Category Archives: Engineering

NEW: Activity Dashboard for API users

Our product strategy has always been shaped by the needs of our users. And one feature a lot of you have been asking for is a dashboard to keep tab on their API account. So… (drum roll) .. I’m happy to announce that we just released an Activity Dashboard for all users!

Before I show you some screenshots, I’ll explain why we pushed off this “essential” feature for this long. We follow a MVP – Minimum Viable Product – philosophy here at Unwired for everything we build; we ship the part of the product that adds most value for our users first. All else comes later. Our focus has been on building a world-class Location API that reliably and affordably locates devices anywhere. Now that we’ve covered that (there’s still miles to go before we sleep!), we’re working on the useful glitter. 🙂

Without further ado, here are some snaps:

API Sandbox API Reports

 

Support desk

Login to your dashboard here.

We’ll be adding many more user-facing features in the coming weeks. Also, any ideas on improving this dashboard are always welcome!

Getting to 100% uptime

As a startup, adding high-availability can be both time consuming and expensive; two factors usually more critical than high-availability itself. Like most startups, we bootstrapped on AWS with an app server, a DB server, no replication or backups and a few Cloud Watch alarms.

aws_high_availability_in_cloud

As luck would have it, we had no major down-times and this scaled easily to our first 100 users. Even at this scale, we couldn’t sleep too well at night, knowing that a down-time would affect a lot of critical services relying on our API. One day, we decided enough was enough and set out to build a good, solid infrastructure. Our team put in place a cost-effective (opinions welcome here!) solution that comprises multiple solutions, some of which can take minutes to configure, and others days.

Here’s an extremely brief overview of what we did:

  1. Monitoring
    We setup a server outside our main infrastructure provider that runs Nagios / Icinga. It monitors all services (API health, CPU, RAM, etc) on all hosts. It also calls us if there’s something critically wrong with any service, so we know about it right away.
  2. Load balancers, and many many more instances
    To scale requests on our primary endpoint, we added more App servers, replication and daily back-ups in different physical locations / data-centers, and put them behind load balancers. For our servers on AWS, we used their load-balancer. For others, we deployed a reverse-proxy using HAProxy to achieve this.
  3. Multiple end-points
    Our API serves customers world-wide, and it made sense to launch similar infrastructure in multiple locations. All it took was rock-solid replication (with alerts if it fails), fast DNS servers and a little more money. We were also careful to pick different data-centers for each location, to decrease risk of downtime due to hardware failures. This was a win-win for customers because it reduced latency for them and us, because the load on the primary endpoint was lesser.
  4. DNS-based Failovers
    Another win for a multiple end-point architecture is the ability to reroute requests from one  endpoint to another, in the event of down-time, with just a DNS change. With Cloudflare as our DNS provider, and per second health-checks by CloudRoutes, we could reroute requests instantly (again, via CloudRoutes), and with zero down-time. DNS time-to-live doesn’t matter because our servers are behind Cloudflare’s reverse proxy; it’s our IP that changes, not theirs.
  5. Keeping track of logs
    There were so many things happening in our systems – malicious requests, slow requests, buggy code – that weren’t very visible. Then we found a great logging solution. All system and API log files are shipped to this logger, email and web-hook alerts are created based on simple searches.
  6. Processes
    We can’t stress enough on this one! From having someone responsible for code-review before pushing it to production, to drawing straws on who would get the first phone call if there was a problem after hours, we spent sometime putting SOPs down on paper and on an internal wiki.

The industry standard seems to be 99.95%, but when thousands rely on our platform for non-trivial use cases, we think 100% availability is pretty darn important.