Engineering

Getting to 100% uptime

As a startup, adding high-availability can be both time consuming and expensive; two factors usually more critical than high-availability itself. Like most startups, we bootstrapped on AWS with an app server, a DB server, no replication or backups and a few Cloud Watch alarms.

As luck would have it, we had no major down-times and this scaled easily to our first 100 users. Even at this scale, we couldn’t sleep too well at night, knowing that a down-time would affect a lot of critical services relying on our API. One day, we decided enough was enough and set out to build a good, solid infrastructure. Our team put in place a cost-effective (opinions welcome here!) solution that comprises multiple solutions, some of which can take minutes to configure, and others days.

Here’s an extremely brief overview of what we did:

Monitoring
We setup a server outside our main infrastructure provider that runs Nagios / Icinga. It monitors all services (API health, CPU, RAM, etc) on all hosts. It also calls us if there’s something critically wrong with any service, so we know about it right away.
Load balancers, and many many more instances
To scale requests on our primary endpoint, we added more App servers, replication and daily back-ups in different physical locations / data-centers, and put them behind load balancers. For our servers on AWS, we used their load-balancer. For others, we deployed a reverse-proxy using HAProxy to achieve this.
Multiple end-points
Our API serves customers world-wide, and it made sense to launch similar infrastructure in multiple locations. All it took was rock-solid replication (with alerts if it fails), fast DNS servers and a little more money. We were also careful to pick different data-centers for each location, to decrease risk of downtime due to hardware failures. This was a win-win for customers because it reduced latency for them and us, because the load on the primary endpoint was lesser.
DNS-based Failovers
Another win for a multiple end-point architecture is the ability to reroute requests from one endpoint to another, in the event of down-time, with just a DNS change. With Cloudflare as our DNS provider, and per second health-checks by CloudRoutes, we could reroute requests instantly (again, via CloudRoutes), and with zero down-time. DNS time-to-live doesn’t matter because our servers are behind Cloudflare’s reverse proxy; it’s our IP that changes, not theirs.
Keeping track of logs
There were so many things happening in our systems – malicious requests, slow requests, buggy code – that weren’t very visible. Then we found a great logging solution. All system and API log files are shipped to this logger, email and web-hook alerts are created based on simple searches.
Processes
We can’t stress enough on this one! From having someone responsible for code-review before pushing it to production, to drawing straws on who would get the first phone call if there was a problem after hours, we spent sometime putting SOPs down on paper and on an internal wiki.

The industry standard seems to be 99.95%, but when thousands rely on our platform for non-trivial use cases, we think 100% availability is pretty darn important.