As a startup, adding high-availability can be both time consuming and expensive; two factors usually more critical than high-availability itself. Like most startups, we bootstrapped on AWS with an app server, a DB server, no replication or backups and a few Cloud Watch alarms.
As luck would have it, we had no major down-times and this scaled easily to our first 100 users. Even at this scale, we couldn’t sleep too well at night, knowing that a down-time would affect a lot of critical services relying on our API. One day, we decided enough was enough and set out to build a good, solid infrastructure. Our team put in place a cost-effective (opinions welcome here!) solution that comprises multiple solutions, some of which can take minutes to configure, and others days.
Here’s an extremely brief overview of what we did:
- Monitoring
We setup a server outside our main infrastructure provider that runs Nagios / Icinga. It monitors all services (API health, CPU, RAM, etc) on all hosts. It also calls us if there’s something critically wrong with any service, so we know about it right away. - Load balancers, and many many more instances
To scale requests on our primary endpoint, we added more App servers, replication and daily back-ups in different physical locations / data-centers, and put them behind load balancers. For our servers on AWS, we used their load-balancer. For others, we deployed a reverse-proxy using HAProxy to achieve this. - Multiple end-points
Our API serves customers world-wide, and it made sense to launch similar infrastructure in multiple locations. All it took was rock-solid replication (with alerts if it fails), fast DNS servers and a little more money. We were also careful to pick different data-centers for each location, to decrease risk of downtime due to hardware failures. This was a win-win for customers because it reduced latency for them and us, because the load on the primary endpoint was lesser. - DNS-based Failovers
Another win for a multiple end-point architecture is the ability to reroute requests from one endpoint to another, in the event of down-time, with just a DNS change. With Cloudflare as our DNS provider, and per second health-checks by CloudRoutes, we could reroute requests instantly (again, via CloudRoutes), and with zero down-time. DNS time-to-live doesn’t matter because our servers are behind Cloudflare’s reverse proxy; it’s our IP that changes, not theirs. - Keeping track of logs
There were so many things happening in our systems – malicious requests, slow requests, buggy code – that weren’t very visible. Then we found a great logging solution. All system and API log files are shipped to this logger, email and web-hook alerts are created based on simple searches. - Processes
We can’t stress enough on this one! From having someone responsible for code-review before pushing it to production, to drawing straws on who would get the first phone call if there was a problem after hours, we spent sometime putting SOPs down on paper and on an internal wiki.
The industry standard seems to be 99.95%, but when thousands rely on our platform for non-trivial use cases, we think 100% availability is pretty darn important.
One thought on “Getting to 100% uptime”
Comments are closed.