Why Springpad was unavailable for a couple of hours yesterday
We had some unexpected downtime yesterday, which made Springpadit.com unavailable to both web and mobile clients. We did our best to let you know that we were having trouble and to assure you that your data was safe, just not accessible. Now, we want to take the time to explain a bit about how Springpad works, as well as what happened yesterday.
We have been working very hard to ensure that our users’ data is always safe. To that end, over the past year we have made some major architectural changes to both enhance the reliability as well as the performance of the system.
- Cassandra: Springpad uses Cassandra, which is a data store that was designed from the ground up (by Facebook) to handle failures. Everything that you save (notes, bookmarks, movies, notebooks) is replicated three times, to three different servers. The system can transparently handle and recover from storage failures.
- Capacity: We have been adding capacity – meaning servers – rapidly. Over the past year we’ve added about 20 machines to the mix. This has enabled us to increase performance and reliability across the entire environment. We host Springpad with Amazon EC2, which allows us to dynamically add and remove capacity as need demands.
- Backups: All user data is backed up nightly in the event that something terrible happens.
- Offline access: The more clients with offline capability, the less load our server environment experiences. The iPhone and Android applications essentially do some of the processing for us. We are working to improve and expand our offline capabilities.
So, what happened? (technically speaking, that is)
- Around 1:30pm (ET), one of our backend web servers started having network trouble. We contacted Amazon’s technical support, while simultaneously working on launching a new backend to replace the problem server. While this was happening requests started to pile up.
- Within a few minutes the gateway into Springpad ran out of memory. We use redundant Apache load balancers and the high number of requests caused them both to run out of memory.
- At the same time we were having trouble launching new Amazon servers to replace the now both down backend servers, as well as the load balancers. We still hadn’t heard from support as to why the problem started in the first place.
- By 3:30pm we had a new balancer up and about half of the requests to our servers were ok. The other half were still having problems. At around 4pm, we eventually were able to recover a second load balancer and the problems subsided.
What we’re doing to prevent issues like this from happening again
- We’ll now keep a spare set of load balancers waiting. We then won’t have to wait to launch new servers if things go wrong again.
- We are increasing the memory and cpu of our load balancers to handle more requests.
- We are still investigating the core network problem with Amazon and researching how to prevent it from happening again in the future.
Rest assured that your data is always safe with Springpad. This issue caused access problems, but did not affect our storage infrastructure.