Why Springpad was unavailable for a couple of hours yesterday
We had some unexpected downtime yesterday, which made Springpadit.com unavailable to both web and mobile clients. We did our best to let you know that we were having trouble and to assure you that your data was safe, just not accessible. Now, we want to take the time to explain a bit about how Springpad works, as well as what happened yesterday.
Springpad’s Design
We have been working very hard to ensure that our users’ data is always safe. To that end, over the past year we have made some major architectural changes to both enhance the reliability as well as the performance of the system.
- Cassandra: Springpad uses Cassandra, which is a data store that was designed from the ground up (by Facebook) to handle failures. Everything that you save (notes, bookmarks, movies, notebooks) is replicated three times, to three different servers. The system can transparently handle and recover from storage failures.
- Capacity: We have been adding capacity – meaning servers – rapidly. Over the past year we’ve added about 20 machines to the mix. This has enabled us to increase performance and reliability across the entire environment. We host Springpad with Amazon EC2, which allows us to dynamically add and remove capacity as need demands.
- Backups: All user data is backed up nightly in the event that something terrible happens.
- Offline access: The more clients with offline capability, the less load our server environment experiences. The iPhone and Android applications essentially do some of the processing for us. We are working to improve and expand our offline capabilities.
So, what happened? (technically speaking, that is)
- Around 1:30pm (ET), one of our backend web servers started having network trouble. We contacted Amazon’s technical support, while simultaneously working on launching a new backend to replace the problem server. While this was happening requests started to pile up.
- Within a few minutes the gateway into Springpad ran out of memory. We use redundant Apache load balancers and the high number of requests caused them both to run out of memory.
- At the same time we were having trouble launching new Amazon servers to replace the now both down backend servers, as well as the load balancers. We still hadn’t heard from support as to why the problem started in the first place.
- By 3:30pm we had a new balancer up and about half of the requests to our servers were ok. The other half were still having problems. At around 4pm, we eventually were able to recover a second load balancer and the problems subsided.
What we’re doing to prevent issues like this from happening again
- We’ll now keep a spare set of load balancers waiting. We then won’t have to wait to launch new servers if things go wrong again.
- We are increasing the memory and cpu of our load balancers to handle more requests.
- We are still investigating the core network problem with Amazon and researching how to prevent it from happening again in the future.
Rest assured that your data is always safe with Springpad. This issue caused access problems, but did not affect our storage infrastructure.
23 Responses to “Why Springpad was unavailable for a couple of hours yesterday”
Leave a Reply
No worries Jason. Thanks for the update
Even though i like the idea that I can integrate my diverse data via cloud,
I still need an OFF-LINE DESKTOP APP–
that will make Springpad genuinely powerful. ?my wish is offline desktop versions for my iMac and my MacBook Air. Thank you!)
I like Springpad very much. The only reason prevents me from switching from Evernote to Springpad is that you don’t have OFFLINE DESKTOP app!
Good thing I was so busy @ work I didn’t get to my office & try to use the computer! LOL
There always be unexpected problems but I believe the way you communicate it with users makes a big difference on how it is perceived. In this case, I am a happy user. Still, I don’t want a second without access to my springpad
Thanks for the details. Those of us who are tech savy feel much better knowing what happened, how you recovered, that our data actually wasn’t lost, and how you plan to fix it for the future. Springpad customer communication rocks! Way to go Springpad team!
Part of IT growing pains. Would take a lot more for me to abandon using SpringPad. Great job communicating – the FB posts helped a lot during your downtime – Nice job. PS: Thanks Brittni you are always quick to respond to any questions that I have. – JC
Hi Jason,
If you don’t mind, I have two questions.
The network problem you mentioned, what was the reason ?
- guest (VM) based ?
- host based, hardware ?
- network devices based ?
You mentioned to keep load balancers ready. As I understood, everything started with backend web servers not being able to handle requests timely when one of them went down, which increased the queue and memory usage on load balancers. Would it make more sense to have spare backend web servers ready ?
Phenomenal informing of customers with this easily digestible article and your Facebook updates. Just brilliant.
Great job handling this, guys. Keeping everyone informed as to exactly what’s going on both at the time and afterwards is SUCH a big (and often neglected) part of making us feel secure using your service. Five stars and a big hug
Jason, thanks for the reply.
Do you use ELB ? Hasn’t LB took faulty web server out of balancing even after you powered it off ? Somehow health checks might have been insufficient but a powered off server should have been taken out of balancing in a few seconds.
I am a bit curious because of a future project to implement a HA service on Amazon.
I lost everything on my springpad on my Droid Incredible. Every piece of information was gone. When I tried to retrieve it, I was prompted to “register”, I didn’t have to do that when I first downloaded the app to the droid. I registered etc, but still, everything is gone. I’ve changed to a different app, but I thought you would like to know what happened to me during the outae.
Thanks for the info. I too think it would be better to have the possibility to work off line, also on the mobile app (I have Android).
@ozgur No. We don’t use ELB yet. We use some custom url rewriting at the balancer level and haven’t had the time to think about how to redo some of that and switch to ELB. We currently use Apache, but are starting to switch some services to nginx.
I think the problem was that the b/c of the network troubles the balancer didn’t see the front end as having been truly down, and the balancers crashed quicker than we could react.
Are you having trouble again? Because I haven’t been able to access springpad all day but I don’t see any mention of trouble here or on Facebook.
@Jason, thanks for your reply. We recently used haproxy in a project and are pretty happy with the results.
Just wanted to say I really really appreciate this post. Not only in acknowledging the error, but providing the technical reasoning instead of some PR speak.
(And on a separate note, how is Cassandra working out for you? We ended up reverting to MySQL after Cassandra started falling apart under our user load.)
as of 11:06 EST, 4/21/11, SpringPad is down. Can anybody explain what happened?
Like so many others, I’m so happy that Springpad is up and running again. Understand that those of us who know what happened don’t blame Springpad for the outage. I’m sure there will be many hours of postmortem review by Amazon and Springpad to mitigate a recurrence.
On a lighter note, today was the first time in many months that my wife and I had to resort to using a paper list to go food shopping! Ouch.
Springpad you are changing out lives for the better and I am confident you will build a better mouse trap as a result of this outage.
[...] View Springpad Blog on an outage and their design [...]
I just became a springpad user and I stopped to read this post. It says a lot about a company when they put this type of post up. I like the ways you guys communicate. Keep it up. I’m recommending you guys to my friends and family.















No worries Jason. Thanks for the update
Even though i like the idea that I can integrate my diverse data via cloud,
I still need an OFF-LINE DESKTOP APP–
that will make Springpad genuinely powerful. ?my wish is offline desktop versions for my iMac and my MacBook Air. Thank you!)
I like Springpad very much. The only reason prevents me from switching from Evernote to Springpad is that you don’t have OFFLINE DESKTOP app!
Good thing I was so busy @ work I didn’t get to my office & try to use the computer! LOL
There always be unexpected problems but I believe the way you communicate it with users makes a big difference on how it is perceived. In this case, I am a happy user. Still, I don’t want a second without access to my springpad
Thanks for the details. Those of us who are tech savy feel much better knowing what happened, how you recovered, that our data actually wasn’t lost, and how you plan to fix it for the future. Springpad customer communication rocks! Way to go Springpad team!
Part of IT growing pains. Would take a lot more for me to abandon using SpringPad. Great job communicating – the FB posts helped a lot during your downtime – Nice job. PS: Thanks Brittni you are always quick to respond to any questions that I have. – JC
Hi Jason,
If you don’t mind, I have two questions.
The network problem you mentioned, what was the reason ?
- guest (VM) based ?
- host based, hardware ?
- network devices based ?
You mentioned to keep load balancers ready. As I understood, everything started with backend web servers not being able to handle requests timely when one of them went down, which increased the queue and memory usage on load balancers. Would it make more sense to have spare backend web servers ready ?
We experienced high packet loss rates to a virtualized instance. We actually have plenty of backend web servers to handle the traffic (9 or so), but for some reason the load balancers didn’t take the problem server out of rotation. They just kept sending requests and waiting for responses, eventually leading to memory problems. We are still investigating why the lb’s didn’t take that machine out and reroute to the additional 8 servers. We are also still waiting on a better explanation of what the networking issue was from Amazon.
Phenomenal informing of customers with this easily digestible article and your Facebook updates. Just brilliant.
Great job handling this, guys. Keeping everyone informed as to exactly what’s going on both at the time and afterwards is SUCH a big (and often neglected) part of making us feel secure using your service. Five stars and a big hug
Jason, thanks for the reply.
Do you use ELB ? Hasn’t LB took faulty web server out of balancing even after you powered it off ? Somehow health checks might have been insufficient but a powered off server should have been taken out of balancing in a few seconds.
I am a bit curious because of a future project to implement a HA service on Amazon.
I lost everything on my springpad on my Droid Incredible. Every piece of information was gone. When I tried to retrieve it, I was prompted to “register”, I didn’t have to do that when I first downloaded the app to the droid. I registered etc, but still, everything is gone. I’ve changed to a different app, but I thought you would like to know what happened to me during the outae.
Thanks for the info. I too think it would be better to have the possibility to work off line, also on the mobile app (I have Android).
@ozgur No. We don’t use ELB yet. We use some custom url rewriting at the balancer level and haven’t had the time to think about how to redo some of that and switch to ELB. We currently use Apache, but are starting to switch some services to nginx.
I think the problem was that the b/c of the network troubles the balancer didn’t see the front end as having been truly down, and the balancers crashed quicker than we could react.
Are you having trouble again? Because I haven’t been able to access springpad all day but I don’t see any mention of trouble here or on Facebook.
Melissa – I’m sorry to hear that you are having trouble!
But, we are not having any issues today. Please try multiple browsers, also please confirm that you are not behind a firewall that might be blocking us (some companies block us due to our integrations with social sites like facebook & twitter).
If you are still having trouble, please send your Springpad username & IP address to me at katin@springpartners.com so I can figure out what’s going on.
An easy way to fin your IP address is by going here: http://www.whatismyip.com/
@Jason, thanks for your reply. We recently used haproxy in a project and are pretty happy with the results.
Just wanted to say I really really appreciate this post. Not only in acknowledging the error, but providing the technical reasoning instead of some PR speak.
(And on a separate note, how is Cassandra working out for you? We ended up reverting to MySQL after Cassandra started falling apart under our user load.)
as of 11:06 EST, 4/21/11, SpringPad is down. Can anybody explain what happened?
Chris – Here’s the current status on today’s outage: http://gsfn.us/t/28vdk
Like so many others, I’m so happy that Springpad is up and running again. Understand that those of us who know what happened don’t blame Springpad for the outage. I’m sure there will be many hours of postmortem review by Amazon and Springpad to mitigate a recurrence.
On a lighter note, today was the first time in many months that my wife and I had to resort to using a paper list to go food shopping! Ouch.
Springpad you are changing out lives for the better and I am confident you will build a better mouse trap as a result of this outage.
[...] View Springpad Blog on an outage and their design [...]
I just became a springpad user and I stopped to read this post. It says a lot about a company when they put this type of post up. I like the ways you guys communicate. Keep it up. I’m recommending you guys to my friends and family.