March 21, 2014’s Outage—What Happened, What We Did, and What We’re Doing

As many of you are undoubtedly aware, Groove experienced a serious server issue last night and this morning that caused the app to go down for all users.

We’ve said it a number of times today, but it bears repeating: while the outage was out of our control, we were unprepared to quickly and effectively deal with it, and the ultimate responsibility for that lies with us.

We undermined the trust that our customers have in us, and for that, we can’t apologize enough. We’ve recommitted ourselves to regaining and maintaining your trust by taking steps to ensure that nothing like this ever happens again.

Now that we’ve restored our servers and brought the app back to full functionality, we’d like to offer a rundown of the technical details behind what happened.

This morning at 8:51AM EST we received a notice from Engine Yard that one of our cloud servers is scheduled to be retired by Amazon on Feb 25th. To our shock, Amazon retired it late last night with no previous warning.

The server in question was our master database instance. Because of this, the entire cluster had to be shut down and recreated, and previous manual tweaks to the configuration had to be located and recreated as well.

Some of the issues we ran into while rebuilding the cluster:

Hardcoded ec2 hostnames prevented server instances from talking to each other
SSL certificates for the server running Live Chat and real-time in-app updates were lost, preventing it from running
Lost link to search configuration

Additionally, our previous public IP address was lost and reused by Amazon, so we had to acquire a new IP address from Amazon and update our DNS records.

One of the biggest fails on our end which delayed the resolution was related to monitoring. We have server monitoring set up, but because the server failed during the night, our team didn’t see the alert emails until they woke up in the morning. We’ve already switched our systems to call our personal cell phones in the event of an outage, 24 hours a day.

As much as today drained our team as we scrambled to put out this fire, it was worse for the customers who rely on us to provide a stable app. And for that, again, we apologize.

There’s a lot more to say about the events that occurred today, and we’ll be doing that on our blog next week. But in the meantime, we wanted to let you know exactly what happened from a technical standpoint.

In the meantime, we’re grateful for your patience, your words of support and understanding throughout today, and most of all, your trust in Groove.

Sincerely,
Alex
CEO, Groove

March 21, 2014’s Outage—What Happened, What We Did, and What We’re Doing

Next Post

Why I Don’t Stress Over Competition Anymore

Deliver support that delights