2019/05/22 Dynalist Outage post-mortem

Shida · May 23, 2019, 11:23pm

Time

2019/05/22 2 AM to 9 AM ET (7 hours)
2019/05/22 11 PM to 2019/05/23 12:40 AM ET (1 hour and 40 minutes)
2019/05/23 4:20 AM to 10 AM ET (5 hours and 40 minutes)
Total downtime: Almost 15 hours span across 2 days

Issue

During the above time windows, Dynalist’s webpage was unreachable and unable to sync. Users would observe 500 errors and requests would timeout.

During the entire time from 7 PM ET on 2019/05/21 until resolution at 10 AM ET on 2019/05/23, the server was intermittently slow to respond on various occasions, taking up to 10 seconds.

No data was lost during the process.

TL;DR: what caused it

Our hosting provider (DigitalOcean) has over-provisioned their hardware. Meaning they were trying to put more servers than the hardware has capacity for.

Timeline

Our database server normally only uses about 10% of the CPU we pay for. This was to ensure that we are ready to handle surges in requests but we rarely ever see anything above 20% CPU.

Starting from 2019/05/21 7PM ET, we’ve seen a slowdown in request handling, causing a portion of our users to wait for 3-10 seconds instead of the usual <0.5s response time. Given there was no network traffic abnormalities, we identified the cause to be a CPU issue. We suspect this was triggered as a newly deployed server belonging to another DigitalOcean customer was automatically allocated on the same machine as our database server. It’s likely that this new server began using up CPU time from the physical machine, but somehow DigitalOcean wasn’t able to detect that there’s a short supply of CPU cycles to accommodate all servers provisioned on that machine.

We filed a ticket with DigitalOcean on 2019/05/21 9:29 PM ET.

There was no reply for a whopping 26 hours(!) of anxious waiting, until a reply was made by a support agent on 2019/05/22 11:28 PM ET. The support agent told us that they are seeing normal CPU usage on their side and asked us for log details as proof that the issue is happening.

We then proceeded to generate logs from CPU monitoring clearly showing a deficiency (known as CPU “Steal” Time). At this time, our server already has a backlog of user requests to serve and needs about 20% available CPU to catch-up so it could return to normal operations. We’re seeing that we are barely able to get 10% CPU time, and from time to time we could only use 5-8% of the CPU we were paying for.

We produced a series of replies starting from 2019/05/22 11:39 PM ET (just 11 minutes after their reply), and we waited until 2019/05/23 1:45 AM ET for a reply from DigitalOcean support which acknowledges the issue and offered us a possible solution. We replied 5 minutes later 2019/05/23 1:50 AM ET asking for the solution to be implemented.

Another 7 hours pass and after many angry tweets to the DigitalOcean public Twitter account, we finally got a reply at 2019/05/23 9:36 AM ET from a support engineer who ran a live migration to have our server moved to a less crowded physical machine.

Our server restored full operation capacity at 2019/05/23 9:55 AM ET.

Prevention going forward

During the downtime, since there was almost nothing we can do while waiting on DigitalOcean support, but we were heavily considering migrating to a different service provider. We’ve also looked into DigitalOcean’s paid customer support which they only started offering recently (it was not available at the time of our previous outage).

Moving forward, we’re considering a migration to a more reliable hosting provider (we’ve had many suggestions for AWS and various other options). We’re currently weighing the different providers to determine which one fits our needs the best.

Another option that we haven’t completely abandoned is to stay on DigitalOcean, purchase the new premium support package (which has SLA and response time guarantees). Since there’s no information publicly available about pricing and guarantees, and it’s an inquiry only, we’ll need to go back-and-forth with them on this to have a better idea if this would help us prevent this kind of issues in the future should we decide to stay with DO.

In all likelihood, a migration will be required as we are disappointed with the current product offering and support from DigitalOcean. If and when we do get ready for a migration, it would be scheduled on a Saturday night when our usage is lowest to minimize service interruptions.

To everyone who was impacted, we’re deeply sorry for the inconvenience!

Jaime_g1 · May 25, 2019, 8:05am

Thank you for your detailed explanation of the issue, and congratulations for the rapid restoring of the service!!
Best regards,
Jaime

Abhishek_Mittal · May 25, 2019, 12:03pm

Thats ok Team. You have built a great product… i was using Workflowy now migrating to you guys.Hoping for a good support ahead.

svsmailus · May 25, 2019, 8:47pm

I have since found that my iPhone dynalist has had errors in syncing. This has not happened with my iPad. I did use my iphone to create a new document during this time, but not my iPad. No matter what I do the document list and bookmarks refuse to sync with my iPhone and vice versa. Some documents sync content, but the new doc does not appear online or on the desktop.

In the end I deleted the app and reinstalled as I could no longer trust the app in it’s current state. It is now working after reinstalling. I did need to back up the new doc that was not syncing

I’m sharing this as I thought all was ok, until I couldn’t find the doc on my desktop that was created on my iphone.

I would encourage anyone who created docs during the outage to check the doc has synced with the device used to create docs during the outage. An easy way is to create a new bookmark online and see if it appears on the devices used to create a new doc during the outage I found my device did not sync the doc list or bookmark list after the outage.

1111 · May 26, 2019, 12:24am

Another option is to have a main service provider and backup ones. It is highly unlikely all of them go out at the same time, thus assures up-time of Dynalist. Say if DO goes out, kick start Google Cloud or Amazon Cloud. Only requires a routine backup of Dynalist.

Shida · May 26, 2019, 2:42am

That’s weird… Were you able to create a new document after the outage and have it sync, or did all new documents stop syncing?

Shida · May 26, 2019, 2:44am

That’s also an option we’re looking for in the future. We currently have a rolling back-up in a completely different geographic region, but it’s not ready to serve as a master. We could consider making it capable of serving in the event of a failure on the main server.

svsmailus · May 26, 2019, 7:33am

The new doc created during the outage didn’t sync across as well as any changes made to the document structure and new bookmarks made on the desktop also didn’t sync across. Content of docs already in existence seemed to sync, but I have not as yet checked each one that was changed, but at a cursory glance of a few, show they’re ok.

Shida · May 26, 2019, 8:02am

Ok sounds like the document structure sync was completely interrupted. I’ll look more into that and see if I can observe the same behavior on my side

svsmailus · May 26, 2019, 8:04am

Thank you! All is well now and I noticed fairly promptly and managed to export that doc so no data was lost. Appreciate your hard work during these testing times of host delays and shenanigans!

Shida · May 26, 2019, 8:20am

Thanks for your support!

BiosElement · May 27, 2019, 6:55pm

For what it’s worth, Linode has always been my go-to provider and been super reliable. In terms of just a drop-in replacement, it’s comparable on almost every level to Digital Ocean without the jank that comes from their rapid growth.

Ahmad_Halwani · May 28, 2019, 4:23pm

Dear Shida,
Thank you for the detailed explanation. You and team Dynalist’s thoughtfulness come across with how snappy Dynalist is, how well thought out and user-friendly the interface is. I did not experience the service outage but how you all handled it and your clear communication of how you did so is most appreciated.

Best
H

Matt1 · May 29, 2019, 6:09am

Thanks so much for the detailed chronology of events and actions taken.

Love the product, and good luck at the continuous improvement!

PS: I know it’s not on the roadmap, but still praying for a native iOS app that isn’t a wrapper.

Christopher_Bistany · May 29, 2019, 2:35pm

This after-the-fact detailed breakdown of exactly what happened is first-class support that I’m happy to have helped to pay for. Very interesting, and puts me at ease, considering this is a tool I use every day and rely on for my productivity.

Erica · May 30, 2019, 1:36am

To anyone who’s concerned about DigitalOcean, we’re migrating away this Saturday:

https://blog.dynalist.io/2019-06-01-migration/