Time
2019/05/22 2 AM to 9 AM ET (7 hours)
2019/05/22 11 PM to 2019/05/23 12:40 AM ET (1 hour and 40 minutes)
2019/05/23 4:20 AM to 10 AM ET (5 hours and 40 minutes)
Total downtime: Almost 15 hours span across 2 days
Issue
During the above time windows, Dynalist’s webpage was unreachable and unable to sync. Users would observe 500 errors and requests would timeout.
During the entire time from 7 PM ET on 2019/05/21 until resolution at 10 AM ET on 2019/05/23, the server was intermittently slow to respond on various occasions, taking up to 10 seconds.
No data was lost during the process.
TL;DR: what caused it
Our hosting provider (DigitalOcean) has over-provisioned their hardware. Meaning they were trying to put more servers than the hardware has capacity for.
Timeline
Our database server normally only uses about 10% of the CPU we pay for. This was to ensure that we are ready to handle surges in requests but we rarely ever see anything above 20% CPU.
Starting from 2019/05/21 7PM ET, we’ve seen a slowdown in request handling, causing a portion of our users to wait for 3-10 seconds instead of the usual <0.5s response time. Given there was no network traffic abnormalities, we identified the cause to be a CPU issue. We suspect this was triggered as a newly deployed server belonging to another DigitalOcean customer was automatically allocated on the same machine as our database server. It’s likely that this new server began using up CPU time from the physical machine, but somehow DigitalOcean wasn’t able to detect that there’s a short supply of CPU cycles to accommodate all servers provisioned on that machine.
We filed a ticket with DigitalOcean on 2019/05/21 9:29 PM ET.
There was no reply for a whopping 26 hours(!) of anxious waiting, until a reply was made by a support agent on 2019/05/22 11:28 PM ET. The support agent told us that they are seeing normal CPU usage on their side and asked us for log details as proof that the issue is happening.
We then proceeded to generate logs from CPU monitoring clearly showing a deficiency (known as CPU “Steal” Time). At this time, our server already has a backlog of user requests to serve and needs about 20% available CPU to catch-up so it could return to normal operations. We’re seeing that we are barely able to get 10% CPU time, and from time to time we could only use 5-8% of the CPU we were paying for.
We produced a series of replies starting from 2019/05/22 11:39 PM ET (just 11 minutes after their reply), and we waited until 2019/05/23 1:45 AM ET for a reply from DigitalOcean support which acknowledges the issue and offered us a possible solution. We replied 5 minutes later 2019/05/23 1:50 AM ET asking for the solution to be implemented.
Another 7 hours pass and after many angry tweets to the DigitalOcean public Twitter account, we finally got a reply at 2019/05/23 9:36 AM ET from a support engineer who ran a live migration to have our server moved to a less crowded physical machine.
Our server restored full operation capacity at 2019/05/23 9:55 AM ET.
Prevention going forward
During the downtime, since there was almost nothing we can do while waiting on DigitalOcean support, but we were heavily considering migrating to a different service provider. We’ve also looked into DigitalOcean’s paid customer support which they only started offering recently (it was not available at the time of our previous outage).
Moving forward, we’re considering a migration to a more reliable hosting provider (we’ve had many suggestions for AWS and various other options). We’re currently weighing the different providers to determine which one fits our needs the best.
Another option that we haven’t completely abandoned is to stay on DigitalOcean, purchase the new premium support package (which has SLA and response time guarantees). Since there’s no information publicly available about pricing and guarantees, and it’s an inquiry only, we’ll need to go back-and-forth with them on this to have a better idea if this would help us prevent this kind of issues in the future should we decide to stay with DO.
In all likelihood, a migration will be required as we are disappointed with the current product offering and support from DigitalOcean. If and when we do get ready for a migration, it would be scheduled on a Saturday night when our usage is lowest to minimize service interruptions.