Time
2017/08/28 2AM EDT to 2017/08/28 2:28PM EDT (12.5 hours)
Issue
Dynalist’s service has been severely degraded, sometimes slow (taking a couple of seconds to load/sync), and sometimes very slow (taking more than 30 seconds), or even times out. Eventually it was completely down for about 30 minutes (Cloudflare error 502).
TL;DR
A recent server upgrade caused our database to become de-optimized. This made the database slow for syncing large documents. When Monday started, large documents started to sync, which slowed everything else.
The fix took a lot longer than expected because of misleading metrics.
Root Cause
Because of a recent database upgrade migration (which happened on 2017/08/26 7PM EDT), our database table was de-optimized, which caused sync attempts on large documents to take up to 10-20 seconds to be completed, even when there was no change to be synced. We believe that this issue only surfaced a day after our initial migration because there were barely any traffic during the weekend, so we thought the migration was safely over.
Monday’s traffic surge made the database deoptimization issue very pronounced, to the point that the database could not catch-up to the amount of sync requests that repeatedly happens.
On top of that, it just happens that Workflowy took some downtime, and as many Workflowy users sign-up for Dynalist, they use the “Import from Workflowy” feature, which also caused a lot of data to flow through our servers when Workflowy was available, and hogged our resources when it wasn’t.
Mitigation
We optimized the database table. Specifically this was MySQL’s InnoDB OPTIMIZE TABLE table_name
where the table was the one we use for synchronizing changes.
Even though the optimization process could work without shutting down Dynalist, having it run caused significant delays to the recovery process, so we took Dynalist down for about half an hour to accelerate the recovery.
Once the database optimization was finished, large sync requests that used to take over 20 seconds now finished on average around 0.1 seconds. The average time was about 0.5 seconds before the database upgrade.
Prevention going forward
We planned to do database query optimization right after the server upgrade stabilize. After the incident, that sounds even more necessary. That would include identifying slow queries, optimizing them, and optimize database indices.
On top of that, we will add more monitoring on our database to constantly watch the number of queries and how long they are taking on average. Right now our internal dashboards are for monitoring if our web server responds normally to requests, moving forward the health our database will be another thing that we watch for all the time.
Why it took so long
It just happened to be a really bad coincidence:
- Workflowy was down, so there was extra “heavy” traffic (many attempts of importing from Workflowy) which we thought were the cause of the issue at first
- At first because of misleading signs we were going down the wrong debugging direction for a few hours (CPU and memory usage looked normal so we didn’t think it’d be an issue with hogging database)
- Our team had a long day of work and 2AM was right about sleep time, so we were having a terrible time being sleep-deprived
Sorry for the inconvenience!