2017/08/27 Dynalist outage post-mortem

Time

2017/08/28 2AM EDT to 2017/08/28 2:28PM EDT (12.5 hours)

Issue

Dynalist’s service has been severely degraded, sometimes slow (taking a couple of seconds to load/sync), and sometimes very slow (taking more than 30 seconds), or even times out. Eventually it was completely down for about 30 minutes (Cloudflare error 502).

TL;DR

A recent server upgrade caused our database to become de-optimized. This made the database slow for syncing large documents. When Monday started, large documents started to sync, which slowed everything else.

The fix took a lot longer than expected because of misleading metrics.

Root Cause

Because of a recent database upgrade migration (which happened on 2017/08/26 7PM EDT), our database table was de-optimized, which caused sync attempts on large documents to take up to 10-20 seconds to be completed, even when there was no change to be synced. We believe that this issue only surfaced a day after our initial migration because there were barely any traffic during the weekend, so we thought the migration was safely over.

Monday’s traffic surge made the database deoptimization issue very pronounced, to the point that the database could not catch-up to the amount of sync requests that repeatedly happens.

On top of that, it just happens that Workflowy took some downtime, and as many Workflowy users sign-up for Dynalist, they use the “Import from Workflowy” feature, which also caused a lot of data to flow through our servers when Workflowy was available, and hogged our resources when it wasn’t.

Mitigation

We optimized the database table. Specifically this was MySQL’s InnoDB OPTIMIZE TABLE table_name where the table was the one we use for synchronizing changes.

Even though the optimization process could work without shutting down Dynalist, having it run caused significant delays to the recovery process, so we took Dynalist down for about half an hour to accelerate the recovery.

Once the database optimization was finished, large sync requests that used to take over 20 seconds now finished on average around 0.1 seconds. The average time was about 0.5 seconds before the database upgrade.

Prevention going forward

We planned to do database query optimization right after the server upgrade stabilize. After the incident, that sounds even more necessary. That would include identifying slow queries, optimizing them, and optimize database indices.

On top of that, we will add more monitoring on our database to constantly watch the number of queries and how long they are taking on average. Right now our internal dashboards are for monitoring if our web server responds normally to requests, moving forward the health our database will be another thing that we watch for all the time.

Why it took so long

It just happened to be a really bad coincidence:

  • Workflowy was down, so there was extra “heavy” traffic (many attempts of importing from Workflowy) which we thought were the cause of the issue at first
  • At first because of misleading signs we were going down the wrong debugging direction for a few hours (CPU and memory usage looked normal so we didn’t think it’d be an issue with hogging database)
  • Our team had a long day of work and 2AM was right about sleep time, so we were having a terrible time being sleep-deprived

Sorry for the inconvenience!

10 Likes

Everything’s working great for me now, very quick sync. I’m so glad you guys and girls found and resolved the root cause. I hope you get some well deserved sleep!

4 Likes

Here to report that my data loss issue (most likely caused by the fact that I had to use my web client throughout the outage/taking DL servers offline, I guess) has been 100% resolved by:

  • exporting the top-most changed node (and its children) to OPML (prior to closing the web client)
  • refresh/reopen the web app
  • copy the content of the previously exported OPML file (in a text editor)
  • pasting the clipboard to a destination node in the target Dynalist document.

Might be worth adding to your FAQ, or something like that.

(Better yet, would be the ability to work offline + export/import to cloud storage providers…)

Many thanks and kudos for your hard work.

Hi Shida,

As a fellow large scale processing systems developer, I’d suggest planning for a suite of thorough load tests as an integral part of such infrastructure upgrades (do not see it mentioned explicitly).

Forgive me advice that has not been asked for, but it is all in a good faith and with appreciation of your product and service.

Regards

Didn’t notice.

You could download the desktop app if you want to work offline: https://dynalist.io/download

2 Likes

The biggest thing is you guys were ON IT and got to the bottom of the problem pretty quick. Yes, I noticed the slowness, but having been notified (through the DL app) that there was a problem was helpful. Thanks for taking the time to keep us in the loop. The most important thing is you were ON IT, identified the problem – and the most important thing: took responsibility to apply what was discovered in this incident and take preventative action for the future. Kudos.

I contrast Dynalist’s excellent ON IT and response times to Microsoft’s horrible response with OneDrive problems and support. I had a situation recently where I had to do a full restore of my OneDrive data to my reformatted PC and the suffered through data transport speeds worse than a dial up modem, even though I was on a blazing fast fiber Internet. It took Microsoft 21 DAYS to resolve the issue, which they said was being experienced by many other users. That is totally unacceptable. My data was held hostage on their servers and they spent days fooling around, not fixing the problem. On Day 20 I finally got a competent support technician, then Day 21 data syncing returned to normal. I had all my data back within 24 hours.

Granted, Microsoft OneDrive has a much bigger customer base to support, but they also have more resources than DynaList to fix the problems. So, no excuses.

This is why I’m a huge DynaList fan and user. I can’t thank you enough for the incredible tool DynaList is in my life. It’s simple, yet powerful. Very versatile. Always improving and very reliable.

Keep up the outstanding work!

Mark

4 Likes

No Data lost? Quick response? Complete transparency? A plan for future prevention?

Sounds like you guys handled this perfectly.

3 Likes

Nobody wishes for this type of incident but the way it was handled is tremendously reassuring. You guys care and your customers remember that long after the inconvenience has been forgotten.

2 Likes