2018/02/07 Dynalist outage post-mortem

Shida · February 23, 2018, 5:46pm

Time

2018/02/07 5PM EDT to 2018/02/07 9:00PM EDT (~4 hours)

Issue

Dynalist service is intermittent, especially regards to syncing. Several users reported seeing internal server errors. No data was lost.

Cause

Disk space was full on our web server, causing large web-requests to be dropped as temp files couldn’t be stored for processing. This was mostly taken by /tmp.

Mitigation

Cleaned /tmp and restarted server.

Prevention going forward

Installed automated scripts to clean /tmp periodically.

Why this hasn’t been caught/fixed sooner

Our disk space monitoring graph was just slightly out of view on our main dashboard, so we did not noticed while it was getting full. We’ve rearranged our dashboard as a result to better monitor the disk space.

The reason why the service was down for ~4 hours was because all of our team members were unavailable at the time of outage.

Sorry for the inconvenience, and thanks to all those who reached out to report the problem!

Kevin_Murray · February 8, 2018, 3:44am

It’s reassuring to see how steps have been taken to prevent it happening again. Thank you.

Robert_Floyd · February 8, 2018, 4:27pm

Thank you for the resolution and explanation of the problem. One aspect of it is a bit troubling: the problem persisted because the the entire team was unavailable. While I understand the team itself is small and likely to remain so, a four hour outage of a paid service (I’m on Pro) that is being used for business purposes is Not A Good Thing.

Have you considered establishing a service level agreement (SLA) for paying customers? It’s important to know that there’s someone available to deal with issues as they arise. At a minimum, there should be an on call rotation 24/7 to receive alerts when there’s a system problem. Support is one of the larger challenges of growing an Internet business. Once people start paying, there’s a certain expectation of service levels. That’s why I prefer paid services over free ones.

Shida · February 9, 2018, 1:32am

That’s a good point you’ve raised. Our current team consists of two engineers from the same timezone, so it’s really difficult to get full coverage for the on-call. Hopefully as we scale out, we’ll be able to afford more eyes on the dashboard as well as offering some kind of guarantee.

Jonathan_Yankovich · February 9, 2018, 2:24am

Consider a service like PagerDuty, which can be used to alert engineers when a service becomes unavailable.

The way this problem was described (“Disk space was full on our web server”) suggests a single point of failure… most modern web applications use an N+1 architecture such that there is ideally no single point of failure for the system, and if one part has a problem, another part will pick up for it. Ideally the server instances would be ephemeral, and a heartbeat service would detect when one is having a problem, kill it, and spawn a new one in its place. In the case of a full tmp drive, a solution like this would mitigate the problem.

Let me know if you want to do some private consultation to get something like this in place, I love Dynalist and would love to see it grow!

ruud · February 9, 2018, 11:31am

Not to put the blame on you but especially if you use a cloud product for business, and especially if it is crucial to you, be ready to have it fail. I’m also an Evernote user, a service which has a business tier, and it does happen people post on how they lost this or that or had no access to important information. There too the tip is; backup, backup, backup, and be ready with a solution.

Have a program ready to import your OPML into. Have Dynalist desktop installed and synced at least once a day so you can continue working even if there is an outage.

Never set yourself up for a situation where you have to blame your business loss on a cloud service.