2018/02/23 Dynalist outage post-mortem

Time

2018/02/23 10:20 AM EDT to 2018/02/23 12:20 PM EDT (~2 hours)

Issue

Dynalist was not accessible. No data was lost.

Cause

Our hosting provider was doing a Spectre and Meltdown mitigation. The instance was rebooted, but the web server was not restarted.

While we were notified that reboots would be happening, we did not receive specific times for it. Weā€™re still investigating if itā€™s an email issue or an issue with our hosting providerā€™s notifications.

Mitigation

Restarted web server for Dynalist.

Prevention going forward

Although accidents like this are hard to prevent, we can deliver a more timely fix next time when things like this happens (server is rebooted and just need a simple restart).

As the next priority todo, weā€™ll set up PagerDuty for Dynalist. Weā€™ll post a reply to this post-mortem when itā€™s all set up.

When itā€™s set up, PagerDuty will call us until we acknowledge the downtime, which will help minimize the downtime. We did receive a lot of emails and Twitter messages, but itā€™s hard to make sure weā€™ll see it when weā€™re not working.

Sorry for the inconvenience, and thanks to all those who reached out to report the problem!

2 Likes

Thanks, Dynalist team, for being on this and getting it resolved, with an eye toward an even better future response! Keep up the great work! Dynalist is wonderful and has great team to develop & support it! :grinning:

Thanks for the kind reply, Mark!

Weā€™ll do all we can to prevent things like this from happening again.

Thank you for the notification and the post-mortem. Iā€™m especially glad to hear that youā€™re moving toward better notification tools for your team. Keep up the good work.

Good stuff.

Also you can change your start up scripts on your instance to restart your webserverā€¦ or use monit on your server to auto start anything that always needs to run. I keep servers running at work so if you have questions reach out I will help!

tony

1 Like

I wonder if the issue had anything to do with a Windows 10 update that occurred this morning as well.

Not really, as the server runs on Unix, not Windows. Thanks for asking!

Update: Dynalist now has a status page for the main app here.

We have set up UptimeRobot to check the availability of the site every 5 minutes. If itā€™s down, weā€™ll get a phone call through IFTTT.


We did initially look at PagerDuty. Itā€™s designed with larger teams in mind, which means 90% of the features donā€™t make sense to a small team like us. Also, PagerDuty is not a site uptime monitoring service by itself, so we need to pay for another service to monitor site uptime. Monitoring site uptime is the thing we need the most, so we went with UptimeRobot.

1 Like