2016/10/29 Dynalist outage

resolved

#1

The outage happened around 12 am EDT. It persisted for about 9 hours until we received the reports and checked the server. A simple restart solved the problem, but we’re not sure what caused the problem in the first place right now. Looking into various logs to find out.

We’ll post an follow-up postmortem after we find the cause. Finger crossed.

This has never happened to us, we’re so sorry that we’re not using any website availability monitoring service that alerts us right after it happens. Will start using something like that asap.


#2

Everything is back up now. We’ve read through hours of server log data and we have now fully understood the root cause of the outage.

Current status

Dynalist is now up and running. We have resolved the issue, and we will continue to monitor the system to prevent similar issues in the future.

Cause of outage

Since we migrated Dynalist to a better server last week (Oct 22, 2016), we have also introduced a continuous database backup system using MySQL’s replication feature.

Basically, when our main database receive data (such as the content you just wrote), it also writes to a binary log. A second backup database will read this binary log and keep itself in sync with the main database. This second back database acts like our backup; if the main database is corrupted for some reaosn, we have a copy that’s almost identical that we can use.

We later found that we have improperly configured the limits of the database’s binary log, which is used in the replication process. This caused the server’s disk space to be filled and the server went down as a result.

Unfortunately, the incident happened at midnight for the timezone our team is in, so nobody was monitoring Dynalist nor our support channels.

Mitigation

We have temporarily disabled continuous backups for our database while we study and test the configurations for properly limiting MySQL’s binary log. This freed up our disk space which is now at a healthy state with lots of room to grow.

Future work

  • We’ll need to enable continuous backups as soon as we can
  • We plan to implement an alert system for system status monitoring so that when the site goes down we know it asap

#3

As a first step, we’ve signed up for notification when Dynalist goes down (using Down Notifier).

We’re also going to implement a status page so that you can be certain the problem is on our end. It also helps perspective users to access our reliability.


#4

Quick update: Continuous backups are now back online. We’ll be monitoring the system for a week to see how the disk usage is affected and if our limits are being put in effect correctly.