2016/10/29 Dynalist outage

Shida · October 29, 2016, 3:09pm

Everything is back up now. We’ve read through hours of server log data and we have now fully understood the root cause of the outage.

Current status

Dynalist is now up and running. We have resolved the issue, and we will continue to monitor the system to prevent similar issues in the future.

Cause of outage

Since we migrated Dynalist to a better server last week (Oct 22, 2016), we have also introduced a continuous database backup system using MySQL’s replication feature.

Basically, when our main database receive data (such as the content you just wrote), it also writes to a binary log. A second backup database will read this binary log and keep itself in sync with the main database. This second back database acts like our backup; if the main database is corrupted for some reaosn, we have a copy that’s almost identical that we can use.

We later found that we have improperly configured the limits of the database’s binary log, which is used in the replication process. This caused the server’s disk space to be filled and the server went down as a result.

Unfortunately, the incident happened at midnight for the timezone our team is in, so nobody was monitoring Dynalist nor our support channels.

Mitigation

We have temporarily disabled continuous backups for our database while we study and test the configurations for properly limiting MySQL’s binary log. This freed up our disk space which is now at a healthy state with lots of room to grow.

Future work

We’ll need to enable continuous backups as soon as we can
We plan to implement an alert system for system status monitoring so that when the site goes down we know it asap