2019/10/28 Dynalist outage post-mortem

Shida · October 31, 2019, 1:16am

Time

2019/10/28 4:30 PM -6:30 PM ET (2 hours)

Issue

During the above time windows, Dynalist’s webpage was unreachable and unable to sync. Users would observe 5xx errors and requests would timeout.

Details

Our hosting provider had a major networking outage.
Below is the official post-mortem from our hosting provider:

OFFICIAL RFO - 10/28/2019

Summary of Incident:

———————————————

Yesterday, Monday October 28th 2019, at approximately 4:23pm portions of customers in our TPA1, TPA2 and DAL1 data centers experienced a loss of network that lasted anywhere from a few minutes to a few hours depending on your server(s) location. The cause of the issue has been identifed and is as follows:

At roughly 4:23pm one of our Network Engineers applied a policy update to our DAL1 edge routers. This policy update was incomplete which led to the full internet routing table being propogated throughout the aggreagation layer of DAL1. This mistake was further exacerbated when that full routing table was automatically injected into the Hivelocity DDoS protection network resulting in the full routing table being distributed to other Hivelocity facilities, i.e. TPA1 and TPA2. The full internet routing table injection led to multiple network devices having their resources exhausted which ultimately led to the network disuption. Once our Network Engineers identified the cause of the issue we began reloading each of the affected network devices to correct the problem. Ultimately, yesterday’s network event was a result of human error.

Service Impact Times:

———————————————
October 28th, 4:23pm - 6:44pm EST

Remediation Plans:

———————————————

We have implemented new router policies that will prevent full route tables being similarly propogated should human error ever occur again. Additionally, we have introduced additional review protocols to minimize the chance of human error occuring.

For years most of our customers have experienced 100% uptime due to our redundancies and nearly 2 decades of experience. We take our responsbility to you very seriously and no one hates it more than us when we fall short of our goals. We are deeply sorry for the inconvenience and any negative impact this disruption had on your operation.

To everyone who was impacted, we’re deeply sorry for the inconvenience!

Louis_Kirsch · October 31, 2019, 9:46am

Didn’t you transition to AWS?

Shida · October 31, 2019, 5:13pm

No, our last migration was moving from VM based hosting (DigitalOcean) to dedicated servers (Hivelocity). Currently, dynalist’s main servers own their whole physical machines as opposed to before where we were sharing the CPU with other customers on DigitalOcean.

Justin_Maxwell · November 4, 2019, 12:47am

A philosophical point (just 'cos it’s one of my hobbyhorses - nothing to do with Dynalist really):

There’s a logically cohesive, academic argument that there is no such thing as ‘human error’, because it’s always a failure in the system that allows the human factor to cause the problem. This is evidenced by your hosting provider putting in place system-level changes to prevent a recurrence. Of course, the prior lack of such policies can be argued to mean there was human error further up the management chain of the hosting provider over an extended period of time. But blaming the engineer’s human failure is an effective deflection.

The thinking arises mostly from the nuclear power industry, where the idea that an error by an individual could have a dangerous outcome is anathema to safety.