2019/05/22 Dynalist outage

Shida · May 22, 2019, 11:30am

Currently still ongoing, taking a quick break to update everyone on what’s going on:

Our database server has been hit with a severe CPU degradation issue over the past ~8+ hours.
So far, it doesn’t look malicious.

The last time we experienced this identical issue was when another DigitalOcean’s client co-located a bitcoin mining server on the same host as our server, which was against their Terms of Service.

We’ve reached out to DigitalOcean but so far it’s been almost 5 hours and no reply from their side. There’s not much we can do at this moment, but if the issue persists for more time with no response, we’ll need to consider an emergency switch to a different hosting provider.

I will be updating this post as more details come up.

pottster · May 22, 2019, 11:43am

Thanks for the update Shida. Good luck.

Shida · May 22, 2019, 12:46pm

We’re observing improvement to the situation! Still no response from DigitalOcean but the server is making progress in catching up to requests.

The underlying issue doesn’t look fully addressed yet, so I’d expect intermittent issues, but for now we can take a short breath.

Erica · May 22, 2019, 3:25pm

So far it looks like the performance got much better, just a bit slower than usual. However, because we still haven’t received a reply from our service provider, so we’ll continue monitoring.

Vik_Vas · May 22, 2019, 3:38pm

Hello!

I am experiencing some difficulties with the API.
{"_code":“InternalError”,"_msg":“A server error occured”,"_v":9,"_mv":7,"_b":“20190519150039”}
Is it related to the hosting issue?

Shida · May 22, 2019, 7:16pm

That is a possibility. If you could please test again once we announce the current issues are gone, that would help isolate the case. Thanks!

Erica · May 23, 2019, 3:33am

Status update: seeing more update, working with service provider to figure out the cause.

Shida · May 23, 2019, 5:54am

Status update: Dynalist is still experiencing some slowness from time to time, but it’s currently up.

Our hosting provider has identified the issue and we’re working together to have it solved asap.
Depending on the solution, we may need to take some downtime for a server migration. If that is the case, we will make sure to let everyone know as soon as we have a time frame scheduled.

Thanks again to everyone for your patience, and we’re really sorry for the inconvenience!

Shida · May 23, 2019, 7:19am

For anyone that’s curious what’s happening: It seems like DigitalOcean has “oversold” the physical server that houses Dynalist’s servers. This means that the amount of cpu processing power we paid for isn’t being satisfied, kinda like how airlines overbook their flights, expecting some people to not show up.

We’re working with DigitalOcean to have our server migrated to a less crowded space, hopefully with plenty of processing power to spare.

Until the migration happens, we’re still expecting intermittent slowness and potential downtime.

pottster · May 23, 2019, 9:12am

Thanks for the transparency Shida.

Just a few thoughts. You guys have a fantastic product which is only going in one direction as far as I can see. How scalable is your business with this hosting provider? Yesterday’s incident, combined with the incident with another of their customers bitcoin mining would raise alarm bells to me. If you’re migrating to a new physical server anyway, is it time to look for another provider?

Shida · May 23, 2019, 9:17am

That is an option for us. The migration done within DigitalOcean would involve no work from us, as our server disk, IP addresses, database configurations, etc are all kept in the same virtual machine and migrated over (potentially even live). The alternative would have been a manual migration which would involve re-configuring all servers (as IPs would have changed), manually transferring database files, which is more risky.

We will be considering a full migration to a different service provider (we’re eyeing AWS) which includes re-configuring the internal networking between various boxes we use to serve Dynalist. But right now, we are not prepared to do that just yet.

Erica · May 23, 2019, 10:14am

Definitely. The fact that they proposed a solution over 4 hours ago and didn’t reply ever since is very disappointing (we replied to their proposal immediately).

AWS support might not be much better but I think it’s stabler. No support needed is the best support in this case.

pottster · May 23, 2019, 12:37pm

So true!

Alan · May 23, 2019, 1:26pm

Yup, Im down Today. I’ll have to do my note taking elsewhere.

Aaron_Disibio · May 23, 2019, 1:48pm

I thought DigitalOcean was all KVM? I didn’t know you could oversell KVM.

Shida · May 23, 2019, 2:19pm

Server is now up and at full capacity. We were told our server was live-migrated to another host.
We will be writing a post-mortem as soon as we catch a little bit of rest.

I believe this happens whenever the total CPU use of all the VMs adds up to be more than the amount CPU performance available.
Dynalist’s server uses on average 10%-15% of the CPU we pay for, but during the incident, we were barely able to get 10%, which eventually lead to our web requests being overloaded. We were also told they had to throttle a “noisy neighbor” on the server.

Aaron_Disibio · May 23, 2019, 2:22pm

Well… I’m back up. Maybe 30 minutes of downtime total for me? Not bad at all considering the situation. I appreciate the work you all do.

Shida · May 23, 2019, 2:24pm

I believe the total downtime was somewhere around 10 hours today, affecting mostly Asia and European customers.
Thankfully it seems to be completely resolved and everything is recovering.

Sorry again for the inconveniences this caused!

Craig_Oliver · May 23, 2019, 2:40pm

Great! Thank you for working on the problem and getting it fixed!

I had a Dynalist browser tab open from before the outage, and when I saw the sync problems I was careful not to close it. So I was able to work locally until the server came back up, at which point my browser automatically re-synced. So for me at least there was no lost work. Phew!

Thanks again for your hard work and your dedication to your customers.

Craig

Shida · May 23, 2019, 2:40pm

Thank you for your patience and for placing your trust in us!