2019/05/22 Dynalist outage


#1

Currently still ongoing, taking a quick break to update everyone on what’s going on:

Our database server has been hit with a severe CPU degradation issue over the past ~8+ hours.
So far, it doesn’t look malicious.

The last time we experienced this identical issue was when another DigitalOcean’s client co-located a bitcoin mining server on the same host as our server, which was against their Terms of Service.

We’ve reached out to DigitalOcean but so far it’s been almost 5 hours and no reply from their side. There’s not much we can do at this moment, but if the issue persists for more time with no response, we’ll need to consider an emergency switch to a different hosting provider.

I will be updating this post as more details come up.


Does not Sync reliably
Dynalist Down
#2

Thanks for the update Shida. Good luck. :crossed_fingers:


#3

We’re observing improvement to the situation! Still no response from DigitalOcean but the server is making progress in catching up to requests.

The underlying issue doesn’t look fully addressed yet, so I’d expect intermittent issues, but for now we can take a short breath.


#4

So far it looks like the performance got much better, just a bit slower than usual. However, because we still haven’t received a reply from our service provider, so we’ll continue monitoring.


#5

Hello!

I am experiencing some difficulties with the API.
{"_code":“InternalError”,"_msg":“A server error occured”,"_v":9,"_mv":7,"_b":“20190519150039”}
Is it related to the hosting issue?


#6

That is a possibility. If you could please test again once we announce the current issues are gone, that would help isolate the case. Thanks!


#7

Status update: seeing more update, working with service provider to figure out the cause.


#8

Status update: Dynalist is still experiencing some slowness from time to time, but it’s currently up.

Our hosting provider has identified the issue and we’re working together to have it solved asap.
Depending on the solution, we may need to take some downtime for a server migration. If that is the case, we will make sure to let everyone know as soon as we have a time frame scheduled.

Thanks again to everyone for your patience, and we’re really sorry for the inconvenience!


#9

For anyone that’s curious what’s happening: It seems like DigitalOcean has “oversold” the physical server that houses Dynalist’s servers. This means that the amount of cpu processing power we paid for isn’t being satisfied, kinda like how airlines overbook their flights, expecting some people to not show up.

We’re working with DigitalOcean to have our server migrated to a less crowded space, hopefully with plenty of processing power to spare.

Until the migration happens, we’re still expecting intermittent slowness and potential downtime.


#10

Thanks for the transparency Shida.

Just a few thoughts. You guys have a fantastic product which is only going in one direction as far as I can see. How scalable is your business with this hosting provider? Yesterday’s incident, combined with the incident with another of their customers bitcoin mining would raise alarm bells to me. If you’re migrating to a new physical server anyway, is it time to look for another provider?


#11

That is an option for us. The migration done within DigitalOcean would involve no work from us, as our server disk, IP addresses, database configurations, etc are all kept in the same virtual machine and migrated over (potentially even live). The alternative would have been a manual migration which would involve re-configuring all servers (as IPs would have changed), manually transferring database files, which is more risky.

We will be considering a full migration to a different service provider (we’re eyeing AWS) which includes re-configuring the internal networking between various boxes we use to serve Dynalist. But right now, we are not prepared to do that just yet.


#12

Definitely. The fact that they proposed a solution over 4 hours ago and didn’t reply ever since is very disappointing (we replied to their proposal immediately).

AWS support might not be much better but I think it’s stabler. No support needed is the best support in this case.


#13

So true!


#14

Yup, Im down Today. I’ll have to do my note taking elsewhere.


#15

I thought DigitalOcean was all KVM? I didn’t know you could oversell KVM.


#16

Server is now up and at full capacity. We were told our server was live-migrated to another host.
We will be writing a post-mortem as soon as we catch a little bit of rest.

I believe this happens whenever the total CPU use of all the VMs adds up to be more than the amount CPU performance available.
Dynalist’s server uses on average 10%-15% of the CPU we pay for, but during the incident, we were barely able to get 10%, which eventually lead to our web requests being overloaded. We were also told they had to throttle a “noisy neighbor” on the server.


#17

Well… I’m back up. Maybe 30 minutes of downtime total for me? Not bad at all considering the situation. I appreciate the work you all do.


#18

I believe the total downtime was somewhere around 10 hours today, affecting mostly Asia and European customers.
Thankfully it seems to be completely resolved and everything is recovering.

Sorry again for the inconveniences this caused!


#19

Great! Thank you for working on the problem and getting it fixed!

I had a Dynalist browser tab open from before the outage, and when I saw the sync problems I was careful not to close it. So I was able to work locally until the server came back up, at which point my browser automatically re-synced. So for me at least there was no lost work. Phew! :sweat_smile:

Thanks again for your hard work and your dedication to your customers.

Craig


#20

Thank you for your patience and for placing your trust in us!