Currently still ongoing, taking a quick break to update everyone on what’s going on:
Our database server has been hit with a severe CPU degradation issue over the past ~8+ hours.
So far, it doesn’t look malicious.
The last time we experienced this identical issue was when another DigitalOcean’s client co-located a bitcoin mining server on the same host as our server, which was against their Terms of Service.
We’ve reached out to DigitalOcean but so far it’s been almost 5 hours and no reply from their side. There’s not much we can do at this moment, but if the issue persists for more time with no response, we’ll need to consider an emergency switch to a different hosting provider.
I will be updating this post as more details come up.
So far it looks like the performance got much better, just a bit slower than usual. However, because we still haven’t received a reply from our service provider, so we’ll continue monitoring.
I am experiencing some difficulties with the API.
{"_code":“InternalError”,"_msg":“A server error occured”,"_v":9,"_mv":7,"_b":“20190519150039”}
Is it related to the hosting issue?
Status update: Dynalist is still experiencing some slowness from time to time, but it’s currently up.
Our hosting provider has identified the issue and we’re working together to have it solved asap.
Depending on the solution, we may need to take some downtime for a server migration. If that is the case, we will make sure to let everyone know as soon as we have a time frame scheduled.
Thanks again to everyone for your patience, and we’re really sorry for the inconvenience!
For anyone that’s curious what’s happening: It seems like DigitalOcean has “oversold” the physical server that houses Dynalist’s servers. This means that the amount of cpu processing power we paid for isn’t being satisfied, kinda like how airlines overbook their flights, expecting some people to not show up.
We’re working with DigitalOcean to have our server migrated to a less crowded space, hopefully with plenty of processing power to spare.
Until the migration happens, we’re still expecting intermittent slowness and potential downtime.
Just a few thoughts. You guys have a fantastic product which is only going in one direction as far as I can see. How scalable is your business with this hosting provider? Yesterday’s incident, combined with the incident with another of their customers bitcoin mining would raise alarm bells to me. If you’re migrating to a new physical server anyway, is it time to look for another provider?
That is an option for us. The migration done within DigitalOcean would involve no work from us, as our server disk, IP addresses, database configurations, etc are all kept in the same virtual machine and migrated over (potentially even live). The alternative would have been a manual migration which would involve re-configuring all servers (as IPs would have changed), manually transferring database files, which is more risky.
We will be considering a full migration to a different service provider (we’re eyeing AWS) which includes re-configuring the internal networking between various boxes we use to serve Dynalist. But right now, we are not prepared to do that just yet.
Definitely. The fact that they proposed a solution over 4 hours ago and didn’t reply ever since is very disappointing (we replied to their proposal immediately).
AWS support might not be much better but I think it’s stabler. No support needed is the best support in this case.
Server is now up and at full capacity. We were told our server was live-migrated to another host.
We will be writing a post-mortem as soon as we catch a little bit of rest.
I believe this happens whenever the total CPU use of all the VMs adds up to be more than the amount CPU performance available.
Dynalist’s server uses on average 10%-15% of the CPU we pay for, but during the incident, we were barely able to get 10%, which eventually lead to our web requests being overloaded. We were also told they had to throttle a “noisy neighbor” on the server.
I believe the total downtime was somewhere around 10 hours today, affecting mostly Asia and European customers.
Thankfully it seems to be completely resolved and everything is recovering.
Great! Thank you for working on the problem and getting it fixed!
I had a Dynalist browser tab open from before the outage, and when I saw the sync problems I was careful not to close it. So I was able to work locally until the server came back up, at which point my browser automatically re-synced. So for me at least there was no lost work. Phew!
Thanks again for your hard work and your dedication to your customers.