by MysteryMan » Sun Mar 17, 2013 11:51 pm
I'm pretty exhausted but the glitch appears to be fixed.
Here's a high level explanation of what happened:
Generally speaking, opening a connection to a sql server takes a bit of resources. Think of it as the metaphorical equivalent of wiring up and then taking down a phone line. You don't want to do that a lot.
This is where the connection pool comes in. You have a server that has a lot of database calls with multiple web clients but you don't want to hammer the server. So what happens is that you wire a small number of connections and every node shares the line. The calls are quick so no line monopolizes the server. Think of it like a building that shares a few phone lines for quick conversations.
This worked for a while but then it suddenly stopped. Instead what happened is that every person in the building got a new phone line wired up whenever they wanted to talk to MySQL. This freaked out MySQL and it had a heart attack. Vadym gave it a cyborg body so it could handle the load like the bionic man, but really the solution would be to stop flooding the server.
The thing is, there should't be that many lines between the web and sql - the nodes should just share a couple public lines. The weren't.
This happened seemingly on its own when nothing was touched. Since we seem to experience Azure issues a few days before Microsoft admits they exist over and over again I just said **** it and moved the web nodes to our own servers.
Our own servers looked at the rate limit, understood the rules and intent, and automatically just opened the a couple phone lines at once, allowing people to share them as needed. Our servers never put more than two lines at once instead of the well over a hundred that Azure let through.
On a related note, i'll be speaking at the Microsoft building later on why we left Azure.