Login server is back up. http://redmine.apocoplay.com/issues/27 What happened:
As you know, we just launched.
Our own dedicated servers didn't even feel a dent from the load and were on call to upgrade them on a moment's notice. Our dedicated servers host everything but login.
Our login uses Azure and that's what failed.
We use the ridiculously expensive Microsoft Azure for new development (right now, just Login) because it is supposed to scale to handle any load (plus Microsoft is one of our sponsors and we wanted let them have a success story). Microsoft's pitch is to use Azure for launches so you are assured you are stable and online. When I woke up our Azure account was suspended for overuse. If our own server farm handled that amount of traffic and power we would use less than 1% of our idle resources.
Should something happen to one of our own servers (not just them being down - even just something looking odd) Vadym configured it we can get a text message waking us up should something go wrong when we are sleeping. Azure didn't even send us a notice over e-mail.
I applied a patch that should get us through the week - but unless I get sufficient assurance that this will never happen again from Microsoft by Tuesday (that I believe) we are either moving the login server to our proven dedicated machines or EC2.
edit: What this means is in the worst case scenario we may experience one more bump in the next couple days but that's it. One of the good things about our current setup is that if a service provider does not meet our expectations we can easily move to one that does with minimal effort.
On the upside, this is why we have a one week limited launch - to get out any kinks that we can't find before the main launch.