View Full Version : Well that sucked. (Outage)


Alex
August 11th, 2011, 12:37 AM
As you may have noticed, the site was down almost all day today. We are hosted at a Tier 1 co-lo datacenter near Dallas. They have redundant power, generators with days of fuel, UPS's, and all of the other things that you'd expect a modern datacenter to have. But - that doesn't mean unexpected things don't happen. One of the key pieces of equipment between the power grid, those backup generators, and the UPS's, is evidently something called the Automated Transfer Switch (ATS). And while there are multiple ones on site, it turns out that they aren't truly redundant. When one failed catastrophically today, it took down a very large portion of the data center, and getting it back up involved some significant engineering and re-wiring.

Once the datacenter itself got its power back up and running to the cages, then the servers themselves needed to be confirmed good. For an unknown reason, the one that we were hosting ninjette.org on was behaving strangely when it came back up. It was "up", but not visible from the internet. After troubleshooting that for a few hours this evening, I was able to migrate all of our content (the entire VPS) onto a known good server, and voila, we're up.

Sorry for the trouble, and glad things like this happen as rarely as they do. If anything does look like it's behaving strangely here, please let me know, but I don't expect any issues. Time for bed...

Alex
August 11th, 2011, 12:40 AM
For a more detailed analysis, check out this thread (http://www.webhostingtalk.com/showthread.php?t=1072692) on the webhostingtalk forum.

I was glad that we had the facebook ninjette.org group (https://www.facebook.com/groups/ninjette/) to get some communications out during the day as well. If you're on facebook, consider joining that group for days just like these. Though hopefully they are few and far between... :thumbup:

Skippii
August 11th, 2011, 12:42 AM
Phew. I had JUST joined this site, so I figured you were taking extreme measures to keep me from posting...

Alex
August 11th, 2011, 12:43 AM
You can never be too careful. :)

akima
August 11th, 2011, 01:43 AM
The email company I use was down for the same period of time as Ninjette.org. They use colo4.com Dallas co-location . I suspected you were using them too. I bet the people in the co-location centre were running around like maniacs trying to get everything fixed :p

Snake
August 11th, 2011, 01:50 AM
I was starting to get the shakes. I felt like an addict needing to get a fix.

ally99
August 11th, 2011, 02:58 AM
Yeah, Ninjette withdrawals SUCK! :hitputer:

Racer x
August 11th, 2011, 03:05 AM
Phew. I had JUST joined this site, so I figured you were taking extreme measures to keep me from posting...

Hi Skippy welcome

the big mike
August 11th, 2011, 03:07 AM
there is a facebook group? I'm getting there right now :p

TnNinjaGirl
August 11th, 2011, 04:28 AM
There needs to be an NA (Ninjetters Anonymous) group if you ask me.

dubojr1
August 11th, 2011, 04:36 AM
I was starting to get the shakes. I felt like an addict needing to get a fix.

Holy cow.... I was thinking the same thing. I was checking to see if the site was back up like every 5 mins at work and even continued checking about every half hour at home. I generally spend my time here while at work but my wife even noticed my nervousness questioned what was happening when I was grabbing for my iPhone so often.

I'm glad we are back up! :D

Azhyen
August 11th, 2011, 06:23 AM
I got SOOOOO much work done yesterday...not that I'm correlating or anything, :D

CC Cowboy
August 11th, 2011, 07:25 AM
I thought Kelly had hacked in and shut down the site.

caps
August 11th, 2011, 07:37 AM
I was starting to get the shakes. I felt like an addict needing to get a fix.

this is EXACTLY how I felt

Linuss
August 11th, 2011, 08:33 AM
running to the cages


Cagers... always ruining it for motorcyclist.

Palero
August 11th, 2011, 09:17 AM
Cagers... always ruining it for motorcyclist.

Heh, I work at at Data Center like the one Ninjette.org live in-- not all Data Center co-location workers strictly commute in cages :D

nickjpass
August 11th, 2011, 09:39 AM
So how did you guys deal with the outage?

dubojr1
August 11th, 2011, 09:42 AM
So how did you guys deal with the outage?

My keyboard is recovering now... :madtyping:

nickjpass
August 11th, 2011, 09:45 AM
My keyboard is recovering now... :madtyping:

I just found out the problem and didn't worry about it :D

Snake
August 11th, 2011, 10:52 AM
There needs to be an NA (Ninjetters Anonymous) group if you ask me.

Hello my name is Snake...umm mmm I mean Rick and I am a ninjette.org addict. BTW is there a computer near by? I need to check on my...umm...stocks, yea my stocks, really.

TnNinjaGirl
August 11th, 2011, 11:23 AM
Lol :-)

Alex
August 12th, 2011, 12:17 PM
A more complete post-mortem from Rimu's perspective:

link: http://rimuhosting.com/maintenance.jsp?server_maint_oid=181086391

Summary

Colo4 lost power at their Dallas facility on 10 August CST. See http://accounts.colo4.com/status/ for the technical details.
RimuHosting's website was affected by the outage and was also unavailable.
We started pushing out information on the outage via Twitter and Facebook. We had a good number of customers join in on our live chat page. We brought up a temporary server in Auckland and used that to post a http://rimuhosting.com temporary page with information on the outage.
Colo4 brought some temporary electrical switching equipment online.
Power was restored.
Some of our core infrastructure servers did not come back up. We needed to replace some power supplies to get them back up.
We then needed to restart some network switches.
Then there were some physical servers that had not powered up, so we needed to go through server by server to get those back up and running.
And we also had a large number of niggly little issues. e.g. Some host servers that were misconfigured to not automatically restart their VPSs. e.g. one VLAN configuration issue. e.g. failed power supplies or blown hard drives or broken GRUB configs on physical servers.
We received a large number of support requests. e.g. Tomcat or Liferay not set to run on startup. e.g. SSL certificates installed with a passphrase on the private key requiring manual intervention for the server to complete boot up. e.g. MySQL databases that required repair table commands to be issued.
We have worked through most of the issues now. And if you are experiencing any problems that you'd like our help with please just pop in a support ticket and we can try to help.

Issues

Our core routers are on A+B power and should not have been impacted by the outage. They were. We will be investigating further.
We have some cabinets with A+B power feeds. Some servers in these cabinets did not lose power. But because of the core router issue they did not have network connectivity. About half our VPS host servers did not lose power at all.
Core RimuHosting servers were affected by the outage. Impacting our website, email, support tickets, and our equipment/location databases.
One of the things that was unavailable was our DNS management UI/API. This prevented customers from changing their DNS records from unresponsive servers to running servers.
After the power was restored some servers remained down and needed hands-on work done. e.g. restarts, re-grubbing, changing kernels, changing firewall settings via KVMoIPs, replacing power supplies. These required data center staff to act. And sometimes we were not getting the typically excellent and 'instant' responses we normally expect from the Colo4 staff.

Improvements

We need better fault tolerance and failover on our RimuHosting core infrastructure servers. So we are better able to communicate with customers. And so they can access key services like DNS.
We have had a number of requests from customers about how to reduce the impact of a fault on any one server or at one of the data centers they use. Our howto on this at http://rimuhosting.com/knowledgebase/rimuhosting/load-balancing-and-failover is probably worth a refresh. I invite your suggestions and advice.
Some customers have requested more information about servers with redundant power. We can do that for dedicated server customers. Your server will need to be located in a cabinet with A+B power feeds. The server model will need to have a redundant dual power supply (currently only available on high end servers). Just let us know and we can work through the details with you.

SLA credits

We will be offering credits to all customers with Dallas-based servers. We will be offering a standard credit percentage or amount (to be determined). What do you think is appropriate in your case?
We know some customers were more impacted by the outage than others so we want to have the flexibility to ensure those customers get an appropriate credit.
To do that we will be letting customers claim the standard credit or apply for a different amount that best reflects the impact of the outage. We will be setting up a web page that will help us manage this. We want it up before the end of next week. To claim your credit then, please go to https://rimuhosting.com/cp/hostingcredit.jsp

Any questions?

Thank you to everyone who contacted us on http://twitter.com/rimuhosting and http://facebook.com/rimuhostingcom For your supportive comments or constructive advice. We appreciate your feedback. It really helped our sysadmin crew during a long and stressful day.
If you feel this notice is missing any information or you needed further details, email us and we can update this notice or provide you any further details you need. You are welcome to email us or join a public discussion on our blog at http://blog.rimuhosting.com/2011/08/12/10-august-report/

I'm generally happy with their response. They do provide great service, great uptime right up until this incident (for years), and they provide this service at a good price. No intention now or in any foreseeable future to move us anywhere else. :thumbup:

dubojr1
August 12th, 2011, 01:49 PM
That just scrambled my brains...

Alex
August 12th, 2011, 01:52 PM
Have you tried rebooting? :)

dubojr1
August 12th, 2011, 01:58 PM
Have you tried rebooting? :)

Um... I think I had a B power failure. :rolleyes: