-
Generator Failures Caused 365 Main Outage
July 24th, 2007 : Rich MillerSeveral generators at 365 Main’s San Francisco data center failed to start when the facility lost grid power Tuesday afternoon, causing an outage that knocked many of the web’s most popular destinations offline for several hours. The disruption, which began at 1:45 pm PST, occurred during a grid outage for Pacific Gas & Electric, which left significant portions of San Francisco in the dark. Parts of 365 Main’s data center lost power, causing downtime for customer sites including CraigsList, Technorati, LiveJournal, TypePad, AdBrite, the 1Up gaming network, Second Life and Yelp, among others.
Wild rumors circulated about why 365 Main’s backup systems failed to maintain power to key systems, including reports of employee sabotage or a possible triggering of the facility’s emergency power off (EPO) button, a frequent cause of outages at mission-critical facilities. While less sensational, the actual cause of the outage was the failure of backup diesel generators.
“An initial investigation has revealed that certain 365 Main back-up generators did not start when the initial power surge hit the building,” the company said in an incident report. “On-site facility engineers responded and manually started affected generators allowing stable power to be restored at approximately 2:34 pm across the entire facility.”
“As a result of the incident, continuous power was interrupted for up to 45 minutes for certain customers,” the report continued. “We’re certain 3 of the 8 colocation rooms were directly affected, and impact on other colocation rooms is still being investigated.”
The 365 Main data center is supported by 10 Hitec 2.1 megawatt generators, which are tested every month. The 277,000 square foot 365 facility is partitioned into eight data center “pods,” some of which remained online while others went dark.
The facility’s backup systems use flywheel UPS systems - rather than batteries - to provide “ride-through” electricity to keep servers online until the diesel generator can start up and begin powering the facility. A flywheel is a spinning cylinder which generates power from kinetic energy, and continues to spin when grid power is interrupted. In most data centers, the UPS (uninterruptible power supply) system draws power from a bank of large batteries. AboveNet, the original builder/owner of the 365 Main data center, was an early adopter of flywheel UPS systems, which have recently gained attention as a “greener” alternative to batteries.
Some customers speculated about a flywheel issue. Trouble shooting the exact reason for the generator failure will take some time, according to 365 Main. “Due to the complexity and specialization of data center electrical systems, we are currently working with Hitec, Valley Power Systems, Cupertino Electric and PG&E to further investigate the incident and determine the root cause of why certain generators did not start,” the company said in its incident report.The downtime quickly became a public relations setback for 365 Main, as the blogosphere pounced on a failure that knocked many of its leading hosts and services offline. The outage was highlighted at O’Reilly Radar, Scobleizer and TechCrunch, among others.
Earlier in the day the company issued a press release noting two consecutive years of uptime for a customer at the San Francisco data center, RedEnvelope. The press release was noted on Slashdot and Techdirt and has since been removed from 365 Main’s web site.
Misinformation spread swiftly, propelled by the blogs and forums not affected by the outage. CNet, which hosts its servers at 365 Main, debunked reports from ValleyWag that a drunk employee had gone on a rampage and that a “mob of angry customers” assembled outside the 365 Main building. The “mob” was actually a line of customers who were forced to enter through the front door and have badges checked manually to get into the building because the parking garage gate was affected by the power outage, according to CNet. ValleyWag’s “drunk employee” post quickly became one of the most popular posts on the front page at Digg.
The problems began when parts of PG&E’s San Francisco area network began experiencing voltage fluctuations, which apparently caused a transformer to fail in a manhole under 560 Mission St. Witnesses told the San Francisco Chronicle they heard a blast shortly before 2 p.m. and then saw flames licking up through the manhole grate. PG&E could not confirm that an explosion had occurred, but said that 30,000 to 50,000 customers were affected.
The 365 Main data center was originally built by AboveNet, which spent $125 million to construct and “earthquake proof” the facility. After AboveNet filed for bankruptcy, 365 Main bought the property for $2.6 million in a court-approved deal. 365 Main has since expanded its network to seven data centers, including facilities in Oakland, Phoenix, Chantilly, Va. and two centers in Los Angeles (El Segundo and Vernon/Irvine).
DataGuy35
Posted July 25th, 2007You said it, Rich. These companies gotta have redundant sites. Check this out: http://www.bizjournals.com/phoenix/stories/2007/07/02/story15.html?from_rss=1
Tom W
Posted July 25th, 2007Unbelievable. I have experienced this time and again. Hosting facilities charge an arm and a leg for redundant power, pipe, and cooling, and yet constantly drop the ball always citing “unforseeable” factors.
JC
Posted July 25th, 2007Rotary no-break is not a new concept, and it’s simple. I can’t believe that with monthly tests those generators would fail to start like that. I think somebody’s fudging those monthly tests, and I hope it costs them their job.
Michael T. Halligan
Posted July 25th, 2007I used to be a customer of 365main, but thankfully completed our migration to Seattle in May of this year. It was the right move. Washington’s power infrastructure is far better maintained, and less of a political target than the nightmare that is California’s dilapidated power grid.. There isn’t enough duct-tape or bailing wire west of the Mississipi to keep California’s lights on, it seems.
This is the second or third serious transformer explosion at the Mission St Substation in the past 5 years.
As for 365main, well, here were our experiences:
In April, 2005 365main had an outage that affected all customers for 50 minutes due to a failed EPO valve. 365 handled that outage spectacularly, claling all of their customers within 15 minutes of the outage.
In February, 2006 365main experienced a partial outage for 3 seconds that only affected some customers, but caused problems in their Telco spine, affecting connectivity.
In October, 2006 365main had a backup generator fail, but supposedly no customers were directly affected, but customers were not allowed to enter the building between 3:29 PM and 4:40 PM.
This isn’t that bad of a track record. Of course, 365main is the most expensive datacenter in California, and you’d expect more.
John Nagle
Posted July 30th, 2007365 Main’s technical analysis of the failure is here: http://www.365main.com/status_update.html
They have ten Hitec continuous power units. Each unit has a generator, motor, flywheel, inductive coupling, mechanical clutch, and Diesel engine on the same shaft. There are no batteries or inverters involved. They need eight running units to operate the facility. When utility power failed, four of the ten units failed to start properly. They’re still trying to figure out exactly why. People from Hitec’s Holland HQ, including a member of Hitec’s Board of Directors, are on site.
365 Main reports that utility power didn’t simply fail; there were “4-6 repetitive surges to the facility in a short period of time”. This apparently was mishandled by the Diesel start control system.