Outage – Explosion at The Planet

Dear Members,

Introduction
We would like to offer you our sincere apologies for the interruption in service many of you experienced last weekend. During the outage our forum remained online and we attempted to keep all of our members up-to-date via our Service Status announcements.

What happened?
On Saturday 31 May at approximately 11pm GMT there was an explosion in The Planet Data Center in Houston Texas. Electrical gear shorted, creating an explosion and fire that knocked down three walls. Thankfully there were no human casualties.

On the instructions of the Fire Department, The Planet then turned off all power to the Data Center resulting in 9,000 servers being knocked offline.

How did the outage at The Planet affect StatCounter?
This affected StatCounter in a number of ways:

  • Some of our database servers went down
  • Our dns servers were temporarily offline
  • Some of our incoming mail servers went down
  • Our blog was unavailable
  • Some of our web servers went down
  • Some of our partitions were knocked offline

How did this mean for StatCounter members?
Different members were affected in different ways.

Members with projects on the following partitions were most seriously affected:
c1 (PN 0), c7 (PN 6) , c8 (PN 7), c14 (PN 13), c16 (PN 15), c17 (PN 16)
These members lost between 24 and 30 hours of stats over Sunday GMT and part of Monday morning.

Members with projects on the following partitions were unable to log into their accounts for a number of hours following the outage but stats continued to be recorded:
c4 (PN 3), c5 (PN 4), c6 (PN 5), c12 (PN 11)

New members and people who had just created new projects with StatCounter in the hours immediately prior to the outage temporarily “lost” their accounts/projects. This is because, since these projects were not on our last back-up, restoring the back-up did not “bring up” their projects. In this case, our advice is to generate a new project and re-install the StatCounter code on your site.

All other members lost about 3-5 hours of stats from approximately 11pm GMT on Saturday night. In addition members experienced difficulties reaching the StatCounter site and logging into their accounts.

As servers at The Planet come back online we continue to work to try to recover as many stats as possible to minimise the loss of information experienced.

Why doesn’t StatCounter have its own Data Center?
By outsourcing our server technology, we can keep costs down, minimize downtime and devote more resources to our members.

Why was StatCounter using The Planet?
StatCounter is powered by over 130 servers. These are located in a number of Data Centers in the United States and in Ireland and are spread among a number of hosting providers although our main hosting partner is The Planet.

We chose The Planet as our main hosting partner as they are the largest dedicated hosting service in the world and due to the apparent reliability of the service they provide.

We believed The Planet to be one of the most reliable and redundant data center providers in the world, particularly as they host servers in multiple centers in Houston and Dallas.

From The Planet website:
With multiple state-of-the art data centers located in Dallas and Houston, Texas, The Planet provides On Demand IT Infrastructure backed by complete redundancy in power, HVAC, fire suppression, network connectivity, and security. So if any of our data centers experiences a disruption for any reason, your eggs (or servers) are never in one basket.

The Planet have let us down, and in turn, we have let you down. For this we are truly sorry.

What did The Planet do wrong?
Accidents are a fact of life, however, we believe that, had The Planet operated in the professional manner we expected from an organisation of its standing, the disruption experienced could have been substantially lessened.

For example, The Planet have hosted our DNS for a number of years, however, it was only this weekend we discovered that, although our DNS servers are on different subnets within The Planet, the servers are actually all in the same location. We will be submitting a complaint to The Planet in this regard. We fully expected that The Planet would have implemented a geographical spread in our DNS servers – this was not something that we thought we would have to request or confirm – particularly since we have servers spread through all The Planet data centers. We have now secured the services of a new geographically spread, redundant DNS provider.

We also feel that the extent of the damage could have been acknowledged and communicated by The Planet in a more timely fashion. While we decided to implement our back-up plans early on, others waited many hours in the hope that The Planet would come back online, only to find that restoration of service was continually delayed.

In addition, we found that our efforts to communicate with The Planet were largely ignored or dismissed with a “template” response. This was particularly galling as we received a presentation glass globe (see below) and a letter from The Planet CEO on FRIDAY thanking us for being one of their largest customers… then Saturday… THIS!

Thank you For making a world of Difference – The Planet ???

Why couldn’t The Planet get the Data Center back more quickly?
We don’t know. Hundreds of angry customers have been asking this question.

What action did StatCounter take when this accident happened?
We immediately began work to restore full service as soon as possible.

  • Initially, and in the absence of any official information from The Planet, we worked to establish exactly what was causing the problems.
  • We started a thread in our Service Status forum to advise our members of the situation – this thread has been updated continuously.
  • We added a notice to out homepage to advise members that service was limited.
  • We procured the services of a properly redundant and geographically spread dns service and re-routed all our servers immediately.
  • We prioritised the restoration of all our affected partitions from our latest back-up taken in the hours before the outage in order to resume tracking stats.
  • We configured new servers.
  • We redirected web servers which were temporarily down due to the outage.
  • We responded to as many tickets as possible to try to explain the situation to our members.
  • We migrated our affected mail servers to a new data center.

How will StatCounter prevent this happening in the future?
The bitter irony of this recent episode is that we have been working on our new beta system since September last year. We decided to develop this new StatCounter system for a number of reasons, one of the major motivations being to improve the architecture of our system so as to insulate it against major outages such as the one just experienced. Considering that we have never before experienced an outage of this magnitude, we are bitterly disappointed that our new system was not up and running before this episode.

Once “normal” service is restored, work will continue on the beta project as planned. The sooner we launch the beta, the sooner we can minimise our vulnerability to this kind of outage.

I’m not happy – how do I complain?
We completely understand why you feel aggrieved. Should you wish to submit a complaint to us please do so by logging into your StatCounter account and clicking the “support” link in the top menu bar. Within this area you will be able to submit a ticket to us – we will endeavour to respond to you as soon as possible.

How do StatCounter feel about what happened?
We are so desperately sorry that any of our customers had to experience this outage. We also feel so humbled by the numerous messages of support that we have received. At a time when we feel so terribly for the interruption in service suffered by some of our members, we have been just bowled over to receive so many of messages of encouragement. While we always knew that we had a great bunch of members, your support and patience throughout this episode has been nothing short of incredible and served to help maintain team morale in a very difficult situation. We are so grateful.

Conclusion
We hope this blog post has gone some way towards summarising the main issues relating to the recent outage. Work continues to restore full service. If you have any comments or queries, please do post them below.

219 Comments

  1. Negative perspiiration here. I’m new onstatcounter, and it’s no big deal to lose a couple of days. Looks like mycounts are going up… (: I hope that all is well with you, and with the folks at planet as well. I’m sure that this was a terrible experience for them. All the best to you…G:

  2. I’ve been using StatCounter for — wow, nearly exactly 3 years now — and this is my first bit of lost logging *ever* as far as I know. 99.8% uptime for a completely free service, with the .2% being due to an explosion blowing out three walls of a data centre, is pretty a darn good record as far as I’m concerned, and I fully expect to be upgrading to a paid subscription in the very near future.

  3. Thank you for the candid and detailed communication. Events will occur and all companies will have problems with service from time to time. Great companies differentiate themselves by how they handle and deal with their customers. It forms the basis of trust. Yours is a great company and your actions are consistent with that measure.

    I have faith that you will continue to refine the fault resilience of your systems to improve the outcome when the next event occurs. Frankly, given the severity of the hishap at the DC, I’m impressed that you got back online as soon as you did and kept the data loss so minimal.

    Thank you!

  4. I am a VERY appreciative customer … more that you were willing to be so responsive to us than the fact that you got the servers up and running again! Your apology on your blog was so eloquent that I wish more companies would follow your example and remember that the customers are the people keeping them in business. Few companies are willing to be so apologetic when something goes wrong. Often, customers are left holding the bag and companies act like they’re doing a favor offering a product or service.
    It looks to me like you couldn’t help what happened, and you took the steps necessary to rectify the situation. It can happen to anyone, and I for one appreciate your quick resolution!
    Thanks so much for taking time to write to us and for offering such a detailed explanation!

  5. All is cool and thanks for being upfront. One or two days of stats lost is not bad 🙂

    Keep up the good work!

  6. The outage wasn’t a huge problem for me, I appreciate even with a high level of uptime (say 99%), in a given year that might mean a number of days when there’s no service.

    That said, I can kind of see where Marcus is coming from. But I also understand Statcounter’s desire to explain exactly what happened.

    I do, however, think that Statcounter should in a future post detail if and how they’re going to set things up to avoid problems like this in the future. Maybe you need redundancy across providers. Perhaps even just a backup service using Amazon’s cloud computing, that you can switch on temporarily if things go belly up with your usual provider, and switch your DNS to point to that (you could set up virtual servers with amazon to match all your servers with your main provider). I was kind of surprised to hear you also had your DNS management with The Planet. Most big sites outsource DNS management to a large dedicated DNS management service independent of their server hosting (which I gather is what you’ve done now).

    I can see mentions that the new system that’s in beta now might help here, but no details of how exactly.

    As for The Planet, I guess if anything it might make them better prepared for the future, so that they hopefully will going forward be among the most reliable in a catastrophe. Nothing beats real experience, and they’ve had some now.

  7. I’ll truly say I was going ‘oh my gosh!’, but hey life goes on … there are bigger problems in the world.

    Glad things are back to normal.

  8. Hi Guys,

    I love statcounter and check it every day. You have handled this crisis so professionally, and we all wish you the very best! What is important is that none of you were hurt. Everything else is secondary.

    Thanks for all you do year-round for us.

  9. Spam? This is the first time I’ve ever posted in SC Blogs!

    I know you’re Irish – but censorship isn’t something common to UK/Irish sites.. I realise also that I had run CCleaner and my comment had “vanished” – once I reposted it was showing again. Cookie?

    Marcus

    StatCounter Team Response:
    Hi Marcus,

    Apologies – we are not saying that you posted spam, rather that spam was posted by someone from the IP address to which you are currently linked. Also, there is a delay with comments being published due to the use of caching. This is necessary due to the high volume of traffic our blog receives. Hope this explains.

  10. ‘Tis pretty much all good on my end, especially since I was lucky enough to not be on one of the partitions that went Tango Uniform with the data center. Yes, things were a bit slow yesterday and Sunday, but considering things were still jury-rigged, understandable.

    Thanks for letting me know the lengthier explanation is out here. It is much appreciated.

  11. I see SC is “moderating” by removing comments that disagree with their abuse towards The Planet. Such a biased stance is typical Americanism, as SC seems incapable of dealing with critisism.

    StatCounter Team Response:

    We’re confused Marcus… the only comments that get held in our moderation queue are those that appear to be spam or contain abusive words… If you disagree with us about anything, you are more than welcome to comment!

    Typical Americanism…? We’re Irish!

    Edit: NOW we understand – you were referring to the fact that your previous comment was not published. It has been released now and was held because we previously received numerous spam posts on our blog from the IP address from which you posted.

  12. Thanks for the full explaination. Having had outage time from my web provider in the last year, with no apology whatsoever, your refreshing honesty is much appreciated.
    Have you ever thought of branching out 🙂

  13. Pingback: Abandoned Stuff by Saskboy » Blog Archive » Customer Service when everything blows up - literally
  14. I think StatCounter’s response to this matter is childish and irresponsible, and playing “heroes” to satisfy customers is beyond a joke.

    It is clear there was an explosion, fire, and an evacuation. Where fire and electric are concerned people must come before data.. regardless of its importance – web stats are not vital to daily life – they are just numbers.

    Without wanting to sound too judgemental, the 9/11 incident resulted in the same thing – fire, damage, loss of data. Were the companies “hosted” in those towers expected to give their clients the run around and apologise for that? Forget the loss of life – fire is the most destructive thing known, and I don’t see the need for any complaint, or to have expected The Planet to have rushed its response.

    If The Planet were ordered to cut power by the Fire Brigade then it is a legal duty to respond to them – for the safety of everyone. The Planet would not have known the extent of the damage or loss until the situation was known. As I said – lives come first.

    Statcounter needs to get off its high-horse. The Paypal incident was possibly a reason to be angry, but this is not. There’s no excuse to keep the ball rolling just because Paypal cost Statcounter a few customers – making The Planet a scaregoat is pathetic behaviour. No one cares about a few lost stats if the main concern is whether anyone was hurt. Burns hurt a person a more than a machine.

    I’m sure everyone respects Statcounter for dealing with things how it could, but give The Planet a break. They thank you for your custom and yet you stick it up their backside for an accident.. for shame! If your offices burned down would you go to the efforts of custom replies to 100’s of customers, or would a template suffice for the initial concerns until the facts were known and a resolution prepared.

    I expect The Planet are going to have to make insurance claims, repairs, etc. Thats assuming there is no long-lasting investigation into the cause. Do you expect them to return to 100% efficiency overnight?

    I appreciate Statcounters service and dedication to its clients, but I think it needs to be more considerate towards real-life concerns and a little less self-centered in such situations in future. Accidents happen, but the “What did the Planet do wrong?” section is petty – end of the day, they’re running a bigger business than Statcounter.

    If Statcounter is so upset and has been finaincially affected, why not take legal action against The Planet instead of flaming them in blogs?

    Marcus

    StatCounter Team Response

    Hi Marcus,

    We appreciate your views and your taking the time to comment, but we stand by our criticisms of The Planet. They claim to provide a fully redundant service, able to withstand even a total power outage – but this is not the case. For example, The Planet claimed redundant power… how can this be with only one power room? While, obviously, we are very grateful that there was no loss of life or serious injury, this does not excuse the fact that The Planet did not provide the service upon which they market themselves. Perhaps if you re-read our post, our stance may become clearer.

  15. ….that should be, You guys definitely rock in my books! I guess you’re not the only ones who need sleep.

  16. As a long-time paid subscriber I appreciate your hard work to get things back up and running quickly, and to evaluate the situation. That being said, in the four years I’ve been with you I think this is about the only thing I’ve ever seen go wrong, and it was definitely beyond your control. Don’t sweat it. You guys definitely ock in my books! 😉

    Cheers,
    Connie

  17. I just want to say that StatCounter is a model of what customer service should be – to the point that I have been known to keep you in mind when dealing with my own customer service issues. I wish all my vendors had this level of integrity!

    StatCounter Team Response:
    WOW John,

    High praise indeed – we really appreciate it – thank you!

  18. Hi,
    I’m not being able to open the Statcounter website at all for the past twelve hours or so. I’m getting a Problem loading page error on Firefox whenever I try. Is there another/still some problem going on?

    StatCounter Team Response:
    Hi there,

    You *should* be able to log in – are you using a bookmark to log in? Please try logging in directly from www.statcounter.com or my9.statcounter.com.

  19. Thank you for your fantastic service through the years, you have huge integrity, and thank you for keeping us up to date, you are No 1. All the best.

Comments are closed.

Try Statcounter free for 30 days

No credit card required. Downgrade to the free plan anytime.

Try it for FREE!