Dear Members,
Introduction
We would like to offer you our sincere apologies for the interruption in service many of you experienced last weekend. During the outage our forum remained online and we attempted to keep all of our members up-to-date via our Service Status announcements.
What happened?
On Saturday 31 May at approximately 11pm GMT there was an explosion in The Planet Data Center in Houston Texas. Electrical gear shorted, creating an explosion and fire that knocked down three walls. Thankfully there were no human casualties.
On the instructions of the Fire Department, The Planet then turned off all power to the Data Center resulting in 9,000 servers being knocked offline.
How did the outage at The Planet affect StatCounter?
This affected StatCounter in a number of ways:
- Some of our database servers went down
- Our dns servers were temporarily offline
- Some of our incoming mail servers went down
- Our blog was unavailable
- Some of our web servers went down
- Some of our partitions were knocked offline
How did this mean for StatCounter members?
Different members were affected in different ways.
Members with projects on the following partitions were most seriously affected:
c1 (PN 0), c7 (PN 6) , c8 (PN 7), c14 (PN 13), c16 (PN 15), c17 (PN 16)
These members lost between 24 and 30 hours of stats over Sunday GMT and part of Monday morning.
Members with projects on the following partitions were unable to log into their accounts for a number of hours following the outage but stats continued to be recorded:
c4 (PN 3), c5 (PN 4), c6 (PN 5), c12 (PN 11)
New members and people who had just created new projects with StatCounter in the hours immediately prior to the outage temporarily “lost” their accounts/projects. This is because, since these projects were not on our last back-up, restoring the back-up did not “bring up” their projects. In this case, our advice is to generate a new project and re-install the StatCounter code on your site.
All other members lost about 3-5 hours of stats from approximately 11pm GMT on Saturday night. In addition members experienced difficulties reaching the StatCounter site and logging into their accounts.
As servers at The Planet come back online we continue to work to try to recover as many stats as possible to minimise the loss of information experienced.
Why doesn’t StatCounter have its own Data Center?
By outsourcing our server technology, we can keep costs down, minimize downtime and devote more resources to our members.
Why was StatCounter using The Planet?
StatCounter is powered by over 130 servers. These are located in a number of Data Centers in the United States and in Ireland and are spread among a number of hosting providers although our main hosting partner is The Planet.
We chose The Planet as our main hosting partner as they are the largest dedicated hosting service in the world and due to the apparent reliability of the service they provide.
We believed The Planet to be one of the most reliable and redundant data center providers in the world, particularly as they host servers in multiple centers in Houston and Dallas.
From The Planet website:
With multiple state-of-the art data centers located in Dallas and Houston, Texas, The Planet provides On Demand IT Infrastructure backed by complete redundancy in power, HVAC, fire suppression, network connectivity, and security. So if any of our data centers experiences a disruption for any reason, your eggs (or servers) are never in one basket.
The Planet have let us down, and in turn, we have let you down. For this we are truly sorry.
What did The Planet do wrong?
Accidents are a fact of life, however, we believe that, had The Planet operated in the professional manner we expected from an organisation of its standing, the disruption experienced could have been substantially lessened.
For example, The Planet have hosted our DNS for a number of years, however, it was only this weekend we discovered that, although our DNS servers are on different subnets within The Planet, the servers are actually all in the same location. We will be submitting a complaint to The Planet in this regard. We fully expected that The Planet would have implemented a geographical spread in our DNS servers – this was not something that we thought we would have to request or confirm – particularly since we have servers spread through all The Planet data centers. We have now secured the services of a new geographically spread, redundant DNS provider.
We also feel that the extent of the damage could have been acknowledged and communicated by The Planet in a more timely fashion. While we decided to implement our back-up plans early on, others waited many hours in the hope that The Planet would come back online, only to find that restoration of service was continually delayed.
In addition, we found that our efforts to communicate with The Planet were largely ignored or dismissed with a “template” response. This was particularly galling as we received a presentation glass globe (see below) and a letter from The Planet CEO on FRIDAY thanking us for being one of their largest customers… then Saturday… THIS!
Why couldn’t The Planet get the Data Center back more quickly?
We don’t know. Hundreds of angry customers have been asking this question.
What action did StatCounter take when this accident happened?
We immediately began work to restore full service as soon as possible.
- Initially, and in the absence of any official information from The Planet, we worked to establish exactly what was causing the problems.
- We started a thread in our Service Status forum to advise our members of the situation – this thread has been updated continuously.
- We added a notice to out homepage to advise members that service was limited.
- We procured the services of a properly redundant and geographically spread dns service and re-routed all our servers immediately.
- We prioritised the restoration of all our affected partitions from our latest back-up taken in the hours before the outage in order to resume tracking stats.
- We configured new servers.
- We redirected web servers which were temporarily down due to the outage.
- We responded to as many tickets as possible to try to explain the situation to our members.
- We migrated our affected mail servers to a new data center.
How will StatCounter prevent this happening in the future?
The bitter irony of this recent episode is that we have been working on our new beta system since September last year. We decided to develop this new StatCounter system for a number of reasons, one of the major motivations being to improve the architecture of our system so as to insulate it against major outages such as the one just experienced. Considering that we have never before experienced an outage of this magnitude, we are bitterly disappointed that our new system was not up and running before this episode.
Once “normal” service is restored, work will continue on the beta project as planned. The sooner we launch the beta, the sooner we can minimise our vulnerability to this kind of outage.
I’m not happy – how do I complain?
We completely understand why you feel aggrieved. Should you wish to submit a complaint to us please do so by logging into your StatCounter account and clicking the “support” link in the top menu bar. Within this area you will be able to submit a ticket to us – we will endeavour to respond to you as soon as possible.
How do StatCounter feel about what happened?
We are so desperately sorry that any of our customers had to experience this outage. We also feel so humbled by the numerous messages of support that we have received. At a time when we feel so terribly for the interruption in service suffered by some of our members, we have been just bowled over to receive so many of messages of encouragement. While we always knew that we had a great bunch of members, your support and patience throughout this episode has been nothing short of incredible and served to help maintain team morale in a very difficult situation. We are so grateful.
Conclusion
We hope this blog post has gone some way towards summarising the main issues relating to the recent outage. Work continues to restore full service. If you have any comments or queries, please do post them below.
How about not being so cheap and keeping online, offsite backups?
We do it and weren’t affected by the outage even though out servers reside in H1.
Not shame on The Planet. SHAME ON YOU!
There’s a lot of interesting comments on here. You can tell who makes money (Or lost some) with their websites and who doesn’t. I’m stuck between sides. On one hand, I was ticked I lost my server since it caused me a fair bit of problems.
On the other hand, what can you do at this point, I had to explain to my customers the same thing, that there was an unfortunate incident where my server is, and I could do nothing. Most of my customers understood, while a few responded by being upset and didn’t except any excuses.
So which one should I be? The one who excepts it for what it is, or doesn’t except any excuses? I’ve decided to just leave it at that and be glad this doesn’t happen that often
I have a free account, so losing my stats for one weekend afters YEARS of perfect service is not going to get me worked up.
You guys have done a great job of explaining and fixing things. Thanks for all the effort and great serivce.
Wow, it seems like you have something against the planet. “The Planet have let us down, and in turn, we have let you down. For this we are truly sorry.” I dont think any of your free customers will think that is the case.
I am about to leave statcounter as I need all my counters to be in house from now on. This isnt because of your service of anything else, I have simply outgrown your free plan. My time at statcounter has been amazing. Your support is brilliant and the product is ideal to every website owner.
But I wouldnt be too pissed at the planet. I mean its not like they can prepare very well for something like this. At least none of their servers were lost.
Jen in particular needs to be commended for constantly updating the situation via the forum. You guys have my vote, you are always on top of things.
I have been using StatCounter’s services for over five years and have never had reason to switch. Other stat and counter providers–free and paid–have come and gone but SC has gotten better and better over the years. I love the service and I love how there are real people behind it who care enough to inform members of matters that concern us all. Next to the great service, it’s their thoughtfulness that sets SC apart.
Go, Aodhan and the whole StatCounter Team! You’ll always have my vote! 🙂
It would be hard for me to put a price tag on the value of the information SC has provided me with for several years.
While s*** does happen in life, planning ahead can avoid much of it. Thus, I’m wondering if SC has looked at hosting with 1and1.com whom I’ve used & trusted with all my websites since 2005. With banks and finance companies relying on my hosting services, I believe they offer the safest locations, back-ups, etc. Based on weather patterns in recent & projected years, I would not be comfortable with servers in Houston or Dallas.
These things happen. Don’t be so hard on The Planet. Don’t know the company, but a fire can always cause serious damage. We are still alive and counting …
PS: love the counter.
I guess we should all recognise and acknowledge the fact that most of us use statcounter under a free registration to help us track our websites effectively. Even if there are users that have paid that services, we see that the problem is fully rectify in one day, which is really fantastic. Besides, this problem is not to much extent such that it happens very frequently or often that it will affect users detriementally.
On the bright side, it is pleasing to see that though we are registering under a free user, the statcounter team was really enthusiastic in solving the matter calmly and maturely, which really needed, render to us the correct form of services.
i am glad that we have such a such a great team of technicians powering statcounter. 🙂 great job
Thanks for the detailed explanation.
Andrew
I launched a new page that weekend and loosing data about how it did is a definite loss. A huge loss. If The Planet implied something they didn’t carry through, it’s definitely worth pursuing.
But you know what, fire happens, no one died, life goes on. Good work through the easy times and the hard times.
I’m a Sysadmin/DBA and lived the awful experience of a Datacenter loss, that was a nightmare don’t wish to anybody. Thanks for your great service, and the effort/sacrifice you did this weekend showed again the compromise with us. After this I’m seriously considering to upgrade my account…
Regards
OracleDisect
Hi
As someone who just absolutley dispises poor customer service and badly run businesses (coming from the UK you have to just put up with it for some reason) I want to say how nice it was to read your report on the recent ‘outage’ we all had.
It was written well and with plenty of detail as to the cause and future preventative measures to be taken and most of all an APOLOGY, even though it wasn’t your fault.
As the saying goes ‘what a refreshing change’ !
Just how hard is it to say Sorry and keep people informed about what is going on, not very.
You put to shame the VAST majority of companies who Charge for services and are run by Idiots with customer service departments that are put there purely to fob people off.
So Thankyou very much for the Free service you give me and the superb Customer Service, Honesty and fast responses you give to all your ‘Members’
Very kind regards
Andy 🙂
Amen LowGenius, I read those posts yesterday and wanted to respond but could only assume that data centers worked the way you just indicated so I had no real information to back my claim. Thanks for taking the time to post.
As for StatCounter staff, for a free service (sorry, I have not upgraded), you’ve definitely amazed me with your response. If only businesses that actually collect my money would be half as informative as you, this world would be a better place.
Keep up the amazing work!
StatCounter rocks regardless!
I’m very happy with your service, so I’m not upset or unhappy at all.
In addition, I’m glad no one was hurt at The Planet’s facility.
A note to those criticizing StatCounter’s response:
ThePlanet has advertised “With multiple state-of-the art data centers located in Dallas and Houston, Texas, The Planet provides On Demand IT Infrastructure backed by complete redundancy in power, HVAC, fire suppression, network connectivity, and security. So if any of our data centers experiences a disruption for any reason, your eggs (or servers) are never in one basket.”
This was obviously not the case, as there was no redundancy on the DNS servers that tell the rest of the internet where to go, nor was there true data redundancy. According to ThePlanet’s boilerplate and industry best-practice, there should have been backup servers in multiple, separate locations and there should have been a backup DNS server at a separate location so that no matter what happened, normal failover processes would kick in and downtime would be minimized.
One poster made a remark about 9-11. It happens that I was working on 9-11 for a Very Large Multinational Telecom who had a server room in Building 7. However, they also had true redundancy – the kind that ThePlanet claims to have – and downtime on 9-11 was less than one minute.
That’s right. NO downtime to speak of. Why? Because those of us who were tasked, partially or completely, with keeping that data up and running knew where those servers were, what was on them, and where the backups were, and we the failover switches were thrown – from North Carolina, in the case of one small bank of machines that wasn’t set to automatically fall back to another location on fail – about 15 seconds after the second plane hit.
That’s how a redundant mission-critical data network is supposed to run. As much as I sympathize with ThePlanet for their loss and celebrate the fact that their losses did not include life or injury to human beings, the fact remains: They failed to deliver what they promised. With all due respect, anyone who thinks SC doesn’t have a legitimate gripe really has no idea how large data centers are configured and operated. There should be a backup of every single bit – literally – of data that is 100% synchronized, geographically separated (ideally by several hundred miles at least), and ready to go into action automatically, or as close to it as possible, when the primary machine 404s for any reason.
That is what TP promised, and they did not deliver. Do they have my sympathy? Absolutely, and I’m sure the SC team feels the same way. Do they have an excuse for their failure?
No. Not in the least. Any competent IT organization that size should have these systems in place and automated as a matter of common sense. It’s the IT equivalent of keeping a spare tire and a jack in the car…except in this case, you’ve got several thousand people in the car and relying on your ability to change a tire if you get a flat.
StatCounter Team Response:
We’re of the same mind it seems LowGenius – we couldn’t have said it better ourselves! We appreciate your taking the time to post – thank you… and best of luck with the site redesign!
Thanks. Far more visibility into the outage than I could reasonably expect from a free service. I can only hope other (paid) vendors will live up to your standard.
Nice job.
Thanks. Far more visibility into the outage than I could reasonably expect for a free service. I can only hope other (paid) vendors will live up to your standard.
Nice job.
Thank you so much for the comprehensive explanation. Boy, you give far better service than many of the big guys!! 🙂