Posts Tagged ‘Google’
Cloud Failures and the Elephant in the Room
In the last 30 days, Google, LinkedIn, and Microsoft all had outages in their cloud-based services. Caroline Craig of InfoWorld wrote in a recent article, “Crashes are inevitable in the cloud. The trick to a successful cloud strategy is to design for the impending failure.” Everyone using cloud services should expect some downtime, since, as Craig writes, “… no cloud-based service offers a 100 percent uptime guarantee.”
But here’s the problem: InfoWorld, Google, LinkedIn, and Microsoft, in all of their analysis, their post-mortems, and their root-cause analysis disclosures, only talk about downtime. Downtime is only half of the story. Where’s the discussion regarding data loss? That’s the elephant in the room that everyone sees, but few are talking about.
If you want a 100 percent guarantee, I’ll give you one. It’s this.
Every time there is downtime in a cloud service, there is data loss.
If there’s always data loss, then why don’t people talk about it? That’s easy. You don’t even have to be a customer to know whether most cloud services are up and running. Twitter users are more than happy to tweet that a cloud service is down. And most reputable cloud-based services disclose downtime to the world through their status page. Here are just a few examples:
None of these services, however, publicly report the data loss that occurs every time they have an outage. Part of the reason may be that they don’t know. And unlike downtime, data loss doesn’t usually affect everyone, so only the affected person or company knows. While it’s frustrating and costly for the customer, especially if it’s a business customer, they are unlikely to tell anyone that they lost data. In fact, one of the few who acknowledged the potential for data loss after the recent Google Drive outage was @john_mccoubrey, who tweeted, “Some of my docs on Google Drive appear to have been removed. Hope this is just a temporary glitch.”
It’s time we all started talking about the impact of data loss. You know it’s there, even if you’re not talking about it.
Google: .02% of Users is a Big Number. 5 Days is a Long Time.
This post isn’t being written to be critical of Google. They have a tremendous platform. I know many people who use Google, not only for advertising and searching, but for blogging, for collaboration applications, and for email. But I’ve been watching the continuing problems with Google’s Gmail service. On Sunday, February 27th, a software bug caused some Gmail user data to be deleted. As reported by Google, only .02% of users were affected by the data loss, down from earlier estimates that were .08%. Turns out, though, that .02% of the Gmail user base is still a big number. By some estimates, it’s about 35,000 people. It’s now five days later. The latest update from Google, which is from yesterday, reports that:
We have restored the majority of the affected accounts, and will continue to restore the remaining accounts as quickly as possible. Accounts with more mail are taking more time.
Why would it take Google so long to restore data? Because, Google has to restore the data from tape. Google has an interesting perspective on tape:
To protect your information from these unusual bugs, we also back it up to tape. Since the tapes are offline, they’re protected from such software bugs. But restoring data from them also takes longer than transferring your requests to another data center, which is why it’s taken us hours to get the email back instead of milliseconds.
Hours instead of milliseconds? Actually, for some users, it’s days instead of milliseconds. Read the rest of this entry »









Twitter
YouTube
LinkedIn
Facebook