Archive for the ‘Disaster Recovery’ Category
We just got back from another great EMC World conference. Thanks to everyone who came to our booth to learn more about implementing zero-data-loss disaster recovery.
The people that we met were looking specifically for a solution like ours – one that would enable them to achieve zero data loss without breaking the bank. These weren’t your typical EMC VMAX global banking customers, who had already implemented 3-site mirroring, and who spend whatever it takes, because they have to. Instead, they were the kinds of companies (mid-sized banks, manufacturers, engineering firms, and services companies) that attended Michael Gordon’s presentation for VNX users.
We provide simple, affordable, reliable zero-data-loss protection over any distance for mid-market companies. Oh, yes. And if you happen to be an EMC VMAX customer looking to reduce the cost of triple-mirroring, we can do that, too.
In the last 30 days, Google, LinkedIn, and Microsoft all had outages in their cloud-based services. Caroline Craig of InfoWorld wrote in a recent article, “Crashes are inevitable in the cloud. The trick to a successful cloud strategy is to design for the impending failure.” Everyone using cloud services should expect some downtime, since, as Craig writes, “… no cloud-based service offers a 100 percent uptime guarantee.”
But here’s the problem: InfoWorld, Google, LinkedIn, and Microsoft, in all of their analysis, their post-mortems, and their root-cause analysis disclosures, only talk about downtime. Downtime is only half of the story. Where’s the discussion regarding data loss? That’s the elephant in the room that everyone sees, but few are talking about.
If you want a 100 percent guarantee, I’ll give you one. It’s this.
Every time there is downtime in a cloud service, there is data loss.
If there’s always data loss, then why don’t people talk about it? That’s easy. You don’t even have to be a customer to know whether most cloud services are up and running. Twitter users are more than happy to tweet that a cloud service is down. And most reputable cloud-based services disclose downtime to the world through their status page. Here are just a few examples:
None of these services, however, publicly report the data loss that occurs every time they have an outage. Part of the reason may be that they don’t know. And unlike downtime, data loss doesn’t usually affect everyone, so only the affected person or company knows. While it’s frustrating and costly for the customer, especially if it’s a business customer, they are unlikely to tell anyone that they lost data. In fact, one of the few who acknowledged the potential for data loss after the recent Google Drive outage was @john_mccoubrey, who tweeted, “Some of my docs on Google Drive appear to have been removed. Hope this is just a temporary glitch.”
It’s time we all started talking about the impact of data loss. You know it’s there, even if you’re not talking about it.
If you accept the conclusions of Quorum Inc.’s DR Report, Q1 13*, which was recently summarized and republished by StorageNewsletter, computer hardware, people, and software are a disastrous combination, accounting for 95% of data center disasters. For anyone who has been working in business continuity planning and disaster recovery, this report is probably not news. Typical organizations have a lot of computers, a lot of people, and a lot of software. And while you can buy higher-quality hardware, better train and more tightly control your people, and test all software before you put it in production, you still won’t eliminate all disasters. Things happen.
We can eliminate a disaster’s impact on application recoverability, and we can prevent data loss, regardless of whether the disaster is one of the 95% or one of the 5%.
It’s also good to remember that frequency and severity are not the same thing. And, while according to the report, natural disasters accounted for only 5% of outages, recovering from some of the recent natural disasters may be more challenging than the occasional server outage, accidental file deletion, or bad software patch that corrupts a database. I’m saying “may be,” not “will be,” since we all know that one person’s minor headache is another person’s disaster.
We will never eliminate all causes of disasters: hardware, people, software, or Mother Nature. Just look at the recent meteor that exploded over Chelyabinsk, Russia. Traveling at 40,000 miles per hour, it was virtually impossible to predict, until it was visible as a meteor, passing through the Earth’s atmosphere and exploding moments before impact. We can’t prevent disasters from occurring. But with proper systems, we can eliminate their impact on application recoverability, and we can prevent data loss, regardless of whether the disaster is one of the 95% or one of the 5%.
* Quorum reported that they used data from incoming calls to its IT support center to estimate the frequency of various causes of data center disasters in small and medium businesses.
Imagine you live in New York. You’ve booked a cross-country flight to attend a party in San Diego, California, where your parents have retired. Your entire family is about to celebrate your parents’ 50th wedding anniversary. It’s a once in a life-time event. You have been selected by your brothers and sisters to give the speech to honor them. You, yourself, feel both proud and honored to have been selected. It’s Wednesday morning, and you’ve left yourself a full day to get there. You’re a frequent traveler, and you know that flights get delayed. Especially this week, and especially today. Everyone has remarked about how appropriate it is that this special anniversary falls on Thanksgiving Day. Today is the day before Thanksgiving, and it’s the busiest travel day of the year.
When you arrive at the airport for check in, the agents at the counters are flustered and fellow travelers are obviously upset. The computers are down. You’re told they’ve been down for 10 minutes. Even though you are not surprised by the outage, you feel your tension increasing. But after another five minutes, the computers come back up, everyone breaths a collective sigh of relief, you check in, and the flight, though slightly delayed, departs nearly on time. The pilot apologizes for the delay and tells the passengers, “I think I can make up the time.”
Now, imagine that when you arrive at the airport for check in, everyone is smiling. The computers are up. Check-ins are being processed, baggage is being checked. The lines are flowing smoothly. But when you reach the check-in desk, the agent tells you, “I’m sorry. I have no record of your reservation. You don’t have a seat on this flight.” You’re angry. She’s sympathetic, but says “There’s nothing I can do. The flight is full.” You demand a seat on the next flight. But it’s the busiest travel day of the year. All other flights are booked. In fact, over sold. The next available seat is Friday, the day after the party. It turns out that the airline has lost a small amount of data. It was a few kilobytes among terabytes of reservation data. Someone at the airline knew there had been a recent data-loss incident, but they didn’t know what was in the data. Unfortunately, it was the data that held the record of your reservation. And you miss this once-in-a-lifetime event.
As customers, there is a fundamental difference between the way we perceive downtime and the way we perceive data loss. In most cases, short-term downtime is, at worst, an annoyance. How many times have you walked into a retail store or a restaurant, a bank, a package shipping service, or an airline check-in desk and heard, “Sorry, the computers are down”? It happens. When computers go down, companies lose productivity. If you are in a hurry and can’t wait or won’t come back, companies may lose a sale. But unless the computer outages are frequent or prolonged, you reluctantly tolerate short term outages. After all, you can see that the computers are down. Everyone is affected. We’re all in this together. It’s not personal.
Contrast that with data loss. If a company actually lost all of their data, you would know it’s not personal. Just incompetence or a catastrophic event; one that everyone can see. But that’s not the way most data loss happens. Most data is protected. Just not all. So, data loss, more than downtime, becomes personal. That airline lost your reservation. That shipping company lost your package. That bank lost your deposit. Unless you’ve got some physical or electronic receipt, you can’t see or prove data loss. By its very nature, it’s often unseen. You say something happened. They say it didn’t. One of you is lying. And you know it’s them. So when a company loses your data, you lose faith, and they are likely to lose a customer for life. So the next time someone asks you about the value of data, ask them, “What’s the value of a customer for life?”
When it comes to data centers, everyone believes site location matters. It seems obvious that anyone choosing a location for a new data center would want to be far away from any known disaster zones. However, an article published yesterday by GigaOM entitled “The states with the most data centers are also the most disaster-prone [maps]” shows that in the U.S., organizations do the exact opposite. They locate data centers in the most disaster-prone states! Wow!
First, thanks to Roni Molla, the author. She’s an expert in data visualization, and the maps really help to tell the story of data center risk. In her article, she references Mark Thiele, data center expert and founder of Data Center Pulse, who argues that climate change will substantially change the risk profile for data centers going forward.
I’ll write more about this in a later post, but Nassim Taleb, author of Fooled by Randomness, The Black Swan, and Antifragile: Things That Gain from Disorder, would argue correctly, that given time, any location is at risk. The strategy, then, needs to be that you make your data center and your organization, at the very worst, resilient, meaning it can survive disasters. At the best, you will make your organization anti-fragile.
In his latest book, he explains that something that is fragile is more likely to break, the more you stress it. Something that is anti-fragile gets stronger or better, the more you stress it. He also reminds us that “the absence of evidence is not the evidence of absence.” Prior to Hurricane Sandy, no one, or at least few, guessed that the lower half of Manhattan would be underwater. Prior to the World Trade Center collapse, no one, or least few, would have guessed that terrorists would fly two planes into the twin towers.
Unlike other disaster recovery solutions, Axxana does not require that you predict the type of disaster that might occur, or the frequency with which they might occur. Given time, everyone will be affected by a disaster. Axxana also doesn’t require that you predict how you will retrieve the protected data. When preparing to survive unforeseen disasters, having options is key. That’s why we give customers four different ways to retrieve data.
I think it’s time to stop trying to predict those things we can’t predict, and instead spend our time figuring out how we will survive.
While it may not have been their intention, Amazon was giving out a gift to Disaster Recovery professionals on Christmas Day. The gift came in the form of three valuable disaster recovery lessons. No charge and no shipping costs. Here they are:
- Well-intentioned humans cause disasters
- Disasters can be hidden, for a while
- Continuing operations after a disaster lengthens the recovery process
On December 24th, Amazon Web Services (AWS) had an event. In the world of data centers, “event” is code for “problem.” Fortunately the event was limited to a specific service, the Amazon Elastic Load Balancing (ELB) Service, in a specific region, the US-East region. Unfortunately, that event impacted service levels and availability for Netflix, the largest customer of Amazon Web Services (AWS). Amazon’s recently released postmortem on the outage is a great case study on how systems fail and why they can take so long to recover.
First, as is all too often the case, this event was caused by human error, when ELB data “was deleted by a maintenance process that was inadvertently run against the production ELB state data.” Said differently, this was an authorized employee performing an authorized task against the wrong set of data. No hackers and no bad intentions.
Second, the impact of deleting the file wasn’t felt immediately. The service degraded over time. Until the service level was sufficiently poor, the AWS team focused on the symptoms. It wasn’t until after several hours of continually degrading service, that they started digging deeply and discovered the root cause.
Third, because production continued after the accidental file deletion, changes in the environment had to be reapplied to a restored snapshot. It took two attempts, two different methods, and several hours to restore a usable snapshot. It then took several hours more to reapply the changes that had occurred.
For its part, Amazon is using this as a learning experience and is making changes to enable faster recovery in the future. So, too, is Netflix, who wrote about their planned changes in this explanation and apology to customers.
Since Amazon was kind enough to give a gift, it would be a shame, if the rest of the IT community failed to use it.
Just 25 days after Hurricane Sandy devastated New Jersey, parts of Pennsylvania, and a good portion of the New York City metropolitan area, New Yorkers gathered to celebrate at the Macys Thanksgiving Day Parade. New Yorkers are a resilient bunch, but the impact of the super storm will be felt for a long time. Only six days after the parade, the Huffington Post reported that New York’s mayor, Michael Bloomberg, went to Washington D.C. to lobby for $42 billion to repair and rebuild New York State. New Jersey, meanwhile, is requesting $37 billion.
Due to the area’s investment in internet infrastructure, New York City is home to many businesses, both large and small, that are dependent upon access to this communication grid. Whether the lights stayed on, the computers remained up, or the networks remained available, Hurricane Sandy impacted all of these businesses. Why? Because employees lost homes. Public transportation systems were in disarray. Fuel was rationed, and lines for fuel stretched for miles. Many individuals simply couldn’t get to work.
We’ve studied disasters enough to know that, regardless of where you live and regardless of what you do, it is not possible to prevent disasters from impacting your organization. But you can increase the probability that your business survives, by maintaining multiple offices and multiple data centers. Your organization’s probability of survival increases dramatically, the further apart your offices are. Hurricane Sandy’s winds spanned 1,100 miles, and its impact was felt in Jamaica, Cuba, the Dominican Republic, the Bahamas, Haiti, Bermuda, Canada, and 24 states in the U.S. A good place for a second location of a New York-based business might be somewhere in Kansas.
Hurricane Sandy helped surface some of the risks associated with operating from a single region. These include:
- Power outages can extend for hours, days, or weeks
- Organizations may not have enough fuel on site to sustain power during an extended outage
- Fuel may be in short supply and rationed after disasters
- Employee movement may be significantly restricted
- Public transportation systems may be impacted
All of this argues for a disaster recovery site that is a long distance, a very long distance, from the production data center.
According to Nate Silver, author of The Signal and The Noise, weather forecasting has dramatically improved over the past 30 years. So why is it that the U.S. National Ocean and Atmospheric Administration (NOAA) could be so wrong in forecasting the 2012 hurricane season? As I wrote in my post, Dodging Bullets in Disaster Recovery, NOAA in May of this year was stating that “Conditions in the atmosphere and the ocean favor a near-normal hurricane season in the Atlantic Basin this season.” According to NOAA, that translates to “12 named storms with six hurricanes, including three major hurricanes.” Instead, assuming no more tropical storms and hurricanes, the U.S. will end the season with 19 named storms and 10 hurricanes, tying with 1887, 1995, 2010, and 2011 as the ‘third most active Atlantic hurricane season in recorded history” according to a Wikipedia post. Whether this is a long-term trend remains to be seen, but we seem, at least for now, to be in a dangerous weather trend.
Trends are not the same as forecasts, and weather forecasts have improved. But, not surprisingly, Mr. Silver reports, “the further out in time these models go, the less accurate they turn out to be.” Forecasts for a season may not be very reliable, but organizations and individuals should closely attend to near-term forecasts for a specific event. As an example, forecasts for when and where a specific hurricane will make landfall have dramatically improved. Mr. Silver wrote, “Just twenty-five years ago, when the National Hurricane Center tried to forecast where a hurricane would hit three days in advance of landfall, it missed by an average of 350 miles.” ”Today, however, the average miss is only about one hundred miles.”
Even better, predictions can be highly accurate when making a probabilistic estimate over longer periods of time, such as:
- What’s the probability of a magnitude 6.0 earthquake in the eastern United States in the next 100 years?
- What’s the probability that the Lincoln Tunnel will flood again in the next 50 years?
These probabilistic predictions are even further improved, if you look at conditional probabilities such as:
- What’s the probability of a magnitude 6.0 aftershock, within 2 days of a magnitude 7.0 earthquake?
- What’s the probability of the Lincoln Tunnel flooding, if ocean temperatures increase 2 degrees?
These conditional probabilities enable us to evaluate scenarios and to plan and prepare. As organizations, we need to spend more time evaluating scenarios and looking at approaches that will mitigate the impact of dangerous events. I’ll write more on this in a later post, but, in the meantime, I’ll say it once again, “The greater the distance between your primary and disaster recovery data centers, the greater the probability that your organization can survive a catastrophic event.”
Too often, business continuity planning and disaster recovery planning are treated as the same functions. Unfortunately, they are not. Business continuity planning helps organizations insure that applications and processes continue through the myriad of day-to-day disruptions that might occur. These include IT component failures, such as disk-drive failures, a server failure, a dropped network link, or an application bug. Disaster recovery planning helps organizations recover operations after less frequent, but far more devastating events, such as fires, floods, hurricanes, earthquakes, and a variety of man-made disasters. While the data center strategy is only one component of business continuity and disaster recovery planning, it is a key component. And while business continuity and disaster recovery planning are different functions, they must often be considered together, because of budget limitations.
There are plenty of advantages to having a business continuity data center in region, a very short distance from the production data center. If the data centers are very close, there will be little impact on transaction latency for the always-important two-phase database commit. Failover times from the production data center to the business continuity data center can be very short. Staff that normally work at the primary data center can easily show up for work at the in-region business continuity data center. WAN charges between the primary and business continuity data centers will be relatively low.
The problem with an in-region business continuity data center is that it can’t replace an out-of-region disaster recovery data center. The two are simply too close for comfort. And few organizations can afford three data centers. Following are a few of the types of disasters that can prevent an in-region business continuity data center from acting as a disaster recovery data center:
- Electrical-grid failure
- Telecommunications failure
- Transportation systems failure
- Chemical spills
- Radiation leaks
- War, terrorism, and civil unrest
For these types disasters, it is much more likely that both in-region data centers will be affected and much more challenging to recover applications and data. One of the trade-offs organizations must make is between how quickly they recover and how certain they are that they can recover from the range of disasters that could strike them. We believe that a slight increase in recovery time is well worth the additional assurance that you can actually recover applications after a disaster. Using an in-region business continuity data center as a disaster recovery data center is a little like doing a tandem sky dive. It’s fine, as long as nothing goes wrong.