Archive for the ‘EDR’ Category
According to Nate Silver, author of The Signal and The Noise, weather forecasting has dramatically improved over the past 30 years. So why is it that the U.S. National Ocean and Atmospheric Administration (NOAA) could be so wrong in forecasting the 2012 hurricane season? As I wrote in my post, Dodging Bullets in Disaster Recovery, NOAA in May of this year was stating that “Conditions in the atmosphere and the ocean favor a near-normal hurricane season in the Atlantic Basin this season.” According to NOAA, that translates to “12 named storms with six hurricanes, including three major hurricanes.” Instead, assuming no more tropical storms and hurricanes, the U.S. will end the season with 19 named storms and 10 hurricanes, tying with 1887, 1995, 2010, and 2011 as the ‘third most active Atlantic hurricane season in recorded history” according to a Wikipedia post. Whether this is a long-term trend remains to be seen, but we seem, at least for now, to be in a dangerous weather trend.
Trends are not the same as forecasts, and weather forecasts have improved. But, not surprisingly, Mr. Silver reports, “the further out in time these models go, the less accurate they turn out to be.” Forecasts for a season may not be very reliable, but organizations and individuals should closely attend to near-term forecasts for a specific event. As an example, forecasts for when and where a specific hurricane will make landfall have dramatically improved. Mr. Silver wrote, “Just twenty-five years ago, when the National Hurricane Center tried to forecast where a hurricane would hit three days in advance of landfall, it missed by an average of 350 miles.” ”Today, however, the average miss is only about one hundred miles.”
Even better, predictions can be highly accurate when making a probabilistic estimate over longer periods of time, such as:
- What’s the probability of a magnitude 6.0 earthquake in the eastern United States in the next 100 years?
- What’s the probability that the Lincoln Tunnel will flood again in the next 50 years?
These probabilistic predictions are even further improved, if you look at conditional probabilities such as:
- What’s the probability of a magnitude 6.0 aftershock, within 2 days of a magnitude 7.0 earthquake?
- What’s the probability of the Lincoln Tunnel flooding, if ocean temperatures increase 2 degrees?
These conditional probabilities enable us to evaluate scenarios and to plan and prepare. As organizations, we need to spend more time evaluating scenarios and looking at approaches that will mitigate the impact of dangerous events. I’ll write more on this in a later post, but, in the meantime, I’ll say it once again, “The greater the distance between your primary and disaster recovery data centers, the greater the probability that your organization can survive a catastrophic event.”
By now, many people have forgotten the magnitude 5.8 earthquake that rocked Virginia on August 23, 2011 and caused $200-300 million in damage, including damage to the Washington Monument and the Washington National Cathedral. Maybe that’s because the earthquake was quickly followed by Hurricane Irene, which, according to the National Hurricane Center, caused an estimated $15.8 billion in damage, including $7.2 billion from inland flooding and storm surge.
Given the recent disaster caused by last week’s Hurricane Sandy, the largest Atlantic hurricane on record, and the Nor’easter that is expected to hit this evening and re-flood some of the impacted areas, Irene and the Virginia earthquake may soon be forgotten, except by those individuals still suffering and cleaning up from the previous disasters.
The question isn’t “Do you remember?”
The question is “What will you do differently to better prepare?”
I think everyone will agree that when nearly 10% of the world’s population loses power, that counts as a major disaster. Unlike some disasters, the recent power outage in India, that affected more than 600 million people, was definitely predictable. Penny Jones at Data Center Dynamics wrote about India’s power issues back in February 2011.
In a follow up article this week, after the blackout, DatacenterDynamic’s general manager for India, Praveen Nair, reported that
Northern India alone can suffer an average of three to four hours of power cuts a day as the government carries out load shedding.
For larger data centers, the massive power outage of the past week was not particularly disruptive, because, as Nair says,
99% of the big players are used to this condition and have adequate backup. So when the outage took place, most data centers switched to generator sets for their power needs and most are equipped to run for days”
The same can not be said for the millions of people trying to get to and from work on transportation systems that were completely shut down. In a disaster, the human factor can never be ignored, which is why we speak so frequently about the need to have a recovery location outside the disaster zone. In this case, the disaster zone was most of India, a much larger area than would be affected by even the largest typhoon, tsunami, flood, fire, or earthquake. And the best place to have a recovery facility would have been on another continent, where there would be no local human impact.
Organizations always need to prepare for recovery from natural disasters. I suspect, however, that some of the greatest challenges for organizations, going forward, will be disasters related to infrastructure failures, particularly in rapidly growing areas such is India.
What’s worse than losing your data?
Losing your data and having no backup.
What’s worse than having no backup?
Having a backup that restores inconsistent data.
That’s precisely the concern that Josh Kirsher raised on the April 10 Wikibon Peer Incite. A lot of people are buying insurance, in the form of snapshots of application data, and they leverage consistency groups, thinking this will insure that the data is application-consistent. It’s the application-consistent snapshot that companies use as source-volumes for off-site backups and asynchronous replication, and as on-premise application recovery points. And it’s consistency groups that enable applications to be restored in minutes rather than hours or days. Unfortunately consistency groups only work when procedures are perfectly designed, when they are perfectly followed, when they are constantly maintained, and when no one makes an error.
In today’s dynamic environment, where the servers on which applications run are virtualized, where applications are frequently moved from one physical server to another, where LUNs are quickly created, and volumes are added to and removed from LUNs on a daily basis, the probability of developing a perfect consistency-group process that is precisely followed and continuously maintained, without introducing any human error, is very low. That means that, when you need to call upon your insurance, which is the snapshot or the backup that you assume is application consistent, the probability is very high that the data will in fact be inconsistent and the time to restore consistent application data from paper source documents will be measured in days, not minutes or hours. And, for companies that primarily transact business electronically, they may not be able to reconstruct the data at all. This is the scenario that Tim Hays, of Animal Health International, avoided when he made the decision to protect everything. After all, if he could affordably protect everything, he didn’t have to worry about what he might miss.
Having two data centers, especially when they are separated by a significant distance, brings so many advantages, it’s difficult to name them all, but here are just a few:
- The ability to increase the frequency and quality of disaster recovery testing
- The ability to perform site maintenance and upgrades, while maintaining application availability
- The ability to rapidly restore applications and continue operations in the event of a regional disaster
Some organizations have eliminated tape and migrated to disk-based backup methods, leveraging various techniques for creating application-consistent snapshots. This approach can dramatically improve recovery times, but again, requires that the 3rd-party recovery location have all the necessary equipment and software in order to run the applications, once the applications and data are restored. And, again, the location must be unoccupied.
The reason organizations use 3rd-party disaster recovery service providers is, in part, because they don’t want to absorb the full cost of having a second location sitting idle, just in case a disaster happens. It is cost prohibitive for most organizations. But forward-thinking companies have recognized that application development and test environments can be re-purposed for production applications, when a disaster occurs. In this way, no infrastructure is wasted, and no systems are sitting idle. A two-data center architecture, with development, test, and disaster recovery in one location, and production in the other, provides the ideal approach for both resource efficiency and resiliency.
The biggest challenge for organizations may be to determine the best way to get all of the current application data from the primary production location to the development, test, and disaster recovery location. Asynchronous replication is clearly the approach of choice, in terms of cost and flexibility for locating the secondary site, but it ensures that some data will be lost. Many of you saw our recent announcement about Animal Health International becoming an Axxana customer. The approach that Animal Health took, combining asynchronous replication with disaster-proof protection of the synchronous lag, is precisely the approach that organizations should take. The combination gives organizations a complete solution that is both affordable and flexible.
On 1 May 2011, French investigators recovered the flight data recorder from Air France Flight 447, twenty-three (23) months after the Airbus 330 plunged into the Atlantic Ocean on a flight between Rio de Janeiro and Paris. At the time of the tragic crash, and for months following, there was a great deal of speculation regarding what caused the crash. Was it mechanical failure, pilot error, or some combination of both? With the release of the information from the flight data recorder, including the full transcripts of the cockpit voice recordings, investigators now have a clear picture of what occurred. And from analysis of the retrieved data, they can make recommendations to airplane manufacturers and to pilot training programs on how reduce or eliminate these kinds of tragedies.
In the case of Flight 447, investigators did have some data from the automatic transmissions, but the data was incomplete. Over the past several decades, and as storage media has advanced from magnetic tape to solid state disk, the airline industry has been able to increase the amount of data that flight data recorders store and protect. And now, with the information from Flight 447′s flight data recorder, which was retrieved from the ocean floor, two miles below the surface, the picture is now complete.
Eye-witness accounts are notoriously unreliable, as this Stanford Journal of Legal Studies article and this APA Monitor article attest. And the stress that comes during and after disasters strike only serves to increase the unreliable nature of eye-witness testimony. When disasters strike a business and data is lost, it is sometimes possible to reconstruct data from source documents, but source documents are sometimes lost. Data can also be reconstructed from memory, but, as research shows, memories can be flawed.
For the airline industry, the capture and protection of data in flight data recorders before and during disasters, and the analysis of data after disasters, have been critical to ensuring that airlines are the safest mode of travel. Still the industry is looking to constantly improve. Imagine if, rather than 23 months, the data in the flight data recorder had been recoverable immediately. Imagine that rather than having to be found, the data could have been extracted automatically. Then the analysis of the cause and the development and modification of procedures that might prevent future tragedies could have begun almost immediately.
Whew! That was close. At least too close for my taste. The asteroid, YU55, that passed by the Earth yesterday is 400 meters long and was traveling at 30,000 miles per hour when it whipped past us. It came within about 200,000 miles, or .85 lunar distances, from the Earth. We were never in any real danger. The astronomists that track these things, determined quite a while ago that we were safe.
If it had hit, scientists estimate that it would have created a crater four miles across and 1/3rd of a mile deep. For the record, while Axxana’s Phoenix System is water proof, fire proof, smoke proof, shock and vibration proof, and can survive building collapses, if one of our black boxes ever does take a direct hit from an asteroid of this size, we won’t survive. I guess we should put that in the disclaimers at the bottom of the product brochure. The next fly-by of an asteroid this size is expected in 2028. I guess we have a few years to re-print the brochures.