Posts Tagged ‘Consistency groups’
You Guessed Wrong
Tim is the VP of IT. His company’s data center is more than a thousand miles from the nearest ocean, so it’s not going to be impacted by a tsunami or a hurricane. It’s in an area that has very little seismic activity, so it’s not likely to be affected by an earthquake. There are no active volcanoes nearby. It’s not near any rivers or near a flood plain. There are no other major buildings nearby, and, even though his area has experienced a major drought over the past year, the risk from fire, at least from somewhere outside the data center, is very low.
Tim’s disaster recovery plan calls for a full backup of the application and data files of the critical applications once a week and an incremental backup nightly. The backups usually complete without error, but not always. Some applications are considered more critical than others, so some applications are backed up less frequently. Tim does a disaster recovery test twice a year, to make sure that everything in the DR plan is working. Usually it is. There are other risks to his data center. He could lose power or network communication. He could have a fire that starts inside the data center. His area does occasionally have tornadoes, but not very often. There could be a chemical spill that would require the area to be evacuated, but none of these are very likely.
Like every IT director, Tim has a limited budget, and he is constantly under pressure to keep IT costs low. Tim has made a series of guesses, bets really, in developing his disaster recovery plan. He’s bet that he’s covered for most of the risk associated with natural disasters, he’s bet that the applications that he deemed critical are the right priorities, that he’s got all of the program files and data together in the proper groups, and that nothing has changed since he last revisited the plan. He’s betting that the backup process is working, that the tapes are readable and the applications recoverable.
Those are only some of the bets that Tim has made. Each bet has a consequence. Sometimes he’ll win. Sometimes he’ll lose. But what happens, if Tim guesses wrong?
I’m a fan of the movie, “The Princess Bride.” If you haven’t seen the movie, click on the link below to see a short clip of what can happen when you guess wrong.
The Princess Bride: The Man in Black in a battle of wits with Vizzini.
Actually, like the movie, the story I just told you is fantasy. Tim is real, but I made up the rest. In reality, Tim made a very different bet. He bet on Axxana. With one very good bet, he avoided making hundreds of bad ones.
Three Observations to End the Debate
Last week I posted a blog: Protecting Consistency Groups Against Human Error. I decided to see what other people were saying, so I did a little browsing around message boards, user groups, and forums. Back in 2010, W. Curtis Preston of Backup Central got into a lively debate with Scott Waterhouse of EMC, with Curtis stating emphatically, “Crash Consistent Backups Aren’t Good Enough,” and with Scott responding that they work, but ” wouldn’t it be ideal if you could do better.”
There’s plenty of concern about the ability to reliably recover applications from crash-consistent copies of data. Of course, it’s different for every application and every environment. Here’s some advice from an EMC message board on how to ensure recovery using RecoverPoint with Oracle:
Many customers and field personnel use RecoverPoint to constantly and successfully access and bring up both application and crash-consistent copies of Oracle every day.
If the Oracle database is setup correctly in RecoverPoint in terms of consistency groups and ensuring that the target volumes are only accessible to ONE mount host and the other host-level best practices are followed, then I would expect you to have no issues.
Click here for the full discussion.
There’s also a helpful discussion by Mike Rothouse on recovering Oracle data from NetApp storage using NetApp’s crash-consistent SnapShot.
If a database has all of its files (control files, data files, online redo logs, and archived logs) contained within a single NetApp volume, then the task is straightforward. A Snapshot copy of that single volume will provide a crash-consistent copy.
Click here for the full discussion.
My key observations are:
1. If you depend on crash-consistent copies of data, then it may work some of the time for some of your applications, but it won’t work all of the time for all of your applications.
2. Best practices for recovering from crash-consistent SnapShots restrict your options for data placement and volume management.
3. Applications, systems, and IT processes are in constant flux, so if you need to set up consistency groups “correctly” in order to ensure recovery, you are creating an inherent human-factor risk.
I don’t know about you, but with the pace of change, the cost of training, and the strain on staffing in today’s IT shops, I would opt for solutions that work the same way across all applications, automatically adapt to changes in the environment, and reduce the risk of human error.
Protecting Consistency Groups Against Human Error
What’s worse than losing your data?
Losing your data and having no backup.
What’s worse than having no backup?
Having a backup that restores inconsistent data.
That’s precisely the concern that Josh Kirsher raised on the April 10 Wikibon Peer Incite. A lot of people are buying insurance, in the form of snapshots of application data, and they leverage consistency groups, thinking this will insure that the data is application-consistent. It’s the application-consistent snapshot that companies use as source-volumes for off-site backups and asynchronous replication, and as on-premise application recovery points. And it’s consistency groups that enable applications to be restored in minutes rather than hours or days. Unfortunately consistency groups only work when procedures are perfectly designed, when they are perfectly followed, when they are constantly maintained, and when no one makes an error.
In today’s dynamic environment, where the servers on which applications run are virtualized, where applications are frequently moved from one physical server to another, where LUNs are quickly created, and volumes are added to and removed from LUNs on a daily basis, the probability of developing a perfect consistency-group process that is precisely followed and continuously maintained, without introducing any human error, is very low. That means that, when you need to call upon your insurance, which is the snapshot or the backup that you assume is application consistent, the probability is very high that the data will in fact be inconsistent and the time to restore consistent application data from paper source documents will be measured in days, not minutes or hours. And, for companies that primarily transact business electronically, they may not be able to reconstruct the data at all. This is the scenario that Tim Hays, of Animal Health International, avoided when he made the decision to protect everything. After all, if he could affordably protect everything, he didn’t have to worry about what he might miss.











Twitter
YouTube
LinkedIn
Facebook