Disaster Recovery – From Practice to Theory?

Disaster Recovery – From Practice to Theory?

I was recently invited to give a talk at a Research institution about the products we are developing in Axxana. This forced me to step back and look at Disaster Recovery from a more “rigorous” point of view. Here are some key observations. First, let me start with a disclaimer. Our initial focus is on transactional environments where data loss translates very directly into lost revenue, reputation damage and in some extreme cases could lead to business closure. We do not focus on large scale systems that provide “eventual consistency,” but on classical environments that have focused on strong consistency models.

The traditional world of Data Replication

This world has been dominated in the last two decades by online remote data replication technologies, initially provided mostly by Storage vendors, and later also by middleware and applications. Replication solutions come in two flavors: (1) synchronous and (2) asynchronous. In my many years in R&D of Storage Systems in general, and advanced copy functions in particular, we conditioned our customers to the “obvious” trade-off between the solutions. Synchronous replication is the “ideal” answer, as it guarantees consistency, high availability and zero data loss. But it has a major drawback: it has a direct impact on the application performance, and the impact is proportional to the distance between the primary and secondary sites, effectively limiting it to tens of kilometers at most. It also requires costly infrastructure, such as Dark Fiber between the sites. Asynchronous replication on the other hand, removes the distance limitation, but does this at the cost of data loss. Customers learned to live with this oversimplification of the world. We made it very easy to understand… And if you do not believe the Storage Vendors, you can surely trust the Securities and Exchange Commission that in 2003 published the “Interagency Paper on Sound Practices to Strengthen the Resilience of the U.S. Financial System” that states: “There is general consensus that the end-of-business-day recovery objective is achievable for firms that play significant roles in critical markets, although many state that this is possible only if firms are able to utilize synchronous data storage technologies, which can limit the extent of geographic separation between primary and back-up sites.” There you have it! As a CIO, if I implement synchronous replication, I can guarantee that my IT infrastructure will protect the business against data loss, and it will allow fast recovery time objectives.

Theory behind the practice

So far for practice… let’s take a look at (some of) the theory behind these statements. In the fall of 1998, Eric Brewer formulated the then CAP Conjecture (now Theorem) that states that any networked shared-data system can have at most two of three desirable properties:

  • Consistency (C) equivalent to having a single up-to-date copy of the data
  • High availability (A) of that data (for updates)
  • Tolerance to network partitions (P)

Figure 1.

Figure 1 – The CAP Theorem

Add to the CAP theorem the observation stated by Claus Mikkelsen that disasters are not atomic. Mikkelsen coined the term “Rolling Disaster” which describes disasters marked by different beginning and end points, be they several milliseconds or minutes apart. A typical example is a network failure between the primary and secondary sites that is followed several minutes later by a disaster in the primary site. Applying the CAP theorem, in order to ensure consistency of the data in the presence of network partitioning, the storage systems must freeze (with the obvious impact on Availability). Otherwise the data copies will be out-of-sync. However, network outages are not rare. Most CIOs, given the trade-off of data and application availability in the presence of network failures, versus the risk of data inconsistency in the case of a disaster, will pick the first one. That is, data access is not interrupted in the presence on network partitioning, bringing the data to inconsistency.

Meet Axxana

Adopting proven hardware technologies from the aircraft industry, the well-known black box, we create a new approach to data protection that attempts to bridge between the two worlds of synchronous and asynchronous replication.

figure 2

Figure 2 – An aircraft black box

The Axxana Black Box is a full-fledge Compute/Storage/Networked node hardened to sustain extreme environmental conditions: Direct fire of up to 2000°F (1100°C) for an hour

  • 482°F (250°C) for 6 hours
  • 40G shock
  • 5000 pounds (2.3 tons) of weight
  • 30 feet (10 m) water pressure
  • Pierce force of a 500 pound (230 kg) rod with a cross-section of 0.04 in2 dropped from height of 10 feet

In an environment protected by Asynchronous Replication, we place the Axxana Black Box in the primary datacenter, adjacent to the primary storage and/or database or applications. The Axxana Black Box stores the “lag” between the primary and secondary sites, namely the “committed transactions” that have not yet reached the secondary site. Unlike aircraft black boxes, when a disaster strikes, the Axxana Black Box will transmit autonomously the information missing at the remote secondary site, ensuring zero data loss at any distance with fast recovery time. Aircraft black boxes need to be physically retrieved and the data stored in it needs to be extracted manually.

figure 3

Figure 3 – The Axxana Black Box

Challenges developing the Axxana solution

One set of challenges is obviously developing a black box that can continue operating uninterruptedly in the presence of the harsh environmental conditions described above. Unlike an aircraft black box that after a disaster is limited to transmitting a beacon signal so that it can be retrieved, and later its information extracted in the lab, the Axxana Black Box needs to continue operating and transmitting the information stored in it through a variety of means: wired communications, Wi-Fi or worst-case Cellular 4G LTE technology that has been proven to be most resilient in the presence of disasters. The second set of challenges are software related. The Axxana solution needs to have minimum effect on production, requiring very high performance (both throughput and latency) during normal, protection mode. On the other hand, the physical dimensions of the black box and the harsh environmental conditions it needs to sustain, constrain the amount of resources available in terms of CPU, memory, storage and networking. During a disaster, it needs to switch modes and become very efficient in its power consumption, turning off devices when not in use. It also needs to be very efficient in its communication protocols, as bandwidth can be very limited and with intermittent disconnections. Last but not least, it needs to support multiple applications, such as Oracle, MS SQL, File Server, Storage, … and, being a general purpose compute platform, we enable third parties to create “plug-ins” that protect their applications. A great example of this approach is being implemented with Oracle’s Far Sync product.

Phoenix OS – The Software Technology behind Axxana

Axxana’s Phoenix software provides a set of infrastructure/core technologies in support of the many applications that can run and be protected by Axxana. Some examples include:

  • Disaster-Aware Storage During a disaster, the Phoenix OS does aggressive caching to power down the disks while waiting for communications to be established.
  • Disaster-Optimized Communications We implement a communication protocol optimized for disaster, with selective compression, based on power and bandwidth availability.
  • Disaster-Sensitive Process Scheduling In order to cope with reduced resources (power, communication bandwidth) available during a disaster, Phoenix OS implements an optimized approach for process scheduling to minimize RTO for higher priority applications versus less critical ones.
  • Automatic Disaster Detection Leverage power source failure detection and communications variance sensors for early detection of disasters and preemptive handling.
  • Smart Data Prefetch When disasters strike, Phoenix OS implements a data prefetching algorithm that ships the data ahead of time, before it is requested by the application, significantly reducing the effective RTO.
  • Disaster-driven Space Optimization Monitor data transfer between the primary and the secondary sites to optimize storage use on the black box
  • Disaster-optimized computing Phoenix OS optimizes the use of CPU cores and reduces CPU frequency to minimize power usage during a disaster.
  • Highest level of security Access to the black box after the disaster needs to be secured to ensure that only authorized personnel can access the data inside.

These set of services are used by the applications running on top of the Black Box in order to make them “disaster-resilient”.

Putting it all together

Axxana’s Black Box technology adds Zero Data Loss capabilities to Asynchronous replication, i.e., it reduces the Recovery Point Objective to zero, and also reduces significantly the Recovery Time Objective because the systems are brought back to an up-to-date (storage and/or application) consistent state. For environments running asynchronous data replication, Axxana adds an important layer of protection (Zero Data Loss) that was deemed infeasible at long distances, as evidenced by the Securities and Exchange Commission 2003 paper. For environments running synchronous replication, new technologies like all-Flash arrays are amplifying the distance limitations of the traditional synchronous replication solution. Axxana’s solution can provide the same protection without incurring the network latency costs. In addition, it protects the customer from data inconsistency in Rolling Disaster scenarios, a risk that most customers have “silently” taken to avoid impacting Availability. Having said that, Axxana cannot break the CAP theorem, but it can significantly change some of the rules that were implied as hard limitations in the past, as exemplified in Eric Brewer’s more recent article (CAP Twelve Years Later: How the “Rules” Have Changed, published by the IEEE Computer Society, February 2012).