De-duping - a killer app
Commonality factoring can reduce file backup size impressively
What is the best way to store data? The best way in terms of getting access to it is to store the full original data: a 10,000 record database; 1 40-slide PowerPoint slide deck; a 500-cell spreadsheet; or a 50-page set of accounts data. But this is not the best way of storing it if you want to have a backup copy on disk. There speed of access is not everything. It has to be balanced by size. The smaller the better is the general rule.
Disk is costly compared to tape and you don't want to have vast amounts of nearline (secondary) storage.
How do you achieve this? The traditional method has been compression. You look for repeated byte strings and replace them with a location pointer and a pointer to the byte string in question. It only goes so far. You have to have the data set in memory and you have a block size problem. Just how many bytes in a file do you check sequentially before deciding that there isn't any repetition?
There is also the aplication problem. Word deals with, well, words. Excel deals with spreadsheet cells and their attributes. E-mail deals with messages which have particular structures and possible attachments. Web pages have their own structures too. A simple compression algorithm looking for repeated byte patterns, relatively short ones at that, won't know anything about the specifics of these different applications.
EMC with Centera popularised content addressing systems (CAS). Here a mathematical function or hash is calculated from a data sequence and stored if there is no other existing stored example of the hash value. If there is then you just store a location pointer and the address of the repeated hash function.
Centera hardware is expensive and you need application software to write and read data from/to it. Centera is no use, say, for backing up remote offices. It's also positioned as a fixed or reference content store, not as a generalised backup-anything facility.
Avamar's Axion is different. It uses compression but it also achieves a single instance store, like Centera, by using a patented common Factoring algorithm set. These are darned hard in detail to understand. Simple examples are easy. For example, let's take a 10MB PowerPoint slide deck and back it up to an Avamar Axion system. For point of comparison we'll say there's no compression and we get 10MB in the Axion store - commodity hardware by the way.
Now we alter one slide and back it up again. The Axion stores an extra 378KB. We have a roughly 30:1 compression ratio - perhaps it's better to call it a 30:1 de-dupe ratio - because all the duplicated information in the revised slide deck is rejected by the Common Factoring algorithms and only the changed information is backed up.
We alter the deck again. Same result. As long as that original deck exists and gets amended over time it is never fully backed up again - ever. Each and every Avamar backup after the first one is an incremental backup; only the delta, not the duplicated data.
How is this done?
This is where it gets a bit mysterious. I can refer you to a technical paper written by two Avamar guys: Design and Implementation of a Storage Repository Using Commonality Factoring.
It makes several excellent points: "In achieving benefit from data normalization, many issues must be addressed. Some sort of indexing system or pointer scheme is required. The indexing system itself is subject to concerns regarding scalability, performance, availability, and fault tolerance. The algorithms for identifying common data, factoring commonality, and re- integration of data must exhibit acceptable performance and reliability. With the elimination of redundancy, fault tolerance for the normalized data representation becomes particularly important. If the application involves storage over time, it may be necessary to provide some form of deletion and storage reclamation. Since data elements are shared by potentially unrelated users or applications, reliable and correct deletion becomes a significant design consideration. "
Well, yes, indeed.
Here is what I think is a key pair of extracts from the document:-
1. "We define the term “Commonality Factoring System” (CFS) to mean a system that defines and computes atomic units of data, providing a mechanism for data normalization. The atomic units of data themselves will be simply termed atomics."
2. "Avamar has designed and implemented a CFS supporting both fixed and variable sized atomics. Typically, fixed size atomics are used for applications such as databases where the application is implemented around a concept of fixed blocksize."
"For general filesystems or data sets, a variable atomic size algorithm is employed. To partition a data stream into variable sized atomics, we have designed and implemented an algorithm that, for a given input stream, consistently factors that input stream into the same sequence of atomics. When the algorithm is presented with a slightly different input stream (as when a file is modified), it will identify the same atomics up to a point of difference and then resynchronize very quickly following the difference. This client-side algorithm is central to the overall implementation of our CFS."
What it means is that quite astounding levels of backup file size reduction can be achieved; well over 30:1 in many cases. The backup file is a file system and can be searched, by Google say, and users can restore files from it.
Agent software in computers with data to be backed up do the data de-duplication and store the result on a local Axion or send it across LAN or WAN to a central one. It means remote offices taking hours to send backup sets to tape across a WAN can do in 15 minutes what used to take 16 hours. (It's an ESG figure so quite believable.)
Avamar's Axion product is currently at v3.5. The company has opened up a London office with Chris Sweetapple as general manager, EMEA. There is a fair amount of service involved in implementing an Axion system. It's for medium to large enterprises. It looks - well, check out Avamar's ESG white paper. I almost feel embarrassed talking about it.
Adaptec is a customer so there's a good place to go to check out its usefulness.
The term 'killer app' is often over-used. Just maybe Avamar's Axion is a killer app for nearline storage.