Can de-dupe data be used for compliance storage?
What if document integrity has to be proved?
The position that Martin Baldock, electronic discovery firm Kroll Ontrack's operations manager, takes is relevant here.
He thinks that the actual format that electronic data is stored in is not the key thing. In effect all electronic representations reconstruct a file for viewing or printing. What matters is that the content is original, not that the representation is in WORM format.
He said: "We look at the hash value of the file's contents compared to what we know was the original value. We are told, for example, that file A on disk is the original file and we compute its hash value and compare it to other copies of the file to see if it has changed." He couldn't necessarily say what has changed, only that something has.
The hash value is the determinant and even so little a change as adding an extra space between words can alter it.
His concern with sub-file-level de-duplication is with the reconstruction of the file when it is needed. "If you are recomputing the file from the components how confident are you that a bit pattern is exactly the same and so will compute exactly the same hash value? It would be a huge burden of concern to me."
Nexsan's Gary Watson is also a strong proponent of hashing as well as other measures to ensure file integrity: "Assureon is highly obsessed with data integrity – files are serialised, stored at least twice on separate RAIDs, and possibly stored on two RAIDs at a DR site, and in all cases are protected with two different hash algorithms which are checked every time the file is touched (plus a dozen other integrity features I won’t bore you with here)."
Referring to de-dupe he said: "In contrast, a given sub-block (say, of zeros) might be referenced by a million files, and the corruption of this single sub-block could have wide-ranging impact though a wide swath of files. It’s like a failure 'amplifier'. I’m not saying this is an impossible challenge to overcome, but an enterprise-class solution to the problem is non-trivial."
The legal holy grail
This seems to be the key thing here. Whatever form the electronic document is stored in: RAIDed and striped; or de-duplicated, as long as it can be provably reconstructed in an unaltered form then it would/should/could be accepted in a court of law.
One way to do that is by computing the file's hash value before electronically altering its representation and then re-computing the hash value when the file is to be used for compliance or legal purposes.
If they are the same then the file is good. If they are not then it isn't.
Will they be the same after the file has gone through a de-duplication process?
No-one knows for sure and until it can be proved that they are the same, de-dupe doubters have a point.