IT Jobs

Did you know? Techworld now offers an IT Jobs section with hundreds of jobs! Current job listings are now available for Software Developers, Web Developers, Application Engineers, Project Managers, Graduate opportunities and more. Apply for your new IT job today!

EMC takes exception to Sepaton's views

EMC says Sepaton's products are not better than products from Avamar (EMC)

In a recent article Sepaton's chief technology officer, Miki Sandorfi, described how he say Sepaton's de-duplication advantages against Avamer (owned by EMC) and FalconStor. Both of of these companies vigorously dispute many of Sanforfi's assertions. (Falconstor's comments can be seen here.)

EMC has sent me its comments as inserts placed in its copy of the original article text. I'm reproducing below exactly what it sent me.

- - - - - - - - - -

Techworld had the opportunity to discuss de-duplication technology and some other issues with Miki Sandorfi, SEPATON's chief technology officer. He explained how Sepaton's content-aware approach differed from hash-based approaches used by Avamar and FalconStor.

TW: Could you compare and contrast the sub-file de-duplication technologies of Avamar, FalconStor and Sepaton please?
MIKI SANDORFI: The hash-based data de-duplication approach (used by Avamar and FalconStor) is typically used with in-band de-duplication solutions. This model runs incoming data through a hashing algorithm (typically MD5), which results in an identifier that is assumed to be unique to that piece of data. It then compares that hash to previous hashes stored in a lookup table. If a match is found,then the data is discarded and a pointer to the existing hash is added.If it is not found, then the data is added to the lookup table.

The idea is that the lookup table will be populated with many hashes from the backup data, making it more efficient over time. Therefore, the best de-duplication ratios will not be achieved until the hash table is populated. There are other challenges with this approach that are summarized below.

- - - - - - - - - -

EMC RESPONSE - Any de-dupe approach relies upon pre-existing data. In fact, we can de-dupe within the first job whereas they must have a second job to perform de-duplication. Also, the EMC Avamar software uses SHA-1. Avamar operates globally and there is no single lookup table. We can de-duplicate at both source and target, looking at local caches or across Avamar storage nodes that comprise an Avamar grid.

- - - - - - - - - -

Hash-based data de-duplication requires substantial CPU performance because all of the hashes are generated by the CPU in real time. The more granular (e.g. the smaller the size of each piece of data being hashed) the hash, the more CPU-intensive and slow the process becomes.

- - - - - - - - - -

EMC RESPONSE -This is an incorrect statement. The CPU usage of the hashing function is independent of object size; it is a function of the total data set size. There is more lookup with smaller objects, but this is not CPU intensive. In fact, our software reduces weekly CPU load by dramatically reducing the amount of work required on client systems for backup and recovery (by up to 20x). By performing de-duplication at the client, we eliminate the "bottle-necking" of inline approaches.

- - - - - - - - - -

Another drawback relates to the size of the hash table and where it resides. Storing the hash table on disk further degrades performance. While storing it in memory increases performance, it requires that the table size (and thus amount of data protected) is constrained by the amount of memory in the system. Hash collisions and subsequent data integrity issues are possible. Maximum backup set size per appliance is also limited. Technologies using this method cannot de-duplicate data across appliances.

- - - - - - - - - -

EMC RESPONSE - The constraint "cannot de-duplicate data across appliances" does not apply to us. Same for "Maximum backup set size." We use a very small memory footprint across all the clients - only during backup operations, which are fast - in order to de-duplicate quickly. Memory available in the Avamar server scales as nodes are added to the Avamar storage grid. We eliminate the avalanche of data at the top of the mountain, rather than de-duplicating inline at the target or the base of the mountain, after the avalanche has formed.

- - - - - - - - - -

FalconStor has taken a mixed approach with its SIR Technology. It uses a ContentAware type approach to identify common objects and then uses hashing to find the redundancies. Because hashing is fundamentally part of finding redundancies, the algorithm is classified as hash-based.

The content-aware aware approach (used by SEPATON) is entirely different. This approach focuses on out-of-band data de-duplication. Data is backed up to the VTL first. When a backup set has completed, the data de-duplication process begins. This approach allows for unimpeded backup performance since de-duplication is not being performed on incoming data.

- - - - - - - - - -

EMC RESPONSE - This totally ignores LAN/WAN based efficiencies of client side de-duplication. While it allows for unimpeded backup performance, it does not reduce backup times like EMC's approach, which dramatically reduces the amount of work required for backup. They are still forced to move full backups on a recurring basis, so - to reiterate our previous comment - they are de-duplicating at the base of the mountain, after the avalanche has formed.

- - - - - - - - - -

The other key element of the content-aware approach is that it uses a higher level of abstraction when analyzing backup data. Unlike the previous two approaches, content-aware de-duplication looks at data as objects. Unlike hashing or byte-level comparisons, which try to find redundancies in byte streams, content-aware looks at objects, comparing them to other objects. (e.g., Word document to Word document or Oracle database to Oracle database.)

- - - - - - - - - -

EMC RESPONSE - A lower level of abstraction allows our software to look across all files and systems for any duplicate data that can be eliminated. We are also content aware, looking at the bytes that make up a file to determine optimal segment boundaries, which maximizes the likelihood of finding and eliminating duplicates.

- - - - - - - - - -

This approach results in an increase in disk space requirements, but provides better performance and de-duplication ratios than the other solutions. It also requires only minimal incremental disk, which is negligible in the overall solution cost.

- - - - - - - - - -

EMC RESPONSE - They are performing byte level comparisons, which can often be more CPU intensive than hashing when accounting for insertions or deletions. In addition, they require all the data to be moved to perform de-duplication which results in far fewer benefits than EMC Avamar customers realize due to our de-duplicating at the source.

- - - - - - - - - -

TW: You've mentioned several differentiating topics. Could you discuss data integrity please in more detail?
MIKI SANDORFI: Since data de-duplication is modifying data stored on the backup system, it is vital that data integrity be guaranteed at all times. Given the many pointers involved, a data integrity issue can potentially have a cascading, negative impact on many backups.

The problem with hash-based algorithms is that they require that a unique hash be generated for each piece of data. The hash must provide a unique identifier for each given chunk of data. If this is not true then the system will silently corrupt data. This corruption occurs when the de-duplication algorithm mistakenly discards non-redundant data. This error will not be found during the backup. It is only apparent if a restore is attempted on data that includes or has pointers to the incorrectly discarded data. All modern hashing algorithms are susceptible to collisions and consequently, any hash-based data de-duplication approach is susceptible to this problem.

- - - - - - - - - -

EMC RESPONSE - All storage systems and file systems are susceptible to corruption and errors on reads and writes. The likelihood of hash collision for SHA-1 is very limited and unlikely. In the extremely remote event that it does occur, it does not have a cascading effect. It will only affect files that share a specific segment. The chances of file corruption from primary file systems during tape backups is orders of magnitude more likely to take place.

- - - - - - - - - -

Our content-aware algorithm is not susceptible to hash-based data integrity concerns since byte-level comparisons are performed.

TW: Could you discuss the scalability issue a little more?
MIKI SANDORFI: Data de-duplication allows customers to store dramatically more data online. As customers need to store more data and the system grows, capacity scalability becomes a substantial challenge with other solutions. However, the content-aware method enables them to minimize footprint and management overhead by reducing the number of systems that need to be managed.

As mentioned previously, hash-based algorithms rely on a lookup table that contains all previously seen unique hashes. In most implementations, this hash table is stored in memory to improve performance. As a result, the scalability of many of these systems is limited by the amount of memory and the size limitations of the supported hash table. Although vendors promote higher scalability numbers they typically requires multiple separate units to achieve them due to the limitations of the hash lookup table. Multiple separate units are inherently less efficient because each unit is a separate data-duplication space with its own lookup table. As a result, you gain no efficiencies from shared data de-duplication between systems.

- - - - - - - - - -

EMC RESPONSE - To be clear, EMC Avamar does not have this issue. Our system is a distributed grid that scales with additional nodes and has a single de-duplication domain.

- - - - - - - - - -

FalconStor indicates that it supports clustering. It is not clear what size data set can be supported with a dual node cluster. Their technology is designed as an add-on technology requiring entirely separate hardware and storage and is not an integrated part of their VTL solution.

SEPATON's content-aware technology creates a content-aware database that incorporates the metadata associated with backups. The database is dynamically scalable and can support 50 PB plus of corporate data and backups of any size.

(Part 2 continued here.)


What are your views on this subject? Use the form below to post a comment on this article up to 500 characters.


Characters remaining: 500

Related Storage news

HP tool offers continous laptop backup

Set it and forget.

Intel fixes drive bricking firmware update for flash drives

Company to re-release SSD software

IBM offers Lotus Symphony on Keepod USB devices

Thin USB device uses VMware to provide secure access to the Lotus suite

Sun claims record-breaking storage array

Says Storage 7000 is fastest on the planet

Related Storage reviews



Email this article to a friend or colleague:


PLEASE NOTE: Your name is used only to let the recipient know who sent the story, and in case of transmission error. Both your name and the recipient's name and address will not be used for any other purpose.

Techworld White Papers

Database security: Preventing enterprise data leaks at the source

IDC discusses the growing internal threats to business information, the impact of government regulations on the protection of data, and how enterprises must adopt database security best practices...

Download Whitepaper

Service-oriented security

SOA has become an integral part of enterprise software by providing a framework to efficiently develop software as services that is easily sharable, reusable, and integrated. No where is the need more apparent than in the Identity Management space. Welcome to the age of Service-Oriented Security (SOS).

Download Whitepaper

Data protection prospective vendor checklist

Organisations need a way to map business needs against all these challenges in procuring a technical solution. To help, SANS has developed the following Prospective Vendor Checklist.

Download Whitepaper

Unlock the power of the mainframe

This whitepaper presents the notion of CICS as an integration hub based on a component-based, service-oriented architecture supporting Web services. Highlights will review the challenges and contrasted support for Web services natively in CICS.

Download Whitepaper

Techworld UK - Technology - Business

COLT White Paper

Are all VoIP services the same?

Questions to ask your service provider to ensure you get the VoIP service you need
With careful choice of partner, your business can have all the advantages of VoIP access - reduced costs, flexibility and simplicity - without the drawbacks.
This white paper is your guide to ensure you get right the VoIP service and details the pitfalls which businesses would do well to avoid.

Download white paper
BMC

Ride the express lane in the journey to speed ITIL adoption

Explore the challenges in making the journey to ITIL and the criteria for selecting consulting services
By following ITIL practices, your IT organisation will become more closely integrated with the business. We recommend making the journey to ITIL in a sequence of six incremental steps, the phases of which are driven through execution of a strategic transformational roadmap.

Download white paper

Webcast: IT Financial Management: Cost Optimisation for Efficiency and Agility.
On Demand Webcast
Join this webcast to learn about the techniques and technologies that can help you prove the value of IT to the business by understanding the true cost of today's IT services and those that will be necessary to deliver future success.

Register Today

Site Map

IDG Network

* *