IBM wrestles with world's largest storage system
Storage Tank is a grid-based system that could change the future of storage. We look at how the project has been progressing.
By Kieren McCarthy and Chris Mellor, Techworld | Techworld | Published: 11:00, 06 August 2004
Until last year, storage was arguably the least sexy market in IT. But with the Enron and and new laws on data retention on both sides of the Atlantic, it has suddenly jumped to the forefront of everyone's minds.
Storing the ever-increasing amount of data produced every day brings with it three problems: one, how to do it; two, what to store it all on; and three, how on earth to find it later on.
Just one of the companies sniffing a hugely expanding market was IBM. And soon after it started work developing what it hopes will become the de facto technology for the future, named Storage Tank.
The idea straight off the bat was to create a technology that would form a vast Storage Attached Network (SAN), working seamlessly with different companies' storage devices, and able to expand simply and easily. It was a tall order and as well as running tests in its own labs, IBM publicly announced its plan to run an R&D project at Europe's nuclear research organisation CERN.
Its Storage Tank work fits into an industry-sponsored R&D arm of CERN's IT department, called Openlab, which CERN hoped would produce some innovative technology that might make it into its final system.
It's been designed to support the Large Hadron Collider project (LHC), a next-generation particle accelerator and the biggest scientific instrument on the planet. It is currently being built in Geneva and is due to go live sometime in 2007. By analysing the 40 million particle collisions per second the machine will produce, scientists hope to find clues to the origins of the universe. However, that process also produces massive amounts of constant data.
It is calculated to pump out 15 Petabytes (PB) of data a year (or 34 Terabytes a day) continuously. The problem will become bigger over time as well, with CERN's head of IT, Wolfgang von Rüden, estimating that by 2010, 100PB a year of data will be produced.
IBM's plan was to build Storage Tank from the ground up and within two years challenge the huge storage system that CERN was developing with various public-funded research organisations. It didn't get off to a good start however. Having started work on the storage system in April 2003, nearly a year later, in January 2004, CERN reported that StorageTank had "never completed a successful test" and it "hangs and crashes".
IBM continued to work at the complex technology however, and last month proudly announced Storage Tank had ran more than 100 simultaneous SAN File System clients with its 28TB of storage distributed among 10 storage servers. It was a step in the right direction but will most likely prove too late in the day to be included in the CERN project.
The 28Tb system tested by IBM would be able to store just 20 hours of data expected to be produced by the LHC in its first year. The test configuration would also have to be scaled up 100 times before CERN could risk using it on the LHC, although IBM claims that the product could handle a fraction of all LCG needs and still be accepted as part of the overall solution. With von Rü:den, explaining [ppt - 10MB] in April that the Storage Tank system was hoping to expand to 100TB by the end of this year and hit a goal of 1,000TB by 2005, it looks unlikely that the technology will be selected when decisions on the final system build are made in March 2005.
IBM's claim that it will "extend Storage Tank's capabilities so it can manage and provide access from any location worldwide to the unprecedented torrent of data... when it goes online in 2007", may be no more than wishful thinking.
This does not mean the technology has failed however, maintains general manager of IBM's storage software division, Brian Truskowski. "We have always stated that this is a research project," he said, adding that the work done in the past year had already produced successful commercial results. Truskowski is confident the technology will be able to scale up, saying that from the very beginnings of the project, his team decided to make expansion a main consideration.
The reason the CERN project has only dealt with 28TB of storage so far, he told us, is more a case of CERN not wishing to spend a fortune on storage materials that will be out-of-date by the time the project goes fully live in 2007, than in Storage Tank not being up to the job.
As a storage network becomes bigger, it gets harder to keep the filing system out of the way - with more and more pieces added, more information is needed to record where a particular piece of data has gone. However, IBM is continuing to develop its caching technology, so the amount of questioning a system has to do to find the right material is get as small as possible. In this sense, it is similar to the DNS system that the Internet works on - something that you can argue has proven its worth.
Truskowski also explains that the system performs better with large files than large numbers of small files. "With lots of small files, you have to be more efficient," he said. In an IBM lab, Storage Tank has already scaled up to a quarter-of-a-billion files. As it begins testing with bigger and bigger storage networks, Truskowski said he expects there will be some bumps, but says none of the scientists on the team are currently stratching their heads.
Nevetheless, while Big Blue has argued StorageTank "will play a pivotal role" and CERN has said it will be "providing key storage technology" in the grid techology at the research centre, it is not yet be able to deal with the LHC's vast storage demands.
CERN spokesman Francois Grey admitted to Techworld: "Yes, at the moment [Storage Tank] is not able to meet the requirements [of the LCG]." But pointed out that currently "no system is able to".
Not that it will impact the Grid project itself. "The Storage Tank results do not reflect directly on our ability to cope with the LHC data in 2007," he said. "The CERN Openlab partnership is about testing and validation of future solutions for Grid technology. In parallel we have a Grid deployment effort for 2007 which addresses the pressing here-and-now needs for 2007."
It may have got off to a slow start but Storage Tank's clever way of tagging material - which can also allow important information to be kept close to hand, while less important information is stored on cheaper storage devices - is proving increasingly useful, and commercially successful.
And as for the LCG, CERN refused to rule it out altogether, with Grey remarking that: "Storage Tank may be part of the picture two or three years down the line after 2007." It better hurry however, as the LHC itself will shut down in 2020.