Lack of data scientists is the new Von Neumann bottleneck
Strata Conference's Founding Chair, Edd Dumbill, talks about bridging the data and information gap
By Brian Proffitt | IT World | Published: 14:44, 31 January 2012
Data is a huge presence within much of business and technology, and the next installment of the O'Reilly Strata Conference will provide attendees a look into the revolutionary ways data is driving, well, everything.
The Winter 2012 edition of the O'Reilly Strata Conference will offer sessions for everyone to the businessperson trying to figure out just what this whole Big Data thing is all about, to the hard-core data scientist wonks who are bringing all this new technology to the fore.
Big Data has gotten a lot of attention in the past couple of years, as Hadoop, Cassandra, MapReduce, and other open source technologies have enabled businesses and governments to use data in ways unheard of when using relational database technology. The Strata Conference is the first and most prominent gathering for any party interested in learning about just what makes big data tick.
And that, according to Founding Chair Edd Dumbill, is part of the whole point of Strata: educating users and data scientists about the benefits and applications of Dig Data.
"There are three main themes examined at Strata," Dumbill said in a recent interview, "The increasing of data and the growth of ubiquitous computing are two, which form the start of an arc to the third aspect."
The arc, Dumbill continued, leads to a much higher level of interconnectivity, the so-called "Internet of Things," which describes the billions of objects tagged and otherwise connected to the internet, each providing massive amounts of data to be collected and processed.
But processed by whom? Stored how? And utilised in what manner? Those are the key questions that gatherings, like Strata, hope to address, particularly that last, third part of the arc: how data is used. This is what Dumbill euphemistically refers to as "data and the final mile."
The "final mile" is likely a familiar term to network engineers: it refers to the all-important connectivity between the end-user and the rest of the internet.
"So it is with data science and analytics within a business," Dumbill. For data, the "final mile" refers to the capability to properly process data and convey what's really important: information.
The bridge of turning data to information (which can then be used to acquire knowledge) is exactly where the data scientist lives, and it's a skill that is still lacking within this burgeoning field.
Data scientists are described by Strata organisers as being talented in engineering, data management, mathematics, and writing. "The art of storytelling and visualisation are also important," Dumbill explained.
I suggested to Dumbill that an example might be the work of Hans Rosling, who very effectively uses stunning graphics to convey a wealth of information. Dumbill agreed that this was pretty much the same sort of work, though Rosling was not working with truly massive data sets. Data scientists for big data will be able to create models beyond even the work of Rosling.
"The headline here is that there are still very few data scientists to go around," Dumbill said. "The lack of data scientists is our new Von Neumann bottleneck."
Dumbill was quick to emphasise that data scientists do not necessarily have to be all-in-one super geniuses that can do it all. Teams with members whose talents are complimentary toward data science are also very effective.
The first Strata conference of the year will be held in Santa Clara, California, from February 28 until March 1. A second conference will be held in New York later this year. The conference will feature a Jumpstart track that will be "the missing MBA of big data" for businesspeople, as well as a Deep Data track into which data scientists can really sink their collective teeth.
"Strata is the home for the data science community," Dumbill explained, "And we're happy to have an oasis of deep geekery as well."
Tracks on Hadoop, which is currently regarded as the Linux of the Big Data world, as well as a showcase for data startups will also be a part of the three-day conference.