Top data visualisation and management tools
Your best resources for plowing through mountains of data
By Sharon Machlis | Computerworld US | Published: 12:00, 25 April 2011
You may not think you've got much in common with an investigative journalist or an academic medical researcher. But if you're trying to extract useful information from an eveR increasing inflow of data, you'll likely find visualisation useful, whether it's to show patterns or trends with graphics instead of mountains of text or to try to explain complex issues to a non-technical audience.
There are many tools around to help turn data into graphics, but they can carry hefty price tags. The cost can make sense for professionals whose primary job is to find meaning in mountains of information, but you might not be able to justify such an expence if you or your users only need a graphics application from time to time, or if your budget for new tools is somewhat limited. If one of the higher priced options is out of your reach, there are a surprising number of highly robust tools for data visualisation and analysis that are available at no charge.
Here's a rundown of some of the better known options, many of which were demonstrated at the Computer-Assisted Reporting (CAR) conference last month. Others are not as well known but show great promise. They range from easy enough for a beginner (ie, anyone who can do rudimentary spreadsheet data entry) to expert (requiring hands-on coding). But they all share one important characteristic: They're free. Your only investment: time.
Related Articles on Techworld
Before you can analyse and visualise data, it often needs to be "cleaned." What does that mean? Perhaps some entries list "New York City" while others say "New York, NY" and you need to standardise them before you can see patterns. There might be some records with misspellings or numerical data entry errors. The following two tools are designed to help get your data in tip-top shape to be analysed.
What it does: This web-based service from Stanford University's Visualisation Group is designed for cleaning and rearranging data so it's in a form that other tools such as a spreadsheet app can use.
Click on a row or column, and DataWrangler will suggest changes. For example, if you click on a blank row, several suggestions pop up such as "delete row" or "delete empty rows."
There's also a history list that allows for easy undo.
Drawbacks: I found that unexpected changes occurred as I attempted to explore DataWrangler's options. I constantly had to click "clear" to reset. And not all suggestions are useful ("promote row to header" seemed an odd suggestion when the row was blank) or easy to understand ("fold split 1 using 2 as key").
And while the fact that DataWrangler is a web-based service makes it convenient to use, don't forget that it sends your data off to an external site, which means it isn't an option for sensitive internal information. However, there are plans for a future release of a standalone desktop version. Another important thing to keep in mind is that DataWrangler is currently alpha code, and its creators say it's "still a work in progress."
Skill level: Advanced beginner.
Runs on: Any web browser.
What it does: Google Refine can be described as a spreadsheet on steroids for taking a first look at both text and numerical data. Like Excel, it can import and export data in a number of formats including tab- and comma-separate text files and Excel, XML and JSON files.
Refine features several built-in algorithms that find text items that are spelled differently but actually should be grouped together. After importing your data, you simply select edit cells -> cluster and edit and select which algorithm you want to use.
After Refine runs, you decide whether to accept or reject each suggestion. For example, you could say yes to combining Microsoft and Microsoft Inc, but no to combining Coach Inc with CQG Inc. If it's offering too few or too many suggestions, you can change the strength of the suggestion function.
There are also numerical options that offer quick and easy overviews of data distributions. This functionality can reveal anomalies that might be the result of data input errors, such as $800,000 instead of $80,000 for a salary entry, or it could expose inconsistencies such as differences in the way compensation data is reported from entry to entry, with some showing hourly wages and others showing weekly pay or yearly salaries.
Beyond data housekeeping, Google Refine offers some useful analysis tools, such as sorting and filtering.
What's cool: Once you get used to which commands do what, this is a powerful tool for data manipulation and analysis that strikes a good balance between functionality and ease of use. The undo/redo list of every action you've taken lets you roll back when needed. And text functions handle Java syntax regular expressions, allowing you to look for patterns (such as, say, three numbers followed by two digits) as well as specific text strings and numbers.
Finally, while this is a browser-based application, it works with files on your desktop, so your data remains local.
Drawbacks: Although Google Refine looks like a spreadsheet, you can't do typical spreadsheet calculations with it. For that, you must export to a conventional spreadsheet application. If you've got a large data set, carve out some time in your day to go through all of Refine's suggested changes, since it can take a while. And, depending on the data set, be prepared when looking for text items to merge: You're likely to get either a lot of false positives or missed problems or both.
Skill level: Advanced beginner. Knowledge of data analysis concepts is more important than technical prowess. Power Excel users who understand data cleaning needs should be comfortable with this.
Runs on: Windows, Mac OS X (if it appears to do nothing after loading on a Mac, point a browser manually to http://127.0.0.1:3333/ ), Linux.
Sometimes you need to combine graphical representation of your data with heftier numerical analysis.
What it does: R is a general statistical analysis platform (the authors call it an "environment") that runs on the command line. Need to find means, medians, standard deviations or correlations? R can handle that and much more, including "linear and generalised linear models, nonlinear regression models, time series analysis, classical parametric and nonparametric tests, clustering and smoothing".
R also graphs, charts and plots results. There are numerous add-ons to this open source project that significantly extend functionality. For users who prefer a GUI, Peter Aldhous, San Francisco bureau chief for New Scientist magazine, suggests RExcel, which offers access to the R engine through Excel.
What's cool: There is a great deal of functionality in R, including quite a number of visualisation options as well as numerical and spatial analysis.
Drawbacks: The fact that R runs on the command line means that users will have to take the time to learn which commands do what, and not all users will be comfortable with a text-only interface. In addition, Aldhous says those dealing with large data sets may hit a memory barrier (if so, there's a commercial option from Revolution Analytics).
Skill level: Intermediate to advanced. Comfort with command line prompts and a knowledge of statistics are a musts for the core application.
Runs on: Linux, Mac OS X, Unix, Windows XP or later.
Learn more: Try R for Statistics: First Steps (PDF) by Peter Aldhous, Hands-on R, a step-by-step tutorial (PDF) by Jacob Fenton, and the project's own An Introduction to R. The R Statistics blog has a number of visualisation samples.
Visualization applications and services
These tools offer a number of different visualisation options. While some stick to conventional charts and graphs, many offer a range of other choices such as treemaps and word clouds. A few offer geographical mapping as well, although if you're interested in maps, our sections on GIS/mapping focus specifically on that.
What it does: This is one of the simplest ways I've seen to turn data into a chart or map. You can upload a file in several different formats and then choose how to display it: table, map, heatmap, line chart, bar graph, pie chart, scatter plot, timeline, storyline or motion (animation over time). It's somewhat customisable, allowing you to change map icons and style info windows.
There are some data editing functions within Fusion Tables, although changing more than a few individual cell entries can quickly become tedious. You can also join tables (which is important when the data you want to map is in multiple tables), and filter, sort and add columns and so on. There are also options to allow others to make comments on the data itself.
Mapping goes beyond just placing points, as many of us are accustomed to with Google Maps. Fusion tables can also map multiple polygons with variations in colour based on underlying data, such as this intensity map showing the percentage of households with Internet access by state from 2007 US Census bureau data.
Unlike IBM's Many Eyes, Google lets you designate your data as private or unlisted as well as public, although your data still resides on Google's servers. This could be either a benefit or drawback, depending on whether server bandwidth costs or data privacy is more important to you.
What's cool: Fusion Tables offers relatively quick charting and mapping, including geographic information system (GIS) functions to analyse data by geography. The service also automatically geocodes addresses, which is useful when trying to place numerous points on a map. This is an excellent tool for beginners and advanced beginners to use to get comfortable with analysing data. It's also a good fit for people who don't program. For more advanced users, there's an API.
Drawbacks: Functionality, customisation and data capacity are all limited compared with desktop applications or custom code, and interacting with large data sets on the site can be sluggish. And it has its limitations. The site choked on March 11, the day of the devastating earthquake and tsunami in Japan. It is still a Google Labs beta project.
Skill level: Beginner.
Runs on: Any web browser.
What it does: Impure is sort of a Yahoo Pipes for data visualisation, designed for creating numerous types of highly polished graphical representations of data using a drag-and-drop workspace. The service includes a library of objects and various methods, and as with Yahoo Pipes, it allows you to click and drag to connect modules so that the output of one becomes the input of another. It was developed by Spanish analytics firm Bestiario.
What's cool: Impure offers a highly visual interface for the task of creating visualisations, which is not as common as you might expect. It has a sleek user interface and numerous modules, including quite a few APIs that are designed to pull data from the web.
It features numerous visualisation types that are searchable by keywords like numeric, tables, nodes, geometry and map. And although it saves your workspaces to the web, you can copy and save the code behind your workspaces locally, so you can back up your work or maintain your own libraries of code snippets.
Drawbacks: Users of Impure face a surprisingly steep learning curve despite its drag-and-drop functionality. The documentation is detailed in some areas, but lacking in others. For instance, while it was easy to find a list of APIs, it was more difficult to find basic instructions on how to use the workspace or even figure out that there was a workspace, let alone how to use the various objects and methods.
Once you save your workspace, it's on the public web, although it's unlikely that anyone else will be able to find it unless you share the URL. And I found some of the samples not all that helpful in understanding the underlying data, even if they were visually striking.
Skill level: Intermediate.
Runs on: Any web browser.
Learn more: To get started, I'd suggest the videos "Interface Basics" (7 minutes) and "Workspaces and Code." You can find a sample called The Pay Gap Between Men and Women Mapped at the website of The Guardian.
What it does: This tool can turn data into any number of visualisations, from simple to complex. You can drag and drop fields onto the work area and ask the software to suggest a visualisation type, then customise everything from labels and tool tips to size, interactive filters and legend display.
What's cool: Tableau Public offers a variety of ways to display interactive data. You can combine multiple connected visualisations onto a single dashboard, where one search filter can act on numerous charts, graphs and maps, underlying data tables can also be joined.
Drawbacks: In the free version of Tableau's business intelligence software, your visualisation and data must reside on Tableau's site. Whenever you save your work, it gets sent up to the public website, which means you can't save work in progress without running the risk that it will be seen before it's ready (while Tableau's site won't deliberately expose your work, it relies on security by obscurity, so someone could see your work if they guess your URL). And once it's saved, viewers are invited to download your entire workbook with data. Upgrading to a single user desktop edition costs $999.
Not surprisingly, all that functionality comes at a cost: Tableau's learning curve is fairly steep compared to that of, say, Fusion Tables. Even with the drag-and-drop interface, it'll take more than an hour or two to learn how to use the software's true capabilities, although you can get up and running doing simple charts and maps before too long.
Skill level: Advanced beginner to intermediate.
Runs on: Windows 7, Vista, XP, 2003, Server 2008, 2003.
Learn more: There are seven short training videos on the Tableau site, where you can also find downloadable data files that you can use to follow along. You can see a sample in our article "Tech Unemployment Climbs; Self-employment Steady."