My favorite definition of ‘Dark Data’ comes from Matt Aslett of 451 Research:
“Data that was previously ignored because of technology limitations”
I’ve recently talked to several big, traditional organizations that are finding lots of dark data down in the cellar.
There’s obviously the unstructured* data that companies have gathered but struggled to analyze in the past. Technologies like Hadoop make it much easier to get useful data from sources such as legal documents, the comments sections of customer surveys, medical research studies, social data, web logs, and many others.
But there’s also a lot of ‘dark’ structured data. It turns out that the number one use of Hadoop in most organizations today is analyzing old data that was previously too costly to process. In addition to providing known value, this data provides an easy, low-risk pilot project for getting to know Hadoop. The structures are well-known, so getting the data out in a useable format doesn’t require complex data exploration or new scripting skills.
Sometimes you may not realize that you have useful data. My favorite example of this is a project at Copenhagen Airport, where a team derived an amazing amount of useful information by crunching the data in the log files of the wifi routers scattered around the airport.
Passengers smart phones “ping” the different routers as they walk through the terminals, even if they don’t actually connect to the network, and the team found they could track passenger movements and behavior to a reasonable level of precision. The data was used to determine facilities questions around typical passenger flows and choke points, but also to help answer more commercial questions such as “which is the most visited area of duty free?”
The technology barriers have tumbled down, and it’s a good time to get out your bucket and flashlight and go take a look in your corporate cellar!
* Hate that label? Get over it– you and everybody else know what we’re talking about.