Many thanks to William Gardella and others for the content below:
Traditional enterprise data warehousing and Hadoop/Big Data are like apples and oranges – the well-known and trusted approach being challenged by a zesty newcomer (sweet oranges were introduced to Europe sometime in the 16th century). Is there room for both? How will these two very different approaches co-exist?
This post is an attempt to summarize the current state of play with Hadoop, “Big Data” and Enterprise BI, and what it means to existing users of enterprise business intelligence. See the list of articles at the end of the post for more detailed materials.
What is Hadoop?
Hadoop is open-source software that enables reliable, scalable, distributed computing on clusters of inexpensive servers. It is:
- Reliable: The software is fault tolerant, it expects and handles hardware and software failures
- Scalable: Designed for massive scale of processors, memory, and local attached storage
- Distributed: Handles replication. Offers massively parallel programming model, MapReduce
Hadoop is designed to process terabytes and even petabytes of unstructured and structured data. It breaks large workloads into smaller data blocks that are distributed across a cluster of commodity hardware for faster processing. And it’s part of a larger framework of related technologies:
- HDFS: Hadoop Distributed File System
- HBase: Column oriented, non-relational, schema-less, distributed database modeled after Google’s BigTable. Promises “Random, real-time read/write access to Big Data”
- Hive: Data warehouse system that provides SQL interface. Data structure can be projected ad hoc onto unstructured underlying data
- Pig: A platform for manipulating and analyzing large data sets. High level language for analysts
- ZooKeeper: a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services
Image: William Gardella
Are Companies Adopting Hadoop?
Yes. According to a recent Ventana survey:
- More than one-half (54%) of organizations surveyed are using or considering Hadoop for large-scale data processing needs
- More than twice as many Hadoop users report being able to create new products and services and enjoy costs savings beyond those using other platforms; over 82% benefit from faster analyses and better utilization of computing resources
- 87% of Hadoop users are performing or planning new types of analyses with large scale data
- 94% of Hadoop users perform analytics on large volumes of data not possible before; 88% analyze data in greater detail; while 82% can now retain more of their data
- Organizations use Hadoop in particular to work with unstructured data such as logs and event data (63%)
- More than two-thirds of Hadoop users perform advanced analysis — data mining or algorithm development and testing
How is it Being Use in Relation to Traditional BI and EDW?
Currently, Hadoop has carved out a clear niche next to conventional systems. Hadoop is good at handling batch processing of large sets of unstructured data, reliably, and at low cost. It does, however, require scarce engineering expertise, real-time analysis is challenging, and it much less mature than traditional approaches. As a result, Hadoop is not typically being used for analyzing conventional structured data such as transaction data, customer information and call records, where traditional RDBMS tools are still better adapted:
“Hadoop is real, but it’s still quite immature. On the “real” side, Hadoop has already been adopted by many companies for extremely scalable analytics in the cloud. On the “immature” side, Hadoop is not ready for broader deployment in enterprise data analytics environments…” James Kobelius, Forrester Research.
To considerably over-simplify: if we consider what’s called the 3 ‘V’s of the data challenge: “Volume, Velocity, and Variety” (and there’s a fourth, Validity), then traditional data warehousing is great at Volume and Velocity (especially with the new analytic architectures), while Hadoop is good at Volume and Variety.
Today, Hadoop is being used as a:
- Staging layer: The most common use of Hadoop in enterprise environments is as “Hadoop ETL” — preprocessing, filtering, and transforming vast quantities of semi-structured and unstructured data for loading into a data warehouse.
- Event analytics layer: large-scale log processing of event data: call records, behavioral analysis, social network analysis, clickstream data, etc.
- Content analytics layer: next-best action, customer experience optimization, social media analytics. MapReduce provides the abstraction layer for integrating content analytics with more traditional forms of advanced analysis.
Most existing vendors in the data warehousing space have announced integrations between their products and Hadoop/MapReduce, or their intention to provide them – for example, SAP has announced that they intend to implement MapReduce in the next version of Sybase IQ.
What Does The Future Look Like?
It’s clear that Hadoop will become a key part of future enterprise data warehouse architectures:
“The bottom line is that Hadoop is the future of the cloud EDW, and its footprint in companies’ core EDW architectures is likely to keep growing throughout this decade. “ James Kobelius, Forrester Research
But (despite some of the almost religious fervor of its backers) Hadoop is unlikely to supplant the role of traditional data warehouse and business intelligence:
“There are places for the traditional things associated with high-quality, high-reliability data in data warehouses, and then there’s the other thing that gets us to the extreme edge when we want to look at data in the raw form” Yvonne Genovese, Gartner Inc.
Companies will continue to use conventional BI for mainstream business users to do ad hoc queries and reports, but they will supplement that effort with a big-data analytics environment optimized to handle a torrent of unstructured data – which, of course, has been part of the goal of enterprise data warehousing for a long time.
Hadoop is particularly useful when:
- Complex information processing is needed
- Unstructured data needs to be turned into structured data
- Queries can’t be reasonably expressed using SQL
- Heavily recursive algorithms
- Complex but parallelizable algorithms needed, such as geo-spatial analysis or genome sequencing
- Machine learning
- Data sets are too large to fit into database RAM, discs, or require too many cores (10’s of TB up to PB)
- Data value does not justify expense of constant real-time availability, such as archives or special interest info, which can be moved to Hadoop and remain available at lower cost
- Results are not needed in real time
- Fault tolerance is critical
- Significant custom coding would be required to handle job scheduling
Does Hadoop and Big Data Solve All Our Data Problems?
Hadoop provides a new, complementary approach to traditional data warehousing that helps deliver on some of the most difficult challenges of enterprise data warehouses. Of course, it’s not a panacea, but by making it easier to gather and analyze data, it may help move the spotlight away from the technology towards the more important limitations on today’s business intelligence efforts: information culture and the limited ability of many people to actually use information to make the right decisions.
References / Suggested Reading
- Hadoop Goes Mainstream for Big BI Tasks
- Hadoop: Is It Soup Yet?
- Hadoop: What Is It Good For? Absolutely . . . Something
- Hadoop: What Are These Big Bad Insights That Need All This Nouveau Stuff?
- Hadoop: Future Of Enterprise Data Warehousing? Are You Kidding?
- Hadoop: When Will The Inevitable Backlash Begin?
- Hadoop finds niche alongside conventional database systems
- ‘Big data’ analytics fulfilling the promise of predictive BI
- Big data: The next frontier for innovation, competition, and productivity