Hadoop, Big Data, and Enterprise Business Intelligence

Differences

Many thanks to William Gardella and others for the content below:

Traditional enterprise data warehousing and Hadoop/Big Data are like apples and oranges – the well-known and trusted approach being challenged by a zesty newcomer (sweet oranges were introduced to Europe sometime in the 16th century). Is there room for both? How will these two very different approaches co-exist?

This post is an attempt to summarize the current state of play with Hadoop, “Big Data” and Enterprise BI, and what it means to existing users of enterprise business intelligence. See the list of articles at the end of the post for more detailed materials.

What is Hadoop?

Hadoop is open-source software that enables reliable, scalable, distributed computing on clusters of inexpensive servers. It is:

  • Reliable: The software is fault tolerant, it expects and handles hardware and software failures
  • Scalable: Designed for massive scale of processors, memory, and local attached storage
  • Distributed: Handles replication. Offers massively parallel programming model, MapReduce

Hadoop is designed to process terabytes and even petabytes of unstructured and structured data. It breaks large workloads into smaller data blocks that are distributed across a cluster of commodity hardware for faster processing. And it’s part of a larger framework of related technologies:

  • HDFS: Hadoop Distributed File System
  • HBase: Column oriented, non-relational, schema-less, distributed database modeled after Google’s BigTable. Promises “Random, real-time read/write access to Big Data”
  • Hive: Data warehouse system that provides SQL interface. Data structure can be projected ad hoc onto unstructured underlying data
  • Pig: A platform for manipulating and analyzing large data sets. High level language for analysts
  • ZooKeeper: a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services

hadoop-architecture

Image: William Gardella

Are Companies Adopting Hadoop?

Yes. According to a recent Ventana survey:

  • More than one-half (54%) of organizations surveyed are using or considering Hadoop for large-scale data processing needs
  • More than twice as many Hadoop users report being able to create new products and services and enjoy costs savings beyond those using other platforms; over 82% benefit from faster analyses and better utilization of computing resources
  • 87% of Hadoop users are performing or planning new types of analyses with large scale data
  • 94% of Hadoop users perform analytics on large volumes of data not possible before; 88% analyze data in greater detail; while 82% can now retain more of their data
  • Organizations use Hadoop in particular to work with unstructured data such as logs and event data (63%)
  • More than two-thirds of Hadoop users perform advanced analysis — data mining or algorithm development and testing

How is it Being Use in Relation to Traditional BI and EDW?

Currently, Hadoop has carved out a clear niche next to conventional systems. Hadoop is good at handling batch processing of large sets of unstructured data, reliably, and at low cost. It does, however, require scarce engineering expertise, real-time analysis is challenging, and it much less mature than traditional approaches. As a result, Hadoop is not typically being used for analyzing conventional structured data such as transaction data, customer information and call records, where traditional RDBMS tools are still better adapted:

“Hadoop is real, but it’s still quite immature. On the “real” side, Hadoop has already been adopted by many companies for extremely scalable analytics in the cloud. On the “immature” side, Hadoop is not ready for broader deployment in enterprise data analytics environments…” James Kobelius, Forrester Research.

hadoop-vs-traditionalTo considerably over-simplify: if we consider what’s called the 3 ‘V’s of the data challenge: “Volume, Velocity, and Variety” (and there’s a fourth, Validity), then traditional data warehousing is great at Volume and Velocity (especially with the new analytic architectures), while Hadoop is good at Volume and Variety.

Today, Hadoop is being used as a:

  • Staging layer: The most common use of Hadoop in enterprise environments is as “Hadoop ETL” — preprocessing, filtering, and transforming vast quantities of semi-structured and unstructured data for loading into a data warehouse.
  • Event analytics layer: large-scale log processing of event data: call records, behavioral analysis, social network analysis, clickstream data, etc.
  • Content analytics layer: next-best action, customer experience optimization, social media analytics. MapReduce provides the abstraction layer for integrating content analytics with more traditional forms of advanced analysis.

Most existing vendors in the data warehousing space have announced integrations between their products and Hadoop/MapReduce, or their intention to provide them – for example, SAP has announced that they intend to implement MapReduce in the next version of Sybase IQ.

Differences

What Does The Future Look Like?

It’s clear that Hadoop will become a key part of future enterprise data warehouse architectures:

“The bottom line is that Hadoop is the future of the cloud EDW, and its footprint in companies’ core EDW architectures is likely to keep growing throughout this decade. “ James Kobelius, Forrester Research

But (despite some of the almost religious fervor of its backers) Hadoop is unlikely to supplant the role of traditional data warehouse and business intelligence:

“There are places for the traditional things associated with high-quality, high-reliability data in data warehouses, and then there’s the other thing that gets us to the extreme edge when we want to look at data in the raw form”  Yvonne Genovese, Gartner Inc.

Companies will continue to use conventional BI for mainstream business users to do ad hoc queries and reports, but they will supplement that effort with a big-data analytics environment optimized to handle a torrent of unstructured data – which, of course, has been part of the goal of enterprise data warehousing for a long time.

Hadoop is particularly useful when:

  • Complex information processing is needed
  • Unstructured data needs to be turned into structured data
  • Queries can’t be reasonably expressed using SQL
  • Heavily recursive algorithms
  • Complex but parallelizable algorithms needed, such as geo-spatial analysis or genome sequencing
  • Machine learning
  • Data sets are too large to fit into database RAM, discs, or require too many cores (10’s of TB up to PB)
  • Data value does not justify expense of constant real-time availability, such as archives or special interest info, which can be moved to Hadoop and remain available at lower cost
  • Results are not needed in real time
  • Fault tolerance is critical
  • Significant custom coding would be required to handle job scheduling

Does Hadoop and Big Data Solve All Our Data Problems?

Hadoop provides a new, complementary approach to traditional data warehousing that helps deliver on some of the most difficult challenges of enterprise data warehouses. Of course, it’s not a panacea, but by making it easier to gather and analyze data, it may help move the spotlight away from the technology towards the more important limitations on today’s business intelligence efforts: information culture and the limited ability of many people to actually use information to make the right decisions.

References / Suggested Reading

Related posts:

23 Comments
  1. Thanks Timo, very interesting read! And lots of articles that I need to read to.

    Just FYI, your link to Computer World (niche next to conventional systems) was missing a ‘w’ in ‘www’..

    Cheers, Josh

  2. Thanks for the well structured and informative post…Hadoop along with R will be the next 2 big things

  3. Great article. It will be interesting to see how the commercial market integrates Hadoop and its family of products into the mainstream EDW and BI implementations already defined in corporations globally. A new generation of BI tools already denies the necessity of an EDW using in-memory caching and streaming instead. These data discovery tools are picking up steam in the enterprise but not as readily supported by IT. The dichotomy that will exist between the end user tools and the way in which IT delivers the data will have to be addressed in the near future.

  4. Great post Timo. I called Hadoop a BYOC BI system, as in bring your own code. It provides MapReduce as a parallel processing framework and a few related tools on top as you note, but ultimately the reason it can do so much pre-processing, heavy lifting, complex expression is that you’re writing it yourself.

    An excellent overview and well needed to help cut through the hype.

  5. Great post. It’s about time somebody wrote it. :)

    Some of the confusion comes from the fact that “big data” is becoming a problem in situations where Hadoop is overkill. 64-bit and multi-core computers made a single node very powerful today, the threshold of volume and processing on a single node have been significantly extended.

    It would be interesting to see how columnar in-memory technologies evolve into distributed architectures. Those would be so powerful, it’s almost scary.

  6. Pingback: Hadoop, Big Data, and Enterprise BI: The Current Situation | DATAVERSITY

  7. Pingback: State of Data #66 « Dr Data's Blog

  8. Pingback: La petite Revue de Presse du Décisionnel | www.LeGrandBI.com

  9. Pingback: Hadoop, Big Data, and Enterprise Business Intelligence « the BPM freak !!

  10. A great summary of the concepts that are buzzzing in the air!!. I liked the analogy and the way you related to the Apple and the Orange story…and most importantly the final pic representing the futuristic state… :-)
    Most of the statistics and Reports talk about the +ve side of the hadoop penetration….but as you rightly mentioned….we cannot just rule out the EDW and the other products which have been ” proven n tested ” over the years. So, most importantly its the tradeoff one has to do, to use the right concept/product for the right implementation at the right time!!
    Thanks once again – interesting read!! :-)
    -Pritiman

  11. Great post. A very lucid description of what Hadoop is and where would this be positioned. However a question that comes to my mind is would Hadoop exist as an alternative to the BIG DATA offerings from the likes of SAP like HANA or exadata/exalytics from Oracle. Although I unerstand that it may not be a direct one on one replacement but Hadoop can as well be a very viable alternative to avoiding a pricey product like HANA/exalytics

  12. Agreed — customers and consultants tell me they’ve been able to use Hadoop etc to solve niche problems that they’d previously tried to solve more expensively with traditional DBs — without it really being a “big data” problem.

    So far, though, the data seems to show that these types of projects are a negligible percentage of the analytic problems people are trying to solve. Over time, Hadoop etc. will be extended and become more enterprise-friendly, potentially increasing this percentage — but at the same time, the vendors aren’t dumb, and they are very quickly introducing the best of the open source approaches into their products… Overall, analytics is a fast-growing market, so it’s not a zero-sum gain: I expect both open-source and traditional vendors to do well — and the real benefactors are businesses trying to do analytics…

  13. Good post Timo, thanks. Interesting to note that InformationWeek has recently published the results of their 2012 BI, Analytics, and Information Management survey. The results are not as optimistic as the ones from Ventana when it comes to Hadoop adoption. They suggest a lot lower adoption rate: Only 26% of the respondents were using Hadoop or an NoSql processing platform. The breakdown was as followed:
    3% Extensively use
    11% Limited use
    12% Planned use
    74% No current or planned use
    n=431

    Nevertheless, InformationWeek believes there is a bright future for Hadoop, especially when scalability, flexibility, and affordability matter.

  14. Pingback: Julkalendern – Lucka 9 « Kentor BI

  15. Pingback: Timo explains Big Data, Hadoop

  16. Pingback: Confluence: Abiliton SaaS

  17. Pingback: Confluence: Abiliton SaaS

  18. For those interested in this topic, the upcoming SAP Press book “Enterprise Information Management with SAP” will feature a small section I wrote on Hadoop and EIM. It mostly focuses on Hadoop’s fit in the enterprise, how it relates to HANA, and how different parts of the SAP EIM portfolio will support it.

    Of course, the main purpose of the book is to explore SAP BusinessObjects Enterprise Information Management, including Data Services, Open Text, Master Data Governance, Master Data Management, and Lifecycle Management. In addition to my section, there’s a good discussion of HANA and how it relates to EIM.

    I’m told that you get a 10% discount if you pre-order from the SAP PRESS web store before May 1st. I don’t have any financial interest in the book beside the fact that I work at SAP.

  19. Hi Timo,

    It is indeed a good article but I believe that for the famous 3V that everybody is seeing on BigData we forget the 2 most important V which need to be added.
    1. Veracity: how do we know that the data we are including is correct and of true value and not just spam ?
    2. Value: I have recently done a pitch to senior sellers on BigData and the most amazing thing to them was that we did not talk about technology and whether one component is better than another one. It is all about what the customer has as objective which can be determined as GOAL/ VALUE. The value which is to be achieved will have a lot of sub values which need to be determined with the customer and only then we come to the point to understand where the data has to come from and what type of solution will be the applicable one.

    For the moment I have seen many projects which involve Data in Motion (we call it Streams at IBM) but few that involve Hadoop since you have to build it really from scratch. In those cases most of the times traditional platforms apply much faster for value. What Hadoop can save you in licenses… you will pay at least in services.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>