{"id":12214,"date":"2011-09-07T14:39:07","date_gmt":"2011-09-07T13:39:07","guid":{"rendered":"http:\/\/timoelliott.com\/blog\/?p=3349"},"modified":"2011-09-07T14:39:07","modified_gmt":"2011-09-07T13:39:07","slug":"hadoop-big-data-and-enterprise-business-intelligence","status":"publish","type":"post","link":"https:\/\/timoelliott.com\/blog\/2011\/09\/hadoop-big-data-and-enterprise-business-intelligence.html","title":{"rendered":"Hadoop, Big Data, and Enterprise Business Intelligence"},"content":{"rendered":"<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-left: 0px; padding-right: 0px; display: inline; padding-top: 0px; border: 0px;\" title=\"Differences\" src=\"https:\/\/i0.wp.com\/timoelliott.com\/blog\/wp-content\/uploads\/2011\/09\/apple-orange-edw-hana-banner.jpg?resize=690%2C310&#038;ssl=1\" alt=\"Differences\" width=\"690\" height=\"310\" border=\"0\" \/><\/p>\n<p>Many thanks to<a href=\"https:\/\/cw.sdn.sap.com\/cw\/people\/5654\"> William Gardella<\/a>\u00a0and others for the content below:<\/p>\n<p>Traditional enterprise data warehousing and Hadoop\/Big Data are like apples and oranges \u2013 the well-known and trusted approach being challenged by a zesty newcomer (<a href=\"http:\/\/en.wikipedia.org\/wiki\/Orange_(fruit)\" target=\"_blank\">sweet oranges were introduced to Europe sometime in the 16th century<\/a>). Is there room for both? How will these two very different approaches co-exist?<\/p>\n<p>This post is an attempt to summarize the current state of play with Hadoop, \u201cBig Data\u201d and Enterprise BI, and what it means to existing users of enterprise business intelligence. See the list of articles at the end of the post for more detailed materials.<\/p>\n<h3>What is Hadoop?<\/h3>\n<p><a href=\"http:\/\/hadoop.apache.org\/\">Hadoop<\/a> is open-source software that enables reliable, scalable, distributed computing on clusters of inexpensive servers. It is:<\/p>\n<ul>\n<li>Reliable: The software is fault tolerant, it expects and handles hardware and software failures<\/li>\n<li>Scalable: Designed for massive scale of processors, memory, and local attached storage<\/li>\n<li>Distributed: Handles replication. Offers massively parallel programming model, <a href=\"http:\/\/hadoop.apache.org\/mapreduce\/\" target=\"_blank\">MapReduce<\/a><\/li>\n<\/ul>\n<p>Hadoop is designed to process terabytes and even petabytes of unstructured and structured data. It breaks large workloads into smaller data blocks that are distributed across a cluster of commodity hardware for faster processing. And it\u2019s part of a larger framework of related technologies:<\/p>\n<ul>\n<li><a href=\"http:\/\/hadoop.apache.org\/hdfs\/\" target=\"_blank\">HDFS<\/a>: Hadoop Distributed File System<\/li>\n<li><a href=\"http:\/\/hbase.apache.org\/\" target=\"_blank\">HBase<\/a>: Column oriented, non-relational, schema-less, distributed database modeled after Google\u2019s <a href=\"http:\/\/en.wikipedia.org\/wiki\/BigTable\" target=\"_blank\">BigTable<\/a>. Promises \u201cRandom, real-time read\/write access to Big Data\u201d<\/li>\n<li><a href=\"http:\/\/hive.apache.org\/\" target=\"_blank\">Hive<\/a>: Data warehouse system that provides SQL interface. Data structure can be projected ad hoc onto unstructured underlying data<\/li>\n<li><a href=\"http:\/\/pig.apache.org\/\" target=\"_blank\">Pig:<\/a> A platform for manipulating and analyzing large data sets. High level language for analysts<\/li>\n<li><a href=\"http:\/\/zookeeper.apache.org\/\" target=\"_blank\">ZooKeeper:<\/a> a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services<\/li>\n<\/ul>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-left: 0px; padding-right: 0px; display: inline; padding-top: 0px; border-width: 0px;\" title=\"hadoop-architecture\" src=\"https:\/\/i0.wp.com\/timoelliott.com\/blog\/wp-content\/uploads\/2011\/09\/hadoop-architecture.jpg?resize=690%2C294&#038;ssl=1\" alt=\"hadoop-architecture\" width=\"690\" height=\"294\" border=\"0\" \/><\/p>\n<p><em>Image: William Gardella<\/em><\/p>\n<h3>Are Companies Adopting Hadoop?<\/h3>\n<p>Yes. According to a recent Ventana <a href=\"http:\/\/www.businesswire.com\/news\/home\/20110726005639\/en\/Ventana-Research-Survey-Shows-Organizations-Hadoop-Perform\" target=\"_blank\">survey<\/a>:<\/p>\n<ul>\n<li>More than one-half (54%) of organizations surveyed are using or considering Hadoop for large-scale data processing needs<\/li>\n<li>More than twice as many Hadoop users report being able to create new products and services and enjoy costs savings beyond those using other platforms; over 82% benefit from faster analyses and better utilization of computing resources<\/li>\n<li>87% of Hadoop users are performing or planning new types of analyses with large scale data<\/li>\n<li>94% of Hadoop users perform analytics on large volumes of data not possible before; 88% analyze data in greater detail; while 82% can now retain more of their data<\/li>\n<li>Organizations use Hadoop in particular to work with unstructured data such as logs and event data (63%)<\/li>\n<li>More than two-thirds of Hadoop users perform advanced analysis \u2014 data mining or algorithm development and testing<\/li>\n<\/ul>\n<h3>How is it Being Use in Relation to Traditional BI and EDW?<\/h3>\n<p>Currently, Hadoop has carved out a clear <a href=\"http:\/\/www.computerworld.com\/s\/article\/358164\/Hadoop_Works_Alongside_RDBMS\" target=\"_blank\">niche next to conventional systems<\/a>. Hadoop is good at handling batch processing of large sets of unstructured data, reliably, and at low cost. It does, however, require scarce engineering expertise, real-time analysis is challenging, and it much less mature than traditional approaches. As a result, Hadoop is not typically being used for analyzing conventional structured data such as transaction data, customer information and call records, where traditional RDBMS tools are still better adapted:<\/p>\n<blockquote><p>\u201cHadoop is real, but it\u2019s still quite immature. On the \u201creal\u201d side, Hadoop has already been adopted by many companies for extremely scalable analytics in the cloud. On the \u201cimmature\u201d side, Hadoop is not ready for broader deployment in enterprise data analytics environments\u2026\u201d James Kobelius, Forrester Research.<\/p><\/blockquote>\n<p><a href=\"https:\/\/i0.wp.com\/timoelliott.com\/blog\/wp-content\/uploads\/2011\/09\/hadoop-vs-traditional.png?ssl=1\" target=\"_blank\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-left: 0px; padding-right: 0px; display: inline; float: right; padding-top: 0px; border: 0px;\" title=\"hadoop-vs-traditional\" src=\"https:\/\/i0.wp.com\/timoelliott.com\/blog\/wp-content\/uploads\/2011\/09\/hadoop-vs-traditional_thumb.png?resize=201%2C164&#038;ssl=1\" alt=\"hadoop-vs-traditional\" width=\"201\" height=\"164\" align=\"right\" border=\"0\" \/><\/a>To considerably over-simplify: if we consider what\u2019s called the 3 \u2018V\u2019s of the data challenge: \u201cVolume, Velocity, and Variety\u201d (and there\u2019s a fourth, Validity), then traditional data warehousing is great at Volume and Velocity (especially with the new analytic architectures), while Hadoop is good at Volume and Variety.<\/p>\n<p>Today, Hadoop is being used as a:<\/p>\n<ul>\n<li><strong>Staging layer<\/strong>: The most common use of Hadoop in enterprise environments is as \u201cHadoop ETL\u201d &#8212; preprocessing, filtering, and transforming vast quantities of semi-structured and unstructured data for loading into a data warehouse.<\/li>\n<li><strong>Event analytics layer<\/strong>: large-scale log processing of event data: call records, behavioral analysis, social network analysis, clickstream data, etc.<\/li>\n<li><strong>Content analytics layer:<\/strong> next-best action, customer experience optimization, social media analytics. MapReduce provides the abstraction layer for integrating content analytics with more traditional forms of advanced analysis.<\/li>\n<\/ul>\n<p>Most existing vendors in the data warehousing space have announced integrations between their products and Hadoop\/MapReduce, or their intention to provide them \u2013 for example, <a href=\"http:\/\/www.zdnet.co.uk\/news\/infrastructure\/2011\/07\/07\/sybases-iq-data-analytics-gets-parallel-smarts-40093331\/\">SAP has announced that they intend to implement MapReduce in the next version of Sybase IQ<\/a>.<\/p>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-left: 0px; padding-right: 0px; display: inline; padding-top: 0px; border: 0px;\" title=\"Differences\" src=\"https:\/\/i0.wp.com\/timoelliott.com\/blog\/wp-content\/uploads\/2011\/09\/apple-orange-edw-hana-together.jpg?resize=690%2C310&#038;ssl=1\" alt=\"Differences\" width=\"690\" height=\"310\" border=\"0\" \/><\/p>\n<h3>What Does The Future Look Like?<\/h3>\n<p>It\u2019s clear that Hadoop will become a key part of future enterprise data warehouse architectures:<\/p>\n<blockquote><p>\u201cThe bottom line is that Hadoop is the future of the cloud EDW, and its footprint in companies\u2019 core EDW architectures is likely to keep growing throughout this decade. \u201c <a href=\"http:\/\/blogs.forrester.com\/james_kobielus\/11-06-08-hadoop_future_of_enterprise_data_warehousing_are_you_kidding\" target=\"_blank\">James Kobelius, Forrester Research<\/a><\/p><\/blockquote>\n<p>But (despite some of the almost religious fervor of its backers) Hadoop is <a href=\"http:\/\/searchbusinessanalytics.techtarget.com\/news\/2240074279\/Big-data-analytics-fulfilling-the-promise-of-predictive-BI\" target=\"_blank\">unlikely to supplant the role of traditional data warehouse and business intelligence<\/a>:<\/p>\n<blockquote><p>\u201cThere are places for the traditional things associated with high-quality, high-reliability data in data warehouses, and then there\u2019s the other thing that gets us to the extreme edge when we want to look at data in the raw form\u201d\u00a0 Yvonne Genovese, Gartner Inc.<\/p><\/blockquote>\n<p>Companies will continue to use conventional BI for mainstream business users to do ad hoc queries and reports, but they will supplement that effort with a big-data analytics environment optimized to handle a torrent of unstructured data \u2013 which, of course, has been part of the goal of enterprise data warehousing for a long time.<\/p>\n<p>Hadoop is <a href=\"http:\/\/blogs.forrester.com\/james_kobielus\/11-06-07-hadoop_what_are_these_big_bad_insights_that_need_all_this_nouveau_stuff\" target=\"_blank\">particularly useful when<\/a>:<\/p>\n<ul>\n<li>Complex information processing is needed<\/li>\n<li>Unstructured data needs to be turned into structured data<\/li>\n<li>Queries can\u2019t be reasonably expressed using SQL<\/li>\n<li>Heavily recursive algorithms<\/li>\n<li>Complex but parallelizable algorithms needed, such as geo-spatial analysis or genome sequencing<\/li>\n<li>Machine learning<\/li>\n<li>Data sets are too large to fit into database RAM, discs, or require too many cores (10\u2019s of TB up to PB)<\/li>\n<li>Data value does not justify expense of constant real-time availability, such as archives or special interest info, which can be moved to Hadoop and remain available at lower cost<\/li>\n<li>Results are not needed in real time<\/li>\n<li>Fault tolerance is critical<\/li>\n<li>Significant custom coding would be required to handle job scheduling<\/li>\n<\/ul>\n<h3>Does Hadoop and Big Data Solve All Our Data Problems?<\/h3>\n<p>Hadoop provides a new, complementary approach to traditional data warehousing that helps deliver on some of the most difficult challenges of enterprise data warehouses. Of course, it\u2019s not a panacea, but by making it easier to gather and analyze data, it may help move the spotlight away from the technology towards the more important limitations on today\u2019s business intelligence efforts: information culture and the limited ability of many people to actually use information to make the right decisions.<\/p>\n<h3>References \/ Suggested Reading<\/h3>\n<ul>\n<li><a href=\"http:\/\/www.computerworld.com\/s\/article\/355363\/Hadoop_Goes_Mainstream_for_Big_BI_Tasks\" target=\"_blank\">Hadoop Goes Mainstream for Big BI Tasks<\/a><\/li>\n<li><a href=\"http:\/\/blogs.forrester.com\/james_kobielus\/11-06-03-hadoop_is_it_soup_yet\">Hadoop: Is It Soup Yet?<\/a><\/li>\n<li><a href=\"http:\/\/blogs.forrester.com\/james_kobielus\/11-06-06-hadoop_what_is_it_good_for_absolutely_something\">Hadoop: What Is It Good For? Absolutely . . . Something<\/a><\/li>\n<li><a href=\"http:\/\/blogs.forrester.com\/james_kobielus\/11-06-07-hadoop_what_are_these_big_bad_insights_that_need_all_this_nouveau_stuff\">Hadoop: What Are These Big Bad Insights That Need All This Nouveau Stuff?<\/a><\/li>\n<li><a href=\"http:\/\/blogs.forrester.com\/james_kobielus\/11-06-08-hadoop_future_of_enterprise_data_warehousing_are_you_kidding\">Hadoop: Future Of Enterprise Data Warehousing? Are You Kidding?<\/a><\/li>\n<li><a href=\"http:\/\/blogs.forrester.com\/james_kobielus\/11-06-09-hadoop_when_will_the_inevitable_backlash_begin\">Hadoop: When Will The Inevitable Backlash Begin?<\/a><\/li>\n<li><a href=\"http:\/\/www.computerworld.com\/s\/article\/358164\/Hadoop_Works_Alongside_RDBMS\" target=\"_blank\">Hadoop finds niche alongside conventional database systems<\/a><\/li>\n<li><a href=\"http:\/\/searchbusinessanalytics.techtarget.com\/news\/2240074279\/Big-data-analytics-fulfilling-the-promise-of-predictive-BI\" target=\"_blank\">&#8216;Big data&#8217; analytics fulfilling the promise of predictive BI<\/a><\/li>\n<li><a href=\"http:\/\/www.mckinsey.com\/mgi\/publications\/big_data\/index.asp\">Big data: The next frontier for innovation, competition, and productivity<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Traditional enterprise data warehousing and Hadoop\/Big Data are like apples and oranges. Is there room for both? How will these two very different approaches co-exist in the future?<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[3],"tags":[100,160,173,198,204,213,350,421,437,556,911],"class_list":["post-12214","post","type-post","status-publish","format-standard","hentry","category-bi-20","tag-analytics","tag-bi","tag-big-data","tag-business-analytics","tag-business-intelligence","tag-businessobjects","tag-data-warehouse","tag-edw","tag-enterprise-data-warehousing","tag-hadoop","tag-sap"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p3X9RF-3b0","_links":{"self":[{"href":"https:\/\/timoelliott.com\/blog\/wp-json\/wp\/v2\/posts\/12214","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/timoelliott.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/timoelliott.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/timoelliott.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/timoelliott.com\/blog\/wp-json\/wp\/v2\/comments?post=12214"}],"version-history":[{"count":0,"href":"https:\/\/timoelliott.com\/blog\/wp-json\/wp\/v2\/posts\/12214\/revisions"}],"wp:attachment":[{"href":"https:\/\/timoelliott.com\/blog\/wp-json\/wp\/v2\/media?parent=12214"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/timoelliott.com\/blog\/wp-json\/wp\/v2\/categories?post=12214"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/timoelliott.com\/blog\/wp-json\/wp\/v2\/tags?post=12214"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}