Can You Trust Your Big Data?

Big Data is increasingly important to corporate architectures

Despite increasing claims that Big Data is a buzzword that has been overhyped, and that it’s really just data, the subject remains a hot topic in most organizations today.

There are many different definitions of Big Data but one common characterization uses what is known as the 3Vs: data whose volume, variety, or velocity makes it difficult to store and process using traditional corporate IT infrastructures.

For example in the early 2000s, web companies such as Yahoo and Google struggled to cost-effectively store billions of pages of web data. They turned to new systems that distributed information processing across large numbers of inexpensive servers. These systems afforded easier analysis of unstructured data such as documents, images, and video, and some estimates say around 80-90% of all potentially usable business information may originate in such unstructured forms. Meanwhile, technological advances in real-time analysis of streaming data enabled applications such as faster trading in the finance industry and predictive maintenance of manufacturing machinery.

Over the last decade, these new data sources have gone from niche applications to a mainstream part of modern business. To support this, new architectures are entering corporate environments, such as open source frameworks Hadoop and Spark, and new in-memory enterprise application platforms.

But compared to traditional IT architectures, there’s typically a lot less control and governance built into Big Data systems.

Today, most organizations based their business decision-making on some form of data warehouse that contains carefully curated data. It often takes many years to create a robust data warehouse, to determine how the data should be structured in order to be of most use to the business, and then to create processes to cleanse, conform, and load data from multiple different data systems.

Big Data systems on the other hand typically involve raw, unprocessed data in some form of data lake where different data reduction, scrubbing, and processing techniques are then applied as and when needed. This “schema on read” approach contrasts with the traditional “schema on write” approach of data warehousing. It brings some big benefits, notably greater flexibility to deal with diverse and fast-changing data sets, and more scalable processing than traditional approaches. But there are also big disadvantages—such as there being little upfront control over the quality of the information stored in the data lake.

The problem with Big Data

This leads to another V associated with Big Data – veracity. Can business people actually trust the data coming from these new systems?

There are many different aspects to the question, but here are some of the key practical dimensions that organizations should consider when embarking on Big Data projects:

Data completeness and accuracy. For many Big Data sources, the problems start as soon as the data is collected. The recorded values may be approximations rather than exact real values—uncertainty and imprecision might be an inherent part of the process.

For example, consider an organization that wants to improve their marketing decisions by analyzing customer sentiment derived from social media. Languages are full of nuance, making it very hard to accurately determine the underlying sentiment of a tweet or blog post. For example, people even interpret the same emoji in completely different ways.

Internet of things data may also only be approximate, because of cheap sensors with “good enough” accuracy that provide information only at intervals, and that run the risk of missing fleeting events. And with large volumes of sensors, it’s probable that some may not be working at any given moment, or data may be lost in transmission, requiring interpretation of missing values.

Data credibility. Big data often comes from systems not directly controlled by your organization, and may contain inherent biases or outright false values (for example because of bots in social media). And the schema-on-read approach means that you might not be able to ascertain the credibility of the data until after you’ve tried to use it for analysis.

Data consistency. Traditional database systems use sophisticated methods to ensure that users get consistent answers to queries even as new data is being added. In order to achieve higher processing capabilities, some Big Data systems relax these constraints, using eventually consistent systems where two people may get different answers to simultaneous, identical questions—at least in the short term.

Data processing and algorithms. Big data is typically captured in low-level detail and the extraction of useable information may require extensive processing, interpretation, and the use of data science algorithms. The processing choices made can have a large effect on the final result, and there is a greater potential for bias and incorrect conclusions than with traditional systems. For example, self-driving cars rely on algorithms and image recognition to read traffic signs, and researchers have shown that small cosmetic changes can induce them to read the signs incorrectly.

Data validity. No matter how accurate the underlying data, judgment is still required to determine whether it is an appropriate and useful source of information for a specific business question. For example, social sentiment data would not be useful measure of customer satisfaction for a segment that never uses social media.

What can you do about it?

For all these reasons, Big Data is necessarily at best a fuzzy source of truth – but that does not mean that it cannot be used for important business decisions. For example, a sharp decline in customer sentiment for a particular product in social media may be a useful leading indicator of poor financial results in the future. Or sensor data can be used to predict and avoid a possible electricity network blackout. These decisions directly affect the running of the organization, and may be subject to regulatory oversight. What can organizations do to strengthen trust, make sure that Big Data is used appropriately, and prove it to others?

Define the ROI for Big Data quality. Many organizations suspect they suffer from poor data quality but have no sensible basis for determining whether the problems warrant any investment in improvements. Organizations should identify areas where Big Data could provide business value, estimate what the impact of poor quality information would be, ascertain the data’s current reliability, and determine how much investment might be required.

Robust governance. Even more than with traditional IT architectures, Big Data requires systems for determining and maintaining data ownership, data definitions, and data flows. New data orchestration systems track the full information lifecycle of Big Data and facilitate transfers to more traditional systems in ways that meet the governance and security needs of the enterprise. This information should be securely stored and immutable to change, so that it is a reliable source of evidence for auditors and regulators. One area of concern is how to deal with what has been called the dark secret at the heart of artificial intelligence—that some algorithmic models can’t fully explain or justify why they came to a particular conclusion—but researchers are working on new techniques to address this.

Transparency. The appropriateness of any Big Data source for decision-making should be made clear to business users. Any known limitations of the data accuracy, sources, and bias should be readily available, along with recommendations about the kinds of decision-making the data can and cannot support. Some organizations use named information quality standards for different data sources, such as gold, silver, and bronze, with corresponding levels of support and guarantees.

Training. Big Data analysis is most appropriate for data scientists with deep expertise in the opportunities and limitations of the data and techniques involved. But organizations must ensure that less-skilled business users understand the consequences of using Big Data for decisions in their roles. For example, over-reliance on fundamentally-imprecise data as part of a highly-visible customer-facing process, or in the calculation of employee objectives and incentives, could lead to negative consequences. To avoid this, organizations should offer training, and restrict access to Big Data to employees who first obtain a company-issued “data driving license” indicating they know how to use the information responsibly.

Ethics. Big Data and the internet of things offers unprecedented opportunities to monitor business processes that were previously invisible. But as the infamous example of Target figuring out a girl was pregnant before her father did shows, it also risks being intrusive to customers, employees, and society as a whole. Just because something is now feasible doesn’t mean that it’s a good idea. If your customers would find it creepy to discover just how much you know about their activities, it’s probably a good indication that you shouldn’t be doing it.

In addition, the detail and volume of the data stored raises the stakes on issues such as data privacy and data sovereignty. Organizations should explicitly introduce ethics considerations into new project decisions as part of their governance processes. And you may want to collaborate with organizations such as the Partnership on AI, which was created to study and formulate best practices on AI technologies, and to serve as an open platform for discussion and engagement about AI and its influences on people and society.


Whether you want to call it Big Data or just “data”, it’s a key part of digital transformation and the business models of the future – and organizations that have robust systems in place to ensure that it can be trusted will be better positioned to take advantage of these powerful new technologies.


A version of this post first appeared in Brink News. Photo by Esteban Lopez on Unsplash