A popular refrain I hear these days is "I am planning to put all of my data into a data lake so my employees can do analytics over this potential treasure trove of information." This point of view is also touted by several vendors selling products in the Hadoop ecosystem. Unfortunately, it has a serious flaw, which I illustrate in this posting using (mostly fake) data on one of my M.I.T. colleagues.
Consider two very simplistic data sources containing data on employees. The first data source has record of the form:
Employee (name, salary, hobbies, age, city, state)
while the second contains data with a layout of:
Person (p-id, wages, address, birthday, year-born, likes)
An example record from the first data set might be:
(Sam Madden, $4000, {bike, dogs}, 36, Cambridge, Mass)
An example from the second data set might be:
(Samuel E. Madden, $5000, Newton Ma., October 4, 1985, bicycling)
A first reasonable step is to assemble these records into a single place for subsequent processing. This Ingest step is the first phase of a data curation process with the following components:
Ingest: Data sources must be assembled as noted above.
Data transformation: The field "address" must be decomposed into its constituent components "city" and "state". Similarly "age" must be computed from "year born" and "birthday".
Schema integration. The fact that p-id and name mean the same thing must be ascertained. Similar statements hold for other attributes in the records, as well as in the transformed fields.
Data cleaning. The two salaries of Sam Madden are different numbers. One or both may be incorrect. Alternately, both may be correct; for example one could be total wages including consulting while the other could represent the W-2 salary of an individual. Similarly, Sam can only have a single age. Hence, one (or both) data source have incorrect information.
Entity consolidation. It must be ascertained that the two Sam Madden records correspond to the same person, and not two different persons. Then, the two records must be merged into a composite record. In the process, decisions (often called merge rules) need to be made about Sam’s hobbies. For example are "bike" and "bicycling" the same thing?
Several comments are immediately evident. Data curation is an iterative process. After applying some of these steps, it may make sense to go back and repeat some of the other steps. For example, entity consolidation may reveal the problems with Sam’s salary and age. Hence, further data cleaning is warranted. In effect, data curation is a workflow of processing steps with some backtracking.
Second, some data curation steps will require human involvement. It is not reasonable to expect automatic systems to do the whole job. Moreover, in many environments, it will take experts to provide the necessary human judgement. For example, integration of genomics data requires a skilled biologist, and cannot be performed by normal crowd sourcing techniques.
Third, real world data is usually very dirty. Anecdotal evidence suggests that up to 10% of corporate data inside the firewall is incorrect. Hence, any person who thinks that his troubles are over, once he has ingested his data sources is sadly mistaken. The remaining four steps will be very costly.
In effect, the ingest phase is trivial compared to the other four steps. Hence, data ingest into an uncurated "data swamp" is just the tip of a data consolidation iceberg. A huge amount of effort will have to be subsequently invested to turn the swamp into a data lake.
The moral of this story is "don’t underestimate the difficulty of data curation." If you do, you will revisit a well-worn path, namely the experience of enterprises in the 1990’s concerning data warehouses. A popular strategy at the time was to assemble customer- facing data into a data warehouse. Business analysts could then use the result to determine customer behavior and make better sales decisions. The average data warehouse project at the time was a factor of 2 over budget and a factor of two late because the data curation problems were underestimated. Don’t repeat that particular mistake.
These data format and meaning challenges have been with us since the first 1s and 0s were thrown together in the cesspool of datastores.
We see the overconfidence in data quality on project after project. I see this even though the same people made just as poor choices in completing their current database designs. There's some sort of designer amnesia or blocking that leads professionals to believe that every other datasource had good design, reliable data curation, and well-understood data lineage. Even though their own designs work against data usability (age, wages).
We are repeating the failures of the past on every project because we rarely teach data principles, we rarely budget and plan for data quality efforts and we rarely let data professionals dare to "slow down the project" to understand and architect a reliable data component to our solutions.
We work in an environment where profound data dementia and anosognosia is the way build applications. For those of us in the data management profession, it's maddening to see other professionals delight in knowing even less about the data they work with, or caring not at all about ensuring that it is usable. Sure, some are terrified of the thought of constraining data, but surely even a new developer can see the tradeoff in capturing not AGE, but DATE OF BIRTH (and the other pieces actually required to understand age).
Originating systems need time, budget and resources to ensure the data isn't corrupted at write time. That rarely happens, causing all kinds of unnecessary costs, risks and harms later on. If we are actually professionals, we would fix that.
Can't be more than a few lines of code!
Displaying all 2 comments