ACM

Communications of the ACM

Home/Magazine Archive/February 2004 (Vol. 47, No. 2)/Assessing Data Quality with Control Matrices/Full Text

Assessing Data Quality with Control Matrices

By Elizabeth M. Pierce
Communications of the ACM, February 2004, Vol. 47 No. 2, Pages 82-86
10.1145/966389.966395
Comments

View as: Print Mobile App ACM Digital Library Full Text (PDF) Share:

There is a growing trend for business and IS professionals to treat information as a product [6, 7]. This represents a fundamental shift in focus because an information product approach means the quality of the information is of paramount importance rather than the system that produces the information. That is not to say we no longer care about the quality of our computer systems, but rather, we care about the quality of our computer systems for a different reason. A well designed, efficient, and effective information system is important insofar as it contributes to a higher quality information product being produced by that system. For the purposes of this article, information products are the results of transforming raw data using computer-based processing into outputs that have value for either internal or external data consumers. (The terms "information" and "data" are used interchangeably throughout this article).

Under this new paradigm, we need to inventory the information products produced by our systems and to measure their quality across multiple dimensions such as accuracy, accessibility, consistency, and timeliness [2, 5]. We need to evaluate information products in terms of how well they meet the consumers' needs, how well we produce information products, and how well we manage the life cycle of the data after it is produced. This approach suggests a new role, the information product manager (IPM). These individuals must apply an integrated, cross-functional approach to fulfill their responsibility of coordinating and managing the suppliers of raw information, the producers of deliverable information, and the information consumers [7].

To accomplish their task, IPMs require some additional tools beyond the currently available case tools or software quality products. Work is underway on several tools specifically designed for the data quality professional. Some of these tools, like the Integrity Analyzer [6], assist the data quality professional with the development, collection, and analysis of information quality metrics. Other tools, like the IP-Map methodology [1], are useful for diagramming and analyzing the process by which an information product is manufactured. This article presents a third techniquethe control matrix. This technique is adapted from the IS audit and control literature for use in analyzing the quality of an information product.

Control Matrices

IS auditors have used control matrices since the 1970s to evaluate how reliably a system safeguards assets and protects data integrity [8]. This technique can also be used to evaluate the quality level of an information product. Control matrices are a concise way to link data problems to the quality controls that should detect and correct these data problems during the information manufacturing process. The columns of the matrix list the data quality problems that can afflict the finished information product. The rows of the matrix are the quality checks or corrective processes exercised during the information manufacturing process to prevent, detect, or correct these quality problems. These controls can be identified from either a data flow diagram or IP-Map of the data manufacturing process. The elements of the matrix rate the effectiveness of the quality check at reducing the level of data errors. These ratings can take several forms.

Yes/No. A quality check exists to prevent certain error(s) from appearing in the information product. In this case, the IPM has examined the information production process and has identified a corrective or detective process in place that should prevent that type of error from appearing in the final information product. Notice the Yes/No provides the lowest level of assessment since this only indicates a quality check is present, and does not address how well the quality check performs its function.

Category. A quality check exists to prevent certain error(s) from appearing in the final information product and the IPM is able to describe its effectiveness at error prevention, detection, or correction as low, moderate, or high. This categorical assessment provides more information than a simple Yes/No since it captures the IPM's belief as to how reliably the quality check performs its function.

Number. A quality check exists in the manufacturing process to prevent certain error(s) from appearing in the final information product and the IPM is able to describe its performance at removing the data problem numerically. To obtain a numerical assessment of the quality check's effectiveness, the IPM must devise a test in order to evaluate the control's effectiveness. For example, the IPM may create several test data sets seeded with known errors. The quality check is then applied to the test data set. If the quality check is able to correct on average, say 95% of the known errors, then the quality check can be considered on average 95% effective at preventing those types of data irregularities from appearing in the information product.

Formula. A quality check exists in the manufacturing process to prevent certain errors(s) from appearing in the final information product; however, its reliability rate depends on a relationship between itself and some other variables. Under this scenario, the IPM has obtained through process experimentation an understanding of how the reliability of a quality check may fluctuate and is able to describe that fluctuation through a mathematical function. For example, clerks who take orders over the phone may be less reliable in checking the data quality of the orders as the day progresses and this behavior can be modeled using a function incorporating the time of day.

Control matrices are a concise way to link data problems to the quality controls that should detect and correct these data problems during the information manufacturing process.

The control matrix is designed to focus on those parts of the data manufacturing process that prevent, detect, or correct data irregularities in the final information product. It is important to note that not every quality check will detect every type of error. It is quite possible that multiple quality checks are employed during the information manufacturing process, each one designed to catch different types of data problems.

Table. Generic control matrix for an information product.

Once the information production control matrix is complete, the IPM examines each data error column of the matrix in order to weigh up the effects of the various data quality controls and to determine whether the quality of the information product is at an acceptable level. Columns that have no or low control reliability ratings represent data quality problems that the manufacturing process may be overlooking. Acceptable quality levels will depend on the organization's commitment to data quality as well as to the costs and benefits of maintaining the information product at a given quality level. In addition, information on the source of errors, cost of errors, frequency of errors, and cost of controls can be added to the control matrix to help estimate the impact of an unreliable information product.

For each column, the IPM assesses how many errors remain in the information product after the quality checks have been performed. This assessment depends on the IPM's understanding of the information production process. To get an overall assessment of the quality of the information product, the IPM combines the individual data irregularities rates into an overall rating. In the simplest case where the error rates are independent and in the same numerical units, the IPM can apply basic probability rules to determine the probability that a given information product is free from defects.

In the case where numerical assessments of reliability were not obtained, the IPM will need to subjectively combine the Yes/No or categorical ratings to get an overall feel of the quality level of the information product. In addition, if the error rates are expressed in terms of functions, a spreadsheet or simulation program may be needed to gauge the overall reliability of the information product. In particular, where probability density functions are involved, the overall quality of the information product may be better expressed as a graph rather than a single number such as an average.

Table. Control matrix for mailing list.

An Example

Here, a mailing list is used to illustrate how a control matrix assesses information product quality. The mailing list was chosen for several reasons. First, it is an easy information product to visualize. This hypothetical example assumes one is dealing with a basic mailing list composed of labels based on a standard address format of the recipient's name, address, city, state, and zip code+4. Secondly, the mailing list represents a significant business data quality problem. In 2001, the U.S. Postal Service processed and delivered over 207 billion pieces of mail to a delivery network that grew by 1.7 million new addresses [3]. A typical 100,000-piece mailing may have as many as 6,990 pieces returned as undeliverable-as-addressed (UAA). Reducing UAA volume caused by inaccurate addresses saves mailers related postage and processing costs and, for the U.S. Postal Service, reduce associated processing and delivery costs [4].

The IPM responsible for the mailing list begins the assessment of the reliability of the mailing list by constructing a control matrix that lists the major data problems and the quality checks that the organization currently uses to address these quality problems. In this control matrix, the biggest problems with addresses are incorrect or missing directional suffix, customer has moved, wrong street name or number, wrong zip code, city, or state, wrong or missing rural route or box number, and wrong or missing apartment number. In addition, duplicate mailing labels and mailing labels for individuals now deceased are also problems. In terms of maintaining the quality of the mailing list, the biggest controls include customer self-reporting of problems, mailings returned by the U.S. Post Office, and the use of external vendor services for detecting problem labels. To produce the problem frequencies and effectiveness ratings, the IPM must rely on historical data, vendor guarantees, and knowledge about the process that maintains and generates mailing labels.

The IPM can use the control matrix in several ways. For instance, the columns in the control matrix document the known data quality problems associated with a particular information product. In this example, there are eight problems of concern to the IPM. The frequency of occurrence depends on the source of the data problem. Address changes and deaths can potentially occur to any individual whose mailing information is stored by the organization. Duplicates, missing, or wrong address data typically occur during the collection and entry of new labels. In addition to frequency of occurrence, the IPM can also include cost of error since some problems may have more serious consequences than others. In this case, all the data problems have a similar impact: excess mailing and printing costs.

The control matrix lists the quality controls currently in place and indicates which quality problems are addressed by the controls. For this simulated case, some controls, like the obituary service, are useful for detecting only a single type of data irregularity while other controls such as customer-reported problems and the U.S. Postal Service provide the means to detect several types of data irregularities.

Also, scanning down the columns reveals if any controls exist for a data problem and if so, what is the effectiveness of the control(s)? The control's effectiveness may vary depending on the type of data error as well as on the frequency in which the control is used. For example, in the case of duplicate labels, there are three possible controls. As the data entry operator types in the new label, the input program has a simple checking procedure to see if the new entry matches an existing entry in the database. This simple check is able to prevent about 30% of the duplicates from entering the database. Once a duplicate entry is in the database, there are two other controls that can potentially detect the duplicate record. One is through self-reportingeach time a mailing occurs that includes the duplicate labels, it is possible (although unlikely given the 2% self-reporting rate) the addressee will notify the mailer of the duplicates. The other control is through a vendor that for a fee will use a sophisticated address-matching program to screen a file for potential duplicate records. The vendor guarantees its process will detect and eliminate at least 75% of duplicate entries.

Moreover, if numerical estimates are available, a detailed analysis of the errors is possible. For example, in the case of duplicate labels, the IPM estimated that roughly 6% of new individuals being added to the mailing list are duplicates of existing mailing list entries. Since about 30% are caught during the data entry process, this leaves around 4% of the incoming new records as duplicates. Since addressees rarely report duplicate mailings, the percentage of duplicate entries in the database, if left untreated, will gradually mirror the duplicate percentage of the incoming records. If once a year, the organization employs a vendor to help remove the duplicates, it appears from the control matrix that the service will eliminate three of every four duplicates, reducing the percentage of duplicate labels from 4% to about 1%. However, as times goes by, the percentage of duplicate labels will rise in response to the duplicate rate of the incoming records.

Acceptable quality levels will depend on the organization's commitment to data quality as well as to the costs and benefits of maintaining the information product at a given quality level.

By repeating this same analysis for each of the data quality issues, it is possible for the IPM to calculate an approximate percentage of problem-free labels using simple probability rules. For example, if the data problems are independent of each other then the joint probability that a label is free of the eight different data quality problems can be obtained by simply multiplying the individual probabilities of a label being free of a given error.

Conclusion

Once the IPM has established the reliability level of the labels and the associated costs of bad labels over time, decisions can be made as to whether the quality is acceptable or if improvements should be made. The organization must weigh the costs of additional quality checks or corrective processes against the perceived benefits of increasing the percentage of correct labels. One factor not explicitly considered in this case is the timing of the mailings. It may be there is no problem with letting the quality of the mailing list deteriorate provided a clean up is performed prior to the commencement of a mass mailing. If further data quality improvements are desired, the control matrix is a useful tool for helping the IPM to identify potential areas of improvement. Revisions in the information production control matrix and the subsequent what-if analysis can help project the value of the quality improvements as part of the overall cost-benefit analysis. Finally the control matrix is a useful document for organizing and communicating data quality information about an information product.

In summary, the information product control matrix is a readily accessible tool used to evaluate the reliability of an information product. Essentially, it is an application of control matrices, a tool that IS auditors have long used to help them make an evaluation of how effectively a system safeguards assets and protects data integrity. It should be noted, however, that this technique is only as good as one's understanding and knowledge of the information manufacturing process. The more detailed the control matrix's measurements, the better the estimates of the reliability of the information product.

References

1. Shankaranarayanan, G., Wang, R.Y. and Ziad, M. IP-Map: Representing the manufacture of an information product. In Proceedings of the 2000 Conference on Information Quality, (Cambridge, Mass., 2000).

2. Strong, D.M., Lee, Y.W. and Wang R.Y. Data quality in context. Commun. ACM 40, 5 (May 1997), 103110.

3. U.S. Postal Service. 2001 Annual Report; www.usps.com/history/anrpt01/.

4. U.S. Postal Service. Coming soon...delivery point validation. Memo to Mailers, (Mar./Apr. 2001), 8.

5. Wand, Y. and Wang, R.Y. Anchoring data quality dimensions in ontological foundations. Commun. ACM 39, 11 (Nov. 1996), 8695.

6. Wang, R.Y. A product perspective on total data quality management. Commun. ACM 41, 2 (Feb. 1998), 5865.

7. Wang. R.Y., Lee Y.W., Pipino, L.L. and Strong, D.M. Manage your information as a product. MIT Sloan Management Review 39, 4 (Summer 1998), 95106.

8. Weber, R. Information Systems Control and Audit. Prentice Hall, Upper Saddle River, NJ, 1999.

Author

Elizabeth M. Pierce ([email protected]) is an associate professor in the Eberly College of Business and Information Technology at Indiana University of Pennsylvania, Indiana, PA.

Tables

Table 1. Generic control matrix for an information product.

Table 2. Control matrix for mailing list.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

No entries found