Expenditures on data warehouses and related business intelligence technologies are expected to increase from $60 billion in 2001 to $150 billion in 2005 [6], an increase of over 20% per year. To make effective use of these expenditures, information systems professionals need methodologies for developing data warehouses. The most widely used tool for developing information systems applications is the systems development life cycle (SDLC)—a general application development tool tailored to meet the requirements of a number of diverse applications, including database development.
Here, we tailor the SDLC to data mart development, and show how the steps in the SDLC must be transformed in order to be applicable for developing data marts. We also clarify the differences between the methods needed to develop transaction-oriented relational databases and those needed to develop decision-oriented dimensional data marts. Dimensional data marts are fundamentally different than relational databases—a different methodology and a different set of tools are required for their effective development. By contrasting the steps required to develop relational databases with the steps required to develop dimensional data marts, this article clarifies the differences in the methodologies and tools in a manner readily understandable to information systems professionals.
Ricardo [7] adapted the SDLC to transaction-oriented database development. Her adaptation is called the Staged Database Design Methodology. Steps unique to database development, such as constructing a logical data model, choosing a database management system, and performance tuning, are addressed directly in Ricardo's methodology.
Using this methodology as a starting point, we develop an analogous procedure for dimensional data marts. Although the steps of our Dimensional Data Mart Development Methodology are essentially the same as Ricardo's, the operationalization of those steps is quite different. These differences exist for two reasons: Relational databases (RDBs) are generally used to support transaction processing, whereas dimensional data marts (DDMs) are generally used to support decision making. Also, different tools and methods are required to design, build, and tune dimensional versus relational databases. Figure 1 highlights these differences.
Step 1. Analyze the user environment. The manner in which the user environment is analyzed differs greatly for a transaction-oriented relational database versus a decision-oriented dimensional data mart. In the development of a relational database, the goal is to support all the transaction-processing needs of the organization. The focus is on identifying the functions the organization must perform and the data required to perform them. Various tools such as process decomposition charts and business function-to-data entity matrices have been developed to aid database designers with this task [5].
In the development of a dimensional data mart, the goal is to support the analysis of a specific business process. Von Winterfeldt and Edwards [8] point out that "good problem structuring is the key to successful analysis." Accordingly, we present the following five-step structured process for analyzing the user environment.
This structure forces analysts to identify the relevant performance measures, and the dimensions along which those performance measures should be analyzed. Our emphasis on performance measures also highlights the difference in the way the user environment must be analyzed for dimensional data marts versus transaction-oriented databases. Figure 1 uses the phrases "Decision Perspective" and "Operations Perspective" to summarize those differences
Step 2. Develop the logical model. Relational and dimensional databases also use very different logical design techniques. Almost all relational databases use entity-relationship diagrams (ERDs) to develop the logical model. The intent is to represent all of the entities that impact an organization's operations, their relevant data characteristics, and the relationships among those data characteristics. The goal is to capture all of the information needed for the transaction processing required to run the organization. The entities identified during the analysis phase, as well as their relationships, are mapped in the ERD.
In the development of a dimensional data mart, the goal is to support the analysis of a specific business process.
A completely different logical design technique is used for data marts. The predominant tool for the development of the logical model for data marts is the star-schema diagram. The intent is for the model to represent all of the relevant questions that must be answered in order to make appropriate decisions in a specific business area. If the five-step process presented here is used to analyze the user environment, then the measured facts will map directly to the fact table, and the dimensions will map directly to the dimension tables. Figure 1 uses the phrases "Star-Schema Diagram" and "Entity-Relationship Diagram" to summarize the differences in the way that logical modeling must be done for data marts versus relational databases.
Figure 2 shows an ERD for a sales database, while Figure 3 shows a star-schema diagram (SSD) for a sales database. Note the ERD focuses on the transactions while the SSD focuses on measuring the impact of the transactions. In figures 2 and 3, PK indicates primary key and FK indicates foreign key.
Step 3. Choose the database management system (DBMS). As described by Ricardo [7], the best DBMS is "the system that best satisfies the specifications for the environment, within the constraints given." The central issue is performance. For a transaction-oriented database, there are a wide variety of choices, including products by IBM, Oracle, Microsoft, Sybase, and Informix. All of these products use a relational engine to process the tables that correspond to the entities and relationships in the ERD.
The choice is more complex for a data mart. A dimensional database can be stored as a set of tables (one table for each dimension plus one fact table), as a single data cube (the join of each dimension table with the fact table), or as both a data cube and a set of tables (the data cube for storing aggregated data and the fact and dimension tables for storing nonaggregated data). If the dimensional database is stored as a set of dimension tables and a fact table, then a relational engine must be used to process the tables (relational online analytical processing or ROLAP). If the dimensional database is stored as a data cube, then a dimensional engine must be used to process the data cube (multidimensional online analytical processing or MOLAP). If the dimensional database is stored using both a data cube and a set of dimension tables and a fact table, then both a dimensional engine and a relational engine must be used to process the data (hybrid online analytical processing or HOLAP). Dimensional engines are available from a number of vendors, including SAS, Oracle, IBM, and Microsoft.
Figure 1 indicates that only a relational engine may be used with a transaction-oriented relational database, but either a relational or a dimensional engine may be used with a decision-oriented dimensional data mart. Although benchmarks are offered regarding the performance of specific DBMS products, there is little information available regarding which dimensional engine offers the best performance. In fact, it is not even known whether a dimensional engine operating against a data cube always offers better performance than a relational engine operating on a set of dimension tables and a fact table. We believe this is an important area for additional research.
Step 4. Map the logical model to the database management system. This is the process of fitting the logical model for the database to the DBMS software chosen to implement the system. In transactional databases, a major component of this step is normalizing the tables. This is done primarily to ensure the integrity of the data. Because transactional databases are constantly updated, it is almost impossible to maintain the integrity of the data if the tables are not normalized. Because dimensional databases are read-only, preserving data integrity is not a critical concern. Instead, the focus is on enhancing performance and maintaining an intuitive user interface. Normalizing the dimension tables would interfere with both of these objectives.
The result of Step 4 is a large number of normalized tables in the relational database case, but only one or a few non-normalized tables in the dimensional data mart case. Figure 1 uses the phrases "One or More Non-Normalized Tables" and "Set of Normalized Tables" to summarize this difference. Removing all redundancy in the data is immensely beneficial to transaction processing, both for preserving data integrity and speeding updates. However, the goal of dimensional databases is user understandability and ease of querying. Removing redundancies does not improve the performance of dimensional data marts.
Steps 5 and 6. Develop and evaluate the physical model. Once the tables have been defined using the DBMS's data definition language, the DBMS generally creates the physical model. This is true for both dimensional and relational DBMSs. Evaluation of the physical model involves an assessment of its ability to meet the objectives of the system, which were defined in Step 1. Typically, this includes benchmarking a prototype of the system.
The criteria used to evaluate the physical model differ for dimensional data marts and transaction-oriented databases. Figure 1 uses the phrases "Preserve Relevant Analytical Criteria" and "Maintain Data Integrity" to summarize these differences. The transaction-oriented database is evaluated on its ability to process transactions over time without becoming corrupt. The focus is on the ability to correctly process all of the transactions required to support the organization's operations.
The dimensional data mart, on the other hand, is evaluated on its ability to help managers understand, evaluate, and improve a business process. The focus is on the understandability of the user interface and the ability of the system to accurately provide the information that managers need to improve the business process. The physical model also is evaluated for speed. If the queries cannot be processed in a timely manner, then the database will not be used. Step 7 addresses how to modify the physical model to increase query-processing speed.
Step 7. Perform tuning. For transaction-oriented databases, adding indices, denormalizing tables, and physically repositioning tables (for example, putting them on the same disk pack, or the same cylinder within a disk pack) are the most effective tools for increasing query processing speed. Indices, denormalized tables, and physical restructuring also are used to improve the performance of dimensional data marts. However, aggregates are the most powerful tool for increasing query processing speed in dimensional data marts. Kimball [4] states, "If you do not have aggregates, then you are potentially wasting millions of dollars on hardware upgrades to solve performance problems that could be otherwise addressed by aggregates."
Aggregates are the most powerful tool for increasing query processing speed in dimensional data marts.
Although storing aggregates can dramatically increase performance, caution is required. Kimball [3] warns against the explosion of aggregates, and shows that storing aggregates can easily increase the size of the data mart by 400%. Selection of aggregates is a complex and important decision. Although it is impossible to provide a single, simple answer to this question, it is possible to provide two useful guidelines.
The first guideline is provided by Kimball [3]: "The single most effective way to control an aggregate explosion, but still benefit from the value of aggregates, is to make sure that each aggregate summarizes at least 10 and preferably 20 or more lower-level items." Kimball is addressing the problem from a storage perspective. As long as each aggregate summarizes a large number of lower-level items, the number of records in the aggregate fact table is likely to be less than the number of records in the base fact table. Also, the greater the number of lower-level items summarized in the aggregate, the less sparse (more dense) the data cube.
Storage costs are an important component of data mart cost, but so are processing costs. Each time the data mart is refreshed, processing time is required to create each aggregate. If an aggregate is never used, that processing time was wasted. If an aggregate is used more than once, a net saving in processing time occurs (assuming the overhead cost of maintaining that aggregate in the data mart is negligible). This means it is efficient to store almost any aggregate that will be used frequently.
Therefore, we suggest a second guideline for storing aggregates: Store any aggregate that, on average, is likely to be used at least once per refresh period. Given the dramatic decrease in storage costs taken place (from $37.25 per gigabyte in 1999 to $5.68 per gigabyte in 2003 to $0.21 per gigabyte by 2010 [2]), this guideline is likely to become even more important in the future.
Figure 1 summarizes the difference in how tuning must be done for dimensional data marts versus transaction-oriented databases by noting that both indices, storage structures, and aggregates can be used to tune data marts, but only indices and storage structures can be used to tune transactional databases. The use of stored aggregates in a volatile transaction-oriented database is very impractical, since the stored aggregates must be updated almost continuously. However, the nonvolatile nature of the dimensional data mart makes it ideal for storing aggregates.
Step 8. Implementation/monitoring. Implementation involves putting the system into production. One of the duties of the database administrator (DBA) is to monitor the queries run against the database. The way in which this monitoring is done differs considerably between transaction-oriented relational databases and dimensional data marts. In transaction databases, queries are monitored primarily for their processing speed. The methods for increasing speed are assigning indices and restructuring tables. The content of the queries is of little interest to the DBA.
In dimensional data marts queries are monitored for both speed and content. If the same query is run repeatedly to create an aggregate, then that aggregate should be stored in a fact table. Since it is much quicker to retrieve a stored aggregate than to process the query that generates the aggregate, storing frequently requested aggregates in the data warehouse speeds the retrieval of summary information. As stated in Step 7, this is unique to data marts because these databases are designed to store non-volatile data.
Figure 1 uses the phrases "Track Queries for Speed, Frequency, and Content" and "Track Queries for Speed and Frequency" to highlight the differences in the way monitoring must take place for dimensional data marts versus transaction-oriented databases. In both cases the DBA must monitor for queries that take an unusually long time to run, and queries that are run very frequently. However, the DBA for a dimensional data mart must also monitor for summary queries. If the query involves an aggregate function, then the DBA must consider whether the creation of a stored aggregate would be beneficial.
This article makes four major contributions to the data warehousing practice and literature. First, it presents a methodology for data mart development. Our Dimensional Data Mart Development Methodology integrates the information systems development literature with the data warehouse literature to create an easy-to-follow, structured process for the data mart practitioner.
Second, we contrast each step in the dimensional data mart development process with its corresponding step in the relational database development process. Figure 1 highlights the differences in the development methodologies. This diagram is a valuable reference tool for understanding the differences in the development methodologies for the two types of databases.
Third, practical guidelines are presented for analyzing the user environment for dimensional data marts. We present a five-step process for analyzing the user environment. By focusing on performance measures and the ways in which those measures need to be analyzed, data mart designers can protect against providing what Ackoff [1] calls an "overabundance of irrelevant data," and yet still provide all the information needed to effectively manage a business process.
Fourth, guidelines are presented for determining which aggregates should be stored in a data mart. The use of stored aggregates can greatly enhance the performance of the data mart, but can also cause the size of the data mart to explode. Building on the work of Kimball [3], we present guidelines for storing aggregates.
In total, these contributions highlight the unique characteristics of data mart development. Because such systems are intended to support management decisions, developers must think in terms of business processes and the performance measures needed to monitor and improve those processes. Given the growing importance of these systems, it is critical that data mart implementations be undertaken in an efficient manner, and the methodology described in this article provides a structure that should contribute toward more successful implementations.
1. Ackoff, R.L. Management misinformation systems. Management Science 14, 4 (Dec. 1967), B147–B156.
2. Gilheany, S. Projecting the cost of magnetic disk storage over the next 10 years. White paper 22011v039h. Archivebuilders.com. Manhattan Beach, CA, 2000; www.archivebuilders.com/whitepapers/22011v039h.html.
3. Kimball, R. The Data Warehouse Toolkit. John Wiley, New York, 1996.
4. Kimball, R. A dimensional modeling manifesto. Database Magazine 10, 9 (Aug. 1997), 59–68.
5. McFadden, F., Hoffer, J., and Prescott, M. Modern Database Management, 5th Edition. Addison-Wesley, Reading, MA, 1999.
6. Parkes, C. Data Warehousing: The economy isn't the only reason the data warehouse industry is stumbling. Enterprise Systems (Mar. 1, 2002); www.esj.com/Departments/article.asp?EditorialsID=51
7. Ricardo, C. Database Systems: Principles, Design, & Implementation. Macmillian, New York, 1990.
8. Von Winterfeldt, D. and Edwards, W. Decision Analysis and Behavioral Research. Cambridge University Press, New York, 1986.
Figure 1. Dimensional vs. relational design methods.
�2003 ACM��0002-0782/03/1200��$5.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright���2003 ACM, Inc.
No entries found