The capacity of digital data storage worldwide has doubled every nine months for at least a decade, at twice the rate predicted by Moore's Law for the growth of computing power during the same period [5]. This less familiar but noteworthy phenomenon, which we call Storage Law, is among the reasons for the increasing importance and rapid growth of the field of data mining.
The aggressive rate of growth of disk storage and the gap between Moore's Law and Storage Law growth trends represents a very interesting pattern in the state of technology evolution. Our ability to capture and store data has far outpaced our ability to process and utilize it. This growing challenge has produced a phenomenon we call the data tombs, or data stores that are effectively write-only; data is deposited to merely rest in peace, since in all likelihood it will never be accessed again.
Data tombs also represent missed opportunities. Whether the data might support exploration in a scientific activity or commercial exploitation by a business organization, the data is potentially valuable information. Without next-generation data mining tools, most will stay unused; hence most of the opportunity to discover, profit, improve service, or optimize operations will be lost. Data miningone of the most general approaches to reducing data in order to explore, analyze, and understood itis the focus of this special section.
Data mining is defined as the identification of interesting structure in data. Structure designates patterns, statistical or predictive models of the data, and relationships among parts of the data. Each of these termspatterns, models, and relationshipshas a concrete definition in the context of data mining. A pattern is a parsimonious summary of a subset of the data (such as people who own minivans have children). A model of the data can be a model of the entire data set and can be predictive; it can be used to, say, anticipate future customer behavior (such as the likelihood a customer is or is not happy, based on historical data of interaction with a particular company). It can also be a general model (such as a joint probability distribution on the set of variables in the data). However, the concept of interesting is much more difficult to define.
What structure within a particular data set is likely to be interesting to a user or task? An algorithm could easily enumerate lots of patterns from a finite database. Identifying interesting structure and useful patterns among the plethora of possibilities is what a data mining algorithm must do, and it must do it quickly over very large databases.
Data mining is primarily concerned with making it easy, convenient, and practical to explore very large databases for organizations and users with lots of data but without years of training as data analysts.
For example, frequent item sets (variable values occurring together frequently in a database of transactions) could be used to answer, say, which items are most frequently bought together in the same supermarket. Such an algorithm could also discover a pattern in a demographics database with exceptionally high confidence that, say, all husbands are males. While true, however, this particular association is unlikely to be interesting. This same method did uncover in the set of transactions representing physicians billing the Australian Government's medical insurance agency a correlation deemed extremely interesting by the agency's auditors. Two billing codes were highly correlated; they were representative of the same medical procedure and hence had created the potential for double-billing fraud. This nugget of information represented millions of dollars of overpayment.
The quest for patterns in data has been studied for a long time in many fields, including statistics, pattern recognition, and exploratory data analysis [6]. Data mining is primarily concerned with making it easy, convenient, and practical to explore very large databases for organizations and users with lots of data but without years of training as data analysts [1, 3, 4]. The goals uniquely addressed by data mining fall into certain categories:
Among the most important trends in data mining is the rise of "verticalized," or highly specialized, solutions, rather than the earlier emphasis on building new data mining tools. Web analytics, customer behavior analysis, and customer relationship management all reflect the new trend; solutions to business problems increasingly embed data mining technology, often in a hidden fashion, into the application. Hence, data mining applications are increasingly targeted and designed specifically for end users. This is an important and positive departure from most of the field's earlier work, which tended to focus on building mining tools for data mining experts.
Transparency and data fusion represent two major challenges for the growth of the data mining market and technology development. Transparency concerns the need for an end-user-friendly interface, whereby the data mining is transparent as far as the user is concerned. Embedding vertical applications is a positive step toward addressing this problem, since it is easier to generate explanations from models built in a specific context. Data fusion concerns a more pervasive infrastructure problem: Where is the data that has to be mined? Unfortunately, most efforts at building the decision-support infrastructure, including data warehouses, have proved to be big, complicated, and expensive. Industry analysts report the failure of a majority of enterprise data warehousing efforts. Hence, even though the data accumulates in stores, it is not being organized in a format that is easy to access for mining or even for general decision support.
Much of the problem involves data fusion. How can a data miner consistently reconcile a variety of data sources? Often labeled as data integration, warehousing, or IT initiatives, the problem is also often the unsolved prerequisite to data mining. The problem of building and maintaining useful data warehouses remains one of the great obstacles to succesful data mining. The sad reality today is that before users get around to applying a mining algorithm, they must spend months or years bringing together the data sources. Fortunately, new disciplined approaches to data warehousing and mining are emerging as part of the vertical solutions approach.
The six articles in this special section reflect the recent emphasis on targeted applications, as well as data characterization and standards.
Padhraic Smyth et al. explore the development of new algorithms and techniques in response to changing data forms and streams, covering the influence of the data form on the evolution of mining algorithms.
Paul Bradley et al. sample the effort to make data mining algorithms scale to very large databases, especially those in which one cannot assume the data is easily manipulated outside the database system or even scanned more than a few times.
Ron Kohavi et al. look into emerging trends in the vertical solutions arena, focusing on business analytics, which is driven by business value measured as progress toward bridging the gap between the needs of business users and the accessibility and usability of analytic tools.
Specific applications have always been an important aspect of data mining practice. Two overview articles cover mature and emerging applications. Chidanand Apte et al. examine industrial applications where these techniques supplement, sometimes supplant, existing human-expert-intensive analytical techniques for significantly improving the quality of business decision making. Jiawei Han et al. outline a number of data analysis and discovery challenges posed by emerging applications in the areas of bioinformatics, telecommunications, geospatial modeling, and climate and Earth ecosystem modeling.
Data mining also represents a step in the process of knowledge discovery in databases (KDD) [2]. The recent rapid increase in KDD tools and techniques for a growing variety of applications needs to follow a consistent process. The business requirement that any KDD solution must be seamlessly integrated into an existing environment makes it imperative that vendors, researchers, and practitioners all adhere to the technical standards that make their solutions interoperable, efficient, and effective. Robert Grossman et al. outline the various standards efforts under way today for dealing with the numerous steps in data mining and the KDD process.
Providing a realistic view of this still young field, these articles should help identify the opportunities for applying data mining tools and techniques in any area of research or practice, now and in the future. They also reflect the beginning of a still new science and the foundation for what will become a theory of effective inference from and exploitation of all those massive (and growing) databases.
1. Fayyad, U., Grinstein, G., and Wierse, A., Eds. Information Visualization in Data Mining. Morgan Kaufmann Publishers, San Francisco, 2002.
2. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R., Eds. Advances in Knowledge Discovery and Data Mining. MIT Press, Cambridge, MA, 1996.
3. Han, J. and Kamber, M. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, San Francisco, 2000.
4. Hand, D, Mannila, H., and Smyth, P. Principles of Data Mining. MIT Press, Cambridge, MA, 2001.
5. Porter, J. Disk Trend 1998 Report; www.disktrend.com/pdf/portrpkg.pdf.
6. Tukey, J. Exploratory Data Analysis. Addison-Wesley, Reading, MA, 1977.
©2002 ACM 0002-0782/02/0800 $5.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2002 ACM, Inc.
No entries found