acm-header
Sign In

Communications of the ACM

Viewpoint

Discovery Informatics


Add it to the ACM Computing Classification System as a new methodology at the level of AI and simulation, due to its importance in identifying and validating patterns in data.

The Great Data Famine! I recall shuddering when I read that 1970s prediction, with its specter of "millions of computers fighting for the same small piece of data, like savages" [2]. We all prayed the proposed data manufacturing plants would stave off the lean times sure to come.

Today, our data silos are overflowing. We find new ways to grind every human experience into finer and finer granularity. Businesses that once viewed transaction data as plumbing the depths of detail are now buried in clickstream data, which for one company alone exceeds 250GB per day and is growing [11]. NASA archives more than 1TB of Earth science data per day, while the National Institutes of Health database of genetic sequence information doubles every 18 months [7]. Co-conspirators in this Great Data Glut are the increasingly plentiful and cheap silos; worldwide digital data storage capacity doubles every nine months [5], while the cost per GB of magnetic disk volume declined by a factor of 500 from 1990 to 2000 [9].

With heaps of data growing exponentially, the task of mining it for relationships that might translate into new knowledge grows more daunting. We are "informing ourselves to death" [10], consciously collecting more data than we need, with every good intention of figuring out later what really is worth saving. An attractive alternative—preventive mining—puts filtering in the forefront, admitting only worthwhile data. But we don't always know in advance what really is worth collecting. Failing to grab data may be an irreversible decision, as I learned supporting NASA, where it's really important to catch the telemetry sent down from space-borne instruments. We haven't found a way to realign heavenly bodies to give us a second chance at data capture.

More fundamentally, the concept of mining says some data nuggets never hint at their worth as predictors or indicators when considered in isolation. We know up front they are not active players in our working hypotheses. But because they are descriptive of phenomena under investigation, they hold promise of being related to important outcomes, if only the automation would help us recognize the underlying patterns. Data mining is being pursued with notable success. Innovation and growth are evident in the meetings and publications of the knowledge discovery in data (KDD) community and its ACM special interest group. But as we harvest more data faster, the challenge of making sense of it all becomes ever more pressing.

Back to Top

Definition and Scope

My experience squeezing value from data suggests that "discovery informatics" may be the right term at the right time to express the comprehensive nature of an emerging methodology. Discovery informatics is the study and practice of employing the full spectrum of computing and analytical science and technology to the singular pursuit of discovering new information by identifying and validating patterns in data. It is most often associated with drug discovery processes and their supporting bioinformatics, but the strategies and methods are also practiced in financial, manufacturing, retail, and other domains.

Discovery informatics conveys the breadth of the approaches that may be applied, including algorithms, data structures, processes, data management, architectures, online analytics, statistical methods, problem-solving strategies, domain-specific techniques, and commercial products and services. It also highlights the significant, central, and growing challenge of discovery across related structured and unstructured content, including combinations of numbers, text, video, graphics, images, sound, and speech.


We are "informing ourselves to death," consciously collecting more data than we need, with every good intention of figuring out later what really is worth saving.


Discovery informatics is certainly current, as efforts to connect the dots in intelligence reports are covered regularly in the popular media. Though university courses and research centers are organized around constituent elements like data mining, the use of discovery informatics appears to be much more limited. At Johns Hopkins University, we are planning a Center for Discovery Informatics, believing discovery can be enhanced by exploring synergies and leveraging approaches and technologies associated with a variety of disciplines and domains. Discovery informatics deserves such attention because our success as a society in solving problems and making progress depends on our capacity for discovery.

This conception of discovery informatics recognizes multiple discovery engines running quietly under the hood. Many of the activities I've pursued under the heading of knowledge management involve applying informatics to help organizations discover:

  • Pockets of expertise they don't know they have;
  • How knowledge gets into their products and services; and
  • How much is written down or in peoples' minds about how to run the organization.

Discovery processes can benefit from knowledge management lessons derived from implementing collaborative systems that help people work together, even in sharing-challenged corporate divisions that want to keep to themselves. Collaboration today is viewed as pivotal to next-generation advances in data mining for bioinformatics where evaluation of mined results increasingly requires many minds. "Due to the nature and volume of life sciences data, the industry suffers from inefficient data sharing, information distribution, and decision-making procedures" [3]. Leveraging knowledge management experience with these practices could accelerate progress.

Collaborative filtering is discovery. When shoppers visiting online booksellers rate the usefulness of the book reviews posted there, the ratings contribute to a process of discovery for everyone, an ongoing referendum on the most valued reviewers. A similar process can be used inside an organization to discover which person's expertise and which reports are the most highly valued. If one employee finds value in an article, how many other employees in the organization might benefit from it too? There certainly are better ways to share such information than emailing a few friends [6]. Collaborative filtering is an ongoing discovery engine, discerning value and propagating the information across the enterprise.

With an estimated 80% of content unstructured today [8], the starting point for discovery is as likely to be the Internet or streaming video as an organization's own database management system. When I use a Web browser's search engine, I'm rarely browsing. What I really want to discover is information that's new to me in the estimated 4,200TB of Web content [9]. Such query-driven discovery grows more sophisticated every day, promising we'll someday be more effective at coping with information overload. Semiotics, natural language processing, and digital signal processing already provide the analytical power for commercial tools that produce feature-rich digital signatures to support classification and clustering to aid discovery.

Exploratory processes increase pressure on the need to validate each discovery. New patterns and relationships illuminated by data-driven discovery must be viewed only as starting points for later confirmation by testing hypotheses for validity in domains of experience. If we can be fooled by nature (see theories of phlogiston and the ether), we're even more likely to be fooled when our starting point is data about nature—a level removed from reality—or mediated knowledge in which "the action context is paved over by the data highway" [12].

A discovery-informatics mindset encourages these practices of knowledge management, collaborative filtering, search and retrieval, and validation to be understood, evaluated, and interpreted for their contributions to the discovery process. Further help will come from wide-ranging fields, including: grid computing, fuzzy systems, intelligent agents, multimedia databases, optimization theory, inductive learning, regression analysis, taxonomy generation, data management, systems architecture, decision support systems, user-system interfaces, reengineering, information visualization, information security, and privacy protection.

Back to Top

People and Automation

Historical discoveries have been attributed to both conscious and subconscious mental activity. Perhaps the most celebrated instance of subconscious discovery involved the German chemist August Kekule falling asleep in front of a fireplace while watching sparks of light spiraling into the air. When he awoke, he thought the benzene molecule might be shaped like a ring. The explanation for this discovery was that "in the subconscious, rationality could not censor the connection" [4].

Until the advances of the past decade, our automated equivalent of human discovery was incomplete. Our use of automation for discovery corresponded to conscious human activity, directed at specific objectives, confirming hypotheses or seeking statistical verification for presumed relationships or analysts' hunches. With data-driven approaches, we expanded our capacity for augmented discovery by adding the automated counterpart to the role of human subconscious. Data-driven discovery is unconstrained by intentionality—free to seize upon associations and patterns where they are found to exist. Indeed, data-driven discovery looks particularly fruitful, because "most breakthroughs are based on linking information that usually is not thought of as related" [4]. Discovery informatics may help clarify relative contributions and roles of humans and automation as they work together on informatics-enhanced discovery.

Back to Top

A New Methodology?

Discovery informatics therefore merits visibility as a new methodology. As the world of computing evolves, specialty areas take form and acquire presence. Tracing changes to the ACM Computing Classification System (see www.acm.org/class/) is instructive, providing a historical record of such adaptation. The 1964 Classification showed artificial intelligence (AI) as a subarea—along with natural sciences, engineering, and humanities—under the top-level category of Applications. The 1991 Classification System recognized AI as being essentially different. It was no longer just another application area for computing but viewed as orthogonal and composed of top-level categories that could be directed at application areas, rather than being one itself. AI encompassed special hardware, software, theory, integrated systems, languages, methods, practices, algorithms, and data structures. Simulation was recognized through a similar rationale. Both took their places in a new top-level category called Methodologies. Similarly, discovery informatics, embracing special hardware, software, integrated systems, algorithms, processes, and data structures, can target application areas, suggesting it is ready to take its place alongside AI and simulation as a new methodology.

Consulting the current Computing Classification System (1998), we find elements of discovery informatics well represented across the taxonomy. The figure here shows current descriptors that would contribute to a new methodology of discovery informatics, with further elaboration surely needed. Categories in the figure stand up well as parent disciplines of discovery informatics: information storage and retrieval, AI, document and text processing, pattern recognition, and database management.

While computing science and technology are relatively new, our terminology would have fit comfortably in the world of 1620 when Francis Bacon described a path to new knowledge that began by bringing together all the relevant data ("particulars relating to the subject under inquiry" [1]) into Tables of Discovery. Perhaps in 2020, with perfect hindsight, we will say that discovery informatics was a useful synthesis that led to advances in knowledge for the benefit of organizations and society.

Back to Top

References

1. Bacon, F. Novum Organum (1620); translated and edited by P. Urbach and J. Gibson. Open Court Publishing Co., Chicago, 1994.

2. Buchwald, A. The great data famine. In Down the Seine and Up the Potomac. G.P. Putnam's Sons, New York, 1977.

3. Chudnow, C. The challenges of data management in biotech. Comput. Tech. Rev. (July 2002), 28–31.

4. Csikszentmihalyi, M. Creativity. HarperCollins Publishers, New York, 1996.

5. Fayyad, U. and Uthurusamy, R. Evolving data mining into solutions for insights. Commun. ACM 45, 8 (Aug. 2002), 28–31.

6. Goldberg, D., Nichols, D., Oki, B., and Terry, D. Using collaborative filtering to weave an information tapestry. Commun. ACM 35, 12 (Dec. 1992), 61–70.

7. Kim, J. Computers are from Mars, organisms are from Venus. Comput. 35, 7 (July 2002), 25–32.

8. Leavitt, N. Data mining for the corporate masses? Comput. 35, 5 (May 2002), 22–24.

9. Lyman, P., et al. How much information? University of California at Berkeley, School of Information Management and Systems (report); see www.sims.berkeley.edu/research/projects/how-much-info/how-much-info.pdf.

10. Postman, N. Informing ourselves to death. In Computers, Ethics, and Society, M. Ermann et al., Eds. Oxford University Press, New York, 1997.

11. Whiting, R. Web data piles up. InformationWeek (Sept. 19, 2002); see www.informationweek.com/785/database.htm.

12. Zuboff, S. The abstraction of industrial work. In Knowledge Management and Organizational Design, P. Myers, Ed. Butterworth-Heinemann, Boston, 1996.

Back to Top

Author

William W. Agresti ([email protected]) is an associate professor in the Graduate Division of Business and Management at Johns Hopkins University, Baltimore.

Back to Top

Figures

UF1Figure. Caption-fig: ACM Classification System descriptors relating to the new methodology of Discovery Informatics.

Back to top


©2003 ACM  0002-0782/03/0800  $5.00

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2003 ACM, Inc.


 

No entries found