In 2001, Dow Chemical Co. merged with Union Carbide Corp., requiring the integration of 35,000 Union Carbide research and development reports into Dow's document management system. Dow partnered with ClearForest Corp., a commercial developer of text-driven business solutions, to help integrate the new combined document collection. Using technology it had developed, ClearForest indexed the documents, identifying chemical substances, products, companies, and people for inclusion in the combined database. Dow was able to add more than 80 years' worth of Union Carbide research to its information management system and approximately 100,000 new chemical substances to its registry. When the project was complete, Dow estimated it had spent some $3 million less than it would have if it had used its own methods for indexing documents. Dow also estimated it had reduced the time it would have spent sorting documents by 50% and data errors by 10%15% [2].
This scenario is an example of how the world is changing when it comes to managing electronic information. In the future, books and magazines will be used only for special purposes, as electronic documents become the primary means of storing, accessing, and sorting written communication. As many fields become overwhelmed with information, it will become physically impossible for any individual to process all the information available on a particular topic. Massive amounts of data will reside in cyberspace, generating demand for text mining technology and solutions.
Text mining has been defined as "the discovery by computer of new, previously unknown, information by automatically extracting information from different written resources" [6]. The Dow-Union Carbide scenario reflects how text mining can be applied in practical business situations. Many other domains (such as health care, government, education, and manufacturing) can also benefit from text-mining tools. Here, we explore them to guide organizations looking for the most effective solutions.
Text mining is similar to data mining, except it is designed to handle structured data from databases or XML files, working with unstructured or semistructured data sets (such as email, full-text documents, and HTML files). As a result, text mining is a much better solution for companies (such as Dow) where large volumes of diverse information must be merged and managed. To date, however, most research and development efforts have centered on data mining using structured data.
Machine intelligence is a problem for text mining. Natural language has developed to help humans communicate with one another and record information. Computers are a long way from comprehending natural language. Humans are able to distinguish and apply linguistic patterns to text, overcoming obstacles (such as slang, spelling variations, and contextual meaning). Computers do not handle them easily. Meanwhile, although our language capabilities allow us to comprehend unstructured data, we lack the computer's ability to process text in large volumes at high speeds. The key to text mining is creating technology that combines a human's linguistic capabilities with the speed and accuracy of a computer.
Figure 1 outlines a generic process model for a text-mining application. Starting with a collection of documents, a text-mining tool retrieves a particular document and preprocesses it by checking its format and character sets. It then goes through text analysis, sometimes repeating techniques until the targeted information is extracted. Figure 1 outlines three text analysis techniques, but many other combinations of techniques can be used, depending on the goals of the organization. The resulting information can be placed in a management information system to yeild knowledge for its users.
Technological advances are, however, beginning to close the gap between human and computer languages. The field of natural language processing has produced technologies that teach computers natural language, enabling them to analyze, understand, and even generate text. Technologies in the text-mining process include information extraction, topic tracking, summarization, categorization, clustering, concept linkage, information visualization, and question answering.
Information extraction. This technology represents a starting point for computers analyzing unstructured text and identifying key phrases and relationships within text. It does it by looking for predefined sequences in the text, a process called pattern matching. Consider, for example, the following sentence: "Area relatives of a man being held hostage in Iraq waited for word about him Saturday as militants threatened to decapitate him, another American, and a Brit unless demands were met within 48 hours." Information-extraction software should be able to identify two American hostages and a British hostage, militants, and the relatives of one of the hostages as people; Iraq as the place; and Saturday as the time. The software infers the relationships among all the identified people, places, and times to give the user meaningful information. The technology can be useful when dealing with large volumes of text. Almost all text-mining software uses information extraction, since it is the basis for many of the various other text mining technologies.
Topic tracking. A topic-tracking system keeps user profiles and, based on the documents a user views, predicts other documents of interest to the user. Yahoo offers a free topic-tracking tool (www.alerts. yahoo.com) that allows users to choose keywords and notifies them when news relating to the topics becomes available. Topic-tracking technology also has limitations; for example, users who set up an alert for text mining will receive several news stories on mining for minerals and few on text mining. Some of the more effective text-mining tools let users select particular categories of interest; the software even infers users' interests based on their reading histories and click-through information they've left behind online.
Topic tracking can be used to, say, alert a company whenever a competitor is in the news, allowing it to keep up with competitive products or changes in the market. Similarly, a company might want to track news on itself and on its own products. It could also be used in the medical industry by physicians and other care providers looking for new treatments for illnesses and those simply wishing to keep up on the latest research in the field. Educators would use topic tracking to be sure they have the latest references for research in their areas of interest.
Summarization. Text summarization helps users figure out whether a lengthy document meets their needs and is worth reading. With large-volume texts, text-summarization software processes and summarizes the document in the time it would take the user to read the first paragraph. The key to summarization is reducing the length and detail of a document while retaining its main points and overall meaning. The challenge for text-mining application developers is that, although computers are able to identify people, places, and times, it is still difficult to teach software to analyze semantics and interpret meaning. When we humans summarize text, we generally read an entire selection to develop our understanding, then write a summary highlighting its main points. Since computers lack human language capabilities, alternative methods are needed.
Sentence extraction, a strategy widely used by text-summarization tools, extracts important sentences from a given text by statistically weighting all the sentences in the text. Other heuristics (such as position information) are also used for summarization; for example, summarization tools may extract the sentences following the key phrase "in conclusion," after which typically follow a document's main points. Summarization tools may also search for headings and other markers of subtopics in order to identify the document's key points; Microsoft Word's AutoSummarize function is a simple example of text summarization. Many text-summarization tools allow users to choose the percentage of the total text they want extracted as a summary.
Summarization can work with topic-tracking tools and categorization tools to summarize the documents retrieved on a particular topic. If organizations, medical personnel, or researchers were given hundreds of documents addressing their topic of interest, then summarization tools could be used to reduce the time they would have to spend sorting through it. Individual users would thus more quickly assess the relevance of the information to the topic.
Categorization. Categorization involves identifying the main themes of a document [10]. When categorizing particular documents, a computer program often treats them as a "bag of words." The program does not attempt to process the actual information as information extraction does. Rather, categorization counts only words that appear and, from the counts, identifies the main topics covered in the document. Categorization often relies on a thesaurus for which topics are predefined and relationships identified by looking for broad terms, narrower terms, synonyms, and related terms. Categorization tools normally have a method for ranking the documents in order of which documents have the most content on a particular topic.
As with summarization, categorization can be used with topic tracking to further specify the relevance of a document to a person seeking information on a particular topic. The documents returned from topic tracking could be ranked by content weights so users could give priority to the most relevant ones first. Categorization can be used in a number of application domains. For example, many businesses and industries provide customer support or must answer customer questions on a variety of topics. If they use categorization schemes to classify the documents by topic, then customers and end users will be much better able to access the information they seek.
Clustering. Clustering is a technique used to group similar documents, but it differs from categorization in that documents are clustered on the fly instead of through predefined topics. Documents can also appear in multiple subtopics, ensuring that useful documents are not omitted from the search results. A basic clustering algorithm creates a vector of topics for each document and measures the weights of how the document fits into each cluster; for example, a user who goes to www.clusty.com, a Web search engine site powered by Vivisimo, a clustering engine, and types "saturn" in the search field is likely to get back "planet," "photo," "car," and "performance." Users can quickly narrow the documents by identifying which topics are relevant to the search and which are not. Clustering technology can be useful in management information systems holding thousands of documents, as in the Dow-ClearForest example.
Concept linkage. Concept-linkage tools connect related documents by identifying their shared concepts, helping users find information they perhaps wouldn't have found through traditional search methods. It promotes browsing for information rather than searching for it. Concept linkage is valuable in text mining, especially in biomedicine where so much research is being done it is impossible for any individual researcher to read it all and relate it to other research. Concept-linking software can identify links between diseases and treatments when humans can't; for example, text-mining software may easily identify a link between topics X and Y and another between Y and Z, but it might also detect a potential link between X and Z, something human researchers have not yet come across due to the volume of information they would have to sort through to make the connection.
A well-known nontechnological example is from Dan Swanson, a professor at the University of Chicago, whose research in the 1980s identified magnesium deficiency as a contributing factor in migraine headaches [9]. Swanson looked at articles with titles containing the keyword "migraine," then culled the keywords that appeared at a certain significant frequency within the documents. One such keyword term was "spreading depression." He then looked for titles containing "spreading depression" and repeated the process with the text of the documents. He identified "magnesium deficiency" as a keyword term, hypothesizing that magnesium deficiency was a factor contributing to migraine headaches. No direct link between magnesium deficiency and migraines could be found in the medical literature, and no previous research had been done suggesting the two were related. The hypothesis was made only by linking related documents from migraines to those covering spreading depression to those covering magnesium deficiency. The direct link between magnesium deficiency and migraine headaches was later proved valid through scientific experiments, showing that Swanson's linkage methods could be a valuable process in other medical research.
The work Swanson did by hand mimicked the concept-linkage technology text-mining products provide today, showing how valuable they can be in medical research. Experiments similar to Swanson's have been replicated through automated tools that can be applied to text mining [4]. Within the next 10 years, we expect that text-mining tools with concept-linkage capabilities will help researchers discover new treatments by associating treatments that have been used in related fields.
Information visualization. Visual text mining, or information visualization, puts large textual sources in a visual hierarchy or map and provides browsing capabilities, in addition to simple searching. An example is the Web-based KartOO metasearch engine [7] (see Figure 2).
Governments, police departments, and intelligence agencies might all be able to use information visualization to identify terrorist networks or find information about crimes previously considered unconnected. It might provide them with a map of possible relationships between suspicious activities and potential perpetrators, helping them investigate connections they would not have come up with on their own. Text mining has been shown to be useful in academic areas [1], too, where authors are able to identify and explore papers, articles, and books in which their own publications are referenced.
Question answering. Another application area of natural language processing is natural language queries, or question answering (Q&A), which deals with how to find the best answer to a given question [8]. Many Web sites equipped with question answering technology allow users to "ask" the computer questions and be given answers. MIT is often credited with implementing the first (1993) Web-based natural query answering system called START (www.ai.mit.edu/projects/infolab/).
Q&A utilizes multiple text-mining techniques; for example, it can use information extraction to extract entities (such as people, places, and events) and question categorization to assign questions into known types (such as who, where, when, and how). In addition to Web applications, companies can use Q&A techniques internally to help their employees searching for answers to common questions. Applications in education and medicine might also find uses for Q&A when people frequently ask certain questions.
Table 1 and Table 2 list major vendors that have developed text-mining technologies, along with the features implemented in their tools. Some of them (such as ClearForest) focus exclusively on text-mining tools, whereas for larger ones (such as IBM and SPSS), the tools represent only a small portion of the software they develop and market.
Table 3 does not include all examples of text-mining industries or applications but represents some of the most likely applications in medicine, business, government, and education. Data mining has been shown to be useful in telecommunications, geospatial data sets, biomedical engineering, and climate data [5], so there is definite potential for extending text mining to these areas, as well as to others in the future.
As the amount of unstructured data increases, text-mining tools that sift through it will be increasingly valuable. For example, such tools are beginning to be applied in biomedicine, where the volume of information on particular topics makes it impossible for any individual researcher to cover it all, much less explore related texts. Text-mining methods are also useful to government intelligence and security agencies trying to piece together terrorist warnings and other security threats before they have a chance to occur. Education is another area that benefits; students and educators are better able to find information relating to their topics than they would be through traditional ad hoc search.
For text-mining developers, business applications may represent the most promising target. Many businesses have overwhelming amounts of information they are unable to use because they have no reasonable way to analyze it. Text-mining tools can help them analyze their competition, customer base, and marketing strategies. However, in order to deploy new text-mining projects, they must follow five project-management guidelines:
A future trend is likely to involve integration of data mining and text mining into a single system, a combination known as duo-mining [3]. SAS and SPSS are two major data mining vendors that have been recommending duo-mining to their customers looking for an edge on using consolidated information for better decision making. This combination has proved especially useful in banking and credit card customer relationship management. Instead of being able to analyze only the structured data they collect from transactions, they can add call logs associated with customer service and further analyze customer spending patterns from the text-mining side. These new developments in text-mining technologybeyond simple search methodsare the key to information discovery and promise support in all areas. Companies with vast document collections sitting idle should consider investing in text-mining applications that would help them analyze their documents and provide payback with the information they provide.
1. Bollacker, K., Lawrence, S., and Giles, C. A system for automatic personalized tracking of scientific literature on the Web. In Proceedings of the Joint Conference on Digital Libraries (Berkeley, CA). ACM Press, New York, 1999.
2. ClearForest. ClearForest-Dow Chemical Case Study. ClearForest Corp., Waltham, MA, 2004; www.clearforest.com/Customers/Dow.asp.
3. Creese, G. Duo-mining: Combining Data and Text Mining (Sept. 2004); www.dmreview.com/article_sub.cfm?articleId=1010449.
4. Gordon, M., Lindsay, R., and Fan, W. Literature-based discovery on the WWW. ACM Transactions on Internet Technology 2, 4 (2002), 262275.
5. Han, J., Altman, R., Kumar, V., Mannila, H., and Pregibon, D. Emerging scientific applications in data mining. Commun. ACM 45, 8 (Aug. 2002), 5458.
6. Hearst, M. What Is Text Mining?; www.sims.berkeley.edu/~hearst/text-mining.html.
7. KartOO Metasearch Engine. KartOO S.A., Clermont Ferrand, France; www.kartoo.com.
8. Radev, D., Libner, K., and Fan, W. Getting answers to natural language queries on the Web. Journal of the American Society for Information Science and Technology 53, 5 (2002), 359364.
9. Swanson, D. Two medical literatures that are logically but not bibliographically connected. Journal of the American Society for Information Science 38, 4 (1987), 228233.
10. Yang, Y. and Pedersen, J. A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning. Morgan Kaufmann, San Francisco, 1997, 412420.
Figure 1. An example of text mining.
Figure 2. Visualization result for the query "text mining" from the KartOO metasearch engine (www.kartoo.com).
Table 1. Text-mining technologies offered by commercial vendors.
Table 2. Vendor Web sites and text-mining products.
Table 3. Examples of where text-mining tools can be applied in medicine, business, government and education.
©2006 ACM 0001-0782/06/0900 $5.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2006 ACM, Inc.
No entries found