Knowledge management efforts over the past decade have produced many document collections focused on particular domains. As such systems scale up, they become unwieldy and ultimately unusable if obsolete and redundant content is not continually identified and removed.
We are working with such a knowledge-sharing system at Xerox, focused on the repair of photocopiers. Called Eureka, it now contains about 40,000 technician-authored free text documents, in the form of tips on issues not covered in the official manuals. Figure 1 shows a pair of similar tips from this corpus. Our goal is to build a system that can identify such conceptually similar documents, regardless of how they are written; identify the parts of two documents that overlap; and identify parts of the documents that stand in some relation to each other, such as expanding on a particular topic or being in mutual contradiction. Such a system will enable the maintenance of vast document collections by identifying potential redundancies or inconsistencies for human attention.
This task requires extensive knowledge about language and of the world, and a rich representation language. Moreover, assessing similarity imposes conflicting requirements on the underlying ontology. On one hand, the representations must capture enough of the nuances of natural language to be sufficiently discriminating, yet the ontology must support the normalization of differing representations of similar content to enable the detection of similarities.
At this point in our research, we have built a prototype system that embodies the knowledge necessary to analyze a test set of 15 pairs of similar tips. Although this system is far from complete, in the course of our work we have developed several design criteria for ontologies that support comparisons of natural language texts. In [2], we discuss the need for reified contexts to handle the representation of nonexistent situations and objects, and how reasoning with types and their instantiations can help. In this article, we focus on ways to produce normalized representations in our ontology from a wide range of different ways of expressing the same idea. We then describe a particular mechanism for normalizing frequently occurring comparative constructions, such as x is deeper than y and y is shallower than x, to a common representation.
To create useful representations of natural language text, we first obtain a compact representation of the syntactic and semantic structures for each sentence, using the Xerox Linguistic Environment, a deep parser based on Lexical Functional Grammar theory [3]. From these sentence structures, we automatically construct conceptual representations of the text based on our ontology.
A constrained graph matching of these representations determines the overall degree of conceptual similarity between two texts, while also identifying areas of overlap and conflict. For matching, we use the Structure Mapping Engine (SME) [6], an implementation of the structure mapping theory of analogy [7]. SME anchors its matching process in identical elements that occur in the same structural positions in the base and target representations. From this, it builds a correspondence subject to two constraints: preservation of one-to-one correspondences between base and target elements, and identicality of the aforementioned anchors. The larger the structure that can be recursively constructed in this manner, the greater the similarity score.
Reasoning systems that receive well-specified input can utilize carefully constrained ontologies that capture exactly the set of concepts necessary for the task at hand. In contrast, our ontology must be more expressive, in order to accept arbitrary natural language input (albeit with strong expectations about the nature of the content). At one extreme, we could define a separate concept for each sense of each word. For example, we might define BreakDamage
, BreakInterrupt
, and BreakRecuperate
to represent the meaning of break in the following sentences:
This approach is tantamount to modeling the language in which the tips are written. It would require a vast ontology, as we would also have to represent equivalence classes, to match synonymous words. WordNet [5], a lexical database that is grounded in cognitive theories of human memory, provides an example of this approach. It contains on the order of 110,000 synsets, or classes of synonymous words.
However, even this fine-grained coverage is inadequate for our purposes. For example, none of the seven senses in WordNet for the word remove captures the use of the word to denote a cleaning event in the sentence
This is not a shortcoming of WordNet, because there is no sense of remove per se that denotes cleaning. The inference of a cleaning activity arises from world knowledge applied to the entire sentence. In this case, what is being removed, toner, is a form of dirt, and removing dirt is plausibly an abstract description of a cleaning event.
WordNet maps individual words onto synsets, not ontological concepts. For our purposes, such a mapping is inadequate. We need a richer relational structure that only an ontology can support and the means to compose concepts dynamically. (However, there are interesting applications of WordNet for semiautomatically constructing ontologies, such as in [4].)
Natural language texts generally contain descriptions of causally or sequentially related events in which entities play particular roles. An ontology that supports the representation of textual content must therefore at least comprise events and entities. Events have a richer structure than objects, including critical role relations for their participants, so an adequate ontology must also include such relations. For example, a cleaning event has roles for the agent doing the cleaning, the object being cleaned, the instrument for accomplishing the cleaning, and possibly the dirt or other contamination that is being removed.
Having these thematic roles in our ontology enables us to abstract away from grammatical relations in our representations of natural language texts. This enables us to compare events described at differing levels of specificity. For example, cleaning object y with instrument x is a more specific kind of event than cleaning object y or cleaning with instrument x, which are in turn more specific event types than cleaning. A description logic approach (see, for example, [1]) is well suited for capturing such distinctions systematically and economically, by enabling the composition of new subconcepts within the ontology.
The question still remains of which types of events to represent, or more generally, the appropriate resolution of the ontology. For example, consider the event of damage or incapacitation of some sort. English contains around 40 verbs for events of incapacitation. Some, such as trample, cut, and rip, indicate the means, but sanction no inference about the resulting extent of the damage. Others, such as ruin, raze, and destroy, indicate damage of uncertain means but ultimate extent. Yet others, such as splinter, shatter, and crack, indicate the final state and possibly the nature of the material in question.
Figure 2 presents one possible representation of such events. The leaves of this tree are concepts that correspond to the Incapacitation
sense of their label (so, for instance, we exclude the golfing sense of slice from consideration). Which details are relevant, and thus should enter the representation, ultimately depends on the nature of the document collection being represented. For example, an event in which a car door is dented is plausibly mundane, perhaps a parking lot fender-bender. In contrast, an event in which a car door is torn off is probably the result of a severe accident.
Such inferences arise from combinations of events with particular entity types in role relationships. Representing the underlying knowledge would require thousands of axioms. Rather than focus on this aspect of the problem, our approach is to start with a basic ontology and add detail to improve the system's performance. We conjecture that complex inferences, such as those about the car door previously mentioned, will often arise more directly from other parts of the text.
Notice that even without representing the difference between tear and dent, Figure 2 presents three levels of abstraction, from Incapacitate
down to ForceDestroy
and its sibling concepts. Consider the following, to illustrate how simple differences in expression can drive representations far apart in this ontology:
With the ontology dipicted in Figure 2, the drive gear in (5) would participate in a Destroy
event, whereas the gear in (6) would be the object of state change in a ForceDamage
event.
However, each of these sentences contains a short causal sequence:
ShortCircuitEvent
HeatingEvent
1 Burn(Gear)
ShortCircuitEvent
HeatingEvent
Crack(Gear)
Clearly, these sentences are similar, despite the different representation of the damage to the gear. However, matching (7) and (8) will require a more complex comparison operator than identicality-anchored correspondences. One approach is minimal ascension in the ontological hierarchy. In this case, Burn
and Crack
share a common ancestor, Damage
, so at a higher level of abstraction, representations of these sentences match.
Traversing subsumption relations in search of a common ancestor will not always result in an accurate similarity assessment. Obviously, anything can be matched to anything else at a sufficiently high level of abstraction (such as Thing
or Event
). One way to address this problem would be to assign diminishing weights to matches in proportion to the number of taxonomic links traversed.
This may help, but will not resolve the problem. At issue is the nature of the taxonomic relation, which can vary considerably in degree and type over the hierarchy. For example, the distinction made in Figure 2 between ForceDamage
and ThermalDamage
is one of means, but the distinction between Damage
and Destroy
is one of extent. Even worse, the Disable
event may introduce a dimension of intentionality. In some contexts, a Crack
event may be more similar to a Ruin
event than a Tear
event, even though Crack
and Tear
are siblings, and Ruin
is at a different level in the hierarchy five links away from Crack
.
Description logic formalisms allow for arbitrary binary relations, or roles, to hold between concepts. We can take advantage of this mechanism to minimize the number of distinct concepts, thus reducing or eliminating the need to traverse subsumption links. The role relationships enable us to retain the resolution we lose in reducing the number of concepts by making the remaining concepts richer. For example, we can define a single Damage
concept that has four roles: Extent
, Material
, EndState
, and Means
, as shown in Figure 3. The representation of a Melt
event becomes a Damage
event that has a Flammable Material
, a Deformed EndState
, a Means
of Heating
, and either a Partial
or a Total Extent
.
The ontology can express more specific concepts, so Melt
may well be reified, particularly if it occurs frequently in the domain. However, there is a middle level of abstraction, where concepts are broad enough to maximize the likelihood of matching yet specific enough to minimize the likelihood of spurious matches. Damage
is an example of such a concept, and therefore has the value of MidLevel
for the metaproperty CategoryLevel
.2 It is a design criterion for our ontology that it can support the use of such metaproperties.
The matching process starts by looking for matches between MidLevel
concepts. Failure here is strong evidence of dissimilarity. Success at the MidLevel
, however, requires either a match between LowLevel
categories (such as Melt
), or a more expensive comparison of the properties of the base and target events, to ensure that there are no conflicts (for example, our knowledge base may mark an EndState
of Pierced
as incompatible with an EndState
of Torn
). Since most texts are different, we only incur this greater cost for promising candidates.
The description logic approach to the construction of our ontology provides advantage here by exposing all the properties of events to the similarity algorithm, thus enabling a fine-grained comparison. Plausible reasoning algorithms can efficiently produce similarity assessments that are explicitly based on the particular properties of the events.
Matching based on midlevel concepts and role comparisons is one mechanism for assessing similarity, but we need others to handle some common linguistic constructions. For example, comparisons occur frequently in our domain, and require additional representational machinery, which we now describe.
The tip on the left in Figure 1 states that removing the plastic sheath from the cable makes it more flexible, which prevents it from breaking. In the tip on the right, we find the plastic makes the cable too stiff, which causes it to snap. In both cases, the underlying situation is the samethe rigidity of the cable is too high for normal operation, with a similar end result the cable breaks. At issue is how we are to determine that the descriptions, one containing more flexible and the other too stiff, are in fact similar.
Natural language often contains two terms for a given dimension, such as high/low for height, deep/shallow for depth, or hot/cold for temperature, where one term implicitly encodes high values and the other low values along the dimension. To enable matching, the ontology must reify dimensions. This places a greater burden on the mechanism for transforming natural language text into our representation.
Specifically, we must represent a unique role for dimension concepts, a polarity, or normative direction of comparison. Consider the dimension of Rigidity
, which we can describe qualitatively as a scalar value that has a range from Low
to High
. Making a cable more flexible results in a decrease in Rigidity
, a movement along the dimension toward the Low
end. In contrast, a cable that is too stiff has a High
degree of Rigidity
. The polarity of Rigidity
, therefore, explicitly marks the High
extreme of the scale as positive.
We also need to represent knowledge of the polarity implicit in dimensional adjectives. For example, High
flexibility is equivalent to Low Rigidity
, so we represent flexible as a negative-polarity predication. This knowledge about the dimension and its associated adjectives enables the transformation mechanism to produce a normalized representation of comparisons.
The choice of which extreme of a dimension to consider positive is arbitrary, so long as all representations adhere to the same convention. In many cases, language usage provides information to guide the choice. For example, (9) is felicitous, whereas (10) is not:
Taking this cue from the language to define large values of depth as positive, we reduce the chance of knowledge engineering mistakes stemming from intuitive understanding of the predicates. Ultimately, we hope to use such evidence to make inferences about the author's intent.
The next step after dimensional normalization is the representation of the comparison. There are three types of comparisons: to another entity, to a quantity, either a numerical or a landmark value, such as the boiling point of water, and to an extreme degree. In the case of comparisons across entities, the desired representation for both
and
is
(GreaterThan (Depth UpperSocket) (Depth LowerSocket))
Comparison to a norm is particularly common in texts concerning repairs. Consider this sentence, from the left side of Figure 1:
Pivoting is a normal function of the left cover; it is only that it has pivoted too far, that is, beyond its normal functional range, that there is a problem. The representation for this is along the lines of:
(ExcessiveHighAmount
(AngularDistance RotationEvent))
ObjectRotating RotationEvent LeftCover)
Note that in this case the system must infer the relevant dimension, AngularDistance
, from the RotationEvent
and the too far comparison. Reified dimensions enable the system to represent this as the dimension role for this RotationEvent
.
Finally, the representation of extreme degree, as in
is along the lines of
(ExtremeLowAmount (Rigidity Cable))
.In contrast to representations of excessive amounts, an extreme amount does not sanction an inference of abnormal function or failure. For example, a very flexible cable might be highly desirable, which will not be the case for a cable that is too flexible.
We have discussed two design criteria for ontologies to support our task of finding similarities and redundancies across documents, the use of metaproperties in an ontology to support tasks such as identification of the appropriate level of abstraction in representation, and the normalization of dimensions for comparatives. In general, we are balancing adequacy in expressiveness against complexity in similarity reasoning.
Our system currently normalizes the dimensional comparisons that occur in our test set of 15 similar pairs of documents, and exploits mid-level categories to make similarity assessments. We expect that these criteria, along with others that will emerge as our research progresses, will form the basis for defining a powerful yet tractable ontology for large-scale knowledge extraction from documents.
1. Brachman, R.J., McGuinness, D.L., Patel-Schnieder, P.F., and Borgida, A. Reducing CLASSIC to practice: Knowledge representation theory meets reality. Artificial Intelligence 114, 12, (1999), 203237.
2. Condoravdi, C., Crouch, R. Everett, J.O. de Paiva, V. Stolle, R. Bobrow, D. and van den Berg, M. Preventing Existence. In Proceedings of the Second International Conference on Formal Ontology in Information Systems. ACM Press, New York, NY, 2001, 162173
3. Dalrymple, M. Syntax and Semantics: Lexical Functional Grammar. Vol. 34. Academic Press, San Diego, CA, 2001.
4. Fabriani, P., Missikoff, M., Velardi, P. Using text processing techniques to automatically enrich a domain ontology. In Proceedings of the Second International Conference on Formal Ontology in Information Systems. (Ogonquit, ME), ACM Press, New York, 2001, 270284
5. Fellbaum, C. ed. WordNet: An Electronic Lexical Database. The MIT Press, Cambridge, MA, 1998.
6. Forbus, K.D., Falkenhainer, B. Gentner, D. The structure mapping engine: Algorithm and examples. Artificial Intelligence 41, 1 (1989), 163.
7. Gentner, D. Structure-mapping: A theoretical framework for analogy. Cog. Sci. 7 (1983), 155170.
8. Rosch, E. Principles of Categorisation. In E. Rosch and B.B. Lloyd, Eds. Cognition and Categorization. Erlbaum, Hillsdale, NJ, 1978.
1Note that the system must infer the existence of the HeatingEvent
in (7) from world knowledge about ShortCircuitEvents
and BurnEvents
.
2This is consistent with a cognitive account of category formation that stresses the primacy of categories such as chair over the more general furniture and more specific loveseat. See [8].
Figure 1. Example of Eureka tips.
Figure 2. A direct word-to-concept ontology for the concept Incapacitate.
©2002 ACM 0002-0782/02/0200 $5.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2002 ACM, Inc.
No entries found