acm-header
Sign In

Communications of the ACM

Research highlights

Technical Perspective: Schema Mappings: Rules For Mixing Data


When you search for flight tickets on you favorite Web site, your query is often dispatched to tens of databases to produce an answer. When you search for products on Amazon.com, you are seeing results from thousands of vendor databases that were developed before Amazon existed. Did you ever wonder how that happens? What is the theory behind it all? At the core, these systems are powered by schema mappings that provide the glue to tie all these databases together. The following paper by ten Cate and Kolaitis will give you a glimpse into the theoretical foundations underlying schema mappings and might even inspire you to work in the area.

The scenarios I've noted here are examples of data management applications that require access to multiple heterogeneous data sets. Data integration is the field that develops architectures, systems, formalisms, and algorithms for combining data from multiple sources, be they relational databases, XML repositories, or data from the Web. The goal of a data integration system is to offer uniform access to a collection of sources, and free the user from having to locate individual sources, learn their specific interaction details, and to manually combine the data. The work on data integration spans multiple fields of computer science, including data management, artificial intelligence systems, and human-computer interaction. The field has been nicknamed the "AI-complete" problem of data management due to the challenges that arise from reconciling multiple models of data created by humans, and the realization that we never expect to solve data integration completely automatically.

Data integration challenges are pervasive in practice. Large enterprises often must combine data from hundreds of repositories, and scientists constantly face an explosion in the number of sources being created in their domain. The Web provides an extreme case of data integration with tens of millions of independently developed data sources. Fortunately, data integration is also a pervasive problem in government organizations, enabling a steady stream of research on the topic. In a nutshell, data integration is difficult because the data sets were developed independently and for different purposes. Therefore different developers model varying aspects of the data, use inconsistent terminology, and make different assumptions on the data.

There are several architectures for data integration systems, and the appropriate choice depends on the need of the application. In some cases it is possible to collect all the data in one physical repository; in other cases data must be exchanged from a source database to a target. In other scenarios, organizational boundaries or other factors dictate that data must be left at the original sources and combination of the relevant data can only occur in response to a query. Regardless of the architecture used, the core of data integration relies on schema mappings that specify how to translate terms (for example, table names and attribute names) between different sources and relate differing database organizations. Much of the effort in building a data integration application is to construct schema mappings and maintain them over time. The main reason building the mappings is difficult is that it requires understanding the semantics of the source and target databases (that may require more than one person), and the ability to express the semantic relationship formally (that may require a database specialist in addition to the domain experts). There has been a large body of research on providing assistance in creating and debugging schema mappings.

A schema mapping must be written in some logical formalism. In the earliest data integration systems, schema mappings were written like ordinary view definitions (now known as GAV mappings), where an integrated view is defined over tables from multiple sources. With time, it became evident that this approach did not scale to a large number of sources, thus LAV mappings were developed. In LAV, the focus is on describing the contents of an individual source irrespective of the other sources. LAV mappings are complemented by a general reasoning engine that infers how to combine data from multiple sources, given a particular query. As this study of mappings progressed, researchers discovered close relationships between mapping formalisms and constraint languages such as tuple-generating dependencies.

Though some properties of these languages in isolation are well understood, this paper sheds significant light, for the first time, on the relationships between these languages. The authors identify general properties of mappings (that are not tied to the formalism in which they are written), and show how these properties can be used to characterize the language that can express a mapping. Except for providing several insightful results, I believe their paper merits careful study because it opens up a new and exciting field of research involving the expressive power of data integration systems.

Back to Top

Author

Alon Halevy is a research scientist at Google, where he manages a team looking into how structured data can be used in Web search.

Back to Top

Footnotes

DOI: http://doi.acm.org/10.1145/1629175.1629200


©2010 ACM  0001-0782/10/0100  $10.00

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2010 ACM, Inc.


 

No entries found

Sign In for Full Access
» Forgot Password? » Create an ACM Web Account
Article Contents: