acm-header
Sign In

Communications of the ACM

Research highlights

Technical Perspective: High-Level Data Structures


It is a defining moment for many a CS student when they start thinking about data structures relationally; when maps, n-tuples, sets, and bags become more natural terms of discourse than lists, vectors, hash tables, and binary trees. I often see a student's eyes light up when realizing that what they need is, say, a mapping from some Xs to an unordered set of Ys, with the correctness of the rest of their program unaffected by whether this mapping is implemented by hash tables and lists, trees and vectors, or something else. This difference in thinking is crucial for programming projects with any amount of data structure complexity, from the real world all the way down to university coursework. In a compiler course project, for instance, the difference is striking. Students who manage to grasp the separation of relational and low-level thinking have no trouble adapting to complex requirements. Students who cannot escape their preoccupation with low-level data structures often spend all their time changing the intertwined implementations of a mess of lexically scoped symbol tables, member tables, and global type structures.

This lifting of data structure thinking to the relational level has long inspired computer scientists. Relational databases have given us a vocabulary for data management, but have also hidden the internals of the implementation from the end programmer. General-purpose programming has largely remained low-level. Even when a library offers an abstraction with different implementations—for example, a map that can be implemented via either a hash table or a binary tree—standard relational reasoning is not supported automatically. A map from a pair of X and Y values to Z values is equivalent to a map from X values to maps from Ys to Zs, for instance, but the programmer cannot represent both forms equivalently and has to structure the code for one or the other. As a result, data structure code is almost universally too low-level to be amenable to compiler analysis, for performance or correctness.


Hawkins et al. make a contribution to a long line of work in elevating data manipulation.


In the following paper, the authors aim at elevating data structure programming to the relational realm. The separation of the relational view from the low-level implementation view of the data structure is performed in a disciplined way, and even automated. Given a relational specification of data to be stored, the system derives and verifies a "decomposition:" a combination of data structures that together implement the relational specification. Insertions, deletions, and lookups on the data structure are performed via actions that remain unchanged for different decompositions. Eventually the system automatically produces code that correctly implements the desired query. Key novelties of the approach are the use of functional dependencies over data and a type system that verifies the adequacy of a decomposition for representing the relations and dependencies.

In a greater context, Hawkins et al. make a contribution to a long line of work in elevating data manipulation. The SETL language4 pioneered high-level data structures by offering a programming model based on set theory, together with automatic selection of concrete data structures at compile time. Even closer to the authors' work in terms of programming model, Batory introduced a sequence of specialized languages1,2,5 exploring the refinement of a relational specification to concrete, intertwined data structures. The program is written in terms of queries oblivious to the specific data structures used. Once the data structures are selected (manually, in a single configuration statement2,5 or automatically via search in the space of data structures1), a query is classified as a scan, an ordered range lookup, or a direct map lookup, and efficient code is generated for the given structures. Additionally, other interesting angles on the relational view of data structures have been examined, such as the Collection Programming Language (CPL) by Buneman et al.3 CPL offers declarative whole-structure operations—an approach of great power, but probably more difficult to integrate in general-purpose code with imperative updates and a traversal state.

Accordingly, the work toward relational thinking in data structures is not over, nor is it likely to be over soon. The issue of the "right" high-level programming interface is not settled yet. Furthermore, there is always a tension between programmer control and automation. In some cases there may be a benefit to "cracking open the shell" and obtaining better performance via clever low-level data structure manipulations (for example, employing hybrid structures, such as an "augmented" red-black tree). Regardless of where a language or library draws the line, I hope that upon reading this paper you will agree that the effort to conceptually separate relational from low-level data structure design will be key to the future of high-level programming.

Back to Top

References

1. Batory, D., Chen, G., Robertson, E. and Wang, T. Design wizards and visual programming environments for GenVoca generators. IEEE Transactions on Software Engineering 26, 5 (2000), 441–452.

2. Batory, D. and Thomas, J. P2: A lightweight DBMS generator, Journal of Intelligent Information Systems 9 (1995), 107–123.

3. Buneman, P., Naqvi, S., Tannen, V. and Wong, L. Principles of programming with complex objects and collection types. Theoretical Computer Science 149, 1 (1995), 3–48.

4. Schwartz, J., Dewar, R., Schonberg, E. and Dubinsky, E. Programming With Sets: An Introduction to SETL. Springer-Verlag, 1986.

5. Smaragdakis, Y. and Batory, D. DiSTiL: A transformation library for data structures. In Proceedings of the Conference on Domain-Specific Languages. USENIX, 1997.

Back to Top

Author

Yannis Smaragdakis ([email protected]) is an associate professor in the Department of Informatics at the University of Athens, Greece.


©2012 ACM  0001-0782/12/12

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected] or fax (212) 869-0481.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2012 ACM, Inc.


 

No entries found

Sign In for Full Access
» Forgot Password? » Create an ACM Web Account
Article Contents: