Here we describe table servers being developed by the National Institute of Statistical Sciences (NISS) that disseminate tabular summaries of statistical data in response to user queries for marginal sub-tables of a large (for example, 40 dimensions with 4 categories each) contingency table containing counts or sums. Table servers evaluate disclosure risk dynamically, in light of previously answered queries.
The query space Q, which contains all 2K sub-tables of a K-way table, is partially ordered by set inclusion of attributes in sub-tables. The set R(t) of all tables released through some time t contains direct releases in response to queries and indirect releases (previously unreleased children of direct releases); R(t) is specified by the released frontier RF(t) of its maximal elements.
Underlying dynamic release decisions is a risk criterion RC defined on subsets of Q: at all times the system must satisfy RC (R(t)) ≤ a, where a is a risk threshold set by the operators. A typical risk criterion is accuracy of bounds based on R(t) for sensitive (small count) cells in the full table. Bounds can be computed using integer programming methods, but these techniques do not scale to the table sizes under consideration here. There are also exact techniques for special cases. For example, if the released sub-tables constitute the minimal sufficient statistics of a decomposable graphical model [4], then bounds can be expressed as explicit functions of these sub-tables [2].
Whenever an answered query releases previously unreleased information, other queries become unanswerable. Consequently, at t there is an unreleasable set U(t) of sub-tables whose release would be too risky, with an unreleasable frontier UF (t) of its minimal elements.
Release rules determine which requests for unreleased tables will be fulfilled. The simplest is the myopic rule of releasing T at t as long as RC(R(t) T) ≤ a. To prevent the table server from taking excessively large steps, one can allow only tables adding but one attribute to a previously released table to be eligible for release. To prevent a single user (or a set of colluding users) from driving the table server into a region of Q that suits their needs but not those of other users, release rules can be biased against releases that add large numbers of tables to U(t). Rules can also incorporate the value of releasing T [3, 5].
A prototype table server, written as a Java application, is shown in the accompanying figure. Its principal strength is the engaging (but non-scalable) visualization of Q, which illustrates the underlying abstractions such as releasable and unreleasable frontiers.
A more powerful table server has been written using the Java 2 Enterprise Edition platform, with query processing performed by Java Servlets and risk computations by native executables. This prototype has been tested on a 14-dimensional, 300,000,000-cell, but extremely sparse, table derived from the Current Population Survey.
Table servers disseminate tabular summaries of statistical data in response to user queries for marginal sub-tables of a large contingency table containing counts or sums. Table servers evaluate disclosure risk dynamically, in light of previously answered queries.
If the requested table lies on or below RF (t), it is provided immediately, ordinarily via downloaded XML (HTML tables and text are alternatives). Releases are governed by the myopic and at most one step away from R(t) rules, and disclosure risk is evaluated in real time. The query history database, with tables for users, queries and the time trajectories of RF (t) and UF (t), is maintained in a MySQL database server. A frontier display facility monitors evolution of RF (t).
The system employs data structures based on hash tables for storing tables and algorithms that exploit sparsity and the fact that R(t) and U(t) are characterized completely by RF (t) and UF (t). The risk criterion is narrowness of cell bounds computed via a generalized shuttle algorithm [1].
1. Buzzigoli, L. and Giusti, A. An algorithm to calculate the lower and upper bounds of the elements of an array given its marginals. In Proceedings of the Conference on Statistical Data Protection. (Luxembourg, Germany, 1999) Eurostat, 131147.
2. Dobra, A. and Fienberg, S.E. Bounds for cell entries in contingency tables given marginal totals and decomposable graphs. In Proceedings of the National Academy of Science 97, 22 (2000) 1188511892.
3. Duncan, G.T., Keller-McNulty, S.A., and Stokes, S.L. Disclosure risk vs. data utility: The R-U confidentiality map. Management Science.
4. Lauritzen, S.L. Graphical Models. Clarendon Press, Oxford, UK, 1996.
5. Trottini, M. A decision-theoretic approach to data disclosure problems. In 2nd Joint ECE/Eurostat Work Session on Statistical Data Confidentiality. (Luxembourg, Germany, 2001). Eurostat.
Figure. Java table server prototype. The visualization of the query space Q shows direct releases (yellow), indirect releases (blue), unacceptably risky (red), and the potential effect (dark blue, magenta, and dark red) of releasing the 5-way table indicated by the cursor. The released (unreleasable) frontier lies at the top of the lower-left (bottom of the upper-right) portion of the visualization.
©2003 ACM 0002-0782/03/0100 $5.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2003 ACM, Inc.
No entries found