When asked to pick a random buddy from our Facebook friends list, we may struggle since our nomination is likely to be biased toward individuals with whom we interact more frequently. Computer assistance in such a case would make things easier: we can just store our friends' names in an array. Whenever a query comes, the computer generates a random array index and returns the name stored in the corresponding location. The question now is whether, for any given data structure problem, we can build a data structure that generates "fair" output while maintaining query efficiency.
In the context of data structure query-answering, fairness can be defined as follows: we either return all valid answers or just return a uniformly random one. Although this definition does not capture all human expectations of fairness, it is the most natural one for data structures from a technical sense. The challenge is that returning all valid answers may be time prohibitive, while returning a uniformly random one in a timely manner could also be difficult due to specific data storage orders and query-answering procedures. Consider the problem of finding red nodes in a binary tree with half of the nodes colored red and the other half blue. We can certainly use our favorite tree traversal algorithms, such as breadth-first search, to find all red nodes in the tree, but doing so will take time proportional to the size of the tree. We can also take random root-leaf paths to find one red node; this approach is much more time-efficient, but it favors red nodes closer to the root, resulting in an output that is not uniformly random.
Designing efficient data structures has long been regarded as one of the most important problems in computer science, and it has been studied for decades. Surprisingly, however, the topic of designing fair data structures has not received much attention. While some general principles for fairness in (machine learning) algorithms exist, such as "similar treatment for similar items" and "output being independent from sensitive variables (for example, individuals' gender, ethnicity, etc.)," the definition of fairness can be debatable at times due to the complexities in socio-technical systems. Fortunately, as previously stated, defining fairness for data structure problems is relatively easy. Achieving both efficiency and fairness, however, may still be challenging.
In the following paper, Aumüller et al. investigated a basic problem in similarity search called near neighbor in the context of fair data structures. In this problem, given a set of points in a metric space and a distance threshold, the task is to build a data structure to store these points so that given any query point, we can efficiently return a point within the distance threshold from the query point. Similarity search has become increasingly important in the era of big data and has been used as a building block in numerous applications in databases, recommendation systems, machine learning, and so on. Fair (or, uniformly random) near neighbors is particularly useful for increasing the diversity of the output and estimating some desired properties/statistics in the neighborhood of the query point.
The following paper investigates a basic problem in similarity search called near neighbor in the context of fair data structures.
The standard method for efficiently retrieving a point from the set of near neighbors is locality sensitive hashing (LSH). In this method, we hash all input points into a collection of buckets. To handle each query, we simply search all points in the bucket to which the query point is hashed. We typically repeat this procedure multiple times using independent hash functions to increase the success probability of finding a near neighbor. Unfortunately, LSH does not produce a uniformly random near neighbor because it favors points that are closer to the query point.
The work by Aumüller et al. adds a fairness component to LSH algorithms. The authors achieved this by considering a more generic, but still fairly basic data structure problem in which we are given a collection of sets of items (think of each set as a bucket containing a set of points). The task is to preprocess this collection of sets, such that for each query that specifies a sub-collection of sets, we can return a uniformly random item from the union of sets in this sub-collection excluding a marked set of outliers. Solving this general problem requires several new and intriguing technical ideas that could potentially be applied to a variety of other fair data structure problems. This work sets the path for future research in fair data structure design, which has not yet received much attention despite its significance. We may anticipate more exciting studies in this area for the years to come.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2022 ACM, Inc.
No entries found