ACM

Communications of the ACM

ACM TechNews

Data Diversity

By MIT News
December 19, 2016
Comments

View as: Print Mobile App Share:

Preserving variety in subsets of unmanageably large data sets should aid machine learning. — A new algorithm makes it much more practical to select diverse subsets from a much larger dataset.

Credit: Christine Daniloff/MIT

Selecting diverse subsets from a much larger dataset promises to be much more practical using a new algorithm designed by researchers from the Massachusetts Institute of Technology's Computer Science and Artificial Intelligence Laboratory and its Laboratory for Information and Decision Systems.

The algorithm begins with a small subset of the data, chosen at random, then picks one point inside the subset and one point outside it and randomly selects one of three simple operations--swapping the points, adding the point outside the subset to the subset, or deleting the point inside the subset. The decision to perform the operation or not is probabilistic, and depends on the improvement in diversity the operation affords. As the subset grows, it becomes harder to add new points unless they dramatically improve diversity. The process repeats until the diversity of the subset reflects that of the full set.

The running time of the algorithm depends on the number of data points in the subset. For example, if the goal is to winnow a data set with 1 million points down to 1,000, the algorithm is 1 billion times faster than its predecessors.

From MIT News
View Full Article

No entries found