acm-header
Sign In

Communications of the ACM

ACM Opinion

The Rise and Fall (and Rise) of Datasets


View as: Print Mobile App Share:
Illustration of data and computer code.

Data sets are not neutral, but represent particular social and political norms, which can specifically affect marginalized groups.

Credit: Getty Images

In recent years, many popular data sets have been identified by the machine-learning community as having an alarming number of potential legal and ethical problems—representational harms, effects of bias, privacy infringement, and unclear or dubious downstream use. This has led several data sets to be taken down or heavily redacted. In practice, however, they continue to be available and widely used, either in their original form, such as via online torrents, or in derivative form, as subsets or modifications of the original data set or models pretrained on the deprecated data set.

Moving forward, a fundamental change in data set culture is necessary. Harm mitigation and stewardship are required throughout the data set's life cycle, while creators must monitor the use of their data sets, update licenses and documentation, and limit access when necessary.

From Nature
View Full Article


 

No entries found

Sign In for Full Access
» Forgot Password? » Create an ACM Web Account