Data plays a critical role in machine learning. Every machine learning model is trained and evaluated using data, quite often in the form of static datasets. The characteristics of these datasets fundamentally influence a model's behavior: a model is unlikely to perform well in the wild if its deployment context does not match its training or evaluation datasets, or if these datasets reflect unwanted societal biases. Mismatches like this can have especially severe consequences when machine learning models are used in high-stakes domains, such as criminal justice,1,13,24 hiring,19 critical infrastructure,11,21 and finance.18 Even in other domains, mismatches may lead to loss of revenue or public relations setbacks. Of particular concern are recent examples showing that machine learning models can reproduce or amplify unwanted societal biases reflected in training datasets.4,5,12 For these and other reasons, the World Economic Forum suggests all entities should document the provenance, creation, and use of machine learning datasets to avoid discriminatory outcomes.25
Although data provenance has been studied extensively in the databases community,3,8 it is rarely discussed in the machine learning community. Documenting the creation and use of datasets has received even less attention. Despite the importance of data to machine learning, there is currently no standardized process for documenting machine learning datasets.
I'm curious if/how the FAIR guiding principles for scientific data management and stewardship align with datasheets for datasets. https://www.nature.com/articles/sdata201618
Also, there seems to be an adversarial relationship between DAIR and industry. Are there opportunities for industry to work with DAIR to help eliminate unwanted societal biases?
This is interesting and timely work. It is crucial that rich context (metadata++) is available for datasets. The 50+ questions are daunting for an author/creator but the more time and effort given to adding this value will accrue on multiple levels: findability; credit for concept development, understandability, and computability/inferencing/linking to advance progress. Libraries have long developed metadata standards for individual objects to serve these value adds, and archives create finding aids that richly contextualize collections. This work illustrates how to move forward with curating datasets. Gary Marchionini
This is a great idea and I wonder if it could be integrated with the Open Science Framework so that people who post their datasets to OSF.io would follow one of your templates. Also, I wondered if this effort could learn from the research reporting guidelines established for medical research (see https://www.equator-network.org/). If you want to submit a medical research article, the journal will demand that you also submit a completed research report like CONSORT (or any of hundreds of others). Maybe data journals could do the same with your template (but not the hundreds part).
Displaying all 3 comments