Understanding Machine Learning Practitioners’ Data Documentation Perceptions, Needs, Challenges, and Desiderata

Summary contributed by Nga Than (@NgaThanNYC), Senior Data Scientist at Prudential Financial.

[Original paper by Amy Heger, Elizabeth B. Marquis, Mihaela Vorvoreanu, Hanna Wallach, Jennifer Wortman Vaughan]

Overview: Data documentation, a practice whereby engineers and ML/AI practitioners provide detailed information about the process of data creation, its current and future uses, is an important on the ground practice of the push towards responsible AI. The paper interviews 14 ML practitioners at a large international information technology company to explore their data documentation practices. The authors then propose 7 design criteria to make data documentation more streamlined and integrated into ML practitioners’ day-to-day work.

Introduction

Data to machine learning (ML) practitioners is the first aspect in creating a ML solution. Securing good and comprehensive data sometimes means whether the AI system to be built would be of any use at all.

Data documentation has been proposed and encouraged by researchers and practitioners alike to promote transparency in the process whereby datasets were created, curated, and used. In practice, there exist a few frameworks to create data documentation such as datasheets for dataset (Gebru et al 2021), dataset nutrition labels (Chmielinski 2022), data statements for NLP datasets (Bender & Friedman 2018).

Data documentation enables the process of reflexivity on the part of the data creators, who through the process would reflect on the underlying assumptions, potential risks and future implications of the use of their datasets. Additionally, it helps dataset consumers (i.e. ML practitioners, data analysts, data scientists) to make informed decisions about whether the datasets are suitable for their proposed projects.

As a collaborative practice, data documentation could serve as a way to communicate and collaborate between the creators, and users of the dataset. This would contribute to the process of development, evaluation, and deployment of AI systems that prioritize values such as “transparency, fairness, safety, reliability and privacy.”

Yet, we don’t know much about how actionable and practical these data documentation frameworks are in practice. ML practitioners on a daily basis have to juggle different organizational imperatives, and data documentation might not necessarily be their priorities.

The researchers set out to answer two questions: (1) how do ML practitioners approach data documentation (2) what are their perceptions, needs, and challenges around data documentation.

Key Insights

Overview of data documentation frameworks

There are currently three frameworks proposed. The first framework is called data statements that provide contexts for text data, which contains information such as speaker demographics and language variety. The second framework is datasheets for datasets, which encourages dataset creators to reflect on choices made throughout the dataset lifecycle, and to help dataset consumers make more informed choices. The final framework is dataset nutrition labels, inspired by nutrition labels on food products. The authors used the datasheets for datasets framework (Gebru et al 2021) to conduct the research. Datasheets for datasets contain a list of questions pertaining to different areas of AI ethics such as fairness, privacy, legal implications, ethics.

ML practitioners find data documentation useful

Participants stated that this practice helps them reduce the risk of losing information. This is especially important when a team member leaves, and they have knowledge about the dataset that was not necessarily documented. Data consumers can effectively discover datasets useful for their needs. Documentation helps new hires onboard faster. Furthermore, it helps with legal compliance, and prevents liability. Finally, this practice facilitates reflexivity, and forces practitioners to think beyond the short term benefits of the datasets.

Ad hoc and myopic processes

ML practitioners pay attention to the formation about whether they or others could use a data set for a specific purpose, while downplaying deeper consideration of whether they or others should use it and what might go wrong in the future.

Information was documented in dedicated locations and in documents created for other purposes (such as slide decks and Github repo readme files). There is no pipeline to make documentation a daily practice. They saved documentation in many forms such as text files, wikis, stored in repositories such as Github. Some teams used these as references about the origins and purposes of their datasets. Because there is no central way of knowing where data documentation lives, participants use other means of communications such as powerpoint, newsletters to link to the data documentation, which undermines the importance of data documentation. Practitioners expressed that data documentation should be stored, and maintained along with other code, and they argue rightfully so that this helps streamline the process by saving the two together.

ML practitioners prefer to have minimum viable documentation which is practical, efficient and utilizable. That is, they want to provide a minimum amount of information that is necessary for their specific purpose and short-term benefits. They would like this process to be automated if possible.

No connection between data documentation and RAI applications

In practice, ML experts lose sights of why data documentation was necessary to start with. The practice takes a life of its own beyond the academic motivation, which is to embed responsible AI principles in practitioners’ daily work flows. Participants resisted the questions that focus on potential use/misuse of the datasets they created. They contended that it depended on the intent of the people who will use the datasets in the future.

Datasets don’t have clear boundaries

Data could be static or streaming, and it’s difficult to define what a dataset could be. The main reason is that their datasets were created through merging different features from different datasets. The question then becomes whether data documentation is necessary for the chosen dataset, or for the datasets where the features come from. With datasets that change over time, data documentation frameworks have to be sensitive to temporality.

As any written artifact, data documentation needs to have a target audience. Yet, ML practitioners are not sure who the readers are. Most people simply document for their own uses. In other words, their implicit target audience is an internal ML practitioner rather than the ML/AI community at large. Some recognized its usefulness to data consumers, even suggested creating a glossary for terms that might be unfamiliar to other groups.

The level of in-depth answers is also a question to practitioners. When data documentation is not mandated, and there is not sufficient time for it, the quality is low. Participants would not go out of their way to find answers to questions on the datasheets. Furthermore, there is a lack of collaboration, and communication between different users of datasets at different points in time such as ML practitioners need to find out questions about how data was created which are often in the purview of software engineers. Participants highlighted that contexts matter. Internal distribution and external distribution would influence how they choose to document the dataset.

Design Criteria for Data Documentation Frameworks

The authors proposed seven criteria for actionable data documentation frameworks:

Make explicit the connection between data documentation and Responsible AI.
Make data documentation frameworks practical
Adapt data documentation frameworks to different contexts
Don’t automate away responsibility
Clarify the target audience for data documentation
Standardize and centralize data documentation
Integrate data documentation to existing frameworks and workflows

Between the Lines

The core issue with data documentation is to generate high quality documentation. There can be many problems in the process such as data creators not understanding why there’s a need for such high quality data documentation, or that the tools that they are using are not up to the task, or that documents can be missing in action. The last problem is very common in small businesses and startups, or businesses that just started to experiment with AI/ML.

Two solutions could be implemented: (1) having a designated data documenter who understands the imperative of creating such a document; (2) training ML practitioners to incorporate professional data documentation into their day-to-day work. The data documenter might be trained more in information science, library science than ML. ML practitioners can also learn practices in archiving, documenting to better keep records for future use.

For ML practitioners like me, fairness principles are still additional extra labor that might not necessarily be rewarded. Or I don’t get to create the dataset myself. Most of the time, the data was created long before I was even aware that such data exists. Unearthing its original purpose, its changes over time, and also how it is useful for my work entails sociological imagination whereby I need to connect the relationships, and hidden changes. High-quality documentation would certainly help my work, and organizations ought to devote resources to such a practice.

Yet, the frameworks that the authors suggested are still at the organizational levels. That means each organization can institute their own framework that might or might not be compatible with others. Industry-wide solutions that are standardized and adopted by all practitioners would be welcome.

References

Bender, E. M., & Friedman, B. (2018). Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6, 587-604.

Chmielinski, K. S., Newman, S., Taylor, M., Joseph, J., Thomas, K., Yurkofsky, J., & Qiu, Y. C. (2022). The dataset nutrition label (2nd Gen): Leveraging context to mitigate harms in artificial intelligence. arXiv preprint arXiv:2201.03954.

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Iii, H. D., & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86-92.