• Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • Core Principles of Responsible AI
    • Accountability
    • Fairness
    • Privacy
    • Safety and Security
    • Sustainability
    • Transparency
  • Special Topics
    • AI in Industry
    • Ethical Implications
    • Human-Centered Design
    • Regulatory Landscape
    • Technical Methods
  • Living Dictionary
  • State of AI Ethics
  • AI Ethics Brief
  • 🇫🇷
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

Understanding Machine Learning Practitioners’ Data Documentation Perceptions, Needs, Challenges, and Desiderata

October 22, 2022

Summary contributed by Nga Than (@NgaThanNYC), Senior Data Scientist at Prudential Financial.

[Original paper by Amy Heger, Elizabeth B. Marquis, Mihaela Vorvoreanu, Hanna Wallach, Jennifer Wortman Vaughan]


Overview: Data documentation, a practice whereby engineers and ML/AI practitioners provide detailed information about the process of data creation, its current and future uses, is an important on the ground practice of the push towards responsible AI. The paper interviews 14 ML practitioners  at a large international information technology company to explore their data documentation practices. The authors then propose 7 design criteria to make data documentation more streamlined and integrated into ML practitioners’ day-to-day work.


Introduction

Data to machine learning (ML) practitioners is the first aspect in creating a ML solution.  Securing good and comprehensive data sometimes means whether the AI system to be built would be of any use at all. 

Data documentation has been proposed and encouraged by researchers and practitioners alike to promote transparency in the process whereby datasets were created, curated, and used. In practice, there exist a few frameworks to create data documentation such as datasheets for dataset (Gebru et al 2021), dataset nutrition labels (Chmielinski 2022), data statements for NLP datasets (Bender & Friedman 2018).

Data documentation enables the process of reflexivity on the part of the data creators, who through the process would reflect on the underlying assumptions, potential risks and future implications of the use of their datasets. Additionally, it helps dataset consumers (i.e. ML practitioners, data analysts, data scientists) to make informed decisions about whether the datasets are suitable for their proposed projects. 

As a collaborative practice, data documentation could serve as a way to communicate and collaborate between the creators, and users of the dataset. This would contribute to the process of development, evaluation, and deployment of AI systems that prioritize values such as “transparency, fairness, safety, reliability and privacy.”

Yet, we don’t know much about how actionable and practical these data documentation frameworks are in practice. ML practitioners on a daily basis have to juggle different organizational imperatives, and data documentation might not necessarily be their priorities. 

The researchers set out to answer two questions: (1) how do ML practitioners approach data documentation (2) what are their perceptions, needs, and challenges around data documentation.

Key Insights

Overview of data documentation frameworks

There are currently three frameworks proposed. The first framework is called data statements that provide contexts for text data, which contains information such as speaker demographics and language variety. The second framework is datasheets for datasets, which encourages dataset creators to reflect on choices made throughout the dataset lifecycle, and to help dataset consumers make more informed choices. The final framework is dataset nutrition labels, inspired by nutrition labels on food products. The authors used the datasheets for datasets framework (Gebru et al 2021) to conduct the research. Datasheets for datasets contain a list of questions pertaining to different areas of AI ethics such as fairness, privacy, legal implications, ethics. 

ML practitioners find data documentation useful 

Participants stated that this practice helps them reduce the risk of losing information. This is especially important when a team member leaves, and they have knowledge about the dataset that was not necessarily documented. Data consumers can effectively discover datasets useful for their needs. Documentation helps new hires onboard faster. Furthermore, it helps with legal compliance, and prevents liability. Finally, this practice facilitates reflexivity, and forces practitioners to think beyond the short term benefits of the datasets.  

Ad hoc and myopic processes 

ML practitioners pay attention to the formation about whether they or others could use a data set for a specific purpose, while downplaying deeper consideration of whether they or others should use it and what might go wrong in the future.

Information was documented in dedicated locations and in documents created for other purposes (such as slide decks and Github repo readme files). There is no pipeline to make documentation a daily practice. They saved documentation in many forms such as text files, wikis, stored in repositories such as Github. Some teams used these as references about the origins and purposes of their datasets. Because there is no central way of knowing where data documentation lives, participants use other means of communications such as powerpoint, newsletters to link to the data documentation, which undermines the importance of data documentation. Practitioners expressed that data documentation should be stored, and maintained along with other code, and they argue rightfully so that this helps streamline the process by saving the two together. 

ML practitioners prefer to have minimum viable documentation which is practical, efficient and utilizable. That is, they want to provide a minimum amount of information that is necessary for their specific purpose and short-term benefits. They would like this process to be automated if possible. 

No connection between data documentation and RAI applications 

In practice, ML experts lose sights of why data documentation was necessary to start with. The practice takes a life of its own beyond the academic motivation, which is to embed responsible AI principles in practitioners’ daily work flows. Participants resisted the questions that focus on potential use/misuse of the datasets they created. They contended that it depended on the intent of the people who will use the datasets in the future. 

Datasets don’t have clear boundaries 

Data could be static or streaming, and it’s difficult to define what a dataset could be. The main reason is that their datasets were created through merging different features from different datasets. The question then becomes whether data documentation is necessary for the chosen dataset, or for the datasets where the features come from. With datasets that change over time, data documentation frameworks have to be sensitive to temporality.

As any written artifact, data documentation needs to have a target audience. Yet, ML practitioners are not sure who the readers are. Most people simply document for their own uses. In other words, their implicit target audience is an internal ML practitioner rather than the ML/AI community at large. Some recognized its usefulness to data consumers, even suggested creating a glossary for terms that might be unfamiliar to other groups. 

The level of in-depth answers is also a question to practitioners. When data documentation is not mandated, and there is not sufficient time for it, the quality is low. Participants would not go out of their way to find answers to questions on the datasheets. Furthermore, there is a lack of collaboration, and communication between different users of datasets at different points in time such as ML practitioners need to find out questions about how data was created which are often in the purview of software engineers. Participants highlighted that contexts matter. Internal distribution and external distribution would influence how they choose to document the dataset. 

Design Criteria for Data Documentation Frameworks

The authors proposed seven criteria for actionable data documentation frameworks: 

  1. Make explicit the connection between data documentation and Responsible AI. 
  2. Make data documentation frameworks practical 
  3. Adapt data documentation frameworks to different contexts 
  4. Don’t automate away responsibility
  5. Clarify the target audience for data documentation 
  6. Standardize and centralize data documentation 
  7. Integrate data documentation to existing frameworks and workflows

Between the Lines

The core issue with data documentation is to generate high quality documentation. There can be many problems in the process such as data creators not understanding why there’s a need for such high quality data documentation, or that the tools that they are using are not up to the task, or that documents can be missing in action. The last problem is very common in small businesses and startups, or businesses that just started to experiment with AI/ML. 

Two solutions could be implemented: (1) having a designated data documenter who understands the imperative of creating such a document; (2) training ML practitioners to incorporate professional data documentation into their day-to-day work. The data documenter might be trained more in information science, library science than ML. ML practitioners can also learn practices in archiving, documenting to better keep records for future use. 

For ML practitioners like me, fairness principles are still additional extra labor that might not necessarily be rewarded. Or I don’t get to create the dataset myself. Most of the time, the data was created long before I was even aware that such data exists. Unearthing its original purpose, its changes over time, and also how it is useful for my work entails sociological imagination whereby I need to connect the relationships, and hidden changes. High-quality documentation would certainly help my work, and organizations ought to devote resources to such a practice. 

Yet, the frameworks that the authors suggested are still at the organizational levels. That means each organization can institute their own framework that might or might not be compatible with others. Industry-wide solutions that are standardized and adopted by all practitioners would be welcome. 

References 

Bender, E. M., & Friedman, B. (2018). Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6, 587-604.

Chmielinski, K. S., Newman, S., Taylor, M., Joseph, J., Thomas, K., Yurkofsky, J., & Qiu, Y. C. (2022). The dataset nutrition label (2nd Gen): Leveraging context to mitigate harms in artificial intelligence. arXiv preprint arXiv:2201.03954.

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Iii, H. D., & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86-92.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

AI Policy Corner: Frontier AI Safety Commitments, AI Seoul Summit 2024

AI Policy Corner: The Colorado State Deepfakes Act

Special Edition: Honouring the Legacy of Abhishek Gupta (1992–2024)

AI Policy Corner: The Turkish Artificial Intelligence Law Proposal

From Funding Crisis to AI Misuse: Critical Digital Rights Challenges from RightsCon 2025

related posts

  • Do Large GPT Models Discover Moral Dimensions in Language Representations? A Topological Study Of Se...

    Do Large GPT Models Discover Moral Dimensions in Language Representations? A Topological Study Of Se...

  • Towards an Understanding of Developers' Perceptions of Transparency in Software Development: A Preli...

    Towards an Understanding of Developers' Perceptions of Transparency in Software Development: A Preli...

  • Getting from Commitment to Content in AI and Data Ethics: Justice and Explainability

    Getting from Commitment to Content in AI and Data Ethics: Justice and Explainability

  • Fairness implications of encoding protected categorical attributes

    Fairness implications of encoding protected categorical attributes

  • Judging the algorithm: A case study on the risk assessment tool for gender-based violence implemente...

    Judging the algorithm: A case study on the risk assessment tool for gender-based violence implemente...

  • Worried But Hopeful: The MAIEI State of AI Ethics Panel Recaps a Difficult Year

    Worried But Hopeful: The MAIEI State of AI Ethics Panel Recaps a Difficult Year

  • Research summary: Principles alone cannot guarantee ethical AI

    Research summary: Principles alone cannot guarantee ethical AI

  • Episodio 3 - Idoia Salazar: Sobre la Vital Importancia de Educar al Ciudadano en los Usos Responsabl...

    Episodio 3 - Idoia Salazar: Sobre la Vital Importancia de Educar al Ciudadano en los Usos Responsabl...

  • In AI We Trust: Ethics, Artificial Intelligence, and Reliability

    In AI We Trust: Ethics, Artificial Intelligence, and Reliability

  • Fair Generative Model Via Transfer Learning

    Fair Generative Model Via Transfer Learning

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer

Categories


• Blog
• Research Summaries
• Columns
• Core Principles of Responsible AI
• Special Topics

Signature Content


• The State Of AI Ethics

• The Living Dictionary

• The AI Ethics Brief

Learn More


• About

• Open Access Policy

• Contributions Policy

• Editorial Stance on AI Tools

• Press

• Donate

• Contact

The AI Ethics Brief (bi-weekly newsletter)

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.


Archive

  • © MONTREAL AI ETHICS INSTITUTE. All rights reserved 2024.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.