• Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • Core Principles of Responsible AI
    • Accountability
    • Fairness
    • Privacy
    • Safety and Security
    • Sustainability
    • Transparency
  • Special Topics
    • AI in Industry
    • Ethical Implications
    • Human-Centered Design
    • Regulatory Landscape
    • Technical Methods
  • Living Dictionary
  • State of AI Ethics
  • AI Ethics Brief
  • 🇫🇷
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

Research summary: Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning

August 9, 2020

Summary contributed by our researcher Victoria Heath (@victoria_heath7), Communications Manager at Creative Commons

*Authors of full paper & link at the bottom


Mini-summary: It’s no secret that there are significant issues related to the collection and annotation of data in machine learning (ML). Many of the ethical issues that are discussed today in ML systems result from the lack of best practices and guidelines for the collection and use of data to train these systems. For example, Professor Eun Seo Jo (Stanford University) and Timnit Gebru (Google) write, “Haphazardly categorizing people in the data used to train ML models can harm vulnerable groups and propagate societal biases.”

In this article, Seo Jo and Gebru set out to examine how ML can apply the methodologies for data collection and annotation utilized for decades by archives: the “oldest human attempt to gather sociocultural data.” They argue that ML should create an “interdisciplinary subfield” focused on “data gathering, sharing, annotation, ethics monitoring, and record-keeping processes.” In particular, they explore how archives have worked to resolve issues in data collection related to consent, power, inclusivity, transparency, and ethics & privacy—and how these lessons can be applied to ML, specifically to subfields that use large, unstructured datasets (e.g. Natural Language Processing and Computer Vision). 

The authors argue that ML should adapt what archives have implemented in their data collection work, including an institutional mission statement, full-time curators, codes of conduct/ethics, standardized forms of documentation, community-based activism, and data consortia for sharing data. 

Full summary:

It’s no secret that there are significant issues related to the collection and annotation of data in machine learning (ML). Many of the ethical issues that are discussed today in ML systems result from the lack of best practices and guidelines for the collection and use of data to train these systems. For example, Professor Eun Seo Jo (Stanford University) and Timnit Gebru (Google) write, “Haphazardly categorizing people in the data used to train ML models can harm vulnerable groups and propagate societal biases.”

In this article, Seo Jo and Gebru set out to examine how ML can apply the methodologies for data collection and annotation utilized for decades by archives: the “oldest human attempt to gather sociocultural data.” They argue that ML should create an “interdisciplinary subfield” focused on “data gathering, sharing, annotation, ethics monitoring, and record-keeping processes.” In particular, they explore how archives have worked to resolve issues in data collection related to consent, power, inclusivity, transparency, and ethics & privacy—and how these lessons can be applied to ML, specifically to subfields that use large, unstructured datasets (e.g. Natural Language Processing and Computer Vision). 

The authors argue that ML should adapt what archives have implemented in their data collection work, including an institutional mission statement, full-time curators, codes of conduct/ethics, standardized forms of documentation, community-based activism, and data consortia for sharing data. These implementations follow decades of research and work done by archives to address “issues of concern in sociocultural material collection.” 

There are important differences to note between archival and ML datasets, including the level of intervention and supervision. In general, data collection in ML is done without “following a rigorous procedure or set of guidelines,” and is often done without critiquing the origins of data, as well as the motivations behind the collection, and the potential impacts on society. Archives, on the other hand, are heavily supervised and have several layers of intervention that help archivists determine whether certain documents or sources should be added to a collection. Seo Jo and Gebru point out another important difference between ML and archival datasets: their motivations and objectives. For the most part, ML datasets are built to further train a system to make it more accurate, while archival datasets are built to preserve cultural heritage and educate society, with particular attention to “authenticity, privacy, inclusivity, and rarity of sources.”  

The authors argue that there should be a more interventionist approach to data collection in ML, similar to what is done by archives. This is due to the fact that from the very beginning historical bias and representational bias infect data. Historical bias refers to the “structural, empirical inequities inherent to society that is reflected in the data,” and representational bias comes from the “divergence between the true distribution and digitized input space.” The best way to mitigate these biases is to implement what archives have put into place in their data collection practices, which includes:

  1. Drafting an institutional mission statement that prioritizes “fair representation or diversity” rather than “tasks or convenience.” This will prevent collection methods or even research questions from being driven solely by the accessibility and availability of datasets, which can replicate bias. It also encourages researchers to publicly explain their collection processes and allows for feedback from the public. 
  1. Ensuring consent through community and participatory approaches. This is especially crucial for ML researchers who are building datasets based on demographic factors. “ML researchers without sufficient domain knowledge of minority groups,” write Seo Jo and Gebru, “frequently miscategorize data, imposing undesirable or even detrimental labels onto groups.” Archives have attempted to solve similar issues by creating community archives where collections are built and essentially “owned” by the community being represented. These archives are open to public input and contributions, often enabling minority groups to “consent to and define their own categorization.” 
  1. Creating data consortia to increase “parity in data ownership.” Archives, alongside libraries, have created a consortia model through institutional frameworks that allow them to “gain economies of scale” by sharing resources and preventing redundant collections. This model has been adopted by the Open Data Institute, for example, to share data among researchers in ML. However, issues around the links between profit and data may prevent widespread adoption by ML companies and organizations.
  1. Encourage transparency by creating appraisal records and committee-based data collection practices. Archives follow rigorous record-keeping standards, including 1) data content standards, 2) data structure standards, and 3) data value standards that pass through several layers of supervision. They also record the process of their data collection to ensure even more transparency. ML should build and maintain similar standards in its data collection practices to address issues emanating from the public (and other researchers) about ML systems. 
  1. Building overlapping “layers of codes on professional conduct” that guide and enforce decisions regarding ethical concerns. For archives, these codes are maintained and enforced by international groups (e.g. International Council on Archives), and because many archivists are employed as professional data collectors they are held to specific standards that are enforced by ethics panels or committees. ML could benefit immensely from creating similar mechanisms that ensure accountability, transparency, and ethical responsibility. 

Of course, there are limitations to the ML field’s ability to adopt the measures outlined above. In particular, the authors argue, the sheer amount of data in ML datasets is much larger than many archives and the resources needed to implement these measures may be beyond what many ML-focused companies, researchers, etc. are willing to commit to. This is especially due to the fact that their motivations are primarily profit-focused. However, the ML community must contend with and end its current, problematic data collection practices—and a “multi-layered” and “multi-person” intervention system informed by systems put into place by archives would be a good place to start.


Original paper by Eun Seo Jo (Stanford University) and Timnit Gebru (Google): https://arxiv.org/abs/1912.10389

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

AI Policy Corner: Frontier AI Safety Commitments, AI Seoul Summit 2024

AI Policy Corner: The Colorado State Deepfakes Act

Special Edition: Honouring the Legacy of Abhishek Gupta (1992–2024)

AI Policy Corner: The Turkish Artificial Intelligence Law Proposal

From Funding Crisis to AI Misuse: Critical Digital Rights Challenges from RightsCon 2025

related posts

  • The role of the African value of Ubuntu in global AI inclusion discourse: A normative ethics perspec...

    The role of the African value of Ubuntu in global AI inclusion discourse: A normative ethics perspec...

  • Re-imagining Algorithmic Fairness in India and Beyond (Research Summary)

    Re-imagining Algorithmic Fairness in India and Beyond (Research Summary)

  • Explaining the Principles to Practices Gap in AI

    Explaining the Principles to Practices Gap in AI

  • Research summary: Apps Gone Rogue: Maintaining Personal Privacy in an Epidemic

    Research summary: Apps Gone Rogue: Maintaining Personal Privacy in an Epidemic

  • The ethical ambiguity of AI data enrichment: Measuring gaps in research ethics norms and practices

    The ethical ambiguity of AI data enrichment: Measuring gaps in research ethics norms and practices

  • Ethics for People Who Work in Tech

    Ethics for People Who Work in Tech

  • Responsibility assignment won’t solve the moral issues of artificial intelligence

    Responsibility assignment won’t solve the moral issues of artificial intelligence

  • Making Kin with the Machines

    Making Kin with the Machines

  • Survey of EU Ethical Guidelines for Commercial AI: Case Studies in Financial Services

    Survey of EU Ethical Guidelines for Commercial AI: Case Studies in Financial Services

  • Research summary:  Laughing is Scary, but Farting is Cute: A Conceptual Model of Children’s Perspect...

    Research summary: Laughing is Scary, but Farting is Cute: A Conceptual Model of Children’s Perspect...

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer

Categories


• Blog
• Research Summaries
• Columns
• Core Principles of Responsible AI
• Special Topics

Signature Content


• The State Of AI Ethics

• The Living Dictionary

• The AI Ethics Brief

Learn More


• About

• Open Access Policy

• Contributions Policy

• Editorial Stance on AI Tools

• Press

• Donate

• Contact

The AI Ethics Brief (bi-weekly newsletter)

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.


Archive

  • © MONTREAL AI ETHICS INSTITUTE. All rights reserved 2024.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.