✍️ Original article by Yacine Jernite, Giada Pistilli, and Carlos Muñoz Ferrandis from HuggingFace.
Yacine Jernite is a researcher at Hugging Face working on exploring the social and legal context of Machine Learning systems, particularly ML and NLP datasets.
Giada Pistilli is an Ethicist at Hugging Face and a Philosophy Ph.D. candidate at Sorbonne University.
Carlos Muñoz Ferrandis is a Tech & Regulatory Affairs Counsel at Hugging Face and a Ph.D. candidate focused on the intersection between open source and standards.
This is part of the Social Context in LLM Research: the BigScience Approach series written by participants of several working groups of the BigScience workshop, a year-long research project focused on developing a multilingual Large Language Model in an open and collaborative fashion. In particular, this series focuses on the work done within the workshop on social, legal, and ethical aspects of LLMs through the lens of the project organization, data curation and governance, and model release strategy. For further information about the workshop and its outcomes, we encourage you to also follow:
- BigScience Blog
- BigScience Organization Twitter
- BigScience Model Training Twitter
- BigScience ACL Workshop
Introduction
The previous post in this series provided an overview of how the BigScience Workshop’s approach to addressing the social context of Large Language Model (LLM) research and development across its project governance, data, and model release strategy. In the present article, we further dive into this first aspect, and particularly into the work that enabled a value-driven, consensus-based, and legally grounded research approach.
Our ability to promote the kind of open, accountable, and conscientious research we believe is necessary to steer the development of new language technology toward more beneficial and equitable outcomes hinges first on the implementation of these values within the project’s internal processes. Thus, we first apply the analysis outlined in the previous article to the project’s internal governance and ethical and legal grounding, as this aspect of the workshop defines the framework for all of the other research questions we aim to address:
- People. Our work on this aspect of the workshop is focused on two main categories of stakeholders: the workshop participants themselves whose research will follow our proposed processes and be informed by its ethical and legal work, and the broader ML and NLP research community for whom we hope to showcase a working example of a large-scale value-grounded distributed research organization.
- Ethical focus. Finding a common approach and shared values that can guide ethical discussions within the project while valuing the diversity of contexts and perspectives at play is instrumental to enabling value-grounded decision-making within the workshop. We organize our work to that end around the elaboration of a collaborative ethical charter.
- Legal focus. We study the growing number of regulations relevant to our area emerging around the world on two accounts: both in order to allow all of our participants to fully engage in research without exposing themselves to sanctions and to understand how different jurisdictions operationalize the values that are represented in our ethical charter.
- Governance. A project’s internal governance processes can disrupt or entrench disparities and determine whose voice is welcomed to the table and taken into account when making decisions. We adopt decision processes guided by the ideal of consensus, as we see it as most consistent with our ethical charter.
This blog post outlines two mains effort that helped structure our work toward the goals outlined above: a multidisciplinary effort to gather inputs from all workshop participants in order to collaboratively build an ethical charter reflecting shared driving values for the project, and a week-long “hackathon” during which legal scholars from around the world worked on answering questions that had come up during the preceding workshop months in 9 different jurisdictions.
Collaboration, Ethical Charter, and Driving Values
First, the project had to determine how to align its participants’ expectations and provide a framework for working on a common goal while allowing for different perspectives to meaningfully coexist. One significant effort in that direction was the elaboration of an ethical charter with a threefold scope:
- to establish BigScience’s core values in order to allow its contributors to commit to them both individually and collectively;
- to serve as a pivot for drafting the documents intended to frame specific issues ethically and legally (e.g. license, model cards, data governance, etc.);
- to promote BigScience values within the research community through scientific publications, disseminations, and scientific popularization.
In order to devise such a charter, we found normative Western philosophical traditions such as virtue ethics, utilitarianism, or deontology ill-adapted to our particular settings given their focus on working toward a unified definition of values, whereas we wanted an ethical approach that is agnostic on value definitions and instead welcomes differences. We instead decided to ground our approach in the Confucian moral notion of harmony which allows us to emphasize value pluralism. The goal was to adopt an approach that takes into account our multidisciplinarity and cherishes our multiculturalism; the concept of harmony allows us to let different values coexist, despite their possible conflicts. The central idea of this approach is to ensure that possibly divergent opinions and points of view can confront each other to finally find a balance of harmony.
From an internal governance perspective, this focus on confronting perspectives meant that we strove for using consensus, rather than unanimity, as the basis for making decisions when conflicts arose: continuing conversations until all major objections have been addressed. In particular, some decisions that involved multiple working groups (such as project-wide timelines) sometimes required us to resolve communication gaps that had developed and manage different priorities – through a combination of online (video calls) and offline (Slack and comment threads on documents, which helped keep track of dissensus) discussions including the working group chairs and other members based on availability. Those typically continued until consensus was reached, or until participants decided to ask another core workshop organizer to come in and make a call in a few cases.
We wanted our charter to act as a dynamic document capable of guiding us while leaving focused working groups the freedom to adapt these values to the concerns that arose in their specific context. We started our brainstorming around BigScience’s moral values with a participative document that allowed us to focus on our shared values, especially in moments of crucial technical decision-making. After months of biweekly interdisciplinary discussion, we drafted the following list of values to serve as a basis for decisions within the workshop:
- Inclusivity: grounded in the general principles of acceptance, belonging, and non-discrimination, we interpret inclusivity as a core value that also allows us to articulate the subsequent.
- Diversity: inclusivity is insubstantial if it doesn’t take diversity into account in all aspects, capable of making different cultures, scientific fields, and organizations partake in the project.
- Reproducibility: given the open scientific nature of BigScience, this value allows the project to ensure the reproduction of its results and conclusions.
- Openness: echoing the previous value, openness allows collaborators to ensure that the internal processes of the project and its results will always be available to the scientific community.
- Responsibility: it aims to engage all collaborators both individually and collectively in their actions but, aware of their impact, this value also considers the link between social and environmental responsibility of developing LLMs.
The full ethical charter, containing more detail on the approach and operationalizing these values, is available here.
Legal Scholarship, NYU Collaboration, and Legal Playbook
The BigScience workshop was designed from the start as an international project: led by participants from many different countries, dealing with languages and text data from around the world, and that would hopefully produce research artifacts relevant to many different places. This makes being aware of the existing and emerging regulations around data and AI/ML in all jurisdictions at play particularly important on two accounts; first, to allow both data and algorithm subjects and project participants to fully benefit from the protections laid out in these regulations, and second, insofar as they reflect different values and priorities to inform us about local contexts. The latter is particularly important given the importance we placed on value pluralism in the ethical work described above.
Given the novelty of both the technology and these regulatory frameworks, understanding how they interact concretely at the level of the dataset and algorithmic design choices requires a significant amount of new multidisciplinary research; a project like BigScience provides a rare opportunity to have contributors with this range of skills interact and learn from each other. Thus, Big Science’s legal scholarship group has taken a transversal approach, centralizing legal questions that came up in all working groups in the course of their work, and contributing to the projects’ overall governance framework.
In particular, this cross-disciplinary work and dialogue led in January 2022 to the organization of a legal hackathon in partnership with the NYU Faculty of Law. Under this framework, Legum Magister students – i.e. legal experts – from all over the world contributed during an intensive week to push the boundaries of research at the intersection between law, ethics, and policy in the AI realm. In addition to several research papers, the students produced a legal playbook to help practitioners navigate relevant regulations in the context of ML and NLP in several jurisdictions around the world, including not only in North American and European contexts but also in Brazil, China, South Africa, Japan, Columbia, and South Korea. The questions were organized into five themes corresponding to common concerns of ML practitioners:
- Intellectual Property: What kinds of content and data are subject to IP in each of these jurisdictions? How does IP on source content flow down to datasets and trained models?
- Licensing: What licensing mechanisms are available for various types of ML systems and artifacts? How can ML practitioners navigate the relationship between licenses, terms of use, and underlying regulations?
- Research exceptions and legal grounds for data use: Can I use copyrighted or protected data easier if it’s for research purposes? What is Fair Use, are there similar mechanisms outside of the US?
- Privacy: What are the legal privacy concerns for different kinds of web content? What are the NLP researchers’ responsibilities in terms of handling content?
- Regulated content: Are some kinds of content prohibited from being mined, stored, or used in model training? Are some kind of content subject to other kinds of similar restrictions?
The playbook is available to support research by actors and on language data from all of these jurisdictions.
Conclusion and Statement of Contributions
We started our discussion of the relationships between ethical, legal, and governance work and of their roles in fostering a welcoming research environment conscious of its place in society by outlining how they informed the project organization aspect of the BigScience workshop.
In particular, our work in all three domains was grounded in an approach that favors value pluralism as a necessary condition to enable substantial diversity and inclusivity of the project’s direct participants and general stakeholders. This value pluralist approach also guided our efforts on questions related to the training data and model release strategy, as described in the next two blog posts in this series.
The work presented here corresponds to efforts coordinated by the BigScience Ethical and Legal Scholarship Working Group and the Organization Working Group, and its outcomes represent the participation of all BigScience participants who chose to engage in the ethical charter creation process and of Legum Magister students who answered legal questions for the playbook in addition to members of the above Working Groups.