🔬 Research summary by Scott Cambo, a data scientist and independent researcher specializing in human-centered machine learning as well as designing and developing interactive machine learning systems.
[Original paper by Scott Allen Cambo, Darren Gergle]
Overview: Computational methods and qualitative thinking are not obvious partners, but while the former helps us automate simple reasoning to quantify phenomena in our data, the latter is essential for framing, defining, and understanding what it is that we quantify. This research draws from feminist qualitative research pedagogy to inform new techniques and data representations the help us go beyond evaluating models in terms of accuracy to evaluating models in terms who the model is accurate for.
Introduction
Many modern uses of data science attempt to build algorithms which automate a subjective interpretation of a phenomenon, such as automated content moderation. However, data science pedagogy is most often derived from math and computation oriented disciplines which have not often wrestled with the challenges that subjectivity, bias, and personal perspective bring to scientific inquiry. This paper suggests that to responsibly engage with social and subjective phenomena with data science, we must consider employing the concepts of “positionality”, one’s social, cultural, and political position with regard to the research subject, by utilizing “reflexivity”, the examination of how one’s own feelings, reactions, and motives influence critical choices made in the research process.The value of positionality and reflexivity are demonstrated by introducing the novel concepts of “Model Positionality” and “Computational Reflexivity”. Model Positionality is a statement of a model’s social position with regard to the social, political, and cultural context of both its development and its deployment. However, the kind of critical reflection needed to understand model positionality is not easily achieved at the scale of “big data” using existing qualitative methods. “Computational Reflexivity” describes computational techniques that can help us this critical reflection at scale. This research presents annotator fingerprinting and position mining as demonstrations of computational reflexivity.
Key Insights
Donna Haraway’s Situated Knowledges
Donna Haraway is a prominent feminist and post-modernist scholar whose contributions have made a large impact on the way that we account for social context in the scientific research process. One of her biggest contributions comes from an essay in which she introduces the concept of situated knowledges, the idea that by acknowledging and reflecting on one’s presence in the process of knowledge production, subjects can produce knowledge with greater objectivity than if they claimed to be neutral. When our personal perspectives are a primary tool for observation, as is often the case in qualitative research, we can only claim validity if we understand and account for the conditions from which we made and understood our observation. Knowledge is most valid under these situated conditions that invites the critical dialog necessary for expanding the validity of such knowledge beyond local bounds.
Haraway emphasizes that self-presence, self-knowledge, and self-identity must be intentional and practiced in order to best answer questions like What should I be looking for? Who should I be looking with? or What instrument should I be using to look? In qualitative research, this intentional reflexive practice is the development of one’s positionality or stance in relation to the social, cultural and political context of the subject. This necessitates clarity regarding which aspects of identity and one’s personal experience are drawn from when producing knowledge. Making one’s positionality intentional and transparent with respect to the subject of research improves validity by recognizing the social, cultural and political context of the knowledge produced.
Model Positionality and Computational Reflexivity
In the evolving discussion regarding both technical and social transparency in the data science practice, this research contributes two important points. The first is that the model(s) produced through data science can have their own positionality that can have meaningful distinctions from the positionality of the data scientist, because most machine learning algorithms encode assumptions about the way that information exists in the world and those assumptions are not always clear. This is called Model Positionality.
Understanding the social, cultural, and political environment of any research requires a thoughtful process often referred to as reflexivity by social scientists. Reflexivity can take many forms depending on the risks and biases of a research project that are known at the beginning. However, many of these processes were designed for qualitative research methods that can observe rich detail regarding social phenomena, but at a smaller scale of data collection. This brings us to the second point of this research which proposes that in order to adequately understand the social context of data science work, we need to think of methods which can scale to work with big data.
Introducing Annotator Fingerprinting and Position Mining
To kick off this new methodology, Cambo and Gergle present annotator fingerprinting and position mining as methods of computational reflexivity which can help data scientists to better understand how their perspective compares to those of the crowd annotators who label the development dataset and those of any models being evaluated. This is accomplished by leveraging the assumption that actors with similar perspectives will provide similar judgements to similar content. Annotator fingerprints are a data structure that can be made using labels from any person or algorithm that can label, i.e. data scientist, UX designer, user, data annotator, and even a classification model. Annotator fingerprints represent both the judgements that an actor has made as well as the content those judgements were applied to. The beauty of this data structure is that we can compare them to each other using common matrix and vector similarity functions to understand the specific contexts in which these actors agree or disagree. This allows a data scientist to go beyond simply asking Is my model accurate? to asking Is my model accurate for me? for certain annotators? for the users?
This ability to deeply analyze the similarity between human and machine annotators enables position mining, computational techniques for determining probable common perspectives, or positions, with regard to the classification task. Currently, this is done by simply applying clustering with the annotator fingerprint similarity functions, but the door is open for developing more robust methods.
Finally, these methods are all demonstrated through the analysis of the popular Wikipedia Toxic Comment Classification dataset that is often used to train the automated content moderation algorithms in social media platforms to determine what should be flagged for removal. This dataset has over 3,000 annotators who can all be represented as an annotator fingerprint and grouped by common positions using the position mining technique.
Between the lines
This research demonstrates the value in integrating qualitative research concepts and methodology into data science practices via principled methods for navigating the subjective and discretionary choices that need to be made when we analyze social phenomena. Validating and expanding on these proposed methods for computational reflexivity will require new collaborations between qualitative social scientists and data scientists. These joint efforts could mean an exciting new chapter for the field of computational social science where computational methods are often used to augment traditional social science.