🔬 Research summary by Felipe González-Pizarro, a Ph.D. Computer Science student at the University of British Columbia. His research focuses on natural language processing, information visualization, and computational social science.
[Original paper by Felipe González-Pizarro, Andrea Figueroa, Claudia López, Cecilia Aragon]
Overview: While there is increasing global attention to data privacy, most of their understanding is based on research conducted in a few countries of North America and Europe. This paper proposes an approach to studying data privacy over a larger geographical scope. By analyzing Twitter content about the #CambridgeAnalytica scandal, we observe language and regional differences on privacy concerns that hint a need for extensions of current information privacy frameworks.
In 2018, the firm Cambridge Analytica was accused of collecting and using the personal information of more than 87 million Facebook users without their authorization. Opinions, facts, and stories related to it were shared on social media, including Twitter, where the hashtag #DeleteFacebook became a trending topic for several days.
This paper analyzes more than a million public tweets related to the scandal. First, we divide the dataset by language (Spanish and English) and regions (Latin America, Europe, North America, and Asia). Using word embedding and manual content analysis, we study and compare the semantic context in which privacy-related terms were used. Then, we contrast our results with one of the most used information privacy concerns frameworks (IUIPC). We pay special attention to the differences in emphasis on privacy-related terms across languages and world regions.
We observe a greater emphasis on data collection in English than in Spanish. Additionally, data from North America exhibits a narrower focus on awareness compared to other regions under study. Some key concepts, such as regulations, are discussed online in all regions and both languages but have not yet been added to current information privacy frameworks. Our results call for more diverse sources of data and nuanced analysis of data privacy concerns around the globe.
Can information privacy concerns be present in a Twitter dataset?
Mining text from social media platforms such as Twitter is a fast and inexpensive method to gather opinions from individuals and can complement findings obtained from traditional polls or other research methods. Following this trend of research, we investigate whether Twitter data can reveal people’s information privacy concerns. Thus, our first research question is: Which information privacy concerns are present over social media content about a data-breach scandal?
To answer this question, first, we retrieved the most related terms to four privacy-related keywords: “data”, “privacy”, “user”, and “company” in multiple word embeddings. Word embeddings are representations of words, in the form of vectors that encode the meaning of words, in such a way that words that are closer in the vector space are expected to be related. For instance, the three most related terms to ‘privacy’ in our English word embedding are “data privacy”, “gdpr”, and “protection.”
Collecting and analyzing the semantic contexts of these privacy-related keywords allows us to observe the presence of terms related to information privacy concerns in the collected tweets. We systematically conducted open coding of these terms. After several iterations, we developed a set of categories to characterize them. Finally, to assess if information privacy concerns were present, we contrasted these categories to a widely accepted framework to describe internet users’ information privacy concerns (IUIPC). We find relationships among some of our categories and the three IUIPC concepts as well as our initial keywords (see Figure 1).
Figure 1: We identify several categories that can be easily mapped to the three dimensions of the Internet User Information Privacy Concerns (IUIPC): collection, awareness, and control. In this way, we find evidence that social media content can reveal information about privacy concerns.
Current conceptualizations of information privacy concerns might need to be extended
Our results suggest a more granular categorization of an IUIPC concept. Awareness might include more specific sub-topics that users can be aware of, such as privacy and security terms (e.g., cybersecurity, confidentiality), security mechanisms (e.g., credentials, encrypted), and privacy and security risks (e.g., scams, grooming). The presence of terms that fit these categories reveals that they are already part of public online conversations around privacy. A distinction among broad privacy and security terms, mechanisms to protect data, and potential data risks might be helpful to describe further the kinds of knowledge people have. Additionally, awareness about some of these subtopics might be more influential than others. For example, knowing about risks and mechanisms might be a sign of higher privacy concerns, while knowing broad privacy and security terms might not. The distinction between sub-topics could also guide users’, educators’, and practitioners’ efforts to enhance information privacy literacy.
Regulations are not only a topic of data and law experts
Besides, the presence of the regulation category highlights its importance in relation to information privacy concerns. Regulation refers to laws or rules that aim to regulate the use of personal data. The emergence of this category from our open coding confirms its relevance through its frequent appearance in public posts about a data breach scandal. These regulations are not only a topic of data and law experts, but it seems to be part of the public discourse around data privacy online.
Language and regional differences in emphasis on information privacy concerns
English speakers emphasize data collection more than Spanish speakers.
Our analysis reveals that English speakers significantly emphasize data collection more than Spanish speakers when freely expressing online about privacy keywords. This difference can lead researchers and practitioners to explore the effectiveness of more tailored data privacy campaigns for specific populations. For example, populations concerned about collection might need more information about the benefits of sharing their information.
North American privacy concerns are not generalizable to other regions.
We also observe significant regional differences in awareness. Particularly, data from North America shows the smallest emphasis on awareness while Latin America has the highest. Given that most studies on information privacy concerns are centered on the USA, this finding is particularly important. It warns us against the (sometimes implicit) assumption that North American privacy concerns can be generalizable to other regions. Our result provides observational evidence to argue that it is necessary to include more diverse populations to better understand the phenomena around data privacy. This finding also invites practitioners to address other regions, such as Latin America, using different services and privacy policies approaches. Populations that are more concerned about awareness might be more receptive to companies that use more transparent communications of their use of personal data, for example.
Between the lines
Our paper uses an alternative approach to study information privacy concerns over a large geographical scope. This approach aims to discover knowledge from a large-scale social media dataset on a topic for which a ground truth does not exist. Unfortunately, such ground truth is unlikely to exist because large-scale, multi-country, and multi-language surveys are too expensive to conduct (Li et al., 2020)
We carefully analyzed more than a thousand terms of the semantic contexts, conducted open coding to formulate a data-grounded categorization, and contrasted our categorization with IUIPC (Malhotra et al., 2004), one of the well-accepted theoretical conceptualizations of information privacy concerns.
In our paper, we discuss how our findings can extend current conceptualizations of information privacy concerns. Finally, we examine how they might relate to regulations about personal data usage in the regions we analyzed.
Future work can dig deeper into the observed differences and study the potential causes. Future studies might build upon our work to examine privacy concerns considering more languages, geographical locations, or different information privacy frameworks. Using our methodology to compare datasets across more extended periods could be helpful to determine whether the semantic contexts of the privacy keywords change over time.