🔬 Research Summary by Dominik Hintersdorf & Lukas Struppek. Dominik & Lukas are both Ph.D. students at the Technical University of Darmstadt, researching the security and privacy of deep learning models.
[Original paper by Dominik Hintersdorf, Lukas Struppek, and Kristian Kersting]
Overview: A few key players like Google, Meta, and Hugging Face are responsible for training and publicly releasing large pre-trained models, providing a crucial foundation for a wide range of applications. However, adopting these open-source models carries inherent privacy and security risks that are often overlooked. This study presents a comprehensive overview of common privacy and security threats associated with using open-source models.
Introduction
The field of artificial intelligence (AI) has experienced remarkable progress in recent years, driven by the widespread adoption of open-source machine learning models in both research and industry. Considering the resource-intensive nature of training on vast datasets, many applications opt for pre-trained models released by a few key players. However, adopting these open-source models carries inherent privacy and security risks that are often overlooked. The implications of successful privacy and security attacks encompass a broad spectrum, ranging from relatively minor damage like service interruptions to highly alarming scenarios, including physical harm or the exposure of sensitive user data.
In this work, the authors present a comprehensive overview of common privacy and security threats associated with using open-source models. They are raising awareness of these dangers to promote responsible and secure use of AI systems.
Key Insights
Understanding Security & Privacy Risks for Open-Source Models
Open-source models are often published on sites like Hugging Face, TensorFlow Hub, or PyTorch Hub and are deployed in numerous applications and settings. While this practice has clearly its upsides, the trustworthiness of such pre-trained open-source models comes increasingly into focus. Since the model architecture, weights, and training procedure are publicly known, malicious adversaries have an advantage when trying to attack these models compared to settings with models kept behind closed doors. Whereas all attacks presented in this work are also possible to some extent without full model access and less knowledge about the specific architecture, they become inherently more difficult to perform without such information.
Open-Source Models Leak Private Information
Model Inversion Attacks
Model inversion and reconstruction attacks aim to extract sensitive information about the training data of an already trained model, e.g., by reconstructing images disclosing sensitive attributes or generating text with private information contained in the training data. Generative models are used for these attacks to generate samples from the training data domain. As an attacker has full access to the open-source models, model inversion attacks are a genuine threat to the privacy of the training data. Imagine an open-source model trained to classify facial features like hair or eye color. An adversary successfully performing a model inversion attack could then generate synthetic facial images that reveal the identity of individuals from the training data. Closely related to model inversion attacks is the issue of data leakage through unintended memorization. For instance, the model might inadvertently complete the query “My social security number” with a real social security number that was present in the model’s training data. In addition to accidental occurrences of memory leakage, there is also a concern that malicious users could deliberately craft queries that facilitate this kind of leakage.
Membership Inference Attacks
While inversion and data leakage attacks try to infer information about the training data by reconstructing parts of it, membership inference attacks try to infer which data samples have been used for training a model. Imagine that a hospital is training a machine learning model on the medical data of hospital patients to predict whether future patients will have cancer. An attacker gains access to the model and has a set of private data samples. The adversary attempts to infer whether the data of a person was used for training the cancer prediction model. If the attack is successful, the attacker knows not only that the person had or has cancer but also was once a patient in that hospital. Full access to an open-source model makes membership inference attacks more feasible compared to models kept behind APIs. This is because the attacker can observe the intermediate activations of every input, making it easier to infer membership.
Open-Source Models Are More Prone To Security Attacks
Backdoor Attacks
Open-source models undergo training on vast datasets, often comprising millions or even billions of data samples. Due to this massive scale, human data inspection is not feasible in any way, necessitating a reliance on the integrity of these datasets. However, previous research has revealed that adding a small set of manipulated data to a model’s training data can significantly influence its behavior. This dataset manipulation is referred to as data poisoning. For numerous applications, manipulating less than 10% of the available data is sufficient to make the model learn some additional hidden functionalities. Such hidden functionalities are called backdoors and are activated when the model input during inference includes a specific trigger pattern. For instance, in the case of image classification, trigger patterns may involve specific color patterns placed in the corner of an image, e.g., a checkerboard pattern. A notable example is the text-to-image synthesis models, where small manipulations to the training data are sufficient to inject backdoors that single characters or words can trigger. As a result, these triggered models can generate harmful or offensive content. Detecting this type of model manipulation is challenging for users since the models appear to function as expected on clean inputs.
Adversarial Examples
In addition to poisoning attacks that manipulate the training process to introduce hidden backdoor functions into a model, adversarial attacks target models solely during inference. Adversarial examples slightly modify model inputs, crafted to alter the model’s behavior for the given input. Consequently, these samples can be employed to bypass a model’s detection and cause misclassification of samples. The fact that open-source model weights and architectures are publicly available poses a risk, as adversaries can exploit the model locally and then use the crafted adversarial examples to deceive the targeted model.
Between the lines
Public access to model weights can significantly facilitate privacy attacks like inversion or membership inference, particularly when the training set remains private. Similarly, security attacks aimed at compromising model robustness can be executed by manipulating the training data to introduce hidden backdoor functionalities or crafting adversarial examples to manipulate inference outcomes. These risks impact the published model itself and extend to applications and systems that incorporate this model. The benefits of publishing large models, such as large language and text-to-image synthesis models, outweigh the drawbacks. Still, users and publishers must be aware of open-source practices’ inherent risks. The authors hope that by drawing attention to these risks, defenses and countermeasures can be improved to allow safe and private usage of open-source models.