Research summary: SoK: Security and Privacy in Machine Learning

Summary contributed by Victoria Heath (@victoria_heath7), Communications Manager at Creative Commons

*Authors of full paper & link at the bottom

1. Introduction

Despite the growing deployment of machine learning (ML) systems, there is a profound lack of understanding regarding their inherent vulnerabilities and how to defend against attacks. In particular, there needs to be more research done on the “sensitivity” of ML algorithms to their input data. In this paper, Papernot et. al. “systematize findings on ML security and privacy,” “articulate a comprehensive threat model” and “categorize attacks and defenses within an adversarial framework.”

2. About Machine Learning

Overview of Machine Learning Tasks

Through machine learning, we’re able to automate data analysis and create relevant models and/or decision procedures that reflect the relationships identified in that analysis. There are three common classes of ML techniques: supervised learning (training with inputs labeled with corresponding outputs), unsupervised learning (training with unlabeled inputs), and reinforcement learning (training with data in the form of sequences of actions, observations, and rewards).

ML Stages: Training and Inference

There are two general stages (or phases) of ML: the training stage includes when a model is learned from input data, generally described as “functions hθ(x) taking an input x and parametrized by a vector θ ∈ Θ.1” Once trained, the model’s performance is measured against a test dataset to measure the model’s generalization (performance on data not included in its training). The inference stage includes when the trained model is deployed to infer predictions on inputs not included during training.

3. Threat Model

It’s important to note that an information security model should consist of two primary components: the threat model and the trust model. In the case of ML systems, these components are examined below.

Threat Model

A system’s threat model is based on the adversarial goals and capabilities it’s designed to protect itself from. To create a comprehensive model, it’s important to first examine the system’s “attack surface”—where and how an adversary will attack. Although attack surfaces can vary, the authors view all ML systems within a “generalized data processing pipeline” in which adversaries have the opportunity to “manipulate the collection of data, corrupt the model, or tamper with the outputs” at different points within the pipeline.”

Trust Model

A system’s trust model relates to the classes of actors involved in an ML-based system’s deployment—this includes data-owners, system providers, consumers, and outsiders. A level of trust is assigned to each actor and the sum forms the model—this helps identify ways in which bad actors may attack the system both internally and externally.

Adversarial Capabilities

The capabilities of an adversary (or a bad actor) refers to the “whats and hows of the available attacks.” These capabilities can be further understood by examining them separately at the inference phase and the training phase. At the inference phase, attacks are referred to as exploratory attacks because they either cause the model to produce selected outputs or “collect evidence about the model’s characteristics.” These attacks are classified as either white box (the adversary has some information about the model or its training data) or black box (the adversary has no knowledge about the model, and instead uses information about the mode’s setting or past inputs to identify vulnerabilities).

At the training phase, attacks attempt to learn, influence, or corrupt the model through injection (inserting inputs into the existing training data as a user) or modification (directly altering the training data through the data collection component). A powerful attack against learning algorithms is logic corruption, which essentially modifies the model’s learning environment and gives the adversary control over the model itself.

Adversarial Goals

Using the CIA triad, with the addition of privacy, the authors model adversarial goals both in terms of the ML model itself, but also in terms of the environment in which the model is deployed. The CIA model includes confidentiality (concerns the model’s structure, parameters, or training/testing data), integrity (concerns the model’s output or behaviors), and availability (concerns access to the model’s meaningful outputs or features). In regards to confidentiality and privacy, attacks tend to be targeted at the model and data with the general goal of exposing either one. Specifically, due to the nature of ML models, which have the ability to “capture and memorize elements of their training data,” it is difficult to guarantee privacy to individuals included in that dataset. In regards to integrity and availability, attacks tend to target the model’s outputs with the goal of inducing “model behavior as chosen by the adversary” and thus undermining the integrity of the inferences. Researchers have also found that the integrity of an ML model can be comprised by attacks on its inputs or training data. Availability is slightly different than integrity, the goal of these attacks is to make the model “inconsistent or unreliable in the target environment.” Finally, if access to a system in which an ML model is deployed depends on the model’s outputs, then it can be subject to denial of services attacks.

4. Training in Adversarial Settings

Training data for ML models are particularly vulnerable to manipulations by adversaries or bad actors. Otherwise known as a poisoning attack, the training dataset is inserted, edited, or points are removed “with the intent of modifying the decision boundaries” of the model. These attacks can cause the system to be completely unavailable.

Targeting Integrity

In a study by Kearns et. al., researched studied the accuracy of learning classifiers when training samples are modified. They found that to achieve 90% accuracy in a model, the “manipulation rate” needs to be less than 10%. Adversaries can attack the integrity of classifiers through label manipulation, which includes perturbing the labels for just a fraction of the training dataset. In order to do this, the adversary needs to have either partial or full knowledge of the learning algorithm. Unfortunately, this type of attack not only has immediate ramifications on the model during training but “further degrades the model’s performance during inference,” thus making it difficult to quantify its impact. Adversaries can also attack the integrity of classifiers through input manipulation, which includes corrupting the training dataset’s labels and input features at different training points. In order to do this, the adversary needs to have a great deal of knowledge regarding the learning algorithm and its training data. This poisoning attack can be done both directly and indirectly. Direct poisoning primarily focuses on “clustering models,” where the adversary slowly moves the center of the cluster so that there are misclassifications during the inference stage.

Targeting Privacy and Confidentiality

The confidentiality and privacy of a model during training may only be impacted by the extent to which the adversary has access to the system hosting the ML model—which the authors note, is a “traditional access control problem” and falls outside of the scope of this paper.

5. Inferring in Adversarial Settings

During the inference stage, the adversary must mount an attack that will evade detection during deployment since the model’s parameters are fixed. There are two types of attackers, white-box (they have access to the model’s internals, such as the parameters, etc.) and black-box (they do not have access, and are thus limited to “interacting with the model as an oracle”).

White-box Adversaries

These attackers have access to both the model and its parameters, making them particularly dangerous. To attack a model’s integrity, these adversaries perturb the model’s inputs through direct manipulation (altering the feature values processed by the model), indirect manipulation (locating perturbations in the data pipeline before the classifier), or other means that impact ML models other than classification, such as autoregressive models or reinforcement learning.

Black-box Adversaries

These attackers do not know the model’s parameters but do have access to the model’s outputs which allows them to observe its environment, including its detection and response policies. The common threat model for these types of adversaries is oracle, in which they “issue queries to the ML model and observe its output for any chosen input” in order to reconstruct the model or identify its training data. These adversaries can attack a model’s integrity through direct manipulation of model inputs or data pipeline manipulation. To attack a model’s privacy and confidentiality, adversaries may mount membership attacks (testing whether a point was part of the training dataset to learn the model’s parameters), model inversion attacks (extracting training data from a model’s predictions), model extraction (extracting parameters of a model by observing its predictions).

6. Towards Robust, Private, and Accountable Machine Learning Models

In this section, the authors identify parallel defense efforts against attacks to reach the following goals: “(a) robustness to distribution drifts, (b) learning privacy-preserving models, and (c) fairness and accountability.”

Robustness of Models to Distribution Drifts

In regards to maintaining integrity, ML models need to be robust to distribution drifts: instances when there are differences between the training and test distribution. Drifts can occur, for example, as a result of adversarial manipulations. To defend against attacks during training time, proposals for defenders include building a PCA-based detection module, adding a regularization term to the loss function, using obfuscation or disinformation to keep details of the model’s internals secret, or creating a detection model that removes data point outliers before the model is learned. Defending against attacks during inference time is difficult due to the “inherent complexity of ML models’ output surface,” therefore it remains an “open problem.” One way to defend against integrity attacks is through gradient masking: reducing the model’s sensitivity to small changes in their inputs. However, this strategy has limited success because the adversary may be able to use a substitute model to craft adversarial examples because it falls outside of the target model’s defense—thus, these examples will also be misclassified. To defend against larger perturbations, defenders can inject correctly labeled adversarial samples into the training dataset, making it more “regularized” and robust. However, the authors note, this method is relatively weak “in the face of adaptive adversaries.”

Learning and inferring with privacy

Differential privacy is a framework used by defenders to analyze if an algorithm provides a certain level of privacy. Essentially, this framework views privacy as the “property that an algorithm’s output does not differ significantly statistically for two versions of the data differing by only one record.” In order to achieve differential privacy or any other form of privacy, the ML system’s pipeline should be randomized. In order to achieve privacy during the training stage, defenders may inject “random noise” into the training data to create randomized response or objective perturbation. To achieve differential privacy during the inference stage, defenders may also inject noise into the model’s predictions. This can impact the accuracy of those predictions, however. A method of protecting the confidentiality of individual inputs to a model is homomorphic encryption, which encrypts the data in a way that doesn’t require decrypting for it to be processed by the model.

Fairness and accountability in ML

Due to the “opaque nature of ML,” there are concerns regarding the fairness and accountability of model predictions, as well as growing legal and policy requirements for companies to explain the predictions made by deployed models to users, officials, etc. Fairness largely refers to the “action taken in the physical domain” based on the prediction made by a model, ensuring that the prediction and thus the action does not discriminate against certain individuals or peoples. Often, concerns arise regarding the ML’s training data, which can be the source of bias that leads to a lack of fairness. The learning algorithm can also be a source of bias if it’s adapted to benefit a specific subset of the training data. One method to achieve fairness, illustrated by Edwards et al., is having a model learn in “competition with an adversary trying to predict the sensitive variable from the fair model’s prediction.” Accountability refers to the ability to explain the model’s predictions based on its internals. One way of achieving accountability is to measure the “influence of specific inputs on the model output,” called quantitative input influence by Datta et al. Another way is by identifying the specific inputs a model is most sensitive to,. For neural networks, activation maximization can synthesize inputs that activate specific neurons. Unfortunately, the authors note, methods for achieving accountability and fairness may open the ML model to more sophisticated attacks since they provide an adversary with information on the model’s internals. However, these methods could also increase privacy.

7. Conclusions

The primary takeaway of this paper is that it’s essential for the classes of actors involved in an ML-based system’s deployment to characterize the “sensitivity of learning algorithms to their training data” in order to achieve privacy-preserving ML. Further, controlling this sensitivity once models are deployed is essential for securing the model. In particular, the authors note, there must be more research into the “sensitivity of the generalization error” of ML models.

Original paper by Nicolas Papernot, Patrick McDaniel, Arunesh Sinha, and Michael P. Wellman: https://ieeexplore.ieee.org/abstract/document/8406613