A survey on adversarial attacks and defences

🔬 Research Summary by Max Krueger, a consultant at Accenture with an interest in both the long and short-term implications of AI on society.

[Original paper by Anirban Chakraborty, Manaar Alam, Vishal Dey, Anupam Chattopadhyay, Debdeep Mukhopadhyay]

Overview: Deep learning systems are increasingly susceptible to adversarial threats. As a result, it is imperative to evaluate methods for adversarial robustness.

Introduction

Deep learning systems are increasingly susceptible to adversarial threats. With the widespread adoption of these systems in society, security poses a significant concern. Attacks on these systems can produce catastrophic failures as communities increasingly rely on input from machine intelligence to drive decision-making. This paper provides a detailed discussion on the types of attacks deep learning systems face and potential defenses against such attacks. “Adversaries can craftily manipulate legitimate inputs, which may be imperceptible to the human eye,” the report states. In other words, these attacks are not apparent and have high consequences.

Key Insights

Attack Surface

A machine learning system can be broadly categorized as four principle components: data collection, data transfer, data processing by a machine learning model, and action taken based on an output. This paradigm represents the attack surface for which an adversary may compose an attack. The authors identify three primary attack vectors:

Evasion attack – The adversary tries to evade the system by adjusting malicious samples during the testing phase. Evasion attacks are the most.
Poisoning attack – The adversary attempts to inject crafted data samples to poison the system, compromising the entire learning process. Poisoning attacks take place during model training.
Exploratory attack – The adversary attacks a black-box model to learn as much about the underlying design as possible.

Adversary Capabilities

The capabilities of an adversary are based on the amount of knowledge available at the time of the attack. For example, if the attacker has access to the underlying dataset, the adversary can execute training phase attacks such as injecting corrupt data into the training dataset, modifying the training data, or logic corruption – physical corruption of the learning algorithm.

Adversaries may not have access to the underlying dataset but have access to the model during the testing phase. Attacks during this phase are determined by the adversary’s knowledge of the underlying model and its parameters. In a white-box attack, “an adversary has total knowledge about the model.” Including the training data distribution, complete model architecture, and hyperparameters. The adversary then alters an input to get a specific output. White-box attacks are highly effective as inputs are specifically crafted based on the known model architecture.

Black-box Attacks

The primary objective of a black-box attack is to train a local model to help craft malicious attacks on the target model. A black-box attack assumes no knowledge of the target. The authors classify black-box attacks into three categories:

Non-adaptive black-box attack – An adversary can only access the model’s training data distribution. The adversary then trains a local model based on the outputs from the black-box model. The local model allows the adversary to attack it via white-box methodology then send the crafted inputs to the target model for exploitation.
Adaptive black-box attack – Like a non-adaptive attack, but the adversary does not know the training data distribution. The authors state, “The adversary issues adaptive oracle queries to the target model and labels a carefully selected dataset.” This dataset is then used to train a local model and craft malicious inputs.
Strict black-box attack – Like the adaptive black-box attack, but the adversary cannot alter the inputs and, therefore, cannot observe changes to the output.

Adversarial Goals

An adversary’s goals are motivated by what they may gain from an attack. The authors illustrate four primary attack goals an adversary may have when attacking a model. Adversaries may look to reduce prediction confidence, misclassify outputs, or map a specific input to an incorrect target output. The end goal of the adversary will influence the type of attack goal used during exploitation.

Defense Strategies

Defending against such attacks is extremely difficult and lacks robustness against multiple attack types. There are several defense mechanisms available to the security practitioner, such as adversarial training, gradient hiding, and defensive distillation, to name a few. The paper provides several defense mechanisms as well as the logic behind them. The primary takeaway is that no single defense mechanism can stop all attacks, and many current defenses are easily fooled. Given the relative newness of modern machine learning and its quick deployment in new environments, it will remain challenging to stop these attacks for years to come.

Between the lines

We all know that cybersecurity is a big issue and even bigger business. This paper demonstrates that the security of machine intelligence is an open and pressing question. As deep learning becomes increasingly embedded in our everyday lives, we should be genuinely concerned about protecting these systems. It doesn’t take a big imagination to envision how one might exploit these systems to cause extreme harm. A focus on developing robust algorithms with built-in adversarial robustness will mitigate the consequences of such attacks. It will also be wise to create AI red teams to test the robustness of algorithms pre and post-deployment. Deep learning is an educational process for us all, and security should be on our radar moving forward.