Beyond Empirical Windowing: An Attention-Based Approach for Trust Prediction in Autonomous Vehicles

🔬 Research Summary by Zhaobo Zheng, a scientist at Honda Research Institute USA, Inc.

[Original paper by Minxue Niu, Zhaobo Zheng, Kumar Akash, and Teruhisa Misu]

Overview: The trust in autonomous driving is critical for user experience and system efficiency. This paper utilizes a selective windowing attention network to augment user trust in autonomous driving. The novel model can also analyze and visualize the more important scenarios related to user trust changes.

Introduction

SWAN: A New Way to Understand Human Trust in Autonomous Vehicles (AV)

Do you want to know how you feel about autonomous driving? Do you want to improve your human-machine interaction design? Do you want to leverage the power of attention mechanisms to analyze long time-series data?

If you answered yes to any of these questions, then have a look at SWAN: a Selective Windowing Attention Network. SWAN is a novel neural network model that can estimate human trust in AV from multimodal signals, such as speech, facial expressions, and physiological data.

Unlike traditional windowing techniques requiring manual tuning and domain knowledge, SWAN can automatically select the most relevant data intervals for trust prediction. SWAN uses window prompts and masked attention transformation to focus on the critical span of trust changes while ignoring the irrelevant or noisy parts.

SWAN has been tested on a new multimodal driving simulation dataset where it outperformed existing baselines, such as CNN-LSTM and Transformer, by a large margin. SWAN also showed robustness across different windowing ranges, demonstrating its flexibility and adaptability.

SWAN is the augmented solution for human state estimation. With SWAN, you can visualize the underlying nuances of human trust.

Key Insights

Trust is an important factor that affects how humans interact with machines, especially in safety-critical domains like AVs. However, trust is a gradual state that changes over time, and it is difficult to label and analyze long time-series data that capture trust variations.

One common technique to deal with long time-series data is windowing, which divides the data into fixed-size, overlapping segments and applies a model to each segment. However, windowing has some drawbacks, such as:

The model’s performance depends on the window size, which requires manual tuning and domain knowledge.
The window size is fixed, which may not capture the dynamic nature of trust changes.
The windowing process may introduce noise or loss of information.

To overcome these limitations, the paper introduces a Selective Windowing Attention Network (SWAN), a neural network model that can automatically select the most relevant data segments for trust prediction using attention mechanisms.

SWAN consists of three main components:

A window prompt generator creates a set of window prompts representing different input data segments with varying lengths and positions.
A masked attention transformer computes the attention scores between the window prompts and the input data and selects the most informative segments based on the scores.
A trust predictor aggregates the selected segments and outputs a trust score for the whole input data.

The paper evaluates SWAN on a new multimodal driving simulation dataset, where participants interacted with an AV system and reported their trust levels. The dataset contains speech, facial, and physiological signals and contextual information such as driving scenarios and events.

The paper compares SWAN with several baselines, including:

A CNN-LSTM model that applies a convolutional neural network (CNN) and a long short-term memory network (LSTM) to the whole input data.
A Transformer model that applies a transformer network to the whole input data.
A windowing-based model that applies a CNN-LSTM model to each window segment, and uses an empirical method to select the optimal window size.

The paper shows that SWAN outperforms the baselines regarding trust prediction accuracy and demonstrates robustness across different windowing ranges. The paper also provides some qualitative analysis and visualization of the attention scores and the selected segments, which reveal some interesting insights into the trust dynamics and the factors that influence trust.

The paper demonstrates that SWAN is a novel and effective method for trust estimation in AVs. It can be extended to other applications that involve human state modeling from long time-series data.

Between the lines

This paper may pave the road for a universal model for cognitive state detection through multimodal signals. Such a model may also provide Interpretability on what triggered cognitive state changes.