Top-level summary: A high-level position paper from Google (by Xiao Ma and Ariel Lu), this work brings forth in a succinct manner some of the challenges faced in designing useful and efficacious voice-activated AI systems. The authors do a great job in providing short examples along with references to relevant literature that position the current challenges in a socio-technical context. While design challenges abound in any technology, with voice systems, users have very high expectations because of their increasing ubiquity and anthropomorphization. Especially when looking at exploratory search which consists of open questions and the user is seeking a subset of meaningful responses rather than one best answer as is the case with fact searches, these challenges become more important. The users come in with pre-set notions on how they interact with each other using natural language and seek to get a similar experience from the system. Especially in cases where the voice interface is the only possible mode of interaction, such as when driving, it becomes essential that people are able to get results that they are seeking expeditiously compared to having to pull out the device and utilize the traditional visual and touch modalities. The development of voice interfaces can also usher in novel paradigms of mixed-modal interactions for optimizing the user experience such as presenting pieces of information that utilize part visual and part voice outputs. The systems also need to be sensitive to various demographic differences in terms of dialects, accents, modes of use, etc. There is still more research needed as to how exploratory search is done in voice compared to text searches and the research and challenges highlighted in this paper serve as good starting points.
The paper highlights four challenges in designing more “intelligent” voice assistant systems that are able to respond to exploratory searches that don’t have clear, short answers and require nuance and detail. This is in response to the rising expectations that users have from voice assistants as they become more familiar with them through increased interactions. Voice assistants are primarily used for productivity tasks like setting alarms, calling contacts, etc. and they can include gestural and voice-activated commands as a method of interaction. Exploratory search is currently not well supported through voice assistants because of them utilizing a fact-based approach that aims to deliver a single, best response whereas a more natural approach would be to ask follow up questions to refine the query of the user to the point of being able to provide them with a set of meaningful options. The challenges as highlighted in this paper if addressed will lead to the community building more capable voice assistants.
One of the first challenges is situationally induced impairments as presented by the authors highlights the importance of voice activated commands because they are used when there are no alternatives available to interact with the system, for example when driving or walking down a busy street. There is an important aspect of balancing the tradeoff between smooth user experience that is quick compared to the degree of granularity in asking questions and presenting results. We need to be able to quantify this compared to using a traditional touch based interaction to achieve the same result. Lastly, there is the issue of privacy, such interfaces are often used in a public space and individuals would not be comfortable sharing details to refine the search such as clothing sizes which they can discreetly type into the screen. Such considerations need to be thought of when designing the interface and system.
Mixed-modal interactions include combinations of text, visual inputs and outputs and voice inputs and output. This can be an effective paradigm to counter some of the problems highlighted above and at the same time improve the efficacy of the interactions between the user and the system. Further analysis is needed as to how users utilize text compared to voice searches and whether one is more informational or exploratory than the other.
Designing for diverse populations is crucial as such systems are going to be widely deployed. For example, existing research already highlights how different demographics even within the same socio-economic subgroup utilize voice and text search differently. The system also needs to be sensitive to different dialects and accents to function properly and be responsive to cultural and contextual cues that might not be pre-built into the system. Differing levels of digital and technical literacy also play a role in how the system can effectively meet the needs of the user.
As the expectations from the system increase over time, ascribed to their ubiquity and anthropomorphization, we start to see a gulf in expectations and execution. Users are less forgiving of mistakes made by the system and this needs to be accounted for when designing the system so that alternate mechanisms are available for the user to be able to meet their needs.
In conclusion, it is essential when designing voice-activated systems to be sensitive to user expectations, more so than other traditional forms of interaction where expectations are set over the course of several uses of the system whereas with voice systems, the user comes in with a set of expectations that closely mimic how they interact with each other using natural language. Addressing the challenges highlighted in this paper will lead to systems that are better able to delight their users and hence gain higher adoption.
Original white paper by Xiao Ma and Ariel Lu from Google: https://arxiv.org/abs/2003.02986