🔬 Column by Julia Anderson, a writer and conversational UX designer exploring how technology can make us better humans.
Part of the ongoing Like Talking to a Person series
Conversations are collaborative. Sharing information from one party to another is a fundamental practice in communication. When it comes to conversational AI, however, people are often limited to a single voice assistant, or agent, for completing a task. As smart as some agents are, it is rare that multiple agents can join forces to help a person during a single conversation.
Interoperability, or creating voice services that are compatible with those of another technological parent, is a significant conversational AI challenge. Imagine planning a vacation and using a voice assistant or bot to complete logistical tasks, but each bot is specialized in something else. Perhaps one can book a flight, but another is needed for reserving a dinner and yet another for travel insurance. Rather than repeating the same credentials to every bot, maybe certain information can be shared with each one during the same conversation. This saves you time and allows each bot to specialize in what it does best.
Such seamless transfer of information is much easier said than done. As voice becomes a primary mode of interaction across devices, it raises the question of how to make conversations more inclusive while maintaining user privacy.
Lay of the Land
As with conversation itself, interoperability is a challenge because it requires interaction at various levels to ensure that a message is coming across clearly. For developers, it means understanding how others are creating conversational AI technology and whether or not those protocols allow sharing with other systems. For designers, this means certain conversations may or may not be possible depending on what device(s) a user is interfacing. More uncertainty arises when it comes to testing, deploying and maintaining the voice agent.
Amazon launched its Voice Interoperability Initiative in 2019, citing many of these challenges. The initiative’s vision touts customer choice, technological integration and security. Ideally, any agent a person wants to interact with could be available, regardless of which device is used. A user could then invoke, or request, any agent to collaborate with another agent. This is particularly useful if one agent specializes in a task that the other doesn’t, like in the vacation planning example.
Beyond the technological infrastructure needed to accomplish this is the issue of the “trust gap.” Users may be concerned their information will be shared indiscriminately with other platforms so a bot can complete a task. However, if transparency measures are available so users know when they are talking to a new agent, this enables people to choose when and how to share information. Through onboarding or by using a companion app, device makers could educate users on the availability of other agents, thereby giving the customer the freedom to choose between preferred services.
The Open Voice Network (OVON) is exploring the limits of interoperability and the complexities of discovering agents and apps through voice, a medium that is more or less invisible. In a future where more users are surfing the web by voice, the ability to jump from one specialized agent to another becomes more crucial. OVON recognizes that the current economics of voice assistants, like the immense investments behind a handful of commercial products, may prevent widespread agent-to-agent interoperability. However, the benefits of standardized protocols could create more efficient sharing of languages and data formats, which facilitates innovation.
While it may be to a customer’s benefit to choose which agent to engage, the incentives for a company to allow this are less compelling. Technology businesses have recently been accused of self-preferencing, by promoting their proprietary products or services over others. Such antitrust issues are not unique to voice assistants, however, the issues demonstrate the difficulty of balancing user experience and privacy with interoperability.
Sonos recently developed speakers that could use multiple voice assistants concurrently, however, Google explained that there were contractual issues with mixing and matching competing assistants. While this may be a simpler fix on mobile phones in this case, it is difficult for other mediums, like smart speakers, to mesh assistants. Recently, Google announced the end of Conversational Actions, or voice apps, with a shift toward commanding Android apps instead. The new focus on integrating voice on Android devices helps developers by building experiences within one ecosystem instead of many.
Colang, a conversational AI modeling language, meanwhile is solving the challenges of live interoperability. The ability for multiple bots to pass control and context to one another could vastly improve user experience. By integrating conversational AI components into existing systems, developers can reuse them efficiently rather than building them from scratch.
From User-Friendly to User-Centric
Opportunities are abundant to make interoperability a cornerstone for voice technology. In healthcare, patients using several separate platforms could soon use conversational AI to interact with bots that handle every part of their healthcare. These complex conversations require conversational bot ensembles that combine specialty bot functions into one fluid experience. Patents targeting multibot architecture strive to simplify this process, which is currently rigid. When it comes to connecting smart devices, the Matter protocol is the industry-unifying standard supported by dozens of major companies. Built upon IP (Internet Protocol), Matter’s mission is interoperability across all smart devices, many of which support voice capabilities.
Legal and technical limitations hinder absolute conversational AI interoperability. Balancing innovation, usability and privacy is always tricky with emerging technology. Creating natural, seamless conversation regardless of voice bot origin remains ideal for those who want to expand the bounds of human-computer interaction.