• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

  • Articles
    • Public Policy
    • Privacy & Security
    • Human Rights
      • Ethics
      • JEDI (Justice, Equity, Diversity, Inclusion
    • Climate
    • Design
      • Emerging Technology
    • Application & Adoption
      • Health
      • Education
      • Government
        • Military
        • Public Works
      • Labour
    • Arts & Culture
      • Film & TV
      • Music
      • Pop Culture
      • Digital Art
  • Columns
    • AI Policy Corner
    • Recess
  • The AI Ethics Brief
  • AI Literacy
    • Research Summaries
    • AI Ethics Living Dictionary
    • Learning Community
  • The State of AI Ethics Report
    • Volume 6 (February 2022)
    • Volume 5 (July 2021)
    • Volume 4 (April 2021)
    • Volume 3 (Jan 2021)
    • Volume 2 (Oct 2020)
    • Volume 1 (June 2020)
  • About
    • Our Contributions Policy
    • Our Open Access Policy
    • Contact
    • Donate

Defining a Research Testbed for Manned-Unmanned Teaming Research

December 7, 2023

🔬 Research Summary by Dr. James E. McCarthy and Dr. Lillian K.E. Asiala.

Dr. McCarthy is Sonalysts’ Vice President of Instructional Systems and has 30+ years of experience developing adaptive training and performance support systems. 

Dr. Asiala is a cognitive scientist and human factors engineer at Sonalysts Inc., with experience in instructional design, human performance measurement, and cognitive research.

[Original papers (see the papers below) by Dr. James E. McCarthy & Lillian K.E. Asiala, LeeAnn Maryeski, & Nyla Warren]

  1. McCarthy, J.E., Asiala, L.K.E., Maryeski, L., & Warren, N. (2023). Improving the State of the Art for Training Human-AI Teams: Technical Report #1 — Results of Subject-Matter Expert Knowledge Elicitation Survey. arXiv:2309.03211 [cs.HC]
  2. McCarthy, J.E., Asiala, L.K.E., Maryeski, L., & Warren, N. (2023). Improving the State of the Art for Training Human-AI Teams: Technical Report #2 — Results of Researcher Knowledge Elicitation Survey. arXiv:2309.03212 [cs.HC]
  3. Asiala, L.K.E., McCarthy, J.E., & Huang, L. (2023). Improving the State of the Art for Training Human-AI Teams: Technical Report #3 — Analysis of Testbed Alternatives. arXiv:2309.03213 [cs.HC]

Overview: There is growing interest in using AI to increase the speed with which individuals and teams can make decisions and take action to deliver desirable outcomes.  Much of this work focuses on using AI systems as tools.  However, we are exploring whether autonomous agents could eventually work with humans as peers or teammates, not merely tools.  We need a robust synthetic task environment that could serve as a testbed to conduct our research.  The studies summarized here provide a foundation for developing such a testbed.


Introduction

Have you ever found yourself arguing with your GPS?  Have you ever thanked Siri or Alexa?  These examples of personification suggest that it is not far-fetched to believe that intelligent agents may soon be seen as teammates.  We needed to select or build an appropriate testbed to examine fundamental issues associated with “teaming” with AI-based agents.  We established a foundation for this effort by conducting stakeholder surveys and a structured analysis of candidate systems.

Key Insights

1 Step One:  Feature Surveys

Given the focus of our work, we decided to develop separate but overlapping surveys for the researchers and the military subject-matter experts.  The goal of the surveys was to determine what capabilities stakeholders felt should be included in a testbed.  

1.1 Methods

We identified various topics for which stakeholders could provide valuable input and developed a range of open-ended and Likert-style questions from these knowledge elicitation objectives.  Respondents completed the survey electronically.

The analysis of the survey results occurred in two threads.  First, members of the research team conducted independent thematic analyses of the responses for each open-ended question to identify ideas that recurred across answers, even if the respondents used different wording.  After completing the independent analyses, we met and established a consensus list of themes for each question and mapped the individual responses to that consensus list.  Second, in parallel with this qualitative analysis of the open-ended questions, we also conducted a quantitative analysis of the various Likert-like items.  

1.2 Results

Six themes emerged from the analysis:  

  1. System Architecture
  2. Teaming
  3. Task Domain
  4. Data Collection and Analysis
  5. Autonomy
  6. Ease of Use

One theme that strongly emerged when discussing the desired system architecture was the need for operational flexibility within the STE. Respondents wanted to be able to modify the STE over time, and this was expressed in language calling for modularity, open-source development, flexibility, and so forth.  Respondents also suggested that we investigate existing STEs that could be used “as-is” or extended to meet particular needs. The flexibility theme continued when respondents mentioned teaming. Team sizes fell between 6 and 12, and they wanted to be able to assign humans and agents to do a variety of roles flexibly. When respondents mentioned the task domain, they emphasized the need for sufficient levels of complexity, fidelity, and interdependence to ensure that the lab results would transfer to the field. Data collection and analysis was another topic that several respondents addressed. They wanted to ensure that the STE was instrumented to collect a wide range of data points they could use to create specific metrics. The last two themes focused on having some form of autonomy within the STE and creating a game-play environment that is easy to learn, including intuitive displays.

2 Step Two:  Testbed Analysis

Several respondents recommended that our team look into existing human-AI teaming testbeds rather than creating something new.  This was surprising because our initial literature review indicated no “consensus” testbed existed, and each lab developed its own.  Nonetheless, we took the recommendation seriously and systematically investigated the associated landscape. 

2.1 Methods

The research team began its analysis of potential testbeds by defining a three-dimensional taxonomy.  We noted that testbeds could be assessed for the level of interdependency they support, their relevance to the likely application environment, and the sophistication of the agents they could house.  

We then used the results of the surveys to identify, define, and weight eight evaluation factors:

  1. Data Collection & Performance Measurement Factors
  2. Implementation Factors
  3. Teaming Factors
  4. Task Features
  5. Scenario Authoring Factors 
  6. Data Processing Factors 
  7. System Architecture Factors
  8. Agent Factors

In parallel with this process, the team conducted a literature review and identified 19 potential testbeds across three categories:  

  1. Testbeds that were specifically developed to support research on Human-AI teaming.
  2. Testbeds that were built on a foundation provided by commercial games.
  3. Testbeds built on a foundation provided by open-source games.

Using the factor definitions, the research team developed a scoring standard, and then two researchers rated each of the testbeds selected for evaluation. After completing their evaluations, the researchers met to discuss any ratings that exceeded a difference of two points.  These discussions aimed to identify cases where the raters may have applied the evaluation criteria differently. 

2.2 Results

A casual review of the 19 testbeds allowed the research team to eliminate nine candidates without a detailed review. The researchers scored and ranked the remaining ten testbeds. The testbeds with the highest ranks were:

ASIST Saturn+. This testbed was developed within the Artificial Social Intelligence for Successful Teams (ASIST) program.  ASIST aims to study Artificial Social Intelligence (ASI) as an advisor in all-human teaming. The Saturn+ testbed presented users with urban search and rescue scenarios within Microsoft’s Minecraft gaming environment.

ASIST Dragon. ASIST researchers also developed this testbed. It presents bomb disposal scenarios.

Black Horizon. Sonalysts, Inc. developed Black Horizon to allow students to master orbital mechanics fundamentals. In Black Horizon, each learner plays as a member of a fictional peacekeeping organization. Players learn to control an individual satellite and coordinate satellite formations with different sensor, communication, and weapon capabilities.  

BW4T. Researchers from Delft University in the Netherlands and the Florida Institute for Human and Machine Cognition developed Blocks World for Teams (BW4T). In it, teams of two or more humans and agents cooperate to move a particular sequence of blocks from rooms in a maze to a drop zone.

SABRE. The Situational Authorable Behavior Research Environment (SABRE) was created to explore the viability of using commercial game technology to study team behavior. SABRE was developed using Neverwinter Nights™, produced by Bioware. Neverwinter Nights is a role-playing game based on Dungeons and Dragons. The team’s task was to search a virtual city (an urban overlay developed for the Neverwinter Nights game) and locate hidden weapons caches while earning (or losing) goodwill with the non-player characters.  

Between the lines

Considering the results of the survey and analysis, none of the highly rated testbeds was an adequate fit for our needs. The research team opted to deprioritize Black Horizon because it did not provide a proper teamwork environment. SABRE was eliminated because it was not an open-source solution. The two ASIST testbeds were unsuitable because their architectures did not support synthetic teammates. BW4T was removed because it presented security challenges.

Instead, we used the lessons gathered during the review to develop a Concept of Operations for a novel testbed. The envisioned testbed would present team members with a time-sensitive search-and-recovery task in outer space. We are assessing whether we can affordably develop the testbed and release it as an open-source environment.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

ALL IN Conference 2025: Four Key Takeaways from Montreal

Beyond Dependency: The Hidden Risk of Social Comparison in Chatbot Companionship

AI Policy Corner: Restriction vs. Regulation: Comparing State Approaches to AI Mental Health Legislation

Beyond Consultation: Building Inclusive AI Governance for Canada’s Democratic Future

AI Policy Corner: U.S. Executive Order on Advancing AI Education for American Youth

related posts

  • Towards a Framework for Human-AI Interaction Patterns in Co-Creative GAN Applications

    Towards a Framework for Human-AI Interaction Patterns in Co-Creative GAN Applications

  • Report on Publications Norms for Responsible AI

    Report on Publications Norms for Responsible AI

  • Re-imagining Algorithmic Fairness in India and Beyond (Research Summary)

    Re-imagining Algorithmic Fairness in India and Beyond (Research Summary)

  • The Larger The Fairer? Small Neural Networks Can Achieve Fairness for Edge Devices

    The Larger The Fairer? Small Neural Networks Can Achieve Fairness for Edge Devices

  • Clinical trial site matching with improved diversity using fair policy learning

    Clinical trial site matching with improved diversity using fair policy learning

  • A Virtue-Based Framework to Support Putting AI Ethics into Practice

    A Virtue-Based Framework to Support Putting AI Ethics into Practice

  • A Snapshot of the Frontiers of Fairness in Machine Learning (Research Summary)

    A Snapshot of the Frontiers of Fairness in Machine Learning (Research Summary)

  • Report on the Santa Clara Principles ​for Content Moderation

    Report on the Santa Clara Principles ​for Content Moderation

  • How to invest in Data and AI companies responsibly

    How to invest in Data and AI companies responsibly

  • The Role of Arts in Shaping AI Ethics

    The Role of Arts in Shaping AI Ethics

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer


Articles

Columns

AI Literacy

The State of AI Ethics Report


 

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.

Contact

Donate


  • © 2025 MONTREAL AI ETHICS INSTITUTE.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.