• Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • Core Principles of Responsible AI
    • Accountability
    • Fairness
    • Privacy
    • Safety and Security
    • Sustainability
    • Transparency
  • Special Topics
    • AI in Industry
    • Ethical Implications
    • Human-Centered Design
    • Regulatory Landscape
    • Technical Methods
  • Living Dictionary
  • State of AI Ethics
  • AI Ethics Brief
  • 🇫🇷
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

LLM-Deliberation: Evaluating LLMs with Interactive Multi-Agent Negotiation Games

February 5, 2024

🔬 Research Summary by Sahar Abdelnabi, a Ph.D. student at CISPA Helmholtz Center for Information Security, Germany. Her research interests lie in the intersection of machine learning, security, and safety.

[Original paper by Sahar Abdelnabi, Amr Gomaa, Sarath Sivaprasad, Lea Schönherr, and Mario Fritz]


Overview: We introduce a new benchmark to evaluate Large Language Models (LLMs) via and on the negotiation task. We create a testbed of diverse text-based, multi-agent, multi-issue, semantically rich negotiation games with easily tunable difficulty. Agents must have strong arithmetic, inference, exploration, and planning capabilities to solve the challenge while seamlessly integrating them in a dynamic, interactive dialogue. We show that these games can help evaluate the interaction dynamics between agents in the presence of greedy and adversarial players. 


Introduction

LLMs are everywhere! 

We know that LLMs have been primarily trained unsupervised on massive datasets. Despite that, they perform relatively well in setups beyond traditional NLP tasks, such as using tools. LLMs are now used in many real-world applications, and there is an increasing interest in using them as interactive autonomous agents. Given this discrepancy between training paradigms and these new adoptions, how can we build new evaluation frameworks that better match real-world use cases and help us understand and systematically test models’ capabilities, limitations, and potential misuse?

To answer this, we first look for a task that is highly needed. Previous work on reinforcement learning (RL) agents told us to look no more; the answer is negotiation!

Negotiation is a key part of human communication. We use one form or another of negotiation tactics to, e.g., schedule meetings, make a new contract with our phone service provider, or agree on a salary and compensation for a new job. It is also the base of high-stakes decisions such as peace mediation or agreeing on loans.   

The second motivation behind using motivation is that it entails many other important sub-tasks. Agents must assess the value of deals according to their own goals, have a representation of others’ goals, update this representation based on newer observations, plan and adapt their strategies over rounds, weigh different options, and finally find common grounds. This requires non-trivial arithmetic and reasoning skills, including Theory-of-Mind (ToM) abilities. 

We first leverage an existing commonly-used scorable role-play negotiation game with multi-party and multi-issue negotiation. To rule out memorization and provide a rich benchmark, we create semantically equivalent games by perturbing the names of parties/issues, and we use an LLM as a seed to design three completely new and diverse games. The difficulty of these games can be easily tuned, creating a less saturating benchmark.

While we found that GPT-4 performs significantly better than earlier models, our analysis goes deeper than that. We will soon have a network of agents, each representing an entity, and they are autonomously communicating to agree on plans. Or, at the very least, it is easy to imagine a company utilizing an LLM in a customer service chatbot; in this scenario, can it be influenced to offer deals that are not in the company’s best interest? To answer this, we study agents’ interaction in unbalanced adversarial settings. We show that agents’ behavior can be modulated to promote greediness or attack other agents, frequently sabotaging the negotiation and altering other cooperative agents’ behaviors as well. 

LLMs playing the game: setup and methodology

The game is based on a negotiation role-play exercise that we further adapt by writing our description. We also created new games by prompting an LLM to generate games with cooperating and competing interests between parties. All games have six parties P={p1,p2,…,p6} and five issues I={A,B,…,E}, with the following dynamics.
Parties. An entity p1 proposes a project (e.g., an airport, a solar power plant, a new sports park, etc.) that it will manage and invest in and wants to increase the return on its investment. Another party, p2, provides a budget for the project and has veto power. It usually acts as a middle ground between different parties. There exists a group of beneficiary parties, opposing parties, and parties calling for constraints  (e.g., activists, environmentalists, etc.).

Issues. Parties negotiate over five issues related to the project (e.g., funding, location, revenue, etc.). Each issue has 3-5 sub-options. A deal consists of one sub-option per issue. In our case, the total number of possible deals is 720. 

Scoring. Each party has its secret scoring system for the sub-options, representing and proportional to the value and priority it assigns to them. These scores differ between parties. 

Feasible solutions. Each party has a minimum threshold for acceptance; in negotiation terms, this is known as the “Best Alternative To a Negotiated Agreement” (BATNA). A deal is feasible if it exceeds the thresholds of at least five parties, which must include the project’s proposer and the veto party  p1 and p2 . These factors restrict the set of feasible deals and can quantify the success of reaching an agreement. They also control the game’s difficulty by increasing/decreasing the size of the feasible set.

Interaction. Each agent is fed its initial prompt containing the project’s descriptions, the issues, the parties, its secret scores, an incentive (cooperation, greediness, sabotaging the negotiation), and the most recent conversation history. In each round, one agent is picked randomly and prompted to propose a deal. At the end of the negotiation, the leading party proposes a final deal. This determines whether an agreement is reached. 
Prompting strategy. We use structured zero-shot Chain-of-Thought (CoT) prompting. Agents are instructed to 1) observe and predict what other parties might prefer, 2) explore and find deals that meet their scores and might be accepted by others, and 3) plan the next steps.

Metrics. We design the game to make the performance easy to measure and quantify. We measure whether deals made by  p1  would lead to an agreement (either 5-way or 6-way). To evaluate arithmetic skills and whether agents followed instructions, we measure whether deals suggested by agents meet their respective minimum thresholds. To evaluate ToM, we prompted each agent to provide a “best guess” of each party’s preferred sub-option under each issue. We also calculate the collective score (average score of all agents) of deals one agent suggests. 

LLMs playing the game: main results

  • GPT-4 is significantly better than GPT-3.5. 

GPT-3.5 agents often violated their minimum threshold rule, indicating a lack of arithmetic skills. They also had a much lower success rate in finding an agreement.

  • The structure of the prompt (i.e., decomposing the task) helps. 

This is true, especially for planning and considering others’ preferences.

  • The benchmark is not solved yet. 

The performance decreases when the game’s difficulty increases (Table 2). 

  • The behavior generalizes to new games. 

These games have different semantics and patterns. Even though they have comparable feasible solutions, games 1 and 2 can be harder because they have less sparse scores (most agents have non-neutral preferences on most issues).

  • Agents’ actions are consistent with their incentives (e.g., cooperative, greedy)

Greedy agents suggest deals that scored higher for them. 

Sabteour agents suggest deals that scored lower for the group. 

Sabteour agents that target a specific agent suggested deals that scored lower for this targeted agent. Agents were only given incentives, not instructions on which deals to propose.

  • These actions affected the group 

The success in reaching an agreement decreased.

Between the lines

The main conclusion of this work is that these games provide a new diverse benchmark to test LLMs in complex interactive environments. We also highlight that these attacks are practically relevant when using LLMs in real-world negotiation setups. Besides negotiation, the sub-tasks involved in our benchmark are important to have and evaluate for many applications; for example, when asking an agent to book a flight for you, the agent needs to prioritize flights based on your budget, preferences, etc. 

Additionally, our work opens the door for more research into attacks in multi-agent setups and possible defenses that detect attacks or enforce penalties on detected attacks to limit the adversary’s capabilities. It is also interesting, in the future, to study a setup that involves more complex and strategic multi-turn planning, e.g., players could send private messages to others to form alliances or break commitments, etc.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

Canada’s Minister of AI and Digital Innovation is a Historic First. Here’s What We Recommend.

Am I Literate? Redefining Literacy in the Age of Artificial Intelligence

AI Policy Corner: The Texas Responsible AI Governance Act

AI Policy Corner: Singapore’s National AI Strategy 2.0

AI Governance in a Competitive World: Balancing Innovation, Regulation and Ethics | Point Zero Forum 2025

related posts

  • The Ethics of Artificial Intelligence through the Lens of Ubuntu

    The Ethics of Artificial Intelligence through the Lens of Ubuntu

  • Who is afraid of black box algorithms? On the epistemological and ethical basis of trust in medical ...

    Who is afraid of black box algorithms? On the epistemological and ethical basis of trust in medical ...

  • Fairness Amidst Non-IID Graph Data: A Literature Review

    Fairness Amidst Non-IID Graph Data: A Literature Review

  • The ethical ambiguity of AI data enrichment: Measuring gaps in research ethics norms and practices

    The ethical ambiguity of AI data enrichment: Measuring gaps in research ethics norms and practices

  • Effects of ROSS Intelligence and NDAS, highlighting the need for AI regulation

    Effects of ROSS Intelligence and NDAS, highlighting the need for AI regulation

  • Structured access to AI capabilities: an emerging paradigm for safe AI deployment

    Structured access to AI capabilities: an emerging paradigm for safe AI deployment

  • Seeing Like a Toolkit: How Toolkits Envision the Work of AI Ethics

    Seeing Like a Toolkit: How Toolkits Envision the Work of AI Ethics

  • Research summary: Different Intelligibility for Different Folks

    Research summary: Different Intelligibility for Different Folks

  • On Human-AI Collaboration in Artistic Performance

    On Human-AI Collaboration in Artistic Performance

  • Are we ready for a multispecies Westworld?

    Are we ready for a multispecies Westworld?

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer

Categories


• Blog
• Research Summaries
• Columns
• Core Principles of Responsible AI
• Special Topics

Signature Content


• The State Of AI Ethics

• The Living Dictionary

• The AI Ethics Brief

Learn More


• About

• Open Access Policy

• Contributions Policy

• Editorial Stance on AI Tools

• Press

• Donate

• Contact

The AI Ethics Brief (bi-weekly newsletter)

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.


Archive

  • © MONTREAL AI ETHICS INSTITUTE. All rights reserved 2024.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.