Google DeepMind’s Research Lets an LLM Rewrite Its Own Game Theory Algorithms — And It Outperformed the Experts

Google DeepMind's Research Lets an LLM Rewrite Its Own Game Theory Algorithms — And It Outperformed the Experts

By Michal Sutter – April 3, 2026

Introduction

Game theory plays a crucial role in understanding strategic interactions among rational decision-makers. Designing algorithms for Multi-Agent Reinforcement Learning (MARL) in imperfect-information games, such as poker, has traditionally involved a manual process of iteration and intuition. However, researchers at Google DeepMind have introduced a novel approach that allows a large language model (LLM) to autonomously rewrite its own game theory algorithms, significantly enhancing performance and efficiency.

Background: Counterfactual Regret Minimization (CFR) and Policy Space Response Oracles (PSRO)

Two established paradigms in game theory are Counterfactual Regret Minimization (CFR) and Policy Space Response Oracles (PSRO).

Counterfactual Regret Minimization (CFR)

CFR is an iterative algorithm that focuses on minimizing regret across various information sets. In each iteration, it calculates ‘counterfactual regret,’ which represents how much a player could have gained by making different choices. Over repeated iterations, this process converges to a Nash Equilibrium (NE). Variants of CFR, such as Discounted CFR (DCFR) and Predictive CFR+ (PCFR+), have been developed to improve convergence through specific discounting and predictive update rules.

Policy Space Response Oracles (PSRO)

PSRO operates at a higher abstraction level, maintaining a population of policies for each player. It constructs a payoff tensor, representing the meta-game, by computing expected utilities for every combination of policies. A meta-strategy solver then generates a probability distribution over these policies, allowing for iterative training against the distribution.

The AlphaEvolve Framework

AlphaEvolve is a pioneering framework that employs LLMs to automate the coding process for game theory algorithms. Instead of manually designing algorithms, AlphaEvolve utilizes an evolutionary coding agent to explore and mutate source code.

Process Overview

The process begins with initializing a population of algorithms based on standard implementations. For CFR experiments, CFR+ serves as the seed, while a uniform distribution is used for PSRO solver classes. A parent algorithm is selected based on fitness metrics, and its source code is modified by the LLM (Gemini 2.5 Pro). The modified candidates are then evaluated in proxy games, and valid candidates are incorporated into the population.

Multi-Objective Optimization

AlphaEvolve supports multi-objective optimization, allowing for the definition of multiple fitness metrics. Each generation randomly selects one metric to guide parent sampling. The primary fitness signal used is negative exploitability after a set number of iterations, evaluated on a fixed set of training games, including:

3-player Kuhn Poker
2-player Leduc Poker
4-card Goofspiel
5-sided Liars Dice

Discovered Algorithms

Through the AlphaEvolve framework, the researchers discovered several innovative algorithm variants that outperformed existing hand-designed algorithms.

1. VAD-CFR

The first evolved CFR variant is Volatility-Adaptive Discounted CFR (VAD-CFR). Unlike traditional CFR algorithms that use static discounting, VAD-CFR introduces three distinct mechanisms:

Volatility-adaptive discounting: This mechanism tracks the volatility of the learning process using an Exponential Weighted Moving Average (EWMA) of instantaneous regret magnitude, adjusting discounting based on volatility.
Asymmetric instantaneous boosting: Positive instantaneous regrets are amplified before being added to cumulative regrets, enhancing responsiveness to beneficial actions.
Hard warm-start with regret-magnitude weighting: Policy averaging is postponed until a specified iteration, prioritizing high-information iterations for constructing the average strategy.

VAD-CFR was benchmarked against various CFR algorithms and achieved state-of-the-art performance in 10 out of 11 games tested.

2. AOD-CFR

Another variant, Asymmetric Optimistic Discounted CFR (AOD-CFR), was discovered during trials with a different training set. AOD-CFR employs a linear schedule for discounting cumulative regrets and incorporates trend-based policy optimism, achieving competitive performance through more conventional mechanisms.

3. SHOR-PSRO

The evolved PSRO variant is Smoothed Hybrid Optimistic Regret PSRO (SHOR-PSRO). This algorithm constructs a meta-strategy by blending components at each solver iteration:

Optimistic Regret Matching: This component provides stability through regret-minimization.
Smoothed Best Pure Strategy: A Boltzmann distribution over pure strategies that biases toward high-payoff modes, controlled by a temperature parameter.

SHOR-PSRO demonstrates enhanced performance by leveraging the strengths of both components in its strategy formulation.

Conclusion

Google DeepMind’s AlphaEvolve framework represents a significant advancement in the field of game theory and reinforcement learning. By enabling an LLM to autonomously rewrite and optimize its algorithms, the research team has not only streamlined the algorithm design process but has also achieved remarkable performance improvements over traditional methods. The discovery of new algorithm variants such as VAD-CFR, AOD-CFR, and SHOR-PSRO showcases the potential of automated systems in advancing complex strategic decision-making.

Note: The implications of this research extend beyond game theory, potentially influencing various fields that rely on strategic interactions, including economics, political science, and artificial intelligence.

Article Source

Disclaimer: A Teams provides news and information for general awareness purposes only. While we strive for accuracy, we do not guarantee the completeness or reliability of any content. Opinions expressed are those of the authors and not necessarily of A Teams. We are not liable for any actions taken based on the information published. Content may be updated or changed without prior notice.

Google DeepMind's Research Lets an LLM Rewrite Its Own Game Theory Algorithms — And It Outperformed the Experts

Introduction

Background: Counterfactual Regret Minimization (CFR) and Policy Space Response Oracles (PSRO)

Counterfactual Regret Minimization (CFR)

Policy Space Response Oracles (PSRO)

The AlphaEvolve Framework

Process Overview

Multi-Objective Optimization

Discovered Algorithms

1. VAD-CFR

2. AOD-CFR

3. SHOR-PSRO

Conclusion

SERVICES

INDUSTRIES

QUICK LINKS

Google DeepMind's Research Lets an LLM Rewrite Its Own Game Theory Algorithms — And It Outperformed the Experts

Introduction

Background: Counterfactual Regret Minimization (CFR) and Policy Space Response Oracles (PSRO)

Counterfactual Regret Minimization (CFR)

Policy Space Response Oracles (PSRO)

The AlphaEvolve Framework

Process Overview

Multi-Objective Optimization

Discovered Algorithms

1. VAD-CFR

2. AOD-CFR

3. SHOR-PSRO

Conclusion

Related Posts

In China, a rush to 'raise lobsters' quickly leads to second thoughts

‘Exploit every vulnerability’: rogue AI agents published passwords and overrode anti-virus software

Why is gaming becoming so expensive? The answer is found in AI

SERVICES

INDUSTRIES

QUICK LINKS