Google DeepMind's Research Lets an LLM Rewrite Its Own Game Theory Algorithms — And It Outperformed the Experts
By Michal Sutter – April 3, 2026
Introduction
Game theory plays a crucial role in understanding strategic interactions among rational decision-makers. Designing algorithms for Multi-Agent Reinforcement Learning (MARL) in imperfect-information games, such as poker, has traditionally involved a manual process of iteration and intuition. However, researchers at Google DeepMind have introduced a novel approach that allows a large language model (LLM) to autonomously rewrite its own game theory algorithms, significantly enhancing performance and efficiency.
Background: Counterfactual Regret Minimization (CFR) and Policy Space Response Oracles (PSRO)
Two established paradigms in game theory are Counterfactual Regret Minimization (CFR) and Policy Space Response Oracles (PSRO).
Counterfactual Regret Minimization (CFR)
CFR is an iterative algorithm that focuses on minimizing regret across various information sets. In each iteration, it calculates ‘counterfactual regret,’ which represents how much a player could have gained by making different choices. Over repeated iterations, this process converges to a Nash Equilibrium (NE). Variants of CFR, such as Discounted CFR (DCFR) and Predictive CFR+ (PCFR+), have been developed to improve convergence through specific discounting and predictive update rules.
Policy Space Response Oracles (PSRO)
PSRO operates at a higher abstraction level, maintaining a population of policies for each player. It constructs a payoff tensor, representing the meta-game, by computing expected utilities for every combination of policies. A meta-strategy solver then generates a probability distribution over these policies, allowing for iterative training against the distribution.
The AlphaEvolve Framework
AlphaEvolve is a pioneering framework that employs LLMs to automate the coding process for game theory algorithms. Instead of manually designing algorithms, AlphaEvolve utilizes an evolutionary coding agent to explore and mutate source code.
Process Overview
The process begins with initializing a population of algorithms based on standard implementations. For CFR experiments, CFR+ serves as the seed, while a uniform distribution is used for PSRO solver classes. A parent algorithm is selected based on fitness metrics, and its source code is modified by the LLM (Gemini 2.5 Pro). The modified candidates are then evaluated in proxy games, and valid candidates are incorporated into the population.
Multi-Objective Optimization
AlphaEvolve supports multi-objective optimization, allowing for the definition of multiple fitness metrics. Each generation randomly selects one metric to guide parent sampling. The primary fitness signal used is negative exploitability after a set number of iterations, evaluated on a fixed set of training games, including:
- 3-player Kuhn Poker
- 2-player Leduc Poker
- 4-card Goofspiel
- 5-sided Liars Dice
Discovered Algorithms
Through the AlphaEvolve framework, the researchers discovered several innovative algorithm variants that outperformed existing hand-designed algorithms.
1. VAD-CFR
The first evolved CFR variant is Volatility-Adaptive Discounted CFR (VAD-CFR). Unlike traditional CFR algorithms that use static discounting, VAD-CFR introduces three distinct mechanisms:
- Volatility-adaptive discounting: This mechanism tracks the volatility of the learning process using an Exponential Weighted Moving Average (EWMA) of instantaneous regret magnitude, adjusting discounting based on volatility.
- Asymmetric instantaneous boosting: Positive instantaneous regrets are amplified before being added to cumulative regrets, enhancing responsiveness to beneficial actions.
- Hard warm-start with regret-magnitude weighting: Policy averaging is postponed until a specified iteration, prioritizing high-information iterations for constructing the average strategy.
VAD-CFR was benchmarked against various CFR algorithms and achieved state-of-the-art performance in 10 out of 11 games tested.
2. AOD-CFR
Another variant, Asymmetric Optimistic Discounted CFR (AOD-CFR), was discovered during trials with a different training set. AOD-CFR employs a linear schedule for discounting cumulative regrets and incorporates trend-based policy optimism, achieving competitive performance through more conventional mechanisms.
3. SHOR-PSRO
The evolved PSRO variant is Smoothed Hybrid Optimistic Regret PSRO (SHOR-PSRO). This algorithm constructs a meta-strategy by blending components at each solver iteration:
- Optimistic Regret Matching: This component provides stability through regret-minimization.
- Smoothed Best Pure Strategy: A Boltzmann distribution over pure strategies that biases toward high-payoff modes, controlled by a temperature parameter.
SHOR-PSRO demonstrates enhanced performance by leveraging the strengths of both components in its strategy formulation.
Conclusion
Google DeepMind’s AlphaEvolve framework represents a significant advancement in the field of game theory and reinforcement learning. By enabling an LLM to autonomously rewrite and optimize its algorithms, the research team has not only streamlined the algorithm design process but has also achieved remarkable performance improvements over traditional methods. The discovery of new algorithm variants such as VAD-CFR, AOD-CFR, and SHOR-PSRO showcases the potential of automated systems in advancing complex strategic decision-making.
Note: The implications of this research extend beyond game theory, potentially influencing various fields that rely on strategic interactions, including economics, political science, and artificial intelligence.

