The E-Value — Game-Theoretic Evidence

PROLOGUE

On the nature of evidence

Probabilities should not be assumed a priori.
A game should be assumed a priori.
The game forces you to play probabilistically.

— The foundational principle of game-theoretic probability

For nearly a century, statistical inference has rested on assumed probability measures. We postulate that data comes from some distribution P, then derive tests and decisions from that assumption.

But what if the structure of the decision problem itself, a game between a Skeptic who seeks truth and a Nature that reveals data, were enough to give rise to all of probability?

This is the insight behind e-values and game-theoretic probability. An e-value is not a probability. It is a bet. It is the wealth of a gambler who wagered against the null hypothesis. When that wealth grows large, the evidence is strong, not because we computed a probability, but because the game would not allow it otherwise.

EDUCATIONAL COMPANION TO

expectation

A Python library for sequential testing with e-values, e-processes, confidence sequences, and game-theoretic probability.

github.com/jakorostami/expectation

Based on "Hypothesis Testing with E-values" — Ramdas & Wang (2025)

THE E-VALUE

Evidence as a wager

Imagine Forecaster claims hypothesis P is true. Skeptic pays $1 and specifies a bet E. If outcome ω is observed, Skeptic gets E(ω) back. The contract is fair to Forecaster when:

𝔼_P[ E ] ≤ 1Under the null, Skeptic cannot expect to profit.

Observed e-value

1.0

Below: the distribution of e-values under two worlds. Under the null (left), e-values scatter around 1 — Skeptic breaks even on average. Under the alternative (right), e-values shift upward — Skeptic profits because truth is on her side.

E-VALUE DISTRIBUTION: NULL vs ALTERNATIVE

The simplest e-variable is the likelihood ratio — the log-optimal bet, which maximizes expected wealth growth:

E = exp(μ · x − μ²/2)The Kelly-optimal wager for testing N(0,1) vs N(μ,1).

Section 1.5 — Betting interpretation of e-values

THE GAME

A protocol between Skeptic and Nature

Testing a hypothesis as a sequential game. At each round, Skeptic declares a bet before seeing the data. Nature reveals the outcome. Wealth grows or shrinks.

ALGORITHM · TESTING BY BETTING

Skeptic's initial wealth is W₀ = 1

for t = 1, 2, 3, … do:

Skeptic declares bet S_t such that 𝔼_P[S_t | past] ≤ 1

Nature reveals X_t

W_t = W_t−1 · S_t(X_t)

ABSTRACT GAME

True mean

0.50

The same protocol applies in every domain. Here are two real-world applications of the testing-by-betting game:

NEUROIMAGING · VOXEL ACTIVATION

Sequential monitoring of BOLD signal in a single brain voxel. Each fMRI volume provides one observation. The e-process accumulates evidence of activation without pre-specifying how many scans to collect.

A/B TESTING · CONVERSION RATE

Sequential comparison of two conversion rates. Each visitor provides a Bernoulli observation. Peek at the result after every user — the e-process remains valid. No need to wait for a fixed sample size.

Section 7.6 — Testing by betting · Algorithm 7.16

THE E-PROCESS

Evidence accumulating through time

An e-process remains valid at any stopping time. You may peek whenever you wish and the conclusion holds. This is the gift of Ville's inequality.

P( sup_t M_t ≥ 1/α ) ≤ αFor any nonnegative supermartingale — valid at all times simultaneously.

Each observation is a round
in a game against Nature.

Skeptic places a bet — constrained
so that under the null, no strategy
can systematically profit.

But if the alternative is true,
wealth grows — evidence accumulates
like light gathering in a lens.

No probability was assumed.
The game was assumed.
The game forced the probability.

The wealth of Skeptic, who bets against Nature

W₀ = 1 · Wₜ = Wₜ₋₁ · Sₜ(Xₜ)

grows only when truth is on her side.

The evidence has spoken.

Skeptic's wealth crosses 1/α — the null is rejected.

By Ville's inequality, this could not happen by chance alone.

Chapter 7 — Sequential e-values and e-processes

BETTING STRATEGIES

How to wager against the null

The e-process framework offers freedom: how much of your wealth to bet on each round. The betting fraction λ_t can be chosen differently, yielding different growth-risk profiles. Four canonical strategies:

All-In (λ=1): Bet everything every round. Maximum growth when right, maximum volatility. The product of raw e-values.

Conservative (λ fixed, small): Steady, predictable growth. Lower variance, slower accumulation. Good when you need robustness.

Empirically Adaptive: Start cautious (λ₁=0), then learn the optimal fraction from past e-values. Asymptotically log-optimal without knowing the alternative.

Log-Optimal: If you know the alternative Q, bet the Kelly fraction. Maximizes 𝔼_Q[log W_t]. The gold standard, rarely achievable in practice.

All-In

Conservative (λ=0.3)

Empirically Adaptive

Log-Optimal (oracle)

True effect size

0.40

Definition 7.21 — Betting strategies for e-processes

E VERSUS P

Two philosophies of evidence

E-VALUE

P-VALUE

A bet — Skeptic's realized wealth

A tail probability under the null

Valid at any stopping time

Valid only at pre-specified n

Multiply to combine independent evidence

Cannot be multiplied directly

Average under arbitrary dependence

Combining requires extra assumptions

Markov: P(E ≥ 1/α) ≤ α. Always.

Requires distributional assumptions

True effect

0.30

An e-variable is genuinely a valid bet, but one over a p-variable is not — the latter is effectively "double dipping."

— Ramdas & Wang, Section 1.5

Section 2.3 — Calibrators · Section 1.5 — Betting interpretation

CONFIDENCE SEQUENCES

The dual face of the e-process

Every e-process has a dual: a confidence sequence — an interval containing the true parameter at every time step simultaneously.

P( ∀ t ≥ 1 : μ ∈ C_t ) ≥ 1 − αA confidence interval is a photograph. A confidence sequence is a film.

It starts wide (little data, much uncertainty), then tightens as evidence accumulates, but never loses its guarantee. Stop collecting data whenever the interval is narrow enough for your purpose.

True mean μ

0.50

Level (1−α)

95%

Howard, Ramdas, McAuliffe & Sekhon (2021) — Time-uniform confidence sequences

PHYSICS

Evidence in the fundamental sciences

In particle physics, discoveries are declared at "5 sigma." But sigma is a fixed-sample concept. An e-process lets us monitor the detector continuously — accumulating evidence as each collision arrives.

— The sequential paradigm applied to experimental physics

The connection between e-processes and physics runs deep. In particle physics, sequential monitoring of collision data yields an e-process that accumulates evidence for new particles — the Poisson likelihood ratio is a natural e-value. And the TwoSidedNormalMixture from martingales.py provides the core machinery for testing any sequential stream of observations: a mixture supermartingale that simultaneously gives an e-process and a time-uniform confidence sequence.

Particle Discovery at a Collider. A detector counts events in successive time windows. Under background only (null), counts follow Poisson(λ₀). If a new particle exists (alternative), counts follow Poisson(λ₁ > λ₀). Each window yields an e-value — the likelihood ratio. The product e-process accumulates evidence collision by collision.

E_t = (λ₁/λ₀)^X_t · exp(λ₀ − λ₁)Poisson likelihood ratio — the optimal bet for counting experiments.

PARTICLE COLLIDER · SEQUENTIAL DISCOVERY

Signal strength

2.0

Background rate

5.0

Sequential Mean Testing. The core of the library: testing whether a stream of observations has mean zero. The TwoSidedNormalMixture class (martingales.py) constructs a mixture supermartingale that yields both an e-process and a time-uniform confidence sequence. Under the null (μ = 0), the wealth fluctuates near 1. Under drift μ > 0, the running mean escapes the shrinking confidence band and the e-process grows — evidence accumulates proportional to μ².

log M(S, v) = ½ log( ρ / (v + ρ) ) + S² / ( 2(v + ρ) )TwoSidedNormalMixture.log_superMG — with ρ = best_rho(v_opt, α).

SEQUENTIAL MEAN TEST · CONFIDENCE SEQUENCE

Running mean X̄_t

Confidence band (CS)

E-process wealth M_t

True μ

0.30

0.05

TwoSidedNormalMixture (martingales.py) · Confidence sequences (Howard, Ramdas, McAuliffe, Sekhon 2022)

FINANCE

Backtesting, risk, and regime detection

A firm may have incentive to forecast risk arbitrarily, but forecasting a larger risk results in financial cost. A backtest e-statistic rewards prudence — overestimation of risk passes the test, underestimation is caught.

— Chapter 16 · E-statistics for risk measures

Game-theoretic probability was born, in part, from the study of financial markets. Shafer and Vovk's foundational work placed finance on game-theoretic footing: the market is Nature, the trader is Skeptic, and the absence of arbitrage is mathematically equivalent to the existence of a probability measure.

VaR Backtesting. A bank reports daily Value-at-Risk: the loss threshold it claims will be exceeded only 1% of the time. The backtest e-statistic (Example 16.9) is e^q_β(x, r) = 𝟙{x > r} / (1−β) — a simple indicator that equals 1/α on exceedance days and 0 otherwise. These sequential e-values are combined into an e-process via the betting fraction approach (Equation 16.15, EProcessUpdater): M_t = M_t−1 · ((1−λ) + λE_t). If the bank's model is honest, the e-process stays bounded. If risk is underestimated, evidence accumulates.

E_t = 𝟙{L_t > VaR_t} / αExceedance indicator divided by the claimed probability — a backtest e-statistic (Ex. 16.9).

VaR BACKTESTING · REGULATORY MONITORING

Model quality

Poor

Market Regime Detection. Financial markets alternate between calm and turbulent regimes. The ConformalCUSUM class (cusum.py) implements a multiplicative e-detector: each observation yields an e-statistic e(x,r) = x²/σ₀² for variance (Example 16.8), and the detector accumulates C_t = max(C_t-1 · E_t, ε) with truncation floor ε = 10⁻¹⁰. Under the null (σ = σ₀), E[log E_t] ≈ −1.27, so the detector decays rapidly to near-zero. Under a volatility shift, it grows and triggers an alarm at threshold (default 20), then resets to 0.

C_t = max( C_t−1 · E_t , ε )ConformalCUSUM e-detector (cusum.py): truncation floor ε = 10⁻¹⁰, reset to 0 after alarm at threshold.

MARKET REGIME · CUSUM E-DETECTOR

Returns

CUSUM E-Detector

Regime shift

Alarm

Volatility ratio

2.5×

Shafer & Vovk (2019) · Chapter 16 — E-statistics for risk measures · E-detectors (Shin, Ramdas & Rinaldo, 2024)

NEUROSCIENCE

Massively parallel sequential testing

An fMRI scan tests tens of thousands of voxels simultaneously. Classical methods must correct for all comparisons at once. With e-values, each voxel runs its own sequential test — and the e-BH procedure controls false discovery rate at any stopping time, under arbitrary dependence.

— Chapter 9 · The e-BH procedure for FDR control

A brain scan divides the cortex into a lattice of volumetric pixels — voxels. Each voxel produces a noisy time series of BOLD signal. The question at every voxel: is there neural activation above baseline? This is a massively parallel sequential testing problem. Each voxel runs an independent TwoSidedNormalMixture e-process (martingales.py). The e-BH procedure (Definition 9.8) then selects discoveries: reject all voxels whose e-process exceeds K/(α · |D|), controlling FDR at level α under arbitrary dependence between voxels.

k* = max{ k : k · E_[k] / K ≥ 1/α }e-BH discovery threshold — Theorem 9.11 guarantees FDR ≤ α for any compound e-variables.

Live Cortical Mapping. A top-down view of the cortex. Each cell is one voxel — a region of interest being tested in parallel. As fMRI volumes arrive, evidence accumulates. Cool regions show no activation (e-process near 1). Warm regions show growing evidence. When a voxel's e-process crosses 1/α, it lights up — a discovery. The scan progress bar shows acquisition time, and the counter tracks discoveries in real time.

CORTICAL HEATMAP · SEQUENTIAL EVIDENCE ACCUMULATION

Voxel Signals. Below, each ribbon is a single voxel's BOLD signal over time. Most voxels are null — pure noise around zero. A subset carry true activation (a small positive drift). The 3D view reveals the spatial structure: many parallel streams of evidence, tested simultaneously.

VOXEL SIGNALS · 3D TRAJECTORY FIELD

Parallel E-Processes. Each voxel's signal feeds a TwoSidedNormalMixture supermartingale. Under the null, wealth stays near 1. Under activation, it grows — the active voxels rise from the field like peaks on a landscape. The e-BH threshold adapts to the number of discoveries, controlling false discovery rate across all voxels simultaneously.

PARALLEL E-PROCESSES · MIXTURE SUPERMARTINGALES

Merged E-Process. The arithmetic mean M_K = (E₁ + ··· + E_K) / K is the canonical e-merging function (Proposition 8.3, Theorem 8.4). It essentially dominates all symmetric e-merging functions. At each scan volume, we average all K voxel e-processes into a single global test statistic. If the merged process crosses 1/α, there is global evidence that some activation exists — anywhere in the brain — while each individual voxel's test retains its own validity.

M_K(t) = (1/K) Σ_k=1^K E_k(t)Arithmetic mean e-merging function — admissible under arbitrary dependence (Theorem 8.4).

Null voxels (no activation)

Active voxels (true signal)

Discoveries (e-BH rejections)

Voxels

Active %

20%

Effect μ

0.35

TwoSidedNormalMixture (martingales.py) · e-BH procedure (Chapter 9, Theorem 9.11) · Compound e-values