Reinforcement Learning

MDPs, policy gradient, model-based RL, offline RL, and distributional shift.

Evidence briefs

Reviewed claims

Claim-level summaries connect a practical takeaway to the papers that actually support it.

High confidencePublished

Clipped surrogate objective positive Policy performance stability and implementation complexity

PPO achieves comparable or better performance than TRPO while using only first-order optimization, eliminating the need for conjugate gradient or Fisher-vector products.

Population: On-policy reinforcement learning with continuous or discrete actions · Comparator: Trust Region Policy Optimization (TRPO) with second-order optimization

Primary evidence

Proximal Policy Optimization Algorithms

PPO achieves comparable or better performance than TRPO while using only first-order optimization, eliminating the need for conjugate gradient or Fisher-vector products.

High confidencePublished

Multiple epochs of minibatch updates on the same trajectory data positive Data efficiency and policy collapse prevention

PPO allows multiple epochs of updates without destructive policy updates by clipping the probability ratio, improving data efficiency while maintaining stable performance.

Population: On-policy reinforcement learning with stochastic policies · Comparator: Single epoch update (standard policy gradient)

Primary evidence

Proximal Policy Optimization Algorithms

PPO allows multiple epochs of updates without destructive policy updates by clipping the probability ratio, improving data efficiency while maintaining stable performance.

High confidencePublished

Clipped probability ratio in surrogate objective positive Policy update stability and monotonic improvement

Clipping the probability ratio to [1-ε, 1+ε] prevents large policy updates that could collapse performance, providing a pessimistic lower bound on policy improvement.

Population: On-policy reinforcement learning with function approximation · Comparator: Unclipped surrogate objective (standard policy gradient)

Primary evidence

Proximal Policy Optimization Algorithms

Clipping the probability ratio to [1-ε, 1+ε] prevents large policy updates that could collapse performance, providing a pessimistic lower bound on policy improvement.

High confidencePublished

Offline reinforcement learning with standard off-policy algorithms (e.g., DQN, SAC, TD3) negative Policy performance due to distributional shift and value function extrapolation error

Standard off-policy RL algorithms fail catastrophically when applied to static datasets, as the learned policy selects out-of-distribution actions leading to compounding extrapolation errors in the value function.

Population: Fixed pre-collected datasets in Markov decision processes · Comparator: Online reinforcement learning with the same algorithms

Primary evidence

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

High confidencePublished

Support overlap assumption (behavior policy must cover actions the learned policy would take) negative Feasibility of learning a good policy

When the behavior policy does not place probability mass on actions that the optimal policy would take, offline RL algorithms cannot learn a good policy due to lack of data support.

Population: Offline RL settings with fixed datasets · Comparator: Violation of support overlap (insufficient coverage)

Primary evidence

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

When the behavior policy does not place probability mass on actions that the optimal policy would take, offline RL algorithms cannot learn a good policy due to lack of data support.

High confidencePublished

Thompson sampling positive Sample efficiency and regret

Thompson sampling avoids wasting samples by integrating exploration and exploitation, leading to improved sample efficiency compared to methods that separate phases.

Population: Online decision problems (multi-armed bandits, shortest path, product recommendation, etc.) · Comparator: Classical approaches with separate exploration/exploitation phases

Primary evidence

A Tutorial on Thompson Sampling

Thompson sampling avoids wasting samples by integrating exploration and exploitation, leading to improved sample efficiency compared to methods that separate phases.

Evidence base

Min quality:

50 papers

BookWikiCanonicalHigh evidence score

Reinforcement Learning: An Introduction

Richard S. Sutton, Andrew G. Barto · MIT Press · 2018

The standard textbook introduction to reinforcement learning, covering MDPs, value functions, temporal-difference learning, policy gradients, and core algorithms.

Read the breakdown →