Sequential Decisions

Multi-armed bandits, Thompson sampling, UCB, regret bounds, and online learning.

Evidence briefs

Reviewed claims

Claim-level summaries connect a practical takeaway to the papers that actually support it.

High confidencePublished

Clipped surrogate objective positive Policy performance stability and data efficiency

Clipping the probability ratio to [1-ε, 1+ε] prevents destructive policy updates, enabling multiple epochs of minibatch SGD without performance collapse, achieving TRPO-level performance with first-order optimization.

Population: On-policy reinforcement learning with MDPs · Comparator: Unclipped surrogate objective (standard policy gradient with multiple epochs)

Primary evidence

Proximal Policy Optimization Algorithms

High confidencePublished

PPO with clipped objective positive Sample efficiency, computational complexity, and final performance

PPO matches or exceeds TRPO's performance on continuous control benchmarks while being significantly simpler to implement (first-order vs second-order optimization) and computationally cheaper.

Population: Continuous control tasks in MuJoCo and Atari games · Comparator: Trust Region Policy Optimization (TRPO)

Primary evidence

Proximal Policy Optimization Algorithms

PPO matches or exceeds TRPO's performance on continuous control benchmarks while being significantly simpler to implement (first-order vs second-order optimization) and computationally cheaper.

High confidencePublished

Thompson Sampling for contextual bandits with linear payoffs positive Regret upper bound

Thompson Sampling achieves regret O(d²/ε √(T^(1+ε))) for any ε>0, which is within a factor of roughly √d of the optimal lower bound Ω(d√T).

Population: Stochastic contextual multi-armed bandit problems with linear payoff functions · Comparator: Theoretical lower bound Ω(d√T)

Primary evidence

Thompson Sampling for Contextual Bandits with Linear Payoffs

Thompson Sampling achieves regret O(d²/ε √(T^(1+ε))) for any ε>0, which is within a factor of roughly √d of the optimal lower bound Ω(d√T).

High confidencePublished

Truncated empirical mean positive Regret bound

Achieves regret of order O(n^(1/(1+ε))) under heavy-tailed rewards, matching the optimal rate.

Population: Multi-armed bandit problems with heavy-tailed reward distributions having finite moments of order 1+ε (0<ε≤1) · Comparator: Standard empirical mean (sub-Gaussian bandit algorithms)

Primary evidence

Bandits With Heavy Tail

Achieves regret of order O(n^(1/(1+ε))) under heavy-tailed rewards, matching the optimal rate.

High confidencePublished

Catoni's M-estimator positive Regret bound

Achieves regret of order O(n^(1/(1+ε))) under heavy-tailed rewards, matching the optimal rate.

Population: Multi-armed bandit problems with heavy-tailed reward distributions having finite moments of order 1+ε (0<ε≤1) · Comparator: Standard empirical mean (sub-Gaussian bandit algorithms)

Primary evidence

Bandits With Heavy Tail

Achieves regret of order O(n^(1/(1+ε))) under heavy-tailed rewards, matching the optimal rate.

High confidencePublished

Median-of-means estimator positive Regret bound

Achieves regret of order O(n^(1/(1+ε))) under heavy-tailed rewards, matching the optimal rate.

Population: Multi-armed bandit problems with heavy-tailed reward distributions having finite moments of order 1+ε (0<ε≤1) · Comparator: Standard empirical mean (sub-Gaussian bandit algorithms)

Primary evidence

Bandits With Heavy Tail

Achieves regret of order O(n^(1/(1+ε))) under heavy-tailed rewards, matching the optimal rate.

Evidence base

Min quality:

50 papers

StudyWikiCanonicalModerate

Designing Optimal Dynamic Treatment Regimes: A Causal Reinforcement Learning Approach

Junzhe Zhang · International Conference on Machine Learning · 2020 · 16 citations

Read the breakdown →

BookWikiCanonicalHigh evidence score

Reinforcement Learning: An Introduction

Richard S. Sutton, Andrew G. Barto · MIT Press · 2018

The standard textbook introduction to reinforcement learning, covering MDPs, value functions, temporal-difference learning, policy gradients, and core algorithms.

Read the breakdown →

StudyWikiCanonicalHigh confidence

A Survey of Constraint Formulations in Safe Reinforcement Learning

Akifumi Wachi, Xun Shen, Yanan Sui · IJCAI · 2024

A survey of safe RL constraint formulations, representative algorithms, and the relationships among common constrained decision-making criteria.

Read the breakdown →

BookWikiCanonicalHigh evidence score

Bandit Algorithms

Tor Lattimore, Csaba Szepesvari · Cambridge University Press · 2020

A comprehensive reference for stochastic bandits, adversarial bandits, contextual bandits, lower bounds, UCB, Thompson sampling, and structured variants.

Read the breakdown →

StudyWikiCanonicalHigh confidence

A Comprehensive Survey on Safe Reinforcement Learning

Javier Garcia, Fernando Fernandez · Journal of Machine Learning Research · 2015

A classic survey of safe reinforcement learning, including risk-sensitive criteria, constrained exploration, safety during learning, and external guidance.

Read the breakdown →

StudyWikiCanonicalHigh confidence

Off-Policy Policy Evaluation for Sequential Decisions under Unobserved Confounding

Hongseok Namkoong, Ramtin Keramati, Steve Yadlowsky +1 more · arXiv · 2020

Studies off-policy evaluation for sequential decisions when hidden confounders may bias logged trajectories.

Read the breakdown →

StudyWikiCanonicalHigh confidence

Markov Decision Processes with Unobserved Confounders: A Causal Approach

Junzhe Zhang, Elias Bareinboim · CausalAI Lab Technical Report R-23 · 2016

Extends causal reasoning to MDPs where hidden variables may affect both actions and outcomes, motivating CRL methods that reason about confounding in sequential settings.

Read the breakdown →

StudyWikiCanonicalHigh confidence

Near-Optimal Reinforcement Learning in Dynamic Treatment Regimes

Junzhe Zhang, Elias Bareinboim · NeurIPS · 2019

Connects causal reinforcement learning with dynamic treatment regimes, focusing on near-optimal sequential treatment policies.

Read the breakdown →

StudyWikiCanonicalHigh confidence

Bandits with Unobserved Confounders: A Causal Approach

Elias Bareinboim, Andrew Forney, Judea Pearl · NeurIPS · 2015

Introduces a causal treatment of bandit problems where observational feedback may be confounded, showing when causal structure can improve intervention selection.

Read the breakdown →

StudyWikiCanonicalHigh confidence

Structural Causal Bandits: Where to Intervene?

Sanghack Lee, Elias Bareinboim · NeurIPS · 2018

Introduces structural causal bandits, where the learner chooses interventions in a causal graph rather than arms with unrelated reward distributions.

Read the breakdown →

StudyWikiCanonicalHigh confidence

An Introduction to Causal Reinforcement Learning

Elias Bareinboim, Junzhe Zhang, Sanghack Lee · CausalAI Lab Technical Report R-65 · 2024

A tutorial survey that organizes causal reinforcement learning around offline-to-online learning, intervention choice, counterfactual decision-making, transportability, causal discovery, imitation, curriculum learning, reward shaping, and causal game theory.

Read the breakdown →

StudyPreprintWikiCanonicalModerate

Always Valid Inference: Bringing Sequential Analysis to A/B Testing

Ramesh Johari, Leo Pekelis, David J. Walsh · 2015 · 101 citations

A/B tests are typically analyzed via frequentist p-values and confidence intervals; but these inferences are wholly unreliable if users endogenously choose samples sizes by *continuously monitoring* their tests. We define *always valid* p-values and confidence intervals that let users try to take advantage of data as fast as it becomes available, providing valid statistical inference whenever they make their decision. Always valid inference can be interpreted as a natural interface for a sequential hypothesis test, which empowers users to implement a modified test tailored to them. In particular, we show in an appropriate sense that the measures we develop tradeoff sample size and power efficiently, despite a lack of prior knowledge of the user's relative preference between these two goals. We also use always valid p-values to obtain multiple hypothesis testing control in the sequential context. Our methodology has been implemented in a large scale commercial A/B testing platform to analyze hundreds of thousands of experiments to date.

Read the breakdown →

StudyPreprintWikiCanonicalModerate

A Tutorial on Thompson Sampling

Daniel Russo, Benjamin Van Roy, Abbas Kazerouni +2 more · 2017 · 1,175 citations

Thompson sampling is an algorithm for online decision problems where actions are taken sequentially in a manner that must balance between exploiting what is known to maximize immediate performance and investing to accumulate new information that may improve future performance. The algorithm addresses a broad range of problems in a computationally efficient manner and is therefore enjoying wide use. This tutorial covers the algorithm and its application, illustrating concepts through a range of examples, including Bernoulli bandit problems, shortest path problems, product recommendation, assortment, active learning with neural networks, and reinforcement learning in Markov decision processes. Most of these problems involve complex information structures, where information revealed by taking an action informs beliefs about other actions. We will also discuss when and why Thompson sampling is or is not effective and relations to alternative algorithms.

Read the breakdown →

StudyPreprintWikiCanonicalModerate

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine, Aviral Kumar, George Tucker +1 more · 2020

In this tutorial article, we aim to provide the reader with the conceptual tools needed to get started on research on offline reinforcement learning algorithms: reinforcement learning algorithms that utilize previously collected data, without additional online data collection. Offline reinforcement learning algorithms hold tremendous promise for making it possible to turn large datasets into powerful decision making engines. Effective offline reinforcement learning methods would be able to extract policies with the maximum possible utility out of the available data, thereby allowing automation of a wide range of decision-making domains, from healthcare and education to robotics. However, the limitations of current algorithms make this difficult. We will aim to provide the reader with an understanding of these challenges, particularly in the context of modern deep reinforcement learning methods, and describe some potential solutions that have been explored in recent work to mitigate these challenges, along with recent applications, and a discussion of perspectives on open problems in the field.

Read the breakdown →

StudyPreprintWikiCanonicalModerate

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal +2 more · 2017

We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.

Read the breakdown →

StudyWikiHigh confidence

Sequential Causal Imitation Learning with Unobserved Confounders

Daniel Kumor, Junzhe Zhang, Elias Bareinboim · NeurIPS · 2021

Extends causal imitation learning to sequential settings where confounding can persist across time.

Read the breakdown →

StudyWikiHigh confidence

Characterizing Optimal Mixed Policies: Where to Intervene, What to Observe

Sanghack Lee, Elias Bareinboim · NeurIPS · 2020

Characterizes policies that mix interventions and observations in causal decision problems.

Read the breakdown →

StudyWikiHigh confidence

Structural Causal Bandits with Non-Manipulable Variables

Sanghack Lee, Elias Bareinboim · AAAI · 2019

Extends structural causal bandits to settings where some variables can be observed but not directly manipulated.

Read the breakdown →

StudyWikiHigh confidence

Budgeted Experiment Design for Causal Structure Learning

AmirEmad Ghassami, Saber Salehkaleybar, Negar Kiyavash +1 more · ICML · 2018

Addresses how to allocate a limited intervention budget to learn causal structure efficiently.

Read the breakdown →

StudyWikiHigh confidence

Counterfactual Data-Fusion for Online Reinforcement Learners

Andrew Forney, Judea Pearl, Elias Bareinboim · ICML · 2017

Studies how online learners can combine heterogeneous observational and experimental data sources using counterfactual data-fusion principles.

Read the breakdown →

Meta-analysisHigh evidence score

The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective

John K. Kruschke, Torrin M. Liddell · Psychonomic Bulletin & Review · 2017 · 1,326 citations

Meta-analysisHigh evidence score

Testing by Betting: A Strategy for Statistical and Scientific Communication

Glenn Shafer · Journal of the Royal Statistical Society Series A (Statistics in Society) · 2021 · 112 citations

Abstract The most widely used concept of statistical inference—the p-value—is too complicated for effective communication to a wide audience. This paper introduces a simpler way of reporting statistical evidence: report the outcome of a bet against the null hypothesis. This leads to a new role for likelihood, to alternatives to power and confidence, and to a framework for meta-analysis that accommodates both planned and opportunistic testing of statistical hypotheses and probabilistic forecasts. This framework builds on the foundation for mathematical probability developed in previous work by Vladimir Vovk and myself.

Systematic ReviewHigh evidence score

A survey on causal inference for recommendation

Huishi Luo, Fuzhen Zhuang, Ruobing Xie +4 more · The Innovation · 2024 · 37 citations

Causal inference has recently garnered significant interest among recommender system (RS) researchers due to its ability to dissect cause-and-effect relationships and its broad applicability across multiple fields. It offers a framework to model the causality in RSs such as confounding effects and deal with counterfactual problems such as offline policy evaluation and data augmentation. Although there are already some valuable surveys on causal recommendations, they typically classify approaches based on the practical issues faced in RS, a classification that may disperse and fragment the unified causal theories. Considering RS researchers' unfamiliarity with causality, it is necessary yet challenging to comprehensively review relevant studies from a coherent causal theoretical perspective, thereby facilitating a deeper integration of causal inference in RS. This survey provides a systematic review of up-to-date papers in this area from a causal theory standpoint and traces the evolutionary development of RS methods within the same causal strategy. First, we introduce the fundamental concepts of causal inference as the basis of the following review. Subsequently, we propose a novel theory-driven taxonomy, categorizing existing methods based on the causal theory employed, namely those based on the potential outcome framework, the structural causal model, and general counterfactuals. The review then delves into the technical details of how existing methods apply causal inference to address particular recommender issues. Finally, we highlight some promising directions for future research in this field. Representative papers and open-source resources will be progressively available at https://github.com/Chrissie-Law/Causal-Inference-for-Recommendation.

StudyModerate

Quantiles via moments

José A. F. Machado, João Santos Silva · Journal of Econometrics · 2019 · 2,156 citations

StudyModerate

Reinforcement Learning: A Survey

Leslie Pack Kaelbling, Michael L. Littman, Andrew Moore · Journal of Artificial Intelligence Research · 1996 · 8,787 citations

This paper surveys the field of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word ``reinforcement.'' The paper discusses central issues of reinforcement learning, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.

StudyModerate

Interpretable machine learning: Fundamental principles and 10 grand challenges

Cynthia Rudin, Chaofan Chen, Zhi Chen +3 more · Statistics Surveys · 2022 · 828 citations

Interpretability in machine learning (ML) is crucial for high stakes decisions and troubleshooting. In this work, we provide fundamental principles for interpretable ML, and dispel common misunderstandings that dilute the importance of this crucial topic. We also identify 10 technical challenge areas in interpretable machine learning and provide history and background on each problem. Some of these problems are classically important, and some are recent problems that have arisen in the last few years. These problems are: (1) Optimizing sparse logical models such as decision trees; (2) Optimization of scoring systems; (3) Placing constraints into generalized additive models to encourage sparsity and better interpretability; (4) Modern case-based reasoning, including neural networks and matching for causal inference; (5) Complete supervised disentanglement of neural networks; (6) Complete or even partial unsupervised disentanglement of neural networks; (7) Dimensionality reduction for data visualization; (8) Machine learning models that can incorporate physics and other generative or causal constraints; (9) Characterization of the “Rashomon set” of good models; and (10) Interpretable reinforcement learning. This survey is suitable as a starting point for statisticians and computer scientists interested in working in interpretable machine learning.

StudyModerate

Bayesian inference for psychology. Part I: Theoretical advantages and practical ramifications

Eric‐Jan Wagenmakers, Maarten Marsman, Tahira Jamil +10 more · Psychonomic Bulletin & Review · 2017 · 1,697 citations

Bayesian parameter estimation and Bayesian hypothesis testing present attractive alternatives to classical inference using confidence intervals and p values. In part I of this series we outline ten prominent advantages of the Bayesian approach. Many of these advantages translate to concrete opportunities for pragmatic researchers. For instance, Bayesian hypothesis testing allows researchers to quantify evidence and monitor its progression as data come in, without needing to know the intention with which the data were collected. We end by countering several objections to Bayesian hypothesis testing. Part II of this series discusses JASP, a free and open source software program that makes it easy to conduct Bayesian estimation and testing for a range of popular statistical scenarios (Wagenmakers et al. this issue).

StudyTop journalModerate

Bayesian model averaging: a tutorial (with comments by M. Clyde, David Draper and E. I. George, and a rejoinder by the authors

Jennifer A. Hoeting, David Madigan, Adrian E. Raftery +1 more · Statistical Science · 1999 · 4,164 citations

Standard statistical practice ignores model uncertainty. Data analysts typically select a model from some class of models and then proceed as if the selected model had generated the data. This approach ignores the uncertainty in model selection, leading to over-confident inferences and decisions that are more risky than one thinks they are. Bayesian model averaging (BMA)provides a coherent mechanism for accounting for this model uncertainty. Several methods for implementing BMA have recently emerged. We discuss these methods and present a number of examples.In these examples, BMA provides improved out-of-sample predictive performance. We also provide a catalogue of currently available BMA software.

StudyModerate

An Introduction to Deep Reinforcement Learning

Vincent François-Lavet, Peter Henderson, Riashat Islam +2 more · Foundations and Trends® in Machine Learning · 2018 · 1,241 citations

Deep reinforcement learning is the combination of reinforcement learning (RL) and deep learning. This field of research has been able to solve a wide range of complex decision making tasks that were previously out of reach for a machine. Thus, deep RL opens up many new applications in domains such as healthcare, robotics, smart grids, finance, and many more. This manuscript provides an introduction to deep reinforcement learning models, algorithms and techniques. Particular focus is on the aspects related to generalization and how deep RL can be used for practical applications. We assume the reader is familiar with basic machine learning concepts.

StudyTop journalModerate

The magical number 4 in short-term memory: A reconsideration of mental storage capacity

Nelson Cowan · Behavioral and Brain Sciences · 2001 · 6,745 citations

Miller (1956) summarized evidence that people can remember about seven chunks in short-term memory (STM) tasks. However, that number was meant more as a rough estimate and a rhetorical device than as a real capacity limit. Others have since suggested that there is a more precise capacity limit, but that it is only three to five chunks. The present target article brings together a wide variety of data on capacity limits suggesting that the smaller capacity limit is real. Capacity limits will be useful in analyses of information processing only if the boundary conditions for observing them can be carefully described. Four basic conditions in which chunks can be identified and capacity limits can accordingly be observed are: (1) when information overload limits chunks to individual stimulus items, (2) when other steps are taken specifically to block the recording of stimulus items into larger chunks, (3) in performance discontinuities caused by the capacity limit, and (4) in various indirect effects of the capacity limit. Under these conditions, rehearsal and long-term memory cannot be used to combine stimulus items into chunks of an unknown size; nor can storage mechanisms that are not capacity-limited, such as sensory memory, allow the capacity-limited storage mechanism to be refilled during recall. A single, central capacity limit averaging about four chunks is implicated along with other, noncapacity-limited sources. The pure STM capacity limit expressed in chunks is distinguished from compound STM limits obtained when the number of separately held chunks is unclear. Reasons why pure capacity estimates fall within a narrow range are discussed and a capacity limit for the focus of attention is proposed.

StudyModerate

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

Tom McCoy, Ellie Pavlick, Tal Linzen · 2019 · 914 citations

A machine learning system can score well on a given test set by relying on heuristics that are effective for frequent example types but break down in more challenging cases. We study this issue within natural language inference (NLI), the task of determining whether one sentence entails another. We hypothesize that statistical NLI models may adopt three fallible syntactic heuristics: the lexical overlap heuristic, the subsequence heuristic, and the constituent heuristic. To determine whether models have adopted these heuristics, we introduce a controlled evaluation set called HANS (Heuristic Analysis for NLI Systems), which contains many examples where the heuristics fail. We find that models trained on MNLI, including BERT, a state-of-the-art model, perform very poorly on HANS, suggesting that they have indeed adopted these heuristics. We conclude that there is substantial room for improvement in NLI systems, and that the HANS dataset can motivate and measure progress in this area.

StudyModerate

Challenges of real-world reinforcement learning: definitions, benchmarks and analysis

Gabriel Dulac-Arnold, Nir Levine, Daniel J. Mankowitz +4 more · Machine Learning · 2021 · 557 citations

RCTHigh evidence score

Selective Trials: A Principal-Agent Approach to Randomized Controlled Experiments

Sylvain Chassang, Gerard Padró i Miquel, Erik Snowberg · American Economic Review · 2012 · 153 citations

We study the design of randomized controlled experiments when outcomes are significantly affected by experimental subjects' unobserved effort expenditure. While standard randomized controlled trials (RCTs) are internally consistent, the unobservability of effort compromises external validity. We approach trial design as a principal-agent problem and show that natural extensions of RCTs—which we call selective trials—can help improve external validity. In particular, selective trials can disentangle the effects of treatment, effort, and the interaction of treatment and effort. Moreover, they can help identify when treatment effects are affected by erroneous beliefs and inappropriate effort expenditure.(JEL C90, D82)

StudyModerate

Benchmarking and survey of explanation methods for black box models

Francesco Bodria, Fosca Giannotti, Riccardo Guidotti +3 more · Data Mining and Knowledge Discovery · 2023 · 234 citations

Abstract The rise of sophisticated black-box machine learning models in Artificial Intelligence systems has prompted the need for explanation methods that reveal how these models work in an understandable way to users and decision makers. Unsurprisingly, the state-of-the-art exhibits currently a plethora of explainers providing many different types of explanations. With the aim of providing a compass for researchers and practitioners, this paper proposes a categorization of explanation methods from the perspective of the type of explanation they return, also considering the different input data formats. The paper accounts for the most representative explainers to date, also discussing similarities and discrepancies of returned explanations through their visual appearance. A companion website to the paper is provided as a continuous update to new explainers as they appear. Moreover, a subset of the most robust and widely adopted explainers, are benchmarked with respect to a repertoire of quantitative metrics.

StudyModerate

Finite-time Analysis of the Multiarmed Bandit Problem

Peter Auer, Nicolò Cesa‐Bianchi, Paul Fischer · Machine Learning · 2002 · 5,786 citations

Meta-analysisTop journalHigh evidence score

The Anytime-Valid Logrank Test: Error Control Under Continuous Monitoring with Unlimited Horizon

Judith ter Schure, Muriel F. Pérez-Ortiz, Alexander Ly +1 more · The New England Journal of Statistics in Data Science · 2024 · 7 citations

We introduce the anytime-valid (AV) logrank test, a version of the logrank test that provides type-I error guarantees under optional stopping and optional continuation. The test is sequential without the need to specify a maximum sample size or stopping rule, and allows for cumulative meta-analysis with type-I error control. The method can be extended to define anytime-valid confidence intervals. The logrank test is an instance of the martingale tests based on E-variables that have been recently developed. We demonstrate type-I error guarantees for the test in a semiparametric setting of proportional hazards, show explicitly how to extend it to ties and confidence sequences and indicate further extensions to the full Cox regression model. Using a Gaussian approximation on the logrank statistic, we show that the AV logrank test (which itself is always exact) has a similar rejection region to O’Brien-Fleming α-spending but with the potential to achieve $100\% $ power by optional continuation. Although our approach to study design requires a larger sample size, the expected sample size is competitive by optional stopping.

StudyTop journalModerate

Interval Estimation for a Binomial Proportion

Lawrence D. Brown, Tommaso Cai, Anirban Dasgupta · Statistical Science · 2001 · 3,475 citations

We revisit the problem of interval estimation of a binomial proportion. The erratic behavior of the coverage probability of the standard Wald confidence interval has previously been remarked on in the literature (Blyth and Still, Agresti and Coull, Santner and others). We begin by showing that the chaotic coverage properties of the Wald interval are far more persistent than is appreciated. Furthermore, common textbook prescriptions regarding its safety are misleading and defective in several respects and cannot be trusted. This leads us to consideration of alternative intervals. A number of natural alternatives are presented, each with its motivation and context. Each interval is examined for its coverage probability and its length. Based on this analysis, we recommend the Wilson interval or the equal-tailed Jeffreys prior interval for small n and the interval suggested in Agresti and Coull for larger n. We also provide an additional frequentist justification for use of the Jeffreys interval.

RCTHigh evidence score

Group Sequential Tests for Delayed Responses (with discussion)

Lisa V. Hampson, Christopher Jennison · Journal of the Royal Statistical Society Series B (Statistical Methodology) · 2012 · 109 citations

Summary Group sequential methods are used routinely to monitor clinical trials and to provide early stopping when there is evidence of a treatment effect, a lack of an effect or concerns about patient safety. In many studies, the response of clinical interest is measured some time after the start of treatment and there are subjects at each interim analysis who have been treated but are yet to respond. We formulate a new form of group sequential test which gives a proper treatment of these ‘pipeline’ subjects; these tests can be applied even when the continued accrual of data after the decision to stop the trial is unexpected. We illustrate our methods through a series of examples. We define error spending versions of these new designs which handle unpredictable group sizes and provide an information monitoring framework that can accommodate nuisance parameters, such as an unknown response variance. By studying optimal versions of our new designs, we show how the benefits of lower expected sample size that are normally achieved by a group sequential test are reduced when there is a delay in response. The loss of efficiency for larger delays can be ameliorated by incorporating data on a correlated short-term end point, fitting a joint model for the two end points but still making inferences on the original, longer-term end point. We derive p-values and confidence intervals on termination of our new tests.

StudyModerate

Human-in-the-Loop Reinforcement Learning: A Survey and Position on Requirements, Challenges, and Opportunities

Carl Orge Retzlaff, Srijita Das, Christabel Wayllace +7 more · Journal of Artificial Intelligence Research · 2024 · 129 citations

Artificial intelligence (AI) and especially reinforcement learning (RL) have the potential to enable agents to learn and perform tasks autonomously with superhuman performance. However, we consider RL as fundamentally a Human-in-the-Loop (HITL) paradigm, even when an agent eventually performs its task autonomously. In cases where the reward function is challenging or impossible to define, HITL approaches are considered particularly advantageous. The application of Reinforcement Learning from Human Feedback (RLHF) in systems such as ChatGPT demonstrates the effectiveness of optimizing for user experience and integrating their feedback into the training loop. In HITL RL, human input is integrated during the agent’s learning process, allowing iterative updates and fine-tuning based on human feedback, thus enhancing the agent’s performance. Since the human is an essential part of this process, we argue that human-centric approaches are the key to successful RL, a fact that has not been adequately considered in the existing literature. This paper aims to inform readers about current explainability methods in HITL RL. It also shows how the application of explainable AI (xAI) and specific improvements to existing explainability approaches can enable a better human-agent interaction in HITL RL for all types of users, whether for lay people, domain experts, or machine learning specialists. Accounting for the workflow in HITL RL and based on software and machine learning methodologies, this article identifies four phases for human involvement for creating HITL RL systems: (1) Agent Development, (2) Agent Learning, (3) Agent Evaluation, and (4) Agent Deployment. We highlight human involvement, explanation requirements, new challenges, and goals for each phase. We furthermore identify low-risk, high-return opportunities for explainability research in HITL RL and present long-term research goals to advance the field. Finally, we propose a vision of human-robot collaboration that allows both parties to reach their full potential and cooperate effectively.

StudyModerate

Bayes factor design analysis: Planning for compelling evidence

Felix D. Schönbrodt, Eric‐Jan Wagenmakers · Psychonomic Bulletin & Review · 2017 · 764 citations

A sizeable literature exists on the use of frequentist power analysis in the null-hypothesis significance testing (NHST) paradigm to facilitate the design of informative experiments. In contrast, there is almost no literature that discusses the design of experiments when Bayes factors (BFs) are used as a measure of evidence. Here we explore Bayes Factor Design Analysis (BFDA) as a useful tool to design studies for maximum efficiency and informativeness. We elaborate on three possible BF designs, (a) a fixed-n design, (b) an open-ended Sequential Bayes Factor (SBF) design, where researchers can test after each participant and can stop data collection whenever there is strong evidence for either $\mathcal {H}_{1}$ or $\mathcal {H}_{0}$ , and (c) a modified SBF design that defines a maximal sample size where data collection is stopped regardless of the current state of evidence. We demonstrate how the properties of each design (i.e., expected strength of evidence, expected sample size, expected probability of misleading evidence, expected probability of weak evidence) can be evaluated using Monte Carlo simulations and equip researchers with the necessary information to compute their own Bayesian design analyses.

StudyModerate

Pair-copula constructions of multiple dependence

Kjersti Aas, Claudia Czado, Arnoldo Frigessi +1 more · Insurance Mathematics and Economics · 2007 · 2,030 citations

RCTHigh evidence score

Choice of futility boundaries for group sequential designs with two endpoints

Svenja Schüler, Meinhard Kieser, Geraldine Rauch · BMC Medical Research Methodology · 2017 · 43 citations

BACKGROUND: In clinical trials, the opportunity for an early stop during an interim analysis (either for efficacy or for futility) may relevantly save time and financial resources. This is especially important, if the planning assumptions required for power calculation are based on a low level of evidence. For example, when including two primary endpoints in the confirmatory analysis, the power of the trial depends on the effects of both endpoints and on their correlation. Assessing the feasibility of such a trial is therefore difficult, as the number of parameter assumptions to be correctly specified is large. For this reason, so-called 'group sequential designs' are of particular importance in this setting. Whereas the choice of adequate boundaries to stop a trial early for efficacy has been broadly discussed in the literature, the choice of optimal futility boundaries has not been investigated so far, although this may have serious consequences with respect to performance characteristics. METHODS: In this work, we propose a general method to construct 'optimal' futility boundaries according to predefined criteria. Further, we present three different group sequential designs for two endpoints applying these futility boundaries. Our methods are illustrated by a real clinical trial example and by Monte-Carlo simulations. RESULTS: By construction, the provided method of choosing futility boundaries maximizes the probability to correctly stop in case of small or opposite effects while limiting the power loss and the probability of stopping the study 'wrongly'. Our results clearly demonstrate the benefit of using such 'optimal' futility boundaries, especially compared to futility boundaries commonly applied in practice. CONCLUSIONS: As the properties of futility boundaries are often not considered in practice and unfavorably chosen futility boundaries may imply bad properties of the study design, we recommend assessing the performance of these boundaries according to the criteria proposed in here.

ObservationalModerate

Optimal Dynamic Treatment Regimes

Susan A. Murphy · Journal of the Royal Statistical Society Series B (Statistical Methodology) · 2003 · 1,040 citations

Summary A dynamic treatment regime is a list of decision rules, one per time interval, for how the level of treatment will be tailored through time to an individual’s changing status. The goal of this paper is to use experimental or observational data to estimate decision regimes that result in a maximal mean response. To explicate our objective and to state the assumptions, we use the potential outcomes model. The method proposed makes smooth parametric assumptions only on quantities that are directly relevant to the goal of estimating the optimal rules. We illustrate the methodology proposed via a small simulation.

StudyModerate

A Survey of Zero-shot Generalisation in Deep Reinforcement Learning

Robert Kirk, Amy Zhang, Edward Grefenstette +1 more · Journal of Artificial Intelligence Research · 2023 · 158 citations

The study of zero-shot generalisation (ZSG) in deep Reinforcement Learning (RL) aims to produce RL algorithms whose policies generalise well to novel unseen situations at deployment time, avoiding overfitting to their training environments. Tackling this is vital if we are to deploy reinforcement learning algorithms in real world scenarios, where the environment will be diverse, dynamic and unpredictable. This survey is an overview of this nascent field. We rely on a unifying formalism and terminology for discussing different ZSG problems, building upon previous works. We go on to categorise existing benchmarks for ZSG, as well as current methods for tackling these problems. Finally, we provide a critical discussion of the current state of the field, including recommendations for future work. Among other conclusions, we argue that taking a purely procedural content generation approach to benchmark design is not conducive to progress in ZSG, we suggest fast online adaptation and tackling RL-specific problems as some areas for future work on methods for ZSG, and we recommend building benchmarks in underexplored problem settings such as offline RL ZSG and reward-function variation.

StudyModerate

Meaningful Explanations of Black Box AI Decision Systems

Dino Pedreschi, Fosca Giannotti, Riccardo Guidotti +3 more · Proceedings of the AAAI Conference on Artificial Intelligence · 2019 · 201 citations

Black box AI systems for automated decision making, often based on machine learning over (big) data, map a user’s features into a class or a score without exposing the reasons why. This is problematic not only for lack of transparency, but also for possible biases inherited by the algorithms from human prejudices and collection artifacts hidden in the training data, which may lead to unfair or wrong decisions. We focus on the urgent open challenge of how to construct meaningful explanations of opaque AI/ML systems, introducing the local-toglobal framework for black box explanation, articulated along three lines: (i) the language for expressing explanations in terms of logic rules, with statistical and causal interpretation; (ii) the inference of local explanations for revealing the decision rationale for a specific case, by auditing the black box in the vicinity of the target instance; (iii), the bottom-up generalization of many local explanations into simple global ones, with algorithms that optimize for quality and comprehensibility. We argue that the local-first approach opens the door to a wide variety of alternative solutions along different dimensions: a variety of data sources (relational, text, images, etc.), a variety of learning problems (multi-label classification, regression, scoring, ranking), a variety of languages for expressing meaningful explanations, a variety of means to audit a black box.

StudyModerate

A Gentle Introduction to Reinforcement Learning and its Application in Different Fields

Muddasar Naeem, Syed Tahir Hussain Rizvi, Antonio Coronato · IEEE Access · 2020 · 241 citations

Due to the recent progress in Deep Neural Networks, Reinforcement Learning (RL) has become one of the most important and useful technology. It is a learning method where a software agent interacts with an unknown environment, selects actions, and progressively discovers the environment dynamics. RL has been effectively applied in many important areas of real life. This article intends to provide an in-depth introduction of the Markov Decision Process, RL and its algorithms. Moreover, we present a literature review of the application of RL to a variety of fields, including robotics and autonomous control, communication and networking, natural language processing, games and self-organized system, scheduling management and configuration of resources, and computer vision.

StudyModerate

Harms from Increasingly Agentic Algorithmic Systems

Alan Chan, Rebecca Salganik, Alva Markelius +19 more · 2023 · 100 citations

Research in Fairness, Accountability, Transparency, and Ethics (FATE)1 has established many sources and forms of algorithmic harm, in domains as diverse as health care, finance, policing, and recommendations. Much work remains to be done to mitigate the serious harms of these systems, particularly those disproportionately affecting marginalized communities. Despite these ongoing harms, new systems are being developed and deployed, typically without strong regulatory barriers, threatening the perpetuation of the same harms and the creation of novel ones. In response, the FATE community has emphasized the importance of anticipating harms, rather than just responding to them. Anticipation of harms is especially important given the rapid pace of developments in machine learning (ML). Our work focuses on the anticipation of harms from increasingly agentic systems. Rather than providing a definition of agency as a binary property, we identify 4 key characteristics which, particularly in combination, tend to increase the agency of a given algorithmic system: underspecification, directness of impact, goal-directedness, and long-term planning. We also discuss important harms which arise from increasing agency – notably, these include systemic and/or long-range impacts, often on marginalized or unconsidered stakeholders. We emphasize that recognizing agency of algorithmic systems does not absolve or shift the human responsibility for algorithmic harms. Rather, we use the term agency to highlight the increasingly evident fact that ML systems are not fully under human control. Our work explores increasingly agentic algorithmic systems in three parts. First, we explain the notion of an increase in agency for algorithmic systems in the context of diverse perspectives on agency across disciplines. Second, we argue for the need to anticipate harms from increasingly agentic systems. Third, we discuss important harms from increasingly agentic systems and ways forward for addressing them. We conclude by reflecting on implications of our work for anticipating algorithmic harms from emerging systems.

StudyModerate

Filtering via Simulation: Auxiliary Particle Filters

M. Pitt, Neil Shephard · Journal of the American Statistical Association · 1999 · 2,261 citations

This article analyses the recently suggested particle approach to filtering time series. We suggest that the algorithm is not robust to outliers for two reasons: the design of the simulators and the use of the discrete support to represent the sequentially updating prior distribution. Here we tackle the first of these problems.

StudyModerate

Bridging Direct and Indirect Data-Driven Control Formulations via Regularizations and Relaxations

Florian Dörfler, Jeremy Coulson, Ivan Markovsky · IEEE Transactions on Automatic Control · 2022 · 195 citations

In this article, we discuss connections between sequential system identification and control for linear time-invariant systems, often termed indirect data-driven control, as well as a contemporary direct data-driven control approach seeking an optimal decision compatible with recorded data assembled in a Hankel matrix and robustified through suitable regularizations. We formulate these two problems in the language of behavioral systems theory and parametric mathematical programs, and we bridge them through a multicriteria formulation trading off system identification and control objectives. We illustrate our results with two methods from subspace identification and control: namely, subspace predictive control and low-rank approximation, which constrain trajectories to be consistent with a nonparametric predictor derived from (respectively, the column span of) a data Hankel matrix. In both cases, we conclude that direct and regularized data-driven control can be derived as convex relaxation of the indirect approach, and the regularizations account for an implicit identification step. Our analysis further reveals a novel regularizer and a plausible hypothesis explaining the remarkable empirical performance of direct methods on nonlinear systems.

StudyLeading journalModerate

Regimes of Expectations: An Active Inference Model of Social Conformity and Human Decision Making

Axel Constant, Maxwell J. D. Ramstead, Samuel P. L. Veissière +1 more · Frontiers in Psychology · 2019 · 174 citations

How do humans come to acquire shared expectations about how they ought to behave in distinct normalized social settings? This paper offers a normative framework to answer this question. We introduce the computational construct of 'deontic value' - based on active inference and Markov decision processes - to formalize conceptions of social conformity and human decision-making. Deontic value is an attribute of choices, behaviors, or action sequences that inherit directly from deontic cues in our econiche (e.g., red traffic lights); namely, cues that denote an obligatory social rule. Crucially, the prosocial aspect of deontic value rests upon a particular form of circular causality: deontic cues exist in the environment in virtue of the environment being modified by repeated actions, while action itself is contingent upon the deontic value of environmental cues. We argue that this construction of deontic cues enables the epistemic (i.e., information-seeking) and pragmatic (i.e., goal- seeking) values of any behavior to be 'cached' or 'outsourced' to the environment, where the environment effectively 'learns' about the behavior of its denizens. We describe the process whereby this particular aspect of value enables learning of habitual behavior over neurodevelopmental and transgenerational timescales.