StudyWikiCanonicalModerate
Online controlled experiments at large scale
Ron Kohavi, Alex Deng, Brian Frasca +3 more · Knowledge Discovery and Data Mining · 2013 · 420 citations
Read the breakdown →StudyPreprintWikiModerate
Mind the Sim-to-Real Gap & Think Like a Scientist
Harsh Parikh, Gabriel Levin-Konigsberg, Dominique Perrault-Joncas +1 more · 2026
Suppose a planner has a pre-trained simulator of a sequential decision problem and the option to run real experiments in the field. The simulator is cheap to query but inherits confounding and drift from its calibration data. Experimentation is unbiased but consumes one real unit per trial. We study when, and how, the planner should supplement the simulator with experiments. We give three results. First, an extended simulation lemma decomposes the simulator's value error into a calibration--deployment shift that randomization can identify and a parametric residual that no further interaction can reduce. Second, the value gap between the simulator-optimal policy and the optimum splits into a local component, on states the deployed policy already visits, and a reachability component, on states it does not. The reachability component stays bounded away from zero at any horizon under purely passive learning. Third, we propose Fisher-SEP, a simulation-aided experimental policy (SEP) that minimizes the posterior predictive variance of a target policy's value, with reward-only and transition-only specializations. Two case studies illustrate the regimes. In a vending-machine supply chain, front-loaded experimentation overtakes posterior updating once the horizon is long enough to amortize the pilot. In an HIV mobile-testing example with a corridor that separates a well-surveilled region from a poorly-surveilled one, only designed exploration reaches the poorly-surveilled region.
Read the breakdown →StudyPreprintModerate
Prior-Free Sample Size Design for Test-and-Roll Experiments
Kentaro Kawato, Shosei Sakaguchi · 2026
This paper studies sample-size design for finite-population test-and-roll experiments, where a decision-maker first conducts an experiment on $m$ units and then assigns the remaining $N-m$ units to the treatment that performs better in the experiment. We consider welfare-aware sample-size choice, which involves an exploration-exploitation tradeoff: larger experiments improve the rollout decision but impose welfare losses on experimental units assigned to the inferior treatment. We show that the standard absolute minimax regret criterion can lead to implausibly small experiments by over-penalizing exploration in its worst-case objective. To address this limitation, we propose the Worst-case Marginal Benefit (WMB) rule, which compares the worst-case marginal benefit of adding one more matched pair to the experiment with the corresponding marginal exploration cost. We establish a simple rule-of-thirds benchmark. For Bernoulli outcomes, after excluding pathological cases, the WMB criterion yields the optimal sample size of $m \approx N/3$ through a Gaussian approximation. For Gaussian outcomes with a known common variance, the same benchmark arises exactly. These results provide a prior-free and practically implementable guide for welfare-based sample-size design.
StudyPreprintWikiModerate
Designing Persuasive Experiments
Karun Adusumilli, Abhi Vemulapati · 2026
Incentives in experimental design are often misaligned: experimenters design and finance experiments to seek regulatory approval, while regulators seek to maximize social-welfare. We propose a framework to resolve this conflict, wherein regulators set a minimum expected welfare threshold, and experimenters optimize designs subject to this constraint. It requires no knowledge of experimenters' private preferences or costs and mitigates strategic Bayesian persuasion. Under normal priors, sampling according to the Neyman-allocation is always optimal, independent of the specific objectives. Furthermore, we characterize the optimal stopping-rule. In a numerical study calibrated to historical clinical-trial data, our framework reduces expected sample-sizes by over 48% relative to classical designs that attain the same social-welfare.
Read the breakdown →StudyPreprintModerate
Improving Sensitivity in A/B Tests: Integrating CUPED with Trimmed Mean Techniques
Kevin Charette, Tristan Boudreault · 2025
Accurate estimation of treatment effects in online A/B testing is challenging with zero-inflated and skewed metrics. Traditional tests, like Welch's t-test, often lack sensitivity with heavy-tailed data due to their reliance on means, as opposed to e.g., percentiles. The Controlled Experiments Using Pre-experiment Data (CUPED) technique improves sensitivity by reducing variance, yet that variance reduction is insufficient for highly skewed metrics. Alternatively, Yuen's t-test uses trimmed means to robustly handle outliers and skewness. This paper introduces a method that combines the variance reduction of CUPED with the robustness of Yuen's t-test to enhance hypothesis testing sensitivity. Our novel approach integrates trimmed data in a principled manner, offering a framework that balances variance reduction with robust location measures. We demonstrate improved detection of significant effects with smaller sample sizes, enabling quicker experimental decisions without sacrificing statistical power. This work broadens the utility of controlled experiments in environments characterized by highly skewed or high-variance data.
StudyPreprintWikiModerate
Valuing Winners: When and How to Correct for Selection Bias in Randomized Experiments
Ron Berman, Walter W. Zhang, Hangcheng Zhao · 2026
Decision-makers often deploy the best-performing treatment from a randomized experiment, creating a winner's curse: selection favors treatments whose observed outcomes are high partly because of statistical noise, so the naïve estimate of the winner is upward biased. We distinguish two forms of winner's curse, bias relative to the true best treatment (global) and bias relative to the selected treatment's true mean (selective), and link them to regret from deploying a suboptimal treatment. This framework defines seven decision-relevant evaluation targets: mean bias, mean squared error, and confidence interval coverage for the global and selective winner's curse, and mean regret. We then show that methods that perform well on one target can perform poorly on others, so corrections should be matched to the manager's objective. Across simulations with varying effect sizes, multiple-arm settings, and data calibrated to an online A/B testing platform, no method dominates uniformly: the plug-in estimator performs best when treatment differences are large, cross-fitting performs best when treatments are similar, and resampling methods often achieve low mean squared error for moderate differences. We also introduce an adaptive empirical likelihood procedure that delivers asymptotically valid confidence intervals across settings without the tuning sensitivity of resampling-based methods.
Read the breakdown →StudyPreprintWikiModerate
EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data
Eliseo Curcio · 2026 · 0 citations
Selecting the right electricity market region for a hyperscale AI datacenter requires reasoning across live electricity prices, grid carbon intensity, technology cost trajectories, and causal grid dynamics -- a multi-step, multi-source analytical task that static knowledge benchmarks cannot evaluate. We introduce EnergyAgentBench, the first agentic benchmark grounded in live electricity market data for this problem class. The benchmark comprises 70 task variants across five families: datacenter siting under cost-carbon trade-offs (F1), long-horizon portfolio siting (F1-LH), lifetime LCOE ranking over multi-decade cost trajectories (F2), 30-year portfolio optimization (F2-LH), and causal grid diagnosis (F3). Tasks require 3 to 48 sequential tool calls against live endpoints from the QuarluxAI infrastructure platform, the U.S. Energy Information Administration (EIA), and the National Renewable Energy Laboratory (NREL) with ground truth derived from trained XGBoost cost-surface models (R^2 0.967--0.995) and the NREL Annual Technology Baseline 2024. We evaluate nine models across Anthropic, OpenAI, and HuggingFace over 1,414 runs at three random seeds. Claude Sonnet 4.6 achieves the highest overall score (0.900) at one-quarter the cost of Claude Opus 4.7 (0.889). Claude Haiku 4.5 leads on long-horizon procedural siting (0.986), outperforming all frontier models including those costing 16x more per run. F3 Causal is the most discriminating family, with a 30.7-point spread between Sonnet (0.793) and Llama 3.3 70B (0.486), versus a 6.6-point spread on F1 Siting. A failure taxonomy of 135 coded failures identifies null-value integration in NREL ATB trajectories as the dominant failure mode (70%), followed by premature commitment on causal tasks (20%) and adversarial injection blindness (6%). Benchmark code, run trajectories, and the failure taxonomy dataset are publicly released.
Read the breakdown →StudyPreprintWikiModerate
Variance Reduction for Expectations with Diffusion Teachers
Jesse Bettencourt, Xindi Wu, Matan Atzmon +2 more · 2026
Pretrained diffusion models serve as frozen teachers feeding downstream pipelines such as text-to-3D, single-step distillation, and data attribution. The teacher gradients these pipelines consume are Monte Carlo (MC) expectations over noise levels and Gaussian noise samples; their estimator variance dominates compute cost because each draw requires expensive upstream work (rendering, simulation, encoding). We introduce CARV, a compute-aware variance-accounting framework that motivates a hierarchical MC estimator: amortize the expensive upstream computation over cheap diffusion-noise resamples, sharpened by timestep importance sampling and a stratified-inverse-CDF construction. In our text-to-3D distillation and attribution experiments, CARV delivers 2-3x effective compute multipliers (most from amortized reuse; ~25% additional from IS+stratification) without changing the objective; in single-step distillation, the same techniques cut gradient variance by an order of magnitude but do not improve downstream FID, marking the regime where MC variance is no longer the bottleneck.
Read the breakdown →StudyPreprintWikiModerate
TCARD: Nearly Balanced Two-Level Designs with Treatment Cardinality Constraints with an Application to LLM Prompt Engineering
Kexin Xie, Ryan Lekivetz, Xinwei Deng · 2026
Modern experimental designs often face the so-called treatment cardinality constraint, which is the constraint on the number of included factors in each treatment. Experiments with such constraints are commonly encountered in engineering simulation, AI system tuning, and large-scale system verification. This calls for the development of adequate designs to enable statistical efficiency for modeling and analysis within feasible constraints. In this work, we study two-level designs under this $k$-treatment cardinality constraint (TCARD), where the design matrix $\mathbf{X} \in \{0,1\}^{n \times p}$ has constant row sums equal to $k$. Although TCARDs are closely related to balanced incomplete block designs (BIBDs), exact BIBD structure is unavailable for many practical $(n,p,k)$ combinations. This leads to the notion of nearly balanced TCARDs, which we prove minimize the first two components of the generalized word-length pattern. We also show that good projection behavior in this setting is governed by two count-based regularities: balanced factor replications and uniform pairwise concurrences. Motivated by this characterization, we then propose the Balanced Concurrence Deviation ($Φ_{\mathrm{BCD}}$), a model-free objective that jointly penalizes replication imbalance and concurrence dispersion. We further show that this criterion is closely connected to classical optimality principles, including $(M,S)$-optimality, centered $\mathrm{UE}(s^2)$ criterion, and Bayesian $D$-optimality. To construct designs minimizing $Φ_{\mathrm{BCD}}$, we develop a coordinate-exchange (CE) algorithm with efficient incremental updates, together with a simulation-based procedure for calibrating the criterion weights to the intended downstream task. Numerical experiments confirm that the proposed method compares favorably with existing alternatives across a range of problem sizes and constraint strengths.
Read the breakdown →StudyPreprintWikiModerate
The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice
Jikai Jin, Vasilis Syrgkanis · 2026 · 0 citations
Offline evaluation of language models from usage logs is biased when model choice is confounded: the same user-side factors that influence which model is used can also influence how its output is judged, so raw comparisons of logged scores mix self-selected populations rather than estimating a common quantity of interest. A small randomized experiment can break this bias by overriding model choice, but in practice such experiments are scarce and costly. We study a three-source design that combines a large confounded observational log (OBS) for scale, a small randomized experiment (EXP) for unconfounded scoring, and an offline simulator (SIM) that replays candidate models on cached contexts. Our main result is an identification theorem showing that the randomized experiment and the simulator are together enough to recover causal model values; the observational log enters only afterward, to reduce estimation error rather than to make the causal comparison valid. Six estimator families are evaluated in a controlled semi-synthetic validation and in two real-task cached benchmarks for summarization and coding. No family dominates every regime; relative performance depends on the amount of unbiased EXP supervision and on how closely the target reward aligns with OBS-derived structure.
Read the breakdown →StudyPreprintModerate
Finite Population Sampling as n to N: Empirical Evidence for the Transition from Inference to Accuracy
Mike Crowhurst · 2026
The Central Limit Theorem provides a foundation for inferential statistics and hypothesis testing. It describes how standardized statistics behave under repeated sampling from large populations. However, if the size of the sample (n) becomes so large that it approaches the size of the population (N), sampling variability becomes very small, and standard errors and margins of error both approach zero. The purpose of this project was to investigate the behavior of estimators as the sampling fraction (f = n/N) approaches 1, motivated by modern data streams from administrative records, transaction logs, sensor systems, and institutional databases that capture large portions of finite populations. We constructed two finite populations with known parameters and drew repeated samples across a range of sampling fractions. We then examined the resulting randomization distributions of the sample mean to understand how sampling variability collapses. Additional experiments were conducted using various CPU- and GPU-based methods to evaluate the deviation of the sample mean from the defined population mean under different computational conditions. The results confirm that sampling variability diminishes as expected under finite population theory and becomes negligible well before full enumeration is reached. Once sampling variability is minimized, remaining deviations in estimators are primarily related to numerical precision and computational structure rather than random sampling. These findings support a reassessment of inferential assumptions in high-coverage, large-scale data settings.
RCTPreprintWikiModerate
Assessing Estimate of CATE from Observational Data via an RCT Study
Bosen Cui, Yuhong Yang · 2026
Conditional average treatment effects (CATEs) are increasingly estimated from observational data and used to guide policy and individualized treatment decisions. Before such estimates can be trusted in practice, their predictive fitness needs to be assessed, yet observational data alone offer limited opportunities for doing so. We propose CATE Assessment via Fitness Evaluation (CAFE), a formal framework for directly assessing the goodness-of-fit of a CATE estimate learned from observational data, rather than the full underlying outcome model, using evidence from a randomized trial. CAFE partitions the trial covariate space according to estimated propensity scores (or the like) and compares observationally derived conditional treatment effects with group-level experimental averages. The framework accommodates a broad class of CATE learners, including parametric models and flexible machine learning methods such as causal forest and boosting. We establish theoretical guarantees under both the null and alternative hypotheses, and introduce a maximum-type extension to improve sensitivity to localized lack of fit. When both randomized trial and observational data are available, we further develop a two-stage procedure to detect the existence of unobserved confounders. Extensive numerical studies show the utility of the CAFE approach when assessing observational-derived CATE estimates.
Read the breakdown →StudyPreprintWikiModerate
A Goodness-of-Fit Test for Independent Component Models in High Dimensions
Mingshuo Liu, Siyao Wang, Miles E. Lopes · 2026
Independent component (IC) models are a standard tool for representing multivariate data in statistics, signal processing, and machine learning. Despite the extensive use of IC models, much less attention has been given to goodness-of-fit tests for assessing their compatibility with data. We develop the first goodness-of-fit test for IC models that is supported by a theoretical validity guarantee when the data dimension and sample size diverge proportionally. This is made possible by the fact that the test does not rely on a pre-whitening step, which often limits the applicability of other goodness-of-fit tests in high dimensions. Our theoretical analysis is complemented with numerical experiments that demonstrate the test's size control and power under a range of conditions. In addition, we provide examples involving gene-expression data to illustrate that the test has potential for effective diagnostic use in practice.
Read the breakdown →StudyPreprintWikiModerate
Goal-Oriented Lower-Tail Calibration of Gaussian Processes for Bayesian Optimization
Aurélien Pion, Emmanuel Vazquez · 2026
Bayesian optimization (BO) selects evaluation points for expensive black-box objectives using Gaussian process (GP) predictive distributions. Kernel choice and hyperparameter selection can lead to miscalibrated predictive distributions and an inappropriate exploration-exploitation trade-off. For minimization, sampling criteria such as expected improvement (EI) depend on the predictive distribution below the current best value, so lower-tail miscalibration directly affects the sampling decision. This article studies goal-oriented calibration of GP predictive distributions below a low threshold $t$ in the noiseless setting, for standard GP models with hyperparameters selected by maximum likelihood. A framework for predictive reliability below $t$ is introduced, based on two notions of spatial calibration: occurrence calibration over the design space and thresholded $μ$-calibration on sublevel sets of the form $\{x\in\mathbb{X}, f(x)\le t\}$. Building on this framework, we propose tcGP, a post-hoc method that calibrates GP predictive distributions below~$t$, and we show that the resulting EI-based global optimization algorithm remains dense in the design space. Experiments on standard benchmarks show improved lower-tail calibration and BO performance relative to standard GP models and globally calibrated GP models.
Read the breakdown →StudyPreprintModerate
Variance-Reduced Manifold Sampling via Polynomial-Maximization Density Estimation
Serhii Zabolotnii · 2026
Uniform sampling on implicitly defined manifolds is a core primitive in motion planning, constrained simulation, and probabilistic machine learning. MASEM addresses this problem by entropy-maximizing resampling, but its resampling weights depend on a local k-nearest-neighbour density estimate whose errors can be amplified by aggressive resampling temperatures. We ask whether a polynomial-maximization moment estimator can replace the plug-in density rule without changing the surrounding MASEM architecture. The proposed PMM-MASEM module computes shell spacings from nested k-nearest-neighbour radii, estimates their standardized cumulants, and uses a gated PMM2/PMM3 estimator only when the spacing distribution departs from the flat Exp(1) regime; otherwise it falls back to the plug-in/MLE rule. This fallback is essential: on a flat homogeneous manifold the plug-in estimator is already the MLE, so PMM should not outperform it. A local Known-DGP Monte Carlo experiment confirms this gate: the selector returns MLE on flat Exp(1) spacings and reduces density MSE by 22--36% on asymmetric gamma and boundary-spacing regimes. The evidence is not uniformly positive: PMM3 worsens a platykurtic uniform spacing law, and a lightweight resampling-proxy experiment improves seven-lobes coverage but degrades the sine and swiss-roll proxies. The current evidence therefore supports an applicability-boundary result rather than a general MASEM improvement claim.
StudyPreprintModerate
The Spatial Cram'{e}r--von Mises Test of Independence under $β$-Mixing: Asymptotic Theory and Python Implementation
Marco Mandap · 2026
We derive the asymptotic distribution of the spatial Cram'{e}r--von Mises statistic for testing bivariate independence in stationary random fields on $\mathbb{R}^2$ under polynomial $β$-mixing dependence, and document the Python implementation that reproduces all simulation results. The classical test assumes i.i.d. observations; we extend it to spatially dependent data by combining three ingredients: (i) a Davydov-type covariance bound yielding integrability of the spatial covariance kernel under $θ> 2(2+δ)/δ$; (ii) a reformulation of the inner-form test statistic as a degenerate U-statistic of order~2 with product kernel $Q = G_1 \otimes G_2$, following De Wet (1980); and (iii) an extension of Gregory's (1977) U-statistic limit theorem to $β$-mixing sequences via Yoshihara (1976). The limit distribution is a weighted sum of correlated $χ^2_1$ variables whose eigenvalues factor as products of marginal eigenvalues; in the small-bandwidth limit the correlation vanishes and the limit reduces to the classical i.i.d. form. Explicit eigenvalue formulas are given for three weight functions (uniform, optimal normal, Anderson--Darling), producing computable critical values. The software generates Mat'{e}rn random fields by circulant embedding, computes the test statistic via the inner-form kernel decomposition, evaluates asymptotic critical values by Monte Carlo, and runs permutation-based alternatives. Simulation experiments show that the Anderson--Darling weight achieves the best power, while the Mantel and cross-$K$ tests have no power against cross-dependence in spatially correlated fields.
ObservationalPreprintWikiModerate
The Illusion of Intervention: Your LLM-Simulated Experiment is an Observational Study
Victoria Lin, Taedong Yun, Maja Matarić +3 more · 2026
Large language models (LLMs) show potential as simulators of human behavior, offering a scalable way to study responses to interventions. However, because LLMs are trained largely on observational data, interventions in experiments with LLM-simulated synthetic users can induce unintended shifts in latent user attributes, causing user drift where the implicit simulated population differs across treatment conditions, potentially distorting effect estimates. We formalize the confounding or selection bias that can arise due to user drift and show how intervention-dependent shifts can inflate or attenuate observed differences in user responses under intervention. To diagnose confounding, we propose using negative control outcomes--attributes that should remain invariant under intervention--to identify distribution shifts across intervention conditions, providing evidence of user drift. To mitigate drift, we study adjusting the persona specification by eliciting additional confounders, finding that targeted, setting-relevant confounders can substantially reduce bias across survey-style and multi-turn agent evaluations.
Read the breakdown →StudyPreprintModerate
$2B$ or Not $2B$: A Tale of Three Algorithms for Streaming: Covariance Estimation after Welford and Chan-Golub-LeVeque
Felix Reichel · 2026
We place three algorithms for computing the unbiased sample covariance matrix in streaming and distributed settings on a common algebraic, numerical, and statistical foundation. The Gram algorithm, derived from the variance reformulation, maintains the running cross-product matrix $G_t = \sum_{i=1}^t x_i x_i^\top$ and the column-sum vector $s_t = \sum_{i=1}^t x_i$, yielding the unbiased covariance estimator $S_t = (t-1)^{-1}(G_t - t^{-1}s_t s_t^\top)$ in $O(p^2)$ time per update. The Welford algorithm propagates a running mean $m_t$ and outer-product corrections $M_t$, with updates $m_t = m_{t-1} + (x_t - m_{t-1})/t$ and $M_t = M_{t-1} + (x_t - m_{t-1})(x_t - m_t)^\top$, achieving the same asymptotic cost with improved numerical stability under large data shifts. The Chan-Golub-LeVeque algorithm supports block-parallel merging through the exact identity $M = M_A + M_B + \frac{n_A n_B}{n_A+n_B}(m_B - m_A)(m_B - m_A)^\top$, making it the natural choice for distributed and map-reduce architectures. All three algorithms produce the same estimator $S_t = M_t/(t-1)$ in exact arithmetic, although their finite-precision behavior differs markedly. Beyond runtime and numerical comparisons, we introduce a conformal prediction framework for streaming covariance estimation that yields finite-sample, distribution-free confidence sets $C_{t,jk}$ for each entry $S_{t,jk}$ of the covariance matrix at any step $t$ of the data stream. Experiments confirm that the Gram algorithm is fastest for batch computation, Welford is uniquely robust to catastrophic cancellation under large mean shifts, CGL is optimal for distributed settings, and conformal intervals achieve the nominal coverage level across all three algorithms.
StudyPreprintWikiModerate
Causal Inference with Categorical Unobserved Confounder via Mixture Learning
Aytijhya Saha, Stephen Bates, Devavrat Shah · 2026
Unobserved confounding is a fundamental challenge for estimating causal effects. To address unobserved confounding, recent literature has turned to two different approaches -- proxy variables and the use of multiple treatments. The first approach, commonly referred to as proximal causal inference, requires proxies to be assigned to specific asymmetric roles: treatment-inducing proxies (negative control exposures), variables that act as common causes of the treatment and outcome, and outcome-inducing proxies (negative control outcomes). In practice, however, identifying variables that satisfy these asymmetric roles can be difficult depending on the application domain. The second approach, commonly referred to as the ``Deconfounder," deals with multiple conditionally independent treatments. There has been limited progress towards developing a consistent estimation method for this setting. As the primary contribution of this work, we establish that causal effects are identifiable in both settings when the unobserved confounder is categorical under suitable conditions. Our approach builds on a mixture learning perspective: we show that the underlying confounding structure can be recovered by identifying the corresponding mixture distribution. We propose an estimation procedure based on tensor decomposition, which allows consistent recovery of the latent structure and comes with non-asymptotic guarantees. Simulation studies and real data experiments demonstrate that the proposed method performs well even with limited data.
Read the breakdown →StudyPreprintModerate
Energy-Arena: A Dynamic Benchmark for Operational Energy Forecasting
Max Kleinebrahm, Jonathan Berrisch, Philipp Eiser +11 more · 2026
Energy forecasting research faces a persistent comparability gap that makes it difficult to measure consistent progress over time. Reported accuracy gains are often not directly comparable because models are evaluated under study-specific datasets, time periods, information sets, and scoring setups, while widely used benchmarks and competition datasets are typically tied to fixed historical windows. This paper introduces the Energy-Arena, a dynamic benchmarking platform for operational energy time series forecasting that provides a continuously updated reference point as energy systems evolve. The platform operates as an open, API-based submission system and standardizes challenge definitions and submission deadlines aligned with operational constraints. Performance is reported on rolling evaluation windows via persistent leaderboards. By moving from retrospective backtesting to forward-looking benchmarking, the Energy-Arena enforces standardized ex-ante submission and ex-post evaluation, thereby improving transparency by preventing information leakage and retroactive tuning. The platform is publicly available at Energy-Arena.org.
RCTPreprintWikiModerate
Budget-Constrained Causal Bandits: Bridging Uplift Modeling and Sequential Decision-Making
Abhirami Pillai · 2026
Treatment allocation under budget constraints is a central challenge in digital advertising: advertisers must decide which users to show ads to while spending a limited budget wisely. The standard approach follows a two-stage offline pipeline - first collect historical data to estimate heterogeneous treatment effects (HTE), then solve a constrained optimization to allocate the budget. This works well with abundant data, but fails in cold-start settings such as new campaigns, new markets, or new customer segments where little historical data exists. We propose Budget-Constrained Causal Bandits (BCCB), an online framework that learns which users respond to ads while simultaneously spending the budget, making treatment decisions one user at a time. BCCB unifies three components into a single sequential process: learning individual-level ad effectiveness, exploring users whose response is uncertain, and pacing the budget over time. We evaluated on the Criteo Uplift dataset, a large-scale advertising dataset from a real randomized controlled trial. Our key finding is a data-efficiency crossover: offline methods require approximately 10,000 historical observations to produce reliable results, while BCCB operates effectively from the very first user. Furthermore, BCCB exhibits 3-5x lower performance variance between runs, making it more practical for real campaign planning. Among purely online methods, BCCB consistently outperforms standard Thompson Sampling, budgeted Thompson Sampling, and greedy HTE estimation across all budget levels tested.
Read the breakdown →StudyPreprintModerate
Auditing Marketing Budget Allocation with Hindsight Regret
Nilavra Pathak, Olivier Jeunen, Eric Lambert · 2026
Organizations routinely make strategic budget allocations under operational constraints, but often lack a principled way to assess whether realized allocations were close to the best feasible choices in hindsight. We present a retrospective auditing framework based on hindsight regret, defined as the opportunity cost of the realized allocation relative to a constraint-faithful benchmark under the same budget and stability guardrails. The framework estimates regime-specific spend--response functions from historical logs, computes feasible hindsight allocations via constrained optimization, and propagates uncertainty through Monte Carlo evaluation to produce regret distributions, expected lift, and probability-of-improvement summaries. This separates allocation inefficiency from uncertainty in the estimated response surfaces. Experiments on real marketing allocation logs show that the framework yields interpretable post-hoc diagnostics and reveals a practical trade-off between allocation flexibility and detectability: moderate feasible reallocations often capture most measurable gain, while larger shifts move into weak-support regions with higher uncertainty. The result is a practical method for auditing historical budget decisions when online experimentation is costly or infeasible.
StudyPreprintWikiModerate
Everywhere Valid Bounds on False Discovery Proportions in Conformal Inference
Ziang Song, Ying Jin, Emmanuel J. Candès · 2026
Modern applications of conformal inference to multiple testing problems, such as outlier detection and candidate selection, often involve selecting test samples whose conformal p-values fall below a threshold. The quality of such methods is often measured by the false discovery proportion (FDP), defined as the fraction of incorrect selections. Existing approaches typically control the expected value of the FDP, using methods such as the Benjamini-Hochberg procedure. This approach fails to provide high-probability bounds on the realized false discovery proportion and invalidates statistical guarantees if the rejection threshold is selected after inspecting the data. This paper establishes finite-sample, distribution-free upper bounds on the FDP that hold simultaneously over all possible rejection thresholds, enabling arbitrary post hoc selection of the threshold. Simultaneous validity is achieved by constructing a high-probability envelope for the empirical distribution function of null conformal p-values by sampling from their joint distribution. Furthermore, our framework allows practitioners to modulate the envelope's shape, thereby producing tight bounds in rejection regions of primary interest. We use this flexible approach to derive simultaneous FDP upper bounds for both outlier detection and conformal selection. We demonstrate through synthetic and real-data experiments that the resulting bounds are both valid and substantially less conservative than those derived from existing approaches.
Read the breakdown →StudyPreprintWikiModerate
Stable direct estimation for GPLSIAMs using P-splines with dynamically updated boundaries
Danilo V. Silva, Gilberto A. Paula · 2026
Generalized partially linear single-index additive models (GPLSIAMs) have been increasingly applied across diverse areas due to their versatility in integrating functional flexibility with parametric dimension reduction while maintaining interpretability. However, the estimation presents severe computational challenges. This paper introduces a novel stable method that uses the model matrix for each single-index effect, defined by its single-index coefficients, and the penalized complete Fisher information matrix to dynamically update the boundaries of the single-index covariates within a unified iterative framework. The derived model matrices enable the fast computation of the estimated effective degrees of freedom and pointwise confidence bands for the single-index effects. The smoothing parameter updates are integrated into the iterative process via the generalized Fellner-Schall method, which recycles the derived matrix decompositions, thereby providing an efficient approximation to the global penalized optimization problem. Simulation studies with moderate sample sizes under non-Gaussian distributions confirm the empirical consistency of the estimation across multiple scenarios. Notably, the proposed approach remains stable where state-of-the-art competitive methods fail to recover true single-index coefficients and nonlinear functions, and is 80.13 times faster than the usual two-step method in the most computationally intensive scenario. The modeling advantage is illustrated through an application to Capital Bike Sharing data, where we deal with a single-index interaction effect for each year, with distinct single-index coefficients, a complex structure that makes competitive methods inapplicable. The proposed method is implemented in R, with functions available for reproducibility and transparency in the comparisons.
Read the breakdown →StudyPreprintWikiModerate
Partial Identification of the Valuation Distribution in Sequential English Auctions
Dongwoo Kim, Kyoo il Kim, Pallavi Pal · 2026
This paper extends the incomplete model of Haile and Tamer (2003) from static English auctions to sequential English auctions. Because bidders may wait for future opportunities, the static condition that bidders do not let rivals win at beatable prices need not hold. We replace it with a dynamic opportunity-cost restriction, yielding nonparametric valuation bounds without solving a dynamic equilibrium. Sharp bounds are also characterized. We propose a novel moment-condition inversion estimator that pools auctions with heterogeneous bidder counts, mitigating finite-sample instability of order statistics approaches and admitting analytical standard errors and smooth confidence intervals. Applications to Korean wholesale used-car auctions and Cars and Bids online auctions deliver informative bounds. Counterfactual analyses show that the option to wait lowers first-period revenue by 8--11% in the Korean market, that increasing effective competition from 8 to 20 serious bidders in Cars and Bids raises seller revenue by 40--65%, and that maximin reserve prices vary substantially across vehicle clusters.
Read the breakdown →StudyPreprintWikiModerate
Selecting Informative Conformal Prediction Sets with an Optimized FCR-Controlled Approach
Israela Solomon, Etienne Roquain, Saharon Rosset +1 more · 2026
Conformal methods provide prediction sets for outcomes with confidence guarantees. We study their use in a selective inference setting, where inference is performed only when the prediction set is informative. The analyst may consider as informative, for example, cases with prediction sets that are sufficiently small, exclude null values, or satisfy other appropriate monotone constraints. Because inference is typically restricted to informative cases in practical applications, accounting for the resulting selection bias is crucial to maintaining false coverage rate (FCR) control. A general framework for constructing such informative conformal prediction sets while controlling the FCR on the selected sample was suggested in Gazin et al. (2025). In this work we focus on oracle-guided procedures. We derive the optimal decision policy under a suitable power objective in the oracle setting where the probability of belonging to each prediction set can be computed. In practice, of course, only estimated probabilities are available. We therefore introduce a calibration procedure that adjusts the oracle policy to maintain finite sample FCR control. We show that this approach can achieve substantially higher power than available alternatives. We demonstrate the effectiveness of our new methods for classification outcomes on both real and simulated data.
Read the breakdown →StudyPreprintWikiModerate
Actor-Critic with Active Importance Sampling
Majid Molaei, Gabor Paczolay, Matteo Papini +2 more · 2026
This paper introduces the Active-Importance-Sampling Actor-Critic (AISAC) algorithm, an extension of the Actor-Critic framework for reducing variance in policy gradient estimation. AISAC optimizes the behavior policy to minimize gradient variance while preserving unbiased gradient estimates. Using importance sampling principles, the algorithm adapts the behavior policy toward efficient data collection distributions aligned with target policy gradients. For continuous action spaces, AISAC employs Gaussian behavior policies optimized through cross-entropy minimization. We provide theoretical analysis demonstrating variance reduction and unbiasedness. Experiments on Inverted Pendulum and Half Cheetah tasks show improved learning speed, sample efficiency, and training stability compared to standard Actor-Critic methods. Results indicate that optimizing the behavior policy improves both target policy updates and critic estimation accuracy across different hyperparameter settings. AISAC accelerates convergence and stabilizes reinforcement learning training, making it promising for real-world applications. Future work includes integration with advanced algorithms such as Soft Actor-Critic and TD3 for more complex environments.
Read the breakdown →StudyPreprintModerate
Hall-Like Transversal Stress and Sandpile Criticality on Real Production Networks
Diego Vallarino · 2026 · 0 citations
This paper develops a Hall-Sandpile model of economic instability that combines a Hall-like transversal stress mechanism with sandpile threshold dynamics on a real production-network substrate. In analogy with the physical Hall effect, where exposed flows under an external field generate stress in a transversal direction, we model economic shocks as fields that act on flow-intensive, low-redundancy, low-capacity nodes and produce systemic stress through a multiplicative conversion function. The accumulated stress drives a discrete toppling rule and an avalanche dynamics whose effective activation threshold declines with transversal exposure. The model is calibrated on annual World Input--Output Database (WIOD) production networks for 2000--2014 and simulated on the 2014 substrate (2{,}283 country--sector nodes) under three alternative propagation normalisations to avoid mechanical near-criticality from row-stochastic operators. Controlled Monte Carlo experiments over external field intensity and redundancy stress generate four ordered regimes: stable absorption, latent fragility, critical transition, and avalanche regime. Mean avalanche size and the probabilities of finite-size systemic events $\Pr(S\!\geq\!5)$, $\Pr(S\!\geq\!10)$ and $\Pr(S\!\geq\!20)$ rise jointly with field intensity and redundancy stress. Tail diagnostics show regime-dependent thickening of the avalanche distribution, but the estimated tail indices remain too high to interpret as evidence of universal power-law criticality. The contribution is therefore a finite-size, real-network description of how transversal stress activates structural fragility, not a claim of self-organised criticality in the global economy.
StudyPreprintWikiModerate
Evaluation of the number of clusters in a data set using $p$-values from Multiple Tests of Hypotheses
Soumita Modak · 2026 · 4 citations
This paper proposes a novel, nonparametric, interpoint distance-based measure to investigate whether there exist any groups in a set of given data, and if so then, how many groups are prevailing in total. It is a cluster accuracy index useful for arbitrary-dimensional data set, in association with any clustering algorithm having the number of groups specified as a priori. We perform univariate, nonparametric, multiple statistical tests of hypotheses, where as many dependent tests as the sample size are carried out using the interpoint distances. They possess $p$-values to be combined to reach a decision, which is taken in a step-wise process for a possible number of clusters. It reduces the unnecessary computations compared with the other accuracy measures from the literature. Data study establishes the proposed index's efficiency and superiority.
Read the breakdown →StudyPreprintWikiModerate
Missing data and cluster graphs: cluster-level missingness vs variable-level missingness
Willow Scott, Eugenio Valdano, Charles Assaad · 2026
Missing data is pervasive in many scientific domains such as public health, environmental science, and the social sciences. Recoverability from missing data is typically studied using fully specified variable-level missingness models despite that, in many applications, only coarse structural information is available, for instance when variables are grouped into clusters due to limited knowledge or interpretability reasons. In this paper, we investigate recoverability from such abstract representations. We introduce two classes of cluster-based missingness graphs: the m-C-DMG, which retains variable-specific missingness indicators, and the cm-C-DMG, which aggregates missingness mechanisms at the cluster level. We formalize the notion of compatibility between these abstract graphs and underlying variable-level missingness models, and study how this abstraction affects the recoverability of probabilistic and causal queries. In particular, we give graphical conditions of recovering the joint distribution as well as graphical conditions of recovering a macro causal effect. Overall, our results clarify when cluster-level missingness information is sufficient for valid inference, and when finer-grained modeling is necessary.
Read the breakdown →StudyPreprintWikiModerate
Regret Equals Covariance: A Closed-Form Characterization for Stochastic Optimization
Irene Aldridge · 2026 · 0 citations
Regret is the cost of uncertainty in algorithmic decision-making. Quantifying regret typically requires computationally expensive simulation via Sample Average Approximation (SAA), with complexity $\mathcal{O}(Bn^{2}d^{3})$ in the number of scenarios $B$, variables $n$, and constraints $d$. % This paper proves that expected regret in any stochastic optimization problem admits the exact decomposition % \begin{equation*} \mathrm{Regret}(c) = \mathrm{Cov}(c,\,π^{*}(c)) + R(c), \end{equation*} % where $c$ is the vector of uncertain parameters, $π^{*}(c)$ is the optimal decision, and $R(c)$ is a residual whose magnitude we bound explicitly under Lipschitz, smooth, and strongly convex conditions. % For linear programs and unconstrained quadratic programs, including the classical Markowitz portfolio problem, we prove $R(c)=0$ exactly, so that $\mathrm{Regret}(c) = \mathrm{Cov}(c,π^{*}(c))$ holds without approximation. % When historical cost-decision pairs $\{(c_i, π^*(c_i))\}$ are available, the covariance can be estimated in $\mathcal{O}(nd^{2})$ time, which is orders of magnitude faster than SAA. The estimation is performed by a single pass through the data. % We derive concentration bounds, a central limit theorem, and an asymptotically unbiased residual estimator, and we validate all results on synthetic LP, QP, and integer programming instances and on a rolling-window portfolio experiment using ten years of CRSP equity data.
Read the breakdown →ObservationalPreprintWikiModerate
Targeted maximum likelihood estimation of vaccine effectiveness and immune correlates in test-negative design studies with missing data
Leah I. B. Andrews, Lars van der Laan, Peter B. Gilbert · 2026
The test-negative design (TND) is a resource-efficient observational study design that can assess vaccine effectiveness and exposure-proximal immune correlates of disease. The TND enrolls symptomatic individuals seeking diagnostic testing and compares case status by an exposure variable, such as vaccination status or immune marker level, that is measured at testing. While the TND reduces confounding by healthcare-seeking behavior, other sources of confounding may remain. TND studies may also have missing data in the exposure variable due to incomplete records or two-phase sampling designs. We present a targeted maximum likelihood estimation approach involving a semiparametric logistic regression model that targets a causal conditional risk ratio of symptomatic disease in the healthcare-seeking population. Under causal and missing at random assumptions, our method produces an efficient, asymptotically linear estimator that provides flexible, data-driven confounding control and valid causal inference when analyzing TND studies with missing exposure variable data. We evaluate our method's finite sample properties using plasmode simulations of a two-phase TND immune correlates study. We also apply our method to assess COVID-19 vaccine effectiveness and antibody marker correlates of COVID-19 from TND study cohorts derived from the Moderna Coronavirus Efficacy phase 3 trial.
Read the breakdown →StudyPreprintWikiModerate
The Bayesian Gaussian Process Latent Variable Model for Spatio-Temporal Stream Networks
Marno Basson, Tobias M. Louw, Theresa R. Smith · 2026
A variational inference-based framework for training a multi-output Gaussian process latent variable model, specifically tailored to the tails-up spatio-temporal stream network, is developed. Training, given a censored observational data set subject to missing values, proceeds by maximising a secondary variational lower bound on the model log marginal likelihood using gradient-based optimisation. Consequently, the theoretical development for a new family of tails-up spatio-temporal stream network models is introduced which rely on the sparse Gaussian process inducing variable framework, the Bayesian Gaussian process latent variable model, and local variational methods. These spatio-temporal models use stream distance instead of Euclidean distance and capture spatial and temporal dependencies using auto/cross-correlation and process convolution, respectively, which allows for the development of valid separable spatio-temporal stream network-based covariance functions. Results from the simulation-based case studies indicate that the proposed framework performs well when considering benchmark comparisons and several performance metrics.
Read the breakdown →StudyPreprintWikiModerate
From Volterra Series to Kunchenko Stochastic Polynomials: Half a Century of Non-Gaussian Estimation Methodology
Serhii Zabolotnii · 2026
This paper reconstructs the half-century evolution of the scientific school founded by Yuriy P. Kunchenko (1939--2006) as the development of a semiparametric methodology for non-Gaussian estimation. Starting with Kunchenko's 1972/1973 dissertation applying Volterra series to estimate parameters of random processes, the trajectory is followed through 2006--2026. Kunchenko stochastic polynomials are presented as a coherent family of moment-cumulant procedures: the polynomial maximization method (PMM) for parameter estimation, polynomial criteria for hypothesis testing, and decomposition in spaces with a generating element. The paper details the school's structure: a verified genealogy of 15 defended dissertations, collaborations in Poland, Slovakia, and Germany, and the R package EstemPMM. A recent 2026 paper on Volterra-based signal processing is analyzed, showing how Kunchenko's nonlinear formulation reappears in applied radio engineering. We build a formal bridge between finite Volterra models and generalized Kunchenko polynomials, while separating the MMSE/L2 criterion from PMM: the former is a covariance projection for kernel adaptation, whereas PMM is a parameter-dependent moment procedure. PMM efficiency claims are stated conditionally: gains require that moments exist, the centered correlant matrix is nondegenerate, and the variance reduction coefficient is below one. The concluding research program operationalizes the historical reconstruction into testable statistical and signal-processing tasks.
Read the breakdown →StudyPreprintWikiModerate
CausalGuard: Conformal Inference under Graph Uncertainty
Vikash Singh, Weicong Chen, Debargha Ganguly +12 more · 2026
Estimating treatment effects from observational data requires choosing an adjustment set, but valid adjustment depends on an unknown causal graph. Graph misspecification can cause under-coverage, while graph-agnostic conformal wrappers may regain nominal coverage only through large padding. We introduce CausalGuard, a structure-weighted conformal framework that calibrates after aggregating graph-conditional doubly robust pseudo-outcomes. Candidate DAGs are proposed from an LLM-derived edge prior, pruned by conditional-independence tests, and reweighted by Bayesian Information Criterion. A composite nonconformity score then calibrates the posterior-weighted pseudo-outcome. CausalGuard provides distribution-free finite-sample marginal coverage for this aggregated pseudo-outcome; under causal identification, overlap, conditional-mean nuisance stability, and concentration on target-aligned valid adjustment strategies, its conditional mean converges to the true Conditional Average Treatment Effect. Across five benchmarks, CausalGuard attains mean coverage above the nominal 90% level for the directly evaluable target and reduces width when graph-agnostic conformal baselines require large padding. Stress tests show that CausalGuard suppresses invalid collider adjustment and remains stable under misspecified priors when the retained candidate set is data-supported.
Read the breakdown →StudyPreprintModerate
Parameterized 4-Qubit EWL Quantum Game Circuits with Dirac-Solow-Swan Hamiltonian Integration for Quadruple Helix Disruptive Innovation Recommender Systems
Agung Trisetyarso, Fithra Faisal Hastiadi, Kridanto Surendro · 2026
We present a novel parameterized 4-qubit Eisert-Wilkens-Lewenstein (EWL) quantum game circuit for recommender systems in quadruple helix innovation ecosystems (academia, industry, government, and civil society). The local strategy operators $U_{i} = R_y(θ_{i})$ for each helix actor are directly tuned by normalized dominance weights extracted from real participant funding data (\texit{ecContribution}) in the European Commission CORDIS Horizon Europe database (project COVend, ID 101045956). The circuit employs a multi-qubit EWL entangler followed by parameterized local rotations, inverse entangler, and full measurement, achieving only 22 gates and circuit depth 11 while scaling as $O(n)$ for $n$-round helix communications. Measurement probabilities after the quantum game serve as recommender scores for disruptive versus sustaining innovation trends. These scores are subsequently mapped into the diagonal Dirac potential of a Dirac-Solow-Swan Hamiltonian, enabling time-evolution simulation of capital accumulation and bifurcation dynamics under disruptive innovation. Numerical experiments on real CORDIS quadruple-helix collaboration networks demonstrate the circuit's NISQ compatibility and its ability to forecast disruptive capital trajectories with high fidelity. The proposed framework bridges quantum game theory, parameterized quantum circuits, and relativistic economic growth models, offering a computationally efficient tool for innovation policy and strategic decision-making in complex socio-economic ecosystems. Complexity analysis and reproducibility are provided through open Qiskit implementations.
StudyPreprintWikiModerate
New Confidence Regions for Linear Regression Parameters with Stationary-Ergodic Dependent Errors
Mous-Abou Hamadou, Martial Longla, Mathias Nthiani Muia +1 more · 2026
We develop joint confidence regions for linear regression coefficients when the regressors and errors are jointly stationary and ergodic with unspecified serial dependence. The method applies random smoothing, using an independent auxiliary sample and shrinking bandwidth, to a vector of regression and second-moment statistics. Under stationarity, ergodicity, and finite second moments, the estimator is asymptotically normal and yields Wald confidence regions and simultaneous confidence intervals without direct long-run variance estimation or a parametric dependence model. For implementation, we introduce a scaled estimator with data-driven bandwidth selection and a mild truncation that improves finite-sample stability. Simulations under ARMA, ARFIMA, copula-based Markov errors, and fractional Gaussian noise, with Gaussian and heavy-tailed margins, show near-nominal coverage and competitive region volumes relative to Newey-West HAC and MAC. A winter Beijing PM2.5 application illustrates the procedure. Keywords: Random smoothing, Joint inference, Confidence regions, Dependent errors, Long memory, Regression inference
Read the breakdown →StudyPreprintWikiModerate
Laplace Approximations for Mixed-Effects and Gaussian Process Quantile Regression
Andrea Nava, Fabio Sigrist · 2026
Laplace approximations are a standard tool for computationally efficient inference in latent Gaussian models, but they fail for quantile regression with the asymmetric Laplace likelihood because the observed Hessian vanishes almost everywhere. We show that this obstacle can be overcome without smoothing the likelihood: the relevant local curvature is given not by the observed Hessian, but by the Fisher information when the model is correctly specified and by the population curvature of the expected loss under misspecification. On this basis, we develop a Laplace approximation framework for quantile regression with mixed-effects and Gaussian process models. We propose practical curvature estimators, including the triangular kernel curvature (TKC) estimator, that yield approximations for posterior distributions and marginal likelihoods, and we establish their asymptotic validity. Empirically, the proposed methods are scalable and numerically stable, and for latent Gaussian models, they achieve accuracy comparable to or better than MCMC and variational competitors at substantially lower computational costs. More broadly, the framework clarifies how Laplace approximations can be justified for non-smooth generalized posteriors through local quadratic behavior of the expected loss.
Read the breakdown →StudyPreprintWikiModerate
Two-stage Ensemble Clustering of Functional Data Using Random Projections
Sourav Chakrabarty, Anirvan Chakraborty, Shyamal K. De · 2026
We propose a computationally simple framework for clustering functional data based on Gaussian-process-generated random projections. In this approach, each curve is first projected onto a large collection of independent Gaussian process realizations. The resulting high-dimensional representations are clustered using the Mean Absolute Difference of Distances (MADD), a dissimilarity measure well suited for high-dimensional settings. A population-level analysis of this dissimilarity provides insight into how random projections help capture distributional differences between functional populations. We introduce a second stage of clustering to additionally leverage on data-driven projection directions. Thus, in Stage I, an initial clustering is obtained using a set of prespecified projection families. In Stage II, this partition is refined by constructing Gaussian random projections based on an estimated covariance operator that uses the first stage of cluster labels. Finally, a normalized cost function is used to select the optimal clustering among candidate solutions. The proposed clustering algorithm is broadly applicable to diverse functional data regimes including irregular and partially observed data. Through extensive simulations and real-data applications, we show that the proposed method achieves a high degree of accuracy and outperforms many of the state-of-the-art methods across a wide range of functional data settings.
Read the breakdown →StudyPreprintWikiModerate
Causal Discovery in Structural VAR Models Under Equal Noise Variance
SeyedSina Seyedi HasanAbadi, Fahimeh Arab, Erfan Nozari +1 more · 2026 · 5 citations
Causal discovery from multivariate time series is challenging when causal effects may occur both across time and within the same sampling interval. This issue is especially important in applications such as neuroscience, where the sampling rate may be coarse relative to the underlying dynamics and contemporaneous effects need not form an acyclic graph. We study causal discovery in linear Gaussian structural VAR models under an equal noise variance assumption, meaning that the structural noise terms have a common variance. Unlike the DAG-based cross-sectional equal noise variance setting, the time-series setting considered here does not generally yield point identification of a unique causal graph. Instead, multiple structural VAR parameterizations can induce the same stationary observed process law. We introduce a notion of observational equivalence tailored to this setting and show that the corresponding equivalence class is characterized by orthogonal transformations of the structural equations together with a global positive scale. This characterization leads to an equivalence-aware model discrepancy, the observational alignment discrepancy, which compares structural models modulo transformations that preserve the observed law. Building on this theory, we propose ENVAR, a sparsity-based procedure that searches over the induced observational equivalence class for a sparse normalized structural representative. We evaluate the proposed methodology on synthetic structural VAR data and on an fMRI dataset.
Read the breakdown →StudyPreprintWikiModerate
Clustering Craters on the Moon with Dysfunctional Families
Nathan Weed, Emily Castleton, Dave Osthus +2 more · 2026
Summaries of craters on terrestrial bodies, such as the number and size distribution, are essential for understanding the history of the Solar System. Identifying craters, however, has not been automated and thus relies on expert crater-counters marking static images. Robbins et al. (2014) (hereafter R14) showed that, contrary to previously held assumptions, there exists large variability across expert crater-counters' identified crater lists. How best to combine identified crater lists across multiple experts for the purposes of learning about the Solar System is an open and consequential question. R14 combined identified crater lists via clustering through a modification of the popular DBSCAN clustering method. Their approach did not, however, make use of all the constraining information available nor did it provide an estimate of clustering uncertainty. To address the shortcomings of the DBSCAN method, we present a novel clustering approach that can combine multiple lists of identified objects of interest from the same image. The key innovation is incorporating a dysfunctional family constraint into the Bayesian nonparametric clustering approach, the Chinese restaurant process (CRP), which naturally takes into account information about the crater identifier. The dysfunctional family Chinese restaurant process (DFCRP) provides an estimate of clustering uncertainty. In this work, we provide guidance on hyperparameter specification, present a Gibbs sampler, and perform a simulation study to compare the performance of the DFCRP to the CRP. Finally, we apply the DFCRP to the crater identification problem of R14, comparing results, and also demonstrate the types of analyses that can be performed with posterior draws of cluster assignments.
Read the breakdown →StudyPreprintWikiModerate
Component over Composite: Mitigating Type I Error Inflation when Imputing "Days Alive and at Home"
Mia S. Tackney, Sarah Dawson, Letao Yuan +2 more · 2026
Background: Days Alive and at Home (DAH) over a pre-defined follow-up period is a novel post-intervention composite outcome that combines data from at least three components: (i) initial length of hospital stay, (ii) length of total readmissions or other post-discharge care and (iii) mortality. Missing values bring unique challenges to the analysis of trials with the DAH outcome as the three components may have different rates of missingness caused by distinct missing data mechanisms. Current approaches define DAH as missing if any of the components are missing, and proceed with complete cases or Multiple Imputation (MI) of the composite. Methods: Through a simulation study motivated by the NOTACS trial, we compare several methods of handling missing data, including complete case analysis, MI of the composite, and MI of the components when the primary analysis is a Mann-Whitney-Wilcoxon test. Results: MI on the component level has good properties in terms of type I error control and power. We caution against the use of MI on the composite level with Predictive Mean Matching, which can lead to type I error inflation. Conclusions: Given the complex distributional characteristics of DAH, naive approaches such as defining missingness on the composite level and directly imputing the composite with Predictive Mean Matching, can lead to type I error inflation. Imputing on the component level is recommended, suggested future work included imputation approaches that are compatible with more complex definitions of DAH, as well as recommendations for sensitivity analyses to the Missing at Random assumption.
Read the breakdown →StudyPreprintWikiModerate
Nonparametric Bayesian Policy Learning
Haonan Ye · 2026 · 41 citations
I propose Nonparametric Bayesian Policy Learning (NBPL) as a framework for uncertainty-aware treatment choice. I consider a decision-maker (DM) seeking to select an expected welfare-maximizing treatment rule using observable characteristics. A key observation is that, for a given welfare criterion and policy class, uncertainty about welfare-relevant objects is entirely induced by uncertainty about a reduced-form distribution. I assume the DM places a nonparametric Dirichlet process prior on this reduced-form parameter and uses the resulting posterior to conduct inference on optimal treatment assignments, optimal welfare, and comparisons across policy classes. The NBPL framework is flexible, and its implementation via the Bayesian bootstrap is highly tractable. I establish two main theoretical properties of NBPL. First, posterior welfare regret under NBPL converges at the minimax-optimal rate. Second, posterior model comparison across policy classes is pointwise consistent. I illustrate NBPL in two empirical applications: the bednet subsidy experiment of Bhattacharya and Dupas (2012) and the JTPA experiment studied by Kitagawa and Tetenov (2018).
Read the breakdown →StudyPreprintWikiModerate
Evaluating causal indirect effects when mediators are left-censored by assay limit of quantification
Cong Jiang, Michael D. Hughes, Nima S. Hejazi · 2026
Causal mediation analysis is essential for disentangling the mechanisms by which investigational therapeutic and preventive agents impact clinical outcomes. However, the measurement of biological mediators is often subject to left-censoring by technical measurement limitations, most commonly an assay's limit of quantification. This form of censoring can pose severe challenges for both identification and estimation of causal mediation estimands, particularly when the censoring mechanism is deterministic and the resulting missingness is missing not at random (MNAR) or nonignorable. Motivated by the question of assessing the role of viral RNA in the action mechanism of monoclonal antibody therapies for COVID-19 in the Accelerating COVID-19 Therapeutics and Vaccine (ACTIV)-2 platform trial, we develop a semi-parametric framework for estimation of the natural direct and indirect effects when the mediator of interest is partially subject to this form of left-censoring. Our proposed strategy combines fractional imputation with a semi-parametric EM algorithm to flexibly estimate key components of the factorized data likelihood. Applying the proposed strategy to circumvent the left-censoring, we discuss both traditional plug-in and asymptotically efficient estimators of the direct and indirect effect estimands, introducing a data-adaptive $m$-out-of-$n$ bootstrap for robust inference under the imputation procedure. We demonstrate in numerical experiments that our approach significantly reduces bias and allows for reliable inference. An application to data from the ACTIV-2 platform trial confirms that monoclonal antibody therapies reduce the risk of hospitalization and death due to COVID-19, while suggesting that changes in viral RNA mediate only a modest proportion of the overall treatment effect.
Read the breakdown →StudyPreprintWikiModerate
Application of Propensity Score Models and Causal Estimators in Observational Studies under Model Misspecification
Apu Chandra Das, Sakib Salam, Md Robiul Islam Talukder +3 more · 2026
Propensity score (PS) methods are widely used in observational studies to reduce confounding and estimate causal treatment effects. However, the validity of PS-based causal estimators depends heavily on correct model specification, and model misspecification may lead to substantial bias and instability. In this study, we systematically evaluate the performance of commonly used causal estimators, including response surface modeling (RSM), inverse probability weighting (IPW), and augmented inverse probability weighting (AIPW), under varying levels of PS and outcome model misspecification. We compare classical logistic regression with several machine learning approaches for PS estimation, including random forests (RF), support vector machines (SVM), and linear discriminant analysis (LDA). Extensive simulation studies were conducted under multiple scenarios defined by combinations of correctly specified and misspecified PS and outcome models, varying sample sizes, and different covariate correlation structures. Estimator performance was assessed using bias, absolute bias, root mean squared error, empirical standard error, and confidence interval width. Results demonstrate that AIPW consistently provides robust and stable estimates across most scenarios due to its doubly robust property, whereas IPW is highly sensitive to PS misspecification and unstable PS estimates produced by flexible machine learning methods. RSM performs well only when the outcome model is correctly specified. Real-world applications using the ACTG175 clinical trial and the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset further illustrate the practical implications of estimator choice and PS modeling strategy. Overall, our findings highlight the importance of integrating flexible machine learning approaches within doubly robust frameworks to improve causal effect estimation in observational studies.
Read the breakdown →StudyPreprintWikiModerate
A Mixed Self-Exciting Process to Model Epileptic Seizures
Karen Kanaster, Giovani L. Silva, Peter Mueller +2 more · 2026
Epilepsy is a neurological disorder characterized by recurrent seizures affecting more than 70 million people worldwide. Often, an individual with epilepsy is more likely to experience subsequent seizures following an initial seizure, a process we call seizure clustering. Motivated by seizure diary data collected over three years from 407 individuals newly diagnosed with focal epilepsy in the Human Epilepsy Project (HEP), we propose a Bayesian mixed Hawkes process model that addresses seizure clustering and heterogeneity between individuals. In the Hawkes process, the intensity is accelerated each time an event occurs, through the composition of background and excitation intensity functions. The proposed model incorporates a Weibull baseline intensity to model a trend in background seizure rates over time, while the excitation process accounts for seizure clustering within individuals. We model heterogeneity among individuals by including covariates and random effects in both the background and excitation intensities. In the HEP study, the average time between primary and secondary seizures within an individual is 1.57 (95\% CrI: 1.43, 1.70) days, with an average of 2.20 (1.96, 2.47) seizures per cluster. We demonstrate that omitting random effects in the presence of heterogeneity leads to underestimation of the background intensity and overestimation of excitation rates.
Read the breakdown →Meta-analysisPreprintWikiModerate
Meta-analysis and network meta-analysis of time-to-event outcomes with non-proportional hazards: a Bayesian time-varying hazard ratio approach
Rhiannon K Owen, Keith R Abrams · 2026
Background: Often when undertaking meta-analyses of time-to-event (TTE) outcomes, especially in a Health Technology Assessment context, a hazard ratio (HR) scale is used. However, issues arise when there is evidence of non-proportional hazards in some of the studies included. A number of methods have been advocated, but their use has been limited by either their complexity and/or the ease with which their results can be used in HTA. An alternative approach is to assume a treatment-log(time) interaction within a Cox proportional hazards model for each study, and to then undertake a bivariate meta-analysis of the resulting treatment and interaction coefficients, so that an overall time-varying HR (TVHR) can be obtained. Methods: A TVHR approach was applied to a meta-analysis of chemotherapy compared to Standard of Care for advanced recurrent gastric cancer, and in which Progression-Free Survival (PFS) was an outcome. The approach was also applied to a network meta-analysis (NMA) evaluating overall survival (OS) in advanced BRAF-mutated melanoma. Results: Five trials in the advanced gastric cancer meta-analysis displayed evidence of non-proportional hazards for PFS. Using a TVHR model produced HRs ranging from 0.83 (CrI:0.75-0.91) at 0.5 years to 0.99 (CrI:0.79-1.23) at 3.5 years. Three studies showed evidence of non-proportional hazards in the advanced BRAF-mutated melanoma NMA for OS. Using a TVHR model, nivolumab plus ipilimumab demonstrated consistent superiority from month 7 onwards, with a HR improving from 0.37 (CrI:0.26-0.51) at one year to 0.24 (CrI:0.12-0.45) at five years. Conclusions: A TVHR approach to the meta-analysis or NMA of TTE outcomes when the proportional hazards assumption appears not to hold, produces an intuitive solution which can be readily used in HTA.
Read the breakdown →StudyPreprintWikiModerate
Group-Aware Matrix Estimation and Latent Subspace Recovery
Hamza Golubovic, Matthew Shen, Genevera I. Allen +1 more · 2026
Modern matrix completion problems often involve heterogeneous data whose rows simultaneously belong to many meta-categories, such as demographic and age groups in recommendation systems, or region and recording session labels in neural electrophysiological experiments. Standard low-rank estimators impose a single global latent geometry, which can recover average structure but may smooth away subgroup-specific variation, especially when observations are unevenly distributed across groups. We introduce Group-Aware Matrix Estimation (GAME), a convex estimator for overlapping subgroup-wise low-rank matrix estimation. GAME regularizes category-specific submatrices through overlapping nuclear-norm penalties, allowing related groups to borrow information while preserving local latent structure in a shared coordinate system. We provide finite-sample guarantees for both reconstruction error and subgroup-specific subspace recovery, showing how performance depends on sampling density, subgroup rank, and overlap structure. Experiments on synthetic, recommendation, ecological, and neuroscience datasets show that GAME is most beneficial in structured missingness regimes, where subgroup-aware regularization improves both reconstruction accuracy and latent subspace fidelity. Across these benchmarks, GAME is competitive or best among global low-rank, side-information, and modern imputation baselines, with the largest gains when subgroups exhibit distinct low-rank structure.
Read the breakdown →StudyPreprintWikiModerate
Sequential Sensitivity Analysis for Multiple Assumptions: A Framework for Understanding Racial Disparity in Police Use of Force
Thomas Leavitt, Jake Bowers, Luke Miratrix · 2026
Inferring racial discrimination in police use of force -- the average causal effect of civilian race on use of force -- requires two assumptions about policing prior to potential use of force: that officers do not discriminate in whom they would stop (no discrimination in stops) and that, conditional on patrol context, the probability that an encounter is with a minority rather than a white civilian does not vary across encounters (no bias in encounters). As Knox et al. (2020) show, violations of the first can mask racial disparity in force. Whether it reflects discrimination in force also depends on the second. Existing sensitivity analyses address one assumption at a time. We develop a framework that varies both sequentially and apply it to NYPD Stop, Question, and Frisk data (2003--2013). Under plausible levels of discrimination in stops, we find substantial racial disparity in force. However, the conclusion that this disparity reflects discrimination is fragile to modest departures from no bias in encounters that census-based calibration suggests are demographically feasible. By jointly addressing both confounding channels, the framework reveals how they interact in ways that separate analyses cannot, contributing to understanding what generates racial disparities and how they might be addressed.
Read the breakdown →StudyPreprintWikiModerate
Assessing covariate-adjusted risk differences in small-sample clinical trials
Martin Schnuerch, Alex Ocampo, Klaus Kähler Holst +1 more · 2026
Binary endpoints are common in clinical trials and conditional odds ratios have traditionally been used to assess treatment effects. However, the interpretation of odds ratios is difficult, they are non-collapsible and rely on strong assumptions in order to be a relevant overall summary measure for the trial. As an alternative, risk differences have gained increasing prominence as a more interpretable, clinically meaningful and assumption-lean measure of treatment effects. This shift has also been motivated by new regulatory guidance, which emphasizes the relevance of marginal estimands and encourages covariate adjustment. Yet, covariate-adjusted inference for risk differences, particularly in smaller samples, has methodological subtleties and lacks well-established best practices. We conduct a simulation study comparing methods for estimating and testing risk differences in small-sample ($N \leq 150$) randomized clinical trials with prognostic categorical baseline covariates, focusing on exact unconditional tests, Mantel-Haenszel methods, and $g$-computation (standardization) approaches. We find that several $g$-computation approaches exhibit inflated Type-I error in very small samples when standard Wald-type inference is applied, whereas robust or penalized variants improve error control at the expense of power. Classical methods such as the Mantel-Haenszel and Suissa-Shuster tests remain robust but may forgo efficiency gains from covariate adjustment. Overall, our results indicate that much of the observed Type-I error inflation reflects misalignment between estimand and variance estimation rather than small sample size alone. Based on these results, we provide practical recommendations to guide method selection that align the estimand, variance estimation, and inferential target.
Read the breakdown →