arXiv論文一覧 - stat.ML updates on arXiv.org

#1 Fast and Robust Likelihood-Guided Diffusion Posterior Sampling with Amortized Variational Inference

diffusion

著者: L\'eon Zheng (MBZUAI, LRE), Thomas Hirtz (MBZUAI, LRE), Yazid Janati (MBZUAI, LRE), Eric Moulines (MBZUAI, LRE)

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.07102

要約:
Zero-shot diffusion posterior sampling offers a flexible framework for inverse problems by accommodating arbitrary degradation operators at test time, but incurs high computational cost due to repeated likelihood-guided updates. In contrast, previous amortized diffusion approaches enable fast inference by replacing likelihood-based sampling with implicit inference models, but at the expense of robustness to unseen degradations. We introduce an amortization strategy for diffusion posterior sampling that preserves explicit likelihood guidance by amortizing the inner optimization problems arising in variational diffusion posterior sampling. This accelerates inference for in-distribution degradations while maintaining robustness to previously unseen operators, thereby improving the trade-off between efficiency and flexibility in diffusion-based inverse problems.

#2 Discrete Adjoint Matching

著者: Oswin So, Brian Karrer, Chuchu Fan, Ricky T. Q. Chen, Guan-Horng Liu

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.07132

要約:
Computation methods for solving entropy-regularized reward optimization -- a class of problems widely used for fine-tuning generative models -- have advanced rapidly. Among those, Adjoint Matching (AM, Domingo-Enrich et al., 2025) has proven highly effective in continuous state spaces with differentiable rewards. Transferring these practical successes to discrete generative modeling, however, remains particularly challenging and largely unexplored, mainly due to the drastic shift in generative model classes to discrete state spaces, which are nowhere differentiable. In this work, we propose Discrete Adjoint Matching (DAM) -- a discrete variant of AM for fine-tuning discrete generative models characterized by Continuous-Time Markov Chains, such as diffusion-based large language models. The core of DAM is the introduction of discrete adjoint-an estimator of the optimal solution to the original problem but formulated on discrete domains-from which standard matching frameworks can be applied. This is derived via a purely statistical standpoint, in contrast to the control-theoretic viewpoint in AM, thereby opening up new algorithmic opportunities for general adjoint-based estimators. We showcase DAM's effectiveness on synthetic and mathematical reasoning tasks.

#3 Scalable Mean-Field Variational Inference via Preconditioned Primal-Dual Optimization

著者: Jinhua Lyu, Tianmin Yu, Ying Ma, Naichen Shi

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.07632

要約:
In this work, we investigate the large-scale mean-field variational inference (MFVI) problem from a mini-batch primal-dual perspective. By reformulating MFVI as a constrained finite-sum problem, we develop a novel primal-dual algorithm based on an augmented Lagrangian formulation, termed primal-dual variational inference (PD-VI). PD-VI jointly updates global and local variational parameters in the evidence lower bound in a scalable manner. To further account for heterogeneous loss geometry across different variational parameter blocks, we introduce a block-preconditioned extension, P$^2$D-VI, which adapts the primal-dual updates to the geometry of each parameter block and improves both numerical robustness and practical efficiency. We establish convergence guarantees for both PD-VI and P$^2$D-VI under properly chosen constant step size, without relying on conjugacy assumptions or explicit bounded-variance conditions. In particular, we prove $O(1/T)$ convergence to a stationary point in general settings and linear convergence under strong convexity. Numerical experiments on synthetic data and a real large-scale spatial transcriptomics dataset demonstrate that our methods consistently outperform existing stochastic variational inference approaches in terms of convergence speed and solution quality.

#4 Flow-Based Conformal Predictive Distributions

著者: Trevor Harris

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.07633

要約:
Conformal prediction provides a distribution-free framework for uncertainty quantification via prediction sets with exact finite-sample coverage. In low dimensions these sets are easy to interpret, but in high-dimensional or structured output spaces they are difficult to represent and use, which can limit their ability to integrate with downstream tasks such as sampling and probabilistic forecasting. We show that any differentiable nonconformity score induces a deterministic flow on the output space whose trajectories converge to the boundary of the corresponding conformal prediction set. This leads to a computationally efficient, training-free method for sampling conformal boundaries in arbitrary dimensions. Boundary samples can be reconformalized to form pointwise prediction sets with controlled risk, and mixing across confidence levels yields conformal predictive distributions whose quantile regions coincide exactly with conformal prediction sets. We evaluate the approach on PDE inverse problems, precipitation downscaling, climate model debiasing, and hurricane trajectory forecasting.

#5 On Generation in Metric Spaces

著者: Jiaxun Li, Vinod Raman, Ambuj Tewari

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.07710

要約:
We study generation in separable metric instance spaces. We extend the language generation framework from Kleinberg and Mullainathan [2024] beyond countable domains by defining novelty through metric separation and allowing asymmetric novelty parameters for the adversary and the generator. We introduce the $(\varepsilon,\varepsilon')$-closure dimension, a scale-sensitive analogue of closure dimension, which yields characterizations of uniform and non-uniform generatability and a sufficient condition for generation in the limit. Along the way, we identify a sharp geometric contrast. Namely, in doubling spaces, including all finite-dimensional normed spaces, generatability is stable across novelty scales and invariant under equivalent metrics. In general metric spaces, however, generatability can be highly scale-sensitive and metric-dependent; even in the natural infinite-dimensional Hilbert space $\ell^2$, all notions of generation may fail abruptly as the novelty parameters vary.

#6 BFTS: Thompson Sampling with Bayesian Additive Regression Trees

著者: Ruizhe Deng, Bibhas Chakraborty, Ran Chen, Yan Shuo Tan

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.07767

要約:
Contextual bandits are a core technology for personalized mobile health interventions, where decision-making requires adapting to complex, non-linear user behaviors. While Thompson Sampling (TS) is a preferred strategy for these problems, its performance hinges on the quality of the underlying reward model. Standard linear models suffer from high bias, while neural network approaches are often brittle and difficult to tune in online settings. Conversely, tree ensembles dominate tabular data prediction but typically rely on heuristic uncertainty quantification, lacking a principled probabilistic basis for TS. We propose Bayesian Forest Thompson Sampling (BFTS), the first contextual bandit algorithm to integrate Bayesian Additive Regression Trees (BART), a fully probabilistic sum-of-trees model, directly into the exploration loop. We prove that BFTS is theoretically sound, deriving an information-theoretic Bayesian regret bound of $\tilde{O}(\sqrt{T})$. As a complementary result, we establish frequentist minimax optimality for a "feel-good" variant, confirming the structural suitability of BART priors for non-parametric bandits. Empirically, BFTS achieves state-of-the-art regret on tabular benchmarks with near-nominal uncertainty calibration. Furthermore, in an offline policy evaluation on the Drink Less micro-randomized trial, BFTS improves engagement rates by over 30% compared to the deployed policy, demonstrating its practical effectiveness for behavioral interventions.

#7 Fast Model Selection and Stable Optimization for Softmax-Gated Multinomial-Logistic Mixture of Experts Models

著者: TrungKhang Tran, TrungTin Nguyen, Md Abul Bashar, Nhat Ho, Richi Nayak, Christopher Drovandi

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.07997

要約:
Mixture-of-Experts (MoE) architectures combine specialized predictors through a learned gate and are effective across regression and classification, but for classification with softmax multinomial-logistic gating, rigorous guarantees for stable maximum-likelihood training and principled model selection remain limited. We address both issues in the full-data (batch) regime. First, we derive a batch minorization-maximization (MM) algorithm for softmax-gated multinomial-logistic MoE using an explicit quadratic minorizer, yielding coordinate-wise closed-form updates that guarantee monotone ascent of the objective and global convergence to a stationary point (in the standard MM sense), avoiding approximate M-steps common in EM-type implementations. Second, we prove finite-sample rates for conditional density estimation and parameter recovery, and we adapt dendrograms of mixing measures to the classification setting to obtain a sweep-free selector of the number of experts that achieves near-parametric optimal rates after merging redundant fitted atoms. Experiments on biological protein--protein interaction prediction validate the full pipeline, delivering improved accuracy and better-calibrated probabilities than strong statistical and machine-learning baselines.

#8 Graph-based Semi-Supervised Learning via Maximum Discrimination

著者: Nadav Katz, Ariel Jaffe

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08042

要約:
Semi-supervised learning (SSL) addresses the critical challenge of training accurate models when labeled data is scarce but unlabeled data is abundant. Graph-based SSL (GSSL) has emerged as a popular framework that captures data structure through graph representations. Classic graph SSL methods, such as Label Propagation and Label Spreading, aim to compute low-dimensional representations where points with the same labels are close in representation space. Although often effective, these methods can be suboptimal on data with complex label distributions. In our work, we develop AUC-spec, a graph approach that computes a low-dimensional representation that maximizes class separation. We compute this representation by optimizing the Area Under the ROC Curve (AUC) as estimated via the labeled points. We provide a detailed analysis of our approach under a product-of-manifold model, and show that the required number of labeled points for AUC-spec is polynomial in the model parameters. Empirically, we show that AUC-spec balances class separation with graph smoothness. It demonstrates competitive results on synthetic and real-world datasets while maintaining computational efficiency comparable to the field's classic and state-of-the-art methods.

#9 Information Geometry of Absorbing Markov-Chain and Discriminative Random Walks

著者: Masanari Kimura

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08185

要約:
Discriminative Random Walks (DRWs) are a simple yet powerful tool for semi-supervised node classification, but their theoretical foundations remain fragmentary. We revisit DRWs through the lens of information geometry, treating the family of class-specific hitting-time laws on an absorbing Markov chain as a statistical manifold. Starting from a log-linear edge-weight model, we derive closed-form expressions for the hitting-time probability mass function, its full moment hierarchy, and the observed Fisher information. The Fisher matrix of each seed node turns out to be rank-one, taking the quotient by its null space yields a low-dimensional, globally flat manifold that captures all identifiable directions of the model. Leveraging the geometry, we introduce a sensitivity score for unlabeled nodes that bounds, and in one-dimensional cases attains, the maximal first-order change in DRW betweenness under unit Fisher perturbations. The score can lead to principled strategies for active label acquisition, edge re-weighting, and explanation.

#10 Discrete Adjoint Schr\"odinger Bridge Sampler

著者: Wei Guo, Yuchen Zhu, Xiaochen Du, Juno Nam, Yongxin Chen, Rafael G\'omez-Bombarelli, Guan-Horng Liu, Molei Tao, Jaemoo Choi

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08243

要約:
Learning discrete neural samplers is challenging due to the lack of gradients and combinatorial complexity. While stochastic optimal control (SOC) and Schr\"odinger bridge (SB) provide principled solutions, efficient SOC solvers like adjoint matching (AM), which excel in continuous domains, remain unexplored for discrete spaces. We bridge this gap by revealing that the core mechanism of AM is $\mathit{state}\text{-}\mathit{space~agnostic}$, and introduce $\mathbf{discrete~ASBS}$, a unified framework that extends AM and adjoint Schr\"odinger bridge sampler (ASBS) to discrete spaces. Theoretically, we analyze the optimality conditions of the discrete SB problem and its connection to SOC, identifying a necessary cyclic group structure on the state space to enable this extension. Empirically, discrete ASBS achieves competitive sample quality with significant advantages in training efficiency and scalability.

#11 A Statistical Framework for Alignment with Biased AI Feedback

著者: Xintao Xia, Zhiqiu Xia, Linjun Zhang, Zhanrui Cai

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08259

要約:
Modern alignment pipelines are increasingly replacing expensive human preference labels with evaluations from large language models (LLM-as-Judge). However, AI labels can be systematically biased compared to high-quality human feedback datasets. In this paper, we develop two debiased alignment methods within a general framework that accommodates heterogeneous prompt-response distributions and external human feedback sources. Debiased Direct Preference Optimization (DDPO) augments standard DPO with a residual-based correction and density-ratio reweighting to mitigate systematic bias, while retaining DPO's computational efficiency. Debiased Identity Preference Optimization (DIPO) directly estimates human preference probabilities without imposing a parametric reward model. We provide theoretical guarantees for both methods: DDPO offers a practical and computationally efficient solution for large-scale alignment, whereas DIPO serves as a robust, statistically optimal alternative that attains the semiparametric efficiency bound. Empirical studies on sentiment generation, summarization, and single-turn dialogue demonstrate that the proposed methods substantially improve alignment efficiency and recover performance close to that of an oracle trained on fully human-labeled data.

#12 Is Flow Matching Just Trajectory Replay for Sequential Data?

著者: Soon Hoe Lim, Shizheng Lin, Michael W. Mahoney, N. Benjamin Erichson

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08318

要約:
Flow matching (FM) is increasingly used for time-series generation, but it is not well understood whether it learns a general dynamical structure or simply performs an effective "trajectory replay". We study this question by deriving the velocity field targeted by the empirical FM objective on sequential data, in the limit of perfect function approximation. For the Gaussian conditional paths commonly used in practice, we show that the implied sampler is an ODE whose dynamics constitutes a nonparametric, memory-augmented continuous-time dynamical system. The optimal field admits a closed-form expression as a similarity-weighted mixture of instantaneous velocities induced by past transitions, making the dataset dependence explicit and interpretable. This perspective positions neural FM models trained by stochastic optimization as parametric surrogates of an ideal nonparametric solution. Using the structure of the optimal field, we study sampling and approximation schemes that improve the efficiency and numerical robustness of ODE-based generation. On nonlinear dynamical system benchmarks, the resulting closed-form sampler yields strong probabilistic forecasts directly from historical transitions, without training.

#13 Schr\"odinger bridge problem via empirical risk minimization

著者: Denis Belomestny, Alexey Naumov, Nikita Puchkin, Denis Suchkov

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08374

要約:
We study the Schr\"odinger bridge problem when the endpoint distributions are available only through samples. Classical computational approaches estimate Schr\"odinger potentials via Sinkhorn iterations on empirical measures and then construct a time-inhomogeneous drift by differentiating a kernel-smoothed dual solution. In contrast, we propose a learning-theoretic route: we rewrite the Schr\"odinger system in terms of a single positive transformed potential that satisfies a nonlinear fixed-point equation and estimate this potential by empirical risk minimization over a function class. We establish uniform concentration of the empirical risk around its population counterpart under sub-Gaussian assumptions on the reference kernel and terminal density. We plug the learned potential into a stochastic control representation of the bridge to generate samples. We illustrate performance of the suggested approach with numerical experiments.

#14 Amortising Inference and Meta-Learning Priors in Neural Networks

著者: Tommy Rochussen, Vincent Fortuin

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08782

要約:
One of the core facets of Bayesianism is in the updating of prior beliefs in light of new evidence$\text{ -- }$so how can we maintain a Bayesian approach if we have no prior beliefs in the first place? This is one of the central challenges in the field of Bayesian deep learning, where it is not clear how to represent beliefs about a prediction task by prior distributions over model parameters. Bridging the fields of Bayesian deep learning and probabilistic meta-learning, we introduce a way to $\textit{learn}$ a weights prior from a collection of datasets by introducing a way to perform per-dataset amortised variational inference. The model we develop can be viewed as a neural process whose latent variable is the set of weights of a BNN and whose decoder is the neural network parameterised by a sample of the latent variable itself. This unique model allows us to study the behaviour of Bayesian neural networks under well-specified priors, use Bayesian neural networks as flexible generative models, and perform desirable but previously elusive feats in neural processes such as within-task minibatching or meta-learning under extreme data-starvation.

#15 Cutting Through the Noise: On-the-fly Outlier Detection for Robust Training of Machine Learning Interatomic Potentials

著者: Terry C. W. Lam, Niamh O'Neill, Christoph Schran, Lars L. Schaaf

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08849

要約:
The accuracy of machine learning interatomic potentials suffers from reference data that contains numerical noise. Often originating from unconverged or inconsistent electronic-structure calculations, this noise is challenging to identify. Existing mitigation strategies such as manual filtering or iterative refinement of outliers, require either substantial expert effort or multiple expensive retraining cycles, making them difficult to scale to large datasets. Here, we introduce an on-the-fly outlier detection scheme that automatically down-weights noisy samples, without requiring additional reference calculations. By tracking the loss distribution via an exponential moving average, this unsupervised method identifies outliers throughout a single training run. We show that this approach prevents overfitting and matches the performance of iterative refinement baselines with significantly reduced overhead. The method's effectiveness is demonstrated by recovering accurate physical observables for liquid water from unconverged reference data, including diffusion coefficients. Furthermore, we validate its scalability by training a foundation model for organic chemistry on the SPICE dataset, where it reduces energy errors by a factor of three. This framework provides a simple, automated solution for training robust models on imperfect datasets across dataset sizes.

#16 Winner's Curse Drives False Promises in Data-Driven Decisions: A Case Study in Refugee Matching

著者: Hamsa Bastani, Osbert Bastani, Bryce McLaughlin

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08892

要約:
A major challenge in data-driven decision-making is accurate policy evaluation-i.e., guaranteeing that a learned decision-making policy achieves the promised benefits. A popular strategy is model-based policy evaluation, which estimates a model from data to infer counterfactual outcomes. This strategy is known to produce unwarrantedly optimistic estimates of the true benefit due to the winner's curse. We searched the recent literature on data-driven decision-making, identifying a sample of 55 papers published in the Management Science in the past decade; all but two relied on this flawed methodology. Several common justifications are provided: (1) the estimated models are accurate, stable, and well-calibrated, (2) the historical data uses random treatment assignment, (3) the model family is well-specified, and (4) the evaluation methodology uses sample splitting. Unfortunately, we show that no combination of these justifications avoids the winner's curse. First, we provide a theoretical analysis demonstrating that the winner's curse can cause large, spurious reported benefits even when all these justifications hold. Second, we perform a simulation study based on the recent and consequential data-driven refugee matching problem. We construct a synthetic refugee matching environment (calibrated to closely match the real setting) but designed so that no assignment policy can improve expected employment compared to random assignment. Model-based methods report large, stable gains of around 60% even when the true effect is zero; these gains are on par with improvements of 22-75% reported in the literature. Our results provide strong evidence against model-based evaluation.

#17 Online monotone density estimation and log-optimal calibration

著者: Rohan Hore, Ruodu Wang, Aaditya Ramdas

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08927

要約:
We study the problem of online monotone density estimation, where density estimators must be constructed in a predictable manner from sequentially observed data. We propose two online estimators: an online analogue of the classical Grenander estimator, and an expert aggregation estimator inspired by exponential weighting methods from the online learning literature. In the well-specified stochastic setting, where the underlying density is monotone, we show that the expected cumulative log-likelihood gap between the online estimators and the true density admits an $O(n^{1/3})$ bound. We further establish a $\sqrt{n\log{n}}$ pathwise regret bound for the expert aggregation estimator relative to the best offline monotone estimator chosen in hindsight, under minimal regularity assumptions on the observed sequence. As an application of independent interest, we show that the problem of constructing log-optimal p-to-e calibrators for sequential hypothesis testing can be formulated as an online monotone density estimation problem. We adapt the proposed estimators to build empirically adaptive p-to-e calibrators and establish their optimality. Numerical experiments illustrate the theoretical results.

#18 Provably robust learning of regression neural networks using $\beta$-divergences

著者: Abhik Ghosh, Suryasis Jana

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08933

要約:
Regression neural networks (NNs) are most commonly trained by minimizing the mean squared prediction error, which is highly sensitive to outliers and data contamination. Existing robust training methods for regression NNs are often limited in scope and rely primarily on empirical validation, with only a few offering partial theoretical guarantees. In this paper, we propose a new robust learning framework for regression NNs based on the $\beta$-divergence (also known as the density power divergence) which we call `rRNet'. It applies to a broad class of regression NNs, including models with non-smooth activation functions and error densities, and recovers the classical maximum likelihood learning as a special case. The rRNet is implemented via an alternating optimization scheme, for which we establish convergence guarantees to stationary points under mild, verifiable conditions. The (local) robustness of rRNet is theoretically characterized through the influence functions of both the parameter estimates and the resulting rRNet predictor, which are shown to be bounded for suitable choices of the tuning parameter $\beta$, depending on the error density. We further prove that rRNet attains the optimal 50\% asymptotic breakdown point at the assumed model for all $\beta\in(0, 1]$, providing a strong global robustness guarantee that is largely absent for existing NN learning methods. Our theoretical results are complemented by simulation experiments and real-data analyses, illustrating practical advantages of rRNet over existing approaches in both function approximation problems and prediction tasks with noisy observations.

#19 Conformal changepoint localization

著者: Rohan Hore, Aaditya Ramdas

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06267

要約:
We study the problem of offline changepoint localization in a distribution-free setting. One observes a vector of data with a single changepoint, assuming that the data before and after the changepoint are iid (or more generally exchangeable) from arbitrary and unknown distributions. The goal is to produce a finite-sample confidence set for the index at which the change occurs without making any other assumptions. Existing methods often rely on parametric assumptions, tail conditions, or asymptotic approximations, or only produce point estimates. In contrast, our distribution-free algorithm, CONformal CHangepoint localization (CONCH), only leverages exchangeability arguments to construct confidence sets with finite sample coverage. By proving a conformal Neyman-Pearson lemma, we derive principled score functions that yield informative (small) sets. Moreover, with such score functions, the normalized length of the confidence set shrinks to zero under weak assumptions. We also establish a universality result showing that any distribution-free changepoint localization method must be an instance of CONCH. Experiments suggest that CONCH delivers precise confidence sets even in challenging settings involving images or text.

#20 Scalable spatial point process models for forensic footwear analysis

著者: Alokesh Manna, Neil Spencer, Dipak K. Dey

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.07006

要約:
Shoe print evidence recovered from crime scenes plays a key role in forensic investigations. By examining shoe prints, investigators can determine details of the footwear worn by suspects. However, establishing that a suspect's shoes match the make and model of a crime scene print may not be sufficient. Typically, thousands of shoes of the same size, make, and model are manufactured, any of which could be responsible for the print. Accordingly, a popular approach used by investigators is to examine the print for signs of ``accidentals,'' i.e., cuts, scrapes, and other features that accumulate on shoe soles after purchase due to wear. While some patterns of accidentals are common on certain types of shoes, others are highly distinctive, potentially distinguishing the suspect's shoe from all others. Quantifying the rarity of a pattern is thus essential to accurately measuring the strength of forensic evidence. In this study, we address this task by developing a hierarchical Bayesian model. Our improvement over existing methods primarily stems from two advancements. First, we frame our approach in terms of a latent Gaussian model, thus enabling inference to be efficiently scaled to large collections of annotated shoe prints via integrated nested Laplace approximations. Second, we incorporate spatially varying coefficients to model the relationship between shoes' tread patterns and accidental locations. We demonstrate these improvements through superior performance on held-out data, which enhances accuracy and reliability in forensic shoe print analysis.

#21 BayesFlow 2.0: Multi-Backend Amortized Bayesian Inference in Python

著者: Lars K\"uhmichel, Jerry M. Huang, Valentin Pratz, Jonas Arruda, Hans Olischl\"ager, Daniel Habermann, Simon Kucharsky, Lasse Elsem\"uller, Aayush Mishra, Niels Bracher, Svenja Jedhoff, Marvin Schmitt, Paul-Christian B\"urkner, Stefan T. Radev

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.07098

要約:
Modern Bayesian inference involves a mixture of computational methods for estimating, validating, and drawing conclusions from probabilistic models as part of principled workflows. An overarching motif of many Bayesian methods is that they are relatively slow, which often becomes prohibitive when fitting complex models to large data sets. Amortized Bayesian inference (ABI) offers a path to solving the computational challenges of Bayes. ABI trains neural networks on model simulations, rewarding users with rapid inference of any model-implied quantity, such as point estimates, likelihoods, or full posterior distributions. In this work, we present the Python library BayesFlow, Version 2.0, for general-purpose ABI. Along with direct posterior, likelihood, and ratio estimation, the software includes support for multiple popular deep learning backends, a rich collection of generative networks for sampling and density estimation, complete customization and high-level interfaces, as well as new capabilities for hyperparameter optimization, design optimization, and hierarchical modeling. Using a case study on dynamical system parameter estimation, combined with comparisons to similar software, we show that our streamlined, user-friendly workflow has strong potential to support broad adoption.

#22 BONSAI: Bayesian Optimization with Natural Simplicity and Interpretability

著者: Samuel Daulton, David Eriksson, Maximilian Balandat, Eytan Bakshy

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.07144

要約:
Bayesian optimization (BO) is a popular technique for sample-efficient optimization of black-box functions. In many applications, the parameters being tuned come with a carefully engineered default configuration, and practitioners only want to deviate from this default when necessary. Standard BO, however, does not aim to minimize deviation from the default and, in practice, often pushes weakly relevant parameters to the boundary of the search space. This makes it difficult to distinguish between important and spurious changes and increases the burden of vetting recommendations when the optimization objective omits relevant operational considerations. We introduce BONSAI, a default-aware BO policy that prunes low-impact deviations from a default configuration while explicitly controlling the loss in acquisition value. BONSAI is compatible with a variety of acquisition functions, including expected improvement and upper confidence bound (GP-UCB). We theoretically bound the regret incurred by BONSAI, showing that, under certain conditions, it enjoys the same no-regret property as vanilla GP-UCB. Across many real-world applications, we empirically find that BONSAI substantially reduces the number of non-default parameters in recommended configurations while maintaining competitive optimization performance, with little effect on wall time.

#23 Free Energy Mixer

著者: Jiecheng Lu, Shihao Yang

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.07160

要約:
Standard attention stores keys/values losslessly but reads them via a per-head convex average, blocking channel-wise selection. We propose the Free Energy Mixer (FEM): a free-energy (log-sum-exp) read that applies a value-driven, per-channel log-linear tilt to a fast prior (e.g., from queries/keys in standard attention) over indices. Unlike methods that attempt to improve and enrich the $(q,k)$ scoring distribution, FEM treats it as a prior and yields a value-aware posterior read at unchanged complexity, smoothly moving from averaging to per-channel selection as the learnable inverse temperature increases, while still preserving parallelism and the original asymptotic complexity ($O(T^2)$ for softmax; $O(T)$ for linearizable variants). We instantiate a two-level gated FEM that is plug-and-play with standard and linear attention, linear RNNs and SSMs. It consistently outperforms strong baselines on NLP, vision, and time-series at matched parameter budgets.

#24 Online Learning for Uninformed Markov Games: Empirical Nash-Value Regret and Non-Stationarity Adaptation

著者: Junyan Liu, Haipeng Luo, Zihan Zhang, Lillian J. Ratliff

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.07205

要約:
We study online learning in two-player uninformed Markov games, where the opponent's actions and policies are unobserved. In this setting, Tian et al. (2021) show that achieving no-external-regret is impossible without incurring an exponential dependence on the episode length $H$. They then turn to the weaker notion of Nash-value regret and propose a V-learning algorithm with regret $O(K^{2/3})$ after $K$ episodes. However, their algorithm and guarantee do not adapt to the difficulty of the problem: even in the case where the opponent follows a fixed policy and thus $O(\sqrt{K})$ external regret is well-known to be achievable, their result is still the worse rate $O(K^{2/3})$ on a weaker metric. In this work, we fully address both limitations. First, we introduce empirical Nash-value regret, a new regret notion that is strictly stronger than Nash-value regret and naturally reduces to external regret when the opponent follows a fixed policy. Moreover, under this new metric, we propose a parameter-free algorithm that achieves an $O(\min \{\sqrt{K} + (CK)^{1/3},\sqrt{LK}\})$ regret bound, where $C$ quantifies the variance of the opponent's policies and $L$ denotes the number of policy switches (both at most $O(K)$). Therefore, our results not only recover the two extremes -- $O(\sqrt{K})$ external regret when the opponent is fixed and $O(K^{2/3})$ Nash-value regret in the worst case -- but also smoothly interpolate between these extremes by automatically adapting to the opponent's non-stationarity. We achieve so by first providing a new analysis of the epoch-based V-learning algorithm by Mao et al. (2022), establishing an $O(\eta C + \sqrt{K/\eta})$ regret bound, where $\eta$ is the epoch incremental factor. Next, we show how to adaptively restart this algorithm with an appropriate $\eta$ in response to the potential non-stationarity of the opponent, eventually achieving our final results.

#25 Collaborative and Efficient Fine-tuning: Leveraging Task Similarity

著者: Gagik Magakyan, Amirhossein Reisizadeh, Chanwoo Park, Pablo A. Parrilo, Asuman Ozdaglar

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.07218

要約:
Adaptability has been regarded as a central feature in the foundation models, enabling them to effectively acclimate to unseen downstream tasks. Parameter-efficient fine-tuning methods such as celebrated LoRA facilitate efficient adaptation of large foundation models using labeled, high-quality and generally scarce task data. To mitigate data scarcity in fine-tuning of foundation models, we propose to leverage task similarity across multiple downstream users. Intuitively, users with similar tasks must be able to assist each other in boosting the effective fine-tuning data size. We propose Collaborative Low-Rank Adaptation, or CoLoRA, which exploits task similarity to collaboratively and efficiently fine-tune personalized foundation models. The main idea in CoLoRA is to train one shared adapter capturing underlying task similarities across all tasks, and personalized adapters tailored to user-specific tasks. We theoretically study CoLoRA on heterogeneous linear regression and provide provable guarantees for ground truth recovery. We also conduct several natural language experiments with varying task similarity, which further demonstrate that when trained together with similar tasks, individual performances are significantly boosted.

#26 Privately Learning Decision Lists and a Differentially Private Winnow

privacy

著者: Mark Bun, William Fang

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.07370

要約:
We give new differentially private algorithms for the classic problems of learning decision lists and large-margin halfspaces in the PAC and online models. In the PAC model, we give a computationally efficient algorithm for learning decision lists with minimal sample overhead over the best non-private algorithms. In the online model, we give a private analog of the influential Winnow algorithm for learning halfspaces with mistake bound polylogarithmic in the dimension and inverse polynomial in the margin. As an application, we describe how to privately learn decision lists in the online model, qualitatively matching state-of-the art non-private guarantees.

#27 Dichotomy of Feature Learning and Unlearning: Fast-Slow Analysis on Neural Networks with Stochastic Gradient Descent

privacy

著者: Shota Imai, Sota Nishiyama, Masaaki Imaizumi

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.07378

要約:
The dynamics of gradient-based training in neural networks often exhibit nontrivial structures; hence, understanding them remains a central challenge in theoretical machine learning. In particular, a concept of feature unlearning, in which a neural network progressively loses previously learned features over long training, has gained attention. In this study, we consider the infinite-width limit of a two-layer neural network updated with a large-batch stochastic gradient, then derive differential equations with different time scales, revealing the mechanism and conditions for feature unlearning to occur. Specifically, we utilize the fast-slow dynamics: while an alignment of first-layer weights develops rapidly, the second-layer weights develop slowly. The direction of a flow on a critical manifold, determined by the slow dynamics, decides whether feature unlearning occurs. We give numerical validation of the result, and derive theoretical grounding and scaling laws of the feature unlearning. Our results yield the following insights: (i) the strength of the primary nonlinear term in data induces the feature unlearning, and (ii) an initial scale of the second-layer weights mitigates the feature unlearning. Technically, our analysis utilizes Tensor Programs and the singular perturbation theory.

#28 Achieving Optimal Static and Dynamic Regret Simultaneously in Bandits with Deterministic Losses

著者: Jian Qian, Chen-Yu Wei

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.07418

要約:
In adversarial multi-armed bandits, two performance measures are commonly used: static regret, which compares the learner to the best fixed arm, and dynamic regret, which compares it to the best sequence of arms. While optimal algorithms are known for each measure individually, there is no known algorithm achieving optimal bounds for both simultaneously. Marinov and Zimmert [2021] first showed that such simultaneous optimality is impossible against an adaptive adversary. Our work takes a first step to demonstrate its possibility against an oblivious adversary when losses are deterministic. First, we extend the impossibility result of Marinov and Zimmert [2021] to the case of deterministic losses. Then, we present an algorithm achieving optimal static and dynamic regret simultaneously against an oblivious adversary. Together, they reveal a fundamental separation between adaptive and oblivious adversaries when multiple regret benchmarks are considered simultaneously. It also provides new insight into the long open problem of simultaneously achieving optimal regret against switching benchmarks of different numbers of switches. Our algorithm uses negative static regret to compensate for the exploration overhead incurred when controlling dynamic regret, and leverages Blackwell approachability to jointly control both regrets. This yields a new model selection procedure for bandits that may be of independent interest.

#29 Data-Aware and Scalable Sensitivity Analysis for Decision Tree Ensembles

著者: Namrita Varshney, Ashutosh Gupta, Arhaan Ahmad, Tanay V. Tayal, S. Akshay

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.07453

要約:
Decision tree ensembles are widely used in critical domains, making robustness and sensitivity analysis essential to their trustworthiness. We study the feature sensitivity problem, which asks whether an ensemble is sensitive to a specified subset of features -- such as protected attributes -- whose manipulation can alter model predictions. Existing approaches often yield examples of sensitivity that lie far from the training distribution, limiting their interpretability and practical value. We propose a data-aware sensitivity framework that constrains the sensitive examples to remain close to the dataset, thereby producing realistic and interpretable evidence of model weaknesses. To this end, we develop novel techniques for data-aware search using a combination of mixed-integer linear programming (MILP) and satisfiability modulo theories (SMT) encodings. Our contributions are fourfold. First, we strengthen the NP-hardness result for sensitivity verification, showing it holds even for trees of depth 1. Second, we develop MILP-optimizations that significantly speed up sensitivity verification for single ensembles and for the first time can also handle multiclass tree ensembles. Third, we introduce a data-aware framework generating realistic examples close to the training distribution. Finally, we conduct an extensive experimental evaluation on large tree ensembles, demonstrating scalability to ensembles with up to 800 trees of depth 8, achieving substantial improvements over the state of the art. This framework provides a practical foundation for analyzing the reliability and fairness of tree-based models in high-stakes applications.

#30 Bandit Allocational Instability

著者: Yilun Chen, Jiaqi Lu

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.07472

要約:
When multi-armed bandit (MAB) algorithms allocate pulls among competing arms, the resulting allocation can exhibit huge variation. This is particularly harmful in modern applications such as learning-enhanced platform operations and post-bandit statistical inference. Thus motivated, we introduce a new performance metric of MAB algorithms termed allocation variability, which is the largest (over arms) standard deviation of an arm's number of pulls. We establish a fundamental trade-off between allocation variability and regret, the canonical performance metric of reward maximization. In particular, for any algorithm, the worst-case regret $R_T$ and worst-case allocation variability $S_T$ must satisfy $R_T \cdot S_T=\Omega(T^{\frac{3}{2}})$ as $T\rightarrow\infty$, as long as $R_T=o(T)$. This indicates that any minimax regret-optimal algorithm must incur worst-case allocation variability $\Theta(T)$, the largest possible scale; while any algorithm with sublinear worst-case regret must necessarily incur ${S}_T= \omega(\sqrt{T})$. We further show that this lower bound is essentially tight, and that any point on the Pareto frontier $R_T \cdot S_T=\tilde{\Theta}(T^{3/2})$ can be achieved by a simple tunable algorithm UCB-f, a generalization of the classic UCB1. Finally, we discuss implications for platform operations and for statistical inference, when bandit algorithms are used. As a byproduct of our result, we resolve an open question of Praharaj and Khamaru (2025).

#31 Statistical inference after variable selection in Cox models: A simulation study

著者: Lena Schemet, Sarah Friedrich-Welz

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.07477

要約:
Choosing relevant predictors is central to the analysis of biomedical time-to-event data. Classical frequentist inference, however, presumes that the set of covariates is fixed in advance and does not account for data-driven variable selection. As a consequence, naive post-selection inference may be biased and misleading. In right-censored survival settings, these issues may be further exacerbated by the additional uncertainty induced by censoring. We investigate several inference procedures applied after variable selection for the coefficients of the Lasso and its extension, the adaptive Lasso, in the context of the Cox model. The methods considered include sample splitting, exact post-selection inference, and the debiased Lasso. Their performance is examined in a neutral simulation study reflecting realistic covariate structures and censoring rates commonly encountered in biomedical applications. To complement the simulation results, we illustrate the practical behavior of these procedures in an applied example using a publicly available survival dataset.

#32 Deriving Neural Scaling Laws from the statistics of natural language

著者: Francesco Cagnetta, Allan Ravent\'os, Surya Ganguli, Matthieu Wyart

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.07488

要約:
Despite the fact that experimental neural scaling laws have substantially guided empirical progress in large-scale machine learning, no existing theory can quantitatively predict the exponents of these important laws for any modern LLM trained on any natural language dataset. We provide the first such theory in the case of data-limited scaling laws. We isolate two key statistical properties of language that alone can predict neural scaling exponents: (i) the decay of pairwise token correlations with time separation between token pairs, and (ii) the decay of the next-token conditional entropy with the length of the conditioning context. We further derive a simple formula in terms of these statistics that predicts data-limited neural scaling exponents from first principles without any free parameters or synthetic data models. Our theory exhibits a remarkable match with experimentally measured neural scaling laws obtained from training GPT-2 and LLaMA style models from scratch on two qualitatively different benchmarks, TinyStories and WikiText.

#33 Gaussian Match-and-Copy: A Minimalist Benchmark for Studying Transformer Induction

著者: Antoine Gonon, Alexandre Cordonnier, Nicolas Boumal

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.07562

要約:
Match-and-copy is a core retrieval primitive used at inference time by large language models to retrieve a matching token from the context then copy its successor. Yet, understanding how this behavior emerges on natural data is challenging because retrieval and memorization are entangled. To disentangle the two, we introduce Gaussian Match-and-Copy (GMC), a minimalist benchmark that isolates long-range retrieval through pure second-order correlation signals. Numerical investigations show that this task retains key qualitative aspects of how Transformers develop match-and-copy circuits in practice, and separates architectures by their retrieval capabilities. We also analyze the optimization dynamics in a simplified attention setting. Although many solutions are a priori possible under a regression objective, including ones that do not implement retrieval, we identify an implicit-bias regime in which gradient descent drives the parameters to diverge while their direction aligns with the max-margin separator, yielding hard match selection. We prove this max-margin alignment for GD trajectories that reach vanishing empirical loss under explicit technical conditions.

#34 Beyond Arrow: From Impossibility to Possibilities in Multi-Criteria Benchmarking

著者: Polina Gordienko, Christoph Jansen, Julian Rodemann, Georg Schollmeyer

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.07593

要約:
Modern benchmarks such as HELM MMLU account for multiple metrics like accuracy, robustness and efficiency. When trying to turn these metrics into a single ranking, natural aggregation procedures can become incoherent or unstable to changes in the model set. We formalize this aggregation as a social choice problem where each metric induces a preference ranking over models on each dataset, and a benchmark operator aggregates these votes across metrics. While prior work has focused on Arrow's impossibility result, we argue that the impossibility often originates from pathological examples and identify sufficient conditions under which these disappear, and meaningful multi-criteria benchmarking becomes possible. In particular, we deal with three restrictions on the combinations of rankings and prove that on single-peaked, group-separable and distance-restricted preferences, the benchmark operator allows for the construction of well-behaved rankings of the involved models. Empirically, we investigate several modern benchmark suites like HELM MMLU and verify which structural conditions are fulfilled on which benchmark problems.

#35 Fast Rerandomization for Balancing Covariates in Randomized Experiments: A Metropolis-Hastings Framework

著者: Jiuyao Lu, Tianruo Zhang, Ke Zhu

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.07613

要約:
Balancing covariates is critical for credible and efficient randomized experiments. Rerandomization addresses this by repeatedly generating treatment assignments until covariate balance meets a prespecified threshold. By shrinking this threshold, it can achieve arbitrarily strong balance, with established results guaranteeing optimal estimation and valid inference in both finite-sample and asymptotic settings across diverse complex experimental settings. Despite its rigorous theoretical foundations, practical use is limited by the extreme inefficiency of rejection sampling, which becomes prohibitively slow under small thresholds and often forces practitioners to adopt suboptimal settings, leading to degraded performance. Existing work focusing on acceleration typically fail to maintain the uniformity over the acceptable assignment space, thus losing the theoretical grounds of classical rerandomization. Building upon a Metropolis-Hastings framework, we address this challenge by introducing an additional sampling-importance resampling step, which restores uniformity and preserves statistical guarantees. Our proposed algorithm, PSRSRR, achieves speedups ranging from 10 to 10,000 times while maintaining exact and asymptotic validity, as demonstrated by simulations and two real-data applications.

#36 Dense Neural Networks are not Universal Approximators

著者: Levi Rauchwerger, Stefanie Jegelka, Ron Levie

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.07618

要約:
We investigate the approximation capabilities of dense neural networks. While universal approximation theorems establish that sufficiently large architectures can approximate arbitrary continuous functions if there are no restrictions on the weight values, we show that dense neural networks do not possess this universality. Our argument is based on a model compression approach, combining the weak regularity lemma with an interpretation of feedforward networks as message passing graph neural networks. We consider ReLU neural networks subject to natural constraints on weights and input and output dimensions, which model a notion of dense connectivity. Within this setting, we demonstrate the existence of Lipschitz continuous functions that cannot be approximated by such networks. This highlights intrinsic limitations of neural networks with dense layers and motivates the use of sparse connectivity as a necessary ingredient for achieving true universality.

#37 Fast Response or Silence: Conversation Persistence in an AI-Agent Social Network

agent

著者: Aysajan Eziz

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.07667

要約:
Autonomous AI agents are beginning to populate social platforms, but it is still unclear whether they can sustain the back-and-forth needed for extended coordination. We study Moltbook, an AI-agent social network, using a first-week snapshot and introduce interaction half-life: how quickly a comment's chance of receiving a direct reply fades as the comment ages. Across tens of thousands of commented threads, Moltbook discussions are dominated by first-layer reactions rather than extended chains. Most comments never receive a direct reply, reciprocal back-and-forth is rare, and when replies do occur they arrive almost immediately -- typically within seconds -- implying persistence on the order of minutes rather than hours. Moltbook is often described as running on an approximately four-hour ``heartbeat'' check-in schedule; using aggregate spectral tests on the longest contiguous activity window, we do not detect a reliable four-hour rhythm in this snapshot, consistent with jittered or out-of-phase individual schedules. A contemporaneous Reddit baseline analyzed with the same estimators shows substantially deeper threads and much longer reply persistence. Overall, early agent social interaction on Moltbook fits a ``fast response or silence'' regime, suggesting that sustained multi-step coordination will likely require explicit memory, thread resurfacing, and re-entry scaffolds.

#38 CausalCompass: Evaluating the Robustness of Time-Series Causal Discovery in Misspecified Scenarios

著者: Huiyang Yi, Xiaojian Shen, Yonggang Wu, Duxin Chen, He Wang, Wenwu Yu

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.07915

要約:
Causal discovery from time series is a fundamental task in machine learning. However, its widespread adoption is hindered by a reliance on untestable causal assumptions and by the lack of robustness-oriented evaluation in existing benchmarks. To address these challenges, we propose CausalCompass, a flexible and extensible benchmark suite designed to assess the robustness of time-series causal discovery (TSCD) methods under violations of modeling assumptions. To demonstrate the practical utility of CausalCompass, we conduct extensive benchmarking of representative TSCD algorithms across eight assumption-violation scenarios. Our experimental results indicate that no single method consistently attains optimal performance across all settings. Nevertheless, the methods exhibiting superior overall performance across diverse scenarios are almost invariably deep learning-based approaches. We further provide hyperparameter sensitivity analyses to deepen the understanding of these findings. We also find, somewhat surprisingly, that NTS-NOTEARS relies heavily on standardized preprocessing in practice, performing poorly in the vanilla setting but exhibiting strong performance after standardization. Finally, our work aims to provide a comprehensive and systematic evaluation of TSCD methods under assumption violations, thereby facilitating their broader adoption in real-world applications. The code and datasets are available at https://github.com/huiyang-yi/CausalCompass.

#39 When Is Compositional Reasoning Learnable from Verifiable Rewards?

著者: Daniel Barzilai, Yotam Wolf, Ronen Basri

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.07992

要約:
The emergence of compositional reasoning in large language models through reinforcement learning with verifiable rewards (RLVR) has been a key driver of recent empirical successes. Despite this progress, it remains unclear which compositional problems are learnable in this setting using outcome-level feedback alone. In this work, we theoretically study the learnability of compositional problems in autoregressive models under RLVR training. We identify a quantity that we call the task-advantage ratio, a joint property of the compositional problem and the base model, that characterizes which tasks and compositions are learnable from outcome-level feedback. On the positive side, using this characterization, we show that compositional problems where correct intermediate steps provide a clear advantage are efficiently learnable with RLVR. We also analyze how such an advantage naturally arises in different problems. On the negative side, when the structural advantage is not present, RLVR may converge to suboptimal compositions. We prove that, in some cases, the quality of the base model determines if such an advantage exists and whether RLVR will converge to a suboptimal solution. We hope our analysis can provide a principled theoretical understanding of when and why RLVR succeeds and when it does not.

#40 Don't Always Pick the Highest-Performing Model: An Information Theoretic View of LLM Ensemble Selection

著者: Yigit Turkmen, Baturalp Buyukates, Melih Bastopcu

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08003

要約:
Large language models (LLMs) are often ensembled together to improve overall reliability and robustness, but in practice models are strongly correlated. This raises a fundamental question: which models should be selected when forming an LLM ensemble? We formulate budgeted ensemble selection as maximizing the mutual information between the true label and predictions of the selected models. Furthermore, to explain why performance can saturate even with many models, we model the correlated errors of the models using Gaussian-copula and show an information-theoretic error floor for the performance of the ensemble. Motivated by these, we propose a simple greedy mutual-information selection algorithm that estimates the required information terms directly from data and iteratively builds an ensemble under a query budget. We test our approach in two question answering datasets and one binary sentiment classification dataset: MEDMCQA, MMLU, and IMDB movie reviews. Across all datasets, we observe that our method consistently outperforms strong baselines under the same query budget.

#41 Sharp analysis of linear ensemble sampling

著者: Arya Akhavan, David Janz, Csaba Szepesv\'ari

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08026

要約:
We analyse linear ensemble sampling (ES) with standard Gaussian perturbations in stochastic linear bandits. We show that for ensemble size $m=\Theta(d\log n)$, ES attains $\tilde O(d^{3/2}\sqrt n)$ high-probability regret, closing the gap to the Thompson sampling benchmark while keeping computation comparable. The proof brings a new perspective on randomized exploration in linear bandits by reducing the analysis to a time-uniform exceedance problem for $m$ independent Brownian motions. Intriguingly, this continuous-time lens is not forced; it appears natural--and perhaps necessary: the discrete-time problem seems to be asking for a continuous-time solution, and we know of no other way to obtain a sharp ES bound.

#42 GAAVI: Global Asymptotic Anytime Valid Inference for the Conditional Mean Function

著者: Brian M Cho, Raaz Dwivedi, Nathan Kallus

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08096

要約:
Inference on the conditional mean function (CMF) is central to tasks from adaptive experimentation to optimal treatment assignment and algorithmic fairness auditing. In this work, we provide a novel asymptotic anytime-valid test for a CMF global null (e.g., that all conditional means are zero) and contrasts between CMFs, enabling experimenters to make high confidence decisions at any time during the experiment beyond a minimum sample size. We provide mild conditions under which our tests achieve (i) asymptotic type-I error guarantees, (i) power one, and, unlike past tests, (iii) optimal sample complexity relative to a Gaussian location testing. By inverting our tests, we show how to construct function-valued asymptotic confidence sequences for the CMF and contrasts thereof. Experiments on both synthetic and real-world data show our method is well-powered across various distributions while preserving the nominal error rate under continuous monitoring.

#43 Mutual information and task-relevant latent dimensionality

著者: Paarth Gulati, Eslam Abdelaleem, Audrey Sederberg, Ilya Nemenman

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08105

要約:
Estimating the dimensionality of the latent representation needed for prediction -- the task-relevant dimension -- is a difficult, largely unsolved problem with broad scientific applications. We cast it as an Information Bottleneck question: what embedding bottleneck dimension is sufficient to compress predictor and predicted views while preserving their mutual information (MI). This repurposes neural MI estimators for dimensionality estimation. We show that standard neural estimators with separable/bilinear critics systematically inflate the inferred dimension, and we address this by introducing a hybrid critic that retains an explicit dimensional bottleneck while allowing flexible nonlinear cross-view interactions, thereby preserving the latent geometry. We further propose a one-shot protocol that reads off the effective dimension from a single over-parameterized hybrid model, without sweeping over bottleneck sizes. We validate the approach on synthetic problems with known task-relevant dimension. We extend the approach to intrinsic dimensionality by constructing paired views of a single dataset, enabling comparison with classical geometric dimension estimators. In noisy regimes where those estimators degrade, our approach remains reliable. Finally, we demonstrate the utility of the method on multiple physics datasets.

#44 Variance-Gated Ensembles: An Epistemic-Aware Framework for Uncertainty Estimation

著者: H. Martin Gillis, Isaac Xu, Thomas Trappenberg

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08142

要約:
Machine learning applications require fast and reliable per-sample uncertainty estimation. A common approach is to use predictive distributions from Bayesian or approximation methods and additively decompose uncertainty into aleatoric (i.e., data-related) and epistemic (i.e., model-related) components. However, additive decomposition has recently been questioned, with evidence that it breaks down when using finite-ensemble sampling and/or mismatched predictive distributions. This paper introduces Variance-Gated Ensembles (VGE), an intuitive, differentiable framework that injects epistemic sensitivity via a signal-to-noise gate computed from ensemble statistics. VGE provides: (i) a Variance-Gated Margin Uncertainty (VGMU) score that couples decision margins with ensemble predictive variance; and (ii) a Variance-Gated Normalization (VGN) layer that generalizes the variance-gated uncertainty mechanism to training via per-class, learnable normalization of ensemble member probabilities. We derive closed-form vector-Jacobian products enabling end-to-end training through ensemble sample mean and variance. VGE matches or exceeds state-of-the-art information-theoretic baselines while remaining computationally efficient. As a result, VGE provides a practical and scalable approach to epistemic-aware uncertainty estimation in ensemble models. An open-source implementation is available at: https://github.com/nextdevai/vge.

#45 A second order regret bound for NormalHedge

著者: Yoav Freund, Nicholas J. A. Harvey, Victor S. Portella, Yabing Qi, Yu-Xiang Wang

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08151

要約:
We consider the problem of prediction with expert advice for ``easy'' sequences. We show that a variant of NormalHedge enjoys a second-order $\epsilon$-quantile regret bound of $O\big(\sqrt{V_T \log(V_T/\epsilon)}\big) $ when $V_T > \log N$, where $V_T$ is the cumulative second moment of instantaneous per-expert regret averaged with respect to a natural distribution determined by the algorithm. The algorithm is motivated by a continuous time limit using Stochastic Differential Equations. The discrete time analysis uses self-concordance techniques.

#46 Interpretable Dynamic Network Modeling of Tensor Time Series via Kronecker Time-Varying Graphical Lasso

著者: Shingo Higashiguchi, Koki Kawabata, Yasuko Matsubara, Yasushi Sakurai

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08197

要約:
With the rapid development of web services, large amounts of time series data are generated and accumulated across various domains such as finance, healthcare, and online platforms. As such data often co-evolves with multiple variables interacting with each other, estimating the time-varying dependencies between variables (i.e., the dynamic network structure) has become crucial for accurate modeling. However, real-world data is often represented as tensor time series with multiple modes, resulting in large, entangled networks that are hard to interpret and computationally intensive to estimate. In this paper, we propose Kronecker Time-Varying Graphical Lasso (KTVGL), a method designed for modeling tensor time series. Our approach estimates mode-specific dynamic networks in a Kronecker product form, thereby avoiding overly complex entangled structures and producing interpretable modeling results. Moreover, the partitioned network structure prevents the exponential growth of computational time with data dimension. In addition, our method can be extended to stream algorithms, making the computational time independent of the sequence length. Experiments on synthetic data show that the proposed method achieves higher edge estimation accuracy than existing methods while requiring less computation time. To further demonstrate its practical value, we also present a case study using real-world data. Our source code and datasets are available at https://github.com/Higashiguchi-Shingo/KTVGL.

#47 CADO: From Imitation to Cost Minimization for Heatmap-based Solvers in Combinatorial Optimization

著者: Hyungseok Song, Deunsol Yoon, Kanghoon Lee, Han-Seul Jeong, Soonyoung Lee, Woohyung Lim

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08210

要約:
Heatmap-based solvers have emerged as a promising paradigm for Combinatorial Optimization (CO). However, we argue that the dominant Supervised Learning (SL) training paradigm suffers from a fundamental objective mismatch: minimizing imitation loss (e.g., cross-entropy) does not guarantee solution cost minimization. We dissect this mismatch into two deficiencies: Decoder-Blindness (being oblivious to the non-differentiable decoding process) and Cost-Blindness (prioritizing structural imitation over solution quality). We empirically demonstrate that these intrinsic flaws impose a hard performance ceiling. To overcome this limitation, we propose CADO (Cost-Aware Diffusion models for Optimization), a streamlined Reinforcement Learning fine-tuning framework that formulates the diffusion denoising process as an MDP to directly optimize the post-decoded solution cost. We introduce Label-Centered Reward, which repurposes ground-truth labels as unbiased baselines rather than imitation targets, and Hybrid Fine-Tuning for parameter-efficient adaptation. CADO achieves state-of-the-art performance across diverse benchmarks, validating that objective alignment is essential for unlocking the full potential of heatmap-based solvers.

#48 Thermodynamic Isomorphism of Transformers: A Lagrangian Approach to Attention Dynamics

著者: Gunn Kim

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08216

要約:
Although the Transformer architecture has revolutionized artificial intelligence, its underlying mechanisms remain largely heuristic and lack a unified physical theory. In this work, we propose a first-principles framework for information dynamics, treating the attention mechanism as a physical system governed by the principle of least action rather than as an algorithmic optimization. By mapping information states to a Riemannian manifold with the Fisher information metric, we derive the intelligence Lagrangian. We show that the softmax function corresponds to the unique thermodynamic equilibrium state that minimizes the Helmholtz free energy of the information gas. In addition, we identify the query-key interaction as an electrodynamic coupling between an external field and an intrinsic dipole moment. This theory establishes the first law of information thermodynamics, unifying inference (mechanical work) and learning (chemical evolution). It also explains emergent phenomena, such as scaling laws and grokking, as phase transitions characterized by the divergence of specific heat. Finally, we discuss how rotational symmetry breaking in the attention manifold generates massless Goldstone bosons, providing a field-theoretic perspective on rotary positional embeddings (RoPE). Our work connects Statistical Physics and Deep Learning, laying the groundwork for a general theory of physics-based intelligence.

#49 Adaptive Matrix Online Learning through Smoothing with Guarantees for Nonsmooth Nonconvex Optimization

著者: Ruichen Jiang, Zakaria Mhammedi, Mehryar Mohri, Aryan Mokhtari

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08232

要約:
We study online linear optimization with matrix variables constrained by the operator norm, a setting where the geometry renders designing data-dependent and efficient adaptive algorithms challenging. The best-known adaptive regret bounds are achieved by Shampoo-like methods, but they require solving a costly quadratic projection subproblem. To address this, we extend the gradient-based prediction scheme to adaptive matrix online learning and cast algorithm design as constructing a family of smoothed potentials for the nuclear norm. We define a notion of admissibility for such smoothings and prove any admissible smoothing yields a regret bound matching the best-known guarantees of one-sided Shampoo. We instantiate this framework with two efficient methods that avoid quadratic projections. The first is an adaptive Follow-the-Perturbed-Leader (FTPL) method using Gaussian stochastic smoothing. The second is Follow-the-Augmented-Matrix-Leader (FAML), which uses a deterministic hyperbolic smoothing in an augmented matrix space. By analyzing the admissibility of these smoothings, we show both methods admit closed-form updates and match one-sided Shampoo's regret up to a constant factor, while significantly reducing computational cost. Lastly, using the online-to-nonconvex conversion, we derive two matrix-based optimizers, Pion (from FTPL) and Leon (from FAML). We prove convergence guarantees for these methods in nonsmooth nonconvex settings, a guarantee that the popular Muon optimizer lacks.

#50 Noise Stability of Transformer Models

著者: Themistoklis Haris, Zihan Zhang, Yuichi Yoshida

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08287

要約:
Understanding simplicity biases in deep learning offers a promising path toward developing reliable AI. A common metric for this, inspired by Boolean function analysis, is average sensitivity, which captures a model's robustness to single-token perturbations. We argue that average sensitivity has two key limitations: it lacks a natural generalization to real-valued domains and fails to explain the "junta-like" input dependence we empirically observe in modern LLMs. To address these limitations, we propose noise stability as a more comprehensive simplicity metric. Noise stability expresses a model's robustness to correlated noise applied to all input coordinates simultaneously. We provide a theoretical analysis of noise stability for single-layer attention and ReLU MLP layers and tackle the multi-layer propagation problem with a covariance interval propagation approach. Building on this theory, we develop a practical noise stability regularization method. Experiments on algorithmic and next-token-prediction tasks show that our regularizer consistently catalyzes grokking and accelerates training by approximately $35\%$ and $75\%$ respectively. Our results sculpt a new connection between signal propagation in neural networks and interpretability, with noise stability emerging as a powerful tool for understanding and improving modern Transformers.

#51 Interaction-Grounded Learning for Contextual Markov Decision Processes with Personalized Feedback

著者: Mengxiao Zhang, Yuheng Zhang, Haipeng Luo, Paul Mineiro

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08307

要約:
In this paper, we study Interaction-Grounded Learning (IGL) [Xie et al., 2021], a paradigm designed for realistic scenarios where the learner receives indirect feedback generated by an unknown mechanism, rather than explicit numerical rewards. While prior work on IGL provides efficient algorithms with provable guarantees, those results are confined to single-step settings, restricting their applicability to modern sequential decision-making systems such as multi-turn Large Language Model (LLM) deployments. To bridge this gap, we propose a computationally efficient algorithm that achieves a sublinear regret guarantee for contextual episodic Markov Decision Processes (MDPs) with personalized feedback. Technically, we extend the reward-estimator construction of Zhang et al. [2024a] from the single-step to the multi-step setting, addressing the unique challenges of decoding latent rewards under MDPs. Building on this estimator, we design an Inverse-Gap-Weighting (IGW) algorithm for policy optimization. Finally, we demonstrate the effectiveness of our method in learning personalized objectives from multi-turn interactions through experiments on both a synthetic episodic MDP and a real-world user booking dataset.

#52 Fast Flow Matching based Conditional Independence Tests for Causal Discovery

著者: Shunyu Zhao, Yanfeng Yang, Shuai Li, Kenji Fukumizu

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08315

要約:
Constraint-based causal discovery methods require a large number of conditional independence (CI) tests, which severely limits their practical applicability due to high computational complexity. Therefore, it is crucial to design an algorithm that accelerates each individual test. To this end, we propose the Flow Matching-based Conditional Independence Test (FMCIT). The proposed test leverages the high computational efficiency of flow matching and requires the model to be trained only once throughout the entire causal discovery procedure, substantially accelerating causal discovery. According to numerical experiments, FMCIT effectively controls type-I error and maintains high testing power under the alternative hypothesis, even in the presence of high-dimensional conditioning sets. In addition, we further integrate FMCIT into a two-stage guided PC skeleton learning framework, termed GPC-FMCIT, which combines fast screening with guided, budgeted refinement using FMCIT. This design yields explicit bounds on the number of CI queries while maintaining high statistical power. Experiments on synthetic and real-world causal discovery tasks demonstrate favorable accuracy-efficiency trade-offs over existing CI testing methods and PC variants.

#53 All ERMs Can Fail in Stochastic Convex Optimization Lower Bounds in Linear Dimension

著者: Tal Burla, Roi Livni

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08350

要約:
We study the sample complexity of the best-case Empirical Risk Minimizer in the setting of stochastic convex optimization. We show that there exists an instance in which the sample size is linear in the dimension, learning is possible, but the Empirical Risk Minimizer is likely to be unique and to overfit. This resolves an open question by Feldman. We also extend this to approximate ERMs. Building on our construction we also show that (constrained) Gradient Descent potentially overfits when horizon and learning rate grow w.r.t sample size. Specifically we provide a novel generalization lower bound of $\Omega\left(\sqrt{\eta T/m^{1.5}}\right)$ for Gradient Descent, where $\eta$ is the learning rate, $T$ is the horizon and $m$ is the sample size. This narrows down, exponentially, the gap between the best known upper bound of $O(\eta T/m)$ and existing lower bounds from previous constructions.

#54 Low Rank Transformer for Multivariate Time Series Anomaly Detection and Localization

著者: Charalampos Shimillas, Kleanthis Malialis, Konstantinos Fokianos, Marios M. Polycarpou

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08467

要約:
Multivariate time series (MTS) anomaly diagnosis, which encompasses both anomaly detection and localization, is critical for the safety and reliability of complex, large-scale real-world systems. The vast majority of existing anomaly diagnosis methods offer limited theoretical insights, especially for anomaly localization, which is a vital but largely unexplored area. The aim of this contribution is to study the learning process of a Transformer when applied to MTS by revealing connections to statistical time series methods. Based on these theoretical insights, we propose the Attention Low-Rank Transformer (ALoRa-T) model, which applies low-rank regularization to self-attention, and we introduce the Attention Low-Rank score, effectively capturing the temporal characteristics of anomalies. Finally, to enable anomaly localization, we propose the ALoRa-Loc method, a novel approach that associates anomalies to specific variables by quantifying interrelationships among time series. Extensive experiments and real data analysis, show that the proposed methodology significantly outperforms state-of-the-art methods in both detection and localization tasks.

#55 Learning Credal Ensembles via Distributionally Robust Optimization

著者: Kaizheng Wang, Ghifari Adam Faza, Fabio Cuzzolin, Siu Lun Chau, David Moens, Hans Hallez

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08470

要約:
Credal predictors are models that are aware of epistemic uncertainty and produce a convex set of probabilistic predictions. They offer a principled way to quantify predictive epistemic uncertainty (EU) and have been shown to improve model robustness in various settings. However, most state-of-the-art methods mainly define EU as disagreement caused by random training initializations, which mostly reflects sensitivity to optimization randomness rather than uncertainty from deeper sources. To address this, we define EU as disagreement among models trained with varying relaxations of the i.i.d. assumption between training and test data. Based on this idea, we propose CreDRO, which learns an ensemble of plausible models through distributionally robust optimization. As a result, CreDRO captures EU not only from training randomness but also from meaningful disagreement due to potential distribution shifts between training and test data. Empirical results show that CreDRO consistently outperforms existing credal methods on tasks such as out-of-distribution detection across multiple benchmarks and selective classification in medical applications.

#56 Rho-Perfect: Correlation Ceiling For Subjective Evaluation Datasets

著者: Fredrik Cumlin

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08552

要約:
Subjective ratings contain inherent noise that limits the model-human correlation, but this reliability issue is rarely quantified. In this paper, we present $\rho$-Perfect, a practical estimation of the highest achievable correlation of a model on subjectively rated datasets. We define $\rho$-Perfect to be the correlation between a perfect predictor and human ratings, and derive an estimate of the value based on heteroscedastic noise scenarios, a common occurrence in subjectively rated datasets. We show that $\rho$-Perfect squared estimates test-retest correlation and use this to validate the estimate. We demonstrate the use of $\rho$-Perfect on a speech quality dataset and show how the measure can distinguish between model limitations and data quality issues.

#57 CauScale: Neural Causal Discovery at Scale

著者: Bo Peng, Sirui Chen, Jiaguo Tian, Yu Qiao, Chaochao Lu

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08629

要約:
Causal discovery is essential for advancing data-driven fields such as scientific AI and data analysis, yet existing approaches face significant time- and space-efficiency bottlenecks when scaling to large graphs. To address this challenge, we present CauScale, a neural architecture designed for efficient causal discovery that scales inference to graphs with up to 1000 nodes. CauScale improves time efficiency via a reduction unit that compresses data embeddings and improves space efficiency by adopting tied attention weights to avoid maintaining axis-specific attention maps. To keep high causal discovery accuracy, CauScale adopts a two-stream design: a data stream extracts relational evidence from high-dimensional observations, while a graph stream integrates statistical graph priors and preserves key structural signals. CauScale successfully scales to 500-node graphs during training, where prior work fails due to space limitations. Across testing data with varying graph scales and causal mechanisms, CauScale achieves 99.6% mAP on in-distribution data and 84.4% on out-of-distribution data, while delivering 4-13,000 times inference speedups over prior methods. Our project page is at https://github.com/OpenCausaLab/CauScale.

#58 The Theory and Practice of MAP Inference over Non-Convex Constraints

著者: Leander Kurscheidt, Gabriele Masina, Roberto Sebastiani, Antonio Vergari

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08681

要約:
In many safety-critical settings, probabilistic ML systems have to make predictions subject to algebraic constraints, e.g., predicting the most likely trajectory that does not cross obstacles. These real-world constraints are rarely convex, nor the densities considered are (log-)concave. This makes computing this constrained maximum a posteriori (MAP) prediction efficiently and reliably extremely challenging. In this paper, we first investigate under which conditions we can perform constrained MAP inference over continuous variables exactly and efficiently and devise a scalable message-passing algorithm for this tractable fragment. Then, we devise a general constrained MAP strategy that interleaves partitioning the domain into convex feasible regions with numerical constrained optimization. We evaluate both methods on synthetic and real-world benchmarks, showing our % approaches outperform constraint-agnostic baselines, and scale to complex densities intractable for SoTA exact solvers.

#59 Data Reconstruction: Identifiability and Optimization with Sample Splitting

privacy

著者: Yujie Shen, Zihan Wang, Jian Qian, Qi Lei

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08723

要約:
Training data reconstruction from KKT conditions has shown striking empirical success, yet it remains unclear when the resulting KKT equations have unique solutions and, even in identifiable regimes, how to reliably recover solutions by optimization. This work hereby focuses on these two complementary questions: identifiability and optimization. On the identifiability side, we discuss the sufficient conditions for KKT system of two-layer networks with polynomial activations to uniquely determine the training data, providing a theoretical explanation of when and why reconstruction is possible. On the optimization side, we introduce sample splitting, a curvature-aware refinement step applicable to general reconstruction objectives (not limited to KKT-based formulations): it creates additional descent directions to escape poor stationary points and refine solutions. Experiments demonstrate that augmenting several existing reconstruction methods with sample splitting consistently improves reconstruction performance.

#60 Near-optimal Swap Regret Minimization for Convex Losses

著者: Lunjia Hu, Jon Schneider, Yifan Wu

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08862

要約:
We give a randomized online algorithm that guarantees near-optimal $\widetilde O(\sqrt T)$ expected swap regret against any sequence of $T$ adaptively chosen Lipschitz convex losses on the unit interval. This improves the previous best bound of $\widetilde O(T^{2/3})$ and answers an open question of Fishelson et al. [2025b]. In addition, our algorithm is efficient: it runs in $\mathsf{poly}(T)$ time. A key technical idea we develop to obtain this result is to discretize the unit interval into bins at multiple scales of granularity and simultaneously use all scales to make randomized predictions, which we call multi-scale binning and may be of independent interest. A direct corollary of our result is an efficient online algorithm for minimizing the calibration error for general elicitable properties. This result does not require the Lipschitzness assumption of the identification function needed in prior work, making it applicable to median calibration, for which we achieve the first $\widetilde O(\sqrt T)$ calibration error guarantee.

#61 Positive Distribution Shift as a Framework for Understanding Tractable Learning

著者: Marko Medvedev, Idan Attias, Elisabetta Cornacchia, Theodor Misiakiewicz, Gal Vardi, Nathan Srebro

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08907

要約:
We study a setting where the goal is to learn a target function f(x) with respect to a target distribution D(x), but training is done on i.i.d. samples from a different training distribution D'(x), labeled by the true target f(x). Such a distribution shift (here in the form of covariate shift) is usually viewed negatively, as hurting or making learning harder, and the traditional distribution shift literature is mostly concerned with limiting or avoiding this negative effect. In contrast, we argue that with a well-chosen D'(x), the shift can be positive and make learning easier -- a perspective called Positive Distribution Shift (PDS). Such a perspective is central to contemporary machine learning, where much of the innovation is in finding good training distributions D'(x), rather than changing the training algorithm. We further argue that the benefit is often computational rather than statistical, and that PDS allows computationally hard problems to become tractable even using standard gradient-based training. We formalize different variants of PDS, show how certain hard classes are easily learnable under PDS, and make connections with membership query learning.

#62 GEMSS: A Variational Bayesian Method for Discovering Multiple Sparse Solutions in Classification and Regression Problems

著者: Kate\v{r}ina Henclov\'a, V\'aclav \v{S}m\'idl

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08913

要約:
Selecting interpretable feature sets in underdetermined ($n \ll p$) and highly correlated regimes constitutes a fundamental challenge in data science, particularly when analyzing physical measurements. In such settings, multiple distinct sparse subsets may explain the response equally well. Identifying these alternatives is crucial for generating domain-specific insights into the underlying mechanisms, yet conventional methods typically isolate a single solution, obscuring the full spectrum of plausible explanations. We present GEMSS (Gaussian Ensemble for Multiple Sparse Solutions), a variational Bayesian framework specifically designed to simultaneously discover multiple, diverse sparse feature combinations. The method employs a structured spike-and-slab prior for sparsity, a mixture of Gaussians to approximate the intractable multimodal posterior, and a Jaccard-based penalty to further control solution diversity. Unlike sequential greedy approaches, GEMSS optimizes the entire ensemble of solutions within a single objective function via stochastic gradient descent. The method is validated on a comprehensive benchmark comprising 128 synthetic experiments across classification and regression tasks. Results demonstrate that GEMSS scales effectively to high-dimensional settings ($p=5000$) with sample size as small as $n = 50$, generalizes seamlessly to continuous targets, handles missing data natively, and exhibits remarkable robustness to class imbalance and Gaussian noise. GEMSS is available as a Python package 'gemss' at PyPI. The full GitHub repository at https://github.com/kat-er-ina/gemss/ also includes a free, easy-to-use application suitable for non-coders.

#63 When do neural ordinary differential equations generalize on complex networks?

privacy

著者: Moritz Laber, Tina Eliassi-Rad, Brennan Klein

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08980

要約:
Neural ordinary differential equations (neural ODEs) can effectively learn dynamical systems from time series data, but their behavior on graph-structured data remains poorly understood, especially when applied to graphs with different size or structure than encountered during training. We study neural ODEs ($\mathtt{nODE}$s) with vector fields following the Barab\'asi-Barzel form, trained on synthetic data from five common dynamical systems on graphs. Using the $\mathbb{S}^1$-model to generate graphs with realistic and tunable structure, we find that degree heterogeneity and the type of dynamical system are the primary factors in determining $\mathtt{nODE}$s' ability to generalize across graph sizes and properties. This extends to $\mathtt{nODE}$s' ability to capture fixed points and maintain performance amid missing data. Average clustering plays a secondary role in determining $\mathtt{nODE}$ performance. Our findings highlight $\mathtt{nODE}$s as a powerful approach to understanding complex systems but underscore challenges emerging from degree heterogeneity and clustering in realistic graphs.

#64 Universal Coefficients and Mayer-Vietoris Sequence for Groupoid Homology

著者: Luciano Melodia

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.08998

要約:
We study homology of ample groupoids via the compactly supported Moore complex of the nerve. Let $A$ be a topological abelian group. For $n\ge 0$ set $C_n(\mathcal G;A) := C_c(\mathcal G_n,A)$ and define $\partial_n^A=\sum_{i=0}^n(-1)^i(d_i)_*$. This defines $H_n(\mathcal G;A)$. The theory is functorial for continuous \'etale homomorphisms. It is compatible with standard reductions, including restriction to saturated clopen subsets. In the ample setting it is invariant under Kakutani equivalence. We reprove Matui type long exact sequences and identify the comparison maps at chain level. For discrete $A$ we prove a natural universal coefficient short exact sequence $$0\to H_n(\mathcal G)\otimes_{\mathbb Z}A\xrightarrow{\ \iota_n^{\mathcal G}\ }H_n(\mathcal G;A)\xrightarrow{\ \kappa_n^{\mathcal G}\ }\operatorname{Tor}_1^{\mathbb Z}\bigl(H_{n-1}(\mathcal G),A\bigr)\to 0.$$ The key input is the chain level isomorphism $C_c(\mathcal G_n,\mathbb Z)\otimes_{\mathbb Z}A\cong C_c(\mathcal G_n,A)$, which reduces the groupoid statement to the classical algebraic UCT for the free complex $C_c(\mathcal G_\bullet,\mathbb Z)$. We also isolate the obstruction for non-discrete coefficients. For a locally compact totally disconnected Hausdorff space $X$ with a basis of compact open sets, the image of $\Phi_X:C_c(X,\mathbb Z)\otimes_{\mathbb Z}A\to C_c(X,A)$ is exactly the compactly supported functions with finite image. Thus $\Phi_X$ is surjective if and only if every $f\in C_c(X,A)$ has finite image, and for suitable $X$ one can produce compactly supported continuous maps $X\to A$ with infinite image. Finally, for a clopen saturated cover $\mathcal G_0=U_1\cup U_2$ we construct a short exact sequence of Moore complexes and derive a Mayer-Vietoris long exact sequence for $H_\bullet(\mathcal G;A)$ for explicit computations.

#65 CoinPress: Practical Private Mean and Covariance Estimation

privacy

著者: Sourav Biswas, Yihe Dong, Gautam Kamath, Jonathan Ullman

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2006.06618

要約:
We present simple differentially private estimators for the mean and covariance of multivariate sub-Gaussian data that are accurate at small sample sizes. We demonstrate the effectiveness of our algorithms both theoretically and empirically using synthetic and real-world datasets -- showing that their asymptotic error rates match the state-of-the-art theoretical bounds, and that they concretely outperform all previous methods. Specifically, previous estimators either have weak empirical accuracy at small sample sizes, perform poorly for multivariate data, or require the user to provide strong a priori estimates for the parameters.

#66 Non-negative matrix factorization algorithms generally improve topic model fits

著者: Peter Carbonetto, Abhishek Sarkar, Zihao Wang, Matthew Stephens

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2105.13440

要約:
In an effort to develop topic modeling methods that can be quickly applied to large data sets, we revisit the problem of maximum-likelihood estimation in topic models. It is known, at least informally, that maximum-likelihood estimation in topic models is closely related to non-negative matrix factorization (NMF). Yet, to our knowledge, this relationship has not been exploited previously to fit topic models. We show that recent advances in NMF optimization methods can be leveraged to fit topic models very efficiently, often resulting in much better fits and in less time than existing algorithms for topic models. We also formally make the connection between the NMF optimization problem and maximum-likelihood estimation for the topic model, and using this result we show that the expectation maximization (EM) algorithm for the topic model is essentially the same as the classic multiplicative updates for NMF (the only difference being that the operations are performed in a different order). Our methods are implemented in the R package fastTopics.

#67 Large Deviations of Gaussian Neural Networks with ReLU activation

model extraction

著者: Quirin Vogel

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2405.16958

要約:
We prove a large deviation principle for deep neural networks with Gaussian weights and at most linearly growing activation functions, such as ReLU. This generalises earlier work, in which bounded and continuous activation functions were considered. In practice, linearly growing activation functions such as ReLU are most commonly used. We furthermore simplify previous expressions for the rate function and provide a power-series expansions for the ReLU case.

#68 Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance

著者: Ryumei Nakada, Yichen Xu, Lexin Li, Linjun Zhang

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2406.03628

要約:
Imbalanced classification and spurious correlation are common challenges in data science and machine learning. Both issues are linked to data imbalance, with certain groups of data samples significantly underrepresented, which in turn would compromise the accuracy, robustness and generalizability of the learned models. Recent advances have proposed leveraging the flexibility and generative capabilities of large language models (LLMs), typically built on transformer architectures, to generate synthetic samples and to augment the observed data. In the context of imbalanced data, LLMs are used to oversample underrepresented groups and have shown promising improvements. However, there is a clear lack of theoretical understanding of such synthetic data approaches. In this article, we develop novel theoretical foundations to systematically study the roles of synthetic samples in addressing imbalanced classification and spurious correlation. Specifically, we first explicitly quantify the benefits of synthetic oversampling. Next, we analyze the scaling dynamics in synthetic data augmentation, and derive the corresponding scaling law. Finally, we demonstrate the capacity of transformer models to generate high-quality synthetic samples. We further conduct extensive numerical experiments to validate the efficacy of the LLM-based synthetic oversampling and augmentation.

#69 On the Computational Efficiency of Bayesian Additive Regression Trees: An Asymptotic Analysis

著者: Yan Shuo Tan, Omer Ronen, Theo Saarinen, Bin Yu

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2406.19958

要約:
Bayesian Additive Regression Trees (BART) is a popular Bayesian non-parametric regression model that is commonly used in causal inference and beyond. Its strong predictive performance is supported by well-developed estimation theory, comprising guarantees that its posterior distribution concentrates around the true regression function at optimal rates under various data generative settings and for appropriate prior choices. However, the computational properties of the widely-used BART sampler proposed by Chipman et al. (2010) are yet to be well-understood. In this paper, we perform an asymptotic analysis of a slightly modified version of the default BART sampler when fitted to data-generating processes with discrete covariates. We show that the sampler's time to convergence, evaluated in terms of the hitting time of a high posterior density set, increases with the number of training samples, due to the multi-modal nature of the target posterior. On the other hand, we show that this trend can be dampened by simple changes, such as increasing the number of trees in the ensemble or raising the temperature of the sampler. These results provide a nuanced picture on the computational efficiency of the BART sampler in the presence of large amounts of training data while suggesting strategies to improve the sampler. We complement our theoretical analysis with a simulation study focusing on the default BART sampler. We observe that the increasing trend of convergence time against number training samples holds for the default BART sampler and is robust to changes in sampler initialization, number of burn-in iterations, feature selection prior, and discretization strategy. On the other hand, increasing the number of trees or raising the temperature sharply dampens this trend, as indicated by our theory.

#70 Note on computational complexity of the Gromov-Wasserstein distance

著者: Natalia Kravtsova

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2408.06525

要約:
This note addresses computational difficulty of the Gromov-Wasserstein distance frequently mentioned in the literature. We provide details on the structure of the Gromov-Wasserstein distance optimization problem that show its non-convex quadratic nature for any instance of an input data. We further illustrate the non-convexity of the problem with several explicit examples.

#71 Step by Step: Adaptive Gradient Descent for Training L-Lipschitz Neural Networks

著者: Kyle Sung, Kholood Khalil, Noah Forman, Steven Samu, Anastasis Kratsios

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2502.03792

要約:
We demonstrate that applying an eventual decay to the learning rate (LR) in empirical risk minimization (ERM), where the mean-squared-error loss is minimized using standard gradient descent (GD) for training a two-layer neural network with Lipschitz activation functions, ensures that the resulting network exhibits a high degree of Lipschitz regularity, that is, a small Lipschitz constant. Moreover, we show that this decay does not hinder the convergence rate of the empirical risk, now measured with the Huber loss, toward a critical point of the non-convex empirical risk. From these findings, we derive generalization bounds for two-layer neural networks trained with GD and a decaying LR with a sub-linear dependence on its number of trainable parameters, suggesting that the statistical behaviour of these networks is independent of overparameterization. We validate our theoretical results with a series of toy numerical experiments, where surprisingly, we observe that networks trained with constant step size GD exhibit similar learning and regularity properties to those trained with a decaying LR. This suggests that neural networks trained with standard GD may already be highly regular learners.

#72 Sparsified-Learning for High-Dimensional Heavy-Tailed Locally Stationary Time Series, Concentration and Oracle Inequalities

著者: Yingjie Wang, Mokhtar Z. Alaya, Salim Bouzebda, Xinsheng Liu

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2504.06477

要約:
Sparse learning is ubiquitous in many machine learning tasks. It aims to regularize the goodness-of-fit objective by adding a penalty term to encode structural constraints on the model parameters. In this paper, we develop a flexible sparse learning framework tailored to high-dimensional heavy-tailed locally stationary time series (LSTS). The data-generating mechanism incorporates a regression function that changes smoothly over time and is observed under noise belonging to the class of sub-Weibull and regularly varying distributions. We introduce a sparsity-inducing penalized estimation procedure that combines additive modeling with kernel smoothing and define an additive kernel-smoothing hypothesis class. In the presence of locally stationary dynamics, we assume exponentially decaying $\beta$-mixing coefficients to derive concentration inequalities for kernel-weighted sums of locally stationary processes with heavy-tailed noise. We further establish nonasymptotic prediction-error bounds, yielding both slow and fast convergence rates under different sparsity structures, including Lasso and total variation penalization with the least-squares loss. To support our theoretical results, we conduct numerical experiments on simulated LSTS with sub-Weibull and Pareto noise, highlighting how tail behavior affects prediction error across different covariate-dimensions as the sample size increases.

#73 Differentially Private Geodesic Regression

privacy

著者: Aditya Kulkarni, Carlos Soto

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2504.11304

要約:
In statistical applications it has become increasingly common to encounter data structures that live on non-linear spaces such as manifolds. Classical linear regression, one of the most fundamental methodologies of statistical learning, captures the relationship between an independent variable and a response variable which both are assumed to live in Euclidean space. Thus, geodesic regression emerged as an extension where the response variable lives on a Riemannian manifold. The parameters of geodesic regression, as with linear regression, capture the relationship of sensitive data and hence one should consider the privacy protection practices of said parameters. We consider releasing Differentially Private (DP) parameters of geodesic regression via the K-Norm Gradient (KNG) mechanism for Riemannian manifolds. We derive theoretical bounds for the sensitivity of the parameters showing they are tied to their respective Jacobi fields and hence the curvature of the space. This corroborates, and extends, recent findings of differential privacy for the Fr\'echet mean. We demonstrate the efficacy of our methodology on the sphere, $S_2\subset\mathbb{R}^3$, the space of symmetric positive definite matrices, and Kendall's planar shape space. Our methodology is general to any Riemannian manifold, and thus it is suitable for data in domains such as medical imaging and computer vision.

#74 Liouville PDE-based sliced-Wasserstein flow

著者: Jayshawn Cooper, Pilhwa Lee

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2505.17204

要約:
The sliced Wasserstein flow (SWF), a nonparametric and implicit generative gradient flow, is transformed to a Liouville partial differential equation (PDE)-based formalism. First, the stochastic diffusive term from the Fokker-Planck equation-based Monte Carlo is reformulated to Liouville PDE-based transport without the diffusive term, and the involved density estimation is handled by normalizing flows of neural ODE. Next, the computation of the Wasserstein barycenter is approximated by the Liouville PDE-based SWF barycenter with the prescription of Kantorovich potentials for the induced gradient flow to generate its samples. These two efforts show outperforming convergence in training and testing Liouville PDE-based SWF and SWF barycenters with reduced variance. Applying the generative SWF barycenter for fair regression demonstrates competent profiles in the accuracy-fairness Pareto curves.

#75 Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation

著者: Alessio Giorlandino, Sebastian Goldt

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2505.24333

要約:
Finding the right initialisation for neural networks is crucial to ensure smooth training and good performance. In transformers, the wrong initialisation can lead to one of two failure modes of self-attention layers: rank collapse, where all tokens collapse into similar representations, and entropy collapse, where highly concentrated attention scores lead to training instability. While previous work has studied different scaling regimes for transformers, an asymptotically exact, down-to-the constant prescription for how to initialise transformers has so far been lacking. Here, we provide an analytical theory of signal propagation through deep transformers with self-attention, layer normalisation, skip connections and MLP. Our theory yields a simple algorithm to compute trainability diagrams that identify the correct choice of initialisation hyper-parameters for a given architecture. We overcome the key challenge, an exact treatment of the self-attention layer, by establishing a formal parallel with the Random Energy Model from statistical physics. We also analyse gradients in the backward path and determine the regime where gradients vanish at initialisation. We demonstrate the versatility of our framework through three case studies. Our theoretical framework gives a unified perspective on the two failure modes of self-attention and gives quantitative predictions on the scale of both weights and residual connections that guarantee smooth training.

#76 Algorithm- and Data-Dependent Generalization Bounds for Diffusion Models

diffusion

著者: Benjamin Dupuis, Dario Shariatian, Maxime Haddouche, Alain Durmus, Umut Simsekli

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2506.03849

要約:
Score-based generative models (SGMs) have emerged as one of the most popular classes of generative models. A substantial body of work now exists on the analysis of SGMs, focusing either on discretization aspects or on their statistical performance. In the latter case, bounds have been derived, under various metrics, between the true data distribution and the distribution induced by the SGM, often demonstrating polynomial convergence rates with respect to the number of training samples. However, these approaches adopt a largely approximation theory viewpoint, which tends to be overly pessimistic and relatively coarse. In particular, they fail to fully explain the empirical success of SGMs or capture the role of the optimization algorithm used in practice to train the score network. To support this observation, we first present simple experiments illustrating the concrete impact of optimization hyperparameters on the generalization ability of the generated distribution. Then, this paper aims to bridge this theoretical gap by providing the first algorithmic- and data-dependent generalization analysis for SGMs. In particular, we establish bounds that explicitly account for the optimization dynamics of the learning algorithm, offering new insights into the generalization behavior of SGMs. Our theoretical findings are supported by empirical results on several datasets.

#77 Scaling Laws for Uncertainty in Deep Learning

著者: Mattia Rosso, Simone Rossi, Giulio Franzese, Markus Heinonen, Maurizio Filippone

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2506.09648

要約:
Deep learning has recently revealed the existence of scaling laws, demonstrating that model performance follows predictable trends based on dataset and model sizes. Inspired by these findings and fascinating phenomena emerging in the over-parameterized regime, we examine a parallel direction: do similar scaling laws govern predictive uncertainties in deep learning? In identifiable parametric models, such scaling laws can be derived in a straightforward manner by treating model parameters in a Bayesian way. In this case, for example, we obtain $O(1/N)$ contraction rates for epistemic uncertainty with respect to the number of data $N$. However, in over-parameterized models, these guarantees do not hold, leading to largely unexplored behaviors. In this work, we empirically show the existence of scaling laws associated with various measures of predictive uncertainty with respect to dataset and model sizes. Through experiments on vision and language tasks, we observe such scaling laws for in- and out-of-distribution predictive uncertainty estimated through popular approximate Bayesian inference and ensemble methods. Besides the elegance of scaling laws and the practical utility of extrapolating uncertainties to larger data or models, this work provides strong evidence to dispel recurring skepticism against Bayesian approaches: "In many applications of deep learning we have so much data available: what do we need Bayes for?". Our findings show that "so much data" is typically not enough to make epistemic uncertainty negligible.

#78 Single Index Bandits: Generalized Linear Contextual Bandits with Unknown Reward Functions

著者: Yue Kang, Mingshuo Liu, Bongsoo Yi, Jing Lyu, Zhi Zhang, Doudou Zhou, Yao Li

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2506.12751

要約:
Generalized linear bandits have been extensively studied due to their broad applicability in real-world online decision-making problems. However, these methods typically assume that the expected reward function is known to the users, an assumption that is often unrealistic in practice. Misspecification of this link function can lead to the failure of all existing algorithms. In this work, we address this critical limitation by introducing a new problem of generalized linear bandits with unknown reward functions, also known as single index bandits. We first consider the case where the unknown reward function is monotonically increasing, and propose two novel and efficient algorithms, STOR and ESTOR, that achieve decent regrets under standard assumptions. Notably, our ESTOR can obtain the nearly optimal regret bound $\tilde{O}_T(\sqrt{T})$ in terms of the time horizon $T$. We then extend our methods to the high-dimensional sparse setting and show that the same regret rate can be attained with the sparsity index. Next, we introduce GSTOR, an algorithm that is agnostic to general reward functions, and establish regret bounds under a Gaussian design assumption. Finally, we validate the efficiency and effectiveness of our algorithms through experiments on both synthetic and real-world datasets.

#79 The Condition Number as a Scale-Invariant Proxy for Information Encoding in Neural Units

著者: Oswaldo Ludwig

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2506.16289

要約:
This paper explores the relationship between the condition number of a neural network's weight tensor and the extent of information encoded by the associated processing unit, viewed through the lens of information theory. It argues that a high condition number, though not sufficient for effective knowledge encoding, may indicate that the unit has learned to selectively amplify and compress information. This intuition is formalized for linear units with Gaussian inputs, linking the condition number and the transformation's log-volume scaling factor to the characteristics of the output entropy and the geometric properties of the learned transformation. The analysis demonstrates that for a fixed weight norm, a concentrated distribution of singular values (high condition number) corresponds to reduced overall information transfer, indicating a specialized and efficient encoding strategy. Furthermore, the linear stage entropy bound provides an upper limit on post-activation information for contractive, element-wise nonlinearities, supporting the condition number as a scale-invariant proxy for encoding capacity in practical neural networks. An empirical case study applies these principles to guide selective fine-tuning of Large Language Models for both a new task and a new input modality. The experiments show that the proposed method, named KappaTune, effectively mitigates catastrophic forgetting. Unlike many existing catastrophic forgetting mitigation methods that rely on access to pre-training statistics, which are often unavailable, this selective fine-tuning approach offers a way to bypass this common requirement.

#80 The Relative Instability of Model Comparison with Cross-validation

著者: Alexandre Bayle, Lucas Janson, Lester Mackey

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2508.04409

要約:
Cross-validation (CV) is known to provide asymptotically exact tests and confidence intervals for model improvement but only when the model comparison is relatively stable. Surprisingly, we prove that even simple, individually stable models can generate relatively unstable comparisons, calling into question the validity of CV inference. Specifically, we show that the Lasso and its close cousin, soft-thresholding, generate relatively unstable comparisons and invalid CV inferences, even in the most favorable of learning settings and when both models are individually stable. These findings highlight the importance of verifying relative stability before deploying CV for model comparison.

#81 Beating the Winner's Curse via Inference-Aware Policy Optimization

著者: Hamsa Bastani, Osbert Bastani, Bryce McLaughlin

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2510.18161

要約:
There has been a surge of recent interest in automatically learning policies to target treatment decisions based on rich individual covariates. In addition, practitioners want confidence that the learned policy has better performance than the incumbent policy according to downstream policy evaluation. However, due to the winner's curse -- an issue where the policy optimization procedure exploits prediction errors rather than finding actual improvements -- predicted performance improvements are often not substantiated by downstream policy evaluation. To address this challenge, we propose a novel strategy called inference-aware policy optimization, which modifies policy optimization to account for how the policy will be evaluated downstream. Specifically, it optimizes not only for the estimated objective value, but also for the chances that the estimate of the policy's improvement passes a significance test during downstream policy evaluation. We mathematically characterize the Pareto frontier of policies according to the tradeoff of these two goals. Based on our characterization, we design a policy optimization algorithm that estimates the Pareto frontier using machine learning models; then, the decision-maker can select the policy that optimizes their desired tradeoff, after which policy evaluation can be performed on the test set as usual. Finally, we perform simulations to illustrate the effectiveness of our methodology.

#82 Deep Ensembles for Epistemic Uncertainty: A Frequentist Perspective

著者: Anchit Jain, Stephen Bates

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2510.22063

要約:
Decomposing prediction uncertainty into aleatoric (irreducible) and epistemic (reducible) components is critical for the reliable deployment of machine learning systems. While the mutual information between the response variable and model parameters is a principled measure for epistemic uncertainty, it requires access to the parameter posterior, which is computationally challenging to approximate. Consequently, practitioners often rely on probabilistic predictions from deep ensembles to quantify uncertainty, which have demonstrated strong empirical performance. However, a theoretical understanding of their success from a frequentist perspective remains limited. We address this gap by first considering a bootstrap-based estimator for epistemic uncertainty, which we prove is asymptotically correct. Next, we connect deep ensembles to the bootstrap estimator by decomposing it into data variability and training stochasticity; specifically, we show that deep ensembles capture the training stochasticity component. Through empirical studies, we show that this stochasticity component constitutes the majority of epistemic uncertainty, thereby explaining the effectiveness of deep ensembles.

#83 Understanding Fairness and Prediction Error through Subspace Decomposition and Influence Analysis

著者: Enze Shi, Pankaj Bhagwat, Zhixian Yang, Linglong Kong, Bei Jiang

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2510.23935

要約:
Machine learning models have achieved widespread success but often inherit and amplify historical biases, resulting in unfair outcomes. Traditional fairness methods typically impose constraints at the prediction level, without addressing underlying biases in data representations. In this work, we propose a principled framework that adjusts data representations to balance predictive utility and fairness. Using sufficient dimension reduction, we decompose the feature space into target-relevant, sensitive, and shared components, and control the fairness-utility trade-off by selectively removing sensitive information. We provide a theoretical analysis of how prediction error and fairness gaps evolve as shared subspaces are added, and employ influence functions to quantify their effects on the asymptotic behavior of parameter estimates. Experiments on both synthetic and real-world datasets validate our theoretical insights and show that the proposed method effectively improves fairness while preserving predictive performance.

#84 Towards Understanding Generalization in DP-GD: A Case Study in Training Two-Layer CNNs

著者: Zhongjie Shi, Puyu Wang, Chenyang Zhang, Yuan Cao

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.22270

要約:
Modern deep learning techniques focus on extracting intricate information from data to achieve accurate predictions. However, the training datasets may be crowdsourced and include sensitive information, such as personal contact details, financial data, and medical records. As a result, there is a growing emphasis on developing privacy-preserving training algorithms for neural networks that maintain good performance while preserving privacy. In this paper, we investigate the generalization and privacy performances of the differentially private gradient descent (DP-GD) algorithm, which is a private variant of the gradient descent (GD) by incorporating additional noise into the gradients during each iteration. Moreover, we identify a concrete learning task where DP-GD can achieve superior generalization performance compared to GD in training two-layer Huberized ReLU convolutional neural networks (CNNs). Specifically, we demonstrate that, under mild conditions, a small signal-to-noise ratio can result in GD producing training models with poor test accuracy, whereas DP-GD can yield training models with good test accuracy and privacy guarantees if the signal-to-noise ratio is not too small. This indicates that DP-GD has the potential to enhance model performance while ensuring privacy protection in certain learning tasks. Numerical simulations are further conducted to support our theoretical results.

#85 Provable FDR Control for Deep Feature Selection: Deep MLPs and Beyond

著者: Kazuma Sawaya

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2512.04696

要約:
We develop a flexible feature selection framework based on deep neural networks that approximately controls the false discovery rate (FDR), a measure of Type-I error. The method applies to architectures whose first layer is fully connected. From the second layer onward, it accommodates multilayer perceptrons (MLPs) of arbitrary width and depth, convolutional and recurrent networks, attention mechanisms, residual connections, and dropout. The procedure also accommodates stochastic gradient descent with data-independent initializations and learning rates. To the best of our knowledge, this is the first work to provide a theoretical guarantee of FDR control for feature selection within such a general deep learning setting. Our analysis is built upon a multi-index data-generating model and an asymptotic regime in which the feature dimension $n$ diverges faster than the latent dimension $q^{*}$, while the sample size, the number of training iterations, the network depth, and hidden layer widths are left unrestricted. Under this setting, we show that each coordinate of the gradient-based feature-importance vector admits a marginal normal approximation, thereby supporting the validity of asymptotic FDR control. As a theoretical limitation, we assume $\mathbf{B}$-right orthogonal invariance of the design matrix, and we discuss broader generalizations. We also present numerical experiments that underscore the theoretical findings.

#86 Calibrated Multi-Level Quantile Forecasting

著者: Tiffany Ding, Isaac Gibbs, Ryan J. Tibshirani

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2512.23671

要約:
We develop an online method that guarantees calibration of quantile forecasts at multiple quantile levels simultaneously. In this work, a sequence of quantile forecasts is said to be calibrated provided that its $\alpha$-level predictions are greater than or equal to the target value at an $\alpha$ fraction of time steps, for each level $\alpha$. Our procedure, called the multi-level quantile tracker (MultiQT), is lightweight and wraps around any point or quantile forecaster to produce adjusted quantile forecasts that are guaranteed to be calibrated, even against adversarial distribution shifts. Critically, it does so while ensuring that the quantiles remain ordered, e.g., the 0.5-level quantile forecast will never be larger than the 0.6-level forecast. Moreover, the method has a no-regret guarantee, implying it will not degrade the performance of the existing forecaster (asymptotically), with respect to the quantile loss. In our experiments, we find that MultiQT significantly improves the calibration of real forecasters in epidemic and energy forecasting problems, while leaving the quantile loss largely unchanged or slightly improved.

#87 Local EGOP for Continuous Index Learning

著者: Alex Kokot, Anand Hemmady, Vydhourie Thiyageswaran, Marina Meila

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.07061

要約:
We introduce the setting of continuous index learning, in which a function of many variables varies only along a small number of directions at each point. For efficient estimation, it is beneficial for a learning algorithm to adapt, near each point $x$, to the subspace that captures the local variability of the function $f$. We pose this task as kernel adaptation along a manifold with noise, and introduce Local EGOP learning, a recursive algorithm that utilizes the Expected Gradient Outer Product (EGOP) quadratic form as both a metric and inverse-covariance of our target distribution. We prove that Local EGOP learning adapts to the regularity of the function of interest, showing that under a supervised noisy manifold hypothesis, intrinsic dimensional learning rates are achieved for arbitrarily high-dimensional noise. Empirically, we compare our algorithm to the feature learning capabilities of deep learning. Additionally, we demonstrate improved regression quality compared to two-layer neural networks in the continuous single-index setting.

#88 Small Gradient Norm Regret for Online Convex Optimization

著者: Wenzhi Gao, Chang He, Madeleine Udell

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.13519

要約:
This paper introduces a new problem-dependent regret measure for online convex optimization with smooth losses. The notion, which we call the $G^\star$ regret, depends on the cumulative squared gradient norm evaluated at the decision in hindsight. We show that the $G^\star$ regret strictly refines the existing $L^\star$ (small loss) regret, and that it can be arbitrarily sharper when the losses have vanishing curvature around the hindsight decision. We establish upper and lower bounds on the $G^\star$ regret and extend our results to dynamic regret and bandit settings. As a byproduct, we refine the existing convergence analysis of stochastic optimization algorithms in the interpolation regime. Some experiments validate our theoretical findings.

#89 Optimal Decision-Making Based on Prediction Sets

著者: Tao Wang, Edgar Dobriban

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.00989

要約:
Prediction sets can wrap around any ML model to cover unknown test outcomes with a guaranteed probability. Yet, it remains unclear how to use them optimally for downstream decision-making. Here, we propose a decision-theoretic framework that seeks to minimize the expected loss (risk) against a worst-case distribution consistent with the prediction set's coverage guarantee. We first characterize the minimax optimal policy for a fixed prediction set, showing that it balances the worst-case loss inside the set with a penalty for potential losses outside the set. Building on this, we derive the optimal prediction set construction that minimizes the resulting robust risk subject to a coverage constraint. Finally, we introduce Risk-Optimal Conformal Prediction (ROCP), a practical algorithm that targets these risk-minimizing sets while maintaining finite-sample distribution-free marginal coverage. Empirical evaluations on medical diagnosis and safety-critical decision-making tasks demonstrate that ROCP reduces critical mistakes compared to baselines, particularly when out-of-set errors are costly.

#90 Near-Universal Multiplicative Updates for Nonnegative Einsum Factorization

著者: John Hood, Aaron Schein

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.02759

要約:
Despite the ubiquity of multiway data across scientific domains, there are few user-friendly tools that fit tailored nonnegative tensor factorizations. Researchers may use gradient-based automatic differentiation (which often struggles in nonnegative settings), choose between a limited set of methods with mature implementations, or implement their own model from scratch. As an alternative, we introduce NNEinFact, an einsum-based multiplicative update algorithm that fits any nonnegative tensor factorization expressible as a tensor contraction by minimizing one of many user-specified loss functions (including the $(\alpha,\beta)$-divergence). To use NNEinFact, the researcher simply specifies their model with a string. NNEinFact converges to a stationary point of the loss, supports missing data, and fits to tensors with hundreds of millions of entries in seconds. Empirically, NNEinFact fits custom models which outperform standard ones in heldout prediction tasks on real-world tensor data by over $37\%$ and attains less than half the test loss of gradient-based methods while converging up to 90 times faster.

#91 Deep networks learn to parse uniform-depth context-free languages from local statistics

著者: Jack T. Parley, Francesco Cagnetta, Matthieu Wyart

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06065

要約:
Understanding how the structure of language can be learned from sentences alone is a central question in both cognitive science and machine learning. Studies of the internal representations of Large Language Models (LLMs) support their ability to parse text when predicting the next word, while representing semantic notions independently of surface form. Yet, which data statistics make these feats possible, and how much data is required, remain largely unknown. Probabilistic context-free grammars (PCFGs) provide a tractable testbed for studying these questions. However, prior work has focused either on the post-hoc characterization of the parsing-like algorithms used by trained networks; or on the learnability of PCFGs with fixed syntax, where parsing is unnecessary. Here, we (i) introduce a tunable class of PCFGs in which both the degree of ambiguity and the correlation structure across scales can be controlled; (ii) provide a learning mechanism -- an inference algorithm inspired by the structure of deep convolutional networks -- that links learnability and sample complexity to specific language statistics; and (iii) validate our predictions empirically across deep convolutional and transformer-based architectures. Overall, we propose a unifying framework where correlations at different scales lift local ambiguities, enabling the emergence of hierarchical representations of the data.

#92 Nuclear Norm Regularized Estimation of Panel Regression Models

著者: Hyungsik Roger Moon, Martin Weidner

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/1810.10987

要約:
In this paper we investigate panel regression models with interactive fixed effects. We propose two new estimation methods that are based on minimizing convex objective functions. The first method minimizes the sum of squared residuals with a nuclear (trace) norm regularization. The second method minimizes the nuclear norm of the residuals. We establish the consistency of the two resulting estimators. Those estimators have a very important computational advantage compared to the existing least squares (LS) estimator, in that they are defined as minimizers of a convex objective function. In addition, the nuclear norm penalization helps to resolve a potential identification problem for interactive fixed effect models, in particular when the regressors are low-rank and the number of the factors is unknown. We also show how to construct estimators that are asymptotically equivalent to the least squares (LS) estimator in Bai (2009) and Moon and Weidner (2017) by using our nuclear norm regularized or minimized estimators as initial values for a finite number of LS minimizing iteration steps. This iteration avoids any non-convex minimization, while the original LS estimation problem is generally non-convex, and can have multiple local minima.

#93 Estimating the Value of Evidence-Based Decision Making

著者: Alberto Abadie, Anish Agarwal, Guido Imbens, Siwei Jia, James McQueen, Serguei Stepaniants, Santiago Torres

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2306.13681

要約:
In an era of data abundance, statistical evidence is increasingly critical for business and policy decisions. Yet, organizations lack empirical tools to assess the value of evidence-based decision making (EBDM), optimize statistical precision, and balance the costs of evidence-gathering strategies against their benefits. To tackle these challenges, this article introduces an empirical framework to estimate the value of EBDM and evaluate the return on investment in statistical precision and project ideation. The framework leverages parametric and nonparametric empirical Bayes methods to account for parameter heterogeneity and measure how statistical precision changes the value of evidence. The value extracted from statistical evidence depends critically on how organizations translate evidence into policy decisions. Commonly used decision rules based on statistical significance can leave substantial value unrealized and, in some cases, generate negative expected value.

#94 Analysis of singular subspaces under random perturbations

著者: Ke Wang

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2403.09170

要約:
We present a comprehensive analysis of singular vector and singular subspace perturbations in the signal-plus-noise matrix model with random Gaussian noise. Assuming a low-rank signal matrix, we extend the Davis-Kahan-Wedin theorem in a fully generalized manner, applicable to any unitarily invariant matrix norm, building on previous results by O'Rourke, Vu, and the author. Our analysis provides fine-grained insights, including $\ell_\infty$ bounds for singular vectors, $\ell_{2, \infty}$ bounds for singular subspaces, and results for linear and bilinear functions of singular vectors. Additionally, we derive $\ell_{2,\infty}$ bounds on perturbed singular vectors, taking into account the weighting by their corresponding singular values. Finally, we explore practical implications of these results in the Gaussian mixture model and the submatrix localization problem.

#95 Bias-Targeted Nonparametric Balancing for Stable Causal Mediation Analysis

著者: Chang Liu, AmirEmad Ghassami

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2404.00735

要約:
Influence function (IF)-based estimators are widely used in mediation analysis due to their modeling flexibility, but standard implementations require direct estimation of the distribution functions of the mediator and treatment variables. Since these functions appear in the denominator of IF-based estimators, they can induce significant instability, particularly with continuous mediators. In this work, we propose an alternative implementation of IF-based estimators for both single- and multiple-mediator settings, based on reparametrizations of the likelihood. The key idea is to estimate the involved nuisance functions according to their role in the bias structure of the IF-based estimators. In our approach, key nuisance functions that are potential sources of instability are estimated using a novel nonparametric weighted balancing method-which can be viewed as a nonparametric extension of covariate balancing generalized to mediation analysis-fully stabilizing the estimators. We establish consistency and multiple robustness under suitable regularity conditions, and asymptotic normality. Simulation studies demonstrate substantial reductions in bias and variance relative to existing methods for continuous mediators. We further illustrate the approach using NHANES 2013-2014 data to estimate the effect of obesity on coronary heart disease mediated by Glycohemoglobin.

#96 Kernel-based Optimally Weighted Conformal Time-Series Prediction

著者: Jonghyeok Lee, Chen Xu, Yao Xie

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2405.16828

要約:
In this work, we present a novel conformal prediction method for time-series, which we call Kernel-based Optimally Weighted Conformal Prediction Intervals (KOWCPI). Specifically, KOWCPI adapts the classic Reweighted Nadaraya-Watson (RNW) estimator for quantile regression on dependent data and learns optimal data-adaptive weights. Theoretically, we tackle the challenge of establishing a conditional coverage guarantee for non-exchangeable data under strong mixing conditions on the non-conformity scores. We demonstrate the superior performance of KOWCPI on real and synthetic time-series data against state-of-the-art methods, where KOWCPI achieves narrower confidence intervals without losing coverage.

#97 Provable Domain Adaptation for Offline Reinforcement Learning with Limited Samples

著者: Weiqin Chen, Xinjie Zhang, Sandipan Mishra, Santiago Paternain

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2408.12136

要約:
Offline reinforcement learning (RL) learns effective policies from a static target dataset. The performance of state-of-the-art offline RL algorithms notwithstanding, it relies on the size of the target dataset, and it degrades if limited samples in the target dataset are available, which is often the case in real-world applications. To address this issue, domain adaptation that leverages auxiliary samples from related source datasets (such as simulators) can be beneficial. However, establishing the optimal way to trade off the limited target dataset and the large-but-biased source dataset while ensuring provably theoretical guarantees remains an open challenge. To the best of our knowledge, this paper proposes the first framework that theoretically explores the impact of the weights assigned to each dataset on the performance of offline RL. In particular, we establish performance bounds and the existence of the optimal weight, which can be computed in closed form under simplifying assumptions. We also provide algorithmic guarantees in terms of convergence to a neighborhood of the optimum. Notably, these results depend on the quality of the source dataset and the number of samples in the target dataset. Our empirical results on the well-known Procgen and MuJoCo benchmarks substantiate the theoretical contributions in this work.

#98 CoHiRF: Hierarchical Consensus for Interpretable Clustering Beyond Scalability Limits

著者: Katia Meziani, Bruno Belucci, Karim Lounici, Vladimir R. Kostic

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2502.00380

要約:
We introduce CoHiRF (Consensus Hierarchical Random Features), a hierarchical consensus framework that enables existing clustering methods to operate beyond their usual computational and memory limits. CoHiRF is a meta-algorithm that operates exclusively on the label assignments produced by a base clustering method, without modifying its objective function, optimization procedure, or geometric assumptions. It repeatedly applies the base method to multiple low-dimensional feature views or stochastic realizations, enforces agreement through consensus, and progressively reduces the problem size via representative-based contraction. Across a diverse set of synthetic and real-world experiments involving centroid-based, kernel-based, density-based, and graph-based methods, we show that CoHiRF can improve robustness to high-dimensional noise, enhance stability under stochastic variability, and enable scalability to regimes where the base method alone is infeasible. We also provide an empirical characterization of when hierarchical consensus is beneficial, highlighting the role of reproducible label relations and their compatibility with representative-based contraction. Beyond flat partitions, CoHiRF produces an explicit Cluster Fusion Hierarchy, offering a multi-resolution and interpretable view of the clustering structure. Together, these results position hierarchical consensus as a practical and flexible tool for large-scale clustering, extending the applicability of existing methods without altering their underlying behavior.

#99 Summaries as Centroids for Interpretable and Scalable Text Clustering

著者: Jairo Diaz-Rodriguez

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2502.09667

要約:
We introduce k-NLPmeans and k-LLMmeans, text-clustering variants of k-means that periodically replace numeric centroids with textual summaries. The key idea, summary-as-centroid, retains k-means assignments in embedding space while producing human-readable, auditable cluster prototypes. The method is LLM-optional: k-NLPmeans uses lightweight, deterministic summarizers, enabling offline, low-cost, and stable operation; k-LLMmeans is a drop-in upgrade that uses an LLM for summaries under a fixed per-iteration budget whose cost does not grow with dataset size. We also present a mini-batch extension for real-time clustering of streaming text. Across diverse datasets, embedding models, and summarization strategies, our approach consistently outperforms classical baselines and approaches the accuracy of recent LLM-based clustering-without extensive LLM calls. Finally, we provide a case study on sequential text streams and release a StackExchange-derived benchmark for evaluating streaming text clustering.

#100 Remasking Discrete Diffusion Models with Inference-Time Scaling

diffusion

著者: Guanghan Wang, Yair Schiff, Subham Sekhar Sahoo, Volodymyr Kuleshov

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2503.00307

要約:
Part of the success of diffusion models stems from their ability to perform iterative refinement, i.e., repeatedly correcting outputs during generation. However, modern masked discrete diffusion lacks this capability: when a token is generated, it cannot be updated again, even when it introduces an error. Here, we address this limitation by introducing the remasking diffusion model (ReMDM) sampler, a method that can be applied to pretrained masked diffusion models in a principled way and that is derived from a discrete diffusion model with a custom remasking backward process. Most interestingly, ReMDM endows discrete diffusion with a form of inference-time compute scaling. By increasing the number of sampling steps, ReMDM generates natural language outputs that approach the quality of autoregressive models, whereas when the computation budget is limited, ReMDM better maintains quality. ReMDM also improves sample quality of masked diffusion models for discretized images, and in scientific domains such as molecule design, ReMDM facilitates diffusion guidance and pushes the Pareto frontier of controllability relative to classical masking and uniform noise diffusion. We provide the code along with a blog post on the project page: https://guanghanwang.com/remdm

#101 Transformers as Unsupervised Learning Algorithms: A study on Gaussian Mixtures

著者: Zhiheng Chen, Ruofan Wu, Guanhua Fang

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2505.11918

要約:
The transformer architecture has demonstrated remarkable capabilities in modern artificial intelligence, among which the capability of implicitly learning an internal model during inference time is widely believed to play a key role in the under standing of pre-trained large language models. However, most recent works have been focusing on studying supervised learning topics such as in-context learning, leaving the field of unsupervised learning largely unexplored. This paper investigates the capabilities of transformers in solving Gaussian Mixture Models (GMMs), a fundamental unsupervised learning problem through the lens of statistical estimation. We propose a transformer-based learning framework called TGMM that simultaneously learns to solve multiple GMM tasks using a shared transformer backbone. The learned models are empirically demonstrated to effectively mitigate the limitations of classical methods such as Expectation-Maximization (EM) or spectral algorithms, at the same time exhibit reasonable robustness to distribution shifts. Theoretically, we prove that transformers can approximate both the EM algorithm and a core component of spectral methods (cubic tensor power iterations). These results bridge the gap between practical success and theoretical understanding, positioning transformers as versatile tools for unsupervised learning.

#102 Parallel Layer Normalization for Universal Approximation

著者: Yunhao Ni, Yuxin Guo, Yuhe Liu, Wenxin Sun, Jie Luo, Wenjun Wu, Lei Huang

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2505.13142

要約:
This paper studies the approximation capabilities of neural networks that combine layer normalization (LN) with linear layers. We prove that networks consisting of two linear layers with parallel layer normalizations (PLNs) inserted between them (referred to as PLN-Nets) achieve universal approximation, whereas architectures that use only standard LN exhibit strictly limited expressive power.We further analyze approximation rates of shallow and deep PLN-Nets under the $L^\infty$ norm as well as in Sobolev norms. Our analysis extends beyond LN to RMSNorm, and from standard MLPs to position-wise feed-forward networks, the core building blocks used in RNNs and Transformers.Finally, we provide empirical experiments to explore other possible potentials of PLN-Nets.

#103 Meta-reinforcement learning with minimum attention

著者: Shashank Gupta, Pilhwa Lee

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2505.16741

要約:
Minimum attention applies the least action principle in the changes of control concerning state and time, first proposed by Brockett. The involved regularization is highly relevant in emulating biological control, such as motor learning. We apply minimum attention in reinforcement learning (RL) as part of the rewards and investigate its connection to meta-learning and stabilization. Specifically, model-based meta-learning with minimum attention is explored in high-dimensional nonlinear dynamics. Ensemble-based model learning and gradient-based meta-policy learning are alternately performed. Empirically, the minimum attention does show outperforming competence in comparison to the state-of-the-art algorithms of model-free and model-based RL, i.e., fast adaptation in few shots and variance reduction from the perturbations of the model and environment. Furthermore, the minimum attention demonstrates an improvement in energy efficiency.

#104 Automatic and Structure-Aware Sparsification of Hybrid Neural ODEs

著者: Bob Junyi Zou, Lu Tian

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2505.18996

要約:
Hybrid neural ordinary differential equations (neural ODEs) integrate mechanistic models with neural ODEs, offering strong inductive bias and flexibility, and are particularly advantageous in data-scarce healthcare settings. However, excessive latent states and interactions from mechanistic models can lead to training inefficiency and over-fitting, limiting practical effectiveness of hybrid neural ODEs. In response, we propose a new hybrid pipeline for automatic state selection and structure optimization in mechanistic neural ODEs, combining domain-informed graph modifications with data-driven regularization to sparsify the model for improving predictive performance and stability while retaining mechanistic plausibility. Experiments on synthetic and real-world data show improved predictive performance and robustness with desired sparsity, establishing an effective solution for hybrid model reduction in healthcare applications.

#105 Interpretability and Generalization Bounds for Learning Spatial Physics

著者: Alejandro Francisco Queiruga, Theo Gutman-Solo, Shuai Jiang

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2506.15199

要約:
While there are many applications of ML to scientific problems that look promising, visuals can be deceiving. Using numerical analysis techniques, we rigorously quantify the accuracy, convergence rates, and generalization bounds of certain ML models applied to linear differential equations for parameter discovery or solution finding. Beyond the quantity and discretization of data, we identify that the function space of the data is critical to the generalization of the model. A similar lack of generalization is empirically demonstrated for commonly used models, including physics-specific techniques. Counterintuitively, we find that different classes of models can exhibit opposing generalization behaviors. Based on our theoretical analysis, we also introduce a new mechanistic interpretability lens on scientific models whereby Green's function representations can be extracted from the weights of black-box models. Our results inform a new cross-validation technique for measuring generalization in physical systems, which can serve as a benchmark.

#106 These Are Not All the Features You Are Looking For: A Fundamental Bottleneck in Supervised Pretraining

著者: Xingyu Alice Yang, Jianyu Zhang, L\'eon Bottou

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2506.18221

要約:
Transfer learning is widely used to adapt large pretrained models to new tasks with only a small amount of new data. However, a challenge persists -- the features from the original task often do not fully cover what is needed for unseen data, especially when the relatedness of tasks is not clear. Since deep learning models tend to learn very sparse representations, they retain only the minimal features required for the initial training while discarding potentially ones for downstream transfer. A theoretical framework developed in this work demonstrates that such pretraining captures inconsistent aspects of the data distribution, therefore, inducing transfer bias. To address this limitation, we propose an inexpensive ensembling strategy that aggregates multiple models to generate richer feature representations. On ResNet, this approach yields a $9\%$ improvement in transfer accuracy without incurring extra pretraining cost. We also present empirical evidence from a range of deep learning studies, confirming that the phenomenon is pervasive across modern deep learning architectures. These results suggests that relying solely on large pretrained networks is not always the most effective way to improve model generalization. Instead, fostering richer, more diverse representations -- e.g. - through model ensembles -- can substantially enhance transfer learning performance.

#107 Predicting Graph Structure via Adapted Flux Balance Analysis

著者: Sevvandi Kandanaarachchi, Ziqi Xu, Stefan Westerlund, Conrad Sanderson

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2507.05806

要約:
Many dynamic processes such as telecommunication and transport networks can be described through discrete time series of graphs. Modelling the dynamics of such time series enables prediction of graph structure at future time steps, which can be used in applications such as detection of anomalies. Existing approaches for graph prediction have limitations such as assuming that the vertices do not to change between consecutive graphs. To address this, we propose to exploit time series prediction methods in combination with an adapted form of flux balance analysis (FBA), a linear programming method originating from biochemistry. FBA is adapted to incorporate various constraints applicable to the scenario of growing graphs. Empirical evaluations on synthetic datasets (constructed via Preferential Attachment model) and real datasets (UCI Message, HePH, Facebook, Bitcoin) demonstrate the efficacy of the proposed approach.

#108 Improved sampling algorithms and functional inequalities for non-log-concave distributions

著者: Yuchen He, Zhehan Lei, Jianan Shao, Chihao Zhang

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2507.11236

要約:
We study the problem of sampling from a distribution $\mu$ with density $\propto e^{-V}$ for some potential function $V:\mathbb R^d\to \mathbb R$ with query access to $V$ and $\nabla V$. We start with the following standard assumptions: (1) $V$ is $L$-smooth. (2) The second moment $\mathbf{E}_{X\sim \mu}[\|X\|^2]\leq M$. Recently, He and Zhang (COLT'25) showed that the query complexity of this problem is at least $\left(\frac{LM}{d\epsilon}\right)^{\Omega(d)}$ where $\epsilon$ is the desired accuracy in total variation distance, and the Poincar\'e constant can be unbounded. Meanwhile, another common assumption in the study of diffusion based samplers (see e.g., the work of Chen, Chewi, Li, Li, Salim and Zhang (ICLR'23)) strengthens (1) to the following: (1*) The potential function of *every* distribution along the Ornstein-Uhlenbeck process starting from $\mu$ is $L$-smooth. We show that under the assumptions (1*) and (2), the query complexity of sampling from $\mu$ can be $\mathrm{poly}(L,d)\cdot \left(\frac{Ld+M}{\epsilon^2}\right)^{\mathcal{O}(L+1)}$, which is polynomial in $d$ and $\frac{1}{\epsilon}$ when $L=\mathcal{O}(1)$ and $M=\mathrm{poly}(d)$. This improves the algorithm with quasi-polynomial query complexity developed by Huang et al. (COLT'24). Our results imply that the seemingly moderate strengthening from (1) to (1*) yields an exponential gap in the query complexity. Furthermore, we show that together with the assumption (1*) and the stronger moment assumption that $\|X\|$ is $\lambda$-sub-Gaussian for $X\sim\mu$, the Poincar\'e constant of $\mu$ is at most $\mathcal{O}(\lambda)^{2(L+1)}$. We also establish a modified log-Sobolev inequality for $\mu$ under these conditions. As an application of our technique, we obtain a new estimate of the modified log-Sobolev constant for a specific class of mixtures of strongly log-concave distributions.

#109 TensorHyper-VQC: A Tensor-Train-Guided Hypernetwork for Robust and Scalable Variational Quantum Computing

著者: Jun Qi, Chao-Han Huck Yang, Pin-Yu Chen, Min-Hsiu Hsieh

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2508.01116

要約:
Variational Quantum Computing (VQC) faces fundamental scalability barriers, primarily due to barren plateaus and sensitivity to quantum noise. To address these challenges, we introduce TensorHyper-VQC, a novel tensor-train (TT)-guided hypernetwork framework that significantly improves the robustness and scalability of VQC. Our framework fully delegates the generation of quantum-circuit parameters to a classical TT network, thereby decoupling optimization from quantum hardware. This innovative parameterization mitigates gradient vanishing, enhances noise resilience through structured low-rank representations, and facilitates efficient gradient propagation. Grounded in Neural Tangent Kernel and statistical learning theory, our rigorous theoretical analyses establish strong guarantees on approximation capability, optimization stability, and generalization performance. Extensive empirical results across quantum dot classification, Max-Cut optimization, and molecular quantum simulation tasks demonstrate that TensorHyper-VQC consistently achieves superior performance and robust noise tolerance, including hardware-level validation on a 156-qubit IBM Heron processor. These results position TensorHyper-VQC as a scalable and noise-resilient framework for advancing practical quantum machine learning on near-term devices.

#110 Predictability Enables Parallelization of Nonlinear State Space Models

著者: Xavier Gonzalez, Leo Kozachkov, David M. Zoltowski, Kenneth L. Clarkson, Scott W. Linderman

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2508.16817

要約:
The rise of parallel computing hardware has made it increasingly important to understand which nonlinear state space models can be efficiently parallelized. Recent advances like DEER (arXiv:2309.12252) and DeepPCR (arXiv:2309.16318) recast sequential evaluation as a parallelizable optimization problem, sometimes yielding dramatic speedups. However, the factors governing the difficulty of these optimization problems remained unclear, limiting broader adoption. In this work, we establish a precise relationship between a system's dynamics and the conditioning of its corresponding optimization problem, as measured by its Polyak-Lojasiewicz (PL) constant. We show that the predictability of a system, defined as the degree to which small perturbations in state influence future behavior and quantified by the largest Lyapunov exponent (LLE), impacts the number of optimization steps required for evaluation. For predictable systems, the state trajectory can be computed in at worst $O((\log T)^2)$ time, where $T$ is the sequence length: a major improvement over the conventional sequential approach. In contrast, chaotic or unpredictable systems exhibit poor conditioning, with the consequence that parallel evaluation converges too slowly to be useful. Importantly, our theoretical analysis shows that predictable systems always yield well-conditioned optimization problems, whereas unpredictable systems lead to severe conditioning degradation. We validate our claims through extensive experiments, providing practical guidance on when nonlinear dynamical systems can be efficiently parallelized. We highlight predictability as a key design principle for parallelizable models.

#111 Adaptive Off-Policy Inference for M-Estimators Under Model Misspecification

著者: James Leiner, Robin Dunn, Aaditya Ramdas

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2509.14218

要約:
When data are collected adaptively, such as in bandit algorithms, classical statistical approaches such as ordinary least squares and $M$-estimation will often fail to achieve asymptotic normality. Although recent lines of work have modified the classical approaches to ensure valid inference on adaptively collected data, most of these works assume that the model is correctly specified. The misspecified setting poses unique challenges because the parameter of interest itself may not be well-defined over a non-stationary distribution of rewards. We therefore tackle the problem of \emph{off-policy} inference in adaptive settings, where we uniquely define a projected solution over a stationary evaluation policy. Our method provides valid inference for $M$-estimators that use adaptively collected bandit data with a possibly misspecified working model. A key ingredient in our approach is the use of flexible approaches to stabilize the variance induced by adaptive data collection. A major novelty is that the procedure enables the construction of valid confidence sets even in settings where treatment policies are unstable and non-converging, such as when there is no unique optimal arm and standard bandit algorithms are used. Empirical results on semi-synthetic datasets constructed from the Osteoarthritis Initiative demonstrate that the method maintains type I error control, while existing methods for inference in adaptive settings do not cover in the misspecified case.

#112 Gradient Descent with Large Step Sizes: Chaos and Fractal Convergence Region

著者: Shuang Liang, Guido Mont\'ufar

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2509.25351

要約:
We examine gradient descent in matrix factorization and show that under large step sizes the parameter space develops a fractal structure. We derive the exact critical step size for convergence in scalar-vector factorization and show that near criticality the selected minimizer depends sensitively on the initialization. Moreover, we show that adding regularization amplifies this sensitivity, generating a fractal boundary between initializations that converge and those that diverge. The analysis extends to general matrix factorization with orthogonal initialization. Our findings reveal that near-critical step sizes induce a chaotic regime of gradient descent where the training outcome is unpredictable and there are no simple implicit biases, such as towards balancedness, minimum norm, or flatness.

#113 Probabilistic bias adjustment of seasonal predictions of Arctic Sea Ice Concentration

著者: Parsa Gooya, Reinel Sospedra-Alfonso

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2510.09891

要約:
Seasonal forecast of Arctic sea ice concentration is key to mitigate the negative impact and assess potential opportunities posed by the rapid decline of sea ice coverage. Seasonal prediction systems based on climate models often show systematic biases and complex spatio-temporal errors that grow with the forecasts. Consequently, operational predictions are routinely bias corrected and calibrated using retrospective forecasts. For predictions of Arctic sea ice concentration, error corrections are mainly based on one-to-one post-processing methods including climatological mean or linear regression correction and, more recently, machine learning. Such deterministic adjustments are confined at best to the limited number of costly-to-run ensemble members of the raw forecast. However, decision-making requires proper quantification of uncertainty and likelihood of events, particularly of extremes. We introduce a probabilistic error correction framework based on a conditional Variational Autoencoder model to map the conditional distribution of observations given the biased model prediction. This method naturally allows for generating large ensembles of adjusted forecasts. We evaluate our model using deterministic and probabilistic metrics and show that the adjusted forecasts are better calibrated, closer to the observational distribution, and have smaller errors than climatological mean adjusted forecasts.

#114 PAC-Bayesian Reinforcement Learning Trains Generalizable Policies

著者: Abdelkrim Zitouni, Mehdi Hennequin, Juba Agoun, Ryan Horache, Nadia Kabachi, Omar Rivasplata

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2510.10544

要約:
We derive a novel PAC-Bayesian generalization bound for reinforcement learning that explicitly accounts for Markov dependencies in the data, through the chain's mixing time. This contributes to overcoming challenges in obtaining generalization guarantees for reinforcement learning, where the sequential nature of data breaks the independence assumptions underlying classical bounds. The new bound provides non-vacuous certificates for modern off-policy algorithms such as Soft Actor-Critic. We demonstrate the practical utility of the bound through PB-SAC, a novel algorithm that optimizes the bound during training to guide exploration. Experiments across several continuous control tasks show that the proposed approach provides meaningful confidence certificates while maintaining competitive performance.

#115 Differentially Private Linear Regression and Synthetic Data Generation with Statistical Guarantees

privacysynthetic data

著者: Shurong Lin, Aleksandra Slavkovi\'c, Deekshith Reddy Bhoomireddy

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2510.16974

要約:
In social sciences, small- to medium-scale datasets are common and linear regression (LR) is canonical. In privacy-aware settings, much work has focused on differentially private (DP) LR, but mostly on point estimation with limited attention to uncertainty quantification. Meanwhile, synthetic data generation (SDG) is increasingly important for reproducibility studies, yet current DP LR methods do not readily support it. Mainstream SDG approaches are either tailored to discretized data, making them less suitable for continuous regression, or rely on deep models that require large datasets, limiting their use for the smaller, continuous data typical in social science. We propose a method for LR with valid inference under Gaussian DP: a DP bias-corrected estimator with asymptotic confidence intervals (CIs) and a general SDG procedure in which regression on the synthetic data matches our DP regression. Our binning-aggregation strategy is effective in small- to moderate-dimensional settings. Experiments show our method (1) improves accuracy over existing methods, (2) provides valid CIs, and (3) produces more reliable synthetic data for downstream ML tasks than current DP SDGs.

#116 Scalable LinUCB: Low-Rank Design Matrix Updates for Recommenders with Large Action Spaces

著者: Evgenia Shustova, Marina Sheshukova, Sergey Samsonov, Evgeny Frolov

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2510.19349

要約:
In this paper, we introduce PSI-LinUCB, a scalable variant of LinUCB that enables efficient training, inference, and memory usage by representing the inverse regularized design matrix as a sum of a diagonal matrix and low-rank correction. We derive numerically stable rank-1 and batched updates that maintain the inverse without explicitly forming the matrix. To control memory growth, we employ a projector-splitting integrator for dynamical low-rank approximation, yielding an average per-step update cost and memory usage of $O(dr)$ for approximation rank $r$. The inference complexity of the proposed algorithm is $O(dr)$ per action evaluation. Experiments on recommender system datasets demonstrate the effectiveness of our algorithm.

#117 Identification and Debiased Learning of Causal Effects with General Instrumental Variables

著者: Shuyuan Chen, Peng Zhang, Yifan Cui

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2510.20404

要約:
Instrumental variable methods are fundamental to causal inference when treatment assignment is confounded by unobserved variables. In this article, we develop a general nonparametric causal framework for identification and learning with multi-categorical or continuous instrumental variables. Specifically, the mean potential outcomes and the average treatment effect can be identified via a regular weighting function derived from the proposed framework. Leveraging semiparametric theory, we derive efficient influence functions and construct two consistent, asymptotically normal estimators via debiased machine learning. The first estimator uses a prespecified weighting function, while the second estimator selects the optimal weighting function adaptively. Extensions to longitudinal data, dynamic treatment regimes, and multiplicative instrumental variables are further developed. We demonstrate the proposed method by employing simulation studies and analyzing real data from the Job Training Partnership Act program.

#118 How to Purchase Labels? A Cost-Effective Approach Using Active Learning Markets

著者: Xiwen Huang, Pierre Pinson

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.20605

要約:
We introduce and analyse active learning markets as a way to purchase labels, in situations where analysts aim to acquire additional data to improve model fitting, or to better train models for predictive analytics applications. This comes in contrast to the many proposals that already exist to purchase features and examples. By originally formalising the market clearing as an optimisation problem, we integrate budget constraints and improvement thresholds into the label acquisition process. We focus on a single-buyer-multiple-seller setup and propose the use of two active learning strategies (variance based and query-by-committee based), paired with distinct pricing mechanisms. They are compared to benchmark baselines including random sampling and a greedy knapsack heuristic. The proposed strategies are validated on real-world datasets from two critical application domains: real estate pricing and energy forecasting. Results demonstrate the robustness of our approach, consistently achieving superior performance with fewer labels acquired compared to conventional methods. Our proposal comprises an easy-to-implement practical solution for optimising data acquisition in resource-constrained environments.

#119 How to Correctly Report LLM-as-a-Judge Evaluations

著者: Chungpa Lee, Thomas Zeng, Jongwon Jeong, Jy-yong Sohn, Kangwook Lee

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.21140

要約:
Large language models (LLMs) are widely used as scalable evaluators of model responses in lieu of human annotators. However, imperfect sensitivity and specificity of the LLM judges induce bias in naive evaluation scores. We propose a simple plug-in framework that corrects this bias and enables statistically principled uncertainty quantification. Our framework constructs confidence intervals that account for uncertainty from both the test dataset and a human-labeled calibration dataset. Additionally, it uses an adaptive strategy to allocate calibration samples for tighter intervals. Importantly, we characterize parameter regimes defined by the true evaluation score and the LLM judge's sensitivity and specificity in which our LLM-based evaluation yields more reliable estimates than human-only evaluation. Moreover, we show that our framework remains unbiased under distribution shift between the test and calibration datasets, in contrast to existing approaches.

#120 Trust Region Masking for Long-Horizon LLM Reinforcement Learning

著者: Yingru Li, Jiacai Liu, Jiawei Xu, Yuxuan Tong, Ziniu Li, Qian Liu, Baoxiang Wang

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2512.23075

要約:
Policy gradient methods for Large Language Models optimize a policy $\pi_\theta$ via a surrogate objective computed from samples of a rollout policy $\pi_{\text{roll}}$. However, modern LLM-RL pipelines suffer from unavoidable implementation divergences -- backend discrepancies, Mixture-of-Experts routing discontinuities, and distributed training staleness -- causing off-policy mismatch ($\pi_{\text{roll}} \neq \pi_\theta$) and approximation errors between the surrogate and the true objective. We demonstrate that classical trust region bounds on this error scale as $O(T^2)$ with sequence length $T$, rendering them vacuous for long-horizon tasks. To address this, we derive a family of bounds -- both KL-based and TV-based -- including a Pinsker-Marginal bound ($O(T^{3/2})$), a Mixed bound ($O(T)$), and an Adaptive bound that strictly generalizes the Pinsker-Marginal bound via per-position importance-ratio decomposition. Taking the minimum over all bounds yields the tightest known guarantee across all divergence regimes. Crucially, all bounds depend on the maximum token-level divergence $D_{\mathrm{KL}}^{\mathrm{tok,max}}$ (or $D_{\mathrm{TV}}^{\mathrm{tok,max}}$), a sequence-level quantity that cannot be controlled by token-independent methods like PPO clipping. We propose Trust Region Masking (TRM), which masks entire sequences violating the trust region, enabling the first non-vacuous monotonic improvement guarantees for long-horizon LLM-RL.

#121 Categorical Reparameterization with Denoising Diffusion models

diffusion

著者: Samson Gourevitch, Alain Durmus, Eric Moulines, Jimmy Olsson, Yazid Janati

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.00781

要約:
Learning models with categorical variables requires optimizing expectations over discrete distributions, a setting in which stochastic gradient-based optimization is challenging due to the non-differentiability of categorical sampling. A common workaround is to replace the discrete distribution with a continuous relaxation, yielding a smooth surrogate that admits reparameterized gradient estimates via the reparameterization trick. Building on this idea, we introduce ReDGE, a novel and efficient diffusion-based soft reparameterization method for categorical distributions. Our approach defines a flexible class of gradient estimators that includes the Straight-Through estimator as a special case. Experiments spanning latent variable models and inference-time reward guidance in discrete diffusion models demonstrate that ReDGE consistently matches or outperforms existing gradient-based methods. The code will be made available at https://github.com/samsongourevitch/redge.

#122 Rank-1 Approximation of Inverse Fisher for Natural Policy Gradients in Deep Reinforcement Learning

著者: Yingxiao Huo, Satya Prakash Dash, Radu Stoican, Samuel Kaski, Mingfei Sun

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.18626

要約:
Natural gradients have long been studied in deep reinforcement learning due to their fast convergence properties and covariant weight updates. However, computing natural gradients requires inversion of the Fisher Information Matrix (FIM) at each iteration, which is computationally prohibitive in nature. In this paper, we present an efficient and scalable natural policy optimization technique that leverages a rank-1 approximation to full inverse-FIM. We theoretically show that under certain conditions, a rank-1 approximation to inverse-FIM converges faster than policy gradients and, under some conditions, enjoys the same sample complexity as stochastic policy gradient methods. We benchmark our method on a diverse set of environments and show that it achieves superior performance to standard actor-critic and trust-region baselines.

#123 Stein-Rule Shrinkage for Stochastic Gradient Estimation in High Dimensions

著者: M. Arashi, M. Amintoosi

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01777

要約:
Stochastic gradient methods are central to large-scale learning, but they treat mini-batch gradients as unbiased estimators, which classical decision theory shows are inadmissible in high dimensions. We formulate gradient computation as a high-dimensional estimation problem and introduce a framework based on Stein-rule shrinkage. We construct a gradient estimator that adaptively contracts noisy mini-batch gradients toward a stable estimator derived from historical momentum. The shrinkage intensity is determined in a data-driven manner using an online estimate of gradient noise variance, leveraging statistics from adaptive optimizers. Under a Gaussian noise model, we show our estimator uniformly dominates the standard stochastic gradient under squared error loss and is minimax-optimal. We incorporate this into the Adam optimizer, yielding SR-Adam, a practical algorithm with negligible computational cost. Empirical evaluations on CIFAR10 and CIFAR100 across multiple levels of input noise show consistent improvements over Adam in the large-batch regime. Ablation studies indicate that gains arise primarily from selectively applying shrinkage to high-dimensional convolutional layers, while indiscriminate shrinkage across all parameters degrades performance. These results illustrate that classical shrinkage principles provide a principled approach to improving stochastic gradient estimation in deep learning.

#124 Quantum Circuit Generation via test-time learning with large language models

著者: Adriano Macarone-Palmieri, Rosario Lo Franco

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.03466

要約:
Large language models (LLMs) can generate structured artifacts, but using them as dependable optimizers for scientific design requires a mechanism for iterative improvement under black-box evaluation. Here, we cast quantum circuit synthesis as a closed-loop, test-time optimization problem: an LLM proposes edits to a fixed-length gate list, and an external simulator evaluates the resulting state with the Meyer-Wallach (MW) global entanglement measure. We introduce a lightweight test-time learning recipe that can reuse prior high-performing candidates as an explicit memory trace, augments prompts with a score-difference feedback, and applies restart-from-the-best sampling to escape potential plateaus. Across fixed 20-qubit settings, the loop without feedback and restart-from-the-best improves random initial circuits over a range of gate budgets. To lift up this performance and success rate, we use the full learning strategy. For the 25-qubit, it mitigates a pronounced performance plateau when naive querying is used. Beyond raw scores, we analyze the structure of synthesized states and find that high MW solutions can correspond to stabilizer or graph-state-like constructions, but full connectivity is not guaranteed due to the metric property and prompt design. These results illustrate both the promise and the pitfalls of memory evaluator-guided LLM optimization for circuit synthesis, highlighting the critical role of prior human-made theoretical theorems to optimally design a custom tool in support of research.

#125 f-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment

著者: Rajdeep Haldar, Lantao Mei, Guang Lin, Yue Xing, Qifan Song

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.05946

要約:
Recent research shows that Preference Alignment (PA) objectives act as divergence estimators between aligned (chosen) and unaligned (rejected) response distributions. In this work, we extend this divergence-based perspective to general alignment settings, such as reinforcement learning with verifiable rewards (RLVR), where only environmental rewards are available. Within this unified framework, we propose f-Group Relative Policy Optimization (f-GRPO), a class of on-policy reinforcement learning, and f-Hybrid Alignment Loss (f-HAL), a hybrid on/off policy objectives, for general LLM alignment based on variational representation of f-divergences. We provide theoretical guarantees that these classes of objectives improve the average reward after alignment. Empirically, we validate our framework on both RLVR (Math Reasoning) and PA tasks (Safety Alignment), demonstrating superior performance and flexibility compared to current methods.

#126 Vision Transformer Finetuning Benefits from Non-Smooth Components

著者: Ambroise Odonnat, Laetitia Chapel, Romain Tavenard, Ievgen Redko

公開日: Tue, 10 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06883

要約:
The smoothness of the transformer architecture has been extensively studied in the context of generalization, training stability, and adversarial robustness. However, its role in transfer learning remains poorly understood. In this paper, we analyze the ability of vision transformer components to adapt their outputs to changes in inputs, or, in other words, their plasticity. Defined as an average rate of change, it captures the sensitivity to input perturbation; in particular, a high plasticity implies low smoothness. We demonstrate through theoretical analysis and comprehensive experiments that this perspective provides principled guidance in choosing the components to prioritize during adaptation. A key takeaway for practitioners is that the high plasticity of the attention modules and feedforward layers consistently leads to better finetuning performance. Our findings depart from the prevailing assumption that smoothness is desirable, offering a novel perspective on the functional properties of transformers. The code is available at https://github.com/ambroiseodt/vit-plasticity.

stat.ML updates on arXiv.org

📋 論文タイトル一覧