arXiv論文一覧 - stat.ML updates on arXiv.org

#1 Uncertainty-Aware Multimodal Learning via Conformal Shapley Intervals

著者: Mathew Chandy, Michael Johnson, Judong Shen, Devan V. Mehrotra, Hua Zhou, Jin Zhou, Xiaowu Dai

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.00171

要約:
Multimodal learning combines information from multiple data modalities to improve predictive performance. However, modalities often contribute unequally and in a data dependent way, making it unclear which data modalities are genuinely informative and to what extent their contributions can be trusted. Quantifying modality level importance together with uncertainty is therefore central to interpretable and reliable multimodal learning. We introduce conformal Shapley intervals, a framework that combines Shapley values with conformal inference to construct uncertainty-aware importance intervals for each modality. Building on these intervals, we propose a modality selection procedure with a provable optimality guarantee: conditional on the observed features, the selected subset of modalities achieves performance close to that of the optimal subset. We demonstrate the effectiveness of our approach on multiple datasets, showing that it provides meaningful uncertainty quantification and strong predictive performance while relying on only a small number of informative modalities.

#2 Neuron Block Dynamics for XOR Classification with Zero-Margin

著者: Guillaume Braun, Masaaki Imaizumi

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.00172

要約:
The ability of neural networks to learn useful features through stochastic gradient descent (SGD) is a cornerstone of their success. Most theoretical analyses focus on regression or on classification tasks with a positive margin, where worst-case gradient bounds suffice. In contrast, we study zero-margin nonlinear classification by analyzing the Gaussian XOR problem, where inputs are Gaussian and the XOR decision boundary determines labels. In this setting, a non-negligible fraction of data lies arbitrarily close to the boundary, breaking standard margin-based arguments. Building on Glasgow's (2024) analysis, we extend the study of training dynamics from discrete to Gaussian inputs and develop a framework for the dynamics of neuron blocks. We show that neurons cluster into four directions and that block-level signals evolve coherently, a phenomenon essential in the Gaussian setting where individual neuron signals vary significantly. Leveraging this block perspective, we analyze generalization without relying on margin assumptions, adopting an average-case view that distinguishes regions of reliable prediction from regions of persistent error. Numerical experiments confirm the predicted two-phase block dynamics and demonstrate their robustness beyond the Gaussian setting.

#3 Singular Bayesian Neural Networks

著者: Mame Diarra Toure, David A. Stephens

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.00387

要約:
Bayesian neural networks promise calibrated uncertainty but require $O(mn)$ parameters for standard mean-field Gaussian posteriors. We argue this cost is often unnecessary, particularly when weight matrices exhibit fast singular value decay. By parameterizing weights as $W = AB^{\top}$ with $A \in \mathbb{R}^{m \times r}$, $B \in \mathbb{R}^{n \times r}$, we induce a posterior that is singular with respect to the Lebesgue measure, concentrating on the rank-$r$ manifold. This singularity captures structured weight correlations through shared latent factors, geometrically distinct from mean-field's independence assumption. We derive PAC-Bayes generalization bounds whose complexity term scales as $\sqrt{r(m+n)}$ instead of $\sqrt{m n}$, and prove loss bounds that decompose the error into optimization and rank-induced bias using the Eckart-Young-Mirsky theorem. We further adapt recent Gaussian complexity bounds for low-rank deterministic networks to Bayesian predictive means. Empirically, across MLPs, LSTMs, and Transformers on standard benchmarks, our method achieves predictive performance competitive with 5-member Deep Ensembles while using up to $15\times$ fewer parameters. Furthermore, it substantially improves OOD detection and often improves calibration relative to mean-field and perturbation baselines.

#4 Reinforcement Learning for Control Systems with Time Delays: A Comprehensive Survey

著者: Armando Alves Neto

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.00399

要約:
In the last decade, Reinforcement Learning (RL) has achieved remarkable success in the control and decision-making of complex dynamical systems. However, most RL algorithms rely on the Markov Decision Process assumption, which is violated in practical cyber-physical systems affected by sensing delays, actuation latencies, and communication constraints. Such time delays introduce memory effects that can significantly degrade performance and compromise stability, particularly in networked and multi-agent environments. This paper presents a comprehensive survey of RL methods designed to address time delays in control systems. We first formalize the main classes of delays and analyze their impact on the Markov property. We then systematically categorize existing approaches into five major families: state augmentation and history-based representations, recurrent policies with learned memory, predictor-based and model-aware methods, robust and domain-randomized training strategies, and safe RL frameworks with explicit constraint handling. For each family, we discuss underlying principles, practical advantages, and inherent limitations. A comparative analysis highlights key trade-offs among these approaches and provides practical guidelines for selecting suitable methods under different delay characteristics and safety requirements. Finally, we identify open challenges and promising research directions, including stability certification, large-delay learning, multi-agent communication co-design, and standardized benchmarking. This survey aims to serve as a unified reference for researchers and practitioners developing reliable RL-based controllers in delay-affected cyber-physical systems.

#5 Alignment of Diffusion Model and Flow Matching for Text-to-Image Generation

diffusion

著者: Yidong Ouyang, Liyan Xie, Hongyuan Zha, Guang Cheng

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.00413

要約:
Diffusion models and flow matching have demonstrated remarkable success in text-to-image generation. While many existing alignment methods primarily focus on fine-tuning pre-trained generative models to maximize a given reward function, these approaches require extensive computational resources and may not generalize well across different objectives. In this work, we propose a novel alignment framework by leveraging the underlying nature of the alignment problem -- sampling from reward-weighted distributions -- and show that it applies to both diffusion models (via score guidance) and flow matching models (via velocity guidance). The score function (velocity field) required for the reward-weighted distribution can be decomposed into the pre-trained score (velocity field) plus a conditional expectation of the reward. For the alignment on the diffusion model, we identify a fundamental challenge: the adversarial nature of the guidance term can introduce undesirable artifacts in the generated images. Therefore, we propose a finetuning-free framework that trains a guidance network to estimate the conditional expectation of the reward. We achieve comparable performance to finetuning-based models with one-step generation with at least a 60% reduction in computational cost. For the alignment on flow matching, we propose a training-free framework that improves the generation quality without additional computational cost.

#6 Shuffle and Joint Differential Privacy for Generalized Linear Contextual Bandits

privacy

著者: Sahasrajit Sarmasarkar

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.00417

要約:
We present the first algorithms for generalized linear contextual bandits under shuffle differential privacy and joint differential privacy. While prior work on private contextual bandits has been restricted to linear reward models -- which admit closed-form estimators -- generalized linear models (GLMs) pose fundamental new challenges: no closed-form estimator exists, requiring private convex optimization; privacy must be tracked across multiple evolving design matrices; and optimization error must be explicitly incorporated into regret analysis. We address these challenges under two privacy models and context settings. For stochastic contexts, we design a shuffle-DP algorithm achieving $\tilde{O}(d^{3/2}\sqrt{T}/\sqrt{\varepsilon})$ regret. For adversarial contexts, we provide a joint-DP algorithm with $\tilde{O}(d\sqrt{T}/\sqrt{\varepsilon})$ regret -- matching the non-private rate up to a $1/\sqrt{\varepsilon}$ factor. Both algorithms remove dependence on the instance-specific parameter $\kappa$ (which can be exponential in dimension) from the dominant $\sqrt{T}$ term. Unlike prior work on locally private GLM bandits, our methods require no spectral assumptions on the context distribution beyond $\ell_2$ boundedness.

#7 Topological Residual Asymmetry for Bivariate Causal Direction

著者: Mouad El Bouchattaoui

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.00427

要約:
Inferring causal direction from purely observational bivariate data is fragile: many methods commit to a direction even in ambiguous or near non-identifiable regimes. We propose Topological Residual Asymmetry (TRA), a geometry-based criterion for additive-noise models. TRA compares the shapes of two cross-fitted regressor-residual clouds after rank-based copula standardization: in the correct direction, residuals are approximately independent, producing a two-dimensional bulk, while in the reverse direction -- especially under low noise -- the cloud concentrates near a one-dimensional tube. We quantify this bulk-tube contrast using a 0D persistent-homology functional, computed efficiently from Euclidean MST edge-length profiles. We prove consistency in a triangular-array small-noise regime, extend the method to fixed noise via a binned variant (TRA-s), and introduce TRA-C, a confounding-aware abstention rule calibrated by a Gaussian-copula plug-in bootstrap. Extensive experiments across many challenging synthetic and real-data scenarios demonstrate the method's superiority.

#8 Stabilizing Fixed-Point Iteration for Markov Chain Poisson Equations

著者: Yang Xu, Vaneet Aggarwal

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.00474

要約:
Poisson equations underpin average-reward reinforcement learning, but beyond ergodicity they can be ill-posed, meaning that solutions are non-unique and standard fixed point iterations can oscillate on reducible or periodic chains. We study finite-state Markov chains with $n$ states and transition matrix $P$. We show that all non-decaying modes are captured by a real peripheral invariant subspace $\mathcal{K}(P)$, and that the induced operator on the quotient space $\mathbb{R}^n/\mathcal{K}(P)$ is strictly contractive, yielding a unique quotient solution. Building on this viewpoint, we develop an end-to-end pipeline that learns the chain structure, estimates an anchor based gauge map, and runs projected stochastic approximation to estimate a gauge-fixed representative together with an associated peripheral residual. We prove $\widetilde{O}(T^{-1/2})$ convergence up to projection estimation error, enabling stable Poisson equation learning for multichain and periodic regimes with applications to performance evaluation of average-reward reinforcement learning beyond ergodicity.

#9 Action-Free Offline-to-Online RL via Discretised State Policies

著者: Natinael Solomon Neggatu, Jeremie Houssineau, Giovanni Montana

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.00629

要約:
Most existing offline RL methods presume the availability of action labels within the dataset, but in many practical scenarios, actions may be missing due to privacy, storage, or sensor limitations. We formalise the setting of action-free offline-to-online RL, where agents must learn from datasets consisting solely of $(s,r,s')$ tuples and later leverage this knowledge during online interaction. To address this challenge, we propose learning state policies that recommend desirable next-state transitions rather than actions. Our contributions are twofold. First, we introduce a simple yet novel state discretisation transformation and propose Offline State-Only DecQN (\algo), a value-based algorithm designed to pre-train state policies from action-free data. \algo{} integrates the transformation to scale efficiently to high-dimensional problems while avoiding instability and overfitting associated with continuous state prediction. Second, we propose a novel mechanism for guided online learning that leverages these pre-trained state policies to accelerate the learning of online agents. Together, these components establish a scalable and practical framework for leveraging action-free datasets to accelerate online RL. Empirical results across diverse benchmarks demonstrate that our approach improves convergence speed and asymptotic performance, while analyses reveal that discretisation and regularisation are critical to its effectiveness.

#10 Sampling from multi-modal distributions on Riemannian manifolds with training-free stochastic interpolants

著者: Alain Durmus, Maxence Noble, Thibaut Pellerin

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.00641

要約:
In this paper, we propose a general methodology for sampling from un-normalized densities defined on Riemannian manifolds, with a particular focus on multi-modal targets that remain challenging for existing sampling methods. Inspired by the framework of diffusion models developed for generative modeling, we introduce a sampling algorithm based on the simulation of a non-equilibrium deterministic dynamics that transports an easy-to-sample noise distribution toward the target. At the marginal level, the induced density path follows a prescribed stochastic interpolant between the noise and target distributions, specifically constructed to respect the underlying Riemannian geometry. In contrast to related generative modeling approaches that rely on machine learning, our method is entirely training-free. It instead builds on iterative posterior sampling procedures using only standard Monte Carlo techniques, thereby extending recent diffusion-based sampling methodologies beyond the Euclidean setting. We complement our approach with a rigorous theoretical analysis and demonstrate its effectiveness on a range of multi-modal sampling problems, including high-dimensional and heavy-tailed examples.

#11 Emergence of Distortions in High-Dimensional Guided Diffusion Models

diffusion

著者: Enrico Ventura, Beatrice Achilli, Luca Ambrogioni, Carlo Lucibello

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.00716

要約:
Classifier-free guidance (CFG) is the de facto standard for conditional sampling in diffusion models, yet it often leads to a loss of diversity in generated samples. We formalize this phenomenon as generative distortion, defined as the mismatch between the CFG-induced sampling distribution and the true conditional distribution. Considering Gaussian mixtures and their exact scores, and leveraging tools from statistical physics, we characterize the onset of distortion in a high-dimensional regime as a function of the number of classes. Our analysis reveals that distortions emerge through a phase transition in the effective potential governing the guided dynamics. In particular, our dynamical mean-field analysis shows that distortion persists when the number of modes grows exponentially with dimension, but vanishes in the sub-exponential regime. Consistent with prior finite-dimensional results, we further demonstrate that vanilla CFG shifts the mean and shrinks the variance of the conditional distribution. We show that standard CFG schedules are fundamentally incapable of preventing variance shrinkage. Finally, we propose a theoretically motivated guidance schedule featuring a negative-guidance window, which mitigates loss of diversity while preserving class separability.

#12 Zero-Flow Encoders

著者: Yakun Wang, Leyang Wang, Song Liu, Taiji Suzuki

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.00797

要約:
Flow-based methods have achieved significant success in various generative modeling tasks, capturing nuanced details within complex data distributions. However, few existing works have exploited this unique capability to resolve fine-grained structural details beyond generation tasks. This paper presents a flow-inspired framework for representation learning. First, we demonstrate that a rectified flow trained using independent coupling is zero everywhere at $t=0.5$ if and only if the source and target distributions are identical. We term this property the \emph{zero-flow criterion}. Second, we show that this criterion can certify conditional independence, thereby extracting \emph{sufficient information} from the data. Third, we translate this criterion into a tractable, simulation-free loss function that enables learning amortized Markov blankets in graphical models and latent representations in self-supervised learning tasks. Experiments on both simulated and real-world datasets demonstrate the effectiveness of our approach. The code reproducing our experiments can be found at: https://github.com/probabilityFLOW/zfe.

#13 Hessian Spectral Analysis at Foundation Model Scale

著者: Diego Granziol, Khurshid Juarev

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.00816

要約:
Accurate Hessian spectra of foundation models have remained out of reach, leading most prior work to rely on small models or strong structural approximations. We show that faithful spectral analysis of the true Hessian is tractable at frontier scale. Using shard-local finite-difference Hessian vector products compatible with Fully Sharded Data Parallelism, we perform stochastic Lanczos quadrature on open-source language models with up to 100B parameters, producing the first large-scale spectral density estimates beyond the sub-10B regime. We characterize the numerical behavior of this pipeline, including finite-difference bias, floating-point noise amplification, and their effect on Krylov stability in fp32 and bf16, and derive practical operating regimes that are validated empirically. We further provide end-to-end runtime and memory scaling laws, showing that full-operator spectral probing incurs only a modest constant-factor overhead over first-order training. Crucially, direct access to the Hessian reveals that widely used block-diagonal curvature approximations can fail catastrophically, exhibiting order-one relative error and poor directional alignment even in mid-scale LLMs. Together, our results demonstrate that foundation-model Hessian spectra are both computable and qualitatively misrepresented by prevailing approximations, opening the door to principled curvature-based analysis at scale.

#14 Safety-Efficacy Trade Off: Robustness against Data-Poisoning

backdoor

著者: Diego Granziol

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.00822

要約:
Backdoor and data poisoning attacks can achieve high attack success while evading existing spectral and optimisation based defences. We show that this behaviour is not incidental, but arises from a fundamental geometric mechanism in input space. Using kernel ridge regression as an exact model of wide neural networks, we prove that clustered dirty label poisons induce a rank one spike in the input Hessian whose magnitude scales quadratically with attack efficacy. Crucially, for nonlinear kernels we identify a near clone regime in which poison efficacy remains order one while the induced input curvature vanishes, making the attack provably spectrally undetectable. We further show that input gradient regularisation contracts poison aligned Fisher and Hessian eigenmodes under gradient flow, yielding an explicit and unavoidable safety efficacy trade off by reducing data fitting capacity. For exponential kernels, this defence admits a precise interpretation as an anisotropic high pass filter that increases the effective length scale and suppresses near clone poisons. Extensive experiments on linear models and deep convolutional networks across MNIST and CIFAR 10 and CIFAR 100 validate the theory, demonstrating consistent lags between attack success and spectral visibility, and showing that regularisation and data augmentation jointly suppress poisoning. Our results establish when backdoors are inherently invisible, and provide the first end to end characterisation of poisoning, detectability, and defence through input space curvature.

#15 Harmful Overfitting in Sobolev Spaces

著者: Kedar Karhadkar, Alexander Sietsema, Deanna Needell, Guido Montufar

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.00825

要約:
Motivated by recent work on benign overfitting in overparameterized machine learning, we study the generalization behavior of functions in Sobolev spaces $W^{k, p}(\mathbb{R}^d)$ that perfectly fit a noisy training data set. Under assumptions of label noise and sufficient regularity in the data distribution, we show that approximately norm-minimizing interpolators, which are canonical solutions selected by smoothness bias, exhibit harmful overfitting: even as the training sample size $n \to \infty$, the generalization error remains bounded below by a positive constant with high probability. Our results hold for arbitrary values of $p \in [1, \infty)$, in contrast to prior results studying the Hilbert space case ($p = 2$) using kernel methods. Our proof uses a geometric argument which identifies harmful neighborhoods of the training data using Sobolev inequalities.

#16 Score-based Metropolis-Hastings for Fractional Langevin Algorithms

著者: Ahmed Aloui, Junyi Liao, Ali Hasan, Jose Blanchet, Vahid Tarokh

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.00835

要約:
Sampling from heavy-tailed and multimodal distributions is challenging when neither the target density nor the proposal density can be evaluated, as in $\alpha$-stable L\'evy-driven fractional Langevin algorithms. While the target distribution can be estimated from data via score-based or energy-based models, the $\alpha$-stable proposal density and its score are generally unavailable, rendering classical density-based Metropolis--Hastings (MH) corrections impractical. Consequently, existing fractional Langevin methods operate in an unadjusted regime and can exhibit substantial finite-time errors and poor empirical control of tail behavior. We introduce the Metropolis-Adjusted Fractional Langevin Algorithm (MAFLA), an MH-inspired, fully score-based correction mechanism. MAFLA employs designed proxies for fractional proposal score gradients under isotropic symmetric $\alpha$-stable noise and learns an acceptance function via Score Balance Matching. We empirically illustrate the strong performance of MAFLA on a series of tasks including combinatorial optimization problems where the method significantly improves finite time sampling accuracy over unadjusted fractional Langevin dynamics.

#17 Multivariate Time Series Data Imputation via Distributionally Robust Regularization

著者: Che-Yi Liao, Zheng Dong, Gian-Gabriel Garcia, Kamran Paynabar

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.00844

要約:
Multivariate time series (MTS) imputation is often compromised by mismatch between observed and true data distributions -- a bias exacerbated by non-stationarity and systematic missingness. Standard methods that minimize reconstruction error or encourage distributional alignment risk overfitting these biased observations. We propose the Distributionally Robust Regularized Imputer Objective (DRIO), which jointly minimizes reconstruction error and the divergence between the imputer and a worst-case distribution within a Wasserstein ambiguity set. We derive a tractable dual formulation that reduces infinite-dimensional optimization over measures to adversarial search over sample trajectories, and propose an adversarial learning algorithm compatible with flexible deep learning backbones. Comprehensive experiments on diverse real-world datasets show DRIO consistently improves imputation under both missing-completely-at-random and missing-not-at-random settings, reaching Pareto-optimal trade-offs between reconstruction accuracy and distributional alignment.

#18 Optimal Decision-Making Based on Prediction Sets

著者: Tao Wang, Edgar Dobriban

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.00989

要約:
Prediction sets can wrap around any ML model to cover unknown test outcomes with a guaranteed probability. Yet, it remains unclear how to use them optimally for downstream decision-making. Here, we propose a decision-theoretic framework that seeks to minimize the expected loss (risk) against a worst-case distribution consistent with the prediction set's coverage guarantee. We first characterize the minimax optimal policy for a fixed prediction set, showing that it balances the worst-case loss inside the set with a penalty for potential losses outside the set. Building on this, we derive the optimal prediction set construction that minimizes the resulting robust risk subject to a coverage constraint. Finally, we introduce Risk-Optimal Conformal Prediction (ROCP), a practical algorithm that targets these risk-minimizing sets while maintaining finite-sample distribution-free marginal coverage. Empirical evaluations on medical diagnosis and safety-critical decision-making tasks demonstrate that ROCP reduces critical mistakes compared to baselines, particularly when out-of-set errors are costly.

#19 Online Social Welfare Function-based Resource Allocation

著者: Kanad Pardeshi, Samsara Foubert, Aarti Singh

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01400

要約:
In many real-world settings, a centralized decision-maker must repeatedly allocate finite resources to a population over multiple time steps. Individuals who receive a resource derive some stochastic utility; to characterize the population-level effects of an allocation, the expected individual utilities are then aggregated using a social welfare function (SWF). We formalize this setting and present a general confidence sequence framework for SWF-based online learning and inference, valid for any monotonic, concave, and Lipschitz-continuous SWF. Our key insight is that monotonicity alone suffices to lift confidence sequences from individual utilities to anytime-valid bounds on optimal welfare. Building on this foundation, we propose SWF-UCB, a SWF-agnostic online learning algorithm that achieves near-optimal $\tilde{O}(n+\sqrt{nkT})$ regret (for $k$ resources distributed among $n$ individuals at each of $T$ time steps). We instantiate our framework on three normatively distinct SWF families: Weighted Power Mean, Kolm, and Gini, providing bespoke oracle algorithms for each. Experiments confirm $\sqrt{T}$ scaling and reveal rich interactions between $k$ and SWF parameters. This framework naturally supports inference applications such as sequential hypothesis testing, optimal stopping, and policy evaluation.

#20 Importance Weighted Variational Inference without the Reparameterization Trick

著者: Kam\'elia Daudel, Minh-Ngoc Tran, Cheng Zhang

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01412

要約:
Importance weighted variational inference (VI) approximates densities known up to a normalizing constant by optimizing bounds that tighten with the number of Monte Carlo samples $N$. Standard optimization relies on reparameterized gradient estimators, which are well-studied theoretically yet restrict both the choice of the data-generating process and the variational approximation. While REINFORCE gradient estimators do not suffer from such restrictions, they lack rigorous theoretical justification. In this paper, we provide the first comprehensive analysis of REINFORCE gradient estimators in importance weighted VI, leveraging this theoretical foundation to diagnose and resolve fundamental deficiencies in current state-of-the-art estimators. Specifically, we introduce and examine a generalized family of variational inference for Monte Carlo objectives (VIMCO) gradient estimators. We prove that state-of-the-art VIMCO gradient estimators exhibit a vanishing signal-to-noise ratio (SNR) as $N$ increases, which prevents effective optimization. To overcome this issue, we propose the novel VIMCO-$\star$ gradient estimator and show that it averts the SNR collapse of existing VIMCO gradient estimators by achieving a $\sqrt{N}$ SNR scaling instead. We demonstrate its superior empirical performance compared to current VIMCO implementations in challenging settings where reparameterized gradients are typically unavailable.

#21 Robust Generalization with Adaptive Optimal Transport Priors for Decision-Focused Learning

著者: Haixiang Sun, Andrew L. Liu

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01427

要約:
Few-shot learning requires models to generalize under limited supervision while remaining robust to distribution shifts. Existing Sinkhorn Distributionally Robust Optimization (DRO) methods provide theoretical guarantees but rely on a fixed reference distribution, which limits their adaptability. We propose a Prototype-Guided Distributionally Robust Optimization (PG-DRO) framework that learns class-adaptive priors from abundant base data via hierarchical optimal transport and embeds them into the Sinkhorn DRO formulation. This design enables few-shot information to be organically integrated into producing class-specific robust decisions that are both theoretically grounded and efficient, and further aligns the uncertainty set with transferable structural knowledge. Experiments show that PG-DRO achieves stronger robust generalization in few-shot scenarios, outperforming both standard learners and DRO baselines.

#22 Rethinking Multinomial Logistic Mixture of Experts with Sigmoid Gating Function

著者: Tuan Minh Pham, Thinh Cao, Viet Nguyen, Huy Nguyen, Nhat Ho, Alessandro Rinaldo

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01466

要約:
The sigmoid gate in mixture-of-experts (MoE) models has been empirically shown to outperform the softmax gate across several tasks, ranging from approximating feed-forward networks to language modeling. Additionally, recent efforts have demonstrated that the sigmoid gate is provably more sample-efficient than its softmax counterpart under regression settings. Nevertheless, there are three notable concerns that have not been addressed in the literature, namely (i) the benefits of the sigmoid gate have not been established under classification settings; (ii) existing sigmoid-gated MoE models may not converge to their ground-truth; and (iii) the effects of a temperature parameter in the sigmoid gate remain theoretically underexplored. To tackle these open problems, we perform a comprehensive analysis of multinomial logistic MoE equipped with a modified sigmoid gate to ensure model convergence. Our results indicate that the sigmoid gate exhibits a lower sample complexity than the softmax gate for both parameter and expert estimation. Furthermore, we find that incorporating a temperature into the sigmoid gate leads to a sample complexity of exponential order due to an intrinsic interaction between the temperature and gating parameters. To overcome this issue, we propose replacing the vanilla inner product score in the gating function with a Euclidean score that effectively removes that interaction, thereby substantially improving the sample complexity to a polynomial order.

#23 Density-Informed Pseudo-Counts for Calibrated Evidential Deep Learning

著者: Pietro Carlotti, Nevena Gligi\'c, Arya Farahi

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01477

要約:
Evidential Deep Learning (EDL) is a popular framework for uncertainty-aware classification that models predictive uncertainty via Dirichlet distributions parameterized by neural networks. Despite its popularity, its theoretical foundations and behavior under distributional shift remain poorly understood. In this work, we provide a principled statistical interpretation by proving that EDL training corresponds to amortized variational inference in a hierarchical Bayesian model with a tempered pseudo-likelihood. This perspective reveals a major drawback: standard EDL conflates epistemic and aleatoric uncertainty, leading to systematic overconfidence on out-of-distribution (OOD) inputs. To address this, we introduce Density-Informed Pseudo-count EDL (DIP-EDL), a new parametrization that decouples class prediction from the magnitude of uncertainty by separately estimating the conditional label distribution and the marginal covariate density. This separation preserves evidence in high-density regions while shrinking predictions toward a uniform prior for OOD data. Theoretically, we prove that DIP-EDL achieves asymptotic concentration. Empirically, we show that our method enhances interpretability and improves robustness and uncertainty calibration under distributional shift.

#24 Inference-Aware Meta-Alignment of LLMs via Non-Linear GRPO

著者: Shokichi Takakura, Akifumi Wachi, Rei Higuchi, Kohei Miyaguchi, Taiji Suzuki

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01603

要約:
Aligning large language models (LLMs) to diverse human preferences is fundamentally challenging since criteria can often conflict with each other. Inference-time alignment methods have recently gained popularity as they allow LLMs to be aligned to multiple criteria via different alignment algorithms at inference time. However, inference-time alignment is computationally expensive since it often requires multiple forward passes of the base model. In this work, we propose inference-aware meta-alignment (IAMA), a novel approach that enables LLMs to be aligned to multiple criteria with limited computational budget at inference time. IAMA trains a base model such that it can be effectively aligned to multiple tasks via different inference-time alignment algorithms. To solve the non-linear optimization problems involved in IAMA, we propose non-linear GRPO, which provably converges to the optimal solution in the space of probability measures.

#25 ST-BCP: Tightening Coverage Bound for Backward Conformal Prediction via Non-Conformity Score Transformation

著者: Junxian Liu, Hao Zeng, Hongxin Wei

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01733

要約:
Conformal Prediction (CP) provides a statistical framework for uncertainty quantification that constructs prediction sets with coverage guarantees. While CP yields uncontrolled prediction set sizes, Backward Conformal Prediction (BCP) inverts this paradigm by enforcing a predefined upper bound on set size and estimating the resulting coverage guarantee. However, the looseness induced by Markov's inequality within the BCP framework causes a significant gap between the estimated coverage bound and the empirical coverage. In this work, we introduce ST-BCP, a novel method that introduces a data-dependent transformation of nonconformity scores to narrow the coverage gap. In particular, we develop a computable transformation and prove that it outperforms the baseline identity transformation. Extensive experiments demonstrate the effectiveness of our method, reducing the average coverage gap from 4.20\% to 1.12\% on common benchmarks.

#26 Transformers as Measure-Theoretic Associative Memory: A Statistical Perspective and Minimax Optimality

著者: Ryotaro Kawata, Taiji Suzuki

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01863

要約:
Transformers excel through content-addressable retrieval and the ability to exploit contexts of, in principle, unbounded length. We recast associative memory at the level of probability measures, treating a context as a distribution over tokens and viewing attention as an integral operator on measures. Concretely, for mixture contexts $\nu = I^{-1} \sum_{i=1}^I \mu^{(i^*)}$ and a query $x_{\mathrm{q}}(i^*)$, the task decomposes into (i) recall of the relevant component $\mu^{(i^*)}$ and (ii) prediction from $(\mu_{i^*},x_\mathrm{q})$. We study learned softmax attention (not a frozen kernel) trained by empirical risk minimization and show that a shallow measure-theoretic Transformer composed with an MLP learns the recall-and-predict map under a spectral assumption on the input densities. We further establish a matching minimax lower bound with the same rate exponent (up to multiplicative constants), proving sharpness of the convergence order. The framework offers a principled recipe for designing and analyzing Transformers that recall from arbitrarily long, distributional contexts with provable generalization guarantees.

#27 Reliable Real-Time Value at Risk Estimation via Quantile Regression Forest with Conformal Calibration

著者: Du-Yi Wang, Guo Liang, Kun Zhang, Qianwen Zhu

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01912

要約:
Rapidly evolving market conditions call for real-time risk monitoring, but its online estimation remains challenging. In this paper, we study the online estimation of one of the most widely used risk measures, Value at Risk (VaR). Its accurate and reliable estimation is essential for timely risk control and informed decision-making. We propose to use the quantile regression forest in the offline-simulation-online-estimation (OSOA) framework. Specifically, the quantile regression forest is trained offline to learn the relationship between the online VaR and risk factors, and real-time VaR estimates are then produced online by incorporating observed risk factors. To further ensure reliability, we develop a conformalized estimator that calibrates the online VaR estimates. To the best of our knowledge, we are the first to leverage conformal calibration to estimate real-time VaR reliably based on the OSOA formulation. Theoretical analysis establishes the consistency and coverage validity of the proposed estimators. Numerical experiments confirm the proposed method and demonstrate its effectiveness in practice.

#28 Privacy Amplification by Missing Data

privacy

著者: Simon Roburin (LPSM), Rafa\"el Pinot (LPSM), Erwan Scornet (LPSM)

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01928

要約:
Privacy preservation is a fundamental requirement in many high-stakes domains such as medicine and finance, where sensitive personal data must be analyzed without compromising individual confidentiality. At the same time, these applications often involve datasets with missing values due to non-response, data corruption, or deliberate anonymization. Missing data is traditionally viewed as a limitation because it reduces the information available to analysts and can degrade model performance. In this work, we take an alternative perspective and study missing data from a privacy preservation standpoint. Intuitively, when features are missing, less information is revealed about individuals, suggesting that missingness could inherently enhance privacy. We formalize this intuition by analyzing missing data as a privacy amplification mechanism within the framework of differential privacy. We show, for the first time, that incomplete data can yield privacy amplification for differentially private algorithms.

#29 Stochastic Interpolants in Hilbert Spaces

著者: James Boran Yu, RuiKang OuYang, Julien Horwood, Jos\'e Miguel Hern\'andez-Lobato

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01988

要約:
Although diffusion models have successfully extended to function-valued data, stochastic interpolants -- which offer a flexible way to bridge arbitrary distributions -- remain limited to finite-dimensional settings. This work bridges this gap by establishing a rigorous framework for stochastic interpolants in infinite-dimensional Hilbert spaces. We provide comprehensive theoretical foundations, including proofs of well-posedness and explicit error bounds. We demonstrate the effectiveness of the proposed framework for conditional generation, focusing particularly on complex PDE-based benchmarks. By enabling generative bridges between arbitrary functional distributions, our approach achieves state-of-the-art results, offering a powerful, general-purpose tool for scientific discovery.

#30 Training-free score-based diffusion for parameter-dependent stochastic dynamical systems

diffusion

著者: Minglei Yang, Sicheng He

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.02113

要約:
Simulating parameter-dependent stochastic differential equations (SDEs) presents significant computational challenges, as separate high-fidelity simulations are typically required for each parameter value of interest. Despite the success of machine learning methods in learning SDE dynamics, existing approaches either require expensive neural network training for score function estimation or lack the ability to handle continuous parameter dependence. We present a training-free conditional diffusion model framework for learning stochastic flow maps of parameter-dependent SDEs, where both drift and diffusion coefficients depend on physical parameters. The key technical innovation is a joint kernel-weighted Monte Carlo estimator that approximates the conditional score function using trajectory data sampled at discrete parameter values, enabling interpolation across both state space and the continuous parameter domain. Once trained, the resulting generative model produces sample trajectories for any parameter value within the training range without retraining, significantly accelerating parameter studies, uncertainty quantification, and real-time filtering applications. The performance of the proposed approach is demonstrated via three numerical examples of increasing complexity, showing accurate approximation of conditional distributions across varying parameter values.

#31 Learning Beyond the Gaussian Data: Learning Dynamics of Neural Networks on an Expressive and Cumulant-Controllable Data Model

著者: Onat Ure, Samet Demir, Zafer Dogan

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.02153

要約:
We study the effect of high-order statistics of data on the learning dynamics of neural networks (NNs) by using a moment-controllable non-Gaussian data model. Considering the expressivity of two-layer neural networks, we first construct the data model as a generative two-layer NN where the activation function is expanded by using Hermite polynomials. This allows us to achieve interpretable control over high-order cumulants such as skewness and kurtosis through the Hermite coefficients while keeping the data model realistic. Using samples generated from the data model, we perform controlled online learning experiments with a two-layer NN. Our results reveal a moment-wise progression in training: networks first capture low-order statistics such as mean and covariance, and progressively learn high-order cumulants. Finally, we pretrain the generative model on the Fashion-MNIST dataset and leverage the generated samples for further experiments. The results of these additional experiments confirm our conclusions and show the utility of the data model in a real-world scenario. Overall, our proposed approach bridges simplified data assumptions and practical data complexity, which offers a principled framework for investigating distributional effects in machine learning and signal processing.

#32 PCA of probability measures: Sparse and Dense sampling regimes

著者: Gachon Erell, J\'er\'emie Bigot, Elsa Cazelles

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.02190

要約:
A common approach to perform PCA on probability measures is to embed them into a Hilbert space where standard functional PCA techniques apply. While convergence rates for estimating the embedding of a single measure from $m$ samples are well understood, the literature has not addressed the setting involving multiple measures. In this paper, we study PCA in a double asymptotic regime where $n$ probability measures are observed, each through $m$ samples. We derive convergence rates of the form $n^{-1/2} + m^{-\alpha}$ for the empirical covariance operator and the PCA excess risk, where $\alpha>0$ depends on the chosen embedding. This characterizes the relationship between the number $n$ of measures and the number $m$ of samples per measure, revealing a sparse (small $m$) to dense (large $m$) transition in the convergence behavior. Moreover, we prove that the dense-regime rate is minimax optimal for the empirical covariance error. Our numerical experiments validate these theoretical rates and demonstrate that appropriate subsampling preserves PCA accuracy while reducing computational cost.

#33 Transfer Learning Through Conditional Quantile Matching

著者: Yikun Zhang, Steven Wilkins-Reeves, Wesley Lee, Aude Hofleitner

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.02358

要約:
We introduce a transfer learning framework for regression that leverages heterogeneous source domains to improve predictive performance in a data-scarce target domain. Our approach learns a conditional generative model separately for each source domain and calibrates the generated responses to the target domain via conditional quantile matching. This distributional alignment step corrects general discrepancies between source and target domains without imposing restrictive assumptions such as covariate or label shift. The resulting framework provides a principled and flexible approach to high-quality data augmentation for downstream learning tasks in the target domain. From a theoretical perspective, we show that an empirical risk minimizer (ERM) trained on the augmented dataset achieves a tighter excess risk bound than the target-only ERM under mild conditions. In particular, we establish new convergence rates for the quantile matching estimator that governs the transfer bias-variance tradeoff. From a practical perspective, extensive simulations and real data applications demonstrate that the proposed method consistently improves prediction accuracy over target-only learning and competing transfer learning methods.

#34 Provably Data-driven Multiple Hyper-parameter Tuning with Structured Loss Function

著者: Tung Quoc Le, Anh Tuan Nguyen, Viet Anh Nguyen

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.02406

要約:
Data-driven algorithm design automates hyperparameter tuning, but its statistical foundations remain limited because model performance can depend on hyperparameters in implicit and highly non-smooth ways. Existing guarantees focus on the simple case of a one-dimensional (scalar) hyperparameter. This leaves the practically important, multi-dimensional hyperparameter tuning setting unresolved. We address this open question by establishing the first general framework for establishing generalization guarantees for tuning multi-dimensional hyperparameters in data-driven settings. Our approach strengthens the generalization guarantee framework for semi-algebraic function classes by exploiting tools from real algebraic geometry, yielding sharper, more broadly applicable guarantees. We then extend the analysis to hyperparameter tuning using the validation loss under minimal assumptions, and derive improved bounds when additional structure is available. Finally, we demonstrate the scope of the framework with new learnability results, including data-driven weighted group lasso and weighted fused lasso.

#35 Full-Batch Gradient Descent Outperforms One-Pass SGD: Sample Complexity Separation in Single-Index Learning

著者: Filip Kova\v{c}evi\'c, Hong Chang Ji, Denny Wu, Mahdi Soltanolkotabi, Marco Mondelli

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.02431

要約:
It is folklore that reusing training data more than once can improve the statistical efficiency of gradient-based learning. However, beyond linear regression, the theoretical advantage of full-batch gradient descent (GD, which always reuses all the data) over one-pass stochastic gradient descent (online SGD, which uses each data point only once) remains unclear. In this work, we consider learning a $d$-dimensional single-index model with a quadratic activation, for which it is known that one-pass SGD requires $n\gtrsim d\log d$ samples to achieve weak recovery. We first show that this $\log d$ factor in the sample complexity persists for full-batch spherical GD on the correlation loss; however, by simply truncating the activation, full-batch GD exhibits a favorable optimization landscape at $n \simeq d$ samples, thereby outperforming one-pass SGD (with the same activation) in statistical efficiency. We complement this result with a trajectory analysis of full-batch GD on the squared loss from small initialization, showing that $n \gtrsim d$ samples and $T \gtrsim\log d$ gradient steps suffice to achieve strong (exact) recovery.

#36 Generative AI-enhanced Probabilistic Multi-Fidelity Surrogate Modeling Via Transfer Learning

著者: Jice Zeng, David Barajas-Solano, Hui Chen

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.00072

要約:
The performance of machine learning surrogates is critically dependent on data quality and quantity. This presents a major challenge, as high-fidelity (HF) data is often scarce and computationally expensive to acquire, while low-fidelity (LF) data is abundant but less accurate. To address this data scarcity problem, we develop a probabilistic multi-fidelity surrogate framework based on generative transfer learning. We employ a normalizing flow (NF) generative model as the backbone, which is trained in two phases: (i) the NF is first pretrained on a large LF dataset to learn a probabilistic forward model; (ii) the pretrained model is then fine-tuned on a small HF dataset, allowing it to correct for LF-HF discrepancies via knowledge transfer. To relax the dimension-preserving constraint of standard bijective NFs, we integrate surjective (dimension-reducing) layers with standard coupling blocks. This architecture enables learned dimension reduction while preserving the ability to train with exact likelihoods. The resulting surrogate provides fast probabilistic predictions with quantified uncertainty and significantly outperforms LF-only baselines while using fewer HF evaluations. We validate the approach on a reinforced concrete slab benchmark, combining many coarse-mesh (LF) simulations with a limited set of fine-mesh (HF) simulations. The proposed model achieves probabilistic predictions with HF accuracy, demonstrating a practical path toward data-efficient, generative AI-driven surrogates for complex engineering systems.

#37 Test-Time Adaptation for Non-stationary Time Series: From Synthetic Regime Shifts to Financial Markets

著者: Yurui Wu, Qingying Deng, Wonou Chung, Mairui Li

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.00073

要約:
Time series encountered in practice are rarely stationary. When the data distribution changes, a forecasting model trained on past observations can lose accuracy. We study a small-footprint test-time adaptation (TTA) framework for causal timeseries forecasting and direction classification. The backbone is frozen, and only normalization affine parameters are updated using recent unlabeled windows. For classification we minimize entropy and enforce temporal consistency; for regression we minimize prediction variance across weak time-preserving augmentations and optionally distill from an EMA teacher. A quadratic drift penalty and an uncertainty triggered fallback keep updates stable. We evaluate this framework in two stages: synthetic regime shifts on ETT benchmarks, and daily equity and FX series (SPY, QQQ, EUR/USD) across pandemic, high-inflation, and recovery regimes. On synthetic gradual drift, normalization-based TTA improves forecasting error, while in financial markets a simple batch-normalization statistics update is a robust default and more aggressive norm-only adaptation can even hurt. Our results provide practical guidance for deploying TTA on non-stationary time series.

#38 Early warning prediction: Onsager-Machlup vs Schr\"{o}dinger

著者: Xiaoai Xu, Yixuan Zhou, Xiang Zhou, Jingqiao Duan, Ting Gao

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.00143

要約:
Predicting critical transitions in complex systems, such as epileptic seizures in the brain, represents a major challenge in scientific research. The high-dimensional characteristics and hidden critical signals further complicate early-warning tasks. This study proposes a novel early-warning framework that integrates manifold learning with stochastic dynamical system modeling. Through systematic comparison, six methods including diffusion maps (DM) are selected to construct low-dimensional representations. Based on these, a data-driven stochastic differential equation model is established to robustly estimate the probability evolution scoring function of the system. Building on this, a new Score Function (SF) indicator is defined by incorporating Schr\"{o}dinger bridge theory to quantify the likelihood of significant state transitions in the system. Experiments demonstrate that this indicator exhibits higher sensitivity and robustness in epilepsy prediction, enables earlier identification of critical points, and clearly captures dynamic features across various stages before and after seizure onset. This work provides a systematic theoretical framework and practical methodology for extracting early-warning signals from high-dimensional data.

#39 GRIP2: A Robust and Powerful Deep Knockoff Method for Feature Selection

著者: Bob Junyi Zou, Lu Tian

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.00218

要約:
Identifying truly predictive covariates while strictly controlling false discoveries remains a fundamental challenge in nonlinear, highly correlated, and low signal-to-noise regimes, where deep learning based feature selection methods are most attractive. We propose Group Regularization Importance Persistence in 2 Dimensions (GRIP2), a deep knockoff feature importance statistic that integrates first-layer feature activity over a two-dimensional regularization surface controlling both sparsity strength and sparsification geometry. To approximate this surface integral in a single training run, we introduce efficient block-stochastic sampling, which aggregates feature activity magnitudes across diverse regularization regimes along the optimization trajectory. The resulting statistics are antisymmetric by construction, ensuring finite-sample FDR control. In extensive experiments on synthetic and semi-real data, GRIP2 demonstrates improved robustness to feature correlation and noise level: in high correlation and low signal-to-noise ratio regimes where standard deep learning based feature selectors may struggle, our method retains high power and stability. Finally, on real-world HIV drug resistance data, GRIP2 recovers known resistance-associated mutations with power better than established linear baselines, confirming its reliability in practice.

#40 LatentTrack: Sequential Weight Generation via Latent Filtering

著者: Omer Haq

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.00458

要約:
We introduce LatentTrack (LT), a sequential neural architecture for online probabilistic prediction under nonstationary dynamics. LT performs causal Bayesian filtering in a low-dimensional latent space and uses a lightweight hypernetwork to generate predictive model parameters at each time step, enabling constant-time online adaptation without per-step gradient updates. At each time step, a learned latent model predicts the next latent distribution, which is updated via amortized inference using new observations, yielding a predict--generate--update filtering framework in function space. The formulation supports both structured (Markovian) and unstructured latent dynamics within a unified objective, while Monte Carlo inference over latent trajectories produces calibrated predictive mixtures with fixed per-step cost. Evaluated on long-horizon online regression using the Jena Climate benchmark, LT consistently achieves lower negative log-likelihood and mean squared error than stateful sequential and static uncertainty-aware baselines, with competitive calibration, demonstrating that latent-conditioned function evolution is an effective alternative to traditional latent-state modeling under distribution shift.

#41 Fast Non-Episodic Finite-Horizon RL with K-Step Lookahead Thresholding

著者: Jiamin Xu, Kyra Gan

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.00781

要約:
Online reinforcement learning in non-episodic, finite-horizon MDPs remains underexplored and is challenged by the need to estimate returns to a fixed terminal time. Existing infinite-horizon methods, which often rely on discounted contraction, do not naturally account for this fixed-horizon structure. We introduce a modified Q-function: rather than targeting the full-horizon, we learn a K-step lookahead Q-function that truncates planning to the next K steps. To further improve sample efficiency, we introduce a thresholding mechanism: actions are selected only when their estimated K-step lookahead value exceeds a time-varying threshold. We provide an efficient tabular learning algorithm for this novel objective, proving it achieves fast finite-sample convergence: it achieves minimax optimal constant regret for $K=1$ and $\mathcal{O}(\max((K-1),C_{K-1})\sqrt{SAT\log(T)})$ regret for any $K \geq 2$. We numerically evaluate the performance of our algorithm under the objective of maximizing reward. Our implementation adaptively increases K over time, balancing lookahead depth against estimation variance. Empirical results demonstrate superior cumulative rewards over state-of-the-art tabular RL methods across synthetic MDPs and RL environments: JumpRiverswim, FrozenLake and AnyTrading.

#42 Over-Alignment vs Over-Fitting: The Role of Feature Learning Strength in Generalization

著者: Taesun Yeom, Taehyeok Ha, Jaeho Lee

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.00827

要約:
Feature learning strength (FLS), i.e., the inverse of the effective output scaling of a model, plays a critical role in shaping the optimization dynamics of neural nets. While its impact has been extensively studied under the asymptotic regimes -- both in training time and FLS -- existing theory offers limited insight into how FLS affects generalization in practical settings, such as when training is stopped upon reaching a target training risk. In this work, we investigate the impact of FLS on generalization in deep networks under such practical conditions. Through empirical studies, we first uncover the emergence of an $\textit{optimal FLS}$ -- neither too small nor too large -- that yields substantial generalization gains. This finding runs counter to the prevailing intuition that stronger feature learning universally improves generalization. To explain this phenomenon, we develop a theoretical analysis of gradient flow dynamics in two-layer ReLU nets trained with logistic loss, where FLS is controlled via initialization scale. Our main theoretical result establishes the existence of an optimal FLS arising from a trade-off between two competing effects: An excessively large FLS induces an $\textit{over-alignment}$ phenomenon that degrades generalization, while an overly small FLS leads to $\textit{over-fitting}$.

#43 Don't Forget Its Variance! The Minimum Path Variance Principle for Accurate and Stable Score-Based Density Ratio Estimation

著者: Wei Chen, Jiacheng Li, Shigui Li, Zhiqi Lin, Junmei Yang, John Paisley, Delu Zeng

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.00834

要約:
Score-based methods have emerged as a powerful framework for density ratio estimation (DRE), but they face an important paradox in that, while theoretically path-independent, their practical performance depends critically on the chosen path schedule. We resolve this issue by proving that tractable training objectives differ from the ideal, ground-truth objective by a crucial, overlooked term: the path variance of the time score. To address this, we propose MinPV (\textbf{Min}imum \textbf{P}ath \textbf{V}ariance) Principle, which introduces a principled heuristic to minimize the overlooked path variance. Our key contribution is the derivation of a closed-form expression for the variance, turning an intractable problem into a tractable optimization. By parameterizing the path with a flexible Kumaraswamy Mixture Model, our method learns a data-adaptive, low-variance path without heuristic selection. This principled optimization of the complete objective yields more accurate and stable estimators, establishing new state-of-the-art results on challenging benchmarks.

#44 Multimodal Scientific Learning Beyond Diffusions and Flows

diffusion

著者: Leonardo Ferreira Guilhoto, Akshat Kaushal, Paris Perdikaris

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.00960

要約:
Scientific machine learning (SciML) increasingly requires models that capture multimodal conditional uncertainty arising from ill-posed inverse problems, multistability, and chaotic dynamics. While recent work has favored highly expressive implicit generative models such as diffusion and flow-based methods, these approaches are often data-hungry, computationally costly, and misaligned with the structured solution spaces frequently found in scientific problems. We demonstrate that Mixture Density Networks (MDNs) provide a principled yet largely overlooked alternative for multimodal uncertainty quantification in SciML. As explicit parametric density estimators, MDNs impose an inductive bias tailored to low-dimensional, multimodal physics, enabling direct global allocation of probability mass across distinct solution branches. This structure delivers strong data efficiency, allowing reliable recovery of separated modes in regimes where scientific data is scarce. We formalize these insights through a unified probabilistic framework contrasting explicit and implicit distribution networks, and demonstrate empirically that MDNs achieve superior generalization, interpretability, and sample efficiency across a range of inverse, multistable, and chaotic scientific regression tasks.

#45 Superposition unifies power-law training dynamics

著者: Zixin Jessie Chen, Hao Chen, Yizhou Liu, Jeff Gore

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01045

要約:
We investigate the role of feature superposition in the emergence of power-law training dynamics using a teacher-student framework. We first derive an analytic theory for training without superposition, establishing that the power-law training exponent depends on both the input data statistics and channel importance. Remarkably, we discover that a superposition bottleneck induces a transition to a universal power-law exponent of $\sim 1$, independent of data and channel statistics. This one over time training with superposition represents an up to tenfold acceleration compared to the purely sequential learning that takes place in the absence of superposition. Our finding that superposition leads to rapid training with a data-independent power law exponent may have important implications for a wide range of neural networks that employ superposition, including production-scale large language models.

#46 Multi-LLM Adaptive Conformal Inference for Reliable LLM Responses

著者: Kangjun Noh, Seongchan Lee, Ilmun Kim, Kyungwoo Song

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01285

要約:
Ensuring factuality is essential for the safe use of Large Language Models (LLMs) in high-stakes domains such as medicine and law. Conformal inference provides distribution-free guarantees, but existing approaches are either overly conservative, discarding many true-claims, or rely on adaptive error rates and simple linear models that fail to capture complex group structures. To address these challenges, we reformulate conformal inference in a multiplicative filtering setting, modeling factuality as a product of claim-level scores. Our method, Multi-LLM Adaptive Conformal Inference (MACI), leverages ensembles to produce more accurate factuality-scores, which in our experiments led to higher retention, while validity is preserved through group-conditional calibration. Experiments show that MACI consistently achieves user-specified coverage with substantially higher retention and lower time cost than baselines. Our repository is available at https://github.com/MLAI-Yonsei/MACI

#47 High-accuracy sampling for diffusion models and log-concave distributions

diffusion

著者: Fan Chen, Sinho Chewi, Constantinos Daskalakis, Alexander Rakhlin

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01338

要約:
We present algorithms for diffusion model sampling which obtain $\delta$-error in $\mathrm{polylog}(1/\delta)$ steps, given access to $\widetilde O(\delta)$-accurate score estimates in $L^2$. This is an exponential improvement over all previous results. Specifically, under minimal data assumptions, the complexity is $\widetilde O(d\,\mathrm{polylog}(1/\delta))$ where $d$ is the dimension of the data; under a non-uniform $L$-Lipschitz condition, the complexity is $\widetilde O(\sqrt{dL}\,\mathrm{polylog}(1/\delta))$; and if the data distribution has intrinsic dimension $d_\star$, then the complexity reduces to $\widetilde O(d_\star\,\mathrm{polylog}(1/\delta))$. Our approach also yields the first $\mathrm{polylog}(1/\delta)$ complexity sampler for general log-concave distributions using only gradient evaluations.

#48 Context Dependence and Reliability in Autoregressive Language Models

著者: Poushali Sengupta, Shashi Raj Pandey, Sabita Maharjan, Frank Eliassen

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01378

要約:
Large language models (LLMs) generate outputs by utilizing extensive context, which often includes redundant information from prompts, retrieved passages, and interaction history. In critical applications, it is vital to identify which context elements actually influence the output, as standard explanation methods struggle with redundancy and overlapping context. Minor changes in input can lead to unpredictable shifts in attribution scores, undermining interpretability and raising concerns about risks like prompt injection. This work addresses the challenge of distinguishing essential context elements from correlated ones. We introduce RISE (Redundancy-Insensitive Scoring of Explanation), a method that quantifies the unique influence of each input relative to others, minimizing the impact of redundancies and providing clearer, stable attributions. Experiments demonstrate that RISE offers more robust explanations than traditional methods, emphasizing the importance of conditional information for trustworthy LLM explanations and monitoring.

#49 On the Power of (Approximate) Reward Models for Inference-Time Scaling

著者: Youheng Zhu, Yiping Lu

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01381

要約:
Inference-time scaling has recently emerged as a powerful paradigm for improving the reasoning capability of large language models. Among various approaches, Sequential Monte Carlo (SMC) has become a particularly important framework, enabling iterative generation, evaluation, rejection, and resampling of intermediate reasoning trajectories. A central component in this process is the reward model, which evaluates partial solutions and guides the allocation of computation during inference. However, in practice, true reward models are never available. All deployed systems rely on approximate reward models, raising a fundamental question: Why and when do approximate reward models suffice for effective inference-time scaling? In this work, we provide a theoretical answer. We identify the Bellman error of the approximate reward model as the key quantity governing the effectiveness of SMC-based inference-time scaling. For a reasoning process of length $T$, we show that if the Bellman error of the approximate reward model is bounded by $O(1/T)$, then combining this reward model with SMC reduces the computational complexity of reasoning from exponential in $T$ to polynomial in $T$. This yields an exponential improvement in inference efficiency despite using only approximate rewards.

#50 An Odd Estimator for Shapley Values

著者: Fabian Fumagalli, Landon Butler, Justin Singh Kang, Kannan Ramchandran, R. Teal Witter

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01399

要約:
The Shapley value is a ubiquitous framework for attribution in machine learning, encompassing feature importance, data valuation, and causal inference. However, its exact computation is generally intractable, necessitating efficient approximation methods. While the most effective and popular estimators leverage the paired sampling heuristic to reduce estimation error, the theoretical mechanism driving this improvement has remained opaque. In this work, we provide an elegant and fundamental justification for paired sampling: we prove that the Shapley value depends exclusively on the odd component of the set function, and that paired sampling orthogonalizes the regression objective to filter out the irrelevant even component. Leveraging this insight, we propose OddSHAP, a novel consistent estimator that performs polynomial regression solely on the odd subspace. By utilizing the Fourier basis to isolate this subspace and employing a proxy model to identify high-impact interactions, OddSHAP overcomes the combinatorial explosion of higher-order approximations. Through an extensive benchmark evaluation, we find that OddSHAP achieves state-of-the-art estimation accuracy.

#51 DCD: Decomposition-based Causal Discovery from Autocorrelated and Non-Stationary Temporal Data

著者: Muhammad Hasan Ferdous, Md Osman Gani

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01433

要約:
Multivariate time series in domains such as finance, climate science, and healthcare often exhibit long-term trends, seasonal patterns, and short-term fluctuations, complicating causal inference under non-stationarity and autocorrelation. Existing causal discovery methods typically operate on raw observations, making them vulnerable to spurious edges and misattributed temporal dependencies. We introduce a decomposition-based causal discovery framework that separates each time series into trend, seasonal, and residual components and performs component-specific causal analysis. Trend components are assessed using stationarity tests, seasonal components using kernel-based dependence measures, and residual components using constraint-based causal discovery. The resulting component-level graphs are integrated into a unified multi-scale causal structure. This approach isolates long- and short-range causal effects, reduces spurious associations, and improves interpretability. Across extensive synthetic benchmarks and real-world climate data, our framework more accurately recovers ground-truth causal structure than state-of-the-art baselines, particularly under strong non-stationarity and temporal autocorrelation.

#52 Theoretical Analysis of Measure Consistency Regularization for Partially Observed Data

著者: Yinsong Wang, Shahin Shahrampour

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01437

要約:
The problem of corrupted data, missing features, or missing modalities continues to plague the modern machine learning landscape. To address this issue, a class of regularization methods that enforce consistency between imputed and fully observed data has emerged as a promising approach for improving model generalization, particularly in partially observed settings. We refer to this class of methods as Measure Consistency Regularization (MCR). Despite its empirical success in various applications, such as image inpainting, data imputation and semi-supervised learning, a fundamental understanding of the theoretical underpinnings of MCR remains limited. This paper bridges this gap by offering theoretical insights into why, when, and how MCR enhances imputation quality under partial observability, viewed through the lens of neural network distance. Our theoretical analysis identifies the term responsible for MCR's generalization advantage and extends to the imperfect training regime, demonstrating that this advantage is not always guaranteed. Guided by these insights, we propose a novel training protocol that monitors the duality gap to determine an early stopping point that preserves the generalization benefit. We then provide detailed empirical evidence to support our theoretical claims and to show the effectiveness and accuracy of our proposed stopping condition. We further provide a set of real-world data simulations to show the versatility of MCR under different model architectures designed for different data sources.

#53 Dimension-Free Multimodal Sampling via Preconditioned Annealed Langevin Dynamics

著者: Lorenzo Baldassari, Josselin Garnier, Knut Solna, Maarten V. de Hoop

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01449

要約:
Designing algorithms that can explore multimodal target distributions accurately across successive refinements of an underlying high-dimensional problem is a central challenge in sampling. Annealed Langevin dynamics (ALD) is a widely used alternative to classical Langevin since it often yields much faster mixing on multimodal targets, but there is still a gap between this empirical success and existing theory: when, and under which design choices, can ALD be guaranteed to remain stable as dimension increases? In this paper, we help bridge this gap by providing a uniform-in-dimension analysis of continuous-time ALD for multimodal targets that can be well-approximated by Gaussian mixture models. Along an explicit annealing path obtained by progressively removing Gaussian smoothing of the target, we identify sufficient spectral conditions - linking smoothing covariance and the covariances of the Gaussian components of the mixture - under which ALD achieves a prescribed accuracy within a single, dimension-uniform time horizon. We then establish dimension-robustness to imperfect initialization and score approximation: under a misspecified-mixture score model, we derive explicit conditions showing that preconditioning the ALD algorithm with a sufficiently decaying spectrum is necessary to prevent error terms from accumulating across coordinates and destroying dimension-uniform control. Finally, numerical experiments illustrate and validate the theory.

#54 A Statistical Theory of Gated Attention through the Lens of Hierarchical Mixture of Experts

著者: Viet Nguyen, Tuan Minh Pham, Thinh Cao, Tan Dinh, Huy Nguyen, Nhat Ho, Alessandro Rinaldo

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01468

要約:
Self-attention has greatly contributed to the success of the widely used Transformer architecture by enabling learning from data with long-range dependencies. In an effort to improve performance, a gated attention model that leverages a gating mechanism within the multi-head self-attention has recently been proposed as a promising alternative. Gated attention has been empirically demonstrated to increase the expressiveness of low-rank mapping in standard attention and even to eliminate the attention sink phenomenon. Despite its efficacy, a clear theoretical understanding of gated attention's benefits remains lacking in the literature. To close this gap, we rigorously show that each entry in a gated attention matrix or a multi-head self-attention matrix can be written as a hierarchical mixture of experts. By recasting learning as an expert estimation problem, we demonstrate that gated attention is more sample-efficient than multi-head self-attention. In particular, while the former needs only a polynomial number of data points to estimate an expert, the latter requires exponentially many data points to achieve the same estimation error. Furthermore, our analysis also provides a theoretical justification for why gated attention yields higher performance when a gate is placed at the output of the scaled dot product attention or the value map rather than at other positions in the multi-head self-attention architecture.

#55 Rod Flow: A Continuous-Time Model for Gradient Descent at the Edge of Stability

著者: Eric Regis, Sinho Chewi

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01480

要約:
How can we understand gradient-based training over non-convex landscapes? The edge of stability phenomenon, introduced in Cohen et al. (2021), indicates that the answer is not so simple: namely, gradient descent (GD) with large step sizes often diverges away from the gradient flow. In this regime, the "Central Flow", recently proposed in Cohen et al. (2025), provides an accurate ODE approximation to the GD dynamics over many architectures. In this work, we propose Rod Flow, an alternative ODE approximation, which carries the following advantages: (1) it rests on a principled derivation stemming from a physical picture of GD iterates as an extended one-dimensional object -- a "rod"; (2) it better captures GD dynamics for simple toy examples and matches the accuracy of Central Flow for representative neural network architectures, and (3) is explicit and cheap to compute. Theoretically, we prove that Rod Flow correctly predicts the critical sharpness threshold and explains self-stabilization in quartic potentials. We validate our theory with a range of numerical experiments.

#56 Predicting and improving test-time scaling laws via reward tail-guided search

著者: Muheng Li, Jian Qian, Wenlong Mou

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01485

要約:
Test-time scaling has emerged as a critical avenue for enhancing the reasoning capabilities of Large Language Models (LLMs). Though the straight-forward ''best-of-$N$'' (BoN) strategy has already demonstrated significant improvements in performance, it lacks principled guidance on the choice of $N$, budget allocation, and multi-stage decision-making, thereby leaving substantial room for optimization. While many works have explored such optimization, rigorous theoretical guarantees remain limited. In this work, we propose new methodologies to predict and improve scaling properties via tail-guided search. By estimating the tail distribution of rewards, our method predicts the scaling law of LLMs without the need for exhaustive evaluations. Leveraging this prediction tool, we introduce Scaling-Law Guided (SLG) Search, a new test-time algorithm that dynamically allocates compute to identify and exploit intermediate states with the highest predicted potential. We theoretically prove that SLG achieves vanishing regret compared to perfect-information oracles, and achieves expected rewards that would otherwise require a polynomially larger compute budget required when using BoN. Empirically, we validate our framework across different LLMs and reward models, confirming that tail-guided allocation consistently achieves higher reward yields than Best-of-$N$ under identical compute budgets. Our code is available at https://github.com/PotatoJnny/Scaling-Law-Guided-search.

#57 Optimal Sample Complexity for Single Time-Scale Actor-Critic with Momentum

著者: Navdeep Kumar, Tehila Dahan, Lior Cohen, Ananyabrata Barua, Giorgia Ramponi, Kfir Yehuda Levy, Shie Mannor

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01505

要約:
We establish an optimal sample complexity of $O(\epsilon^{-2})$ for obtaining an $\epsilon$-optimal global policy using a single-timescale actor-critic (AC) algorithm in infinite-horizon discounted Markov decision processes (MDPs) with finite state-action spaces, improving upon the prior state of the art of $O(\epsilon^{-3})$. Our approach applies STORM (STOchastic Recursive Momentum) to reduce variance in the critic updates. However, because samples are drawn from a nonstationary occupancy measure induced by the evolving policy, variance reduction via STORM alone is insufficient. To address this challenge, we maintain a buffer of small fraction of recent samples and uniformly sample from it for each critic update. Importantly, these mechanisms are compatible with existing deep learning architectures and require only minor modifications, without compromising practical applicability.

#58 When Is Generalized Bayes Bayesian? A Decision-Theoretic Characterization of Loss-Based Updating

著者: Kenichiro McAlinn, K\=osaku Takanashi

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01573

要約:
Loss-based updating, including generalized Bayes, Gibbs, and quasi-posteriors, replaces likelihoods by a user-chosen loss and produces a posterior-like distribution via exponential tilt. We give a decision-theoretic characterization that separates \emph{belief posteriors} -- conditional beliefs justified by the foundations of Savage and Anscombe-Aumann under a joint probability mode l-- from \emph{decision posteriors} -- randomized decision rules justified by preferences over decision rules. We make explicit that a loss-based posterior coincides with ordinary Bayes if and only if the loss is, up to scale and a data-only term, negative log-likelihood. We then show that generalized marginal likelihood is not evidence for decision posteriors, and Bayes factors are not well-defined without additional structure. In the decision posterior regime, non-degenerate posteriors require nonlinear preferences over decision rules. Under sequential coherence and separability, these lead to an entropy-penalized variational representation yielding generalized Bayes as the optimal rule.

#59 Universal Redundancies in Time Series Foundation Models

著者: Anthony Bao, Venkata Hasith Vattikuti, Jeffrey Lai, William Gilpin

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01605

要約:
Time Series Foundation Models (TSFMs) leverage extensive pretraining to accurately predict unseen time series during inference, without the need for task-specific fine-tuning. Through large-scale evaluations on standard benchmarks, we find that leading transformer-based TSFMs exhibit redundant components in their intermediate layers. We introduce a set of tools for mechanistic interpretability of TSFMs, including ablations of specific components and direct logit attribution on the residual stream. Our findings are consistent across several leading TSFMs with diverse architectures, and across a diverse set of real-world and synthetic time-series datasets. We discover that all models in our study are robust to ablations of entire layers. Furthermore, we develop a theoretical framework framing transformers as kernel regressors, motivating a purely intrinsic strategy for ablating heads based on the stable rank of the per-head projection matrices. Using this approach, we uncover the specific heads responsible for degenerate phenomena widely observed in TSFMs, such as parroting of motifs from the context and seasonality bias. Our study sheds light on the universal properties of this emerging class of architectures for continuous-time sequence modeling.

#60 Minimax optimal differentially private synthetic data for smooth queries

privacysynthetic data

著者: Rundong Ding, Yiyun He, Yizhe Zhu

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01607

要約:
Differentially private synthetic data enables the sharing and analysis of sensitive datasets while providing rigorous privacy guarantees for individual contributors. A central challenge is to achieve strong utility guarantees for meaningful downstream analysis. Many existing methods ensure uniform accuracy over broad query classes, such as all Lipschitz functions, but this level of generality often leads to suboptimal rates for statistics of practical interest. Since many common data analysis queries exhibit smoothness beyond what worst-case Lipschitz bounds capture, we ask whether exploiting this additional structure can yield improved utility. We study the problem of generating $(\varepsilon,\delta)$-differentially private synthetic data from a dataset of size $n$ supported on the hypercube $[-1,1]^d$, with utility guarantees uniformly for all smooth queries having bounded derivatives up to order $k$. We propose a polynomial-time algorithm that achieves a minimax error rate of $n^{-\min \{1, \frac{k}{d}\}}$, up to a $\log(n)$ factor. This characterization uncovers a phase transition at $k=d$. Our results generalize the Chebyshev moment matching framework of (Musco et al., 2025; Wang et al., 2016) and strictly improve the error rates for $k$-smooth queries established in (Wang et al., 2016). Moreover, we establish the first minimax lower bound for the utility of $(\varepsilon,\delta)$-differentially private synthetic data with respect to $k$-smooth queries, extending the Wasserstein lower bound for $\varepsilon$-differential privacy in (Boedihardjo et al., 2024).

#61 The Effect of Mini-Batch Noise on the Implicit Bias of Adam

著者: Matias D. Cattaneo, Boris Shigida

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01642

要約:
With limited high-quality data and growing compute, multi-epoch training is gaining back its importance across sub-areas of deep learning. Adam(W), versions of which are go-to optimizers for many tasks such as next token prediction, has two momentum hyperparameters $(\beta_1, \beta_2)$ controlling memory and one very important hyperparameter, batch size, controlling (in particular) the amount mini-batch noise. We introduce a theoretical framework to understand how mini-batch noise influences the implicit bias of memory in Adam (depending on $\beta_1$, $\beta_2$) towards sharper or flatter regions of the loss landscape, which is commonly observed to correlate with the generalization gap in multi-epoch training. We find that in the case of large batch sizes, higher $\beta_2$ increases the magnitude of anti-regularization by memory (hurting generalization), but as the batch size becomes smaller, the dependence of (anti-)regulariation on $\beta_2$ is reversed. A similar monotonicity shift (in the opposite direction) happens in $\beta_1$. In particular, the commonly "default" pair $(\beta_1, \beta_2) = (0.9, 0.999)$ is a good choice if batches are small; for larger batches, in many settings moving $\beta_1$ closer to $\beta_2$ is much better in terms of validation accuracy in multi-epoch training. Moreover, our theoretical derivations connect the scale of the batch size at which the shift happens to the scale of the critical batch size. We illustrate this effect in experiments with small-scale data in the about-to-overfit regime.

#62 Finite and Corruption-Robust Regret Bounds in Online Inverse Linear Optimization under M-Convex Action Sets

著者: Taihei Oki, Shinsaku Sakaue

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01682

要約:
We study online inverse linear optimization, also known as contextual recommendation, where a learner sequentially infers an agent's hidden objective vector from observed optimal actions over feasible sets that change over time. The learner aims to recommend actions that perform well under the agent's true objective, and the performance is measured by the regret, defined as the cumulative gap between the agent's optimal values and those achieved by the learner's recommended actions. Prior work has established a regret bound of $O(d\log T)$, as well as a finite but exponentially large bound of $\exp(O(d\log d))$, where $d$ is the dimension of the optimization problem and $T$ is the time horizon, while a regret lower bound of $\Omega(d)$ is known (Gollapudi et al. 2021; Sakaue et al. 2025). Whether a finite regret bound polynomial in $d$ is achievable or not has remained an open question. We partially resolve this by showing that when the feasible sets are M-convex -- a broad class that includes matroids -- a finite regret bound of $O(d\log d)$ is possible. We achieve this by combining a structural characterization of optimal solutions on M-convex sets with a geometric volume argument. Moreover, we extend our approach to adversarially corrupted feedback in up to $C$ rounds. We obtain a regret bound of $O((C+1)d\log d)$ without prior knowledge of $C$, by monitoring directed graphs induced by the observed feedback to detect corruptions adaptively.

#63 Stein-Rule Shrinkage for Stochastic Gradient Estimation in High Dimensions

著者: M. Arashi, M. Amintoosi

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01777

要約:
Stochastic gradient methods are central to large-scale learning, yet their analysis typically treats mini-batch gradients as unbiased estimators of the population gradient. In high-dimensional settings, however, classical results from statistical decision theory show that unbiased estimators are generally inadmissible under quadratic loss, suggesting that standard stochastic gradients may be suboptimal from a risk perspective. In this work, we formulate stochastic gradient computation as a high-dimensional estimation problem and introduce a decision-theoretic framework based on Stein-rule shrinkage. We construct a shrinkage gradient estimator that adaptively contracts noisy mini-batch gradients toward a stable restricted estimator derived from historical momentum. The shrinkage intensity is determined in a data-driven manner using an online estimate of gradient noise variance, leveraging second-moment statistics commonly maintained by adaptive optimization methods. Under a Gaussian noise model and for dimension p>=3, we show that the proposed estimator uniformly dominates the standard stochastic gradient under squared error loss and is minimax-optimal in the classical decision-theoretic sense. We further demonstrate how this estimator can be incorporated into the Adam optimizer, yielding a practical algorithm with negligible additional computational cost. Empirical evaluations on CIFAR10 and CIFAR100, across multiple levels of label noise, show consistent improvements over Adam in the large-batch regime. Ablation studies indicate that the gains arise primarily from selectively applying shrinkage to high-dimensional convolutional layers, while indiscriminate shrinkage across all parameters degrades performance. These results illustrate that classical shrinkage principles provide a principled and effective approach to improving stochastic gradient estimation in modern deep learning.

#64 Learning Sequential Decisions from Multiple Sources via Group-Robust Markov Decision Processes

著者: Mingyuan Xu, Zongqi Xia, Tianxi Cai, Doudou Zhou, Nian Si

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01825

要約:
We often collect data from multiple sites (e.g., hospitals) that share common structure but also exhibit heterogeneity. This paper aims to learn robust sequential decision-making policies from such offline, multi-site datasets. To model cross-site uncertainty, we study distributionally robust MDPs with a group-linear structure: all sites share a common feature map, and both the transition kernels and expected reward functions are linear in these shared features. We introduce feature-wise (d-rectangular) uncertainty sets, which preserve tractable robust Bellman recursions while maintaining key cross-site structure. Building on this, we then develop an offline algorithm based on pessimistic value iteration that includes: (i) per-site ridge regression for Bellman targets, (ii) feature-wise worst-case (row-wise minimization) aggregation, and (iii) a data-dependent pessimism penalty computed from the diagonals of the inverse design matrices. We further propose a cluster-level extension that pools similar sites to improve sample efficiency, guided by prior knowledge of site similarity. Under a robust partial coverage assumption, we prove a suboptimality bound for the resulting policy. Overall, our framework addresses multi-site learning with heterogeneous data sources and provides a principled approach to robust planning without relying on strong state-action rectangularity assumptions.

#65 Designing Time Series Experiments in A/B Testing with Transformer Reinforcement Learning

著者: Xiangkun Wu, Qianglin Wen, Yingying Zhang, Hongtu Zhu, Ting Li, Chengchun Shi

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01853

要約:
A/B testing has become a gold standard for modern technological companies to conduct policy evaluation. Yet, its application to time series experiments, where policies are sequentially assigned over time, remains challenging. Existing designs suffer from two limitations: (i) they do not fully leverage the entire history for treatment allocation; (ii) they rely on strong assumptions to approximate the objective function (e.g., the mean squared error of the estimated treatment effect) for optimizing the design. We first establish an impossibility theorem showing that failure to condition on the full history leads to suboptimal designs, due to the dynamic dependencies in time series experiments. To address both limitations simultaneously, we next propose a transformer reinforcement learning (RL) approach which leverages transformers to condition allocation on the entire history and employs RL to directly optimize the MSE without relying on restrictive assumptions. Empirical evaluations on synthetic data, a publicly available dispatch simulator, and a real-world ridesharing dataset demonstrate that our proposal consistently outperforms existing designs.

#66 Observation-dependent Bayesian active learning via input-warped Gaussian processes

著者: Sanna Jarl, Maria B{\aa}nkestad, Jonathan J. S. Scragg, Jens Sj\"olund

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01898

要約:
Bayesian active learning relies on the precise quantification of predictive uncertainty to explore unknown function landscapes. While Gaussian process surrogates are the standard for such tasks, an underappreciated fact is that their posterior variance depends on the observed outputs only through the hyperparameters, rendering exploration largely insensitive to the actual measurements. We propose to inject observation-dependent feedback by warping the input space with a learned, monotone reparameterization. This mechanism allows the design policy to expand or compress regions of the input space in response to observed variability, thereby shaping the behavior of variance-based acquisition functions. We demonstrate that while such warps can be trained via marginal likelihood, a novel self-supervised objective yields substantially better performance. Our approach improves sample efficiency across a range of active learning benchmarks, particularly in regimes where non-stationarity challenges traditional methods.

#67 Data- and Variance-dependent Regret Bounds for Online Tabular MDPs

著者: Mingyi Li, Taira Tsuchiya, Kenji Yamanishi

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01903

要約:
This work studies online episodic tabular Markov decision processes (MDPs) with known transitions and develops best-of-both-worlds algorithms that achieve refined data-dependent regret bounds in the adversarial regime and variance-dependent regret bounds in the stochastic regime. We quantify MDP complexity using a first-order quantity and several new data-dependent measures for the adversarial regime, including a second-order quantity and a path-length measure, as well as variance-based measures for the stochastic regime. To adapt to these measures, we develop algorithms based on global optimization and policy optimization, both built on optimistic follow-the-regularized-leader with log-barrier regularization. For global optimization, our algorithms achieve first-order, second-order, and path-length regret bounds in the adversarial regime, and in the stochastic regime, they achieve a variance-aware gap-independent bound and a variance-aware gap-dependent bound that is polylogarithmic in the number of episodes. For policy optimization, our algorithms achieve the same data- and variance-dependent adaptivity, up to a factor of the episode horizon, by exploiting a new optimistic $Q$-function estimator. Finally, we establish regret lower bounds in terms of data-dependent complexity measures for the adversarial regime and a variance measure for the stochastic regime, implying that the regret upper bounds achieved by the global-optimization approach are nearly optimal.

#68 Probabilistic function-on-function nonlinear autoregressive model for emulation and reliability analysis of dynamical systems

著者: Zhouzhou Song, Marcos A. Valdebenito, Styfen Sch\"ar, Stefano Marelli, Bruno Sudret, Matthias G. R. Faes

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01929

要約:
Constructing accurate and computationally efficient surrogate models (or emulators) for predicting dynamical system responses is critical in many engineering domains, yet remains challenging due to the strongly nonlinear and high-dimensional mapping from external excitations and system parameters to system responses. This work introduces a novel Function-on-Function Nonlinear AutoRegressive model with eXogenous inputs (F2NARX), which reformulates the conventional NARX model from a function-on-function regression perspective, inspired by the recently proposed $\mathcal{F}$-NARX method. The proposed framework substantially improves predictive efficiency while maintaining high accuracy. By combining principal component analysis with Gaussian process regression, F2NARX further enables probabilistic predictions of dynamical responses via the unscented transform in an autoregressive manner. The effectiveness of the method is demonstrated through case studies of varying complexity. Results show that F2NARX outperforms state-of-the-art NARX model by orders of magnitude in efficiency while achieving higher accuracy in general. Moreover, its probabilistic prediction capabilities facilitate active learning, enabling accurate estimation of first-passage failure probabilities of dynamical systems using only a small number of training time histories.

#69 Deep Multivariate Models with Parametric Conditionals

著者: Dmitrij Schlesinger, Boris Flach, Alexander Shekhovtsov

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.01953

要約:
We consider deep multivariate models for heterogeneous collections of random variables. In the context of computer vision, such collections may e.g. consist of images, segmentations, image attributes, and latent variables. When developing such models, most existing works start from an application task and design the model components and their dependencies to meet the needs of the chosen task. This has the disadvantage of limiting the applicability of the resulting model for other downstream tasks. Here, instead, we propose to represent the joint probability distribution by means of conditional probability distributions for each group of variables conditioned on the rest. Such models can then be used for practically any possible downstream task. Their learning can be approached as training a parametrised Markov chain kernel by maximising the data likelihood of its limiting distribution. This has the additional advantage of allowing a wide range of semi-supervised learning scenarios.

#70 SNAP: A Self-Consistent Agreement Principle with Application to Robust Computation

著者: Xiaoyi Jiang, Andreas Nienk\"otter

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.02013

要約:
We introduce SNAP (Self-coNsistent Agreement Principle), a self-supervised framework for robust computation based on mutual agreement. Based on an Agreement-Reliability Hypothesis SNAP assigns weights that quantify agreement, emphasizing trustworthy items and downweighting outliers without supervision or prior knowledge. A key result is the Exponential Suppression of Outlier Weights, ensuring that outliers contribute negligibly to computations, even in high-dimensional settings. We study properties of SNAP weighting scheme and show its practical benefits on vector averaging and subspace estimation. Particularly, we demonstrate that non-iterative SNAP outperforms the iterative Weiszfeld algorithm and two variants of multivariate median of means. SNAP thus provides a flexible, easy-to-use, broadly applicable approach to robust computation.

#71 Ultrafast On-chip Online Learning via Spline Locality in Kolmogorov-Arnold Networks

著者: Duc Hoang, Aarush Gupta, Philip Harris

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.02056

要約:
Ultrafast online learning is essential for high-frequency systems, such as controls for quantum computing and nuclear fusion, where adaptation must occur on sub-microsecond timescales. Meeting these requirements demands low-latency, fixed-precision computation under strict memory constraints, a regime in which conventional Multi-Layer Perceptrons (MLPs) are both inefficient and numerically unstable. We identify key properties of Kolmogorov-Arnold Networks (KANs) that align with these constraints. Specifically, we show that: (i) KAN updates exploiting B-spline locality are sparse, enabling superior on-chip resource scaling, and (ii) KANs are inherently robust to fixed-point quantization. By implementing fixed-point online training on Field-Programmable Gate Arrays (FPGAs), a representative platform for on-chip computation, we demonstrate that KAN-based online learners are significantly more efficient and expressive than MLPs across a range of low-latency and resource-constrained tasks. To our knowledge, this work is the first to demonstrate model-free online learning at sub-microsecond latencies.

#72 Handling Covariate Mismatch in Federated Linear Prediction

著者: Alexis Ayme, R\'emi Khellaf

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.02083

要約:
Federated learning enables institutions to train predictive models collaboratively without sharing raw data, addressing privacy and regulatory constraints. In the standard horizontal setting, clients hold disjoint cohorts of individuals and collaborate to learn a shared predictor. Most existing methods, however, assume that all clients measure the same features. We study the more realistic setting of covariate mismatch, where each client observes a different subset of features, which typically arises in multicenter collaborations with no prior agreement on data collection. We formalize learning a linear prediction under client-wise MCAR patterns and develop two modular approaches tailored to the dimensional regime and communication budget. In the low-dimensional setting, we propose a plug-in estimator that approximates the oracle linear predictor by aggregating sufficient statistics to estimate the covariance and cross-moment terms. In higher dimensions, we study an impute-then-regress strategy: (i) impute missing covariates using any exchangeability-preserving imputation procedure, and (ii) fit a ridge-regularized linear model on the completed data. We provide asymptotic and finite-sample learning rates for our predictors, explicitly characterizing their behaviour with the global dimension, the client-specific feature partition, and the distribution of samples across sites.

#73 Efficient Swap Regret Minimization in Combinatorial Bandits

著者: Andreas Kontogiannis, Vasilis Pollatos, Panayotis Mertikopoulos, Ioannis Panageas

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.02087

要約:
This paper addresses the problem of designing efficient no-swap regret algorithms for combinatorial bandits, where the number of actions $N$ is exponentially large in the dimensionality of the problem. In this setting, designing efficient no-swap regret translates to sublinear -- in horizon $T$ -- swap regret with polylogarithmic dependence on $N$. In contrast to the weaker notion of external regret minimization - a problem which is fairly well understood in the literature - achieving no-swap regret with a polylogarithmic dependence on $N$ has remained elusive in combinatorial bandits. Our paper resolves this challenge, by introducing a no-swap-regret learning algorithm with regret that scales polylogarithmically in $N$ and is tight for the class of combinatorial bandits. To ground our results, we also demonstrate how to implement the proposed algorithm efficiently -- that is, with a per-iteration complexity that also scales polylogarithmically in $N$ -- across a wide range of well-studied applications.

#74 Spectral Superposition: A Theory of Feature Geometry

著者: Georgi Ivanov, Narmeen Oozeer, Shivam Raval, Tasana Pejovic, Shriyash Upadhyay, Amir Abdullah

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.02224

要約:
Neural networks represent more features than they have dimensions via superposition, forcing features to share representational space. Current methods decompose activations into sparse linear features but discard geometric structure. We develop a theory for studying the geometric structre of features by analyzing the spectra (eigenvalues, eigenspaces, etc.) of weight derived matrices. In particular, we introduce the frame operator $F = WW^\top$, which gives us a spectral measure that describes how each feature allocates norm across eigenspaces. While previous tools could describe the pairwise interactions between features, spectral methods capture the global geometry (``how do all features interact?''). In toy models of superposition, we use this theory to prove that capacity saturation forces spectral localization: features collapse onto single eigenspaces, organize into tight frames, and admit discrete classification via association schemes, classifying all geometries from prior work (simplices, polygons, antiprisms). The spectral measure formalism applies to arbitrary weight matrices, enabling diagnosis of feature localization beyond toy settings. These results point toward a broader program: applying operator theory to interpretability.

#75 Causal Inference for Preprocessed Outcomes with an Application to Functional Connectivity

著者: Zihang Wang, Razieh Nabi, Benjamin B. Risk

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.02240

要約:
In biomedical research, repeated measurements within each subject are often processed to remove artifacts and unwanted sources of variation. The resulting data are used to construct derived outcomes that act as proxies for scientific outcomes that are not directly observable. Although intra-subject processing is widely used, its impact on inter-subject statistical inference has not been systematically studied, and a principled framework for causal analysis in this setting is lacking. In this article, we propose a semiparametric framework for causal inference with derived outcomes obtained after intra-subject processing. This framework applies to settings with a modular structure, where intra-subject analyses are conducted independently across subjects and are followed by inter-subject analyses based on parameters from the intra-subject stage. We develop multiply robust estimators of causal parameters under rate conditions on both intra-subject and inter-subject models, which allows the use of flexible machine learning. We specialize the framework to a mediation setting and focus on the natural direct effect. For high dimensional inference, we employ a step-down procedure that controls the exceedance rate of the false discovery proportion. Simulation studies demonstrate the superior performance of the proposed approach. We apply our method to estimate the impact of stimulant medication on brain connectivity in children with autism spectrum disorder.

#76 Choice-Model-Assisted Q-learning for Delayed-Feedback Revenue Management

著者: Owen Shen, Patrick Jaillet

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.02283

要約:
We study reinforcement learning for revenue management with delayed feedback, where a substantial fraction of value is determined by customer cancellations and modifications observed days after booking. We propose \emph{choice-model-assisted RL}: a calibrated discrete choice model is used as a fixed partial world model to impute the delayed component of the learning target at decision time. In the fixed-model deployment regime, we prove that tabular Q-learning with model-imputed targets converges to an $O(\varepsilon/(1-\gamma))$ neighborhood of the optimal Q-function, where $\varepsilon$ summarizes partial-model error, with an additional $O(t^{-1/2})$ sampling term. Experiments in a simulator calibrated from 61{,}619 hotel bookings (1{,}088 independent runs) show: (i) no statistically detectable difference from a maturity-buffer DQN baseline in stationary settings; (ii) positive effects under in-family parameter shifts, with significant gains in 5 of 10 shift scenarios after Holm--Bonferroni correction (up to 12.4\%); and (iii) consistent degradation under structural misspecification, where the choice model assumptions are violated (1.4--2.6\% lower revenue). These results characterize when partial behavioral models improve robustness under shift and when they introduce harmful bias.

#77 C-kNN-LSH: A Nearest-Neighbor Algorithm for Sequential Counterfactual Inference

著者: Jing Wang, Jie Shen, Qiaomin Xie, Jeremy C Weiss

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.02371

要約:
Estimating causal effects from longitudinal trajectories is central to understanding the progression of complex conditions and optimizing clinical decision-making, such as comorbidities and long COVID recovery. We introduce \emph{C-kNN--LSH}, a nearest-neighbor framework for sequential causal inference designed to handle such high-dimensional, confounded situations. By utilizing locality-sensitive hashing, we efficiently identify ``clinical twins'' with similar covariate histories, enabling local estimation of conditional treatment effects across evolving disease states. To mitigate bias from irregular sampling and shifting patient recovery profiles, we integrate neighborhood estimator with a doubly-robust correction. Theoretical analysis guarantees our estimator is consistent and second-order robust to nuisance error. Evaluated on a real-world Long COVID cohort with 13,511 participants, \emph{C-kNN-LSH} demonstrates superior performance in capturing recovery heterogeneity and estimating policy values compared to existing baselines.

#78 Maximizing Reliability with Bayesian Optimization

著者: Jack M. Buckingham, Ivo Couckuyt, Juergen Branke

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.02432

要約:
Bayesian optimization (BO) is a popular, sample-efficient technique for expensive, black-box optimization. One such problem arising in manufacturing is that of maximizing the reliability, or equivalently minimizing the probability of a failure, of a design which is subject to random perturbations - a problem that can involve extremely rare failures ($P_\mathrm{fail} = 10^{-6}-10^{-8}$). In this work, we propose two BO methods based on Thompson sampling and knowledge gradient, the latter approximating the one-step Bayes-optimal policy for minimizing the logarithm of the failure probability. Both methods incorporate importance sampling to target extremely small failure probabilities. Empirical results show the proposed methods outperform existing methods in both extreme and non-extreme regimes.

#79 New explanations and inference for least angle regression

著者: Karl B. Gregory, Daniel J. Nordman

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.02491

要約:
Efron et al. (2004) introduced least angle regression (LAR) as an algorithm for linear predictions, intended as an alternative to forward selection with connections to penalized regression. However, LAR has remained somewhat of a "black box," where some basic behavioral properties of LAR output are not well understood, including an appropriate termination point for the algorithm. We provide a novel framework for inference with LAR, which also allows LAR to be understood from new perspectives with several newly developed mathematical properties. The LAR algorithm at a data level can viewed as estimating a population counterpart "path" that organizes a response mean along regressor variables which are ordered according to a decreasing series of population "correlation" parameters; such parameters are shown to have meaningful interpretations for explaining variable contributions whereby zero correlations denote unimportant variables. In the output of LAR, estimates of all non-zero population correlations turn out to have independent normal distributions for use in inference, while estimates of zero-valued population correlations have a certain non-normal joint distribution. These properties help to provide a formal rule for stopping the LAR algorithm. While the standard bootstrap for regression can fail for LAR, a modified bootstrap provides a practical and formally justified tool for interpreting the entrance of variables and quantifying uncertainty in estimation. The LAR inference method is studied through simulation and illustrated with data examples.

#80 VC Theory for Inventory Policies

著者: Yaqi Xie, Will Ma, Linwei Xin

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2404.11509

要約:
There has been growing interest in applying reinforcement learning (RL) to inventory management, either by optimizing over temporal transitions or by learning directly from full historical demand trajectories. This contrasts sharply with classical data-driven approaches, which first estimate demand distributions from past data and then compute well-structured optimal policies via dynamic programming. This paper considers a hybrid approach that combines trajectory-based RL with policy regularization imposing base-stock and $(s, S) $ structures. We provide generalization guarantees for this combined approach for several well-known classes in a $T$-period dynamic inventory model, using tools from the celebrated Vapnik-Chervonenkis (VC) theory, such as the Pseudo-dimension and Fat-shattering dimension. Our results have implications for regret against the best-in-class policies, and allow for an arbitrary distribution over demand sequences, which makes no assumptions such as independence across time. Surprisingly, we prove that the class of policies defined by $T$ non-stationary base-stock levels exhibits a generalization error that does not grow with $T$, whereas the two-parameter $(s, S)$ policy class has a generalization error growing logarithmically with $T$. Overall, our analysis leverages specific inventory structures within the learning theory framework, and improves sample complexity guarantees even compared to existing results assuming independent demands.

#81 Joint Bayesian Parameter and Model Order Estimation for Low-Rank Probability Mass Tensors

著者: Joseph K. Chege, Arie Yeredor, Martin Haardt

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2410.06329

要約:
Obtaining a reliable estimate of the joint probability mass function (PMF) of a set of random variables from observed data is a significant objective in statistical signal processing and machine learning. Modelling the joint PMF as a tensor that admits a low-rank canonical polyadic decomposition (CPD) has enabled the development of efficient PMF estimation algorithms. However, these algorithms require the rank (model order) of the tensor to be specified beforehand. In real-world applications, the true rank is unknown. Therefore, an appropriate rank is usually selected from a candidate set either by observing validation errors or by computing various likelihood-based information criteria, a procedure that could be costly in terms of computational time or hardware resources, or could result in mismatched models which affect the model accuracy. This paper presents a novel Bayesian framework for estimating the low-rank components of a joint PMF tensor and simultaneously inferring its rank from the observed data. We specify a Bayesian PMF estimation model and employ appropriate prior distributions for the model parameters, allowing the rank to be inferred without cross-validation.We then derive a deterministic solution based on variational inference (VI) to approximate the posterior distributions of various model parameters. Numerical experiments involving both synthetic data and real classification and item recommendation data illustrate the advantages of our VI-based method in terms of estimation accuracy, automatic rank detection, and computational efficiency.

#82 Graph Max Shift: A Hill-Climbing Method for Graph Clustering

著者: Ery Arias-Castro, Elizabeth Coda, Wanli Qiao

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2411.18794

要約:
We present a method for graph clustering that is analogous to gradient ascent methods previously proposed for clustering points in space. The algorithm, which can be viewed as a max-degree hill-climbing procedure on the graph, iteratively moves each node to a neighboring node of highest degree. We show that, when applied to a random geometric graph whose nodes correspond to data drawn i.i.d. from a density with Morse regularity, the method is asymptotically consistent. Here, consistency is in the sense of Fukunaga and Hostetler, meaning, with respect to the partition of the support of the density defined by the basins of attraction of the density gradient flow.

#83 Robust Amortized Bayesian Inference with Self-Consistency Losses on Unlabeled Data

著者: Aayush Mishra, Daniel Habermann, Marvin Schmitt, Stefan T. Radev, Paul-Christian B\"urkner

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2501.13483

要約:
Amortized Bayesian inference (ABI) with neural networks can solve probabilistic inverse problems orders of magnitude faster than classical methods. However, ABI is not yet sufficiently robust for widespread and safe application. When performing inference on observations outside the scope of the simulated training data, posterior approximations are likely to become highly biased, which cannot be corrected by additional simulations due to the bad pre-asymptotic behavior of current neural posterior estimators. In this paper, we propose a semi-supervised approach that enables training not only on labeled simulated data generated from the model, but also on \textit{unlabeled} data originating from any source, including real data. To achieve this, we leverage Bayesian self-consistency properties that can be transformed into strictly proper losses that do not require knowledge of ground-truth parameters. We test our approach on several real-world case studies, including applications to high-dimensional time-series and image data. Our results show that semi-supervised learning with unlabeled data drastically improves the robustness of ABI in the out-of-simulation regime. Notably, inference remains accurate even when evaluated on observations far away from the labeled and unlabeled data seen during training.

#84 DOLCE: Decomposing Off-Policy Evaluation/Learning into Lagged and Current Effects

著者: Shu Tamano

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2505.00961

要約:
Off-policy evaluation and learning in contextual bandits use logged interaction data to estimate and optimize the value of a target policy. Most existing methods require sufficient action overlap between the logging and target policies, and violations can bias value and policy gradient estimates. To address this issue, we propose DOLCE (Decomposing Off-policy evaluation/learning into Lagged and Current Effects), which uses only lagged contexts already stored in bandit logs to construct lag-marginalized importance weights and to decompose the objective into a support-robust lagged correction term and a current, model-based term, yielding bias cancellation when the reward-model residual is conditionally mean-zero given the lagged context and action. With multiple candidate lags, DOLCE softly aggregates lag-specific estimates, and we introduce a moment-based training procedure that promotes the desired invariance using only logged lag-augmented data. We show that DOLCE is unbiased in an idealized setting and yields consistent and asymptotically normal estimates with cross-fitting under standard conditions. Our experiments demonstrate that DOLCE achieves substantial improvements in both off-policy evaluation and learning, particularly as the proportion of individuals who violate support increases.

#85 Transportability without Graphs: A Bayesian Approach to Identifying s-Admissible Backdoor Sets

backdoor

著者: Konstantina Lelova, Gregory F. Cooper, Sofia Triantafillou

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2505.12801

要約:
Transporting causal information across populations is a critical challenge in clinical decision-making. Causal modeling provides criteria for identifiability and transportability, but these require knowledge of the causal graph, which rarely holds in practice. We propose a Bayesian method that combines observational data from the target domain with experimental data from a different domain to identify s-admissible backdoor sets, which enable unbiased estimation of causal effects across populations, without requiring the causal graph. We prove that if such a set exists, we can always find one within the Markov boundary of the outcome, narrowing the search space, and we establish asymptotic convergence guarantees for our method. We develop a greedy algorithm that reframes transportability as a feature selection problem, selecting conditioning sets that maximize the marginal likelihood of experimental data given observational data. In simulated and semi-synthetic data, our method correctly identifies transportability bias, improves causal effect estimation, and performs favorably against alternatives.

#86 The Nuclear Route: Sharp Asymptotics of ERM in Overparameterized Quadratic Networks

著者: Vittorio Erba, Emanuele Troiani, Lenka Zdeborov\'a, Florent Krzakala

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2505.17958

要約:
We study the high-dimensional asymptotics of empirical risk minimization (ERM) in over-parametrized two-layer neural networks with quadratic activations trained on synthetic data. We derive sharp asymptotics for both training and test errors by mapping the $\ell_2$-regularized learning problem to a convex matrix sensing task with nuclear norm penalization. This reveals that capacity control in such networks emerges from a low-rank structure in the learned feature maps. Our results characterize the global minima of the loss and yield precise generalization thresholds, showing how the width of the target function governs learnability. This analysis bridges and extends ideas from spin-glass methods, matrix factorization, and convex optimization and emphasizes the deep link between low-rank matrix sensing and learning in quadratic neural networks.

#87 On Theoretical Identifiability of Discrete Latent Causal Graphical Models

著者: Seunghyun Lee, Yuqi Gu

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2505.18410

要約:
This paper considers a challenging problem of identifying a causal graphical model under the presence of latent variables. While various identifiability conditions have been proposed in the literature, they often require multiple pure children per latent variable or restrictions on the latent causal graph. Furthermore, it is common for all observed variables to exhibit the same modality. Consequently, the existing identifiability conditions are often too stringent for complex real-world data. We consider a general nonparametric measurement model with arbitrary observed variable types and binary latent variables, and propose a double triangular graphical condition that guarantees identifiability of the entire causal graphical model. The proposed condition significantly relaxes the popular pure children condition. We also establish necessary conditions for identifiability and provide valuable insights into fundamental limits of identifiability. Simulation studies verify that latent structures satisfying our conditions can be accurately estimated from data.

#88 Safely Learning Controlled Stochastic Dynamics

著者: Luc Brogat-Motte, Alessandro Rudi, Riccardo Bonalli

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2506.02754

要約:
We address the problem of safely learning controlled stochastic dynamics from discrete-time trajectory observations, ensuring system trajectories remain within predefined safe regions during both training and deployment. Safety-critical constraints of this kind are crucial in applications such as autonomous robotics, finance, and biomedicine. We introduce a method that ensures safe exploration and efficient estimation of system dynamics by iteratively expanding an initial known safe control set using kernel-based confidence bounds. After training, the learned model enables predictions of the system's dynamics and permits safety verification of any given control. Our approach requires only mild smoothness assumptions and access to an initial safe control set, enabling broad applicability to complex real-world systems. We provide theoretical guarantees for safety and derive adaptive learning rates that improve with increasing Sobolev regularity of the true dynamics. Experimental evaluations demonstrate the practical effectiveness of our method in terms of safety, estimation accuracy, and computational efficiency.

#89 When Pattern-by-Pattern Works: Theoretical and Empirical Insights for Logistic Models with Missing Values

著者: Christophe Muller (LPSM), Erwan Scornet (LPSM), Julie Josse (PREMEDICAL)

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2507.13024

要約:
Predicting with missing inputs challenges even parametric models, as parameter estimation alone is insufficient for prediction on incomplete data. While several works study prediction in linear models, we focus on logistic models, where optimal predictors lack closed-form expressions. We prove that a Pattern-by-Pattern strategy (PbP), which learns one logistic model per missingness pattern, accurately approximates Bayes probabilities under a Gaussian Pattern Mixture Model (GPMM). Crucially, this result holds across standard missing data scenarios (MCAR and MAR) and, notably, in Missing Not at Random (MNAR) settings where standard methods often fail. Empirically, we compare PbP against imputation and EM methods across classification, probability estimation, calibration, and inference. Our analysis provides a comprehensive view of logistic regression with missing values. It reveals that mean imputation can be used as baseline for low sample sizes and PbP for large sample sizes, as both methods are fast to train and may have good performances in some settings. The best performances are achieved by non-linear multiple iterative imputation techniques that include the response label (Random Forest MICE with response), which are more computationally expensive.

#90 Multivariate Standardized Residuals for Conformal Prediction

著者: Sacha Braun, Eug\`ene Berta, Michael I. Jordan, Francis Bach

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2507.20941

要約:
While split conformal prediction guarantees marginal coverage, approaching the stronger property of conditional coverage is essential for reliable uncertainty quantification. Naive conformal scores, however, suffer from poor conditional coverage in heteroskedastic settings. In univariate regression, this is commonly addressed by normalizing nonconformity scores using estimated local score variance. In this work, we propose a natural extension of this normalization to the multivariate setting, effectively whitening the residuals to decouple output correlations and standardize local variance. We demonstrate that using the Mahalanobis distance induced by a learned local covariance as a nonconformity score provides a closed-form, computationally efficient mechanism for capturing inter-output correlations and heteroskedasticity, avoiding the expensive sampling required by previous methods based on cumulative distribution functions. This structure unlocks several practical extensions, including the handling of missing output values, the refinement of conformal sets when partial information is revealed, and the construction of valid conformal sets for transformations of the output. Finally, we provide extensive empirical evidence on both synthetic and real-world datasets showing that our approach yields conformal sets that significantly improve upon the conditional coverage of existing multivariate baselines.

#91 Single-Head Attention in High Dimensions: A Theory of Generalization, Weights Spectra, and Scaling Laws

著者: Fabrizio Boncoraglio, Vittorio Erba, Emanuele Troiani, Yizhou Xu, Florent Krzakala, Lenka Zdeborov\'a

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2509.24914

要約:
Trained attention layers exhibit striking and reproducible spectral structure of the weights, including low-rank collapse, bulk deformation, and isolated spectral outliers, yet the origin of these phenomena and their implications for generalization remain poorly understood. We study empirical risk minimization in a single-head tied-attention layer trained on synthetic high-dimensional sequence tasks generated from the attention-indexed model. Using tools from random matrix theory, spin-glass theory, and approximate message passing, we obtain an exact high-dimensional characterization of training and test error, interpolation and recovery thresholds, and the spectrum of the key and query matrices. Our theory predicts the full singular-value distribution of the trained query-key map, including low-rank structure and isolated spectral outliers, in qualitative agreement with observations in more realistic transformers. Finally, for targets with power-law spectra, we show that learning proceeds through sequential spectral recovery, leading to the emergence of power-law scaling laws.

#92 Test time training enhances in-context learning of nonlinear functions

著者: Kento Kuwataka, Taiji Suzuki

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2509.25741

要約:
Test-time training (TTT) enhances model performance by explicitly updating designated parameters prior to each prediction to adapt to the test data. While TTT has demonstrated considerable empirical success, its theoretical underpinnings remain limited, particularly for nonlinear models. In this paper, we investigate the combination of TTT with in-context learning (ICL), where the model is given a few examples from the target distribution at inference time. We analyze this framework in the setting of single-index models $y=\sigma_*(\langle \beta, \mathbf{x} \rangle)$, where the feature vector $\beta$ is drawn from a hidden low-dimensional subspace. For single-layer transformers trained with gradient-based algorithms and adopting TTT, we establish an upper bound on the prediction risk. Our theory reveals that TTT enables the single-layer transformers to adapt to both the feature vector $\beta$ and the link function $\sigma_*$, which vary across tasks. This creates a sharp contrast with ICL alone, which is theoretically difficult to adapt to shifts in the link function. Moreover, we provide the convergence rate with respect to the data length, showing the predictive error can be driven arbitrarily close to the noise level as the context size and the network width grow.

#93 BALLAST: Bayesian Active Learning with Look-ahead Amendment for Sea-drifter Trajectories under Spatio-Temporal Vector Fields

著者: Rui-Yang Zhang, Henry B. Moss, Lachlan Astfalck, Edward Cripps, David S. Leslie

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2509.26005

要約:
We introduce a formal active learning methodology for guiding the placement of Lagrangian observers to infer time-dependent vector fields -- a key task in oceanography, marine science, and ocean engineering -- using a physics-informed spatio-temporal Gaussian process surrogate model. The majority of existing placement campaigns either follow standard `space-filling' designs or relatively ad-hoc expert opinions. A key challenge to applying principled active learning in this setting is that Lagrangian observers are continuously advected through the vector field, so they make measurements at different locations and times. It is, therefore, important to consider the likely future trajectories of placed observers to account for the utility of candidate placement locations. To this end, we present BALLAST: Bayesian Active Learning with Look-ahead Amendment for Sea-drifter Trajectories. We observe noticeable benefits of BALLAST-aided sequential observer placement strategies on both synthetic and high-fidelity ocean current models. In addition, we developed a novel GP inference method -- the Vanilla SPDE Exchange (VaSE) -- to boost the GP posterior sampling efficiency, which is also of independent interest.

#94 A Proof of Learning Rate Transfer under $\mu$P

著者: Soufiane Hayou

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.01734

要約:
We provide the first proof of learning rate transfer with width in a linear multi-layer perceptron (MLP) parametrized with $\mu$P, a neural network parameterization designed to ``maximize'' feature learning in the infinite-width limit. We show that under $\mu P$, the optimal learning rate converges to a \emph{non-zero constant} as width goes to infinity, providing a theoretical explanation to learning rate transfer. In contrast, we show that this property fails to hold under alternative parametrizations such as Standard Parametrization (SP) and Neural Tangent Parametrization (NTP). We provide intuitive proofs and support the theoretical findings with extensive empirical results.

#95 A Diffusive Classification Loss for Learning Energy-based Generative Models

著者: RuiKang OuYang, Louis Grenioux, Jos\'e Miguel Hern\'andez-Lobato

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.21025

要約:
Score-based generative models have recently achieved remarkable success. While they are usually parameterized by the score, an alternative way is to use a series of time-dependent energy-based models (EBMs), where the score is obtained from the negative input-gradient of the energy. Crucially, EBMs can be leveraged not only for generation, but also for tasks such as compositional sampling or building Boltzmann Generators via Monte Carlo methods. However, training EBMs remains challenging. Direct maximum likelihood is computationally prohibitive due to the need for nested sampling, while score matching, though efficient, suffers from mode blindness. To address these issues, we introduce the Diffusive Classification (DiffCLF) objective, a simple method that avoids blindness while remaining computationally efficient. DiffCLF reframes EBM learning as a supervised classification problem across noise levels, and can be seamlessly combined with standard score-based objectives. We validate the effectiveness of DiffCLF by comparing the estimated energies against ground truth in analytical Gaussian mixture cases, and by applying the trained models to tasks such as model composition and Boltzmann Generator sampling. Our results show that DiffCLF enables EBMs with higher fidelity and broader applicability than existing approaches.

#96 The Function Representation of Artificial Neural Network

著者: Zhongkui Ma

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/1908.10493

要約:
This paper expresses the structure of artificial neural network (ANN) as a functional form, using the activation integral concept derived from the activation function. In this way, the structure of ANN can be represented by a simple function, and it is possible to find the mathematical solutions of ANN. Thus, it can be recognized that the current ANN can be placed in a more reasonable framework. Perhaps all questions about ANN will be eliminated.

#97 Individual Regret in Cooperative Stochastic Multi-Armed Bandits

著者: Idan Barnea, Tal Lancewicki, Yishay Mansour

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2411.06501

要約:
We study the regret in stochastic Multi-Armed Bandits (MAB) with multiple agents that communicate over an arbitrary connected communication graph. We analyzed a variant of Cooperative Successive Elimination algorithm, $\coopse$, and show an individual regret bound of ${O}(\mathcal{R} / m + A^2 + A \sqrt{\log T})$ and a nearly matching lower bound. Here $A$ is the number of actions, $T$ the time horizon, $m$ the number of agents, and $\mathcal{R} = \sum_{\Delta_i > 0}\log(T)/\Delta_i$ is the optimal single agent regret, where $\Delta_i$ is the sub-optimality gap of action $i$. Our work is the first to show an individual regret bound in cooperative stochastic MAB that is independent of the graph's diameter. When considering communication networks there are additional considerations beyond regret, such as message size and number of communication rounds. First, we show that our regret bound holds even if we restrict the messages to be of logarithmic size. Second, for logarithmic number of communication rounds, we obtain a regret bound of ${O}(\mathcal{R} / m+A \log T)$.

#98 AverageTime: Enhance Long-Term Time Series Forecasting with Simple Averaging

著者: Gaoxiang Zhao, Chunmao Huang, Li Zhou, Xiaoqiang Wang

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2412.20727

要約:
Multivariate long-term time series forecasting aims to predict future sequences by utilizing historical observations, with a core focus on modeling intra-sequence and cross-channel dependencies. Numerous studies have developed diverse architectures to capture these patterns, achieving significant improvements in forecasting accuracy. Among them, iTransformer, a representative method for channel information extraction, leverages the Transformer architecture to model channel-wise dependencies, thereby facilitating sequence transformation for enhanced forecasting performance. Building upon iTransformer's channel extraction concept, we propose AverageTime, a simple, efficient, and scalable forecasting model. Beyond iTransformer, AverageTime retains the original sequence information and reframes channel extraction as a stackable and extensible architecture. This allows the model to generate multiple novel sequences through various structural mechanisms, rather than being limited to transforming the original input. Moreover, the newly extracted sequences are not restricted to channel processing; other techniques such as series decomposition can also be incorporated to enhance predictive accuracy. Additionally, we introduce a channel clustering technique into AverageTime, which substantially improves training and inference efficiency with negligible performance loss. Experiments on real-world datasets demonstrate that with only two straightforward averaging operations, applied to both the extracted sequences and the original series. AverageTime surpasses state-of-the-art models in forecasting performance while maintaining near-linear complexity. This work offers a new perspective on time series forecasting: enriching sequence information through extraction and fusion. The source code is available at https://github.com/ UniqueoneZ/AverageTime.

#99 Short-length Adversarial Training Helps LLMs Defend Long-length Jailbreak Attacks: Theoretical and Empirical Evidence

著者: Shaopeng Fu, Liang Ding, Jingfeng Zhang, Di Wang

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2502.04204

要約:
Jailbreak attacks against large language models (LLMs) aim to induce harmful behaviors in LLMs through carefully crafted adversarial prompts. To mitigate attacks, one way is to perform adversarial training (AT)-based alignment, i.e., training LLMs on some of the most adversarial prompts to help them learn how to behave safely under attacks. During AT, the length of adversarial prompts plays a critical role in the robustness of aligned LLMs. While long-length adversarial prompts during AT might lead to strong LLM robustness, their synthesis however is very resource-consuming, which may limit the application of LLM AT. This paper focuses on adversarial suffix jailbreak attacks and unveils that to defend against a jailbreak attack with an adversarial suffix of length $\Theta(M)$, it is enough to align LLMs on prompts with adversarial suffixes of length $\Theta(\sqrt{M})$. Theoretically, we analyze the adversarial in-context learning of linear transformers on linear regression tasks and prove a robust generalization bound for trained transformers. The bound depends on the term $\Theta(\sqrt{M_{\text{test}}}/M_{\text{train}})$, where $M_{\text{train}}$ and $M_{\text{test}}$ are the numbers of adversarially perturbed in-context samples during training and testing. Empirically, we conduct AT on popular open-source LLMs and evaluate their robustness against jailbreak attacks of different adversarial suffix lengths. Results confirm a positive correlation between the attack success rate and the ratio of the square root of the adversarial suffix length during jailbreaking to the length during AT. Our findings show that it is practical to defend against "long-length" jailbreak attacks via efficient "short-length" AT. The code is available at https://github.com/fshp971/adv-icl.

#100 An Overview of Low-Rank Structures in the Training and Adaptation of Large Models

著者: Laura Balzano, Tianjiao Ding, Benjamin D. Haeffele, Soo Min Kwon, Qing Qu, Peng Wang, Zhangyang Wang, Can Yaras

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2503.19859

要約:
The substantial computational demands of modern large-scale deep learning present significant challenges for efficient training and deployment. Recent research has revealed a widespread phenomenon wherein deep networks inherently learn low-rank structures in their weights and representations during training. This tutorial paper provides a comprehensive review of advances in exploiting these low-rank structures, bridging mathematical foundations with practical applications. We present two complementary theoretical perspectives on the emergence of low-rankness: viewing it through the optimization dynamics of gradient descent throughout training, and understanding it as a result of implicit regularization effects at convergence. Practically, these theoretical frameworks provide a foundation for understanding the success of techniques such as Low-Rank Adaptation (LoRA) in fine-tuning, inspire new parameter-efficient low-rank training strategies, and explain the effectiveness of masked training approaches like dropout and masked self-supervised learning.

#101 Scaling Gaussian Process Regression with Full Derivative Observations

著者: Daniel Huang

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2505.09134

要約:
We present a scalable Gaussian Process (GP) method called DSoftKI that can fit and predict full derivative observations. It extends SoftKI, a method that approximates a kernel via softmax interpolation, to the setting with derivatives. DSoftKI enhances SoftKI's interpolation scheme by replacing its global temperature vector with local temperature vectors associated with each interpolation point. This modification allows the model to encode local directional sensitivity, enabling the construction of a scalable approximate kernel, including its first and second-order derivatives, through interpolation. Moreover, the interpolation scheme eliminates the need for kernel derivatives, facilitating extensions such as Deep Kernel Learning (DKL). We evaluate DSoftKI on synthetic benchmarks, a toy n-body physics simulation, standard regression datasets with synthetic gradients, and high-dimensional molecular force field prediction (100-1000 dimensions). Our results demonstrate that DSoftKI is accurate and scales to larger datasets with full derivative observations than previously possible.

#102 Sample Complexity of Distributionally Robust Average-Reward Reinforcement Learning

著者: Zijun Chen, Shengbo Wang, Nian Si

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2505.10007

要約:
Motivated by practical applications where stable long-term performance is critical-such as robotics, operations research, and healthcare-we study the problem of distributionally robust (DR) average-reward reinforcement learning. We propose two algorithms that achieve near-optimal sample complexity. The first reduces the problem to a DR discounted Markov decision process (MDP), while the second, Anchored DR Average-Reward MDP, introduces an anchoring state to stabilize the controlled transition kernels within the uncertainty set. Assuming the nominal MDP is uniformly ergodic, we prove that both algorithms attain a sample complexity of $\widetilde{O}\left(|\mathbf{S}||\mathbf{A}| t_{\mathrm{mix}}^2\varepsilon^{-2}\right)$ for estimating the optimal policy as well as the robust average reward under KL and $f_k$-divergence-based uncertainty sets, provided the uncertainty radius is sufficiently small. Here, $\varepsilon$ is the target accuracy, $|\mathbf{S}|$ and $|\mathbf{A}|$ denote the sizes of the state and action spaces, and $t_{\mathrm{mix}}$ is the mixing time of the nominal MDP. This represents the first finite-sample convergence guarantee for DR average-reward reinforcement learning. We further validate the convergence rates of our algorithms through numerical experiments.

#103 On DeepSeekMoE: Statistical Benefits of Shared Experts and Normalized Sigmoid Gating

著者: Huy Nguyen, Thong T. Doan, Quang Pham, Nghi D. Q. Bui, Nhat Ho, Alessandro Rinaldo

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2505.10860

要約:
Mixture of experts (MoE) methods are a key component in most large language model architectures, including the recent series of DeepSeek models. Compared to other MoE implementations, DeepSeekMoE stands out because of two unique features: the deployment of a shared expert strategy and of the normalized sigmoid gating mechanism. Despite the prominent role of DeepSeekMoE in the success of the DeepSeek series of models, there have been only a few attempts to justify theoretically the value of the shared expert strategy, while its normalized sigmoid gating has remained unexplored. To bridge this gap, we undertake a comprehensive theoretical study of these two features of DeepSeekMoE from a statistical perspective. We perform a convergence analysis of the expert estimation task to highlight the gains in sample efficiency for both the shared expert strategy and the normalized sigmoid gating, offering useful insights into the design of expert and gating structures. To verify empirically our theoretical findings, we carry out several experiments on both synthetic data and real-world datasets for (vision) language modeling tasks. Finally, we conduct an extensive empirical analysis of the router behaviors, ranging from router saturation, router change rate, to expert utilization.

#104 Bayes optimal learning of attention-indexed models

著者: Fabrizio Boncoraglio, Emanuele Troiani, Vittorio Erba, Lenka Zdeborov\'a

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2506.01582

要約:
We introduce the attention-indexed model (AIM), a theoretical framework for analyzing learning in deep attention layers. Inspired by multi-index models, AIM captures how token-level outputs emerge from layered bilinear interactions over high-dimensional embeddings. Unlike prior tractable attention models, AIM allows full-width key and query matrices, aligning more closely with practical transformers. Using tools from statistical mechanics and random matrix theory, we derive closed-form predictions for Bayes-optimal generalization error and identify sharp phase transitions as a function of sample complexity, model width, and sequence length. We propose a matching approximate message passing algorithm and show that gradient descent can reach optimal performance. AIM offers a solvable playground for understanding learning in self-attention layers, that are key components of modern architectures.

#105 Lions and Muons: Optimization via Stochastic Frank-Wolfe

著者: Maria-Eleni Sfyraki, Jun-Kun Wang

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2506.04192

要約:
Stochastic Frank-Wolfe is a classical optimization method for solving constrained optimization problems. On the other hand, recent optimizers such as Lion and Muon have gained quite significant popularity in deep learning. In this work, building on recent initiatives, we provide a unifying perspective by interpreting these seemingly disparate methods through the lens of Stochastic Frank-Wolfe. Specifically, we show that Lion and Muon with weight decay can be viewed as special instances of a Stochastic Frank-Wolfe, and we establish their convergence guarantees in terms of the Frank-Wolfe gap, a standard stationarity measure in non-convex optimization for Frank-Wolfe methods. We further find that convergence to this gap implies convergence to a KKT point of the original problem under a norm constraint for Lion and Muon. Moreover, motivated by recent empirical findings that stochastic gradients in modern machine learning tasks often exhibit heavy-tailed distributions, we extend Stochastic Frank-Wolfe to settings with heavy-tailed noise by developing two robust variants with strong theoretical guarantees that hold for general compact convex sets without the need for a large batch size, filling the gap in the literature on Stochastic Frank-Wolfe for non-convex optimization. Our contributions in the later part of this work, in turn, yield new variants of Lion and Muon, that better accommodate heavy-tailed gradient noise, thereby enhancing their practical scope.

#106 Identifiability of Deep Polynomial Neural Networks

著者: Konstantin Usevich, Ricardo Borsoi, Clara D\'erand, Marianne Clausel

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2506.17093

要約:
Polynomial Neural Networks (PNNs) possess a rich algebraic and geometric structure. However, their identifiability -- a key property for ensuring interpretability -- remains poorly understood. In this work, we present a comprehensive analysis of the identifiability of deep PNNs, including architectures with and without bias terms. Our results reveal an intricate interplay between activation degrees and layer widths in achieving identifiability. As special cases, we show that architectures with non-increasing layer widths are generically identifiable under mild conditions, while encoder-decoder networks are identifiable when the decoder widths do not grow too rapidly compared to the activation degrees. Our proofs are constructive and center on a connection between deep PNNs and low-rank tensor decompositions, and Kruskal-type uniqueness theorems. We also settle an open conjecture on the dimension of PNN's neurovarieties, and provide new bounds on the activation degrees required for it to reach the expected dimension.

#107 Duality and Policy Evaluation in Distributionally Robust Bayesian Diffusion Control

diffusion

著者: Jose Blanchet, Jiayi Cheng, Yuewei Ling, Hao Liu, Yang Liu

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2506.19294

要約:
We study diffusion control problems under parameter uncertainty. Controllers based on plug-in estimation can be brittle due to potential distribution shifts. Bayesian control with a prior on the parameters offers a formulation with beliefs about such shifts. However, as with any Bayesian model, the prior may be misspecified. To mitigate misspecification and reduce over-pessimism compared to classical robust control approaches (e.g. \citet{hansen2008robustness}), we propose a distributionally robust Bayesian control (DRBC) formulation in which an adversary perturbs the prior within a divergence neighborhood of a baseline prior. We develop a strong duality result that reduces the distributionally robust prior evaluation to a low-dimensional optimization and yields a practical simulation-based policy evaluation and learning procedure with structured policy parameterizations. We validate the efficiency of the algorithm on a synthetic linear-quadratic control example and real-data portfolio selection.

#108 Bridging GANs and Bayesian Neural Networks via Partial Stochasticity

著者: Maurizio Filippone, Marius P. Linhard

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2507.00651

要約:
Generative Adversarial Networks (GANs) are popular and successful generative models. Despite their success, optimization is notoriously challenging. In this work, we explain the success and limitations of GANs by casting them as Bayesian neural networks with partial stochasticity. This interpretation allows us to establish conditions of universal approximation and to rewrite the adversarial-style optimization of several variants of GANs as the optimization of a proxy for the likelihood obtained by marginalizing out the stochastic variables. Following this interpretation, the need for regularization becomes apparent, and we propose to adopt strategies to smooth the loss landscape and methods to search for solutions with minimum description length, which are associated with flat minima and good generalization. Results obtained on a wide range of experiments indicate that these strategies lead to performance improvements and pave the way to a deeper understanding of GANs.

#109 Dense associative memory for Gaussian distributions

著者: Chandan Tankala, Krishnakumar Balasubramanian

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2509.23162

要約:
Dense associative memories (DAMs) store and retrieve patterns via energy-function based fixed points, but existing models are limited to vector representations. We extend DAMs to Gaussian densities equipped with the 2-Wasserstein distance. Our framework defines a log-sum-exp energy over stored distributions and a retrieval dynamics aggregating optimal transport maps in a Gibbs-weighted manner. Stationary points correspond to self-consistent Wasserstein barycenters, generalizing classical DAM fixed points. We prove exponential storage capacity and provide quantitative retrieval guarantees under Wasserstein perturbations. We validate the method on synthetic and real-world image (CelebA and CIFAR-10 datasets) and text (text8 and NLI corpus) datasets. By generalizing from vectors to distributions, our work bridges classical DAMs with modern generative modeling and paves way for distributional storage and retrieval in memory-augmented learning.

#110 Revisiting Multivariate Time Series Forecasting with Missing Values

著者: Jie Yang, Yifan Hu, Kexin Zhang, Luyang Niu, Philip S. Yu, Kaize Ding

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2509.23494

要約:
Missing values are common in real-world time series, and multivariate time series forecasting with missing values (MTSF-M) has become a crucial area of research for ensuring reliable predictions. To address the challenge of missing data, current approaches have developed an imputation-then-prediction framework that uses imputation modules to fill in missing values, followed by forecasting on the imputed data. However, this framework overlooks a critical issue: there is no ground truth for the missing values, making the imputation process susceptible to errors that can degrade prediction accuracy. In this paper, we conduct a systematic empirical study and reveal that imputation without direct supervision can corrupt the underlying data distribution and actively degrade prediction accuracy. To address this, we propose a paradigm shift that moves away from imputation and directly predicts from the partially observed time series. We introduce Consistency-Regularized Information Bottleneck (CRIB), a novel framework built on the Information Bottleneck principle. CRIB combines a unified-variate attention mechanism with a consistency regularization scheme to learn robust representations that filter out noise introduced by missing values while preserving essential predictive signals. Comprehensive experiments on four real-world datasets demonstrate the effectiveness of CRIB, which predicts accurately even under high missing rates. Our code is available in https://github.com/Muyiiiii/CRIB.

#111 AdaDetectGPT: Adaptive Detection of LLM-Generated Text with Statistical Guarantees

著者: Hongyi Zhou, Jin Zhu, Pingfan Su, Kai Ye, Ying Yang, Shakeel A O B Gavioli-Akilagun, Chengchun Shi

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2510.01268

要約:
We study the problem of determining whether a piece of text has been authored by a human or by a large language model (LLM). Existing state of the art logits-based detectors make use of statistics derived from the log-probability of the observed text evaluated using the distribution function of a given source LLM. However, relying solely on log probabilities can be sub-optimal. In response, we introduce AdaDetectGPT -- a novel classifier that adaptively learns a witness function from training data to enhance the performance of logits-based detectors. We provide statistical guarantees on its true positive rate, false positive rate, true negative rate and false negative rate. Extensive numerical studies show AdaDetectGPT nearly uniformly improves the state-of-the-art method in various combination of datasets and LLMs, and the improvement can reach up to 37\%. A python implementation of our method is available at https://github.com/Mamba413/AdaDetectGPT.

#112 Flatness-Aware Stochastic Gradient Langevin Dynamics

著者: Stefano Bruno, Youngsik Hwang, Jaehyeon An, Sotirios Sabanis, Dong-Young Lim

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2510.02174

要約:
Flatness of the loss landscape has been widely studied as an important perspective for understanding the behavior and generalization of deep learning algorithms. Motivated by this view, we propose Flatness-Aware Stochastic Gradient Langevin Dynamics (fSGLD), a first-order optimization method that biases learning its dynamics toward flat basins while retaining the computational and memory efficiency of SGD and SGLD. We provide a non-asymptotic theoretical analysis showing that fSGLD converges to a flatness-biased Gibbs distribution under a theoretically prescribed coupling between the noise scale $\sigma$ and the inverse temperature $\beta$, together with explicit excess risk guarantees. We empirically evaluate fSGLD across standard optimizer benchmarks, Bayesian image classification, uncertainty quantification, and out-of-distribution detection, demonstrating consistently strong performance and reliable uncertainty estimates. Additional experiments confirm the effectiveness of the theoretically prescribed $\beta$-$\sigma$ coupling compared to decoupled choices.

#113 TROLL: Trust Regions improve Reinforcement Learning for Large Language Models

著者: Philipp Becker, Niklas Freymuth, Serge Thilges, Fabian Otto, Gerhard Neumann

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2510.03817

要約:
Reinforcement Learning (RL) with PPO-like clip objectives has become the standard choice for reward-based fine-tuning of large language models (LLMs). Although recent work has explored improved estimators of advantages and normalization, the clipping mechanism itself has remained untouched. Originally introduced as a proxy for principled KL-based trust regions, clipping is a crude approximation that often causes unstable updates and suboptimal performance. We replace the clip objective with a novel discrete differentiable trust region projection, which provides principled token-level KL constraints. The projection operates on a sparse subset of the model's most important token logits to balance computational cost and projection effectiveness. Our approach, Trust Region Optimization for Large Language models (TROLL), serves as a direct replacement for PPO-like clipping during training and does not alter the model's inference behavior. Across mathematical reasoning and code generation tasks, model families, as well as advantage-estimation methods, TROLL consistently outperforms PPO-like clipping in terms of training speed, stability, and final success rates.

#114 Myopic Bayesian Decision Theory for Batch Active Learning with Partial Batch Label Sampling

著者: Kangping Hu, Stephen Mussmann

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2510.09877

要約:
Over the past couple of decades, many active learning acquisition functions have been proposed, leaving practitioners with an unclear choice of which to use. Bayesian Decision Theory (BDT) offers a universal principle to guide decision-making. In this work, we derive BDT for (Bayesian) active learning in the myopic framework, where we imagine we only have one more point to label. This derivation leads to effective algorithms such as Expected Error Reduction (EER), Expected Predictive Information Gain (EPIG), and other algorithms that appear in the literature. A key challenge of such methods is the difficult scaling to large batch sizes, leading to either computational challenges (BatchBALD) or dramatic performance drops (top-$B$ selection). Here, using a particular formulation of the decision process, we derive Partial Batch Label Sampling (ParBaLS) for the EPIG algorithm. We show experimentally for several datasets that ParBaLS EPIG gives superior performance for a fixed budget and Bayesian Logistic Regression on Neural Embeddings. Our code is available at https://github.com/ADDAPT-ML/ParBaLS.

#115 Functional Distribution Networks (FDN)

著者: Omer Haq

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2510.17794

要約:
Modern probabilistic regressors often remain overconfident under distribution shift. Functional Distribution Networks (FDN) place input-conditioned distributions over network weights, producing predictive mixtures whose dispersion adapts to the input; we train them with a Monte Carlo beta-ELBO objective. We pair FDN with an evaluation protocol that separates interpolation from extrapolation and emphasizes simple OOD sanity checks. On controlled 1D tasks and small/medium UCI-style regression benchmarks, FDN remains competitive in accuracy with strong Bayesian, ensemble, dropout, and hypernetwork baselines, while providing strongly input-dependent, shift-aware uncertainty and competitive calibration under matched parameter and update budgets.

#116 Forgetting is Everywhere

著者: Ben Sanati, Thomas L. Lee, Trevor McInroe, Aidan Scannell, Nikolay Malkin, David Abel, Amos Storkey

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.04666

要約:
A fundamental challenge in developing general learning algorithms is their tendency to forget past knowledge when adapting to new data. Addressing this problem requires a principled understanding of forgetting; yet, despite decades of study, no unified definition has emerged that provides insights into the underlying dynamics of learning. We propose an algorithm- and task-agnostic theory that characterises forgetting as a lack of self-consistency in a learner's predictive distribution, manifesting as a loss of predictive information. Our theory naturally yields a general measure of an algorithm's propensity to forget and demonstrates that exact Bayesian inference allows for adaptation without forgetting. To validate the theory, we design a comprehensive set of experiments that span classification, regression, generative modelling, and reinforcement learning. We empirically demonstrate how forgetting is present across all deep learning settings and plays a significant role in determining learning efficiency. Together, these results establish a principled understanding of forgetting and lay the foundation for analysing and improving the information retention capabilities of general learning algorithms.

#117 Counterfactual Forecasting for Panel Data

著者: Navonil Deb, Raaz Dwivedi, Sumanta Basu

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.06189

要約:
We address the challenge of forecasting counterfactual outcomes in a panel data with missing entries and temporally dependent latent factors -- a common scenario in causal inference, where estimating unobserved potential outcomes ahead of time is essential. We propose Forecasting Counterfactuals under Stochastic Dynamics (FOCUS), a method that extends traditional matrix completion methods by leveraging time series dynamics of the factors, thereby enhancing the prediction accuracy of future counterfactuals. Building upon a consistent estimator of the factors, our method accommodates both stochastic and deterministic components within the factors, and provides a flexible framework for various applications. In case of stationary autoregressive factors and under standard conditions, we derive error bounds and establish asymptotic normality of our estimator. Empirical evaluations demonstrate that our method outperforms existing benchmarks when the latent factors have an autoregressive component. We illustrate FOCUS results on HeartSteps, a mobile health study, illustrating its effectiveness in forecasting step counts for users receiving activity prompts, thereby leveraging temporal patterns in user behavior.

#118 Quantized-Tinyllava: a new multimodal foundation model enables efficient split learning

著者: Jiajun Guo, Xin Luo, Jiayin Zheng, Yiqun Wang, Kai-Wei Chang, Wei Wang, Jie Liu

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.23402

要約:
Multimodal foundation models are increasingly trained on sensitive data across domains such as finance, biomedicine, and personal identifiers. However, this distributed setup raises serious privacy concerns due to the need for cross-partition data sharing. Split learning addresses these concerns by enabling collaborative model training without raw data exchange between partitions, yet it introduces a significant challenge: transmitting high-dimensional intermediate feature representations between partitions leads to substantial communication costs. To address this challenge, we propose Quantized-TinyLLaVA, a multimodal foundation model with an integrated communication-efficient split learning framework. Our approach adopts a compression module that quantizes intermediate feature into discrete representations before transmission, substantially reducing communication overhead. Besides, we derive a principled quantization strategy grounded in entropy coding theory to determine the optimal number of discrete representation levels. We deploy our framework in a two-partition setting, with one partition operating as the client and the other as the server, to realistically simulate distributed training. Under this setup, Quantized-TinyLLaVA achieves an approximate \textbf{87.5\%} reduction in communication overhead with 2-bit quantization, while maintaining performance of the original 16-bit model across five benchmark datasets. Furthermore, our compressed representations exhibit enhanced resilience against feature inversion attacks, validating the privacy of transmission. The code is available at https://github.com/anonymous-1742/Quantized-TinyLLaVA.

#119 Learning to Reason in LLMs by Expectation Maximization

著者: Junghyun Lee, Branislav Kveton, Anup Rao, Subhojyoti Mukherjee, Ryan A. Rossi, Sunav Choudhary, Alexa Siu

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2512.20169

要約:
Large language models (LLMs) solve reasoning problems by first generating a rationale and then answering. We formalize reasoning as a latent variable model and derive a reward-based filtered expectation-maximization (FEM) objective for learning to reason. This view connects EM and modern reward-based optimization, and shows that the main challenge lies in designing a sampling distribution of rationales that justify correct answers. We instantiate and compare three sampling schemes: rejection sampling with a budget, self-taught reasoner (STaR), and prompt posterior sampling (PPS), which only keeps the rationalization stage of STaR that conditions on the correct answer in the prompt. We experiment with LLM-as-a-judge calibration and summarization from feedback tasks, where conditioning on the correct answer provides a strong guidance for generating rationales. Our experiments show the efficacy of PPS over other sampling schemes, and that the sampling scheme can have a significant impact on performance.

#120 Semiparametric Preference Optimization: Your Language Model is Secretly a Single-Index Model

著者: Nathan Kallus

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2512.21917

要約:
Aligning large language models (LLMs) to preference data typically assumes a known link function between observed preferences and latent rewards (e.g., a logistic Bradley-Terry link). Misspecification of this link can bias inferred rewards and misalign learned policies. We study preference alignment under an unknown and unrestricted link function. We show that realizability of $f$-divergence-constrained reward maximization in a policy class induces a semiparametric single-index binary choice model, where a scalar policy-dependent index captures all dependence on demonstrations and the remaining preference distribution is unrestricted. Rather than assuming this model has identifiable finite-dimensional structural parameters and estimating them, as in econometrics, we focus on policy learning with the reward function implicit, analyzing error to the optimal policy and allowing for unidentifiable nonparametric indices. We develop preference optimization algorithms robust to the unknown link and prove convergence guarantees in terms of generic function complexity measures. We demonstrate this empirically on LLM alignment. Code is available at https://github.com/causalml/spo/

#121 A Community-Aware Framework for Influence Maximization with Explicit Accounting for Inter-Community Influence

著者: Eliot W. Robson, Abhishek K. Umrawal

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2512.23973

要約:
Influence Maximization (IM) seeks to identify a small set of seed nodes in a social network to maximize expected information spread under a diffusion model. While community-based approaches improve scalability by exploiting modular structure, they typically assume independence between communities, overlooking inter-community influence$\unicode{x2014}$a limitation that reduces effectiveness in real-world networks. We introduce Community-IM++, a scalable framework that explicitly models cross-community diffusion through a principled heuristic based on community-based diffusion degree (CDD) and a progressive budgeting strategy. The algorithm partitions the network, computes CDD to prioritize bridging nodes, and allocates seeds adaptively across communities using lazy evaluation to minimize redundant computations. Experiments on large real-world social networks under different edge weight models show that Community-IM++ achieves near-greedy influence spread at up to 100 times lower runtime, while outperforming Community-IM and degree heuristics across budgets and structural conditions. These results demonstrate the practicality of Community-IM++ for large-scale applications such as viral marketing, misinformation control, and public health campaigns, where efficiency and cross-community reach are critical.

#122 When Does Pairing Seeds Reduce Variance? Evidence from a Multi-Agent Economic Simulation

agent

著者: Udit Sharma

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2512.24145

要約:
Machine learning systems appear stochastic but are deterministically random, as seeded pseudorandom number generators produce identical realisations across repeated executions. Standard evaluation practice typically treats runs across alternatives as independent and does not exploit shared sources of randomness. This paper analyses the statistical structure of comparative evaluation under shared random seeds. Under this design, competing systems are evaluated using identical seeds, inducing matched stochastic realisations and yielding strict variance reduction whenever outcomes are positively correlated at the seed level. We demonstrate these effects using an extended learning-based multi-agent economic simulator, where paired evaluation exposes systematic differences in aggregate and distributional outcomes that remain statistically inconclusive under independent evaluation at fixed budgets.

#123 A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLMs

著者: Dayal Singh Kalra, Jean-Christophe Gagnon-Audet, Andrey Gromov, Ishita Mediratta, Kelvin Niu, Alexander H Miller, Michael Shvartsman

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.16979

要約:
Understanding the curvature evolution of the loss landscape is fundamental to analyzing the training dynamics of neural networks. The most commonly studied measure, Hessian sharpness ($\lambda_{\max}^H$) -- the largest eigenvalue of the loss Hessian -- determines local training stability and interacts with the learning rate throughout training. Despite its significance in analyzing training dynamics, direct measurement of Hessian sharpness remains prohibitive for Large Language Models (LLMs) due to high computational cost. We analyze $\textit{critical sharpness}$ ($\lambda_c$), a computationally efficient measure requiring fewer than $10$ forward passes given the update direction $\Delta \mathbf{\theta}$. Critically, this measure captures well-documented Hessian sharpness phenomena, including progressive sharpening and Edge of Stability. Using this measure, we provide the first demonstration of these sharpness phenomena at scale, up to $7$B parameters, spanning both pre-training and mid-training of OLMo-2 models. We further introduce $\textit{relative critical sharpness}$ ($\lambda_c^{1\to 2}$), which quantifies the curvature of one loss landscape while optimizing another, to analyze the transition from pre-training to fine-tuning and guide data mixing strategies. Critical sharpness provides practitioners with a practical tool for diagnosing curvature dynamics and informing data composition choices at scale. More broadly, our work shows that scalable curvature measures can provide actionable insights for large-scale training.

#124 A Universal Load Balancing Principle and Its Application to Large Language Model Serving

著者: Zixi Chen, Tianci Bu, Chendong Song, Xin Lu, Yinyu Ye, Zijie Zhou

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.17855

要約:
Over 40% of computational power in Large Language Model (LLM) serving systems can be systematically wasted - not from hardware limits, but from load imbalance in barrier-synchronized parallel processing. When progress is gated by the slowest worker at each step, heterogeneous and evolving workloads create persistent stragglers; faster workers idle while drawing power, producing nothing. In large language model inference alone, this translates to gigawatt-hours of wasted electricity daily. Here we develop a universal load-balancing principle for barrier-synchronized systems with non-migratable state. We prove worst-case theoretical guarantees: imbalance reduction grows with system scale, and the resulting energy savings can exceed 52% for modern hardware at fleet scale. Experiments corroborate the theory, demonstrating 28% energy reduction alongside substantial throughput and latency improvements. Formulated as an online integer optimization with provable guarantees, the principle extends beyond LLM serving to broad classes of barrier-synchronized parallel systems, establishing a theoretical foundation for sustainable high-performance computing.

#125 Sampling-Free Privacy Accounting for Matrix Mechanisms under Random Allocation

privacy

著者: Jan Schuchardt, Nikita Kalinin

公開日: Tue, 03 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.21636

要約:
We study privacy amplification for differentially private model training with matrix factorization under random allocation (also known as the balls-in-bins model). Recent work by Choquette-Choo et al. (2025) proposes a sampling-based Monte Carlo approach to compute amplification parameters in this setting. However, their guarantees either only hold with some high probability or require random abstention by the mechanism. Furthermore, the required number of samples for ensuring $(\epsilon,\delta)$-DP is inversely proportional to $\delta$. In contrast, we develop sampling-free bounds based on R\'enyi divergence and conditional composition. The former is facilitated by a dynamic programming formulation to efficiently compute the bounds. The latter complements it by offering stronger privacy guarantees for small $\epsilon$, where R\'enyi divergence bounds inherently lead to an over-approximation. Our framework applies to arbitrary banded and non-banded matrices. Through numerical comparisons, we demonstrate the efficacy of our approach across a broad range of matrix mechanisms used in research and practice.

stat.ML updates on arXiv.org

📋 論文タイトル一覧