arXiv論文一覧 - stat.ML updates on arXiv.org

#1 Deep networks learn to parse uniform-depth context-free languages from local statistics

著者: Jack T. Parley, Francesco Cagnetta, Matthieu Wyart

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06065

要約:
Understanding how the structure of language can be learned from sentences alone is a central question in both cognitive science and machine learning. Studies of the internal representations of Large Language Models (LLMs) support their ability to parse text when predicting the next word, while representing semantic notions independently of surface form. Yet, which data statistics make these feats possible, and how much data is required, remain largely unknown. Probabilistic context-free grammars (PCFGs) provide a tractable testbed for studying these questions. However, prior work has focused either on the post-hoc characterization of the parsing-like algorithms used by trained networks; or on the learnability of PCFGs with fixed syntax, where parsing is unnecessary. Here, we (i) introduce a tunable class of PCFGs in which both the degree of ambiguity and the correlation structure across scales can be controlled; (ii) provide a learning mechanism -- an inference algorithm inspired by the structure of deep convolutional networks -- that links learnability and sample complexity to specific language statistics; and (iii) validate our predictions empirically across deep convolutional and transformer-based architectures. Overall, we propose a unifying framework where correlations at different scales lift local ambiguities, enabling the emergence of hierarchical representations of the data.

#2 Algebraic Robustness Verification of Neural Networks

著者: Yulia Alexandr, Hao Duan, Guido Mont\'ufar

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06105

要約:
We formulate formal robustness verification of neural networks as an algebraic optimization problem. We leverage the Euclidean Distance (ED) degree, which is the generic number of complex critical points of the distance minimization problem to a classifier's decision boundary, as an architecture-dependent measure of the intrinsic complexity of robustness verification. To make this notion operational, we define the associated ED discriminant, which characterizes input points at which the number of real critical points changes, distinguishing test instances that are easier or harder to verify. We provide an explicit algorithm for computing this discriminant. We further introduce the parameter discriminant of a neural network, identifying parameters where the ED degree drops and the decision boundary exhibits reduced algebraic complexity. We derive closed-form expressions for the ED degree for several classes of neural architectures, as well as formulas for the expected number of real critical points in the infinite-width limit. Finally, we present an exact robustness certification algorithm based on numerical homotopy continuation, establishing a concrete link between metric algebraic geometry and neural network verification.

#3 Inheritance Between Feedforward and Convolutional Networks via Model Projection

著者: Nicolas Ewen, Jairo Diaz-Rodriguez, Kelly Ramsay

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06245

要約:
Techniques for feedforward networks (FFNs) and convolutional networks (CNNs) are frequently reused across families, but the relationship between the underlying model classes is rarely made explicit. We introduce a unified node-level formalization with tensor-valued activations and show that generalized feedforward networks form a strict subset of generalized convolutional networks. Motivated by the mismatch in per-input parameterization between the two families, we propose model projection, a parameter-efficient transfer learning method for CNNs that freezes pretrained per-input-channel filters and learns a single scalar gate for each (output channel, input channel) contribution. Projection keeps all convolutional layers adaptable to downstream tasks while substantially reducing the number of trained parameters in convolutional layers. We prove that projected nodes take the generalized FFN form, enabling projected CNNs to inherit feedforward techniques that do not rely on homogeneous layer inputs. Experiments across multiple ImageNet-pretrained backbones and several downstream image classification datasets show that model projection is a strong transfer learning baseline under simple training recipes.

#4 Time-uniform conformal and PAC prediction

著者: Kayla E. Scharfstein, Arun Kumar Kuchibhotla

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06297

要約:
Given that machine learning algorithms are increasingly being deployed to aid in high stakes decision-making, uncertainty quantification methods that wrap around these black box models such as conformal prediction have received much attention in recent years. In sequential settings, where data are observed/generated in a streaming fashion, traditional conformal methods do not provide any guarantee without fixing the sample size. More importantly, traditional conformal methods cannot cope with sequentially updated predictions. As such, we develop an extension of the conformal prediction and related probably approximately correct (PAC) prediction frameworks to sequential settings where the number of data points is not fixed in advance. The resulting prediction sets are anytime-valid in that their expected coverage is at the required level at any time chosen by the analyst even if this choice depends on the data. We present theoretical guarantees for our proposed methods and demonstrate their validity and utility on simulated and real datasets.

#5 High-Dimensional Limit of Stochastic Gradient Flow via Dynamical Mean-Field Theory

著者: Sota Nishiyama, Masaaki Imaizumi

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06320

要約:
Modern machine learning models are typically trained via multi-pass stochastic gradient descent (SGD) with small batch sizes, and understanding their dynamics in high dimensions is of great interest. However, an analytical framework for describing the high-dimensional asymptotic behavior of multi-pass SGD with small batch sizes for nonlinear models is currently missing. In this study, we address this gap by analyzing the high-dimensional dynamics of a stochastic differential equation called a \emph{stochastic gradient flow} (SGF), which approximates multi-pass SGD in this regime. In the limit where the number of data samples $n$ and the dimension $d$ grow proportionally, we derive a closed system of low-dimensional and continuous-time equations and prove that it characterizes the asymptotic distribution of the SGF parameters. Our theory is based on the dynamical mean-field theory (DMFT) and is applicable to a wide range of models encompassing generalized linear models and two-layer neural networks. We further show that the resulting DMFT equations recover several existing high-dimensional descriptions of SGD dynamics as special cases, thereby providing a unifying perspective on prior frameworks such as online SGD and high-dimensional linear regression. Our proof builds on the existing DMFT technique for gradient flow and extends it to handle the stochasticity in SGF using tools from stochastic calculus.

#6 Revisiting the Sliced Wasserstein Kernel for persistence diagrams: a Figalli-Gigli approach

著者: Marc Janthial (X), Th\'eo Lacombe (LIGM)

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06539

要約:
The Sliced Wasserstein Kernel (SWK) for persistence diagrams was introduced in (Carri{\`e}re et al. 2017) as a powerful tool to implicitly embed persistence diagrams in a Hilbert space with reasonable distortion. This kernel is built on the intuition that the Figalli-Gigli distance-that is the partial matching distance routinely used to compare persistence diagrams-resembles the Wasserstein distance used in the optimal transport literature, and that the later could be sliced to define a positive definite kernel on the space of persistence diagrams. This efficient construction nonetheless relies on ad-hoc tweaks on the Wasserstein distance to account for the peculiar geometry of the space of persistence diagrams. In this work, we propose to revisit this idea by directly using the Figalli-Gigli distance instead of the Wasserstein one as the building block of our kernel. On the theoretical side, our sliced Figalli-Gigli kernel (SFGK) shares most of the important properties of the SWK of Carri{\`e}re et al., including distortion results on the induced embedding and its ease of computation, while being more faithful to the natural geometry of persistence diagrams. In particular, it can be directly used to handle infinite persistence diagrams and persistence measures. On the numerical side, we show that the SFGK performs as well as the SWK on benchmark applications.

#7 Operationalizing Stein's Method for Online Linear Optimization: CLT-Based Optimal Tradeoffs

著者: Zhiyu Zhang, Aaditya Ramdas

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06545

要約:
Adversarial online linear optimization (OLO) is essentially about making performance tradeoffs with respect to the unknown difficulty of the adversary. In the setting of one-dimensional fixed-time OLO on a bounded domain, it has been observed since Cover (1966) that achievable tradeoffs are governed by probabilistic inequalities, and these descriptive results can be converted into algorithms via dynamic programming, which, however, is not computationally efficient. We address this limitation by showing that Stein's method, a classical framework underlying the proofs of probabilistic limit theorems, can be operationalized as computationally efficient OLO algorithms. The associated regret and total loss upper bounds are "additively sharp", meaning that they surpass the conventional big-O optimality and match normal-approximation-based lower bounds by additive lower order terms. Our construction is inspired by the remarkably clean proof of a Wasserstein martingale central limit theorem (CLT) due to R\"ollin (2018). Several concrete benefits can be obtained from this general technique. First, with the same computational complexity, the proposed algorithm improves upon the total loss upper bounds of online gradient descent (OGD) and multiplicative weight update (MWU). Second, our algorithm can realize a continuum of optimal two-point tradeoffs between the total loss and the maximum regret over comparators, improving upon prior works in parameter-free online learning. Third, by allowing the adversary to randomize on an unbounded support, we achieve sharp in-expectation performance guarantees for OLO with noisy feedback.

#8 Infinite-dimensional generative diffusions via Doob's h-transform

diffusion

著者: Thorben Pieper-Sethmacher, Daniel Paulin

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06621

要約:
This paper introduces a rigorous framework for defining generative diffusion models in infinite dimensions via Doob's h-transform. Rather than relying on time reversal of a noising process, a reference diffusion is forced towards the target distribution by an exponential change of measure. Compared to existing methodology, this approach readily generalises to the infinite-dimensional setting, hence offering greater flexibility in the diffusion model. The construction is derived rigorously under verifiable conditions, and bounds with respect to the target measure are established. We show that the forced process under the changed measure can be approximated by minimising a score-matching objective and validate our method on both synthetic and real data.

#9 Missing At Random as Covariate Shift: Correcting Bias in Iterative Imputation

著者: Luke Shannon, Song Liu, Katarzyna Reluga

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06713

要約:
Accurate imputation of missing data is critical to downstream machine learning performance. We formulate missing data imputation as a risk minimisation problem, which highlights a covariate shift between the observed and unobserved data distributions. This covariate shift induced bias is not accounted for by popular imputation methods and leads to suboptimal performance. In this paper, we derive theoretically valid importance weights that correct for the induced distributional bias. Furthermore, we propose a novel imputation algorithm that jointly estimates both the importance weights and imputation models, enabling bias correction throughout the imputation process. Empirical results across benchmark datasets show reductions in root mean squared error and Wasserstein distance of up to 7% and 20%, respectively, compared to otherwise identical unweighted methods.

#10 Optimal Learning-Rate Schedules under Functional Scaling Laws: Power Decay and Warmup-Stable-Decay

著者: Binghui Li, Zilin Wang, Fengling Chen, Shiyang Zhao, Ruiheng Zheng, Lei Wu

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06797

要約:
We study optimal learning-rate schedules (LRSs) under the functional scaling law (FSL) framework introduced in Li et al. (2025), which accurately models the loss dynamics of both linear regression and large language model (LLM) pre-training. Within FSL, loss dynamics are governed by two exponents: a source exponent $s>0$ controlling the rate of signal learning, and a capacity exponent $\beta>1$ determining the rate of noise forgetting. Focusing on a fixed training horizon $N$, we derive the optimal LRSs and reveal a sharp phase transition. In the easy-task regime $s \ge 1 - 1/\beta$, the optimal schedule follows a power decay to zero, $\eta^*(z) = \eta_{\mathrm{peak}}(1 - z/N)^{2\beta - 1}$, where the peak learning rate scales as $\eta_{\mathrm{peak}} \eqsim N^{-\nu}$ for an explicit exponent $\nu = \nu(s,\beta)$. In contrast, in the hard-task regime $s < 1 - 1/\beta$, the optimal LRS exhibits a warmup-stable-decay (WSD) (Hu et al. (2024)) structure: it maintains the largest admissible learning rate for most of training and decays only near the end, with the decay phase occupying a vanishing fraction of the horizon. We further analyze optimal shape-fixed schedules, where only the peak learning rate is tuned -- a strategy widely adopted in practiceand characterize their strengths and intrinsic limitations. This yields a principled evaluation of commonly used schedules such as cosine and linear decay. Finally, we apply the power-decay LRS to one-pass stochastic gradient descent (SGD) for kernel regression and show the last iterate attains the exact minimax-optimal rate, eliminating the logarithmic suboptimality present in prior analyses. Numerical experiments corroborate our theoretical predictions.

#11 Pragmatic Curiosity: A Hybrid Learning-Optimization Paradigm via Active Inference

著者: Yingke Li, Anjali Parashar, Enlu Zhou, Chuchu Fan

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06104

要約:
Many engineering and scientific workflows depend on expensive black-box evaluations, requiring decision-making that simultaneously improves performance and reduces uncertainty. Bayesian optimization (BO) and Bayesian experimental design (BED) offer powerful yet largely separate treatments of goal-seeking and information-seeking, providing limited guidance for hybrid settings where learning and optimization are intrinsically coupled. We propose "pragmatic curiosity," a hybrid learning-optimization paradigm derived from active inference, in which actions are selected by minimizing the expected free energy--a single objective that couples pragmatic utility with epistemic information gain. We demonstrate the practical effectiveness and flexibility of pragmatic curiosity on various real-world hybrid tasks, including constrained system identification, targeted active search, and composite optimization with unknown preferences. Across these benchmarks, pragmatic curiosity consistently outperforms strong BO-type and BED-type baselines, delivering higher estimation accuracy, better critical-region coverage, and improved final solution quality.

#12 Warm Starts, Cold States: Exploiting Adiabaticity for Variational Ground-States

著者: Ricard Puig, Berta Casas, Alba Cervera-Lierta, Zo\"e Holmes, Adri\'an P\'erez-Salinas

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06137

要約:
Reliable preparation of many-body ground states is an essential task in quantum computing, with applications spanning areas from chemistry and materials modeling to quantum optimization and benchmarking. A variety of approaches have been proposed to tackle this problem, including variational methods. However, variational training often struggle to navigate complex energy landscapes, frequently encountering suboptimal local minima or suffering from barren plateaus. In this work, we introduce an iterative strategy for ground-state preparation based on a stepwise (discretized) Hamiltonian deformation. By complementing the Variational Quantum Eigensolver (VQE) with adiabatic principles, we demonstrate that solving a sequence of intermediate problems facilitates tracking the ground-state manifold toward the target system, even as we scale the system size. We provide a rigorous theoretical foundation for this approach, proving a lower bound on the loss variance that suggests trainability throughout the deformation, provided the system remains away from gap closings. Numerical simulations, including the effects of shot noise, confirm that this path-dependent tracking consistently converges to the target ground state.

#13 Optimistic Training and Convergence of Q-Learning -- Extended Version

著者: Prashant Mehta, Sean Meyn

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06146

要約:
In recent work it is shown that Q-learning with linear function approximation is stable, in the sense of bounded parameter estimates, under the $(\varepsilon,\kappa)$-tamed Gibbs policy; $\kappa$ is inverse temperature, and $\varepsilon>0$ is introduced for additional exploration. Under these assumptions it also follows that there is a solution to the projected Bellman equation (PBE). Left open is uniqueness of the solution, and criteria for convergence outside of the standard tabular or linear MDP settings. The present work extends these results to other variants of Q-learning, and clarifies prior work: a one dimensional example shows that under an oblivious policy for training there may be no solution to the PBE, or multiple solutions, and in each case the algorithm is not stable under oblivious training. The main contribution is that far more structure is required for convergence. An example is presented for which the basis is ideal, in the sense that the true Q-function is in the span of the basis. However, there are two solutions to the PBE under the greedy policy, and hence also for the $(\varepsilon,\kappa)$-tamed Gibbs policy for all sufficiently small $\varepsilon>0$ and $\kappa\ge 1$.

#14 Latent Structure Emergence in Diffusion Models via Confidence-Based Filtering

diffusion

著者: Wei Wei, Yizhou Zeng, Kuntian Chen, Sophie Langer, Mariia Seleznova, Hung-Hsu Chou

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06155

要約:
Diffusion models rely on a high-dimensional latent space of initial noise seeds, yet it remains unclear whether this space contains sufficient structure to predict properties of the generated samples, such as their classes. In this work, we investigate the emergence of latent structure through the lens of confidence scores assigned by a pre-trained classifier to generated samples. We show that while the latent space appears largely unstructured when considering all noise realizations, restricting attention to initial noise seeds that produce high-confidence samples reveals pronounced class separability. By comparing class predictability across noise subsets of varying confidence and examining the class separability of the latent space, we find evidence of class-relevant latent structure that becomes observable only under confidence-based filtering. As a practical implication, we discuss how confidence-based filtering enables conditional generation as an alternative to guidance-based methods.

#15 Optimal rates for density and mode estimation with expand-and-sparsify representations

著者: Kaushik Sinha, Christopher Tosh

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06175

要約:
Expand-and-sparsify representations are a class of theoretical models that capture sparse representation phenomena observed in the sensory systems of many animals. At a high level, these representations map an input $x \in \mathbb{R}^d$ to a much higher dimension $m \gg d$ via random linear projections before zeroing out all but the $k \ll m$ largest entries. The result is a $k$-sparse vector in $\{0,1\}^m$. We study the suitability of this representation for two fundamental statistical problems: density estimation and mode estimation. For density estimation, we show that a simple linear function of the expand-and-sparsify representation produces an estimator with minimax-optimal $\ell_{\infty}$ convergence rates. In mode estimation, we provide simple algorithms on top of our density estimator that recover single or multiple modes at optimal rates up to logarithmic factors under mild conditions.

#16 Statistical Learning from Attribution Sets

著者: Lorne Applebaum, Robert Busa-Fekete, August Y. Chen, Claudio Gentile, Tomer Koren, Aryan Mokhtari

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06276

要約:
We address the problem of training conversion prediction models in advertising domains under privacy constraints, where direct links between ad clicks and conversions are unavailable. Motivated by privacy-preserving browser APIs and the deprecation of third-party cookies, we study a setting where the learner observes a sequence of clicks and a sequence of conversions, but can only link a conversion to a set of candidate clicks (an attribution set) rather than a unique source. We formalize this as learning from attribution sets generated by an oblivious adversary equipped with a prior distribution over the candidates. Despite the lack of explicit labels, we construct an unbiased estimator of the population loss from these coarse signals via a novel approach. Leveraging this estimator, we show that Empirical Risk Minimization achieves generalization guarantees that scale with the informativeness of the prior and is also robust against estimation errors in the prior, despite complex dependencies among attribution sets. Simple empirical evaluations on standard datasets suggest our unbiased approach significantly outperforms common industry heuristics, particularly in regimes where attribution sets are large or overlapping.

#17 Envy-Free Allocation of Indivisible Goods via Noisy Queries

著者: Zihan Li, Yan Hao Ling, Jonathan Scarlett, Warut Suksompong

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06361

要約:
We introduce a problem of fairly allocating indivisible goods (items) in which the agents' valuations cannot be observed directly, but instead can only be accessed via noisy queries. In the two-agent setting with Gaussian noise and bounded valuations, we derive upper and lower bounds on the required number of queries for finding an envy-free allocation in terms of the number of items, $m$, and the negative-envy of the optimal allocation, $\Delta$. In particular, when $\Delta$ is not too small (namely, $\Delta \gg m^{1/4}$), we establish that the optimal number of queries scales as $\frac{\sqrt m }{(\Delta / m)^2} = \frac{m^{2.5}}{\Delta^2}$ up to logarithmic factors. Our upper bound is based on non-adaptive queries and a simple thresholding-based allocation algorithm that runs in polynomial time, while our lower bound holds even under adaptive queries and arbitrary computation time.

#18 Sequential Auditing for f-Differential Privacy

privacy

著者: Tim Kutta, Martin Dunsche, Yu Wei, Vassilis Zikas

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06518

要約:
We present new auditors to assess Differential Privacy (DP) of an algorithm based on output samples. Such empirical auditors are common to check for algorithmic correctness and implementation bugs. Most existing auditors are batch-based or targeted toward the traditional notion of $(\varepsilon,\delta)$-DP; typically both. In this work, we shift the focus to the highly expressive privacy concept of $f$-DP, in which the entire privacy behavior is captured by a single tradeoff curve. Our auditors detect violations across the full privacy spectrum with statistical significance guarantees, which are supported by theory and simulations. Most importantly, and in contrast to prior work, our auditors do not require a user-specified sample size as an input. Rather, they adaptively determine a near-optimal number of samples needed to reach a decision, thereby avoiding the excessively large sample sizes common in many auditing studies. This reduction in sampling cost becomes especially beneficial for expensive training procedures such as DP-SGD. Our method supports both whitebox and blackbox settings and can also be executed in single-run frameworks.

#19 Which Graph Shift Operator? A Spectral Answer to an Empirical Question

著者: Yassine Abbahaddou

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06557

要約:
Graph Neural Networks (GNNs) have established themselves as the leading models for learning on graph-structured data, generally categorized into spatial and spectral approaches. Central to these architectures is the Graph Shift Operator (GSO), a matrix representation of the graph structure used to filter node signals. However, selecting the optimal GSO, whether fixed or learnable, remains largely empirical. In this paper, we introduce a novel alignment gain metric that quantifies the geometric distortion between the input signal and label subspaces. Crucially, our theoretical analysis connects this alignment directly to generalization bounds via a spectral proxy for the Lipschitz constant. This yields a principled, computation-efficient criterion to rank and select the optimal GSO for any prediction task prior to training, eliminating the need for extensive search.

#20 Efficient Online Variational Estimation via Monte Carlo Sampling

著者: Mathis Chagneux (KTH STOCKHOLM), Mathias M\"uller (KTH STOCKHOLM), Pierre Gloaguen (UBS), Sylvain Le Corff (LPSM), Jimmy Olsson (KTH STOCKHOLM)

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06579

要約:
This article addresses online variational estimation in parametric state-space models. We propose a new procedure for efficiently computing the evidence lower bound and its gradient in a streaming-data setting, where observations arrive sequentially. The algorithm allows for the simultaneous training of the model parameters and the distribution of the latent states given the observations. It is based on i.i.d. Monte Carlo sampling, coupled with a well-chosen deep architecture, enabling both computational efficiency and flexibility. The performance of the method is illustrated on both synthetic data and real-world air-quality data. The proposed approach is theoretically motivated by the existence of an asymptotic contrast function and the ergodicity of the underlying Markov chain, and applies more generally to the computation of additive expectations under posterior distributions in state-space models.

#21 Inference-Time Rethinking with Latent Thought Vectors for Math Reasoning

著者: Deqian Kong, Minglu Zhao, Aoyang Qin, Bo Pang, Chenxin Tao, David Hartmann, Edouardo Honig, Dehong Xu, Amit Kumar, Matt Sarte, Chuan Li, Jianwen Xie, Ying Nian Wu

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06584

要約:
Standard chain-of-thought reasoning generates a solution in a single forward pass, committing irrevocably to each token and lacking a mechanism to recover from early errors. We introduce Inference-Time Rethinking, a generative framework that enables iterative self-correction by decoupling declarative latent thought vectors from procedural generation. We factorize reasoning into a continuous latent thought vector (what to reason about) and a decoder that verbalizes the trace conditioned on this vector (how to reason). Beyond serving as a declarative buffer, latent thought vectors compress the reasoning structure into a continuous representation that abstracts away surface-level token variability, making gradient-based optimization over reasoning strategies well-posed. Our prior model maps unstructured noise to a learned manifold of valid reasoning patterns, and at test time we employ a Gibbs-style procedure that alternates between generating a candidate trace and optimizing the latent vector to better explain that trace, effectively navigating the latent manifold to refine the reasoning strategy. Training a 0.2B-parameter model from scratch on GSM8K, our method with 30 rethinking iterations surpasses baselines with 10 to 15 times more parameters, including a 3B counterpart. This result demonstrates that effective mathematical reasoning can emerge from sophisticated inference-time computation rather than solely from massive parameter counts.

#22 On the Convergence of Multicalibration Gradient Boosting

著者: Daniel Haimovich, Fridolin Linder, Lorenzo Perini, Niek Tax, Milan Vojnovic

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06773

要約:
Multicalibration gradient boosting has recently emerged as a scalable method that empirically produces approximately multicalibrated predictors and has been deployed at web scale. Despite this empirical success, its convergence properties are not well understood. In this paper, we bridge the gap by providing convergence guarantees for multicalibration gradient boosting in regression with squared-error loss. We show that the magnitude of successive prediction updates decays at $O(1/\sqrt{T})$, which implies the same convergence rate bound for the multicalibration error over rounds. Under additional smoothness assumptions on the weak learners, this rate improves to linear convergence. We further analyze adaptive variants, showing local quadratic convergence of the training loss, and we study rescaling schemes that preserve convergence. Experiments on real-world datasets support our theory and clarify the regimes in which the method achieves fast convergence and strong multicalibration.

#23 Robust Online Learning

著者: Sajad Ashkezari

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06775

要約:
We study the problem of learning robust classifiers where the classifier will receive a perturbed input. Unlike robust PAC learning studied in prior work, here the clean data and its label are also adversarially chosen. We formulate this setting as an online learning problem and consider both the realizable and agnostic learnability of hypothesis classes. We define a new dimension of classes and show it controls the mistake bounds in the realizable setting and the regret bounds in the agnostic setting. In contrast to the dimension that characterizes learnability in the PAC setting, our dimension is rather simple and resembles the Littlestone dimension. We generalize our dimension to multiclass hypothesis classes and prove similar results in the realizable case. Finally, we study the case where the learner does not know the set of allowed perturbations for each point and only has some prior on them.

#24 Learning Deep Hybrid Models with Sharpness-Aware Minimization

著者: Naoya Takeishi

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06837

要約:
Hybrid modeling, the combination of machine learning models and scientific mathematical models, enables flexible and robust data-driven prediction with partial interpretability. However, effectively the scientific models may be ignored in prediction due to the flexibility of the machine learning model, making the idea of hybrid modeling pointless. Typically some regularization is applied to hybrid model learning to avoid such a failure case, but the formulation of the regularizer strongly depends on model architectures and domain knowledge. In this paper, we propose to focus on the flatness of loss minima in learning hybrid models, aiming to make the model as simple as possible. We employ the idea of sharpness-aware minimization and adapt it to the hybrid modeling setting. Numerical experiments show that the SAM-based method works well across different choices of models and datasets.

#25 Vision Transformer Finetuning Benefits from Non-Smooth Components

著者: Ambroise Odonnat, Laetitia Chapel, Romain Tavenard, Ievgen Redko

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06883

要約:
The smoothness of the transformer architecture has been extensively studied in the context of generalization, training stability, and adversarial robustness. However, its role in transfer learning remains poorly understood. In this paper, we analyze the ability of vision transformer components to adapt their outputs to changes in inputs, or, in other words, their plasticity. Defined as an average rate of change, it captures the sensitivity to input perturbation; in particular, a high plasticity implies low smoothness. We demonstrate through theoretical analysis and comprehensive experiments that this perspective provides principled guidance in choosing the components to prioritize during adaptation. A key takeaway for practitioners is that the high plasticity of the attention modules and feedforward layers consistently leads to better finetuning performance. Our findings depart from the prevailing assumption that smoothness is desirable, offering a novel perspective on the functional properties of transformers. The code is available at https://github.com/ambroiseodt/vit-plasticity.

#26 Sample Complexity of Causal Identification with Temporal Heterogeneity

著者: Ameya Rathod, Sujay Belsare, Salvik Krishna Nautiyal, Dhruv Laad, Ponnurangam Kumaraguru

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06899

要約:
Recovering a unique causal graph from observational data is an ill-posed problem because multiple generating mechanisms can lead to the same observational distribution. This problem becomes solvable only by exploiting specific structural or distributional assumptions. While recent work has separately utilized time-series dynamics or multi-environment heterogeneity to constrain this problem, we integrate both as complementary sources of heterogeneity. This integration yields unified necessary identifiability conditions and enables a rigorous analysis of the statistical limits of recovery under thin versus heavy-tailed noise. In particular, temporal structure is shown to effectively substitute for missing environmental diversity, possibly achieving identifiability even under insufficient heterogeneity. Extending this analysis to heavy-tailed (Student's t) distributions, we demonstrate that while geometric identifiability conditions remain invariant, the sample complexity diverges significantly from the Gaussian baseline. Explicit information-theoretic bounds quantify this cost of robustness, establishing the fundamental limits of covariance-based causal graph recovery methods in realistic non-stationary systems. This work shifts the focus from whether causal structure is identifiable to whether it is statistically recoverable in practice.

#27 Supercharging Simulation-Based Inference for Bayesian Optimal Experimental Design

著者: Samuel Klein, Willie Neiswanger, Daniel Ratner, Michael Kagan, Sean Gasiorowski

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06900

要約:
Bayesian optimal experimental design (BOED) seeks to maximize the expected information gain (EIG) of experiments. This requires a likelihood estimate, which in many settings is intractable. Simulation-based inference (SBI) provides powerful tools for this regime. However, existing work explicitly connecting SBI and BOED is restricted to a single contrastive EIG bound. We show that the EIG admits multiple formulations which can directly leverage modern SBI density estimators, encompassing neural posterior, likelihood, and ratio estimation. Building on this perspective, we define a novel EIG estimator using neural likelihood estimation. Further, we identify optimization as a key bottleneck of gradient based EIG maximization and show that a simple multi-start parallel gradient ascent procedure can substantially improve reliability and performance. With these innovations, our SBI-based BOED methods are able to match or outperform by up to $22\%$ existing state-of-the-art approaches across standard BOED benchmarks.

#28 Parameter-free Dynamic Regret: Time-varying Movement Costs, Delayed Feedback, and Memory

著者: Emmanuel Esposito, Andrew Jacobsen, Hao Qiu, Mengxiao Zhang

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06902

要約:
In this paper, we study dynamic regret in unconstrained online convex optimization (OCO) with movement costs. Specifically, we generalize the standard setting by allowing the movement cost coefficients $\lambda_t$ to vary arbitrarily over time. Our main contribution is a novel algorithm that establishes the first comparator-adaptive dynamic regret bound for this setting, guaranteeing $\widetilde{\mathcal{O}}(\sqrt{(1+P_T)(T+\sum_t \lambda_t)})$ regret, where $P_T$ is the path length of the comparator sequence over $T$ rounds. This recovers the optimal guarantees for both static and dynamic regret in standard OCO as a special case where $\lambda_t=0$ for all rounds. To demonstrate the versatility of our results, we consider two applications: OCO with delayed feedback and OCO with time-varying memory. We show that both problems can be translated into time-varying movement costs, establishing a novel reduction specifically for the delayed feedback setting that is of independent interest. A crucial observation is that the first-order dependence on movement costs in our regret bound plays a key role in enabling optimal comparator-adaptive dynamic regret guarantees in both settings.

#29 Continuous-time reinforcement learning: ellipticity enables model-free value function approximation

著者: Wenlong Mou

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.06930

要約:
We study off-policy reinforcement learning for controlling continuous-time Markov diffusion processes with discrete-time observations and actions. We consider model-free algorithms with function approximation that learn value and advantage functions directly from data, without unrealistic structural assumptions on the dynamics. Leveraging the ellipticity of the diffusions, we establish a new class of Hilbert-space positive definiteness and boundedness properties for the Bellman operators. Based on these properties, we propose the Sobolev-prox fitted $q$-learning algorithm, which learns value and advantage functions by iteratively solving least-squares regression problems. We derive oracle inequalities for the estimation error, governed by (i) the best approximation error of the function classes, (ii) their localized complexity, (iii) exponentially decaying optimization error, and (iv) numerical discretization error. These results identify ellipticity as a key structural property that renders reinforcement learning with function approximation for Markov diffusions no harder than supervised learning.

#30 Training-Conditional Coverage Bounds under Covariate Shift

著者: Mehrdad Pournaderi, Yu Xiang

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2405.16594

要約:
Conformal prediction methodology has recently been extended to the covariate shift setting, where the distribution of covariates differs between training and test data. While existing results ensure that the prediction sets from these methods achieve marginal coverage above a nominal level, their coverage rate conditional on the training dataset (referred to as training-conditional coverage) remains unexplored. In this paper, we address this gap by deriving upper bounds on the tail of the training-conditional coverage distribution, offering probably approximately correct (PAC) guarantees for these methods. Our results characterize the reliability of the prediction sets in terms of the severity of distributional changes and the size of the training dataset.

#31 Ensemble Transport Filter via Optimized Maximum Mean Discrepancy

著者: Dengfei Zeng, Lijian Jiang

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2407.11518

要約:
In this paper, we present a new ensemble-based filter method by reconstructing the analysis step of the particle filter through a transport map, which directly transports prior particles to posterior particles. The transport map is constructed through an optimization problem described by the Maximum Mean Discrepancy loss function, which matches the expectation information of the approximated posterior and reference posterior. The proposed method inherits the accurate estimation of the posterior distribution from particle filtering while gives an extension to high dimensional assimilation problems. To improve the robustness of Maximum Mean Discrepancy, a variance penalty term is used to guide the optimization. It prioritizes minimizing the discrepancy between the expectations of highly informative statistics for the reference posteriors. The penalty term significantly enhances the robustness of the proposed method and leads to a better approximation of the posterior. A few numerical examples are presented to illustrate the advantage of the proposed method over ensemble Kalman filter.

#32 Efficient Perplexity Bound and Ratio Matching in Discrete Diffusion Language Models

diffusion

著者: Etrit Haxholli, Yeti Z. Gurbuz, Ogul Can, Eli Waxman

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2507.04341

要約:
While continuous diffusion models excel in modeling continuous distributions, their application to categorical data has been less effective. Recent work has shown that ratio-matching through score-entropy within a continuous-time discrete Markov chain (CTMC) framework serves as a competitive alternative to autoregressive models in language modeling. To enhance this framework, we first introduce three new theorems concerning the KL divergence between the data and learned distribution. Our results serve as the discrete counterpart to those established for continuous diffusion models and allow us to derive an improved upper bound of the perplexity. Second, we empirically show that ratio-matching performed by minimizing the denoising cross-entropy between the clean and corrupted data enables models to outperform those utilizing score-entropy with up to 10% lower perplexity/generative-perplexity, and 15% faster training steps. To further support our findings, we introduce and evaluate a novel CTMC transition-rate matrix that allows prediction refinement, and derive the analytic expression for its matrix exponential which facilitates the computation of conditional ratios thus enabling efficient training and generation.

#33 Optimal Bias-variance Tradeoff in Matrix and Tensor Estimation

著者: Shivam Kumar, Xiaokai Luo, Haotian Xu, Carlos Misael Madrid Padilla, Oscar Hernan Madrid Padilla, Daren Wang

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2509.17382

要約:
We study matrix and tensor denoising when the underlying signal is \textbf{not} necessarily low-rank. In the tensor setting, we observe \[ Y = X^\ast + Z \in \mathbb{R}^{p_1 \times p_2 \times p_3}, \] where $X^\ast$ is an unknown signal tensor and $Z$ is a noise tensor. We propose a one-step variant of the higher-order SVD (HOSVD) estimator, denoted $\widetilde X$, and show that, uniformly over any user-specified Tucker ranks $(r_1,r_2,r_3)$, with high probability, \[ \|\widetilde X - X^\ast\|_{\mathrm F}^2 = O\Big( \kappa^2\Big\{r_1r_2r_3 + \sum_{k=1}^3 p_k r_k\Big\} + \xi_{(r_1,r_2,r_3)}^2 \Big). \] Here, $\xi_{(r_1,r_2,r_3)}$ is the best achievable Tucker rank-$(r_1,r_2,r_3)$ approximation error of $X^\ast$ (bias), $\kappa^2$ quantifies the noise level, and $\kappa^2\{r_1r_2r_3+\sum_{k=1}^3 p_k r_k\}$ is the variance term scaling with the effective degrees of freedom of $\widetilde X$. This yields a rank-adaptive bias-variance tradeoff: increasing $(r_1,r_2,r_3)$ decreases the bias $\xi_{(r_1,r_2,r_3)}$ while increasing variance. In the matrix setting, we show that truncated SVD achieves an analogous bias-variance tradeoff for arbitrary signal matrices. Notably, our matrix result requires \textbf{no} assumptions on the signal matrix, such as finite rank or spectral gaps. Finally, we complement our upper bounds with matching information-theoretic lower bounds, showing that the resulting bias-variance tradeoff is minimax optimal up to universal constants in both the matrix and tensor settings.

#34 FreDN: Spectral Disentanglement for Time Series Forecasting via Learnable Frequency Decomposition

著者: Zhongde An, Jinhong You, Jiyanglin Li, Yiming Tang, Wen Li, Heming Du, Shouguo Du

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.11817

要約:
Time series forecasting is essential in a wide range of real world applications. Recently, frequency-domain methods have attracted increasing interest for their ability to capture global dependencies. However, when applied to non-stationary time series, these methods encounter the $\textit{spectral entanglement}$ and the computational burden of complex-valued learning. The $\textit{spectral entanglement}$ refers to the overlap of trends, periodicities, and noise across the spectrum due to $\textit{spectral leakage}$ and the presence of non-stationarity. However, existing decompositions are not suited to resolving spectral entanglement. To address this, we propose the Frequency Decomposition Network (FreDN), which introduces a learnable Frequency Disentangler module to separate trend and periodic components directly in the frequency domain. Furthermore, we propose a theoretically supported ReIm Block to reduce the complexity of complex-valued operations while maintaining performance. We also re-examine the frequency-domain loss function and provide new theoretical insights into its effectiveness. Extensive experiments on seven long-term forecasting benchmarks demonstrate that FreDN outperforms state-of-the-art methods by up to 10\%. Furthermore, compared with standard complex-valued architectures, our real-imaginary shared-parameter design reduces the parameter count and computational cost by at least 50\%.

#35 Performative Learning Theory

著者: Julian Rodemann, Unai Fischer-Abaigar, James Bailie, Krikamol Muandet

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.04402

要約:
Performative predictions influence the very outcomes they aim to forecast. We study performative predictions that affect a sample (e.g., only existing users of an app) and/or the whole population (e.g., all potential app users). This raises the question of how well models generalize under performativity. For example, how well can we draw insights about new app users based on existing users when both of them react to the app's predictions? We address this question by embedding performative predictions into statistical learning theory. We prove generalization bounds under performative effects on the sample, on the population, and on both. A key intuition behind our proofs is that in the worst case, the population negates predictions, while the sample deceptively fulfills them. We cast such self-negating and self-fulfilling predictions as min-max and min-min risk functionals in Wasserstein space, respectively. Our analysis reveals a fundamental trade-off between performatively changing the world and learning from it: the more a model affects data, the less it can learn from it. Moreover, our analysis results in a surprising insight on how to improve generalization guarantees by retraining on performatively distorted samples. We illustrate our bounds in a case study on prediction-informed assignments of unemployed German residents to job trainings, drawing upon administrative labor market records from 1975 to 2017 in Germany.

#36 Radon--Wasserstein Gradient Flows for Interacting-Particle Sampling in High Dimensions

著者: Elias Hess-Childs, Dejan Slep\v{c}ev, Lantian Xu

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2602.05227

要約:
Gradient flows of the Kullback--Leibler (KL) divergence, such as the Fokker--Planck equation and Stein Variational Gradient Descent, evolve a distribution toward a target density known only up to a normalizing constant. We introduce new gradient flows of the KL divergence with a remarkable combination of properties: they admit accurate interacting-particle approximations in high dimensions, and the per-step cost scales linearly in both the number of particles and the dimension. These gradient flows are based on new transportation-based Riemannian geometries on the space of probability measures: the Radon--Wasserstein geometry and the related Regularized Radon--Wasserstein (RRW) geometry. We define these geometries using the Radon transform so that the gradient-flow velocities depend only on one-dimensional projections. This yields interacting-particle-based algorithms whose per-step cost follows from efficient Fast Fourier Transform-based evaluation of the required 1D convolutions. We additionally provide numerical experiments that study the performance of the proposed algorithms and compare convergence behavior and quantization. Finally, we prove some theoretical results including well-posedness of the flows and long-time convergence guarantees for the RRW flow.

#37 Adventures in Demand Analysis Using AI

著者: Philipp Bach, Victor Chernozhukov, Sven Klaassen, Martin Spindler, Jan Teichert-Kluge, Suhas Vijaykumar

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2501.00382

要約:
This paper advances empirical demand analysis by integrating multimodal product representations derived from artificial intelligence (AI). Using a detailed dataset of toy cars on textit{Amazon.com}, we combine text descriptions, images, and tabular covariates to represent each product using transformer-based embedding models. These embeddings capture nuanced attributes, such as quality, branding, and visual characteristics, that traditional methods often struggle to summarize. Moreover, we fine-tune these embeddings for causal inference tasks. We show that the resulting embeddings substantially improve the predictive accuracy of sales ranks and prices and that they lead to more credible causal estimates of price elasticity. Notably, we uncover strong heterogeneity in price elasticity driven by these product-specific features. Our findings illustrate that AI-driven representations can enrich and modernize empirical demand analysis. The insights generated may also prove valuable for applied causal inference more broadly.

#38 Dataset Distillation as Pushforward Optimal Quantization

model extraction

著者: Hong Ye Tan, Emma Slade

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2501.07681

要約:
Dataset distillation aims to find a synthetic training set such that training on the synthetic data achieves similar performance to training on real data, with orders of magnitude less computational requirements. Existing methods can be broadly categorized as either bi-level optimization problems that have neural network training heuristics as the lower level problem, or disentangled methods that bypass the bi-level optimization by matching distributions of data. The latter method has the major advantages of speed and scalability in terms of size of both training and distilled datasets. We demonstrate that when equipped with an encoder-decoder structure, the empirically successful disentangled methods can be reformulated as an optimal quantization problem, where a finite set of points is found to approximate the underlying probability measure by minimizing the expected projection distance. In particular, we link existing disentangled dataset distillation methods to the classical optimal quantization and Wasserstein barycenter problems, demonstrating consistency of distilled datasets for diffusion-based generative priors. We propose Dataset Distillation by Optimal Quantization, based on clustering in a latent space. Compared to the previous SOTA method D\textsuperscript{4}M, we achieve better performance and inter-model generalization on the ImageNet-1K dataset with trivial additional computation, and SOTA performance in higher image-per-class settings. Using the distilled noise initializations in a stronger diffusion transformer model, we obtain SOTA distillation performance on ImageNet-1K and its subsets, outperforming diffusion guidance methods.

#39 Training in reverse: How iteration order influences convergence and stability in deep learning

著者: Benoit Dherin, Benny Avelin, Anders Karlsson, Hanna Mazzawi, Javier Gonzalvo, Michael Munn

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2502.01557

要約:
Despite exceptional achievements, training neural networks remains computationally expensive and is often plagued by instabilities that can degrade convergence. While learning rate schedules can help mitigate these issues, finding optimal schedules is time-consuming and resource-intensive. This work explores theoretical issues concerning training stability in the constant-learning-rate (i.e., without schedule) and small-batch-size regime. Surprisingly, we show that the composition order of gradient updates affects stability and convergence in gradient-based optimizers. We illustrate this new line of thinking using backward-SGD, which produces parameter iterates at each step by reverting the usual forward composition order of batch gradients. Our theoretical analysis shows that in contractive regions (e.g., around minima) backward-SGD converges to a point while the standard forward-SGD generally only converges to a distribution. This leads to improved stability and convergence which we demonstrate experimentally. While full backward-SGD is computationally intensive in practice, it highlights that the extra freedom of modifying the usual iteration composition by reusing creatively previous batches at each optimization step may have important beneficial effects in improving training. Our experiments provide a proof of concept supporting this phenomenon. To our knowledge, this represents a new and unexplored avenue in deep learning optimization.

#40 High-dimensional censored MIDAS logistic regression for corporate survival forecasting

著者: Wei Miao, Jad Beyhum, Jonas Striaukas, Ingrid Van Keilegom

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2502.09740

要約:
This paper addresses the challenge of forecasting corporate distress, a problem marked by three key statistical hurdles: (i) right censoring, (ii) high-dimensional predictors, and (iii) mixed-frequency data. To overcome these complexities, we introduce a novel high-dimensional censored MIDAS (Mixed Data Sampling) logistic regression. Our approach handles censoring through inverse probability weighting and achieves accurate estimation with numerous mixed-frequency predictors by employing a sparse-group penalty. We establish finite-sample bounds for the estimation error, accounting for censoring, MIDAS approximation error, and heavy tails. For statistical inference, we develop a de-sparsified version of the proposed penalized estimator and establish its asymptotic theory, which enables valid statistical inference in high-dimensional settings with censoring. We show that censoring induces a nonstandard variance structure for the de-sparsified estimator, a feature that, to the best of our knowledge, has not been studied in the existing literature. The superior performance of the method is demonstrated through Monte Carlo simulations. Finally, we present an extensive application of our methodology to predict the financial distress of Chinese-listed firms and to identify covariates that are statistically significant for predicting distress. Our novel procedure is implemented in the R package \texttt{Survivalml}.

#41 Generative modelling with jump-diffusions

diffusion

著者: Adrian Baule

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2503.06558

要約:
Score-based diffusion models generate samples from an unknown target distribution using a time-reversed diffusion process. While such models represent state-of-the-art approaches in industrial applications such as artificial image generation, it has recently been noted that their performance can be further improved by considering injection noise with heavy tailed characteristics. Here, I present a generalization of generative diffusion processes to a wide class of non-Gaussian noise processes. I consider forward processes driven by standard Gaussian noise with super-imposed Poisson jumps representing a finite activity Levy process. The generative process is shown to be governed by a generalized score function that depends on the jump amplitude distribution and can be estimated by minimizing a simple MSE loss as in conventional Gaussian models. Both probability flow ODE and SDE formulations are derived using basic technical effort. A detailed implementation for a pure jump process with Laplace distributed amplitudes yields a generalized score function in closed analytical form and is shown to outperform the equivalent Gaussian model in specific parameter regimes.

#42 Single-loop Algorithms for Stochastic Non-convex Optimization with Weakly-Convex Constraints

著者: Ming Yang, Gang Li, Quanqi Hu, Qihang Lin, Tianbao Yang

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2504.15243

要約:
Constrained optimization with multiple functional inequality constraints has significant applications in machine learning. This paper examines a crucial subset of such problems where both the objective and constraint functions are weakly convex. Existing methods often face limitations, including slow convergence rates or reliance on double-loop algorithmic designs. To overcome these challenges, we introduce a novel single-loop penalty-based stochastic algorithm. Following the classical exact penalty method, our approach employs a {\bf hinge-based penalty}, which permits the use of a constant penalty parameter, enabling us to achieve a {\bf state-of-the-art complexity} for finding an approximate Karush-Kuhn-Tucker (KKT) solution. We further extend our algorithm to address finite-sum coupled compositional objectives, which are prevalent in artificial intelligence applications, establishing improved complexity over existing approaches. Finally, we validate our method through experiments on fair learning with receiver operating characteristic (ROC) fairness constraints and continual learning with non-forgetting constraints.

#43 A Kolmogorov-Arnold Neural Model for Cascading Extremes

著者: Miguel de Carvalho, Clemente Ferrer, Ronny Vallejos

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2505.13370

要約:
This paper addresses the growing concern of cascading extreme events, such as an extreme earthquake followed by a tsunami, by presenting a novel method for risk assessment focused on these domino effects. The proposed approach develops an extreme value theory framework within a Kolmogorov-Arnold network (KAN) to estimate the probability of one extreme event triggering another, conditionally on a feature vector. An extra layer is added to the KAN architecture to ensure that the parameter of interest lies within the unit interval, and we refer to the resulting neural model as KANE (KAN with Natural Enforcement). The proposed method is backed by exhaustive numerical studies and further illustrated with real-world applications to seismology and climatology.

#44 Position: Epistemic uncertainty estimation methods are fundamentally incomplete

著者: Sebasti\'an Jim\'enez, Mira J\"urgens, Willem Waegeman

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2505.23506

要約:
Identifying and disentangling sources of predictive uncertainty is essential for trustworthy supervised learning. We argue that widely used second-order methods that disentangle aleatoric and epistemic uncertainty are fundamentally incomplete. First, we show that unaccounted bias contaminates uncertainty estimates by overestimating aleatoric (data-related) uncertainty and underestimating the epistemic (model-related) counterpart, leading to incorrect uncertainty quantification. Second, we demonstrate that existing methods capture only partial contributions to the variance-driven part of epistemic uncertainty; different approaches account for different variance sources, yielding estimates that are incomplete and difficult to interpret. Together, these results highlight that current epistemic uncertainty estimates can only be used in safety-critical and high-stakes decision-making when limitations are fully understood by end users and acknowledged by AI developers.

#45 Learning a distance measure from the information-estimation geometry of data

著者: Guy Ohayon, Pierre-Etienne H. Fiquet, Florentin Guth, Jona Ball\'e, Eero P. Simoncelli

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2510.02514

要約:
We introduce the Information-Estimation Metric (IEM), a novel form of distance function derived from an underlying continuous probability density over a domain of signals. The IEM is rooted in a fundamental relationship between information theory and estimation theory, which links the log-probability of a signal with the errors of an optimal denoiser, applied to noisy observations of the signal. In particular, the IEM between a pair of signals is obtained by comparing their denoising error vectors over a range of noise amplitudes. Geometrically, this amounts to comparing the score vector fields of the blurred density around the signals over a range of blur levels. We prove that the IEM is a valid global distance metric and derive a closed-form expression for its local second-order approximation, which yields a Riemannian metric. For Gaussian-distributed signals, the IEM coincides with the Mahalanobis distance. But for more complex distributions, it adapts, both locally and globally, to the geometry of the distribution. In practice, the IEM can be computed using a learned denoiser (analogous to generative diffusion models) and solving a one-dimensional integral. To demonstrate the value of our framework, we learn an IEM on the ImageNet database. Experiments show that this IEM is competitive with or outperforms state-of-the-art supervised image quality metrics in predicting human perceptual judgments.

#46 A Unified Framework for Lifted Training and Inversion Approaches

著者: Xiaoyu Wang, Alexandra Valavanis, Azhir Mahmood, Andreas Mang, Martin Benning, Audrey Repetti

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2510.09796

要約:
The training of deep neural networks predominantly relies on a combination of gradient-based optimisation and back-propagation for the computation of the gradient. While incredibly successful, this approach faces challenges such as vanishing or exploding gradients, difficulties with non-smooth activations, and an inherently sequential structure that limits parallelisation. Lifted training methods offer an alternative by reformulating the nested optimisation problem into a higher-dimensional, constrained optimisation problem where the constraints are no longer enforced directly but penalised with penalty terms. This chapter introduces a unified framework that encapsulates various lifted training strategies, including the Method of Auxiliary Coordinates, Fenchel Lifted Networks, and Lifted Bregman Training, and demonstrates how diverse architectures, such as Multi-Layer Perceptrons, Residual Neural Networks, and Proximal Neural Networks fit within this structure. By leveraging tools from convex optimisation, particularly Bregman distances, the framework facilitates distributed optimisation, accommodates non-differentiable proximal activations, and can improve the conditioning of the training landscape. We discuss the implementation of these methods using block-coordinate descent strategies, including deterministic implementations enhanced by accelerated and adaptive optimisation techniques, as well as implicit stochastic gradient methods. Furthermore, we explore the application of this framework to inverse problems, detailing methodologies for both the training of specialised networks (e.g., unrolled architectures) and the stable inversion of pre-trained networks. Numerical results on standard imaging tasks validate the effectiveness and stability of the lifted Bregman approach compared to conventional training, particularly for architectures employing proximal activations.

#47 On topological descriptors for graph products

著者: Mattie Ji, Amauri H. Souza, Vikas Garg

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.08846

要約:
Topological descriptors have been increasingly utilized for capturing multiscale structural information in relational data. In this work, we consider various filtrations on the (box) product of graphs and the effect on their outputs on the topological descriptors - the Euler characteristic (EC) and persistent homology (PH). In particular, we establish a complete characterization of the expressive power of EC on general color-based filtrations. We also show that the PH descriptors of (virtual) graph products contain strictly more information than the computation on individual graphs, whereas EC does not. Additionally, we provide algorithms to compute the PH diagrams of the product of vertex- and edge-level filtrations on the graph product. We also substantiate our theoretical analysis with empirical investigations on runtime analysis, expressivity, and graph classification performance. Overall, this work paves way for powerful graph persistent descriptors via product filtrations. Code is available at https://github.com/Aalto-QuML/tda_graph_product.

#48 Path Signatures Enable Model-Free Mapping of RNA Modifications

著者: Maud Lemercier, Paola Arrubarrena, Salvatore Di Giorgio, Julia Brettschneider, Thomas Cass, Valerie Griesche, Isabel S. Naarmann-de Vries, Anastasia Papavasiliou, Alessia Ruggieri, Irem Tellioglu, Chia Ching Wu, F. Nina Papavasiliou, Terry Lyons

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.08855

要約:
Detecting chemical modifications on RNA molecules remains a key challenge in epitranscriptomics. Traditional reverse transcription-based sequencing methods introduce enzyme- and sequence-dependent biases and fragment RNA molecules, confounding the accurate mapping of modifications across the transcriptome. Nanopore direct RNA sequencing offers a powerful alternative by preserving native RNA molecules, enabling the detection of modifications at single-molecule resolution. However, current computational tools can identify only a limited subset of modification types within well-characterized sequence contexts for which ample training data exists. Here, we introduce a model-free computational method that reframes modification detection as an anomaly detection problem, requiring only canonical (unmodified) RNA reads without any other annotated data. For each nanopore read, our approach extracts robust, modification-sensitive features from the raw ionic current signal at a site using the signature transform, then computes an anomaly score by comparing the resulting feature vector to its nearest neighbors in an unmodified reference dataset. We convert anomaly scores into statistical p-values to enable anomaly detection at both individual read and site levels. Validation on densely-modified \textit{E. coli} rRNA demonstrates that our approach detects known sites harboring diverse modification types, without prior training on these modifications. We further applyied this framework to dengue virus (DENV) transcripts and mammalian mRNAs. For DENV sfRNA, it led to revealing a novel 2'-O-methylated site, which we validate orthogonally by qRT-PCR assays. These results demonstrate that our model-free approach operates robustly across different types of RNAs and datasets generated with different nanopore sequencing chemistries.

#49 Trust Region Masking for Long-Horizon LLM Reinforcement Learning

著者: Yingru Li, Jiacai Liu, Jiawei Xu, Yuxuan Tong, Ziniu Li, Qian Liu, Baoxiang Wang

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2512.23075

要約:
Policy gradient methods for Large Language Models optimize a policy $\pi_\theta$ via a surrogate objective computed from samples of a rollout policy $\pi_{\text{roll}}$. However, modern LLM-RL pipelines suffer from unavoidable implementation divergences, such as backend discrepancies, Mixture-of-Experts routing discontinuities, and distributed training staleness. These factors cause an off-policy mismatch ($\pi_{\text{roll}} \neq \pi_\theta$), leading to approximation errors between the surrogate and the true objective. We demonstrate that classical trust region bounds on this error scale as $O(T^2)$ with sequence length $T$, rendering them vacuous for long-horizon tasks. To address this, we derive two new bounds: a Pinsker-Marginal bound scaling as $O(T^{3/2})$ and a Mixed bound scaling as $O(T)$. We further derive an Adaptive bound that strictly generalizes the Pinsker-Marginal bound by combining an importance-ratio decomposition of the error with an adaptive per-position application of Pinsker's inequality on the future trajectory divergence; the minimum over all three bounds is tighter than any individual bound. Crucially, all bounds depend on $D_{\mathrm{KL}}^{\mathrm{tok,max}}$, the maximum token-level KL divergence across the sequence. As a \emph{sequence-level} term, the divergence cannot be controlled by previous token-independent methods like PPO clipping. We propose Trust Region Masking (TRM), which masks entire sequences that violate the trust region. TRM enables the first non-vacuous monotonic improvement guarantees and demonstrates empirical training stability for long-horizon LLM-RL.

#50 Dynamic Vocabulary Pruning: Stable LLM-RL by Taming the Tail

著者: Yingru Li, Jiawei Xu, Jiacai Liu, Yuxuan Tong, Ziniu Li, Tianle Cai, Ge Zhang, Qian Liu, Baoxiang Wang

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2512.23087

要約:
Reinforcement Learning (RL) for Large Language Models (LLMs) faces a fundamental tension: the numerical divergence between high-throughput inference engines and numerically precise training engines. Although these systems share the same parameters, they produce slightly different probability distributions, creating a training-inference mismatch. We prove that the bound on the log-probability divergence arising from this mismatch scales as $(1-p)$, where $p$ is the token probability. This scaling induces a highly asymmetric effect: the bound vanishes for high-probability tokens but remains significant for low-probability tokens in the distribution tail. When sampled, these tail tokens introduce systematically biased errors that accumulate over sequences, thereby destabilizing gradient estimation. Instead of applying post-hoc corrections, we propose Dynamic Vocabulary Pruning (DVP), which constrains the RL objective to a dynamically determined ''safe'' vocabulary that excludes the extreme tail. This strategy trades large, destabilizing numerical errors for a small, bounded optimization bias. We validate DVP empirically by demonstrating stable training, and theoretically by deriving strict bounds on the induced bias.

#51 Multi-fidelity graph-based neural networks architectures to learn Navier-Stokes solutions on non-parametrized 2D domains

著者: Francesco Songia (SIMBIOTX), Raoul Sall\'e de Chou (SIMBIOTX), Hugues Talbot (OPIS, CVN), Irene Vignon-Clementel (SIMBIOTX)

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.02157

要約:
We propose a graph-based, multi-fidelity learning framework for the prediction of stationary Navier--Stokes solutions in non-parametrized two-dimensional geometries. The method is designed to guide the learning process through successive approximations, starting from reduced-order and full Stokes models, and progressively approaching the Navier--Stokes solution. To effectively capture both local and long-range dependencies in the velocity and pressure fields, we combine graph neural networks with Transformer and Mamba architectures. While Transformers achieve the highest accuracy, we show that Mamba can be successfully adapted to graph-structured data through an unsupervised node-ordering strategy. The Mamba approach significantly reduces computational cost while maintaining performance. Physical knowledge is embedded directly into the architecture through an encoding-processing-physics informed decoding pipeline. Derivatives are computed through algebraic operators constructed via the Weighted Least Squares method. The flexibility of these operators allows us not only to make the output obey the governing equations, but also to constrain selected hidden features to satisfy mass conservation. We introduce additional physical biases through an enriched graph convolution with the same differential operators describing the PDEs. Overall, we successfully guide the learning process by physical knowledge and fluid dynamics insights, leading to more regular and accurate predictions

#52 To Grok Grokking: Provable Grokking in Ridge Regression

著者: Mingyue Xu, Gal Vardi, Itay Safran

公開日: Mon, 09 Feb 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.19791

要約:
We study grokking, the onset of generalization long after overfitting, in a classical ridge regression setting. We prove end-to-end grokking results for learning over-parameterized linear regression models using gradient descent with weight decay. Specifically, we prove that the following stages occur: (i) the model overfits the training data early during training; (ii) poor generalization persists long after overfitting has manifested; and (iii) the generalization error eventually becomes arbitrarily small. Moreover, we show, both theoretically and empirically, that grokking can be amplified or eliminated in a principled manner through proper hyperparameter tuning. To the best of our knowledge, these are the first rigorous quantitative bounds on the generalization delay (which we refer to as the "grokking time") in terms of training hyperparameters. Lastly, going beyond the linear setting, we empirically demonstrate that our quantitative bounds also capture the behavior of grokking on non-linear neural networks. Our results suggest that grokking is not an inherent failure mode of deep learning, but rather a consequence of specific training conditions, and thus does not require fundamental changes to the model architecture or learning algorithm to avoid.

stat.ML updates on arXiv.org

📋 論文タイトル一覧