arXiv論文一覧 - stat.ML updates on arXiv.org

#1 ROOFS: RObust biOmarker Feature Selection

著者: Anastasiia Bakhmach, Paul Dufoss\'e, Andrea Vaglio, Florence Monville, Laurent Greillier, Fabrice Barl\'esi, S\'ebastien Benzekry

公開日: Fri, 09 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.05151

要約:
Feature selection (FS) is essential for biomarker discovery and in the analysis of biomedical datasets. However, challenges such as high-dimensional feature space, low sample size, multicollinearity, and missing values make FS non-trivial. Moreover, FS performances vary across datasets and predictive tasks. We propose roofs, a Python package available at https://gitlab.inria.fr/compo/roofs, designed to help researchers in the choice of FS method adapted to their problem. Roofs benchmarks multiple FS methods on the user's data and generates reports that summarize a comprehensive set of evaluation metrics, including downstream predictive performance estimated using optimism correction, stability, reliability of individual features, and true positive and false positive rates assessed on semi-synthetic data with a simulated outcome. We demonstrate the utility of roofs on data from the PIONeeR clinical trial, aimed at identifying predictors of resistance to anti-PD-(L)1 immunotherapy in lung cancer. The PIONeeR dataset contained 374 multi-source blood and tumor biomarkers from 435 patients. A reduced subset of 214 features was obtained through iterative variance inflation factor pre-filtering. Of the 34 FS methods gathered in roofs, we evaluated 23 in combination with 11 classifiers (253 models in total) and identified a filter based on the union of Benjamini-Hochberg false discovery rate-adjusted p-values from t-test and logistic regression as the optimal approach, outperforming other methods including the widely used LASSO. We conclude that comprehensive benchmarking with roofs has the potential to improve the robustness and reproducibility of FS discoveries and increase the translational value of clinical models.

#2 CAOS: Conformal Aggregation of One-Shot Predictors

著者: Maja Waldron

公開日: Fri, 09 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.05219

要約:
One-shot prediction enables rapid adaptation of pretrained foundation models to new tasks using only one labeled example, but lacks principled uncertainty quantification. While conformal prediction provides finite-sample coverage guarantees, standard split conformal methods are inefficient in the one-shot setting due to data splitting and reliance on a single predictor. We propose Conformal Aggregation of One-Shot Predictors (CAOS), a conformal framework that adaptively aggregates multiple one-shot predictors and uses a leave-one-out calibration scheme to fully exploit scarce labeled data. Despite violating classical exchangeability assumptions, we prove that CAOS achieves valid marginal coverage using a monotonicity-based argument. Experiments on one-shot facial landmarking and RAFT text classification tasks show that CAOS produces substantially smaller prediction sets than split conformal baselines while maintaining reliable coverage.

#3 Stochastic Deep Learning: A Probabilistic Framework for Modeling Uncertainty in Structured Temporal Data

著者: James Rice

公開日: Fri, 09 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.05227

要約:
I propose a novel framework that integrates stochastic differential equations (SDEs) with deep generative models to improve uncertainty quantification in machine learning applications involving structured and temporal data. This approach, termed Stochastic Latent Differential Inference (SLDI), embeds an It\^o SDE in the latent space of a variational autoencoder, allowing for flexible, continuous-time modeling of uncertainty while preserving a principled mathematical foundation. The drift and diffusion terms of the SDE are parameterized by neural networks, enabling data-driven inference and generalizing classical time series models to handle irregular sampling and complex dynamic structure. A central theoretical contribution is the co-parameterization of the adjoint state with a dedicated neural network, forming a coupled forward-backward system that captures not only latent evolution but also gradient dynamics. I introduce a pathwise-regularized adjoint loss and analyze variance-reduced gradient flows through the lens of stochastic calculus, offering new tools for improving training stability in deep latent SDEs. My paper unifies and extends variational inference, continuous-time generative modeling, and control-theoretic optimization, providing a rigorous foundation for future developments in stochastic probabilistic machine learning.

#4 Aligned explanations in neural networks

著者: Corentin Lobet, Francesca Chiaromonte

公開日: Fri, 09 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.04378

要約:
Feature attribution is the dominant paradigm for explaining deep neural networks. However, most existing methods only loosely reflect the model's prediction-making process, thereby merely white-painting the black box. We argue that explanatory alignment is a key aspect of trustworthiness in prediction tasks: explanations must be directly linked to predictions, rather than serving as post-hoc rationalizations. We present model readability as a design principle enabling alignment, and PiNets as a modeling framework to pursue it in a deep learning context. PiNets are pseudo-linear networks that produce instance-wise linear predictions in an arbitrary feature space, making them linearly readable. We illustrate their use on image classification and segmentation tasks, demonstrating how PiNets produce explanations that are faithful across multiple criteria in addition to alignment.

#5 Learning Multinomial Logits in $O(n \log n)$ time

著者: Flavio Chierichetti, Mirko Giacchini, Ravi Kumar, Silvio Lattanzi, Alessandro Panconesi, Erasmo Tani, Andrew Tomkins

公開日: Fri, 09 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.04423

要約:
A Multinomial Logit (MNL) model is composed of a finite universe of items $[n]=\{1,..., n\}$, each assigned a positive weight. A query specifies an admissible subset -- called a slate -- and the model chooses one item from that slate with probability proportional to its weight. This query model is also known as the Plackett-Luce model or conditional sampling oracle in the literature. Although MNLs have been studied extensively, a basic computational question remains open: given query access to slates, how efficiently can we learn weights so that, for every slate, the induced choice distribution is within total variation distance $\varepsilon$ of the ground truth? This question is central to MNL learning and has direct implications for modern recommender system interfaces. We provide two algorithms for this task, one with adaptive queries and one with non-adaptive queries. Each algorithm outputs an MNL $M'$ that induces, for each slate $S$, a distribution $M'_S$ on $S$ that is within $\varepsilon$ total variation distance of the true distribution. Our adaptive algorithm makes $O\left(\frac{n}{\varepsilon^{3}}\log n\right)$ queries, while our non-adaptive algorithm makes $O\left(\frac{n^{2}}{\varepsilon^{3}}\log n \log\frac{n}{\varepsilon}\right)$ queries. Both algorithms query only slates of size two and run in time proportional to their query complexity. We complement these upper bounds with lower bounds of $\Omega\left(\frac{n}{\varepsilon^{2}}\log n\right)$ for adaptive queries and $\Omega\left(\frac{n^{2}}{\varepsilon^{2}}\log n\right)$ for non-adaptive queries, thus proving that our adaptive algorithm is optimal in its dependence on the support size $n$, while the non-adaptive one is tight within a $\log n$ factor.

#6 Convergence Rates for Learning Pseudo-Differential Operators

privacy

著者: Jiaheng Chen, Daniel Sanz-Alonso

公開日: Fri, 09 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.04473

要約:
This paper establishes convergence rates for learning elliptic pseudo-differential operators, a fundamental operator class in partial differential equations and mathematical physics. In a wavelet-Galerkin framework, we formulate learning over this class as a structured infinite-dimensional regression problem with multiscale sparsity. Building on this structure, we propose a sparse, data- and computation-efficient estimator, which leverages a novel matrix compression scheme tailored to the learning task and a nested-support strategy to balance approximation and estimation errors. In addition to obtaining convergence rates for the estimator, we show that the learned operator induces an efficient and stable Galerkin solver whose numerical error matches its statistical accuracy. Our results therefore contribute to bringing together operator learning, data-driven solvers, and wavelet methods in scientific computing.

#7 A Generalized Adaptive Joint Learning Framework for High-Dimensional Time-Varying Models

著者: Baolin Chen, Mengfei Ran

公開日: Fri, 09 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.04499

要約:
In modern biomedical and econometric studies, longitudinal processes are often characterized by complex time-varying associations and abrupt regime shifts that are shared across correlated outcomes. Standard functional data analysis (FDA) methods, which prioritize smoothness, often fail to capture these dynamic structural features, particularly in high-dimensional settings. This article introduces Adaptive Joint Learning (AJL), a regularization framework designed to simultaneously perform functional variable selection and structural changepoint detection in multivariate time-varying coefficient models. We propose a convex optimization procedure that synergizes adaptive group-wise penalization with fused regularization, effectively borrowing strength across multiple outcomes to enhance estimation efficiency. We provide a rigorous theoretical analysis of the estimator in the ultra-high-dimensional regime (p >> n), establishing non-asymptotic error bounds and proving that AJL achieves the oracle property--performing as well as if the true active set and changepoint locations were known a priori. A key theoretical contribution is the explicit handling of approximation bias via undersmoothing conditions to ensure valid asymptotic inference. The proposed method is validated through comprehensive simulations and an application to Primary Biliary Cirrhosis (PBC) data. The analysis uncovers synchronized phase transitions in disease progression and identifies a parsimonious set of time-varying prognostic markers.

#8 Forecasting the U.S. Treasury Yield Curve: A Distributionally Robust Machine Learning Approach

著者: Jinjun Liu, Ming-Yen Cheng

公開日: Fri, 09 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.04608

要約:
We study U.S. Treasury yield curve forecasting under distributional uncertainty and recast forecasting as an operations research and managerial decision problem. Rather than minimizing average forecast error, the forecaster selects a decision rule that minimizes worst case expected loss over an ambiguity set of forecast error distributions. To this end, we propose a distributionally robust ensemble forecasting framework that integrates parametric factor models with high dimensional nonparametric machine learning models through adaptive forecast combinations. The framework consists of three machine learning components. First, a rolling window Factor Augmented Dynamic Nelson Siegel model captures level, slope, and curvature dynamics using principal components extracted from economic indicators. Second, Random Forest models capture nonlinear interactions among macro financial drivers and lagged Treasury yields. Third, distributionally robust forecast combination schemes aggregate heterogeneous forecasts under moment uncertainty, penalizing downside tail risk via expected shortfall and stabilizing second moment estimation through ridge regularized covariance matrices. The severity of the worst case criterion is adjustable, allowing the forecaster to regulate the trade off between robustness and statistical efficiency. Using monthly data, we evaluate out of sample forecasts across maturities and horizons from one to twelve months ahead. Adaptive combinations deliver superior performance at short horizons, while Random Forest forecasts dominate at longer horizons. Extensions to global sovereign bond yields confirm the stability and generalizability of the proposed framework.

#9 Estimating Causal Effects in Gaussian Linear SCMs with Finite Data

著者: Aurghya Maiti, Prateek Jain

公開日: Fri, 09 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.04673

要約:
Estimating causal effects from observational data remains a fundamental challenge in causal inference, especially in the presence of latent confounders. This paper focuses on estimating causal effects in Gaussian Linear Structural Causal Models (GL-SCMs), which are widely used due to their analytical tractability. However, parameter estimation in GL-SCMs is often infeasible with finite data, primarily due to overparameterization. To address this, we introduce the class of Centralized Gaussian Linear SCMs (CGL-SCMs), a simplified yet expressive subclass where exogenous variables follow standardized distributions. We show that CGL-SCMs are equally expressive in terms of causal effect identifiability from observational distributions and present a novel EM-based estimation algorithm that can learn CGL-SCM parameters and estimate identifiable causal effects from finite observational samples. Our theoretical analysis is validated through experiments on synthetic data and benchmark causal graphs, demonstrating that the learned models accurately recover causal distributions.

#10 DeepWeightFlow: Re-Basined Flow Matching for Generating Neural Network Weights

著者: Saumya Gupta, Scott Biggs, Moritz Laber, Zohair Shafi, Robin Walters, Ayan Paul

公開日: Fri, 09 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.05052

要約:
Building efficient and effective generative models for neural network weights has been a research focus of significant interest that faces challenges posed by the high-dimensional weight spaces of modern neural networks and their symmetries. Several prior generative models are limited to generating partial neural network weights, particularly for larger models, such as ResNet and ViT. Those that do generate complete weights struggle with generation speed or require finetuning of the generated models. In this work, we present DeepWeightFlow, a Flow Matching model that operates directly in weight space to generate diverse and high-accuracy neural network weights for a variety of architectures, neural network sizes, and data modalities. The neural networks generated by DeepWeightFlow do not require fine-tuning to perform well and can scale to large networks. We apply Git Re-Basin and TransFusion for neural network canonicalization in the context of generative weight models to account for the impact of neural network permutation symmetries and to improve generation efficiency for larger model sizes. The generated networks excel at transfer learning, and ensembles of hundreds of neural networks can be generated in minutes, far exceeding the efficiency of diffusion-based methods. DeepWeightFlow models pave the way for more efficient and scalable generation of diverse sets of neural networks.

#11 Learning Mixture Models via Efficient High-dimensional Sparse Fourier Transforms

著者: Alkis Kalavasis, Pravesh K. Kothari, Shuchen Li, Manolis Zampetakis

公開日: Fri, 09 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.05157

要約:
In this work, we give a ${\rm poly}(d,k)$ time and sample algorithm for efficiently learning the parameters of a mixture of $k$ spherical distributions in $d$ dimensions. Unlike all previous methods, our techniques apply to heavy-tailed distributions and include examples that do not even have finite covariances. Our method succeeds whenever the cluster distributions have a characteristic function with sufficiently heavy tails. Such distributions include the Laplace distribution but crucially exclude Gaussians. All previous methods for learning mixture models relied implicitly or explicitly on the low-degree moments. Even for the case of Laplace distributions, we prove that any such algorithm must use super-polynomially many samples. Our method thus adds to the short list of techniques that bypass the limitations of the method of moments. Somewhat surprisingly, our algorithm does not require any minimum separation between the cluster means. This is in stark contrast to spherical Gaussian mixtures where a minimum $\ell_2$-separation is provably necessary even information-theoretically [Regev and Vijayaraghavan '17]. Our methods compose well with existing techniques and allow obtaining ''best of both worlds" guarantees for mixtures where every component either has a heavy-tailed characteristic function or has a sub-Gaussian tail with a light-tailed characteristic function. Our algorithm is based on a new approach to learning mixture models via efficient high-dimensional sparse Fourier transforms. We believe that this method will find more applications to statistical estimation. As an example, we give an algorithm for consistent robust mean estimation against noise-oblivious adversaries, a model practically motivated by the literature on multiple hypothesis testing. It was formally proposed in a recent Master's thesis by one of the authors, and has already inspired follow-up works.

#12 Optimal Lower Bounds for Online Multicalibration

著者: Natalie Collina, Jiuyao Lu, Georgy Noarov, Aaron Roth

公開日: Fri, 09 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.05245

要約:
We prove tight lower bounds for online multicalibration, establishing an information-theoretic separation from marginal calibration. In the general setting where group functions can depend on both context and the learner's predictions, we prove an $\Omega(T^{2/3})$ lower bound on expected multicalibration error using just three disjoint binary groups. This matches the upper bounds of Noarov et al. (2025) up to logarithmic factors and exceeds the $O(T^{2/3-\varepsilon})$ upper bound for marginal calibration (Dagan et al., 2025), thereby separating the two problems. We then turn to lower bounds for the more difficult case of group functions that may depend on context but not on the learner's predictions. In this case, we establish an $\widetilde{\Omega}(T^{2/3})$ lower bound for online multicalibration via a $\Theta(T)$-sized group family constructed using orthogonal function systems, again matching upper bounds up to logarithmic factors.

#13 Centroid Decision Forest

著者: Amjad Ali, Saeed Aldahmani, Hailiang Du, Zardad Khan

公開日: Fri, 09 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2503.19306

要約:
This paper introduces the centroid decision forest (CDF), a novel ensemble learning framework that redefines the splitting strategy and tree building in the ordinary decision trees for high-dimensional classification. The splitting approach in CDF differs from the traditional decision trees in theat the class separability score (CSS) determines the selection of the most discriminative features at each node to construct centroids of the partitions (daughter nodes). The splitting criterion uses the Euclidean distance measurements from each class centroid to achieve a splitting mechanism that is more flexible and robust. Centroids are constructed by computing the mean feature values of the selected features for each class, ensuring a class-representative division of the feature space. This centroid-driven approach enables CDF to capture complex class structures while maintaining interpretability and scalability. To evaluate CDF, 23 high-dimensional datasets are used to assess its performance against different state-of-the-art classifiers through classification accuracy and Cohen's kappa statistic. The experimental results show that CDF outperforms the conventional methods establishing its effectiveness and flexibility for high-dimensional classification problems.

#14 Structured Matching via Cost-Regularized Unbalanced Optimal Transport

著者: Emanuele Pardini, Katerina Papagiannouli

公開日: Fri, 09 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.19075

要約:
Unbalanced optimal transport (UOT) provides a flexible way to match or compare nonnegative finite Radon measures. However, UOT requires a predefined ground transport cost, which may misrepresent the data's underlying geometry. Choosing such a cost is particularly challenging when datasets live in heterogeneous spaces, often motivating practitioners to adopt Gromov-Wasserstein formulations. To address this challenge, we introduce cost-regularized unbalanced optimal transport (CR-UOT), a framework that allows the ground cost to vary while allowing mass creation and removal. We show that CR-UOT incorporates unbalanced Gromov-Wasserstein type problems through families of inner-product costs parameterized by linear transformations, enabling the matching of measures or point clouds across Euclidean spaces. We develop algorithms for such CR-UOT problems using entropic regularization and demonstrate that this approach improves the alignment of heterogeneous single-cell omics profiles, especially when many cells lack direct matches.

#15 High-Dimensional Change Point Detection using Graph Spanning Ratio

著者: Yang-Wen Sun, Katerina Papagiannouli, Vladimir Spokoiny

公開日: Fri, 09 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2512.07541

要約:
Inspired by graph-based methodologies, we introduce a novel graph-spanning algorithm designed to identify changes in both offline and online data across low to high dimensions. This versatile approach is applicable to Euclidean and graph-structured data with unknown distributions, while maintaining control over error probabilities. Theoretically, we demonstrate that the algorithm achieves high detection power when the magnitude of the change surpasses the lower bound of the minimax separation rate, which scales on the order of $\sqrt{nd}$. Our method outperforms other techniques in terms of accuracy for both Gaussian and non-Gaussian data. Notably, it maintains strong detection power even with small observation windows, making it particularly effective for online environments where timely and precise change detection is critical.

#16 Avoiding the Price of Adaptivity: Inference in Linear Contextual Bandits via Stability

著者: Samya Praharaj, Koulik Khamaru

公開日: Fri, 09 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2512.20368

要約:
Statistical inference in contextual bandits is challenging due to the adaptive, non-i.i.d. nature of the data. A growing body of work shows that classical least-squares inference can fail under adaptive sampling, and that valid confidence intervals for linear functionals typically require an inflation of order $\sqrt{d \log T}$. This phenomenon -- often termed the price of adaptivity -- reflects the intrinsic difficulty of reliable inference under general contextual bandit policies. A key structural condition that overcomes this limitation is the stability condition of Lai and Wei, which requires the empirical feature covariance to converge to a deterministic limit. When stability holds, the ordinary least-squares estimator satisfies a central limit theorem, and classical Wald-type confidence intervals remain asymptotically valid under adaptation, without incurring the $\sqrt{d \log T}$ price of adaptivity. In this paper, we propose and analyze a regularized EXP4 algorithm for linear contextual bandits. Our first main result shows that this procedure satisfies the Lai--Wei stability condition and therefore admits valid Wald-type confidence intervals for linear functionals. We additionally provide quantitative rates of convergence in the associated central limit theorem. Our second result establishes that the same algorithm achieves regret guarantees that are minimax optimal up to logarithmic factors, demonstrating that stability and statistical efficiency can coexist within a single contextual bandit method. As an application of our theory, we show how it can be used to construct confidence intervals for the conditional average treatment effect (CATE) under adaptively collected data. Finally, we complement our theory with simulations illustrating the empirical normality of the resulting estimators and the sharpness of the corresponding confidence intervals.

#17 Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds

著者: Naman Agarwal, Siddhartha R. Dalal, Vishal Misra

公開日: Fri, 09 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2512.22473

要約:
Transformers empirically perform precise probabilistic reasoning in carefully constructed ``Bayesian wind tunnels'' and in large-scale language models, yet the mechanisms by which gradient-based learning creates the required internal geometry remain opaque. We provide a complete first-order analysis of how cross-entropy training reshapes attention scores and value vectors in a transformer attention head. Our core result is an \emph{advantage-based routing law} for attention scores, \[ \frac{\partial L}{\partial s_{ij}} = \alpha_{ij}\bigl(b_{ij}-\mathbb{E}_{\alpha_i}[b]\bigr), \qquad b_{ij} := u_i^\top v_j, \] coupled with a \emph{responsibility-weighted update} for values, \[ \Delta v_j = -\eta\sum_i \alpha_{ij} u_i, \] where $u_i$ is the upstream gradient at position $i$ and $\alpha_{ij}$ are attention weights. These equations induce a positive feedback loop in which routing and content specialize together: queries route more strongly to values that are above-average for their error signal, and those values are pulled toward the queries that use them. We show that this coupled specialization behaves like a two-timescale EM procedure: attention weights implement an E-step (soft responsibilities), while values implement an M-step (responsibility-weighted prototype updates), with queries and keys adjusting the hypothesis frame. Through controlled simulations, including a sticky Markov-chain task where we compare a closed-form EM-style update to standard SGD, we demonstrate that the same gradient dynamics that minimize cross-entropy also sculpt the low-dimensional manifolds identified in our companion work as implementing Bayesian inference. This yields a unified picture in which optimization (gradient flow) gives rise to geometry (Bayesian manifolds), which in turn supports function (in-context probabilistic reasoning).

#18 Binary Iterative Hard Thresholding Converges with Optimal Number of Measurements for 1-Bit Compressed Sensing

著者: Namiko Matsumoto, Arya Mazumdar

公開日: Fri, 09 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2207.03427

要約:
Compressed sensing has been a very successful high-dimensional signal acquisition and recovery technique that relies on linear operations. However, the actual measurements of signals have to be quantized before storing or processing. 1(One)-bit compressed sensing is a heavily quantized version of compressed sensing, where each linear measurement of a signal is reduced to just one bit: the sign of the measurement. Once enough of such measurements are collected, the recovery problem in 1-bit compressed sensing aims to find the original signal with as much accuracy as possible. The recovery problem is related to the traditional "halfspace-learning" problem in learning theory. For recovery of sparse vectors, a popular reconstruction method from 1-bit measurements is the binary iterative hard thresholding (BIHT) algorithm. The algorithm is a simple projected sub-gradient descent method, and is known to converge well empirically, despite the nonconvexity of the problem. The convergence property of BIHT was not theoretically justified, except with an exorbitantly large number of measurements (i.e., a number of measurement greater than $\max\{k^{10}, 24^{48}, k^{3.5}/\epsilon\}$, where $k$ is the sparsity, $\epsilon$ denotes the approximation error, and even this expression hides other factors). In this paper we show that the BIHT algorithm converges with only $\tilde{O}(\frac{k}{\epsilon})$ measurements. Note that, this dependence on $k$ and $\epsilon$ is optimal for any recovery method in 1-bit compressed sensing. With this result, to the best of our knowledge, BIHT is the only practical and efficient (polynomial time) algorithm that requires the optimal number of measurements in all parameters (both $k$ and $\epsilon$). This is also an example of a gradient descent algorithm converging to the correct solution for a nonconvex problem, under suitable structural conditions.

#19 Convergence of Sign-based Random Reshuffling Algorithms for Nonconvex Optimization

著者: Zhen Qin, Zhishuai Liu, Pan Xu

公開日: Fri, 09 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2310.15976

要約:
signSGD is popular in nonconvex optimization due to its communication efficiency. Yet, existing analyses typically assume data are sampled with replacement in each iteration, contradicting a common practical implementation where data are randomly reshuffled and sequentially fed into the algorithm. This gap leaves the theoretical understanding of the more practical algorithm, signSGD with random reshuffling (SignRR), largely unexplored. We develop the first analysis of SignRR to identify the core technical challenge that prevents a thorough convergence analysis of this method. In particular, given a dataset of size $n$ and $T$ epochs, we show that the expected gradient norm of SignRR is upper bounded by $O(\log(nT)/\sqrt{nT} + \sigma)$, where $\sigma$ is the averaged conditional mean square error that may not vanish. To tackle this limitation, we develop two new sign-based algorithms under random reshuffling: SignRVR, which incorporates variance-reduced gradients, and SignRVM, which integrates momentum-based updates. Both algorithms achieve a faster convergence rate of ${O}(\log(nT)/\sqrt{nT} +\log(nT)\sqrt{n}/\sqrt{T})$. We further extend our algorithms to a distributed setting, with a convergence rate of ${O}(\log(n_0T)/\sqrt{n_0T} +\log (n_0T)\sqrt{n_0}/\sqrt{T})$, where $n_0$ is the size of the dataset of a single machine. These results mark the first step towards the theoretical understanding of practical implementation of sign-based optimization algorithms. Finally, we back up our theoretical findings through experiments on simulated and real-world problems, verifying that randomly reshuffled sign methods match or surpass existing baselines.

#20 Demonstrating the power and flexibility of variational assumptions for amortized neural posterior estimation in environmental applications

著者: Elliot Maceda, Emily C. Hector, Amanda Lenzi, Brian J. Reich

公開日: Fri, 09 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2404.10899

要約:
Classic Bayesian methods with complex models are frequently infeasible due to an intractable likelihood. Simulation-based inference methods, such as Approximate Bayesian Computing (ABC), calculate posteriors without accessing a likelihood function by leveraging the fact that data can be quickly simulated from the model, but converge slowly and/or poorly in high-dimensional settings. In this paper, we propose a framework for Bayesian posterior estimation by mapping data to posteriors of parameters using a neural network trained on data simulated from the complex model. Posterior distributions of model parameters are efficiently obtained by feeding observed data into the trained neural network. We show theoretically that our posteriors converge to the true posteriors in Kullback-Leibler divergence. Our approach yields computationally efficient and theoretically justified uncertainty quantification, which is lacking in existing simulation-based neural network approaches. Comprehensive simulation studies highlight our method's robustness and accuracy.

#21 What Should Embeddings Embed? Autoregressive Models Represent Latent Generating Distributions

著者: Liyi Zhang, Michael Y. Li, R. Thomas McCoy, Theodore R. Sumers, Jian-Qiao Zhu, Thomas L. Griffiths

公開日: Fri, 09 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2406.03707

要約:
Autoregressive language models have demonstrated a remarkable ability to extract latent structure from text. The embeddings from large language models have been shown to capture aspects of the syntax and semantics of language. But what should embeddings represent? We connect the autoregressive prediction objective to the idea of constructing predictive sufficient statistics to summarize the information contained in a sequence of observations, and use this connection to identify three settings where the optimal content of embeddings can be identified: independent identically distributed data, where the embedding should capture the sufficient statistics of the data; latent state models, where the embedding should encode the posterior distribution over states given the data; and discrete hypothesis spaces, where the embedding should reflect the posterior distribution over hypotheses given the data. We then conduct empirical probing studies to show that transformers encode these three kinds of latent generating distributions, and that they perform well in out-of-distribution cases and without token memorization in these settings.

#22 Graph-Dictionary Signal Model for Sparse Representations of Multivariate Data

著者: William Cappelletti, Pascal Frossard

公開日: Fri, 09 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2411.05729

要約:
Representing and exploiting multivariate signals requires capturing relations between variables, which we can represent by graphs. Graph dictionaries allow to describe complex relational information as a sparse sum of simpler structures, but no prior model exists to infer such underlying structure elements from data. We define a novel Graph-Dictionary signal model, where a finite set of graphs characterizes relationships in data distribution as filters on the weighted sum of their Laplacians. We propose a framework to infer the graph dictionary representation from observed node signals, which allows to include a priori knowledge about signal properties, and about underlying graphs and their coefficients. We introduce a bilinear generalization of the primal-dual splitting algorithm to solve the learning problem. We show the capability of our method to reconstruct graphs from signals in multiple synthetic settings, where our model outperforms popular baselines. Then, we exploit graph-dictionary representations in an illustrative motor imagery decoding task on brain activity data, where we classify imagined motion better than standard methods relying on many more features. Our graph-dictionary model bridges a gap between sparse representations of multivariate data and a structured decomposition of sample-varying relationships into a sparse combination of elementary graph atoms.

#23 Meta-Learning Objectives for Preference Optimization

著者: Carlo Alfano, Silvia Sapora, Jakob Nicolaus Foerster, Patrick Rebeschini, Yee Whye Teh

公開日: Fri, 09 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2411.06568

要約:
Evaluating preference optimization (PO) algorithms on LLM alignment is a challenging task that presents prohibitive costs, noise, and several variables like model size and hyper-parameters. In this work, we show that it is possible to gain insights on the efficacy of PO algorithm on simpler benchmarks. We design a diagnostic suite of MuJoCo tasks and datasets, which we use to systematically evaluate PO algorithms, establishing a more controlled and cheaper benchmark. We then propose a novel family of PO algorithms based on mirror descent, which we call Mirror Preference Optimization (MPO). Through evolutionary strategies, we search this class to discover algorithms specialized to specific properties of preference datasets, such as mixed-quality or noisy data. We demonstrate that our discovered PO algorithms outperform all known algorithms in the targeted MuJoCo settings. Finally, based on the insights gained from our MuJoCo experiments, we design a PO algorithm that significantly outperform existing baselines in an LLM alignment task.

#24 Estimation of partial rankings from sparse, noisy comparisons

著者: Sebastian Morel-Balbi, Alec Kirkley

公開日: Fri, 09 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2501.02505

要約:
Ranking items based on pairwise comparisons is common, from using match outcomes to rank sports teams to using purchase or survey data to rank consumer products. Statistical inference-based methods such as the Bradley-Terry model, which extract rankings based on an underlying generative model, have emerged as flexible and powerful tools to tackle ranking in empirical data. In situations with limited and/or noisy comparisons, it is often challenging to confidently distinguish the performance of different items based on the evidence available in the data. However, most inference-based ranking methods choose to assign each item to a unique rank or score, suggesting a meaningful distinction when there is none. Here, we develop a principled nonparametric Bayesian method, adaptable to any statistical ranking method, for learning partial rankings (rankings with ties) that distinguishes among the ranks of different items only when there is sufficient evidence available in the data. We develop a fast agglomerative algorithm to perform Maximum A Posteriori (MAP) inference of partial rankings under our framework and examine the performance of our method on a variety of real and synthetic network datasets, finding that it frequently gives a more parsimonious summary of the data than traditional ranking, particularly when observations are sparse.

#25 Mining Intrinsic Rewards from LLM Hidden States for Efficient Best-of-N Sampling

著者: Jizhou Guo, Zhaomin Wu, Hanchen Yang, Philip S. Yu

公開日: Fri, 09 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2505.12225

要約:
Best-of-N sampling is a powerful method for improving Large Language Model (LLM) performance, but it is often limited by its dependence on massive, text-based reward models. These models are not only computationally expensive but also data-hungry, requiring extensive labeled datasets for training. This creates a significant data challenge, as they overlook a rich, readily available data source: the LLM's own internal hidden states. To address this data and efficiency gap, we introduce SWIFT (Simple Weighted Intrinsic Feedback Technique), a novel and lightweight method that learns a reward function directly from the rich information embedded in LLM hidden states. Operating at the token embedding level, SWIFT employs simple linear layers to effectively distinguish between preferred and dispreferred generations, eliminating the need for computationally intensive text-based modeling. Extensive experiments on standard benchmarks show that SWIFT outperforms existing baselines (12.7% higher accuracy than EurusRM-7B on MATH dataset) while using less than 0.005% of their parameters. Its robust scalability, compatibility with certain closed-source models via logit access, and ability to combine with traditional reward models for additional performance highlight SWIFT's practical value and contribution to more efficient data-driven LLM post-training. Our code is available at https://github.com/aster2024/SWIFT .

#26 Breaking AR's Sampling Bottleneck: Provable Acceleration via Diffusion Language Models

diffusion

著者: Gen Li, Changxiao Cai

公開日: Fri, 09 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2505.21400

要約:
Diffusion models have emerged as a powerful paradigm for modern generative modeling, demonstrating strong potential for large language models (LLMs). Unlike conventional autoregressive (AR) models that generate tokens sequentially, diffusion models allow for parallel sampling, offering a promising path to accelerate generation and eliminate the left-to-right generation constraints. Despite their empirical success, theoretical understandings of diffusion language models remain underdeveloped. In this work, we develop convergence guarantees for diffusion language models from an information-theoretic perspective. Our analysis demonstrates that the sampling error, measured by the Kullback-Leibler (KL) divergence, decays inversely with the number of iterations $T$ and scales linearly with the mutual information between tokens in the target text sequence. Crucially, our theory covers the regime $T<L$, where $L$ is the text sequence length. This justifies that high-quality samples can be generated with fewer iterations than $L$, thereby breaking the fundamental sampling bottleneck of $L$ steps required by AR models. We further establish matching upper and lower bounds, up to some constant factor, that shows the tightness of our convergence analysis. These results offer novel theoretical insights into the practical effectiveness of diffusion language models.

#27 Blockchain-Enabled Privacy-Preserving Second-Order Federated Edge Learning in Personalized Healthcare

privacy

著者: Anum Nawaz, Muhammad Irfan, Xianjia Yu, Hamad Aldawsari, Rayan Hamza Alsisi, Zhuo Zou, Tomi Westerlund

公開日: Fri, 09 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2506.00416

要約:
Federated learning (FL) is increasingly recognised for addressing security and privacy concerns in traditional cloud-centric machine learning (ML), particularly within personalised health monitoring such as wearable devices. By enabling global model training through localised policies, FL allows resource-constrained wearables to operate independently. However, conventional first-order FL approaches face several challenges in personalised model training due to the heterogeneous non-independent and identically distributed (non-iid) data by each individual's unique physiology and usage patterns. Recently, second-order FL approaches maintain the stability and consistency of non-iid datasets while improving personalised model training. This study proposes and develops a verifiable and auditable optimised second-order FL framework BFEL (blockchain enhanced federated edge learning) based on optimised FedCurv for personalised healthcare systems. FedCurv incorporates information about the importance of each parameter to each client's task (through fisher information matrix) which helps to preserve client-specific knowledge and reduce model drift during aggregation. Moreover, it minimizes communication rounds required to achieve a target precision convergence for each client device while effectively managing personalised training on non-iid and heterogeneous data. The incorporation of ethereum-based model aggregation ensures trust, verifiability, and auditability while public key encryption enhances privacy and security. Experimental results of federated CNNs and MLPs utilizing mnist, cifar-10, and PathMnist demonstrate framework's high efficiency, scalability, suitability for edge deployment on wearables, and significant reduction in communication cost.

#28 When Lower-Order Terms Dominate: Adaptive Expert Algorithms for Heavy-Tailed Losses

著者: Antoine Moulin, Emmanuel Esposito, Dirk van der Hoeven

公開日: Fri, 09 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2506.01722

要約:
We consider the problem setting of prediction with expert advice with possibly heavy-tailed losses, i.e. the only assumption on the losses is an upper bound on their second moments, denoted by $\theta$. We develop adaptive algorithms that do not require any prior knowledge about the range or the second moment of the losses. Existing adaptive algorithms have what is typically considered a lower-order term in their regret guarantees. We show that this lower-order term, which is often the maximum of the losses, can actually dominate the regret bound in our setting. Specifically, we show that even with small constant $\theta$, this lower-order term can scale as $\sqrt{KT}$, where $K$ is the number of experts and $T$ is the time horizon. We propose adaptive algorithms with improved regret bounds that avoid the dependence on such a lower-order term and guarantee $\mathcal{O}(\sqrt{\theta T\log(K)})$ regret in the worst case, and $\mathcal{O}(\theta \log(KT)/\Delta_{\min})$ regret when the losses are sampled i.i.d. from some fixed distribution, where $\Delta_{\min}$ is the difference between the mean losses of the second best expert and the best expert. Additionally, when the loss function is the squared loss, our algorithm also guarantees improved regret bounds over prior results.

#29 Simulation-based population inference of LISA's Galactic binaries: Bypassing the global fit

著者: Rahul Srinivasan, Enrico Barausse, Natalia Korsakova, Roberto Trotta

公開日: Fri, 09 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2506.22543

要約:
The Laser Interferometer Space Antenna (LISA) is expected to detect thousands of individually resolved gravitational wave sources, overlapping in time and frequency, on top of unresolved astrophysical and/or primordial backgrounds. Disentangling resolved sources from backgrounds and extracting their parameters in a computationally intensive "global fit" is normally regarded as a necessary step toward reconstructing the properties of the underlying astrophysical populations. Here, we show that it is in principle feasible to infer the population properties of the most numerous of LISA sources -- Galactic double white dwarfs -- directly from the frequency (or, equivalently, time) strain series by adopting a simulation-based approach, without extracting and estimating the parameters of each single source. By training a normalizing flow on a custom-designed compression of simulated LISA frequency series from the Galactic double white dwarf population, we demonstrate how to infer the posterior distribution of population parameters (e.g., mass function, frequency, and spatial distributions). This allows for extracting information on the population parameters from both resolved and unresolved sources simultaneously and in a computationally efficient manner. This approach can be extended to other source classes (e.g., massive and stellar-mass black holes, extreme mass ratio inspirals) and to scenarios involving non-Gaussian or non-stationary noise (e.g., data gaps), provided that fast and accurate simulations are available.

#30 Gaussian Mixture Model with unknown diagonal covariances via continuous sparse regularization

著者: Romane Giard (ECL, ICJ, PSPM), Yohann de Castro (ICJ, ECL, PSPM, IUF), Cl\'ement Marteau (PSPM, ICJ, UCBL)

公開日: Fri, 09 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2509.12889

要約:
This paper addresses the statistical estimation of Gaussian Mixture Models (GMMs) with unknown diagonal covariances from independent and identically distributed samples. We employ the Beurling-LASSO (BLASSO), a convex optimization framework that promotes sparsity in the space of measures, to simultaneously estimate the number of components and their parameters. Our main contribution extends the BLASSO methodology to multivariate GMMs with component-specific unknown diagonal covariance matrices. This setting is significantly more flexible than previous approaches, which required known and identical covariances. We establish non-asymptotic recovery guarantees with nearly parametric convergence rates for component means, diagonal covariances, and weights, as well as for density prediction. A key theoretical contribution is the identification of an explicit separation condition on mixture components that enables the construction of non-degenerate dual certificates-essential tools for establishing statistical guarantees for the BLASSO. Our analysis leverages the Fisher-Rao geometry of the statistical model and introduces a novel semi-distance adapted to our framework, providing new insights into the interplay between component separation, parameter space geometry, and achievable statistical recovery.

#31 The Bayesian Geometry of Transformer Attention

著者: Naman Agarwal, Siddhartha R. Dalal, Vishal Misra

公開日: Fri, 09 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2512.22471

要約:
Transformers often appear to perform Bayesian reasoning in context, but verifying this rigorously has been impossible: natural data lack analytic posteriors, and large models conflate reasoning with memorization. We address this by constructing \emph{Bayesian wind tunnels} -- controlled environments where the true posterior is known in closed form and memorization is provably impossible. In these settings, small transformers reproduce Bayesian posteriors with $10^{-3}$-$10^{-4}$ bit accuracy, while capacity-matched MLPs fail by orders of magnitude, establishing a clear architectural separation. Across two tasks -- bijection elimination and Hidden Markov Model (HMM) state tracking -- we find that transformers implement Bayesian inference through a consistent geometric mechanism: residual streams serve as the belief substrate, feed-forward networks perform the posterior update, and attention provides content-addressable routing. Geometric diagnostics reveal orthogonal key bases, progressive query-key alignment, and a low-dimensional value manifold parameterized by posterior entropy. During training this manifold unfurls while attention patterns remain stable, a \emph{frame-precision dissociation} predicted by recent gradient analyses. Taken together, these results demonstrate that hierarchical attention realizes Bayesian inference by geometric design, explaining both the necessity of attention and the failure of flat architectures. Bayesian wind tunnels provide a foundation for mechanistically connecting small, verifiable systems to reasoning phenomena observed in large language models.

#32 A Gap Between Decision Trees and Neural Networks

著者: Akash Kumar

公開日: Fri, 09 Jan 2026 00:00:00 -0500

リンク: https://arxiv.org/abs/2601.03919

要約:
We study when geometric simplicity of decision boundaries, used here as a notion of interpretability, can conflict with accurate approximation of axis-aligned decision trees by shallow neural networks. Decision trees induce rule-based, axis-aligned decision regions (finite unions of boxes), whereas shallow ReLU networks are typically trained as score models whose predictions are obtained by thresholding. We analyze the infinite-width, bounded-norm, single-hidden-layer ReLU class through the Radon total variation ($\mathrm{R}\mathrm{TV}$) seminorm, which controls the geometric complexity of level sets. We first show that the hard tree indicator $1_A$ has infinite $\mathrm{R}\mathrm{TV}$. Moreover, two natural split-wise continuous surrogates--piecewise-linear ramp smoothing and sigmoidal (logistic) smoothing--also have infinite $\mathrm{R}\mathrm{TV}$ in dimensions $d>1$, while Gaussian convolution yields finite $\mathrm{R}\mathrm{TV}$ but with an explicit exponential dependence on $d$. We then separate two goals that are often conflated: classification after thresholding (recovering the decision set) versus score learning (learning a calibrated score close to $1_A$). For classification, we construct a smooth barrier score $S_A$ with finite $\mathrm{R}\mathrm{TV}$ whose fixed threshold $\tau=1$ exactly recovers the box. Under a mild tube-mass condition near $\partial A$, we prove an $L_1(P)$ calibration bound that decays polynomially in a sharpness parameter, along with an explicit $\mathrm{R}\mathrm{TV}$ upper bound in terms of face measures. Experiments on synthetic unions of rectangles illustrate the resulting accuracy--complexity tradeoff and how threshold selection shifts where training lands along it.

stat.ML updates on arXiv.org

📋 論文タイトル一覧