arXiv論文一覧 - stat.ML updates on arXiv.org

#1 NS-RGS: Newton-Schulz based Riemannian gradient method for orthogonal group synchronization

著者: Haiyang Peng, Deren Han, Xin Chen, Meng Huang

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2604.07372

要約:
Group synchronization is a fundamental task involving the recovery of group elements from pairwise measurements. For orthogonal group synchronization, the most common approach reformulates the problem as a constrained nonconvex optimization and solves it using projection-based methods, such as the generalized power method. However, these methods rely on exact SVD or QR decompositions in each iteration, which are computationally expensive and become a bottleneck for large-scale problems. In this paper, we propose a Newton-Schulz-based Riemannian Gradient Scheme (NS-RGS) for orthogonal group synchronization that significantly reduces computational cost by replacing the SVD or QR step with the Newton-Schulz iteration. This approach leverages efficient matrix multiplications and aligns perfectly with modern GPU/TPU architectures. By employing a refined leave-one-out analysis, we overcome the challenge arising from statistical dependencies, and establish that NS-RGS with spectral initialization achieves linear convergence to the target solution up to near-optimal statistical noise levels. Experiments on synthetic data and real-world global alignment tasks demonstrate that NS-RGS attains accuracy comparable to state-of-the-art methods such as the generalized power method, while achieving nearly a 2$\times$ speedup.

#2 Variational Approximated Restricted Maximum Likelihood Estimation for Spatial Data

著者: Debjoy Thakur

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2604.07635

要約:
This research considers a scalable inference for spatial data modeled through Gaussian intrinsic conditional autoregressive (ICAR) structures. The classical estimation method, restricted maximum likelihood (REML), requires repeated inversion and factorization of large, sparse precision matrices, which makes this computation costly. To sort this problem out, we propose a variational restricted maximum likelihood (VREML) framework that approximates the intractable marginal likelihood using a Gaussian variational distribution. By constructing an evidence lower bound (ELBO) on the restricted likelihood, we derive a computationally efficient coordinate-ascent algorithm for jointly estimating the spatial random effects and variance components. In this article, we theoretically establish the monotone convergence of ELBO and mathematically exhibit that the variational family is exact under Gaussian ICAR settings, which is an indication of nullifying approximation error at the posterior level. We empirically establish the supremacy of our VREML over MLE and INLA.

#3 On the Unique Recovery of Transport Maps and Vector Fields from Finite Measure-Valued Data

著者: Jonah Botvinick-Greenhouse, Yunan Yang

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2604.07671

要約:
We establish guarantees for the unique recovery of vector fields and transport maps from finite measure-valued data, yielding new insights into generative models, data-driven dynamical systems, and PDE inverse problems. In particular, we provide general conditions under which a diffeomorphism can be uniquely identified from its pushforward action on finitely many densities, i.e., when the data $\{(\rho_j,f_\#\rho_j)\}_{j=1}^m$ uniquely determines $f$. As a corollary, we introduce a new metric which compares diffeomorphisms by measuring the discrepancy between finitely many pushforward densities in the space of probability measures. We also prove analogous results in an infinitesimal setting, where derivatives of the densities along a smooth vector field are observed, i.e., when $\{(\rho_j,\text{div} (\rho_j v))\}_{j=1}^m$ uniquely determines $v$. Our analysis makes use of the Whitney and Takens embedding theorems, which provide estimates on the required number of densities $m$, depending only on the intrinsic dimension of the problem. We additionally interpret our results through the lens of Perron--Frobenius and Koopman operators and demonstrate how our techniques lead to new guarantees for the well-posedness of certain PDE inverse problems related to continuity, advection, Fokker--Planck, and advection-diffusion-reaction equations. Finally, we present illustrative numerical experiments demonstrating the unique identification of transport maps from finitely many pushforward densities, and of vector fields from finitely many weighted divergence observations.

#4 The Condition-Number Principle for Prototype Clustering

著者: Romano Li, Jianfei Cao

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2604.07744

要約:
We develop a geometric framework that links objective accuracy to structural recovery in prototype-based clustering. The analysis is algorithm-agnostic and applies to a broad class of admissible loss functions. We define a clustering condition number that compares within-cluster scale to the minimum loss increase required to move a point across a cluster boundary. When this quantity is small, any solution with a small suboptimality gap must also have a small misclassification error relative to a benchmark partition. The framework also clarifies a fundamental trade-off between robustness and sensitivity to cluster imbalance, leading to sharp phase transitions for exact recovery under different objectives. The guarantees are deterministic and non-asymptotic, and they separate the role of algorithmic accuracy from the intrinsic geometric difficulty of the instance. We further show that errors concentrate near cluster boundaries and that sufficiently deep cluster cores are recovered exactly under strengthened local margins. Together, these results provide a geometric principle for interpreting low objective values as reliable evidence of meaningful clustering structure.

#5 Sparse $\epsilon$ insensitive zone bounded asymmetric elastic net support vector machines for pattern classification

著者: Haiyan Du, Hu Yang

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2604.07748

要約:
Existing support vector machines(SVM) models are sensitive to noise and lack sparsity, which limits their performance. To address these issues, we combine the elastic net loss with a robust loss framework to construct a sparse $\varepsilon$-insensitive bounded asymmetric elastic net loss, and integrate it with SVM to build $\varepsilon$ Insensitive Zone Bounded Asymmetric Elastic Net Loss-based SVM($\varepsilon$-BAEN-SVM). $\varepsilon$-BAEN-SVM is both sparse and robust. Sparsity is proven by showing that samples inside the $\varepsilon$-insensitive band are not support vectors. Robustness is theoretically guaranteed because the influence function is bounded. To solve the non-convex optimization problem, we design a half-quadratic algorithm based on clipping dual coordinate descent. It transforms the problem into a series of weighted subproblems, improving computational efficiency via the $\varepsilon$ parameter. Experiments on simulated and real datasets show that $\varepsilon$-BAEN-SVM outperforms traditional and existing robust SVMs. It balances sparsity and robustness well in noisy environments. Statistical tests confirm its superiority. Under the Gaussian kernel, it achieves better accuracy and noise insensitivity, validating its effectiveness and practical value.

#6 Order-Optimal Sequential 1-Bit Mean Estimation in General Tail Regimes

著者: Ivan Lau, Jonathan Scarlett

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2604.07796

要約:
In this paper, we study the problem of mean estimation under strict 1-bit communication constraints. We propose a novel adaptive mean estimator based solely on randomized threshold queries, where each 1-bit outcome indicates whether a given sample exceeds a sequentially chosen threshold. Our estimator is $(\epsilon, \delta)$-PAC for any distribution with a bounded mean $\mu \in [-\lambda, \lambda]$ and a bounded $k$-th central moment $\mathbb{E}[|X-\mu|^k] \le \sigma^k$ for any fixed $k > 1$. Crucially, our sample complexity is order-optimal in all such tail regimes, i.e., for every such $k$ value. For $k \neq 2$, our estimator's sample complexity matches the unquantized minimax lower bounds plus an unavoidable $O(\log(\lambda/\sigma))$ localization cost. For the finite-variance case ($k=2$), our estimator's sample complexity has an extra multiplicative $O(\log(\sigma/\epsilon))$ penalty, and we establish a novel information-theoretic lower bound showing that this penalty is a fundamental limit of 1-bit quantization. We also establish a significant adaptivity gap: for both threshold queries and more general interval queries, the sample complexity of any non-adaptive estimator must scale linearly with the search space parameter $\lambda/\sigma$, rendering it vastly less sample efficient than our adaptive approach. Finally, we present algorithmic variants that (i) handle an unknown sampling budget, (ii) adapt to an unknown scale parameter~$\sigma$ given (possibly loose) bounds, and (iii) require only two stages of adaptivity at the expense of more complicated general 1-bit queries.

#7 Intensity Dot Product Graphs

著者: Giulio Valentino Dalla Riva, Matteo Dalla Riva

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2604.07810

要約:
Latent-position random graph models usually treat the node set as fixed once the sample size is chosen, while graphon-based and random-measure constructions allow more randomness at the cost of weaker geometric interpretability. We introduce \emph{Intensity Dot Product Graphs} (IDPGs), which extend Random Dot Product Graphs by replacing a fixed collection of latent positions with a Poisson point process on a Euclidean latent space. This yields a model with random node populations, RDPG-style dot-product affinities, and a population-level intensity that links continuous latent structure to finite observed graphs. We define the heat map and the desire operator as continuous analogues of the probability matrix, prove a spectral consistency result connecting adjacency singular values to the operator spectrum, compare the construction with graphon and digraphon representations, and show how classical RDPGs arise in a concentrated limit. Because the model is parameterized by an evolving intensity, temporal extensions through partial differential equations arise naturally.

#8 Differentially Private Language Generation and Identification in the Limit

privacy

著者: Anay Mehrotra, Grigoris Velegkas, Xifan Yu, Felix Zhou

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2604.08504

要約:
We initiate the study of language generation in the limit, a model recently introduced by Kleinberg and Mullainathan [KM24], under the constraint of differential privacy. We consider the continual release model, where a generator must eventually output a stream of valid strings while protecting the privacy of the entire input sequence. Our first main result is that for countable collections of languages, privacy comes at no qualitative cost: we provide an $\varepsilon$-differentially-private algorithm that generates in the limit from any countable collection. This stands in contrast to many learning settings where privacy renders learnability impossible. However, privacy does impose a quantitative cost: there are finite collections of size $k$ for which uniform private generation requires $\Omega(k/\varepsilon)$ samples, whereas just one sample suffices non-privately. We then turn to the harder problem of language identification in the limit. Here, we show that privacy creates fundamental barriers. We prove that no $\varepsilon$-DP algorithm can identify a collection containing two languages with an infinite intersection and a finite set difference, a condition far stronger than the classical non-private characterization of identification. Next, we turn to the stochastic setting where the sample strings are sampled i.i.d. from a distribution (instead of being generated by an adversary). Here, we show that private identification is possible if and only if the collection is identifiable in the adversarial model. Together, our results establish new dimensions along which generation and identification differ and, for identification, a separation between adversarial and stochastic settings induced by privacy constraints.

#9 Massively Parallel Exact Inference for Hawkes Processes

著者: Ahmer Raza, Hudson Smith

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2604.01342

要約:
Multivariate Hawkes processes are a widely used class of self-exciting point processes, but maximum likelihood estimation naively scales as $O(N^2)$ in the number of events. The canonical linear exponential Hawkes process admits a faster $O(N)$ recurrence, but prior work evaluates this recurrence sequentially, without exploiting parallelization on modern GPUs. We show that the Hawkes process intensity can be expressed as a product of sparse transition matrices admitting a linear-time associative multiply, enabling computation via a parallel prefix scan. This yields a simple yet massively parallelizable algorithm for maximum likelihood estimation of linear exponential Hawkes processes. Our method reduces the computational complexity to approximately $O(N/P)$ with $P$ parallel processors, and naturally yields a batching scheme to maintain constant memory usage, avoiding GPU memory constraints. Importantly, it computes the exact likelihood without any additional assumptions or approximations, preserving the simplicity and interpretability of the model. We demonstrate orders-of-magnitude speedups on simulated and real datasets, scaling to thousands of nodes and tens of millions of events, substantially beyond scales reported in prior work. We provide an open-source PyTorch library implementing our optimizations.

#10 SBBTS: A Unified Schr\"odinger-Bass Framework for Synthetic Financial Time Series

著者: Alexandre Alouadi, Gr\'egoire Loeper, C\'elian Marsala, Othmane Mazhar, Huy\^en Pham

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2604.07159

要約:
We study the problem of generating synthetic time series that reproduce both marginal distributions and temporal dynamics, a central challenge in financial machine learning. Existing approaches typically fail to jointly model drift and stochastic volatility, as diffusion-based methods fix the volatility while martingale transport models ignore drift. We introduce the Schr\"odinger-Bass Bridge for Time Series (SBBTS), a unified framework that extends the Schr\"odinger-Bass formulation to multi-step time series. The method constructs a diffusion process that jointly calibrates drift and volatility and admits a tractable decomposition into conditional transport problems, enabling efficient learning. Numerical experiments on the Heston model demonstrate that SBBTS accurately recovers stochastic volatility and correlation parameters that prior Schr\"odingerBridge methods fail to capture. Applied to S&P 500 data, SBBTS-generated synthetic time series consistently improve downstream forecasting performance when used for data augmentation, yielding higher classification accuracy and Sharpe ratio compared to real-data-only training. These results show that SBBTS provides a practical and effective framework for realistic time series generation and data augmentation in financial applications.

#11 Score Shocks: The Burgers Equation Structure of Diffusion Generative Models

diffusion

著者: Krisanu Sarkar

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2604.07404

要約:
We analyze the score field of a diffusion generative model through a Burgers-type evolution law. For VE diffusion, the heat-evolved data density implies that the score obeys viscous Burgers in one dimension and the corresponding irrotational vector Burgers system in $\R^d$, giving a PDE view of \emph{speciation transitions} as the sharpening of inter-mode interfaces. For any binary decomposition of the noised density into two positive heat solutions, the score separates into a smooth background and a universal $\tanh$ interfacial term determined by the component log-ratio; near a regular binary mode boundary this yields a normal criterion for speciation. In symmetric binary Gaussian mixtures, the criterion recovers the critical diffusion time detected by the midpoint derivative of the score and agrees with the spectral criterion of Biroli, Bonnaire, de~Bortoli, and M\'ezard (2024). After subtracting the background drift, the inter-mode layer has a local Burgers $\tanh$ profile, which becomes global in the symmetric Gaussian case with width $\sigma_\tau^2/a$. We also quantify exponential amplification of score errors across this layer, show that Burgers dynamics preserves irrotationality, and use a change of variables to reduce the VP-SDE to the VE case, yielding a closed-form VP speciation time. Gaussian-mixture formulas are verified to machine precision, and the local theorem is checked numerically on a quartic double-well.

#12 Virtual Dummies: Enabling Scalable FDR-Controlled Variable Selection via Sequential Sampling of Null Features

著者: Taulant Koka, Jasin Machkour, Daniel P. Palomar, Michael Muma

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2604.07464

要約:
High-dimensional variable selection, particularly in genomics, requires error-controlling procedures that scale to millions of predictors. The Terminating-Random Experiments (T-Rex) selector achieves false discovery rate (FDR) control by aggregating results of early terminated random experiments, each combining original predictors with i.i.d. synthetic null variables (dummies). At biobank scales, however, explicit dummy augmentation requires terabytes of memory. We demonstrate that this bottleneck is not fundamental. Formalizing the information flow of forward selection through a filtration, we show that compatible selectors interact with unselected dummies solely through projections onto an adaptively evolving low-dimensional subspace. For rotationally invariant dummy distributions, we derive an adaptive stick-breaking construction sampling these projections from their exact conditional distribution given the selection history, thereby eliminating dummy matrix materialization. We prove a pathwise universality theorem: under mild delocalization conditions, selection paths driven by generic standardized i.i.d. dummies converge to the same Gaussian limit. We instantiate the theory through Virtual Dummy LARS (VD-LARS), reducing memory and runtime by several orders of magnitude while preserving the exact selection law and FDR guarantees of the T-Rex selector. Experiments on realistic genome-wide association study data confirm that VD-T-Rex controls FDR and achieves power at scales where all competing methods either fail or time out.

#13 From Ground Truth to Measurement: A Statistical Framework for Human Labeling

著者: Robert Chew, Stephanie Eckman, Christoph Kern, Frauke Kreuter

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2604.07591

要約:
Supervised machine learning assumes that labeled data provide accurate measurements of the concepts models are meant to learn. Yet in practice, human labeling introduces systematic variation arising from ambiguous items, divergent interpretations, and simple mistakes. Machine learning research commonly treats all disagreement as noise, which obscures these distinctions and limits our understanding of what models actually learn. This paper reframes annotation as a measurement process and introduces a statistical framework for decomposing labeling outcomes into interpretable sources of variation: instance difficulty, annotator bias, situational noise, and relational alignment. The framework extends classical measurement-error models to accommodate both shared and individualized notions of truth, reflecting traditional and human label variation interpretations of error, and provides a diagnostic for assessing which regime better characterizes a given task. Applying the proposed model to a multi-annotator natural language inference dataset, we find empirical evidence for all four theorized components and demonstrate the effectiveness of our approach. We conclude with implications for data-centric machine learning and outline how this approach can guide the development of a more systematic science of labeling.

#14 Unified Precision-Guaranteed Stopping Rules for Contextual Learning

著者: Mingrui Ding, Qiuhong Zhao, Siyang Gao, Jing Dong

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2604.07913

要約:
Contextual learning seeks to learn a decision policy that maps an individual's characteristics to an action through data collection. In operations management, such data may come from various sources, and a central question is when data collection can stop while still guaranteeing that the learned policy is sufficiently accurate. We study this question under two precision criteria: a context-wise criterion and an aggregate policy-value criterion. We develop unified stopping rules for contextual learning with unknown sampling variances in both unstructured and structured linear settings. Our approach is based on generalized likelihood ratio (GLR) statistics for pairwise action comparisons. To calibrate the corresponding sequential boundaries, we derive new time-uniform deviation inequalities that directly control the self-normalized GLR evidence and thus avoid the conservativeness caused by decoupling mean and variance uncertainty. Under the Gaussian sampling model, we establish finite-sample precision guarantees for both criteria. Numerical experiments on synthetic instances and two case studies demonstrate that the proposed stopping rules achieve the target precision with substantially fewer samples than benchmark methods. The proposed framework provides a practical way to determine when enough information has been collected in personalized decision problems. It applies across multiple data-collection environments, including historical datasets, simulation models, and real systems, enabling practitioners to reduce unnecessary sampling while maintaining a desired level of decision quality.

#15 The ecosystem of machine learning competitions: Platforms, participants, and their impact on AI development

著者: Ioannis Nasios

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2604.08001

要約:
Machine learning competitions (MLCs) play a pivotal role in advancing artificial intelligence (AI) by fostering innovation, skill development, and practical problem-solving. This study provides a comprehensive analysis of major competition platforms such as Kaggle and Zindi, examining their workflows, evaluation methodologies, and reward structures. It further assesses competition quality, participant expertise, and global reach, with particular attention to demographic trends among top-performing competitors. By exploring the motivations of competition hosts, this paper underscores the significant role of MLCs in shaping AI development, promoting collaboration, and driving impactful technological progress. Furthermore, by combining literature synthesis with platform-level data analysis and practitioner insights a comprehensive understanding of the MLC ecosystem is provided. Moreover, the paper demonstrates that MLCs function at the intersection of academic research and industrial application, fostering the exchange of knowledge, data, and practical methodologies across domains. Their strong ties to open-source communities further promote collaboration, reproducibility, and continuous innovation within the broader ML ecosystem. By shaping research priorities, informing industry standards, and enabling large-scale crowdsourced problem-solving, these competitions play a key role in the ongoing evolution of AI. The study provides insights relevant to researchers, practitioners, and competition organizers, and includes an examination of the future trajectory and sustained influence of MLCs on AI development.

#16 A unifying view of contrastive learning, importance sampling, and bridge sampling for energy-based models

著者: Luca Martino

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2604.08116

要約:
In the last decades, energy-based models (EBMs) have become an important class of probabilistic models in which a component of the likelihood is intractable and therefore cannot be evaluated explicitly. Consequently, parameter estimation in EBMs is challenging for conventional inference methods. In this work, we provide a unified framework that connects noise contrastive estimation (NCE), reverse logistic regression (RLR), multiple importance sampling (MIS), and bridge sampling within the context of EBMs. We further show that these methods are equivalent under specific conditions. This unified perspective clarifies relationships among existing methods and enables the development of new estimators, with the potential to improve statistical and computational efficiency. Furthermore, this study helps elucidate the success of NCE in terms of its flexibility and robustness, while also identifying scenarios in which its performance can be further improved. Hence, rather than being a purely descriptive review, this work offers a unifying perspective and additional methodological contributions. The MATLAB code used in the numerical experiments is also made freely available to support the reproducibility of the results.

#17 A Direct Approach for Handling Contextual Bandits with Latent State Dynamics

著者: Zhen Li (LMO, CELESTE, HEC Paris), Gilles Stoltz (LMO, CELESTE, HEC Paris)

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2604.08149

要約:
We revisit the finite-armed linear bandit model by Nelson et al. (2022), where contexts and rewards are governed by a finite hidden Markov chain. Nelson et al. (2022) approach this model by a reduction to linear contextual bandits; but to do so, they actually introduce a simplification in which rewards are linear functions of the posterior probabilities over the hidden states given the observed contexts, rather than functions of the hidden states themselves. Their analysis (but not their algorithm) also does not take into account the estimation of the HMM parameters, and only tackles expected, not high-probability, bounds, which suffer in addition from unnecessary complex dependencies on the model (like reward gaps). We instead study the more natural model incorporating direct dependencies in the hidden states (on top of dependencies on the observed contexts, as is natural for contextual bandits) and also obtain stronger, high-probability, regret bounds for a fully adaptive strategy that estimates HMM parameters online. These bounds do not depend on the reward functions and only depend on the model through the estimation of the HMM parameters.

#18 Adversarial Label Invariant Graph Data Augmentations for Out-of-Distribution Generalization

著者: Simon Zhang, Ryan P. DeMilt, Kun Jin, Cathy H. Xia

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2604.08404

要約:
Out-of-distribution (OoD) generalization occurs when representation learning encounters a distribution shift. This occurs frequently in practice when training and testing data come from different environments. Covariate shift is a type of distribution shift that occurs only in the input data, while the concept distribution stays invariant. We propose RIA - Regularization for Invariance with Adversarial training, a new method for OoD generalization under convariate shift. Motivated by an analogy to $Q$-learning, it performs an adversarial exploration for training data environments. These new environments are induced by adversarial label invariant data augmentations that prevent a collapse to an in-distribution trained learner. It works with many existing OoD generalization methods for covariate shift that can be formulated as constrained optimization problems. We develop an alternating gradient descent-ascent algorithm to solve the problem, and perform extensive experiments on OoD graph classification for various kinds of synthetic and natural distribution shifts. We demonstrate that our method can achieve high accuracy compared with OoD baselines.

#19 Synthetic Data for any Differentiable Target

synthetic data

著者: Tristan Thrush, Sung Min Park, Herman Brunborg, Luke Bailey, Marcel Roed, Neil Band, Christopher Potts, Tatsunori Hashimoto

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2604.08423

要約:
What are the limits of controlling language models via synthetic training data? We develop a reinforcement learning (RL) primitive, the Dataset Policy Gradient (DPG), which can precisely optimize synthetic data generators to produce a dataset of targeted examples. When used for supervised fine-tuning (SFT) of a target model, these examples cause the target model to do well on a differentiable metric of our choice. Our approach achieves this by taking exact data attribution via higher-order gradients and using those scores as policy gradient rewards. We prove that this procedure closely approximates the true, intractable gradient for the synthetic data generator. To illustrate the potential of DPG, we show that, using only SFT on generated examples, we can cause the target model's LM head weights to (1) embed a QR code, (2) embed the pattern $\texttt{67}$, and (3) have lower $\ell^2$ norm. We additionally show that we can cause the generator to (4) rephrase inputs in a new language and (5) produce a specific UUID, even though neither of these objectives is conveyed in the generator's input prompts. These findings suggest that DPG is a powerful and flexible technique for shaping model properties using only synthetic training examples.

#20 Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

privacy

著者: Jiayuan Ye, Vitaly Feldman, Kunal Talwar

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2604.08519

要約:
Large language models (LLMs) can struggle to memorize factual knowledge in their parameters, often leading to hallucinations and poor performance on knowledge-intensive tasks. In this paper, we formalize fact memorization from an information-theoretic perspective and study how training data distributions affect fact accuracy. We show that fact accuracy is suboptimal (below the capacity limit) whenever the amount of information contained in the training data facts exceeds model capacity. This is further exacerbated when the fact frequency distribution is skewed (e.g. a power law). We propose data selection schemes based on the training loss alone that aim to limit the number of facts in the training data and flatten their frequency distribution. On semi-synthetic datasets containing high-entropy facts, our selection method effectively boosts fact accuracy to the capacity limit. When pretraining language models from scratch on an annotated Wikipedia corpus, our selection method enables a GPT2-Small model (110m parameters) to memorize 1.3X more entity facts compared to standard training, matching the performance of a 10X larger model (1.3B parameters) pretrained on the full dataset.

#21 A Probabilistic Formulation of Offset Noise in Diffusion Models

diffusion

著者: Takuro Kutsuna

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2412.03134

要約:
Diffusion models have become fundamental tools for modeling data distributions in machine learning. Despite their success, these models face challenges when generating data with extreme brightness values, as evidenced by limitations observed in practical large-scale diffusion models. Offset noise has been proposed as an empirical solution to this issue, yet its theoretical basis remains insufficiently explored. In this paper, we propose a novel diffusion model that naturally incorporates additional noise within a rigorous probabilistic framework. Our approach modifies both the forward and reverse diffusion processes, enabling inputs to be diffused into Gaussian distributions with arbitrary mean structures. We derive a loss function based on the evidence lower bound and show that the resulting objective is structurally analogous to that of offset noise, with time-dependent coefficients. Experiments on controlled synthetic datasets demonstrate that the proposed model mitigates brightness-related limitations and achieves improved performance over conventional methods, particularly in high-dimensional settings.

#22 Rate-optimal Design for Anytime Best Arm Identification

著者: Junpei Komiyama, Kyoungseok Jang, Junya Honda

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2510.23199

要約:
We consider the best arm identification problem, where the goal is to identify the arm with the highest mean reward from a set of $K$ arms under a limited sampling budget. This problem models many practical scenarios such as A/B testing. We consider a class of algorithms for this problem, which is provably minimax optimal up to a constant factor. This idea is a generalization of existing works in fixed-budget best arm identification, which are limited to a particular choice of risk measures. Based on the framework, we propose Almost Tracking, a closed-form algorithm that has a provable guarantee on the popular risk measure $H_1$. Unlike existing algorithms, Almost Tracking does not require the total budget in advance nor does it need to discard a significant part of samples, which gives a practical advantage. Through experiments on synthetic and real-world datasets, we show that our algorithm outperforms existing anytime algorithms as well as fixed-budget algorithms.

#23 Physics-Informed Neural Networks for Joint Source and Parameter Estimation in Advection-Diffusion Equations

diffusion

著者: Brenda Anague, Bamdad Hosseini, Issa Karambal, Jean Medard Ngnotchouye

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2512.07755

要約:
Recent studies have demonstrated the success of deep learning in solving forward and inverse problems in engineering and scientific computing domains, such as physics-informed neural networks (PINNs). Source inversion problems under sparse measurements for parabolic partial differential equations (PDEs) are particularly challenging to solve using PINNs, due to their severe ill-posedness and the multiple unknowns involved including the source function and the PDE parameters. Although the neural tangent kernel (NTK) of PINNs has been widely used in forward problems involving a single neural network, its extension to inverse problems involving multiple neural networks remains less explored. In this work, we propose a weighted adaptive approach based on the NTK of PINNS including multiple separate networks representing the solution, the unknown source, and the PDE parameters. The key idea behind our methodology is to simultaneously solve the joint recovery of the solution, the source function along with the unknown parameters thereby using the underlying partial differential equation as a constraint that couples multiple unknown functional parameters, leading to more efficient use of the limited information in the measurements. We apply our method on the advection-diffusion equation and we present various 2D and 3D numerical experiments using different types of measurements data that reflect practical engineering systems. Our proposed method is successful in estimating the unknown source function, the velocity and diffusion parameters as well as recovering the solution of the equation, while remaining robust to additional noise in the measurements.

#24 Evaluating Singular Value Thresholds for DNN Weight Matrices based on Random Matrix Theory

著者: Kohei Nishikawa, Koki Shimizu, Hiroki Hashiguchi

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2512.12911

要約:
This study evaluates thresholds for removing singular values from singular value decomposition-based low-rank approximations of deep neural network weight matrices. Each weight matrix is modeled as the sum of signal and noise matrices. The low-rank approximation is obtained by removing noise-related singular values using a threshold based on random matrix theory. To assess the adequacy of this threshold, we propose an evaluation metric based on the cosine similarity between the singular vectors of the signal and original weight matrices. The proposed metric is used in numerical experiments to compare two threshold estimation methods.

#25 Flow Matching is Adaptive to Manifold Structures

著者: Shivam Kumar, Yixin Wang, Lizhen Lin

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2602.22486

要約:
Flow matching has emerged as a simulation-free alternative to diffusion-based generative modeling, producing samples by solving an ODE whose time-dependent velocity field is learned along an interpolation between a simple source distribution (e.g., a standard normal) and a target data distribution. Flow-based methods often exhibit greater training stability and have achieved strong empirical performance in high-dimensional settings where data concentrate near a low-dimensional manifold, such as text-to-image synthesis, video generation, and molecular structure generation. Despite this success, existing theoretical analyses of flow matching assume target distributions with smooth, full-dimensional densities, leaving its effectiveness in manifold-supported settings largely unexplained. To this end, we theoretically analyze flow matching with linear interpolation when the target distribution is supported on a smooth manifold. We establish a non-asymptotic convergence guarantee for the learned velocity field, and then propagate this estimation error through the ODE to obtain statistical consistency of the implicit density estimator induced by the flow-matching objective. The resulting convergence rate is near minimax-optimal, depends only on the intrinsic dimension, and reflects the smoothness of both the manifold and the target distribution. Together, these results provide a principled explanation for how flow matching adapts to intrinsic data geometry and circumvents the curse of dimensionality.

#26 tBayes-MICE: A Bayesian Approach to Multiple Imputation for Time Series Data

著者: Amuche Ibenegbu, Pierre Lafaye de Micheaux, Rohitash Chandra

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2603.27142

要約:
Time-series analysis is often affected by missing data, a common problem across several fields, including healthcare and environmental monitoring. Multiple Imputation by Chained Equations (MICE) has been prominent for imputing missing values through "fully conditional specification". We extend MICE using the Bayesian framework (tBayes-MICE), utilising Bayesian inference to impute missing values via Markov Chain Monte Carlo (MCMC) sampling to account for uncertainty in MICE model parameters and imputed values. We also include temporally informed initialisation and time-lagged features in the model to respect the sequential nature of time-series data. We evaluate the tBayes-MICE method using two real-world datasets (AirQuality and PhysioNet), and using both the Random Walk Metropolis (RWM) and the Metropolis-Adjusted Langevin Algorithm (MALA) samplers. Our results demonstrate that tBayes-MICE reduces imputation errors relative to the baseline methods over all variables and accounts for uncertainty in the imputation process, thereby providing a more accurate measure of imputation error. We also found that MALA mixed better than RWM across most variables, achieving comparable accuracy while providing more consistent posterior exploration. Overall, these findings suggest that the tBayes-MICE framework represents a practical and efficient approach to time-series imputation, balancing increased accuracy with meaningful quantification of uncertainty in various environmental and clinical settings.

#27 Causal Discovery in Linear Models with Unobserved Variables and Measurement Error

著者: Yuqin Yang, Mohamed Nafea, Negar Kiyavash, Kun Zhang, AmirEmad Ghassami

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2407.19426

要約:
The presence of unobserved common causes and measurement error poses two major obstacles to causal structure learning, since ignoring either source of complexity can induce spurious causal relations among variables of interest. We study causal structure learning in linear systems where both challenges may occur simultaneously. We introduce a causal model called LV-SEM-ME, which contains four types of variables: directly observed variables, variables that are not directly observed but are measured with error, the corresponding measurements, and variables that are neither observed nor measured. Under a separability condition-namely, identifiability of the mixing matrix associated with the exogenous noise terms of the observed variables-together with certain faithfulness assumptions, we characterize the extent of identifiability and the corresponding observational equivalence classes. We provide graphical characterizations of these equivalence classes and develop recovery algorithms that enumerate all models in the equivalence class of the ground truth. We also establish, via a four-node union model that subsumes instrumental variable, front-door, and negative-control-outcome settings, a form of identification robustness: the target effect remains identifiable in the broader LV-SEM-ME model even when the assumptions underlying the specialized identification formulas for the corresponding submodels need not all hold simultaneously.

#28 LipKernel: Lipschitz-Bounded Convolutional Neural Networks via Dissipative Layers

著者: Patricia Pauli, Ruigang Wang, Ian Manchester, Frank Allg\"ower

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2410.22258

要約:
We propose a novel layer-wise parameterization for convolutional neural networks (CNNs) that includes built-in robustness guarantees by enforcing a prescribed Lipschitz bound. Each layer in our parameterization is designed to satisfy a linear matrix inequality (LMI), which in turn implies dissipativity with respect to a specific supply rate. Collectively, these layer-wise LMIs ensure Lipschitz boundedness for the input-output mapping of the neural network, yielding a more expressive parameterization than through spectral bounds or orthogonal layers. Our new method LipKernel directly parameterizes dissipative convolution kernels using a 2-D Roesser-type state space model. This means that the convolutional layers are given in standard form after training and can be evaluated without computational overhead. In numerical experiments, we show that the run-time using our method is orders of magnitude faster than state-of-the-art Lipschitz-bounded networks that parameterize convolutions in the Fourier domain, making our approach particularly attractive for improving the robustness of learning-based real-time perception or control in robotics, autonomous vehicles, or automation systems. We focus on CNNs, and in contrast to previous works, our approach accommodates a wide variety of layers typically used in CNNs, including 1-D and 2-D convolutional layers, maximum and average pooling layers, as well as strided and dilated convolutions and zero padding. However, our approach naturally extends beyond CNNs as we can incorporate any layer that is incrementally dissipative.

#29 Entropic optimal transport beyond product reference couplings: the Gaussian case on Euclidean space

著者: Paul Freulon, Nikitas Georgakis, Victor Panaretos

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2507.01709

要約:
The Optimal Transport (OT) problem with squared Euclidean cost consists in finding a coupling between two input measures that maximizes correlation. Consequently, the optimal coupling is often singular with respect to the Lebesgue measure. Regularizing the OT problem with an entropy term yields an approximation called entropic optimal transport. Entropic penalties steer the induced coupling toward a reference measure with desired properties. For instance, when seeking a diffuse coupling, the most popular reference measures are the Lebesgue measure and the product of the two input measures. In this work, we study the case where the reference coupling is not a product, focussing on the Gaussian case as a core paradigm. We establish a reduction of such a regularised OT problem to a matrix optimization problem, enabling us to provide a complete description of the solution, both in terms of the primal variable and the dual variables. Beyond its intrinsic interest, allowing non-product references is essential in dynamic statistical settings. As a key motivation, we address the reconstruction of trajectory dynamics from finitely many time marginals where, unlike product references, Gaussian process references produce transitions that assemble into a coherent continuous-time process.

#30 Capturing Unseen Spatial Heat Extremes Through Dependence-Aware Generative Modeling

著者: Xinyue Liu, Xiao Peng, Shuyue Yan, Yuntian Chen, Dongxiao Zhang, Zhixiao Niu, Hui-Min Wang, Xiaogang He

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2507.09211

要約:
Observed records of climate extremes provide an incomplete view of risk, missing "unseen" events beyond historical experience. Ignoring spatial dependence further underestimates hazards striking multiple locations simultaneously. We introduce DeepX-GAN (Dependence-Enhanced Embedding for Physical eXtremes - Generative Adversarial Network), a deep generative model that explicitly captures the spatial structure of rare extremes. Its zero-shot generalizability enables simulation of statistically plausible extremes beyond the observed record, validated against long climate model large-ensemble simulations. We define two unseen types: direct-hit extremes that affect the target and near-miss extremes that narrowly miss. These unrealized events reveal hidden risks and can either prompt proactive adaptation or reinforce a sense of false resilience. Applying DeepX-GAN to the Middle East and North Africa shows that unseen heat extremes disproportionately threaten countries with high vulnerability and low socioeconomic readiness. Future warming is projected to expand and shift these extremes, creating persistent hotspots in Northwest Africa and the Arabian Peninsula, and new hotspots in Central Africa, necessitating spatially adaptive risk planning.

#31 LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization

著者: Yuanchen Wu, Saurabh Verma, Justin Lee, Fangzhou Xiong, Poppy Zhang, Amel Awadelkarim, Xu Chen, Yubai Yuan, Shawndra Hill

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2510.13907

要約:
Large language models (LLMs) are highly sensitive to prompts, but most automatic prompt optimization (APO) methods assume access to ground-truth references (e.g., labeled validation data) that are costly to obtain. We propose the Prompt Duel Optimizer (PDO), a sample-efficient framework for label-free prompt optimization based on pairwise preference feedback from an LLM judge. PDO casts prompt selection as a dueling-bandit problem and combines (i) Double Thompson Sampling to prioritize informative comparisons under a fixed judge budget, with (ii) top-performer guided mutation to expand the candidate pool while pruning weak prompts. Experiments on BIG-bench Hard (BBH) and MS MARCO show that PDO consistently identifies stronger prompts than label-free baselines, while offering favorable quality--cost trade-offs under constrained comparison budgets.

#32 Adaptive Stepsizing for Stochastic Gradient Langevin Dynamics in Bayesian Neural Networks

著者: Rajit Rajpal, Benedict Leimkuhler, Yuanhao Jiang

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2511.11666

要約:
Bayesian neural networks (BNNs) require scalable sampling algorithms to approximate posterior distributions over parameters. Existing stochastic gradient Markov Chain Monte Carlo (SGMCMC) methods are highly sensitive to the choice of stepsize and adaptive variants such as pSGLD typically fail to sample the correct invariant measure without addition of a costly divergence correction term. In this work, we build on the recently proposed `SamAdams' framework for timestep adaptation (Leimkuhler, Lohmann, and Whalley 2025), introducing an adaptive scheme: SA-SGLD, which employs time rescaling to modulate the stepsize according to a monitored quantity (typically the local gradient norm). SA-SGLD can automatically shrink stepsizes in regions of high curvature and expand them in flatter regions, improving both stability and mixing without introducing bias. We show that our method can achieve more accurate posterior sampling than SGLD on high-curvature 2D toy examples and in image classification with BNNs using sharp priors.

#33 We Still Don't Understand High-Dimensional Bayesian Optimization

著者: Colin Doumont, Donney Fan, Natalie Maus, Jacob R. Gardner, Henry Moss, Geoff Pleiss

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2512.00170

要約:
Existing high-dimensional Bayesian optimization (BO) methods aim to overcome the curse of dimensionality by carefully encoding structural assumptions, from locality to sparsity to smoothness, into the optimization procedure. Surprisingly, we demonstrate that these approaches are outperformed by arguably the simplest method imaginable: Bayesian linear regression. After applying a geometric transformation to avoid boundary-seeking behavior, Gaussian processes with linear kernels match state-of-the-art performance on tasks with 60- to 6,000-dimensional search spaces. Linear models offer numerous advantages over their non-parametric counterparts: they afford closed-form sampling and their computation scales linearly with data, a fact we exploit on molecular optimization tasks with >20,000 observations. Coupled with empirical analyses, our results suggest the need to depart from past intuitions about BO methods in high-dimensions.

#34 Constrained Policy Optimization with Cantelli-Bounded Value-at-Risk

著者: Rohan Tangri, Jan-Peter Calliess

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2601.22993

要約:
We introduce the Value-at-Risk Constrained Policy Optimization algorithm (VaR-CPO), a sample efficient and conservative method designed to optimize Value-at-Risk (VaR) constrained reinforcement learning (RL) problems. Empirically, we demonstrate that VaR-CPO is capable of safe exploration, achieving zero constraint violations during training in feasible environments, a critical property that baseline methods fail to uphold. To overcome the inherent non-differentiability of the VaR constraint, we employ Cantelli's inequality to obtain a tractable approximation based on the first two moments of the cost return. Additionally, by extending the trust-region framework of the Constrained Policy Optimization (CPO) method, we provide worst-case bounds for both policy improvement and constraint violation during the training process.

#35 Computationally lightweight classifiers with frequentist bounds on predictions

著者: Shreeram Murali, Cristian R. Rojas, Dominik Baumann

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2603.22128

要約:
While both classical and neural network classifiers can achieve high accuracy, they fall short on offering uncertainty bounds on their predictions, making them unfit for safety-critical applications. Existing kernel-based classifiers that provide such bounds scale with $\mathcal O (n^{\sim3})$ in time, making them computationally intractable for large datasets. To address this, we propose a novel, computationally efficient classification algorithm based on the Nadaraya-Watson estimator, for whose estimates we derive frequentist uncertainty intervals. We evaluate our classifier on synthetically generated data and on electrocardiographic heartbeat signals from the MIT-BIH Arrhythmia database. We show that the method achieves competitive accuracy $>$\SI{96}{\percent} at $\mathcal O(n)$ and $\mathcal O(\log n)$ operations, while providing actionable uncertainty bounds. These bounds can, e.g., aid in flagging low-confidence predictions, making them suitable for real-time settings with resource constraints, such as diagnostic monitoring or implantable devices.

#36 A Bayesian Information-Theoretic Approach to Data Attribution

著者: Dharmesh Tailor, Nicol\`o Felicioni, Kamil Ciosek

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2604.03858

要約:
Training Data Attribution (TDA) seeks to trace model predictions back to influential training examples, enhancing interpretability and safety. We formulate TDA as a Bayesian information-theoretic problem: subsets are scored by the information loss they induce - the entropy increase at a query when removed. This criterion credits examples for resolving predictive uncertainty rather than label noise. To scale to modern networks, we approximate information loss using a Gaussian Process surrogate built from tangent features. We show this aligns with classical influence scores for single-example attribution while promoting diversity for subsets. For even larger-scale retrieval, we relax to an information-gain objective and add a variance correction for scalable attribution in vector databases. Experiments show competitive performance on counterfactual sensitivity, ground-truth retrieval and coreset selection, showing that our method scales to modern architectures while bridging principled measures with practice.

#37 A Generalized Sinkhorn Algorithm for Mean-Field Schr\"odinger Bridge

著者: Asmaa Eldesoukey, Yongxin Chen, Abhishek Halder

公開日: Fri, 10 Apr 2026 00:00:00 -0400

リンク: https://arxiv.org/abs/2604.06531

要約:
The mean-field Schr\"odinger bridge (MFSB) problem concerns designing a minimum-effort controller that guides a diffusion process with nonlocal interaction to reach a given distribution from another by a fixed deadline. Unlike the standard Schr\"odinger bridge, the dynamical constraint for MFSB is the mean-field limit of a population of interacting agents with controls. It serves as a natural model for large-scale multi-agent systems. The MFSB is computationally challenging because the nonlocal interaction makes the problem nonconvex. We propose a generalization of the Hopf-Cole transform for MFSB and, building on it, design a Sinkhorn-type recursive algorithm to solve the associated system of integro-PDEs. Under mild assumptions on the interaction potential, we discuss convergence guarantees for the proposed algorithm. We present numerical examples with repulsive and attractive interactions to illustrate the theoretical contributions.

stat.ML updates on arXiv.org

📋 論文タイトル一覧