arXiv論文一覧 - stat.ML updates on arXiv.org

#1 FAST: Topology-Aware Frequency-Domain Distribution Matching for Coreset Selection

著者: Jin Cui (State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University), Boran Zhao (School of Software Engineering, Xi'an Jiaotong University), Jiajun Xu (School of Software Engineering, Xi'an Jiaotong University), Jiaqi Guo (School of Mathematical Sciences, Nankai University), Shuo Guan (School of Software Engineering, Xi'an Jiaotong University), Pengju Ren (State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University)

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.19476

要約:
Coreset selection compresses large datasets into compact, representative subsets, reducing the energy and computational burden of training deep neural networks. Existing methods are either: (i) DNN-based, which are tied to model-specific parameters and introduce architectural bias; or (ii) DNN-free, which rely on heuristics lacking theoretical guarantees. Neither approach explicitly constrains distributional equivalence, largely because continuous distribution matching is considered inapplicable to discrete sampling. Moreover, prevalent metrics (e.g., MSE, KL, MMD, CE) cannot accurately capture higher-order moment discrepancies, leading to suboptimal coresets. In this work, we propose FAST, the first DNN-free distribution-matching coreset selection framework that formulates the coreset selection task as a graph-constrained optimization problem grounded in spectral graph theory and employs the Characteristic Function Distance (CFD) to capture full distributional information in the frequency domain. We further discover that naive CFD suffers from a "vanishing phase gradient" issue in medium and high-frequency regions; to address this, we introduce an Attenuated Phase-Decoupled CFD. Furthermore, for better convergence, we design a Progressive Discrepancy-Aware Sampling strategy that progressively schedules frequency selection from low to high, preserving global structure before refining local details and enabling accurate matching with fewer frequencies while avoiding overfitting. Extensive experiments demonstrate that FAST significantly outperforms state-of-the-art coreset selection methods across all evaluated benchmarks, achieving an average accuracy gain of 9.12%. Compared to other baseline coreset methods, it reduces power consumption by 96.57% and achieves a 2.2x average speedup, underscoring its high performance and energy efficiency.

#2 Optimization and Regularization Under Arbitrary Objectives

著者: Jared N. Lakhani, Etienne Pienaar

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.19628

要約:
This study investigates the limitations of applying Markov Chain Monte Carlo (MCMC) methods to arbitrary objective functions, focusing on a two-block MCMC framework which alternates between Metropolis-Hastings and Gibbs sampling. While such approaches are often considered advantageous for enabling data-driven regularization, we show that their performance critically depends on the sharpness of the employed likelihood form. By introducing a sharpness parameter and exploring alternative likelihood formulations proportional to the target objective function, we demonstrate how likelihood curvature governs both in-sample performance and the degree of regularization inferred by the training data. Empirical applications are conducted on reinforcement learning tasks: including a navigation problem and the game of tic-tac-toe. The study concludes with a separate analysis examining the implications of extreme likelihood sharpness on arbitrary objective functions stemming from the classic game of blackjack, where the first block of the two-block MCMC framework is replaced with an iterative optimization step. The resulting hybrid approach achieves performance nearly identical to the original MCMC framework, indicating that excessive likelihood sharpness effectively collapses posterior mass onto a single dominant mode.

#3 Clustering Approaches for Mixed-Type Data: A Comparative Study

著者: Badih Ghattas, Alvaro Sanchez San-Benito

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.19755

要約:
Clustering is widely used in unsupervised learning to find homogeneous groups of observations within a dataset. However, clustering mixed-type data remains a challenge, as few existing approaches are suited for this task. This study presents the state-of-the-art of these approaches and compares them using various simulation models. The compared methods include the distance-based approaches k-prototypes, PDQ, and convex k-means, and the probabilistic methods KAy-means for MIxed LArge data (KAMILA), the mixture of Bayesian networks (MBNs), and latent class model (LCM). The aim is to provide insights into the behavior of different methods across a wide range of scenarios by varying some experimental factors such as the number of clusters, cluster overlap, sample size, dimension, proportion of continuous variables in the dataset, and clusters' distribution. The degree of cluster overlap and the proportion of continuous variables in the dataset and the sample size have a significant impact on the observed performances. When strong interactions exist between variables alongside an explicit dependence on cluster membership, none of the evaluated methods demonstrated satisfactory performance. In our experiments KAMILA, LCM, and k-prototypes exhibited the best performance, with respect to the adjusted rand index (ARI). All the methods are available in R.

#4 A Fully Probabilistic Tensor Network for Regularized Volterra System Identification

著者: Afra Kilic, Kim Batselier

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.20457

要約:
Modeling nonlinear systems with Volterra series is challenging because the number of kernel coefficients grows exponentially with the model order. This work introduces Bayesian Tensor Network Volterra kernel machines (BTN-V), extending the Bayesian Tensor Network framework to Volterra system identification. BTN-V represents Volterra kernels using canonical polyadic decomposition, reducing model complexity from O(I^D) to O(DIR). By treating all tensor components and hyperparameters as random variables, BTN-V provides predictive uncertainty estimation at no additional computational cost. Sparsity-inducing hierarchical priors enable automatic rank determination and the learning of fading-memory behavior directly from data, improving interpretability and preventing overfitting. Empirical results demonstrate competitive accuracy, enhanced uncertainty quantification, and reduced computational cost.

#5 Generative Modeling with Manifold Percolation

著者: Rui Tong

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.20503

要約:
Generative modeling is typically framed as learning mapping rules, but from an observer's perspective without access to these rules, the task manifests as disentangling the geometric support from the probability distribution. We propose that Continuum Percolation is uniquely suited for this support analysis, as the sampling process effectively projects high-dimensional density estimation onto a geometric counting problem on the support. In this work, we establish a rigorous isomorphism between the topological phase transitions of Random Geometric Graphs and the underlying data manifold in high-dimensional space. By analyzing the relationship between our proposed Percolation Shift metric and FID, we demonstrate that our metric captures structural pathologies (such as implicit mode collapse) where statistical metrics fail. Finally, we translate this topological phenomenon into a differentiable loss function to guide training. Experimental results confirm that this approach not only prevents manifold shrinkage but drives the model toward a state of "Hyper-Generalization," achieving good fidelity and verified topological expansion.

#6 Spatio-Temporal Hierarchical Causal Models

著者: Xintong Li, Haoran Zhang, Xiao Zhou

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.20558

要約:
The abundance of fine-grained spatio-temporal data, such as traffic sensor networks, offers vast opportunities for scientific discovery. However, inferring causal relationships from such observational data remains challenging, particularly due to unobserved confounders that are specific to units (e.g., geographical locations) yet influence outcomes over time. Most existing methods for spatio-temporal causal inference assume that all confounders are observed, an assumption that is often violated in practice. In this paper, we introduce Spatio-Temporal Hierarchical Causal Models (ST-HCMs), a novel graphical framework that extends hierarchical causal modeling to the spatio-temporal domain. At the core of our approach is the Spatio-Temporal Collapse Theorem, which shows that a complex ST-HCM converges to a simpler flat causal model as the amount of subunit data increases. This theoretical result enables a general procedure for causal identification, allowing ST-HCMs to recover causal effects even in the presence of unobserved, time-invariant unit-level confounders, a scenario where standard non-hierarchical models fail. We validate the effectiveness of our framework on both synthetic and real-world datasets, demonstrating its potential for robust causal inference in complex dynamic systems.

#7 Heckman Selection Contaminated Normal Model

著者: Heeju Lim, Jose Alejandro Ordonez, Victor H. Lachos, Antonio Punzo

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2409.12348

要約:
The Heckman selection model is one of the most well-renounced econometric models in the analysis of data with sample selection. This model is designed to rectify sample selection biases based on the assumption of bivariate normal error terms. However, real data diverge from this assumption in the presence of heavy tails and/or atypical observations. Recently, this assumption has been relaxed via a more flexible Student's t-distribution, which has appealing statistical properties. This paper introduces a novel Heckman selection model using a bivariate contaminated normal distribution for the error terms. We present an efficient ECM algorithm for parameter estimation with closed-form expressions at the E-step based on truncated multinormal distribution formulas. The identifiability of the proposed model is also discussed, and its properties have been examined. Through simulation studies, we compare our proposed model with the normal and Student's t counterparts and investigate the finite-sample properties and the variation in missing rate. Results obtained from two real data analyses showcase the usefulness and effectiveness of our model. The proposed algorithms are implemented in the R package HeckmanEM.

#8 The Generalized Proximity Forest

著者: Ben Shaw, Adam Rustad, Sofia Pelagalli Maia, Jake S. Rhodes, Kevin R. Moon

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.19487

要約:
Recent work has demonstrated the utility of Random Forest (RF) proximities for various supervised machine learning tasks, including outlier detection, missing data imputation, and visualization. However, the utility of the RF proximities depends upon the success of the RF model, which itself is not the ideal model in all contexts. RF proximities have recently been extended to time series by means of the distance-based Proximity Forest (PF) model, among others, affording time series analysis with the benefits of RF proximities. In this work, we introduce the generalized PF model, thereby extending RF proximities to all contexts in which supervised distance-based machine learning can occur. Additionally, we introduce a variant of the PF model for regression tasks. We also introduce the notion of using the generalized PF model as a meta-learning framework, extending supervised imputation capability to any pre-trained classifier. We experimentally demonstrate the unique advantages of the generalized PF model compared with both the RF model and the $k$-nearest neighbors model.

#9 RFX: High-Performance Random Forests with GPU Acceleration and QLORA Compression

著者: Chris Kuchar

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.19493

要約:
RFX (Random Forests X), where X stands for compression or quantization, presents a production-ready implementation of Breiman and Cutler's Random Forest classification methodology in Python. RFX v1.0 provides complete classification: out-of-bag error estimation, overall and local importance measures, proximity matrices with QLORA compression, case-wise analysis, and interactive visualization (rfviz)--all with CPU and GPU acceleration. Regression, unsupervised learning, CLIQUE importance, and RF-GAP proximity are planned for v2.0. This work introduces four solutions addressing the proximity matrix memory bottleneck limiting Random Forest analysis to ~60,000 samples: (1) QLORA (Quantized Low-Rank Adaptation) compression for GPU proximity matrices, reducing memory from 80GB to 6.4MB for 100k samples (12,500x compression with INT8 quantization) while maintaining 99% geometric structure preservation, (2) CPU TriBlock proximity--combining upper-triangle storage with block-sparse thresholding--achieving 2.7x memory reduction with lossless quality, (3) SM-aware GPU batch sizing achieving 95% GPU utilization, and (4) GPU-accelerated 3D MDS visualization computing embeddings directly from low-rank factors using power iteration. Validation across four implementation modes (GPU/CPU x case-wise/non-case-wise) demonstrates correct implementation. GPU achieves 1.4x speedup over CPU for overall importance with 500+ trees. Proximity computation scales from 1,000 to 200,000+ samples (requiring GPU QLORA), with CPU TriBlock filling the gap for medium-scale datasets (10K-50K samples). RFX v1.0 eliminates the proximity memory bottleneck, enabling proximity-based Random Forest analysis on datasets orders of magnitude larger than previously feasible. Open-source production-ready classification following Breiman and Cutler's original methodology.

#10 Position: The Complexity of Perfect AI Alignment -- Formalizing the RLHF Trilemma

著者: Subramanyam Sahoo, Aman Chadha, Vinija Jain, Divya Chaudhary

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.19504

要約:
Reinforcement Learning from Human Feedback (RLHF) is widely used for aligning large language models, yet practitioners face a persistent puzzle: improving safety often reduces fairness, scaling to diverse populations becomes computationally intractable, and making systems robust often amplifies majority biases. We formalize this tension as the Alignment Trilemma: no RLHF system can simultaneously achieve (i) epsilon-representativeness across diverse human values, (ii) polynomial tractability in sample and compute complexity, and (iii) delta-robustness against adversarial perturbations and distribution shift. Through a complexity-theoretic analysis integrating statistical learning theory and robust optimization, we prove that achieving both representativeness (epsilon <= 0.01) and robustness (delta <= 0.001) for global-scale populations requires Omega(2^{d_context}) operations, which is super-polynomial in the context dimensionality. We show that current RLHF implementations resolve this trilemma by sacrificing representativeness: they collect only 10^3--10^4 samples from homogeneous annotator pools while 10^7--10^8 samples are needed for true global representation. Our framework provides a unified explanation for documented RLHF pathologies including preference collapse, sycophancy, and systematic bias amplification. We conclude with concrete directions for navigating these fundamental trade-offs through strategic relaxations of alignment requirements.

#11 Shortcut Invariance: Targeted Jacobian Regularization in Disentangled Latent Space

著者: Shivam Pal, Sakshi Varshney, Piyush Rai

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.19525

要約:
Deep neural networks are prone to learning shortcuts, spurious and easily learned correlations in training data that cause severe failures in out-of-distribution (OOD) generalization. A dominant line of work seeks robustness by learning a robust representation, often explicitly partitioning the latent space into core and spurious components; this approach can be complex, brittle, and difficult to scale. We take a different approach, instead of a robust representation, we learn a robust function. We present a simple and effective training method that renders the classifier functionally invariant to shortcut signals. Our method operates within a disentangled latent space, which is essential as it isolates spurious and core features into distinct dimensions. This separation enables the identification of candidate shortcut features by their strong correlation with the label, used as a proxy for semantic simplicity. The classifier is then desensitized to these features by injecting targeted, anisotropic latent noise during training. We analyze this as targeted Jacobian regularization, which forces the classifier to ignore spurious features and rely on more complex, core semantic signals. The result is state-of-the-art OOD performance on established shortcut learning benchmarks.

#12 ModHiFi: Identifying High Fidelity predictive components for Model Modification

著者: Dhruva Kashyap, Chaitanya Murti, Pranav K Nayak, Tanay Narshana, Chiranjib Bhattacharyya

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.19566

要約:
Open weight models, which are ubiquitous, rarely provide access to their training data or loss function. This makes modifying such models for tasks such as pruning or unlearning constrained by this unavailability an active area of research. Existing techniques typically require gradients or ground-truth labels, rendering them infeasible in settings with limited computational resources. In this work, we investigate the fundamental question of identifying components that are critical to the model's predictive performance, without access to either gradients or the loss function, and with only distributional access such as synthetic data. We theoretically demonstrate that the global reconstruction error is linearly bounded by local reconstruction errors for Lipschitz-continuous networks such as CNNs and well-trained Transformers (which, contrary to existing literature, we find exhibit Lipschitz continuity). This motivates using the locally reconstructive behavior of component subsets to quantify their global importance, via a metric that we term Subset Fidelity. In the uncorrelated features setting, selecting individual components via their Subset Fidelity scores is optimal, which we use to propose ModHiFi, an algorithm for model modification that requires no training data or loss function access. ModHiFi-P, for structured pruning, achieves an 11% speedup over the current state of the art on ImageNet models and competitive performance on language models. ModHiFi-U, for classwise unlearning, achieves complete unlearning on CIFAR-10 without fine-tuning and demonstrates competitive performance on Swin Transformers.

#13 Neural Tractability via Structure: Learning-Augmented Algorithms for Graph Combinatorial Optimization

著者: Jialiang Li, Weitong Chen, Mingyu Guo

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.19573

要約:
Neural models have shown promise in solving NP-hard graph combinatorial optimization (CO) problems. Once trained, they offer fast inference and reasonably high-quality solutions for in-distribution testing instances, but they generally fall short in terms of absolute solution quality compared to classical search-based algorithms that are admittedly slower but offer optimality guarantee once search finishes. We propose a novel framework that combines the inference efficiency and exploratory power of neural models with the solution quality guarantee of search-based algorithms. In particular, we use parameterized algorithms (PAs) as the search component. PAs are dedicated to identifying easy instances of generally NP-hard problems, and allow for practically efficient search by exploiting structural simplicity (of the identified easy instances). Under our framework, we use parameterized analysis to identify the structurally hard parts of a CO instance. The neural model handles the hard parts by generating advisory signals based on its data-driven understanding. The PA-based search component then integrates the advisory signals to systematically and efficiently searches through the remaining structurally easy parts. Notably, our framework is agnostic to the choice of neural model and produces strictly better solutions than neural solvers alone. We examine our framework on multiple CO tasks. Empirical results show that it achieves superior solution quality, competitive with that of commercial solvers. Furthermore, by using the neural model only for exploratory advisory signals, our framework exhibits improved out-of-distribution generalization, addressing a key limitation of existing neural CO solvers.

#14 Lower Complexity Bounds for Nonconvex-Strongly-Convex Bilevel Optimization with First-Order Oracles

著者: Kaiyi Ji

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.19656

要約:
Although upper bound guarantees for bilevel optimization have been widely studied, progress on lower bounds has been limited due to the complexity of the bilevel structure. In this work, we focus on the smooth nonconvex-strongly-convex setting and develop new hard instances that yield nontrivial lower bounds under deterministic and stochastic first-order oracle models. In the deterministic case, we prove that any first-order zero-respecting algorithm requires at least $\Omega(\kappa^{3/2}\epsilon^{-2})$ oracle calls to find an $\epsilon$-accurate stationary point, improving the optimal lower bounds known for single-level nonconvex optimization and for nonconvex-strongly-convex min-max problems. In the stochastic case, we show that at least $\Omega(\kappa^{5/2}\epsilon^{-4})$ stochastic oracle calls are necessary, again strengthening the best known bounds in related settings. Our results expose substantial gaps between current upper and lower bounds for bilevel optimization and suggest that even simplified regimes, such as those with quadratic lower-level objectives, warrant further investigation toward understanding the optimal complexity of bilevel optimization under standard first-order oracles.

#15 Demystifying Diffusion Objectives: Reweighted Losses are Better Variational Bounds

diffusion

著者: Jiaxin Shi, Michalis K. Titsias

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.19664

要約:
We derive a new theoretical interpretation of the reweighted losses that are widely used for training diffusion models. Our method is based on constructing a cascade of time-dependent variational lower bounds on the data log-likelihood, that provably improves upon the standard evidence lower bound and results in reduced data-model KL-divergences. Combining such bounds gives rise to reweighted objectives that can be applied to any generative diffusion model including both continuous Gaussian diffusion and masked (discrete) diffusion models. Then, we showcase this framework in masked diffusion and report significant improvements over previous training losses in pixel-space image modeling, approaching sample quality comparable to continuous diffusion models. Our results also provide a theoretical justification for the simple weighting scheme widely used in masked image models.

#16 Order Selection in Vector Autoregression by Mean Square Information Criterion

著者: Michael Hellstern, Ali Shojaie

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.19761

要約:
Vector autoregressive (VAR) processes are ubiquitously used in economics, finance, and biology. Order selection is an essential step in fitting VAR models. While many order selection methods exist, all come with weaknesses. Order selection by minimizing AIC is a popular approach but is known to consistently overestimate the true order for processes of small dimension. On the other hand, methods based on BIC or the Hannan-Quinn (HQ) criteria are shown to require large sample sizes in order to accurately estimate the order for larger-dimensional processes. We propose the mean square information criterion (MIC) based on the observation that the expected squared error loss is flat once the fitted order reaches or exceeds the true order. MIC is shown to consistently estimate the order of the process under relatively mild conditions. Our simulation results show that MIC offers better performance relative to AIC, BIC, and HQ under misspecification. This advantage is corroborated when forecasting COVID-19 outcomes in New York City. Order selection by MIC is implemented in the micvar R package available on CRAN.

#17 When +1% Is Not Enough: A Paired Bootstrap Protocol for Evaluating Small Improvements

著者: Wenzhang Du

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.19794

要約:
Recent machine learning papers often report 1-2 percentage point improvements from a single run on a benchmark. These gains are highly sensitive to random seeds, data ordering, and implementation details, yet are rarely accompanied by uncertainty estimates or significance tests. It is therefore unclear when a reported +1-2% reflects a real algorithmic advance versus noise. We revisit this problem under realistic compute budgets, where only a few runs are affordable. We propose a simple, PC-friendly evaluation protocol based on paired multi-seed runs, bias-corrected and accelerated (BCa) bootstrap confidence intervals, and a sign-flip permutation test on per-seed deltas. The protocol is intentionally conservative and is meant as a guardrail against over-claiming. We instantiate it on CIFAR-10, CIFAR-10N, and AG News using synthetic no-improvement, small-gain, and medium-gain scenarios. Single runs and unpaired t-tests often suggest significant gains for 0.6-2.0 point improvements, especially on text. With only three seeds, our paired protocol never declares significance in these settings. We argue that such conservative evaluation is a safer default for small gains under tight budgets.

#18 Terminal Velocity Matching

著者: Linqi Zhou, Mathias Parger, Ayaan Haque, Jiaming Song

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.19797

要約:
We propose Terminal Velocity Matching (TVM), a generalization of flow matching that enables high-fidelity one- and few-step generative modeling. TVM models the transition between any two diffusion timesteps and regularizes its behavior at its terminal time rather than at the initial time. We prove that TVM provides an upper bound on the $2$-Wasserstein distance between data and model distributions when the model is Lipschitz continuous. However, since Diffusion Transformers lack this property, we introduce minimal architectural changes that achieve stable, single-stage training. To make TVM efficient in practice, we develop a fused attention kernel that supports backward passes on Jacobian-Vector Products, which scale well with transformer architectures. On ImageNet-256x256, TVM achieves 3.29 FID with a single function evaluation (NFE) and 1.99 FID with 4 NFEs. It similarly achieves 4.32 1-NFE FID and 2.94 4-NFE FID on ImageNet-512x512, representing state-of-the-art performance for one/few-step models from scratch.

#19 Latent-space metrics for Complex-Valued VAE out-of-distribution detection under radar clutter

著者: Y. A. Rouzoumka, E. Terreaux, C. Morisseau, J. -P. Ovarlez, C. Ren

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.19805

要約:
We investigate complex-valued Variational AutoEncoders (CVAE) for radar Out-Of-Distribution (OOD) detection in complex radar environments. We proposed several detection metrics: the reconstruction error of CVAE (CVAE-MSE), the latent-based scores (Mahalanobis, Kullback-Leibler divergence (KLD)), and compared their performance against the classical ANMF-Tyler detector (ANMF-FP). The performance of all these detectors is analyzed on synthetic and experimental radar data, showing the advantages and the weaknesses of each detector.

#20 Provably Outlier-resistant Semi-parametric Regression for Transferable Calibration of Low-cost Air-quality Sensors

著者: Divyansh Chaurasia, Manoj Daram, Roshan Kumar, Nihal Thukarama Rao, Vipul Sangode, Pranjal Srivastava, Avnish Tripathi, Shoubhik Chakraborty, Akanksha, Ambasht Kumar, Davender Sethi, Sachchida Nand Tripathi, Purushottam Kar

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.19810

要約:
We present a case study for the calibration of Low-cost air-quality (LCAQ) CO sensors from one of the largest multi-site-multi-season-multi-sensor-multi-pollutant mobile air-quality monitoring network deployments in India. LCAQ sensors have been shown to play a critical role in the establishment of dense, expansive air-quality monitoring networks and combating elevated pollution levels. The calibration of LCAQ sensors against regulatory-grade monitors is an expensive, laborious and time-consuming process, especially when a large number of sensors are to be deployed in a geographically diverse layout. In this work, we present the RESPIRE technique to calibrate LCAQ sensors to detect ambient CO (Carbon Monoxide) levels. RESPIRE offers specific advantages over baseline calibration methods popular in literature, such as improved prediction in cross-site, cross-season, and cross-sensor settings. RESPIRE offers a training algorithm that is provably resistant to outliers and an explainable model with the ability to flag instances of model overfitting. Empirical results are presented based on data collected during an extensive deployment spanning four sites, two seasons and six sensor packages. RESPIRE code is available at https://github.com/purushottamkar/respire.

#21 Time-Varying Network Driver Estimation (TNDE) Quantifies Stage-Specific Regulatory Effects From Single-Cell Snapshots

著者: Jiaxin Li, Shanjun Mao

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.19813

要約:
Identifying key driver genes governing biological processes such as development and disease progression remains a challenge. While existing methods can reconstruct cellular trajectories or infer static gene regulatory networks (GRNs), they often fail to quantify time-resolved regulatory effects within specific temporal windows. Here, we present Time-varying Network Driver Estimation (TNDE), a computational framework quantifying dynamic gene driver effects from single-cell snapshot data under a linear Markov assumption. TNDE leverages a shared graph attention encoder to preserve the local topological structure of the data. Furthermore, by incorporating partial optimal transport, TNDE accounts for unmatched cells arising from proliferation or apoptosis, thereby enabling trajectory alignment in non-equilibrium processes. Benchmarking on simulated datasets demonstrates that TNDE outperforms existing baseline methods across diverse complex regulatory scenarios. Applied to mouse erythropoiesis data, TNDE identifies stage-specific driver genes, the functional relevance of which is corroborated by biological validation. TNDE offers an effective quantitative tool for dissecting dynamic regulatory mechanisms underlying complex biological processes.

#22 Cisco Time Series Model Technical Report

著者: Liang Gou (Ellen), Archit Khare (Ellen), Praneet Pabolu (Ellen), Prachi Patel (Ellen), Joseph Ross (Ellen), Hercy Shen (Ellen), Yuhan (Ellen), Song, Jingze Sun, Kristal Curtis, Vedant Dharnidharka, Abhinav Mathur, Hao Yang

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.19841

要約:
We introduce the Cisco Time Series Model, a univariate zero-shot forecaster. This time series foundation model is the result of a general architectural innovation to a time series model enabling it to accept multiresolution input, applied to a popular decoder-only time series model (TimesFM). The resulting multiresolution decoder-only model is trained on over 300B unique data points, with more than half coming from the observability domain. Quantitative and qualitative evaluations demonstrate that the resulting model achieves superior performance on observability datasets while retaining very similar performance on a standard general-purpose forecasting benchmark (GIFT-Eval), and suggest that the multiresolution structure enables the model to make more accurate predictions on long context input.

#23 SX-GeoTree: Self-eXplaining Geospatial Regression Tree Incorporating the Spatial Similarity of Feature Attributions

著者: Chaogui Kang, Lijian Luo, Qingfeng Guan, Yu Liu

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.19845

要約:
Decision trees remain central for tabular prediction but struggle with (i) capturing spatial dependence and (ii) producing locally stable (robust) explanations. We present SX-GeoTree, a self-explaining geospatial regression tree that integrates three coupled objectives during recursive splitting: impurity reduction (MSE), spatial residual control (global Moran's I), and explanation robustness via modularity maximization on a consensus similarity network formed from (a) geographically weighted regression (GWR) coefficient distances (stimulus-response similarity) and (b) SHAP attribution distances (explanatory similarity). We recast local Lipschitz continuity of feature attributions as a network community preservation problem, enabling scalable enforcement of spatially coherent explanations without per-sample neighborhood searches. Experiments on two exemplar tasks (county-level GDP in Fujian, n=83; point-wise housing prices in Seattle, n=21,613) show SX-GeoTree maintains competitive predictive accuracy (within 0.01 $R^{2}$ of decision trees) while improving residual spatial evenness and doubling attribution consensus (modularity: Fujian 0.19 vs 0.09; Seattle 0.10 vs 0.05). Ablation confirms Moran's I and modularity terms are complementary; removing either degrades both spatial residual structure and explanation stability. The framework demonstrates how spatial similarity - extended beyond geometric proximity through GWR-derived local relationships - can be embedded in interpretable models, advancing trustworthy geospatial machine learning and offering a transferable template for domain-aware explainability.

#24 Adaptivity and Universality: Problem-dependent Universal Regret for Online Convex Optimization

著者: Peng Zhao, Yu-Hu Yan, Hang Yu, Zhi-Hua Zhou

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.19937

要約:
Universal online learning aims to achieve optimal regret guarantees without requiring prior knowledge of the curvature of online functions. Existing methods have established minimax-optimal regret bounds for universal online learning, where a single algorithm can simultaneously attain $\mathcal{O}(\sqrt{T})$ regret for convex functions, $\mathcal{O}(d \log T)$ for exp-concave functions, and $\mathcal{O}(\log T)$ for strongly convex functions, where $T$ is the number of rounds and $d$ is the dimension of the feasible domain. However, these methods still lack problem-dependent adaptivity. In particular, no universal method provides regret bounds that scale with the gradient variation $V_T$, a key quantity that plays a crucial role in applications such as stochastic optimization and fast-rate convergence in games. In this work, we introduce UniGrad, a novel approach that achieves both universality and adaptivity, with two distinct realizations: UniGrad.Correct and UniGrad.Bregman. Both methods achieve universal regret guarantees that adapt to gradient variation, simultaneously attaining $\mathcal{O}(\log V_T)$ regret for strongly convex functions and $\mathcal{O}(d \log V_T)$ regret for exp-concave functions. For convex functions, the regret bounds differ: UniGrad.Correct achieves an $\mathcal{O}(\sqrt{V_T \log V_T})$ bound while preserving the RVU property that is crucial for fast convergence in online games, whereas UniGrad.Bregman achieves the optimal $\mathcal{O}(\sqrt{V_T})$ regret bound through a novel design. Both methods employ a meta algorithm with $\mathcal{O}(\log T)$ base learners, which naturally requires $\mathcal{O}(\log T)$ gradient queries per round. To enhance computational efficiency, we introduce UniGrad++, which retains the regret while reducing the gradient query to just $1$ per round via surrogate optimization. We further provide various implications.

#25 Popularity Bias Alignment Estimates

著者: Anton Lyubinin

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.19999

要約:
We are extending Popularity Bias Memorization theorem from arXiv:archive/2404.12008 in several directions. We extend it to arbitrary degree distributions and also prove both upper and lower estimates for the alignment with top-k singular hyperspace.

#26 An Infinite BART model

著者: Marco Battiston, Yu Luo

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.20087

要約:
Bayesian additive regression trees (BART) are popular Bayesian ensemble models used in regression and classification analysis. Under this modeling framework, the regression function is approximated by an ensemble of decision trees, interpreted as weak learners that capture different features of the data. In this work, we propose a generalization of the BART model that has two main features: first, it automatically selects the number of decision trees using the given data; second, the model allows clusters of observations to have different regression functions since each data point can only use a selection of weak learners, instead of all of them. This model generalization is accomplished by including a binary weight matrix in the conditional distribution of the response variable, which activates only a specific subset of decision trees for each observation. Such a matrix is endowed with an Indian Buffet process prior, and sampled within the MCMC sampler, together with the other BART parameters. We then compare the Infinite BART model with the classic one on simulated and real datasets. Specifically, we provide examples illustrating variable importance, partial dependence and causal estimation.

#27 Adaptive SGD with Line-Search and Polyak Stepsizes: Nonconvex Convergence and Accelerated Rates

著者: Haotian Wu

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.20207

要約:
We extend the convergence analysis of AdaSLS and AdaSPS in [Jiang and Stich, 2024] to the nonconvex setting, presenting a unified convergence analysis of stochastic gradient descent with adaptive Armijo line-search (AdaSLS) and Polyak stepsize (AdaSPS) for nonconvex optimization. Our contributions include: (1) an $\mathcal{O}(1/\sqrt{T})$ convergence rate for general nonconvex smooth functions, (2) an $\mathcal{O}(1/T)$ rate under quasar-convexity and interpolation, and (3) an $\mathcal{O}(1/T)$ rate under the strong growth condition for general nonconvex functions.

#28 Modified Equations for Stochastic Optimization

著者: Stefan Perko

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.20322

要約:
In this thesis, we extend the recently introduced theory of stochastic modified equations (SMEs) for stochastic gradient optimization algorithms. In Ch. 3 we study time-inhomogeneous SDEs driven by Brownian motion. For certain SDEs we prove a 1st and 2nd-order weak approximation properties, and we compute their linear error terms explicitly, under certain regularity conditions. In Ch. 4 we instantiate our results for SGD, working out the example of linear regression explicitly. We use this example to compare the linear error terms of gradient flow and two commonly used 1st-order SMEs for SGD in Ch. 5. In the second part of the thesis we introduce and study a novel diffusion approximation for SGD without replacement (SGDo) in the finite-data setting. In Ch. 6 we motivate and define the notion of an epoched Brownian motion (EBM). We argue that Young differential equations (YDEs) driven by EBMs serve as continuous-time models for SGDo for any shuffling scheme whose induced permutations converge to a det. permuton. Further, we prove a.s. convergence for these YDEs in the strongly convex setting. Moreover, we compute an upper asymptotic bound on the convergence rate which is as sharp as, or better than previous results for SGDo. In Ch. 7 we study scaling limits of families of random walks (RW) that share the same increments up to a random permutation. We show weak convergence under the assumption that the sequence of permutations converges to a det. (higher-dimensional) permuton. This permuton determines the covariance function of the limiting Gaussian process. Conversely, we show that every Gaussian process with a covariance function determined by a permuton in this way arises as a weak scaling limit of families of RW with shared increments. Finally, we apply our weak convergence theory to show that EBMs arise as scaling limits of RW with finitely many distinct increments.

#29 PAC-Bayes Meets Online Contextual Optimization

著者: Zhuojun Xie, Adam Abdin, Yiping Fang

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.20413

要約:
The predict-then-optimize paradigm bridges online learning and contextual optimization in dynamic environments. Previous works have investigated the sequential updating of predictors using feedback from downstream decisions to minimize regret in the full-information settings. However, existing approaches are predominantly frequentist, rely heavily on gradient-based strategies, and employ deterministic predictors that could yield high variance in practice despite their asymptotic guarantees. This work introduces, to the best of our knowledge, the first Bayesian online contextual optimization framework. Grounded in PAC-Bayes theory and general Bayesian updating principles, our framework achieves $\mathcal{O}(\sqrt{T})$ regret for bounded and mixable losses via a Gibbs posterior, eliminates the dependence on gradients through sequential Monte Carlo samplers, and thereby accommodates nondifferentiable problems. Theoretical developments and numerical experiments substantiate our claims.

#30 Variational bagging: a robust approach for Bayesian uncertainty quantification

著者: Shitao Fan, Ilsang Ohn, David Dunson, Lizhen Lin

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.20594

要約:
Variational Bayes methods are popular due to their computational efficiency and adaptability to diverse applications. In specifying the variational family, mean-field classes are commonly used, which enables efficient algorithms such as coordinate ascent variational inference (CAVI) but fails to capture parameter dependence and typically underestimates uncertainty. In this work, we introduce a variational bagging approach that integrates a bagging procedure with variational Bayes, resulting in a bagged variational posterior for improved inference. We establish strong theoretical guarantees, including posterior contraction rates for general models and a Bernstein-von Mises (BVM) type theorem that ensures valid uncertainty quantification. Notably, our results show that even when using a mean-field variational family, our approach can recover off-diagonal elements of the limiting covariance structure and provide proper uncertainty quantification. In addition, variational bagging is robust to model misspecification, with covariance structures matching those of the target covariance. We illustrate our variational bagging method in numerical studies through applications to parametric models, finite mixture models, deep neural networks, and variational autoencoders (VAEs).

#31 How to Purchase Labels? A Cost-Effective Approach Using Active Learning Markets

著者: Xiwen Huang, Pierre Pinson

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.20605

要約:
We introduce and analyse active learning markets as a way to purchase labels, in situations where analysts aim to acquire additional data to improve model fitting, or to better train models for predictive analytics applications. This comes in contrast to the many proposals that already exist to purchase features and examples. By originally formalising the market clearing as an optimisation problem, we integrate budget constraints and improvement thresholds into the label acquisition process. We focus on a single-buyer-multiple-seller setup and propose the use of two active learning strategies (variance based and query-by-committee based), paired with distinct pricing mechanisms. They are compared to a benchmark random sampling approach. The proposed strategies are validated on real-world datasets from two critical application domains: real estate pricing and energy forecasting. Results demonstrate the robustness of our approach, consistently achieving superior performance with fewer labels acquired compared to conventional methods. Our proposal comprises an easy-to-implement practical solution for optimising data acquisition in resource-constrained environments.

#32 Optimization of Sums of Bivariate Functions: An Introduction to Relaxation-Based Methods for the Case of Finite Domains

著者: Nils M\"uller

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.20607

要約:
We study the optimization of functions with $n>2$ arguments that have a representation as a sum of several functions that have only $2$ of the $n$ arguments each, termed sums of bivariates, on finite domains. The complexity of optimizing sums of bivariates is shown to be NP-equivalent and it is shown that there exists free lunch in the optimization of sums of bivariates. Based on measure-valued extensions of the objective function, so-called relaxations, $\ell^2$-approximation, and entropy-regularization, we derive several tractable problem formulations solvable with linear programming, coordinate ascent as well as with closed-form solutions. The limits of applying tractable versions of such relaxations to sums of bivariates are investigated using general results for reconstructing measures from their bivariate marginals. Experiments in which the derived algorithms are applied to random functions, vertex coloring, and signal reconstruction problems provide insights into qualitatively different function classes that can be modeled as sums of bivariates.

#33 Hyperparameter Optimization in Machine Learning

著者: Luca Franceschi, Michele Donini, Valerio Perrone, Aaron Klein, C\'edric Archambeau, Matthias Seeger, Massimiliano Pontil, Paolo Frasconi

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2410.22854

要約:
Hyperparameters are configuration variables controlling the behavior of machine learning algorithms. They are ubiquitous in machine learning and artificial intelligence and the choice of their values determines the effectiveness of systems based on these technologies. Manual hyperparameter search is often time-consuming and becomes infeasible when the number of hyperparameters is large. Automating the search is an important step towards advancing, streamlining, and systematizing machine learning, freeing researchers and practitioners alike from the burden of finding a good set of hyperparameters by trial and error. In this survey, we present a unified treatment of hyperparameter optimization, providing the reader with examples, insights into the state-of-the-art, and numerous links to further reading. We cover the main families of techniques to automate hyperparameter search, often referred to as hyperparameter optimization or tuning, including random and quasi-random search, bandit-, model-, population-, and gradient-based approaches. We further discuss extensions, including online, constrained, and multi-objective formulations, touch upon connections with other fields, such as meta-learning and neural architecture search, and conclude with open questions and future research directions.

#34 Sparse Techniques for Regression in Deep Gaussian Processes

著者: Jonas Latz, Aretha L. Teckentrup, Simon Urbainczyk

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2505.11355

要約:
Gaussian processes (GPs) have gained popularity as flexible machine learning models for regression and function approximation with an in-built method for uncertainty quantification. However, GPs suffer when the amount of training data is large or when the underlying function contains multi-scale features that are difficult to represent by a stationary kernel. To address the former, training of GPs with large-scale data is often performed through inducing point approximations, also known as sparse GP regression (GPR), where the size of the covariance matrices in GPR is reduced considerably through a greedy search on the data set. To aid the latter, deep GPs have gained traction as hierarchical models that resolve multi-scale features by combining multiple GPs. Posterior inference in deep GPs requires a sampling or, more usual, a variational approximation. Variational approximations lead to large-scale stochastic, non-convex optimisation problems and the resulting approximation tends to represent uncertainty incorrectly. In this work, we combine variational learning with MCMC to develop a particle-based expectation-maximisation method to simultaneously find inducing points within the large-scale data (variationally) and accurately train the deep GPs (sampling-based). The result is a highly efficient and accurate methodology for deep GP training on large-scale data. We test our method on standard benchmark problems.

#35 Missing Data Imputation by Reducing Mutual Information with Rectified Flows

著者: Jiahao Yu, Qizhen Ying, Leyang Wang, Ziyue Jiang, Song Liu

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2505.11749

要約:
This paper introduces a novel iterative method for missing data imputation that sequentially reduces the mutual information between data and the corresponding missingness mask. Inspired by GAN-based approaches that train generators to decrease the predictability of missingness patterns, our method explicitly targets this reduction in mutual information. Specifically, our algorithm iteratively minimizes the KL divergence between the joint distribution of the imputed data and missingness mask, and the product of their marginals from the previous iteration. We show that the optimal imputation under this framework can be achieved by solving an ODE whose velocity field minimizes a rectified flow training objective. We further illustrate that some existing imputation techniques can be interpreted as approximate special cases of our mutual-information-reducing framework. Comprehensive experiments on synthetic and real-world datasets validate the efficacy of our proposed approach, demonstrating its superior imputation performance. Our implementation is available at https://github.com/yujhml/MIRI-Imputation.

#36 An Asymptotic Equation Linking WAIC and WBIC in Singular Models

著者: Naoki Hayashi, Takuro Kutsuna, Sawa Takamuku

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2505.13902

要約:
In statistical learning, models are classified as regular or singular depending on whether the mapping from parameters to probability distributions is injective. Most models with hierarchical structures or latent variables are singular, for which conventional criteria such as the Akaike Information Criterion and the Bayesian Information Criterion are inapplicable due to the breakdown of normal approximations for the likelihood and posterior. To address this, the Widely Applicable Information Criterion (WAIC) and the Widely Applicable Bayesian Information Criterion (WBIC) have been proposed. Since WAIC and WBIC are computed using posterior distributions at different temperature settings, separate posterior sampling is generally required. In this paper, we theoretically derive an asymptotic equation that links WAIC and WBIC, despite their dependence on different posteriors. This equation yields an asymptotically unbiased expression of WAIC in terms of the posterior distribution used for WBIC. The result clarifies the structural relationship between these criteria within the framework of singular learning theory, and deepens understanding of their asymptotic behavior. This theoretical contribution provides a foundation for future developments in the computational efficiency of model selection in singular models.

#37 Adjoint Schr\"odinger Bridge Sampler

著者: Guan-Horng Liu, Jaemoo Choi, Yongxin Chen, Benjamin Kurt Miller, Ricky T. Q. Chen

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2506.22565

要約:
Computational methods for learning to sample from the Boltzmann distribution -- where the target distribution is known only up to an unnormalized energy function -- have advanced significantly recently. Due to the lack of explicit target samples, however, prior diffusion-based methods, known as diffusion samplers, often require importance-weighted estimation or complicated learning processes. Both trade off scalability with extensive evaluations of the energy and model, thereby limiting their practical usage. In this work, we propose Adjoint Schr\"odinger Bridge Sampler (ASBS), a new diffusion sampler that employs simple and scalable matching-based objectives yet without the need to estimate target samples during training. ASBS is grounded on a mathematical model -- the Schr\"odinger Bridge -- which enhances sampling efficiency via kinetic-optimal transportation. Through a new lens of stochastic optimal control theory, we demonstrate how SB-based diffusion samplers can be learned at scale via Adjoint Matching and prove convergence to the global solution. Notably, ASBS generalizes the recent Adjoint Sampling (Havens et al., 2025) to arbitrary source distributions by relaxing the so-called memoryless condition that largely restricts the design space. Through extensive experiments, we demonstrate the effectiveness of ASBS on sampling from classical energy functions, amortized conformer generation, and molecular Boltzmann distributions. Code available at https://github.com/facebookresearch/adjoint_samplers

#38 Spectral Thresholds for Identifiability and Stability:Finite-Sample Phase Transitions in High-Dimensional Learning

著者: William Hao-Cheng Huang

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2510.03809

要約:
In high-dimensional learning, models remain stable until they collapse abruptly once the sample size falls below a critical level. This instability is not algorithm-specific but a geometric mechanism: when the weakest Fisher eigendirection falls beneath sample-level fluctuations, identifiability fails. Our Fisher Threshold Theorem formalizes this by proving that stability requires the minimal Fisher eigenvalue to exceed an explicit $O(\sqrt{d/n})$ bound. Unlike prior asymptotic or model-specific criteria, this threshold is finite-sample and necessary, marking a sharp phase transition between reliable concentration and inevitable failure. To make the principle constructive, we introduce the Fisher floor, a verifiable spectral regularization robust to smoothing and preconditioning. Synthetic experiments on Gaussian mixtures and logistic models confirm the predicted transition, consistent with $d/n$ scaling. Statistically, the threshold sharpens classical eigenvalue conditions into a non-asymptotic law; learning-theoretically, it defines a spectral sample-complexity frontier, bridging theory with diagnostics for robust high-dimensional inference.

#39 Learning to Validate Generative Models: a Goodness-of-Fit Approach

著者: Pietro Cappelli, Gaia Grosso, Marco Letizia, Humberto Reyes-Gonz\'alez, Marco Zanetti

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.09118

要約:
Generative models are increasingly central to scientific workflows, yet their systematic use and interpretation require a proper understanding of their limitations through rigorous validation. Classic approaches struggle with scalability, statistical power, or interpretability when applied to high-dimensional data, making it difficult to certify the reliability of these models in realistic, high-dimensional scientific settings. Here, we propose the use of the New Physics Learning Machine (NPLM), a learning-based approach to goodness-of-fit testing inspired by the Neyman--Pearson construction, to test generative networks trained on high-dimensional scientific data. We demonstrate the performance of NPLM for validation in two benchmark cases: generative models trained on mixtures of Gaussian models with increasing dimensionality, and a public end-to-end model, known as FlowSim, developed to generate high-energy physics collision events. We demonstrate that the NPLM can serve as a powerful validation method while also providing a means to diagnose sub-optimally modeled regions of the data.

#40 Differential privacy with dependent data

privacy

著者: Valentin Roth, Marco Avella-Medina

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.18583

要約:
Dependent data underlies many statistical studies in the social and health sciences, which often involve sensitive or private information. Differential privacy (DP) and in particular \textit{user-level} DP provide a natural formalization of privacy requirements for processing dependent data where each individual provides multiple observations to the dataset. However, dependence introduced, e.g., through repeated measurements challenges the existing statistical theory under DP-constraints. In \iid{} settings, noisy Winsorized mean estimators have been shown to be minimax optimal for standard (\textit{item-level}) and \textit{user-level} DP estimation of a mean $\mu \in \R^d$. Yet, their behavior on potentially dependent observations has not previously been studied. We fill this gap and show that Winsorized mean estimators can also be used under dependence for bounded and unbounded data, and can lead to asymptotic and finite sample guarantees that resemble their \iid{} counterparts under a weak notion of dependence. For this, we formalize dependence via log-Sobolev inequalities on the joint distribution of observations. This enables us to adapt the stable histogram by Karwa and Vadhan (2018) to a non-\iid{} setting, which we then use to estimate the private projection intervals of the Winsorized estimator. The resulting guarantees for our item-level mean estimator extend to \textit{user-level} mean estimation and transfer to the local model via a randomized response histogram. Using the mean estimators as building blocks, we provide extensions to random effects models, longitudinal linear regression and nonparametric regression. Therefore, our work constitutes a first step towards a systematic study of DP for dependent data.

#41 Categorical Flow Matching on Statistical Manifolds

著者: Chaoran Cheng, Jiahan Li, Jian Peng, Ge Liu

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2405.16441

要約:
We introduce Statistical Flow Matching (SFM), a novel and mathematically rigorous flow-matching framework on the manifold of parameterized probability measures inspired by the results from information geometry. We demonstrate the effectiveness of our method on the discrete generation problem by instantiating SFM on the manifold of categorical distributions whose geometric properties remain unexplored in previous discrete generative models. Utilizing the Fisher information metric, we equip the manifold with a Riemannian structure whose intrinsic geometries are effectively leveraged by following the shortest paths of geodesics. We develop an efficient training and sampling algorithm that overcomes numerical stability issues with a diffeomorphism between manifolds. Our distinctive geometric perspective of statistical manifolds allows us to apply optimal transport during training and interpret SFM as following the steepest direction of the natural gradient. Unlike previous models that rely on variational bounds for likelihood estimation, SFM enjoys the exact likelihood calculation for arbitrary probability measures. We manifest that SFM can learn more complex patterns on the statistical manifold where existing models often fail due to strong prior assumptions. Comprehensive experiments on real-world generative tasks ranging from image, text to biological domains further demonstrate that SFM achieves higher sampling quality and likelihood than other discrete diffusion or flow-based models.

#42 Predictive Performance Test based on the Exhaustive Nested Cross-Validation for High-dimensional data

著者: Iris Ivy Gauran, Hernando Ombao, Zhaoxia Yu

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2408.03138

要約:
It is crucial to assess the predictive performance of a model to establish its practicality and relevance in real-world scenarios, particularly for high-dimensional data analysis. Among data splitting or resampling methods, cross-validation (CV) is extensively used for several tasks such as estimating the prediction error, tuning the regularization parameter, and selecting the most suitable predictive model among competing alternatives. The $K$-fold cross-validation is a popular CV method but its limitation is that the risk estimates are highly dependent on the partitioning of the data (for training and testing). Here, the issues regarding the reproducibility of the $K$-fold CV estimator are demonstrated in hypothesis testing wherein different partitions lead to notably disparate conclusions. This study presents a novel predictive performance test and valid confidence intervals based on exhaustive nested cross-validation for determining the difference in prediction error between two model-fitting algorithms. A naive implementation of the exhaustive nested cross-validation is computationally costly. Here, we address concerns regarding computational complexity by devising a computationally tractable closed-form expression for the proposed cross-validation estimator. Our study also investigates strategies aimed at enhancing statistical power within high-dimensional scenarios while controlling the Type I error rate. \tr{Through comprehensive numerical experiments, we demonstrate that Ridge-based methods using bias to measure uncertainty of CV estimates and adaptive hyperparameter selection provide the most reliable approach for high-dimensional predictive performance testing.} To illustrate the practical utility of our method, we apply it to an RNA sequencing study and demonstrate its effectiveness in the context of biological data analysis.

#43 SCNode: Spatial and Contextual Coordinates for Graph Representation Learning

著者: Md Joshem Uddin, Astrit Tola, Varin Sikand, Cuneyt Gurcan Akcora, Baris Coskunuzer

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2410.02158

要約:
Effective node representation lies at the heart of Graph Neural Networks (GNNs), as it directly impacts their ability to perform downstream tasks such as node classification and link prediction. Most existing GNNs, particularly message passing graph neural networks, rely on neighborhood aggregation to iteratively compute node embeddings. While powerful, this paradigm suffers from well-known limitations of oversquashing, oversmoothing, and underreaching that degrade representation quality. More critically, MPGNNs often assume homophily, where connected nodes share similar features or labels, leading to poor generalization in heterophilic graphs where this assumption breaks down. To address these challenges, we propose \textit{SCNode}, a \textit{Spatial-Contextual Node Embedding} framework designed to perform consistently well in both homophilic and heterophilic settings. SCNode integrates spatial and contextual information, yielding node embeddings that are not only more discriminative but also structurally aware. Our approach introduces new homophily matrices for understanding class interactions and tendencies. Extensive experiments on benchmark datasets show that SCNode achieves superior performance over conventional GNN models, demonstrating its robustness and adaptability in diverse graph structures.

#44 Transfer Learning for High-dimensional Quantile Regression with Distribution Shift

著者: Ruiqi Bai, Yijiao Zhang, Hanbo Yang, Zhongyi Zhu

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2411.19933

要約:
Information from related source studies can often enhance the findings of a target study. However, the distribution shift between target and source studies can severely impact the efficiency of knowledge transfer. In the high-dimensional regression setting, existing transfer approaches mainly focus on the parameter shift. In this paper, we focus on the high-dimensional quantile regression with knowledge transfer under three types of distribution shift: parameter shift, covariate shift, and residual shift. We propose a novel transferable set and a new transfer framework to address the above three discrepancies. Non-asymptotic estimation error bounds and source detection consistency are established to validate the availability and superiority of our method in the presence of distribution shift. Additionally, an orthogonal debiased approach is proposed for statistical inference with knowledge transfer, leading to sharper asymptotic results. Extensive simulation results as well as real data applications further demonstrate the effectiveness of our proposed procedure.

#45 GRAM: Generalization in Deep RL with a Robust Adaptation Module

著者: James Queeney, Xiaoyi Cai, Alexander Schperberg, Radu Corcodel, Mouhacine Benosman, Jonathan P. How

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2412.04323

要約:
The reliable deployment of deep reinforcement learning in real-world settings requires the ability to generalize across a variety of conditions, including both in-distribution scenarios seen during training as well as novel out-of-distribution scenarios. In this work, we present a framework for dynamics generalization in deep reinforcement learning that unifies these two distinct types of generalization within a single architecture. We introduce a robust adaptation module that provides a mechanism for identifying and reacting to both in-distribution and out-of-distribution environment dynamics, along with a joint training pipeline that combines the goals of in-distribution adaptation and out-of-distribution robustness. Our algorithm GRAM achieves strong generalization performance across in-distribution and out-of-distribution scenarios upon deployment, which we demonstrate through extensive simulation and hardware locomotion experiments on a quadruped robot.

#46 Data-Driven High-Dimensional Statistical Inference with Generative Models

著者: Oz Amram, Manuel Szewc

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2506.06438

要約:
Crucial to many measurements at the LHC is the use of correlated multi-dimensional information to distinguish rare processes from large backgrounds, which is complicated by the poor modeling of many of the crucial backgrounds in Monte Carlo simulations. In this work, we introduce HI-SIGMA, a method to perform unbinned high-dimensional statistical inference with data-driven background distributions. In contradistinction to many applications of Simulation Based Inference in High Energy Physics, HI-SIGMA relies on generative ML models, rather than classifiers, to learn the signal and background distributions in the high-dimensional space. These ML models allow for interpretable inference while also incorporating model errors and other sources of systematic uncertainties. We showcase this methodology on a simplified version of a di-Higgs measurement in the $bb\gamma\gamma$ final state, where the di-photon resonance allows for background interpolation from sidebands into the signal region. We demonstrate that HI-SIGMA provides improved sensitivity as compared to standard classifier-based methods, and that systematic uncertainties can be straightforwardly incorporated by extending methods which have been used for histogram based analyses.

#47 FunDiff: Diffusion Models over Function Spaces for Physics-Informed Generative Modeling

diffusion

著者: Sifan Wang, Zehao Dou, Siming Shan, Tong-Rui Liu, Lu Lu

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2506.07902

要約:
Recent advances in generative modeling -- particularly diffusion models and flow matching -- have achieved remarkable success in synthesizing discrete data such as images and videos. However, adapting these models to physical applications remains challenging, as the quantities of interest are continuous functions governed by complex physical laws. Here, we introduce $\textbf{FunDiff}$, a novel framework for generative modeling in function spaces. FunDiff combines a latent diffusion process with a function autoencoder architecture to handle input functions with varying discretizations, generate continuous functions evaluable at arbitrary locations, and seamlessly incorporate physical priors. These priors are enforced through architectural constraints or physics-informed loss functions, ensuring that generated samples satisfy fundamental physical laws. We theoretically establish minimax optimality guarantees for density estimation in function spaces, showing that diffusion-based estimators achieve optimal convergence rates under suitable regularity conditions. We demonstrate the practical effectiveness of FunDiff across diverse applications in fluid dynamics and solid mechanics. Empirical results show that our method generates physically consistent samples with high fidelity to the target distribution and exhibits robustness to noisy and low-resolution data. Code and datasets are publicly available at https://github.com/sifanexisted/fundiff.

#48 Elucidated Rolling Diffusion Models for Probabilistic Weather Forecasting

diffusion

著者: Salva R\"uhling Cachay, Miika Aittala, Karsten Kreis, Noah Brenowitz, Arash Vahdat, Morteza Mardani, Rose Yu

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2506.20024

要約:
Diffusion models are a powerful tool for probabilistic forecasting, yet most applications in high-dimensional complex systems predict future states individually. This approach struggles to model complex temporal dependencies and fails to explicitly account for the progressive growth of uncertainty inherent to the systems. While rolling diffusion frameworks, which apply increasing noise to forecasts at longer lead times, have been proposed to address this, their integration with state-of-the-art, high-fidelity diffusion techniques remains a significant challenge. We tackle this problem by introducing Elucidated Rolling Diffusion Models (ERDM), the first framework to successfully unify a rolling forecast structure with the principled, performant design of Elucidated Diffusion Models (EDM). To do this, we adapt the core EDM components-its noise schedule, network preconditioning, and Heun sampler-to the rolling forecast setting. The success of this integration is driven by three key contributions: (i) a novel loss weighting scheme that focuses model capacity on the mid-range forecast horizons where determinism gives way to stochasticity; (ii) an efficient initialization strategy using a pre-trained EDM for the initial window; and (iii) a bespoke hybrid sequence architecture for robust spatiotemporal feature extraction under progressive denoising. On 2D Navier-Stokes simulations and ERA5 global weather forecasting at 1.5-degree resolution, ERDM consistently outperforms key diffusion-based baselines, including conditional autoregressive EDM. ERDM offers a flexible and powerful general framework for tackling diffusion-based dynamics forecasting problems where modeling uncertainty propagation is paramount.

#49 Improving Constrained Language Generation via Self-Distilled Twisted Sequential Monte Carlo

著者: Sooyeon Kim, Giung Nam, Byoungwoo Park, Juho Lee

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2507.02315

要約:
Recent work has framed constrained text generation with autoregressive language models as a probabilistic inference problem. Among these, Zhao et al. (2024) introduced a promising approach based on twisted Sequential Monte Carlo, which incorporates learned twist functions and twist-induced proposals to guide the generation process. However, in constrained generation settings where the target distribution concentrates on outputs that are unlikely under the base model, learning becomes challenging due to sparse and uninformative reward signals. We show that iteratively refining the base model through self-distillation alleviates this issue by making the model progressively more aligned with the target, leading to substantial gains in generation quality.

#50 Scalable neural network-based blackbox optimization

著者: Pavankumar Koratikere, Leifur Leifsson

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2508.03827

要約:
Bayesian Optimization (BO) is a widely used approach for blackbox optimization that leverages a Gaussian process (GP) model and an acquisition function to guide future sampling. While effective in low-dimensional settings, BO faces scalability challenges in high-dimensional spaces and with large number of function evaluations due to the computational complexity of GP models. In contrast, neural networks (NNs) offer better scalability and can model complex functions, which led to the development of NN-based BO approaches. However, these methods typically rely on estimating model uncertainty in NN prediction -- a process that is often computationally intensive and complex, particularly in high dimensions. To address these limitations, a novel method, called scalable neural network-based blackbox optimization (SNBO), is proposed that does not rely on model uncertainty estimation. Specifically, SNBO adds new samples using separate criteria for exploration and exploitation, while adaptively controlling the sampling region to ensure efficient optimization. SNBO is evaluated on a range of optimization problems spanning from 10 to 102 dimensions and compared against four state-of-the-art baseline algorithms. Across the majority of test problems, SNBO attains function values better than the best-performing baseline algorithm, while requiring 40-60% fewer function evaluations and reducing the runtime by at least an order of magnitude.

#51 Scaling Up Active Testing to Large Language Models

著者: Gabrielle Berrada, Jannik Kossen, Freddie Bickford Smith, Muhammed Razzak, Yarin Gal, Tom Rainforth

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2508.09093

要約:
Active testing enables label-efficient evaluation of predictive models through careful data acquisition, but it can pose a significant computational cost. We identify cost-saving measures that enable active testing to be scaled up to large language models (LLMs). In particular we show that the surrogate model used to guide data acquisition can be constructed cheaply using in-context learning, does not require updating within an active-testing loop, and can be smaller than the target model. We even find we can make good data-acquisition decisions without making predictions with the target model. As a result we are able to achieve much more accurate evaluations of LLM performance relative to using randomly acquired data. We additionally introduce a bootstrap estimator of evaluation error, which we show to be a useful indicator of how well active testing is working within a single run.

#52 Scaling Up ROC-Optimizing Support Vector Machines

著者: Gimun Bae, Seung Jun Shin

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.04979

要約:
The ROC-SVM, originally proposed by Rakotomamonjy, directly maximizes the area under the ROC curve (AUC) and has become an attractive alternative of the conventional binary classification under the presence of class imbalance. However, its practical use is limited by high computational cost, as training involves evaluating all $O(n^2)$. To overcome this limitation, we develop a scalable variant of the ROC-SVM that leverages incomplete U-statistics, thereby substantially reducing computational complexity. We further extend the framework to nonlinear classification through a low-rank kernel approximation, enabling efficient training in reproducing kernel Hilbert spaces. Theoretical analysis establishes an error bound that justifies the proposed approximation, and empirical results on both synthetic and real datasets demonstrate that the proposed method achieves comparable AUC performance to the original ROC-SVM with drastically reduced training time.

#53 A Latent-Variable Formulation of the Poisson Canonical Polyadic Tensor Model: Maximum Likelihood Estimation and Fisher Information

著者: Carlos Llosa-Vite, Daniel M. Dunlavy, Richard B. Lehoucq, Oscar L\'opez, Arvind Prasadan

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.05352

要約:
We establish parameter inference for the Poisson canonical polyadic (PCP) model of tensor count data through a latent-variable formulation. Our approach exploits the property that any random tensor that follows the PCP model can be derived by marginalizing an unobservable random tensor of one dimension larger. The loglikelihood of this larger dimensional tensor, referred to as the "complete" loglikelihood, is comprised of multiple loglikelihoods corresponding to rank one PCP models. Using this methodology, we first demonstrate that several existing algorithms for fitting non-negative matrix and tensor factorizations are Expectation-Maximization algorithms. Next, we derive the observed and expected Fisher information matrices for the PCP model by leveraging its latent-variable formulation. The Fisher information provides us crucial insights into the well-posedness of the tensor model, such as the role that the rank of parameter tensor plays in identifiability and indeterminacy. For the special case of PCP models with rank one parameter tensors, we demonstrate that these results are greatly simplified.

#54 Subtract the Corruption: Training-Data-Free Corrective Machine Unlearning using Task Arithmetic

privacy

著者: Mostafa Mozafari, Farooq Ahmad Wani, Maria Sofia Bucarelli, Fabrizio Silvestri

公開日: Wed, 26 Nov 2025 00:00:00 -0500

リンク: https://arxiv.org/abs/2511.18660

要約:
Corrupted training data are ubiquitous. Corrective Machine Unlearning (CMU) seeks to remove the influence of such corruption post-training. Prior CMU typically assumes access to identified corrupted training samples (a "forget set"). However, in many real-world scenarios the training data are no longer accessible. We formalize source-free CMU, where the original training data are unavailable and, consequently, no forget set of identified corrupted training samples can be specified. Instead, we assume a small proxy (surrogate) set of corrupted samples that reflect the suspected corruption type without needing to be the original training samples. In this stricter setting, methods relying on forget set are ineffective or narrow in scope. We introduce Corrective Unlearning in Task Space (CUTS), a lightweight weight space correction method guided by the proxy set using task arithmetic principles. CUTS treats the clean and the corruption signal as distinct tasks. Specifically, we briefly fine-tune the corrupted model on the proxy to amplify the corruption mechanism in the weight space, compute the difference between the corrupted and fine-tuned weights as a proxy task vector, and subtract a calibrated multiple of this vector to cancel the corruption. Without access to clean data or a forget set, CUTS recovers a large fraction of the lost utility under label noise and, for backdoor triggers, nearly eliminates the attack with minimal damage to utility, outperforming state-of-the-art specialized CMU methods in source-free setting.

stat.ML updates on arXiv.org

📋 論文タイトル一覧