要約:
Computational modeling of single-cell gene expression is crucial for understanding cellular processes, but generating realistic expression profiles remains a major challenge. This difficulty arises from the count nature of gene expression data and complex latent dependencies among genes. Existing generative models often impose artificial gene orderings or rely on shallow neural network architectures. We introduce a scalable latent diffusion model for single-cell gene expression data, which we refer to as scLDM, that respects the fundamental exchangeability property of the data. Our VAE uses fixed-size latent variables leveraging a unified Multi-head Cross-Attention Block (MCAB) architecture, which serves dual roles: permutation-invariant pooling in the encoder and permutation-equivariant unpooling in the decoder. We enhance this framework by replacing the Gaussian prior with a latent diffusion model using Diffusion Transformers and linear interpolants, enabling high-quality generation with multi-conditional classifier-free guidance. We show its superior performance in a variety of experiments for both observational and perturbational single-cell data, as well as downstream tasks like cell-level classification.
要約:
Comparing clusterings is central to evaluating unsupervised models, yet the many existing similarity measures can produce widely divergent, sometimes contradictory, evaluations. Clustering similarity measures are typically organized into two principal families, pair-counting and information-theoretic, reflecting whether they quantify agreement through element pairs or aggregate information across full cluster contingency tables. Prior work has uncovered parallels between these families and applied empirical normalization or chance-correction schemes, but their deeper analytical connection remains only partially understood. Here, we develop an analytical framework that unifies these families through two complementary perspectives. First, both families are expressed as weighted expansions of observed versus expected co-occurrences, with pair-counting arising as a quadratic, low-order approximation and information-theoretic measures as higher-order, frequency-weighted extensions. Second, we generalize pair-counting to $k$-tuple agreement and show that information-theoretic measures can be viewed as systematically accumulating higher-order co-assignment structure beyond the pairwise level. We illustrate the approaches analytically for the Rand index and Mutual Information, and show how other indices in each family emerge as natural extensions. Together, these views clarify when and why the two regimes diverge, relating their sensitivities directly to weighting and approximation order, and provide a principled basis for selecting, interpreting, and extending clustering similarity measures across applications.
要約:
Gradient information is widely useful and available in applications, and is therefore natural to include in the training of neural networks. Yet little is known theoretically about the impact of Sobolev training -- regression with both function and gradient data -- on the generalization error of highly overparameterized predictive models in high dimensions. In this paper, we obtain a precise characterization of this training modality for random feature (RF) models in the limit where the number of trainable parameters, input dimensions, and training data tend proportionally to infinity. Our model for Sobolev training reflects practical implementations by sketching gradient data onto finite dimensional subspaces. By combining the replica method from statistical physics with linearizations in operator-valued free probability theory, we derive a closed-form description for the generalization errors of the trained RF models. For target functions described by single-index models, we demonstrate that supplementing function data with additional gradient data does not universally improve predictive performance. Rather, the degree of overparameterization should inform the choice of training method. More broadly, our results identify settings where models perform optimally by interpolating noisy function and gradient data.
要約:
We study how Bayesian optimization (BO) can be accelerated on a target task with historical knowledge transferred from related source tasks. Existing works on BO with knowledge transfer either do not have theoretical guarantees or achieve the same regret as BO in the non-transfer setting, $\tilde{\mathcal{O}}(\sqrt{T \gamma_f})$, where $T$ is the number of evaluations of the target function and $\gamma_f$ denotes its information gain. In this paper, we propose the DeltaBO algorithm, in which a novel uncertainty-quantification approach is built on the difference function $\delta$ between the source and target functions, which are allowed to belong to different reproducing kernel Hilbert spaces (RKHSs). Under mild assumptions, we prove that the regret of DeltaBO is of order $\tilde{\mathcal{O}}(\sqrt{T (T/N + \gamma_\delta)})$, where $N$ denotes the number of evaluations from source tasks and typically $N \gg T$. In many applications, source and target tasks are similar, which implies that $\gamma_\delta$ can be much smaller than $\gamma_f$. Empirical studies on both real-world hyperparameter tuning tasks and synthetic functions show that DeltaBO outperforms other baseline methods and support our theoretical claims.
要約:
Diffusion models have achieved remarkable success across diverse domains, but they remain vulnerable to memorization -- reproducing training data rather than generating novel outputs. This not only limits their creative potential but also raises concerns about privacy and safety. While empirical studies have explored mitigation strategies, theoretical understanding of memorization remains limited. We address this gap through developing a dual-separation result via two complementary perspectives: statistical estimation and network approximation. From the estimation side, we show that the ground-truth score function does not minimize the empirical denoising loss, creating a separation that drives memorization. From the approximation side, we prove that implementing the empirical score function requires network size to scale with sample size, spelling a separation compared to the more compact network representation of the ground-truth score function. Guided by these insights, we develop a pruning-based method that reduces memorization while maintaining generation quality in diffusion transformers.
要約:
RKUM is an R package developed for implementing robust kernel-based unsupervised methods. It provides functions for estimating the robust kernel covariance operator (CO) and the robust kernel cross-covariance operator (CCO) using generalized loss functions instead of the conventional quadratic loss. These operators form the foundation of robust kernel learning and enable reliable analysis under contaminated or noisy data conditions. The package includes implementations of robust kernel canonical correlation analysis (Kernel CCA), as well as the influence function (IF) for both standard and multiple kernel CCA frameworks. The influence function quantifies sensitivity and helps detect influential or outlying observations across two-view and multi-view datasets. Experiments using synthesized two-view and multi-view data demonstrate that the IF of the standard kernel CCA effectively identifies outliers, while the robust kernel methods implemented in RKUM exhibit reduced sensitivity to contamination. Overall, RKUM provides an efficient and extensible platform for robust kernel-based analysis in high-dimensional data applications.
要約:
The study of self-normalized processes plays a crucial role in a wide range of applications, from sequential decision-making to econometrics. While the behavior of self-normalized concentration has been widely investigated for scalar-valued processes, vector-valued processes remain comparatively underexplored, especially outside of the sub-Gaussian framework. In this contribution, we provide concentration bounds for self-normalized processes with light tails beyond sub-Gaussianity (such as Bennett or Bernstein bounds). We illustrate the relevance of our results in the context of online linear regression, with applications in (kernelized) linear bandits.
要約:
Colorectal cancer (CRC) grading is a critical prognostic factor but remains hampered by inter-observer variability and the privacy constraints of multi-institutional data sharing. While deep learning offers a path to automation, centralized training models conflict with data governance regulations and neglect the diagnostic importance of multi-scale analysis. In this work, we propose a scalable, privacy-preserving federated learning (FL) framework for CRC histopathological grading that integrates multi-scale feature learning within a distributed training paradigm. Our approach employs a dual-stream ResNetRS50 backbone to concurrently capture fine-grained nuclear detail and broader tissue-level context. This architecture is integrated into a robust FL system stabilized using FedProx to mitigate client drift across heterogeneous data distributions from multiple hospitals. Extensive evaluation on the CRC-HGD dataset demonstrates that our framework achieves an overall accuracy of 83.5%, outperforming a comparable centralized model (81.6%). Crucially, the system excels in identifying the most aggressive Grade III tumors with a high recall of 87.5%, a key clinical priority to prevent dangerous false negatives. Performance further improves with higher magnification, reaching 88.0% accuracy at 40x. These results validate that our federated multi-scale approach not only preserves patient privacy but also enhances model performance and generalization. The proposed modular pipeline, with built-in preprocessing, checkpointing, and error handling, establishes a foundational step toward deployable, privacy-aware clinical AI for digital pathology.
要約:
A common challenge for decision makers is selecting actions whose rewards are unknown and evolve over time based on prior policies. For instance, repeated use may reduce an action's effectiveness (habituation), while inactivity may restore it (recovery). These nonstationarities are captured by the Reducing or Gaining Unknown Efficacy (ROGUE) bandit framework, which models real-world settings such as behavioral health interventions. While existing algorithms can compute sublinear regret policies to optimize these settings, they may not provide sufficient exploration due to overemphasis on exploitation, limiting the ability to estimate population-level effects. This is a challenge of particular interest in micro-randomized trials (MRTs) that aid researchers in developing just-in-time adaptive interventions that have population-level effects while still providing personalized recommendations to individuals. In this paper, we first develop ROGUE-TS, a Thompson Sampling algorithm tailored to the ROGUE framework, and provide theoretical guarantees of sublinear regret. We then introduce a probability clipping procedure to balance personalization and population-level learning, with quantified trade-off that balances regret and minimum exploration probability. Validation on two MRT datasets concerning physical activity promotion and bipolar disorder treatment shows that our methods both achieve lower regret than existing approaches and maintain high statistical power through the clipping procedure without significantly increasing regret. This enables reliable detection of treatment effects while accounting for individual behavioral dynamics. For researchers designing MRTs, our framework offers practical guidance on balancing personalization with statistical validity.
要約:
Generating graph-structured data is crucial in applications such as molecular generation, knowledge graphs, and network analysis. However, their discrete, unordered nature makes them difficult for traditional generative models, leading to the rise of discrete diffusion and flow matching models. In this work, we introduce GraphBSI, a novel one-shot graph generative model based on Bayesian Sample Inference (BSI). Instead of evolving samples directly, GraphBSI iteratively refines a belief over graphs in the continuous space of distribution parameters, naturally handling discrete structures. Further, we state BSI as a stochastic differential equation (SDE) and derive a noise-controlled family of SDEs that preserves the marginal distributions via an approximation of the score function. Our theoretical analysis further reveals the connection to Bayesian Flow Networks and Diffusion models. Finally, in our empirical evaluation, we demonstrate state-of-the-art performance on molecular and synthetic graph generation, outperforming existing one-shot graph generative models on the standard benchmarks Moses and GuacaMol.
要約:
Artificial intelligence (AI) systems hold great promise for advancing various scientific disciplines, and are increasingly used in real-world applications. Despite their remarkable progress, further capabilities are expected in order to achieve more general types of intelligence. A critical distinction in this context is between factual knowledge, which can be evaluated against true or false answers (e.g., "what is the capital of England?"), and probabilistic knowledge, reflecting probabilistic properties of the real world (e.g., "what is the sex of a computer science graduate in the US?"). In this paper, our goal is to build a benchmark for understanding the capabilities of LLMs in terms of knowledge of probability distributions describing the real world. Given that LLMs are trained on vast amounts of text, it may be plausible that they internalize aspects of these distributions. Indeed, LLMs are touted as powerful universal approximators of real-world distributions. At the same time, classical results in statistics, known as curse of dimensionality, highlight fundamental challenges in learning distributions in high dimensions, challenging the notion of universal distributional learning. In this work, we develop the first benchmark to directly test this hypothesis, evaluating whether LLMs have access to empirical distributions describing real-world populations across domains such as economics, health, education, and social behavior. Our results demonstrate that LLMs perform poorly overall, and do not seem to internalize real-world statistics naturally. When interpreted in the context of Pearl's Causal Hierarchy (PCH), our benchmark demonstrates that language models do not contain knowledge on observational distributions (Layer 1 of PCH), and thus the Causal Hierarchy Theorem implies that interventional (Layer 2) and counterfactual (Layer 3) knowledge of these models is also limited.
要約:
Generalized linear models (GLMs) are fundamental tools for statistical modeling, with maximum likelihood estimation (MLE) serving as the classical method for parameter inference. While MLE performs well in canonical GLMs, it can become computationally inefficient near the true parameter value. In more general settings with non-canonical or fully general link functions, the resulting optimization landscape is often non-convex, non-smooth, and numerically unstable. To address these challenges, we investigate an alternative estimator based on solving the variational inequality (VI) formulation of the GLM likelihood equations, originally proposed by Juditsky and Nemirovski as an alternative for solving nonlinear least-squares problems. Unlike their focus on algorithmic convergence in monotone settings, we analyze the VI approach from a statistical perspective, comparing it systematically with the MLE. We also extend the theory of VI estimators to a broader class of link functions, including non-monotone cases satisfying a strong Minty condition, and show that it admits weaker smoothness requirements than MLE, enabling faster, more stable, and less locally trapped optimization. Theoretically, we establish both non-asymptotic estimation error bounds and asymptotic normality for the VI estimator, and further provide convergence guarantees for fixed-point and stochastic approximation algorithms. Numerical experiments show that the VI framework preserves the statistical efficiency of MLE while substantially extending its applicability to more challenging GLM settings.
要約:
Rectified flow (Liu et al., 2022; Liu, 2022; Wu et al., 2023) is a method for defining a transport map between two distributions, and enjoys popularity in machine learning, although theoretical results supporting the validity of these methods are scant. The rectified flow can be regarded as an approximation to optimal transport, but in contrast to other transport methods that require optimization over a function space, computing the rectified flow only requires standard statistical tools such as regression or density estimation. Because of this, one can leverage standard data analysis tools for regression and density estimation to develop empirical versions of transport maps. We study some structural properties of the rectified flow, including existence, uniqueness, and regularity, as well as the related statistical properties, such as rates of convergence and central limit theorems, for some selected estimators. To do so, we analyze separately the bounded and unbounded cases as each presents unique challenges. In both cases, we are able to establish convergence at faster rates than the ones for the usual nonparametric regression and density estimation.
要約:
Various data modalities are common in real-world applications (e.g., electronic health records, medical images and clinical notes in healthcare). It is essential to develop multimodal learning methods to aggregate various information from multiple modalities. The main challenge is how to appropriately align and fuse the representations of different modalities into a joint distribution. Existing methods mainly rely on concatenation or the Kronecker product, oversimplifying the interaction structure between modalities and indicating a need to model more complex interactions. Additionally, the joint distribution of latent representations with higher-order interactions is underexplored. Copula is a powerful statistical structure for modelling the interactions among variables, as it naturally bridges the joint distribution and marginal distributions of multiple variables. We propose a novel copula-driven multimodal learning framework, which focuses on learning the joint distribution of various modalities to capture the complex interactions among them. The key idea is to interpret the copula model as a tool to align the marginal distributions of the modalities efficiently. By assuming a Gaussian mixture distribution for each modality and a copula model on the joint distribution, our model can generate accurate representations for missing modalities. Extensive experiments on public MIMIC datasets demonstrate the superior performance of our model over other competitors. The code is available at https://github.com/HKU-MedAI/CMCM.
要約:
Entropy Minimization (EM) is beneficial to reducing class overlap, bridging domain gap, and restricting uncertainty for various tasks in machine learning, yet its potential is limited. To study the internal mechanism of EM, we reformulate and decouple the classical EM into two parts with opposite effects: cluster aggregation driving factor (CADF) rewards dominant classes and prompts a peaked output distribution, while gradient mitigation calibrator (GMC) penalizes high-confidence classes based on predicted probabilities. Furthermore, we reveal the limitations of classical EM caused by its coupled formulation: 1) reward collapse impedes the contribution of high-certainty samples in the learning process, and 2) easy-class bias induces misalignment between output distribution and label distribution. To address these issues, we propose Adaptive Decoupled Entropy Minimization (AdaDEM), which normalizes the reward brought from CADF and employs a marginal entropy calibrator (MEC) to replace GMC. AdaDEM outperforms DEM*, an upper-bound variant of classical EM, and achieves superior performance across various imperfectly supervised learning tasks in noisy and dynamic environments.
要約:
Safety-aligned large language models (LLMs) are becoming increasingly widespread, especially in sensitive applications where fairness is essential and biased outputs can cause significant harm. However, evaluating the fairness of models is a complex challenge, and approaches that do so typically utilize standard question-answer (QA) styled schemes. Such methods often overlook deeper issues by interpreting the model's refusal responses as positive fairness measurements, which creates a false sense of fairness. In this work, we introduce the concept of silenced biases, which are unfair preferences encoded within models' latent space and are effectively concealed by safety-alignment. Previous approaches that considered similar indirect biases often relied on prompt manipulation or handcrafted implicit queries, which present limited scalability and risk contaminating the evaluation process with additional biases. We propose the Silenced Bias Benchmark (SBB), which aims to uncover these biases by employing activation steering to reduce model refusals during QA. SBB supports easy expansion to new demographic groups and subjects, presenting a fairness evaluation framework that encourages the future development of fair models and tools beyond the masking effects of alignment training. We demonstrate our approach over multiple LLMs, where our findings expose an alarming distinction between models' direct responses and their underlying fairness issues.
要約:
In this paper, we investigate optimization problems with nonnegative and orthogonal constraints, where any feasible matrix of size $n \times p$ exhibits a sparsity pattern such that each row accommodates at most one nonzero entry. Our analysis demonstrates that, by fixing the support set, the global solution of the minimization subproblem for the proximal linearization of the objective function can be computed in closed form with at most $n$ nonzero entries. Exploiting this structural property offers a powerful avenue for dramatically enhancing computational efficiency. Guided by this insight, we propose a support-set algorithm preserving strictly the feasibility of iterates. A central ingredient is a strategically devised update scheme for support sets that adjusts the placement of nonzero entries. We establish the global convergence of the support-set algorithm to a first-order stationary point, and show that its iteration complexity required to reach an $\epsilon$-approximate first-order stationary point is $O (\epsilon^{-2})$. Numerical results are strongly in favor of our algorithm in real-world applications, including nonnegative PCA, clustering, and community detection.
要約:
This paper introduces a theoretical framework to resolve a central paradox in modern machine learning: When is it better to use less data? This question has become critical as classical scaling laws suggesting ``more is more'' (Sun et al., 2025) are challenged by methods like LIMO (``less is more'') and s1 (Ye et al., 2025; Muenighoff et al., 2025), which achieve superior performance with small, aggressively curated datasets. Here, we study data curation strategies where an imperfect oracle selects the training examples according to their difficulty and correctness. Our results provide exact scaling law curves for test error under both label-agnostic and label-aware curation rules, revealing when and why keeping only a subset of data can improve generalization. In contrast to classical scaling laws, we show that under certain conditions, small curated datasets can outperform full datasets, and we provide analytical conditions for this by deriving precise phase transition curves tied to data size and quality. We validate these theoretical claims with empirical results on ImageNet, confirming our predictions about when curation improves accuracy and can even mitigate model collapse. Furthermore, our framework provides a principled explanation for the contradictory curation strategies recently observed in LLM mathematical reasoning.
要約:
In this paper, we formalize the almost sure convergence of $Q$-learning and linear temporal difference (TD) learning with Markovian samples using the Lean 4 theorem prover based on the Mathlib library. $Q$-learning and linear TD are among the earliest and most influential reinforcement learning (RL) algorithms. The investigation of their convergence properties is not only a major research topic during the early development of the RL field but also receives increasing attention nowadays. This paper formally verifies their almost sure convergence in a unified framework based on the Robbins-Siegmund theorem. The framework developed in this work can be easily extended to convergence rates and other modes of convergence. This work thus makes an important step towards fully formalizing convergent RL results. The code is available at https://github.com/ShangtongZhang/rl-theory-in-lean.
要約:
Complex inference tasks, such as those encountered in Pulsar Timing Array (PTA) data analysis, rely on Bayesian frameworks. The high-dimensional parameter space and the strong interdependencies among astrophysical, pulsar noise, and nuisance parameters introduce significant challenges for efficient learning and robust inference. These challenges are emblematic of broader issues in decision science, where model over-parameterization and prior sensitivity can compromise both computational tractability and the reliability of the results. We address these issues in the framework of hierarchical Bayesian modeling by introducing a reparameterization strategy. Our approach employs Normalizing Flows (NFs) to decorrelate the parameters governing hierarchical priors from those of astrophysical interest. The use of NF-based mappings provides both the flexibility to realize the reparametrization and the tractability to preserve proper probability densities. We further adopt i-nessai, a flow-guided nested sampler, to accelerate exploration of complex posteriors. This unified use of NFs improves statistical robustness and computational efficiency, providing a principled methodology for addressing hierarchical Bayesian inference in PTA analysis.
要約:
We study batched nonparametric contextual bandits under a margin condition when the margin parameter $\alpha$ is unknown. To capture the statistical price of this ignorance, we introduce the regret inflation criterion, defined as the ratio between the regret of an adaptive algorithm and that of an oracle knowing $\alpha$. We show that the optimal regret inflation grows polynomial with the horizon $T$, with exponent precisely given by the value of a convex optimization problem involving the dimension, smoothness, and batch budget. Moreover, the minimizers of this optimization problem directly prescribe the batch allocation and exploration strategy of a rate-optimal algorithm. Building on this principle, we develop RoBIN (RObust batched algorithm with adaptive BINning), which achieves the optimal regret inflation up to logarithmic factors. These results reveal a new adaptivity barrier: under batching, adaptation to an unknown margin parameter inevitably incurs a polynomial penalty, sharply characterized by a variational problem. Remarkably, this barrier vanishes when the number of batches exceeds $\log \log T$; with only a doubly logarithmic number of updates, one can recover the oracle regret rate up to polylogarithmic factors.
要約:
We study two-sample variable selection: identifying variables that discriminate between the distributions of two sets of data vectors. Such variables help scientists understand the mechanisms behind dataset discrepancies. Although domain-specific methods exist (e.g., in medical imaging, genetics, and computational social science), a general framework remains underdeveloped. We make two separate contributions. (i) We introduce a mathematical notion of the discriminating set of variables: the largest subset containing no variables whose marginals are identical across the two distributions and independent of the remaining variables. We prove this set is uniquely defined and establish further properties, making it a suitable ground truth for theory and evaluation. (ii) We propose two methods for two-sample variable selection that assign weights to variables and optimise them to maximise the power of a kernel two-sample test while enforcing sparsity to downweight redundant variables. To select the regularisation parameter - unknown in practice, as it controls the number of selected variables - we develop two data-driven procedures to balance recall and precision. Synthetic experiments show improved performance over baselines, and we illustrate the approach on two applications using datasets from water-pipe and traffic networks.
要約:
Understanding the behavior of predictive models with random inputs can be achieved through functional decompositions into sub-models that capture interpretable effects of input groups. Building on recent advances in uncertainty quantification, the existence and uniqueness of a generalized Hoeffding decomposition have been established for correlated input variables, using oblique projections onto suitable functional subspaces. This work focuses on the case of Bernoulli inputs and provides a complete analytical characterization of the decomposition. We show that, in this discrete setting, the associated subspaces are one-dimensional and that the decomposition admits a closed-form representation. One of the main contributions of this study is to generalize the classical Fourier--Walsh--Hadamard decomposition for pseudo-Boolean functions to the correlated case, yielding an oblique version when the underlying distribution is not a product measure, and recovering the standard orthogonal form when independence holds. This explicit structure offers a fully interpretable framework, clarifying the contribution of each input combination and theoretically enabling model reverse engineering. From this formulation, explicit sensitivity measures-such as Sobol' indices and Shapley effects-can be directly derived. Numerical experiments illustrate the practical interest of the approach for decision-support problems involving binary features. The paper concludes with perspectives on extending the methodology to high-dimensional settings and to models involving inputs with finite, non-binary support.
著者: Clemens Sch\"achter, Maren Hackenberg, Michelle Pfaffenlehner, F\'elix B. Tambe-Ndonfack, Thorsten Schmidt, Astrid Pechmann, Janbernd Kirschner, Jan Hasenauer, Harald Binder
要約:
Many rare diseases offer limited established treatment options, leading patients to switch therapies when new medications emerge. To analyze the impact of such treatment switches within the low sample size limitations of rare disease trials, it is important to use all available data sources. This, however, is complicated when usage of measurement instruments change during the observation period, for example when instruments are adapted to specific age ranges. The resulting disjoint longitudinal data trajectories, complicate the application of traditional modeling approaches like mixed-effects regression. We tackle this by mapping observations of each instrument to a aligned low-dimensional temporal trajectory, enabling longitudinal modeling across instruments. Specifically, we employ a set of variational autoencoder architectures to embed item values into a shared latent space for each time point. Temporal disease dynamics and treatment switch effects are then captured through a mixed-effects regression model applied to latent representations. To enable statistical inference, we present a novel statistical testing approach that accounts for the joint parameter estimation of mixed-effects regression and variational autoencoders. The methodology is applied to quantify the impact of treatment switches for patients with spinal muscular atrophy. Here, our approach aligns motor performance items from different measurement instruments for mixed-effects regression and maps estimated effects back to the observed item level to quantify the treatment switch effect. Our approach allows for model selection as well as for assessing effects of treatment switching. The results highlight the potential of modeling in joint latent representations for addressing small data challenges.
要約:
The goal of policy learning is to train a policy function that recommends a treatment given covariates to maximize population welfare. There are two major approaches in policy learning: the empirical welfare maximization (EWM) approach and the plug-in approach. The EWM approach is analogous to a classification problem, where one first builds an estimator of the population welfare, which is a functional of policy functions, and then trains a policy by maximizing the estimated welfare. In contrast, the plug-in approach is based on regression, where one first estimates the conditional average treatment effect (CATE) and then recommends the treatment with the highest estimated outcome. This study bridges the gap between the two approaches by showing that both are based on essentially the same optimization problem. In particular, we prove an exact equivalence between EWM and least squares over a reparameterization of the policy class. As a consequence, the two approaches are interchangeable in several respects and share the same theoretical guarantees under common conditions. Leveraging this equivalence, we propose a regularization method for policy learning. The reduction to least squares yields a smooth surrogate that is typically easier to optimize in practice. At the same time, for many natural policy classes the inherent combinatorial hardness of exact EWM generally remains, so the reduction should be viewed as an optimization aid rather than a universal bypass of NP-hardness.
要約:
We study the mean field Langevin dynamics and the associated particle system. By assuming the functional convexity of the energy, we obtain the $L^p$-convergence of the marginal distributions towards the unique invariant measure for the mean field dynamics. Furthermore, we prove the uniform-in-time propagation of chaos in both the $L^2$-Wasserstein metric and relative entropy.
要約:
In machine learning, there is a long history of trying to build neural networks that can learn from fewer example data by baking in strong geometric priors. However, it is not always clear a priori what geometric constraints are appropriate for a given task. Here, we explore the possibility that one can uncover useful geometric inductive biases by studying how training molds the Riemannian geometry induced by unconstrained neural network feature maps. We first show that at infinite width, neural networks with random parameters induce highly symmetric metrics on input space. This symmetry is broken by feature learning: networks trained to perform classification tasks learn to magnify local areas along decision boundaries. This holds in deep networks trained on high-dimensional image classification tasks, and even in self-supervised representation learning. These results begin to elucidate how training shapes the geometry induced by unconstrained neural network feature maps, laying the groundwork for an understanding of this richly nonlinear form of feature learning.
要約:
The effective analysis of high-dimensional Electronic Health Record (EHR) data, with substantial potential for healthcare research, presents notable methodological challenges. Employing predictive modeling guided by a knowledge graph (KG), which enables efficient feature selection, can enhance both statistical efficiency and interpretability. While various methods have emerged for constructing KGs, existing techniques often lack statistical certainty concerning the presence of links between entities, especially in scenarios where the utilization of patient-level EHR data is limited due to privacy concerns. In this paper, we propose the first inferential framework for deriving a sparse KG with statistical guarantee based on the dynamic log-linear topic model proposed by \cite{arora2016latent}. Within this model, the KG embeddings are estimated by performing singular value decomposition on the empirical pointwise mutual information matrix, offering a scalable solution. We then establish entrywise asymptotic normality for the KG low-rank estimator, enabling the recovery of sparse graph edges with controlled type I error. Our work uniquely addresses the under-explored domain of statistical inference about non-linear statistics under the low-rank temporal dependent models, a critical gap in existing research. We validate our approach through extensive simulation studies and then apply the method to real-world EHR data in constructing clinical KGs and generating clinical feature embeddings.
要約:
A quantum generalized divergence by definition satisfies the data-processing inequality; as such, the relative decrease in such a divergence under the action of a quantum channel is at most one. This relative decrease is formally known as the contraction coefficient of the channel and the divergence. Interestingly, there exist combinations of channels and divergences for which the contraction coefficient is strictly less than one. Furthermore, understanding the contraction coefficient is fundamental for the study of statistical tasks under privacy constraints. To this end, here we establish upper bounds on contraction coefficients for the hockey-stick divergence under privacy constraints, where privacy is quantified with respect to the quantum local differential privacy (QLDP) framework, and we fully characterize the contraction coefficient for the trace distance under privacy constraints. With the machinery developed, we also determine an upper bound on the contraction of both the Bures distance and quantum relative entropy relative to the normalized trace distance, under QLDP constraints. Next, we apply our findings to establish bounds on the sample complexity of quantum hypothesis testing under privacy constraints. Furthermore, we study various scenarios in which the sample complexity bounds are tight, while providing order-optimal quantum channels that achieve those bounds. Lastly, we show how private quantum channels provide fairness and Holevo information stability in quantum learning settings.
要約:
In the field of non-invasive medical imaging, radiomic features are utilized to measure tumor characteristics. However, these features can be affected by the techniques used to discretize the images, ultimately impacting the accuracy of diagnosis. To investigate the influence of various image discretization methods on diagnosis, it is common practice to evaluate multiple discretization strategies individually. This approach often leads to redundant and time-consuming tasks such as training predictive models and fine-tuning hyperparameters separately. This study examines the feasibility of employing multi-task Bayesian optimization to accelerate the hyperparameters search for classifying benign and malignant pulmonary nodules using RBF SVM. Our findings suggest that multi-task Bayesian optimization significantly accelerates the search for hyperparameters in comparison to a single-task approach. To the best of our knowledge, this is the first investigation to utilize multi-task Bayesian optimization in a critical medical context.
著者: Oz Amram, Luca Anzalone, Joschka Birk, Darius A. Faroughy, Anna Hallin, Gregor Kasieczka, Michael Kr\"amer, Ian Pang, Humberto Reyes-Gonzalez, David Shih
要約:
Foundation models are deep learning models pre-trained on large amounts of data which are capable of generalizing to multiple datasets and/or downstream tasks. This work demonstrates how data collected by the CMS experiment at the Large Hadron Collider can be useful in pre-training foundation models for HEP. Specifically, we introduce the AspenOpenJets dataset, consisting of approximately 178M high $p_T$ jets derived from CMS 2016 Open Data. We show how pre-training the OmniJet-$\alpha$ foundation model on AspenOpenJets improves performance on generative tasks with significant domain shift: generating boosted top and QCD jets from the simulated JetClass dataset. In addition to demonstrating the power of pre-training of a jet-based foundation model on actual proton-proton collision data, we provide the ML-ready derived AspenOpenJets dataset for further public use.
要約:
We study the statistical complexity of private linear regression under an unknown, potentially ill-conditioned covariate distribution. Somewhat surprisingly, under privacy constraints the intrinsic complexity is \emph{not} captured by the usual covariance matrix but rather its $L_1$ analogues. Building on this insight, we establish minimax convergence rates for both the central and local privacy models and introduce an Information-Weighted Regression method that attains the optimal rates.
As application, in private linear contextual bandits, we propose an efficient algorithm that achieves rate-optimal regret bounds of order $\sqrt{T}+\frac{1}{\alpha}$ and $\sqrt{T}/\alpha$ under joint and local $\alpha$-privacy models, respectively. Notably, our results demonstrate that joint privacy comes at almost no additional cost, addressing the open problems posed by Azize and Basu (2024).
要約:
Accurately quantifying air-sea fluxes is important for understanding air-sea interactions and improving coupled weather and climate systems. This study introduces a probabilistic framework to represent the highly variable nature of air-sea fluxes, which is missing in deterministic bulk algorithms. Assuming Gaussian distributions conditioned on the input variables, we use artificial neural networks and eddy-covariance measurement data to estimate the mean and variance by minimizing negative log-likelihood loss. The trained neural networks provide alternative mean flux estimates to existing bulk algorithms, and quantify the uncertainty around the mean estimates. Stochastic parameterization of air-sea turbulent fluxes can be constructed by sampling from the predicted distributions. Tests in a single-column forced upper-ocean model suggest that changes in flux algorithms influence sea surface temperature and mixed layer depth seasonally. The ensemble spread in stochastic runs is most pronounced during spring restratification.
要約:
We introduce NeuralSurv, the first deep survival model to incorporate Bayesian uncertainty quantification. Our non-parametric, architecture-agnostic framework captures time-varying covariate-risk relationships in continuous time via a novel two-stage data-augmentation scheme, for which we establish theoretical guarantees. For efficient posterior inference, we introduce a mean-field variational algorithm with coordinate-ascent updates that scale linearly in model size. By locally linearizing the Bayesian neural network, we obtain full conjugacy and derive all coordinate updates in closed form. In experiments, NeuralSurv delivers superior calibration compared to state-of-the-art deep survival models, while matching or exceeding their discriminative performance across both synthetic benchmarks and real-world datasets. Our results demonstrate the value of Bayesian principles in data-scarce regimes by enhancing model calibration and providing robust, well-calibrated uncertainty estimates for the survival function.
要約:
The theoretical understanding of self-attention (SA) has been steadily progressing. A prominent line of work studies a class of SA layers that admit an energy function decreased by state updates. While it provides valuable insights into inherent biases in signal propagation, it often relies on idealized assumptions or additional constraints not necessarily present in standard SA. Thus, to broaden our understanding, this work aims to relax these energy constraints and provide an energy-agnostic characterization of inference dynamics by dynamical systems analysis. In more detail, we first consider relaxing the symmetry and single-head constraints traditionally required in energy-based formulations. Next, we show that analyzing the Jacobian matrix of the state is highly valuable when investigating more general SA architectures without necessarily admitting an energy function. It reveals that the normalization layer plays an essential role in suppressing the Lipschitzness of SA and the Jacobian's complex eigenvalues, which correspond to the oscillatory components of the dynamics. In addition, the Lyapunov exponents computed from the Jacobians demonstrate that the normalized dynamics lie close to a critical state, and this criticality serves as a strong indicator of high inference performance. Furthermore, the Jacobian perspective also enables us to develop regularization methods for training and a pseudo-energy for monitoring inference dynamics.
要約:
We study the dynamics of gradient flow with small weight decay on general training losses $F: \mathbb{R}^d \to \mathbb{R}$. Under mild regularity assumptions and assuming convergence of the unregularised gradient flow, we show that the trajectory with weight decay $\lambda$ exhibits a two-phase behaviour as $\lambda \to 0$. During the initial fast phase, the trajectory follows the unregularised gradient flow and converges to a manifold of critical points of $F$. Then, at time of order $1/\lambda$, the trajectory enters a slow drift phase and follows a Riemannian gradient flow minimising the $\ell_2$-norm of the parameters. This purely optimisation-based phenomenon offers a natural explanation for the \textit{grokking} effect observed in deep learning, where the training loss rapidly reaches zero while the test loss plateaus for an extended period before suddenly improving. We argue that this generalisation jump can be attributed to the slow norm reduction induced by weight decay, as explained by our analysis. We validate this mechanism empirically on several synthetic regression tasks.
要約:
Gaussian processes (GPs) are widely used for regression and optimization tasks such as Bayesian optimization (BO) due to their expressiveness and principled uncertainty estimates. However, in settings with large datasets corrupted by outliers, standard GPs and their sparse approximations struggle with computational tractability and robustness. We introduce Robust Computation-aware Gaussian Process (RCaGP), a novel GP model that jointly addresses these challenges by combining a principled treatment of approximation-induced uncertainty with robust generalized Bayesian updating. The key insight is that robustness and approximation-awareness are not orthogonal but intertwined: approximations can exacerbate the impact of outliers, and mitigating one without the other is insufficient. Unlike previous work that focuses narrowly on either robustness or approximation quality, RCaGP combines both in a principled and scalable framework, thus effectively managing both outliers and computational uncertainties introduced by approximations such as low-rank matrix multiplications. Our model ensures more conservative and reliable uncertainty estimates, a property we rigorously demonstrate. Additionally, we establish a robustness property and show that the mean function is key to preserving it, motivating a tailored model selection scheme for robust mean functions. Empirical results confirm that solving these challenges jointly leads to superior performance across both clean and outlier-contaminated settings, both on regression and high-throughput Bayesian optimization benchmarks.
要約:
In recent years various supervised learning methods that disentangle aleatoric and epistemic uncertainty based on second-order distributions have been proposed. We argue that these methods fail to capture critical components of epistemic uncertainty, particularly due to the often-neglected component of model bias. To show this, we make use of a more fine-grained taxonomy of epistemic uncertainty sources in machine learning models, and analyse how the classical bias-variance decomposition of the expected prediction error can be decomposed into different parts reflecting these uncertainties. By using a simulation-based evaluation protocol which encompasses epistemic uncertainty due to both procedural- and data-driven uncertainty components, we illustrate that current methods rarely capture the full spectrum of epistemic uncertainty. Through theoretical insights and synthetic experiments, we show that high model bias can lead to misleadingly low estimates of epistemic uncertainty, and common second-order uncertainty quantification methods systematically blur bias-induced errors into aleatoric estimates, thereby underrepresenting epistemic uncertainty. Our findings underscore that meaningful aleatoric estimates are feasible only if all relevant sources of epistemic uncertainty are properly represented.
要約:
Variational inference often struggles with the posterior geometry exhibited by complex hierarchical Bayesian models. Recent advances in flow-based variational families and Variationally Inferred Parameters (VIP) each address aspects of this challenge, but their formal relationship is unexplored. Here, we prove that the combination of VIP and a full-rank Gaussian can be represented exactly as a forward autoregressive flow augmented with a translation term and input from the model's prior. Guided by this theoretical insight, we introduce the Model-Informed Flow (MIF) architecture, which adds the necessary translation mechanism, prior information, and hierarchical ordering. Empirically, MIF delivers tighter posterior approximations and matches or exceeds state-of-the-art performance across a suite of hierarchical and non-hierarchical benchmarks.
要約:
Variational quantum circuits (VQCs) hold promise for quantum machine learning but face challenges in expressivity, trainability, and noise resilience. We propose VQC-MLPNet, a hybrid architecture where a VQC generates the first-layer weights of a classical multilayer perceptron during training, while inference is performed entirely classically. This design preserves scalability, reduces quantum resource demands, and enables practical deployment. We provide a theoretical analysis based on statistical learning and neural tangent kernel theory, establishing explicit risk bounds and demonstrating improved expressivity and trainability compared to purely quantum or existing hybrid approaches. These theoretical insights demonstrate exponential improvements in representation capacity relative to quantum circuit depth and the number of qubits, providing clear computational advantages over standalone quantum circuits and existing hybrid quantum architectures. Empirical results on diverse datasets, including quantum-dot classification and genomic sequence analysis, show that VQC-MLPNet achieves high accuracy and robustness under realistic noise models, outperforming classical and quantum baselines while using significantly fewer trainable parameters.
要約:
Variational Quantum Computing (VQC) faces fundamental scalability barriers, primarily due to the presence of barren plateaus and its sensitivity to quantum noise. To address these challenges, we introduce TensorHyper-VQC, a novel tensor-train (TT)-guided hypernetwork framework that significantly improves the robustness and scalability of VQC. Our framework fully delegates the generation of quantum circuit parameters to a classical TT network, effectively decoupling optimization from quantum hardware. This innovative parameterization mitigates gradient vanishing, enhances noise resilience through structured low-rank representations, and facilitates efficient gradient propagation. Grounded in Neural Tangent Kernel and statistical learning theory, our rigorous theoretical analyses establish strong guarantees on approximation capability, optimization stability, and generalization performance. Extensive empirical results across quantum dot classification, Max-Cut optimization, and molecular quantum simulation tasks demonstrate that TensorHyper-VQC consistently achieves superior performance and robust noise tolerance, including hardware-level validation on a 156-qubit IBM Heron processor. These results position TensorHyper-VQC as a scalable and noise-resilient framework for advancing practical quantum machine learning on near-term devices.
要約:
Causal loop diagrams (CLDs) are widely used in health and environmental research to represent hypothesized causal structures underlying complex problems. However, as qualitative and static representations, CLDs are limited in their ability to support dynamic analysis and inform intervention strategies. We propose Diagrams-to-Dynamics (D2D), a method for converting CLDs into exploratory system dynamics models (SDMs) in the absence of empirical data. With minimal user input - following a protocol to label variables as stocks, flows or auxiliaries, and constants - D2D leverages the structural information already encoded in CLDs, namely, link existence and polarity, to simulate hypothetical interventions and explore potential leverage points under uncertainty. Results suggest that D2D helps distinguish between high- and low-ranked leverage points. We compare D2D to a data-driven SDM constructed from the same CLD and variable labels. D2D showed greater consistency with the data-driven model compared to static network centrality analysis, while providing uncertainty estimates and guidance for future data collection. The D2D method is implemented in an open-source Python package and a web-based application to support further testing and lower the barrier to dynamic modeling for researchers working with CLDs. We expect that additional validation studies will further establish the approach's utility across a broad range of cases and domains.
要約:
Direct Preference Optimization (DPO) has become a standard for aligning models with human feedback, yet its reliance on hard, pairwise preferences makes it brittle to annotator noise and distribution shift. We propose Anchored Direct Preference Optimization (ADPO), a theoretically grounded framework that extends preference learning to soft, listwise supervision through reference anchoring. Our key theoretical contributions are threefold: (1) we establish that ADPO unifies major learning paradigms, including supervised fine-tuning, knowledge distillation, maximum-entropy reinforcement learning, and DPO, as special cases through different choices of target distribution, anchor policy, and temperature; (2) we prove that anchoring induces an implicit trust region governed by the softmax Fisher metric; and (3) we formalize the stability of dynamic anchor updates. Empirically, we discover a task-dependent tradeoff: dynamic anchors suit online exploration, while fixed anchors excel at offline distillation, reducing teacher-student KL divergence by two to three orders of magnitude (170 to 5000 times).
要約:
Shortcuts, spurious rules that perform well during training but fail to generalize, present a major challenge to the reliability of deep networks (Geirhos et al., 2020). However, the impact of shortcuts on feature representations remains understudied, obstructing the design of principled shortcut-mitigation methods. To overcome this limitation, we investigate the layer-wise localization of shortcuts in deep models. Our novel experiment design quantifies the layer-wise contribution to accuracy degradation caused by a shortcut-inducing skew by counterfactual training on clean and skewed datasets. We employ our design to study shortcuts on CIFAR-10, Waterbirds, and CelebA datasets across VGG, ResNet, DeiT, and ConvNeXt architectures. We find that shortcut learning is not localized in specific layers but distributed throughout the network. Different network parts play different roles in this process: shallow layers predominantly encode spurious features, while deeper layers predominantly forget core features that are predictive on clean data. We also analyze the differences in localization and describe its principal axes of variation. Finally, our analysis of layer-wise shortcut-mitigation strategies suggests the hardness of designing general methods, supporting dataset- and architecture-specific approaches instead.
要約:
Probabilistic relaxations of graph cuts offer a differentiable alternative to spectral clustering, enabling end-to-end and online learning without eigendecompositions, yet prior work centered on RatioCut and lacked general guarantees and principled gradients. We present a unified probabilistic framework that covers a wide class of cuts, including Normalized Cut. Our framework provides tight analytic upper bounds on expected discrete cuts via integral representations and Gauss hypergeometric functions with closed-form forward and backward. Together, these results deliver a rigorous, numerically stable foundation for scalable, differentiable graph partitioning covering a wide range of clustering and contrastive learning objectives.