要約:
We study the expected star discrepancy under a newly designed class of non-equal volume partitions. The main contributions are twofold. First, we establish a strong partition principle for the star discrepancy, showing that our newly designed non-equal volume partitions yield stratified sampling point sets with lower expected star discrepancy than classical jittered sampling. Specifically, we prove that $\mathbb{E}(D^{*}_{N}(Z)) < \mathbb{E}(D^{*}_{N}(Y))$, where $Y$ and $Z$ represent jittered sampling and our non-equal volume partition sampling, respectively. Second, we derive explicit upper bounds for the expected star discrepancy under our non-equal volume partition models, which improve upon existing bounds for jittered sampling. Our results provide a theoretical foundation for using non-equal volume partitions in high-dimensional numerical integration.
要約:
Score-based generative models (SGMs) aim at generating samples from a target distribution by approximating the reverse-time dynamics of a stochastic differential equation. Despite their strong empirical performance, classical samplers initialized from a Gaussian distribution require a long time horizon noising typically inducing a large number of discretization steps and high computational cost. In this work, we present a Kullback-Leibler convergence analysis of Variance Exploding diffusion samplers that highlights the critical role of the backward process initialization. Based on this result, we propose a theoretically grounded sampling strategy that learns the reverse-time initialization, directly minimizing the initialization error. The resulting procedure is independent of the specific score training procedure, network architecture, and discretization scheme. Experiments on toy distributions and benchmark datasets demonstrate competitive or improved generative quality while using significantly fewer sampling steps.
要約:
Latent-space Bayesian optimization (LSBO) extends Bayesian optimization to structured domains, such as molecular design, by searching in the continuous latent space of a generative model. However, most LSBO methods assume a fixed objective, whereas real design campaigns often face temporal drift (e.g., evolving preferences or shifting targets). Bringing time-varying BO into LSBO is nontrivial: drift can affect not only the surrogate, but also the latent search space geometry induced by the representation. We propose Time-Aware Latent-space Bayesian Optimization (TALBO), which incorporates time in both the surrogate and the learned generative representation via a GP-prior variational autoencoder, yielding a latent space aligned as objectives evolve. To evaluate timevarying LSBO systematically, we adapt widely used molecular design tasks to drifting multi-property objectives and introduce metrics tailored to changing targets. Across these benchmarks, TALBO consistently outperforms strong LSBO baselines and remains robust across drift speeds and design choices, while remaining competitive under actually time-invariant objectives.
要約:
The Nash-Sutcliffe efficiency ($\text{NSE}$) is a widely used, positively oriented relative measure for evaluating forecasts across multiple time series. However, it lacks a decision-theoretic foundation for this purpose. To address this, we examine its negatively oriented counterpart, which we refer to as Nash-Sutcliffe loss, defined as $L_{\text{NS}} = 1 - \text{NSE}$. We prove that $L_{\text{NS}}$ is strictly consistent for an elicitable and identifiable multi-dimensional functional, which we name the Nash-Sutcliffe functional. This functional is a data-weighted component-wise mean. The common practice of maximizing the average NSE across multiple series is the sample analog of minimizing the expected $L_{\text{NS}}$. Consequently, this operation implicitly assumes that all series originate from a single non-stationary, stochastic process. We introduce Nash-Sutcliffe linear regression, a multi-dimensional model estimated by minimizing the average $L_{\text{NS}}$, which reduces to a data-weighted least squares formulation. By reorienting the sample average loss function, we extend the previously proposed evaluation and estimation framework to forecasting multiple stationary dependent time series with differing stochastic properties. This constitutes a more natural empirical implementation of the $\text{NSE}$ than the earlier formulation. Our results establish a decision-theoretic foundation for $\text{NSE}$-based model estimation and forecast evaluation in large datasets, while further clarifying the benefits of global over local machine learning models.
要約:
In this work, we investigate the generalization properties of random feature methods. Our analysis extends prior results for Tikhonov regularization to a broad class of spectral regularization techniques and further generalizes the setting to operator-valued kernels. This unified framework enables a rigorous theoretical analysis of neural operators and neural networks through the lens of the Neural Tangent Kernel (NTK). In particular, it allows us to establish optimal learning rates and provides a good understanding of how many neurons are required to achieve a given accuracy. Furthermore, we establish minimax rates in the well-specified case and also in the misspecified case, where the target is not contained in the reproducing kernel Hilbert space. These results sharpen and complete earlier findings for specific kernel algorithms.
要約:
Grokking, the abrupt transition from memorization to generalisation after extended training, suggests the presence of competing solution basins with distinct statistical properties. We study this phenomenon through the lens of Singular Learning Theory (SLT), a Bayesian framework that characterizes the geometry of the loss landscape via the local learning coefficient (LLC), a measure of the local degeneracy of the loss surface. SLT links lower-LLC basins to higher posterior mass concentration and lower expected generalisation error. Leveraging this theory, we interpret grokking in quadratic networks as a phase transition between competing near-zero-loss solution basins. Our contributions are two-fold: we derive closed-form expressions for the LLC in quadratic networks trained on modular arithmetic tasks, with the corresponding empirical verification; as well as empirical evidence demonstrating that LLC trajectories provide a reliable tool for tracking generalisation dynamics and interpreting phase transitions during training.
要約:
We study adaptive estimation and inference in ill-posed linear inverse problems defined by conditional moment restrictions. Existing regularized estimators such as Regularized DeepIV (RDIV) require prior knowledge of the smoothness of the nuisance function, typically encoded by a beta source condition to tune their regularization parameters. In practice, this smoothness is unknown, and misspecified hyperparameters can lead to suboptimal convergence or instability.
We introduce a discrepancy-principle-based framework for adaptive hyperparameter selection that automatically balances bias and variance without relying on the unknown smoothness parameter. Our framework applies to both RDIV (Li et al. [2024]) and the Tikhonov Regularized Adversarial Estimator (TRAE) (Bennett et al. [2023a]) and achieves the same rates in both weak and strong metrics. Building on this, we construct a fully adaptive doubly robust estimator for linear functionals that attains the optimal rate of the better-conditioned primal or dual problem, providing a practical, theoretically grounded approach for adaptive inference in ill-posed econometric models.
要約:
We study experiments on interacting populations of humans and AI agents, where both unit types and the interaction network remain unobserved. Although causal effects propagate throughout the system, the goal is to estimate effects on humans. Examples include online platforms where human users interact alongside AI-driven accounts. We assume a human-AI prior that gives each unit a probability of being human. While humans cannot be distinguished at the unit level, the prior allows us to compute the average human composition within large subpopulations. We then model outcome dynamics through a causal message passing (CMP) framework and analyze sample-mean outcomes across subpopulations. We show that by constructing subpopulations that vary in expected human composition and treatment exposure, one can consistently recover human-specific causal effects. Our results characterize when distributional knowledge of population composition (without observing unit types or the interaction network) is sufficient for identification. We validate the approach on a simulated human-AI platform driven by behaviorally differentiated LLM agents. Together, these results provide a theoretical and practical framework for experimentation in emerging human-AI systems.
要約:
Conformal prediction (CP) provides finite-sample, distribution-free marginal coverage, but standard conformal regression intervals can be inefficient under heteroscedasticity and skewness. In particular, popular constructions such as conformalized quantile regression (CQR) often inherit a fixed notion of center and enforce equal-tailed errors, which can displace the interval away from high-density regions and produce unnecessarily wide sets. We propose Co-optimization for Adaptive Conformal Prediction (CoCP), a framework that learns prediction intervals by jointly optimizing a center $m(x)$ and a radius $h(x)$.CoCP alternates between (i) learning $h(x)$ via quantile regression on the folded absolute residual around the current center, and (ii) refining $m(x)$ with a differentiable soft-coverage objective whose gradients concentrate near the current boundaries, effectively correcting mis-centering without estimating the full conditional density. Finite-sample marginal validity is guaranteed by split-conformal calibration with a normalized nonconformity score. Theory characterizes the population fixed point of the soft objective and shows that, under standard regularity conditions, CoCP asymptotically approaches the length-minimizing conditional interval at the target coverage level as the estimation error and smoothing vanish. Experiments on synthetic and real benchmarks demonstrate that CoCP yields consistently shorter intervals and achieves state-of-the-art conditional-coverage diagnostics.
要約:
Modern machine learning models can be accurate on average yet still make mistakes that dominate deployment cost. We introduce Locus, a distribution-free wrapper that produces a per-input loss-scale reliability score for a fixed prediction function. Rather than quantifying uncertainty about the label, Locus models the realized loss of the prediction function using any engine that outputs a predictive distribution for the loss given an input. A simple split-calibration step turns this function into a distribution-free interpretable score that is comparable across inputs and can be read as an upper loss level. The score is useful on its own for ranking, and it can optionally be thresholded to obtain a transparent flagging rule with distribution-free control of large-loss events. Experiments across 13 regression benchmarks show that Locus yields effective risk ranking and reduces large-loss frequency compared to standard heuristics.
要約:
We introduce a supervised dimensionality reduction methodology for categorical (and discretized mixed-type) data based on a density-matrix construction induced by class-conditional frequencies. Given a labeled dataset encoded in a one-hot survey space, we assemble a frequency matrix whose columns aggregate feature occurrences within each class, and define a normalized Gram-type operator that satisfies the axioms of a density matrix. The resulting representation admits an intrinsic rank bound controlled by the number of classes, enabling low-dimensional spectral embeddings via dominant eigenmodes.
Classification is performed in the reduced space through class-conditional kernel density estimation and a maximum-likelihood decision rule.
We establish structural invariances, provide complexity estimates, and validate the approach on synthetic benchmarks probing high cardinality, sparsity, noise, and class imbalance.
要約:
Extreme weather events, such as windstorms and heatwaves, are driven by persistent atmospheric circulation patterns that evolve over several consecutive days. While traditional circulation-based studies often focus on instantaneous atmospheric states, capturing the temporal evolution, or trajectory, of these spatial fields is essential for characterizing rare and potentially impactful atmospheric behavior. However, performing an exhaustive similarity search on multi-decadal, continental-scale gridded datasets presents significant computational and memory challenges. In this paper, we propose TRAKNN (TRajectory Aware KNN), a fully unsupervised and data-agnostic framework for detecting geometrically rare short trajectories in spatio-temporal data with an exact kNN approach. TRAKNN leverages a recurrence-based algorithm that decouples computational complexity from trajectory length and efficient batch operations, maximizing computational intensity. These optimizations enable exhaustive analysis on standard workstations, either on CPU or on GPU. We evaluate our approach on 75 years of daily European sea-level pressure data. Our results illustrate that rare trajectories identified by TRAKNN correspond to physically coherent atmospheric anomalies and align with independent extreme-event databases.
要約:
Instrumental variable (IV) and proximal causal learning (Proxy) methods are central frameworks for causal inference in the presence of unobserved confounding. Despite substantial methodological advances, existing approaches rarely provide reliable epistemic uncertainty (EU) quantification. We address this gap through a Deconditional Gaussian Process (DGP) framework for uncertainty-aware causal learning. Our formulation recovers popular kernel estimators as the posterior mean, ensuring predictive precision, while the posterior variance yields principled and well-calibrated EU. Moreover, the probabilistic structure enables systematic model selection via marginal log-likelihood optimization. Empirical results demonstrate strong predictive performance alongside informative EU quantification, evaluated via empirical coverage frequencies and decision-aware accuracy rejection curves. Together, our approach provides a unified, practical solution for causal inference under unobserved confounding with reliable uncertainty.
要約:
LLM-as-a-judge ensembles are the standard paradigm for scalable evaluation, but their aggregation mechanisms suffer from a fundamental flaw: they implicitly assume that judges provide independent estimates of true quality. However, in practice, LLM judges exhibit correlated errors caused by shared latent confounders -- such as verbosity, stylistic preferences, or training artifacts -- causing standard aggregation rules like majority vote or averaging to provide little gain or even amplify systematic mistakes. To address this, we introduce CARE, a confounder-aware aggregation framework that explicitly models LLM judge scores as arising from both a latent true-quality signal and shared confounding factors. Rather than heuristically re-weighting judges, CARE separates quality from confounders without access to ground-truth labels. We provide theoretical guarantees for identifiability and finite-sample recovery under shared confounders, and we quantify the systematic bias incurred when aggregation models omit confounding latent factors. Across 12 public benchmarks spanning continuous scoring, binary classification, and pairwise preference settings, CARE improves aggregation accuracy, reducing error by up to 26.8\%. Code is released in \href{https://github.com/SprocketLab/CARE}{https://github.com/SprocketLab/CARE}.
要約:
Large language models (LLMs) have gained significant attention by many researchers and practitioners in natural language processing (NLP) since the introduction of ChatGPT in 2022. One notable feature of ChatGPT is its ability to generate summaries based on prompts. Yet evaluating the quality of these summaries remains challenging due to the complexity of language. To this end, in this paper we suggest a new method of LLM summary inference with BERT-SVD-based direction metric and SOFARI (LIDS) that assesses the summary accuracy equipped with interpretable key words for layered themes. The LIDS uses a latent SVD-based direction metric to measure the similarity between the summaries and original text, leveraging the BERT embeddings and repeated prompts to quantify the statistical uncertainty. As a result, LIDS gives a natural embedding of each summary for large text reduction. We further exploit SOFARI to uncover important key words associated with each latent theme in the summary with controlled false discovery rate (FDR). Comprehensive empirical studies demonstrate the practical utility and robustness of LIDS through human verification and comparisons to other similarity metrics, including a comparison of different LLMs.
要約:
In healthcare, predictive models increasingly inform patient-level decisions, yet little attention is paid to the variability in individual risk estimates and its impact on treatment decisions. For overparameterized models, now standard in machine learning, a substantial source of variability often goes undetected. Even when the data and model architecture are held fixed, randomness introduced by optimization and initialization can lead to materially different risk estimates for the same patient. This problem is largely obscured by standard evaluation practices, which rely on aggregate performance metrics (e.g., log-loss, accuracy) that are agnostic to individual-level stability. As a result, models with indistinguishable aggregate performance can nonetheless exhibit substantial procedural arbitrariness, which can undermine clinical trust. We propose an evaluation framework that quantifies individual-level prediction instability by using two complementary diagnostics: empirical prediction interval width (ePIW), which captures variability in continuous risk estimates, and empirical decision flip rate (eDFR), which measures instability in threshold-based clinical decisions. We apply these diagnostics to simulated data and GUSTO-I clinical dataset. Across observed settings, we find that for flexible machine-learning models, randomness arising solely from optimization and initialization can induce individual-level variability comparable to that produced by resampling the entire training dataset. Neural networks exhibit substantially greater instability in individual risk predictions compared to logistic regression models. Risk estimate instability near clinically relevant decision thresholds can alter treatment recommendations. These findings that stability diagnostics should be incorporated into routine model validation for assessing clinical reliability.
要約:
This paper investigates numerical solution methods for the Schatten-$p$ quasi-norm regularized problem with $p \in [0,1]$, which has been widely studied for finding low-rank solutions of linear inverse problems and gained successful applications in various mathematics and applied science fields. We propose a dynamic proximal gradient algorithm that, through the use of the Cayley transformation, avoids computationally expensive singular value decompositions at each iteration, thereby significantly reducing the computational complexity. The algorithm incorporates two step size selection strategies: an adaptive backtracking search and an explicit step size rule. We establish the sublinear convergence of the proposed algorithm for all $p \in [0,1]$ within the framework of the Kurdyka-Lojasiewicz property. Notably, under mild assumptions, we show that the generated sequence converges to a stationary point of the objective function of the problem. For the special case when $p=1$, the linear convergence is further proved under the strict complementarity-type regularity condition commonly used in the linear convergence analysis of the forward-backward splitting algorithms. Preliminary numerical results validate the superior computational efficiency of the proposed algorithm.
要約:
Inverse problems constrained by partial differential equations are often ill-conditioned due to noisy and incomplete data or inherent non-uniqueness. A prominent example is full waveform inversion, which estimates Earth's subsurface properties by fitting seismic measurements subject to the wave equation, where ill-conditioning is inherent to noisy, band-limited, finite-aperture data and shadow zones. Casting the inverse problem into a Bayesian framework allows for a more comprehensive description of its solution, where instead of a single estimate, the posterior distribution characterizes non-uniqueness and can be sampled to quantify uncertainty. However, no clear procedure exists for translating hard physical constraints, such as the wave equation, into prior distributions amenable to existing sampling techniques. To address this, we perform posterior sampling in the dual space using an augmented Lagrangian formulation, which translates hard constraints into penalties amenable to sampling algorithms while ensuring their exact satisfaction. We achieve this by seamlessly integrating the alternating direction method of multipliers (ADMM) with Stein variational gradient descent (SVGD) -- a particle-based sampler -- where the constraint is relaxed at each iteration and multiplier updates progressively enforce satisfaction. This enables constrained posterior sampling while inheriting the favorable conditioning properties of dual-space solvers, where partial constraint relaxation allows productive updates even when the current model is far from the true solution. We validate the method on a stylized Rosenbrock conditional inference problem and on frequency-domain full waveform inversion for a Gaussian anomaly model and the Marmousi~II benchmark, demonstrating well-calibrated uncertainty estimates and posterior contraction with increasing data coverage.
要約:
Generative foundation models are increasingly scaled in both width and depth, posing significant challenges for stable feature learning and reliable hyperparameter (HP) transfer across model sizes. While maximal update parameterization ($\mu$P) has provided a principled solution to both problems for width scaling, existing extensions to the joint width-depth scaling regime remain fragmented, architecture- and optimizer-specific, and often rely on technically involved theories. In this work, we develop a simple and unified spectral framework for $\mu$P under joint width-depth scaling. Considering residual networks of varying block depths, we first introduce a spectral $\mu$P condition that precisely characterizes how the norms of weights and their per-step updates should scale with width and depth, unifying previously disparate $\mu$P formulations as special cases. Building on this condition, we then derive a general recipe for implementing $\mu$P across a broad class of optimizers by mapping the spectral constraints to concrete HP parameterizations. This approach not only recovers existing $\mu$P formulations (e.g., for SGD and AdamW) but also naturally extends to a wider range of optimizers. Finally, experiments on GPT-2 style language models demonstrate that the proposed spectral $\mu$P condition preserves stable feature learning and enables robust HP transfer under width-depth scaling.
要約:
We propose a retrodictive forecasting paradigm for time series: instead of predicting the future from the past, we identify the future that best explains the observed present via inverse MAP optimization over a Conditional Variational Autoencoder (CVAE). This conditioning is a statistical modeling choice for Bayesian inversion; it does not assert that future events cause past observations. The approach is theoretically grounded in an information-theoretic arrow-of-time measure: the symmetrized Kullback-Leibler divergence between forward and time-reversed trajectory ensembles provides both the conceptual rationale and an operational GO/NO-GO diagnostic for applicability. We implement the paradigm as MAP inference over an inverse CVAE with a learned RealNVP normalizing-flow prior and evaluate it on six time series cases: four synthetic processes with controlled temporal asymmetry and two ERA5 reanalysis datasets (wind speed and solar irradiance). The work makes four contributions: (i) a formal retrodictive inference formulation; (ii) an inverse CVAE architecture; (iii) a model-free irreversibility diagnostic; and (iv) a falsifiable validation protocol with four pre-specified predictions. All pre-specified predictions are empirically supported: the diagnostic correctly classifies all six cases; the learned flow prior improves over an isotropic Gaussian baseline on GO cases; the inverse MAP yields no spurious advantage on time-reversible dynamics; and on irreversible GO cases, it achieves competitive or superior RMSE relative to forward baselines, with a statistically significant 17.7% reduction over a forward MLP on ERA5 solar irradiance. These results provide a structured proof-of-concept that retrodictive forecasting can constitute a viable alternative to conventional forward prediction when statistical time-irreversibility is present and exploitable.
要約:
We study computationally and statistically efficient reinforcement learning under the linear $Q^{\pi}$ realizability assumption, where any policy's $Q$-function is linear in a given state-action feature representation. Prior methods in this setting are either computationally intractable, or require (local) access to a simulator. In this paper, we propose a computationally efficient online RL algorithm, named Frozen Policy Iteration, under the linear $Q^{\pi}$ realizability setting that works for Markov Decision Processes (MDPs) with stochastic initial states, stochastic rewards and deterministic transitions. Our algorithm achieves a regret bound of $\widetilde{O}(\sqrt{d^2H^6T})$, where $d$ is the dimensionality of the feature space, $H$ is the horizon length, and $T$ is the total number of episodes. Our regret bound is optimal for linear (contextual) bandits which is a special case of our setting with $H = 1$.
Existing policy iteration algorithms under the same setting heavily rely on repeatedly sampling the same state by access to the simulator, which is not implementable in the online setting with stochastic initial states studied in this paper. In contrast, our new algorithm circumvents this limitation by strategically using only high-confidence part of the trajectory data and freezing the policy for well-explored states, which ensures that all data used by our algorithm remains effectively on-policy during the whole course of learning. We further demonstrate the versatility of our approach by extending it to the Uniform-PAC setting and to function classes with bounded eluder dimension.
要約:
Despite exceptional predictive performance of Deep sequence models (DSMs), the main concern of their deployment centers around the lack of uncertainty awareness. In contrast, probabilistic models quantify the uncertainty associated with unobserved variables with rules of probability. Notably, Bayesian methods leverage Bayes' rule to express our belief of unobserved variables in a principled way. Since exact Bayesian inference is computationally infeasible at scale, approximate inference is required in practice. Two major bottlenecks of Bayesian methods, especially when applied in deep neural networks, are prior specification and approximation quality. In Chapter 3 & 4, we investigate how the architectures of DSMs themselves can be informative for the design of priors or approximations in probabilistic models. We first develop an approximate Bayesian inference method tailored to the Transformer based on the similarity between attention and sparse Gaussian process. Next, we exploit the long-range memory preservation capability of HiPPOs (High-order Polynomial Projection Operators) to construct an interdomain inducing point for Gaussian process, which successfully memorizes the history in online learning. In addition to the progress of DSMs in predictive tasks, sequential generative models consisting of a sequence of latent variables are popularized in the domain of deep generative models. Inspired by the explicit self-supervised signals for these latent variables in diffusion models, in Chapter 5, we explore the possibility of improving other generative models with self-supervision for their sequential latent states, and investigate desired probabilistic structures over them. Overall, this thesis leverages inductive biases in DSMs to design probabilistic inference or structure, which bridges the gap between DSMs and probabilistic models, leading to mutually reinforced improvement.
要約:
We study non-rectangular robust Markov decision processes under the average-reward criterion, where the ambiguity set couples transition probabilities across states and the adversary commits to a stationary kernel for the entire horizon. We show that any history-dependent policy achieving sublinear expected regret uniformly over the ambiguity set is robust-optimal, and that the robust value admits a minimax representation as the infimum over the ambiguity set of the classical optimal gains, without requiring any form of rectangularity or robust dynamic programming principle. Under the weak communication assumption, we establish the existence of such policies by converting high-probability regret bounds from the average-reward reinforcement learning literature into the expected-regret criterion. We then introduce a transient-value framework to evaluate finite-time performance of robust optimal policies, proving that average-reward optimality alone can mask arbitrarily poor transients and deriving regret-based lower bounds on transient values. Finally, we construct an epoch-based policy that combines an optimal stationary policy for the worst-case model with an anytime-valid sequential test and an online learning fallback, achieving a constant-order transient value.
要約:
Generative Flow Networks (GFlowNets) were developed to learn policies for efficiently sampling combinatorial candidates by interpreting their generative processes as trajectories in directed acyclic graphs. In the value-based training workflow, the objective is to enforce the balance over partial episodes between the flows of the learned policy and the estimated flows of the desired policy, implicitly encouraging policy divergence minimization. The policy-based strategy alternates between estimating the policy divergence and updating the policy, but reliable estimation of the divergence under directed acyclic graphs remains a major challenge. This work bridges the two perspectives by showing that flow balance also yields a principled policy evaluator that measures the divergence, and an evaluation balance objective over partial episodes is proposed for learning the evaluator. As demonstrated on both synthetic and real-world tasks, evaluation balance not only strengthens the reliability of policy-based training but also broadens its flexibility by seamlessly supporting parameterized backward policies and enabling the integration of offline data-collection techniques.
要約:
We extend the unified kernel framework for transport equations and Koopman eigenfunctions, developed in previous work by the authors for deterministic systems, to stochastic differential equations (SDEs). In the deterministic setting, three analytically grounded constructions-Lions-type variational principles, Green's function convolution, and resolvent operators along characteristic flows--were shown to yield identical reproducing kernels. For stochastic systems, the Koopman generator includes a second-order diffusion term, transforming the first-order hyperbolic transport equation into a second-order elliptic-parabolic PDE. This fundamental change necessitates replacing the method of characteristics with probabilistic representations based on the Feynman--Kac formula.
Our main contributions include: (i) extension of all three kernel constructions to stochastic systems via Feynman--Kac path-integral representations; (ii) proof of kernel equivalence under uniform ellipticity assumptions; (iii) a collocation-based computational framework incorporating second-order differential operators; (iv) error bounds separating RKHS approximation error from Monte Carlo sampling error; (v) analysis of how diffusion affects numerical conditioning; and (vi) connections to generator EDMD, diffusion maps, and kernel analog forecasting. Numerical experiments on Ornstein--Uhlenbeck processes, nonlinear SDEs with varying diffusion strength, and multi-dimensional systems validate the theoretical developments and demonstrate that moderate diffusion can improve numerical stability through elliptic regularization.
要約:
Risk forecasts in financial regulation and internal management are calculated through historical data. The unknown structural changes of financial data poses a substantial challenge in selecting an appropriate look-back window for risk modeling and forecasting. We develop a data-driven online learning method, called the bootstrap-based adaptive window selection (BAWS), that adaptively determines the window size in a sequential manner. A central component of BAWS is to compare the realized scores against a data-dependent threshold, which is evaluate based on an idea of bootstrap. The proposed method is applicable to the forecast of risk measures that are elicitable individually or jointly, such as the Value-at-Risk (VaR) and the pair of the VaR and the corresponding Expected Shortfall. Through simulation studies and empirical analyses, we demonstrate that BAWS generally outperforms the standard rolling window approach and the recently developed method of stability-based adaptive window selection, especially when there are structural changes in the data-generating process.
要約:
Group relative policy optimization (GRPO), a core methodological component of DeepSeekMath and DeepSeek-R1, has emerged as a cornerstone for scaling reasoning capabilities of large language models. Despite its widespread adoption and the proliferation of follow-up works, the theoretical properties of GRPO remain less studied. This paper provides a unified framework to understand GRPO through the lens of classical U-statistics. We demonstrate that the GRPO policy gradient is inherently a U-statistic, allowing us to characterize its mean squared error (MSE), derive the finite-sample error bound and asymptotic distribution of the suboptimality gap for its learned policy. Our findings reveal that GRPO is asymptotically equivalent to an oracle policy gradient algorithm -- one with access to a value function that quantifies the goodness of its learning policy at each training iteration -- and achieves asymptotically optimal performance within a broad class of policy gradient algorithms. Furthermore, we establish a universal scaling law that offers principled guidance for selecting the optimal group size. Empirical experiments further validate our theoretical findings, demonstrating that the optimal group size is universal, and verify the oracle property of GRPO.
要約:
Large Language Models (LLMs) are pretrained on massive datasets and later instruction-tuned via supervised fine-tuning (SFT) or reinforcement learning (RL). Best practices emphasize large, diverse pretraining data, whereas post-training operates differently: SFT relies on smaller, high-quality datasets, while RL benefits more from scale, with larger amounts of feedback often outweighing label quality. Yet it remains unclear why pretraining and RL require large datasets, why SFT excels on smaller ones, and what defines high-quality SFT data. In this work, we theoretically analyze transformers trained on an in-context weight prediction task for linear regression. Our analysis reveals several key findings: $(i)$ balanced pretraining data can induce latent capabilities later activated during post-training, and $(ii)$ SFT learns best from a small set of examples challenging for the pretrained model, while excessively large SFT datasets may dilute informative pretraining signals. In contrast, RL is most effective on large-scale data that is not overly difficult for the pretrained model. We validate these theoretical insights with experiments on large nonlinear transformer architectures.
要約:
We propose two nonconvex regularization methods, LogLOP-l2/l1 and AdaLOP-l2/l1, for recovering block-sparse signals with unknown block partitions. These methods address the underestimation bias of existing convex approaches by extending log-sum penalty and the Minimax Concave Penalty (MCP) to the block-sparse domain via novel variational formulations. Unlike Generalized Moreau Enhancement (GME) and Bayesian methods dependent on the squared-error data fidelity term, our proposed methods are compatible with a broad range of data fidelity terms. We develop efficient Alternating Direction Method of Multipliers (ADMM)-based algorithms for these formulations that exhibit stable empirical convergence. Numerical experiments on synthetic data, angular power spectrum estimation, and denoising of nanopore currents demonstrate that our methods outperform state-of-the-art baselines in estimation accuracy.
要約:
When data is scarce or mistakes are costly, average-case metrics fall short. What a practitioner needs is a guarantee: with probability at least $1-\delta$, the learned policy is $\varepsilon$-close to optimal after $N$ episodes. This is the PAC promise, and between 2018 and 2025 the RL theory community made striking progress on when such promises can be kept. We survey that progress. Our organizing tool is the Coverage-Structure-Objective (CSO) framework, proposed here, which decomposes nearly every PAC sample complexity result into three factors: coverage (how data were obtained), structure (intrinsic MDP or function-class complexity), and objective (what the learner must deliver). CSO is not a theorem but an interpretive template that identifies bottlenecks and makes cross-setting comparison immediate. The technical core covers tight tabular baselines and the uniform-PAC bridge to regret; structural complexity measures (Bellman rank, witness rank, Bellman-Eluder dimension) governing learnability with function approximation; results for linear, kernel/NTK, and low-rank models; reward-free exploration as upfront coverage investment; and pessimistic offline RL where inherited coverage is the binding constraint. We provide practitioner tools: rate lookup tables indexed by CSO coordinates, Bellman residual diagnostics, coverage estimation with deployment gates, and per-episode policy certificates. A final section catalogs open problems, separating near-term targets from frontier questions where coverage, structure, and computation tangle in ways current theory cannot resolve.
要約:
We revisit the framework of Smart PAC learning, which seeks supervised learners which compete with semi-supervised learners that are provided full knowledge of the marginal distribution on unlabeled data. Prior work has shown that such marginal-by-marginal guarantees are possible for "most" marginals, with respect to an arbitrary fixed and known measure, but not more generally. We discover that this failure can be attributed to an "indistinguishability" phenomenon: There are marginals which cannot be statistically distinguished from other marginals that require different learning approaches. In such settings, semi-supervised learning cannot certify its guarantees from unlabeled data, rendering them arguably non-actionable.
We propose relatively smart learning, a new framework which demands that a supervised learner compete only with the best "certifiable" semi-supervised guarantee. We show that such modest relaxation suffices to bypass the impossibility results from prior work. In the distribution-free setting, we show that the OIG learner is relatively smart up to squaring the sample complexity, and show that no supervised learning algorithm can do better. For distribution-family settings, we show that relatively smart learning can be impossible or can require idiosyncratic learning approaches, and its difficulty can be non-monotone in the inclusion order on distribution families.
要約:
Sparse plus Low-Rank $(\mathbf{S} + \mathbf{LR})$ decomposition of Large Language Models (LLMs) has emerged as a promising direction in model compression, aiming to decompose pre-trained model weights into a sum of sparse and low-rank matrices $(\mathbf{W} \approx \mathbf{S} + \mathbf{LR})$. Despite recent progress, existing methods often suffer from substantial performance degradation compared to dense models. In this work, we introduce 3BASiL-TM, an efficient one-shot post-training method for $(\mathbf{S} + \mathbf{LR})$ decomposition of LLMs that addresses this gap. Our approach first introduces a novel 3-Block Alternating Direction Method of Multipliers (ADMM) method, termed 3BASiL, to minimize the layer-wise reconstruction error with convergence guarantees. We then design an efficient transformer-matching (TM) refinement step that jointly optimizes the sparse and low-rank components across transformer layers. This step minimizes a novel memory-efficient loss that aligns outputs at the transformer level. Notably, the TM procedure is universal as it can enhance any $(\mathbf{S} + \mathbf{LR})$ decomposition, including pure sparsity. Our numerical experiments show that 3BASiL-TM reduces the WikiText2 perplexity gap relative to dense LLaMA-8B model by over 30% under a (2:4 Sparse + 64 LR) configuration, compared to prior methods. Moreover, our method achieves over 2.5x faster compression runtime on an A100 GPU compared to SOTA $(\mathbf{S} + \mathbf{LR})$ method. Our code is available at https://github.com/mazumder-lab/3BASiL.
要約:
Graph Neural Networks (GNNs) face fundamental limitations in expressivity and capturing structural heterogeneity. Standard message-passing architectures are constrained by the 1-dimensional Weisfeiler-Leman (1-WL) test, unable to distinguish graphs beyond degree sequences, and aggregate information uniformly from neighbors, failing to capture how nodes occupy different structural positions within higher-order patterns. While methods exist to achieve higher expressivity, they incur prohibitive computational costs and lack unified frameworks for flexibly encoding diverse structural properties. To address these limitations, we introduce Invariant-Stratified Propagation (ISP), a framework comprising both a novel WL variant (ISP-WL) and its efficient neural network implementation (ISPGNN). ISP stratifies nodes according to graph invariants, processing them in hierarchical strata that reveal structural distinctions invisible to 1-WL. Through hierarchical structural heterogeneity encoding, ISP quantifies differences in nodes' structural positions within higher-order patterns, distinguishing interactions where participants occupy different roles from those with uniform participation. We provide formal theoretical analysis establishing enhanced expressivity beyond 1-WL, convergence guarantees, and inherent resistance to oversmoothing. Extensive experiments across graph classification, node classification, and influence estimation demonstrate consistent improvements over standard architectures and state-of-the-art expressive baselines.
要約:
Non-negative matrix factorization (NMF) is widely used for parts-based representations, yet formal inference for covariate effects is rarely available when the basis is learned under non-negativity. We introduce non-negative matrix factorization with random effects (NMF-RE), a mean-structure latent-variable model $Y=X(\Theta A+U)+\mathcal{E}$ that combines covariate-driven scores with unit-specific deviations. Random effects act as a working device for modeling heterogeneity and controlling complexity; we monitor their effective degrees of freedom and enforce a df-based cap to prevent near-saturated fits. Estimation alternates closed-form ridge (BLUP-like) updates for $U$ with multiplicative non-negative updates for $X$ and $\Theta$. For inference on $\Theta$, we condition on $(\widehat X,\widehat U)$ and obtain fast uncertainty quantification via asymptotic linearization, a one-step Newton update, and a multiplier (wild) bootstrap; this avoids repeated constrained re-optimization. Simulations include a targeted stress test showing that, without df control, the random-effects penalty can collapse and inference for $\Theta$ becomes degenerate, whereas the df-cap prevents this failure mode. The non-negativity constraint induces sparse, parts-based loadings -- a measurement-side variable selection -- while inference on $\Theta$ identifies which covariates affect which components, providing covariate-side selection. Longitudinal, psychometric, spatial-flow, and text examples further illustrate stable, interpretable covariate-effect inference.
要約:
We consider an optimization problem of an expensive-to-evaluate black-box function, in which we can obtain noisy function values in parallel. For this problem, parallel Bayesian optimization (PBO) is a promising approach, which aims to optimize with fewer function evaluations by selecting a diverse input set for parallel evaluation. However, existing PBO methods suffer from poor practical performance or lack theoretical guarantees. In this study, we propose a PBO method, called randomized kriging believer (KB), based on a well-known KB heuristic and inheriting the advantages of the original KB: low computational complexity, a simple implementation, versatility across various BO methods, and applicability to asynchronous parallelization. Furthermore, we show that our randomized KB achieves Bayesian expected regret guarantees. We demonstrate the effectiveness of the proposed method through experiments on synthetic and benchmark functions and emulators of real-world data.
要約:
We study the training dynamics of gradient descent in a softmax self-attention layer trained to perform linear regression and show that a simple first-order optimization algorithm can converge to the globally optimal self-attention parameters at a geometric rate. Our analysis proceeds in two steps. First, we show that in the infinite-data limit the regression problem solved by the self-attention layer is equivalent to a nonconvex matrix factorization problem. Second, we exploit this connection to design a novel "structure-aware" variant of gradient descent which efficiently optimizes the original finite-data regression objective. Our optimization algorithm features several innovations over standard gradient descent, including a preconditioner and regularizer which help avoid spurious stationary points, and a data-dependent spectral initialization of parameters which lie near the manifold of global minima with high probability.
要約:
Due to their efficiency and small size, decision trees and random forests are popular machine learning models used for classification on resource-constrained systems. In such systems, the available execution time for inference in a random forest might not be sufficient for a complete model execution. Ideally, the already gained prediction confidence should be retained. An anytime algorithm is designed to be able to be aborted anytime, while giving a result with an increasing quality over time. Previous approaches have realized random forests as anytime algorithms on the granularity of trees, stopping after some but not all trees of a forest have been executed. However, due to the way decision trees subdivide the sample space in every step, an increase in prediction quality is achieved with every additional step in one tree. In this paper, we realize decision trees and random forest as anytime algorithms on the granularity of single steps in trees. This approach opens a design space to define the step order in a forest, which has the potential to optimize the mean accuracy. We propose the Optimal Order, which finds a step order with a maximal mean accuracy in exponential runtime and the polynomial runtime heuristics Forward Squirrel Order and Backward Squirrel Order, which greedily maximize the accuracy for each additional step taken down and up the trees, respectively.
Our evaluation shows, that the Backward Squirrel Order performs $\sim94\%$ as well as the Optimal Order and $\sim99\%$ as well as all other step orders.
要約:
We investigate the limiting behavior of discrete determinantal point processes (DPPs) towards continuous DPPs when the size of the set to sample from goes to infinity. We propose a non-asymptotic characterization of this limit in terms of the concentration of statistics associated to these processes, which we refer to as "weak coherency". This allows to translate statistical guarantees from the limiting process to the original, discrete one. Our main result describes sufficient conditions for weak coherency to hold. In particular, our study encompasses settings where both the kernel of the continuous process and its underlying space are inaccessible, or when the discrete marginal kernel is a noisy version of its continuous counterpart. We illustrate our theory on several examples. We prove that a discrete multivariate orthogonal polynomial ensemble can be used to produce coresets strictly smaller than independent sampling for the same error. We propose a process achieving repulsive sampling on an unknown manifold from a set of points sampled from an unknown density. Finally, we show that continuous DPPs can be obtained as limits on random graphs with Bernoulli edges, even when only observing the graph structure. We obtain interesting byproduct results along the way.
要約:
In this paper, we present a novel learning framework for finding shortest paths in graphs utilizing Generative Flow Networks (GFlowNets). First, we examine theoretical properties of GFlowNets in non-acyclic environments in relation to shortest paths. We prove that, if the total flow is minimized, forward and backward policies traverse the environment graph exclusively along shortest paths between the initial and terminal states. Building on this result, we show that the pathfinding problem in an arbitrary graph can be solved by training a non-acyclic GFlowNet with flow regularization. We experimentally demonstrate the performance of our method in pathfinding in permutation environments and in solving Rubik's Cubes. For the latter problem, our approach shows competitive results with state-of-the-art machine learning approaches designed specifically for this task in terms of the solution length, while requiring smaller search budget at test-time.
要約:
Heavy-tailed distributions are ubiquitous in real-world data, where rare but extreme events dominate risk and variability. However, standard Variational Autoencoders (VAEs) employ simple decoder distributions (e.g., Gaussian) that fail to capture heavy-tailed behavior, while existing heavy-tail-aware extensions remain restricted to predefined parametric families whose tail behavior is fixed a priori.
We propose the Phase-Type Variational Autoencoder (PH-VAE), whose decoder distribution is a latent-conditioned Phase-Type (PH) distribution defined as the absorption time of a continuous-time Markov chain (CTMC). This formulation composes multiple exponential time scales, yielding a flexible and analytically tractable decoder that adapts its tail behavior directly from the observed data. Experiments on synthetic and real-world benchmarks demonstrate that PH-VAE accurately recovers diverse heavy-tailed distributions, significantly outperforming Gaussian, Student-t, and extreme-value-based VAE decoders in modeling tail behavior and extreme quantiles. In multivariate settings, PH-VAE captures realistic cross-dimensional tail dependence through its shared latent representation. To our knowledge, this is the first work to integrate Phase-Type distributions into deep generative modeling, bridging applied probability and representation learning.
要約:
Modern e-commerce platforms employ various auction mechanisms to allocate paid slots for a given item. To scale this approach to the millions of auctions, the platforms suggest promotion tools based on the autobidding algorithms. These algorithms typically depend on the Click-Through-Rate (CTR) and Conversion-Rate (CVR) estimates provided by a pre-trained machine learning model. However, the predictions of such models are uncertain and can significantly affect the performance of the autobidding algorithm. To address this issue, we propose the DenoiseBid method, which corrects the generated CTRs and CVRs to make the resulting bids more efficient in auctions. The underlying idea of our method is to employ a Bayesian approach and replace noisy CTR or CVR estimates with those from recovered distributions. To demonstrate the performance of the proposed approach, we perform extensive experiments on the synthetic, iPinYou, and BAT datasets. To evaluate the robustness of our approach to the noise scale, we use synthetic noise and noise estimated from the predictions of the pre-trained machine learning model.
要約:
Diffusion models have gained prominence as powerful generative tools for solving inverse problems due to their ability to model complex data distributions. However, existing methods typically rely on complete knowledge of the forward observation process to compute gradients for guided sampling, limiting their applicability in scenarios where such information is unavailable. In this work, we introduce \textbf{\emph{Constrained Particle Seeking (CPS)}}, a novel gradient-free approach that leverages all candidate particle information to actively search for the optimal particle while incorporating constraints aligned with high-density regions of the unconditional prior. Unlike previous methods that passively select promising candidates, CPS reformulates the inverse problem as a constrained optimization task, enabling more flexible and efficient particle seeking. We demonstrate that CPS can effectively solve both image and scientific inverse problems, achieving results comparable to gradient-based methods while significantly outperforming gradient-free alternatives. Code is available at https://github.com/deng-ai-lab/CPS.
要約:
We study generalized linear prediction under a streaming setting, where each iteration uses only one fresh data point for a gradient-level update. While momentum is well-established in deterministic optimization, a fundamental open question is whether it can accelerate such single-pass non-quadratic stochastic optimization. We propose the first algorithm that successfully incorporates momentum via a novel data-dependent proximal method, achieving dual-momentum acceleration. Our derived excess risk bound decomposes into three components: an improved optimization error, a minimax optimal statistical error, and a higher-order model-misspecification error. The proof handles mis-specification via a fine-grained stationary analysis of inner updates, while localizing statistical error through a two-phase outer-loop analysis. As a result, we resolve the open problem posed by Jain et al. [2018a] and demonstrate that momentum acceleration is more effective than variance reduction for generalized linear prediction in the streaming setting.
要約:
Many differentially private (DP) data release systems either output DP synthetic data and leave analysts to perform inference as usual, which can lead to severe miscalibration, or output a DP point estimate without a principled way to do uncertainty quantification. This paper develops a clean and tractable middle ground for exponential families: release only DP sufficient statistics, then perform noise-calibrated likelihood-based inference and optional parametric synthetic data generation as post-processing. Our contributions are: (1) a general recipe for approximate-DP release of clipped sufficient statistics under the Gaussian mechanism; (2) asymptotic normality, explicit variance inflation, and valid Wald-style confidence intervals for the plug-in DP MLE; (3) a noise-aware likelihood correction that is first-order equivalent to the plug-in but supports bootstrap-based intervals; and (4) a matching minimax lower bound showing the privacy distortion rate is unavoidable. The resulting theory yields concrete design rules and a practical pipeline for releasing DP synthetic data with principled uncertainty quantification, validated on three exponential families and real census data.
要約:
Moving beyond evaluations that collapse performance across heterogeneous prompts toward fine-grained evaluation at the prompt level, or within relatively homogeneous subsets, is necessary to diagnose generative models' strengths and weaknesses. Such fine-grained evaluations, however, suffer from a data bottleneck: human gold-standard labels are too costly at this scale, while automated ratings are often misaligned with human judgment. To resolve this challenge, we propose a novel statistical model based on tensor factorization that merges cheap autorater data with a limited set of human gold-standard labels. Specifically, our approach uses autorater scores to pretrain latent representations of prompts and generative models, and then aligns those pretrained representations to human preferences using a small calibration set. This sample-efficient methodology is robust to autorater quality, more accurately predicts human preferences on a per-prompt basis than standard baselines, and provides tight confidence intervals for key statistical parameters of interest. We also showcase the practical utility of our method by constructing granular leaderboards based on prompt qualities and by estimating model performance solely from autorater scores, eliminating the need for additional human annotations.
要約:
Leave-one-out (LOO) prediction provides a principled, data-dependent measure of generalization, yet guarantees in fully transductive settings remain poorly understood beyond specialized models. We introduce Median of Level-Set Aggregation (MLSA), a general aggregation procedure based on empirical-risk level sets around the ERM. For arbitrary fixed datasets and losses satisfying a mild monotonicity condition, we establish a multiplicative oracle inequality for the LOO error of the form \[ LOO_S(\hat{h}) \;\le\; C \cdot \frac{1}{n} \min_{h\in H} L_S(h) \;+\; \frac{Comp(S,H,\ell)}{n}, \qquad C>1. \]
The analysis is based on a local level-set growth condition controlling how the set of near-optimal empirical-risk minimizers expands as the tolerance increases. We verify this condition in several canonical settings. For classification with VC classes under the 0-1 loss, the resulting complexity scales as $O(d \log n)$, where $d$ is the VC dimension. For finite hypothesis and density classes under bounded or log loss, it scales as $O(\log |H|)$ and $O(\log |P|)$, respectively. For logistic regression with bounded covariates and parameters, a volumetric argument based on the empirical covariance matrix yields complexity scaling as $O(d \log n)$ up to problem-dependent factors.
要約:
We study scaling laws of signSGD under a power-law random features (PLRF) model that accounts for both feature and target decay. We analyze the population risk of a linear model trained with one-pass signSGD on Gaussian-sketched features. We express the risk as a function of model size, training steps, learning rate, and the feature and target decay parameters. Comparing against the SGD risk analyzed by Paquette et al. (2024), we identify a drift-normalization effect and a noise-reshaping effect unique to signSGD. We then obtain compute-optimal scaling laws under the optimal choice of learning rate. Our analysis shows that the noise-reshaping effect can make the compute-optimal slope of signSGD steeper than that of SGD in regimes where noise is dominant. Finally, we observe that the widely used warmup-stable-decay (WSD) schedule further reduces the noise term and sharpens the compute-optimal slope, when feature decay is fast but target decay is slow.
要約:
Recent studies have shown that reinforcement learning with KL-regularized objectives can enjoy faster rates of convergence or logarithmic regret, in contrast to the classical $\sqrt{T}$-type regret in the unregularized setting. However, the statistical efficiency of online learning with respect to KL-regularized objectives remains far from completely characterized, even when specialized to multi-armed bandits (MABs). We address this problem for MABs via a sharp analysis of KL-UCB using a novel peeling argument, which yields a $\tilde{O}(\eta K\log^2T)$ upper bound: the first high-probability regret bound with linear dependence on $K$. Here, $T$ is the time horizon, $K$ is the number of arms, $\eta^{-1}$ is the regularization intensity, and $\tilde{O}$ hides all logarithmic factors except those involving $\log T$. The near-tightness of our analysis is certified by the first non-constant lower bound $\Omega(\eta K \log T)$, which follows from subtle hard-instance constructions and a tailored decomposition of the Bayes prior. Moreover, in the low-regularization regime (i.e., large $\eta$), we show that the KL-regularized regret for MABs is $\eta$-independent and scales as $\tilde{\Theta}(\sqrt{KT})$. Overall, our results provide a thorough understanding of KL-regularized MABs across all regimes of $\eta$ and yield nearly optimal bounds in terms of $K$, $\eta$, and $T$.
要約:
Reservoir expansion can improve online independent component analysis (ICA) under nonlinear mixing, yet top-$n$ whitening may discard injected features. We formalize this bottleneck as \emph{reservoir subspace injection} (RSI): injected features help only if they enter the retained eigenspace without displacing passthrough directions. RSI diagnostics (IER, SSO, $\rho_x$) identify a failure mode in our top-$n$ setting: stronger injection increases IER but crowds out passthrough energy ($\rho_x: 1.00\!\rightarrow\!0.77$), degrading SI-SDR by up to $2.2$\,dB. A guarded RSI controller preserves passthrough retention and recovers mean performance to within $0.1$\,dB of baseline $1/N$ scaling. With passthrough preserved, RE-OICA improves over vanilla online ICA by $+1.7$\,dB under nonlinear mixing and achieves positive SI-SDR$_{\mathrm{sc}}$ on the tested super-Gaussian benchmark ($+0.6$\,dB).
要約:
Reasoning problems such as Sudoku and ARC-AGI remain challenging for neural networks. The structured problem solving architecture family of Recurrent Reasoning Models (RRMs), including Hierarchical Reasoning Model (HRM) and Tiny Recursive Model (TRM), offer a compact alternative to large language models, but currently handle symbol symmetries only implicitly via costly data augmentation. We introduce Symbol-Equivariant Recurrent Reasoning Models (SE-RRMs), which enforce permutation equivariance at the architectural level through symbol-equivariant layers, guaranteeing identical solutions under symbol or color permutations. SE-RRMs outperform prior RRMs on 9x9 Sudoku and generalize from just training on 9x9 to smaller 4x4 and larger 16x16 and 25x25 instances, to which existing RRMs cannot extrapolate. On ARC-AGI-1 and ARC-AGI-2, SE-RRMs achieve competitive performance with substantially less data augmentation and only 2 million parameters, demonstrating that explicitly encoding symmetry improves the robustness and scalability of neural reasoning. Code is available at https://github.com/ml-jku/SE-RRM.
要約:
An agent must try new behaviors to explore and improve. In high-stakes environments, an agent that violates safety constraints may cause harm and must be taken offline, curtailing any future interaction. Imitating old behavior is safe, but excessive conservatism discourages exploration. How much behavior change is too much? We show how to use any safe reference policy as a probabilistic regulator for any optimized but untested policy. Conformal calibration on data from the safe policy determines how aggressively the new policy can act, while provably enforcing the user's declared risk tolerance. Unlike conservative optimization methods, we do not assume the user has identified the correct model class nor tuned any hyperparameters. Unlike previous conformal methods, our theory provides finite-sample guarantees even for non-monotonic bounded constraint functions. Our experiments on applications ranging from natural language question answering to biomolecular engineering show that safe exploration is not only possible from the first moment of deployment, but can also improve performance.
要約:
Selective conformal prediction can yield substantially tighter uncertainty sets when we can identify calibration examples that are exchangeable with the test example. In interventional settings, such as perturbation experiments in genomics, exchangeability often holds only within subsets of interventions that leave a target variable "unaffected" (e.g., non-descendants of an intervened node in a causal graph). We study the practical regime where this invariance structure is unknown and must be learned from data. Our contributions are: (i) a contamination-robust conformal coverage theorem that quantifies how misclassification of "unaffected" calibration examples degrades coverage via an explicit function $g(\delta,n)$ of the contamination fraction and calibration set size, providing a finite-sample lower bound that holds for arbitrary contaminating distributions; (ii) a task-driven partial causal learning formulation that estimates only the binary descendant indicators $Z_{a,i}=\mathbf{1}\{i\in\mathrm{desc}(a)\}$ needed for selective calibration, rather than the full causal graph; and (iii) algorithms for descendant discovery via perturbation intersection patterns (differentially affected variable set intersections across interventions), and for approximate distance-to-intervention estimation via local invariant causal prediction. We provide recovery conditions under which contamination is controlled. Experiments on synthetic linear structural equation models (SEMs) validate the bound: under controlled contamination up to $\delta=0.30$, the corrected procedure maintains $\ge 0.95$ coverage while uncorrected selective CP degrades to $0.867$. A proof-of-concept on Replogle K562 CRISPR interference (CRISPRi) perturbation data demonstrates applicability to real genomic screens.
要約:
This work presents the first large-scale neutral benchmark experiment focused on single-event, right-censored, low-dimensional survival data. Benchmark experiments are essential in methodological research to scientifically compare new and existing model classes through proper empirical evaluation. Existing benchmarks in the survival literature are smaller in scale regarding the number of used datasets and extent of empirical evaluation. They often lack appropriate tuning or evaluation procedures, while other comparison studies focus on qualitative reviews rather than quantitative comparisons. This comprehensive study aims to fill the gap by neutrally evaluating a broad range of methods and providing generalizable guidelines for practitioners. We benchmark 19 models, ranging from classical statistical approaches to many common machine learning methods, on 34 publicly available datasets. The benchmark tunes models using both a discrimination measure (Harrell's C-index) and a scoring rule (Integrated Survival Brier Score), and evaluates them across six metrics covering discrimination, calibration, and overall predictive performance. Despite superior average ranks in overall predictive performance from individual learners like oblique random survival forests and likelihood-based boosting, and better discrimination rankings from multiple boosting- and tree-based methods as well as parametric survival models, no method significantly outperforms the commonly used Cox proportional hazards model for either tuning measure. We conclude that for predictive purposes in the standard survival analysis setting of low-dimensional, right-censored data, the Cox Proportional Hazards model remains a simple and robust method, sufficient for most practitioners. All code, data, and results are publicly available on GitHub https://github.com/slds-lmu/paper_2023_survival_benchmark
要約:
Effective clustering of biomedical data is crucial in precision medicine, enabling accurate stratifiction of patients or samples. However, the growth in availability of high-dimensional categorical data, including `omics data, necessitates computationally efficient clustering algorithms. We present VICatMix, a variational Bayesian finite mixture model designed for the clustering of categorical data. The use of variational inference (VI) in its training allows the model to outperform competitors in term of efficiency, while maintaining high accuracy. VICatMix furthermore performs variable selection, enhancing its performance on high-dimensional, noisy data. The proposed model incorporates summarisation and model averaging to mitigate poor local optima in VI, allowing for improved estimation of the true number of clusters simultaneously with feature saliency. We demonstrate the performance of VICatMix with both simulated and real-world data, including applications to datasets from The Cancer Genome Atlas (TCGA), showing its use in cancer subtyping and driver gene discovery. We demonstrate VICatMix's utility in integrative cluster analysis with different `omics datasets, enabling the discovery of novel subtypes.
\textbf{Availability:} VICatMix is freely available as an R package, incorporating C++ for faster computation, at https://github.com/j-ackierao/VICatMix.
要約:
Dynamic feature transformation (the rich regime) does not always align with predictive performance (better representation), yet accuracy is often used as a proxy for richness, limiting analysis of their relationship. We propose a computationally efficient, performance-independent metric of richness grounded in the low-rank bias of rich dynamics, which recovers neural collapse as a special case. The metric is empirically more stable than existing alternatives and captures known lazy-torich transitions (e.g., grokking) without relying on accuracy. We further use it to examine how training factors (e.g., learning rate) relate to richness, confirming recognized assumptions and highlighting new observations (e.g., batch normalization promotes rich dynamics). An eigendecomposition-based visualization is also introduced to support interpretability, together providing a diagnostic tool for studying the relationship between training factors, dynamics, and representations.
要約:
Data assimilation techniques are crucial for accurately tracking complex dynamical systems by integrating observational data with numerical forecasts. Recently, score-based data assimilation methods emerged as powerful tools for high-dimensional and nonlinear data assimilation. However, these methods still incur substantial computational costs due to the need for expensive forward simulations. In this work, we propose LD-EnSF, a novel score-based data assimilation method that eliminates the need for full-space simulations by evolving dynamics directly in a compact latent space. Our method incorporates improved Latent Dynamics Networks (LDNets) to learn accurate surrogate dynamics and introduces a history-aware LSTM encoder to effectively process sparse and irregular observations. By operating entirely in the latent space, LD-EnSF achieves speedups orders of magnitude over existing methods while maintaining high accuracy and robustness. We demonstrate the effectiveness of LD-EnSF on several challenging high-dimensional benchmarks with highly sparse (in both space and time) and noisy observations.
要約:
This paper introduces a novel approach to learning sparsity-promoting regularizers for solving linear inverse problems. We develop a bilevel optimization framework to select an optimal synthesis operator, denoted as $B$, which regularizes the inverse problem while promoting sparsity in the solution. The method leverages statistical properties of the underlying data and incorporates prior knowledge through the choice of $B$. We establish the well-posedness of the optimization problem, provide theoretical guarantees for the learning process, and present sample complexity bounds. The approach is demonstrated through theoretical infinite-dimensional examples, including compact perturbations of a known operator and the problem of learning the mother wavelet, and through extensive numerical simulations. This work extends previous efforts in Tikhonov regularization by addressing non-differentiable norms and proposing a data-driven approach for sparse regularization in infinite dimensions.
要約:
We study the mixing time of the projected Langevin algorithm (LA) and the privacy curve of noisy Stochastic Gradient Descent (SGD), beyond nonexpansive iterations. Specifically, we derive new mixing time bounds for the projected LA which are, in some important cases, dimension-free and poly-logarithmic on the accuracy, closely matching the existing results in the smooth convex case. Additionally, we establish new upper bounds for the privacy curve of the subsampled noisy SGD algorithm. These bounds show a crucial dependency on the regularity of gradients, and are useful for a wide range of convex losses beyond the smooth case. Our analysis relies on a suitable extension of the Privacy Amplification by Iteration (PABI) framework (Feldman et al., 2018; Altschuler and Talwar, 2022, 2023) to noisy iterations whose gradient map is not necessarily nonexpansive. This extension is achieved by designing an optimization problem which accounts for the best possible R\'enyi divergence bound obtained by an application of PABI, where the tractability of the problem is crucially related to the modulus of continuity of the associated gradient mapping. We show that, in several interesting cases -- namely the nonsmooth convex, weakly smooth and (strongly) dissipative -- such optimization problem can be solved exactly and explicitly, yielding the tightest possible PABI-based bounds.
要約:
We investigate the problem of weight uncertainty originally proposed by [Blundell et al. (2015). Weight uncertainty in neural networks. In International conference on machine learning, 1613-1622, PMLR.] in the context of neural networks designed for regression tasks, and we extend their framework by incorporating variance uncertainty into the model. Our analysis demonstrates that explicitly modeling uncertainty in the variance parameter can significantly enhance the predictive performance of Bayesian neural networks. By considering a full posterior distribution over the variance, the model achieves improved generalization compared to approaches that treat variance as fixed or deterministic. We evaluate the generalization capability of our proposed approach through a function approximation example and further validate it on the riboflavin genetic dataset. Our exploration encompasses both fully connected dense networks and dropout neural networks, employing Gaussian and spike-and-slab priors respectively for the network weights, providing a comprehensive assessment of how variance uncertainty affects model performance across different architectural choices.
要約:
We introduce Galerkin-ARIMA and Galerkin-SARIMA, a projection-based extension of classical ARIMA/SARIMA that replaces rigid linear lag operators with low-dimensional Galerkin basis expansions while preserving the familiar AR-MA decomposition. Experiments on synthetic series and on quarterly GDP and daily S&P 500 returns show that Galerkin-SARIMA matches or improves forecast accuracy relative to classical ARIMA/SARIMA. Estimation is closed-form via a two-stage least-squares procedure, and the closed-form two-stage estimator enables efficient rolling-window re-estimation while preserving the familiar AR-MA operator structure, facilitating applications in central bank forecasting and portfolio risk management. We establish approximation-estimation trade-offs under weak dependence, provide consistency and asymptotic distributional results for the unpenalized estimator, compare prediction risk to classical SARIMA, and propose information-criterion selection of basis size. We further develop bootstrap-based inference for exogenous factor blocks and block-bootstrap prediction intervals that account for serial dependence and the two-stage generated-regressor structure.
要約:
While achieving exceptional generative quality, modern diffusion, flow, and other matching models suffer from slow inference, as they require many steps of iterative generation. Recent distillation methods address this by training efficient one-step generators under the guidance of a pre-trained teacher model. However, these methods are often constrained to only one specific framework, e.g., only to diffusion or only to flow models. Furthermore, these methods are naturally data-free, and to benefit from the usage of real data, it is required to use an additional complex adversarial training with an extra discriminator model. In this paper, we present RealUID, a universal distillation framework for all matching models that seamlessly incorporates real data into the distillation procedure without GANs. Our RealUID approach offers a simple theoretical foundation that covers previous distillation methods for Flow Matching and Diffusion models, and is also extended to their modifications, such as Bridge Matching and Stochastic Interpolants. The code can be found in https://github.com/David-cripto/RealUID.
要約:
The global dimensionality of a neural representation manifold provides rich insight into the computational process underlying both artificial and biological neural networks. However, all existing measures of global dimensionality are sensitive to the number of samples, i.e., the number of rows and columns of the sample matrix. We show that, in particular, the participation ratio of eigenvalues, a popular measure of global dimensionality, is highly biased with small sample sizes, and propose a bias-corrected estimator that is more accurate with finite samples and with noise. On synthetic data examples, we demonstrate that our estimator can recover the true known dimensionality. We apply our estimator to neural brain recordings, including calcium imaging, electrophysiological recordings, and fMRI data, and to the neural activations in a large language model and show our estimator is invariant to the sample size. Finally, our estimators can additionally be used to measure the local dimensionalities of curved neural manifolds by weighting the finite samples appropriately.
要約:
When training large-scale models, the performance typically scales with the number of parameters and the dataset size according to a slow power law. A fundamental theoretical and practical question is whether comparable performance can be achieved with significantly smaller models and substantially less data. In this work, we provide a positive and constructive answer. We prove that a generic permutation-invariant function of $d$ objects can be asymptotically compressed into a function of $\operatorname{polylog} d$ objects with vanishing error, which is proved to be the optimal compression rate. This theorem yields two key implications: (Ia) a large neural network can be compressed to polylogarithmic width while preserving its learning dynamics; (Ib) a large dataset can be compressed to polylogarithmic size while leaving the loss landscape of the corresponding model unchanged. Implication (Ia) directly establishes a proof of the dynamical lottery ticket hypothesis, which states that any ordinary network can be strongly compressed such that the learning dynamics and result remain unchanged. (Ib) shows that a neural scaling law of the form $L\sim d^{-\alpha}$ can be boosted to an arbitrarily fast power law decay, and ultimately to $\exp(-\alpha' \sqrt[m]{d})$.
要約:
Fourier analysis on the Boolean hypercube is fundamentally defined as the orthogonal decomposition of the space of pseudo-Boolean functions with respect to the uniform probability measure. In this work, we propose an ANOVA-based generalization of the Fourier decomposition on the Boolean hypercube endowed with any arbitrary probability measure. We provide an \emph{explicit} decomposition basis which generalizes the Walsh-Hadamard (or parity functions) basis under any \emph{arbitrary} probability measure on the Boolean hypercube. We formulate the computation of the entire functional decomposition as a least squares problem and also provide a method to address the classical \emph{curse of dimensionality} challenge. We provide a comprehensive generalization of Fourier analysis on the Boolean hypercube, enabling the handling of non-uniform configuration spaces inherent to real-world machine learning tasks, \textit{e.g.} when dealing with \emph{one-hot encoded} features. Finally, we demonstrate its practical impact in the field of explainable AI, by conducting comparative studies with feature attribution methods such as SHAP or TreeHFD.
要約:
We identify and analyze a surprising phenomenon of Latent Diffusion Models (LDMs) where the final steps of the diffusion can degrade sample quality. In contrast to conventional arguments that justify early stopping for numerical stability, this phenomenon is intrinsic to the dimensionality reduction in LDMs. We provide a principled explanation by analyzing the interaction between latent dimension and stopping time. Under a Gaussian framework with linear autoencoders, we characterize the conditions under which early stopping is needed to minimize the distance between generated and target distributions. More precisely, we show that lower-dimensional representations benefit from earlier termination, whereas higher-dimensional latent spaces require later stopping time. We further establish that the latent dimension interplays with other hyperparameters of the problem such as constraints in the parameters of score matching. Experiments on synthetic and real datasets illustrate these properties, underlining that early stopping can improve generative quality. Together, our results offer a theoretical foundation for understanding how the latent dimension influences the sample quality, and highlight stopping time as a key hyperparameter in LDMs.
要約:
This paper studies how geopolitical and geoeconomic shocks transmit to sovereign risk. Using a daily panel of CDS spreads, financial variables, and news-based indicators for 42 advanced and emerging economies over 2018-2025, we estimate nonlinear Machine Learning models that capture the interactions and threshold effects through which these shocks operate. A Shapley-Taylor decomposition exactly partitions predicted spreads into four channels: Direct, Global Financial Cycle, Uncertainty, and Local. The decomposition reveals a structural distinction. Geopolitical shocks enter through the Direct channel -- repricing default probability -- with the global financial cycle providing a transient offset. Geoeconomic uncertainty shocks bypass the Direct channel and operate through expected monetary policy and risk appetite. Gravity regressions show geopolitical transmission decays with distance; policy-uncertainty shocks activate the Uncertainty channel globally. The taxonomy implies that liquidity provision can address financial-cycle-mediated transmission but not the persistent component of geopolitical risk premia.
要約:
Time-series forecasting increasingly demands not only accurate observational predictions but also causal forecasting under interventional and counterfactual queries in multivariate systems. We present DoFlow, a flow-based generative model defined over a causal Directed Acyclic Graph (DAG) that delivers coherent observational and interventional predictions, as well as counterfactuals through the natural encoding-decoding mechanism of continuous normalizing flows (CNFs). We also provide a supporting counterfactual recovery theory under certain assumptions. Beyond forecasting, DoFlow provides explicit likelihoods of future trajectories, enabling principled anomaly detection. Experiments on synthetic datasets with various causal DAG structures and real-world hydropower and cancer-treatment time series show that DoFlow achieves accurate system-wide observational forecasting, enables causal forecasting over interventional and counterfactual queries, and effectively detects anomalies. This work contributes to the broader goal of unifying causal reasoning and generative modeling for complex dynamical systems.
要約:
We revisit the problem of denoising from noisy measurements where only the noise level is known, not the noise distribution. In multi-dimensions, independent noise $Z$ corrupts the signal $X$, resulting in the noisy measurement $Y = X + \sigma Z$, where $\sigma \in (0, 1)$ is a known noise level. Our goal is to recover the underlying signal distribution $P_X$ from denoising $P_Y$. We propose and analyze universal denoisers that are agnostic to a wide range of signal and noise distributions. Our distributional denoisers offer order-of-magnitude improvements over the Bayes-optimal denoiser derived from Tweedie's formula, if the focus is on the entire distribution $P_X$ rather than on individual realizations of $X$. Our denoisers shrink $P_Y$ toward $P_X$ optimally, achieving $O(\sigma^4)$ and $O(\sigma^6)$ accuracy in matching generalized moments and density functions. Inspired by optimal transport theory, the proposed denoisers are optimal in approximating the Monge-Amp\`ere equation with higher-order accuracy, and can be implemented efficiently via score matching.
Let $q$ represent the density of $P_Y$; for optimal distributional denoising, we recommend replacing the Bayes-optimal denoiser, \[ \mathbf{T}^*(y) = y + \sigma^2 \nabla \log q(y), \] with denoisers exhibiting less aggressive distributional shrinkage, \[ \mathbf{T}_1(y) = y + \frac{\sigma^2}{2} \nabla \log q(y), \] \[ \mathbf{T}_2(y) = y + \frac{\sigma^2}{2} \nabla \log q(y) - \frac{\sigma^4}{8} \nabla \left( \frac{1}{2} \| \nabla \log q(y) \|^2 + \nabla \cdot \nabla \log q(y) \right) . \]
要約:
Cluster analysis relates to the task of assigning objects into groups which ideally present some desirable characteristics. When a cluster structure is confined to a subset of the feature space, traditional clustering techniques face unprecedented challenges. We present an information theoretic framework that overcomes the problems associated with sparse data, allowing for joint feature weighting and clustering. Our proposal constitutes a competitive alternative to existing clustering algorithms for sparse data, as demonstrated through simulations on synthetic data. The effectiveness of our method is established by an application on a real-world genomics data set.
要約:
The Kullback-Leibler (KL) divergence is not a proper distance metric and does not satisfy the triangle inequality, posing theoretical challenges in certain practical applications. Existing work has demonstrated that KL divergence between multivariate Gaussian distributions follows a relaxed triangle inequality. Given any three multivariate Gaussian distributions $\mathcal{N}_1, \mathcal{N}_2$, and $\mathcal{N}_3$, if $KL(\mathcal{N}_1, \mathcal{N}_2)\leq \epsilon_1$ and $KL(\mathcal{N}_2, \mathcal{N}_3)\leq \epsilon_2$, then $KL(\mathcal{N}_1, \mathcal{N}_3)< 3\epsilon_1+3\epsilon_2+2\sqrt{\epsilon_1\epsilon_2}+o(\epsilon_1)+o(\epsilon_2)$. However, the supremum of $KL(\mathcal{N}_1, \mathcal{N}_3)$ is still unknown. In this paper, we investigate the relaxed triangle inequality for the KL divergence between multivariate Gaussian distributions and give the supremum of $KL(\mathcal{N}_1, \mathcal{N}_3)$ as well as the conditions when the supremum can be attained. When $\epsilon_1$ and $\epsilon_2$ are small, the supremum is $\epsilon_1+\epsilon_2+2\sqrt{\epsilon_1\epsilon_2}+o(\epsilon_1)+o(\epsilon_2)$. Finally, we demonstrate several applications of our results in out-of-distribution detection with flow-based generative models and safe reinforcement learning.
要約:
We develop a finite-sample, design-based theory for random forests in which each tree is a randomized conditional predictor acting on fixed covariates and the forest is their Monte Carlo average. An exact variance identity separates Monte Carlo error from a covariance floor that persists under infinite aggregation. The floor arises through two mechanisms: observation reuse, where the same training outcomes receive weight across multiple trees, and partition alignment, where independently generated trees discover similar conditional prediction rules. We prove the floor is strictly positive under minimal conditions and show that alignment persists even when sample splitting eliminates observation overlap entirely. We introduce procedure-aligned synthetic resampling (PASR) to estimate the covariance floor, decomposing the total prediction uncertainty of a deployed forest into interpretable components. For continuous outcomes, resulting prediction intervals achieve nominal coverage with a theoretically guaranteed conservative bias direction. For classification forests, the PASR estimator is asymptotically unbiased, providing the first pointwise confidence intervals for predicted conditional probabilities from a deployed forest. Nominal coverage is maintained across a range of design configurations for both outcome types, including high-dimensional settings. The underlying theory extends to any tree-based ensemble with an exchangeable tree-generating mechanism.
要約:
Smooth activation functions are ubiquitous in modern deep learning, yet their theoretical advantages over non-smooth counterparts remain poorly understood. In this work, we study both approximation and statistical properties of neural networks with smooth activations for learning functions in the Sobolev space $W^{s,\infty}([0,1]^d)$ with $s>0$. We prove that constant-depth networks equipped with smooth activations achieve smoothness adaptivity: increasing width alone suffices to attain the minimax-optimal approximation and estimation error rates (up to logarithmic factors). In contrast, for non-smooth activations such as ReLU, smoothness adaptivity is fundamentally limited by depth: the attainable approximation order is bounded by depth, and higher-order smoothness requires proportional depth growth. These results identify activation smoothness as a fundamental mechanism, complementary to depth, for achieving optimal rates over Sobolev function classes. Technically, our analysis is based on a multi-scale approximation framework that yields explicit neural network approximators with controlled parameter norms and model size. This complexity control ensures statistical learnability under empirical risk minimization (ERM) and avoids the impractical $\ell^0$-sparsity constraints commonly required in prior analyses.
要約:
Autoregressive models enable tractable sampling from learned probability distributions, but their performance critically depends on the variable ordering used in the factorization via complexities of the resulting conditional distributions. We propose to learn the Markov random field describing the underlying data, and use the inferred graphical model structure to construct optimized variable orderings. We illustrate our approach on two-dimensional image-like models where a structure-aware ordering leads to restricted conditioning sets, thereby reducing model complexity. Numerical experiments on Ising models with discrete data demonstrate that graph-informed orderings yield higher-fidelity generated samples compared to naive variable orderings.
要約:
We propose a Multivariate Spatio-Temporal Neural Hawkes Process for modeling complex multivariate event data with spatio-temporal dynamics. The proposed model extends continuous-time neural Hawkes processes by integrating spatial information into latent state evolution through learned temporal and spatial decay dynamics, enabling flexible modeling of excitation and inhibition without predefined triggering kernels. By analyzing fitted intensity functions of deep learning-based temporal Hawkes process models, we identify a modeling gap in how fitted intensity behavior is captured beyond likelihood-based performance, which motivates the proposed spatio-temporal approach. Simulation studies show that the proposed method successfully recovers sensible temporal and spatial intensity structure in multivariate spatio-temporal point patterns, while existing temporal neural Hawkes process approach fails to do so. An application to terrorism data from Pakistan further demonstrates the proposed model's ability to capture complex spatio-temporal interaction across multiple event types.
要約:
We propose a novel hierarchical Bayesian approach to Federated Learning (FL), where our model reasonably describes the generative process of clients' local data via hierarchical Bayesian modeling: constituting random variables of local models for clients that are governed by a higher-level global variate. Interestingly, the variational inference in our Bayesian model leads to an optimisation problem whose block-coordinate descent solution becomes a distributed algorithm that is separable over clients and allows them not to reveal their own private data at all, thus fully compatible with FL. We also highlight that our block-coordinate algorithm has particular forms that subsume the well-known FL algorithms including Fed-Avg and Fed-Prox as special cases. Beyond introducing novel modeling and derivations, we also offer convergence analysis showing that our block-coordinate FL algorithm converges to an (local) optimum of the objective at the rate of $O(1/\sqrt{t})$, the same rate as regular (centralised) SGD, as well as the generalisation error analysis where we prove that the test error of our model on unseen data is guaranteed to vanish as we increase the training data size, thus asymptotically optimal.
要約:
NUBO, short for Newcastle University Bayesian Optimisation, is a Bayesian optimization framework for the optimization of expensive-to-evaluate black-box functions, such as physical experiments and computer simulators. Bayesian optimization is a costefficient optimization strategy that uses surrogate modelling via Gaussian processes to represent an objective function and acquisition functions to guide the selection of candidate points to approximate the global optimum of the objective function. NUBO itself focuses on transparency and user experience to make Bayesian optimization easily accessible to researchers from all disciplines. Clean and understandable code, precise references, and thorough documentation ensure transparency, while user experience is ensured by a modular and flexible design, easy-to-write syntax, and careful selection of Bayesian optimization algorithms. NUBO allows users to tailor Bayesian optimization to their specific problem by writing the optimization loop themselves using the provided building blocks. It supports sequential single-point, parallel multi-point, and asynchronous optimization of bounded, constrained, and/or mixed (discrete and continuous) parameter input spaces. Only algorithms and methods that are extensively tested and validated to perform well are included in NUBO. This ensures that the package remains compact and does not overwhelm the user with an unnecessarily large number of options. The package is written in Python but does not require expert knowledge of Python to optimize your simulators and experiments. NUBO is distributed as open-source software under the BSD 3-Clause license.
要約:
Multiple instance learning (MIL) is a framework for weakly supervised classification, where labels are assigned to sets of instances, i.e., bags, rather than to individual data points. This paradigm has proven effective in tasks where fine-grained annotations are unavailable or costly to obtain. However, the effectiveness of MIL drops sharply when training data are scarce, such as for rare disease classification. To address this challenge, we propose incorporating topological inductive biases into the data representation space within the MIL framework. This bias introduces a topology-preserving constraint that encourages the instance encoder to maintain the topological structure of the instance distribution within each bag when mapping them to MIL latent space. As a result, our Topology Guided MIL (TG-MIL) method enhances the performance and generalizability of MIL classifiers across different aggregation functions, especially under scarce-data regimes. Our evaluations show average performance improvements of 15.3% for synthetic MIL datasets, 2.8% for MIL benchmarks, and 5.5% for rare anemia classification compared to current state-of-the-art MIL models, where only 17-120 samples per class are available. We make our code publicly available.
要約:
Directed Acyclic Graphs (DAGs) are a standard tool in causal modeling, but their suitability for capturing the complexity of large-scale multimodal data is questionable. In practice, real-world multimodal datasets are often collected from heterogeneous generative processes that do not conform to a single DAG. Instead, they may involve multiple, and even opposing, DAG structures with inverse causal directions. To address this gap, in this work, we first propose a novel latent partial causal model tailored for multimodal data representation learning, featuring two latent coupled variables parts connected by an undirected edge, to represent the transfer of knowledge across modalities. Under specific statistical assumptions, we establish an identifiability result, demonstrating that representations learned by MultiModal Contrastive Learning (MMCL) correspond to the latent coupled variables up to a trivial transformation. This result deepens our understanding of the why MMCL works, highlights its potential for representation disentanglement, and expands the utility of pre-trained models like CLIP. Synthetic experiments confirm the robustness of our findings, even when the assumptions are partially violated. Most importantly, experiments on a pre-trained CLIP model embodies disentangled representations, enabling few-shot learning and improving domain generalization across diverse real-world datasets. Together, these contributions push the boundaries of MMCL, both in theory and in practical applications.
要約:
We introduce the concept of an imprecise Markov semigroup \(\mathbf Q\). It is a tool that allows us to represent ambiguity around both the transition probabilities and the invariant measure of a continuous-time Markov process via a collection of Markov semigroups, each associated with a (possibly different) Markov process. We use techniques from topology, geometry, and probability to analyze ergodic limits under model uncertainty encoded by \(\mathbf Q\). We establish long-term bounds that are uniform in the initial state and identify regimes in which the imprecision in these bounds collapses asymptotically. Our results are proved in progressively more general settings. We first assume that \(\mathbf Q\) is compact and that the state space is Euclidean or a Riemannian manifold, working with a fixed bounded observable. We then allow the state space to be standard Borel, while keeping \(\mathbf Q\) compact and the observable fixed. Finally, we drop compactness and work on Polish metric spaces of finite diameter, where we treat arbitrary bounded Lipschitz observables. The importance of our findings for the fields of artificial intelligence and computer vision is also discussed at a high level; In particular, in the study of how the probability of an output evolves over time as we perturb the input of a convolutional autoencoder.
著者: Peter Mostowsky, Vincent Dutordoir, Iskander Azangulov, No\'emie Jaquier, Michael John Hutchinson, Aditya Ravuri, Leonel Rozo, Alexander Terenin, Viacheslav Borovitskiy
要約:
Kernels are a fundamental technical primitive in machine learning. In recent years, kernel-based methods such as Gaussian processes are becoming increasingly important in applications where quantifying uncertainty is of key interest. In settings that involve structured data defined on graphs, meshes, manifolds, or other related spaces, defining kernels with good uncertainty-quantification behavior, and computing their value numerically, is less straightforward than in the Euclidean setting. To address this difficulty, we present GeometricKernels, a Python software package which implements the geometric analogs of classical Euclidean squared exponential - also known as heat - and Mat\'ern kernels, which are widely-used in settings where uncertainty is of key interest. As a byproduct, we obtain the ability to compute Fourier-feature-type expansions, which are widely used in their own right, on a wide set of geometric spaces. Our implementation supports automatic differentiation in every major current framework simultaneously via a backend-agnostic design. In this companion paper to the package and its documentation, we outline the capabilities of the package and present an illustrated example of its interface. We also include a brief overview of the theory the package is built upon and provide some historic context in the appendix.
著者: Paolo Maria Ceriani (Department of Decision Sciences, Bocconi University, Milan, Italy), Andrea Pandolfi (Department of Decision Sciences, Bocconi University, Milan, Italy), Giacomo Zanella (Department of Decision Sciences, Bocconi University, Milan, Italy, Bocconi Institute for Data Science and Analytics, Bocconi University, Milan, Italy)
要約:
We design and analyze unbiased Markov chain Monte Carlo (MCMC) schemes based on couplings of blocked Gibbs samplers (BGSs), whose total computational costs scale linearly with the number of parameters and data points. Our methodology is designed for and applicable to high-dimensional BGS with conditionally independent blocks, which are often encountered in Bayesian modeling. We provide bounds on the expected number of iterations needed for coalescence for Gaussian targets, as well as on the tails of the coalescence times distribution. These imply that practical two-step coupling strategies achieve coalescence times that match the relaxation times of the original BGS scheme up to logarithmic factors. To illustrate the practical relevance of our methodology, we apply it to high-dimensional crossed random effect and probabilistic matrix factorization models, for which we develop a novel BGS scheme with improved convergence speed. Our methodology provides unbiased posterior estimates at linear cost (usually requiring only a few BGS iterations for problems with thousands of parameters), matching state-of-the-art procedures for both frequentist and Bayesian estimation of those models.
要約:
We study offline off-dynamics reinforcement learning (RL) to utilize data from an easily accessible source domain to enhance policy learning in a target domain with limited data. Our approach centers on return-conditioned supervised learning (RCSL), particularly focusing on Decision Transformer (DT) type frameworks, which can predict actions conditioned on desired return guidance and complete trajectory history. Previous works address the dynamics shift problem by augmenting the reward in the trajectory from the source domain to match the optimal trajectory in the target domain. However, this strategy can not be directly applicable in RCSL owing to (1) the unique form of the RCSL policy class, which explicitly depends on the return, and (2) the absence of a straightforward representation of the optimal trajectory distribution. We propose the Return Augmented (REAG) method for DT type frameworks, where we augment the return in the source domain by aligning its distribution with that in the target domain. We provide the theoretical analysis demonstrating that the RCSL policy learned from REAG achieves the same level of suboptimality as would be obtained without a dynamics shift. We introduce two practical implementations REAG$_\text{Dara}^{*}$ and REAG$_\text{MV}^{*}$ respectively. Thorough experiments on D4RL datasets and various DT-type baselines demonstrate that our methods consistently enhance the performance of DT type frameworks in off-dynamics RL.
要約:
There is strong empirical evidence that the state-of-the-art diffusion modeling paradigm leads to models that memorize the training set, especially when the training set is small. Prior methods to mitigate the memorization problem often lead to a decrease in image quality. Is it possible to obtain strong and creative generative models, i.e., models that achieve high generation quality and low memorization? Despite the current pessimistic landscape of results, we make significant progress in pushing the trade-off between fidelity and memorization. We first provide theoretical evidence that memorization in diffusion models is only necessary for denoising problems at low noise scales (usually used in generating high-frequency details). Using this theoretical insight, we propose a simple, principled method to train the diffusion models using noisy data at large noise scales. We show that our method significantly reduces memorization without decreasing the image quality, for both text-conditional and unconditional models and for a variety of data availability settings.
要約:
A fundamental problem in statistics is measuring the correlation between two rankings of a set of items. Kendall's $\tau$ and Spearman's $\rho$ are well established correlation coefficients whose symmetric structure guarantees zero expected value between two rankings randomly chosen with uniform probability. In many modern applications, however, greater importance is assigned to top-ranked items, motivating weighted variants of these coefficients. Such weighting schemes generally break the symmetry of the original formulations, resulting in a non-zero expected value under independence and compromising the interpretation of zero correlation. We propose a general standardization function $g(\cdot)$ that transforms a ranking correlation coefficient $\Gamma$ into a standardized form $g(\Gamma)$ with zero expected value under randomness. The transformation preserves the domain $[-1,1]$, satisfies the boundary conditions, is continuous and increasing, and reduces to the identity for coefficients that already satisfy the zero-expected-value property. The construction of $g(x)$ depends on three distributional parameters of $\Gamma$: its mean, variance, and left variance; since their exact calculation becomes infeasible for large ranking lengths $n$, we develop accurate numerical estimates based on Monte Carlo sampling combined with polynomial regression to capture their dependence on $n$.
要約:
The statistical properties of deep neural networks (DNNs) at initialization play an important role to comprehend their trainability and the intrinsic architectural biases they possess before data exposure Well established mean field (MF) theories have uncovered that the distribution of parameters of randomly initialized networks strongly influences the behavior of the gradients, dictating whether they explode or vanish. Recent work has showed that untrained DNNs also manifest an initial guessing bias (IGB), in which large regions of the input space are assigned to a single class. In this work, we provide a theoretical proof that links IGB to previous MF theories for a vast class of DNNs, showing that efficient learning is tightly connected to a network prejudice towards a specific class. This connection leads to a counterintuitive conclusion: the initialization that optimizes trainability is systematically biased rather than neutral.
要約:
Adversarial training is one of the most effective defenses against adversarial attacks, but it incurs a high computational cost. In this study, we present the first theoretical analysis suggesting that adversarially pretrained transformers can serve as universally robust foundation models -- models that can adapt robustly to diverse downstream tasks with only lightweight tuning. Specifically, we demonstrate that single-layer linear transformers, after adversarial pretraining across a variety of classification tasks, can generalize robustly to unseen classification tasks through in-context learning from clean demonstrations (i.e., without requiring additional adversarial training or examples). This universal robustness stems from the model's ability to adaptively focus on robust features within given tasks. We also identify two open challenges for attaining robustness: the accuracy-robustness trade-off and sample-hungry training. This study initiates the discussion on the utility of universally robust foundation models. While their training is expensive, the investment would prove worthwhile as downstream tasks can obtain adversarial robustness for free. The code is available at https://github.com/s-kumano/universally-robust-in-context-learner.
要約:
Multimodal learning is of continued interest in artificial intelligence-based applications, motivated by the potential information gain from combining different data modalities. However, modalities observed in the source environment may differ from the modalities observed in the target environment due to multiple factors, including cost, hardware failure, or the perceived \textit{informativeness} of a given modality. This change in missingness patterns between the source and target environment has not been carefully studied. Na{\"i}ve estimation of the information gain associated with including an additional modality without accounting for missingness may result in improper estimates of that modality's value in the target environment. We formalize the problem of missingness, demonstrate its ubiquity, and show that the subsequent distribution shift induces bias when the missingness process is not explicitly accounted for. To address this issue, we introduce ICYM2I (In Case You Multimodal Missed It), a framework for the evaluation of predictive performance and information gain under missingness through inverse probability weighting-based correction. We demonstrate the importance of the proposed adjustment to estimate information gain under missingness on synthetic, semi-synthetic, and real-world datasets.
要約:
Ensemble learning remains a cornerstone of machine learning, with stacking used to integrate predictions from multiple base learners through a meta-model. However, deep stacking remains uncommon due to feature redundancy, complexity, and computational burden. To address these limitations, RocketStack is introduced as a level-aware recursive stacking architecture explored up to ten stacking levels, extending beyond prior architectures. At level 1, base-learner predictions are fused with original features; at later levels, weaker learners are incrementally pruned using out-of-fold (OOF) scores. To curb early saturation, pruning is regularized by applying Gaussian perturbations at two noise scales to OOF scores prior to model selection for next-level stacking, alongside deterministic pruning. To control feature growth, periodic compression is applied at levels 3, 6, and 9 using Simple, Fast, Efficient (SFE) filtering, attention-based selection, and autoencoders. Across 33 datasets (23 binary, 10 multi-class), increasing accuracy with depth is confirmed by linear mixed-effects trend tests, and the best meta-model per level increasingly outperforms the best standalone ensemble. OOF-perturbed pruning is found to improve stability and late-level gains, while periodic compression is found to yield substantial runtime and dimensionality reductions with minimal accuracy drop. At the deepest level, accuracy slightly surpasses established deep tabular baselines. When hyperparameter optimization is performed on baseline models, early performance is boosted; however, untuned RocketStack closes the gap with depth and remains competitive at later levels. It achieves deep recursive stacking with sublinear computational growth and provides a modular, depth-aware foundation for scalable decision fusion as model pools and feature spaces evolve.
要約:
The performance of predictive models in clinical settings often degrades when deployed in new hospitals due to distribution shifts. This paper presents a large-scale study of causality-inspired domain generalization on heterogeneous multi-center intensive care unit (ICU) data. We apply anchor regression and introduce anchor boosting, a novel, tree-based nonlinear extension, to a large dataset comprising 400,000 patients from nine distinct ICU databases. We find that anchor regularization yields improvements of out-of-distribution performance, particularly for the most dissimilar target domains. The methods appear robust to violations of theoretical assumptions, such as anchor exogeneity. Furthermore, we propose a novel conceptual framework to quantify the utility of large external data datasets. By evaluating performance as a function of available target-domain data, we identify three regimes: (i) a domain generalization regime, where only the external model should be used, (ii) a domain adaptation regime, where refitting the external model is optimal, and (iii) a data-rich regime, where external data provides no additional value.
要約:
Let $S$ be a finite set, and $X_1,\ldots,X_n$ an i.i.d. uniform sample from $S$. To estimate the size $|S|$, without further structure, one can wait for repeats and use the birthday problem. This requires a sample size of the order $|S|^\frac{1}{2}$. On the other hand, if $S=\{1,2,\ldots,|S|\}$, the maximum of the sample blown up by $n/(n-1)$ gives an efficient estimator based on any growing sample size. This paper gives refinements that interpolate between these extremes. A general non-asymptotic theory is developed. This includes estimating the volume of a compact convex set, the unseen species problem, and a host of testing problems that follow from the question `Is this new observation a typical pick from a large prespecified population?' We also treat regression style predictors. A general theorem gives non-parametric finite $n$ error bounds in all cases.
要約:
Multivariate Hawkes process provides a powerful framework for modeling temporal dependencies and event-driven interactions in complex systems. While existing methods primarily focus on uncovering causal structures among observed subprocesses, real-world systems are often only partially observed, with latent subprocesses posing significant challenges. In this paper, we show that continuous-time event sequences can be represented by a discrete-time causal model as the time interval shrinks, and we leverage this insight to establish necessary and sufficient conditions for identifying latent subprocesses and the causal influences. Accordingly, we propose a two-phase iterative algorithm that alternates between inferring causal relationships among discovered subprocesses and uncovering new latent subprocesses, guided by path-based conditions that guarantee identifiability. Experiments on both synthetic and real-world datasets show that our method effectively recovers causal structures despite the presence of latent subprocesses.
要約:
The success of federated learning (FL) ultimately depends on how strategic participants behave under partial observability, yet most formulations still treat FL as a static optimization problem. We instead view FL deployments as governed strategic systems and develop an analytical framework that separates welfare-improving behavior from metric gaming. Within this framework, we introduce indices that quantify manipulability, the price of gaming, and the price of cooperation, and we use them to study how rules, information disclosure, evaluation metrics, and aggregator-switching policies reshape incentives and cooperation patterns. We derive threshold conditions for deterring harmful gaming while preserving benign cooperation, and for triggering auto-switch rules when early-warning indicators become critical. Building on these results, we construct a design toolkit including a governance checklist and a simple audit-budget allocation algorithm with a provable performance guarantee. Simulations across diverse stylized environments and a federated learning case study consistently match the qualitative and quantitative patterns predicted by our framework. Taken together, our results provide design principles and operational guidelines for reducing metric gaming while sustaining stable, high-welfare cooperation in FL platforms.
要約:
We prove the first guarantees of sparse recovery for ReLU neural networks, where the sparse network weights constitute the signal to be recovered. Specifically, we study structural properties of the sparse network weights for two-layer, scalar-output networks under which a simple iterative hard thresholding algorithm recovers these weights exactly, using memory that grows linearly in the number of nonzero weights. We validate this theoretical result with simple experiments on recovery of sparse planted MLPs, MNIST classification, and implicit neural representations. Experimentally, we find performance that is competitive with, and often exceeds, a high-performing but memory-inefficient baseline based on iterative magnitude pruning. Code is available at https://github.com/voilalab/MLP-IHT.
要約:
We present DistillKac, a fast image generator that uses the damped wave equation and its stochastic Kac representation to move probability mass at finite speed. In contrast to diffusion models whose reverse time velocities can become stiff and implicitly allow unbounded propagation speed, Kac dynamics enforce finite speed transport and yield globally bounded kinetic energy. Building on this structure, we introduce classifier-free guidance in velocity space that preserves square integrability under mild conditions. We then propose endpoint only distillation that trains a student to match a frozen teacher over long intervals. We prove a stability result that promotes supervision at the endpoints to closeness along the entire path. Experiments demonstrate DistillKac delivers high quality samples with very few function evaluations while retaining the numerical stability benefits of finite speed probability flows.
要約:
In clinical applications, the utility of segmentation models is often based on the accuracy of derived downstream metrics such as organ size, rather than by the pixel-level accuracy of the segmentation masks themselves. Thus, uncertainty quantification for such metrics is crucial for decision-making. Conformal prediction (CP) is a popular framework to derive such principled uncertainty guarantees, but applying CP naively to the final scalar metric is inefficient because it treats the complex, non-linear segmentation-to-metric pipeline as a black box. We introduce COMPASS, a practical framework that generates efficient, metric-based CP intervals for image segmentation models by leveraging the inductive biases of their underlying deep neural networks. COMPASS performs calibration directly in the model's representation space by perturbing intermediate features along low-dimensional subspaces maximally sensitive to the target metric. We prove that COMPASS achieves valid marginal coverage under the assumption of exchangeability. Empirically, we demonstrate that COMPASS produces significantly tighter intervals than traditional CP baselines on four medical image segmentation tasks for area estimation of skin lesions and anatomical structures. Furthermore, we show that leveraging learned internal features to estimate importance weights allows COMPASS to also recover target coverage under covariate shifts. COMPASS paves the way for practical, metric-based uncertainty quantification for medical image segmentation.
要約:
Under the data manifold hypothesis, high-dimensional data are concentrated near a low-dimensional manifold. We study the problem of Riemannian optimization over such manifolds when they are given only implicitly through the data distribution, and the standard manifold operations required by classical algorithms are unavailable. This formulation captures a broad class of data-driven design problems that are central to modern generative AI. Our key idea is to introduce a link function that connects the data distribution to the geometric operations needed for optimization. We show that this function enables the recovery of essential manifold operations, such as retraction and Riemannian gradient computation. Moreover, we establish a direct connection between our construction and the score function in diffusion models of the data distribution. This connection allows us to leverage well-studied parameterizations, efficient training procedures, and even pretrained score networks from the diffusion model literature to perform optimization. Building on this foundation, we propose two efficient inference-time algorithms -- Denoising Landing Flow (DLF) and Denoising Riemannian Gradient Descent (DRGD) -- and provide theoretical guarantees for both feasibility (approximate manifold adherence) and optimality (small Riemannian gradient norm). Finally, we demonstrate the effectiveness of our approach on finite-horizon reference tracking tasks in data-driven control, highlighting its potential for practical generative and design applications.
要約:
Computational imaging methods increasingly rely on powerful generative diffusion models to tackle challenging image restoration tasks. In particular, state-of-the-art zero-shot image inverse solvers leverage distilled text-to-image latent diffusion models (LDMs) to achieve unprecedented accuracy and perceptual quality with high computational efficiency. However, extending these advances to high-definition video restoration remains a significant challenge, due to the need to recover fine spatial detail while capturing subtle temporal dependencies. Consequently, methods that naively apply image-based LDM priors on a frame-by-frame basis often result in temporally inconsistent reconstructions. We address this challenge by leveraging recent advances in Video Consistency Models (VCMs), which distill video latent diffusion models into fast generators that explicitly capture temporal causality. Building on this foundation, we propose LVTINO, the first zero-shot or plug-and-play inverse solver for high definition video restoration with priors encoded by VCMs. Our conditioning mechanism bypasses the need for automatic differentiation and achieves state-of-the-art video reconstruction quality with only a few neural function evaluations, while ensuring strong measurement consistency and smooth temporal transitions across frames. Extensive experiments on a diverse set of video inverse problems show significant perceptual improvements over current state-of-the-art methods that apply image LDMs frame by frame, establishing a new benchmark in both reconstruction fidelity and computational efficiency. The code is available on GitHub.
要約:
Test-time scaling improves the reasoning capabilities of large language models (LLMs) by allocating extra compute to generate longer Chains-of-Thoughts (CoTs). This enables models to tackle more complex problem by breaking them down into additional steps, backtracking, and correcting mistakes. Despite its strong performance--demonstrated by OpenAI's o1 and DeepSeek R1, the conditions in the training data under which long CoTs emerge, and when such long CoTs improve the performance, remain unclear. In this paper, we study the performance of test-time scaling for transformers trained on an in-context weight prediction task for linear regression. Our analysis provides a theoretical explanation for several intriguing observations: First, at any fixed test error, increasing test-time compute allows us to reduce the number of in-context examples (context length) in training prompts. Second, if the skills required to solve a downstream task are not sufficiently present in the training data, increasing test-time compute can harm performance. Finally, we characterize task hardness via the smallest eigenvalue of its feature covariance matrix and show that training on a diverse, relevant, and hard set of tasks results in best performance for test-time scaling. We confirm our findings with experiments on large, nonlinear transformer architectures.
要約:
Implicit models, an emerging model class, compute outputs by iterating a single parameter block to a fixed point. This architecture realizes an infinite-depth, weight-tied network that trains with constant memory, significantly reducing memory needs for the same level of performance compared to explicit models. While it is empirically known that these compact models can often match or even exceed the accuracy of larger explicit networks by allocating more test-time compute, the underlying mechanism remains poorly understood.
We study this gap through a nonparametric analysis of expressive power. We provide a strict mathematical characterization, showing that a simple and regular implicit operator can, through iteration, progressively express more complex mappings. We prove that for a broad class of implicit models, this process lets the model's expressive power scale with test-time compute, ultimately matching a much richer function class. The theory is validated across four domains: image reconstruction, scientific computing, operations research, and LLM reasoning, demonstrating that as test-time iterations increase, the complexity of the learned mapping rises, while the solution quality simultaneously improves and stabilizes.
要約:
Conformal prediction offers a powerful framework for building distribution-free prediction intervals for exchangeable data. Existing methods that extend conformal prediction to sequential data rely on fitting a relatively complex model to capture temporal dependencies. However, these methods can fail if the sample size is small and often require expensive retraining when the underlying data distribution changes. To overcome these limitations, we propose Reservoir Conformal Prediction (ResCP), a novel training-free conformal prediction method for time series. Our approach leverages the efficiency and representation learning capabilities of reservoir computing to dynamically reweight conformity scores. In particular, we compute similarity scores among reservoir states and use them to adaptively reweight the observed residuals at each step. With this approach, ResCP enables us to account for local temporal dynamics when modeling the error distribution without compromising computational scalability. We prove that, under reasonable assumptions, ResCP achieves asymptotic conditional coverage, and we empirically demonstrate its effectiveness across diverse forecasting tasks.
要約:
The rapid scaling of large language models (LLMs) has made low-precision training essential for reducing memory, improving efficiency, and enabling larger models and datasets. Existing convergence theories for adaptive optimizers, however, assume all components are exact and neglect hardware-aware quantization, leaving open the question of why low-precision training remains effective. We introduce the first theoretical framework for analyzing the convergence of adaptive optimizers, including Adam and Muon, under floating-point quantization of gradients, weights, and optimizer states (e.g., moment estimates). Within this framework, we derive convergence rates on smooth non-convex objectives under standard stochastic gradient assumptions, explicitly characterizing how quantization errors from different components affect convergence. We show that both algorithms retain rates close to their full-precision counterparts provided mantissa length scales only logarithmically with the number of iterations. Our analysis further reveals that Adam is highly sensitive to weights and second-moment quantization due to its reliance on $\beta_2 \to 1$, while Muon requires weaker error control and is thus potentially more robust. These results narrow the gap between empirical success and theoretical understanding of low-precision training methods. Numerical experiments on synthetic and real-world data corroborate our theory.
要約:
Single-cell RNA sequencing (scRNA-seq) enables the study of cellular heterogeneity. Yet, clustering accuracy, and with it downstream analyses based on cell labels, remain challenging due to measurement noise and biological variability. In standard latent spaces (e.g., obtained through PCA), data from different cell types can be projected close together, making accurate clustering difficult. We introduce a latent plug-and-play diffusion framework that separates the observation and denoising space. This separation is operationalized through a novel Gibbs sampling procedure: the learned diffusion prior is applied in a low-dimensional latent space to perform denoising, while to steer this process, noise is reintroduced into the original high-dimensional observation space. This unique "input-space steering" ensures the denoising trajectory remains faithful to the original data structure. Our approach offers three key advantages: (1) adaptive noise handling via a tunable balance between prior and observed data; (2) uncertainty quantification through principled uncertainty estimates for downstream analysis; and (3) generalizable denoising by leveraging clean reference data to denoise noisier datasets, and via averaging, improve quality beyond the training set. We evaluate robustness on both synthetic and real single-cell genomics data. Our method improves clustering accuracy on synthetic data across varied noise levels and dataset shifts. On real-world single-cell data, our method demonstrates improved biological coherence in the resulting cell clusters, with cluster boundaries that better align with known cell type markers and developmental trajectories.
要約:
Generalized linear models (GLMs) are fundamental tools for statistical modeling, with maximum likelihood estimation (MLE) serving as the classical approach for parameter inference. While MLE performs well for canonical GLMs, it can become computationally challenging in more general settings with non-canonical, non-smooth, or nonlinear link functions, where the resulting optimization landscape may be ill-conditioned, non-convex, or non-differentiable. In this paper, we study an alternative estimation framework based on variational inequalities (VIs), which formulates GLM estimation through an operator-based equilibrium condition rather than likelihood minimization. We analyze the VI estimator from a statistical perspective and establish finite-sample error bounds and asymptotic normality under mild regularity conditions, together with convergence guarantees for fixed-point and stochastic approximation algorithms. The framework accommodates a broad class of link functions, including non-canonical and non-monotone cases satisfying a strong Minty-type condition, and extends naturally to generalized additive models via basis expansion. Numerical experiments demonstrate that the VI approach achieves competitive finite-sample accuracy and improved numerical stability relative to MLE, particularly in GLMs and GAMs with non-canonical or non-smooth link functions.
要約:
We revisit the signal denoising problem through the lens of optimal transport: the goal is to recover an unknown scalar signal distribution $X \sim P$ from noisy observations $Y = X + \sigma Z$, with $Z$ being standard Gaussian independent of $X$ and $\sigma>0$ a known noise level. Let $Q$ denote the distribution of $Y$. We introduce a hierarchy of denoisers $T_0, T_1, \ldots, T_\infty : \mathbb{R} \to \mathbb{R}$ that are agnostic to the signal distribution $P$, depending only on higher-order score functions of $Q$. Each denoiser $T_K$ is progressively refined using the $(2K-1)$-th order score function of $Q$ at noise resolution $\sigma^{2K}$, achieving better denoising quality measured by the Wasserstein metric $W(T_K \sharp Q, P)$. The limiting denoiser $T_\infty$ identifies the optimal transport map with $T_\infty \sharp Q = P$.
We provide a complete characterization of the combinatorial structure underlying this hierarchy through Bell polynomial recursions, revealing how higher-order score functions encode the optimal transport map for signal denoising. We study two estimation strategies with convergence rates for higher-order scores from i.i.d. samples drawn from $Q$: (i) plug-in estimation via Gaussian kernel smoothing, and (ii) direct estimation via higher-order score matching. This hierarchy of agnostic denoisers opens new perspectives in signal denoising and empirical Bayes.
要約:
Goal-Conditioned Reinforcement Learning (GCRL) mitigates the difficulty of reward design by framing tasks as goal reaching rather than maximizing hand-crafted reward signals. In this setting, the optimal goal-conditioned value function naturally forms a quasimetric, motivating Quasimetric RL (QRL), which constrains value learning to quasimetric mappings and enforces local consistency through discrete, trajectory-based constraints. We propose Eikonal-Constrained Quasimetric RL (Eik-QRL), a continuous-time reformulation of QRL based on the Eikonal Partial Differential Equation (PDE). This PDE-based structure makes Eik-QRL trajectory-free, requiring only sampled states and goals, while improving out-of-distribution generalization. We provide theoretical guarantees for Eik-QRL and identify limitations that arise under complex dynamics. To address these challenges, we introduce Eik-Hierarchical QRL (Eik-HiQRL), which integrates Eik-QRL into a hierarchical decomposition. Empirically, Eik-HiQRL achieves state-of-the-art performance in offline goal-conditioned navigation and yields consistent gains over QRL in manipulation tasks, matching temporal-difference methods.
要約:
Feature attribution is the dominant paradigm for explaining the predictions of complex machine learning models like neural networks. However, most existing methods offer little guarantee of reflecting the model's prediction-making process. We define the notion of explanatory alignment and argue that it is central to trustworthy predictive modeling: in short, it requires that explanations directly underlie predictions rather than serve as rationalizations. We present model readability as a design principle enabling alignment, and Pointwise-interpretable Networks (PiNets) as a modeling framework to pursue it in a deep learning context. PiNets combine statistical intelligence with a pseudo-linear structure that yields instance-wise linear predictions in an arbitrary feature space. We illustrate their use on image classification and segmentation tasks, demonstrating that PiNets produce explanations that are not only aligned by design but also faithful across other dimensions: meaningfulness, robustness, and sufficiency.
要約:
Modern large language models (LLMs) such as GPT, Claude, and Gemini have transformed the way we learn, work, and communicate. Yet, their ability to produce highly human-like text raises serious concerns about misinformation and academic integrity, making it an urgent need for reliable algorithms to detect LLM-generated content. In this paper, we start by presenting a geometric approach to demystify rewrite-based detection algorithms, revealing their underlying rationale and demonstrating their generalization ability. Building on this insight, we introduce a novel rewrite-based detection algorithm that adaptively learns the distance between the original and rewritten text. Theoretically, we demonstrate that employing an adaptively learned distance function is more effective for detection than using a fixed distance. Empirically, we conduct extensive experiments with over 100 settings, and find that our approach demonstrates superior performance over baseline algorithms in the majority of scenarios. In particular, it achieves relative improvements from 54.3% to 75.4% over the strongest baseline across different target LLMs (e.g., GPT, Claude, and Gemini). A python implementation of our proposal is publicly available at https://github.com/Mamba413/L2D.
要約:
Autonomous AI agents are beginning to populate social platforms, but it is still unclear whether they can sustain the back-and-forth needed for extended coordination. We study Moltbook, an AI-agent social network, using a first-week snapshot and introduce interaction half-life: how quickly a comment's chance of receiving a direct reply fades as the comment ages. Across tens of thousands of commented threads, Moltbook discussions are dominated by first-layer reactions rather than extended chains. Most comments never receive a direct reply, reciprocal back-and-forth is rare, and when replies do occur they arrive almost immediately -- typically within seconds -- implying persistence on the order of minutes rather than hours. Moltbook is often described as running on an approximately four-hour ``heartbeat'' check-in schedule; using aggregate spectral tests on the longest contiguous activity window, we do not detect a reliable four-hour rhythm in this snapshot, consistent with jittered or out-of-phase individual schedules. A contemporaneous Reddit baseline analyzed with the same estimators shows substantially deeper threads and much longer reply persistence. Overall, early agent social interaction on Moltbook fits a ``fast response or silence'' regime, suggesting that sustained multi-step coordination will likely require explicit memory, thread resurfacing, and re-entry scaffolds.
要約:
Spatio-temporal forecasting is fundamental to intelligent systems in transportation, climate science, and urban planning. However, training deep learning models on the massive, often redundant, datasets from these domains presents a significant computational bottleneck. Existing solutions typically focus on optimizing model architectures or optimizers, while overlooking the inherent inefficiency of the training data itself. This conventional approach of iterating over the entire static dataset each epoch wastes considerable resources on easy-to-learn or repetitive samples. In this paper, we explore a novel training-efficiency techniques, namely learning from complexity with dynamic sample pruning, ST-Prune, for spatio-temporal forecasting. Through dynamic sample pruning, we aim to intelligently identify the most informative samples based on the model's real-time learning state, thereby accelerating convergence and improving training efficiency. Extensive experiments conducted on real-world spatio-temporal datasets show that ST-Prune significantly accelerates the training speed while maintaining or even improving the model performance, and it also has scalability and universality.
要約:
In modern applications such as ECG monitoring, neuroimaging, wearable sensing, and industrial equipment diagnostics, complex and continuously structured data are ubiquitous, presenting both challenges and opportunities for functional data analysis. However, existing methods face a critical trade-off: conventional functional models are limited by linearity, whereas deep learning approaches lack interpretable region selection for sparse effects. To bridge these gaps, we propose a sparse Bayesian functional deep neural network (sBayFDNN). It learns adaptive functional embeddings through a deep Bayesian architecture to capture complex nonlinear relationships, while a structured prior enables interpretable, region-wise selection of influential domains with quantified uncertainty. Theoretically, we establish rigorous approximation error bounds, posterior consistency, and region selection consistency. These results provide the first theoretical guarantees for a Bayesian deep functional model, ensuring its reliability and statistical rigor. Empirically, comprehensive simulations and real-world studies confirm the effectiveness and superiority of sBayFDNN. Crucially, sBayFDNN excels in recognizing intricate dependencies for accurate predictions and more precisely identifies functionally meaningful regions, capabilities fundamentally beyond existing approaches.