要約:
Data assimilation (DA) is a cornerstone of scientific and engineering applications, combining model forecasts with sparse and noisy observations to estimate latent system states. Classical DA methods, such as the ensemble Kalman filter, rely on Gaussian approximations and heuristic tuning (e.g., inflation and localization) to scale to high dimensions. While often successful, these approximations can make the methods unstable or inaccurate when the underlying distributions of states and observations depart significantly from Gaussianity. To address this limitation, we introduce DAISI, a scalable filtering algorithm built on flow-based generative models that enables flexible probabilistic inference using data-driven priors. The core idea is to use a stationary, pre-trained generative prior to assimilate observations via guidance-based conditional sampling while incorporating forecast information through a novel inverse-sampling step. This step maps the forecast ensemble into a latent space to provide initial conditions for the conditional sampling, allowing us to encode model dynamics into the DA pipeline without having to retrain or fine-tune the generative prior at each assimilation step. Experiments on challenging nonlinear systems show that DAISI achieves accurate filtering results in regimes with sparse, noisy, and nonlinear observations where traditional methods struggle.
要約:
Random Forests and Gradient Boosting are among the most effective algorithms for supervised learning on tabular data. Both belong to the class of tree-based ensemble methods, where predictions are obtained by aggregating many randomized regression trees. In this paper, we develop a theoretical framework for analyzing such methods through Reproducing Kernel Hilbert Spaces (RKHSs) constructed on tree ensembles -- more precisely, on the random partitions generated by randomized regression trees. We establish fundamental analytical properties of the resulting Random Forest kernel, including boundedness, continuity, and universality, and show that a Random Forest predictor can be characterized as the unique minimizer of a penalized empirical risk functional in this RKHS, providing a variational interpretation of ensemble learning. We further extend this perspective to the continuous-time formulation of Gradient Boosting introduced by Dombry and Duchamps, and demonstrate that it corresponds to a gradient flow on a Hilbert manifold induced by the Random Forest RKHS. A key feature of this framework is that both the kernel and the RKHS geometry are data-dependent, offering a theoretical explanation for the strong empirical performance of tree-based ensembles. Finally, we illustrate the practical potential of this approach by introducing a kernel principal component analysis built on the Random Forest kernel, which enhances the interpretability of ensemble models, as well as GVI, a new geometric variable importance criterion.
要約:
Sequential optimization of black-box functions from noisy evaluations has been widely studied, with Gaussian Process bandit algorithms such as GP-UCB guaranteeing no-regret in stationary settings. However, for time-varying objectives, it is known that no-regret is unattainable under pure bandit feedback unless strong and often unrealistic assumptions are imposed.
In this article, we propose a novel method to optimize time-varying rewards in the frequentist setting, where the objective has bounded RKHS norm. Time variations are captured through uncertainty injection (UI), which enables heteroscedastic GP regression that adapts past observations to the current time step. As no-regret is unattainable in general in the strict bandit setting, we relax the latter allowing additional queries on previously observed points. Building on sparse inference and the effect of UI on regret, we propose \textbf{W-SparQ-GP-UCB}, an online algorithm that achieves no-regret with only a vanishing number of additional queries per iteration. To assess the theoretical limits of this approach, we establish a lower bound on the number of additional queries required for no-regret, proving the efficiency of our method. Finally, we provide a comprehensive analysis linking the degree of time-variation of the function to achievable regret rates, together with upper and lower bounds on the number of additional queries needed in each regime.
要約:
We investigate the existence of a statistical-computational gap in multiple Gaussian graph alignment. We first generalize a previously established informational threshold from Vassaux and Massouli\'e (2025) to regimes where the number of observed graphs $p$ may also grow with the number of nodes $n$: when $p \leq O(n/\log(n))$, we recover the results from Vassaux and Massouli\'e (2025), and $p \geq \Omega(n/\log(n))$ corresponds to a regime where the problem is as difficult as aligning one single graph with some unknown "signal" graph. Moreover, when $\log p = \omega(\log n)$, the informational thresholds for partial and exact recovery no longer coincide, in contrast to the all-or-nothing phenomenon observed when $\log p=O(\log n)$. Then, we provide the first computational barrier in the low-degree framework for (multiple) Gaussian graph alignment. We prove that when the correlation $\rho$ is less than $1$, up to logarithmic terms, low degree non-trivial estimation fails. Our results suggest that the task of aligning $p$ graphs in polynomial time is as hard as the problem of aligning two graphs in polynomial time, up to logarithmic factors. These results characterize the existence of a statistical-computational gap and provide another example in which polynomial-time algorithms cannot handle complex combinatorial bi-dimensional structures.
要約:
We study the problem of learning disentangled signals from data using non-linear Independent Component Analysis (ICA). Motivated by advances in self-supervised learning, we propose to learn self-sufficient signals: A recovered signal should be able to reconstruct a missing value of its own from all remaining components without relying on any other signals. We formulate this problem as the minimization of a conditional KL divergence. Compared to traditional maximum likelihood estimation, our algorithm is prior-free and likelihood-free, meaning that we do not need to impose any prior on the original signals or any observational model, which often restricts the model's flexibility. To tackle the KL divergence minimization problem, we propose a sequential algorithm that reduces the KL divergence and learns an optimal de-mixing flow model at each iteration. This approach completely avoids the unstable adversarial training, a common issue in minimizing the KL divergence. Experiments on toy and real-world datasets show the effectiveness of our method.
要約:
We study a structured permutation scheme for two-sample testing that restricts permutations to single cross-swaps between block-selected representatives. Our analysis yields three main results. First, we provide an exact validity construction that applies to any fixed restricted permutation set. Second, for both the difference of sample means and the unbiased $\widehat{\mathrm{MMD}}^{2}$ estimator, we derive closed-form one-swap increment identities whose conditional variances scale as $O(h^{2})$, in contrast to the $\Theta(h)$ increment variability under full relabeling. This increment-level variance contraction sharpens the Bernstein--Freedman variance proxy and leads to substantially smaller permutation critical values. Third, we obtain explicit, data-dependent expressions for the resulting critical values and statistical power. Together, these results show that block-restricted one-swap permutations can achieve strictly higher power than classical full permutation tests while maintaining exact finite-sample validity, without relying on pessimistic worst-case Lipschitz bounds.
要約:
We address the problem of causal effect estimation in the presence of hidden confounders using nonparametric instrumental variable (IV) regression. An established approach is to use estimators based on learned spectral features, that is, features spanning the top singular subspaces of the operator linking treatments to instruments. While powerful, such features are agnostic to the outcome variable. Consequently, the method can fail when the true causal function is poorly represented by these dominant singular functions. To mitigate, we introduce Augmented Spectral Feature Learning, a framework that makes the feature learning process outcome-aware. Our method learns features by minimizing a novel contrastive loss derived from an augmented operator that incorporates information from the outcome. By learning these task-specific features, our approach remains effective even under spectral misalignment. We provide a theoretical analysis of this framework and validate our approach on challenging benchmarks.
要約:
We study the multi-objective linear contextual bandit problem, where multiple possible conflicting objectives must be optimized simultaneously. We propose \texttt{MOL-TS}, the \textit{first} Thompson Sampling algorithm with Pareto regret guarantees for this problem. Unlike standard approaches that compute an empirical Pareto front each round, \texttt{MOL-TS} samples parameters across objectives and efficiently selects an arm from a novel \emph{effective Pareto front}, which accounts for repeated selections over time. Our analysis shows that \texttt{MOL-TS} achieves a worst-case Pareto regret bound of $\widetilde{O}(d^{3/2}\sqrt{T})$, where $d$ is the dimension of the feature vectors, $T$ is the total number of rounds, matching the best known order for randomized linear bandit algorithms for single objective. Empirical results confirm the benefits of our proposed approach, demonstrating improved regret minimization and strong multi-objective performance.
要約:
Principal Component Analysis (PCA) and K-means constitute fundamental techniques in multivariate analysis. Although they are frequently applied independently or sequentially to cluster observations, the relationship between them, especially when K-means is used to cluster variables rather than observations, has been scarcely explored. This study seeks to address this gap by proposing an innovative method that analyzes the relationship between clusters of variables obtained by applying K-means on transposed data and the principal components of PCA. Our approach involves applying PCA to the original data and K-means to the transposed data set, where the original variables are converted into observations. The contribution of each variable cluster to each principal component is then quantified using measures based on variable loadings. This process provides a tool to explore and understand the clustering of variables and how such clusters contribute to the principal dimensions of variation identified by PCA.
要約:
We introduce Smart Bayes, a new classification framework that bridges generative and discriminative modeling by integrating likelihood-ratio-based generative features into a logistic-regression-style discriminative classifier. From the generative perspective, Smart Bayes relaxes the fixed unit weights of Naive Bayes by allowing data-driven coefficients on density-ratio features. From a discriminative perspective, it constructs transformed inputs as marginal log-density ratios that explicitly quantify how much more likely each feature value is under one class than another, thereby providing predictors with stronger class separation than the raw covariates. To support this framework, we develop a spline-based estimator for univariate log-density ratios that is flexible, robust, and computationally efficient. Through extensive simulations and real-data studies, Smart Bayes often outperforms both logistic regression and Naive Bayes. Our results highlight the potential of hybrid approaches that exploit generative structure to enhance discriminative performance.
要約:
Mean-field games (MFGs) study the Nash equilibrium of systems with a continuum of interacting agents, which can be formulated as the fixed-point of optimal control problems. They provide a unified framework for a variety of applications, including optimal transport (OT) and generative models. Despite their broad applicability, solving high-dimensional MFGs remains a significant challenge due to fundamental computational and analytical obstacles. In this work, we propose a particle-based deep Flow Matching (FM) method to tackle high-dimensional MFG computation. In each iteration of our proximal fixed-point scheme, particles are updated using first-order information, and a flow neural network is trained to match the velocity of the sample trajectories in a simulation-free manner. Theoretically, in the optimal control setting, we prove that our scheme converges to a stationary point sublinearly, and upgrade to linear (exponential) convergence under additional convexity assumptions. Our proof uses FM to induce an Eulerian coordinate (density-based) from a Lagrangian one (particle-based), and this also leads to certain equivalence results between the two formulations for MFGs when the Eulerian solution is sufficiently regular. Our method demonstrates promising performance on non-potential MFGs and high-dimensional OT problems cast as MFGs through a relaxed terminal-cost formulation.
要約:
Many online learning algorithms, including classical online PCA methods, enforce explicit normalization steps that discard the evolving norm of the parameter vector. We show that this norm can in fact encode meaningful information about the underlying statistical structure of the problem, and that exploiting this information leads to improved learning behavior. Motivated by this principle, we introduce Implicitly Normalized Online PCA (INO-PCA), an online PCA algorithm that removes the unit-norm constraint and instead allows the parameter norm to evolve dynamically through a simple regularized update. We prove that in the high-dimensional limit the joint empirical distribution of the estimate and the true component converges to a deterministic measure-valued process governed by a nonlinear PDE. This analysis reveals that the parameter norm obeys a closed-form ODE coupled with the cosine similarity, forming an internal state variable that regulates learning rate, stability, and sensitivity to signal-to-noise ratio (SNR). The resulting dynamics uncover a three-way relationship between the norm, SNR, and optimal step size, and expose a sharp phase transition in steady-state performance. Both theoretically and experimentally, we show that INO-PCA consistently outperforms Oja's algorithm and adapts rapidly in non-stationary environments. Overall, our results demonstrate that relaxing norm constraints can be a principled and effective way to encode and exploit problem-relevant information in online learning algorithms.
要約:
Post-training quantization (PTQ) aims to preserve model-level behavior; however, most methods focus on individual linear layers. Even recent extensions, such as QEP and LoaQ, which mitigate error propagation or target specific submodules, still rely on layer-wise formulations and fail to capture the behavior of larger submodules. We introduce Layer-Projected Coordinate Descent (LPCD), a unified framework that extends PTQ beyond layers by optimizing relaxed objectives across arbitrary submodules and projecting the solutions with layer-wise quantizers. LPCD generalizes existing methods and provides a principled approach to quantizing complex submodules while maintaining the efficiency and compatibility of layer-wise PTQ pipelines. Across diverse LLM architectures and bit-widths, LPCD-based submodule quantization consistently enhances both layer-wise PTQ methods and existing submodule approaches.
要約:
Learning the structure of a Bayesian network from decentralized data poses two major challenges: (i) ensuring rigorous privacy guarantees for participants, and (ii) avoiding communication costs that scale poorly with dimensionality. In this work, we introduce Fed-Sparse-BNSL, a novel federated method for learning linear Gaussian Bayesian network structures that addresses both challenges. By combining differential privacy with greedy updates that target only a few relevant edges per participant, Fed-Sparse-BNSL efficiently uses the privacy budget while keeping communication costs low. Our careful algorithmic design preserves model identifiability and enables accurate structure estimation. Experiments on synthetic and real datasets demonstrate that Fed-Sparse-BNSL achieves utility close to non-private baselines while offering substantially stronger privacy and communication efficiency.
要約:
Bipartite networks are widely used to encode the ecological interactions. Being able to compare the organization of bipartite networks is a first step toward a better understanding of how environmental factors shape community structure and resilience. Yet current methods for structure detection in bipartite networks overlook shared patterns across collections of networks. We introduce the \emph{colBiSBM}, a family of probabilistic models for collections of bipartite networks that extends the classical Latent Block Model (LBM). The proposed framework assumes that networks are independent realizations of a shared mesoscale structure, encoded through common inter-block connectivity parameters. We establish identifiability conditions for the different variants of \emph{colBiSBM} and develop a variational EM algorithm for parameter estimation, coupled with an adaptation of the Integrated Classification Likelihood (ICL) criterion for model selection. We demonstrate how our approach can be used to classify networks based on their topology or organization. Simulation studies highlight the ability of \emph{colBiSBM} to recover common structures, improve clustering performance, and enhance link prediction by borrowing strength across networks. An application to plant--pollinator networks highlights how the method uncovers shared ecological roles and partitions networks into sub-collections with similar connectivity patterns. These results illustrate the methodological and practical advantages of joint modeling over separate network analyses in the study of bipartite systems.
要約:
Decision trees and random forest remain highly competitive for classification on medium-sized, standard datasets due to their robustness, minimal preprocessing requirements, and interpretability. However, a single tree suffers from high estimation variance, while large ensembles reduce this variance at the cost of substantial computational overhead and diminished interpretability. In this paper, we propose Decision Tree Embedding (DTE), a fast and effective method that leverages the leaf partitions of a trained classification tree to construct an interpretable feature representation. By using the sample means within each leaf region as anchor points, DTE maps inputs into an embedding space defined by the tree's partition structure, effectively circumventing the high variance inherent in decision-tree splitting rules. We further introduce an ensemble extension based on additional bootstrap trees, and pair the resulting embedding with linear discriminant analysis for classification. We establish several population-level theoretical properties of DTE, including its preservation of conditional density under mild conditions and a characterization of the resulting classification error. Empirical studies on synthetic and real datasets demonstrate that DTE strikes a strong balance between accuracy and computational efficiency, outperforming or matching random forest and shallow neural networks while requiring only a fraction of their training time in most cases. Overall, the proposed DTE method can be viewed either as a scalable decision tree classifier that improves upon standard split rules, or as a neural network model whose weights are learned from tree-derived anchor points, achieving an intriguing integration of both paradigms.
要約:
Diffusion generative models have emerged as powerful tools for producing synthetic data from an empirically observed distribution. A common approach involves simulating the time-reversal of an Ornstein-Uhlenbeck (OU) process initialized at the true data distribution. Since the score function associated with the OU process is typically unknown, it is approximated using a trained neural network. This approximation, along with finite time simulation, time discretization and statistical approximation, introduce several sources of error whose impact on the generated samples must be carefully understood. Previous analyses have quantified the error between the generated and the true data distributions in terms of Wasserstein distance or Kullback-Leibler (KL) divergence. However, both metrics present limitations: KL divergence requires absolute continuity between distributions, while Wasserstein distance, though more general, leads to error bounds that scale poorly with dimension, rendering them impractical in high-dimensional settings. In this work, we derive an explicit, dimension-free bound on the discrepancy between the generated and the true data distributions. The bound is expressed in terms of a smooth test functional with bounded first and second derivatives. The key novelty lies in the use of this weaker, functional metric to obtain dimension-independent guarantees, at the cost of higher regularity on the test functions. As an application, we formulate and solve a variational problem to minimize the time-discretization error, leading to the derivation of an optimal time-scheduling strategy for the reverse-time diffusion. Interestingly, this scheduler has appeared previously in the literature in a different context; our analysis provides a new justification for its optimality, now grounded in minimizing the discretization bias in generative sampling.
要約:
This chapter opens with a review of classic tools for regression, a subset of machine learning that seeks to find relationships between variables. With the advent of scientific machine learning this field has moved from a purely data-driven (statistical) formalism to a constrained or ``physics-informed'' formalism, which integrates physical knowledge and methods from traditional computational engineering. In the first part, we introduce the general concepts and the statistical flavor of regression versus other forms of curve fitting. We then move to an overview of traditional methods from machine learning and their classification and ways to link these to traditional computational science. Finally, we close with a note on methods to combine machine learning and numerical methods for physics
要約:
This chapter provides three tutorial exercises on physics-constrained regression. These are implemented as toy problems that seek to mimic grand challenges in (1) the super-resolution and data assimilation of the velocity field in image velocimetry, (2) data-driven turbulence modeling, and (3) system identification and digital twinning for forecasting and control. The Python codes for all exercises are provided in the course repository.
要約:
High-dimensional spaces have challenged Bayesian optimization (BO). Existing methods aim to overcome this so-called curse of dimensionality by carefully encoding structural assumptions, from locality to sparsity to smoothness, into the optimization procedure. Surprisingly, we demonstrate that these approaches are outperformed by arguably the simplest method imaginable: Bayesian linear regression. After applying a geometric transformation to avoid boundary-seeking behavior, Gaussian processes with linear kernels match state-of-the-art performance on tasks with 60- to 6,000-dimensional search spaces. Linear models offer numerous advantages over their non-parametric counterparts: they afford closed-form sampling and their computation scales linearly with data, a fact we exploit on molecular optimization tasks with > 20,000 observations. Coupled with empirical analyses, our results suggest the need to depart from past intuitions about BO methods in high-dimensional spaces.
要約:
Identifying causal effects in the presence of unmeasured variables is a fundamental challenge in causal inference, for which proxy variable methods have emerged as a powerful solution. We contrast two major approaches in this framework: (1) bridge equation methods, which leverage solutions to integral equations to recover causal targets, and (2) array decomposition methods, which recover latent factors composing counterfactual quantities by exploiting unique determination of eigenspaces. We compare the model restrictions underlying these two approaches and provide insight into implications of the underlying assumptions, clarifying the scope of applicability for each method.
要約:
Tabular data drive most real-world machine learning applications, yet building general-purpose models for them remains difficult. Mixed numeric and categorical fields, weak feature structure, and limited labeled data make scaling and generalization challenging. To this end, we introduce Orion-Bix, a tabular foundation model that combines biaxial attention with meta-learned in-context reasoning for few-shot tabular learning. Its encoder alternates standard, grouped, hierarchical, and relational attention, fusing their outputs through multi-CLS summarization to capture both local and global dependencies efficiently. A label-aware ICL head adapts on the fly and scales to large label spaces via hierarchical decision routing. Meta-trained on synthetically generated, structurally diverse tables with causal priors, Orion-Bix learns transferable inductive biases across heterogeneous data. Delivered as a scikit-learn compatible foundation model, it outperforms gradient-boosting baselines and remains competitive with state-of-the-art tabular foundation models on public benchmarks, showing that biaxial attention with episodic meta-training enables robust, few-shot-ready tabular learning. The model is publicly available at https://github.com/Lexsi-Labs/Orion-BiX .
要約:
Many tasks require mapping continuous input data (e.g. images) to discrete task outputs (e.g. class labels). Yet, how neural networks learn to perform such discrete computations on continuous data manifolds remains poorly understood. Here, we show that signatures of such computations emerge in the representational geometry of neural networks as they learn. By analysing the Riemannian pullback metric across layers of a neural network, we find that network computation can be decomposed into two functions: discretising continuous input features and performing logical operations on these discretised variables. Furthermore, we demonstrate how different learning regimes (rich vs. lazy) have contrasting metric and curvature structures, affecting the ability of the networks to generalise to unseen inputs. Overall, our work provides a geometric framework for understanding how neural networks learn to perform discrete computations on continuous manifolds.
要約:
Adaptively collected data has become ubiquitous within modern practice. However, even seemingly benign adaptive sampling schemes can introduce severe biases, rendering traditional statistical inference tools inapplicable. This can be mitigated by a property called stability, which states that if the rate at which an algorithm takes actions converges to a deterministic limit, one can expect that certain parameters are asymptotically normal. Building on a recent line of work for the multi-armed bandit setting, we show that the linear upper confidence bound (LinUCB) algorithm for linear bandits satisfies this property. In doing so, we painstakingly characterize the behavior of the eigenvalues and eigenvectors of the random design feature covariance matrix in the setting where the action set is the unit ball, showing that it decomposes into a rank-one direction that locks onto the true parameter and an almost-isotropic bulk that grows at a predictable $\sqrt{T}$ rate. This allows us to establish a central limit theorem for the LinUCB algorithm, establishing asymptotic normality for the limiting distribution of the estimation error where the convergence occurs at a $T^{-1/4}$ rate. The resulting Wald-type confidence sets and hypothesis tests do not depend on the feature covariance matrix and are asymptotically tighter than existing nonasymptotic confidence sets. Numerical simulations corroborate our findings.
要約:
Deep neural networks often struggle to recognize when an input lies outside their training experience, leading to unreliable and overconfident predictions. Building dependable machine learning systems therefore requires methods that can both estimate predictive \textit{uncertainty} and detect \textit{out-of-distribution (OOD)} samples in a unified manner. In this paper, we propose \textbf{TIE: a Training--Inversion--Exclusion} framework for visually interpretable and uncertainty-guided anomaly detection that jointly addresses these challenges through iterative refinement. TIE extends a standard $n$-class classifier to an $(n+1)$-class model by introducing a garbage class initialized with Gaussian noise to represent outlier inputs. Within each epoch, TIE performs a closed-loop process of \textit{training, inversion, and exclusion}, where highly uncertain inverted samples reconstructed from the just-trained classifier are excluded into the garbage class. Over successive iterations, the inverted samples transition from noisy artifacts into visually coherent class prototypes, providing transparent insight into how the model organizes its learned manifolds. During inference, TIE rejects OOD inputs by either directly mapping them to the garbage class or producing low-confidence, uncertain misclassifications within the in-distribution classes that are easily separable, all without relying on external OOD datasets. A comprehensive threshold-based evaluation using multiple OOD metrics and performance measures such as \textit{AUROC}, \textit{AUPR}, and \textit{FPR@95\%TPR} demonstrates that TIE offers a unified and interpretable framework for robust anomaly detection and calibrated uncertainty estimation (UE) achieving near-perfect OOD detection with \textbf{\(\!\approx\!\) 0 FPR@95\%TPR} when trained on MNIST or FashionMNIST and tested against diverse unseen datasets.
要約:
The effectiveness of self-supervised learning (SSL) for physiological time series depends on the ability of a pretraining objective to preserve information about the underlying physiological state while filtering out unrelated noise. However, existing strategies are limited due to reliance on heuristic principles or poorly constrained generative tasks. To address this limitation, we propose a pretraining framework that exploits the information structure of a dynamical systems generative model across multiple time-series. This framework reveals our key insight that class identity can be efficiently captured by extracting information about the generative variables related to the system parameters shared across similar time series samples, while noise unique to individual samples should be discarded. Building on this insight, we propose PULSE, a cross-reconstruction-based pretraining objective for physiological time series datasets that explicitly extracts system information while discarding non-transferrable sample-specific ones. We establish theory that provides sufficient conditions for the system information to be recovered, and empirically validate it using a synthetic dynamical systems experiment. Furthermore, we apply our method to diverse real-world datasets, demonstrating that PULSE learns representations that can broadly distinguish semantic classes, increase label efficiency, and improve transfer learning.
要約:
Sheaf Neural Networks equip graph structures with a cellular sheaf: a geometric structure which assigns local vector spaces (stalks) and a linear learnable restriction/transport maps to nodes and edges, yielding an edge-aware inductive bias that handles heterophily and limits oversmoothing. However, common Neural Sheaf Diffusion implementations rely on SVD-based sheaf normalization and dense per-edge restriction maps, which scale with stalk dimension, require frequent Laplacian rebuilds, and yield brittle gradients. To address these limitations, we introduce Polynomial Neural Sheaf Diffusion (PolyNSD), a new sheaf diffusion approach whose propagation operator is a degree-K polynomial in a normalised sheaf Laplacian, evaluated via a stable three-term recurrence on a spectrally rescaled operator. This provides an explicit K-hop receptive field in a single layer (independently of the stalk dimension), with a trainable spectral response obtained as a convex mixture of K+1 orthogonal polynomial basis responses. PolyNSD enforces stability via convex mixtures, spectral rescaling, and residual/gated paths, reaching new state-of-the-art results on both homophilic and heterophilic benchmarks, inverting the Neural Sheaf Diffusion trend by obtaining these results with just diagonal restriction maps, decoupling performance from large stalk dimension, while reducing runtime and memory requirements.
要約:
Kolmogorov-Arnold Networks (KANs) offer a promising alternative to Multi-Layer Perceptron (MLP) by placing learnable univariate functions on network edges, enhancing interpretability. However, standard KANs lack probabilistic outputs, limiting their utility in applications requiring uncertainty quantification. While recent Gaussian Process (GP) extensions to KANs address this, they utilize exact inference methods that scale cubically with data size N, restricting their application to smaller datasets. We introduce the Sparse Variational GP-KAN (SVGP-KAN), an architecture that integrates sparse variational inference with the KAN topology. By employing $M$ inducing points and analytic moment matching, our method reduces computational complexity from $O(N^3)$ to $O(NM^2)$ or linear in sample size, enabling the application of probabilistic KANs to larger scientific datasets. Furthermore, we demonstrate that integrating a permutation-based importance analysis enables the network to function as a framework for structural identification, identifying relevant inputs and classifying functional relationships.
要約:
The analysis of non-real-valued data, such as binary time series, has attracted great interest in recent years. This manuscript proposes a post-selection estimator for estimating the coefficient matrices of a high-dimensional generalized binary vector autoregressive process and establishes a Gaussian approximation theorem for the proposed estimator. Furthermore, it introduces a second-order wild bootstrap algorithm to enable statistical inference on the coefficient matrices. Numerical studies and empirical applications demonstrate the good finite-sample performance of the proposed method.
要約:
The thriving field of multi-agent reinforcement learning (MARL) studies how a group of interacting agents make decisions autonomously in a shared dynamic environment. Existing theoretical studies in this area suffer from at least two of the following obstacles: memory inefficiency, the heavy dependence of sample complexity on the long horizon and the large state space, the high computational complexity, non-Markov policy, non-Nash policy, and high burn-in cost. In this work, we take a step towards settling this problem by designing a model-free self-play algorithm \emph{Memory-Efficient Nash Q-Learning (ME-Nash-QL)} for two-player zero-sum Markov games, which is a specific setting of MARL. ME-Nash-QL is proven to enjoy the following merits. First, it can output an $\varepsilon$-approximate Nash policy with space complexity $O(SABH)$ and sample complexity $\widetilde{O}(H^4SAB/\varepsilon^2)$, where $S$ is the number of states, $\{A, B\}$ is the number of actions for two players, and $H$ is the horizon length. It outperforms existing algorithms in terms of space complexity for tabular cases, and in terms of sample complexity for long horizons, i.e., when $\min\{A, B\}\ll H^2$. Second, ME-Nash-QL achieves the lowest computational complexity $O(T\mathrm{poly}(AB))$ while preserving Markov policies, where $T$ is the number of samples. Third, ME-Nash-QL also achieves the best burn-in cost $O(SAB\,\mathrm{poly}(H))$, whereas previous algorithms have a burn-in cost of at least $O(S^3 AB\,\mathrm{poly}(H))$ to attain the same level of sample complexity with ours.
要約:
Multi-agent reinforcement learning (MARL), as a thriving field, explores how multiple agents independently make decisions in a shared dynamic environment. Due to environmental uncertainties, policies in MARL must remain robust to tackle the sim-to-real gap. We focus on robust two-player zero-sum Markov games (TZMGs) in offline settings, specifically on tabular robust TZMGs (RTZMGs). We propose a model-based algorithm (\textit{RTZ-VI-LCB}) for offline RTZMGs, which is optimistic robust value iteration combined with a data-driven Bernstein-style penalty term for robust value estimation. By accounting for distribution shifts in the historical dataset, the proposed algorithm establishes near-optimal sample complexity guarantees under partial coverage and environmental uncertainty. An information-theoretic lower bound is developed to confirm the tightness of our algorithm's sample complexity, which is optimal regarding both state and action spaces. To the best of our knowledge, RTZ-VI-LCB is the first to attain this optimality, sets a new benchmark for offline RTZMGs, and is validated experimentally.
要約:
A critical challenge for reinforcement learning (RL) is making decisions based on incomplete and noisy observations, especially in perturbed and partially observable Markov decision processes (P$^2$OMDPs). Existing methods fail to mitigate perturbations while addressing partial observability. We propose \textit{Causal State Representation under Asynchronous Diffusion Model (CaDiff)}, a framework that enhances any RL algorithm by uncovering the underlying causal structure of P$^2$OMDPs. This is achieved by incorporating a novel asynchronous diffusion model (ADM) and a new bisimulation metric. ADM enables forward and reverse processes with different numbers of steps, thus interpreting the perturbation of P$^2$OMDP as part of the noise suppressed through diffusion. The bisimulation metric quantifies the similarity between partially observable environments and their causal counterparts. Moreover, we establish the theoretical guarantee of CaDiff by deriving an upper bound for the value function approximation errors between perturbed observations and denoised causal states, reflecting a principled trade-off between approximation errors of reward and transition-model. Experiments on Roboschool tasks show that CaDiff enhances returns by at least 14.18\% compared to baselines. CaDiff is the first framework that approximates causal states using diffusion models with both theoretical rigor and practicality.
要約:
We take the novel perspective of incorporating offline RL algorithms as subroutines of tabula rasa online RL. This is feasible because an online learning agent can repurpose its historical interactions as offline dataset. We formalize this idea into a framework that accommodates several variants of offline RL incorporation such as final policy recommendation and online fine-tuning. We further introduce convenient techniques to improve its effectiveness in enhancing online learning efficiency. Our extensive and systematic empirical analyses show that 1) the effectiveness of the proposed framework depends strongly on the nature of the task, 2) our proposed techniques greatly enhance its effectiveness, and 3) existing online fine-tuning methods are overall ineffective, calling for more research therein.
要約:
Many emerging applications - such as adversarial training, AI alignment, and robust optimization - can be framed as zero-sum games between neural nets, with von Neumann-Nash equilibria (NE) capturing the desirable system behavior. While such games often involve non-convex non-concave objectives, empirical evidence shows that simple gradient methods frequently converge, suggesting a hidden geometric structure. In this paper, we provide a theoretical framework that explains this phenomenon through the lens of hidden convexity and overparameterization. We identify sufficient conditions - spanning initialization, training dynamics, and network width - that guarantee global convergence to a NE in a broad class of non-convex min-max games. To our knowledge, this is the first such result for games that involve two-layer neural networks. Technically, our approach is twofold: (a) we derive a novel path-length bound for the alternating gradient descent-ascent scheme in min-max games; and (b) we show that the reduction from a hidden convex-concave geometry to two-sided Polyak-{\L}ojasiewicz (P{\L}) min-max condition hold with high probability under overparameterization, using tools from random matrix theory.
著者: Benjamin D. Ballyk, Ankit Gupta, Sujay Konda, Kavitha Subramanian, Chris Landon, Ahmed Ammar Naseer, Georg Maierhofer, Sumanth Swaminathan, Vasudevan Venkateshwaran
要約:
Data privacy is a critical challenge in modern medical workflows as the adoption of electronic patient records has grown rapidly. Stringent data protection regulations limit access to clinical records for training and integrating machine learning models that have shown promise in improving diagnostic accuracy and personalized care outcomes. Synthetic data offers a promising alternative; however, current generative models either struggle with time-series data or lack formal privacy guaranties. In this paper, we enhance a state-of-the-art time-series generative model to better handle longitudinal clinical data while incorporating quantifiable privacy safeguards. Using real data from chronic kidney disease and ICU patients, we evaluate our method through statistical tests, a Train-on-Synthetic-Test-on-Real (TSTR) setup, and expert clinical review. Our non-private model (Augmented TimeGAN) outperforms transformer- and flow-based models on statistical metrics in several datasets, while our private model (DP-TimeGAN) maintains a mean authenticity of 0.778 on the CKD dataset, outperforming existing state-of-the-art models on the privacy-utility frontier. Both models achieve performance comparable to real data in clinician evaluations, providing robust input data necessary for developing models for complex chronic conditions without compromising data privacy.
要約:
Large language model (LLM) reinforcement learning has increasingly relied on group-based policy optimization frameworks, such as GRPO and GSPO, to achieve stable fine-tuning at scale. However, a fundamental trade-off persists between optimization granularity and training stability. While GSPO improves robustness via sequence-level optimization, its monolithic treatment of sequences introduces severe inefficiencies: its conservative clipping mechanism indiscriminately discards valid training samples-a phenomenon we term gradient underutilization-and its uniform credit assignment fails to capture the heterogeneous contributions of critical reasoning steps. In this work, we propose Entropy Importance Sampling Policy Optimization (ESPO), a novel framework that reconciles fine-grained control with training stability. ESPO decomposes sequences into groups based on predictive entropy, enabling (1) Entropy-driven Importance Sampling to capture intra-sequence heterogeneity, and (2) Entropy-adaptive Clipping to dynamically allocate trust regions based on model uncertainty. Extensive experiments on mathematical reasoning benchmarks demonstrate that ESPO not only accelerates convergence but also achieves state-of-the-art performance, notably improving accuracy on the challenging HMMT benchmark from 4.4% to 13.13%.
要約:
Hierarchical clustering is a fundamental machine-learning technique for grouping data points into dendrograms. However, existing hierarchical clustering methods encounter two primary challenges: 1) Most methods specify dendrograms without a global objective. 2) Graph-based methods often neglect the significance of graph structure, optimizing objectives on complete or static predefined graphs. In this work, we propose Hyperbolic Continuous Structural Entropy neural networks, namely HypCSE, for structure-enhanced continuous hierarchical clustering. Our key idea is to map data points in the hyperbolic space and minimize the relaxed continuous structural entropy (SE) on structure-enhanced graphs. Specifically, we encode graph vertices in hyperbolic space using hyperbolic graph neural networks and minimize approximate SE defined on graph embeddings. To make the SE objective differentiable for optimization, we reformulate it into a function using the lowest common ancestor (LCA) on trees and then relax it into continuous SE (CSE) by the analogy of hyperbolic graph embeddings and partitioning trees. To ensure a graph structure that effectively captures the hierarchy of data points for CSE calculation, we employ a graph structure learning (GSL) strategy that updates the graph structure during training. Extensive experiments on seven datasets demonstrate the superior performance of HypCSE.
要約:
Given a training dataset, the goal of dataset distillation is to derive a synthetic dataset such that models trained on the latter perform as well as those trained on the training dataset. In this work, we develop and analyze an efficient dataset distillation algorithm for supervised learning, specifically regression in $\mathbb{R}^d$, based on matching the losses on the training and synthetic datasets with respect to a fixed set of randomly sampled regressors without any model training. Our first key contribution is a novel performance guarantee proving that our algorithm needs only $\tilde{O}(d^2)$ sampled regressors to derive a synthetic dataset on which the MSE loss of any bounded linear model is nearly the same as its MSE loss on the given training data. In particular, the model optimized on the synthetic data has close to minimum loss on the training data, thus performing nearly as well as the model optimized on the latter. Complementing this, we also prove a matching lower bound of $\Omega(d^2)$ for the number of sampled regressors showing the tightness of our analysis.
Our second contribution is to extend our algorithm to offline RL dataset distillation by matching the Bellman loss, unlike previous works which used a behavioral cloning objective. This is the first such method which leverages both, the rewards and the next state information, available in offline RL datasets, without any policy model optimization. Our algorithm generates a synthetic dataset whose Bellman loss with respect to any linear action-value predictor is close to the latter's Bellman loss on the offline RL training dataset. Therefore, a policy associated with an action-value predictor optimized on the synthetic dataset performs nearly as well as that derived from the one optimized on the training data. We conduct experiments to validate our theoretical guarantees and observe performance gains.
要約:
The Influence Maximization (IM) problem aims to select a set of seed nodes within a given budget to maximize the spread of influence in a social network. However, real-world social networks have several structural inequalities, such as dominant majority groups and underrepresented minority groups. If these inequalities are not considered while designing IM algorithms, the outcomes might be biased, disproportionately benefiting majority groups while marginalizing minorities. In this work, we address this gap by designing a fairness-aware IM method using Reinforcement Learning (RL) that ensures equitable influence outreach across all communities, regardless of protected attributes. Fairness is incorporated using a maximin fairness objective, which prioritizes improving the outreach of the least-influenced group, pushing the solution toward an equitable influence distribution. We propose a novel fairness-aware deep RL method, called DQ4FairIM, that maximizes the expected number of influenced nodes by learning an RL policy. The learnt policy ensures that minority groups formulate the IM problem as a Markov Decision Process (MDP) and use deep Q-learning, combined with the Structure2Vec network embedding, earning together with Structure2Vec network embedding to solve the MDP. We perform extensive experiments on synthetic benchmarks and real-world networks to compare our method with fairness-agnostic and fairness-aware baselines. The results show that our method achieves a higher level of fairness while maintaining a better fairness-performance trade-off than baselines. Additionally, our approach learns effective seeding policies that generalize across problem instances without retraining, such as varying the network size or the number of seed nodes.
要約:
Replicability is a fundamental challenge in reinforcement learning (RL), as RL algorithms are empirically observed to be unstable and sensitive to variations in training conditions. To formally address this issue, we study \emph{list replicability} in the Probably Approximately Correct (PAC) RL framework, where an algorithm must return a near-optimal policy that lies in a \emph{small list} of policies across different runs, with high probability. The size of this list defines the \emph{list complexity}. We introduce both weak and strong forms of list replicability: the weak form ensures that the final learned policy belongs to a small list, while the strong form further requires that the entire sequence of executed policies remains constrained. These objectives are challenging, as existing RL algorithms exhibit exponential list complexity due to their instability. Our main theoretical contribution is a provably efficient tabular RL algorithm that guarantees list replicability by ensuring the list complexity remains polynomial in the number of states, actions, and the horizon length. We further extend our techniques to achieve strong list replicability, bounding the number of possible policy execution traces polynomially with high probability. Our theoretical result is made possible by key innovations including (i) a novel planning strategy that selects actions based on lexicographic order among near-optimal choices within a randomly chosen tolerance threshold, and (ii) a mechanism for testing state reachability in stochastic environments while preserving replicability. Finally, we demonstrate that our theoretical investigation sheds light on resolving the \emph{instability} issue of RL algorithms used in practice. In particular, we show that empirically, our new planning strategy can be incorporated into practical RL frameworks to enhance their stability.
要約:
We investigate the theoretical underpinnings of Discrete Diffusion Models (DDMs) on discrete state spaces. Unlike in the continuous setting-where diffusion models are well understood both theoretically and empirically-the discrete case poses significant challenges due to its combinatorial structure and the lack of rigorous analysis. In this work, we establish convergence guarantees for DDMs on both the finite space $\mathbb{Z}^d_m=\{0,...,m-1\}^d$ and the countably infinite space $\mathbb{N}^d$ under mild assumptions, focusing on forward masked and random walk dynamics. Similar to the continuous case, the backward process can be characterized by a discrete score function, whose monotonicity plays a central role in deriving the error bounds of the generated data. Notably, the complexity of our model scales linearly up to logarithmic factors, rather than exponentially, with the dimension, making it efficiently scalable to high-dimensional data. To the best of our knowledge, this study provides the first non-asymptotic convergence guarantees that do not rely on the boundedness of the estimated score-covering not only uniform noising processes on $\mathbb{Z}^d_m$ and on $\mathbb{N}^d$, but also masking-based noising dynamics.
要約:
Classical statistical inference and learning theory often fail to explain the success of modern neural networks. A key reason is that these models are non-identifiable (singular), violating core assumptions behind PAC bounds and asymptotic normality. Singular learning theory (SLT), a physics-inspired framework grounded in algebraic geometry, has gained popularity for its ability to close this theory-practice gap. In this paper, we empirically study SLT in toy settings relevant to interpretability and phase transitions. First, we understand the SLT free energy $\mathcal{F}_n$ by testing an Arrhenius-style rate hypothesis using both a grokking modulo-arithmetic model and Anthropic's Toy Models of Superposition. Second, we understand the local learning coefficient $\lambda_{\alpha}$ by measuring how it scales with problem difficulty across several controlled network families (polynomial regressors, low-rank linear networks, and low-rank autoencoders). Our experiments recover known scaling laws while others yield meaningful deviations from theoretical expectations. Overall, our paper illustrates the many merits of SLT for understanding neural network phase transitions, and poses open research questions for the field.
要約:
Synthetic data generation is an important tool for privacy-preserving data sharing. While diffusion models have set recent benchmarks, flow matching (FM) offers a promising alternative. This paper presents different ways to implement flow matching for tabular data synthesis. We provide a comprehensive empirical study that compares flow matching (FM and variational FM) with a state-of-the-art diffusion method (TabDDPM and TabSyn) in tabular data synthesis. We evaluate both the standard Optimal Transport (OT) and the Variance Preserving (VP) probability paths, and also compare deterministic and stochastic samplers -- something possible when learning to generate using \textit{variational} flow matching -- characterising the empirical relationship between data utility and privacy risk. Our key findings reveal that flow matching, particularly TabbyFlow, outperforms diffusion baselines. Flow matching methods also achieves better performance with remarkably low function evaluations ($\leq$ 100 steps), offering a substantial computational advantage. The choice of probability path is also crucial, as using the OT path demonstrates superior performance, while VP has potential for producing synthetic data with lower disclosure risk. Lastly, our results show that making flows stochastic not only preserves marginal distributions but, in some instances, enables the generation of high utility synthetic data with reduced disclosure risk.
要約:
Differential privacy is increasingly formalized through the lens of hypothesis testing via the robust and interpretable $f$-DP framework, where privacy guarantees are encoded by a baseline Blackwell trade-off function $f_{\infty} = T(P_{\infty}, Q_{\infty})$ involving a pair of distributions $(P_{\infty}, Q_{\infty})$. The problem of choosing the right privacy metric in practice leads to a central question: what is a statistically appropriate baseline $f_{\infty}$ given some prior modeling assumptions? The special case of Gaussian differential privacy (GDP) showed that, under compositions of nearly perfect mechanisms, these trade-off functions exhibit a central limit behavior with a Gaussian limit experiment. Inspired by Le Cam's theory of limits of statistical experiments, we answer this question in full generality in an infinitely divisible setting.
We show that suitable composition experiments $(P_n^{\otimes n}, Q_n^{\otimes n})$ converge to a binary limit experiment $(P_{\infty}, Q_{\infty})$ whose log-likelihood ratio $L = \log(dQ_{\infty} / dP_{\infty})$ is infinitely divisible under $P_{\infty}$. Thus any limiting trade-off function $f_{\infty}$ is determined by an infinitely divisible law $P_{\infty}$, characterized by its Levy--Khintchine triplet, and its Esscher tilt defined by $dQ_{\infty}(x) = e^{x} dP_{\infty}(x)$. This characterizes all limiting baseline trade-off functions $f_{\infty}$ arising from compositions of nearly perfect differentially private mechanisms. Our framework recovers GDP as the purely Gaussian case and yields explicit non-Gaussian limits, including Poisson examples. It also positively resolves the empirical $s^2 = 2k$ phenomenon observed in the GDP paper and provides an optimal mechanism for count statistics achieving asymmetric Poisson differential privacy.
要約:
Evaluating rare-event forecasts is challenging because standard metrics collapse as event prevalence declines. Measures such as F1-score, AUPRC, MCC, and accuracy induce degenerate thresholds -- converging to zero or one -- and their values become dominated by class imbalance rather than tail discrimination. We develop a family of rare-event-stable (RES) metrics whose optimal thresholds remain strictly interior as the event probability approaches zero, ensuring coherent decision rules under extreme rarity. Simulations spanning event probabilities from 0.01 down to one in a million show that RES metrics maintain stable thresholds, consistent model rankings, and near-complete prevalence invariance, whereas traditional metrics exhibit statistically significant threshold drift and structural collapse. A credit-default application confirms these results: RES metrics yield interpretable probability-of-default cutoffs (4-9%) and remain robust under subsampling, while classical metrics fail operationally. The RES framework provides a principled, prevalence-invariant basis for evaluating extreme-risk forecasts.
要約:
Multipurpose batch processes become increasingly popular in manufacturing industries since they adapt to low-volume, high-value products and shifting demands. These processes often operate in a dynamic environment, which faces disturbances such as processing delays and demand changes. To minimise long-term cost and system nervousness (i.e., disruptive changes to schedules), schedulers must design rescheduling strategies to address such disturbances effectively. Existing methods often assume complete look-ahead information over the scheduling horizon. This assumption contrasts with realistic situations where schedulers can only access incomplete look-ahead information. Sticking with existing methods may lead to suboptimal long-term costs and high-level system nervousness. In this work we propose a Bayesian dynamic scheduling method. Our method relies on learning a Bayesian Network from the probability distribution of disturbances. Specifically, the Bayesian Network represents how likely each operation will be impacted by disturbances. During the online execution, when new disturbances become observed, this method updates the posterior distribution and therefore guides the rescheduling strategy. We compare our method with the existing periodic rescheduling strategy (which generates new schedules from scratch at fixed intervals) on four benchmark problems. Computational results show that our method achieves statistically better long-term costs and system nervousness. In the theoretical aspect, we prove that if disturbances are mutually independent, the impact-quantifying variables inherently satisfy the independence assumptions required by Bayesian Networks. As an implication, practitioners can extend the method to other scheduling problems (such as job shop scheduling and continuous processes), given that they define the problem-specific dependencies between operations.
要約:
Foundation models, and in particular large language models, can generate highly informative responses, prompting growing interest in using these ''synthetic'' outputs as data in empirical research and decision-making. This paper introduces the idea of a foundation prior, which shows that model-generated outputs are not as real observations, but draws from the foundation prior induced prior predictive distribution. As such synthetic data reflects both the model's learned patterns and the user's subjective priors, expectations, and biases. We model the subjectivity of the generative process by making explicit the dependence of synthetic outputs on the user's anticipated data distribution, the prompt-engineering process, and the trust placed in the foundation model.
We derive the foundation prior as an exponential-tilted, generalized Bayesian update of the user's primitive prior, where a trust parameter governs the weight assigned to synthetic data. We then show how synthetic data and the associated foundation prior can be incorporated into standard statistical and econometric workflows, and discuss their use in applications such as refining complex models, informing latent constructs, guiding experimental design, and augmenting random-coefficient and partially linear specifications. By treating generative outputs as structured, explicitly subjective priors rather than as empirical observations, the framework offers a principled way to harness foundation models in empirical work while avoiding the conflation of synthetic ''facts'' with real data.
要約:
Online learning is the cornerstone of applications like recommendation and advertising systems, where models continuously adapt to shifting data distributions. Model training for such systems is remarkably expensive, a cost that multiplies during hyperparameter search. We introduce a two-stage paradigm to reduce this cost: (1) efficiently identifying the most promising configurations, and then (2) training only these selected candidates to their full potential. Our core insight is that focusing on accurate identification in the first stage, rather than achieving peak performance, allows for aggressive cost-saving measures. We develop novel data reduction and prediction strategies that specifically overcome the challenges of sequential, non-stationary data not addressed by conventional hyperparameter optimization. We validate our framework's effectiveness through a dual evaluation: first on the Criteo 1TB dataset, the largest suitable public benchmark, and second on an industrial advertising system operating at a scale two orders of magnitude larger. Our methods reduce the total hyperparameter search cost by up to 10$\times$ on the public benchmark and deliver significant, validated efficiency gains in the industrial setting.
要約:
Diffusion-based solvers for partial differential equations (PDEs) are often bottle-necked by slow gradient-based test-time optimization routines that use PDE residuals for loss guidance. They additionally suffer from optimization instabilities and are unable to dynamically adapt their inference scheme in the presence of noisy PDE residuals. To address these limitations, we introduce PRISMA (PDE Residual Informed Spectral Modulation with Attention), a conditional diffusion neural operator that embeds PDE residuals directly into the model's architecture via attention mechanisms in the spectral domain, enabling gradient-descent free inference. In contrast to previous methods that use PDE loss solely as external optimization targets, PRISMA integrates PDE residuals as integral architectural features, making it inherently fast, robust, accurate, and free from sensitive hyperparameter tuning. We show that PRISMA has competitive accuracy, at substantially lower inference costs, compared to previous methods across five benchmark PDEs, especially with noisy observations, while using 10x to 100x fewer denoising steps, leading to 15x to 250x faster inference.
要約:
We present CLAPS, a posterior-aware conformal regression method that pairs a Last-Layer Laplace Approximation with split-conformal calibration. From the resulting Gaussian posterior, CLAPS defines a simple two-sided posterior CDF score that aligns the conformity metric with the full predictive shape, not just a point estimate. This alignment yields narrower prediction intervals at the same target coverage, especially on small to medium tabular datasets where data are scarce and uncertainty modeling matters. We also provide a lightweight diagnostic suite that separates aleatoric and epistemic components and visualizes posterior behavior, helping practitioners understand why intervals shrink when they do. Across multiple benchmarks using the same MLP backbone, CLAPS consistently attains nominal coverage with improved efficiency and minimal overhead, offering a clear, practical upgrade to residual-based conformal baselines.
要約:
Discrete event systems are present both in observations of nature, socio economical sciences, and industrial systems. Standard analysis approaches do not usually exploit their dual event / state nature: signals are either modeled as transition event sequences, emphasizing event order alignment, or as categorical or ordinal state timeseries, usually resampled a distorting and costly operation as the observation period and number of events grow. In this work we define state transition event timeseries (STE-ts) and propose a new Selective Temporal Hamming distance (STH) leveraging both transition time and duration-in-state, avoiding costly and distorting resampling on large databases. STH generalizes both resampled Hamming and Jaccard metrics with better precision and computation time, and an ability to focus on multiple states of interest. We validate these benefits on simulated and real-world datasets.
要約:
We consider the problem of generalization of arbitrarily overparameterized two-layer ReLU Neural Networks with univariate input. Recent work showed that under square loss, flat solutions (motivated by flat / stable minima and Edge of Stability phenomenon) provably cannot overfit, but it remains unclear whether the same phenomenon holds for logistic loss. This is a puzzling open problem because existing work on logistic loss shows that gradient descent with increasing step size converges to interpolating solutions (at infinity, for the margin-separable cases). In this paper, we prove that the \emph{flatness implied generalization} is more delicate under logistic loss. On the positive side, we show that flat solutions enjoy near-optimal generalization bounds within a region between the left-most and right-most \emph{uncertain} sets determined by each candidate solution. On the negative side, we show that there exist arbitrarily flat yet overfitting solutions at infinity that are (falsely) certain everywhere, thus certifying that flatness alone is insufficient for generalization in general. We demonstrate the effects predicted by our theory in a well-controlled simulation study.
要約:
This paper introduces a comprehensive unified framework for constructing multi-view diffusion geometries through intertwined multi-view diffusion trajectories (MDTs), a class of inhomogeneous diffusion processes that iteratively combine the random walk operators of multiple data views. Each MDT defines a trajectory-dependent diffusion operator with a clear probabilistic and geometric interpretation, capturing over time the interplay between data views. Our formulation encompasses existing multi-view diffusion models, while providing new degrees of freedom for view interaction and fusion. We establish theoretical properties under mild assumptions, including ergodicity of both the point-wise operator and the process in itself. We also derive MDT-based diffusion distances, and associated embeddings via singular value decompositions. Finally, we propose various strategies for learning MDT operators within the defined operator space, guided by internal quality measures. Beyond enabling flexible model design, MDTs also offer a neutral baseline for evaluating diffusion-based approaches through comparison with randomly selected MDTs. Experiments show the practical impact of the MDT operators in a manifold learning and data clustering context.
要約:
In this paper, we investigate a second-order stochastic algorithm for solving large-scale binary classification problems. We propose to make use of a new hybrid stochastic Newton algorithm that includes two weighted components in the Hessian matrix estimation: the first one coming from the natural Hessian estimate and the second associated with the stochastic gradient information. Our motivation comes from the fact that both parts evaluated at the true parameter of logistic regression, are equal to the Hessian matrix. This new formulation has several advantages and it enables us to prove the almost sure convergence of our stochastic algorithm to the true parameter. Moreover, we significantly improve the almost sure rate of convergence to the Hessian matrix. Furthermore, we establish the central limit theorem for our hybrid stochastic Newton algorithm. Finally, we show a surprising result on the almost sure convergence of the cumulative excess risk.
要約:
A central challenge in machine learning is to distinguish genuine structure from chance correlations in high-dimensional data. In this work, we address this issue for the perceptron, a foundational model of neural computation. Specifically, we investigate the relationship between the pattern load $\alpha$ and the variable selection ratio $\rho$ for which a simple perceptron can perfectly classify $P = \alpha N$ random patterns by optimally selecting $M = \rho N$ variables out of $N$ variables. While the Cover--Gardner theory establishes that a random subset of $\rho N$ dimensions can separate $\alpha N$ random patterns if and only if $\alpha < 2\rho$, we demonstrate that optimal variable selection can surpass this bound by developing a method, based on the replica method from statistical mechanics, for enumerating the combinations of variables that enable perfect pattern classification. This not only provides a quantitative criterion for distinguishing true structure in the data from spurious regularities, but also yields the storage capacity of associative memory models with sparse asymmetric couplings.
要約:
Safety-critical environments are inherently dynamic. Distribution shifts, emerging vulnerabilities, and evolving requirements demand continuous updates to machine learning models. Yet even benign parameter updates can have unintended consequences, such as catastrophic forgetting in classical models or alignment drift in foundation models. Existing heuristic approaches (e.g., regularization, parameter isolation) can mitigate these effects but cannot certify that updated models continue to satisfy required performance specifications. We address this problem by introducing a framework for provably safe model updates. Our approach first formalizes the problem as computing the largest locally invariant domain (LID): a connected region in parameter space where all points are certified to satisfy a given specification. While exact maximal LID computation is intractable, we show that relaxing the problem to parameterized abstract domains (orthotopes, zonotopes) yields a tractable primal-dual formulation. This enables efficient certification of updates - independent of the data or algorithm used - by projecting them onto the safe domain. Our formulation further allows computation of multiple approximately optimal LIDs, incorporation of regularization-inspired biases, and use of lookahead data buffers. Across continual learning and foundation model fine-tuning benchmarks, our method matches or exceeds heuristic baselines for avoiding forgetting while providing formal safety guarantees.
要約:
Diffusion models have achieved remarkable success in data-driven learning and in sampling from complex, unnormalized target distributions. Building on this progress, we reinterpret Maximum Entropy Reinforcement Learning (MaxEntRL) as a diffusion model-based sampling problem. We tackle this problem by minimizing the reverse Kullback-Leibler (KL) divergence between the diffusion policy and the optimal policy distribution using a tractable upper bound. By applying the policy gradient theorem to this objective, we derive a modified surrogate objective for MaxEntRL that incorporates diffusion dynamics in a principled way. This leads to simple diffusion-based variants of Soft Actor-Critic (SAC), Proximal Policy Optimization (PPO) and Wasserstein Policy Optimization (WPO), termed DiffSAC, DiffPPO and DiffWPO. All of these methods require only minor implementation changes to their base algorithm. We find that on standard continuous control benchmarks, DiffSAC, DiffPPO and DiffWPO achieve better returns and higher sample efficiency than SAC and PPO.
要約:
We study the problem of post-selection predictive inference in an online fashion. To avoid devoting resources to unimportant units, a preliminary selection of the current individual before reporting its prediction interval is common and meaningful in online predictive tasks. Since the online selection causes a temporal multiplicity in the selected prediction intervals, it is important to control the real-time false coverage-statement rate (FCR) which measures the overall miscoverage level. We develop a general framework named CAP (Calibration after Adaptive Pick) that performs an adaptive pick rule on historical data to construct a calibration set if the current individual is selected and then outputs a conformal prediction interval for the unobserved label. We provide tractable procedures for constructing the calibration set for popular online selection rules. We proved that CAP can achieve an exact selection-conditional coverage guarantee in the finite-sample and distribution-free regimes. To account for the distribution shift in online data, we also embed CAP into some recent dynamic conformal prediction algorithms and show that the proposed method can deliver long-run FCR control. Numerical results on both synthetic and real data corroborate that CAP can effectively control FCR around the target level and yield more narrowed prediction intervals over existing baselines across various settings.
要約:
Model selection is the process of choosing from a class of candidate models given data. For instance, methods such as the LASSO and sparse identification of nonlinear dynamics (SINDy) formulate model selection as finding a sparse solution to a linear system of equations determined by training data. However, absent strong assumptions, such methods are highly unstable: if a single data point is removed from the training set, a different model may be selected. In this paper, we present a new approach to stabilizing model selection with theoretical stability guarantees that leverages a combination of bagging and an ''inflated'' argmax operation. Our method selects a small collection of models that all fit the data, and it is stable in that, with high probability, the removal of any training point will result in a collection of selected models that overlaps with the original collection. We illustrate this method in (a) a simulation in which strongly correlated covariates make standard LASSO model selection highly unstable, (b) a Lotka-Volterra model selection problem focused on identifying how competition in an ecosystem influences species' abundances, (c) a graph subset selection problem using cell-signaling data from proteomics, and (d) unsupervised $\kappa$-means clustering. In these settings, the proposed method yields stable, compact, and accurate collections of selected models, outperforming a variety of benchmarks.
要約:
We consider Heterogeneous Transfer Learning (HTL) from a source to a new target domain for high-dimensional regression with differing feature sets. Most homogeneous TL methods assume that target and source domains share the same feature space, which limits their practical applicability. In applications, the target and source features are frequently different due to the inability to measure certain variables in data-poor target environments. Conversely, existing HTL methods do not provide statistical error guarantees, limiting their utility for scientific discovery. Our method first learns a feature map between the missing and observed features, leveraging the vast source data, and then imputes the missing features in the target. Using the combined matched and imputed features, we then perform a two-step transfer learning for penalized regression. We develop upper bounds on estimation and prediction errors, assuming that the source and target parameters differ sparsely but without assuming sparsity in the target model. We obtain results for both when the feature map is linear and when it is nonparametrically specified as unknown functions. Our results elucidate how estimation and prediction errors of HTL depend on the model's complexity, sample size, the quality and differences in feature maps, and differences in the models across domains.
要約:
Forecast reconciliation is considered an effective method to achieve coherence (within a forecast hierarchy) and to improve forecast quality. However, the value of reconciled forecasts in downstream decision-making tasks has been mostly overlooked. In a multi-agent setup with heterogeneous loss functions, this oversight may lead to unfair outcomes, hence resulting in conflicts during the reconciliation process. To address this, we propose a value-oriented forecast reconciliation approach that focuses on the forecast value for all individual agents. Fairness is ensured through the use of a Nash bargaining framework. Specifically, we model this problem as a cooperative bargaining game, where each agent aims to optimize their own gain while contributing to the overall reconciliation process. We then present a primal-dual algorithm for parameter estimation based on empirical risk minimization. From an application perspective, we consider an aggregated wind energy trading problem, where profits are distributed using a weighted allocation rule. We demonstrate the effectiveness of our approach through several numerical experiments, showing that it consistently results in increased profits for all agents involved.
要約:
In this article, we study the problem of sampling from distributions whose densities are not necessarily smooth nor logconcave. We propose a simple Langevin-based algorithm that does not rely on popular but computationally challenging techniques, such as the Moreau-Yosida envelope or Gaussian smoothing, and show consequently that the performance of samplers like ULA does not necessarily degenerate arbitrarily with low regularity. In particular, we show that the Lipschitz or H\"older continuity assumption can be replaced by a geometric one-sided Lipschitz condition that allows even for discontinuous log-gradients. We derive non-asymptotic guarantees for the convergence of the algorithm to the target distribution in Wasserstein distances. Non-asymptotic bounds are also provided for the performance of the algorithm as an optimizer, specifically for the solution of associated excess risk optimization problems.
要約:
We analyse the convergence of one-hidden-layer ReLU networks trained by gradient flow on $n$ data points. Our main contribution leverages the high dimensionality of the ambient space, which implies low correlation of the input samples, to demonstrate that a network with width of order $\log(n)$ neurons suffices for global convergence with high probability. Our analysis uses a Polyak-{\L}ojasiewicz viewpoint along the gradient-flow trajectory, which provides an exponential rate of convergence of $\frac{1}{n}$. When the data are exactly orthogonal, we give further refined characterizations of the convergence speed, proving its asymptotic behavior lies between the orders $\frac{1}{n}$ and $\frac{1}{\sqrt{n}}$, and exhibiting a phase-transition phenomenon in the convergence rate, during which it evolves from the lower bound to the upper, and in a relative time of order $\frac{1}{\log(n)}$.
要約:
Designing algorithms for solving high-dimensional Bayesian inverse problems directly in infinite-dimensional function spaces - where such problems are naturally formulated - is crucial to ensure stability and convergence as the discretization of the underlying problem is refined. In this paper, we contribute to this line of work by analyzing a widely used sampler for linear inverse problems: Langevin dynamics driven by score-based generative models (SGMs) acting as priors, formulated directly in function space. Building on the theoretical framework for SGMs in Hilbert spaces, we give a rigorous definition of this sampler in the infinite-dimensional setting and derive, for the first time, error estimates that explicitly depend on the approximation error of the score. As a consequence, we obtain sufficient conditions for global convergence in Kullback-Leibler divergence on the underlying function space. Preventing numerical instabilities requires preconditioning of the Langevin algorithm and we prove the existence and the form of an optimal preconditioner. The preconditioner depends on both the score error and the forward operator and guarantees a uniform convergence rate across all posterior modes. Our analysis applies to both Gaussian and a general class of non-Gaussian priors. Finally, we present examples that illustrate and validate our theoretical findings.
要約:
Will further scaling up of machine learning models continue to bring success? A significant challenge in answering this question lies in understanding generalization gap, which is the impact of overfitting. Understanding generalization gap behavior of increasingly large-scale machine learning models remains a significant area of investigation, as conventional analyses often link error bounds to model complexity, failing to fully explain the success of extremely large architectures. This research introduces a novel perspective by establishing a model-independent upper bound for generalization gap applicable to algorithms whose outputs are determined solely by the data's histogram, such as empirical risk minimization or gradient-based methods. Crucially, this bound is shown to depend only on the R\'enyi entropy of the data-generating distribution, suggesting that a small generalization gap can be maintained even with arbitrarily large models, provided the data quantity is sufficient relative to this entropy. This framework offers a direct explanation for the phenomenon where generalization performance degrades significantly upon injecting random noise into data, where the performance degrade is attributed to the consequent increase in the data distribution's R\'enyi entropy. Furthermore, we adapt the no-free-lunch theorem to be data-distribution-dependent, demonstrating that an amount of data corresponding to the R\'enyi entropy is indeed essential for successful learning, thereby highlighting the tightness of our proposed generalization bound.
要約:
Bandit Convex Optimization is a fundamental class of sequential decision-making problems, where the learner selects actions from a continuous domain and observes a loss (but not its gradient) at only one point per round. We study this problem in non-stationary environments, and aim to minimize the regret under three standard measures of non-stationarity: the number of switches $S$ in the comparator sequence, the total variation $\Delta$ of the loss functions, and the path-length $P$ of the comparator sequence. We propose a polynomial-time algorithm, Tilted Exponentially Weighted Average with Sleeping Experts (TEWA-SE), which adapts the sleeping experts framework from online convex optimization to the bandit setting. For strongly convex losses, we prove that TEWA-SE is minimax-optimal with respect to known $S$ and $\Delta$ by establishing matching upper and lower bounds. By equipping TEWA-SE with the Bandit-over-Bandit framework, we extend our analysis to environments with unknown non-stationarity measures. For general convex losses, we introduce a second algorithm, clipped Exploration by Optimization (cExO), based on exponential weights over a discretized action space. While not polynomial-time computable, this method achieves minimax-optimal regret with respect to known $S$ and $\Delta$, and improves on the best existing bounds with respect to $P$.
要約:
Understanding the geometry of the loss landscape near a minimum is key to explaining the implicit bias of gradient-based methods in non-convex optimization problems such as deep neural network training and deep matrix factorization. A central quantity to characterize this geometry is the maximum eigenvalue of the Hessian of the loss, which measures the sharpness of the landscape. Currently, its precise role has been obfuscated because no exact expressions for this sharpness measure were known in general settings. In this paper, we present the first exact expression for the maximum eigenvalue of the Hessian of the squared-error loss at any minimizer in general overparameterized deep matrix factorization (i.e., deep linear neural network training) problems, resolving an open question posed by Mulayoff & Michaeli (2020). This expression uncovers a fundamental property of the loss landscape of depth-2 matrix factorization problems: a minimum is flat if and only if it is spectral-norm balanced, which implies that flat minima are not necessarily Frobenius-norm balanced. Furthermore, to complement our theory, we empirically investigate an escape phenomenon observed during gradient-based training near a minimum that crucially relies on our exact expression of the sharpness.
要約:
Missing data is a fundamental challenge in data science, significantly hindering analysis and decision-making across a wide range of disciplines, including healthcare, bioinformatics, social science, e-commerce, and industrial monitoring. Despite decades of research and numerous imputation methods, the literature remains fragmented across fields, creating a critical need for a comprehensive synthesis that connects statistical foundations with modern machine learning advances. This work systematically reviews core concepts-including missingness mechanisms, single versus multiple imputation, and different imputation goals-and examines problem characteristics across various domains. It provides a thorough categorization of imputation methods, spanning classical techniques (e.g., regression, the EM algorithm) to modern approaches like low-rank and high-rank matrix completion, deep learning models (autoencoders, GANs, diffusion models, graph neural networks), and large language models. Special attention is given to methods for complex data types, such as tensors, time series, streaming data, graph-structured data, categorical data, and multimodal data. Beyond methodology, we investigate the crucial integration of imputation with downstream tasks like classification, clustering, and anomaly detection, examining both sequential pipelines and joint optimization frameworks. The review also assesses theoretical guarantees, benchmarking resources, and evaluation metrics. Finally, we identify critical challenges and future directions, emphasizing model selection and hyperparameter optimization, the growing importance of privacy-preserving imputation via federated learning, and the pursuit of generalizable models that can adapt across domains and data types, thereby outlining a roadmap for future research.
要約:
Wasserstein gradient flow provides a general framework for minimizing an energy functional $J$ over the space of probability measures on a Riemannian manifold $(M,g)$. Its canonical time-discretization, the Jordan-Kinderlehrer-Otto (JKO) scheme, produces for any step size $\eta>0$ a sequence of probability distributions $\rho_k^\eta$ that approximate to first order in $\eta$ Wasserstein gradient flow on $J$. But the JKO scheme also has many other remarkable properties not shared by other first order integrators, e.g. it preserves energy dissipation and exhibits unconditional stability for $\lambda$-geodesically convex functionals $J$. To better understand the JKO scheme we characterize its implicit bias at second order in $\eta$. We show that $\rho_k^\eta$ are approximated to order $\eta^2$ by Wasserstein gradient flow on a modified energy \[ J^{\eta}(\rho) = J(\rho) - \frac{\eta}{4}\int_M \Big\lVert \nabla_g \frac{\delta J}{\delta \rho} (\rho) \Big\rVert_{2}^{2} \,\rho(dx), \] obtained by subtracting from $J$ the squared metric curvature of $J$ times $\eta/4$. The JKO scheme therefore adds at second order in $\eta$ a deceleration in directions where the metric curvature of $J$ is rapidly changing. This corresponds to canonical implicit biases for common functionals: for entropy the implicit bias is the Fisher information, for KL-divergence it is the Fisher-Hyv{\"a}rinen divergence, and for Riemannian gradient descent it is the kinetic energy in the metric $g$. To understand the differences between minimizing $J$ and $J^\eta$ we study JKO-Flow, Wasserstein gradient flow on $J^\eta$, in several simple numerical examples. These include exactly solvable Langevin dynamics on the Bures-Wasserstein space and Langevin sampling from a quartic potential in 1D.
要約:
Feature selection has remained a daunting challenge in machine learning and artificial intelligence, where increasingly complex, high-dimensional datasets demand principled strategies for isolating the most informative predictors. Despite widespread adoption, many established techniques suffer from notable limitations; some incur substantial computational cost, while others offer no definite statistical driven stopping criteria or assesses the significance of their importance scores. A common heuristic approach introduces multiple random noise features and retains all predictors ranked above the strongest noise feature. Although intuitive, this strategy lacks theoretical justification and depends heavily on heuristics. This paper proposes a novel feature selection method that addresses these limitations. Our approach introduces multiple random noise features and evaluates each feature's importance against the maximum importance value among these noise features incorporating a non-parametric bootstrap-based hypothesis testing framework to establish a solid theoretical foundation. We establish the conceptual soundness of our approach through statistical derivations that articulate the principles guiding the design of our algorithm. To evaluate its reliability, we generated simulated datasets under controlled statistical settings and benchmarked performance against Boruta and Knockoff-based methods, observing consistently stronger recovery of meaningful signal. As a demonstration of practical utility, we applied the technique across diverse real-world datasets, where it surpassed feature selection techniques including Boruta, RFE, and Extra Trees. Hence, the method emerges as a robust algorithm for principled feature selection, enabling the distillation of informative predictors that support reliable inference, enhanced predictive performance, and efficient computation.
要約:
Modern artificial intelligence systems make critical decisions yet often fail silently when uncertain -- even well-calibrated models provide no mechanism to identify \textit{which specific predictions} are unreliable. We develop a geometric framework addressing both calibration and instance-level uncertainty quantification for neural network probability outputs. Treating probability vectors as points on the $(c-1)$-dimensional probability simplex equipped with the Fisher--Rao metric, we construct: (i) Additive Log-Ratio (ALR) calibration maps that reduce exactly to Platt scaling for binary problems while extending naturally to multi-class settings, and (ii) geometric reliability scores that translate calibrated probabilities into actionable uncertainty measures, enabling principled deferral of ambiguous predictions to human review.
Theoretical contributions include: consistency of the calibration estimator at rate $O_p(n^{-1/2})$ via M-estimation theory (Theorem~1), and tight concentration bounds for reliability scores with explicit sub-Gaussian parameters enabling sample size calculations for validation set design (Theorem~2). We conjecture Neyman--Pearson optimality of our neutral zone construction based on connections to Bhattacharyya coefficients. Empirical validation on Adeno-Associated Virus classification demonstrates that the two-stage framework captures 72.5\% of errors while deferring 34.5\% of samples, reducing automated decision error rates from 16.8\% to 6.9\%. Notably, calibration alone yields marginal accuracy gains; the operational benefit arises primarily from the reliability scoring mechanism, which applies to any well-calibrated probability output. This work bridges information geometry and statistical learning, offering formal guarantees for uncertainty-aware classification in applications requiring rigorous validation.
要約:
Sparse Principal Component Analysis (sPCA) is a cardinal technique for obtaining combinations of features, or principal components (PCs), that explain the variance of high-dimensional datasets in an interpretable manner. This involves solving a sparsity and orthogonality constrained convex maximization problem, which is extremely computationally challenging. Most existing works address sparse PCA via methods-such as iteratively computing one sparse PC and deflating the covariance matrix-that do not guarantee the orthogonality, let alone the optimality, of the resulting solution when we seek multiple mutually orthogonal PCs. We challenge this status by reformulating the orthogonality conditions as rank constraints and optimizing over the sparsity and rank constraints simultaneously. We design tight semidefinite relaxations to supply high-quality upper bounds, which we strengthen via additional second-order cone inequalities when each PC's individual sparsity is specified. Further, we derive a combinatorial upper bound on the maximum amount of variance explained as a function of the support. We exploit these relaxations and bounds to propose exact methods and rounding mechanisms that, together, obtain solutions with a bound gap on the order of 0%-15% for real-world datasets with p = 100s or 1000s of features and r \in {2, 3} components. Numerically, our algorithms match (and sometimes surpass) the best performing methods in terms of fraction of variance explained and systematically return PCs that are sparse and orthogonal. In contrast, we find that existing methods like deflation return solutions that violate the orthogonality constraints, even when the data is generated according to sparse orthogonal PCs. Altogether, our approach solves sparse PCA problems with multiple components to certifiable (near) optimality in a practically tractable fashion.
要約:
The ProxSkip algorithm for distributed optimization is gaining increasing attention due to its effectiveness in reducing communication. However, existing analyses of ProxSkip are limited to the strongly convex setting and fail to achieve linear speedup with respect to the number of nodes. Key questions regarding its behavior in the non-convex setting and the achievability of linear speedup remain open. In this paper, we revisit ProxSkip and address both questions. We provide a comprehensive analysis for stochastic non-convex, convex, and strongly convex problems, revealing the effects of gradient noise, local updates, network connectivity, and data heterogeneity on its convergence. We prove that ProxSkip achieves linear speedup across all three settings, and can further achieve linear speedup with network-independent stepsizes in the strongly convex setting. Moreover, we show that properly increasing local updates effectively reduces communication complexity.
要約:
Real-world time series data often exhibits substantial missing values, posing challenges for advanced analysis. A common approach to addressing this issue is imputation, where the primary challenge lies in determining the appropriate values to fill in. While previous deep learning methods have proven effective for time series imputation, they often produce overconfident imputations, which poses a potentially overlooked risk to the reliability of the intelligent system. Diffusion methods are proficient in estimating probability distributions but face challenges under a high missing rate and are, moreover, computationally expensive due to the nature of the generative model framework. In this paper, we propose Quantile Sub-Ensembles, a novel method that estimates uncertainty with ensembles of quantile-regression-based task networks and incorporate Quantile Sub-Ensembles into a non-generative time series imputation method. Our method not only produces accurate and reliable imputations, but also remains computationally efficient due to its non-generative framework. We conduct extensive experiments on five real-world datasets, and the results demonstrates superior performance in both deterministic and probabilistic imputation compared to baselines across most experimental settings. The code is available at https://github.com/yingliu-coder/QSE.
要約:
We study the information geometry of $\bcc$-divergences from families of costs of the form $\mathsf{c}(x, \barx) =\mathsf{u}(x^{\mathfrak{t}}\barx)$ through the optimal transport point of view. Here, $\mathsf{u}$ is a scalar function with inverse $\mathsf{s}$, $x^{\ft}\barx$ is a nondegenerate bilinear pairing of vectors $x, \barx$ belonging to an open subset of $\mathbb{R}^n$. We compute explicitly the MTW tensor (or cross curvature) for the optimal transport problem on $\mathbb{R}^n$ with this cost. The condition that the MTW-tensor vanishes on null vectors under the Kim-McCann metric is a fourth-order nonlinear ODE, which could be reduced to a linear ODE of the form $\mathsf{s}^{(2)} - S\mathsf{s}^{(1)} + P\mathsf{s} = 0$ with constant coefficients $P$ and $S$. The resulting inverse functions include {\it Lambert} and {\it generalized inverse hyperbolic\slash trigonometric} functions. The square Euclidean metric and $\log$-type costs are equivalent to instances of these solutions. The optimal map may be written explicitly in terms of the potential function. For cost functions of a similar form on a hyperboloid model of the hyperbolic space and unit sphere, we also express this tensor in terms of algebraic expressions in derivatives of $\mathsf{s}$ using the Gauss-Codazzi equation, obtaining new families of strictly regular costs for these manifolds, including new families of {\it power function costs}. We express the divergence geometry of the $\mathsf{c}$-divergence in terms of the Kim-McCann metric, including a $\mathsf{c}$-Crouzeix identity and a formula for the primal connection. We analyze the $\sinh$-type hyperbolic cost, providing examples of $\mathsf{c}$-convex functions, which are used to construct a new \emph{local form} of the $\alpha$-divergences on probability simplices. We apply the optimal maps to sample the multivariate $t$-distribution.
要約:
We propose methods to estimate the individual $\beta$-mixing coefficients of a real-valued geometrically ergodic Markov process from a single sample-path $X_0,X_1, \dots,X_n$. Under standard smoothness conditions on the densities, namely, that the joint density of the pair $(X_0,X_m)$ for each $m$ lies in a Besov space $B^s_{1,\infty}(\mathbb R^2)$ for some known $s>0$, we obtain a rate of convergence of order $\mathcal{O}(\log(n) n^{-[s]/(2[s]+2)})$ for the expected error of our estimator in this case\footnote{We use $[s]$ to denote the integer part of the decomposition $s=[s]+\{s\}$ of $s \in (0,\infty)$ into an integer term and a {\em strictly positive} remainder term $\{s\} \in (0,1]$.}. We complement this result with a high-probability bound on the estimation error, and further obtain analogues of these bounds in the case where the state-space is finite. Naturally no density assumptions are required in this setting; the expected error rate is shown to be of order $\mathcal O(\log(n) n^{-1/2})$.
要約:
In clustering algorithm selection, we are given a massive dataset and must efficiently select which clustering algorithm to use. We study this problem in a semi-supervised setting, with an unknown ground-truth clustering that we can only access through expensive oracle queries. Ideally, the clustering algorithm's output will be structurally close to the ground truth. We approach this problem by introducing a notion of size generalization for clustering algorithm accuracy. We identify conditions under which we can (1) subsample the massive clustering instance, (2) evaluate a set of candidate algorithms on the smaller instance, and (3) guarantee that the algorithm with the best accuracy on the small instance will have the best accuracy on the original big instance. We provide theoretical size generalization guarantees for three classic clustering algorithms: single-linkage, k-means++, and (a smoothed variant of) Gonzalez's k-centers heuristic. We validate our theoretical analysis with empirical results, observing that on real-world clustering instances, we can use a subsample of as little as 5% of the data to identify which algorithm is best on the full dataset.
要約:
This paper presents a novel approach for estimating the Koopman operator defined on a reproducing kernel Hilbert space (RKHS) and its spectra. We propose an estimation method, what we call Jet Extended Dynamic Mode Decomposition (JetEDMD), leveraging the intrinsic structure of RKHS and the geometric notion known as jets to enhance the estimation of the Koopman operator. This method refines the traditional Extended Dynamic Mode Decomposition (EDMD) in accuracy, especially in the numerical estimation of eigenvalues. This paper proves JetEDMD's superiority through explicit error bounds and convergence rate for special positive definite kernels, offering a solid theoretical foundation for its performance. We also investigate the spectral analysis of the Koopman operator, proposing the notion of an extended Koopman operator within a framework of a rigged Hilbert space. This notion leads to a deeper understanding of estimated Koopman eigenfunctions and capturing them outside the original function space. Through the theory of rigged Hilbert space, our study provides a principled methodology to analyze the estimated spectrum and eigenfunctions of Koopman operators, and enables eigendecomposition within a rigged RKHS. We also propose a new effective method for reconstructing the dynamical system from temporally-sampled trajectory data of the dynamical system with solid theoretical guarantee. We conduct several numerical simulations using the van der Pol oscillator, the Duffing oscillator, the H\'enon map, and the Lorenz attractor, and illustrate the performance of JetEDMD with clear numerical computations of eigenvalues and accurate predictions of the dynamical systems.
要約:
The Cox Proportional Hazards (CPH) model has long been the preferred survival model for its explainability. However, to increase its predictive power beyond its linear log-risk, it was extended to utilize deep neural networks, sacrificing its explainability. In this work, we explore the potential of self-explaining neural networks (SENN) for survival analysis. We propose a new locally explainable Cox proportional hazards model, named CoxSE, by estimating a locally-linear log-hazard function using the SENN. We also propose a modification to the Neural additive (NAM) model, hybrid with SENN, named CoxSENAM, which enables the control of the stability and consistency of the generated explanations.
Several experiments using synthetic and real datasets are presented, benchmarking CoxSE and CoxSENAM against a NAM-based model, a DeepSurv model explained with SHAP, and a linear CPH model. The results show that, unlike the NAM-based model, the SENN-based model can provide more stable and consistent explanations while maintaining the predictive power of the black-box model. The results also show that, due to their structural design, NAM-based models demonstrate better robustness to non-informative features. Among the models, the hybrid model exhibits the best robustness.
要約:
Effective feature selection is essential for optimizing contextual multi-armed bandits (CMABs) in large-scale online systems, where suboptimal features can degrade rewards, interpretability, and efficiency. Traditional feature selection often prioritizes outcome correlation, neglecting the crucial role of heterogeneous treatment effects (HTE) across arms in CMAB decision-making. This paper introduces two novel, model-free filter methods, Heterogeneous Incremental Effect (HIE) and Heterogeneous Distribution Divergence (HDD), specifically designed to identify features driving HTE. HIE quantifies a feature's value based on its ability to induce changes in the optimal arm, while HDD measures its impact on reward distribution divergence across arms. These methods are computationally efficient, robust to model mis-specification, and adaptable to various feature types, making them suitable for rapid screening in dynamic environments where retraining complex models is infeasible. We validate HIE and HDD on synthetic data with known ground truth and in a large-scale commercial recommender system, demonstrating their consistent ability to identify influential HTE features and thereby enhance CMAB performance.
要約:
Class-incremental learning (CIL) aims to learn new classes while retaining previous knowledge. Although pre-trained model (PTM) based approaches show strong performance, directly fine-tuning PTMs on incremental task streams often causes renewed catastrophic forgetting. This paper proposes a Dual-Prototype Network with Task-wise Adaptation (DPTA) for PTM-based CIL. For each incremental learning task, an adapter module is built to fine-tune the PTM, where the center-adapt loss forces the representation to be more centrally clustered and class separable. The dual prototype network improves the prediction process by enabling test-time adapter selection, where the raw prototypes deduce several possible task indexes of test samples to select suitable adapter modules for PTM, and the augmented prototypes that could separate confusable classes are utilized to determine the final result. Experiments on multiple benchmarks show that DPTA consistently surpasses recent methods by 1\% - 5\%. Notably, on the VTAB dataset, it achieves approximately 3\% improvement over state-of-the-art methods. The code is open-sourced in https://github.com/Yorkxzm/DPTA}
要約:
Standard conformal prediction methods guarantee marginal coverage but often produce inefficient intervals that fail to adapt to local heteroscedasticity, while recent localized approaches often struggle to maintain validity across distinct subpopulations with varying noise profiles. To address these challenges, we introduce Localized Conformal Multi-Quantile Regression (LCMQR), a novel framework that synergizes multi-quantile information with kernel-based localization to construct efficient and adaptive prediction intervals. Theoretically, we resolve an inconsistency in Conformalized Composite Quantile Regression (CCQR) by proving that our consistent Average-then-Max scoring mechanism systematically yields tighter intervals than the Max-then-Average approach used in prior work. For heterogeneous environments, we extend this framework to Group-Calibrated LCMQR (GC-LCMQR) via a stratified calibration step that guarantees finite-sample validity within distinct subgroups. Experiments on benchmark datasets and an Individual Treatment Effect (ITE) task demonstrate that LCMQR achieves superior efficiency on standard benchmarks, while GC-LCMQR uniquely achieves group-level coverage for target subgroups in mixture populations where baselines fail.
要約:
The Influence Function (IF) is a widely used technique for assessing the impact of individual training samples on model predictions. However, existing IF methods often fail to provide reliable influence estimates in deep neural networks, particularly when applied to noisy training data. This issue does not stem from inaccuracies in parameter change estimation, which has been the primary focus of prior research, but rather from deficiencies in loss change estimation, specifically due to the sharpness of validation risk. In this work, we establish a theoretical connection between influence estimation error, validation set risk, and its sharpness, underscoring the importance of flat validation minima for accurate influence estimation. Furthermore, we introduce a novel estimation form of Influence Function specifically designed for flat validation minima. Experimental results across various tasks validate the superiority of our approach.
要約:
In orthogonal frequency division multiplexing (OFDM), accurate channel estimation is crucial. Classical signal processing based approaches, such as minimum mean-squared error (MMSE) estimation, often require second-order statistics that are difficult to obtain in practice. Recent deep neural networks based methods have been introduced to address this; yet they often suffer from high inference complexity. This paper proposes an Attention-aided MMSE (A-MMSE), a novel model-based DNN framework that learns the optimal MMSE filter via the Attention Transformer. Once trained, the A-MMSE estimates the channel through a single linear operation for channel estimation, eliminating nonlinear activations during inference and thus reducing computational complexity. To enhance the learning efficiency of the A-MMSE, we develop a two-stage Attention encoder, designed to effectively capture the channel correlation structure. Additionally, a rank-adaptive extension of the proposed A-MMSE allows flexible trade-offs between complexity and channel estimation accuracy. Extensive simulations with 3GPP TDL channel models demonstrate that the proposed A-MMSE consistently outperforms other baseline methods in terms of normalized MSE across a wide range of signal-to-noise ratio (SNR) conditions. In particular, the A-MMSE and its rank-adaptive extension establish a new frontier in the performance-complexity trade-off, providing a powerful yet highly efficient solution for practical channel estimation
要約:
Including intricate topological information (e.g., cycles) provably enhances the expressivity of message-passing graph neural networks (GNNs) beyond the Weisfeiler-Leman (WL) hierarchy. Consequently, Persistent Homology (PH) methods are increasingly employed for graph representation learning. In this context, recent works have proposed decorating classical PH diagrams with vertex and edge features for improved expressivity. However, these methods still fail to capture basic graph structural information. In this paper, we propose SpectRe -- a new topological descriptor for graphs that integrates spectral information into PH diagrams. Notably, SpectRe is strictly more expressive than existing descriptors on graphs. We also introduce notions of global and local stability to analyze existing descriptors and establish that SpectRe is locally stable. Finally, experiments on synthetic and real-world datasets demonstrate the effectiveness of SpectRe and its potential to enhance the capabilities of graph models in relevant learning tasks. Code is available at https://github.com/Aalto-QuML/SpectRe/.
要約:
State entropy regularization has empirically shown better exploration and sample complexity in reinforcement learning (RL). However, its theoretical guarantees have not been studied. In this paper, we show that state entropy regularization improves robustness to structured and spatially correlated perturbations. These types of variation are common in transfer learning but often overlooked by standard robust RL methods, which typically focus on small, uncorrelated changes. We provide a comprehensive characterization of these robustness properties, including formal guarantees under reward and transition uncertainty, as well as settings where the method performs poorly. Much of our analysis contrasts state entropy with the widely used policy entropy regularization, highlighting their different benefits. Finally, from a practical standpoint, we illustrate that compared with policy entropy, the robustness advantages of state entropy are more sensitive to the number of rollouts used for policy evaluation.
著者: Lydia T. Liu, Inioluwa Deborah Raji, Angela Zhou, Luke Guerdan, Jessica Hullman, Daniel Malinsky, Bryan Wilder, Simone Zhang, Hammaad Adam, Amanda Coston, Ben Laufer, Ezinne Nwankwo, Michael Zanger-Tishler, Eli Ben-Michael, Solon Barocas, Avi Feller, Marissa Gerchick, Talia Gillis, Shion Guha, Daniel Ho, Lily Hu, Kosuke Imai, Sayash Kapoor, Joshua Loftus, Razieh Nabi, Arvind Narayanan, Ben Recht, Juan Carlos Perdomo, Matthew Salganik, Mark Sendak, Alexander Tolbert, Berk Ustun, Suresh Venkatasubramanian, Angelina Wang, Ashia Wilson
要約:
Many automated decision systems (ADS) are designed to solve prediction problems -- where the goal is to learn patterns from a sample of the population and apply them to individuals from the same population. In reality, these prediction systems operationalize holistic policy interventions in deployment. Once deployed, ADS can shape impacted population outcomes through an effective policy change in how decision-makers operate, while also being defined by past and present interactions between stakeholders and the limitations of existing organizational, as well as societal, infrastructure and context. In this work, we consider the ways in which we must shift from a prediction-focused paradigm to an interventionist paradigm when considering the impact of ADS within social systems. We argue this requires a new default problem setup for ADS beyond prediction, to instead consider predictions as decision support, final decisions, and outcomes. We highlight how this perspective unifies modern statistical frameworks and other tools to study the design, implementation, and evaluation of ADS systems, and point to the research directions necessary to operationalize this paradigm shift. Using these tools, we characterize the limitations of focusing on isolated prediction tasks, and lay the foundation for a more intervention-oriented approach to developing and deploying ADS.
要約:
We derive a new class of non-linear expectations from first-principles deterministic chaotic dynamics. The homogenization of the system's skew-adjoint microscopic generator is achieved using the spectral theory of transfer operators for uniformly hyperbolic flows. We prove convergence in the viscosity sense to a macroscopic evolution governed by a fully non-linear Hamilton-Jacobi-Bellman (HJB) equation. Our central result establishes that the HJB Hamiltonian possesses a rigid structure: affine in the Hessian but demonstrably non-convex in the gradient. This defines a new $\theta$-expectation and constructively establishes a class of non-convex stochastic control problems fundamentally outside the sub-additive framework of G-expectations.
要約:
The success of federated learning (FL) ultimately depends on how strategic participants behave under partial observability, yet most formulations still treat FL as a static optimization problem. We instead view FL deployments as governed strategic systems and develop an analytical framework that separates welfare-improving behavior from metric gaming. Within this framework, we introduce indices that quantify manipulability, the price of gaming, and the price of cooperation, and we use them to study how rules, information disclosure, evaluation metrics, and aggregator-switching policies reshape incentives and cooperation patterns. We derive threshold conditions for deterring harmful gaming while preserving benign cooperation, and for triggering auto-switch rules when early-warning indicators become critical. Building on these results, we construct a design toolkit including a governance checklist and a simple audit-budget allocation algorithm with a provable performance guarantee. Simulations across diverse stylized environments and a federated learning case study consistently match the qualitative and quantitative patterns predicted by our framework. Taken together, our results provide design principles and operational guidelines for reducing metric gaming while sustaining stable, high-welfare cooperation in FL platforms.
要約:
We introduce ORACLE, a framework that explains neural networks on tabular and scientific design data. It fits ANOVA-style main and pairwise interaction effects to a model's prediction surface. ORACLE treats a trained network as a black-box response, learns an orthogonal factorial surrogate on a discretized input grid, and uses simple centering and $\mu$-rebalancing steps to obtain main- and interaction-effect tables that remain $L^2$-consistent with the original model. The resulting grid-based interaction maps are easy to visualize, comparable across backbones, and directly connected to classical design-of-experiments analyses. On synthetic factorial and low- to medium-dimensional tabular regression benchmarks, ORACLE more accurately recovers ground-truth ANOVA interactions and hotspot structure than Monte Carlo SHAP-family interaction methods, as measured by ranking, localization, and cross-backbone stability metrics. In latent image and text settings, ORACLE instead delineates its natural scope, and our results indicate that grid-based ANOVA surrogates are most effective when features admit interpretable factorial structure, making ORACLE particularly well-suited to scientific and engineering tabular workflows that require stable, DoE-style interaction summaries.
要約:
Implicit models, an emerging model class, compute outputs by iterating a single parameter block to a fixed point. This architecture realizes an infinite-depth, weight-tied network that trains with constant memory, significantly reducing memory needs for the same level of performance compared to explicit models. While it is empirically known that these compact models can often match or even exceed the accuracy of larger explicit networks by allocating more test-time compute, the underlying mechanism remains poorly understood.
We study this gap through a nonparametric analysis of expressive power. We provide a strict mathematical characterization, showing that a simple and regular implicit operator can, through iteration, progressively express more complex mappings. We prove that for a broad class of implicit models, this process lets the model's expressive power scale with test-time compute, ultimately matching a much richer function class. The theory is validated across four domains: image reconstruction, scientific computing, operations research, and LLM reasoning, demonstrating that as test-time iterations increase, the complexity of the learned mapping rises, while the solution quality simultaneously improves and stabilizes.
要約:
The growing adoption of spectrum-aware matrix-valued optimizers such as Muon and Shampoo in deep learning motivates a systematic study of their generalization properties and, in particular, when they might outperform competitive algorithms. We approach this question by introducing appropriate simplifying abstractions as follows: First, we use imbalanced data as a testbed. Second, we study the canonical form of such optimizers, which is Spectral Gradient Descent (SpecGD) -- each update step is $UV^T$ where $U\Sigma V^T$ is the truncated SVD of the gradient. Third, within this framework we identify a canonical setting for which we precisely quantify when SpecGD outperforms vanilla Euclidean GD. For a Gaussian mixture data model and both linear and bilinear models, we show that unlike GD, which prioritizes learning dominant principal components of the data first, SpecGD learns all principal components of the data at equal rates. We demonstrate how this translates to a growing gap in balanced accuracy favoring SpecGD early in training and further show that the gap remains consistent even when the GD counterpart uses adaptive step-sizes via normalization. By extending the analysis to deep linear models, we show that depth amplifies these effects. We empirically verify our theoretical findings on a variety of imbalanced datasets. Our experiments compare practical variants of spectral methods, like Muon and Shampoo, against their Euclidean counterparts and Adam. The results validate our findings that these spectral optimizers achieve superior generalization by promoting a more balanced learning of the data's underlying components.
要約:
We study the computational task of detecting and estimating correlated signals in a pair of spiked matrices $$ X=\tfrac{\lambda}{\sqrt{n}} xu^{\top}+W, \quad Y=\tfrac{\mu}{\sqrt{n}} yv^{\top}+Z $$ where the spikes $x,y$ have correlation $\rho$. Specifically, we consider two fundamental models: (1) Correlated spiked Wigner model with signal-to-noise ratio $\lambda,\mu$; (2) Correlated spiked $n*N$ Wishart (covariance) model with signal-to-noise ratio $\sqrt\lambda,\sqrt\mu$.
We propose an efficient detection and estimation algorithm based on counting a specific family of edge-decorated cycles. The algorithm's performance is governed by the function $$ F(\lambda,\mu,\rho,\gamma)=\max\Big\{ \frac{ \lambda^2 }{ \gamma }, \frac{ \mu^2 }{ \gamma }, \frac{ \lambda^2 \rho^2 }{ \gamma-\lambda^2+\lambda^2 \rho^2 } + \frac{ \mu^2 \rho^2 }{ \gamma-\mu^2+\mu^2 \rho^2 } \Big\} \,. $$ We prove our algorithm succeeds for the correlated spiked Wigner model whenever $F(\lambda,\mu,\rho,1)>1$, and succeeds for the correlated spiked Wishart model whenever $F(\lambda,\mu,\rho,\tfrac{n}{N})>1$. Our result shows that an algorithm can leverage the correlation between the spikes to detect and estimate the signals even in regimes where efficiently recovering either $x$ from ${X}$ alone or $y$ from ${Y}$ alone is believed to be computationally infeasible.
We complement our algorithmic results with evidence for a matching computational lower bound. In particular, we prove that when $F(\lambda,\mu,\rho,1)<1$ for the correlated spiked Wigner model and when $F(\lambda,\mu,\rho,\tfrac{n}{N})<1$ for the spiked Wishart model, all algorithms based on low-degree polynomials fails to distinguish $({X},{Y})$ with two independent noise matrices. This strongly suggests that $F=1$ is the precise computation threshold for our models.
要約:
Since the 21st century, artificial intelligence has been leading a new round of industrial revolution. Under the training framework, the optimization algorithm aims to stably converge high-dimensional optimization to local and even global minima. Entering the era of large language models, although the scale of model parameters and data has increased, Adam remains the mainstream optimization algorithm. However, compared with stochastic gradient descent (SGD) based optimization algorithms, Adam is more likely to converge to non-flat minima. To address this issue, the AdamNX algorithm is proposed. Its core innovation lies in the proposition of a novel type of second-order moment estimation exponential decay rate, which gradually weakens the learning step correction strength as training progresses, and degrades to momentum SGD in the stable training period, thereby improving the stability of training in the stable period and possibly enhancing generalization ability. Experimental results show that our second-order moment estimation exponential decay rate is better than the current second-order moment estimation exponential decay rate, and AdamNX can stably outperform Adam and its variants in terms of performance. Our code is open-sourced at https://github.com/mengzhu0308/AdamNX.
要約:
We propose a nonparametric estimator of multivariate joint entropy based on partitioned sample spacing (PSS). The method extends univariate spacing ideas to $\mathbb{R}^{d}$ by partitioning into localized cells and aggregating within-cell statistics, with strong consistency guarantees under mild conditions. In benchmarks across diverse distributions, PSS consistently outperforms $k$-nearest neighbor estimators and achieves accuracy competitive with recent normalizing flow-based methods, while requiring no training or auxiliary density modeling. The estimator scales favorably in moderately high dimensions ($d = 10$--$40$) and shows particular robustness to correlated or skewed distributions. These properties position PSS as a practical and reliable alternative to both $k$NN and NF-based entropy estimators, with broad utility in information-theoretic machine learning tasks such as total-correlation estimation, representation learning, and feature selection.
要約:
We extend the convergence analysis of AdaSLS and AdaSPS in [Jiang and Stich, 2024] to the nonconvex setting, presenting a unified convergence analysis of stochastic gradient descent with adaptive Armijo line-search (AdaSLS) and Polyak stepsize (AdaSPS) for nonconvex optimization. Our contributions include: (1) an $\mathcal{O}(1/\sqrt{T})$ convergence rate for general nonconvex smooth functions, (2) an $\mathcal{O}(1/T)$ rate under quasar-convexity and interpolation, and (3) an $\mathcal{O}(1/T)$ rate under the strong growth condition for general nonconvex functions.
要約:
We study streaming data with categorical features where the vocabulary of categorical feature values is changing and can even grow unboundedly over time. Feature hashing is commonly used as a pre-processing step to map these categorical values into a feature space of fixed size before learning their embeddings. While these methods have been developed and evaluated for offline or batch settings, in this paper we consider online settings. We show that deterministic embeddings are sensitive to the arrival order of categories and suffer from forgetting in online learning, leading to performance deterioration. To mitigate this issue, we propose a probabilistic hash embedding (PHE) model that treats hash embeddings as stochastic and applies Bayesian online learning to learn incrementally from data. Based on the structure of PHE, we derive a scalable inference algorithm to learn model parameters and infer/update the posteriors of hash embeddings and other latent variables. Our algorithm (i) can handle an evolving vocabulary of categorical items, (ii) is adaptive to new items without forgetting old items, (iii) is implementable with a bounded set of parameters that does not grow with the number of distinct observed values on the stream, and (iv) is invariant to the item arrival order. Experiments in classification, sequence modeling, and recommendation systems in online learning setups demonstrate the superior performance of PHE while maintaining high memory efficiency (consumes as low as 2~4 memory of a one-hot embedding table). Supplementary materials are at https://github.com/aodongli/probabilistic-hash-embeddings