要約:
Trust Region Bayesian Optimization (TuRBO) is an effective strategy for alleviating the curse of dimensionality in high-dimensional black-box optimization. However, inappropriate lengthscale design can cause the local Gaussian process (GP) model within the trust region to degenerate, leading to suboptimal performance in high dimensions. In this work, we show that TuRBO's local GP may remain either excessively complex or overly simple as the dimension $D$ and trust region side length $L$ vary. To address this issue, we propose a straightforward variant, AdaScale-TuRBO, which scales the GP lengthscale with both the problem dimension and trust region size, thereby preserving kernel geometry and maintaining consistent prior complexity. Empirically, we show that AdaScale-TuRBO can robustly outperform standard TuRBO and other popular high-dimensional BO methods on synthetic benchmarks and real-world trajectory planning tasks.
要約:
Generative approaches to clustering provide information on geometric properties of clusters, whereas discriminative approaches provide boundaries between clusters. Ideas from both approaches are incorporated to present a fully unsupervised, probabilistic, and discriminative clustering method via a regularized mutual information objective function, wherein a mixture of mixtures of Gaussian and uniform distributions is used for formulation of the conditional model. Automatic selection of the number of components is established with the introduction of the regularizing term and a merge step, similar to those applied in reversible jump Markov chain Monte Carlo methods used in Bayesian clustering. Consequently, the turtle shell method -- a fully unsupervised clustering method capable of estimating non-linear boundary lines, automatically selecting the number of components, and capturing intuitive clusters in the presence of data abnormalities such as noise and/or irregular cluster shapes -- is introduced. We test this method on various simulated and real datasets commonly explored in clustering research, and extend the analysis to datasets arising from flow cytometry experiments.
要約:
Causal effect estimation from observational data requires careful adjustment for confounding. Classical estimators such as inverse probability weighting and augmented inverse probability weighting are effective under favorable model specification, but may become unstable when treatment assignment and outcome mechanisms are complex, non-linear, and high-dimensional. Machine learning and representation learning approaches improve flexibility, yet joint training can allow outcome-related information to influence treatment-side representations, which is undesirable from a causal perspective. We propose MOCA (Modular One-way Causal Attention), a transformer-based framework that separates treatment and outcome modeling through a modular design, and performs confounder adjustment using a one-way attention mechanism. A cutting-feedback strategy, implemented via gradient detachment, prevents the outcome loss from updating the treatment module. This design preserves directional information flow while retaining the representational power of transformer architectures for causal inference. Across multiple simulated scenarios, including linear, nonlinear, heavy-tailed, hidden confounding, and high-dimensional settings, MOCA shows competitive or improved performance relative to IPW, AIPW, X-learner, TARNet, and DragonNet. We further illustrate the method on the Infant Health and Development Program dataset and the Dehejia-Wahba dataset as real-world benchmarks. These results suggest that modular attention with one-way information flow provides a promising and interpretable direction for causal inference with modern deep learning models.
要約:
Existing large-dimensional theory for spectral algorithms resolves either the optimally tuned point or the interpolation limit, but leaves the under-regularized regime unexplored. We study the learning curve and benign overfitting of spectral algorithms in the large-dimensional setting where the sample size and dimension are of comparable order, i.e., $n \asymp d^{\gamma}$ for some $\gamma>0$. We first consider inner-product kernels on the sphere $\mathbb{S}^{d-1}$ and establish a sharp asymptotic characterization of the excess risk across the full regularization path under various source conditions $s \geq 0$, where $s$ measures the relative smoothness of the regression function. Our results reveal that the learning curve is not simply U-shaped but instead consists of three distinct regimes: over-regularized, under-regularized, and interpolation regimes. This characterization allows us to fully capture the benign overfitting phenomenon, demonstrating that benign overfitting arises consistently across both the under-regularized and interpolation regimes whenever $s$ is positive but no larger than a critical threshold. We further show that, in the sufficiently regularized regime, the kernel learning curve is recovered by an associated sequence model. Finally, we extend the learning-curve analysis to large-dimensional KRR for a class of kernels on general domains in $\mathbb{R}^d$ whose low-degree eigenspaces satisfy spectral-scaling and hyper-contractivity conditions.
要約:
An approach to construct explicit integral representations for two-layer ReLU networks is presented, which provides relatively simple representations for any multivariate polynomial. Quantitative bounds are provided for a particular, sharpened ReLU integral representation, which involves a harmonic extension and a projection. The bounds demonstrate that functions can be approximated with $L^{2}(\mathcal{D})$ errors that do not depend explicitly on dimension or degree, but rather the coefficients of their monomial expansions and the distribution $\mathcal{D}$.
要約:
Reliable decision-making with streaming data requires principled uncertainty quantification of online methods. While first-order methods enable efficient iterate updates, their inference procedures still require updating proper (covariance) matrices, incurring $O(d^2)$ time and memory complexity, and are sensitive to ill-conditioning and noise heterogeneity of the problem. This costly inference task offers an opportunity for more robust second-order methods, which are, however, bottlenecked by solving Newton systems with $O(d^3)$ complexity. In this paper, we address this gap by studying an online Newton method with Hessian averaging, where the Newton direction at each step is approximately computed using a sketch-and-project solver with Nesterov's acceleration, matching $O(d^2)$ complexity of first-order methods. For the proposed method, we quantify its uncertainty arising from both random data and randomized computation. Under standard smoothness and moment conditions, we establish global almost-sure convergence, prove asymptotic normality of the last iterate with a limiting covariance characterized by a Lyapunov equation, and develop a fully online covariance estimator with non-asymptotic convergence guarantees. We also connect the resulting uncertainty quantification to that of exact and sketched Newton methods without Nesterov's acceleration. Extensive experiments on regression models demonstrate the superiority of the proposed method for online inference.
要約:
The health condition of components in civil infrastructures can be described by various discrete states according to their performance degradation. Inferring these states from measurable responses is typically an ill-posed inverse problem. Although Bayesian methods are well-suited to tackle such problems, computing the posterior probability density function (PDF) presents challenges. The likelihood function cannot be analytically formulated due to the unclear relationship between discrete states and structural responses, and the high-dimensional state parameters resulting from numerous components severely complicates the computation of the marginal likelihood function. To address these challenges, this study proposes a novel Bayesian inversion paradigm for discrete variables based on Probabilistic Graphical Models (PGMs). The Markov networks are employed as modeling tools, with model parameters learned from data and structural topology prior. It has been proved that inferring this PGM produces the same probabilistic estimation as the posterior PDF derived from Bayesian inference, which effectively solves the above challenges. The inference is accomplished by Graph Neural Networks (GNNs), and a graph property-based GNN training strategy is developed to enable accurate inference across varying graph scales, thereby significantly reducing the computational overhead in high-dimensional problems. Both synthetic and experimental data are used to validate the proposed framework
要約:
Semi-supervised classification, where unlabeled data are massive but labeled data are limited, often arises in machine learning applications. We address this challenge under high-dimensional data by leveraging the manifold and cluster assumptions. Based on the Fermat distance, a density-sensitive metric that naturally encodes the cluster assumption, we propose the weighted $k$-nearest neighbors (NN) classifier and multidimensional scaling (MDS)-induced classifiers. The use of MDS with a large target dimension allows the effective application of linear classifiers to complex manifold data. Theoretically, we derive a sharp lower bound for the expected excess risk within clusters and prove that the weighted $k$-NN classifier utilizing the true Fermat distance is minimax optimal. Furthermore, we explicitly quantify the utility of unlabeled data by showing that the error arising from estimating the Fermat distance decays exponentially with the pooled sample size. Such a rate is much faster than the related rates in the literature. Extensive experiments on synthetic and real datasets demonstrate competitive or superior performance of our approaches compared to state-of-the-art graph-based semi-supervised classifiers.
要約:
We propose a new regularized optimal transport (OT) formulation, termed sliced-regularized optimal transport (SROT). Unlike entropic OT (EOT), which regularizes the transport plan toward an independent coupling, SROT regularizes it toward a smoothened sliced OT (SOT) plan. To the best of our knowledge, SROT is the first approach to leverage a version of SOT plan as a reference to improve classical OT. We provide a formal definition of SROT, derive its dual formulation, and provide a post-Bayesian interpretation of SROT. We then develop a Sinkhorn-style algorithm for efficient computation, retaining the same scalability advantages as EOT. By incorporating a scalable SOT plan as a prior, SROT yields more accurate approximations of the exact OT plan than EOT under the same level of regularization. Moreover, the resulting transport plan improves upon the reference SOT plan itself. We further introduce the corresponding OT divergence induced by SROT, named SROT divergence, and analyze its topological and computational properties. Finally, we validate our approach through experiments on synthetic datasets and color transfer tasks, demonstrating that SROT is better than both EOT and SOT in approximating exact OT. Additional experiments on gradient flows further highlight the advantages of SROT divergence.
要約:
Stochastic reduced-order models are widely used to represent the effective dynamics of complex systems, but estimating their drift and diffusion coefficients from data remains challenging. Standard approaches often rely on short-time trajectory increments, state-space partitioning, or repeated simulation of candidate models, which become unreliable or computationally expensive for high-dimensional systems, coarse temporal sampling, or unevenly sampled data. We introduce a data-driven calibration method based on a novel relationship between the coefficients of a stochastic reduced model and the conditional score of the finite-time transition density, defined as the gradient of the logarithm of the transition density with respect to the initial state. The resulting identity expresses derivatives of lagged correlation functions as stationary expectations over observed lagged pairs involving this conditional score and the unknown model coefficients. This formulation allows the drift and diffusion structure to be constrained directly from finite-lag statistics, without differentiating trajectories, partitioning state space, or repeatedly integrating candidate reduced models during calibration, yielding a least-squares fitting problem over stationary lagged pairs. We validate the approach on analytically tractable and data-driven nonequilibrium diffusions, demonstrating that the inferred models preserve the invariant statistics while accurately reproducing finite-lag dynamical correlations. The framework provides a scalable route for learning stochastic reduced-order models from data that reproduce prescribed statistical and dynamical properties.
要約:
This paper uses a minimum divergence framework to introduce a new way of calculating model weights that can be used to average probabilistic predictions from statistical and machine learning models. The method is general and can be applied regardless of whether the models under consideration are fit to data using frequentist, Bayesian, or some other fitting method. The proposed method is motivated in two different ways and is shown empirically to perform better than or on a par with standard model averaging methods, including model stacking and model averaging that relies on Akaike-style negative exponentiated model weighting, especially when the sample size is small. Our theoretical analysis explains why the method has a small-sample advantage.
要約:
This paper analyzes identifiability and stability for the drifting field underlying distributional matching in the Generative Drifting framework of Deng et al. First, we introduce the class of companion-elliptic kernels, which includes the Laplace kernel and is characterized by a second-order elliptic coupling between each kernel $\kappa$ in this class and its companion function $\eta$. For each kernel in this class and each pair of Borel probability measures, we prove that the drifting field vanishes if and only if the two probability measures are equal. We further show that this class consists precisely of Gaussian kernels and Mat\'ern kernels with $\nu \ge 1/2$. Second, by constructing counterexamples, we exhibit sequences for which mass escapes to infinity while the field tends to zero; in particular, control of the field norm alone does not guarantee weak convergence. Nevertheless, we prove that the only possible mode of failure is confined to the one-dimensional ray $\{c\,p:0\le c\le 1\}$. Consequently, weak convergence can be restored by imposing an asymptotic lower bound on the intrinsic overlap scalar, a linear observable defined by the kernel and the target measure.
要約:
In many areas of medicine, security, and life sciences, we want to allocate limited resources to different sources in order to detect extreme values. In this paper, we study an efficient way to allocate these resources sequentially under limited feedback. While sequential design of experiments is well studied in bandit theory, the most commonly optimized property is the regret with respect to the maximum mean reward. However, in other problems such as network intrusion detection, we are interested in detecting the most extreme value output by the sources. Therefore, in our work we study extreme regret which measures the efficiency of an algorithm compared to the oracle policy selecting the source with the heaviest tail. We propose the ExtremeHunter algorithm, provide its analysis, and evaluate it empirically on synthetic and real-world experiments.
要約:
We consider debiased inference on least-squares solutions to inverse problems as a way to avoid having to assume exact solutions exist. Such assumptions are substantive and not innocuous and their failure may well imperil inference when we impose them on the statistical model. Our approach instead allows us to conduct inference on a quantity that is defined regardless of solutions existing and coincides with the usual estimands when they do. For the case of instrumental variables, this means we can motivate the analysis with structural models but these do not need to hold exactly for the inferential procedure to remain valid.
要約:
Optimal transport with quadratic cost provides a geometric framework for steering an ensemble, modeled by a probability law, with minimal effort. Yet ambient-space formulations become unwieldy in high dimensions, and sensing or actuation in practice often reveals only linear views of the state -- camera silhouettes, LiDAR beams, tomographic slices. We develop a sliced feedback controller for distribution steering: the evolving law is projected onto one-dimensional directions on the sphere, the optimal one-dimensional velocity is synthesized in each projection, and these velocities are averaged to produce a feedback control in the ambient space. The construction reduces to the Benamou--Brenier problem in one dimension. In addition, it is invariant under orthogonal transforms, nonexpansive under projections, and well posed on $\mathcal{P}_2(\mathbb{R}^n)$. Computation proceeds by sampling directions on the sphere and solving independent one-dimensional subproblems, yielding a scalable method aligned with partial observations. In the Gaussian setting, we show that the developed sliced controller steers the law to the prescribed target. Furthermore, we derive an identity relating the energy consumption incurred by the controller to the sliced Wasserstein distance.
要約:
We formalize a new inference problem: requirement-steered interface type inference. Given spatiotemporal observations of a physical system and functional requirements, the task is to infer what kind of interface must separate the system's interior from its environment for those requirements to be satisfiable. Unlike classical constrained design, which optimizes parameters within a pre-specified object type, here the type itself is unknown.
We cast the problem as constrained variational Bayesian inference over Markov blanket partitions and introduce Constrained Dynamic Markov Blanket Detection (C-DMBD). The algorithm extends DMBD by steering blanket discovery toward functional targets through Lagrange multipliers updated by dual ascent. These multipliers penalize violations computed from model-predicted blanket statistics, allowing requirements to shape both the inferred partition and the interface dynamics.
The framework yields three phenomena unavailable to classical design: intra-family navigation, where one interface type supports different functional modes; family transition, where changing requirements induce a discontinuous shift in interface type; and ontological disambiguation, where requirements resolve ambiguities left open by physical data alone. The converged multipliers form a certificate of functional effort, recording which physical properties the inferred interface resists satisfying.
Finally, we argue that the generative model family associated with a cup's Markov blanket belongs to the designer, not to the cup. The cup is inert; the model family is the designer's representation of the dynamics its surface can sustain. This yields a loop in which the designer encodes a cup-user model into the surface, the user reconstructs it through active inference, and physiological data reveal the gap between the designer's prior and the user's actual generative model.
要約:
History-dependent sampling can reduce long-run Monte Carlo variance by discouraging redundant revisits, but existing schemes typically encode history through empirical measure on finite state spaces, which is infeasible in high-dimensional discrete configuration spaces or ill-posed in continuous domains. We propose Score-Repellent Monte Carlo (SRMC) framework that summarizes trajectory history by a running average of score evaluations in $R^d$, where $d$ is the dimension of the score and state representation. This history is converted into a surrogate target through an exponential score tilt, indexed with $\alpha$ that represents the strength of repellence in controlling the magnitude of the history-based repulsion. The surrogate family is normalization-free in the standard MCMC sense, yielding a generic wrapper: at each iteration, any base kernel targeting $\pi$ can instead be run on the current surrogate $\pi_{\theta_n}$ while the history is updated online. We analyze the coupled evolution of the history recursion and Monte Carlo estimators using stochastic approximation with controlled Markovian noise, establishing almost sure convergence and a joint central limit theorem. We further identify regimes in which the asymptotic covariance decreases as $\alpha$ increases, with scaling $O(1/\alpha)$, extending the near-zero-variance effect of finite-state history-dependent samplers to general state spaces with constant memory. Experiments on continuous targets and discrete energy-based models demonstrate improved estimator variance and mode coverage, while retaining $O(d)$ memory usage and modest per-iteration overhead.
要約:
Two-stage recommender systems first choose a candidate generator and then rank items within the generated set. Because the generator decides which items are available to the ranker, changing the generator changes both the policy value and the data support used to estimate that value. This creates an offline selection problem that standard single-stage objectives do not capture: a policy may look good under a retrieval score or a raw off-policy value estimate, but still be unreliable if it depends on weakly supported generator-item pairs. We propose CASP (Coupled Action-Set Pessimism), a support-aware offline selector for finite libraries of two-stage recommender policies. CASP combines doubly robust value estimation with a support-burden penalty. We show that stagewise rules that ignore downstream continuation value can be arbitrarily suboptimal, and we derive population, finite-class, and reconstructed-propensity guarantees for conservative selection. In simulations and a reconstructed MovieLens 1M application, CASP selects lower-burden policies when estimated value and support credibility are in tension.
要約:
We argue that current definitions of machine unlearning are underspecified for second-order optimizers. We compare first-order and second-order learners for their ability to handle the data deletion task with varying degrees of eigendecomposition to mimic the loss model memory. While both first and second-order methods realign with the ideal counterfactul in terms of performance and gradient, the second-order optimizer shows significant volatility in the optimizer state. This indicates residual information, supposedly deleted, that isn't detectable by first-order analysis. Various eigendecay treatments show that stability and information loss is regained only under controlled state pertubation where geometric information (or memory) is erased.
要約:
Evaluating generative AI models is increasingly resource-intensive due to slow inference, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate performance and identify failure cases. ProEval employs pre-trained Gaussian Processes (GPs) as surrogates for the performance score function, mapping model inputs to metrics such as the severity of errors or safety violations. By framing performance estimation as Bayesian quadrature (BQ) and failure discovery as superlevel set sampling, we develop uncertainty-aware decision strategies that actively select or synthesize highly informative inputs for testing. Theoretically, we prove that our pre-trained GP-based BQ estimator is unbiased and bounded. Empirically, extensive experiments on reasoning, safety alignment, and classification benchmarks demonstrate that ProEval is significantly more efficient than competitive baselines. It requires 8-65x fewer samples to achieve estimates within 1% of the ground truth, while simultaneously revealing more diverse failure cases under a stricter evaluation budget.
要約:
Perturbing a deterministic $n$-dimensional matrix with small Gaussian noise is a cornerstone of smoothed analysis of algorithms [Spielman and Teng, JACM 2004], as it reduces the condition number of the input to $O(n)$, and with it the complexity of many matrix algorithms. However, when deployed algorithmically, these perturbations are expensive due to the cost of generating and storing $n^2$ Gaussian random variables. We propose a perturbation that requires generating and storing $O(n)$ random numbers in $O(\log n)$ bits of precision, and reduces the condition number of any deterministic matrix to $O(n)$, matching Gaussian perturbations. Our result in particular implies a better complexity for the perturbed conjugate gradient algorithm, showing that we can solve an $n\times n$ linear system in linear space to within an arbitrarily small constant backward error using $O(n)$ matrix-vector products.
In our construction, we introduce the concept of a pattern matrix, which is a dense deterministic matrix that maps all sparse vectors into dense vectors, and we combine it with a sparse perturbation whose entries are dependent and located in a non-uniform fashion. In order to analyze this construction, we develop new techniques for lower bounding the smallest singular value of a random matrix with dependent entries.
要約:
Offline multi-agent reinforcement learning (MARL) enables policy learning from fixed datasets, but is prone to coordination failure: agents trained on static, off-policy data converge to suboptimal joint behaviours because they cannot co-adapt as their policies change. We introduce CODA (Coordination via On-Policy Diffusion for Multi-Agent Reinforcement Learning), a diffusion-based multi-agent trajectory generator for data augmentation that samples conditioned on the current joint policy, producing synthetic experience which reflects the evolving behaviours of the agents, thereby providing a mechanism for co-adaptation. We find that previous diffusion-based augmentation approaches are insufficient for fostering multi-agent coordination because they produce static augmented datasets that do not evolve as the current joint policy changes during training; CODA resolves this by more closely simulating on-policy learning and is a meaningful step toward coordinated behaviours in the offline setting. CODA is algorithm-agnostic and can be layered onto both model-free and model-based offline reinforcement learning pipelines as an augmentation module. Empirically, CODA not only resolves canonical coordination pathologies in continuous polynomial games but also delivers strong results on the more complex MaMuJoCo continuous-control benchmarks.
要約:
Solutions to the Schr\"{o}dinger bridge problem and its generalizations yield feedback control policies for optimal density steering over a controlled diffusion. To numerically compute the same, the dynamic Sinkhorn recursion has become a standard approach. The mathematical engine behind this approach is the Hopf-Cole transform that recasts the conditions for optimality into a system of boundary-coupled linear PDEs. Recent works pointed out that for the control-affine Schr\"{o}dinger bridge problem, this exact linearity via Hopf-Cole transform, and thus the standard Sinkhorn recursion, apply only if the control and noise channels are proportional. When the channels do not match, the Hopf-Cole-transformed PDEs remain nonlinear, and no algorithm is available to solve the same. We advance the state-of-the-art by designing a Sinkhorn recursion with memory that leverages the structure of these nonlinear PDEs, and demonstrate how it solves the control-affine Schr\"{o}dinger bridge problem with input and noise channel mismatch. We prove the local stability of the proposed algorithm.
要約:
Convolutional neural networks (CNNs) have become increasingly difficult to deploy in resource-constrained environments due to their large memory and computational requirements. Although low-rank compression methods can reduce this burden, most existing approaches compress spatial and channel redundancy independently and therefore do not fully exploit the localised structure within convolutional feature maps. This paper proposes a hierarchical spatio-channel low-rank compression framework for CNNs that exploits redundancy across spatial regions and channel activations. Unlike conventional methods, which apply a uniform decomposition across an entire layer, the proposed approach first partitions feature maps into spatial regions, then groups channels according to their co-activation patterns within each region, and finally applies rank-adaptive SVD to each resulting spatio-channel cluster. The method is evaluated on an AlexNet-based brain tumour MRI classification model and compared with Global SVD and Tucker decomposition under \(3\times\) and \(6\times\) compression budgets. Our method outperforms both baselines, reducing FLOPs from \(8.21\,\mathrm{G}\) to \(1.55\,\mathrm{G}\) (\(81.1\%\) reduction), achieving a \(1.38\times\) inference speed-up, and increasing classification accuracy from \(87.76\%\) to \(89.80\%\). The method also improves the macro \(F_1\)-score and performance on challenging classes such as meningioma. A hyper-parameter trade-off analysis demonstrates that the framework provides Pareto-optimal configurations, enabling control over the balance between compression and predictive performance. Moderate clustering with adaptive rank selection yields strong results. Bootstrap standard errors are reported for all classification metrics.
要約:
Over decades, Markov chain Monte Carlo (MCMC) methods have been widely studied, with a typical application being the quantification of posterior uncertainties in Bayesian system identification of structural dynamic models. To address the issue of excessively low sampling efficiency in generic MCMC methods when applied to specific problems, researchers developed several MCMC algorithms that integrate trainable neural networks to replace and enhance their critical components. Later, meta-learning MCMC methods emerged to reduce training time. However, they require considerable similarity between test and training tasks, while their sampling efficiency is constrained by trade-off-simplified network designs. This paper proposes the Adaptive Principal-Component (PC) Meta-learning Stochastic Gradient Hamiltonian Monte Carlo (APM-SGHMC) algorithm. It adaptively rotates coordinate axes in the parameter space to align with the PC directions of the current posterior samples, ensuring rotation-invariance of sampling performance with respect to the posterior distribution. By incorporating translation-invariance, scale-invariance, and rotation-invariance in a unified framework, APM-SGHMC enables universal samplers to acquire generalizable knowledge across diverse Bayesian system identification tasks using minimalistic tasks while eliminating the constraints imposed by network design trade-offs on sampling efficiency. Practical feasibility issues are also addressed. Two Bayesian system identification case studies demonstrate its effectiveness and universality: our method overcomes the case-by-case limitations of traditional data-driven approaches, achieving zero-shot generalization across structurally distinct models without retraining and maintaining consistent superior performance across all scenarios.
要約:
Sequential latent-variable models with subject-specific random effects provide a flexible framework for modeling temporally structured data with both local latent dynamics and stable between-subject heterogeneity. In such models, conditional inference for the local latent process is often tractable, but integrating over subject-specific random effects can be computationally demanding. We propose an anchored variational inference framework for efficient approximate inference in this setting. The central idea is to replace the full conditional posterior of the local latent process with its evaluation at a representative value of the subject-specific latent effect, called the anchor point, thereby preserving tractable local inference while substantially reducing computational cost. This approximation is especially appealing in sequential settings, where the posterior distribution of the random effect becomes increasingly concentrated as the sequence length grows. Under suitable conditions, we show that the posterior mean is a nearly optimal anchor point and that the resulting anchored variational EM (AVEM) algorithm approximately preserves the local monotonicity behavior of standard variational inference. We instantiate the framework in two representative classes of sequential latent-variable models, namely mixed hidden Markov models and mixed-effects state-space models, derive the corresponding AVEM algorithms, and use simulation studies to indicate that the resulting methods achieve accurate estimation with substantial computational gains. We also discuss a partially anchored variant of the framework, in which only the components of the subject-specific latent effect whose posteriors are well concentrated are anchored.
要約:
Polyak-Ruppert averaging yields an asymptotically normal estimator with sandwich covariance $H^{-1}SH^{-1}$, the foundation of online inference. When the gradient step is preconditioned by a data-driven matrix $P_t$, we ask how fast $P_t$ must stabilize for the central limit theorem (CLT) to remain valid.
We resolve this via an exact preconditioner-isolating decomposition of the averaged error that confines $P_t$ to a dynamic remainder $R_n$, leaving the martingale and Taylor terms preconditioner-free. Let $M_t = (P_t H)^{-1}$ denote the effective inverse drift matrix, with $\|M_t - M_{t-1}\|_{\mathrm{op}} \lesssim t^{-\beta}$ and step-size exponent $\alpha \in (1/2, 1)$. We identify a stabilization-rate threshold $\beta > (\alpha+1)/2$ and prove that, within the class of polynomial rate hypotheses used in our upper bound, it cannot be weakened: the dynamic remainder $\sqrt{n}\,R_n$ vanishes in $L^2$ whenever $\beta > (\alpha+1)/2$, and we exhibit sequences satisfying those hypotheses for which it does not vanish when $\beta \le (\alpha+1)/2$.
A single stabilization argument certifies three SA variants - SA-AdaGrad, SA-RMSProp, and SA-ONS - with gain $\rho_t = c/t$, each delivering one-step $L^2(\mathrm{op})$ stabilization of order $t^{-1}$, yielding the CLT $\sqrt{n}(\bar{x}_n - x^*) \to N(0, H^{-1}SH^{-1})$; under bounded inputs the pathwise rate $\beta = 1$ further preserves the $n^{-1/6}$ Wasserstein rate at $\alpha^* = 2/3$. Under standard regularity conditions, Wald-type online inference remains valid for dynamically preconditioned averaged SGD whose stabilization rate exceeds the threshold.
要約:
Diffusion models are central to modern generative modeling, and understanding how they balance memorization and generalization is critical for reliable deployment. Recent work has shown that memorization in diffusion models is shaped by training dynamics, with generalization and memorization emerging at different stages of training. However, deployed diffusion models are often further distilled, introducing an additional training phase whose impact on memorization is not well understood. In this work, we analyze how distillation reshapes memorization behavior in diffusion models, taking consistency distillation as a representative framework. Empirically, we show that when applied to a teacher model that has memorized data, consistency distillation significantly reduces transferred memorization in the student while preserving, and sometimes improving, sample quality. To explain this behavior, we provide a theoretical analysis using a random feature neural network model [Bonnaire et al., 2025], showing that consistency distillation suppresses unstable feature directions associated with memorization while preserving stable, generalizable modes. Our findings suggest that distillation can serve not only as an acceleration tool, but also as a mechanism for improving the memorization-generalization trade-off.
要約:
A widely cited result by Dong et al. (2021) showed that Transformers built from self-attention alone, without skip connections or feed-forward layers, suffer from rapid rank collapse: all token representations converge to a single direction. The proposed remedy was the MLP. We show that this picture, while correct in the regime studied by Dong, is incomplete in ways that matter for architectural understanding.
Three results are established. First, layer normalisation is precisely affine-rank-neutral: it preserves the affine rank of the token representation set exactly. The widespread claim that LN "plays no role" is imprecise; the correct statement is sharper. Second, residual connections generically obstruct rank collapse in real Transformers such as BERT-base, in a measure-theoretic sense, without contribution from the MLP. The MLP's irreplaceable function is different: generating feature directions outside the linear span of the original token embeddings, which no stack of attention layers can produce. Third, a phenomenon distinct from rank collapse is identified: head-channel non-identifiability. After multi-head attention sums per-head outputs through the output projection, individual contributions cannot be canonically attributed to a specific head; n(H-1)d_k degrees of freedom per layer remain ambiguous when recovering a single head from the mixed signal. The MLP cannot remedy this because it acts on the post-summation signal.
A constructive partial remedy is proposed: a position-gated output projection (PG-OP) at parameter overhead below 1.6% of the standard output projection. The four collapse phenomena identified in the literature -- rank collapse in depth, in width, head-channel non-identifiability, and entropy collapse -- are unified under a symmetry-breaking framework, each corresponding to a distinct symmetry of the Transformer's forward pass.
要約:
AI/ML methods are increasingly used in economics to generate binary variables (or labels) via classification algorithms. When these generated variables are included as covariates in regressions, even small misclassification errors can induce large biases in OLS estimators and invalidate standard inference. We study whether the bootstrap can correct this bias and deliver valid inference. We first show that a seemingly natural fixed-label bootstrap, which generates data using estimated labels but relies on a corrupted version in estimation, is generally invalid unless a strong independence condition between the latent true labels and other covariates holds. We then propose a coupled-label bootstrap that jointly resamples the true and imputed labels, and show it is valid without this condition. Two finite-sample adjustments further improve coverage: a variance correction for uncertainty in estimated misclassification rates and a Hessian rotation for near-singular designs. We illustrate the methods in simulations and apply them to investigate the relationship between wages and remote work status.
要約:
A central problem in unsupervised domain adaptation is determining what to transfer from labeled source domains to an unlabeled target domain. To handle high-dimensional observations (e.g., images), a line of approaches use deep learning to learn latent representations of the observations, which facilitate knowledge transfer in the latent space. However, existing approaches often rely on restrictive assumptions to establish identifiability of the joint distribution in the target domain, such as independent latent variables or invariant label distributions, limiting their real-world applicability. In this work, we propose a general domain adaptation framework that learns compact latent representations to capture distribution shifts relative to the prediction task and address the fundamental question of what representations should be learned and transferred. Notably, we first demonstrate that learning representations based on all the predictive information, i.e., the label's Markov blanket in terms of the learned representations, is often underspecified in general settings. Instead, we show that, interestingly, general domain adaptation can be achieved by partitioning the representations of Markov blanket into those of the label's parents, children, and spouses. Moreover, its identifiability guarantee can be established. Building on these theoretical insights, we develop a practical, nonparametric approach for domain adaptation in a general setting, which can handle different types of distribution shifts.
要約:
Causal representation learning aims to recover the latent causal variables and their causal relations, typically represented by directed acyclic graphs (DAGs), from low-level observations such as image pixels. A prevailing line of research exploits multiple environments, which assume how data distributions change, including single-node interventions, coupled interventions, or hard interventions, or parametric constraints on the mixing function or the latent causal model, such as linearity. Despite the novelty and elegance of the results, they are often violated in real problems. Accordingly, we formalize a set of desiderata for causal representation learning that applies to a broader class of environments, referred to as general environments. Interestingly, we show that one can fully recover the latent DAG and identify the latent variables up to minor indeterminacies under a nonparametric mixing function and nonlinear latent causal models, such as additive (Gaussian) noise models or heteroscedastic noise models, by properly leveraging sufficient change conditions on the causal mechanisms up to third-order derivatives. These represent, to our knowledge, the first results to fully recover the latent DAG from general environments under nonparametric mixing. Notably, our results match or improve upon many existing works, but require less restrictive assumptions about changing environments.
要約:
Foundation models of brain activity promise a new frontier for in silico neuroscience by emulating neural responses to complex stimuli across tasks and modalities. A natural next step is to ask whether these models can also be used in reverse. Can we recover a stimulus or its properties from synthetic brain activity? We study this question in a proof-of-concept setting using TRIBEv2. We pair the brain emulator with large language models (LLMs) that generate news headlines from linguistic parameters such as valence, arousal, and dominance. We then use simulation-based inference to learn a probabilistic mapping from brain maps to latent stimulus parameters. Our results show that these parameters can be recovered from predicted brain maps, validating the quality of neural encodings. They also show that LLMs can serve as controllable stimulus generators for simulated experiments. Together, these findings provide a step toward decoding and inverse design with foundation brain models.
要約:
Synthetic data offers a promising tool for privacy-preserving data release, augmentation, and simulation, but its use in causal inference requires preserving more than predictive fidelity. We show that fully generative tabular synthesizers, including GAN- and LLM-based models, can achieve strong train-on-synthetic-test-on-real performance while substantially distorting causal estimands such as the average treatment effect (ATE). We formalize this failure through sensitivity and tradeoff results showing that ATE preservation requires control of both the generated covariate law and the treatment-effect contrast in the outcome regression. Motivated by this observation, we propose a hybrid synthetic-data framework that generates covariates separately from the treatment and outcome mechanisms, using distance-to-closest-record diagnostics to monitor covariate synthesis and separately learned nuisance models to construct (W, A, Y) triplets. We further study targeted synthetic augmentation for practical positivity problems and characterize when added overlap support helps by improving conditional-effect estimation more than it shifts the covariate distribution. Finally, we develop a synthetic simulation engine for pre-analysis estimator evaluation, enabling finite-sample comparison of OR, IPW, AIPW, and TMLE under realistic covariate structure. Across experiments, hybrid synthetic data substantially improve ATE preservation relative to fully generative baselines and provide a practical diagnostic tool for robust causal analysis.
要約:
Learning low-dimensional representations from multi-view relational data is challenging when underlying geometries differ across views. We propose Bary-GWMDS, a Gromov-Wasserstein-based method that operates directly on distance matrices to learn a consensus embedding preserving shared relational structure. By leveraging intrinsic distances, the approach naturally handles nonlinear distortions across views. We also introduce Mean-GWMDS-C, a clustering-oriented formulation that averages distance matrices and learns reduced-support representations via a consensus Gromov-Wasserstein transport. Experiments on synthetic and real-world datasets show that the proposed framework yields stable and geometrically meaningful embeddings.
要約:
When, in terms of the number of data points, the size of a dataset exceeds available computing resources, or when labeling is expensive, an attractive solution consists of selecting only some of the data points (subdata) for further consideration. A central question for selecting subdata of size $n$ from $N$ available data points is which $n$ points to select. While an answer to this question depends on the objective, one approach for a parametric model and a focus on parameter estimation is to select subdata that retains maximal information. Identifying such subdata is a classical NP-hard problem due to its inherent discreteness. Based on optimal approximate design theory, we develop a new methodology for information-based subdata selection, resulting in subdata that approaches the optimal solution. To achieve this, we develop a novel algorithm that applies to a general model, accommodates arbitrary choices of $N$ and $n$, and supports multiple optimality criteria, and we prove its convergence. Moreover, the new methodology facilitates an assessment of the efficiency of subdata selected by any method by obtaining tight lower and upper bounds for the efficiency. We show that the subdata obtained through the new methodology is highly efficient and outperforms all existing methods.
要約:
Accurate time series forecasting in scientific domains such as climate modeling, physiological monitoring, and energy systems benefits from both competitive predictions and model transparency. This work proposes DecompKAN, a lightweight attention-free architecture that combines trend-residual decomposition, channel-wise patching, learned instance normalization, and B-spline Kolmogorov-Arnold Network (KAN) edge functions. Each KAN edge learns an explicit, inspectable 1D scalar function over learned patch-embedding coordinates that can be directly visualized. On standard benchmarks, DecompKAN achieves best or tied-best MSE on 15 of 32 dataset-horizon combinations among selected published baselines, and achieves best or tied-best MSE on 20 of 36 comparisons under a controlled same-recipe evaluation across 9 datasets including the physiological PPG-DaLiA benchmark. The architecture shows particular strength on datasets with smooth temporal dynamics (Solar -17%, ECL -10% vs. iTransformer, Weather) and physiological time series. Visualization of learned edge functions reveals qualitatively different latent nonlinearities across domains. Ablation analysis shows that the architectural pipeline (decomposition, patching, normalization) drives performance more than the choice of nonlinear layer, while the KAN formulation enables inspection of learned latent transformations.
要約:
In this paper we study continuum-marginal optimal transport. Given a time-continuous family of probability marginals, the problem is to recover the minimum-energy velocity field whose flow reproduces every marginal. This problem is the continuum limit of the classical two-marginal Benamou--Brenier formulation, and also the deterministic limit of the Nelson problem of stochastic optimal transport. We propose a practical mesh-free solver for this problem. The weak continuity equation is embedded in a reproducing kernel Hilbert space, yielding a sample-only objective that requires no spatial discretization. The velocity is parametrized by any linear-in-parameters dictionary or neural network, and is optimized by mini-batch stochastic methods. Synthetic experiments confirm that the method achieves accurate drift recovery and marginal consistency. The same computational framework also applies to the stochastic Nelson problem.
要約:
We study the problem of global maximization of a function f given a finite number of evaluations perturbed by noise. We consider a very weak assumption on the function, namely that it is locally smooth (in some precise sense) with respect to some semi-metric, around one of its global maxima. Compared to previous works on bandits in general spaces (Kleinberg et al., 2008; Bubeck et al., 2011a) our algorithm does not require the knowledge of this semi-metric. Our algorithm, StoSOO, follows an optimistic strategy to iteratively construct upper confidence bounds over the hierarchical partitions of the function domain to decide which point to sample next. A finite-time analysis of StoSOO shows that it performs almost as well as the best specifically-tuned algorithms even though the local smoothness of the function is not known.
要約:
We consider online learning problems under a partial observability model capturing situations where the information conveyed to the learner is between full information and bandit feedback. In the simplest variant, we assume that in addition to its own loss, the learner also gets to observe losses of some other actions. The revealed losses depend on the learner's action and a directed observation system chosen by the environment. For this setting, we propose the first algorithm that enjoys near-optimal regret guarantees without having to know the observation system before selecting its actions. Along similar lines, we also define a new partial information setting that models online combinatorial optimization problems where the feedback received by the learner is between semi-bandit and full feedback. As the predictions of our first algorithm cannot be always computed efficiently in this setting, we propose another algorithm with similar properties and with the benefit of always being computationally efficient, at the price of a slightly more complicated tuning mechanism. Both algorithms rely on a novel exploration strategy called implicit exploration, which is shown to be more efficient both computationally and information-theoretically than previously studied exploration strategies for the problem.
要約:
Machine-learning interatomic potentials (MLIPs) have enabled molecular dynamics at near ab initio accuracy, yet remain limited to energies and forces by construction, leaving electronic observables such as dipole moments and polarizabilities inaccessible. We introduce DenSNet, a density-first approach to machine-learned electronic structure that learns the Hohenberg--Kohn map from nuclear configurations to the ground-state electron density. Our approach employs an SE(3)-equivariant neural network to predict density coefficients of a flexible atom-centered Gaussian basis, combined with a $\Delta$-learning strategy that uses superposed atomic densities as a prior to accelerate training. A second equivariant network then maps the predicted density to the total energy, providing a unified framework for molecular dynamics and electronic structure. We validate DenSNet on ethanol, ethanethiol, and resorcinol, where infrared spectra from machine-learned trajectories show excellent agreement with experimental gas-phase measurements. To test scalability, we train on polythiophene oligomers with 1--6 monomers and extrapolate to chains of up to 12 monomers, generating stable long-time trajectories whose infrared spectra agree with reference density functional theory calculations. Here, we show that reinstating the electron density as the central learned quantity opens a practical route to transferable prediction of spectroscopic and electronic observables in large-scale molecular simulations.
要約:
We study learning with Chain-of-Thought (CoT) supervision from multiple thinkers, all of whom provide correct but possibly systematically different solutions, e.g., step-by-step solutions to math problems written by different thinkers, or step-by-step execution traces of different programs solving the same problem.
We consider classes that are computationally easy to learn using CoT supervision from a single thinker, but hard to learn with only end-result supervision, i.e., without CoT (Joshi et al. 2025). We establish that, under cryptographic assumptions, learning can be hard from CoT supervision provided by two or a few different thinkers, in passive data-collection settings.
On the other hand, we provide a generic computationally efficient active learning algorithm that learns with a small amount of CoT data per thinker that is completely independent of the target accuracy $\varepsilon$, a moderate number of thinkers that scales as $\log \frac{1}{\varepsilon}\log \log \frac{1}{\varepsilon}$, and sufficient passive end-result data that scales as $\frac{1}{\varepsilon}\cdot poly\log\frac{1}{\varepsilon}$.
要約:
While the optimal sample complexity of binary classification in terms of the VC dimension is well-established, determining the optimal sample complexity of multiclass classification has remained open. The appropriate complexity parameter for multiclass classification is the DS dimension, and despite significant efforts, a gap of $\sqrt{\text{DS}}$ has persisted between the upper and lower bounds on sample complexity.
Recent work by Hanneke et al. (2026) shows a novel algebraic characterization of multiclass hypothesis classes in terms of their DS dimension. Building up on this, we show that the maximum hypergraph density of any multiclass hypothesis class is upper-bounded by its DS dimension. This proves a longstanding conjecture of Daniely and Shalev-Shwartz (2014). As a consequence, we determine the optimal dependence of the sample complexity on the DS dimension for multiclass as well as list learning.
要約:
Anomaly localization in images -- identifying regions that deviate from normal patterns -- is vital in applications such as medical diagnosis and industrial inspection. A recent trend is the use of image generation models in anomaly localization, where these models generate normal-looking counterparts of anomalous images, thereby allowing flexible and adaptive anomaly localization. However, these methods inherit the uncertainty and bias implicitly embedded in the employed generative model, raising concerns about the reliability. To address this, we propose a statistical framework based on selective inference to quantify the significance of detected anomalous regions. Our method provides $p$-values to assess the false positive detection rates, providing a principled measure of reliability. As a proof of concept, we consider anomaly localization using a diffusion model and its applications to medical diagnoses and industrial inspections. The results indicate that the proposed method effectively controls the risk of false positive detection, supporting its use in high-stakes decision-making tasks.
要約:
We consider a class of statistical inverse problems involving the estimation of a regression operator from a Polish space to a separable Hilbert space, where the target lies in a vector-valued reproducing kernel Hilbert space induced by an operator-valued kernel. To address the associated ill-posedness, we analyze regularized stochastic gradient descent (SGD) algorithms in both online and finite-horizon settings. The former uses polynomially decaying step sizes and regularization parameters, while the latter adopts fixed values. Under suitable structural and distributional assumptions, we establish dimension-independent bounds for prediction and estimation errors. The resulting convergence rates are near-optimal in expectation, and we also derive high-probability estimates that imply almost sure convergence. Our analysis introduces a general technique for obtaining high-probability guarantees in infinite-dimensional settings. We illustrate the practical scope of our framework with applications to structured prediction and parametric PDEs, providing examples that reflect how the approach can be applied in practice.
要約:
Differentially private (DP) linear regression has received significant attention in the recent theoretical literature, with several approaches proposed to improve error rates. Our work considers the popular high-dimensional regime with random data, where the number of training samples $n$ and the input dimension $d$ grow at a proportional rate $d / n \to \gamma$, and it studies a family of one-pass DP gradient descent (DP-GD) algorithms satisfying $\rho^2 / 2$ zero concentrated DP. In this setting, we establish a deterministic equivalent for the DP-GD trajectory in terms of a system of ordinary differential equations. This allows to analyze the effect of gradient clipping constants that are smaller than the typical norm of the per-sample gradients - a setup shown to improve performance in practice. For well-conditioned data, we show that DP-GD, upon properly choosing clipping constant and learning rate, achieves the non-asymptotic risk of $O(\gamma + \gamma^2 / \rho^2)$, and we establish that this rate is minimax optimal. Then, we consider the ill-conditioned case where the data covariance spectrum follows a power-law distribution, and we show that the risk displays a power-like scaling law in $\gamma$, highlighting the change in the exponent as a function of the privacy parameter $\rho$. Overall, our analysis demonstrates the benefits of practical algorithmic design choices, including aggressive gradient clipping and decaying learning rate schedules.
要約:
In recent years, the neural tangent kernel (NTK) and neural network Gaussian process kernel (NNGP) have given theoreticians tractable limiting cases of fully connected neural networks. However, the property of these kernels are poorly understood for activation functions other than powers of the ReLU. Our main contribution is a characterization of the RKHS of these kernels for activation functions whose only non-smoothness is at zero. This extends existing theory to numerous commonly used activation functions such as SELU, ELU, or LeakyReLU. Additionally, we analyze a broad set of special cases such as missing biases, two-layer networks, or polynomial activations. Our results show that a broad class of not infinitely smooth activations generate equivalent RKHSs at different network depths, depending only on the degree of the non-smoothness up to equivalence. On the other hand, the RKHS generated by polynomial activations depends on the network depth. Finally, we derive results for the smoothness of NNGP sample paths, characterizing the smoothness of infinitely wide neural networks at initialization.
要約:
Longitudinal voice biomarkers provide a non-invasive source of information for monitoring Parkinson's disease progression, but their statistical analysis is difficult because repeated measurements from the same subject are correlated, clinical cohorts are often small, and disease trajectories can vary substantially across individuals. This study evaluates statistical and neural mixed-effects approaches for modeling Parkinson's disease progression from telemonitoring voice data. Using the Oxford Parkinson's telemonitoring dataset (N=42), we compare Neural Mixed Effects (NME) models, Generalized Neural Network Mixed Models (GNMMs), and semi-parametric Generalized Additive Mixed Models (GAMMs) under the same longitudinal prediction setting. The results show that neural mixed-effects models provide flexible nonlinear representations but can overfit severely in this small-sample setting, whereas GAMMs achieve stronger predictive performance and retain interpretable smooth effects and subject-level structure. In particular, the GAMM-based approach attains the lowest prediction error (MSE 6.56), while the neural baselines have substantially larger errors (MSE > 90). These findings support the use of interpretable statistical mixed-effects models for small longitudinal telemonitoring studies and suggest that larger and more diverse cohorts are needed before highly flexible neural mixed-effects models can be reliably assessed in this application.
要約:
Diffusion and flow matching approaches to generative modeling have shown promise in domains where the state space is continuous, such as image generation or protein folding & design, and discrete, exemplified by diffusion large language models. They offer a natural fit when the number of elements in a state is fixed in advance (e.g. images), but require ad hoc solutions when, for example, the length of a response from a large language model, or the number of amino acids in a protein chain is not known a priori.
Here we propose Branching Flows, a generative modeling framework that, like diffusion and flow matching approaches, transports a simple distribution to the data distribution. But in Branching Flows, the elements in the state evolve over a forest of binary trees, branching and dying stochastically with rates that are learned by the model. This allows the model to control, during generation, the number of elements in the sequence. We also show that Branching Flows can compose with any flow matching base process on discrete sets, continuous Euclidean spaces, smooth manifolds, and `multimodal' product spaces that mix these components. We demonstrate this in three domains: small molecule generation (multimodal), antibody sequence generation (discrete), and protein backbone generation (multimodal), and show that Branching Flows is a capable distribution learner with a stable learning objective, and that it enables new capabilities.
要約:
We propose a flexible deep neural network (DNN) framework for modeling survival data within a partially linear regression structure. The approach preserves interpretability through a parametric linear component for covariates of primary interest, while a nonparametric DNN component captures complex time-covariate interactions among nuisance variables. We refer to the method as FLEXI-Haz, a FLEXIble Hazard model with a partially linear structure. In contrast to existing DNN approaches for partially linear Cox models, FLEXI-Haz does not rely on the proportional hazards assumption. We establish theoretical guarantees: the neural network component attains minimax-optimal convergence rates over composite H\"older classes, the linear estimator is sqrt-n-consistent, asymptotically normal, and semiparametrically efficient, and we develop a cross-fitted one-step estimator of the cumulative hazard and survival function for a new subject, together with pointwise asymptotic confidence intervals. To the best of our knowledge, this is the first frequentist asymptotic pointwise inference result for a survival function in a DNN survival model, with or without a linear component. Simulations and real-data analyses demonstrate the utility of FLEXI-Haz as a principled and interpretable alternative to methods based on proportional hazards.
要約:
We study the problem of learning a low-degree spherical polynomial of degree $\ell_0 = \Theta(1) \ge 1$ defined on the unit sphere in $\RR^d$ by training an over-parameterized two-layer neural network (NN) with channel attention in this paper. Our main result is the significantly improved sample complexity for learning such low-degree polynomials. We show that, for any regression risk $\eps \in (0,1)$, a carefully designed two-layer NN with channel attention and finite width trained by the vanilla gradient descent (GD) requires the lowest sample complexity of $n \asymp \Theta(d^{\ell_0}/\eps)$ with high probability, in contrast with the representative sample complexity $\Theta\pth{d^{\ell_0} \max\set{\eps^{-2},\log d}}$, where $n$ is the training data size. Moreover, such sample complexity is not improvable since the trained network renders a sharp rate of the nonparametric regression risk of the order $\Theta(d^{\ell_0}/{n})$ with high probability. On the other hand, the minimax optimal rate for the regression risk with a kernel of rank $\Theta(d^{\ell_0})$ is $\Theta(d^{\ell_0}/{n})$, so that the rate of the nonparametric regression risk of the network trained by GD is minimax optimal. Training the two-layer NN with channel attention proceeds in two stages: (1) a provable learnable channel selection algorithm, as a learnable harmonic-degree selection process, identifies the ground truth channel number in the target function, $\ell_0$, from $L \ge \ell_0$ channels in the first-layer activation; (2) the second layer is trained by standard GD using the selected channels. To the best of our knowledge, this is the first time a minimax optimal risk bound is obtained by training an over-parameterized but finite-width neural network with feature learning capability to learn low-degree spherical polynomials.
要約:
We present the first algorithms for generalized linear contextual bandits under shuffle differential privacy and joint differential privacy. While prior work on private contextual bandits has been restricted to linear reward models -- which admit closed-form estimators -- generalized linear models (GLMs) pose fundamental new challenges: no closed-form estimator exists, requiring private convex optimization; privacy must be tracked across multiple evolving design matrices; and optimization error must be explicitly incorporated into regret analysis.
We address these challenges under two privacy models and context settings. For stochastic contexts, we design a shuffle-DP algorithm achieving $\tilde{O}(d^{3/2}\sqrt{T \log T}/\sqrt{\varepsilon})$ regret in dominant term, differing from the non-private rate by a factor of $\sqrt{d/\varepsilon}$. For adversarial contexts, we provide a joint-DP algorithm with regret $\tilde{O}\!\big(d\sqrt{T} \log T + d^{3/4}\sqrt{T/\varepsilon}\,(\log T)\,(d + \log T)^{1/4}\big)$ -- matching the non-private rate $\tilde{O}(d\sqrt{T} \log T)$ in the leading term, with privacy contributing only an additive correction. Unlike prior work on locally private GLM bandits, our methods require no spectral assumptions on the context distribution beyond $\ell_2$ boundedness.
要約:
Building local surrogates to accelerate stationary point searches on potential energy surfaces spans decades of effort. Done correctly, surrogates can reduce the number of expensive electronic structure evaluations by roughly an order of magnitude while preserving the accuracy of the underlying theory, with the gain depending on oracle cost, search distance, and the availability of analytical forces. We present a unified Bayesian optimization view of minimization, single-point saddle searches, and double-ended path searches: all three share one six-step surrogate loop and differ only in the inner optimization target and the acquisition criterion. The framework uses Gaussian process regression with derivative observations, inverse-distance kernels, and active learning, and we develop optional extensions for production use, including farthest-point sampling with the Earth Mover's Distance, MAP regularization, an adaptive trust radius, and random Fourier features for scaling. Accompanying pedagogical Rust code demonstrates that all three applications use the same Bayesian optimization loop, bridging the gap between theoretical formulation and practical execution.
要約:
Motivated by the principle of satisficing in decision-making, we study satisficing regret guarantees for nonstationary $K$-armed bandits. We show that in the general realizable, piecewise-stationary setting with $L$ stationary segments, the optimal regret is $\Theta(L\log T)$ as long as $L\geq 2$. This stands in sharp contrast to the case of $L=1$ (i.e., the stationary setting), where a $T$-independent $\Theta(1)$ satisficing regret is achievable under realizability. In other words, the optimal regret has to scale with $T$ even if just a little nonstationarity presents. A key ingredient in our analysis is a novel Fano-based framework tailored to nonstationary bandits via a \emph{post-interaction reference} construction. This framework strictly extends the classical Fano method for passive estimation as well as recent interactive Fano techniques for stationary bandits. As a complement, we also discuss a special regime in which constant satisficing regret is again possible.
要約:
We introduce a Banach space-valued extension of random feature learning, a data-driven supervised machine learning technique for large-scale kernel approximation. By randomly initializing the feature maps, only the linear readout needs to be trained, which reduces the computational complexity substantially. Viewing random feature models as Banach space-valued random variables, we prove a universal approximation result in the corresponding Bochner space. Moreover, we derive approximation rates and an explicit algorithm to learn an element of the given Banach space by such models. The framework of this paper includes random trigonometric/Fourier regression and in particular random neural networks which are single-hidden-layer feedforward neural networks whose weights and biases are randomly initialized, whence only the linear readout needs to be trained. For the latter, we can then lift the universal approximation property of deterministic neural networks to random neural networks, even within function spaces over non-compact domains, e.g., weighted spaces, $L^p$-spaces, and (weighted) Sobolev spaces, where the latter includes the approximation of the (weak) derivatives. In addition, we analyze when the training costs for approximating a given function grow polynomially in both the input/output dimension and the reciprocal of a pre-specified tolerated approximation error. Furthermore, we demonstrate in a numerical example the empirical advantages of random feature models over their deterministic counterparts.
要約:
We show strong (and surprisingly simple) lower bounds for weakly learning intersections of halfspaces in the improper setting. Strikingly little is known about this problem. For instance, it is not even known if there is a polynomial-time algorithm for learning the intersection of only two halfspaces. On the other hand, lower bounds based on well-established assumptions (such as approximating worst-case lattice problems or variants of Feige's 3SAT hypothesis) are only known (or are implied by existing results) for the intersection of super-logarithmically many halfspaces [KS09,KS06,DSS16]. With intersections of fewer halfspaces being only ruled out under less standard assumptions [DV21] (such as the existence of local pseudo-random generators with large stretch). We significantly narrow this gap by showing that even learning $\omega(\log \log N)$ halfspaces in dimension $N$ takes super-polynomial time under standard assumptions on worst-case lattice problems (namely that SVP and SIVP are hard to approximate within polynomial factors). Further, we give unconditional hardness results in the statistical query framework. Specifically, we show that for any $k$ (even constant), learning $k$ halfspaces in dimension $N$ requires accuracy $N^{-\Omega(k)}$, or exponentially many queries -- in particular ruling out SQ algorithms with polynomial accuracy for $\omega(1)$ halfspaces. To the best of our knowledge this is the first unconditional hardness result for learning a super-constant number of halfspaces.
Our lower bounds are obtained in a unified way via a novel connection we make between intersections of halfspaces and the so-called parallel pancakes distribution [DKS17,BLPR19,BRST21] that has been at the heart of many lower bound constructions in (robust) high-dimensional statistics in the past few years.
要約:
Machine learning has become increasingly popular in informing data-driven policy-making. Policies influence behavior in individuals or populations, and ideally, through observational signals, policy-makers learn which policies are effective. However, in many settings, individual actions cannot be perfectly observed. This issue, known in economics as moral hazard, poses a significant challenge. In this work, we study the foundational multitasking principal-agent contract design problem and demonstrate how instrumental regression and the generalized method of moments (GMM) estimator can be used to estimate or learn a good contract. As a bonus result, we also give a uniformity characterization of the shape of the optimal contract.
要約:
AI methods, such as generative models and reinforcement learning, have recently been applied to combinatorial optimization (CO) problems, especially NP-hard ones. This paper compares such GPU-based methods with classical CPU-based methods on the Maximum Independent Set (MIS) problem. Strikingly, even on in-distribution random graphs, leading AI-inspired methods are consistently outperformed by the state-of-the-art classical solver KaMIS running on a single CPU, and some AI-inspired methods frequently fail to surpass even the simplest degree-based greedy heuristic. Even with post-processing techniques like local search, AI-inspired methods still perform worse than CPU-based solvers. To better understand the source of these failures, we introduce a novel analysis, serialization, which reveals that non-backtracking AI-inspired methods, e.g. LTFT (which is based on GFlowNets), end up reasoning similarly to the simplest degree-based greedy, and thus worse than KaMIS. More generally, our findings suggest a need for a rethinking of current approaches in AI for CO, advocating for more rigorous benchmarking and the principled integration of classical heuristics. Additionally, we also find that CPU-based algorithm KaMIS have strong performance on sparse random graphs, which appears to show that the shattering threshold conjecture for large independent sets proposed by Coja-Oghlan & Efthymiou (2015) does not apply for real-life sizes (such as 10^6 nodes).
要約:
We propose Guided Speculative Inference (GSI), a novel algorithm for efficient reward-guided decoding in large language models. GSI combines soft best-of-$n$ test-time scaling with a reward model $r(x,y)$ and speculative samples from a small auxiliary model $\pi_S(y\mid x)$. We provably approximate both the optimal tilted policy $\pi_{\beta,B}(y\mid x) \propto \pi_B(y\mid x)\exp(\beta\,r(x,y))$ of soft best-of-$n$ under the base model $\pi_B$, as well as the expected reward under the optimal policy. In experiments on reasoning benchmarks (MATH500, OlympiadBench, Minerva Math, MMLU-STEM, GSM8K) and across different model families, our method achieves higher accuracy than standard soft best-of-$n$ with $\pi_S$ and reward-guided speculative decoding (Liao et al., 2025), and in certain settings even outperforms soft best-of-$n$ with $\pi_B$, while reducing end-to-end latency by up to $28\%$. The code is available at https://github.com/j-geuter/GSI .
要約:
Neural Processes (NPs) are a rapidly evolving class of models designed to directly model the posterior predictive distribution of stochastic processes. While early architectures were developed primarily as a scalable alternative to Gaussian Processes (GPs), modern NPs tackle far more complex and data-hungry applications spanning geology, epidemiology, climate, and robotics. These applications have placed increasing pressure on the scalability of these models, with many architectures compromising accuracy for scalability. In this paper, we demonstrate that this trade-off is often unnecessary, particularly when modeling fully or partially translation-invariant processes. We propose a versatile new architecture, the Biased Scan Attention Transformer Neural Process (BSA-TNP), which introduces Kernel Regression Blocks (KRBlocks), group-invariant attention biases, and memory-efficient Biased Scan Attention (BSA). BSA-TNP is able to: (1) match or exceed the accuracy of the best models while often training in a fraction of the time, (2) exhibit translation invariance, enabling learning at multiple resolutions simultaneously, (3) transparently model processes that evolve in both space and time, (4) support high-dimensional fixed effects, and (5) scale gracefully, running inference on over 1M test points and 100K context points in under a minute on a single 24GB GPU. Code is provided as part of the `dl4bi` package.
要約:
Decentralized learning provides a scalable alternative to parameter-server-based training, yet its performance is often hindered by limited peer-to-peer communication. In this paper, we study how communication should be scheduled over time, including determining when and how frequently devices synchronize. Counterintuitive empirical results show that concentrating communication budgets in the later stages of decentralized training remarkably improves global test performance. Surprisingly, we uncover that fully connected communication at the final step, implemented by a single global merging, can significantly improve the performance of decentralized learning under high data heterogeneity. Our theoretical contributions, which explain these phenomena, are the first to establish that the globally merged model of decentralized SGD can match the convergence rate of parallel SGD. Technically, we reinterpret part of the discrepancy among local models, which were previously considered as detrimental noise, as constructive components essential for matching this rate. This work provides evidence that decentralized learning is able to generalize under high data heterogeneity and limited communication, while offering broad new avenues for model merging research.
要約:
We study neural architectures in which each hidden layer is defined by the stationary state of a dissipative Schr\"odinger-type dynamics on a learned latent graph. On stable branches, the local stationary problem defines a differentiable implicit graph layer. To learn the graph itself, we optimize over the stratified moduli space of weighted graphs and equip each stratum with a non-degenerate K\"ahler-Hessian metric that keeps natural-gradient descent and face crossing well posed. We then show that a multilayer stationary network is equivalent to an exact global stationary problem on a supra-graph, and that it admits a penalized global relaxation whose stationary states converge to the exact one as the penalty parameter tends to infinity. Reverse-mode differentiation is recovered as the adjoint of the exact global system, and the penalized adjoint converges to it in the same limit. Finally, under finite-dimensional strong-monotonicity and admissible-lift assumptions, the corresponding represented hypothesis classes coincide among resolvent feed-forward networks, graph-stationary networks, supra-graph stationary systems, and sheaf-based architectures with unitary connection. The resulting structural identifications yield complexity bounds controlled by sparse graph or supra-graph geometry rather than dense ambient connectivity.
要約:
Let $S$ be a finite set, and $X_1,\ldots,X_n$ an i.i.d. uniform sample from $S$. To estimate the size $|S|$, without further structure, one can wait for repeats and use the birthday problem. This requires a sample size of the order $|S|^\frac{1}{2}$. On the other hand, if $S=\{1,2,\ldots,|S|\}$, the maximum of the sample blown up by $n/(n-1)$ gives an efficient estimator based on any growing sample size. This paper gives refinements that interpolate between these extremes. A general non-asymptotic theory is developed. This includes estimating the volume of a compact convex set, the unseen species problem, and a host of testing problems that follow from the question `Is this new observation a typical pick from a large prespecified population?' We also treat regression style predictors. A general theorem gives non-parametric finite $n$ error bounds in all cases.
要約:
Marketing Mix Modeling (MMM) estimates the impact of marketing activities on business outcomes such as sales or revenue. Traditional MMM approaches rely on linear regression or Bayesian hierarchical models that assume channel independence and struggle to capture temporal dynamics and non-linear saturation. DeepCausalMMM addresses these limitations by combining deep learning, causal inference, and marketing science. It uses Gated Recurrent Units (GRUs) to learn temporal patterns (adstock, lag) while learning statistical dependencies between channels through Directed Acyclic Graph (DAG) structure with upper triangular constraints. It implements Hill equation saturation curves for diminishing returns and budget optimization. Key features: (1) data-driven hyperparameters learned from data with defaults, (2) linear mean scaling of the dependent variable, (3) configurable attribution priors with dynamic loss scaling, (4) multi-region modeling with shared and region-specific parameters, (5) robust methods including Huber loss, (6) response curve analysis.
要約:
We study counterfactual prediction under assignment bias and propose a mathematically grounded, information-theoretic approach that removes treatment-covariate dependence without adversarial training. Starting from a bound that links the counterfactual-factual risk gap to mutual information, we learn a stochastic representation Z that is predictive of outcomes while minimizing I(Z; T). We derive a tractable variational objective that upper-bounds the information term and couples it with a supervised decoder, yielding a stable, provably motivated training criterion. The framework extends naturally to dynamic settings by applying the information penalty to sequential representations at each decision time. We evaluate the method on controlled numerical simulations and a real-world clinical dataset, comparing against recent state-of-the-art balancing, reweighting, and adversarial baselines. Across metrics of likelihood, counterfactual error, and policy evaluation, our approach performs favorably while avoiding the training instabilities and tuning burden of adversarial schemes.
要約:
Distributed Fiber Optic Sensing (DFOS) is promising for long-range perimeter security, yet practical deployment faces three key obstacles: severe cross-deployment domain shift, scarce or unavailable labels at new sites, and limited within-class coverage even in source deployments. We propose DUPLE, a prototype-based meta-learning framework tailored for cross-deployment DFOS recognition. The core idea is to jointly exploit complementary time- and frequency-domain cues and adapt class representations to sample-specific statistics: (i) a dual-domain learner constructs multi-prototype class representations to cover intra-class heterogeneity; (ii) a lightweight statistical guidance mechanism estimates the reliability of each domain from raw signal statistics; and (iii) a query-adaptive aggregation strategy selects and combines the most relevant prototypes for each query. Extensive experiments on two real-world cross-deployment benchmarks demonstrate consistent improvements over strong deep learning and meta-learning baselines, achieving more accurate and stable recognition under label-scarce target deployments.
要約:
Bayesian optimization is widely used for optimizing expensive black box functions, but most existing approaches focus on scalar responses. In many scientific and engineering settings the response is functional, varying smoothly over an index such as time or wavelength, which makes classical formulations inadequate. Existing methods often minimize integrated error, which captures average performance but neglects worst case deviations. To address this limitation we propose min-max Functional Bayesian Optimization (MM-FBO), a framework that directly minimizes the maximum error across the functional domain. Functional responses are represented using functional principal component analysis, and Gaussian process surrogates are constructed for the principal component scores. Building on this representation, MM-FBO introduces an integrated uncertainty acquisition function that balances exploitation of worst case expected error with exploration across the functional domain. We provide two theoretical guarantees: a discretization bound for the worst case objective, and a consistency result showing that as the surrogate becomes accurate and uncertainty vanishes, the acquisition converges to the true min-max objective. We validate the method through experiments on synthetic benchmarks and physics inspired case studies involving electromagnetic scattering by metaphotonic devices and vapor phase infiltration. Results show that MM-FBO consistently outperforms existing baselines and highlights the importance of explicitly modeling functional uncertainty in Bayesian optimization.
要約:
We present algorithms for diffusion model sampling which obtain $\delta$-error in $\mathrm{polylog}(1/\delta)$ steps, given access to $\widetilde O(\delta)$-accurate score estimates in $L^2$. This is an exponential improvement over all previous results. Specifically, under minimal data assumptions, the complexity is $\widetilde O(d_\star \mathrm{polylog}(1/\delta))$ where $d_\star$ is the intrinsic dimension of the data. Further, under a non-uniform $L$-Lipschitz condition, the complexity reduces to $\widetilde O(L \mathrm{polylog}(1/\delta))$. Our approach also yields the first $\mathrm{polylog}(1/\delta)$ complexity sampler for general log-concave distributions using only gradient evaluations.
要約:
Supervised detection of network attacks has always been a critical part of network intrusion detection systems (NIDS). Nowadays, in a pivotal time for artificial intelligence (AI), with even more sophisticated attacks that utilize advanced techniques, such as generative artificial intelligence (GenAI) and reinforcement learning, it has become a vital component if we wish to protect our personal data, which are scattered across the web. In this paper, we address two tasks, in the first unified multi-modal NIDS dataset, which incorporates flow-level data, packet payload information and temporal contextual features, from the reprocessed CIC-IDS-2017, CIC-IoT-2023, UNSW-NB15 and CIC-DDoS-2019, with the same feature space. In the first task we use machine learning (ML) algorithms, with stratified cross validation, in order to prevent network attacks, with stability and reliability. In the second task we use adversarial learning algorithms to generate synthetic data, compare them with the real ones and evaluate their fidelity, utility and privacy using the SDV framework, f-divergences, distinguishability and non-parametric statistical tests. The findings provide stable ML models for intrusion detection and generative models with high fidelity and utility, by combining the Synthetic Data Vault framework, the TRTS and TSTR tests, with non-parametric statistical tests and f-divergence measures.
要約:
We investigate random feature models in which neural networks sampled from a prescribed initialization ensemble are frozen and used as random features, with only the readout weights optimized. Adopting a statistical-physics viewpoint, we study the training error, test error, and generalization gap beyond the mean kernel approximation. Since the predictor is a nonlinear functional of the induced random kernel, the ensemble-averaged errors depend not only on the mean kernel but also on higher-order fluctuation statistics. Within an effective field-theoretic framework, these finite-width contributions naturally appear as loop corrections. We derive loop corrections to the training error, test error, and generalization gap, obtain their scaling laws, and support the theory with
要約:
Semi-supervised learning with manifold regularization is a classical framework for jointly learning from both labeled and unlabeled data, where the key requirement is that the support of the unknown marginal distribution has the geometric structure of a Riemannian manifold. Typically, the Laplace-Beltrami operator-based manifold regularization can be approximated empirically by the Laplacian regularization associated with the entire training data and its corresponding graph Laplacian matrix. However, the graph Laplacian matrix depends heavily on the prespecified similarity metric and may lead to inappropriate penalties when dealing with redundant or noisy input variables. To address the above issues, this paper proposes a new Semi-Supervised Meta Additive Model (S$^2$MAM) based on a bilevel optimization scheme that automatically identifies informative variables, updates the similarity matrix, and simultaneously achieves interpretable predictions. Theoretical guarantees are provided for S$^2$MAM, including the computing convergence and the statistical generalization bound. Experimental assessments across 4 synthetic and 12 real-world datasets, with varying levels and categories of corruption, validate the robustness and interpretability of the proposed approach.