要約:
Hamiltonian Monte Carlo (HMC) algorithms are among the most widely used sampling methods in high dimensional settings, yet their convergence properties are poorly understood in divergences that quantify relative density mismatch, such as Kullback-Leibler (KL) and R\'enyi divergences. These divergences naturally govern acceptance probabilities and warm-start requirements for Metropolis-adjusted Markov chains. In this work, we develop a framework for upgrading Wasserstein convergence guarantees for unadjusted Hamiltonian Monte Carlo (uHMC) to guarantees in tail-sensitive KL and R\'enyi divergences. Our approach is based on one-shot couplings, which we use to establish a regularization property of the uHMC transition kernel. This regularization allows Wasserstein-2 mixing-time and asymptotic bias bounds to be lifted to KL divergence, and analogous Orlicz-Wasserstein bounds to be lifted to R\'enyi divergence, paralleling earlier work of Bou-Rabee and Eberle (2023) that upgrade Wasserstein-1 bounds to total variation distance via kernel smoothing. As a consequence, our results provide quantitative control of relative density mismatch, clarify the role of discretization bias in strong divergences, and yield principled guarantees relevant both for unadjusted sampling and for generating warm starts for Metropolis-adjusted Markov chains.
要約:
Horseshoe mixtures-of-experts (HS-MoE) models provide a Bayesian framework for sparse expert selection in mixture-of-experts architectures. We combine the horseshoe prior's adaptive global-local shrinkage with input-dependent gating, yielding data-adaptive sparsity in expert usage. Our primary methodological contribution is a particle learning algorithm for sequential inference, in which the filter is propagated forward in time while tracking only sufficient statistics. We also discuss how HS-MoE relates to modern mixture-of-experts layers in large language models, which are deployed under extreme sparsity constraints (e.g., activating a small number of experts per token out of a large pool).
要約:
Multilabel Classification (MLC) deals with the simultaneous classification of multiple binary labels. The task is challenging because, not only may there be arbitrarily different and complex relationships between predictor variables and each label, but associations among labels may exist even after accounting for effects of predictor variables. In this paper, we present a Bayesian additive regression tree (BART) framework to model the problem. BART is a nonparametric and flexible model structure capable of uncovering complex relationships within the data. Our adaptation, MLCBART, assumes that labels arise from thresholding an underlying numeric scale, where a multivariate normal model allows explicit estimation of the correlation structure among labels. This enables the discovery of complicated relationships in various forms and improves MLC predictive performance. Our Bayesian framework not only enables uncertainty quantification for each predicted label, but our MCMC draws produce an estimated conditional probability distribution of label combinations for any predictor values. Simulation experiments demonstrate the effectiveness of the proposed model by comparing its performance with a set of models, including the oracle model with the correct functional form. Results show that our model predicts vectors of labels more accurately than other contenders and its performance is close to the oracle model. An example highlights how the method's ability to produce measures of uncertainty on predictions provides nuanced understanding of classification results.
要約:
This work addresses the fundamental problem of unbounded metric movement costs in bandit online convex optimization, by considering high-dimensional dynamic quadratic hitting costs and $\ell_2$-norm switching costs in a noisy bandit feedback model. For a general class of stochastic environments, we provide the first algorithm SCaLE that provably achieves a distribution-agnostic sub-linear dynamic regret, without the knowledge of hitting cost structure. En-route, we present a novel spectral regret analysis that separately quantifies eigenvalue-error driven regret and eigenbasis-perturbation driven regret. Extensive numerical experiments, against online-learning baselines, corroborate our claims, and highlight statistical consistency of our algorithm.
要約:
We study the Bandit Clustering (BC) problem under the fixed confidence setting, where the objective is to group a collection of data sequences (arms) into clusters through sequential sampling from adaptively selected arms at each time step while ensuring a fixed error probability at the stopping time. We consider a setting where arms in a cluster may have different distributions. Unlike existing results in this setting, which assume Gaussian-distributed arms, we study a broader class of vector-parametric distributions that satisfy mild regularity conditions. Existing asymptotically optimal BC algorithms require solving an optimization problem as part of their sampling rule at each step, which is computationally costly. We propose an Efficient Bandit Clustering algorithm (EBC), which, instead of solving the full optimization problem, takes a single step toward the optimal value at each time step, making it computationally efficient while remaining asymptotically optimal. We also propose a heuristic variant of EBC, called EBC-H, which further simplifies the sampling rule, with arm selection based on quantities computed as part of the stopping rule. We highlight the computational efficiency of EBC and EBC-H by comparing their per-sample run time with that of existing algorithms. The asymptotic optimality of EBC is supported through simulations on the synthetic datasets. Through simulations on both synthetic and real-world datasets, we show the performance gain of EBC and EBC-H over existing approaches.
要約:
Analysis of learned representations has a blind spot: it focuses on $similarity$, measuring how closely embeddings align with external references, but similarity reveals only what is represented, not whether that structure is robust. We introduce $geometric$ $stability$, a distinct dimension that quantifies how reliably representational geometry holds under perturbation, and present $Shesha$, a framework for measuring it. Across 2,463 configurations in seven domains, we show that stability and similarity are empirically uncorrelated ($\rho \approx 0.01$) and mechanistically distinct: similarity metrics collapse after removing the top principal components, while stability retains sensitivity to fine-grained manifold structure. This distinction yields actionable insights: for safety monitoring, stability acts as a functional geometric canary, detecting structural drift nearly 2$\times$ more sensitively than CKA while filtering out the non-functional noise that triggers false alarms in rigid distance metrics; for controllability, supervised stability predicts linear steerability ($\rho = 0.89$-$0.96$); for model selection, stability dissociates from transferability, revealing a geometric tax that transfer optimization incurs. Beyond machine learning, stability predicts CRISPR perturbation coherence and neural-behavioral coupling. By quantifying $how$ $reliably$ systems maintain structure, geometric stability provides a necessary complement to similarity for auditing representations across biological and computational systems.
要約:
Kernel-based methods are used in the context of Granger Causality to enable the identification of nonlinear causal relationships between time series variables. In this paper, we show that two state of the art kernel-based Granger Causality (GC) approaches can be theoretically unified under the framework of Kernel Principal Component Regression (KPCR), and introduce a method based on this unification, demonstrating that this approach can improve causal identification. Additionally, we introduce a Gaussian Process score-based model with Smooth Information Criterion penalisation on the marginal likelihood, and demonstrate improved performance over existing state of the art time-series nonlinear causal discovery methods. Furthermore, we propose a contemporaneous causal identification algorithm, fully based on GC, using the proposed score-based $GP_{SIC}$ method, and compare its performance to a state of the art contemporaneous time series causal discovery algorithm.
要約:
The graphical Lasso (GLASSO) is a widely used algorithm for learning high-dimensional undirected Gaussian graphical models (GGM). Given i.i.d. observations from a multivariate normal distribution, GLASSO estimates the precision matrix by maximizing the log-likelihood with an \ell_1-penalty on the off-diagonal entries. However, selecting an optimal regularization parameter \lambda in this unsupervised setting remains a significant challenge. A well-known issue is that existing methods, such as out-of-sample likelihood maximization, select a single global \lambda and do not account for heterogeneity in variable scaling or partial variances. Standardizing the data to unit variances, although a common workaround, has been shown to negatively affect graph recovery. Addressing the problem of nodewise adaptive tuning in graph estimation is crucial for applications like computational neuroscience, where brain networks are constructed from highly heterogeneous, region-specific fMRI data.
In this work, we develop Locally Adaptive Regularization for Graph Estimation (LARGE), an approach to adaptively learn nodewise tuning parameters to improve graph estimation and selection. In each block coordinate descent step of GLASSO, we augment the nodewise Lasso regression to jointly estimate the regression coefficients and error variance, which in turn guides the adaptive learning of nodewise penalties. In simulations, LARGE consistently outperforms benchmark methods in graph recovery, demonstrates greater stability across replications, and achieves the best estimation accuracy in the most difficult simulation settings. We demonstrate the practical utility of our method by estimating brain functional connectivity from a real fMRI data set.
要約:
Structure-based and ligand-based computational drug design have traditionally relied on disjoint data sources and modeling assumptions, limiting their joint use at scale. In this work, we introduce Contrastive Geometric Learning for Unified Computational Drug Design (ConGLUDe), a single contrastive geometric model that unifies structure- and ligand-based training. ConGLUDe couples a geometric protein encoder that produces whole-protein representations and implicit embeddings of predicted binding sites with a fast ligand encoder, removing the need for pre-defined pockets. By aligning ligands with both global protein representations and multiple candidate binding sites through contrastive learning, ConGLUDe supports ligand-conditioned pocket prediction in addition to virtual screening and target fishing, while being trained jointly on protein-ligand complexes and large-scale bioactivity data. Across diverse benchmarks, ConGLUDe achieves state-of-the-art zero-shot virtual screening performance in settings where no binding pocket information is provided as input, substantially outperforms existing methods on a challenging target fishing task, and demonstrates competitive ligand-conditioned pocket selection. These results highlight the advantages of unified structure-ligand training and position ConGLUDe as a step toward general-purpose foundation models for drug discovery.
要約:
This article presents an input convex neural network architecture using Kolmogorov-Arnold networks (ICKAN). Two specific networks are presented: the first is based on a low-order, linear-by-part, representation of functions, and a universal approximation theorem is provided. The second is based on cubic splines, for which only numerical results support convergence. We demonstrate on simple tests that these networks perform competitively with classical input convex neural networks (ICNNs). In a second part, we use the networks to solve some optimal transport problems needing a convex approximation of functions and demonstrate their effectiveness. Comparisons with ICNNs show that cubic ICKANs produce results similar to those of classical ICNNs.
要約:
Neural simulation-based inference (SBI) is a popular set of methods for Bayesian inference when models are only available in the form of a simulator. These methods are widely used in the sciences and engineering, where writing down a likelihood can be significantly more challenging than constructing a simulator. However, the performance of neural SBI can suffer when simulators are computationally expensive, thereby limiting the number of simulations that can be performed. In this paper, we propose a novel approach to neural SBI which leverages multilevel Monte Carlo techniques for settings where several simulators of varying cost and fidelity are available. We demonstrate through both theoretical analysis and extensive experiments that our method can significantly enhance the accuracy of SBI methods given a fixed computational budget.
要約:
We consider fully connected and feedforward deep neural networks with dependent and possibly heavy-tailed weights, as introduced in Lee et al. 2023, to address limitations of the standard Gaussian prior. It has been proved in Lee et al. 2023 that, as the number of nodes in the hidden layers grows large, according to a sequential and ordered limit, the law of the output converges weakly to a Gaussian mixture. Among our results, we present sufficient conditions on the model parameters (the activation function and the associated L\'evy measures) which ensure that the sequential limit is independent of the order. Next, we study the neural network through the lens of the posterior distribution with a Gaussian likelihood. If the random covariance matrix of the infinite-width limit is positive definite under the prior, we identify the posterior distribution of the output in the wide-width limit according to a sequential regime. Remarkably, we provide mild sufficient conditions to ensure the aforementioned invertibility of the random covariance matrix under the prior, thereby extending the results in L. Carvalho et al. 2025. We illustrate our findings using numerical simulations.
著者: James Carzon, Luca Masserano, Joshua D. Ingram, Alex Shen, Antonio Carlos Herling Ribeiro Junior, Tommaso Dorigo, Michele Doro, Joshua S. Speagle, Rafael Izbicki, Ann B. Lee
要約:
Generative artificial intelligence (AI) excels at producing complex data structures (text, images, videos) by learning patterns from training examples. Across scientific disciplines, researchers are now applying generative models to "inverse problems" to directly predict hidden parameters from observed data along with measures of uncertainty. While these predictive or posterior-based methods can handle intractable likelihoods and large-scale studies, they can also produce biased or overconfident conclusions even without model misspecifications. We present a solution with Frequentist-Bayes (FreB), a mathematically rigorous protocol that reshapes AI-generated posterior probability distributions into (locally valid) confidence regions that consistently include true parameters with the expected probability, while achieving minimum size when training and target data align. We demonstrate FreB's effectiveness by tackling diverse case studies in the physical sciences: identifying unknown sources under dataset shift, reconciling competing theoretical models, and mitigating selection bias and systematics in observational studies. By providing validity guarantees with interpretable diagnostics, FreB enables trustworthy scientific inference across fields where direct likelihood evaluation remains impossible or prohibitively expensive.
要約:
Multidimensional data is often associated with uncertainties that are not well-described by normal distributions. In this work, we describe how such distributions can be projected to a low-dimensional space using uncertainty-aware principal component analysis (UAPCA). We propose to model multidimensional distributions using Gaussian mixture models (GMMs) and derive the projection from a general formulation that allows projecting arbitrary probability density functions. The low-dimensional projections of the densities exhibit more details about the distributions and represent them more faithfully compared to UAPCA mappings. Further, we support including user-defined weights between the different distributions, which allows for varying the importance of the multidimensional distributions. We evaluate our approach by comparing the distributions in low-dimensional space obtained by our method and UAPCA to those obtained by sample-based projections.
要約:
We consider inferring the causal effect of a treatment (intervention) on an outcome of interest in situations where there is potentially an unobserved confounder influencing both the treatment and the outcome. This is achievable by assuming access to two separate sets of control (proxy) measurements associated with treatment and outcomes, which are used to estimate treatment effects through a function termed the em causal bridge (CB). We present a new theoretical perspective, associated assumptions for when estimating treatment effects with the CB is feasible, and a bound on the average error of the treatment effect when the CB assumptions are violated. From this new perspective, we then demonstrate how coupling the CB with an autoencoder architecture allows for the sharing of statistical strength between observed quantities (proxies, treatment, and outcomes), thus improving the quality of the CB estimates. Experiments on synthetic and real-world data demonstrate the effectiveness of the proposed approach in relation to the state-of-the-art methodology for proxy measurements.
要約:
Exogenous state variables and rewards can slow reinforcement learning by injecting uncontrolled variation into the reward signal. This paper formalizes exogenous state variables and rewards and shows that if the reward function decomposes additively into endogenous and exogenous components, the MDP can be decomposed into an exogenous Markov Reward Process (based on the exogenous reward) and an endogenous Markov Decision Process (optimizing the endogenous reward). Any optimal policy for the endogenous MDP is also an optimal policy for the original MDP, but because the endogenous reward typically has reduced variance, the endogenous MDP is easier to solve. We study settings where the decomposition of the state space into exogenous and endogenous state spaces is not given but must be discovered. The paper introduces and proves correctness of algorithms for discovering the exogenous and endogenous subspaces of the state space when they are mixed through linear combination. These algorithms can be applied during reinforcement learning to discover the exogenous subspace, remove the exogenous reward, and focus reinforcement learning on the endogenous MDP. Experiments on a variety of challenging synthetic MDPs show that these methods, applied online, discover large exogenous state spaces and produce substantial speedups in reinforcement learning.
要約:
Mathematical methods are developed to characterize the asymptotics of recurrent neural networks (RNN) as the number of hidden units, data samples in the sequence, hidden state updates, and training steps simultaneously grow to infinity. In the case of an RNN with a simplified weight matrix, we prove the convergence of the RNN to the solution of an infinite-dimensional ODE coupled with the fixed point of a random algebraic equation. The analysis requires addressing several challenges which are unique to RNNs. In typical mean-field applications (e.g., feedforward neural networks), discrete updates are of magnitude $\mathcal{O}(1/N)$ and the number of updates is $\mathcal{O}(N)$. Therefore, the system can be represented as an Euler approximation of an appropriate ODE/PDE, which it will converge to as $N \rightarrow \infty$. However, the RNN hidden layer updates are $\mathcal{O}(1)$. Therefore, RNNs cannot be represented as a discretization of an ODE/PDE and standard mean-field techniques cannot be applied. Instead, we develop a fixed point analysis for the evolution of the RNN memory states, with convergence estimates in terms of the number of update steps and the number of hidden units. The RNN hidden layer is studied as a function in a Sobolev space, whose evolution is governed by the data sequence (a Markov chain), the parameter updates, and its dependence on the RNN hidden layer at the previous time step. Due to the strong correlation between updates, a Poisson equation must be used to bound the fluctuations of the RNN around its limit equation. These mathematical methods give rise to the neural tangent kernel (NTK) limits for RNNs trained on data sequences as the number of data samples and size of the neural network grow to infinity.
要約:
Contrastive learning has shown to be effective to learn representations from time series in a self-supervised way. However, contrasting similar time series instances or values from adjacent timestamps within a time series leads to ignore their inherent correlations, which results in deteriorating the quality of learned representations. To address this issue, we propose SoftCLT, a simple yet effective soft contrastive learning strategy for time series. This is achieved by introducing instance-wise and temporal contrastive loss with soft assignments ranging from zero to one. Specifically, we define soft assignments for 1) instance-wise contrastive loss by the distance between time series on the data space, and 2) temporal contrastive loss by the difference of timestamps. SoftCLT is a plug-and-play method for time series contrastive learning that improves the quality of learned representations without bells and whistles. In experiments, we demonstrate that SoftCLT consistently improves the performance in various downstream tasks including classification, semi-supervised learning, transfer learning, and anomaly detection, showing state-of-the-art performance. Code is available at this repository: https://github.com/seunghan96/softclt.
要約:
Many real-world black-box optimization problems have multiple conflicting objectives. Rather than attempting to approximate the entire set of Pareto-optimal solutions, interactive preference learning allows to focus the search on the most relevant subset. However, few previous studies have exploited the fact that utility functions are usually monotonic. In this paper, we address the Bayesian Optimization with Preference Exploration (BOPE) problem and propose using a neural network ensemble as a utility surrogate model. This approach naturally integrates monotonicity and supports pairwise comparison data. Our experiments demonstrate that the proposed method outperforms state-of-the-art approaches and exhibits robustness to noise in utility evaluations. An ablation study highlights the critical role of monotonicity in enhancing performance.
要約:
In an era of increasing computational capabilities and growing environmental consciousness, organizations face a critical challenge in balancing the accuracy of forecasting models with computational efficiency and sustainability. Global forecasting models, lowering the computational time, have gained significant attention over the years. However, the common practice of retraining these models with new observations raises important questions about the costs of forecasting. Using ten different machine learning and deep learning models, we analyzed various retraining scenarios, ranging from continuous updates to no retraining at all, across two large retail demand datasets. We showed that less frequent retraining strategies maintain the forecast accuracy while reducing the computational costs, providing a more sustainable approach to large-scale forecasting. We also found that machine learning models are a marginally better choice to reduce the costs of forecasting when coupled with less frequent model retraining strategies as the frequency of the data increases. Our findings challenge the conventional belief that frequent retraining is essential for maintaining forecasting accuracy. Instead, periodic retraining offers a good balance between predictive performance and efficiency, both in the case of point and probabilistic forecasting. These insights provide actionable guidelines for organizations seeking to optimize forecasting pipelines while reducing costs and energy consumption.
要約:
Backward stochastic differential equation (BSDE)-based deep learning methods provide an alternative to Physics-Informed Neural Networks (PINNs) for solving high-dimensional partial differential equations (PDEs), offering potential algorithmic advantages in settings such as stochastic optimal control, where the PDEs of interest are tied to an underlying dynamical system. However, standard BSDE-based solvers have empirically been shown to underperform relative to PINNs in the literature. In this paper, we identify the root cause of this performance gap as a discretization bias introduced by the standard Euler-Maruyama (EM) integration scheme applied to one-step self-consistency BSDE losses, which shifts the optimization landscape off target. We find that this bias cannot be satisfactorily addressed through finer step-sizes or multi-step self-consistency losses. To properly handle this issue, we propose a Stratonovich-based BSDE formulation, which we implement with stochastic Heun integration. We show that our proposed approach completely eliminates the bias issues faced by EM integration. Furthermore, our empirical results show that our Heun-based BSDE method consistently outperforms EM-based variants and achieves competitive results with PINNs across multiple high-dimensional benchmarks. Our findings highlight the critical role of integration schemes in BSDE-based PDE solvers, an algorithmic detail that has received little attention thus far in the literature.
要約:
In recent years, diffusion models trained on equilibrium molecular distributions have proven effective for sampling biomolecules. Beyond direct sampling, the score of such a model can also be used to derive the forces that act on molecular systems. However, while classical diffusion sampling usually recovers the training distribution, the corresponding energy-based interpretation of the learned score is often inconsistent with this distribution, even for low-dimensional toy systems. We trace this inconsistency to inaccuracies of the learned score at very small diffusion timesteps, where the model must capture the correct evolution of the data distribution. In this regime, diffusion models fail to satisfy the Fokker-Planck equation, which governs the evolution of the score. We interpret this deviation as one source of the observed inconsistencies and propose an energy-based diffusion model with a Fokker-Planck-derived regularization term to enforce consistency. We demonstrate our approach by sampling and simulating multiple biomolecular systems, including fast-folding proteins, and by introducing a state-of-the-art transferable Boltzmann emulator for dipeptides that supports simulation and achieves improved consistency and efficient sampling. Our code, model weights, and self-contained JAX and PyTorch notebooks are available at https://github.com/noegroup/ScoreMD.
要約:
Machine learning systems increasingly face requirements to remove entire domains of information--such as toxic language or biases--rather than individual user data. This task presents a dilemma: full removal of the unwanted domain data is computationally expensive, while random partial removal is statistically inefficient. We find that a domain's statistical influence is often concentrated in a small subset of its data samples, suggesting a path between ineffective partial removal and unnecessary complete removal. We formalize this as distributional unlearning: a framework to select a small subset that balances forgetting an unwanted distribution while preserving a desired one. Using Kullback-Leibler divergence constraints, we derive the exact removal-preservation Pareto frontier for Gaussian distributions and prove that models trained on the edited data achieve corresponding log-loss bounds. We propose a distance-based selection algorithm and show it is quadratically more sample-efficient than random removal in the challenging low-divergence regime. Experiments across synthetic, text, and image datasets (Jigsaw, CIFAR-10, SMS spam) show our method requires 15-82% less deletion than full removal for strong unlearning effects, e.g., halving initial forget set accuracy. Ultimately, by showing a small forget set often suffices, our framework lays the foundations for more scalable and rigorous subpopulation unlearning.
要約:
Classical joint modeling approaches often rely on competing risks or recurrent event formulations to describe complex processes involving evolving longitudinal biomarkers and discrete event occurrences, but these frameworks typically capture only limited aspects of the underlying event dynamics. We propose a general multi-state joint modeling framework that unifies longitudinal biomarker dynamics with multi-state time-to-event processes defined on arbitrary directed graphs. The proposed framework accommodates arbitrary directed transition graphs, nonlinear longitudinal submodels, and scalable inference via stochastic gradient descent. This formulation encompasses both Markovian and semi-Markovian transition structures, allowing recurrent cycles and terminal absorptions to be naturally represented. The longitudinal and event processes are linked through shared latent structures within nonlinear mixed-effects models, extending classical joint modeling formulations. We derive the complete likelihood, establish conditions for identifiability, and develop scalable inference procedures based on stochastic gradient descent to enable high-dimensional and large-scale applications. In addition, we formulate a dynamic prediction framework that provides individualized state-transition probabilities and personalized risk assessments along complex event trajectories. Through simulation and application to the PAQUID cohort, we demonstrate accurate parameter recovery and individualized prediction.
要約:
Inverse problems, where the goal is to recover an unknown signal from noisy or incomplete measurements, are central to applications in medical imaging, remote sensing, and computational biology. Diffusion models have recently emerged as powerful priors for solving such problems. However, existing methods either rely on projection-based techniques that enforce measurement consistency through heuristic updates, or they approximate the likelihood $p(\boldsymbol{y} \mid \boldsymbol{x})$, often resulting in artifacts and instability under complex or high-noise conditions. To address these limitations, we propose a novel framework called \emph{coupled data and measurement space diffusion posterior sampling} (C-DPS), which eliminates the need for constraint tuning or likelihood approximation. C-DPS introduces a forward stochastic process in the measurement space $\{\boldsymbol{y}_t\}$, evolving in parallel with the data-space diffusion $\{\boldsymbol{x}_t\}$, which enables the derivation of a closed-form posterior $p(\boldsymbol{x}_{t-1} \mid \boldsymbol{x}_t, \boldsymbol{y}_{t-1})$. This coupling allows for accurate and recursive sampling based on a well-defined posterior distribution. Empirical results demonstrate that C-DPS consistently outperforms existing baselines, both qualitatively and quantitatively, across multiple inverse problem benchmarks.
要約:
Spectral gradient methods, such as the recently popularized Muon optimizer, are a promising alternative to standard Euclidean gradient descent for training deep neural networks and transformers, but it is still unclear in which regimes they are expected to perform better. We propose a simple layerwise condition that predicts when a spectral update yields a larger decrease in the loss than a Euclidean gradient step. This condition compares, for each parameter block, the squared nuclear-to-Frobenius ratio of the gradient to the stable rank of the incoming activations. To understand when this condition may be satisfied, we first prove that post-activation matrices have low stable rank at Gaussian initialization in random feature regression, feedforward networks, and transformer blocks. In spiked random feature models we then show that, after a short burn-in, the Euclidean gradient's nuclear-to-Frobenius ratio grows with the data dimension while the stable rank of the activations remains bounded, so the predicted advantage of spectral updates scales with dimension. We validate these predictions in synthetic regression experiments and in NanoGPT-scale language model training, where we find that intermediate activations have low-stable-rank throughout training and the corresponding gradients maintain large nuclear-to-Frobenius ratios. Together, these results identify conditions for spectral gradient methods, such as Muon, to be effective in training deep networks and transformers.
要約:
Fair regression methods have the potential to mitigate societal bias concerns in health care, but there has been little work on penalized fair regression when multiple groups experience such bias. We propose a general regression framework that addresses this gap with unfairness penalties for multiple groups. Our approach is demonstrated for binary outcomes with true positive rate disparity penalties. It can be efficiently implemented through reduction to a cost-sensitive classification problem. We additionally introduce novel score functions for automatically selecting penalty weights. Our penalized fair regression methods are empirically studied in simulations, where they achieve a fairness-accuracy frontier beyond that of existing comparison methods. Finally, we apply these methods to a national multi-site primary care study of chronic kidney disease to develop a fair classifier for end-stage renal disease. There we find substantial improvements in fairness for multiple race and ethnicity groups who experience societal bias in the health care system without any appreciable loss in overall fit.
要約:
The Rashomon effect -- the existence of multiple, distinct models that achieve nearly equivalent predictive performance -- has emerged as a fundamental phenomenon in modern machine learning and statistics. In this paper, we explore the causes underlying the Rashomon effect, organizing them into three categories: statistical sources arising from finite samples and noise in the data-generating process; structural sources arising from non-convexity of optimization objectives and unobserved variables that create fundamental non-identifiability; and procedural sources arising from limitations of optimization algorithms and deliberate restrictions to suboptimal model classes. We synthesize insights from machine learning, statistics, and optimization literature to provide a unified framework for understanding why the multiplicity of good models arises. A key distinction emerges: statistical multiplicity diminishes with more data, structural multiplicity persists asymptotically and cannot be resolved without different data or additional assumptions, and procedural multiplicity reflects choices made by practitioners. Beyond characterizing causes, we discuss both the challenges and opportunities presented by the Rashomon effect, including implications for inference, interpretability, fairness, and decision-making under uncertainty.