Training Mamba Architectures With Approximate Second-Order Methods

Keywords

Loading...
Thumbnail Image

Issue Date

2025-10

Language

en

Document type

Journal Title

Journal ISSN

Volume Title

Publisher

Title

ISSN

Volume

Issue

Startpage

Endpage

DOI

Abstract

Abstract Transformer architectures dominate sequence modeling and underpin foundation models, but their memory and compute scale quadratically with sequence length during training, and inference costs grow linearly per token, making them impractical for long-context modalities such as audio and genomics. The Mamba architecture (Gu and Dao, 2024), based on state space models, scales linearly with sequence length during training and offers constant per-token inference cost, enabling modeling of extremely long contexts. While Mamba is already more inference-efficient than Transformers, efficiency can also be targeted via improved convergence properties. Adaptive first-order optimizers like Adam (Kingma and Ba, 2017) are widely used but approximate the loss landscape coarsely. Second-order methods such as Newton’s method or natural gradient descent (Amari, 1998) converge faster but are intractable. Two recent approaches approximate natural gradient descent: decorrelated backpropagation (DBP), which adds decorrelation layers optimized alongside the main objective (Ahmad, 2024; Dalm et al., 2024a), and SOAP (Vyas et al., 2025), which applies updates in a curvature-informed eigenbasis. In our experiments with Mamba on autoregressive DNA modeling, DBP and SOAP matched Adam’s performance while reducing training time by up to ∼ 30%, depending on the level of performance being compared. In our experiments with autoregressive audio modeling, SOAP produced better models than Adam and reduced training time by up to ∼ 40% depending on the level of performance being compared, while DBP converged slower than Adam and produced slightly inferior models. Finally, we trained Mamba with both SOAP and DBP on Induction Heads and Selective Copying, as state space models prior to Mamba were unable to solve these tasks well. We found impaired zero-shot generalization on Induction Heads when using DBP, suggesting that feature decorrelation might be detrimental in a recurrent context. We analyzed the computational costs of optimizer update steps and found that although DBP and SOAP required orders of magnitude more operations than Adam, this overhead was negligible compared to the operations required by Mamba’s forward and backward passes. When profiling memory, DBP was found to increase the peak reserved and allocated memory by ∼30-35% over Adam during training, while SOAP’s peak memory overhead was negligible.

Description

Citation

Faculty

Faculteit der Sociale Wetenschappen