Training Mamba Architectures With Approximate Second-Order Methods
Keywords
Loading...
Authors
Issue Date
2025-10
Language
en
Document type
Journal Title
Journal ISSN
Volume Title
Publisher
Title
ISSN
Volume
Issue
Startpage
Endpage
DOI
Abstract
Abstract
Transformer architectures dominate sequence modeling and underpin foundation models, but their memory
and compute scale quadratically with sequence length during training, and inference costs grow linearly
per token, making them impractical for long-context modalities such as audio and genomics. The
Mamba architecture (Gu and Dao, 2024), based on state space models, scales linearly with sequence
length during training and offers constant per-token inference cost, enabling modeling of extremely long
contexts. While Mamba is already more inference-efficient than Transformers, efficiency can also be
targeted via improved convergence properties. Adaptive first-order optimizers like Adam (Kingma and
Ba, 2017) are widely used but approximate the loss landscape coarsely. Second-order methods such as
Newton’s method or natural gradient descent (Amari, 1998) converge faster but are intractable. Two
recent approaches approximate natural gradient descent: decorrelated backpropagation (DBP), which
adds decorrelation layers optimized alongside the main objective (Ahmad, 2024; Dalm et al., 2024a), and
SOAP (Vyas et al., 2025), which applies updates in a curvature-informed eigenbasis. In our experiments
with Mamba on autoregressive DNA modeling, DBP and SOAP matched Adam’s performance while
reducing training time by up to ∼ 30%, depending on the level of performance being compared. In our
experiments with autoregressive audio modeling, SOAP produced better models than Adam and reduced
training time by up to ∼ 40% depending on the level of performance being compared, while DBP converged
slower than Adam and produced slightly inferior models. Finally, we trained Mamba with both
SOAP and DBP on Induction Heads and Selective Copying, as state space models prior to Mamba were
unable to solve these tasks well. We found impaired zero-shot generalization on Induction Heads when
using DBP, suggesting that feature decorrelation might be detrimental in a recurrent context.
We analyzed the computational costs of optimizer update steps and found that although DBP and
SOAP required orders of magnitude more operations than Adam, this overhead was negligible compared
to the operations required by Mamba’s forward and backward passes. When profiling memory, DBP
was found to increase the peak reserved and allocated memory by ∼30-35% over Adam during training,
while SOAP’s peak memory overhead was negligible.
Description
Citation
Faculty
Faculteit der Sociale Wetenschappen
