Skip to content

idiap/LARM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LARM: Loop Audio Recurrent Model

Test-Time Compute Scaling for ASR with Depth-Conditioned Looped Transformers

Paper on arXiv


Abstract

End-to-end ASR systems typically use fixed-depth acoustic encoders at inference, making it difficult to trade additional test-time computation for improved recognition without training a larger model. We introduce LARM, a depth-conditioned looped Transformer that turns recurrent encoder depth into a controllable test-time compute axis. LARM combines sparse CTC checkpoints, supervision-clock embeddings, FiLM depth conditioning, and delayed soft-posterior feedback to structure the loop into recognition checkpoints separated by latent refinement phases. On LibriSpeech, LARM improves WER as the number of inference loops increases and achieves performance competitive with deeper unshared-parameter baselines, while using a fraction of the parameter count.


Architecture

LARM architecture overview

LARM applies a shared Transformer encoder recurrently to a latent acoustic sequence. Each loop reuses the same parameters and shared CTC head, modulated by FiLM depth conditioning and supervision-clock embeddings, with delayed soft-posterior feedback reinjected into the recurrent state. CTC loss is applied only at sparse recognition checkpoints.

Recurrence. The acoustic frontend $\phi$ produces the initial state $\mathbf{h}^{(0)} = \phi(\mathbf{x})$. For each loop $k = 1, \dots, K$, the shared encoder $F_\theta$ and CTC head $\psi$ give an encoder output and token posteriors:

$$\mathbf{z}^{(k)} = F_\theta\big(\mathbf{h}^{(k-1)}\big), \qquad \mathbf{p}^{(k)} = \mathrm{softmax}\big(\psi(\mathbf{z}^{(k)})\big).$$

Delayed prediction feedback. The posteriors are projected back to the hidden space and shifted by one frame (with zero padding at $t=0$):

$$\mathbf{r}^{(k)}_t = \mathbf{p}^{(k)}_t \mathbf{W}_\rho, \qquad \bar{\mathbf{r}}^{(k)}_t = \mathbf{r}^{(k)}_{t-1}.$$

State aggregation. The encoder output, the frontend skip connection, and the delayed feedback are combined with learnable scalars $\alpha, \beta$:

$$\mathbf{a}^{(k)} = \mathbf{z}^{(k)} + \beta, \mathbf{h}^{(0)} + \alpha, \bar{\mathbf{r}}^{(k)}.$$

Clock and FiLM depth conditioning. A supervision-clock embedding (period $c$) is added, then FiLM modulation on the normalized depth $\bar{d}(k) = \tfrac{k-1}{K-1}$ produces the next state:

$$\hat{\mathbf{a}}^{(k)} = \mathbf{a}^{(k)} + \mathbf{W}_c\big[(k-1) \bmod c\big], \qquad \mathbf{h}^{(k)} = \boldsymbol{\gamma}_{\mathrm{film}}\big(\bar{d}(k)\big) \odot \hat{\mathbf{a}}^{(k)} + \boldsymbol{\beta}_{\mathrm{film}}\big(\bar{d}(k)\big).$$

Sparse supervision. CTC loss is applied only at the checkpoint loops $\mathcal{S} = {c, 2c, \dots, K}$:

$$\mathcal{L} = \frac{1}{|\mathcal{S}|} \sum_{k \in \mathcal{S}} \mathcal{L}_{\mathrm{CTC}}\big(\psi(\mathbf{z}^{(k)}), \mathbf{y}\big).$$


Main results

LibriSpeech WER (%) on test-clean / test-other, with greedy CTC decoding and 4-gram LM beam search. The reference LARM ($d=384$, $K=12$; 7.7M) matches or beats a 16-block unshared encoder at ~4× fewer parameters. Scaling LARM to $d=768$ (28.9M) surpasses even the 48-block encoder (85.7M, the largest standard baseline) on both 100h and 960h, and larger budgets (more blocks, wider $d$, more loops) push WER lower still.

Train Model #Params Greedy clean Greedy other +LM clean +LM other
100h Standard encoder, 16 blocks 28.9M 14.43 37.23 9.97 28.68
Standard encoder, 48 blocks 85.7M 12.24 34.06 9.03 27.07
LARM ($d=384$, $K=12$) 7.7M 11.34 31.84 8.66 26.28
LARM ($d=768$, $K=12$) 28.9M 9.58 28.25 7.57 23.89
960h Standard encoder, 16 blocks 28.9M 4.79 13.26 3.51 9.87
Standard encoder, 48 blocks 85.7M 3.87 10.58 3.20 8.56
LARM ($d=384$, $K=12$) 7.7M 4.59 11.75 3.51 9.38
LARM ($d=768$, $K=12$) 28.9M 3.45 9.44 2.93 7.93

Pretrained checkpoints

To reproduce the table above without retraining, download the released checkpoints from HuggingFace:

  • [TBD]: LARM 100h ($d=384$, $K=12$, the reference model)
  • [TBD]: LARM 960h ($d=384$, $K=12$)
  • [TBD]: LARM 960h ($d=768$, $K=12$)

Place each downloaded checkpoint-<step>/ directory under <output_root>/larm/<suffix>/ (where output_root is your LARGE_MODELS_PATH or --output_dir), set SUFFIX accordingly in scripts/eval/eval_template.sh, and run it. Using the released checkpoints sidesteps both the environment and the GPU-numerics variables.


Loop-structured test-time computation

WER trajectories across recurrent loops for two trained LARM models. Red stars mark the supervised recognition checkpoints (every c loops); intermediate loops refine the latent acoustic representation. Supervised checkpoints improve from one to the next, while intermediate loops can be non-monotonic (100h) or follow a smoother trajectory (960h).

WER vs. loop, 960h, K=12, c=4 WER vs. loop, 100h, K=16, c=4

Left: LibriSpeech 960h (K=12, c=4), smooth refinement. Right: LibriSpeech 100h (K=16, c=4), non-monotonic intermediate loops.


Installation

Python 3.12, PyTorch 2.3. env.yml is the environment used for the paper; env_quick.yml is provided as a quicker-to-build alternative.

conda env create -f env.yml          # or env_quick.yml for a faster install
conda activate larmenv

Training requires a CUDA GPU.


Configuration (path_config.py)

All machine-specific paths live in a single file that is not committed. Copy the template and fill in your own locations:

cp larm/config/path_config_example.py larm/config/path_config.py

Then edit larm/config/path_config.py:

Variable Meaning
DATASETS_ROOT_PATH Folder for a local LibriSpeech copy (librispeech_asr/ inside). If it does not exist, LibriSpeech is downloaded automatically into HUGGINGFACE_CACHE.
HUGGINGFACE_CACHE HuggingFace datasets cache directory.
LARGE_MODELS_PATH Where experiment outputs/checkpoints are written by default.

These can be overridden per run with --output_dir and --hf_cache_dir.


Running an experiment

Experiments are driven by run.py and organized as shell scripts under scripts/, grouped by category:

scripts/
├── main_exp/            # reference LARM (100h, 960h)
├── baseline/            # standard non-looped encoders
├── scaling/             # width / data / epoch scaling
├── ablative_depth/      # depth-conditioning ablations
├── ablative_feedback/   # feedback & aggregation ablations
├── ablative_supervision/# checkpoint-interval ablations
├── ablative_loop_budget/# loop-budget (K) ablations
├── ablative_nblocks/    # encoder-depth ablations
└── eval/                # eval_template.sh (ready to fill)

Each script begins with two placeholders you must set for your machine:

cd path/to/larm           # repo root
path/to/python run.py ...  # python from the `larmenv` env

Then launch a run, e.g. the reference 100h model:

bash scripts/main_exp/run_libri_100h_d384_full.sh

Key flags (see python run.py --help for the full list):

Flag Role
--K number of recurrent loops
--clock_period checkpoint interval c (sparse-supervision period)
--depth_mode film FiLM depth conditioning
--rep_feedback_mode prev_frame delayed (one-frame) prediction feedback
--learn_alpha_beta learn the feedback / skip mixing scalars
--d_model, --n_heads, --n_blocks encoder width / heads / shared blocks
--num_train_epochs, --batch_size, --gradient_accumulation_steps optimization budget

How data is stored

Input data. LibriSpeech is loaded with the HuggingFace datasets loader. If a local copy exists at DATASETS_ROOT_PATH/librispeech_asr it is used; otherwise the dataset is downloaded from the Hub into HUGGINGFACE_CACHE (or --hf_cache_dir) on first run, so a from-scratch run needs internet access and ~60 GB of disk for the full 960h set.

Acoustic frontend features. LARM uses Whisper's log-Mel feature extractor (80 mels), only the extractor, never the Whisper model weights. Its config is downloaded once from openai/whisper-small (a few KB); no model checkpoint is required.

Preprocessing & caching. run.py maps each split to 80-dim log-Mel features and then filters samples. Both the map and filter steps are content-addressed: results are cached by a fingerprint, so re-running an experiment reuses the prepared data instead of recomputing it. Filtering is controlled by:

Flag Default Meaning
--min_input_length 400 drop utterances with fewer input samples
--max_input_length 480000 drop utterances longer than ~30 s @ 16 kHz
--min_label_length 1 drop utterances with too few label tokens

The character-level CTC vocabulary is written to vocab/vocab_libri.json (or vocab/vocab_libri_lowercase.json with --force_lowercase).

Outputs. Checkpoints are written to:

<output_root>/<encoder_name>/<suffix>/checkpoint-<step>/

where output_root = --output_dir (or LARGE_MODELS_PATH/large_models_results by default), encoder_name defaults to larm, and suffix is the --suffix of the run. Saving/eval cadence is set by --save_strategy {steps,epoch}, --save_steps, and --eval_steps.


What's in the checkpoint JSON files

Each checkpoint-<step>/ directory contains the weights (model.pt, optim.pt, sched.pt, scaler.pt) plus three JSON files:

  • config.json: the full model + training configuration used to rebuild the model: architecture (d_model, n_heads, n_blocks, K, clock_period, depth_period, depth_mode, rep_feedback_mode, alpha/beta, conditioning flags, vocab_size, pad_id/blank_id) and training settings (optimizer, schedule, SpecAugment, dataset splits, batch size, epochs).
  • trainer_state.json: HF-style training state:
    • global_step, epoch, learning_rate
    • best_metric and best_model_checkpoint (lowest eval_wer so far)
    • log_history: a list of training-step entries {step, epoch, loss, learning_rate} and evaluation events {step, epoch, eval_loss, eval_wer}.
  • meta.json: minimal {step, epoch} marker used for resuming.

For evaluation, run.py resolves the checkpoint from trainer_state.json's best_model_checkpoint, falling back to the latest checkpoint-* directory.


Evaluating with a language model

Evaluation reuses the same architecture flags as training (so the weights load) plus --only_evaluate. A ready-to-fill template is provided at scripts/eval/eval_template.sh: set SUFFIX to the trained model's id, match the architecture flags, then choose the decoding options.

Greedy CTC (WER at every loop exit):

path/to/python run.py ... --only_evaluate --eval_all_steps

4-gram KenLM beam search. The LibriSpeech 4-gram LM (4-gram.arpa) can be downloaded from OpenSLR:

path/to/python run.py ... \
    --only_evaluate \
    --eval_all_steps \
    --lm_path path/to/4-gram.arpa \
    --eval_all_steps_lm \
    --lm_alpha 0.5 \
    --lm_beta 1.0 \
    --beam_size 100
Flag Meaning
--lm_path KenLM .arpa / .binary file (enables LM decoding)
--eval_all_steps_lm run LM beam search at every loop exit
--lm_alpha LM (shallow-fusion) weight (paper: 0.5)
--lm_beta word-insertion bonus (paper: 1.0)
--beam_size beam width (paper: 100)

Omit the --lm_* block to report greedy WER only.


Authors and Affiliations

Name Affiliation Contact
Yacouba Kaloga Idiap Research Institute, Martigny, Switzerland yacouba.kaloga@idiap.ch
Shashi Kumar Idiap Research Institute, Martigny, Switzerland shashi.kumar@idiap.ch
Shakeel A. Sheikh TOCOME Shakeelzmail608@gmail.com
Driss Khalil Idiap Research Institute, Martigny, Switzerland driss.khalil@idiap.ch
Petr Motlicek Idiap Research Institute, Martigny, Switzerland petr.motlicek@idiap.ch
Ina Kodrasi Idiap Research Institute, Martigny, Switzerland ina.kodrasi@idiap.ch

Citation

If you use this code, please cite (placeholder, paper currently under review):

@misc{kaloga2026larm,
  title         = {Test-Time Compute Scaling for {ASR} with Depth-Conditioned Looped Transformers},
  author        = {Kaloga, Yacouba and Kumar, Shashi and Sheikh, Shakeel and Khalil, Driss and Motlicek, Petr and Kodrasi, Ina},
  year          = {2026},
  eprint        = {XXXX.XXXXX},
  archivePrefix = {arXiv},
  note          = {Preprint, under review},
}

License

The code in this project is released under the MIT License. See LICENSES/MIT.txt for details. The LibriSpeech dataset is distributed under the CC BY 4.0 license. Third-party dependencies and datasets are listed in THIRDPARTY.md.

About

LARM (Loop Audio Recurrent Model): a depth-conditioned looped Transformer for automatic speech recognition that scales test-time compute by running a shared encoder recurrently, improving accuracy as the number of inference loops increases

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors