LARM: Loop Audio Recurrent Model

Test-Time Compute Scaling for ASR with Depth-Conditioned Looped Transformers

Paper on arXiv

Abstract

End-to-end ASR systems typically use fixed-depth acoustic encoders at inference, making it difficult to trade additional test-time computation for improved recognition without training a larger model. We introduce LARM, a depth-conditioned looped Transformer that turns recurrent encoder depth into a controllable test-time compute axis. LARM combines sparse CTC checkpoints, supervision-clock embeddings, FiLM depth conditioning, and delayed soft-posterior feedback to structure the loop into recognition checkpoints separated by latent refinement phases. On LibriSpeech, LARM improves WER as the number of inference loops increases and achieves performance competitive with deeper unshared-parameter baselines, while using a fraction of the parameter count.

Architecture

LARM applies a shared Transformer encoder recurrently to a latent acoustic sequence. Each loop reuses the same parameters and shared CTC head, modulated by FiLM depth conditioning and supervision-clock embeddings, with delayed soft-posterior feedback reinjected into the recurrent state. CTC loss is applied only at sparse recognition checkpoints.

Recurrence. The acoustic frontend $\phi$ produces the initial state $\mathbf{h}^{(0)} = \phi(\mathbf{x})$. For each loop $k = 1, \dots, K$, the shared encoder $F_\theta$ and CTC head $\psi$ give an encoder output and token posteriors:

$$\mathbf{z}^{(k)} = F_\theta\big(\mathbf{h}^{(k-1)}\big), \qquad \mathbf{p}^{(k)} = \mathrm{softmax}\big(\psi(\mathbf{z}^{(k)})\big).$$

Delayed prediction feedback. The posteriors are projected back to the hidden space and shifted by one frame (with zero padding at $t=0$):

$$\mathbf{r}^{(k)}_t = \mathbf{p}^{(k)}_t \mathbf{W}_\rho, \qquad \bar{\mathbf{r}}^{(k)}_t = \mathbf{r}^{(k)}_{t-1}.$$

State aggregation. The encoder output, the frontend skip connection, and the delayed feedback are combined with learnable scalars $\alpha, \beta$:

$$\mathbf{a}^{(k)} = \mathbf{z}^{(k)} + \beta, \mathbf{h}^{(0)} + \alpha, \bar{\mathbf{r}}^{(k)}.$$

Clock and FiLM depth conditioning. A supervision-clock embedding (period $c$) is added, then FiLM modulation on the normalized depth $\bar{d}(k) = \tfrac{k-1}{K-1}$ produces the next state:

$$\hat{\mathbf{a}}^{(k)} = \mathbf{a}^{(k)} + \mathbf{W}_c\big[(k-1) \bmod c\big], \qquad \mathbf{h}^{(k)} = \boldsymbol{\gamma}_{\mathrm{film}}\big(\bar{d}(k)\big) \odot \hat{\mathbf{a}}^{(k)} + \boldsymbol{\beta}_{\mathrm{film}}\big(\bar{d}(k)\big).$$

Sparse supervision. CTC loss is applied only at the checkpoint loops $\mathcal{S} = {c, 2c, \dots, K}$:

$$\mathcal{L} = \frac{1}{|\mathcal{S}|} \sum_{k \in \mathcal{S}} \mathcal{L}_{\mathrm{CTC}}\big(\psi(\mathbf{z}^{(k)}), \mathbf{y}\big).$$

Main results

LibriSpeech WER (%) on test-clean / test-other, with greedy CTC decoding and 4-gram LM beam search. The reference LARM ($d=384$, $K=12$; 7.7M) matches or beats a 16-block unshared encoder at ~4× fewer parameters. Scaling LARM to $d=768$ (28.9M) surpasses even the 48-block encoder (85.7M, the largest standard baseline) on both 100h and 960h, and larger budgets (more blocks, wider $d$, more loops) push WER lower still.

Train	Model	#Params	Greedy clean	Greedy other	+LM clean	+LM other
100h	Standard encoder, 16 blocks	28.9M	14.43	37.23	9.97	28.68
	Standard encoder, 48 blocks	85.7M	12.24	34.06	9.03	27.07
	LARM ($d=384$, $K=12$)	7.7M	11.34	31.84	8.66	26.28
	LARM ($d=768$, $K=12$)	28.9M	9.58	28.25	7.57	23.89
960h	Standard encoder, 16 blocks	28.9M	4.79	13.26	3.51	9.87
	Standard encoder, 48 blocks	85.7M	3.87	10.58	3.20	8.56
	LARM ($d=384$, $K=12$)	7.7M	4.59	11.75	3.51	9.38
	LARM ($d=768$, $K=12$)	28.9M	3.45	9.44	2.93	7.93

Pretrained checkpoints

To reproduce the table above without retraining, download the released checkpoints from HuggingFace:

[TBD]: LARM 100h ($d=384$, $K=12$, the reference model)
[TBD]: LARM 960h ($d=384$, $K=12$)
[TBD]: LARM 960h ($d=768$, $K=12$)

Place each downloaded checkpoint-<step>/ directory under <output_root>/larm/<suffix>/ (where output_root is your LARGE_MODELS_PATH or --output_dir), set SUFFIX accordingly in scripts/eval/eval_template.sh, and run it. Using the released checkpoints sidesteps both the environment and the GPU-numerics variables.

Loop-structured test-time computation

WER trajectories across recurrent loops for two trained LARM models. Red stars mark the supervised recognition checkpoints (every c loops); intermediate loops refine the latent acoustic representation. Supervised checkpoints improve from one to the next, while intermediate loops can be non-monotonic (100h) or follow a smoother trajectory (960h).

Left: LibriSpeech 960h (K=12, c=4), smooth refinement. Right: LibriSpeech 100h (K=16, c=4), non-monotonic intermediate loops.

Installation

Python 3.12, PyTorch 2.3. env.yml is the environment used for the paper; env_quick.yml is provided as a quicker-to-build alternative.

conda env create -f env.yml          # or env_quick.yml for a faster install
conda activate larmenv

Training requires a CUDA GPU.

Configuration (`path_config.py`)

All machine-specific paths live in a single file that is not committed. Copy the template and fill in your own locations:

cp larm/config/path_config_example.py larm/config/path_config.py

Then edit larm/config/path_config.py:

Variable	Meaning
`DATASETS_ROOT_PATH`	Folder for a local LibriSpeech copy (`librispeech_asr/` inside). If it does not exist, LibriSpeech is downloaded automatically into `HUGGINGFACE_CACHE`.
`HUGGINGFACE_CACHE`	HuggingFace `datasets` cache directory.
`LARGE_MODELS_PATH`	Where experiment outputs/checkpoints are written by default.

These can be overridden per run with --output_dir and --hf_cache_dir.

Running an experiment

Experiments are driven by run.py and organized as shell scripts under scripts/, grouped by category:

scripts/
├── main_exp/            # reference LARM (100h, 960h)
├── baseline/            # standard non-looped encoders
├── scaling/             # width / data / epoch scaling
├── ablative_depth/      # depth-conditioning ablations
├── ablative_feedback/   # feedback & aggregation ablations
├── ablative_supervision/# checkpoint-interval ablations
├── ablative_loop_budget/# loop-budget (K) ablations
├── ablative_nblocks/    # encoder-depth ablations
└── eval/                # eval_template.sh (ready to fill)

Each script begins with two placeholders you must set for your machine:

cd path/to/larm           # repo root
path/to/python run.py ...  # python from the `larmenv` env

Then launch a run, e.g. the reference 100h model:

bash scripts/main_exp/run_libri_100h_d384_full.sh

Key flags (see python run.py --help for the full list):

Flag	Role
`--K`	number of recurrent loops
`--clock_period`	checkpoint interval `c` (sparse-supervision period)
`--depth_mode film`	FiLM depth conditioning
`--rep_feedback_mode prev_frame`	delayed (one-frame) prediction feedback
`--learn_alpha_beta`	learn the feedback / skip mixing scalars
`--d_model`, `--n_heads`, `--n_blocks`	encoder width / heads / shared blocks
`--num_train_epochs`, `--batch_size`, `--gradient_accumulation_steps`	optimization budget

How data is stored

Input data. LibriSpeech is loaded with the HuggingFace datasets loader. If a local copy exists at DATASETS_ROOT_PATH/librispeech_asr it is used; otherwise the dataset is downloaded from the Hub into HUGGINGFACE_CACHE (or --hf_cache_dir) on first run, so a from-scratch run needs internet access and ~60 GB of disk for the full 960h set.

Acoustic frontend features. LARM uses Whisper's log-Mel feature extractor (80 mels), only the extractor, never the Whisper model weights. Its config is downloaded once from openai/whisper-small (a few KB); no model checkpoint is required.

Preprocessing & caching. run.py maps each split to 80-dim log-Mel features and then filters samples. Both the map and filter steps are content-addressed: results are cached by a fingerprint, so re-running an experiment reuses the prepared data instead of recomputing it. Filtering is controlled by:

Flag	Default	Meaning
`--min_input_length`	400	drop utterances with fewer input samples
`--max_input_length`	480000	drop utterances longer than ~30 s @ 16 kHz
`--min_label_length`	1	drop utterances with too few label tokens

The character-level CTC vocabulary is written to vocab/vocab_libri.json (or vocab/vocab_libri_lowercase.json with --force_lowercase).

Outputs. Checkpoints are written to:

<output_root>/<encoder_name>/<suffix>/checkpoint-<step>/

where output_root = --output_dir (or LARGE_MODELS_PATH/large_models_results by default), encoder_name defaults to larm, and suffix is the --suffix of the run. Saving/eval cadence is set by --save_strategy {steps,epoch}, --save_steps, and --eval_steps.

What's in the checkpoint JSON files

Each checkpoint-<step>/ directory contains the weights (model.pt, optim.pt, sched.pt, scaler.pt) plus three JSON files:

config.json: the full model + training configuration used to rebuild the model: architecture (d_model, n_heads, n_blocks, K, clock_period, depth_period, depth_mode, rep_feedback_mode, alpha/beta, conditioning flags, vocab_size, pad_id/blank_id) and training settings (optimizer, schedule, SpecAugment, dataset splits, batch size, epochs).
trainer_state.json: HF-style training state:
- global_step, epoch, learning_rate
- best_metric and best_model_checkpoint (lowest eval_wer so far)
- log_history: a list of training-step entries {step, epoch, loss, learning_rate} and evaluation events {step, epoch, eval_loss, eval_wer}.
meta.json: minimal {step, epoch} marker used for resuming.

For evaluation, run.py resolves the checkpoint from trainer_state.json's best_model_checkpoint, falling back to the latest checkpoint-* directory.

Evaluating with a language model

Evaluation reuses the same architecture flags as training (so the weights load) plus --only_evaluate. A ready-to-fill template is provided at scripts/eval/eval_template.sh: set SUFFIX to the trained model's id, match the architecture flags, then choose the decoding options.

Greedy CTC (WER at every loop exit):

path/to/python run.py ... --only_evaluate --eval_all_steps

4-gram KenLM beam search. The LibriSpeech 4-gram LM (4-gram.arpa) can be downloaded from OpenSLR:

path/to/python run.py ... \
    --only_evaluate \
    --eval_all_steps \
    --lm_path path/to/4-gram.arpa \
    --eval_all_steps_lm \
    --lm_alpha 0.5 \
    --lm_beta 1.0 \
    --beam_size 100

Flag	Meaning
`--lm_path`	KenLM `.arpa` / `.binary` file (enables LM decoding)
`--eval_all_steps_lm`	run LM beam search at every loop exit
`--lm_alpha`	LM (shallow-fusion) weight (paper: 0.5)
`--lm_beta`	word-insertion bonus (paper: 1.0)
`--beam_size`	beam width (paper: 100)

Omit the --lm_* block to report greedy WER only.

Authors and Affiliations

Name	Affiliation	Contact
Yacouba Kaloga	Idiap Research Institute, Martigny, Switzerland	yacouba.kaloga@idiap.ch
Shashi Kumar	Idiap Research Institute, Martigny, Switzerland	shashi.kumar@idiap.ch
Shakeel A. Sheikh	TOCOME	Shakeelzmail608@gmail.com
Driss Khalil	Idiap Research Institute, Martigny, Switzerland	driss.khalil@idiap.ch
Petr Motlicek	Idiap Research Institute, Martigny, Switzerland	petr.motlicek@idiap.ch
Ina Kodrasi	Idiap Research Institute, Martigny, Switzerland	ina.kodrasi@idiap.ch

Citation

If you use this code, please cite (placeholder, paper currently under review):

@misc{kaloga2026larm,
  title         = {Test-Time Compute Scaling for {ASR} with Depth-Conditioned Looped Transformers},
  author        = {Kaloga, Yacouba and Kumar, Shashi and Sheikh, Shakeel and Khalil, Driss and Motlicek, Petr and Kodrasi, Ina},
  year          = {2026},
  eprint        = {XXXX.XXXXX},
  archivePrefix = {arXiv},
  note          = {Preprint, under review},
}

License

The code in this project is released under the MIT License. See LICENSES/MIT.txt for details. The LibriSpeech dataset is distributed under the CC BY 4.0 license. Third-party dependencies and datasets are listed in THIRDPARTY.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LARM: Loop Audio Recurrent Model

Abstract

Architecture

Main results

Pretrained checkpoints

Loop-structured test-time computation

Installation

Configuration (`path_config.py`)

Running an experiment

How data is stored

What's in the checkpoint JSON files

Evaluating with a language model

Authors and Affiliations

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
LICENSES		LICENSES
figures		figures
larm		larm
scripts		scripts
vocab		vocab
.gitignore		.gitignore
README.md		README.md
REUSE.toml		REUSE.toml
THIRDPARTY.md		THIRDPARTY.md
env.yml		env.yml
env_quick.yml		env_quick.yml
run.py		run.py

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

LARM: Loop Audio Recurrent Model

Abstract

Architecture

Main results

Pretrained checkpoints

Loop-structured test-time computation

Installation

Configuration (path_config.py)

Running an experiment

How data is stored

What's in the checkpoint JSON files

Evaluating with a language model

Authors and Affiliations

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Configuration (`path_config.py`)

Packages