Test-Time Compute Scaling for ASR with Depth-Conditioned Looped Transformers
End-to-end ASR systems typically use fixed-depth acoustic encoders at inference, making it difficult to trade additional test-time computation for improved recognition without training a larger model. We introduce LARM, a depth-conditioned looped Transformer that turns recurrent encoder depth into a controllable test-time compute axis. LARM combines sparse CTC checkpoints, supervision-clock embeddings, FiLM depth conditioning, and delayed soft-posterior feedback to structure the loop into recognition checkpoints separated by latent refinement phases. On LibriSpeech, LARM improves WER as the number of inference loops increases and achieves performance competitive with deeper unshared-parameter baselines, while using a fraction of the parameter count.
LARM applies a shared Transformer encoder recurrently to a latent acoustic sequence. Each loop reuses the same parameters and shared CTC head, modulated by FiLM depth conditioning and supervision-clock embeddings, with delayed soft-posterior feedback reinjected into the recurrent state. CTC loss is applied only at sparse recognition checkpoints.
Recurrence. The acoustic frontend
Delayed prediction feedback. The posteriors are projected back to the hidden space and
shifted by one frame (with zero padding at
State aggregation. The encoder output, the frontend skip connection, and the delayed
feedback are combined with learnable scalars
Clock and FiLM depth conditioning. A supervision-clock embedding (period
Sparse supervision. CTC loss is applied only at the checkpoint loops
LibriSpeech WER (%) on test-clean / test-other, with greedy CTC decoding and 4-gram LM
beam search. The reference LARM (
| Train | Model | #Params | Greedy clean | Greedy other | +LM clean | +LM other |
|---|---|---|---|---|---|---|
| 100h | Standard encoder, 16 blocks | 28.9M | 14.43 | 37.23 | 9.97 | 28.68 |
| Standard encoder, 48 blocks | 85.7M | 12.24 | 34.06 | 9.03 | 27.07 | |
| LARM ( |
7.7M | 11.34 | 31.84 | 8.66 | 26.28 | |
|
LARM ( |
28.9M | 9.58 | 28.25 | 7.57 | 23.89 | |
| 960h | Standard encoder, 16 blocks | 28.9M | 4.79 | 13.26 | 3.51 | 9.87 |
| Standard encoder, 48 blocks | 85.7M | 3.87 | 10.58 | 3.20 | 8.56 | |
| LARM ( |
7.7M | 4.59 | 11.75 | 3.51 | 9.38 | |
|
LARM ( |
28.9M | 3.45 | 9.44 | 2.93 | 7.93 |
To reproduce the table above without retraining, download the released checkpoints from HuggingFace:
- [TBD]: LARM 100h (
$d=384$ ,$K=12$ , the reference model) - [TBD]: LARM 960h (
$d=384$ ,$K=12$ ) - [TBD]: LARM 960h (
$d=768$ ,$K=12$ )
Place each downloaded checkpoint-<step>/ directory under
<output_root>/larm/<suffix>/ (where output_root is your LARGE_MODELS_PATH or
--output_dir), set SUFFIX accordingly in
scripts/eval/eval_template.sh, and run it. Using the
released checkpoints sidesteps both the environment and the GPU-numerics variables.
WER trajectories across recurrent loops for two trained LARM models. Red stars mark the
supervised recognition checkpoints (every c loops); intermediate loops refine the latent
acoustic representation. Supervised checkpoints improve from one to the next, while
intermediate loops can be non-monotonic (100h) or follow a smoother trajectory (960h).
Left: LibriSpeech 960h (K=12, c=4), smooth refinement. Right: LibriSpeech 100h (K=16, c=4), non-monotonic intermediate loops.
Python 3.12, PyTorch 2.3. env.yml is the environment used for the paper;
env_quick.yml is provided as a quicker-to-build alternative.
conda env create -f env.yml # or env_quick.yml for a faster install
conda activate larmenvTraining requires a CUDA GPU.
All machine-specific paths live in a single file that is not committed. Copy the template and fill in your own locations:
cp larm/config/path_config_example.py larm/config/path_config.pyThen edit larm/config/path_config.py:
| Variable | Meaning |
|---|---|
DATASETS_ROOT_PATH |
Folder for a local LibriSpeech copy (librispeech_asr/ inside). If it does not exist, LibriSpeech is downloaded automatically into HUGGINGFACE_CACHE. |
HUGGINGFACE_CACHE |
HuggingFace datasets cache directory. |
LARGE_MODELS_PATH |
Where experiment outputs/checkpoints are written by default. |
These can be overridden per run with --output_dir and --hf_cache_dir.
Experiments are driven by run.py and organized as shell scripts under scripts/,
grouped by category:
scripts/
├── main_exp/ # reference LARM (100h, 960h)
├── baseline/ # standard non-looped encoders
├── scaling/ # width / data / epoch scaling
├── ablative_depth/ # depth-conditioning ablations
├── ablative_feedback/ # feedback & aggregation ablations
├── ablative_supervision/# checkpoint-interval ablations
├── ablative_loop_budget/# loop-budget (K) ablations
├── ablative_nblocks/ # encoder-depth ablations
└── eval/ # eval_template.sh (ready to fill)
Each script begins with two placeholders you must set for your machine:
cd path/to/larm # repo root
path/to/python run.py ... # python from the `larmenv` envThen launch a run, e.g. the reference 100h model:
bash scripts/main_exp/run_libri_100h_d384_full.shKey flags (see python run.py --help for the full list):
| Flag | Role |
|---|---|
--K |
number of recurrent loops |
--clock_period |
checkpoint interval c (sparse-supervision period) |
--depth_mode film |
FiLM depth conditioning |
--rep_feedback_mode prev_frame |
delayed (one-frame) prediction feedback |
--learn_alpha_beta |
learn the feedback / skip mixing scalars |
--d_model, --n_heads, --n_blocks |
encoder width / heads / shared blocks |
--num_train_epochs, --batch_size, --gradient_accumulation_steps |
optimization budget |
Input data. LibriSpeech is loaded with the HuggingFace datasets loader. If a local copy
exists at DATASETS_ROOT_PATH/librispeech_asr it is used; otherwise the dataset is downloaded
from the Hub into HUGGINGFACE_CACHE (or --hf_cache_dir) on first run, so a from-scratch
run needs internet access and ~60 GB of disk for the full 960h set.
Acoustic frontend features. LARM uses Whisper's log-Mel feature extractor (80 mels), only
the extractor, never the Whisper model weights. Its config is downloaded once from
openai/whisper-small (a few KB); no model checkpoint is required.
Preprocessing & caching. run.py maps each split to 80-dim log-Mel features and then
filters samples. Both the map and filter steps are content-addressed: results are cached
by a fingerprint, so re-running an experiment reuses the prepared data instead of recomputing
it. Filtering is controlled by:
| Flag | Default | Meaning |
|---|---|---|
--min_input_length |
400 | drop utterances with fewer input samples |
--max_input_length |
480000 | drop utterances longer than ~30 s @ 16 kHz |
--min_label_length |
1 | drop utterances with too few label tokens |
The character-level CTC vocabulary is written to vocab/vocab_libri.json (or
vocab/vocab_libri_lowercase.json with --force_lowercase).
Outputs. Checkpoints are written to:
<output_root>/<encoder_name>/<suffix>/checkpoint-<step>/
where output_root = --output_dir (or LARGE_MODELS_PATH/large_models_results by default),
encoder_name defaults to larm, and suffix is the --suffix of the run. Saving/eval
cadence is set by --save_strategy {steps,epoch}, --save_steps, and --eval_steps.
Each checkpoint-<step>/ directory contains the weights (model.pt, optim.pt,
sched.pt, scaler.pt) plus three JSON files:
config.json: the full model + training configuration used to rebuild the model: architecture (d_model,n_heads,n_blocks,K,clock_period,depth_period,depth_mode,rep_feedback_mode,alpha/beta, conditioning flags,vocab_size,pad_id/blank_id) and training settings (optimizer, schedule, SpecAugment, dataset splits, batch size, epochs).trainer_state.json: HF-style training state:global_step,epoch,learning_ratebest_metricandbest_model_checkpoint(lowesteval_werso far)log_history: a list of training-step entries{step, epoch, loss, learning_rate}and evaluation events{step, epoch, eval_loss, eval_wer}.
meta.json: minimal{step, epoch}marker used for resuming.
For evaluation, run.py resolves the checkpoint from trainer_state.json's
best_model_checkpoint, falling back to the latest checkpoint-* directory.
Evaluation reuses the same architecture flags as training (so the weights load) plus
--only_evaluate. A ready-to-fill template is provided at
scripts/eval/eval_template.sh: set SUFFIX to the trained
model's id, match the architecture flags, then choose the decoding options.
Greedy CTC (WER at every loop exit):
path/to/python run.py ... --only_evaluate --eval_all_steps4-gram KenLM beam search. The LibriSpeech 4-gram LM (4-gram.arpa) can be downloaded
from OpenSLR:
path/to/python run.py ... \
--only_evaluate \
--eval_all_steps \
--lm_path path/to/4-gram.arpa \
--eval_all_steps_lm \
--lm_alpha 0.5 \
--lm_beta 1.0 \
--beam_size 100| Flag | Meaning |
|---|---|
--lm_path |
KenLM .arpa / .binary file (enables LM decoding) |
--eval_all_steps_lm |
run LM beam search at every loop exit |
--lm_alpha |
LM (shallow-fusion) weight (paper: 0.5) |
--lm_beta |
word-insertion bonus (paper: 1.0) |
--beam_size |
beam width (paper: 100) |
Omit the --lm_* block to report greedy WER only.
| Name | Affiliation | Contact |
|---|---|---|
| Yacouba Kaloga | Idiap Research Institute, Martigny, Switzerland | yacouba.kaloga@idiap.ch |
| Shashi Kumar | Idiap Research Institute, Martigny, Switzerland | shashi.kumar@idiap.ch |
| Shakeel A. Sheikh | TOCOME | Shakeelzmail608@gmail.com |
| Driss Khalil | Idiap Research Institute, Martigny, Switzerland | driss.khalil@idiap.ch |
| Petr Motlicek | Idiap Research Institute, Martigny, Switzerland | petr.motlicek@idiap.ch |
| Ina Kodrasi | Idiap Research Institute, Martigny, Switzerland | ina.kodrasi@idiap.ch |
If you use this code, please cite (placeholder, paper currently under review):
@misc{kaloga2026larm,
title = {Test-Time Compute Scaling for {ASR} with Depth-Conditioned Looped Transformers},
author = {Kaloga, Yacouba and Kumar, Shashi and Sheikh, Shakeel and Khalil, Driss and Motlicek, Petr and Kodrasi, Ina},
year = {2026},
eprint = {XXXX.XXXXX},
archivePrefix = {arXiv},
note = {Preprint, under review},
}The code in this project is released under the MIT License. See LICENSES/MIT.txt for details. The LibriSpeech dataset is distributed under the CC BY 4.0 license. Third-party dependencies and datasets are listed in THIRDPARTY.md.


