Fine-Grained Music Retrieval

FIGMA: Towards Fine-Grained Music Retrieval

ACL 2026

Nishit Anand, Ashish Seth, Sreyan Ghosh, Dinesh Manocha*, Ramani Duraiswami*

University of Maryland, College Park  ·  *Equal advising  ·  nishit@umd.edu

Current music retrieval systems fail on detailed queries. A composer searching for "a track in F major at 110 BPM with a 4/4 beat and Am–F–C–G chords" gets irrelevant results—even from models trained on richly annotated captions. FIGMA fixes this by aligning audio frames to caption tokens, not just global embeddings.

Key results

73.3% relative improvement on out-of-domain FMACaps-Eval
21.4% relative gain on MusicBench R@1
~22M trainable parameters (frozen backbone encoders)

The Problem

Why existing models fail on detailed queries

Despite being trained on long, richly detailed captions, CLAP-based models effectively use only the first 40–50 tokens. Everything after—the musical substance—is ignored.

The cause is architectural. Standard models compress audio into a single mean-pooled vector and text into a single [CLS] token before computing similarity. This collapses both modalities, discarding temporal audio structure and token-level text distinctions. Long captions behave like bags of words.

We confirmed this empirically: truncating MusicBench captions to their first 50 tokens produces nearly identical retrieval scores as using the full caption. Further, fine-tuning LAION-CLAP on FGMCaps training data—which has detailed music-theoretic annotations—yields only marginal gains, showing the bottleneck is in the objective, not the data.

CLAP fails on fine-grained captions; FIGMA succeeds.
Coarse vs. fine-grained retrieval. CLAP-style models retrieve correctly when queries describe general mood or genre, but fail when queries specify tempo, key, or chord structure. FIGMA handles both.
Retrieval performance plateaus at ~50 tokens.
Token saturation. Recall@1/5/10 on MusicBench as caption length grows (MuQMuLaN). All metrics plateau around 50 tokens—additional musical detail is ignored by existing models.

Our Approach

Multi-view contrastive alignment

Architecture. We freeze both encoders—MuQ (audio, pretrained self-supervised on music) and Microsoft Multilingual E5 Large (text)—and train only lightweight projection heads (~22M parameters). Each projector is two Transformer encoder layers + a linear map into a shared 512-d space, applied separately to both global pooled and full-sequence features.

Frame-level loss. For each audio frame, we find the maximally similar caption token, average these per-frame scores, and apply InfoNCE. Paired clips are positives; all others in the batch are negatives. This encourages grounding of tempo in rhythm tokens, key in harmony tokens, and so on.

Multi-view loss. Global and frame-level objectives are weighted by α = 0.6:

\[\mathcal{L}_{\mathrm{Multi\text{-}View}} = \alpha\,\mathcal{L}_{\mathrm{global}} + (1-\alpha)\,\mathcal{L}_{\mathrm{frame}}\]

Training. 15 epochs, batch size 256, Adam at 1×10⁻⁴, temperature τ = 0.07, early stopping on validation Recall@1.

FIGMA architecture diagram.
FIGMA architecture. Frozen MuQ and E5 encoders produce frame-level and token-level features. Projection heads map both to a shared space for joint global + frame-level contrastive training. Only the ~22M projector parameters are updated.

Dataset

FGMCaps — 380K annotated music–caption pairs

The first large-scale dataset combining music-theoretic annotations—chord, tempo, key, beat—with natural language captions at scale.

Audio is drawn from four public sources (MTG-Jamendo, Music4All, JamendoMaxCaps, MusicBench). Musical attributes are extracted automatically: BeatNet for tempo and beat count, Omnizart for chord progressions, Essentia's KeyExtractor for key and mode. Captions are generated by Qwen3-Next-80B-A3B-Instruct from structured attribute prompts with randomized attribute order to avoid position bias. Stratified splitting ensures no data leakage between train / validation / test.

380K+
Training pairs
10K
Test set
6
Attributes per clip
FGMCaps construction pipeline.
Construction pipeline. Audio from four sources → parallel attribute extraction (BeatNet, Omnizart, Essentia) → caption generation with Qwen3 → quality control. Failed extractions (<0.5%) and low-confidence key predictions are discarded.
Dataset # Train / Test Chord Tempo Beat Key Captions
JamendoMaxCaps189,515 / 0
Music4All108,042 / 0
MusicCaps5,521 / 0
MusicBench52,768 / 400
FGMCaps (ours)380,878 / 10,000

FGMCaps is the only dataset at this scale combining all four music-theoretic attributes with natural-language captions.


Experiments & Results

State-of-the-art across all benchmarks

FIGMA is evaluated against 10+ baselines including all LAION-CLAP variants, MS-CLAP, MuQ-MuLaN, M2D-CLAP, and CLAMP 3. On MusicBench, FIGMA reaches 34.52% T2A R@1—a 21.4% relative gain over the previous best (CLAMP 3). The gains are larger on the harder, out-of-domain FMACaps-Eval benchmark (1,000 pairs from Free Music Archive), where FIGMA achieves a 73.3% relative improvement. On the in-distribution FGMCaps test set, FIGMA reaches 26.15% T2A R@1 while the best baseline manages only 2.22%.

MusicBench

Model Text-to-Audio Audio-to-Text
R@1R@5R@10R@20 R@1R@5R@10R@20
LAION-CLAP (Music)25.3855.8468.5379.7025.3861.9376.1489.34
MuQ-MuLaN20.8147.7162.9474.6217.7643.6557.8678.68
M2D-CLAP25.3855.3370.0578.1736.5563.9675.6384.77
CLAMP 328.4357.8774.6289.855.0824.3734.0152.28
FIGMA34.5265.9981.7391.3739.0968.0280.7188.83

FMACaps-Eval (out-of-domain)

Model Text-to-Audio Audio-to-Text
R@1R@5R@10R@20 R@1R@5R@10R@20
LAION-CLAP (Music)2.609.4014.8021.603.0011.6018.5028.10
MuQ-MuLaN4.1012.4017.8027.503.9011.7019.1028.10
M2D-CLAP1.906.7011.4017.803.3011.9019.7030.80
CLAMP 37.5020.7030.8043.101.104.106.4011.70
FIGMA13.0028.0037.6048.6013.2033.3042.9053.40

Best in bold (red); second-best underlined. Full 11-model comparison including all LAION-CLAP and MS-CLAP variants in the paper.

Robustness to attribute perturbations

To verify FIGMA genuinely encodes fine-grained attributes—not just surface-level patterns—we construct hard-negative queries from 3K test clips by altering one attribute at a time: key, BPM, tempo marking, beat count, or chord sequence. FIGMA maintains 34–43% A2T R@1 under targeted perturbations, confirming representations are grounded in specific musical properties rather than holistic semantics.

Ablation on negative set size.
Effect of negative set size on FMACaps-Eval T2A recall. Larger negative pools consistently improve fine-grained discrimination.

Takeaway

A new paradigm for music retrieval

Fine-grained music retrieval requires aligning specific audio moments to specific text tokens—not just global embeddings. FIGMA demonstrates that a frame-level, token-wise contrastive objective atop frozen encoders is sufficient to unlock the musical information already present in rich captions. Paired with FGMCaps—the first dataset systematically combining chord, tempo, beat, and key annotations at scale—FIGMA sets a new state of the art across in-domain and out-of-domain benchmarks while remaining compute-efficient (~22M trainable parameters). We hope both the model and dataset serve as a foundation for future work on fine-grained audio understanding and precision music search.