Skip to main content

The 2nd Cadenza Lyric Intelligibility Prediction Challenge

Image by Christopher Sinnott from Pixabay

info

CLIP2 will be submitted as a proposal for an ICASSP 2027 Grand Challenge; however, the challenge will run regardless of whether it is accepted as an ICASSP Grand Challenge.

Building on the success of last year's lyrics intelligibility challenge, ICASSP 2026 Challenge (CLIP1), this new lyric intelligibility challenge CLIP2 features AI generated music over a broader range of genres.

To advance music processing through machine learning, we need reliable ways to evaluate audio quality. This is true whether the music comes from humans or AI. For songs, we need a metric to assess the intelligibility of the lyrics. There are several factors that can affect intelligibility, including:

  • Vocal style and articulation
  • Song genre
  • Mixing and production techniques
  • Listener hearing ability

In 2025, CLIP1 introduced the first dataset to assess lyrics intelligibility (the CLIP1 dataset) derived from human performers and the FMA dataset. We are now excited to pre-announce the launch of the second Cadenza lyric intelligibility prediction challenge CLIP2, with a new dataset that includes more music genres, new listening conditions and AI generated music.

While most entrants beat the baseline in CLIP1, the residual RMSE error was still high, so there is plenty of room for you to improve on CLIP1.

CLIP2 Dataset​

The CLIP2 dataset features thousands of music excerpts extracted from AI-generated songs created in collaboration with ElevenLabs. Generating music in this way allows us to include pieces from genres that would normally be difficult to distribute due to copyright restrictions.

Music genres include Pop/Rock, Musical Theater and choral music. The excerpts will be processed with reverberation and background noise to represent different listening scenarios and spatial locations. Additionally, excerpts will be provided both in their original form and processed through a hearing-loss simulator to mimic listeners with hearing loss who are not wearing hearing aids.

ElevenLabs models are trained on properly licensed music through agreements with organizations such as Merlin and Kobalt, ensuring fair use of artists' work.

Challenge Overview​

Participants will build models to predict lyrics intelligibility from audio recordings. The intelligibility metric would be derived from a predictive model that takes audio as input and estimates the score a listener would likely achieve in a listening test.

What will be provided?​

  • A dataset of song excerpts.
  • All samples will include lyrics intelligibility scores from listening tests.
  • Software and baseline system.