Skip to main content

Task 1: Lyric Intelligibility

Image by Marcísio Coelho Mac Hostile from Pixabay

Studies show that not being able to understand lyrics is an important problem to tackle for those with hearing loss. Consequently, this task is about improving the intelligibility of lyrics when listening to pop/rock over headphones. But this needs to be done without losing too much audio quality - you can't improve intelligibility just by turning off the rest of the band! We will be using one metric for intelligibility and another metric for audio quality, and giving you different targets to explore the balance between these metrics.

This task could be tackled in many different ways using machine learning. A few examples:

  • Within speech technology, there are many different approaches to improving speech intelligibility that have been developed. Can these methods be adapted to improve listening to vocals?
  • Within demixing, technologies allow the separation of music into different components including a vocal track. This then allow processing of the vocals and remixing to improve intelligibility.
  • End-to-end approaches allow the transformation of audio from one style to another. How can this be adapted for this task?

But we'd welcome other approaches as well.

What is lyric intelligibility?

Lyrics intelligibility can be defined as to "the extent to which a listener understands a singer's message" [1]. According to Fine et al., there are four groups of factors that can lead to lyric misunderstanding:

  1. Performer: includes articulation, voice quality and diction.
  2. Music-to-singer balance: includes music genre, song speed and composition style.
  3. Listener [2]: includes listener attention and hearing ability.
  4. Environmental: includes room acoustics, proximity to performer and use or abuse of amplification.

As listeners are using headphones in our scenario, environmental factors are not included.

From the first three factors, (1) and(2) are addressed in the task by including samples with different singing styles and background accompaniment. For example, in the challenge datasets, one can find samples of music tracks where the background is not prominent and the singing style is more easily understandable. This is illustrated by the following example extracted from the training set from MUSDB18-HQ dataset:

Did you pick up the lyrics?

Track Name: Actions - South Of The Water
  my skin's falling off i'm breaking at the seeps
  he's holding me under and i can't breath

Transcriptions made by [Schulze-Forster et al.]

The datasets also include tracks where the singing can be more difficult to understand, either because the background level is higher than the singing level and/or the singing is difficult to understand. The next example, also drawn from MUSDB18-HQ dataset, illustrates how the background accompaniment can mask the singing line, affecting the intelligibility.

Did you pick up the lyrics?

Track Name: Dark Ride - Burning Bridges
  burning bridges fire in my soul burning bridges forget about control
  burn those witches i am the only one
  burn the bridges i relied upon

Transcriptions made by [Schulze-Forster et al.]

Listener issues (factor 3) will be covered by us providing listeners' hearing characteristics as Audiograms.

Challenge entrants will be provided with appropriate music datasets and sets of audiograms for training, development and evaluation.

Task overview

Entrants will process part of a pop/rock track to increase the intelligibility with least loss of audio quality. Two metrics will guage the systems, one for quality Q and the other for intelligibility I. The balance between the intelligibility and audio quality will be given by a randomly selected α (alpha) value between 0 (prioritise intelligibility) and 1 (prioritise audio quality). Thus the overall metric is αQ+[1-α]I.


We will accept causal and non-causal systems. Non-causal systems could be used for recorded music, whereas causal systems would also work for live listening. A baseline will be provided for each case. The allowed latency for causal systems will be 5 milliseconds, that is, systems cannot look beyond 5 ms into the future.


Objective metrics

The plan is to score audio quality using either HAAQI [3], HAAQI-Net [4] or an audio quality metric we are developing based on the CAD1 results.

Intelligibility might be scored using Word Error Rate (WER) or other metrics such as Singing Adapted STOI [5].

For intrusive metrics the reference will be the original signal with a corresponding amplification applied to the vocal signal to achieve the target intelligibility.

Note, we are currently working on the metrics, and a definitive list will be published when we launch the challenge. As most of these metrics have never been tested under challenge conditions, systems will probably be scored but not ranked by these. Entrants are free to use any metric they may find useful during training as well.

Listening tests

The systems included in the listener test will be selected using criteria such as the originality of the system or selection by a pilot listener test. Listeners will be asked to transcribe some short extracts as well as rating longer ones for quality and intelligibilty, probably on a scale.


[1] Fine, P. A., & Ginsborg, J. (2014). "Making myself understood: perceived factors affecting the intelligibility of sung text," Frontiers in psychology, 5, 809.
[2] A. Greasley, H. Crook, and R. Fulford, "Music listening and hearing aids: perspectives from audiologists and their patients," International Journal of Audiology, vol. 59, no. 9, pp. 694–706, 2020.
[3] Kates, J. M., & Arehart, K. H. (2015). The hearing-aid audio quality index (HAAQI). IEEE/ACM transactions on audio, speech, and language processing, 24(2), 354-365.
[4] Wisnu, D. A., Pratiwi, E., Rini, S., Zezario, R. E., Wang, H. M., & Tsao, Y. (2024). HAAQI-Net: A non-intrusive neural music quality assessment model for hearing aids. arXiv preprint arXiv:2401.01145.
[5] Sharma, B., & Wang, Y. (2019). Automatic evaluation of song intelligibility using singing adapted STOI and vocal-specific features. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 319-331.
[6] Schulze-Forster, K., Doire, C. S., Richard, G., & Badeau, R. (2021). Phoneme level lyrics alignment and text-informed singing voice separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 2382-2395.