Skip to main content

Task 1: Lyric Intelligibility

Image by Marcísio Coelho Mac Hostile from Pixabay

A. Introduction

Studies show that not being able to hear the lyrics in music is an important problem to tackle for those with hearing loss [1]. Consequently, this task is about improving the intelligibility of lyrics when listening to pop/rock over headphones. But this needs to be done without losing too much audio quality - you can't improve intelligibility just by turning off the rest of the band! For this reason, we will be evaluating both intelligibility and audio quality, and giving you different targets to explore the balance between these attributes.

This task could be tackled in many different ways using machine learning. A few examples:

  • Within speech technology, there are many different approaches to improving speech intelligibility that have been developed. Can these methods be adapted to improve listening to vocals?
  • Within demixing, technologies allow the separation of music into different components including a vocal track. This then allow processing of the vocals and remixing to improve intelligibility.
  • End-to-end approaches allow the transformation of audio from one style to another. How can this be adapted for this task?

But we'd welcome other approaches as well.

A.1 What is lyric intelligibility?

Lyric intelligibility, as defined by Cadenza's sensory panel of hearing aid users, refers to "how clearly and effortlessly the words in the music can be heard". Across this sensory panel and work by Fine et al. [2], there are four groups of factors that can affect lyric intelligibility:

  1. Performer: includes articulation, voice quality and diction.
  2. Music-to-singer balance: includes balance in dynamics or pitch, music genre, song speed and composition style.
  3. Listener [1]: includes listener attention, familiarity, expectation and hearing ability.
  4. Environmental: includes room acoustics, proximity to performer and use or abuse of amplification.

As listeners are using headphones in our scenario, environmental factors are not included.

From the first three factors, (1) and (2) are addressed in the task by including samples with different singing styles and background accompaniment. For example, in the challenge datasets, one can find samples of music tracks where the background is not prominent and the sung words are more easily heard. This is illustrated by the following example extracted from the training set from MUSDB18-HQ dataset:

Did you pick up the lyrics?

Track Name: Actions - South Of The Water
Lyrics:
  my skin's falling off i'm breaking at the seeps
  he's holding me under and i can't breath

Transcriptions made by [Schulze-Forster et al.]

The datasets also include tracks where the singing can be more difficult to hear, either because the background level is higher than the singing level and/or the singing style makes the lyrics difficult to hear. The next example, also drawn from MUSDB18-HQ dataset, illustrates how the background accompaniment can mask the singing line, affecting the lyric intelligibility.

Did you pick up the lyrics?

Track Name: Dark Ride - Burning Bridges
Lyrics:
  burning bridges fire in my soul burning bridges forget about control
  burn those witches i am the only one
  burn the bridges i relied upon

Transcriptions made by [Schulze-Forster et al.]

Listener issues (factor 3) will be covered by us providing listeners' hearing characteristics as Audiograms.

Challenge entrants will be provided with appropriate music datasets and sets of audiograms for training, development and evaluation.

References

[1] Fine, P. A., & Ginsborg, J. (2014). "Making myself understood: perceived factors affecting the intelligibility of sung text," Frontiers in psychology, 5, 809.
[2] A. Greasley, H. Crook, and R. Fulford, "Music listening and hearing aids: perspectives from audiologists and their patients," International Journal of Audiology, vol. 59, no. 9, pp. 694–706, 2020.