Lyric Intelligibility Data
- To obtain the data and baseline code, please see the download page.
- For instructions on what data and pretrained models can be used in the challenges, please see rules.
A. Training, validation and evaluation dataβ
The training and validation data are provided at challenge launch. The evaluation data is provided closer to the submission deadline.
A.1 Training and validation dataβ
The dataset uses the transcription extension [1] of the training split of MUSDB18-HQ [2]. This extension comprises 96 manual transcriptions of English songs by non-native English speakers, totalling 366 minutes of audio.
We permit the use of the following additional datasets in training: FMA, MedleydB version 1 and version 2, and MousesDB. We also permit the use of pre-trained models that might have been developed using these databases.
You should not use pre-trained models that were trained on our evaluation data.
A.2 Evaluation (test) setβ
The evaluation dataset combines the English subset of the JamendoLyrics dataset (20 songs) [3] with the 46 transcribed songs from the evaluation split of the MUSDB18-HQ dataset. We will tell you what part of the songs are required, the required value of and the audiograms of the listeners.
The evaluation set should not be used for refining the system.
B. Metadata Informationβ
B.1 Listener characteristicsβ
We provide metadata characterising the hearing abilities of listeners so the audio signals can be personalised. This is common for both tasks, so please see Listener Metadata for more details.
{
"L0001": {
"name": "L0001",
"audiogram_cfs": [250, 500, 1000, 2000, 3000, 4000, 6000, 8000],
"audiogram_levels_l": [45, 45, 35, 45, 60, 65, 70, 65],
"audiogram_levels_r": [40, 40, 45, 45, 60, 65, 80, 80]
},
"L0002": {
"name": "L0002",
...
}
B.2 Alphaβ
This gives the balanced between intelligibility and quality. It will range from 0 to 1 in 0.1 steps.
{
"alpha_0": 0.0,
"alpha_1": 0.1,
...
}
B.3 Musicβ
This provides the information of the audio segments with its transcriptions.
{
"A_Classic_Education_-_NightOwl_1": {
"track_name": "A Classic Education - NightOwl",
"path": "musdb18_hq/train/audios/A Classic Education - NightOwl",
"segment_id": 1,
"start_time": 0,
"end_time": 8.2,
"confidence": "a",
"text": "i think you're right i do"
},
"A_Classic_Education_-_NightOwl_2": {
...
B.4. Scenesβ
This files provide the combination of segment ids and alpha to use for that segment. This is a randomly generated combinations.
{
"S10001": {
"segment_id": "A_Classic_Education_-_NightOwl_1",
"alpha": "alpha_10"
},
"S10002": {
"segment_id": "A_Classic_Education_-_NightOwl_2",
"alpha": "alpha_5"
},
"S10003": {
...
B.5 Scene-Listenersβ
This provides the list of listeners for each scene.
{
"S10001": ["L0067", "L0044"],
"S10002": ["L0073", "L0054"],
...
Referencesβ
[1] Schulze-Forster, K., Doire, C.S., Richard, G. and Badeau, R., 2021. Phoneme level lyrics alignment and text-informed singing voice separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, pp.2382-2395.
[2] Rafii, Z., Liutkus, A., StΓΆter, F.-R., Mimilakis, S. I., and Bittner, R. (2019). MUSDB18-HQ - an Uncompressed Version of MUSDB18. [Dataset]. doi:10.5281/zenodo.3338373
[3] Durand, S., Stoller, D. and Ewert, S., 2023, June. Contrastive learning-based audio to lyrics alignment for multiple languages. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE.