⚠️ This is a beta version. Please report any issues or request new tutorial by opening an issue on the GitHub repository. ⚠️

Baseline CAD2 Task1#

Lyrics Intelligibility#

hostile band rock guitar show

Image by Marcísio Coelho Mac Hostile from Pixabay

Mishearings of lyrics are very common, and one can find numerous examples on the internet, from websites dedicated to misheard lyrics to stand-up comedies that exploit this in a humorous way.

However, this is a significant issue for those with hearing loss [GCF20].

In this second round of Cadenza Challenges, we are presenting a challenge where entrants need to process a pop/rock music signal and increase its intelligibility with minimal loss of audio quality.

More details about the challenge can be found on the Cadenza website.

This tutorial walks you through the process of running the lyrics intelligibility baseline using the shell interface.

Create the environment#

We first need to install the Clarity package. The tag version for CAD2 is v0.6.1

Setting the Location of the Project#

For convenience, we are setting an environment variable with the location of the root working directory of the project. This variable will be used in various places throughout the tutorial. Please change this value to reflect where you have installed this notebook on your system.

import os
os.environ["NBOOKROOT"] = os.getcwd()
os.environ["NBOOKROOT"] = f"{os.environ['NBOOKROOT']}/.."
os.environ['NBOOKROOT']
'/home/gerardoroadabike/Extended/Projects/cadenza_tutorials/cad2/..'
from IPython.display import clear_output

import os
import sys

print("Cloning git repo...")
!git clone --depth 1 --branch v0.6.1 https://github.com/claritychallenge/clarity.git

clear_output()
print("Installing the pyClarity...\n")
%cd clarity
%pip install -e .

sys.path.append(f'{os.getenv("NBOOKROOT")}/clarity')

clear_output()
print("Repository installed")
Repository installed
%cd {os.environ['NBOOKROOT']}/clarity/recipes/cad2/task1
!pip install -r requirements.txt
clear_output()

Get the demo data#

The next step is to download a demo data package that will help demonstrate the process. This package has the same structure as the official data package, so it will help you understand how the files are organized.

Before continuing, it is recommended that you familiarize yourself with the data structure and content, which you can find on the website.

Now, let’s download the data…

%cd {os.environ['NBOOKROOT']}
!gdown 1UqiqwYJuyC1o-C14DpVL4QYncsOGCvHF
!tar -xf cad2_demo_data.tar.xz

clear_output()
print("Data installed")
Data installed

Changing working Directory#

Next, we change working directory to the location of the shell scripts we wish to run.

%cd {os.environ['NBOOKROOT']}/clarity/recipes/cad2/task1/baseline
/home/gerardoroadabike/Extended/Projects/cadenza_tutorials/clarity/recipes/cad2/task1/baseline

Let’s save the path to the dataset in root_data

root_data = f"{os.environ['NBOOKROOT']}/cadenza_data_demo/cad2/task1"
!ls -l {root_data}
total 8
drwxr-xr-x 3 gerardoroadabike gerardoroadabike 4096 Aug 23 08:37 audio
drwxr-xr-x 2 gerardoroadabike gerardoroadabike 4096 Sep  3 16:11 metadata

Running the Baseline#

The enhancement baseline employs a ConvTasNet model to separate the lyrics from the background accompaniment. This model was trained for causal and non-causal cases. The pre-trained models are stored on Huggingface. The causality is defined in the config.yaml.

The config parameters#

The parameters of the baseline are define in the config.yaml file.


First, it configures the paths to metadata, audio files and, location for the output files.

path:
  root: ???   # Set to the root of the dataset
  metadata_dir: ${path.root}/metadata
  music_dir: ${path.root}/audio
  musics_file: ${path.metadata_dir}/music.valid.json
  alphas_file: ${path.metadata_dir}/alpha.json
  listeners_file: ${path.metadata_dir}/listeners.valid.json
  enhancer_params_file: ${path.metadata_dir}/compressor_params.valid.json
  scenes_file: ${path.metadata_dir}/scene.valid.json
  scene_listeners_file: ${path.metadata_dir}/scene_listeners.valid.json
  exp_folder: ./exp_${separator.causality}
  • path.root: must be set to the location of the dataset.

  • exp_folder: by default, the name of the folder it’s using the causality parameter. But this can be cahnge according your requirements


The next parameters are the different sample rates

input_sample_rate: 44100 # sample rate of the input mixture
remix_sample_rate: 44100 # sample rate for the output remixed signal
HAAQI_sample_rate: 24000 # sample rate for computing HAAQI score

The HAAQI sample rate is uses by HAAQI in the evaluation.


The next parameters are related to the separation and how it will operate

separator:
  causality: causal
  device: ~
  separation:
    number_sources: 2
    segment: 6.0
    overlap: 0.1
    sample_rate: ${input_sample_rate}
  • separator.causality: this is where we set the causality

  • separator.separation: these parameters are used for separate large signals using fade and overlap.


The enhancer parameters are the amplification parameters used by the multiband dynamic range compressor that not directly depend on the listener.

enhancer:
  crossover_frequencies: [ 353.55, 707.11, 1414.21, 2828.43, 5656.85 ] # [250, 500, 1000, 2000, 4000] * sqrt(2)
  attack: [ 11, 11, 14, 13, 11, 11 ]
  release: [ 80, 80, 80, 80, 100, 100 ]
  threshold: [ -30, -30, -30, -30, -30, -30 ]

You are free to change these parameters if you believe it may improve the signals for the listener panel. However, take in consideration that the objective evaluation uses these parameters. This means that any changes may result in lower objective HAAQI scores, as this metric is based on the correlation between the enhanced and reference signals.


The last parameters are evaluation configurations

evaluate:
  whisper_version: base.en
  set_random_seed: True
  small_test: False
  save_intermediate: False
  equiv_0db_spl: 100
  batch_size: 1  # Number of batches
  batch: 0       # Batch number to evaluate

whisper_version indicates what version of Whisper are we using for the intelligibility metric. The objective evaluation will emply the base.en version.

Running enhance.py#

The first steps in the script are:

  1. Loading the different metadata files into dictionaries.

  2. Load the causal or non-causal separation model using the method load_separation_model().

  3. Create an instance of a MultibandCompressor.

  4. Load the scenes and listeners per scenes

Then, the script processes one scene-listener pair at a time.

  1. Load the original mixture and select the requested segment

input_mixture, input_sample_rate = read_flac_signal(
    Path(config.path.music_dir)
    / songs[scene["segment_id"]]["path"]
    / "mixture.flac"
)
start_sample = int(
    songs[scene["segment_id"]]["start_time"] * config.input_sample_rate
)
end_time = int(
    (songs[scene["segment_id"]]["end_time"]) * config.input_sample_rate
)
  1. Normalise to -40 dB LUFS. This is an important steps as affect how the compressor works later.

  2. Separate the vocals from the background.

est_sources = separate_sources(
    separation_model,
    input_mixture.T,
    device=device,
    **config.separator.separation,
)
vocals, accompaniment = est_sources.squeeze(0).cpu().detach().numpy()
  1. Remix the sources into a stereo signal using the alpha as input parameter. You are free to modify the downmix_signal function according to your approach.

enhanced_signal = downmix_signal(vocals, accompaniment, beta=alpha)
  1. Load the compressor parameters for the listener and compress the signal using the multiband compressor and the listener audiograms

# Get the listener's compressor params
mbc_params_listener: dict[str, dict] = {"left": {}, "right": {}}

for ear in ["left", "right"]:
    mbc_params_listener[ear]["release"] = config.enhancer.release
    mbc_params_listener[ear]["attack"] = config.enhancer.attack
    mbc_params_listener[ear]["threshold"] = config.enhancer.threshold
mbc_params_listener["left"]["ratio"] = enhancer_params[listener_id]["cr_l"]
mbc_params_listener["right"]["ratio"] = enhancer_params[listener_id]["cr_r"]
mbc_params_listener["left"]["makeup_gain"] = enhancer_params[listener_id][
    "gain_l"
]
mbc_params_listener["right"]["makeup_gain"] = enhancer_params[listener_id][
    "gain_r"
]
        
enhancer.set_compressors(**mbc_params_listener["left"])
left_enhanced = enhancer(signal=enhanced_signal[0, :])

enhancer.set_compressors(**mbc_params_listener["right"])
right_enhanced = enhancer(signal=enhanced_signal[1, :])

enhanced_signal = np.stack((left_enhanced[0], right_enhanced[0]), axis=1)
  1. Save the enhanced signal in FLAC format. These are saved in the directory enhanced_signals within the experiment path path.exp_folder defined in the config.yaml.

We can call the enhance.py script now. When calling this script, mind that you are loading the correct files.

In shell, we can call the enhancer using the demo data as:

python enhance.py \
    path.root={root_data} \
    'path.listeners_file=${path.metadata_dir}/listeners.demo.json' \
    'path.enhancer_params_file=${path.metadata_dir}/compressor_params.demo.json' \
    'path.scenes_file=${path.metadata_dir}/scenes.demo.json' \
    'path.scene_listeners_file=${path.metadata_dir}/scene_listeners.demo.json' \
    'path.musics_file=${path.metadata_dir}/music.demo.json'
!python enhance.py path.root={root_data} path.listeners_file={root_data}/metadata/listeners.demo.json path.enhancer_params_file={root_data}/metadata/compressor_params.demo.json path.scenes_file={root_data}/metadata/scene.demo.json path.scene_listeners_file={root_data}/metadata/scene_listeners.demo.json path.musics_file={root_data}/metadata/music.demo.json
config.json: 100%|██████████████████████████████| 204/204 [00:00<00:00, 519kB/s]
model.safetensors: 100%|████████████████████| 43.4M/43.4M [00:00<00:00, 106MB/s]
[2024-09-04 15:33:41,281][__main__][INFO] - [0001/0002] Processing scene-listener pair: ('S50009', 'L5086')
[2024-09-04 15:34:03,629][__main__][INFO] - [0002/0002] Processing scene-listener pair: ('S50077', 'L5042')
[2024-09-04 15:34:24,233][__main__][INFO] - Enhancement completed.

Let’s check the output path

!ls -l {os.environ['NBOOKROOT']}/clarity/recipes/cad2/task1/baseline/exp_causal/enhanced_signals
total 1776
-rw-rw-r-- 1 gerardoroadabike gerardoroadabike 1024842 Sep  4 15:34 S50009_L5086_A0.4_remix.flac
-rw-rw-r-- 1 gerardoroadabike gerardoroadabike  790055 Sep  4 15:34 S50077_L5042_A0.8_remix.flac

Let’s listen to these signals.

from pathlib import Path
from clarity.utils.flac_encoder import read_flac_signal
from clarity.utils.signal_processing import resample
import IPython.display as ipd

audio_path = Path(os.environ['NBOOKROOT']) / "clarity/recipes/cad2/task1/baseline/exp_causal/enhanced_signals" 
audio_files = [f for f in audio_path.glob('*') if f.suffix == '.flac']

for file_to_play in audio_files:
  signal, sample_rate = read_flac_signal(file_to_play)
  signal = resample(signal, sample_rate, 16000)
  print(file_to_play.name)
  ipd.display(ipd.Audio(signal.T, rate=16000))
S50077_L5042_A0.8_remix.flac
S50009_L5086_A0.4_remix.flac

Running evaluate.py#

Now that we have enhanced the signals we can use the evaluate.py script to generate the HAAQI and Whisper scores for the signals. It is important to run the evaluation using the same parameters as the enhancement.

! python evaluate.py path.root={root_data} path.listeners_file={root_data}/metadata/listeners.demo.json path.enhancer_params_file={root_data}/metadata/compressor_params.demo.json path.scenes_file={root_data}/metadata/scene.demo.json path.scene_listeners_file={root_data}/metadata/scene_listeners.demo.json path.musics_file={root_data}/metadata/music.demo.json
[2024-09-04 15:37:56,150][__main__][INFO] - Evaluating from enhanced_signals directory
[2024-09-04 15:37:57,373][__main__][INFO] - [0001/0002] Processing scene-listener pair: ('S50009', 'L5086')
[2024-09-04 15:38:18,446][root][INFO] - Severity level - SEVERE
[2024-09-04 15:38:18,446][root][INFO] - Processing {len(chans)} samples
[2024-09-04 15:38:18,453][root][INFO] - tracking fixed threshold
[2024-09-04 15:38:18,724][root][INFO] - Rescaling: leveldBSPL was 90.5 dB SPL, now 90.5 dB SPL.  Target SPL is 90.5 dB SPL.
[2024-09-04 15:38:18,724][root][INFO] - performing outer/middle ear corrections
[2024-09-04 15:38:21,237][root][INFO] - performing outer/middle ear corrections
[2024-09-04 15:38:23,266][root][INFO] - Severity level - MODERATE
[2024-09-04 15:38:23,266][root][INFO] - Processing {len(chans)} samples
[2024-09-04 15:38:23,272][root][INFO] - tracking fixed threshold
[2024-09-04 15:38:23,524][root][INFO] - Rescaling: leveldBSPL was 89.1 dB SPL, now 89.1 dB SPL.  Target SPL is 89.1 dB SPL.
[2024-09-04 15:38:23,524][root][INFO] - performing outer/middle ear corrections
[2024-09-04 15:38:26,288][root][INFO] - performing outer/middle ear corrections
[2024-09-04 15:38:28,199][__main__][INFO] - [0002/0002] Processing scene-listener pair: ('S50077', 'L5042')
[2024-09-04 15:38:48,559][root][INFO] - Severity level - SEVERE
[2024-09-04 15:38:48,560][root][INFO] - Processing {len(chans)} samples
[2024-09-04 15:38:48,566][root][INFO] - tracking fixed threshold
[2024-09-04 15:38:48,663][root][INFO] - Rescaling: leveldBSPL was 88.1 dB SPL, now 88.1 dB SPL.  Target SPL is 88.1 dB SPL.
[2024-09-04 15:38:48,663][root][INFO] - performing outer/middle ear corrections
[2024-09-04 15:38:50,988][root][INFO] - performing outer/middle ear corrections
[2024-09-04 15:38:52,896][root][INFO] - Severity level - SEVERE
[2024-09-04 15:38:52,896][root][INFO] - Processing {len(chans)} samples
[2024-09-04 15:38:52,902][root][INFO] - tracking fixed threshold
[2024-09-04 15:38:52,974][root][INFO] - Rescaling: leveldBSPL was 88.1 dB SPL, now 88.1 dB SPL.  Target SPL is 88.1 dB SPL.
[2024-09-04 15:38:52,974][root][INFO] - performing outer/middle ear corrections
[2024-09-04 15:38:55,448][root][INFO] - performing outer/middle ear corrections
[2024-09-04 15:38:57,393][__main__][INFO] - Evaluation completed

The evaluation scores are save in the path.exp_folder/scores.csv

import pandas as pd
pd.read_csv(f"exp_causal/scores.csv")
scene song listener haaqi_left haaqi_right haaqi_avg whisper_left whisper_rigth whisper_be alpha score
0 S50009 Actions - One Minute Smile L5086 0.93910 0.947933 0.943517 0.0 0.0 0.0 0.4 0.56611
1 S50077 Clara Berry And Wooldog - Waltz For My Victims L5042 0.59874 0.571855 0.585298 0.5 0.5 0.5 0.8 0.51706

The HAAQI scores are compute for the left and right ear and then save the averaged. The intelligibility score as compute for the left and right ear and then saved the better ear.