Word Correctness#
In summary, the correctness of a signal is computed as:
\(
\text{Correctness} = \frac{\sum{\text{Correct words}}}{\sum{\text{Total words}}}
\)
To compute this score, we employ the Whisper [RKX+23] model. Whisper is a robust speech recognition system trained in a very large dataset under different acoustic conditions.
Despite Whisper being trained for speech recognition, the large variable acoustic conditions makes it capable of transcribing lyrics.
The correlation of Whisper model base.en
with human performance (Ibrahim data) is 0.69.
In this case, the transcription was run on the original mixture, that is, without supressing the background accompaniment.
Google Colab
To run this tutorial on Google Colab, you will need install the PyClarity module by uncommenting and running the next cell.
# print("Cloning git repo...")
# !git clone --depth 1 --branch v0.6.1 https://github.com/claritychallenge/clarity.git
# print("Installing the pyClarity...\n")
# %cd clarity
# %pip install -e .
# clear_output()
# print("Repository installed")
Computing Correctness in Cadenza.#
To automatically compute the intelligibility for a hearing loss listener, we need to simulate how someone with a hearing loss is hearing such signals. For this the signal to evaluate needs:
Apply amplification according to the hearing loss thresholds (audiograms). E.g. using multiband dynamic range compressor.
Simulate hearing loss. E.g. using the MSGB hearing loss simulation.
One can argue that the intelligibility depends on how the words are understood by the better ear and not an average between left and right ear. Therefore in Cadenza, we compute the intelligibility as the maximun correctness between the left and right channels.
Let’s run an example:
Let’s take a 6-second excerpt from Aretha Franklin’s As Long As You There. The lyrics of this sample are
... as long as you hold my hand ...
!wget "https://github.com/CadenzaProject/cadenza_tutorials/raw/main/_static/audio/Aretha%20Franklin-As%20Long%20As%20You%20There.wav"
--2024-09-13 14:47:03-- https://github.com/CadenzaProject/cadenza_tutorials/raw/main/_static/audio/Aretha%20Franklin-As%20Long%20As%20You%20There.wav
Resolving github.com (github.com)... 20.26.156.215
Connecting to github.com (github.com)|20.26.156.215|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/CadenzaProject/cadenza_tutorials/main/_static/audio/Aretha%20Franklin-As%20Long%20As%20You%20There.wav [following]
--2024-09-13 14:47:03-- https://raw.githubusercontent.com/CadenzaProject/cadenza_tutorials/main/_static/audio/Aretha%20Franklin-As%20Long%20As%20You%20There.wav
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1058448 (1.0M) [audio/wav]
Saving to: ‘Aretha Franklin-As Long As You There.wav’
Aretha Franklin-As 100%[===================>] 1.01M --.-KB/s in 0.09s
2024-09-13 14:47:03 (11.5 MB/s) - ‘Aretha Franklin-As Long As You There.wav’ saved [1058448/1058448]
import IPython.display as ipd
from scipy.io import wavfile
reference = "as long as you hold my hand"
sr, signal = wavfile.read(
"Aretha Franklin-As Long As You There.wav"
)
ipd.display(ipd.Audio(signal.T, rate=sr))