Document how to derive pyin's voicing decision from its probabilities
twoertwein opened this issue · comments
Describe the bug
pyin returns whether voicing was detected and its probability. It seems that there might not be a constant threshold how to derive the voicing decision from the probabilities.
To Reproduce
f0, voicing, probs = librosa.pyin(
wav,
fmin=librosa.note_to_hz("C2"),
fmax=librosa.note_to_hz("C5"),
sr=16_000,
frame_length=1024,
hop_length=256,
center=False
)
pd.Series(probs[voicing]).describe()
count 1341.000000
mean 0.573702
std 0.377040
min 0.010000 # would have expected 0.5
25% 0.176531
50% 0.686083
75% 0.945930
max 1.000000
dtype: float64
pd.Series(probs[~voicing]).describe()
count 719.000000
mean 0.012068
std 0.033047
min 0.000000
25% 0.010000
50% 0.010000 # good to see that the mean is much lower compared to when voicing is asserted
75% 0.010000
max 0.826669 # would have expected 0.5
dtype: float64
Expected behavior
I would have expected that there is a constant threshold how to derive the voicing decision from its probabilities.
Software versions*
INSTALLED VERSIONS
------------------
python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]
librosa: 0.10.1
audioread: 3.0.1
numpy: 1.26.4
scipy: 1.12.0
sklearn: 1.4.1.post1
joblib: 1.3.2
decorator: 5.1.1
numba: 0.59.1
soundfile: 0.12.1
pooch: v1.8.1
soxr: 0.3.7
typing_extensions: installed, no version number available
lazy_loader: installed, no version number available
msgpack: 1.0.8
numpydoc: None
sphinx: None
sphinx_rtd_theme: None
matplotlib: None
sphinx_multiversion: None
sphinx_gallery: None
mir_eval: None
ipython: None
sphinxcontrib.rsvgconverter: None
pytest: None
pytest_mpl: None
pytest_cov: None
samplerate: None
resampy: None
presets: None
packaging: 24.0
It seems that there might not be a constant threshold how to derive the voicing decision from the probabilities.
It's a little more complex than a simple framewise threshold. As the docstring states, pyin uses a hidden markov model to infer a globally optimal state sequence to explain the observation. The hidden states are two sets of frequencies: one for "voiced" frequencies and one for "unvoiced". The first N
states are voiced (for N frequencies) and the second N
are unvoiced; the voicing detection logic simply looks at whether the maximum likelihood state for each frame belongs to the first N
or not:
Line 860 in 222b228
The voicing probability for each frame is the framewise marginal likelihood of that frame belonging to the first N
states (prior to viterbi decoding):
Lines 937 to 939 in 222b228
So the probability might be low for a given frame to be voiced, but that doesn't mean that it isn't voiced! Just that it's unlikely independent of all other frames. The viterbi decoding path will determine whether it's more likely to be voiced or not when considered as part of the entire sequence.
Thank you for the clarification! I forgot that the decision is contextualized using the Viterbi decoding!
Feel free to close the issue.
Well, if there's something we could add to the docstring to make this more clear, we should do so. (PRs would be welcome!)