machine-learning audio-processing normalization segmentation pydub

Audio Segmentation

The scripts uses a modified pydub package (0.24.1) to segment and normalize raw conversational/speeches files for machine learning. Compared to using the standard pydub library, this script is optimized for processing audio by removing the need to use multiple loops to segment data.

Please be aware that this script can/may be broken if any other pydub versions are used.

audio_segmentation.ipynb

This script assumes that there is only one speaker. If you need to find the optimal silence threshold and length, please use 'parameter_tester.ipynb' to find the optimal values.

What this script will do is:

Removes unnecessary long pauses/silences, but retaining natural silences which indicates the speakers thoughts or use of fillers.
Splits the audio files into 5 second intervals. Files that are too short will be kept but labelled as "leftover"
Normalize amplitude, chanhel, and sampling rate.
[Future Feature] Removes background noise if applicable
[Future Feature] Generates a unique adds id for each file

parameter_tester.ipynb

If you need to find the optimal parameters for removing silence in your audio.

Open parameter_tester.ipynb. This script will take a sample of your original file, which can be used to test and find the optimal silence length and threshold.
Run the first cell to splice a sample of your original raw audio data.
Adjust the parameters in nonsilent_data = detect_nonsilent(normalized_sound, min_silence_len=4000, silence_thresh=-32, seek_step=1), then run the cell. It should output a series of time frames, for example:

Run the third cell to output wave graph.
Using the wave graph, make sure that it matches with the time frame from the second cell. If it doesn't match, readjust the parameters again. Optimal parameters should match the time frames like this:

About

This script uses a modified pydub package (0.24.1) to segment and normalize raw conversational/speeches files for machine learning.

machine-learning audio-processing normalization segmentation pydub

MIT License

Languages

Language:Jupyter Notebook 100.0%