WhisperPlus: Advancing Speech-to-Text Processing 🚀

🛠️ Installation

pip install whisperplus

🤗 Model Hub

You can find the models on the HuggingFace Spaces or on the HuggingFace Model Hub

🎙️ Usage

To use the whisperplus library, follow the steps below for different tasks:

🎵 Youtube URL to Audio

from whisperplus import SpeechToTextPipeline, download_and_convert_to_mp3

# Define the URL of the YouTube video that you want to convert to text.
url = "https://www.youtube.com/watch?v=di3rHkEZuUw"

# Initialize the Speech to Text Pipeline with the specified model.
audio_path = download_and_convert_to_mp3(url)
pipeline = SpeechToTextPipeline(model_id="openai/whisper-large-v3")

# Run the pipeline on the audio file.
transcript = pipeline(
    audio_path=audio_path, model_id="openai/whisper-large-v3", language="english"
)

# Print the transcript of the audio.
print(transcript)

Summarization

from whisperplus.pipelines.summarization import TextSummarizationPipeline

summarizer = TextSummarizationPipeline(model_id="facebook/bart-large-cnn")
summary = summarizer.summarize(transcript)
print(summary[0]["summary_text"])

Speaker Diarization

from whisperplus import (
    ASRDiarizationPipeline,
    download_and_convert_to_mp3,
    format_speech_to_dialogue,
)

audio_path = download_and_convert_to_mp3("https://www.youtube.com/watch?v=mRB14sFHw2E")

device = "cuda"  # cpu or mps
pipeline = ASRDiarizationPipeline.from_pretrained(
    asr_model="openai/whisper-large-v3",
    diarizer_model="pyannote/speaker-diarization",
    use_auth_token=False,
    chunk_length_s=30,
    device=device,
)

output_text = pipeline(audio_path, num_speakers=2, min_speaker=1, max_speaker=2)
dialogue = format_speech_to_dialogue(output_text)
print(dialogue)

Contributing

pip install -r dev-requirements.txt
pre-commit install
pre-commit run --all-files

📜 License

This project is licensed under the terms of the Apache License 2.0.

🤗 Acknowledgments

This project is based on the HuggingFace Transformers library.

🤗 Citation

@misc{radford2022whisper,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

zb0419 / whisper-plus