EgorLakomkin / KTSpeechCrawler

Automatically constructing corpus for automatic speech recognition from YouTube videos

Home Page:https://arxiv.org/abs/1903.00216

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

why GOOGLE_TEST default OK? Why didn't added GoogleRandomSubsetWERFilter class in process.py?

MuruganR96 opened this issue · comments

why GOOGLE_TEST default OK?

Why didn't added GoogleRandomSubsetWERFilter class in process.py pipeline?

first up all thanks for Given this Project as Open-Source. Awesome work. thank you so much KTSpeechCrawler team.:)

i was tried KTSpeechCrawler project to collecting youtube audio datasets for ASR Speech-to-text task.

i was collected and finished entire steps. after that i was tested transcipt with corresponding audio files (.wav, .txt).

here i getting 11/100 audios are mistakes.

if we will apply google_speech_test , and validate to remove less than the threshold means (threshold=0.85) we can get good proper audiofiles and transcipt.

can you please tell where i need to start and add this module to do google_speech_test?

Here any complexity will come, for using google_speech_test?

that pipeline module,

pipeline = Pipeline([
    OverlappingSubtitlesRemover(),
    SubtitleCaptionTextFilter(),
    CaptionNormalizer(),
    CaptionRegexMatcher(good_chars_regexp),
    CaptionLengthFilter(min_length=5),
    CaptionLeaveOnlyAlphaNumCharacters(),
    SubtitleMerger(max_len_merged_sec=10),
    CaptionDurationFilter(min_length=1, max_length=20.0)
])

here which place i need add that module? last is enough?

Thank you sir :)