Could you please explain these two sentences from the README.md?

Question

Could you please explain these two sentences from the README.md?

afparsons opened this issue 2 years ago · comments

Great work! Could you please explain these two sentences from the README.md?

The sequence of text is tokenized using the top 10,000 words found in sponsorships.

Where does this list come from? Did you simply fetch transcripts for the videos in the training data (a subset of the SponsorBlock data) and collect the tokens from the known sponsor segments? Is this just a simple count? Are stopwords removed? etc.

Note, using a pre-trained word embedding by fastText does not yield better performance.

I'm familiar with fastText, but could you please elaborate a little more on this? I'm not sure I understand what optimization you attempted (and ultimately concluded to not be worthwhile).

Andrew Lee · Answer 1 · Thu Dec 29 2022 09:53:19 GMT+0800 (China Standard Time)

Hi, hopefully this doesn't come too late and is still helpful.

Where does this list come from?

You're correct. Using Sponsorblocks labels, I extracted the complete text identified by Sponsorblock. I used then used a built-in function that tensorflow provided to identify the top 10,000 words and tokenize it. You can see the code here.

could you please elaborate a little more on this?

I didn't try anything sophisticated. I used the default fastText embeddings in lieu of my custom embedding and didn't see any meaningful improvements in accuracy. Makes sense since fastText isn't focused on the language of sponsorships.