Cached Usage of SparseBytePairFeaturizer docs contains dense BytePairFeaturizer
namhoai167 opened this issue · comments
The Cached Usage of SparseBytePairFeaturizer contains dense BytePairFeaturizer. I checked the sparse_bytepair.md and it is with SparseBytePairFeaturizer. There was a commit edit this but it's not showing on the docs.
That's a fair comment!
One thing though; are you using the sparse BytePair featurizer? I'm currently prepping the repository for Rasa 3.0 and my impression was that the feature was barely used and I was considering dropping it. Would you happen to have an anecdote that suggests that I should keep it around?
Thank @koaning for your reply, you and Dr. Rachael are the two who teach me most of the stuff about NLP from Jan and till now. I'm not using sparse BytePair featurizer at the moment and you can drop it if you want. I came from your videos (this, this and this), so CountVectorsFeaturizer is doing a great job at handling spelling errors, but I want to try stacking a subword featurizer (which is dense BytePair) and benchmark will it increase DIET classification score on my small artificial errors data.
For sparse BytePair featurizer, I'm not sure how it works. It has terminologies that I don't know like BytePair tokeniser, I try to figure out is it BPE tokenizer or something else. After well understand it, I may give it a shot.
Happy to hear it that you find our content useful :)
I will drop the sparse featurizer then, which will also resolve this issue.