Undoing casefolding?
jhdeov opened this issue · comments
The commandline lets the user choose to apply casefolding so that entries like English
can be changed to either English
or english
. But for the scraped data on the repo, it seems you apply casefolding by default. Would it be more useful if the online data didn't do casefolding? That way,
- If the user wanted to get the original data (with the correct cases), then they can just use the scraped data online instead of running WIkipron on the terminal
- If the user wanted to get the casefolded data, then they can take the un-casefolded data from the repo and then apply casefolding on their on their own machine (a simple fast Excel function).
Right now, if the user wants to get the original cases, then they have to run the terminal option (which takes a while).
I'm not opposed. Would you send a PR? You'll just remove casefold: true
from languages.json
and run the scrape.
Just to confirm, you mean delete casefold: true
and not simply change it to casefold: false
?
Sadly, I don't think I have a good enough computer/internet to rescrape everything :(
Did a PR
I wonder if the various cleanup
processes (casefolding, syllable removal, stress removal, etc.) could be turned into a single script. So that the WikiPron scrape has the pure
form of everything; and then if the user is interested, they could run a cleanup
script to apply all the default casefoldings and etc?
Did a PR I wonder if the various
cleanup
processes (casefolding, syllable removal, stress removal, etc.) could be turned into a single script. So that the WikiPron scrape has thepure
form of everything; and then if the user is interested, they could run acleanup
script to apply all the default casefoldings and etc?
We have a hint of this in our notion of "filtered" vs. "unfiltered", this could just be an additional layer.
I was working on this and trying to run step 1 of "the big scrape", but I ran into a weird error with some languages not being recognized, details here.