Undoing casefolding?

Question

Undoing casefolding?

jhdeov opened this issue 2 years ago · comments

The commandline lets the user choose to apply casefolding so that entries like English can be changed to either English or english. But for the scraped data on the repo, it seems you apply casefolding by default. Would it be more useful if the online data didn't do casefolding? That way,

If the user wanted to get the original data (with the correct cases), then they can just use the scraped data online instead of running WIkipron on the terminal
If the user wanted to get the casefolded data, then they can take the un-casefolded data from the repo and then apply casefolding on their on their own machine (a simple fast Excel function).

Right now, if the user wants to get the original cases, then they have to run the terminal option (which takes a while).

Kyle Gorman · Answer 1 · Fri Nov 04 2022 20:27:39 GMT+0800 (China Standard Time)

I'm not opposed. Would you send a PR? You'll just remove casefold: true from languages.json and run the scrape.

Hossep Dolatian · Answer 2 · Sat Nov 05 2022 06:11:06 GMT+0800 (China Standard Time)

Just to confirm, you mean delete casefold: true and not simply change it to casefold: false?
Sadly, I don't think I have a good enough computer/internet to rescrape everything :(

Kyle Gorman · Answer 3 · Sat Nov 05 2022 06:17:44 GMT+0800 (China Standard Time)

On Fri, Nov 4, 2022 at 6:11 PM Hossep Dolatian ***@***.***> wrote: Just to confirm, you mean delete casefold: true and not simply change it to casefold: false?

Yea, it appears to default to false if I’m reading correctly. If I’m wrong we’ll know.

Sadly, I don't think I have a good enough computer/internet to rescrape everything :(

It isn’t computationally intensive at all, it only takes a while because of Wiktionary’s rate limiting. But if you create the rest of the PR, test it out on a language or two, we could probably take it from there.

…

Hossep Dolatian · Answer 4 · Mon Nov 07 2022 13:34:42 GMT+0800 (China Standard Time)

Did a PR
I wonder if the various cleanup processes (casefolding, syllable removal, stress removal, etc.) could be turned into a single script. So that the WikiPron scrape has the pure form of everything; and then if the user is interested, they could run a cleanup script to apply all the default casefoldings and etc?

Kyle Gorman · Answer 5 · Mon Nov 07 2022 22:05:14 GMT+0800 (China Standard Time)

Did a PR I wonder if the various cleanup processes (casefolding, syllable removal, stress removal, etc.) could be turned into a single script. So that the WikiPron scrape has the pure form of everything; and then if the user is interested, they could run a cleanup script to apply all the default casefoldings and etc?

We have a hint of this in our notion of "filtered" vs. "unfiltered", this could just be an additional layer.

Gabriel Thompson · Answer 6 · Sat Jul 15 2023 02:56:15 GMT+0800 (China Standard Time)

I was working on this and trying to run step 1 of "the big scrape", but I ran into a weird error with some languages not being recognized, details here.