CUNY-CL / wikipron

Massively multilingual pronunciation mining

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Undoing casefolding?

jhdeov opened this issue · comments

The commandline lets the user choose to apply casefolding so that entries like English can be changed to either English or english. But for the scraped data on the repo, it seems you apply casefolding by default. Would it be more useful if the online data didn't do casefolding? That way,

  • If the user wanted to get the original data (with the correct cases), then they can just use the scraped data online instead of running WIkipron on the terminal
  • If the user wanted to get the casefolded data, then they can take the un-casefolded data from the repo and then apply casefolding on their on their own machine (a simple fast Excel function).

Right now, if the user wants to get the original cases, then they have to run the terminal option (which takes a while).

I'm not opposed. Would you send a PR? You'll just remove casefold: true from languages.json and run the scrape.

Just to confirm, you mean delete casefold: true and not simply change it to casefold: false?
Sadly, I don't think I have a good enough computer/internet to rescrape everything :(

Did a PR
I wonder if the various cleanup processes (casefolding, syllable removal, stress removal, etc.) could be turned into a single script. So that the WikiPron scrape has the pure form of everything; and then if the user is interested, they could run a cleanup script to apply all the default casefoldings and etc?

Did a PR I wonder if the various cleanup processes (casefolding, syllable removal, stress removal, etc.) could be turned into a single script. So that the WikiPron scrape has the pure form of everything; and then if the user is interested, they could run a cleanup script to apply all the default casefoldings and etc?

We have a hint of this in our notion of "filtered" vs. "unfiltered", this could just be an additional layer.

I was working on this and trying to run step 1 of "the big scrape", but I ran into a weird error with some languages not being recognized, details here.