CUNY-CL / wikipron

Massively multilingual pronunciation mining

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

New columns in the summary tables

kylebgorman opened this issue · comments

This issue suggests some minor enhancements to the autogenerated tables in data/src/README.md and data/src/tsv_summary.tsv.

  1. The column Phonetic/Phonemic currently has labels like Phonetic_Filtered. These are two independent pieces of information which are not closely related. A new column "Filtered" (with True and False values) should be introduced instead.
  2. Arguably, dialect information should be in a separate column than language information. For languages without a dialect specification the column would just be blank.
  3. PR #364, which addresses issue #319, adds script information to every file. This should go in a separate column from language, dialect, etc.

Example entry under the proposed:

TSV | wel | Welsh | North Wales | Latin | True | Phonemic | True | 8,477 

The problem of data/phones/README.md coming after a long list of .phones files has been mentioned in the past (e.g. in one of your comments on #360). Maybe we could add a True/False column as well to indicate which languages/dialects have phonelists?

I guess that's an alternative to my proposal in #360, @ajmalanoski? Seems like it would make it hard to report other useful information (e.g., number of phones) or link to the filtered ones, right?

Oops, I totally missed your proposed reorganization. And now that I think about it a little more, the proposed "Filtered" column would be identical to a hypothetical "Phonelist" column (since Filtered=True entails Phonelist=True, and if the instructions in data/phones/HOWTO.md are followed, then reverse will be true as well).