CUNY-CL / wikipron

Massively multilingual pronunciation mining

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Including both broad and narrow transcriptions in German scrapes

sonofthomp opened this issue · comments

Many German words (about 5,000) have a narrow transcriptions and no broad transcriptions on Wiktionary (example). These broad transcriptions aren't included in scrapes by default, unless you add a --narrow parameter to the command (e.g. wikipron deu --narrow). However, if you add the --narrow parameter, Wikipron will now exclusively scrape narrow transcriptions (it won't scrape broad transcriptions).

If we're trying to parse a language like German, wouldn't it be better to have a way of parsing both broad and narrow transcriptions at the same time? Or would the resulting scrape not be useful, because it would be ambiguous whether each transcription is broad or narrow? I don't have a strong background in linguistics so I have no idea, personally.

Actually, my bad, I hadn't seen how broad and narrow transcriptions were kept in separate files. That approach makes a lot more sense.

Yeah, we definitely want to store them separately. It seems unlikely that you'd use them in a combined fashion without also knowing which is which (and then you'd need some convention for indicating that). Now, if it were possible to scrape both simultaneously (and send them to the separate files) we could speed up the big scrape by a bit, but it's not an urgent thing.

I just ran a current WikiPron (command line) over German:

  47980 deu-broad.tsv
  16070 deu-narrow.tsv
  64050 total