Including both broad and narrow transcriptions in German scrapes

Question

Including both broad and narrow transcriptions in German scrapes

sonofthomp opened this issue a year ago · comments

Many German words (about 5,000) have a narrow transcriptions and no broad transcriptions on Wiktionary (example). These broad transcriptions aren't included in scrapes by default, unless you add a --narrow parameter to the command (e.g. wikipron deu --narrow). However, if you add the --narrow parameter, Wikipron will now exclusively scrape narrow transcriptions (it won't scrape broad transcriptions).

If we're trying to parse a language like German, wouldn't it be better to have a way of parsing both broad and narrow transcriptions at the same time? Or would the resulting scrape not be useful, because it would be ambiguous whether each transcription is broad or narrow? I don't have a strong background in linguistics so I have no idea, personally.

Gabriel Thompson · Answer 1 · Mon Aug 21 2023 21:31:04 GMT+0800 (China Standard Time)

Actually, my bad, I hadn't seen how broad and narrow transcriptions were kept in separate files. That approach makes a lot more sense.

Kyle Gorman · Answer 2 · Mon Aug 21 2023 23:05:40 GMT+0800 (China Standard Time)

Yeah, we definitely want to store them separately. It seems unlikely that you'd use them in a combined fashion without also knowing which is which (and then you'd need some convention for indicating that). Now, if it were possible to scrape both simultaneously (and send them to the separate files) we could speed up the big scrape by a bit, but it's not an urgent thing.

Kyle Gorman · Answer 3 · Tue Aug 22 2023 03:51:40 GMT+0800 (China Standard Time)

I just ran a current WikiPron (command line) over German:

  47980 deu-broad.tsv
  16070 deu-narrow.tsv
  64050 total