CUNY-CL / wikipron

Massively multilingual pronunciation mining

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

wikipron/data/src reorganization

lfashby opened this issue · comments

wikipron/data/src is host to a lot of different files that do a lot of different things in different places. The unorganized state of wikipron/data/src may be off-putting to new/prospective contributors who don’t want to wade into this large collection of files to try and get their bearings.

I have a few suggestions for things we can do to reorganize wikipron/data/src, though I’d also like suggestions for how (or even if) wikipron/data/src should be reorganized from those who contributed some of these files.

A few things come to mind immediately:

  • We wrote postprocess but for whatever reason only call split.py from within it. We should call generate_tsv_summary.py from postprocess as well and maybe put both split.py and generate_tsv_summary.py in an src/postprocess_helpers subdirectory.
  • All of our .json files should go in a subdirectory (or at least all of them that are not languages.json).
  • The “Big Scrape” Scripts README is out of date, we no longer retry languages 10 times.
  • The scripts which have to do with the phones stuff should be in there own subdirectory? There are quite a few of them now.
  • covering_grammar.py seems a bit homeless. I don’t think we’ve used it.

I’m not sure if putting things in subdirectories of src is a great solution, so any alternatives would be welcome.

wikipron/data/src is host to a lot of different files that do a lot of different things in different places. The unorganized state of wikipron/data/src may be off-putting to new/prospective contributors who don’t want to wade into this large collection of files to try and get their bearings.

I have a few suggestions for things we can do to reorganize wikipron/data/src, though I’d also like suggestions for how (or even if) wikipron/data/src should be reorganized from those who contributed some of these files.

I agree this needs to be reorganized. It's a mess.

A few things come to mind immediately:

  • We wrote postprocess but for whatever reason only call split.py from within it. We should call generate_tsv_summary.py from postprocess as well and maybe put both split.py and generate_tsv_summary.py in an src/postprocess_helpers subdirectory.

+1.

The things people need to call: scrape.py and postprocess should live in some top-level directory, put everything else in a subdirectory maybe.

  • All of our .json files should go in a subdirectory (or at least all of them that are not languages.json).

I agree these should be in a subdirectory but I don't htink it should be organized around their JSONness. I have a proposal below.

  • The “Big Scrape” Scripts README is out of date, we no longer retry languages 10 times.

+1, easy fix.

  • The scripts which have to do with the phones stuff should be in there own subdirectory? There are quite a few of them now.
  • covering_grammar.py seems a bit homeless. I don’t think we’ve used it.

I think that in general the phones directory is poorly organized too, since GitHub still shows the generated README after the long list of .phones files.

I’m not sure if putting things in subdirectories of src is a great solution, so any alternatives would be welcome.

Here's a worked proposal for data/:

README.md  # manually generated, with links to the immediate subdirectories
frequencies/*  # as is
morphology/*  # as Reuben is currently developing
phones/README.md  # autogenerated
phones/postprocess  # new, but just calls whatever you do after you update phones files
phones/lib/*.py  # everything else
phones/phones/*.phones
scrape/scrape.py
scrape/postprocess
scrape/README.md  # autogenerated
scrape/lib/*.py  # everything but scrape.py
scrape/lib/*.json
scrape/tsv/*.tsv

Sidebar: I've also asked the error analysis team to begin merging their code and data into data/, so that will be a test case for whatever proposals...

The proposal makes sense to me, it’ll be a pain to move these files around and make sure everything works, but I think it’s worth it.

I suppose in phones/postprocess we can call generate_phones_summary.py and maybe normalize.py? I think those are the only relevant steps.

Should we break this up into smaller tickets or just assign ourselves to sections here? I can tackle the scrape stuff tonight, although maybe it'd be better to hold off until after the covering grammar stuff gets merged.

normalize.py I think goes in phones/lib since it's not needed except to fix problems. I would just make sure there's a postprocess next to each autogenerated README...

If you want to try to break it up into smaller tickets go ahead. I'm terrible at that. Feel free to assign some of it to me.

I think you can do this independently of the CG work, since I asked Arundhati to take her time.

Now that #364 is in this can go through, methinks.

Aidan and I are both going to move some stuff around. I won't have time to work on this until Thursday.

Should I close the bug @ajmalanoski @lfashby?

There are still a few things left to do. I've got to fix some paths in the big scrape README and we need to take care of the covering grammar/error analysis files. Ultimately I think we want to get rid of data/src entirely.
Also we've got to write up data/README.md.