Find a way to share corpora between Fathom applications
biancadanforth opened this issue · comments
Creating a corpus is a sizable task (see #139 ), and likely different applications can make use of all or part of existing corpora.
- Do different end uses of Fathom require different corpora? How do we know?
- If not, can we have a centralized corpus not unlike ImageNet? How might we best share that corpus with others?
For the former, paraphrasing erikrose:
We must retain detailed rubrics for each corpus, including the strategy we used to find samples. These should be in VCS with the corpus. Then we can look through those rubrics for new applications and see if they’re suitable for reuse.
For the latter, paraphrasing danielhertenstein:
If we had a standardized set of labels (non-exhaustive and people would be welcome to add/change/remove locally), we could start our ImageNet for webpages allowing us to share corpora and reduce the need for more samples.