mozilla / fathom

A framework for extracting meaning from web pages

Home Page:http://mozilla.github.io/fathom/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Find a way to share corpora between Fathom applications

biancadanforth opened this issue · comments

Creating a corpus is a sizable task (see #139 ), and likely different applications can make use of all or part of existing corpora.

  • Do different end uses of Fathom require different corpora? How do we know?
  • If not, can we have a centralized corpus not unlike ImageNet? How might we best share that corpus with others?

For the former, paraphrasing erikrose:

We must retain detailed rubrics for each corpus, including the strategy we used to find samples. These should be in VCS with the corpus. Then we can look through those rubrics for new applications and see if they’re suitable for reuse.

For the latter, paraphrasing danielhertenstein:

If we had a standardized set of labels (non-exhaustive and people would be welcome to add/change/remove locally), we could start our ImageNet for webpages allowing us to share corpora and reduce the need for more samples.

I think this has a lot of overlap with #130, which would requite all these same questions to be answer. Please re-open if you think there's value to this ticket in addition.