EmilHvitfeldt / textdata

Download, parse, store, and load text datasets instead of storing it in packages

Home Page:https://emilhvitfeldt.github.io/textdata/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add Stanford GloVe Embeddings Datasets

jonthegeek opened this issue · comments

I'd like to add the GloVe pre-trained word vectors, for use in tidymodels/textrecipes#20

The datasets are available here: https://nlp.stanford.edu/projects/glove/

There are 4 downloads, that break down like this:

  • glove.6B.zip = 4 datasets
  • glove.42B.300d.zip = 1 dataset
  • glove.840B.300d.zip = 1 dataset
  • glove.twitter.27B.zip = 4 datasets

The first one is all I'm directly in need of right now, but it feels worthwhile to work out a standard for all of them while I'm at it.

I don't want to make the functions too complicated to understand, but it feels like maybe it should be one set of textdata functions (download_glove, process_glove, dataset_glove), with arguments about the specifics (something like dataset_glove({normal stuff plus}, token_set, dimensions)).

Let me know what you think and I can knock this out (I'm doing it anyway for personal/work use, so formalizing it won't be a lot of extra work).

This sounds good.

It looks like each download comes with everything zipped. So I would create 4 user facing functions. Lets prefix them with embedding_ . so we get embedding_glove6b(), embedding_glove42b() etc etc.

I did a little writeup of what should be done to make a new step work:
https://emilhvitfeldt.github.io/textdata/articles/How-to-add-a-data-set.html

If you need examples of how this procedure works look at this commit 7ce4e42.

Please feel free to ping me if you have any questions or problems

Ok, that sounds good. The downloads will be separate, but then I'll put a parameter in the dataset_ function to just load the appropriate sub-dataset (for 6b and 27b). I should have a PR for this within the next couple hours, depending on what other distractions come up.