Add Stanford GloVe Embeddings Datasets

Question

Add Stanford GloVe Embeddings Datasets

jonthegeek opened this issue 5 years ago · comments

Jon Harmon commented 5 years ago

I'd like to add the GloVe pre-trained word vectors, for use in tidymodels/textrecipes#20

The datasets are available here: https://nlp.stanford.edu/projects/glove/

There are 4 downloads, that break down like this:

glove.6B.zip = 4 datasets
glove.42B.300d.zip = 1 dataset
glove.840B.300d.zip = 1 dataset
glove.twitter.27B.zip = 4 datasets

The first one is all I'm directly in need of right now, but it feels worthwhile to work out a standard for all of them while I'm at it.

I don't want to make the functions too complicated to understand, but it feels like maybe it should be one set of textdata functions (download_glove, process_glove, dataset_glove), with arguments about the specifics (something like dataset_glove({normal stuff plus}, token_set, dimensions)).

Let me know what you think and I can knock this out (I'm doing it anyway for personal/work use, so formalizing it won't be a lot of extra work).

Emil Hvitfeldt · Answer 1 · Wed Oct 16 2019 00:39:01 GMT+0800 (China Standard Time)

This sounds good.

It looks like each download comes with everything zipped. So I would create 4 user facing functions. Lets prefix them with embedding_ . so we get embedding_glove6b(), embedding_glove42b() etc etc.

I did a little writeup of what should be done to make a new step work:
https://emilhvitfeldt.github.io/textdata/articles/How-to-add-a-data-set.html

If you need examples of how this procedure works look at this commit 7ce4e42.

Please feel free to ping me if you have any questions or problems

Jon Harmon · Answer 2 · Wed Oct 16 2019 01:18:47 GMT+0800 (China Standard Time)

Ok, that sounds good. The downloads will be separate, but then I'll put a parameter in the dataset_ function to just load the appropriate sub-dataset (for 6b and 27b). I should have a PR for this within the next couple hours, depending on what other distractions come up.