laurieburchell / open-lid-dataset

Repository accompanying "An Open Dataset and Model for Language Identification" (Burchell et al., 2023)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Make the dataset and model available on the Hugging Face Hub

davanstrien opened this issue · comments

Really awesome work! I've been doing some work around automatic language detection (https://huggingface.co/blog/huggy-lingo) so I'm really happy to see more models and datasets in this space!

It would be really nice to also share the model + dataset on the huggingface_hub. I've already uploaded the model (https://huggingface.co/davanstrien/OpenLID?text=I+like+you.+I+love+you) but it would be nicer to transfer this to the Wikimedia organisation (https://huggingface.co/wikimedia). Happy to do this if you are okay with that?

Thanks for your interest and putting the model up. I actually have my own HF account (laurievb), would it be possible to transfer the model directly to me and then add links to Wikimedia and the other projects that use it?

As for making the data available through HF, different parts of the dataset are covered by different licenses (available in the repo). Does HF have the ability to make the dataset available whilst respecting the multiple license terms?

Thanks for your interest and putting the model up. I actually have my own HF account (laurievb), would it be possible to transfer the model directly to me and then add links to Wikimedia and the other projects that use it?

I just transferred the model to your account :)

As for making the data available through HF, different parts of the dataset are covered by different licenses (available in the repo). Does HF have the ability to make the dataset available whilst respecting the multiple license terms?

The approach we usually take for these situations is to add an 'other' licence to the dataset overall. You could then either specify the licence breakdown in the dataset card (as you've done in this repo) or you could directly add the licence as a new column for the dataset (this is easier if you already have a nice mapping between the dataset and the licence).

In terms of sharing the dataset via the Hub, you could directly upload the current TSV file (compressed is fine) and it should work. I've also uploaded a version of the dataset in Parquet format here (https://huggingface.co/datasets/davanstrien/open-lid-dataset) which has the advantage of being streaming and a little quicker to load. I can transfer that to your account if you like?

Thanks for transferring the model! I will update the model card today. Could you please transfer the Parquet dataset to my account as well? I'll add the information about the licenses to the repo on HF.

Thanks for transferring the model! I will update the model card today. Could you please transfer the Parquet dataset to my account as well? I'll add the information about the licenses to the repo on HF.

Just moved that to https://huggingface.co/datasets/laurievb/open-lid-dataset :)

Thank you! I updated the model card so I'll close the issue.