NbAiLab / notram

Norwegian Transformer Model

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Could not get the train json by gsutil

erichen510 opened this issue · comments

The error message is : does not have storage.objects.list access to the Google Cloud Storage bucket.

commented

Exactly what url are you trying to retrieve?

Are you authenticated on gcloud?

The exact url is:
gsutil -m cp gs://notram-west4-a/pretrain_datasets/notram_v2_social_media/splits/social_train.jsonl social_train.json
How to get authorization on gcloud? Am I suppose to join the project?
image

commented

You are trying to access a non-open dataset. Where was this linked from?

The link is from

## RoBERTa

I want to pretrain the corpus on roberta large. If I cannot get the json, where should I get the original corpus?
I notice that https://huggingface.co/datasets/NbAiLab/NCC list the datasets, could you tell me how to convert the original data to the json required by run_mlm_flax_stream.py.

Sorry the link is
gsutil -m cp gs://notram-west4-a/pretrain_datasets/notram_v2_official_short/norwegian_colossal_corpus_train.jsonl norwegian_colossal_corpus_train.json

commented

Sorry. There is an internal link in this guide. You should replace this with whatever dataset you have available.

One alternative is of course the NCC (that was released after this tutorial was written).

There are several ways of training on this dataset. Assuming you are using Flax (since you are following the tutorial), a simple was is to specify dataset_name NbAiLab/NCC instead of train and validation file. Another way is to clone the HuggingFace repo and copy/combine the files from the repo. NCC is already in json format, but it is sharded and zipped. If you insist on having them locally, they should be combined and unzipped.

commented

Early next year, we will also place the NCC in an open gcloud bucket.