utterworks / fast-bert

Super easy library for BERT based NLP models

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Is it possible to retrieve original data from fast_bert.data_cls.BertDataBunch object ?

Yasminabouzbiba opened this issue · comments

Hi, I need to train my model on the cloud but my data is confidential so I'm planning to save the fast_bert.data_cls.BertDataBunch object and use it on the cloud :
databunch = BertDataBunch(DATA_PATH, LABEL_PATH, tokenizer='./model/camembert-base', train_file='train_set.csv', val_file='val_set.csv', label_file='labels.txt', text_col='TEXT', label_col='label', batch_size_per_gpu=4, max_seq_length=512, multi_gpu=False, multi_label=False, model_type='camembert-base')

databunch.save()

And in the cloud, I would load databunch and train my model like so :

databunch = pickle.load( open( "./data/tmp/databunch.pkl", "rb" ) )

cl_learner = BertLearner.from_pretrained_model( databunch, pretrained_path='model/model_out', metrics=metrics, device=device_cuda, logger=logger, output_dir=OUTPUT_DIR, finetuned_wgts_path=WGTS_PATH, warmup_steps=300, multi_gpu=False, multi_label=False, is_fp16=False)

cl_learner.fit(epochs=30, lr=9e-5, validate=True, schedule_type="warmup_cosine", optimizer_type="adamw")

I just want to be sure that the original data can not be retrieved from databunch, is it the case ?

Thank you.

I am afraid the data can be retrieved by calling tokenizer decode function. It the saved databunch does not provide data encryption. I would suggest you delete the databunch file after you finish training the model.

Ok thank you :) In which file can we find this decode function please ?

It's the decode method in tokenizer object in databunch.