nlpyang / BertSum

Code for paper Fine-tune BERT for Extractive Summarization

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Problems with my own dataset and with the format_to_bert function.

DarlineFiedler opened this issue · comments

Maybe you can help me. I'm supposed to use my own data for my final paper using BertSum. These are title and abstracat pairs. So that I can get a title from the abstract.
At the moment I'm stuck on the question where to insert my data set into the model.
Furthermore I cannot open a .story file. I don't know exactly how the original data set is structured.

Maybe you can help me to customize BertSum.

I've got a different problem now.
It does not matter if I use the example file or my own. I always get the same message.
When I use the -format_to_bert function I get the following error:

(base) D:\Studium\Bachelor Arbeit\Bachlorarbeit\Bachelorarbeit\BERT\BertSum\BertSum\src>python preprocess.py -mode format_to_bert -raw_path ../json_data -save_path ../bert_data -oracle_mode greedy -n_cpus 4 -log_file ../logs/preprocess.log
[('../json_data\cnndm_sample.train.0.json', Namespace(dataset='', log_file='../logs/preprocess.log', lower=True, map_path='../data/', max_nsents=100, max_src_ntokens=200, min_nsents=3, min_src_ntokens=5, mode='format_to_bert', n_cpus=4, oracle_mode='greedy', raw_path='../json_data', save_path='../bert_data', shard_size=2000), '../bert_data\bert.pt_data\cnndm_sample.train.0.bert.pt')]
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "D:\Programme\anaconda3\lib\site-packages\multiprocess\pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "D:\Studium\Bachelor Arbeit\Bachlorarbeit\Bachelorarbeit\BERT\BertSum\BertSum\src\prepro\data_builder.py", line 273, in _format_to_bert
torch.save(datasets, save_file)
File "D:\Programme\anaconda3\lib\site-packages\torch\serialization.py", line 209, in save
return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
File "D:\Programme\anaconda3\lib\site-packages\torch\serialization.py", line 132, in _with_file_like
f = open(f, mode)
FileNotFoundError: [Errno 2] No such file or directory: '../bert_data\bert.pt_data\cnndm_sample.train.0.bert.pt'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "preprocess.py", line 63, in
eval('data_builder.'+args.mode + '(args)')
File "", line 1, in
File "D:\Studium\Bachelor Arbeit\Bachlorarbeit\Bachelorarbeit\BERT\BertSum\BertSum\src\prepro\data_builder.py", line 212, in format_to_bert
for d in pool.imap(_format_to_bert, a_lst):
File "D:\Programme\anaconda3\lib\site-packages\multiprocess\pool.py", line 748, in next
raise value
FileNotFoundError: [Errno 2] No such file or directory: '../bert_data\bert.pt_data\cnndm_sample.train.0.bert.pt'

I solved the upper problem. If I now execute the function I get two empty brackets. Is that right?

(base) D:\Studium\Bachelor Arbeit\Bachlorarbeit\Bachelorarbeit\BERT\BertSum\BertSum\src>python preprocess.py -mode format_to_bert -raw_path ../json_data -save_path ../bert_data -oracle_mode greedy -n_cpus 4 -log_file ../logs/preprocess.log
[('../json_data\cnndm_sample.train.0.json', Namespace(dataset='', log_file='../logs/preprocess.log', lower=True, map_path='../data/', max_nsents=100, max_src_ntokens=200, min_nsents=3, min_src_ntokens=5, mode='format_to_bert', n_cpus=4, oracle_mode='greedy', raw_path='../json_data', save_path='../bert_data', shard_size=2000), '../bert_data\bert.pt_data\cnndm_sample.train.0.bert.pt')]
[]
[]

commented

empty brakets mean, that there is no input into some function, so it is not correct. I remember having the same problem, but I don't remember how I solved it. I suggest you using the debug mode though, by typing "-m pdb" before the command. Then you can print out variables and check whats going wrong.

Thank you I tried "-m pdb" and got an AttributeError displayed. But I don't know what this tells me exactly. Or rather I do not know how to solve it.

The exact error is this:
--Return--

D:\studium\bachelorarbeit\bachlorarbeit\bachelorarbeit\bert\bertsum\bertsum\src\preprocess.py(63)()->None
-> eval('data_builder.'+args.mode + '(args)')
(Pdb) next
AttributeError: module 'main' has no attribute 'spec'
< string >(1)()->None
image

commented

I think I remember now, solved it by copying the following line to the argparser arguments of the preprocess module.
__spec__ = "ModuleSpec(name='builtins', loader=<class '_frozen_importlib.BuiltinImporter'>)"

I add this line in preprocess.py but maybe in a wrong way.
Or isn't the argparsers arguments in the preprocess.py.

Because i still get the empty brackets, but if i run the -m pdb, i didn't get a error.
maybe you can show me the exact spot in the code where the line should go.

Hi @DarlineFiedler I am also currently writing my bachelor thesis on bertsum. For me the problem had something to do the way my json files from step 4 where named. maybe this comment: #90 (comment) helps you

I also get this empty brackets, If i try the "cnndm_sample.train.0.json", not only with my own json Data.

The empty brackets are indicating that no files where found. Try the absolute path and create a file for each category. e.g. cnndm_sample.valid.0.json , cnndm_sample.test.0.json , cnndm_sample.train.0.json . They could all be a copy of cnndm_sample.train.0.json

I have also encountered a few problems with the original BertSum. Because of this I switched to a fork of it: https://github.com/Santosh-Gupta/BertSum and based on this I have created my own fork: https://github.com/tschomacker/BertSum

Thanks, that really helped me a lot. I actually just forgot to create the valid and test data.

Hello @DarlineFiedler and @tschomacker ,
I'm working on this repo and need help to process my own dataset to test the model.
I've followed the guidance and have completed the training with the preprocessed dataset.
However when working on my own dataset, I'm stuck at this step:
Step 4. Format to Simpler Json Files
python preprocess.py -mode format_to_lines -raw_path RAW_PATH -save_path JSON_PATH -map_path MAP_PATH -lower
image
As you can see, no output was printed. I think it's because of the /urls folder. I don't know what it means so can you help me?

As I have indicated previously:
There is a fork https://github.com/Santosh-Gupta/BertSum and based on this I have created my own fork: https://github.com/tschomacker/BertSum . Both fixed this problem. As a starting point look at: https://github.com/tschomacker/BertSum/blob/master/src/prepro/data_builder.py#L246 . I hope this helps :)

@tschomacker I've figured the problem myself. However thanks very much!

As I have indicated previously: There is a fork https://github.com/Santosh-Gupta/BertSum and based on this I have created my own fork: https://github.com/tschomacker/BertSum . Both fixed this problem. As a starting point look at: https://github.com/tschomacker/BertSum/blob/master/src/prepro/data_builder.py#L246 . I hope this helps :)

@tschomacker hi, is there any pre-trained model for bertsum under your branch? If so, could you please send me a copy? It would be very useful to me. hannan@stumail.nwu.edu.cn