rikdz / GraphWriter

Code for "Text Generation from Knowledge Graphs with Graph Transformers"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Exact command line arguments to reproduce results

mnschmit opened this issue · comments

I have followed the instructions as they are listed in the README, i.e., I ran the following commands verbatim in the root folder of the repository (please correct me if I misunderstood anything):

python3 ./train.py -save trained_weights
python3 generator.py -save=trained_weights/11.vloss-3.562062.lr-0.1
mkdir ../outputs
python3 eval.py ../outputs/11.vloss-3.562062.lr-0.1.inputs.beam_predictions.cmdline data/preprocessed.test.tsv

It got me the following output:

Bleu_1:	 17.554950053967314
Bleu_2:	 9.937139335974187
Bleu_3:	 5.843996614371239
Bleu_4:	 3.4978054776396585
METEOR:	 7.639675023911272
ROUGE_L: 15.840991976680046

As this is quite worse than the results reported in the paper, I assume I have missed something. As others have reported in the other issues that they were able to reproduce the results, could someone please post their exact command line arguments to do so?
I suppose the default learning rate (0.1) is wrong as 0.25 was reported in the paper. However, if the defaults are not the optimal hyperparameters, I am unsure how to achieve the rest of the reported training regime, e.g., the exact scheduling behavior where the learning rate goes down to 0.05 over 5 epochs.

Some other thoughts I had:

  • In command (2) I chose epoch 11 because of the smallest validation loss. I understand that this was the procedure for model selection in the paper. Am I wrong here?
  • Do I have to set a specific random seed to reproduce results?

I get the same result with the author
it seem that you did not use the title while should be add as pargs in the command line
you should use the result of 20 epoch rather than 11 epoch and do not change any parameter except the saving place
vadilation loss low is not equal to better results

Hi @wutaiqiang , thank you for answering!

If I understand the code correctly, adding -title as argument would not use the title as input but only the graph/entities. Did I get that wrong?

I assumed that model selection was based on validation loss because this is the only feature visible from the weight file names. Thank you for the hint! Here are the BLEU scores when I use the weights from epoch 20:

Bleu_1:  13.81998084017321
Bleu_2:  7.975161909290924
Bleu_3:  4.785343961925754
Bleu_4:  2.9152031333545163

Unfortunately, they're worse than what I got with epoch 11.

Could you please post what you typed in exactly when you obtained the same results as reported in the paper? It would be very helpful to have a complete documentation of every step necessary to obtain the good results.

Code:

parser.add_argument("-title",action='store_true',help="do not use title as input, only graph/entities")

adding -title as argument would USE the title as input,here is my result:

42.2755
27.9623
19.6838
14.0688

The code that you quoted actually made me believe that adding -title would, as the help text says, not use title as input, only graph/entities. Thank you for pointing that out ^^

Unfortunately, I get a CUDNN_STATUS_EXECUTION_FAILED error with -title now. I suppose I am running out of GPU memory. So there is not really anything I can do about it.
Thanks again for your help trying to reproduce the results!

Maybe you can adjust the batch size,it use approximately 10GB gpu memory

I tried -bsz 16 but that did not help. And I'm also afraid that altering the batch size could worsen the final results. Normally, I have nearly 11GB of GPU memory. I am not sure what else could be the reason for the error...

My para:

parser.add_argument("-t1size",default=24,type=int,help="batch size for short targets")
parser.add_argument("-t2size",default=16,type=int,help="batch size for medium length targets")
parser.add_argument("-t3size",default=6,type=int,help="batch size for long targets")

for you:

I tried -bsz 16 but that did not help. And I'm also afraid that altering the batch size could worsen the final results. Normally, I have nearly 11GB of GPU memory. I am not sure what else could be the reason for the error...

the '-bsz' is not useful for training, you should adjust the t1size~t3size

CUDNN_STATUS_EXECUTION_FAILED Mean the cuda version and python version(or torchtext .etc) are not suitable,maybe you can use anaconda and reinstall your environment

Thank you! This is very helpful. I will try this!

Hi, @wutaiqiang . Have you compare the performance with and without -title. Indeed, in my experiment, I found without -title would have a higher BLEU score(14.37) which is closed to the paper while with -title lead to lower BLEU score(13.+). I want to confirm this result because the argument also get me confused.

For this original issue, I don't think the -title setting could lead to a 10 BLEU score decrease. Have you @mnschmit changed the default learning rate? As your performance at eopch 20 is worse than that at epoch 11, sounds like you have changed to a higher learning rate(e.g. 0.25) which cause overfitting.

i get the conclusion'--title means using title' by reading the code rather than compare the result,like:

if self.args.title:
tencs,_ = self.tenc(b.src)
tmask = self.maskFromList(tencs.size(),b.src[1]).unsqueeze(1)

if self.args.title:
a2 = self.attn2(hx.unsqueeze(1),tencs,mask=tmask).squeeze(1)
a = torch.cat((a,a2),1)

i did the experiment just now and using the result of epoch 20 for model with ' -title' and without '-tilte':
without '-title':

BLEU: 14.2996
METEOR: 18.7

with '-title':

BLEU: 14.0688
METEOR: 18.8525


hard to believe the result

I have followed the instructions as they are listed in the README, i.e., I ran the following commands verbatim in the root folder of the repository (please correct me if I misunderstood anything):

python3 ./train.py -save trained_weights
python3 generator.py -save=trained_weights/11.vloss-3.562062.lr-0.1
mkdir ../outputs
python3 eval.py ../outputs/11.vloss-3.562062.lr-0.1.inputs.beam_predictions.cmdline data/preprocessed.test.tsv

It got me the following output:

Bleu_1:	 17.554950053967314
Bleu_2:	 9.937139335974187
Bleu_3:	 5.843996614371239
Bleu_4:	 3.4978054776396585
METEOR:	 7.639675023911272
ROUGE_L: 15.840991976680046

As this is quite worse than the results reported in the paper, I assume I have missed something. As others have reported in the other issues that they were able to reproduce the results, could someone please post their exact command line arguments to do so?
I suppose the default learning rate (0.1) is wrong as 0.25 was reported in the paper. However, if the defaults are not the optimal hyperparameters, I am unsure how to achieve the rest of the reported training regime, e.g., the exact scheduling behavior where the learning rate goes down to 0.05 over 5 epochs.

Some other thoughts I had:

  • In command (2) I chose epoch 11 because of the smallest validation loss. I understand that this was the procedure for model selection in the paper. Am I wrong here?
  • Do I have to set a specific random seed to reproduce results?

I got the similiar results with you, and i used the '-title' setting, i didn't change the learning rate, it always '0.1'.
at epoch 20 the output as follows:
bleu_1: 20.36
bleu_2: 9.75
bleu_3: 5.03
bleu_4: 2.77
METEOR: 6.04
ROUGE_L: 13.65
at epoch 11 the output as follows:
bleu_1: 21.07
bleu_2: 11.92
bleu_3: 6.95
bleu_4: 4.13
METEOR: 7.90
ROUGE_L: 16.2
Do you know what went wrong ?

Hi, @menggehe .Have you changed the path according to this issue? As I can remember, I have only modified the code for this issue.

@menggehe maybe you can use generate.py to generate the result.txt and ref.txt[deafut for result.txt only and you should modified the code to get ref.txt], then use the result.txt and ref.txt as the prameter rather than using result.txt and test.tsv

Hi, @menggehe .Have you changed the path according to this issue? As I can remember, I have only modified the code for this issue.

yes, i changed the path.

Thanks to @wutaiqiang's comments concering the batch sizes, I was able to run the code with -title. Here are my BLEU scores for epoch 20:

Bleu_1:  18.633505234444097
Bleu_2:  10.556388209416772
Bleu_3:  6.175432745502417
Bleu_4:  3.676021095744585

So it got a little better but not enough.

@sysu-zjw, I did not modify anything except the path in generator.py as was recommended in the other issue you linked. Then I used the commands as I showed in my original post.

It is very mysterious why some of us can reproduce the results and some can't...
@menggehe, thank you for posting your results, too! It is good to know I am not the only one struggling ^^ Unfortunately, I still do not know what goes wrong though...

commented

what should i do to get the reproduce results

I can't reproduce the same result also. Here is my result running the code with -title

Bleu_1: 18.937064313624802
Bleu_2: 10.804192932667048
Bleu_3: 6.398713632926152
Bleu_4: 3.8613881960224457
METEOR: 7.894808327890248
ROUGE_L: 15.322604643734424

Hi @wutaiqiang , can you provide complete commands so that I can reproduce the same results as the paper? Thanks!

commented

Hi @wutaiqiang , can you provide complete commands so that I can reproduce the same results as the paper? Thanks!

I got a results better than yours, but alse can't get the reproduce results.
('Bleu_1:\t', 32.937958268788414)
('Bleu_2:\t', 22.447175820889868)
('Bleu_3:\t', 16.162693118259686)
('Bleu_4:\t', 11.788624343179919)
('METEOR:\t', 17.268461361418215)
('ROUGE_L:', 27.478546640828654)

the reproduce results is :
image

Those who are able to reproduce the results, can you please share the command line code used and any changes made in code. I have been trying to reproduce the results for a couple of weeks now and tried all combination I could understand as per paper, but the results are no way close to published once.

commented

Those who are able to reproduce the results, can you please share the command line code used and any changes made in code. I have been trying to reproduce the results for a couple of weeks now and tried all combination I could understand as per paper, but the results are no way close to published once.

there ara something you can do to get the reproduce the results
1、first, you shoul use the title information. which is use --title with the command line in train
2、the eval.py command line code has the generation res and ori abstract, but the data/test.csv is not the ori abstract,it is the input data. So you should extract the ori abstract the the eval.py's input file
3、the papar show that is used the 'warm restarts' from 0.25 to 0.05, so you can change the lr at each epoch, which likes this:
if o.param_groups[0]['lr'] > 0.05: o.param_groups[0]['lr'] -= 0.05 if e != 0 and e % 5 == 0 : o.param_groups[0]['lr'] = 0.25
But even you have got the reproduce the results, i think is can't used in your projects, becase the author does'n have provide the post processing code, the generation result have many repeat words, especially copy the entity from graph, there are too many repeat entity in the result。

Those who are able to reproduce the results, can you please share the command line code used and any changes made in code. I have been trying to reproduce the results for a couple of weeks now and tried all combination I could understand as per paper, but the results are no way close to published once.

there ara something you can do to get the reproduce the results
1、first, you shoul use the title information. which is use --title with the command line in train
2、the eval.py command line code has the generation res and ori abstract, but the data/test.csv is not the ori abstract,it is the input data. So you should extract the ori abstract the the eval.py's input file
3、the papar show that is used the 'warm restarts' from 0.25 to 0.05, so you can change the lr at each epoch, which likes this:
if o.param_groups[0]['lr'] > 0.05: o.param_groups[0]['lr'] -= 0.05 if e != 0 and e % 5 == 0 : o.param_groups[0]['lr'] = 0.25
But even you have got the reproduce the results, i think is can't used in your projects, becase the author does'n have provide the post processing code, the generation result have many repeat words, especially copy the entity from graph, there are too many repeat entity in the result。

Thank you very much for your reply. I was able to get the desired result. I had done all but the second point you made helped me achieve .

I see that the author reported some results that are far less than that reported in the issue
image
Is that what are you comparing against? If so, the converstation doesn't add up or am I missing something here

@ahhussein`` for BLEU, the lower the score, the better it is. oops i was thinking of perplexity. disregard!