martiansideofthemoon / style-transfer-paraphrase

Official code and data repository for our EMNLP 2020 long paper "Reformulating Unsupervised Style Transfer as Paraphrase Generation" (https://arxiv.org/abs/2010.05700).

Home Page:http://style.cs.umass.edu

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Mismatch between reported and reproduced ACC and SIM in Formality dataset

minniie opened this issue · comments

Hi :)

I am currently working on a paper in style transfer and was trying to reproduce the scores reported in Table 1 of your paper. I am getting roughly the same scores, except for ACC and SIM in the GYAFC Entertainment_Music split.

The following is the reported scores - reproduced scores from us:

Model reported ACC reproduced ACC reported SIM reproduced SIM
STRAP 0.0 67.7 71.3 72.5 60.9
STRAP 0.6 70.7 73.5 69.9 58.7
STRAP 0.9 76.8 79.3 62.9 52.4

A general trend is that the reproduces scores have higher ACC and lower SIM.
The scores for "reproduced SIM" were averaged after comparing with four .ref* files in GYAFC/Entertainment_Music/test.

Qualitatively examining your results, one possible reason for such discrepancy might be the existence of &apos, &quot, and &amp in your output. Should I manually replace these with ', ", and & and proceed evaluating? Were the current reported scores measured after such replacement? FYI, the example outputs in this repo also contains &apos, &quot, and &amp.

I think it is fair to replace them to compare your outputs with other baselines and ours, but I didn't want to edit any output of yours without your permission.

I also measured SIM with the original input just in case, and got scores closer to your reported SIM. We got 71.4, 68.3, 60.4 for STRAP 0.0, 0.6, 0.9. It should be that the reported scores are from comparison with the reference sentences, NOT the original sentences, but it was little weird that the reported scores tend to represent SIM with original sentences instead.

If both are not the case, can you tell me what could be the possible reason, and whether I should report my reproduced scores or the previously reported scores for ACC and SIM in my paper?

I would really appreciate any guidance. btw, awesome work, my coworkers and I were really inspired. Thanks in advance! :)

Hi @minniie thanks for reaching out! Are these reproduced numbers with the pretrained checkpoint, or with models you trained yourself using the scripts provided? If it's the latter, there could be some variation from run-to-run.

Another thing I wanted to check was whether you were using the evaluation script provided, which takes max SIM score (instead of average) over the provided four references at an instance level? You mentioned you took an average, but was that across instances only or references as well?

Hi, thank you so much for your reply.

First, we downloaded the pretrained checkpoint and used it only for inference.

We used the script in https://github.com/martiansideofthemoon/style-transfer-paraphrase/blob/master/style_paraphrase/evaluation/scripts/evaluate_formality.sh, and only changed line 48 so that it takes the four .ref* files instead of the original input concatenated in reverse order, since it's not the correct reference for GYAFC dataset.

formality.log is the log file after running the above script as it is, which gives weird scores at generated vs gold section. The reason for this, I believe, is that unlike Shakespeare dataset, the two test sets in GYAFC are not parallel, thus cannot serve as references for each other.

After making our edit, our scores are in formality_after_edit.log. As you pointed out, we averaged the scores manually instead of taking the max SIM score. Taking the max, the scores are 71.46, 68.88, 61.71, which is still about one point lower than the reported scores, but much closer to them than our average.

Hi, thanks for getting back..

I agree line 48 doesn't make sense (since sizes of formal/informal are different) but the file created in line 48 is not used for evaluation (we use the ref* files instead), see https://github.com/martiansideofthemoon/style-transfer-paraphrase/blob/master/style_paraphrase/evaluation/scripts/evaluate_formality.sh#L66, which is unlike the line in the shakespeare script. Did you make any other edits to the script?

Yes ~1% difference could be due to some preprocessing reason. p=0.6/0.9 can vary due to stochasticity, but p=0.0 should get you a number similar to that reported in paper. How did you create the data splits? If you created them on your own, I could send over preprocessed files which do some tokenization etc. (see point 3 here: https://github.com/martiansideofthemoon/style-transfer-paraphrase#datasets)

Hi,

Now that I look at the script more carefully, line 48 shouldn't be the issue. So sorry about that. My teammate actually ran the code and sent me the results, so I'd have to check with them if any other changes were made. But line 66 makes total sense to me, and I believe we're doing the same given the similar scores.

Also, we might have missed the part for asking you for the preprocessed dataset, so my teammate actually preprocessed it himself. The splits were the same as the original split in GYAFC, which comes with train/dev/test. But I agree there might be slight difference from bpe tokenization. Can you send the preprocessed files to mylee@princeton.edu ?

Again, so sorry about our hastiness. We're kinda running on time and I wasn't careful enough to check everything perfectly. We'll try again once we receive the preprocessed files, and hopefully it solves the issue.

Oh no worries! I think the preprocessing will fix issues for p=0.0. Could you write me an email (kalpesh@cs.umass.edu) with a screenshot of the email from Joel Tetreault confirming you have access to GYAFC? I'll promptly send the preprocessed version after I get that.. (i know you probably do, but just following protocol since GYAFC is not publicly available)

Awesome! Just sent it :)

Awesome, sent you the dataset. I'm closing this issue for now, but feel free to re-open it if the issue persists.