Preprocess/filter custom data

Question

Preprocess/filter custom data

l0rn0r opened this issue 2 years ago · comments

Hi,
thanks for your work! I'm trying to use your method to transfer contemporary german text to the style of the Swiss author Jeremias Gotthelf (19th century). I'm at the first step to train the paraphraser - atm I have 386k backtranslated TED-talk sentences (en translation to ger with T5).

Now I want to filter the backtranslated corpus and by reading #38 I got a first idea. But there are some points I do not yet understand. I try to describe, what I understood till now:

Putting the backtranslation data in a TSV-file and run https://github.com/martiansideofthemoon/style-transfer-paraphrase/blob/master/datasets/prepare_paraphrase_data.py - so I'll get a train and a dev pickle with a data line like:
None, None, None, Sentence, BacktranslatedSentence, None, None, None, None
Those positions stand for
tmp1, tmp2, equality, sent1, sent2, f1_scores, kendall_tau_scores, ed_scores, langid_scores
used in https://github.com/martiansideofthemoon/style-transfer-paraphrase/blob/master/datasets/parse_paranmt_postprocess.py.

In your paper you describe in Appendix A.1 the filtering steps.
To get the data ready to run with parse_paranmt_postporcess.py, I do have to write my proper script:

Calculate get_kendall_tau() and f1_score() with preprocess_utils.py. Since I only have German sentences, I not need the langid to filter. What are tmp1 and tmp2? Are those sent1 and sent2 normalized? What is ed_scores?
Filter by content: Calculate the similarity measure with test_sim.py and drop results with score lower than 0.5, filter it by length difference with parse_paranmt_postprocess.py(lendiff_less_), and filter it by length (you propose 7 to 25 tokens). Now I do have the content-filtered dataset.
Lexical diversity filtering: Which SQuAD evaluation scripts did you use? I guess some from here: https://worksheets.codalab.org/worksheets/0xd53d03a48ef64b329c16b9baf0f99b0c
Of course I will have to adapt the hard coded English articels aso to German in those scripts. In my script I will filter the results of those scripts here.
Syntactic diversity filtering: I'll take the kendall tau score and filter the dataset with parse_paranmt_postprocess.py (kt_less_).
LangID filtering: Here's no need to do that.

Open questions are:

What are tmp1 and tmp2 in the dataset?
What is ed_scores?
How to get the lexical diversity with the SQuAD evaluation scripts?

Sorry for the long issue 😄
Thanks for your help in advance!

Kalpesh Krishna · Answer 1 · Sun May 29 2022 18:51:09 GMT+0800 (China Standard Time)

hi @l0rn0r,
Thanks for your interest in our work and your detailed issue describing the points of confusion.

tmp1 and tmp2 are benepar constituency parses of the sentences, as you can see in this file. ed_scores are some kind of edit distance scores between parses.

Most important, neither of tmp1, tmp2, ed_scores were used for filtering the data --- we only used f1_scores, kendall_tau_scores, langid and sentence lengths. So please ignore these fields, and I'm sorry for the confusion they may have caused.

How to get the lexical diversity with the SQuAD evaluation scripts?

Use this function. We used the precision in the paper, but I think f1_score is more appropriate if you don't have any length bias in your data like paraNMT (paraNMT is notorious for dropping content).

Kendall Tau

Use this function.

Please feel free to reopen if you have more questions