Preprocess/filter custom data
l0rn0r opened this issue · comments
Hi,
thanks for your work! I'm trying to use your method to transfer contemporary german text to the style of the Swiss author Jeremias Gotthelf (19th century). I'm at the first step to train the paraphraser - atm I have 386k backtranslated TED-talk sentences (en translation to ger with T5).
Now I want to filter the backtranslated corpus and by reading #38 I got a first idea. But there are some points I do not yet understand. I try to describe, what I understood till now:
- Putting the backtranslation data in a TSV-file and run https://github.com/martiansideofthemoon/style-transfer-paraphrase/blob/master/datasets/prepare_paraphrase_data.py - so I'll get a train and a dev pickle with a data line like:
None, None, None, Sentence, BacktranslatedSentence, None, None, None, None
Those positions stand for
tmp1, tmp2, equality, sent1, sent2, f1_scores, kendall_tau_scores, ed_scores, langid_scores
used in https://github.com/martiansideofthemoon/style-transfer-paraphrase/blob/master/datasets/parse_paranmt_postprocess.py.
In your paper you describe in Appendix A.1
the filtering steps.
To get the data ready to run with parse_paranmt_postporcess.py
, I do have to write my proper script:
- Calculate
get_kendall_tau()
andf1_score()
withpreprocess_utils.py
. Since I only have German sentences, I not need thelangid
to filter. What aretmp1
andtmp2
? Are thosesent1
andsent2
normalized? What ised_scores
? - Filter by content: Calculate the
similarity measure
withtest_sim.py
and drop results with score lower than 0.5, filter it by length difference withparse_paranmt_postprocess.py
(lendiff_less_), and filter it by length (you propose 7 to 25 tokens). Now I do have the content-filtered dataset. - Lexical diversity filtering: Which SQuAD evaluation scripts did you use? I guess some from here: https://worksheets.codalab.org/worksheets/0xd53d03a48ef64b329c16b9baf0f99b0c
Of course I will have to adapt the hard coded English articels aso to German in those scripts. In my script I will filter the results of those scripts here. - Syntactic diversity filtering: I'll take the kendall tau score and filter the dataset with
parse_paranmt_postprocess.py
(kt_less_). - LangID filtering: Here's no need to do that.
Open questions are:
- What are
tmp1
andtmp2
in the dataset? - What is
ed_scores
? - How to get the lexical diversity with the SQuAD evaluation scripts?
Sorry for the long issue 😄
Thanks for your help in advance!
hi @l0rn0r,
Thanks for your interest in our work and your detailed issue describing the points of confusion.
tmp1
and tmp2
are benepar
constituency parses of the sentences, as you can see in this file. ed_scores
are some kind of edit distance scores between parses.
Most important, neither of tmp1
, tmp2
, ed_scores
were used for filtering the data --- we only used f1_scores, kendall_tau_scores
, langid
and sentence lengths. So please ignore these fields, and I'm sorry for the confusion they may have caused.
How to get the lexical diversity with the SQuAD evaluation scripts?
Use this function. We used the precision
in the paper, but I think f1_score
is more appropriate if you don't have any length bias in your data like paraNMT (paraNMT is notorious for dropping content).
Kendall Tau
Use this function.
Please feel free to reopen if you have more questions