kpu / preprocess

Corpus preprocessing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

-k from `cache` doesn't work when numbers are not sorted

cgr71ii opened this issue · comments

Hi!

I'm using cache with -k option and it seems that is necessary to provide the indexes sorted, is this right? If I provide the indexes like, e.g. -k 4,3, it fails (exit code 139, SIGSEGV), but with -k 3,4 it works fine. Is this a bug or it's expected to provide the indexes always sorted?

Thank you!

It should work with -k4,3. I just tried a minimal example.

As to why bitextor has developed headers and then a script to cut them off, I remain confused.

Sorry, I closed the issue by mistake.

I ran into this issue from a large set of data, but with the following minimal error fails too:

echo -e "1\t2\t3\t4\n6\t7\t8\t9\n1\t2\t7\t9" | cache -k 1,2 cut -f4
# 4
# 9
# 4

echo -e "1\t2\t3\t4\n6\t7\t8\t9\n1\t2\t7\t9" | cache -k 2,1 cut -f4
# Segmentation fault (core dumped)

About why we cut off the headers from bitextor, I don't really know which part you mean. If you mean about the commit where I referred to this issue, there we take the index from the src and trg sentences in order to know the indexes and provide them to cache, but we don't cut them off. In fact, cut them off because of cache is something that we though in a past because cache doesn't have an option like --header from GNU parallel, which ignored the header, so this might lead to a problem like:

src_sentence    trg_sentence    bicleaner_score  # header: cache stores 'bicleaner_score' as result for 'src_sentence    trg_sentence'
sent1           sent2           0.1              # bicleaner returns '0.1' and cache stores 'sent1    sent2'
src_sentence    trg_sentence    bicleaner_score  # since 'src_sentence    trg_sentence' is stored because of the header, 'bicleaner_score' is returned instead of the actual score from bicleaner

Should be working now, passes your example, sorry for the error.