kpu / preprocess

Corpus preprocessing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cache util::EndOfFileException

zuny26 opened this issue · comments

Hi. During Bitextor testing, we encountered an issue regarding empty lines in cache output. More precisely, if the first line of the input causes the output to be empty, and then there is another occurrence of the same input line, cache crashes with util::EndOfFileException

We found out that util::Pool initializes with NULL, so *copy_to is NULL if got is an empty string.

char *copy_to = (char*)string_pool.Allocate(got.size());

So, when the second occurrence of the first line is read, the check in Input function passes,

std::pair<std::unordered_map<uint64_t, StringPiece>::iterator, bool> res(cache.insert(entry));
if (res.second) {
,
but so does the one in Output:
while (queue.Consume(q).value) {
StringPiece &value = *q.value;
if (!value.data()) {

value.data() returns false, even though the copy of the output was stored in the cache, and therefore the input line was not passed to the child process.

One workaround would be to allocate a byte before the loop, to make sure NULL pointer isn’t returned: char *small_init = (char*)string_pool.Allocate(1);

Alternatively, the check in the Input function could be changed to check the same thing as in the Output function, so change if (res.second) to if (!res.first->second.data()). This way, if the first line produces an empty output, it would be executed twice instead of once, but no errors.

Good point, followed your suggested fix.