polm / fugashi

A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

bizarre print behavior

EtienneGagnon1 opened this issue · comments

Hello,

When using the tokenizer as part of a loop I get different outputs depending on whether a print call is in the for loop or not.
In the following for loop, printing articles gives the following output


tagger = fugashi.Tagger()
text = ['未来ある子どもたちを、たばこがもたらす健康被害',
            '◆生命の尊厳\u3000立法化検討13年\u3000党議拘束見送り「死」をどう考えるか。']

    articles = []
    for art in text:
        tokenized = tagger(art)
        articles.append(tokenized)
[[生, の, 厳, 立法, 討, 年, 党議, を, どう, る, か], [◆, 生命, の, 尊厳,  , 立法, 化, 検討, 13, 年,  , 党議, 拘束, 見送り, 「, 死, 」, を, どう, 考える, か, 。]]

While the following for loop gives a more correct result:


tagger = fugashi.Tagger()
text = ['未来ある子どもたちを、たばこがもたらす健康被害',
            '◆生命の尊厳\u3000立法化検討13年\u3000党議拘束見送り「死」をどう考えるか。']

    articles = []
    for art in text:
        print(art)
        tokenized = tagger(art)

        print(tokenized)
        articles.append(tokenized)
[[未来, ある, 子ども, たち, を, 、, たばこ, が, もたらす, 健康, 被害], [◆, 生命, の, 尊厳,  , 立法, 化, 検討, 13, 年,  , 党議, 拘束, 見送り, 「, 死, 」, を, どう, 考える, か, 。]]

Well that's a nasty bug.

What's happening is that for some reason the string pointer in the node is referring to your second article instead of the first one, which is why characters from the second string show up in the first list.

This shouldn't be happening - internally fugashi passes the -C flag to MeCab specifically so that it allocates new memory for each input string. I've seen errors like this when not passing the -C flag but not when passing it.

It's especially weird that printing in the loop fixes it post-loop, as in this sample:

import fugashi

tagger = fugashi.Tagger()
text = [
        "◆生命の尊厳\u3000立法化検討13年\u3000党議拘束見送り「死」をどう考えるか。",
        "未来ある子どもたちを、たばこがもたらす健康被害", 
        ]

articles = []
for art in text:
    tokenized = tagger(art)
    print(tokenized)
    articles.append(tokenized)

print("-----")
for article in articles:
    print(*[node.surface for node in article])

Anyway, something is going wrong with memory management somewhere, so I'll have to look into it.

As a workaround, it looks like making any reference to the input string ensures it's preserved properly, so you can add a line like this to fix it:

[len(tt.surface) for tt in tokenized]

Note you don't have to save this to a variable, just calling it fixes the strings.

Thank you for the workaround and for your work on the tokenizer. Best of luck solving the issue. Please let me know if I can help.

I found second example.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from fugashi import Tagger

if __name__ == '__main__':
    tagger = Tagger()

    print("OK")
    t1 = "このテキストは1番目です"
    t2 = "このテキストは2番目です"
    n1 = tagger.parseToNodeList(t1)
    print('n1', n1) # debug
    n2 = tagger.parseToNodeList(t2)
    print('n2', n2) # debug

    # printing later cause wrong
    print("Wrong")
    t1 = "このテキストは1番目です"
    t2 = "このテキストは2番目です"
    n1 = tagger.parseToNodeList(t1)
    n2 = tagger.parseToNodeList(t2)
    print('n1', n1) # debug
    print('n2', n2) # debug

    # other ok example
    print("OK2")
    t1 = "このテキストは1番目です"
    t2 = "2番目はこれでござる"
    n1 = tagger.parseToNodeList(t1)
    print('n1', n1) # debug
    n2 = tagger.parseToNodeList(t2)
    print('n2', n2) # debug

    # printing later cause error in this case
    print("Error")
    t1 = "このテキストは1番目です"
    t2 = "2番目はこれでござる"
    n1 = tagger.parseToNodeList(t1)
    n2 = tagger.parseToNodeList(t2)
    print('n1', n1) # debug
    print('n2', n2) # debug


Thanks for the report.

I understand what's going on, mostly.

When a node list is created, the nodes have pointers referring to the tokenized string. However they don't "own" the original string, and another call to parseToNode ends up freeing or re-using the string.

I implemented a workaround but it causes significant slowdown so I'm trying some other solutions.

The thing I don't understand about the current issue is which string the pointers refer to. There are several strings involved in tokenization:

  1. The original Python string
  2. The bytes version of the string created to pass to the C++ interface
  3. The copy of 2 that MeCab creates internally

I thought that if I removed the -C option then the pointers in node objects should refer to string 2, and the string in 3 shouldn't exist, but that doesn't seem to be the case.

I'll put the fix I have so far in a branch.

Fix is in fix/node-text branch, with a test.

https://github.com/polm/fugashi/tree/fix/node-text

Released test wheels with the "slow" fix for Linux only, you can install them with pip install fugashi=1.1.1a1.

On further review the slowdown is not as much as I thought, it looks like 10%, which is not great but is acceptable given it can be fixed later. The reason I was confused about this is it looks like the versions built for wheels on Github Actions are faster than the versions I build locally, even on my own machine.

Given all that I should be able to review the code and do a proper release soon.

Sorry it took me much longer than expected to get around to making a release for this, but I just released v1.1.1, which resolves this issue. If you have any other problems please open an issue any time.