PureTextTokenizer: inconsistent results when tokenizing same sentences

Question

PureTextTokenizer: inconsistent results when tokenizing same sentences

nnnyt opened this issue 3 years ago · comments

🐛 Description

(A clear and concise description of what the bug is.)
I try to tokenize an item twice using PureTextTokenzier, but get inconsistent results.

Error Message

To Reproduce

The code I use:

from EduNLP.Tokenizer import PureTextTokenizer
tokenizer = PureTextTokenizer()
item = ["有公式$\\FormFigureID{wrong1?}$，如图$\\FigureID{088f15ea-xxx}$,若$x,y$满足约束条件公式$\\FormFigureBase64{wrong2?}$,$\\SIFSep$，则$z=x+7 y$的最大值为$\\SIFBlank$"]
next(tokenizer(item))
next(tokenizer(item))

What have you tried to solve it?

Environment

Environment Information

Operating System: MacOS

Python Version: (e.g., python3.6, anaconda/python3.7, venv/python3.8) anaconda/python3.7

Additional context

tswsxk · Answer 1 · Mon Sep 27 2021 11:16:01 GMT+0800 (China Standard Time)

Solved by PR #99