loomchild / segment

Program used to split text into segments

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Accurate version of the iterator returns results which differs from ultimate/fast on a very short texts

dchaplinsky opened this issue · comments

Hi Jarek!

I'm currently working on the project called choppa which is a partial python port of your great library. My intention is to bring sentence tokenization found in LanguageTool to the python world.

To make my life a little bit easier, I decided to implement accurate iterator and sax parser only for now. I successfully ported the code and tests and got it working (despite the lack of the Matcher class in python regexes and general difference in regex syntax between python and java). Then I've started to port tests from LanguageTool for ukrainian language. And most of them worked except for a few. I literally banged my head against the wall for couple of days (you can see it from commit messages).

Then I decided to compile the segment itself and run the tests using srx file found in LanguageTool distro.

And boom. When using fast or ultimate algo, it works flawlessly. But with accurate it fails the same way as my code.

$ echo "Алисов Н. В. , Хореев Б. С." | ./segment -a accurate -s ~/Projects/choppa/data/srx/segment_new.srx -l uk_two -r
Алисов Н. В.
, Хореев Б. С.

$ echo "Алисов Н. В. , Хореев Б. С." | ./segment -a fast -s ~/Projects/choppa/data/srx/segment_new.srx -l uk_two -r
Алисов Н. В. , Хореев Б. С.

$ echo "М. Л. Гончарука, I. О. Денисюка" | ./segment -a fast -s ~/Projects/choppa/data/srx/segment_new.srx -l uk_two -r
М. Л. Гончарука, I. О. Денисюка

$ echo "М. Л. Гончарука, I. О. Денисюка" | ./segment -a accurate -s ~/Projects/choppa/data/srx/segment_new.srx -l uk_two -r
М. Л. Гончарука, I. О.
Денисюка

On one hand I'm now happy that my implementation is still correct. On the other hand, I'm not, because I need to implement either fast or ultimate to make it work and presumably there is an error in the segment lib, which is not covered by the tests.

P.S. I cannot express how grateful I am for the lib you wrote and the quality of its code. Dziękuję bardzo za waszą ciężką pracę!

Hi Dimitry,

Thank you for working on this and for your nice message:)

I will look closer into the issue and try to help you in a few days. However, keep in mind that the code was written a long time ago (most of the work was done in 2007-2008 AFAIR), I no loger actively code in Java (actually I prefer Python nowadays), and the Fast and Accurate algorithms are considered legacy. Still, I hope the issue will be possible to fix when digging deeper in the code.

Meanwhile, would you be able to share the current version of your project for reference?

Thanks. Yes, my immediate plans are to port ultimate algo and see if that helps.

Current version is here: https://github.com/lang-uk/choppa, it's still WIP, but already good enough to show :)

If you can give it a code review once I'm done with the implementation of ultimate algo, it would be marvelous.

Thanks. Keep in mind that the ultimate algorithm is more complex because it needs to support streaming. On the other hand it should be more optimized for situations when there are few break rules, but many exception rules (typical for SRX).

Well, after all my struggle, ultimate algorithm doesn't sounds so scary now. Will post an update once I'll finish it.

Finished the port for the SrxTextIterator (as well as an implementation of useTransparentBounds for the JavaMatcher, TextManager and RuleManager).

It's slow (since my implementation of JavaMatcher is very inefficient) and ESPECIALLY slow on that TEXT_LONGER_THAN_BUFFER_RESULT test but it finally works.

I'll try to add some optimizations over weekend. If you have a minute — could you please look at the current code?

Thanks for an update. I will try to look at it on Sunday.

I've added few potential speedups here: https://github.com/lang-uk/choppa/pull/1/files
Still not sure if it's a good idea (but that code is passing all the tests and works faster)

I found the issue with accurate algorithm and came to some interesting conclusions (I hope). I will share them tomorrow (a bit too late for me today:)

Sure. I'd be also grateful if you can review speedups (and code in general).

Yesterday I did some tests and I know what is the issue with accurate algorithm.

The problem is that this exception rule:

<rule break="no">
	<beforebreak>[\h\v][А-ЯІЇЄҐ]\.[\h\v]*</beforebreak>
	<afterbreak>[А-ЯІЇЄҐ]\.|[0-9]|[\h\v]*,|[\h\v]*[:«]|\([0-9]{4}</afterbreak>
</rule>

For the text "Алисов Н. В. , Хореев Б. С." should be matched 4 times, after every dot. However, because the space at the beginning of the rule has already been consumed by the previous match, it matches only twice, after "H. " and "Б. ".

A straightforward solution is to convert the rule to use zero-length lookbehind. You can do it manually by changing the rule, but since you already implemented the necessary finitize() function, it can be done as follows:

public static String createLookbehindPattern(String pattern, int maxLenght) {
	if (pattern.length() == 0) {
		return pattern;
	}
	return "(?<=" + Util.finitize(pattern, maxLenght) + ")";
}

Apply the above transformation to the before pattern of each exception rule when initializing the algorithm:

for (LanguageRule languageRule : languageRuleList) {
	for (Rule rule : languageRule.getRuleList()) {
		if (!rule.isBreak()) {
			rule  = new Rule(rule.isBreak(), Util.createLookbehindPattern(rule.getBeforePattern(), maxLookbehindConstructLength), rule.getAfterPattern());
		}
		RuleMatcher matcher = new RuleMatcher(document, rule, text);
		ruleMatcherList.add(matcher);
	}
}

The complete code along with the test has been committed to Segment here:
783d4e9

The accurate algorithm is much simpler, but generally slower because it matches all text by all rules, instead of focusing on break rules and then applying exception rules to break spaces. However, the performance might be completely sufficient if you plan to segment texts that fit into memory.

Regarding the ultimate algorithm, your optimizations might be enough, if you can live with much more complex code. Alternatively, consider using a read-only view of the string instead of a copy. I looked into this for a bit, but not sure if it's possible in Python without implementing a simple string view yourself (memoryview doesn't seem to solve the problem, but I don't know enough about the matter). It could work by exposing the same interface as string, but operating only on a part of the text. Another option could be using an external Regex library for Python that supports region interface.

In summary, I would try to fix the accurate algorithm first as described above, see if it's sufficient for your needs, and then try to tweak the ultimate algorithm further.

Please let me know what do you think.

Thanks for in-depth analysis.

I'll apply the change to the Accurate Algo soon and double-check the result.
In my tests (and I wasn't very thorough) I found that on a very long texts (TEXT_LONGER_THAN_BUFFER_RESULT) accurate algorithm works faster for me (in fact, without the recent optimization I never had a chance to see the Ultimate to complete that text at all, with the optimizations they are on par). I'll do the performance testing for both and publish the results here some day.

Could you please look into two more things for me, if you have a spare minute:

  • Attribution of your work in the package, as I'm preparing choppa to the public release. Spelling, order, other things. Just want to be on a safe side here :)
  • Lookahead/lookbehind buffer length in JavaMatcher. Currently those are hardcoded to 1000/100 I think, smaller buffers tends to give faster processing. Is it safe to set them both to (say) DEFAULT_MAX_LOOKBEHIND_CONSTRUCT_LENGTH?

wrt other regex libraries. I'm using regex instead of re, and it's author was kind enough to add the new character classes to replace /h/v from Java. Unfortunately I wasn't able to find any other libraries, that supports regions in the way that java does. Actually both re/regex allows you to specify start/end position for the match, but both cannot match things like ^ at the beginning of the region. I.e. re.search("^foo", "barfoo", 3, 6) would yield no results :( There are couple of other bindings for re2 (google lib for fast regexes), but those aren't really maintained and lacking a lot of functionality (not to mention regions).

In an accurate algorithm, the entire text is loaded into memory first, so that could make a difference in performance (and testing text longer than buffer doesn't make much sense). Also, the necessity to copy the memory in Python can slow it down greatly (maybe, some simple cache could be used to avoid doing it many times - but then it requires more memory?). Additionally, keep in mind that the TEXT_LONGER_THAN_BUFFER test is very basic and not really representative of a real corpus.

I still think that if performance (both CPU and memory) of accurate algorithm is sufficient for you, then no point of introducing more complicated ones that require special JavaMatcher, region, etc. Keep in mind that the code was written long time ago, when using gigabytes of memory was unthinkable:)

Attribution is great, thank you. You can perhaps add a link to my profile directly instead of using my pseudo if you wish:
by Jarek Lipski.

Lookahead/lookbehind buffer should fit the longest rule, so yes, it should be safe to decrease them to a 100. Hopefully, the class won't be necessary anymore.

About other libraries - I see, makes sense. I am only not sure why re.search("^foo", "barfoo", 3, 6) should match? Shouldn't it match only the beginning of the entire text? Or perhaps beginning of a line? I see that it's being used in LanguageTool SRX rules, so probably I forgot how it should work... could you please explain (just for my info)?

Stream parser (ultimate algo) is already there and it has some other advantages, but as I said I'll do the performance testing on real texts for both.
Lookahead/lookbehind to 100 — great news, another lil' speedup!

JavaMatcher cannot be removed, not only because ultimate algo is using it but also because AccurateSRXParser is relying on it (see below for the explanation on re)

Consider the following code:

        Pattern TEST_PATTERN = Pattern.compile("^foo");
        Matcher beforeMatcher = TEST_PATTERN.matcher("barfoo");
        System.out.println(beforeMatcher.lookingAt());
        beforeMatcher.region(3, 6);
        System.out.println(beforeMatcher.lookingAt());

It'll produce false, true.

Similarly python rough equivalent is:

In [1]: import regex as re

In [2]: pattern = re.compile("^foo")

In [4]: pattern.match("barfoo") is None
Out[4]: True

In [5]: pattern.match("barfoo", 3, 6) is None
Out[5]: True

So in case of Java Matcher, if you supply the region for the pattern, lookingAt/search with ^ will match the beginning of the region, while in python it always the beginning of the string (or line)

Thanks for explanations. I understand that Java version works like that, but I didn't realize that real SRX rules rely on this fact and use ^ this way, as it seems unnecessary (but probably I am missing something).

Yeah, one problem with flexible rules is that the people who is writing the rules can come up with anything, basically fine-tuning their regexes to the desired behavior.

Yes, and maybe even some SRX-building tools add these...

Nevertheless, by reviewing the SRX file from languagetool, I don't see a case where it would be a good idea to match ^ as a beginning of the current region. It's not used in afterbreak rules, and in beforebreak it either means beginning of text, or should be removed. I think that the default Pyhon interpretation makes more sense in this context.

Perhaps it makes sense to reach to LanguageTool SRX authors to ask what effect there were trying to achieve in these rules (I found only one in Polish and few in Ukrainian, so I suppose the one added by your team)

Side note: I forgot that ^ also matches the beginning of the line, but only in MULTILINE mode. So, it might make sense in some patterns (although MULTILINE matching is not enabled in segment).

>>> pattern = re.compile("^foo", re.MULTILINE)

This doesn't work:

>>> pattern.match("bar\nfoo") is None
True

But this does:

>>> pattern.match("bar\nfoo", 4, 7) is None
False

(not sure if relevant, I don't remember how it's exactly handled in languagetool with segment; I think it works the same in Java but haven't checked)

Yes, that works, but gives not so many advantage. I was trying to play with different implementations for the caret matching (for example, if regex pattern starts with a caret, then I remove it and use match instead of a search, but patterns like (^\h\v...)|(foobar) killed that idea).

On the changes to accurate srx parser, I've tried to add those to the python implementation and it doesn't work :(

E   AssertionError: Lists differ: ["Don't split strings like U.S.A. please."] != ["Don't split strings like U.S.", 'A. please.']
E   AssertionError: Lists differ: ['Алисов Н. В. , Хореев Б. С.'] != ['Алисов Н. ', 'В. ', ', Хореев Б. ', 'С.']

Strange, will do the debug later today.

I have no idea why it doesn't work in Python. I will try to take a look into it over the weekend.

Hi! Any chance you could look into this once again? I'm trying to wrap the work and release it at last.

Yes, sorry for delay. I will try to do it this weekend.

I have resolved an issue in JavaMatcher related to zero-length matches in accurate algorithm.

However, I still think it might be unnecessary to fully emulate Java behavior and I will try to propose a simpler solution later today.

I have created two alternative PRs:

  1. Fix JavaMatcher empty match handling - Resolves a bug in JavaMatcher. Please note that the JavaMatcher code is pretty hard to understand, so I am not 100% sure that fix is correct. On the other hand, I am pretty sure there was an issue with handling zero-length matches, and since all exception rules are wrapped in lookbehind, most matches will be zero-lenght.

  2. Implement simple rule matcher - Get rid of JavaMatcher and implement a simple, pure Python RuleMatcher. There are differences between Java and Python as described above, but I think they are not significant in real life rules (unless I am missing the point of using ^ in rule patterns).