AlexPoint / OpenNlp

Open source NLP tools (sentence splitter, tokenizer, chunker, coref, NER, parse trees, etc.) in C#

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Special characters

kennetherland opened this issue · comments

I am experiencing word splits in weird places when the sentence includes characters like commas, colons, semi-colons, and slashes.

Example:
"As a Hydra Store Visitor, I want to see the latest version."

Becomes:
"As a Hydra Store Vis i t or , I want to see the latest version."

I noticed during TokenizePositions, the last token in the the first phrase ("Visitor,") does not pass the AlphaNumeric.IsMatch(token) because the token has an ending character , i.e. a comma, so the model evaluator takes over and splits up the string in strange ways. Am I doing something wrong?

Please advise.

Here is how I solved for it, right or wrong:

In MaximumEntropy.Tokenizer::TokenizePositions:

`
....

			else if (AlphaNumericOptimization && AlphaNumeric.IsMatch(token))
			{
				newTokens.Add(tokenSpan);
				tokenProbabilities.Add(1.0);
			}
			else if (token.ContainsAny(";", ",", ":", ")", "/"))
            		{
				var contained = token.GetWhatsContained(";", ",", ":", ")", "/").ToList();
				var tokenSpanStart = tokenSpan.Start;

				foreach (var ch in contained.Select(s => char.Parse(s)))
                		{
					var relativeSubTokens = SplitOn(token, ch);

					foreach (var relativeSubTokenSpan in relativeSubTokens)
					{
						var absoluteStart = tokenSpanStart + relativeSubTokenSpan.Start;
						var absoluteEnd = tokenSpanStart + relativeSubTokenSpan.End;
						var absoluteTokenSpan = new Span(absoluteStart, absoluteEnd);
						var subToken = input.Substring(absoluteTokenSpan.Start, (absoluteTokenSpan.End) - (absoluteTokenSpan.Start));

						newTokens.Add(absoluteTokenSpan);
						tokenProbabilities.Add(0.9);

						// one for char

						if (absoluteEnd + 1 > input.Length)
                        				{
							break;
                        				}

						absoluteTokenSpan = new Span(absoluteTokenSpan.End, absoluteTokenSpan.End + 1);
						subToken = input.Substring(absoluteTokenSpan.Start, (absoluteTokenSpan.End) - (absoluteTokenSpan.Start));

						if (subToken.Length == 1 && char.IsWhiteSpace(char.Parse(subToken)))
                        				{
							break;
                        				}

						newTokens.Add(absoluteTokenSpan);
						tokenProbabilities.Add(0.9);
					}

					break;
				}
			}
			else
			{
				...`
commented

I'll have a look at your fix quickly and get back to you once the solution has been integrated on the master branch.
Thanks for your help on this!

commented

When I run the default tokenizer with the models provided, I get the expected results:
image

Can you tell me which tokenizer and which models you are using to reproduce this issue?

Sorry Alex, that I did not see your last email to me. I should have clarified my goal in the beginning, which was the fact that I was trying to figure out how to map the sub-parses back to the original text locations. You are right in that the tokenizer tokenizes correctly, but I had to create a few properties to the Parse class and write a routine to analyze the spacing correctly.

I would do a pull request of my code but I went a bit rogue, which would break the port from Java. I am okay with the closing of the issue. Thank you for looking into it.