kermitt2 / grobid

A machine learning software for extracting information from scholarly documents

Home Page:https://grobid.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Utilities.convertStringOffsetToTokenOffset seems to miss the last token

lfoppiano opened this issue · comments

I'm wondering whether the interval returned in this method (

public static List<OffsetPosition> convertStringOffsetToTokenOffset(
) does not follow the java approach of "inclusive, exclusive" so that when we call the subList() we miss the last token.

This test would fail because the calculated string will be "is ":

        String input = "This is a token.";
        List<LayoutToken> layoutTokens = GrobidAnalyzer.getInstance().tokenizeWithLayoutToken(input);
        OffsetPosition stringPosition = new OffsetPosition(5, 9);
        List<OffsetPosition> tokenOffsets = convertStringOffsetToTokenOffset(Arrays.asList(stringPosition), layoutTokens);

        assertThat(tokenOffsets, hasSize(1));
        OffsetPosition position = tokenOffsets.get(0);
        assertThat(LayoutTokensUtil.toText(layoutTokens.subList(position.start, position.end)), is("is a"));

Hi Luca ! If I am not wrong, all token positions are "inclusive, inclusive", all character positions are "inclusive, exclusive". This is what is expected when creating the features for example.
This method convertStringOffsetToTokenOffset is used typically for converting regex (char) positions to layout token positions.

OK. In general the Java approach is to have "inclusive, exclusive" for any subList/subString, however, I agree to keep the consistency with the previous code in Grobid. For the features since it's probably checking token by token it's simpler to keep the "inclusive, inclusive" approach. 😎

Anyway, at the moment, this specific method is used only for the URL recognition.

I know, it is like that for historical reasons, and clearly the amount of code that depends on the "inclusive, inclusive" tokens is far too important in Grobid and all the Grobid modules to justify a change here (it's not a problem, just a convention matter).

The convertStringOffsetToTokenOffset method is also used in Grobid core for recognizing DOI, arXiv identifiers, and emails. It is also used in a few other grobid modules like dataset and softcite.

Ah, it's true, sorry. I overlooked the places where it's used in the Lexicon 😅