java.lang.Exception: This binary file contains trie with quantization and array-compressed pointers.

Question

java.lang.Exception: This binary file contains trie with quantization and array-compressed pointers.

lpla opened this issue 4 years ago · comments

Leopoldo Pla Sempere commented 4 years ago

I downloaded the sentence-join model from http://data.statmt.org/paracrawl/sentence-join/en/ and tried to run with a simple PDF I got working without this model (https://www.dlsi.ua.es//~mlf/docum/forcada16j.pdf) and default config file (PDFExtract.json) Using code commit before #54 fix, I got this error:

java.lang.Exception: This binary file contains trie with quantization and array-compressed pointers.

        at pdfextract.SentenceJoin.start(SentenceJoin.java:110)
        at pdfextract.PDFExtract.sentenceJoin(PDFExtract.java:1706)
        at pdfextract.PDFExtract.sentenceJoin(PDFExtract.java:1130)
        at pdfextract.PDFExtract.Extract(PDFExtract.java:391)

With #54 fix only this non-specific warning was shown in the output:

<warnings>
<warning>
<method>sentenceJoin</method>
<details>
        <message><![CDATA[Fail loading model for language: en]]></message>
        <suggestion><![CDATA[Please verify the "sentencejoin_model" value of language {en} in configuration file.]]></suggestion>
</details>
</warning>
</warnings>

ROMUELEE BUESA · Answer 1 · Thu Sep 03 2020 19:17:04 GMT+0800 (China Standard Time)

Hi @lpla ,
for work around, please follow the instruction here.
I will find the root cause and fix it after I can simulate the error message as above.

Leopoldo Pla Sempere · Answer 2 · Thu Sep 03 2020 19:24:20 GMT+0800 (China Standard Time)

Hi. I didn't use Bitextor for this example. Only run this command with the PDF I mentioned:
java -jar target/PDFExtract-2.0.jar -I ~/forcada16j.pdf -O test

with the attached JSON config file (compressed given Github format restrictions) and the data I downloaded from statmt as mentioned in OP.

PDFExtract.zip

ROMUELEE BUESA · Answer 3 · Fri Sep 04 2020 01:05:46 GMT+0800 (China Standard Time)

Hi,
I still cannot simulate it, no matter the below commant has been used.

java -jar PDFExtract-2.0.jar -I "/home/ramoslee/work/pdfExtract/testing/forcada16j.pdf" -O test

with the attached JSON config file and the result as attached was retrun.

pdfExtract.zip

Leopoldo Pla Sempere · Answer 4 · Fri Sep 04 2020 02:04:21 GMT+0800 (China Standard Time)

That result you attached is not an HTML as I was getting, it is plain text. Also, as mentioned, I am using the penultimate master commit with git checkout 56f327a26e6b1bf4ad137d2c4c86c6e0c5402448

ROMUELEE BUESA · Answer 5 · Fri Sep 04 2020 07:23:12 GMT+0800 (China Standard Time)

you may use the below command to get the html result:

-O <output_file> specifies the path to the output HTML file after extraction

java -jar PDFExtract-2.0.jar -I "/home/ramoslee/work/pdfExtract/testing/forcada16j.pdf" -O test.html

Rusult:
html_result.zip

Leopoldo Pla Sempere · Answer 6 · Fri Sep 04 2020 14:43:39 GMT+0800 (China Standard Time)

Sorry, I was opening your output file with an editor that actually interpreted the HTML content. You were right.

Which version or commit of kenlm are you using?

ROMUELEE BUESA · Answer 7 · Fri Sep 04 2020 15:52:16 GMT+0800 (China Standard Time)

I used currently version on git (commit 209ceb6).
May you try to redownload setup.sh and run below commands to reinstall the PDFExtract-2.0.jar

sudo bash setup.sh

Leopoldo Pla Sempere · Answer 8 · Fri Sep 04 2020 16:02:17 GMT+0800 (China Standard Time)

I was talking about kenlm version, which is the only part that it is not installed with setup.sh

ROMUELEE BUESA · Answer 9 · Fri Sep 04 2020 22:22:01 GMT+0800 (China Standard Time)

Sorry, I tested with KenLM current version and hit the same error with you,
for work around:

you may use Moses instead of KenLM.
Or you may use KenLM but you have redirect stderror to "/dev/null" as below in sentence-join.py as below attachment.
sentence-join.zip