bitextor / pdf-extract

PDF parser and converter to HTML

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

java.lang.Exception: This binary file contains trie with quantization and array-compressed pointers.

lpla opened this issue · comments

I downloaded the sentence-join model from http://data.statmt.org/paracrawl/sentence-join/en/ and tried to run with a simple PDF I got working without this model (https://www.dlsi.ua.es//~mlf/docum/forcada16j.pdf) and default config file (PDFExtract.json) Using code commit before #54 fix, I got this error:

java.lang.Exception: This binary file contains trie with quantization and array-compressed pointers.

        at pdfextract.SentenceJoin.start(SentenceJoin.java:110)
        at pdfextract.PDFExtract.sentenceJoin(PDFExtract.java:1706)
        at pdfextract.PDFExtract.sentenceJoin(PDFExtract.java:1130)
        at pdfextract.PDFExtract.Extract(PDFExtract.java:391)

With #54 fix only this non-specific warning was shown in the output:

<warnings>
<warning>
<method>sentenceJoin</method>
<details>
        <message><![CDATA[Fail loading model for language: en]]></message>
        <suggestion><![CDATA[Please verify the "sentencejoin_model" value of language {en} in configuration file.]]></suggestion>
</details>
</warning>
</warnings>

Hi @lpla ,
for work around, please follow the instruction here.
I will find the root cause and fix it after I can simulate the error message as above.

Hi. I didn't use Bitextor for this example. Only run this command with the PDF I mentioned:
java -jar target/PDFExtract-2.0.jar -I ~/forcada16j.pdf -O test

with the attached JSON config file (compressed given Github format restrictions) and the data I downloaded from statmt as mentioned in OP.

PDFExtract.zip

Hi,
I still cannot simulate it, no matter the below commant has been used.

java -jar PDFExtract-2.0.jar -I "/home/ramoslee/work/pdfExtract/testing/forcada16j.pdf" -O test

with the attached JSON config file and the result as attached was retrun.

pdfExtract.zip

That result you attached is not an HTML as I was getting, it is plain text. Also, as mentioned, I am using the penultimate master commit with git checkout 56f327a26e6b1bf4ad137d2c4c86c6e0c5402448

you may use the below command to get the html result:

-O <output_file> specifies the path to the output HTML file after extraction

java -jar PDFExtract-2.0.jar -I "/home/ramoslee/work/pdfExtract/testing/forcada16j.pdf" -O test.html

Rusult:
html_result.zip

Sorry, I was opening your output file with an editor that actually interpreted the HTML content. You were right.

Which version or commit of kenlm are you using?

I used currently version on git (commit 209ceb6).
May​ you try to redownload setup.sh and run below commands to reinstall the PDFExtract-2.0.jar

sudo bash setup.sh

I was talking about kenlm version, which is the only part that it is not installed with setup.sh

Sorry, I tested with KenLM current version and hit the same error with you,
for work around:

  1. you may use Moses instead of KenLM.
  2. Or you may use KenLM but you have redirect stderror to "/dev/null" as below in sentence-join.py as below attachment.
    sentence-join.zip