java.lang.Exception: This binary file contains trie with quantization and array-compressed pointers.
lpla opened this issue · comments
I downloaded the sentence-join model from http://data.statmt.org/paracrawl/sentence-join/en/ and tried to run with a simple PDF I got working without this model (https://www.dlsi.ua.es//~mlf/docum/forcada16j.pdf) and default config file (PDFExtract.json) Using code commit before #54 fix, I got this error:
java.lang.Exception: This binary file contains trie with quantization and array-compressed pointers.
at pdfextract.SentenceJoin.start(SentenceJoin.java:110)
at pdfextract.PDFExtract.sentenceJoin(PDFExtract.java:1706)
at pdfextract.PDFExtract.sentenceJoin(PDFExtract.java:1130)
at pdfextract.PDFExtract.Extract(PDFExtract.java:391)
With #54 fix only this non-specific warning was shown in the output:
<warnings>
<warning>
<method>sentenceJoin</method>
<details>
<message><![CDATA[Fail loading model for language: en]]></message>
<suggestion><![CDATA[Please verify the "sentencejoin_model" value of language {en} in configuration file.]]></suggestion>
</details>
</warning>
</warnings>
Hi. I didn't use Bitextor for this example. Only run this command with the PDF I mentioned:
java -jar target/PDFExtract-2.0.jar -I ~/forcada16j.pdf -O test
with the attached JSON config file (compressed given Github format restrictions) and the data I downloaded from statmt as mentioned in OP.
Hi,
I still cannot simulate it, no matter the below commant has been used.
java -jar PDFExtract-2.0.jar -I "/home/ramoslee/work/pdfExtract/testing/forcada16j.pdf" -O test
with the attached JSON config file and the result as attached was retrun.
That result you attached is not an HTML as I was getting, it is plain text. Also, as mentioned, I am using the penultimate master commit with git checkout 56f327a26e6b1bf4ad137d2c4c86c6e0c5402448
you may use the below command to get the html result:
-O <output_file> specifies the path to the output HTML file after extraction
java -jar PDFExtract-2.0.jar -I "/home/ramoslee/work/pdfExtract/testing/forcada16j.pdf" -O test.html
Rusult:
html_result.zip
Sorry, I was opening your output file with an editor that actually interpreted the HTML content. You were right.
Which version or commit of kenlm are you using?
I was talking about kenlm
version, which is the only part that it is not installed with setup.sh
Sorry, I tested with KenLM current version and hit the same error with you,
for work around:
- you may use Moses instead of KenLM.
- Or you may use KenLM but you have redirect stderror to "/dev/null" as below in sentence-join.py as below attachment.
sentence-join.zip