0xabu / pdfannots

Extracts and formats text annotations from a PDF file

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Hebrew text displayed backwards

dotancohen opened this issue · comments

This tool is terrific, thank you.

Highlighted and underlined Hebrew text are displayed backwards. Interestingly, the title blurb preceding the highlighted text is not backwards.

Find attached a PDF file, created in LibreOffice Writer, with the following structure:

שלום, עולם.

# כותרת
זה קובץ לבדיקה, אני סתם כותב משהו כאן.

## עוד כותרת
אין שמש גשם יש רק צל.

pdfannots.pdf

‪I've then highlighted the text אני סתם כותב under the first heading and שמש גשם under the second heading. I used Okular (KDE PDF viewer) for the annotations:
pdfannots.pdf

Here is the output:

$ pdfannots pdfannots.pdf
## Highlights

 * Page #1 (כותרת): "בתוכ םתס ינא"

 * Page #1 (עוד כותרת): "םשג שמש"

Note that כותרת and עוד כותרת are displayed properly, but אני סתם כותב and שמש גשם are backwards.

$ pdfannots --version
pdfannots 0.4
$ uname -a
Linux nefora 5.15.0-105-generic #115-Ubuntu SMP Mon Apr 15 09:52:04 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Thanks for the report!

In this case the headings are extracted correctly because they come as a string from the PDF metadata. The problem is that pdfminer's text extraction routines don't support right-to-left text: pdfminer/pdfminer.six#515

There are also some similar assumptions in pdfannots that affect things like the relative order that two annotations are reported when they appear on the same line of text. I could probably fix that but the bigger issue is the one linked above.

Thank you.

That bug report points to a fork, PdfMiner.RTL which has experimental RTL support:
https://pypi.org/project/pdfminer.rtl/

I tried it and in general the tool works well.