Hebrew text displayed backwards
dotancohen opened this issue · comments
This tool is terrific, thank you.
Highlighted and underlined Hebrew text are displayed backwards. Interestingly, the title blurb preceding the highlighted text is not backwards.
Find attached a PDF file, created in LibreOffice Writer, with the following structure:
שלום, עולם.
# כותרת
זה קובץ לבדיקה, אני סתם כותב משהו כאן.
## עוד כותרת
אין שמש גשם יש רק צל.
I've then highlighted the text אני סתם כותב
under the first heading and שמש גשם
under the second heading. I used Okular (KDE PDF viewer) for the annotations:
pdfannots.pdf
Here is the output:
$ pdfannots pdfannots.pdf
## Highlights
* Page #1 (כותרת): "בתוכ םתס ינא"
* Page #1 (עוד כותרת): "םשג שמש"
Note that כותרת
and עוד כותרת
are displayed properly, but אני סתם כותב
and שמש גשם
are backwards.
$ pdfannots --version
pdfannots 0.4
$ uname -a
Linux nefora 5.15.0-105-generic #115-Ubuntu SMP Mon Apr 15 09:52:04 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Thanks for the report!
In this case the headings are extracted correctly because they come as a string from the PDF metadata. The problem is that pdfminer's text extraction routines don't support right-to-left text: pdfminer/pdfminer.six#515
There are also some similar assumptions in pdfannots that affect things like the relative order that two annotations are reported when they appear on the same line of text. I could probably fix that but the bigger issue is the one linked above.
Thank you.
That bug report points to a fork, PdfMiner.RTL which has experimental RTL support:
https://pypi.org/project/pdfminer.rtl/
I tried it and in general the tool works well.