kevinboone / epub2txt2

A simple command-line utility for Linux, for extracting text from EPUB documents.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Formatting issue with Arabic text

Kentoseth opened this issue · comments

Hi,

Thank you for this excellent & efficient utility.

I am having formatting issues with both raw/normal of epubs that have Arabic in them.

Issues:

  1. Raw: Formatting connects words together & does not preserve newlines
  2. Normal: Does not preserve newlines

I cannot attach an epub, but it can be downloaded here: Shamela Islamic library

Raw copy:
bidaya.txt

Normal copy:
bidaya2.txt

Zamzar.com copy:
29796.txt

Zamzar seems to preserve the original formatting better, although I'm not sure what the stars (*) represent except that they might be for different pages.

I think that at least one problem here has nothing to do with Arabic: the document you linked uses "
" as a line break (rather than "
") which epub2txt2 does not handle. Still, it's legal, so it should work. Are you able to build at test the latest commit?

To be honest, I've never tested with right-to-left text before, so there could be other problems. Sadly, I don't have any skill at reading Arabic, so I'm not entirely sure whether what I see is readable or not.

Thanks for fixing the one issue.

I tested the latest build and the issue with words at the end of paragraphs being joined to words at the start of paragraphs persists.

I will do my best to assist with explaining the Arabic words & their meanings. So here is how the sample output should look for lines 1-6 of page 3 of the epub (this is the actual book content):

بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ

مقدمة المصنف

بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ

الحَمْدُ للَّهِ الذي فَقَّه في دينهِ مَنْ شَاءَ مِنَ العِبَادِ، وَوَفَّقَ أَهَلَ طَاعَتِهِ لِلْعِبَادَةِ والسَّدَادِ، والصَّلاَةُ والسلامُ على سيِّدنا مُحمدٍ الهَادِي إِلى طريقِ الرَّشَادِ، وعَلَى آله وأصحابهِ السَّادَةِ القَادَةِ الأَمْجَادِ، وعَلَى تَابِعيهم بإِحسانٍ صَلاَةً دَائِمَةً مُتَّصِلَة إِلى يَومِ المَعَادِ.

and this is how the current epub2txt2 converts to text (raw):

سْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِمقدمة المصنفبِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِالحَمْدُ للَّهِ الذي فَقَّه في دينهِ مَنْ شَاءَ مِنَ العِبَادِ، وَوَفَّقَ أَهَلَ طَاعَتِهِ لِلْعِبَادَةِ والسَّدَادِ، والصَّلاَةُ والسلامُ على سيِّدنا مُحمدٍ الهَادِي إِلى طريقِ الرَّشَادِ، وعَلَى آله وأصحابهِ السَّادَةِ القَادَةِ الأَمْجَادِ، وعَلَى تَابِعيهم بإِحسانٍ صَلاَةً دَائِمَةً مُتَّصِلَة إِلى يَومِ المَعَادِ.

and this is how the current epub2txt2 converts to text:

بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ مقدمة المصنف بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ 
الحَمْدُ للَّهِ الذي فَقَّه في دينهِ مَنْ شَاءَ مِنَ العِبَادِ، وَوَفَّقَ أَهَلَ طَاعَتِهِ 
لِلْعِبَادَةِ والسَّدَادِ، والصَّلاَةُ والسلامُ على سيِّدنا مُحمدٍ الهَادِي إِلى طريقِ 
الرَّشَادِ، وعَلَى آله وأصحابهِ السَّادَةِ القَادَةِ الأَمْجَادِ، وعَلَى تَابِعيهم بإِحسانٍ 
صَلاَةً دَائِمَةً مُتَّصِلَة إِلى يَومِ المَعَادِ.

It appears that raw connects all words at the end of paragraphs being joined to words at the start of paragraphs while the normal option only does it in certain places.

Whatever is splitting the lines/paragraphs like this in the epub (and zamzar version):

بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ

مقدمة المصنف

بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ

Should persist in the text conversion so that:

  1. Word-joining does not occur
  2. Paragraph/line breaks being maintained as per their original

What may help you resolve this bug is to use the zamzar copy I sent above & get the output as close to theirs. A quick way to test without needing to know Arabic (or indeed any right-to-left lang) is to diff your output against theirs.

I can also manually verify this too.

This also assumes that you are willing & have the capacity to go that far.

Regardless of the outcome, I once again iterate my appreciation for this amazing software & thank you for trying to assist me.

Hi. I'm trying my best here, but I confess that I'm struggling with the Arabic. Ironically, the problem isn't with the Arabic at all -- it's with the handling of hard line breaks. Unfortunately, these are rarely used in XHTML, so I haven't tested them very thoroughly.

I have uploaded a new version that might help. Do please let me know. If it helps you, I will still have to do a lot of testing, to ensure that it doesn't break other books that use hard line breaks.