preserveOnlyMultipleLineBreaks does not work on PDF when eol:'dos'
thegoatherder opened this issue · comments
Adam commented
Trying to extract text from PDF using textract.fromFileWithPath()
in a Windows environment. Using textract v2.5.0
The following config is set:
{
preserveOnlyMultipleLineBreaks: true,
pdftotextOptions: {
eol: 'dos',
layout: 'raw',
encoding: 'UTF-8',
splitPages: true }
}
I have found that preserveOnlyMultipleLineBreaks: true
is not working as expected. When the setting is on, the output converts \r\n
to \r
. But AFAIK \r
on its own doesn't mean anything in Windows or Unix systems. I'm expecting it instead to convert \r\n\r\n
to \r\n
and to remove solo \r\n
completely from the text output.
Seems like a bug?