preserveOnlyMultipleLineBreaks does not work on PDF when eol:'dos'

Question

preserveOnlyMultipleLineBreaks does not work on PDF when eol:'dos'

thegoatherder opened this issue 5 years ago · comments

Trying to extract text from PDF using textract.fromFileWithPath() in a Windows environment. Using textract v2.5.0

The following config is set:

{
  preserveOnlyMultipleLineBreaks: true,
  pdftotextOptions: { 
    eol: 'dos', 
    layout: 'raw', 
    encoding: 'UTF-8', 
    splitPages: true }
}

I have found that preserveOnlyMultipleLineBreaks: true is not working as expected. When the setting is on, the output converts \r\n to \r. But AFAIK \r on its own doesn't mean anything in Windows or Unix systems. I'm expecting it instead to convert \r\n\r\n to \r\n and to remove solo \r\n completely from the text output.

Seems like a bug?