dbashford / textract

node.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

preserveOnlyMultipleLineBreaks does not work on PDF when eol:'dos'

thegoatherder opened this issue · comments

commented

Trying to extract text from PDF using textract.fromFileWithPath() in a Windows environment. Using textract v2.5.0

The following config is set:

{
  preserveOnlyMultipleLineBreaks: true,
  pdftotextOptions: { 
    eol: 'dos', 
    layout: 'raw', 
    encoding: 'UTF-8', 
    splitPages: true }
}

I have found that preserveOnlyMultipleLineBreaks: true is not working as expected. When the setting is on, the output converts \r\n to \r. But AFAIK \r on its own doesn't mean anything in Windows or Unix systems. I'm expecting it instead to convert \r\n\r\n to \r\n and to remove solo \r\n completely from the text output.

Seems like a bug?