dbashford / textract

node.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Problems with garbled characters in docx files

uptown opened this issue · comments

Hi,

During extracting text from docx, I found some text are not extracted as expected.

[AS-IS]
image

I think the problem occurred by conversion from byte buffer to string in this file

function getTextFromZipFile( zipfile, entry, cb ) {
  zipfile.openReadStream( entry, function( err, readStream ) {
    var text = ''
      , error = ''
      ;

    if ( err ) {
      cb( err, null );
      return;
    }

    readStream.on( 'data', function( chunk ) {
      text += chunk; // HERE !! 
    });
    readStream.on( 'end', function() {
      if ( error.length > 0 ) {
        cb( error, null );
      } else {
        cb( null, text );
      }
    });
    readStream.on( 'error', function( _err ) {
      error += _err;
    });
  });
}

In the function, the line text += chunk; makes a conversion problem, so there is a chance to text contains wrong text.

So, I changed the function a little bit, changing the type of text (which was string type) to Buffer

function getTextFromZipFile( zipfile, entry, cb ) {
  zipfile.openReadStream( entry, function( err, readStream ) {
    var text = new Buffer("")
      , error = ''
      ;

    if ( err ) {
      cb( err, null );
      return;
    }

    readStream.on( 'data', function( chunk ) {
      text = Buffer.concat([text, chunk]);
    });
    readStream.on( 'end', function() {
      if ( error.length > 0 ) {
        cb( error, null );
      } else {
        cb( null, "" + text );
      }
    });
    readStream.on( 'error', function( _err ) {
      error += _err;
    });
  });
}

And I finally get a right output.

[TO-BE]
image

Is there any problem with this approach?
Thank you.

Doesn't seem to be anything up with that approach! Please do give the tests a go and submit a PR. I tend to wait 3 months or so while issues and PRs pile up and then go through them and release. Should be doing another soon.

Thanks!