Problems with garbled characters in docx files
uptown opened this issue · comments
Hi,
During extracting text from docx, I found some text are not extracted as expected.
I think the problem occurred by conversion from byte buffer to string in this file
function getTextFromZipFile( zipfile, entry, cb ) {
zipfile.openReadStream( entry, function( err, readStream ) {
var text = ''
, error = ''
;
if ( err ) {
cb( err, null );
return;
}
readStream.on( 'data', function( chunk ) {
text += chunk; // HERE !!
});
readStream.on( 'end', function() {
if ( error.length > 0 ) {
cb( error, null );
} else {
cb( null, text );
}
});
readStream.on( 'error', function( _err ) {
error += _err;
});
});
}
In the function, the line text += chunk;
makes a conversion problem, so there is a chance to text contains wrong text.
So, I changed the function a little bit, changing the type of text
(which was string
type) to Buffer
function getTextFromZipFile( zipfile, entry, cb ) {
zipfile.openReadStream( entry, function( err, readStream ) {
var text = new Buffer("")
, error = ''
;
if ( err ) {
cb( err, null );
return;
}
readStream.on( 'data', function( chunk ) {
text = Buffer.concat([text, chunk]);
});
readStream.on( 'end', function() {
if ( error.length > 0 ) {
cb( error, null );
} else {
cb( null, "" + text );
}
});
readStream.on( 'error', function( _err ) {
error += _err;
});
});
}
And I finally get a right output.
Is there any problem with this approach?
Thank you.
Doesn't seem to be anything up with that approach! Please do give the tests a go and submit a PR. I tend to wait 3 months or so while issues and PRs pile up and then go through them and release. Should be doing another soon.
Thanks!