Parsing text from PDF

Question

Parsing text from PDF

nunofgs opened this issue 5 years ago · comments

Hi @Hopding, thank you for the great lib.

Apologies if this is a newbie question, but I can't seem to find a way to parse text out of an existing PDF. I'm looking to retrieve a string from a PDF in order to determine which page it's on.

Any idea how I could accomplish this?

David da Silva · Answer 1 · Tue Jul 16 2019 19:39:49 GMT+0800 (China Standard Time)

I'm personally looking to find some text and replace the "field"'s contents

Andrew Dillon · Answer 2 · Sun Jul 21 2019 07:04:25 GMT+0800 (China Standard Time)

Hello @nunofgs!

It is not currently possible to parse plain text out of a document with pdf-lib (but you can extract the content of acroform fields). I'd suggest you consider using PDF.js to parse/extract text.

Of course, this isn't an ideal solution since it requires two different libraries for a seemingly simple task. But it's the best approach I know of for now, until pdf-lib gains support for text parsing.

Andrew Dillon · Answer 3 · Sun Jul 21 2019 07:08:35 GMT+0800 (China Standard Time)

@dasilvacontin Is the field you are working with just plain text? Or is it an acroform field? If it is raw text, I'm afraid pdf-lib doesn't have the necessary features to parse it (but as I mentioned above, you could use PDF.js instead).

However, if it's in an acroform, pdf-lib should be able to do what you need. pdf-lib's acroform support isn't currently well documented, so I'd suggest taking a look at some of the existing acroform issues. Please let me know if you have any questions!