Hopding / pdf-lib

Create and modify PDF documents in any JavaScript environment

Home Page:https://pdf-lib.js.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Parsing text from PDF

nunofgs opened this issue · comments

Hi @Hopding, thank you for the great lib.

Apologies if this is a newbie question, but I can't seem to find a way to parse text out of an existing PDF. I'm looking to retrieve a string from a PDF in order to determine which page it's on.

Any idea how I could accomplish this?

I'm personally looking to find some text and replace the "field"'s contents

Hello @nunofgs!

It is not currently possible to parse plain text out of a document with pdf-lib (but you can extract the content of acroform fields). I'd suggest you consider using PDF.js to parse/extract text.

Of course, this isn't an ideal solution since it requires two different libraries for a seemingly simple task. But it's the best approach I know of for now, until pdf-lib gains support for text parsing.

@dasilvacontin Is the field you are working with just plain text? Or is it an acroform field? If it is raw text, I'm afraid pdf-lib doesn't have the necessary features to parse it (but as I mentioned above, you could use PDF.js instead).

However, if it's in an acroform, pdf-lib should be able to do what you need. pdf-lib's acroform support isn't currently well documented, so I'd suggest taking a look at some of the existing acroform issues. Please let me know if you have any questions!