Reading text contents of page

Question

Reading text contents of page

matthopson opened this issue 5 years ago · comments

Hi, thanks for working on this project. It has met my needs beautifully with one exception (and it's probably a lack of understanding on my part).

I couldn't find a very intuitive way to get the contents of a page and verify its text content.

While generating a new page and inserting it into a document was very straight-forward, I'd like to also test this functionality, including that the expected contents end up on the page (it's dynamically generated). So when writing a test, I'd like to create a page, insert several lines of text, and then bring that page back in to verify that the expected lines of text exist on that page.

Am I overlooking something obvious, or are we lacking this functionality in a straight-forward way?

Thanks!

Andrew Dillon · Answer 1 · Wed Apr 17 2019 01:20:24 GMT+0800 (China Standard Time)

Hello @matthopson. pdf-lib is primarily focused on creating and editing PDFs right now. It does not currently have functionality to extract text content from them. Though, this is functionality I've considered adding at some point in the future.

For your use case, I'd suggest using pdf.js to extract text from the documents you create/modify with pdf-lib. pdf.js is a library specifically designed to extract text, images, etc... from PDFs for rendering. here's an example of using it in Node.

Let me know if you have any further questions!

Matt · Answer 2 · Wed Apr 17 2019 01:50:54 GMT+0800 (China Standard Time)

Thanks for the response. I had considered this, but was hoping to not have to use two separate PDF libraries to do this, but it sounds like that's my best bet for the time being.

Thanks!