davidbrandfonbrener / lapdftext

LA-PDFText is a system for extracting accurate text from PDF-based research articles (and an interface to be able to improve performance where needed). The system is open-source and provides a simple baseline function for extracting text from primary research articles using rules that developers can customize. This means that the system works quite well for most applications (and might occasionally make mistakes and extract the wrong text), but it is always possible to 'hack' your own rules and improve performance.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

About

LA-PDFText is a system for extracting accurate text from PDF-based research articles (and an interface to be able to improve performance where needed). The system is open-source and provides a simple baseline function for extracting text from primary research articles using rules that developers can customize. This means that the system works quite well for most applications (and might occasionally make mistakes and extract the wrong text), but it is always possible to 'hack' your own rules and improve performance.

License:GNU General Public License v3.0


Languages

Language:Java 82.0%Language:XSLT 12.4%Language:HTML 4.7%Language:XProc 0.6%Language:KiCad Layout 0.2%Language:CSS 0.1%