pyxpdf is a fast and memory efficient python module for parsing PDF documents based on xpdf reader sources.
docs | |
---|---|
tests | |
package | |
license |
- Almost x20 times faster than pure python based pdf parsers (see Speed Comparison)
- Extract text while maintaining original document layout (best possible)
- Support almost all PDF encodings, CMaps and predefined CMaps.
- Extract LZW, RLE, CCITTFax, DCT, JBIG2 and JPX compressed images and image masks along with their BBox.
- Render PDF Pages as image with support of '1', 'L', 'LA', 'RGB', 'RGBA' and 'CMYK' color modes.
- No explict dependencies (except optional ones, see Installation)
- Thread Safe
pyxpdf
is licensed under the GNU General Public License (GPL),
version 2 or 3. See the LICENSE
- xpdf reader by Derek Noonburg
- lxml - project structure and build adapted from lxml
- poppler project