DOCx files parser

Parses the docx file and returns an html string. Any help or critics appreciated.

Supported elements:

paragraphs (w:p),
images (pic:pic),
links (w:hyperlink),
tables (w:t),
boorkmarks (w:bookmarkStart),
lists.

TODO:

shapes support
table cell styles
add links filter
images optimization (remove extra equal images)
add i18n

Known issues:

memory consuming
for big files, executing more than 30 sec (default timout time)

Usage example:

<?php
    // load lib
	require_once('DocumentsParser.php');

	// init parser
	$parserSettings = array(
		'filesDestinationFolder' => 'images',
	);

	$defaultStyles = array();

	$parser = new DocumentsParser($parserSettings, $defaultStyles);

	// parse DOCx
	$html = $parser->parseFile('test_document.docx');

	// save content to file
	file_put_contents('test_document.html', $html);
?>

Reminder links:

Comparing images - http://pastebin.com/wgeu2DqE
OpenXML Standards - http://www.schemacentral.com/sc/ooxml/ss.html
Awesome XML editor - http://www.oxygenxml.com/download_oxygenxml_developer.html

t1gor / DocumentsParser

DOCx files parser

Supported elements:

TODO:

Known issues:

Usage example:

Reminder links:

About

Languages