Support reading docx files in flat opc format
jessrosenfield opened this issue · comments
Jessica Rosenfield commented
Docx may be in the format of flattened xml files, as opposed to zip files. I believe _calculateExtractedText has this functionality. Here's an example of what this input looks like:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<?mso-application progid="Word.Document"?>
<pkg:package xmlns:pkg="http://schemas.microsoft.com/office/2006/xmlPackage">
<pkg:part pkg:name="/_rels/.rels" pkg:contentType="application/vnd.openxmlformats-package.relationships+xml" pkg:padding="512">
<pkg:xmlData>
[FILE CONTENT]
</pkg:xmlData>
</pkg:part>
<pkg:part pkg:name="/word/_rels/document.xml.rels" pkg:contentType="application/vnd.openxmlformats-package.relationships+xml" pkg:padding="256">
<pkg:xmlData>
[FILE CONTENT]
</pkg:xmlData>
</pkg:part>
<pkg:part pkg:name="/word/footnotes.xml" pkg:contentType="application/vnd.openxmlformats-officedocument.wordprocessingml.footnotes+xml">
<pkg:xmlData>
[FILE CONTENT]
</pkg:xmlData>
<pkg:part pkg:name="/word/media/image1.png" pkg:contentType="image/png" pkg:compression="store">
<pkg:binaryData>
[FILE CONTENT]
</pkg:binaryData>
</pkg:part>