zhr1991 / doc2txt

extract text from MS-WORD's .doc binary format file

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

通过分析二进制的OLE结构得到doc中的WordDocument Stream,Table Stream等部分, 然后用其中的某些字段得到文本和格式信息。

Compilation

$ make

Encoding

the extracted text is encoded in UTF-16. ANSI is not supported.

About

extract text from MS-WORD's .doc binary format file