python处理doc格式文档

（1）调用python的docx库进行读取word文档：

import docx

fn = r'E:\abc\test.docx'
doc = docx.Document(fn)

for paragraph in doc.paragraphs:
        print(paragraph.text)

该方法自己用的时候发现，并不是能读出所有的文字，有些格式不一样的可能读取的时候就被忽略了，因此我计较推荐使用第二种方法。
（2）将word文档压缩后会有4个文件，而正文文本储存在document.xml的标签下，所以只需要用正则取出正文文本。

from zipfile import ZipFile
from bs4 import BeautifulSoup

document = ZipFile('test.docx')
xml = document.read("word/document.xml")
wordObj = BeautifulSoup(xml.decode("utf-8"))
texts = wordObj.findAll("w:t")
str = ''
for text in texts:
    if text.text is not None:
        # print(text.text)
        str = str + text.text
# print(str)

原文链接：https://blog.csdn.net/qq_33470156/article/details/93399742