系列目录

python数据处理1: 导入数据、片选数据、数据绘图
 python数据处理2: 拟合数据、整合数据、导出数据
 python数据处理3: 光谱曲线的洛伦兹函数拟合
 python数据处理4: matplotlib嵌入到PyQt、os获取文件列表、cmd/bat直接执行python
python数据处理5: doc批量另存为docx

安装模块

pip install python-docx

源码

import os #用于获取目标文件所在路径
from docx import Document 
import pandas as pd

path=".\\docx\\" # 文件夹绝对路径
files=[]
for file in os.listdir(path):
    if file.endswith(".docx"): #排除文件夹内的其它干扰文件，只获取".docx"后缀的word文件
        files.append(path+file) 
print(files)

numRows = len(files)+5
numCols = 110
#创建一个可以容纳string的DataFrame
out = pd.DataFrame(index=range(numRows),columns=range(numCols))

m=0
for path in files:
    m=m+1
    document = Document(path) #读入文件
    tables = document.tables #获取文件中的表格集
    table = tables[0]#获取文件中的第一个表格
    (filepath,tempfilename) = os.path.split(path)
    (filename,extension) = os.path.splitext(tempfilename)
    out.iloc[m,1]=filename
    for i in range(1,len(table.rows)):#从表格第二行开始循环读取表格数据
        for k in range(1,len(table.columns)):
            out.iloc[m,10*i+k]= table.cell(i,k).text
    out.iloc[m,25] = out.iloc[m,25].replace('汉族','汉') #进行必要的替换
  
out1=out.iloc[1:46,[1,21,23,25,28,31,33,38,41,43,48,51,55,61,65,73,81,91]]       
out1.to_csv('out1.csv',encoding='utf_8_sig')#utf-8 with bom
out2=out.iloc[1:46,[1,21,23,25,28,31,33,38,41,43,48,51,55,61,65]]       
out2.to_csv('out2.csv',encoding='utf_8_sig')

启发

采用python-docx读取docx中table的数据。
从路径中获得文件名的方法。
建立一个DataFrame容纳string。
DataFrame筛选输出。
中文无法显示问题，输出格式为UTF-8 with BOM，也可以使用记事本另存为这个格式。

参考

https://blog.csdn.net/zhengyikuangge/article/details/80451424
https://blog.csdn.net/zichen_ziqi/article/details/104859963

原文链接：https://blog.csdn.net/jell14/article/details/118146299