NLP文本分类实战（一）--文本预处理

摘要：

最近在学习文本分类的相关项目，主要用到的据集为IMDB 电影影评，总共有三个数据文件，在/data/rawData目录下，包括unlabeledTrainData.tsv，labeledTrainData.tsv，testData.tsv。在进行文本分类时需要有标签的数据（labeledTrainData），但是在训练word2vec词向量模型（无监督学习）时可以将无标签的数据一起用。

（1）导入相关库并读取数据

# 导入相关的库
import pandas as pd
from bs4 import BeautifulSoup

## 读取数据文件
with open("../data/rawData/unlabeledTrainData.tsv", "r", encoding='UTF-8') as f:
    unlabeledTrain = [line.strip().split("\t") for line in f.readlines() if len(line.strip().split("\t")) == 2]
    
with open("../data/rawData/labeledTrainData.tsv", "r", encoding='UTF-8') as f:
    labeledTrain = [line.strip().split("\t") for line in f.readlines() if len(line.strip().split("\t")) == 3]

（2）将数据转换为pandas的DataFrame数据格式

unlabel = pd.DataFrame(unlabeledTrain[1: ], columns=unlabeledTrain[0])
label = pd.DataFrame(labeledTrain[1: ], columns=labeledTrain[0])

（3）清洗文本数据，并将句段切分为单词格式


def cleanReview(subject):
    beau = BeautifulSoup(subject)
    newSubject = beau.get_text()
    # 去除标点字符
    newSubject = newSubject.replace("\\", "").replace("\'", "").replace('/', '').replace('"', '').replace(',', '').\
        replace('.', '').replace('?', '').replace('(', '').replace(')', '')
    # 使用空格，来切分句子
    newSubject = newSubject.strip().split(" ")
    # 将字母转为小写形式
    newSubject = [word.lower() for word in newSubject]
    # join() 方法用于将序列中的元素以指定的字符连接生成一个新的字符串
    newSubject = " ".join(newSubject)
    
    return newSubject
    
unlabel["review"] = unlabel["review"].apply(cleanReview)
label["review"] = label["review"].apply(cleanReview)

（4）将有标签、无标签的文本数据合并，在训练word2vec词向量模型（无监督学习）时可以将无标签的数据一起用。

# 将有标签、无标签的评论合并
newDf = pd.concat([unlabel["review"], label["review"]], axis=0)

（5）导出数据预处理后的文件

newDf.to_csv("../data/preProcess/wordEmbdiing.txt", index=False)

newLabel = label[["review", "sentiment", "rate"]]
newLabel.to_csv("../data/preProcess/labeledCharTrain.csv", index=False)

原文链接：https://blog.csdn.net/weixin_40437821/article/details/102638049