利用NLP进行情感分析,建立模型并使用Java调用

最近项目中要使用NLP进行情感分析,作为一个Java程序员,对于NLP是一窍不通,就连python语言也是没入门。这可难坏了。

通过各种百度找资料,发现了这个帖子(链接如下:https://www.cnblogs.com/jclian91/p/10886031.html),参考这个帖子写了一个NLP的demo案例。其中参考的源代码和部分数据:https://github.com/renjunxiang/Text-Classification/blob/master/TextClassification/data/data_single.csv 

我讲GitHub的代码拉到本地之后,用pycharm打开 下载好相关包之后运行,这里不得不说python的环境之多真是大坑,各种版本不匹配的问题以及版本问题导致的代码不兼容,哎,实在是大坑。

1. 首先看看经过我更改之后的python项目目录结构:

其中TextClassification中的代码就是上面GitHub上的,其中就有CNN训练的一些封装的代码。注意这个目录结构,因为后面Java调用python的时候我遇到的问题就在这。

2.python自身和依赖第三方包的版本

首先是python自己的版本,我选用的是anaconda

安装之后就是python3.8 注意安装的时候选择将路劲添加到环境变量中,自己配置过于繁琐。大概就是这样的:

其次是项目中依赖的第三方包:

主要就是这几个,这里特别说明的是keras是需要tensorflow的,所以还是间接需要tensorflow的包

具体的版本就是

其他的版本下载最新版就可以 问题应该不大。

这里说下下载第三方包。在pycharm中下载很有可能下载不成功。即使是配置了豆瓣的镜像http://pypi.douban.com/simple

我用anaconda下载比较快速:

首先配置pip的全局镜像:

在用户目录下新建pip文件夹和pip.ini,pip文件内容如下:

[global]
index-url = http://pypi.douban.com/simple

[install]
trusted-host = pypi.douban.com

在anaconda安装目录下cmd 进行安装各自的依赖:

安装好之后就可以进行训练了。

 

整个训练序列化了两个文件,分别是封装类中的preprocess和model,这里需要用不同的序列化方式。

序列化成功之后,进行模型调用和预测:

读取训练时保存的pkl和h5文件,类似与java中class对象的成员变量。再调用封装类的predict方法。

进行测试:

模型跑通了,传参预测也能做了。接下来就需要在Java项目中进行调用了。这里就卡住了。

首先我在网上找了相关的资料,发现Java调用python有两种方法,第一种即使使用jython,这种方法我试过了,对于引用第三方依赖包的python项目支持的不好,我照着https://www.cnblogs.com/nuccch/p/8435693.html这篇帖子做了,在jython中添加了第三方依赖的路径到系统环境变量中依然不行。后来又看到是因为我的python项目是3.8的版本,这个jython的下载的是2.7.2的版本,所以我想下载对应jython3的版本,结果网上几乎没有,唯一一个github上的jython3项目还没有人用过,最后果断放弃了第一张方式。第二种就是简单粗暴的 Runtime.getRuntime().exec(),我就是用这种实现的。

首先贴代码:

这里的自定义类是开启了一个线程,注意要在子线程中,不然就得不到返回值。

我之前将项目代码放在一个目录下,而没有直接放在项目下,导致训练的保存的模型load进代码找不到对应的mode,因为python加载文件成clf对象之后,类似一个class对象,这个时候它发现类路径不在Lib中,即使是在项目子目录下也没识别(我是这么理解的)

原来的 目录结果是这样的: CNN_train、CNN_call都是子目录CNN下 ,所以load pkl文件一直走不过去

现在把代码放到项目目录下 去除子目录,就能直接掉通了:

Java调用结果:

相关代码:

DataPreprocess.py
# coding=UTF-8
import jieba
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np

jieba.setLogLevel('WARN')


class DataPreprocess():
    def __init__(self, tokenizer=None,
                 label_set=None):
        self.tokenizer = tokenizer
        self.num_words = None
        self.label_set = label_set
        self.sentence_len = None
        self.word_len = None

    def cut_texts(self, texts=None, word_len=1):
        """
        对文本分词
        :param texts: 文本列表
        :param word_len: 保留最短长度的词语
        :return:
        """
        if word_len > 1:
            texts_cut = [[word for word in jieba.lcut(text) if len(word) >= word_len] for text in texts]
        else:
            texts_cut = [jieba.lcut(one_text) for one_text in texts]

        self.word_len = word_len

        return texts_cut

    def train_tokenizer(self,
                        texts_cut=None,
                        num_words=2000):
        """
        生成编码字典
        :param texts_cut: 分词的列表
        :param num_words: 字典按词频从高到低保留数量
        :return:
        """
        tokenizer = Tokenizer(num_words=num_words)
        tokenizer.fit_on_texts(texts=texts_cut)
        num_words = min(num_words, len(tokenizer.word_index) + 1)
        self.tokenizer = tokenizer
        self.num_words = num_words

    def text2seq(self,
                 texts_cut,
                 sentence_len=30):
        """
        文本转序列,用于神经网络的ebedding层输入。
        :param texts_cut: 分词后的文本列表
        :param sentence_len: 文本转序列保留长度
        :return:sequence list
        """
        tokenizer = self.tokenizer
        texts_seq = tokenizer.texts_to_sequences(texts=texts_cut)
        del texts_cut

        texts_pad_seq = pad_sequences(texts_seq,
                                      maxlen=sentence_len,
                                      padding='post',
                                      truncating='post')
        self.sentence_len = sentence_len
        return texts_pad_seq

    def creat_label_set(self, labels):
        '''
        获取标签集合,用于one-hot
        :param labels: 原始标签集
        :return:
        '''
        label_set = set()
        for i in labels:
            label_set = label_set.union(set(i))

        self.label_set = np.array(list(label_set))

    def creat_label(self, label):
        '''
        构建标签one-hot
        :param label: 原始标签
        :return: 标签one-hot形式的array
        eg. creat_label(label=data_valid_accusations[12], label_set=accusations_set)
        '''
        label_set = self.label_set
        label_zero = np.zeros(len(label_set))
        label_zero[np.in1d(label_set, label)] = 1
        return label_zero

    def creat_labels(self, labels=None):
        '''
        调用creat_label遍历标签列表生成one-hot二维数组
        :param label: 原始标签集
        :return:
        '''
        label_set = self.label_set
        labels_one_hot = [self.creat_label(label) for label in labels]

        return np.array(labels_one_hot)
net.py
from keras.models import Model
from keras.layers import Dense, Embedding, Input
from keras.layers import Conv1D, GlobalMaxPool1D, Dropout


def CNN(input_dim,
        input_length,
        vec_size,
        output_shape,
        output_type='multiple'):
    '''
    Creat CNN net,use Embedding+CNN1D+GlobalMaxPool1D+Dense.
    You can change filters and dropout rate in code..

    :param input_dim: Size of the vocabulary
    :param input_length:Length of input sequences
    :param vec_size:Dimension of the dense embedding
    :param output_shape:Target shape,target should be one-hot term
    :param output_type:last layer type,multiple(activation="sigmoid") or single(activation="softmax")
    :return:keras model
    '''
    data_input = Input(shape=[input_length])
    word_vec = Embedding(input_dim=input_dim + 1,
                         input_length=input_length,
                         output_dim=vec_size)(data_input)
    x = Conv1D(filters=128,
               kernel_size=[3],
               strides=1,
               padding='same',
               activation='relu')(word_vec)
    x = GlobalMaxPool1D()(x)
    x = Dense(500, activation='relu')(x)
    x = Dropout(0.1)(x)
    if output_type == 'multiple':
        x = Dense(output_shape, activation='sigmoid')(x)
        model = Model(inputs=data_input, outputs=x)
        model.compile(loss='binary_crossentropy',
                      optimizer='adam',
                      metrics=['acc'])
    elif output_type == 'single':
        x = Dense(output_shape, activation='softmax')(x)
        model = Model(inputs=data_input, outputs=x)
        model.compile(loss='categorical_crossentropy',
                      optimizer='adam',
                      metrics=['acc'])
    else:
        raise ValueError('output_type should be multiple or single')
    return model


if __name__ == '__main__':
    model = CNN(input_dim=10, input_length=10, vec_size=10, output_shape=10, output_type='multiple')
    model.summary()
TextClassification.py
# coding=UTF-8
from .DataPreprocess import DataPreprocess
from .net import CNN
import numpy as np


class TextClassification():
    def __init__(self):
        self.preprocess = None
        self.model = None

    def get_preprocess(self, texts, labels, word_len=1, num_words=2000, sentence_len=30):
        # 数据预处理
        preprocess = DataPreprocess()

        # 处理文本
        texts_cut = preprocess.cut_texts(texts, word_len)
        preprocess.train_tokenizer(texts_cut, num_words)
        texts_seq = preprocess.text2seq(texts_cut, sentence_len)

        # 得到标签
        preprocess.creat_label_set(labels)
        labels = preprocess.creat_labels(labels)
        self.preprocess = preprocess

        return texts_seq, labels

    def fit(self, texts_seq, texts_labels, output_type, epochs, batch_size, model=None):
        if model is None:
            preprocess = self.preprocess
            model = CNN(preprocess.num_words,
                        preprocess.sentence_len,
                        128,
                        len(preprocess.label_set),
                        output_type)
        # 训练神经网络
        model.fit(texts_seq,
                  texts_labels,
                  epochs=epochs,
                  batch_size=batch_size)
        self.model = model

    def predict(self, texts):
        preprocess = self.preprocess
        word_len = preprocess.word_len
        sentence_len = preprocess.sentence_len

        # 处理文本
        texts_cut = preprocess.cut_texts(texts, word_len)
        texts_seq = preprocess.text2seq(texts_cut, sentence_len)

        return self.model.predict(texts_seq)

    def label2toptag(self, predictions, labelset):
        labels = []
        for prediction in predictions:
            label = labelset[prediction == prediction.max()]
            labels.append(label.tolist())
        return labels

    def label2half(self, predictions, labelset):
        labels = []
        for prediction in predictions:
            label = labelset[prediction > 0.5]
            labels.append(label.tolist())
        return labels

    def label2tag(self, predictions, labelset):
        labels1 = self.label2toptag(predictions, labelset)
        labels2 = self.label2half(predictions, labelset)
        labels = []
        for i in range(len(predictions)):
            if len(labels2[i]) == 0:
                labels.append(labels1[i])
            else:
                labels.append(labels2[i])
        return labels
CNN_call.py
#coding:utf-8
import pickle
import numpy as np
import keras

class CNN_call():
    def callPredict(self,str):
        # 导出保存的模型
        with open('dump/clf_preprocess.pkl', 'rb') as clf_preprocess:
            clf =pickle.load(clf_preprocess)

        model = keras.models.load_model('dump/clf_model.h5')
        clf.model = model

        x_test = [str]
        predict_rate = clf.predict(x_test)
        max_rate = np.max(predict_rate)

        for i in predict_rate:
            # 取出i中元素最大值所对应的索引
            predict_lable = clf.preprocess.label_set[i.argmax()]
        return predict_lable,max_rate

CNN_train.py
#coding:utf-8
from TextClassification import TextClassification
from sklearn.model_selection import train_test_split
import pandas as pd
import os

import pickle

# 导入数据
def importData(data_type,data_name):

    projectpath = os.path.dirname(os.path.abspath(__file__))


    data = pd.read_csv(projectpath + "/Data/"+data_name, encoding='utf8')
    x = data['evaluation']
    y = [[i] for i in data['label']]

    # 拆分训练集和测试集
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.01, random_state=1)
    clf = TextClassification()

    # 以下是训练过程
    texts_seq, texts_labels = clf.get_preprocess(x_train, y_train,word_len=1,num_words=2000,sentence_len=50)

    #clf有两个成员变量 model和preprocess  此时还没有生成model 先序列化包含preprocess的clf
    with open('./dump/clf_preprocess.pkl', 'wb') as f:
        pickle.dump(clf, f)

    clf.fit(texts_seq=texts_seq,texts_labels=texts_labels,output_type=data_type,epochs=10,batch_size=64,model=None)

    #此时已经生成model 仅仅序列化model
    clf.model.save('./dump/clf_model.h5')

    print(data_name + " has been trained")

if __name__ == '__main__':
    data_type = 'single'
    data_name = 'corpus.csv'
    importData(data_type,data_name)
javaCallPython.py
#coding:utf-8
import pickle
import keras
import numpy as np
from sys import argv

def callPre(str):

    # 导出保存的模型
    pklPath = 'E:\ideaSpace\xxx\dump\clf_preprocess.pkl'
    modelPath = 'E:\ideaSpace\xxx\dump\clf_model.h5'

    with open(pklPath, 'rb') as clf_preprocess:
        try:
            clf = pickle.load(clf_preprocess)
        except KeyboardInterrupt:
            print("quit")
        except Exception as ex:
            print("出现如下异常%s" % ex)
    model = keras.models.load_model(modelPath)
    clf.model = model
    x_test = [str]
    predict_rate = clf.predict(x_test)
    max_rate = np.max(predict_rate)

    for i in predict_rate:
        # 取出i中元素最大值所对应的索引
        predict_lable = clf.preprocess.label_set[i.argmax()]

    print(predict_lable)
    print(max_rate)

if __name__ == '__main__':
    str = argv[1]
    callPre(str)
test.py
#coding:utf-8
from CNN_call import CNN_call

str = "真好"
predict_lable,max_rate = CNN_call().callPredict(str)

print(str)
print(predict_lable)
print(max_rate)
CallPythonNLP.java
package com.simon.python;
import java.io.*;

public class CallPythonNLP {
    public static void main(String[] args) throws IOException, InterruptedException {
//      python解析器的路径
        String exe = "python";
        String command = "E:\\ideaSpace\\xxx\\javaCallPython.py";

        String num1 = "还行";
        String[] cmdArr = new String[] {exe, command, num1};

        Process process = Runtime.getRuntime().exec(cmdArr);
        StreamGobbler outputGobbler = new StreamGobbler(process.getInputStream(), "Output");
        outputGobbler.start();
        process.waitFor();
    }


}

 


版权声明:本文为Simon_09010817原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。