python 语音识别

带有Python的AI –语音识别 (AI with Python – Speech Recognition)

In this chapter, we will learn about speech recognition using AI with Python.

在本章中，我们将学习使用AI和Python进行语音识别的知识。

Speech is the most basic means of adult human communication. The basic goal of speech processing is to provide an interaction between a human and a machine.

语音是成人交流的最基本手段。语音处理的基本目标是提供人与机器之间的交互。

Speech processing system has mainly three tasks −

语音处理系统主要有三个任务-

First, speech recognition that allows the machine to catch the words, phrases and sentences we speak
首先，语音识别使机器可以捕捉我们说的单词，短语和句子
Second, natural language processing to allow the machine to understand what we speak, and
其次，自然语言处理使机器能够理解我们所说的内容，并且
Third, speech synthesis to allow the machine to speak.
第三，语音合成让机器说话。

This chapter focuses on speech recognition, the process of understanding the words that are spoken by human beings. Remember that the speech signals are captured with the help of a microphone and then it has to be understood by the system.

本章重点介绍语音识别 ，即理解人类所说单词的过程。请记住，语音信号是在麦克风的帮助下捕获的，然后系统必须理解它。

构建语音识别器 (Building a Speech Recognizer)

Speech Recognition or Automatic Speech Recognition (ASR) is the center of attention for AI projects like robotics. Without ASR, it is not possible to imagine a cognitive robot interacting with a human. However, it is not quite easy to build a speech recognizer.

语音识别或自动语音识别(ASR)是诸如机器人技术之类的AI项目的关注重点。没有ASR，就无法想象认知机器人会与人互动。但是，构建语音识别器并不是一件容易的事。

开发语音识别系统的困难 (Difficulties in developing a speech recognition system)

Developing a high quality speech recognition system is really a difficult problem. The difficulty of speech recognition technology can be broadly characterized along a number of dimensions as discussed below −

开发高质量的语音识别系统确实是一个难题。语音识别技术的难点可以从多个方面大致表征，如下所述-

Size of the vocabulary − Size of the vocabulary impacts the ease of developing an ASR. Consider the following sizes of vocabulary for a better understanding.
词汇量 -词汇量影响开发ASR的难易程度。为了更好地理解，请考虑以下词汇量。
- A small size vocabulary consists of 2-100 words, for example, as in a voice-menu system
  小型词汇表由2-100个单词组成，例如在语音菜单系统中
- A medium size vocabulary consists of several 100s to 1,000s of words, for example, as in a database-retrieval task
  中等大小的词汇表包含几百到千个单词，例如在数据库检索任务中
- A large size vocabulary consists of several 10,000s of words, as in a general dictation task.
  与一般的听写任务一样，大型词汇表包含10,000个单词。
Note that, the larger the size of vocabulary, the harder it is to perform recognition.
请注意，词汇量越大，识别就越困难。
Channel characteristics − Channel quality is also an important dimension. For example, human speech contains high bandwidth with full frequency range, while a telephone speech consists of low bandwidth with limited frequency range. Note that it is harder in the latter.
频道特性 -频道质量也是一个重要的维度。例如，人类语音包含具有整个频率范围的高带宽，而电话语音包含具有有限频率范围的低带宽。请注意，后者更难。
Speaking mode − Ease of developing an ASR also depends on the speaking mode, that is whether the speech is in isolated word mode, or connected word mode, or in a continuous speech mode. Note that a continuous speech is harder to recognize.
口语模式 -开发ASR的难易程度还取决于口语模式，即语音是处于隔离单词模式，连接单词模式还是连续语音模式。请注意，连续语音很难识别。
Speaking style − A read speech may be in a formal style, or spontaneous and conversational with casual style. The latter is harder to recognize.
演讲风格 -阅读的演讲可以是正式的风格，也可以是自发的，也可以随意交谈。后者更难辨认。
Speaker dependency − Speech can be speaker dependent, speaker adaptive, or speaker independent. A speaker independent is the hardest to build.
说话者依赖性 -语音可以是说话者依赖性，说话者自适应性或说话者无关性。独立于演讲者是最难建立的。
Type of noise − Noise is another factor to consider while developing an ASR. Signal to noise ratio may be in various ranges, depending on the acoustic environment that observes less versus more background noise −
噪声类型 -噪声是开发ASR时要考虑的另一个因素。信噪比可能在不同的范围内，具体取决于观察到的背景噪声多于背景噪声多的声学环境-
- If the signal to noise ratio is greater than 30dB, it is considered as high range
  如果信噪比大于30dB，则认为是高范围
- If the signal to noise ratio lies between 30dB to 10db, it is considered as medium SNR
  如果信噪比在30dB至10db之间，则认为是中等SNR
- If the signal to noise ratio is lesser than 10dB, it is considered as low range
  如果信噪比小于10dB，则认为是低范围
For example, the type of background noise such as stationary, non-human noise, background speech and crosstalk by other speakers also contributes to the difficulty of the problem.
例如，背景噪声的类型，例如固定的，非人为的噪声，背景语音和其他说话者的串扰，也造成了问题的难度。
Microphone characteristics − The quality of microphone may be good, average, or below average. Also, the distance between mouth and micro-phone can vary. These factors also should be considered for recognition systems.
麦克风的特性 - 麦克风的质量可能良好，中等或低于平均水平。同样，嘴和麦克风之间的距离可能会有所不同。对于识别系统，也应考虑这些因素。

Despite these difficulties, researchers worked a lot on various aspects of speech such as understanding the speech signal, the speaker, and identifying the accents.

尽管存在这些困难，研究人员仍在语音的各个方面进行了大量工作，例如理解语音信号，说话者和识别口音。

You will have to follow the steps given below to build a speech recognizer −

您将必须按照以下给出的步骤来构建语音识别器-

可视化音频信号-从文件读取并进行处理 (Visualizing Audio Signals - Reading from a File and Working on it)

This is the first step in building speech recognition system as it gives an understanding of how an audio signal is structured. Some common steps that can be followed to work with audio signals are as follows −

这是构建语音识别系统的第一步，因为它可以了解音频信号的结构。使用音频信号可以遵循的一些常见步骤如下-

记录 (Recording)

When you have to read the audio signal from a file, then record it using a microphone, at first.

当您必须从文件中读取音频信号时，请先使用麦克风进行录制。

采样 (Sampling)

When recording with microphone, the signals are stored in a digitized form. But to work upon it, the machine needs them in the discrete numeric form. Hence, we should perform sampling at a certain frequency and convert the signal into the discrete numerical form. Choosing the high frequency for sampling implies that when humans listen to the signal, they feel it as a continuous audio signal.

用麦克风录音时，信号以数字形式存储。但是要对其进行处理，机器需要离散数字形式的它们。因此，我们应该以一定的频率进行采样并将信号转换为离散的数值形式。选择高频进行采样意味着当人们收听信号时，他们会感觉到它是连续的音频信号。

例 (Example)

The following example shows a stepwise approach to analyze an audio signal, using Python, which is stored in a file. The frequency of this audio signal is 44,100 HZ.

以下示例显示了一种逐步方法，该方法使用Python(存储在文件中)分析音频信号。该音频信号的频率为44,100 HZ。

Import the necessary packages as shown here −

导入必要的软件包，如下所示：


import numpy as np
import matplotlib.pyplot as plt
from scipy.io import wavfile

Now, read the stored audio file. It will return two values: the sampling frequency and the audio signal. Provide the path of the audio file where it is stored, as shown here −

现在，读取存储的音频文件。它将返回两个值：采样频率和音频信号。提供音频文件的存储路径，如下所示：


frequency_sampling, audio_signal = wavfile.read("/Users/admin/audio_file.wav")

Display the parameters like sampling frequency of the audio signal, data type of signal and its duration, using the commands shown −

使用所示命令显示诸如音频信号的采样频率，信号的数据类型及其持续时间之类的参数-


print('\nSignal shape:', audio_signal.shape)
print('Signal Datatype:', audio_signal.dtype)
print('Signal duration:', round(audio_signal.shape[0] / 
float(frequency_sampling), 2), 'seconds')

This step involves normalizing the signal as shown below −

此步骤涉及标准化信号，如下所示-


audio_signal = audio_signal / np.power(2, 15)

In this step, we are extracting the first 100 values from this signal to visualize. Use the following commands for this purpose −

在此步骤中，我们将从该信号中提取前100个值以进行可视化。为此，请使用以下命令-


audio_signal = audio_signal [:100]
time_axis = 1000 * np.arange(0, len(signal), 1) / float(frequency_sampling)

Now, visualize the signal using the commands given below −

现在，使用下面给出的命令可视化信号-


plt.plot(time_axis, signal, color='blue')
plt.xlabel('Time (milliseconds)')
plt.ylabel('Amplitude')
plt.title('Input audio signal')
plt.show()

You would be able to see an output graph and data extracted for the above audio signal as shown in the image here

您将能够看到上述音频信号的输出图和提取的数据，如此处的图像所示


Signal shape: (132300,)
Signal Datatype: int16
Signal duration: 3.0 seconds

表征音频信号：变换到频域 (Characterizing the Audio Signal: Transforming to Frequency Domain)

Characterizing an audio signal involves converting the time domain signal into frequency domain, and understanding its frequency components, by. This is an important step because it gives a lot of information about the signal. You can use a mathematical tool like Fourier Transform to perform this transformation.

表征音频信号包括将时域信号转换为频域，并据此了解其频率成分。这是重要的一步，因为它提供了有关信号的大量信息。您可以使用诸如傅立叶变换这样的数学工具来执行此变换。

例 (Example)

The following example shows, step-by-step, how to characterize the signal, using Python, which is stored in a file. Note that here we are using Fourier Transform mathematical tool to convert it into frequency domain.

以下示例逐步说明了如何使用存储在文件中的Python来表征信号。请注意，这里我们使用傅里叶变换数学工具将其转换为频域。

Import the necessary packages, as shown here −

导入必要的软件包，如下所示：


import numpy as np
import matplotlib.pyplot as plt
from scipy.io import wavfile

Now, read the stored audio file. It will return two values: the sampling frequency and the the audio signal. Provide the path of the audio file where it is stored as shown in the command here −

现在，读取存储的音频文件。它将返回两个值：采样频率和音频信号。提供音频文件的存储路径，如此处的命令所示-


frequency_sampling, audio_signal = wavfile.read("/Users/admin/sample.wav")

In this step, we will display the parameters like sampling frequency of the audio signal, data type of signal and its duration, using the commands given below −

在此步骤中，我们将使用下面给出的命令显示诸如音频信号的采样频率，信号的数据类型及其持续时间之类的参数-


print('\nSignal shape:', audio_signal.shape)
print('Signal Datatype:', audio_signal.dtype)
print('Signal duration:', round(audio_signal.shape[0] / 
float(frequency_sampling), 2), 'seconds')

In this step, we need to normalize the signal, as shown in the following command −

在此步骤中，我们需要对信号进行归一化，如以下命令所示：


audio_signal = audio_signal / np.power(2, 15)

This step involves extracting the length and half length of the signal. Use the following commands for this purpose −

此步骤涉及提取信号的长度和一半长度。为此，请使用以下命令-


length_signal = len(audio_signal)
half_length = np.ceil((length_signal + 1) / 2.0).astype(np.int)

Now, we need to apply mathematics tools for transforming into frequency domain. Here we are using the Fourier Transform.

现在，我们需要应用数学工具转换到频域。在这里，我们使用傅立叶变换。


signal_frequency = np.fft.fft(audio_signal)

Now, do the normalization of frequency domain signal and square it −

现在，对频域信号进行归一化并平方-


signal_frequency = abs(signal_frequency[0:half_length]) / length_signal
signal_frequency **= 2

Next, extract the length and half length of the frequency transformed signal −

接下来，提取经过频率变换的信号的长度和一半长度-


len_fts = len(signal_frequency)

Note that the Fourier transformed signal must be adjusted for even as well as odd case.

注意，即使在奇数情况下，也必须调整傅立叶变换后的信号。


if length_signal % 2:
   signal_frequency[1:len_fts] *= 2
else:
   signal_frequency[1:len_fts-1] *= 2

Now, extract the power in decibal(dB) −

现在，以分贝(dB)提取功率-


signal_power = 10 * np.log10(signal_frequency)

Adjust the frequency in kHz for X-axis −

调整X轴的频率(kHz)-


x_axis = np.arange(0, len_half, 1) * (frequency_sampling / length_signal) / 1000.0

Now, visualize the characterization of signal as follows −

现在，可视化信号的表征，如下所示：


plt.figure()
plt.plot(x_axis, signal_power, color='black')
plt.xlabel('Frequency (kHz)')
plt.ylabel('Signal power (dB)')
plt.show()

You can observe the output graph of the above code as shown in the image below −

您可以观察到上面代码的输出图，如下图所示：

产生单调音频信号 (Generating Monotone Audio Signal)

The two steps that you have seen till now are important to learn about signals. Now, this step will be useful if you want to generate the audio signal with some predefined parameters. Note that this step will save the audio signal in an output file.

到目前为止，您已经看到的两个步骤对于了解信号非常重要。现在，如果要生成带有一些预定义参数的音频信号，此步骤将非常有用。请注意，此步骤会将音频信号保存在输出文件中。

例 (Example)

In the following example, we are going to generate a monotone signal, using Python, which will be stored in a file. For this, you will have to take the following steps −

在下面的示例中，我们将使用Python生成单调信号，并将其存储在文件中。为此，您将必须采取以下步骤-

Import the necessary packages as shown −

导入所需的软件包，如下所示：


import numpy as np
import matplotlib.pyplot as plt
from scipy.io.wavfile import write

Provide the file where the output file should be saved

提供应在其中保存输出文件的文件


output_file = 'audio_signal_generated.wav'

Now, specify the parameters of your choice, as shown −

现在，指定您选择的参数，如下所示：


duration = 4 # in seconds
frequency_sampling = 44100 # in Hz
frequency_tone = 784
min_val = -4 * np.pi
max_val = 4 * np.pi

In this step, we can generate the audio signal, as shown −

在此步骤中，我们可以生成音频信号，如下所示：


t = np.linspace(min_val, max_val, duration * frequency_sampling)
audio_signal = np.sin(2 * np.pi * tone_freq * t)

Now, save the audio file in the output file −

现在，将音频文件保存在输出文件中-


write(output_file, frequency_sampling, signal_scaled)

Extract the first 100 values for our graph, as shown −

提取图形的前100个值，如下所示-


audio_signal = audio_signal[:100]
time_axis = 1000 * np.arange(0, len(signal), 1) / float(sampling_freq)

Now, visualize the generated audio signal as follows −

现在，可视化生成的音频信号，如下所示：


plt.plot(time_axis, signal, color='blue')
plt.xlabel('Time in milliseconds')
plt.ylabel('Amplitude')
plt.title('Generated audio signal')
plt.show()

You can observe the plot as shown in the figure given here −

您可以观察该图，如下图所示：

语音特征提取 (Feature Extraction from Speech)

This is the most important step in building a speech recognizer because after converting the speech signal into the frequency domain, we must convert it into the usable form of feature vector. We can use different feature extraction techniques like MFCC, PLP, PLP-RASTA etc. for this purpose.

这是构建语音识别器的最重要步骤，因为在将语音信号转换到频域之后，我们必须将其转换为可用的特征向量形式。为此，我们可以使用不同的特征提取技术，例如MFCC，PLP，PLP-RASTA等。

例 (Example)

In the following example, we are going to extract the features from signal, step-by-step, using Python, by using MFCC technique.

在下面的示例中，我们将使用MFCC技术通过Python逐步从信号中提取特征。

Import the necessary packages, as shown here −

导入必要的软件包，如下所示：


import numpy as np
import matplotlib.pyplot as plt
from scipy.io import wavfile
from python_speech_features import mfcc, logfbank

Now, read the stored audio file. It will return two values − the sampling frequency and the audio signal. Provide the path of the audio file where it is stored.

现在，读取存储的音频文件。它将返回两个值-采样频率和音频信号。提供音频文件的存储路径。


frequency_sampling, audio_signal = wavfile.read("/Users/admin/audio_file.wav")

Note that here we are taking first 15000 samples for analysis.

请注意，此处我们将首先提取15000个样本进行分析。


audio_signal = audio_signal[:15000]

Use the MFCC techniques and execute the following command to extract the MFCC features −

使用MFCC技术并执行以下命令以提取MFCC功能-


features_mfcc = mfcc(audio_signal, frequency_sampling)

Now, print the MFCC parameters, as shown −

现在，打印MFCC参数，如下所示：


print('\nMFCC:\nNumber of windows =', features_mfcc.shape[0])
print('Length of each feature =', features_mfcc.shape[1])

Now, plot and visualize the MFCC features using the commands given below −

现在，使用下面给出的命令绘制并可视化MFCC功能-


features_mfcc = features_mfcc.T
plt.matshow(features_mfcc)
plt.title('MFCC')

In this step, we work with the filter bank features as shown −

在此步骤中，我们将使用过滤器库功能，如下所示：

Extract the filter bank features −

提取滤镜库特征-


filterbank_features = logfbank(audio_signal, frequency_sampling)

Now, print the filterbank parameters.

现在，打印filterbank参数。


print('\nFilter bank:\nNumber of windows =', filterbank_features.shape[0])
print('Length of each feature =', filterbank_features.shape[1])

Now, plot and visualize the filterbank features.

现在，绘制并可视化滤镜库功能。


filterbank_features = filterbank_features.T
plt.matshow(filterbank_features)
plt.title('Filter bank')
plt.show()

As a result of the steps above, you can observe the following outputs: Figure1 for MFCC and Figure2 for Filter Bank

作为上述步骤的结果，您可以观察到以下输出：MFCC的图1和Filter Bank的图2

口语识别 (Recognition of Spoken Words)

Speech recognition means that when humans are speaking, a machine understands it. Here we are using Google Speech API in Python to make it happen. We need to install the following packages for this −

语音识别意味着当人们说话时，机器会理解它。在这里，我们使用Python中的Google Speech API来实现它。我们需要为此安装以下软件包-

Pyaudio − It can be installed by using pip install Pyaudio command.
Pyaudio-可以使用pip install Pyaudio命令安装。
SpeechRecognition − This package can be installed by using pip install SpeechRecognition.
SpeechRecognition-可以使用pip install SpeechRecognition安装此软件包。
Google-Speech-API − It can be installed by using the command pip install google-api-python-client.
Google-Speech-API-可以使用命令pip install google-api-python-client进行安装 。

例 (Example)

Observe the following example to understand about recognition of spoken words −

观察以下示例以了解有关语音识别的知识-

Import the necessary packages as shown −

导入所需的软件包，如下所示：


import speech_recognition as sr

Create an object as shown below −

创建一个对象，如下所示-


recording = sr.Recognizer()

Now, the Microphone() module will take the voice as input −

现在， Microphone()模块将语音作为输入-


with sr.Microphone() as source: recording.adjust_for_ambient_noise(source)
   print("Please Say something:")
   audio = recording.listen(source)

Now google API would recognize the voice and gives the output.

现在，谷歌API将识别语音并提供输出。


try:
   print("You said: \n" + recording.recognize_google(audio))
except Exception as e:
   print(e)

You can see the following output −

您可以看到以下输出-


Please Say Something:
You said:

For example, if you said tutorialspoint.com, then the system recognizes it correctly as follows −

例如，如果您说了tutorialspoint.com ，则系统可以正确识别出它，如下所示：


tutorialspoint.com

翻译自: https://www.tutorialspoint.com/artificial_intelligence_with_python/artificial_intelligence_with_python_speech_recognition.htm

python 语音识别