- 参考pyAudioAnalysis、openSmile以及语音信号处理实验教程(MATLAB源代码)
- Introduction to Audio Analysis–A Matlab Approach
- 完整测试文件
- 注意,以下代码不在genFeatures.py内的,可在pyAudioAnalysis.audioFeatureExtraction文件内观察得到
1.过零率
zero crossing rate
每帧信号内,信号过零点的次数,体现的是频率特性。
Z n = 1 2 ∑ m = 0 N − 1 ∣ s g n [ x n ( m ) ] − s g n [ x n ( m − 1 ) ] ∣ Z_n = \frac{1}{2}\sum_{m=0}^{N-1}|sgn[x_n(m)]-sgn[x_n(m-1)]|Zn=21m=0∑N−1∣sgn[xn(m)]−sgn[xn(m−1)]∣
import numpy as np
def stZCR(frame):
# computing zero crossing rate
count = len(frame)
count_z = np.sum(np.abs(np.diff(np.sign(frame)))) / 2
return (np.float64(count_z) / np.float64(count - 1))
2.能量
energy
短时能量,即每帧信号的平方和,体现的是信号能量的强弱
E n = ∑ m = 0 N − 1 x n 2 ( m ) E_n = \sum_{m=0}^{N-1}x_n^2(m)En=m=0∑N−1xn2(m)
import numpy as np
def stEnergy(frame):
return (np.sum(frame ** 2) / np.float64(len(frame)))
2.1 振幅扰动度-分贝形式
shimmer in DB
S h i m m e r ( d B ) = 1 N − 1 ∑ i = 1 N − − 1 ∣ 20 l o g ( A i + 1 / A i ) ∣ Shimmer(dB) = {\frac{1}{N-1}}\sum_{i=1}^{N--1}|20log(A_{i+1}/A_i)|Shimmer(dB)=N−11i=1∑N−−1∣20log(Ai+1/Ai)∣
import numpy as np
def stShimmerDB(frame):
'''
amplitude shimmer 振幅扰动度
expressed as variability of the peak-to-peak amplitude in decibels 分贝
[3]
'''
count = len(frame)
sigma = 0
for i in range(count):
if i == count - 1:
break
sigma += np.abs(20 * (np.log10(np.abs(frame[i + 1] / (frame[i] + eps)))))
return np.float64(sigma) / np.float64(count - 1)
2.2 振幅扰动度-百分数形式
S h i m m e r ( r e l a t i v e ) = 1 N − 1 ∑ N − 1 i = 1 ∣ A i − A i + 1 ∣ 1 N ∑ i = 1 N A i Shimmer(relative) = \frac{\frac{1}{N-1}\sum_{N-1}^{i=1}|A_i-A_{i+1}|}{\frac{1}{N}\sum_{i=1}^{N}A_i}Shimmer(relative)=N1∑i=1NAiN−11∑N−1i=1∣Ai−Ai+1∣
def stShimmerRelative(frame):
'''
shimmer relative is defined as average absolute difference between the amplitude
of consecutive periods divided by the average amplitude, expressed as percentage
[3]
'''
count = len(frame)
sigma_diff = 0
sigma_sum = 0
for i in range(count):
if i < count - 1:
sigma_diff += np.abs(np.abs(frame[i]) - np.abs(frame[i + 1]))
sigma_sum += np.abs(frame[i])
return np.float64(sigma_diff / (count - 1)) / np.float64(sigma_sum / count + eps)
3. 声强/响度
intensity / loudness
- intensity: mean of squared input values multiplied by a Hamming window
声强和响度是对应的概念,参考openSmile程序
i n t e n s i t y = ∑ m = 0 N − 1 h a m W i n [ m ] ∗ x n 2 ( m ) ∑ m = 0 N − 1 h a m W i n [ m ] intensity = \frac{\sum_{m=0}^{N-1}hamWin[m]*x_n^2(m)}{\sum_{m=0}^{N-1}hamWin[m]}intensity=∑m=0N−1hamWin[m]∑m=0N−1hamWin[m]∗xn2(m)
l o u d n e s s = ( i n t e n s i t y I 0 0.3 ) loudness =( {\frac{intensity}{I0}}^{0.3})loudness=(I0intensity0.3) I 0 = 1 × 1 0 − 12 I0=1\times10^{-12}I0=1×10−12
###################
##
## from opensimle
##
#####################
def stIntensity(frame):
'''
cannot understand what differ from energy
'''
fn = len(frame)
hamWin = np.hamming(fn)
winSum = np.sum(hamWin)
if winSum <= 0.0:
winSum = 1.0
I0 = 0.000001
Im = 0
for i in range(fn):
Im = hamWin[i] * frame[i] ** 2
intensity = Im/winSum
loudness = (Im / I0) ** .3
return intensity, loudness
4. 基频
计算基频的方法包括倒谱法、短时自相关法和线性预测法。本文采用短时自相关法
1)基音检测预处理:语音端点检测
由于语音的头部和尾部不具有周期性,因此为了提高基音检测的准确性,在基音检测时采用了端点检测。使用谱熵法进行端点检测。
语音信号时域波形为x ( i ) x(i)x(i),加窗分帧后第n nn帧语音信号为x n ( m ) x_n(m)xn(m),其FFT表示为X n ( k ) X_n(k)Xn(k),k kk表示第k kk条谱线。该语音帧在频域中的短时能量E n E_nEn为
E n = ∑ k = 0 N / 2 X n ( k ) X n ∗ ( k ) E_n = \sum_{k=0}^{N/2}X_n(k)X_n^*(k)En=k=0∑N/2Xn(k)Xn∗(k)
N NN为FFT长度,只取正频率部分
某一谱线k kk的能量谱为Y n ( k ) = X n ( k ) X n ( k ) ∗ Y_n(k) = X_n(k)X_n(k)^*Yn(k)=Xn(k)Xn(k)∗,则每个频率分量的归一化谱概率密度函数定义为p n ( k ) = Y n ( k ) E n = Y n ( k ) ∑ l = 0 N / 2 Y n ( l ) p_n(k)=\frac{Y_n(k)}{E_n}=\frac{Y_n(k)}{\sum_{l=0}^{N/2}Y_n(l)}pn(k)=EnYn(k)=∑l=0N/2Yn(l)Yn(k)
该语音帧短时谱熵定义为
H n = − ∑ l = 0 N / 2 p n ( k ) l n p n ( k ) H_n = -\sum_{l=0}^{N/2}p_n(k)lnp_n(k)Hn=−l=0∑N/2pn(k)lnpn(k)
H n H_nHn是第n nn帧的谱熵,它表示谱的能量变化,在不同水平噪声环境中谱熵参数具有一定稳健性,但每一谱点的幅值易受噪声的污染而影响端点检测。
在说话区间内的谱熵值小于噪声区段内的谱熵值,因此能量与谱熵的比值,说话区间大于噪声区间。能量与谱熵的比值称为能熵比E E F n EEF_nEEFn。
E E F n = 1 + ∣ E n / H n ∣ EEF_n = \sqrt{1+|E_n/H_n|}EEFn=1+∣En/Hn∣
设置阈值T 1 T1T1,当E E F n EEF_nEEFn不小于T 1 T1T1时,为说话区间。
2)自相关法检测基频
为了减少共振峰的干扰,基频检测的频率范围一般为60Hz~500Hz,此时相应的样本点值为f s / 60 f_s/60fs/60、f s / 500 f_s/500fs/500
将一帧信号进行自相关,当延迟量等于基音周期时,此时信号的值最大。因此,检测基频的具体做法即为:将信号进行自相关运算后,找出最大值对应的采样点数。
####################################
##
## calculate pitch with correlation
## calPitch() JitterAbsolute() JitterRelative()
##
####################################
class voiceSegment:
def __init__(self, in1=0, in2=0, duratioin=0):
self.begin = in1
self.end = in2
self.duratioin = duratioin
def pitch_vad(x, win, step, T1, miniL):
# 端点检测
y = enframe(x, win, step).T
fn = len(y[0, :])
print(fn)
Esum = [] # energy of frames
H = [] # spectrom entropy
for i in range(fn):
Sp = np.abs(np.fft.fft(y[:, i]))
Sp = Sp[:int(win / 2)] # fft positive
Esum.append(np.sum(Sp ** 2)) # energy
prob = Sp / (np.sum(Sp)) # probability
H.append(- np.sum(prob * np.log(prob + eps)))
H = np.array(H)
hindex = np.where(H < 0.1)
H[hindex] = np.max(H)
# Ef = np.sqrt(1 + np.abs(Esum / np.linalg.inv(H))) # energy entropy percentage
Ef = np.sqrt(1 + np.abs(Esum / H))
Ef = Ef / np.max(Ef)
zindex = np.where(Ef >= T1) # 寻找Ef中大于T1的部分
zseg = findSegment(zindex) # 给出端点检测各段的信息
zsl = len(zseg) # 给出段数
j = 0
SF = np.zeros(fn)
voiceseg = []
for k in range(zsl):
if zseg[k].duratioin >= miniL:
j = j + 1
in1 = zseg[k].begin
in2 = zseg[k].end
voiceseg.append(zseg[k])
SF[in1:in2] = 1
vosl = len(voiceseg) # 有话段的段数
return voiceseg, vosl, SF, Ef
def findSegment(express):
# express = np.array(express)
'''
if express[0][0] == 0:
voiceIndex = np.where(express == 1)
else:
voiceIndex = express
'''
voiceIndex = np.array(express).flatten()
soundSegment = []
k = 0
soundSegment.append(voiceSegment(voiceIndex[0]))
for i in range(len(voiceIndex) - 1):
if voiceIndex[i + 1] - voiceIndex[i] > 1:
soundSegment[k].end = voiceIndex[i]
soundSegment.append(voiceSegment(voiceIndex[i + 1]))
k = k + 1
soundSegment[k].end = voiceIndex[-1]
for i in range(k + 1):
soundSegment[i].duratioin = soundSegment[i].end - soundSegment[i].begin + 1
return soundSegment
def pitch_Corr(x, win, step, T1, sr, miniL=10):
win = int(win);
step = int(step)
vseg, vsl, SF, Ef = pitch_vad(x, win, step, T1, miniL)
y = enframe(x, win, step).T
fn = len(SF)
lmin = int(sr / 500)
lmax = int(sr / 27.5)
period = np.zeros(fn)
for i in range(vsl):
ixb = vseg[i].begin
ixe = vseg[i].end
ixd = vseg[i].duratioin
for k in range(ixd):
u = y[:, k + ixb]
ru = np.correlate(u, u, mode='full')
ru = ru[win - 1:] # positive
tloc = np.array(np.where(ru[lmin:lmax] == np.max(ru[lmin:lmax]))).flatten()
period[k + ixb] = lmin + tloc - 1
return vseg, vsl, SF, Ef, period
def calPitch(y, win, step, sr):
'''
calculate pitch
:param y: data of wav file
:param win: windows
:param step: inc
:param sr: frequency of wav file
:return: pitch Hz, period dot
'''
T1 = 0.05
voicesef, vosl, SF, Ef, period = pitch_Corr(y, win, step, T1, sr)
# period is pitch period
pitch = sr / (period + eps)
pindex = np.where(pitch > 5000)
pitch[pindex] = 0
return pitch, period
4.1 频率抖动度-分贝形式
jitter
J i t t e r ( a b s o l u t e ) = 1 N − 1 ∑ i = 1 N − 1 ∣ T i − T i + 1 ∣ Jitter(absolute)=\frac{1}{N-1}\sum_{i=1}^{N-1}|T_i-T_{i+1}|Jitter(absolute)=N−11i=1∑N−1∣Ti−Ti+1∣
def JitterAbsolute(pitch):
period = 1 / (pitch + eps)
pindex = np.where(period > 5000)
period[pindex] = 0
n = len(period)
sigma = 0
for i in range(n - 1):
sigma = np.abs(period[i] - period[i + 1])
jitter_absolute = sigma / (n - 1)
return jitter_absolute
4.2 频率抖动度-百分数形式
jitter
J i t t e r ( r e l a t i v e ) = 1 N − 1 ∑ i = 1 N − 1 ∣ T 1 − T i + 1 ∣ 1 N ∑ i = 1 N T i Jitter(relative)=\frac{\frac{1}{N-1}\sum_{i=1}^{N-1}|T_1-T_{i+1}|}{\frac{1}{N}\sum_{i=1}^{N}T_i}Jitter(relative)=N1∑i=1NTiN−11∑i=1N−1∣T1−Ti+1∣
def JitterRelative(pitch):
period = 1 / (pitch + eps)
pindex = np.where(period > 5000)
period[pindex] = 0
n = len(period)
sigma = 0
jitter_relative = JitterAbsolute(pitch) / (np.sum(period) / n)
4.3 谐噪比HNR
harmonic to noise ratio
H N R = 10 ∗ l o g ( A C F ( T 0 ) A C F ( 0 ) − A C F ( T 0 ) ) HNR=10*log(\frac{ACF(T0)}{ACF(0)-ACF(T0)})HNR=10∗log(ACF(0)−ACF(T0)ACF(T0))
T 0 T0T0指基音频率
def stHNR(frame, period):
'''
harmonics to noise ratio 谐噪比
HNR = 10 * log( ACF(T0) / (ACF(0) - ACF(T0)) )
frame: a frame of signal
period: pitch period of given frame
return hnr db
'''
period = int(period)
if period == 0:
return 0 # when pitch period is zero, return zero
ru = np.correlate(frame, frame, mode='full')
win = len(frame)
print(np.where(ru == np.max(ru)))
ru = ru[win - 1:]
print(np.max(ru))
print(ru[period])
HNR = 10 * np.log(ru[period] / (ru[0] - ru[period]))
return HNR
5. 共振峰
formant
共振峰可用倒谱法与LPC法估计得出。其中LPC法是通过FFT对任意频率求得其功率谱幅值响应,并从幅值响应中找到共振峰,相应的方法包括抛物线内插法和线性预测系数求复数根法。
本文采用抛物线内插法。
声道可以看成具有非均匀截面的声管,在发音时起到共鸣器的作用。当准周期激励进入声道时会引起共振特性,产生一组共振频率,称为共振峰频率。
语音产生模型是将辐射、声道以及声门激励的全部效应简化为一个时变的数字滤波器,其传递函数为H ( z ) = S ( z ) U ( z ) = G 1 − ∑ i = 1 p a i z − i H(z)=\frac{S(z)}{U(z)}=\frac{G}{1-\sum_{i=1}^{p}a_iz^{-i}}H(z)=U(z)S(z)=1−∑i=1paiz−iG
这种表现形式称为p pp阶线性预测模型,这是一个全极点模型。
令z − 1 = e x p ( − j 2 π f / f s ) z^{-1}=exp(-j2\pi{f}/f_s)z−1=exp(−j2πf/fs),则功率谱P ( f ) P(f)P(f)可表示为P ( f ) = ∣ H ( f ) ∣ 2 = G 2 ∣ 1 − ∑ i = 1 p a i e x p ( − j 2 π i f / f s ) ∣ 2 P(f)=|H(f)|^2=\frac{G^2}{|1-\sum_{i=1}^{p}a_iexp(-j2\pi if/f_s)|^2}P(f)=∣H(f)∣2=∣1−∑i=1paiexp(−j2πif/fs)∣2G2
利用FFT对任意频率求功率谱幅值响应,从幅值响应中找出共振峰。相应的求解方法包括抛物线内插法和线性预测系数求复根法。
抛物线内插法,见《语音信号处理实验教程》p105-107
#############################################
##
## calculate formant frequency and bandwidth
##
#############################################
# def Formant_Interpolation(u, sr, p=12):
def stFormant(u, sr, p=12):
'''
F: formant frequency
Bw: formant bandwith
u: one frame of signal
p: number of LPC
sr: sampling rate
return
[1] formant frequency array
[2] formant bandwidth array
'''
### calculate lpc begin
a_filter = lpc.autocor(u, p)
a_filter_num = a_filter.numdict
i = 0
a = []
for k, v in a_filter_num.items():
if i != k:
while (i != k):
a.append(0)
i = i + 1
a.append(v)
i = i + 1
a = np.array(a)
### calculate lpc end
U = lpcar2pf(a, 255) # 由LPC系数求出频谱曲线
df = sr / 512 # 频谱分辨力
Loc, Mdict = signal.find_peaks(U) # find peaks in U
nFormant = len(Loc)
F = np.zeros(nFormant) # 共振峰频率
Bw = np.zeros(nFormant) # 共振峰带宽
# 内插法
i = 0
for m in Loc:
m1 = m - 1;
m2 = m + 1
p = U[m];
p1 = U[m1];
p2 = U[m2]
aa = (p1 + p2) / 2 - p
bb = (p2 - p1) / 2
cc = p
dm = - bb / (2 * aa) # 极大值对应频率
pp = -bb ** 2 / (4 * aa) + cc # 中心频率对应功率谱
bf = - np.sqrt(bb ** 2 - 4 * aa * (cc - 0.5 * pp)) / (2 * aa) # 带宽x轴值
F[i] = (m + dm) * df
Bw[i] = 2 * bf * df
i = i + 1
return F, Bw
def lpcar2pf(ar, npoints):
'''
ar : lpc coefficient
np : 频谱范围
return : 频谱曲线
'''
return np.abs(np.fft.rfft(ar, 2 * npoints + 2)) ** (-2)
def pre_emphasis(y, coefficient=0.99):
'''
y : original signal
coefficient: emphasis coefficient
'''
return np.append(y[0], y[1:] - coefficient * y[:-1])
6. Entropy of Energy:能量熵
跟频谱的谱熵(Spectral Entropy)有点类似,不过它描述的是信号的时域分布情况,体现的是连续性。也是短时特征。在第i ii帧信号内,将信号分为K KK个子块。
第j jj个子块与一帧信号总能量的比值为e j = E s u b F r a m e j E s h o r t F r a m e i e_j =\frac{E_{subFrame_j}}{E_{shortFrame_i}}ej=EshortFrameiEsubFramej,其中E s h o r t F r a m e i E_{shortFrame_i}EshortFramei的值与子块的总能量对应,为E s h o r t F r a m e i = ∑ k = 1 K E s u b F r a m e k E_{shortFrame_i}=\sum_{k=1}^{K}E_{subFrame_k}EshortFramei=∑k=1KEsubFramek.
则一帧信号的谱熵值为H i = − ∑ k = 1 K e j l o g 2 ( e j ) H{i}=-\sum_{k=1}^{K} e_j log_2(e_j)Hi=−k=1∑Kejlog2(ej)
def stEnergyEntropy(frame, n_short_blocks=10):
"""Computes entropy of energy"""
Eol = numpy.sum(frame ** 2) # total frame energy
L = len(frame)
sub_win_len = int(numpy.floor(L / n_short_blocks))
if L != sub_win_len * n_short_blocks:
frame = frame[0:sub_win_len * n_short_blocks]
# sub_wins is of size [n_short_blocks x L]
sub_wins = frame.reshape(sub_win_len, n_short_blocks, order='F').copy()
# Compute normalized sub-frame energies:
s = numpy.sum(sub_wins ** 2, axis=0) / (Eol + eps)
# Compute entropy of the normalized sub-frame energies:
Entropy = -numpy.sum(s * numpy.log2(s + eps))
return Entropy
7. Spectral Centroid:频谱质心
又称为频谱一阶距,频谱中心的值越小,表明越多的频谱能量集中在低频范围内,如:voice与music相比,通常spectral centroid较低。第i ii帧信号的频谱质心C i C_iCi:
C i = ∑ k = 1 f n k X i ( k ) ∑ k = 1 f n X i ( k ) C_i=\frac{\sum_{k=1}^{f_n}kX_i(k)}{\sum_{k=1}^{f_n}X_i(k)}Ci=∑k=1fnXi(k)∑k=1fnkXi(k)
X i ( k ) X_i(k)Xi(k)为第i ii帧信号的第k kk根频谱线,f n f_nfn为一帧信号的长度
def stSpectralCentroidAndSpread(X, fs):
"""Computes spectral centroid of frame (given abs(FFT))"""
ind = (numpy.arange(1, len(X) + 1)) * (fs/(2.0 * len(X)))
Xt = X.copy()
Xt = Xt / Xt.max()
NUM = numpy.sum(ind * Xt)
DEN = numpy.sum(Xt) + eps
# Centroid:
C = (NUM / DEN)
# Spread:
S = numpy.sqrt(numpy.sum(((ind - C) ** 2) * Xt) / DEN)
# Normalize:
C = C / (fs / 2.0)
S = S / (fs / 2.0)
return (C, S)
8. Spectral Spread:频谱延展度
又称为频谱二阶中心矩,它描述了信号在频谱中心周围的分布状况。第i帧信号的频谱延展度S i S_iSi
S i = ∑ k = 1 f n ( k − C i ) 2 X i ( k ) ∑ k = 1 f n X i k S_i=\sqrt{\frac {\sum_{k=1}^{f_n} (k-C_i)^2X_i(k)} {\sum_{k=1}^{f_n}X_i{k}}}Si=∑k=1fnXik∑k=1fn(k−Ci)2Xi(k)
程序同7.Spectral Centroid
9. Spectral Entropy:谱熵
根据熵的特性可以知道,分布越均匀,熵越大,能量熵反应了每一帧信号的均匀程度,如说话人频谱由于共振峰存在显得不均匀,而白噪声的频谱就更加均匀,借此进行VAD便是应用之一。
与计算能量熵类似,谱熵是将FFT后的频谱划分为K KK个子块,分别计算每个子块与频谱总能量的比例,最后将所有子块的谱熵相加,即为第i ii帧信号的谱熵。
第k kk个子块的比例:n k = E k ∑ l = 1 K E l n_k = \frac{E_k}{\sum_{l=1}^{K}E_l}nk=∑l=1KElEk
第i ii帧谱熵
H i = − ∑ k = 1 K n k l o g 2 ( n f ) H_i=-\sum_{k=1}^{K}n_klog_2(n_f)Hi=−k=1∑Knklog2(nf)
def stSpectralEntropy(X, n_short_blocks=10):
"""Computes the spectral entropy"""
L = len(X) # number of frame samples
Eol = numpy.sum(X ** 2) # total spectral energy
sub_win_len = int(numpy.floor(L / n_short_blocks)) # length of sub-frame
if L != sub_win_len * n_short_blocks:
X = X[0:sub_win_len * n_short_blocks]
sub_wins = X.reshape(sub_win_len, n_short_blocks, order='F').copy() # define sub-frames (using matrix reshape)
s = numpy.sum(sub_wins ** 2, axis=0) / (Eol + eps) # compute spectral sub-energies
En = -numpy.sum(s*numpy.log2(s + eps)) # compute spectral entropy
return En
10. Spectral Flux:频谱通量
描述的是相邻帧频谱的变化情况。计算了频谱归一化之后,两帧频谱差的平方的总和
F l ( i , i − 1 ) = ∑ k = 1 f n ( E N i ( k ) − E N i − 1 ( k ) ) 2 Fl_{(i,i-1)}=\sum_{k=1}^{f_n}(EN_i(k)-EN_{i-1}(k))^2Fl(i,i−1)=k=1∑fn(ENi(k)−ENi−1(k))2
其中,E N i ( k ) = X i ( k ) ∑ l = 1 f n X i ( l ) EN_i(k)=\frac{X_i(k)}{\sum_{l=1}^{f_n}X_i(l)}ENi(k)=∑l=1fnXi(l)Xi(k)
X i ( k ) X_i(k)Xi(k)指第k kk根频谱线。
def stSpectralFlux(X, X_prev):
"""
Computes the spectral flux feature of the current frame
ARGUMENTS:
X: the abs(fft) of the current frame
X_prev: the abs(fft) of the previous frame
"""
# compute the spectral flux as the sum of square distances:
sumX = numpy.sum(X + eps)
sumPrevX = numpy.sum(X_prev + eps)
F = numpy.sum((X / sumX - X_prev/sumPrevX) ** 2)
return F
11. Spectral Rolloff:频谱滚降点
频谱的能量在一定频率范围内是集中的。当频谱能量达到一确切百分比(通常为90%左右),相应的DFT坐标即为滚降点的坐标。然后将滚降点坐标除以FFT长度归一化。∑ k = 1 m X i ( k ) = C ∑ k = 1 f n X i ( k ) \sum_{k=1}^{m}X_i(k)=C\sum_{k=1}^{f_n}X_i(k)k=1∑mXi(k)=Ck=1∑fnXi(k)
m mm为滚降点坐标,f n f_nfn为FFT长度,X i ( k ) X_i(k)Xi(k)为第k kk根谱线。
def stSpectralRollOff(X, c, fs):
"""Computes spectral roll-off"""
totalEnergy = numpy.sum(X ** 2)
fftLength = len(X)
Thres = c*totalEnergy
# Ffind the spectral rolloff as the frequency position
# where the respective spectral energy is equal to c*totalEnergy
CumSum = numpy.cumsum(X ** 2) + eps
[a, ] = numpy.nonzero(CumSum > Thres)
if len(a) > 0:
mC = numpy.float64(a[0]) / (float(fftLength))
else:
mC = 0.0
return (mC)
12. MFCCs:梅尔倒谱系数
梅尔频率倒谱系数基于人的听觉特性机理,即根据人的听觉实验结果分析语音频谱。与实际频率的关系为F m e l ( f ) = 1125 l n ( 1 + f / 700 ) F_{mel}(f)=1125ln(1+f/700)Fmel(f)=1125ln(1+f/700)
- 梅尔滤波器
梅尔频率相当于在语音的频谱范围内设置若干带通滤波器H m ( k ) H_m(k)Hm(k),0 ≤ m ≤ M 0\le m \le M0≤m≤M,M MM为滤波器的个数,自己设定。每个滤波器具有三角形滤波特性,中心频率为f ( m ) f(m)f(m),在Mel频率范围内,这些滤波器等带宽。每个带通滤波器的传递函数为
H m ( k ) = { 0 k < f ( m − 1 ) k − f ( m − 1 ) f ( m ) − f ( m − 1 ) f ( m − 1 ) ⩽ k ⩽ f ( m ) f ( m + 1 ) − k f ( m + 1 ) − f ( m ) f ( m ) ⩽ k ⩽ f ( m + 1 ) 0 k > f ( m + 1 ) H_m(k)= \left \{ \begin {matrix} 0&&k<f(m-1)\\ \frac{k-f(m-1)}{f(m)-f(m-1)} && f(m-1)\leqslant k \leqslant f(m) \\ \frac{f(m+1)-k}{f(m+1)-f(m)} && f(m) \leqslant k \leqslant f(m+1) \\ 0 && k>f(m+1) \end {matrix}\right.Hm(k)=⎩⎪⎪⎪⎨⎪⎪⎪⎧0f(m)−f(m−1)k−f(m−1)f(m+1)−f(m)f(m+1)−k0k<f(m−1)f(m−1)⩽k⩽f(m)f(m)⩽k⩽f(m+1)k>f(m+1)
梅尔滤波器的中心频率为f ( m ) = N f s F m e l − 1 ( F m e l ( f l ) + m F m e l ( f h ) − F m e l ( f l ) M + 1 ) f(m)=\frac{N}{f_s}F_{mel}^{-1}(F_{mel}(f_l)+m\frac{F_{mel}(f_h)-F_{mel}(f_l)}{M+1})f(m)=fsNFmel−1(Fmel(fl)+mM+1Fmel(fh)−Fmel(fl))
其中,f h f_hfh和f l f_lfl是滤波器组的最高频率和最低频率,f s f_sfs为采样频率,M MM是滤波器组的数目,N NN为FFT变换的点数 - MFCC系数计算
(1)预处理
预加重、分帧、加窗,得第i ii帧信号为x i ( m ) x_i(m)xi(m)
(2)FFT
X i ( k ) = F F T [ x i ( m ) ] X_i(k)=FFT[x_i(m)]Xi(k)=FFT[xi(m)]
(3)计算谱线能量
E i ( k ) = [ X i ( k ) ] 2 E_i(k)=[X_i(k)]^2Ei(k)=[Xi(k)]2
(4)计算通过梅尔滤波器的能量
将一帧能量谱的每一根谱线E i ( k ) E_i(k)Ei(k)分别与梅尔滤波器的频域响应H m ( k ) H_m(k)Hm(k)相乘并相加。由于梅尔滤波器有M MM个,因此一帧内通过梅尔滤波器能量有M MM个
S i ( m ) = ∑ k = 0 f n − 1 E i ( k ) H m ( k ) , 0 ⩽ m < M S_i(m )=\sum_{k=0}^{f_n-1}E_i(k)H_m(k) ,0\leqslant m < MSi(m)=k=0∑fn−1Ei(k)Hm(k),0⩽m<M
(5)计算DCT倒谱
[我的想法。到了这一步,是将傅里叶频域变换到梅尔频域内。梅尔频域的谱线与梅尔滤波器相对应]使用DCT计算梅尔频域的各谱线能量
m f c c i ( n ) = 2 M ∑ m = 0 M − 1 l o g [ S ( i . m ) ] c o s [ π n ( 2 m − 1 ) 2 M ] , 0 ⩽ n < M mfcc_i(n)=\sqrt{\frac{2}{M}\sum_{m=0}^{M-1}log[S(i.m)]cos[\frac{\pi n(2m-1)}{2M}]},0\leqslant n <Mmfcci(n)=M2m=0∑M−1log[S(i.m)]cos[2Mπn(2m−1)],0⩽n<M
def mfccInitFilterBanks(fs, nfft):
"""
Computes the triangular filterbank for MFCC computation
(used in the stFeatureExtraction function before the stMFCC function call)
This function is taken from the scikits.talkbox library (MIT Licence):
https://pypi.python.org/pypi/scikits.talkbox
"""
# filter bank params:
lowfreq = 133.33
linsc = 200/3.
logsc = 1.0711703
numLinFiltTotal = 13
numLogFilt = 27
if fs < 8000:
nlogfil = 5
# Total number of filters
nFiltTotal = numLinFiltTotal + numLogFilt
# Compute frequency points of the triangle:
freqs = numpy.zeros(nFiltTotal+2)
freqs[:numLinFiltTotal] = lowfreq + numpy.arange(numLinFiltTotal) * linsc
freqs[numLinFiltTotal:] = freqs[numLinFiltTotal-1] * logsc ** numpy.arange(1, numLogFilt + 3)
heights = 2./(freqs[2:] - freqs[0:-2])
# Compute filterbank coeff (in fft domain, in bins)
fbank = numpy.zeros((nFiltTotal, nfft))
nfreqs = numpy.arange(nfft) / (1. * nfft) * fs
for i in range(nFiltTotal):
lowTrFreq = freqs[i]
cenTrFreq = freqs[i+1]
highTrFreq = freqs[i+2]
lid = numpy.arange(numpy.floor(lowTrFreq * nfft / fs) + 1,
numpy.floor(cenTrFreq * nfft / fs) + 1,
dtype=numpy.int)
lslope = heights[i] / (cenTrFreq - lowTrFreq)
rid = numpy.arange(numpy.floor(cenTrFreq * nfft / fs) + 1,
numpy.floor(highTrFreq * nfft / fs) + 1,
dtype=numpy.int)
rslope = heights[i] / (highTrFreq - cenTrFreq)
fbank[i][lid] = lslope * (nfreqs[lid] - lowTrFreq)
fbank[i][rid] = rslope * (highTrFreq - nfreqs[rid])
return fbank, freqs
def stMFCC(X, fbank, n_mfcc_feats):
"""
Computes the MFCCs of a frame, given the fft mag
ARGUMENTS:
X: fft magnitude abs(FFT)
fbank: filter bank (see mfccInitFilterBanks)
RETURN
ceps: MFCCs (13 element vector)
Note: MFCC calculation is, in general, taken from the
scikits.talkbox library (MIT Licence),
# with a small number of modifications to make it more
compact and suitable for the pyAudioAnalysis Lib
"""
mspec = numpy.log10(numpy.dot(X, fbank.T)+eps)
ceps = dct(mspec, type=2, norm='ortho', axis=-1)[:n_mfcc_feats]
return ceps