本文根据kaldi中的vad的算法 kaldi/src/ivector/voice-activity-detection.cc以及网上的一些资源来总结一下这个知识点。
首先VAD的全称是:Voice Activity Detection (语音激活检测), 能够区分传输语音信号中的语音信号和背景噪音, 当然还能在通信中区分语音和静默段能够区分传输语音信号中的语音信号和背景噪音,
避免带宽资源的浪费,这里我们只讨论在说话人识别中需要区分背景噪音来构建UBM模型。
下面直接看kaldi的源码,注意看注释
run.sh中调用下面computer_vad_decision.sh
Usage: $0 [options] <data-dir> <log-dir> <path-to-vad-dir>
sid/compute_vad_decision.sh --nj 40 --cmd "$train_cmd" \
data/train exp/make_vad $vaddir在 computer_vad_decision.sh调用的是Usage: compute-vad [options] <feats-rspecifier> <vad-wspecifier>
输入的是每一个feats文件,由于上边的nj是40,所以这JOB: 1~40, 输入mfcc.ark 输出vad.ark
compute-vad --config=$vad_config scp:$sdata/JOB/feats.scp ark,scp:$vaddir/vad_${name}.JOB.ark,$vaddir/vad_${name}.JOB.scp
for (;!feat_reader.Done(); feat_reader.Next()) {#读取每一句话 std::string utt = feat_reader.Key(); Matrix<BaseFloat> feat(feat_reader.Value()); if (feat.NumRows() == 0) { KALDI_WARN << "Empty feature matrix for utterance " << utt; num_err++; continue; }#声明一个vector, 维数 = 一句话的帧数 Vector<BaseFloat> vad_result(feat.NumRows()); #然后是计算vad,一个可选参数集合,mfcc的matrix, 返回的结果vertor, 看下一个的源码片段 ComputeVadEnergy(opts, feat, &vad_result); double sum = vad_result.Sum(); if (sum == 0.0) { KALDI_WARN << "No frames were judged voiced for utterance " << utt; num_unvoiced++; } else { num_done++; } tot_decision += vad_result.Sum(); tot_length += vad_result.Dim(); if (!(omit_unvoiced_utts && sum == 0)) { vad_writer.Write(utt, vad_result); } }
下面这个是计算vad结果的函数: kaldi/src/ivector/voice-activity-detection.cc#include "ivector/voice-activity-detection.h" #include "matrix/matrix-functions.h" namespace kaldi { void ComputeVadEnergy(const VadEnergyOptions &opts, const MatrixBase<BaseFloat> &feats, Vector<BaseFloat> *output_voiced) { #feats是mfcc的特征矩阵 int32 T = feats.NumRows(); output_voiced->Resize(T); if (T == 0) { KALDI_WARN << "Empty features"; return; } #定义一个维度为T的vector Vector<BaseFloat> log_energy(T); #激昂feats的第0列as log_energy的value log_energy.CopyColFromMat(feats, 0); // column zero is log-energy. #读取配置文件中的噪声的阈值: --vad-energy-threshold=5.5, 若小于这个值则为噪音,若大于则为语音信号 BaseFloat energy_threshold = opts.vad_energy_threshold; #读取配置文件中: if (opts.vad_energy_mean_scale != 0.0) { KALDI_ASSERT(opts.vad_energy_mean_scale > 0.0); energy_threshold += opts.vad_energy_mean_scale * log_energy.Sum() / T; } KALDI_ASSERT(opts.vad_frames_context >= 0); KALDI_ASSERT(opts.vad_proportion_threshold > 0.0 && opts.vad_proportion_threshold < 1.0); for (int32 t = 0; t < T; t++) { const BaseFloat *log_energy_data = log_energy.Data(); int32 num_count = 0, den_count = 0, context = opts.vad_frames_context; for (int32 t2 = t - context; t2 <= t + context; t2++) { if (t2 >= 0 && t2 < T) { den_count++; if (log_energy_data[t] > energy_threshold) num_count++; } } if (num_count >= den_count * opts.vad_proportion_threshold) (*output_voiced)(t) = 1.0; else (*output_voiced)(t) = 0.0; } } }
下面我将给出一个实际的计算过程的demo:其中raw_mfcc_train1.txt 和 vad_train1.txt分别是在mfcc目录下执行:
./../../../../src/bin/copy-vector ark:vad_train.1.ark ark,t:- > vad_train1.txt
./../../../../src/featbin/copy-feats ark:raw_mfcc_train.1.ark ark,t:- > raw_mfcc_train1.txt
import numpy as np def read_feats(filename): f = open(filename, 'r') all_xs = [] arr = [] for line in f: temp = [] if '[' in line: pass else: l = line.strip().split(' ') #print "l->",len(l) if ']' in l: l_temp = l[:-1] for i in range(len(l_temp)): if l_temp[i] != '': temp.append(eval(l_temp[i])) #print "temp->",len(temp) arr.append(temp) all_xs.append(arr) arr = [] else: for i in range(len(l)): if l[i] != '': temp.append(eval(l[i])) #print "temo->",len(temp) arr.append(temp) return all_xs mfcc_filename = 'raw_mfcc_train1.txt' all_feats = read_feats(mfcc_filename) vad_energy_threshold = 5.5 vad_energy_mean_scale = 0.5 vad_frames_context = 5 vad_proportion_threshold = 0.6 for i in range(2): print "utt id is : ", i feat = all_feats[i] T = len(feat) print "this utt has frames: ", T log_energy = [] for f in feat: log_energy.append(f[0]) print "log_energy : ", log_energy energy_threshold = vad_energy_threshold + vad_energy_mean_scale * sum(log_energy) / T print "energy_threshold : ", energy_threshold output_voiced = np.zeros(T) for t in range(T): log_energy_data = log_energy num_count = 0 den_count = 0 context = vad_frames_context for t2 in range(t - context, t + context+1): if t2 >= 0 and t2 < T: den_count += 1 if log_energy_data[t] > energy_threshold: num_count += 1 print "den_count is : ", den_count print "num_count is : ", num_count if num_count >= den_count * vad_proportion_threshold: output_voiced[t] = 1.0 else: output_voiced[t] = 0.0 print output_voiced
此处为了节省时间只对比这前两个与vad_train1.txt是一致的,其他应该也差不多。至于公式推导以及原理,明天再来补充
补充一点: 其实VAD就是一个分类模型,kaldi中处理的很简单就是把根据能量的相对大小来就行计算的,不过比较严谨的做法是用判别模型dnn,svm或者是统计模型:gmm等 对训练的数据进行建模,根据事先的对齐的标注,分voiced类和other(include silence and noise)类,这样训练的结果比较好,但是这样也有一个问题,就是对于动物种类的识别来说,有一部分频率特别高,有的想对频率很低,这样的话就是全部的训练数据进行训练的话效果也不一定很好,最好还是针对某一个人或者事某一个动物单独进行建模。个人见解