超级爽一区二区三区,玖玖玖久久,少妇福利啊

VAD（Voice Activity Detection）算法的作用是檢測語音，在遠(yuǎn)場語音交互場景中，VAD面臨著兩個(gè)難題：

可以成功檢測到最低能量的語音(靈敏度)。
如何在多噪環(huán)境下成功檢測（漏檢率和虛檢率）。
漏檢反應(yīng)的是原本是語音但是沒有檢測出來，而虛檢率反應(yīng)的是不是語音信號(hào)而被檢測成語音信號(hào)的概率。相對而言漏檢是不可接受的，而虛檢可以通過后端的ASR和NLP算法進(jìn)一步過濾，但是虛檢會(huì)帶來系統(tǒng)資源利用率上升，隨之系統(tǒng)的功耗和發(fā)熱會(huì)進(jìn)一步增加，而這會(huì)上升為可移動(dòng)和隨聲攜帶設(shè)備的一個(gè)難題。
本文基于WebRTC的AEC算法，WebRTC的VAD模型采用了高斯模型，這一模型應(yīng)用極其廣泛。

高斯分布

高斯分布又稱為正態(tài)分布（Normal distribution/Gaussian distribution）。
若隨機(jī)變量X服從一個(gè)數(shù)學(xué)期望為μ，標(biāo)準(zhǔn)差為σ^2的高斯分布，則：
X~N(μ，σ^2)
其概率密度函數(shù)為：
f(x)=1/(√2π σ) e^{(-〖(x-u)〗}2/(2σ^2 ))
高斯在webRTC中的使用：
f(x_k |Z，r_k)=1/√2π e^(-(x_k-u_z )^2/(2σ2 ))
x_k是選取的特征向量，webRTC中指x_k是六個(gè)子帶的能量，r_k是均值u_z和方差σ的參數(shù)結(jié)合，這兩個(gè)參數(shù)決定了高斯分布的概率。Z=0情況是計(jì)算噪聲的概率，Z=1是計(jì)算是語音的概率。

WebRTC算法流程

1.設(shè)置VAD激進(jìn)模式

共四種模式，用數(shù)字0~3來區(qū)分，激進(jìn)程度與數(shù)值大小正相關(guān)。
0: Normal，1：low Bitrate， 2：Aggressive；3：Very Aggressive
這些激進(jìn)模式是和以下參數(shù)是息息相關(guān)的。

<comman_audio/vad/vad_core.c>
// Mode 0, Quality.
static const int16_t kOverHangMax1Q[3] = { 8, 4, 3 };
static const int16_t kOverHangMax2Q[3] = { 14, 7, 5 };
static const int16_t kLocalThresholdQ[3] = { 24, 21, 24 };
static const int16_t kGlobalThresholdQ[3] = { 57, 48, 57 };
// Mode 1, Low bitrate.
static const int16_t kOverHangMax1LBR[3] = { 8, 4, 3 };
static const int16_t kOverHangMax2LBR[3] = { 14, 7, 5 };
static const int16_t kLocalThresholdLBR[3] = { 37, 32, 37 };
static const int16_t kGlobalThresholdLBR[3] = { 100, 80, 100 };
// Mode 2, Aggressive.
static const int16_t kOverHangMax1AGG[3] = { 6, 3, 2 };
static const int16_t kOverHangMax2AGG[3] = { 9, 5, 3 };
static const int16_t kLocalThresholdAGG[3] = { 82, 78, 82 };
static const int16_t kGlobalThresholdAGG[3] = { 285, 260, 285 };
// Mode 3, Very aggressive.
static const int16_t kOverHangMax1VAG[3] = { 6, 3, 2 };
static const int16_t kOverHangMax2VAG[3] = { 9, 5, 3 };
static const int16_t kLocalThresholdVAG[3] = { 94, 94, 94 };
static const int16_t kGlobalThresholdVAG[3] = { 1100, 1050, 1100 };

它們在計(jì)算高斯模型概率時(shí)用到。

2幀長設(shè)置

A）共有三種幀長可以用到，分別是80/10ms，160/20ms，240/30ms。
B）其它采樣率的48k，32k，24k，16k會(huì)重采樣到8k來計(jì)算VAD。
之所以選擇上述三種幀長度，是因?yàn)檎Z音信號(hào)是短時(shí)平穩(wěn)信號(hào)，其在10ms_{30ms之間可看成平穩(wěn)信號(hào)，高斯馬爾科夫等比較的信號(hào)處理方法基于的前提是信號(hào)是平穩(wěn)的，在10ms}30ms，平穩(wěn)信號(hào)處理方法是可以使用的。

3 高斯模型中特征向量選取

在WebRTC的VAD算法中用到了聚類的思想，只有兩個(gè)類，一個(gè)類是語音，一個(gè)類是噪聲，對每幀信號(hào)都求其是語音和噪聲的概率，根據(jù)概率進(jìn)行聚類，當(dāng)然為了避免一幀帶來的誤差也有一個(gè)統(tǒng)計(jì)量判決在算法里，那么問題來了，選擇什么樣的特征作為高斯分布的輸入呢?這關(guān)系到聚類結(jié)果的準(zhǔn)確性，也即VAD性能，毋庸置疑，既然VAD目的是區(qū)分噪聲和語音，那么噪聲信號(hào)和語音信號(hào)這兩種信號(hào)它們的什么特征相差最大呢?選擇特征相差比較大自然能得到比較好的區(qū)分度。
眾所周知，信號(hào)的處理分類主要有時(shí)域，頻域和空域，從空域上看，webRTC的VAD是基于單麥克的，噪聲和語音沒有空間區(qū)分度的概念，在多麥克風(fēng)場景，確實(shí)基于多麥克風(fēng)的VAD算法，從時(shí)域上看，而者都是時(shí)變信號(hào)，且短時(shí)信號(hào)變化率比較小，所以推算來推算去只有頻域的區(qū)分度可能是比較好的。

image.png

汽車噪聲頻譜

image.png

粉紅噪聲頻譜

image.png

白噪聲頻譜

image.png

語音聲譜

從以上四個(gè)圖中，可以看到從頻譜來看噪聲和語音，它們的頻譜差異還是比較大，且以一個(gè)個(gè)波峰和波谷的形式呈現(xiàn)。
WebRTC正式基于這一假設(shè)，將頻譜分成了6個(gè)子帶。它們是：
80Hz_{250Hz，250Hz}500Hz,500Hz_1K,1K2K,2K_3K,3K4K。
可以看到以1KHz為分界，向下500HZ，250Hz以及170HZ三個(gè)段，向上也有三個(gè)段，每個(gè)段是1KHz，這一頻段涵蓋了語音中絕大部分的信號(hào)能量，且能量越大的子帶的區(qū)分度越細(xì)致。
我國交流電標(biāo)準(zhǔn)是220V~50Hz，電源50Hz的干擾會(huì)混入麥克風(fēng)采集到的數(shù)據(jù)中且物理震動(dòng)也會(huì)帶來影響，所以取了80Hz以上的信號(hào)。

高通濾波器設(shè)計(jì)

// High pass filtering, with a cut-off frequency at 80 Hz, if the |data_in| is
// sampled at 500 Hz.
//
// - data_in      [i]   : Input audio data sampled at 500 Hz.
// - data_length  [i]   : Length of input and output data.
// - filter_state [i/o] : State of the filter.
// - data_out     [o]   : Output audio data in the frequency interval
//                        80 - 250 Hz.
static void HighPassFilter(const int16_t* data_in, size_t data_length,
                           int16_t* filter_state, int16_t* data_out) {
  size_t i;
  const int16_t* in_ptr = data_in;
  int16_t* out_ptr = data_out;
  int32_t tmp32 = 0;

  // The sum of the absolute values of the impulse response:
  // The zero/pole-filter has a max amplification of a single sample of: 1.4546
  // Impulse response: 0.4047 -0.6179 -0.0266  0.1993  0.1035  -0.0194
  // The all-zero section has a max amplification of a single sample of: 1.6189
  // Impulse response: 0.4047 -0.8094  0.4047  0       0        0
  // The all-pole section has a max amplification of a single sample of: 1.9931
  // Impulse response: 1.0000  0.4734 -0.1189 -0.2187 -0.0627   0.04532

  for (i = 0; i < data_length; i++) {
    // All-zero section (filter coefficients in Q14).
    tmp32 = kHpZeroCoefs[0] * *in_ptr;
    tmp32 += kHpZeroCoefs[1] * filter_state[0];
    tmp32 += kHpZeroCoefs[2] * filter_state[1];
    filter_state[1] = filter_state[0];
    filter_state[0] = *in_ptr++;

    // All-pole section (filter coefficients in Q14).
    tmp32 -= kHpPoleCoefs[1] * filter_state[2];
    tmp32 -= kHpPoleCoefs[2] * filter_state[3];
    filter_state[3] = filter_state[2];
    filter_state[2] = (int16_t) (tmp32 >> 14);
    *out_ptr++ = filter_state[2];
  }
}

新版的定義在 common_audio\vad\vad_filterbank.c

// High pass filtering, with a cut-off frequency at 80 Hz, if the |data_in| is
// sampled at 500 Hz.
//
// - data_in      [i]   : Input audio data sampled at 500 Hz.
// - data_length  [i]   : Length of input and output data.
// - filter_state [i/o] : State of the filter.
// - data_out     [o]   : Output audio data in the frequency interval
//                        80 - 250 Hz.
static void HighPassFilter(const int16_t* data_in, int data_length,
                           int16_t* filter_state, int16_t* data_out) {
  int i;
  const int16_t* in_ptr = data_in;
  int16_t* out_ptr = data_out;
  int32_t tmp32 = 0;


  // The sum of the absolute values of the impulse response:
  // The zero/pole-filter has a max amplification of a single sample of: 1.4546
  // Impulse response: 0.4047 -0.6179 -0.0266  0.1993  0.1035  -0.0194
  // The all-zero section has a max amplification of a single sample of: 1.6189
  // Impulse response: 0.4047 -0.8094  0.4047  0       0        0
  // The all-pole section has a max amplification of a single sample of: 1.9931
  // Impulse response: 1.0000  0.4734 -0.1189 -0.2187 -0.0627   0.04532

  for (i = 0; i < data_length; i++) {
    // All-zero section (filter coefficients in Q14).
    tmp32 = WEBRTC_SPL_MUL_16_16(kHpZeroCoefs[0], *in_ptr);
    tmp32 += WEBRTC_SPL_MUL_16_16(kHpZeroCoefs[1], filter_state[0]);
    tmp32 += WEBRTC_SPL_MUL_16_16(kHpZeroCoefs[2], filter_state[1]);
    filter_state[1] = filter_state[0];
    filter_state[0] = *in_ptr++;

    // All-pole section (filter coefficients in Q14).
    tmp32 -= WEBRTC_SPL_MUL_16_16(kHpPoleCoefs[1], filter_state[2]);
    tmp32 -= WEBRTC_SPL_MUL_16_16(kHpPoleCoefs[2], filter_state[3]);
    filter_state[3] = filter_state[2];
    filter_state[2] = (int16_t) (tmp32 >> 14);
    *out_ptr++ = filter_state[2];
  }
}

對應(yīng)的調(diào)用入口在modules\audio_processing\VoiceDetectionImpl.cc 中，通過WebRtcVad_Process（common_audio\vad\Webrtc_vad.c int WebRtcVad_Process(VadInst* handle, int fs, const int16_t* audio_frame,
int frame_length)）最終會(huì)調(diào)用到vad檢測；

int VoiceDetectionImpl::ProcessCaptureAudio(AudioBuffer* audio) {
  if (!is_component_enabled()) {
    return apm_->kNoError;
  }

  if (using_external_vad_) {
    using_external_vad_ = false;
    return apm_->kNoError;
  }
  assert(audio->samples_per_split_channel() <= 160);

  // TODO(ajm): concatenate data in frame buffer here.

  int vad_ret = WebRtcVad_Process(static_cast<Handle*>(handle(0)),
                                  apm_->proc_split_sample_rate_hz(),
                                  audio->mixed_low_pass_data(),
                                  frame_size_samples_);

github地址：https://github.com/starmier/webrtc-1/blob/master/webrtc/modules/audio_processing/audio_processing_impl.cc

感興趣的小伙伴可以去研讀一下??；

WebRTC在設(shè)計(jì)該濾波器上還是很有技巧的，技巧有二：

定點(diǎn)數(shù)計(jì)算，指兩個(gè)方面，一是濾波系數(shù)量化，而是計(jì)算過程的定點(diǎn)化，高斯模型計(jì)算也使用了這一技巧。
舍入技巧，減少運(yùn)算量。
下面就來看看，這些技巧是如何使用的，首先根據(jù)代碼的注釋可以看出，
全零點(diǎn)和全極點(diǎn)脈沖響應(yīng)的實(shí)際上是浮點(diǎn)數(shù)，它們脈沖響應(yīng)分別是：
0.4047 -0.8094 0.4047 0 0 0
1.0000 0.4734 -0.1189 -0.2187 -0.0627 0.04532
所以可見應(yīng)該是六階方程，但是超過3階后，零點(diǎn)全零，極點(diǎn)數(shù)值較小，這時(shí)適當(dāng)增大第三個(gè)數(shù)值，達(dá)到減少計(jì)算次數(shù)的目的。

量化是按照2的十四次方進(jìn)行定點(diǎn)化。這是因?yàn)樽畈钋闆r下，零極點(diǎn)的放大倍數(shù)不超過兩倍，所以16位數(shù)可以表示的下來。其零極點(diǎn)繪圖如下：

image.png

對這兩個(gè)圖的解釋就忽略了，能夠看懂上述代碼和兩張圖的意義，就可以更改濾波器的特性了，對不要相位信息的，采用IIR比FIR達(dá)到相同的增益平坦度需要的階數(shù)要少。關(guān)于高通濾波器的設(shè)計(jì)還有疑問的可以留言共同交流。頻響如下：

image.png

WebRtcVad_CalculateFeatures函數(shù)計(jì)算每個(gè)子帶的能量。能量結(jié)果存放在features數(shù)組里，然后調(diào)用GmmProbability計(jì)算概率。

int WebRtcVad_CalcVad8khz(VadInstT* inst, const int16_t* speech_frame,
                          size_t frame_length) {
    int16_t feature_vector[kNumChannels], total_power;

    // Get power in the bands
    total_power = WebRtcVad_CalculateFeatures(inst, speech_frame, frame_length,
                                              feature_vector);

    // Make a VAD
    inst->vad = GmmProbability(inst, feature_vector, total_power, frame_length);

    return inst->vad;
}

計(jì)算流程

高斯模型有兩個(gè)參數(shù)H0和H1，它們分表示的是噪聲和語音，判決測試使用LRT（likelihood ratio test）。分為全局和局部兩種情況。

image.png

a)高斯概率計(jì)算采用的高斯公式如下：

image.png

c)對數(shù)似然比，分為全局和局部，全局是六個(gè)子帶之加權(quán)之和，而局部是指每一個(gè)子帶則是局部，所以語音判決會(huì)先判斷子帶，子帶判斷沒有時(shí)會(huì)判斷全局，只要有一方過了，就算有語音，公式表達(dá)如下：

image.png

后記：

和判決準(zhǔn)則相關(guān)的參數(shù)在vad_core.c文件，他們是：

// Spectrum Weighting
static const int16_t kSpectrumWeight[kNumChannels] = { 6, 8, 10, 12, 14, 16 };
static const int16_t kNoiseUpdateConst = 655; // Q15
static const int16_t kSpeechUpdateConst = 6554; // Q15
static const int16_t kBackEta = 154; // Q8
// Minimum difference between the two models, Q5
static const int16_t kMinimumDifference[kNumChannels] = {
    544, 544, 576, 576, 576, 576 };
// Upper limit of mean value for speech model, Q7
static const int16_t kMaximumSpeech[kNumChannels] = {
    11392, 11392, 11520, 11520, 11520, 11520 };
// Minimum value for mean value
static const int16_t kMinimumMean[kNumGaussians] = { 640, 768 };
// Upper limit of mean value for noise model, Q7
static const int16_t kMaximumNoise[kNumChannels] = {
    9216, 9088, 8960, 8832, 8704, 8576 };
// Start values for the Gaussian models, Q7
// Weights for the two Gaussians for the six channels (noise)
static const int16_t kNoiseDataWeights[kTableSize] = {
    34, 62, 72, 66, 53, 25, 94, 66, 56, 62, 75, 103 };
// Weights for the two Gaussians for the six channels (speech)
static const int16_t kSpeechDataWeights[kTableSize] = {
    48, 82, 45, 87, 50, 47, 80, 46, 83, 41, 78, 81 };
// Means for the two Gaussians for the six channels (noise)
static const int16_t kNoiseDataMeans[kTableSize] = {
    6738, 4892, 7065, 6715, 6771, 3369, 7646, 3863, 7820, 7266, 5020, 4362 };
// Means for the two Gaussians for the six channels (speech)
static const int16_t kSpeechDataMeans[kTableSize] = {
    8306, 10085, 10078, 11823, 11843, 6309, 9473, 9571, 10879, 7581, 8180, 7483
};
// Stds for the two Gaussians for the six channels (noise)
static const int16_t kNoiseDataStds[kTableSize] = {
    378, 1064, 493, 582, 688, 593, 474, 697, 475, 688, 421, 455 };
// Stds for the two Gaussians for the six channels (speech)
static const int16_t kSpeechDataStds[kTableSize] = {
    555, 505, 567, 524, 585, 1231, 509, 828, 492, 1540, 1079, 850 };

原文鏈接：https://blog.csdn.net/shichaog/article/details/52399354

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

WebRTC之VAD算法

WebRTC之VAD算法

高斯分布

WebRTC算法流程

1.設(shè)置VAD激進(jìn)模式

2幀長設(shè)置

3 高斯模型中特征向量選取

高通濾波器設(shè)計(jì)

計(jì)算流程

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

WebRTC之VAD算法

高斯分布

WebRTC算法流程

1.設(shè)置VAD激進(jìn)模式

2幀長設(shè)置

3 高斯模型中特征向量選取

高通濾波器設(shè)計(jì)

計(jì)算流程

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av