Voice Activity Detection for Transient Noisy Environment Based on Diffusion Nets

Victoria D. Doty

Voice action detection is a endeavor wherever an audio signal has to be segmented into speech and silent areas. Recent techniques battle with the endeavor in noisy environments, in particular with transient noises.

A current research on arXiv.org indicates a novel algorithm that addresses the limitations of prior ways.

Graphic credit: Dennis Hill via Wikimedia (CC BY 2.)

The spatial patterns of speech and non-speech audio frames are uncovered independently via the Diffusion Maps method. It performs non-linear dimensionality reduction by mapping substantial-dimensional facts factors to a manifold embedded in a minimal-dimensional place. That lets to vary the intrinsic composition of speech from the kinds of transients and track record noises. 5 comparative experiments confirmed that the advised algorithm increased voice action detection effectiveness and has much better generalization potential than competing techniques.

We tackle voice action detection in acoustic environments of transients and stationary noises, which often manifest in true daily life scenarios. We exploit one of a kind spatial patterns of speech and non-speech audio frames by independently learning their fundamental geometric composition. This approach is finished via a deep encoder-decoder primarily based neural community architecture. This composition entails an encoder that maps spectral functions with temporal information to their minimal-dimensional representations, which are generated by implementing the diffusion maps method. The encoder feeds a decoder that maps the embedded facts back into the substantial-dimensional place. A deep neural community, which is qualified to different speech from non-speech frames, is received by concatenating the decoder to the encoder, resembling the known Diffusion nets architecture. Experimental outcomes clearly show increased effectiveness as opposed to competing voice action detection techniques. The enhancement is reached in the two accuracy, robustness and generalization potential. Our model performs in a true-time method and can be integrated into audio-primarily based conversation devices. We also present a batch algorithm which obtains an even greater accuracy for off-line applications.

Investigate paper: Ivry, A., Berdugo, B., and Cohen, I., “Voice Exercise Detection for Transient Noisy Atmosphere Centered on Diffusion Nets”, 2021. Url: https://arxiv.org/ab muscles/2106.13763

Next Post

Q&A with Alumni Recruiters Arick Davis and Donzell Dixson

Michigan Tech is a distinct institution — and it was the right one for Arick Davis and Donzell Dixson. Now, these young Black alumni are helping prospective students figure out if Tech is right for them, too. Or, in some cases, just figure out higher education in general. In this […]

Subscribe US Now