Towards end-to-end speech enhancement with a variational U-Net architecture
In this paper, we investigate the viability of a variational U-Net architecture for denoising of single-channel audio data. Deep network speech enhancement systems commonly aim to estimate filter masks, or opt to skip preprocessing steps to directly work on the waveform signal, potentially neglecting relationships across higher dimensional spectro-temporal features. We study the adoption of a probabilistic bottleneck, as well as dilated convolutions, into the classic U-Net architecture. Evaluation of a number of network variants is carried out using signal-to-distortion ratio and perceptual model scores, with audio data including known and unknown noise types as well as reverberation. Our experiments show that the residual (skip) connections in the proposed system are required for successful end-to-end signal enhancement, i.e., without filter mask estimation. Further, they indicate a slight advantage of the variational U-Net architecture over its non-variational version in terms of signal enhancement performance under reverberant conditions. Specifically, PESQ scores show increases of 0.28 and 0.49 in reverberant and non-reverberant scenes, respectively. Anecdotal evidence points to improved suppression of impulsive noise sources with the variational end-to-end U-Net compared to the recurrent mask estimation network baseline.
READ FULL TEXT