Efficient Context Aggregation for End-to-End Speech Enhancement Using a Densely Connected Convolutional and Recurrent Network
In speech enhancement, an end-to-end deep neural network converts a noisy speech signal to a clean speech directly in time domain without time-frequency transformation or mask estimation. However, aggregating contextual information from a high-resolution time domain signal with an affordable model complexity still remains challenging. In this paper, we propose a hybrid architecture, incorporating densely connected convolutional networks (DenseNet) and gated recurrent units (GRU), to enable dual-level temporal context aggregation. Due to the dense connectivity pattern and a cross-component identical shortcut, the proposed model consistently outperforms competing convolutional baselines with an average STOI improvement of 0.23 and PESQ of 1.38 at three SNR levels. In addition, the proposed hybrid architecture is computationally efficient with 1.38 million parameters.
READ FULL TEXT