Transformer with Bidirectional Decoder for Speech Recognition

Model Overview

Abstract

Attention-based models have made tremendous progress on end-to-end automatic speech recognition(ASR) recently. However, the conventional transformer-based approaches usually generate the recognition sequence token by token from left to right, leaving the right-to-left contexts unexploited. In this work, we introduce a synchronous bidirectional speech transformer to utilize the different directional contexts simultaneously. Specifically, the outputs of our proposed transformer include a left-to-right target, and a right-to-left target. In inference stage, we use the introduced bidirectional beam search method, which can not only generate left-to-right candidates but also generate right-to-left candidates, and determine the best hypothesis sentence by scores. To demonstrate our proposed speech transformer with a bidirectional decoder(STBD), we conduct extensive experiments on the AISHELL-1 dataset. The results of experiments show that STBD achieves a 3.6% relative CER reduction(CERR) over the unidirectional speech transformer baseline, and the strongest model in this paper called STBD-Big model can achieve 6.64% CER on the test set, without language model rescoring and any extra data augmentation strategies.

Publication
In InterSpeech, 2020
Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.
Songyang Zhang
Songyang Zhang
PhD Students

My research interests include few/low-shot learning, graph neural networks and video understanding.