0% found this document useful (0 votes)

12 views219 pages

Thesis Attention-Based Encoder-Decoder Models for Speech Processing

This dissertation explores the application of Attention-Based Encoder-Decoder (AED) models in speech processing, focusing on Automatic Speech Recognition (ASR), confidence estimation, and speaker diarisation. It introduces the Integrated Source-Channel and Attention (ISCA) framework to enhance ASR performance, proposes confidence estimators to improve transcription reliability, and presents a novel approach for supervised clustering in speaker diarisation. The findings demonstrate significant improvements in performance metrics across various datasets, highlighting the effectiveness of AED models in these tasks.

Uploaded by

Kyle N.

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views219 pages

Thesis Attention-Based Encoder-Decoder Models for Speech Processing

Uploaded by

Kyle N.

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 219

Attention-Based Encoder-Decoder

Models for Speech Processing

Qiujia Li

Department of Engineering
University of Cambridge

This dissertation is submitted for the degree of

Doctor of Philosophy

Peterhouse April 2022

Declaration

I hereby declare that this dissertation is the result of my own work and includes nothing which
is the outcome of work done in collaboration except as declared in the Preface and specified in
the text. I further state that no substantial part of my dissertation has already been submitted,
or, is being concurrently submitted for any such degree, diploma or other qualification at the
University of Cambridge or any other University of similar institution except as declared in
the Preface and specified in the text. Some of the material has been presented at or submitted
to international conferences and journals (Li et al., 2021c,d, 2019d, 2021e,f, 2022). This
dissertation contains fewer than 65,000 words including appendices, footnotes, tables and
equations, but excluding the bibliography, and has fewer than 150 figures.

Qiujia Li
April 2022
Abstract

Speech processing is one of the key components of machine perception. It covers a wide
range of topics and plays an important role in many real-world applications. Many speech
processing problems are modelled using sequence-to-sequence models. More recently, the
Attention-Based Encoder-Decoder (AED) model has become a general and effective neural
network that transforms a source sequence into a target sequence. These two sequences may
have different lengths and belong to different modalities. AED models offer a new perspective
for various speech processing tasks. In this thesis, the fundamentals of AED models and
Automatic Speech Recognition (ASR) are first covered. The rest of the thesis focuses on
the application of AED models for three major speech processing tasks - speech recognition,
confidence estimation and speaker diarisation.
Speech recognition technology is widely used in voice assistants and dictation systems.
It converts speech signals into text. Traditionally, Hidden Markov Models (HMMs), as a
generative sequence-to-sequence model, are widely used as the backbone of an acoustic model.
Under the Source-Channel Model (SCM) framework, the ASR system finds the most likely text
sequence that produces the corresponding acoustic sequence together with a language model
and a lexicon. Alternatively, the speech recognition task can be addressed discriminatively
using a single AED model. There are distinct characteristics associated with each modelling
approach. As the first contribution of the thesis, the Integrated Source-Channel and Attention
(ISCA) framework is proposed to leverage the advantages of both approaches with two passes.
The first pass uses the traditional SCM-based ASR system to generate diverse hypotheses,
either in the form of N -best lists or lattices. The second pass obtains the AED model score for
each hypothesis. Experiments on Augmented Multi-Party Interaction (AMI) dataset showed
that ISCA using two-pass decoding reduced the Word Error Rate (WER) by 13% relative
for a joint SCM and AED system using one-pass decoding. Further experiments on both the
AMI dataset and the larger Switchboard (SWB) dataset showed that, if the SCM and AED
systems were trained separately to be more complementary, the combined system using ISCA
outperformed the individual system by around 30%. Also, the refined lattice rescoring algorithm
is significantly better than N -best rescoring as lattice is a more compact representation of
hypothesis space, especially for longer utterances.
vi

With various advancements in neural network training, AED models can reach similar or
better performance than traditional systems for many ASR tasks. Compared to a conventional
ASR system, one important but perhaps missing attribute of an AED-based system is good
confidence scores which indicate the reliability of automatic transcriptions. Confidence scores
are very helpful for various downstream tasks, including semi-supervised training, keyword
spotting and dialogue systems. As the second contribution of this thesis, effective confidence
estimators for AED-based ASR systems are proposed. The Confidence Estimation Module
(CEM) is a lightweight simple add-on neural network that takes various features from the
encoder, attention mechanism and decoder to estimate a confidence score for each output unit
(token). Experiments on the LibriSpeech dataset showed that compared to using Softmax
probabilities as confidence scores, the CEM improved token-level confidence estimation perfor-
mance substantially and largely addressed the over-confidence issue. For various downstream
tasks such as data selection, utterance-level confidence scores are more desirable. The Residual
Energy-Based Model (R-EBM), an utterance-level confidence estimator, was demonstrated
to outperform both Softmax probabilities and the CEM. The R-EBM directly operates at the
utterance level and takes deletion errors into account implicitly. The R-EBM also provides
a global normalisation term for the locally normalised auto-regressive AED models. On the
LibriSpeech dataset, the R-EBM reduced the WER of an AED model by up to 8% relative. One
potential issue for model-based confidence estimators such as the CEM and R-EBM is their
performance on Out-of-Domain (OOD) data. To ensure that confidence estimators generalise
well for OOD input, two simple approaches are suggested that can effectively inject OOD
information during the training of the CEM and R-EBM.
Speaker diarisation, a task of identifying “who spoke when”, is a crucial step for information
extraction and retrieval. The speaker diarisation pipeline often consists of multiple stages. The
last stage is to perform clustering over segment-level or window-level speaker representations.
Although clustering is normally an unsupervised task, this thesis proposes the use of AED
models for supervised clustering. With specific data augmentation techniques, the proposed
approach, Discriminative Neural Clustering (DNC), has shown to be an effective alternative
to unsupervised clustering algorithms. Experiments on the very challenging AMI dataset
showed that DNC improved the Speaker Error Rate (SpkER) by around 30% relative compared
to a strong spectral clustering baseline. Furthermore, DNC opens more interesting research
directions, e.g. speaker diarisation with multi-channel or multi-modality information and
end-to-end neural network-based speaker diarisation.
Acknowledgements

First and foremost, I would like to thank my PhD supervisor Prof. Phil Woodland. I deeply
appreciate the opportunity Phil gave me to work on speech recognition about six years ago
when I was still an undergraduate student. Since then, I have determined to pursue research,
especially in the field of speech technologies. During my PhD studies, I have been exceedingly
privileged to be guided by Phil through this challenging journey. He has spent countless hours
discussing my projects, replying to my messages, improving my paper drafts etc. I have learned
abundant technical knowledge and insight, but perhaps more importantly, the diligent and
rigorous attitude towards research. Phil has been more than my PhD supervisor, but my mentor
who selflessly supports my personal growth. For that, I will be forever grateful.
I would like to thank Dr Chao Zhang. Without him, I cannot imagine how many pitfalls and
sidetracks I would have gone through during my PhD studies. He is very patient, knowledgeable
and always willing to help. He can always inspire me when I face difficulties. I would also
like to express my gratitude to my labmate and friend, Florian Kreyssig. We brainstorm ideas,
discuss intriguing questions, and cheer each other on. I cannot overstate how lucky I am to
have Chao and Florian around in the past six years.
I want to thank Prof. Mark Gales, who has served as my PhD advisor and pointed the right
direction for me early on. I also appreciate my college tutors at Peterhouse, Dr Saskia Murk
Jansen and Dr Christopher Lester, who have helped secure my Graduate Studentship from
Peterhouse and supported me on various tutorial matters. I thank my collaborators at Google,
Dr Yu Zhang, Dr Liangliang Cao, Dr Bo Li, Dr David Qiu and Dr Yanzhang He, who provided
tremendous support and expertise during my internship. I am grateful to many members at the
Machine Intelligence Laboratory, Guangzhi Sun, Dr Anton Ragni, Dr Yu Wang, Xiaoyu Yang,
Qingyun Dou, Yiting Lu, Dr Kate Knill and Dr Linlin Wang, who have helped me in various
ways. I must also extend my thanks to my best friends at Cambridge, Xuan Guo, Yudong
Chen, Weiming Che and Yichen Yang, who have brought me joy and laughter in my daily life,
especially during the coronavirus pandemic. I greatly cherish the invaluable friendship.
Finally, I would like to thank my girlfriend, Ke Li, for her indispensable support and
encouragement. Her companionship makes the distance across the Atlantic Ocean seem
insignificant. And my most profound gratitude goes to my parents. They have supported me
viii

unconditionally in all possible ways over the past twenty-six years. I would like to dedicate
this thesis to them.
Table of contents

List of figures xv

List of tables xvii

Notation xix

Acronyms xxiii

1 Introduction 1
1.1 Speech Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Confidence Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Speaker Diarisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Attention-Based Encoder-Decoder Models 9

2.1 Basic Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Multilayer Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . 12
2.1.3 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . 14
2.1.4 Attention Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.5 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 AED Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 RNN-Based AED Models . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.2 Transformer-Based Models . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 Error Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.2 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . 27
x Table of contents

2.3.3Initialisation and Normalisation . . . . . . . . . . . . . . . . . . . . 29

2.3.4Regularisation and Generalisation . . . . . . . . . . . . . . . . . . . 30
2.3.4.1 Parameter Norm Penalty and Early Stopping . . . . . . . . 30
2.3.4.2 Ensemble Methods and Dropout . . . . . . . . . . . . . . 31
2.3.4.3 Parameter Tying and Multi-Task Training . . . . . . . . . . 33
2.3.4.4 Data Augmentation . . . . . . . . . . . . . . . . . . . . . 33
2.3.4.5 Label Smoothing . . . . . . . . . . . . . . . . . . . . . . 34
2.3.4.6 Weight Noise . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3 Automatic Speech Recognition 35

3.1 Source-Channel Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.1.1 Feature Extraction and Normalisation . . . . . . . . . . . . . . . . . 37
3.1.2 Modelling Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Hidden Markov Models for Acoustic Modelling . . . . . . . . . . . . . . . . 40
3.2.1 State Output Probability Distributions . . . . . . . . . . . . . . . . . 42
3.2.2 Likelihood Calculation . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Acoustic Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.1 Maximum Likelihood Training . . . . . . . . . . . . . . . . . . . . . 46
3.3.2 Conditional Maximum Likelihood Training . . . . . . . . . . . . . . 48
3.3.2.1 Cross Entropy . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.2.2 Maximum Mutual Information . . . . . . . . . . . . . . . 49
3.3.2.3 Minimum Bayes Risk . . . . . . . . . . . . . . . . . . . . 50
3.3.3 Speaker Adaptation and Adaptive Training . . . . . . . . . . . . . . 50
3.4 Language Models and Decoding for Source-Channel Models . . . . . . . . . 53
3.4.1 Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4.1.1 N-gram Language Models . . . . . . . . . . . . . . . . . . 54
3.4.1.2 Neural Network Language Models . . . . . . . . . . . . . 55
3.4.2 Decoding Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4.3 Lattices and Confusion Networks . . . . . . . . . . . . . . . . . . . 57
3.4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.5 Other Frame-Synchronous Systems . . . . . . . . . . . . . . . . . . . . . . 59
3.5.1 Connectionist Temporal Classification . . . . . . . . . . . . . . . . . 59
3.5.2 Neural Transducers . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.6 Attention-Based Encoder-Decoder Models . . . . . . . . . . . . . . . . . . . 64
3.6.1 Front-End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.6.2 Modelling Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Table of contents xi

3.6.3 Convolution-Augmented Transformer . . . . . . . . . . . . . . . . . 67

3.7 Attention-Based Encoder-Decoder Model Training . . . . . . . . . . . . . . 68
3.7.1 Scheduled Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.7.2 Minimum Word Error Rate Training . . . . . . . . . . . . . . . . . . 69
3.8 Language Models and Decoding for Attention-Based Models . . . . . . . . . 70
3.8.1 Beam Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.8.2 Language Model Integration . . . . . . . . . . . . . . . . . . . . . . 71
3.9 Self-Supervised Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4 Integrating Source-Channel and Attention-Based ASR Models 77

4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.1.1 Combination of Multiple Source-Channel Models . . . . . . . . . . . 78
4.1.1.1 Explicit Combination . . . . . . . . . . . . . . . . . . . . 79
4.1.1.2 Implicit Combination . . . . . . . . . . . . . . . . . . . . 80
4.1.2 Combination of Multiple Attention-Based Models . . . . . . . . . . 81
4.1.3 Combination of Source-Channel and Attention-Based Models . . . . 81
4.1.3.1 Joint CTC and Attention-Based Models . . . . . . . . . . . 81
4.1.3.2 Two Pass End-to-End ASR . . . . . . . . . . . . . . . . . 82
4.2 The Integrated Source-Channel and Attention-Based Model Approach . . . . 83
4.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.2.2 ISCA Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.2.4 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.2.4.1 N -best Rescoring . . . . . . . . . . . . . . . . . . . . . . 87
4.2.4.2 Lattice Rescoring . . . . . . . . . . . . . . . . . . . . . . 87
4.3 Preliminary Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.3.1.1 Data and Features . . . . . . . . . . . . . . . . . . . . . . 91
4.3.1.2 Model Configurations . . . . . . . . . . . . . . . . . . . . 91
4.3.1.3 Training and Decoding Setups . . . . . . . . . . . . . . . 92
4.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.3.2.1 Improvements on Joint CTC and Attention Baseline Models 92
4.3.2.2 Improvements to the CTC Models . . . . . . . . . . . . . 93
4.3.2.3 ISCA for Multi-Task Trained Models . . . . . . . . . . . . 94
4.3.2.4 ISCA for Separately Trained Models . . . . . . . . . . . . 95
4.3.3 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 97
xii Table of contents

4.4 Large Scale Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

4.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.4.1.1 Acoustic Data . . . . . . . . . . . . . . . . . . . . . . . . 98
4.4.1.2 Text Data and Language Models . . . . . . . . . . . . . . 98
4.4.1.3 Acoustic Models and AED Models . . . . . . . . . . . . . 99
4.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.4.2.1 N-best and Lattice Rescoring . . . . . . . . . . . . . . . . 100
4.4.2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5 Confidence Scores for Attention-Based Encoder-Decoder ASR Models 107

5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.1.1 Confidence Scores for Source-Channel Models . . . . . . . . . . . . 108
5.1.2 Confidence Scores for Attention-Based Models . . . . . . . . . . . . 109
5.1.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.2 Confidence Estimation Module . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.2.2 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.2.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.2.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.2.3.2 Baseline Model . . . . . . . . . . . . . . . . . . . . . . . 114
5.2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.2.4.1 Effect of Regularisation Methods on Confidence Scores . . 115
5.2.4.2 Confidence Estimation Module . . . . . . . . . . . . . . . 116
5.2.4.3 Effect of Language Model Fusion on Confidence Scores . . 117
5.2.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.2.5.1 Generalisation to a Mismatched Domain . . . . . . . . . . 118
5.2.5.2 Implications for Downstream Tasks . . . . . . . . . . . . . 119
5.3 Residual Energy-Based Model . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.3.2 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.3.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.3.2.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.3.2.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.3.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.3.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.3.3.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Table of contents xiii

5.3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 124

5.3.4.1 Length Normalisation and Log-linear Interpolation . . . . . 124
5.3.4.2 R-EBM Architecture . . . . . . . . . . . . . . . . . . . . 125
5.3.4.3 Effect of the Size of N -best Lists . . . . . . . . . . . . . . 126
5.3.4.4 Recognition and Confidence Estimation Performance . . . 126
5.3.4.5 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.3.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.3.5.1 Relative Improvement by Utterance Length . . . . . . . . . 128
5.3.5.2 Distribution Matching . . . . . . . . . . . . . . . . . . . . 128
5.4 Improving Confidence Scores for Out-of-Domain Data . . . . . . . . . . . . 129
5.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.4.2 Proposed Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.4.2.1 Pseudo Transcriptions . . . . . . . . . . . . . . . . . . . . 131
5.4.2.2 Additional Language Model . . . . . . . . . . . . . . . . . 131
5.4.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.4.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.4.3.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.4.4.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.4.4.2 Out-of-Domain Information for Confidence Estimation . . 133
5.4.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.4.5.1 Word-Level Confidence Calibration . . . . . . . . . . . . . 135
5.4.5.2 Utterance-Level Data Selection . . . . . . . . . . . . . . . 136
5.5 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

6 Attention-Based Encoder-Decoder Models for Speaker Diarisation 139

6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.1.1 Speaker Diarisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.1.1.1 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . 140
6.1.1.2 Speaker Representations . . . . . . . . . . . . . . . . . . . 141
6.1.1.3 Unsupervised Clustering . . . . . . . . . . . . . . . . . . 142
6.1.1.4 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . 142
6.1.2 Supervised Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.2 Discriminative Neural Clustering . . . . . . . . . . . . . . . . . . . . . . . . 144
6.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.2.2 Sequence-to-Sequence Models for Clustering . . . . . . . . . . . . . 145
6.2.3 Clustering Using Transformers . . . . . . . . . . . . . . . . . . . . . 146
xiv Table of contents

6.3 Data Augmentation for DNC . . . . . . . . . . . . . . . . . . . . . . . . . . 146

6.3.1 Sub-sequence Randomisation . . . . . . . . . . . . . . . . . . . . . 147
6.3.2 Input Vector Randomisation . . . . . . . . . . . . . . . . . . . . . . 147
6.3.3 Diaconis Augmentation . . . . . . . . . . . . . . . . . . . . . . . . 148
6.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.4.1 Data and Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.4.2 Segment-Level Speaker Embedding Generator . . . . . . . . . . . . 149
6.4.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.5.1 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.5.2 Curriculum Learning Scheduling . . . . . . . . . . . . . . . . . . . . 152
6.5.3 Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.6 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

7 Conclusions and Future Work 157

7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

Appendix A Datasets 163

A.1 Augmented Multi-Party Interaction . . . . . . . . . . . . . . . . . . . . . . . 163
A.2 LibriSpeech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
A.3 Switchboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

References 167
List of figures

2.1 Artificial neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Multi-layer perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Time delay neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Convolutional neural network . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Recurrent neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 Long short-term memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.7 RNN-based AED model with attention . . . . . . . . . . . . . . . . . . . . . 20
2.8 Transformer model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.9 Backpropagation on a computational graph . . . . . . . . . . . . . . . . . . 26

3.1 Source-channel model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2 Hidden Markov model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Lattice and confusion network . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4 CTC model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.5 Relationship between HMM and CTC. . . . . . . . . . . . . . . . . . . . . . 61
3.6 CTC alignment lattice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.7 Neural transducer model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.8 Neural transducer alignment lattice . . . . . . . . . . . . . . . . . . . . . . . 63
3.9 Hierarchical RNN encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.10 Conformer encoder architecture . . . . . . . . . . . . . . . . . . . . . . . . 68
3.11 Contrastive predictive coding . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.12 wav2vec 2.0 model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.1 Integrated source-channel and attention-based models . . . . . . . . . . . . . 85

4.2 Lattice expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.3 ISCA performance on AMI dev set with various sizes of N -best lists . . . . . 96
4.4 Relative WER reduction by utterance length using ISCA . . . . . . . . . . . 103

5.1 Filtering curves of an HMM-based system and an AED model . . . . . . . . 112

xvi List of figures

5.2 Confidence estimation module . . . . . . . . . . . . . . . . . . . . . . . . . 113

5.3 P-R curves by using Softmax and CEM for confidence estimation . . . . . . 117
5.4 Filtering curves of an AED model using Softmax and CEM . . . . . . . . . . 119
5.5 Residual energy-based model . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.6 Relative WER reduction by utterance length . . . . . . . . . . . . . . . . . . 128
5.7 Matching data distribution using R-EBM . . . . . . . . . . . . . . . . . . . . 129
5.8 System schematic for confidence estimation for AED models . . . . . . . . . 130

6.1 Speaker diarisation pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

6.2 DNC input vector randomisation . . . . . . . . . . . . . . . . . . . . . . . . 148
6.3 DNC diaconis augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.4 Visualisation of clustering results . . . . . . . . . . . . . . . . . . . . . . . . 154
List of tables

2.1 Activation functions for single input unit . . . . . . . . . . . . . . . . . . . . 18

3.1 Grapheme and phoneme subword units . . . . . . . . . . . . . . . . . . . . . 40

3.2 Word-piece subword units . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.1 RNN and Transformer decoder complexities . . . . . . . . . . . . . . . . . . 86

4.2 Improvements to joint CTC and attention model . . . . . . . . . . . . . . . . 92
4.3 Improvements to CTC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4 Multi-task training with various subword units and loss functions for AMs . . 94
4.5 Standalone SCM and AED models with various configurations . . . . . . . . 96
4.6 Summary of AMI test set ISCA results . . . . . . . . . . . . . . . . . . . . . 97
4.7 AMI and Switchboard datasets . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.8 AMI and Switchboard LM perplexities . . . . . . . . . . . . . . . . . . . . . 99
4.9 Single system WERs on AMI and Switchboard . . . . . . . . . . . . . . . . 100
4.10 ISCA N -best rescoring results on AMI . . . . . . . . . . . . . . . . . . . . . 101
4.11 ISCA lattice rescoring results on AMI . . . . . . . . . . . . . . . . . . . . . 101
4.12 ISCA N -best and lattice rescoring results on Hub5’00 . . . . . . . . . . . . . 102
4.13 ISCA N -best and lattice rescoring results on RT03 . . . . . . . . . . . . . . 102
4.14 AMI results comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.15 Switchboard results comparison . . . . . . . . . . . . . . . . . . . . . . . . 105

5.1 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.2 Effect of regularisation on confidence scores . . . . . . . . . . . . . . . . . . 115
5.3 Confidence estimation performance by Softmax and CEM . . . . . . . . . . 116
5.4 Confidence estimation with LM shallow fusion . . . . . . . . . . . . . . . . 118
5.5 Confidence estimation performance on a mismatch domain . . . . . . . . . . 118
5.6 Effect of length normalisation on WERs . . . . . . . . . . . . . . . . . . . . 125
5.7 LSTM architecture for R-EBM . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.8 Effect of the size of N -best list for R-EBM . . . . . . . . . . . . . . . . . . 126
xviii List of tables

5.9 R-EBM recognition and confidence performance . . . . . . . . . . . . . . . 127

5.10 R-EBM recognition and confidence performance on w2v2 model . . . . . . . 127
5.11 Baseline word-level confidence performance . . . . . . . . . . . . . . . . . . 133
5.12 Baseline utterance-level confidence performance . . . . . . . . . . . . . . . 133
5.13 Contributions of different OOD features for confidence estimation . . . . . . 134
5.14 Confidence estimation performance on two OOD datasets . . . . . . . . . . . 135
5.15 Word-level calibration performance on OOD datasets . . . . . . . . . . . . . 136
5.16 Utterance-level data selection using improved confidence on OOD . . . . . . 137

6.1 AMI MDM dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

6.2 Effect of data augmentation for DNC . . . . . . . . . . . . . . . . . . . . . . 152
6.3 Curriculum learning results for DNC . . . . . . . . . . . . . . . . . . . . . . 153
6.4 Performance comparison of DNC and spectral clustering . . . . . . . . . . . 153

A.1 AMI dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

A.2 LibriSpeech dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
A.3 Switchboard dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Notation

Neural Networks
a, a attention weight value/vector
b, b bias value/vector
β momentum coefficient
d decoder state vector
D, H number of nodes in a hidden layer
δ small number for computational stability
e, E encoder output embedding vector/matrix
ϵ learning rate
η decay factor for Adam optimiser
f (·), g(·) generic functions
g gradient of model parameters
f , i, o, c forget/input/output gates and cell state for LSTM
h hidden state vector
I number of nodes in an input layer
i, j general indices
J (·) overall cost
K convolutional kernel matrix
κ decay factor for exponential moving average
L(·) loss function
L number of layers in a network
m momentum of model parameters
M number of samples in a mini-batch
N number of data samples
n general count
∥ · ∥p p-norm of a vector
ν weight decay factor
O number of nodes in an output layer
xx Notation

P, Q dimensions of a convolutional kernel

ϕ(·) activation function
ρ PReLU gradient for negative input
r second moment of gradient of model parameters
s feature map value
σ(·) Sigmoid activation function
s first moment of gradient of model parameters
T number of time steps
t time index
θ, θ̃ model parameters, shallow model parameters
U, V weight matrices
u token index
v context vector
ε error value
w, w, W weight value/vector/matrix
x, x, X input feature value/vector/matrix
ξ label smoothing / scheduled sampling coefficient
y, y, Y output value/vector/matrix
z function value
ζ temperature scaling
Statistics
E(·) expectation
H(·) entropy of a distribution
µ mean vector
N (·) normal distribution
P (·) probability mass function
p(·), q(·) probability density functions
Q(·; ·) auxiliary function for EM algorithm
σ standard deviation vector
Σ covariance matrix
U(·) uniform distribution
Speech Processing
aij transition probability from state i to state j in HMMs
αt (i) forward probability at time t and state i
B(·) mapping that removes all repeated symbols and blanks
bj HMM emission probability at state j
Notation xxi

βt (i) backward probability at time t and state i

c weight of a Gaussian component in GMMs
χt (i, j) state transition posterior probability from state i to state j at time t
D acoustic feature dimension
F loss function
f frequency
γt (i) posterior probability at time t and state i
h hypothesis sequence
H hypothesis space
K length of a word/subword sequence
κ posterior probability smoothing factor
M number of Gaussian components in GMMs
o, o, O acoustic feature value/vector/sequence
ω insertion penalty
ϕt (j) maximum log-likelihood of being in state j at time t
ψ grammar scaling factor
R(·, ·) distance function between two sequences
S all possible state sequences
S number of HMM states
st , s, s′ HMM state at time t, HMM state sequence, CTC symbol sequence
T number of time steps
τ number of neighbouring frames for computing delta features
u utterance index
φt (j) best previous state of the current state j at time t
w, w word, word sequence
Acronyms

AED Attention-Based Encoder-Decoder. v, vi, 1–7, 9, 15, 17–21, 23, 34–36, 64–
67, 69, 71, 72, 77, 78, 81–89, 91–97, 99–105, 107, 109, 111–114, 116, 117,
120–122, 124, 129, 131, 132, 135, 137–139, 145, 146, 157–160
AM Acoustic Model. 36, 37, 40, 53–55, 57, 59, 68, 79, 80, 84–86, 94, 100, 158
AMI Augmented Multi-Party Interaction. v, vi, 77, 91–101, 103–105, 140, 143, 147,
149–151, 155, 163, 164
ANN Artificial Neural Network. 9
ASR Automatic Speech Recognition. v, vi, 1–7, 30, 35–39, 42, 45, 49, 52, 53, 55,
57, 59, 61, 62, 65–67, 71, 72, 74, 76–78, 107–109, 112, 113, 115–121, 123,
124, 126–130, 132, 134, 135, 137, 139, 146, 157–160, 163
AUC Area Under the Curve. 111, 115–118, 126, 127, 132–135

BPE Byte Pair Encoding. 66

CE Cross Entropy. 33, 48, 59, 68, 70, 92, 94–97

CEM Confidence Estimation Module. vi, 6, 7, 107–109, 112–114, 116–119, 121,
122, 126, 127, 129–138, 157, 159, 160
CER Character Error Rate. 59
CH CallHome. 98, 100, 102, 105, 143, 165
CL Curriculum Learning. 152–154
CMA-ES Covariance Matrix Adaptation Evolution Strategy. 94, 100
CML Conditional Maximum Likelihood. 48, 49, 59–61
CMLLR Constrained Maximum Likelihood Linear Regression. 51, 52
CMVN Cepstral Mean and Variance Normalisation. 39
CNC Confusion Network Combination. 79, 80
CNN Convolutional Neural Network. 10, 12–14, 33, 34, 55, 64, 66, 67, 73
CPC Contrastive Predictive Coding. 73, 74
CPD Change Point Detection. 141
xxiv Acronyms

CTC Connectionist Temporal Classification. 5, 35, 59–62, 64, 68, 77, 81–84, 91–95,
97, 157, 158, 160
CTS Conversational Telephone Speech. 164, 165

DAG Directed Acyclic Graph. 57

DCT Discrete Cosine Transform. 38
DDCRP Distance-Dependent Chinese Restaurant Process. 143
DER Diarisation Error Rate. 142, 155, 161
Diac-Aug Diaconis Augmentation. 140, 147, 148, 152–155
DNC Discriminative Neural Clustering. vi, 6, 7, 139, 140, 143–147, 149–157, 160,
161
DNN Deep Neural Network. 1, 2, 5, 9, 10, 14, 24, 25, 29–32, 36, 42, 43, 48, 52, 59,
68, 77, 78, 83, 84, 91–93, 113, 140–142, 157

ECE Expected Calibration Error. 136

EER Equal Error Rate. 111, 132–135
EM Expectation Maximisation. 46, 47, 51
EMA Exponential Moving Average. 32, 115

FBANK Filter Bank. 37–39

FMLLR Feature-Space Maximum Likelihood Linear Regression. 51
FSH Fisher. 98, 100, 102, 105, 164–166

GLU Gated Linear Unit. 67

GMM Gaussian Mixture Model. 2, 38, 39, 42, 43, 46–48, 51, 59, 78
GPU Graphics Processing Unit. 50
GSM Global System for Mobile Communications. 165

HMM Hidden Markov Model. v, 1–5, 35, 36, 38–48, 50–53, 56, 57, 59–61, 65, 68,
70, 77–79, 83, 84, 86, 91–94, 107, 111, 112, 137, 158, 160

IHM Individual Headset Microphone. 91, 98–101, 103, 104, 163

ISCA Integrated Source-Channel and Attention. v, 6, 7, 77, 83–85, 91, 94–97, 103–
105, 157, 158, 160

KL Kullback–Leibler. 52

LF Lattice Free. 43, 50

LHUC Learning Hidden Unit Contributions. 52
Acronyms xxv

LM Language Model. 36, 37, 50, 53–57, 59, 61, 67, 72, 78–80, 84, 85, 91–94,
98–100, 117–119, 124, 129–134, 137, 138, 159, 164, 166
LN Length Normalisation. 125
LSTM Long Short-Term Memory. 15, 16, 55, 63, 73, 91, 92, 96, 98–102, 104, 114,
117, 124–126, 132

MAP Maximum a Posteriori. 36, 37, 51, 53, 79

MAPSSWE Matched-Pair Sentence-Segment Word Error. 103
MBR Minimum Bayes Risk. 46, 50, 70, 155
MDM Multiple Distance Microphone. 149, 155, 163
MFCC Mel-scale Frequency Cepstral Coefficient. 37, 38, 43, 74
MLE Maximum Likelihood Estimation. 45, 46, 48, 50, 54, 60
MLLR Maximum Likelihood Linear Regression. 51, 80
MLP Multilayer Perceptron. 10–14, 20, 24, 34, 55, 67, 91
MMI Maximum Mutual Information. 43, 45, 49, 50, 70, 86, 99
MPE Minimum Phone Error. 50
MWER Minimum Word Error Rate. 69, 82, 86

NCE Normalised Cross Entropy. 109–111, 116–118, 136

NNLM Neural Network Language Model. 53, 55, 57, 62, 72, 78, 158

OOD Out-of-Domain. vi, 6, 7, 108, 129–131, 133–138, 159

OOV Out-of-Vocabulary. 67

P-R Precision-Recall. 111, 117, 118, 126

PIT Permutation Invariant Training. 143, 144
PLP Perceptual Linear Prediction. 37, 38, 43
PWLM Piece-wise Linear Mapping. 110, 116–118, 121, 136, 137

R-EBM Residual Energy-Based Model. vi, 6, 7, 107–109, 120–138, 157, 159, 160
RNN Recurrent Neural Network. 10, 14, 15, 19–21, 23, 26, 30, 33, 34, 55, 64–67,
70, 82, 85–87, 89, 104, 105, 143, 158
RNNLM Recurrent Neural Network Language Model. 7, 87, 88, 91–104, 118
ROC Receiver Operating Characteristics. 111
ROVER Recogniser Output Voting Error Reduction. 79–81

SAT Speaker Adaptive Training. 51, 52

SCM Source-Channel Model. v, 2–6, 35–37, 40, 53, 57, 59, 61, 64, 65, 68–70, 77,
78, 81, 83–87, 91–97, 99–101, 104, 108, 109, 157, 158
xxvi Acronyms

SentER Sentence Error Rate. 59, 132, 133, 136, 137, 140
SGD Stochastic Gradient Descend. 25, 27–29, 98
SpkER Speaker Error Rate. vi, 143, 151–155
SSL Self-Supervised Learning. 72, 73, 75
SWB Switchboard. v, 77, 98–100, 102–105, 132–137, 163–165
SWBC Switchboard Cellular. 98, 100, 102, 105, 165

t-SNE t-distributed Stochastic Neighbour Embedding. 154, 155

TDNN Time Delay Neural Network. 12, 22, 99, 149, 150

UIS-RNN Unbounded Interleaved-State Recurrent Neural Network. 143

VAD Voice Activity Detection. 140–143, 149, 156

VTLN Vocal Tract Length Normalisation. 39

w2v2 wav2vec 2.0. 74, 75, 127, 132, 134, 138

WER Word Error Rate. v, vi, 30, 50, 59, 69, 70, 77, 86, 87, 92–97, 99–103, 105, 108,
112–121, 124–128, 132–135, 158, 159, 164, 165
WERR Relative Word Error Rate Reduction. 103, 104, 126–128
WFST Weighted Finite State Transducer. 50, 56, 58
WPM Word Piece Model. 114
WSJ Wall Street Journal. 118
WTN Word Transition Network. 79, 80
Chapter 1

Introduction

Speech processing technologies enable us to communicate with machines more naturally

compared to traditional interfaces. Recently, personal voice assistants powered by speech
technology have gained increasing popularity. As well as consumer-based products, large-
scale speech-related services are also in high demand, including television broadcasting,
meeting and video captioning, and telephony transcription. Strongly motivated by practical
applications, the performance of various speech processing systems is consistently improving
as a result of the recent advancement of Deep Neural Networks (DNNs). Speech processing
is a very broad area, including topics such as front-end signal processing, speech recognition,
speaker recognition, speech synthesis and dialogue systems. Many speech processing tasks are
intrinsically sequence-to-sequence modelling problems.
Recently, the Attention-Based Encoder-Decoder (AED) model has emerged as an effective
and flexible framework for mapping an input sequence to an output sequence of an arbitrary
length and perhaps in a different domain. The encoder of the AED model encodes the input
using a sequence model and transforms it into a sequence of hidden representations. The
attention mechanism bridges between the encoder and the decoder to allow the model to
focus on parts of the input sequence dynamically depending on the position in the output
sequence. The decoder is an autoregressive sequence model that learns to convert the context
from the encoder given by the attention mechanism to an output token at each step. This general
sequence-to-sequence framework has been very successful in various tasks, most notably for
machine translation.
For speech processing, the AED model has also gained an increasing amount of popularity.
For example, Automatic Speech Recognition (ASR) systems based on AED models can
reach comparable performance to conventional Hidden Markov Model (HMM)-based systems
based on high-quality DNN-based acoustic and language models. Since both HMM-based
systems and AED-based systems offer distinctive advantages from modelling and deployment
2 Introduction

perspectives, this thesis proposes practical approaches to exploit their complementarity via
system combination. Compared to conventional ASR systems, one missing piece of the
AED-based system is high-quality confidence scores for their automatic transcriptions. As
AED models address the ASR task using a different principle from conventional systems,
novel confidence estimators are proposed in this thesis to produce reliable confidence scores
for AED-based ASR systems at the token, word, and utterance levels. Effective confidence
estimators have significant implications for various downstream tasks such as data selection
for semi-supervised or active learning, and dialogue systems. Apart from ASR, AED models
can also be used for speaker diarisation or determining “who spoke when” in a multi-talker
audio recording. Clustering is the last stage of a speaker diarisation pipeline and is normally
regarded as an unsupervised task. This thesis proposes a novel supervised approach for speaker
clustering using AED models.
In this chapter, the three topics for speech processing covered in this thesis are first briefly
introduced. Then the thesis outline is presented and then the main contributions are highlighted.

1.1 Speech Processing

Three topics related to AED models for speech processing are within the scope of this thesis,
which are ASR, confidence estimation and speaker diarisation.

1.1.1 Speech Recognition

ASR, by definition, is the task of transcribing a sequence of speech signals into a sequence of
linguistic tokens. Developing high-performing speech recognition systems have been an active
research topic for many decades. Many of the current speech recognition applications rely
on a statistical model named the noisy Source-Channel Model (SCM). The observed speech
signal is regarded as an output from a noisy channel whose input is the intended text. HMMs
have been used as the backbone of speech recognition technology. A large number of effective
and efficient training and decoding algorithms have been established based on HMMs. The
development of DNNs allows the likelihood of the observation sequence from a hidden state
sequence to be modelled better than using the Gaussian Mixture Models (GMMs) as done
traditionally. The modular SCM framework decomposes the speech recognition problem into
various smaller tasks, including acoustic modelling, language modelling and pronunciation
modelling. A multi-stage pipeline, including front-end processing, feature normalisation,
sequence training, adaptive training and decoding, joins each component together to produce
the desired output. SCM-based systems allow the designer to adjust or improve a part of the
1.1 Speech Processing 3

system without worrying too much about the rest of the pipeline. Many sources of structured
knowledge such as the lexicon and phonetic decision trees can readily be incorporated into the
system which helps the recogniser perform relatively robustly, even with a limited resource
budget. SCM-based systems process the acoustic sequence in a frame-by-frame manner, which
allows the system to be designed to handle streaming data.
Recently, with the accelerating development of learning algorithms and computing hardware,
a single end-to-end trainable AED model can reach a similar performance to SCM-based
systems for ASR. An AED model consists of an encoder, an attention mechanism and a
decoder. For ASR, the encoder transforms the input acoustic features into a sequence of hidden
representations, and the decoder works together with the attention mechanism to produce one
output token at a time based on all its previous output. Unlike HMM-based models, the AED
models pose much weaker conditional independence assumptions for both word sequences and
acoustic sequences. AED models jointly learn the acoustic and language models whereas SCM-
based systems optimise them separately. Since AED models operate in a label-synchronous
fashion, processing streaming data is more challenging.
SCM and AED-based systems have their respective characteristics and are highly com-
plementary. Given two systems that have similar performance but different error patterns, a
significant amount of gain is expected by combining them. To this end, it is of great interest to
investigate possible combination approaches and look for the best strategy in terms of the final
recognition performance while considering practical constraints.

1.1.2 Confidence Scores

Confidence scores are an important attribute associated with automatic transcriptions produced
by ASR systems. They provide a measure of the reliability of the ASR output, normally
a value between 0 and 1 for each word or utterance. There is a wide array of downstream
tasks that benefit from reliable confidence scores. For example, semi-supervised training can
use confidence scores to select unlabelled utterances and their automatic transcriptions with
the highest confidence scores as additional training data to improve the performance of the
ASR system. For active learning, utterances with the lowest confidence scores are transcribed
manually by humans, which are then used to improve the weakness of the ASR system. For
keyword spotting, the confidence estimator assigns the probability of each word being correctly
recognised, and a threshold can be set to determine whether a designated keyword has been
uttered. For dialogue systems, confidence scores allow the system to take erroneous hypotheses
from ASR into consideration, which improves the user experience. If the ASR system gives
uncertain recognition results, the dialogue system can choose to ask the user for clarification
instead of proceeding with the potentially incorrect command.
4 Introduction

To obtain reliable confidence scores, it is imperative to make use of the information

available from the speech recogniser that allows the uncertainty of the output to be estimated.
In conventional SCM-based ASR systems, confidence scores have been relatively well-studied.
One of the most indicative features for confidence scores of an HMM-based system is the
word posterior probability. The word posterior probability represents the probability of the
word that occurs in the speech signal. A high posterior probability implies that the recogniser
is more certain about it based on the acoustic and language model scores compared with its
competing hypotheses. Word posterior probabilities are normally derived from lattices, which
are compact representations of a very large number of hypotheses produced by the recogniser.
A word lattice can be efficiently constructed because of the Markov assumptions made by
HMM-based acoustic models and n-gram language models. Along with other features from
the recogniser, confidence estimators using various kinds of model have been proposed in the
past for SCM-based systems.
By contrast, there has been only a small amount of investigation regarding confidence scores
for AED-based ASR systems. AED models have autoregressive decoders where each step of
the output depends on the full history and does not use the Markov assumption. Therefore,
the hypothesis space explored during the decoding procedure is very limited compared to
HMM-based systems. Word posterior probabilities estimated based on the N -best hypotheses
are expected to be very poor. The other alternative is to use the conditional probabilities given
by the final Softmax output from the decoder. However, the Softmax probabilities can be
overconfident as the model-based decoder tends to fit the training data, especially for large and
deep models. Different regularisation approaches applied during training may heavily affect
the Softmax probabilities. Consequently, new approaches are demanded to produce reliable
confidence scores for AED-based ASR models.

1.1.3 Speaker Diarisation

Speaker diarisation is the task of identifying “who spoke when” in an audio recording involving
multiple speakers. This is not only an important task for some ASR systems that need to use
speaker-dependent features for improved performance, but is often needed for information
retrieval and extraction. For example, as many automatic meetings transcripts are generated, it
is desirable to record speaker identities together with ASR outputs.
A speaker diarisation system is often a multi-stage pipeline. It first removes the non-speech
parts from the audio recording, such as silence, music and other noise. Then, the remaining
speech signal is divided into short segments where only one speaker is active in each segment.
After extracting a speaker representation from each segment, clustering is performed such that
all segments from the same speaker are gathered into the same cluster.
1.2 Thesis Outline 5

Apart from the clustering algorithm, DNNs have been applied to all the other stages and have
shown promising performance. Clustering algorithms are generally unsupervised. However,
the task is sometimes ambiguous by nature. Given many high-dimensional features without
additional information, there may be more than one sensible clustering result. Current clustering
procedures rely on a distance measure for speaker representations and treat each segment as
an independent sample. Some clustering approaches even impose stricter assumptions about
the distribution of each cluster. The limitations of unsupervised clustering algorithms leave
more pressure on the upstream stages, especially the extraction of speaker representations.
If it is possible to view speaker clustering as a supervised sequence-to-sequence task where
the input is a sequence of speaker representations and the output is a sequence of cluster
labels, there would be several benefits by reformulating clustering in the context of speaker
diarisation. The sequential relationship between speaker representations can then be exploited,
the ambiguity of clustering results can be removed by supervised training samples, a pre-defined
distance measure of speaker representations becomes optional, and the assumptions made by
unsupervised algorithms can be lifted. Moreover, the entire diarisation pipeline could be
designed as an end-to-end trainable model in order to avoid error propagation across different
stages.

1.2 Thesis Outline

This thesis starts with relevant background knowledge about DNNs and speech recognition.
Then, contributions in three speech processing topics are presented. In this section, the overview
of each chapter is given, including references to the corresponding peer-reviewed papers.

• Chapter 2 first establishes the fundamentals of DNNs, including their basic building
blocks such as different neural network layers and activation functions. Then Chapter 2
introduces the AED models that are constructed based on these building blocks. Optimi-
sation procedures and regularisation techniques for DNNs are also described. Parts of
this chapter will be referred to throughout the thesis.

• Chapter 3 first describes SCM-based systems and components, including feature extrac-
tion, HMM-based acoustic models and their adaptation, language models, and the decod-
ing procedure. Connectionist Temporal Classification (CTC) and neural transducers are
also introduced as other forms of frame-synchronous systems. Next, label-synchronous
AED models for ASR are described. Their specific training and decoding techniques
are also covered. Finally, some self-supervised pre-training approaches for ASR are
6 Introduction

presented. As they often use a large amount of unlabelled acoustic data, the pre-trained
model can be used to initialise an acoustic model or the encoder of an AED model.

• Chapter 4 proposes a general combination framework of an SCM-based system and an

AED model. The framework uses an AED model to rescore the hypotheses from an
SCM-based system in a two-pass manner. Both N -best and lattice rescoring algorithms
are shown to be highly effective. This chapter includes a conference paper published
at ASRU 2019 (Li et al., 2019d) which first proposed Integrated Source-Channel and
Attention (ISCA) using N -best rescoring and a journal paper submitted in 2021 (Li et al.,
2021e) that extends the conference paper with an improved lattice rescoring algorithm.

• Chapter 5 addresses the confidence estimation problem for AED models for ASR. The
Confidence Estimation Module (CEM) is proposed as the token-level confidence esti-
mator, which was published in a conference paper at ICASSP 2021 (Li et al., 2021d).
Subsequently, the Residual Energy-Based Model (R-EBM) is proposed as the utterance-
level confidence estimator and can also be used to rescore the top hypotheses from the
AED models for better ASR performance. R-EBM was published at Interspeech 2021
conference (Li et al., 2021f). To improve the model-based confidence scores on Out-of-
Domain (OOD) data, practical approaches are suggested and validated. The findings are
included in a conference paper at ICASSP 2022 (Li et al., 2022).

• Chapter 6 focuses on the clustering stage of the speaker diarisation task. Discriminative
Neural Clustering (DNC), based on AED models, is proposed as an effective supervised
alternative to existing unsupervised clustering algorithms. Three data augmentation
techniques are tailored to DNC when the training data is very limited. DNC was first
published as a conference paper at SLT 2021 (Li et al., 2021c).

• The key conclusions and contributions are summarised in Chapter 7. Based on the current
findings and the understanding of AED models, Chapter 7 also suggests several related
topics for further investigation with respect to ASR, confidence estimation and speaker
diarisation.

1.3 Contributions
The key contributions of this thesis are as follows.

• A general system combination framework called ISCA is proposed for ASR. The frame-
work uses a two-pass approach where hypotheses are first generated using a frame-
synchronous SCM-based system and then rescored by a label-synchronous AED model.
1.3 Contributions 7

The lattice rescoring algorithm used for Recurrent Neural Network Language Model
(RNNLM) rescoring is adapted and improved for ISCA. Both N -best and lattice rescoring
algorithms are highly effective for improving the overall ASR performance.

• A light-weight confidence estimator for AED models, the CEM, is proposed for ASR.
The CEM can predict a confidence score for each output token in the hypotheses. By
simply aggregating token-level scores, word-level and utterance-level confidence scores
can be obtained.

• Since utterance-level confidence scores are very useful for various downstream tasks, the
R-EBM is proposed as an utterance-level confidence estimator that is trained directly
at the utterance level and implicitly takes deletion error into account. The R-EBM
can also be used to rescore N -best hypotheses from the AED models to improve the
ASR performance as it provides the globally normalised residual term for the locally
normalised AED model.

• Because both the CEM and R-EBM are model-based confidence estimators for AED
models, two practical approaches are proposed to improve the reliability of confidence
scores on OOD data when some unlabelled data from the target domain is available.

• A novel method called DNC is proposed to use an AED model to perform clustering
for speaker diarisation in a supervised fashion. Together with various specific data
augmentation schemes, DNC implicitly handles the permutation issue for cluster labels.
As DNC does not assume a pre-defined distance measure for speaker representations and
learns to disambiguate speakers from data, it outperforms a commonly used unsupervised
clustering algorithm on a challenging dataset.
Chapter 2

Attention-Based Encoder-Decoder Models

Artificial Neural Networks (ANNs), inspired by biological neural networks, are mathematical
models or a class of functions that transform input features or representations into the desired
output space via non-linear mappings (Bishop, 1995). An ANN is a directed and weighted
graph consisting of interconnected groups of nodes. There is a weight associated with each
connection and a non-linear activation function associated with each node. Deep Neural
Networks (DNNs) are a major class of ANNs where there are multiple layers of nodes between
the input and the output layers. By increasing the width and depth of neural networks, more
complex functions can be approximated (Goodfellow et al., 2016). The building blocks of
DNNs will be first introduced in this chapter.
Based on the basic building blocks, various kinds of neural networks can be constructed.
For speech processing, the input and the output are often sequences with variable lengths.
Therefore, Attention-Based Encoder-Decoder (AED) models (Cho et al., 2014; Sutskever et al.,
2014) are of particular interest. In this section, two commonly used AED models are described.
They will be frequently mentioned in the rest of the thesis.
Deep learning provides a powerful framework for DNNs to perform classification or regres-
sion tasks in a supervised fashion (LeCun et al., 2015). One of the key advantages of DNNs is
that they do not require any prior knowledge between the input and output spaces (Goodfellow
et al., 2016). Provided with a reasonable amount of training data (i.e. input and output pairs),
model parameters (i.e. weights) can be optimised according to the defined criterion via error
backpropagation (Rumelhart et al., 1988). Various optimisation techniques are included in this
chapter.
DNNs have been shown to achieve state-of-the-art results on various tasks in computer
vision, speech processing and natural language processing (LeCun et al., 2015). With more
complex model architectures, more advanced training techniques, an increasing amount of
training data and more powerful computing facilities, DNNs are expected to continue breaking
10 Attention-Based Encoder-Decoder Models

existing performance records. Despite the success of DNNs, this data-driven approach cannot
easily solve other tasks that would require more than the input-to-output feature mapping and a
higher level of understanding and reasoning (LeCun et al., 2015). However, within the scope of
speech processing, DNNs have become one of the most important modules in the system. This
chapter covers generic deep neural networks and their learning procedures, which will be often
referred to throughout the thesis.

2.1 Basic Building Blocks

A DNN normally consists of one or more types of the following building blocks including their
variants: Multilayer Perceptrons (MLPs), Convolutional Neural Networks (CNNs), Recurrent
Neural Networks (RNNs), and attention mechanisms. By stacking these blocks on top of
each other, training data can be used to optimise all parameters of the model where error
backpropagation can be used to compute the gradients for gradient-based training. However,
these basic building blocks can all be decomposed into individual artificial neurons with
connecting weights and associated activation functions, as shown in Figure 2.1.

<latexit sha1_base64="U7EeqmYiKeTo/r/N2HofpG7Xero=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0lqQY9FLx4r2A9oQ9lsN83S3U3Y3Qgl9C948aCIV/+QN/+NmzYHbX0w8Hhvhpl5QcKZNq777ZQ2Nre2d8q7lb39g8Oj6vFJV8epIrRDYh6rfoA15UzSjmGG036iKBYBp71gepf7vSeqNIvlo5kl1Bd4IlnICDa5NEwiNqrW3Lq7AFonXkFqUKA9qn4NxzFJBZWGcKz1wHMT42dYGUY4nVeGqaYJJlM8oQNLJRZU+9ni1jm6sMoYhbGyJQ1aqL8nMiy0nonAdgpsIr3q5eJ/3iA14Y2fMZmkhkqyXBSmHJkY5Y+jMVOUGD6zBBPF7K2IRFhhYmw8FRuCt/ryOuk26t5VvfHQrLVuizjKcAbncAkeXEML7qENHSAQwTO8wpsjnBfn3flYtpacYuYU/sD5/AEU1o5D</latexit>

+
<latexit sha1_base64="HFb7vPGXWshKfdtTiumEDK16me0=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoMgCGE3CnoMevGYgHlAsoTZSW8yZnZ2mZkVQsgXePGgiFc/yZt/4yTZgyYWNBRV3XR3BYng2rjut5NbW9/Y3MpvF3Z29/YPiodHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7mZ+6wmV5rF8MOME/YgOJA85o8ZK9YteseSW3TnIKvEyUoIMtV7xq9uPWRqhNExQrTuemxh/QpXhTOC00E01JpSN6AA7lkoaofYn80On5MwqfRLGypY0ZK7+npjQSOtxFNjOiJqhXvZm4n9eJzXhjT/hMkkNSrZYFKaCmJjMviZ9rpAZMbaEMsXtrYQNqaLM2GwKNgRv+eVV0qyUvctypX5Vqt5mceThBE7hHDy4hircQw0awADhGV7hzXl0Xpx352PRmnOymWP4A+fzB3LTjLM=</latexit>
b
<latexit sha1_base64="YJjhR7RY5hyNtVLBH/MerrmOQ7I=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOlZtAvV9yqOwdZJV5OKpCj0S9/9QYxSyOUhgmqdddzE+NnVBnOBE5LvVRjQtmYDrFrqaQRaj+bHzolZ1YZkDBWtqQhc/X3REYjrSdRYDsjakZ62ZuJ/3nd1ITXfsZlkhqUbLEoTAUxMZl9TQZcITNiYgllittbCRtRRZmx2ZRsCN7yy6ukXat6F9Va87JSv8njKMIJnMI5eHAFdbiDBrSAAcIzvMKb8+i8OO/Ox6K14OQzx/AHzucPxi+M6g==</latexit>

⇥w1 <latexit sha1_base64="vaUHFuZpK8+Id+6/RCpNBST6nwI=">AAAB8XicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGC/cA2lM120y7dbMLuRCmh/8KLB0W8+m+8+W/ctjlo64OBx3szzMwLEikMuu63s7K6tr6xWdgqbu/s7u2XDg6bJk414w0Wy1i3A2q4FIo3UKDk7URzGgWSt4LRzdRvPXJtRKzucZxwP6IDJULBKFrpoYsi4oY89bxeqexW3BnIMvFyUoYc9V7pq9uPWRpxhUxSYzqem6CfUY2CST4pdlPDE8pGdMA7lipqF/nZ7OIJObVKn4SxtqWQzNTfExmNjBlHge2MKA7NojcV//M6KYZXfiZUkiJXbL4oTCXBmEzfJ32hOUM5toQyLeythA2ppgxtSEUbgrf48jJpViveeaV6d1GuXedxFOAYTuAMPLiEGtxCHRrAQMEzvMKbY5wX5935mLeuOPnMEfyB8/kDFiuQhw==</latexit>

⇥w2
<latexit sha1_base64="DaWRSGDyDTUEPh/InaIQDnnRNQE=">AAAB8XicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGC/cA2lM120y7dbMLuRCmh/8KLB0W8+m+8+W/ctjlo64OBx3szzMwLEikMuu63s7K6tr6xWdgqbu/s7u2XDg6bJk414w0Wy1i3A2q4FIo3UKDk7URzGgWSt4LRzdRvPXJtRKzucZxwP6IDJULBKFrpoYsi4oY89aq9UtmtuDOQZeLlpAw56r3SV7cfszTiCpmkxnQ8N0E/oxoFk3xS7KaGJ5SN6IB3LFXULvKz2cUTcmqVPgljbUshmam/JzIaGTOOAtsZURyaRW8q/ud1Ugyv/EyoJEWu2HxRmEqCMZm+T/pCc4ZybAllWthbCRtSTRnakIo2BG/x5WXSrFa880r17qJcu87jKMAxnMAZeHAJNbiFOjSAgYJneIU3xzgvzrvzMW9dcfKZI/gD5/MHF6+QiA==</latexit>
⇥w3 <latexit sha1_base64="mXB5L2z/qICF9GmfPH5WbH+ZgiY=">AAAB8XicbVDLTgJBEOzFF+IL9ehlIjHxRHbBRI9ELx4xkUeEDZkdZmHC7OxmpldDCH/hxYPGePVvvPk3DrAHBSvppFLVne6uIJHCoOt+O7m19Y3Nrfx2YWd3b/+geHjUNHGqGW+wWMa6HVDDpVC8gQIlbyea0yiQvBWMbmZ+65FrI2J1j+OE+xEdKBEKRtFKD10UETfkqVftFUtu2Z2DrBIvIyXIUO8Vv7r9mKURV8gkNabjuQn6E6pRMMmnhW5qeELZiA54x1JF7SJ/Mr94Ss6s0idhrG0pJHP198SERsaMo8B2RhSHZtmbif95nRTDK38iVJIiV2yxKEwlwZjM3id9oTlDObaEMi3srYQNqaYMbUgFG4K3/PIqaVbKXrVcubso1a6zOPJwAqdwDh5cQg1uoQ4NYKDgGV7hzTHOi/PufCxac042cwx/4Hz+ABkzkIk=</latexit>
⇥wD <latexit sha1_base64="3EwoM/6bzlsgtEE0xUQD/oQ9K6A=">AAAB8XicbVDLSgNBEOyNrxhfUY9eBoPgKexGQY9BPXiMYB6YLGF2MkmGzM4uM71KWPIXXjwo4tW/8ebfOEn2oIkFDUVVN91dQSyFQdf9dnIrq2vrG/nNwtb2zu5ecf+gYaJEM15nkYx0K6CGS6F4HQVK3oo1p2EgeTMYXU/95iPXRkTqHscx90M6UKIvGEUrPXRQhNyQp+5Nt1hyy+4MZJl4GSlBhlq3+NXpRSwJuUImqTFtz43RT6lGwSSfFDqJ4TFlIzrgbUsVtYv8dHbxhJxYpUf6kbalkMzU3xMpDY0Zh4HtDCkOzaI3Ff/z2gn2L/1UqDhBrth8UT+RBCMyfZ/0hOYM5dgSyrSwtxI2pJoytCEVbAje4svLpFEpe2flyt15qXqVxZGHIziGU/DgAqpwCzWoAwMFz/AKb45xXpx352PemnOymUP4A+fzBzL3kJo=</latexit>

x1
<latexit sha1_base64="MWSDWkw1NdOauHNwPQkLknLX4o4=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGi/YA2lM120y7dbMLuRCyhP8GLB0W8+ou8+W/ctjlo64OBx3szzMwLEikMuu63s7K6tr6xWdgqbu/s7u2XDg6bJk414w0Wy1i3A2q4FIo3UKDk7URzGgWSt4LRzdRvPXJtRKwecJxwP6IDJULBKFrp/qnn9Uplt+LOQJaJl5My5Kj3Sl/dfszSiCtkkhrT8dwE/YxqFEzySbGbGp5QNqID3rFU0YgbP5udOiGnVumTMNa2FJKZ+nsio5Ex4yiwnRHFoVn0puJ/XifF8MrPhEpS5IrNF4WpJBiT6d+kLzRnKMeWUKaFvZWwIdWUoU2naEPwFl9eJs1qxTuvVO8uyrXrPI4CHMMJnIEHl1CDW6hDAxgM4Ble4c2Rzovz7nzMW1ecfOYI/sD5/AEM/o2k</latexit>
x2 <latexit sha1_base64="vdTCQWpAcdEoAqjXndSIH2U27gw=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGi/YA2lM120y7dbMLuRCyhP8GLB0W8+ou8+W/ctjlo64OBx3szzMwLEikMuu63s7K6tr6xWdgqbu/s7u2XDg6bJk414w0Wy1i3A2q4FIo3UKDk7URzGgWSt4LRzdRvPXJtRKwecJxwP6IDJULBKFrp/qlX7ZXKbsWdgSwTLydlyFHvlb66/ZilEVfIJDWm47kJ+hnVKJjkk2I3NTyhbEQHvGOpohE3fjY7dUJOrdInYaxtKSQz9fdERiNjxlFgOyOKQ7PoTcX/vE6K4ZWfCZWkyBWbLwpTSTAm079JX2jOUI4toUwLeythQ6opQ5tO0YbgLb68TJrVindeqd5dlGvXeRwFOIYTOAMPLqEGt1CHBjAYwDO8wpsjnRfn3fmYt644+cwR/IHz+QMOgo2l</latexit>
x3
<latexit sha1_base64="GcfdmiVXIuQAVIE+3vSRqlRiStc=">AAAB6nicbVDLTgJBEOzFF+IL9ehlIjHxRHbBRI9ELx4xyiOBDZkdemHC7OxmZtZICJ/gxYPGePWLvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLRzcxvPaLSPJYPZpygH9GB5CFn1Fjp/qlX7RVLbtmdg6wSLyMlyFDvFb+6/ZilEUrDBNW647mJ8SdUGc4ETgvdVGNC2YgOsGOppBFqfzI/dUrOrNInYaxsSUPm6u+JCY20HkeB7YyoGeplbyb+53VSE175Ey6T1KBki0VhKoiJyexv0ucKmRFjSyhT3N5K2JAqyoxNp2BD8JZfXiXNStmrlit3F6XadRZHHk7gFM7Bg0uowS3UoQEMBvAMr/DmCOfFeXc+Fq05J5s5hj9wPn8AEAaNpg==</latexit>
··· ··· xD
<latexit sha1_base64="9pGPaDabNijkWLtd2pluvCo4p5o=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69BIvgqSRV0GPRi8cK9gPaUDabTbt2sxt2J0Ip/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZemApu0PO+ncLa+sbmVnG7tLO7t39QPjxqGZVpyppUCaU7ITFMcMmayFGwTqoZSULB2uHodua3n5g2XMkHHKcsSMhA8phTglZq9Wik0PTLFa/qzeGuEj8nFcjR6Je/epGiWcIkUkGM6fpeisGEaORUsGmplxmWEjoiA9a1VJKEmWAyv3bqnlklcmOlbUl05+rviQlJjBknoe1MCA7NsjcT//O6GcbXwYTLNEMm6WJRnAkXlTt73Y24ZhTF2BJCNbe3unRINKFoAyrZEPzll1dJq1b1L6q1+8tK/SaPowgncArn4MMV1OEOGtAECo/wDK/w5ijnxXl3PhatBSefOYY/cD5/AK+ljzM=</latexit> <latexit sha1_base64="9pGPaDabNijkWLtd2pluvCo4p5o=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69BIvgqSRV0GPRi8cK9gPaUDabTbt2sxt2J0Ip/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZemApu0PO+ncLa+sbmVnG7tLO7t39QPjxqGZVpyppUCaU7ITFMcMmayFGwTqoZSULB2uHodua3n5g2XMkHHKcsSMhA8phTglZq9Wik0PTLFa/qzeGuEj8nFcjR6Je/epGiWcIkUkGM6fpeisGEaORUsGmplxmWEjoiA9a1VJKEmWAyv3bqnlklcmOlbUl05+rviQlJjBknoe1MCA7NsjcT//O6GcbXwYTLNEMm6WJRnAkXlTt73Y24ZhTF2BJCNbe3unRINKFoAyrZEPzll1dJq1b1L6q1+8tK/SaPowgncArn4MMV1OEOGtAECo/wDK/w5ijnxXl3PhatBSefOYY/cD5/AK+ljzM=</latexit>

<latexit sha1_base64="cD1wOL3x093czdOlsA8OWdLBleU=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKexGQY9BPXiMaB6QLGF2MkmGzM4uM71iWPIJXjwo4tUv8ubfOEn2oIkFDUVVN91dQSyFQdf9dnIrq2vrG/nNwtb2zu5ecf+gYaJEM15nkYx0K6CGS6F4HQVK3oo1p2EgeTMYXU/95iPXRkTqAccx90M6UKIvGEUr3T91b7rFklt2ZyDLxMtICTLUusWvTi9iScgVMkmNaXtujH5KNQom+aTQSQyPKRvRAW9bqmjIjZ/OTp2QE6v0SD/SthSSmfp7IqWhMeMwsJ0hxaFZ9Kbif147wf6lnwoVJ8gVmy/qJ5JgRKZ/k57QnKEcW0KZFvZWwoZUU4Y2nYINwVt8eZk0KmXvrFy5Oy9Vr7I48nAEx3AKHlxAFW6hBnVgMIBneIU3RzovzrvzMW/NOdnMIfyB8/kDKcqNtw==</latexit>

Figure 2.1 Functional representation of an artificial neuron.

Generally speaking, for an artificial neuron in a particular layer of a DNN:

D
!

w i xi + b = ϕ w T x + b ,
X
y=ϕ (2.1)
i=1

the input feature x = [x1 , . . . , xD ]T and the weight vector w = [w1 , . . . , wD ]T have the same
dimension D, and the input value of the activation function is the dot product of the input and
weight vectors plus a bias value b. The corresponding output of the neuron y is the evaluation
of the non-linear activation function ϕ(·).
2.1 Basic Building Blocks 11

2.1.1 Multilayer Perceptrons

The most fundamental form of neural network architecture is the MLP (Bishop, 1995), which
is a feed-forward neural network. As the name suggests, the information propagates through
the network from the input layer to the output layer in a single direction. In between the input
and output layers, there are one or more layers of artificial neurons and any pair of neurons in
two adjacent layers are connected with a weight value as shown in Figure 2.2.

<latexit sha1_base64="YEy0rtL8OREFxfc23Jo0M0weW64=">AAAB/XicbVA7T8MwGHTKq5RXeGwsERVSWaoEKmCsYGEsEn1Ibagcx2mtOnZkO0ghivgrLAwgxMr/YOPf4LQZoOUky6e775PP50WUSGXb30ZpaXllda28XtnY3NreMXf3OpLHAuE24pSLngclpoThtiKK4l4kMAw9irve5Dr3uw9YSMLZnUoi7IZwxEhAEFRaGpoHA49TXyahvtIku09rZyfZ0KzadXsKa5E4BamCAq2h+TXwOYpDzBSiUMq+Y0fKTaFQBFGcVQaxxBFEEzjCfU0ZDLF002n6zDrWim8FXOjDlDVVf2+kMJR5QD0ZQjWW814u/uf1YxVcuilhUawwQ7OHgphailt5FZZPBEaKJppAJIjOaqExFBApXVhFl+DMf3mRdE7rznm9cduoNq+KOsrgEByBGnDABWiCG9ACbYDAI3gGr+DNeDJejHfjYzZaMoqdffAHxucPpaaVWg==</latexit>

y (3)
<latexit sha1_base64="XVIwW/ek2F5To0f+3o276B5c3y8=">AAACEnicbVDLSsNAFJ3UV62vqEs3g0VoQUqiRV0W3bisYB/QxjKZTNqhk0mYmQgl5Bvc+CtuXCji1pU7/8ZJm0VtPTDM4Zx7ufceN2JUKsv6MQorq2vrG8XN0tb2zu6euX/QlmEsMGnhkIWi6yJJGOWkpahipBsJggKXkY47vsn8ziMRkob8Xk0i4gRoyKlPMVJaGpjVvhsyT04C/SWd9CGpnFfT03nRzcWBWbZq1hRwmdg5KYMczYH53fdCHAeEK8yQlD3bipSTIKEoZiQt9WNJIoTHaEh6mnIUEOkk05NSeKIVD/qh0I8rOFXnOxIUyGxBXRkgNZKLXib+5/Vi5V85CeVRrAjHs0F+zKAKYZYP9KggWLGJJggLqneFeIQEwkqnWNIh2IsnL5P2Wc2+qNXv6uXGdR5HERyBY1ABNrgEDXALmqAFMHgCL+ANvBvPxqvxYXzOSgtG3nMI/sD4+gWAHJ39</latexit>

W (3) , b(3)

<latexit sha1_base64="YDK+9JZXWSiEBgkZZCgDG4oMORw=">AAACFHicbVDLSsNAFJ34rPUVdelmsAgtQklqUTdC0Y3LCvYBbSyTyaQdOnkwMxFDyEe48VfcuFDErQt3/o2TNovaemCYwzn3cu89dsiokIbxoy0tr6yurRc2iptb2zu7+t5+WwQRx6SFAxbwro0EYdQnLUklI92QE+TZjHTs8XXmdx4IFzTw72QcEstDQ5+6FCOppIF+0rcD5ojYU1/ymN4n5dNKCi/hrBxncq2SDvSSUTUmgIvEzEkJ5GgO9O++E+DII77EDAnRM41QWgnikmJG0mI/EiREeIyGpKeojzwirGRyVAqPleJAN+Dq+RJO1NmOBHkiW1BVekiOxLyXif95vUi6F1ZC/TCSxMfTQW7EoAxglhB0KCdYslgRhDlVu0I8QhxhqXIsqhDM+ZMXSbtWNc+q9dt6qXGVx1EAh+AIlIEJzkED3IAmaAEMnsALeAPv2rP2qn1on9PSJS3vOQB/oH39ArO5npk=</latexit>

x(3) = y (2) ...

<latexit sha1_base64="D8NZwhGRc3SadH+lq9NyH2X2S6M=">AAAB7HicbVBNS8NAFHzxs9avqkcvi0XwVJIq6LHoxWMF0xbaUDbbTbt0swm7L0IJ/Q1ePCji1R/kzX/jts1BWwcWhpk37HsTplIYdN1vZ219Y3Nru7RT3t3bPzisHB23TJJpxn2WyER3Qmq4FIr7KFDyTqo5jUPJ2+H4bua3n7g2IlGPOEl5ENOhEpFgFK3k9wYJmn6l6tbcOcgq8QpShQLNfuXL5lgWc4VMUmO6nptikFONgkk+Lfcyw1PKxnTIu5YqGnMT5PNlp+TcKgMSJdo+hWSu/k7kNDZmEod2MqY4MsveTPzP62YY3QS5UGmGXLHFR1EmCSZkdjkZCM0ZyokllGlhdyVsRDVlaPsp2xK85ZNXSate8y5r9YerauO2qKMEp3AGF+DBNTTgHprgAwMBz/AKb45yXpx352MxuuYUmRP4A+fzB/K2jsY=</latexit>

<latexit sha1_base64="aj0iK0uF5FS5rq9zA+djRoblxj0=">AAACEnicbVDLSsNAFJ3UV62vqEs3g0VoQUpSirosunFZwT6gjWUymbRDJ5MwMxFKyDe48VfcuFDErSt3/o2TNovaemCYwzn3cu89bsSoVJb1YxTW1jc2t4rbpZ3dvf0D8/CoI8NYYNLGIQtFz0WSMMpJW1HFSC8SBAUuI113cpP53UciJA35vZpGxAnQiFOfYqS0NDSrAzdknpwG+ku66UNSqVfT80XRzcWhWbZq1gxwldg5KYMcraH5PfBCHAeEK8yQlH3bipSTIKEoZiQtDWJJIoQnaET6mnIUEOkks5NSeKYVD/qh0I8rOFMXOxIUyGxBXRkgNZbLXib+5/Vj5V85CeVRrAjH80F+zKAKYZYP9KggWLGpJggLqneFeIwEwkqnWNIh2Msnr5JOvWZf1Bp3jXLzOo+jCE7AKagAG1yCJrgFLdAGGDyBF/AG3o1n49X4MD7npQUj7zkGf2B8/QJ8+537</latexit>

W (2) , b(2)
<latexit sha1_base64="2OCJ4yowjQjJtTTBNG707Fkxe8c=">AAACFHicbVDLSsNAFJ3UV62vqEs3g0VoEUpSiroRim5cVrAPaGOZTCbt0MmDmYkYQj7Cjb/ixoUibl2482+ctFnU1gPDHM65l3vvsUNGhTSMH62wsrq2vlHcLG1t7+zu6fsHHRFEHJM2DljAezYShFGftCWVjPRCTpBnM9K1J9eZ330gXNDAv5NxSCwPjXzqUoykkob66cAOmCNiT33JY3qfVOrVFF7CeTnOZLOaDvWyUTOmgMvEzEkZ5GgN9e+BE+DII77EDAnRN41QWgnikmJG0tIgEiREeIJGpK+ojzwirGR6VApPlOJAN+Dq+RJO1fmOBHkiW1BVekiOxaKXif95/Ui6F1ZC/TCSxMezQW7EoAxglhB0KCdYslgRhDlVu0I8RhxhqXIsqRDMxZOXSadeM89qjdtGuXmVx1EER+AYVIAJzkET3IAWaAMMnsALeAPv2rP2qn1on7PSgpb3HII/0L5+AbCWnpc=</latexit>

x(2) = y (1) ...

<latexit sha1_base64="HCZhClYpxAVH8RM8qXbCA+wzMac=">AAACEnicbVDLSgMxFM34rPU16tJNsAgtSJmRoi6LblxWsA9ox5LJZNrQTDIkGaEM/QY3/oobF4q4deXOvzHTzqK2Hgg5nHMv997jx4wq7Tg/1srq2vrGZmGruL2zu7dvHxy2lEgkJk0smJAdHynCKCdNTTUjnVgSFPmMtP3RTea3H4lUVPB7PY6JF6EBpyHFSBupb1d6vmCBGkfmS9uTh7TsViZn86Kfi3275FSdKeAycXNSAjkaffu7FwicRIRrzJBSXdeJtZciqSlmZFLsJYrECI/QgHQN5SgiykunJ03gqVECGAppHtdwqs53pChS2YKmMkJ6qBa9TPzP6yY6vPJSyuNEE45ng8KEQS1glg8MqCRYs7EhCEtqdoV4iCTC2qRYNCG4iycvk9Z51b2o1u5qpfp1HkcBHIMTUAYuuAR1cAsaoAkweAIv4A28W8/Wq/Vhfc5KV6y85wj8gfX1C3nanfk=</latexit>

W (1) , b(1)
<latexit sha1_base64="qtFCYMCwUv9TbEokav80f7ju0Ww=">AAAB/XicbVA7T8MwGHTKq5RXeGwsFhVSWaoEVcBYwcJYJPqQ2lA5jtNadezIdhAlivgrLAwgxMr/YOPf4LQdoOUky6e775PP58eMKu0431ZhaXllda24XtrY3NresXf3WkokEpMmFkzIjo8UYZSTpqaakU4sCYp8Rtr+6Cr32/dEKir4rR7HxIvQgNOQYqSN1LcPer5ggRpH5kofsru04p5kfbvsVJ0J4CJxZ6QMZmj07a9eIHASEa4xQ0p1XSfWXoqkppiRrNRLFIkRHqEB6RrKUUSUl07SZ/DYKAEMhTSHazhRf2+kKFJ5QDMZIT1U814u/ud1Ex1eeCnlcaIJx9OHwoRBLWBeBQyoJFizsSEIS2qyQjxEEmFtCiuZEtz5Ly+S1mnVPavWbmrl+uWsjiI4BEegAlxwDurgGjRAE2DwCJ7BK3iznqwX6936mI4WrNnOPvgD6/MHoQ+VVw==</latexit>

x(1)

Figure 2.2 An example of a three-layer MLP consisting of two hidden layers and one output
layer. In this example, the input has four dimensions and the output has two dimensions.

Instead of a single neuron written in Equation (2.1), for a general layer l in an MLP, the
forward function modelled by the layer is

Il
!
(l) X (l) (l) (l)
yj =ϕ wji xi + bj ,
i=1

or in matrix notation !
y (l) = ϕ W (l) x(l) + b(l) , (2.2)

the output from the l-th layer is the input to the next layer, i.e. y (l) = x(l+1) . If the input feature
x(l) has dimension Il and the output feature y (l) has dimension Ol , then the weight matrix
W (l) ∈ ROl ×Il and the bias b is a Ol -dimensional vector which is same as Il+1 . Note that when
the input to the activation function ϕ(·) is a vector, the operation is elementwise.
For MLPs, the number of layers defines the depth of the model while the number of artificial
neurons in hidden layers defines the width of the model. Varying the depth or width of an MLP
model allows it to have various levels of modelling capability by having a different number
of parameters, which is closely related to the selection of training setup. Assuming all hidden
layers have the same dimension D, the number of hidden layers is L, and the network has no
12 Attention-Based Encoder-Decoder Models

bias, then the number of parameters is ID + (L − 1)D2 + DO, where I and O are input and
output layer dimensions.
A Time Delay Neural Network (TDNN) (Waibel et al., 1989) is a variant of MLP, where
each layer of TDNN processes a window of context from the previous layer, as shown in
Figure 2.3.

t
<latexit sha1_base64="8+tdbSseR6FAb4KFmTZIwTpZd1w=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgxbAbA3qSgBePEc0DkiXMTmaTIbOzy0yvEEI+wYsHRbz6Rd78GyfJHjSxoKGo6qa7K0ikMOi6305ubX1jcyu/XdjZ3ds/KB4eNU2casYbLJaxbgfUcCkUb6BAyduJ5jQKJG8Fo9uZ33ri2ohYPeI44X5EB0qEglG00gNeVHvFklt25yCrxMtICTLUe8Wvbj9macQVMkmN6Xhugv6EahRM8mmhmxqeUDaiA96xVNGIG38yP3VKzqzSJ2GsbSkkc/X3xIRGxoyjwHZGFIdm2ZuJ/3mdFMNrfyJUkiJXbLEoTCXBmMz+Jn2hOUM5toQyLeythA2ppgxtOgUbgrf88ippVsreZblyXy3VbrI48nACp3AOHlxBDe6gDg1gMIBneIU3RzovzrvzsWjNOdnMMfyB8/kDvjWNbQ==</latexit>
4 t
<latexit sha1_base64="qs3vb1H7rt9+UkqsRMd9JrHN+ik=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoCcpePHYgq2FNpTNdtOu3WzC7kQoob/AiwdFvPqTvPlv3LY5aOuDgcd7M8zMCxIpDLrut1NYW9/Y3Cpul3Z29/YPyodHbROnmvEWi2WsOwE1XArFWyhQ8k6iOY0CyR+C8e3Mf3ji2ohY3eMk4X5Eh0qEglG0UhP75Ypbdecgq8TLSQVyNPrlr94gZmnEFTJJjel6boJ+RjUKJvm01EsNTygb0yHvWqpoxI2fzQ+dkjOrDEgYa1sKyVz9PZHRyJhJFNjOiOLILHsz8T+vm2J47WdCJSlyxRaLwlQSjMnsazIQmjOUE0so08LeStiIasrQZlOyIXjLL6+Sdq3qXVRrzctK/SaPowgncArn4MEV1OEOGtACBhye4RXenEfnxXl3PhatBSefOYY/cD5/AOBDjPg=</latexit>
t+4
<latexit sha1_base64="b2utVxjnJTqZP7CfhH0IdOPCYw8=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoMgCGE3BvQkAS8eI5oHJEuYncwmQ2Znl5leIYR8ghcPinj1i7z5N06SPWhiQUNR1U13V5BIYdB1v53c2vrG5lZ+u7Czu7d/UDw8apo41Yw3WCxj3Q6o4VIo3kCBkrcTzWkUSN4KRrczv/XEtRGxesRxwv2IDpQIBaNopQe8qPaKJbfszkFWiZeREmSo94pf3X7M0ogrZJIa0/HcBP0J1SiY5NNCNzU8oWxEB7xjqaIRN/5kfuqUnFmlT8JY21JI5urviQmNjBlHge2MKA7NsjcT//M6KYbX/kSoJEWu2GJRmEqCMZn9TfpCc4ZybAllWthbCRtSTRnadAo2BG/55VXSrJS9y3Llvlqq3WRx5OEETuEcPLiCGtxBHRrAYADP8ApvjnRenHfnY9Gac7KZY/gD5/MHuyuNaw==</latexit>

Figure 2.3 An example of a three-layer TDNN for a sequence of input with context from -4 to
+4. If dotted connections are removed, it becomes a subsampled TDNN.

Since a TDNN layer is effectively an MLP layer that can shift across time, TDNNs are
particularly useful for sequential inputs. The higher layers in a TDNN have an increasingly
wide view of the original input sequence (i.e. receptive field) and the lower layers are forced to
learn translation-invariant feature transforms (Peddinti et al., 2015). Effectively, parameters are
shared across different time steps, which reduces the total number of parameters. To further
reduce the computation load, subsampled TDNNs are often used where a window processes a
fixed amount of context and then shifts multiple steps (Peddinti et al., 2015).

2.1.2 Convolutional Neural Networks

CNNs (LeCun et al., 1989) are a special case of MLPs, where the matrix multiplication
operation is replaced by convolution – a specialised kind of linear operation. CNNs are
very successful when applied to data where the temporal and/or spatial relationships between
variables are crucial, especially for images (Krizhevsky et al., 2012). For the 1-dimensional
case in Figure 2.4a, a CNN operates similarly to a TDNN.
The input to the convolution operation is the feature X ∈ RD×T and a kernel, i.e. a weight
matrix K ∈ RP ×Q , the output feature map is
XX
sij = kpq x(i−p)(j−q) = K ∗ Xij . (2.3)
p q
2.1 Basic Building Blocks 13

(2) (2) (2) (2) (2)

s1
<latexit sha1_base64="0DrMzStCi+bjHOWzIDhLtNmslag=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRahXspuK+ix6MVjBfsh7VqyabYNTbJLkhXK0l/hxYMiXv053vw3pu0etPXBwOO9GWbmBTFn2rjut5NbW9/Y3MpvF3Z29/YPiodHLR0litAmiXikOgHWlDNJm4YZTjuxolgEnLaD8c3Mbz9RpVkk780kpr7AQ8lCRrCx0oPue49puXo+7RdLbsWdA60SLyMlyNDoF796g4gkgkpDONa667mx8VOsDCOcTgu9RNMYkzEe0q6lEguq/XR+8BSdWWWAwkjZkgbN1d8TKRZaT0RgOwU2I73szcT/vG5iwis/ZTJODJVksShMODIRmn2PBkxRYvjEEkwUs7ciMsIKE2MzKtgQvOWXV0mrWvFqlerdRal+ncWRhxM4hTJ4cAl1uIUGNIGAgGd4hTdHOS/Ou/OxaM052cwx/IHz+QO62I+0</latexit>
s2
<latexit sha1_base64="V82QyUzCn2Dt7YDGPY2QaKt5yMU=">AAAB8HicbVBNSwMxEJ31s9avqkcvwSLUS9ldBT0WvXisYD+kXUs2zbahSXZJskJZ+iu8eFDEqz/Hm//GtN2Dtj4YeLw3w8y8MOFMG9f9dlZW19Y3Ngtbxe2d3b390sFhU8epIrRBYh6rdog15UzShmGG03aiKBYhp61wdDP1W09UaRbLezNOaCDwQLKIEWys9KB7/mNW8c8mvVLZrbozoGXi5aQMOeq90le3H5NUUGkIx1p3PDcxQYaVYYTTSbGbappgMsID2rFUYkF1kM0OnqBTq/RRFCtb0qCZ+nsiw0LrsQhtp8BmqBe9qfif10lNdBVkTCapoZLMF0UpRyZG0+9RnylKDB9bgoli9lZEhlhhYmxGRRuCt/jyMmn6Ve+86t9dlGvXeRwFOIYTqIAHl1CDW6hDAwgIeIZXeHOU8+K8Ox/z1hUnnzmCP3A+fwC8Yo+1</latexit>
s3
<latexit sha1_base64="PpJrfNIHoWQxixq48EW2KI+p0Gk=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRahXspuK+ix6MVjBfsh7VqyabYNTbJLkhXK0l/hxYMiXv053vw3pu0etPXBwOO9GWbmBTFn2rjut5NbW9/Y3MpvF3Z29/YPiodHLR0litAmiXikOgHWlDNJm4YZTjuxolgEnLaD8c3Mbz9RpVkk780kpr7AQ8lCRrCx0oPu1x7TcvV82i+W3Io7B1olXkZKkKHRL371BhFJBJWGcKx113Nj46dYGUY4nRZ6iaYxJmM8pF1LJRZU++n84Ck6s8oAhZGyJQ2aq78nUiy0nojAdgpsRnrZm4n/ed3EhFd+ymScGCrJYlGYcGQiNPseDZiixPCJJZgoZm9FZIQVJsZmVLAheMsvr5JWteLVKtW7i1L9OosjDydwCmXw4BLqcAsNaAIBAc/wCm+Ocl6cd+dj0Zpzsplj+APn8we97I+2</latexit>
s4
<latexit sha1_base64="/mCbwR0em8P4M92mYyplgjhUUk4=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRahXspuLeix6MVjBfsh7VqyabYNTbJLkhXK0l/hxYMiXv053vw3pu0etPXBwOO9GWbmBTFn2rjut5NbW9/Y3MpvF3Z29/YPiodHLR0litAmiXikOgHWlDNJm4YZTjuxolgEnLaD8c3Mbz9RpVkk780kpr7AQ8lCRrCx0oPu1x7TcvV82i+W3Io7B1olXkZKkKHRL371BhFJBJWGcKx113Nj46dYGUY4nRZ6iaYxJmM8pF1LJRZU++n84Ck6s8oAhZGyJQ2aq78nUiy0nojAdgpsRnrZm4n/ed3EhFd+ymScGCrJYlGYcGQiNPseDZiixPCJJZgoZm9FZIQVJsZmVLAheMsvr5JWteJdVKp3tVL9OosjDydwCmXw4BLqcAsNaAIBAc/wCm+Ocl6cd+dj0Zpzsplj+APn8we/do+3</latexit>
s5
<latexit sha1_base64="69Pivz5Ri3cXIhw11rao7WAKzss=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRahXspuVfRY9OKxgv2Qdi3ZNNuGJtklyQpl6a/w4kERr/4cb/4b03YP2vpg4PHeDDPzgpgzbVz328mtrK6tb+Q3C1vbO7t7xf2Dpo4SRWiDRDxS7QBrypmkDcMMp+1YUSwCTlvB6Gbqt56o0iyS92YcU1/ggWQhI9hY6UH3Lh7TcvV00iuW3Io7A1omXkZKkKHeK351+xFJBJWGcKx1x3Nj46dYGUY4nRS6iaYxJiM8oB1LJRZU++ns4Ak6sUofhZGyJQ2aqb8nUiy0HovAdgpshnrRm4r/eZ3EhFd+ymScGCrJfFGYcGQiNP0e9ZmixPCxJZgoZm9FZIgVJsZmVLAheIsvL5NmteKdVap356XadRZHHo7gGMrgwSXU4Bbq0AACAp7hFd4c5bw4787HvDXnZDOH8AfO5w/BAI+4</latexit>

(1) (1) (1) (1) (1)

s1
<latexit sha1_base64="muE7CggG+nU8k0/Sdt3oZBmt+RA=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRahXspuK+ix6MVjBfsh7VqyabYNTbJLkhXK0l/hxYMiXv053vw3pu0etPXBwOO9GWbmBTFn2rjut5NbW9/Y3MpvF3Z29/YPiodHLR0litAmiXikOgHWlDNJm4YZTjuxolgEnLaD8c3Mbz9RpVkk780kpr7AQ8lCRrCx0oPue49p2Tuf9oslt+LOgVaJl5ESZGj0i1+9QUQSQaUhHGvd9dzY+ClWhhFOp4VeommMyRgPaddSiQXVfjo/eIrOrDJAYaRsSYPm6u+JFAutJyKwnQKbkV72ZuJ/Xjcx4ZWfMhknhkqyWBQmHJkIzb5HA6YoMXxiCSaK2VsRGWGFibEZFWwI3vLLq6RVrXi1SvXuolS/zuLIwwmcQhk8uIQ63EIDmkBAwDO8wpujnBfn3flYtOacbOYY/sD5/AG5Uo+z</latexit>
s2
<latexit sha1_base64="tkWAq9EMOKceBQ5p705grU5ClWo=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRahXspuK+ix6MVjBfsh7VqyabYNTbJLkhXK0l/hxYMiXv053vw3pu0etPXBwOO9GWbmBTFn2rjut5NbW9/Y3MpvF3Z29/YPiodHLR0litAmiXikOgHWlDNJm4YZTjuxolgEnLaD8c3Mbz9RpVkk780kpr7AQ8lCRrCx0oPuVx/Tsnc+7RdLbsWdA60SLyMlyNDoF796g4gkgkpDONa667mx8VOsDCOcTgu9RNMYkzEe0q6lEguq/XR+8BSdWWWAwkjZkgbN1d8TKRZaT0RgOwU2I73szcT/vG5iwis/ZTJODJVksShMODIRmn2PBkxRYvjEEkwUs7ciMsIKE2MzKtgQvOWXV0mrWvFqlerdRal+ncWRhxM4hTJ4cAl1uIUGNIGAgGd4hTdHOS/Ou/OxaM052cwx/IHz+QO63I+0</latexit>
s3
<latexit sha1_base64="h6R4lbS4arEZ4uZ2VmNGHzJYOp4=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRahXspuK+ix6MVjBfsh7VqyabYNTbJLkhXK0l/hxYMiXv053vw3pu0etPXBwOO9GWbmBTFn2rjut5NbW9/Y3MpvF3Z29/YPiodHLR0litAmiXikOgHWlDNJm4YZTjuxolgEnLaD8c3Mbz9RpVkk780kpr7AQ8lCRrCx0oPu1x7Tsnc+7RdLbsWdA60SLyMlyNDoF796g4gkgkpDONa667mx8VOsDCOcTgu9RNMYkzEe0q6lEguq/XR+8BSdWWWAwkjZkgbN1d8TKRZaT0RgOwU2I73szcT/vG5iwis/ZTJODJVksShMODIRmn2PBkxRYvjEEkwUs7ciMsIKE2MzKtgQvOWXV0mrWvFqlerdRal+ncWRhxM4hTJ4cAl1uIUGNIGAgGd4hTdHOS/Ou/OxaM052cwx/IHz+QO8Zo+1</latexit>
s4
<latexit sha1_base64="AJWFsBcGIuNPT/Gk2uJNlUVGORo=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRahXspuLeix6MVjBfsh7VqyabYNTbJLkhXK0l/hxYMiXv053vw3pu0etPXBwOO9GWbmBTFn2rjut5NbW9/Y3MpvF3Z29/YPiodHLR0litAmiXikOgHWlDNJm4YZTjuxolgEnLaD8c3Mbz9RpVkk780kpr7AQ8lCRrCx0oPu1x7Tsnc+7RdLbsWdA60SLyMlyNDoF796g4gkgkpDONa667mx8VOsDCOcTgu9RNMYkzEe0q6lEguq/XR+8BSdWWWAwkjZkgbN1d8TKRZaT0RgOwU2I73szcT/vG5iwis/ZTJODJVksShMODIRmn2PBkxRYvjEEkwUs7ciMsIKE2MzKtgQvOWXV0mrWvEuKtW7Wql+ncWRhxM4hTJ4cAl1uIUGNIGAgGd4hTdHOS/Ou/OxaM052cwx/IHz+QO98I+2</latexit>
s5
<latexit sha1_base64="L8p5vN5b9c+fV+FMo11RrvRjTEA=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRahXspuVfRY9OKxgv2Qdi3ZNNuGJtklyQpl6a/w4kERr/4cb/4b03YP2vpg4PHeDDPzgpgzbVz328mtrK6tb+Q3C1vbO7t7xf2Dpo4SRWiDRDxS7QBrypmkDcMMp+1YUSwCTlvB6Gbqt56o0iyS92YcU1/ggWQhI9hY6UH3Lh7Tsnc66RVLbsWdAS0TLyMlyFDvFb+6/YgkgkpDONa647mx8VOsDCOcTgrdRNMYkxEe0I6lEguq/XR28ASdWKWPwkjZkgbN1N8TKRZaj0VgOwU2Q73oTcX/vE5iwis/ZTJODJVkvihMODIRmn6P+kxRYvjYEkwUs7ciMsQKE2MzKtgQvMWXl0mzWvHOKtW781LtOosjD0dwDGXw4BJqcAt1aAABAc/wCm+Ocl6cd+dj3ppzsplD+APn8we/eo+3</latexit>

x1
<latexit sha1_base64="MWSDWkw1NdOauHNwPQkLknLX4o4=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGi/YA2lM120y7dbMLuRCyhP8GLB0W8+ou8+W/ctjlo64OBx3szzMwLEikMuu63s7K6tr6xWdgqbu/s7u2XDg6bJk414w0Wy1i3A2q4FIo3UKDk7URzGgWSt4LRzdRvPXJtRKwecJxwP6IDJULBKFrp/qnn9Uplt+LOQJaJl5My5Kj3Sl/dfszSiCtkkhrT8dwE/YxqFEzySbGbGp5QNqID3rFU0YgbP5udOiGnVumTMNa2FJKZ+nsio5Ex4yiwnRHFoVn0puJ/XifF8MrPhEpS5IrNF4WpJBiT6d+kLzRnKMeWUKaFvZWwIdWUoU2naEPwFl9eJs1qxTuvVO8uyrXrPI4CHMMJnIEHl1CDW6hDAxgM4Ble4c2Rzovz7nzMW1ecfOYI/sD5/AEM/o2k</latexit>
x2
<latexit sha1_base64="vdTCQWpAcdEoAqjXndSIH2U27gw=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGi/YA2lM120y7dbMLuRCyhP8GLB0W8+ou8+W/ctjlo64OBx3szzMwLEikMuu63s7K6tr6xWdgqbu/s7u2XDg6bJk414w0Wy1i3A2q4FIo3UKDk7URzGgWSt4LRzdRvPXJtRKwecJxwP6IDJULBKFrp/qlX7ZXKbsWdgSwTLydlyFHvlb66/ZilEVfIJDWm47kJ+hnVKJjkk2I3NTyhbEQHvGOpohE3fjY7dUJOrdInYaxtKSQz9fdERiNjxlFgOyOKQ7PoTcX/vE6K4ZWfCZWkyBWbLwpTSTAm079JX2jOUI4toUwLeythQ6opQ5tO0YbgLb68TJrVindeqd5dlGvXeRwFOIYTOAMPLqEGt1CHBjAYwDO8wpsjnRfn3fmYt644+cwR/IHz+QMOgo2l</latexit>
x3
<latexit sha1_base64="GcfdmiVXIuQAVIE+3vSRqlRiStc=">AAAB6nicbVDLTgJBEOzFF+IL9ehlIjHxRHbBRI9ELx4xyiOBDZkdemHC7OxmZtZICJ/gxYPGePWLvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLRzcxvPaLSPJYPZpygH9GB5CFn1Fjp/qlX7RVLbtmdg6wSLyMlyFDvFb+6/ZilEUrDBNW647mJ8SdUGc4ETgvdVGNC2YgOsGOppBFqfzI/dUrOrNInYaxsSUPm6u+JCY20HkeB7YyoGeplbyb+53VSE175Ey6T1KBki0VhKoiJyexv0ucKmRFjSyhT3N5K2JAqyoxNp2BD8JZfXiXNStmrlit3F6XadRZHHk7gFM7Bg0uowS3UoQEMBvAMr/DmCOfFeXc+Fq05J5s5hj9wPn8AEAaNpg==</latexit>
x4
<latexit sha1_base64="jWBmQGVqa48YZ6/3PC5ZWIyPers=">AAAB6nicbVDLTgJBEOzFF+IL9ehlIjHxRHaRRI9ELx4xyiOBDZkdemHC7OxmZtZICJ/gxYPGePWLvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLRzcxvPaLSPJYPZpygH9GB5CFn1Fjp/qlX7RVLbtmdg6wSLyMlyFDvFb+6/ZilEUrDBNW647mJ8SdUGc4ETgvdVGNC2YgOsGOppBFqfzI/dUrOrNInYaxsSUPm6u+JCY20HkeB7YyoGeplbyb+53VSE175Ey6T1KBki0VhKoiJyexv0ucKmRFjSyhT3N5K2JAqyoxNp2BD8JZfXiXNStm7KFfuqqXadRZHHk7gFM7Bg0uowS3UoQEMBvAMr/DmCOfFeXc+Fq05J5s5hj9wPn8AEYqNpw==</latexit>
x5
<latexit sha1_base64="kqxXQf8R+MEijt2vTMYXcAGGEVc=">AAAB6nicbVDLTgJBEOzFF+IL9ehlIjHxRHZRo0eiF48Y5ZHAhswODUyYnd3MzBrJhk/w4kFjvPpF3vwbB9iDgpV0UqnqTndXEAuujet+O7mV1bX1jfxmYWt7Z3evuH/Q0FGiGNZZJCLVCqhGwSXWDTcCW7FCGgYCm8HoZuo3H1FpHskHM47RD+lA8j5n1Fjp/ql70S2W3LI7A1kmXkZKkKHWLX51ehFLQpSGCap123Nj46dUGc4ETgqdRGNM2YgOsG2ppCFqP52dOiEnVumRfqRsSUNm6u+JlIZaj8PAdobUDPWiNxX/89qJ6V/5KZdxYlCy+aJ+IoiJyPRv0uMKmRFjSyhT3N5K2JAqyoxNp2BD8BZfXiaNStk7K1fuzkvV6yyOPBzBMZyCB5dQhVuoQR0YDOAZXuHNEc6L8+58zFtzTjZzCH/gfP4AEw6NqA==</latexit>

(a) Connections in 1-dimensional CNN layers. The size of the kernel is 3. Connec-
(2)
tions with the same colour share the same weight. The receptive field of s3 covers
the full input with sparse connections.

⇤
<latexit sha1_base64="aETUmmg4d7Fr1+uPFFVtljBLhmc=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoMgHsJuFPQY9OIxAfOAZAmzk95kzOzsMjMrhJAv8OJBEa9+kjf/xkmyB00saCiquunuChLBtXHdbye3tr6xuZXfLuzs7u0fFA+PmjpOFcMGi0Ws2gHVKLjEhuFGYDtRSKNAYCsY3c381hMqzWP5YMYJ+hEdSB5yRo2V6he9Ysktu3OQVeJlpAQZar3iV7cfszRCaZigWnc8NzH+hCrDmcBpoZtqTCgb0QF2LJU0Qu1P5odOyZlV+iSMlS1pyFz9PTGhkdbjKLCdETVDvezNxP+8TmrCG3/CZZIalGyxKEwFMTGZfU36XCEzYmwJZYrbWwkbUkWZsdkUbAje8surpFkpe5flSv2qVL3N4sjDCZzCOXhwDVW4hxo0gAHCM7zCm/PovDjvzseiNedkM8fwB87nD3FPjLI=</latexit>

2D input convolution pooling

(3⇥4)
<latexit sha1_base64="y3ZFcWqBEabekax7ttofWU1jMYg=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilVg9FxE2/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJJ+WeqnhCWVjOuRdSxW1S/xsfu2UnFllQMJY21JI5urviYxGxkyiwHZGFEdm2ZuJ/3ndFMNrPxMqSZErtlgUppJgTGavk4HQnKGcWEKZFvZWwkZUU4Y2oJINwVt+eZW0alXvolq7v6zUb/I4inACp3AOHlxBHe6gAU1g8AjP8ApvTuy8OO/Ox6K14OQzx/AHzucPt3mPOA==</latexit>

3 ⇥(2 ⇥2) <latexit sha1_base64="y3ZFcWqBEabekax7ttofWU1jMYg=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilVg9FxE2/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJJ+WeqnhCWVjOuRdSxW1S/xsfu2UnFllQMJY21JI5urviYxGxkyiwHZGFEdm2ZuJ/3ndFMNrPxMqSZErtlgUppJgTGavk4HQnKGcWEKZFvZWwkZUU4Y2oJINwVt+eZW0alXvolq7v6zUb/I4inACp3AOHlxBHe6gAU1g8AjP8ApvTuy8OO/Ox6K14OQzx/AHzucPt3mPOA==</latexit> <latexit sha1_base64="y3ZFcWqBEabekax7ttofWU1jMYg=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilVg9FxE2/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJJ+WeqnhCWVjOuRdSxW1S/xsfu2UnFllQMJY21JI5urviYxGxkyiwHZGFEdm2ZuJ/3ndFMNrPxMqSZErtlgUppJgTGavk4HQnKGcWEKZFvZWwkZUU4Y2oJINwVt+eZW0alXvolq7v6zUb/I4inACp3AOHlxBHe6gAU1g8AjP8ApvTuy8OO/Ox6K14OQzx/AHzucPt3mPOA==</latexit>

(2⇥2) <latexit sha1_base64="y3ZFcWqBEabekax7ttofWU1jMYg=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilVg9FxE2/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJJ+WeqnhCWVjOuRdSxW1S/xsfu2UnFllQMJY21JI5urviYxGxkyiwHZGFEdm2ZuJ/3ndFMNrPxMqSZErtlgUppJgTGavk4HQnKGcWEKZFvZWwkZUU4Y2oJINwVt+eZW0alXvolq7v6zUb/I4inACp3AOHlxBHe6gAU1g8AjP8ApvTuy8OO/Ox6K14OQzx/AHzucPt3mPOA==</latexit>

(b) Example of a 2-dimensional CNN. No padding. The stride (amount of movement

of the kernel) is 1 for both convolution and pooling in both dimensions. The pooling
operation can be max pooling or average pooling.

Figure 2.4 Convolutional neural networks.

After the convolution operation, the output sij is passed through a non-linear activation
function as in the MLP, and the entire feature map is processed by a pooling function. As
shown in Figure 2.4b, the pooling function replaces the output feature map at a certain location
with a summary statistic of the nearby output. For example, max-pooling (Zhou and Chellappa,
1988) only takes the maximum value of the output within a rectangular area in the feature
map, whereas mean pooling takes the average or a weighted average of the elements within the
defined area. The pooling operation effectively reduces the output dimension from each convo-
lutional layer and makes the representation approximately invariant to a small temporal/spatial
translation. This property is desirable for some features but not so much when the specific
location of a feature needs to be preserved (Boureau et al., 2010). Normally in a convolutional
neural network, there is more than one kernel in each layer to capture individual features and
14 Attention-Based Encoder-Decoder Models

many convolutional layers to have a large effective receptive field for deeper layers (Goodfellow
et al., 2016).
Compared to MLPs, CNNs have other two distinct characteristics that make them very suc-
cessful in many pattern recognition tasks, especially in the field of computer vision (Goodfellow
et al., 2016). First, the convolution kernels interact with the input relatively sparsely, i.e. the
convolution kernel is normally much smaller than the input features. Unlike MLPs where each
input neuron is fully connected to each of the neurons in the next layer, only sets of weights in
these kernels are stored and used, and therefore the number of parameters can be several orders
of magnitude smaller. The sparse connectivity is demonstrated in Figure 2.4a. Furthermore,
since the kernels are much smaller than the input and they are acting as sliding windows
across the whole of the input, the same set of weights are reused multiple times instead of
having different parameters for different locations. Therefore, CNNs can be significantly more
efficient in terms of memory and computation. The two-dimensional convolution operation is
very suitable for image processing where feature invariance is desired in both dimensions. In
contrast, the advantage of CNNs for speech processing is less obvious where feature invariance
in the frequency dimension is much less desirable.

2.1.3 Recurrent Neural Networks

RNNs (Pineda, 1987; Rumelhart et al., 1988) are a special family of DNNs that is designed to
process sequential data of arbitrary lengths. In contrast to how parameters are shared in CNNs,
where the output of a convolution only depends on close neighbours, RNNs take an additional
input which is the output from the previous time step. In terms of the computational graph,
RNNs can be viewed as very deep MLPs where the parameters across different time steps are
tied, which is also known as unfolding and shown in Figure 2.5.
t+1
<latexit sha1_base64="mLxz/koAWvZP5dEolt4WvHd6gJ4=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSIIQkmqoMeiF48V7Qe0oWy2m3bpZhN2J0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEhh0HW/nZXVtfWNzcJWcXtnd2+/dHDYNHGqGW+wWMa6HVDDpVC8gQIlbyea0yiQvBWMbqd+64lrI2L1iOOE+xEdKBEKRtFKD3ju9Uplt+LOQJaJl5My5Kj3Sl/dfszSiCtkkhrT8dwE/YxqFEzySbGbGp5QNqID3rFU0YgbP5udOiGnVumTMNa2FJKZ+nsio5Ex4yiwnRHFoVn0puJ/XifF8NrPhEpS5IrNF4WpJBiT6d+kLzRnKMeWUKaFvZWwIdWUoU2naEPwFl9eJs1qxbuoVO8vy7WbPI4CHMMJnIEHV1CDO6hDAxgM4Ble4c2Rzovz7nzMW1ecfOYI/sD5/AG3041s</latexit>

W W W W
<latexit sha1_base64="oRRfNu0Z3o7VS5RRGRBmDkIJsU8=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsxIUZcFNy4r2Ae0Y8lkMm1oJhmSjFKG/ocbF4q49V/c+Tdm2llo64GQwzn3kpMTJJxp47rfTmltfWNzq7xd2dnd2z+oHh51tEwVoW0iuVS9AGvKmaBtwwynvURRHAecdoPJTe53H6nSTIp7M02oH+ORYBEj2FjpYRBIHuppbK+sOxtWa27dnQOtEq8gNSjQGla/BqEkaUyFIRxr3ffcxPgZVoYRTmeVQappgskEj2jfUoFjqv1snnqGzqwSokgqe4RBc/X3RoZjnUezkzE2Y73s5eJ/Xj810bWfMZGkhgqyeChKOTIS5RWgkClKDJ9agoliNisiY6wwMbaoii3BW/7yKulc1L3LeuOuUWs2izrKcAKncA4eXEETbqEFbSCg4Ble4c15cl6cd+djMVpyip1j+APn8wcje5Lv</latexit> <latexit sha1_base64="oRRfNu0Z3o7VS5RRGRBmDkIJsU8=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsxIUZcFNy4r2Ae0Y8lkMm1oJhmSjFKG/ocbF4q49V/c+Tdm2llo64GQwzn3kpMTJJxp47rfTmltfWNzq7xd2dnd2z+oHh51tEwVoW0iuVS9AGvKmaBtwwynvURRHAecdoPJTe53H6nSTIp7M02oH+ORYBEj2FjpYRBIHuppbK+sOxtWa27dnQOtEq8gNSjQGla/BqEkaUyFIRxr3ffcxPgZVoYRTmeVQappgskEj2jfUoFjqv1snnqGzqwSokgqe4RBc/X3RoZjnUezkzE2Y73s5eJ/Xj810bWfMZGkhgqyeChKOTIS5RWgkClKDJ9agoliNisiY6wwMbaoii3BW/7yKulc1L3LeuOuUWs2izrKcAKncA4eXEETbqEFbSCg4Ble4c15cl6cd+djMVpyip1j+APn8wcje5Lv</latexit> <latexit sha1_base64="oRRfNu0Z3o7VS5RRGRBmDkIJsU8=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsxIUZcFNy4r2Ae0Y8lkMm1oJhmSjFKG/ocbF4q49V/c+Tdm2llo64GQwzn3kpMTJJxp47rfTmltfWNzq7xd2dnd2z+oHh51tEwVoW0iuVS9AGvKmaBtwwynvURRHAecdoPJTe53H6nSTIp7M02oH+ORYBEj2FjpYRBIHuppbK+sOxtWa27dnQOtEq8gNSjQGla/BqEkaUyFIRxr3ffcxPgZVoYRTmeVQappgskEj2jfUoFjqv1snnqGzqwSokgqe4RBc/X3RoZjnUezkzE2Y73s5eJ/Xj810bWfMZGkhgqyeChKOTIS5RWgkClKDJ9agoliNisiY6wwMbaoii3BW/7yKulc1L3LeuOuUWs2izrKcAKncA4eXEETbqEFbSCg4Ble4c15cl6cd+djMVpyip1j+APn8wcje5Lv</latexit> <latexit sha1_base64="oRRfNu0Z3o7VS5RRGRBmDkIJsU8=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsxIUZcFNy4r2Ae0Y8lkMm1oJhmSjFKG/ocbF4q49V/c+Tdm2llo64GQwzn3kpMTJJxp47rfTmltfWNzq7xd2dnd2z+oHh51tEwVoW0iuVS9AGvKmaBtwwynvURRHAecdoPJTe53H6nSTIp7M02oH+ORYBEj2FjpYRBIHuppbK+sOxtWa27dnQOtEq8gNSjQGla/BqEkaUyFIRxr3ffcxPgZVoYRTmeVQappgskEj2jfUoFjqv1snnqGzqwSokgqe4RBc/X3RoZjnUezkzE2Y73s5eJ/Xj810bWfMZGkhgqyeChKOTIS5RWgkClKDJ9agoliNisiY6wwMbaoii3BW/7yKulc1L3LeuOuUWs2izrKcAKncA4eXEETbqEFbSCg4Ble4c15cl6cd+djMVpyip1j+APn8wcje5Lv</latexit>

ht ht ht+1
<latexit sha1_base64="wgk5xI16TSwDrE6nXAfDRP4xEsU=">AAAB/XicbVDNS8MwHE3n15xf9ePmJTgEQRitDPU48OJxgvuArZQ0TbewNClJKsxS/Fe8eFDEq/+HN/8b060H3XwQ8njv9yMvL0gYVdpxvq3Kyura+kZ1s7a1vbO7Z+8fdJVIJSYdLJiQ/QApwignHU01I/1EEhQHjPSCyU3h9x6IVFTwez1NiBejEacRxUgbybePhoFgoZrG5srGuZ/pczf37brTcGaAy8QtSR2UaPv21zAUOI0J15ghpQauk2gvQ1JTzEheG6aKJAhP0IgMDOUoJsrLZulzeGqUEEZCmsM1nKm/NzIUqyKgmYyRHqtFrxD/8wapjq69jPIk1YTj+UNRyqAWsKgChlQSrNnUEIQlNVkhHiOJsDaF1UwJ7uKXl0n3ouFeNpp3zXqrVdZRBcfgBJwBF1yBFrgFbdABGDyCZ/AK3qwn68V6tz7moxWr3DkEf2B9/gAAdZWU</latexit>

<latexit sha1_base64="yGU52SEmv3RYim+dmx3oNcR2KaU=">AAAB/XicbVDNS8MwHE3n15xf9ePmJTgEL45WhnocePE4wX3AVkqapltYmpQkFWYp/itePCji1f/Dm/+N6daDbj4Iebz3+5GXFySMKu0431ZlZXVtfaO6Wdva3tnds/cPukqkEpMOFkzIfoAUYZSTjqaakX4iCYoDRnrB5Kbwew9EKir4vZ4mxIvRiNOIYqSN5NtHw0CwUE1jc2Xj3M/0uZv7dt1pODPAZeKWpA5KtH37axgKnMaEa8yQUgPXSbSXIakpZiSvDVNFEoQnaEQGhnIUE+Vls/Q5PDVKCCMhzeEaztTfGxmKVRHQTMZIj9WiV4j/eYNUR9deRnmSasLx/KEoZVALWFQBQyoJ1mxqCMKSmqwQj5FEWJvCaqYEd/HLy6R70XAvG827Zr3VKuuogmNwAs6AC65AC9yCNugADB7BM3gFb9aT9WK9Wx/z0YpV7hyCP7A+fwADgZWW</latexit> <latexit sha1_base64="KNEPgdYIc8iukP5yAFwVPEZCZ3s=">AAAB+3icbVDLSsNAFL2pr1pfsS7dDBbBVUmkqMuCG5cV7APaECaTaTt0MgkzE7GE/IobF4q49Ufc+TdO2iy09cAwh3PuZc6cIOFMacf5tiobm1vbO9Xd2t7+weGRfVzvqTiVhHZJzGM5CLCinAna1UxzOkgkxVHAaT+Y3RZ+/5FKxWLxoOcJ9SI8EWzMCNZG8u36KIh5qOaRubJp7mc69+2G03QWQOvELUkDSnR8+2sUxiSNqNCEY6WGrpNoL8NSM8JpXhuliiaYzPCEDg0VOKLKyxbZc3RulBCNY2mO0Gih/t7IcKSKeGYywnqqVr1C/M8bpnp842VMJKmmgiwfGqcc6RgVRaCQSUo0nxuCiWQmKyJTLDHRpq6aKcFd/fI66V023atm677VaLfLOqpwCmdwAS5cQxvuoANdIPAEz/AKb1ZuvVjv1sdytGKVOyfwB9bnDxuhlSQ=</latexit>

h
<latexit sha1_base64="a6l5eFCp2BIC14pHDrtdFQVWUeY=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsxIUZcFNy4r2Ae0Y8lkMm1oJhmSjFKG/ocbF4q49V/c+Tdm2llo64GQwzn3kpMTJJxp47rfTmltfWNzq7xd2dnd2z+oHh51tEwVoW0iuVS9AGvKmaBtwwynvURRHAecdoPJTe53H6nSTIp7M02oH+ORYBEj2FjpYRBIHuppbK9sPBtWa27dnQOtEq8gNSjQGla/BqEkaUyFIRxr3ffcxPgZVoYRTmeVQappgskEj2jfUoFjqv1snnqGzqwSokgqe4RBc/X3RoZjnUezkzE2Y73s5eJ/Xj810bWfMZGkhgqyeChKOTIS5RWgkClKDJ9agoliNisiY6wwMbaoii3BW/7yKulc1L3LeuOuUWs2izrKcAKncA4eXEETbqEFbSCg4Ble4c15cl6cd+djMVpyip1j+APn8wc9UJMA</latexit>

U U U U
<latexit sha1_base64="vjE8/9ebFnD+7vta3oMwAX9u4UU=">AAAB9XicbVDNSgMxGPy2/tX6V/XoJVgET2VXinosePFYwW0L7Vqy2bQNzSZLklXK0vfw4kERr76LN9/GbLsHbR0IGWa+j0wmTDjTxnW/ndLa+sbmVnm7srO7t39QPTxqa5kqQn0iuVTdEGvKmaC+YYbTbqIojkNOO+HkJvc7j1RpJsW9mSY0iPFIsCEj2FjpoR9KHulpbK/Mnw2qNbfuzoFWiVeQGhRoDapf/UiSNKbCEI617nluYoIMK8MIp7NKP9U0wWSCR7RnqcAx1UE2Tz1DZ1aJ0FAqe4RBc/X3RoZjnUezkzE2Y73s5eJ/Xi81w+sgYyJJDRVk8dAw5chIlFeAIqYoMXxqCSaK2ayIjLHCxNiiKrYEb/nLq6R9Ufcu6427Rq3ZLOoowwmcwjl4cAVNuIUW+EBAwTO8wpvz5Lw4787HYrTkFDvH8AfO5w8gcZLt</latexit> <latexit sha1_base64="vjE8/9ebFnD+7vta3oMwAX9u4UU=">AAAB9XicbVDNSgMxGPy2/tX6V/XoJVgET2VXinosePFYwW0L7Vqy2bQNzSZLklXK0vfw4kERr76LN9/GbLsHbR0IGWa+j0wmTDjTxnW/ndLa+sbmVnm7srO7t39QPTxqa5kqQn0iuVTdEGvKmaC+YYbTbqIojkNOO+HkJvc7j1RpJsW9mSY0iPFIsCEj2FjpoR9KHulpbK/Mnw2qNbfuzoFWiVeQGhRoDapf/UiSNKbCEI617nluYoIMK8MIp7NKP9U0wWSCR7RnqcAx1UE2Tz1DZ1aJ0FAqe4RBc/X3RoZjnUezkzE2Y73s5eJ/Xi81w+sgYyJJDRVk8dAw5chIlFeAIqYoMXxqCSaK2ayIjLHCxNiiKrYEb/nLq6R9Ufcu6427Rq3ZLOoowwmcwjl4cAVNuIUW+EBAwTO8wpvz5Lw4787HYrTkFDvH8AfO5w8gcZLt</latexit> <latexit sha1_base64="vjE8/9ebFnD+7vta3oMwAX9u4UU=">AAAB9XicbVDNSgMxGPy2/tX6V/XoJVgET2VXinosePFYwW0L7Vqy2bQNzSZLklXK0vfw4kERr76LN9/GbLsHbR0IGWa+j0wmTDjTxnW/ndLa+sbmVnm7srO7t39QPTxqa5kqQn0iuVTdEGvKmaC+YYbTbqIojkNOO+HkJvc7j1RpJsW9mSY0iPFIsCEj2FjpoR9KHulpbK/Mnw2qNbfuzoFWiVeQGhRoDapf/UiSNKbCEI617nluYoIMK8MIp7NKP9U0wWSCR7RnqcAx1UE2Tz1DZ1aJ0FAqe4RBc/X3RoZjnUezkzE2Y73s5eJ/Xi81w+sgYyJJDRVk8dAw5chIlFeAIqYoMXxqCSaK2ayIjLHCxNiiKrYEb/nLq6R9Ufcu6427Rq3ZLOoowwmcwjl4cAVNuIUW+EBAwTO8wpvz5Lw4787HYrTkFDvH8AfO5w8gcZLt</latexit> <latexit sha1_base64="vjE8/9ebFnD+7vta3oMwAX9u4UU=">AAAB9XicbVDNSgMxGPy2/tX6V/XoJVgET2VXinosePFYwW0L7Vqy2bQNzSZLklXK0vfw4kERr76LN9/GbLsHbR0IGWa+j0wmTDjTxnW/ndLa+sbmVnm7srO7t39QPTxqa5kqQn0iuVTdEGvKmaC+YYbTbqIojkNOO+HkJvc7j1RpJsW9mSY0iPFIsCEj2FjpoR9KHulpbK/Mnw2qNbfuzoFWiVeQGhRoDapf/UiSNKbCEI617nluYoIMK8MIp7NKP9U0wWSCR7RnqcAx1UE2Tz1DZ1aJ0FAqe4RBc/X3RoZjnUezkzE2Y73s5eJ/Xi81w+sgYyJJDRVk8dAw5chIlFeAIqYoMXxqCSaK2ayIjLHCxNiiKrYEb/nLq6R9Ufcu6427Rq3ZLOoowwmcwjl4cAVNuIUW+EBAwTO8wpvz5Lw4787HYrTkFDvH8AfO5w8gcZLt</latexit>

x <latexit sha1_base64="aUgEQZWjW0A4wx4lcCzbzebpQ/w=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIqMuCG5cV7APasWQymTY0kwxJRi1D/8ONC0Xc+i/u/Bsz7Sy09UDI4Zx7yckJEs60cd1vp7Syura+Ud6sbG3v7O5V9w/aWqaK0BaRXKpugDXlTNCWYYbTbqIojgNOO8H4Ovc7D1RpJsWdmSTUj/FQsIgRbKx03w8kD/Uktlf2NB1Ua27dnQEtE68gNSjQHFS/+qEkaUyFIRxr3fPcxPgZVoYRTqeVfqppgskYD2nPUoFjqv1slnqKTqwSokgqe4RBM/X3RoZjnUezkzE2I73o5eJ/Xi810ZWfMZGkhgoyfyhKOTIS5RWgkClKDJ9YgoliNisiI6wwMbaoii3BW/zyMmmf1b2L+vntea3RKOoowxEcwyl4cAkNuIEmtICAgmd4hTfn0Xlx3p2P+WjJKXYO4Q+czx9VoJMQ</latexit>

xt
<latexit sha1_base64="rkfM6ZPh0Yh2mrmOr2PD+vU+Wyk=">AAAB/XicbVDNS8MwHE39nPOrfty8BIfgxdHKUI8DLx4nuA/YSknTdAtLk5Kk4izFf8WLB0W8+n94878x3XrQzQchj/d+P/LygoRRpR3n21paXlldW69sVDe3tnd27b39jhKpxKSNBROyFyBFGOWkralmpJdIguKAkW4wvi787j2Rigp+pycJ8WI05DSiGGkj+fbhIBAsVJPYXNlD7mf6zM19u+bUnSngInFLUgMlWr79NQgFTmPCNWZIqb7rJNrLkNQUM5JXB6kiCcJjNCR9QzmKifKyafocnhglhJGQ5nANp+rvjQzFqghoJmOkR2reK8T/vH6qoysvozxJNeF49lCUMqgFLKqAIZUEazYxBGFJTVaIR0girE1hVVOCO//lRdI5r7sX9cZto9ZslnVUwBE4BqfABZegCW5AC7QBBo/gGbyCN+vJerHerY/Z6JJV7hyAP7A+fwAcMZWm</latexit>

1 xt
<latexit sha1_base64="fs6lojKjg/a3LV2jPpfhEuFxGUQ=">AAAB+3icbVDLSsNAFJ3UV62vWJduBovgqiQi6rLgxmUF+4A2hMlk2g6dZMLMjbSE/IobF4q49Ufc+TdO2iy09cAwh3PuZc6cIBFcg+N8W5WNza3tnepubW//4PDIPq53tUwVZR0qhVT9gGgmeMw6wEGwfqIYiQLBesH0rvB7T0xpLuNHmCfMi8g45iNOCRjJt+vDQIpQzyNzZbPczyD37YbTdBbA68QtSQOVaPv21zCUNI1YDFQQrQeuk4CXEQWcCpbXhqlmCaFTMmYDQ2MSMe1li+w5PjdKiEdSmRMDXqi/NzIS6SKemYwITPSqV4j/eYMURrdexuMkBRbT5UOjVGCQuCgCh1wxCmJuCKGKm6yYTogiFExdNVOCu/rlddK9bLrXzauHq0arVdZRRafoDF0gF92gFrpHbdRBFM3QM3pFb1ZuvVjv1sdytGKVOyfoD6zPHzQxlTQ=</latexit>

xt+1
<latexit sha1_base64="Z16JjCBt0Jv1n4va0qV4kfPWe00=">AAAB/XicbVDNS8MwHE39nPOrfty8BIcgCKOVoR4HXjxOcB+wlZKm6RaWJiVJxVmK/4oXD4p49f/w5n9juvWgmw9CHu/9fuTlBQmjSjvOt7W0vLK6tl7ZqG5ube/s2nv7HSVSiUkbCyZkL0CKMMpJW1PNSC+RBMUBI91gfF343XsiFRX8Tk8S4sVoyGlEMdJG8u3DQSBYqCaxubKH3M/0mZv7ds2pO1PAReKWpAZKtHz7axAKnMaEa8yQUn3XSbSXIakpZiSvDlJFEoTHaEj6hnIUE+Vl0/Q5PDFKCCMhzeEaTtXfGxmKVRHQTMZIj9S8V4j/ef1UR1deRnmSasLx7KEoZVALWFQBQyoJ1mxiCMKSmqwQj5BEWJvCqqYEd/7Li6RzXncv6o3bRq3ZLOuogCNwDE6BCy5BE9yAFmgDDB7BM3gFb9aT9WK9Wx+z0SWr3DkAf2B9/gAZJZWk</latexit>

Figure 2.5 Unfolded representation of a single layer RNN.

ht = ϕ(W ht−1 + U xt + b), (2.4)

2.1 Basic Building Blocks 15

where W is the history weight matrix and U is the input weight matrix. RNNs can also be
applied to a sequence in the backwards direction. For one layer in a bidirectional RNN (Schuster
and Paliwal, 1997), the forward and backward hidden states can be concatenated or transformed
by a projection layer to be fed into the next layer or an output layer.
RNNs can be viewed as a fully connected graphical model that models the target sequence.
The hidden variable h is an efficient way of parameterising the graphical model despite being a
deterministic function of the input and the previous hidden state. As a result, the output at any
time step depends on all the previous input for a unidirectional RNN and the total number of
parameters in an RNN is independent of the length of input sequences.
RNNs become very useful for tasks whose input and/or output are sequence-based. RNNs
can map an input sequence to an output sequence of the same length (e.g. part of speech
tagging (Socher et al., 2010)), can encode an input sequence into a fixed-length vector (e.g.
sentiment analysis (Tang et al., 2015)), and can convert a fixed-size input to a variable-length
sequence (e.g. image captioning (Xu et al., 2015)). For tasks such as machine translation (Bah-
danau et al., 2014) and speech recognition (Chorowski et al., 2015), they are also important
building blocks for RNN-based AED models (see Section 2.2) where, in general, input and
output sequences have different lengths.
One significant drawback of RNNs using gradient-based optimisation is that the gradient
can either vanish or explode for a very long sequence. Therefore, it is challenging to estab-
lish long-term dependencies, which could be very important for tasks such as language and
speech processing. The gating function was therefore introduced to allow the gradient to flow
unchanged from history steps, thus improving the ability of the model to capture longer-term
dependencies. Among many variants of RNNs, the Long Short-Term Memory (LSTM) net-
work (Hochreiter and Schmidhuber, 1997) is one of the most successful models (Sak et al.,
2014). As shown in Figure 2.6, LSTM networks have memory cells in addition to the outer
recurrence of the RNN.
The internal recurrence of an LSTM memory cell has three gates which control the flow
of information: a forget gate f (t) , an input gate i(t) and an output gate o(t) at time t, and their
16 Attention-Based Encoder-Decoder Models

ct
<latexit sha1_base64="2qC7xLObgQ5UGPKZjEoHvZbTBks=">AAAB/XicbVDNS8MwHE39nPOrfty8BIfgxdHKUI9DLx4nuA/YSknTdAtLk5KkwizFf8WLB0W8+n94878x3XrQzQchj/d+P/LygoRRpR3n21paXlldW69sVDe3tnd27b39jhKpxKSNBROyFyBFGOWkralmpJdIguKAkW4wvin87gORigp+rycJ8WI05DSiGGkj+fbhIBAsVJPYXBnO/Uyfublv15y6MwVcJG5JaqBEy7e/BqHAaUy4xgwp1XedRHsZkppiRvLqIFUkQXiMhqRvKEcxUV42TZ/DE6OEMBLSHK7hVP29kaFYFQHNZIz0SM17hfif1091dOVllCepJhzPHopSBrWARRUwpJJgzSaGICypyQrxCEmEtSmsakpw57+8SDrndfei3rhr1JrXZR0VcASOwSlwwSVoglvQAm2AwSN4Bq/gzXqyXqx362M2umSVOwfgD6zPH/xVlZM=</latexit>

1 ⇥
<latexit sha1_base64="2DfdF1l21+C75cp/xOVABOCQskw=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilVg9FxE2/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJJ+WeqnhCWVjOuRdSxW1S/xsfu2UnFllQMJY21JI5urviYxGxkyiwHZGFEdm2ZuJ/3ndFMNrPxMqSZErtlgUppJgTGavk4HQnKGcWEKZFvZWwkZUU4Y2oJINwVt+eZW0LqreZbV2X6vUb/I4inACp3AOHlxBHe6gAU1g8AjP8ApvTuy8OO/Ox6K14OQzx/AHzucPuRmPPQ==</latexit>

+
<latexit sha1_base64="G4vzAKUfqBvuBoSMSn1cPbS+Hlo=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoMgCGFXgnoMevGYgHlAsoTZSW8yZnZ2mZkVQsgXePGgiFc/yZt/4yTZgyYWNBRV3XR3BYng2rjut5NbW9/Y3MpvF3Z29/YPiodHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7mZ+6wmV5rF8MOME/YgOJA85o8ZK9YteseSW3TnIKvEyUoIMtV7xq9uPWRqhNExQrTuemxh/QpXhTOC00E01JpSN6AA7lkoaofYn80On5MwqfRLGypY0ZK7+npjQSOtxFNjOiJqhXvZm4n9eJzXhjT/hMkkNSrZYFKaCmJjMviZ9rpAZMbaEMsXtrYQNqaLM2GwKNgRv+eVV0rwse1flSr1Sqt5mceThBE7hHDy4hircQw0awADhGV7hzXl0Xpx352PRmnOymWP4A+fzB3RzjLg=</latexit>

ct
<latexit sha1_base64="xpZF0mX6d+iMpwVuzFDXUtQZ8MU=">AAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVRIp6rLoxmUF+4A2hMlk0g6dTMLMTaGE/okbF4q49U/c+TdO2iy09cAwh3PuZc6cIBVcg+N8W5WNza3tnepubW//4PDIPj7p6iRTlHVoIhLVD4hmgkvWAQ6C9VPFSBwI1gsm94XfmzKleSKfYJYyLyYjySNOCRjJt+1hkIhQz2Jz5XTug2/XnYazAF4nbknqqETbt7+GYUKzmEmggmg9cJ0UvJwo4FSweW2YaZYSOiEjNjBUkphpL18kn+MLo4Q4SpQ5EvBC/b2Rk1gX4cxkTGCsV71C/M8bZBDdejmXaQZM0uVDUSYwJLioAYdcMQpiZgihipusmI6JIhRMWTVTgrv65XXSvWq4143mY7PeuivrqKIzdI4ukYtuUAs9oDbqIIqm6Bm9ojcrt16sd+tjOVqxyp1T9AfW5w9ESJQV</latexit>

tanh
<latexit sha1_base64="erFpp9H6lfjuFJAraYBgYklfH7M=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8cKpi20oWy2m3bpZhN2J0IJ/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTxqmSTTjPsskYnuhNRwKRT3UaDknVRzGoeSt8Px3cxvP3FtRKIecZLyIKZDJSLBKFrJ7yFVo3615tbdOcgq8QpSgwLNfvWrN0hYFnOFTFJjup6bYpBTjYJJPq30MsNTysZ0yLuWKhpzE+TzY6fkzCoDEiXalkIyV39P5DQ2ZhKHtjOmODLL3kz8z+tmGN0EuVBphlyxxaIokwQTMvucDITmDOXEEsq0sLcSNqKaMrT5VGwI3vLLq6R1Ufeu6pcPl7XGbRFHGU7gFM7Bg2towD00wQcGAp7hFd4c5bw4787HorXkFDPH8AfO5w/dqI68</latexit>

⇥ ⇥
<latexit sha1_base64="2DfdF1l21+C75cp/xOVABOCQskw=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilVg9FxE2/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJJ+WeqnhCWVjOuRdSxW1S/xsfu2UnFllQMJY21JI5urviYxGxkyiwHZGFEdm2ZuJ/3ndFMNrPxMqSZErtlgUppJgTGavk4HQnKGcWEKZFvZWwkZUU4Y2oJINwVt+eZW0LqreZbV2X6vUb/I4inACp3AOHlxBHe6gAU1g8AjP8ApvTuy8OO/Ox6K14OQzx/AHzucPuRmPPQ==</latexit> <latexit sha1_base64="2DfdF1l21+C75cp/xOVABOCQskw=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilVg9FxE2/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJJ+WeqnhCWVjOuRdSxW1S/xsfu2UnFllQMJY21JI5urviYxGxkyiwHZGFEdm2ZuJ/3ndFMNrPxMqSZErtlgUppJgTGavk4HQnKGcWEKZFvZWwkZUU4Y2oJINwVt+eZW0LqreZbV2X6vUb/I4inACp3AOHlxBHe6gAU1g8AjP8ApvTuy8OO/Ox6K14OQzx/AHzucPuRmPPQ==</latexit>

ft ot
<latexit sha1_base64="6g0Zm4qJvtI78ngYMxAJFZOLCZg=">AAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVRIp6rLoxmUF+4A2hMlk0g6dTMLMTaGE/okbF4q49U/c+TdO2iy09cAwh3PuZc6cIBVcg+N8W5WNza3tnepubW//4PDIPj7p6iRTlHVoIhLVD4hmgkvWAQ6C9VPFSBwI1gsm94XfmzKleSKfYJYyLyYjySNOCRjJt+1hkIhQz2Jz5dHcB9+uOw1nAbxO3JLUUYm2b38Nw4RmMZNABdF64DopeDlRwKlg89ow0ywldEJGbGCoJDHTXr5IPscXRglxlChzJOCF+nsjJ7EuwpnJmMBYr3qF+J83yCC69XIu0wyYpMuHokxgSHBRAw65YhTEzBBCFTdZMR0TRSiYsmqmBHf1y+uke9VwrxvNx2a9dVfWUUVn6BxdIhfdoBZ6QG3UQRRN0TN6RW9Wbr1Y79bHcrRilTun6A+szx9I3ZQY</latexit>

it
<latexit sha1_base64="Zwbzd78y5lxvRAllATSXHrBlQ0g=">AAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVRIp6rLoxmUF+4A2hMlk0g6dTMLMTaGE/okbF4q49U/c+TdO2iy09cAwh3PuZc6cIBVcg+N8W5WNza3tnepubW//4PDIPj7p6iRTlHVoIhLVD4hmgkvWAQ6C9VPFSBwI1gsm94XfmzKleSKfYJYyLyYjySNOCRjJt+1hkIhQz2Jz5Xzug2/XnYazAF4nbknqqETbt7+GYUKzmEmggmg9cJ0UvJwo4FSweW2YaZYSOiEjNjBUkphpL18kn+MLo4Q4SpQ5EvBC/b2Rk1gX4cxkTGCsV71C/M8bZBDdejmXaQZM0uVDUSYwJLioAYdcMQpiZgihipusmI6JIhRMWTVTgrv65XXSvWq4143mY7PeuivrqKIzdI4ukYtuUAs9oDbqIIqm6Bm9ojcrt16sd+tjOVqxyp1T9AfW5w9NcpQb</latexit>

<latexit sha1_base64="hQJqAx5F76JQVSy5M6P4yFUv344=">AAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVRIp6rLoxmUF+4A2hMlk0g6dZMLMTaGE/okbF4q49U/c+TdO2iy09cAwh3PuZc6cIBVcg+N8W5WNza3tnepubW//4PDIPj7papkpyjpUCqn6AdFM8IR1gINg/VQxEgeC9YLJfeH3pkxpLpMnmKXMi8ko4RGnBIzk2/YwkCLUs9hcuZz74Nt1p+EsgNeJW5I6KtH27a9hKGkWswSoIFoPXCcFLycKOBVsXhtmmqWETsiIDQxNSMy0ly+Sz/GFUUIcSWVOAnih/t7ISayLcGYyJjDWq14h/ucNMohuvZwnaQYsocuHokxgkLioAYdcMQpiZgihipusmI6JIhRMWTVTgrv65XXSvWq4143mY7PeuivrqKIzdI4ukYtuUAs9oDbqIIqm6Bm9ojcrt16sd+tjOVqxyp1T9AfW5w9WnJQh</latexit>

<latexit sha1_base64="ewDXALEmqnRDMxlpjdYfO2t7Sq8=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKexKUI9BLx4jmAckS5idzCZj5rHMzAphyT948aCIV//Hm3/jJNmDJhY0FFXddHdFCWfG+v63V1hb39jcKm6Xdnb39g/Kh0cto1JNaJMornQnwoZyJmnTMstpJ9EUi4jTdjS+nfntJ6oNU/LBThIaCjyULGYEWye1eoYNBe6XK37VnwOtkiAnFcjR6Je/egNFUkGlJRwb0w38xIYZ1pYRTqelXmpogskYD2nXUYkFNWE2v3aKzpwyQLHSrqRFc/X3RIaFMRMRuU6B7cgsezPxP6+b2vg6zJhMUkslWSyKU46sQrPX0YBpSiyfOIKJZu5WREZYY2JdQCUXQrD88ippXVSDy2rtvlap3+RxFOEETuEcAriCOtxBA5pA4BGe4RXePOW9eO/ex6K14OUzx/AH3ucPn02PLA==</latexit> <latexit sha1_base64="ewDXALEmqnRDMxlpjdYfO2t7Sq8=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKexKUI9BLx4jmAckS5idzCZj5rHMzAphyT948aCIV//Hm3/jJNmDJhY0FFXddHdFCWfG+v63V1hb39jcKm6Xdnb39g/Kh0cto1JNaJMornQnwoZyJmnTMstpJ9EUi4jTdjS+nfntJ6oNU/LBThIaCjyULGYEWye1eoYNBe6XK37VnwOtkiAnFcjR6Je/egNFUkGlJRwb0w38xIYZ1pYRTqelXmpogskYD2nXUYkFNWE2v3aKzpwyQLHSrqRFc/X3RIaFMRMRuU6B7cgsezPxP6+b2vg6zJhMUkslWSyKU46sQrPX0YBpSiyfOIKJZu5WREZYY2JdQCUXQrD88ippXVSDy2rtvlap3+RxFOEETuEcAriCOtxBA5pA4BGe4RXePOW9eO/ex6K14OUzx/AH3ucPn02PLA==</latexit> <latexit sha1_base64="ewDXALEmqnRDMxlpjdYfO2t7Sq8=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKexKUI9BLx4jmAckS5idzCZj5rHMzAphyT948aCIV//Hm3/jJNmDJhY0FFXddHdFCWfG+v63V1hb39jcKm6Xdnb39g/Kh0cto1JNaJMornQnwoZyJmnTMstpJ9EUi4jTdjS+nfntJ6oNU/LBThIaCjyULGYEWye1eoYNBe6XK37VnwOtkiAnFcjR6Je/egNFUkGlJRwb0w38xIYZ1pYRTqelXmpogskYD2nXUYkFNWE2v3aKzpwyQLHSrqRFc/X3RIaFMRMRuU6B7cgsezPxP6+b2vg6zJhMUkslWSyKU46sQrPX0YBpSiyfOIKJZu5WREZYY2JdQCUXQrD88ippXVSDy2rtvlap3+RxFOEETuEcAriCOtxBA5pA4BGe4RXePOW9eO/ex6K14OUzx/AH3ucPn02PLA==</latexit>

ht
<latexit sha1_base64="JqAV0W/9CrCj9S+LUyUsad1plCc=">AAAB/XicbVDNS8MwHE39nPOrfty8BIfgxdHKUI9DLx4nuA/YSknTdAtLk5KkwizFf8WLB0W8+n94878x3XrQzQchj/d+P/LygoRRpR3n21paXlldW69sVDe3tnd27b39jhKpxKSNBROyFyBFGOWkralmpJdIguKAkW4wvin87gORigp+rycJ8WI05DSiGGkj+fbhIBAsVJPYXNko9zN95ua+XXPqzhRwkbglqYESLd/+GoQCpzHhGjOkVN91Eu1lSGqKGcmrg1SRBOExGpK+oRzFRHnZNH0OT4wSwkhIc7iGU/X3RoZiVQQ0kzHSIzXvFeJ/Xj/V0ZWXUZ6kmnA8eyhKGdQCFlXAkEqCNZsYgrCkJivEIyQR1qawqinBnf/yIumc192LeuOuUWtel3VUwBE4BqfABZegCW5BC7QBBo/gGbyCN+vJerHerY/Z6JJV7hyAP7A+fwAEG5WY</latexit>

xt
<latexit sha1_base64="IGswZgSqj+KGWSwU7wfZW6s3pP8=">AAAB+XicbVC7TsMwFHXKq5RXgJHFokJiqhKEgLGChbFI9CG1UeQ4TmvVcSL7pqKK+icsDCDEyp+w8Tc4bQZoOZLlo3PulY9PkAquwXG+rcra+sbmVnW7trO7t39gHx51dJIpyto0EYnqBUQzwSVrAwfBeqliJA4E6wbju8LvTpjSPJGPME2ZF5Oh5BGnBIzk2/YgSESop7G58qeZD75ddxrOHHiVuCWpoxIt3/4ahAnNYiaBCqJ133VS8HKigFPBZrVBpllK6JgMWd9QSWKmvXyefIbPjBLiKFHmSMBz9fdGTmJdhDOTMYGRXvYK8T+vn0F04+VcphkwSRcPRZnAkOCiBhxyxSiIqSGEKm6yYjoiilAwZdVMCe7yl1dJ56LhXjUuHy7rzduyjio6QafoHLnoGjXRPWqhNqJogp7RK3qzcuvFerc+FqMVq9w5Rn9gff4AZFuUKg==</latexit>
·
<latexit sha1_base64="jW9JLZAIsBxFZZG6OL0K66i4oHM=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8cKpi20oWw2m3bpZhN2J0Ip/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemElh0HW/ndLa+sbmVnm7srO7t39QPTxqmTTXjPsslanuhNRwKRT3UaDknUxzmoSSt8PR3cxvP3FtRKoecZzxIKEDJWLBKFrJ77EoxX615tbdOcgq8QpSgwLNfvWrF6UsT7hCJqkxXc/NMJhQjYJJPq30csMzykZ0wLuWKppwE0zmx07JmVUiEqfalkIyV39PTGhizDgJbWdCcWiWvZn4n9fNMb4JJkJlOXLFFoviXBJMyexzEgnNGcqxJZRpYW8lbEg1ZWjzqdgQvOWXV0nrou5d1S8fLmuN2yKOMpzAKZyDB9fQgHtogg8MBDzDK7w5ynlx3p2PRWvJKWaO4Q+czx/b+I67</latexit>

ht
<latexit sha1_base64="OmKwESMdSF05m6DumMXdfshC3E0=">AAAB+XicbVDLSsNAFL2pr1pfUZduBovgqiRS1GXRjcsK9gFtCJPJpB06mYSZSaGE/okbF4q49U/c+TdO2iy09cAwh3PuZc6cIOVMacf5tiobm1vbO9Xd2t7+weGRfXzSVUkmCe2QhCeyH2BFORO0o5nmtJ9KiuOA014wuS/83pRKxRLxpGcp9WI8EixiBGsj+bY9DBIeqllsrnw897Vv152GswBaJ25J6lCi7dtfwzAhWUyFJhwrNXCdVHs5lpoRTue1YaZoiskEj+jAUIFjqrx8kXyOLowSoiiR5giNFurvjRzHqghnJmOsx2rVK8T/vEGmo1svZyLNNBVk+VCUcaQTVNSAQiYp0XxmCCaSmayIjLHERJuyaqYEd/XL66R71XCvG83HZr11V9ZRhTM4h0tw4QZa8ABt6ACBKTzDK7xZufVivVsfy9GKVe6cwh9Ynz9L65Qa</latexit>

Figure 2.6 Diagram of an LSTM cell. At time t, xt is the input vector, ht is the hidden state,
and ct is the cell state.

connections are defined as follows

!
ft = σ W f ht−1 + U f xt + bf , (2.5)
!
i i i
it = σ W ht−1 + U xt + b , (2.6)
!
o o o
ot = σ W ht−1 + U xt + b , (2.7)
!
c c c
ct = ft ◦ ct−1 + it ◦ tanh W ht−1 + U xt + b , (2.8)

ht = ot ◦ tanh ct , (2.9)

where the W and U matrices are the corresponding weights associated with each gate or the
cell, b denotes the corresponding biases, and ◦ denotes element-wise multiplication, σ(·) is the
sigmoid activation function (see Section 2.1.5). If the feature dimension is D and the hidden /
cell state dimension is H, then W ∈ RH×H , U ∈ RH×D , and b ∈ RH . In Equations (2.5), (2.6)
and (2.7), all three gates have the same input, i.e. the previous hidden state ht−1 and the current
input feature xt , and follow the recurrence equation Equation (2.4). As in Equation (2.8), the
cell state, ct , is a weighted combination of the previous cell state modified by the forget gate
and a transform of the current input from the input gate. The updated hidden state, ht , is the
updated cell state ct controlled by the output gate. The introduction of the memory cell with
gating functions allows history information to be integrated dynamically depending on the
input sequence.
2.1 Basic Building Blocks 17

2.1.4 Attention Mechanisms

In deep learning, combining multiple sources of information is often required. A general
mechanism for this purpose is called the attention mechanism (Bahdanau et al., 2014), where
a set of dynamically generated weights are used to perform a weighted summation of all the
incoming sources. Generally speaking, if there are T feature vectors h(1) , . . . , h(T ) to combine,
the output of the attention mechanism, which is also called the context vector v, is

T
a(t) h(t) ,
X
v= (2.10)
t=1

where t a(t) = 1. The set of weights a is usually the output of a Softmax function on the
P

set of scores produced by the model. The input feature vectors, h(t) , can either be the hidden
representation at a certain stage of the model or the input features. The context vector can be
subsequently transformed into the output distribution. The attention weights are dynamically
generated for different inputs and the attention mechanism is a differentiable function that can
be jointly optimised with the rest of the network.
The attention mechanism is predominantly used in sequence-to-sequence tasks where a
single hidden representation of the entire sequence is very difficult to capture for increasingly
longer sequences (Bahdanau et al., 2014). Instead of compressing the information from a
sequence into a fixed-length vector, a more effective approach is to extract a hidden repre-
sentation for each position in the input sequence, and then use an attention mechanism to
dynamically focus on different parts of the input sequence in the transformed space to generate
the desired output in each position in the corresponding output sequence. This is extremely
helpful for tasks like machine translation (Wu et al., 2016) where the input-to-output alignment
is irregular, and the decoder needs to “attend to” various features associated with words in the
input sequence at different stages. The attention mechanism has also inspired the creation of
new sequence models such as Transformers (Vaswani et al., 2017). A detailed description of
AED models will be given in Section 2.2.

2.1.5 Activation Functions

For all types of neural networks described in the previous sections of this chapter, one essential
element is the activation function ϕ(·), which allows deep neural networks to have powerful
non-linear modelling capabilities. Some common properties for desirable activation functions
include non-linearity, continuity, monotonicity, and low computation for derivatives. Table 2.1
shows various frequently used activation functions (Glorot et al., 2011; Nair and Hinton, 2010;
Ramachandran et al., 2018; Rumelhart et al., 1988) for a single input unit.
18 Attention-Based Encoder-Decoder Models

name plot function derivative range

1
1
sigmoid 0
ϕ(x) = ϕ′ (x) = ϕ(x)(1 − ϕ(x)) (0, 1)
1
1 + e−x
2
2 1 0 1 2

2  
1 0
for x < 0 0 for x < 0
ReLU 0
ϕ(x) = ϕ′ (x) = [0, ∞)
1 x for x ≥ 0 1 for x ≥ 0
2
2 1 0 1 2

2  
1 ρx for x < 0 ρ for x < 0
PReLU 0
ϕ(x) =  ϕ′ (x) =  (−∞, ∞)
1
x for x ≥ 0 1 for x ≥ 0
2
2 1 0 1 2

1
ex − e−x
tanh 0
ϕ(x) = ϕ′ (x) = 1 − ϕ(x)2 (−1, 1)
1
ex + e−x
2
2 1 0 1 2

1
x 1 + ϕ(x)e−x
swish 0
ϕ(x) = ϕ′ (x) = (−0.3, ∞)
1
1 + e−x 1 + e−x
2
2 1 0 1 2

Table 2.1 List of frequently used activation functions for single input unit. The exact minimum
of the swish activation function is −W (1/e) where W is the Lambert W -function.

The activation functions in Table 2.1 are often used within layers and some of them can
be used for output layers as well. If the output layer requires a probability distribution for
classification or in the case of attention weights over various inputs, the Softmax activation
function is normally used as it ensures the output values are between 0 and 1 and sum up to
unity,
e xi
ϕi (x) = P x , (2.11)
ke
k

and its derivative is 

∂ϕi (x) ϕi (x)(1 − ϕj (x))

for i = j,
= (2.12)
∂xj −ϕi (x)ϕj (x)

for i ̸= j.

2.2 AED Model Architecture

AED models provide a flexible and general framework for sequence-to-sequence tasks, such
as machine translation (Bahdanau et al., 2014) and speech recognition (Chorowski et al.,
2015) where the input and output sequences have different lengths and irregular mappings. Its
2.2 AED Model Architecture 19

predecessor, RNN encoder-decoder models (Cho et al., 2014; Sutskever et al., 2014), uses two
RNNs to achieve sequence-to-sequence modelling. One RNN encodes the input sequence into a
fixed-length vector representation and the other RNN decodes the representation into the output
sequence. The performance of this approach may be significantly limited by the information
stored in the fixed-length vector and the sequential information associated with it, especially for
long sequences. This shortcoming is addressed by the attention mechanism (Bahdanau et al.,
2014) introduced in Section 2.1.4. The AED model is able to “attend to” or “focus on” the
most relevant part of the input sequence to generate the output sequence. Based on the basic
building blocks in Section 2.1, two widely used AED models are described in this section.

2.2.1 RNN-Based AED Models

Generally, an RNN-based AED model consists of a neural encoder, a neural decoder, and
an attention mechanism that bridges between the encoder and the decoder (Bahdanau et al.,
2014). The neural encoder maps the variable-length input sequence X into an intermediate
embedding E. If no subsampling is performed within the encoder network, then each input
xt corresponds to an embedding et . At each decoding step i, the encoded information E is
first transformed into a context vector vi based on the attention weight vector ai produced by
the attention mechanism. Then, the neural decoder transforms the context vector vi into an
output probability distribution over all modelling units based on the previous output unit ui−1
and the previous decoder state di−1 . During training, the output probability of the target unit ui
given the ground truth history sequence u1:i−1 is maximised, which is also known as “teacher
forcing”. During a greedy decoding procedure, the output unit with the highest probability
is generated and then fed back into the next decoding step. The decoding procedure stops
when the end-of-sentence symbol </s> is generated, which allows output sequences to have
variable lengths. The initial input to the decoder u0 is the start-of-sentence symbol <s>. More
specifically,

E = E NCODER(X), (2.13)
ai = ATTENTION(ai−1 , di−1 , E), (2.14)
vi = E ai , (2.15)
P (ui |u1:i−1 , X), di = D ECODER(ui−1 , di−1 , vi ). (2.16)

As described in Section 2.1.3, RNNs process sequences and can perform one-to-one, many-
to-one and one-to-many sequence to sequence mapping flexibly. For attention-based models, an
RNN can be used as the encoder that performs the one-to-one mapping from the input feature
20 Attention-Based Encoder-Decoder Models

xt to embedding space et . The decoder can also use an RNN to generate the transcription
sequence based on the previous output token u1 , . . . , ui−1 , the decoder history hidden state
di−1 and the context vector vi produced by the attention mechanism. Figure 2.7 illustrates the
model architecture for RNN-based AED models with attention.

u0
<latexit sha1_base64="FfxpFS/Pivu6Pu9pvfv7CXmKkW8=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeCF48VTFtoQ9lsJ+3SzSbsboQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MBVcG9f9dkobm1vbO+Xdyt7+weFR9fikrZNMMfRZIhLVDalGwSX6hhuB3VQhjUOBnXByN/c7T6g0T+SjmaYYxHQkecQZNVbys0HuzgbVmlt3FyDrxCtIDQq0BtWv/jBhWYzSMEG17nluaoKcKsOZwFmln2lMKZvQEfYslTRGHeSLY2fkwipDEiXKljRkof6eyGms9TQObWdMzVivenPxP6+Xmeg2yLlMM4OSLRdFmSAmIfPPyZArZEZMLaFMcXsrYWOqKDM2n4oNwVt9eZ20r+redb3x0Kg1m0UcZTiDc7gED26gCffQAh8YcHiGV3hzpPPivDsfy9aSU8ycwh84nz/MfY6v</latexit>

u1
<latexit sha1_base64="WuqriIingXMnkzDx1XOtdj+AaV0=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeCF48VTFtoQ9lsN+3SzSbsToQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74STu7nfeeLaiEQ94jTlQUxHSkSCUbSSnw1ybzao1ty6uwBZJ15BalCgNah+9YcJy2KukElqTM9zUwxyqlEwyWeVfmZ4StmEjnjPUkVjboJ8ceyMXFhlSKJE21JIFurviZzGxkzj0HbGFMdm1ZuL/3m9DKPbIBcqzZArtlwUZZJgQuafk6HQnKGcWkKZFvZWwsZUU4Y2n4oNwVt9eZ20r+redb3x0Kg1m0UcZTiDc7gED26gCffQAh8YCHiGV3hzlPPivDsfy9aSU8ycwh84nz/OAo6w</latexit>

u2
<latexit sha1_base64="e7d4Z6N4IxLWo0LEwRzU0rvj0EE=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lKUY8FLx4rmFZoQ9lsJ+3SzSbsboQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MBVcG9f9dkobm1vbO+Xdyt7+weFR9fiko5NMMfRZIhL1GFKNgkv0DTcCH1OFNA4FdsPJ7dzvPqHSPJEPZppiENOR5BFn1FjJzwZ5Yzao1ty6uwBZJ15BalCgPah+9YcJy2KUhgmqdc9zUxPkVBnOBM4q/UxjStmEjrBnqaQx6iBfHDsjF1YZkihRtqQhC/X3RE5jradxaDtjasZ61ZuL/3m9zEQ3Qc5lmhmUbLkoygQxCZl/ToZcITNiagllittbCRtTRZmx+VRsCN7qy+uk06h7V/XmfbPWahVxlOEMzuESPLiGFtxBG3xgwOEZXuHNkc6L8+58LFtLTjFzCn/gfP4Az4eOsQ==</latexit>

ui <latexit sha1_base64="nb2SV2whgj/Q/2mkh5TcRCqBpSo=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeCF48VTFtoQ9lsN+3SzSbsToQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74STu7nfeeLaiEQ94jTlQUxHSkSCUbSSnw1yMRtUa27dXYCsE68gNSjQGlS/+sOEZTFXyCQ1pue5KQY51SiY5LNKPzM8pWxCR7xnqaIxN0G+OHZGLqwyJFGibSkkC/X3RE5jY6ZxaDtjimOz6s3F/7xehtFtkAuVZsgVWy6KMkkwIfPPyVBozlBOLaFMC3srYWOqKUObT8WG4K2+vE7aV3Xvut54aNSazSKOMpzBOVyCBzfQhHtogQ8MBDzDK7w5ynlx3p2PZWvJKWZO4Q+czx8jKY7o</latexit>

d1
<latexit sha1_base64="bDsDFwPY79fCAIOdKRIvGcbZ92s=">AAAB+3icbVDLSsNAFL2pr1pfsS7dBIvgqiRS1GXBjcsK9gFtCJPJtB06mQkzE7GE/IobF4q49Ufc+TdO2iy09cAwh3PuZc6cMGFUadf9tiobm1vbO9Xd2t7+weGRfVzvKZFKTLpYMCEHIVKEUU66mmpGBokkKA4Z6Yez28LvPxKpqOAPep4QP0YTTscUI22kwK6PQsEiNY/NlUV5kHl5YDfcpruAs068kjSgRCewv0aRwGlMuMYMKTX03ET7GZKaYkby2ihVJEF4hiZkaChHMVF+tsieO+dGiZyxkOZw7SzU3xsZilURz0zGSE/VqleI/3nDVI9v/IzyJNWE4+VD45Q5WjhFEU5EJcGazQ1BWFKT1cFTJBHWpq6aKcFb/fI66V02vatm677VaLfLOqpwCmdwAR5cQxvuoANdwPAEz/AKb1ZuvVjv1sdytGKVOyfwB9bnD6+flN0=</latexit>

d2
<latexit sha1_base64="navCPeGrXG42uDNBseCY4h959/8=">AAAB+3icbVBPS8MwHP11/pvzX51HL8UheBrtGOpx4MXjBLcJWylpmm1haVKSVBylX8WLB0W8+kW8+W1Mtx5080HI473fj7y8MGFUadf9tiobm1vbO9Xd2t7+weGRfVzvK5FKTHpYMCEfQqQIo5z0NNWMPCSSoDhkZBDObgp/8EikooLf63lC/BhNOB1TjLSRArs+CgWL1Dw2VxblQdbKA7vhNt0FnHXilaQBJbqB/TWKBE5jwjVmSKmh5ybaz5DUFDOS10apIgnCMzQhQ0M5ionys0X23Dk3SuSMhTSHa2eh/t7IUKyKeGYyRnqqVr1C/M8bpnp87WeUJ6kmHC8fGqfM0cIpinAiKgnWbG4IwpKarA6eIomwNnXVTAne6pfXSb/V9C6b7bt2o9Mp66jCKZzBBXhwBR24hS70AMMTPMMrvFm59WK9Wx/L0YpV7pzAH1ifP7EklN4=</latexit>

di
<latexit sha1_base64="AmOETnpIFvq9YX2Vc44Tn2Tkso4=">AAAB+3icbVDLSsNAFL2pr1pfsS7dBIvgqiRS1GXBjcsK9gFtCJPJtB06mQkzE7GE/IobF4q49Ufc+TdO2iy09cAwh3PuZc6cMGFUadf9tiobm1vbO9Xd2t7+weGRfVzvKZFKTLpYMCEHIVKEUU66mmpGBokkKA4Z6Yez28LvPxKpqOAPep4QP0YTTscUI22kwK6PQsEiNY/NlUV5kNE8sBtu013AWSdeSRpQohPYX6NI4DQmXGOGlBp6bqL9DElNMSN5bZQqkiA8QxMyNJSjmCg/W2TPnXOjRM5YSHO4dhbq740MxaqIZyZjpKdq1SvE/7xhqsc3fkZ5kmrC8fKhccocLZyiCCeikmDN5oYgLKnJ6uApkghrU1fNlOCtfnmd9C6b3lWzdd9qtNtlHVU4hTO4AA+uoQ130IEuYHiCZ3iFNyu3Xqx362M5WrHKnRP4A+vzBwTGlRU=</latexit>

Decoder
<latexit sha1_base64="D4Bhn7PAqhSZDzOncxQZQtUdgSk=">AAAB+XicbVBNS8NAEN3Ur1q/oh69BIvgqSRV0GNRDx4r2A9oQ9lspu3SzSbsTool9J948aCIV/+JN/+N2zYHbX0w8Hhvhpl5QSK4Rtf9tgpr6xubW8Xt0s7u3v6BfXjU1HGqGDRYLGLVDqgGwSU0kKOAdqKARoGAVjC6nfmtMSjNY/mIkwT8iA4k73NG0Ug92+4iPKFm2R2wOAQ17dllt+LO4awSLydlkqPes7+6YczSCCQyQbXueG6CfkYVciZgWuqmGhLKRnQAHUMljUD72fzyqXNmlNDpx8qURGeu/p7IaKT1JApMZ0RxqJe9mfif10mxf+1nXCYpgmSLRf1UOBg7sxickCtgKCaGUKa4udVhQ6ooQxNWyYTgLb+8SprVindRqT5clms3eRxFckJOyTnxyBWpkXtSJw3CyJg8k1fyZmXWi/VufSxaC1Y+c0z+wPr8ARgXk/Q=</latexit>

di
<latexit sha1_base64="FiOK2E3Xz4/5B3zlXLGA6fdgQAY=">AAAB/XicbVDNS8MwHE3n15xf9ePmJTgEL45WhnocePE4wX3AVkqapltYmpQkFWYp/itePCji1f/Dm/+N6daDbj4Iebz3+5GXFySMKu0431ZlZXVtfaO6Wdva3tnds/cPukqkEpMOFkzIfoAUYZSTjqaakX4iCYoDRnrB5Kbwew9EKir4vZ4mxIvRiNOIYqSN5NtHw0CwUE1jc2Vh7mf03M19u+40nBngMnFLUgcl2r79NQwFTmPCNWZIqYHrJNrLkNQUM5LXhqkiCcITNCIDQzmKifKyWfocnholhJGQ5nANZ+rvjQzFqghoJmOkx2rRK8T/vEGqo2svozxJNeF4/lCUMqgFLKqAIZUEazY1BGFJTVaIx0girE1hNVOCu/jlZdK9aLiXjeZds95qlXVUwTE4AWfABVegBW5BG3QABo/gGbyCN+vJerHerY/5aMUqdw7BH1ifP+x5lYc=</latexit>

1 vi
<latexit sha1_base64="TmEzlmo3+t/QoEHgqFvVCCyQfKU=">AAAB+3icbVDLSsNAFL2pr1pfsS7dBIvgqiQi6rLgxmUF+4A2hMlk2g6dzISZSbGE/IobF4q49Ufc+TdO2iy09cAwh3PuZc6cMGFUadf9tiobm1vbO9Xd2t7+weGRfVzvKpFKTDpYMCH7IVKEUU46mmpG+okkKA4Z6YXTu8LvzYhUVPBHPU+IH6MxpyOKkTZSYNeHoWCRmsfmymZ5kNE8sBtu013AWSdeSRpQoh3YX8NI4DQmXGOGlBp4bqL9DElNMSN5bZgqkiA8RWMyMJSjmCg/W2TPnXOjRM5ISHO4dhbq740MxaqIZyZjpCdq1SvE/7xBqke3fkZ5kmrC8fKhUcocLZyiCCeikmDN5oYgLKnJ6uAJkghrU1fNlOCtfnmddC+b3nXz6uGq0WqVdVThFM7gAjy4gRbcQxs6gOEJnuEV3qzcerHerY/laMUqd07gD6zPHyBolSc=</latexit>

Attention
<latexit sha1_base64="sfIywWOu1dYHs1gMIkwGo0JC6sA=">AAAB+3icbVDLSsNAFL3xWesr1qWbYBFclaQKuqy6cVnBPqANZTKdtEMnD2ZupCXkV9y4UMStP+LOv3HSZqGtBwYO59zD3Hu8WHCFtv1trK1vbG5tl3bKu3v7B4fmUaWtokRS1qKRiGTXI4oJHrIWchSsG0tGAk+wjje5y/3OE5OKR+EjzmLmBmQUcp9TgloamJU+sikqmt4gsjDXsoFZtWv2HNYqcQpShQLNgfnVH0Y0CXSeCqJUz7FjdFMikVPBsnI/USwmdEJGrKdpSAKm3HS+e2adaWVo+ZHUL0Rrrv5OpCRQahZ4ejIgOFbLXi7+5/US9K/dlIdxog+ji4/8RFgYWXkR1pBLRlHMNCFUcr2rRcdEEoq6rrIuwVk+eZW06zXnolZ/uKw2bos6SnACp3AODlxBA+6hCS2gMIVneIU3IzNejHfjYzG6ZhSZY/gD4/MH9PGVCA==</latexit>
ai
<latexit sha1_base64="aFqY2o7Ys2DbtbqNW1Q+zEjSIUE=">AAAB+3icbVDLSsNAFL2pr1pfsS7dBIvgqiRS1GXBjcsK9gFtCJPJtB06mQkzE7GE/IobF4q49Ufc+TdO2iy09cAwh3PuZc6cMGFUadf9tiobm1vbO9Xd2t7+weGRfVzvKZFKTLpYMCEHIVKEUU66mmpGBokkKA4Z6Yez28LvPxKpqOAPep4QP0YTTscUI22kwK6PQsEiNY/NlaE8yGge2A236S7grBOvJA0o0Qnsr1EkcBoTrjFDSg09N9F+hqSmmJG8NkoVSRCeoQkZGspRTJSfLbLnzrlRImcspDlcOwv190aGYlXEM5Mx0lO16hXif94w1eMbP6M8STXhePnQOGWOFk5RhBNRSbBmc0MQltRkdfAUSYS1qatmSvBWv7xOepdN76rZum812u2yjiqcwhlcgAfX0IY76EAXMDzBM7zCm5VbL9a79bEcrVjlzgn8gfX5AwArlRI=</latexit>

ai <latexit sha1_base64="Irx3DCg+lABfNPyQTYh0VuEL7lk=">AAAB/XicbVDNS8MwHE3n15xf9ePmJTgEL45WhnocePE4wX3AVkqapltYmpQkFWYp/itePCji1f/Dm/+N6daDbj4Iebz3+5GXFySMKu0431ZlZXVtfaO6Wdva3tnds/cPukqkEpMOFkzIfoAUYZSTjqaakX4iCYoDRnrB5Kbwew9EKir4vZ4mxIvRiNOIYqSN5NtHw0CwUE1jc2Uo9zN67ua+XXcazgxwmbglqYMSbd/+GoYCpzHhGjOk1MB1Eu1lSGqKGclrw1SRBOEJGpGBoRzFRHnZLH0OT40SwkhIc7iGM/X3RoZiVQQ0kzHSY7XoFeJ/3iDV0bWXUZ6kmnA8fyhKGdQCFlXAkEqCNZsagrCkJivEYyQR1qawminBXfzyMuleNNzLRvOuWW+1yjqq4BicgDPggivQAregDToAg0fwDF7Bm/VkvVjv1sd8tGKVO4fgD6zPH+fYlYQ=</latexit>

1
+

+
<latexit

<latexit

sha1_base64="HFb7vPGXWshKfdtTiumEDK16me0=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoMgCGE3CnoMevGYgHlAsoTZSW8yZnZ2mZkVQsgXePGgiFc/yZt/4yTZgyYWNBRV3XR3BYng2rjut5NbW9/Y3MpvF3Z29/YPiodHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7mZ+6wmV5rF8MOME/YgOJA85o8ZK9YteseSW3TnIKvEyUoIMtV7xq9uPWRqhNExQrTuemxh/QpXhTOC00E01JpSN6AA7lkoaofYn80On5MwqfRLGypY0ZK7+npjQSOtxFNjOiJqhXvZm4n9eJzXhjT/hMkkNSrZYFKaCmJjMviZ9rpAZMbaEMsXtrYQNqaLM2GwKNgRv+eVV0qyUvctypX5Vqt5mceThBE7hHDy4hircQw0awADhGV7hzXl0Xpx352PRmnOymWP4A+fzB3LTjLM=</latexit>
E e1 e2 eT
<latexit sha1_base64="EdKAlblDdVx0I0YcXmwKwUxYHuU=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIqMuCCC4r2Ae0Y8lkMm1oJhmSjFKG/ocbF4q49V/c+Tdm2llo64GQwzn3kpMTJJxp47rfTmlldW19o7xZ2dre2d2r7h+0tUwVoS0iuVTdAGvKmaAtwwyn3URRHAecdoLxde53HqnSTIp7M0moH+OhYBEj2FjpoR9IHupJbK/sZjqo1ty6OwNaJl5BalCgOah+9UNJ0pgKQzjWuue5ifEzrAwjnE4r/VTTBJMxHtKepQLHVPvZLPUUnVglRJFU9giDZurvjQzHOo9mJ2NsRnrRy8X/vF5qois/YyJJDRVk/lCUcmQkyitAIVOUGD6xBBPFbFZERlhhYmxRFVuCt/jlZdI+q3sX9fO781qjUdRRhiM4hlPw4BIacAtNaAEBBc/wCm/Ok/PivDsf89GSU+wcwh84nz8IIZLd</latexit>

<latexit sha1_base64="P8H+MclbkBGJbvme3hub8eWc0Fs=">AAAB+3icbVDLSsNAFJ3UV62vWJdugkVwVRIp6rLgxmUF+4A2hMnkth06mQkzE7GE/IobF4q49Ufc+TdO2iy09cAwh3PuZc6cMGFUadf9tiobm1vbO9Xd2t7+weGRfVzvKZFKAl0imJCDECtglENXU81gkEjAccigH85uC7//CFJRwR/0PAE/xhNOx5RgbaTAro9CwSI1j82VQR5kXh7YDbfpLuCsE68kDVSiE9hfo0iQNAauCcNKDT030X6GpaaEQV4bpQoSTGZ4AkNDOY5B+dkie+6cGyVyxkKaw7WzUH9vZDhWRTwzGWM9VateIf7nDVM9vvEzypNUAyfLh8Ypc7RwiiKciEogms0NwURSk9UhUywx0aauminBW/3yOuldNr2rZuu+1Wi3yzqq6BSdoQvkoWvURneog7qIoCf0jF7Rm5VbL9a79bEcrVjlzgn6A+vzB7EolN4=</latexit> <latexit sha1_base64="XQ6UzMWXKkDbPvmL2lA38VTG/cU=">AAAB+3icbVBPS8MwHE3nvzn/1Xn0EhyCp9GOoR4HXjxOcJuwlZKmv21haVqSVBylX8WLB0W8+kW8+W1Mtx5080HI473fj7y8IOFMacf5tiobm1vbO9Xd2t7+weGRfVzvqziVFHo05rF8CIgCzgT0NNMcHhIJJAo4DILZTeEPHkEqFot7PU/Ai8hEsDGjRBvJt+ujIOahmkfmyiD3s1bu2w2n6SyA14lbkgYq0fXtr1EY0zQCoSknSg1dJ9FeRqRmlENeG6UKEkJnZAJDQwWJQHnZInuOz40S4nEszREaL9TfGxmJVBHPTEZET9WqV4j/ecNUj6+9jIkk1SDo8qFxyrGOcVEEDpkEqvncEEIlM1kxnRJJqDZ11UwJ7uqX10m/1XQvm+27dqPTKeuoolN0hi6Qi65QB92iLuohip7QM3pFb1ZuvVjv1sdytGKVOyfoD6zPH7KtlN8=</latexit> <latexit sha1_base64="Zx+kLE8PbcSWffKUqcfHeYKIjf0=">AAAB+3icbVDLSsNAFJ3UV62vWJduBovgqiQi6rLgxmWFvqANYTK5bYdOJmFmIpaQX3HjQhG3/og7/8ZJm4W2HhjmcM69zJkTJJwp7TjfVmVjc2t7p7pb29s/ODyyj+s9FaeSQpfGPJaDgCjgTEBXM81hkEggUcChH8zuCr//CFKxWHT0PAEvIhPBxowSbSTfro+CmIdqHpkrg9zPOrlvN5ymswBeJ25JGqhE27e/RmFM0wiEppwoNXSdRHsZkZpRDnltlCpICJ2RCQwNFSQC5WWL7Dk+N0qIx7E0R2i8UH9vZCRSRTwzGRE9VateIf7nDVM9vvUyJpJUg6DLh8YpxzrGRRE4ZBKo5nNDCJXMZMV0SiSh2tRVMyW4q19eJ73LpnvdvHq4arRaZR1VdIrO0AVy0Q1qoXvURl1E0RN6Rq/ozcqtF+vd+liOVqxy5wT9gfX5A+ZXlQE=</latexit>

Encoder <latexit sha1_base64="AJ9Y5lz/g1bP0Xfjeub1AdR5afw=">AAAB+XicbVBNS8NAEN3Ur1q/oh69BIvgqSRV0GNRBI8V7Ae0oWw203bpZhN2J8US+k+8eFDEq//Em//GbZuDtj4YeLw3w8y8IBFco+t+W4W19Y3NreJ2aWd3b//APjxq6jhVDBosFrFqB1SD4BIayFFAO1FAo0BAKxjdzvzWGJTmsXzESQJ+RAeS9zmjaKSebXcRnlCz7E6yOAQ17dllt+LO4awSLydlkqPes7+6YczSCCQyQbXueG6CfkYVciZgWuqmGhLKRnQAHUMljUD72fzyqXNmlNDpx8qURGeu/p7IaKT1JApMZ0RxqJe9mfif10mxf+1nXCYpgmSLRf1UOBg7sxickCtgKCaGUKa4udVhQ6ooQxNWyYTgLb+8SprVindRqT5clms3eRxFckJOyTnxyBWpkXtSJw3CyJg8k1fyZmXWi/VufSxaC1Y+c0z+wPr8ASd8k/4=</latexit>

x1
<latexit sha1_base64="FXjhGAs5lQQnmlcmJQLQSm5i5N0=">AAAB+3icbVDLSsNAFL2pr1pfsS7dDBbBVUlE1GXBjcsK9gFtCJPJtB06yYSZibSE/IobF4q49Ufc+TdO2iy09cAwh3PuZc6cIOFMacf5tiobm1vbO9Xd2t7+weGRfVzvKpFKQjtEcCH7AVaUs5h2NNOc9hNJcRRw2gumd4Xfe6JSMRE/6nlCvQiPYzZiBGsj+XZ9GAgeqnlkrmyW+5mb+3bDaToLoHXilqQBJdq+/TUMBUkjGmvCsVID10m0l2GpGeE0rw1TRRNMpnhMB4bGOKLKyxbZc3RulBCNhDQn1mih/t7IcKSKeGYywnqiVr1C/M8bpHp062UsTlJNY7J8aJRypAUqikAhk5RoPjcEE8lMVkQmWGKiTV01U4K7+uV10r1sutfNq4erRqtV1lGFUziDC3DhBlpwD23oAIEZPMMrvFm59WK9Wx/L0YpV7pzAH1ifP85TlPE=</latexit>

x2
<latexit sha1_base64="i4icZ47ITj660RXU3rOEhnhzCI4=">AAAB+3icbVDNS8MwHE3n15xfdR69BIfgabRjqMeBF48T3AdspaRpuoWlSUlS2Sj9V7x4UMSr/4g3/xvTrQfdfBDyeO/3Iy8vSBhV2nG+rcrW9s7uXnW/dnB4dHxin9b7SqQSkx4WTMhhgBRhlJOeppqRYSIJigNGBsHsrvAHT0QqKvijXiTEi9GE04hipI3k2/VxIFioFrG5snnuZ63ctxtO01kCbhK3JA1QouvbX+NQ4DQmXGOGlBq5TqK9DElNMSN5bZwqkiA8QxMyMpSjmCgvW2bP4aVRQhgJaQ7XcKn+3shQrIp4ZjJGeqrWvUL8zxulOrr1MsqTVBOOVw9FKYNawKIIGFJJsGYLQxCW1GSFeIokwtrUVTMluOtf3iT9VtO9brYf2o1Op6yjCs7BBbgCLrgBHXAPuqAHMJiDZ/AK3qzcerHerY/VaMUqd87AH1ifP8/YlPI=</latexit>

xT
<latexit sha1_base64="2WyQ8W6FkNwAZYAltiSEHggaA2M=">AAAB+3icbVDNS8MwHE3n15xfdR69BIfgabQy1OPAi8cJ+4KtlDRNt7A0KUkqG6X/ihcPinj1H/Hmf2O69aCbD0Ie7/1+5OUFCaNKO863Vdna3tndq+7XDg6Pjk/s03pfiVRi0sOCCTkMkCKMctLTVDMyTCRBccDIIJjdF/7giUhFBe/qRUK8GE04jShG2ki+XR8HgoVqEZsrm+d+1s19u+E0nSXgJnFL0gAlOr79NQ4FTmPCNWZIqZHrJNrLkNQUM5LXxqkiCcIzNCEjQzmKifKyZfYcXholhJGQ5nANl+rvjQzFqohnJmOkp2rdK8T/vFGqozsvozxJNeF49VCUMqgFLIqAIZUEa7YwBGFJTVaIp0girE1dNVOCu/7lTdK/bro3zdZjq9Ful3VUwTm4AFfABbegDR5AB/QABnPwDF7Bm5VbL9a79bEarVjlzhn4A+vzBwORlRQ=</latexit>

Figure 2.7 Model architecture for RNN-based AED models with attention.

The attention mechanism, as described in Section 2.1.4, produces a set of weights for
intermediate embedding vectors at all time steps. Based on the previous decoder state, encoded
feature embeddings, and optionally the attention weights, the attention mechanism provides
information on which parts of the input sequence to focus on. The decoder then can use
the context information given by the attention mechanism and the previous decoder state to
generate the next symbol in the output sequence. Therefore, the attention mechanism plays a
central role in the automatic soft alignment between input and output sequences.

Context-based Attention This was the proposed form of attention mechanism first used in
machine translation (Bahdanau et al., 2014). The attention mechanism is an MLP that takes the
previous decoder hidden state di−1 and the encoded embeddings e1 , . . . , eT as input

ait = wT tanh(W di−1 + V et + b), (2.17)

ai = Softmax([ai1 , . . . , aiT ]T , ) (2.18)
2.2 AED Model Architecture 21

where w, W , V , b are trainable parameters of the attention mechanism. The attention network
can be trained jointly with the rest of the network.

Location-based Attention For tasks like speech recognition, the alignment between input
and output sequences is always monotonic. Therefore, knowing the alignment for the previous
output should be able to help with the attention network to produce attention weights for the
next step. Location-based attention (Chorowski et al., 2015) allows the attention network to
take the previous attention weights as one extra input

[fi1 , . . . , fiT ] = K ∗ ai−1 , (2.19)

ait = wT tanh(W di−1 + V et + U fit + b), (2.20)
ai = Softmax([ai1 , . . . , aiT ]T ), (2.21)

where K ∈ RP ×Q is a learned convolutional kernel matrix, and w, W , U , V , b form part of

the parameters used in the attention network. There are P 1-dimensional kernels of size Q, and
each one of them is convolved with the previous attention weights at each location t. fit is
of dimension P and represents the convolved output of the kernel matrix at location t in the
encoded sequence at step i of the decoder. Compared to the context-based attention mechanism,
the attention network in the location-based mechanism takes an additional location feature
produced by 1-dimensional convolution.
There are also many other task-specific variants of the attention mechanism. For exam-
ple, the attention mechanisms that consider monotonicity and/or locality for speech recogni-
tion (Prabhavalkar et al., 2017; Tjandra et al., 2017b).

2.2.2 Transformer-Based Models

The Transformer (Vaswani et al., 2017), an alternative to the RNN-based AED model, can
also model sequences but without recurrence. Transformer-based models rely only on the
attention mechanism and have been shown to outperform RNN-based models significantly
in many language modelling and related tasks (Devlin et al., 2018). The core concept for
the Transformer model is scaled dot-product attention. The input to this particular attention
mechanism consists of queries Eq ∈ RTq ×Dk , keys Ek ∈ RTk ×Dk and values Ev ∈ RTk ×Dv .
22 Attention-Based Encoder-Decoder Models

They are obtained by the following linear transforms:

E q = Xq W q , where Xq ∈ RTq ×DXq , Wq ∈ RDXq ×Dk ,

Ek = Xk Wk , where Xk ∈ RTk ×DXk , Wk ∈ RDXk ×Dk , (2.22)
Ev = Xv Wv , where Xv ∈ RTk ×DXv , Wv ∈ RDXv ×Dv .

The attention mechanism takes a query and looks for similar keys. Then it uses the similarity
between each pair of query and key to obtain a weight for the value that corresponds to the key.
The output of the attention mechanism for the query is a weighted sum of all values, i.e.

Eq E T
!
ATTENTION(Eq , Ek , Ev ) = Softmax √ k Ev , (2.23)
Dk

where the Softmax is applied to each row. To interpret this operation, assume the query only
contains one entry, i.e. Tq = 1. The numerator inside the Softmax operation is the dot product
of the query vector with every key vector for all times steps Tk . The scalar result of every dot
product represents the closeness of the query and the key according to the cosine distance. The
Softmax of the Tk values normalises these dot products into a valid probability distribution,
which is used as weights when computing the weighted sum of the value vector across all time
steps. By assuming that each dimension of the query and key vectors is an independent random
variable with zero mean and unit variance, the value of the dot product has zero mean and
√
variance Dk . Therefore, the dot product is scaled down by Dk to prevent small gradients
caused by large values passed to the Softmax function. To ensure the validity of this assumption,
layer normalisation is usually used before the attention mechanism to normalise each feature to
approximately zero mean and unit variance. For a query with an arbitrary number of entries,
the attention operation first computes the closeness between each pair of query vector and
key vector using dot product, and then normalises the values from the dot product across all
times steps Tk for each query vector. A weighted sum of the value vectors produces the feature
vector for each time step in the query. Therefore, the result of the attention operation has
dimension Tq × Dv . The Transformer model is constructed based on this definition of the
attention mechanism as shown in Figure 2.8.
For the attention mechanism in the encoder and the first layer of the decoder block, it is
also called self-attention as the input features are from the same source, i.e. Xq = Xk = Xv .
Self-attention acts like a feature extractor that pays attention to input features at all other time
steps. In this sense, self-attention is similar to a TDNN whose scope is the entire sequence. For
the attention mechanism that bridges the encoder and the decoder, it is called source-attention
as Xk and Xv are the same encoder embeddings whereas Eq is from the decoder embeddings
2.2 AED Model Architecture 23

MLP & softmax

add & norm

MLP

add & norm

add & norm
multi-head
attention
MLP

multiple multiple
encoder add & norm add & norm decoder
blocks blocks
multi-head multi-head
attention attention

position encoding +
<latexit sha1_base64="HFb7vPGXWshKfdtTiumEDK16me0=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoMgCGE3CnoMevGYgHlAsoTZSW8yZnZ2mZkVQsgXePGgiFc/yZt/4yTZgyYWNBRV3XR3BYng2rjut5NbW9/Y3MpvF3Z29/YPiodHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7mZ+6wmV5rF8MOME/YgOJA85o8ZK9YteseSW3TnIKvEyUoIMtV7xq9uPWRqhNExQrTuemxh/QpXhTOC00E01JpSN6AA7lkoaofYn80On5MwqfRLGypY0ZK7+npjQSOtxFNjOiJqhXvZm4n9eJzXhjT/hMkkNSrZYFKaCmJjMviZ9rpAZMbaEMsXtrYQNqaLM2GwKNgRv+eVV0qyUvctypX5Vqt5mceThBE7hHDy4hircQw0awADhGV7hzXl0Xpx352PRmnOymWP4A+fzB3LTjLM=</latexit>
+
<latexit sha1_base64="HFb7vPGXWshKfdtTiumEDK16me0=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoMgCGE3CnoMevGYgHlAsoTZSW8yZnZ2mZkVQsgXePGgiFc/yZt/4yTZgyYWNBRV3XR3BYng2rjut5NbW9/Y3MpvF3Z29/YPiodHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7mZ+6wmV5rF8MOME/YgOJA85o8ZK9YteseSW3TnIKvEyUoIMtV7xq9uPWRqhNExQrTuemxh/QpXhTOC00E01JpSN6AA7lkoaofYn80On5MwqfRLGypY0ZK7+npjQSOtxFNjOiJqhXvZm4n9eJzXhjT/hMkkNSrZYFKaCmJjMviZ9rpAZMbaEMsXtrYQNqaLM2GwKNgRv+eVV0qyUvctypX5Vqt5mceThBE7hHDy4hircQw0awADhGV7hzXl0Xpx352PRmnOymWP4A+fzB3LTjLM=</latexit>
position encoding

input sequence output sequence (masked)

Figure 2.8 The Transformer model (Vaswani et al., 2017). “add & norm” means the addition
operation of the residual connection followed by layer normalisation. The three inputs to the
multi-head attention layer are key, value, query from left to right.

based on the history sequence. This is similar to the attention mechanism in RNN-based AED
models, but the attention weights are obtained by the dot product on the transformed features
rather than directly via a neural network.
The other novel component of the Transformer model is multi-head attention, where the
(i)
attention mechanism in Equation (2.23) is split into N heads. For head i, Eq(i) , Ek , Ev(i)
(i)
are computed with a weight matrix associated with each head Wq(i) , Wk , Wv(i) according to
Equation (2.22). Then the output from each attention head is concatenated and transformed by
another weight matrix Wo ,

(i) (i)
H EADi = ATTENTION(Eq(i) , Ek , Ek , ) (2.24)
M ULTI H EADATTENTION = Concat(H EAD1 , . . . , H EADN )Wo , (2.25)
24 Attention-Based Encoder-Decoder Models

where N is the number of heads and Wo ∈ R(N Dv ×Do ) . Multi-head attention allows the model
to jointly attend to information from different representation subspaces at different positions.
The transformer model also includes an MLP network after the attention block and residual
/ skip connections across each attention block. Although the ordering of the input sequence
is important, the Transformer model so far does not explicitly use the sequential information
due to the lack of recurrence or convolution. To address this, positional encoding is added to
the input embedding for both the encoder and the decoder. Positional encoding describes the
position of an entity in a sequence so that each position is assigned a unique representation. The
positional encoding can either be learned or fixed (Vaswani et al., 2017). The fixed positional
encoding proposed by Vaswani et al. (2017) is based on sinusoidal functions
 !
 t
sin if i is even,


 i

PE(t, i) = 10000 D ! (2.26)

 t
cos if i is odd,


 i−1
10000 D

where t is the position in the sequence and i is the dimension for D-dimensional input embed-
dings. This function means that each dimension of the positional encoding corresponds to a
sinusoid and allows the model to easily learn to attend by relative positions (Vaswani et al.,
2017).
Compared to recurrent networks for sequence modelling, self-attention can be fully par-
allelised across time steps and the maximum path length (the number of steps to traverse the
network in between two input features that are T steps apart) reduces from O(T ) to O(1), which
allows longer dependencies to be modelled directly. However, the computational complexity
for each layer for self-attention is O(T 2 ) vs. O(T ) for recurrent networks.

2.3 Optimisation
Once the DNNs are designed for specific tasks, optimisation is the next challenge as millions
of parameters need to be estimated efficiently. The optimisation problem, in short, is to find a
set of parameters θ that minimises the overall loss J (θ) across the true data distribution pdata

J (θ) = E(x,y)∼pdata L(f (x; θ), y), (2.27)

where f (·) represents a non-linear function that describes the neural network.
However, the underlying data distribution is generally inaccessible. Therefore, the approxi-
mate optimisation objective is the empirical risk on given samples drawn from the empirical
2.3 Optimisation 25

data distribution p̂data

N
1 X
J (θ) ≈ E(x,y)∼p̂data L(f (x; θ), y) = L(f (x(i) ; θ), y (i) ), (2.28)
N i=1

where L(·) is the loss function and N is the total number of samples in the training dataset.
Unlike linear models, there is no closed-form analytical solution for optimising DNNs.
In this section, the cornerstone of neural network optimisation is first described. Then the
most widely adopted gradient-based algorithm (i.e. Stochastic Gradient Descend (SGD)) and
its related learning strategies are introduced. However, minimising the empirical risk leads
to an overfitting problem. As one key aspect of deep learning, effective regularisation helps
deep models with a huge number of parameters to generalise well on unseen data. Some
regularisation techniques are included in Section 2.3.4.

2.3.1 Error Backpropagation

In Section 2.1, all of the forward operations for different building blocks are defined, i.e. from
an input to an output. During neural network training, the input x passes through all of the
DNN layers and finally reaches the output layer where a scalar loss L(θ) is computed. In order
to update all of the parameters θ, the error1 is backpropagated from the output back through
the network. Backpropagation (Rumelhart et al., 1988) is a method that is used to compute the
gradient of the loss function with respect to each parameter. Then gradient-based optimisation
algorithms take these gradients to change the value of the parameters.
For a general example, a short chain of operations

z = f (y) = f (g(x)), (2.29)

where z is a scalar output, x ∈ RI is the input vector, y ∈ RO is the intermediate result, and
f (·) and g(·) are two general functions. According to the chain rule of calculus,

∂z X ∂z ∂yj
= , (2.30)
∂xi j ∂yj ∂xi

or in matrix notation !T
∂y
∇x z = ∇y z, (2.31)
∂x
1
The term “error” makes sense for the least squares loss function, but in general, “error” here refers to the
partial derivative of the loss function.
26 Attention-Based Encoder-Decoder Models

∂y
where ∂x is the Jacobian matrix of g(·) of dimension O × I. This operation can be extended
recursively to a chain of operations of arbitrary steps and higher dimensional inputs.
In practice, a neural network forms a computational graph whose nodes are variables for
each operation and the directed edges are defined operations from an input variable to an output
variable. By using the chain rule, the derivatives of the final scalar loss can be computed with
respect to all the nodes within the computational graph recursively (Goodfellow et al., 2016).
However, naively computing all the gradients will have exponential cost as the computational
graph grows. Since many sub-expressions can be reused, storing these intermediate results can
yield linear computation time w.r.t. the number of edges in the graph. The backpropagation
algorithm constructs an identical computational graph as the forward propagation but with
reversed edges. Each edge in the backward graph has the derivative of the corresponding
operation in the forward graph.
Computational graphs are directed acyclic graphs. For the case of RNNs, the computation
graph is still acyclic after unfolding as shown in Figure 2.5. Figure 2.9 shows a generic example
of how the backpropagation algorithm works for a chain of vector operations.

y <latexit sha1_base64="mEcz1FLhuG1BpP6c5hi50qAIJ0g=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOl5qRfrrhVdw6ySrycVCBHo1/+6g1ilkYoDRNU667nJsbPqDKcCZyWeqnGhLIxHWLXUkkj1H42P3RKzqwyIGGsbElD5urviYxGWk+iwHZG1Iz0sjcT//O6qQmv/YzLJDUo2WJRmApiYjL7mgy4QmbExBLKFLe3EjaiijJjsynZELzll1dJu1b1Lqq15mWlfpPHUYQTOIVz8OAK6nAHDWgBA4RneIU359F5cd6dj0VrwclnjuEPnM8f6QuNAQ==</latexit>
y <latexit sha1_base64="mEcz1FLhuG1BpP6c5hi50qAIJ0g=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOl5qRfrrhVdw6ySrycVCBHo1/+6g1ilkYoDRNU667nJsbPqDKcCZyWeqnGhLIxHWLXUkkj1H42P3RKzqwyIGGsbElD5urviYxGWk+iwHZG1Iz0sjcT//O6qQmv/YzLJDUo2WJRmApiYjL7mgy4QmbExBLKFLe3EjaiijJjsynZELzll1dJu1b1Lqq15mWlfpPHUYQTOIVz8OAK6nAHDWgBA4RneIU359F5cd6dj0VrwclnjuEPnM8f6QuNAQ==</latexit>

f3 (·)
<latexit sha1_base64="lCvb0LgHkKD1ayCVYDmQr2wQo7k=">AAAB8XicbVBNTwIxEO3iF+IX6tFLIzHBC9kFEz0SvXjERJAIG9LtdqGh227aWROy4V948aAxXv033vw3FtiDgi+Z5OW9mczMCxLBDbjut1NYW9/Y3Cpul3Z29/YPyodHHaNSTVmbKqF0NyCGCS5ZGzgI1k00I3Eg2EMwvpn5D09MG67kPUwS5sdkKHnEKQErPUaDRrVPQwXng3LFrblz4FXi5aSCcrQG5a9+qGgaMwlUEGN6npuAnxENnAo2LfVTwxJCx2TIepZKEjPjZ/OLp/jMKiGOlLYlAc/V3xMZiY2ZxIHtjAmMzLI3E//zeilEV37GZZICk3SxKEoFBoVn7+OQa0ZBTCwhVHN7K6YjogkFG1LJhuAtv7xKOvWa16jV7y4qzes8jiI6Qaeoijx0iZroFrVQG1Ek0TN6RW+OcV6cd+dj0Vpw8plj9AfO5w+QwpAx</latexit>
f3 (·)
<latexit sha1_base64="lCvb0LgHkKD1ayCVYDmQr2wQo7k=">AAAB8XicbVBNTwIxEO3iF+IX6tFLIzHBC9kFEz0SvXjERJAIG9LtdqGh227aWROy4V948aAxXv033vw3FtiDgi+Z5OW9mczMCxLBDbjut1NYW9/Y3Cpul3Z29/YPyodHHaNSTVmbKqF0NyCGCS5ZGzgI1k00I3Eg2EMwvpn5D09MG67kPUwS5sdkKHnEKQErPUaDRrVPQwXng3LFrblz4FXi5aSCcrQG5a9+qGgaMwlUEGN6npuAnxENnAo2LfVTwxJCx2TIepZKEjPjZ/OLp/jMKiGOlLYlAc/V3xMZiY2ZxIHtjAmMzLI3E//zeilEV37GZZICk3SxKEoFBoVn7+OQa0ZBTCwhVHN7K6YjogkFG1LJhuAtv7xKOvWa16jV7y4qzes8jiI6Qaeoijx0iZroFrVQG1Ek0TN6RW+OcV6cd+dj0Vpw8plj9AfO5w+QwpAx</latexit>

f30 (·) dy
z <latexit sha1_base64="VLEo6VgUnu2TnOxoOkqsMPXvyTo=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI9ELx4hkUcCGzI79MLI7OxmZtYECV/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR7cxvPaLSPJb3ZpygH9GB5CFn1Fip/tQrltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1Gu1C9L1ZssjjycwCmcgwdXUIU7qEEDGCA8wyu8OQ/Oi/PufCxac042cwx/4Hz+AOqPjQI=</latexit>
z <latexit sha1_base64="VLEo6VgUnu2TnOxoOkqsMPXvyTo=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI9ELx4hkUcCGzI79MLI7OxmZtYECV/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR7cxvPaLSPJb3ZpygH9GB5CFn1Fip/tQrltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1Gu1C9L1ZssjjycwCmcgwdXUIU7qEEDGCA8wyu8OQ/Oi/PufCxac042cwx/4Hz+AOqPjQI=</latexit>
<latexit sha1_base64="Kl5C7qmUxIQOIUqIDLcyu1oUYes=">AAAB8nicbVBNSwMxEM36WetX1aOXYBHrpey2gh6LXjxWsB+wXUo2m21Ds8mSzAql9Gd48aCIV3+NN/+NabsHbX0w8Hhvhpl5YSq4Adf9dtbWNza3tgs7xd29/YPD0tFx26hMU9aiSijdDYlhgkvWAg6CdVPNSBIK1glHdzO/88S04Uo+wjhlQUIGksecErCSH/frF5UejRRc9ktlt+rOgVeJl5MyytHsl756kaJZwiRQQYzxPTeFYEI0cCrYtNjLDEsJHZEB8y2VJGEmmMxPnuJzq0Q4VtqWBDxXf09MSGLMOAltZ0JgaJa9mfif52cQ3wQTLtMMmKSLRXEmMCg8+x9HXDMKYmwJoZrbWzEdEk0o2JSKNgRv+eVV0q5VvXq19nBVbtzmcRTQKTpDFeSha9RA96iJWogihZ7RK3pzwHlx3p2PReuak8+coD9wPn8A8t2QYg==</latexit>

dz <latexit sha1_base64="EQIwB7Yjj3ikUQNEh9Rl26dPm4Q=">AAAB9XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0sWw2m3bpZhN2N0oM+R9ePCji1f/izX/jts1BWx8MPN6bYWaeF3OmtG1/W6WV1bX1jfJmZWt7Z3evun/QUVEiCW2TiEey52FFORO0rZnmtBdLikOP0643uZ763QcqFYvEnU5j6oZ4JFjACNZGuh/4gcQk89M885/yYbVm1+0Z0DJxClKDAq1h9WvgRyQJqdCEY6X6jh1rN8NSM8JpXhkkisaYTPCI9g0VOKTKzWZX5+jEKD4KImlKaDRTf09kOFQqDT3TGWI9VoveVPzP6yc6uHQzJuJEU0Hmi4KEIx2haQTIZ5ISzVNDMJHM3IrIGJsctAmqYkJwFl9eJp1G3TmrN27Pa82rIo4yHMExnIIDF9CEG2hBGwhIeIZXeLMerRfr3fqYt5asYuYQ/sD6/AFahJMR</latexit>

f2 (·)
<latexit sha1_base64="RDi/y9o862/ZUSyCWWkOiZEMvM0=">AAAB8XicbVBNSwMxEJ2tX7V+VT16CRahXspuFfRY9OKxgv3AdinZbLYNzSZLkhXK0n/hxYMiXv033vw3pu0etPXBwOO9GWbmBQln2rjut1NYW9/Y3Cpul3Z29/YPyodHbS1TRWiLSC5VN8CaciZoyzDDaTdRFMcBp51gfDvzO09UaSbFg5kk1I/xULCIEWys9BgN6tU+CaU5H5Qrbs2dA60SLycVyNEclL/6oSRpTIUhHGvd89zE+BlWhhFOp6V+qmmCyRgPac9SgWOq/Wx+8RSdWSVEkVS2hEFz9fdEhmOtJ3FgO2NsRnrZm4n/eb3URNd+xkSSGirIYlGUcmQkmr2PQqYoMXxiCSaK2VsRGWGFibEhlWwI3vLLq6Rdr3kXtfr9ZaVxk8dRhBM4hSp4cAUNuIMmtICAgGd4hTdHOy/Ou/OxaC04+cwx/IHz+QOPN5Aw</latexit>
f2 (·)
<latexit sha1_base64="RDi/y9o862/ZUSyCWWkOiZEMvM0=">AAAB8XicbVBNSwMxEJ2tX7V+VT16CRahXspuFfRY9OKxgv3AdinZbLYNzSZLkhXK0n/hxYMiXv033vw3pu0etPXBwOO9GWbmBQln2rjut1NYW9/Y3Cpul3Z29/YPyodHbS1TRWiLSC5VN8CaciZoyzDDaTdRFMcBp51gfDvzO09UaSbFg5kk1I/xULCIEWys9BgN6tU+CaU5H5Qrbs2dA60SLycVyNEclL/6oSRpTIUhHGvd89zE+BlWhhFOp6V+qmmCyRgPac9SgWOq/Wx+8RSdWSVEkVS2hEFz9fdEhmOtJ3FgO2NsRnrZm4n/eb3URNd+xkSSGirIYlGUcmQkmr2PQqYoMXxiCSaK2VsRGWGFibEhlWwI3vLLq6Rdr3kXtfr9ZaVxk8dRhBM4hSp4cAUNuIMmtICAgGd4hTdHOy/Ou/OxaC04+cwx/IHz+QOPN5Aw</latexit>

f20 (·) dz ⇥
<latexit sha1_base64="y3ZFcWqBEabekax7ttofWU1jMYg=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilVg9FxE2/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJJ+WeqnhCWVjOuRdSxW1S/xsfu2UnFllQMJY21JI5urviYxGxkyiwHZGFEdm2ZuJ/3ndFMNrPxMqSZErtlgUppJgTGavk4HQnKGcWEKZFvZWwkZUU4Y2oJINwVt+eZW0alXvolq7v6zUb/I4inACp3AOHlxBHe6gAU1g8AjP8ApvTuy8OO/Ox6K14OQzx/AHzucPt3mPOA==</latexit>

dy
h h
<latexit sha1_base64="CYx7dMsyOOPik4LNCht/W2+kFXE=">AAAB8nicbVBNSwMxEM3Wr1q/qh69BItYL2W3CnosevFYwX7AdinZbLYNzSZLMiuU0p/hxYMiXv013vw3pu0etPXBwOO9GWbmhangBlz32ymsrW9sbhW3Szu7e/sH5cOjtlGZpqxFlVC6GxLDBJesBRwE66aakSQUrBOO7mZ+54lpw5V8hHHKgoQMJI85JWAlP+7Xz6s9Gim46Jcrbs2dA68SLycVlKPZL3/1IkWzhEmgghjje24KwYRo4FSwaamXGZYSOiID5lsqScJMMJmfPMVnVolwrLQtCXiu/p6YkMSYcRLazoTA0Cx7M/E/z88gvgkmXKYZMEkXi+JMYFB49j+OuGYUxNgSQjW3t2I6JJpQsCmVbAje8surpF2veZe1+sNVpXGbx1FEJ+gUVZGHrlED3aMmaiGKFHpGr+jNAefFeXc+Fq0FJ585Rn/gfP4A8VGQYQ==</latexit>

dh dh
<latexit sha1_base64="Avqj6DgOR2NBV6dY7Rsio1T0XiY=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOl5qhfrrhVdw6ySrycVCBHo1/+6g1ilkYoDRNU667nJsbPqDKcCZyWeqnGhLIxHWLXUkkj1H42P3RKzqwyIGGsbElD5urviYxGWk+iwHZG1Iz0sjcT//O6qQmv/YzLJDUo2WJRmApiYjL7mgy4QmbExBLKFLe3EjaiijJjsynZELzll1dJu1b1Lqq15mWlfpPHUYQTOIVz8OAK6nAHDWgBA4RneIU359F5cd6dj0VrwclnjuEPnM8fz0eM8A==</latexit>

<latexit sha1_base64="Avqj6DgOR2NBV6dY7Rsio1T0XiY=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOl5qhfrrhVdw6ySrycVCBHo1/+6g1ilkYoDRNU667nJsbPqDKcCZyWeqnGhLIxHWLXUkkj1H42P3RKzqwyIGGsbElD5urviYxGWk+iwHZG1Iz0sjcT//O6qQmv/YzLJDUo2WJRmApiYjL7mgy4QmbExBLKFLe3EjaiijJjsynZELzll1dJu1b1Lqq15mWlfpPHUYQTOIVz8OAK6nAHDWgBA4RneIU359F5cd6dj0VrwclnjuEPnM8fz0eM8A==</latexit>

<latexit sha1_base64="aE9AakPLbjKolPDUvrxlz8/XDwQ=">AAAB9XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lqQY9FLx4r2A9oY9lsNu3SzSbsbpQQ8j+8eFDEq//Fm//GbZuDtj4YeLw3w8w8L+ZMadv+tkpr6xubW+Xtys7u3v5B9fCoq6JEEtohEY9k38OKciZoRzPNaT+WFIcepz1vejPze49UKhaJe53G1A3xWLCAEayN9DD0A4lJ5qd55k/yUbVm1+050CpxClKDAu1R9WvoRyQJqdCEY6UGjh1rN8NSM8JpXhkmisaYTPGYDgwVOKTKzeZX5+jMKD4KImlKaDRXf09kOFQqDT3TGWI9UcveTPzPGyQ6uHIzJuJEU0EWi4KEIx2hWQTIZ5ISzVNDMJHM3IrIBJsctAmqYkJwll9eJd1G3bmoN+6atdZ1EUcZTuAUzsGBS2jBLbShAwQkPMMrvFlP1ov1bn0sWktWMXMMf2B9/gA/KpL/</latexit>

<latexit sha1_base64="gEYuA18nAAZaAcM2dzjG5H+lDks=">AAAB9XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0sWw2m3bpZhN2N0oN+R9ePCji1f/izX/jps1BWx8MPN6bYWaeF3OmtG1/W6WV1bX1jfJmZWt7Z3evun/QUVEiCW2TiEey52FFORO0rZnmtBdLikOP0643uc797gOVikXiTk9j6oZ4JFjACNZGuh/4gcQk9Z+y1B9nw2rNrtszoGXiFKQGBVrD6tfAj0gSUqEJx0r1HTvWboqlZoTTrDJIFI0xmeAR7RsqcEiVm86uztCJUXwURNKU0Gim/p5IcajUNPRMZ4j1WC16ufif1090cOmmTMSJpoLMFwUJRzpCeQTIZ5ISzaeGYCKZuRWRMTY5aBNUxYTgLL68TDqNunNWb9ye15pXRRxlOIJjOAUHLqAJN9CCNhCQ8Ayv8GY9Wi/Wu/Uxby1Zxcwh/IH1+QNAs5MA</latexit>

f1 (·)
<latexit sha1_base64="HqAMN9Yh3cGP0O7XnAQUw+EeGn8=">AAAB8XicbVBNSwMxEM3Wr1q/qh69BItQL2W3CnosevFYwX5gu5RsNtuGZpMlmRXK0n/hxYMiXv033vw3pu0etPXBwOO9GWbmBYngBlz32ymsrW9sbhW3Szu7e/sH5cOjtlGppqxFlVC6GxDDBJesBRwE6yaakTgQrBOMb2d+54lpw5V8gEnC/JgMJY84JWClx2jgVfs0VHA+KFfcmjsHXiVeTiooR3NQ/uqHiqYxk0AFMabnuQn4GdHAqWDTUj81LCF0TIasZ6kkMTN+Nr94is+sEuJIaVsS8Fz9PZGR2JhJHNjOmMDILHsz8T+vl0J07WdcJikwSReLolRgUHj2Pg65ZhTExBJCNbe3YjoimlCwIZVsCN7yy6ukXa95F7X6/WWlcZPHUUQn6BRVkYeuUAPdoSZqIYokekav6M0xzovz7nwsWgtOPnOM/sD5/AGNrJAv</latexit>
f1 (·)
<latexit sha1_base64="HqAMN9Yh3cGP0O7XnAQUw+EeGn8=">AAAB8XicbVBNSwMxEM3Wr1q/qh69BItQL2W3CnosevFYwX5gu5RsNtuGZpMlmRXK0n/hxYMiXv033vw3pu0etPXBwOO9GWbmBYngBlz32ymsrW9sbhW3Szu7e/sH5cOjtlGppqxFlVC6GxDDBJesBRwE6yaakTgQrBOMb2d+54lpw5V8gEnC/JgMJY84JWClx2jgVfs0VHA+KFfcmjsHXiVeTiooR3NQ/uqHiqYxk0AFMabnuQn4GdHAqWDTUj81LCF0TIasZ6kkMTN+Nr94is+sEuJIaVsS8Fz9PZGR2JhJHNjOmMDILHsz8T+vl0J07WdcJikwSReLolRgUHj2Pg65ZhTExBJCNbe3YjoimlCwIZVsCN7yy6ukXa95F7X6/WWlcZPHUUQn6BRVkYeuUAPdoSZqIYokekav6M0xzovz7nwsWgtOPnOM/sD5/AGNrJAv</latexit>

f10 (·) dh ⇥
<latexit sha1_base64="y3ZFcWqBEabekax7ttofWU1jMYg=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilVg9FxE2/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJJ+WeqnhCWVjOuRdSxW1S/xsfu2UnFllQMJY21JI5urviYxGxkyiwHZGFEdm2ZuJ/3ndFMNrPxMqSZErtlgUppJgTGavk4HQnKGcWEKZFvZWwkZUU4Y2oJINwVt+eZW0alXvolq7v6zUb/I4inACp3AOHlxBHe6gAU1g8AjP8ApvTuy8OO/Ox6K14OQzx/AHzucPt3mPOA==</latexit>

dy
x x
<latexit sha1_base64="luO/7VBVhcqQzXoSaHBcmDoa89o=">AAAB8nicbVBNSwMxEM3Wr1q/qh69BItYL2W3CnosevFYwX7AdinZbLYNzSZLMiuU0p/hxYMiXv013vw3pu0etPXBwOO9GWbmhangBlz32ymsrW9sbhW3Szu7e/sH5cOjtlGZpqxFlVC6GxLDBJesBRwE66aakSQUrBOO7mZ+54lpw5V8hHHKgoQMJI85JWAlP+5759UejRRc9MsVt+bOgVeJl5MKytHsl796kaJZwiRQQYzxPTeFYEI0cCrYtNTLDEsJHZEB8y2VJGEmmMxPnuIzq0Q4VtqWBDxXf09MSGLMOAltZ0JgaJa9mfif52cQ3wQTLtMMmKSLRXEmMCg8+x9HXDMKYmwJoZrbWzEdEk0o2JRKNgRv+eVV0q7XvMta/eGq0rjN4yiiE3SKqshD16iB7lETtRBFCj2jV/TmgPPivDsfi9aCk88coz9wPn8A78WQYA==</latexit>

<latexit sha1_base64="hL+FaLtOT9luwfLW3Ut08xl3Pcw=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI9ELx4hkUcCGzI79MLI7OxmZtZICF/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR7cxvPaLSPJb3ZpygH9GB5CFn1Fip/tQrltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1Gu1C9L1ZssjjycwCmcgwdXUIU7qEEDGCA8wyu8OQ/Oi/PufCxac042cwx/4Hz+AOeHjQA=</latexit>

dx <latexit sha1_base64="NwwV4GQeCcb1tE5MwUMt03+nWBw=">AAAB9XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0sWw2m3bpZhN2N2oJ+R9ePCji1f/izX/jps1BWx8MPN6bYWaeF3OmtG1/W6WV1bX1jfJmZWt7Z3evun/QUVEiCW2TiEey52FFORO0rZnmtBdLikOP0643uc797gOVikXiTk9j6oZ4JFjACNZGuh/4gcQk9cdZ6j9lw2rNrtszoGXiFKQGBVrD6tfAj0gSUqEJx0r1HTvWboqlZoTTrDJIFI0xmeAR7RsqcEiVm86uztCJUXwURNKU0Gim/p5IcajUNPRMZ4j1WC16ufif1090cOmmTMSJpoLMFwUJRzpCeQTIZ5ISzaeGYCKZuRWRMTY5aBNUxYTgLL68TDqNunNWb9ye15pXRRxlOIJjOAUHLqAJN9CCNhCQ8Ayv8GY9Wi/Wu/Uxby1Zxcwh/IH1+QM9YZL+</latexit> <latexit sha1_base64="v5CIvR6E0k4YihjLeKUp8tusksc=">AAAB9XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0sWw2m3bpZhN2N2oI+R9ePCji1f/izX/jts1BWx8MPN6bYWaeF3OmtG1/W6WV1bX1jfJmZWt7Z3evun/QUVEiCW2TiEey52FFORO0rZnmtBdLikOP0643uZ763QcqFYvEnU5j6oZ4JFjACNZGuh/4gcQk89M885/yYbVm1+0Z0DJxClKDAq1h9WvgRyQJqdCEY6X6jh1rN8NSM8JpXhkkisaYTPCI9g0VOKTKzWZX5+jEKD4KImlKaDRTf09kOFQqDT3TGWI9VoveVPzP6yc6uHQzJuJEU0Hmi4KEIx2haQTIZ5ISzVNDMJHM3IrIGJsctAmqYkJwFl9eJp1G3TmrN27Pa82rIo4yHMExnIIDF9CEG2hBGwhIeIZXeLMerRfr3fqYt5asYuYQ/sD6/AFXepMP</latexit>

dx
Figure 2.9 An example of backpropagation on a simple computational graph.

In many modern neural network toolkits and libraries, e.g. PyTorch2 and TensorFlow3 , an
approach called symbol-to-symbol (Goodfellow et al., 2016) is commonly used for computing
derivatives. During the forward pass, it adds additional nodes to the computational graph that
contains a symbolic description of the derivative functions. Any subset of the graph may be
evaluated later using specific numerical values. One advantage of this approach is that obtaining
2
https://www.pytorch.org
3
https://www.tensorflow.org
2.3 Optimisation 27

higher-order derivatives is straightforward by running the backpropagation algorithm again on

the new computational graph (Goodfellow et al., 2016).

2.3.2 Stochastic Gradient Descent

Gradient descent, or steepest descent, is a simple iterative approach to perform approximate
optimisation for continuous and differentiable variables (Gill et al., 1981). The error is a
function of all the parameters for a given input. Gradient descent methods update each
parameter in the opposite direction to the gradient of the loss w.r.t. the parameter by a small
step

N
1
L(f (x(i) ; θ), y (i) ),
X
∇θ J (θ) = ∇θ (2.32)
N i=1

θ ← θ − ϵ∇θ J (θ), (2.33)

where the step size is controlled by ϵ which is known as the learning rate.
In practice, the number of data samples is usually very large and the computational cost of
Equation (2.32) increases linearly with N . Therefore, performing an update on all parameters
with the full training set can be very computationally expensive considering that gradient
descent is an iterative process. As Equation (2.28) shows that the loss is an expected value,
the gradient for one update can be approximated by sampling a small portion of the data
M ≪ N , i.e. a minibatch of samples randomly drawn from the training set. This is the idea of
SGD (Bottou, 2010).

M
1
L(f (x(i) ; θ), y (i) ),
X
g= ∇θ (2.34)
M i=1

θ ← θ − ϵg. (2.35)

During training, the minibatch size is usually constant and is in the order of a few hundred.
Therefore, each update can be performed at a constant cost regardless of the size of the dataset.
However, by using a small batch size, the gradient calculation can be biased towards sampled
data which yields noisy gradients and is prone to local optima.
Since SGD may be very noisy and slow, the choice of the learning rate ϵ becomes critically
important. If the learning rate is too large, the update step may overshoot and sometimes may
even diverge. On the other hand, a small learning rate will cause the convergence to be slow and
stuck at poor local minima. To address this problem, various learning rate schedulers have been
proposed to reach a good compromise between convergence and speed. The learning rate can be
28 Attention-Based Encoder-Decoder Models

set to gradually ramp up at the beginning and/or decay linearly or quadratically as the training
proceeds. Some schedulers decrease the learning rate in a more discrete fashion based on the
performance of the current model during training. The momentum term, i.e. the accumulated
update history direction, can be interpolated with the gradient of the current minibatch to carry
the inertia of past updates and helps SGD gain faster convergence with reduced oscillation
across the error surface (Bishop, 1995; Polyak, 1964). SGD with momentum can be written as
follows.

m ← βm + ϵg, (2.36)
θ ← θ − m, (2.37)

where β is a hyperparameter between 0 and 1 indicating the contribution of the momentum

term m.
Since learning rates can influence the model performance significantly, tuning it using
the aforementioned methods is a laborious exercise. As the loss is more sensitive to some
directions in the parameter space than others, it is desirable to have automatically adaptive
parameter-specific learning rates. AdaGrad (Duchi et al., 2011), RMSProp (Graves, 2013),
AdaDelta (Zeiler, 2012), and Adam (Kingma and Ba, 2015) etc. are the class of schedulers that
scale the learning rate for each parameter using accumulated squared gradients. Specifically
for Adam – “adaptive momentum”,
s
s ← η1 s + (1 − η1 )g, ŝ = , (2.38)
1 − η1t
r
r ← η2 r + (1 − η2 )g ◦ g, r̂ = , (2.39)
1 − η2t
ŝ
θ ← θ − ϵ√ , (2.40)
r̂ + δ

where s and r are the estimated first and second moments of the gradient g, and η1 = 0.9
and η2 = 0.999 are the suggested settings. δ is a small number, e.g. 10−8 , for computational
stability. However, in general, there is no conclusive advantage of using one particular learning
rate scheduling algorithm over others in practical neural network training. Techniques such
as gradient clipping (Mikolov, 2012) and gradient scaling (Pascanu et al., 2013) restrict the
magnitudes of update value and gradient norm to minimise the effect caused by abnormal or
noisy data within a minibatch.
2.3 Optimisation 29

2.3.3 Initialisation and Normalisation

For deep learning, neural network optimisation is an iterative process and requires a reasonable
starting point. The model performance can sometimes depend strongly on the initial weight
values, which determines how fast training will converge, to which point the training will
converge and how well the model will generalise. Most initialisation methods are random
and heuristic. Weights are usually sampled from a Gaussian or a uniform distribution and
biases are initialised to a constant value. However, the range of the initialised weights matters.
Larger weights are helpful to allow each unit to be more distinct, which reduces the number
of redundant units. However, large weights may cause exploding or vanishing gradients
depending on the activation functions. One of the commonly used strategies is called the Glorot
initialisation (Glorot and Bengio, 2010)
s
6
w ∼ U(−a, a) where a = g , or (2.41)
m+n
s
2
w ∼ N (0, σ 2 ) where σ = g , (2.42)
m+n

where U and N are uniform and normal distributions, and g is the gain factor associated with
√ q
activation functions where g = 1 for sigmoid, g = 2 for ReLU, g = 2/(1 + ρ2 ) for PReLU,
and g = 5/3 for tanh.
Normalisation accounts for two aspects. The first one is feature normalisation, where the
first and second-order statistics for each dimension of the input feature at the global level or
batch level or some data-specific subset level can be performed. This aligns well with the
assumptions made when initialising weights randomly.
The other aspect is model reparameterisation. While training DNNs with SGD, only
first-order interactions between parameters are considered since all parameters are updated
simultaneously. Batch normalisation (Ioffe and Szegedy, 2015) is a technique that reparame-
terises the activations of each layer as each minibatch is processed. Specifically, for a batch of
input X = [x1 , . . . , xM ] to a layer, the features are first normalised

xi − µ
x̂i = √ 2 , (2.43)
σ +δ

where µ and σ are the mean and standard deviation vector computed across the batch of
features. To restore the representation power, the output of the layer Y = [y1 , . . . , yM ] is
transformed again by a pair of layer-specific mean vector µ̃ and standard deviation vector σ̃

ŷi = σ̃ ◦ yi + µ̃, (2.44)

30 Attention-Based Encoder-Decoder Models

where µ̃ and σ̃ are learnt parameters as part of the network. The output of the batch normalisa-
tion Y is then treated as input features to the next layer in the DNN. During inference time, the
population statistics are used for the input to each layer. The batch normalisation operation
introduces greater Lipschitzness (the change of output values due to the change of input values
is more limited) into the loss and the gradient during training, thus generating a smoother loss
landscape (Santurkar et al., 2018).
Despite the success of batch normalisation, it is challenging to apply this technique to
RNNs where having a batch normalisation procedure for each time step is impractical. Another
alternative is called layer normalisation (Ba et al., 2016). Instead of normalising across each
feature dimension in the batch of input features, layer normalisation normalises across all
feature dimensions for each feature at each layer. Therefore, layer normalisation does not
depend on other samples and can be used to train RNNs with long sequences and small mini-
batches. Layer normalisation has been shown to speed up the training of RNNs (Ba et al.,
2016). Transformer models also use layer normalisation (Vaswani et al., 2017).

2.3.4 Regularisation and Generalisation

In deep learning, overfitting is a common phenomenon where the loss on the training data
keeps decreasing but the validation or the test loss on unseen data increases (Goodfellow et al.,
2016). DNN-based models are expected to not only memorise the patterns in the training data
well but also be able to interpolate to new data samples within the data distribution. Thanks
to the increasing amount of available computing resources, it is possible to build models
with hundreds of millions of parameters. For the given amount of data, many regularisation
techniques have been studied to prevent these powerful models from overfitting. There are
also scenarios where the training objective and the evaluation objective are mismatched, e.g.
log-likelihood for training and Word Error Rate (WER) for evaluation of Automatic Speech
Recognition (ASR). Regularisation can help prevent the model from overfitting to the training
objective. Broadly, regularisation is any modification to a learning algorithm that is intended to
improve the evaluation metric during inference but not necessarily its training criterion. The
methods described in this chapter may have not been directly developed for regularisation, but
all of them can achieve a certain degree of regularisation from different perspectives.

2.3.4.1 Parameter Norm Penalty and Early Stopping

Parameter norm penalty (or weight decay) (Krogh and Hertz, 1992), aims to constrain the
capability of the model by penalising large weight values. It adds an extra term to the standard
2.3 Optimisation 31

optimisation objective in Equation (2.28)

J˜(θ) = J (θ) + ν ∥ θ ∥ p , (2.45)

where ν is a non-negative hyperparameter for the weight of the norm regularisation and ∥ θ ∥ p
is the p-norm of all parameters. For DNNs, the L2 norm on all weights is commonly used.
Note that bias terms are normally unregularised (Goodfellow et al., 2016). For simplicity, the
objective function and its derivatives w.r.t. weights can be written for the L2 norm as
ν
J˜(θ) = J (θ) + θ T θ, (2.46)
2
∇θ J˜(θ) = ∇θ J (θ) + νθ. (2.47)

Therefore, the weights updated according to Equation (2.35) are

θ ← (1 − ϵν)θ − ϵ∇θ J (θ). (2.48)

The addition of L2 norm regularisation is equivalent to shrinking the weights by a constant

amount per update. After training for a large number of epochs, the effect of this regulariser is
to preserve the weights that align with the direction of the eigenvector of the Hessian which has
the largest eigenvalue while the weights in other less important directions diminish. Another
form of norm regularisation is L1 , which encourages the weights to be sparse (Goodfellow
et al., 2016).
Early stopping is another simple training technique that does not even require modification
of the training objective. When the model is overfitting the training data, a common indication
is that the training loss keeps decreasing while the validation loss plateaus before rising. Early
stopping requires the model to be saved at fixed intervals and the validation loss is evaluated
at each checkpoint. Whenever the validation loss stops decreasing, the training stops and the
model with the lowest validation loss is kept. It has been shown that for a linear model with a
quadratic loss function optimised with a simple gradient descent algorithm, early stopping is
equivalent to L2 norm regularisation (Bishop, 1995). The advantage of early stopping is that it
automatically determines the stopping point while weight decay requires the hyperparameter ν
to be tuned.

2.3.4.2 Ensemble Methods and Dropout

Ensemble methods usually combine multiple models at different levels, e.g. voting at the
output level or averaging at the posterior level (Breiman, 1996). Since different models tend
32 Attention-Based Encoder-Decoder Models

to make different errors, ensemble methods can normally outperform individual models. To
demonstrate, if there are n regression models and each one make error εi for each sample where
E[εi ] = 0, the expected squared error of the ensemble model (averaged model outputs) is
" !2 # " !#
1X 1 1 n−1
ε2i + E[ε2i ] +
X X
E εi = 2E εi ε j = E[εi εj ]. (2.49)
n i n i j̸=i n n

If all errors are perfectly correlated, i.e. E[ε2i ] = E[εi εj ] meaning n models are identical, the
ensemble does not help. If all errors made by different models are independent, i.e. E[εi εj ] = 0,
the expected squared error decreases linearly w.r.t. the size of the ensemble. If n models are
constructed differently, e.g. different subsets of training data, different initialisation, different
data shuffling, different architectures, and different hyperparameters, the ensemble model
is able to reduce the generalisation error, especially when individual ensemble components
become more complementary to each other.
Similar to ensembling, multiple models with different architectures trained on the same
set of data, dropout (Srivastava et al., 2014) is an alternative method that provides strong
regularisation by randomly disabling parts of the neurons in a DNN during training. Dropout is
an approximation of training an exponentially large number of neural networks with partially
shared parameters simultaneously as different dropout masks are applied to different layers of
the network for each minibatch. The proportion of neurons to be dropped for an iteration is
a hyperparameter Pdropout to be tuned. Intuitively, dropout prevents some neurons from being
over-specialised on certain data. During test time, all the weights going out of that neuron
is multiplied by the probability of including the neuron (1 − Pdropout ). This empirical rule
performs well in practice.
Exponential Moving Average (EMA), a temporal ensemble of model parameters, is another
commonly used approach to boost the final performance (Polyak and Juditsky, 1992). A set of
“shadow” parameters θ̃ are kept to maintain a moving average of the trained parameters,

θ̃ ← κθ̃ + (1 − κ) ∗ θ, (2.50)

where κ is the decay parameter for EMA. κ is normally set to be close to 1.0, e.g. 0.9999. The
shadow parameters are updated after each model update or a fixed number of model updates,
which does not influence the training process at all. Maintaining an EMA of model weights
during training could improve the performance significantly more than just using the final
weights (Tarvainen and Valpola, 2017).
2.3 Optimisation 33

2.3.4.3 Parameter Tying and Multi-Task Training

Parameter tying or parameter sharing is a technique that forces parts of the parameter vector
within a model to be identical. In CNNs described in Section 2.1.2, parameters are shared
for each kernel that convolves across the entire input. In RNNs introduced in Section 2.1.3,
parameters are tied across each time step if viewed from the unfolding perspective. Parameter
sharing can reduce the memory footprint of a model and provide a regularisation effect by
restricting the number of parameters.
Multi-task training (Caruana, 1997) makes the model jointly learn from multiple objective
functions or perform multiple tasks at the same time. Usually, a significant part of the network
is shared and only the output layers differ. For multiple output branches, they can perform
different but similar tasks, e.g. grapheme and phone recognition, or the same task using
different objective functions, e.g. Cross Entropy (CE) and squared error. In the multi-task
training framework, the lower parts of the network are normally shared to extract useful features
from the input while the upper parts are split into multiple branches to transform the shared
features to perform the desired task. The improved generalisation performance is mainly
because of the shared parameters and the limited difference between different tasks that prevent
the network from specialising to one task. This approach is not always available as two or more
tasks with shared statistical factors are required. The overall loss function, an interpolation
of losses from all branches, needs to be carefully balanced in order to achieve the desired
performance for all tasks.

2.3.4.4 Data Augmentation

For a model that can generalise well, it should be able to cope with data uncertainties, i.e.
noise (Sietsma and Dow, 1991). A model’s noise robustness can always be improved by
training with more data sampled from the real data distribution. However, this may not
be practical due to limited available data resources. Data augmentation is one effective
approach to increasing the amount of training data based on the currently available data. For
example, images can be transformed differently to generate new images, e.g. cropping, scaling,
translating, rotating (Krizhevsky et al., 2012); speech can be augmented by vocal tract length
perturbation or speed perturbation (Jaitly and Hinton, 2013; Ko et al., 2015). Noise can
always be injected into the data by adding or subtracting some small random values. Data
augmentation allows the space of training data to be enriched and forces the model to be
invariant to the applied transformations and more robust to noisy data. Specific to speech
recognition, SpecAugment (Park et al., 2019) has been widely used as an effective approach
to augmenting the input log mel spectrogram by applying multiple instances of time warping,
34 Attention-Based Encoder-Decoder Models

frequency masking and time masking. The randomly corrupted input prevents the network
from overfitting specific features and improves the generalisation of the model for mismatched
acoustic conditions.

2.3.4.5 Label Smoothing

Similar to augmenting the input, the output labels can also be corrupted by some degree of
noise, especially for recognition tasks. Label smoothing (Szegedy et al., 2016) replaces the
ξ
one-hot training targets by 1 − ξ for the 1s and |y|−1 for 0s. By smoothing the hard targets
to soft ones, this technique injects noise into the output labels and acts as a regulariser that
prevents the model from pursuing hard output distributions with Softmax by having larger
weights. From another perspective, label smoothing simulates the scenario where the training
labels are not perfect.

2.3.4.6 Weight Noise

Apart from adding noise to input features and output labels, small and random perturbations to
model parameters should ideally result in a minimal change at the output. One strategy is to
add zero-mean, fixed variance Gaussian noise to the network weights (Jim et al., 1996) during
training, which is shown to improve the performance of RNNs (Graves, 2013). Note that,
for each minibatch, the gradient is computed based on the model parameters with Gaussian
noise added, but the update to the parameters using gradient descent is applied to the original
parameters without Gaussian noise. Weight noise tends to “simplify” neural networks as
noise reduces the precision with which the weights must be described. Simpler networks
are preferred because they normally generalise better (Graves, 2011). Other more complex
strategies such as adaptive weight noise (Graves, 2011) have also been proposed. Furthermore,
weight quantisation (Hubara et al., 2018; Woodland, 1989) can also be regarded as another
form of weight noise that improves the generalisation of neural networks.

2.4 Summary
This chapter describes the basic building blocks used for the construction of RNN-based
and Transformer-based AED models, including MLP, CNN, RNN and attention mechanism.
Various optimisation procedures and techniques are then introduced for training AED models
and other deep neural networks. Many terms will be often referred to throughout the thesis.
Chapter 3

Automatic Speech Recognition

Automatic Speech Recognition (ASR) is the process of converting the acoustic speech signal
into its corresponding textual representation. It is considered to be a challenging task for
real-life applications due to both inter-speaker variability (including physiological differences
and pronunciation differences of accents or dialects) and intra-speaker variability (including
different styles of speech). Other factors such as channel distortion, reverberation and back-
ground noise pose more difficulties for achieving effective speech to text conversion (Jurafsky
and Martin, 2008).
There are two major modern frameworks for ASR. The first one is based on the noisy
Source-Channel Model (SCM) (Jelinek, 1997; MacKay, 2003; Shannon, 1948) where the
acoustic sequence is modelled using Hidden Markov Models (HMMs) from a generative
perspective. This framework has multiple modules such as an acoustic model, a language
model, a pronunciation model and a decoder. The advantage of the framework is that multiple
sources of structured knowledge such as phonetic and linguistic information can be easily
integrated. The modular design allows each component to be optimised separately. Because the
decoding procedure is generally frame-synchronous, the framework is able to process speech
data in a streaming fashion, which is desirable in various applications. In this chapter, the
SCM-based ASR framework is first described. Some similar frame-synchronous models, such
as Connectionist Temporal Classification (CTC) and neural transducers, are also introduced.
Another type of framework is based on Attention-Based Encoder-Decoder (AED) models.
Unlike the generative modelling approach with HMMs, the speech recognition task is formu-
lated from a discriminative perspective. An AED model directly learns the probability of the
transcription sequence given the input acoustic sequence (Chorowski et al., 2015). As only
a single model needs to be optimised, constructing an AED-based ASR system is, in theory,
much simpler than an SCM-based system. In this chapter, details about the model architecture,
training and decoding procedures of AED models will be given.
36 Automatic Speech Recognition

Both SCM-based and AED-based frameworks rely on fully transcribed speech data. How-
ever, labelled training data is typically limited and expensive to acquire. In order to leverage
a large amount of unlabelled data and the modelling power of large Deep Neural Networks
(DNNs), self-supervised pre-training enables the model to learn meaningful feature represen-
tations without supervision (van den Oord et al., 2018). Pre-trained models can be used to
initialise the acoustic model in an SCM-based system or the encoder in an AED-based system
for subsequent fine-tuning to improve recognition performance (Baevski et al., 2020b). In the
last part of the chapter, some self-supervised pre-training approaches will be described.

3.1 Source-Channel Models

The statistical framework of ASR has been viewed as a noisy SCM (Jelinek, 1997; MacKay,
2003; Shannon, 1948) for many decades. The source refers to the speaker’s intended content,
and the noisy channel includes speech generation, propagation, reception, quantification and
waveform conversion. An SCM normally consists of many modules, including an Acoustic
Model (AM), a Language Model (LM), a pronunciation model and a decoder, where a multi-
stage pipeline is required to integrate information composed within each element to produce
the final recognition results. The acoustic model is usually based on HMMs, where each
subword unit is modelled by an HMM. Since HMMs use many conditional independence
assumptions, efficient dynamic programming algorithms are available to perform optimisation
and search procedures. Many of the building blocks under this framework have been greatly
improved to set the state-of-the-art records on various ASR tasks. Most notably, DNNs for
acoustic and language modelling (Hinton et al., 2012; Mikolov et al., 2010), discriminative
sequence-level training (Bahl et al., 1986; Povey, 2003; Schlüter, 2000; Valtchev et al., 1997),
and adaptation (Woodland, 2001) and adaptive training (Anastasakos et al., 1996). As the
speech recognition system is disentangled into multiple modules, each one can be optimised
and controlled individually. This chapter covers the major modules and steps for SCM-based
ASR. From the perspective of a noisy SCM, speech is produced and encoded via a noisy
channel and the decoder’s task is to find the most probable source text ŵ given the output of the
noisy channel O (Jelinek, 1997). According to Bayes’ rule, decoding follows the Maximum a
Posteriori (MAP) rule to search over each possible word sequence hypothesis w to find the
3.1 Source-Channel Models 37

MAP word sequence ŵ as

ŵ = arg max P (w|O) (3.1)

w
p(O|w)P (w)
= arg max (3.2)
w p(O)
= arg max p(O|w)P (w), (3.3)
w

where p(O|w), estimated by an AM, is the likelihood of generating the observation sequence
through the channel; P (w), approximated by an LM, describes the underlying probabilistic dis-
tribution of the source. In this way, an SCM-based ASR system consists of several independent
modules shown in Figure 3.1.

acoustic model
speaker’s speech acoustic pre-
language model
mind w production speech processing O ŵ
<latexit sha1_base64="h6iOCDQDqbIH9RpjOgumkUD0uow=">AAAB/XicbVDNS8MwHE3n15xf9ePmJTgET6OVoTsOvHic4DZhLSNN0y0sTUqSKrMU/xUvHhTx6v/hzf/GdOtBNx+EPN77/cjLCxJGlXacb6uysrq2vlHdrG1t7+zu2fsHPSVSiUkXCybkXYAUYZSTrqaakbtEEhQHjPSDyVXh9++JVFTwWz1NiB+jEacRxUgbaWgfeWOkMy8QLFTT2FzZQ54P7brTcGaAy8QtSR2U6AztLy8UOI0J15ghpQauk2g/Q1JTzEhe81JFEoQnaEQGhnIUE+Vns/Q5PDVKCCMhzeEaztTfGxmKVZHNTMZIj9WiV4j/eYNURy0/ozxJNeF4/lCUMqgFLKqAIZUEazY1BGFJTVaIx0girE1hNVOCu/jlZdI7b7gXjeZNs95ulXVUwTE4AWfABZegDa5BB3QBBo/gGbyCN+vJerHerY/5aMUqdw7BH1ifP7UwlgU=</latexit>

<latexit sha1_base64="Jwp9RsI2AwyXhBufQcOdOJNVjtU=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsxI0S4LbtxZwT6gHUsmk7ahmWRIMkoZ+h9uXCji1n9x59+YaWehrQdCDufcS05OEHOmjet+O4W19Y3NreJ2aWd3b/+gfHjU1jJRhLaI5FJ1A6wpZ4K2DDOcdmNFcRRw2gkm15nfeaRKMynuzTSmfoRHgg0ZwcZKD/1A8lBPI3ult7NBueJW3TnQKvFyUoEczUH5qx9KkkRUGMKx1j3PjY2fYmUY4XRW6ieaxphM8Ij2LBU4otpP56ln6MwqIRpKZY8waK7+3khxpLNodjLCZqyXvUz8z+slZlj3UybixFBBFg8NE46MRFkFKGSKEsOnlmCimM2KyBgrTIwtqmRL8Ja/vEraF1Xvslq7q1Ua9byOIpzAKZyDB1fQgBtoQgsIKHiGV3hznpwX5935WIwWnHznGP7A+fwBFOuS3w==</latexit>

<latexit sha1_base64="Q62ufHVmMXqNvBTNL3V0WPdgFTc=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsxI0S4LblxWsA9ox5LJpG1oJhmSjKUM/Q83LhRx67+482/MtLPQ1gMhh3PuJScniDnTxnW/ncLG5tb2TnG3tLd/cHhUPj5pa5koQltEcqm6AdaUM0FbhhlOu7GiOAo47QST28zvPFGlmRQPZhZTP8IjwYaMYGOlx34geahnkb3S6XxQrrhVdwG0TrycVCBHc1D+6oeSJBEVhnCsdc9zY+OnWBlGOJ2X+ommMSYTPKI9SwWOqPbTReo5urBKiIZS2SMMWqi/N1Ic6SyanYywGetVLxP/83qJGdb9lIk4MVSQ5UPDhCMjUVYBCpmixPCZJZgoZrMiMsYKE2OLKtkSvNUvr5P2VdW7rtbua5VGPa+jCGdwDpfgwQ004A6a0AICCp7hFd6cqfPivDsfy9GCk++cwh84nz9Rs5MH</latexit>

lexicon

speaker acoustic channel speech recogniser

Figure 3.1 Noisy source-channel model for speech recognition. w is the intended word sequence
from the speaker. O is the acoustic features representing the speech signal. ŵ is the word
sequence hypothesis generated by the speech recogniser.

3.1.1 Feature Extraction and Normalisation

In order to deal with spoken input, the signal is first segmented to remove irrelevant parts of
the recording, e.g. long silences in telephony speech and music in broadcast news. Then, a
feature extractor transforms the raw waveform of speech signals into a sequence of vectors, i.e.
O = [o1 , . . . , oT ]. The most commonly used speech features include log-mel spectral Filter
Bank (FBANK) features, Mel-scale Frequency Cepstral Coefficients (MFCCs) (Davis and
Mermelstein, 1980) and the Perceptual Linear Prediction (PLP) representations (Hermansky,
1990). Frequency information is critical to hearing perception and it is resolved in a non-linear
fashion by human ears (Zwicker, 1961). For ASR, features are engineered to roughly follow
the non-linearity of the human hearing system. The speech waveform is usually first split into a
sequence of frames, where each frame normally covers about 25 ms of the speech signal and
there is an overlap of about 15 ms in between two adjacent frames. Consequently, there is a
new frame every 10 ms. Each frame is sufficiently short to contain not more than one acoustic
modelling unit (see Section 3.1.2) and is assumed to be statistically stationary.
38 Automatic Speech Recognition

To extract FBANK features, the windowed signal is processed by the short-time Fourier
transform to obtain the spectrum. Based on the human perception that larger frequency intervals
are required to produce equal pitch increments at higher frequencies, the Mel-scale is used to
adjust to this phenomenon by warping the normal frequency f to fmel
!
f
fmel = 1127 ln 1 + . (3.4)
700

A series of overlapping triangular band-pass filters that are linearly spaced across the Mel-scale
are applied to the linear power spectrum. The log of the value produced by each filter is
concatenated into a vector, which is referred to as FBANK features.
FBANK features are found to be highly correlated between dimensions, which is not ideal
for Gaussian Mixture Models (GMMs) with diagonal covariance matrices used in GMM-HMM
based models for ASR (see Section 3.2). To decorrelate the coefficients in FBANK features,
the Discrete Cosine Transform (DCT) is applied on these features to obtain MFCCs. The i-th
dimension of a Dc -dimensional MFCC feature oMFCC i is transformed from Df -dimensional
FBANK
FBANK o
s D f !
2 πi 1
X
oMFCC
i = oFBANK
j cos j+ for i = 0, . . . , Dc − 1. (3.5)
Df − 1 j=0 Df 2

Normally, Dc is smaller than Df as the higher-order cepstral coefficients are discarded.

As an alternative to MFCCs, PLP is another well-engineered feature based on short-
time spectrum analysis. The linear frequency of the spectrum is warped according to the
Bark frequency scale. The output of band-pass filters arranged along the Bark scale is then
transformed in a non-linear fashion, which is based on an equal-loudness and intensity-loudness
power law. Then linear prediction analysis (Atal and Hanauer, 1971) is applied to yield cepstral
coefficients, i.e. PLP features. Often, a variant of PLP named MF-PLP is used where the
Mel-scale features used in FBANK are used (Woodland et al., 1997).
To incorporate dynamic information into speech features, all the aforementioned features
can be augmented by delta features (Furui, 1986). The first order delta feature is calculated by
Equation (3.6), Pτ
i(ot+i − ot−i )
∆ot = i=1 Pτ 2 . (3.6)
2 i=1 i
The delta feature is a linear combination of the static features of 2τ + 1 frames. The second
order delta features can be computed analogously. The augmented feature with first and second
3.1 Source-Channel Models 39

order deltas is  
ot
′
 
ot =  ∆ot 

. (3.7)
2
∆ ot
However, the addition of deltas introduces correlation across dimensions which is inconsistent
with GMMs that assumes that features are element-wise independent. Linear discriminant
analysis (Campbell, 1984; Kumar, 1998) can be used to project into a new space where feature
dimensions are uncorrelated. Before undergoing further processing, the cepstrum is normalised
by subtracting the mean and dividing the standard deviation. The normalisation procedure,
Cepstral Mean and Variance Normalisation (CMVN) (Viikki and Laurila, 1998), can effectively
minimise the channel and noise distortions. Vocal Tract Length Normalisation (VTLN) (Lee
and Rose, 1996) can be further applied to compensate for the differences in vocal tract length
and shape between speakers. Gaussianisation (Liu et al., 2005; Saon et al., 2004) decorrelates
each dimension of the feature vector to reduce the impact on models that have independence
assumptions. Gaussianisation can also be related to CMVN but also normalises higher-order
moments across dimensions.
Unlike GMM-HMM systems where the covariance matrix is assumed to be diagonal, neural
networks do not make assumptions about the correlation between input feature dimensions.
Therefore, FBANK can be directly used as input to neural network models. Meanwhile, methods
have been explored to remove the front-end processing stage and let neural networks learn
implicit filters from raw waveforms and perform recognition within a single model (Sainath
et al., 2015; Tüske et al., 2014; von Platen et al., 2019).

3.1.2 Modelling Units

Directly modelling from acoustics to words is very challenging because of the data sparsity
problem (Soltau et al., 2017). For large vocabulary ASR tasks, the majority of words may
only appear a limited number of times in the entire dataset or not at all. And expanding the
vocabulary is not straightforward if using words directly as the modelling units. Therefore,
some subword units are commonly used, either phone-based units or character-based units.
Since phones often serve as natural units to define word pronunciations, phonetic-based
units are commonly used in conjunction with a lexicon or pronunciation dictionary. While the
simplest case is to use context-independent phones, context-dependent phones such as biphones
or triphones (Lee, 1988) are more widely used as they provide finer-grained subword units,
which is useful in speech recognition where co-articulation effects are significant.
Grapheme-based units do not require linguistic knowledge to construct lexicons. The set of
graphemes can also be expanded based on either context or positions in words (Gales et al., 2015;
40 Automatic Speech Recognition

Kanthak and Ney, 2002). However, using graphemic units for acoustic modelling poses a greater
challenge for acoustic models, especially for languages such as English with orthographic
irregularities where many different pronunciations correspond to the same grapheme or a single
pronunciation corresponds to multiple different graphemes. For languages that are not based
on the Latin script, the graphemic units can be constructed using unicode (Gales et al., 2015; Li
et al., 2019a).
Table 3.1 shows several different modelling units for the same English word “hello”. For
phonetic units, two pronunciations are available. Note that for subword units using both past
and future contexts, the number of possible combinations is |V|3 where |V| is the number of
context-independent units. To avoid an explosive increase in the number of classes, decision tree
clustering (Hwang and Huang, 1993; Young et al., 1994) is used to cluster the context-dependent
units into a manageable number of classes.

mono-grapheme h e l l o
tri-grapheme sil-h+e h-e+l e-l+l l-l+o l-o+sil
mono-grapheme with position hI eM lM lM oF
mono-phone hh ax l ow
hh eh l ow
triphone sil-hh+ax hh-ax+l ax-l+ow l-ow+sil
sil-hh+eh hh-eh+l eh-l+ow l-ow+sil

Table 3.1 Different modelling units for the word “hello”. The symbol before ‘-’ is the previous
context and the symbol after ‘+’ is the future context. Superscripts indicate the location of the
modelling unit, where ‘I’, ‘M’ and ‘F’ correspond to the initial, middle and final positions in a
word.

3.2 Hidden Markov Models for Acoustic Modelling

Under the SCM framework, the backbone of the AM uses HMMs as a Markovian generative
process (Huang et al., 2001; Jelinek, 1997). Each HMM is a finite state machine that models
a subword unit. At each time step, the finite state machine is at a particular hidden state.
There are two types of hidden states, where one emits an observation vector and the other does
not. The non-emitting states are useful for concatenating multiple HMMs together to form
larger units or sequences of units, i.e. composite HMMs. At each time step, the finite state
machine moves to another hidden state according to the transition probability for the HMM.
The transition probability matrix often ensures that the HMM hidden states can only either
stay at the current state or move to the next state. HMMs use three assumptions about the
3.2 Hidden Markov Models for Acoustic Modelling 41

emitted observation vectors. First, it assumes the observation within certain acoustic units
can be segmented into multiple phases and stationary. Second, the current hidden state only
depends on the previous state. Third, the observation vector only depends on the current hidden
state. In other words, the current feature vector is conditionally independent of all the other
surrounding frames given the current hidden state. However, none of the above assumptions is
true for speech signals, but HMMs offer many computationally efficient algorithms for search
because of these assumptions. Many other aspects of acoustic modelling try to overcome these
strict assumptions by including more context information.
To formally introduce HMMs, let O = [o1 , . . . , oT ] be an observation sequence. For the
five-state left-to-right HMM shown in Figure 3.2, states 1 and 5 are non-emitting state whereas
states 2, 3 and 4 are emitting states.
a22
<latexit sha1_base64="aKuYNMKhCBgyNEVi/hSa7gBPVGs=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mioMeiF48V7Ae0oWy2m3btZjfsboQS+h+8eFDEq//Hm//GbZqDtj4YeLw3w8y8MOFMG9f9dkpr6xubW+Xtys7u3v5B9fCorWWqCG0RyaXqhlhTzgRtGWY47SaK4jjktBNObud+54kqzaR4MNOEBjEeCRYxgo2V2niQ+f5sUK25dTcHWiVeQWpQoDmofvWHkqQxFYZwrHXPcxMTZFgZRjidVfqppgkmEzyiPUsFjqkOsvzaGTqzyhBFUtkSBuXq74kMx1pP49B2xtiM9bI3F//zeqmJroOMiSQ1VJDFoijlyEg0fx0NmaLE8KklmChmb0VkjBUmxgZUsSF4yy+vkrZf9y7q/v1lrXFTxFGGEziFc/DgChpwB01oAYFHeIZXeHOk8+K8Ox+L1pJTzBzDHzifPyJDjtY=</latexit>
a33
<latexit sha1_base64="XHlN8dusHIsXlWanYkf4vCF0ZA0=">AAAB7XicbVDLSgNBEOz1GeMr6tHLYBA8hd1E0GPQi8cI5gHJEmYns8mYeSwzs0JY8g9ePCji1f/x5t84SfagiQUNRVU33V1Rwpmxvv/tra1vbG5tF3aKu3v7B4elo+OWUakmtEkUV7oTYUM5k7RpmeW0k2iKRcRpOxrfzvz2E9WGKflgJwkNBR5KFjOCrZNauJ/VatN+qexX/DnQKglyUoYcjX7pqzdQJBVUWsKxMd3AT2yYYW0Z4XRa7KWGJpiM8ZB2HZVYUBNm82un6NwpAxQr7UpaNFd/T2RYGDMRkesU2I7MsjcT//O6qY2vw4zJJLVUksWiOOXIKjR7HQ2YpsTyiSOYaOZuRWSENSbWBVR0IQTLL6+SVrUS1CrV+8ty/SaPowCncAYXEMAV1OEOGtAEAo/wDK/w5invxXv3Phata14+cwJ/4H3+ACVOjtg=</latexit>
a44
<latexit sha1_base64="AnWFXd3IMYLXovBfDyAXHj3gqw8=">AAAB7XicbVBNSwMxEJ3Ur1q/qh69BIvgqezWgh6LXjxWsLXQLiWbZtvYbLIkWaEs/Q9ePCji1f/jzX9j2u5BWx8MPN6bYWZemAhurOd9o8La+sbmVnG7tLO7t39QPjxqG5VqylpUCaU7ITFMcMlallvBOolmJA4FewjHNzP/4Ylpw5W8t5OEBTEZSh5xSqyT2qSf1evTfrniVb058Crxc1KBHM1++as3UDSNmbRUEGO6vpfYICPacirYtNRLDUsIHZMh6zoqScxMkM2vneIzpwxwpLQrafFc/T2RkdiYSRy6zpjYkVn2ZuJ/Xje10VWQcZmklkm6WBSlAluFZ6/jAdeMWjFxhFDN3a2Yjogm1LqASi4Ef/nlVdKuVf2Lau2uXmlc53EU4QRO4Rx8uIQG3EITWkDhEZ7hFd6QQi/oHX0sWgsonzmGP0CfPyhZjto=</latexit>

a12 a23 a34 a45

1 <latexit sha1_base64="OLU6LEcMFNiGDLU/HKiOI8bEv9U=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKexGQY9BLx4jmAckS5idzCZj5rHMzAphyT948aCIV//Hm3/jJNmDJhY0FFXddHdFCWfG+v63V1hb39jcKm6Xdnb39g/Kh0cto1JNaJMornQnwoZyJmnTMstpJ9EUi4jTdjS+nfntJ6oNU/LBThIaCjyULGYEWye1cD8LatN+ueJX/TnQKglyUoEcjX75qzdQJBVUWsKxMd3AT2yYYW0Z4XRa6qWGJpiM8ZB2HZVYUBNm82un6MwpAxQr7UpaNFd/T2RYGDMRkesU2I7MsjcT//O6qY2vw4zJJLVUksWiOOXIKjR7HQ2YpsTyiSOYaOZuRWSENSbWBVRyIQTLL6+SVq0aXFRr95eV+k0eRxFO4BTOIYArqMMdNKAJBB7hGV7hzVPei/fufSxaC14+cwx/4H3+ACC9jtU=</latexit>

2 <latexit sha1_base64="fvsquavDkDTJrSj2pCCRIjGtc/I=">AAAB7XicbVDLSgNBEOz1GeMr6tHLYBA8hd1E0GPQi8cI5gHJEmYns8mYeSwzs0JY8g9ePCji1f/x5t84SfagiQUNRVU33V1Rwpmxvv/tra1vbG5tF3aKu3v7B4elo+OWUakmtEkUV7oTYUM5k7RpmeW0k2iKRcRpOxrfzvz2E9WGKflgJwkNBR5KFjOCrZNauJ9Va9N+qexX/DnQKglyUoYcjX7pqzdQJBVUWsKxMd3AT2yYYW0Z4XRa7KWGJpiM8ZB2HZVYUBNm82un6NwpAxQr7UpaNFd/T2RYGDMRkesU2I7MsjcT//O6qY2vw4zJJLVUksWiOOXIKjR7HQ2YpsTyiSOYaOZuRWSENSbWBVR0IQTLL6+SVrUS1CrV+8ty/SaPowCncAYXEMAV1OEOGtAEAo/wDK/w5invxXv3Phata14+cwJ/4H3+ACPIjtc=</latexit>

3 <latexit sha1_base64="AH+X8GiZ3uST1WG0fDqZ+U1842g=">AAAB7XicbVBNSwMxEJ2tX7V+VT16CRbBU9ltC3osevFYwX5Au5Rsmm1js8mSZIWy9D948aCIV/+PN/+N6XYP2vpg4PHeDDPzgpgzbVz32ylsbG5t7xR3S3v7B4dH5eOTjpaJIrRNJJeqF2BNORO0bZjhtBcriqOA024wvV343SeqNJPiwcxi6kd4LFjICDZW6uBhWm/Mh+WKW3UzoHXi5aQCOVrD8tdgJEkSUWEIx1r3PTc2foqVYYTTeWmQaBpjMsVj2rdU4IhqP82unaMLq4xQKJUtYVCm/p5IcaT1LApsZ4TNRK96C/E/r5+Y8NpPmYgTQwVZLgoTjoxEi9fRiClKDJ9Zgoli9lZEJlhhYmxAJRuCt/ryOunUql69WrtvVJo3eRxFOINzuAQPrqAJd9CCNhB4hGd4hTdHOi/Ou/OxbC04+cwp/IHz+QMm047Z</latexit>

4 <latexit sha1_base64="HAl8Hw26ZtdbP+QdNynIUpWMu/c=">AAAB7XicbVBNSwMxEJ2tX7V+VT16CRbBU9mtFT0WvXisYD+gXUo2zbax2WRJskJZ+h+8eFDEq//Hm//GbLsHbX0w8Hhvhpl5QcyZNq777RTW1jc2t4rbpZ3dvf2D8uFRW8tEEdoikkvVDbCmnAnaMsxw2o0VxVHAaSeY3GZ+54kqzaR4MNOY+hEeCRYygo2V2niQ1i9ng3LFrbpzoFXi5aQCOZqD8ld/KEkSUWEIx1r3PDc2foqVYYTTWamfaBpjMsEj2rNU4IhqP51fO0NnVhmiUCpbwqC5+nsixZHW0yiwnRE2Y73sZeJ/Xi8x4bWfMhEnhgqyWBQmHBmJstfRkClKDJ9agoli9lZExlhhYmxAJRuCt/zyKmnXqt5FtXZfrzRu8jiKcAKncA4eXEED7qAJLSDwCM/wCm+OdF6cd+dj0Vpw8plj+APn8wcp3o7b</latexit>

o1
<latexit sha1_base64="5TU2Krumy3WIHwLop2wdZN6FEFI=">AAAB+XicbVDLSgMxFL1TX7W+Rl26CRbBVZkRUZcFNy4r2Ae0w5DJpG1oJhmSTKEM/RM3LhRx65+482/MtLPQ1gMhh3PuJScnSjnTxvO+ncrG5tb2TnW3trd/cHjkHp90tMwUoW0iuVS9CGvKmaBtwwynvVRRnEScdqPJfeF3p1RpJsWTmaU0SPBIsCEj2FgpdN1BJHmsZ4m9cjkP/dCtew1vAbRO/JLUoUQrdL8GsSRZQoUhHGvd973UBDlWhhFO57VBpmmKyQSPaN9SgROqg3yRfI4urBKjoVT2CIMW6u+NHCe6CGcnE2zGetUrxP+8fmaGd0HORJoZKsjyoWHGkZGoqAHFTFFi+MwSTBSzWREZY4WJsWXVbAn+6pfXSeeq4d80rh+v681mWUcVzuAcLsGHW2jCA7SgDQSm8Ayv8Obkzovz7nwsRytOuXMKf+B8/gDwZ5Pc</latexit>

o2
<latexit sha1_base64="hf+ilKe+vsv3kJfbP0C9W7rx/dI=">AAAB+XicbVDLSsNAFL2pr1pfUZdugkVwVZJS1GXBjcsK9gFtCJPJpB06mQkzk0IJ/RM3LhRx65+482+ctFlo64FhDufcy5w5Ycqo0q77bVW2tnd296r7tYPDo+MT+/Ssp0QmMeliwYQchEgRRjnpaqoZGaSSoCRkpB9O7wu/PyNSUcGf9DwlfoLGnMYUI22kwLZHoWCRmifmysUiaAZ23W24SzibxCtJHUp0AvtrFAmcJYRrzJBSQ89NtZ8jqSlmZFEbZYqkCE/RmAwN5Sghys+XyRfOlVEiJxbSHK6dpfp7I0eJKsKZyQTpiVr3CvE/b5jp+M7PKU8zTThePRRnzNHCKWpwIioJ1mxuCMKSmqwOniCJsDZl1UwJ3vqXN0mv2fBuGq3HVr3dLuuowgVcwjV4cAtteIAOdAHDDJ7hFd6s3Hqx3q2P1WjFKnfO4Q+szx/x65Pd</latexit>

...
<latexit sha1_base64="D8NZwhGRc3SadH+lq9NyH2X2S6M=">AAAB7HicbVBNS8NAFHzxs9avqkcvi0XwVJIq6LHoxWMF0xbaUDbbTbt0swm7L0IJ/Q1ePCji1R/kzX/jts1BWwcWhpk37HsTplIYdN1vZ219Y3Nru7RT3t3bPzisHB23TJJpxn2WyER3Qmq4FIr7KFDyTqo5jUPJ2+H4bua3n7g2IlGPOEl5ENOhEpFgFK3k9wYJmn6l6tbcOcgq8QpShQLNfuXL5lgWc4VMUmO6nptikFONgkk+Lfcyw1PKxnTIu5YqGnMT5PNlp+TcKgMSJdo+hWSu/k7kNDZmEod2MqY4MsveTPzP62YY3QS5UGmGXLHFR1EmCSZkdjkZCM0ZyokllGlhdyVsRDVlaPsp2xK85ZNXSate8y5r9YerauO2qKMEp3AGF+DBNTTgHprgAwMBz/AKb45yXpx352MxuuYUmRP4A+fzB/K2jsY=</latexit>
. . . . . . . . . oT
<latexit sha1_base64="D8NZwhGRc3SadH+lq9NyH2X2S6M=">AAAB7HicbVBNS8NAFHzxs9avqkcvi0XwVJIq6LHoxWMF0xbaUDbbTbt0swm7L0IJ/Q1ePCji1R/kzX/jts1BWwcWhpk37HsTplIYdN1vZ219Y3Nru7RT3t3bPzisHB23TJJpxn2WyER3Qmq4FIr7KFDyTqo5jUPJ2+H4bua3n7g2IlGPOEl5ENOhEpFgFK3k9wYJmn6l6tbcOcgq8QpShQLNfuXL5lgWc4VMUmO6nptikFONgkk+Lfcyw1PKxnTIu5YqGnMT5PNlp+TcKgMSJdo+hWSu/k7kNDZmEod2MqY4MsveTPzP62YY3QS5UGmGXLHFR1EmCSZkdjkZCM0ZyokllGlhdyVsRDVlaPsp2xK85ZNXSate8y5r9YerauO2qKMEp3AGF+DBNTTgHprgAwMBz/AKb45yXpx352MxuuYUmRP4A+fzB/K2jsY=</latexit> <latexit sha1_base64="D8NZwhGRc3SadH+lq9NyH2X2S6M=">AAAB7HicbVBNS8NAFHzxs9avqkcvi0XwVJIq6LHoxWMF0xbaUDbbTbt0swm7L0IJ/Q1ePCji1R/kzX/jts1BWwcWhpk37HsTplIYdN1vZ219Y3Nru7RT3t3bPzisHB23TJJpxn2WyER3Qmq4FIr7KFDyTqo5jUPJ2+H4bua3n7g2IlGPOEl5ENOhEpFgFK3k9wYJmn6l6tbcOcgq8QpShQLNfuXL5lgWc4VMUmO6nptikFONgkk+Lfcyw1PKxnTIu5YqGnMT5PNlp+TcKgMSJdo+hWSu/k7kNDZmEod2MqY4MsveTPzP62YY3QS5UGmGXLHFR1EmCSZkdjkZCM0ZyokllGlhdyVsRDVlaPsp2xK85ZNXSate8y5r9YerauO2qKMEp3AGF+DBNTTgHprgAwMBz/AKb45yXpx352MxuuYUmRP4A+fzB/K2jsY=</latexit> <latexit sha1_base64="D8NZwhGRc3SadH+lq9NyH2X2S6M=">AAAB7HicbVBNS8NAFHzxs9avqkcvi0XwVJIq6LHoxWMF0xbaUDbbTbt0swm7L0IJ/Q1ePCji1R/kzX/jts1BWwcWhpk37HsTplIYdN1vZ219Y3Nru7RT3t3bPzisHB23TJJpxn2WyER3Qmq4FIr7KFDyTqo5jUPJ2+H4bua3n7g2IlGPOEl5ENOhEpFgFK3k9wYJmn6l6tbcOcgq8QpShQLNfuXL5lgWc4VMUmO6nptikFONgkk+Lfcyw1PKxnTIu5YqGnMT5PNlp+TcKgMSJdo+hWSu/k7kNDZmEod2MqY4MsveTPzP62YY3QS5UGmGXLHFR1EmCSZkdjkZCM0ZyokllGlhdyVsRDVlaPsp2xK85ZNXSate8y5r9YerauO2qKMEp3AGF+DBNTTgHprgAwMBz/AKb45yXpx352MxuuYUmRP4A+fzB/K2jsY=</latexit>
<latexit sha1_base64="F4KdGFDeKZZjYoI0A5M79Crq4uE=">AAAB/XicbVDNS8MwHE3n15xf9ePmJTgEL45WhnocePE4YZuDrZQ0TbewNClJKsxS/Fe8eFDEq/+HN/8b060H3XwQ8njv9yMvL0gYVdpxvq3Kyura+kZ1s7a1vbO7Z+8f9JRIJSZdLJiQ/QApwignXU01I/1EEhQHjNwHk5vCv38gUlHBO3qaEC9GI04jipE2km8fDQPBQjWNzZWJ3M86527u23Wn4cwAl4lbkjoo0fbtr2EocBoTrjFDSg1cJ9FehqSmmJG8NkwVSRCeoBEZGMpRTJSXzdLn8NQoIYyENIdrOFN/b2QoVkVAMxkjPVaLXiH+5w1SHV17GeVJqgnH84eilEEtYFEFDKkkWLOpIQhLarJCPEYSYW0Kq5kS3MUvL5PeRcO9bDTvmvVWq6yjCo7BCTgDLrgCLXAL2qALMHgEz+AVvFlP1ov1bn3MRytWuXMI/sD6/AHdX5V9</latexit>

1 oT
<latexit sha1_base64="1FE1Jp0ZKbfz0LIJ4OWzr8mxw5Y=">AAAB+3icbVDLSsNAFL2pr1pfsS7dBIvgqiQi6rLgxmWFvqANYTKZtkMnM2FmIpaQX3HjQhG3/og7/8ZJm4W2HhjmcM69zJkTJowq7brfVmVjc2t7p7pb29s/ODyyj+s9JVKJSRcLJuQgRIowyklXU83IIJEExSEj/XB2V/j9RyIVFbyj5wnxYzThdEwx0kYK7PooFCxS89hcmciDrJMHdsNtugs468QrSQNKtAP7axQJnMaEa8yQUkPPTbSfIakpZiSvjVJFEoRnaEKGhnIUE+Vni+y5c26UyBkLaQ7XzkL9vZGhWBXxzGSM9FSteoX4nzdM9fjWzyhPUk04Xj40TpmjhVMU4URUEqzZ3BCEJTVZHTxFEmFt6qqZErzVL6+T3mXTu25ePVw1Wq2yjiqcwhlcgAc30IJ7aEMXMDzBM7zCm5VbL9a79bEcrVjlzgn8gfX5A/WxlQs=</latexit>

Figure 3.2 An illustration of the five-state HMM used for speech recognition. The first and last
HMM states are non-emitting.

Transitioning from state i to state j follows the transition probability

S
X
aij = P (st+1 = j|st = i) where aij = 1 ∀i = 1, . . . , S (3.8)
j=1

where S is the total number of states in the HMM. And the emitting probability density for ot is

bj (ot ) = p(ot |st = j). (3.9)

For the observation sequence of length T , there must be a corresponding state sequence
s = [s1 , . . . , sT ] (excluding non-emitting states). The hidden state sequence can only be
inferred from the output sequence.
42 Automatic Speech Recognition

3.2.1 State Output Probability Distributions

To model the output probability density functions for each HMM for continuous acoustic
features, a multivariate Gaussian distribution could be used with the form

bj (ot ) = N (ot ; µj , Σj )
( )
1 1 T
=q exp − ot − µj Σj −1 ot − µj , (3.10)
(2π)D |Σj | 2

where D is the feature dimension and (µj , Σj ) are the mean vector and covariance matrix asso-
ciated with the output distribution of the j-th state in an HMM. As mentioned in Section 3.1.1,
the covariance matrix is usually assumed to be diagonal to save both memory and computation
with limited data. Note that the feature vectors need to be treated carefully such that each
dimension should be statistically independent.
In practice, using a single Gaussian distribution to model the output probability function is
a relatively poor approximation. Instead, GMMs can be used (Juang, 1985). Consequently, the
emission probability becomes

Mj
X
bj (ot ) = cjm N (ot ; µjm , Σjm ), (3.11)
m=1

where Mj is the number of Gaussian components in the GMM associated with state j and the
scalar cjm is the weight (or prior) of the Gaussian m in state j. For a valid distribution, the
weights for all Gaussian components must be non-negative and must sum to 1. An HMM-based
ASR system that uses GMMs to model output probability distributions is called a GMM-HMM
system.
In contrast, the output density can also be approximated by DNNs (Bourlard et al., 1994).
Different to the generative GMMs where the p(ot |j) is directly modelled by multiple multi-
variate Gaussian distributions, a neural network is a discriminative model estimating the state
posterior probabilities p(j|ot ) (Bishop, 2006). According to the Bayes’ rule

P (j|ot )p(ot ) P (j|ot )

p(ot |j) = ∝ . (3.12)
P (j) P (j)

For each frame, the neural network takes the feature vector as an input and produces an output
posterior probability over all HMM states where the denominator P (j) is the prior probability
of the state j. It is worth noting that the marginal distribution p(ot ) can be safely ignored as it
does not depend on a particular state. The DNN-HMM system is also referred to as the hybrid
approach (Bourlard et al., 1994). This approach can be regarded as discriminative training of a
3.2 Hidden Markov Models for Acoustic Modelling 43

generative model. However, in order to train such neural networks for state classification, the
class label for each frame is often required, i.e. frame-level alignment. The alignment can be
obtained from a pre-trained GMM-HMM system by running the Viterbi algorithm (Viterbi,
1967) with a composite HMM for each utterance in the training data that models the reference
word sequence. The acoustic model can also be trained without frame-level alignment by
using sequence-level optimisation, such as Lattice Free (LF)-Maximum Mutual Information
(MMI) (Povey et al., 2016).
Another type of acoustic modelling approach is called a Tandem system (Grézl et al., 2007;
Hermansky et al., 2000). The neural network is used as a feature extractor. The DNN-based
feature and the acoustic feature obtained from the front-end processing (e.g. MFCC or PLP) are
concatenated, which are subsequently modelled by a GMM-HMM system. Tandem systems
can make use of the adaptation techniques developed for GMM-HMM systems while exploiting
the representation power from the neural networks.

3.2.2 Likelihood Calculation

The term in Equation (3.2) p(O|w) computes the likelihood of the sequence of observation
vectors O being generated using the hypothesis w, given the parameters of HMMs θ. If the
underlying state sequence s corresponding to the word sequence is known, then computing the
likelihood becomes straightforward. However, without knowing the underlying state sequence,
the likelihood needs to be computed as a sum over all possible state sequences s associated
with the hypothesis w
T
X Y
p(O|w; θ) = ast−1 st bst (ot ), (3.13)
s∈Sw t=1

where Sw is a set of all possible state sequences based on the hypothesis w. The model
parameters θ include state transition probabilities and state output distributions. Simply
summing over all possible state sequences is computationally intractable as the number of
possible sequences grows exponentially with the number of time steps. The forward-backward
algorithm uses the conditional independence property of HMMs and exploits the idea of
dynamic programming to allow the computational cost for the likelihood term drops from
O(S T ) to O(ST ) (Huang et al., 2001).
44 Automatic Speech Recognition

As the name suggests, the forward-backward algorithm factorises the computation into
forward and backward probabilities.
X
p(O|w; θ) = p(o1:t |ot+1:T , s)p(ot+1:T , s) (3.14)
s
X
= p(o1:t |s1:t )p(ot+1:T , st+1:T |s1:t )p(s1:t ) (3.15)
s
X
= p(o1:t , s1:t )p(ot+1:T , st+1:T |st ) (3.16)
s
X
= p(o1:t , st = i)p(ot+1:T |st = i) (3.17)
i
X
= αt (i)βt (i), (3.18)
i

where the assumption that future states and future observations are independent from past states
and past observations given the current hidden state is used. The forward probability α and
backward probability β can be computed recursively:
X
αt (i) = p(o1:t , st = i) = αt−1 (j)aji bi (ot ), (3.19)
j
X
βt (i) = p(ot+1:T |st = i) = βt+1 (j)aij bj (ot+1 ). (3.20)
j

More rigorously, for an HMM with non-emitting entry and exit states 1 and S, the forward
probability with initial conditions can be written as





1 i=1 and t = 0,

and 1 < t ≤ T,

0 i=1




αt (i) = 0 1<i≤S and t = 0, (3.21)





 S−1
X
αt−1 (j)aji bi (ot ) 1<i<S and 1 ≤ t ≤ T.





j=2

Similarly, the backward probability can be written as


a 1<i≤S and t = T,
 iN



βt (i) = S−1
X (3.22)



 βt+1 (j)aij bj (ot+1 ) 1≤i<S and 1 ≤ t < T.
j=2
3.3 Acoustic Model Training 45

The termination steps for forward and backward passes are

S−1
X
αT (S) = αT (j)ajS , (3.23)
j=2
S−1
X
β0 (1) = β1 (j)a1j bj (o1 ), (3.24)
j=2

and both terminating probabilities are equal to the likelihood p(O|w; θ). After computing the
α’s and β’s in both directions, the posterior probability of being at state i at time t, γt (i) can be
easily computed by

p(st = i, O|w; θ) αt (i)βt (i)

γt (i) = p(st = i|O, w; θ) = = . (3.25)
p(O|w; θ) αT (S)

The state transition posterior from state i to j at time t, denoted as χt (i, j), can be expressed as

αt−1 (i)aij bj (ot )βt (j)

χt (i, j) = p(st−1 = i, st = j|O, w; θ) = . (3.26)
αT (S)

These posterior probabilities are essential for parameter estimation.

3.3 Acoustic Model Training

As the likelihood can now be practically computed, the associated HMM parameters need to
be estimated according to the training utterances {O (1) , . . . , O (N ) } and their corresponding
transcriptions {w(1) , . . . , w(N ) }. The parameters can be estimated by maximising the likelihood
of the training data in a generative fashion, i.e. Maximum Likelihood Estimation (MLE).
The MLE approach is the near-optimal solution to ASR providing that both the acoustic
and language models are close to the true distribution and the training data is sufficiently
large (Brown, 1987). Unfortunately, neither of these assumptions stands given the HMM
assumptions and the resource limitation for many tasks (Juang et al., 1997). As a consequence,
maximum likelihood training of the acoustic model deviates from the optimal performance.
Intuitively, MLE maximises the likelihood of the training data given by the acoustic model
for the corresponding reference word sequences, which implies that all the other competing
hypotheses proposed by the acoustic model are treated equally and not penalised accordingly.
Discriminative sequence training, however, either models the posterior probabilities of the
reference sequence (e.g. MMI) or optimises a discriminant function that closely associates with
46 Automatic Speech Recognition

the final decision rule (e.g. Minimum Bayes Risk (MBR)) (Bishop, 2006). Both of the training
schemes are described in the following sections.

3.3.1 Maximum Likelihood Training

MLE is a generative training approach that aims to obtain a set of HMM parameters θMLE
maximising the likelihood of the training data

p(O (u) , w(u) ; θ)

Y
θMLE = arg max (3.27)
θ u

p(O (u) |w(u) ; θ)

Y
= arg max (3.28)
θ u

log p(O (u) |w(u) ; θ),

X
= arg max (3.29)
θ u

where u is the index of utterances in the training set, which is omitted for brevity below. For
GMM-HMM systems, the Baum-Welch algorithm (Baum and Eagon, 1967), which is a special
case of the Expectation Maximisation (EM) algorithm (Dempster et al., 1977), is often used to
obtain the MLE parameters.
The auxiliary function for this EM algorithm is Q(θ; θk ), which defines the lower bound to
the log-likelihood function. By iteratively maximising the lower bound with respect to θ, a
new set of parameters that guarantees a non-decreasing likelihood will be obtained. To derive
the auxiliary function used for EM, Jensen’s inequality is first used for the log-likelihood
!
X p(O, s|w; θ)
log p(O|w; θ) = log q(s) (3.30)
s q(s)
X X
≥ q(s) log p(O, s|w; θ) − q(s) log q(s), (3.31)
s s

where q(s) could be any valid distribution w.r.t. the state sequence s. Please note that the
equality holds when the distribution is exactly the posterior distribution of state sequences
p(s|O, w; θ). During the k-th EM iteration where the current parameters are θk , the sequence
posterior probability distribution given by the current set of parameters p(s|O, w; θk ) is used
for the arbitrary distribution q(s). Consequently, Equation (3.31) becomes
X
log p(O|w; θ) ≥ p(s|O, w; θk ) log p(O, s|w; θ)
s
X
− p(s|O, w; θk ) log p(s|O, w; θk ), (3.32)
s
3.3 Acoustic Model Training 47

By subtracting Equation (3.33) from Equation (3.32),

log p(O|w; θ) − log p(O|w; θk ) ≥ Q(θ; θk ) − Q(θk ; θk ), (3.34)

where the auxiliary equation Q(θ; θk ) = s p(s|O, w; θk ) log p(O, s|w; θ). Maximising the
P

auxiliary equation Q(θ; θk ) w.r.t. to θ is guaranteed not to decrease the log-likelihood. During
the E-step in the EM algorithm, the posterior probabilities are computed using the current set
of parameters using the forward-backward algorithm. The M-step then maximises the auxiliary
function and updates the parameters.
For HMMs, the auxiliary function can be written as
X X
Q(θ; θk ) = γt (i) log bi (ot ) + χt (i, j) log aij , (3.35)
i,t i,j,t

where γt (i) is the state occupancy posterior probability as in Equation (3.25) and χt (i, j) is
the state pairwise posterior occupancy as in Equation (3.26). The optimal state transition
probability becomes PT
t=1 χt (i, j)
âij = P T . (3.36)
t=1 γt (i)

For GMMs as output distributions, the posterior probability that the observation ot con-
tributed by state j and its component m can be expressed as

zjm N (ot ; µjm , Σjm )

γt (j, m) = γt (j) , (3.37)
bj (ot )

and the re-estimation equations for GMM parameters can be derived to be

P
tγt (j, m)
ẑjm = P , (3.38)
m,t γt (j, m)
P
γt (j, m)ot
µ̂jm = Pt , (3.39)
t γt (j, m)
T
!
t γt (j, m)(ot − µ̂jm )(ot − µ̂jm )
P
Σ̂jm = diag P . (3.40)
t γt (j, m)
48 Automatic Speech Recognition

The equations above are derived for a single utterance for simplicity. In practice, GMM-HMM
parameters are updated when the statistics have been accumulated over all utterances in the
training set.

3.3.2 Conditional Maximum Likelihood Training

Similar to MLE for generative modelling in Equation (3.28), Conditional Maximum Likeli-
hood (CML) is a discriminative approach that maximises the target word sequence given the
observation sequence,
θCML = arg max P (w(u) |O (u) ; θ),
Y
(3.41)
θ u

where u is the index of training utterances.

3.3.2.1 Cross Entropy

If the state posterior distribution for each input frame is assumed to be concentrated solely
around its mode, i.e. the probability mass on one state is one and zero for all other states, then
the CML criterion can be written as frame-level Cross Entropy (CE) minimisation

(u)
P (st |O (u) ; θ)
X Y
LCE = − log (3.42)
u t
(u)
log P (st |O (u) ; θ),
XX
=− (3.43)
u t

where the state-level alignment s = [s1 , . . . , st ] is obtained from an existing model, e.g. running
the Viterbi algorithm with a GMM-HMM system. This alignment is also called a hard alignment
as each frame is assigned to an HMM state with absolute certainty. Therefore, Equation (3.43)
is equivalent to the cross-entropy between frame posterior probability given by the model and
the one-hot frame label. To obtain the frame posterior probabilities, a neural network frame
classifier can be trained by minimising the CE criterion over the entire dataset.
For the DNN-HMM hybrid approach, a DNN trained by gradient-based optimisation
methods performs a state classification task based on the hard alignment. It is worth noting
that unlike a GMM-HMM system where the Baum-Welch algorithm updates both GMMs and
HMMs jointly, the DNN in the hybrid system is normally trained separately by the frame-level
CE criterion since the HMM transition probabilities make negligible difference to the system
performance (Dahl et al., 2011).
3.3 Acoustic Model Training 49

3.3.2.2 Maximum Mutual Information

For ASR, one possible objective would be to minimise the conditional entropy of the word
sequence w given the observation sequence O. For a single utterance,
XZ
H(w|O) = − p(w, O) log P (w|O)dO (3.44)
w
X XZ p(w, O)
=− P (w) log P (w) − p(w, O) log dO (3.45)
w w P (w)p(O)
= −H(P (w)) − DKL (p(w, O)||p(O)P (w)), (3.46)

where the second term in Equation (3.46) is the mutual information between p(O) and P (w).
Assuming that the language model is fixed, it is clear that minimising the conditional entropy
is equivalent to maximising the mutual information (MacKay, 2003). As directly computing
p(w, O) is infeasible, assuming a particular pair of w and O is representative can reach the
following simplification

p(w, O)
DKL (p(w, O)||p(O)P (w)) ≈ log (3.47)
P (w)p(O)
p(O|w)
= log P ′ ′
. (3.48)
w′ p(O|w )P (w )

Therefore, with the fixed language model P (w), maximising mutual information is equivalent
to maximising
p(O|w)P (w)
log P ′ ′
. (3.49)
w′ p(O|w )P (w )

By applying Bayes’ rule, Equation (3.49) is equivalent to CML criterion. In other words,
for MMI training (Bahl et al., 1986; Valtchev et al., 1997; Woodland and Povey, 2002), the
posterior probability of the correct hypothesis for a given utterance is maximised:

log P (w(u) |O (u) ; θ)

X
LMMI = (3.50)
u
X p(O (u) |w(u) ; θ)κ P (w(u) )
≈ log P (u) |h; θ)κ P (h)
, (3.51)
u h∈H(u) p(O

where H(u) is a set of hypothesised sequences for a given utterance u and κ is the acoustic
scaling factor to match the range of acoustic and language model scores. κ is normally set to the
inverse of the language model scaling factor used for decoding and the denominator is estimated
through a lattice, both of which will be detailed in Section 3.4 later. To estimate model parame-
ters based on the MMI criterion, the extended Baum-Welch algorithm is used (Gopalakrishnan
50 Automatic Speech Recognition

et al., 1989; Normandin, 1991). More recently, Povey et al. (2016) proposed a practical method
called LF-MMI to perform MMI-based sequence training of acoustic models without explicitly
generating lattices. LF-MMI represents both the numerator and denominator lattices in the
form of Weighted Finite State Transducers (WFSTs) and the forward-backward computation is
parallelised on a Graphics Processing Unit (GPU). A 4-gram LM is used for WFST composition
instead of a word LM for efficiency. With reduced frame rate, modified HMM structure, and
various normalisation and regularisation tricks, LF-MMI allows the acoustic model to be trained
faster and often leads to better performance than standard models (Povey et al., 2016).

3.3.2.3 Minimum Bayes Risk

Although the MMI criterion integrates discriminative ability during training, it does not directly
optimise the final evaluation metric for speech recognition – the Word Error Rate (WER) (see
Section 3.4.4). Unlike MMI, the MBR criterion takes into account the associated risk function
for each hypothesis. In other words, MBR is a class of criteria that considers a general error
metric between the hypothesis and the reference sequence R(h, w):

P (h|O (u) ; θ)R(h, w)

X X
LMBR = (3.52)
u h∈H(u)

p(O (u) |h; θ)κ P (h)R(h, w(u) )

P
h∈H(u)
X
≈ log P (u) |h; θ)κ P (h)
. (3.53)
u h∈H(u) p(O

The design of the risk function is important for MBR training. At the utterance level, if
we define the risk to be 0 for the correct hypothesis and 1 otherwise, MBR falls back to be
equivalent to MMI. The risk function R can also be defined at the word level to reflect WER
directly (Povey and Woodland, 2002). However, due to the same data sparsity issue as selecting
the basic modelling units in Section 3.1.2, phone and HMM state-based loss functions, such
as the Minimum Phone Error (MPE) criterion (Povey, 2003; Povey and Woodland, 2002), are
more widely used. Models trained using the MPE criterion are usually interpolated with the
MLE or MMI criterion to avoid over-fitting (Povey and Woodland, 2002).

3.3.3 Speaker Adaptation and Adaptive Training

Unlike feature normalisation methods mentioned in Section 3.1.1 where features are trans-
formed to better fit the model, adaptation works in the opposite direction and modifies the
model to better fit the features. Given similar acoustic conditions, one of the main sources of
variability between speech signals is the speaker. The differences between speakers can be very
significant, including gender, age, accent, speaker style, etc. (Huang et al., 2001). One may
3.3 Acoustic Model Training 51

argue that with significant data and modelling capability, a single model can learn or extract
speaker-invariant features. However, this is rarely the case for practical deployment of speech
recognition services as unseen classes of speakers always exist, e.g. new accents. One way to ap-
proach speaker variability is to have speaker-dependent models. However, the amount of speech
data from a single speaker is normally very limited. As a viable solution, a speaker-adapted
system (Huang and Lee, 1993) is usually used where a speaker-independent acoustic model is
adapted on a small amount of speaker-specific data, either in a supervised or an unsupervised
fashion. Speaker Adaptive Training (SAT) (Anastasakos et al., 1996) is normally a two-pass
training procedure where the first pass is used to estimate the speaker-related parameters, and
the second-pass uses these parameters to update the speaker-independent model. If correct
supervision transcription is available for some utterances of the unseen speaker, it is called
supervised adaptation. Otherwise, it is unsupervised since only the recognition hypothesis is
available for adaptation.
For GMM-HMM systems, many techniques have been developed for speaker adaptation
including linear transform-based adaptation (Gales, 1998; Gales and Woodland, 1996; Leggetter
and Woodland, 1995), MAP adaptation (Gauvain and Lee, 1994), and speaker cluster-based
adaptation (Gales, 2000). Maximum Likelihood Linear Regression (MLLR) is a speaker
adaptation technique that uses linear transforms to adapt the means (Leggetter and Woodland,
1995) and variances (Gales, 1998; Gales and Woodland, 1996) of GMM components. A
commonly used version of MLLR, called Constrained Maximum Likelihood Linear Regression
(CMLLR), constrains the mean and variance transformation matrices of a Gaussian component
to be identical (Digalakis et al., 1995; Gales, 1998). When only a single transform is used per
speaker, CMLLR is equivalent to feature space linear transformation, so that speaker adaptation
can be applied without changing the model parameters (Gales, 1998). Therefore, this method
is also called Feature-Space Maximum Likelihood Linear Regression (FMLLR). The EM
algorithm can be also used to estimate CMLLR transforms that maximise the likelihood of
generating the speaker-specific data.
In addition to test-time adaptation, speaker adaptation can also be applied during training,
i.e. SAT (Anastasakos et al., 1996; Gales, 1998). The SAT approach first estimates the parameter
of the canonical speaker-independent model. Then the CMLLR transforms can be estimated
for every speaker in the training set. The new speaker-independent model can be trained using
these transformed features. During recognition, the speaker-independent model can produce
initial hypotheses, which are used for alignment and estimation of the CMLLR transforms for
the speakers in the test set. Then using these transforms, a second-pass recognition is performed
to yield the final results. However, if the adaptation data has associated transcriptions, the
52 Automatic Speech Recognition

CMLLR transforms can be directly estimated without using first-pass potentially erroneous
recognition results.
For DNN-HMM systems, one similar alternative to CMLLR is to have a linear input
layer before passing the transformed feature to the speaker-independent network (Li and
Sim, 2010; Seide et al., 2011). During training, it can be set to an identity matrix or can
be trained with the rest of the network. During testing where there are unseen speakers,
the linear layer needs to be trained separately on either recognition results from an existing
system or some speaker enrolment transcription. Instead of having an extra linear layer, the
DNN can have speaker-specific parameterised activation functions (Siniscalchi et al., 2010;
Swietojanski and Renals, 2014; Zhang and Woodland, 2016), e.g. parameterised ReLU (see
Section 2.1.5). The parameters for activation functions are chosen because far fewer speaker-
dependent parameters need to be learned than using model weights to mitigate the data sparsity
issue. This approach is called Learning Hidden Unit Contributions (LHUC) (Swietojanski
and Renals, 2014). Alternatively, applying regularisation when adapting weights of a DNN
acoustic model, e.g. minimising the Kullback–Leibler (KL)-divergence between the speaker-
independent output distribution and the speaker-adapted output distribution, has shown to
be effective (Yu et al., 2013). All the above techniques for DNNs can be used for test time
adaptation. However, ASR systems usually benefit the most if SAT is also used to minimise the
training and testing conditions. Also, some of these methods can be used together to yield the
best performance by exploiting complementarity between different methods, e.g. CMLLR and
LHUC for DNN-HMM systems (Swietojanski and Renals, 2014).
Apart from these two-pass speaker adaption approaches, one-pass methods are also available.
The DNN input features can be augmented with speaker-related information directly during
training. If the extra feature comes from an unsupervised fashion (e.g. i-vector based on
factor analysis on speech features (Dehak et al., 2010)), or from another network or system
(e.g. speaker code (Abdel-Hamid and Jiang, 2013)), then the first pass is no longer necessary.
Adversarial training is also an approach related to speaker adaptation that explicitly tries to
de-correlate the acoustic model from the speaker information (Meng et al., 2018). The single
neural network is trained on both acoustic targets and speaker targets simultaneously, but the
gradient from the speaker branch is reversed to remove the speaker information in the acoustic
model.
3.4 Language Models and Decoding for Source-Channel Models 53

3.4 Language Models and Decoding for Source-Channel Mod-

els
Decoding, also known as recognition, is a process that searches for the most likely hypothesis
using both the AM, LM and lexicon information for a given utterance. In terms of the SCM,
decoding is the process that reconstructs the source information from the noisy channel. The
finite-state graph for search can be constructed statically (Baker, 1975; Mohri et al., 2002;
Young et al., 1989), dynamically (Odell et al., 1994), or in a hybrid way (Demuynck et al.,
2000) to achieve a trade-off between memory and speed. As an overview, the search follows
the MAP decision rule

ĥ = arg max P (h|O; θ)

= arg max p(O|h; θ)P (h)

= arg max{log p(O|h; θ) + log P (h)}, (3.54)

where log p(O|h; θ) is the log-likelihood of the observation O for the hypothesis h, or the
acoustic score given by the AM, and log P (h) is the prior probability of the hypothesis h,
or the language score given by the LM. The recognition procedure exploits the conditional
independence structure of HMM to efficiently search through a large number of possible paths
and find the corresponding word sequence with the highest overall score.

3.4.1 Language Models

A language model describes the prior probability of a particular word sequence w of length K
and according to the chain rule of probability,

K
Y
P (w) = P (w1 ) P (wk |wk−1 , . . . , w1 ). (3.55)
k=2

A good language model can generally help improve the performance of the ASR system
regardless of the model architecture and the criterion used for the acoustic model training. In
this section, two major types of LM are discussed. The n-gram language model (Jelinek, 1991;
Manning et al., 1999) is purely statistical and non-parametric, whereas the Neural Network
Language Model (NNLM) (Bengio et al., 2003; Mikolov et al., 2010) is a neural network
trained to predict the next word given the history word sequence. The SCM framework allows
the AM and LM to be decoupled. Since training either an n-gram LM or an NNLM is much less
54 Automatic Speech Recognition

computationally expensive than the AM for a certain amount of speech data, and furthermore,
text-only corpora in a similar domain are relatively cheap to collect, the LM can exploit large
amounts of data in addition to the corresponding speech audio transcriptions to alleviate the
data sparsity issue at the word level and avoid bias within the training data that impedes
generalisation. Note that for LM estimation, the start and end of sentences are important, which
are often treated as separate tokens.

3.4.1.1 N-gram Language Models

As shown in Equation (3.55), the size of the distribution table grows exponentially w.r.t. the
length of the sequence and becomes nearly impossible to have a good estimate even for short
sequences due to the size of the vocabulary for real use cases. With a limited amount of
training data, some degree of conditional independence must be imposed for P (w) to be
computationally feasible. The simplest form of a language model is a uni-gram model, where
the probability of a word in a sequence is independent of its context and is estimated by its
frequency count in the training set. However, the order of words in a sequence becomes
irrelevant due to the strong independence assumption. By incorporating history information of
a word, an n-gram LM could be obtained by considering the conditional distribution of a word
given previous n − 1 words, i.e.

P (wk |wk−1 , . . . , w1 ) ≈ P (wk |wk−1 , . . . , wk−n+1 ). (3.56)

With this nth-order Markovian assumption, the probability of a word sequence w is

K
Y
P (w) = P (w1 ) P (wk |wk−1 , . . . , wk−n+1 ), (3.57)
k=2

and the MLE for the word wk is

count(wk , wk−1 , . . . , wk−n+1 )

P (wk |wk−1 , . . . , wk−n+1 ) = P . (3.58)
w count(w, wk−1 , . . . , wk−n+1 )

Obtaining a robust and unbiased estimation of the conditional probabilities requires the
training corpus to be large enough to cover all possible n-grams. However, this is nearly
impossible for a large vocabulary system where a large majority of these n-grams are not
present in the training data due to the huge number of n-grams and the data distribution
itself (Chen and Goodman, 1999). Smoothing techniques are required. To avoid the problem
where many unseen n-grams in the training data will have zero probabilities, a certain amount
of overall probability mass controlled by a discounting factor is allocated to these unseen
3.4 Language Models and Decoding for Source-Channel Models 55

cases, e.g. Katz smoothing (Katz, 1987), absolute discounting (Ney et al., 1994), Kneser-
Ney smoothing (Kneser and Ney, 1995). A back-off scheme is normally used for the above
smoothing techniques, which allows the unseen n-grams to have a certain amount of probability
mass according to lower-order n-gram distributions. Another widely adopted method is to
interpolate high-order LMs with lower-order LMs or to interpolate multiple LMs trained on
different sources of data. The weights for linear interpolation can be adjusted based on a
validation set.

3.4.1.2 Neural Network Language Models

Instead of using frequency counts statistics in the text corpus to estimate the LM, the neural
network can be trained for word prediction which can then be used as an LM (Bengio et al.,
2003). The advantage of the NNLM is that it does not give zero probability mass to any
predicted word in the vocabulary due to the Softmax output layer over the vocabulary and the
non-linear functions can extract word representations with semantic and contextual information,
i.e. word embeddings. Similar to standard n-gram LMs, Multilayer Perceptrons (MLPs) or
Convolutional Neural Networks (CNNs) can be used as LMs with n − 1 history words. A more
powerful and widely used class of NNLMs is Recurrent Neural Network (RNN)-based (Mikolov
et al., 2010). A recurrent model such as the Long Short-Term Memory (LSTM) can model long
dependencies in Equation (3.55) by representing word histories in hidden states with gating
functions. Self-attention based models such as Transformer LMs have become a major research
direction that to model the relationship between words across longer ranges due to the lack
of recurrence (Dai et al., 2019; Irie et al., 2019b), and large-scale self-attention based models
pre-trained on a very large amount of data have reached state-of-the-art performance on various
language-related tasks (Devlin et al., 2018; Radford and Narasimhan, 2018; Radford et al.,
2019). Other variants of NNLMs that exploit future information have also been explored (Chen
et al., 2017). As the normalisation term in the Softmax function over a very large vocabulary
size can be computationally problematic for NNLMs, noise contrastive estimation can be used
for training and inference of an NNLM without needing the Softmax normalisation term (Chen
et al., 2015a; Gutmann and Hyvärinen, 2010). Specifically for ASR, the information from
previous utterances can be utilised during LM training and decoding (Irie et al., 2019c; Sun
et al., 2021b). Similar to AM adaptation mentioned in Section 3.3.3, one advantage of NNLMs
is that they can adapt to input data in a flexible manner, including augmenting the input with a
topic vector (Chen et al., 2015b), adapting LM parameters (Gangireddy et al., 2016; Park et al.,
2010) and learning to bias towards recent histories (Li et al., 2018).
56 Automatic Speech Recognition

3.4.2 Decoding Procedure

To find the path with the highest overall score in the search graph, the Viterbi algorithm (Huang
et al., 2001) is used for HMMs. The Viterbi algorithm uses the dynamic programming approach
to find the most likely hidden state sequence. For an HMM with S states where state 1 and
state S are non-emitting, let ϕt (j) be the log likelihood of the single best-scoring path up to
time t that is in state j at that time

ϕt (j) = smax log p(o1:t , s1:t−1 , st = j), (3.59)

1:t−1

which can be computed recursively

ϕt (j) = max{ϕt−1 (i) + log aij } + log bj (ot ), (3.60)

where 1 < j < N and 1 ≤ t < T . The initial condition is

ϕ0 (1) = 1, ϕ1 (j) = a1j bj (ot ). (3.61)

And the best previous state is

φt (j) = arg max{ϕt−1 (i) + log aij }. (3.62)

At the end of the utterance, the best path can be traced backwards by the recursion,

st = φt+1 (st+1 ), (3.63)

for t = T − 1, . . . , 1 and the initial condition is sT = S.

During decoding, all state sequences are explored in parallel in a time-synchronous manner.
To reduce the complexity of the search procedure, many pruning techniques are often used to
reduce the breadth during the search. Although pruning can significantly reduce the search
time, the optimal path might be excluded. If the beam width is too narrow, some promising
paths may be pruned prematurely which results in a large number of errors due to approximate
search. Therefore, setting the pruning level at state/model/word level is important to achieve a
reasonable trade-off between computational load and performance. In a practical implementa-
tion, a static graph based on a dictionary and phone-level HMMs can be compiled first and LM
scores are taken into account during search time. Alternatively, each component needed during
decoding can be formulated into a WFST, and a composition of them can be used for decoding
after minimisation and determinisation (Mohri et al., 2002).
3.4 Language Models and Decoding for Source-Channel Models 57

One practical issue for decoding an SCM-based system is the difference in dynamic range
between the acoustic scores and language scores. The acoustic scores have a much wider
range due to underestimation of likelihood originating from HMM assumptions (Woodland and
Povey, 2002). To offset this discrepancy, another decoding parameter ψ named the grammar
scaling factor is often used to scale up the language scores. Another practice is to set a word
insertion penalty ω (usually negative) to limit the degree of inserting words. Putting them
together, the search criterion used in practice is

ĥ = arg max{log p(O|h; θ) + ψ log P (h) + ω|h|}.1 (3.64)

3.4.3 Lattices and Confusion Networks

Incorporating more complex linguistic knowledge such as higher-order n-gram LMs or NNLMs
directly into the search process can be prohibitively expensive in terms of the computational
cost. As a solution, a multi-pass recognition strategy is widely used (Huang et al., 2001; Young,
1996). In order to do so, some intermediate representation of the hypotheses space needs to
be stored. For example, the first pass can use the frame-level log-likelihood and a bi-gram
LM, and the second pass could use a tri-gram LM to modify the language score for each
hypothesis and rerank. Storing an efficient representation of decoding results can also be used
in discriminative sequence training in Section 3.3.2 in order to compute the denominator that
requires a summation over all hypotheses.
A widely adopted compact representation is the word lattice (Woodland et al., 1995) or the
word graph (Ortmanns et al., 1997), which contains most of the likely competing hypotheses of
an observed utterance at the word level. Topologically, a lattice is a Directed Acyclic Graph
(DAG) with nodes and arcs as in Figure 3.3a. Each arc corresponds to a word hypothesised
by the ASR system and contains information including the AM score, the LM score, the
pronunciation variant, and the identity of the source node and the destination node. Each node
encodes timestamp information and it may be connected to an arbitrary number of incoming
arcs and outgoing arcs. A path in the lattice represents a hypothesised word sequence and all
paths are uni-directional with no loops.
Since a large number of hypotheses contain overlapping partial sequences, lattices can be
very efficient in terms of storage space. The forward-backward algorithm can run on lattices
and word posterior probabilities can be efficiently obtained. Although large lattices can be
constructed to fully represent the hypothesis space, it is found empirically that lattices can be
compressed or pruned to be relatively small while being a good approximation. Lattices can be
1
Note that one can also deweight the acoustic model score as in discriminative training in Sections 3.3.2.2 and
3.3.2.3.
58 Automatic Speech Recognition

a=-1198.98 t=0.68 a=-354.68

l=-3.769 W=HE l=-1.232

a=-1191.97 t=0.68 a=-354.68

l=-3.566 W=HOW l=-0.620

a=-11133.28 t=0.62 a=-1178.12

l=0.000 W=<s> l=-1.722 a=-354.68
t=0.68 l=-1.273 t=0.70
W=HEY a=-354.68 W=</s>
t=0.00 l=-2.538
a=-11640.99
W=!NULL l=0.000 a=-595.92
t=0.65 l=-3.805 t=0.68 a=-354.68
W=<s> a=-589.98 W=<hes> l=-0.376
l=-4.306

t=0.68
W=HI

(a) An example of a word lattice.

W=HEY
p=-2.63503

W=HOW
p=-4.51853
W=<s> W=</s>
p=-0.00000 W=<hes>
p=-1.75130 p=0.00000
t=0.70 t=0.68 t=0.62 t=0.00
W=!NULL
p=-18.69590 W=HI
p=-0.30052
W=!NULL
p=-5.68403

(b) An example of a confusion network.

Figure 3.3 Decoding result representations.

pruned using a fixed beam width to discard arcs whose likelihoods fall below a certain threshold
relative to the best path, or restrict the number of unique arcs within a time interval as many arcs
represent the same word of similar likelihoods but with slightly different timestamps (Woodland
et al., 1995). WFSTs can also be used equivalently to represent lattices (Povey et al., 2012).
Confusion networks are another dense representation for the most likely hypotheses in
lattices (Mangu et al., 2000). They are formed by grouping lattices arcs. The confusion
network is also known as a “sausage” because of its constraint that all paths in a confusion
network pass through all nodes, as shown in Figure 3.3b. The arcs in a confusion network
correspond to words, some of which are null which represent skip connections to comply with
the special constraint. The score associated with each arc is the log posterior probability of the
corresponding word. In terms of the information stored in confusion networks, they discard a
large number of arcs with low likelihoods in lattices. However, confusion networks also create
some new hypothesis sequences which are not present in lattices as a result of imposing the
constraint.
3.5 Other Frame-Synchronous Systems 59

3.4.4 Evaluation
The most common evaluation metric for ASR is WER, which measures the percentage of errors
in the hypotheses when compared to the reference transcriptions. There are three types of
errors: a substitution error (S) where a word in the reference is misrecognised as another word;
an insertion error (I) where a word is inserted; and a deletion error (D) where a word in the
reference is missing in the hypothesis. To define these errors unambiguously for two sequences,
the alignment is the one that minimises the WER defined as

S+D+I
WER = × 100%, (3.65)
N
where N is the total number of words in the reference. For languages that have no word
boundaries in their written form, e.g. Chinese and Japanese, a similar metric evaluated at the
character level called the Character Error Rate (CER) is often used. Note that both WER
and CER can be above 100% because of unbounded insertion errors. At a coarser level, the
Sentence Error Rate (SentER) which is the percentage of utterances that contain at least one
error, is sometimes used.

3.5 Other Frame-Synchronous Systems

Apart from GMM/DNN-HMM systems, there are two other popular systems that also operate
in a frame-synchronous fashion, namely CTC models (Graves et al., 2006) and neural transduc-
ers (Graves, 2012). Because the LM and lexicon are optional, these two systems are sometimes
called “end-to-end” models in the sense that a single neural network model is directly optimised
to perform speech recognition. In this section, these two types of model are introduced and the
connection between them and the SCM framework is drawn.

3.5.1 Connectionist Temporal Classification

As shown in Figure 3.4, the CTC model is similar to an AM in an DNN-HMM system.
Instead of having frame-level hard alignments for CE training (see Section 3.3.2.1), CTC
(Graves et al., 2006) employs a sequence-level discriminative criterion that is also a form of the
CML criterion

log P (w(u) |O (u) ; θ)

X
LCTC = − (3.66)
u

P (s′t |O (u) ),
X X Y
=− log (3.67)
u s′ ∈B−1 (w(u) ) t
60 Automatic Speech Recognition

s01 s02 s0t s0T

<latexit sha1_base64="vUnUFYT6T8wrJzu9y3aQIQZqdCs=">AAAB63icbVA9SwNBEJ2LXzF+RS1tlgQxVbgTMZZBG8sI5gOSI+xtNsmS3b1jd08IR/6CjYWitv4YWzvRH+NekkITHww83pthZl4QcaaN6346mZXVtfWN7GZua3tndy+/f9DQYawIrZOQh6oVYE05k7RumOG0FSmKRcBpMxhdpX7zjirNQnlrxhH1BR5I1mcEm1TSXe+kmy+6ZXcKtEy8OSlWC6Xvr8r7S62b/+j0QhILKg3hWOu250bGT7AyjHA6yXViTSNMRnhA25ZKLKj2k+mtE3RslR7qh8qWNGiq/p5IsNB6LALbKbAZ6kUvFf/z2rHpX/gJk1FsqCSzRf2YIxOi9HHUY4oSw8eWYKKYvRWRIVaYGBtPzobgLb68TBqnZe+8fHZj07iEGbJwBAUogQcVqMI11KAOBIZwD4/w5AjnwXl2XmetGWc+cwh/4Lz9AJWqka0=</latexit> <latexit sha1_base64="RitHwQeYMTUFm+vb1RtHMZy8wlc=">AAAB63icbVA9SwNBEJ2LXzF+RS1tlgQxVbgLYiyDNpYRzAfEI+xtNsmS3b1jd08IR/6CjYWitv4YWzvRH+NekkITHww83pthZl4QcaaN6346mZXVtfWN7GZua3tndy+/f9DUYawIbZCQh6odYE05k7RhmOG0HSmKRcBpKxhdpn7rjirNQnljxhH1BR5I1mcEm1TS3cpJN190y+4UaJl4c1KsFUrfX9X3l3o3/3HbC0ksqDSEY607nhsZP8HKMMLpJHcbaxphMsID2rFUYkG1n0xvnaBjq/RQP1S2pEFT9fdEgoXWYxHYToHNUC96qfif14lN/9xPmIxiQyWZLerHHJkQpY+jHlOUGD62BBPF7K2IDLHCxNh4cjYEb/HlZdKslL2z8um1TeMCZsjCERSgBB5UoQZXUIcGEBjCPTzCkyOcB+fZeZ21Zpz5zCH8gfP2A5cvka4=</latexit> <latexit sha1_base64="2fQMKhtBcck67B8O6j0NoxhlpcE=">AAAB63icbVA9SwNBEJ2LXzF+RS1tlgQxVbgTMZZBG8sI5gOSI+xtNsmS3b1jd08IR/6CjYWitv4YWzvRH+NekkITHww83pthZl4QcaaN6346mZXVtfWN7GZua3tndy+/f9DQYawIrZOQh6oVYE05k7RumOG0FSmKRcBpMxhdpX7zjirNQnlrxhH1BR5I1mcEm1TSXXPSzRfdsjsFWibenBSrhdL3V+X9pdbNf3R6IYkFlYZwrHXbcyPjJ1gZRjid5DqxphEmIzygbUslFlT7yfTWCTq2Sg/1Q2VLGjRVf08kWGg9FoHtFNgM9aKXiv957dj0L/yEySg2VJLZon7MkQlR+jjqMUWJ4WNLMFHM3orIECtMjI0nZ0PwFl9eJo3TsndePruxaVzCDFk4ggKUwIMKVOEaalAHAkO4h0d4coTz4Dw7r7PWjDOfOYQ/cN5+APt5kfA=</latexit> <latexit sha1_base64="ybu8L4SgWDELnLLvXDSFS7TnsOQ=">AAAB63icbVA9SwNBEJ3zM8avqKXNkiCmCncixjJoYxkhX5AcYW+zSZbs7h27e0I48hdsLBS19cfY2on+GPeSFJr4YODx3gwz84KIM21c99NZWV1b39jMbGW3d3b39nMHhw0dxorQOgl5qFoB1pQzSeuGGU5bkaJYBJw2g9F16jfvqNIslDUzjqgv8ECyPiPYpJLu1k67uYJbcqdAy8Sbk0IlX/z+Kr+/VLu5j04vJLGg0hCOtW57bmT8BCvDCKeTbCfWNMJkhAe0banEgmo/md46QSdW6aF+qGxJg6bq74kEC63HIrCdApuhXvRS8T+vHZv+pZ8wGcWGSjJb1I85MiFKH0c9pigxfGwJJorZWxEZYoWJsfFkbQje4svLpHFW8i5K57c2jSuYIQPHkIcieFCGCtxAFepAYAj38AhPjnAenGfndda64sxnjuAPnLcfytmR0A==</latexit>

e1 <latexit sha1_base64="aidKR2s/tZ6GrEOJPG/rk5TQExo=">AAAB+XicbVDLSsNAFL2pr1pfUZduBovgqiRS1GXRjcsK9gFtCJPJpB06mYSZSaGE/okbF4q49U/c+TdO2iy09cAwh3PuZc6cIOVMacf5tiobm1vbO9Xd2t7+weGRfXzSVUkmCe2QhCeyH2BFORO0o5nmtJ9KiuOA014wuS/83pRKxRLxpGcp9WI8EixiBGsj+bY9DBIeqllsrpzOfde3607DWQCtE7ckdSjR9u2vYZiQLKZCE46VGrhOqr0cS80Ip/PaMFM0xWSCR3RgqMAxVV6+SD5HF0YJUZRIc4RGC/X3Ro5jVYQzkzHWY7XqFeJ/3iDT0a2XM5FmmgqyfCjKONIJKmpAIZOUaD4zBBPJTFZExlhiok1ZNVOCu/rlddK9arjXjeZjs966K+uowhmcwyW4cAMteIA2dIDAFJ7hFd6s3Hqx3q2P5WjFKndO4Q+szx/hu5PU</latexit>

e2 <latexit sha1_base64="ARZTR0BQ8i3cbeB9FFVuCYsatvo=">AAAB+XicbVBPS8MwHP11/pvzX9Wjl+AQPI12DPU49OJxgpuDrZQ0TbewNC1JOhhl38SLB0W8+k28+W1Mtx5080HI473fj7y8IOVMacf5tiobm1vbO9Xd2t7+weGRfXzSU0kmCe2ShCeyH2BFORO0q5nmtJ9KiuOA06dgclf4T1MqFUvEo56l1IvxSLCIEayN5Nv2MEh4qGaxuXI695u+XXcazgJonbglqUOJjm9/DcOEZDEVmnCs1MB1Uu3lWGpGOJ3XhpmiKSYTPKIDQwWOqfLyRfI5ujBKiKJEmiM0Wqi/N3IcqyKcmYyxHqtVrxD/8waZjm68nIk001SQ5UNRxpFOUFEDCpmkRPOZIZhIZrIiMsYSE23KqpkS3NUvr5Nes+FeNVoPrXr7tqyjCmdwDpfgwjW04R460AUCU3iGV3izcuvFerc+lqMVq9w5hT+wPn8A4z+T1Q==</latexit>

et <latexit sha1_base64="uEtA+/osy8xuXNwEWTjuZK926kM=">AAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVRIp6rLoxmUF+4A2hMlk0g6dTMLMTaGE/okbF4q49U/c+TdO2iy09cAwh3PuZc6cIBVcg+N8W5WNza3tnepubW//4PDIPj7p6iRTlHVoIhLVD4hmgkvWAQ6C9VPFSBwI1gsm94XfmzKleSKfYJYyLyYjySNOCRjJt+1hkIhQz2Jz5Wzug2/XnYazAF4nbknqqETbt7+GYUKzmEmggmg9cJ0UvJwo4FSweW2YaZYSOiEjNjBUkphpL18kn+MLo4Q4SpQ5EvBC/b2Rk1gX4cxkTGCsV71C/M8bZBDdejmXaQZM0uVDUSYwJLioAYdcMQpiZgihipusmI6JIhRMWTVTgrv65XXSvWq4143mY7PeuivrqKIzdI4ukYtuUAs9oDbqIIqm6Bm9ojcrt16sd+tjOVqxyp1T9AfW5w9HVpQX</latexit>

eT <latexit sha1_base64="qPYN9LFAo50olxF251B8TtMcvcU=">AAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVRIRdVl047JCX9CGMJnctEMnkzAzKZTQP3HjQhG3/ok7/8ZJm4W2HhjmcM69zJkTpJwp7TjfVmVjc2t7p7pb29s/ODyyj0+6KskkhQ5NeCL7AVHAmYCOZppDP5VA4oBDL5g8FH5vClKxRLT1LAUvJiPBIkaJNpJv28Mg4aGaxebKYe63fbvuNJwF8DpxS1JHJVq+/TUME5rFIDTlRKmB66Tay4nUjHKY14aZgpTQCRnBwFBBYlBevkg+xxdGCXGUSHOExgv190ZOYlWEM5Mx0WO16hXif94g09GdlzORZhoEXT4UZRzrBBc14JBJoJrPDCFUMpMV0zGRhGpTVs2U4K5+eZ10rxruTeP66brevC/rqKIzdI4ukYtuURM9ohbqIIqm6Bm9ojcrt16sd+tjOVqxyp1T9AfW5w8W1pP3</latexit>

o1
<latexit sha1_base64="yV6c0E5USxDvfBDK6npQChwWZAw=">AAAB+XicbVDLSsNAFL2pr1pfUZduBovgqiRS1GXRjcsK9gFtCJPJpB06yYSZSaGE/okbF4q49U/c+TdO2iy09cAwh3PuZc6cIOVMacf5tiobm1vbO9Xd2t7+weGRfXzSVSKThHaI4EL2A6woZwntaKY57aeS4jjgtBdM7gu/N6VSMZE86VlKvRiPEhYxgrWRfNseBoKHahabKxdz3/XtutNwFkDrxC1JHUq0fftrGAqSxTTRhGOlBq6Tai/HUjPC6bw2zBRNMZngER0YmuCYKi9fJJ+jC6OEKBLSnESjhfp7I8exKsKZyRjrsVr1CvE/b5Dp6NbLWZJmmiZk+VCUcaQFKmpAIZOUaD4zBBPJTFZExlhiok1ZNVOCu/rlddK9arjXjeZjs966K+uowhmcwyW4cAMteIA2dIDAFJ7hFd6s3Hqx3q2P5WjFKndO4Q+szx/xAZPe</latexit>

o2
<latexit sha1_base64="Jq/vjcu6ACbzVeXUtfgNT44zqrk=">AAAB+XicbVBPS8MwHP11/pvzX9Wjl+IQPI12DPU49OJxgpuDrZQ0TbewNClJOhhl38SLB0W8+k28+W1Mtx5080HI473fj7y8MGVUadf9tiobm1vbO9Xd2t7+weGRfXzSUyKTmHSxYEL2Q6QIo5x0NdWM9FNJUBIy8hRO7gr/aUqkooI/6llK/ASNOI0pRtpIgW0PQ8EiNUvMlYt50AzsuttwF3DWiVeSOpToBPbXMBI4SwjXmCGlBp6baj9HUlPMyLw2zBRJEZ6gERkYylFClJ8vks+dC6NETiykOVw7C/X3Ro4SVYQzkwnSY7XqFeJ/3iDT8Y2fU55mmnC8fCjOmKOFU9TgRFQSrNnMEIQlNVkdPEYSYW3KqpkSvNUvr5Nes+FdNVoPrXr7tqyjCmdwDpfgwTW04R460AUMU3iGV3izcuvFerc+lqMVq9w5hT+wPn8A8oWT3w==</latexit>

ot
<latexit sha1_base64="hQJqAx5F76JQVSy5M6P4yFUv344=">AAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVRIp6rLoxmUF+4A2hMlk0g6dZMLMTaGE/okbF4q49U/c+TdO2iy09cAwh3PuZc6cIBVcg+N8W5WNza3tnepubW//4PDIPj7papkpyjpUCqn6AdFM8IR1gINg/VQxEgeC9YLJfeH3pkxpLpMnmKXMi8ko4RGnBIzk2/YwkCLUs9hcuZz74Nt1p+EsgNeJW5I6KtH27a9hKGkWswSoIFoPXCcFLycKOBVsXhtmmqWETsiIDQxNSMy0ly+Sz/GFUUIcSWVOAnih/t7ISayLcGYyJjDWq14h/ucNMohuvZwnaQYsocuHokxgkLioAYdcMQpiZgihipusmI6JIhRMWTVTgrv65XXSvWq4143mY7PeuivrqKIzdI4ukYtuUAs9oDbqIIqm6Bm9ojcrt16sd+tjOVqxyp1T9AfW5w9WnJQh</latexit>

oT
<latexit sha1_base64="N+fIw8Yph8Kawce/rTR5VJSrHSI=">AAAB+XicbVDLSsNAFL2pr1pfUZduBovgqiQi6rLoxmWFvqANYTKZtEMnmTAzKZTQP3HjQhG3/ok7/8ZJm4W2HhjmcM69zJkTpJwp7TjfVmVjc2t7p7pb29s/ODyyj0+6SmSS0A4RXMh+gBXlLKEdzTSn/VRSHAec9oLJQ+H3plQqJpK2nqXUi/EoYREjWBvJt+1hIHioZrG5cjH3275ddxrOAmiduCWpQ4mWb38NQ0GymCaacKzUwHVS7eVYakY4ndeGmaIpJhM8ogNDExxT5eWL5HN0YZQQRUKak2i0UH9v5DhWRTgzGWM9VqteIf7nDTId3Xk5S9JM04QsH4oyjrRARQ0oZJISzWeGYCKZyYrIGEtMtCmrZkpwV7+8TrpXDfemcf10XW/el3VU4QzO4RJcuIUmPEILOkBgCs/wCm9Wbr1Y79bHcrRilTun8AfW5w8mHJQB</latexit>

Figure 3.4 A CTC model. The model can be either uni-directional or bi-directional. For each
input frame, the model predicts a probability distribution over all symbols, including ∅. et is
the hidden representation corresponding to the input at time t.

where s′ is a sequence of symbols chosen from all possible states (normally the subword
vocabulary) plus the blank symbol ∅, and B is a many-to-one mapping from the symbol
sequence to the word sequence. The mapping removes repeated symbols and all blanks, e.g.

B(a∅ab∅) = B(∅aa∅∅abb) = aab.

From Equation (3.67), the criterion maximises the conditional likelihood of all possible
paths/alignments for the ground truth word sequence w. The introduction of the blank symbol
∅ is critical for the operation of CTC as it allows flexible alignment. Equation (3.67) also
implicitly assumes the output at different time steps are conditionally independent given the
internal state of the network. Under such an assumption, all possible alignments can be
efficiently computed by the forward-backward algorithm. Minimising Equation (3.67) over the
training set is equivalent to maximising the conditional likelihood of the target labelling given
observation.
By comparing MLE of HMMs from Equation (3.13) and CML training of CTC from
Equation (3.67) in terms of state sequences,

T
XY
HMM P (st+1 |st )p(ot |st ), (3.68)
s t=1
T
P (s′t |ot ),
XY
CTC (3.69)
s′ t=1

where s is a state sequence for HMMs and s′ is a possible symbol sequence for CTC. It
can be shown that CTC is equivalent to a special instantiation of the 2-state HMM structure
when the state prior P (st ) and the state transition probabilities P (st+1 |st ) are ignored or
assumed equal (Hadian et al., 2018; Zeyer et al., 2017). As illustrated in Figure 3.5, the first
3.5 Other Frame-Synchronous Systems 61

emission state of the HMM is the skippable blank state (∅) with a self-loop and the second one
corresponds to the subword unit. The blank state is shared across all HMMs. One caveat is
that if two consecutive symbols are the same, the blank state in the later HMM is not skippable
because two repetitive symbols must be separated by at least one blank symbol to prevent the
mapping function B from collapsing them into a single symbol.

(a) Single-state HMM topology (b) A CTC equivalent HMM topology

Figure 3.5 A relationship between HMM and CTC. The white and grey cycles are emission
and non-emission HMM states, and ∅ stands for the CTC blank symbol.
In practice, the transition probabilities P (st+1 |st ) makes little difference when being forced
to be set to 1.0. At test time, the posterior probability of the CTC blank symbol can be penalised
by an extra empirical value (Sak et al., 2015a,b), which can be seen as a rough approximation
of the prior P (st ). As a result, it is reasonable to view CTC as an acoustic modelling method
in the SCM framework, which combines the HMM topology in Figure 3.5, the CML training
criterion and the forward-backward procedure.
To train the model with the CTC loss function, the probabilities of all possible alignments
for the reference sequence have to be added together. As illustrated in Figure 3.6, there can be
an exponential number of possible alignments. For efficient computation, the forward-backward
algorithm is used analogously to Section 3.2 to derive the gradient w.r.t. the model output
distribution at each time step. Note that only the forward or the backward pass is sufficient if
only the total loss is needed.
There are two general decoding strategies for CTC models (Graves et al., 2006). Best
path decoding is the simpler one where the most probable path is considered to yield the most
likely transcription. After removing repeated symbols and blanks, the transcription is obtained.
However, because multiple symbol sequences can correspond to the same transcription, best
path decoding can be sub-optimal. A better yet more complex strategy, called prefix search
decoding, efficiently computes the probability of each partial hypothesis during beam search
by summing over all possible alignments of the prefix using a modified forward-backward
algorithm. In practice, prefix search decoding yields marginally better results as the output
distributions from CTC models are generally very peaky, i.e. the posterior probability of a
single alignment dominates (Graves et al., 2006). However, prefix search is more useful when
decoding with a separate LM (Maas et al., 2014). CTC has been shown to perform well on ASR
tasks when a large amount of training data is available (Amodei et al., 2016; Miao et al., 2016a).
62 Automatic Speech Recognition

?
<latexit sha1_base64="INkgBd/gRKLqmJg73ybelTFIzRE=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69BIvgqSRS1GPRi8cK9gPaUDbbTbt0sxt2J4US+jO8eFDEq7/Gm//GbZuDtj4YeLw3w8y8MBHcoOd9O4WNza3tneJuaW//4PCofHzSMirVlDWpEkp3QmKY4JI1kaNgnUQzEoeCtcPx/dxvT5g2XMknnCYsiMlQ8ohTglbq9iZES4UjLof9csWregu468TPSQVyNPrlr95A0TRmEqkgxnR9L8EgIxo5FWxW6qWGJYSOyZB1LZUkZibIFifP3AurDNxIaVsS3YX6eyIjsTHTOLSdMcGRWfXm4n9eN8XoNsi4TFJkki4XRalwUbnz/90B14yimFpCqOb2VpeOiCYUbUolG4K/+vI6aV1V/etq7bFWqd/lcRThDM7hEny4gTo8QAOaQEHBM7zCm4POi/PufCxbC04+cwp/4Hz+ALhckY0=</latexit>

T
<latexit sha1_base64="Re3q3rDTo8m9ZlVqoUElcLXfzg4=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKewGUY9BLx4TyAuSJcxOepMxs7PLzKwQQr7AiwdFvPpJ3vwbJ8keNLGgoajqprsrSATXxnW/ndzG5tb2Tn63sLd/cHhUPD5p6ThVDJssFrHqBFSj4BKbhhuBnUQhjQKB7WB8P/fbT6g0j2XDTBL0IzqUPOSMGivVG/1iyS27C5B14mWkBBlq/eJXbxCzNEJpmKBadz03Mf6UKsOZwFmhl2pMKBvTIXYtlTRC7U8Xh87IhVUGJIyVLWnIQv09MaWR1pMosJ0RNSO96s3F/7xuasJbf8plkhqUbLkoTAUxMZl/TQZcITNiYgllittbCRtRRZmx2RRsCN7qy+ukVSl71+WreqVUvcviyMMZnMMleHADVXiAGjSBAcIzvMKb8+i8OO/Ox7I152Qzp/AHzucPsfeM3w==</latexit>

s0 A
<latexit sha1_base64="GT591uhpNZAGudBVzbiBa5ZI2CE=">AAAB6XicbVBNS8NAEJ34WetX1aOXxSJ6KokU9Vj04rGK/YA2lM120i7dbMLuRiih/8CLB0W8+o+8+W/ctjlo64OBx3szzMwLEsG1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6nfqtJ1Sax/LRjBP0IzqQPOSMGis96LNeqexW3BnIMvFyUoYc9V7pq9uPWRqhNExQrTuemxg/o8pwJnBS7KYaE8pGdIAdSyWNUPvZ7NIJObVKn4SxsiUNmam/JzIaaT2OAtsZUTPUi95U/M/rpCa89jMuk9SgZPNFYSqIicn0bdLnCpkRY0soU9zeStiQKsqMDadoQ/AWX14mzYuKd1mp3lfLtZs8jgIcwwmcgwdXUIM7qEMDGITwDK/w5oycF+fd+Zi3rjj5zBH8gfP5A0IVjTE=</latexit>

<latexit sha1_base64="eINJPYvlNQfnv4FehLe4hTBMdFs=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHaJUY+oF4+QyCOBDZkdemFkdnYzM2tCCF/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR3cxvPaHSPJYPZpygH9GB5CFn1FipftMrltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1m+qFdK1dssjjycwCmcgwdXUIV7qEEDGCA8wyu8OY/Oi/PufCxac042cwx/4Hz+AJUrjMw=</latexit>

C
<latexit sha1_base64="p+0nAFeCXge/HI4RNHKcRIEQu5Q=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHaJUY9ELh4hkUcCGzI79MLI7OxmZtaEEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eQ2Nre2d/K7hb39g8Oj4vFJS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfj2txvP6HSPJYPZpKgH9Gh5CFn1FipUesXS27ZXYCsEy8jJchQ7xe/eoOYpRFKwwTVuuu5ifGnVBnOBM4KvVRjQtmYDrFrqaQRan+6OHRGLqwyIGGsbElDFurviSmNtJ5Ege2MqBnpVW8u/ud1UxPe+lMuk9SgZMtFYSqIicn8azLgCpkRE0soU9zeStiIKsqMzaZgQ/BWX14nrUrZuy5fNSql6l0WRx7O4BwuwYMbqMI91KEJDBCe4RXenEfnxXl3PpatOSebOYU/cD5/AJgzjM4=</latexit>

1 2 3 4 5 6
<latexit sha1_base64="DSs/otLGM3tGcbw7Y+b2owP1uy4=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOlptcvV9yqOwdZJV5OKpCj0S9/9QYxSyOUhgmqdddzE+NnVBnOBE5LvVRjQtmYDrFrqaQRaj+bHzolZ1YZkDBWtqQhc/X3REYjrSdRYDsjakZ62ZuJ/3nd1ITXfsZlkhqUbLEoTAUxMZl9TQZcITNiYgllittbCRtRRZmx2ZRsCN7yy6ukfVH1Lqu1Zq1Sv8njKMIJnMI5eHAFdbiDBrSAAcIzvMKb8+i8OO/Ox6K14OQzx/AHzucPfYuMvg==</latexit> <latexit sha1_base64="YMsxORc1qROyZt/Sshecb25Lu40=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHYJUY9ELx4hkUcCGzI79MLI7OxmZtaEEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eQ2Nre2d/K7hb39g8Oj4vFJS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfju7nffkKleSwfzCRBP6JDyUPOqLFSo9IvltyyuwBZJ15GSpCh3i9+9QYxSyOUhgmqdddzE+NPqTKcCZwVeqnGhLIxHWLXUkkj1P50ceiMXFhlQMJY2ZKGLNTfE1MaaT2JAtsZUTPSq95c/M/rpia88adcJqlByZaLwlQQE5P512TAFTIjJpZQpri9lbARVZQZm03BhuCtvrxOWpWyd1WuNqql2m0WRx7O4BwuwYNrqME91KEJDBCe4RXenEfnxXl3PpatOSebOYU/cD5/AH8PjL8=</latexit> <latexit sha1_base64="PvkqMPf0KQ2ZiRgX4QPbidqMLN8=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHaVqEeiF4+QyCOBDZkdemFkdnYzM2tCCF/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR3cxvPaHSPJYPZpygH9GB5CFn1FipftkrltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa88SdcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmhdl76pcqVdK1dssjjycwCmcgwfXUIV7qEEDGCA8wyu8OY/Oi/PufCxac042cwx/4Hz+AICTjMA=</latexit> <latexit sha1_base64="y+WDoqttszdexELZg6EJPDYoWM8=">AAAB5HicbVBNS8NAEJ3Urxq/qlcvi0XwVBIp6rHoxWMF+wFtKJvtpF272YTdjVBCf4EXD4pXf5M3/43bNgdtfTDweG+GmXlhKrg2nvftlDY2t7Z3yrvu3v7B4VHFPW7rJFMMWywRieqGVKPgEluGG4HdVCGNQ4GdcHI39zvPqDRP5KOZphjEdCR5xBk1VnqoDypVr+YtQNaJX5AqFGgOKl/9YcKyGKVhgmrd873UBDlVhjOBM7efaUwpm9AR9iyVNEYd5ItDZ+TcKkMSJcqWNGSh/p7Iaaz1NA5tZ0zNWK96c/E/r5eZ6CbIuUwzg5ItF0WZICYh86/JkCtkRkwtoUxxeythY6ooMzYb14bgr768TtqXNf+qVq82boswynAKZ3ABPlxDA+6hCS1ggPACb/DuPDmvzseyseQUEyfwB87nDxedi5c=</latexit> <latexit sha1_base64="pukKC4/a/4Vumdv9C1rDgxpFPro=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHYNPo5ELx4hkUcCGzI79MLI7OxmZtaEEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eTW1jc2t/LbhZ3dvf2D4uFRU8epYthgsYhVO6AaBZfYMNwIbCcKaRQIbAWju5nfekKleSwfzDhBP6IDyUPOqLFS/bJXLLlldw6ySryMlCBDrVf86vZjlkYoDRNU647nJsafUGU4EzgtdFONCWUjOsCOpZJGqP3J/NApObNKn4SxsiUNmau/JyY00nocBbYzomaol72Z+J/XSU1440+4TFKDki0WhakgJiazr0mfK2RGjC2hTHF7K2FDqigzNpuCDcFbfnmVNC/K3lW5Uq+UqrdZHHk4gVM4Bw+uoQr3UIMGMEB4hld4cx6dF+fd+Vi05pxs5hj+wPn8AYObjMI=</latexit> <latexit sha1_base64="Zdb79L+zknig4R9axIVCooBH2pQ=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHYNQY9ELx4hkUcCGzI79MLI7OxmZtaEEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eQ2Nre2d/K7hb39g8Oj4vFJS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfju7nffkKleSwfzCRBP6JDyUPOqLFSo9ovltyyuwBZJ15GSpCh3i9+9QYxSyOUhgmqdddzE+NPqTKcCZwVeqnGhLIxHWLXUkkj1P50ceiMXFhlQMJY2ZKGLNTfE1MaaT2JAtsZUTPSq95c/M/rpia88adcJqlByZaLwlQQE5P512TAFTIjJpZQpri9lbARVZQZm03BhuCtvrxOWldlr1quNCql2m0WRx7O4BwuwYNrqME91KEJDBCe4RXenEfnxXl3PpatOSebOYU/cD5/AIUfjMM=</latexit>

t
<latexit sha1_base64="4mSRiAOC1HPbUsbyd7QN48TyFAA=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48t2FpoQ9lsN+3azSbsToQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IJHCoOt+O4W19Y3NreJ2aWd3b/+gfHjUNnGqGW+xWMa6E1DDpVC8hQIl7ySa0yiQ/CEY3878hyeujYjVPU4S7kd0qEQoGEUrNbFfrrhVdw6ySrycVCBHo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKupYpG3PjZ/NApObPKgISxtqWQzNXfExmNjJlEge2MKI7MsjcT//O6KYbXfiZUkiJXbLEoTCXBmMy+JgOhOUM5sYQyLeythI2opgxtNiUbgrf88ippX1S9y2qtWavUb/I4inACp3AOHlxBHe6gAS1gwOEZXuHNeXRenHfnY9FacPKZY/gD5/MH4xeNAQ==</latexit>

Figure 3.6 An example of a CTC alignment lattice. The acoustic sequence has 6 input frames
and the symbol sequence is graphemes of the word “CAT”. A darker path refers to an alignment
with higher probability. Nodes with two concentric circles means the start and end nodes.

Some variants of CTC have been proposed recently to improve its performance (Higuchi et al.,
2020; Lee and Watanabe, 2021).

3.5.2 Neural Transducers

There are two major shortcomings for CTC models. First, the length of the output sequence
must be equal to or shorter than the length of the input sequence as each output symbol
consumes one input frame. This is generally true for ASR using 100 Hz inputs described in
Section 3.1.1. However, if aggressive subsampling of the acoustic features is used together with
fine-grained output units such as graphemes2 or phonemes, the modelling assumption may fail.
Secondly, given the input sequence, each output symbol is assumed to be independent of all
other outputs. This assumption is far from being realistic for spoken input. To address the above
limitations, neural transducers (Graves, 2012) were proposed to incorporate an auto-regressive
decoder with a slight modification of the CTC loss function. As shown in Figure 3.7, a neural
transducer has three components.
The encoder, or the transcription network, is similar to the CTC model in Figure 3.4 that
uses a uni-directional or a bi-directional sequence model to transform the input acoustic features
o1:T into hidden representations e1:T . The prediction network is analogous to an NNLM that
compresses the history sequence w0:k into a hidden representation gk . Assuming that the text
representation g and the acoustic representation e have the same dimension, the joint network
2
Especially for graphemes that might have a silent pronunciation.
3.5 Other Frame-Synchronous Systems 63

s0t,k
<latexit sha1_base64="sey/O5Ma+odb8h7Z+GUSdPwjPiQ=">AAAB73icbVDLSgNBEOyNrxhfUY9eBoPoQcKuBPUY9OIxgnlAsoTZySQZMju7zvQKYclPePGgiFd/x5t/4yTZgyYWNBRV3XR3BbEUBl3328mtrK6tb+Q3C1vbO7t7xf2DhokSzXidRTLSrYAaLoXidRQoeSvWnIaB5M1gdDv1m09cGxGpBxzH3A/pQIm+YBSt1DLdFM9Hk9NuseSW3RnIMvEyUoIMtW7xq9OLWBJyhUxSY9qeG6OfUo2CST4pdBLDY8pGdMDblioacuOns3sn5MQqPdKPtC2FZKb+nkhpaMw4DGxnSHFoFr2p+J/XTrB/7adCxQlyxeaL+okkGJHp86QnNGcox5ZQpoW9lbAh1ZShjahgQ/AWX14mjYuyd1mu3FdK1ZssjjwcwTGcgQdXUIU7qEEdGEh4hld4cx6dF+fd+Zi35pxs5hD+wPn8AcdVj88=</latexit>

Joint Network
<latexit sha1_base64="NF6x4Jt6QrQ+GYEpq5cV7TDCS3Q=">AAAB/3icbVC7SgNBFJ2NrxhfUcHGZjEIVmFXgloGbcRCIpgHJCHMTu4mQ2Z3lpm7alhT+Cs2ForY+ht2/o2TZAtNPDBwOOde7pnjRYJrdJxvK7OwuLS8kl3Nra1vbG7lt3dqWsaKQZVJIVXDoxoED6GKHAU0IgU08ATUvcHF2K/fgdJchrc4jKAd0F7Ifc4oGqmT32shPKBmyZXkIdrXgPdSDUadfMEpOhPY88RNSYGkqHTyX62uZHEAITJBtW66ToTthCrkTMAo14o1RJQNaA+ahoY0AN1OJvlH9qFRurYvlXkmxET9vZHQQOth4JnJgGJfz3pj8T+vGaN/1k54GMUIIZse8mNho7THZdhdroChGBpCmeImq836VFGGprKcKcGd/fI8qR0X3ZNi6aZUKJ+ndWTJPjkgR8Qlp6RMLkmFVAkjj+SZvJI368l6sd6tj+loxkp3dskfWJ8/squWjQ==</latexit>

Prediction Encoder
<latexit sha1_base64="QqxUiDDZVYlMSrkeoEOKD10QW4Y=">AAAB/HicbVBNS8NAEN3Ur1q/oj16CRbBU0mkqMeiF48V7Ae0oWw2k3bp5oPdiVhC/StePCji1R/izX/jps1BWx8MPN6bYWaelwiu0La/jdLa+sbmVnm7srO7t39gHh51VJxKBm0Wi1j2PKpA8AjayFFAL5FAQ09A15vc5H73AaTicXSP0wTckI4iHnBGUUtDszpAeETFspYEn7NcnA3Nml2357BWiVOQGinQGppfAz9maQgRMkGV6jt2gm5GJXImYFYZpAoSyiZ0BH1NIxqCcrP58TPrVCu+FcRSV4TWXP09kdFQqWno6c6Q4lgte7n4n9dPMbhyMx4lKULEFouCVFgYW3kSls8lMBRTTSiTXN9qsTGVlKHOq6JDcJZfXiWd87pzUW/cNWrN6yKOMjkmJ+SMOOSSNMktaZE2YWRKnskreTOejBfj3fhYtJaMYqZK/sD4/AGs9ZVy</latexit> <latexit sha1_base64="0un2OPfI+rZfjsgwL+bBxZ9sWFU=">AAAB+XicbVBNS8NAEN3Ur1q/oh69BIvgqSRS1GNRBI8V7Ae0oWw203bpZhN2J8US+k+8eFDEq//Em//GbZuDtj4YeLw3w8y8IBFco+t+W4W19Y3NreJ2aWd3b//APjxq6jhVDBosFrFqB1SD4BIayFFAO1FAo0BAKxjdzvzWGJTmsXzESQJ+RAeS9zmjaKSebXcRnlCz7E6yOAQ17dllt+LO4awSLydlkqPes7+6YczSCCQyQbXueG6CfkYVciZgWuqmGhLKRnQAHUMljUD72fzyqXNmlNDpx8qURGeu/p7IaKT1JApMZ0RxqJe9mfif10mxf+1nXCYpgmSLRf1UOBg7sxickCtgKCaGUKa4udVhQ6ooQxNWyYTgLb+8SpoXFe+yUn2olms3eRxFckJOyTnxyBWpkXtSJw3CyJg8k1fyZmXWi/VufSxaC1Y+c0z+wPr8ASkclAM=</latexit>

Network
<latexit sha1_base64="oCgNjrKpxnpkUVxWcq77qGEu2wI=">AAAB+XicbVBNS8NAEN3Ur1q/oh69BIvgqSQi6rHoxZNUsB/QhrLZTtqlm03YnVRL6D/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgkRwja77bRVWVtfWN4qbpa3tnd09e/+goeNUMaizWMSqFVANgkuoI0cBrUQBjQIBzWB4M/WbI1Cax/IBxwn4Ee1LHnJG0Uhd2+4gPKFm2R3gY6yGk65ddivuDM4y8XJSJjlqXfur04tZGoFEJqjWbc9N0M+oQs4ETEqdVENC2ZD2oW2opBFoP5tdPnFOjNJzwliZkujM1N8TGY20HkeB6YwoDvSiNxX/89ophld+xmWSIkg2XxSmwsHYmcbg9LgChmJsCGWKm1sdNqCKMjRhlUwI3uLLy6RxVvEuKuf35+XqdR5HkRyRY3JKPHJJquSW1EidMDIiz+SVvFmZ9WK9Wx/z1oKVzxySP7A+fwBpdpQt</latexit>

g0 g1 gk <latexit sha1_base64="TNK9y2nM3Za3nC76BjXSJ10QwzY=">AAAB+XicbVDLSsNAFL2pr1pfUZduBovgqiRS1GXRjcsK9gFtCJPJpB06mYSZSaGE/okbF4q49U/c+TdO2iy09cAwh3PuZc6cIOVMacf5tiobm1vbO9Xd2t7+weGRfXzSVUkmCe2QhCeyH2BFORO0o5nmtJ9KiuOA014wuS/83pRKxRLxpGcp9WI8EixiBGsj+bY9DBIeqllsrnw09x3frjsNZwG0TtyS1KFE27e/hmFCspgKTThWauA6qfZyLDUjnM5rw0zRFJMJHtGBoQLHVHn5IvkcXRglRFEizREaLdTfGzmOVRHOTMZYj9WqV4j/eYNMR7dezkSaaSrI8qEo40gnqKgBhUxSovnMEEwkM1kRGWOJiTZl1UwJ7uqX10n3quFeN5qPzXrrrqyjCmdwDpfgwg204AHa0AECU3iGV3izcuvFerc+lqMVq9w5hT+wPn8A40WT1Q==</latexit>
<latexit sha1_base64="PkZ1rT2ZjTiPSawv63X7v9fCxJY=">AAAB+XicbVDLSsNAFL2pr1pfUZduBovgqiRS1GXRjcsK9gFtCJPJpB06mYSZSaGE/okbF4q49U/c+TdO2iy09cAwh3PuZc6cIOVMacf5tiobm1vbO9Xd2t7+weGRfXzSVUkmCe2QhCeyH2BFORO0o5nmtJ9KiuOA014wuS/83pRKxRLxpGcp9WI8EixiBGsj+bY9DBIeqllsrnw0913frjsNZwG0TtyS1KFE27e/hmFCspgKTThWauA6qfZyLDUjnM5rw0zRFJMJHtGBoQLHVHn5IvkcXRglRFEizREaLdTfGzmOVRHOTMZYj9WqV4j/eYNMR7dezkSaaSrI8qEo40gnqKgBhUxSovnMEEwkM1kRGWOJiTZl1UwJ7uqX10n3quFeN5qPzXrrrqyjCmdwDpfgwg204AHa0AECU3iGV3izcuvFerc+lqMVq9w5hT+wPn8A5MmT1g==</latexit> <latexit sha1_base64="CmQ6A2C3JWyMAGOEg0qAh7der7Y=">AAAB+XicbVDLSsNAFL2pr1pfUZduBovgqiRS1GXRjcsK9gFtCJPJpB06mYSZSaGE/okbF4q49U/c+TdO2iy09cAwh3PuZc6cIOVMacf5tiobm1vbO9Xd2t7+weGRfXzSVUkmCe2QhCeyH2BFORO0o5nmtJ9KiuOA014wuS/83pRKxRLxpGcp9WI8EixiBGsj+bY9DBIeqllsrnw09ye+XXcazgJonbglqUOJtm9/DcOEZDEVmnCs1MB1Uu3lWGpGOJ3XhpmiKSYTPKIDQwWOqfLyRfI5ujBKiKJEmiM0Wqi/N3IcqyKcmYyxHqtVrxD/8waZjm69nIk001SQ5UNRxpFOUFEDCpmkRPOZIZhIZrIiMsYSE23KqpkS3NUvr5PuVcO9bjQfm/XWXVlHFc7gHC7BhRtowQO0oQMEpvAMr/Bm5daL9W59LEcrVrlzCn9gff4APMCUEA==</latexit> <latexit sha1_base64="5dW2J9uqNEpCYwgAN1Uory6/GXo=">AAAB+XicbVDLSsNAFL2pr1pfUZdugkVwVRIRdVl0I7ipYB/QhjCZTNqhk5kwMymU0D9x40IRt/6JO//GSZuFth4Y5nDOvcyZE6aMKu2631ZlbX1jc6u6XdvZ3ds/sA+POkpkEpM2FkzIXogUYZSTtqaakV4qCUpCRrrh+K7wuxMiFRX8SU9T4idoyGlMMdJGCmx7EAoWqWlirnw4Cx4Cu+423DmcVeKVpA4lWoH9NYgEzhLCNWZIqb7nptrPkdQUMzKrDTJFUoTHaEj6hnKUEOXn8+Qz58wokRMLaQ7Xzlz9vZGjRBXhzGSC9Egte4X4n9fPdHzj55SnmSYcLx6KM+Zo4RQ1OBGVBGs2NQRhSU1WB4+QRFibsmqmBG/5y6ukc9HwrhqXj5f15m1ZRxVO4BTOwYNraMI9tKANGCbwDK/wZuXWi/VufSxGK1a5cwx/YH3+AAxAk/A=</latexit>

gK e1
<latexit sha1_base64="aidKR2s/tZ6GrEOJPG/rk5TQExo=">AAAB+XicbVDLSsNAFL2pr1pfUZduBovgqiRS1GXRjcsK9gFtCJPJpB06mYSZSaGE/okbF4q49U/c+TdO2iy09cAwh3PuZc6cIOVMacf5tiobm1vbO9Xd2t7+weGRfXzSVUkmCe2QhCeyH2BFORO0o5nmtJ9KiuOA014wuS/83pRKxRLxpGcp9WI8EixiBGsj+bY9DBIeqllsrpzOfde3607DWQCtE7ckdSjR9u2vYZiQLKZCE46VGrhOqr0cS80Ip/PaMFM0xWSCR3RgqMAxVV6+SD5HF0YJUZRIc4RGC/X3Ro5jVYQzkzHWY7XqFeJ/3iDT0a2XM5FmmgqyfCjKONIJKmpAIZOUaD4zBBPJTFZExlhiok1ZNVOCu/rlddK9arjXjeZjs966K+uowhmcwyW4cAMteIA2dIDAFJ7hFd6s3Hqx3q2P5WjFKndO4Q+szx/hu5PU</latexit>

e2
<latexit sha1_base64="ARZTR0BQ8i3cbeB9FFVuCYsatvo=">AAAB+XicbVBPS8MwHP11/pvzX9Wjl+AQPI12DPU49OJxgpuDrZQ0TbewNC1JOhhl38SLB0W8+k28+W1Mtx5080HI473fj7y8IOVMacf5tiobm1vbO9Xd2t7+weGRfXzSU0kmCe2ShCeyH2BFORO0q5nmtJ9KiuOA06dgclf4T1MqFUvEo56l1IvxSLCIEayN5Nv2MEh4qGaxuXI695u+XXcazgJonbglqUOJjm9/DcOEZDEVmnCs1MB1Uu3lWGpGOJ3XhpmiKSYTPKIDQwWOqfLyRfI5ujBKiKJEmiM0Wqi/N3IcqyKcmYyxHqtVrxD/8waZjm68nIk001SQ5UNRxpFOUFEDCpmkRPOZIZhIZrIiMsYSE23KqpkS3NUvr5Nes+FeNVoPrXr7tqyjCmdwDpfgwjW04R460AUCU3iGV3izcuvFerc+lqMVq9w5hT+wPn8A4z+T1Q==</latexit>

<s>
<latexit sha1_base64="RBd+pAoldJvlLNhF8gIaIbwtB2s=">AAAB83icbVDLSsNAFJ3UV42vqks3g0VwVRIRdeGj6MZlBfuAJpTJdNIOnUzCzI1YQn/DjQtF3fod7t2If+P0sdDWAxcO59zLvfcEieAaHOfbys3NLywu5ZftldW19Y3C5lZNx6mirEpjEatGQDQTXLIqcBCskShGokCwetC7Gvr1O6Y0j+Ut9BPmR6QjecgpASN5HrB7AMhO9fmgVSg6JWcEPEvcCSlefNhnyeuXXWkVPr12TNOISaCCaN10nQT8jCjgVLCB7aWaJYT2SIc1DZUkYtrPRjcP8J5R2jiMlSkJeKT+nshIpHU/CkxnRKCrp72h+J/XTCE88TMukxSYpONFYSowxHgYAG5zxSiIviGEKm5uxbRLFKFgYrJNCO70y7OkdlByj0qHN06xfInGyKMdtIv2kYuOURldowqqIooS9ICe0LOVWo/Wi/U2bs1Zk5lt9AfW+w+9/JUm</latexit>

w1
<latexit sha1_base64="67zubY/zRYXPtGD09zOTgpCYIlM=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8eK9gPaUDbbTbt0swm7E6WE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvSKQw6LrfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWjm6nfeuTaiFg94DjhfkQHSoSCUbTS/VPP65UrbtWdgSwTLycVyFHvlb+6/ZilEVfIJDWm47kJ+hnVKJjkk1I3NTyhbEQHvGOpohE3fjY7dUJOrNInYaxtKSQz9fdERiNjxlFgOyOKQ7PoTcX/vE6K4ZWfCZWkyBWbLwpTSTAm079JX2jOUI4toUwLeythQ6opQ5tOyYbgLb68TJpnVe+ien53Xqld53EU4QiO4RQ8uIQa3EIdGsBgAM/wCm+OdF6cd+dj3lpw8plD+APn8wcNGI2o</latexit>

wk
<latexit sha1_base64="5jC3rV+U5Rx6DyC9Q8QgcZ2oObw=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8eK9gPaUDbbTbt0swm7E6WE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvSKQw6LrfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWjm6nfeuTaiFg94DjhfkQHSoSCUbTS/VNv1CtX3Ko7A1kmXk4qkKPeK391+zFLI66QSWpMx3MT9DOqUTDJJ6VuanhC2YgOeMdSRSNu/Gx26oScWKVPwljbUkhm6u+JjEbGjKPAdkYUh2bRm4r/eZ0Uwys/EypJkSs2XxSmkmBMpn+TvtCcoRxbQpkW9lbChlRThjadkg3BW3x5mTTPqt5F9fzuvFK7zuMowhEcwyl4cAk1uIU6NIDBAJ7hFd4c6bw4787HvLXg5DOH8AfO5w9lAI3i</latexit>

wK
<latexit sha1_base64="1iMLcikwS/yYSy1NKSztw7lP72Q=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKexKUI9BL4KXiOYByRJmJ5NkyOzsMtOrhCWf4MWDIl79Im/+jZNkD5pY0FBUddPdFcRSGHTdbye3srq2vpHfLGxt7+zuFfcPGiZKNON1FslItwJquBSK11Gg5K1YcxoGkjeD0fXUbz5ybUSkHnAccz+kAyX6glG00v1T97ZbLLlldwayTLyMlCBDrVv86vQiloRcIZPUmLbnxuinVKNgkk8KncTwmLIRHfC2pYqG3Pjp7NQJObFKj/QjbUshmam/J1IaGjMOA9sZUhyaRW8q/ue1E+xf+qlQcYJcsfmifiIJRmT6N+kJzRnKsSWUaWFvJWxINWVo0ynYELzFl5dJ46zsnZcrd5VS9SqLIw9HcAyn4MEFVOEGalAHBgN4hld4c6Tz4rw7H/PWnJPNHMIfOJ8/NICNwg==</latexit>

o1 <latexit sha1_base64="yV6c0E5USxDvfBDK6npQChwWZAw=">AAAB+XicbVDLSsNAFL2pr1pfUZduBovgqiRS1GXRjcsK9gFtCJPJpB06yYSZSaGE/okbF4q49U/c+TdO2iy09cAwh3PuZc6cIOVMacf5tiobm1vbO9Xd2t7+weGRfXzSVSKThHaI4EL2A6woZwntaKY57aeS4jjgtBdM7gu/N6VSMZE86VlKvRiPEhYxgrWRfNseBoKHahabKxdz3/XtutNwFkDrxC1JHUq0fftrGAqSxTTRhGOlBq6Tai/HUjPC6bw2zBRNMZngER0YmuCYKi9fJJ+jC6OEKBLSnESjhfp7I8exKsKZyRjrsVr1CvE/b5Dp6NbLWZJmmiZk+VCUcaQFKmpAIZOUaD4zBBPJTFZExlhiok1ZNVOCu/rlddK9arjXjeZjs966K+uowhmcwyW4cAMteIA2dIDAFJ7hFd6s3Hqx3q2P5WjFKndO4Q+szx/xAZPe</latexit>

o2 <latexit sha1_base64="Jq/vjcu6ACbzVeXUtfgNT44zqrk=">AAAB+XicbVBPS8MwHP11/pvzX9Wjl+IQPI12DPU49OJxgpuDrZQ0TbewNClJOhhl38SLB0W8+k28+W1Mtx5080HI473fj7y8MGVUadf9tiobm1vbO9Xd2t7+weGRfXzSUyKTmHSxYEL2Q6QIo5x0NdWM9FNJUBIy8hRO7gr/aUqkooI/6llK/ASNOI0pRtpIgW0PQ8EiNUvMlYt50AzsuttwF3DWiVeSOpToBPbXMBI4SwjXmCGlBp6baj9HUlPMyLw2zBRJEZ6gERkYylFClJ8vks+dC6NETiykOVw7C/X3Ro4SVYQzkwnSY7XqFeJ/3iDT8Y2fU55mmnC8fCjOmKOFU9TgRFQSrNnMEIQlNVkdPEYSYW3KqpkSvNUvr5Nes+FdNVoPrXr7tqyjCmdwDpfgwTW04R460AUMU3iGV3izcuvFerc+lqMVq9w5hT+wPn8A8oWT3w==</latexit>

Figure 3.7 A neural transducer model that consists of an encoder, a prediction network and a
joint network. w0 is normally the start of sentence symbol <s> and wk is the kth modelling
unit in the output sequence. gk is the hidden representation for the history sequence w0:k .

uses the combined representation to predict the probability distribution over all output symbols
P (s′t,k |et , gk ), including the null symbol ∅.
Popular encoder architectures for neural transducers include LSTMs and Transformers. The
prediction network is usually a uni-directional LSTM and the joint network can be as simple as
a single fully connected projection layer. By computing the output probabilities for all pairs of
t and k, an output probability lattice can be constructed as in Figure 3.8.

? ?
<latexit sha1_base64="INkgBd/gRKLqmJg73ybelTFIzRE=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69BIvgqSRS1GPRi8cK9gPaUDbbTbt0sxt2J4US+jO8eFDEq7/Gm//GbZuDtj4YeLw3w8y8MBHcoOd9O4WNza3tneJuaW//4PCofHzSMirVlDWpEkp3QmKY4JI1kaNgnUQzEoeCtcPx/dxvT5g2XMknnCYsiMlQ8ohTglbq9iZES4UjLof9csWregu468TPSQVyNPrlr95A0TRmEqkgxnR9L8EgIxo5FWxW6qWGJYSOyZB1LZUkZibIFifP3AurDNxIaVsS3YX6eyIjsTHTOLSdMcGRWfXm4n9eN8XoNsi4TFJkki4XRalwUbnz/90B14yimFpCqOb2VpeOiCYUbUolG4K/+vI6aV1V/etq7bFWqd/lcRThDM7hEny4gTo8QAOaQEHBM7zCm4POi/PufCxbC04+cwp/4Hz+ALhckY0=</latexit> <latexit sha1_base64="INkgBd/gRKLqmJg73ybelTFIzRE=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69BIvgqSRS1GPRi8cK9gPaUDbbTbt0sxt2J4US+jO8eFDEq7/Gm//GbZuDtj4YeLw3w8y8MBHcoOd9O4WNza3tneJuaW//4PCofHzSMirVlDWpEkp3QmKY4JI1kaNgnUQzEoeCtcPx/dxvT5g2XMknnCYsiMlQ8ohTglbq9iZES4UjLof9csWregu468TPSQVyNPrlr95A0TRmEqkgxnR9L8EgIxo5FWxW6qWGJYSOyZB1LZUkZibIFifP3AurDNxIaVsS3YX6eyIjsTHTOLSdMcGRWfXm4n9eN8XoNsi4TFJkki4XRalwUbnz/90B14yimFpCqOb2VpeOiCYUbUolG4K/+vI6aV1V/etq7bFWqd/lcRThDM7hEny4gTo8QAOaQEHBM7zCm4POi/PufCxbC04+cwp/4Hz+ALhckY0=</latexit>

3
<latexit sha1_base64="PvkqMPf0KQ2ZiRgX4QPbidqMLN8=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHaVqEeiF4+QyCOBDZkdemFkdnYzM2tCCF/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR3cxvPaHSPJYPZpygH9GB5CFn1FipftkrltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa88SdcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmhdl76pcqVdK1dssjjycwCmcgwfXUIV7qEEDGCA8wyu8OY/Oi/PufCxac042cwx/4Hz+AICTjMA=</latexit>

2
<latexit sha1_base64="YMsxORc1qROyZt/Sshecb25Lu40=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHYJUY9ELx4hkUcCGzI79MLI7OxmZtaEEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eQ2Nre2d/K7hb39g8Oj4vFJS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfju7nffkKleSwfzCRBP6JDyUPOqLFSo9IvltyyuwBZJ15GSpCh3i9+9QYxSyOUhgmqdddzE+NPqTKcCZwVeqnGhLIxHWLXUkkj1P50ceiMXFhlQMJY2ZKGLNTfE1MaaT2JAtsZUTPSq95c/M/rpia88adcJqlByZaLwlQQE5P512TAFTIjJpZQpri9lbARVZQZm03BhuCtvrxOWpWyd1WuNqql2m0WRx7O4BwuwYNrqME91KEJDBCe4RXenEfnxXl3PpatOSebOYU/cD5/AH8PjL8=</latexit>

k A
<latexit sha1_base64="eINJPYvlNQfnv4FehLe4hTBMdFs=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHaJUY+oF4+QyCOBDZkdemFkdnYzM2tCCF/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR3cxvPaHSPJYPZpygH9GB5CFn1FipftMrltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1m+qFdK1dssjjycwCmcgwdXUIV7qEEDGCA8wyu8OY/Oi/PufCxac042cwx/4Hz+AJUrjMw=</latexit>

<latexit sha1_base64="22ayNhSfxXcShAAyfExNzJXubLY=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOl5rhfrrhVdw6ySrycVCBHo1/+6g1ilkYoDRNU667nJsbPqDKcCZyWeqnGhLIxHWLXUkkj1H42P3RKzqwyIGGsbElD5urviYxGWk+iwHZG1Iz0sjcT//O6qQmv/YzLJDUo2WJRmApiYjL7mgy4QmbExBLKFLe3EjaiijJjsynZELzll1dJ+6LqXVZrzVqlfpPHUYQTOIVz8OAK6nAHDWgBA4RneIU359F5cd6dj0VrwclnjuEPnM8f1XOM+A==</latexit>

1
<latexit sha1_base64="DSs/otLGM3tGcbw7Y+b2owP1uy4=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOlptcvV9yqOwdZJV5OKpCj0S9/9QYxSyOUhgmqdddzE+NnVBnOBE5LvVRjQtmYDrFrqaQRaj+bHzolZ1YZkDBWtqQhc/X3REYjrSdRYDsjakZ62ZuJ/3nd1ITXfsZlkhqUbLEoTAUxMZl9TQZcITNiYgllittbCRtRRZmx2ZRsCN7yy6ukfVH1Lqu1Zq1Sv8njKMIJnMI5eHAFdbiDBrSAAcIzvMKb8+i8OO/Ox6K14OQzx/AHzucPfYuMvg==</latexit>

0
<latexit sha1_base64="SvEw9j+G6/nNSZebhlvQwRvlSiI=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOlptsvV9yqOwdZJV5OKpCj0S9/9QYxSyOUhgmqdddzE+NnVBnOBE5LvVRjQtmYDrFrqaQRaj+bHzolZ1YZkDBWtqQhc/X3REYjrSdRYDsjakZ62ZuJ/3nd1ITXfsZlkhqUbLEoTAUxMZl9TQZcITNiYgllittbCRtRRZmx2ZRsCN7yy6ukfVH1Lqu1Zq1Sv8njKMIJnMI5eHAFdbiDBrSAAcIzvMKb8+i8OO/Ox6K14OQzx/AHzucPfAeMvQ==</latexit>

1 2 3 4 5 6
<latexit sha1_base64="PvkqMPf0KQ2ZiRgX4QPbidqMLN8=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHaVqEeiF4+QyCOBDZkdemFkdnYzM2tCCF/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR3cxvPaHSPJYPZpygH9GB5CFn1FipftkrltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa88SdcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmhdl76pcqVdK1dssjjycwCmcgwfXUIV7qEEDGCA8wyu8OY/Oi/PufCxac042cwx/4Hz+AICTjMA=</latexit> <latexit sha1_base64="pukKC4/a/4Vumdv9C1rDgxpFPro=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHYNPo5ELx4hkUcCGzI79MLI7OxmZtaEEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eTW1jc2t/LbhZ3dvf2D4uFRU8epYthgsYhVO6AaBZfYMNwIbCcKaRQIbAWju5nfekKleSwfzDhBP6IDyUPOqLFS/bJXLLlldw6ySryMlCBDrVf86vZjlkYoDRNU647nJsafUGU4EzgtdFONCWUjOsCOpZJGqP3J/NApObNKn4SxsiUNmau/JyY00nocBbYzomaol72Z+J/XSU1440+4TFKDki0WhakgJiazr0mfK2RGjC2hTHF7K2FDqigzNpuCDcFbfnmVNC/K3lW5Uq+UqrdZHHk4gVM4Bw+uoQr3UIMGMEB4hld4cx6dF+fd+Vi05pxs5hj+wPn8AYObjMI=</latexit> <latexit sha1_base64="Zdb79L+zknig4R9axIVCooBH2pQ=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHYNQY9ELx4hkUcCGzI79MLI7OxmZtaEEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eQ2Nre2d/K7hb39g8Oj4vFJS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfju7nffkKleSwfzCRBP6JDyUPOqLFSo9ovltyyuwBZJ15GSpCh3i9+9QYxSyOUhgmqdddzE+NPqTKcCZwVeqnGhLIxHWLXUkkj1P50ceiMXFhlQMJY2ZKGLNTfE1MaaT2JAtsZUTPSq95c/M/rpia88adcJqlByZaLwlQQE5P512TAFTIjJpZQpri9lbARVZQZm03BhuCtvrxOWldlr1quNCql2m0WRx7O4BwuwYNrqME91KEJDBCe4RXenEfnxXl3PpatOSebOYU/cD5/AIUfjMM=</latexit>

<latexit sha1_base64="DSs/otLGM3tGcbw7Y+b2owP1uy4=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOlptcvV9yqOwdZJV5OKpCj0S9/9QYxSyOUhgmqdddzE+NnVBnOBE5LvVRjQtmYDrFrqaQRaj+bHzolZ1YZkDBWtqQhc/X3REYjrSdRYDsjakZ62ZuJ/3nd1ITXfsZlkhqUbLEoTAUxMZl9TQZcITNiYgllittbCRtRRZmx2ZRsCN7yy6ukfVH1Lqu1Zq1Sv8njKMIJnMI5eHAFdbiDBrSAAcIzvMKb8+i8OO/Ox6K14OQzx/AHzucPfYuMvg==</latexit> <latexit sha1_base64="YMsxORc1qROyZt/Sshecb25Lu40=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHYJUY9ELx4hkUcCGzI79MLI7OxmZtaEEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eQ2Nre2d/K7hb39g8Oj4vFJS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfju7nffkKleSwfzCRBP6JDyUPOqLFSo9IvltyyuwBZJ15GSpCh3i9+9QYxSyOUhgmqdddzE+NPqTKcCZwVeqnGhLIxHWLXUkkj1P50ceiMXFhlQMJY2ZKGLNTfE1MaaT2JAtsZUTPSq95c/M/rpia88adcJqlByZaLwlQQE5P512TAFTIjJpZQpri9lbARVZQZm03BhuCtvrxOWpWyd1WuNqql2m0WRx7O4BwuwYNrqME91KEJDBCe4RXenEfnxXl3PpatOSebOYU/cD5/AH8PjL8=</latexit> <latexit sha1_base64="y+WDoqttszdexELZg6EJPDYoWM8=">AAAB5HicbVBNS8NAEJ3Urxq/qlcvi0XwVBIp6rHoxWMF+wFtKJvtpF272YTdjVBCf4EXD4pXf5M3/43bNgdtfTDweG+GmXlhKrg2nvftlDY2t7Z3yrvu3v7B4VHFPW7rJFMMWywRieqGVKPgEluGG4HdVCGNQ4GdcHI39zvPqDRP5KOZphjEdCR5xBk1VnqoDypVr+YtQNaJX5AqFGgOKl/9YcKyGKVhgmrd873UBDlVhjOBM7efaUwpm9AR9iyVNEYd5ItDZ+TcKkMSJcqWNGSh/p7Iaaz1NA5tZ0zNWK96c/E/r5eZ6CbIuUwzg5ItF0WZICYh86/JkCtkRkwtoUxxeythY6ooMzYb14bgr768TtqXNf+qVq82boswynAKZ3ABPlxDA+6hCS1ggPACb/DuPDmvzseyseQUEyfwB87nDxedi5c=</latexit>

Figure 3.8 An example of a neural transducer alignment lattice. The node at (t, k) represents the
probability of having output the first k non-blank symbols by time step t in the input acoustic
feature sequence. The horizontal transition leaving node (t, k) represents the probability of
∅, whereas the vertical transition represents the probability of outputting symbol wk+1 . The
acoustic sequence has 6 input frames and the symbol sequence is graphemes of the word “CAT”.
A darker path refers to an alignment with higher probability. Nodes with two concentric circles
denote the start and end nodes.
64 Automatic Speech Recognition

Note that the meaning of ∅ is slightly different from CTC. For CTC models, ∅ sometimes
acts as a separator between two repetitive symbols. For neural transducers, ∅ means outputting
nothing for the current frame. In other words, for neural transducers, ∅ is not a real output
symbol, but a symbolic instruction for the model to move forward to the next frame. Compared
with the CTC alignment lattice in Figure 3.6, the total length of a transducer alignment is
T + K instead of T for CTC. Because when the transducer outputs real symbols, no acoustic
frame is consumed as the alignment path travels upwards. Acoustic frames are only consumed
when the special symbol ∅ occurs and the alignment path advances horizontally in time. The
training loss is similar to CTC as in Equation (3.67),

log P (w(u) |O (u) ; θ)

X
LTransducer = − (3.70)
u

P (s′t,k |O (u) , w0:k ),

X X Y
=− log (3.71)
u s′t,k ∈B−1 (w(u) ) (t,k)

where B −1 (w) is a set of all possible alignments as shown in Figure 3.8. Again, the summation
can be computed efficiently using the forward-backward algorithm. For decoding, although
prefix search is a theoretically better procedure, the standard beam search can often reach
similar results with much less computation (Graves et al., 2013). Neural transducers, with
various modelling improvements, have been widely adopted to process streaming audio for
commercial applications (Li et al., 2021a, 2019b; Rao et al., 2017; Saon et al., 2021).

3.6 Attention-Based Encoder-Decoder Models

AED models, as an alternative framework to SCM, became increasingly popular after such
approaches significantly boosted the performance of machine translation (Bahdanau et al.,
2014). Compared to machine translation, speech recognition is also a sequence-to-sequence
problem, but the input-to-output mapping is strictly monotonic and typically many-to-one. An
AED model has an encoder and a decoder with an attention mechanism in between. Under
this framework, the complexity of ASR systems is greatly reduced (Chorowski et al., 2015).
A single neural network maps directly from the observation sequence to the word sequence
without having separate modules as used in an SCM. Many architectures have been explored,
including RNNs (Bahdanau et al., 2016; Chan et al., 2016; Chorowski et al., 2015; Lu et al.,
2015), CNNs (Hori et al., 2017c; Zhang et al., 2017, 2016), and Transformers (Dong et al.,
2018; Gulati et al., 2020; Pham et al., 2019). To improve the performance of attention-based
models, speech-specific techniques have been introduced, including richer subword units (Chan
et al., 2017; Xu et al., 2019; Zweig et al., 2017), language model integration (Hori et al., 2018,
3.6 Attention-Based Encoder-Decoder Models 65

2017b; Shan et al., 2019; Toshniwal et al., 2018), loss function modification (Cui et al., 2018;
Prabhavalkar et al., 2018; Sabour et al., 2019), and data augmentation (Hayashi et al., 2018; Park
et al., 2019). Some experiments have shown that with significantly more data and computing
resources (Chiu et al., 2018; Lüscher et al., 2019; Park et al., 2019), a single attention model is
capable of learning the structured information composed in traditional HMM-based systems,
e.g. lexicons, and decision trees. AED models offer the advantage of joint optimisation, which
allows numerous downstream tasks to be naturally connected as a single trainable model, e.g.
speech translation (Bérard et al., 2018; Weiss et al., 2017), speech chain that connects an ASR
model with a text-to-speech model to improve the performance of each other (Tjandra et al.,
2017a, 2018a). The conditional independence assumption made in the SCM approach is also
eliminated.
This section builds on the architecture of AED models described in Section 2.2 and dis-
cusses modelling and training techniques that are specific to ASR. Compared to the SCM in
Equation (3.3), attention-based models provide an alternative view of addressing the speech
recognition problem. Instead of decomposing the overall model into an acoustic model and a
language model as in Section 3.1, attention-based models compute the posterior distribution
P (w|O) directly by following the chain rule of conditional probability,

P (w|O) = P (w1 |O)P (w2 |w1 , O) . . . P (wK |wk−1 , . . . , w1 , O) (3.72)

K
Y
= P (w1 |O) P (w1:k−1 , O), (3.73)
k=2

where w is the word sequence. Compared to Equation (3.3), AED models hold some theoretical
advantages over HMM-based systems. AED models do not rely on the first-order Markovian
assumption, and the acoustic information and the language information are jointly learned
using a single model without making any independence assumptions.

3.6.1 Front-End
Normally the output transcription sequence is much shorter than the input acoustic sequence
using features extracted at a frequency of 100 Hz (see Section 3.1.1). As a result, modelling
very long acoustic sequences without subsampling can be computationally expensive for RNNs
and Transformers and may also cause difficulties for optimisation such as vanishing gradients.
Instead of concatenating multiple adjacent frames as a single input frame, the following two
model-based approaches are commonly used.
66 Automatic Speech Recognition

For RNNs, a hierarchical structure can be used to skip or combine states from a previous
RNN layer to the next. In Figure 3.9, two types of hierarchical RNN (Chan et al., 2016; Kim
et al., 2017) are illustrated that can achieve a frame rate reduction of 4 times.

(a) Skip connection. (b) Concatenation connection.

Figure 3.9 Unfolded representations of two-layer hierarchical RNNs that reduce the input
sequence lengths by a factor of four. Circles represent the RNN recurrent units.

An alternative approach to achieving a subsampling of 4 is to have multiple CNN layers

with two pooling layers with a stride of 2 (Hori et al., 2017c; Zhang et al., 2017, 2016). CNNs
have also shown to be more robust against noise and channel distortions for ASR (Qian et al.,
2016; Sainath et al., 2013).

3.6.2 Modelling Units

Although the pronunciation of certain graphemes can vary significantly in different contexts,
improved model architecture, training techniques and computation power help narrow the gap
between a phonetic system and a graphemic system (Gales et al., 2015; Kanthak and Ney,
2002; Killer et al., 2003; Sung et al., 2009). With large models and abundant training data,
grapheme-based modelling units are desirable for fully differentiable systems such as AED
models where system simplicity for both training and decoding is their key characteristic.
Furthermore, without the conditional dependency assumption, the neural decoder can learn
spellings from the training transcriptions relatively well. Experiments by Sainath et al. (2018)
and Irie et al. (2019a) showed that grapheme-based output units consistently outperformed
phonemes for an AED model and converting a phoneme sequence to a word sequence is
not trivial. Since a certain combination of graphemes often appears together, e.g. “ing” for
verbs in the present continuous form, treating it as one word-piece can reduce the modelling
complexity. To create a set of word-piece units, Byte Pair Encoding (BPE) (Gage, 1994),
which is a compression algorithm, can be effectively used. The BPE algorithm starts from
a set of graphemes and then incrementally finds a set of symbols such that the total number
3.6 Attention-Based Encoder-Decoder Models 67

of symbols for encoding the text is minimised. Similarly, a unigram LM can be used as an
entropy encoder to iteratively increase the number of word-pieces until the desired vocabulary
size by maximising the LM probability on the training text corpus (Kudo, 2018; Whittaker and
Woodland, 2000). Both methods preserve the basic set of graphemes to allow open vocabulary
recognition without the Out-of-Vocabulary (OOV) problem. Note that the tokenisation results
are not unique even for the same word sequence as shown in Table 3.2. One advantage of using
a probabilistic LM is that multiple tokenisation results can be generated and ranked by their
probabilities. This leads to a regularisation technique that randomly samples the target subword
sequence during the training of an AED model (Kudo, 2018).

grapheme _ h e l l o _ w o r l d
word-piece _he llo _world
_h e ll o _w or l d

Table 3.2 Word-piece modelling units for “hello world”. The symbol ‘_’ denotes the word
boundary. There can be many different word-piece sequences for the same word sequence.

3.6.3 Convolution-Augmented Transformer

As described in Section 2.2, RNN-based and Transformer-based AED models can be used
for ASR. More recently, the convolution-augmented Transformer, or Conformer (Gulati et al.,
2020), has shown promising performance (Guo et al., 2021; Li et al., 2021a). Self-attention
in Transformer models is good at modelling long-range global patterns whereas CNNs can
effectively exploit local features using small convolutional kernels. The Conformer model is
designed to leverage the advantages of both for speech recognition where both local features
and global context are important. In Figure 3.10b, the Conformer block is a modification of
the Transformer block as shown in Figure 3.10a, where a convolution module is added after
the self-attention layer and the MLP layers in the original Transformer block are split into two
half-step MLP layers that sandwich the self-attention layer and the convolution module.
In the convolution module, pointwise convolution uses a kernel that iterates through every
single point across all input channels. For example, if the input has a size of (batch, in_channels, time),
then a kernel would have a size of (in_channels, 1), and pointwise convolution would re-
sult in an output with size (batch, 1, time). For the first pointwise convolution layer in the
convolution module, the number of kernels is 2 × in_channels and the Gated Linear Unit
(GLU) activation uses the second half to perform an element-wise gated activation of the first
half (Dauphin et al., 2017). 1-dimensional depth-wise convolution uses a kernel that convolves
along the time dimension. For the input matrix of size (batch, in_channels, time), then
68 Automatic Speech Recognition

layer norm layer norm dropout

+ + ⇥1/2 pointwise
<latexit sha1_base64="G4vzAKUfqBvuBoSMSn1cPbS+Hlo=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoMgCGFXgnoMevGYgHlAsoTZSW8yZnZ2mZkVQsgXePGgiFc/yZt/4yTZgyYWNBRV3XR3BYng2rjut5NbW9/Y3MpvF3Z29/YPiodHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7mZ+6wmV5rF8MOME/YgOJA85o8ZK9YteseSW3TnIKvEyUoIMtV7xq9uPWRqhNExQrTuemxh/QpXhTOC00E01JpSN6AA7lkoaofYn80On5MwqfRLGypY0ZK7+npjQSOtxFNjOiJqhXvZm4n9eJzXhjT/hMkkNSrZYFKaCmJjMviZ9rpAZMbaEMsXtrYQNqaLM2GwKNgRv+eVV0rwse1flSr1Sqt5mceThBE7hHDy4hircQw0awADhGV7hzXl0Xpx352PRmnOymWP4A+fzB3RzjLg=</latexit>

<latexit sha1_base64="G4vzAKUfqBvuBoSMSn1cPbS+Hlo=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoMgCGFXgnoMevGYgHlAsoTZSW8yZnZ2mZkVQsgXePGgiFc/yZt/4yTZgyYWNBRV3XR3BYng2rjut5NbW9/Y3MpvF3Z29/YPiodHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7mZ+6wmV5rF8MOME/YgOJA85o8ZK9YteseSW3TnIKvEyUoIMtV7xq9uPWRqhNExQrTuemxh/QpXhTOC00E01JpSN6AA7lkoaofYn80On5MwqfRLGypY0ZK7+npjQSOtxFNjOiJqhXvZm4n9eJzXhjT/hMkkNSrZYFKaCmJjMviZ9rpAZMbaEMsXtrYQNqaLM2GwKNgRv+eVV0rwse1flSr1Sqt5mceThBE7hHDy4hircQw0awADhGV7hzXl0Xpx352PRmnOymWP4A+fzB3RzjLg=</latexit>

<latexit sha1_base64="WEia3EVcrk9g869I4W1e05UC7ko=">AAAB8XicbVDLTgJBEOzFF+IL9ehlIjHxhLuEqEeiF4+YyCPChswOszBhdnYz02tCCH/hxYPGePVvvPk3DrAHBSvppFLVne6uIJHCoOt+O7m19Y3Nrfx2YWd3b/+geHjUNHGqGW+wWMa6HVDDpVC8gQIlbyea0yiQvBWMbmd+64lrI2L1gOOE+xEdKBEKRtFKj10UETfEu6j0iiW37M5BVomXkRJkqPeKX91+zNKIK2SSGtPx3AT9CdUomOTTQjc1PKFsRAe8Y6midpE/mV88JWdW6ZMw1rYUkrn6e2JCI2PGUWA7I4pDs+zNxP+8TorhtT8RKkmRK7ZYFKaSYExm75O+0JyhHFtCmRb2VsKGVFOGNqSCDcFbfnmVNCtl77Jcva+WajdZHHk4gVM4Bw+uoAZ3UIcGMFDwDK/w5hjnxXl3PhatOSebOYY/cD5/AGWskBc=</latexit>

convolution
MLP MLP
Swish activation
+
<latexit sha1_base64="G4vzAKUfqBvuBoSMSn1cPbS+Hlo=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoMgCGFXgnoMevGYgHlAsoTZSW8yZnZ2mZkVQsgXePGgiFc/yZt/4yTZgyYWNBRV3XR3BYng2rjut5NbW9/Y3MpvF3Z29/YPiodHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7mZ+6wmV5rF8MOME/YgOJA85o8ZK9YteseSW3TnIKvEyUoIMtV7xq9uPWRqhNExQrTuemxh/QpXhTOC00E01JpSN6AA7lkoaofYn80On5MwqfRLGypY0ZK7+npjQSOtxFNjOiJqhXvZm4n9eJzXhjT/hMkkNSrZYFKaCmJjMviZ9rpAZMbaEMsXtrYQNqaLM2GwKNgRv+eVV0rwse1flSr1Sqt5mceThBE7hHDy4hircQw0awADhGV7hzXl0Xpx352PRmnOymWP4A+fzB3RzjLg=</latexit>

batch norm
convolution
layer norm module
1D depth-wise
convolution
+ +
<latexit sha1_base64="G4vzAKUfqBvuBoSMSn1cPbS+Hlo=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoMgCGFXgnoMevGYgHlAsoTZSW8yZnZ2mZkVQsgXePGgiFc/yZt/4yTZgyYWNBRV3XR3BYng2rjut5NbW9/Y3MpvF3Z29/YPiodHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7mZ+6wmV5rF8MOME/YgOJA85o8ZK9YteseSW3TnIKvEyUoIMtV7xq9uPWRqhNExQrTuemxh/QpXhTOC00E01JpSN6AA7lkoaofYn80On5MwqfRLGypY0ZK7+npjQSOtxFNjOiJqhXvZm4n9eJzXhjT/hMkkNSrZYFKaCmJjMviZ9rpAZMbaEMsXtrYQNqaLM2GwKNgRv+eVV0rwse1flSr1Sqt5mceThBE7hHDy4hircQw0awADhGV7hzXl0Xpx352PRmnOymWP4A+fzB3RzjLg=</latexit>

multi-head multi-head GLU activation

self-attention self-attention
pointwise
+ ⇥1/2
<latexit sha1_base64="G4vzAKUfqBvuBoSMSn1cPbS+Hlo=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoMgCGFXgnoMevGYgHlAsoTZSW8yZnZ2mZkVQsgXePGgiFc/yZt/4yTZgyYWNBRV3XR3BYng2rjut5NbW9/Y3MpvF3Z29/YPiodHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7mZ+6wmV5rF8MOME/YgOJA85o8ZK9YteseSW3TnIKvEyUoIMtV7xq9uPWRqhNExQrTuemxh/QpXhTOC00E01JpSN6AA7lkoaofYn80On5MwqfRLGypY0ZK7+npjQSOtxFNjOiJqhXvZm4n9eJzXhjT/hMkkNSrZYFKaCmJjMviZ9rpAZMbaEMsXtrYQNqaLM2GwKNgRv+eVV0rwse1flSr1Sqt5mceThBE7hHDy4hircQw0awADhGV7hzXl0Xpx352PRmnOymWP4A+fzB3RzjLg=</latexit>

MLP layer norm

(a) A Transformer encoder block. (b) A Conformer encoder block.

Figure 3.10 The convolution-augmented Transformer encoder architecture (Gulati et al., 2020)

a kernel would have a size of (1, kernel_size) for each channel, and depth-wise convolu-
tion would result in an output with size (batch, in_channels, time) given that the two ends
of the time dimension are padded. As introduced in Section 2.1.5, the Swish activation is
x · sigmoid(x) where x is a scalar input value (Ramachandran et al., 2018), which is effectively
a function gated by itself.
The Conformer encoder can be viewed as another building block for various speech-related
tasks. For example, it can be used in SCM-based systems, including the AM in DNN-HMM
systems, CTC and neural transducers described in Section 3.1.

3.7 Attention-Based Encoder-Decoder Model Training

For standard training of attention-based models, a teacher forcing approach is used (Williams
and Zipser, 1989). That means for each step forward in the decoder, the ground truth subword
unit is used as the input to obtain the next output. For each output step, the CE is minimised
between the output distribution over subword units and the one-hot target symbol. The
3.7 Attention-Based Encoder-Decoder Model Training 69

equivalent cross-entropy training criterion is

K
(u) (u) (u)
− log P (wk |wk−1 , . . . , w0 , O (u) ),
XX
LCE = (3.74)
u k=1

where w1 , . . . , wK arise the ground truth subword sequence and w0 is the start of sentence
symbol <s>. However, for recognition, the ground truth history subwords are not available and
the decoded subwords are used instead. Therefore, the following improved training techniques,
scheduled sampling and Minimum Word Error Rate (MWER) training, have been developed
to address either the exposure bias between training and testing (Ranzato et al., 2016) or the
criteria mismatch issue between conditional maximum likelihood and WER (Graves and Jaitly,
2014; Prabhavalkar et al., 2018; Shannon, 2017).

3.7.1 Scheduled Sampling

For AED models, the subword unit output from the previous decoding step is used for decoding,
but the ground truth subword unit is fed into the next step during training. This creates an
exposure bias where the model never takes an erroneous output from a previous step during
training but is very likely to do so during decoding. To bridge the gap between training and
decoding, scheduled sampling (Bengio et al., 2015), which is a curriculum learning approach,
was proposed to feed erroneous output from the previous stage for training with a scheduled
probability. The probability of feeding the ground truth symbol is ξ, and the probability of
feeding the symbol with the highest probability from the previous output distribution is 1 − ξ.
The value ξ is between zero and one, which is supposed to be high at the beginning of the
training for the model to converge and decreases as the model becomes stronger towards the
end of the training.

3.7.2 Minimum Word Error Rate Training

Standard training of AED models corresponds to improving the log-likelihood of the training
subword sequence given the observation sequence. For SCM, discriminative sequence training
of the acoustic model overcomes the mismatch between training and testing criteria. For
AED models, something similar can be achieved. The criterion, called MWER, minimises the
WER (Prabhavalkar et al., 2018) on all training utterances,

P (h|O (u) )WER(h, w(u) ).

XX
LMWER = E[WER(h, w)] = (3.75)
u h
70 Automatic Speech Recognition

However, the summation over all possible output sequences h is intractable. To approximate
such a summation, instead of using word lattices for MMI and MBR criteria, the n-best
hypotheses from the beam search could be used. Assuming the probability mass over all
possible sequences is concentrated in the top n hypotheses, the loss function becomes

P (hi |O (u) )WER(hi , w(u) )

P
X hi ∈B EAM(O (u) ,N )
LMWER ≈ , (3.76)
hi ∈B EAM(O (u) ,N ) P (hi |O
(u) )
P
u

where N is the beam width of the beam search algorithm B EAM(·, ·) (see Section 3.8.1). In
practice, the sequence-level loss is interpolated with the CE loss for stable training, which is
similar to the F -smoothing (Su et al., 2013) approach used for discriminative sequence training
of SCM models.
To optimise the model directly towards the final non-differentiable objective function, WER,
reinforcement learning related loss functions (Cui et al., 2018; Sabour et al., 2019) or gradient
policy-based training (Karita et al., 2018; Tjandra et al., 2018b; Zhou et al., 2018b) have also
been proposed.

3.8 Language Models and Decoding for Attention-Based Mod-

els
Unlike the HMM-based decoder where conditional independence is assumed, either RNN-
based models or Transformer-based models depend on the decoded history to generate the next
modelling unit. Mathematically, the decoder is supposed to find the most probable subword
sequence ĥ
ĥ = arg max log P (h|O). (3.77)
h

3.8.1 Beam Search

The simplest approach of decoding is taking the subword with the highest probability at each
step during decoding. However, this greedy approach will almost certainly result in a large
number of search errors. Because of the “autoregressive” nature of the decoder where the next
output distribution depends on the history sequence, the current best subword may not be in the
overall best sequence. Therefore, it is best to keep the options open at each stage. However, the
number of hypothesis sequences that are tracked grows exponentially w.r.t. decoding steps.
Beam search is a breadth-first heuristic search algorithm, which is commonly used to
yield reasonably good results within the required computing budget. For attention-based
3.8 Language Models and Decoding for Attention-Based Models 71

encoder-decoder models, a naive search algorithm without pruning would expand the search
tree exponentially as the sequence grows longer. Beam search, however, only preserves at most
N paths within the search graph at each stage before the next search step. Note that a very
small beam size would result in a significant number of search errors, while a large beam size
would increase the time for the decoding process linearly. In practice, for each path in the
search tree, the hidden state of the decoder is cached to avoid computing the history again when
the tree expands. A search path terminates when the end-of-sentence symbol is decoded. To
prevent excessively long sentences, the maximum number of steps for the decoder is normally
set. A conservative setting for ASR is to set the maximum number of steps to be the same as
the number of acoustic features frames after subsampling, i.e. the encoder output sequence
length. Since the acoustic features are designed to capture fine-grained phonetic units, the
number of frames is usually much longer than the written sequence.
Attention-based models may be prone to deletion and insertion errors. Because the attention
alignment is not explicitly restricted to be contiguous and to cover the entire input sequence,
the decoder may generate end-of-sentence symbols prematurely without reaching the end of
the utterance which leads to deletion errors. Also, the attention mechanism may be trapped
in a certain section of the input sequence and generate the same symbol repetitively as no
monotonicity is strictly enforced on the attention mechanism. To address these issues, two
extra terms are included to rank the hypotheses

T K
( ( ! ))
1
X X
ĥ = arg max log P (h|O) + ω|h| + η akt > τ , (3.78)
h∈H t=1 k=1

where ω is the insertion penalty (non-positive) to penalise long sequences (Bahdanau et al., 2016;
Chorowski et al., 2015), η is the coverage coefficient and τ is the coverage threshold (Chorowski
and Jaitly, 2017). All three parameters need to be tuned on a separate validation set. The third
term in Equation (3.78) is called the coverage term (Chorowski and Jaitly, 2017) that represents
the number of frames that have received cumulative attention greater than τ . The coverage
term prevents the repetition issue as looping over multiple frames does not contribute to the
coverage. All three of these decoding parameters may need to be tuned for the best result.

3.8.2 Language Model Integration

Theoretically, AED models jointly learn the information in acoustic models, language models
and the lexicon. Therefore, no external language model is strictly required. However, a
language model is often used in practice during the decoding process to further improve the
72 Automatic Speech Recognition

performance by log-linear interpolation,

ĥ = arg max{log P (h|O) + ψ log P (h)}, (3.79)

h∈H

where ψ is the language model weight. As mentioned in Section 3.4.1, LMs can be trained on a
large amount of external text data.
If a separately trained NNLM shares the same vocabulary as the AED model, LM scores
can be directly interpolated at each step of the beam search decoding, which is often referred to
as shallow fusion (Gülçehre et al., 2015). Shallow fusion has been widely adopted because of
its simplicity and effectiveness. Other structured approaches such as deep fusion (Gülçehre
et al., 2015), cold fusion (Sriram et al., 2018) and component fusion (Shan et al., 2019) jointly
train the AED model with an LM in order to find the optimal combination of the two. However,
the additional complexity of these approaches often outweighs their benefits (Toshniwal et al.,
2018).
Since each output of an AED model depends on the full history of previous outputs, the ASR
model has implicitly learned an internal LM. Although the internal LM cannot be computed
exactly, several approaches have been proposed to estimate it (McDermott et al., 2019; Meng
et al., 2021; Variani et al., 2020b). Then, the external LM can be integrated after the internal LM
is subtracted. This can be particularly useful when there is a mismatch between training and test
domains as the bias of the training text data can be removed (Zeineldeen et al., 2021). As AED
models normally use subwords as output units, it is also possible to incorporate word-level
LMs during the decoding procedure (Hori et al., 2018, 2017b).

3.9 Self-Supervised Pre-training

Self-Supervised Learning (SSL) is an unsupervised representation learning approach that
extracts a useful representation from high-dimensional data. SSL uses the input data itself
as learning targets, which can then exploit a large amount of unlabelled data. Because the
learned representations are general and not task-specific, the SSL stage is often referred to as
pre-training. It is then followed by fine-tuning, where the learned representations or the SSL
models can be quickly adapted to a particular task by using domain-specific labelled data. SSL
can help with a wide range of speech applications, such as ASR (Baevski and Mohamed, 2020;
Chung and Glass, 2020), speaker recognition, verification and diarisation (Chen et al., 2021),
and speech translation (Nguyen et al., 2020).
Instead of extracting a latent representation that can perfectly reconstruct the original input
signal at each time step (Chorowski et al., 2019), the use of a contrastive loss objective function
3.9 Self-Supervised Pre-training 73

enables the model to learn representations that encode the underlying shared information
between different parts of the signal (van den Oord et al., 2018). Features learned by SSL
should ideally discard low-level information such as local noise in the speech signal but preserve
the high-level structure that could span many time steps such as phonetic information and
speaker characteristics. Compared to reconstructive loss that aims to model the complex local
features of the signal, contrastive loss encourages the model to learn a contextual representation
that can predict observation signals in the future (van den Oord et al., 2018). The following
approach, Contrastive Predictive Coding (CPC), combines predicting future observations with
a probabilistic contrastive loss.
As shown in Figure 3.11, if the input waveform is split into small segments x and the encoder
transforms the waveform into a feature sequence z, the current contextual representation vt is
the output from a sequence model that summarises the history feature sequence z1:t .
vt
<latexit sha1_base64="jywdbTWg1PDQlyOwuEwLU9ABneo=">AAAB+3icbVDNS8MwHE3n15xfdR69BIfgabQy1IOHgRePE9wHbKWkabqFpUlJ0uEo/Ve8eFDEq/+IN/8b060H3XwQ8njv9yMvL0gYVdpxvq3KxubW9k51t7a3f3B4ZB/Xe0qkEpMuFkzIQYAUYZSTrqaakUEiCYoDRvrB9K7w+zMiFRX8Uc8T4sVozGlEMdJG8u36KBAsVPPYXNks9zOd+3bDaToLwHXilqQBSnR8+2sUCpzGhGvMkFJD10m0lyGpKWYkr41SRRKEp2hMhoZyFBPlZYvsOTw3SggjIc3hGi7U3xsZilURz0zGSE/UqleI/3nDVEc3XkZ5kmrC8fKhKGVQC1gUAUMqCdZsbgjCkpqsEE+QRFibumqmBHf1y+ukd9l0r5qth1ajfVvWUQWn4AxcABdcgza4Bx3QBRg8gWfwCt6s3Hqx3q2P5WjFKndOwB9Ynz8v65Uu</latexit>

predictions

context network

zt
<latexit sha1_base64="bsVgBsYxTh1Vo9DViae0zyi3prI=">AAAB+3icbVA7T8MwGHTKq5RXKCOLRYXEVCWoAgaGSiyMRaIPqY0ix3Faq44d2Q6iRPkrLAwgxMofYePf4LQZoOUky6e775PPFySMKu0431ZlbX1jc6u6XdvZ3ds/sA/rPSVSiUkXCybkIECKMMpJV1PNyCCRBMUBI/1gelP4/QciFRX8Xs8S4sVozGlEMdJG8u36KBAsVLPYXNlT7mc69+2G03TmgKvELUkDlOj49tcoFDiNCdeYIaWGrpNoL0NSU8xIXhuliiQIT9GYDA3lKCbKy+bZc3hqlBBGQprDNZyrvzcyFKsinpmMkZ6oZa8Q//OGqY6uvIzyJNWE48VDUcqgFrAoAoZUEqzZzBCEJTVZIZ4gibA2ddVMCe7yl1dJ77zpXjRbd61G+7qsowqOwQk4Ay64BG1wCzqgCzB4BM/gFbxZufVivVsfi9GKVe4cgT+wPn8ANg+VMg==</latexit>

z t+1 z t+2 z t+3 z t+4

<latexit sha1_base64="C+/9iXDihESl46dAYE4h/SVCPQ4=">AAAB/XicbVBLSwMxGMzWV62v9XHzEiyCIJRdKerBQ8GLxwr2Ae2yZLNpG5pNliQrtMviX/HiQRGv/g9v/huz7R60dSBkmPk+MpkgZlRpx/m2Siura+sb5c3K1vbO7p69f9BWIpGYtLBgQnYDpAijnLQ01Yx0Y0lQFDDSCca3ud95JFJRwR/0JCZehIacDihG2ki+fdQPBAvVJDJXOs38VJ+7mW9XnZozA1wmbkGqoEDTt7/6ocBJRLjGDCnVc51YeymSmmJGsko/USRGeIyGpGcoRxFRXjpLn8FTo4RwIKQ5XMOZ+nsjRZHKA5rJCOmRWvRy8T+vl+jBtZdSHieacDx/aJAwqAXMq4AhlQRrNjEEYUlNVohHSCKsTWEVU4K7+OVl0r6ouZe1+n292rgp6iiDY3ACzoALrkAD3IEmaAEMpuAZvII368l6sd6tj/loySp2DsEfWJ8/GweVog==</latexit> <latexit sha1_base64="QgfoECVUrsHpXQ1DG0UR0Qsrmpg=">AAAB/XicbVBLSwMxGMz6rPW1Pm5egkUQhLJbinrwUPDisYJ9QLss2WzahmaTJckK7bL4V7x4UMSr/8Ob/8ZsuwdtHQgZZr6PTCaIGVXacb6tldW19Y3N0lZ5e2d3b98+OGwrkUhMWlgwIbsBUoRRTlqaaka6sSQoChjpBOPb3O88Eqmo4A96EhMvQkNOBxQjbSTfPu4HgoVqEpkrnWZ+qi9qmW9XnKozA1wmbkEqoEDTt7/6ocBJRLjGDCnVc51YeymSmmJGsnI/USRGeIyGpGcoRxFRXjpLn8Ezo4RwIKQ5XMOZ+nsjRZHKA5rJCOmRWvRy8T+vl+jBtZdSHieacDx/aJAwqAXMq4AhlQRrNjEEYUlNVohHSCKsTWFlU4K7+OVl0q5V3ctq/b5eadwUdZTACTgF58AFV6AB7kATtAAGU/AMXsGb9WS9WO/Wx3x0xSp2jsAfWJ8/HIyVow==</latexit> <latexit sha1_base64="byunzLQHj5RemM3KKeeIQS/Ed1A=">AAAB/XicbVBLSwMxGMz6rPW1Pm5egkUQhLKrRT14KHjxWME+oF2WbDZtQ7PJkmSFdln8K148KOLV/+HNf2O23YO2DoQMM99HJhPEjCrtON/W0vLK6tp6aaO8ubW9s2vv7beUSCQmTSyYkJ0AKcIoJ01NNSOdWBIUBYy0g9Ft7rcfiVRU8Ac9jokXoQGnfYqRNpJvH/YCwUI1jsyVTjI/1WcXmW9XnKozBVwkbkEqoEDDt796ocBJRLjGDCnVdZ1YeymSmmJGsnIvUSRGeIQGpGsoRxFRXjpNn8ETo4SwL6Q5XMOp+nsjRZHKA5rJCOmhmvdy8T+vm+j+tZdSHieacDx7qJ8wqAXMq4AhlQRrNjYEYUlNVoiHSCKsTWFlU4I7/+VF0jqvupfV2n2tUr8p6iiBI3AMToELrkAd3IEGaAIMJuAZvII368l6sd6tj9noklXsHIA/sD5/AB4RlaQ=</latexit> <latexit sha1_base64="gSdxS4/tYsLipiAyPztM2edB9E4=">AAAB/XicbVBLSwMxGMzWV62v9XHzEiyCIJRdKerBQ8GLxwr2Ae2yZLNpG5pNliQrtMviX/HiQRGv/g9v/huz7R60dSBkmPk+MpkgZlRpx/m2Siura+sb5c3K1vbO7p69f9BWIpGYtLBgQnYDpAijnLQ01Yx0Y0lQFDDSCca3ud95JFJRwR/0JCZehIacDihG2ki+fdQPBAvVJDJXOs38VJ/XM9+uOjVnBrhM3IJUQYGmb3/1Q4GTiHCNGVKq5zqx9lIkNcWMZJV+okiM8BgNSc9QjiKivHSWPoOnRgnhQEhzuIYz9fdGiiKVBzSTEdIjtejl4n9eL9GDay+lPE404Xj+0CBhUAuYVwFDKgnWbGIIwpKarBCPkERYm8IqpgR38cvLpH1Rcy9r9ft6tXFT1FEGx+AEnAEXXIEGuANN0AIYTMEzeAVv1pP1Yr1bH/PRklXsHII/sD5/AB+WlaU=</latexit>

feature encoder

xt xt xt xt xt+1 xt+2 xt+3 xt+4

<latexit sha1_base64="+sM74cyB9zucG7r1QqZgzOkHSw8=">AAAB/XicbVBLSwMxGMz6rPW1Pm5egkUQhLKrRT14KHjxWME+oF2WbJptQ7PJkmTFuiz+FS8eFPHq//DmvzHb7kFbB0KGme8jkwliRpV2nG9rYXFpeWW1tFZe39jc2rZ3dltKJBKTJhZMyE6AFGGUk6ammpFOLAmKAkbaweg699v3RCoq+J0ex8SL0IDTkGKkjeTb+71AsL4aR+ZKHzI/1SdnmW9XnKozAZwnbkEqoEDDt796fYGTiHCNGVKq6zqx9lIkNcWMZOVeokiM8AgNSNdQjiKivHSSPoNHRunDUEhzuIYT9fdGiiKVBzSTEdJDNevl4n9eN9HhpZdSHieacDx9KEwY1ALmVcA+lQRrNjYEYUlNVoiHSCKsTWFlU4I7++V50jqtuufV2m2tUr8q6iiBA3AIjoELLkAd3IAGaAIMHsEzeAVv1pP1Yr1bH9PRBavY2QN/YH3+ABr7laI=</latexit> <latexit sha1_base64="IIHPq0faZzL6soGchwucwxA+Hls=">AAAB/XicbVBLSwMxGMzWV62v9XHzEiyCIJRdKerBQ8GLxwr2Ae2yZLNpG5pNliQr1mXxr3jxoIhX/4c3/43Zdg/aOhAyzHwfmUwQM6q043xbpaXlldW18nplY3Nre8fe3WsrkUhMWlgwIbsBUoRRTlqaaka6sSQoChjpBOPr3O/cE6mo4Hd6EhMvQkNOBxQjbSTfPugHgoVqEpkrfcj8VJ/WM9+uOjVnCrhI3IJUQYGmb3/1Q4GTiHCNGVKq5zqx9lIkNcWMZJV+okiM8BgNSc9QjiKivHSaPoPHRgnhQEhzuIZT9fdGiiKVBzSTEdIjNe/l4n9eL9GDSy+lPE404Xj20CBhUAuYVwFDKgnWbGIIwpKarBCPkERYm8IqpgR3/suLpH1Wc89r9dt6tXFV1FEGh+AInAAXXIAGuAFN0AIYPIJn8ArerCfrxXq3PmajJavY2Qd/YH3+AByAlaM=</latexit>

<latexit sha1_base64="r867v0EW7wxV6jSKXPfW+84uJpg=">AAAB/XicbVA7T8MwGHR4lvIKj43FokJioUqgAgaGSiyMRaIPqY0ix3Vaq44d2Q6iRBF/hYUBhFj5H2z8G5w2A7ScZPl0933y+YKYUaUd59taWFxaXlktrZXXNza3tu2d3ZYSicSkiQUTshMgRRjlpKmpZqQTS4KigJF2MLrO/fY9kYoKfqfHMfEiNOA0pBhpI/n2fi8QrK/GkbnSh8xP9clZ5tsVp+pMAOeJW5AKKNDw7a9eX+AkIlxjhpTquk6svRRJTTEjWbmXKBIjPEID0jWUo4goL52kz+CRUfowFNIcruFE/b2RokjlAc1khPRQzXq5+J/XTXR46aWUx4kmHE8fChMGtYB5FbBPJcGajQ1BWFKTFeIhkghrU1jZlODOfnmetE6r7nm1dlur1K+KOkrgAByCY+CCC1AHN6ABmgCDR/AMXsGb9WS9WO/Wx3R0wSp29sAfWJ8/HgeVpA==</latexit> <latexit sha1_base64="hycXqWbtEyPqaKi3m+694az6BP8=">AAAB/XicbVA7T8MwGHR4lvIKj43FokJioUqqChgYKrEwFok+pDaKHMdtrTp2ZDuIEkX8FRYGEGLlf7Dxb3DaDNBykuXT3ffJ5wtiRpV2nG9raXlldW29tFHe3Nre2bX39ttKJBKTFhZMyG6AFGGUk5ammpFuLAmKAkY6wfg69zv3RCoq+J2exMSL0JDTAcVIG8m3D/uBYKGaROZKHzI/1We1zLcrTtWZAi4StyAVUKDp21/9UOAkIlxjhpTquU6svRRJTTEjWbmfKBIjPEZD0jOUo4goL52mz+CJUUI4ENIcruFU/b2RokjlAc1khPRIzXu5+J/XS/Tg0kspjxNNOJ49NEgY1ALmVcCQSoI1mxiCsKQmK8QjJBHWprCyKcGd//Iiadeq7nm1fluvNK6KOkrgCByDU+CCC9AAN6AJWgCDR/AMXsGb9WS9WO/Wx2x0ySp2DsAfWJ8/HIKVow==</latexit> <latexit sha1_base64="+EmATlsNfbMPgbmk0JJAyIQ5r4E=">AAAB/XicbVBLSwMxGMzWV62v9XHzEiyCIJRdKerBQ8GLxwr2Ae2yZLNpG5pNliQr1mXxr3jxoIhX/4c3/43Zdg/aOhAyzHwfmUwQM6q043xbpaXlldW18nplY3Nre8fe3WsrkUhMWlgwIbsBUoRRTlqaaka6sSQoChjpBOPr3O/cE6mo4Hd6EhMvQkNOBxQjbSTfPugHgoVqEpkrfcj8VJ+6mW9XnZozBVwkbkGqoEDTt7/6ocBJRLjGDCnVc51YeymSmmJGsko/USRGeIyGpGcoRxFRXjpNn8Fjo4RwIKQ5XMOp+nsjRZHKA5rJCOmRmvdy8T+vl+jBpZdSHieacDx7aJAwqAXMq4AhlQRrNjEEYUlNVohHSCKsTWEVU4I7/+VF0j6ruee1+m292rgq6iiDQ3AEToALLkAD3IAmaAEMHsEzeAVv1pP1Yr1bH7PRklXs7IM/sD5/ABfxlaA=</latexit> <latexit sha1_base64="1T76SLsTN0epT5ufP+JUFdyhP30=">AAAB/XicbVBLSwMxGMz6rPW1Pm5egkUQhLJbinrwUPDisYJ9QLss2WzahmaTJcmKdVn8K148KOLV/+HNf2O23YO2DoQMM99HJhPEjCrtON/W0vLK6tp6aaO8ubW9s2vv7beVSCQmLSyYkN0AKcIoJy1NNSPdWBIUBYx0gvF17nfuiVRU8Ds9iYkXoSGnA4qRNpJvH/YDwUI1icyVPmR+qs9qmW9XnKozBVwkbkEqoEDTt7/6ocBJRLjGDCnVc51YeymSmmJGsnI/USRGeIyGpGcoRxFRXjpNn8ETo4RwIKQ5XMOp+nsjRZHKA5rJCOmRmvdy8T+vl+jBpZdSHieacDx7aJAwqAXMq4AhlQRrNjEEYUlNVohHSCKsTWFlU4I7/+VF0q5V3fNq/bZeaVwVdZTAETgGp8AFF6ABbkATtAAGj+AZvII368l6sd6tj9noklXsHIA/sD5/ABl2laE=</latexit>

<latexit sha1_base64="gLgfhQeitV47ohtcuJI6QB8pCKs=">AAAB+3icbVDNS8MwHE3n15xfdR69BIfgabQy1IOHgRePE9wHbKWkabqFpUlJUtko/Ve8eFDEq/+IN/8b060H3XwQ8njv9yMvL0gYVdpxvq3KxubW9k51t7a3f3B4ZB/Xe0qkEpMuFkzIQYAUYZSTrqaakUEiCYoDRvrB9K7w+09EKir4o54nxIvRmNOIYqSN5Nv1USBYqOaxubJZ7mc69+2G03QWgOvELUkDlOj49tcoFDiNCdeYIaWGrpNoL0NSU8xIXhuliiQIT9GYDA3lKCbKyxbZc3hulBBGQprDNVyovzcyFKsinpmMkZ6oVa8Q//OGqY5uvIzyJNWE4+VDUcqgFrAoAoZUEqzZ3BCEJTVZIZ4gibA2ddVMCe7ql9dJ77LpXjVbD61G+7asowpOwRm4AC64Bm1wDzqgCzCYgWfwCt6s3Hqx3q2P5WjFKndOwB9Ynz8y/ZUw</latexit>

<latexit sha1_base64="G+qGp2pgG8RQQ3t78V7/wzUHVOc=">AAAB/XicbVA7T8MwGHTKq5RXeGwsFhUSC1WCKmBgqMTCWCT6kNoochy3terYke0gShTxV1gYQIiV/8HGv8FpM0DLSZZPd98nny+IGVXacb6t0tLyyupaeb2ysbm1vWPv7rWVSCQmLSyYkN0AKcIoJy1NNSPdWBIUBYx0gvF17nfuiVRU8Ds9iYkXoSGnA4qRNpJvH/QDwUI1icyVPmR+qk/dzLerTs2ZAi4StyBVUKDp21/9UOAkIlxjhpTquU6svRRJTTEjWaWfKBIjPEZD0jOUo4goL52mz+CxUUI4ENIcruFU/b2RokjlAc1khPRIzXu5+J/XS/Tg0kspjxNNOJ49NEgY1ALmVcCQSoI1mxiCsKQmK8QjJBHWprCKKcGd//IiaZ/V3PNa/bZebVwVdZTBITgCJ8AFF6ABbkATtAAGj+AZvII368l6sd6tj9loySp29sEfWJ8/Gv2Vog==</latexit>

3 2 1

input waveform

Figure 3.11 CPC framework (van den Oord et al., 2018). The feature encoder normally uses
CNNs with temporal pooling and the context network is a sequence model such as an LSTM
that summarises the history representations.

Given a set of N random samples X = {x1 , . . . , xN }, containing one positive sample from
p(xt+k |vt ) and N − 1 negative samples from the proposal distribution p(xt+k ), the contrastive
loss for predicting the k-th step into the future is
!
fk (xt+k , vt )
Lt,k
CPC = −EX log P , (3.80)
xj ∈X fk (xj , vt )
74 Automatic Speech Recognition

where fk is the density ratio

p(xt+k |vt )
fk (xt+k , vt ) ∝ . (3.81)
p(xt+k )
Although more complicated non-linear functions can be used for the density ratio, a simple
log-bilinear model is often used,

T
fk (xt+k , vt ) = exp(zt+k Wk vt ). (3.82)

Note that Wk is a linear transform and is different for every offset k. It can be shown that by
minimising LCPC , the mutual information between vt and xt+k is maximised (van den Oord
et al., 2018)

X p(xt+k |vt )
I(xt+k , vt ) = p(xt+k , vt ) log (3.83)
xt+k ,vt p(xt+k )
≥ log(N ) − Lt,k
CPC . (3.84)

After demonstrating that features extracted using CPC can achieve much better results than
handcrafted features such as MFCCs on phone classification and speaker classification (van den
Oord et al., 2018), wav2vec (Schneider et al., 2019) showed that pre-trained features can also
help improve ASR performance. As shown in Figure 3.12, by using quantised prediction
targets (Baevski et al., 2020a) and Transformer encoder blocks as the context network, wav2vec
2.0 (w2v2) (Baevski et al., 2020b) has shown promising performance on ASR, especially when
a large amount of unlabelled acoustic data is used for pre-training and only a small amount of
transcribed speech data is available.
Although CPC in Figure 3.11 and w2v2 in Figure 3.12 follow a similar principle, they
have two differences. First, w2v2 has a quantisation module that quantises the output of the
feature encoder z to a finite set of speech representations via product quantisation (Jégou et al.,
2011). The Gumbel Softmax (Jang et al., 2017) allows the quantisation procedure to be fully
differentiable. To encourage the use of all of the quantised representations in the codebook, a
diversity loss Ldiversity is interpolated with the contrastive loss during training (Baevski et al.,
2020b). Secondly, instead of using a uni-directional sequence model for the context network
and generating a contextual representation to predict input in the future, w2v2 adopts the
Transformer encoder architecture (see Section 2.2.2) with masked input. For each masked
input feature xt , the context network uses the information available in the past and the future
to generate the contextual representation vt , which is similar to a “fill in the blank” question.
The positive sample is the current quantised feature qt and the negative samples are uniformly
sampled from quantised features that correspond to other masked time steps. The density ratio
3.10 Summary 75

+ve sample
contrastive loss -ve sample

Q
<latexit sha1_base64="TgRKxso/lMSzVAsrTantWZmVMXw=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsxIURcuCm5ctmAf0I4lk0nb0EwyJBmlDP0PNy4Uceu/uPNvzLSz0NYDIYdz7iUnJ4g508Z1v53C2vrG5lZxu7Szu7d/UD48amuZKEJbRHKpugHWlDNBW4YZTruxojgKOO0Ek9vM7zxSpZkU92YaUz/CI8GGjGBjpYd+IHmop5G90uZsUK64VXcOtEq8nFQgR2NQ/uqHkiQRFYZwrHXPc2Pjp1gZRjidlfqJpjEmEzyiPUsFjqj203nqGTqzSoiGUtkjDJqrvzdSHOksmp2MsBnrZS8T//N6iRle+ykTcWKoIIuHhglHRqKsAhQyRYnhU0swUcxmRWSMFSbGFlWyJXjLX14l7Yuqd1mtNWuV+k1eRxFO4BTOwYMrqMMdNKAFBBQ8wyu8OU/Oi/PufCxGC06+cwx/4Hz+ABkpkuU=</latexit>

V
<latexit sha1_base64="rWnn3xLXn/fc7i9fD5jNkxKQeUs=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIqAsXBTcuK9gHtGPJZDJtaCYZkoxShv6HGxeKuPVf3Pk3ZtpZaOuBkMM595KTEyScaeO6305pZXVtfaO8Wdna3tndq+4ftLVMFaEtIrlU3QBrypmgLcMMp91EURwHnHaC8U3udx6p0kyKezNJqB/joWARI9hY6aEfSB7qSWyvrD0dVGtu3Z0BLROvIDUo0BxUv/qhJGlMhSEca93z3MT4GVaGEU6nlX6qaYLJGA9pz1KBY6r9bJZ6ik6sEqJIKnuEQTP190aGY51Hs5MxNiO96OXif14vNdGVnzGRpIYKMn8oSjkyEuUVoJApSgyfWIKJYjYrIiOsMDG2qIotwVv88jJpn9W9i/r53XmtcV3UUYYjOIZT8OASGnALTWgBAQXP8ApvzpPz4rw7H/PRklPsHMIfOJ8/IMKS6g==</latexit>

context network quantisation module

masking
Z
<latexit sha1_base64="FGXhBpvcuMQJuIjBcN6CjF7HESQ=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsxIURcuCm5cVrAPbMeSyaRtaCYZkoxShv6HGxeKuPVf3Pk3ZtpZaOuBkMM595KTE8ScaeO6305hZXVtfaO4Wdra3tndK+8ftLRMFKFNIrlUnQBrypmgTcMMp51YURwFnLaD8XXmtx+p0kyKOzOJqR/hoWADRrCx0kMvkDzUk8he6f20X664VXcGtEy8nFQgR6Nf/uqFkiQRFYZwrHXXc2Pjp1gZRjidlnqJpjEmYzykXUsFjqj201nqKTqxSogGUtkjDJqpvzdSHOksmp2MsBnpRS8T//O6iRlc+ikTcWKoIPOHBglHRqKsAhQyRYnhE0swUcxmRWSEFSbGFlWyJXiLX14mrbOqd16t3dYq9au8jiIcwTGcggcXUIcbaEATCCh4hld4c56cF+fd+ZiPFpx85xD+wPn8ASbWku4=</latexit>

feature encoder

X
<latexit sha1_base64="dtdmyq/eSUpi2IzY413OETEKnAA=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIqAsXBTcuK9gHtGPJZDJtaCYZkoxShv6HGxeKuPVf3Pk3ZtpZaOuBkMM595KTEyScaeO6305pZXVtfaO8Wdna3tndq+4ftLVMFaEtIrlU3QBrypmgLcMMp91EURwHnHaC8U3udx6p0kyKezNJqB/joWARI9hY6aEfSB7qSWyvrDsdVGtu3Z0BLROvIDUo0BxUv/qhJGlMhSEca93z3MT4GVaGEU6nlX6qaYLJGA9pz1KBY6r9bJZ6ik6sEqJIKnuEQTP190aGY51Hs5MxNiO96OXif14vNdGVnzGRpIYKMn8oSjkyEuUVoJApSgyfWIKJYjYrIiOsMDG2qIotwVv88jJpn9W9i/r53XmtcV3UUYYjOIZT8OASGnALTWgBAQXP8ApvzpPz4rw7H/PRklPsHMIfOJ8/I8yS7A==</latexit>

Figure 3.12 w2v2 model (Baevski et al., 2020b),

function used in w2v2 is

qtT′ vt
!
1
f (xt′ , vt ) = exp . (3.85)
ζ ∥ qt′ ∥∥ vt ∥

For fine-tuning, task-specific layers are added on top of the w2v2 model and are updated
together with the context network. An alternative implementation (Zhang et al., 2020) shows
that the input to w2v2 can also be handcrafted acoustic features and the quantisation module
is optional. After fine-tuning the pre-trained model with a very limited amount of labelled
data, the system can perform very well (Baevski et al., 2020b). This indicates that SSL can
effectively leverage unlabelled data to learn useful representations.

3.10 Summary
In this chapter, two major frameworks of speech recognition were covered, i.e. the source-
channel model-based system and the attention-based encoder-decoder model-based system.
Although both systems share some components, including feature extraction, language mod-
elling, and evaluation procedure, they are very different in terms of the modelling principle,
modelling units, and training and decoding procedures. Each ingredient of both systems was
briefly described in this chapter. Finally, some self-supervised training approaches that lever-
76 Automatic Speech Recognition

age a large amount of unlabelled data to improve ASR model performance via unsupervised
pre-training were introduced.
Chapter 4

Integrating Source-Channel and

Attention-Based ASR Models

As described in Chapter 3, Source-Channel Model (SCM)-based models and Attention-Based

Encoder-Decoder (AED) models are two very distinct approaches to Automatic Speech Recog-
nition (ASR). Therefore, it is of interest to effectively combine two such systems and to leverage
the complementarity of both model types. In this chapter, the first section covers the back-
ground of model combination for multiple SCM systems and multiple AED systems. Previous
work on combining a particular form of SCM and AED models is also briefly introduced.
Then a general framework, Integrated Source-Channel and Attention (ISCA), is proposed to
combine the advantages from the SCM-based systems and the AED models for ASR. This is a
two-pass procedure where hypotheses are first generated by the frame-synchronous SCM-based
systems that include both Deep Neural Network (DNN)-Hidden Markov Models (HMMs) and
Connectionist Temporal Classification (CTC) models, and then the AED models rescore the
hypotheses in the form of N -best lists or lattices. The hidden layers of the acoustic model in the
SCM-based system and the encoder of the AED model can be shared via multi-task learning.
Alternatively, the two systems can be constructed independently, which makes ISCA a general
approach that combines a traditional ASR system with an AED system. Experimental results
using the Augmented Multi-Party Interaction (AMI) and Switchboard (SWB) datasets show
that ISCA improves the Word Error Rates (WERs) over both component systems by significant
margins.
78 Integrating Source-Channel and Attention-Based ASR Models

4.1 Background
This section first covers various approaches to combining multiple ASR systems. Then related
work that combines a SCM-based system and a AED model is described. Differences between
the related work and the proposed method are highlighted.

4.1.1 Combination of Multiple Source-Channel Models

In machine learning, ensemble methods are an effective approach to reducing generalisation
errors. One such strategy of model averaging is bagging (Breiman, 1996), where k training
datasets are sampled from the original dataset (with replacement) to train k different models.
Then the final decision is made based on either majority voting or the maximum of the combined
output distribution. For general neural networks, a wide variety of optimisation solutions can
be reached by different models due to different random initialisation, random shuffling of mini-
batches, different hyperparameters, etc. Consequently, averaging the outputs from multiple
neural networks is normally beneficial for improving generalisation, even when they are trained
on the same dataset (Goodfellow et al., 2016).
For SCM-based systems, there are many design options for constructing a speech recogniser.
For the input features, multiple different features with various normalisation schemes are
available. For acoustic modelling, different output units or different decision trees can be
chosen, the number of Gaussian components can vary for Gaussian Mixture Model (GMM)-
HMM models or the neural network architecture can be different for DNN-HMM models, and
they can be trained with different criteria or different training data. For language modelling, n-
gram Language Models (LMs) with various smoothing techniques or Neural Network Language
Models (NNLMs) with different architectures can be used. It is more likely to yield a useful
combined system when each individual system has similar performance but different setups.
Multiple systems are more complementary when they produce different errors (Siohan et al.,
2005).
Even if the overall performance of each system is similar, the combined system is likely
to outperform individual systems because systems with different setups are likely to produce
different errors. If the setup of individual systems is less similar, they are likely to be more
complementary and perform better after combination.
4.1 Background 79

4.1.1.1 Explicit Combination

Mathematically, with a single LM, system combination for M Acoustic Models (AMs) esti-
mates
p(O|h; θ1 , . . . , θM )P (h; θLM )
P (h|O; θ1 , . . . , θM ) = P , (4.1)
h̃ p(O|h̃; θ1 , . . . , θM )P (h̃; θLM )

where h̃ is a set of all possible hypothesis sequences. h̃ can be approximated by the union of
top hypotheses from all candidate systems. It is possible to combine likelihoods from different
acoustic models or directly combine hypothesis posteriors.

Likelihood Combination There are two general approaches to combine likelihoods (Beyer-
lein, 1997). The most straightforward one is a linear combination,

M
X
p(O|h; θ1 , . . . , θM ) ≈ zm p(O|h; θm ), (4.2)
m=1

where zm is the prior probability of model θm . For the combined likelihood to be a valid
distribution, M
P
m=1 zm = 1 and zm are non-negative for all m. The alternative is a log-linear
combination of all likelihoods,
M M
!
1 X 1 Y zm
p(O|h; θ1 , . . . , θM ) ≈ exp zm log p O|h; θm = p O|h; θm , (4.3)
Z m=1 Z m=1

where Z is a constant to ensure a valid probability distribution. The computation of Z is

intractable and not necessary for decoding. The combination coefficients or the model priors
need to be estimated. For example, a held-out validation set can be used to determine the
optimal weights for combination. For systems that have the same modelling units, the likelihood
is usually combined at the HMM state level. Then the usual Maximum a Posteriori (MAP)
decoding procedure is followed. This approach is also known as joint decoding (Wang et al.,
2015).

Hypothesis Combination Similarly, hypothesis posteriors can be also combined in a linear

or log-linear fashion. Two algorithms for combining hypotheses are Recogniser Output Voting
Error Reduction (ROVER) (Fiscus, 1997) and Confusion Network Combination (CNC) (Ev-
ermann and Woodland, 2000a). ROVER is normally applied to the 1-best hypotheses from
several systems, whereas CNC uses confusion networks from each system. ROVER first aligns
the hypotheses from multiple systems against one another that minimises the Levenshtein
edit distance. Then a Word Transition Network (WTN) (Fiscus, 1997), similar to a confusion
network, can be formed based on the alignment. Confidence scores, a measure of the reliability
80 Integrating Source-Channel and Attention-Based ASR Models

of the recognised word that can be derived from the word posterior probability or other sources
of information during decoding, can be optionally incorporated into WTNs (Fiscus, 1997).
Second, a word needs to be chosen from each set of competing words based on the maximum
posterior criterion

P (hi |O; θ1 , . . . , θM ) ≈ (1 − λ)Conf hi |O; θ1 , . . . , θM + λD(hi , O), (4.4)

where Conf(hi |O; θ1 , . . . , θM ) is the average or maximum confidence score of the word h in
the i-th set among all the top hypotheses from M systems; D(hi , O) is the frequency of the
word h in the i-th set; and λ is the interpolation coefficient between the confidence score and
the word frequency. Without confidence scores or λ is set to 1, Equation (4.4) is equivalent
to voting by frequency, where ties within a set of competing words are decided randomly.
For CNC, the process is very similar apart from the fact that multiple words with posteriors
from each system are used in the form of confusion networks. As the alignment for multiple
confusion networks becomes more complex, CNC generally works slightly better than ROVER.

4.1.1.2 Implicit Combination

Unlike explicit combination where the likelihoods or hypotheses from all candidate systems
are used for decoding, a combination scheme that propagates information from one system to
another is regarded as an implicit combination approach (Gales and Young, 2008; Peskin et al.,
1999).
One commonly used approach is N -best (Schwartz and Austin, 1991) or lattice rescor-
ing (Aubert and Ney, 1995; Richardson et al., 1995; Woodland et al., 1995), where the AM
and/or the LM scores for each word decoded by one system is updated based on the scores
from another system. Rescoring can restrict the search space and avoid possible errors made by
one system. However, the final hypothesis can only come from the lattice or the N -best list
generated by a single system, i.e. hypotheses made solely by the other system are not included.
Another approach for implicit combination is cross-adaptation (Gales et al., 2006; Stüker
et al., 2006; Woodland et al., 1995). In Section 3.3.3, in order to estimate Maximum Likeli-
hood Linear Regression (MLLR)-based transforms for unseen speakers during recognition, a
transcription is needed to compute the regression statistics. For unsupervised adaptation, the
transcription can come from another system. This approach allows the information from the
other system to be explicitly used during the estimation of the transform.
4.1 Background 81

4.1.2 Combination of Multiple Attention-Based Models

Some system combination methods in Section 4.1.1 can be directly used for combining multiple
AED models. Similar to likelihood combination using log-linear combination,

M
!
1 X
P (h|O; θ1 , . . . , θM ) ≈ exp zm log P (h|O; θm ) . (4.5)
Z m=1

Instead of using a union of N -best lists from all AED models and interpolating the sequence-
level scores, the same approach can be applied during each step of beam search, which is also
known as joint decoding (Tüske et al., 2021). Also similar to hypothesis combination for SCM
systems, ROVER can be applied to AED models (Wong et al., 2020). Confidence scores for
each word P (hi |O; θ) can be the Softmax scores of the AED models or come from a dedicated
confidence estimation module as discussed in Chapter 5.

4.1.3 Combination of Source-Channel and Attention-Based Models

As described in Section 3.1 and Section 3.6, the modelling approaches for addressing the
speech recognition task are very different between SCM and AED systems. Therefore, with an
SCM system and an AED model of similar performance, it is expected to be more beneficial to
combine these two systems of different types than systems of the same type. In this section,
two pieces of related work are briefly introduced.

4.1.3.1 Joint CTC and Attention-Based Models

Kim et al. (2017) and Watanabe et al. (2017) proposed to use the CTC objective function as an
auxiliary task when training the AED model. The encoder is shared between the CTC branch
and the AED model. The CTC output layer and the decoder of the attention-based model are
separate. Two objective functions are used together during training via log-linear combination

Ljoint CTC-attention = λ log PCTC (w|O) + (1 − λ) log PAED (w|O), (4.6)

where λ is the combination weight between 0 and 1.

Unlike the standalone AED model where the monotonic alignment between input and
output is purely learned from data, the forward-backward algorithm used in the CTC loss helps
speed up the process of estimating the left-to-right alignment for the attention mechanism in
AED (Watanabe et al., 2017). However, finding the optimal combination weight λ for training
may be difficult.
82 Integrating Source-Channel and Attention-Based ASR Models

During decoding of the joint CTC and AED models, the length penalty and coverage term in
Equation (3.78) become unnecessary as the CTC probability encourages a monotonic alignment
that does not allow the skipping or looping behaviour observed for decoding standalone AED
models. Also, it can prevent the premature termination of hypotheses. For decoding two
models jointly, the challenge is to handle the asynchrony between two output branches. The
CTC branch outputs symbols in a frame-synchronous fashion whereas the attention-based
decoder emits symbols in a label-synchronous way. A straightforward approach to handle this
problem is to combine scores at the utterance level, i.e. obtain N -best hypotheses from the
attention-based decoder and rerank these hypotheses based on the combined score from CTC
and attention-based models,

ĥ = arg max {λ′ log PCTC (h|O) + (1 − λ′ ) log PAED (h|O)}. (4.7)
h∈B EAM(O,N )

A more systematic approach is to integrate the CTC loss during the beam search so that the
search space is modified (Hori et al., 2017a). The CTC prefix probability is defined as the
cumulative probability of all label sequences that have the same prefix (Graves et al., 2006).
Then the beam search process can use the log-linear combination of the CTC prefix score and
the attention-based model score for each partial hypothesis.

4.1.3.2 Two Pass End-to-End ASR

Sainath et al. (2019) proposed using an Recurrent Neural Network (RNN) transducer to
generate N -best hypothesis during the first pass and then rescore them using an AED model.
The encoders of the transducer and the AED model share the same set of parameters. The
first pass using a transducer allows streaming processing of speech data while the second pass
rescoring improves the recognition accuracy. Consequently, the latency requirement for the
first pass is met and the final recognition quality is also improved by the second pass. The
training procedure has three steps. First, a standalone RNN transducer model is trained. Second,
an AED model is trained where the encoder has the same parameters as the encoder of the
RNN transducer and is frozen. Finally, all components are fine-tuned with a combination of
RNN transducer and AED losses. To further integrate the RNN transducer and AED models,
Minimum Word Error Rate (MWER) training is used for the AED model where the N -best
hypotheses are generated by the transducer. During rescoring, an adaptive beam and a prefix-
tree representation for N -best list are used to improve both the rescoring speed and the final
performance.
4.2 The Integrated Source-Channel and Attention-Based Model Approach 83

4.2 The Integrated Source-Channel and Attention-Based Model

Approach
4.2.1 Motivation
Frame-synchronous SCM-based systems and label-synchronous AED-based systems exhibit dis-
tinctive advantages due to their different modelling and decoding strategies. Frame-synchronous
systems such as DNN-HMMs and CTC assume that the output distributions at different time
steps are independent. In contrast, label-synchronous systems such as AED models calculate
P (w|O) using the chain rule without making any label independence assumptions. Conse-
quently, frame-synchronous systems with phone-based acoustic models can easily incorporate
structured phonetic, linguistic and contextual knowledge by using a lexicon. This can be very
helpful for the robustness of the ASR systems to rare words. Because of the label independence
assumption, DNN-HMMs and CTC can produce a rich and compact graph-based representation
of the hypothesis space for each utterance, e.g. in a lattice or in a confusion network structure.
Recent studies show that by limiting the history context, i.e. imposing some degree of label
independence assumption, neural transducers can also effectively generate lattices with compa-
rable performance to the full-context model (Prabhavalkar et al., 2021). It is more difficult for
AED-based systems to generate lattices since the previous decoding output needs to be fed into
the auto-regressive neural decoder to obtain the probability distribution over the next token. The
hypothesis space explored by AED-based systems is normally limited to the top N candidates,
which is significantly smaller than when using lattices from frame-synchronous SCM-based
systems. Although it is more challenging to adapt AED models to process streaming data
due to its label-synchronous nature (Miao et al., 2020; Moritz et al., 2020), AED models can
jointly model acoustic and textual information and the neural decoder can model long-range
dependencies across labels without the independence assumption.
Therefore, it is desirable to leverage the advantages of both types of systems and exploit their
complementarity in order to improve the performance of the final ASR system. In this chapter,
we propose a generic two-pass ASR combination method called Integrated Source-Channel
and Attention (ISCA), which uses a frame-synchronous SCM-based systems to perform a first-
pass decoding and then uses an AED model to rescore the first-pass results. In the first-pass
decoding stage with DNN-HMMs or CTC systems, the model can process streaming data, and
leverage structured knowledge and in particular contextual knowledge. In the second-pass
rescoring stage with AED models, complete acoustic and language information from each entire
utterance can be used jointly to refine the estimate of hypothesis probabilities. Furthermore,
84 Integrating Source-Channel and Attention-Based ASR Models

since DNN-HMMs or CTC systems prune most of the less likely hypotheses, rescoring with
AED models can be more robust and efficient than decoding with AED models.
Compared to the joint CTC and attention architecture reviewed in Section 4.1.3.1, ISCA
first performs beam search in a frame-synchronous fashion. It is known that beam search with
frame-synchronous systems can more easily handle streaming data and explore a larger search
space. Furthermore, the proposed framework uses a frame-synchronous decoding pass followed
by a separate rescoring pass by a label-synchronous AED system. The two-pass approach not
only makes it easier to implement by reusing the existing AED rescoring framework, but also
allows the two types of system to have different output units. Moreover, the proposed method is
configured to combine multiple systems and can use different model structures with or without
sharing the encoder. This offers more flexibility in the choice of the systems and possibly more
complementarity.
Compared to the two-pass end-to-end approach described in Section 4.1.3.2, the proposed
framework allows a more flexible choice of the first pass frame-synchronous system, including
DNN-HMMs, CTC, and neural transducers. Since DNN-HMMs and CTC do not have a
model-based decoder with learned parameters such as an AED model, it is easier for them to
incorporate structured knowledge and construct richer lattices.

4.2.2 ISCA Framework

As shown in Figure 4.1, the acoustic features of an utterance O are first passed through the
acoustic model of the SCM-based system. Together with a language model and a lexicon,
the frame-synchronous decoder generates the hypotheses in the form of an N -best list or a
lattice with corresponding acoustic scores pAM (O|h) and language model scores PLM (h). The
same acoustic features O can be forwarded through the neural encoder of the AED model.
The neural decoder then uses the hidden representations and scores the hypotheses from the
first-pass similar to the forward propagation of the training process, i.e. operating in the teacher
forcing mode. Finally, for each hypothesis among an N -best list or a lattice, its total score is a
weighted sum of scores from both the SCM-based system, the AED model, and optionally the
LM. After rescoring, the top hypothesis based on the total score will be used as the final output
of the combined system ĥ. More details about the N -best and lattice rescoring algorithms are
given in Section 4.2.4.
Note that the acoustic model and the neural encoder can share some of the model parameters
if both models are trained in a multi-task fashion. The word-level hypotheses h are tokenised
into the corresponding subword sequence h′ used by the AED model before forwarding through
the neural decoder. Generically, if a frame-synchronous system PSCM and a label-synchronous
4.2 The Integrated Source-Channel and Attention-Based Model Approach 85

language
lexicon
model
h
<latexit sha1_base64="ExlVAR4bKx54WEKOyyX9KQaUSz8=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIqLgquHFZwT6gHUsmk2lDM8mQZJQy9D/cuFDErf/izr8x085CWw+EHM65l5ycIOFMG9f9dkorq2vrG+XNytb2zu5edf+grWWqCG0RyaXqBlhTzgRtGWY47SaK4jjgtBOMb3K/80iVZlLcm0lC/RgPBYsYwcZKD/1A8lBPYntlo+mgWnPr7gxomXgFqUGB5qD61Q8lSWMqDOFY657nJsbPsDKMcDqt9FNNE0zGeEh7lgocU+1ns9RTdGKVEEVS2SMMmqm/NzIc6zyanYyxGelFLxf/83qpia78jIkkNVSQ+UNRypGRKK8AhUxRYvjEEkwUs1kRGWGFibFFVWwJ3uKXl0n7rO5d1M/vzmuN66KOMhzBMZyCB5fQgFtoQgsIKHiGV3hznpwX5935mI+WnGLnEP7A+fwBO4KS+g==</latexit>

<latexit sha1_base64="A73zOqqpjhroGGYmr/iTFjsjsAY=">AAACF3icbVDLSsNAFJ34rPUVdekmWIS6KYkUFVcVN27ECvYBTQiTybQdOnkwcyOWmL9w46+4caGIW935N07aLmrrgWHOnHMvc+/xYs4kmOaPtrC4tLyyWlgrrm9sbm3rO7tNGSWC0AaJeCTaHpaUs5A2gAGn7VhQHHictrzBZe637qmQLArvYBhTJ8C9kHUZwaAkV6/EbmoDfYD04jrLyqntRdyXw0Bd6U32OP3sZ9mRq5fMijmCMU+sCSmhCequ/m37EUkCGgLhWMqOZcbgpFgAI5xmRTuRNMZkgHu0o2iIAyqddLRXZhwqxTe6kVAnBGOkTnekOJD5cKoywNCXs14u/ud1EuieOSkL4wRoSMYfdRNuQGTkIRk+E5QAHyqCiWBqVoP0scAEVJRFFYI1u/I8aR5XrJNK9bZaqp1P4iigfXSAyshCp6iGrlAdNRBBT+gFvaF37Vl71T60z3Hpgjbp2UN/oH39AqiHoXg=</latexit>

acoustic frame-synchronous hypotheses pAM (O|h)

acoustic features O

model decoder (N-best / lattices)

<latexit sha1_base64="1fsG/rOPW8c+7DB/y5YMfvwfDeU=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsxIUXFVcOPOCvYB7VgyaaYNzSRDklHK0P9w40IRt/6LO//GTDsLbT0QcjjnXnJygpgzbVz32ymsrK6tbxQ3S1vbO7t75f2DlpaJIrRJJJeqE2BNORO0aZjhtBMriqOA03Ywvs789iNVmklxbyYx9SM8FCxkBBsrPfQCyQd6EtkrvZ32yxW36s6AlomXkwrkaPTLX72BJElEhSEca9313Nj4KVaGEU6npV6iaYzJGA9p11KBI6r9dJZ6ik6sMkChVPYIg2bq740URzqLZicjbEZ60cvE/7xuYsJLP2UiTgwVZP5QmHBkJMoqQAOmKDF8YgkmitmsiIywwsTYokq2BG/xy8ukdVb1zqu1u1qlfpXXUYQjOIZT8OAC6nADDWgCAQXP8ApvzpPz4rw7H/PRgpPvHMIfOJ8/FYWS4Q==</latexit>

<latexit sha1_base64="rjCGK7w6pmi1PsZxrOZ2TD9Rjmg=">AAACCHicbVDLSsNAFJ34rPUVdenCYBHqpiRSVFwV3LhQqGAf0IQwmU7aoZMHMzdiCVm68VfcuFDErZ/gzr9x0mahrQeGOZxzL/fe48WcSTDNb21hcWl5ZbW0Vl7f2Nza1nd22zJKBKEtEvFIdD0sKWchbQEDTruxoDjwOO14o8vc79xTIVkU3sE4pk6AByHzGcGgJFc/aLqpDfQB0uubLKumthfxvhwH6kuHWXbs6hWzZk5gzBOrIBVUoOnqX3Y/IklAQyAcS9mzzBicFAtghNOsbCeSxpiM8ID2FA1xQKWTTg7JjCOl9A0/EuqFYEzU3x0pDmS+nKoMMAzlrJeL/3m9BPxzJ2VhnAANyXSQn3ADIiNPxegzQQnwsSKYCKZ2NcgQC0xAZVdWIVizJ8+T9knNOq3Vb+uVxkURRwnto0NURRY6Qw10hZqohQh6RM/oFb1pT9qL9q59TEsXtKJnD/2B9vkDz5qadw==</latexit>

N PLM (h)
<latexit sha1_base64="SOUcfj3ZbWBF8Yr7bDoxo9Ekszg=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKexKUPEU8OJJEjAPSJYwO+lNxszOLjOzQgj5Ai8eFPHqJ3nzb5wke9DEgoaiqpvuriARXBvX/XZya+sbm1v57cLO7t7+QfHwqKnjVDFssFjEqh1QjYJLbBhuBLYThTQKBLaC0e3Mbz2h0jyWD2acoB/RgeQhZ9RYqX7fK5bcsjsHWSVeRkqQodYrfnX7MUsjlIYJqnXHcxPjT6gynAmcFrqpxoSyER1gx1JJI9T+ZH7olJxZpU/CWNmShszV3xMTGmk9jgLbGVEz1MveTPzP66QmvPYnXCapQckWi8JUEBOT2dekzxUyI8aWUKa4vZWwIVWUGZtNwYbgLb+8SpoXZe+yXKlXStWbLI48nMApnIMHV1CFO6hBAxggPMMrvDmPzovz7nwsWnNONnMMf+B8/gCnF4zT</latexit>

optionally tokenise rescoring <latexit sha1_base64="eW6AO9YLoizxlZScL9YDMjTvV3o=">AAAB/XicbVDNS8MwHE3n15xf9ePmJTgET6OVoeJp4MXjBPcBaxlpmm5haVKSVJil+K948aCIV/8Pb/43plsPuvkg5PHe70deXpAwqrTjfFuVldW19Y3qZm1re2d3z94/6CqRSkw6WDAh+wFShFFOOppqRvqJJCgOGOkFk5vC7z0Qqajg93qaED9GI04jipE20tA+8sZIZ14gWKimsbmycZ4P7brTcGaAy8QtSR2UaA/tLy8UOI0J15ghpQauk2g/Q1JTzEhe81JFEoQnaEQGhnIUE+Vns/Q5PDVKCCMhzeEaztTfGxmKVZHNTMZIj9WiV4j/eYNUR1d+RnmSasLx/KEoZVALWFQBQyoJ1mxqCMKSmqwQj5FEWJvCaqYEd/HLy6R73nAvGs27Zr11XdZRBcfgBJwBF1yCFrgFbdABGDyCZ/AK3qwn68V6tz7moxWr3DkEf2B9/gCe8JX4</latexit>

shared algorithm ĥ
h0
<latexit sha1_base64="6TZjTajSwPpTV1M02kc3eHAK5ao=">AAAB+HicbVDLSgMxFL1TX7U+OurSTbCIrsqMFBVXBTcuK9gHtEPJZDJtaCYzJBmhDv0SNy4UceunuPNvzLSz0NYDIYdz7iUnx084U9pxvq3S2vrG5lZ5u7Kzu7dftQ8OOypOJaFtEvNY9nysKGeCtjXTnPYSSXHkc9r1J7e5332kUrFYPOhpQr0IjwQLGcHaSEO7OvBjHqhpZK5sPDsb2jWn7syBVolbkBoUaA3tr0EQkzSiQhOOleq7TqK9DEvNCKezyiBVNMFkgke0b6jAEVVeNg8+Q6dGCVAYS3OERnP190aGI5VnM5MR1mO17OXif14/1eG1lzGRpJoKsngoTDnSMcpbQAGTlGg+NQQTyUxWRMZYYqJNVxVTgrv85VXSuai7l/XGfaPWvCnqKMMxnMA5uHAFTbiDFrSBQArP8Apv1pP1Yr1bH4vRklXsHMEfWJ8/GVOTXA==</latexit>

neural
neural decoder PAED (h0 |O)
<latexit sha1_base64="zESM/QkoeYxPekkoOMNCsLr9AMM=">AAACGXicbVDLSgMxFM34rPU16tJNsIh1U2akqLiqqODOCvYBnaFk0rQNzTxI7ohlnN9w46+4caGIS135N6aPRW09EHJyzr3k3uNFgiuwrB9jbn5hcWk5s5JdXVvf2DS3tqsqjCVlFRqKUNY9opjgAasAB8HqkWTE9wSreb2LgV+7Z1LxMLiDfsRcn3QC3uaUgJaaplVuJg6wB0jOry7TNJ84Xihaqu/rK+mmB4+T75s0PWyaOatgDYFniT0mOTRGuWl+Oa2Qxj4LgAqiVMO2InATIoFTwdKsEysWEdojHdbQNCA+U24y3CzF+1pp4XYo9QkAD9XJjoT4ajCcrvQJdNW0NxD/8xoxtE/dhAdRDCygo4/ascAQ4kFMuMUloyD6mhAquZ4V0y6RhIIOM6tDsKdXniXVo4J9XCjeFnOls3EcGbSL9lAe2egEldA1KqMKougJvaA39G48G6/Gh/E5Kp0zxj076A+M719v7KHP</latexit>

encoder

Figure 4.1 The ISCA framework. The top branch is the SCM-based system and the bottom
branch is the AED model. The acoustic model and the neural encoder can optionally share
parameters. Rounded boxes indicate trainable models and a rectangular box indicates an
algorithm or a procedure.

system PAED are available, the final score of a hypothesis h for an utterance O is

S(h|O) = log PSCM (h|O) + α log PAED (h′ |O) (4.8)

= log pAM (O|h) + ψ log PLM (h) + α log PAED (h′ |O), (4.9)

where ψ is the grammar scaling factor used in the SCM-based system and α is the interpolation
coefficient. When there are multiple LMs and multiple AED models, the score from each
additional model will be interpolated similarly. Additional decoding parameters used in SCM
and AED models can also be added to the total score S, such as an insertion penalty and
coverage score. Based on Equation (4.9), AED models can be viewed as audio-grounded
language models.
In contrast to the proposed framework, standard combination methods for ASR systems
are not suitable for combining with AED models. For example, confusion network combina-
tion (Evermann and Woodland, 2000b) requires decoding lattices and ROVER (Fiscus, 1997)
requires comparable confidence measures from both types of system. The benefits of using
hypothesis-level combination may be limited when combining an SCM-based system with an
AED model, because the hypothesis space generated from the SCM-based system is generally
much larger than the AED model.
Since there are two widely used neural decoder architectures, their training and inference
efficiencies need to be taken into account when using different rescoring algorithms. The first
one is the RNN decoder as in Section 2.2.1 that processes the input sequentially from left to
right. The second one is the Transformer decoder as in Section 2.2.2 that uses multi-head
attention and positional encoding to process all time steps in a sequence in parallel. Although
86 Integrating Source-Channel and Attention-Based ASR Models

this accelerates training with a higher memory cost, a sequential step-by-step procedure still
needs to be followed during decoding. A summary of the time and space complexities for both
types of neural decoders is shown in Table 4.1.

time complexity space complexity

neural decoder
training test training test
RNN O(L) O(L) O(L) O(1)
Transformer O(1) O(L) O(L2 ) O(L2 )

Table 4.1 Comparison of the time and space complexities of RNN and Transformer decoders
with respect to the output sequence length L.

4.2.3 Training
The SCM-based system and the AED model can be trained separately or jointly in a multi-task
fashion by sharing the neural encoder with the acoustic model. For multi-task trained mod-
els (Watanabe et al., 2017), the total number of parameters in the entire system is smaller due
to parameter sharing. Although multi-task training can be an effective way of regularisation,
setting the interpolation weights between the two losses and configuring the learning rate to
achieve good performance for both models may not be straightforward. Moreover, multi-task
training also limits the model architectures or model-specific training techniques that can
be adopted for individual systems. For example, the acoustic model can be a unidirectional
architecture for streaming purposes but the encoder of the AED model can be bi-directional
for second-pass rescoring. Acoustic models in SCM-based systems normally have a frame
subsampling rate of 3 in a low frame-rate system (Povey et al., 2018; Pundak and Sainath, 2016)
but neural encoders normally have a frame rate reduction of 4, by using convolutional layers or
pyramidal RNNs (see Section 3.6.1), for better performance. Frame-level shuffling (Su et al.,
2013) is important for the optimisation of HMM-based AMs, whereas AED models have to be
trained on a per utterance basis1 . Triphone units are commonly used for HMM-based acoustic
models whereas word-pieces are widely used for attention-based models. Different discrim-
inative sequence training can also be applied separately, e.g. Maximum Mutual Information
(MMI) for SCM-based systems and MWER for AED models. Overall, sharing the acoustic
model and the neural encoder in a multi-task training framework hampers both systems from
reaching the best possible WER performance.
1
For both sequence training of HMM-based systems and cross-entropy training of AED models, utterance-level
shuffling is often used.
4.2 The Integrated Source-Channel and Attention-Based Model Approach 87

4.2.4 Inference
During inference, the SCM-based system generates the top hypotheses for each utterance. The
word sequence h can be tokenised into the set of word-pieces used by the AED model. The
word-piece sequence h′ is forwarded through the neural decoder to obtain the probability for
each token P (h′t |h′1:t−1 , O). By finding the interpolation coefficients ψ and α in Equation (4.9)
to have the lowest WER on the development set via grid search, the final hypothesis is the one
with the highest score among the top candidates,

ĥ = arg max S(h|O), (4.10)

h∈H

where H is the hypothesis space represented by an N -best list or a lattice.

4.2.4.1 N -best Rescoring

As shown in Table 4.1, when rescoring each hypothesis in the N -best list with an AED model
with an RNN-based decoder, the time complexity is O(L) and the space complexity is O(1)
because of the sequential nature of RNNs. In contrast, a Transformer-based decoder has a
time complexity of O(1) and space complexity of O(L2 ) as during Transformer training. Since
the entire hypothesis is available, self-attention can be directly computed across the whole
sequence for each token.

4.2.4.2 Lattice Rescoring

For a fixed number of candidate hypotheses, the number of alternatives per word in the sequence
is smaller when the hypothesis becomes longer. This means the potential for WER reduction
diminishes for longer utterances. Therefore, in order to have the same number of alternatives per
word, the size of an N -best list needs to grow exponentially with respect to the utterance length.
However, lattice rescoring can effectively mitigate this issue. As described in Section 3.4.3,
lattices are directed acyclic graphs where nodes represent words and edges represent associated
acoustic and language model scores. A complete path from the start to the end of a lattice is
a hypothesis. Because a different number of arcs can merge to or split from a node, a lattice
generally contains a far greater number of hypotheses than a limited N -best list. The size of
lattices can be measured by the number of arcs per second of speech, also known as the lattice
density (Woodland et al., 1995).
One commonly used lattice rescoring approach for Recurrent Neural Network Language
Models (RNNLMs) is on-the-fly lattice expansion with n-gram based history clustering (Liu
et al., 2016). Similar to RNNLMs, attention-based models are also auto-regressive models
88 Integrating Source-Channel and Attention-Based ASR Models

where the next prediction depends on all history tokens. This means approximations must be
made when assigning scores on edges of lattices because each word in the lattice may have
numerous history sequences. Although lattices can be expanded to allow each word to have
a more specific history, a trade-off between the uniqueness of the history and computational
efficiency need to be considered. n-gram based history clustering (Liu et al., 2016) assumes
that the history before the previous n − 1 words has no impact on the probability of the current
word. As illustrated in Figure 4.2, a lattice can be expanded such that the n − 1 history word
sequence for each word in the lattice is unique. During rescoring, a hash table-based cache is
a cat a cat sat
<latexit sha1_base64="XlgoQcoi5fk9WyQWZ8u3qQTriFU=">AAAB83icbVDLSgMxFL3js9ZX1aWbYBFclZki6rKgC5cV7AM6Q8mkaRuayQzJHaEM/Q03LhRx68+482/MtLPQ1gOBk3PuJScnTKQw6Lrfztr6xubWdmmnvLu3f3BYOTpumzjVjLdYLGPdDanhUijeQoGSdxPNaRRK3gknt7nfeeLaiFg94jThQURHSgwFo2gl348ojhEze531K1W35s5BVolXkCoUaPYrX/4gZmnEFTJJjel5boJBRjUKJvms7KeGJ5RN6Ij3LFU04ibI5pln5NwqAzKMtT0KyVz9vZHRyJhpFNrJPKNZ9nLxP6+X4vAmyIRKUuSKLR4appJgTPICyEBozlBOLaFMC5uVsDHVlKGtqWxL8Ja/vEra9Zp3Vas/XFYbd0UdJTiFM7gAD66hAffQhBYwSOAZXuHNSZ0X5935WIyuOcXOCfyB8/kDptSSGQ==</latexit> <latexit sha1_base64="XlgoQcoi5fk9WyQWZ8u3qQTriFU=">AAAB83icbVDLSgMxFL3js9ZX1aWbYBFclZki6rKgC5cV7AM6Q8mkaRuayQzJHaEM/Q03LhRx68+482/MtLPQ1gOBk3PuJScnTKQw6Lrfztr6xubWdmmnvLu3f3BYOTpumzjVjLdYLGPdDanhUijeQoGSdxPNaRRK3gknt7nfeeLaiFg94jThQURHSgwFo2gl348ojhEze531K1W35s5BVolXkCoUaPYrX/4gZmnEFTJJjel5boJBRjUKJvms7KeGJ5RN6Ij3LFU04ibI5pln5NwqAzKMtT0KyVz9vZHRyJhpFNrJPKNZ9nLxP6+X4vAmyIRKUuSKLR4appJgTPICyEBozlBOLaFMC5uVsDHVlKGtqWxL8Ja/vEra9Zp3Vas/XFYbd0UdJTiFM7gAD66hAffQhBYwSOAZXuHNSZ0X5935WIyuOcXOCfyB8/kDptSSGQ==</latexit> <latexit sha1_base64="YHYtOyYu9o8pRaxq/UJ4UqREk9o=">AAAB83icbVDLSsNAFL3xWeur6tLNYBFclaSIuizowmUF+4AmlMl02g6dTMLMjVBCf8ONC0Xc+jPu/BsnbRbaemDgcM693DMnTKQw6Lrfztr6xubWdmmnvLu3f3BYOTpumzjVjLdYLGPdDanhUijeQoGSdxPNaRRK3gknt7nfeeLaiFg94jThQURHSgwFo2gl348ojhEzQ3HWr1TdmjsHWSVeQapQoNmvfPmDmKURV8gkNabnuQkGGdUomOSzsp8anlA2oSPes1TRiJsgm2eekXOrDMgw1vYpJHP190ZGI2OmUWgn84xm2cvF/7xeisObIBMqSZErtjg0TCXBmOQFkIHQnKGcWkKZFjYrYWOqKUNbU9mW4C1/eZW06zXvqlZ/uKw27oo6SnAKZ3ABHlxDA+6hCS1gkMAzvMKbkzovzrvzsRhdc4qdE/gD5/MHv0SSKQ==</latexit>

<latexit sha1_base64="2jdeii0yp2mTjVNQrbYJopWV5cI=">AAAB8XicbVDLSgMxFL1TX7W+qi7dBIvgqswUUZcFXbisYB/YDiWTZtrQTGZI7ghl6F+4caGIW//GnX9j2s5CWw8EDufcS849QSKFQdf9dgpr6xubW8Xt0s7u3v5B+fCoZeJUM95ksYx1J6CGS6F4EwVK3kk0p1EgeTsY38z89hPXRsTqAScJ9yM6VCIUjKKVHnsRxRFiRqf9csWtunOQVeLlpAI5Gv3yV28QszTiCpmkxnQ9N0E/oxoFk3xa6qWGJ5SN6ZB3LVU04sbP5omn5MwqAxLG2j6FZK7+3shoZMwkCuzkLKFZ9mbif143xfDaz4RKUuSKLT4KU0kwJrPzyUBozlBOLKFMC5uVsBHVlKEtqWRL8JZPXiWtWtW7rNbuLyr127yOIpzAKZyDB1dQhztoQBMYKHiGV3hzjPPivDsfi9GCk+8cwx84nz8OK5Eu</latexit> <latexit sha1_base64="2jdeii0yp2mTjVNQrbYJopWV5cI=">AAAB8XicbVDLSgMxFL1TX7W+qi7dBIvgqswUUZcFXbisYB/YDiWTZtrQTGZI7ghl6F+4caGIW//GnX9j2s5CWw8EDufcS849QSKFQdf9dgpr6xubW8Xt0s7u3v5B+fCoZeJUM95ksYx1J6CGS6F4EwVK3kk0p1EgeTsY38z89hPXRsTqAScJ9yM6VCIUjKKVHnsRxRFiRqf9csWtunOQVeLlpAI5Gv3yV28QszTiCpmkxnQ9N0E/oxoFk3xa6qWGJ5SN6ZB3LVU04sbP5omn5MwqAxLG2j6FZK7+3shoZMwkCuzkLKFZ9mbif143xfDaz4RKUuSKLT4KU0kwJrPzyUBozlBOLKFMC5uVsBHVlKEtqWRL8JZPXiWtWtW7rNbuLyr127yOIpzAKZyDB1dQhztoQBMYKHiGV3hzjPPivDsfi9GCk+8cwx84nz8OK5Eu</latexit>

ak
<latexit sha1_base64="8guKYhuHAdwKVAJ8buv2Lw5Af/I=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKqMeCHjxWtB/QhrLZTtqlm03Y3Qgl9Cd48aCIV3+RN/+N2zYHbX0w8Hhvhpl5QSK4Nq777RTW1jc2t4rbpZ3dvf2D8uFRS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfjm5nffkKleSwfzSRBP6JDyUPOqLHSA+2P++WKW3XnIKvEy0kFcjT65a/eIGZphNIwQbXuem5i/Iwqw5nAaamXakwoG9Mhdi2VNELtZ/NTp+TMKgMSxsqWNGSu/p7IaKT1JApsZ0TNSC97M/E/r5ua8NrPuExSg5ItFoWpICYms7/JgCtkRkwsoUxxeythI6ooMzadkg3BW355lbRqVe+yWru/qNRv8ziKcAKncA4eXEEd7qABTWAwhGd4hTdHOC/Ou/OxaC04+cwx/IHz+QNDco3M</latexit>

ni ñj ãl ñl

<latexit sha1_base64="/givYncnlutxfEe60ywPL1+ObwI=">AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0mKqMeCHjxWsB+QhrLZbNq1m92wOxFK6M/w4kERr/4ab/4bt20O2vpg4PHeDDPzwlRwA6777ZTW1jc2t8rblZ3dvf2D6uFRx6hMU9amSijdC4lhgkvWBg6C9VLNSBIK1g3HNzO/+8S04Uo+wCRlQUKGksecErCS3wcuIpbL6eBxUK25dXcOvEq8gtRQgdag+tWPFM0SJoEKYozvuSkEOdHAqWDTSj8zLCV0TIbMt1SShJkgn588xWdWiXCstC0JeK7+nshJYswkCW1nQmBklr2Z+J/nZxBfBzmXaQZM0sWiOBMYFJ79jyOuGQUxsYRQze2tmI6IJhRsShUbgrf88irpNOreZb1xf1Fr3hZxlNEJOkXnyENXqInuUAu1EUUKPaNX9OaA8+K8Ox+L1pJTzByjP3A+fwC5s5GO</latexit>

<latexit sha1_base64="eLriB8tNo+IbkyPL7uwp2HRsenQ=">AAAB8nicbVBNS8NAEJ34WetX1aOXxSJ4KkkR9VjQg8cK9gPaUDabTbt0swm7E6GE/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvSKUw6Lrfztr6xubWdmmnvLu3f3BYOTpumyTTjLdYIhPdDajhUijeQoGSd1PNaRxI3gnGtzO/88S1EYl6xEnK/ZgOlYgEo2ilXh+FDHlOpwM5qFTdmjsHWSVeQapQoDmofPXDhGUxV8gkNabnuSn6OdUomOTTcj8zPKVsTIe8Z6miMTd+Pj95Ss6tEpIo0bYUkrn6eyKnsTGTOLCdMcWRWfZm4n9eL8Poxs+FSjPkii0WRZkkmJDZ/yQUmjOUE0so08LeStiIasrQplS2IXjLL6+Sdr3mXdXqD5fVxl0RRwlO4QwuwINraMA9NKEFDBJ4hld4c9B5cd6dj0XrmlPMnMAfOJ8/qOCRgw==</latexit> <latexit sha1_base64="NWgkdD599NT9kDMTs7tBVgg2aew=">AAAB8nicbVBNS8NAEJ34WetX1aOXxSJ4KkkR9VjQg8cK9gPaUDabTbt0swm7E6GE/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvSKUw6Lrfztr6xubWdmmnvLu3f3BYOTpumyTTjLdYIhPdDajhUijeQoGSd1PNaRxI3gnGtzO/88S1EYl6xEnK/ZgOlYgEo2ilXh+FDHmupgM5qFTdmjsHWSVeQapQoDmofPXDhGUxV8gkNabnuSn6OdUomOTTcj8zPKVsTIe8Z6miMTd+Pj95Ss6tEpIo0bYUkrn6eyKnsTGTOLCdMcWRWfZm4n9eL8Poxs+FSjPkii0WRZkkmJDZ/yQUmjOUE0so08LeStiIasrQplS2IXjLL6+Sdr3mXdXqD5fVxl0RRwlO4QwuwINraMA9NKEFDBJ4hld4c9B5cd6dj0XrmlPMnMAfOJ8/vLuRkA==</latexit>

on on
<latexit sha1_base64="Rog7oCC3k+Uleh0oiN9cBASCl4w=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKqMeCHjxWtB/QhrLZbtqlm03YnQgl9Cd48aCIV3+RN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpr6xubW8Xt0s7u3v5B+fCoZeJUM95ksYx1J6CGS6F4EwVK3kk0p1EgeTsY38z89hPXRsTqEScJ9yM6VCIUjKKVHlRf9MsVt+rOQVaJl5MK5Gj0y1+9QczSiCtkkhrT9dwE/YxqFEzyaamXGp5QNqZD3rVU0YgbP5ufOiVnVhmQMNa2FJK5+nsio5ExkyiwnRHFkVn2ZuJ/XjfF8NrPhEpS5IotFoWpJBiT2d9kIDRnKCeWUKaFvZWwEdWUoU2nZEPwll9eJa1a1bus1u4vKvXbPI4inMApnIMHV1CHO2hAExgM4Rle4c2Rzovz7nwsWgtOPnMMf+B8/gBUOI3X</latexit>

sat
<latexit sha1_base64="YHYtOyYu9o8pRaxq/UJ4UqREk9o=">AAAB83icbVDLSsNAFL3xWeur6tLNYBFclaSIuizowmUF+4AmlMl02g6dTMLMjVBCf8ONC0Xc+jPu/BsnbRbaemDgcM693DMnTKQw6Lrfztr6xubWdmmnvLu3f3BYOTpumzjVjLdYLGPdDanhUijeQoGSdxPNaRRK3gknt7nfeeLaiFg94jThQURHSgwFo2gl348ojhEzQ3HWr1TdmjsHWSVeQapQoNmvfPmDmKURV8gkNabnuQkGGdUomOSzsp8anlA2oSPes1TRiJsgm2eekXOrDMgw1vYpJHP190ZGI2OmUWgn84xm2cvF/7xeisObIBMqSZErtjg0TCXBmOQFkIHQnKGcWkKZFjYrYWOqKUNbU9mW4C1/eZW06zXvqlZ/uKw27oo6SnAKZ3ABHlxDA+6hCS1gkMAzvMKbkzovzrvzsRhdc4qdE/gD5/MHv0SSKQ==</latexit>

<latexit sha1_base64="0M/DphxqgEcWxjE61svngpE+La0=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSlJEXRZ04bKCfUAbymQ6aYdOZsLMjVBCP8ONC0Xc+jXu/BsnbRbaemDgcM69zLknTAQ36Hnfztr6xubWdmmnvLu3f3BYOTpuG5VqylpUCaW7ITFMcMlayFGwbqIZiUPBOuHkNvc7T0wbruQjThMWxGQkecQpQSv1+jHBMWKm5GxQqXo1bw53lfgFqUKB5qDy1R8qmsZMIhXEmJ7vJRhkRCOngs3K/dSwhNAJGbGepZLEzATZPPLMPbfK0I2Utk+iO1d/b2QkNmYah3Yyj2iWvVz8z+ulGN0EGZdJikzSxUdRKlxUbn6/O+SaURRTSwjV3GZ16ZhoQtG2VLYl+Msnr5J2veZf1eoPl9XGXVFHCU7hDC7Ah2towD00oQUUFDzDK7w56Lw4787HYnTNKXZO4A+czx/zZZG0</latexit> <latexit sha1_base64="0M/DphxqgEcWxjE61svngpE+La0=">AAAB8nicbVDLSsNAFL3xWeur6tJNsAiuSlJEXRZ04bKCfUAbymQ6aYdOZsLMjVBCP8ONC0Xc+jXu/BsnbRbaemDgcM69zLknTAQ36Hnfztr6xubWdmmnvLu3f3BYOTpuG5VqylpUCaW7ITFMcMlayFGwbqIZiUPBOuHkNvc7T0wbruQjThMWxGQkecQpQSv1+jHBMWKm5GxQqXo1bw53lfgFqUKB5qDy1R8qmsZMIhXEmJ7vJRhkRCOngs3K/dSwhNAJGbGepZLEzATZPPLMPbfK0I2Utk+iO1d/b2QkNmYah3Yyj2iWvVz8z+ulGN0EGZdJikzSxUdRKlxUbn6/O+SaURRTSwjV3GZ16ZhoQtG2VLYl+Msnr5J2veZf1eoPl9XGXVFHCU7hDC7Ah2towD00oQUUFDzDK7w56Lw4787HYnTNKXZO4A+czx/zZZG0</latexit>

nk
<latexit sha1_base64="qe6BCiF+3ZREN1kb5ZVKYhJX+eQ=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKqMeCHjxWtB/QhrLZTtqlm03Y3Qgl9Cd48aCIV3+RN/+N2zYHbX0w8Hhvhpl5QSK4Nq777RTW1jc2t4rbpZ3dvf2D8uFRS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfjm5nffkKleSwfzSRBP6JDyUPOqLHSg+yP++WKW3XnIKvEy0kFcjT65a/eIGZphNIwQbXuem5i/Iwqw5nAaamXakwoG9Mhdi2VNELtZ/NTp+TMKgMSxsqWNGSu/p7IaKT1JApsZ0TNSC97M/E/r5ua8NrPuExSg5ItFoWpICYms7/JgCtkRkwsoUxxeythI6ooMzadkg3BW355lbRqVe+yWru/qNRv8ziKcAKncA4eXEEd7qABTWAwhGd4hTdHOC/Ou/OxaC04+cwx/IHz+QNXQI3Z</latexit>

the cat the cat sat

<latexit sha1_base64="7l8XQyuOa4LZWZmSr/02XtOp5Pg=">AAAB83icbVDLSsNAFL3xWeur6tLNYBFclaSIuizowmUF+4AmlMl02g6dTMLMjVBCf8ONC0Xc+jPu/BsnbRbaemDgcM693DMnTKQw6Lrfztr6xubWdmmnvLu3f3BYOTpumzjVjLdYLGPdDanhUijeQoGSdxPNaRRK3gknt7nfeeLaiFg94jThQURHSgwFo2gl348ojhEzHPNZv1J1a+4cZJV4BalCgWa/8uUPYpZGXCGT1Jie5yYYZFSjYJLPyn5qeELZhI54z1JFI26CbJ55Rs6tMiDDWNunkMzV3xsZjYyZRqGdzDOaZS8X//N6KQ5vgkyoJEWu2OLQMJUEY5IXQAZCc4ZyagllWtishI2ppgxtTWVbgrf85VXSrte8q1r94bLauCvqKMEpnMEFeHANDbiHJrSAQQLP8ApvTuq8OO/Ox2J0zSl2TuAPnM8ftKqSIg==</latexit> <latexit sha1_base64="7l8XQyuOa4LZWZmSr/02XtOp5Pg=">AAAB83icbVDLSsNAFL3xWeur6tLNYBFclaSIuizowmUF+4AmlMl02g6dTMLMjVBCf8ONC0Xc+jPu/BsnbRbaemDgcM693DMnTKQw6Lrfztr6xubWdmmnvLu3f3BYOTpumzjVjLdYLGPdDanhUijeQoGSdxPNaRRK3gknt7nfeeLaiFg94jThQURHSgwFo2gl348ojhEzHPNZv1J1a+4cZJV4BalCgWa/8uUPYpZGXCGT1Jie5yYYZFSjYJLPyn5qeELZhI54z1JFI26CbJ55Rs6tMiDDWNunkMzV3xsZjYyZRqGdzDOaZS8X//N6KQ5vgkyoJEWu2OLQMJUEY5IXQAZCc4ZyagllWtishI2ppgxtTWVbgrf85VXSrte8q1r94bLauCvqKMEpnMEFeHANDbiHJrSAQQLP8ApvTuq8OO/Ox2J0zSl2TuAPnM8ftKqSIg==</latexit>

<latexit sha1_base64="XlgoQcoi5fk9WyQWZ8u3qQTriFU=">AAAB83icbVDLSgMxFL3js9ZX1aWbYBFclZki6rKgC5cV7AM6Q8mkaRuayQzJHaEM/Q03LhRx68+482/MtLPQ1gOBk3PuJScnTKQw6Lrfztr6xubWdmmnvLu3f3BYOTpumzjVjLdYLGPdDanhUijeQoGSdxPNaRRK3gknt7nfeeLaiFg94jThQURHSgwFo2gl348ojhEze531K1W35s5BVolXkCoUaPYrX/4gZmnEFTJJjel5boJBRjUKJvms7KeGJ5RN6Ij3LFU04ibI5pln5NwqAzKMtT0KyVz9vZHRyJhpFNrJPKNZ9nLxP6+X4vAmyIRKUuSKLR4appJgTPICyEBozlBOLaFMC5uVsDHVlKGtqWxL8Ja/vEra9Zp3Vas/XFYbd0UdJTiFM7gAD66hAffQhBYwSOAZXuHNSZ0X5935WIyuOcXOCfyB8/kDptSSGQ==</latexit> <latexit sha1_base64="XlgoQcoi5fk9WyQWZ8u3qQTriFU=">AAAB83icbVDLSgMxFL3js9ZX1aWbYBFclZki6rKgC5cV7AM6Q8mkaRuayQzJHaEM/Q03LhRx68+482/MtLPQ1gOBk3PuJScnTKQw6Lrfztr6xubWdmmnvLu3f3BYOTpumzjVjLdYLGPdDanhUijeQoGSdxPNaRRK3gknt7nfeeLaiFg94jThQURHSgwFo2gl348ojhEze531K1W35s5BVolXkCoUaPYrX/4gZmnEFTJJjel5boJBRjUKJvms7KeGJ5RN6Ij3LFU04ibI5pln5NwqAzKMtT0KyVz9vZHRyJhpFNrJPKNZ9nLxP6+X4vAmyIRKUuSKLR4appJgTPICyEBozlBOLaFMC5uVsDHVlKGtqWxL8Ja/vEra9Zp3Vas/XFYbd0UdJTiFM7gAD66hAffQhBYwSOAZXuHNSZ0X5935WIyuOcXOCfyB8/kDptSSGQ==</latexit> <latexit sha1_base64="YHYtOyYu9o8pRaxq/UJ4UqREk9o=">AAAB83icbVDLSsNAFL3xWeur6tLNYBFclaSIuizowmUF+4AmlMl02g6dTMLMjVBCf8ONC0Xc+jPu/BsnbRbaemDgcM693DMnTKQw6Lrfztr6xubWdmmnvLu3f3BYOTpumzjVjLdYLGPdDanhUijeQoGSdxPNaRRK3gknt7nfeeLaiFg94jThQURHSgwFo2gl348ojhEzQ3HWr1TdmjsHWSVeQapQoNmvfPmDmKURV8gkNabnuQkGGdUomOSzsp8anlA2oSPes1TRiJsgm2eekXOrDMgw1vYpJHP190ZGI2OmUWgn84xm2cvF/7xeisObIBMqSZErtjg0TCXBmOQFkIHQnKGcWkKZFjYrYWOqKUNbU9mW4C1/eZW06zXvqlZ/uKw27oo6SnAKZ3ABHlxDA+6hCS1gkMAzvMKbkzovzrvzsRhdc4qdE/gD5/MHv0SSKQ==</latexit>

2-gram lattice 3-gram lattice

Figure 4.2 Example of expanding a 2-gram lattice to a 3-gram lattice. The hollow node on the
left is expanded into two hollow nodes on the right, and the symbols of the nodes and arcs
correspond to lines 16-18 in Algorithm 1.

created, where the key is the n − 1 history words and the value is the corresponding hidden
state of RNNLM and the output distribution. When the same history appears again during
rescoring, repetitive computation can be avoided, regardless of the more distant history.
However, for AED models based on attention mechanisms, n-gram based history clustering
can lead to undesirable behaviour. For example (from Switchboard),

I think those are are wonderful things to have but I think in a big company ...

the phrase “I think” appears twice in the utterance. If the original trigram based history
clustering developed for RNNLMs is used, at the second occurrence of the phrase, the algorithm
will restore the cache from the first occurrence including the attention context and the decoder
state and then continue to score the rest of the utterance. Consequently, scores for the second
half of the utterance will be wrong because of the incorrect attention context.
To this end, a time-dependent two-level n-gram cache is proposed for the lattice rescoring al-
gorithm shown in Algorithm 1. The input to the algorithm is a lattice from a frame-synchronous
model, the AED model, the corresponding acoustic sequence, and the n-gram order used for
approximation. The output of the algorithm is an expanded lattice with additional AED scores
on each arc. The algorithm first initialises a two-level hash table as the cache (line 2). The
hash-key of the first level is the history phrase. The hash-key of the second level cache is the
frame index of the current word. When looking up the second level, a collar of ±9 frames is
4.2 The Integrated Source-Channel and Attention-Based Model Approach 89

used to accommodate a small time difference in the alignment. Then the algorithm performs
lattice expansion and rescoring at the same time. After initialising the new lattice (lines 3-7),
the nodes in the lattice are traversed in a topological order (line 8). For each node in the
original lattice, the new lattice may have multiple copies such that the n − 1 preceding words
are unique (line 9). Each outgoing arc from the node in the original lattice is duplicated for all
the duplicated nodes in the expanded lattice. Depending on the updated n − 1-gram history,
the destination node of the outgoing arc may need to be duplicated (lines 11-19). Next, for
the duplicated arc, the AED score needs to be obtained. First, the two-level cache is accessed
with the word history and the timestamp of the current word. The cache is only hit when the
timestamp falls within the vicinity of one of the timestamps in the cache. Otherwise, there is a
cache miss and a new entry is created in the cache with the timestamp and the corresponding
decoder states (lines 21-23). Another key detail is that when there is a cache hit, the sum of
arc posteriors of the current (n − 1)-gram is compared with the one stored in the cache. If the
current posterior is larger, indicating the current (n − 1)-gram is on a better path, then the cache
entry is updated to store the current hidden states (lines 24-26). For the example in Figure 4.2,
when the lower path is visited after the upper path, the cache entry “sat on” should already
exist. If the lower path has a higher posterior probability, then the cache entry will be updated,
so that future words in the lattice will adopt the history from the lower path 2 . Now, the entry
corresponds to the duplicated arc must exist in the cache, which may be just created or renewed.
The duplicated arc with the score from a label-synchronous model can then be connected to the
corresponding nodes in the expanded lattice (lines 28-31). After traversing through all nodes in
the original lattice, the algorithm returns a new expanded lattice with new scores from an AED
model (line 36).
For label-synchronous systems with RNN decoders, lattice rescoring has O(L) for time
complexity and O(1) for space complexity as the RNN hidden states can be stored and carried
forward at each node in the lattice. However, since lattice rescoring operates on partial
hypotheses, Transformer decoders have to run in decoding mode as in Table 4.1. Because self-
attention need to be computed with all previous tokens, lattice rescoring with the Transformer-
based decoder has O(L2 ) time and space complexities3 .

2
Code is available at https : //github.com/qiujiali/lattice − rescore.
3
Recent work has shown that the computational complexity of Transformers can be reduced to O(L) with
some modifications (Katharopoulos et al., 2020; Wang et al., 2020b).
90 Integrating Source-Channel and Attention-Based ASR Models

Algorithm 1 Lattice Rescoring Using a Label-Synchronous Model

1: procedure L ATTICE R ESCORE(lattice, model, utt, ngram, collar)
2: cache← C REATE C ACHE(model, utt, collar) ▷ Initialise the two-level cache
3: for node ni in lattice.nodes do ▷ Initialise expanded nodes and arcs
4: ni .expanded_nodes ← []; ni .expanded_arcs ← []
5: end for
6: ▷ Initialise the starting node
7: lattice.nodes[0].expanded_nodes.Append(lattice.nodes[0].Duplicate())
8: for node ni in lattice.nodes do ▷ Lattice traversal, expansion and rescoring
9: for node ñj in ni .expanded_nodes do
10: for arc ak in ni .exits do
11: nk ← ak .dest
12: hist ← L AST NI TEM([ñj .hist, nk .word], ngram-1)
13: if ∃ node ñl ∈ nk .expanded_nodes such that ñl .hist = hist then
14: ãl ← ak .Duplicate(ñj , ñl ) ▷ Create arc between expanded nodes
15: else
16: ñl ← nk .Duplicate(hist) ▷ Create expanded node with new history
17: nk .expanded_nodes.Append(ñl )
18: ãl ← ak .Duplicate(ñj , ñl ) ▷ Create arc between expanded nodes
19: end if
20: post ← L AST NI TEM([cache.GetPost(ñj .hist, ñj .time), ãl .post], ngram-1)
21: if cache.Lookup(hist, nk .time) fails then
22: ▷ Cache miss – new ngram or timestamp
23: cache.Renew(ñj .hist, ñj .time, ñl .hist, ñl .time, post)
24: else if S UM(post) > S UM(cache.GetPost(ñl .hist, ñl .time)) then
25: ▷ Cache update – use the more likely path
26: cache.Renew(ñj .hist, ñj .time, ñl .hist, ñl .time, post)
27: end if
28: ▷ Assign score from label-sync model
29: ãl .model_score ← cache.GetPred(ñj .hist, ñj .time, ñl .word)
30: ▷ Connect new arc to expanded nodes
31: ñj .exits.Append(ãl ); ñl .entries.Append(ãl )
32: end for
33: end for
34: end for
35: ▷ Construct new lattice from expanded nodes and arcs
36: lattice ← B UILD E XPANDED L ATTICE(lattice)
37: end procedure
4.3 Preliminary Experiments 91

4.3 Preliminary Experiments

In this section, experiments are designed to be comparable to the joint CTC and attention
models (Watanabe et al., 2017) in order to investigate the advantages offered by the ISCA
framework. After describing the experimental setup on the AMI dataset, several improvements
to the CTC and AED model baselines are made first in order to approach the performance
of traditional DNN-HMM models. Next, within the scope of SCM-based models, different
subword units and objective functions are compared. Under the ISCA framework, both multi-
task training and separate training performance are reported. After the exploratory experiments
on the AMI dev set, the key findings are summarised and the models are tested on the AMI
eval set.

4.3.1 Experimental Setup

4.3.1.1 Data and Features

The AMI-Individual Headset Microphone (IHM) recordings were used (see Section A.1). The
dataset contains around 80 hours of speech for training, and 8 hours for both development (dev)
and evaluation (eval). The inputs used were 80-dim filter-bank features at a 100 Hz frame rate
concatenated with 3-dimensional pitch features. Manual utterance-level segmentation was used
throughout.

4.3.1.2 Model Configurations

The pipeline is based on the ESPnet setup (Kim et al., 2017; Watanabe et al., 2018). The
default configurations for the CTC model, the acoustic model of the DNN-HMM system and
the encoder of the attention model are both an 8-layer bi-directional Long Short-Term Memory
(LSTM) with a projection layer. The bi-directional LSTM has 320 units in each direction and
the projection size is also 320. For the SCM model, an additional Multilayer Perceptron (MLP)
of size 320 is added before the Softmax output. The attention model uses a location-aware
attention mechanism connecting to the decoder with a one-layer 300-unit LSTM. The n-gram
LM is trained using both the AMI and Fisher transcriptions. The RNNLM is trained purely
based on the text transcriptions of AMI data. The RNNLM has one LSTM layer with 1000
hidden units. The perplexities of the RNNLM on the AMI dev and eval data are 73 and 64
respectively.
92 Integrating Source-Channel and Attention-Based ASR Models

4.3.1.3 Training and Decoding Setups

To train the HMM-based model, the Cross Entropy (CE) objective function was used with align-
ments produced by a pre-trained DNN-HMM system. Unigram label-smoothing (Szegedy et al.,
2016) was applied before computing the CE loss of the AED model. The numbers of graphemes,
monophones and tied triphones are 31, 48 and 4016. The AdaDelta optimiser (Zeiler, 2012)
was used with a batch size of 30 utterances for all models.
For decoding the SCM-based models, PyHTK (Zhang et al., 2019b) was used to set up the
corresponding HMM structures and the decoding pipeline. HTK tools (Young et al., 2015)
were used for lattice generation, lattice rescoring and N -best list generation. A trigram LM
was used for SCM-based model decoding. For AED model decoding, the width of the beam
search was set to 30.

4.3.2 Experimental Results

4.3.2.1 Improvements on Joint CTC and Attention Baseline Models

In the baseline setup (Kim et al., 2017), both the CTC model and the AED model used the same
set of graphemes as their output units and the effective frame rate is 25 Hz. The frame rate
reduction is achieved by skipping some time steps in the LSTM layer. The first LSTM layer
has a full frame rate and the next two layers skip one step at every other frame. Statistics on the
training data show that the input/output length ratio is more than four for 95.0% of all utterances
but is more than three for 99.4% of the data. For phones, 99.9% of the utterances have an
input/output length ratio higher than four. For AED models where frame-synchronicity is not
required, a higher ratio of frame rate reduction may be appropriate. However, for SCM-based
models, the frame rate should not be less than 1/3 of the full frame rate.

AED joint CTC/AED decoding

baseline 37.6 34.5
+ 1/3 frame rate 34.9 32.2
+ training on every frame 32.7 30.1
+ stacking input frames 30.6 28.1

Table 4.2 AMI dev set WERs by decoding on the AED branch only and joint decoding of the
CTC and attention models with a word-based RNNLM. The output units are the same set of
grapehemes for both branches.

In the experiments shown in Table 4.2, the frame rate was reduced at the input feature level,
which is then more similar to the setup from SCM-based systems (Miao et al., 2016b; Povey
4.3 Preliminary Experiments 93

et al., 2016). By sampling one in every three frames, the performance of the model improves
despite two-thirds of training data not being used. Offsets to the starting point of the input
sequence allow the data to be fully used, which reduced the WER by another 2% absolute.
Furthermore, by concatenating two adjacent frames as the input feature to cover the test-time
unused frames (Pundak and Sainath, 2016), the WER was further reduced by 2%. Overall, a
6-7% absolute reduction of WER was observed over the baseline.

4.3.2.2 Improvements to the CTC Models

By treating the CTC model as a type of SCM-based model as described in Section 3.5.1,
the decoding procedure of the traditional HMM-based systems can also be used, where var-
ious sources of structured information are incorporated. The baseline is based on the prefix
search decoding procedure and the improvements made using the lexical-tree-based decoding
procedure.

WER
CTC baseline 47.6
+ graphemic lexicon 43.9
+ trigram LM 38.5
+ prior 33.2
+ multi-task training 32.1

Table 4.3 AMI dev set WER of the standalone graphemic CTC model and and several improve-
ments by incorporating structured information during decoding and multi-task training. The
RNNLM is not used.

As shown in Table 4.3, the addition of a graphemic lexicon that prevents decoding words
with incorrect spelling reduces the WER by 2.7% absolute. However, some words in the
lexicon can be decomposed into shorter word-pieces, which happen to also be legal words in
the lexicon. The introduction of the trigram LM greatly reduces the fragmentation of words and
reduces the WER significantly (5.4% absolute). One of the many assumptions made by CTC
is that all output units have equal prior probabilities. Computed by accumulating the output
posteriors from the DNN, the estimated priors are very imbalanced. For example, in the case
of 1/3 frame rate, the priors for the blank symbol, the letter ‘A’ and ‘Z’ are 0.43, 0.03, 10−4
respectively. By using the graphemic unit priors similarly to HMMs as in Equation (3.12), the
WER improves by another 5.3% absolute. Finally, by training the CTC model and the AED
model in a multi-task fashion, i.e. the same system as the last row in Table 4.2, the CTC WER
is further reduced to 32.1%, whereas the performance of the AED model alone is not improved.
94 Integrating Source-Channel and Attention-Based ASR Models

Narrowing the performance gap between the CTC model and the AED model facilitates the use
of the ISCA framework.

4.3.2.3 ISCA for Multi-Task Trained Models

Following the multi-task training scheme where the AM (SCM-based system) and the encoder
(AED model) are shared, the following experiments vary the modelling units and objective
functions of the SCM-based model, while keeping those of the AED model unchanged. For the
following ISCA experiments, 20-best hypotheses from the SCM-based models were ranked
by optimally interpolating the AM scores, trigram LM scores, the AED model scores, and
optionally the RNNLM scores. A derivative-free stochastic global search algorithm, called the
Covariance Matrix Adaptation Evolution Strategy (CMA-ES) (Hansen and Ostermeier, 2001),
was used to optimise the interpolation weights. Note that results presented on the dev set in
Table 4.4 are optimistic as the interpolation weights are directly optimised for this set.

subword units loss SCM AED ISCA (+RNNLM)

grapheme CTC 32.1 31.7 28.6 (28.4)
CTC 28.8 31.6 26.7 (26.3)
monophone
CE 28.8 31.8 26.5 (26.0)
CTC 28.3 31.4 26.2 (25.6)
triphone4
CE5 26.6 31.6 25.3 (24.7)

Table 4.4 AMI dev set WERs of multi-task trained systems. CE systems use single-state HMMs.
A trigram LM was used for SCM-based system decoding. The RNNLM was not used except
for the results in brackets.

The model in the first row in Table 4.4 corresponds to the model of the last rows in both
Tables 4.2 and 4.3. For this multi-task trained graphemic CTC and AED model, the performance
of the ISCA framework is similar to the joint decoding result in Table 4.2.
Given that the joint decoding method only applies to a CTC model and an AED model
with the same graphemic units, one of the major advantages of the ISCA framework is that the
SCM model can have any subword units and any loss function. The next two rows change the
modelling units of SCM-based systems from graphemes to monophones, and the WER of the
SCM-based system is reduced by 10.3% relative to the graphemic CTC model. This is mainly
due to the orthographic irregularity in English where phonetic-based units reduce the difficulty
4
A decision tree is used to tie 93k triphone models (or states) to 4k.
5
An acoustic model trained with a similar setup by HTK can yield approximately same WERs with only a fifth
of the parameters due to better data shuffling, larger batch size etc. For results to be comparable, all models in this
paper are trained using ESPnet.
4.3 Preliminary Experiments 95

of modelling significantly. The mapping between pronunciation and text is achieved by the
lexicon embedded in the decoding graph. For these two monophone SCM models, improved
CTC and CE models have similar performance. However, this observation does not hold for
triphone systems. This may be because triphone systems are more sensitive to the quality of the
training alignments since there are many more confusing triphone units than monophone units.
The amount of training data in the AMI dataset may not be sufficient for the CTC system to
discriminate different triphone units and to learn the alignment simultaneously. In contrast, the
CE system outperforms the CTC system as it uses alignments produced by an existing system.
Further experiments show that multi-task trained SCM-based models perform marginally
better than their standalone counterparts. However, the expected benefits of multi-task training
are not observed for the AED models. The last column in Table 4.4 shows that the ISCA
approach yields consistent reductions in WER while the SCM-based model improves, with
or without an external RNNLM. However, as the performance gap between the SCM-based
system and the AED model widens, the relative improvement w.r.t. the SCM-based model
shrinks. In the extreme case where the triphone CE model outperforms the AED model by
15.8% relative, ISCA with 20-best rescoring can still improve the WER of the SCM-based
model by 4.8% relative.
Since the ISCA framework integrates two models at the word level, attempts have also been
made to change the modelling units of the AED model. Similar to the findings by Sainath et al.
(2018) and Zhou et al. (2018a), using monophones for the AED model is not as helpful as
graphemes for ISCA, which shows that the complementarity of graphemic and phonetic models
is essential. As expected, using context-dependent units for the AED model yields essentially
no improvement compared to their context-independent counterparts. Since the neural decoder
directly conditions on the previous output as in Equation (3.73), modelling context-dependent
units is an even harder task than context-independent ones as the model also needs to learn the
tying results found by phonetic decision trees.

4.3.2.4 ISCA for Separately Trained Models

Since the triphone CE model outperforms the AED model significantly when using the multi-
task training setup and only 20-best hypotheses are used for ISCA, two more questions remain.
First, how ISCA performs when the attention-based model improves. Second, if the N -best
list becomes longer, how much more improvement ISCA can yield. In order to answer these
questions, three separate models were trained whose configurations are listed in Table 4.5 and
the WERs of the combined system with respect to the length of N -best lists are plotted in
Figure 4.3.
96 Integrating Source-Channel and Attention-Based ASR Models

model BLSTM configuration WER (+RNNLM)

SCM(triphone, CE) 8-layer, 320 units 27.0 (25.8)
AED(small) 8-layer, 320 units 31.6 (30.2)
AED(large) 4-layer, 1024 units 26.3 (25.8)

Table 4.5 Configurations and WERs of three separately trained models.

27 +RNNLM
+AED(small) +RNNLM+AED(small)
+AED(large) +RNNLM+AED(large)
26
WER(%)

23
1 20 40 60 80 100
length of N-best list
Figure 4.3 ISCA between a standalone triphone CE model and AED models of different
configurations with various sizes of N -best lists.

As shown in Figure 4.3, the reduction in WER from adding an external RNNLM stagnates
after an n-best list of length 20 is used. However, the WER of ISCA continues to drop,
especially for the large attention-based model where two systems have similar WERs. Although
the attention-based system models the acoustic and language model jointly, an RNNLM trained
using the AMI transcriptions only is still useful. The improvement from the RNNLM is
expected to be greater when additional text corpora in similar domains are used for training.
For the 100-best list, the small and large attention-based models with the RNNLM reduce
the WER by 7.1% and 10.5% respectively relative to the triphone/CE model rescored by the
RNNLM.
4.4 Large Scale Experiments 97

4.3.3 Summary and Discussion

The key results are summarised in Table 4.6. The total number of model parameters and
performance on the AMI eval set are also reported. By changing the encoder frame rate and
arrangement of input acoustic features, the joint decoding WER of the multi-task trained CTC
and the AED model drops from 37.4% to 29.2%. For standalone SCM-based systems, triphone
CE models consistently outperform the other types of SCM-based systems evaluated. The
standalone AED model requires five times as many model parameters as the triphone CE model
to reach competitive performance. Although having more parameters combined, separately
trained models outperform the multi-task trained model with a similar model size per branch.
Furthermore, the flexibility of individual optimisation leads to greater improvement when the
performance of the two models are closer. Compared to the improved baseline, ISCA reduces
the WER by 13% relative for the multi-task trained system. For separately trained models,
ISCA achieves relative WER reductions of 8.6% and 21% w.r.t. the triphone CE model and the
small AED model; 11.2% and 8.1% w.r.t. the triphone CE model and the large AED model.
Further improvement is expected if longer N -best lists are used or using ISCA for lattice
rescoring.

#params. dev eval

baseline (Kim et al., 2017) 15.6M 34.5 37.4
multi-task
improved baseline 16.1M 28.1 29.2
SCM (triphone CE) 16.0M 25.8 26.8
separate AED (small) 15.9M 30.2 31.0
AED (large) 84.0M 25.8 25.9
multi-task 17.3M 24.4 25.4
ISCA SCM + AED (small) 31.9M 24.0 24.5
SCM + AED (large) 100.0M 23.1 23.8

Table 4.6 The number of parameters and WERs of some key models on both the AMI dev and
eval sets, with RNNLM included. ISCA uses 100-best from the triphone CE model.

4.4 Large Scale Experiments

After demonstrating the flexibility and effectiveness of the ISCA framework in Section 4.3.2,
the following experiments will try to use the best individual SCM and AED systems for ISCA
combination on both the 80-hour AMI dataset and the 300-hour Switchboard dataset. Both
N -best and lattice rescoring results will be presented.
98 Integrating Source-Channel and Attention-Based ASR Models

4.4.1 Experimental Setup

4.4.1.1 Acoustic Data

Two common ASR benchmarks were used for training and evaluation. The AMI dataset
(Section A.1) is relatively small-scale and contains recordings of spontaneous meetings. The
IHM channel was used. As shown in Table 4.7, it has many short utterances as multiple
speakers take short turns during meetings. SWB-300 (see Section A.3) is a larger-scale dataset
with telephony conversations. Compared to AMI, it has more training data, longer utterances,
more speakers and a larger vocabulary. Hub5’00 was used as the development set, which is split
into two subsets: SWB and CallHome (CH). RT03 was used as the evaluation set, which is split
into Switchboard Cellular (SWBC) and Fisher (FSH) subsets. The acoustic data preparation
follows the Kaldi recipes (Povey et al., 2011).

AMI-IHM SWB-300
training data 78 hours 319 hours
avg. utterance length 7.4 words 11.8 words
number of speakers 155 543
vocabulary size 12k 30k

Table 4.7 AMI-IHM and SWB-300 dataset information for experiments.

4.4.1.2 Text Data and Language Models

For each dataset, its training transcription and Fisher transcription (see Section A.3) were used
to train both n-gram language models and RNNLMs. Text processing and building n-gram
LMs for both datasets follow the Kaldi recipes (Povey et al., 2011). RNNLMs were trained
using the ESPnet toolkit (Watanabe et al., 2018). The vocabulary used for RNNLMs is the
same as for n-gram LMs and has 49k words for AMI-IHM and 30k words for SWB-300. The
RNNLMs have 2-layer LSTMs with 2048 units in each layer. The models were trained with
Stochastic Gradient Descend (SGD) with a learning rate of 10.0 and a dropout rate of 0.5.
The embedding dimension is 256. Gradient norms were clipped to 0.25 and weight decay set
to 10−6 . The training transcriptions and the Fisher transcriptions were mixed in a 3:1 ratio.
Because of the domain mismatch between the AMI and Fisher text data, the RNNLM for AMI
was fine-tuned on the AMI transcriptions after training using the mixture of data with a learning
rate of 1.0. The AMI RNNLM has 161M parameters and the Switchboard RNNLM has 122M
parameters. The perplexities of the LMs for both datasets are shown in Table 4.8.
4.4 Large Scale Experiments 99

AMI-IHM SWB-300
LM
dev eval Hub5’00 RT03
3-gram 80.2 76.7 82.8 67.7
4-gram 79.3 75.7 80.0 65.3
RNNLM 58.0 53.5 51.9 45.3

Table 4.8 Perplexities of various language models on two datasets.

4.4.1.3 Acoustic Models and AED Models

The acoustic models of the SCM-based systems are factorised Time Delay Neural Networks
(TDNNs), which were trained with the lattice-free MMI objective (Povey et al., 2018) by
following the standard Kaldi recipes (Povey et al., 2011). The total numbers of parameters are
10M for AMI-IHM and 19M for SWB-300.
Two types of AED systems were trained using the ESPnet toolkit (Watanabe et al., 2018)
without the CTC branch. The neural encoders were composed of 2 convolutional layers that
reduce the frame rate by 4, followed by 16 Conformer blocks (Gulati et al., 2020). For the
Conformer block, the dimension for the feed-forward layer was 2048, the attention dimension
was 512 for the model with an LSTM decoder and 256 for the model with a Transformer
decoder. The number of attention heads was 4 and the convolutional kernel size is 31. For the
AED model with an LSTM decoder, the decoder had a location-aware attention mechanism
and 2-layer LSTMs with 1024 units. For the AED model with a Transformer-based decoder,
the decoder had 6-layer Transformer decoder blocks where the attention dimension is 256 and
the feed-forward dimension is 2048. Both AED models were trained using the Noam learning
rate scheduler on the Adam optimiser. The learning rate was 5.0 and the number of warm-up
steps was 25k. Label smoothing of 0.1 and a dropout rate of 0.1 were applied during training.
An exponential moving average of all model parameters with a decay factor of 0.999 was
used. The AED model with LSTM decoder has 130M parameters while the AED model with
Transformer decoder has 54M parameters. Beam search with a beam-width of 8 was used for
decoding. Apart from applying length normalisation for the AED (Transformer) model, other
decoding heuristics were not used. SpecAugment (Park et al., 2019) and speed perturbation
were applied. Word-piece outputs (Kudo, 2018) were used with 200 units for AMI-IHM and
800 units for SWB-300. The single model WERs for the SCM-based system and AED models
are given in Table 4.9.
The WERs of SCM systems are higher than AED models, mainly because the number
of parameters in the SCM systems is much smaller. The WER gap between SCM and AED
systems is smaller on AMI-IHM dataset than SWB-300 dataset, because the training data for
100 Integrating Source-Channel and Attention-Based ASR Models

AMI-IHM SWB-300
Hub5’00 RT03
dev eval
(SWB/CH) (SWBC/FSH)
SCM 19.9 19.2 8.6 / 17.0 18.8 / 11.4
AED (LSTM) 19.6 18.2 7.5 / 15.3 16.2 / 10.7
AED (Transformer) 19.4 19.1 7.8 / 14.4 17.5 / 10.4

Table 4.9 Single system WERs on AMI-IHM and SWB-300 datasets. Systems do not use
RNNLMs for rescoring or decoding. All AED models have Conformer encoders. The decoder
architecture of AED models are in brackets.

the large AED models is more abundant for SWB-300 dataset. By comparing two AED models
with LSTM and Transformer decoders, the WERs are relatively close despite the fact that the
AED model with LSTM decoders have more parameters.

4.4.2 Experimental Results

4.4.2.1 N-best and Lattice Rescoring

After pruning the lattices generated by the SCM-based system by limiting the beam width and
the maximum lattice density, 20-best, 100-best and 500-best hypotheses were obtained from
these lattices. The N -best hypotheses were then forwarded through the RNNLM, AED (LSTM)
and AED (Transformer). Each hypothesis has five scores, i.e. AM score, n-gram LM score,
RNNLM score, and scores from an AED model with an LSTM decoder and an AED model
with a Transformer decoder. By following Equation (4.9), the five interpolation coefficients are
found by using CMA-ES (Hansen and Ostermeier, 2001) that minimise the WER on the dev
set. By applying the optimal combination coefficients on the test set, the hypotheses with the
highest score were picked. The lattice density for each N -best or lattice rescoring experiment
is also reported. For lattice rescoring, the lattice density refers to the number of arcs in the
expanded lattice for each second of speech audio. For N -best rescoring, since an equivalent
lattice representation for an N -best list would be N parallel paths, the lattice density refers
to the number of words in all N -best hypotheses divided by the duration of the utterance in
seconds6 .
For AMI-IHM, as shown in Table 4.10, the WER consistently decreases as the N -best list
size increases. As expected, the lattice density grows approximately linearly with respect to
the size of the N -best list. However, the WER improvement from 100 to 500-best is smaller
6
Another common representation of an N -best list is a prefix tree, which has a lower lattice density than N
parallel paths.
4.4 Large Scale Experiments 101

than from 20 to 100-best, i.e. the gain from increasing N reduces for larger N . For example, in
the last row, the relative WER reduction from 20 to 100-best is 5.2%, but it shrinks to 2.8%
when increasing the size of N -best lists from 100 to 500. As this observation holds for all
rows in Table 4.10, the benefit of expanding the size of N -best lists diminishes. For lattice
rescoring results shown in Table 4.11, the improvement from using a higher-order n-gram
approximation is often marginal whereas the lattice density nearly doubles from 4-gram to
5-gram. For both N -best and lattice rescoring, Tables 4.10 and 4.11 show that the WER is

AED AED 20-best 100-best 500-best

RNNLM
(LSTM) (Transformer) [34.4] [146.9] [595.2]
✓ 16.9 16.5 16.3
✓ 16.2 15.8 15.5
✓ 16.3 15.7 15.4
✓ ✓ 15.4 14.7 14.3
✓ ✓ 15.5 14.8 14.3
✓ ✓ ✓ 15.3 14.5 14.1

Table 4.10 WERs on AMI-IHM eval set using N -best rescoring with various combinations of
RNNLM and two AED models. Lattice density is provided in square brackets.

AED AED 4-gram lattice 5-gram lattice

RNNLM
(LSTM) (Transformer) [313.3] [606.0]
✓ 16.3 16.3
✓ 15.3 15.3
✓ 15.3 15.3
✓ ✓ 14.1 14.1
✓ ✓ 14.2 14.1
✓ ✓ ✓ 13.7 13.8

Table 4.11 WERs on AMI-IHM eval set using lattice rescoring with various combination of
RNNLM and two AED models. Lattice density is provided in square brackets.

lower when combining scores from more models. If just one additional model is to be used
for combination with a SCM-based system as in the first blocks of Tables 4.10 and 4.11, using
an AED model seems to be more effective than an RNNLM, because an AED model can be
viewed as an audio-grounded language model. In the second blocks of Tables 4.10 and 4.11,
as the performance of the two AED models are similar, the final rescoring performance using
an RNNLM and an AED model are also similar. When all three models are used together for
102 Integrating Source-Channel and Attention-Based ASR Models

rescoring, the best WERs are reached, which shows that all three models are complementary.
The WER of the combined system is 25-29% relative lower compared to a single system in
Table 4.9. With half the lattice density, lattice rescoring with a 4-gram approximation has a
2.8% relative WER reduction over 500-best rescoring.
For SWB-300, 500-best rescoring and lattice rescoring with a 5-gram approximation are
reported for Hub5’00 and RT03 sets in Tables 4.12 and 4.13. For most of the rows in

AED AED Hub5’00 (SWB/CH)

RNNLM
(LSTM) (Transformer) 500-best [760.7] 5-gram lattice [406.8]
✓ 6.8 / 14.3 6.8 / 14.7
✓ 6.5 / 13.3 6.5 / 13.2
✓ 6.4 / 12.9 6.3 / 12.7
✓ ✓ 5.9 / 12.7 5.8 / 12.7
✓ ✓ 5.8 / 12.6 5.9 / 12.3
✓ ✓ ✓ 5.8 / 12.4 5.7 / 12.1

Table 4.12 WERs on Hub5’00 using 500-best rescoring and lattice rescoring (5-gram approx-
imation) with various combinations of RNNLM and two AED models. Lattice density is
provided in square brackets.

AED AED RT03 (SWBC/FSH)

RNNLM
(LSTM) (Transformer) 500-best [721.2] 5-gram lattice [496.2]
✓ 16.1 / 9.4 16.2 / 9.6
✓ 15.2 / 9.0 14.7 / 8.8
✓ 14.9 / 8.6 14.6 / 8.4
✓ ✓ 14.3 / 8.2 13.8 / 7.9
✓ ✓ 14.2 / 8.0 13.9 / 8.0
✓ ✓ ✓ 14.3 / 8.0 13.2 / 7.6

Table 4.13 WERs on RT03 using 500-best rescoring and lattice rescoring (5-gram approxima-
tion) with various combination of RNNLM and two AED models. Lattice density is provided
in square brackets.

Tables 4.12 and 4.13, WERs are lower for lattice rescoring using a 5-gram approximation than
N -best rescoring using 500-best while having much lower lattice densities. When only using
an RNNLM, the lattice rescoring results are slightly worse than N -best rescoring. This may be
due to the fact that the RNNLM operates at the word level whereas AED models operate at the
word-piece level. As a result, by having a 5-gram approximation during word lattice rescoring,
4.4 Large Scale Experiments 103

the number of auto-regressive steps in the RNNLM is exactly 5, but it is much greater than 5 for
AED model. Similar to the observations made on the AMI dataset, the best system is obtained
with lattice rescoring using all three models. For RT03, the final combined system using lattice
rescoring reduces the WER by 19-33% relative compared to the single systems in Table 4.9.
Although the 500-best has a greater lattice density than the expanded 5-gram lattice, lattice
rescoring has 5-8% relative WER reduction over 500-best rescoring. The relative gain is greater
than on the AMI dataset. This is not unexpected because SWB-300 has longer utterances than
AMI-IHM on average as shown in Table 4.7. The benefit of using lattice rescoring is more
substantial on long utterances as lattices can represent more alternatives more efficiently.
The Matched-Pair Sentence-Segment Word Error (MAPSSWE) statistical tests (Pallett
et al., 1990) show that the p-values for the null hypotheses “there is no performance difference
between 500-best rescoring and 5-gram lattice rescoring” on both AMI and SWB test sets are
below 0.001.

4.4.2.2 Analysis

Lattices are a more compact representation of the hypothesis space that scales well with the
utterance length. In Figure 4.4, the Relative Word Error Rate Reductions (WERRs) by using
20-best rescoring, 500-best rescoring and lattice rescoring with a 5-gram approximation are
compared for different utterance lengths measured by the number of words in the reference.

20-best
relative WER reduction (%)

500-best
30 5-gram lattice

0
1-8 9-16 17-24 25-32 >32
number of words in reference
Figure 4.4 Relative WER reduction by utterance length on RT03 for various ISCA rescoring
methods.

As expected, the gap between N -best rescoring and lattice rescoring widens as the utterance
length increases. The number of alternatives per word represented by the N -best list is smaller
104 Integrating Source-Channel and Attention-Based ASR Models

for longer utterances, which explains the downward trend in Figure 4.4 for 20-best and 500-best.
However, the number of alternatives per word for lattices is constant for a given lattice density.
Therefore, the WERR from lattice rescoring does not drop even for very long utterances.
Table 4.1 compared the complexities of using RNN and Transformer decoders for AED
models. Based on our implementation, speed disparities between the two types of decoders
are significant. For 500-best rescoring, AED (Transformer) is about four times faster than
AED (LSTM). However, AED (LSTM) is nearly twice as fast as AED (Transformer) for lattice
rescoring with a 5-gram approximation. Explicit comparisons between N -best and lattice
rescoring, and between RNNLM and AED model rescoring are not made here as they depend
on other factors including the implementation, hardware, the degree of parallel computation
and the extent of optimisation. For example, representing the N -best list in the form of a
prefix tree (Sainath et al., 2019) or using noise contrastive estimation (Chen et al., 2016) will
significantly accelerate N -best rescoring using RNNLMs.
Under the constraint that no additional acoustic or text data is used, ISCA outperforms
various recent results on AMI-IHM shown in Table 4.14 and SWB-300 shown in Table 4.15.
Further improvements are expected if cross-utterance language models or cross-utterance
label-synchronous models are used for rescoring (Irie et al., 2019c; Sun et al., 2021b; Tüske
et al., 2020), or combining more and stronger individual systems (Tüske et al., 2021).

AMI dev AMI eval

Peddinti et al. (2018) 20.6 20.3
Cheng et al. (2017) 19.6 19.4
Kanda et al. (2018) 18.7 17.8
Chen et al. (2019) 19.7 20.1
Sun et al. (2021b) 17.9 17.5
ISCA 15.7 13.7

Table 4.14 Comparison with other AMI-IHM results.

4.4.3 Conclusions
In this chapter, a flexible framework called ISCA that combines an SCM-based system with
one or more AED models is proposed. Frame-synchronous SCM systems are used as the first
pass, which can process streaming data and integrate structured knowledge such as a lexicon.
Label-synchronous AED models are viewed as audio-grounded language models to rescore
hypotheses from the first pass. Since the two highly complementary systems are integrated at
the word level, they can be trained jointly in a multi-task fashion or optimised separately to
4.4 Large Scale Experiments 105

Hub5’00 RT03
(SWB/CH) (SWBC/FSH)
Hadian et al. (2018) 9.3 / 18.9
Zeyer et al. (2018) 8.3 / 17.3
Park et al. (2019) 6.8 / 14.1
Irie et al. (2019c) 6.7 / 12.9
Kitza et al. (2019) 6.7 / 13.5
Wang et al. (2020c) 6.3 / 13.3
Tüske et al. (2020) 6.4 / 12.5 14.8 / 8.4
Saon et al. (2021) 5.9 / 12.5 14.1 / 8.6
Hu et al. (2021) 7.2 / 13.6 14.4 / 8.9
Sun et al. (2021b) 6.7 / 13.2 14.9 / 8.3
ISCA 5.7 / 12.1 13.2 / 7.6

Table 4.15 Comparison with other SWB-300 results. Some RT03 results are not available.

fully utilise system-specific techniques for optimal performance. Experiments showed that the
proposed lattice rescoring algorithm by AED models generally outperforms N -best rescoring.
AED models with RNN decoders are better suited for lattice rescoring while Transformer
decoders are more time-efficient for N -best rescoring. On both the AMI and SWB datasets,
the combined systems have WERs around 30% relatively lower than individual systems and
outperform other recently published results.
Chapter 5

Confidence Scores for Attention-Based

Encoder-Decoder ASR Models

For various speech-related tasks, a confidence score is normally a value between 0 and 1
associated with each subword/word/utterance that indicates the quality of Automatic Speech
Recognition (ASR) transcriptions. Confidence scores play an important role in various down-
stream applications, such as semi-supervised learning, active learning, keyword spotting,
systems combination and dialogue systems. In traditional Hidden Markov Model (HMM)-
based ASR systems, confidence scores can be estimated from word posteriors in decoding
lattices. However, for an Attention-Based Encoder-Decoder (AED)-based ASR model with
an auto-regressive decoder, computing word posteriors is difficult. As AED models reach
promising performance for ASR, various downstream tasks rely on good confidence estima-
tors. An obvious approach for estimating confidence scores for AED models is to use the
decoder Softmax probabilities. In practice, Softmax scores are poor confidence scores and are
affected heavily by regularisation techniques used during ASR training. In the first part of the
chapter, a lightweight and effective approach called the Confidence Estimation Module (CEM)
is proposed. The CEM generates a confidence score for each hypothesis token. Word-level
confidence scores can be obtained by aggregating the token-level scores. Experiments show
that the CEM is a much better confidence estimator than the Softmax probabilities and the
overconfidence problem of Softmax scores is effectively mitigated.
Although AED models have shown impressive results for ASR, they formulate the sequence-
level probability as a product of the conditional probabilities of all individual tokens given their
histories. However, the performance of locally normalised models can be sub-optimal because
of factors such as exposure bias (Deng et al., 2020). Consequently, the model distribution
differs from the underlying data distribution. In the next part of the chapter, the Residual
Energy-Based Model (R-EBM) is used to complement the auto-regressive AED model to
108 Confidence Scores for Attention-Based Encoder-Decoder ASR Models

close the gap between the two distributions. Meanwhile, an R-EBM can also be regarded as
an utterance-level confidence estimator. Experiments show that an R-EBM produces better
utterance-level confidence scores than aggregating token-level confidence scores from the CEM.
Furthermore, the utterance-level scores generated by an R-EBM can be used to rescore the
N -best hypotheses and reduce the Word Error Rate (WER).
The CEM and R-EBM are better confidence estimators than using output Softmax prob-
abilities on in-domain data. Since they are model-based confidence estimators trained using
the same data as the underlying ASR model, generalising to out-of-domain data may be chal-
lenging. If the input data to the speech recogniser is from mismatched acoustic and linguistic
conditions, the ASR performance and the corresponding confidence estimators may exhibit
severe degradation. To this end, the last part of this chapter proposes two approaches to improve
the model-based confidence estimators on Out-of-Domain (OOD) data while keeping the ASR
model untouched: using pseudo transcriptions and an additional OOD language model. Experi-
ments show that the proposed methods can considerably improve the confidence metrics on
OOD datasets while preserving in-domain performance. Furthermore, the improved confidence
estimators are shown to better reflect the probability of a word being recognised correctly and
can also provide a much more reliable criterion for data selection.

5.1 Background
Confidence scores have been an intrinsic part of ASR systems (Jiang, 2005; Wessel et al.,
2001; Yu et al., 2011) and provide an indication of the reliability of transcriptions given by
the recogniser. Many speech-related applications depend on high-quality confidence scores to
mitigate errors from speech recognisers. For example, in semi-supervised learning and active
learning, utterances with highly confident hypotheses are selected to further improve ASR
performance (Chan and Woodland, 2004; Riccardi and Hakkani-Tür, 2005; Tür et al., 2005).
Confidence scores are also used in dialogue systems where queries with low confidence may
be returned to users for clarification (Tür et al., 2005). As an indication of ASR uncertainty,
confidence scores can play an role in speaker adaptation (Uebel and Woodland, 2001), and
system combination (Evermann and Woodland, 2000b).

5.1.1 Confidence Scores for Source-Channel Models

Confidence scores for Source-Channel Model (SCM)-based systems have been studied exten-
sively (Jiang, 2005; Li, 2018). The existing methods mainly focus on effective approaches to
extract, represent and process useful features from the recogniser. The most important feature
5.1 Background 109

is the word posterior probability, which is normally derived from word lattices or confusion
networks (Evermann and Woodland, 2000a,b; Mangu et al., 2000). Other useful features
include the hypothesis density (Kemp and Schaaf, 1997), word trellis stability (Sanchís et al.,
2012), acoustic stability (Zeppenfeld et al., 1997), normalised acoustic likelihood (Pinto and
Sitaram, 2005), and language model back-off behaviour (Weintraub et al., 1997). In order
to have more reliable estimates, many model-based approaches have been proposed, such as
conditional random fields (Seigel and Woodland, 2011), recurrent neural networks (Del-Agua
et al., 2018; Kalgaonkar et al., 2015; Kastanos et al., 2020; Ragni et al., 2018) and graph neural
networks (Li et al., 2019c; Ragni et al., 2022).

5.1.2 Confidence Scores for Attention-Based Models

Compared to SCM-based systems, the hypothesis space explored by the beam search of AED
models is generally much smaller. Also, because of the auto-regressive decoder, compact
representations such as lattices cannot be easily constructed. Therefore, computing word
posteriors based on N -best hypotheses would be fairly inaccurate. As AED models have gained
increasing popularity in recent years due to the system simplicity and competitive performance,
confidence estimation for AED models has become a much-needed component of an ASR
system. The approaches proposed in this chapter are some of the first published work on
confidence estimation for AED models (Li et al., 2021d,f). Similar to model-based approaches
used for SCM-based systems, both the CEM and R-EBM are additional models built on top
of existing AED models. They extract features from the ASR model for each hypothesised
token and then predict the corresponding confidence score. Coincidentally, most of the other
proposed methods follow the same idea (Kumar et al., 2020; Liu and Lee, 2021; Oneata et al.,
2021; Qiu et al., 2021a,b; Woodward et al., 2020).

5.1.3 Evaluation Metrics

For each hypothesis, an alignment between the hypothesis and the corresponding ground truth
can be obtained that gives the minimum Levenshtein edit distance. If the hypothesis token/word
is the same as the aligned reference token/word, the target confidence score should be 1. For
insertion and substitution errors, the target confidence score should be 0. Note that deletion
errors are not included as a part of confidence estimation. The predicted confidence scores are
a value between 0 and 1.
To measure the performance of a confidence estimator, Normalised Cross Entropy (NCE) is
a frequently used metric Siu et al. (1997). NCE measures the amount of information gained by
comparing the confidence estimator and a baseline where all scores are equal to the probability
110 Confidence Scores for Attention-Based Encoder-Decoder ASR Models

of a token/word being correct in the entire dataset. If confidence scores for all N tokens/words
are gathered and denote c = [c1 , . . . , cN ] where cn ∈ [0, 1] and their corresponding target
confidence c∗ = [c∗1 , . . . , c∗N ] where c∗n ∈ {0, 1}, NCE is given by

H(c∗ ) − H(c∗ , c)
NCE(c∗ , c) = , (5.1)
H(c∗ )

where H(c∗ ) is the entropy of the target sequence

N
c∗n
H(c∗ ) = −c¯∗ log(c¯∗ ) − (1 − c¯∗ ) log(1 − c¯∗ ) , where c¯∗ =
X
, (5.2)
n=1 N

and H(c∗ , c) is the binary cross-entropy between the target and the estimated confidence scores

N
1 X
H(c∗ , c) = − c∗i log(ci ) + (1 − c∗i ) log(1 − c∗i ) . (5.3)
N n=1

When confidence estimation is systematically better than the word correct ratio c¯∗ , the NCE
is positive. For perfect confidence scores, NCE is 1. Since the absolute values of confidence
scores matter, Piece-wise Linear Mapping (PWLM) is a commonly used technique to boost the
NCE (Evermann and Woodland, 2000a). The PWLM is estimated on a development set that
maps the confidence scores closer to the probability of the recognised token/word being correct
while maintaining the relative ordering of the tokens/words based on confidence scores.
However, in some applications such as keyword spotting and data filtering, it is the order of
tokens ranked by confidence scores that matters. In these cases, operating points have to be
chosen where hypotheses with confidence scores above a certain threshold c̃ are deemed to be
correct and incorrect otherwise. There are four outcomes of the binary decision as shown in
Table 5.1.

predicted confidence
c ≥ c̃ c < c̃
target c∗ = 1 true positive (TP) false negative (FN)
confidence c∗ = 0 false positive (FP) true negative (TN)

Table 5.1 Confusion matrix

5.2 Confidence Estimation Module 111

Precision-Recall (P-R) curves are commonly used to illustrate the operating characteris-
tics (Davis and Goadrich, 2006), where

TP(c̃)
precision(c̃) = , (5.4)
TP(c̃) + FP(c̃)
TP(c̃)
recall(c̃) = . (5.5)
TP(c̃) + FN(c̃)

For a given threshold, precision is the fraction of true positives over all samples that are deemed
to be positives by the confidence estimator, and recall is the fraction of true positives over all
samples that are actually positive. Normally, when the threshold c̃ increases, there are fewer
false positives and more false negatives, which leads to higher precision and lower recall. The
trade-off behaviour between precision and recall yields a downward trending curve from the top
left corner to the bottom right corner. Therefore, the Area Under the Curve (AUC) can measure
the quality of the confidence estimator, which has a maximum value of 1. It is worth noting
that two confidence estimators can have the same AUC value but different NCE values. Using
P-R curves is more informative than the Receiver Operating Characteristics (ROC) curves
under unbalanced classes (Saito and Rehmsmeier, 2015). In practice, a downstream application
normally needs to make decisions based on confidence scores. The Equal Error Rate (EER) is
where the false negative rate (FN/(TP + FN)) equals the false positive rate (FP/(FP + TN)),
which is the optimal operating point if false acceptance and false rejection have equal costs.

5.2 Confidence Estimation Module

5.2.1 Motivation
In conventional HMM-based systems, reliable confidence scores can be easily obtained by
computing word posterior probabilities from compact representations of the hypotheses space,
e.g. lattices or confusion networks (Evermann and Woodland, 2000b; Mangu et al., 2000).
However, for AED models, the auto-regressive decoder implies that compact representations
like lattices cannot be easily constructed, except using specific decoder architectures and
heuristics (Zapotoczny et al., 2019). Consequently, computing word posteriors in the hypothesis
space becomes prohibitively expensive. A greedy approximation would be taking the Softmax
probability from each step of the decoder as the confidence scores for each token (Park et al.,
2020). However, the quality of confidence estimation by using the Softmax probabilities may
be very poor (Hendrycks and Gimpel, 2017).
112 Confidence Scores for Attention-Based Encoder-Decoder ASR Models

To demonstrate, confidence scores of a conventional HMM-based system and an AED

model using Softmax probabilities are compared under a semi-supervised training setup. In
order to use the automatic transcriptions given by the current ASR system for semi-supervised
training, utterances that have high confidence scores are selected for an unlabelled dataset.
If the confidence score correlates with WER well, the selected subset should have a lower
WER if the confidence threshold increases. Figure 5.1 plots the WERs of selected subsets
by two different ASR systems across all confidence thresholds. From left to right, the plot
shows that if all utterances are selected, then an HMM-based system and an AED model have
similar performance (7∼8% WER). As the confidence threshold increases, the WER of the
HMM-based system monotonically decreases. However, for the AED model, a higher threshold
does not always mean a reduced WER, which means that some utterances have a large number
of errors and high confidence scores at the same time. The spike in Figure 5.1 indicates that the
AED model is overconfident based on Softmax probabilities. To address the overconfidence
issue and improve the quality of confidence scores for AED models, the Confidence Estimation
Module (CEM) is proposed.
WER (%) of the filtered subset

2
HMM-based system
AED model
0
0.00 0.25 0.50 0.75 1.00
confidence threshold for filtering
Figure 5.1 Filtering behaviour of a conventional HMM-based system and an AED model based
on confidence scores. Utterances with confidence higher than a threshold (x-axis) are selected,
and WER of the filtered subset (y-axis) are plotted. Both systems are trained on LibriSpeech
100-hour data and the test-clean set is used for filtering.

5.2.2 Proposed Model

As described in Section 2.2 and shown in the grey box in Figure 5.2, an AED model consists of
an encoder, an attention mechanism and a decoder. If the recogniser is very confident about
the output token, then the corresponding confidence score should be close to 1. The word or
utterance level confidence scores can be obtained by taking the average of tokens within a
5.2 Confidence Estimation Module 113

<latexit sha1_base64="4tp+iaNE3koesqx9sX1GutlcCdU=">AAAB7XicbVBNSwMxEJ2tX7V+VT16CRahXsquFBVPBS8eK9gPaJeSTbNtbDZZkqxQlv4HLx4U8er/8ea/MdvuQVsfDDzem2FmXhBzpo3rfjuFtfWNza3idmlnd2//oHx41NYyUYS2iORSdQOsKWeCtgwznHZjRXEUcNoJJreZ33miSjMpHsw0pn6ER4KFjGBjpTapjgfsfFCuuDV3DrRKvJxUIEdzUP7qDyVJIioM4VjrnufGxk+xMoxwOiv1E01jTCZ4RHuWChxR7afza2fozCpDFEplSxg0V39PpDjSehoFtjPCZqyXvUz8z+slJrz2UybixFBBFovChCMjUfY6GjJFieFTSzBRzN6KyBgrTIwNqGRD8JZfXiXti5p3Wavf1yuNmzyOIpzAKVTBgytowB00oQUEHuEZXuHNkc6L8+58LFoLTj5zDH/gfP4AzBKOmw==</latexit>

confidence c(hi )
estimation module
(CEM) sigmoid
<latexit sha1_base64="PAFuck7kL2uek4kaDNn9jXPStPA=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBbBU0lEVDwVvHisYD+gDWWz2bRLdzdhd1Isof/EiwdFvPpPvPlv3LY5aOuDgcd7M8zMC1PBDXjet1NaW9/Y3CpvV3Z29/YP3MOjlkkyTVmTJiLRnZAYJrhiTeAgWCfVjMhQsHY4upv57THThifqESYpCyQZKB5zSsBKfdftAXsCQ3PDBzLh0bTvVr2aNwdeJX5BqqhAo+9+9aKEZpIpoIIY0/W9FIKcaOBUsGmllxmWEjoiA9a1VBHJTJDPL5/iM6tEOE60LQV4rv6eyIk0ZiJD2ykJDM2yNxP/87oZxDdBzlWaAVN0sSjOBIYEz2LAEdeMgphYQqjm9lZMh0QTCjasig3BX355lbQuav5V7fLhslq/LeIooxN0is6Rj65RHd2jBmoiisboGb2iNyd3Xpx352PRWnKKmWP0B87nD2qvlCc=</latexit>

FeatureExtractor
<latexit sha1_base64="rDv9Ou+yNppQjLxY2N7e29JZHw8=">AAACAnicbVDLSgNBEJyNrxhfq57Ey2IQPIVdCSqeAqJ4jGAekCxhdtJJhsw+mOmVhCV48Ve8eFDEq1/hzb9xNtmDJhY0FFXddHd5keAKbfvbyC0tr6yu5dcLG5tb2zvm7l5dhbFkUGOhCGXTowoED6CGHAU0IwnU9wQ0vOFV6jceQCoeBvc4jsD1aT/gPc4oaqljHrQRRqhYcgMUYwnXI5SUYSgnHbNol+wprEXiZKRIMlQ75le7G7LYhwCZoEq1HDtCN6ESORMwKbRjBRFlQ9qHlqYB9UG5yfSFiXWsla7VC6WuAK2p+nsiob5SY9/TnT7FgZr3UvE/rxVj78JNeBDFCAGbLerFwsLQSvOwulwCQzHWhDLJ9a0WG9A0A51aQYfgzL+8SOqnJeesVL4rFyuXWRx5ckiOyAlxyDmpkFtSJTXCyCN5Jq/kzXgyXox342PWmjOymX3yB8bnD3sZmB0=</latexit>

hi
<latexit sha1_base64="B06OytD5ZndSQTB3ynbu2Th+5sc=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqHgqePFY0X5AG8pmO2mXbjZhdyOU0J/gxYMiXv1F3vw3btsctPXBwOO9GWbmBYng2rjut1NYW9/Y3Cpul3Z29/YPyodHLR2nimGTxSJWnYBqFFxi03AjsJMopFEgsB2Mb2d++wmV5rF8NJME/YgOJQ85o8ZKD6M+75crbtWdg6wSLycVyNHol796g5ilEUrDBNW667mJ8TOqDGcCp6VeqjGhbEyH2LVU0gi1n81PnZIzqwxIGCtb0pC5+nsio5HWkyiwnRE1I73szcT/vG5qwms/4zJJDUq2WBSmgpiYzP4mA66QGTGxhDLF7a2EjaiizNh0SjYEb/nlVdK6qHqX1dp9rVK/yeMowgmcwjl4cAV1uIMGNIHBEJ7hFd4c4bw4787HorXg5DPH8AfO5w9Ito3J</latexit>

attention-based
p(hi |h0:i
<latexit sha1_base64="ABB+Zd9lSbx7M4A4567uXfVlZm0=">AAACDnicbVDLSsNAFJ3UV62vqEs3wVKooCWRotJVwY3LCn1BG8JkMmmGTjJhZiKUmC9w46+4caGIW9fu/BsnbRZaPTDM4Zx7ufceN6ZESNP80korq2vrG+XNytb2zu6evn/QFyzhCPcQo4wPXSgwJRHuSSIpHsYcw9CleOBOr3N/cIe5ICzqylmM7RBOIuITBKWSHL0W1wOH3AdOarbImZWdjl1GPTEL1ZeyzEmtVjc7cfSq2TDnMP4SqyBVUKDj6J9jj6EkxJFEFAoxssxY2inkkiCKs8o4ETiGaAoneKRoBEMs7HR+TmbUlOIZPuPqRdKYqz87UhiKfENVGUIZiGUvF//zRon0r+yURHEicYQWg/yEGpIZeTaGRzhGks4UgYgTtauBAsghkirBigrBWj75L+mfN6yLRvO2WW23ijjK4AgcgzqwwCVogxvQAT2AwAN4Ai/gVXvUnrU37X1RWtKKnkPwC9rHN3BKm7c=</latexit>

1 , o1:T )
encoder-decoder
model softmax
<latexit sha1_base64="+xglHDDVSy3KKb/JN0Tu/8FfPm4=">AAAB+XicbVBNSwMxEM3Wr1q/Vj16CRbBU9mVouKp4MVjBfsB7VKyabYNTbJLMltalv4TLx4U8eo/8ea/MW33oK0PBh7vzTAzL0wEN+B5305hY3Nre6e4W9rbPzg8co9PmiZONWUNGotYt0NimOCKNYCDYO1EMyJDwVrh6H7ut8ZMGx6rJ5gmLJBkoHjEKQEr9Vy3C2wChmYmjkCSyaznlr2KtwBeJ35OyihHved+dfsxTSVTQAUxpuN7CQQZ0cCpYLNSNzUsIXREBqxjqSKSmSBbXD7DF1bp4yjWthTghfp7IiPSmKkMbackMDSr3lz8z+ukEN0GGVdJCkzR5aIoFRhiPI8B97lmFMTUEkI1t7diOiSaULBhlWwI/urL66R5VfGvK9XHarl2l8dRRGfoHF0iH92gGnpAddRAFI3RM3pFb07mvDjvzseyteDkM6foD5zPH4xAlD0=</latexit>

di
<latexit sha1_base64="ulP8/tkeS68Dxes+7T0wU40RbDw=">AAAB+3icbVDLSsNAFL2pr1pfsS7dDBbBVUlEVFwV3LisYB/QhjCZTNuhk0mYmYgl5FfcuFDErT/izr9x0mahrQeGOZxzL3PmBAlnSjvOt1VZW9/Y3Kpu13Z29/YP7MN6V8WpJLRDYh7LfoAV5UzQjmaa034iKY4CTnvB9Lbwe49UKhaLBz1LqBfhsWAjRrA2km/Xh0HMQzWLzJWFuZ+x3LcbTtOZA60StyQNKNH27a9hGJM0okITjpUauE6ivQxLzQineW2YKppgMsVjOjBU4IgqL5tnz9GpUUI0iqU5QqO5+nsjw5Eq4pnJCOuJWvYK8T9vkOrRtZcxkaSaCrJ4aJRypGNUFIFCJinRfGYIJpKZrIhMsMREm7pqpgR3+curpHvedC+bF/cXjdZNWUcVjuEEzsCFK2jBHbShAwSe4Ble4c3KrRfr3fpYjFascucI/sD6/AEC+JUP</latexit>

Decoder
<latexit sha1_base64="pIu4NX5B4VYCdNZPxMskWybzP8M=">AAAB+XicbVBNS8NAEN3Ur1q/oh69BIvgqSRSVDwV9OCxgv2ANpTNZtou3WzC7qRYQv+JFw+KePWfePPfuG1z0NYHA4/3ZpiZFySCa3Tdb6uwtr6xuVXcLu3s7u0f2IdHTR2nikGDxSJW7YBqEFxCAzkKaCcKaBQIaAWj25nfGoPSPJaPOEnAj+hA8j5nFI3Us+0uwhNqlt0Bi0NQ055ddivuHM4q8XJSJjnqPfurG8YsjUAiE1Trjucm6GdUIWcCpqVuqiGhbEQH0DFU0gi0n80vnzpnRgmdfqxMSXTm6u+JjEZaT6LAdEYUh3rZm4n/eZ0U+9d+xmWSIki2WNRPhYOxM4vBCbkChmJiCGWKm1sdNqSKMjRhlUwI3vLLq6R5UfEuK9WHarl2k8dRJCfklJwTj1yRGrknddIgjIzJM3klb1ZmvVjv1seitWDlM8fkD6zPHxdPk/E=</latexit>

e1 , . . . , eT
<latexit sha1_base64="fpVoZl65kOz6ejlL75fS+z4YJnA=">AAACEHicbVC7TsMwFHV4lvIKMLJYVAiGqkpQBYipEgtjkfqS2ihyHKe16tiR7SBVUT+BhV9hYQAhVkY2/ganzQAtR7J8dM699r0nSBhV2nG+rZXVtfWNzdJWeXtnd2/fPjjsKJFKTNpYMCF7AVKEUU7ammpGeokkKA4Y6Qbj29zvPhCpqOAtPUmIF6MhpxHFSBvJt88GgWChmsTmysjUd6uDUGhVXZBbvl1xas4McJm4BamAAk3f/jIP4TQmXGOGlOq7TqK9DElNMSPT8iBVJEF4jIakbyhHMVFeNltoCk+NEsJISHO4hjP1d0eGYpUPZypjpEdq0cvF/7x+qqNrL6M8STXheP5RlDKoBczTgSGVBGs2MQRhSc2sEI+QRFibDMsmBHdx5WXSuai5l7X6fb3SuCniKIFjcALOgQuuQAPcgSZoAwwewTN4BW/Wk/VivVsf89IVq+g5An9gff4A60Cdww==</latexit>

di 1 , wi 1
<latexit sha1_base64="oFYujJoFrMUxQO6ba8yyEGniab4=">AAACBXicbVC7TsMwFHXKq5RXgBEGiwqJAaoEVYCYKrEwFok+pDaKHMdprTp2ZDugKsrCwq+wMIAQK//Axt/gtB2gcCTrHp1zr3zvCRJGlXacL6u0sLi0vFJeraytb2xu2ds7bSVSiUkLCyZkN0CKMMpJS1PNSDeRBMUBI51gdFX4nTsiFRX8Vo8T4sVowGlEMdJG8u39fiBYqMaxKVmY+xk9cfPj+2n17apTcyaAf4k7I1UwQ9O3P/uhwGlMuMYMKdVznUR7GZKaYkbySj9VJEF4hAakZyhHMVFeNrkih4dGCWEkpHlcw4n6cyJDsSoWNZ0x0kM17xXif14v1dGFl1GepJpwPP0oShnUAhaRwJBKgjUbG4KwpGZXiIdIIqxNcBUTgjt/8l/SPq25Z7X6Tb3auJzFUQZ74AAcARecgwa4Bk3QAhg8gCfwAl6tR+vZerPep60lazazC37B+vgGivyYkg==</latexit>

vi
<latexit sha1_base64="ECCBEjDlQUrO9GCEp1e6H/7n+hQ=">AAAB+XicbVDLSsNAFL2pr1pfUZduBovgqiRSVFwV3LisYB/QhjCZTNqhk0mYmRRK6J+4caGIW//EnX/jpM1CWw8MczjnXubMCVLOlHacb6uysbm1vVPdre3tHxwe2ccnXZVkktAOSXgi+wFWlDNBO5ppTvuppDgOOO0Fk/vC702pVCwRT3qWUi/GI8EiRrA2km/bwyDhoZrF5sqnc5/5dt1pOAugdeKWpA4l2r79NQwTksVUaMKxUgPXSbWXY6kZ4XReG2aKpphM8IgODBU4psrLF8nn6MIoIYoSaY7QaKH+3shxrIpwZjLGeqxWvUL8zxtkOrr1cibSTFNBlg9FGUc6QUUNKGSSEs1nhmAimcmKyBhLTLQpq2ZKcFe/vE66Vw33utF8bNZbd2UdVTiDc7gEF26gBQ/Qhg4QmMIzvMKblVsv1rv1sRytWOXOKfyB9fkDTjmUFQ==</latexit>

Encoder Attention
<latexit sha1_base64="C+a/wgw5E6R4XEltJ3ESTF+Epa4=">AAAB+XicbVBNS8NAEN3Ur1q/oh69BIvgqSRSVDwVRPBYwX5AG8pmM22XbjZhd1Isof/EiwdFvPpPvPlv3LY5aOuDgcd7M8zMCxLBNbrut1VYW9/Y3Cpul3Z29/YP7MOjpo5TxaDBYhGrdkA1CC6hgRwFtBMFNAoEtILR7cxvjUFpHstHnCTgR3QgeZ8zikbq2XYX4Qk1y+4ki0NQ055ddivuHM4q8XJSJjnqPfurG8YsjUAiE1Trjucm6GdUIWcCpqVuqiGhbEQH0DFU0gi0n80vnzpnRgmdfqxMSXTm6u+JjEZaT6LAdEYUh3rZm4n/eZ0U+9d+xmWSIki2WNRPhYOxM4vBCbkChmJiCGWKm1sdNqSKMjRhlUwI3vLLq6R5UfEuK9WHarl2k8dRJCfklJwTj1yRGrknddIgjIzJM3klb1ZmvVjv1seitWDlM8fkD6zPHya0k/s=</latexit>

<latexit sha1_base64="QCW9WugXudcnY0LEI7Lw/y7Y5rc=">AAAB+3icbVDLSsNAFL2pr1pfsS7dBIvgqiQiKq4qblxWsA9oQ5lMJ+3QyYOZG2kJ+RU3LhRx64+482+ctFlo64GBwzn3MPceLxZcoW1/G6W19Y3NrfJ2ZWd3b//APKy2VZRIylo0EpHsekQxwUPWQo6CdWPJSOAJ1vEmd7nfeWJS8Sh8xFnM3ICMQu5zSlBLA7PaRzZFRdNbRBbmWjYwa3bdnsNaJU5BalCgOTC/+sOIJoHOU0GU6jl2jG5KJHIqWFbpJ4rFhE7IiPU0DUnAlJvOd8+sU60MLT+S+oVozdXfiZQESs0CT08GBMdq2cvF/7xegv61m/IwTvRhdPGRnwgLIysvwhpyySiKmSaESq53teiYSEJR11XRJTjLJ6+S9nnduaxfPFzUGjdFHWU4hhM4AweuoAH30IQWUJjCM7zCm5EZL8a78bEYLRlF5gj+wPj8AfQplQU=</latexit>

o1 , . . . , oT ai 1 , di 1
<latexit sha1_base64="fNrPRozPaGdWruxXSAFyntqhtQw=">AAACEnicbVDLSgMxFM3UV62vUZdugkVQ0DIjRcVVwY3LCvYB7TBkMpk2NJMMSUYoQ7/Bjb/ixoUibl2582/MtLOo1QMhh3Pu5d57goRRpR3n2yotLa+srpXXKxubW9s79u5eW4lUYtLCggnZDZAijHLS0lQz0k0kQXHASCcY3eR+54FIRQW/1+OEeDEacBpRjLSRfPukHwgWqnFsvgxN/IyeuZPTeTEsRN+uOjVnCviXuAWpggJN3/7qhwKnMeEaM6RUz3US7WVIaooZmVT6qSIJwiM0ID1DOYqJ8rLpSRN4ZJQQRkKaxzWcqvMdGYpVvqCpjJEeqkUvF//zeqmOrryM8iTVhOPZoChlUAuY5wNDKgnWbGwIwpKaXSEeIomwNilWTAju4sl/Sfu85l7U6nf1auO6iKMMDsAhOAYuuAQNcAuaoAUweATP4BW8WU/Wi/VufcxKS1bRsw9+wfr8AWZVnok=</latexit>

<latexit sha1_base64="fJL0Kr1dDsexAoQVutaHp4cZ7iM=">AAACEHicbVC7TsMwFHV4lvIKMLJYVAiGqkpQBYipEgtjkfqS2ihyHKe16tiR7SBVUT+BhV9hYQAhVkY2/ganzQAtR7J8dM699r0nSBhV2nG+rZXVtfWNzdJWeXtnd2/fPjjsKJFKTNpYMCF7AVKEUU7ammpGeokkKA4Y6Qbj29zvPhCpqOAtPUmIF6MhpxHFSBvJt88GgWChmsTmysTUd6uDUGhVXZBbvl1xas4McJm4BamAAk3f/jIP4TQmXGOGlOq7TqK9DElNMSPT8iBVJEF4jIakbyhHMVFeNltoCk+NEsJISHO4hjP1d0eGYpUPZypjpEdq0cvF/7x+qqNrL6M8STXheP5RlDKoBczTgSGVBGs2MQRhSc2sEI+QRFibDMsmBHdx5WXSuai5l7X6fb3SuCniKIFjcALOgQuuQAPcgSZoAwwewTN4BW/Wk/VivVsf89IVq+g5An9gff4ACsGd1w==</latexit>

Figure 5.2 Confidence Estimation Module (CEM). The AED model in the shaded box is frozen
during CEM training.

word or an utterance. For AED models, each step in the auto-regressive decoder is treated as a
classification task over all possible output tokens. However, there is a subtle difference between
calibration for standard classification and confidence scores for sequences. For a hypothesis
sequence, each token can either be correct, a substitution or an insertion. Because of the
auto-regressive nature of the decoder and the use of the teacher forcing approach for training,
the calibration behaviour for sequences with an incorrect history is uncertain. Furthermore,
the model can be poorly calibrated when the model becomes very deep and large in pursuit of
state-of-the-art WER (Guo et al., 2017).
To obtain high-quality confidence scores while maintaining the WER performance of
the ASR model, the CEM is proposed as shown in the top box in Figure 5.2. The CEM is
designed to be a lightweight module that can be easily configured on top of any AED model.
The CEM gathers information from the attention mechanism, the decoder state, the Softmax
output probabilities, and the current token embedding as the input of the confidence feature
extractor. The feature extractor can be any Deep Neural Network (DNN) that transforms a
high-dimensional input feature to a single dimension. Then the sigmoid output layer generates
a value between 0 and 1 that indicates the confidence score c(hi ) for the current token, i.e.

c(hi ) = SIGMOID (F EATURE E XTRACTOR (vi , di , p(hi |h0:i−1 , o1:T ), hi )). (5.6)
114 Confidence Scores for Attention-Based Encoder-Decoder ASR Models

Assuming there is an existing well-trained AED model available, N -best hypotheses can
be generated by running a beam search using the model. Then the edit distance between each
hypothesis sequence h can be computed with respect to the ground truth reference sequence
w. The alignment from the edit distance computation can be used as the target for confidence
if correct tokens are assigned as 1 while substituted or inserted tokens are assigned as 0. For
example, if the ground truth sequence is “A B C D." and one of the hypotheses is “A C C
D.”, then the binary target sequence is c∗ = [1, 0, 1, 1]. Note that confidence scores are only
associated with hypothesised tokens and deletion errors are not modelled here. For each of
the N -best hypotheses, the CEM is trained to minimise the binary cross entropy between the
estimated confidence c and the target c∗ ,

L
1X

∗
L(c , c) = − c∗ (hi ) log(c(hi )) + (1 − c∗ (hi )) log(1 − c(hi )) . (5.7)
L i=1

The total loss for an utterance is the aggregated confidence estimation loss for all the N -best
hypotheses. During CEM training, all parameters of the AED model are fixed.

5.2.3 Experimental Setup

5.2.3.1 Data

The LibriSpeech “train-clean-100” subset (see Section A.2), containing read speech from
audiobooks, was used for model training. It reflects a typical use case of confidence scores
where the WERs are moderately high and the amount of supervised data is limited. The dev and
test sets are the dev-clean/dev-other and test-clean/test-other sets with each one contains over
5 hours of audio. The input features are 80-dimension filterbank coefficients with ∆ and ∆∆
values appended (240 dimensions per frame). The output targets are 16k word-pieces tokenised
from the full LibriSpeech training set using a Word Piece Model (WPM) model (Schuster
and Nakajima, 2012). All experimental results in Section 5.2.4 on LibriSpeech are on the
test-clean/test-other sets.

5.2.3.2 Baseline Model

The baseline model is an AED model trained using the open-source Lingvo toolkit (Shen et al.,
2019). The encoder consists of a 2-layer convolutional neural network with max-pooling and a
stride of 2 and a 4-layer bi-directional Long Short-Term Memory (LSTM) network with 1024
units in each direction. The decoder has a 2-layer uni-directional LSTM network with 1024
units. The total number of parameters in the baseline model is 184 million. Training used the
5.2 Confidence Estimation Module 115

Adam optimiser (Kingma and Ba, 2015) with a learning rate of 0.001 and the batch size was
512. During training, five regularisation techniques have been adopted: dropout (Srivastava
et al., 2014) of 0.1 on the decoder; uniform label smoothing (Szegedy et al., 2016) of 0.2;
Gaussian weight noise (Graves, 2011) with zero mean and a standard deviation of 0.05 for
all model parameters after 20k updates; SpecAugment (Park et al., 2019) with 2 frequency
masks with mask parameter F = 27, 2 time masks with mask parameter T = 40, and time
warping with warp parameter W = 40; and an Exponential Moving Average (EMA) (Polyak
and Juditsky, 1992) of all model parameters were used during training. For details of various
regularisation techniques, please refer to Section 2.3.4.

5.2.4 Experimental Results

5.2.4.1 Effect of Regularisation Methods on Confidence Scores

Many training techniques have been developed for deep neural networks. They normally
reduce the extent to which the model overfits the training data and improve the generalisation
of the model. State-of-the-art performance is often achieved when a large model is trained
using aggressive regularisation (Chiu et al., 2018). As for the baseline setup described in
Section 5.2.3.2, five techniques have been used. Broadly speaking, regularisation methods can
be classified into three categories, i.e. augmenting input features (SpecAugment), manipulating
model weights (dropout, EMA & weight noise), and modifying output targets (label smoothing).
In Table 5.2, the ASR WERs and the confidence metric AUCs of five additional models
are shown, where each model is trained by removing one regularisation technique from the
baseline setup. The confidence scores were computed using Softmax probabilities.

WER ↓ AUC ↑
baseline 7.5/21.6 0.976/0.912
− dropout 7.8/22.0 0.977/0.916
− EMA 8.2/24.8 0.974/0.903
− label smoothing 10.6/24.6 0.985/0.950
− weight noise 12.9/25.8 0.978/0.925
− SpecAugment 10.8/34.3 0.952/0.911

Table 5.2 ASR and token-level confidence performance by removing a regularisation method
from the baseline model on test-clean/test-other. Confidence scores are based on the raw
Softmax probabilities.

As expected, removing any of these regularisation methods results in increased WER. If a

purely random confidence estimator is used, the AUC is expected to be lower for a system with
116 Confidence Scores for Attention-Based Encoder-Decoder ASR Models

higher WER. However, AUCs based on Softmax scores may be even higher when the ASR
system becomes worse by excluding a regularisation technique during training. For example,
by removing label smoothing, the AUC is unexpectedly better than the baseline. This shows
that although Softmax probabilities can be directly used as confidence scores, they can be
heavily affected by regularisation techniques. Ideally, confidence estimation should perform
well regardless of the specific training procedure. Since confidence estimation is only an
auxiliary task to the main ASR task, improved confidence estimators should not sacrifice ASR
WER performance.

5.2.4.2 Confidence Estimation Module

To keep good ASR WER performance while having reliable confidence scores, a dedicated
CEM can be trained as in Section 5.2.2. During CEM training, more aggressive SpecAugment
(10 time masks with mask parameter T = 50) is used to increase the WER on the training
set for more negative training samples. For each utterance, 8-best hypotheses are generated
on-the-fly and are aligned with the reference to obtain the binary training targets. The CEM
only has one fully-connected layer with 256 units. The number of additional parameters is 0.4%
of the baseline AED model. Piece-wise Linear Mapping (PWLM) (Evermann and Woodland,
2000a) are estimated on dev-clean/dev-other and are then applied to test-clean/test-other, so
that the confidence scores better match with the token or word correctness. Since the PWLM is
monotonic, the NCE is boosted while the AUC remains unchanged as the relative order of the
confidence scores is unchanged. Table 5.3 reports the confidence metrics at the token level and
the word level.

NCE ↑ NCE ↑
AUC ↑
(w/o PWLM) (w/ PWLM)
Softmax 0.976/0.912 -0.195/0.131 0.166/0.172
token
CEM 0.990/0.958 0.189/0.019 0.344/0.275
Softmax 0.981/0.927 -0.180/0.139 0.269/0.195
word
CEM 0.990/0.962 0.192/0.039 0.350/0.270

Table 5.3 Comparison of confidence scores between using Softmax probabilities and using the
CEM on the baseline model. Piece-wise Linear Mapping (PWLM) was estimated on the dev
sets and applied on the test sets to improve the NCE metric. The first row corresponds to the
baseline in Table 5.2.

The word-level confidence is the average of the token-level ones if a word consists of
multiple tokens. The AUC is improved at both token and word levels by using the CEM. Unlike
5.2 Confidence Estimation Module 117

the use of Softmax probabilities, the NCE values are all positive for the CEM. After PWLM,
the CEM yields much higher NCE values than when using Softmax probabilities. The AUC
values given in Table 5.3 do not show the whole picture. As shown in Figure 5.3, the P-R curves
of Softmax and CEM are drastically different. A sharp downward spike at the high-confidence
region in Figure 5.3a corresponds to a low precision and a low recall. In other words, the
Softmax probabilities are overconfident for some incorrect tokens, which also explains the
spike shown in Figure 5.1. The CEM, however, suffers little from overconfidence as Figure 5.3b
depicts the desired trade-off between precision and recall. Overall, the CEM is a more reliable
confidence estimator for both the AUC and NCE metrics.

1.0 1.00
0.95
precision

precision
0.8
0.90
0.85
0.6 test-clean test-clean
test-other 0.80 test-other
0.0 0.5 1.0 0.0 0.5 1.0
recall recall
(a) Softmax (b) CEM
Figure 5.3 Precision-recall curves for token-level confidence scores on LibriSpeech test-clean
and test-other sets.

5.2.4.3 Effect of Language Model Fusion on Confidence Scores

For AED-based ASR models, shallow fusion of a Language Model (LM) (Gülçehre et al., 2015)
is commonly used to improve ASR performance during decoding. The effect of shallow fusion
on confidence estimation was investigated. The LM used was a three-layer LSTM network
with width 4096 trained on the LibriSpeech LM corpus, which shares the same word-piece
vocabulary as the AED model. To take LM information for confidence estimation, the input to
the CEM was extended by the LM probability for the current token. The other aspects of the
setup are the same as used in Section 5.2.2.
Table 5.4 shows the word-level confidence scores after PWLM for both Softmax probabili-
ties and the CEM. Although WERs on test-clean and test-other decreased by 8∼9% relative,
there is no clear improvement in AUC and even a substantial degradation for NCE. By com-
paring the first and second blocks of Table 5.4, a CEM improves the quality of confidence
118 Confidence Scores for Attention-Based Encoder-Decoder ASR Models

estimation even more noticeably when an additional LM is used. The contrast of P-R curves
between Softmax and CEM with LM shallow fusion is similar to Figure 5.3.

WER ↓ AUC ↑ NCE ↑

Softmax 0.981/0.927 0.269/0.195
baseline 7.5/21.6
CEM 0.990/0.962 0.350/0.270
Softmax 0.981/0.928 0.103/0.109
+ LM 6.8/19.8
CEM 0.991/0.966 0.337/0.263

Table 5.4 ASR and word-level confidence performance for models with and without Recurrent
Neural Network Language Model (RNNLM) shallow fusion (with PWLM).

5.2.5 Analysis
5.2.5.1 Generalisation to a Mismatched Domain

Since the CEM is a model-based approach and the training data for the CEM is the same
as for the ASR model, the CEM is naturally more confident on the training set. Although
the mismatch between training and test for the CEM is mitigated by having more aggressive
augmentation during training and applying a PWLM estimated on dev sets during testing,
it is still unclear how well the confidence scores from the CEM generalises to data from a
mismatched domain. The Wall Street Journal (WSJ) corpus (Paul and Baker, 1992) is a dataset
of clean read speech of news articles and is in a moderately mismatched domain compared
to LibriSpeech in terms of the speaker, style and vocabulary. In Table 5.5, the WSJ eval92
test set was fed into the same setup as in Section 5.2.4.3, where all models were trained on
LibriSpeech.

WER ↓ AUC ↑ NCE ↑

Softmax 0.935 0.230
baseline 18.7
CEM 0.970 0.280
Softmax 0.933 0.159
+ LM 17.7
CEM 0.965 0.266

Table 5.5 ASR and confidence performance on WSJ eval92 with a PWLM. The PWLM was
estimated on LibriSpeech dev-other set. The LM used was trained on the LibriSpeech LM
corpus as in Section 5.2.4.3.

Similar to the observations in Table 5.4, shallow fusion worsens the confidence estimation
by Softmax probabilities despite reduced WER. The CEM improves the quality of confidence
estimation considerably with or without an LM.
5.2 Confidence Estimation Module 119

5.2.5.2 Implications for Downstream Tasks

As mentioned in Section 5.1, confidence scores are widely used to select unlabelled data for
semi-supervised learning in order to improve ASR performance. First, a speech recogniser
is trained using the limited transcribed data. Then the recogniser transcribes the unlabelled
data, which can be taken as noisy labels to train the existing model further. However, erroneous
automatic transcription can hurt the model. If confidence scores can reflect the WER well,
filtering out utterances with low confidence can be beneficial to semi-supervised training.
Similar to Figure 5.1 and plots used for semi-supervised learning (Park et al., 2020),
Figure 5.4 shows the WER of the filtered utterances whose confidence scores are above the
corresponding threshold. If confidence scores strongly correlate with WER, a higher threshold
will filter a subset with lower WER. In Figure 5.4a, sharp spikes at the high confidence threshold
region clearly indicate overconfidence based on Softmax probabilities. In contrast, curves for
all three test sets in Figure 5.4b are monotonically decreasing (i.e. without spikes), which shows
that confidence scores from the CEM match WER more closely.

25 test-clean 25 test-clean
test-other test-other
WER (%) of the filtered subset

WER (%) of the filtered subset

WSJ-eval92 WSJ-eval92
20 20

15 15

10 10

5 5

0 0
0.6 0.8 1.0 0.6 0.8 1.0
confidence threshold confidence threshold
(a) Softmax (b) CEM
Figure 5.4 WERs of filtered utterances w.r.t. confidence thresholds for Softmax and CEM with
LM shallow fusion.
120 Confidence Scores for Attention-Based Encoder-Decoder ASR Models

5.3 Residual Energy-Based Model

5.3.1 Motivation
AED models have decoders that learn the conditional distribution of the current output token
given all the history tokens. This leads to locally normalised auto-regressive models where the
output probability distributions are normalised per output token. The final probability of the
output sequence is obtained by computing the product of a series of conditional probabilities.
This allows AED models to be trained efficiently using the maximum likelihood criterion and
yields good performance for ASR.
However, locally normalised models can in practice be sub-optimal. First, history tokens
are the ground truth during training while history tokens may contain errors during inference
as they are generated sequentially. This training and inference mismatch is referred to as
exposure bias (Ranzato et al., 2016). To this end, scheduled sampling (see Section 3.7.1) has
been proposed to allow generated tokens to appear in the history during training with a certain
probability according to a specific schedule (Bengio et al., 2015). Using a larger beam during
decoding or incorporating beam search heuristics can also help reduce search errors (Chorowski
and Jaitly, 2017). Secondly, the locally normalised model trained with maximum likelihood
may not be optimal in terms of the final evaluation criterion, i.e. WER for ASR. Sequence-level
training criteria that directly minimise the number of word errors have been proposed to address
this mismatch (Prabhavalkar et al., 2018). Although all of the above techniques can improve
model performance to some extent, the model distribution is still based on the product of locally
normalised distributions, which may differ from the data distribution (Deng et al., 2020).
The Residual Energy-Based Model (R-EBM), which was first used for text generation (Deng
et al., 2020), use energy-based models (LeCun et al., 2006) to learn from the residual errors
of an auto-regressive generator to reduce the gap between the model and data distributions.
An R-EBM can also be viewed as a discriminator between generated samples from an auto-
regressive model and real data samples (Deng et al., 2020). For AED model-based ASR, the
R-EBM, conditioned on acoustic features, aims to distinguish the model-generated hypotheses
from the ground truth transcriptions. R-EBMs are trained using conditional noise contrastive
estimation (Gutmann and Hyvärinen, 2010; Ma and Collins, 2018). For a given utterance, the
positive sample is the ground truth whereas the N -best hypotheses are taken as the negative
samples. During inference, the auto-regressive model first generates a list of hypotheses for
each utterance, and the best hypothesis from the joint model can be obtained by using the
combined score of the log-likelihood from the auto-regressive model and the negative energy
value from the R-EBM.
5.3 Residual Energy-Based Model 121

As a discriminator between correct and erroneous hypotheses, R-EBMs can also produce
utterance-level confidence scores for AED models. Some downstream tasks only require
utterance-level confidence scores, such as data selection for semi-supervised learning (Park
et al., 2020; Zhang et al., 2020), and hypothesis-level model combination (Qiu et al., 2021b).
Compared to token or word level confidence, direct modelling of utterance-level confidence
scores implicitly takes deletion errors into account, and does not require calibration of the
confidence scores (e.g. using PWLM) before taking the average for utterance-level scores.
Previously, including the CEM in Section 5.2.2, various model-based methods have been used
for confidence estimation for AED models (Kumar et al., 2020; Oneata et al., 2021; Qiu et al.,
2021a,b; Woodward et al., 2020), and N -best re-ranking models (Li et al., 2019d; Ogawa et al.,
2018; Sainath et al., 2019; Variani et al., 2020a) have been proposed to improve WER. The
R-EBM is a single model that can improve both speech recognition WER performance and
utterance-level confidence estimation performance at the same time.

5.3.2 Proposed Model

5.3.2.1 Overview

ASR models the conditional distribution of the text sequence h given the input acoustic
sequence O. For AED models, the model distribution can be expanded using the chain rule,

L
Y
P (h|O) = P (h1 |O) P (hi |h1:i−1 , O), (5.8)
i=2

where h is a hypothesis sequence.

The AED model is locally normalised as an auto-regressive decoder predicts the next token
based on the acoustic features and the past tokens for each step. Given an unlimited model
capacity, the auto-regressive model trained using maximum likelihood has the potential to learn
the true data distribution perfectly (Deng et al., 2020). However, in practice, there are various
drawbacks associated with locally normalised AED models such that the distribution learned
by a locally normalised auto-regressive model P (h|O) may not match the real data distribution
Pdata . To have a globally normalised model, Deng et al. (2020) proposed the R-EBM that learns
“residual” errors of the auto-regressive model to better match the data distribution. The R-EBM,
parameterised by θ, can be formulated as

P (h|O) exp − Eθ (O, h)
Pθ (h|O) = , (5.9)
Zθ (O)
122 Confidence Scores for Attention-Based Encoder-Decoder ASR Models

where Pθ is the joint model, Eθ is the residual energy function, and Zθ is the partition function
for the energy-based model, which can be computed as
X
Zθ (O) = P (h|O) exp − Eθ (O, h) . (5.10)
h

<latexit sha1_base64="bl5cV3FgfUv104HVHseHxe/DZfE=">AAACH3icbVDLSsNAFJ34rPUVdekmWIS6KYmUKq4KbtxZwT6gCWUynTRDJw9mboQS8ydu/BU3LhQRd/0bJ20WtfXAMGfOuZe597gxZxJMc6qtrW9sbm2Xdsq7e/sHh/rRcUdGiSC0TSIeiZ6LJeUspG1gwGkvFhQHLqddd3yb+90nKiSLwkeYxNQJ8ChkHiMYlDTQG61BarsRH8pJoK7UBp8CzrLqouhnz4vP++xioFfMmjmDsUqsglRQgdZA/7GHEUkCGgLhWMq+ZcbgpFgAI5xmZTuRNMZkjEe0r2iIAyqddLZfZpwrZWh4kVAnBGOmLnakOJD5bKoywODLZS8X//P6CXjXTsrCOAEakvlHXsINiIw8LGPIBCXAJ4pgIpia1SA+FpiAirSsQrCWV14lncua1ajVH+qV5k0RRwmdojNURRa6Qk10h1qojQh6QW/oA31qr9q79qV9z0vXtKLnBP2BNv0FYb6lBg==</latexit> <latexit sha1_base64="3UhuVl6/ODw1TtdCU0XmlaTzsrc=">AAAB+nicbVDLSgMxFL1TX7W+prp0M1iEuikzUlRcFdy4rGAf0A4lk0nb0EwyJBmljP0UNy4UceuXuPNvzLSz0NYDIYdz7iUnJ4gZVdp1v63C2vrG5lZxu7Szu7d/YJcP20okEpMWFkzIboAUYZSTlqaakW4sCYoCRjrB5CbzOw9EKir4vZ7GxI/QiNMhxUgbaWCXcbUfCBaqaWSudDw7G9gVt+bO4awSLycVyNEc2F/9UOAkIlxjhpTqeW6s/RRJTTEjs1I/USRGeIJGpGcoRxFRfjqPPnNOjRI6QyHN4dqZq783UhSpLJuZjJAeq2UvE//zeokeXvkp5XGiCceLh4YJc7Rwsh6ckEqCNZsagrCkJquDx0girE1bJVOCt/zlVdI+r3kXtfpdvdK4zusowjGcQBU8uIQG3EITWoDhEZ7hFd6sJ+vFerc+FqMFK985gj+wPn8AQd6T/Q==</latexit>

P✓ (h|O) c(h)

sigmoid
<latexit sha1_base64="PAFuck7kL2uek4kaDNn9jXPStPA=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBbBU0lEVDwVvHisYD+gDWWz2bRLdzdhd1Isof/EiwdFvPpPvPlv3LY5aOuDgcd7M8zMC1PBDXjet1NaW9/Y3CpvV3Z29/YP3MOjlkkyTVmTJiLRnZAYJrhiTeAgWCfVjMhQsHY4upv57THThifqESYpCyQZKB5zSsBKfdftAXsCQ3PDBzLh0bTvVr2aNwdeJX5BqqhAo+9+9aKEZpIpoIIY0/W9FIKcaOBUsGmllxmWEjoiA9a1VBHJTJDPL5/iM6tEOE60LQV4rv6eyIk0ZiJD2ykJDM2yNxP/87oZxDdBzlWaAVN0sSjOBIYEz2LAEdeMgphYQqjm9lZMh0QTCjasig3BX355lbQuav5V7fLhslq/LeIooxN0is6Rj65RHd2jBmoiisboGb2iNyd3Xpx352PRWnKKmWP0B87nD2qvlCc=</latexit>
<latexit

sha1_base64="G4vzAKUfqBvuBoSMSn1cPbS+Hlo=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoMgCGFXgnoMevGYgHlAsoTZSW8yZnZ2mZkVQsgXePGgiFc/yZt/4yTZgyYWNBRV3XR3BYng2rjut5NbW9/Y3MpvF3Z29/YPiodHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7mZ+6wmV5rF8MOME/YgOJA85o8ZK9YteseSW3TnIKvEyUoIMtV7xq9uPWRqhNExQrTuemxh/QpXhTOC00E01JpSN6AA7lkoaofYn80On5MwqfRLGypY0ZK7+npjQSOtxFNjOiJqhXvZm4n9eJzXhjT/hMkkNSrZYFKaCmJjMviZ9rpAZMbaEMsXtrYQNqaLM2GwKNgRv+eVV0rwse1flSr1Sqt5mceThBE7hHDy4hircQw0awADhGV7hzXl0Xpx352PRmnOymWP4A+fzB3RzjLg=</latexit>

exp
<latexit sha1_base64="9hHk3XJWWAmBnev/Xu1gsGp8Crw=">AAAB83icbVBNS8NAEN3Ur1q/qh69LBbBU0lEVDwVvHisYD+gCWWznbZLN5uwO5GW0L/hxYMiXv0z3vw3btsctPXBwOO9GWbmhYkUBl332ymsrW9sbhW3Szu7e/sH5cOjpolTzaHBYxnrdsgMSKGggQIltBMNLAoltMLR3cxvPYE2IlaPOEkgiNhAib7gDK3k+whjNDyDcTLtlitu1Z2DrhIvJxWSo94tf/m9mKcRKOSSGdPx3ASDjGkUXMK05KcGEsZHbAAdSxWLwATZ/OYpPbNKj/ZjbUshnau/JzIWGTOJQtsZMRyaZW8m/ud1UuzfBJlQSYqg+GJRP5UUYzoLgPaEBo5yYgnjWthbKR8yzTjamEo2BG/55VXSvKh6V9XLh8tK7TaPo0hOyCk5Jx65JjVyT+qkQThJyDN5JW9O6rw4787HorXg5DPH5A+czx/SoZIv</latexit>
+

R-EBM <latexit sha1_base64="shPoHEgh8f+jRN9OCJ01LyWg2lU=">AAACIHicbVDLSsNAFJ3UV62vqEs3wSJU0JJIseKqIII7K9gHNKVMJpN26OTBzI1QQj7Fjb/ixoUiutOvcdJmUasHhjlzzr3MvceJOJNgml9aYWl5ZXWtuF7a2Nza3tF399oyjAWhLRLyUHQdLClnAW0BA067kaDYdzjtOOOrzO88UCFZGNzDJKJ9Hw8D5jGCQUkDvX56PUhsJ+SunPjqSmwYUcBpWpkXb9OT+ecoPR7oZbNqTmH8JVZOyihHc6B/2m5IYp8GQDiWsmeZEfQTLIARTtOSHUsaYTLGQ9pTNMA+lf1kumBqHCnFNbxQqBOAMVXnOxLsy2w2VeljGMlFLxP/83oxeBf9hAVRDDQgs4+8mBsQGllahssEJcAnimAimJrVICMsMAGVaUmFYC2u/Je0z6rWebV2Vys3LvM4iugAHaIKslAdNdANaqIWIugRPaNX9KY9aS/au/YxKy1oec8++gXt+wdCg6Ti</latexit>

E✓ (O, h)
Pooling
<latexit sha1_base64="FilncHX0GiSLVw2aPPR2yxmlSLQ=">AAAB+XicbVBNSwMxEM3Wr1q/Vj16CRbBU9kVUfFU8OKxgv2AdinZdNqGZjdLMlssS/+JFw+KePWfePPfmLZ70NYHA4/3ZpKZFyZSGPS8b6ewtr6xuVXcLu3s7u0fuIdHDaNSzaHOlVS6FTIDUsRQR4ESWokGFoUSmuHobuY3x6CNUPEjThIIIjaIRV9whlbqum4H4QkNz2pK2TcG065b9ireHHSV+Dkpkxy1rvvV6SmeRhAjl8yYtu8lGGRMo+ASpqVOaiBhfMQG0LY0ZhGYIJtvPqVnVunRvtK2YqRz9fdExiJjJlFoOyOGQ7PszcT/vHaK/ZsgE3GSIsR88VE/lRQVncVAe0IDRzmxhHEt7K6UD5lmHG1YJRuCv3zyKmlcVPyryuXDZbl6m8dRJCfklJwTn1yTKrknNVInnIzJM3klb07mvDjvzseiteDkM8fkD5zPH0utlBM=</latexit>

AED FeatureExtractor
<latexit sha1_base64="rDv9Ou+yNppQjLxY2N7e29JZHw8=">AAACAnicbVDLSgNBEJyNrxhfq57Ey2IQPIVdCSqeAqJ4jGAekCxhdtJJhsw+mOmVhCV48Ve8eFDEq1/hzb9xNtmDJhY0FFXddHd5keAKbfvbyC0tr6yu5dcLG5tb2zvm7l5dhbFkUGOhCGXTowoED6CGHAU0IwnU9wQ0vOFV6jceQCoeBvc4jsD1aT/gPc4oaqljHrQRRqhYcgMUYwnXI5SUYSgnHbNol+wprEXiZKRIMlQ75le7G7LYhwCZoEq1HDtCN6ESORMwKbRjBRFlQ9qHlqYB9UG5yfSFiXWsla7VC6WuAK2p+nsiob5SY9/TnT7FgZr3UvE/rxVj78JNeBDFCAGbLerFwsLQSvOwulwCQzHWhDLJ9a0WG9A0A51aQYfgzL+8SOqnJeesVL4rFyuXWRx5ckiOyAlxyDmpkFtSJTXCyCN5Jq/kzXgyXox342PWmjOymX3yB8bnD3sZmB0=</latexit>

<latexit sha1_base64="4veFlTEJzPaHCA9nx96i3lmK/1o=">AAACCXicbVC7TsMwFHXKq5RXgJHFokIqS5WgChBTJRY2ikQfUhtVjuO0Vh07sh2kKnRl4VdYGECIlT9g429w2gyl5UiWj8+5V773+DGjSjvOj1VYWV1b3yhulra2d3b37P2DlhKJxKSJBROy4yNFGOWkqalmpBNLgiKfkbY/us789gORigp+r8cx8SI04DSkGGkj9W3YqPR8wQI1jsyVDieP88/byWnfLjtVZwq4TNyclEGORt/+7gUCJxHhGjOkVNd1Yu2lSGqKGZmUeokiMcIjNCBdQzmKiPLS6SYTeGKUAIZCmsM1nKrzHSmKVDabqYyQHqpFLxP/87qJDi+9lPI40YTj2UdhwqAWMIsFBlQSrNnYEIQlNbNCPEQSYW3CK5kQ3MWVl0nrrOqeV2t3tXL9Ko+jCI7AMagAF1yAOrgBDdAEGDyBF/AG3q1n69X6sD5npQUr7zkEf2B9/QJ/mZrW</latexit>

P (h|O)
softmax
<latexit sha1_base64="+xglHDDVSy3KKb/JN0Tu/8FfPm4=">AAAB+XicbVBNSwMxEM3Wr1q/Vj16CRbBU9mVouKp4MVjBfsB7VKyabYNTbJLMltalv4TLx4U8eo/8ea/MW33oK0PBh7vzTAzL0wEN+B5305hY3Nre6e4W9rbPzg8co9PmiZONWUNGotYt0NimOCKNYCDYO1EMyJDwVrh6H7ut8ZMGx6rJ5gmLJBkoHjEKQEr9Vy3C2wChmYmjkCSyaznlr2KtwBeJ35OyihHved+dfsxTSVTQAUxpuN7CQQZ0cCpYLNSNzUsIXREBqxjqSKSmSBbXD7DF1bp4yjWthTghfp7IiPSmKkMbackMDSr3lz8z+ukEN0GGVdJCkzR5aIoFRhiPI8B97lmFMTUEkI1t7diOiSaULBhlWwI/urL66R5VfGvK9XHarl2l8dRRGfoHF0iH92gGnpAddRAFI3RM3pFb07mvDjvzseyteDkM6foD5zPH4xAlD0=</latexit>

p(hi |h0:i
<latexit sha1_base64="ABB+Zd9lSbx7M4A4567uXfVlZm0=">AAACDnicbVDLSsNAFJ3UV62vqEs3wVKooCWRotJVwY3LCn1BG8JkMmmGTjJhZiKUmC9w46+4caGIW9fu/BsnbRZaPTDM4Zx7ufceN6ZESNP80korq2vrG+XNytb2zu6evn/QFyzhCPcQo4wPXSgwJRHuSSIpHsYcw9CleOBOr3N/cIe5ICzqylmM7RBOIuITBKWSHL0W1wOH3AdOarbImZWdjl1GPTEL1ZeyzEmtVjc7cfSq2TDnMP4SqyBVUKDj6J9jj6EkxJFEFAoxssxY2inkkiCKs8o4ETiGaAoneKRoBEMs7HR+TmbUlOIZPuPqRdKYqz87UhiKfENVGUIZiGUvF//zRon0r+yURHEicYQWg/yEGpIZeTaGRzhGks4UgYgTtauBAsghkirBigrBWj75L+mfN6yLRvO2WW23ijjK4AgcgzqwwCVogxvQAT2AwAN4Ai/gVXvUnrU37X1RWtKKnkPwC9rHN3BKm7c=</latexit>

1 , o1:T )
Decoder
<latexit sha1_base64="pIu4NX5B4VYCdNZPxMskWybzP8M=">AAAB+XicbVBNS8NAEN3Ur1q/oh69BIvgqSRSVDwV9OCxgv2ANpTNZtou3WzC7qRYQv+JFw+KePWfePPfuG1z0NYHA4/3ZpiZFySCa3Tdb6uwtr6xuVXcLu3s7u0f2IdHTR2nikGDxSJW7YBqEFxCAzkKaCcKaBQIaAWj25nfGoPSPJaPOEnAj+hA8j5nFI3Us+0uwhNqlt0Bi0NQ055ddivuHM4q8XJSJjnqPfurG8YsjUAiE1Trjucm6GdUIWcCpqVuqiGhbEQH0DFU0gi0n80vnzpnRgmdfqxMSXTm6u+JjEZaT6LAdEYUh3rZm4n/eZ0U+9d+xmWSIki2WNRPhYOxM4vBCbkChmJiCGWKm1sdNqSKMjRhlUwI3vLLq6R5UfEuK9WHarl2k8dRJCfklJwTj1yRGrknddIgjIzJM3klb1ZmvVjv1seitWDlM8fkD6zPHxdPk/E=</latexit>

Attention
<latexit sha1_base64="QCW9WugXudcnY0LEI7Lw/y7Y5rc=">AAAB+3icbVDLSsNAFL2pr1pfsS7dBIvgqiQiKq4qblxWsA9oQ5lMJ+3QyYOZG2kJ+RU3LhRx64+482+ctFlo64GBwzn3MPceLxZcoW1/G6W19Y3NrfJ2ZWd3b//APKy2VZRIylo0EpHsekQxwUPWQo6CdWPJSOAJ1vEmd7nfeWJS8Sh8xFnM3ICMQu5zSlBLA7PaRzZFRdNbRBbmWjYwa3bdnsNaJU5BalCgOTC/+sOIJoHOU0GU6jl2jG5KJHIqWFbpJ4rFhE7IiPU0DUnAlJvOd8+sU60MLT+S+oVozdXfiZQESs0CT08GBMdq2cvF/7xegv61m/IwTvRhdPGRnwgLIysvwhpyySiKmSaESq53teiYSEJR11XRJTjLJ6+S9nnduaxfPFzUGjdFHWU4hhM4AweuoAH30IQWUJjCM7zCm5EZL8a78bEYLRlF5gj+wPj8AfQplQU=</latexit>

Encoder
<latexit sha1_base64="C+a/wgw5E6R4XEltJ3ESTF+Epa4=">AAAB+XicbVBNS8NAEN3Ur1q/oh69BIvgqSRSVDwVRPBYwX5AG8pmM22XbjZhd1Isof/EiwdFvPpPvPlv3LY5aOuDgcd7M8zMCxLBNbrut1VYW9/Y3Cpul3Z29/YP7MOjpo5TxaDBYhGrdkA1CC6hgRwFtBMFNAoEtILR7cxvjUFpHstHnCTgR3QgeZ8zikbq2XYX4Qk1y+4ki0NQ055ddivuHM4q8XJSJjnqPfurG8YsjUAiE1Trjucm6GdUIWcCpqVuqiGhbEQH0DFU0gi0n80vnzpnRgmdfqxMSXTm6u+JjEZaT6LAdEYUh3rZm4n/eZ0U+9d+xmWSIki2WNRPhYOxM4vBCbkChmJiCGWKm1sdNqSKMjRhlUwI3vLLq6R5UfEuK9WHarl2k8dRJCfklJwTj1yRGrknddIgjIzJM3klb1ZmvVjv1seitWDlM8fkD6zPHya0k/s=</latexit>

O
<latexit sha1_base64="1fsG/rOPW8c+7DB/y5YMfvwfDeU=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsxIUXFVcOPOCvYB7VgyaaYNzSRDklHK0P9w40IRt/6LO//GTDsLbT0QcjjnXnJygpgzbVz32ymsrK6tbxQ3S1vbO7t75f2DlpaJIrRJJJeqE2BNORO0aZjhtBMriqOA03Ywvs789iNVmklxbyYx9SM8FCxkBBsrPfQCyQd6EtkrvZ32yxW36s6AlomXkwrkaPTLX72BJElEhSEca9313Nj4KVaGEU6npV6iaYzJGA9p11KBI6r9dJZ6ik6sMkChVPYIg2bq740URzqLZicjbEZ60cvE/7xuYsJLP2UiTgwVZP5QmHBkJMoqQAOmKDF8YgkmitmsiIywwsTYokq2BG/xy8ukdVb1zqu1u1qlfpXXUYQjOIZT8OAC6nADDWgCAQXP8ApvzpPz4rw7H/PRgpPvHMIfOJ8/FYWS4Q==</latexit>

Figure 5.5 Schematic of an R-EBM for an AED model. The baseline AED model in the shaded
box is fixed during R-EBM training.
In principle, the R-EBM itself can be any model that takes a pair of acoustic sequence O
and hypothesis sequence h to produce a scalar value (−Eθ (O, h)). Here, the R-EBM has a
feature extractor similar to the CEM. For each token in a hypothesis, it takes features including
the current decoder hidden state, the acoustic context vector from the attention mechanism, the
output token embeddings, and the topK Softmax probabilities at each output step (Li et al.,
2021d). Then a sequence model transforms these features and performs pooling over the hidden
representations of the whole sequence. The pooled representation is then passed to the output
layer that uses the sigmoid activation function. Since R-EBMs operate at the utterance level,
bi-directional models can be used.
5.3 Residual Energy-Based Model 123

5.3.2.2 Training

For a system with a token vocabulary size V and a maximum output sequence length L, the
partition function quickly becomes intractable as the summation is over V L possible sequences.
With the baseline auto-regressive model P fixed, noise contrastive estimation for conditional
models (Ma and Collins, 2018) can be used to train the R-EBM where the noise distribution is
the auto-regressive ASR model P . The loss can be expressed as
(
1
L(θ) = EO Eh+ ∼Pdata(·|O) log
1 + exp Eθ (O, h+ )
) (5.11)
1
+ Eh− ∼P (·|O) log ,
1 + exp − Eθ (O, h− )

where h+ are positive samples from the data distribution and h− are negative samples from
the noise distribution. For ASR, the noise samples are the N -best hypotheses from the auto-
regressive model via beam search, and the positive samples are the corresponding ground truth
transcriptions. Thus, the R-EBM is effectively a discriminator between the incorrect set of
sequences H− and correct set of sequences H+ , trained using the binary cross-entropy loss,

1 X 1
L(O; θ) ≈ log
|H | h∈H+
+
1 + exp Eθ (O, h)
(5.12)
1 X 1
+ − log ,
|H | h∈H− 1 + exp − Eθ (O, h)

where H+ ∪H− = {w, B EAM S EARCH(O, N )}. Note that there may be more than one element
in H+ since multiple tokenization using sub-word units for the ground truth may exist or the
ground truth is among the N -best hypotheses. Finally, the parameters for the R-EBM can be
estimated by optimising a binary classifier over all the utterances in the entire dataset.

5.3.2.3 Inference

For a given utterance during inference, the log-likelihood of a hypothesis and the negative
energy value of the hypothesis are added to obtain the joint score as in Equation (5.13).

log Pθ (h|O) ∝ log P (h|O) − Eθ (O, h). (5.13)

With a shared partition function, the N -best hypotheses can be re-ranked based on the joint
scores to yield the best candidate. In practice, a variant of Equation (5.13) is used and will be
described in Section 5.3.4.
124 Confidence Scores for Attention-Based Encoder-Decoder ASR Models

From another perspective, the R-EBM is a binary classifier that learns to assign scores
close to 1 for correct hypotheses and 0 for erroneous ones, which is also the objective for
utterance-level confidence scores (Kumar et al., 2020). Confidence scores can be used to
automatically assess the quality of transcriptions of ASR systems. For applications where
utterance-level confidence scores are also required, the R-EBM can be used to achieve two aims.
The pre-sigmoid values of R-EBM (−Eθ (O, h)) can be used to rerank the N -best hypotheses
to lower the word error rate of the ASR system while the post-sigmoid values can be used as a
model-based confidence measure.

5.3.3 Experimental Setup

5.3.3.1 Data

The data used for the following experiments is the same as that used in Section 5.2.3. One
difference is that the modelling units used here were a set of 1024 word-pieces (Schuster and
Nakajima, 2012) derived from the LibriSpeech 100h training transcriptions.

5.3.3.2 Models

The baseline AED model was also the same as that used in Section 5.2.3, with a total number
of parameters of 145 million. The LM has 2 uni-directional LSTM layers with 1024 units in
each layer. Shallow fusion (Gülçehre et al., 2015) was used for decoding and for generating the
N -best hypotheses for R-EBMs. The hyper-parameters for the AED models, the LM and beam
search were tuned on the dev sets.
The R-EBMs were trained with the baseline ASR model fixed. The N -best hypotheses
of the training set were generated on-the-fly with a beam size N and random SpecAugment
masks. For the time masks, instead of 2 masks with a maximum of 40 frames per mask for the
baseline ASR model, 10 masks with a maximum of 50 frames per mask were used for R-EBM
training. This is to simulate the errors made by the model during inference and the randomness
of masks allows diverse errors to appear during training. The WER on the augmented training
set should ideally match that of the dev set.

5.3.4 Experimental Results

5.3.4.1 Length Normalisation and Log-linear Interpolation

After beam search, the N -best hypotheses are determined by keeping the N terminated hy-
potheses with the highest sequence-level log-likelihood. However, this criterion may favour
5.3 Residual Energy-Based Model 125

shorter hypotheses when finding the 1-best. Therefore, normalising the log-likelihood by the
number of tokens in each hypothesis results in a slightly lower WER as shown in Table 5.6.
Length Normalisation (LN) becomes more important when the N is large. When combining the
log-likelihood score log P (h|O) with the negative energy score −Eθ (O, h) for the joint score,
an interpolation coefficient α is tuned on dev sets to minimise the WER and accommodate
potentially different numerical ranges as in Equation (5.14).

log P (h|O)
ĥ = arg max − αEθ (O, h). (5.14)
h |h|

log-likelihood R-EBM joint

w/o LN 5.84/18.48 5.38/17.91
5.44/18.13
w/ LN 5.28/18.41 5.09/17.77

Table 5.6 Impact on WER (%) by applying Length Normalisation (LN) for the log-likelihood
score on dev-clean/dev-other sets. The joint score is a linear combination of the log-likelihood
and R-EBM score. The R-EBM is a uni-directional LSTM and the beam size is 8.

Table 5.6 shows that ranking the N -best hypotheses just using the R-EBM scores reduces
the WER compared to 1-best WER without LN. After log-linear interpolation, the WER of the
joint model is lower than the 1-best results with or without LN. Therefore, all the following
experiments will use LN.

5.3.4.2 R-EBM Architecture

As a globally normalised model, an R-EBM can be bi-directional and take advantage of the full
utterance context in the hypotheses. Table 5.7 compares the performance of uni-directional
and bi-directional R-EBMs. Both R-EBMs have two layers of LSTMs with 512 units in each
direction. Since the bi-directional model performs slightly better, all the following experiments
will use bi-directional LSTMs for R-EBMs.

log-likelihood R-EBM joint

uni-directional 5.44/18.13 5.09/17.77
5.28/18.41
bi-directional 5.33/18.05 5.07/17.69

Table 5.7 Comparison of WERs (%) by using uni-directional and bi-directional LSTMs for
R-EBMs on dev-clean/dev-other sets. The beam size for decoding is 8.
126 Confidence Scores for Attention-Based Encoder-Decoder ASR Models

5.3.4.3 Effect of the Size of N -best Lists

The performance ceiling of the joint log-linear interpolation score and R-EBM score is the
oracle WER of the N -best hypotheses. In this section, the lengths of N -best lists range from 4
to 32 for both training and inference. Table 5.8 shows that oracle WERs improve consistently
when larger N -best lists are used whereas the 1-best WERs only have a minor reduction.
With more hypotheses available, the joint WERs reduces greatly together with WERs ranked
by R-EBM scores only. The last column in Table 5.8 shows that Relative Word Error Rate
Reduction (WERR) of the joint model over 1-best steadily increases with larger N -best lists,
which indicates that the gain from the R-EBMs outpaces that of the 1-best. With 32-best,
8.2%/6.8% WERRs were obtained for dev-clean/dev-other sets. The top 32-best will be used
for all the following experiments.

n oracle log-likelihood R-EBM joint WERR

4 4.19 5.41 5.58 5.30 2.0
8 3.52 5.28 5.33 5.07 4.0
dev-clean
16 3.03 5.17 5.23 4.86 6.0
32 2.68 5.23 5.33 4.80 8.2
4 17.10 19.04 19.07 18.71 1.7
8 15.33 18.41 18.05 17.69 3.9
dev-other
16 13.96 18.06 17.67 17.12 5.2
32 12.93 17.98 17.31 16.75 6.8

Table 5.8 WERs of the ASR model and joint models with various numbers of hypotheses used
for training and inference.

5.3.4.4 Recognition and Confidence Estimation Performance

Based on previous experimental results on dev sets, bi-directional LSTMs with 32-best hypothe-
ses and length normalisation are used as the best setup. Applying the log-linear combination
coefficients tuned on dev-clean/dev-other sets, the results on two test sets were generated and
shown in Table 5.9. The AUC of the P-R curve is used as the metric for confidence estima-
tion. Also in Table 5.9, the second row corresponds to the CEM, which predicts a confidence
score for each token in the hypothesis sequence using the same input features as R-EBMs.
By averaging the token-level scores1 , utterance-level scores can be used to combine with the
baseline model. Note that although the CEM yields improved confidence at token and word
1
For each hypothesis, the mean of the pre-sigmoid logits of all tokens is used for N -best reranking, whereas
the mean of the post-sigmoid confidence scores of all tokens is used for confidence evaluation.
5.3 Residual Energy-Based Model 127

levels as in Li et al. (2021d), the utterance-level confidence may under-perform the baseline
log-likelihood score. Since the R-EBM is directly optimised for the utterance-level confidence,
issues such as multiple tokenisations for the same word or sequence and deletion errors are
addressed implicitly during training. As a result, the R-EBM reduces WERs and improves
utterance-level confidence at the same time.

test-clean test-other
WER ↓ AUC ↑ WER ↓ AUC ↑
baseline 5.61 0.684 18.68 0.529
+ CEM 5.59 0.697 18.44 0.501
+ R-EBM 5.15 0.770 17.42 0.679

Table 5.9 Recognition and confidence performance on LibriSpeech test sets. The average of
token-level scores from CEM is used as utterance-level scores. For both CEM and R-EBM, the
best interpolation coefficients are tuned on the dev sets.

5.3.4.5 Scalability

This section investigates the situation when the auto-regressive ASR model has seen much
more data such that the baseline WER is far lower. In this set of experiments, the encoder of
the ASR model is first initialised with the pre-trained wav2vec 2.0 (w2v2) model (Baevski
et al., 2020b) trained on 57.7 thousand hours of unlabelled speech data from Libri-light (Kahn
et al., 2020) and then fine-tuned on the same amount of labelled data (“train-clean-100”) as
before2 . Although the WERs in Table 5.10 are much lower than in Table 5.9, the joint model
further yields 5.3%/4.4% WERRs on test-clean/-other. Table 5.10 also shows that R-EBMs can
substantially boost confidence estimation performance.

test-clean test-other
WER ↓ AUC ↑ WER ↓ AUC ↑
w2v2 2.63 0.786 4.74 0.684
+ R-EBM 2.49 0.928 4.53 0.890

Table 5.10 Recognition and confidence performance on LibriSpeech test sets when the encoder
is initialised using a pre-trained w2v2 as a stronger baseline.

2
The implementation follows Zhang et al. (2020), which shows state-of-the-art performance on LibriSpeech.
128 Confidence Scores for Attention-Based Encoder-Decoder ASR Models

5.3.5 Analysis
5.3.5.1 Relative Improvement by Utterance Length

Figure 5.6 shows the breakdown of the WERR for the joint model and the oracle hypotheses
with respect to the number of words in the reference sequences. The baseline ASR model is
used for this analysis. The oracle WERR, i.e. the maximum possible WERR given the N -best
lists, is lower for longer utterances as the number of alternatives per word is fewer with a given
number of top hypotheses. The general trend of the joint WERR follows the trend of the oracle
WERR except for short utterances (1-8 words in reference). We hypothesise that R-EBMs may
need more global context information to give a higher WER reduction.

12 test-clean 60 test-clean
test-other test-other
10 50
oracle WERR (%)
joint WERR (%)

8 40
6 30
4 20
2 10
0 0
1-8

17 6
25 4
33 2
-40
0

1-8

17 6
25 4
33 2
-40
0
9-1
-2
-3

9-1
-2
-3

reference length reference length

Figure 5.6 WERR with respect to the number of words in reference sequences. Left is the
WERR of the joint interpolation approach over 1-best and right is the WERR of oracle over
1-best. 32-best hypotheses are used.

5.3.5.2 Distribution Matching

If the joint model Pθ matches the data distribution Pdata better, then statistics computed on
a large set of samples from the two distributions should also match (Baevski et al., 2020b).
Figure 5.7 shows the density plot of the log-likelihood scores (left) and the joint model scores
(right) on test-other set. The red lines correspond to the score distributions of the ground
truth transcriptions. The distribution of log-likelihood scores of the best hypotheses from the
auto-regressive model does not match the data distribution well. However, the distribution from
the joint model is much closer to the data distribution.
5.4 Improving Confidence Scores for Out-of-Domain Data 129

<latexit sha1_base64="10Y0i4PJVhLmCNjEPnk/atevES8=">AAACDnicbVC7TsMwFHV4lvIKMLJYVJXKUiWoKoyVWNgoEn1ITVQ5jtNadeLIdpCqkC9g4VdYGECIlZmNv8FpM5SWI1k+Pude+d7jxYxKZVk/xtr6xubWdmmnvLu3f3BoHh13JU8EJh3MGRd9D0nCaEQ6iipG+rEgKPQY6XmT69zvPRAhKY/u1TQmbohGEQ0oRkpLQ7PqMD6C7ZrjcebLaaivdJw9Lj5vs/OhWbHq1gxwldgFqYAC7aH57fgcJyGJFGZIyoFtxcpNkVAUM5KVnUSSGOEJGpGBphEKiXTT2ToZrGrFhwEX+kQKztTFjhSFMp9NV4ZIjeWyl4v/eYNEBVduSqM4USTC84+ChEHFYZ4N9KkgWLGpJggLqmeFeIwEwkonWNYh2Msrr5LuRd1u1ht3jUqrWcRRAqfgDNSADS5BC9yANugADJ7AC3gD78az8Wp8GJ/z0jWj6DkBf2B8/QIYq5zC</latexit>

log P (h|O)

Figure 5.7 Density plot of log-probability scores using the baseline model (left) and the joint
model (right) on test-other set.

5.4 Improving Confidence Scores for Out-of-Domain Data

5.4.1 Motivation
The previously proposed CEM and R-EBM are token and utterance level confidence estimators
for AED models, where dedicated neural networks are used to predict confidence scores with
various features extracted from AED models. Although model-based confidence estimators can
yield good performance, they are normally trained on the same data as the ASR system. There-
fore, it is questionable whether the confidence estimation on OOD data will be reliable (Ovadia
et al., 2019). Under mismatched acoustic and/or linguistic conditions, it may be hard for
the ASR model to generalise well on unseen data. However, a reliable confidence estimator
should ideally provide a good indication of the quality of the automatic transcription, even for
OOD data. In this section, assuming that the accessible OOD data is not transcribed, i.e. only
acoustic data and text data but they are not paired, and the ASR model is fixed, two approaches
are proposed to improve OOD confidence estimation. The first one is to use automatically
generated “pseudo” transcriptions of the OOD acoustic data, and the second one makes use of
the features from an additional LM trained from OOD text.

5.4.2 Proposed Approaches

As shown in the top box in Figure 5.8, the CEM and R-EBM share the same skeleton that assigns
the associated confidence scores to each hypothesis. The feature extractor first gathers various
useful information from the encoder and the decoder for each token in the hypothesis, including
the attention context, the decoder state, and the output distribution (in green). Optionally, the
130 Confidence Scores for Attention-Based Encoder-Decoder ASR Models

CEM confidence score

/R-EBM
sigmoid
<latexit sha1_base64="PAFuck7kL2uek4kaDNn9jXPStPA=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBbBU0lEVDwVvHisYD+gDWWz2bRLdzdhd1Isof/EiwdFvPpPvPlv3LY5aOuDgcd7M8zMC1PBDXjet1NaW9/Y3CpvV3Z29/YP3MOjlkkyTVmTJiLRnZAYJrhiTeAgWCfVjMhQsHY4upv57THThifqESYpCyQZKB5zSsBKfdftAXsCQ3PDBzLh0bTvVr2aNwdeJX5BqqhAo+9+9aKEZpIpoIIY0/W9FIKcaOBUsGmllxmWEjoiA9a1VBHJTJDPL5/iM6tEOE60LQV4rv6eyIk0ZiJD2ykJDM2yNxP/87oZxDdBzlWaAVN0sSjOBIYEz2LAEdeMgphYQqjm9lZMh0QTCjasig3BX355lbQuav5V7fLhslq/LeIooxN0is6Rj65RHd2jBmoiisboGb2iNyd3Xpx352PRWnKKmWP0B87nD2qvlCc=</latexit>

<latexit sha1_base64="oCfrKUcKaBSkwfqX4OsFr/UYE7Q=">AAACCHicbVDLSgMxFM3UV62vqksXBotQF5YZKSquiiK4EarYB7SlZNK0Dc1MhuSOWIYu3fgrblwo4tZPcOffmE5noa0HAifn3HuTe9xAcA22/W2l5uYXFpfSy5mV1bX1jezmVlXLUFFWoVJIVXeJZoL7rAIcBKsHihHPFazmDi7Gfu2eKc2lfwfDgLU80vN5l1MCRmpnd5vAHkDTqCylmdEbxfcI528PL8+vD0btbM4u2DHwLHESkkMJyu3sV7MjaegxH6ggWjccO4BWRBRwKtgo0ww1CwgdkB5rGOoTj+lWFC8ywvtG6eCuVOb4gGP1d0dEPK2HnmsqPQJ9Pe2Nxf+8Rgjd01bE/SAE5tPJQ91QYJB4nArucMUoiKEhhCpu/oppnyhCwWSXMSE40yvPkupRwTkuFG+KudJZEkca7aA9lEcOOkEldIXKqIIoekTP6BW9WU/Wi/VufUxKU1bSs43+wPr8AWPOmYY=</latexit>

Pooling (R-EBM)

hypothesis FeatureExtractor
<latexit sha1_base64="rDv9Ou+yNppQjLxY2N7e29JZHw8=">AAACAnicbVDLSgNBEJyNrxhfq57Ey2IQPIVdCSqeAqJ4jGAekCxhdtJJhsw+mOmVhCV48Ve8eFDEq1/hzb9xNtmDJhY0FFXddHd5keAKbfvbyC0tr6yu5dcLG5tb2zvm7l5dhbFkUGOhCGXTowoED6CGHAU0IwnU9wQ0vOFV6jceQCoeBvc4jsD1aT/gPc4oaqljHrQRRqhYcgMUYwnXI5SUYSgnHbNol+wprEXiZKRIMlQ75le7G7LYhwCZoEq1HDtCN6ESORMwKbRjBRFlQ9qHlqYB9UG5yfSFiXWsla7VC6WuAK2p+nsiob5SY9/TnT7FgZr3UvE/rxVj78JNeBDFCAGbLerFwsLQSvOwulwCQzHWhDLJ9a0WG9A0A51aQYfgzL+8SOqnJeesVL4rFyuXWRx5ckiOyAlxyDmpkFtSJTXCyCN5Jq/kzXgyXox342PWmjOymX3yB8bnD3sZmB0=</latexit>

attention context
decoder state output distribution
output distribution

AED LM

Figure 5.8 The system schematic of CEM / R-EBM for confidence estimation. The pooling
layer is only used in R-EBM.

corresponding output distribution from an LM can also be included as additional features (in
red). These features are then passed to a sequence model such as a recurrent neural network or
a self-attention network. For the CEM, the hidden representation for each token is projected
to a scalar and then mapped to a confidence score between 0 and 1 by the sigmoid function.
For the R-EBM, a pooling layer reduces the hidden representations of the entire sequence to a
single representation. A projection layer with a sigmoid activation produces utterance-level
confidence scores. During training, N -best hypotheses are generated from the fixed ASR
model and the binary training targets are obtained by aligning hypotheses with the ground truth
transcription using an edit distance.
As shown in Sections 5.2.4 and 5.3.4, using the CEM and R-EBM can provide much more
reliable confidence scores than the Softmax scores obtained directly from the ASR model.
Although training data is augmented more aggressively during CEM / R-EBM training than
ASR training to avoid being over-confident, training data used for confidence estimation is
often the same as the ASR model. A useful confidence estimator should not only perform
well on in-domain data, but also generalise well to OOD data without modifying the ASR
model. Assuming that some unlabelled OOD data is available, the following two methods
are proposed to improve the OOD confidence scores: pseudo transcriptions and an additional
language model.
5.4 Improving Confidence Scores for Out-of-Domain Data 131

5.4.2.1 Pseudo Transcriptions

One approach of exposing confidence estimators to OOD acoustic data is to include it in the
training process (i.e. exposing the confidence estimator with an OOD hypothesis in Figure 5.8).
To this end, the existing AED model can be used to transcribe unlabelled OOD data to give
pseudo transcriptions without any data augmentation. During CEM/R-EBM training, N -
best hypotheses are generated on-the-fly with data augmentation, and are then aligned with
the pseudo transcription to produce binary confidence targets. Note that N -best hypotheses
are nearly always erroneous w.r.t. the pseudo transcription because beam search is run with
augmented input acoustic data. The effectiveness of this approach depends on the similarity
between the in-domain and OOD data. If the OOD data has, for example, very mismatched
acoustic conditions and a different speaking style, the quality of pseudo transcriptions may be
poor, which gives misleading training labels for confidence estimators.

5.4.2.2 Additional Language Model

Language models have been an important source of information for confidence estimation (Jiang,
2005; Ragni et al., 2022; Yu et al., 2011). In CEM/R-EBM, an in-domain LM is normally used.
However, the difference in speaking style can lead to very different linguistic patterns. For
example, comparing audiobooks with telephone conversations, the vocabularies used are dis-
tinct, as telephone conversations are generally much more casual and spontaneous. Therefore,
it may be useful to leverage the OOD text data to train an additional LM, which can provide
additional features for confidence estimators (i.e. feeding both in-domain and OOD LM output
distribution to the feature extractor). If either an in-domain LM or an OOD LM has a high
probability for a hypothesis token, then the token is more likely to be recognised correctly.

5.4.3 Experimental Setup

5.4.3.1 Data

The in-domain speech data used in this section is from audiobooks. For training, there are
57.7 thousand hours of unlabelled speech data from Libri-light (Kahn et al., 2020) (“unlab-60k”
subset) for unsupervised pre-training and 100 hours of transcribed data from LibriSpeech for
fine-tuning. The text data for language modelling has around 810 million words. The standard
dev and test sets are used for in-domain development and evaluation (see Section A.2).
Two out-of-domain datasets were used in the experiments. The TED-LIUM release 3 (TED)
corpus (Hernandez et al., 2018) contains 452 hours of talks, which are in a somewhat different
domain to audiobooks. Although the talks are mostly well prepared, the speaking style is more
132 Confidence Scores for Attention-Based Encoder-Decoder ASR Models

casual and the language is sometimes colloquial. The text data to train LMs has around 255
million words (Rousseau et al., 2014). The standard dev and test sets from the TED dataset were
used. Another out-of-domain dataset is Switchboard (SWB) (see Section A.3). The text data
for language modelling is the combination of the SWB transcriptions and Fisher Transcriptions.
The Hub5’00 set was used as the development set and RT03 used as the evaluation set. Since
all SWB data was collected at 8 kHz, it is upsampled to 16 kHz before processing by the AED
model.

5.4.3.2 Models

The AED model follows the “Conformer XL” setup in (Zhang et al., 2020) with 24 Conformer
layers and has around 600M parameters. w2v2 (Baevski et al., 2020b), an unsupervised pre-
training method, was used to pre-train the encoder using the unlabelled in-domain Libri-light
data. The decoder is a single-layer LSTM network with 640 units. The randomly initialised
decoder was fine-tuned jointly with the encoder initialised from w2v2 using LibriSpeech 100h
labelled data. Regularisation techniques such as SpecAugment (Park et al., 2019), dropout,
label smoothing, Gaussian weight noise and exponential moving average were used to improve
performance. The modelling units were a set of 1024 word-pieces (Schuster and Nakajima,
2012) derived from the LibriSpeech 100h training transcriptions.
The model-based confidence estimators, i.e. CEM and R-EBM, were trained on the Lib-
riSpeech 100h dataset while freezing the AED model. 8-best hypotheses were generated
on-the-fly with more aggressive SpecAugment masks to simulate errors on unseen data (Li
et al., 2021f). Both the CEM and R-EBM have two-layer bi-directional LSTMs with 512
units in each direction. Since the direct output of the CEM is the token-level confidence, the
word-level confidence is obtained by taking the minimum among all the tokens per word.

5.4.4 Experimental Results

5.4.4.1 Baselines

The simplest baseline for confidence estimation is to directly use the Softmax probability from
the decoder output distribution. As discussed in Section 5.3.4, the Softmax-based confidence
scores can be severely impacted by the regularisation techniques used during training. In
Tables 5.11 and 5.12, the performance of the ASR model (in WER and Sentence Error Rate
(SentER)) and the corresponding confidence performance (in AUC and EER) using both
the Softmax probabilities and model-based confidence estimators are given. As shown in
Tables 5.11 and 5.12, by having a dedicated confidence estimator at either the word-level using
5.4 Improving Confidence Scores for Out-of-Domain Data 133

CEM or utterance-level using R-EBM, the in-domain performance on LibriSpeech test-clean/-

other is significantly improved. At the utterance level, the AUC increases from around 70% to
around 90% while the EER is approximately halved.

Softmax CEM
dataset WER
AUC↑ EER↓ AUC↑ EER↓
LS (test-clean) 2.7 99.29 21.59 99.64 16.40
LS (test-other) 4.9 98.79 19.93 99.40 15.39
TED (test) 9.4 96.49 24.07 98.78 18.28
SWB (RT03) 28.3 92.88 20.15 96.45 18.59

Table 5.11 Baseline WERs (%) and word-level confidence estimation performance.

Softmax R-EBM
dataset SentER
AUC↑ EER↓ AUC↑ EER↓
LS (test-clean) 31.4 77.63 41.97 91.68 21.53
LS (test-other) 45.2 67.84 38.63 88.05 18.43
TED (test) 72.1 33.57 47.66 71.24 23.17
SWB (RT03) 82.7 34.91 36.59 52.31 28.35

Table 5.12 Baseline SentERs (%) and utterance-level confidence estimation performance.

On the two OOD sets, the benefit of using the CEM or R-EBM is also substantial. However,
the relative improvement of confidence estimation performance by using a confidence module
diminishes as the data becomes increasingly dissimilar to the in-domain data. For example, in
Table 5.12, the AUC on TED increased from 33.57% to 71.24% whereas the AUC on SWB only
increased from 34.91% to 52.31%. The EER on TED is reduced by more than half whereas the
EER on SWB is only reduced by 23% relatively. This is expected as the confidence module is
trained only on in-domain data.

5.4.4.2 Out-of-Domain Information for Confidence Estimation

Based on the previous observations, this section explores various techniques that use the unla-
belled OOD data to improve the confidence estimator while keeping the in-domain performance
unchanged. As described in Section 5.4.2, the pseudo transcriptions on the OOD data can
be used to train confidence estimators and additional features from an LM trained on OOD
text may also be useful. The OOD data with pseudo transcriptions was mixed with in-domain
134 Confidence Scores for Attention-Based Encoder-Decoder ASR Models

data in a 1:9 ratio for each minibatch. Table 5.13 shows the results when incorporating this
additional OOD information on the TED dataset.

word-level utterance-level
pseudo LM
AUC↑ EER↓ AUC↑ EER↓
98.78 18.28 71.24 23.17
✓ 98.85 16.41 74.89 20.05
✓ 98.73 18.56 73.51 21.97
✓ ✓ 98.85 16.08 75.70 19.21

Table 5.13 Word and utterance-level confidence estimation performance in AUC (%) and EER
(%) on TED dataset with additional OOD information for CEM and R-EBM.

Compared with the first row of Table 5.13 (i.e. CEM and R-EBM baselines in Tables 5.11
and 5.12), using pseudo transcriptions can effectively improve AUC and reduce EER. Although
the improvement brought by additional OOD LM features is smaller, using both the pseudo
transcription and OOD LM features yields the best confidence estimator. A similar observation
is also made on the SWB dataset, which will be omitted here. Therefore, both pieces of OOD
information will be used for the following experiments.
To further validate the effectiveness of the two proposed approaches, the OOD acoustic
data is included during pretraining. This is a more challenging setup because the confidence
performance baseline will improve as pretraining can effectively reduce the WERs on OOD
data. After continuing training the w2v2 model with a mixture of in-domain and OOD data with
a reduced learning rate, the new encoder is then fine-tuned on the LibriSpeech 100h dataset.
The results are shown in Table 5.14.
The final WERs on the in-domain dataset are very similar to the baseline ASR model
(within ±0.1%), but the WER is reduced by 10.6% relative on TED and 14.5% relative on
SWB. This result shows that by having unlabelled OOD data during pre-training of the ASR
model, the WER on OOD data can be greatly reduced. An improved ASR model generally
suggests that the quality of confidence scores are also better. By comparing the confidence
metrics before and after pretraining with OOD data (for each dataset, compare the first row in
Table 5.14a and the first row in Table 5.14b), the AUCs are generally better with pretraining.
However, EERs can be higher for the better ASR model because the EER only represents a
single operating point whereas the AUC presents the overall picture of the confidence estimator
at all operating points. With the stronger baseline (for each dataset, compare the first and second
rows in Table 5.14b), by including the OOD information for CEM or R-EBM, the confidence
quality is consistently improved. This observation shows that even when the encoder has been
5.4 Improving Confidence Scores for Out-of-Domain Data 135

w/o OOD pretraining

dataset OOD info word-level utterance-level
WER
AUC↑ EER↓ AUC↑ EER↓
98.78 18.28 71.24 23.17
TED (test) 9.4
✓ 98.85 16.08 75.70 19.21
96.45 18.59 52.31 28.35
SWB (RT03) 28.3
✓ 97.46 14.79 60.95 22.77

(a) Without OOD data during pretraining for AED models.

w/ OOD pretraining
dataset OOD info word-level utterance-level
WER
AUC↑ EER↓ AUC↑ EER↓
98.87 17.55 74.00 22.48
TED (test) 8.4
✓ 99.06 15.56 75.90 20.37
96.80 19.00 58.54 25.59
SWB (RT03) 24.2
✓ 97.54 16.83 61.75 23.07

(b) With OOD data during pretraining for AED models.

Table 5.14 Confidence metrics in AUC (%) and EER (%) after using CEM & R-EBM with
additional OOD information on TED & SWB. After pretraining with OOD data, ASR models
have nearly the same performance on in-domain data but lower WERs on OOD data. This is a
more challenging setup for making improvements on confidence estimators as the encoder has
been exposed to OOD data.

exposed to OOD data during pretraining, it is still very useful to add the two pieces of OOD
information during the training of confidence estimators.

5.4.5 Analysis
5.4.5.1 Word-Level Confidence Calibration

Confidence metrics such as AUC and EER are only influenced by the rank ordering of the
confidence scores, but not their absolute values. However, well-calibrated word-level confidence
scores can be important for some downstream applications. In other words, the absolute value
of the confidence score should ideally reflect the probability of the word being recognised
136 Confidence Scores for Attention-Based Encoder-Decoder ASR Models

correctly. Two commonly used metrics for evaluating calibration performance is NCE (Siu
et al., 1997) and Expected Calibration Error (ECE) (Guo et al., 2017). ECE is computed as the
averaged gap between expected confidence and predicted confidence after binning all words
into M buckets, i.e.
M
X |Bm |
ECE = acc(Bm ) − conf(Bm ) , (5.15)
m=1 N

where Bm represents the words fall within m-th bin ranked by their confidence scores.

dataset estimator NCE↑ ECE↓

Softmax 0.1675 6.16
TED (test) CEM 0.1221 3.39
+OOD info 0.3473 1.94
Softmax 0.2642 6.08
SWB (RT03) CEM 0.3377 1.33
+OOD info 0.4694 0.82

Table 5.15 Word-level NCE and ECE (%) on TED & SWB test sets.

Before computing NCE and ECE values, a PWLM (Evermann and Woodland, 2000b; Guo
et al., 2017) was estimated on the dev set and then applied to the test set. In this experiment,
the PWLM uses 5 linear segments and 50 bins, and the computation of ECE also uses 50 bins.
As shown in Table 5.15, on both OOD datasets, the word-level NCE and ECE are significantly
better than the Softmax or the CEM baseline after using OOD information during the training
of CEM.

5.4.5.2 Utterance-Level Data Selection

Although aggregating all word-level confidence scores can yield an utterance-level score,
Section 5.3.4 showed that the utterance-level confidence is more effectively modelled by an
R-EBM which directly optimises the utterance-level objective. The improved utterance-level
confidence scores can be readily used for data selection tasks. For active learning, utterances
with low confidence are normally selected for manual transcription. For semi-supervised
learning, utterances with high confidence are often included as additional training data because
the hypotheses can be used as high-quality pseudo transcriptions.
In Table 5.16, the SentERs of 200 utterances with the lowest and the highest confidence
scores from each dataset are reported. As expected, the R-EBM with additional OOD infor-
mation during training can effectively filter utterances at either regime. Especially for the
5.5 Summary and Conclusion 137

dataset estimator bottom SentER↑ top SentER↓

Softmax 93.5 48.5
TED (test) R-EBM 98.0 25.5
+OOD info 99.5 18.5
Softmax 100.0 57.5
SWB (RT03) R-EBM 99.5 23.0
+OOD info 100.0 17.0

Table 5.16 SentERs (%) of the bottom and top 200 utterances of TED & SWB test sets filtered
using different confidence estimators.

high-confidence utterances, OOD information is very helpful in reducing the SentERs for the
R-EBM for the highest confidence utterances.

5.5 Summary and Conclusion

Confidence scores have been an important attribute for ASR systems. For HMM-based con-
ventional systems, word posteriors (with PWLM) can be used as confidence scores. Although
various techniques have been proposed to improve confidence estimation, word posteriors
remain one of the key sources of information. For AED models, computing word posteriors
is not straightforward and using Softmax probabilities directly as confidence scores are not
reliable. This chapter proposes the CEM and R-EBM to estimate token, word and utterance
level confidence from AED models reliably.
The CEM is a lightweight addition on top of existing AED models. By keeping the original
AED model fixed, the CEM was trained to predict confidence scores for each hypothesised
token. The effectiveness of the CEM was demonstrated with and without LM shallow fusion.
The significant improvement in confidence estimation quality over using Softmax probabilities
may benefit various downstream tasks, such as adaptation (Uebel and Woodland, 2001), semi-
supervised training (Park et al., 2020) and model combination (Fiscus, 1997; Li et al., 2019d).
As an extension, the CEM can also be trained directly for word-level confidence by only
producing confidence scores at the end of each word (Qiu et al., 2021b). Furthermore, a similar
module to the CEM can also be introduced for deletion prediction as deletion errors are not
within the scope of confidence estimation but can be important for many applications (Qiu
et al., 2021a; Ragni et al., 2018; Seigel and Woodland, 2014).
The R-EBM is an utterance-level confidence estimator for AED models. As it is designed
to operate at the utterance level, it can implicitly take deletion errors into account. Using
the framework of energy-based models, the R-EBM is also a discriminator between correct
138 Confidence Scores for Attention-Based Encoder-Decoder ASR Models

and erroneous hypotheses. The R-EBM is globally normalised as it learns from the residual
error of the locally normalised model and complements the locally normalised AED models.
Experiments showed that the R-EBM was a very effective utterance-level confidence estimator
while reducing speech recognition error rates, even on top of a state-of-the-art model trained
using w2v2. Further analysis showed that the performance of an R-EBM may depend on
the amount of context, and confirmed that the R-EBM closed the gap between the model
distribution and the data distribution. Because deletion prediction and confidence estimation at
different levels are very similar, a multi-task training framework can be used to estimate the
word-level confidence scores, utterance-level confidence scores, and the number of deletion
errors jointly (Qiu et al., 2021a).
Both the CEM and R-EBM are model-based and their confidence estimation performance on
OOD data is of particular interest. Experiments showed that although model-based confidence
scores were more reliable than Softmax probabilities from AED models as confidence estimators
for both in-domain and OOD data, the performance on OOD data lagged far behind the in-
domain scenario. To this end, two approaches were investigated. Using pseudo transcriptions to
provide binary targets for training model-based confidence estimators and including additional
features from an OOD LM were both useful for improving the confidence scores on OOD
datasets. By exposing the CEM or R-EBM to OOD data, the word-level calibration performance
was also significantly improved. Selecting OOD data using the improved confidence estimators
is expected to aid active or semi-supervised learning.
Chapter 6

Attention-Based Encoder-Decoder Models

for Speaker Diarisation

Attention-Based Encoder-Decoder (AED) models have been widely applied to solve Automatic
Speech Recognition (ASR) and other sequence-to-sequence tasks and the attention mechanism
can flexibly handle variable input and output sequence lengths. For each decoding step, the
prediction depends on the context from the encoder given by the attention mechanism and
the decoder state that represents the full output history. AED models require paired input
and output sequences for supervised training, e.g. acoustic sequences and the corresponding
transcriptions for ASR. This chapter explores the possibility of using AED models in a speech
processing task called speaker diarisation (Anguera et al., 2012; Tranter and Reynolds, 2006).
Speaker diarisation aims to identify “who spoke when” in conversations. The diarisation
pipeline normally has multiple stages (Anguera et al., 2012; Moattar and Homayounpour, 2012;
Park et al., 2022). First, the non-speech components in the audio recording are stripped and the
remainder is divided into short segments such that each segment only has one active speaker.
Then a speaker representation is extracted for each segment. Finally, a clustering algorithm
is used to determine the number of speakers in the whole recording and also which segments
belong to the same speaker. Although there are other alternatives, this chapter adopts the above
procedure to mainly focus on the use of AED models for the final clustering stage.
Clustering is normally regarded as an unsupervised task but here Discriminative Neural Clus-
tering (DNC) is proposed which formulates clustering as a supervised sequence-to-sequence
learning problem with a maximum number of clusters (Li et al., 2021c)1 . Compared to tra-
ditional unsupervised clustering algorithms, DNC learns clustering patterns from training
data without requiring a pre-defined similarity measure such as the cosine distance of speaker
1
Note that this work is in collaboration with Florian Kreyssig with equal contribution. In this chapter,
overlapped speech is not considered and is left as future work.
140 Attention-Based Encoder-Decoder Models for Speaker Diarisation

embeddings. An implementation of DNC based on the Transformer model is shown to be

effective for speaker diarisation using the challenging Augmented Multi-Party Interaction
(AMI) dataset. Since AMI contains only 147 complete meetings as individual input sequences,
data scarcity is a significant issue for training a Transformer model for DNC. Accordingly,
three data augmentation schemes are proposed: sub-sequence randomisation, input vector
randomisation, and Diaconis Augmentation (Diac-Aug). Experimental results on AMI show
that DNC achieves a reduction in Sentence Error Rate (SentER) of 29.4% relative to spectral
clustering.

6.1 Background
This section first describes the speaker diarisation pipeline that includes audio segmentation,
speaker representation extraction and clustering. As Deep Neural Network (DNN)-based
approaches have been very effective for the first two stages (Park et al., 2022), alternative ap-
proaches to replacing unsupervised clustering methods with supervised ones are then introduced.
The differences between the proposed DNC and the related work are briefly highlighted.

6.1.1 Speaker Diarisation

Speaker diarisation identifies the relevant segments in a multi-talker audio recording associated
with each speaker, especially for meetings or conversations. Speaker diarisation is used for
producing a rich transcription, which can provide speaker-specific metadata for downstream
tasks such as information retrieval/extraction and speaker adaptation. The most common
pipeline for diarisation has three main stages as shown in Figure 6.1.
Optionally, a front-end processing stage, such as speech enhancement, dereverberation and
beamforming, can be used to improve the signal-to-noise ratio of the audio, which may make
the following stages easier. After the clustering stage, post-processing such as re-segmentation
is sometimes performed to refine the final results.
Note that for speaker clustering, only the relative speaker identities are of interest, instead
of the absolute speaker identities. In other words, it is not important to identify who the actual
speaker is, but to identify all of the segments uttered by the same speaker.

6.1.1.1 Segmentation

The segmentation stage aims to obtain small segments of audio where each segment only
contains speech signals from a single speaker. This is normally performed in two sub-stages.
The first one is called Voice Activity Detection (VAD), which is a classifier that determines
6.1 Background 141

clustering

speaker <latexit sha1_base64="K74w00hLF2lRSoXTH2PB2DIdhCE=">AAAB+3icbVDLSsNAFL2pr1pfsS7dDBbBVUmkqOCm6MZlBfuAJpTJdNoOnWTCzEQsIb/ixoUibv0Rd/6NkzYLbT2XC4dz7mXunCDmTGnH+bZKa+sbm1vl7crO7t7+gX1Y7SiRSELbRHAhewFWlLOItjXTnPZiSXEYcNoNpre5332kUjERPehZTP0QjyM2YgRrIw3sqieMHUhMaOpd55UN7JpTd+ZAq8QtSA0KtAb2lzcUJAlppAnHSvVdJ9Z+iqVmhNOs4iWKxphM8Zj2DY1wSJWfzm/P0KlRhmgkpOlIo7n6eyPFoVKzMDCTIdYTtezl4n9eP9GjKz9lUZxoGpHFQ6OEIy1QHgQaMkmJ5jNDMJHM3IrIBJsgtImrYkJwl7+8Sjrndfei3rhv1Jo3RRxlOIYTOAMXLqEJd9CCNhB4gmd4hTcrs16sd+tjMVqyip0j+APr8weziZQ6</latexit>

z}|{ <latexit sha1_base64="K74w00hLF2lRSoXTH2PB2DIdhCE=">AAAB+3icbVDLSsNAFL2pr1pfsS7dDBbBVUmkqOCm6MZlBfuAJpTJdNoOnWTCzEQsIb/ixoUibv0Rd/6NkzYLbT2XC4dz7mXunCDmTGnH+bZKa+sbm1vl7crO7t7+gX1Y7SiRSELbRHAhewFWlLOItjXTnPZiSXEYcNoNpre5332kUjERPehZTP0QjyM2YgRrIw3sqieMHUhMaOpd55UN7JpTd+ZAq8QtSA0KtAb2lzcUJAlppAnHSvVdJ9Z+iqVmhNOs4iWKxphM8Zj2DY1wSJWfzm/P0KlRhmgkpOlIo7n6eyPFoVKzMDCTIdYTtezl4n9eP9GjKz9lUZxoGpHFQ6OEIy1QHgQaMkmJ5jNDMJHM3IrIBJsgtImrYkJwl7+8Sjrndfei3rhv1Jo3RRxlOIYTOAMXLqEJd9CCNhB4gmd4hTcrs16sd+tjMVqyip0j+APr8weziZQ6</latexit>

z}|{
representations

segmentation

Figure 6.1 A simplified diarisation pipeline. The grey blocks after the segmentation stage
denote non-speech. Speaker clusters are colour coded.

whether each frame of the acoustic features (see Section 3.1.1) contains speech or not. DNN-
based VAD systems (Hughes and Mierle, 2013; Wang et al., 2016; Zhang and Wu, 2013) have
recently shown better performance over traditional methods using the zero-crossing rate, energy
constraints or a phone recogniser (Savoji, 1989; Sinha et al., 2005; Tranter et al., 2004).
After removing non-speech regions, Change Point Detection (CPD) can be applied to
split each speech regions into smaller speaker-homogeneous segments. A metric-based ap-
proach (Chen, 1998; Kemp et al., 2000) was widely used, until DNN-based approaches demon-
strated better performance more recently (Gupta, 2015; Hrúz and Zajíc, 2017; India et al., 2017;
Sun et al., 2021a).

6.1.1.2 Speaker Representations

For each segment, a fixed-length speaker representation is needed to allow the clustering stage
to differentiate different speakers. An unsupervised method that is based on joint factor analysis
in the total variability space that generates an i-vector representation (Dehak et al., 2010)
is commonly used. The cosine distance between i-vectors is used as the distance metric for
speaker diarisation (Sell and Garcia-Romero, 2014; Senoussaoui et al., 2014; Shum et al., 2011).
More recently, DNN-based approaches have emerged as a more powerful alternative. Neural
networks are trained for a speaker classification task and the hidden vector from the penultimate
layer are often used as the speaker representation (Cyrta et al., 2017; Garcia-Romero et al.,
2017; Okabe et al., 2018; Shi et al., 2020; Sun et al., 2019; Sun et al., 2021a; Variani et al.,
2014; Wang et al., 2018; Wang et al., 2018; Yella and Stolcke, 2015; Zhu et al., 2018). The
end-to-end loss (Díez et al., 2019; Heigold et al., 2016; Wan et al., 2018) and the angular
142 Attention-Based Encoder-Decoder Models for Speaker Diarisation

Softmax loss (Deng et al., 2019; Fathullah et al., 2020; Huang et al., 2018; Liu et al., 2019;
Wang et al., 2018; Yu et al., 2019) have been proposed to improve the representations so that
they better match the clustering algorithm.

6.1.1.3 Unsupervised Clustering

The clustering stage usually relies on the distance metric associated with the speaker representa-
tions. Popular unsupervised algorithms include agglomerative hierarchical clustering, k-means
clustering and spectral clustering (Dimitriadis and Fousek, 2017; Garcia-Romero et al., 2017;
Karanasou et al., 2015; Ning et al., 2006; Sell et al., 2018; Shum et al., 2013; Sun et al., 2019;
Wang et al., 2018). With the DNN-based speaker representations, spectral clustering that uses
the cosine distance is commonly used. It first computes the affinity matrix between the speaker
representations for each pair of segments. Then multiple operations are conducted to refine the
affinity matrix including Gaussian blur and thresholding. Eigen-decomposition is performed
on the refined matrix and the number of clusters is determined by the largest eigen-gap of its
eigen-values. Finally, k-means clustering is used to cluster the new segment representations
formed by taking the corresponding dimensions of the largest eigen-vectors. Recently, graph
neural networks have been used to improve spectral clustering (Shaham et al., 2018; Wang
et al., 2020a).

6.1.1.4 Evaluation Metric

The performance of a speaker diarisation system is measured by the Diarisation Error Rate
(DER), which is the sum of three types of errors,

false alarm + missed speech + speaker error

DER = × 100%. (6.1)
total duration
The first two terms in the numerator of Equation (6.1) are due to the imperfect VAD in the
segmentation stage. The false alarm is the amount of audio that is treated as speech segments
but are actually non-speech segments, whereas missed speech is the amount of audio that is
treated as non-speech segments but are actually speech segments. The last term, speaker error,
is caused by the clustering stage.
During an evaluation, the hypothesis cluster labels and reference speaker labels are matched
with a one-to-one mapping that minimises the speaker error in the DER calculation. As
is common practice, errors within a collar of 250 ms around every segment boundary in
the reference are not counted to accommodate inconsistent annotation and human errors in
labelling (Fiscus et al., 2006). In this chapter, overlapping-speech regions are excluded from
scoring.
6.1 Background 143

Note that if the oracle/reference segmentation is used, which is a widely adopted setup
when testing the performance of the speaker representations or clustering, the first two terms in
the numerator of Equation (6.1) are zero as the VAD is assumed to be perfect. The resulting
metric is also referred to as the Speaker Error Rate (SpkER).

6.1.2 Supervised Clustering

Unsupervised clustering algorithms often assume the underlying data distributions to be Gaus-
sian or measure the distance between speaker embeddings using the cosine similarity. Ac-
cordingly, speaker embeddings have been trained to maximise the cosine distance between
embeddings of different speakers.
For the purpose of exploratory data analysis where there is no ground truth clustering result
or the desirable clusters are unknown, unsupervised algorithms should be used. However, for
some clustering tasks where there is only one desirable outcome, such as speaker clustering
where the correct clusters are given by speaker identities, model-based supervised algorithms
can be used if some training data is available. Compared to unsupervised clustering algorithms,
supervised clustering methods do not make any assumptions about the data distribution or a
distance measure. Supervised clustering has been applied to speaker clustering by (Zhang
et al., 2019a), which uses a Recurrent Neural Network (RNN) as a generative model in an
approach called Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN). All
speakers in a meeting share the same set of RNN parameters but have different state sequences.
The UIS-RNN is capable of handling an unlimited number of speakers by assuming that the
occurrence of speakers follows the same Distance-Dependent Chinese Restaurant Process
(DDCRP) (Zhang et al., 2019a). Although it has been demonstrated that UIS-RNN worked
well on the CallHome (CH) telephone conversation dataset, the task is relatively constrained
since two speakers dominate about 90% of the speaking time and the average duration of
a conversation is around 2 minutes. In contrast, a more complex setup with more speakers
and a much longer audio stream would be very challenging for DDCRP. The proposed DNC
approach will be tested on the AMI dataset, which is a more general and more difficult task
since each meeting has four or more active speakers who can speak at any time and in any order
and a meeting lasts for more than 30 minutes on average. During the generative process of
UIS-RNN, a speaker embedding is modelled by a normal distribution with identity covariance
matrix and the mean given by the RNN. The DNC model does not impose these assumptions
but requires knowledge of the maximum number of speakers allowed.
Recent work on end-to-end models for diarisation (Fujita et al., 2019a,b) combines neural
networks with Permutation Invariant Training (PIT) (Yu et al., 2017) to directly produce cluster
assignments from acoustic features. However, the method assumes independence of the output
144 Attention-Based Encoder-Decoder Models for Speaker Diarisation

labels, even though they are strongly correlated in practice. In contrast, DNC uses the sequence-
to-sequence structure that conditions each output on the full output history. Moreover, PIT has
a complexity of O(K!) if the number of speakers is K, and can be very expensive for a large
number of speakers2 , while DNC uses a permutation-free training loss by enforcing a specific
way of ordering the output label sequence.

6.2 Discriminative Neural Clustering

6.2.1 Motivation
Clustering is the task of grouping data samples into multiple clusters such that a sample of a
particular cluster is more similar to samples of that cluster than to those of other clusters. The
performance of clustering algorithms is crucial in many applications. For speaker diarisation,
the commonly used clustering algorithms are mostly unsupervised and model-free, often
leveraging pre-defined distance measures and hyper-parameters to determine the similarity
between data samples.
Clustering is challenging and inherently ambiguous when samples of different clusters are
not well separated in the feature space. Various loss functions (Hershey et al., 2016; Le and
Odobez, 2018; Wan et al., 2018) and model structures (Lin et al., 2019; Sun et al., 2019) have
been designed to extract speaker embeddings that are better suited for unsupervised clustering in
speaker diarisation pipelines. These methods often try to match some assumptions made by the
clustering algorithms that are related to the underlying data distributions or distance measures.
From this perspective, using a parametric model to learn how to cluster the embeddings from
the training data is more desirable since it has the potential to avoid enforcing such assumptions.
Furthermore, algorithms such as k-means and spectral clustering require an iterative process
with multiple sets of randomly initialised values, and the related hyper-parameters often need to
be carefully adjusted. Although supervised learning can be a desirable solution to the clustering
problem when training examples are available, it is often difficult to use existing parametric
classifiers for clustering purposes directly. For example, speaker diarisation requires speech
segments to be assigned to speakers with unknown identities. Hence, a clustering algorithm
should not be associating a target with a particular speaker. Rather than determining the real
speaker identity of each segment as is done in a speaker classification task, it is the relative
speaker identities across all segments that are of interest.
2
With the Hungarian algorithm (Kuhn, 1955), the complexity can be reduced to O(K 3 )
6.2 Discriminative Neural Clustering 145

6.2.2 Sequence-to-Sequence Models for Clustering

Clustering groups multiple high-dimensional feature vectors according to the similarity of
certain desired attributes. In other words, for an input sequence with N feature vectors,
X = [x1 , . . . , xN ]T , a clustering algorithm assigns each feature vector xi a cluster label. As
long as each cluster is associated with a unique identity label, permutating the cluster labels
should not affect the clustering outcome. For example, for two clusters, interchanging the
cluster labels does not affect the clustering outcome. Therefore, assuming the maximum number
of clusters is known, the task of clustering can be considered as a special sequence-to-sequence
classification problem using the AED structure (Bahdanau et al., 2014), in which the input and
output sequences have equal lengths and each output target represents the cluster label of the
corresponding input.
When applying DNC to a clustering task, with input sequence is X, each feature vector
xi has an underlying identity zi . At inference time, DNC attempts to assign xi to a cluster
label yi ∈ N1 , such that all feature vectors in the input sequence that belong to the same identity
are assigned the same cluster label. Therefore, instead of requiring the absolute identities zi
assigned to each xi , it is the relative cluster labels y1:N across x1 , . . . , xN that matters. In
the context of speaker diarisation, X represents a meeting where each feature vector xi is
the speaker embedding extracted from a speech segment of the audio, zi is the actual speaker
identity, and yi is the cluster label assigned to the corresponding segment within that meeting.
For DNC to be applicable, multiple data samples (X, z1:N ) have to be available for training.
These are mapped to (X, y1:N ) by enforcing a specific ordering of the cluster labels y1:N as
shown in the following examples.

identity sequence z1:N cluster label sequence y1:N

EACAEEC 1232113
ACABBCDBD 121332434

Here, {A, B, C, D, E} are five different speaker identities. In the first meeting, only {A, C,
E} participate and ‘E’ was the first person to speak, so for DNC the cluster label ‘1’ is assigned
to the first input and whenever they speak again. When a new speaker speaks, ‘A’ in this case,
DNC is trained to assign the incremented cluster label ‘2’ and similarly thereafter. As shown in
the second example, DNC will assign ‘1’ to the speaker ‘A’ and ‘2’ to speaker ‘C’ according to
the order of appearance.
In practice, cluster boundaries are rarely clear. Without prior information, making clustering
decisions (deciding how many clusters and cluster boundaries) is intrinsically ambiguous.
Learning domain-specific knowledge contained within the data samples can help resolve
ambiguities. A model attempting to determine yi , the cluster assignment of xi , should condition
146 Attention-Based Encoder-Decoder Models for Speaker Diarisation

that decision on the entire input sequence X and also on all assignments made for previous
feature vectors y0:i−1 . Hence, it is proposed to model clustering with a discriminative sequence-
to-sequence model:
N
Y
P (y1:N |X) = P (yi |y0:i−1 , X), (6.2)
i=1

where y0 denotes a start-of-sequence token. For the model to be end-to-end trainable, a

sequence-to-sequence model is used, more specifically an AED model (Bahdanau et al., 2014),
which generally consists of two components

H = E NCODER(X) (6.3)
yi = D ECODER(y0:i−1 , H), (6.4)

where H = [h1 , . . . , hN ]T is an encoded hidden representation sequence of X. The encoder in

DNC transforms the input sequences such that the hidden representation captures the similarity
between different feature vectors. As in Equation (6.4), the decoder assigns the next cluster
label according to the cluster labels assigned to all previous feature vectors. If the model
believes that the next feature vector belongs to one of the existing clusters, it will look up the
previously assigned cluster label and reuse it. Otherwise, it will assign a new cluster label to
the feature vector.

6.2.3 Clustering Using Transformers

For both the E NCODER and D ECODER, various neural network architectures could be used.
As a particular type of DNC, this chapter uses the Transformer architecture (Vaswani et al.,
2017). The choice is motivated by the Transformer’s ability to handle long input sequences
and its superior performance across many tasks. The dot-product attention in the Transformer
model can be viewed as cosine similarity given unit vectors, where a large cosine similarity
corresponds to a large attention weight.

6.3 Data Augmentation for DNC

In contrast to ASR using AED models (Bahdanau et al., 2016; Chan et al., 2016), where each
input sequence is an utterance with a duration of tens of seconds, speaker diarisation requires
each input sequence to be the audio stream of a complete conversation or a meeting with a much
longer duration of tens of minutes or even hours. To avoid dealing with overly-long sequences,
each feature vector in the input sequences of DNC corresponds to the speaker embedding
6.3 Data Augmentation for DNC 147

of a speaker-homogeneous segment rather than a frame. Treating one input sequence as one
conversation or one meeting causes the amount of supervised training data for clustering to be
severely limited. For example, in the AMI dataset, widely used for speaker diarisation (Carletta
et al., 2005), only 147 training meetings exist that in turn can be used as individual training
sequences. The three data augmentation schemes that are proposed to overcome the data
scarcity problem are called sub-sequence randomisation, input vector randomisation, and
Diaconis Augmentation (Diac-Aug). The three techniques can also be combined. The data
augmentation techniques have two, possibly competing, objectives. The first is to generate
as many training sequences (X, y1:N ) as possible. The second is for them to match the true
data distribution p (X, y1:N ) as closely as possible. The augmentation schemes also enable
DNC to learn the importance of relative speaker identities (the cluster labels) rather than real
speaker identities across segments, which allows DNC to perform speaker clustering with a
simple cross-entropy training loss function.

6.3.1 Sub-sequence Randomisation

Sub-sequence randomisation selects many sub-meetings with randomised start and end points
as the augmented input sequences. In other words, multiple sub-sequences (Xs:e , ys:e ) of
the full sequence are used for training, where s and e are the random starting and ending
indices. While increasing the number of input sequences for DNC, it also allows the same xi
to correspond to different yi in different sub-sequences since the specific ordering that cluster
labels always start from ‘1’ and are incremented for any new cluster. Hence, DNC is trained
not to associate xi with a fixed cluster label yi .

6.3.2 Input Vector Randomisation

Input vector randomisation generates augmented input sequences by keeping the cluster label
sequence of the original training example, replacing each cluster with a randomly selected
speaker identity and selecting feature vectors corresponding to the speaker identity at random.
As depicted in Figure 6.2, this technique modifies each training sequence by preserving its
cluster label sequence y1:N such that every cluster label sequence is a true sample of the prior
distribution P (y1:N ). To achieve this goal, every cluster in each cluster label sequence is first
reassigned to an identity randomly chosen from the training set, which results in a new identity
sequence z1:N . Next, for each zi , a feature vector from the training set belonging to the identity
zi is randomly chosen as xi . In Section 6.5.1, two specific approaches to choosing zi and xi at
the level of a meeting or the entire training set are given for speaker diarisation.
148 Attention-Based Encoder-Decoder Models for Speaker Diarisation

sample cluster label sequence

y1:N = 1 2 3 2 1 . . . <latexit sha1_base64="lHVVDqODPpgz9Eglz+sCjFDC4uI=">AAAB73icbVBNSwMxEJ2tX7V+VT16CRbBU9mVoiIIBS+epIL9gHYp2TTbhmaTNckKy9I/4cWDIl79O978N6btHrT1wcDjvRlm5gUxZ9q47rdTWFldW98obpa2tnd298r7By0tE0Vok0guVSfAmnImaNMww2knVhRHAaftYHwz9dtPVGkmxYNJY+pHeChYyAg2Vuqk/cy7uptc98sVt+rOgJaJl5MK5Gj0y1+9gSRJRIUhHGvd9dzY+BlWhhFOJ6VeommMyRgPaddSgSOq/Wx27wSdWGWAQqlsCYNm6u+JDEdap1FgOyNsRnrRm4r/ed3EhJd+xkScGCrIfFGYcGQkmj6PBkxRYnhqCSaK2VsRGWGFibERlWwI3uLLy6R1VvXOq7X7WqVez+MowhEcwyl4cAF1uIUGNIEAh2d4hTfn0Xlx3p2PeWvByWcO4Q+czx9z8Y+X</latexit>

<latexit sha1_base64="xSL4S3z9YqrJwjI7+DAdPU8O13g=">AAAB8XicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFNy4r2Ae2Q8mkmTY0kxmSO0IZ+hduXCji1r9x59+YtrPQ1gOBwzn3knNPkEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilx15EcYSYedN+ueJW3TnIKvFyUoEcjX75qzeIWRpxhUxSY7qem6CfUY2CST4t9VLDE8rGdMi7lioaceNn88RTcmaVAQljbZ9CMld/b2Q0MmYSBXZyltAsezPxP6+bYnjtZ0IlKXLFFh+FqSQYk9n5ZCA0ZygnllCmhc1K2IhqytCWVLIleMsnr5JWrepdVGv3l5X6TV5HEU7gFM7Bgyuowx00oAkMFDzDK7w5xnlx3p2PxWjByXeO4Q+czx/DlpD5</latexit> <latexit sha1_base64="3Q7qAsKfYc6M9iqt6BG1v+38Lzs=">AAAB8XicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFNy4r2Ae2Q8mkaRuayQzJHaEM/Qs3LhRx69+482/MtLPQ1gOBwzn3knNPEEth0HW/ncLa+sbmVnG7tLO7t39QPjxqmSjRjDdZJCPdCajhUijeRIGSd2LNaRhI3g4mt5nffuLaiEg94DTmfkhHSgwFo2ilx15IcYyY1mb9csWtunOQVeLlpAI5Gv3yV28QsSTkCpmkxnQ9N0Y/pRoFk3xW6iWGx5RN6Ih3LVU05MZP54ln5MwqAzKMtH0KyVz9vZHS0JhpGNjJLKFZ9jLxP6+b4PDaT4WKE+SKLT4aJpJgRLLzyUBozlBOLaFMC5uVsDHVlKEtqWRL8JZPXiWtWtW7qNbuLyv1m7yOIpzAKZyDB1dQhztoQBMYKHiGV3hzjPPivDsfi9GCk+8cwx84nz/FG5D6</latexit>

<latexit sha1_base64="paXUGPpDg5WNw53SEbdEhTyPEHU=">AAAB8XicbVBNTwIxFHyLX4hfqEcvjcTEE9kFEz0SvXjERMAIG9ItBRq63U371oRs+BdePGiMV/+NN/+NXdiDgpM0mcy8l86bIJbCoOt+O4W19Y3NreJ2aWd3b/+gfHjUNlGiGW+xSEb6IaCGS6F4CwVK/hBrTsNA8k4wucn8zhPXRkTqHqcx90M6UmIoGEUrPfZCimPEtD7rlytu1Z2DrBIvJxXI0eyXv3qDiCUhV8gkNabruTH6KdUomOSzUi8xPKZsQke8a6miITd+Ok88I2dWGZBhpO1TSObq742UhsZMw8BOZgnNspeJ/3ndBIdXfipUnCBXbPHRMJEEI5KdTwZCc4ZyagllWtishI2ppgxtSSVbgrd88ipp16pevVq7u6g0rvM6inACp3AOHlxCA26hCS1goOAZXuHNMc6L8+58LEYLTr5zDH/gfP4AxqCQ+w==</latexit>
<latexit sha1_base64="3Q7qAsKfYc6M9iqt6BG1v+38Lzs=">AAAB8XicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFNy4r2Ae2Q8mkaRuayQzJHaEM/Qs3LhRx69+482/MtLPQ1gOBwzn3knNPEEth0HW/ncLa+sbmVnG7tLO7t39QPjxqmSjRjDdZJCPdCajhUijeRIGSd2LNaRhI3g4mt5nffuLaiEg94DTmfkhHSgwFo2ilx15IcYyY1mb9csWtunOQVeLlpAI5Gv3yV28QsSTkCpmkxnQ9N0Y/pRoFk3xW6iWGx5RN6Ih3LVU05MZP54ln5MwqAzKMtH0KyVz9vZHS0JhpGNjJLKFZ9jLxP6+b4PDaT4WKE+SKLT4aJpJgRLLzyUBozlBOLaFMC5uVsDHVlKEtqWRL8JZPXiWtWtW7qNbuLyv1m7yOIpzAKZyDB1dQhztoQBMYKHiGV3hzjPPivDsfi9GCk+8cwx84nz/FG5D6</latexit> <latexit sha1_base64="xSL4S3z9YqrJwjI7+DAdPU8O13g=">AAAB8XicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFNy4r2Ae2Q8mkmTY0kxmSO0IZ+hduXCji1r9x59+YtrPQ1gOBwzn3knNPkEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilx15EcYSYedN+ueJW3TnIKvFyUoEcjX75qzeIWRpxhUxSY7qem6CfUY2CST4t9VLDE8rGdMi7lioaceNn88RTcmaVAQljbZ9CMld/b2Q0MmYSBXZyltAsezPxP6+bYnjtZ0IlKXLFFh+FqSQYk9n5ZCA0ZygnllCmhc1K2IhqytCWVLIleMsnr5JWrepdVGv3l5X6TV5HEU7gFM7Bgyuowx00oAkMFDzDK7w5xnlx3p2PxWjByXeO4Q+czx/DlpD5</latexit>
<latexit sha1_base64="D8NZwhGRc3SadH+lq9NyH2X2S6M=">AAAB7HicbVBNS8NAFHzxs9avqkcvi0XwVJIq6LHoxWMF0xbaUDbbTbt0swm7L0IJ/Q1ePCji1R/kzX/jts1BWwcWhpk37HsTplIYdN1vZ219Y3Nru7RT3t3bPzisHB23TJJpxn2WyER3Qmq4FIr7KFDyTqo5jUPJ2+H4bua3n7g2IlGPOEl5ENOhEpFgFK3k9wYJmn6l6tbcOcgq8QpShQLNfuXL5lgWc4VMUmO6nptikFONgkk+Lfcyw1PKxnTIu5YqGnMT5PNlp+TcKgMSJdo+hWSu/k7kNDZmEod2MqY4MsveTPzP62YY3QS5UGmGXLHFR1EmCSZkdjkZCM0ZyokllGlhdyVsRDVlaPsp2xK85ZNXSate8y5r9YerauO2qKMEp3AGF+DBNTTgHprgAwMBz/AKb45yXpx352MxuuYUmRP4A+fzB/K2jsY=</latexit>

sample identity for each cluster

z1:N = E A C A E . . . <latexit sha1_base64="X2Y2bt3kHxYQ0c6jUzeJxA+25Jk=">AAAB73icbVDLSgNBEOyNrxhfUY9eBoPgKeyKqAhCwIsniWAekCxhdjJJhszOrjO9QlzyE148KOLV3/Hm3zhJ9qCJBQ1FVTfdXUEshUHX/XZyS8srq2v59cLG5tb2TnF3r26iRDNeY5GMdDOghkuheA0FSt6MNadhIHkjGF5P/MYj10ZE6h5HMfdD2leiJxhFKzWfOql3eTu+6hRLbtmdgiwSLyMlyFDtFL/a3YglIVfIJDWm5bkx+inVKJjk40I7MTymbEj7vGWpoiE3fjq9d0yOrNIlvUjbUkim6u+JlIbGjMLAdoYUB2bem4j/ea0Eexd+KlScIFdstqiXSIIRmTxPukJzhnJkCWVa2FsJG1BNGdqICjYEb/7lRVI/KXtn5dO701KlksWRhwM4hGPw4BwqcANVqAEDCc/wCm/Og/PivDsfs9ack83swx84nz91fI+Y</latexit>

A C B C A ... <latexit sha1_base64="D8NZwhGRc3SadH+lq9NyH2X2S6M=">AAAB7HicbVBNS8NAFHzxs9avqkcvi0XwVJIq6LHoxWMF0xbaUDbbTbt0swm7L0IJ/Q1ePCji1R/kzX/jts1BWwcWhpk37HsTplIYdN1vZ219Y3Nru7RT3t3bPzisHB23TJJpxn2WyER3Qmq4FIr7KFDyTqo5jUPJ2+H4bua3n7g2IlGPOEl5ENOhEpFgFK3k9wYJmn6l6tbcOcgq8QpShQLNfuXL5lgWc4VMUmO6nptikFONgkk+Lfcyw1PKxnTIu5YqGnMT5PNlp+TcKgMSJdo+hWSu/k7kNDZmEod2MqY4MsveTPzP62YY3QS5UGmGXLHFR1EmCSZkdjkZCM0ZyokllGlhdyVsRDVlaPsp2xK85ZNXSate8y5r9YerauO2qKMEp3AGF+DBNTTgHprgAwMBz/AKb45yXpx352MxuuYUmRP4A+fzB/K2jsY=</latexit>

sample input feature vector for each identity

x1:N =
<latexit sha1_base64="cmbZBCJq251pr9sjYt0fKRgLcTQ=">AAAB/nicbVDNS8MwHE3n15xfVfHkJTgET6OVoSIIAy+eZIL7gK2UNM22sDQpSSqOUvBf8eJBEa/+Hd78b0y3HnTzQcjjvd+PvLwgZlRpx/m2SkvLK6tr5fXKxubW9o69u9dWIpGYtLBgQnYDpAijnLQ01Yx0Y0lQFDDSCcbXud95IFJRwe/1JCZehIacDihG2ki+fdAPBAvVJDJX+pj5qXt5m135dtWpOVPAReIWpAoKNH37qx8KnESEa8yQUj3XibWXIqkpZiSr9BNFYoTHaEh6hnIUEeWl0/gZPDZKCAdCmsM1nKq/N1IUqTyhmYyQHql5Lxf/83qJHlx4KeVxognHs4cGCYNawLwLGFJJsGYTQxCW1GSFeIQkwto0VjEluPNfXiTt05p7Vqvf1auNRlFHGRyCI3ACXHAOGuAGNEELYJCCZ/AK3qwn68V6tz5moyWr2NkHf2B9/gB/a5XU</latexit>

Figure 6.2 Examples of input vector randomisation generating two input sequences for one
label sequence.

6.3.3 Diaconis Augmentation

The third technique, Diac-Aug, is applicable when the feature vectors xi are L2 -normalised,
forming clusters on the surface of a hypersphere whose radius is the L2 -norm of the vectors.
New training sequences can be generated by rotating the entire input sequence X ∈ RN ×D to
a different region of the hypersphere. This approach is applicable for the speaker embeddings
used in this chapter. This synthesises an entirely new training sequence pair (X ′ , y1:N ) with
previously unseen x′i . To do so, a random rotation matrix R ∈ RD×D is sampled and X ′ =
XR. An example of such rotation is demonstrated in Figure 6.3, with the paths of the rotation
drawn. The algorithm for randomly sampling high dimensional rotation matrices was developed
by Diaconis et al. (Diaconis and Shahshahani, 1987), hence the name of the proposed novel
data augmentation algorithm. Diac-Aug can effectively prevent the model overfitting to some
particular regions on the hypersphere when data is limited.

Figure 6.3 Diac-Aug for two clusters. The rotated clusters form a new training example.
6.4 Experimental Setup 149

6.4 Experimental Setup

The speaker diarisation pipeline, clustering baseline and DNC model setup are outlined below3 .

6.4.1 Data and Segmentation

The Multiple Distance Microphone (MDM) data from the AMI meeting corpus (Carletta et al.,
2005) was used for all experiments, where the official training (train), development (dev) and
evaluation (eval) split was followed (see Section A.1). The eight-channel audio data was first
merged into a single stream using BeamformIt (Anguera et al., 2007).
To compare the performance of spectral clustering and DNC, a perfect VAD is assumed
by using the manual segmentation of the original dataset and stripping silence regions at both
ends of each utterance. Two segments may overlap in time for overlapped speech. Some
short segments are completely enclosed within longer segments, which are unrepresentative
of the output generated by common segmentation schemes operating on single-stream audio.
Therefore, such enclosed segments were removed from the training and test sets.

#meetings avg. duration #speakers

train 1474 37.9 min. 155
dev 18 32.3 min. 215
eval 16 34.0 min. 16

Table 6.1 Details of AMI corpus partitions used for both training the speaker embedding
generator and training the DNC model.

6.4.2 Segment-Level Speaker Embedding Generator

Clustering is performed based on speech segments, i.e. one speaker embedding xi per segment.
xi was obtained by averaging the window-level speaker embeddings within the corresponding
speech segment, which are L2 -normalised both before and after averaging. The input window
to the window-level embedding generator is around 2 seconds (215 frames, [-107,+106]) long.
The embedding generator uses a Time Delay Neural Network (TDNN) (Kreyssig et al., 2018;
Peddinti et al., 2015) with a total input context of [-7,+7], which is shifted from {-100} to
3
The code is available at https : //github.com/FlorianKrey/DNC.
4
Meetings IS1003b and IS1007d are removed as the MDM audio is lost, after which 135 meetings are left.
Since we limit the number of speakers to 4, each of the 3 five-speaker meetings is duplicated into 5 four-speaker
meetings by removing one speaker at a time, which yields 147 meetings.
5
Among speakers in the dev set, 2 speakers appear in the training set.
150 Attention-Based Encoder-Decoder Models for Speaker Diarisation

{+99} (resulting in the overall input window of [-107,+106]). The output vectors of the TDNN
are combined using the self-attentive layer proposed in (Sun et al., 2019). This is followed by
a linear projection down to the embedding size, which is then the window-level embedding.
The TDNN structure resembles the one used in the x-vector models (Snyder et al., 2018) (i.e.
TDNN-layers with the following input contexts: [-2,+2], followed by {-2,0,+2}, followed by
{-3,0,+3}, followed by {0}). The first three TDNN-layers have a size of 512, the third a size of
128 and the embedding size is 32. The embedding generator is trained as a speaker classifier on
the AMI training data with the angular Softmax loss (Liu et al., 2017) using HTK (Young et al.,
2015) and PyHTK (Zhang et al., 2019b).
By using the angular Softmax loss combined with a linear activation function for the penulti-
mate layer of the d-vector generator, the L2 -normalised window-level speaker embeddings, and
in turn the segment-level speaker embeddings, should be approximately uniformly distributed
on the unit hypersphere. Based on this assumption and the speaker embedding size of 32, the
mean and variance of individual dimensions of speaker embeddings should be close to zero and
1/32, respectively. Empirically, this assumption fits well for the mean and most dimensions
for the variance. Variance normalisation for the DNC models was performed by scaling the
√
embeddings by 32.

6.4.3 Clustering
Spectral Clustering The baseline uses the refined spectral clustering algorithm proposed
in (Wang et al., 2018), the input of which is the segment-level speaker embeddings described
in Section 6.4.2. Our implementation is based on the one published by (Wang et al., 2018),
but the distance measure used in the k-means algorithm is modified from Euclidean to cosine
similarity to align exactly with (Wang et al., 2018). The number of clusters allowed is set to be
between two and four.

DNC Model The Transformer used in the DNC model contains 4 encoder blocks and 4
decoder blocks with a dimension of 256. The total number of parameters is 7.3 million. The
number of heads for the multi-head attention is 4. The model architecture follows (Vaswani
et al., 2017) and is implemented using ESPnet (Watanabe et al., 2018). The Adam optimiser
was used with a variable learning rate, which first ramps up linearly from 0 to 12 in the first
40,000 training updates and then decreases in proportion to the inverse square root of the
number of training steps (Vaswani et al., 2017). A dropout rate of 10% was applied to all
parameters. Considering that the input-to-output alignment for DNC is strictly one-to-one
and monotonic (see Section 6.2, one cluster label needs to be assigned to each input vector),
the source attention between encoder and decoder, represented as a square matrix, can be
6.5 Experimental Results 151

restricted to an identity matrix. For our experiments, the source attention matrix is masked to
be a tri-diagonal matrix, i.e. only the main diagonal and the first diagonals above and below
are non-zero. This restriction was found to be important for effective training of DNC models
in preliminary experiments. In this thesis, only experiments that used monotonic restricted
attention are reported.

6.5 Experimental Results

Experiments were designed to evaluate the performance of DNC models for diarisation in
general and to evaluate the proposed data augmentation methods. Furthermore, the experiments
also demonstrate the necessity for a curriculum learning scheme. Finally, DNC and spectral
clustering are evaluated and compared in terms of SpkER for speaker diarisation. Note that
overlapped speech is ignored during scoring as mentioned in Section 6.1.1.4.

6.5.1 Data Augmentation

Sub-sequence randomisation (Section 6.3.1) was applied to the training set for all experiments,
in addition to other augmentation techniques. For the dev set used as the validation set in
training, this augmentation technique is applied exclusively. All augmentation techniques are
compared for sub-meetings of length 50 in Table 6.2. Per original meeting, which dictates the
label sequence y1:N , 5000 sub-meetings with 50 segments each were generated and augmented
using the techniques from Section 6.3. The full eval set meetings were split into sub-meetings
with at most 50 segments.
For diarisation on the AMI corpus, two variants of the proposed input vector randomisation
were compared. The first one, referred to as global randomisation, samples speakers (the set
of possible identities in z1:N ) uniformly at random from the entire training set. Each feature
vector xi is then randomly chosen from all segment embeddings of the training set belonging
to speaker zi . The second variant, called meeting randomisation, first samples a meeting from
the training set. Its speakers form the possibilities for z1:N and vectors xi are sampled only
from those vectors belonging to speaker zi that are in the sampled meeting. Thus, meeting
randomisation preserves correlations among the speaker embeddings within one meeting. For
global and meeting randomisation, all segments in the training set longer than 2 seconds were
split into as many segments of at least 1 second as possible. This splitting operation is only
used to obtain additional segment embeddings for sampling xi , leaving the label sequences
untouched.
152 Attention-Based Encoder-Decoder Models for Speaker Diarisation

randomisation w/o Diac-Aug w/ Diac-Aug

none 20.19 15.25
global 14.47 19.80
meeting 23.03 13.57

Table 6.2 SpkERs (%) of different data augmentation techniques for sub-meetings with a length
of 50 segments on the eval set. Input vector randomisation schemes, none, global and meeting
are combined with Diac-Aug.

The SpkER of a DNC model trained on non-augmented data (none in Table 6.2), is 20.19%.
Using global augmentation reduces the SpkER to 14.47%, whilst meeting augmentation only
achieves a SpkER of 23.03%. Neighbouring embeddings are challenging to cluster due to
overlapping speech and similarities in the acoustic environment. For short sub-meetings,
meeting randomisation can move such neighbouring embeddings into separate sub-meetings.
Hence, for short sub-meetings meeting might generate atypical data whilst providing far less
augmentation than global.
The trend is different after applying Diac-Aug (second column of Table 6.2). Using none
achieves a SpkER of 15.25%, and the result for meeting reduces to 13.57%, which shows that
Diac-Aug generates fairly typical data. Section 6.4.2 showed that the assumptions behind Diac-
Aug are not perfect, which explains the performance drop of global. The number of speaker
groups available from global (∼C4155 ) is larger than the number of generated sub-meetings.
Thus, Diac-Aug does not increase the number of speaker groups seen by the DNC model when
used together with global.

6.5.2 Curriculum Learning Scheduling

Training directly with full-length meetings, using meeting and Diac-Aug for augmentation,
leads to a high SpkER. A Curriculum Learning (CL) (Bengio et al., 2009; Elman, 1993)
approach, i.e. training in increasingly difficult stages, was used. The CL approach used in this
chapter is to train with increasingly long sub-meetings as training examples as learning long
sequences directly is very difficult. First, sub-meetings with 50 segments were chosen to reduce
the input space significantly. By containing fewer speakers on average, such sub-meetings
represent an easier task. Subsequently, the maximum length of the sub-meetings was increased
to 200, then to 500 and then to the full length (up to 1682 segments). The model was trained
to convergence in all four stages. In each of the last three stages, each sub-meeting has a
variable length of between 50% and 100% of the maximum length. Meetings of the eval set
were split into as few sub-meetings as possible, each no longer than the maximum length. After
6.5 Experimental Results 153

training to convergence with data augmentation, the DNC models were fine-tuned using only
sub-sequence randomisation for augmentation (akin to none in Table 6.2).

dev eval
#segments in sub-meetings
PT FT PT FT
50 18.44 17.94 13.57 13.90
200 23.82 20.51 16.92 16.75
500 25.48 21.89 17.73 18.39
all 28.13 26.15 20.65 16.92

Table 6.3 Results of DNC’s SpkER performance for the four CL stages. Comparison between
only pre-training (PT) on data augmented with meeting randomisation and Diac-Aug and
finetuning (FT) afterwards on non-augmented data.

For the stages with a meeting lengths of above 50, for each original meeting, 104 sub-
meetings were generated using the augmentation techniques. When the maximum length was
set to 200, applying the two best-performing techniques of Section 6.5.1 results in a SpkER
of 19.14% for global and 16.92% for meeting combined with Diac-Aug. The later CL stages
use the latter combination of techniques for data augmentation. Increasing the maximum
sub-meeting length to 500 results in a SpkER of 17.73%. The SpkER for full-length (34.0
minutes on average) meetings was 20.65%. Table 6.3 shows the results for meeting with
Diac-Aug in column “eval-PT”.

sub-meeting length
DNC spectral clustering
#segments duration (mins)
50 2.8 13.90 15.89
200 9.7 16.75 22.38
500 20.9 18.39 23.56
all 34.0 16.92 23.95

Table 6.4 Comparison of DNC vs. spectral clustering on the eval set for different meeting
lengths. Fine-tuned DNC models are used for comparison.

Table 6.3 also shows the results of finetuning the DNC models of column “eval-PT” by
training on data that uses only sub-sequence randomisation, but neither meeting nor Diac-Aug
(see column “eval-FT”). While the eval set SpkER increases after finetuning for some cases,
the SpkER on the dev set (i.e. the validation set) reduces in all cases. The finetuned models
are compared with spectral clustering in Table 6.4 (column “eval-FT” in Table 6.3 is column
“DNC” in Table 6.4). The spectral clustering parameters were chosen to optimise the SpkER
154 Attention-Based Encoder-Decoder Models for Speaker Diarisation

on the dev set. For all meeting lengths, the DNC model outperforms spectral clustering. The
finetuned DNC model for the full meeting length achieves a SpkER of 16.92%, which is a
reduction of 29.4% relative to spectral clustering. The SpkER of a DNC model trained with
meeting and Diac-Aug, but without CL, is only 34.48% after finetuning.

6.5.3 Visualisation

1
2
3 1
4 2

(a) ground truth (b) spectral clustering

1 1
2 2
3 3
4 4

(c) spectral clustering with known number (d) DNC

of speakers

Figure 6.4 Clustering results of different algorithms for one meeting. The 32-dimensional
speaker embeddings are projected to 2-dimensional using t-distributed Stochastic Neighbour
Embedding (t-SNE). In this example, (b) shows that spectral clustering fails to identify the
correct number of speakers. Even when the correct number of clusters is given to spectral
clustering as in (c), DNC shows a better clustering result in (d).
6.6 Summary and Conclusions 155

Figure 6.4 visualises the clustering results of a sub-meeting from the eval set using a
2-dimensional t-SNE (Maaten and Hinton, 2008) projection of the 32-dimensional segment
embeddings. In Figure 6.4a, depicting the ground truth cluster labels. This plot shows the
difficulty of the clustering task. For example, some samples from speaker 2 are very close
to samples from speaker 1 (top left of Figure 6.4a), and some samples from speaker 1 are
very close to speaker 4 (bottom of Figure 6.4a). As these clusters are not well separated, it
is very challenging to determine the right number of clusters and also draw the right cluster
boundaries. As a result, spectral clustering produced two clusters instead of four in Figure 6.4b,
which leads to high SpkER. If the spectral clustering algorithm is forced to produce four
clusters, the result is illustrated in Figure 6.4c. Note that the four clusters are roughly linearly
separable in Figure 6.4c, which shows that the unsupervised clustering algorithm based on
cosine distance draws hard boundaries between clusters. Finally, the result given by DNC is
plotted in Figure 6.4d. First, the DNC model correctly recognises the existence of four speakers.
Compared to Figure 6.4c, the cluster boundaries given by the DNC approach are more complex
as Figure 6.4d shows multiple points of cross over to other clusters, which is more similar to
Figure 6.4a. Especially, when two clusters have significant overlap, spectral clustering makes
more errors, e.g. it wrongly assigns many samples from speaker 2 to speaker 1 (top middle
of Figure 6.4c). By comparison, the DNC model can split these two confusing clusters better.
Overall, DNC yields a much lower SpkER than spectral clustering, even if the correct number
of speaker is provided to the unsupervised algorithm.

6.6 Summary and Conclusions

A novel supervised clustering approach based on sequence-to-sequence neural networks called
DNC has been proposed. DNC models in the form of the Transformer model were trained
using three effective, targeted data augmentation techniques and curriculum learning. The
data augmentation techniques are sub-sequence randomisation, input vector randomisation,
and Diac-Aug. It was shown that these techniques were themselves effective but also can be
combined for even better performance. On the challenging AMI MDM meeting data with up to
4 speakers in each meeting, DNC was shown to perform much better for speaker clustering
than a strong spectral clustering baseline.
Given that the neural network used to extract the d-vectors is differentiable, DNC can be
extended to an end-to-end trainable model for speaker diarisation. Furthermore, instead of
optimising the segment level cross-entropy, DNC could be trained to directly optimise the DER
based on a Minimum Bayes Risk (MBR) loss (Prabhavalkar et al., 2018) or the REINFORCE
algorithm (Williams, 1992). To perform online diarisation, the Transformer-based DNC models
156 Attention-Based Encoder-Decoder Models for Speaker Diarisation

could be modified to use forms of monotonic attention (Arivazhagan et al., 2019; Ma et al.,
2020).
Although DNC is a novel and promising approach for speaker diarisation, several aspects
can still be improved. For example, the experimental setup assumed a perfect VAD by using
manual segmentation information. The performance of DNC needs to be verified under a more
realistic setup. As the experiments used segment-level speaker embeddings for the Transformer
model, short and long segments were treated equally in the training loss, which was not true for
the evaluation metric. In practice, a time-weighted loss function can be used or window-level
speaker embeddings can be used. The DNC approach itself also requires the maximum number
of speakers to be set. More research is needed to allow DNC to handle an unknown number of
speakers. The setup in this chapter does not handle speaker overlapping regions during training
as oracle segmentation is used and these regions are excluded during evaluation. However, it
would be more useful to allow the model to identify the overlapping regions in practice. For
example, it is possible to include overlapped speech in the training data and also introduce a
special output unit as the corresponding target.
Chapter 7

Conclusions and Future Work

This thesis first describes the fundamentals of Deep Neural Networks (DNNs) and two major
Automatic Speech Recognition (ASR) paradigms. Then various novel approaches for three
speech processing topics relating to Attention-Based Encoder-Decoder (AED) models are
proposed and validated by extensive experimentation, including the Integrated Source-Channel
and Attention (ISCA) framework that combines Source-Channel Model (SCM) and AED-based
ASR systems using N -best and lattice rescoring; the Confidence Estimation Module (CEM) and
the Residual Energy-Based Model (R-EBM) that produce reliable token/word/utterance-level
confidence scores for AED models; and Discriminative Neural Clustering (DNC) that uses the
Transformer model to perform supervised clustering for speaker diarisation. In this chapter,
observations are summarised and conclusions are drawn for each proposed approach. Based
on all the aforementioned contributions, this chapter recommends potential extensions and
promising prospects to be explored in the future.

7.1 Conclusions
AED models play an increasingly important role in various aspects of speech processing. In this
thesis, AED models are regarded as complementary components to combine with traditional
systems for ASR, effective confidence estimators for AED models are developed, and AED
models are used for the clustering stage of speaker diarisation.
As discussed in Chapter 4, AED models are highly complementary to conventional SCM-
based systems by approaching the ASR task from a different perspective. There are multiple
ways to combine these two types of systems. One widely used method is to jointly train a
Connectionist Temporal Classification (CTC) model and an AED model where the encoder
is shared. Decoding is a single pass procedure based on the AED decoder where the decoder
output probabilities are interpolated with CTC prefix scores for each decoding step. There
158 Conclusions and Future Work

are multiple shortcomings with this framework. As the majority of the model parameters and
the output units of the two systems are shared, the individual system cannot be adapted to its
optimal performance. The weights of two losses for multi-task training and decoding can be
very sensitive. As the decoding is label-synchronous, it is challenging to adapt the system to
process streaming data. Our proposed alternative, ISCA, is to have a two-pass system where
the SCM-based system such as a CTC model produces first pass hypotheses and the AED
model performs rescoring in a second pass. However, the vanilla CTC model yields very
high error rates. Experiments show that once the token prior, lexicon and language model are
re-introduced for CTC similar to a standard Hidden Markov Model (HMM)-based system,
the CTC model can perform reasonably well. As the proposed approach does not restrict the
output units of the SCM system and the AED model to be the same, experiments found that a
triphone HMM-based Acoustic Model (AM) trained using frame-level cross-entropy criterion
outperforms a CTC model using either grapheme or phone as targets. Under the same multi-task
training setup, the proposed ISCA approach reached a lower Word Error Rate (WER) than the
single-pass approach. To illustrate the full potential of the ISCA framework, the SCM system
and the AED model are trained separately. Experimental results showed that further reductions
in WER can be observed. Also, when the N -best lists become larger, the combined system
consistently gives better performance. These initial experiments demonstrated that the benefits
of system combination are maximised when two systems are optimised separately based on
their individual best practice. Based on these observations, more extensive experiments were
carried out on a larger scale dataset and the individual SCM and AED systems are close to the
state-of-the-art performance. Both N -best rescoring and lattice rescoring are tested for ISCA.
Note that the lattice rescoring algorithm is extended from Neural Network Language Model
(NNLM) lattice rescoring by considering the effect of the attention mechanism. Recurrent
Neural Network (RNN)-based and Transformer-based AED decoders were also compared. As
far as computational cost is concerned, RNN-based AED decoders may be more suitable for
lattice rescoring while Transformer-based may be better for N -best rescoring based on the
structure of the hypotheses and the mechanism of the AED decoder. Given the same N -best
or lattice density, lattice rescoring generally outperforms N -best rescoring for ISCA as a far
larger number of alternative hypotheses are considered in lattices. From a practical perspective,
the proposed ISCA framework allows streaming processing of speech data in the first pass and
then adjusts the final output by rescoring in the second pass to improve the performance while
the increase in latency of the service is marginal.
Confidence scores for AED-based ASR systems were investigated in Chapter 5. Experi-
ments showed that using Softmax probabilities as confidence scores are not reliable as they
often tend to be overconfident and can be heavily influenced by the regularisation techniques
7.1 Conclusions 159

used during training. The CEM was first proposed as a simple additional module on top of an
existing AED-based ASR model. The CEM is trained to predict a binary correct/incorrect target
per output token. Experimental results demonstrated that confidence scores based on the CEM
are much better than Softmax probabilities at both the token and word levels. The overconfi-
dence issue is effectively mitigated by the CEM. Further experiments indicate that the CEM
also works well after Language Model (LM) shallow fusion and can generalise to a slightly mis-
matched domain. Considering that some applications rely on utterance-level confidence scores,
simply aggregating token-level confidence scores from the CEM may not be optimal. Deletion
errors are not within the scope of the CEM but are important for utterance-level confidence. The
aggregating function from a sequence of token-level scores to an utterance-level score, such as
taking the minimum or the average, may not be optimal. Therefore, the R-EBM was proposed
to directly learn the utterance-level confidence scores. Coincidentally, the training objective
is effectively the same as a discriminator between correct and erroneous hypotheses. In other
words, the utterance-level confidence estimator is also an energy-based model that learns the
residual space between a locally normalised AED model and a globally normalised model. The
negative energy value can be used to rescore the N -best hypotheses and improve the recog-
nition performance. Experiments verified that the R-EBM improves both the utterance-level
confidence performance and reduces WERs at the same time. For utterance-level confidence
scores, the R-EBM outperforms the CEM as expected. Further experiments showed that even
under a much more challenging setup where the encoder of the AED model is pre-trained on a
large amount of data, the R-EBM can still improve the confidence and recognition performance.
Analysis of the rescoring results showed that the reduction of WER correlates with the utterance
length, which indicates that the effectiveness of the R-EBM may depend on the amount of
global context available. By simply plotting the data and model distributions, it seems that the
R-EBM does reduce the gap between them as suggested by the theory. From the perspective
of confidence estimation, both the CEM and the R-EBM are model-based approaches. They
may be subject to generalisation problems, especially when the input to the models are from
Out-of-Domain (OOD) data. However, an ideal confidence estimator should be able to provide
reliable confidence scores for both in-domain and OOD data. Assuming some unlabelled
OOD data is available, experiments showed some interesting observations. Either by including
automatic transcriptions of OOD data or by having additional OOD language models during the
training of confidence estimators, the confidence performance is boosted on OOD data while
keeping the in-domain performance unchanged. With OOD information injected during CEM
training, the word-level calibration performance on OOD data can be significantly improved.
Similarly for the R-EBM, OOD information enables the confidence estimator to better filter
OOD data, which is expected to assist active or semi-supervised learning substantially.
160 Conclusions and Future Work

For speaker diarisation, DNC was proposed in Chapter 6 which is a supervised clustering
approach that outperforms the commonly used unsupervised clustering algorithms. Diarisation
experiments were carried out on a very challenging meeting corpus. The meetings were
typically longer than half an hour and there were three or four active speakers. Three data
augmentation techniques work together with curriculum learning to effectively address the data
scarcity problem. The final results showed that DNC is a very promising approach and offers a
new perspective on speaker diarisation.

7.2 Future Work

Similar to ISCA, AED models cannot only be used to rescore hypotheses from HMM-based
models, but also other frame-synchronous models such as CTC and neural transducers. As
neural transducers are more widely adopted for ASR, it would be interesting to explore
how to use AED model to effectively rescore hypotheses from neural transducers beyond
using N -best hypotheses (Sainath et al., 2019). For example, if rich lattices from neural
transducers are available (Prabhavalkar et al., 2021), ISCA lattice rescoring may improve the
recognition performance significantly, especially when the first-pass is performed in an online
fashion. Furthermore, as the encoders of the AED model and the neural transducer have similar
functionality, exploring how parameters can be effectively shared to reduce the cost of the
second-pass is worthwhile for practical deployment. Other alternative strategies for improving
the efficiency of lattice rescoring can also be investigated (Li et al., 2021b).
As the CEM and the R-EBM learn token-level and utterance-level confidence scores, recent
extensions of the work include learning word-level confidence scores directly (Qiu et al., 2021b),
which is important for words with multiple alternative subword representations. A multi-task
approach to learning word-level confidence scores, utterance-level confidence scores, and the
number of deletion errors has also been proposed (Qiu et al., 2021a). With the advancement of
confidence estimation for AED models, applying these techniques on some downstream tasks
is also one of the next steps. For example, data selection for semi-supervised or active learning,
keyword spotting and dialogue systems. CEM and R-EBM can also be implemented similarly
on top of neural transducers for confidence estimation (Wang et al., 2021).
For speaker diarisation, DNC is an initial investigation of the potential to use an AED
model to perform clustering. The current work uses manual segmentations, which assumes
that the segmentation from previous stages of the diarisation pipeline is perfect. DNC currently
uses one speaker representation per segment, whereas a more practical setup would have a
speaker representation per window of a fixed duration. Therefore, a natural extension to the
current work is to use DNC in a more realistic diarisation pipeline (Sun et al., 2021a). As next
7.2 Future Work 161

steps, DNC should also be modified to handle speaker overlaps and accommodate variable
number of speakers. Since DNC offers a supervised alternative to the clustering stage, DNC
can potentially merge with upstream neural network-based components such as voice activity
detection and speaker representation extraction. Consequently, the whole diarisation pipeline
can be optimised in an end-to-end manner, which may help mitigate the propagation of errors
across different stages. Furthermore, an improved training criterion can be designed for DNC
to have a closer match with the final evaluation criterion, i.e. the Diarisation Error Rate (DER).
As DNC does not impose any assumptions on the input, it is also straightforward to include
signals from other microphones or information from other modalities such as video or text to
push the diarisation performance even further.
Appendix A

Datasets

This appendix provides detailed information about the major datasets used in this thesis,
including Augmented Multi-Party Interaction (AMI), LibriSpeech and Switchboard (SWB).

A.1 Augmented Multi-Party Interaction

The AMI dataset (Carletta et al., 2005) is a multi-modal meeting corpus. All meetings are in
English with mostly non-native English speakers. Each meeting often involves four speakers
playing different roles discussing a design project. There were multiple different rooms used
for these meetings, and each room records both the close-talking audio or Individual Headset
Microphone (IHM), and far-field audio or Multiple Distance Microphone (MDM) in 16 kHz.
For Automatic Speech Recognition (ASR) experiments in Chapter 4, the IHM channel is used.
For diarisation experiments in Chapter 6, the 8 MDM channels are used after beamforming.
Some statistics of the AMI dataset is listed in Table A.1.

train dev test

# utterances 108.5k 13.1k 12.6k
total duration (h) 77.9 8.9 8.7
duration per utterance (s) 2.6 2.5 2.5
total # words 802.9k 95.0k 89.7k
# words per utterance 7.4 7.2 7.1
vocabulary size 11.9k 4.1k 3.9k
# speakers 155 21 16
# speakers in training set – 2 0

Table A.1 AMI dataset. Train/dev/test sets follow the official split.
164 Datasets

For language modelling on the AMI dataset, the training transcriptions are normally used
to train Language Models (LMs). Sometimes, additional text data from the Fisher (FSH)
dataset (Cieri et al., 2004a) (see Section A.3) is used to augment the training data for LMs.

A.2 LibriSpeech
The LibriSpeech dataset (Panayotov et al., 2015) is a large scale dataset with nearly 1000 hours
of read English speech sampled at 16 kHz. The content is derived from audiobooks from the
LibriVox project. In this thesis, only the 100-hour subset is used during training. There are two
subsets in either the dev set or the test set. “clean” refers to the partition of the data with low
Word Error Rate (WER) speakers, whereas “other” refers to the partition of the data with high
WER speakers, based on an existing model trained on another read speech dataset. Some key
statistics of this 100h subset dataset are given in Table A.2.

dev test
train
clean other clean other
# utterance 28.5k 2.7k 2.9k 2.6k 2.9k
total duration (h) 100.6 5.4 5.1 5.4 5.3
duration per utterance (s) 12.7 7.2 6.4 7.4 6.5
total # words 990.1k 54.4k 50.9k 52.6k 52.3k
# words per utterance 34.7 20.1 17.8 20.1 17.8
vocabulary size 33.8k 8.3k 7.4k 8.1k 7.6k
# speakers 251 40 33 40 33
# speakers in training set – 0 0 0 0

Table A.2 LibriSpeech dataset. The training set is the “train-clean-100” subset. The dev and
test sets follow the official split.

For language modelling on the LibriSpeech dataset, there is a separate text corpus available,
which is much larger than the amount of training text. The additional text data has around 810
million words and 900 thousand unique words.

A.3 Switchboard
The SWB-300 dataset (Godfrey and Holliman, 1993) (Switchboard-1 release 2) is a Conver-
sational Telephone Speech (CTS) dataset with around 2400 two-sided recordings of landline
A.3 Switchboard 165

telephone conversations sampled at 8 kHz. Each phone call is between two English speakers
and the topic of each conversation is picked from a list of 70 topics.
The Hub5 2000 evaluation data (Hub5’00) (LDC, 2002a) and its transcripts (LDC, 2002b)
are used as the dev set. Hub5’00 has two parts, Switchboard (SWB) and CallHome (CH),
where each part has 20 conversations. The SWB part has some overlapping speakers with the
training set because it was first collected together with the training set but unreleased. The
CH part is from the CH English corpus (Canavan et al., 1997). Note that although the two
parts are both landline telephone conversations, they are different in nature. The two callers
in SWB did not know each other and they followed the assigned topics during phone calls.
However, CH participants in each call were family or friends, which resulted in less topicality
and formality (Fiscus et al., 2000). In addition, CH contains more accented speech. Therefore,
the CH part is expected to be more challenging.
The English CTS set of RT03 (Fiscus et al., 1997) is used as the test set. It consists of 36
telephony conversations from the Switchboard Cellular (SWBC) collection (Graff et al., 2001)
and 36 from the Fisher (FSH) collection (Cieri et al., 2004b). Note that SWBC sometimes have
a large channel mismatch due to the nature of the Global System for Mobile Communications
(GSM) cellular network compared to SWB, which may result in a higher WER if a system
trained on SWB is used. FSH is similar to SWB data apart from the fact that the topics are
more diverse and the vocabulary has a broader range, and a different collection protocol was
used as FSH was collected around a decade later than SWB.

train dev test

# utts 264.2k 4.5k 8.4k
total duration (hrs) 318.9 3.8 6.3
avg. duration (s) 4.3 3.1 2.7
# words 3131.5k 42.7k 69.5k
avg. # words 11.9 9.6 8.3
vocab size 30.1k 3.6k 4.3k
# spkrs/# spkr_side 543 80 144
# spkrs in train – 36 0

Table A.3 Switchboard-1 real ease-2 dataset (LDC97S62). The dev set is Hub5’00
(LDC2002S09). The test set is RT03 (LDC2007S10).

For language modelling on the SWB dataset, apart from the corresponding transcriptions
of the 300-hour training data, sometimes FSH transcriptions (Cieri et al., 2004a, 2005) is also
166 Datasets

included as additional in-domain text data to improve the LM. The FSH text has around 22
million words and 65 thousand unique words.
References

Abdel-Hamid, O. and Jiang, H. (2013). Fast speaker adaptation of hybrid NN/HMM model
for speech recognition based on discriminative learning of speaker code. Proc. ICASSP,
Vancouver, BC, Canada. 52
Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper,
J., Catanzaro, B., Chen, J., Chrzanowski, M., Coates, A., Diamos, G.F., Elsen, E., Engel,
J., Fan, L.J., Fougner, C., Hannun, A.Y., Jun, B., Han, T.X., LeGresley, P., Li, X., Lin, L.,
Narang, S., Ng, A., Ozair, S., Prenger, R.J., Qian, S., Raiman, J., Satheesh, S., Seetapun, D.,
Sengupta, S., Sriram, A., Wang, C.J., Wang, Y., Wang, Z., Xiao, B., Xie, Y., Yogatama, D.,
Zhan, J., and Zhu, Z. (2016). Deep Speech 2 : End-to-end speech recognition in English and
Mandarin. Proc. ICML, New York, NY, USA. 61
Anastasakos, T., McDonough, J., Schwartz, R., and Makhoul, J. (1996). A compact model for
speaker-adaptive training. Proc. ICSLP, Philadelphia, PA, USA. 36, 51
Anguera, X., Bozonnet, S., Evans, N.W.D., Fredouille, C., Friedland, G., and Vinyals, O.
(2012). Speaker diarization: A review of recent research. IEEE Trans. on Audio, Speech, &
Language Processing, 20:356–370. 139
Anguera, X., Wooters, C., and Hernando, J. (2007). Acoustic beamforming for speaker
diarization of meetings. IEEE Trans. on Audio, Speech, & Language Processing, 15:2011–
2022. 149
Arivazhagan, N., Cherry, C., Macherey, W., Chiu, C.C., Yavuz, S., Pang, R., Li, W., and Raffel,
C. (2019). Monotonic infinite lookback attention for simultaneous machine translation. Proc.
ACL, Florence, Italy. 156
Atal, B.S. and Hanauer, S.L. (1971). Speech analysis and synthesis by linear prediction of the
speech wave. The Journal of the Acoustical Society of America, 50:637–655. 38
Aubert, X. and Ney, H. (1995). Large vocabulary continuous speech recognition using word
graphs. Proc. ICASSP, Detroit, MI, USA. 80
Ba, J.L., Kiros, J.R., and Hinton, G.E. (2016). Layer normalization. Proc. NIPS Deep Learning
Symposium, Barcelona, Spain. 30
Baevski, A. and Mohamed, A. (2020). Effectiveness of self-supervised pre-training for ASR.
Proc. ICASSP, Barcelona, Spain. 72
Baevski, A., Schneider, S., and Auli, M. (2020a). vq-wav2vec: Self-supervised learning of
discrete speech representations. Proc. ICLR, Addis Ababa, Ethiopia. 74
168 References

Baevski, A., Zhou, H., Mohamed, A., and Auli, M. (2020b). wav2vec 2.0: A framework for
self-supervised learning of speech representations. Proc. NeurIPS, Vancouver, BC, Canada.
36, 74, 75, 127, 128, 132
Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning
to align and translate. Proc. ICLR, Banff, AB, Canada. 15, 17, 18, 19, 20, 64, 145, 146
Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., and Bengio, Y. (2016). End-to-end
attention-based large vocabulary speech recognition. Proc. ICASSP, Shanghai, China. 64,
71, 146
Bahl, L.R., Brown, P.F., de Souza, P.V., and Mercer, R.L. (1986). Maximum mutual information
estimation of hidden Markov model parameters for speech recognition. Proc. ICASSP, Tokyo,
Japan. 36, 49
Baker, J. (1975). The DRAGON system – An overview. IEEE Trans. on Acoustics, Speech, &
Signal Processing, 23:24–29. 53
Baum, L.E. and Eagon, J.A. (1967). An inequality with applications to statistical estimation
for probabilistic functions of Markov processes and to a model for ecology. Bulletin of the
American Mathematical Society, 73:360–363. 46
Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. (2015). Scheduled sampling for sequence
prediction with recurrent neural networks. Proc. NIPS, Montreal, QC, Canada. 69, 120
Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003). A neural probabilistic language
model. Journal of Machine Learning Research, 3:1137–1155. 53, 55
Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009). Curriculum learning. Proc.
ICML, Montreal, QC, Canada. 152
Bérard, A., Besacier, L., Kocabiyikoglu, A.C., and Pietquin, O. (2018). End-to-end automatic
speech translation of audiobooks. Proc. ICASSP, Calgary, AB, Canada. 65
Beyerlein, P. (1997). Discriminative model combination. Proc. ASRU, Santa Barbara, CA,
USA. 79
Bishop, C.M. (1995). Neural networks for pattern recognition. Oxford University Press. 9, 11,
28, 31
Bishop, C.M. (2006). Pattern recognition and machine learning. Springer. 42, 46
Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. Proc.
COMPSTAT, Paris, France. 27
Boureau, Y.L., Ponce, J., and LeCun, Y. (2010). A theoretical analysis of feature pooling in
visual recognition. Proc. ICML, Haifa, Israel. 13
Bourlard, H., Bourlard, H.A., and Morgan, N. (1994). Connectionist speech recognition: A
hybrid approach, volume 247. Springer. 42
Breiman, L. (1996). Bagging predictors. Machine learning, 24:123–140. 31, 78
References 169

Brown, P.F. (1987). The acoustic modeling problem in automatic speech recognition. PhD
thesis, Carnegie Mellon University. 45
Campbell, N.A. (1984). Canonical variate analysis - a general model formulation. Australian
Journal of Statistics, 26:86–96. 39
Canavan, A., Graff, D., and Zipperlen, G. (1997). CALLHOME American English speech
LDC97S42. Web Download. Philadelphia: Linguistic Data Consortium. 165
Carletta, J., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, T., Kadlec, J., Karaiskos,
V., Kraaij, W., Kronenthal, M., Lathoud, G., Lincoln, M., Masson, A.L., McCowan, I., Post,
W., Reidsma, D., and Wellner, P.D. (2005). The AMI meeting corpus: A pre-announcement.
Proc. MLMI, Edinburgh, UK. 147, 149, 163
Caruana, R. (1997). Multitask learning. Machine learning, 28:41–75. 33
Chan, R.H.Y. and Woodland, P.C. (2004). Improving broadcast news transcription by lightly
supervised discriminative training. Proc. ICASSP, Montreal, QC, Canada. 108
Chan, W., Jaitly, N., Le, Q.V., and Vinyals, O. (2016). Listen, attend and spell: A neural
network for large vocabulary conversational speech recognition. Proc. ICASSP, Shanghai,
China. 64, 66, 146
Chan, W., Zhang, Y., Le, Q., and Jaitly, N. (2017). Latent sequence decompositions. Proc.
ICLR, Toulon, France. 64
Chen, S. (1998). Speaker, environment and channel change detection and clustering via the
Bayesian information criterion. Proc. Broadcast News Transcription and Understanding
Workshop, Lansdowne, VA, USA. 141
Chen, S., Wang, C., Chen, Z., Wu, Y., Liu, S., Chen, Z., Li, J., Kanda, N., Yoshioka, T., Xiao, X.,
Wu, J., Zhou, L., Ren, S., Qian, Y., Qian, Y., Zeng, M., and Wei, F. (2021). WavLM: Large-
scale self-supervised pre-training for full stack speech processing. arXiv.org:2110.13900.
72
Chen, S.F. and Goodman, J. (1999). An empirical study of smoothing techniques for language
modeling. Computer Speech & Language, 13:359–394. 54
Chen, X., Liu, X., Gales, M.J.F., and Woodland, P.C. (2015a). Recurrent neural network
language model training with noise contrastive estimation for speech recognition. Proc.
ICASSP, South Brisbane, QLD, Australia. 55
Chen, X., Liu, X., Ragni, A., Wang, Y., and Gales, M.J.F. (2017). Future word contexts in
neural network language models. Proc. ASRU, Okinawa, Japan. 55
Chen, X., Liu, X., Wang, Y., Gales, M.J.F., and Woodland, P.C. (2016). Efficient training and
evaluation of recurrent neural network language models for automatic speech recognition.
IEEE/ACM Trans. on Audio, Speech, & Language Processing, 24:2146–2157. 104
Chen, X., Liu, X., Wang, Y., Ragni, A., Wong, J.H.M., and Gales, M.J.F. (2019). Exploiting
future word contexts in neural network language models for speech recognition. IEEE/ACM
Trans. on Audio, Speech, & Language Processing, 27:1444–1454. 104
170 References

Chen, X., Tan, T., Liu, X., Lanchantin, P., Wan, M., Gales, M.J.F., and Woodland, P.C. (2015b).
Recurrent neural network language model adaptation for multi-genre broadcast speech
recognition. Proc. Interspeech, Dresden, Germany. 55
Cheng, G., Peddinti, V., Povey, D., Manohar, V., Khudanpur, S., and Yan, Y. (2017). An
exploration of dropout with LSTMs. Proc. Interspeech, Stockholm, Sweden. 104
Chiu, C.C., Sainath, T.N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., Kannan, A., Weiss,
R.J., Rao, K., Gonina, K., Jaitly, N., Li, B., Chorowski, J., and Bacchiani, M. (2018). State-
of-the-art speech recognition with sequence-to-sequence models. Proc. ICASSP, Calgary,
AB, Canada. 65, 115
Cho, K., van Merrienboer, B., Gülçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., and
Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical
machine translation. Proc. EMNLP, Doha, Qatar. 9, 19
Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., and Bengio, Y. (2015). Attention-based
models for speech recognition. Proc. NIPS, Montreal, QC, Canada. 15, 18, 21, 35, 64, 71
Chorowski, J. and Jaitly, N. (2017). Towards better decoding and language model integration
in sequence to sequence models. Proc. Interspeech, Stockholm, Sweden. 71, 120
Chorowski, J., Weiss, R.J., Bengio, S., and van den Oord, A. (2019). Unsupervised speech
representation learning using WaveNet autoencoders. IEEE/ACM Trans. on Audio, Speech,
& Language Processing, 27:2041–2053. 72
Chung, Y.A. and Glass, J.R. (2020). Generative pre-training for speech with autoregressive
predictive coding. Proc. ICASSP, Barcelona, Spain. 72
Cieri, C., Graff, D., Kimball, O., Miller, D., and Walker, K. (2004a). Fisher English training
speech part 1 transcripts LDC2004T19. Web Download. Philadelphia: Linguistic Data
Consortium. 164, 165
Cieri, C., Graff, D., Kimball, O., Miller, D., and Walker, K. (2005). Fisher English training
speech part 2 transcripts LDC2005T19. Web Download. Philadelphia: Linguistic Data
Consortium. 165
Cieri, C., Miller, D., and Walker, K. (2004b). The Fisher corpus: A resource for the next
generations of speech-to-text. Proc. LREC, Lisbon, Portugal. 165
Cui, J., Weng, C., Wang, G., Wang, J., Wang, P., Yu, C., Su, D., and Yu, D. (2018). Improving
attention-based end-to-end ASR systems with sequence-based loss functions. Proc. SLT,
Athens, Greece. 65, 70
Cyrta, P., Trzciński, T., and Stokowiec, W. (2017). Speaker diarization using deep recurrent
convolutional neural networks for speaker embeddings. Proc. ISAT, Szklarska Por˛eba,
Poland. 141
Dahl, G.E., Yu, D., Deng, L., and Acero, A. (2011). Context-dependent pre-trained deep
neural networks for large-vocabulary speech recognition. IEEE Trans. on Audio, Speech, &
Language Processing, 20:30–42. 48
References 171

Dai, Z., Yang, Z., Yang, Y., Carbonell, J.G., Le, Q.V., and Salakhutdinov, R. (2019).
Transformer-XL: Attentive language models beyond a fixed-length context. Proc. ACL,
Florence, Italy. 55
Dauphin, Y., Fan, A., Auli, M., and Grangier, D. (2017). Language modeling with gated
convolutional networks. Proc. ICML, Sydney, NSW, Australia. 67
Davis, J. and Goadrich, M. (2006). The relationship between precision-recall and ROC curves.
Proc. ICML, Pittsburgh, PA, USA. 111
Davis, S. and Mermelstein, P. (1980). Comparison of parametric representations for monosyl-
labic word recognition in continuously spoken sentences. IEEE Trans. on Acoustics, Speech,
& Signal Processing, 28:357–366. 37
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., and Ouellet, P. (2010). Front-end factor
analysis for speaker verification. IEEE Trans. on Audio, Speech, & Language Processing,
19:788–798. 52, 141
Del-Agua, M.Á., Giménez, A., Sanchís, A., Saiz, J.C., and Juan, A. (2018). Speaker-adapted
confidence measures for ASR using deep bidirectional recurrent neural networks. IEEE/ACM
Trans. on Audio, Speech, & Language Processing, 26:1194–1202. 109
Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). Maximum likelihood from incomplete
data via the EM algorithm. Journal of the Royal Statistical Society, 39:1–22. 46
Demuynck, K., Duchateau, J., Van Compernolle, D., and Wambacq, P. (2000). An efficient
search space representation for large vocabulary continuous speech recognition. Speech
Communication, 30:37–53. 53
Deng, J., Guo, J., and Zafeiriou, S. (2019). ArcFace: Additive angular margin loss for deep
face recognition. Proc. CVPR, Long Beach, CA, USA. 142
Deng, Y., Bakhtin, A., Ott, M., Szlam, A., and Ranzato, M. (2020). Residual energy-based
models for text generation. Proc. ICLR, Addis Ababa, Ethiopia. 107, 120, 121
Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2018). BERT: Pre-training of deep
bidirectional Transformers for language understanding. Proc. NAACL, Minneapolis, MN,
USA. 21, 55
Diaconis, P. and Shahshahani, M. (1987). The subgroup algorithm for generating uniform
random variables. Probability in the Engineering & Informational Sciences, 1:15–32. 148
Díez, M., Burget, L., Wang, S., Rohdin, J., and Cernocký, J.H. (2019). Bayesian HMM based
x-vector clustering for speaker diarization. Proc. Interspeech, Graz, Austria. 141
Digalakis, V.V., Rtischev, D., and Neumeyer, L.G. (1995). Speaker adaptation using constrained
estimation of Gaussian mixtures. IEEE Trans. on Speech & Audio Processing, 3:357–366.
51
Dimitriadis, D. and Fousek, P. (2017). Developing on-line speaker diarization system. Proc.
Interspeech, Stockholm, Sweden. 142
172 References

Dong, L., Xu, S., and Xu, B. (2018). Speech-Transformer: A no-recurrence sequence-to-
sequence model for speech recognition. Proc. ICASSP, Calgary, AB, Canada. 64
Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning
and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159. 28
Elman, J.L. (1993). Learning and development in neural networks: The importance of starting
small. Cognition, 48:71–99. 152
Evermann, G. and Woodland, P.C. (2000a). Large vocabulary decoding and confidence
estimation using word posterior probabilities. Proc. ICASSP, Istanbul, Turkey. 79, 109, 110,
116
Evermann, G. and Woodland, P.C. (2000b). Posterior probability decoding, confidence esti-
mation and system combination. Proc. NIST Speech Transcription Workshop, College Park,
MD, USA. 85, 108, 109, 111, 136
Fathullah, Y., Zhang, C., and Woodland, P.C. (2020). Improved large-margin softmax loss for
speaker diarisation. Proc. ICASSP, Barcelona, Spain. 142
Fiscus, J.G. (1997). A post-processing system to yield reduced word error rates: Recognizer
output voting error reduction (ROVER). Proc. ASRU, Santa Barbara, CA, USA. 79, 80, 85,
137
Fiscus, J.G., Ajot, J., Michel, M., and Garofolo, J.S. (2006). The Rich Transcription 2006
spring meeting recognition evaluation. Proc. MLMI, Bethesda, MD, USA. 142
Fiscus, J.G., Doddington, G., Le, A., Sanders, G., Przybocki, M., and Pallett, D. (1997). 2003
NIST Rich Transcription evaluation data LDC2007S10. Web Download. Philadelphia:
Linguistic Data Consortium. 165
Fiscus, J.G., Fisher, W.M., Martin, A.F., Przybocki, M.A., and Pallett, D.S. (2000). 2000 NIST
evaluation of conversational speech recognition over the telephone: English and Mandarin
performance results. Proc. NIST Speech Transcription Workshop, College Park, MD, USA.
165
Fujita, Y., Kanda, N., Horiguchi, S., Nagamatsu, K., and Watanabe, S. (2019a). End-to-end
neural speaker diarization with permutation-free objectives. Proc. Interspeech, Graz, Austria.
143
Fujita, Y., Kanda, N., Horiguchi, S., Xue, Y., Nagamatsu, K., and Watanabe, S. (2019b).
End-to-end neural speaker diarization with self-attention. Proc. ASRU, Singapore. 143
Furui, S. (1986). Speaker-independent isolated word recognition based on emphasized spectral
dynamics. Proc. ICASSP, Tokyo, Japan. 38
Gage, P. (1994). A new algorithm for data compression. The C Users Journal archive, 12:23–38.
66
Gales, M.J.F. (1998). Maximum likelihood linear transformations for HMM-based speech
recognition. Computer Speech & Language, 12:75–98. 51
References 173

Gales, M.J.F. (2000). Cluster adaptive training of hidden Markov models. IEEE Trans. on
Speech & Audio Processing, 8:417–428. 51
Gales, M.J.F., Kim, D.Y., Woodland, P.C., Chan, R.H.Y., Mrva, D., Sinha, R., and Tranter,
S. (2006). Progress in the CU-HTK broadcast news transcription system. IEEE Trans. on
Audio, Speech, & Language Processing, 14:1513–1525. 80
Gales, M.J.F., Knill, K., and Ragni, A. (2015). Unicode-based graphemic systems for limited
resource languages. Proc. ICASSP, South Brisbane, QLD, Australia. 39, 40, 66
Gales, M.J.F. and Woodland, P.C. (1996). Mean and variance adaptation within the MLLR
framework. Computer Speech & Language, 10:249–264. 51
Gales, M.J.F. and Young, S.J. (2008). The application of hidden Markov models in speech
recognition. Foundations & Trends in Signal Processing, 1:195–304. 80
Gangireddy, S.C.R., Swietojanski, P., Bell, P., and Renals, S. (2016). Unsupervised adaptation
of recurrent neural network language models. Proc. Interspeech, San Francisco, CA, USA.
55
Garcia-Romero, D., Snyder, D., Sell, G., Povey, D., and McCree, A. (2017). Speaker diarization
using deep neural network embeddings. Proc. ICASSP, New Orleans, LA, USA. 141, 142
Gauvain, J.L. and Lee, C.H. (1994). Maximum a posteriori estimation for multivariate Gaussian
mixture observations of Markov chains. IEEE Trans. on Speech & Audio Processing, 2:291–
298. 51
Gill, P.E., Murray, W., and Wright, M.H. (1981). Practical optimization. London Academic
Press. 27
Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward
neural networks. Proc. AISTATS, Sardinia, Italy. 29
Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse rectifier neural networks. Proc.
AISTATS, Fort Lauderdale, FL, USA. 17
Godfrey, J. and Holliman, E. (1993). Switchboard-1 release 2 LDC97S62. Web Download.
Philadelphia: Linguistic Data Consortium. 164
Goodfellow, I.J., Bengio, Y., and Courville, A. (2016). Deep learning. MIT Press. 9, 14, 26,
27, 30, 31, 78
Gopalakrishnan, P.S., Kanevsky, D., Nádas, A., and Nahamoo, D. (1989). A generalization of
the Baum algorithm to rational objective functions. Proc. ICASSP, Glasgow, UK. 49
Graff, D., Walker, K., and Miller, D. (2001). Switchboard cellular part 1 audio LDC2001S13.
Web Download. Philadelphia: Linguistic Data Consortium. 165
Graves, A. (2011). Practical variational inference for neural networks. Proc. NIPS, Granada,
Spain. 34, 115
Graves, A. (2012). Sequence transduction with recurrent neural networks. Proc. ICML
Representation Learning Workshop, Edinburgh, UK. 59, 62
174 References

Graves, A. (2013). Generating sequences with recurrent neural networks. arXiv.org:1308.0850.

28, 34
Graves, A., Fernández, S., Gomez, F.J., and Schmidhuber, J. (2006). Connectionist temporal
classification: Labelling unsegmented sequence data with recurrent neural networks. Proc.
ICML, Pittsburgh, PA, USA. 59, 61, 82
Graves, A. and Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent neural
networks. Proc. ICML, Beijing, China. 69
Graves, A., Mohamed, A., and Hinton, G.E. (2013). Speech recognition with deep recurrent
neural networks. Proc. ICASSP, Vancouver, BC, Canada. 64
Grézl, F., Karafiát, M., Kontar, S., and Cernocký, J.H. (2007). Probabilistic and bottle-neck
features for LVCSR of meetings. Proc. ICASSP, Honolulu, HI, USA. 43
Gulati, A., Qin, J., Chiu, C.C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z.,
Wu, Y., and Pang, R. (2020). Conformer: Convolution-augmented Transformer for speech
recognition. Proc. Interspeech, Shanghai, China. 64, 67, 68, 99
Gülçehre, C., Firat, O., Xu, K., Cho, K., Barrault, L., Lin, H.C., Bougares, F., Schwenk,
H., and Bengio, Y. (2015). On using monolingual corpora in neural machine translation.
arXiv.org:1503.03535. 72, 117, 124
Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. (2017). On calibration of modern neural
networks. Proc. ICML, Sydney, NSW, Australia. 113, 136
Guo, P., Boyer, F., Chang, X., Hayashi, T., Higuchi, Y., Inaguma, H., Kamo, N., Li, C., Garcia-
Romero, D., Shi, J., Shi, J., Watanabe, S., Wei, K., Zhang, W., and Zhang, Y. (2021). Recent
developments on ESPnet toolkit boosted by Conformer. Proc. ICASSP, Toronto, ON, Canada.
67
Gupta, V. (2015). Speaker change point detection using deep neural nets. Proc. ICASSP, South
Brisbane, QLD, Australia. 141
Gutmann, M.U. and Hyvärinen, A. (2010). Noise-contrastive estimation: A new estimation
principle for unnormalized statistical models. Proc. AISTATS, Sardinia, Italy. 55, 120
Hadian, H., Sameti, H., Povey, D., and Khudanpur, S. (2018). End-to-end speech recognition
using lattice-free MMI. Proc. Interspeech, Hyderabad, India. 60, 105
Hansen, N. and Ostermeier, A. (2001). Completely derandomized self-adaptation in evolution
strategies. Evolutionary Computation, 9:159–195. 94, 100
Hayashi, T., Watanabe, S., Zhang, Y., Toda, T., Hori, T., Astudillo, R.F., and Takeda, K. (2018).
Back-translation-style data augmentation for end-to-end ASR. Proc. SLT, Athens, Greece.
65
Heigold, G., Moreno, I., Bengio, S., and Shazeer, N.M. (2016). End-to-end text-dependent
speaker verification. Proc. ICASSP, Shanghai, China. 141
Hendrycks, D. and Gimpel, K. (2017). A baseline for detecting misclassified and out-of-
distribution examples in neural networks. Proc. ICLR, Toulon, France. 111
References 175

Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis of speech. The Journal of
the Acoustical Society of America, 87:1738–1752. 37
Hermansky, H., Ellis, D.P.W., and Sharma, S. (2000). Tandem connectionist feature extraction
for conventional HMM systems. Proc. ICASSP, Istanbul, Turkey. 43
Hernandez, F., Nguyen, V., Ghannay, S., Tomashenko, N., and Estève, Y. (2018). TED-LIUM
3: Twice as much data and corpus repartition for experiments on speaker adaptation. Proc.
SPECOM, Leipzig, Germany. 131
Hershey, J.R., Chen, Z., Le Roux, J., and Watanabe, S. (2016). Deep clustering: Discriminative
embeddings for segmentation and separation. Proc. ICASSP, Shanghai, China. 144
Higuchi, Y., Watanabe, S., Chen, N., Ogawa, T., and Kobayashi, T. (2020). Mask CTC: Non-
autoregressive end-to-end ASR with CTC and mask predict. Proc. Interspeech, Shanghai,
China. 62
Hinton, G.E., Deng, L., Yu, D., Dahl, G.E., Mohamed, A., Jaitly, N., Senior, A.W., Vanhoucke,
V., Nguyen, P., Sainath, T.N., and Kingsbury, B. (2012). Deep neural networks for acoustic
modeling in speech recognition. IEEE Signal Processing Magazine, 29. 36
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation,
9:1735–1780. 15
Hori, T., Cho, J., and Watanabe, S. (2018). End-to-end speech recognition with word-based
RNN language models. Proc. SLT, Athens, Greece. 64, 72
Hori, T., Watanabe, S., and Hershey, J.R. (2017a). Joint CTC/attention decoding for end-to-end
speech recognition. Proc. ACL, Vancouver, BC, Canada. 82
Hori, T., Watanabe, S., and Hershey, J.R. (2017b). Multi-level language modeling and decoding
for open vocabulary end-to-end speech recognition. Proc. ASRU, Okinawa, Japan. 65, 72
Hori, T., Watanabe, S., Zhang, Y., and Chan, W. (2017c). Advances in joint CTC-attention based
end-to-end speech recognition with a deep CNN encoder and RNN-LM. Proc. Interspeech,
Stockholm, Sweden. 64, 66
Hrúz, M. and Zajíc, Z. (2017). Convolutional neural network for speaker change detection in
telephone speaker diarization system. Proc. ICASSP, New Orleans, LA, USA. 141
Hu, S., Xie, X., Liu, S., Yu, J., Ye, Z., Geng, M., Liu, X., and Meng, H. (2021). Bayesian
learning of LF-MMI trained time delay neural networks for speech recognition. IEEE/ACM
Trans. on Audio, Speech, & Language Processing, 29:1514–1529. 105
Huang, X., Acero, A., and Hon, H.W. (2001). Spoken language processing: A guide to theory,
algorithm, and system development. Prentice Hall PTR. 40, 43, 50, 56, 57
Huang, X. and Lee, K.F. (1993). On speaker-independent, speaker-dependent, and speaker-
adaptive speech recognition. IEEE Trans. on Speech & Audio Processing, 1:150–157. 51
Huang, Z., Wang, S., and Yu, K. (2018). Angular softmax for short-duration text-independent
speaker verification. Proc. Interspeech, Hyderabad, India. 142
176 References

Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y. (2018). Quantized neural
networks: Training neural networks with low precision weights and activations. Journal of
Machine Learning Research, 18:6869–6898. 34
Hughes, T. and Mierle, K. (2013). Recurrent neural networks for voice activity detection. Proc.
ICASSP, Vancouver, BC, Canada. 141
Hwang, M.Y. and Huang, X. (1993). Shared-distribution hidden Markov models for speech
recognition. IEEE Trans. on Speech & Audio Processing, 1:414–420. 40
India, M., Fonollosa, J.A.R., and Hernando, J. (2017). LSTM neural network-based speaker
segmentation using acoustic and language modelling. Proc. Interspeech, Stockholm, Sweden.
141
Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by
reducing internal covariate shift. Proc. ICML, Lille, France. 29
Irie, K., Prabhavalkar, R., Kannan, A., Bruguier, A., Rybach, D., and Nguyen, P. (2019a). On
the choice of modeling unit for sequence-to-sequence speech recognition. Proc. Interspeech,
Graz, Austria. 66
Irie, K., Zeyer, A., Schlüter, R., and Ney, H. (2019b). Language modeling with deep Trans-
formers. Proc. Interspeech, Graz, Austria. 55
Irie, K., Zeyer, A., Schlüter, R., and Ney, H. (2019c). Training language models for long-span
cross-sentence evaluation. Proc. ASRU, Singapore. 55, 104, 105
Jaitly, N. and Hinton, G.E. (2013). Vocal tract length perturbation (VTLP) improves speech
recognition. Proc. ICML Workshop on Deep Learning for Audio, Speech & Language,
Atlanta, GA, USA. 33
Jang, E., Gu, S.S., and Poole, B. (2017). Categorical peparameterization with Gumbel-Softmax.
Proc. ICLR, Toulon, France. 74
Jégou, H., Douze, M., and Schmid, C. (2011). Product quantization for nearest neighbor search.
IEEE Trans. on Pattern Analysis & Machine Intelligence, 33:117–128. 74
Jelinek, F. (1991). Up from trigrams! - The struggle for improved language models. Proc.
Eurospeech, Genove, Italy. 53
Jelinek, F. (1997). Statistical methods for speech recognition. MIT Press. 35, 36, 40
Jiang, H. (2005). Confidence measures for speech recognition: A survey. Speech Communica-
tion, 45:455–470. 108, 131
Jim, K.C., Giles, C.L., and Horne, B.G. (1996). An analysis of noise in recurrent neural
networks: convergence and generalization. IEEE Trans. on Neural Networks, 7:1424–1438.
34
Juang, B.H. (1985). Maximum-likelihood estimation for mixture multivariate stochastic
observations of Markov chains. AT&T Technical Journal, 64:1235–1249. 42
References 177

Juang, B.H., Hou, W., and Lee, C.H. (1997). Minimum classification error rate methods for
speech recognition. IEEE Trans. on Speech & Audio Processing, 5:257–265. 45
Jurafsky, D. and Martin, J.H. (2008). Speech and language processing: An introduction to
speech recognition, computational linguistics and natural language processing. Prentice
Hall. 35
Kahn, J., Rivière, M., Zheng, W., Kharitonov, E., Xu, Q., Mazar’e, P.E., Karadayi, J., Liptchin-
sky, V., Collobert, R., Fuegen, C., Likhomanenko, T., Synnaeve, G., Joulin, A., Mohamed, A.,
and Dupoux, E. (2020). Libri-Light: A benchmark for ASR with limited or no supervision.
Proc. ICASSP, Barcelona, Spain. 127, 131
Kalgaonkar, K., Liu, C., Gong, Y., and Yao, K. (2015). Estimating confidence scores on ASR
results using recurrent neural networks. Proc. ICASSP, South Brisbane, QLD, Australia. 109
Kanda, N., Fujita, Y., and Nagamatsu, K. (2018). Lattice-free state-level minimum Bayes risk
training of acoustic models. Proc. Interspeech, Hyderabad, India. 104
Kanthak, S. and Ney, H. (2002). Context-dependent acoustic modeling using graphemes for
large vocabulary speech recognition. Proc. ICASSP, Orlando, FL, USA. 40, 66
Karanasou, P., Gales, M.J.F., Lanchantin, P., Liu, X., Qian, Y., Wang, L., Woodland, P.C., and
Zhang, C. (2015). Speaker diarisation and longitudinal linking in multi-genre broadcast data.
Proc. ASRU, Scottsdale, AZ, USA. 142
Karita, S., Ogawa, A., Delcroix, M., and Nakatani, T. (2018). Sequence training of encoder-
decoder model using policy gradient for end-to-end speech recognition. Proc. ICASSP,
Calgary, AB, Canada. 70
Kastanos, A., Ragni, A., and Gales, M.J.F. (2020). Confidence estimation for black box
automatic speech recognition systems using lattice recurrent neural networks. Proc. ICASSP,
Barcelona, Spain. 109
Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. (2020). Transformers are RNNs: Fast
autoregressive Transformers with linear attention. Proc. ICML, Vienna, Austria. 89
Katz, S. (1987). Estimation of probabilities from sparse data for the language model component
of a speech recognizer. IEEE Trans. on Acoustics, Speech, & Signal Processing, 35:400–401.
55
Kemp, T. and Schaaf, T. (1997). Estimating confidence using word lattices. Proc. Eurospeech,
Rhodes, Greece. 109
Kemp, T., Schmidt, M., Westphal, M., and Waibel, A.H. (2000). Strategies for automatic
segmentation of audio data. Proc. ICASSP, Istanbul, Turkey. 141
Killer, M., Stüker, S., and Schultz, T. (2003). Grapheme based speech recognition. Proc.
Eurospeech, Geneva, Switzerland. 66
Kim, S., Hori, T., and Watanabe, S. (2017). Joint CTC-attention based end-to-end speech
recognition using multi-task learning. Proc. ICASSP, New Orleans, LA, USA. 66, 81, 91,
92, 97
178 References

Kingma, D.P. and Ba, J. (2015). Adam: A method for stochastic optimization. Proc. ICLR,
San Diego, CA, USA. 28, 115
Kitza, M., Golik, P., Schlüter, R., and Ney, H. (2019). Cumulative adaptation for BLSTM
acoustic models. Proc. Interspeech, Hyderabad, India. 105
Kneser, R. and Ney, H. (1995). Improved backing-off for m-gram language modeling. Proc.
ICASSP, Detroit, MI, USA. 55
Ko, T., Peddinti, V., Povey, D., and Khudanpur, S. (2015). Audio augmentation for speech
recognition. Proc. Interspeech, Dresden, Germany. 33
Kreyssig, F.L., Zhang, C., and Woodland, P.C. (2018). Improved TDNNs using deep kernels
and frequency dependent grid-RNNs. Proc. ICASSP, Calgary, AB, Canada. 149
Krizhevsky, A., Sutskever, I., and Hinton, G.E. (2012). ImageNet classification with deep
convolutional neural networks. Proc. NIPS, Stateline, NV, USA. 12, 33
Krogh, A. and Hertz, J.A. (1992). A simple weight decay can improve generalization. Proc.
NIPS, Denver, CO, USA. 30
Kudo, T. (2018). Subword regularization: Improving neural network translation models with
multiple subword candidates. Proc. ACL, Melbourne, VIC, Australia. 67, 99
Kuhn, H.W. (1955). The Hungarian method for the assignment problem. Naval Research
Logistics Quarterly, 2:83–97. 144
Kumar, A., Singh, S., Gowda, D.N., Garg, A., Singh, S., and Kim, C. (2020). Utterance
confidence measure for end-to-end speech recognition with applications to distributed speech
recognition scenarios. Proc. Interspeech, Shanghai, China. 109, 121, 124
Kumar, N. (1998). Investigation of silicon auditory models and generalization of linear
discriminant analysis for improved speech recognition. PhD thesis, Johns Hopkins University.
39
LDC (2002a). 2000 HUB5 English evaluation speech LDC2002S09. Web Download. Philadel-
phia: Linguistic Data Consortium. 165
LDC (2002b). 2000 HUB5 English evaluation transcripts LDC2002T43. Web Download.
Philadelphia: Linguistic Data Consortium. 165
Le, N. and Odobez, J.M. (2018). Robust and discriminative speaker embedding via intra-class
distance variance regularization. Proc. Interspeech, Hyderabad, India. 144
LeCun, Y., Bengio, Y., and Hinton, G.E. (2015). Deep learning. Nature, 521:436. 9, 10
LeCun, Y., Boser, B.E., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W.E., and
Jackel, L.D. (1989). Backpropagation applied to handwritten zip code recognition. Neural
Computation, 1:541–551. 12
LeCun, Y., Chopra, S., Hadsell, R., Ranzato, A., and Huang, F. (2006). A tutorial on energy-
based learning. Predicting Structured Data. 120
References 179

Lee, J. and Watanabe, S. (2021). Intermediate loss regularization for CTC-based speech
recognition. Proc. ICASSP, Toronto, ON, Canada. 62
Lee, K.F. (1988). On large-vocabulary speaker-independent continuous speech recognition.
Speech Communication, 7:375–379. 39
Lee, L. and Rose, R.C. (1996). Speaker normalization using efficient frequency warping
procedures. Proc. ICASSP, Atlanta, GA, USA. 39
Leggetter, C.J. and Woodland, P.C. (1995). Maximum likelihood linear regression for speaker
adaptation of continuous density hidden Markov models. Computer Speech & Language,
9:171–185. 51
Li, B., Gulati, A., Yu, J., Sainath, T.N., Chiu, C.C., Narayanan, A., yiin Chang, S., Pang, R., He,
Y., Qin, J., Han, W., Liang, Q., Zhang, Y., Strohman, T., and Wu, Y. (2021a). A better and
faster end-to-end model for streaming ASR. Proc. ICASSP, Toronto, ON, Canada. 64, 67
Li, B. and Sim, K.C. (2010). Comparison of discriminative input and output transformations
for speaker adaptation in the hybrid NN/HMM systems. Proc. Interspeech, Makuhari, Japan.
52
Li, B., Zhang, Y., Sainath, T., Wu, Y., and Chan, W. (2019a). Bytes are all you need: End-to-end
multilingual speech recognition and synthesis with bytes. Proc. ICASSP, Brighton, UK. 40
Li, J., Zhao, R., Hu, H., and Gong, Y. (2019b). Improving RNN transducer modeling for
end-to-end speech recognition. Proc. ASRU, Singapore. 64
Li, K., Povey, D., and Khudanpur, S. (2021b). A parallelizable lattice rescoring strategy with
neural language models. Proc. ICASSP, Toronto, ON, Canada. 160
Li, K., Xu, H., Wang, Y., Povey, D., and Khudanpur, S. (2018). Recurrent neural network lan-
guage model adaptation for conversational speech recognition. Proc. Interspeech, Hyderabad,
India. 55
Li, Q. (2018). Confidence scores for speech processing. Master’s thesis, University of
Cambridge. 108
Li, Q., Kreyssig, F., Zhang, C., and Woodland, P.C. (2021c). Discriminative neural clustering
for speaker diarisation. Proc. SLT, Shenzhen, China. iii, 6, 139
Li, Q., Ness, P., Ragni, A., and Gales, M.J.F. (2019c). Bi-directional lattice recurrent neural
networks for confidence estimation. Proc. ICASSP, Brighton, UK. 109
Li, Q., Qiu, D., Zhang, Y., Li, B., He, Y., Woodland, P.C., Cao, L., and Strohman, T. (2021d).
Confidence estimation for attention-based sequence-to-sequence models for speech recogni-
tion. Proc. ICASSP, Toronto, ON, Canada. iii, 6, 109, 122, 127
Li, Q., Zhang, C., and Woodland, P.C. (2019d). Integrating source-channel and attention-based
sequence-to-sequence models for speech recognition. Proc. ASRU, Singapore. iii, 6, 121,
137
Li, Q., Zhang, C., and Woodland, P.C. (2021e). Combining frame-synchronous and label-
synchronous systems for speech recognition. arXiv.org:2107.00764. iii, 6
180 References

Li, Q., Zhang, Y., Li, B., Cao, L., and Woodland, P.C. (2021f). Residual energy-based models
for end-to-end speech recognition. Proc. Interspeech, Brno, Czech Republic. iii, 6, 109, 132
Li, Q., Zhang, Y., Qiu, D., He, Y., Cao, L., and Woodland, P.C. (2022). Improving confi-
dence estimation on out-of-domain data for end-to-end speech recognition. Proc. ICASSP,
Singapore. iii, 6
Lin, Q., Yin, R., Li, M., Bredin, H., and Barras, C. (2019). LSTM based similarity measurement
with spectral clustering for speaker diarization. Proc. Interspeech, Graz, Austria. 144
Liu, W. and Lee, T. (2021). Utterance-level neural confidence measure for end-to-end children
speech recognition. Proc. ASRU, Cartagena, Colombia. 109
Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., and Song, L. (2017). SphereFace: Deep hypersphere
embedding for face recognition. Proc. CVPR, Honolulu, HI, USA. 150
Liu, X., Chen, X., Wang, Y., Gales, M.J.F., and Woodland, P.C. (2016). Two efficient lattice
rescoring methods using recurrent neural network language models. IEEE/ACM Trans. on
Audio, Speech, & Language Processing, 24:1438–1449. 87, 88
Liu, X., Gales, M.J.F., Sim, K.C., and Yu, K. (2005). Investigation of acoustic modeling
techniques for LVCSR systems. Proc. ICASSP, Philadelphia, PA, USA. 39
Liu, Y., He, L., and Liu, J. (2019). Large margin softmax loss for speaker verification. Proc.
Interspeech, Graz, Austria. 142
Lu, L., Zhang, X., Cho, K., and Renals, S. (2015). A study of the recurrent neural network
encoder-decoder for large vocabulary speech recognition. Proc. Interspeech, Dresden,
Germany. 64
Lüscher, C., Beck, E., Irie, K., Kitza, M., Michel, W., Zeyer, A., Schlüter, R., and Ney, H.
(2019). RWTH ASR systems for Librispeech: Hybrid vs attention – w/o data augmentation.
Proc. Interspeech, Graz, Austria. 65
Ma, X., Pino, J., Cross, J., Puzon, L., and Gu, J. (2020). Monotonic multihead attention. Proc.
ICLR, Addis Ababa, Ethiopia. 156
Ma, Z. and Collins, M. (2018). Noise contrastive estimation and negative sampling for
conditional models: Consistency and statistical efficiency. Proc. EMNLP, Brussels, Belgium.
120, 123
Maas, A.L., Hannun, A.Y., Jurafsky, D., and Ng, A. (2014). First-pass large vocabulary
continuous speech recognition using bi-directional recurrent DNNs. arXiv.org:1408.2873.
61
Maaten, L.v.d. and Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine
Learning Research, 9:2579–2605. 155
MacKay, D.J.C. (2003). Information theory, inference and learning algorithms. Cambridge
University Press. 35, 36, 49
References 181

Mangu, L., Brill, E., and Stolcke, A. (2000). Finding consensus in speech recognition: Word
error minimization and other applications of confusion networks. Computer Speech &
Language, 14:373–400. 58, 109, 111
Manning, C.D., Manning, C.D., and Schütze, H. (1999). Foundations of statistical natural
language processing. MIT Press. 53
McDermott, E., Sak, H., and Variani, E. (2019). A density ratio approach to language model
fusion in end-to-end automatic speech recognition. Proc. ASRU, Singapore. 72
Meng, Z., Li, J., Chen, Z., Zhao, Y., Mazalov, V., Gong, Y., and Juang, B.H. (2018). Speaker-
invariant training via adversarial learning. Proc. ICASSP, Calgary, AB, Canada. 52
Meng, Z., Parthasarathy, S., Sun, E., Gaur, Y., Kanda, N., Lu, L., Chen, X., Zhao, R., Li, J., and
Gong, Y. (2021). Internal language model estimation for domain-adaptive end-to-end speech
recognition. Proc. SLT, Shenzhen, China. 72
Miao, H., Cheng, G., Zhang, P., and Yan, Y. (2020). Online hybrid CTC/attention end-to-end
automatic speech recognition architecture. IEEE/ACM Trans. on Audio, Speech, & Language
Processing, 28:1452–1465. 83
Miao, Y., Gowayyed, M.A., Na, X., Ko, T., Metze, F., and Waibel, A.H. (2016a). An empirical
exploration of CTC acoustic models. Proc. ICASSP, Shanghai, China. 61
Miao, Y., Li, J., Wang, Y., Zhang, S.X., and Gong, Y. (2016b). Simplifying long short-term
memory acoustic models for fast training and decoding. Proc. ICASSP, Shanghai, China. 92
Mikolov, T. (2012). Statistical language models based on neural networks. PhD thesis, Brno
University of Technology. 28

Mikolov, T., Karafiát, M., Burget, L., Černockỳ, J., and Khudanpur, S. (2010). Recurrent neural
network based language model. Proc. Interspeech, Makuhari, Japan. 36, 53, 55
Moattar, M.H. and Homayounpour, M.M. (2012). A review on speaker diarization systems and
approaches. Speech Communication, 54:1065–1103. 139
Mohri, M., Pereira, F., and Riley, M. (2002). Weighted finite-state transducers in speech
recognition. Computer Speech & Language, 16:69–88. 53, 56
Moritz, N., Hori, T., and Roux, J.L. (2020). Streaming automatic speech recognition with the
Transformer model. Proc. ICASSP, Barcelona, Spain. 83
Nair, V. and Hinton, G.E. (2010). Rectified linear units improve restricted Boltzmann machines.
Proc. ICML, Haifa, Israel. 17
Ney, H., Essen, U., and Kneser, R. (1994). On structuring probabilistic dependences in
stochastic language modelling. Computer Speech & Language, 8:1–38. 55
Nguyen, H., Bougares, F., Tomashenko, N.A., Estève, Y., and Besacier, L. (2020). Investigating
self-supervised pre-training for end-to-end speech translation. Proc. Interspeech, Shanghai,
China. 72
182 References

Ning, H., Liu, M., Tang, H., and Huang, T.S. (2006). A spectral clustering approach to speaker
diarization. Proc. Interspeech, Pittsburgh, PA, USA. 142
Normandin, Y. (1991). Hidden Markov models, maximum mutual information estimation, and
the speech recognition problem. PhD thesis, McGill University. 50
Odell, J.J., Valtchev, V., Woodland, P.C., and Young, S.J. (1994). A one pass decoder design
for large vocabulary recognition. Proc. HLT, Plainsboro, NJ, USA. 53
Ogawa, A., Delcroix, M., Karita, S., and Nakatani, T. (2018). Rescoring N-best speech
recognition list based on one-on-one hypothesis comparison using encoder-classifier model.
Proc. ICASSP, Calgary, AB, Canada. 121
Okabe, K., Koshinaka, T., and Shinoda, K. (2018). Attentive statistics pooling for deep speaker
embedding. Proc. Interspeech, Hyderabad, India. 141
Oneata, D., Caranica, A., Stan, A., and Cucu, H. (2021). An evaluation of word-level confidence
estimation for end-to-end automatic speech recognition. Proc. SLT, Shenzhen, China. 109,
121
Ortmanns, S., Ney, H., and Aubert, X. (1997). A word graph algorithm for large vocabulary
continuous speech recognition. Computer Speech & Language, 11:43–72. 57
Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J.V., Lakshmi-
narayanan, B., and Snoek, J. (2019). Can you trust your model’s uncertainty? Evaluating
predictive uncertainty under dataset shift. Proc. NeurIPS, Vancouver, BC, Canada. 129
Pallett, D.S., Fisher, W.M., and Fiscus, J.G. (1990). Tools for the analysis of benchmark speech
recognition tests. Proc. ICASSP, Albuquerque, NM, USA. 103
Panayotov, V., Chen, G., Povey, D., and Khudanpur, S. (2015). Librispeech: An ASR corpus
based on public domain audio books. Proc. ICASSP, South Brisbane, QLD, Australia. 164
Park, D.S., Chan, W., Zhang, Y., Chiu, C.C., Zoph, B., Cubuk, E.D., and Le, Q.V. (2019).
SpecAugment: A simple data augmentation method for automatic speech recognition. Proc.
Interspeech, Graz, Austria. 33, 65, 99, 105, 115, 132
Park, D.S., Zhang, Y., Jia, Y., Han, W., Chiu, C.C., Li, B., Wu, Y., and Le, Q.V. (2020). Improved
noisy student training for automatic speech recognition. Proc. Interspeech, Shanghai, China.
111, 119, 121, 137
Park, J., Liu, X., Gales, M.J.F., and Woodland, P.C. (2010). Improved neural network based
language modelling and adaptation. Proc. Interspeech, Makuhari, Japan. 55
Park, T.J., Kanda, N., Dimitriadis, D., Han, K.J., Watanabe, S., and Narayanan, S.S. (2022).
A review of speaker diarization: Recent advances with deep learning. Computer Speech &
Language, 72:101317. 139, 140
Pascanu, R., Mikolov, T., and Bengio, Y. (2013). On the difficulty of training recurrent neural
networks. Proc. ICML, Atlanta, GA USA. 28
Paul, D.B. and Baker, J. (1992). The design for the Wall Street Journal-based CSR corpus.
Proc. ICSLP, Banff, AB, Canada. 118
References 183

Peddinti, V., Povey, D., and Khudanpur, S. (2015). A time delay neural network architecture for
efficient modeling of long temporal contexts. Proc. Interspeech, Dresden, Germany. 12, 149
Peddinti, V., Wang, Y., Povey, D., and Khudanpur, S. (2018). Low latency acoustic modeling
using temporal convolution and LSTMs. IEEE Signal Processing Letters, 25:373–377. 104
Peskin, B., Newman, M., McAllaster, D., Nagesha, V., Richards, H.B., Wegmann, S., Hunt,
M.J., and Gillick, L. (1999). Improvements in recognition of conversational telephone speech.
Proc. ICASSP, Phoenix, AZ, USA. 80
Pham, N.Q., Nguyen, T.S., Niehues, J., Müller, M., Stüker, S., and Waibel, A.H. (2019). Very
deep self-attention networks for end-to-end speech recognition. Proc. Interspeech, Graz,
Austria. 64
Pineda, F.J. (1987). Generalization of back-propagation to recurrent neural networks. Physical
Review Letters, 59:2229. 14
Pinto, J. and Sitaram, R.N.V. (2005). Confidence measures in speech recognition based on
probability distribution of likelihoods. Proc. Interspeech, Lisbon, Portugal. 109
Polyak, B. and Juditsky, A. (1992). Acceleration of stochastic approximation by averaging.
SIAM Journal on Control & Optimization, 30:838–855. 32, 115
Polyak, B.T. (1964). Some methods of speeding up the convergence of iteration methods.
USSR Computational Mathematics & Mathematical Physics, 4:1–17. 28
Povey, D. (2003). Discriminative training for large vocabulary speech recognition. PhD thesis,
University of Cambridge. 36, 50
Povey, D., Cheng, G., Wang, Y., Li, K., Xu, H., Yarmohammadi, M., and Khudanpur, S. (2018).
Semi-orthogonal low-rank matrix factorization for deep neural networks. Proc. Interspeech,
Hyderabad, India. 86, 99
Povey, D., Ghoshal, A.K., Boulianne, G., Burget, L., Glembek, O., Goel, N.K., Hannemann,
M., Motlícek, P., Qian, Y., Schwarz, P., Silovský, J., Stemmer, G., and Veselý, K. (2011).
The Kaldi speech recognition toolkit. Proc. ASRU, Waikoloa, HI, USA. 98, 99
Povey, D., Hannemann, M., Boulianne, G., Burget, L., Ghoshal, A.K., Janda, M., Karafiát,
M., Kombrink, S., Motlícek, P., Qian, Y., Riedhammer, K., Veselý, K., and Vu, N.T. (2012).
Generating exact lattices in the WFST framework. Proc. ICASSP, Kyoto, Japan. 58
Povey, D., Peddinti, V., Galvez, D., Ghahremani, P., Manohar, V., Na, X., Wang, Y., and
Khudanpur, S. (2016). Purely sequence-trained neural networks for ASR based on lattice-
free MMI. Proc. Interspeech, San Francisco, CA, USA. 43, 50, 92
Povey, D. and Woodland, P.C. (2002). Minimum phone error and I-smoothing for improved
discriminative training. Proc. ICASSP, Orlando, FL, USA. 50
Prabhavalkar, R., He, Y., Rybach, D., Campbell, S., Narayanan, A., Strohman, T., and Sainath,
T.N. (2021). Less is more: Improved RNN-T decoding using limited label context and path
merging. Proc. ICASSP, Toronto, ON, Canada. 83, 160
184 References

Prabhavalkar, R., Sainath, T.N., Li, B., Rao, K., and Jaitly, N. (2017). An analysis of "attention"
in sequence-to-sequence models. Proc. Interspeech, Stockholm, Sweden. 21
Prabhavalkar, R., Sainath, T.N., Wu, Y., Nguyen, P., Chen, Z., Chiu, C.C., and Kannan, A.
(2018). Minimum word error rate training for attention-based sequence-to-sequence models.
Proc. ICASSP, Calgary, AB, Canada. 65, 69, 120, 155
Pundak, G. and Sainath, T.N. (2016). Lower frame rate neural network acoustic models. Proc.
Interspeech, San Francisco, CA, USA. 86, 93
Qian, Y., Bi, M., Tan, T., and Yu, K. (2016). Very deep convolutional neural networks for noise
robust speech recognition. IEEE/ACM Trans. on Audio, Speech, & Language Processing,
24:2263–2276. 66
Qiu, D., He, Y., Li, Q., Zhang, Y., Cao, L., and McGraw, I. (2021a). Multi-task learning for
end-to-end ASR word and utterance confidence with deletion prediction. Proc. Interspeech,
Brno, Czech Republic. 109, 121, 137, 138, 160
Qiu, D., Li, Q., He, Y., Zhang, Y., Li, B., Cao, L., Prabhavalkar, R., Bhatia, D., Li, W., Hu,
K., Sainath, T.N., and McGraw, I. (2021b). Learning word-level confidence for subword
end-to-end ASR. Proc. ICASSP, Toronto, ON, Canada. 109, 121, 137, 160
Radford, A. and Narasimhan, K. (2018). Improving language understanding by generative
pre-training. [Online] https://blog.openai.com/language-unsupervised/. 55
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2019). Lan-
guage models are unsupervised multitask learners. [Online] https://blog.openai.
com/better-language-models/. 55

Ragni, A., Gales, M.J.F., Rose, O., Knill, K., Kastanos, A., Li, Q., and Ness, P. (2022). Increas-
ing context for estimating confidence scores in automatic speech recognition. IEEE/ACM
Trans. on Audio, Speech, & Language Processing, 30:1319–1329. 109, 131
Ragni, A., Li, Q., Gales, M.J.F., and Wang, Y. (2018). Confidence estimation and deletion
prediction using bidirectional recurrent neural networks. Proc. SLT, Athens, Greece. 109,
137
Ramachandran, P., Zoph, B., and Le, Q.V. (2018). Searching for activation functions. Proc.
ICLR Workshop, Vancouver, BC, Canada. 17, 68
Ranzato, M., Chopra, S., Auli, M., and Zaremba, W. (2016). Sequence level training with
recurrent neural networks. Proc. ICLR, San Juan, Puerto Rico. 69, 120
Rao, K., Sak, H., and Prabhavalkar, R. (2017). Exploring architectures, data and units for
streaming end-to-end speech recognition with RNN-transducer. Proc. ASRU, Okinawa,
Japan. 64
Riccardi, G. and Hakkani-Tür, D. (2005). Active learning: Theory and applications to automatic
speech recognition. IEEE Trans. on Speech & Audio Processing, 13:504–511. 108
Richardson, F., Ostendorf, M., and Rohlicek, J.R. (1995). Lattice-based search strategies for
large vocabulary speech recognition. Proc. ICASSP, Detroit, MI, USA. 80
References 185

Rousseau, A., Deléglise, P., and Estève, Y. (2014). Enhancing the TED-LIUM corpus with
selected data for language modeling and more TED talks. Proc. LREC, Reykjavik, Iceland.
132
Rumelhart, D.E., Hinton, G.E., and Williams, R.J. (1988). Learning representations by back-
propagating errors. Cognitive Modeling, 5:1. 9, 14, 17, 25
Sabour, S., Chan, W., and Norouzi, M. (2019). Optimal completion distillation for sequence
learning. Proc. ICLR, New Orleans, LA, USA. 65, 70
Sainath, T.N., Mohamed, A., Kingsbury, B., and Ramabhadran, B. (2013). Deep convolutional
neural networks for LVCSR. Proc. ICASSP, Vancouver, BC, Canada. 66
Sainath, T.N., Pang, R., Rybach, D., He, Y., Prabhavalkar, R., Li, W., Visontai, M., Liang,
Q., Strohman, T., Wu, Y., McGraw, I., and Chiu, C.C. (2019). Two-pass end-to-end speech
recognition. Proc. Interspeech, Graz, Austria. 82, 104, 121, 160
Sainath, T.N., Prabhavalkar, R., Kumar, S., Lee, S., Kannan, A., Rybach, D., Schogol, V.,
Nguyen, P., Li, B., Wu, Y., Chen, Z., and Chiu, C.C. (2018). No need for a lexicon?
Evaluating the value of the pronunciation lexica in end-to-end models. Proc. ICASSP,
Calgary, AB, Canada. 66, 95
Sainath, T.N., Weiss, R.J., Senior, A.W., Wilson, K.W., and Vinyals, O. (2015). Learning the
speech front-end with raw waveform CLDNNs. Proc. Interspeech, Dresden, Germany. 39
Saito, T. and Rehmsmeier, M. (2015). The precision-recall plot is more informative than the
ROC plot when evaluating binary classifiers on imbalanced datasets. PLOS ONE, 10. 111
Sak, H., Senior, A., and Beaufays, F. (2014). Long short-term memory recurrent neural network
architectures for large scale acoustic modeling. Proc. Interspeech, Singapore. 15
Sak, H., Senior, A., Rao, K., and Beaufays, F. (2015a). Fast and accurate recurrent neural
network acoustic models for speech recognition. Proc. Interspeech, Dresden, Germany. 61
Sak, H., Senior, A.W., Rao, K., Irsoy, O., Graves, A., Beaufays, F., and Schalkwyk, J. (2015b).
Learning acoustic frame labeling for speech recognition with recurrent neural networks.
Proc. ICASSP, South Brisbane, QLD, Australia. 61
Sanchís, A., Juan-Císcar, A., and Vidal, E. (2012). A word-based naïve Bayes classifier for
confidence estimation in speech recognition. IEEE Trans. on Audio, Speech, & Language
Processing, 20:565–574. 109
Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. (2018). How does batch normalization help
optimization? Proc. NIPS, Montreal, QC, Canada. 30
Saon, G., Dharanipragada, S., and Povey, D. (2004). Feature space Gaussianization. Proc.
ICASSP, Montreal, QC, Canada. 39
Saon, G., Tüske, Z., Bolaños, D., and Kingsbury, B. (2021). Advancing RNN transducer
technology for speech recognition. Proc. ICASSP, Toronto, ON, Canada. 64, 105
Savoji, M.H. (1989). A robust algorithm for accurate endpointing of speech signals. Speech
Communication, 8:45–60. 141
186 References

Schlüter, R. (2000). Investigations on discriminative training criteria. PhD thesis, RWTH

Aachen University. 36
Schneider, S., Baevski, A., Collobert, R., and Auli, M. (2019). wav2vec: Unsupervised
pre-training for speech recognition. Proc. Interspeech, Graz, Austria. 74
Schuster, M. and Nakajima, K. (2012). Japanese and Korean voice search. Proc. ICASSP,
Kyoto, Japan. 114, 124, 132
Schuster, M. and Paliwal, K.K. (1997). Bidirectional recurrent neural networks. IEEE Trans.
on Signal Processing, 45:2673–2681. 15
Schwartz, R. and Austin, S. (1991). A comparison of several approximate algorithms for
finding multiple (N-best) sentence hypotheses. Proc. ICASSP, Toronto, ON, Canada. 80
Seide, F., Li, G., Chen, X., and Yu, D. (2011). Feature engineering in context-dependent deep
neural networks for conversational speech transcription. Proc. ASRU, Honolulu, HI, USA.
52
Seigel, M.S. and Woodland, P.C. (2011). Combining information sources for confidence
estimation with CRF models. Proc. Interspeech, Florence, Italy. 109
Seigel, M.S. and Woodland, P.C. (2014). Detecting deletions in ASR output. Proc. ICASSP,
Florence, Italy. 137
Sell, G. and Garcia-Romero, D. (2014). Speaker diarization with PLDA i-vector scoring and
unsupervised calibration. Proc. SLT, South Lake Tahoe, NV, USA. 141
Sell, G., Snyder, D., McCree, A., Garcia-Romero, D., Villalba, J., Maciejewski, M., Manohar,
V., Dehak, N., Povey, D., Watanabe, S., and Khudanpur, S. (2018). Diarization is hard: Some
experiences and lessons learned for the JHU team in the inaugural DIHARD challenge. Proc.
Interspeech, Hyderabad, India. 142
Senoussaoui, M., Kenny, P., Stafylakis, T., and Dumouchel, P. (2014). A study of the cosine
distance-based mean shift for telephone speech diarization. IEEE/ACM Trans. on Audio,
Speech, & Language Processing, 22:217–227. 141
Shaham, U., Stanton, K.P., Li, H., Nadler, B., Basri, R., and Kluger, Y. (2018). SpectralNet:
Spectral clustering using deep neural networks. Proc. ICLR, Vancouver, BC, Canada. 142
Shan, C., Weng, C., Wang, G., Su, D., Luo, M., Yu, D., and Xie, L. (2019). Component fusion:
Learning replaceable language model component for end-to-end speech recognition system.
Proc. ICASSP, Brighton, UK. 65, 72
Shannon, C.E. (1948). A mathematical theory of communication. The Bell System Technical
Journal, 27(3):379–423. 35, 36
Shannon, M. (2017). Optimizing expected word error rate via sampling for speech recognition.
Proc. Interspeech, Stockholm, Sweden. 69
References 187

Shen, J., Nguyen, P., Wu, Y., Chen, Z., Chen, M.X., Jia, Y., Kannan, A., Sainath, T.N., Cao, Y.,
Chiu, C.C., He, Y., Chorowski, J., Hinsu, S., Laurenzo, S., Qin, J., Firat, O., Macherey, W.,
Gupta, S., Bapna, A., Zhang, S., Pang, R., Weiss, R.J., Prabhavalkar, R., Liang, Q., Jacob,
B., Liang, B., Lee, H., Chelba, C., Jean, S., Li, B., Johnson, M., Anil, R., Tibrewal, R., Liu,
X., Eriguchi, A., Jaitly, N., Ari, N., Cherry, C., Haghani, P., Good, O., Cheng, Y., Álvarez,
R., Caswell, I., Hsu, W.N., Yang, Z., Wang, K., Gonina, E., Tomanek, K., Vanik, B., Wu,
Z., Jones, L., Schuster, M., Huang, Y., Chen, D., Irie, K., Foster, G.F., Richardson, J., Alon,
U., and et al. (2019). Lingvo: A modular and scalable framework for sequence-to-sequence
modeling. arXiv.org:1902.08295. 114
Shi, Y., Huang, Q., and Hain, T. (2020). H-vectors: Utterance-level speaker embedding using a
hierarchical attention model. Proc. ICASSP, Barcelona, Spain. 141
Shum, S., Dehak, N., Chuangsuwanich, E., Reynolds, D.A., and Glass, J.R. (2011). Exploiting
intra-conversation variability for speaker diarization. Proc. Interspeech, Florence, Italy. 141
Shum, S.H., Dehak, N., Dehak, R., and Glass, J.R. (2013). Unsupervised methods for speaker
diarization: An integrated and iterative approach. IEEE Trans. on Audio, Speech, & Language
Processing, 21:2015–2028. 142
Sietsma, J. and Dow, R.J.F. (1991). Creating artificial neural networks that generalize. Neural
Networks, 4:67–79. 33
Sinha, R., Tranter, S., Gales, M.J.F., and Woodland, P.C. (2005). The Cambridge University
March 2005 speaker diarisation system. Proc. Interspeech, Lisbon, Portugal. 141
Siniscalchi, S.M., Svendsen, T., Sorbello, F., and Lee, C.H. (2010). Experimental studies on
continuous speech recognition using neural architectures with “adaptive” hidden activation
functions. Proc. ICASSP, Dallas, TX, USA. 52
Siohan, O., Ramabhadran, B., and Kingsbury, B. (2005). Constructing ensembles of ASR
systems using randomized decision trees. Proc. ICASSP, Philadelphia, PA, USA. 78
Siu, M., Gish, H., and Richardson, F. (1997). Improved estimation, evaluation and applications
of confidence measures for speech recognition. Proc. Eurospeech, Rhodes, Greece. 109, 136
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., and Khudanpur, S. (2018). X-vectors:
Robust DNN embeddings for speaker recognition. Proc. ICASSP, Calgary, AB, Canada. 150
Socher, R., Manning, C.D., and Ng, A.Y. (2010). Learning continuous phrase representations
and syntactic parsing with recursive neural networks. Proc. NIPS Deep Learning and
Unsupervised Feature Learning Workshop, Vancouver, BC, Canada. 15
Soltau, H., Liao, H., and Sak, H. (2017). Neural speech recognizer: Acoustic-to-word LSTM
model for large vocabulary speech recognition. Proc. Interspeech, Stockholm, Sweden. 39
Sriram, A., Jun, H., Satheesh, S., and Coates, A. (2018). Cold fusion: Training seq2seq models
together with language models. Proc. Interspeech, Hyderabad, India. 72
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout:
A simple way to prevent neural networks from overfitting. Journal of Machine Learning
Research, 15:1929–1958. 32, 115
188 References

Stüker, S., Fügen, C., Burger, S., and Wölfel, M. (2006). Cross-system adaptation and
combination for continuous speech recognition: The influence of phoneme set and acoustic
front-end. Proc. ICSLP, Pittsburgh, PA, USA. 80
Su, H., Li, G., Yu, D., and Seide, F. (2013). Error back propagation for sequence training of
context-dependent deep networks for conversational speech transcription. Proc. ICASSP,
Vancouver, BC, Canada. 70, 86
Sun, G., Zhang, C., and Woodland, P.C. (2019). Speaker diarisation using 2D self-attentive
combination of embeddings. Proc. ICASSP, Brighton, UK. 141, 142, 144, 150
Sun, G., Zhang, C., and Woodland, P.C. (2021a). Combination of deep speaker embeddings for
diarisation. Neural Networks, 141:372–384. 141, 160
Sun, G., Zhang, C., and Woodland, P.C. (2021b). Transformer language models with LSTM-
based cross-utterance information representation. ICASSP, pages 7363–7367. 55, 104,
105
Sung, Y.H., Hughes, T., Beaufays, F., and Strope, B. (2009). Revisiting graphemes with
increasing amounts of data. Proc. ICASSP, Taipei, Taiwan. 66
Sutskever, I., Vinyals, O., and Le, Q.V. (2014). Sequence to sequence learning with neural
networks. Proc. NIPS, Montreal, QC, Canada. 9, 19
Swietojanski, P. and Renals, S. (2014). Learning hidden unit contributions for unsupervised
speaker adaptation of neural network acoustic models. Proc. SLT, South Lake Tahoe, NV,
USA. 52
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2016). Rethinking the inception
architecture for computer vision. Proc. CVPR, Las Vegas, NV, USA. 34, 92, 115
Tang, D., Qin, B., and Liu, T. (2015). Document modeling with gated recurrent neural network
for sentiment classification. Proc. EMNLP, Lisbon, Portugal. 15
Tarvainen, A. and Valpola, H. (2017). Mean teachers are better role models: Weight-averaged
consistency targets improve semi-supervised deep learning results. Proc. NIPS, Long Beach,
CA, USA. 32
Tjandra, A., Sakti, S., and Nakamura, S. (2017a). Listening while speaking: Speech chain by
deep learning. Proc. ASRU, Okinawa, Japan. 65
Tjandra, A., Sakti, S., and Nakamura, S. (2017b). Local monotonic attention mechanism for
end-to-end speech and language processing. Proc. IJCNLP, Taipei, Taiwan. 21
Tjandra, A., Sakti, S., and Nakamura, S. (2018a). Machine speech chain with one-shot speaker
adaptation. Proc. Interspeech, Hyderabad, India. 65
Tjandra, A., Sakti, S., and Nakamura, S. (2018b). Sequence-to-sequence ASR optimization via
reinforcement learning. Proc. ICASSP, Calgary, AB, Canada. 70
Toshniwal, S., Kannan, A., Chiu, C.C., Wu, Y., Sainath, T.N., and Livescu, K. (2018). A com-
parison of techniques for language model integration in encoder-decoder speech recognition.
Proc. SLT, Athens, Greece. 65, 72
References 189

Tranter, S. and Reynolds, D.A. (2006). An overview of automatic speaker diarization systems.
IEEE Trans. on Audio, Speech, & Language Processing, 14:1557–1565. 139
Tranter, S., Yu, K., Evermann, G., and Woodland, P.C. (2004). Generating and evaluating
segmentations for automatic speech recognition of conversational telephone speech. Proc.
ICASSP, Montreal, QC, Canada. 141
Tür, G., Hakkani-Tür, D.Z., and Schapire, R. (2005). Combining active and semi-supervised
learning for spoken language understanding. Speech Communication, 45:171–186. 108
Tüske, Z., Golik, P., Schlüter, R., and Ney, H. (2014). Acoustic modeling with deep neural
networks using raw time signal for LVCSR. Proc. Interspeech, Singapore. 39
Tüske, Z., Saon, G., Audhkhasi, K., and Kingsbury, B. (2020). Single headed attention
based sequence-to-sequence model for state-of-the-art results on Switchboard-300. Proc.
Interspeech, Shanghai, China. 104, 105
Tüske, Z., Saon, G., and Kingsbury, B. (2021). On the limit of English conversational speech
recognition. Proc. Interspeech, Brno, Czech Republic. 81, 104
Uebel, L.F. and Woodland, P.C. (2001). Speaker adaptation using lattice-based MLLR. Proc.
ITRW on Adaptation Methods for Speech Recognition, Sophia Antipolis, France. 108, 137
Valtchev, V., Odell, J.J., Woodland, P.C., and Young, S.J. (1997). MMIE training of large
vocabulary recognition systems. Speech Communication, 22:303–314. 36, 49
van den Oord, A., Li, Y., and Vinyals, O. (2018). Representation learning with contrastive
predictive coding. arXiv.org:1807.03748. 36, 73, 74
Variani, E., Chen, T., Apfel, J.A., Ramabhadran, B., Lee, S., and Moreno, P.J. (2020a). Neural
oracle search on N-best hypotheses. Proc. ICASSP, Barcelona, Spain. 121
Variani, E., Lei, X., McDermott, E., Moreno, I.L., and Gonzalez-Dominguez, J. (2014). Deep
neural networks for small footprint text-dependent speaker verification. Proc. ICASSP,
Florence, Italy. 141
Variani, E., Rybach, D., Allauzen, C., and Riley, M. (2020b). Hybrid autoregressive transducer
(HAT). Proc. ICASSP, Barcelona, Spain. 72
Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and
Polosukhin, I. (2017). Attention is all you need. Proc. NIPS, Long Beach, CA, USA. 17, 21,
23, 24, 30, 146, 150
Viikki, O. and Laurila, K. (1998). Cepstral domain segmental feature vector normalization for
noise robust speech recognition. Speech Communication, 25:133–147. 39
Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically optimum
decoding algorithm. IEEE Trans. on Information Theory, 13:260–269. 43
von Platen, P., Zhang, C., and Woodland, P.C. (2019). Multi-span acoustic modelling using raw
waveform signals. Proc. Interspeech, Graz, Austria. 39
190 References

Waibel, A.H., Hanazawa, T., Hinton, G.E., Shikano, K., and Lang, K.J. (1989). Phoneme
recognition using time-delay neural networks. IEEE Trans. Acoustics, Speech, & Signal
Processing, 37:328–339. 12
Wan, L., Wang, Q., Papir, A., and Moreno, I.L. (2018). Generalized end-to-end loss for speaker
verification. Proc. ICASSP, Calgary, AB, Canada. 141, 144
Wang, H., Ragni, A., Gales, M.J.F., Knill, K., Woodland, P.C., and Zhang, C. (2015). Joint
decoding of tandem and hybrid systems for improved keyword spotting on low resource
languages. Proc. Interspeech, Dresden, Germany. 79
Wang, H., Wang, Y., Zhou, Z., Ji, X., Li, Z., Gong, D., Zhou, J., and Liu, W. (2018). CosFace:
Large margin cosine loss for deep face recognition. Proc. CVPR, Salt Lake City, UT, USA.
142
Wang, J., Xiao, X., Wu, J., Ramamurthy, R., Rudzicz, F., and Brudno, M. (2020a). Speaker
diarization with session-level speaker embedding refinement using graph neural networks.
Proc. ICASSP, Barcelona, Spain. 142
Wang, L., Zhang, C., Woodland, P.C., Gales, M.J.F., Karanasou, P., Lanchantin, P., Liu, X., and
Qian, Y. (2016). Improved DNN-based segmentation for multi-genre broadcast audio. Proc.
ICASSP, Shanghai, China. 141
Wang, M., Soltau, H., Shafey, L.E., and Shafran, I. (2021). Word-level confidence estimation
for RNN transducers. Proc. ASRU, Cartagena, Colombia. 160
Wang, Q., Downey, C., Wan, L., Mansfield, P.A., and Moreno, I.L. (2018). Speaker diarization
with LSTM. Proc. ICASSP, Calgary, AB, Canada. 141, 142, 150
Wang, Q., Okabe, K., Lee, K.A., Yamamoto, H., and Koshinaka, T. (2018). Attention mech-
anism in speaker recognition: What does it learn in deep speaker embedding? Proc. SLT,
Athens, Greece. 141
Wang, S., Li, B.Z., Khabsa, M., Fang, H., and Ma, H. (2020b). Linformer: Self-attention with
linear complexity. arXiv.org:2006.04768. 89
Wang, W., Zhou, Y., Xiong, C., and Socher, R. (2020c). An investigation of phone-based
subword units for end-to-end speech recognition. Proc. Interspeech, Shanghai, China. 105
Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., Yalta, N., Heymann,
J., Wiesner, M., Chen, N., Renduchintala, A., and Ochiai, T. (2018). ESPnet: End-to-end
speech processing toolkit. Proc. Interspeech, Hyderabad, India. 91, 98, 99, 150
Watanabe, S., Hori, T., Kim, S., Hershey, J.R., and Hayashi, T. (2017). Hybrid CTC/attention
architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal
Processing, 11:1240–1253. 81, 86, 91
Weintraub, M., Beaufays, F., Rivlin, Z., Konig, Y., and Stolcke, A. (1997). Neural-network
based measures of confidence for word recognition. Proc. ICASSP, Munich, Germany. 109
Weiss, R.J., Chorowski, J., Jaitly, N., Wu, Y., and Chen, Z. (2017). Sequence-to-sequence
models can directly translate foreign speech. Proc. Interspeech, Stockholm, Sweden. 65
References 191

Wessel, F., Schlüter, R., Macherey, K., and Ney, H. (2001). Confidence measures for large
vocabulary continuous speech recognition. IEEE Trans. on Speech & Audio Processing,
9:288–298. 108
Whittaker, E. and Woodland, P.C. (2000). Particle-based language modelling. Proc. Interspeech,
Beijing, China. 67
Williams, R.J. (1992). Simple statistical gradient-following algorithms for connectionist
reinforcement learning. Machine Learning, 8:229–256. 155
Williams, R.J. and Zipser, D. (1989). A learning algorithm for continually running fully
recurrent neural networks. Neural Computation, 1:270–280. 68
Wong, J.H.M., Gaur, Y., Zhao, R., Lu, L., Sun, E., Li, J., and Gong, Y. (2020). Combination of
end-to-end and hybrid models for speech recognition. Proc. Interspeech, Shanghai, China.
81
Woodland, P.C. (1989). Weight limiting, weight quantisation and generalisation in multi-layer
perceptrons. Proc. ICANN, London, UK. 34
Woodland, P.C. (2001). Speaker adaptation for continuous density HMMs: A review. Proc.
ISCA ITR-Workshop on Adaptation Methods for Speech Recognition, Salt Lake City, UT,
USA. 36
Woodland, P.C., Gales, M.J.F., Pye, D., and Young, S.J. (1997). Broadcast news transcription
using HTK. Proc. ICASSP, Munich, Germany. 38
Woodland, P.C., Leggetter, C.J., Odell, J.J., Valtchev, V., and Young, S.J. (1995). The 1994
HTK large vocabulary speech recognition system. Proc. ICASSP, Detroit, MI, USA. 57, 58,
80, 87
Woodland, P.C. and Povey, D. (2002). Large scale discriminative training of hidden Markov
models for speech recognition. Computer Speech & Language, 16:25–47. 49, 57
Woodward, A., Bonnín, C., Masuda, I., Varas, D., Bou, E., and Riveiro, J.C. (2020). Confidence
measures in encoder-decoder models for speech recognition. Proc. Interspeech, Shanghai,
China. 109, 121
Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y.,
Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, L., Gouws, S.,
Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W., Young, C.,
Smith, J.R., Riesa, J., Rudnick, A., Vinyals, O., Corrado, G.S., Hughes, M., and Dean, J.
(2016). Google’s neural machine translation system: Bridging the gap between human and
machine translation. arXiv.org:1609.08144. 17
Xu, H., Ding, S., and Watanabe, S. (2019). Improving end-to-end speech recognition with
pronunciation-assisted sub-word modeling. Proc. ICASSP, Brighton, UK. 64
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A.C., Salakhutdinov, R., Zemel, R.S., and Bengio,
Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. Proc.
ICML, Lille, France. 15
192 References

Yella, S.H. and Stolcke, A. (2015). A comparison of neural network feature transforms for
speaker diarization. Proc. Interspeech, Dresden, Germany. 141
Young, S.J. (1996). Large vocabulary continuous speech recognition: A review. IEEE Signal
Processing Magazine, 13:45–57. 57
Young, S.J., Evermann, G., Gales, M.J.F., Hain, T., Kershaw, D.J., Liu, X., Moore, G.L., Odell,
J.J., Ollason, D., Povey, D., Ragni, A., Valtchev, V., Woodland, P.C., and Zhang, C. (2015).
The HTK Book (for HTK version 3.5). Cambridge University Engineering Department. 92,
150
Young, S.J., Odell, J.J., and Woodland, P.C. (1994). Tree-based state tying for high accuracy
acoustic modelling. Proc. HLT, Plainsboro, NJ, USA. 40
Young, S.J., Russell, N.H., and Thornton, J.H.S. (1989). Token passing: A simple conceptual
model for connected speech recognition systems. Technical report, Cambridge University
Engineering Department. 53
Yu, D., Kolbæk, M., Tan, Z.H., and Jensen, J. (2017). Permutation invariant training of deep
models for speaker-independent multi-talker speech separation. Proc. ICASSP, New Orleans,
LA, USA. 143
Yu, D., Li, J., and Deng, L. (2011). Calibration of confidence measures in speech recognition.
IEEE Trans. on Audio, Speech, & Language Processing, 19:2461–2473. 108, 131
Yu, D., Yao, K., Su, H., Li, G., and Seide, F. (2013). KL-divergence regularized deep neural
network adaptation for improved large vocabulary speech recognition. Proc. ICASSP,
Vancouver, BC, Canada. 52
Yu, Y.Q., Fan, L., and Li, W.J. (2019). Ensemble additive margin softmax for speaker verifica-
tion. Proc. ICASSP, Brighton, UK. 142
Zapotoczny, M., Pietrzak, P., Lancucki, A., and Chorowski, J. (2019). Lattice generation in
attention-based speech recognition models. Proc. Interspeech, Graz, Austria. 111
Zeiler, M.D. (2012). ADADELTA: An adaptive learning rate method. arXiv.org:1212.5701.
28, 92
Zeineldeen, M., Glushko, A., Michel, W., Zeyer, A., Schluter, R., and Ney, H. (2021). Investi-
gating methods to improve language model integration for attention-based encoder-decoder
ASR models. Proc. Interspeech, Brno, Czech Republic. 72
Zeppenfeld, T., Finke, M., Ries, K., Westphal, M., and Waibel, A.H. (1997). Recognition of
conversational telephone speech using the JANUS speech engine. Proc. ICASSP, Munich,
Germany. 109
Zeyer, A., Beck, E., Schlüter, R., and Ney, H. (2017). CTC in the context of generalized
full-sum HMM training. Proc. Interspeech, Stockholm, Sweden. 60
Zeyer, A., Irie, K., Schlüter, R., and Ney, H. (2018). Improved training of end-to-end attention
models for speech recognition. Proc. Interspeech, Hyderabad, India. 105
References 193

Zhang, A., Wang, Q., Zhu, Z., Paisley, J., and Wang, C. (2019a). Fully supervised speaker
diarization. Proc. ICASSP, Brighton, UK. 143
Zhang, C., Kreyssig, F.L., Li, Q., and Woodland, P.C. (2019b). PyHTK: Python library and
ASR pipelines for HTK. Proc. ICASSP, Brighton, UK. 92, 150
Zhang, C. and Woodland, P.C. (2016). DNN speaker adaptation using parameterised sigmoid
and ReLU hidden activation functions. Proc. ICASSP, Shanghai, China. 52
Zhang, X.L. and Wu, J. (2013). Deep belief networks based voice activity detection. Proc.
ICASSP, Vancouver, BC, Canada. 141
Zhang, Y., Chan, W., and Jaitly, N. (2017). Very deep convolutional networks for end-to-end
speech recognition. Proc. ICASSP, New Orleans, LA, USA. 64, 66
Zhang, Y., Pezeshki, M., Brakel, P., Zhang, S., Laurent, C., Bengio, Y., and Courville, A.C.
(2016). Towards end-to-end speech recognition with deep convolutional neural networks.
Proc. Interspeech, San Francisco, CA, USA. 64, 66
Zhang, Y., Qin, J., Park, D.S., Han, W., Chiu, C.C., Pang, R., Le, Q.V., and Wu, Y. (2020).
Pushing the limits of semi-supervised learning for automatic speech recognition. Proc.
NeurIPS SAS Workshop, Vancouver, BC, Canada. 75, 121, 127, 132
Zhou, S., Dong, L., Xu, S., and Xu, B. (2018a). Syllable-based sequence-to-sequence speech
recognition with the Transformer in mandarin Chinese. Proc. Interspeech, Hyderabad, India.
95
Zhou, Y., Xiong, C., and Socher, R. (2018b). Improving end-to-end speech recognition with
policy learning. Proc. ICASSP, Calgary, AB, Canada. 70
Zhou, Y.T. and Chellappa, R. (1988). Computation of optical flow using a neural network.
Proc. ICNN, San Diego, CA, USA. 13
Zhu, Y., Ko, T., Snyder, D., Mak, B.K.W., and Povey, D. (2018). Self-attentive speaker
embeddings for text-independent speaker verification. Proc. Interspeech, Hyderabad, India.
141
Zweig, G., Yu, C., Droppo, J., and Stolcke, A. (2017). Advances in all-neural speech recognition.
Proc. ICASSP, New Orleans, LA, USA. 64
Zwicker, E. (1961). Subdivision of the audible frequency range into critical bands. The Journal
of the Acoustical Society of America, 33(2):248–248. 37

Vocoder Summer School 2021
No ratings yet
Vocoder Summer School 2021
298 pages
Te Belski S 1995
No ratings yet
Te Belski S 1995
190 pages
Thesis Bich Ngoc Do
No ratings yet
Thesis Bich Ngoc Do
72 pages
Thesis
No ratings yet
Thesis
160 pages
Speech Recognition Using Neural Networks, PHD Thesis (1995)
No ratings yet
Speech Recognition Using Neural Networks, PHD Thesis (1995)
190 pages
NLP Docs
No ratings yet
NLP Docs
51 pages
Representation Analysis Methods - For Translation
No ratings yet
Representation Analysis Methods - For Translation
218 pages
17ec35011 Mtp Report
No ratings yet
17ec35011 Mtp Report
30 pages
Telluride Decoding Toolbox Manual
No ratings yet
Telluride Decoding Toolbox Manual
47 pages
Speech Emotion Recognition1
No ratings yet
Speech Emotion Recognition1
86 pages
Master Thesis
No ratings yet
Master Thesis
73 pages
Evaluation of State Of Art Open-source ASR Engines with Local Inferencing
No ratings yet
Evaluation of State Of Art Open-source ASR Engines with Local Inferencing
81 pages
Yang 等 - 2019 - Design of an Always-On Deep Neural Network-Based 1
No ratings yet
Yang 等 - 2019 - Design of an Always-On Deep Neural Network-Based 1
14 pages
Seminar Report Parthiv
No ratings yet
Seminar Report Parthiv
58 pages
Seminar_Report_Final
No ratings yet
Seminar_Report_Final
37 pages
Emotion Recognition Based On Speech Signals by Combining Empirical Mode Decomposition and Deep Neural Network
No ratings yet
Emotion Recognition Based On Speech Signals by Combining Empirical Mode Decomposition and Deep Neural Network
10 pages
Emergency Medical
No ratings yet
Emergency Medical
13 pages
2202.01367v1
No ratings yet
2202.01367v1
11 pages
Seminar Report
No ratings yet
Seminar Report
39 pages
NaturalSpeech End-to-End Text-to-Speech Synthesis With Human-Level Quality
No ratings yet
NaturalSpeech End-to-End Text-to-Speech Synthesis With Human-Level Quality
12 pages
s40537-025-01090-0
No ratings yet
s40537-025-01090-0
31 pages
s11042-023-16438-y
No ratings yet
s11042-023-16438-y
46 pages
Lifting Categorisation Flow Chart Onshore
No ratings yet
Lifting Categorisation Flow Chart Onshore
3 pages
AudioGen
No ratings yet
AudioGen
16 pages
Cyprus University of Technology TEPAK Report Template English PDF
No ratings yet
Cyprus University of Technology TEPAK Report Template English PDF
17 pages
Cyprus University of Technology TEPAK Report Template English PDF
No ratings yet
Cyprus University of Technology TEPAK Report Template English PDF
17 pages
Winter Semester 2021-22 CSE4020-Machine Learning Digital Assignment-1
No ratings yet
Winter Semester 2021-22 CSE4020-Machine Learning Digital Assignment-1
20 pages
Suoni
No ratings yet
Suoni
38 pages
High-Fidelity and Pitch-Controllable Neural Vocoder Based On Unified Source-Filter Networks
No ratings yet
High-Fidelity and Pitch-Controllable Neural Vocoder Based On Unified Source-Filter Networks
13 pages
Adaptation Algorithms for Neural Network-Based Speech Recognition an Overview
No ratings yet
Adaptation Algorithms for Neural Network-Based Speech Recognition an Overview
34 pages
Thesis
No ratings yet
Thesis
37 pages
deeplearninginspeech
No ratings yet
deeplearninginspeech
4 pages
10 - Recurrent Neural Network Based Speech Emotion
No ratings yet
10 - Recurrent Neural Network Based Speech Emotion
13 pages
Electronics 11 03831
No ratings yet
Electronics 11 03831
12 pages
Human_Speech_Emotion_Recognition_Using_Artificial_Neural_Networks_Technique
No ratings yet
Human_Speech_Emotion_Recognition_Using_Artificial_Neural_Networks_Technique
7 pages
Speech Representation Models For Speech Synthesis and Multimodal Speech Recognition
No ratings yet
Speech Representation Models For Speech Synthesis and Multimodal Speech Recognition
63 pages
Survey of Deep Learning Paradigms For Speech Processing
No ratings yet
Survey of Deep Learning Paradigms For Speech Processing
37 pages
1 base
No ratings yet
1 base
5 pages
Speech_Emotion_Recognition_Using_Deep_Learning_Hybrid_Models
No ratings yet
Speech_Emotion_Recognition_Using_Deep_Learning_Hybrid_Models
5 pages
IJRPR4210
No ratings yet
IJRPR4210
12 pages
Emotional_Speech_Synthesis_Review
No ratings yet
Emotional_Speech_Synthesis_Review
4 pages
Application of Deep Learning-based Speech Signal p
No ratings yet
Application of Deep Learning-based Speech Signal p
6 pages
Fast and Lightweight On-Device TTS With Tacotron2 and LPCNet
No ratings yet
Fast and Lightweight On-Device TTS With Tacotron2 and LPCNet
5 pages
AI-Synthesized Voice Detection Using Neural Vocoder Artifacts
No ratings yet
AI-Synthesized Voice Detection Using Neural Vocoder Artifacts
9 pages
SER (Research Paper)
No ratings yet
SER (Research Paper)
5 pages
JETIR2106163 (37)
No ratings yet
JETIR2106163 (37)
5 pages
2024 TERM 2 GRADE 9 NS
No ratings yet
2024 TERM 2 GRADE 9 NS
13 pages
articles-and-book-chapters-on-drama-therapy
No ratings yet
articles-and-book-chapters-on-drama-therapy
65 pages
Demo
No ratings yet
Demo
29 pages
Análisis de Flujo de Potencia Equilibrado - ETAP 21
No ratings yet
Análisis de Flujo de Potencia Equilibrado - ETAP 21
128 pages
The_Imagineering_Model_Applying_Disney_T
No ratings yet
The_Imagineering_Model_Applying_Disney_T
58 pages
STA HSTA 203 Probability Lecture Notes
No ratings yet
STA HSTA 203 Probability Lecture Notes
48 pages
Natural Language Processing: by Dr. Parminder Kaur
No ratings yet
Natural Language Processing: by Dr. Parminder Kaur
26 pages
using the 6-key model
No ratings yet
using the 6-key model
30 pages
The beauty and the extravaganza of the mask_Aesthetic experiences, improvisation practices in learning how to solve complex business cases with Commedia dell’Arte_Finnish Business Review_11_1-21_2025_Marcella Zoccoli, Anne Eskola_(2025)
No ratings yet
The beauty and the extravaganza of the mask_Aesthetic experiences, improvisation practices in learning how to solve complex business cases with Commedia dell’Arte_Finnish Business Review_11_1-21_2025_Marcella Zoccoli, Anne Eskola_(2025)
22 pages
Image Denoising Using Wavelet Transform
No ratings yet
Image Denoising Using Wavelet Transform
7 pages
T3-SCARA ROBOT DESIGN
No ratings yet
T3-SCARA ROBOT DESIGN
56 pages
Aldehydes
No ratings yet
Aldehydes
40 pages
Development of A Real-Time Embedded System For Speech Emotion Recognition
No ratings yet
Development of A Real-Time Embedded System For Speech Emotion Recognition
35 pages
Current Transformer (CT)
No ratings yet
Current Transformer (CT)
16 pages
creative-drama-and-theater-books
No ratings yet
creative-drama-and-theater-books
7 pages
Investigating Machine Learning Techniques For Predicting Risk of Asthma Exacerbations: A Systematic Review
No ratings yet
Investigating Machine Learning Techniques For Predicting Risk of Asthma Exacerbations: A Systematic Review
22 pages
Direct-Current Machines
No ratings yet
Direct-Current Machines
108 pages
Goldbach_Conjecture
No ratings yet
Goldbach_Conjecture
4 pages
GPoeT-2_A_GPT-2_Based_Poem_Generator
No ratings yet
GPoeT-2_A_GPT-2_Based_Poem_Generator
10 pages
Sig Trans 201723
No ratings yet
Sig Trans 201723
9 pages
Marsaglia 1994
No ratings yet
Marsaglia 1994
8 pages
1TGC901052M0201 Profibus GW Manual V2.3
No ratings yet
1TGC901052M0201 Profibus GW Manual V2.3
32 pages
HAAS CNC MACHINING MAGAZINE 1997 Issue 1 - Spring PDF
100% (1)
HAAS CNC MACHINING MAGAZINE 1997 Issue 1 - Spring PDF
13 pages
Design and Finite Element Analysis of Shell & Tube Heat Exchanger Using Nano Fluids
No ratings yet
Design and Finite Element Analysis of Shell & Tube Heat Exchanger Using Nano Fluids
87 pages
Final
No ratings yet
Final
11 pages
CCTV 6 Pul6004bu Fe
No ratings yet
CCTV 6 Pul6004bu Fe
2 pages
Microwave Lab 1
No ratings yet
Microwave Lab 1
11 pages
3 775326 202403250744 Study Adds New Evidence On The Link Between Type 2 Diabetes and Alzheimers Disease
No ratings yet
3 775326 202403250744 Study Adds New Evidence On The Link Between Type 2 Diabetes and Alzheimers Disease
2 pages
Crystallography Session 3 (Determintion of Crystal Structure Using X-Ray Diffraction)
No ratings yet
Crystallography Session 3 (Determintion of Crystal Structure Using X-Ray Diffraction)
6 pages
O o o o o o o o o o o
No ratings yet
O o o o o o o o o o o
17 pages
Asyg 12 Ltca
100% (1)
Asyg 12 Ltca
22 pages
Solving Sudoku
100% (2)
Solving Sudoku
11 pages
JS330/360 LC/NLC: Hydraulic Excavator
No ratings yet
JS330/360 LC/NLC: Hydraulic Excavator
24 pages
IC660BBA100
No ratings yet
IC660BBA100
5 pages
User Manual Pro-CT en v.2
No ratings yet
User Manual Pro-CT en v.2
24 pages
Factorisation Self Practice
No ratings yet
Factorisation Self Practice
8 pages
Syl S17
No ratings yet
Syl S17
2 pages
Importance of Course Module in Academic Performance of Students M.Ranga Reddy (PHD) 49
No ratings yet
Importance of Course Module in Academic Performance of Students M.Ranga Reddy (PHD) 49
12 pages
Pre-Board Examination Electronics Engineering SET A: University of San Agustin
No ratings yet
Pre-Board Examination Electronics Engineering SET A: University of San Agustin
8 pages
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Speech-to-Text Systems and Technologies: Definitive Reference for Developers and Engineers
From Everand
Speech-to-Text Systems and Technologies: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Value Creation with Digital Twins: Conceptual Reference Frameworks and Case Study
From Everand
Value Creation with Digital Twins: Conceptual Reference Frameworks and Case Study
Linard Dario Barth
No ratings yet
Foundational Models and Architectures S1: Generative AI, #1
From Everand
Foundational Models and Architectures S1: Generative AI, #1
Leaster Startx
No ratings yet
Tacit and Explicit Understanding in Computer Support: Gerry Stahl's eLibrary, #2
From Everand
Tacit and Explicit Understanding in Computer Support: Gerry Stahl's eLibrary, #2
Gerry Stahl
No ratings yet
Introduction to DBMS: Designing and Implementing Databases from Scratch for Absolute Beginners
From Everand
Introduction to DBMS: Designing and Implementing Databases from Scratch for Absolute Beginners
Dr. Hariram Chavan
No ratings yet
Academic English for Computer Science: Academic English
From Everand
Academic English for Computer Science: Academic English
Disigma Publications
No ratings yet
Mastering Parallel Programming with R
From Everand
Mastering Parallel Programming with R
Simon R. Chapple
No ratings yet
Three Essays on Operations Scheduling with Job Classes and Time Windows
From Everand
Three Essays on Operations Scheduling with Job Classes and Time Windows
Alexander Lieder
No ratings yet
Discrete Structure and Automata Theory for Learners: Learn Discrete Structure Concepts and Automata Theory with JFLAP
From Everand
Discrete Structure and Automata Theory for Learners: Learn Discrete Structure Concepts and Automata Theory with JFLAP
Sukhpreet Kaur Gill
No ratings yet
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet
A Measurement Framework for Software Projects: A Generic and Practical Goal-Question-Metric(Gqm) Based Approach.
From Everand
A Measurement Framework for Software Projects: A Generic and Practical Goal-Question-Metric(Gqm) Based Approach.
Prashanth Harish Southekal
No ratings yet
Real-Time Critical Systems
From Everand
Real-Time Critical Systems
Jordan Lee Mauro-Buhagiar
3/5 (1)