Thesis Attention-Based Encoder-Decoder Models for Speech Processing
Thesis Attention-Based Encoder-Decoder Models for Speech Processing
Qiujia Li
Department of Engineering
University of Cambridge
I hereby declare that this dissertation is the result of my own work and includes nothing which
is the outcome of work done in collaboration except as declared in the Preface and specified in
the text. I further state that no substantial part of my dissertation has already been submitted,
or, is being concurrently submitted for any such degree, diploma or other qualification at the
University of Cambridge or any other University of similar institution except as declared in
the Preface and specified in the text. Some of the material has been presented at or submitted
to international conferences and journals (Li et al., 2021c,d, 2019d, 2021e,f, 2022). This
dissertation contains fewer than 65,000 words including appendices, footnotes, tables and
equations, but excluding the bibliography, and has fewer than 150 figures.
Qiujia Li
April 2022
Abstract
Speech processing is one of the key components of machine perception. It covers a wide
range of topics and plays an important role in many real-world applications. Many speech
processing problems are modelled using sequence-to-sequence models. More recently, the
Attention-Based Encoder-Decoder (AED) model has become a general and effective neural
network that transforms a source sequence into a target sequence. These two sequences may
have different lengths and belong to different modalities. AED models offer a new perspective
for various speech processing tasks. In this thesis, the fundamentals of AED models and
Automatic Speech Recognition (ASR) are first covered. The rest of the thesis focuses on
the application of AED models for three major speech processing tasks - speech recognition,
confidence estimation and speaker diarisation.
Speech recognition technology is widely used in voice assistants and dictation systems.
It converts speech signals into text. Traditionally, Hidden Markov Models (HMMs), as a
generative sequence-to-sequence model, are widely used as the backbone of an acoustic model.
Under the Source-Channel Model (SCM) framework, the ASR system finds the most likely text
sequence that produces the corresponding acoustic sequence together with a language model
and a lexicon. Alternatively, the speech recognition task can be addressed discriminatively
using a single AED model. There are distinct characteristics associated with each modelling
approach. As the first contribution of the thesis, the Integrated Source-Channel and Attention
(ISCA) framework is proposed to leverage the advantages of both approaches with two passes.
The first pass uses the traditional SCM-based ASR system to generate diverse hypotheses,
either in the form of N -best lists or lattices. The second pass obtains the AED model score for
each hypothesis. Experiments on Augmented Multi-Party Interaction (AMI) dataset showed
that ISCA using two-pass decoding reduced the Word Error Rate (WER) by 13% relative
for a joint SCM and AED system using one-pass decoding. Further experiments on both the
AMI dataset and the larger Switchboard (SWB) dataset showed that, if the SCM and AED
systems were trained separately to be more complementary, the combined system using ISCA
outperformed the individual system by around 30%. Also, the refined lattice rescoring algorithm
is significantly better than N -best rescoring as lattice is a more compact representation of
hypothesis space, especially for longer utterances.
vi
With various advancements in neural network training, AED models can reach similar or
better performance than traditional systems for many ASR tasks. Compared to a conventional
ASR system, one important but perhaps missing attribute of an AED-based system is good
confidence scores which indicate the reliability of automatic transcriptions. Confidence scores
are very helpful for various downstream tasks, including semi-supervised training, keyword
spotting and dialogue systems. As the second contribution of this thesis, effective confidence
estimators for AED-based ASR systems are proposed. The Confidence Estimation Module
(CEM) is a lightweight simple add-on neural network that takes various features from the
encoder, attention mechanism and decoder to estimate a confidence score for each output unit
(token). Experiments on the LibriSpeech dataset showed that compared to using Softmax
probabilities as confidence scores, the CEM improved token-level confidence estimation perfor-
mance substantially and largely addressed the over-confidence issue. For various downstream
tasks such as data selection, utterance-level confidence scores are more desirable. The Residual
Energy-Based Model (R-EBM), an utterance-level confidence estimator, was demonstrated
to outperform both Softmax probabilities and the CEM. The R-EBM directly operates at the
utterance level and takes deletion errors into account implicitly. The R-EBM also provides
a global normalisation term for the locally normalised auto-regressive AED models. On the
LibriSpeech dataset, the R-EBM reduced the WER of an AED model by up to 8% relative. One
potential issue for model-based confidence estimators such as the CEM and R-EBM is their
performance on Out-of-Domain (OOD) data. To ensure that confidence estimators generalise
well for OOD input, two simple approaches are suggested that can effectively inject OOD
information during the training of the CEM and R-EBM.
Speaker diarisation, a task of identifying “who spoke when”, is a crucial step for information
extraction and retrieval. The speaker diarisation pipeline often consists of multiple stages. The
last stage is to perform clustering over segment-level or window-level speaker representations.
Although clustering is normally an unsupervised task, this thesis proposes the use of AED
models for supervised clustering. With specific data augmentation techniques, the proposed
approach, Discriminative Neural Clustering (DNC), has shown to be an effective alternative
to unsupervised clustering algorithms. Experiments on the very challenging AMI dataset
showed that DNC improved the Speaker Error Rate (SpkER) by around 30% relative compared
to a strong spectral clustering baseline. Furthermore, DNC opens more interesting research
directions, e.g. speaker diarisation with multi-channel or multi-modality information and
end-to-end neural network-based speaker diarisation.
Acknowledgements
First and foremost, I would like to thank my PhD supervisor Prof. Phil Woodland. I deeply
appreciate the opportunity Phil gave me to work on speech recognition about six years ago
when I was still an undergraduate student. Since then, I have determined to pursue research,
especially in the field of speech technologies. During my PhD studies, I have been exceedingly
privileged to be guided by Phil through this challenging journey. He has spent countless hours
discussing my projects, replying to my messages, improving my paper drafts etc. I have learned
abundant technical knowledge and insight, but perhaps more importantly, the diligent and
rigorous attitude towards research. Phil has been more than my PhD supervisor, but my mentor
who selflessly supports my personal growth. For that, I will be forever grateful.
I would like to thank Dr Chao Zhang. Without him, I cannot imagine how many pitfalls and
sidetracks I would have gone through during my PhD studies. He is very patient, knowledgeable
and always willing to help. He can always inspire me when I face difficulties. I would also
like to express my gratitude to my labmate and friend, Florian Kreyssig. We brainstorm ideas,
discuss intriguing questions, and cheer each other on. I cannot overstate how lucky I am to
have Chao and Florian around in the past six years.
I want to thank Prof. Mark Gales, who has served as my PhD advisor and pointed the right
direction for me early on. I also appreciate my college tutors at Peterhouse, Dr Saskia Murk
Jansen and Dr Christopher Lester, who have helped secure my Graduate Studentship from
Peterhouse and supported me on various tutorial matters. I thank my collaborators at Google,
Dr Yu Zhang, Dr Liangliang Cao, Dr Bo Li, Dr David Qiu and Dr Yanzhang He, who provided
tremendous support and expertise during my internship. I am grateful to many members at the
Machine Intelligence Laboratory, Guangzhi Sun, Dr Anton Ragni, Dr Yu Wang, Xiaoyu Yang,
Qingyun Dou, Yiting Lu, Dr Kate Knill and Dr Linlin Wang, who have helped me in various
ways. I must also extend my thanks to my best friends at Cambridge, Xuan Guo, Yudong
Chen, Weiming Che and Yichen Yang, who have brought me joy and laughter in my daily life,
especially during the coronavirus pandemic. I greatly cherish the invaluable friendship.
Finally, I would like to thank my girlfriend, Ke Li, for her indispensable support and
encouragement. Her companionship makes the distance across the Atlantic Ocean seem
insignificant. And my most profound gratitude goes to my parents. They have supported me
viii
unconditionally in all possible ways over the past twenty-six years. I would like to dedicate
this thesis to them.
Table of contents
List of figures xv
Notation xix
Acronyms xxiii
1 Introduction 1
1.1 Speech Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Confidence Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Speaker Diarisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
References 167
List of figures
Neural Networks
a, a attention weight value/vector
b, b bias value/vector
β momentum coefficient
d decoder state vector
D, H number of nodes in a hidden layer
δ small number for computational stability
e, E encoder output embedding vector/matrix
ϵ learning rate
η decay factor for Adam optimiser
f (·), g(·) generic functions
g gradient of model parameters
f , i, o, c forget/input/output gates and cell state for LSTM
h hidden state vector
I number of nodes in an input layer
i, j general indices
J (·) overall cost
K convolutional kernel matrix
κ decay factor for exponential moving average
L(·) loss function
L number of layers in a network
m momentum of model parameters
M number of samples in a mini-batch
N number of data samples
n general count
∥ · ∥p p-norm of a vector
ν weight decay factor
O number of nodes in an output layer
xx Notation
AED Attention-Based Encoder-Decoder. v, vi, 1–7, 9, 15, 17–21, 23, 34–36, 64–
67, 69, 71, 72, 77, 78, 81–89, 91–97, 99–105, 107, 109, 111–114, 116, 117,
120–122, 124, 129, 131, 132, 135, 137–139, 145, 146, 157–160
AM Acoustic Model. 36, 37, 40, 53–55, 57, 59, 68, 79, 80, 84–86, 94, 100, 158
AMI Augmented Multi-Party Interaction. v, vi, 77, 91–101, 103–105, 140, 143, 147,
149–151, 155, 163, 164
ANN Artificial Neural Network. 9
ASR Automatic Speech Recognition. v, vi, 1–7, 30, 35–39, 42, 45, 49, 52, 53, 55,
57, 59, 61, 62, 65–67, 71, 72, 74, 76–78, 107–109, 112, 113, 115–121, 123,
124, 126–130, 132, 134, 135, 137, 139, 146, 157–160, 163
AUC Area Under the Curve. 111, 115–118, 126, 127, 132–135
CTC Connectionist Temporal Classification. 5, 35, 59–62, 64, 68, 77, 81–84, 91–95,
97, 157, 158, 160
CTS Conversational Telephone Speech. 164, 165
HMM Hidden Markov Model. v, 1–5, 35, 36, 38–48, 50–53, 56, 57, 59–61, 65, 68,
70, 77–79, 83, 84, 86, 91–94, 107, 111, 112, 137, 158, 160
KL Kullback–Leibler. 52
LM Language Model. 36, 37, 50, 53–57, 59, 61, 67, 72, 78–80, 84, 85, 91–94,
98–100, 117–119, 124, 129–134, 137, 138, 159, 164, 166
LN Length Normalisation. 125
LSTM Long Short-Term Memory. 15, 16, 55, 63, 73, 91, 92, 96, 98–102, 104, 114,
117, 124–126, 132
R-EBM Residual Energy-Based Model. vi, 6, 7, 107–109, 120–138, 157, 159, 160
RNN Recurrent Neural Network. 10, 14, 15, 19–21, 23, 26, 30, 33, 34, 55, 64–67,
70, 82, 85–87, 89, 104, 105, 143, 158
RNNLM Recurrent Neural Network Language Model. 7, 87, 88, 91–104, 118
ROC Receiver Operating Characteristics. 111
ROVER Recogniser Output Voting Error Reduction. 79–81
SentER Sentence Error Rate. 59, 132, 133, 136, 137, 140
SGD Stochastic Gradient Descend. 25, 27–29, 98
SpkER Speaker Error Rate. vi, 143, 151–155
SSL Self-Supervised Learning. 72, 73, 75
SWB Switchboard. v, 77, 98–100, 102–105, 132–137, 163–165
SWBC Switchboard Cellular. 98, 100, 102, 105, 165
Introduction
perspectives, this thesis proposes practical approaches to exploit their complementarity via
system combination. Compared to conventional ASR systems, one missing piece of the
AED-based system is high-quality confidence scores for their automatic transcriptions. As
AED models address the ASR task using a different principle from conventional systems,
novel confidence estimators are proposed in this thesis to produce reliable confidence scores
for AED-based ASR systems at the token, word, and utterance levels. Effective confidence
estimators have significant implications for various downstream tasks such as data selection
for semi-supervised or active learning, and dialogue systems. Apart from ASR, AED models
can also be used for speaker diarisation or determining “who spoke when” in a multi-talker
audio recording. Clustering is the last stage of a speaker diarisation pipeline and is normally
regarded as an unsupervised task. This thesis proposes a novel supervised approach for speaker
clustering using AED models.
In this chapter, the three topics for speech processing covered in this thesis are first briefly
introduced. Then the thesis outline is presented and then the main contributions are highlighted.
system without worrying too much about the rest of the pipeline. Many sources of structured
knowledge such as the lexicon and phonetic decision trees can readily be incorporated into the
system which helps the recogniser perform relatively robustly, even with a limited resource
budget. SCM-based systems process the acoustic sequence in a frame-by-frame manner, which
allows the system to be designed to handle streaming data.
Recently, with the accelerating development of learning algorithms and computing hardware,
a single end-to-end trainable AED model can reach a similar performance to SCM-based
systems for ASR. An AED model consists of an encoder, an attention mechanism and a
decoder. For ASR, the encoder transforms the input acoustic features into a sequence of hidden
representations, and the decoder works together with the attention mechanism to produce one
output token at a time based on all its previous output. Unlike HMM-based models, the AED
models pose much weaker conditional independence assumptions for both word sequences and
acoustic sequences. AED models jointly learn the acoustic and language models whereas SCM-
based systems optimise them separately. Since AED models operate in a label-synchronous
fashion, processing streaming data is more challenging.
SCM and AED-based systems have their respective characteristics and are highly com-
plementary. Given two systems that have similar performance but different error patterns, a
significant amount of gain is expected by combining them. To this end, it is of great interest to
investigate possible combination approaches and look for the best strategy in terms of the final
recognition performance while considering practical constraints.
Apart from the clustering algorithm, DNNs have been applied to all the other stages and have
shown promising performance. Clustering algorithms are generally unsupervised. However,
the task is sometimes ambiguous by nature. Given many high-dimensional features without
additional information, there may be more than one sensible clustering result. Current clustering
procedures rely on a distance measure for speaker representations and treat each segment as
an independent sample. Some clustering approaches even impose stricter assumptions about
the distribution of each cluster. The limitations of unsupervised clustering algorithms leave
more pressure on the upstream stages, especially the extraction of speaker representations.
If it is possible to view speaker clustering as a supervised sequence-to-sequence task where
the input is a sequence of speaker representations and the output is a sequence of cluster
labels, there would be several benefits by reformulating clustering in the context of speaker
diarisation. The sequential relationship between speaker representations can then be exploited,
the ambiguity of clustering results can be removed by supervised training samples, a pre-defined
distance measure of speaker representations becomes optional, and the assumptions made by
unsupervised algorithms can be lifted. Moreover, the entire diarisation pipeline could be
designed as an end-to-end trainable model in order to avoid error propagation across different
stages.
• Chapter 2 first establishes the fundamentals of DNNs, including their basic building
blocks such as different neural network layers and activation functions. Then Chapter 2
introduces the AED models that are constructed based on these building blocks. Optimi-
sation procedures and regularisation techniques for DNNs are also described. Parts of
this chapter will be referred to throughout the thesis.
• Chapter 3 first describes SCM-based systems and components, including feature extrac-
tion, HMM-based acoustic models and their adaptation, language models, and the decod-
ing procedure. Connectionist Temporal Classification (CTC) and neural transducers are
also introduced as other forms of frame-synchronous systems. Next, label-synchronous
AED models for ASR are described. Their specific training and decoding techniques
are also covered. Finally, some self-supervised pre-training approaches for ASR are
6 Introduction
presented. As they often use a large amount of unlabelled acoustic data, the pre-trained
model can be used to initialise an acoustic model or the encoder of an AED model.
• Chapter 5 addresses the confidence estimation problem for AED models for ASR. The
Confidence Estimation Module (CEM) is proposed as the token-level confidence esti-
mator, which was published in a conference paper at ICASSP 2021 (Li et al., 2021d).
Subsequently, the Residual Energy-Based Model (R-EBM) is proposed as the utterance-
level confidence estimator and can also be used to rescore the top hypotheses from the
AED models for better ASR performance. R-EBM was published at Interspeech 2021
conference (Li et al., 2021f). To improve the model-based confidence scores on Out-of-
Domain (OOD) data, practical approaches are suggested and validated. The findings are
included in a conference paper at ICASSP 2022 (Li et al., 2022).
• Chapter 6 focuses on the clustering stage of the speaker diarisation task. Discriminative
Neural Clustering (DNC), based on AED models, is proposed as an effective supervised
alternative to existing unsupervised clustering algorithms. Three data augmentation
techniques are tailored to DNC when the training data is very limited. DNC was first
published as a conference paper at SLT 2021 (Li et al., 2021c).
• The key conclusions and contributions are summarised in Chapter 7. Based on the current
findings and the understanding of AED models, Chapter 7 also suggests several related
topics for further investigation with respect to ASR, confidence estimation and speaker
diarisation.
1.3 Contributions
The key contributions of this thesis are as follows.
• A general system combination framework called ISCA is proposed for ASR. The frame-
work uses a two-pass approach where hypotheses are first generated using a frame-
synchronous SCM-based system and then rescored by a label-synchronous AED model.
1.3 Contributions 7
The lattice rescoring algorithm used for Recurrent Neural Network Language Model
(RNNLM) rescoring is adapted and improved for ISCA. Both N -best and lattice rescoring
algorithms are highly effective for improving the overall ASR performance.
• A light-weight confidence estimator for AED models, the CEM, is proposed for ASR.
The CEM can predict a confidence score for each output token in the hypotheses. By
simply aggregating token-level scores, word-level and utterance-level confidence scores
can be obtained.
• Since utterance-level confidence scores are very useful for various downstream tasks, the
R-EBM is proposed as an utterance-level confidence estimator that is trained directly
at the utterance level and implicitly takes deletion error into account. The R-EBM
can also be used to rescore N -best hypotheses from the AED models to improve the
ASR performance as it provides the globally normalised residual term for the locally
normalised AED model.
• Because both the CEM and R-EBM are model-based confidence estimators for AED
models, two practical approaches are proposed to improve the reliability of confidence
scores on OOD data when some unlabelled data from the target domain is available.
• A novel method called DNC is proposed to use an AED model to perform clustering
for speaker diarisation in a supervised fashion. Together with various specific data
augmentation schemes, DNC implicitly handles the permutation issue for cluster labels.
As DNC does not assume a pre-defined distance measure for speaker representations and
learns to disambiguate speakers from data, it outperforms a commonly used unsupervised
clustering algorithm on a challenging dataset.
Chapter 2
Artificial Neural Networks (ANNs), inspired by biological neural networks, are mathematical
models or a class of functions that transform input features or representations into the desired
output space via non-linear mappings (Bishop, 1995). An ANN is a directed and weighted
graph consisting of interconnected groups of nodes. There is a weight associated with each
connection and a non-linear activation function associated with each node. Deep Neural
Networks (DNNs) are a major class of ANNs where there are multiple layers of nodes between
the input and the output layers. By increasing the width and depth of neural networks, more
complex functions can be approximated (Goodfellow et al., 2016). The building blocks of
DNNs will be first introduced in this chapter.
Based on the basic building blocks, various kinds of neural networks can be constructed.
For speech processing, the input and the output are often sequences with variable lengths.
Therefore, Attention-Based Encoder-Decoder (AED) models (Cho et al., 2014; Sutskever et al.,
2014) are of particular interest. In this section, two commonly used AED models are described.
They will be frequently mentioned in the rest of the thesis.
Deep learning provides a powerful framework for DNNs to perform classification or regres-
sion tasks in a supervised fashion (LeCun et al., 2015). One of the key advantages of DNNs is
that they do not require any prior knowledge between the input and output spaces (Goodfellow
et al., 2016). Provided with a reasonable amount of training data (i.e. input and output pairs),
model parameters (i.e. weights) can be optimised according to the defined criterion via error
backpropagation (Rumelhart et al., 1988). Various optimisation techniques are included in this
chapter.
DNNs have been shown to achieve state-of-the-art results on various tasks in computer
vision, speech processing and natural language processing (LeCun et al., 2015). With more
complex model architectures, more advanced training techniques, an increasing amount of
training data and more powerful computing facilities, DNNs are expected to continue breaking
10 Attention-Based Encoder-Decoder Models
existing performance records. Despite the success of DNNs, this data-driven approach cannot
easily solve other tasks that would require more than the input-to-output feature mapping and a
higher level of understanding and reasoning (LeCun et al., 2015). However, within the scope of
speech processing, DNNs have become one of the most important modules in the system. This
chapter covers generic deep neural networks and their learning procedures, which will be often
referred to throughout the thesis.
y
<latexit sha1_base64="mEcz1FLhuG1BpP6c5hi50qAIJ0g=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOl5qRfrrhVdw6ySrycVCBHo1/+6g1ilkYoDRNU667nJsbPqDKcCZyWeqnGhLIxHWLXUkkj1H42P3RKzqwyIGGsbElD5urviYxGWk+iwHZG1Iz0sjcT//O6qQmv/YzLJDUo2WJRmApiYjL7mgy4QmbExBLKFLe3EjaiijJjsynZELzll1dJu1b1Lqq15mWlfpPHUYQTOIVz8OAK6nAHDWgBA4RneIU359F5cd6dj0VrwclnjuEPnM8f6QuNAQ==</latexit>
<latexit sha1_base64="U7EeqmYiKeTo/r/N2HofpG7Xero=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0lqQY9FLx4r2A9oQ9lsN83S3U3Y3Qgl9C948aCIV/+QN/+NmzYHbX0w8Hhvhpl5QcKZNq777ZQ2Nre2d8q7lb39g8Oj6vFJV8epIrRDYh6rfoA15UzSjmGG036iKBYBp71gepf7vSeqNIvlo5kl1Bd4IlnICDa5NEwiNqrW3Lq7AFonXkFqUKA9qn4NxzFJBZWGcKz1wHMT42dYGUY4nVeGqaYJJlM8oQNLJRZU+9ni1jm6sMoYhbGyJQ1aqL8nMiy0nonAdgpsIr3q5eJ/3iA14Y2fMZmkhkqyXBSmHJkY5Y+jMVOUGD6zBBPF7K2IRFhhYmw8FRuCt/ryOuk26t5VvfHQrLVuizjKcAbncAkeXEML7qENHSAQwTO8wpsjnBfn3flYtpacYuYU/sD5/AEU1o5D</latexit>
+
<latexit sha1_base64="HFb7vPGXWshKfdtTiumEDK16me0=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoMgCGE3CnoMevGYgHlAsoTZSW8yZnZ2mZkVQsgXePGgiFc/yZt/4yTZgyYWNBRV3XR3BYng2rjut5NbW9/Y3MpvF3Z29/YPiodHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7mZ+6wmV5rF8MOME/YgOJA85o8ZK9YteseSW3TnIKvEyUoIMtV7xq9uPWRqhNExQrTuemxh/QpXhTOC00E01JpSN6AA7lkoaofYn80On5MwqfRLGypY0ZK7+npjQSOtxFNjOiJqhXvZm4n9eJzXhjT/hMkkNSrZYFKaCmJjMviZ9rpAZMbaEMsXtrYQNqaLM2GwKNgRv+eVV0qyUvctypX5Vqt5mceThBE7hHDy4hircQw0awADhGV7hzXl0Xpx352PRmnOymWP4A+fzB3LTjLM=</latexit>
b
<latexit sha1_base64="YJjhR7RY5hyNtVLBH/MerrmOQ7I=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOlZtAvV9yqOwdZJV5OKpCj0S9/9QYxSyOUhgmqdddzE+NnVBnOBE5LvVRjQtmYDrFrqaQRaj+bHzolZ1YZkDBWtqQhc/X3REYjrSdRYDsjakZ62ZuJ/3nd1ITXfsZlkhqUbLEoTAUxMZl9TQZcITNiYgllittbCRtRRZmx2ZRsCN7yy6ukXat6F9Va87JSv8njKMIJnMI5eHAFdbiDBrSAAcIzvMKb8+i8OO/Ox6K14OQzx/AHzucPxi+M6g==</latexit>
x1
<latexit sha1_base64="MWSDWkw1NdOauHNwPQkLknLX4o4=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGi/YA2lM120y7dbMLuRCyhP8GLB0W8+ou8+W/ctjlo64OBx3szzMwLEikMuu63s7K6tr6xWdgqbu/s7u2XDg6bJk414w0Wy1i3A2q4FIo3UKDk7URzGgWSt4LRzdRvPXJtRKwecJxwP6IDJULBKFrp/qnn9Uplt+LOQJaJl5My5Kj3Sl/dfszSiCtkkhrT8dwE/YxqFEzySbGbGp5QNqID3rFU0YgbP5udOiGnVumTMNa2FJKZ+nsio5Ex4yiwnRHFoVn0puJ/XifF8MrPhEpS5IrNF4WpJBiT6d+kLzRnKMeWUKaFvZWwIdWUoU2naEPwFl9eJs1qxTuvVO8uyrXrPI4CHMMJnIEHl1CDW6hDAxgM4Ble4c2Rzovz7nzMW1ecfOYI/sD5/AEM/o2k</latexit>
x2 <latexit sha1_base64="vdTCQWpAcdEoAqjXndSIH2U27gw=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGi/YA2lM120y7dbMLuRCyhP8GLB0W8+ou8+W/ctjlo64OBx3szzMwLEikMuu63s7K6tr6xWdgqbu/s7u2XDg6bJk414w0Wy1i3A2q4FIo3UKDk7URzGgWSt4LRzdRvPXJtRKwecJxwP6IDJULBKFrp/qlX7ZXKbsWdgSwTLydlyFHvlb66/ZilEVfIJDWm47kJ+hnVKJjkk2I3NTyhbEQHvGOpohE3fjY7dUJOrdInYaxtKSQz9fdERiNjxlFgOyOKQ7PoTcX/vE6K4ZWfCZWkyBWbLwpTSTAm079JX2jOUI4toUwLeythQ6opQ5tO0YbgLb68TJrVindeqd5dlGvXeRwFOIYTOAMPLqEGt1CHBjAYwDO8wpsjnRfn3fmYt644+cwR/IHz+QMOgo2l</latexit>
x3
<latexit sha1_base64="GcfdmiVXIuQAVIE+3vSRqlRiStc=">AAAB6nicbVDLTgJBEOzFF+IL9ehlIjHxRHbBRI9ELx4xyiOBDZkdemHC7OxmZtZICJ/gxYPGePWLvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLRzcxvPaLSPJYPZpygH9GB5CFn1Fjp/qlX7RVLbtmdg6wSLyMlyFDvFb+6/ZilEUrDBNW647mJ8SdUGc4ETgvdVGNC2YgOsGOppBFqfzI/dUrOrNInYaxsSUPm6u+JCY20HkeB7YyoGeplbyb+53VSE175Ey6T1KBki0VhKoiJyexv0ucKmRFjSyhT3N5K2JAqyoxNp2BD8JZfXiXNStmrlit3F6XadRZHHk7gFM7Bg0uowS3UoQEMBvAMr/DmCOfFeXc+Fq05J5s5hj9wPn8AEAaNpg==</latexit>
··· ··· xD
<latexit sha1_base64="9pGPaDabNijkWLtd2pluvCo4p5o=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69BIvgqSRV0GPRi8cK9gPaUDabTbt2sxt2J0Ip/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZemApu0PO+ncLa+sbmVnG7tLO7t39QPjxqGZVpyppUCaU7ITFMcMmayFGwTqoZSULB2uHodua3n5g2XMkHHKcsSMhA8phTglZq9Wik0PTLFa/qzeGuEj8nFcjR6Je/epGiWcIkUkGM6fpeisGEaORUsGmplxmWEjoiA9a1VJKEmWAyv3bqnlklcmOlbUl05+rviQlJjBknoe1MCA7NsjcT//O6GcbXwYTLNEMm6WJRnAkXlTt73Y24ZhTF2BJCNbe3unRINKFoAyrZEPzll1dJq1b1L6q1+8tK/SaPowgncArn4MMV1OEOGtAECo/wDK/w5ijnxXl3PhatBSefOYY/cD5/AK+ljzM=</latexit> <latexit sha1_base64="9pGPaDabNijkWLtd2pluvCo4p5o=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69BIvgqSRV0GPRi8cK9gPaUDabTbt2sxt2J0Ip/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZemApu0PO+ncLa+sbmVnG7tLO7t39QPjxqGZVpyppUCaU7ITFMcMmayFGwTqoZSULB2uHodua3n5g2XMkHHKcsSMhA8phTglZq9Wik0PTLFa/qzeGuEj8nFcjR6Je/epGiWcIkUkGM6fpeisGEaORUsGmplxmWEjoiA9a1VJKEmWAyv3bqnlklcmOlbUl05+rviQlJjBknoe1MCA7NsjcT//O6GcbXwYTLNEMm6WJRnAkXlTt73Y24ZhTF2BJCNbe3unRINKFoAyrZEPzll1dJq1b1L6q1+8tK/SaPowgncArn4MMV1OEOGtAECo/wDK/w5ijnxXl3PhatBSefOYY/cD5/AK+ljzM=</latexit>
<latexit sha1_base64="cD1wOL3x093czdOlsA8OWdLBleU=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKexGQY9BPXiMaB6QLGF2MkmGzM4uM71iWPIJXjwo4tUv8ubfOEn2oIkFDUVVN91dQSyFQdf9dnIrq2vrG/nNwtb2zu5ecf+gYaJEM15nkYx0K6CGS6F4HQVK3oo1p2EgeTMYXU/95iPXRkTqAccx90M6UKIvGEUr3T91b7rFklt2ZyDLxMtICTLUusWvTi9iScgVMkmNaXtujH5KNQom+aTQSQyPKRvRAW9bqmjIjZ/OTp2QE6v0SD/SthSSmfp7IqWhMeMwsJ0hxaFZ9Kbif147wf6lnwoVJ8gVmy/qJ5JgRKZ/k57QnKEcW0KZFvZWwoZUU4Y2nYINwVt8eZk0KmXvrFy5Oy9Vr7I48nAEx3AKHlxAFW6hBnVgMIBneIU3RzovzrvzMW/NOdnMIfyB8/kDKcqNtw==</latexit>
D
!
w i xi + b = ϕ w T x + b ,
X
y=ϕ (2.1)
i=1
the input feature x = [x1 , . . . , xD ]T and the weight vector w = [w1 , . . . , wD ]T have the same
dimension D, and the input value of the activation function is the dot product of the input and
weight vectors plus a bias value b. The corresponding output of the neuron y is the evaluation
of the non-linear activation function ϕ(·).
2.1 Basic Building Blocks 11
<latexit sha1_base64="YEy0rtL8OREFxfc23Jo0M0weW64=">AAAB/XicbVA7T8MwGHTKq5RXeGwsERVSWaoEKmCsYGEsEn1Ibagcx2mtOnZkO0ghivgrLAwgxMr/YOPf4LQZoOUky6e775PP50WUSGXb30ZpaXllda28XtnY3NreMXf3OpLHAuE24pSLngclpoThtiKK4l4kMAw9irve5Dr3uw9YSMLZnUoi7IZwxEhAEFRaGpoHA49TXyahvtIku09rZyfZ0KzadXsKa5E4BamCAq2h+TXwOYpDzBSiUMq+Y0fKTaFQBFGcVQaxxBFEEzjCfU0ZDLF002n6zDrWim8FXOjDlDVVf2+kMJR5QD0ZQjWW814u/uf1YxVcuilhUawwQ7OHgphailt5FZZPBEaKJppAJIjOaqExFBApXVhFl+DMf3mRdE7rznm9cduoNq+KOsrgEByBGnDABWiCG9ACbYDAI3gGr+DNeDJejHfjYzZaMoqdffAHxucPpaaVWg==</latexit>
y (3)
<latexit sha1_base64="XVIwW/ek2F5To0f+3o276B5c3y8=">AAACEnicbVDLSsNAFJ3UV62vqEs3g0VoQUqiRV0W3bisYB/QxjKZTNqhk0mYmQgl5Bvc+CtuXCji1pU7/8ZJm0VtPTDM4Zx7ufceN2JUKsv6MQorq2vrG8XN0tb2zu6euX/QlmEsMGnhkIWi6yJJGOWkpahipBsJggKXkY47vsn8ziMRkob8Xk0i4gRoyKlPMVJaGpjVvhsyT04C/SWd9CGpnFfT03nRzcWBWbZq1hRwmdg5KYMczYH53fdCHAeEK8yQlD3bipSTIKEoZiQt9WNJIoTHaEh6mnIUEOkk05NSeKIVD/qh0I8rOFXnOxIUyGxBXRkgNZKLXib+5/Vi5V85CeVRrAjHs0F+zKAKYZYP9KggWLGJJggLqneFeIQEwkqnWNIh2IsnL5P2Wc2+qNXv6uXGdR5HERyBY1ABNrgEDXALmqAFMHgCL+ANvBvPxqvxYXzOSgtG3nMI/sD4+gWAHJ39</latexit>
W (3) , b(3)
<latexit sha1_base64="YDK+9JZXWSiEBgkZZCgDG4oMORw=">AAACFHicbVDLSsNAFJ34rPUVdelmsAgtQklqUTdC0Y3LCvYBbSyTyaQdOnkwMxFDyEe48VfcuFDErQt3/o2TNovaemCYwzn3cu89dsiokIbxoy0tr6yurRc2iptb2zu7+t5+WwQRx6SFAxbwro0EYdQnLUklI92QE+TZjHTs8XXmdx4IFzTw72QcEstDQ5+6FCOppIF+0rcD5ojYU1/ymN4n5dNKCi/hrBxncq2SDvSSUTUmgIvEzEkJ5GgO9O++E+DII77EDAnRM41QWgnikmJG0mI/EiREeIyGpKeojzwirGRyVAqPleJAN+Dq+RJO1NmOBHkiW1BVekiOxLyXif95vUi6F1ZC/TCSxMfTQW7EoAxglhB0KCdYslgRhDlVu0I8QhxhqXIsqhDM+ZMXSbtWNc+q9dt6qXGVx1EAh+AIlIEJzkED3IAmaAEMnsALeAPv2rP2qn1on9PSJS3vOQB/oH39ArO5npk=</latexit>
<latexit sha1_base64="aj0iK0uF5FS5rq9zA+djRoblxj0=">AAACEnicbVDLSsNAFJ3UV62vqEs3g0VoQUpSirosunFZwT6gjWUymbRDJ5MwMxFKyDe48VfcuFDErSt3/o2TNovaemCYwzn3cu89bsSoVJb1YxTW1jc2t4rbpZ3dvf0D8/CoI8NYYNLGIQtFz0WSMMpJW1HFSC8SBAUuI113cpP53UciJA35vZpGxAnQiFOfYqS0NDSrAzdknpwG+ku66UNSqVfT80XRzcWhWbZq1gxwldg5KYMcraH5PfBCHAeEK8yQlH3bipSTIKEoZiQtDWJJIoQnaET6mnIUEOkks5NSeKYVD/qh0I8rOFMXOxIUyGxBXRkgNZbLXib+5/Vj5V85CeVRrAjH80F+zKAKYZYP9KggWLGpJggLqneFeIwEwkqnWNIh2Msnr5JOvWZf1Bp3jXLzOo+jCE7AKagAG1yCJrgFLdAGGDyBF/AG3o1n49X4MD7npQUj7zkGf2B8/QJ8+537</latexit>
W (2) , b(2)
<latexit sha1_base64="2OCJ4yowjQjJtTTBNG707Fkxe8c=">AAACFHicbVDLSsNAFJ3UV62vqEs3g0VoEUpSiroRim5cVrAPaGOZTCbt0MmDmYkYQj7Cjb/ixoUibl2482+ctFnU1gPDHM65l3vvsUNGhTSMH62wsrq2vlHcLG1t7+zu6fsHHRFEHJM2DljAezYShFGftCWVjPRCTpBnM9K1J9eZ330gXNDAv5NxSCwPjXzqUoykkob66cAOmCNiT33JY3qfVOrVFF7CeTnOZLOaDvWyUTOmgMvEzEkZ5GgN9e+BE+DII77EDAnRN41QWgnikmJG0tIgEiREeIJGpK+ojzwirGR6VApPlOJAN+Dq+RJO1fmOBHkiW1BVekiOxaKXif95/Ui6F1ZC/TCSxMezQW7EoAxglhB0KCdYslgRhDlVu0I8RhxhqXIsqRDMxZOXSadeM89qjdtGuXmVx1EER+AYVIAJzkET3IAWaAMMnsALeAPv2rP2qn1on7PSgpb3HII/0L5+AbCWnpc=</latexit>
<latexit sha1_base64="HCZhClYpxAVH8RM8qXbCA+wzMac=">AAACEnicbVDLSgMxFM34rPU16tJNsAgtSJmRoi6LblxWsA9ox5LJZNrQTDIkGaEM/QY3/oobF4q4deXOvzHTzqK2Hgg5nHMv997jx4wq7Tg/1srq2vrGZmGruL2zu7dvHxy2lEgkJk0smJAdHynCKCdNTTUjnVgSFPmMtP3RTea3H4lUVPB7PY6JF6EBpyHFSBupb1d6vmCBGkfmS9uTh7TsViZn86Kfi3275FSdKeAycXNSAjkaffu7FwicRIRrzJBSXdeJtZciqSlmZFLsJYrECI/QgHQN5SgiykunJ03gqVECGAppHtdwqs53pChS2YKmMkJ6qBa9TPzP6yY6vPJSyuNEE45ng8KEQS1glg8MqCRYs7EhCEtqdoV4iCTC2qRYNCG4iycvk9Z51b2o1u5qpfp1HkcBHIMTUAYuuAR1cAsaoAkweAIv4A28W8/Wq/Vhfc5KV6y85wj8gfX1C3nanfk=</latexit>
W (1) , b(1)
<latexit sha1_base64="qtFCYMCwUv9TbEokav80f7ju0Ww=">AAAB/XicbVA7T8MwGHTKq5RXeGwsFhVSWaoEVcBYwcJYJPqQ2lA5jtNadezIdhAlivgrLAwgxMr/YOPf4LQdoOUky6e775PP58eMKu0431ZhaXllda24XtrY3NresXf3WkokEpMmFkzIjo8UYZSTpqaakU4sCYp8Rtr+6Cr32/dEKir4rR7HxIvQgNOQYqSN1LcPer5ggRpH5kofsru04p5kfbvsVJ0J4CJxZ6QMZmj07a9eIHASEa4xQ0p1XSfWXoqkppiRrNRLFIkRHqEB6RrKUUSUl07SZ/DYKAEMhTSHazhRf2+kKFJ5QDMZIT1U814u/ud1Ex1eeCnlcaIJx9OHwoRBLWBeBQyoJFizsSEIS2qyQjxEEmFtCiuZEtz5Ly+S1mnVPavWbmrl+uWsjiI4BEegAlxwDurgGjRAE2DwCJ7BK3iznqwX6936mI4WrNnOPvgD6/MHoQ+VVw==</latexit>
x(1)
Figure 2.2 An example of a three-layer MLP consisting of two hidden layers and one output
layer. In this example, the input has four dimensions and the output has two dimensions.
Instead of a single neuron written in Equation (2.1), for a general layer l in an MLP, the
forward function modelled by the layer is
Il
!
(l) X (l) (l) (l)
yj =ϕ wji xi + bj ,
i=1
or in matrix notation !
y (l) = ϕ W (l) x(l) + b(l) , (2.2)
the output from the l-th layer is the input to the next layer, i.e. y (l) = x(l+1) . If the input feature
x(l) has dimension Il and the output feature y (l) has dimension Ol , then the weight matrix
W (l) ∈ ROl ×Il and the bias b is a Ol -dimensional vector which is same as Il+1 . Note that when
the input to the activation function ϕ(·) is a vector, the operation is elementwise.
For MLPs, the number of layers defines the depth of the model while the number of artificial
neurons in hidden layers defines the width of the model. Varying the depth or width of an MLP
model allows it to have various levels of modelling capability by having a different number
of parameters, which is closely related to the selection of training setup. Assuming all hidden
layers have the same dimension D, the number of hidden layers is L, and the network has no
12 Attention-Based Encoder-Decoder Models
bias, then the number of parameters is ID + (L − 1)D2 + DO, where I and O are input and
output layer dimensions.
A Time Delay Neural Network (TDNN) (Waibel et al., 1989) is a variant of MLP, where
each layer of TDNN processes a window of context from the previous layer, as shown in
Figure 2.3.
t
<latexit sha1_base64="8+tdbSseR6FAb4KFmTZIwTpZd1w=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgxbAbA3qSgBePEc0DkiXMTmaTIbOzy0yvEEI+wYsHRbz6Rd78GyfJHjSxoKGo6qa7K0ikMOi6305ubX1jcyu/XdjZ3ds/KB4eNU2casYbLJaxbgfUcCkUb6BAyduJ5jQKJG8Fo9uZ33ri2ohYPeI44X5EB0qEglG00gNeVHvFklt25yCrxMtICTLUe8Wvbj9macQVMkmN6Xhugv6EahRM8mmhmxqeUDaiA96xVNGIG38yP3VKzqzSJ2GsbSkkc/X3xIRGxoyjwHZGFIdm2ZuJ/3mdFMNrfyJUkiJXbLEoTCXBmMz+Jn2hOUM5toQyLeythA2ppgxtOgUbgrf88ippVsreZblyXy3VbrI48nACp3AOHlxBDe6gDg1gMIBneIU3RzovzrvzsWjNOdnMMfyB8/kDvjWNbQ==</latexit>
4 t
<latexit sha1_base64="qs3vb1H7rt9+UkqsRMd9JrHN+ik=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoCcpePHYgq2FNpTNdtOu3WzC7kQoob/AiwdFvPqTvPlv3LY5aOuDgcd7M8zMCxIpDLrut1NYW9/Y3Cpul3Z29/YPyodHbROnmvEWi2WsOwE1XArFWyhQ8k6iOY0CyR+C8e3Mf3ji2ohY3eMk4X5Eh0qEglG0UhP75Ypbdecgq8TLSQVyNPrlr94gZmnEFTJJjel6boJ+RjUKJvm01EsNTygb0yHvWqpoxI2fzQ+dkjOrDEgYa1sKyVz9PZHRyJhJFNjOiOLILHsz8T+vm2J47WdCJSlyxRaLwlQSjMnsazIQmjOUE0so08LeStiIasrQZlOyIXjLL6+Sdq3qXVRrzctK/SaPowgncArn4MEV1OEOGtACBhye4RXenEfnxXl3PhatBSefOYY/cD5/AOBDjPg=</latexit>
t+4
<latexit sha1_base64="b2utVxjnJTqZP7CfhH0IdOPCYw8=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoMgCGE3BvQkAS8eI5oHJEuYncwmQ2Znl5leIYR8ghcPinj1i7z5N06SPWhiQUNR1U13V5BIYdB1v53c2vrG5lZ+u7Czu7d/UDw8apo41Yw3WCxj3Q6o4VIo3kCBkrcTzWkUSN4KRrczv/XEtRGxesRxwv2IDpQIBaNopQe8qPaKJbfszkFWiZeREmSo94pf3X7M0ogrZJIa0/HcBP0J1SiY5NNCNzU8oWxEB7xjqaIRN/5kfuqUnFmlT8JY21JI5urviQmNjBlHge2MKA7NsjcT//M6KYbX/kSoJEWu2GJRmEqCMZn9TfpCc4ZybAllWthbCRtSTRnadAo2BG/55VXSrJS9y3Llvlqq3WRx5OEETuEcPLiCGtxBHRrAYADP8ApvjnRenHfnY9Gac7KZY/gD5/MHuyuNaw==</latexit>
Figure 2.3 An example of a three-layer TDNN for a sequence of input with context from -4 to
+4. If dotted connections are removed, it becomes a subsampled TDNN.
Since a TDNN layer is effectively an MLP layer that can shift across time, TDNNs are
particularly useful for sequential inputs. The higher layers in a TDNN have an increasingly
wide view of the original input sequence (i.e. receptive field) and the lower layers are forced to
learn translation-invariant feature transforms (Peddinti et al., 2015). Effectively, parameters are
shared across different time steps, which reduces the total number of parameters. To further
reduce the computation load, subsampled TDNNs are often used where a window processes a
fixed amount of context and then shifts multiple steps (Peddinti et al., 2015).
x1
<latexit sha1_base64="MWSDWkw1NdOauHNwPQkLknLX4o4=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGi/YA2lM120y7dbMLuRCyhP8GLB0W8+ou8+W/ctjlo64OBx3szzMwLEikMuu63s7K6tr6xWdgqbu/s7u2XDg6bJk414w0Wy1i3A2q4FIo3UKDk7URzGgWSt4LRzdRvPXJtRKwecJxwP6IDJULBKFrp/qnn9Uplt+LOQJaJl5My5Kj3Sl/dfszSiCtkkhrT8dwE/YxqFEzySbGbGp5QNqID3rFU0YgbP5udOiGnVumTMNa2FJKZ+nsio5Ex4yiwnRHFoVn0puJ/XifF8MrPhEpS5IrNF4WpJBiT6d+kLzRnKMeWUKaFvZWwIdWUoU2naEPwFl9eJs1qxTuvVO8uyrXrPI4CHMMJnIEHl1CDW6hDAxgM4Ble4c2Rzovz7nzMW1ecfOYI/sD5/AEM/o2k</latexit>
x2
<latexit sha1_base64="vdTCQWpAcdEoAqjXndSIH2U27gw=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGi/YA2lM120y7dbMLuRCyhP8GLB0W8+ou8+W/ctjlo64OBx3szzMwLEikMuu63s7K6tr6xWdgqbu/s7u2XDg6bJk414w0Wy1i3A2q4FIo3UKDk7URzGgWSt4LRzdRvPXJtRKwecJxwP6IDJULBKFrp/qlX7ZXKbsWdgSwTLydlyFHvlb66/ZilEVfIJDWm47kJ+hnVKJjkk2I3NTyhbEQHvGOpohE3fjY7dUJOrdInYaxtKSQz9fdERiNjxlFgOyOKQ7PoTcX/vE6K4ZWfCZWkyBWbLwpTSTAm079JX2jOUI4toUwLeythQ6opQ5tO0YbgLb68TJrVindeqd5dlGvXeRwFOIYTOAMPLqEGt1CHBjAYwDO8wpsjnRfn3fmYt644+cwR/IHz+QMOgo2l</latexit>
x3
<latexit sha1_base64="GcfdmiVXIuQAVIE+3vSRqlRiStc=">AAAB6nicbVDLTgJBEOzFF+IL9ehlIjHxRHbBRI9ELx4xyiOBDZkdemHC7OxmZtZICJ/gxYPGePWLvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLRzcxvPaLSPJYPZpygH9GB5CFn1Fjp/qlX7RVLbtmdg6wSLyMlyFDvFb+6/ZilEUrDBNW647mJ8SdUGc4ETgvdVGNC2YgOsGOppBFqfzI/dUrOrNInYaxsSUPm6u+JCY20HkeB7YyoGeplbyb+53VSE175Ey6T1KBki0VhKoiJyexv0ucKmRFjSyhT3N5K2JAqyoxNp2BD8JZfXiXNStmrlit3F6XadRZHHk7gFM7Bg0uowS3UoQEMBvAMr/DmCOfFeXc+Fq05J5s5hj9wPn8AEAaNpg==</latexit>
x4
<latexit sha1_base64="jWBmQGVqa48YZ6/3PC5ZWIyPers=">AAAB6nicbVDLTgJBEOzFF+IL9ehlIjHxRHaRRI9ELx4xyiOBDZkdemHC7OxmZtZICJ/gxYPGePWLvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLRzcxvPaLSPJYPZpygH9GB5CFn1Fjp/qlX7RVLbtmdg6wSLyMlyFDvFb+6/ZilEUrDBNW647mJ8SdUGc4ETgvdVGNC2YgOsGOppBFqfzI/dUrOrNInYaxsSUPm6u+JCY20HkeB7YyoGeplbyb+53VSE175Ey6T1KBki0VhKoiJyexv0ucKmRFjSyhT3N5K2JAqyoxNp2BD8JZfXiXNStm7KFfuqqXadRZHHk7gFM7Bg0uowS3UoQEMBvAMr/DmCOfFeXc+Fq05J5s5hj9wPn8AEYqNpw==</latexit>
x5
<latexit sha1_base64="kqxXQf8R+MEijt2vTMYXcAGGEVc=">AAAB6nicbVDLTgJBEOzFF+IL9ehlIjHxRHZRo0eiF48Y5ZHAhswODUyYnd3MzBrJhk/w4kFjvPpF3vwbB9iDgpV0UqnqTndXEAuujet+O7mV1bX1jfxmYWt7Z3evuH/Q0FGiGNZZJCLVCqhGwSXWDTcCW7FCGgYCm8HoZuo3H1FpHskHM47RD+lA8j5n1Fjp/ql70S2W3LI7A1kmXkZKkKHWLX51ehFLQpSGCap123Nj46dUGc4ETgqdRGNM2YgOsG2ppCFqP52dOiEnVumRfqRsSUNm6u+JlIZaj8PAdobUDPWiNxX/89qJ6V/5KZdxYlCy+aJ+IoiJyPRv0uMKmRFjSyhT3N5K2JAqyoxNp2BD8BZfXiaNStk7K1fuzkvV6yyOPBzBMZyCB5dQhVuoQR0YDOAZXuHNEc6L8+58zFtzTjZzCH/gfP4AEw6NqA==</latexit>
(a) Connections in 1-dimensional CNN layers. The size of the kernel is 3. Connec-
(2)
tions with the same colour share the same weight. The receptive field of s3 covers
the full input with sparse connections.
⇤
<latexit sha1_base64="aETUmmg4d7Fr1+uPFFVtljBLhmc=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoMgHsJuFPQY9OIxAfOAZAmzk95kzOzsMjMrhJAv8OJBEa9+kjf/xkmyB00saCiquunuChLBtXHdbye3tr6xuZXfLuzs7u0fFA+PmjpOFcMGi0Ws2gHVKLjEhuFGYDtRSKNAYCsY3c381hMqzWP5YMYJ+hEdSB5yRo2V6he9Ysktu3OQVeJlpAQZar3iV7cfszRCaZigWnc8NzH+hCrDmcBpoZtqTCgb0QF2LJU0Qu1P5odOyZlV+iSMlS1pyFz9PTGhkdbjKLCdETVDvezNxP+8TmrCG3/CZZIalGyxKEwFMTGZfU36XCEzYmwJZYrbWwkbUkWZsdkUbAje8surpFkpe5flSv2qVL3N4sjDCZzCOXhwDVW4hxo0gAHCM7zCm/PovDjvzseiNedkM8fwB87nD3FPjLI=</latexit>
⇤
<latexit sha1_base64="aETUmmg4d7Fr1+uPFFVtljBLhmc=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoMgHsJuFPQY9OIxAfOAZAmzk95kzOzsMjMrhJAv8OJBEa9+kjf/xkmyB00saCiquunuChLBtXHdbye3tr6xuZXfLuzs7u0fFA+PmjpOFcMGi0Ws2gHVKLjEhuFGYDtRSKNAYCsY3c381hMqzWP5YMYJ+hEdSB5yRo2V6he9Ysktu3OQVeJlpAQZar3iV7cfszRCaZigWnc8NzH+hCrDmcBpoZtqTCgb0QF2LJU0Qu1P5odOyZlV+iSMlS1pyFz9PTGhkdbjKLCdETVDvezNxP+8TmrCG3/CZZIalGyxKEwFMTGZfU36XCEzYmwJZYrbWwkbUkWZsdkUbAje8surpFkpe5flSv2qVL3N4sjDCZzCOXhwDVW4hxo0gAHCM7zCm/PovDjvzseiNedkM8fwB87nD3FPjLI=</latexit>
⇤
<latexit sha1_base64="aETUmmg4d7Fr1+uPFFVtljBLhmc=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoMgHsJuFPQY9OIxAfOAZAmzk95kzOzsMjMrhJAv8OJBEa9+kjf/xkmyB00saCiquunuChLBtXHdbye3tr6xuZXfLuzs7u0fFA+PmjpOFcMGi0Ws2gHVKLjEhuFGYDtRSKNAYCsY3c381hMqzWP5YMYJ+hEdSB5yRo2V6he9Ysktu3OQVeJlpAQZar3iV7cfszRCaZigWnc8NzH+hCrDmcBpoZtqTCgb0QF2LJU0Qu1P5odOyZlV+iSMlS1pyFz9PTGhkdbjKLCdETVDvezNxP+8TmrCG3/CZZIalGyxKEwFMTGZfU36XCEzYmwJZYrbWwkbUkWZsdkUbAje8surpFkpe5flSv2qVL3N4sjDCZzCOXhwDVW4hxo0gAHCM7zCm/PovDjvzseiNedkM8fwB87nD3FPjLI=</latexit>
After the convolution operation, the output sij is passed through a non-linear activation
function as in the MLP, and the entire feature map is processed by a pooling function. As
shown in Figure 2.4b, the pooling function replaces the output feature map at a certain location
with a summary statistic of the nearby output. For example, max-pooling (Zhou and Chellappa,
1988) only takes the maximum value of the output within a rectangular area in the feature
map, whereas mean pooling takes the average or a weighted average of the elements within the
defined area. The pooling operation effectively reduces the output dimension from each convo-
lutional layer and makes the representation approximately invariant to a small temporal/spatial
translation. This property is desirable for some features but not so much when the specific
location of a feature needs to be preserved (Boureau et al., 2010). Normally in a convolutional
neural network, there is more than one kernel in each layer to capture individual features and
14 Attention-Based Encoder-Decoder Models
many convolutional layers to have a large effective receptive field for deeper layers (Goodfellow
et al., 2016).
Compared to MLPs, CNNs have other two distinct characteristics that make them very suc-
cessful in many pattern recognition tasks, especially in the field of computer vision (Goodfellow
et al., 2016). First, the convolution kernels interact with the input relatively sparsely, i.e. the
convolution kernel is normally much smaller than the input features. Unlike MLPs where each
input neuron is fully connected to each of the neurons in the next layer, only sets of weights in
these kernels are stored and used, and therefore the number of parameters can be several orders
of magnitude smaller. The sparse connectivity is demonstrated in Figure 2.4a. Furthermore,
since the kernels are much smaller than the input and they are acting as sliding windows
across the whole of the input, the same set of weights are reused multiple times instead of
having different parameters for different locations. Therefore, CNNs can be significantly more
efficient in terms of memory and computation. The two-dimensional convolution operation is
very suitable for image processing where feature invariance is desired in both dimensions. In
contrast, the advantage of CNNs for speech processing is less obvious where feature invariance
in the frequency dimension is much less desirable.
W W W W
<latexit sha1_base64="oRRfNu0Z3o7VS5RRGRBmDkIJsU8=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsxIUZcFNy4r2Ae0Y8lkMm1oJhmSjFKG/ocbF4q49V/c+Tdm2llo64GQwzn3kpMTJJxp47rfTmltfWNzq7xd2dnd2z+oHh51tEwVoW0iuVS9AGvKmaBtwwynvURRHAecdoPJTe53H6nSTIp7M02oH+ORYBEj2FjpYRBIHuppbK+sOxtWa27dnQOtEq8gNSjQGla/BqEkaUyFIRxr3ffcxPgZVoYRTmeVQappgskEj2jfUoFjqv1snnqGzqwSokgqe4RBc/X3RoZjnUezkzE2Y73s5eJ/Xj810bWfMZGkhgqyeChKOTIS5RWgkClKDJ9agoliNisiY6wwMbaoii3BW/7yKulc1L3LeuOuUWs2izrKcAKncA4eXEETbqEFbSCg4Ble4c15cl6cd+djMVpyip1j+APn8wcje5Lv</latexit> <latexit sha1_base64="oRRfNu0Z3o7VS5RRGRBmDkIJsU8=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsxIUZcFNy4r2Ae0Y8lkMm1oJhmSjFKG/ocbF4q49V/c+Tdm2llo64GQwzn3kpMTJJxp47rfTmltfWNzq7xd2dnd2z+oHh51tEwVoW0iuVS9AGvKmaBtwwynvURRHAecdoPJTe53H6nSTIp7M02oH+ORYBEj2FjpYRBIHuppbK+sOxtWa27dnQOtEq8gNSjQGla/BqEkaUyFIRxr3ffcxPgZVoYRTmeVQappgskEj2jfUoFjqv1snnqGzqwSokgqe4RBc/X3RoZjnUezkzE2Y73s5eJ/Xj810bWfMZGkhgqyeChKOTIS5RWgkClKDJ9agoliNisiY6wwMbaoii3BW/7yKulc1L3LeuOuUWs2izrKcAKncA4eXEETbqEFbSCg4Ble4c15cl6cd+djMVpyip1j+APn8wcje5Lv</latexit> <latexit sha1_base64="oRRfNu0Z3o7VS5RRGRBmDkIJsU8=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsxIUZcFNy4r2Ae0Y8lkMm1oJhmSjFKG/ocbF4q49V/c+Tdm2llo64GQwzn3kpMTJJxp47rfTmltfWNzq7xd2dnd2z+oHh51tEwVoW0iuVS9AGvKmaBtwwynvURRHAecdoPJTe53H6nSTIp7M02oH+ORYBEj2FjpYRBIHuppbK+sOxtWa27dnQOtEq8gNSjQGla/BqEkaUyFIRxr3ffcxPgZVoYRTmeVQappgskEj2jfUoFjqv1snnqGzqwSokgqe4RBc/X3RoZjnUezkzE2Y73s5eJ/Xj810bWfMZGkhgqyeChKOTIS5RWgkClKDJ9agoliNisiY6wwMbaoii3BW/7yKulc1L3LeuOuUWs2izrKcAKncA4eXEETbqEFbSCg4Ble4c15cl6cd+djMVpyip1j+APn8wcje5Lv</latexit> <latexit sha1_base64="oRRfNu0Z3o7VS5RRGRBmDkIJsU8=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsxIUZcFNy4r2Ae0Y8lkMm1oJhmSjFKG/ocbF4q49V/c+Tdm2llo64GQwzn3kpMTJJxp47rfTmltfWNzq7xd2dnd2z+oHh51tEwVoW0iuVS9AGvKmaBtwwynvURRHAecdoPJTe53H6nSTIp7M02oH+ORYBEj2FjpYRBIHuppbK+sOxtWa27dnQOtEq8gNSjQGla/BqEkaUyFIRxr3ffcxPgZVoYRTmeVQappgskEj2jfUoFjqv1snnqGzqwSokgqe4RBc/X3RoZjnUezkzE2Y73s5eJ/Xj810bWfMZGkhgqyeChKOTIS5RWgkClKDJ9agoliNisiY6wwMbaoii3BW/7yKulc1L3LeuOuUWs2izrKcAKncA4eXEETbqEFbSCg4Ble4c15cl6cd+djMVpyip1j+APn8wcje5Lv</latexit>
ht ht ht+1
<latexit sha1_base64="wgk5xI16TSwDrE6nXAfDRP4xEsU=">AAAB/XicbVDNS8MwHE3n15xf9ePmJTgEQRitDPU48OJxgvuArZQ0TbewNClJKsxS/Fe8eFDEq/+HN/8b060H3XwQ8njv9yMvL0gYVdpxvq3Kyura+kZ1s7a1vbO7Z+8fdJVIJSYdLJiQ/QApwignHU01I/1EEhQHjPSCyU3h9x6IVFTwez1NiBejEacRxUgbybePhoFgoZrG5srGuZ/pczf37brTcGaAy8QtSR2UaPv21zAUOI0J15ghpQauk2gvQ1JTzEheG6aKJAhP0IgMDOUoJsrLZulzeGqUEEZCmsM1nKm/NzIUqyKgmYyRHqtFrxD/8wapjq69jPIk1YTj+UNRyqAWsKgChlQSrNnUEIQlNVkhHiOJsDaF1UwJ7uKXl0n3ouFeNpp3zXqrVdZRBcfgBJwBF1yBFrgFbdABGDyCZ/AK3qwn68V6tz7moxWr3DkEf2B9/gAAdZWU</latexit>
h
<latexit sha1_base64="a6l5eFCp2BIC14pHDrtdFQVWUeY=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsxIUZcFNy4r2Ae0Y8lkMm1oJhmSjFKG/ocbF4q49V/c+Tdm2llo64GQwzn3kpMTJJxp47rfTmltfWNzq7xd2dnd2z+oHh51tEwVoW0iuVS9AGvKmaBtwwynvURRHAecdoPJTe53H6nSTIp7M02oH+ORYBEj2FjpYRBIHuppbK9sPBtWa27dnQOtEq8gNSjQGla/BqEkaUyFIRxr3ffcxPgZVoYRTmeVQappgskEj2jfUoFjqv1snnqGzqwSokgqe4RBc/X3RoZjnUezkzE2Y73s5eJ/Xj810bWfMZGkhgqyeChKOTIS5RWgkClKDJ9agoliNisiY6wwMbaoii3BW/7yKulc1L3LeuOuUWs2izrKcAKncA4eXEETbqEFbSCg4Ble4c15cl6cd+djMVpyip1j+APn8wc9UJMA</latexit>
U U U U
<latexit sha1_base64="vjE8/9ebFnD+7vta3oMwAX9u4UU=">AAAB9XicbVDNSgMxGPy2/tX6V/XoJVgET2VXinosePFYwW0L7Vqy2bQNzSZLklXK0vfw4kERr76LN9/GbLsHbR0IGWa+j0wmTDjTxnW/ndLa+sbmVnm7srO7t39QPTxqa5kqQn0iuVTdEGvKmaC+YYbTbqIojkNOO+HkJvc7j1RpJsW9mSY0iPFIsCEj2FjpoR9KHulpbK/Mnw2qNbfuzoFWiVeQGhRoDapf/UiSNKbCEI617nluYoIMK8MIp7NKP9U0wWSCR7RnqcAx1UE2Tz1DZ1aJ0FAqe4RBc/X3RoZjnUezkzE2Y73s5eJ/Xi81w+sgYyJJDRVk8dAw5chIlFeAIqYoMXxqCSaK2ayIjLHCxNiiKrYEb/nLq6R9Ufcu6427Rq3ZLOoowwmcwjl4cAVNuIUW+EBAwTO8wpvz5Lw4787HYrTkFDvH8AfO5w8gcZLt</latexit> <latexit sha1_base64="vjE8/9ebFnD+7vta3oMwAX9u4UU=">AAAB9XicbVDNSgMxGPy2/tX6V/XoJVgET2VXinosePFYwW0L7Vqy2bQNzSZLklXK0vfw4kERr76LN9/GbLsHbR0IGWa+j0wmTDjTxnW/ndLa+sbmVnm7srO7t39QPTxqa5kqQn0iuVTdEGvKmaC+YYbTbqIojkNOO+HkJvc7j1RpJsW9mSY0iPFIsCEj2FjpoR9KHulpbK/Mnw2qNbfuzoFWiVeQGhRoDapf/UiSNKbCEI617nluYoIMK8MIp7NKP9U0wWSCR7RnqcAx1UE2Tz1DZ1aJ0FAqe4RBc/X3RoZjnUezkzE2Y73s5eJ/Xi81w+sgYyJJDRVk8dAw5chIlFeAIqYoMXxqCSaK2ayIjLHCxNiiKrYEb/nLq6R9Ufcu6427Rq3ZLOoowwmcwjl4cAVNuIUW+EBAwTO8wpvz5Lw4787HYrTkFDvH8AfO5w8gcZLt</latexit> <latexit sha1_base64="vjE8/9ebFnD+7vta3oMwAX9u4UU=">AAAB9XicbVDNSgMxGPy2/tX6V/XoJVgET2VXinosePFYwW0L7Vqy2bQNzSZLklXK0vfw4kERr76LN9/GbLsHbR0IGWa+j0wmTDjTxnW/ndLa+sbmVnm7srO7t39QPTxqa5kqQn0iuVTdEGvKmaC+YYbTbqIojkNOO+HkJvc7j1RpJsW9mSY0iPFIsCEj2FjpoR9KHulpbK/Mnw2qNbfuzoFWiVeQGhRoDapf/UiSNKbCEI617nluYoIMK8MIp7NKP9U0wWSCR7RnqcAx1UE2Tz1DZ1aJ0FAqe4RBc/X3RoZjnUezkzE2Y73s5eJ/Xi81w+sgYyJJDRVk8dAw5chIlFeAIqYoMXxqCSaK2ayIjLHCxNiiKrYEb/nLq6R9Ufcu6427Rq3ZLOoowwmcwjl4cAVNuIUW+EBAwTO8wpvz5Lw4787HYrTkFDvH8AfO5w8gcZLt</latexit> <latexit sha1_base64="vjE8/9ebFnD+7vta3oMwAX9u4UU=">AAAB9XicbVDNSgMxGPy2/tX6V/XoJVgET2VXinosePFYwW0L7Vqy2bQNzSZLklXK0vfw4kERr76LN9/GbLsHbR0IGWa+j0wmTDjTxnW/ndLa+sbmVnm7srO7t39QPTxqa5kqQn0iuVTdEGvKmaC+YYbTbqIojkNOO+HkJvc7j1RpJsW9mSY0iPFIsCEj2FjpoR9KHulpbK/Mnw2qNbfuzoFWiVeQGhRoDapf/UiSNKbCEI617nluYoIMK8MIp7NKP9U0wWSCR7RnqcAx1UE2Tz1DZ1aJ0FAqe4RBc/X3RoZjnUezkzE2Y73s5eJ/Xi81w+sgYyJJDRVk8dAw5chIlFeAIqYoMXxqCSaK2ayIjLHCxNiiKrYEb/nLq6R9Ufcu6427Rq3ZLOoowwmcwjl4cAVNuIUW+EBAwTO8wpvz5Lw4787HYrTkFDvH8AfO5w8gcZLt</latexit>
x <latexit sha1_base64="aUgEQZWjW0A4wx4lcCzbzebpQ/w=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIqMuCG5cV7APasWQymTY0kwxJRi1D/8ONC0Xc+i/u/Bsz7Sy09UDI4Zx7yckJEs60cd1vp7Syura+Ud6sbG3v7O5V9w/aWqaK0BaRXKpugDXlTNCWYYbTbqIojgNOO8H4Ovc7D1RpJsWdmSTUj/FQsIgRbKx03w8kD/Uktlf2NB1Ua27dnQEtE68gNSjQHFS/+qEkaUyFIRxr3fPcxPgZVoYRTqeVfqppgskYD2nPUoFjqv1slnqKTqwSokgqe4RBM/X3RoZjnUezkzE2I73o5eJ/Xi810ZWfMZGkhgoyfyhKOTIS5RWgkClKDJ9YgoliNisiI6wwMbaoii3BW/zyMmmf1b2L+vntea3RKOoowxEcwyl4cAkNuIEmtICAgmd4hTfn0Xlx3p2P+WjJKXYO4Q+czx9VoJMQ</latexit>
xt
<latexit sha1_base64="rkfM6ZPh0Yh2mrmOr2PD+vU+Wyk=">AAAB/XicbVDNS8MwHE39nPOrfty8BIfgxdHKUI8DLx4nuA/YSknTdAtLk5Kk4izFf8WLB0W8+n94878x3XrQzQchj/d+P/LygoRRpR3n21paXlldW69sVDe3tnd27b39jhKpxKSNBROyFyBFGOWkralmpJdIguKAkW4wvi787j2Rigp+pycJ8WI05DSiGGkj+fbhIBAsVJPYXNlD7mf6zM19u+bUnSngInFLUgMlWr79NQgFTmPCNWZIqb7rJNrLkNQUM5JXB6kiCcJjNCR9QzmKifKyafocnhglhJGQ5nANp+rvjQzFqghoJmOkR2reK8T/vH6qoysvozxJNeF49lCUMqgFLKqAIZUEazYxBGFJTVaIR0girE1hVVOCO//lRdI5r7sX9cZto9ZslnVUwBE4BqfABZegCW5AC7QBBo/gGbyCN+vJerHerY/Z6JJV7hyAP7A+fwAcMZWm</latexit>
1 xt
<latexit sha1_base64="fs6lojKjg/a3LV2jPpfhEuFxGUQ=">AAAB+3icbVDLSsNAFJ3UV62vWJduBovgqiQi6rLgxmUF+4A2hMlk2g6dZMLMjbSE/IobF4q49Ufc+TdO2iy09cAwh3PuZc6cIBFcg+N8W5WNza3tnepubW//4PDIPq53tUwVZR0qhVT9gGgmeMw6wEGwfqIYiQLBesH0rvB7T0xpLuNHmCfMi8g45iNOCRjJt+vDQIpQzyNzZbPczyD37YbTdBbA68QtSQOVaPv21zCUNI1YDFQQrQeuk4CXEQWcCpbXhqlmCaFTMmYDQ2MSMe1li+w5PjdKiEdSmRMDXqi/NzIS6SKemYwITPSqV4j/eYMURrdexuMkBRbT5UOjVGCQuCgCh1wxCmJuCKGKm6yYTogiFExdNVOCu/rlddK9bLrXzauHq0arVdZRRafoDF0gF92gFrpHbdRBFM3QM3pFb1ZuvVjv1sdytGKVOyfoD6zPHzQxlTQ=</latexit>
xt+1
<latexit sha1_base64="Z16JjCBt0Jv1n4va0qV4kfPWe00=">AAAB/XicbVDNS8MwHE39nPOrfty8BIcgCKOVoR4HXjxOcB+wlZKm6RaWJiVJxVmK/4oXD4p49f/w5n9juvWgmw9CHu/9fuTlBQmjSjvOt7W0vLK6tl7ZqG5ube/s2nv7HSVSiUkbCyZkL0CKMMpJW1PNSC+RBMUBI91gfF343XsiFRX8Tk8S4sVoyGlEMdJG8u3DQSBYqCaxubKH3M/0mZv7ds2pO1PAReKWpAZKtHz7axAKnMaEa8yQUn3XSbSXIakpZiSvDlJFEoTHaEj6hnIUE+Vl0/Q5PDFKCCMhzeEaTtXfGxmKVRHQTMZIj9S8V4j/ef1UR1deRnmSasLx7KEoZVALWFQBQyoJ1mxiCMKSmqwQj5BEWJvCqqYEd/7Li6RzXncv6o3bRq3ZLOuogCNwDE6BCy5BE9yAFmgDDB7BM3gFb9aT9WK9Wx+z0SWr3DkAf2B9/gAZJZWk</latexit>
where W is the history weight matrix and U is the input weight matrix. RNNs can also be
applied to a sequence in the backwards direction. For one layer in a bidirectional RNN (Schuster
and Paliwal, 1997), the forward and backward hidden states can be concatenated or transformed
by a projection layer to be fed into the next layer or an output layer.
RNNs can be viewed as a fully connected graphical model that models the target sequence.
The hidden variable h is an efficient way of parameterising the graphical model despite being a
deterministic function of the input and the previous hidden state. As a result, the output at any
time step depends on all the previous input for a unidirectional RNN and the total number of
parameters in an RNN is independent of the length of input sequences.
RNNs become very useful for tasks whose input and/or output are sequence-based. RNNs
can map an input sequence to an output sequence of the same length (e.g. part of speech
tagging (Socher et al., 2010)), can encode an input sequence into a fixed-length vector (e.g.
sentiment analysis (Tang et al., 2015)), and can convert a fixed-size input to a variable-length
sequence (e.g. image captioning (Xu et al., 2015)). For tasks such as machine translation (Bah-
danau et al., 2014) and speech recognition (Chorowski et al., 2015), they are also important
building blocks for RNN-based AED models (see Section 2.2) where, in general, input and
output sequences have different lengths.
One significant drawback of RNNs using gradient-based optimisation is that the gradient
can either vanish or explode for a very long sequence. Therefore, it is challenging to estab-
lish long-term dependencies, which could be very important for tasks such as language and
speech processing. The gating function was therefore introduced to allow the gradient to flow
unchanged from history steps, thus improving the ability of the model to capture longer-term
dependencies. Among many variants of RNNs, the Long Short-Term Memory (LSTM) net-
work (Hochreiter and Schmidhuber, 1997) is one of the most successful models (Sak et al.,
2014). As shown in Figure 2.6, LSTM networks have memory cells in addition to the outer
recurrence of the RNN.
The internal recurrence of an LSTM memory cell has three gates which control the flow
of information: a forget gate f (t) , an input gate i(t) and an output gate o(t) at time t, and their
16 Attention-Based Encoder-Decoder Models
ct
<latexit sha1_base64="2qC7xLObgQ5UGPKZjEoHvZbTBks=">AAAB/XicbVDNS8MwHE39nPOrfty8BIfgxdHKUI9DLx4nuA/YSknTdAtLk5KkwizFf8WLB0W8+n94878x3XrQzQchj/d+P/LygoRRpR3n21paXlldW69sVDe3tnd27b39jhKpxKSNBROyFyBFGOWkralmpJdIguKAkW4wvin87gORigp+rycJ8WI05DSiGGkj+fbhIBAsVJPYXBnO/Uyfublv15y6MwVcJG5JaqBEy7e/BqHAaUy4xgwp1XedRHsZkppiRvLqIFUkQXiMhqRvKEcxUV42TZ/DE6OEMBLSHK7hVP29kaFYFQHNZIz0SM17hfif1091dOVllCepJhzPHopSBrWARRUwpJJgzSaGICypyQrxCEmEtSmsakpw57+8SDrndfei3rhr1JrXZR0VcASOwSlwwSVoglvQAm2AwSN4Bq/gzXqyXqx362M2umSVOwfgD6zPH/xVlZM=</latexit>
1 ⇥
<latexit sha1_base64="2DfdF1l21+C75cp/xOVABOCQskw=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilVg9FxE2/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJJ+WeqnhCWVjOuRdSxW1S/xsfu2UnFllQMJY21JI5urviYxGxkyiwHZGFEdm2ZuJ/3ndFMNrPxMqSZErtlgUppJgTGavk4HQnKGcWEKZFvZWwkZUU4Y2oJINwVt+eZW0LqreZbV2X6vUb/I4inACp3AOHlxBHe6gAU1g8AjP8ApvTuy8OO/Ox6K14OQzx/AHzucPuRmPPQ==</latexit>
+
<latexit sha1_base64="G4vzAKUfqBvuBoSMSn1cPbS+Hlo=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoMgCGFXgnoMevGYgHlAsoTZSW8yZnZ2mZkVQsgXePGgiFc/yZt/4yTZgyYWNBRV3XR3BYng2rjut5NbW9/Y3MpvF3Z29/YPiodHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7mZ+6wmV5rF8MOME/YgOJA85o8ZK9YteseSW3TnIKvEyUoIMtV7xq9uPWRqhNExQrTuemxh/QpXhTOC00E01JpSN6AA7lkoaofYn80On5MwqfRLGypY0ZK7+npjQSOtxFNjOiJqhXvZm4n9eJzXhjT/hMkkNSrZYFKaCmJjMviZ9rpAZMbaEMsXtrYQNqaLM2GwKNgRv+eVV0rwse1flSr1Sqt5mceThBE7hHDy4hircQw0awADhGV7hzXl0Xpx352PRmnOymWP4A+fzB3RzjLg=</latexit>
ct
<latexit sha1_base64="xpZF0mX6d+iMpwVuzFDXUtQZ8MU=">AAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVRIp6rLoxmUF+4A2hMlk0g6dTMLMTaGE/okbF4q49U/c+TdO2iy09cAwh3PuZc6cIBVcg+N8W5WNza3tnepubW//4PDIPj7p6iRTlHVoIhLVD4hmgkvWAQ6C9VPFSBwI1gsm94XfmzKleSKfYJYyLyYjySNOCRjJt+1hkIhQz2Jz5XTug2/XnYazAF4nbknqqETbt7+GYUKzmEmggmg9cJ0UvJwo4FSweW2YaZYSOiEjNjBUkphpL18kn+MLo4Q4SpQ5EvBC/b2Rk1gX4cxkTGCsV71C/M8bZBDdejmXaQZM0uVDUSYwJLioAYdcMQpiZgihipusmI6JIhRMWTVTgrv65XXSvWq4143mY7PeuivrqKIzdI4ukYtuUAs9oDbqIIqm6Bm9ojcrt16sd+tjOVqxyp1T9AfW5w9ESJQV</latexit>
tanh
<latexit sha1_base64="erFpp9H6lfjuFJAraYBgYklfH7M=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8cKpi20oWy2m3bpZhN2J0IJ/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTxqmSTTjPsskYnuhNRwKRT3UaDknVRzGoeSt8Px3cxvP3FtRKIecZLyIKZDJSLBKFrJ7yFVo3615tbdOcgq8QpSgwLNfvWrN0hYFnOFTFJjup6bYpBTjYJJPq30MsNTysZ0yLuWKhpzE+TzY6fkzCoDEiXalkIyV39P5DQ2ZhKHtjOmODLL3kz8z+tmGN0EuVBphlyxxaIokwQTMvucDITmDOXEEsq0sLcSNqKaMrT5VGwI3vLLq6R1Ufeu6pcPl7XGbRFHGU7gFM7Bg2towD00wQcGAp7hFd4c5bw4787HorXkFDPH8AfO5w/dqI68</latexit>
⇥ ⇥
<latexit sha1_base64="2DfdF1l21+C75cp/xOVABOCQskw=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilVg9FxE2/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJJ+WeqnhCWVjOuRdSxW1S/xsfu2UnFllQMJY21JI5urviYxGxkyiwHZGFEdm2ZuJ/3ndFMNrPxMqSZErtlgUppJgTGavk4HQnKGcWEKZFvZWwkZUU4Y2oJINwVt+eZW0LqreZbV2X6vUb/I4inACp3AOHlxBHe6gAU1g8AjP8ApvTuy8OO/Ox6K14OQzx/AHzucPuRmPPQ==</latexit> <latexit sha1_base64="2DfdF1l21+C75cp/xOVABOCQskw=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilVg9FxE2/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJJ+WeqnhCWVjOuRdSxW1S/xsfu2UnFllQMJY21JI5urviYxGxkyiwHZGFEdm2ZuJ/3ndFMNrPxMqSZErtlgUppJgTGavk4HQnKGcWEKZFvZWwkZUU4Y2oJINwVt+eZW0LqreZbV2X6vUb/I4inACp3AOHlxBHe6gAU1g8AjP8ApvTuy8OO/Ox6K14OQzx/AHzucPuRmPPQ==</latexit>
ft ot
<latexit sha1_base64="6g0Zm4qJvtI78ngYMxAJFZOLCZg=">AAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVRIp6rLoxmUF+4A2hMlk0g6dTMLMTaGE/okbF4q49U/c+TdO2iy09cAwh3PuZc6cIBVcg+N8W5WNza3tnepubW//4PDIPj7p6iRTlHVoIhLVD4hmgkvWAQ6C9VPFSBwI1gsm94XfmzKleSKfYJYyLyYjySNOCRjJt+1hkIhQz2Jz5dHcB9+uOw1nAbxO3JLUUYm2b38Nw4RmMZNABdF64DopeDlRwKlg89ow0ywldEJGbGCoJDHTXr5IPscXRglxlChzJOCF+nsjJ7EuwpnJmMBYr3qF+J83yCC69XIu0wyYpMuHokxgSHBRAw65YhTEzBBCFTdZMR0TRSiYsmqmBHf1y+uke9VwrxvNx2a9dVfWUUVn6BxdIhfdoBZ6QG3UQRRN0TN6RW9Wbr1Y79bHcrRilTun6A+szx9I3ZQY</latexit>
it
<latexit sha1_base64="Zwbzd78y5lxvRAllATSXHrBlQ0g=">AAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVRIp6rLoxmUF+4A2hMlk0g6dTMLMTaGE/okbF4q49U/c+TdO2iy09cAwh3PuZc6cIBVcg+N8W5WNza3tnepubW//4PDIPj7p6iRTlHVoIhLVD4hmgkvWAQ6C9VPFSBwI1gsm94XfmzKleSKfYJYyLyYjySNOCRjJt+1hkIhQz2Jz5Xzug2/XnYazAF4nbknqqETbt7+GYUKzmEmggmg9cJ0UvJwo4FSweW2YaZYSOiEjNjBUkphpL18kn+MLo4Q4SpQ5EvBC/b2Rk1gX4cxkTGCsV71C/M8bZBDdejmXaQZM0uVDUSYwJLioAYdcMQpiZgihipusmI6JIhRMWTVTgrv65XXSvWq4143mY7PeuivrqKIzdI4ukYtuUAs9oDbqIIqm6Bm9ojcrt16sd+tjOVqxyp1T9AfW5w9NcpQb</latexit>
<latexit sha1_base64="hQJqAx5F76JQVSy5M6P4yFUv344=">AAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVRIp6rLoxmUF+4A2hMlk0g6dZMLMTaGE/okbF4q49U/c+TdO2iy09cAwh3PuZc6cIBVcg+N8W5WNza3tnepubW//4PDIPj7papkpyjpUCqn6AdFM8IR1gINg/VQxEgeC9YLJfeH3pkxpLpMnmKXMi8ko4RGnBIzk2/YwkCLUs9hcuZz74Nt1p+EsgNeJW5I6KtH27a9hKGkWswSoIFoPXCcFLycKOBVsXhtmmqWETsiIDQxNSMy0ly+Sz/GFUUIcSWVOAnih/t7ISayLcGYyJjDWq14h/ucNMohuvZwnaQYsocuHokxgkLioAYdcMQpiZgihipusmI6JIhRMWTVTgrv65XXSvWq4143mY7PeuivrqKIzdI4ukYtuUAs9oDbqIIqm6Bm9ojcrt16sd+tjOVqxyp1T9AfW5w9WnJQh</latexit>
tanh
<latexit sha1_base64="erFpp9H6lfjuFJAraYBgYklfH7M=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8cKpi20oWy2m3bpZhN2J0IJ/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTxqmSTTjPsskYnuhNRwKRT3UaDknVRzGoeSt8Px3cxvP3FtRKIecZLyIKZDJSLBKFrJ7yFVo3615tbdOcgq8QpSgwLNfvWrN0hYFnOFTFJjup6bYpBTjYJJPq30MsNTysZ0yLuWKhpzE+TzY6fkzCoDEiXalkIyV39P5DQ2ZhKHtjOmODLL3kz8z+tmGN0EuVBphlyxxaIokwQTMvucDITmDOXEEsq0sLcSNqKaMrT5VGwI3vLLq6R1Ufeu6pcPl7XGbRFHGU7gFM7Bg2towD00wQcGAp7hFd4c5bw4787HorXkFDPH8AfO5w/dqI68</latexit>
ht
<latexit sha1_base64="JqAV0W/9CrCj9S+LUyUsad1plCc=">AAAB/XicbVDNS8MwHE39nPOrfty8BIfgxdHKUI9DLx4nuA/YSknTdAtLk5KkwizFf8WLB0W8+n94878x3XrQzQchj/d+P/LygoRRpR3n21paXlldW69sVDe3tnd27b39jhKpxKSNBROyFyBFGOWkralmpJdIguKAkW4wvin87gORigp+rycJ8WI05DSiGGkj+fbhIBAsVJPYXNko9zN95ua+XXPqzhRwkbglqYESLd/+GoQCpzHhGjOkVN91Eu1lSGqKGcmrg1SRBOExGpK+oRzFRHnZNH0OT4wSwkhIc7iGU/X3RoZiVQQ0kzHSIzXvFeJ/Xj/V0ZWXUZ6kmnA8eyhKGdQCFlXAkEqCNZsYgrCkJivEIyQR1qawqinBnf/yIumc192LeuOuUWtel3VUwBE4BqfABZegCW5BC7QBBo/gGbyCN+vJerHerY/Z6JJV7hyAP7A+fwAEG5WY</latexit>
xt
<latexit sha1_base64="IGswZgSqj+KGWSwU7wfZW6s3pP8=">AAAB+XicbVC7TsMwFHXKq5RXgJHFokJiqhKEgLGChbFI9CG1UeQ4TmvVcSL7pqKK+icsDCDEyp+w8Tc4bQZoOZLlo3PulY9PkAquwXG+rcra+sbmVnW7trO7t39gHx51dJIpyto0EYnqBUQzwSVrAwfBeqliJA4E6wbju8LvTpjSPJGPME2ZF5Oh5BGnBIzk2/YgSESop7G58qeZD75ddxrOHHiVuCWpoxIt3/4ahAnNYiaBCqJ133VS8HKigFPBZrVBpllK6JgMWd9QSWKmvXyefIbPjBLiKFHmSMBz9fdGTmJdhDOTMYGRXvYK8T+vn0F04+VcphkwSRcPRZnAkOCiBhxyxSiIqSGEKm6yYjoiilAwZdVMCe7yl1dJ56LhXjUuHy7rzduyjio6QafoHLnoGjXRPWqhNqJogp7RK3qzcuvFerc+FqMVq9w5Rn9gff4AZFuUKg==</latexit>
·
<latexit sha1_base64="jW9JLZAIsBxFZZG6OL0K66i4oHM=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8cKpi20oWw2m3bpZhN2J0Ip/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemElh0HW/ndLa+sbmVnm7srO7t39QPTxqmTTXjPsslanuhNRwKRT3UaDknUxzmoSSt8PR3cxvP3FtRKoecZzxIKEDJWLBKFrJ77EoxX615tbdOcgq8QpSgwLNfvWrF6UsT7hCJqkxXc/NMJhQjYJJPq30csMzykZ0wLuWKppwE0zmx07JmVUiEqfalkIyV39PTGhizDgJbWdCcWiWvZn4n9fNMb4JJkJlOXLFFoviXBJMyexzEgnNGcqxJZRpYW8lbEg1ZWjzqdgQvOWXV0nrou5d1S8fLmuN2yKOMpzAKZyDB9fQgHtogg8MBDzDK7w5ynlx3p2PRWvJKWaO4Q+czx/b+I67</latexit>
ht
<latexit sha1_base64="OmKwESMdSF05m6DumMXdfshC3E0=">AAAB+XicbVDLSsNAFL2pr1pfUZduBovgqiRS1GXRjcsK9gFtCJPJpB06mYSZSaGE/okbF4q49U/c+TdO2iy09cAwh3PuZc6cIOVMacf5tiobm1vbO9Xd2t7+weGRfXzSVUkmCe2QhCeyH2BFORO0o5nmtJ9KiuOA014wuS/83pRKxRLxpGcp9WI8EixiBGsj+bY9DBIeqllsrnw897Vv152GswBaJ25J6lCi7dtfwzAhWUyFJhwrNXCdVHs5lpoRTue1YaZoiskEj+jAUIFjqrx8kXyOLowSoiiR5giNFurvjRzHqghnJmOsx2rVK8T/vEGmo1svZyLNNBVk+VCUcaQTVNSAQiYp0XxmCCaSmayIjLHERJuyaqYEd/XL66R71XCvG83HZr11V9ZRhTM4h0tw4QZa8ABt6ACBKTzDK7xZufVivVsfy9GKVe6cwh9Ynz9L65Qa</latexit>
Figure 2.6 Diagram of an LSTM cell. At time t, xt is the input vector, ht is the hidden state,
and ct is the cell state.
where the W and U matrices are the corresponding weights associated with each gate or the
cell, b denotes the corresponding biases, and ◦ denotes element-wise multiplication, σ(·) is the
sigmoid activation function (see Section 2.1.5). If the feature dimension is D and the hidden /
cell state dimension is H, then W ∈ RH×H , U ∈ RH×D , and b ∈ RH . In Equations (2.5), (2.6)
and (2.7), all three gates have the same input, i.e. the previous hidden state ht−1 and the current
input feature xt , and follow the recurrence equation Equation (2.4). As in Equation (2.8), the
cell state, ct , is a weighted combination of the previous cell state modified by the forget gate
and a transform of the current input from the input gate. The updated hidden state, ht , is the
updated cell state ct controlled by the output gate. The introduction of the memory cell with
gating functions allows history information to be integrated dynamically depending on the
input sequence.
2.1 Basic Building Blocks 17
T
a(t) h(t) ,
X
v= (2.10)
t=1
where t a(t) = 1. The set of weights a is usually the output of a Softmax function on the
P
set of scores produced by the model. The input feature vectors, h(t) , can either be the hidden
representation at a certain stage of the model or the input features. The context vector can be
subsequently transformed into the output distribution. The attention weights are dynamically
generated for different inputs and the attention mechanism is a differentiable function that can
be jointly optimised with the rest of the network.
The attention mechanism is predominantly used in sequence-to-sequence tasks where a
single hidden representation of the entire sequence is very difficult to capture for increasingly
longer sequences (Bahdanau et al., 2014). Instead of compressing the information from a
sequence into a fixed-length vector, a more effective approach is to extract a hidden repre-
sentation for each position in the input sequence, and then use an attention mechanism to
dynamically focus on different parts of the input sequence in the transformed space to generate
the desired output in each position in the corresponding output sequence. This is extremely
helpful for tasks like machine translation (Wu et al., 2016) where the input-to-output alignment
is irregular, and the decoder needs to “attend to” various features associated with words in the
input sequence at different stages. The attention mechanism has also inspired the creation of
new sequence models such as Transformers (Vaswani et al., 2017). A detailed description of
AED models will be given in Section 2.2.
1
1
sigmoid 0
ϕ(x) = ϕ′ (x) = ϕ(x)(1 − ϕ(x)) (0, 1)
1
1 + e−x
2
2 1 0 1 2
2
1 0
for x < 0 0 for x < 0
ReLU 0
ϕ(x) = ϕ′ (x) = [0, ∞)
1 x for x ≥ 0 1 for x ≥ 0
2
2 1 0 1 2
2
1 ρx for x < 0 ρ for x < 0
PReLU 0
ϕ(x) = ϕ′ (x) = (−∞, ∞)
1
x for x ≥ 0 1 for x ≥ 0
2
2 1 0 1 2
1
ex − e−x
tanh 0
ϕ(x) = ϕ′ (x) = 1 − ϕ(x)2 (−1, 1)
1
ex + e−x
2
2 1 0 1 2
1
x 1 + ϕ(x)e−x
swish 0
ϕ(x) = ϕ′ (x) = (−0.3, ∞)
1
1 + e−x 1 + e−x
2
2 1 0 1 2
Table 2.1 List of frequently used activation functions for single input unit. The exact minimum
of the swish activation function is −W (1/e) where W is the Lambert W -function.
The activation functions in Table 2.1 are often used within layers and some of them can
be used for output layers as well. If the output layer requires a probability distribution for
classification or in the case of attention weights over various inputs, the Softmax activation
function is normally used as it ensures the output values are between 0 and 1 and sum up to
unity,
e xi
ϕi (x) = P x , (2.11)
ke
k
predecessor, RNN encoder-decoder models (Cho et al., 2014; Sutskever et al., 2014), uses two
RNNs to achieve sequence-to-sequence modelling. One RNN encodes the input sequence into a
fixed-length vector representation and the other RNN decodes the representation into the output
sequence. The performance of this approach may be significantly limited by the information
stored in the fixed-length vector and the sequential information associated with it, especially for
long sequences. This shortcoming is addressed by the attention mechanism (Bahdanau et al.,
2014) introduced in Section 2.1.4. The AED model is able to “attend to” or “focus on” the
most relevant part of the input sequence to generate the output sequence. Based on the basic
building blocks in Section 2.1, two widely used AED models are described in this section.
E = E NCODER(X), (2.13)
ai = ATTENTION(ai−1 , di−1 , E), (2.14)
vi = E ai , (2.15)
P (ui |u1:i−1 , X), di = D ECODER(ui−1 , di−1 , vi ). (2.16)
As described in Section 2.1.3, RNNs process sequences and can perform one-to-one, many-
to-one and one-to-many sequence to sequence mapping flexibly. For attention-based models, an
RNN can be used as the encoder that performs the one-to-one mapping from the input feature
20 Attention-Based Encoder-Decoder Models
xt to embedding space et . The decoder can also use an RNN to generate the transcription
sequence based on the previous output token u1 , . . . , ui−1 , the decoder history hidden state
di−1 and the context vector vi produced by the attention mechanism. Figure 2.7 illustrates the
model architecture for RNN-based AED models with attention.
u0
<latexit sha1_base64="FfxpFS/Pivu6Pu9pvfv7CXmKkW8=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeCF48VTFtoQ9lsJ+3SzSbsboQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MBVcG9f9dkobm1vbO+Xdyt7+weFR9fikrZNMMfRZIhLVDalGwSX6hhuB3VQhjUOBnXByN/c7T6g0T+SjmaYYxHQkecQZNVbys0HuzgbVmlt3FyDrxCtIDQq0BtWv/jBhWYzSMEG17nluaoKcKsOZwFmln2lMKZvQEfYslTRGHeSLY2fkwipDEiXKljRkof6eyGms9TQObWdMzVivenPxP6+Xmeg2yLlMM4OSLRdFmSAmIfPPyZArZEZMLaFMcXsrYWOqKDM2n4oNwVt9eZ20r+redb3x0Kg1m0UcZTiDc7gED26gCffQAh8YcHiGV3hzpPPivDsfy9aSU8ycwh84nz/MfY6v</latexit>
u1
<latexit sha1_base64="WuqriIingXMnkzDx1XOtdj+AaV0=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeCF48VTFtoQ9lsN+3SzSbsToQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74STu7nfeeLaiEQ94jTlQUxHSkSCUbSSnw1ybzao1ty6uwBZJ15BalCgNah+9YcJy2KukElqTM9zUwxyqlEwyWeVfmZ4StmEjnjPUkVjboJ8ceyMXFhlSKJE21JIFurviZzGxkzj0HbGFMdm1ZuL/3m9DKPbIBcqzZArtlwUZZJgQuafk6HQnKGcWkKZFvZWwsZUU4Y2n4oNwVt9eZ20r+redb3x0Kg1m0UcZTiDc7gED26gCffQAh8YCHiGV3hzlPPivDsfy9aSU8ycwh84nz/OAo6w</latexit>
u2
<latexit sha1_base64="e7d4Z6N4IxLWo0LEwRzU0rvj0EE=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lKUY8FLx4rmFZoQ9lsJ+3SzSbsboQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MBVcG9f9dkobm1vbO+Xdyt7+weFR9fiko5NMMfRZIhL1GFKNgkv0DTcCH1OFNA4FdsPJ7dzvPqHSPJEPZppiENOR5BFn1FjJzwZ5Yzao1ty6uwBZJ15BalCgPah+9YcJy2KUhgmqdc9zUxPkVBnOBM4q/UxjStmEjrBnqaQx6iBfHDsjF1YZkihRtqQhC/X3RE5jradxaDtjasZ61ZuL/3m9zEQ3Qc5lmhmUbLkoygQxCZl/ToZcITNiagllittbCRtTRZmx+VRsCN7qy+uk06h7V/XmfbPWahVxlOEMzuESPLiGFtxBG3xgwOEZXuHNkc6L8+58LFtLTjFzCn/gfP4Az4eOsQ==</latexit>
ui <latexit sha1_base64="nb2SV2whgj/Q/2mkh5TcRCqBpSo=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeCF48VTFtoQ9lsN+3SzSbsToQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJXCoOt+O6WNza3tnfJuZW//4PCoenzSNkmmGfdZIhPdDanhUijuo0DJu6nmNA4l74STu7nfeeLaiEQ94jTlQUxHSkSCUbSSnw1yMRtUa27dXYCsE68gNSjQGlS/+sOEZTFXyCQ1pue5KQY51SiY5LNKPzM8pWxCR7xnqaIxN0G+OHZGLqwyJFGibSkkC/X3RE5jY6ZxaDtjimOz6s3F/7xehtFtkAuVZsgVWy6KMkkwIfPPyVBozlBOLaFMC3srYWOqKUObT8WG4K2+vE7aV3Xvut54aNSazSKOMpzBOVyCBzfQhHtogQ8MBDzDK7w5ynlx3p2PZWvJKWZO4Q+czx8jKY7o</latexit>
...
<latexit sha1_base64="D8NZwhGRc3SadH+lq9NyH2X2S6M=">AAAB7HicbVBNS8NAFHzxs9avqkcvi0XwVJIq6LHoxWMF0xbaUDbbTbt0swm7L0IJ/Q1ePCji1R/kzX/jts1BWwcWhpk37HsTplIYdN1vZ219Y3Nru7RT3t3bPzisHB23TJJpxn2WyER3Qmq4FIr7KFDyTqo5jUPJ2+H4bua3n7g2IlGPOEl5ENOhEpFgFK3k9wYJmn6l6tbcOcgq8QpShQLNfuXL5lgWc4VMUmO6nptikFONgkk+Lfcyw1PKxnTIu5YqGnMT5PNlp+TcKgMSJdo+hWSu/k7kNDZmEod2MqY4MsveTPzP62YY3QS5UGmGXLHFR1EmCSZkdjkZCM0ZyokllGlhdyVsRDVlaPsp2xK85ZNXSate8y5r9YerauO2qKMEp3AGF+DBNTTgHprgAwMBz/AKb45yXpx352MxuuYUmRP4A+fzB/K2jsY=</latexit>
d1
<latexit sha1_base64="bDsDFwPY79fCAIOdKRIvGcbZ92s=">AAAB+3icbVDLSsNAFL2pr1pfsS7dBIvgqiRS1GXBjcsK9gFtCJPJtB06mQkzE7GE/IobF4q49Ufc+TdO2iy09cAwh3PuZc6cMGFUadf9tiobm1vbO9Xd2t7+weGRfVzvKZFKTLpYMCEHIVKEUU66mmpGBokkKA4Z6Yez28LvPxKpqOAPep4QP0YTTscUI22kwK6PQsEiNY/NlUV5kHl5YDfcpruAs068kjSgRCewv0aRwGlMuMYMKTX03ET7GZKaYkby2ihVJEF4hiZkaChHMVF+tsieO+dGiZyxkOZw7SzU3xsZilURz0zGSE/VqleI/3nDVI9v/IzyJNWE4+VD45Q5WjhFEU5EJcGazQ1BWFKT1cFTJBHWpq6aKcFb/fI66V02vatm677VaLfLOqpwCmdwAR5cQxvuoANdwPAEz/AKb1ZuvVjv1sdytGKVOyfwB9bnD6+flN0=</latexit>
d2
<latexit sha1_base64="navCPeGrXG42uDNBseCY4h959/8=">AAAB+3icbVBPS8MwHP11/pvzX51HL8UheBrtGOpx4MXjBLcJWylpmm1haVKSVBylX8WLB0W8+kW8+W1Mtx5080HI473fj7y8MGFUadf9tiobm1vbO9Xd2t7+weGRfVzvK5FKTHpYMCEfQqQIo5z0NNWMPCSSoDhkZBDObgp/8EikooLf63lC/BhNOB1TjLSRArs+CgWL1Dw2VxblQdbKA7vhNt0FnHXilaQBJbqB/TWKBE5jwjVmSKmh5ybaz5DUFDOS10apIgnCMzQhQ0M5ionys0X23Dk3SuSMhTSHa2eh/t7IUKyKeGYyRnqqVr1C/M8bpnp87WeUJ6kmHC8fGqfM0cIpinAiKgnWbG4IwpKarA6eIomwNnXVTAne6pfXSb/V9C6b7bt2o9Mp66jCKZzBBXhwBR24hS70AMMTPMMrvFm59WK9Wx/L0YpV7pzAH1ifP7EklN4=</latexit>
di
<latexit sha1_base64="AmOETnpIFvq9YX2Vc44Tn2Tkso4=">AAAB+3icbVDLSsNAFL2pr1pfsS7dBIvgqiRS1GXBjcsK9gFtCJPJtB06mQkzE7GE/IobF4q49Ufc+TdO2iy09cAwh3PuZc6cMGFUadf9tiobm1vbO9Xd2t7+weGRfVzvKZFKTLpYMCEHIVKEUU66mmpGBokkKA4Z6Yez28LvPxKpqOAPep4QP0YTTscUI22kwK6PQsEiNY/NlUV5kNE8sBtu013AWSdeSRpQohPYX6NI4DQmXGOGlBp6bqL9DElNMSN5bZQqkiA8QxMyNJSjmCg/W2TPnXOjRM5YSHO4dhbq740MxaqIZyZjpKdq1SvE/7xhqsc3fkZ5kmrC8fKhccocLZyiCCeikmDN5oYgLKnJ6uApkghrU1fNlOCtfnmd9C6b3lWzdd9qtNtlHVU4hTO4AA+uoQ130IEuYHiCZ3iFNyu3Xqx362M5WrHKnRP4A+vzBwTGlRU=</latexit>
...
<latexit sha1_base64="D8NZwhGRc3SadH+lq9NyH2X2S6M=">AAAB7HicbVBNS8NAFHzxs9avqkcvi0XwVJIq6LHoxWMF0xbaUDbbTbt0swm7L0IJ/Q1ePCji1R/kzX/jts1BWwcWhpk37HsTplIYdN1vZ219Y3Nru7RT3t3bPzisHB23TJJpxn2WyER3Qmq4FIr7KFDyTqo5jUPJ2+H4bua3n7g2IlGPOEl5ENOhEpFgFK3k9wYJmn6l6tbcOcgq8QpShQLNfuXL5lgWc4VMUmO6nptikFONgkk+Lfcyw1PKxnTIu5YqGnMT5PNlp+TcKgMSJdo+hWSu/k7kNDZmEod2MqY4MsveTPzP62YY3QS5UGmGXLHFR1EmCSZkdjkZCM0ZyokllGlhdyVsRDVlaPsp2xK85ZNXSate8y5r9YerauO2qKMEp3AGF+DBNTTgHprgAwMBz/AKb45yXpx352MxuuYUmRP4A+fzB/K2jsY=</latexit>
Decoder
<latexit sha1_base64="D4Bhn7PAqhSZDzOncxQZQtUdgSk=">AAAB+XicbVBNS8NAEN3Ur1q/oh69BIvgqSRV0GNRDx4r2A9oQ9lspu3SzSbsTool9J948aCIV/+JN/+N2zYHbX0w8Hhvhpl5QSK4Rtf9tgpr6xubW8Xt0s7u3v6BfXjU1HGqGDRYLGLVDqgGwSU0kKOAdqKARoGAVjC6nfmtMSjNY/mIkwT8iA4k73NG0Ug92+4iPKFm2R2wOAQ17dllt+LO4awSLydlkqPes7+6YczSCCQyQbXueG6CfkYVciZgWuqmGhLKRnQAHUMljUD72fzyqXNmlNDpx8qURGeu/p7IaKT1JApMZ0RxqJe9mfif10mxf+1nXCYpgmSLRf1UOBg7sxickCtgKCaGUKa4udVhQ6ooQxNWyYTgLb+8SprVindRqT5clms3eRxFckJOyTnxyBWpkXtSJw3CyJg8k1fyZmXWi/VufSxaC1Y+c0z+wPr8ARgXk/Q=</latexit>
di
<latexit sha1_base64="FiOK2E3Xz4/5B3zlXLGA6fdgQAY=">AAAB/XicbVDNS8MwHE3n15xf9ePmJTgEL45WhnocePE4wX3AVkqapltYmpQkFWYp/itePCji1f/Dm/+N6daDbj4Iebz3+5GXFySMKu0431ZlZXVtfaO6Wdva3tnds/cPukqkEpMOFkzIfoAUYZSTjqaakX4iCYoDRnrB5Kbwew9EKir4vZ4mxIvRiNOIYqSN5NtHw0CwUE1jc2Vh7mf03M19u+40nBngMnFLUgcl2r79NQwFTmPCNWZIqYHrJNrLkNQUM5LXhqkiCcITNCIDQzmKifKyWfocnholhJGQ5nANZ+rvjQzFqghoJmOkx2rRK8T/vEGqo2svozxJNeF4/lCUMqgFLKqAIZUEazY1BGFJTVaIx0girE1hNVOCu/jlZdK9aLiXjeZds95qlXVUwTE4AWfABVegBW5BG3QABo/gGbyCN+vJerHerY/5aMUqdw7BH1ifP+x5lYc=</latexit>
1 vi
<latexit sha1_base64="TmEzlmo3+t/QoEHgqFvVCCyQfKU=">AAAB+3icbVDLSsNAFL2pr1pfsS7dBIvgqiQi6rLgxmUF+4A2hMlk2g6dzISZSbGE/IobF4q49Ufc+TdO2iy09cAwh3PuZc6cMGFUadf9tiobm1vbO9Xd2t7+weGRfVzvKpFKTDpYMCH7IVKEUU46mmpG+okkKA4Z6YXTu8LvzYhUVPBHPU+IH6MxpyOKkTZSYNeHoWCRmsfmymZ5kNE8sBtu013AWSdeSRpQoh3YX8NI4DQmXGOGlBp4bqL9DElNMSN5bZgqkiA8RWMyMJSjmCg/W2TPnXOjRM5ISHO4dhbq740MxaqIZyZjpCdq1SvE/7xBqke3fkZ5kmrC8fKhUcocLZyiCCeikmDN5oYgLKnJ6uAJkghrU1fNlOCtfnmddC+b3nXz6uGq0WqVdVThFM7gAjy4gRbcQxs6gOEJnuEV3qzcerHerY/laMUqd07gD6zPHyBolSc=</latexit>
Attention
<latexit sha1_base64="sfIywWOu1dYHs1gMIkwGo0JC6sA=">AAAB+3icbVDLSsNAFL3xWesr1qWbYBFclaQKuqy6cVnBPqANZTKdtEMnD2ZupCXkV9y4UMStP+LOv3HSZqGtBwYO59zD3Hu8WHCFtv1trK1vbG5tl3bKu3v7B4fmUaWtokRS1qKRiGTXI4oJHrIWchSsG0tGAk+wjje5y/3OE5OKR+EjzmLmBmQUcp9TgloamJU+sikqmt4gsjDXsoFZtWv2HNYqcQpShQLNgfnVH0Y0CXSeCqJUz7FjdFMikVPBsnI/USwmdEJGrKdpSAKm3HS+e2adaWVo+ZHUL0Rrrv5OpCRQahZ4ejIgOFbLXi7+5/US9K/dlIdxog+ji4/8RFgYWXkR1pBLRlHMNCFUcr2rRcdEEoq6rrIuwVk+eZW06zXnolZ/uKw2bos6SnACp3AODlxBA+6hCS2gMIVneIU3IzNejHfjYzG6ZhSZY/gD4/MH9PGVCA==</latexit>
ai
<latexit sha1_base64="aFqY2o7Ys2DbtbqNW1Q+zEjSIUE=">AAAB+3icbVDLSsNAFL2pr1pfsS7dBIvgqiRS1GXBjcsK9gFtCJPJtB06mQkzE7GE/IobF4q49Ufc+TdO2iy09cAwh3PuZc6cMGFUadf9tiobm1vbO9Xd2t7+weGRfVzvKZFKTLpYMCEHIVKEUU66mmpGBokkKA4Z6Yez28LvPxKpqOAPep4QP0YTTscUI22kwK6PQsEiNY/NlaE8yGge2A236S7grBOvJA0o0Qnsr1EkcBoTrjFDSg09N9F+hqSmmJG8NkoVSRCeoQkZGspRTJSfLbLnzrlRImcspDlcOwv190aGYlXEM5Mx0lO16hXif94w1eMbP6M8STXhePnQOGWOFk5RhBNRSbBmc0MQltRkdfAUSYS1qatmSvBWv7xOepdN76rZum812u2yjiqcwhlcgAfX0IY76EAXMDzBM7zCm5VbL9a79bEcrVjlzgn8gfX5AwArlRI=</latexit>
+ <latexit sha1_base64="HFb7vPGXWshKfdtTiumEDK16me0=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoMgCGE3CnoMevGYgHlAsoTZSW8yZnZ2mZkVQsgXePGgiFc/yZt/4yTZgyYWNBRV3XR3BYng2rjut5NbW9/Y3MpvF3Z29/YPiodHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7mZ+6wmV5rF8MOME/YgOJA85o8ZK9YteseSW3TnIKvEyUoIMtV7xq9uPWRqhNExQrTuemxh/QpXhTOC00E01JpSN6AA7lkoaofYn80On5MwqfRLGypY0ZK7+npjQSOtxFNjOiJqhXvZm4n9eJzXhjT/hMkkNSrZYFKaCmJjMviZ9rpAZMbaEMsXtrYQNqaLM2GwKNgRv+eVV0qyUvctypX5Vqt5mceThBE7hHDy4hircQw0awADhGV7hzXl0Xpx352PRmnOymWP4A+fzB3LTjLM=</latexit>
ai <latexit sha1_base64="Irx3DCg+lABfNPyQTYh0VuEL7lk=">AAAB/XicbVDNS8MwHE3n15xf9ePmJTgEL45WhnocePE4wX3AVkqapltYmpQkFWYp/itePCji1f/Dm/+N6daDbj4Iebz3+5GXFySMKu0431ZlZXVtfaO6Wdva3tnds/cPukqkEpMOFkzIfoAUYZSTjqaakX4iCYoDRnrB5Kbwew9EKir4vZ4mxIvRiNOIYqSN5NtHw0CwUE1jc2Uo9zN67ua+XXcazgxwmbglqYMSbd/+GoYCpzHhGjOk1MB1Eu1lSGqKGclrw1SRBOEJGpGBoRzFRHnZLH0OT40SwkhIc7iGM/X3RoZiVQQ0kzHSY7XoFeJ/3iDV0bWXUZ6kmnA8fyhKGdQCFlXAkEqCNZsagrCkJivEYyQR1qawminBXfzyMuleNNzLRvOuWW+1yjqq4BicgDPggivQAregDToAg0fwDF7Bm/VkvVjv1sd8tGKVO4fgD6zPH+fYlYQ=</latexit>
1
+
+
<latexit
sha1_base64="HFb7vPGXWshKfdtTiumEDK16me0=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoMgCGE3CnoMevGYgHlAsoTZSW8yZnZ2mZkVQsgXePGgiFc/yZt/4yTZgyYWNBRV3XR3BYng2rjut5NbW9/Y3MpvF3Z29/YPiodHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7mZ+6wmV5rF8MOME/YgOJA85o8ZK9YteseSW3TnIKvEyUoIMtV7xq9uPWRqhNExQrTuemxh/QpXhTOC00E01JpSN6AA7lkoaofYn80On5MwqfRLGypY0ZK7+npjQSOtxFNjOiJqhXvZm4n9eJzXhjT/hMkkNSrZYFKaCmJjMviZ9rpAZMbaEMsXtrYQNqaLM2GwKNgRv+eVV0qyUvctypX5Vqt5mceThBE7hHDy4hircQw0awADhGV7hzXl0Xpx352PRmnOymWP4A+fzB3LTjLM=</latexit>
<latexit
sha1_base64="HFb7vPGXWshKfdtTiumEDK16me0=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoMgCGE3CnoMevGYgHlAsoTZSW8yZnZ2mZkVQsgXePGgiFc/yZt/4yTZgyYWNBRV3XR3BYng2rjut5NbW9/Y3MpvF3Z29/YPiodHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7mZ+6wmV5rF8MOME/YgOJA85o8ZK9YteseSW3TnIKvEyUoIMtV7xq9uPWRqhNExQrTuemxh/QpXhTOC00E01JpSN6AA7lkoaofYn80On5MwqfRLGypY0ZK7+npjQSOtxFNjOiJqhXvZm4n9eJzXhjT/hMkkNSrZYFKaCmJjMviZ9rpAZMbaEMsXtrYQNqaLM2GwKNgRv+eVV0qyUvctypX5Vqt5mceThBE7hHDy4hircQw0awADhGV7hzXl0Xpx352PRmnOymWP4A+fzB3LTjLM=</latexit>
<latexit
sha1_base64="HFb7vPGXWshKfdtTiumEDK16me0=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoMgCGE3CnoMevGYgHlAsoTZSW8yZnZ2mZkVQsgXePGgiFc/yZt/4yTZgyYWNBRV3XR3BYng2rjut5NbW9/Y3MpvF3Z29/YPiodHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7mZ+6wmV5rF8MOME/YgOJA85o8ZK9YteseSW3TnIKvEyUoIMtV7xq9uPWRqhNExQrTuemxh/QpXhTOC00E01JpSN6AA7lkoaofYn80On5MwqfRLGypY0ZK7+npjQSOtxFNjOiJqhXvZm4n9eJzXhjT/hMkkNSrZYFKaCmJjMviZ9rpAZMbaEMsXtrYQNqaLM2GwKNgRv+eVV0qyUvctypX5Vqt5mceThBE7hHDy4hircQw0awADhGV7hzXl0Xpx352PRmnOymWP4A+fzB3LTjLM=</latexit>
<latexit
sha1_base64="HFb7vPGXWshKfdtTiumEDK16me0=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoMgCGE3CnoMevGYgHlAsoTZSW8yZnZ2mZkVQsgXePGgiFc/yZt/4yTZgyYWNBRV3XR3BYng2rjut5NbW9/Y3MpvF3Z29/YPiodHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7mZ+6wmV5rF8MOME/YgOJA85o8ZK9YteseSW3TnIKvEyUoIMtV7xq9uPWRqhNExQrTuemxh/QpXhTOC00E01JpSN6AA7lkoaofYn80On5MwqfRLGypY0ZK7+npjQSOtxFNjOiJqhXvZm4n9eJzXhjT/hMkkNSrZYFKaCmJjMviZ9rpAZMbaEMsXtrYQNqaLM2GwKNgRv+eVV0qyUvctypX5Vqt5mceThBE7hHDy4hircQw0awADhGV7hzXl0Xpx352PRmnOymWP4A+fzB3LTjLM=</latexit>
E e1 e2 eT
<latexit sha1_base64="EdKAlblDdVx0I0YcXmwKwUxYHuU=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIqMuCCC4r2Ae0Y8lkMm1oJhmSjFKG/ocbF4q49V/c+Tdm2llo64GQwzn3kpMTJJxp47rfTmlldW19o7xZ2dre2d2r7h+0tUwVoS0iuVTdAGvKmaAtwwyn3URRHAecdoLxde53HqnSTIp7M0moH+OhYBEj2FjpoR9IHupJbK/sZjqo1ty6OwNaJl5BalCgOah+9UNJ0pgKQzjWuue5ifEzrAwjnE4r/VTTBJMxHtKepQLHVPvZLPUUnVglRJFU9giDZurvjQzHOo9mJ2NsRnrRy8X/vF5qois/YyJJDRVk/lCUcmQkyitAIVOUGD6xBBPFbFZERlhhYmxRFVuCt/jlZdI+q3sX9fO781qjUdRRhiM4hlPw4BIacAtNaAEBBc/wCm/Ok/PivDsf89GSU+wcwh84nz8IIZLd</latexit>
x2
<latexit sha1_base64="i4icZ47ITj660RXU3rOEhnhzCI4=">AAAB+3icbVDNS8MwHE3n15xfdR69BIfgabRjqMeBF48T3AdspaRpuoWlSUlS2Sj9V7x4UMSr/4g3/xvTrQfdfBDyeO/3Iy8vSBhV2nG+rcrW9s7uXnW/dnB4dHxin9b7SqQSkx4WTMhhgBRhlJOeppqRYSIJigNGBsHsrvAHT0QqKvijXiTEi9GE04hipI3k2/VxIFioFrG5snnuZ63ctxtO01kCbhK3JA1QouvbX+NQ4DQmXGOGlBq5TqK9DElNMSN5bZwqkiA8QxMyMpSjmCgvW2bP4aVRQhgJaQ7XcKn+3shQrIp4ZjJGeqrWvUL8zxulOrr1MsqTVBOOVw9FKYNawKIIGFJJsGYLQxCW1GSFeIokwtrUVTMluOtf3iT9VtO9brYf2o1Op6yjCs7BBbgCLrgBHXAPuqAHMJiDZ/AK3qzcerHerY/VaMUqd87AH1ifP8/YlPI=</latexit>
xT
<latexit sha1_base64="2WyQ8W6FkNwAZYAltiSEHggaA2M=">AAAB+3icbVDNS8MwHE3n15xfdR69BIfgabQy1OPAi8cJ+4KtlDRNt7A0KUkqG6X/ihcPinj1H/Hmf2O69aCbD0Ie7/1+5OUFCaNKO863Vdna3tndq+7XDg6Pjk/s03pfiVRi0sOCCTkMkCKMctLTVDMyTCRBccDIIJjdF/7giUhFBe/qRUK8GE04jShG2ki+XR8HgoVqEZsrm+d+1s19u+E0nSXgJnFL0gAlOr79NQ4FTmPCNWZIqZHrJNrLkNQUM5LXxqkiCcIzNCEjQzmKifKyZfYcXholhJGQ5nANl+rvjQzFqohnJmOkp2rdK8T/vFGqozsvozxJNeF49VCUMqgFLIqAIZUEa7YwBGFJTVaIp0girE1dNVOCu/7lTdK/bro3zdZjq9Ful3VUwTm4AFfABbegDR5AB/QABnPwDF7Bm5VbL9a79bEarVjlzhn4A+vzBwORlRQ=</latexit>
Figure 2.7 Model architecture for RNN-based AED models with attention.
The attention mechanism, as described in Section 2.1.4, produces a set of weights for
intermediate embedding vectors at all time steps. Based on the previous decoder state, encoded
feature embeddings, and optionally the attention weights, the attention mechanism provides
information on which parts of the input sequence to focus on. The decoder then can use
the context information given by the attention mechanism and the previous decoder state to
generate the next symbol in the output sequence. Therefore, the attention mechanism plays a
central role in the automatic soft alignment between input and output sequences.
Context-based Attention This was the proposed form of attention mechanism first used in
machine translation (Bahdanau et al., 2014). The attention mechanism is an MLP that takes the
previous decoder hidden state di−1 and the encoded embeddings e1 , . . . , eT as input
where w, W , V , b are trainable parameters of the attention mechanism. The attention network
can be trained jointly with the rest of the network.
Location-based Attention For tasks like speech recognition, the alignment between input
and output sequences is always monotonic. Therefore, knowing the alignment for the previous
output should be able to help with the attention network to produce attention weights for the
next step. Location-based attention (Chorowski et al., 2015) allows the attention network to
take the previous attention weights as one extra input
The attention mechanism takes a query and looks for similar keys. Then it uses the similarity
between each pair of query and key to obtain a weight for the value that corresponds to the key.
The output of the attention mechanism for the query is a weighted sum of all values, i.e.
Eq E T
!
ATTENTION(Eq , Ek , Ev ) = Softmax √ k Ev , (2.23)
Dk
where the Softmax is applied to each row. To interpret this operation, assume the query only
contains one entry, i.e. Tq = 1. The numerator inside the Softmax operation is the dot product
of the query vector with every key vector for all times steps Tk . The scalar result of every dot
product represents the closeness of the query and the key according to the cosine distance. The
Softmax of the Tk values normalises these dot products into a valid probability distribution,
which is used as weights when computing the weighted sum of the value vector across all time
steps. By assuming that each dimension of the query and key vectors is an independent random
variable with zero mean and unit variance, the value of the dot product has zero mean and
√
variance Dk . Therefore, the dot product is scaled down by Dk to prevent small gradients
caused by large values passed to the Softmax function. To ensure the validity of this assumption,
layer normalisation is usually used before the attention mechanism to normalise each feature to
approximately zero mean and unit variance. For a query with an arbitrary number of entries,
the attention operation first computes the closeness between each pair of query vector and
key vector using dot product, and then normalises the values from the dot product across all
times steps Tk for each query vector. A weighted sum of the value vectors produces the feature
vector for each time step in the query. Therefore, the result of the attention operation has
dimension Tq × Dv . The Transformer model is constructed based on this definition of the
attention mechanism as shown in Figure 2.8.
For the attention mechanism in the encoder and the first layer of the decoder block, it is
also called self-attention as the input features are from the same source, i.e. Xq = Xk = Xv .
Self-attention acts like a feature extractor that pays attention to input features at all other time
steps. In this sense, self-attention is similar to a TDNN whose scope is the entire sequence. For
the attention mechanism that bridges the encoder and the decoder, it is called source-attention
as Xk and Xv are the same encoder embeddings whereas Eq is from the decoder embeddings
2.2 AED Model Architecture 23
MLP
multiple multiple
encoder add & norm add & norm decoder
blocks blocks
multi-head multi-head
attention attention
position encoding +
<latexit sha1_base64="HFb7vPGXWshKfdtTiumEDK16me0=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoMgCGE3CnoMevGYgHlAsoTZSW8yZnZ2mZkVQsgXePGgiFc/yZt/4yTZgyYWNBRV3XR3BYng2rjut5NbW9/Y3MpvF3Z29/YPiodHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7mZ+6wmV5rF8MOME/YgOJA85o8ZK9YteseSW3TnIKvEyUoIMtV7xq9uPWRqhNExQrTuemxh/QpXhTOC00E01JpSN6AA7lkoaofYn80On5MwqfRLGypY0ZK7+npjQSOtxFNjOiJqhXvZm4n9eJzXhjT/hMkkNSrZYFKaCmJjMviZ9rpAZMbaEMsXtrYQNqaLM2GwKNgRv+eVV0qyUvctypX5Vqt5mceThBE7hHDy4hircQw0awADhGV7hzXl0Xpx352PRmnOymWP4A+fzB3LTjLM=</latexit>
+
<latexit sha1_base64="HFb7vPGXWshKfdtTiumEDK16me0=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoMgCGE3CnoMevGYgHlAsoTZSW8yZnZ2mZkVQsgXePGgiFc/yZt/4yTZgyYWNBRV3XR3BYng2rjut5NbW9/Y3MpvF3Z29/YPiodHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7mZ+6wmV5rF8MOME/YgOJA85o8ZK9YteseSW3TnIKvEyUoIMtV7xq9uPWRqhNExQrTuemxh/QpXhTOC00E01JpSN6AA7lkoaofYn80On5MwqfRLGypY0ZK7+npjQSOtxFNjOiJqhXvZm4n9eJzXhjT/hMkkNSrZYFKaCmJjMviZ9rpAZMbaEMsXtrYQNqaLM2GwKNgRv+eVV0qyUvctypX5Vqt5mceThBE7hHDy4hircQw0awADhGV7hzXl0Xpx352PRmnOymWP4A+fzB3LTjLM=</latexit>
position encoding
based on the history sequence. This is similar to the attention mechanism in RNN-based AED
models, but the attention weights are obtained by the dot product on the transformed features
rather than directly via a neural network.
The other novel component of the Transformer model is multi-head attention, where the
(i)
attention mechanism in Equation (2.23) is split into N heads. For head i, Eq(i) , Ek , Ev(i)
(i)
are computed with a weight matrix associated with each head Wq(i) , Wk , Wv(i) according to
Equation (2.22). Then the output from each attention head is concatenated and transformed by
another weight matrix Wo ,
(i) (i)
H EADi = ATTENTION(Eq(i) , Ek , Ek , ) (2.24)
M ULTI H EADATTENTION = Concat(H EAD1 , . . . , H EADN )Wo , (2.25)
24 Attention-Based Encoder-Decoder Models
where N is the number of heads and Wo ∈ R(N Dv ×Do ) . Multi-head attention allows the model
to jointly attend to information from different representation subspaces at different positions.
The transformer model also includes an MLP network after the attention block and residual
/ skip connections across each attention block. Although the ordering of the input sequence
is important, the Transformer model so far does not explicitly use the sequential information
due to the lack of recurrence or convolution. To address this, positional encoding is added to
the input embedding for both the encoder and the decoder. Positional encoding describes the
position of an entity in a sequence so that each position is assigned a unique representation. The
positional encoding can either be learned or fixed (Vaswani et al., 2017). The fixed positional
encoding proposed by Vaswani et al. (2017) is based on sinusoidal functions
!
t
sin if i is even,
i
where t is the position in the sequence and i is the dimension for D-dimensional input embed-
dings. This function means that each dimension of the positional encoding corresponds to a
sinusoid and allows the model to easily learn to attend by relative positions (Vaswani et al.,
2017).
Compared to recurrent networks for sequence modelling, self-attention can be fully par-
allelised across time steps and the maximum path length (the number of steps to traverse the
network in between two input features that are T steps apart) reduces from O(T ) to O(1), which
allows longer dependencies to be modelled directly. However, the computational complexity
for each layer for self-attention is O(T 2 ) vs. O(T ) for recurrent networks.
2.3 Optimisation
Once the DNNs are designed for specific tasks, optimisation is the next challenge as millions
of parameters need to be estimated efficiently. The optimisation problem, in short, is to find a
set of parameters θ that minimises the overall loss J (θ) across the true data distribution pdata
where f (·) represents a non-linear function that describes the neural network.
However, the underlying data distribution is generally inaccessible. Therefore, the approxi-
mate optimisation objective is the empirical risk on given samples drawn from the empirical
2.3 Optimisation 25
N
1 X
J (θ) ≈ E(x,y)∼p̂data L(f (x; θ), y) = L(f (x(i) ; θ), y (i) ), (2.28)
N i=1
where L(·) is the loss function and N is the total number of samples in the training dataset.
Unlike linear models, there is no closed-form analytical solution for optimising DNNs.
In this section, the cornerstone of neural network optimisation is first described. Then the
most widely adopted gradient-based algorithm (i.e. Stochastic Gradient Descend (SGD)) and
its related learning strategies are introduced. However, minimising the empirical risk leads
to an overfitting problem. As one key aspect of deep learning, effective regularisation helps
deep models with a huge number of parameters to generalise well on unseen data. Some
regularisation techniques are included in Section 2.3.4.
where z is a scalar output, x ∈ RI is the input vector, y ∈ RO is the intermediate result, and
f (·) and g(·) are two general functions. According to the chain rule of calculus,
∂z X ∂z ∂yj
= , (2.30)
∂xi j ∂yj ∂xi
or in matrix notation !T
∂y
∇x z = ∇y z, (2.31)
∂x
1
The term “error” makes sense for the least squares loss function, but in general, “error” here refers to the
partial derivative of the loss function.
26 Attention-Based Encoder-Decoder Models
∂y
where ∂x is the Jacobian matrix of g(·) of dimension O × I. This operation can be extended
recursively to a chain of operations of arbitrary steps and higher dimensional inputs.
In practice, a neural network forms a computational graph whose nodes are variables for
each operation and the directed edges are defined operations from an input variable to an output
variable. By using the chain rule, the derivatives of the final scalar loss can be computed with
respect to all the nodes within the computational graph recursively (Goodfellow et al., 2016).
However, naively computing all the gradients will have exponential cost as the computational
graph grows. Since many sub-expressions can be reused, storing these intermediate results can
yield linear computation time w.r.t. the number of edges in the graph. The backpropagation
algorithm constructs an identical computational graph as the forward propagation but with
reversed edges. Each edge in the backward graph has the derivative of the corresponding
operation in the forward graph.
Computational graphs are directed acyclic graphs. For the case of RNNs, the computation
graph is still acyclic after unfolding as shown in Figure 2.5. Figure 2.9 shows a generic example
of how the backpropagation algorithm works for a chain of vector operations.
y <latexit sha1_base64="mEcz1FLhuG1BpP6c5hi50qAIJ0g=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOl5qRfrrhVdw6ySrycVCBHo1/+6g1ilkYoDRNU667nJsbPqDKcCZyWeqnGhLIxHWLXUkkj1H42P3RKzqwyIGGsbElD5urviYxGWk+iwHZG1Iz0sjcT//O6qQmv/YzLJDUo2WJRmApiYjL7mgy4QmbExBLKFLe3EjaiijJjsynZELzll1dJu1b1Lqq15mWlfpPHUYQTOIVz8OAK6nAHDWgBA4RneIU359F5cd6dj0VrwclnjuEPnM8f6QuNAQ==</latexit>
y <latexit sha1_base64="mEcz1FLhuG1BpP6c5hi50qAIJ0g=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOl5qRfrrhVdw6ySrycVCBHo1/+6g1ilkYoDRNU667nJsbPqDKcCZyWeqnGhLIxHWLXUkkj1H42P3RKzqwyIGGsbElD5urviYxGWk+iwHZG1Iz0sjcT//O6qQmv/YzLJDUo2WJRmApiYjL7mgy4QmbExBLKFLe3EjaiijJjsynZELzll1dJu1b1Lqq15mWlfpPHUYQTOIVz8OAK6nAHDWgBA4RneIU359F5cd6dj0VrwclnjuEPnM8f6QuNAQ==</latexit>
f3 (·)
<latexit sha1_base64="lCvb0LgHkKD1ayCVYDmQr2wQo7k=">AAAB8XicbVBNTwIxEO3iF+IX6tFLIzHBC9kFEz0SvXjERJAIG9LtdqGh227aWROy4V948aAxXv033vw3FtiDgi+Z5OW9mczMCxLBDbjut1NYW9/Y3Cpul3Z29/YPyodHHaNSTVmbKqF0NyCGCS5ZGzgI1k00I3Eg2EMwvpn5D09MG67kPUwS5sdkKHnEKQErPUaDRrVPQwXng3LFrblz4FXi5aSCcrQG5a9+qGgaMwlUEGN6npuAnxENnAo2LfVTwxJCx2TIepZKEjPjZ/OLp/jMKiGOlLYlAc/V3xMZiY2ZxIHtjAmMzLI3E//zeilEV37GZZICk3SxKEoFBoVn7+OQa0ZBTCwhVHN7K6YjogkFG1LJhuAtv7xKOvWa16jV7y4qzes8jiI6Qaeoijx0iZroFrVQG1Ek0TN6RW+OcV6cd+dj0Vpw8plj9AfO5w+QwpAx</latexit>
f3 (·)
<latexit sha1_base64="lCvb0LgHkKD1ayCVYDmQr2wQo7k=">AAAB8XicbVBNTwIxEO3iF+IX6tFLIzHBC9kFEz0SvXjERJAIG9LtdqGh227aWROy4V948aAxXv033vw3FtiDgi+Z5OW9mczMCxLBDbjut1NYW9/Y3Cpul3Z29/YPyodHHaNSTVmbKqF0NyCGCS5ZGzgI1k00I3Eg2EMwvpn5D09MG67kPUwS5sdkKHnEKQErPUaDRrVPQwXng3LFrblz4FXi5aSCcrQG5a9+qGgaMwlUEGN6npuAnxENnAo2LfVTwxJCx2TIepZKEjPjZ/OLp/jMKiGOlLYlAc/V3xMZiY2ZxIHtjAmMzLI3E//zeilEV37GZZICk3SxKEoFBoVn7+OQa0ZBTCwhVHN7K6YjogkFG1LJhuAtv7xKOvWa16jV7y4qzes8jiI6Qaeoijx0iZroFrVQG1Ek0TN6RW+OcV6cd+dj0Vpw8plj9AfO5w+QwpAx</latexit>
f30 (·) dy
z <latexit sha1_base64="VLEo6VgUnu2TnOxoOkqsMPXvyTo=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI9ELx4hkUcCGzI79MLI7OxmZtYECV/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR7cxvPaLSPJb3ZpygH9GB5CFn1Fip/tQrltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1Gu1C9L1ZssjjycwCmcgwdXUIU7qEEDGCA8wyu8OQ/Oi/PufCxac042cwx/4Hz+AOqPjQI=</latexit>
z <latexit sha1_base64="VLEo6VgUnu2TnOxoOkqsMPXvyTo=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI9ELx4hkUcCGzI79MLI7OxmZtYECV/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR7cxvPaLSPJb3ZpygH9GB5CFn1Fip/tQrltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1Gu1C9L1ZssjjycwCmcgwdXUIU7qEEDGCA8wyu8OQ/Oi/PufCxac042cwx/4Hz+AOqPjQI=</latexit>
<latexit sha1_base64="Kl5C7qmUxIQOIUqIDLcyu1oUYes=">AAAB8nicbVBNSwMxEM36WetX1aOXYBHrpey2gh6LXjxWsB+wXUo2m21Ds8mSzAql9Gd48aCIV3+NN/+NabsHbX0w8Hhvhpl5YSq4Adf9dtbWNza3tgs7xd29/YPD0tFx26hMU9aiSijdDYlhgkvWAg6CdVPNSBIK1glHdzO/88S04Uo+wjhlQUIGksecErCSH/frF5UejRRc9ktlt+rOgVeJl5MyytHsl756kaJZwiRQQYzxPTeFYEI0cCrYtNjLDEsJHZEB8y2VJGEmmMxPnuJzq0Q4VtqWBDxXf09MSGLMOAltZ0JgaJa9mfif52cQ3wQTLtMMmKSLRXEmMCg8+x9HXDMKYmwJoZrbWzEdEk0o2JSKNgRv+eVV0q5VvXq19nBVbtzmcRTQKTpDFeSha9RA96iJWogihZ7RK3pzwHlx3p2PReuak8+coD9wPn8A8t2QYg==</latexit>
dz <latexit sha1_base64="EQIwB7Yjj3ikUQNEh9Rl26dPm4Q=">AAAB9XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0sWw2m3bpZhN2N0oM+R9ePCji1f/izX/jts1BWx8MPN6bYWaeF3OmtG1/W6WV1bX1jfJmZWt7Z3evun/QUVEiCW2TiEey52FFORO0rZnmtBdLikOP0643uZ763QcqFYvEnU5j6oZ4JFjACNZGuh/4gcQk89M885/yYbVm1+0Z0DJxClKDAq1h9WvgRyQJqdCEY6X6jh1rN8NSM8JpXhkkisaYTPCI9g0VOKTKzWZX5+jEKD4KImlKaDRTf09kOFQqDT3TGWI9VoveVPzP6yc6uHQzJuJEU0Hmi4KEIx2haQTIZ5ISzVNDMJHM3IrIGJsctAmqYkJwFl9eJp1G3TmrN27Pa82rIo4yHMExnIIDF9CEG2hBGwhIeIZXeLMerRfr3fqYt5asYuYQ/sD6/AFahJMR</latexit>
f2 (·)
<latexit sha1_base64="RDi/y9o862/ZUSyCWWkOiZEMvM0=">AAAB8XicbVBNSwMxEJ2tX7V+VT16CRahXspuFfRY9OKxgv3AdinZbLYNzSZLkhXK0n/hxYMiXv033vw3pu0etPXBwOO9GWbmBQln2rjut1NYW9/Y3Cpul3Z29/YPyodHbS1TRWiLSC5VN8CaciZoyzDDaTdRFMcBp51gfDvzO09UaSbFg5kk1I/xULCIEWys9BgN6tU+CaU5H5Qrbs2dA60SLycVyNEclL/6oSRpTIUhHGvd89zE+BlWhhFOp6V+qmmCyRgPac9SgWOq/Wx+8RSdWSVEkVS2hEFz9fdEhmOtJ3FgO2NsRnrZm4n/eb3URNd+xkSSGirIYlGUcmQkmr2PQqYoMXxiCSaK2VsRGWGFibEhlWwI3vLLq6Rdr3kXtfr9ZaVxk8dRhBM4hSp4cAUNuIMmtICAgGd4hTdHOy/Ou/OxaC04+cwx/IHz+QOPN5Aw</latexit>
f2 (·)
<latexit sha1_base64="RDi/y9o862/ZUSyCWWkOiZEMvM0=">AAAB8XicbVBNSwMxEJ2tX7V+VT16CRahXspuFfRY9OKxgv3AdinZbLYNzSZLkhXK0n/hxYMiXv033vw3pu0etPXBwOO9GWbmBQln2rjut1NYW9/Y3Cpul3Z29/YPyodHbS1TRWiLSC5VN8CaciZoyzDDaTdRFMcBp51gfDvzO09UaSbFg5kk1I/xULCIEWys9BgN6tU+CaU5H5Qrbs2dA60SLycVyNEclL/6oSRpTIUhHGvd89zE+BlWhhFOp6V+qmmCyRgPac9SgWOq/Wx+8RSdWSVEkVS2hEFz9fdEhmOtJ3FgO2NsRnrZm4n/eb3URNd+xkSSGirIYlGUcmQkmr2PQqYoMXxiCSaK2VsRGWGFibEhlWwI3vLLq6Rdr3kXtfr9ZaVxk8dRhBM4hSp4cAUNuIMmtICAgGd4hTdHOy/Ou/OxaC04+cwx/IHz+QOPN5Aw</latexit>
f20 (·) dz ⇥
<latexit sha1_base64="y3ZFcWqBEabekax7ttofWU1jMYg=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilVg9FxE2/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJJ+WeqnhCWVjOuRdSxW1S/xsfu2UnFllQMJY21JI5urviYxGxkyiwHZGFEdm2ZuJ/3ndFMNrPxMqSZErtlgUppJgTGavk4HQnKGcWEKZFvZWwkZUU4Y2oJINwVt+eZW0alXvolq7v6zUb/I4inACp3AOHlxBHe6gAU1g8AjP8ApvTuy8OO/Ox6K14OQzx/AHzucPt3mPOA==</latexit>
dy
h h
<latexit sha1_base64="CYx7dMsyOOPik4LNCht/W2+kFXE=">AAAB8nicbVBNSwMxEM3Wr1q/qh69BItYL2W3CnosevFYwX7AdinZbLYNzSZLMiuU0p/hxYMiXv013vw3pu0etPXBwOO9GWbmhangBlz32ymsrW9sbhW3Szu7e/sH5cOjtlGZpqxFlVC6GxLDBJesBRwE66aakSQUrBOO7mZ+54lpw5V8hHHKgoQMJI85JWAlP+7Xz6s9Gim46Jcrbs2dA68SLycVlKPZL3/1IkWzhEmgghjje24KwYRo4FSwaamXGZYSOiID5lsqScJMMJmfPMVnVolwrLQtCXiu/p6YkMSYcRLazoTA0Cx7M/E/z88gvgkmXKYZMEkXi+JMYFB49j+OuGYUxNgSQjW3t2I6JJpQsCmVbAje8surpF2veZe1+sNVpXGbx1FEJ+gUVZGHrlED3aMmaiGKFHpGr+jNAefFeXc+Fq0FJ585Rn/gfP4A8VGQYQ==</latexit>
dh dh
<latexit sha1_base64="Avqj6DgOR2NBV6dY7Rsio1T0XiY=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOl5qhfrrhVdw6ySrycVCBHo1/+6g1ilkYoDRNU667nJsbPqDKcCZyWeqnGhLIxHWLXUkkj1H42P3RKzqwyIGGsbElD5urviYxGWk+iwHZG1Iz0sjcT//O6qQmv/YzLJDUo2WJRmApiYjL7mgy4QmbExBLKFLe3EjaiijJjsynZELzll1dJu1b1Lqq15mWlfpPHUYQTOIVz8OAK6nAHDWgBA4RneIU359F5cd6dj0VrwclnjuEPnM8fz0eM8A==</latexit>
<latexit sha1_base64="Avqj6DgOR2NBV6dY7Rsio1T0XiY=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOl5qhfrrhVdw6ySrycVCBHo1/+6g1ilkYoDRNU667nJsbPqDKcCZyWeqnGhLIxHWLXUkkj1H42P3RKzqwyIGGsbElD5urviYxGWk+iwHZG1Iz0sjcT//O6qQmv/YzLJDUo2WJRmApiYjL7mgy4QmbExBLKFLe3EjaiijJjsynZELzll1dJu1b1Lqq15mWlfpPHUYQTOIVz8OAK6nAHDWgBA4RneIU359F5cd6dj0VrwclnjuEPnM8fz0eM8A==</latexit>
<latexit sha1_base64="aE9AakPLbjKolPDUvrxlz8/XDwQ=">AAAB9XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lqQY9FLx4r2A9oY9lsNu3SzSbsbpQQ8j+8eFDEq//Fm//GbZuDtj4YeLw3w8w8L+ZMadv+tkpr6xubW+Xtys7u3v5B9fCoq6JEEtohEY9k38OKciZoRzPNaT+WFIcepz1vejPze49UKhaJe53G1A3xWLCAEayN9DD0A4lJ5qd55k/yUbVm1+050CpxClKDAu1R9WvoRyQJqdCEY6UGjh1rN8NSM8JpXhkmisaYTPGYDgwVOKTKzeZX5+jMKD4KImlKaDRXf09kOFQqDT3TGWI9UcveTPzPGyQ6uHIzJuJEU0EWi4KEIx2hWQTIZ5ISzVNDMJHM3IrIBJsctAmqYkJwll9eJd1G3bmoN+6atdZ1EUcZTuAUzsGBS2jBLbShAwQkPMMrvFlP1ov1bn0sWktWMXMMf2B9/gA/KpL/</latexit>
<latexit sha1_base64="gEYuA18nAAZaAcM2dzjG5H+lDks=">AAAB9XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0sWw2m3bpZhN2N0oN+R9ePCji1f/izX/jps1BWx8MPN6bYWaeF3OmtG1/W6WV1bX1jfJmZWt7Z3evun/QUVEiCW2TiEey52FFORO0rZnmtBdLikOP0643uc797gOVikXiTk9j6oZ4JFjACNZGuh/4gcQk9Z+y1B9nw2rNrtszoGXiFKQGBVrD6tfAj0gSUqEJx0r1HTvWboqlZoTTrDJIFI0xmeAR7RsqcEiVm86uztCJUXwURNKU0Gim/p5IcajUNPRMZ4j1WC16ufif1090cOmmTMSJpoLMFwUJRzpCeQTIZ5ISzaeGYCKZuRWRMTY5aBNUxYTgLL68TDqNunNWb9ye15pXRRxlOIJjOAUHLqAJN9CCNhCQ8Ayv8GY9Wi/Wu/Uxby1Zxcwh/IH1+QNAs5MA</latexit>
f1 (·)
<latexit sha1_base64="HqAMN9Yh3cGP0O7XnAQUw+EeGn8=">AAAB8XicbVBNSwMxEM3Wr1q/qh69BItQL2W3CnosevFYwX5gu5RsNtuGZpMlmRXK0n/hxYMiXv033vw3pu0etPXBwOO9GWbmBYngBlz32ymsrW9sbhW3Szu7e/sH5cOjtlGppqxFlVC6GxDDBJesBRwE6yaakTgQrBOMb2d+54lpw5V8gEnC/JgMJY84JWClx2jgVfs0VHA+KFfcmjsHXiVeTiooR3NQ/uqHiqYxk0AFMabnuQn4GdHAqWDTUj81LCF0TIasZ6kkMTN+Nr94is+sEuJIaVsS8Fz9PZGR2JhJHNjOmMDILHsz8T+vl0J07WdcJikwSReLolRgUHj2Pg65ZhTExBJCNbe3YjoimlCwIZVsCN7yy6ukXa95F7X6/WWlcZPHUUQn6BRVkYeuUAPdoSZqIYokekav6M0xzovz7nwsWgtOPnOM/sD5/AGNrJAv</latexit>
f1 (·)
<latexit sha1_base64="HqAMN9Yh3cGP0O7XnAQUw+EeGn8=">AAAB8XicbVBNSwMxEM3Wr1q/qh69BItQL2W3CnosevFYwX5gu5RsNtuGZpMlmRXK0n/hxYMiXv033vw3pu0etPXBwOO9GWbmBYngBlz32ymsrW9sbhW3Szu7e/sH5cOjtlGppqxFlVC6GxDDBJesBRwE6yaakTgQrBOMb2d+54lpw5V8gEnC/JgMJY84JWClx2jgVfs0VHA+KFfcmjsHXiVeTiooR3NQ/uqHiqYxk0AFMabnuQn4GdHAqWDTUj81LCF0TIasZ6kkMTN+Nr94is+sEuJIaVsS8Fz9PZGR2JhJHNjOmMDILHsz8T+vl0J07WdcJikwSReLolRgUHj2Pg65ZhTExBJCNbe3YjoimlCwIZVsCN7yy6ukXa95F7X6/WWlcZPHUUQn6BRVkYeuUAPdoSZqIYokekav6M0xzovz7nwsWgtOPnOM/sD5/AGNrJAv</latexit>
f10 (·) dh ⇥
<latexit sha1_base64="y3ZFcWqBEabekax7ttofWU1jMYg=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilVg9FxE2/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJJ+WeqnhCWVjOuRdSxW1S/xsfu2UnFllQMJY21JI5urviYxGxkyiwHZGFEdm2ZuJ/3ndFMNrPxMqSZErtlgUppJgTGavk4HQnKGcWEKZFvZWwkZUU4Y2oJINwVt+eZW0alXvolq7v6zUb/I4inACp3AOHlxBHe6gAU1g8AjP8ApvTuy8OO/Ox6K14OQzx/AHzucPt3mPOA==</latexit>
dy
x x
<latexit sha1_base64="luO/7VBVhcqQzXoSaHBcmDoa89o=">AAAB8nicbVBNSwMxEM3Wr1q/qh69BItYL2W3CnosevFYwX7AdinZbLYNzSZLMiuU0p/hxYMiXv013vw3pu0etPXBwOO9GWbmhangBlz32ymsrW9sbhW3Szu7e/sH5cOjtlGZpqxFlVC6GxLDBJesBRwE66aakSQUrBOO7mZ+54lpw5V8hHHKgoQMJI85JWAlP+5759UejRRc9MsVt+bOgVeJl5MKytHsl796kaJZwiRQQYzxPTeFYEI0cCrYtNTLDEsJHZEB8y2VJGEmmMxPnuIzq0Q4VtqWBDxXf09MSGLMOAltZ0JgaJa9mfif52cQ3wQTLtMMmKSLRXEmMCg8+x9HXDMKYmwJoZrbWzEdEk0o2JRKNgRv+eVV0q7XvMta/eGq0rjN4yiiE3SKqshD16iB7lETtRBFCj2jV/TmgPPivDsfi9aCk88coz9wPn8A78WQYA==</latexit>
<latexit sha1_base64="hL+FaLtOT9luwfLW3Ut08xl3Pcw=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI9ELx4hkUcCGzI79MLI7OxmZtZICF/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR7cxvPaLSPJb3ZpygH9GB5CFn1Fip/tQrltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1Gu1C9L1ZssjjycwCmcgwdXUIU7qEEDGCA8wyu8OQ/Oi/PufCxac042cwx/4Hz+AOeHjQA=</latexit>
<latexit sha1_base64="hL+FaLtOT9luwfLW3Ut08xl3Pcw=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI9ELx4hkUcCGzI79MLI7OxmZtZICF/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR7cxvPaLSPJb3ZpygH9GB5CFn1Fip/tQrltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1Gu1C9L1ZssjjycwCmcgwdXUIU7qEEDGCA8wyu8OQ/Oi/PufCxac042cwx/4Hz+AOeHjQA=</latexit>
In many modern neural network toolkits and libraries, e.g. PyTorch2 and TensorFlow3 , an
approach called symbol-to-symbol (Goodfellow et al., 2016) is commonly used for computing
derivatives. During the forward pass, it adds additional nodes to the computational graph that
contains a symbolic description of the derivative functions. Any subset of the graph may be
evaluated later using specific numerical values. One advantage of this approach is that obtaining
2
https://www.pytorch.org
3
https://www.tensorflow.org
2.3 Optimisation 27
N
1
L(f (x(i) ; θ), y (i) ),
X
∇θ J (θ) = ∇θ (2.32)
N i=1
where the step size is controlled by ϵ which is known as the learning rate.
In practice, the number of data samples is usually very large and the computational cost of
Equation (2.32) increases linearly with N . Therefore, performing an update on all parameters
with the full training set can be very computationally expensive considering that gradient
descent is an iterative process. As Equation (2.28) shows that the loss is an expected value,
the gradient for one update can be approximated by sampling a small portion of the data
M ≪ N , i.e. a minibatch of samples randomly drawn from the training set. This is the idea of
SGD (Bottou, 2010).
M
1
L(f (x(i) ; θ), y (i) ),
X
g= ∇θ (2.34)
M i=1
θ ← θ − ϵg. (2.35)
During training, the minibatch size is usually constant and is in the order of a few hundred.
Therefore, each update can be performed at a constant cost regardless of the size of the dataset.
However, by using a small batch size, the gradient calculation can be biased towards sampled
data which yields noisy gradients and is prone to local optima.
Since SGD may be very noisy and slow, the choice of the learning rate ϵ becomes critically
important. If the learning rate is too large, the update step may overshoot and sometimes may
even diverge. On the other hand, a small learning rate will cause the convergence to be slow and
stuck at poor local minima. To address this problem, various learning rate schedulers have been
proposed to reach a good compromise between convergence and speed. The learning rate can be
28 Attention-Based Encoder-Decoder Models
set to gradually ramp up at the beginning and/or decay linearly or quadratically as the training
proceeds. Some schedulers decrease the learning rate in a more discrete fashion based on the
performance of the current model during training. The momentum term, i.e. the accumulated
update history direction, can be interpolated with the gradient of the current minibatch to carry
the inertia of past updates and helps SGD gain faster convergence with reduced oscillation
across the error surface (Bishop, 1995; Polyak, 1964). SGD with momentum can be written as
follows.
m ← βm + ϵg, (2.36)
θ ← θ − m, (2.37)
where s and r are the estimated first and second moments of the gradient g, and η1 = 0.9
and η2 = 0.999 are the suggested settings. δ is a small number, e.g. 10−8 , for computational
stability. However, in general, there is no conclusive advantage of using one particular learning
rate scheduling algorithm over others in practical neural network training. Techniques such
as gradient clipping (Mikolov, 2012) and gradient scaling (Pascanu et al., 2013) restrict the
magnitudes of update value and gradient norm to minimise the effect caused by abnormal or
noisy data within a minibatch.
2.3 Optimisation 29
where U and N are uniform and normal distributions, and g is the gain factor associated with
√ q
activation functions where g = 1 for sigmoid, g = 2 for ReLU, g = 2/(1 + ρ2 ) for PReLU,
and g = 5/3 for tanh.
Normalisation accounts for two aspects. The first one is feature normalisation, where the
first and second-order statistics for each dimension of the input feature at the global level or
batch level or some data-specific subset level can be performed. This aligns well with the
assumptions made when initialising weights randomly.
The other aspect is model reparameterisation. While training DNNs with SGD, only
first-order interactions between parameters are considered since all parameters are updated
simultaneously. Batch normalisation (Ioffe and Szegedy, 2015) is a technique that reparame-
terises the activations of each layer as each minibatch is processed. Specifically, for a batch of
input X = [x1 , . . . , xM ] to a layer, the features are first normalised
xi − µ
x̂i = √ 2 , (2.43)
σ +δ
where µ and σ are the mean and standard deviation vector computed across the batch of
features. To restore the representation power, the output of the layer Y = [y1 , . . . , yM ] is
transformed again by a pair of layer-specific mean vector µ̃ and standard deviation vector σ̃
where µ̃ and σ̃ are learnt parameters as part of the network. The output of the batch normalisa-
tion Y is then treated as input features to the next layer in the DNN. During inference time, the
population statistics are used for the input to each layer. The batch normalisation operation
introduces greater Lipschitzness (the change of output values due to the change of input values
is more limited) into the loss and the gradient during training, thus generating a smoother loss
landscape (Santurkar et al., 2018).
Despite the success of batch normalisation, it is challenging to apply this technique to
RNNs where having a batch normalisation procedure for each time step is impractical. Another
alternative is called layer normalisation (Ba et al., 2016). Instead of normalising across each
feature dimension in the batch of input features, layer normalisation normalises across all
feature dimensions for each feature at each layer. Therefore, layer normalisation does not
depend on other samples and can be used to train RNNs with long sequences and small mini-
batches. Layer normalisation has been shown to speed up the training of RNNs (Ba et al.,
2016). Transformer models also use layer normalisation (Vaswani et al., 2017).
Parameter norm penalty (or weight decay) (Krogh and Hertz, 1992), aims to constrain the
capability of the model by penalising large weight values. It adds an extra term to the standard
2.3 Optimisation 31
where ν is a non-negative hyperparameter for the weight of the norm regularisation and ∥ θ ∥ p
is the p-norm of all parameters. For DNNs, the L2 norm on all weights is commonly used.
Note that bias terms are normally unregularised (Goodfellow et al., 2016). For simplicity, the
objective function and its derivatives w.r.t. weights can be written for the L2 norm as
ν
J˜(θ) = J (θ) + θ T θ, (2.46)
2
∇θ J˜(θ) = ∇θ J (θ) + νθ. (2.47)
Ensemble methods usually combine multiple models at different levels, e.g. voting at the
output level or averaging at the posterior level (Breiman, 1996). Since different models tend
32 Attention-Based Encoder-Decoder Models
to make different errors, ensemble methods can normally outperform individual models. To
demonstrate, if there are n regression models and each one make error εi for each sample where
E[εi ] = 0, the expected squared error of the ensemble model (averaged model outputs) is
" !2 # " !#
1X 1 1 n−1
ε2i + E[ε2i ] +
X X
E εi = 2E εi ε j = E[εi εj ]. (2.49)
n i n i j̸=i n n
If all errors are perfectly correlated, i.e. E[ε2i ] = E[εi εj ] meaning n models are identical, the
ensemble does not help. If all errors made by different models are independent, i.e. E[εi εj ] = 0,
the expected squared error decreases linearly w.r.t. the size of the ensemble. If n models are
constructed differently, e.g. different subsets of training data, different initialisation, different
data shuffling, different architectures, and different hyperparameters, the ensemble model
is able to reduce the generalisation error, especially when individual ensemble components
become more complementary to each other.
Similar to ensembling, multiple models with different architectures trained on the same
set of data, dropout (Srivastava et al., 2014) is an alternative method that provides strong
regularisation by randomly disabling parts of the neurons in a DNN during training. Dropout is
an approximation of training an exponentially large number of neural networks with partially
shared parameters simultaneously as different dropout masks are applied to different layers of
the network for each minibatch. The proportion of neurons to be dropped for an iteration is
a hyperparameter Pdropout to be tuned. Intuitively, dropout prevents some neurons from being
over-specialised on certain data. During test time, all the weights going out of that neuron
is multiplied by the probability of including the neuron (1 − Pdropout ). This empirical rule
performs well in practice.
Exponential Moving Average (EMA), a temporal ensemble of model parameters, is another
commonly used approach to boost the final performance (Polyak and Juditsky, 1992). A set of
“shadow” parameters θ̃ are kept to maintain a moving average of the trained parameters,
θ̃ ← κθ̃ + (1 − κ) ∗ θ, (2.50)
where κ is the decay parameter for EMA. κ is normally set to be close to 1.0, e.g. 0.9999. The
shadow parameters are updated after each model update or a fixed number of model updates,
which does not influence the training process at all. Maintaining an EMA of model weights
during training could improve the performance significantly more than just using the final
weights (Tarvainen and Valpola, 2017).
2.3 Optimisation 33
Parameter tying or parameter sharing is a technique that forces parts of the parameter vector
within a model to be identical. In CNNs described in Section 2.1.2, parameters are shared
for each kernel that convolves across the entire input. In RNNs introduced in Section 2.1.3,
parameters are tied across each time step if viewed from the unfolding perspective. Parameter
sharing can reduce the memory footprint of a model and provide a regularisation effect by
restricting the number of parameters.
Multi-task training (Caruana, 1997) makes the model jointly learn from multiple objective
functions or perform multiple tasks at the same time. Usually, a significant part of the network
is shared and only the output layers differ. For multiple output branches, they can perform
different but similar tasks, e.g. grapheme and phone recognition, or the same task using
different objective functions, e.g. Cross Entropy (CE) and squared error. In the multi-task
training framework, the lower parts of the network are normally shared to extract useful features
from the input while the upper parts are split into multiple branches to transform the shared
features to perform the desired task. The improved generalisation performance is mainly
because of the shared parameters and the limited difference between different tasks that prevent
the network from specialising to one task. This approach is not always available as two or more
tasks with shared statistical factors are required. The overall loss function, an interpolation
of losses from all branches, needs to be carefully balanced in order to achieve the desired
performance for all tasks.
For a model that can generalise well, it should be able to cope with data uncertainties, i.e.
noise (Sietsma and Dow, 1991). A model’s noise robustness can always be improved by
training with more data sampled from the real data distribution. However, this may not
be practical due to limited available data resources. Data augmentation is one effective
approach to increasing the amount of training data based on the currently available data. For
example, images can be transformed differently to generate new images, e.g. cropping, scaling,
translating, rotating (Krizhevsky et al., 2012); speech can be augmented by vocal tract length
perturbation or speed perturbation (Jaitly and Hinton, 2013; Ko et al., 2015). Noise can
always be injected into the data by adding or subtracting some small random values. Data
augmentation allows the space of training data to be enriched and forces the model to be
invariant to the applied transformations and more robust to noisy data. Specific to speech
recognition, SpecAugment (Park et al., 2019) has been widely used as an effective approach
to augmenting the input log mel spectrogram by applying multiple instances of time warping,
34 Attention-Based Encoder-Decoder Models
frequency masking and time masking. The randomly corrupted input prevents the network
from overfitting specific features and improves the generalisation of the model for mismatched
acoustic conditions.
Similar to augmenting the input, the output labels can also be corrupted by some degree of
noise, especially for recognition tasks. Label smoothing (Szegedy et al., 2016) replaces the
ξ
one-hot training targets by 1 − ξ for the 1s and |y|−1 for 0s. By smoothing the hard targets
to soft ones, this technique injects noise into the output labels and acts as a regulariser that
prevents the model from pursuing hard output distributions with Softmax by having larger
weights. From another perspective, label smoothing simulates the scenario where the training
labels are not perfect.
Apart from adding noise to input features and output labels, small and random perturbations to
model parameters should ideally result in a minimal change at the output. One strategy is to
add zero-mean, fixed variance Gaussian noise to the network weights (Jim et al., 1996) during
training, which is shown to improve the performance of RNNs (Graves, 2013). Note that,
for each minibatch, the gradient is computed based on the model parameters with Gaussian
noise added, but the update to the parameters using gradient descent is applied to the original
parameters without Gaussian noise. Weight noise tends to “simplify” neural networks as
noise reduces the precision with which the weights must be described. Simpler networks
are preferred because they normally generalise better (Graves, 2011). Other more complex
strategies such as adaptive weight noise (Graves, 2011) have also been proposed. Furthermore,
weight quantisation (Hubara et al., 2018; Woodland, 1989) can also be regarded as another
form of weight noise that improves the generalisation of neural networks.
2.4 Summary
This chapter describes the basic building blocks used for the construction of RNN-based
and Transformer-based AED models, including MLP, CNN, RNN and attention mechanism.
Various optimisation procedures and techniques are then introduced for training AED models
and other deep neural networks. Many terms will be often referred to throughout the thesis.
Chapter 3
Automatic Speech Recognition (ASR) is the process of converting the acoustic speech signal
into its corresponding textual representation. It is considered to be a challenging task for
real-life applications due to both inter-speaker variability (including physiological differences
and pronunciation differences of accents or dialects) and intra-speaker variability (including
different styles of speech). Other factors such as channel distortion, reverberation and back-
ground noise pose more difficulties for achieving effective speech to text conversion (Jurafsky
and Martin, 2008).
There are two major modern frameworks for ASR. The first one is based on the noisy
Source-Channel Model (SCM) (Jelinek, 1997; MacKay, 2003; Shannon, 1948) where the
acoustic sequence is modelled using Hidden Markov Models (HMMs) from a generative
perspective. This framework has multiple modules such as an acoustic model, a language
model, a pronunciation model and a decoder. The advantage of the framework is that multiple
sources of structured knowledge such as phonetic and linguistic information can be easily
integrated. The modular design allows each component to be optimised separately. Because the
decoding procedure is generally frame-synchronous, the framework is able to process speech
data in a streaming fashion, which is desirable in various applications. In this chapter, the
SCM-based ASR framework is first described. Some similar frame-synchronous models, such
as Connectionist Temporal Classification (CTC) and neural transducers, are also introduced.
Another type of framework is based on Attention-Based Encoder-Decoder (AED) models.
Unlike the generative modelling approach with HMMs, the speech recognition task is formu-
lated from a discriminative perspective. An AED model directly learns the probability of the
transcription sequence given the input acoustic sequence (Chorowski et al., 2015). As only
a single model needs to be optimised, constructing an AED-based ASR system is, in theory,
much simpler than an SCM-based system. In this chapter, details about the model architecture,
training and decoding procedures of AED models will be given.
36 Automatic Speech Recognition
Both SCM-based and AED-based frameworks rely on fully transcribed speech data. How-
ever, labelled training data is typically limited and expensive to acquire. In order to leverage
a large amount of unlabelled data and the modelling power of large Deep Neural Networks
(DNNs), self-supervised pre-training enables the model to learn meaningful feature represen-
tations without supervision (van den Oord et al., 2018). Pre-trained models can be used to
initialise the acoustic model in an SCM-based system or the encoder in an AED-based system
for subsequent fine-tuning to improve recognition performance (Baevski et al., 2020b). In the
last part of the chapter, some self-supervised pre-training approaches will be described.
where p(O|w), estimated by an AM, is the likelihood of generating the observation sequence
through the channel; P (w), approximated by an LM, describes the underlying probabilistic dis-
tribution of the source. In this way, an SCM-based ASR system consists of several independent
modules shown in Figure 3.1.
acoustic model
speaker’s speech acoustic pre-
language model
mind w production speech processing O ŵ
<latexit sha1_base64="h6iOCDQDqbIH9RpjOgumkUD0uow=">AAAB/XicbVDNS8MwHE3n15xf9ePmJTgET6OVoTsOvHic4DZhLSNN0y0sTUqSKrMU/xUvHhTx6v/hzf/GdOtBNx+EPN77/cjLCxJGlXacb6uysrq2vlHdrG1t7+zu2fsHPSVSiUkXCybkXYAUYZSTrqaakbtEEhQHjPSDyVXh9++JVFTwWz1NiB+jEacRxUgbaWgfeWOkMy8QLFTT2FzZQ54P7brTcGaAy8QtSR2U6AztLy8UOI0J15ghpQauk2g/Q1JTzEhe81JFEoQnaEQGhnIUE+Vns/Q5PDVKCCMhzeEaztTfGxmKVZHNTMZIj9WiV4j/eYNURy0/ozxJNeF4/lCUMqgFLKqAIZUEazY1BGFJTVaIx0girE1hNVOCu/jlZdI7b7gXjeZNs95ulXVUwTE4AWfABZegDa5BB3QBBo/gGbyCN+vJerHerY/5aMUqdw7BH1ifP7UwlgU=</latexit>
<latexit sha1_base64="Jwp9RsI2AwyXhBufQcOdOJNVjtU=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsxI0S4LbtxZwT6gHUsmk7ahmWRIMkoZ+h9uXCji1n9x59+YaWehrQdCDufcS05OEHOmjet+O4W19Y3NreJ2aWd3b/+gfHjU1jJRhLaI5FJ1A6wpZ4K2DDOcdmNFcRRw2gkm15nfeaRKMynuzTSmfoRHgg0ZwcZKD/1A8lBPI3ult7NBueJW3TnQKvFyUoEczUH5qx9KkkRUGMKx1j3PjY2fYmUY4XRW6ieaxphM8Ij2LBU4otpP56ln6MwqIRpKZY8waK7+3khxpLNodjLCZqyXvUz8z+slZlj3UybixFBBFg8NE46MRFkFKGSKEsOnlmCimM2KyBgrTIwtqmRL8Ja/vEraF1Xvslq7q1Ua9byOIpzAKZyDB1fQgBtoQgsIKHiGV3hznpwX5935WIwWnHznGP7A+fwBFOuS3w==</latexit>
<latexit sha1_base64="Q62ufHVmMXqNvBTNL3V0WPdgFTc=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsxI0S4LblxWsA9ox5LJpG1oJhmSjKUM/Q83LhRx67+482/MtLPQ1gMhh3PuJScniDnTxnW/ncLG5tb2TnG3tLd/cHhUPj5pa5koQltEcqm6AdaUM0FbhhlOu7GiOAo47QST28zvPFGlmRQPZhZTP8IjwYaMYGOlx34geahnkb3S6XxQrrhVdwG0TrycVCBHc1D+6oeSJBEVhnCsdc9zY+OnWBlGOJ2X+ommMSYTPKI9SwWOqPbTReo5urBKiIZS2SMMWqi/N1Ic6SyanYywGetVLxP/83qJGdb9lIk4MVSQ5UPDhCMjUVYBCpmixPCZJZgoZrMiMsYKE2OLKtkSvNUvr5P2VdW7rtbua5VGPa+jCGdwDpfgwQ004A6a0AICCp7hFd6cqfPivDsfy9GCk++cwh84nz9Rs5MH</latexit>
lexicon
To extract FBANK features, the windowed signal is processed by the short-time Fourier
transform to obtain the spectrum. Based on the human perception that larger frequency intervals
are required to produce equal pitch increments at higher frequencies, the Mel-scale is used to
adjust to this phenomenon by warping the normal frequency f to fmel
!
f
fmel = 1127 ln 1 + . (3.4)
700
A series of overlapping triangular band-pass filters that are linearly spaced across the Mel-scale
are applied to the linear power spectrum. The log of the value produced by each filter is
concatenated into a vector, which is referred to as FBANK features.
FBANK features are found to be highly correlated between dimensions, which is not ideal
for Gaussian Mixture Models (GMMs) with diagonal covariance matrices used in GMM-HMM
based models for ASR (see Section 3.2). To decorrelate the coefficients in FBANK features,
the Discrete Cosine Transform (DCT) is applied on these features to obtain MFCCs. The i-th
dimension of a Dc -dimensional MFCC feature oMFCC i is transformed from Df -dimensional
FBANK
FBANK o
s D f !
2 πi 1
X
oMFCC
i = oFBANK
j cos j+ for i = 0, . . . , Dc − 1. (3.5)
Df − 1 j=0 Df 2
order deltas is
ot
′
ot = ∆ot
. (3.7)
2
∆ ot
However, the addition of deltas introduces correlation across dimensions which is inconsistent
with GMMs that assumes that features are element-wise independent. Linear discriminant
analysis (Campbell, 1984; Kumar, 1998) can be used to project into a new space where feature
dimensions are uncorrelated. Before undergoing further processing, the cepstrum is normalised
by subtracting the mean and dividing the standard deviation. The normalisation procedure,
Cepstral Mean and Variance Normalisation (CMVN) (Viikki and Laurila, 1998), can effectively
minimise the channel and noise distortions. Vocal Tract Length Normalisation (VTLN) (Lee
and Rose, 1996) can be further applied to compensate for the differences in vocal tract length
and shape between speakers. Gaussianisation (Liu et al., 2005; Saon et al., 2004) decorrelates
each dimension of the feature vector to reduce the impact on models that have independence
assumptions. Gaussianisation can also be related to CMVN but also normalises higher-order
moments across dimensions.
Unlike GMM-HMM systems where the covariance matrix is assumed to be diagonal, neural
networks do not make assumptions about the correlation between input feature dimensions.
Therefore, FBANK can be directly used as input to neural network models. Meanwhile, methods
have been explored to remove the front-end processing stage and let neural networks learn
implicit filters from raw waveforms and perform recognition within a single model (Sainath
et al., 2015; Tüske et al., 2014; von Platen et al., 2019).
Kanthak and Ney, 2002). However, using graphemic units for acoustic modelling poses a greater
challenge for acoustic models, especially for languages such as English with orthographic
irregularities where many different pronunciations correspond to the same grapheme or a single
pronunciation corresponds to multiple different graphemes. For languages that are not based
on the Latin script, the graphemic units can be constructed using unicode (Gales et al., 2015; Li
et al., 2019a).
Table 3.1 shows several different modelling units for the same English word “hello”. For
phonetic units, two pronunciations are available. Note that for subword units using both past
and future contexts, the number of possible combinations is |V|3 where |V| is the number of
context-independent units. To avoid an explosive increase in the number of classes, decision tree
clustering (Hwang and Huang, 1993; Young et al., 1994) is used to cluster the context-dependent
units into a manageable number of classes.
mono-grapheme h e l l o
tri-grapheme sil-h+e h-e+l e-l+l l-l+o l-o+sil
mono-grapheme with position hI eM lM lM oF
mono-phone hh ax l ow
hh eh l ow
triphone sil-hh+ax hh-ax+l ax-l+ow l-ow+sil
sil-hh+eh hh-eh+l eh-l+ow l-ow+sil
Table 3.1 Different modelling units for the word “hello”. The symbol before ‘-’ is the previous
context and the symbol after ‘+’ is the future context. Superscripts indicate the location of the
modelling unit, where ‘I’, ‘M’ and ‘F’ correspond to the initial, middle and final positions in a
word.
emitted observation vectors. First, it assumes the observation within certain acoustic units
can be segmented into multiple phases and stationary. Second, the current hidden state only
depends on the previous state. Third, the observation vector only depends on the current hidden
state. In other words, the current feature vector is conditionally independent of all the other
surrounding frames given the current hidden state. However, none of the above assumptions is
true for speech signals, but HMMs offer many computationally efficient algorithms for search
because of these assumptions. Many other aspects of acoustic modelling try to overcome these
strict assumptions by including more context information.
To formally introduce HMMs, let O = [o1 , . . . , oT ] be an observation sequence. For the
five-state left-to-right HMM shown in Figure 3.2, states 1 and 5 are non-emitting state whereas
states 2, 3 and 4 are emitting states.
a22
<latexit sha1_base64="aKuYNMKhCBgyNEVi/hSa7gBPVGs=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mioMeiF48V7Ae0oWy2m3btZjfsboQS+h+8eFDEq//Hm//GbZqDtj4YeLw3w8y8MOFMG9f9dkpr6xubW+Xtys7u3v5B9fCorWWqCG0RyaXqhlhTzgRtGWY47SaK4jjktBNObud+54kqzaR4MNOEBjEeCRYxgo2V2niQ+f5sUK25dTcHWiVeQWpQoDmofvWHkqQxFYZwrHXPcxMTZFgZRjidVfqppgkmEzyiPUsFjqkOsvzaGTqzyhBFUtkSBuXq74kMx1pP49B2xtiM9bI3F//zeqmJroOMiSQ1VJDFoijlyEg0fx0NmaLE8KklmChmb0VkjBUmxgZUsSF4yy+vkrZf9y7q/v1lrXFTxFGGEziFc/DgChpwB01oAYFHeIZXeHOk8+K8Ox+L1pJTzBzDHzifPyJDjtY=</latexit>
a33
<latexit sha1_base64="XHlN8dusHIsXlWanYkf4vCF0ZA0=">AAAB7XicbVDLSgNBEOz1GeMr6tHLYBA8hd1E0GPQi8cI5gHJEmYns8mYeSwzs0JY8g9ePCji1f/x5t84SfagiQUNRVU33V1Rwpmxvv/tra1vbG5tF3aKu3v7B4elo+OWUakmtEkUV7oTYUM5k7RpmeW0k2iKRcRpOxrfzvz2E9WGKflgJwkNBR5KFjOCrZNauJ/VatN+qexX/DnQKglyUoYcjX7pqzdQJBVUWsKxMd3AT2yYYW0Z4XRa7KWGJpiM8ZB2HZVYUBNm82un6NwpAxQr7UpaNFd/T2RYGDMRkesU2I7MsjcT//O6qY2vw4zJJLVUksWiOOXIKjR7HQ2YpsTyiSOYaOZuRWSENSbWBVR0IQTLL6+SVrUS1CrV+8ty/SaPowCncAYXEMAV1OEOGtAEAo/wDK/w5invxXv3Phata14+cwJ/4H3+ACVOjtg=</latexit>
a44
<latexit sha1_base64="AnWFXd3IMYLXovBfDyAXHj3gqw8=">AAAB7XicbVBNSwMxEJ3Ur1q/qh69BIvgqezWgh6LXjxWsLXQLiWbZtvYbLIkWaEs/Q9ePCji1f/jzX9j2u5BWx8MPN6bYWZemAhurOd9o8La+sbmVnG7tLO7t39QPjxqG5VqylpUCaU7ITFMcMlallvBOolmJA4FewjHNzP/4Ylpw5W8t5OEBTEZSh5xSqyT2qSf1evTfrniVb058Crxc1KBHM1++as3UDSNmbRUEGO6vpfYICPacirYtNRLDUsIHZMh6zoqScxMkM2vneIzpwxwpLQrafFc/T2RkdiYSRy6zpjYkVn2ZuJ/Xje10VWQcZmklkm6WBSlAluFZ6/jAdeMWjFxhFDN3a2Yjogm1LqASi4Ef/nlVdKuVf2Lau2uXmlc53EU4QRO4Rx8uIQG3EITWkDhEZ7hFd6QQi/oHX0sWgsonzmGP0CfPyhZjto=</latexit>
2 <latexit sha1_base64="fvsquavDkDTJrSj2pCCRIjGtc/I=">AAAB7XicbVDLSgNBEOz1GeMr6tHLYBA8hd1E0GPQi8cI5gHJEmYns8mYeSwzs0JY8g9ePCji1f/x5t84SfagiQUNRVU33V1Rwpmxvv/tra1vbG5tF3aKu3v7B4elo+OWUakmtEkUV7oTYUM5k7RpmeW0k2iKRcRpOxrfzvz2E9WGKflgJwkNBR5KFjOCrZNauJ9Va9N+qexX/DnQKglyUoYcjX7pqzdQJBVUWsKxMd3AT2yYYW0Z4XRa7KWGJpiM8ZB2HZVYUBNm82un6NwpAxQr7UpaNFd/T2RYGDMRkesU2I7MsjcT//O6qY2vw4zJJLVUksWiOOXIKjR7HQ2YpsTyiSOYaOZuRWSENSbWBVR0IQTLL6+SVrUS1CrV+8ty/SaPowCncAYXEMAV1OEOGtAEAo/wDK/w5invxXv3Phata14+cwJ/4H3+ACPIjtc=</latexit>
3 <latexit sha1_base64="AH+X8GiZ3uST1WG0fDqZ+U1842g=">AAAB7XicbVBNSwMxEJ2tX7V+VT16CRbBU9ltC3osevFYwX5Au5Rsmm1js8mSZIWy9D948aCIV/+PN/+N6XYP2vpg4PHeDDPzgpgzbVz32ylsbG5t7xR3S3v7B4dH5eOTjpaJIrRNJJeqF2BNORO0bZjhtBcriqOA024wvV343SeqNJPiwcxi6kd4LFjICDZW6uBhWm/Mh+WKW3UzoHXi5aQCOVrD8tdgJEkSUWEIx1r3PTc2foqVYYTTeWmQaBpjMsVj2rdU4IhqP82unaMLq4xQKJUtYVCm/p5IcaT1LApsZ4TNRK96C/E/r5+Y8NpPmYgTQwVZLgoTjoxEi9fRiClKDJ9Zgoli9lZEJlhhYmxAJRuCt/ryOunUql69WrtvVJo3eRxFOINzuAQPrqAJd9CCNhB4hGd4hTdHOi/Ou/OxbC04+cwp/IHz+QMm047Z</latexit>
4 <latexit sha1_base64="HAl8Hw26ZtdbP+QdNynIUpWMu/c=">AAAB7XicbVBNSwMxEJ2tX7V+VT16CRbBU9mtFT0WvXisYD+gXUo2zbax2WRJskJZ+h+8eFDEq//Hm//GbLsHbX0w8Hhvhpl5QcyZNq777RTW1jc2t4rbpZ3dvf2D8uFRW8tEEdoikkvVDbCmnAnaMsxw2o0VxVHAaSeY3GZ+54kqzaR4MNOY+hEeCRYygo2V2niQ1i9ng3LFrbpzoFXi5aQCOZqD8ld/KEkSUWEIx1r3PDc2foqVYYTTWamfaBpjMsEj2rNU4IhqP51fO0NnVhmiUCpbwqC5+nsixZHW0yiwnRE2Y73sZeJ/Xi8x4bWfMhEnhgqyWBQmHBmJstfRkClKDJ9agoli9lZExlhhYmxAJRuCt/zyKmnXqt5FtXZfrzRu8jiKcAKncA4eXEED7qAJLSDwCM/wCm+OdF6cd+dj0Vpw8plj+APn8wcp3o7b</latexit>
o1
<latexit sha1_base64="5TU2Krumy3WIHwLop2wdZN6FEFI=">AAAB+XicbVDLSgMxFL1TX7W+Rl26CRbBVZkRUZcFNy4r2Ae0w5DJpG1oJhmSTKEM/RM3LhRx65+482/MtLPQ1gMhh3PuJScnSjnTxvO+ncrG5tb2TnW3trd/cHjkHp90tMwUoW0iuVS9CGvKmaBtwwynvVRRnEScdqPJfeF3p1RpJsWTmaU0SPBIsCEj2FgpdN1BJHmsZ4m9cjkP/dCtew1vAbRO/JLUoUQrdL8GsSRZQoUhHGvd973UBDlWhhFO57VBpmmKyQSPaN9SgROqg3yRfI4urBKjoVT2CIMW6u+NHCe6CGcnE2zGetUrxP+8fmaGd0HORJoZKsjyoWHGkZGoqAHFTFFi+MwSTBSzWREZY4WJsWXVbAn+6pfXSeeq4d80rh+v681mWUcVzuAcLsGHW2jCA7SgDQSm8Ayv8Obkzovz7nwsRytOuXMKf+B8/gDwZ5Pc</latexit>
o2
<latexit sha1_base64="hf+ilKe+vsv3kJfbP0C9W7rx/dI=">AAAB+XicbVDLSsNAFL2pr1pfUZdugkVwVZJS1GXBjcsK9gFtCJPJpB06mQkzk0IJ/RM3LhRx65+482+ctFlo64FhDufcy5w5Ycqo0q77bVW2tnd296r7tYPDo+MT+/Ssp0QmMeliwYQchEgRRjnpaqoZGaSSoCRkpB9O7wu/PyNSUcGf9DwlfoLGnMYUI22kwLZHoWCRmifmysUiaAZ23W24SzibxCtJHUp0AvtrFAmcJYRrzJBSQ89NtZ8jqSlmZFEbZYqkCE/RmAwN5Sghys+XyRfOlVEiJxbSHK6dpfp7I0eJKsKZyQTpiVr3CvE/b5jp+M7PKU8zTThePRRnzNHCKWpwIioJ1mxuCMKSmqwOniCJsDZl1UwJ3vqXN0mv2fBuGq3HVr3dLuuowgVcwjV4cAtteIAOdAHDDJ7hFd6s3Hqx3q2P1WjFKnfO4Q+szx/x65Pd</latexit>
...
<latexit sha1_base64="D8NZwhGRc3SadH+lq9NyH2X2S6M=">AAAB7HicbVBNS8NAFHzxs9avqkcvi0XwVJIq6LHoxWMF0xbaUDbbTbt0swm7L0IJ/Q1ePCji1R/kzX/jts1BWwcWhpk37HsTplIYdN1vZ219Y3Nru7RT3t3bPzisHB23TJJpxn2WyER3Qmq4FIr7KFDyTqo5jUPJ2+H4bua3n7g2IlGPOEl5ENOhEpFgFK3k9wYJmn6l6tbcOcgq8QpShQLNfuXL5lgWc4VMUmO6nptikFONgkk+Lfcyw1PKxnTIu5YqGnMT5PNlp+TcKgMSJdo+hWSu/k7kNDZmEod2MqY4MsveTPzP62YY3QS5UGmGXLHFR1EmCSZkdjkZCM0ZyokllGlhdyVsRDVlaPsp2xK85ZNXSate8y5r9YerauO2qKMEp3AGF+DBNTTgHprgAwMBz/AKb45yXpx352MxuuYUmRP4A+fzB/K2jsY=</latexit>
. . . . . . . . . oT
<latexit sha1_base64="D8NZwhGRc3SadH+lq9NyH2X2S6M=">AAAB7HicbVBNS8NAFHzxs9avqkcvi0XwVJIq6LHoxWMF0xbaUDbbTbt0swm7L0IJ/Q1ePCji1R/kzX/jts1BWwcWhpk37HsTplIYdN1vZ219Y3Nru7RT3t3bPzisHB23TJJpxn2WyER3Qmq4FIr7KFDyTqo5jUPJ2+H4bua3n7g2IlGPOEl5ENOhEpFgFK3k9wYJmn6l6tbcOcgq8QpShQLNfuXL5lgWc4VMUmO6nptikFONgkk+Lfcyw1PKxnTIu5YqGnMT5PNlp+TcKgMSJdo+hWSu/k7kNDZmEod2MqY4MsveTPzP62YY3QS5UGmGXLHFR1EmCSZkdjkZCM0ZyokllGlhdyVsRDVlaPsp2xK85ZNXSate8y5r9YerauO2qKMEp3AGF+DBNTTgHprgAwMBz/AKb45yXpx352MxuuYUmRP4A+fzB/K2jsY=</latexit> <latexit sha1_base64="D8NZwhGRc3SadH+lq9NyH2X2S6M=">AAAB7HicbVBNS8NAFHzxs9avqkcvi0XwVJIq6LHoxWMF0xbaUDbbTbt0swm7L0IJ/Q1ePCji1R/kzX/jts1BWwcWhpk37HsTplIYdN1vZ219Y3Nru7RT3t3bPzisHB23TJJpxn2WyER3Qmq4FIr7KFDyTqo5jUPJ2+H4bua3n7g2IlGPOEl5ENOhEpFgFK3k9wYJmn6l6tbcOcgq8QpShQLNfuXL5lgWc4VMUmO6nptikFONgkk+Lfcyw1PKxnTIu5YqGnMT5PNlp+TcKgMSJdo+hWSu/k7kNDZmEod2MqY4MsveTPzP62YY3QS5UGmGXLHFR1EmCSZkdjkZCM0ZyokllGlhdyVsRDVlaPsp2xK85ZNXSate8y5r9YerauO2qKMEp3AGF+DBNTTgHprgAwMBz/AKb45yXpx352MxuuYUmRP4A+fzB/K2jsY=</latexit> <latexit sha1_base64="D8NZwhGRc3SadH+lq9NyH2X2S6M=">AAAB7HicbVBNS8NAFHzxs9avqkcvi0XwVJIq6LHoxWMF0xbaUDbbTbt0swm7L0IJ/Q1ePCji1R/kzX/jts1BWwcWhpk37HsTplIYdN1vZ219Y3Nru7RT3t3bPzisHB23TJJpxn2WyER3Qmq4FIr7KFDyTqo5jUPJ2+H4bua3n7g2IlGPOEl5ENOhEpFgFK3k9wYJmn6l6tbcOcgq8QpShQLNfuXL5lgWc4VMUmO6nptikFONgkk+Lfcyw1PKxnTIu5YqGnMT5PNlp+TcKgMSJdo+hWSu/k7kNDZmEod2MqY4MsveTPzP62YY3QS5UGmGXLHFR1EmCSZkdjkZCM0ZyokllGlhdyVsRDVlaPsp2xK85ZNXSate8y5r9YerauO2qKMEp3AGF+DBNTTgHprgAwMBz/AKb45yXpx352MxuuYUmRP4A+fzB/K2jsY=</latexit>
<latexit sha1_base64="F4KdGFDeKZZjYoI0A5M79Crq4uE=">AAAB/XicbVDNS8MwHE3n15xf9ePmJTgEL45WhnocePE4YZuDrZQ0TbewNClJKsxS/Fe8eFDEq/+HN/8b060H3XwQ8njv9yMvL0gYVdpxvq3Kyura+kZ1s7a1vbO7Z+8f9JRIJSZdLJiQ/QApwignXU01I/1EEhQHjNwHk5vCv38gUlHBO3qaEC9GI04jipE2km8fDQPBQjWNzZWJ3M86527u23Wn4cwAl4lbkjoo0fbtr2EocBoTrjFDSg1cJ9FehqSmmJG8NkwVSRCeoBEZGMpRTJSXzdLn8NQoIYyENIdrOFN/b2QoVkVAMxkjPVaLXiH+5w1SHV17GeVJqgnH84eilEEtYFEFDKkkWLOpIQhLarJCPEYSYW0Kq5kS3MUvL5PeRcO9bDTvmvVWq6yjCo7BCTgDLrgCLXAL2qALMHgEz+AVvFlP1ov1bn3MRytWuXMI/sD6/AHdX5V9</latexit>
1 oT
<latexit sha1_base64="1FE1Jp0ZKbfz0LIJ4OWzr8mxw5Y=">AAAB+3icbVDLSsNAFL2pr1pfsS7dBIvgqiQi6rLgxmWFvqANYTKZtkMnM2FmIpaQX3HjQhG3/og7/8ZJm4W2HhjmcM69zJkTJowq7brfVmVjc2t7p7pb29s/ODyyj+s9JVKJSRcLJuQgRIowyklXU83IIJEExSEj/XB2V/j9RyIVFbyj5wnxYzThdEwx0kYK7PooFCxS89hcmciDrJMHdsNtugs468QrSQNKtAP7axQJnMaEa8yQUkPPTbSfIakpZiSvjVJFEoRnaEKGhnIUE+Vni+y5c26UyBkLaQ7XzkL9vZGhWBXxzGSM9FSteoX4nzdM9fjWzyhPUk04Xj40TpmjhVMU4URUEqzZ3BCEJTVZHTxFEmFt6qqZErzVL6+T3mXTu25ePVw1Wq2yjiqcwhlcgAc30IJ7aEMXMDzBM7zCm5VbL9a79bEcrVjlzgn8gfX5A/WxlQs=</latexit>
Figure 3.2 An illustration of the five-state HMM used for speech recognition. The first and last
HMM states are non-emitting.
S
X
aij = P (st+1 = j|st = i) where aij = 1 ∀i = 1, . . . , S (3.8)
j=1
where S is the total number of states in the HMM. And the emitting probability density for ot is
For the observation sequence of length T , there must be a corresponding state sequence
s = [s1 , . . . , sT ] (excluding non-emitting states). The hidden state sequence can only be
inferred from the output sequence.
42 Automatic Speech Recognition
bj (ot ) = N (ot ; µj , Σj )
( )
1 1 T
=q exp − ot − µj Σj −1 ot − µj , (3.10)
(2π)D |Σj | 2
where D is the feature dimension and (µj , Σj ) are the mean vector and covariance matrix asso-
ciated with the output distribution of the j-th state in an HMM. As mentioned in Section 3.1.1,
the covariance matrix is usually assumed to be diagonal to save both memory and computation
with limited data. Note that the feature vectors need to be treated carefully such that each
dimension should be statistically independent.
In practice, using a single Gaussian distribution to model the output probability function is
a relatively poor approximation. Instead, GMMs can be used (Juang, 1985). Consequently, the
emission probability becomes
Mj
X
bj (ot ) = cjm N (ot ; µjm , Σjm ), (3.11)
m=1
where Mj is the number of Gaussian components in the GMM associated with state j and the
scalar cjm is the weight (or prior) of the Gaussian m in state j. For a valid distribution, the
weights for all Gaussian components must be non-negative and must sum to 1. An HMM-based
ASR system that uses GMMs to model output probability distributions is called a GMM-HMM
system.
In contrast, the output density can also be approximated by DNNs (Bourlard et al., 1994).
Different to the generative GMMs where the p(ot |j) is directly modelled by multiple multi-
variate Gaussian distributions, a neural network is a discriminative model estimating the state
posterior probabilities p(j|ot ) (Bishop, 2006). According to the Bayes’ rule
For each frame, the neural network takes the feature vector as an input and produces an output
posterior probability over all HMM states where the denominator P (j) is the prior probability
of the state j. It is worth noting that the marginal distribution p(ot ) can be safely ignored as it
does not depend on a particular state. The DNN-HMM system is also referred to as the hybrid
approach (Bourlard et al., 1994). This approach can be regarded as discriminative training of a
3.2 Hidden Markov Models for Acoustic Modelling 43
generative model. However, in order to train such neural networks for state classification, the
class label for each frame is often required, i.e. frame-level alignment. The alignment can be
obtained from a pre-trained GMM-HMM system by running the Viterbi algorithm (Viterbi,
1967) with a composite HMM for each utterance in the training data that models the reference
word sequence. The acoustic model can also be trained without frame-level alignment by
using sequence-level optimisation, such as Lattice Free (LF)-Maximum Mutual Information
(MMI) (Povey et al., 2016).
Another type of acoustic modelling approach is called a Tandem system (Grézl et al., 2007;
Hermansky et al., 2000). The neural network is used as a feature extractor. The DNN-based
feature and the acoustic feature obtained from the front-end processing (e.g. MFCC or PLP) are
concatenated, which are subsequently modelled by a GMM-HMM system. Tandem systems
can make use of the adaptation techniques developed for GMM-HMM systems while exploiting
the representation power from the neural networks.
where Sw is a set of all possible state sequences based on the hypothesis w. The model
parameters θ include state transition probabilities and state output distributions. Simply
summing over all possible state sequences is computationally intractable as the number of
possible sequences grows exponentially with the number of time steps. The forward-backward
algorithm uses the conditional independence property of HMMs and exploits the idea of
dynamic programming to allow the computational cost for the likelihood term drops from
O(S T ) to O(ST ) (Huang et al., 2001).
44 Automatic Speech Recognition
As the name suggests, the forward-backward algorithm factorises the computation into
forward and backward probabilities.
X
p(O|w; θ) = p(o1:t |ot+1:T , s)p(ot+1:T , s) (3.14)
s
X
= p(o1:t |s1:t )p(ot+1:T , st+1:T |s1:t )p(s1:t ) (3.15)
s
X
= p(o1:t , s1:t )p(ot+1:T , st+1:T |st ) (3.16)
s
X
= p(o1:t , st = i)p(ot+1:T |st = i) (3.17)
i
X
= αt (i)βt (i), (3.18)
i
where the assumption that future states and future observations are independent from past states
and past observations given the current hidden state is used. The forward probability α and
backward probability β can be computed recursively:
X
αt (i) = p(o1:t , st = i) = αt−1 (j)aji bi (ot ), (3.19)
j
X
βt (i) = p(ot+1:T |st = i) = βt+1 (j)aij bj (ot+1 ). (3.20)
j
More rigorously, for an HMM with non-emitting entry and exit states 1 and S, the forward
probability with initial conditions can be written as
1 i=1 and t = 0,
and 1 < t ≤ T,
0 i=1
αt (i) = 0 1<i≤S and t = 0, (3.21)
S−1
X
αt−1 (j)aji bi (ot ) 1<i<S and 1 ≤ t ≤ T.
j=2
S−1
X
αT (S) = αT (j)ajS , (3.23)
j=2
S−1
X
β0 (1) = β1 (j)a1j bj (o1 ), (3.24)
j=2
and both terminating probabilities are equal to the likelihood p(O|w; θ). After computing the
α’s and β’s in both directions, the posterior probability of being at state i at time t, γt (i) can be
easily computed by
The state transition posterior from state i to j at time t, denoted as χt (i, j), can be expressed as
the final decision rule (e.g. Minimum Bayes Risk (MBR)) (Bishop, 2006). Both of the training
schemes are described in the following sections.
where u is the index of utterances in the training set, which is omitted for brevity below. For
GMM-HMM systems, the Baum-Welch algorithm (Baum and Eagon, 1967), which is a special
case of the Expectation Maximisation (EM) algorithm (Dempster et al., 1977), is often used to
obtain the MLE parameters.
The auxiliary function for this EM algorithm is Q(θ; θk ), which defines the lower bound to
the log-likelihood function. By iteratively maximising the lower bound with respect to θ, a
new set of parameters that guarantees a non-decreasing likelihood will be obtained. To derive
the auxiliary function used for EM, Jensen’s inequality is first used for the log-likelihood
!
X p(O, s|w; θ)
log p(O|w; θ) = log q(s) (3.30)
s q(s)
X X
≥ q(s) log p(O, s|w; θ) − q(s) log q(s), (3.31)
s s
where q(s) could be any valid distribution w.r.t. the state sequence s. Please note that the
equality holds when the distribution is exactly the posterior distribution of state sequences
p(s|O, w; θ). During the k-th EM iteration where the current parameters are θk , the sequence
posterior probability distribution given by the current set of parameters p(s|O, w; θk ) is used
for the arbitrary distribution q(s). Consequently, Equation (3.31) becomes
X
log p(O|w; θ) ≥ p(s|O, w; θk ) log p(O, s|w; θ)
s
X
− p(s|O, w; θk ) log p(s|O, w; θk ), (3.32)
s
3.3 Acoustic Model Training 47
and the equality holds for the likelihood based on the current parameters
X
log p(O|w; θk ) = p(s|O, w; θk ) log p(O, s|w; θk ))
s
X
− p(s|O, w; θk ) log p(s|O, w; θk ). (3.33)
s
where the auxiliary equation Q(θ; θk ) = s p(s|O, w; θk ) log p(O, s|w; θ). Maximising the
P
auxiliary equation Q(θ; θk ) w.r.t. to θ is guaranteed not to decrease the log-likelihood. During
the E-step in the EM algorithm, the posterior probabilities are computed using the current set
of parameters using the forward-backward algorithm. The M-step then maximises the auxiliary
function and updates the parameters.
For HMMs, the auxiliary function can be written as
X X
Q(θ; θk ) = γt (i) log bi (ot ) + χt (i, j) log aij , (3.35)
i,t i,j,t
where γt (i) is the state occupancy posterior probability as in Equation (3.25) and χt (i, j) is
the state pairwise posterior occupancy as in Equation (3.26). The optimal state transition
probability becomes PT
t=1 χt (i, j)
âij = P T . (3.36)
t=1 γt (i)
For GMMs as output distributions, the posterior probability that the observation ot con-
tributed by state j and its component m can be expressed as
The equations above are derived for a single utterance for simplicity. In practice, GMM-HMM
parameters are updated when the statistics have been accumulated over all utterances in the
training set.
If the state posterior distribution for each input frame is assumed to be concentrated solely
around its mode, i.e. the probability mass on one state is one and zero for all other states, then
the CML criterion can be written as frame-level Cross Entropy (CE) minimisation
(u)
P (st |O (u) ; θ)
X Y
LCE = − log (3.42)
u t
(u)
log P (st |O (u) ; θ),
XX
=− (3.43)
u t
where the state-level alignment s = [s1 , . . . , st ] is obtained from an existing model, e.g. running
the Viterbi algorithm with a GMM-HMM system. This alignment is also called a hard alignment
as each frame is assigned to an HMM state with absolute certainty. Therefore, Equation (3.43)
is equivalent to the cross-entropy between frame posterior probability given by the model and
the one-hot frame label. To obtain the frame posterior probabilities, a neural network frame
classifier can be trained by minimising the CE criterion over the entire dataset.
For the DNN-HMM hybrid approach, a DNN trained by gradient-based optimisation
methods performs a state classification task based on the hard alignment. It is worth noting
that unlike a GMM-HMM system where the Baum-Welch algorithm updates both GMMs and
HMMs jointly, the DNN in the hybrid system is normally trained separately by the frame-level
CE criterion since the HMM transition probabilities make negligible difference to the system
performance (Dahl et al., 2011).
3.3 Acoustic Model Training 49
For ASR, one possible objective would be to minimise the conditional entropy of the word
sequence w given the observation sequence O. For a single utterance,
XZ
H(w|O) = − p(w, O) log P (w|O)dO (3.44)
w
X XZ p(w, O)
=− P (w) log P (w) − p(w, O) log dO (3.45)
w w P (w)p(O)
= −H(P (w)) − DKL (p(w, O)||p(O)P (w)), (3.46)
where the second term in Equation (3.46) is the mutual information between p(O) and P (w).
Assuming that the language model is fixed, it is clear that minimising the conditional entropy
is equivalent to maximising the mutual information (MacKay, 2003). As directly computing
p(w, O) is infeasible, assuming a particular pair of w and O is representative can reach the
following simplification
p(w, O)
DKL (p(w, O)||p(O)P (w)) ≈ log (3.47)
P (w)p(O)
p(O|w)
= log P ′ ′
. (3.48)
w′ p(O|w )P (w )
Therefore, with the fixed language model P (w), maximising mutual information is equivalent
to maximising
p(O|w)P (w)
log P ′ ′
. (3.49)
w′ p(O|w )P (w )
By applying Bayes’ rule, Equation (3.49) is equivalent to CML criterion. In other words,
for MMI training (Bahl et al., 1986; Valtchev et al., 1997; Woodland and Povey, 2002), the
posterior probability of the correct hypothesis for a given utterance is maximised:
where H(u) is a set of hypothesised sequences for a given utterance u and κ is the acoustic
scaling factor to match the range of acoustic and language model scores. κ is normally set to the
inverse of the language model scaling factor used for decoding and the denominator is estimated
through a lattice, both of which will be detailed in Section 3.4 later. To estimate model parame-
ters based on the MMI criterion, the extended Baum-Welch algorithm is used (Gopalakrishnan
50 Automatic Speech Recognition
et al., 1989; Normandin, 1991). More recently, Povey et al. (2016) proposed a practical method
called LF-MMI to perform MMI-based sequence training of acoustic models without explicitly
generating lattices. LF-MMI represents both the numerator and denominator lattices in the
form of Weighted Finite State Transducers (WFSTs) and the forward-backward computation is
parallelised on a Graphics Processing Unit (GPU). A 4-gram LM is used for WFST composition
instead of a word LM for efficiency. With reduced frame rate, modified HMM structure, and
various normalisation and regularisation tricks, LF-MMI allows the acoustic model to be trained
faster and often leads to better performance than standard models (Povey et al., 2016).
Although the MMI criterion integrates discriminative ability during training, it does not directly
optimise the final evaluation metric for speech recognition – the Word Error Rate (WER) (see
Section 3.4.4). Unlike MMI, the MBR criterion takes into account the associated risk function
for each hypothesis. In other words, MBR is a class of criteria that considers a general error
metric between the hypothesis and the reference sequence R(h, w):
The design of the risk function is important for MBR training. At the utterance level, if
we define the risk to be 0 for the correct hypothesis and 1 otherwise, MBR falls back to be
equivalent to MMI. The risk function R can also be defined at the word level to reflect WER
directly (Povey and Woodland, 2002). However, due to the same data sparsity issue as selecting
the basic modelling units in Section 3.1.2, phone and HMM state-based loss functions, such
as the Minimum Phone Error (MPE) criterion (Povey, 2003; Povey and Woodland, 2002), are
more widely used. Models trained using the MPE criterion are usually interpolated with the
MLE or MMI criterion to avoid over-fitting (Povey and Woodland, 2002).
argue that with significant data and modelling capability, a single model can learn or extract
speaker-invariant features. However, this is rarely the case for practical deployment of speech
recognition services as unseen classes of speakers always exist, e.g. new accents. One way to ap-
proach speaker variability is to have speaker-dependent models. However, the amount of speech
data from a single speaker is normally very limited. As a viable solution, a speaker-adapted
system (Huang and Lee, 1993) is usually used where a speaker-independent acoustic model is
adapted on a small amount of speaker-specific data, either in a supervised or an unsupervised
fashion. Speaker Adaptive Training (SAT) (Anastasakos et al., 1996) is normally a two-pass
training procedure where the first pass is used to estimate the speaker-related parameters, and
the second-pass uses these parameters to update the speaker-independent model. If correct
supervision transcription is available for some utterances of the unseen speaker, it is called
supervised adaptation. Otherwise, it is unsupervised since only the recognition hypothesis is
available for adaptation.
For GMM-HMM systems, many techniques have been developed for speaker adaptation
including linear transform-based adaptation (Gales, 1998; Gales and Woodland, 1996; Leggetter
and Woodland, 1995), MAP adaptation (Gauvain and Lee, 1994), and speaker cluster-based
adaptation (Gales, 2000). Maximum Likelihood Linear Regression (MLLR) is a speaker
adaptation technique that uses linear transforms to adapt the means (Leggetter and Woodland,
1995) and variances (Gales, 1998; Gales and Woodland, 1996) of GMM components. A
commonly used version of MLLR, called Constrained Maximum Likelihood Linear Regression
(CMLLR), constrains the mean and variance transformation matrices of a Gaussian component
to be identical (Digalakis et al., 1995; Gales, 1998). When only a single transform is used per
speaker, CMLLR is equivalent to feature space linear transformation, so that speaker adaptation
can be applied without changing the model parameters (Gales, 1998). Therefore, this method
is also called Feature-Space Maximum Likelihood Linear Regression (FMLLR). The EM
algorithm can be also used to estimate CMLLR transforms that maximise the likelihood of
generating the speaker-specific data.
In addition to test-time adaptation, speaker adaptation can also be applied during training,
i.e. SAT (Anastasakos et al., 1996; Gales, 1998). The SAT approach first estimates the parameter
of the canonical speaker-independent model. Then the CMLLR transforms can be estimated
for every speaker in the training set. The new speaker-independent model can be trained using
these transformed features. During recognition, the speaker-independent model can produce
initial hypotheses, which are used for alignment and estimation of the CMLLR transforms for
the speakers in the test set. Then using these transforms, a second-pass recognition is performed
to yield the final results. However, if the adaptation data has associated transcriptions, the
52 Automatic Speech Recognition
CMLLR transforms can be directly estimated without using first-pass potentially erroneous
recognition results.
For DNN-HMM systems, one similar alternative to CMLLR is to have a linear input
layer before passing the transformed feature to the speaker-independent network (Li and
Sim, 2010; Seide et al., 2011). During training, it can be set to an identity matrix or can
be trained with the rest of the network. During testing where there are unseen speakers,
the linear layer needs to be trained separately on either recognition results from an existing
system or some speaker enrolment transcription. Instead of having an extra linear layer, the
DNN can have speaker-specific parameterised activation functions (Siniscalchi et al., 2010;
Swietojanski and Renals, 2014; Zhang and Woodland, 2016), e.g. parameterised ReLU (see
Section 2.1.5). The parameters for activation functions are chosen because far fewer speaker-
dependent parameters need to be learned than using model weights to mitigate the data sparsity
issue. This approach is called Learning Hidden Unit Contributions (LHUC) (Swietojanski
and Renals, 2014). Alternatively, applying regularisation when adapting weights of a DNN
acoustic model, e.g. minimising the Kullback–Leibler (KL)-divergence between the speaker-
independent output distribution and the speaker-adapted output distribution, has shown to
be effective (Yu et al., 2013). All the above techniques for DNNs can be used for test time
adaptation. However, ASR systems usually benefit the most if SAT is also used to minimise the
training and testing conditions. Also, some of these methods can be used together to yield the
best performance by exploiting complementarity between different methods, e.g. CMLLR and
LHUC for DNN-HMM systems (Swietojanski and Renals, 2014).
Apart from these two-pass speaker adaption approaches, one-pass methods are also available.
The DNN input features can be augmented with speaker-related information directly during
training. If the extra feature comes from an unsupervised fashion (e.g. i-vector based on
factor analysis on speech features (Dehak et al., 2010)), or from another network or system
(e.g. speaker code (Abdel-Hamid and Jiang, 2013)), then the first pass is no longer necessary.
Adversarial training is also an approach related to speaker adaptation that explicitly tries to
de-correlate the acoustic model from the speaker information (Meng et al., 2018). The single
neural network is trained on both acoustic targets and speaker targets simultaneously, but the
gradient from the speaker branch is reversed to remove the speaker information in the acoustic
model.
3.4 Language Models and Decoding for Source-Channel Models 53
where log p(O|h; θ) is the log-likelihood of the observation O for the hypothesis h, or the
acoustic score given by the AM, and log P (h) is the prior probability of the hypothesis h,
or the language score given by the LM. The recognition procedure exploits the conditional
independence structure of HMM to efficiently search through a large number of possible paths
and find the corresponding word sequence with the highest overall score.
K
Y
P (w) = P (w1 ) P (wk |wk−1 , . . . , w1 ). (3.55)
k=2
A good language model can generally help improve the performance of the ASR system
regardless of the model architecture and the criterion used for the acoustic model training. In
this section, two major types of LM are discussed. The n-gram language model (Jelinek, 1991;
Manning et al., 1999) is purely statistical and non-parametric, whereas the Neural Network
Language Model (NNLM) (Bengio et al., 2003; Mikolov et al., 2010) is a neural network
trained to predict the next word given the history word sequence. The SCM framework allows
the AM and LM to be decoupled. Since training either an n-gram LM or an NNLM is much less
54 Automatic Speech Recognition
computationally expensive than the AM for a certain amount of speech data, and furthermore,
text-only corpora in a similar domain are relatively cheap to collect, the LM can exploit large
amounts of data in addition to the corresponding speech audio transcriptions to alleviate the
data sparsity issue at the word level and avoid bias within the training data that impedes
generalisation. Note that for LM estimation, the start and end of sentences are important, which
are often treated as separate tokens.
As shown in Equation (3.55), the size of the distribution table grows exponentially w.r.t. the
length of the sequence and becomes nearly impossible to have a good estimate even for short
sequences due to the size of the vocabulary for real use cases. With a limited amount of
training data, some degree of conditional independence must be imposed for P (w) to be
computationally feasible. The simplest form of a language model is a uni-gram model, where
the probability of a word in a sequence is independent of its context and is estimated by its
frequency count in the training set. However, the order of words in a sequence becomes
irrelevant due to the strong independence assumption. By incorporating history information of
a word, an n-gram LM could be obtained by considering the conditional distribution of a word
given previous n − 1 words, i.e.
K
Y
P (w) = P (w1 ) P (wk |wk−1 , . . . , wk−n+1 ), (3.57)
k=2
Obtaining a robust and unbiased estimation of the conditional probabilities requires the
training corpus to be large enough to cover all possible n-grams. However, this is nearly
impossible for a large vocabulary system where a large majority of these n-grams are not
present in the training data due to the huge number of n-grams and the data distribution
itself (Chen and Goodman, 1999). Smoothing techniques are required. To avoid the problem
where many unseen n-grams in the training data will have zero probabilities, a certain amount
of overall probability mass controlled by a discounting factor is allocated to these unseen
3.4 Language Models and Decoding for Source-Channel Models 55
cases, e.g. Katz smoothing (Katz, 1987), absolute discounting (Ney et al., 1994), Kneser-
Ney smoothing (Kneser and Ney, 1995). A back-off scheme is normally used for the above
smoothing techniques, which allows the unseen n-grams to have a certain amount of probability
mass according to lower-order n-gram distributions. Another widely adopted method is to
interpolate high-order LMs with lower-order LMs or to interpolate multiple LMs trained on
different sources of data. The weights for linear interpolation can be adjusted based on a
validation set.
Instead of using frequency counts statistics in the text corpus to estimate the LM, the neural
network can be trained for word prediction which can then be used as an LM (Bengio et al.,
2003). The advantage of the NNLM is that it does not give zero probability mass to any
predicted word in the vocabulary due to the Softmax output layer over the vocabulary and the
non-linear functions can extract word representations with semantic and contextual information,
i.e. word embeddings. Similar to standard n-gram LMs, Multilayer Perceptrons (MLPs) or
Convolutional Neural Networks (CNNs) can be used as LMs with n − 1 history words. A more
powerful and widely used class of NNLMs is Recurrent Neural Network (RNN)-based (Mikolov
et al., 2010). A recurrent model such as the Long Short-Term Memory (LSTM) can model long
dependencies in Equation (3.55) by representing word histories in hidden states with gating
functions. Self-attention based models such as Transformer LMs have become a major research
direction that to model the relationship between words across longer ranges due to the lack
of recurrence (Dai et al., 2019; Irie et al., 2019b), and large-scale self-attention based models
pre-trained on a very large amount of data have reached state-of-the-art performance on various
language-related tasks (Devlin et al., 2018; Radford and Narasimhan, 2018; Radford et al.,
2019). Other variants of NNLMs that exploit future information have also been explored (Chen
et al., 2017). As the normalisation term in the Softmax function over a very large vocabulary
size can be computationally problematic for NNLMs, noise contrastive estimation can be used
for training and inference of an NNLM without needing the Softmax normalisation term (Chen
et al., 2015a; Gutmann and Hyvärinen, 2010). Specifically for ASR, the information from
previous utterances can be utilised during LM training and decoding (Irie et al., 2019c; Sun
et al., 2021b). Similar to AM adaptation mentioned in Section 3.3.3, one advantage of NNLMs
is that they can adapt to input data in a flexible manner, including augmenting the input with a
topic vector (Chen et al., 2015b), adapting LM parameters (Gangireddy et al., 2016; Park et al.,
2010) and learning to bias towards recent histories (Li et al., 2018).
56 Automatic Speech Recognition
At the end of the utterance, the best path can be traced backwards by the recursion,
One practical issue for decoding an SCM-based system is the difference in dynamic range
between the acoustic scores and language scores. The acoustic scores have a much wider
range due to underestimation of likelihood originating from HMM assumptions (Woodland and
Povey, 2002). To offset this discrepancy, another decoding parameter ψ named the grammar
scaling factor is often used to scale up the language scores. Another practice is to set a word
insertion penalty ω (usually negative) to limit the degree of inserting words. Putting them
together, the search criterion used in practice is
t=0.68
W=HI
W=HOW
p=-4.51853
W=<s> W=</s>
p=-0.00000 W=<hes>
p=-1.75130 p=0.00000
t=0.70 t=0.68 t=0.62 t=0.00
W=!NULL
p=-18.69590 W=HI
p=-0.30052
W=!NULL
p=-5.68403
pruned using a fixed beam width to discard arcs whose likelihoods fall below a certain threshold
relative to the best path, or restrict the number of unique arcs within a time interval as many arcs
represent the same word of similar likelihoods but with slightly different timestamps (Woodland
et al., 1995). WFSTs can also be used equivalently to represent lattices (Povey et al., 2012).
Confusion networks are another dense representation for the most likely hypotheses in
lattices (Mangu et al., 2000). They are formed by grouping lattices arcs. The confusion
network is also known as a “sausage” because of its constraint that all paths in a confusion
network pass through all nodes, as shown in Figure 3.3b. The arcs in a confusion network
correspond to words, some of which are null which represent skip connections to comply with
the special constraint. The score associated with each arc is the log posterior probability of the
corresponding word. In terms of the information stored in confusion networks, they discard a
large number of arcs with low likelihoods in lattices. However, confusion networks also create
some new hypothesis sequences which are not present in lattices as a result of imposing the
constraint.
3.5 Other Frame-Synchronous Systems 59
3.4.4 Evaluation
The most common evaluation metric for ASR is WER, which measures the percentage of errors
in the hypotheses when compared to the reference transcriptions. There are three types of
errors: a substitution error (S) where a word in the reference is misrecognised as another word;
an insertion error (I) where a word is inserted; and a deletion error (D) where a word in the
reference is missing in the hypothesis. To define these errors unambiguously for two sequences,
the alignment is the one that minimises the WER defined as
S+D+I
WER = × 100%, (3.65)
N
where N is the total number of words in the reference. For languages that have no word
boundaries in their written form, e.g. Chinese and Japanese, a similar metric evaluated at the
character level called the Character Error Rate (CER) is often used. Note that both WER
and CER can be above 100% because of unbounded insertion errors. At a coarser level, the
Sentence Error Rate (SentER) which is the percentage of utterances that contain at least one
error, is sometimes used.
P (s′t |O (u) ),
X X Y
=− log (3.67)
u s′ ∈B−1 (w(u) ) t
60 Automatic Speech Recognition
e1 <latexit sha1_base64="aidKR2s/tZ6GrEOJPG/rk5TQExo=">AAAB+XicbVDLSsNAFL2pr1pfUZduBovgqiRS1GXRjcsK9gFtCJPJpB06mYSZSaGE/okbF4q49U/c+TdO2iy09cAwh3PuZc6cIOVMacf5tiobm1vbO9Xd2t7+weGRfXzSVUkmCe2QhCeyH2BFORO0o5nmtJ9KiuOA014wuS/83pRKxRLxpGcp9WI8EixiBGsj+bY9DBIeqllsrpzOfde3607DWQCtE7ckdSjR9u2vYZiQLKZCE46VGrhOqr0cS80Ip/PaMFM0xWSCR3RgqMAxVV6+SD5HF0YJUZRIc4RGC/X3Ro5jVYQzkzHWY7XqFeJ/3iDT0a2XM5FmmgqyfCjKONIJKmpAIZOUaD4zBBPJTFZExlhiok1ZNVOCu/rlddK9arjXjeZjs966K+uowhmcwyW4cAMteIA2dIDAFJ7hFd6s3Hqx3q2P5WjFKndO4Q+szx/hu5PU</latexit>
e2 <latexit sha1_base64="ARZTR0BQ8i3cbeB9FFVuCYsatvo=">AAAB+XicbVBPS8MwHP11/pvzX9Wjl+AQPI12DPU49OJxgpuDrZQ0TbewNC1JOhhl38SLB0W8+k28+W1Mtx5080HI473fj7y8IOVMacf5tiobm1vbO9Xd2t7+weGRfXzSU0kmCe2ShCeyH2BFORO0q5nmtJ9KiuOA06dgclf4T1MqFUvEo56l1IvxSLCIEayN5Nv2MEh4qGaxuXI695u+XXcazgJonbglqUOJjm9/DcOEZDEVmnCs1MB1Uu3lWGpGOJ3XhpmiKSYTPKIDQwWOqfLyRfI5ujBKiKJEmiM0Wqi/N3IcqyKcmYyxHqtVrxD/8waZjm68nIk001SQ5UNRxpFOUFEDCpmkRPOZIZhIZrIiMsYSE23KqpkS3NUvr5Nes+FeNVoPrXr7tqyjCmdwDpfgwjW04R460AUCU3iGV3izcuvFerc+lqMVq9w5hT+wPn8A4z+T1Q==</latexit>
et <latexit sha1_base64="uEtA+/osy8xuXNwEWTjuZK926kM=">AAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVRIp6rLoxmUF+4A2hMlk0g6dTMLMTaGE/okbF4q49U/c+TdO2iy09cAwh3PuZc6cIBVcg+N8W5WNza3tnepubW//4PDIPj7p6iRTlHVoIhLVD4hmgkvWAQ6C9VPFSBwI1gsm94XfmzKleSKfYJYyLyYjySNOCRjJt+1hkIhQz2Jz5Wzug2/XnYazAF4nbknqqETbt7+GYUKzmEmggmg9cJ0UvJwo4FSweW2YaZYSOiEjNjBUkphpL18kn+MLo4Q4SpQ5EvBC/b2Rk1gX4cxkTGCsV71C/M8bZBDdejmXaQZM0uVDUSYwJLioAYdcMQpiZgihipusmI6JIhRMWTVTgrv65XXSvWq4143mY7PeuivrqKIzdI4ukYtuUAs9oDbqIIqm6Bm9ojcrt16sd+tjOVqxyp1T9AfW5w9HVpQX</latexit>
eT <latexit sha1_base64="qPYN9LFAo50olxF251B8TtMcvcU=">AAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVRIRdVl047JCX9CGMJnctEMnkzAzKZTQP3HjQhG3/ok7/8ZJm4W2HhjmcM69zJkTpJwp7TjfVmVjc2t7p7pb29s/ODyyj0+6KskkhQ5NeCL7AVHAmYCOZppDP5VA4oBDL5g8FH5vClKxRLT1LAUvJiPBIkaJNpJv28Mg4aGaxebKYe63fbvuNJwF8DpxS1JHJVq+/TUME5rFIDTlRKmB66Tay4nUjHKY14aZgpTQCRnBwFBBYlBevkg+xxdGCXGUSHOExgv190ZOYlWEM5Mx0WO16hXif94g09GdlzORZhoEXT4UZRzrBBc14JBJoJrPDCFUMpMV0zGRhGpTVs2U4K5+eZ10rxruTeP66brevC/rqKIzdI4ukYtuURM9ohbqIIqm6Bm9ojcrt16sd+tjOVqxyp1T9AfW5w8W1pP3</latexit>
o1
<latexit sha1_base64="yV6c0E5USxDvfBDK6npQChwWZAw=">AAAB+XicbVDLSsNAFL2pr1pfUZduBovgqiRS1GXRjcsK9gFtCJPJpB06yYSZSaGE/okbF4q49U/c+TdO2iy09cAwh3PuZc6cIOVMacf5tiobm1vbO9Xd2t7+weGRfXzSVSKThHaI4EL2A6woZwntaKY57aeS4jjgtBdM7gu/N6VSMZE86VlKvRiPEhYxgrWRfNseBoKHahabKxdz3/XtutNwFkDrxC1JHUq0fftrGAqSxTTRhGOlBq6Tai/HUjPC6bw2zBRNMZngER0YmuCYKi9fJJ+jC6OEKBLSnESjhfp7I8exKsKZyRjrsVr1CvE/b5Dp6NbLWZJmmiZk+VCUcaQFKmpAIZOUaD4zBBPJTFZExlhiok1ZNVOCu/rlddK9arjXjeZjs966K+uowhmcwyW4cAMteIA2dIDAFJ7hFd6s3Hqx3q2P5WjFKndO4Q+szx/xAZPe</latexit>
o2
<latexit sha1_base64="Jq/vjcu6ACbzVeXUtfgNT44zqrk=">AAAB+XicbVBPS8MwHP11/pvzX9Wjl+IQPI12DPU49OJxgpuDrZQ0TbewNClJOhhl38SLB0W8+k28+W1Mtx5080HI473fj7y8MGVUadf9tiobm1vbO9Xd2t7+weGRfXzSUyKTmHSxYEL2Q6QIo5x0NdWM9FNJUBIy8hRO7gr/aUqkooI/6llK/ASNOI0pRtpIgW0PQ8EiNUvMlYt50AzsuttwF3DWiVeSOpToBPbXMBI4SwjXmCGlBp6baj9HUlPMyLw2zBRJEZ6gERkYylFClJ8vks+dC6NETiykOVw7C/X3Ro4SVYQzkwnSY7XqFeJ/3iDT8Y2fU55mmnC8fCjOmKOFU9TgRFQSrNnMEIQlNVkdPEYSYW3KqpkSvNUvr5Nes+FdNVoPrXr7tqyjCmdwDpfgwTW04R460AUMU3iGV3izcuvFerc+lqMVq9w5hT+wPn8A8oWT3w==</latexit>
ot
<latexit sha1_base64="hQJqAx5F76JQVSy5M6P4yFUv344=">AAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVRIp6rLoxmUF+4A2hMlk0g6dZMLMTaGE/okbF4q49U/c+TdO2iy09cAwh3PuZc6cIBVcg+N8W5WNza3tnepubW//4PDIPj7papkpyjpUCqn6AdFM8IR1gINg/VQxEgeC9YLJfeH3pkxpLpMnmKXMi8ko4RGnBIzk2/YwkCLUs9hcuZz74Nt1p+EsgNeJW5I6KtH27a9hKGkWswSoIFoPXCcFLycKOBVsXhtmmqWETsiIDQxNSMy0ly+Sz/GFUUIcSWVOAnih/t7ISayLcGYyJjDWq14h/ucNMohuvZwnaQYsocuHokxgkLioAYdcMQpiZgihipusmI6JIhRMWTVTgrv65XXSvWq4143mY7PeuivrqKIzdI4ukYtuUAs9oDbqIIqm6Bm9ojcrt16sd+tjOVqxyp1T9AfW5w9WnJQh</latexit>
oT
<latexit sha1_base64="N+fIw8Yph8Kawce/rTR5VJSrHSI=">AAAB+XicbVDLSsNAFL2pr1pfUZduBovgqiQi6rLoxmWFvqANYTKZtEMnmTAzKZTQP3HjQhG3/ok7/8ZJm4W2HhjmcM69zJkTpJwp7TjfVmVjc2t7p7pb29s/ODyyj0+6SmSS0A4RXMh+gBXlLKEdzTSn/VRSHAec9oLJQ+H3plQqJpK2nqXUi/EoYREjWBvJt+1hIHioZrG5cjH3275ddxrOAmiduCWpQ4mWb38NQ0GymCaacKzUwHVS7eVYakY4ndeGmaIpJhM8ogNDExxT5eWL5HN0YZQQRUKak2i0UH9v5DhWRTgzGWM9VqteIf7nDTId3Xk5S9JM04QsH4oyjrRARQ0oZJISzWeGYCKZyYrIGEtMtCmrZkpwV7+8TrpXDfemcf10XW/el3VU4QzO4RJcuIUmPEILOkBgCs/wCm9Wbr1Y79bHcrRilTun8AfW5w8mHJQB</latexit>
Figure 3.4 A CTC model. The model can be either uni-directional or bi-directional. For each
input frame, the model predicts a probability distribution over all symbols, including ∅. et is
the hidden representation corresponding to the input at time t.
where s′ is a sequence of symbols chosen from all possible states (normally the subword
vocabulary) plus the blank symbol ∅, and B is a many-to-one mapping from the symbol
sequence to the word sequence. The mapping removes repeated symbols and all blanks, e.g.
From Equation (3.67), the criterion maximises the conditional likelihood of all possible
paths/alignments for the ground truth word sequence w. The introduction of the blank symbol
∅ is critical for the operation of CTC as it allows flexible alignment. Equation (3.67) also
implicitly assumes the output at different time steps are conditionally independent given the
internal state of the network. Under such an assumption, all possible alignments can be
efficiently computed by the forward-backward algorithm. Minimising Equation (3.67) over the
training set is equivalent to maximising the conditional likelihood of the target labelling given
observation.
By comparing MLE of HMMs from Equation (3.13) and CML training of CTC from
Equation (3.67) in terms of state sequences,
T
XY
HMM P (st+1 |st )p(ot |st ), (3.68)
s t=1
T
P (s′t |ot ),
XY
CTC (3.69)
s′ t=1
where s is a state sequence for HMMs and s′ is a possible symbol sequence for CTC. It
can be shown that CTC is equivalent to a special instantiation of the 2-state HMM structure
when the state prior P (st ) and the state transition probabilities P (st+1 |st ) are ignored or
assumed equal (Hadian et al., 2018; Zeyer et al., 2017). As illustrated in Figure 3.5, the first
3.5 Other Frame-Synchronous Systems 61
emission state of the HMM is the skippable blank state (∅) with a self-loop and the second one
corresponds to the subword unit. The blank state is shared across all HMMs. One caveat is
that if two consecutive symbols are the same, the blank state in the later HMM is not skippable
because two repetitive symbols must be separated by at least one blank symbol to prevent the
mapping function B from collapsing them into a single symbol.
Figure 3.5 A relationship between HMM and CTC. The white and grey cycles are emission
and non-emission HMM states, and ∅ stands for the CTC blank symbol.
In practice, the transition probabilities P (st+1 |st ) makes little difference when being forced
to be set to 1.0. At test time, the posterior probability of the CTC blank symbol can be penalised
by an extra empirical value (Sak et al., 2015a,b), which can be seen as a rough approximation
of the prior P (st ). As a result, it is reasonable to view CTC as an acoustic modelling method
in the SCM framework, which combines the HMM topology in Figure 3.5, the CML training
criterion and the forward-backward procedure.
To train the model with the CTC loss function, the probabilities of all possible alignments
for the reference sequence have to be added together. As illustrated in Figure 3.6, there can be
an exponential number of possible alignments. For efficient computation, the forward-backward
algorithm is used analogously to Section 3.2 to derive the gradient w.r.t. the model output
distribution at each time step. Note that only the forward or the backward pass is sufficient if
only the total loss is needed.
There are two general decoding strategies for CTC models (Graves et al., 2006). Best
path decoding is the simpler one where the most probable path is considered to yield the most
likely transcription. After removing repeated symbols and blanks, the transcription is obtained.
However, because multiple symbol sequences can correspond to the same transcription, best
path decoding can be sub-optimal. A better yet more complex strategy, called prefix search
decoding, efficiently computes the probability of each partial hypothesis during beam search
by summing over all possible alignments of the prefix using a modified forward-backward
algorithm. In practice, prefix search decoding yields marginally better results as the output
distributions from CTC models are generally very peaky, i.e. the posterior probability of a
single alignment dominates (Graves et al., 2006). However, prefix search is more useful when
decoding with a separate LM (Maas et al., 2014). CTC has been shown to perform well on ASR
tasks when a large amount of training data is available (Amodei et al., 2016; Miao et al., 2016a).
62 Automatic Speech Recognition
?
<latexit sha1_base64="INkgBd/gRKLqmJg73ybelTFIzRE=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69BIvgqSRS1GPRi8cK9gPaUDbbTbt0sxt2J4US+jO8eFDEq7/Gm//GbZuDtj4YeLw3w8y8MBHcoOd9O4WNza3tneJuaW//4PCofHzSMirVlDWpEkp3QmKY4JI1kaNgnUQzEoeCtcPx/dxvT5g2XMknnCYsiMlQ8ohTglbq9iZES4UjLof9csWregu468TPSQVyNPrlr95A0TRmEqkgxnR9L8EgIxo5FWxW6qWGJYSOyZB1LZUkZibIFifP3AurDNxIaVsS3YX6eyIjsTHTOLSdMcGRWfXm4n9eN8XoNsi4TFJkki4XRalwUbnz/90B14yimFpCqOb2VpeOiCYUbUolG4K/+vI6aV1V/etq7bFWqd/lcRThDM7hEny4gTo8QAOaQEHBM7zCm4POi/PufCxbC04+cwp/4Hz+ALhckY0=</latexit>
T
<latexit sha1_base64="Re3q3rDTo8m9ZlVqoUElcLXfzg4=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKewGUY9BLx4TyAuSJcxOepMxs7PLzKwQQr7AiwdFvPpJ3vwbJ8keNLGgoajqprsrSATXxnW/ndzG5tb2Tn63sLd/cHhUPD5p6ThVDJssFrHqBFSj4BKbhhuBnUQhjQKB7WB8P/fbT6g0j2XDTBL0IzqUPOSMGivVG/1iyS27C5B14mWkBBlq/eJXbxCzNEJpmKBadz03Mf6UKsOZwFmhl2pMKBvTIXYtlTRC7U8Xh87IhVUGJIyVLWnIQv09MaWR1pMosJ0RNSO96s3F/7xuasJbf8plkhqUbLkoTAUxMZl/TQZcITNiYgllittbCRtRRZmx2RRsCN7qy+ukVSl71+WreqVUvcviyMMZnMMleHADVXiAGjSBAcIzvMKb8+i8OO/Ox7I152Qzp/AHzucPsfeM3w==</latexit>
?
<latexit sha1_base64="INkgBd/gRKLqmJg73ybelTFIzRE=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69BIvgqSRS1GPRi8cK9gPaUDbbTbt0sxt2J4US+jO8eFDEq7/Gm//GbZuDtj4YeLw3w8y8MBHcoOd9O4WNza3tneJuaW//4PCofHzSMirVlDWpEkp3QmKY4JI1kaNgnUQzEoeCtcPx/dxvT5g2XMknnCYsiMlQ8ohTglbq9iZES4UjLof9csWregu468TPSQVyNPrlr95A0TRmEqkgxnR9L8EgIxo5FWxW6qWGJYSOyZB1LZUkZibIFifP3AurDNxIaVsS3YX6eyIjsTHTOLSdMcGRWfXm4n9eN8XoNsi4TFJkki4XRalwUbnz/90B14yimFpCqOb2VpeOiCYUbUolG4K/+vI6aV1V/etq7bFWqd/lcRThDM7hEny4gTo8QAOaQEHBM7zCm4POi/PufCxbC04+cwp/4Hz+ALhckY0=</latexit>
s0 A
<latexit sha1_base64="GT591uhpNZAGudBVzbiBa5ZI2CE=">AAAB6XicbVBNS8NAEJ34WetX1aOXxSJ6KokU9Vj04rGK/YA2lM120i7dbMLuRiih/8CLB0W8+o+8+W/ctjlo64OBx3szzMwLEsG1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6nfqtJ1Sax/LRjBP0IzqQPOSMGis96LNeqexW3BnIMvFyUoYc9V7pq9uPWRqhNExQrTuemxg/o8pwJnBS7KYaE8pGdIAdSyWNUPvZ7NIJObVKn4SxsiUNmam/JzIaaT2OAtsZUTPUi95U/M/rpCa89jMuk9SgZPNFYSqIicn0bdLnCpkRY0soU9zeStiQKsqMDadoQ/AWX14mzYuKd1mp3lfLtZs8jgIcwwmcgwdXUIM7qEMDGITwDK/w5oycF+fd+Zi3rjj5zBH8gfP5A0IVjTE=</latexit>
<latexit sha1_base64="eINJPYvlNQfnv4FehLe4hTBMdFs=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHaJUY+oF4+QyCOBDZkdemFkdnYzM2tCCF/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR3cxvPaHSPJYPZpygH9GB5CFn1FipftMrltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1m+qFdK1dssjjycwCmcgwdXUIV7qEEDGCA8wyu8OY/Oi/PufCxac042cwx/4Hz+AJUrjMw=</latexit>
?
<latexit sha1_base64="INkgBd/gRKLqmJg73ybelTFIzRE=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69BIvgqSRS1GPRi8cK9gPaUDbbTbt0sxt2J4US+jO8eFDEq7/Gm//GbZuDtj4YeLw3w8y8MBHcoOd9O4WNza3tneJuaW//4PCofHzSMirVlDWpEkp3QmKY4JI1kaNgnUQzEoeCtcPx/dxvT5g2XMknnCYsiMlQ8ohTglbq9iZES4UjLof9csWregu468TPSQVyNPrlr95A0TRmEqkgxnR9L8EgIxo5FWxW6qWGJYSOyZB1LZUkZibIFifP3AurDNxIaVsS3YX6eyIjsTHTOLSdMcGRWfXm4n9eN8XoNsi4TFJkki4XRalwUbnz/90B14yimFpCqOb2VpeOiCYUbUolG4K/+vI6aV1V/etq7bFWqd/lcRThDM7hEny4gTo8QAOaQEHBM7zCm4POi/PufCxbC04+cwp/4Hz+ALhckY0=</latexit>
C
<latexit sha1_base64="p+0nAFeCXge/HI4RNHKcRIEQu5Q=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHaJUY9ELh4hkUcCGzI79MLI7OxmZtaEEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eQ2Nre2d/K7hb39g8Oj4vFJS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfj2txvP6HSPJYPZpKgH9Gh5CFn1FipUesXS27ZXYCsEy8jJchQ7xe/eoOYpRFKwwTVuuu5ifGnVBnOBM4KvVRjQtmYDrFrqaQRan+6OHRGLqwyIGGsbElDFurviSmNtJ5Ege2MqBnpVW8u/ud1UxPe+lMuk9SgZMtFYSqIicn8azLgCpkRE0soU9zeStiIKsqMzaZgQ/BWX14nrUrZuy5fNSql6l0WRx7O4BwuwYMbqMI91KEJDBCe4RXenEfnxXl3PpatOSebOYU/cD5/AJgzjM4=</latexit>
?
<latexit sha1_base64="INkgBd/gRKLqmJg73ybelTFIzRE=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69BIvgqSRS1GPRi8cK9gPaUDbbTbt0sxt2J4US+jO8eFDEq7/Gm//GbZuDtj4YeLw3w8y8MBHcoOd9O4WNza3tneJuaW//4PCofHzSMirVlDWpEkp3QmKY4JI1kaNgnUQzEoeCtcPx/dxvT5g2XMknnCYsiMlQ8ohTglbq9iZES4UjLof9csWregu468TPSQVyNPrlr95A0TRmEqkgxnR9L8EgIxo5FWxW6qWGJYSOyZB1LZUkZibIFifP3AurDNxIaVsS3YX6eyIjsTHTOLSdMcGRWfXm4n9eN8XoNsi4TFJkki4XRalwUbnz/90B14yimFpCqOb2VpeOiCYUbUolG4K/+vI6aV1V/etq7bFWqd/lcRThDM7hEny4gTo8QAOaQEHBM7zCm4POi/PufCxbC04+cwp/4Hz+ALhckY0=</latexit>
1 2 3 4 5 6
<latexit sha1_base64="DSs/otLGM3tGcbw7Y+b2owP1uy4=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOlptcvV9yqOwdZJV5OKpCj0S9/9QYxSyOUhgmqdddzE+NnVBnOBE5LvVRjQtmYDrFrqaQRaj+bHzolZ1YZkDBWtqQhc/X3REYjrSdRYDsjakZ62ZuJ/3nd1ITXfsZlkhqUbLEoTAUxMZl9TQZcITNiYgllittbCRtRRZmx2ZRsCN7yy6ukfVH1Lqu1Zq1Sv8njKMIJnMI5eHAFdbiDBrSAAcIzvMKb8+i8OO/Ox6K14OQzx/AHzucPfYuMvg==</latexit> <latexit sha1_base64="YMsxORc1qROyZt/Sshecb25Lu40=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHYJUY9ELx4hkUcCGzI79MLI7OxmZtaEEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eQ2Nre2d/K7hb39g8Oj4vFJS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfju7nffkKleSwfzCRBP6JDyUPOqLFSo9IvltyyuwBZJ15GSpCh3i9+9QYxSyOUhgmqdddzE+NPqTKcCZwVeqnGhLIxHWLXUkkj1P50ceiMXFhlQMJY2ZKGLNTfE1MaaT2JAtsZUTPSq95c/M/rpia88adcJqlByZaLwlQQE5P512TAFTIjJpZQpri9lbARVZQZm03BhuCtvrxOWpWyd1WuNqql2m0WRx7O4BwuwYNrqME91KEJDBCe4RXenEfnxXl3PpatOSebOYU/cD5/AH8PjL8=</latexit> <latexit sha1_base64="PvkqMPf0KQ2ZiRgX4QPbidqMLN8=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHaVqEeiF4+QyCOBDZkdemFkdnYzM2tCCF/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR3cxvPaHSPJYPZpygH9GB5CFn1FipftkrltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa88SdcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmhdl76pcqVdK1dssjjycwCmcgwfXUIV7qEEDGCA8wyu8OY/Oi/PufCxac042cwx/4Hz+AICTjMA=</latexit> <latexit sha1_base64="y+WDoqttszdexELZg6EJPDYoWM8=">AAAB5HicbVBNS8NAEJ3Urxq/qlcvi0XwVBIp6rHoxWMF+wFtKJvtpF272YTdjVBCf4EXD4pXf5M3/43bNgdtfTDweG+GmXlhKrg2nvftlDY2t7Z3yrvu3v7B4VHFPW7rJFMMWywRieqGVKPgEluGG4HdVCGNQ4GdcHI39zvPqDRP5KOZphjEdCR5xBk1VnqoDypVr+YtQNaJX5AqFGgOKl/9YcKyGKVhgmrd873UBDlVhjOBM7efaUwpm9AR9iyVNEYd5ItDZ+TcKkMSJcqWNGSh/p7Iaaz1NA5tZ0zNWK96c/E/r5eZ6CbIuUwzg5ItF0WZICYh86/JkCtkRkwtoUxxeythY6ooMzYb14bgr768TtqXNf+qVq82boswynAKZ3ABPlxDA+6hCS1ggPACb/DuPDmvzseyseQUEyfwB87nDxedi5c=</latexit> <latexit sha1_base64="pukKC4/a/4Vumdv9C1rDgxpFPro=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHYNPo5ELx4hkUcCGzI79MLI7OxmZtaEEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eTW1jc2t/LbhZ3dvf2D4uFRU8epYthgsYhVO6AaBZfYMNwIbCcKaRQIbAWju5nfekKleSwfzDhBP6IDyUPOqLFS/bJXLLlldw6ySryMlCBDrVf86vZjlkYoDRNU647nJsafUGU4EzgtdFONCWUjOsCOpZJGqP3J/NApObNKn4SxsiUNmau/JyY00nocBbYzomaol72Z+J/XSU1440+4TFKDki0WhakgJiazr0mfK2RGjC2hTHF7K2FDqigzNpuCDcFbfnmVNC/K3lW5Uq+UqrdZHHk4gVM4Bw+uoQr3UIMGMEB4hld4cx6dF+fd+Vi05pxs5hj+wPn8AYObjMI=</latexit> <latexit sha1_base64="Zdb79L+zknig4R9axIVCooBH2pQ=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHYNQY9ELx4hkUcCGzI79MLI7OxmZtaEEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eQ2Nre2d/K7hb39g8Oj4vFJS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfju7nffkKleSwfzCRBP6JDyUPOqLFSo9ovltyyuwBZJ15GSpCh3i9+9QYxSyOUhgmqdddzE+NPqTKcCZwVeqnGhLIxHWLXUkkj1P50ceiMXFhlQMJY2ZKGLNTfE1MaaT2JAtsZUTPSq95c/M/rpia88adcJqlByZaLwlQQE5P512TAFTIjJpZQpri9lbARVZQZm03BhuCtvrxOWldlr1quNCql2m0WRx7O4BwuwYNrqME91KEJDBCe4RXenEfnxXl3PpatOSebOYU/cD5/AIUfjMM=</latexit>
t
<latexit sha1_base64="4mSRiAOC1HPbUsbyd7QN48TyFAA=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48t2FpoQ9lsN+3azSbsToQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IJHCoOt+O4W19Y3NreJ2aWd3b/+gfHjUNnGqGW+xWMa6E1DDpVC8hQIl7ySa0yiQ/CEY3878hyeujYjVPU4S7kd0qEQoGEUrNbFfrrhVdw6ySrycVCBHo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKupYpG3PjZ/NApObPKgISxtqWQzNXfExmNjJlEge2MKI7MsjcT//O6KYbXfiZUkiJXbLEoTCXBmMy+JgOhOUM5sYQyLeythI2opgxtNiUbgrf88ippX1S9y2qtWavUb/I4inACp3AOHlxBHe6gAS1gwOEZXuHNeXRenHfnY9FacPKZY/gD5/MH4xeNAQ==</latexit>
Figure 3.6 An example of a CTC alignment lattice. The acoustic sequence has 6 input frames
and the symbol sequence is graphemes of the word “CAT”. A darker path refers to an alignment
with higher probability. Nodes with two concentric circles means the start and end nodes.
Some variants of CTC have been proposed recently to improve its performance (Higuchi et al.,
2020; Lee and Watanabe, 2021).
s0t,k
<latexit sha1_base64="sey/O5Ma+odb8h7Z+GUSdPwjPiQ=">AAAB73icbVDLSgNBEOyNrxhfUY9eBoPoQcKuBPUY9OIxgnlAsoTZySQZMju7zvQKYclPePGgiFd/x5t/4yTZgyYWNBRV3XR3BbEUBl3328mtrK6tb+Q3C1vbO7t7xf2DhokSzXidRTLSrYAaLoXidRQoeSvWnIaB5M1gdDv1m09cGxGpBxzH3A/pQIm+YBSt1DLdFM9Hk9NuseSW3RnIMvEyUoIMtW7xq9OLWBJyhUxSY9qeG6OfUo2CST4pdBLDY8pGdMDblioacuOns3sn5MQqPdKPtC2FZKb+nkhpaMw4DGxnSHFoFr2p+J/XTrB/7adCxQlyxeaL+okkGJHp86QnNGcox5ZQpoW9lbAh1ZShjahgQ/AWX14mjYuyd1mu3FdK1ZssjjwcwTGcgQdXUIU7qEEdGEh4hld4cx6dF+fd+Zi35pxs5hD+wPn8AcdVj88=</latexit>
Joint Network
<latexit sha1_base64="NF6x4Jt6QrQ+GYEpq5cV7TDCS3Q=">AAAB/3icbVC7SgNBFJ2NrxhfUcHGZjEIVmFXgloGbcRCIpgHJCHMTu4mQ2Z3lpm7alhT+Cs2ForY+ht2/o2TZAtNPDBwOOde7pnjRYJrdJxvK7OwuLS8kl3Nra1vbG7lt3dqWsaKQZVJIVXDoxoED6GKHAU0IgU08ATUvcHF2K/fgdJchrc4jKAd0F7Ifc4oGqmT32shPKBmyZXkIdrXgPdSDUadfMEpOhPY88RNSYGkqHTyX62uZHEAITJBtW66ToTthCrkTMAo14o1RJQNaA+ahoY0AN1OJvlH9qFRurYvlXkmxET9vZHQQOth4JnJgGJfz3pj8T+vGaN/1k54GMUIIZse8mNho7THZdhdroChGBpCmeImq836VFGGprKcKcGd/fI8qR0X3ZNi6aZUKJ+ndWTJPjkgR8Qlp6RMLkmFVAkjj+SZvJI368l6sd6tj+loxkp3dskfWJ8/squWjQ==</latexit>
Prediction Encoder
<latexit sha1_base64="QqxUiDDZVYlMSrkeoEOKD10QW4Y=">AAAB/HicbVBNS8NAEN3Ur1q/oj16CRbBU0mkqMeiF48V7Ae0oWw2k3bp5oPdiVhC/StePCji1R/izX/jps1BWx8MPN6bYWaelwiu0La/jdLa+sbmVnm7srO7t39gHh51VJxKBm0Wi1j2PKpA8AjayFFAL5FAQ09A15vc5H73AaTicXSP0wTckI4iHnBGUUtDszpAeETFspYEn7NcnA3Nml2357BWiVOQGinQGppfAz9maQgRMkGV6jt2gm5GJXImYFYZpAoSyiZ0BH1NIxqCcrP58TPrVCu+FcRSV4TWXP09kdFQqWno6c6Q4lgte7n4n9dPMbhyMx4lKULEFouCVFgYW3kSls8lMBRTTSiTXN9qsTGVlKHOq6JDcJZfXiWd87pzUW/cNWrN6yKOMjkmJ+SMOOSSNMktaZE2YWRKnskreTOejBfj3fhYtJaMYqZK/sD4/AGs9ZVy</latexit> <latexit sha1_base64="0un2OPfI+rZfjsgwL+bBxZ9sWFU=">AAAB+XicbVBNS8NAEN3Ur1q/oh69BIvgqSRS1GNRBI8V7Ae0oWw203bpZhN2J8US+k+8eFDEq//Em//GbZuDtj4YeLw3w8y8IBFco+t+W4W19Y3NreJ2aWd3b//APjxq6jhVDBosFrFqB1SD4BIayFFAO1FAo0BAKxjdzvzWGJTmsXzESQJ+RAeS9zmjaKSebXcRnlCz7E6yOAQ17dllt+LO4awSLydlkqPes7+6YczSCCQyQbXueG6CfkYVciZgWuqmGhLKRnQAHUMljUD72fzyqXNmlNDpx8qURGeu/p7IaKT1JApMZ0RxqJe9mfif10mxf+1nXCYpgmSLRf1UOBg7sxickCtgKCaGUKa4udVhQ6ooQxNWyYTgLb+8SpoXFe+yUn2olms3eRxFckJOyTnxyBWpkXtSJw3CyJg8k1fyZmXWi/VufSxaC1Y+c0z+wPr8ASkclAM=</latexit>
Network
<latexit sha1_base64="oCgNjrKpxnpkUVxWcq77qGEu2wI=">AAAB+XicbVBNS8NAEN3Ur1q/oh69BIvgqSQi6rHoxZNUsB/QhrLZTtqlm03YnVRL6D/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgkRwja77bRVWVtfWN4qbpa3tnd09e/+goeNUMaizWMSqFVANgkuoI0cBrUQBjQIBzWB4M/WbI1Cax/IBxwn4Ee1LHnJG0Uhd2+4gPKFm2R3gY6yGk65ddivuDM4y8XJSJjlqXfur04tZGoFEJqjWbc9N0M+oQs4ETEqdVENC2ZD2oW2opBFoP5tdPnFOjNJzwliZkujM1N8TGY20HkeB6YwoDvSiNxX/89ophld+xmWSIkg2XxSmwsHYmcbg9LgChmJsCGWKm1sdNqCKMjRhlUwI3uLLy6RxVvEuKuf35+XqdR5HkRyRY3JKPHJJquSW1EidMDIiz+SVvFmZ9WK9Wx/z1oKVzxySP7A+fwBpdpQt</latexit>
g0 g1 gk <latexit sha1_base64="TNK9y2nM3Za3nC76BjXSJ10QwzY=">AAAB+XicbVDLSsNAFL2pr1pfUZduBovgqiRS1GXRjcsK9gFtCJPJpB06mYSZSaGE/okbF4q49U/c+TdO2iy09cAwh3PuZc6cIOVMacf5tiobm1vbO9Xd2t7+weGRfXzSVUkmCe2QhCeyH2BFORO0o5nmtJ9KiuOA014wuS/83pRKxRLxpGcp9WI8EixiBGsj+bY9DBIeqllsrnw09x3frjsNZwG0TtyS1KFE27e/hmFCspgKTThWauA6qfZyLDUjnM5rw0zRFJMJHtGBoQLHVHn5IvkcXRglRFEizREaLdTfGzmOVRHOTMZYj9WqV4j/eYNMR7dezkSaaSrI8qEo40gnqKgBhUxSovnMEEwkM1kRGWOJiTZl1UwJ7uqX10n3quFeN5qPzXrrrqyjCmdwDpfgwg204AHa0AECU3iGV3izcuvFerc+lqMVq9w5hT+wPn8A40WT1Q==</latexit>
<latexit sha1_base64="PkZ1rT2ZjTiPSawv63X7v9fCxJY=">AAAB+XicbVDLSsNAFL2pr1pfUZduBovgqiRS1GXRjcsK9gFtCJPJpB06mYSZSaGE/okbF4q49U/c+TdO2iy09cAwh3PuZc6cIOVMacf5tiobm1vbO9Xd2t7+weGRfXzSVUkmCe2QhCeyH2BFORO0o5nmtJ9KiuOA014wuS/83pRKxRLxpGcp9WI8EixiBGsj+bY9DBIeqllsrnw0913frjsNZwG0TtyS1KFE27e/hmFCspgKTThWauA6qfZyLDUjnM5rw0zRFJMJHtGBoQLHVHn5IvkcXRglRFEizREaLdTfGzmOVRHOTMZYj9WqV4j/eYNMR7dezkSaaSrI8qEo40gnqKgBhUxSovnMEEwkM1kRGWOJiTZl1UwJ7uqX10n3quFeN5qPzXrrrqyjCmdwDpfgwg204AHa0AECU3iGV3izcuvFerc+lqMVq9w5hT+wPn8A5MmT1g==</latexit> <latexit sha1_base64="CmQ6A2C3JWyMAGOEg0qAh7der7Y=">AAAB+XicbVDLSsNAFL2pr1pfUZduBovgqiRS1GXRjcsK9gFtCJPJpB06mYSZSaGE/okbF4q49U/c+TdO2iy09cAwh3PuZc6cIOVMacf5tiobm1vbO9Xd2t7+weGRfXzSVUkmCe2QhCeyH2BFORO0o5nmtJ9KiuOA014wuS/83pRKxRLxpGcp9WI8EixiBGsj+bY9DBIeqllsrnw09ye+XXcazgJonbglqUOJtm9/DcOEZDEVmnCs1MB1Uu3lWGpGOJ3XhpmiKSYTPKIDQwWOqfLyRfI5ujBKiKJEmiM0Wqi/N3IcqyKcmYyxHqtVrxD/8waZjm69nIk001SQ5UNRxpFOUFEDCpmkRPOZIZhIZrIiMsYSE23KqpkS3NUvr5PuVcO9bjQfm/XWXVlHFc7gHC7BhRtowQO0oQMEpvAMr/Bm5daL9W59LEcrVrlzCn9gff4APMCUEA==</latexit> <latexit sha1_base64="5dW2J9uqNEpCYwgAN1Uory6/GXo=">AAAB+XicbVDLSsNAFL2pr1pfUZdugkVwVRIRdVl0I7ipYB/QhjCZTNqhk5kwMymU0D9x40IRt/6JO//GSZuFth4Y5nDOvcyZE6aMKu2631ZlbX1jc6u6XdvZ3ds/sA+POkpkEpM2FkzIXogUYZSTtqaakV4qCUpCRrrh+K7wuxMiFRX8SU9T4idoyGlMMdJGCmx7EAoWqWlirnw4Cx4Cu+423DmcVeKVpA4lWoH9NYgEzhLCNWZIqb7nptrPkdQUMzKrDTJFUoTHaEj6hnKUEOXn8+Qz58wokRMLaQ7Xzlz9vZGjRBXhzGSC9Egte4X4n9fPdHzj55SnmSYcLx6KM+Zo4RQ1OBGVBGs2NQRhSU1WB4+QRFibsmqmBG/5y6ukc9HwrhqXj5f15m1ZRxVO4BTOwYNraMI9tKANGCbwDK/wZuXWi/VufSxGK1a5cwx/YH3+AAxAk/A=</latexit>
gK e1
<latexit sha1_base64="aidKR2s/tZ6GrEOJPG/rk5TQExo=">AAAB+XicbVDLSsNAFL2pr1pfUZduBovgqiRS1GXRjcsK9gFtCJPJpB06mYSZSaGE/okbF4q49U/c+TdO2iy09cAwh3PuZc6cIOVMacf5tiobm1vbO9Xd2t7+weGRfXzSVUkmCe2QhCeyH2BFORO0o5nmtJ9KiuOA014wuS/83pRKxRLxpGcp9WI8EixiBGsj+bY9DBIeqllsrpzOfde3607DWQCtE7ckdSjR9u2vYZiQLKZCE46VGrhOqr0cS80Ip/PaMFM0xWSCR3RgqMAxVV6+SD5HF0YJUZRIc4RGC/X3Ro5jVYQzkzHWY7XqFeJ/3iDT0a2XM5FmmgqyfCjKONIJKmpAIZOUaD4zBBPJTFZExlhiok1ZNVOCu/rlddK9arjXjeZjs966K+uowhmcwyW4cAMteIA2dIDAFJ7hFd6s3Hqx3q2P5WjFKndO4Q+szx/hu5PU</latexit>
e2
<latexit sha1_base64="ARZTR0BQ8i3cbeB9FFVuCYsatvo=">AAAB+XicbVBPS8MwHP11/pvzX9Wjl+AQPI12DPU49OJxgpuDrZQ0TbewNC1JOhhl38SLB0W8+k28+W1Mtx5080HI473fj7y8IOVMacf5tiobm1vbO9Xd2t7+weGRfXzSU0kmCe2ShCeyH2BFORO0q5nmtJ9KiuOA06dgclf4T1MqFUvEo56l1IvxSLCIEayN5Nv2MEh4qGaxuXI695u+XXcazgJonbglqUOJjm9/DcOEZDEVmnCs1MB1Uu3lWGpGOJ3XhpmiKSYTPKIDQwWOqfLyRfI5ujBKiKJEmiM0Wqi/N3IcqyKcmYyxHqtVrxD/8waZjm68nIk001SQ5UNRxpFOUFEDCpmkRPOZIZhIZrIiMsYSE23KqpkS3NUvr5Nes+FeNVoPrXr7tqyjCmdwDpfgwjW04R460AUCU3iGV3izcuvFerc+lqMVq9w5hT+wPn8A4z+T1Q==</latexit>
et <latexit sha1_base64="uEtA+/osy8xuXNwEWTjuZK926kM=">AAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVRIp6rLoxmUF+4A2hMlk0g6dTMLMTaGE/okbF4q49U/c+TdO2iy09cAwh3PuZc6cIBVcg+N8W5WNza3tnepubW//4PDIPj7p6iRTlHVoIhLVD4hmgkvWAQ6C9VPFSBwI1gsm94XfmzKleSKfYJYyLyYjySNOCRjJt+1hkIhQz2Jz5Wzug2/XnYazAF4nbknqqETbt7+GYUKzmEmggmg9cJ0UvJwo4FSweW2YaZYSOiEjNjBUkphpL18kn+MLo4Q4SpQ5EvBC/b2Rk1gX4cxkTGCsV71C/M8bZBDdejmXaQZM0uVDUSYwJLioAYdcMQpiZgihipusmI6JIhRMWTVTgrv65XXSvWq4143mY7PeuivrqKIzdI4ukYtuUAs9oDbqIIqm6Bm9ojcrt16sd+tjOVqxyp1T9AfW5w9HVpQX</latexit>
eT <latexit sha1_base64="qPYN9LFAo50olxF251B8TtMcvcU=">AAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVRIRdVl047JCX9CGMJnctEMnkzAzKZTQP3HjQhG3/ok7/8ZJm4W2HhjmcM69zJkTpJwp7TjfVmVjc2t7p7pb29s/ODyyj0+6KskkhQ5NeCL7AVHAmYCOZppDP5VA4oBDL5g8FH5vClKxRLT1LAUvJiPBIkaJNpJv28Mg4aGaxebKYe63fbvuNJwF8DpxS1JHJVq+/TUME5rFIDTlRKmB66Tay4nUjHKY14aZgpTQCRnBwFBBYlBevkg+xxdGCXGUSHOExgv190ZOYlWEM5Mx0WO16hXif94g09GdlzORZhoEXT4UZRzrBBc14JBJoJrPDCFUMpMV0zGRhGpTVs2U4K5+eZ10rxruTeP66brevC/rqKIzdI4ukYtuURM9ohbqIIqm6Bm9ojcrt16sd+tjOVqxyp1T9AfW5w8W1pP3</latexit>
<s>
<latexit sha1_base64="RBd+pAoldJvlLNhF8gIaIbwtB2s=">AAAB83icbVDLSsNAFJ3UV42vqks3g0VwVRIRdeGj6MZlBfuAJpTJdNIOnUzCzI1YQn/DjQtF3fod7t2If+P0sdDWAxcO59zLvfcEieAaHOfbys3NLywu5ZftldW19Y3C5lZNx6mirEpjEatGQDQTXLIqcBCskShGokCwetC7Gvr1O6Y0j+Ut9BPmR6QjecgpASN5HrB7AMhO9fmgVSg6JWcEPEvcCSlefNhnyeuXXWkVPr12TNOISaCCaN10nQT8jCjgVLCB7aWaJYT2SIc1DZUkYtrPRjcP8J5R2jiMlSkJeKT+nshIpHU/CkxnRKCrp72h+J/XTCE88TMukxSYpONFYSowxHgYAG5zxSiIviGEKm5uxbRLFKFgYrJNCO70y7OkdlByj0qHN06xfInGyKMdtIv2kYuOURldowqqIooS9ICe0LOVWo/Wi/U2bs1Zk5lt9AfW+w+9/JUm</latexit>
w1
<latexit sha1_base64="67zubY/zRYXPtGD09zOTgpCYIlM=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8eK9gPaUDbbTbt0swm7E6WE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvSKQw6LrfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWjm6nfeuTaiFg94DjhfkQHSoSCUbTS/VPP65UrbtWdgSwTLycVyFHvlb+6/ZilEVfIJDWm47kJ+hnVKJjkk1I3NTyhbEQHvGOpohE3fjY7dUJOrNInYaxtKSQz9fdERiNjxlFgOyOKQ7PoTcX/vE6K4ZWfCZWkyBWbLwpTSTAm079JX2jOUI4toUwLeythQ6opQ5tOyYbgLb68TJpnVe+ien53Xqld53EU4QiO4RQ8uIQa3EIdGsBgAM/wCm+OdF6cd+dj3lpw8plD+APn8wcNGI2o</latexit>
wk
<latexit sha1_base64="5jC3rV+U5Rx6DyC9Q8QgcZ2oObw=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8eK9gPaUDbbTbt0swm7E6WE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvSKQw6LrfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWjm6nfeuTaiFg94DjhfkQHSoSCUbTS/VNv1CtX3Ko7A1kmXk4qkKPeK391+zFLI66QSWpMx3MT9DOqUTDJJ6VuanhC2YgOeMdSRSNu/Gx26oScWKVPwljbUkhm6u+JjEbGjKPAdkYUh2bRm4r/eZ0Uwys/EypJkSs2XxSmkmBMpn+TvtCcoRxbQpkW9lbChlRThjadkg3BW3x5mTTPqt5F9fzuvFK7zuMowhEcwyl4cAk1uIU6NIDBAJ7hFd4c6bw4787HvLXg5DOH8AfO5w9lAI3i</latexit>
wK
<latexit sha1_base64="1iMLcikwS/yYSy1NKSztw7lP72Q=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKexKUI9BL4KXiOYByRJmJ5NkyOzsMtOrhCWf4MWDIl79Im/+jZNkD5pY0FBUddPdFcRSGHTdbye3srq2vpHfLGxt7+zuFfcPGiZKNON1FslItwJquBSK11Gg5K1YcxoGkjeD0fXUbz5ybUSkHnAccz+kAyX6glG00v1T97ZbLLlldwayTLyMlCBDrVv86vQiloRcIZPUmLbnxuinVKNgkk8KncTwmLIRHfC2pYqG3Pjp7NQJObFKj/QjbUshmam/J1IaGjMOA9sZUhyaRW8q/ue1E+xf+qlQcYJcsfmifiIJRmT6N+kJzRnKsSWUaWFvJWxINWVo0ynYELzFl5dJ46zsnZcrd5VS9SqLIw9HcAyn4MEFVOEGalAHBgN4hld4c6Tz4rw7H/PWnJPNHMIfOJ8/NICNwg==</latexit>
o1 <latexit sha1_base64="yV6c0E5USxDvfBDK6npQChwWZAw=">AAAB+XicbVDLSsNAFL2pr1pfUZduBovgqiRS1GXRjcsK9gFtCJPJpB06yYSZSaGE/okbF4q49U/c+TdO2iy09cAwh3PuZc6cIOVMacf5tiobm1vbO9Xd2t7+weGRfXzSVSKThHaI4EL2A6woZwntaKY57aeS4jjgtBdM7gu/N6VSMZE86VlKvRiPEhYxgrWRfNseBoKHahabKxdz3/XtutNwFkDrxC1JHUq0fftrGAqSxTTRhGOlBq6Tai/HUjPC6bw2zBRNMZngER0YmuCYKi9fJJ+jC6OEKBLSnESjhfp7I8exKsKZyRjrsVr1CvE/b5Dp6NbLWZJmmiZk+VCUcaQFKmpAIZOUaD4zBBPJTFZExlhiok1ZNVOCu/rlddK9arjXjeZjs966K+uowhmcwyW4cAMteIA2dIDAFJ7hFd6s3Hqx3q2P5WjFKndO4Q+szx/xAZPe</latexit>
o2 <latexit sha1_base64="Jq/vjcu6ACbzVeXUtfgNT44zqrk=">AAAB+XicbVBPS8MwHP11/pvzX9Wjl+IQPI12DPU49OJxgpuDrZQ0TbewNClJOhhl38SLB0W8+k28+W1Mtx5080HI473fj7y8MGVUadf9tiobm1vbO9Xd2t7+weGRfXzSUyKTmHSxYEL2Q6QIo5x0NdWM9FNJUBIy8hRO7gr/aUqkooI/6llK/ASNOI0pRtpIgW0PQ8EiNUvMlYt50AzsuttwF3DWiVeSOpToBPbXMBI4SwjXmCGlBp6baj9HUlPMyLw2zBRJEZ6gERkYylFClJ8vks+dC6NETiykOVw7C/X3Ro4SVYQzkwnSY7XqFeJ/3iDT8Y2fU55mmnC8fCjOmKOFU9TgRFQSrNnMEIQlNVkdPEYSYW3KqpkSvNUvr5Nes+FdNVoPrXr7tqyjCmdwDpfgwTW04R460AUMU3iGV3izcuvFerc+lqMVq9w5hT+wPn8A8oWT3w==</latexit>
ot
<latexit sha1_base64="hQJqAx5F76JQVSy5M6P4yFUv344=">AAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVRIp6rLoxmUF+4A2hMlk0g6dZMLMTaGE/okbF4q49U/c+TdO2iy09cAwh3PuZc6cIBVcg+N8W5WNza3tnepubW//4PDIPj7papkpyjpUCqn6AdFM8IR1gINg/VQxEgeC9YLJfeH3pkxpLpMnmKXMi8ko4RGnBIzk2/YwkCLUs9hcuZz74Nt1p+EsgNeJW5I6KtH27a9hKGkWswSoIFoPXCcFLycKOBVsXhtmmqWETsiIDQxNSMy0ly+Sz/GFUUIcSWVOAnih/t7ISayLcGYyJjDWq14h/ucNMohuvZwnaQYsocuHokxgkLioAYdcMQpiZgihipusmI6JIhRMWTVTgrv65XXSvWq4143mY7PeuivrqKIzdI4ukYtuUAs9oDbqIIqm6Bm9ojcrt16sd+tjOVqxyp1T9AfW5w9WnJQh</latexit>
oT
<latexit sha1_base64="N+fIw8Yph8Kawce/rTR5VJSrHSI=">AAAB+XicbVDLSsNAFL2pr1pfUZduBovgqiQi6rLoxmWFvqANYTKZtEMnmTAzKZTQP3HjQhG3/ok7/8ZJm4W2HhjmcM69zJkTpJwp7TjfVmVjc2t7p7pb29s/ODyyj0+6SmSS0A4RXMh+gBXlLKEdzTSn/VRSHAec9oLJQ+H3plQqJpK2nqXUi/EoYREjWBvJt+1hIHioZrG5cjH3275ddxrOAmiduCWpQ4mWb38NQ0GymCaacKzUwHVS7eVYakY4ndeGmaIpJhM8ogNDExxT5eWL5HN0YZQQRUKak2i0UH9v5DhWRTgzGWM9VqteIf7nDTId3Xk5S9JM04QsH4oyjrRARQ0oZJISzWeGYCKZyYrIGEtMtCmrZkpwV7+8TrpXDfemcf10XW/el3VU4QzO4RJcuIUmPEILOkBgCs/wCm9Wbr1Y79bHcrRilTun8AfW5w8mHJQB</latexit>
Figure 3.7 A neural transducer model that consists of an encoder, a prediction network and a
joint network. w0 is normally the start of sentence symbol <s> and wk is the kth modelling
unit in the output sequence. gk is the hidden representation for the history sequence w0:k .
uses the combined representation to predict the probability distribution over all output symbols
P (s′t,k |et , gk ), including the null symbol ∅.
Popular encoder architectures for neural transducers include LSTMs and Transformers. The
prediction network is usually a uni-directional LSTM and the joint network can be as simple as
a single fully connected projection layer. By computing the output probabilities for all pairs of
t and k, an output probability lattice can be constructed as in Figure 3.8.
? ?
<latexit sha1_base64="INkgBd/gRKLqmJg73ybelTFIzRE=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69BIvgqSRS1GPRi8cK9gPaUDbbTbt0sxt2J4US+jO8eFDEq7/Gm//GbZuDtj4YeLw3w8y8MBHcoOd9O4WNza3tneJuaW//4PCofHzSMirVlDWpEkp3QmKY4JI1kaNgnUQzEoeCtcPx/dxvT5g2XMknnCYsiMlQ8ohTglbq9iZES4UjLof9csWregu468TPSQVyNPrlr95A0TRmEqkgxnR9L8EgIxo5FWxW6qWGJYSOyZB1LZUkZibIFifP3AurDNxIaVsS3YX6eyIjsTHTOLSdMcGRWfXm4n9eN8XoNsi4TFJkki4XRalwUbnz/90B14yimFpCqOb2VpeOiCYUbUolG4K/+vI6aV1V/etq7bFWqd/lcRThDM7hEny4gTo8QAOaQEHBM7zCm4POi/PufCxbC04+cwp/4Hz+ALhckY0=</latexit> <latexit sha1_base64="INkgBd/gRKLqmJg73ybelTFIzRE=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69BIvgqSRS1GPRi8cK9gPaUDbbTbt0sxt2J4US+jO8eFDEq7/Gm//GbZuDtj4YeLw3w8y8MBHcoOd9O4WNza3tneJuaW//4PCofHzSMirVlDWpEkp3QmKY4JI1kaNgnUQzEoeCtcPx/dxvT5g2XMknnCYsiMlQ8ohTglbq9iZES4UjLof9csWregu468TPSQVyNPrlr95A0TRmEqkgxnR9L8EgIxo5FWxW6qWGJYSOyZB1LZUkZibIFifP3AurDNxIaVsS3YX6eyIjsTHTOLSdMcGRWfXm4n9eN8XoNsi4TFJkki4XRalwUbnz/90B14yimFpCqOb2VpeOiCYUbUolG4K/+vI6aV1V/etq7bFWqd/lcRThDM7hEny4gTo8QAOaQEHBM7zCm4POi/PufCxbC04+cwp/4Hz+ALhckY0=</latexit>
3
<latexit sha1_base64="PvkqMPf0KQ2ZiRgX4QPbidqMLN8=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHaVqEeiF4+QyCOBDZkdemFkdnYzM2tCCF/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR3cxvPaHSPJYPZpygH9GB5CFn1FipftkrltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa88SdcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmhdl76pcqVdK1dssjjycwCmcgwfXUIV7qEEDGCA8wyu8OY/Oi/PufCxac042cwx/4Hz+AICTjMA=</latexit>
T
<latexit sha1_base64="Re3q3rDTo8m9ZlVqoUElcLXfzg4=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKewGUY9BLx4TyAuSJcxOepMxs7PLzKwQQr7AiwdFvPpJ3vwbJ8keNLGgoajqprsrSATXxnW/ndzG5tb2Tn63sLd/cHhUPD5p6ThVDJssFrHqBFSj4BKbhhuBnUQhjQKB7WB8P/fbT6g0j2XDTBL0IzqUPOSMGivVG/1iyS27C5B14mWkBBlq/eJXbxCzNEJpmKBadz03Mf6UKsOZwFmhl2pMKBvTIXYtlTRC7U8Xh87IhVUGJIyVLWnIQv09MaWR1pMosJ0RNSO96s3F/7xuasJbf8plkhqUbLkoTAUxMZl/TQZcITNiYgllittbCRtRRZmx2RRsCN7qy+ukVSl71+WreqVUvcviyMMZnMMleHADVXiAGjSBAcIzvMKb8+i8OO/Ox7I152Qzp/AHzucPsfeM3w==</latexit>
2
<latexit sha1_base64="YMsxORc1qROyZt/Sshecb25Lu40=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHYJUY9ELx4hkUcCGzI79MLI7OxmZtaEEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eQ2Nre2d/K7hb39g8Oj4vFJS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfju7nffkKleSwfzCRBP6JDyUPOqLFSo9IvltyyuwBZJ15GSpCh3i9+9QYxSyOUhgmqdddzE+NPqTKcCZwVeqnGhLIxHWLXUkkj1P50ceiMXFhlQMJY2ZKGLNTfE1MaaT2JAtsZUTPSq95c/M/rpia88adcJqlByZaLwlQQE5P512TAFTIjJpZQpri9lbARVZQZm03BhuCtvrxOWpWyd1WuNqql2m0WRx7O4BwuwYNrqME91KEJDBCe4RXenEfnxXl3PpatOSebOYU/cD5/AH8PjL8=</latexit>
k A
<latexit sha1_base64="eINJPYvlNQfnv4FehLe4hTBMdFs=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHaJUY+oF4+QyCOBDZkdemFkdnYzM2tCCF/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR3cxvPaHSPJYPZpygH9GB5CFn1FipftMrltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1m+qFdK1dssjjycwCmcgwdXUIV7qEEDGCA8wyu8OY/Oi/PufCxac042cwx/4Hz+AJUrjMw=</latexit>
<latexit sha1_base64="22ayNhSfxXcShAAyfExNzJXubLY=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOl5rhfrrhVdw6ySrycVCBHo1/+6g1ilkYoDRNU667nJsbPqDKcCZyWeqnGhLIxHWLXUkkj1H42P3RKzqwyIGGsbElD5urviYxGWk+iwHZG1Iz0sjcT//O6qQmv/YzLJDUo2WJRmApiYjL7mgy4QmbExBLKFLe3EjaiijJjsynZELzll1dJ+6LqXVZrzVqlfpPHUYQTOIVz8OAK6nAHDWgBA4RneIU359F5cd6dj0VrwclnjuEPnM8f1XOM+A==</latexit>
? ?
<latexit sha1_base64="INkgBd/gRKLqmJg73ybelTFIzRE=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69BIvgqSRS1GPRi8cK9gPaUDbbTbt0sxt2J4US+jO8eFDEq7/Gm//GbZuDtj4YeLw3w8y8MBHcoOd9O4WNza3tneJuaW//4PCofHzSMirVlDWpEkp3QmKY4JI1kaNgnUQzEoeCtcPx/dxvT5g2XMknnCYsiMlQ8ohTglbq9iZES4UjLof9csWregu468TPSQVyNPrlr95A0TRmEqkgxnR9L8EgIxo5FWxW6qWGJYSOyZB1LZUkZibIFifP3AurDNxIaVsS3YX6eyIjsTHTOLSdMcGRWfXm4n9eN8XoNsi4TFJkki4XRalwUbnz/90B14yimFpCqOb2VpeOiCYUbUolG4K/+vI6aV1V/etq7bFWqd/lcRThDM7hEny4gTo8QAOaQEHBM7zCm4POi/PufCxbC04+cwp/4Hz+ALhckY0=</latexit> <latexit sha1_base64="INkgBd/gRKLqmJg73ybelTFIzRE=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69BIvgqSRS1GPRi8cK9gPaUDbbTbt0sxt2J4US+jO8eFDEq7/Gm//GbZuDtj4YeLw3w8y8MBHcoOd9O4WNza3tneJuaW//4PCofHzSMirVlDWpEkp3QmKY4JI1kaNgnUQzEoeCtcPx/dxvT5g2XMknnCYsiMlQ8ohTglbq9iZES4UjLof9csWregu468TPSQVyNPrlr95A0TRmEqkgxnR9L8EgIxo5FWxW6qWGJYSOyZB1LZUkZibIFifP3AurDNxIaVsS3YX6eyIjsTHTOLSdMcGRWfXm4n9eN8XoNsi4TFJkki4XRalwUbnz/90B14yimFpCqOb2VpeOiCYUbUolG4K/+vI6aV1V/etq7bFWqd/lcRThDM7hEny4gTo8QAOaQEHBM7zCm4POi/PufCxbC04+cwp/4Hz+ALhckY0=</latexit>
1
<latexit sha1_base64="DSs/otLGM3tGcbw7Y+b2owP1uy4=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOlptcvV9yqOwdZJV5OKpCj0S9/9QYxSyOUhgmqdddzE+NnVBnOBE5LvVRjQtmYDrFrqaQRaj+bHzolZ1YZkDBWtqQhc/X3REYjrSdRYDsjakZ62ZuJ/3nd1ITXfsZlkhqUbLEoTAUxMZl9TQZcITNiYgllittbCRtRRZmx2ZRsCN7yy6ukfVH1Lqu1Zq1Sv8njKMIJnMI5eHAFdbiDBrSAAcIzvMKb8+i8OO/Ox6K14OQzx/AHzucPfYuMvg==</latexit>
C
<latexit sha1_base64="p+0nAFeCXge/HI4RNHKcRIEQu5Q=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHaJUY9ELh4hkUcCGzI79MLI7OxmZtaEEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eQ2Nre2d/K7hb39g8Oj4vFJS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfj2txvP6HSPJYPZpKgH9Gh5CFn1FipUesXS27ZXYCsEy8jJchQ7xe/eoOYpRFKwwTVuuu5ifGnVBnOBM4KvVRjQtmYDrFrqaQRan+6OHRGLqwyIGGsbElDFurviSmNtJ5Ege2MqBnpVW8u/ud1UxPe+lMuk9SgZMtFYSqIicn8azLgCpkRE0soU9zeStiIKsqMzaZgQ/BWX14nrUrZuy5fNSql6l0WRx7O4BwuwYMbqMI91KEJDBCe4RXenEfnxXl3PpatOSebOYU/cD5/AJgzjM4=</latexit>
? ?
<latexit sha1_base64="INkgBd/gRKLqmJg73ybelTFIzRE=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69BIvgqSRS1GPRi8cK9gPaUDbbTbt0sxt2J4US+jO8eFDEq7/Gm//GbZuDtj4YeLw3w8y8MBHcoOd9O4WNza3tneJuaW//4PCofHzSMirVlDWpEkp3QmKY4JI1kaNgnUQzEoeCtcPx/dxvT5g2XMknnCYsiMlQ8ohTglbq9iZES4UjLof9csWregu468TPSQVyNPrlr95A0TRmEqkgxnR9L8EgIxo5FWxW6qWGJYSOyZB1LZUkZibIFifP3AurDNxIaVsS3YX6eyIjsTHTOLSdMcGRWfXm4n9eN8XoNsi4TFJkki4XRalwUbnz/90B14yimFpCqOb2VpeOiCYUbUolG4K/+vI6aV1V/etq7bFWqd/lcRThDM7hEny4gTo8QAOaQEHBM7zCm4POi/PufCxbC04+cwp/4Hz+ALhckY0=</latexit> <latexit sha1_base64="INkgBd/gRKLqmJg73ybelTFIzRE=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69BIvgqSRS1GPRi8cK9gPaUDbbTbt0sxt2J4US+jO8eFDEq7/Gm//GbZuDtj4YeLw3w8y8MBHcoOd9O4WNza3tneJuaW//4PCofHzSMirVlDWpEkp3QmKY4JI1kaNgnUQzEoeCtcPx/dxvT5g2XMknnCYsiMlQ8ohTglbq9iZES4UjLof9csWregu468TPSQVyNPrlr95A0TRmEqkgxnR9L8EgIxo5FWxW6qWGJYSOyZB1LZUkZibIFifP3AurDNxIaVsS3YX6eyIjsTHTOLSdMcGRWfXm4n9eN8XoNsi4TFJkki4XRalwUbnz/90B14yimFpCqOb2VpeOiCYUbUolG4K/+vI6aV1V/etq7bFWqd/lcRThDM7hEny4gTo8QAOaQEHBM7zCm4POi/PufCxbC04+cwp/4Hz+ALhckY0=</latexit>
0
<latexit sha1_base64="SvEw9j+G6/nNSZebhlvQwRvlSiI=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOlptsvV9yqOwdZJV5OKpCj0S9/9QYxSyOUhgmqdddzE+NnVBnOBE5LvVRjQtmYDrFrqaQRaj+bHzolZ1YZkDBWtqQhc/X3REYjrSdRYDsjakZ62ZuJ/3nd1ITXfsZlkhqUbLEoTAUxMZl9TQZcITNiYgllittbCRtRRZmx2ZRsCN7yy6ukfVH1Lqu1Zq1Sv8njKMIJnMI5eHAFdbiDBrSAAcIzvMKb8+i8OO/Ox6K14OQzx/AHzucPfAeMvQ==</latexit>
1 2 3 4 5 6
<latexit sha1_base64="PvkqMPf0KQ2ZiRgX4QPbidqMLN8=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHaVqEeiF4+QyCOBDZkdemFkdnYzM2tCCF/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR3cxvPaHSPJYPZpygH9GB5CFn1FipftkrltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa88SdcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmhdl76pcqVdK1dssjjycwCmcgwfXUIV7qEEDGCA8wyu8OY/Oi/PufCxac042cwx/4Hz+AICTjMA=</latexit> <latexit sha1_base64="pukKC4/a/4Vumdv9C1rDgxpFPro=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHYNPo5ELx4hkUcCGzI79MLI7OxmZtaEEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eTW1jc2t/LbhZ3dvf2D4uFRU8epYthgsYhVO6AaBZfYMNwIbCcKaRQIbAWju5nfekKleSwfzDhBP6IDyUPOqLFS/bJXLLlldw6ySryMlCBDrVf86vZjlkYoDRNU647nJsafUGU4EzgtdFONCWUjOsCOpZJGqP3J/NApObNKn4SxsiUNmau/JyY00nocBbYzomaol72Z+J/XSU1440+4TFKDki0WhakgJiazr0mfK2RGjC2hTHF7K2FDqigzNpuCDcFbfnmVNC/K3lW5Uq+UqrdZHHk4gVM4Bw+uoQr3UIMGMEB4hld4cx6dF+fd+Vi05pxs5hj+wPn8AYObjMI=</latexit> <latexit sha1_base64="Zdb79L+zknig4R9axIVCooBH2pQ=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHYNQY9ELx4hkUcCGzI79MLI7OxmZtaEEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eQ2Nre2d/K7hb39g8Oj4vFJS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfju7nffkKleSwfzCRBP6JDyUPOqLFSo9ovltyyuwBZJ15GSpCh3i9+9QYxSyOUhgmqdddzE+NPqTKcCZwVeqnGhLIxHWLXUkkj1P50ceiMXFhlQMJY2ZKGLNTfE1MaaT2JAtsZUTPSq95c/M/rpia88adcJqlByZaLwlQQE5P512TAFTIjJpZQpri9lbARVZQZm03BhuCtvrxOWldlr1quNCql2m0WRx7O4BwuwYNrqME91KEJDBCe4RXenEfnxXl3PpatOSebOYU/cD5/AIUfjMM=</latexit>
t
<latexit sha1_base64="4mSRiAOC1HPbUsbyd7QN48TyFAA=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48t2FpoQ9lsN+3azSbsToQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IJHCoOt+O4W19Y3NreJ2aWd3b/+gfHjUNnGqGW+xWMa6E1DDpVC8hQIl7ySa0yiQ/CEY3878hyeujYjVPU4S7kd0qEQoGEUrNbFfrrhVdw6ySrycVCBHo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKupYpG3PjZ/NApObPKgISxtqWQzNXfExmNjJlEge2MKI7MsjcT//O6KYbXfiZUkiJXbLEoTCXBmMy+JgOhOUM5sYQyLeythI2opgxtNiUbgrf88ippX1S9y2qtWavUb/I4inACp3AOHlxBHe6gAS1gwOEZXuHNeXRenHfnY9FacPKZY/gD5/MH4xeNAQ==</latexit>
Figure 3.8 An example of a neural transducer alignment lattice. The node at (t, k) represents the
probability of having output the first k non-blank symbols by time step t in the input acoustic
feature sequence. The horizontal transition leaving node (t, k) represents the probability of
∅, whereas the vertical transition represents the probability of outputting symbol wk+1 . The
acoustic sequence has 6 input frames and the symbol sequence is graphemes of the word “CAT”.
A darker path refers to an alignment with higher probability. Nodes with two concentric circles
denote the start and end nodes.
64 Automatic Speech Recognition
Note that the meaning of ∅ is slightly different from CTC. For CTC models, ∅ sometimes
acts as a separator between two repetitive symbols. For neural transducers, ∅ means outputting
nothing for the current frame. In other words, for neural transducers, ∅ is not a real output
symbol, but a symbolic instruction for the model to move forward to the next frame. Compared
with the CTC alignment lattice in Figure 3.6, the total length of a transducer alignment is
T + K instead of T for CTC. Because when the transducer outputs real symbols, no acoustic
frame is consumed as the alignment path travels upwards. Acoustic frames are only consumed
when the special symbol ∅ occurs and the alignment path advances horizontally in time. The
training loss is similar to CTC as in Equation (3.67),
where B −1 (w) is a set of all possible alignments as shown in Figure 3.8. Again, the summation
can be computed efficiently using the forward-backward algorithm. For decoding, although
prefix search is a theoretically better procedure, the standard beam search can often reach
similar results with much less computation (Graves et al., 2013). Neural transducers, with
various modelling improvements, have been widely adopted to process streaming audio for
commercial applications (Li et al., 2021a, 2019b; Rao et al., 2017; Saon et al., 2021).
2017b; Shan et al., 2019; Toshniwal et al., 2018), loss function modification (Cui et al., 2018;
Prabhavalkar et al., 2018; Sabour et al., 2019), and data augmentation (Hayashi et al., 2018; Park
et al., 2019). Some experiments have shown that with significantly more data and computing
resources (Chiu et al., 2018; Lüscher et al., 2019; Park et al., 2019), a single attention model is
capable of learning the structured information composed in traditional HMM-based systems,
e.g. lexicons, and decision trees. AED models offer the advantage of joint optimisation, which
allows numerous downstream tasks to be naturally connected as a single trainable model, e.g.
speech translation (Bérard et al., 2018; Weiss et al., 2017), speech chain that connects an ASR
model with a text-to-speech model to improve the performance of each other (Tjandra et al.,
2017a, 2018a). The conditional independence assumption made in the SCM approach is also
eliminated.
This section builds on the architecture of AED models described in Section 2.2 and dis-
cusses modelling and training techniques that are specific to ASR. Compared to the SCM in
Equation (3.3), attention-based models provide an alternative view of addressing the speech
recognition problem. Instead of decomposing the overall model into an acoustic model and a
language model as in Section 3.1, attention-based models compute the posterior distribution
P (w|O) directly by following the chain rule of conditional probability,
where w is the word sequence. Compared to Equation (3.3), AED models hold some theoretical
advantages over HMM-based systems. AED models do not rely on the first-order Markovian
assumption, and the acoustic information and the language information are jointly learned
using a single model without making any independence assumptions.
3.6.1 Front-End
Normally the output transcription sequence is much shorter than the input acoustic sequence
using features extracted at a frequency of 100 Hz (see Section 3.1.1). As a result, modelling
very long acoustic sequences without subsampling can be computationally expensive for RNNs
and Transformers and may also cause difficulties for optimisation such as vanishing gradients.
Instead of concatenating multiple adjacent frames as a single input frame, the following two
model-based approaches are commonly used.
66 Automatic Speech Recognition
For RNNs, a hierarchical structure can be used to skip or combine states from a previous
RNN layer to the next. In Figure 3.9, two types of hierarchical RNN (Chan et al., 2016; Kim
et al., 2017) are illustrated that can achieve a frame rate reduction of 4 times.
Figure 3.9 Unfolded representations of two-layer hierarchical RNNs that reduce the input
sequence lengths by a factor of four. Circles represent the RNN recurrent units.
of symbols for encoding the text is minimised. Similarly, a unigram LM can be used as an
entropy encoder to iteratively increase the number of word-pieces until the desired vocabulary
size by maximising the LM probability on the training text corpus (Kudo, 2018; Whittaker and
Woodland, 2000). Both methods preserve the basic set of graphemes to allow open vocabulary
recognition without the Out-of-Vocabulary (OOV) problem. Note that the tokenisation results
are not unique even for the same word sequence as shown in Table 3.2. One advantage of using
a probabilistic LM is that multiple tokenisation results can be generated and ranked by their
probabilities. This leads to a regularisation technique that randomly samples the target subword
sequence during the training of an AED model (Kudo, 2018).
grapheme _ h e l l o _ w o r l d
word-piece _he llo _world
_h e ll o _w or l d
Table 3.2 Word-piece modelling units for “hello world”. The symbol ‘_’ denotes the word
boundary. There can be many different word-piece sequences for the same word sequence.
+ + ⇥1/2 pointwise
<latexit sha1_base64="G4vzAKUfqBvuBoSMSn1cPbS+Hlo=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoMgCGFXgnoMevGYgHlAsoTZSW8yZnZ2mZkVQsgXePGgiFc/yZt/4yTZgyYWNBRV3XR3BYng2rjut5NbW9/Y3MpvF3Z29/YPiodHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7mZ+6wmV5rF8MOME/YgOJA85o8ZK9YteseSW3TnIKvEyUoIMtV7xq9uPWRqhNExQrTuemxh/QpXhTOC00E01JpSN6AA7lkoaofYn80On5MwqfRLGypY0ZK7+npjQSOtxFNjOiJqhXvZm4n9eJzXhjT/hMkkNSrZYFKaCmJjMviZ9rpAZMbaEMsXtrYQNqaLM2GwKNgRv+eVV0rwse1flSr1Sqt5mceThBE7hHDy4hircQw0awADhGV7hzXl0Xpx352PRmnOymWP4A+fzB3RzjLg=</latexit>
<latexit sha1_base64="G4vzAKUfqBvuBoSMSn1cPbS+Hlo=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoMgCGFXgnoMevGYgHlAsoTZSW8yZnZ2mZkVQsgXePGgiFc/yZt/4yTZgyYWNBRV3XR3BYng2rjut5NbW9/Y3MpvF3Z29/YPiodHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7mZ+6wmV5rF8MOME/YgOJA85o8ZK9YteseSW3TnIKvEyUoIMtV7xq9uPWRqhNExQrTuemxh/QpXhTOC00E01JpSN6AA7lkoaofYn80On5MwqfRLGypY0ZK7+npjQSOtxFNjOiJqhXvZm4n9eJzXhjT/hMkkNSrZYFKaCmJjMviZ9rpAZMbaEMsXtrYQNqaLM2GwKNgRv+eVV0rwse1flSr1Sqt5mceThBE7hHDy4hircQw0awADhGV7hzXl0Xpx352PRmnOymWP4A+fzB3RzjLg=</latexit>
<latexit sha1_base64="WEia3EVcrk9g869I4W1e05UC7ko=">AAAB8XicbVDLTgJBEOzFF+IL9ehlIjHxhLuEqEeiF4+YyCPChswOszBhdnYz02tCCH/hxYPGePVvvPk3DrAHBSvppFLVne6uIJHCoOt+O7m19Y3Nrfx2YWd3b/+geHjUNHGqGW+wWMa6HVDDpVC8gQIlbyea0yiQvBWMbmd+64lrI2L1gOOE+xEdKBEKRtFKj10UETfEu6j0iiW37M5BVomXkRJkqPeKX91+zNKIK2SSGtPx3AT9CdUomOTTQjc1PKFsRAe8Y6midpE/mV88JWdW6ZMw1rYUkrn6e2JCI2PGUWA7I4pDs+zNxP+8TorhtT8RKkmRK7ZYFKaSYExm75O+0JyhHFtCmRb2VsKGVFOGNqSCDcFbfnmVNCtl77Jcva+WajdZHHk4gVM4Bw+uoAZ3UIcGMFDwDK/w5hjnxXl3PhatOSebOYY/cD5/AGWskBc=</latexit>
convolution
MLP MLP
Swish activation
+
<latexit sha1_base64="G4vzAKUfqBvuBoSMSn1cPbS+Hlo=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoMgCGFXgnoMevGYgHlAsoTZSW8yZnZ2mZkVQsgXePGgiFc/yZt/4yTZgyYWNBRV3XR3BYng2rjut5NbW9/Y3MpvF3Z29/YPiodHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7mZ+6wmV5rF8MOME/YgOJA85o8ZK9YteseSW3TnIKvEyUoIMtV7xq9uPWRqhNExQrTuemxh/QpXhTOC00E01JpSN6AA7lkoaofYn80On5MwqfRLGypY0ZK7+npjQSOtxFNjOiJqhXvZm4n9eJzXhjT/hMkkNSrZYFKaCmJjMviZ9rpAZMbaEMsXtrYQNqaLM2GwKNgRv+eVV0rwse1flSr1Sqt5mceThBE7hHDy4hircQw0awADhGV7hzXl0Xpx352PRmnOymWP4A+fzB3RzjLg=</latexit>
batch norm
convolution
layer norm module
1D depth-wise
convolution
+ +
<latexit sha1_base64="G4vzAKUfqBvuBoSMSn1cPbS+Hlo=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoMgCGFXgnoMevGYgHlAsoTZSW8yZnZ2mZkVQsgXePGgiFc/yZt/4yTZgyYWNBRV3XR3BYng2rjut5NbW9/Y3MpvF3Z29/YPiodHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7mZ+6wmV5rF8MOME/YgOJA85o8ZK9YteseSW3TnIKvEyUoIMtV7xq9uPWRqhNExQrTuemxh/QpXhTOC00E01JpSN6AA7lkoaofYn80On5MwqfRLGypY0ZK7+npjQSOtxFNjOiJqhXvZm4n9eJzXhjT/hMkkNSrZYFKaCmJjMviZ9rpAZMbaEMsXtrYQNqaLM2GwKNgRv+eVV0rwse1flSr1Sqt5mceThBE7hHDy4hircQw0awADhGV7hzXl0Xpx352PRmnOymWP4A+fzB3RzjLg=</latexit>
<latexit sha1_base64="G4vzAKUfqBvuBoSMSn1cPbS+Hlo=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoMgCGFXgnoMevGYgHlAsoTZSW8yZnZ2mZkVQsgXePGgiFc/yZt/4yTZgyYWNBRV3XR3BYng2rjut5NbW9/Y3MpvF3Z29/YPiodHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7mZ+6wmV5rF8MOME/YgOJA85o8ZK9YteseSW3TnIKvEyUoIMtV7xq9uPWRqhNExQrTuemxh/QpXhTOC00E01JpSN6AA7lkoaofYn80On5MwqfRLGypY0ZK7+npjQSOtxFNjOiJqhXvZm4n9eJzXhjT/hMkkNSrZYFKaCmJjMviZ9rpAZMbaEMsXtrYQNqaLM2GwKNgRv+eVV0rwse1flSr1Sqt5mceThBE7hHDy4hircQw0awADhGV7hzXl0Xpx352PRmnOymWP4A+fzB3RzjLg=</latexit>
<latexit sha1_base64="WEia3EVcrk9g869I4W1e05UC7ko=">AAAB8XicbVDLTgJBEOzFF+IL9ehlIjHxhLuEqEeiF4+YyCPChswOszBhdnYz02tCCH/hxYPGePVvvPk3DrAHBSvppFLVne6uIJHCoOt+O7m19Y3Nrfx2YWd3b/+geHjUNHGqGW+wWMa6HVDDpVC8gQIlbyea0yiQvBWMbmd+64lrI2L1gOOE+xEdKBEKRtFKj10UETfEu6j0iiW37M5BVomXkRJkqPeKX91+zNKIK2SSGtPx3AT9CdUomOTTQjc1PKFsRAe8Y6midpE/mV88JWdW6ZMw1rYUkrn6e2JCI2PGUWA7I4pDs+zNxP+8TorhtT8RKkmRK7ZYFKaSYExm75O+0JyhHFtCmRb2VsKGVFOGNqSCDcFbfnmVNCtl77Jcva+WajdZHHk4gVM4Bw+uoAZ3UIcGMFDwDK/w5hjnxXl3PhatOSebOYY/cD5/AGWskBc=</latexit>
convolution
Figure 3.10 The convolution-augmented Transformer encoder architecture (Gulati et al., 2020)
a kernel would have a size of (1, kernel_size) for each channel, and depth-wise convolu-
tion would result in an output with size (batch, in_channels, time) given that the two ends
of the time dimension are padded. As introduced in Section 2.1.5, the Swish activation is
x · sigmoid(x) where x is a scalar input value (Ramachandran et al., 2018), which is effectively
a function gated by itself.
The Conformer encoder can be viewed as another building block for various speech-related
tasks. For example, it can be used in SCM-based systems, including the AM in DNN-HMM
systems, CTC and neural transducers described in Section 3.1.
K
(u) (u) (u)
− log P (wk |wk−1 , . . . , w0 , O (u) ),
XX
LCE = (3.74)
u k=1
where w1 , . . . , wK arise the ground truth subword sequence and w0 is the start of sentence
symbol <s>. However, for recognition, the ground truth history subwords are not available and
the decoded subwords are used instead. Therefore, the following improved training techniques,
scheduled sampling and Minimum Word Error Rate (MWER) training, have been developed
to address either the exposure bias between training and testing (Ranzato et al., 2016) or the
criteria mismatch issue between conditional maximum likelihood and WER (Graves and Jaitly,
2014; Prabhavalkar et al., 2018; Shannon, 2017).
However, the summation over all possible output sequences h is intractable. To approximate
such a summation, instead of using word lattices for MMI and MBR criteria, the n-best
hypotheses from the beam search could be used. Assuming the probability mass over all
possible sequences is concentrated in the top n hypotheses, the loss function becomes
where N is the beam width of the beam search algorithm B EAM(·, ·) (see Section 3.8.1). In
practice, the sequence-level loss is interpolated with the CE loss for stable training, which is
similar to the F -smoothing (Su et al., 2013) approach used for discriminative sequence training
of SCM models.
To optimise the model directly towards the final non-differentiable objective function, WER,
reinforcement learning related loss functions (Cui et al., 2018; Sabour et al., 2019) or gradient
policy-based training (Karita et al., 2018; Tjandra et al., 2018b; Zhou et al., 2018b) have also
been proposed.
encoder-decoder models, a naive search algorithm without pruning would expand the search
tree exponentially as the sequence grows longer. Beam search, however, only preserves at most
N paths within the search graph at each stage before the next search step. Note that a very
small beam size would result in a significant number of search errors, while a large beam size
would increase the time for the decoding process linearly. In practice, for each path in the
search tree, the hidden state of the decoder is cached to avoid computing the history again when
the tree expands. A search path terminates when the end-of-sentence symbol is decoded. To
prevent excessively long sentences, the maximum number of steps for the decoder is normally
set. A conservative setting for ASR is to set the maximum number of steps to be the same as
the number of acoustic features frames after subsampling, i.e. the encoder output sequence
length. Since the acoustic features are designed to capture fine-grained phonetic units, the
number of frames is usually much longer than the written sequence.
Attention-based models may be prone to deletion and insertion errors. Because the attention
alignment is not explicitly restricted to be contiguous and to cover the entire input sequence,
the decoder may generate end-of-sentence symbols prematurely without reaching the end of
the utterance which leads to deletion errors. Also, the attention mechanism may be trapped
in a certain section of the input sequence and generate the same symbol repetitively as no
monotonicity is strictly enforced on the attention mechanism. To address these issues, two
extra terms are included to rank the hypotheses
T K
( ( ! ))
1
X X
ĥ = arg max log P (h|O) + ω|h| + η akt > τ , (3.78)
h∈H t=1 k=1
where ω is the insertion penalty (non-positive) to penalise long sequences (Bahdanau et al., 2016;
Chorowski et al., 2015), η is the coverage coefficient and τ is the coverage threshold (Chorowski
and Jaitly, 2017). All three parameters need to be tuned on a separate validation set. The third
term in Equation (3.78) is called the coverage term (Chorowski and Jaitly, 2017) that represents
the number of frames that have received cumulative attention greater than τ . The coverage
term prevents the repetition issue as looping over multiple frames does not contribute to the
coverage. All three of these decoding parameters may need to be tuned for the best result.
where ψ is the language model weight. As mentioned in Section 3.4.1, LMs can be trained on a
large amount of external text data.
If a separately trained NNLM shares the same vocabulary as the AED model, LM scores
can be directly interpolated at each step of the beam search decoding, which is often referred to
as shallow fusion (Gülçehre et al., 2015). Shallow fusion has been widely adopted because of
its simplicity and effectiveness. Other structured approaches such as deep fusion (Gülçehre
et al., 2015), cold fusion (Sriram et al., 2018) and component fusion (Shan et al., 2019) jointly
train the AED model with an LM in order to find the optimal combination of the two. However,
the additional complexity of these approaches often outweighs their benefits (Toshniwal et al.,
2018).
Since each output of an AED model depends on the full history of previous outputs, the ASR
model has implicitly learned an internal LM. Although the internal LM cannot be computed
exactly, several approaches have been proposed to estimate it (McDermott et al., 2019; Meng
et al., 2021; Variani et al., 2020b). Then, the external LM can be integrated after the internal LM
is subtracted. This can be particularly useful when there is a mismatch between training and test
domains as the bias of the training text data can be removed (Zeineldeen et al., 2021). As AED
models normally use subwords as output units, it is also possible to incorporate word-level
LMs during the decoding procedure (Hori et al., 2018, 2017b).
enables the model to learn representations that encode the underlying shared information
between different parts of the signal (van den Oord et al., 2018). Features learned by SSL
should ideally discard low-level information such as local noise in the speech signal but preserve
the high-level structure that could span many time steps such as phonetic information and
speaker characteristics. Compared to reconstructive loss that aims to model the complex local
features of the signal, contrastive loss encourages the model to learn a contextual representation
that can predict observation signals in the future (van den Oord et al., 2018). The following
approach, Contrastive Predictive Coding (CPC), combines predicting future observations with
a probabilistic contrastive loss.
As shown in Figure 3.11, if the input waveform is split into small segments x and the encoder
transforms the waveform into a feature sequence z, the current contextual representation vt is
the output from a sequence model that summarises the history feature sequence z1:t .
vt
<latexit sha1_base64="jywdbTWg1PDQlyOwuEwLU9ABneo=">AAAB+3icbVDNS8MwHE3n15xfdR69BIfgabQy1IOHgRePE9wHbKWkabqFpUlJ0uEo/Ve8eFDEq/+IN/8b060H3XwQ8njv9yMvL0gYVdpxvq3KxubW9k51t7a3f3B4ZB/Xe0qkEpMuFkzIQYAUYZSTrqaakUEiCYoDRvrB9K7w+zMiFRX8Uc8T4sVozGlEMdJG8u36KBAsVPPYXNks9zOd+3bDaToLwHXilqQBSnR8+2sUCpzGhGvMkFJD10m0lyGpKWYkr41SRRKEp2hMhoZyFBPlZYvsOTw3SggjIc3hGi7U3xsZilURz0zGSE/UqleI/3nDVEc3XkZ5kmrC8fKhKGVQC1gUAUMqCdZsbgjCkpqsEE+QRFibumqmBHf1y+ukd9l0r5qth1ajfVvWUQWn4AxcABdcgza4Bx3QBRg8gWfwCt6s3Hqx3q2P5WjFKndOwB9Ynz8v65Uu</latexit>
predictions
context network
zt
<latexit sha1_base64="bsVgBsYxTh1Vo9DViae0zyi3prI=">AAAB+3icbVA7T8MwGHTKq5RXKCOLRYXEVCWoAgaGSiyMRaIPqY0ix3Faq44d2Q6iRPkrLAwgxMofYePf4LQZoOUky6e775PPFySMKu0431ZlbX1jc6u6XdvZ3ds/sA/rPSVSiUkXCybkIECKMMpJV1PNyCCRBMUBI/1gelP4/QciFRX8Xs8S4sVozGlEMdJG8u36KBAsVLPYXNlT7mc69+2G03TmgKvELUkDlOj49tcoFDiNCdeYIaWGrpNoL0NSU8xIXhuliiQIT9GYDA3lKCbKy+bZc3hqlBBGQprDNZyrvzcyFKsinpmMkZ6oZa8Q//OGqY6uvIzyJNWE48VDUcqgFrAoAoZUEqzZzBCEJTVZIZ4gibA2ddVMCe7yl1dJ77zpXjRbd61G+7qsowqOwQk4Ay64BG1wCzqgCzB4BM/gFbxZufVivVsfi9GKVe4cgT+wPn8ANg+VMg==</latexit>
feature encoder
<latexit sha1_base64="gLgfhQeitV47ohtcuJI6QB8pCKs=">AAAB+3icbVDNS8MwHE3n15xfdR69BIfgabQy1IOHgRePE9wHbKWkabqFpUlJUtko/Ve8eFDEq/+IN/8b060H3XwQ8njv9yMvL0gYVdpxvq3KxubW9k51t7a3f3B4ZB/Xe0qkEpMuFkzIQYAUYZSTrqaakUEiCYoDRvrB9K7w+09EKir4o54nxIvRmNOIYqSN5Nv1USBYqOaxubJZ7mc69+2G03QWgOvELUkDlOj49tcoFDiNCdeYIaWGrpNoL0NSU8xIXhuliiQIT9GYDA3lKCbKyxbZc3hulBBGQprDNVyovzcyFKsinpmMkZ6oVa8Q//OGqY5uvIzyJNWE4+VDUcqgFrAoAoZUEqzZ3BCEJTVZIZ4gibA2ddVMCe7ql9dJ77LpXjVbD61G+7asowpOwRm4AC64Bm1wDzqgCzCYgWfwCt6s3Hqx3q2P5WjFKndOwB9Ynz8y/ZUw</latexit>
<latexit sha1_base64="G+qGp2pgG8RQQ3t78V7/wzUHVOc=">AAAB/XicbVA7T8MwGHTKq5RXeGwsFhUSC1WCKmBgqMTCWCT6kNoochy3terYke0gShTxV1gYQIiV/8HGv8FpM0DLSZZPd98nny+IGVXacb6t0tLyyupaeb2ysbm1vWPv7rWVSCQmLSyYkN0AKcIoJy1NNSPdWBIUBYx0gvF17nfuiVRU8Ds9iYkXoSGnA4qRNpJvH/QDwUI1icyVPmR+qk/dzLerTs2ZAi4StyBVUKDp21/9UOAkIlxjhpTquU6svRRJTTEjWaWfKBIjPEZD0jOUo4goL52mz+CxUUI4ENIcruFU/b2RokjlAc1khPRIzXu5+J/XS/Tg0kspjxNNOJ49NEgY1ALmVcCQSoI1mxiCsKQmK8QjJBHWprCKKcGd//IiaZ/V3PNa/bZebVwVdZTBITgCJ8AFF6ABbkATtAAGj+AZvII368l6sd6tj9loySp29sEfWJ8/Gv2Vog==</latexit>
3 2 1
input waveform
Figure 3.11 CPC framework (van den Oord et al., 2018). The feature encoder normally uses
CNNs with temporal pooling and the context network is a sequence model such as an LSTM
that summarises the history representations.
Given a set of N random samples X = {x1 , . . . , xN }, containing one positive sample from
p(xt+k |vt ) and N − 1 negative samples from the proposal distribution p(xt+k ), the contrastive
loss for predicting the k-th step into the future is
!
fk (xt+k , vt )
Lt,k
CPC = −EX log P , (3.80)
xj ∈X fk (xj , vt )
74 Automatic Speech Recognition
T
fk (xt+k , vt ) = exp(zt+k Wk vt ). (3.82)
Note that Wk is a linear transform and is different for every offset k. It can be shown that by
minimising LCPC , the mutual information between vt and xt+k is maximised (van den Oord
et al., 2018)
X p(xt+k |vt )
I(xt+k , vt ) = p(xt+k , vt ) log (3.83)
xt+k ,vt p(xt+k )
≥ log(N ) − Lt,k
CPC . (3.84)
After demonstrating that features extracted using CPC can achieve much better results than
handcrafted features such as MFCCs on phone classification and speaker classification (van den
Oord et al., 2018), wav2vec (Schneider et al., 2019) showed that pre-trained features can also
help improve ASR performance. As shown in Figure 3.12, by using quantised prediction
targets (Baevski et al., 2020a) and Transformer encoder blocks as the context network, wav2vec
2.0 (w2v2) (Baevski et al., 2020b) has shown promising performance on ASR, especially when
a large amount of unlabelled acoustic data is used for pre-training and only a small amount of
transcribed speech data is available.
Although CPC in Figure 3.11 and w2v2 in Figure 3.12 follow a similar principle, they
have two differences. First, w2v2 has a quantisation module that quantises the output of the
feature encoder z to a finite set of speech representations via product quantisation (Jégou et al.,
2011). The Gumbel Softmax (Jang et al., 2017) allows the quantisation procedure to be fully
differentiable. To encourage the use of all of the quantised representations in the codebook, a
diversity loss Ldiversity is interpolated with the contrastive loss during training (Baevski et al.,
2020b). Secondly, instead of using a uni-directional sequence model for the context network
and generating a contextual representation to predict input in the future, w2v2 adopts the
Transformer encoder architecture (see Section 2.2.2) with masked input. For each masked
input feature xt , the context network uses the information available in the past and the future
to generate the contextual representation vt , which is similar to a “fill in the blank” question.
The positive sample is the current quantised feature qt and the negative samples are uniformly
sampled from quantised features that correspond to other masked time steps. The density ratio
3.10 Summary 75
+ve sample
contrastive loss -ve sample
Q
<latexit sha1_base64="TgRKxso/lMSzVAsrTantWZmVMXw=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsxIURcuCm5ctmAf0I4lk0nb0EwyJBmlDP0PNy4Uceu/uPNvzLSz0NYDIYdz7iUnJ4g508Z1v53C2vrG5lZxu7Szu7d/UD48amuZKEJbRHKpugHWlDNBW4YZTruxojgKOO0Ek9vM7zxSpZkU92YaUz/CI8GGjGBjpYd+IHmop5G90uZsUK64VXcOtEq8nFQgR2NQ/uqHkiQRFYZwrHXPc2Pjp1gZRjidlfqJpjEmEzyiPUsFjqj203nqGTqzSoiGUtkjDJqrvzdSHOksmp2MsBnrZS8T//N6iRle+ykTcWKoIIuHhglHRqKsAhQyRYnhU0swUcxmRWSMFSbGFlWyJXjLX14l7Yuqd1mtNWuV+k1eRxFO4BTOwYMrqMMdNKAFBBQ8wyu8OU/Oi/PufCxGC06+cwx/4Hz+ABkpkuU=</latexit>
V
<latexit sha1_base64="rWnn3xLXn/fc7i9fD5jNkxKQeUs=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIqAsXBTcuK9gHtGPJZDJtaCYZkoxShv6HGxeKuPVf3Pk3ZtpZaOuBkMM595KTEyScaeO6305pZXVtfaO8Wdna3tndq+4ftLVMFaEtIrlU3QBrypmgLcMMp91EURwHnHaC8U3udx6p0kyKezNJqB/joWARI9hY6aEfSB7qSWyvrD0dVGtu3Z0BLROvIDUo0BxUv/qhJGlMhSEca93z3MT4GVaGEU6nlX6qaYLJGA9pz1KBY6r9bJZ6ik6sEqJIKnuEQTP190aGY51Hs5MxNiO96OXif14vNdGVnzGRpIYKMn8oSjkyEuUVoJApSgyfWIKJYjYrIiOsMDG2qIotwVv88jJpn9W9i/r53XmtcV3UUYYjOIZT8OASGnALTWgBAQXP8ApvzpPz4rw7H/PRklPsHMIfOJ8/IMKS6g==</latexit>
masking
Z
<latexit sha1_base64="FGXhBpvcuMQJuIjBcN6CjF7HESQ=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsxIURcuCm5cVrAPbMeSyaRtaCYZkoxShv6HGxeKuPVf3Pk3ZtpZaOuBkMM595KTE8ScaeO6305hZXVtfaO4Wdra3tndK+8ftLRMFKFNIrlUnQBrypmgTcMMp51YURwFnLaD8XXmtx+p0kyKOzOJqR/hoWADRrCx0kMvkDzUk8he6f20X664VXcGtEy8nFQgR6Nf/uqFkiQRFYZwrHXXc2Pjp1gZRjidlnqJpjEmYzykXUsFjqj201nqKTqxSogGUtkjDJqpvzdSHOksmp2MsBnpRS8T//O6iRlc+ikTcWKoIPOHBglHRqKsAhQyRYnhE0swUcxmRWSEFSbGFlWyJXiLX14mrbOqd16t3dYq9au8jiIcwTGcggcXUIcbaEATCCh4hld4c56cF+fd+ZiPFpx85xD+wPn8ASbWku4=</latexit>
feature encoder
X
<latexit sha1_base64="dtdmyq/eSUpi2IzY413OETEKnAA=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIqAsXBTcuK9gHtGPJZDJtaCYZkoxShv6HGxeKuPVf3Pk3ZtpZaOuBkMM595KTEyScaeO6305pZXVtfaO8Wdna3tndq+4ftLVMFaEtIrlU3QBrypmgLcMMp91EURwHnHaC8U3udx6p0kyKezNJqB/joWARI9hY6aEfSB7qSWyvrDsdVGtu3Z0BLROvIDUo0BxUv/qhJGlMhSEca93z3MT4GVaGEU6nlX6qaYLJGA9pz1KBY6r9bJZ6ik6sEqJIKnuEQTP190aGY51Hs5MxNiO96OXif14vNdGVnzGRpIYKMn8oSjkyEuUVoJApSgyfWIKJYjYrIiOsMDG2qIotwVv88jJpn9W9i/r53XmtcV3UUYYjOIZT8OASGnALTWgBAQXP8ApvzpPz4rw7H/PRklPsHMIfOJ8/I8yS7A==</latexit>
qtT′ vt
!
1
f (xt′ , vt ) = exp . (3.85)
ζ ∥ qt′ ∥∥ vt ∥
For fine-tuning, task-specific layers are added on top of the w2v2 model and are updated
together with the context network. An alternative implementation (Zhang et al., 2020) shows
that the input to w2v2 can also be handcrafted acoustic features and the quantisation module
is optional. After fine-tuning the pre-trained model with a very limited amount of labelled
data, the system can perform very well (Baevski et al., 2020b). This indicates that SSL can
effectively leverage unlabelled data to learn useful representations.
3.10 Summary
In this chapter, two major frameworks of speech recognition were covered, i.e. the source-
channel model-based system and the attention-based encoder-decoder model-based system.
Although both systems share some components, including feature extraction, language mod-
elling, and evaluation procedure, they are very different in terms of the modelling principle,
modelling units, and training and decoding procedures. Each ingredient of both systems was
briefly described in this chapter. Finally, some self-supervised training approaches that lever-
76 Automatic Speech Recognition
age a large amount of unlabelled data to improve ASR model performance via unsupervised
pre-training were introduced.
Chapter 4
4.1 Background
This section first covers various approaches to combining multiple ASR systems. Then related
work that combines a SCM-based system and a AED model is described. Differences between
the related work and the proposed method are highlighted.
Mathematically, with a single LM, system combination for M Acoustic Models (AMs) esti-
mates
p(O|h; θ1 , . . . , θM )P (h; θLM )
P (h|O; θ1 , . . . , θM ) = P , (4.1)
h̃ p(O|h̃; θ1 , . . . , θM )P (h̃; θLM )
where h̃ is a set of all possible hypothesis sequences. h̃ can be approximated by the union of
top hypotheses from all candidate systems. It is possible to combine likelihoods from different
acoustic models or directly combine hypothesis posteriors.
Likelihood Combination There are two general approaches to combine likelihoods (Beyer-
lein, 1997). The most straightforward one is a linear combination,
M
X
p(O|h; θ1 , . . . , θM ) ≈ zm p(O|h; θm ), (4.2)
m=1
where zm is the prior probability of model θm . For the combined likelihood to be a valid
distribution, M
P
m=1 zm = 1 and zm are non-negative for all m. The alternative is a log-linear
combination of all likelihoods,
M M
!
1 X 1 Y zm
p(O|h; θ1 , . . . , θM ) ≈ exp zm log p O|h; θm = p O|h; θm , (4.3)
Z m=1 Z m=1
of the recognised word that can be derived from the word posterior probability or other sources
of information during decoding, can be optionally incorporated into WTNs (Fiscus, 1997).
Second, a word needs to be chosen from each set of competing words based on the maximum
posterior criterion
P (hi |O; θ1 , . . . , θM ) ≈ (1 − λ)Conf hi |O; θ1 , . . . , θM + λD(hi , O), (4.4)
where Conf(hi |O; θ1 , . . . , θM ) is the average or maximum confidence score of the word h in
the i-th set among all the top hypotheses from M systems; D(hi , O) is the frequency of the
word h in the i-th set; and λ is the interpolation coefficient between the confidence score and
the word frequency. Without confidence scores or λ is set to 1, Equation (4.4) is equivalent
to voting by frequency, where ties within a set of competing words are decided randomly.
For CNC, the process is very similar apart from the fact that multiple words with posteriors
from each system are used in the form of confusion networks. As the alignment for multiple
confusion networks becomes more complex, CNC generally works slightly better than ROVER.
Unlike explicit combination where the likelihoods or hypotheses from all candidate systems
are used for decoding, a combination scheme that propagates information from one system to
another is regarded as an implicit combination approach (Gales and Young, 2008; Peskin et al.,
1999).
One commonly used approach is N -best (Schwartz and Austin, 1991) or lattice rescor-
ing (Aubert and Ney, 1995; Richardson et al., 1995; Woodland et al., 1995), where the AM
and/or the LM scores for each word decoded by one system is updated based on the scores
from another system. Rescoring can restrict the search space and avoid possible errors made by
one system. However, the final hypothesis can only come from the lattice or the N -best list
generated by a single system, i.e. hypotheses made solely by the other system are not included.
Another approach for implicit combination is cross-adaptation (Gales et al., 2006; Stüker
et al., 2006; Woodland et al., 1995). In Section 3.3.3, in order to estimate Maximum Likeli-
hood Linear Regression (MLLR)-based transforms for unseen speakers during recognition, a
transcription is needed to compute the regression statistics. For unsupervised adaptation, the
transcription can come from another system. This approach allows the information from the
other system to be explicitly used during the estimation of the transform.
4.1 Background 81
M
!
1 X
P (h|O; θ1 , . . . , θM ) ≈ exp zm log P (h|O; θm ) . (4.5)
Z m=1
Instead of using a union of N -best lists from all AED models and interpolating the sequence-
level scores, the same approach can be applied during each step of beam search, which is also
known as joint decoding (Tüske et al., 2021). Also similar to hypothesis combination for SCM
systems, ROVER can be applied to AED models (Wong et al., 2020). Confidence scores for
each word P (hi |O; θ) can be the Softmax scores of the AED models or come from a dedicated
confidence estimation module as discussed in Chapter 5.
Kim et al. (2017) and Watanabe et al. (2017) proposed to use the CTC objective function as an
auxiliary task when training the AED model. The encoder is shared between the CTC branch
and the AED model. The CTC output layer and the decoder of the attention-based model are
separate. Two objective functions are used together during training via log-linear combination
During decoding of the joint CTC and AED models, the length penalty and coverage term in
Equation (3.78) become unnecessary as the CTC probability encourages a monotonic alignment
that does not allow the skipping or looping behaviour observed for decoding standalone AED
models. Also, it can prevent the premature termination of hypotheses. For decoding two
models jointly, the challenge is to handle the asynchrony between two output branches. The
CTC branch outputs symbols in a frame-synchronous fashion whereas the attention-based
decoder emits symbols in a label-synchronous way. A straightforward approach to handle this
problem is to combine scores at the utterance level, i.e. obtain N -best hypotheses from the
attention-based decoder and rerank these hypotheses based on the combined score from CTC
and attention-based models,
ĥ = arg max {λ′ log PCTC (h|O) + (1 − λ′ ) log PAED (h|O)}. (4.7)
h∈B EAM(O,N )
A more systematic approach is to integrate the CTC loss during the beam search so that the
search space is modified (Hori et al., 2017a). The CTC prefix probability is defined as the
cumulative probability of all label sequences that have the same prefix (Graves et al., 2006).
Then the beam search process can use the log-linear combination of the CTC prefix score and
the attention-based model score for each partial hypothesis.
Sainath et al. (2019) proposed using an Recurrent Neural Network (RNN) transducer to
generate N -best hypothesis during the first pass and then rescore them using an AED model.
The encoders of the transducer and the AED model share the same set of parameters. The
first pass using a transducer allows streaming processing of speech data while the second pass
rescoring improves the recognition accuracy. Consequently, the latency requirement for the
first pass is met and the final recognition quality is also improved by the second pass. The
training procedure has three steps. First, a standalone RNN transducer model is trained. Second,
an AED model is trained where the encoder has the same parameters as the encoder of the
RNN transducer and is frozen. Finally, all components are fine-tuned with a combination of
RNN transducer and AED losses. To further integrate the RNN transducer and AED models,
Minimum Word Error Rate (MWER) training is used for the AED model where the N -best
hypotheses are generated by the transducer. During rescoring, an adaptive beam and a prefix-
tree representation for N -best list are used to improve both the rescoring speed and the final
performance.
4.2 The Integrated Source-Channel and Attention-Based Model Approach 83
since DNN-HMMs or CTC systems prune most of the less likely hypotheses, rescoring with
AED models can be more robust and efficient than decoding with AED models.
Compared to the joint CTC and attention architecture reviewed in Section 4.1.3.1, ISCA
first performs beam search in a frame-synchronous fashion. It is known that beam search with
frame-synchronous systems can more easily handle streaming data and explore a larger search
space. Furthermore, the proposed framework uses a frame-synchronous decoding pass followed
by a separate rescoring pass by a label-synchronous AED system. The two-pass approach not
only makes it easier to implement by reusing the existing AED rescoring framework, but also
allows the two types of system to have different output units. Moreover, the proposed method is
configured to combine multiple systems and can use different model structures with or without
sharing the encoder. This offers more flexibility in the choice of the systems and possibly more
complementarity.
Compared to the two-pass end-to-end approach described in Section 4.1.3.2, the proposed
framework allows a more flexible choice of the first pass frame-synchronous system, including
DNN-HMMs, CTC, and neural transducers. Since DNN-HMMs and CTC do not have a
model-based decoder with learned parameters such as an AED model, it is easier for them to
incorporate structured knowledge and construct richer lattices.
language
lexicon
model
h
<latexit sha1_base64="ExlVAR4bKx54WEKOyyX9KQaUSz8=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIqLgquHFZwT6gHUsmk2lDM8mQZJQy9D/cuFDErf/izr8x085CWw+EHM65l5ycIOFMG9f9dkorq2vrG+XNytb2zu5edf+grWWqCG0RyaXqBlhTzgRtGWY47SaK4jjgtBOMb3K/80iVZlLcm0lC/RgPBYsYwcZKD/1A8lBPYntlo+mgWnPr7gxomXgFqUGB5qD61Q8lSWMqDOFY657nJsbPsDKMcDqt9FNNE0zGeEh7lgocU+1ns9RTdGKVEEVS2SMMmqm/NzIc6zyanYyxGelFLxf/83qpia78jIkkNVSQ+UNRypGRKK8AhUxRYvjEEkwUs1kRGWGFibFFVWwJ3uKXl0n7rO5d1M/vzmuN66KOMhzBMZyCB5fQgFtoQgsIKHiGV3hznpwX5935mI+WnGLnEP7A+fwBO4KS+g==</latexit>
<latexit sha1_base64="A73zOqqpjhroGGYmr/iTFjsjsAY=">AAACF3icbVDLSsNAFJ34rPUVdekmWIS6KYkUFVcVN27ECvYBTQiTybQdOnkwcyOWmL9w46+4caGIW935N07aLmrrgWHOnHMvc+/xYs4kmOaPtrC4tLyyWlgrrm9sbm3rO7tNGSWC0AaJeCTaHpaUs5A2gAGn7VhQHHictrzBZe637qmQLArvYBhTJ8C9kHUZwaAkV6/EbmoDfYD04jrLyqntRdyXw0Bd6U32OP3sZ9mRq5fMijmCMU+sCSmhCequ/m37EUkCGgLhWMqOZcbgpFgAI5xmRTuRNMZkgHu0o2iIAyqddLRXZhwqxTe6kVAnBGOkTnekOJD5cKoywNCXs14u/ud1EuieOSkL4wRoSMYfdRNuQGTkIRk+E5QAHyqCiWBqVoP0scAEVJRFFYI1u/I8aR5XrJNK9bZaqp1P4iigfXSAyshCp6iGrlAdNRBBT+gFvaF37Vl71T60z3Hpgjbp2UN/oH39AqiHoXg=</latexit>
<latexit sha1_base64="rjCGK7w6pmi1PsZxrOZ2TD9Rjmg=">AAACCHicbVDLSsNAFJ34rPUVdenCYBHqpiRSVFwV3LhQqGAf0IQwmU7aoZMHMzdiCVm68VfcuFDErZ/gzr9x0mahrQeGOZxzL/fe48WcSTDNb21hcWl5ZbW0Vl7f2Nza1nd22zJKBKEtEvFIdD0sKWchbQEDTruxoDjwOO14o8vc79xTIVkU3sE4pk6AByHzGcGgJFc/aLqpDfQB0uubLKumthfxvhwH6kuHWXbs6hWzZk5gzBOrIBVUoOnqX3Y/IklAQyAcS9mzzBicFAtghNOsbCeSxpiM8ID2FA1xQKWTTg7JjCOl9A0/EuqFYEzU3x0pDmS+nKoMMAzlrJeL/3m9BPxzJ2VhnAANyXSQn3ADIiNPxegzQQnwsSKYCKZ2NcgQC0xAZVdWIVizJ8+T9knNOq3Vb+uVxkURRwnto0NURRY6Qw10hZqohQh6RM/oFb1pT9qL9q59TEsXtKJnD/2B9vkDz5qadw==</latexit>
N PLM (h)
<latexit sha1_base64="SOUcfj3ZbWBF8Yr7bDoxo9Ekszg=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKexKUPEU8OJJEjAPSJYwO+lNxszOLjOzQgj5Ai8eFPHqJ3nzb5wke9DEgoaiqpvuriARXBvX/XZya+sbm1v57cLO7t7+QfHwqKnjVDFssFjEqh1QjYJLbBhuBLYThTQKBLaC0e3Mbz2h0jyWD2acoB/RgeQhZ9RYqX7fK5bcsjsHWSVeRkqQodYrfnX7MUsjlIYJqnXHcxPjT6gynAmcFrqpxoSyER1gx1JJI9T+ZH7olJxZpU/CWNmShszV3xMTGmk9jgLbGVEz1MveTPzP66QmvPYnXCapQckWi8JUEBOT2dekzxUyI8aWUKa4vZWwIVWUGZtNwYbgLb+8SpoXZe+yXKlXStWbLI48nMApnIMHV1CFO6hBAxggPMMrvDmPzovz7nwsWnNONnMMf+B8/gCnF4zT</latexit>
shared algorithm ĥ
h0
<latexit sha1_base64="6TZjTajSwPpTV1M02kc3eHAK5ao=">AAAB+HicbVDLSgMxFL1TX7U+OurSTbCIrsqMFBVXBTcuK9gHtEPJZDJtaCYzJBmhDv0SNy4UceunuPNvzLSz0NYDIYdz7iUnx084U9pxvq3S2vrG5lZ5u7Kzu7dftQ8OOypOJaFtEvNY9nysKGeCtjXTnPYSSXHkc9r1J7e5332kUrFYPOhpQr0IjwQLGcHaSEO7OvBjHqhpZK5sPDsb2jWn7syBVolbkBoUaA3tr0EQkzSiQhOOleq7TqK9DEvNCKezyiBVNMFkgke0b6jAEVVeNg8+Q6dGCVAYS3OERnP190aGI5VnM5MR1mO17OXif14/1eG1lzGRpJoKsngoTDnSMcpbQAGTlGg+NQQTyUxWRMZYYqJNVxVTgrv85VXSuai7l/XGfaPWvCnqKMMxnMA5uHAFTbiDFrSBQArP8Apv1pP1Yr1bH4vRklXsHMEfWJ8/GVOTXA==</latexit>
neural
neural decoder PAED (h0 |O)
<latexit sha1_base64="zESM/QkoeYxPekkoOMNCsLr9AMM=">AAACGXicbVDLSgMxFM34rPU16tJNsIh1U2akqLiqqODOCvYBnaFk0rQNzTxI7ohlnN9w46+4caGIS135N6aPRW09EHJyzr3k3uNFgiuwrB9jbn5hcWk5s5JdXVvf2DS3tqsqjCVlFRqKUNY9opjgAasAB8HqkWTE9wSreb2LgV+7Z1LxMLiDfsRcn3QC3uaUgJaaplVuJg6wB0jOry7TNJ84Xihaqu/rK+mmB4+T75s0PWyaOatgDYFniT0mOTRGuWl+Oa2Qxj4LgAqiVMO2InATIoFTwdKsEysWEdojHdbQNCA+U24y3CzF+1pp4XYo9QkAD9XJjoT4ajCcrvQJdNW0NxD/8xoxtE/dhAdRDCygo4/ascAQ4kFMuMUloyD6mhAquZ4V0y6RhIIOM6tDsKdXniXVo4J9XCjeFnOls3EcGbSL9lAe2egEldA1KqMKougJvaA39G48G6/Gh/E5Kp0zxj076A+M719v7KHP</latexit>
encoder
Figure 4.1 The ISCA framework. The top branch is the SCM-based system and the bottom
branch is the AED model. The acoustic model and the neural encoder can optionally share
parameters. Rounded boxes indicate trainable models and a rectangular box indicates an
algorithm or a procedure.
system PAED are available, the final score of a hypothesis h for an utterance O is
where ψ is the grammar scaling factor used in the SCM-based system and α is the interpolation
coefficient. When there are multiple LMs and multiple AED models, the score from each
additional model will be interpolated similarly. Additional decoding parameters used in SCM
and AED models can also be added to the total score S, such as an insertion penalty and
coverage score. Based on Equation (4.9), AED models can be viewed as audio-grounded
language models.
In contrast to the proposed framework, standard combination methods for ASR systems
are not suitable for combining with AED models. For example, confusion network combina-
tion (Evermann and Woodland, 2000b) requires decoding lattices and ROVER (Fiscus, 1997)
requires comparable confidence measures from both types of system. The benefits of using
hypothesis-level combination may be limited when combining an SCM-based system with an
AED model, because the hypothesis space generated from the SCM-based system is generally
much larger than the AED model.
Since there are two widely used neural decoder architectures, their training and inference
efficiencies need to be taken into account when using different rescoring algorithms. The first
one is the RNN decoder as in Section 2.2.1 that processes the input sequentially from left to
right. The second one is the Transformer decoder as in Section 2.2.2 that uses multi-head
attention and positional encoding to process all time steps in a sequence in parallel. Although
86 Integrating Source-Channel and Attention-Based ASR Models
this accelerates training with a higher memory cost, a sequential step-by-step procedure still
needs to be followed during decoding. A summary of the time and space complexities for both
types of neural decoders is shown in Table 4.1.
Table 4.1 Comparison of the time and space complexities of RNN and Transformer decoders
with respect to the output sequence length L.
4.2.3 Training
The SCM-based system and the AED model can be trained separately or jointly in a multi-task
fashion by sharing the neural encoder with the acoustic model. For multi-task trained mod-
els (Watanabe et al., 2017), the total number of parameters in the entire system is smaller due
to parameter sharing. Although multi-task training can be an effective way of regularisation,
setting the interpolation weights between the two losses and configuring the learning rate to
achieve good performance for both models may not be straightforward. Moreover, multi-task
training also limits the model architectures or model-specific training techniques that can
be adopted for individual systems. For example, the acoustic model can be a unidirectional
architecture for streaming purposes but the encoder of the AED model can be bi-directional
for second-pass rescoring. Acoustic models in SCM-based systems normally have a frame
subsampling rate of 3 in a low frame-rate system (Povey et al., 2018; Pundak and Sainath, 2016)
but neural encoders normally have a frame rate reduction of 4, by using convolutional layers or
pyramidal RNNs (see Section 3.6.1), for better performance. Frame-level shuffling (Su et al.,
2013) is important for the optimisation of HMM-based AMs, whereas AED models have to be
trained on a per utterance basis1 . Triphone units are commonly used for HMM-based acoustic
models whereas word-pieces are widely used for attention-based models. Different discrim-
inative sequence training can also be applied separately, e.g. Maximum Mutual Information
(MMI) for SCM-based systems and MWER for AED models. Overall, sharing the acoustic
model and the neural encoder in a multi-task training framework hampers both systems from
reaching the best possible WER performance.
1
For both sequence training of HMM-based systems and cross-entropy training of AED models, utterance-level
shuffling is often used.
4.2 The Integrated Source-Channel and Attention-Based Model Approach 87
4.2.4 Inference
During inference, the SCM-based system generates the top hypotheses for each utterance. The
word sequence h can be tokenised into the set of word-pieces used by the AED model. The
word-piece sequence h′ is forwarded through the neural decoder to obtain the probability for
each token P (h′t |h′1:t−1 , O). By finding the interpolation coefficients ψ and α in Equation (4.9)
to have the lowest WER on the development set via grid search, the final hypothesis is the one
with the highest score among the top candidates,
As shown in Table 4.1, when rescoring each hypothesis in the N -best list with an AED model
with an RNN-based decoder, the time complexity is O(L) and the space complexity is O(1)
because of the sequential nature of RNNs. In contrast, a Transformer-based decoder has a
time complexity of O(1) and space complexity of O(L2 ) as during Transformer training. Since
the entire hypothesis is available, self-attention can be directly computed across the whole
sequence for each token.
For a fixed number of candidate hypotheses, the number of alternatives per word in the sequence
is smaller when the hypothesis becomes longer. This means the potential for WER reduction
diminishes for longer utterances. Therefore, in order to have the same number of alternatives per
word, the size of an N -best list needs to grow exponentially with respect to the utterance length.
However, lattice rescoring can effectively mitigate this issue. As described in Section 3.4.3,
lattices are directed acyclic graphs where nodes represent words and edges represent associated
acoustic and language model scores. A complete path from the start to the end of a lattice is
a hypothesis. Because a different number of arcs can merge to or split from a node, a lattice
generally contains a far greater number of hypotheses than a limited N -best list. The size of
lattices can be measured by the number of arcs per second of speech, also known as the lattice
density (Woodland et al., 1995).
One commonly used lattice rescoring approach for Recurrent Neural Network Language
Models (RNNLMs) is on-the-fly lattice expansion with n-gram based history clustering (Liu
et al., 2016). Similar to RNNLMs, attention-based models are also auto-regressive models
88 Integrating Source-Channel and Attention-Based ASR Models
where the next prediction depends on all history tokens. This means approximations must be
made when assigning scores on edges of lattices because each word in the lattice may have
numerous history sequences. Although lattices can be expanded to allow each word to have
a more specific history, a trade-off between the uniqueness of the history and computational
efficiency need to be considered. n-gram based history clustering (Liu et al., 2016) assumes
that the history before the previous n − 1 words has no impact on the probability of the current
word. As illustrated in Figure 4.2, a lattice can be expanded such that the n − 1 history word
sequence for each word in the lattice is unique. During rescoring, a hash table-based cache is
a cat a cat sat
<latexit sha1_base64="XlgoQcoi5fk9WyQWZ8u3qQTriFU=">AAAB83icbVDLSgMxFL3js9ZX1aWbYBFclZki6rKgC5cV7AM6Q8mkaRuayQzJHaEM/Q03LhRx68+482/MtLPQ1gOBk3PuJScnTKQw6Lrfztr6xubWdmmnvLu3f3BYOTpumzjVjLdYLGPdDanhUijeQoGSdxPNaRRK3gknt7nfeeLaiFg94jThQURHSgwFo2gl348ojhEze531K1W35s5BVolXkCoUaPYrX/4gZmnEFTJJjel5boJBRjUKJvms7KeGJ5RN6Ij3LFU04ibI5pln5NwqAzKMtT0KyVz9vZHRyJhpFNrJPKNZ9nLxP6+X4vAmyIRKUuSKLR4appJgTPICyEBozlBOLaFMC5uVsDHVlKGtqWxL8Ja/vEra9Zp3Vas/XFYbd0UdJTiFM7gAD66hAffQhBYwSOAZXuHNSZ0X5935WIyuOcXOCfyB8/kDptSSGQ==</latexit> <latexit sha1_base64="XlgoQcoi5fk9WyQWZ8u3qQTriFU=">AAAB83icbVDLSgMxFL3js9ZX1aWbYBFclZki6rKgC5cV7AM6Q8mkaRuayQzJHaEM/Q03LhRx68+482/MtLPQ1gOBk3PuJScnTKQw6Lrfztr6xubWdmmnvLu3f3BYOTpumzjVjLdYLGPdDanhUijeQoGSdxPNaRRK3gknt7nfeeLaiFg94jThQURHSgwFo2gl348ojhEze531K1W35s5BVolXkCoUaPYrX/4gZmnEFTJJjel5boJBRjUKJvms7KeGJ5RN6Ij3LFU04ibI5pln5NwqAzKMtT0KyVz9vZHRyJhpFNrJPKNZ9nLxP6+X4vAmyIRKUuSKLR4appJgTPICyEBozlBOLaFMC5uVsDHVlKGtqWxL8Ja/vEra9Zp3Vas/XFYbd0UdJTiFM7gAD66hAffQhBYwSOAZXuHNSZ0X5935WIyuOcXOCfyB8/kDptSSGQ==</latexit> <latexit sha1_base64="YHYtOyYu9o8pRaxq/UJ4UqREk9o=">AAAB83icbVDLSsNAFL3xWeur6tLNYBFclaSIuizowmUF+4AmlMl02g6dTMLMjVBCf8ONC0Xc+jPu/BsnbRbaemDgcM693DMnTKQw6Lrfztr6xubWdmmnvLu3f3BYOTpumzjVjLdYLGPdDanhUijeQoGSdxPNaRRK3gknt7nfeeLaiFg94jThQURHSgwFo2gl348ojhEzQ3HWr1TdmjsHWSVeQapQoNmvfPmDmKURV8gkNabnuQkGGdUomOSzsp8anlA2oSPes1TRiJsgm2eekXOrDMgw1vYpJHP190ZGI2OmUWgn84xm2cvF/7xeisObIBMqSZErtjg0TCXBmOQFkIHQnKGcWkKZFjYrYWOqKUNbU9mW4C1/eZW06zXvqlZ/uKw27oo6SnAKZ3ABHlxDA+6hCS1gkMAzvMKbkzovzrvzsRhdc4qdE/gD5/MHv0SSKQ==</latexit>
ak
<latexit sha1_base64="8guKYhuHAdwKVAJ8buv2Lw5Af/I=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKqMeCHjxWtB/QhrLZTtqlm03Y3Qgl9Cd48aCIV3+RN/+N2zYHbX0w8Hhvhpl5QSK4Nq777RTW1jc2t4rbpZ3dvf2D8uFRS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfjm5nffkKleSwfzSRBP6JDyUPOqLHSA+2P++WKW3XnIKvEy0kFcjT65a/eIGZphNIwQbXuem5i/Iwqw5nAaamXakwoG9Mhdi2VNELtZ/NTp+TMKgMSxsqWNGSu/p7IaKT1JApsZ0TNSC97M/E/r5ua8NrPuExSg5ItFoWpICYms7/JgCtkRkwsoUxxeythI6ooMzadkg3BW355lbRqVe+yWru/qNRv8ziKcAKncA4eXEEd7qABTWAwhGd4hTdHOC/Ou/OxaC04+cwx/IHz+QNDco3M</latexit>
on on
<latexit sha1_base64="Rog7oCC3k+Uleh0oiN9cBASCl4w=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKqMeCHjxWtB/QhrLZbtqlm03YnQgl9Cd48aCIV3+RN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpr6xubW8Xt0s7u3v5B+fCoZeJUM95ksYx1J6CGS6F4EwVK3kk0p1EgeTsY38z89hPXRsTqEScJ9yM6VCIUjKKVHlRf9MsVt+rOQVaJl5MK5Gj0y1+9QczSiCtkkhrT9dwE/YxqFEzyaamXGp5QNqZD3rVU0YgbP5ufOiVnVhmQMNa2FJK5+nsio5ExkyiwnRHFkVn2ZuJ/XjfF8NrPhEpS5IotFoWpJBiT2d9kIDRnKCeWUKaFvZWwEdWUoU2nZEPwll9eJa1a1bus1u4vKvXbPI4inMApnIMHV1CHO2hAExgM4Rle4c2Rzovz7nwsWgtOPnMMf+B8/gBUOI3X</latexit>
sat
<latexit sha1_base64="YHYtOyYu9o8pRaxq/UJ4UqREk9o=">AAAB83icbVDLSsNAFL3xWeur6tLNYBFclaSIuizowmUF+4AmlMl02g6dTMLMjVBCf8ONC0Xc+jPu/BsnbRbaemDgcM693DMnTKQw6Lrfztr6xubWdmmnvLu3f3BYOTpumzjVjLdYLGPdDanhUijeQoGSdxPNaRRK3gknt7nfeeLaiFg94jThQURHSgwFo2gl348ojhEzQ3HWr1TdmjsHWSVeQapQoNmvfPmDmKURV8gkNabnuQkGGdUomOSzsp8anlA2oSPes1TRiJsgm2eekXOrDMgw1vYpJHP190ZGI2OmUWgn84xm2cvF/7xeisObIBMqSZErtjg0TCXBmOQFkIHQnKGcWkKZFjYrYWOqKUNbU9mW4C1/eZW06zXvqlZ/uKw27oo6SnAKZ3ABHlxDA+6hCS1gkMAzvMKbkzovzrvzsRhdc4qdE/gD5/MHv0SSKQ==</latexit>
nk
<latexit sha1_base64="qe6BCiF+3ZREN1kb5ZVKYhJX+eQ=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKqMeCHjxWtB/QhrLZTtqlm03Y3Qgl9Cd48aCIV3+RN/+N2zYHbX0w8Hhvhpl5QSK4Nq777RTW1jc2t4rbpZ3dvf2D8uFRS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfjm5nffkKleSwfzSRBP6JDyUPOqLHSg+yP++WKW3XnIKvEy0kFcjT65a/eIGZphNIwQbXuem5i/Iwqw5nAaamXakwoG9Mhdi2VNELtZ/NTp+TMKgMSxsqWNGSu/p7IaKT1JApsZ0TNSC97M/E/r5ua8NrPuExSg5ItFoWpICYms7/JgCtkRkwsoUxxeythI6ooMzadkg3BW355lbRqVe+yWru/qNRv8ziKcAKncA4eXEEd7qABTWAwhGd4hTdHOC/Ou/OxaC04+cwx/IHz+QNXQI3Z</latexit>
Figure 4.2 Example of expanding a 2-gram lattice to a 3-gram lattice. The hollow node on the
left is expanded into two hollow nodes on the right, and the symbols of the nodes and arcs
correspond to lines 16-18 in Algorithm 1.
created, where the key is the n − 1 history words and the value is the corresponding hidden
state of RNNLM and the output distribution. When the same history appears again during
rescoring, repetitive computation can be avoided, regardless of the more distant history.
However, for AED models based on attention mechanisms, n-gram based history clustering
can lead to undesirable behaviour. For example (from Switchboard),
I think those are are wonderful things to have but I think in a big company ...
the phrase “I think” appears twice in the utterance. If the original trigram based history
clustering developed for RNNLMs is used, at the second occurrence of the phrase, the algorithm
will restore the cache from the first occurrence including the attention context and the decoder
state and then continue to score the rest of the utterance. Consequently, scores for the second
half of the utterance will be wrong because of the incorrect attention context.
To this end, a time-dependent two-level n-gram cache is proposed for the lattice rescoring al-
gorithm shown in Algorithm 1. The input to the algorithm is a lattice from a frame-synchronous
model, the AED model, the corresponding acoustic sequence, and the n-gram order used for
approximation. The output of the algorithm is an expanded lattice with additional AED scores
on each arc. The algorithm first initialises a two-level hash table as the cache (line 2). The
hash-key of the first level is the history phrase. The hash-key of the second level cache is the
frame index of the current word. When looking up the second level, a collar of ±9 frames is
4.2 The Integrated Source-Channel and Attention-Based Model Approach 89
used to accommodate a small time difference in the alignment. Then the algorithm performs
lattice expansion and rescoring at the same time. After initialising the new lattice (lines 3-7),
the nodes in the lattice are traversed in a topological order (line 8). For each node in the
original lattice, the new lattice may have multiple copies such that the n − 1 preceding words
are unique (line 9). Each outgoing arc from the node in the original lattice is duplicated for all
the duplicated nodes in the expanded lattice. Depending on the updated n − 1-gram history,
the destination node of the outgoing arc may need to be duplicated (lines 11-19). Next, for
the duplicated arc, the AED score needs to be obtained. First, the two-level cache is accessed
with the word history and the timestamp of the current word. The cache is only hit when the
timestamp falls within the vicinity of one of the timestamps in the cache. Otherwise, there is a
cache miss and a new entry is created in the cache with the timestamp and the corresponding
decoder states (lines 21-23). Another key detail is that when there is a cache hit, the sum of
arc posteriors of the current (n − 1)-gram is compared with the one stored in the cache. If the
current posterior is larger, indicating the current (n − 1)-gram is on a better path, then the cache
entry is updated to store the current hidden states (lines 24-26). For the example in Figure 4.2,
when the lower path is visited after the upper path, the cache entry “sat on” should already
exist. If the lower path has a higher posterior probability, then the cache entry will be updated,
so that future words in the lattice will adopt the history from the lower path 2 . Now, the entry
corresponds to the duplicated arc must exist in the cache, which may be just created or renewed.
The duplicated arc with the score from a label-synchronous model can then be connected to the
corresponding nodes in the expanded lattice (lines 28-31). After traversing through all nodes in
the original lattice, the algorithm returns a new expanded lattice with new scores from an AED
model (line 36).
For label-synchronous systems with RNN decoders, lattice rescoring has O(L) for time
complexity and O(1) for space complexity as the RNN hidden states can be stored and carried
forward at each node in the lattice. However, since lattice rescoring operates on partial
hypotheses, Transformer decoders have to run in decoding mode as in Table 4.1. Because self-
attention need to be computed with all previous tokens, lattice rescoring with the Transformer-
based decoder has O(L2 ) time and space complexities3 .
2
Code is available at https : //github.com/qiujiali/lattice − rescore.
3
Recent work has shown that the computational complexity of Transformers can be reduced to O(L) with
some modifications (Katharopoulos et al., 2020; Wang et al., 2020b).
90 Integrating Source-Channel and Attention-Based ASR Models
The AMI-Individual Headset Microphone (IHM) recordings were used (see Section A.1). The
dataset contains around 80 hours of speech for training, and 8 hours for both development (dev)
and evaluation (eval). The inputs used were 80-dim filter-bank features at a 100 Hz frame rate
concatenated with 3-dimensional pitch features. Manual utterance-level segmentation was used
throughout.
The pipeline is based on the ESPnet setup (Kim et al., 2017; Watanabe et al., 2018). The
default configurations for the CTC model, the acoustic model of the DNN-HMM system and
the encoder of the attention model are both an 8-layer bi-directional Long Short-Term Memory
(LSTM) with a projection layer. The bi-directional LSTM has 320 units in each direction and
the projection size is also 320. For the SCM model, an additional Multilayer Perceptron (MLP)
of size 320 is added before the Softmax output. The attention model uses a location-aware
attention mechanism connecting to the decoder with a one-layer 300-unit LSTM. The n-gram
LM is trained using both the AMI and Fisher transcriptions. The RNNLM is trained purely
based on the text transcriptions of AMI data. The RNNLM has one LSTM layer with 1000
hidden units. The perplexities of the RNNLM on the AMI dev and eval data are 73 and 64
respectively.
92 Integrating Source-Channel and Attention-Based ASR Models
To train the HMM-based model, the Cross Entropy (CE) objective function was used with align-
ments produced by a pre-trained DNN-HMM system. Unigram label-smoothing (Szegedy et al.,
2016) was applied before computing the CE loss of the AED model. The numbers of graphemes,
monophones and tied triphones are 31, 48 and 4016. The AdaDelta optimiser (Zeiler, 2012)
was used with a batch size of 30 utterances for all models.
For decoding the SCM-based models, PyHTK (Zhang et al., 2019b) was used to set up the
corresponding HMM structures and the decoding pipeline. HTK tools (Young et al., 2015)
were used for lattice generation, lattice rescoring and N -best list generation. A trigram LM
was used for SCM-based model decoding. For AED model decoding, the width of the beam
search was set to 30.
In the baseline setup (Kim et al., 2017), both the CTC model and the AED model used the same
set of graphemes as their output units and the effective frame rate is 25 Hz. The frame rate
reduction is achieved by skipping some time steps in the LSTM layer. The first LSTM layer
has a full frame rate and the next two layers skip one step at every other frame. Statistics on the
training data show that the input/output length ratio is more than four for 95.0% of all utterances
but is more than three for 99.4% of the data. For phones, 99.9% of the utterances have an
input/output length ratio higher than four. For AED models where frame-synchronicity is not
required, a higher ratio of frame rate reduction may be appropriate. However, for SCM-based
models, the frame rate should not be less than 1/3 of the full frame rate.
Table 4.2 AMI dev set WERs by decoding on the AED branch only and joint decoding of the
CTC and attention models with a word-based RNNLM. The output units are the same set of
grapehemes for both branches.
In the experiments shown in Table 4.2, the frame rate was reduced at the input feature level,
which is then more similar to the setup from SCM-based systems (Miao et al., 2016b; Povey
4.3 Preliminary Experiments 93
et al., 2016). By sampling one in every three frames, the performance of the model improves
despite two-thirds of training data not being used. Offsets to the starting point of the input
sequence allow the data to be fully used, which reduced the WER by another 2% absolute.
Furthermore, by concatenating two adjacent frames as the input feature to cover the test-time
unused frames (Pundak and Sainath, 2016), the WER was further reduced by 2%. Overall, a
6-7% absolute reduction of WER was observed over the baseline.
By treating the CTC model as a type of SCM-based model as described in Section 3.5.1,
the decoding procedure of the traditional HMM-based systems can also be used, where var-
ious sources of structured information are incorporated. The baseline is based on the prefix
search decoding procedure and the improvements made using the lexical-tree-based decoding
procedure.
WER
CTC baseline 47.6
+ graphemic lexicon 43.9
+ trigram LM 38.5
+ prior 33.2
+ multi-task training 32.1
Table 4.3 AMI dev set WER of the standalone graphemic CTC model and and several improve-
ments by incorporating structured information during decoding and multi-task training. The
RNNLM is not used.
As shown in Table 4.3, the addition of a graphemic lexicon that prevents decoding words
with incorrect spelling reduces the WER by 2.7% absolute. However, some words in the
lexicon can be decomposed into shorter word-pieces, which happen to also be legal words in
the lexicon. The introduction of the trigram LM greatly reduces the fragmentation of words and
reduces the WER significantly (5.4% absolute). One of the many assumptions made by CTC
is that all output units have equal prior probabilities. Computed by accumulating the output
posteriors from the DNN, the estimated priors are very imbalanced. For example, in the case
of 1/3 frame rate, the priors for the blank symbol, the letter ‘A’ and ‘Z’ are 0.43, 0.03, 10−4
respectively. By using the graphemic unit priors similarly to HMMs as in Equation (3.12), the
WER improves by another 5.3% absolute. Finally, by training the CTC model and the AED
model in a multi-task fashion, i.e. the same system as the last row in Table 4.2, the CTC WER
is further reduced to 32.1%, whereas the performance of the AED model alone is not improved.
94 Integrating Source-Channel and Attention-Based ASR Models
Narrowing the performance gap between the CTC model and the AED model facilitates the use
of the ISCA framework.
Following the multi-task training scheme where the AM (SCM-based system) and the encoder
(AED model) are shared, the following experiments vary the modelling units and objective
functions of the SCM-based model, while keeping those of the AED model unchanged. For the
following ISCA experiments, 20-best hypotheses from the SCM-based models were ranked
by optimally interpolating the AM scores, trigram LM scores, the AED model scores, and
optionally the RNNLM scores. A derivative-free stochastic global search algorithm, called the
Covariance Matrix Adaptation Evolution Strategy (CMA-ES) (Hansen and Ostermeier, 2001),
was used to optimise the interpolation weights. Note that results presented on the dev set in
Table 4.4 are optimistic as the interpolation weights are directly optimised for this set.
Table 4.4 AMI dev set WERs of multi-task trained systems. CE systems use single-state HMMs.
A trigram LM was used for SCM-based system decoding. The RNNLM was not used except
for the results in brackets.
The model in the first row in Table 4.4 corresponds to the model of the last rows in both
Tables 4.2 and 4.3. For this multi-task trained graphemic CTC and AED model, the performance
of the ISCA framework is similar to the joint decoding result in Table 4.2.
Given that the joint decoding method only applies to a CTC model and an AED model
with the same graphemic units, one of the major advantages of the ISCA framework is that the
SCM model can have any subword units and any loss function. The next two rows change the
modelling units of SCM-based systems from graphemes to monophones, and the WER of the
SCM-based system is reduced by 10.3% relative to the graphemic CTC model. This is mainly
due to the orthographic irregularity in English where phonetic-based units reduce the difficulty
4
A decision tree is used to tie 93k triphone models (or states) to 4k.
5
An acoustic model trained with a similar setup by HTK can yield approximately same WERs with only a fifth
of the parameters due to better data shuffling, larger batch size etc. For results to be comparable, all models in this
paper are trained using ESPnet.
4.3 Preliminary Experiments 95
of modelling significantly. The mapping between pronunciation and text is achieved by the
lexicon embedded in the decoding graph. For these two monophone SCM models, improved
CTC and CE models have similar performance. However, this observation does not hold for
triphone systems. This may be because triphone systems are more sensitive to the quality of the
training alignments since there are many more confusing triphone units than monophone units.
The amount of training data in the AMI dataset may not be sufficient for the CTC system to
discriminate different triphone units and to learn the alignment simultaneously. In contrast, the
CE system outperforms the CTC system as it uses alignments produced by an existing system.
Further experiments show that multi-task trained SCM-based models perform marginally
better than their standalone counterparts. However, the expected benefits of multi-task training
are not observed for the AED models. The last column in Table 4.4 shows that the ISCA
approach yields consistent reductions in WER while the SCM-based model improves, with
or without an external RNNLM. However, as the performance gap between the SCM-based
system and the AED model widens, the relative improvement w.r.t. the SCM-based model
shrinks. In the extreme case where the triphone CE model outperforms the AED model by
15.8% relative, ISCA with 20-best rescoring can still improve the WER of the SCM-based
model by 4.8% relative.
Since the ISCA framework integrates two models at the word level, attempts have also been
made to change the modelling units of the AED model. Similar to the findings by Sainath et al.
(2018) and Zhou et al. (2018a), using monophones for the AED model is not as helpful as
graphemes for ISCA, which shows that the complementarity of graphemic and phonetic models
is essential. As expected, using context-dependent units for the AED model yields essentially
no improvement compared to their context-independent counterparts. Since the neural decoder
directly conditions on the previous output as in Equation (3.73), modelling context-dependent
units is an even harder task than context-independent ones as the model also needs to learn the
tying results found by phonetic decision trees.
Since the triphone CE model outperforms the AED model significantly when using the multi-
task training setup and only 20-best hypotheses are used for ISCA, two more questions remain.
First, how ISCA performs when the attention-based model improves. Second, if the N -best
list becomes longer, how much more improvement ISCA can yield. In order to answer these
questions, three separate models were trained whose configurations are listed in Table 4.5 and
the WERs of the combined system with respect to the length of N -best lists are plotted in
Figure 4.3.
96 Integrating Source-Channel and Attention-Based ASR Models
27 +RNNLM
+AED(small) +RNNLM+AED(small)
+AED(large) +RNNLM+AED(large)
26
WER(%)
25
24
23
1 20 40 60 80 100
length of N-best list
Figure 4.3 ISCA between a standalone triphone CE model and AED models of different
configurations with various sizes of N -best lists.
As shown in Figure 4.3, the reduction in WER from adding an external RNNLM stagnates
after an n-best list of length 20 is used. However, the WER of ISCA continues to drop,
especially for the large attention-based model where two systems have similar WERs. Although
the attention-based system models the acoustic and language model jointly, an RNNLM trained
using the AMI transcriptions only is still useful. The improvement from the RNNLM is
expected to be greater when additional text corpora in similar domains are used for training.
For the 100-best list, the small and large attention-based models with the RNNLM reduce
the WER by 7.1% and 10.5% respectively relative to the triphone/CE model rescored by the
RNNLM.
4.4 Large Scale Experiments 97
Table 4.6 The number of parameters and WERs of some key models on both the AMI dev and
eval sets, with RNNLM included. ISCA uses 100-best from the triphone CE model.
Two common ASR benchmarks were used for training and evaluation. The AMI dataset
(Section A.1) is relatively small-scale and contains recordings of spontaneous meetings. The
IHM channel was used. As shown in Table 4.7, it has many short utterances as multiple
speakers take short turns during meetings. SWB-300 (see Section A.3) is a larger-scale dataset
with telephony conversations. Compared to AMI, it has more training data, longer utterances,
more speakers and a larger vocabulary. Hub5’00 was used as the development set, which is split
into two subsets: SWB and CallHome (CH). RT03 was used as the evaluation set, which is split
into Switchboard Cellular (SWBC) and Fisher (FSH) subsets. The acoustic data preparation
follows the Kaldi recipes (Povey et al., 2011).
AMI-IHM SWB-300
training data 78 hours 319 hours
avg. utterance length 7.4 words 11.8 words
number of speakers 155 543
vocabulary size 12k 30k
For each dataset, its training transcription and Fisher transcription (see Section A.3) were used
to train both n-gram language models and RNNLMs. Text processing and building n-gram
LMs for both datasets follow the Kaldi recipes (Povey et al., 2011). RNNLMs were trained
using the ESPnet toolkit (Watanabe et al., 2018). The vocabulary used for RNNLMs is the
same as for n-gram LMs and has 49k words for AMI-IHM and 30k words for SWB-300. The
RNNLMs have 2-layer LSTMs with 2048 units in each layer. The models were trained with
Stochastic Gradient Descend (SGD) with a learning rate of 10.0 and a dropout rate of 0.5.
The embedding dimension is 256. Gradient norms were clipped to 0.25 and weight decay set
to 10−6 . The training transcriptions and the Fisher transcriptions were mixed in a 3:1 ratio.
Because of the domain mismatch between the AMI and Fisher text data, the RNNLM for AMI
was fine-tuned on the AMI transcriptions after training using the mixture of data with a learning
rate of 1.0. The AMI RNNLM has 161M parameters and the Switchboard RNNLM has 122M
parameters. The perplexities of the LMs for both datasets are shown in Table 4.8.
4.4 Large Scale Experiments 99
AMI-IHM SWB-300
LM
dev eval Hub5’00 RT03
3-gram 80.2 76.7 82.8 67.7
4-gram 79.3 75.7 80.0 65.3
RNNLM 58.0 53.5 51.9 45.3
The acoustic models of the SCM-based systems are factorised Time Delay Neural Networks
(TDNNs), which were trained with the lattice-free MMI objective (Povey et al., 2018) by
following the standard Kaldi recipes (Povey et al., 2011). The total numbers of parameters are
10M for AMI-IHM and 19M for SWB-300.
Two types of AED systems were trained using the ESPnet toolkit (Watanabe et al., 2018)
without the CTC branch. The neural encoders were composed of 2 convolutional layers that
reduce the frame rate by 4, followed by 16 Conformer blocks (Gulati et al., 2020). For the
Conformer block, the dimension for the feed-forward layer was 2048, the attention dimension
was 512 for the model with an LSTM decoder and 256 for the model with a Transformer
decoder. The number of attention heads was 4 and the convolutional kernel size is 31. For the
AED model with an LSTM decoder, the decoder had a location-aware attention mechanism
and 2-layer LSTMs with 1024 units. For the AED model with a Transformer-based decoder,
the decoder had 6-layer Transformer decoder blocks where the attention dimension is 256 and
the feed-forward dimension is 2048. Both AED models were trained using the Noam learning
rate scheduler on the Adam optimiser. The learning rate was 5.0 and the number of warm-up
steps was 25k. Label smoothing of 0.1 and a dropout rate of 0.1 were applied during training.
An exponential moving average of all model parameters with a decay factor of 0.999 was
used. The AED model with LSTM decoder has 130M parameters while the AED model with
Transformer decoder has 54M parameters. Beam search with a beam-width of 8 was used for
decoding. Apart from applying length normalisation for the AED (Transformer) model, other
decoding heuristics were not used. SpecAugment (Park et al., 2019) and speed perturbation
were applied. Word-piece outputs (Kudo, 2018) were used with 200 units for AMI-IHM and
800 units for SWB-300. The single model WERs for the SCM-based system and AED models
are given in Table 4.9.
The WERs of SCM systems are higher than AED models, mainly because the number
of parameters in the SCM systems is much smaller. The WER gap between SCM and AED
systems is smaller on AMI-IHM dataset than SWB-300 dataset, because the training data for
100 Integrating Source-Channel and Attention-Based ASR Models
AMI-IHM SWB-300
Hub5’00 RT03
dev eval
(SWB/CH) (SWBC/FSH)
SCM 19.9 19.2 8.6 / 17.0 18.8 / 11.4
AED (LSTM) 19.6 18.2 7.5 / 15.3 16.2 / 10.7
AED (Transformer) 19.4 19.1 7.8 / 14.4 17.5 / 10.4
Table 4.9 Single system WERs on AMI-IHM and SWB-300 datasets. Systems do not use
RNNLMs for rescoring or decoding. All AED models have Conformer encoders. The decoder
architecture of AED models are in brackets.
the large AED models is more abundant for SWB-300 dataset. By comparing two AED models
with LSTM and Transformer decoders, the WERs are relatively close despite the fact that the
AED model with LSTM decoders have more parameters.
After pruning the lattices generated by the SCM-based system by limiting the beam width and
the maximum lattice density, 20-best, 100-best and 500-best hypotheses were obtained from
these lattices. The N -best hypotheses were then forwarded through the RNNLM, AED (LSTM)
and AED (Transformer). Each hypothesis has five scores, i.e. AM score, n-gram LM score,
RNNLM score, and scores from an AED model with an LSTM decoder and an AED model
with a Transformer decoder. By following Equation (4.9), the five interpolation coefficients are
found by using CMA-ES (Hansen and Ostermeier, 2001) that minimise the WER on the dev
set. By applying the optimal combination coefficients on the test set, the hypotheses with the
highest score were picked. The lattice density for each N -best or lattice rescoring experiment
is also reported. For lattice rescoring, the lattice density refers to the number of arcs in the
expanded lattice for each second of speech audio. For N -best rescoring, since an equivalent
lattice representation for an N -best list would be N parallel paths, the lattice density refers
to the number of words in all N -best hypotheses divided by the duration of the utterance in
seconds6 .
For AMI-IHM, as shown in Table 4.10, the WER consistently decreases as the N -best list
size increases. As expected, the lattice density grows approximately linearly with respect to
the size of the N -best list. However, the WER improvement from 100 to 500-best is smaller
6
Another common representation of an N -best list is a prefix tree, which has a lower lattice density than N
parallel paths.
4.4 Large Scale Experiments 101
than from 20 to 100-best, i.e. the gain from increasing N reduces for larger N . For example, in
the last row, the relative WER reduction from 20 to 100-best is 5.2%, but it shrinks to 2.8%
when increasing the size of N -best lists from 100 to 500. As this observation holds for all
rows in Table 4.10, the benefit of expanding the size of N -best lists diminishes. For lattice
rescoring results shown in Table 4.11, the improvement from using a higher-order n-gram
approximation is often marginal whereas the lattice density nearly doubles from 4-gram to
5-gram. For both N -best and lattice rescoring, Tables 4.10 and 4.11 show that the WER is
Table 4.10 WERs on AMI-IHM eval set using N -best rescoring with various combinations of
RNNLM and two AED models. Lattice density is provided in square brackets.
Table 4.11 WERs on AMI-IHM eval set using lattice rescoring with various combination of
RNNLM and two AED models. Lattice density is provided in square brackets.
lower when combining scores from more models. If just one additional model is to be used
for combination with a SCM-based system as in the first blocks of Tables 4.10 and 4.11, using
an AED model seems to be more effective than an RNNLM, because an AED model can be
viewed as an audio-grounded language model. In the second blocks of Tables 4.10 and 4.11,
as the performance of the two AED models are similar, the final rescoring performance using
an RNNLM and an AED model are also similar. When all three models are used together for
102 Integrating Source-Channel and Attention-Based ASR Models
rescoring, the best WERs are reached, which shows that all three models are complementary.
The WER of the combined system is 25-29% relative lower compared to a single system in
Table 4.9. With half the lattice density, lattice rescoring with a 4-gram approximation has a
2.8% relative WER reduction over 500-best rescoring.
For SWB-300, 500-best rescoring and lattice rescoring with a 5-gram approximation are
reported for Hub5’00 and RT03 sets in Tables 4.12 and 4.13. For most of the rows in
Table 4.12 WERs on Hub5’00 using 500-best rescoring and lattice rescoring (5-gram approx-
imation) with various combinations of RNNLM and two AED models. Lattice density is
provided in square brackets.
Table 4.13 WERs on RT03 using 500-best rescoring and lattice rescoring (5-gram approxima-
tion) with various combination of RNNLM and two AED models. Lattice density is provided
in square brackets.
Tables 4.12 and 4.13, WERs are lower for lattice rescoring using a 5-gram approximation than
N -best rescoring using 500-best while having much lower lattice densities. When only using
an RNNLM, the lattice rescoring results are slightly worse than N -best rescoring. This may be
due to the fact that the RNNLM operates at the word level whereas AED models operate at the
word-piece level. As a result, by having a 5-gram approximation during word lattice rescoring,
4.4 Large Scale Experiments 103
the number of auto-regressive steps in the RNNLM is exactly 5, but it is much greater than 5 for
AED model. Similar to the observations made on the AMI dataset, the best system is obtained
with lattice rescoring using all three models. For RT03, the final combined system using lattice
rescoring reduces the WER by 19-33% relative compared to the single systems in Table 4.9.
Although the 500-best has a greater lattice density than the expanded 5-gram lattice, lattice
rescoring has 5-8% relative WER reduction over 500-best rescoring. The relative gain is greater
than on the AMI dataset. This is not unexpected because SWB-300 has longer utterances than
AMI-IHM on average as shown in Table 4.7. The benefit of using lattice rescoring is more
substantial on long utterances as lattices can represent more alternatives more efficiently.
The Matched-Pair Sentence-Segment Word Error (MAPSSWE) statistical tests (Pallett
et al., 1990) show that the p-values for the null hypotheses “there is no performance difference
between 500-best rescoring and 5-gram lattice rescoring” on both AMI and SWB test sets are
below 0.001.
4.4.2.2 Analysis
Lattices are a more compact representation of the hypothesis space that scales well with the
utterance length. In Figure 4.4, the Relative Word Error Rate Reductions (WERRs) by using
20-best rescoring, 500-best rescoring and lattice rescoring with a 5-gram approximation are
compared for different utterance lengths measured by the number of words in the reference.
20-best
relative WER reduction (%)
500-best
30 5-gram lattice
20
10
0
1-8 9-16 17-24 25-32 >32
number of words in reference
Figure 4.4 Relative WER reduction by utterance length on RT03 for various ISCA rescoring
methods.
As expected, the gap between N -best rescoring and lattice rescoring widens as the utterance
length increases. The number of alternatives per word represented by the N -best list is smaller
104 Integrating Source-Channel and Attention-Based ASR Models
for longer utterances, which explains the downward trend in Figure 4.4 for 20-best and 500-best.
However, the number of alternatives per word for lattices is constant for a given lattice density.
Therefore, the WERR from lattice rescoring does not drop even for very long utterances.
Table 4.1 compared the complexities of using RNN and Transformer decoders for AED
models. Based on our implementation, speed disparities between the two types of decoders
are significant. For 500-best rescoring, AED (Transformer) is about four times faster than
AED (LSTM). However, AED (LSTM) is nearly twice as fast as AED (Transformer) for lattice
rescoring with a 5-gram approximation. Explicit comparisons between N -best and lattice
rescoring, and between RNNLM and AED model rescoring are not made here as they depend
on other factors including the implementation, hardware, the degree of parallel computation
and the extent of optimisation. For example, representing the N -best list in the form of a
prefix tree (Sainath et al., 2019) or using noise contrastive estimation (Chen et al., 2016) will
significantly accelerate N -best rescoring using RNNLMs.
Under the constraint that no additional acoustic or text data is used, ISCA outperforms
various recent results on AMI-IHM shown in Table 4.14 and SWB-300 shown in Table 4.15.
Further improvements are expected if cross-utterance language models or cross-utterance
label-synchronous models are used for rescoring (Irie et al., 2019c; Sun et al., 2021b; Tüske
et al., 2020), or combining more and stronger individual systems (Tüske et al., 2021).
4.4.3 Conclusions
In this chapter, a flexible framework called ISCA that combines an SCM-based system with
one or more AED models is proposed. Frame-synchronous SCM systems are used as the first
pass, which can process streaming data and integrate structured knowledge such as a lexicon.
Label-synchronous AED models are viewed as audio-grounded language models to rescore
hypotheses from the first pass. Since the two highly complementary systems are integrated at
the word level, they can be trained jointly in a multi-task fashion or optimised separately to
4.4 Large Scale Experiments 105
Hub5’00 RT03
(SWB/CH) (SWBC/FSH)
Hadian et al. (2018) 9.3 / 18.9
Zeyer et al. (2018) 8.3 / 17.3
Park et al. (2019) 6.8 / 14.1
Irie et al. (2019c) 6.7 / 12.9
Kitza et al. (2019) 6.7 / 13.5
Wang et al. (2020c) 6.3 / 13.3
Tüske et al. (2020) 6.4 / 12.5 14.8 / 8.4
Saon et al. (2021) 5.9 / 12.5 14.1 / 8.6
Hu et al. (2021) 7.2 / 13.6 14.4 / 8.9
Sun et al. (2021b) 6.7 / 13.2 14.9 / 8.3
ISCA 5.7 / 12.1 13.2 / 7.6
Table 4.15 Comparison with other SWB-300 results. Some RT03 results are not available.
fully utilise system-specific techniques for optimal performance. Experiments showed that the
proposed lattice rescoring algorithm by AED models generally outperforms N -best rescoring.
AED models with RNN decoders are better suited for lattice rescoring while Transformer
decoders are more time-efficient for N -best rescoring. On both the AMI and SWB datasets,
the combined systems have WERs around 30% relatively lower than individual systems and
outperform other recently published results.
Chapter 5
For various speech-related tasks, a confidence score is normally a value between 0 and 1
associated with each subword/word/utterance that indicates the quality of Automatic Speech
Recognition (ASR) transcriptions. Confidence scores play an important role in various down-
stream applications, such as semi-supervised learning, active learning, keyword spotting,
systems combination and dialogue systems. In traditional Hidden Markov Model (HMM)-
based ASR systems, confidence scores can be estimated from word posteriors in decoding
lattices. However, for an Attention-Based Encoder-Decoder (AED)-based ASR model with
an auto-regressive decoder, computing word posteriors is difficult. As AED models reach
promising performance for ASR, various downstream tasks rely on good confidence estima-
tors. An obvious approach for estimating confidence scores for AED models is to use the
decoder Softmax probabilities. In practice, Softmax scores are poor confidence scores and are
affected heavily by regularisation techniques used during ASR training. In the first part of the
chapter, a lightweight and effective approach called the Confidence Estimation Module (CEM)
is proposed. The CEM generates a confidence score for each hypothesis token. Word-level
confidence scores can be obtained by aggregating the token-level scores. Experiments show
that the CEM is a much better confidence estimator than the Softmax probabilities and the
overconfidence problem of Softmax scores is effectively mitigated.
Although AED models have shown impressive results for ASR, they formulate the sequence-
level probability as a product of the conditional probabilities of all individual tokens given their
histories. However, the performance of locally normalised models can be sub-optimal because
of factors such as exposure bias (Deng et al., 2020). Consequently, the model distribution
differs from the underlying data distribution. In the next part of the chapter, the Residual
Energy-Based Model (R-EBM) is used to complement the auto-regressive AED model to
108 Confidence Scores for Attention-Based Encoder-Decoder ASR Models
close the gap between the two distributions. Meanwhile, an R-EBM can also be regarded as
an utterance-level confidence estimator. Experiments show that an R-EBM produces better
utterance-level confidence scores than aggregating token-level confidence scores from the CEM.
Furthermore, the utterance-level scores generated by an R-EBM can be used to rescore the
N -best hypotheses and reduce the Word Error Rate (WER).
The CEM and R-EBM are better confidence estimators than using output Softmax prob-
abilities on in-domain data. Since they are model-based confidence estimators trained using
the same data as the underlying ASR model, generalising to out-of-domain data may be chal-
lenging. If the input data to the speech recogniser is from mismatched acoustic and linguistic
conditions, the ASR performance and the corresponding confidence estimators may exhibit
severe degradation. To this end, the last part of this chapter proposes two approaches to improve
the model-based confidence estimators on Out-of-Domain (OOD) data while keeping the ASR
model untouched: using pseudo transcriptions and an additional OOD language model. Experi-
ments show that the proposed methods can considerably improve the confidence metrics on
OOD datasets while preserving in-domain performance. Furthermore, the improved confidence
estimators are shown to better reflect the probability of a word being recognised correctly and
can also provide a much more reliable criterion for data selection.
5.1 Background
Confidence scores have been an intrinsic part of ASR systems (Jiang, 2005; Wessel et al.,
2001; Yu et al., 2011) and provide an indication of the reliability of transcriptions given by
the recogniser. Many speech-related applications depend on high-quality confidence scores to
mitigate errors from speech recognisers. For example, in semi-supervised learning and active
learning, utterances with highly confident hypotheses are selected to further improve ASR
performance (Chan and Woodland, 2004; Riccardi and Hakkani-Tür, 2005; Tür et al., 2005).
Confidence scores are also used in dialogue systems where queries with low confidence may
be returned to users for clarification (Tür et al., 2005). As an indication of ASR uncertainty,
confidence scores can play an role in speaker adaptation (Uebel and Woodland, 2001), and
system combination (Evermann and Woodland, 2000b).
is the word posterior probability, which is normally derived from word lattices or confusion
networks (Evermann and Woodland, 2000a,b; Mangu et al., 2000). Other useful features
include the hypothesis density (Kemp and Schaaf, 1997), word trellis stability (Sanchís et al.,
2012), acoustic stability (Zeppenfeld et al., 1997), normalised acoustic likelihood (Pinto and
Sitaram, 2005), and language model back-off behaviour (Weintraub et al., 1997). In order
to have more reliable estimates, many model-based approaches have been proposed, such as
conditional random fields (Seigel and Woodland, 2011), recurrent neural networks (Del-Agua
et al., 2018; Kalgaonkar et al., 2015; Kastanos et al., 2020; Ragni et al., 2018) and graph neural
networks (Li et al., 2019c; Ragni et al., 2022).
of a token/word being correct in the entire dataset. If confidence scores for all N tokens/words
are gathered and denote c = [c1 , . . . , cN ] where cn ∈ [0, 1] and their corresponding target
confidence c∗ = [c∗1 , . . . , c∗N ] where c∗n ∈ {0, 1}, NCE is given by
H(c∗ ) − H(c∗ , c)
NCE(c∗ , c) = , (5.1)
H(c∗ )
N
c∗n
H(c∗ ) = −c¯∗ log(c¯∗ ) − (1 − c¯∗ ) log(1 − c¯∗ ) , where c¯∗ =
X
, (5.2)
n=1 N
and H(c∗ , c) is the binary cross-entropy between the target and the estimated confidence scores
N
1 X
H(c∗ , c) = − c∗i log(ci ) + (1 − c∗i ) log(1 − c∗i ) . (5.3)
N n=1
When confidence estimation is systematically better than the word correct ratio c¯∗ , the NCE
is positive. For perfect confidence scores, NCE is 1. Since the absolute values of confidence
scores matter, Piece-wise Linear Mapping (PWLM) is a commonly used technique to boost the
NCE (Evermann and Woodland, 2000a). The PWLM is estimated on a development set that
maps the confidence scores closer to the probability of the recognised token/word being correct
while maintaining the relative ordering of the tokens/words based on confidence scores.
However, in some applications such as keyword spotting and data filtering, it is the order of
tokens ranked by confidence scores that matters. In these cases, operating points have to be
chosen where hypotheses with confidence scores above a certain threshold c̃ are deemed to be
correct and incorrect otherwise. There are four outcomes of the binary decision as shown in
Table 5.1.
predicted confidence
c ≥ c̃ c < c̃
target c∗ = 1 true positive (TP) false negative (FN)
confidence c∗ = 0 false positive (FP) true negative (TN)
Precision-Recall (P-R) curves are commonly used to illustrate the operating characteris-
tics (Davis and Goadrich, 2006), where
TP(c̃)
precision(c̃) = , (5.4)
TP(c̃) + FP(c̃)
TP(c̃)
recall(c̃) = . (5.5)
TP(c̃) + FN(c̃)
For a given threshold, precision is the fraction of true positives over all samples that are deemed
to be positives by the confidence estimator, and recall is the fraction of true positives over all
samples that are actually positive. Normally, when the threshold c̃ increases, there are fewer
false positives and more false negatives, which leads to higher precision and lower recall. The
trade-off behaviour between precision and recall yields a downward trending curve from the top
left corner to the bottom right corner. Therefore, the Area Under the Curve (AUC) can measure
the quality of the confidence estimator, which has a maximum value of 1. It is worth noting
that two confidence estimators can have the same AUC value but different NCE values. Using
P-R curves is more informative than the Receiver Operating Characteristics (ROC) curves
under unbalanced classes (Saito and Rehmsmeier, 2015). In practice, a downstream application
normally needs to make decisions based on confidence scores. The Equal Error Rate (EER) is
where the false negative rate (FN/(TP + FN)) equals the false positive rate (FP/(FP + TN)),
which is the optimal operating point if false acceptance and false rejection have equal costs.
2
HMM-based system
AED model
0
0.00 0.25 0.50 0.75 1.00
confidence threshold for filtering
Figure 5.1 Filtering behaviour of a conventional HMM-based system and an AED model based
on confidence scores. Utterances with confidence higher than a threshold (x-axis) are selected,
and WER of the filtered subset (y-axis) are plotted. Both systems are trained on LibriSpeech
100-hour data and the test-clean set is used for filtering.
<latexit sha1_base64="4tp+iaNE3koesqx9sX1GutlcCdU=">AAAB7XicbVBNSwMxEJ2tX7V+VT16CRahXsquFBVPBS8eK9gPaJeSTbNtbDZZkqxQlv4HLx4U8er/8ea/MdvuQVsfDDzem2FmXhBzpo3rfjuFtfWNza3idmlnd2//oHx41NYyUYS2iORSdQOsKWeCtgwznHZjRXEUcNoJJreZ33miSjMpHsw0pn6ER4KFjGBjpTapjgfsfFCuuDV3DrRKvJxUIEdzUP7qDyVJIioM4VjrnufGxk+xMoxwOiv1E01jTCZ4RHuWChxR7afza2fozCpDFEplSxg0V39PpDjSehoFtjPCZqyXvUz8z+slJrz2UybixFBBFovChCMjUfY6GjJFieFTSzBRzN6KyBgrTIwNqGRD8JZfXiXti5p3Wavf1yuNmzyOIpzAKVTBgytowB00oQUEHuEZXuHNkc6L8+58LFoLTj5zDH/gfP4AzBKOmw==</latexit>
confidence c(hi )
estimation module
(CEM) sigmoid
<latexit sha1_base64="PAFuck7kL2uek4kaDNn9jXPStPA=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBbBU0lEVDwVvHisYD+gDWWz2bRLdzdhd1Isof/EiwdFvPpPvPlv3LY5aOuDgcd7M8zMC1PBDXjet1NaW9/Y3CpvV3Z29/YP3MOjlkkyTVmTJiLRnZAYJrhiTeAgWCfVjMhQsHY4upv57THThifqESYpCyQZKB5zSsBKfdftAXsCQ3PDBzLh0bTvVr2aNwdeJX5BqqhAo+9+9aKEZpIpoIIY0/W9FIKcaOBUsGmllxmWEjoiA9a1VBHJTJDPL5/iM6tEOE60LQV4rv6eyIk0ZiJD2ykJDM2yNxP/87oZxDdBzlWaAVN0sSjOBIYEz2LAEdeMgphYQqjm9lZMh0QTCjasig3BX355lbQuav5V7fLhslq/LeIooxN0is6Rj65RHd2jBmoiisboGb2iNyd3Xpx352PRWnKKmWP0B87nD2qvlCc=</latexit>
FeatureExtractor
<latexit sha1_base64="rDv9Ou+yNppQjLxY2N7e29JZHw8=">AAACAnicbVDLSgNBEJyNrxhfq57Ey2IQPIVdCSqeAqJ4jGAekCxhdtJJhsw+mOmVhCV48Ve8eFDEq1/hzb9xNtmDJhY0FFXddHd5keAKbfvbyC0tr6yu5dcLG5tb2zvm7l5dhbFkUGOhCGXTowoED6CGHAU0IwnU9wQ0vOFV6jceQCoeBvc4jsD1aT/gPc4oaqljHrQRRqhYcgMUYwnXI5SUYSgnHbNol+wprEXiZKRIMlQ75le7G7LYhwCZoEq1HDtCN6ESORMwKbRjBRFlQ9qHlqYB9UG5yfSFiXWsla7VC6WuAK2p+nsiob5SY9/TnT7FgZr3UvE/rxVj78JNeBDFCAGbLerFwsLQSvOwulwCQzHWhDLJ9a0WG9A0A51aQYfgzL+8SOqnJeesVL4rFyuXWRx5ckiOyAlxyDmpkFtSJTXCyCN5Jq/kzXgyXox342PWmjOymX3yB8bnD3sZmB0=</latexit>
hi
<latexit sha1_base64="B06OytD5ZndSQTB3ynbu2Th+5sc=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqHgqePFY0X5AG8pmO2mXbjZhdyOU0J/gxYMiXv1F3vw3btsctPXBwOO9GWbmBYng2rjut1NYW9/Y3Cpul3Z29/YPyodHLR2nimGTxSJWnYBqFFxi03AjsJMopFEgsB2Mb2d++wmV5rF8NJME/YgOJQ85o8ZKD6M+75crbtWdg6wSLycVyNHol796g5ilEUrDBNW667mJ8TOqDGcCp6VeqjGhbEyH2LVU0gi1n81PnZIzqwxIGCtb0pC5+nsio5HWkyiwnRE1I73szcT/vG5qwms/4zJJDUq2WBSmgpiYzP4mA66QGTGxhDLF7a2EjaiizNh0SjYEb/nlVdK6qHqX1dp9rVK/yeMowgmcwjl4cAV1uIMGNIHBEJ7hFd4c4bw4787HorXg5DPH8AfO5w9Ito3J</latexit>
attention-based
p(hi |h0:i
<latexit sha1_base64="ABB+Zd9lSbx7M4A4567uXfVlZm0=">AAACDnicbVDLSsNAFJ3UV62vqEs3wVKooCWRotJVwY3LCn1BG8JkMmmGTjJhZiKUmC9w46+4caGIW9fu/BsnbRZaPTDM4Zx7ufceN6ZESNP80korq2vrG+XNytb2zu6evn/QFyzhCPcQo4wPXSgwJRHuSSIpHsYcw9CleOBOr3N/cIe5ICzqylmM7RBOIuITBKWSHL0W1wOH3AdOarbImZWdjl1GPTEL1ZeyzEmtVjc7cfSq2TDnMP4SqyBVUKDj6J9jj6EkxJFEFAoxssxY2inkkiCKs8o4ETiGaAoneKRoBEMs7HR+TmbUlOIZPuPqRdKYqz87UhiKfENVGUIZiGUvF//zRon0r+yURHEicYQWg/yEGpIZeTaGRzhGks4UgYgTtauBAsghkirBigrBWj75L+mfN6yLRvO2WW23ijjK4AgcgzqwwCVogxvQAT2AwAN4Ai/gVXvUnrU37X1RWtKKnkPwC9rHN3BKm7c=</latexit>
1 , o1:T )
encoder-decoder
model softmax
<latexit sha1_base64="+xglHDDVSy3KKb/JN0Tu/8FfPm4=">AAAB+XicbVBNSwMxEM3Wr1q/Vj16CRbBU9mVouKp4MVjBfsB7VKyabYNTbJLMltalv4TLx4U8eo/8ea/MW33oK0PBh7vzTAzL0wEN+B5305hY3Nre6e4W9rbPzg8co9PmiZONWUNGotYt0NimOCKNYCDYO1EMyJDwVrh6H7ut8ZMGx6rJ5gmLJBkoHjEKQEr9Vy3C2wChmYmjkCSyaznlr2KtwBeJ35OyihHved+dfsxTSVTQAUxpuN7CQQZ0cCpYLNSNzUsIXREBqxjqSKSmSBbXD7DF1bp4yjWthTghfp7IiPSmKkMbackMDSr3lz8z+ukEN0GGVdJCkzR5aIoFRhiPI8B97lmFMTUEkI1t7diOiSaULBhlWwI/urL66R5VfGvK9XHarl2l8dRRGfoHF0iH92gGnpAddRAFI3RM3pFb07mvDjvzseyteDkM6foD5zPH4xAlD0=</latexit>
di
<latexit sha1_base64="ulP8/tkeS68Dxes+7T0wU40RbDw=">AAAB+3icbVDLSsNAFL2pr1pfsS7dDBbBVUlEVFwV3LisYB/QhjCZTNuhk0mYmYgl5FfcuFDErT/izr9x0mahrQeGOZxzL3PmBAlnSjvOt1VZW9/Y3Kpu13Z29/YP7MN6V8WpJLRDYh7LfoAV5UzQjmaa034iKY4CTnvB9Lbwe49UKhaLBz1LqBfhsWAjRrA2km/Xh0HMQzWLzJWFuZ+x3LcbTtOZA60StyQNKNH27a9hGJM0okITjpUauE6ivQxLzQineW2YKppgMsVjOjBU4IgqL5tnz9GpUUI0iqU5QqO5+nsjw5Eq4pnJCOuJWvYK8T9vkOrRtZcxkaSaCrJ4aJRypGNUFIFCJinRfGYIJpKZrIhMsMREm7pqpgR3+curpHvedC+bF/cXjdZNWUcVjuEEzsCFK2jBHbShAwSe4Ble4c3KrRfr3fpYjFascucI/sD6/AEC+JUP</latexit>
Decoder
<latexit sha1_base64="pIu4NX5B4VYCdNZPxMskWybzP8M=">AAAB+XicbVBNS8NAEN3Ur1q/oh69BIvgqSRSVDwV9OCxgv2ANpTNZtou3WzC7qRYQv+JFw+KePWfePPfuG1z0NYHA4/3ZpiZFySCa3Tdb6uwtr6xuVXcLu3s7u0f2IdHTR2nikGDxSJW7YBqEFxCAzkKaCcKaBQIaAWj25nfGoPSPJaPOEnAj+hA8j5nFI3Us+0uwhNqlt0Bi0NQ055ddivuHM4q8XJSJjnqPfurG8YsjUAiE1Trjucm6GdUIWcCpqVuqiGhbEQH0DFU0gi0n80vnzpnRgmdfqxMSXTm6u+JjEZaT6LAdEYUh3rZm4n/eZ0U+9d+xmWSIki2WNRPhYOxM4vBCbkChmJiCGWKm1sdNqSKMjRhlUwI3vLLq6R5UfEuK9WHarl2k8dRJCfklJwTj1yRGrknddIgjIzJM3klb1ZmvVjv1seitWDlM8fkD6zPHxdPk/E=</latexit>
e1 , . . . , eT
<latexit sha1_base64="fpVoZl65kOz6ejlL75fS+z4YJnA=">AAACEHicbVC7TsMwFHV4lvIKMLJYVAiGqkpQBYipEgtjkfqS2ihyHKe16tiR7SBVUT+BhV9hYQAhVkY2/ganzQAtR7J8dM699r0nSBhV2nG+rZXVtfWNzdJWeXtnd2/fPjjsKJFKTNpYMCF7AVKEUU7ammpGeokkKA4Y6Qbj29zvPhCpqOAtPUmIF6MhpxHFSBvJt88GgWChmsTmysjUd6uDUGhVXZBbvl1xas4McJm4BamAAk3f/jIP4TQmXGOGlOq7TqK9DElNMSPT8iBVJEF4jIakbyhHMVFeNltoCk+NEsJISHO4hjP1d0eGYpUPZypjpEdq0cvF/7x+qqNrL6M8STXheP5RlDKoBczTgSGVBGs2MQRhSc2sEI+QRFibDMsmBHdx5WXSuai5l7X6fb3SuCniKIFjcALOgQuuQAPcgSZoAwwewTN4BW/Wk/VivVsf89IVq+g5An9gff4A60Cdww==</latexit>
di 1 , wi 1
<latexit sha1_base64="oFYujJoFrMUxQO6ba8yyEGniab4=">AAACBXicbVC7TsMwFHXKq5RXgBEGiwqJAaoEVYCYKrEwFok+pDaKHMdprTp2ZDugKsrCwq+wMIAQK//Axt/gtB2gcCTrHp1zr3zvCRJGlXacL6u0sLi0vFJeraytb2xu2ds7bSVSiUkLCyZkN0CKMMpJS1PNSDeRBMUBI51gdFX4nTsiFRX8Vo8T4sVowGlEMdJG8u39fiBYqMaxKVmY+xk9cfPj+2n17apTcyaAf4k7I1UwQ9O3P/uhwGlMuMYMKdVznUR7GZKaYkbySj9VJEF4hAakZyhHMVFeNrkih4dGCWEkpHlcw4n6cyJDsSoWNZ0x0kM17xXif14v1dGFl1GepJpwPP0oShnUAhaRwJBKgjUbG4KwpGZXiIdIIqxNcBUTgjt/8l/SPq25Z7X6Tb3auJzFUQZ74AAcARecgwa4Bk3QAhg8gCfwAl6tR+vZerPep60lazazC37B+vgGivyYkg==</latexit>
vi
<latexit sha1_base64="ECCBEjDlQUrO9GCEp1e6H/7n+hQ=">AAAB+XicbVDLSsNAFL2pr1pfUZduBovgqiRSVFwV3LisYB/QhjCZTNqhk0mYmRRK6J+4caGIW//EnX/jpM1CWw8MczjnXubMCVLOlHacb6uysbm1vVPdre3tHxwe2ccnXZVkktAOSXgi+wFWlDNBO5ppTvuppDgOOO0Fk/vC702pVCwRT3qWUi/GI8EiRrA2km/bwyDhoZrF5sqnc5/5dt1pOAugdeKWpA4l2r79NQwTksVUaMKxUgPXSbWXY6kZ4XReG2aKpphM8IgODBU4psrLF8nn6MIoIYoSaY7QaKH+3shxrIpwZjLGeqxWvUL8zxtkOrr1cibSTFNBlg9FGUc6QUUNKGSSEs1nhmAimcmKyBhLTLQpq2ZKcFe/vE66Vw33utF8bNZbd2UdVTiDc7gEF26gBQ/Qhg4QmMIzvMKblVsv1rv1sRytWOXOKfyB9fkDTjmUFQ==</latexit>
Encoder Attention
<latexit sha1_base64="C+a/wgw5E6R4XEltJ3ESTF+Epa4=">AAAB+XicbVBNS8NAEN3Ur1q/oh69BIvgqSRSVDwVRPBYwX5AG8pmM22XbjZhd1Isof/EiwdFvPpPvPlv3LY5aOuDgcd7M8zMCxLBNbrut1VYW9/Y3Cpul3Z29/YP7MOjpo5TxaDBYhGrdkA1CC6hgRwFtBMFNAoEtILR7cxvjUFpHstHnCTgR3QgeZ8zikbq2XYX4Qk1y+4ki0NQ055ddivuHM4q8XJSJjnqPfurG8YsjUAiE1Trjucm6GdUIWcCpqVuqiGhbEQH0DFU0gi0n80vnzpnRgmdfqxMSXTm6u+JjEZaT6LAdEYUh3rZm4n/eZ0U+9d+xmWSIki2WNRPhYOxM4vBCbkChmJiCGWKm1sdNqSKMjRhlUwI3vLLq6R5UfEuK9WHarl2k8dRJCfklJwTj1yRGrknddIgjIzJM3klb1ZmvVjv1seitWDlM8fkD6zPHya0k/s=</latexit>
<latexit sha1_base64="QCW9WugXudcnY0LEI7Lw/y7Y5rc=">AAAB+3icbVDLSsNAFL2pr1pfsS7dBIvgqiQiKq4qblxWsA9oQ5lMJ+3QyYOZG2kJ+RU3LhRx64+482+ctFlo64GBwzn3MPceLxZcoW1/G6W19Y3NrfJ2ZWd3b//APKy2VZRIylo0EpHsekQxwUPWQo6CdWPJSOAJ1vEmd7nfeWJS8Sh8xFnM3ICMQu5zSlBLA7PaRzZFRdNbRBbmWjYwa3bdnsNaJU5BalCgOTC/+sOIJoHOU0GU6jl2jG5KJHIqWFbpJ4rFhE7IiPU0DUnAlJvOd8+sU60MLT+S+oVozdXfiZQESs0CT08GBMdq2cvF/7xegv61m/IwTvRhdPGRnwgLIysvwhpyySiKmSaESq53teiYSEJR11XRJTjLJ6+S9nnduaxfPFzUGjdFHWU4hhM4AweuoAH30IQWUJjCM7zCm5EZL8a78bEYLRlF5gj+wPj8AfQplQU=</latexit>
o1 , . . . , oT ai 1 , di 1
<latexit sha1_base64="fNrPRozPaGdWruxXSAFyntqhtQw=">AAACEnicbVDLSgMxFM3UV62vUZdugkVQ0DIjRcVVwY3LCvYB7TBkMpk2NJMMSUYoQ7/Bjb/ixoUibl2582/MtLOo1QMhh3Pu5d57goRRpR3n2yotLa+srpXXKxubW9s79u5eW4lUYtLCggnZDZAijHLS0lQz0k0kQXHASCcY3eR+54FIRQW/1+OEeDEacBpRjLSRfPukHwgWqnFsvgxN/IyeuZPTeTEsRN+uOjVnCviXuAWpggJN3/7qhwKnMeEaM6RUz3US7WVIaooZmVT6qSIJwiM0ID1DOYqJ8rLpSRN4ZJQQRkKaxzWcqvMdGYpVvqCpjJEeqkUvF//zeqmOrryM8iTVhOPZoChlUAuY5wNDKgnWbGwIwpKaXSEeIomwNilWTAju4sl/Sfu85l7U6nf1auO6iKMMDsAhOAYuuAQNcAuaoAUweATP4BW8WU/Wi/VufcxKS1bRsw9+wfr8AWZVnok=</latexit>
<latexit sha1_base64="fJL0Kr1dDsexAoQVutaHp4cZ7iM=">AAACEHicbVC7TsMwFHV4lvIKMLJYVAiGqkpQBYipEgtjkfqS2ihyHKe16tiR7SBVUT+BhV9hYQAhVkY2/ganzQAtR7J8dM699r0nSBhV2nG+rZXVtfWNzdJWeXtnd2/fPjjsKJFKTNpYMCF7AVKEUU7ammpGeokkKA4Y6Qbj29zvPhCpqOAtPUmIF6MhpxHFSBvJt88GgWChmsTmysTUd6uDUGhVXZBbvl1xas4McJm4BamAAk3f/jIP4TQmXGOGlOq7TqK9DElNMSPT8iBVJEF4jIakbyhHMVFeNltoCk+NEsJISHO4hjP1d0eGYpUPZypjpEdq0cvF/7x+qqNrL6M8STXheP5RlDKoBczTgSGVBGs2MQRhSc2sEI+QRFibDMsmBHdx5WXSuai5l7X6fb3SuCniKIFjcALOgQuuQAPcgSZoAwwewTN4BW/Wk/VivVsf89IVq+g5An9gff4ACsGd1w==</latexit>
Figure 5.2 Confidence Estimation Module (CEM). The AED model in the shaded box is frozen
during CEM training.
word or an utterance. For AED models, each step in the auto-regressive decoder is treated as a
classification task over all possible output tokens. However, there is a subtle difference between
calibration for standard classification and confidence scores for sequences. For a hypothesis
sequence, each token can either be correct, a substitution or an insertion. Because of the
auto-regressive nature of the decoder and the use of the teacher forcing approach for training,
the calibration behaviour for sequences with an incorrect history is uncertain. Furthermore,
the model can be poorly calibrated when the model becomes very deep and large in pursuit of
state-of-the-art WER (Guo et al., 2017).
To obtain high-quality confidence scores while maintaining the WER performance of
the ASR model, the CEM is proposed as shown in the top box in Figure 5.2. The CEM is
designed to be a lightweight module that can be easily configured on top of any AED model.
The CEM gathers information from the attention mechanism, the decoder state, the Softmax
output probabilities, and the current token embedding as the input of the confidence feature
extractor. The feature extractor can be any Deep Neural Network (DNN) that transforms a
high-dimensional input feature to a single dimension. Then the sigmoid output layer generates
a value between 0 and 1 that indicates the confidence score c(hi ) for the current token, i.e.
c(hi ) = SIGMOID (F EATURE E XTRACTOR (vi , di , p(hi |h0:i−1 , o1:T ), hi )). (5.6)
114 Confidence Scores for Attention-Based Encoder-Decoder ASR Models
Assuming there is an existing well-trained AED model available, N -best hypotheses can
be generated by running a beam search using the model. Then the edit distance between each
hypothesis sequence h can be computed with respect to the ground truth reference sequence
w. The alignment from the edit distance computation can be used as the target for confidence
if correct tokens are assigned as 1 while substituted or inserted tokens are assigned as 0. For
example, if the ground truth sequence is “A B C D." and one of the hypotheses is “A C C
D.”, then the binary target sequence is c∗ = [1, 0, 1, 1]. Note that confidence scores are only
associated with hypothesised tokens and deletion errors are not modelled here. For each of
the N -best hypotheses, the CEM is trained to minimise the binary cross entropy between the
estimated confidence c and the target c∗ ,
L
1X
∗
L(c , c) = − c∗ (hi ) log(c(hi )) + (1 − c∗ (hi )) log(1 − c(hi )) . (5.7)
L i=1
The total loss for an utterance is the aggregated confidence estimation loss for all the N -best
hypotheses. During CEM training, all parameters of the AED model are fixed.
The LibriSpeech “train-clean-100” subset (see Section A.2), containing read speech from
audiobooks, was used for model training. It reflects a typical use case of confidence scores
where the WERs are moderately high and the amount of supervised data is limited. The dev and
test sets are the dev-clean/dev-other and test-clean/test-other sets with each one contains over
5 hours of audio. The input features are 80-dimension filterbank coefficients with ∆ and ∆∆
values appended (240 dimensions per frame). The output targets are 16k word-pieces tokenised
from the full LibriSpeech training set using a Word Piece Model (WPM) model (Schuster
and Nakajima, 2012). All experimental results in Section 5.2.4 on LibriSpeech are on the
test-clean/test-other sets.
The baseline model is an AED model trained using the open-source Lingvo toolkit (Shen et al.,
2019). The encoder consists of a 2-layer convolutional neural network with max-pooling and a
stride of 2 and a 4-layer bi-directional Long Short-Term Memory (LSTM) network with 1024
units in each direction. The decoder has a 2-layer uni-directional LSTM network with 1024
units. The total number of parameters in the baseline model is 184 million. Training used the
5.2 Confidence Estimation Module 115
Adam optimiser (Kingma and Ba, 2015) with a learning rate of 0.001 and the batch size was
512. During training, five regularisation techniques have been adopted: dropout (Srivastava
et al., 2014) of 0.1 on the decoder; uniform label smoothing (Szegedy et al., 2016) of 0.2;
Gaussian weight noise (Graves, 2011) with zero mean and a standard deviation of 0.05 for
all model parameters after 20k updates; SpecAugment (Park et al., 2019) with 2 frequency
masks with mask parameter F = 27, 2 time masks with mask parameter T = 40, and time
warping with warp parameter W = 40; and an Exponential Moving Average (EMA) (Polyak
and Juditsky, 1992) of all model parameters were used during training. For details of various
regularisation techniques, please refer to Section 2.3.4.
Many training techniques have been developed for deep neural networks. They normally
reduce the extent to which the model overfits the training data and improve the generalisation
of the model. State-of-the-art performance is often achieved when a large model is trained
using aggressive regularisation (Chiu et al., 2018). As for the baseline setup described in
Section 5.2.3.2, five techniques have been used. Broadly speaking, regularisation methods can
be classified into three categories, i.e. augmenting input features (SpecAugment), manipulating
model weights (dropout, EMA & weight noise), and modifying output targets (label smoothing).
In Table 5.2, the ASR WERs and the confidence metric AUCs of five additional models
are shown, where each model is trained by removing one regularisation technique from the
baseline setup. The confidence scores were computed using Softmax probabilities.
WER ↓ AUC ↑
baseline 7.5/21.6 0.976/0.912
− dropout 7.8/22.0 0.977/0.916
− EMA 8.2/24.8 0.974/0.903
− label smoothing 10.6/24.6 0.985/0.950
− weight noise 12.9/25.8 0.978/0.925
− SpecAugment 10.8/34.3 0.952/0.911
Table 5.2 ASR and token-level confidence performance by removing a regularisation method
from the baseline model on test-clean/test-other. Confidence scores are based on the raw
Softmax probabilities.
higher WER. However, AUCs based on Softmax scores may be even higher when the ASR
system becomes worse by excluding a regularisation technique during training. For example,
by removing label smoothing, the AUC is unexpectedly better than the baseline. This shows
that although Softmax probabilities can be directly used as confidence scores, they can be
heavily affected by regularisation techniques. Ideally, confidence estimation should perform
well regardless of the specific training procedure. Since confidence estimation is only an
auxiliary task to the main ASR task, improved confidence estimators should not sacrifice ASR
WER performance.
To keep good ASR WER performance while having reliable confidence scores, a dedicated
CEM can be trained as in Section 5.2.2. During CEM training, more aggressive SpecAugment
(10 time masks with mask parameter T = 50) is used to increase the WER on the training
set for more negative training samples. For each utterance, 8-best hypotheses are generated
on-the-fly and are aligned with the reference to obtain the binary training targets. The CEM
only has one fully-connected layer with 256 units. The number of additional parameters is 0.4%
of the baseline AED model. Piece-wise Linear Mapping (PWLM) (Evermann and Woodland,
2000a) are estimated on dev-clean/dev-other and are then applied to test-clean/test-other, so
that the confidence scores better match with the token or word correctness. Since the PWLM is
monotonic, the NCE is boosted while the AUC remains unchanged as the relative order of the
confidence scores is unchanged. Table 5.3 reports the confidence metrics at the token level and
the word level.
NCE ↑ NCE ↑
AUC ↑
(w/o PWLM) (w/ PWLM)
Softmax 0.976/0.912 -0.195/0.131 0.166/0.172
token
CEM 0.990/0.958 0.189/0.019 0.344/0.275
Softmax 0.981/0.927 -0.180/0.139 0.269/0.195
word
CEM 0.990/0.962 0.192/0.039 0.350/0.270
Table 5.3 Comparison of confidence scores between using Softmax probabilities and using the
CEM on the baseline model. Piece-wise Linear Mapping (PWLM) was estimated on the dev
sets and applied on the test sets to improve the NCE metric. The first row corresponds to the
baseline in Table 5.2.
The word-level confidence is the average of the token-level ones if a word consists of
multiple tokens. The AUC is improved at both token and word levels by using the CEM. Unlike
5.2 Confidence Estimation Module 117
the use of Softmax probabilities, the NCE values are all positive for the CEM. After PWLM,
the CEM yields much higher NCE values than when using Softmax probabilities. The AUC
values given in Table 5.3 do not show the whole picture. As shown in Figure 5.3, the P-R curves
of Softmax and CEM are drastically different. A sharp downward spike at the high-confidence
region in Figure 5.3a corresponds to a low precision and a low recall. In other words, the
Softmax probabilities are overconfident for some incorrect tokens, which also explains the
spike shown in Figure 5.1. The CEM, however, suffers little from overconfidence as Figure 5.3b
depicts the desired trade-off between precision and recall. Overall, the CEM is a more reliable
confidence estimator for both the AUC and NCE metrics.
1.0 1.00
0.95
precision
precision
0.8
0.90
0.85
0.6 test-clean test-clean
test-other 0.80 test-other
0.0 0.5 1.0 0.0 0.5 1.0
recall recall
(a) Softmax (b) CEM
Figure 5.3 Precision-recall curves for token-level confidence scores on LibriSpeech test-clean
and test-other sets.
For AED-based ASR models, shallow fusion of a Language Model (LM) (Gülçehre et al., 2015)
is commonly used to improve ASR performance during decoding. The effect of shallow fusion
on confidence estimation was investigated. The LM used was a three-layer LSTM network
with width 4096 trained on the LibriSpeech LM corpus, which shares the same word-piece
vocabulary as the AED model. To take LM information for confidence estimation, the input to
the CEM was extended by the LM probability for the current token. The other aspects of the
setup are the same as used in Section 5.2.2.
Table 5.4 shows the word-level confidence scores after PWLM for both Softmax probabili-
ties and the CEM. Although WERs on test-clean and test-other decreased by 8∼9% relative,
there is no clear improvement in AUC and even a substantial degradation for NCE. By com-
paring the first and second blocks of Table 5.4, a CEM improves the quality of confidence
118 Confidence Scores for Attention-Based Encoder-Decoder ASR Models
estimation even more noticeably when an additional LM is used. The contrast of P-R curves
between Softmax and CEM with LM shallow fusion is similar to Figure 5.3.
Table 5.4 ASR and word-level confidence performance for models with and without Recurrent
Neural Network Language Model (RNNLM) shallow fusion (with PWLM).
5.2.5 Analysis
5.2.5.1 Generalisation to a Mismatched Domain
Since the CEM is a model-based approach and the training data for the CEM is the same
as for the ASR model, the CEM is naturally more confident on the training set. Although
the mismatch between training and test for the CEM is mitigated by having more aggressive
augmentation during training and applying a PWLM estimated on dev sets during testing,
it is still unclear how well the confidence scores from the CEM generalises to data from a
mismatched domain. The Wall Street Journal (WSJ) corpus (Paul and Baker, 1992) is a dataset
of clean read speech of news articles and is in a moderately mismatched domain compared
to LibriSpeech in terms of the speaker, style and vocabulary. In Table 5.5, the WSJ eval92
test set was fed into the same setup as in Section 5.2.4.3, where all models were trained on
LibriSpeech.
Table 5.5 ASR and confidence performance on WSJ eval92 with a PWLM. The PWLM was
estimated on LibriSpeech dev-other set. The LM used was trained on the LibriSpeech LM
corpus as in Section 5.2.4.3.
Similar to the observations in Table 5.4, shallow fusion worsens the confidence estimation
by Softmax probabilities despite reduced WER. The CEM improves the quality of confidence
estimation considerably with or without an LM.
5.2 Confidence Estimation Module 119
As mentioned in Section 5.1, confidence scores are widely used to select unlabelled data for
semi-supervised learning in order to improve ASR performance. First, a speech recogniser
is trained using the limited transcribed data. Then the recogniser transcribes the unlabelled
data, which can be taken as noisy labels to train the existing model further. However, erroneous
automatic transcription can hurt the model. If confidence scores can reflect the WER well,
filtering out utterances with low confidence can be beneficial to semi-supervised training.
Similar to Figure 5.1 and plots used for semi-supervised learning (Park et al., 2020),
Figure 5.4 shows the WER of the filtered utterances whose confidence scores are above the
corresponding threshold. If confidence scores strongly correlate with WER, a higher threshold
will filter a subset with lower WER. In Figure 5.4a, sharp spikes at the high confidence threshold
region clearly indicate overconfidence based on Softmax probabilities. In contrast, curves for
all three test sets in Figure 5.4b are monotonically decreasing (i.e. without spikes), which shows
that confidence scores from the CEM match WER more closely.
25 test-clean 25 test-clean
test-other test-other
WER (%) of the filtered subset
WSJ-eval92 WSJ-eval92
20 20
15 15
10 10
5 5
0 0
0.6 0.8 1.0 0.6 0.8 1.0
confidence threshold confidence threshold
(a) Softmax (b) CEM
Figure 5.4 WERs of filtered utterances w.r.t. confidence thresholds for Softmax and CEM with
LM shallow fusion.
120 Confidence Scores for Attention-Based Encoder-Decoder ASR Models
As a discriminator between correct and erroneous hypotheses, R-EBMs can also produce
utterance-level confidence scores for AED models. Some downstream tasks only require
utterance-level confidence scores, such as data selection for semi-supervised learning (Park
et al., 2020; Zhang et al., 2020), and hypothesis-level model combination (Qiu et al., 2021b).
Compared to token or word level confidence, direct modelling of utterance-level confidence
scores implicitly takes deletion errors into account, and does not require calibration of the
confidence scores (e.g. using PWLM) before taking the average for utterance-level scores.
Previously, including the CEM in Section 5.2.2, various model-based methods have been used
for confidence estimation for AED models (Kumar et al., 2020; Oneata et al., 2021; Qiu et al.,
2021a,b; Woodward et al., 2020), and N -best re-ranking models (Li et al., 2019d; Ogawa et al.,
2018; Sainath et al., 2019; Variani et al., 2020a) have been proposed to improve WER. The
R-EBM is a single model that can improve both speech recognition WER performance and
utterance-level confidence estimation performance at the same time.
ASR models the conditional distribution of the text sequence h given the input acoustic
sequence O. For AED models, the model distribution can be expanded using the chain rule,
L
Y
P (h|O) = P (h1 |O) P (hi |h1:i−1 , O), (5.8)
i=2
where Pθ is the joint model, Eθ is the residual energy function, and Zθ is the partition function
for the energy-based model, which can be computed as
X
Zθ (O) = P (h|O) exp − Eθ (O, h) . (5.10)
h
P✓ (h|O) c(h)
sigmoid
<latexit sha1_base64="PAFuck7kL2uek4kaDNn9jXPStPA=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBbBU0lEVDwVvHisYD+gDWWz2bRLdzdhd1Isof/EiwdFvPpPvPlv3LY5aOuDgcd7M8zMC1PBDXjet1NaW9/Y3CpvV3Z29/YP3MOjlkkyTVmTJiLRnZAYJrhiTeAgWCfVjMhQsHY4upv57THThifqESYpCyQZKB5zSsBKfdftAXsCQ3PDBzLh0bTvVr2aNwdeJX5BqqhAo+9+9aKEZpIpoIIY0/W9FIKcaOBUsGmllxmWEjoiA9a1VBHJTJDPL5/iM6tEOE60LQV4rv6eyIk0ZiJD2ykJDM2yNxP/87oZxDdBzlWaAVN0sSjOBIYEz2LAEdeMgphYQqjm9lZMh0QTCjasig3BX355lbQuav5V7fLhslq/LeIooxN0is6Rj65RHd2jBmoiisboGb2iNyd3Xpx352PRWnKKmWP0B87nD2qvlCc=</latexit>
<latexit
sha1_base64="G4vzAKUfqBvuBoSMSn1cPbS+Hlo=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoMgCGFXgnoMevGYgHlAsoTZSW8yZnZ2mZkVQsgXePGgiFc/yZt/4yTZgyYWNBRV3XR3BYng2rjut5NbW9/Y3MpvF3Z29/YPiodHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7mZ+6wmV5rF8MOME/YgOJA85o8ZK9YteseSW3TnIKvEyUoIMtV7xq9uPWRqhNExQrTuemxh/QpXhTOC00E01JpSN6AA7lkoaofYn80On5MwqfRLGypY0ZK7+npjQSOtxFNjOiJqhXvZm4n9eJzXhjT/hMkkNSrZYFKaCmJjMviZ9rpAZMbaEMsXtrYQNqaLM2GwKNgRv+eVV0rwse1flSr1Sqt5mceThBE7hHDy4hircQw0awADhGV7hzXl0Xpx352PRmnOymWP4A+fzB3RzjLg=</latexit>
exp
<latexit sha1_base64="9hHk3XJWWAmBnev/Xu1gsGp8Crw=">AAAB83icbVBNS8NAEN3Ur1q/qh69LBbBU0lEVDwVvHisYD+gCWWznbZLN5uwO5GW0L/hxYMiXv0z3vw3btsctPXBwOO9GWbmhYkUBl332ymsrW9sbhW3Szu7e/sH5cOjpolTzaHBYxnrdsgMSKGggQIltBMNLAoltMLR3cxvPYE2IlaPOEkgiNhAib7gDK3k+whjNDyDcTLtlitu1Z2DrhIvJxWSo94tf/m9mKcRKOSSGdPx3ASDjGkUXMK05KcGEsZHbAAdSxWLwATZ/OYpPbNKj/ZjbUshnau/JzIWGTOJQtsZMRyaZW8m/ud1UuzfBJlQSYqg+GJRP5UUYzoLgPaEBo5yYgnjWthbKR8yzTjamEo2BG/55VXSvKh6V9XLh8tK7TaPo0hOyCk5Jx65JjVyT+qkQThJyDN5JW9O6rw4787HorXg5DPH5A+czx/SoZIv</latexit>
+
E✓ (O, h)
Pooling
<latexit sha1_base64="FilncHX0GiSLVw2aPPR2yxmlSLQ=">AAAB+XicbVBNSwMxEM3Wr1q/Vj16CRbBU9kVUfFU8OKxgv2AdinZdNqGZjdLMlssS/+JFw+KePWfePPfmLZ70NYHA4/3ZpKZFyZSGPS8b6ewtr6xuVXcLu3s7u0fuIdHDaNSzaHOlVS6FTIDUsRQR4ESWokGFoUSmuHobuY3x6CNUPEjThIIIjaIRV9whlbqum4H4QkNz2pK2TcG065b9ireHHSV+Dkpkxy1rvvV6SmeRhAjl8yYtu8lGGRMo+ASpqVOaiBhfMQG0LY0ZhGYIJtvPqVnVunRvtK2YqRz9fdExiJjJlFoOyOGQ7PszcT/vHaK/ZsgE3GSIsR88VE/lRQVncVAe0IDRzmxhHEt7K6UD5lmHG1YJRuCv3zyKmlcVPyryuXDZbl6m8dRJCfklJwTn1yTKrknNVInnIzJM3klb07mvDjvzseiteDkM8fkD5zPH0utlBM=</latexit>
AED FeatureExtractor
<latexit sha1_base64="rDv9Ou+yNppQjLxY2N7e29JZHw8=">AAACAnicbVDLSgNBEJyNrxhfq57Ey2IQPIVdCSqeAqJ4jGAekCxhdtJJhsw+mOmVhCV48Ve8eFDEq1/hzb9xNtmDJhY0FFXddHd5keAKbfvbyC0tr6yu5dcLG5tb2zvm7l5dhbFkUGOhCGXTowoED6CGHAU0IwnU9wQ0vOFV6jceQCoeBvc4jsD1aT/gPc4oaqljHrQRRqhYcgMUYwnXI5SUYSgnHbNol+wprEXiZKRIMlQ75le7G7LYhwCZoEq1HDtCN6ESORMwKbRjBRFlQ9qHlqYB9UG5yfSFiXWsla7VC6WuAK2p+nsiob5SY9/TnT7FgZr3UvE/rxVj78JNeBDFCAGbLerFwsLQSvOwulwCQzHWhDLJ9a0WG9A0A51aQYfgzL+8SOqnJeesVL4rFyuXWRx5ckiOyAlxyDmpkFtSJTXCyCN5Jq/kzXgyXox342PWmjOymX3yB8bnD3sZmB0=</latexit>
<latexit sha1_base64="4veFlTEJzPaHCA9nx96i3lmK/1o=">AAACCXicbVC7TsMwFHXKq5RXgJHFokIqS5WgChBTJRY2ikQfUhtVjuO0Vh07sh2kKnRl4VdYGECIlT9g429w2gyl5UiWj8+5V773+DGjSjvOj1VYWV1b3yhulra2d3b37P2DlhKJxKSJBROy4yNFGOWkqalmpBNLgiKfkbY/us789gORigp+r8cx8SI04DSkGGkj9W3YqPR8wQI1jsyVDieP88/byWnfLjtVZwq4TNyclEGORt/+7gUCJxHhGjOkVNd1Yu2lSGqKGZmUeokiMcIjNCBdQzmKiPLS6SYTeGKUAIZCmsM1nKrzHSmKVDabqYyQHqpFLxP/87qJDi+9lPI40YTj2UdhwqAWMIsFBlQSrNnYEIQlNbNCPEQSYW3CK5kQ3MWVl0nrrOqeV2t3tXL9Ko+jCI7AMagAF1yAOrgBDdAEGDyBF/AG3q1n69X6sD5npQUr7zkEf2B9/QJ/mZrW</latexit>
P (h|O)
softmax
<latexit sha1_base64="+xglHDDVSy3KKb/JN0Tu/8FfPm4=">AAAB+XicbVBNSwMxEM3Wr1q/Vj16CRbBU9mVouKp4MVjBfsB7VKyabYNTbJLMltalv4TLx4U8eo/8ea/MW33oK0PBh7vzTAzL0wEN+B5305hY3Nre6e4W9rbPzg8co9PmiZONWUNGotYt0NimOCKNYCDYO1EMyJDwVrh6H7ut8ZMGx6rJ5gmLJBkoHjEKQEr9Vy3C2wChmYmjkCSyaznlr2KtwBeJ35OyihHved+dfsxTSVTQAUxpuN7CQQZ0cCpYLNSNzUsIXREBqxjqSKSmSBbXD7DF1bp4yjWthTghfp7IiPSmKkMbackMDSr3lz8z+ukEN0GGVdJCkzR5aIoFRhiPI8B97lmFMTUEkI1t7diOiSaULBhlWwI/urL66R5VfGvK9XHarl2l8dRRGfoHF0iH92gGnpAddRAFI3RM3pFb07mvDjvzseyteDkM6foD5zPH4xAlD0=</latexit>
hi
<latexit sha1_base64="B06OytD5ZndSQTB3ynbu2Th+5sc=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqHgqePFY0X5AG8pmO2mXbjZhdyOU0J/gxYMiXv1F3vw3btsctPXBwOO9GWbmBYng2rjut1NYW9/Y3Cpul3Z29/YPyodHLR2nimGTxSJWnYBqFFxi03AjsJMopFEgsB2Mb2d++wmV5rF8NJME/YgOJQ85o8ZKD6M+75crbtWdg6wSLycVyNHol796g5ilEUrDBNW667mJ8TOqDGcCp6VeqjGhbEyH2LVU0gi1n81PnZIzqwxIGCtb0pC5+nsio5HWkyiwnRE1I73szcT/vG5qwms/4zJJDUq2WBSmgpiYzP4mA66QGTGxhDLF7a2EjaiizNh0SjYEb/nlVdK6qHqX1dp9rVK/yeMowgmcwjl4cAV1uIMGNIHBEJ7hFd4c4bw4787HorXg5DPH8AfO5w9Ito3J</latexit>
p(hi |h0:i
<latexit sha1_base64="ABB+Zd9lSbx7M4A4567uXfVlZm0=">AAACDnicbVDLSsNAFJ3UV62vqEs3wVKooCWRotJVwY3LCn1BG8JkMmmGTjJhZiKUmC9w46+4caGIW9fu/BsnbRZaPTDM4Zx7ufceN6ZESNP80korq2vrG+XNytb2zu6evn/QFyzhCPcQo4wPXSgwJRHuSSIpHsYcw9CleOBOr3N/cIe5ICzqylmM7RBOIuITBKWSHL0W1wOH3AdOarbImZWdjl1GPTEL1ZeyzEmtVjc7cfSq2TDnMP4SqyBVUKDj6J9jj6EkxJFEFAoxssxY2inkkiCKs8o4ETiGaAoneKRoBEMs7HR+TmbUlOIZPuPqRdKYqz87UhiKfENVGUIZiGUvF//zRon0r+yURHEicYQWg/yEGpIZeTaGRzhGks4UgYgTtauBAsghkirBigrBWj75L+mfN6yLRvO2WW23ijjK4AgcgzqwwCVogxvQAT2AwAN4Ai/gVXvUnrU37X1RWtKKnkPwC9rHN3BKm7c=</latexit>
1 , o1:T )
Decoder
<latexit sha1_base64="pIu4NX5B4VYCdNZPxMskWybzP8M=">AAAB+XicbVBNS8NAEN3Ur1q/oh69BIvgqSRSVDwV9OCxgv2ANpTNZtou3WzC7qRYQv+JFw+KePWfePPfuG1z0NYHA4/3ZpiZFySCa3Tdb6uwtr6xuVXcLu3s7u0f2IdHTR2nikGDxSJW7YBqEFxCAzkKaCcKaBQIaAWj25nfGoPSPJaPOEnAj+hA8j5nFI3Us+0uwhNqlt0Bi0NQ055ddivuHM4q8XJSJjnqPfurG8YsjUAiE1Trjucm6GdUIWcCpqVuqiGhbEQH0DFU0gi0n80vnzpnRgmdfqxMSXTm6u+JjEZaT6LAdEYUh3rZm4n/eZ0U+9d+xmWSIki2WNRPhYOxM4vBCbkChmJiCGWKm1sdNqSKMjRhlUwI3vLLq6R5UfEuK9WHarl2k8dRJCfklJwTj1yRGrknddIgjIzJM3klb1ZmvVjv1seitWDlM8fkD6zPHxdPk/E=</latexit>
di
<latexit sha1_base64="ulP8/tkeS68Dxes+7T0wU40RbDw=">AAAB+3icbVDLSsNAFL2pr1pfsS7dDBbBVUlEVFwV3LisYB/QhjCZTNuhk0mYmYgl5FfcuFDErT/izr9x0mahrQeGOZxzL3PmBAlnSjvOt1VZW9/Y3Kpu13Z29/YP7MN6V8WpJLRDYh7LfoAV5UzQjmaa034iKY4CTnvB9Lbwe49UKhaLBz1LqBfhsWAjRrA2km/Xh0HMQzWLzJWFuZ+x3LcbTtOZA60StyQNKNH27a9hGJM0okITjpUauE6ivQxLzQineW2YKppgMsVjOjBU4IgqL5tnz9GpUUI0iqU5QqO5+nsjw5Eq4pnJCOuJWvYK8T9vkOrRtZcxkaSaCrJ4aJRypGNUFIFCJinRfGYIJpKZrIhMsMREm7pqpgR3+curpHvedC+bF/cXjdZNWUcVjuEEzsCFK2jBHbShAwSe4Ble4c3KrRfr3fpYjFascucI/sD6/AEC+JUP</latexit>
Attention
<latexit sha1_base64="QCW9WugXudcnY0LEI7Lw/y7Y5rc=">AAAB+3icbVDLSsNAFL2pr1pfsS7dBIvgqiQiKq4qblxWsA9oQ5lMJ+3QyYOZG2kJ+RU3LhRx64+482+ctFlo64GBwzn3MPceLxZcoW1/G6W19Y3NrfJ2ZWd3b//APKy2VZRIylo0EpHsekQxwUPWQo6CdWPJSOAJ1vEmd7nfeWJS8Sh8xFnM3ICMQu5zSlBLA7PaRzZFRdNbRBbmWjYwa3bdnsNaJU5BalCgOTC/+sOIJoHOU0GU6jl2jG5KJHIqWFbpJ4rFhE7IiPU0DUnAlJvOd8+sU60MLT+S+oVozdXfiZQESs0CT08GBMdq2cvF/7xegv61m/IwTvRhdPGRnwgLIysvwhpyySiKmSaESq53teiYSEJR11XRJTjLJ6+S9nnduaxfPFzUGjdFHWU4hhM4AweuoAH30IQWUJjCM7zCm5EZL8a78bEYLRlF5gj+wPj8AfQplQU=</latexit>
vi
<latexit sha1_base64="ECCBEjDlQUrO9GCEp1e6H/7n+hQ=">AAAB+XicbVDLSsNAFL2pr1pfUZduBovgqiRSVFwV3LisYB/QhjCZTNqhk0mYmRRK6J+4caGIW//EnX/jpM1CWw8MczjnXubMCVLOlHacb6uysbm1vVPdre3tHxwe2ccnXZVkktAOSXgi+wFWlDNBO5ppTvuppDgOOO0Fk/vC702pVCwRT3qWUi/GI8EiRrA2km/bwyDhoZrF5sqnc5/5dt1pOAugdeKWpA4l2r79NQwTksVUaMKxUgPXSbWXY6kZ4XReG2aKpphM8IgODBU4psrLF8nn6MIoIYoSaY7QaKH+3shxrIpwZjLGeqxWvUL8zxtkOrr1cibSTFNBlg9FGUc6QUUNKGSSEs1nhmAimcmKyBhLTLQpq2ZKcFe/vE66Vw33utF8bNZbd2UdVTiDc7gEF26gBQ/Qhg4QmMIzvMKblVsv1rv1sRytWOXOKfyB9fkDTjmUFQ==</latexit>
Encoder
<latexit sha1_base64="C+a/wgw5E6R4XEltJ3ESTF+Epa4=">AAAB+XicbVBNS8NAEN3Ur1q/oh69BIvgqSRSVDwVRPBYwX5AG8pmM22XbjZhd1Isof/EiwdFvPpPvPlv3LY5aOuDgcd7M8zMCxLBNbrut1VYW9/Y3Cpul3Z29/YP7MOjpo5TxaDBYhGrdkA1CC6hgRwFtBMFNAoEtILR7cxvjUFpHstHnCTgR3QgeZ8zikbq2XYX4Qk1y+4ki0NQ055ddivuHM4q8XJSJjnqPfurG8YsjUAiE1Trjucm6GdUIWcCpqVuqiGhbEQH0DFU0gi0n80vnzpnRgmdfqxMSXTm6u+JjEZaT6LAdEYUh3rZm4n/eZ0U+9d+xmWSIki2WNRPhYOxM4vBCbkChmJiCGWKm1sdNqSKMjRhlUwI3vLLq6R5UfEuK9WHarl2k8dRJCfklJwTj1yRGrknddIgjIzJM3klb1ZmvVjv1seitWDlM8fkD6zPHya0k/s=</latexit>
O
<latexit sha1_base64="1fsG/rOPW8c+7DB/y5YMfvwfDeU=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsxIUXFVcOPOCvYB7VgyaaYNzSRDklHK0P9w40IRt/6LO//GTDsLbT0QcjjnXnJygpgzbVz32ymsrK6tbxQ3S1vbO7t75f2DlpaJIrRJJJeqE2BNORO0aZjhtBMriqOA03Ywvs789iNVmklxbyYx9SM8FCxkBBsrPfQCyQd6EtkrvZ32yxW36s6AlomXkwrkaPTLX72BJElEhSEca9313Nj4KVaGEU6npV6iaYzJGA9p11KBI6r9dJZ6ik6sMkChVPYIg2bq740URzqLZicjbEZ60cvE/7xuYsJLP2UiTgwVZP5QmHBkJMoqQAOmKDF8YgkmitmsiIywwsTYokq2BG/xy8ukdVb1zqu1u1qlfpXXUYQjOIZT8OAC6nADDWgCAQXP8ApvzpPz4rw7H/PRgpPvHMIfOJ8/FYWS4Q==</latexit>
Figure 5.5 Schematic of an R-EBM for an AED model. The baseline AED model in the shaded
box is fixed during R-EBM training.
In principle, the R-EBM itself can be any model that takes a pair of acoustic sequence O
and hypothesis sequence h to produce a scalar value (−Eθ (O, h)). Here, the R-EBM has a
feature extractor similar to the CEM. For each token in a hypothesis, it takes features including
the current decoder hidden state, the acoustic context vector from the attention mechanism, the
output token embeddings, and the topK Softmax probabilities at each output step (Li et al.,
2021d). Then a sequence model transforms these features and performs pooling over the hidden
representations of the whole sequence. The pooled representation is then passed to the output
layer that uses the sigmoid activation function. Since R-EBMs operate at the utterance level,
bi-directional models can be used.
5.3 Residual Energy-Based Model 123
5.3.2.2 Training
For a system with a token vocabulary size V and a maximum output sequence length L, the
partition function quickly becomes intractable as the summation is over V L possible sequences.
With the baseline auto-regressive model P fixed, noise contrastive estimation for conditional
models (Ma and Collins, 2018) can be used to train the R-EBM where the noise distribution is
the auto-regressive ASR model P . The loss can be expressed as
(
1
L(θ) = EO Eh+ ∼Pdata(·|O) log
1 + exp Eθ (O, h+ )
) (5.11)
1
+ Eh− ∼P (·|O) log ,
1 + exp − Eθ (O, h− )
where h+ are positive samples from the data distribution and h− are negative samples from
the noise distribution. For ASR, the noise samples are the N -best hypotheses from the auto-
regressive model via beam search, and the positive samples are the corresponding ground truth
transcriptions. Thus, the R-EBM is effectively a discriminator between the incorrect set of
sequences H− and correct set of sequences H+ , trained using the binary cross-entropy loss,
1 X 1
L(O; θ) ≈ log
|H | h∈H+
+
1 + exp Eθ (O, h)
(5.12)
1 X 1
+ − log ,
|H | h∈H− 1 + exp − Eθ (O, h)
where H+ ∪H− = {w, B EAM S EARCH(O, N )}. Note that there may be more than one element
in H+ since multiple tokenization using sub-word units for the ground truth may exist or the
ground truth is among the N -best hypotheses. Finally, the parameters for the R-EBM can be
estimated by optimising a binary classifier over all the utterances in the entire dataset.
5.3.2.3 Inference
For a given utterance during inference, the log-likelihood of a hypothesis and the negative
energy value of the hypothesis are added to obtain the joint score as in Equation (5.13).
With a shared partition function, the N -best hypotheses can be re-ranked based on the joint
scores to yield the best candidate. In practice, a variant of Equation (5.13) is used and will be
described in Section 5.3.4.
124 Confidence Scores for Attention-Based Encoder-Decoder ASR Models
From another perspective, the R-EBM is a binary classifier that learns to assign scores
close to 1 for correct hypotheses and 0 for erroneous ones, which is also the objective for
utterance-level confidence scores (Kumar et al., 2020). Confidence scores can be used to
automatically assess the quality of transcriptions of ASR systems. For applications where
utterance-level confidence scores are also required, the R-EBM can be used to achieve two aims.
The pre-sigmoid values of R-EBM (−Eθ (O, h)) can be used to rerank the N -best hypotheses
to lower the word error rate of the ASR system while the post-sigmoid values can be used as a
model-based confidence measure.
The data used for the following experiments is the same as that used in Section 5.2.3. One
difference is that the modelling units used here were a set of 1024 word-pieces (Schuster and
Nakajima, 2012) derived from the LibriSpeech 100h training transcriptions.
5.3.3.2 Models
The baseline AED model was also the same as that used in Section 5.2.3, with a total number
of parameters of 145 million. The LM has 2 uni-directional LSTM layers with 1024 units in
each layer. Shallow fusion (Gülçehre et al., 2015) was used for decoding and for generating the
N -best hypotheses for R-EBMs. The hyper-parameters for the AED models, the LM and beam
search were tuned on the dev sets.
The R-EBMs were trained with the baseline ASR model fixed. The N -best hypotheses
of the training set were generated on-the-fly with a beam size N and random SpecAugment
masks. For the time masks, instead of 2 masks with a maximum of 40 frames per mask for the
baseline ASR model, 10 masks with a maximum of 50 frames per mask were used for R-EBM
training. This is to simulate the errors made by the model during inference and the randomness
of masks allows diverse errors to appear during training. The WER on the augmented training
set should ideally match that of the dev set.
After beam search, the N -best hypotheses are determined by keeping the N terminated hy-
potheses with the highest sequence-level log-likelihood. However, this criterion may favour
5.3 Residual Energy-Based Model 125
shorter hypotheses when finding the 1-best. Therefore, normalising the log-likelihood by the
number of tokens in each hypothesis results in a slightly lower WER as shown in Table 5.6.
Length Normalisation (LN) becomes more important when the N is large. When combining the
log-likelihood score log P (h|O) with the negative energy score −Eθ (O, h) for the joint score,
an interpolation coefficient α is tuned on dev sets to minimise the WER and accommodate
potentially different numerical ranges as in Equation (5.14).
log P (h|O)
ĥ = arg max − αEθ (O, h). (5.14)
h |h|
Table 5.6 Impact on WER (%) by applying Length Normalisation (LN) for the log-likelihood
score on dev-clean/dev-other sets. The joint score is a linear combination of the log-likelihood
and R-EBM score. The R-EBM is a uni-directional LSTM and the beam size is 8.
Table 5.6 shows that ranking the N -best hypotheses just using the R-EBM scores reduces
the WER compared to 1-best WER without LN. After log-linear interpolation, the WER of the
joint model is lower than the 1-best results with or without LN. Therefore, all the following
experiments will use LN.
As a globally normalised model, an R-EBM can be bi-directional and take advantage of the full
utterance context in the hypotheses. Table 5.7 compares the performance of uni-directional
and bi-directional R-EBMs. Both R-EBMs have two layers of LSTMs with 512 units in each
direction. Since the bi-directional model performs slightly better, all the following experiments
will use bi-directional LSTMs for R-EBMs.
Table 5.7 Comparison of WERs (%) by using uni-directional and bi-directional LSTMs for
R-EBMs on dev-clean/dev-other sets. The beam size for decoding is 8.
126 Confidence Scores for Attention-Based Encoder-Decoder ASR Models
The performance ceiling of the joint log-linear interpolation score and R-EBM score is the
oracle WER of the N -best hypotheses. In this section, the lengths of N -best lists range from 4
to 32 for both training and inference. Table 5.8 shows that oracle WERs improve consistently
when larger N -best lists are used whereas the 1-best WERs only have a minor reduction.
With more hypotheses available, the joint WERs reduces greatly together with WERs ranked
by R-EBM scores only. The last column in Table 5.8 shows that Relative Word Error Rate
Reduction (WERR) of the joint model over 1-best steadily increases with larger N -best lists,
which indicates that the gain from the R-EBMs outpaces that of the 1-best. With 32-best,
8.2%/6.8% WERRs were obtained for dev-clean/dev-other sets. The top 32-best will be used
for all the following experiments.
Table 5.8 WERs of the ASR model and joint models with various numbers of hypotheses used
for training and inference.
Based on previous experimental results on dev sets, bi-directional LSTMs with 32-best hypothe-
ses and length normalisation are used as the best setup. Applying the log-linear combination
coefficients tuned on dev-clean/dev-other sets, the results on two test sets were generated and
shown in Table 5.9. The AUC of the P-R curve is used as the metric for confidence estima-
tion. Also in Table 5.9, the second row corresponds to the CEM, which predicts a confidence
score for each token in the hypothesis sequence using the same input features as R-EBMs.
By averaging the token-level scores1 , utterance-level scores can be used to combine with the
baseline model. Note that although the CEM yields improved confidence at token and word
1
For each hypothesis, the mean of the pre-sigmoid logits of all tokens is used for N -best reranking, whereas
the mean of the post-sigmoid confidence scores of all tokens is used for confidence evaluation.
5.3 Residual Energy-Based Model 127
levels as in Li et al. (2021d), the utterance-level confidence may under-perform the baseline
log-likelihood score. Since the R-EBM is directly optimised for the utterance-level confidence,
issues such as multiple tokenisations for the same word or sequence and deletion errors are
addressed implicitly during training. As a result, the R-EBM reduces WERs and improves
utterance-level confidence at the same time.
test-clean test-other
WER ↓ AUC ↑ WER ↓ AUC ↑
baseline 5.61 0.684 18.68 0.529
+ CEM 5.59 0.697 18.44 0.501
+ R-EBM 5.15 0.770 17.42 0.679
Table 5.9 Recognition and confidence performance on LibriSpeech test sets. The average of
token-level scores from CEM is used as utterance-level scores. For both CEM and R-EBM, the
best interpolation coefficients are tuned on the dev sets.
5.3.4.5 Scalability
This section investigates the situation when the auto-regressive ASR model has seen much
more data such that the baseline WER is far lower. In this set of experiments, the encoder of
the ASR model is first initialised with the pre-trained wav2vec 2.0 (w2v2) model (Baevski
et al., 2020b) trained on 57.7 thousand hours of unlabelled speech data from Libri-light (Kahn
et al., 2020) and then fine-tuned on the same amount of labelled data (“train-clean-100”) as
before2 . Although the WERs in Table 5.10 are much lower than in Table 5.9, the joint model
further yields 5.3%/4.4% WERRs on test-clean/-other. Table 5.10 also shows that R-EBMs can
substantially boost confidence estimation performance.
test-clean test-other
WER ↓ AUC ↑ WER ↓ AUC ↑
w2v2 2.63 0.786 4.74 0.684
+ R-EBM 2.49 0.928 4.53 0.890
Table 5.10 Recognition and confidence performance on LibriSpeech test sets when the encoder
is initialised using a pre-trained w2v2 as a stronger baseline.
2
The implementation follows Zhang et al. (2020), which shows state-of-the-art performance on LibriSpeech.
128 Confidence Scores for Attention-Based Encoder-Decoder ASR Models
5.3.5 Analysis
5.3.5.1 Relative Improvement by Utterance Length
Figure 5.6 shows the breakdown of the WERR for the joint model and the oracle hypotheses
with respect to the number of words in the reference sequences. The baseline ASR model is
used for this analysis. The oracle WERR, i.e. the maximum possible WERR given the N -best
lists, is lower for longer utterances as the number of alternatives per word is fewer with a given
number of top hypotheses. The general trend of the joint WERR follows the trend of the oracle
WERR except for short utterances (1-8 words in reference). We hypothesise that R-EBMs may
need more global context information to give a higher WER reduction.
12 test-clean 60 test-clean
test-other test-other
10 50
oracle WERR (%)
joint WERR (%)
8 40
6 30
4 20
2 10
0 0
1-8
17 6
25 4
33 2
-40
0
1-8
17 6
25 4
33 2
-40
0
9-1
-2
-3
>4
9-1
-2
-3
>4
If the joint model Pθ matches the data distribution Pdata better, then statistics computed on
a large set of samples from the two distributions should also match (Baevski et al., 2020b).
Figure 5.7 shows the density plot of the log-likelihood scores (left) and the joint model scores
(right) on test-other set. The red lines correspond to the score distributions of the ground
truth transcriptions. The distribution of log-likelihood scores of the best hypotheses from the
auto-regressive model does not match the data distribution well. However, the distribution from
the joint model is much closer to the data distribution.
5.4 Improving Confidence Scores for Out-of-Domain Data 129
<latexit sha1_base64="10Y0i4PJVhLmCNjEPnk/atevES8=">AAACDnicbVC7TsMwFHV4lvIKMLJYVJXKUiWoKoyVWNgoEn1ITVQ5jtNadeLIdpCqkC9g4VdYGECIlZmNv8FpM5SWI1k+Pude+d7jxYxKZVk/xtr6xubWdmmnvLu3f3BoHh13JU8EJh3MGRd9D0nCaEQ6iipG+rEgKPQY6XmT69zvPRAhKY/u1TQmbohGEQ0oRkpLQ7PqMD6C7ZrjcebLaaivdJw9Lj5vs/OhWbHq1gxwldgFqYAC7aH57fgcJyGJFGZIyoFtxcpNkVAUM5KVnUSSGOEJGpGBphEKiXTT2ToZrGrFhwEX+kQKztTFjhSFMp9NV4ZIjeWyl4v/eYNEBVduSqM4USTC84+ChEHFYZ4N9KkgWLGpJggLqmeFeIwEwkonWNYh2Msrr5LuRd1u1ht3jUqrWcRRAqfgDNSADS5BC9yANugADJ7AC3gD78az8Wp8GJ/z0jWj6DkBf2B8/QIYq5zC</latexit>
log P (h|O)
Figure 5.7 Density plot of log-probability scores using the baseline model (left) and the joint
model (right) on test-other set.
<latexit sha1_base64="oCfrKUcKaBSkwfqX4OsFr/UYE7Q=">AAACCHicbVDLSgMxFM3UV62vqksXBotQF5YZKSquiiK4EarYB7SlZNK0Dc1MhuSOWIYu3fgrblwo4tZPcOffmE5noa0HAifn3HuTe9xAcA22/W2l5uYXFpfSy5mV1bX1jezmVlXLUFFWoVJIVXeJZoL7rAIcBKsHihHPFazmDi7Gfu2eKc2lfwfDgLU80vN5l1MCRmpnd5vAHkDTqCylmdEbxfcI528PL8+vD0btbM4u2DHwLHESkkMJyu3sV7MjaegxH6ggWjccO4BWRBRwKtgo0ww1CwgdkB5rGOoTj+lWFC8ywvtG6eCuVOb4gGP1d0dEPK2HnmsqPQJ9Pe2Nxf+8Rgjd01bE/SAE5tPJQ91QYJB4nArucMUoiKEhhCpu/oppnyhCwWSXMSE40yvPkupRwTkuFG+KudJZEkca7aA9lEcOOkEldIXKqIIoekTP6BW9WU/Wi/VufUxKU1bSs43+wPr8AWPOmYY=</latexit>
Pooling (R-EBM)
hypothesis FeatureExtractor
<latexit sha1_base64="rDv9Ou+yNppQjLxY2N7e29JZHw8=">AAACAnicbVDLSgNBEJyNrxhfq57Ey2IQPIVdCSqeAqJ4jGAekCxhdtJJhsw+mOmVhCV48Ve8eFDEq1/hzb9xNtmDJhY0FFXddHd5keAKbfvbyC0tr6yu5dcLG5tb2zvm7l5dhbFkUGOhCGXTowoED6CGHAU0IwnU9wQ0vOFV6jceQCoeBvc4jsD1aT/gPc4oaqljHrQRRqhYcgMUYwnXI5SUYSgnHbNol+wprEXiZKRIMlQ75le7G7LYhwCZoEq1HDtCN6ESORMwKbRjBRFlQ9qHlqYB9UG5yfSFiXWsla7VC6WuAK2p+nsiob5SY9/TnT7FgZr3UvE/rxVj78JNeBDFCAGbLerFwsLQSvOwulwCQzHWhDLJ9a0WG9A0A51aQYfgzL+8SOqnJeesVL4rFyuXWRx5ckiOyAlxyDmpkFtSJTXCyCN5Jq/kzXgyXox342PWmjOymX3yB8bnD3sZmB0=</latexit>
attention context
decoder state output distribution
output distribution
AED LM
Figure 5.8 The system schematic of CEM / R-EBM for confidence estimation. The pooling
layer is only used in R-EBM.
corresponding output distribution from an LM can also be included as additional features (in
red). These features are then passed to a sequence model such as a recurrent neural network or
a self-attention network. For the CEM, the hidden representation for each token is projected
to a scalar and then mapped to a confidence score between 0 and 1 by the sigmoid function.
For the R-EBM, a pooling layer reduces the hidden representations of the entire sequence to a
single representation. A projection layer with a sigmoid activation produces utterance-level
confidence scores. During training, N -best hypotheses are generated from the fixed ASR
model and the binary training targets are obtained by aligning hypotheses with the ground truth
transcription using an edit distance.
As shown in Sections 5.2.4 and 5.3.4, using the CEM and R-EBM can provide much more
reliable confidence scores than the Softmax scores obtained directly from the ASR model.
Although training data is augmented more aggressively during CEM / R-EBM training than
ASR training to avoid being over-confident, training data used for confidence estimation is
often the same as the ASR model. A useful confidence estimator should not only perform
well on in-domain data, but also generalise well to OOD data without modifying the ASR
model. Assuming that some unlabelled OOD data is available, the following two methods
are proposed to improve the OOD confidence scores: pseudo transcriptions and an additional
language model.
5.4 Improving Confidence Scores for Out-of-Domain Data 131
One approach of exposing confidence estimators to OOD acoustic data is to include it in the
training process (i.e. exposing the confidence estimator with an OOD hypothesis in Figure 5.8).
To this end, the existing AED model can be used to transcribe unlabelled OOD data to give
pseudo transcriptions without any data augmentation. During CEM/R-EBM training, N -
best hypotheses are generated on-the-fly with data augmentation, and are then aligned with
the pseudo transcription to produce binary confidence targets. Note that N -best hypotheses
are nearly always erroneous w.r.t. the pseudo transcription because beam search is run with
augmented input acoustic data. The effectiveness of this approach depends on the similarity
between the in-domain and OOD data. If the OOD data has, for example, very mismatched
acoustic conditions and a different speaking style, the quality of pseudo transcriptions may be
poor, which gives misleading training labels for confidence estimators.
Language models have been an important source of information for confidence estimation (Jiang,
2005; Ragni et al., 2022; Yu et al., 2011). In CEM/R-EBM, an in-domain LM is normally used.
However, the difference in speaking style can lead to very different linguistic patterns. For
example, comparing audiobooks with telephone conversations, the vocabularies used are dis-
tinct, as telephone conversations are generally much more casual and spontaneous. Therefore,
it may be useful to leverage the OOD text data to train an additional LM, which can provide
additional features for confidence estimators (i.e. feeding both in-domain and OOD LM output
distribution to the feature extractor). If either an in-domain LM or an OOD LM has a high
probability for a hypothesis token, then the token is more likely to be recognised correctly.
The in-domain speech data used in this section is from audiobooks. For training, there are
57.7 thousand hours of unlabelled speech data from Libri-light (Kahn et al., 2020) (“unlab-60k”
subset) for unsupervised pre-training and 100 hours of transcribed data from LibriSpeech for
fine-tuning. The text data for language modelling has around 810 million words. The standard
dev and test sets are used for in-domain development and evaluation (see Section A.2).
Two out-of-domain datasets were used in the experiments. The TED-LIUM release 3 (TED)
corpus (Hernandez et al., 2018) contains 452 hours of talks, which are in a somewhat different
domain to audiobooks. Although the talks are mostly well prepared, the speaking style is more
132 Confidence Scores for Attention-Based Encoder-Decoder ASR Models
casual and the language is sometimes colloquial. The text data to train LMs has around 255
million words (Rousseau et al., 2014). The standard dev and test sets from the TED dataset were
used. Another out-of-domain dataset is Switchboard (SWB) (see Section A.3). The text data
for language modelling is the combination of the SWB transcriptions and Fisher Transcriptions.
The Hub5’00 set was used as the development set and RT03 used as the evaluation set. Since
all SWB data was collected at 8 kHz, it is upsampled to 16 kHz before processing by the AED
model.
5.4.3.2 Models
The AED model follows the “Conformer XL” setup in (Zhang et al., 2020) with 24 Conformer
layers and has around 600M parameters. w2v2 (Baevski et al., 2020b), an unsupervised pre-
training method, was used to pre-train the encoder using the unlabelled in-domain Libri-light
data. The decoder is a single-layer LSTM network with 640 units. The randomly initialised
decoder was fine-tuned jointly with the encoder initialised from w2v2 using LibriSpeech 100h
labelled data. Regularisation techniques such as SpecAugment (Park et al., 2019), dropout,
label smoothing, Gaussian weight noise and exponential moving average were used to improve
performance. The modelling units were a set of 1024 word-pieces (Schuster and Nakajima,
2012) derived from the LibriSpeech 100h training transcriptions.
The model-based confidence estimators, i.e. CEM and R-EBM, were trained on the Lib-
riSpeech 100h dataset while freezing the AED model. 8-best hypotheses were generated
on-the-fly with more aggressive SpecAugment masks to simulate errors on unseen data (Li
et al., 2021f). Both the CEM and R-EBM have two-layer bi-directional LSTMs with 512
units in each direction. Since the direct output of the CEM is the token-level confidence, the
word-level confidence is obtained by taking the minimum among all the tokens per word.
The simplest baseline for confidence estimation is to directly use the Softmax probability from
the decoder output distribution. As discussed in Section 5.3.4, the Softmax-based confidence
scores can be severely impacted by the regularisation techniques used during training. In
Tables 5.11 and 5.12, the performance of the ASR model (in WER and Sentence Error Rate
(SentER)) and the corresponding confidence performance (in AUC and EER) using both
the Softmax probabilities and model-based confidence estimators are given. As shown in
Tables 5.11 and 5.12, by having a dedicated confidence estimator at either the word-level using
5.4 Improving Confidence Scores for Out-of-Domain Data 133
Softmax CEM
dataset WER
AUC↑ EER↓ AUC↑ EER↓
LS (test-clean) 2.7 99.29 21.59 99.64 16.40
LS (test-other) 4.9 98.79 19.93 99.40 15.39
TED (test) 9.4 96.49 24.07 98.78 18.28
SWB (RT03) 28.3 92.88 20.15 96.45 18.59
Table 5.11 Baseline WERs (%) and word-level confidence estimation performance.
Softmax R-EBM
dataset SentER
AUC↑ EER↓ AUC↑ EER↓
LS (test-clean) 31.4 77.63 41.97 91.68 21.53
LS (test-other) 45.2 67.84 38.63 88.05 18.43
TED (test) 72.1 33.57 47.66 71.24 23.17
SWB (RT03) 82.7 34.91 36.59 52.31 28.35
Table 5.12 Baseline SentERs (%) and utterance-level confidence estimation performance.
On the two OOD sets, the benefit of using the CEM or R-EBM is also substantial. However,
the relative improvement of confidence estimation performance by using a confidence module
diminishes as the data becomes increasingly dissimilar to the in-domain data. For example, in
Table 5.12, the AUC on TED increased from 33.57% to 71.24% whereas the AUC on SWB only
increased from 34.91% to 52.31%. The EER on TED is reduced by more than half whereas the
EER on SWB is only reduced by 23% relatively. This is expected as the confidence module is
trained only on in-domain data.
Based on the previous observations, this section explores various techniques that use the unla-
belled OOD data to improve the confidence estimator while keeping the in-domain performance
unchanged. As described in Section 5.4.2, the pseudo transcriptions on the OOD data can
be used to train confidence estimators and additional features from an LM trained on OOD
text may also be useful. The OOD data with pseudo transcriptions was mixed with in-domain
134 Confidence Scores for Attention-Based Encoder-Decoder ASR Models
data in a 1:9 ratio for each minibatch. Table 5.13 shows the results when incorporating this
additional OOD information on the TED dataset.
word-level utterance-level
pseudo LM
AUC↑ EER↓ AUC↑ EER↓
98.78 18.28 71.24 23.17
✓ 98.85 16.41 74.89 20.05
✓ 98.73 18.56 73.51 21.97
✓ ✓ 98.85 16.08 75.70 19.21
Table 5.13 Word and utterance-level confidence estimation performance in AUC (%) and EER
(%) on TED dataset with additional OOD information for CEM and R-EBM.
Compared with the first row of Table 5.13 (i.e. CEM and R-EBM baselines in Tables 5.11
and 5.12), using pseudo transcriptions can effectively improve AUC and reduce EER. Although
the improvement brought by additional OOD LM features is smaller, using both the pseudo
transcription and OOD LM features yields the best confidence estimator. A similar observation
is also made on the SWB dataset, which will be omitted here. Therefore, both pieces of OOD
information will be used for the following experiments.
To further validate the effectiveness of the two proposed approaches, the OOD acoustic
data is included during pretraining. This is a more challenging setup because the confidence
performance baseline will improve as pretraining can effectively reduce the WERs on OOD
data. After continuing training the w2v2 model with a mixture of in-domain and OOD data with
a reduced learning rate, the new encoder is then fine-tuned on the LibriSpeech 100h dataset.
The results are shown in Table 5.14.
The final WERs on the in-domain dataset are very similar to the baseline ASR model
(within ±0.1%), but the WER is reduced by 10.6% relative on TED and 14.5% relative on
SWB. This result shows that by having unlabelled OOD data during pre-training of the ASR
model, the WER on OOD data can be greatly reduced. An improved ASR model generally
suggests that the quality of confidence scores are also better. By comparing the confidence
metrics before and after pretraining with OOD data (for each dataset, compare the first row in
Table 5.14a and the first row in Table 5.14b), the AUCs are generally better with pretraining.
However, EERs can be higher for the better ASR model because the EER only represents a
single operating point whereas the AUC presents the overall picture of the confidence estimator
at all operating points. With the stronger baseline (for each dataset, compare the first and second
rows in Table 5.14b), by including the OOD information for CEM or R-EBM, the confidence
quality is consistently improved. This observation shows that even when the encoder has been
5.4 Improving Confidence Scores for Out-of-Domain Data 135
w/ OOD pretraining
dataset OOD info word-level utterance-level
WER
AUC↑ EER↓ AUC↑ EER↓
98.87 17.55 74.00 22.48
TED (test) 8.4
✓ 99.06 15.56 75.90 20.37
96.80 19.00 58.54 25.59
SWB (RT03) 24.2
✓ 97.54 16.83 61.75 23.07
Table 5.14 Confidence metrics in AUC (%) and EER (%) after using CEM & R-EBM with
additional OOD information on TED & SWB. After pretraining with OOD data, ASR models
have nearly the same performance on in-domain data but lower WERs on OOD data. This is a
more challenging setup for making improvements on confidence estimators as the encoder has
been exposed to OOD data.
exposed to OOD data during pretraining, it is still very useful to add the two pieces of OOD
information during the training of confidence estimators.
5.4.5 Analysis
5.4.5.1 Word-Level Confidence Calibration
Confidence metrics such as AUC and EER are only influenced by the rank ordering of the
confidence scores, but not their absolute values. However, well-calibrated word-level confidence
scores can be important for some downstream applications. In other words, the absolute value
of the confidence score should ideally reflect the probability of the word being recognised
136 Confidence Scores for Attention-Based Encoder-Decoder ASR Models
correctly. Two commonly used metrics for evaluating calibration performance is NCE (Siu
et al., 1997) and Expected Calibration Error (ECE) (Guo et al., 2017). ECE is computed as the
averaged gap between expected confidence and predicted confidence after binning all words
into M buckets, i.e.
M
X |Bm |
ECE = acc(Bm ) − conf(Bm ) , (5.15)
m=1 N
where Bm represents the words fall within m-th bin ranked by their confidence scores.
Table 5.15 Word-level NCE and ECE (%) on TED & SWB test sets.
Before computing NCE and ECE values, a PWLM (Evermann and Woodland, 2000b; Guo
et al., 2017) was estimated on the dev set and then applied to the test set. In this experiment,
the PWLM uses 5 linear segments and 50 bins, and the computation of ECE also uses 50 bins.
As shown in Table 5.15, on both OOD datasets, the word-level NCE and ECE are significantly
better than the Softmax or the CEM baseline after using OOD information during the training
of CEM.
Although aggregating all word-level confidence scores can yield an utterance-level score,
Section 5.3.4 showed that the utterance-level confidence is more effectively modelled by an
R-EBM which directly optimises the utterance-level objective. The improved utterance-level
confidence scores can be readily used for data selection tasks. For active learning, utterances
with low confidence are normally selected for manual transcription. For semi-supervised
learning, utterances with high confidence are often included as additional training data because
the hypotheses can be used as high-quality pseudo transcriptions.
In Table 5.16, the SentERs of 200 utterances with the lowest and the highest confidence
scores from each dataset are reported. As expected, the R-EBM with additional OOD infor-
mation during training can effectively filter utterances at either regime. Especially for the
5.5 Summary and Conclusion 137
Table 5.16 SentERs (%) of the bottom and top 200 utterances of TED & SWB test sets filtered
using different confidence estimators.
high-confidence utterances, OOD information is very helpful in reducing the SentERs for the
R-EBM for the highest confidence utterances.
and erroneous hypotheses. The R-EBM is globally normalised as it learns from the residual
error of the locally normalised model and complements the locally normalised AED models.
Experiments showed that the R-EBM was a very effective utterance-level confidence estimator
while reducing speech recognition error rates, even on top of a state-of-the-art model trained
using w2v2. Further analysis showed that the performance of an R-EBM may depend on
the amount of context, and confirmed that the R-EBM closed the gap between the model
distribution and the data distribution. Because deletion prediction and confidence estimation at
different levels are very similar, a multi-task training framework can be used to estimate the
word-level confidence scores, utterance-level confidence scores, and the number of deletion
errors jointly (Qiu et al., 2021a).
Both the CEM and R-EBM are model-based and their confidence estimation performance on
OOD data is of particular interest. Experiments showed that although model-based confidence
scores were more reliable than Softmax probabilities from AED models as confidence estimators
for both in-domain and OOD data, the performance on OOD data lagged far behind the in-
domain scenario. To this end, two approaches were investigated. Using pseudo transcriptions to
provide binary targets for training model-based confidence estimators and including additional
features from an OOD LM were both useful for improving the confidence scores on OOD
datasets. By exposing the CEM or R-EBM to OOD data, the word-level calibration performance
was also significantly improved. Selecting OOD data using the improved confidence estimators
is expected to aid active or semi-supervised learning.
Chapter 6
Attention-Based Encoder-Decoder (AED) models have been widely applied to solve Automatic
Speech Recognition (ASR) and other sequence-to-sequence tasks and the attention mechanism
can flexibly handle variable input and output sequence lengths. For each decoding step, the
prediction depends on the context from the encoder given by the attention mechanism and
the decoder state that represents the full output history. AED models require paired input
and output sequences for supervised training, e.g. acoustic sequences and the corresponding
transcriptions for ASR. This chapter explores the possibility of using AED models in a speech
processing task called speaker diarisation (Anguera et al., 2012; Tranter and Reynolds, 2006).
Speaker diarisation aims to identify “who spoke when” in conversations. The diarisation
pipeline normally has multiple stages (Anguera et al., 2012; Moattar and Homayounpour, 2012;
Park et al., 2022). First, the non-speech components in the audio recording are stripped and the
remainder is divided into short segments such that each segment only has one active speaker.
Then a speaker representation is extracted for each segment. Finally, a clustering algorithm
is used to determine the number of speakers in the whole recording and also which segments
belong to the same speaker. Although there are other alternatives, this chapter adopts the above
procedure to mainly focus on the use of AED models for the final clustering stage.
Clustering is normally regarded as an unsupervised task but here Discriminative Neural Clus-
tering (DNC) is proposed which formulates clustering as a supervised sequence-to-sequence
learning problem with a maximum number of clusters (Li et al., 2021c)1 . Compared to tra-
ditional unsupervised clustering algorithms, DNC learns clustering patterns from training
data without requiring a pre-defined similarity measure such as the cosine distance of speaker
1
Note that this work is in collaboration with Florian Kreyssig with equal contribution. In this chapter,
overlapped speech is not considered and is left as future work.
140 Attention-Based Encoder-Decoder Models for Speaker Diarisation
6.1 Background
This section first describes the speaker diarisation pipeline that includes audio segmentation,
speaker representation extraction and clustering. As Deep Neural Network (DNN)-based
approaches have been very effective for the first two stages (Park et al., 2022), alternative ap-
proaches to replacing unsupervised clustering methods with supervised ones are then introduced.
The differences between the proposed DNC and the related work are briefly highlighted.
6.1.1.1 Segmentation
The segmentation stage aims to obtain small segments of audio where each segment only
contains speech signals from a single speaker. This is normally performed in two sub-stages.
The first one is called Voice Activity Detection (VAD), which is a classifier that determines
6.1 Background 141
clustering
z}|{
representations
segmentation
Figure 6.1 A simplified diarisation pipeline. The grey blocks after the segmentation stage
denote non-speech. Speaker clusters are colour coded.
whether each frame of the acoustic features (see Section 3.1.1) contains speech or not. DNN-
based VAD systems (Hughes and Mierle, 2013; Wang et al., 2016; Zhang and Wu, 2013) have
recently shown better performance over traditional methods using the zero-crossing rate, energy
constraints or a phone recogniser (Savoji, 1989; Sinha et al., 2005; Tranter et al., 2004).
After removing non-speech regions, Change Point Detection (CPD) can be applied to
split each speech regions into smaller speaker-homogeneous segments. A metric-based ap-
proach (Chen, 1998; Kemp et al., 2000) was widely used, until DNN-based approaches demon-
strated better performance more recently (Gupta, 2015; Hrúz and Zajíc, 2017; India et al., 2017;
Sun et al., 2021a).
For each segment, a fixed-length speaker representation is needed to allow the clustering stage
to differentiate different speakers. An unsupervised method that is based on joint factor analysis
in the total variability space that generates an i-vector representation (Dehak et al., 2010)
is commonly used. The cosine distance between i-vectors is used as the distance metric for
speaker diarisation (Sell and Garcia-Romero, 2014; Senoussaoui et al., 2014; Shum et al., 2011).
More recently, DNN-based approaches have emerged as a more powerful alternative. Neural
networks are trained for a speaker classification task and the hidden vector from the penultimate
layer are often used as the speaker representation (Cyrta et al., 2017; Garcia-Romero et al.,
2017; Okabe et al., 2018; Shi et al., 2020; Sun et al., 2019; Sun et al., 2021a; Variani et al.,
2014; Wang et al., 2018; Wang et al., 2018; Yella and Stolcke, 2015; Zhu et al., 2018). The
end-to-end loss (Díez et al., 2019; Heigold et al., 2016; Wan et al., 2018) and the angular
142 Attention-Based Encoder-Decoder Models for Speaker Diarisation
Softmax loss (Deng et al., 2019; Fathullah et al., 2020; Huang et al., 2018; Liu et al., 2019;
Wang et al., 2018; Yu et al., 2019) have been proposed to improve the representations so that
they better match the clustering algorithm.
The clustering stage usually relies on the distance metric associated with the speaker representa-
tions. Popular unsupervised algorithms include agglomerative hierarchical clustering, k-means
clustering and spectral clustering (Dimitriadis and Fousek, 2017; Garcia-Romero et al., 2017;
Karanasou et al., 2015; Ning et al., 2006; Sell et al., 2018; Shum et al., 2013; Sun et al., 2019;
Wang et al., 2018). With the DNN-based speaker representations, spectral clustering that uses
the cosine distance is commonly used. It first computes the affinity matrix between the speaker
representations for each pair of segments. Then multiple operations are conducted to refine the
affinity matrix including Gaussian blur and thresholding. Eigen-decomposition is performed
on the refined matrix and the number of clusters is determined by the largest eigen-gap of its
eigen-values. Finally, k-means clustering is used to cluster the new segment representations
formed by taking the corresponding dimensions of the largest eigen-vectors. Recently, graph
neural networks have been used to improve spectral clustering (Shaham et al., 2018; Wang
et al., 2020a).
The performance of a speaker diarisation system is measured by the Diarisation Error Rate
(DER), which is the sum of three types of errors,
Note that if the oracle/reference segmentation is used, which is a widely adopted setup
when testing the performance of the speaker representations or clustering, the first two terms in
the numerator of Equation (6.1) are zero as the VAD is assumed to be perfect. The resulting
metric is also referred to as the Speaker Error Rate (SpkER).
labels, even though they are strongly correlated in practice. In contrast, DNC uses the sequence-
to-sequence structure that conditions each output on the full output history. Moreover, PIT has
a complexity of O(K!) if the number of speakers is K, and can be very expensive for a large
number of speakers2 , while DNC uses a permutation-free training loss by enforcing a specific
way of ordering the output label sequence.
Here, {A, B, C, D, E} are five different speaker identities. In the first meeting, only {A, C,
E} participate and ‘E’ was the first person to speak, so for DNC the cluster label ‘1’ is assigned
to the first input and whenever they speak again. When a new speaker speaks, ‘A’ in this case,
DNC is trained to assign the incremented cluster label ‘2’ and similarly thereafter. As shown in
the second example, DNC will assign ‘1’ to the speaker ‘A’ and ‘2’ to speaker ‘C’ according to
the order of appearance.
In practice, cluster boundaries are rarely clear. Without prior information, making clustering
decisions (deciding how many clusters and cluster boundaries) is intrinsically ambiguous.
Learning domain-specific knowledge contained within the data samples can help resolve
ambiguities. A model attempting to determine yi , the cluster assignment of xi , should condition
146 Attention-Based Encoder-Decoder Models for Speaker Diarisation
that decision on the entire input sequence X and also on all assignments made for previous
feature vectors y0:i−1 . Hence, it is proposed to model clustering with a discriminative sequence-
to-sequence model:
N
Y
P (y1:N |X) = P (yi |y0:i−1 , X), (6.2)
i=1
H = E NCODER(X) (6.3)
yi = D ECODER(y0:i−1 , H), (6.4)
of a speaker-homogeneous segment rather than a frame. Treating one input sequence as one
conversation or one meeting causes the amount of supervised training data for clustering to be
severely limited. For example, in the AMI dataset, widely used for speaker diarisation (Carletta
et al., 2005), only 147 training meetings exist that in turn can be used as individual training
sequences. The three data augmentation schemes that are proposed to overcome the data
scarcity problem are called sub-sequence randomisation, input vector randomisation, and
Diaconis Augmentation (Diac-Aug). The three techniques can also be combined. The data
augmentation techniques have two, possibly competing, objectives. The first is to generate
as many training sequences (X, y1:N ) as possible. The second is for them to match the true
data distribution p (X, y1:N ) as closely as possible. The augmentation schemes also enable
DNC to learn the importance of relative speaker identities (the cluster labels) rather than real
speaker identities across segments, which allows DNC to perform speaker clustering with a
simple cross-entropy training loss function.
<latexit sha1_base64="paXUGPpDg5WNw53SEbdEhTyPEHU=">AAAB8XicbVBNTwIxFHyLX4hfqEcvjcTEE9kFEz0SvXjERMAIG9ItBRq63U371oRs+BdePGiMV/+NN/+NXdiDgpM0mcy8l86bIJbCoOt+O4W19Y3NreJ2aWd3b/+gfHjUNlGiGW+xSEb6IaCGS6F4CwVK/hBrTsNA8k4wucn8zhPXRkTqHqcx90M6UmIoGEUrPfZCimPEtD7rlytu1Z2DrBIvJxXI0eyXv3qDiCUhV8gkNabruTH6KdUomOSzUi8xPKZsQke8a6miITd+Ok88I2dWGZBhpO1TSObq742UhsZMw8BOZgnNspeJ/3ndBIdXfipUnCBXbPHRMJEEI5KdTwZCc4ZyagllWtishI2ppgxtSSVbgrd88ipp16pevVq7u6g0rvM6inACp3AOHlxCA26hCS1goOAZXuHNMc6L8+58LEYLTr5zDH/gfP4AxqCQ+w==</latexit>
<latexit sha1_base64="3Q7qAsKfYc6M9iqt6BG1v+38Lzs=">AAAB8XicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFNy4r2Ae2Q8mkaRuayQzJHaEM/Qs3LhRx69+482/MtLPQ1gOBwzn3knNPEEth0HW/ncLa+sbmVnG7tLO7t39QPjxqmSjRjDdZJCPdCajhUijeRIGSd2LNaRhI3g4mt5nffuLaiEg94DTmfkhHSgwFo2ilx15IcYyY1mb9csWtunOQVeLlpAI5Gv3yV28QsSTkCpmkxnQ9N0Y/pRoFk3xW6iWGx5RN6Ih3LVU05MZP54ln5MwqAzKMtH0KyVz9vZHS0JhpGNjJLKFZ9jLxP6+b4PDaT4WKE+SKLT4aJpJgRLLzyUBozlBOLaFMC5uVsDHVlKEtqWRL8JZPXiWtWtW7qNbuLyv1m7yOIpzAKZyDB1dQhztoQBMYKHiGV3hzjPPivDsfi9GCk+8cwx84nz/FG5D6</latexit> <latexit sha1_base64="xSL4S3z9YqrJwjI7+DAdPU8O13g=">AAAB8XicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFNy4r2Ae2Q8mkmTY0kxmSO0IZ+hduXCji1r9x59+YtrPQ1gOBwzn3knNPkEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilx15EcYSYedN+ueJW3TnIKvFyUoEcjX75qzeIWRpxhUxSY7qem6CfUY2CST4t9VLDE8rGdMi7lioaceNn88RTcmaVAQljbZ9CMld/b2Q0MmYSBXZyltAsezPxP6+bYnjtZ0IlKXLFFh+FqSQYk9n5ZCA0ZygnllCmhc1K2IhqytCWVLIleMsnr5JWrepdVGv3l5X6TV5HEU7gFM7Bgyuowx00oAkMFDzDK7w5xnlx3p2PxWjByXeO4Q+czx/DlpD5</latexit>
<latexit sha1_base64="D8NZwhGRc3SadH+lq9NyH2X2S6M=">AAAB7HicbVBNS8NAFHzxs9avqkcvi0XwVJIq6LHoxWMF0xbaUDbbTbt0swm7L0IJ/Q1ePCji1R/kzX/jts1BWwcWhpk37HsTplIYdN1vZ219Y3Nru7RT3t3bPzisHB23TJJpxn2WyER3Qmq4FIr7KFDyTqo5jUPJ2+H4bua3n7g2IlGPOEl5ENOhEpFgFK3k9wYJmn6l6tbcOcgq8QpShQLNfuXL5lgWc4VMUmO6nptikFONgkk+Lfcyw1PKxnTIu5YqGnMT5PNlp+TcKgMSJdo+hWSu/k7kNDZmEod2MqY4MsveTPzP62YY3QS5UGmGXLHFR1EmCSZkdjkZCM0ZyokllGlhdyVsRDVlaPsp2xK85ZNXSate8y5r9YerauO2qKMEp3AGF+DBNTTgHprgAwMBz/AKb45yXpx352MxuuYUmRP4A+fzB/K2jsY=</latexit>
<latexit sha1_base64="D8NZwhGRc3SadH+lq9NyH2X2S6M=">AAAB7HicbVBNS8NAFHzxs9avqkcvi0XwVJIq6LHoxWMF0xbaUDbbTbt0swm7L0IJ/Q1ePCji1R/kzX/jts1BWwcWhpk37HsTplIYdN1vZ219Y3Nru7RT3t3bPzisHB23TJJpxn2WyER3Qmq4FIr7KFDyTqo5jUPJ2+H4bua3n7g2IlGPOEl5ENOhEpFgFK3k9wYJmn6l6tbcOcgq8QpShQLNfuXL5lgWc4VMUmO6nptikFONgkk+Lfcyw1PKxnTIu5YqGnMT5PNlp+TcKgMSJdo+hWSu/k7kNDZmEod2MqY4MsveTPzP62YY3QS5UGmGXLHFR1EmCSZkdjkZCM0ZyokllGlhdyVsRDVlaPsp2xK85ZNXSate8y5r9YerauO2qKMEp3AGF+DBNTTgHprgAwMBz/AKb45yXpx352MxuuYUmRP4A+fzB/K2jsY=</latexit>
...
<latexit sha1_base64="D8NZwhGRc3SadH+lq9NyH2X2S6M=">AAAB7HicbVBNS8NAFHzxs9avqkcvi0XwVJIq6LHoxWMF0xbaUDbbTbt0swm7L0IJ/Q1ePCji1R/kzX/jts1BWwcWhpk37HsTplIYdN1vZ219Y3Nru7RT3t3bPzisHB23TJJpxn2WyER3Qmq4FIr7KFDyTqo5jUPJ2+H4bua3n7g2IlGPOEl5ENOhEpFgFK3k9wYJmn6l6tbcOcgq8QpShQLNfuXL5lgWc4VMUmO6nptikFONgkk+Lfcyw1PKxnTIu5YqGnMT5PNlp+TcKgMSJdo+hWSu/k7kNDZmEod2MqY4MsveTPzP62YY3QS5UGmGXLHFR1EmCSZkdjkZCM0ZyokllGlhdyVsRDVlaPsp2xK85ZNXSate8y5r9YerauO2qKMEp3AGF+DBNTTgHprgAwMBz/AKb45yXpx352MxuuYUmRP4A+fzB/K2jsY=</latexit>
... <latexit sha1_base64="D8NZwhGRc3SadH+lq9NyH2X2S6M=">AAAB7HicbVBNS8NAFHzxs9avqkcvi0XwVJIq6LHoxWMF0xbaUDbbTbt0swm7L0IJ/Q1ePCji1R/kzX/jts1BWwcWhpk37HsTplIYdN1vZ219Y3Nru7RT3t3bPzisHB23TJJpxn2WyER3Qmq4FIr7KFDyTqo5jUPJ2+H4bua3n7g2IlGPOEl5ENOhEpFgFK3k9wYJmn6l6tbcOcgq8QpShQLNfuXL5lgWc4VMUmO6nptikFONgkk+Lfcyw1PKxnTIu5YqGnMT5PNlp+TcKgMSJdo+hWSu/k7kNDZmEod2MqY4MsveTPzP62YY3QS5UGmGXLHFR1EmCSZkdjkZCM0ZyokllGlhdyVsRDVlaPsp2xK85ZNXSate8y5r9YerauO2qKMEp3AGF+DBNTTgHprgAwMBz/AKb45yXpx352MxuuYUmRP4A+fzB/K2jsY=</latexit>
Figure 6.2 Examples of input vector randomisation generating two input sequences for one
label sequence.
Figure 6.3 Diac-Aug for two clusters. The rotated clusters form a new training example.
6.4 Experimental Setup 149
Table 6.1 Details of AMI corpus partitions used for both training the speaker embedding
generator and training the DNC model.
{+99} (resulting in the overall input window of [-107,+106]). The output vectors of the TDNN
are combined using the self-attentive layer proposed in (Sun et al., 2019). This is followed by
a linear projection down to the embedding size, which is then the window-level embedding.
The TDNN structure resembles the one used in the x-vector models (Snyder et al., 2018) (i.e.
TDNN-layers with the following input contexts: [-2,+2], followed by {-2,0,+2}, followed by
{-3,0,+3}, followed by {0}). The first three TDNN-layers have a size of 512, the third a size of
128 and the embedding size is 32. The embedding generator is trained as a speaker classifier on
the AMI training data with the angular Softmax loss (Liu et al., 2017) using HTK (Young et al.,
2015) and PyHTK (Zhang et al., 2019b).
By using the angular Softmax loss combined with a linear activation function for the penulti-
mate layer of the d-vector generator, the L2 -normalised window-level speaker embeddings, and
in turn the segment-level speaker embeddings, should be approximately uniformly distributed
on the unit hypersphere. Based on this assumption and the speaker embedding size of 32, the
mean and variance of individual dimensions of speaker embeddings should be close to zero and
1/32, respectively. Empirically, this assumption fits well for the mean and most dimensions
for the variance. Variance normalisation for the DNC models was performed by scaling the
√
embeddings by 32.
6.4.3 Clustering
Spectral Clustering The baseline uses the refined spectral clustering algorithm proposed
in (Wang et al., 2018), the input of which is the segment-level speaker embeddings described
in Section 6.4.2. Our implementation is based on the one published by (Wang et al., 2018),
but the distance measure used in the k-means algorithm is modified from Euclidean to cosine
similarity to align exactly with (Wang et al., 2018). The number of clusters allowed is set to be
between two and four.
DNC Model The Transformer used in the DNC model contains 4 encoder blocks and 4
decoder blocks with a dimension of 256. The total number of parameters is 7.3 million. The
number of heads for the multi-head attention is 4. The model architecture follows (Vaswani
et al., 2017) and is implemented using ESPnet (Watanabe et al., 2018). The Adam optimiser
was used with a variable learning rate, which first ramps up linearly from 0 to 12 in the first
40,000 training updates and then decreases in proportion to the inverse square root of the
number of training steps (Vaswani et al., 2017). A dropout rate of 10% was applied to all
parameters. Considering that the input-to-output alignment for DNC is strictly one-to-one
and monotonic (see Section 6.2, one cluster label needs to be assigned to each input vector),
the source attention between encoder and decoder, represented as a square matrix, can be
6.5 Experimental Results 151
restricted to an identity matrix. For our experiments, the source attention matrix is masked to
be a tri-diagonal matrix, i.e. only the main diagonal and the first diagonals above and below
are non-zero. This restriction was found to be important for effective training of DNC models
in preliminary experiments. In this thesis, only experiments that used monotonic restricted
attention are reported.
Table 6.2 SpkERs (%) of different data augmentation techniques for sub-meetings with a length
of 50 segments on the eval set. Input vector randomisation schemes, none, global and meeting
are combined with Diac-Aug.
The SpkER of a DNC model trained on non-augmented data (none in Table 6.2), is 20.19%.
Using global augmentation reduces the SpkER to 14.47%, whilst meeting augmentation only
achieves a SpkER of 23.03%. Neighbouring embeddings are challenging to cluster due to
overlapping speech and similarities in the acoustic environment. For short sub-meetings,
meeting randomisation can move such neighbouring embeddings into separate sub-meetings.
Hence, for short sub-meetings meeting might generate atypical data whilst providing far less
augmentation than global.
The trend is different after applying Diac-Aug (second column of Table 6.2). Using none
achieves a SpkER of 15.25%, and the result for meeting reduces to 13.57%, which shows that
Diac-Aug generates fairly typical data. Section 6.4.2 showed that the assumptions behind Diac-
Aug are not perfect, which explains the performance drop of global. The number of speaker
groups available from global (∼C4155 ) is larger than the number of generated sub-meetings.
Thus, Diac-Aug does not increase the number of speaker groups seen by the DNC model when
used together with global.
training to convergence with data augmentation, the DNC models were fine-tuned using only
sub-sequence randomisation for augmentation (akin to none in Table 6.2).
dev eval
#segments in sub-meetings
PT FT PT FT
50 18.44 17.94 13.57 13.90
200 23.82 20.51 16.92 16.75
500 25.48 21.89 17.73 18.39
all 28.13 26.15 20.65 16.92
Table 6.3 Results of DNC’s SpkER performance for the four CL stages. Comparison between
only pre-training (PT) on data augmented with meeting randomisation and Diac-Aug and
finetuning (FT) afterwards on non-augmented data.
For the stages with a meeting lengths of above 50, for each original meeting, 104 sub-
meetings were generated using the augmentation techniques. When the maximum length was
set to 200, applying the two best-performing techniques of Section 6.5.1 results in a SpkER
of 19.14% for global and 16.92% for meeting combined with Diac-Aug. The later CL stages
use the latter combination of techniques for data augmentation. Increasing the maximum
sub-meeting length to 500 results in a SpkER of 17.73%. The SpkER for full-length (34.0
minutes on average) meetings was 20.65%. Table 6.3 shows the results for meeting with
Diac-Aug in column “eval-PT”.
sub-meeting length
DNC spectral clustering
#segments duration (mins)
50 2.8 13.90 15.89
200 9.7 16.75 22.38
500 20.9 18.39 23.56
all 34.0 16.92 23.95
Table 6.4 Comparison of DNC vs. spectral clustering on the eval set for different meeting
lengths. Fine-tuned DNC models are used for comparison.
Table 6.3 also shows the results of finetuning the DNC models of column “eval-PT” by
training on data that uses only sub-sequence randomisation, but neither meeting nor Diac-Aug
(see column “eval-FT”). While the eval set SpkER increases after finetuning for some cases,
the SpkER on the dev set (i.e. the validation set) reduces in all cases. The finetuned models
are compared with spectral clustering in Table 6.4 (column “eval-FT” in Table 6.3 is column
“DNC” in Table 6.4). The spectral clustering parameters were chosen to optimise the SpkER
154 Attention-Based Encoder-Decoder Models for Speaker Diarisation
on the dev set. For all meeting lengths, the DNC model outperforms spectral clustering. The
finetuned DNC model for the full meeting length achieves a SpkER of 16.92%, which is a
reduction of 29.4% relative to spectral clustering. The SpkER of a DNC model trained with
meeting and Diac-Aug, but without CL, is only 34.48% after finetuning.
6.5.3 Visualisation
1
2
3 1
4 2
1 1
2 2
3 3
4 4
Figure 6.4 Clustering results of different algorithms for one meeting. The 32-dimensional
speaker embeddings are projected to 2-dimensional using t-distributed Stochastic Neighbour
Embedding (t-SNE). In this example, (b) shows that spectral clustering fails to identify the
correct number of speakers. Even when the correct number of clusters is given to spectral
clustering as in (c), DNC shows a better clustering result in (d).
6.6 Summary and Conclusions 155
Figure 6.4 visualises the clustering results of a sub-meeting from the eval set using a
2-dimensional t-SNE (Maaten and Hinton, 2008) projection of the 32-dimensional segment
embeddings. In Figure 6.4a, depicting the ground truth cluster labels. This plot shows the
difficulty of the clustering task. For example, some samples from speaker 2 are very close
to samples from speaker 1 (top left of Figure 6.4a), and some samples from speaker 1 are
very close to speaker 4 (bottom of Figure 6.4a). As these clusters are not well separated, it
is very challenging to determine the right number of clusters and also draw the right cluster
boundaries. As a result, spectral clustering produced two clusters instead of four in Figure 6.4b,
which leads to high SpkER. If the spectral clustering algorithm is forced to produce four
clusters, the result is illustrated in Figure 6.4c. Note that the four clusters are roughly linearly
separable in Figure 6.4c, which shows that the unsupervised clustering algorithm based on
cosine distance draws hard boundaries between clusters. Finally, the result given by DNC is
plotted in Figure 6.4d. First, the DNC model correctly recognises the existence of four speakers.
Compared to Figure 6.4c, the cluster boundaries given by the DNC approach are more complex
as Figure 6.4d shows multiple points of cross over to other clusters, which is more similar to
Figure 6.4a. Especially, when two clusters have significant overlap, spectral clustering makes
more errors, e.g. it wrongly assigns many samples from speaker 2 to speaker 1 (top middle
of Figure 6.4c). By comparison, the DNC model can split these two confusing clusters better.
Overall, DNC yields a much lower SpkER than spectral clustering, even if the correct number
of speaker is provided to the unsupervised algorithm.
could be modified to use forms of monotonic attention (Arivazhagan et al., 2019; Ma et al.,
2020).
Although DNC is a novel and promising approach for speaker diarisation, several aspects
can still be improved. For example, the experimental setup assumed a perfect VAD by using
manual segmentation information. The performance of DNC needs to be verified under a more
realistic setup. As the experiments used segment-level speaker embeddings for the Transformer
model, short and long segments were treated equally in the training loss, which was not true for
the evaluation metric. In practice, a time-weighted loss function can be used or window-level
speaker embeddings can be used. The DNC approach itself also requires the maximum number
of speakers to be set. More research is needed to allow DNC to handle an unknown number of
speakers. The setup in this chapter does not handle speaker overlapping regions during training
as oracle segmentation is used and these regions are excluded during evaluation. However, it
would be more useful to allow the model to identify the overlapping regions in practice. For
example, it is possible to include overlapped speech in the training data and also introduce a
special output unit as the corresponding target.
Chapter 7
This thesis first describes the fundamentals of Deep Neural Networks (DNNs) and two major
Automatic Speech Recognition (ASR) paradigms. Then various novel approaches for three
speech processing topics relating to Attention-Based Encoder-Decoder (AED) models are
proposed and validated by extensive experimentation, including the Integrated Source-Channel
and Attention (ISCA) framework that combines Source-Channel Model (SCM) and AED-based
ASR systems using N -best and lattice rescoring; the Confidence Estimation Module (CEM) and
the Residual Energy-Based Model (R-EBM) that produce reliable token/word/utterance-level
confidence scores for AED models; and Discriminative Neural Clustering (DNC) that uses the
Transformer model to perform supervised clustering for speaker diarisation. In this chapter,
observations are summarised and conclusions are drawn for each proposed approach. Based
on all the aforementioned contributions, this chapter recommends potential extensions and
promising prospects to be explored in the future.
7.1 Conclusions
AED models play an increasingly important role in various aspects of speech processing. In this
thesis, AED models are regarded as complementary components to combine with traditional
systems for ASR, effective confidence estimators for AED models are developed, and AED
models are used for the clustering stage of speaker diarisation.
As discussed in Chapter 4, AED models are highly complementary to conventional SCM-
based systems by approaching the ASR task from a different perspective. There are multiple
ways to combine these two types of systems. One widely used method is to jointly train a
Connectionist Temporal Classification (CTC) model and an AED model where the encoder
is shared. Decoding is a single pass procedure based on the AED decoder where the decoder
output probabilities are interpolated with CTC prefix scores for each decoding step. There
158 Conclusions and Future Work
are multiple shortcomings with this framework. As the majority of the model parameters and
the output units of the two systems are shared, the individual system cannot be adapted to its
optimal performance. The weights of two losses for multi-task training and decoding can be
very sensitive. As the decoding is label-synchronous, it is challenging to adapt the system to
process streaming data. Our proposed alternative, ISCA, is to have a two-pass system where
the SCM-based system such as a CTC model produces first pass hypotheses and the AED
model performs rescoring in a second pass. However, the vanilla CTC model yields very
high error rates. Experiments show that once the token prior, lexicon and language model are
re-introduced for CTC similar to a standard Hidden Markov Model (HMM)-based system,
the CTC model can perform reasonably well. As the proposed approach does not restrict the
output units of the SCM system and the AED model to be the same, experiments found that a
triphone HMM-based Acoustic Model (AM) trained using frame-level cross-entropy criterion
outperforms a CTC model using either grapheme or phone as targets. Under the same multi-task
training setup, the proposed ISCA approach reached a lower Word Error Rate (WER) than the
single-pass approach. To illustrate the full potential of the ISCA framework, the SCM system
and the AED model are trained separately. Experimental results showed that further reductions
in WER can be observed. Also, when the N -best lists become larger, the combined system
consistently gives better performance. These initial experiments demonstrated that the benefits
of system combination are maximised when two systems are optimised separately based on
their individual best practice. Based on these observations, more extensive experiments were
carried out on a larger scale dataset and the individual SCM and AED systems are close to the
state-of-the-art performance. Both N -best rescoring and lattice rescoring are tested for ISCA.
Note that the lattice rescoring algorithm is extended from Neural Network Language Model
(NNLM) lattice rescoring by considering the effect of the attention mechanism. Recurrent
Neural Network (RNN)-based and Transformer-based AED decoders were also compared. As
far as computational cost is concerned, RNN-based AED decoders may be more suitable for
lattice rescoring while Transformer-based may be better for N -best rescoring based on the
structure of the hypotheses and the mechanism of the AED decoder. Given the same N -best
or lattice density, lattice rescoring generally outperforms N -best rescoring for ISCA as a far
larger number of alternative hypotheses are considered in lattices. From a practical perspective,
the proposed ISCA framework allows streaming processing of speech data in the first pass and
then adjusts the final output by rescoring in the second pass to improve the performance while
the increase in latency of the service is marginal.
Confidence scores for AED-based ASR systems were investigated in Chapter 5. Experi-
ments showed that using Softmax probabilities as confidence scores are not reliable as they
often tend to be overconfident and can be heavily influenced by the regularisation techniques
7.1 Conclusions 159
used during training. The CEM was first proposed as a simple additional module on top of an
existing AED-based ASR model. The CEM is trained to predict a binary correct/incorrect target
per output token. Experimental results demonstrated that confidence scores based on the CEM
are much better than Softmax probabilities at both the token and word levels. The overconfi-
dence issue is effectively mitigated by the CEM. Further experiments indicate that the CEM
also works well after Language Model (LM) shallow fusion and can generalise to a slightly mis-
matched domain. Considering that some applications rely on utterance-level confidence scores,
simply aggregating token-level confidence scores from the CEM may not be optimal. Deletion
errors are not within the scope of the CEM but are important for utterance-level confidence. The
aggregating function from a sequence of token-level scores to an utterance-level score, such as
taking the minimum or the average, may not be optimal. Therefore, the R-EBM was proposed
to directly learn the utterance-level confidence scores. Coincidentally, the training objective
is effectively the same as a discriminator between correct and erroneous hypotheses. In other
words, the utterance-level confidence estimator is also an energy-based model that learns the
residual space between a locally normalised AED model and a globally normalised model. The
negative energy value can be used to rescore the N -best hypotheses and improve the recog-
nition performance. Experiments verified that the R-EBM improves both the utterance-level
confidence performance and reduces WERs at the same time. For utterance-level confidence
scores, the R-EBM outperforms the CEM as expected. Further experiments showed that even
under a much more challenging setup where the encoder of the AED model is pre-trained on a
large amount of data, the R-EBM can still improve the confidence and recognition performance.
Analysis of the rescoring results showed that the reduction of WER correlates with the utterance
length, which indicates that the effectiveness of the R-EBM may depend on the amount of
global context available. By simply plotting the data and model distributions, it seems that the
R-EBM does reduce the gap between them as suggested by the theory. From the perspective
of confidence estimation, both the CEM and the R-EBM are model-based approaches. They
may be subject to generalisation problems, especially when the input to the models are from
Out-of-Domain (OOD) data. However, an ideal confidence estimator should be able to provide
reliable confidence scores for both in-domain and OOD data. Assuming some unlabelled
OOD data is available, experiments showed some interesting observations. Either by including
automatic transcriptions of OOD data or by having additional OOD language models during the
training of confidence estimators, the confidence performance is boosted on OOD data while
keeping the in-domain performance unchanged. With OOD information injected during CEM
training, the word-level calibration performance on OOD data can be significantly improved.
Similarly for the R-EBM, OOD information enables the confidence estimator to better filter
OOD data, which is expected to assist active or semi-supervised learning substantially.
160 Conclusions and Future Work
For speaker diarisation, DNC was proposed in Chapter 6 which is a supervised clustering
approach that outperforms the commonly used unsupervised clustering algorithms. Diarisation
experiments were carried out on a very challenging meeting corpus. The meetings were
typically longer than half an hour and there were three or four active speakers. Three data
augmentation techniques work together with curriculum learning to effectively address the data
scarcity problem. The final results showed that DNC is a very promising approach and offers a
new perspective on speaker diarisation.
steps, DNC should also be modified to handle speaker overlaps and accommodate variable
number of speakers. Since DNC offers a supervised alternative to the clustering stage, DNC
can potentially merge with upstream neural network-based components such as voice activity
detection and speaker representation extraction. Consequently, the whole diarisation pipeline
can be optimised in an end-to-end manner, which may help mitigate the propagation of errors
across different stages. Furthermore, an improved training criterion can be designed for DNC
to have a closer match with the final evaluation criterion, i.e. the Diarisation Error Rate (DER).
As DNC does not impose any assumptions on the input, it is also straightforward to include
signals from other microphones or information from other modalities such as video or text to
push the diarisation performance even further.
Appendix A
Datasets
This appendix provides detailed information about the major datasets used in this thesis,
including Augmented Multi-Party Interaction (AMI), LibriSpeech and Switchboard (SWB).
Table A.1 AMI dataset. Train/dev/test sets follow the official split.
164 Datasets
For language modelling on the AMI dataset, the training transcriptions are normally used
to train Language Models (LMs). Sometimes, additional text data from the Fisher (FSH)
dataset (Cieri et al., 2004a) (see Section A.3) is used to augment the training data for LMs.
A.2 LibriSpeech
The LibriSpeech dataset (Panayotov et al., 2015) is a large scale dataset with nearly 1000 hours
of read English speech sampled at 16 kHz. The content is derived from audiobooks from the
LibriVox project. In this thesis, only the 100-hour subset is used during training. There are two
subsets in either the dev set or the test set. “clean” refers to the partition of the data with low
Word Error Rate (WER) speakers, whereas “other” refers to the partition of the data with high
WER speakers, based on an existing model trained on another read speech dataset. Some key
statistics of this 100h subset dataset are given in Table A.2.
dev test
train
clean other clean other
# utterance 28.5k 2.7k 2.9k 2.6k 2.9k
total duration (h) 100.6 5.4 5.1 5.4 5.3
duration per utterance (s) 12.7 7.2 6.4 7.4 6.5
total # words 990.1k 54.4k 50.9k 52.6k 52.3k
# words per utterance 34.7 20.1 17.8 20.1 17.8
vocabulary size 33.8k 8.3k 7.4k 8.1k 7.6k
# speakers 251 40 33 40 33
# speakers in training set – 0 0 0 0
Table A.2 LibriSpeech dataset. The training set is the “train-clean-100” subset. The dev and
test sets follow the official split.
For language modelling on the LibriSpeech dataset, there is a separate text corpus available,
which is much larger than the amount of training text. The additional text data has around 810
million words and 900 thousand unique words.
A.3 Switchboard
The SWB-300 dataset (Godfrey and Holliman, 1993) (Switchboard-1 release 2) is a Conver-
sational Telephone Speech (CTS) dataset with around 2400 two-sided recordings of landline
A.3 Switchboard 165
telephone conversations sampled at 8 kHz. Each phone call is between two English speakers
and the topic of each conversation is picked from a list of 70 topics.
The Hub5 2000 evaluation data (Hub5’00) (LDC, 2002a) and its transcripts (LDC, 2002b)
are used as the dev set. Hub5’00 has two parts, Switchboard (SWB) and CallHome (CH),
where each part has 20 conversations. The SWB part has some overlapping speakers with the
training set because it was first collected together with the training set but unreleased. The
CH part is from the CH English corpus (Canavan et al., 1997). Note that although the two
parts are both landline telephone conversations, they are different in nature. The two callers
in SWB did not know each other and they followed the assigned topics during phone calls.
However, CH participants in each call were family or friends, which resulted in less topicality
and formality (Fiscus et al., 2000). In addition, CH contains more accented speech. Therefore,
the CH part is expected to be more challenging.
The English CTS set of RT03 (Fiscus et al., 1997) is used as the test set. It consists of 36
telephony conversations from the Switchboard Cellular (SWBC) collection (Graff et al., 2001)
and 36 from the Fisher (FSH) collection (Cieri et al., 2004b). Note that SWBC sometimes have
a large channel mismatch due to the nature of the Global System for Mobile Communications
(GSM) cellular network compared to SWB, which may result in a higher WER if a system
trained on SWB is used. FSH is similar to SWB data apart from the fact that the topics are
more diverse and the vocabulary has a broader range, and a different collection protocol was
used as FSH was collected around a decade later than SWB.
Table A.3 Switchboard-1 real ease-2 dataset (LDC97S62). The dev set is Hub5’00
(LDC2002S09). The test set is RT03 (LDC2007S10).
For language modelling on the SWB dataset, apart from the corresponding transcriptions
of the 300-hour training data, sometimes FSH transcriptions (Cieri et al., 2004a, 2005) is also
166 Datasets
included as additional in-domain text data to improve the LM. The FSH text has around 22
million words and 65 thousand unique words.
References
Abdel-Hamid, O. and Jiang, H. (2013). Fast speaker adaptation of hybrid NN/HMM model
for speech recognition based on discriminative learning of speaker code. Proc. ICASSP,
Vancouver, BC, Canada. 52
Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper,
J., Catanzaro, B., Chen, J., Chrzanowski, M., Coates, A., Diamos, G.F., Elsen, E., Engel,
J., Fan, L.J., Fougner, C., Hannun, A.Y., Jun, B., Han, T.X., LeGresley, P., Li, X., Lin, L.,
Narang, S., Ng, A., Ozair, S., Prenger, R.J., Qian, S., Raiman, J., Satheesh, S., Seetapun, D.,
Sengupta, S., Sriram, A., Wang, C.J., Wang, Y., Wang, Z., Xiao, B., Xie, Y., Yogatama, D.,
Zhan, J., and Zhu, Z. (2016). Deep Speech 2 : End-to-end speech recognition in English and
Mandarin. Proc. ICML, New York, NY, USA. 61
Anastasakos, T., McDonough, J., Schwartz, R., and Makhoul, J. (1996). A compact model for
speaker-adaptive training. Proc. ICSLP, Philadelphia, PA, USA. 36, 51
Anguera, X., Bozonnet, S., Evans, N.W.D., Fredouille, C., Friedland, G., and Vinyals, O.
(2012). Speaker diarization: A review of recent research. IEEE Trans. on Audio, Speech, &
Language Processing, 20:356–370. 139
Anguera, X., Wooters, C., and Hernando, J. (2007). Acoustic beamforming for speaker
diarization of meetings. IEEE Trans. on Audio, Speech, & Language Processing, 15:2011–
2022. 149
Arivazhagan, N., Cherry, C., Macherey, W., Chiu, C.C., Yavuz, S., Pang, R., Li, W., and Raffel,
C. (2019). Monotonic infinite lookback attention for simultaneous machine translation. Proc.
ACL, Florence, Italy. 156
Atal, B.S. and Hanauer, S.L. (1971). Speech analysis and synthesis by linear prediction of the
speech wave. The Journal of the Acoustical Society of America, 50:637–655. 38
Aubert, X. and Ney, H. (1995). Large vocabulary continuous speech recognition using word
graphs. Proc. ICASSP, Detroit, MI, USA. 80
Ba, J.L., Kiros, J.R., and Hinton, G.E. (2016). Layer normalization. Proc. NIPS Deep Learning
Symposium, Barcelona, Spain. 30
Baevski, A. and Mohamed, A. (2020). Effectiveness of self-supervised pre-training for ASR.
Proc. ICASSP, Barcelona, Spain. 72
Baevski, A., Schneider, S., and Auli, M. (2020a). vq-wav2vec: Self-supervised learning of
discrete speech representations. Proc. ICLR, Addis Ababa, Ethiopia. 74
168 References
Baevski, A., Zhou, H., Mohamed, A., and Auli, M. (2020b). wav2vec 2.0: A framework for
self-supervised learning of speech representations. Proc. NeurIPS, Vancouver, BC, Canada.
36, 74, 75, 127, 128, 132
Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning
to align and translate. Proc. ICLR, Banff, AB, Canada. 15, 17, 18, 19, 20, 64, 145, 146
Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., and Bengio, Y. (2016). End-to-end
attention-based large vocabulary speech recognition. Proc. ICASSP, Shanghai, China. 64,
71, 146
Bahl, L.R., Brown, P.F., de Souza, P.V., and Mercer, R.L. (1986). Maximum mutual information
estimation of hidden Markov model parameters for speech recognition. Proc. ICASSP, Tokyo,
Japan. 36, 49
Baker, J. (1975). The DRAGON system – An overview. IEEE Trans. on Acoustics, Speech, &
Signal Processing, 23:24–29. 53
Baum, L.E. and Eagon, J.A. (1967). An inequality with applications to statistical estimation
for probabilistic functions of Markov processes and to a model for ecology. Bulletin of the
American Mathematical Society, 73:360–363. 46
Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. (2015). Scheduled sampling for sequence
prediction with recurrent neural networks. Proc. NIPS, Montreal, QC, Canada. 69, 120
Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003). A neural probabilistic language
model. Journal of Machine Learning Research, 3:1137–1155. 53, 55
Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009). Curriculum learning. Proc.
ICML, Montreal, QC, Canada. 152
Bérard, A., Besacier, L., Kocabiyikoglu, A.C., and Pietquin, O. (2018). End-to-end automatic
speech translation of audiobooks. Proc. ICASSP, Calgary, AB, Canada. 65
Beyerlein, P. (1997). Discriminative model combination. Proc. ASRU, Santa Barbara, CA,
USA. 79
Bishop, C.M. (1995). Neural networks for pattern recognition. Oxford University Press. 9, 11,
28, 31
Bishop, C.M. (2006). Pattern recognition and machine learning. Springer. 42, 46
Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. Proc.
COMPSTAT, Paris, France. 27
Boureau, Y.L., Ponce, J., and LeCun, Y. (2010). A theoretical analysis of feature pooling in
visual recognition. Proc. ICML, Haifa, Israel. 13
Bourlard, H., Bourlard, H.A., and Morgan, N. (1994). Connectionist speech recognition: A
hybrid approach, volume 247. Springer. 42
Breiman, L. (1996). Bagging predictors. Machine learning, 24:123–140. 31, 78
References 169
Brown, P.F. (1987). The acoustic modeling problem in automatic speech recognition. PhD
thesis, Carnegie Mellon University. 45
Campbell, N.A. (1984). Canonical variate analysis - a general model formulation. Australian
Journal of Statistics, 26:86–96. 39
Canavan, A., Graff, D., and Zipperlen, G. (1997). CALLHOME American English speech
LDC97S42. Web Download. Philadelphia: Linguistic Data Consortium. 165
Carletta, J., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, T., Kadlec, J., Karaiskos,
V., Kraaij, W., Kronenthal, M., Lathoud, G., Lincoln, M., Masson, A.L., McCowan, I., Post,
W., Reidsma, D., and Wellner, P.D. (2005). The AMI meeting corpus: A pre-announcement.
Proc. MLMI, Edinburgh, UK. 147, 149, 163
Caruana, R. (1997). Multitask learning. Machine learning, 28:41–75. 33
Chan, R.H.Y. and Woodland, P.C. (2004). Improving broadcast news transcription by lightly
supervised discriminative training. Proc. ICASSP, Montreal, QC, Canada. 108
Chan, W., Jaitly, N., Le, Q.V., and Vinyals, O. (2016). Listen, attend and spell: A neural
network for large vocabulary conversational speech recognition. Proc. ICASSP, Shanghai,
China. 64, 66, 146
Chan, W., Zhang, Y., Le, Q., and Jaitly, N. (2017). Latent sequence decompositions. Proc.
ICLR, Toulon, France. 64
Chen, S. (1998). Speaker, environment and channel change detection and clustering via the
Bayesian information criterion. Proc. Broadcast News Transcription and Understanding
Workshop, Lansdowne, VA, USA. 141
Chen, S., Wang, C., Chen, Z., Wu, Y., Liu, S., Chen, Z., Li, J., Kanda, N., Yoshioka, T., Xiao, X.,
Wu, J., Zhou, L., Ren, S., Qian, Y., Qian, Y., Zeng, M., and Wei, F. (2021). WavLM: Large-
scale self-supervised pre-training for full stack speech processing. arXiv.org:2110.13900.
72
Chen, S.F. and Goodman, J. (1999). An empirical study of smoothing techniques for language
modeling. Computer Speech & Language, 13:359–394. 54
Chen, X., Liu, X., Gales, M.J.F., and Woodland, P.C. (2015a). Recurrent neural network
language model training with noise contrastive estimation for speech recognition. Proc.
ICASSP, South Brisbane, QLD, Australia. 55
Chen, X., Liu, X., Ragni, A., Wang, Y., and Gales, M.J.F. (2017). Future word contexts in
neural network language models. Proc. ASRU, Okinawa, Japan. 55
Chen, X., Liu, X., Wang, Y., Gales, M.J.F., and Woodland, P.C. (2016). Efficient training and
evaluation of recurrent neural network language models for automatic speech recognition.
IEEE/ACM Trans. on Audio, Speech, & Language Processing, 24:2146–2157. 104
Chen, X., Liu, X., Wang, Y., Ragni, A., Wong, J.H.M., and Gales, M.J.F. (2019). Exploiting
future word contexts in neural network language models for speech recognition. IEEE/ACM
Trans. on Audio, Speech, & Language Processing, 27:1444–1454. 104
170 References
Chen, X., Tan, T., Liu, X., Lanchantin, P., Wan, M., Gales, M.J.F., and Woodland, P.C. (2015b).
Recurrent neural network language model adaptation for multi-genre broadcast speech
recognition. Proc. Interspeech, Dresden, Germany. 55
Cheng, G., Peddinti, V., Povey, D., Manohar, V., Khudanpur, S., and Yan, Y. (2017). An
exploration of dropout with LSTMs. Proc. Interspeech, Stockholm, Sweden. 104
Chiu, C.C., Sainath, T.N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., Kannan, A., Weiss,
R.J., Rao, K., Gonina, K., Jaitly, N., Li, B., Chorowski, J., and Bacchiani, M. (2018). State-
of-the-art speech recognition with sequence-to-sequence models. Proc. ICASSP, Calgary,
AB, Canada. 65, 115
Cho, K., van Merrienboer, B., Gülçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., and
Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical
machine translation. Proc. EMNLP, Doha, Qatar. 9, 19
Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., and Bengio, Y. (2015). Attention-based
models for speech recognition. Proc. NIPS, Montreal, QC, Canada. 15, 18, 21, 35, 64, 71
Chorowski, J. and Jaitly, N. (2017). Towards better decoding and language model integration
in sequence to sequence models. Proc. Interspeech, Stockholm, Sweden. 71, 120
Chorowski, J., Weiss, R.J., Bengio, S., and van den Oord, A. (2019). Unsupervised speech
representation learning using WaveNet autoencoders. IEEE/ACM Trans. on Audio, Speech,
& Language Processing, 27:2041–2053. 72
Chung, Y.A. and Glass, J.R. (2020). Generative pre-training for speech with autoregressive
predictive coding. Proc. ICASSP, Barcelona, Spain. 72
Cieri, C., Graff, D., Kimball, O., Miller, D., and Walker, K. (2004a). Fisher English training
speech part 1 transcripts LDC2004T19. Web Download. Philadelphia: Linguistic Data
Consortium. 164, 165
Cieri, C., Graff, D., Kimball, O., Miller, D., and Walker, K. (2005). Fisher English training
speech part 2 transcripts LDC2005T19. Web Download. Philadelphia: Linguistic Data
Consortium. 165
Cieri, C., Miller, D., and Walker, K. (2004b). The Fisher corpus: A resource for the next
generations of speech-to-text. Proc. LREC, Lisbon, Portugal. 165
Cui, J., Weng, C., Wang, G., Wang, J., Wang, P., Yu, C., Su, D., and Yu, D. (2018). Improving
attention-based end-to-end ASR systems with sequence-based loss functions. Proc. SLT,
Athens, Greece. 65, 70
Cyrta, P., Trzciński, T., and Stokowiec, W. (2017). Speaker diarization using deep recurrent
convolutional neural networks for speaker embeddings. Proc. ISAT, Szklarska Por˛eba,
Poland. 141
Dahl, G.E., Yu, D., Deng, L., and Acero, A. (2011). Context-dependent pre-trained deep
neural networks for large-vocabulary speech recognition. IEEE Trans. on Audio, Speech, &
Language Processing, 20:30–42. 48
References 171
Dai, Z., Yang, Z., Yang, Y., Carbonell, J.G., Le, Q.V., and Salakhutdinov, R. (2019).
Transformer-XL: Attentive language models beyond a fixed-length context. Proc. ACL,
Florence, Italy. 55
Dauphin, Y., Fan, A., Auli, M., and Grangier, D. (2017). Language modeling with gated
convolutional networks. Proc. ICML, Sydney, NSW, Australia. 67
Davis, J. and Goadrich, M. (2006). The relationship between precision-recall and ROC curves.
Proc. ICML, Pittsburgh, PA, USA. 111
Davis, S. and Mermelstein, P. (1980). Comparison of parametric representations for monosyl-
labic word recognition in continuously spoken sentences. IEEE Trans. on Acoustics, Speech,
& Signal Processing, 28:357–366. 37
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., and Ouellet, P. (2010). Front-end factor
analysis for speaker verification. IEEE Trans. on Audio, Speech, & Language Processing,
19:788–798. 52, 141
Del-Agua, M.Á., Giménez, A., Sanchís, A., Saiz, J.C., and Juan, A. (2018). Speaker-adapted
confidence measures for ASR using deep bidirectional recurrent neural networks. IEEE/ACM
Trans. on Audio, Speech, & Language Processing, 26:1194–1202. 109
Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). Maximum likelihood from incomplete
data via the EM algorithm. Journal of the Royal Statistical Society, 39:1–22. 46
Demuynck, K., Duchateau, J., Van Compernolle, D., and Wambacq, P. (2000). An efficient
search space representation for large vocabulary continuous speech recognition. Speech
Communication, 30:37–53. 53
Deng, J., Guo, J., and Zafeiriou, S. (2019). ArcFace: Additive angular margin loss for deep
face recognition. Proc. CVPR, Long Beach, CA, USA. 142
Deng, Y., Bakhtin, A., Ott, M., Szlam, A., and Ranzato, M. (2020). Residual energy-based
models for text generation. Proc. ICLR, Addis Ababa, Ethiopia. 107, 120, 121
Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2018). BERT: Pre-training of deep
bidirectional Transformers for language understanding. Proc. NAACL, Minneapolis, MN,
USA. 21, 55
Diaconis, P. and Shahshahani, M. (1987). The subgroup algorithm for generating uniform
random variables. Probability in the Engineering & Informational Sciences, 1:15–32. 148
Díez, M., Burget, L., Wang, S., Rohdin, J., and Cernocký, J.H. (2019). Bayesian HMM based
x-vector clustering for speaker diarization. Proc. Interspeech, Graz, Austria. 141
Digalakis, V.V., Rtischev, D., and Neumeyer, L.G. (1995). Speaker adaptation using constrained
estimation of Gaussian mixtures. IEEE Trans. on Speech & Audio Processing, 3:357–366.
51
Dimitriadis, D. and Fousek, P. (2017). Developing on-line speaker diarization system. Proc.
Interspeech, Stockholm, Sweden. 142
172 References
Dong, L., Xu, S., and Xu, B. (2018). Speech-Transformer: A no-recurrence sequence-to-
sequence model for speech recognition. Proc. ICASSP, Calgary, AB, Canada. 64
Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning
and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159. 28
Elman, J.L. (1993). Learning and development in neural networks: The importance of starting
small. Cognition, 48:71–99. 152
Evermann, G. and Woodland, P.C. (2000a). Large vocabulary decoding and confidence
estimation using word posterior probabilities. Proc. ICASSP, Istanbul, Turkey. 79, 109, 110,
116
Evermann, G. and Woodland, P.C. (2000b). Posterior probability decoding, confidence esti-
mation and system combination. Proc. NIST Speech Transcription Workshop, College Park,
MD, USA. 85, 108, 109, 111, 136
Fathullah, Y., Zhang, C., and Woodland, P.C. (2020). Improved large-margin softmax loss for
speaker diarisation. Proc. ICASSP, Barcelona, Spain. 142
Fiscus, J.G. (1997). A post-processing system to yield reduced word error rates: Recognizer
output voting error reduction (ROVER). Proc. ASRU, Santa Barbara, CA, USA. 79, 80, 85,
137
Fiscus, J.G., Ajot, J., Michel, M., and Garofolo, J.S. (2006). The Rich Transcription 2006
spring meeting recognition evaluation. Proc. MLMI, Bethesda, MD, USA. 142
Fiscus, J.G., Doddington, G., Le, A., Sanders, G., Przybocki, M., and Pallett, D. (1997). 2003
NIST Rich Transcription evaluation data LDC2007S10. Web Download. Philadelphia:
Linguistic Data Consortium. 165
Fiscus, J.G., Fisher, W.M., Martin, A.F., Przybocki, M.A., and Pallett, D.S. (2000). 2000 NIST
evaluation of conversational speech recognition over the telephone: English and Mandarin
performance results. Proc. NIST Speech Transcription Workshop, College Park, MD, USA.
165
Fujita, Y., Kanda, N., Horiguchi, S., Nagamatsu, K., and Watanabe, S. (2019a). End-to-end
neural speaker diarization with permutation-free objectives. Proc. Interspeech, Graz, Austria.
143
Fujita, Y., Kanda, N., Horiguchi, S., Xue, Y., Nagamatsu, K., and Watanabe, S. (2019b).
End-to-end neural speaker diarization with self-attention. Proc. ASRU, Singapore. 143
Furui, S. (1986). Speaker-independent isolated word recognition based on emphasized spectral
dynamics. Proc. ICASSP, Tokyo, Japan. 38
Gage, P. (1994). A new algorithm for data compression. The C Users Journal archive, 12:23–38.
66
Gales, M.J.F. (1998). Maximum likelihood linear transformations for HMM-based speech
recognition. Computer Speech & Language, 12:75–98. 51
References 173
Gales, M.J.F. (2000). Cluster adaptive training of hidden Markov models. IEEE Trans. on
Speech & Audio Processing, 8:417–428. 51
Gales, M.J.F., Kim, D.Y., Woodland, P.C., Chan, R.H.Y., Mrva, D., Sinha, R., and Tranter,
S. (2006). Progress in the CU-HTK broadcast news transcription system. IEEE Trans. on
Audio, Speech, & Language Processing, 14:1513–1525. 80
Gales, M.J.F., Knill, K., and Ragni, A. (2015). Unicode-based graphemic systems for limited
resource languages. Proc. ICASSP, South Brisbane, QLD, Australia. 39, 40, 66
Gales, M.J.F. and Woodland, P.C. (1996). Mean and variance adaptation within the MLLR
framework. Computer Speech & Language, 10:249–264. 51
Gales, M.J.F. and Young, S.J. (2008). The application of hidden Markov models in speech
recognition. Foundations & Trends in Signal Processing, 1:195–304. 80
Gangireddy, S.C.R., Swietojanski, P., Bell, P., and Renals, S. (2016). Unsupervised adaptation
of recurrent neural network language models. Proc. Interspeech, San Francisco, CA, USA.
55
Garcia-Romero, D., Snyder, D., Sell, G., Povey, D., and McCree, A. (2017). Speaker diarization
using deep neural network embeddings. Proc. ICASSP, New Orleans, LA, USA. 141, 142
Gauvain, J.L. and Lee, C.H. (1994). Maximum a posteriori estimation for multivariate Gaussian
mixture observations of Markov chains. IEEE Trans. on Speech & Audio Processing, 2:291–
298. 51
Gill, P.E., Murray, W., and Wright, M.H. (1981). Practical optimization. London Academic
Press. 27
Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward
neural networks. Proc. AISTATS, Sardinia, Italy. 29
Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse rectifier neural networks. Proc.
AISTATS, Fort Lauderdale, FL, USA. 17
Godfrey, J. and Holliman, E. (1993). Switchboard-1 release 2 LDC97S62. Web Download.
Philadelphia: Linguistic Data Consortium. 164
Goodfellow, I.J., Bengio, Y., and Courville, A. (2016). Deep learning. MIT Press. 9, 14, 26,
27, 30, 31, 78
Gopalakrishnan, P.S., Kanevsky, D., Nádas, A., and Nahamoo, D. (1989). A generalization of
the Baum algorithm to rational objective functions. Proc. ICASSP, Glasgow, UK. 49
Graff, D., Walker, K., and Miller, D. (2001). Switchboard cellular part 1 audio LDC2001S13.
Web Download. Philadelphia: Linguistic Data Consortium. 165
Graves, A. (2011). Practical variational inference for neural networks. Proc. NIPS, Granada,
Spain. 34, 115
Graves, A. (2012). Sequence transduction with recurrent neural networks. Proc. ICML
Representation Learning Workshop, Edinburgh, UK. 59, 62
174 References
Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis of speech. The Journal of
the Acoustical Society of America, 87:1738–1752. 37
Hermansky, H., Ellis, D.P.W., and Sharma, S. (2000). Tandem connectionist feature extraction
for conventional HMM systems. Proc. ICASSP, Istanbul, Turkey. 43
Hernandez, F., Nguyen, V., Ghannay, S., Tomashenko, N., and Estève, Y. (2018). TED-LIUM
3: Twice as much data and corpus repartition for experiments on speaker adaptation. Proc.
SPECOM, Leipzig, Germany. 131
Hershey, J.R., Chen, Z., Le Roux, J., and Watanabe, S. (2016). Deep clustering: Discriminative
embeddings for segmentation and separation. Proc. ICASSP, Shanghai, China. 144
Higuchi, Y., Watanabe, S., Chen, N., Ogawa, T., and Kobayashi, T. (2020). Mask CTC: Non-
autoregressive end-to-end ASR with CTC and mask predict. Proc. Interspeech, Shanghai,
China. 62
Hinton, G.E., Deng, L., Yu, D., Dahl, G.E., Mohamed, A., Jaitly, N., Senior, A.W., Vanhoucke,
V., Nguyen, P., Sainath, T.N., and Kingsbury, B. (2012). Deep neural networks for acoustic
modeling in speech recognition. IEEE Signal Processing Magazine, 29. 36
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation,
9:1735–1780. 15
Hori, T., Cho, J., and Watanabe, S. (2018). End-to-end speech recognition with word-based
RNN language models. Proc. SLT, Athens, Greece. 64, 72
Hori, T., Watanabe, S., and Hershey, J.R. (2017a). Joint CTC/attention decoding for end-to-end
speech recognition. Proc. ACL, Vancouver, BC, Canada. 82
Hori, T., Watanabe, S., and Hershey, J.R. (2017b). Multi-level language modeling and decoding
for open vocabulary end-to-end speech recognition. Proc. ASRU, Okinawa, Japan. 65, 72
Hori, T., Watanabe, S., Zhang, Y., and Chan, W. (2017c). Advances in joint CTC-attention based
end-to-end speech recognition with a deep CNN encoder and RNN-LM. Proc. Interspeech,
Stockholm, Sweden. 64, 66
Hrúz, M. and Zajíc, Z. (2017). Convolutional neural network for speaker change detection in
telephone speaker diarization system. Proc. ICASSP, New Orleans, LA, USA. 141
Hu, S., Xie, X., Liu, S., Yu, J., Ye, Z., Geng, M., Liu, X., and Meng, H. (2021). Bayesian
learning of LF-MMI trained time delay neural networks for speech recognition. IEEE/ACM
Trans. on Audio, Speech, & Language Processing, 29:1514–1529. 105
Huang, X., Acero, A., and Hon, H.W. (2001). Spoken language processing: A guide to theory,
algorithm, and system development. Prentice Hall PTR. 40, 43, 50, 56, 57
Huang, X. and Lee, K.F. (1993). On speaker-independent, speaker-dependent, and speaker-
adaptive speech recognition. IEEE Trans. on Speech & Audio Processing, 1:150–157. 51
Huang, Z., Wang, S., and Yu, K. (2018). Angular softmax for short-duration text-independent
speaker verification. Proc. Interspeech, Hyderabad, India. 142
176 References
Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y. (2018). Quantized neural
networks: Training neural networks with low precision weights and activations. Journal of
Machine Learning Research, 18:6869–6898. 34
Hughes, T. and Mierle, K. (2013). Recurrent neural networks for voice activity detection. Proc.
ICASSP, Vancouver, BC, Canada. 141
Hwang, M.Y. and Huang, X. (1993). Shared-distribution hidden Markov models for speech
recognition. IEEE Trans. on Speech & Audio Processing, 1:414–420. 40
India, M., Fonollosa, J.A.R., and Hernando, J. (2017). LSTM neural network-based speaker
segmentation using acoustic and language modelling. Proc. Interspeech, Stockholm, Sweden.
141
Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by
reducing internal covariate shift. Proc. ICML, Lille, France. 29
Irie, K., Prabhavalkar, R., Kannan, A., Bruguier, A., Rybach, D., and Nguyen, P. (2019a). On
the choice of modeling unit for sequence-to-sequence speech recognition. Proc. Interspeech,
Graz, Austria. 66
Irie, K., Zeyer, A., Schlüter, R., and Ney, H. (2019b). Language modeling with deep Trans-
formers. Proc. Interspeech, Graz, Austria. 55
Irie, K., Zeyer, A., Schlüter, R., and Ney, H. (2019c). Training language models for long-span
cross-sentence evaluation. Proc. ASRU, Singapore. 55, 104, 105
Jaitly, N. and Hinton, G.E. (2013). Vocal tract length perturbation (VTLP) improves speech
recognition. Proc. ICML Workshop on Deep Learning for Audio, Speech & Language,
Atlanta, GA, USA. 33
Jang, E., Gu, S.S., and Poole, B. (2017). Categorical peparameterization with Gumbel-Softmax.
Proc. ICLR, Toulon, France. 74
Jégou, H., Douze, M., and Schmid, C. (2011). Product quantization for nearest neighbor search.
IEEE Trans. on Pattern Analysis & Machine Intelligence, 33:117–128. 74
Jelinek, F. (1991). Up from trigrams! - The struggle for improved language models. Proc.
Eurospeech, Genove, Italy. 53
Jelinek, F. (1997). Statistical methods for speech recognition. MIT Press. 35, 36, 40
Jiang, H. (2005). Confidence measures for speech recognition: A survey. Speech Communica-
tion, 45:455–470. 108, 131
Jim, K.C., Giles, C.L., and Horne, B.G. (1996). An analysis of noise in recurrent neural
networks: convergence and generalization. IEEE Trans. on Neural Networks, 7:1424–1438.
34
Juang, B.H. (1985). Maximum-likelihood estimation for mixture multivariate stochastic
observations of Markov chains. AT&T Technical Journal, 64:1235–1249. 42
References 177
Juang, B.H., Hou, W., and Lee, C.H. (1997). Minimum classification error rate methods for
speech recognition. IEEE Trans. on Speech & Audio Processing, 5:257–265. 45
Jurafsky, D. and Martin, J.H. (2008). Speech and language processing: An introduction to
speech recognition, computational linguistics and natural language processing. Prentice
Hall. 35
Kahn, J., Rivière, M., Zheng, W., Kharitonov, E., Xu, Q., Mazar’e, P.E., Karadayi, J., Liptchin-
sky, V., Collobert, R., Fuegen, C., Likhomanenko, T., Synnaeve, G., Joulin, A., Mohamed, A.,
and Dupoux, E. (2020). Libri-Light: A benchmark for ASR with limited or no supervision.
Proc. ICASSP, Barcelona, Spain. 127, 131
Kalgaonkar, K., Liu, C., Gong, Y., and Yao, K. (2015). Estimating confidence scores on ASR
results using recurrent neural networks. Proc. ICASSP, South Brisbane, QLD, Australia. 109
Kanda, N., Fujita, Y., and Nagamatsu, K. (2018). Lattice-free state-level minimum Bayes risk
training of acoustic models. Proc. Interspeech, Hyderabad, India. 104
Kanthak, S. and Ney, H. (2002). Context-dependent acoustic modeling using graphemes for
large vocabulary speech recognition. Proc. ICASSP, Orlando, FL, USA. 40, 66
Karanasou, P., Gales, M.J.F., Lanchantin, P., Liu, X., Qian, Y., Wang, L., Woodland, P.C., and
Zhang, C. (2015). Speaker diarisation and longitudinal linking in multi-genre broadcast data.
Proc. ASRU, Scottsdale, AZ, USA. 142
Karita, S., Ogawa, A., Delcroix, M., and Nakatani, T. (2018). Sequence training of encoder-
decoder model using policy gradient for end-to-end speech recognition. Proc. ICASSP,
Calgary, AB, Canada. 70
Kastanos, A., Ragni, A., and Gales, M.J.F. (2020). Confidence estimation for black box
automatic speech recognition systems using lattice recurrent neural networks. Proc. ICASSP,
Barcelona, Spain. 109
Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. (2020). Transformers are RNNs: Fast
autoregressive Transformers with linear attention. Proc. ICML, Vienna, Austria. 89
Katz, S. (1987). Estimation of probabilities from sparse data for the language model component
of a speech recognizer. IEEE Trans. on Acoustics, Speech, & Signal Processing, 35:400–401.
55
Kemp, T. and Schaaf, T. (1997). Estimating confidence using word lattices. Proc. Eurospeech,
Rhodes, Greece. 109
Kemp, T., Schmidt, M., Westphal, M., and Waibel, A.H. (2000). Strategies for automatic
segmentation of audio data. Proc. ICASSP, Istanbul, Turkey. 141
Killer, M., Stüker, S., and Schultz, T. (2003). Grapheme based speech recognition. Proc.
Eurospeech, Geneva, Switzerland. 66
Kim, S., Hori, T., and Watanabe, S. (2017). Joint CTC-attention based end-to-end speech
recognition using multi-task learning. Proc. ICASSP, New Orleans, LA, USA. 66, 81, 91,
92, 97
178 References
Kingma, D.P. and Ba, J. (2015). Adam: A method for stochastic optimization. Proc. ICLR,
San Diego, CA, USA. 28, 115
Kitza, M., Golik, P., Schlüter, R., and Ney, H. (2019). Cumulative adaptation for BLSTM
acoustic models. Proc. Interspeech, Hyderabad, India. 105
Kneser, R. and Ney, H. (1995). Improved backing-off for m-gram language modeling. Proc.
ICASSP, Detroit, MI, USA. 55
Ko, T., Peddinti, V., Povey, D., and Khudanpur, S. (2015). Audio augmentation for speech
recognition. Proc. Interspeech, Dresden, Germany. 33
Kreyssig, F.L., Zhang, C., and Woodland, P.C. (2018). Improved TDNNs using deep kernels
and frequency dependent grid-RNNs. Proc. ICASSP, Calgary, AB, Canada. 149
Krizhevsky, A., Sutskever, I., and Hinton, G.E. (2012). ImageNet classification with deep
convolutional neural networks. Proc. NIPS, Stateline, NV, USA. 12, 33
Krogh, A. and Hertz, J.A. (1992). A simple weight decay can improve generalization. Proc.
NIPS, Denver, CO, USA. 30
Kudo, T. (2018). Subword regularization: Improving neural network translation models with
multiple subword candidates. Proc. ACL, Melbourne, VIC, Australia. 67, 99
Kuhn, H.W. (1955). The Hungarian method for the assignment problem. Naval Research
Logistics Quarterly, 2:83–97. 144
Kumar, A., Singh, S., Gowda, D.N., Garg, A., Singh, S., and Kim, C. (2020). Utterance
confidence measure for end-to-end speech recognition with applications to distributed speech
recognition scenarios. Proc. Interspeech, Shanghai, China. 109, 121, 124
Kumar, N. (1998). Investigation of silicon auditory models and generalization of linear
discriminant analysis for improved speech recognition. PhD thesis, Johns Hopkins University.
39
LDC (2002a). 2000 HUB5 English evaluation speech LDC2002S09. Web Download. Philadel-
phia: Linguistic Data Consortium. 165
LDC (2002b). 2000 HUB5 English evaluation transcripts LDC2002T43. Web Download.
Philadelphia: Linguistic Data Consortium. 165
Le, N. and Odobez, J.M. (2018). Robust and discriminative speaker embedding via intra-class
distance variance regularization. Proc. Interspeech, Hyderabad, India. 144
LeCun, Y., Bengio, Y., and Hinton, G.E. (2015). Deep learning. Nature, 521:436. 9, 10
LeCun, Y., Boser, B.E., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W.E., and
Jackel, L.D. (1989). Backpropagation applied to handwritten zip code recognition. Neural
Computation, 1:541–551. 12
LeCun, Y., Chopra, S., Hadsell, R., Ranzato, A., and Huang, F. (2006). A tutorial on energy-
based learning. Predicting Structured Data. 120
References 179
Lee, J. and Watanabe, S. (2021). Intermediate loss regularization for CTC-based speech
recognition. Proc. ICASSP, Toronto, ON, Canada. 62
Lee, K.F. (1988). On large-vocabulary speaker-independent continuous speech recognition.
Speech Communication, 7:375–379. 39
Lee, L. and Rose, R.C. (1996). Speaker normalization using efficient frequency warping
procedures. Proc. ICASSP, Atlanta, GA, USA. 39
Leggetter, C.J. and Woodland, P.C. (1995). Maximum likelihood linear regression for speaker
adaptation of continuous density hidden Markov models. Computer Speech & Language,
9:171–185. 51
Li, B., Gulati, A., Yu, J., Sainath, T.N., Chiu, C.C., Narayanan, A., yiin Chang, S., Pang, R., He,
Y., Qin, J., Han, W., Liang, Q., Zhang, Y., Strohman, T., and Wu, Y. (2021a). A better and
faster end-to-end model for streaming ASR. Proc. ICASSP, Toronto, ON, Canada. 64, 67
Li, B. and Sim, K.C. (2010). Comparison of discriminative input and output transformations
for speaker adaptation in the hybrid NN/HMM systems. Proc. Interspeech, Makuhari, Japan.
52
Li, B., Zhang, Y., Sainath, T., Wu, Y., and Chan, W. (2019a). Bytes are all you need: End-to-end
multilingual speech recognition and synthesis with bytes. Proc. ICASSP, Brighton, UK. 40
Li, J., Zhao, R., Hu, H., and Gong, Y. (2019b). Improving RNN transducer modeling for
end-to-end speech recognition. Proc. ASRU, Singapore. 64
Li, K., Povey, D., and Khudanpur, S. (2021b). A parallelizable lattice rescoring strategy with
neural language models. Proc. ICASSP, Toronto, ON, Canada. 160
Li, K., Xu, H., Wang, Y., Povey, D., and Khudanpur, S. (2018). Recurrent neural network lan-
guage model adaptation for conversational speech recognition. Proc. Interspeech, Hyderabad,
India. 55
Li, Q. (2018). Confidence scores for speech processing. Master’s thesis, University of
Cambridge. 108
Li, Q., Kreyssig, F., Zhang, C., and Woodland, P.C. (2021c). Discriminative neural clustering
for speaker diarisation. Proc. SLT, Shenzhen, China. iii, 6, 139
Li, Q., Ness, P., Ragni, A., and Gales, M.J.F. (2019c). Bi-directional lattice recurrent neural
networks for confidence estimation. Proc. ICASSP, Brighton, UK. 109
Li, Q., Qiu, D., Zhang, Y., Li, B., He, Y., Woodland, P.C., Cao, L., and Strohman, T. (2021d).
Confidence estimation for attention-based sequence-to-sequence models for speech recogni-
tion. Proc. ICASSP, Toronto, ON, Canada. iii, 6, 109, 122, 127
Li, Q., Zhang, C., and Woodland, P.C. (2019d). Integrating source-channel and attention-based
sequence-to-sequence models for speech recognition. Proc. ASRU, Singapore. iii, 6, 121,
137
Li, Q., Zhang, C., and Woodland, P.C. (2021e). Combining frame-synchronous and label-
synchronous systems for speech recognition. arXiv.org:2107.00764. iii, 6
180 References
Li, Q., Zhang, Y., Li, B., Cao, L., and Woodland, P.C. (2021f). Residual energy-based models
for end-to-end speech recognition. Proc. Interspeech, Brno, Czech Republic. iii, 6, 109, 132
Li, Q., Zhang, Y., Qiu, D., He, Y., Cao, L., and Woodland, P.C. (2022). Improving confi-
dence estimation on out-of-domain data for end-to-end speech recognition. Proc. ICASSP,
Singapore. iii, 6
Lin, Q., Yin, R., Li, M., Bredin, H., and Barras, C. (2019). LSTM based similarity measurement
with spectral clustering for speaker diarization. Proc. Interspeech, Graz, Austria. 144
Liu, W. and Lee, T. (2021). Utterance-level neural confidence measure for end-to-end children
speech recognition. Proc. ASRU, Cartagena, Colombia. 109
Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., and Song, L. (2017). SphereFace: Deep hypersphere
embedding for face recognition. Proc. CVPR, Honolulu, HI, USA. 150
Liu, X., Chen, X., Wang, Y., Gales, M.J.F., and Woodland, P.C. (2016). Two efficient lattice
rescoring methods using recurrent neural network language models. IEEE/ACM Trans. on
Audio, Speech, & Language Processing, 24:1438–1449. 87, 88
Liu, X., Gales, M.J.F., Sim, K.C., and Yu, K. (2005). Investigation of acoustic modeling
techniques for LVCSR systems. Proc. ICASSP, Philadelphia, PA, USA. 39
Liu, Y., He, L., and Liu, J. (2019). Large margin softmax loss for speaker verification. Proc.
Interspeech, Graz, Austria. 142
Lu, L., Zhang, X., Cho, K., and Renals, S. (2015). A study of the recurrent neural network
encoder-decoder for large vocabulary speech recognition. Proc. Interspeech, Dresden,
Germany. 64
Lüscher, C., Beck, E., Irie, K., Kitza, M., Michel, W., Zeyer, A., Schlüter, R., and Ney, H.
(2019). RWTH ASR systems for Librispeech: Hybrid vs attention – w/o data augmentation.
Proc. Interspeech, Graz, Austria. 65
Ma, X., Pino, J., Cross, J., Puzon, L., and Gu, J. (2020). Monotonic multihead attention. Proc.
ICLR, Addis Ababa, Ethiopia. 156
Ma, Z. and Collins, M. (2018). Noise contrastive estimation and negative sampling for
conditional models: Consistency and statistical efficiency. Proc. EMNLP, Brussels, Belgium.
120, 123
Maas, A.L., Hannun, A.Y., Jurafsky, D., and Ng, A. (2014). First-pass large vocabulary
continuous speech recognition using bi-directional recurrent DNNs. arXiv.org:1408.2873.
61
Maaten, L.v.d. and Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine
Learning Research, 9:2579–2605. 155
MacKay, D.J.C. (2003). Information theory, inference and learning algorithms. Cambridge
University Press. 35, 36, 49
References 181
Mangu, L., Brill, E., and Stolcke, A. (2000). Finding consensus in speech recognition: Word
error minimization and other applications of confusion networks. Computer Speech &
Language, 14:373–400. 58, 109, 111
Manning, C.D., Manning, C.D., and Schütze, H. (1999). Foundations of statistical natural
language processing. MIT Press. 53
McDermott, E., Sak, H., and Variani, E. (2019). A density ratio approach to language model
fusion in end-to-end automatic speech recognition. Proc. ASRU, Singapore. 72
Meng, Z., Li, J., Chen, Z., Zhao, Y., Mazalov, V., Gong, Y., and Juang, B.H. (2018). Speaker-
invariant training via adversarial learning. Proc. ICASSP, Calgary, AB, Canada. 52
Meng, Z., Parthasarathy, S., Sun, E., Gaur, Y., Kanda, N., Lu, L., Chen, X., Zhao, R., Li, J., and
Gong, Y. (2021). Internal language model estimation for domain-adaptive end-to-end speech
recognition. Proc. SLT, Shenzhen, China. 72
Miao, H., Cheng, G., Zhang, P., and Yan, Y. (2020). Online hybrid CTC/attention end-to-end
automatic speech recognition architecture. IEEE/ACM Trans. on Audio, Speech, & Language
Processing, 28:1452–1465. 83
Miao, Y., Gowayyed, M.A., Na, X., Ko, T., Metze, F., and Waibel, A.H. (2016a). An empirical
exploration of CTC acoustic models. Proc. ICASSP, Shanghai, China. 61
Miao, Y., Li, J., Wang, Y., Zhang, S.X., and Gong, Y. (2016b). Simplifying long short-term
memory acoustic models for fast training and decoding. Proc. ICASSP, Shanghai, China. 92
Mikolov, T. (2012). Statistical language models based on neural networks. PhD thesis, Brno
University of Technology. 28
Mikolov, T., Karafiát, M., Burget, L., Černockỳ, J., and Khudanpur, S. (2010). Recurrent neural
network based language model. Proc. Interspeech, Makuhari, Japan. 36, 53, 55
Moattar, M.H. and Homayounpour, M.M. (2012). A review on speaker diarization systems and
approaches. Speech Communication, 54:1065–1103. 139
Mohri, M., Pereira, F., and Riley, M. (2002). Weighted finite-state transducers in speech
recognition. Computer Speech & Language, 16:69–88. 53, 56
Moritz, N., Hori, T., and Roux, J.L. (2020). Streaming automatic speech recognition with the
Transformer model. Proc. ICASSP, Barcelona, Spain. 83
Nair, V. and Hinton, G.E. (2010). Rectified linear units improve restricted Boltzmann machines.
Proc. ICML, Haifa, Israel. 17
Ney, H., Essen, U., and Kneser, R. (1994). On structuring probabilistic dependences in
stochastic language modelling. Computer Speech & Language, 8:1–38. 55
Nguyen, H., Bougares, F., Tomashenko, N.A., Estève, Y., and Besacier, L. (2020). Investigating
self-supervised pre-training for end-to-end speech translation. Proc. Interspeech, Shanghai,
China. 72
182 References
Ning, H., Liu, M., Tang, H., and Huang, T.S. (2006). A spectral clustering approach to speaker
diarization. Proc. Interspeech, Pittsburgh, PA, USA. 142
Normandin, Y. (1991). Hidden Markov models, maximum mutual information estimation, and
the speech recognition problem. PhD thesis, McGill University. 50
Odell, J.J., Valtchev, V., Woodland, P.C., and Young, S.J. (1994). A one pass decoder design
for large vocabulary recognition. Proc. HLT, Plainsboro, NJ, USA. 53
Ogawa, A., Delcroix, M., Karita, S., and Nakatani, T. (2018). Rescoring N-best speech
recognition list based on one-on-one hypothesis comparison using encoder-classifier model.
Proc. ICASSP, Calgary, AB, Canada. 121
Okabe, K., Koshinaka, T., and Shinoda, K. (2018). Attentive statistics pooling for deep speaker
embedding. Proc. Interspeech, Hyderabad, India. 141
Oneata, D., Caranica, A., Stan, A., and Cucu, H. (2021). An evaluation of word-level confidence
estimation for end-to-end automatic speech recognition. Proc. SLT, Shenzhen, China. 109,
121
Ortmanns, S., Ney, H., and Aubert, X. (1997). A word graph algorithm for large vocabulary
continuous speech recognition. Computer Speech & Language, 11:43–72. 57
Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J.V., Lakshmi-
narayanan, B., and Snoek, J. (2019). Can you trust your model’s uncertainty? Evaluating
predictive uncertainty under dataset shift. Proc. NeurIPS, Vancouver, BC, Canada. 129
Pallett, D.S., Fisher, W.M., and Fiscus, J.G. (1990). Tools for the analysis of benchmark speech
recognition tests. Proc. ICASSP, Albuquerque, NM, USA. 103
Panayotov, V., Chen, G., Povey, D., and Khudanpur, S. (2015). Librispeech: An ASR corpus
based on public domain audio books. Proc. ICASSP, South Brisbane, QLD, Australia. 164
Park, D.S., Chan, W., Zhang, Y., Chiu, C.C., Zoph, B., Cubuk, E.D., and Le, Q.V. (2019).
SpecAugment: A simple data augmentation method for automatic speech recognition. Proc.
Interspeech, Graz, Austria. 33, 65, 99, 105, 115, 132
Park, D.S., Zhang, Y., Jia, Y., Han, W., Chiu, C.C., Li, B., Wu, Y., and Le, Q.V. (2020). Improved
noisy student training for automatic speech recognition. Proc. Interspeech, Shanghai, China.
111, 119, 121, 137
Park, J., Liu, X., Gales, M.J.F., and Woodland, P.C. (2010). Improved neural network based
language modelling and adaptation. Proc. Interspeech, Makuhari, Japan. 55
Park, T.J., Kanda, N., Dimitriadis, D., Han, K.J., Watanabe, S., and Narayanan, S.S. (2022).
A review of speaker diarization: Recent advances with deep learning. Computer Speech &
Language, 72:101317. 139, 140
Pascanu, R., Mikolov, T., and Bengio, Y. (2013). On the difficulty of training recurrent neural
networks. Proc. ICML, Atlanta, GA USA. 28
Paul, D.B. and Baker, J. (1992). The design for the Wall Street Journal-based CSR corpus.
Proc. ICSLP, Banff, AB, Canada. 118
References 183
Peddinti, V., Povey, D., and Khudanpur, S. (2015). A time delay neural network architecture for
efficient modeling of long temporal contexts. Proc. Interspeech, Dresden, Germany. 12, 149
Peddinti, V., Wang, Y., Povey, D., and Khudanpur, S. (2018). Low latency acoustic modeling
using temporal convolution and LSTMs. IEEE Signal Processing Letters, 25:373–377. 104
Peskin, B., Newman, M., McAllaster, D., Nagesha, V., Richards, H.B., Wegmann, S., Hunt,
M.J., and Gillick, L. (1999). Improvements in recognition of conversational telephone speech.
Proc. ICASSP, Phoenix, AZ, USA. 80
Pham, N.Q., Nguyen, T.S., Niehues, J., Müller, M., Stüker, S., and Waibel, A.H. (2019). Very
deep self-attention networks for end-to-end speech recognition. Proc. Interspeech, Graz,
Austria. 64
Pineda, F.J. (1987). Generalization of back-propagation to recurrent neural networks. Physical
Review Letters, 59:2229. 14
Pinto, J. and Sitaram, R.N.V. (2005). Confidence measures in speech recognition based on
probability distribution of likelihoods. Proc. Interspeech, Lisbon, Portugal. 109
Polyak, B. and Juditsky, A. (1992). Acceleration of stochastic approximation by averaging.
SIAM Journal on Control & Optimization, 30:838–855. 32, 115
Polyak, B.T. (1964). Some methods of speeding up the convergence of iteration methods.
USSR Computational Mathematics & Mathematical Physics, 4:1–17. 28
Povey, D. (2003). Discriminative training for large vocabulary speech recognition. PhD thesis,
University of Cambridge. 36, 50
Povey, D., Cheng, G., Wang, Y., Li, K., Xu, H., Yarmohammadi, M., and Khudanpur, S. (2018).
Semi-orthogonal low-rank matrix factorization for deep neural networks. Proc. Interspeech,
Hyderabad, India. 86, 99
Povey, D., Ghoshal, A.K., Boulianne, G., Burget, L., Glembek, O., Goel, N.K., Hannemann,
M., Motlícek, P., Qian, Y., Schwarz, P., Silovský, J., Stemmer, G., and Veselý, K. (2011).
The Kaldi speech recognition toolkit. Proc. ASRU, Waikoloa, HI, USA. 98, 99
Povey, D., Hannemann, M., Boulianne, G., Burget, L., Ghoshal, A.K., Janda, M., Karafiát,
M., Kombrink, S., Motlícek, P., Qian, Y., Riedhammer, K., Veselý, K., and Vu, N.T. (2012).
Generating exact lattices in the WFST framework. Proc. ICASSP, Kyoto, Japan. 58
Povey, D., Peddinti, V., Galvez, D., Ghahremani, P., Manohar, V., Na, X., Wang, Y., and
Khudanpur, S. (2016). Purely sequence-trained neural networks for ASR based on lattice-
free MMI. Proc. Interspeech, San Francisco, CA, USA. 43, 50, 92
Povey, D. and Woodland, P.C. (2002). Minimum phone error and I-smoothing for improved
discriminative training. Proc. ICASSP, Orlando, FL, USA. 50
Prabhavalkar, R., He, Y., Rybach, D., Campbell, S., Narayanan, A., Strohman, T., and Sainath,
T.N. (2021). Less is more: Improved RNN-T decoding using limited label context and path
merging. Proc. ICASSP, Toronto, ON, Canada. 83, 160
184 References
Prabhavalkar, R., Sainath, T.N., Li, B., Rao, K., and Jaitly, N. (2017). An analysis of "attention"
in sequence-to-sequence models. Proc. Interspeech, Stockholm, Sweden. 21
Prabhavalkar, R., Sainath, T.N., Wu, Y., Nguyen, P., Chen, Z., Chiu, C.C., and Kannan, A.
(2018). Minimum word error rate training for attention-based sequence-to-sequence models.
Proc. ICASSP, Calgary, AB, Canada. 65, 69, 120, 155
Pundak, G. and Sainath, T.N. (2016). Lower frame rate neural network acoustic models. Proc.
Interspeech, San Francisco, CA, USA. 86, 93
Qian, Y., Bi, M., Tan, T., and Yu, K. (2016). Very deep convolutional neural networks for noise
robust speech recognition. IEEE/ACM Trans. on Audio, Speech, & Language Processing,
24:2263–2276. 66
Qiu, D., He, Y., Li, Q., Zhang, Y., Cao, L., and McGraw, I. (2021a). Multi-task learning for
end-to-end ASR word and utterance confidence with deletion prediction. Proc. Interspeech,
Brno, Czech Republic. 109, 121, 137, 138, 160
Qiu, D., Li, Q., He, Y., Zhang, Y., Li, B., Cao, L., Prabhavalkar, R., Bhatia, D., Li, W., Hu,
K., Sainath, T.N., and McGraw, I. (2021b). Learning word-level confidence for subword
end-to-end ASR. Proc. ICASSP, Toronto, ON, Canada. 109, 121, 137, 160
Radford, A. and Narasimhan, K. (2018). Improving language understanding by generative
pre-training. [Online] https://blog.openai.com/language-unsupervised/. 55
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2019). Lan-
guage models are unsupervised multitask learners. [Online] https://blog.openai.
com/better-language-models/. 55
Ragni, A., Gales, M.J.F., Rose, O., Knill, K., Kastanos, A., Li, Q., and Ness, P. (2022). Increas-
ing context for estimating confidence scores in automatic speech recognition. IEEE/ACM
Trans. on Audio, Speech, & Language Processing, 30:1319–1329. 109, 131
Ragni, A., Li, Q., Gales, M.J.F., and Wang, Y. (2018). Confidence estimation and deletion
prediction using bidirectional recurrent neural networks. Proc. SLT, Athens, Greece. 109,
137
Ramachandran, P., Zoph, B., and Le, Q.V. (2018). Searching for activation functions. Proc.
ICLR Workshop, Vancouver, BC, Canada. 17, 68
Ranzato, M., Chopra, S., Auli, M., and Zaremba, W. (2016). Sequence level training with
recurrent neural networks. Proc. ICLR, San Juan, Puerto Rico. 69, 120
Rao, K., Sak, H., and Prabhavalkar, R. (2017). Exploring architectures, data and units for
streaming end-to-end speech recognition with RNN-transducer. Proc. ASRU, Okinawa,
Japan. 64
Riccardi, G. and Hakkani-Tür, D. (2005). Active learning: Theory and applications to automatic
speech recognition. IEEE Trans. on Speech & Audio Processing, 13:504–511. 108
Richardson, F., Ostendorf, M., and Rohlicek, J.R. (1995). Lattice-based search strategies for
large vocabulary speech recognition. Proc. ICASSP, Detroit, MI, USA. 80
References 185
Rousseau, A., Deléglise, P., and Estève, Y. (2014). Enhancing the TED-LIUM corpus with
selected data for language modeling and more TED talks. Proc. LREC, Reykjavik, Iceland.
132
Rumelhart, D.E., Hinton, G.E., and Williams, R.J. (1988). Learning representations by back-
propagating errors. Cognitive Modeling, 5:1. 9, 14, 17, 25
Sabour, S., Chan, W., and Norouzi, M. (2019). Optimal completion distillation for sequence
learning. Proc. ICLR, New Orleans, LA, USA. 65, 70
Sainath, T.N., Mohamed, A., Kingsbury, B., and Ramabhadran, B. (2013). Deep convolutional
neural networks for LVCSR. Proc. ICASSP, Vancouver, BC, Canada. 66
Sainath, T.N., Pang, R., Rybach, D., He, Y., Prabhavalkar, R., Li, W., Visontai, M., Liang,
Q., Strohman, T., Wu, Y., McGraw, I., and Chiu, C.C. (2019). Two-pass end-to-end speech
recognition. Proc. Interspeech, Graz, Austria. 82, 104, 121, 160
Sainath, T.N., Prabhavalkar, R., Kumar, S., Lee, S., Kannan, A., Rybach, D., Schogol, V.,
Nguyen, P., Li, B., Wu, Y., Chen, Z., and Chiu, C.C. (2018). No need for a lexicon?
Evaluating the value of the pronunciation lexica in end-to-end models. Proc. ICASSP,
Calgary, AB, Canada. 66, 95
Sainath, T.N., Weiss, R.J., Senior, A.W., Wilson, K.W., and Vinyals, O. (2015). Learning the
speech front-end with raw waveform CLDNNs. Proc. Interspeech, Dresden, Germany. 39
Saito, T. and Rehmsmeier, M. (2015). The precision-recall plot is more informative than the
ROC plot when evaluating binary classifiers on imbalanced datasets. PLOS ONE, 10. 111
Sak, H., Senior, A., and Beaufays, F. (2014). Long short-term memory recurrent neural network
architectures for large scale acoustic modeling. Proc. Interspeech, Singapore. 15
Sak, H., Senior, A., Rao, K., and Beaufays, F. (2015a). Fast and accurate recurrent neural
network acoustic models for speech recognition. Proc. Interspeech, Dresden, Germany. 61
Sak, H., Senior, A.W., Rao, K., Irsoy, O., Graves, A., Beaufays, F., and Schalkwyk, J. (2015b).
Learning acoustic frame labeling for speech recognition with recurrent neural networks.
Proc. ICASSP, South Brisbane, QLD, Australia. 61
Sanchís, A., Juan-Císcar, A., and Vidal, E. (2012). A word-based naïve Bayes classifier for
confidence estimation in speech recognition. IEEE Trans. on Audio, Speech, & Language
Processing, 20:565–574. 109
Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. (2018). How does batch normalization help
optimization? Proc. NIPS, Montreal, QC, Canada. 30
Saon, G., Dharanipragada, S., and Povey, D. (2004). Feature space Gaussianization. Proc.
ICASSP, Montreal, QC, Canada. 39
Saon, G., Tüske, Z., Bolaños, D., and Kingsbury, B. (2021). Advancing RNN transducer
technology for speech recognition. Proc. ICASSP, Toronto, ON, Canada. 64, 105
Savoji, M.H. (1989). A robust algorithm for accurate endpointing of speech signals. Speech
Communication, 8:45–60. 141
186 References
Shen, J., Nguyen, P., Wu, Y., Chen, Z., Chen, M.X., Jia, Y., Kannan, A., Sainath, T.N., Cao, Y.,
Chiu, C.C., He, Y., Chorowski, J., Hinsu, S., Laurenzo, S., Qin, J., Firat, O., Macherey, W.,
Gupta, S., Bapna, A., Zhang, S., Pang, R., Weiss, R.J., Prabhavalkar, R., Liang, Q., Jacob,
B., Liang, B., Lee, H., Chelba, C., Jean, S., Li, B., Johnson, M., Anil, R., Tibrewal, R., Liu,
X., Eriguchi, A., Jaitly, N., Ari, N., Cherry, C., Haghani, P., Good, O., Cheng, Y., Álvarez,
R., Caswell, I., Hsu, W.N., Yang, Z., Wang, K., Gonina, E., Tomanek, K., Vanik, B., Wu,
Z., Jones, L., Schuster, M., Huang, Y., Chen, D., Irie, K., Foster, G.F., Richardson, J., Alon,
U., and et al. (2019). Lingvo: A modular and scalable framework for sequence-to-sequence
modeling. arXiv.org:1902.08295. 114
Shi, Y., Huang, Q., and Hain, T. (2020). H-vectors: Utterance-level speaker embedding using a
hierarchical attention model. Proc. ICASSP, Barcelona, Spain. 141
Shum, S., Dehak, N., Chuangsuwanich, E., Reynolds, D.A., and Glass, J.R. (2011). Exploiting
intra-conversation variability for speaker diarization. Proc. Interspeech, Florence, Italy. 141
Shum, S.H., Dehak, N., Dehak, R., and Glass, J.R. (2013). Unsupervised methods for speaker
diarization: An integrated and iterative approach. IEEE Trans. on Audio, Speech, & Language
Processing, 21:2015–2028. 142
Sietsma, J. and Dow, R.J.F. (1991). Creating artificial neural networks that generalize. Neural
Networks, 4:67–79. 33
Sinha, R., Tranter, S., Gales, M.J.F., and Woodland, P.C. (2005). The Cambridge University
March 2005 speaker diarisation system. Proc. Interspeech, Lisbon, Portugal. 141
Siniscalchi, S.M., Svendsen, T., Sorbello, F., and Lee, C.H. (2010). Experimental studies on
continuous speech recognition using neural architectures with “adaptive” hidden activation
functions. Proc. ICASSP, Dallas, TX, USA. 52
Siohan, O., Ramabhadran, B., and Kingsbury, B. (2005). Constructing ensembles of ASR
systems using randomized decision trees. Proc. ICASSP, Philadelphia, PA, USA. 78
Siu, M., Gish, H., and Richardson, F. (1997). Improved estimation, evaluation and applications
of confidence measures for speech recognition. Proc. Eurospeech, Rhodes, Greece. 109, 136
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., and Khudanpur, S. (2018). X-vectors:
Robust DNN embeddings for speaker recognition. Proc. ICASSP, Calgary, AB, Canada. 150
Socher, R., Manning, C.D., and Ng, A.Y. (2010). Learning continuous phrase representations
and syntactic parsing with recursive neural networks. Proc. NIPS Deep Learning and
Unsupervised Feature Learning Workshop, Vancouver, BC, Canada. 15
Soltau, H., Liao, H., and Sak, H. (2017). Neural speech recognizer: Acoustic-to-word LSTM
model for large vocabulary speech recognition. Proc. Interspeech, Stockholm, Sweden. 39
Sriram, A., Jun, H., Satheesh, S., and Coates, A. (2018). Cold fusion: Training seq2seq models
together with language models. Proc. Interspeech, Hyderabad, India. 72
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout:
A simple way to prevent neural networks from overfitting. Journal of Machine Learning
Research, 15:1929–1958. 32, 115
188 References
Stüker, S., Fügen, C., Burger, S., and Wölfel, M. (2006). Cross-system adaptation and
combination for continuous speech recognition: The influence of phoneme set and acoustic
front-end. Proc. ICSLP, Pittsburgh, PA, USA. 80
Su, H., Li, G., Yu, D., and Seide, F. (2013). Error back propagation for sequence training of
context-dependent deep networks for conversational speech transcription. Proc. ICASSP,
Vancouver, BC, Canada. 70, 86
Sun, G., Zhang, C., and Woodland, P.C. (2019). Speaker diarisation using 2D self-attentive
combination of embeddings. Proc. ICASSP, Brighton, UK. 141, 142, 144, 150
Sun, G., Zhang, C., and Woodland, P.C. (2021a). Combination of deep speaker embeddings for
diarisation. Neural Networks, 141:372–384. 141, 160
Sun, G., Zhang, C., and Woodland, P.C. (2021b). Transformer language models with LSTM-
based cross-utterance information representation. ICASSP, pages 7363–7367. 55, 104,
105
Sung, Y.H., Hughes, T., Beaufays, F., and Strope, B. (2009). Revisiting graphemes with
increasing amounts of data. Proc. ICASSP, Taipei, Taiwan. 66
Sutskever, I., Vinyals, O., and Le, Q.V. (2014). Sequence to sequence learning with neural
networks. Proc. NIPS, Montreal, QC, Canada. 9, 19
Swietojanski, P. and Renals, S. (2014). Learning hidden unit contributions for unsupervised
speaker adaptation of neural network acoustic models. Proc. SLT, South Lake Tahoe, NV,
USA. 52
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2016). Rethinking the inception
architecture for computer vision. Proc. CVPR, Las Vegas, NV, USA. 34, 92, 115
Tang, D., Qin, B., and Liu, T. (2015). Document modeling with gated recurrent neural network
for sentiment classification. Proc. EMNLP, Lisbon, Portugal. 15
Tarvainen, A. and Valpola, H. (2017). Mean teachers are better role models: Weight-averaged
consistency targets improve semi-supervised deep learning results. Proc. NIPS, Long Beach,
CA, USA. 32
Tjandra, A., Sakti, S., and Nakamura, S. (2017a). Listening while speaking: Speech chain by
deep learning. Proc. ASRU, Okinawa, Japan. 65
Tjandra, A., Sakti, S., and Nakamura, S. (2017b). Local monotonic attention mechanism for
end-to-end speech and language processing. Proc. IJCNLP, Taipei, Taiwan. 21
Tjandra, A., Sakti, S., and Nakamura, S. (2018a). Machine speech chain with one-shot speaker
adaptation. Proc. Interspeech, Hyderabad, India. 65
Tjandra, A., Sakti, S., and Nakamura, S. (2018b). Sequence-to-sequence ASR optimization via
reinforcement learning. Proc. ICASSP, Calgary, AB, Canada. 70
Toshniwal, S., Kannan, A., Chiu, C.C., Wu, Y., Sainath, T.N., and Livescu, K. (2018). A com-
parison of techniques for language model integration in encoder-decoder speech recognition.
Proc. SLT, Athens, Greece. 65, 72
References 189
Tranter, S. and Reynolds, D.A. (2006). An overview of automatic speaker diarization systems.
IEEE Trans. on Audio, Speech, & Language Processing, 14:1557–1565. 139
Tranter, S., Yu, K., Evermann, G., and Woodland, P.C. (2004). Generating and evaluating
segmentations for automatic speech recognition of conversational telephone speech. Proc.
ICASSP, Montreal, QC, Canada. 141
Tür, G., Hakkani-Tür, D.Z., and Schapire, R. (2005). Combining active and semi-supervised
learning for spoken language understanding. Speech Communication, 45:171–186. 108
Tüske, Z., Golik, P., Schlüter, R., and Ney, H. (2014). Acoustic modeling with deep neural
networks using raw time signal for LVCSR. Proc. Interspeech, Singapore. 39
Tüske, Z., Saon, G., Audhkhasi, K., and Kingsbury, B. (2020). Single headed attention
based sequence-to-sequence model for state-of-the-art results on Switchboard-300. Proc.
Interspeech, Shanghai, China. 104, 105
Tüske, Z., Saon, G., and Kingsbury, B. (2021). On the limit of English conversational speech
recognition. Proc. Interspeech, Brno, Czech Republic. 81, 104
Uebel, L.F. and Woodland, P.C. (2001). Speaker adaptation using lattice-based MLLR. Proc.
ITRW on Adaptation Methods for Speech Recognition, Sophia Antipolis, France. 108, 137
Valtchev, V., Odell, J.J., Woodland, P.C., and Young, S.J. (1997). MMIE training of large
vocabulary recognition systems. Speech Communication, 22:303–314. 36, 49
van den Oord, A., Li, Y., and Vinyals, O. (2018). Representation learning with contrastive
predictive coding. arXiv.org:1807.03748. 36, 73, 74
Variani, E., Chen, T., Apfel, J.A., Ramabhadran, B., Lee, S., and Moreno, P.J. (2020a). Neural
oracle search on N-best hypotheses. Proc. ICASSP, Barcelona, Spain. 121
Variani, E., Lei, X., McDermott, E., Moreno, I.L., and Gonzalez-Dominguez, J. (2014). Deep
neural networks for small footprint text-dependent speaker verification. Proc. ICASSP,
Florence, Italy. 141
Variani, E., Rybach, D., Allauzen, C., and Riley, M. (2020b). Hybrid autoregressive transducer
(HAT). Proc. ICASSP, Barcelona, Spain. 72
Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and
Polosukhin, I. (2017). Attention is all you need. Proc. NIPS, Long Beach, CA, USA. 17, 21,
23, 24, 30, 146, 150
Viikki, O. and Laurila, K. (1998). Cepstral domain segmental feature vector normalization for
noise robust speech recognition. Speech Communication, 25:133–147. 39
Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically optimum
decoding algorithm. IEEE Trans. on Information Theory, 13:260–269. 43
von Platen, P., Zhang, C., and Woodland, P.C. (2019). Multi-span acoustic modelling using raw
waveform signals. Proc. Interspeech, Graz, Austria. 39
190 References
Waibel, A.H., Hanazawa, T., Hinton, G.E., Shikano, K., and Lang, K.J. (1989). Phoneme
recognition using time-delay neural networks. IEEE Trans. Acoustics, Speech, & Signal
Processing, 37:328–339. 12
Wan, L., Wang, Q., Papir, A., and Moreno, I.L. (2018). Generalized end-to-end loss for speaker
verification. Proc. ICASSP, Calgary, AB, Canada. 141, 144
Wang, H., Ragni, A., Gales, M.J.F., Knill, K., Woodland, P.C., and Zhang, C. (2015). Joint
decoding of tandem and hybrid systems for improved keyword spotting on low resource
languages. Proc. Interspeech, Dresden, Germany. 79
Wang, H., Wang, Y., Zhou, Z., Ji, X., Li, Z., Gong, D., Zhou, J., and Liu, W. (2018). CosFace:
Large margin cosine loss for deep face recognition. Proc. CVPR, Salt Lake City, UT, USA.
142
Wang, J., Xiao, X., Wu, J., Ramamurthy, R., Rudzicz, F., and Brudno, M. (2020a). Speaker
diarization with session-level speaker embedding refinement using graph neural networks.
Proc. ICASSP, Barcelona, Spain. 142
Wang, L., Zhang, C., Woodland, P.C., Gales, M.J.F., Karanasou, P., Lanchantin, P., Liu, X., and
Qian, Y. (2016). Improved DNN-based segmentation for multi-genre broadcast audio. Proc.
ICASSP, Shanghai, China. 141
Wang, M., Soltau, H., Shafey, L.E., and Shafran, I. (2021). Word-level confidence estimation
for RNN transducers. Proc. ASRU, Cartagena, Colombia. 160
Wang, Q., Downey, C., Wan, L., Mansfield, P.A., and Moreno, I.L. (2018). Speaker diarization
with LSTM. Proc. ICASSP, Calgary, AB, Canada. 141, 142, 150
Wang, Q., Okabe, K., Lee, K.A., Yamamoto, H., and Koshinaka, T. (2018). Attention mech-
anism in speaker recognition: What does it learn in deep speaker embedding? Proc. SLT,
Athens, Greece. 141
Wang, S., Li, B.Z., Khabsa, M., Fang, H., and Ma, H. (2020b). Linformer: Self-attention with
linear complexity. arXiv.org:2006.04768. 89
Wang, W., Zhou, Y., Xiong, C., and Socher, R. (2020c). An investigation of phone-based
subword units for end-to-end speech recognition. Proc. Interspeech, Shanghai, China. 105
Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., Yalta, N., Heymann,
J., Wiesner, M., Chen, N., Renduchintala, A., and Ochiai, T. (2018). ESPnet: End-to-end
speech processing toolkit. Proc. Interspeech, Hyderabad, India. 91, 98, 99, 150
Watanabe, S., Hori, T., Kim, S., Hershey, J.R., and Hayashi, T. (2017). Hybrid CTC/attention
architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal
Processing, 11:1240–1253. 81, 86, 91
Weintraub, M., Beaufays, F., Rivlin, Z., Konig, Y., and Stolcke, A. (1997). Neural-network
based measures of confidence for word recognition. Proc. ICASSP, Munich, Germany. 109
Weiss, R.J., Chorowski, J., Jaitly, N., Wu, Y., and Chen, Z. (2017). Sequence-to-sequence
models can directly translate foreign speech. Proc. Interspeech, Stockholm, Sweden. 65
References 191
Wessel, F., Schlüter, R., Macherey, K., and Ney, H. (2001). Confidence measures for large
vocabulary continuous speech recognition. IEEE Trans. on Speech & Audio Processing,
9:288–298. 108
Whittaker, E. and Woodland, P.C. (2000). Particle-based language modelling. Proc. Interspeech,
Beijing, China. 67
Williams, R.J. (1992). Simple statistical gradient-following algorithms for connectionist
reinforcement learning. Machine Learning, 8:229–256. 155
Williams, R.J. and Zipser, D. (1989). A learning algorithm for continually running fully
recurrent neural networks. Neural Computation, 1:270–280. 68
Wong, J.H.M., Gaur, Y., Zhao, R., Lu, L., Sun, E., Li, J., and Gong, Y. (2020). Combination of
end-to-end and hybrid models for speech recognition. Proc. Interspeech, Shanghai, China.
81
Woodland, P.C. (1989). Weight limiting, weight quantisation and generalisation in multi-layer
perceptrons. Proc. ICANN, London, UK. 34
Woodland, P.C. (2001). Speaker adaptation for continuous density HMMs: A review. Proc.
ISCA ITR-Workshop on Adaptation Methods for Speech Recognition, Salt Lake City, UT,
USA. 36
Woodland, P.C., Gales, M.J.F., Pye, D., and Young, S.J. (1997). Broadcast news transcription
using HTK. Proc. ICASSP, Munich, Germany. 38
Woodland, P.C., Leggetter, C.J., Odell, J.J., Valtchev, V., and Young, S.J. (1995). The 1994
HTK large vocabulary speech recognition system. Proc. ICASSP, Detroit, MI, USA. 57, 58,
80, 87
Woodland, P.C. and Povey, D. (2002). Large scale discriminative training of hidden Markov
models for speech recognition. Computer Speech & Language, 16:25–47. 49, 57
Woodward, A., Bonnín, C., Masuda, I., Varas, D., Bou, E., and Riveiro, J.C. (2020). Confidence
measures in encoder-decoder models for speech recognition. Proc. Interspeech, Shanghai,
China. 109, 121
Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y.,
Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, L., Gouws, S.,
Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W., Young, C.,
Smith, J.R., Riesa, J., Rudnick, A., Vinyals, O., Corrado, G.S., Hughes, M., and Dean, J.
(2016). Google’s neural machine translation system: Bridging the gap between human and
machine translation. arXiv.org:1609.08144. 17
Xu, H., Ding, S., and Watanabe, S. (2019). Improving end-to-end speech recognition with
pronunciation-assisted sub-word modeling. Proc. ICASSP, Brighton, UK. 64
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A.C., Salakhutdinov, R., Zemel, R.S., and Bengio,
Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. Proc.
ICML, Lille, France. 15
192 References
Yella, S.H. and Stolcke, A. (2015). A comparison of neural network feature transforms for
speaker diarization. Proc. Interspeech, Dresden, Germany. 141
Young, S.J. (1996). Large vocabulary continuous speech recognition: A review. IEEE Signal
Processing Magazine, 13:45–57. 57
Young, S.J., Evermann, G., Gales, M.J.F., Hain, T., Kershaw, D.J., Liu, X., Moore, G.L., Odell,
J.J., Ollason, D., Povey, D., Ragni, A., Valtchev, V., Woodland, P.C., and Zhang, C. (2015).
The HTK Book (for HTK version 3.5). Cambridge University Engineering Department. 92,
150
Young, S.J., Odell, J.J., and Woodland, P.C. (1994). Tree-based state tying for high accuracy
acoustic modelling. Proc. HLT, Plainsboro, NJ, USA. 40
Young, S.J., Russell, N.H., and Thornton, J.H.S. (1989). Token passing: A simple conceptual
model for connected speech recognition systems. Technical report, Cambridge University
Engineering Department. 53
Yu, D., Kolbæk, M., Tan, Z.H., and Jensen, J. (2017). Permutation invariant training of deep
models for speaker-independent multi-talker speech separation. Proc. ICASSP, New Orleans,
LA, USA. 143
Yu, D., Li, J., and Deng, L. (2011). Calibration of confidence measures in speech recognition.
IEEE Trans. on Audio, Speech, & Language Processing, 19:2461–2473. 108, 131
Yu, D., Yao, K., Su, H., Li, G., and Seide, F. (2013). KL-divergence regularized deep neural
network adaptation for improved large vocabulary speech recognition. Proc. ICASSP,
Vancouver, BC, Canada. 52
Yu, Y.Q., Fan, L., and Li, W.J. (2019). Ensemble additive margin softmax for speaker verifica-
tion. Proc. ICASSP, Brighton, UK. 142
Zapotoczny, M., Pietrzak, P., Lancucki, A., and Chorowski, J. (2019). Lattice generation in
attention-based speech recognition models. Proc. Interspeech, Graz, Austria. 111
Zeiler, M.D. (2012). ADADELTA: An adaptive learning rate method. arXiv.org:1212.5701.
28, 92
Zeineldeen, M., Glushko, A., Michel, W., Zeyer, A., Schluter, R., and Ney, H. (2021). Investi-
gating methods to improve language model integration for attention-based encoder-decoder
ASR models. Proc. Interspeech, Brno, Czech Republic. 72
Zeppenfeld, T., Finke, M., Ries, K., Westphal, M., and Waibel, A.H. (1997). Recognition of
conversational telephone speech using the JANUS speech engine. Proc. ICASSP, Munich,
Germany. 109
Zeyer, A., Beck, E., Schlüter, R., and Ney, H. (2017). CTC in the context of generalized
full-sum HMM training. Proc. Interspeech, Stockholm, Sweden. 60
Zeyer, A., Irie, K., Schlüter, R., and Ney, H. (2018). Improved training of end-to-end attention
models for speech recognition. Proc. Interspeech, Hyderabad, India. 105
References 193
Zhang, A., Wang, Q., Zhu, Z., Paisley, J., and Wang, C. (2019a). Fully supervised speaker
diarization. Proc. ICASSP, Brighton, UK. 143
Zhang, C., Kreyssig, F.L., Li, Q., and Woodland, P.C. (2019b). PyHTK: Python library and
ASR pipelines for HTK. Proc. ICASSP, Brighton, UK. 92, 150
Zhang, C. and Woodland, P.C. (2016). DNN speaker adaptation using parameterised sigmoid
and ReLU hidden activation functions. Proc. ICASSP, Shanghai, China. 52
Zhang, X.L. and Wu, J. (2013). Deep belief networks based voice activity detection. Proc.
ICASSP, Vancouver, BC, Canada. 141
Zhang, Y., Chan, W., and Jaitly, N. (2017). Very deep convolutional networks for end-to-end
speech recognition. Proc. ICASSP, New Orleans, LA, USA. 64, 66
Zhang, Y., Pezeshki, M., Brakel, P., Zhang, S., Laurent, C., Bengio, Y., and Courville, A.C.
(2016). Towards end-to-end speech recognition with deep convolutional neural networks.
Proc. Interspeech, San Francisco, CA, USA. 64, 66
Zhang, Y., Qin, J., Park, D.S., Han, W., Chiu, C.C., Pang, R., Le, Q.V., and Wu, Y. (2020).
Pushing the limits of semi-supervised learning for automatic speech recognition. Proc.
NeurIPS SAS Workshop, Vancouver, BC, Canada. 75, 121, 127, 132
Zhou, S., Dong, L., Xu, S., and Xu, B. (2018a). Syllable-based sequence-to-sequence speech
recognition with the Transformer in mandarin Chinese. Proc. Interspeech, Hyderabad, India.
95
Zhou, Y., Xiong, C., and Socher, R. (2018b). Improving end-to-end speech recognition with
policy learning. Proc. ICASSP, Calgary, AB, Canada. 70
Zhou, Y.T. and Chellappa, R. (1988). Computation of optical flow using a neural network.
Proc. ICNN, San Diego, CA, USA. 13
Zhu, Y., Ko, T., Snyder, D., Mak, B.K.W., and Povey, D. (2018). Self-attentive speaker
embeddings for text-independent speaker verification. Proc. Interspeech, Hyderabad, India.
141
Zweig, G., Yu, C., Droppo, J., and Stolcke, A. (2017). Advances in all-neural speech recognition.
Proc. ICASSP, New Orleans, LA, USA. 64
Zwicker, E. (1961). Subdivision of the audible frequency range into critical bands. The Journal
of the Acoustical Society of America, 33(2):248–248. 37