Speaker Independent Speech Recognition For Afan Oromo Language Using Hybrid Approach
Speaker Independent Speech Recognition For Afan Oromo Language Using Hybrid Approach
net/publication/365365838
CITATIONS READS
0 316
1 author:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Speaker Independent Spontaneous Speech Recognition for Afan Oromo using Hybrid Approach View project
All content following this page was uploaded by Tariku Endale Terefe on 14 November 2022.
BY TARIKU ENDALE
May, 2020
WOLISO, ETHIOPIA
I
APPROVAL SHEET
Submitted By:
ii
AMBO UNIVERSITY WOLISO CAMPUS
SCHOOL OF GRADUATE STUDIES
CERTIFICATION SHEET
A thesis research advisor, I hereby certify that have read and evaluated this thesis
under my guidance by Tariku Endale Terefe entitled as Speaker Independent
Spontaneous Speech Recognition for Afan Oromo using Hybrid approach, I
recommend that it be submitted as fulfilling the thesis requirement.
iii
Declaration
I the undersigned declare that the thesis comprises my own work in compliance with internationally
accepted practices; I have dually acknowledged and refereed all materials used in this work. I
action by the University and can also evoke penal action from the sources which have not been
cited or acknowledged.
iv
Lists of Table
Table 1: Historical development of Speech Recognition taken from (NEC R & D Meeting 2009) ..............13
Table 2: Afan Oromo Alphabets ...................................................................................................................33
Table 3: Afan Oromo International Phonetics Alphabets ..............................................................................34
Table 4: Afan Oromo Numbering System Taken from (Yadeta, 2016) ........................................................35
Table 5: Articulation of Afan Oromo vowels compiled from (Yadeta, 2016) .............................................38
Table 6: Articulation of Afan Oromo consonant Alphabets (Compiled from Yadeta, 2016) ........................39
Table 7: Afan Oromo phones used during Transcription taken from (Yadeta, 2016) ..................................42
Table 8: Result of experiments conducted by increasing Gaussian Mixtures ...............................................59
Table 9: Results obtained from Experiments of Language Model built by HTK ..........................................60
v
Table of Contents
Declaration .................................................................................................................................... iv
Lists of Table.................................................................................................................................. v
Acknowledgements ........................................................................................................................ix
Lists of Figures ............................................................................................................................... x
Lists of Acronyms ..........................................................................................................................xi
Abstract .........................................................................................................................................xii
CHAPTER ONE ............................................................................................................................ 1
1.1. INTRODUCTION ................................................................................................................... 1
1.2. Statement of the Problem ........................................................................................................ 3
1.3. Research Questions ................................................................................................................. 4
1.4. Objectives of the Study ........................................................................................................... 5
1.4.1. General Objectives ......................................................................................................... 5
1.4.2. Specific Objectives ........................................................................................................ 5
1.5. Scope and Limitations of the Study .................................................................................. 5
1.6. Research Methodology ............................................................................................................ 6
1.6.1. Review of Related Literatures........................................................................................ 6
1.6.2. Data Selection and Preparation ...................................................................................... 6
1.6.3. Modeling Techniques..................................................................................................... 7
1.6.4. Testing Techniques ........................................................................................................ 7
1.6.5. Tools used for the Study ................................................................................................ 8
1.6.6. Significance of the Study ............................................................................................... 9
1.6.7. Organization of the Study .............................................................................................. 9
CHAPTER TWO.......................................................................................................................... 10
LITERATURE REVIEW ............................................................................................................. 10
2.1. Introduction ........................................................................................................................... 10
2.2. Overview of Automatic Speech Recognition (ASR) ...................................................... 10
2.2.1. Categories of Automatic Speech Recognition (ASR) System ..................................... 10
2.2.2. History of Speech Recognition System ....................................................................... 12
2.2.3. Speech Recognition Approaches ................................................................................. 13
2.2.4. OVERVIEWS OF HMM, GMM AND ANN .................................................................... 15
2.3. Automatic Speech Recognition Process............................................................................... 21
vi
2.4. Toolkits used in Speech Recognition ................................................................................... 23
2.4.1. Data Preparation Tools............................................................................................... 23
2.4.2. Training Tools ............................................................................................................ 24
2.4.3. Recognition Tools ...................................................................................................... 25
2.4.4. Analysis Tools ........................................................................................................... 27
2.5. Related Works ...................................................................................................................... 29
CHAPTER THREE ................................................................................................................... 32
AFAAN OROMO LANGUAGE ............................................................................................... 32
3. INTRODUCTION .......................................................................................................... 32
3.3. AFAN OROMO ALPHABETS (QUBEE AFAAN OROMOO)......................................... 32
3.3.1. Sound Formation in Afan Oromo .............................................................................. 33
3.3.2. IPA Representation of Afan Oromo........................................................................... 34
3.3.3. Morphological Features in Afan Oromo .................................................................... 34
3.3.4. Afan Oromo Phonetics ............................................................................................... 36
3.3.5. Articulation of Afan Oromo Vowels ......................................................................... 37
3.3.6. Articulation of Afan Oromo Consonants ................................................................... 38
3.4. Data Preparation ................................................................................................................... 39
3.5. Audio Pre-processing ........................................................................................................... 40
3.6. Transcribing the Segmented Speech .................................................................................... 41
3.7. Hidden Markov Toolkits (HTK). ......................................................................................... 42
CHAPTER FOUR ...................................................................................................................... 46
EXPERMENTS AND OUTCOMES ........................................................................................ 46
4. Introduction............................................................................................................................. 46
4.3. Data Preparation Phase ........................................................................................................ 46
4.3.1. Pronunciation Dictionary ........................................................................................... 46
4.3.2. Creating Transcription Files....................................................................................... 47
4.3.3. Feature Extraction ...................................................................................................... 48
4.4. Training Phase...................................................................................................................... 49
4.4.1. Creating Mono and Tri-Phone HMMS ...................................................................... 49
4.4.2. Re-estimating Mono-phones ...................................................................................... 50
4.4.3. Refinement and Optimization .................................................................................... 52
4.5. Recognition Phase ................................................................................................................ 57
vii
4.6. Analysis Phase ..................................................................................................................... 58
4.7. Challenges ............................................................................................................................ 60
CHAPTER FIVE .......................................................................................................................... 62
CONCLUSION AND RECOMMENDATIONS ......................................................................... 62
5.1. Conclusion............................................................................................................................. 62
6.2. Recommendations ................................................................................................................. 63
References .................................................................................................................................... 65
Appendices ................................................................................................................................... 68
Appendix A: Summary of Tools used in the study ...................................................................... 68
Appendix B: Samples of prompts afanoromotrainprompt.txt and afanoromotestprompt.txt ....... 69
Appendix C: Samples of Pronounciation Dictionary ................................................................... 71
Appendix D: Number of alternative pronounciation dictionary................................................... 73
Appendix E: Sample of coding for afanoromotraincode.scp and afanoromotest.scp code. ......... 75
Appendix F: Configuration Tools ................................................................................................ 76
Appendix G: Prototype of Proto files ............................................................................................. 2
Appendix H: Editing script of tree.hed .......................................................................................... 2
viii
Acknowledgements
It is my great pleasure to thank my God heart fully for the work he is doing in my daily
life. Passed life styles and situations were unable to be resisted if God was not helping me
in all, at all, and for all in my life. God Thank you Very Much!
My heartfelt thank should go to my Advisor Dr. Getachew Mamo for his constructive
comments and guidance. I am thankful to him because his guidance and open comments
supported me for the completion of this research. I extol Dr. Getachew Mamo for his
I would like also to thank my wife Demitu Kebede and My son Goftanbon Tariku for the
time they have given me and support me in any conditions while I was on learning.
Prayer.
My Heartfelt thank should go to Mr. Daba Ararsa who is the General Manager of Ambo
Urban Water Supply and Sewerage Service Enterprise for his support in all and with all
during my study. I also thank all the staffs of Ambo Urban Water Supply and Sewerage
Service Enterprise especially Mr. Teshome Kenea, for his support with encouraging and
ix
Lists of Figures
Figure 1: Overview of Automatic Speech Recognition (ASR) ........................ Error! Bookmark not defined.
Figure 2: Automatic Speech Recognition Components in Process taken from (Yadeta, 2016) ...................22
x
Lists of Acronyms
EM: Expectation-Maximization
xi
Abstract
technologies that enables the recognition and translation of spoken language into text by computer.
The goal of the speech recognition is to develop a model that automatically converts speech
utterances into a sequence of words. With similar objective of transforming Afan Oromo Spoken
words into its equivalent sequence of words, this study explored the possibility of developing
Speaker Independent Spontaneous Speech Recognition for Afan Oromo Language using Hybrid
Models (Hidden Markov Model, Artificial Neural Network, and Gaussian Mixture Model).
A Speaker Independent Spontaneous Speech Recognition for Afan Oromo Language has been done
using Conversational Speech between two or more speakers. Amount of Training Data planned to
be comprised of were 15000 utterances out of which 14500 utterances were used for Training Data
and the other 500 were used for Testing. Automatic speech recognition (ASR) on some
controlled speech has achieved almost human performance. However, the performance of
spontaneous speech is drastically decreased due to the diversity of speaking styles, speak
rate, presence of additive and non-linear distortion, accents and weakened articulation.
Spontaneous speech recognizer for Afan Oromo developed with the speech database
contains 14500 utterances for training the acoustic model. Language model developed
using 120MB text data and 30MB audio dataset and the percentage of result obtained was
Keywords
Speech recognition, acoustic modeling, language modeling, finite state network, dialog systems
xii
CHAPTER ONE
1.1. INTRODUCTION
and technologies that enables the recognition and translation of spoken language into text
2013).
It is a Computer Technology that enables the device to recognize and understand spoken
words by digitizing the sound and matching its pattern against the stored patterns (Adugna,
2015). It is used to enable the computer to understand the real spoken word and executes
the word to text or actions. Speech Recognition (is also known as Automatic Speech
Speech Recognition is a technology that allows spoken words input into the systems.
Speech Recognition enables the user to talk to the mobile phone, computers and it uses the
spoken word as an input to trigger some action. It is the process of finding a linguistic
characters, which forms the word. Speech recognition is a friendly human interface for
human communication, but differs in some important ways from the types of speech for
1
which human language technology is often developed (Elizabeth, 2012).
The human ability to communicate with together inspired the researcher to develop the
system that can imitate the human being. Different researchers have been working on
several fronts to decode most of the information from the speech signal. Some of these
fronts include tasks like identifying speakers by the voice, detecting the language being
spoken, transcribing speech, translating speech, and understanding speech. Among all
speech tasks, automatic speech recognition (ASR) has been the focus of many researchers
for several decades. In this task, the linguistic message is one of the areas of interest
An ASR system with spontaneous speech ability should be able to handle a variety of
natural speech features such as words being run together, "ums" and "ahs", and even slight
stutters (IJCSIS, 2009). Continuous speech recognizers allow users to speak almost
naturally, while the computer determines the content. (Basically, it's computer dictation).
Recognizers with continuous speech capabilities are some of the most difficult to create
because they utilize special methods to determine utterance boundaries (IJCSIS, 2009).
Afan Oromo is one of the major languages that is widely spoken and used in Ethiopia.
Currently it is an official language of Oromia state (which is the largest region in Ethiopia)
feature of database-query spoken language systems (Ward, 1987). It has a very consistent
rate and articulation across the sentence and across the speakers (Ward, 1987).
2
1.2. Statement of the Problem
The objective of speech recognition is to trap human voice in a digital Computer and
decode it into corresponding text. The ultimate goal of any automatic speech recognition
is to develop a model that converts speech utterance to text or words (Getachew, 2009).
Human beings are in need and trying to communicate with the machine such as computer
with their voices using Natural Language Processing. That is why researchers are trying to
investigate the speech recognition technology for different languages as per the rule and
regulation of writing, reading and speaking of that language. Researchers showed the
Apart from foreign languages, Some Researchers have conducted a research on some of
Ethiopian Languages like: (Zegaye, 2003) developed ASR system for large vocabulary,
speaker independent, continuous Amharic speech recognizer using HMM based approach;
(Solomon, 2005) explored various possibilities for developing a Large Vocabulary Speaker
Independent Continuous Speech Recognition System for Amharic, and (Adugna, 2015)
spontaneous speech such as conversation between two or more speakers; (Hafte, 2009)
tried to design speaker independent continuous Tigrigna speech recognition system and
(Abdella, 2010) tried to explore the possibility of developing prototype speaker dependent
Beside these local researchers there are researches done on Afan Oromo as follows.
Speech Recognition System using Hybrid Hidden Markov Model and Artificial Neural
3
Network. Large Vocabulary, Speaker Independent Continuous Speech Recognition for
Afan Oromo using Broadcast News Speech Corpus (Yadeta, 2016). The outcome of the
Researcher has been indicated to be with a low performance which means with a large
Word Error Rate. The reason why this large word error rate was registered was set by the
researcher as they used a bigram and recommended that using a trigram will improve the
performance. See details on Chapter 2 for the detailed review of these works.
Most of the aforementioned researchers used Hidden Markov Model for the Speech
Recognition development. Any work with Hidden Markov models requires three things -
estimating the A matrix or the transition matrix, Estimating the B matrix or the observation
matrix and the initial vector or pi. All of these impose limitations. HMMs do not encode
the physics of the vocal tract. Artificial Neural Network is the most popular in Condition
Monitoring because it is the most popular condition monitoring tool (South Africa UWJ,
2006). Gaussian Mixture Model is a classification tool in pattern recognition like speech
and face recognition (South Africa UWJ, 2006). Hidden Markov Model is used for the
success of GMM in classification of dynamic signals has also been demonstrated by many
researchers such as Cardinaux and Marcel (Reynolds, 2000) who compared GMM and
MLP in face recognition and Speech Recognition. A Markov chain is a random process of
discrete-valued variables that involves a number of states (South Africa UWJ, 2006). So,
this thesis has been conducted with the hybrid model to have a successful speech
development for Afan Oromo in which the machine takes training data once and not
rehearsed.
4
What are the challenges of Developing Spontaneous Speech Recognition for Afan
Oromo using a corpus collected from social media like Oromia Broadcasting
What are the techniques and methodologies that can be used to develop
The general objective of the study is to improve the development of Spontaneous Speech
Exploring the methodologies, tools and techniques that were used in the researches
To build acoustic and language model for collected Audio and Text Corpus.
Build a prototype for real person spoken words to the machine using Hybrid Model.
The main aim of this study is to scrutinize the possibility of developing ASR for Afan
Oromo using corpus collected from Afan Oromo Social Media like Oromia
5
Broadcasting Network and Oromia Media Network and self-recorded dataset. That is
due to time and financial constraints only about 17 hours of speech corpus has been
collected and text corpus is being collected from newspaper of Kallacha Oromiyaa.
The limitation of the research is, the research has no component that identifies the age,
dialects and sex of the speaker. In this research bigram is not used because we have
used the trigram to develop a large dataset for Afan Oromo language. We have not
used bigram in this thesis because of bigram approximation consider only one word of
context.
The Research methodologies that we used in this study are like Reviewing Related
Literature, Data Collection, Data Preparation and Data Modeling and Data Evaluation.
To find the approaches, techniques, methods and tools used for speech recognition thesis
for different languages that has been already developed, A number of researches have been
reviewed. In addition to this, researches, articles and literatures have been reviewed on
The data that has been collected from Oromia Broadcasting Network, Oromia Media
Network, and Finfine Integrated Broadcast and self-recorded in both Audio and text format
respectively. Again in addition to those media the researcher has tried to recorded manually
from 10 male and 5 female at sol studio found in Ambo town. Out of the Collected audio
6
speech from Broadcasting Networks the Researcher has constructed around 4000 sentences
after the audio pre-processing and 4000 utterances for the experiment of the study. Also,
after making text normalization we have prepared text corpus of 3,000 sentences for
language modeling purpose. For training of the built system, the researcher used 3500
utterances and 500 utterances used for the testing purpose. The detail explanation and
As Fulufhelo V. Nelwamondo December 2005, but Revised 2006, they had clearly stated
the definition and appropriate function of the three models: Artificial Neural Network,
Gaussian Mixture Model and Hidden Markov Models. According to their definition the
speech and face recognition. Hidden Markov Model is used in Machine Controlling
(Bernard, 1997), Speech Recognition (Zhang, 1998) and Fault Detection (Zhang, 1998).
As of Juraf sky and Martin 2007, Hidden Markov Model (HMM) is more appropriate
model for speech recognition because it is very rich in mathematical structure (Gonfa,
2016). Gaussian Mixture Model has been a reliable classification tool in many applications
of pattern recognition, particularly in speech and face recognition. GMM have proved to
perform better than Hidden Markov Models in text independent speaker recognition (South
Africa UWJ, 2006). So, I need to use the Hybrid Model of Artificial Neural Network
(ANN), Hidden Markov Model (HMM) and Gaussian Mixture Models (GMM).
7
To test the trained system to check its accuracy, the Word Error Rate (WER) is used
because it is a standard evaluation metric for speech recognition system from the HTK
toolkit in order to test the accuracy of our developed system (CUED, 2009). WER depends
on how much the speech recognizer returns the word string while in operation or running.
and substitution of error words. High Word Error Rate shows low accuracy and Low Word
of Insertions, H=the number of the correct words, N=the number of words in the references.
To evaluate the performance of our speech recognizer we take the word accuracy (Wacc)
which can be obtained from the Word Error Rate. Word Accuracy can be computed and
obtained as
Wacc=100-WER.
The collected audio speech has been saved in wave audio file extension during pre-
processing. The audio processing tool has been audacity and Hidden Markov Model
Toolkit (HTK) is a toolkit used for building Hidden Markov Models which was used to
model any time series. Microsoft Visual Studio 2017 with full Python libraries will be used
for Prototyping. HTK was preferred instead of other toolkits due to the familiarity of the
8
researcher and it is freely available for academic and research use, also it is state of the art
Human beings are in need to communicate with machines like computer through speech
technology through their vocal speech. The speech recognition is essential in the upcoming
Speaker Independent Spontaneous Speech Recognition for Afan Oromo. There were so
many researchers that attempted to develop the speech recognition using different types of
speech recognition like isolated speech recognition, continuous speech recognition and this
thesis focuses on developing the Spontaneous speech recognition for Afan Oromo using
This paper falls under six chapters. In the second chapter, we have tried to review the
related literatures and it consists of the detail related literatures on the title. Chapter three
consists of the detail information about Afan Oromo Language. Chapter four Data
Preparation and the experiments and Outcomes of the work and the end chapter which is
9
CHAPTER TWO
LITERATURE REVIEW
2.1. Introduction
In this Chapter, the Overview of ASR Systems, Types of ASR, Applications of ASR, and
other Speech Recognitions are going to be described, the approaches and tools used in the
area of speech recognition were also discussed in this chapter. At the end of this chapter
we tried to put the survey of ASR under the related work section.
Speech Recognition which is also called Automatic Speech Recognition is the process in
(ASR) works by taking an audio speech as input and then giving the string of words as an
output (Jurafsky, 2014). To develop an ASR system, the audio format is needed that is used
as an input parameter, and a large text data will be required for building the language
model. Now the required audio will be real world speech or Spontaneous Speech.
Developing a Spontaneous ASR from the real-world speech or spontaneous is not an easy
task as of using read speech corpus and continues speech recognition. Continues Speech
Recognition not requires the cut and fills whereas spontaneous Speech recognition requires
start and pause while talking because it takes any parameter from the real man.
10
Several literatures reveal that as ASR system can be classified to in different categories
based on the nature of utterance, speaker type, and vocabulary size (Suman et.Al, 2015).
Depending up on the nature of utterances ASR system can be classified as: isolated words
Isolated Words speech recognition system: Isolated word speech recognition system is a
speech recognizer system which recognizes single word. In such systems, the pause or
silence is needed before and after the single word. It is suitable for conditions that user is
required to give only a single word. Also, this type of ASR system is simple and easiest
because of the word boundaries are simply identified (Yadeta, 2016). The beginning and
the end of each word can be detected directly from the energy of the signal (Manan, 2013).
Connected Words speech recognition system: This type of speech recognizer system is
similar with the recognizer system of isolated words. However, the difference is allowing
separate utterances to be run-together by having a minimal pause among words which run
together. But it is recalled as pause is used to show the boundary of words that we have
utterances that can be distinguished from well-planned utterances like radio news and
movie dialogues (Mikio Nakano, 2013). The example of this kind of speech recognition is
11
Types of Speech Recognition Based on Speaker Class
depends on the speaker’s characteristics like accent, dialect and the other (Adugna, 2015).
speech recognition is not easy and cheap like speaker dependent (Adugna, 2015). It is more
difficult, because the internal representation of the speech must somehow be global enough
to cover all types of voices and all possible ways of pronouncing words, and yet specific
Small Vocabulary: is the speech recognition with 1 to 100 words and it is suitable for
Medium Vocabulary: is the speech recognition which contains 101 to 1000 words
(Yadeta, 2016).
Large Vocabulary: is the speech recognition which contains 1001 to 10,000 words.
Very Large Vocabulary: is the speech recognition that contains and developed with more
Automatic Speech Recognition System was started in the 19th century for the first time as
mentioned by (Rabiner and Juang, 2004). Speech Recognition was started by building a system
for isolated word recognition system for a single speaker in Bell Laboratories in the year of 1952.
This was done by designing to develop the ASR system for the ten digits (one to nine and zero).
12
The first developed speech recognition in 1960’s was for small vocabulary which contains only
(10-100) words using acoustic-phonetic approach. The next was developed for medium vocabulary
speech recognition which consists of 101 to 1000 words in 1970 (Yadeta, 2016). This system was
also developed for connected digits and continuous speech in addition to isolated words recognition
system (Rabiner & Juang, 2004). The evolution of ASR system is coming with the idea of large
vocabulary which was started in 1980’s. The system was developed for connected word and
continuous speech by applying statistical approach. Then, the continuity of large vocabulary; also
for very large vocabulary ASR system was investigated in 2000’s for spoken dialogs using
Table 1: Historical development of Speech Recognition taken from (NEC R & D Meeting 2009)
As Rabiner and Juang (1993) there are numerous approaches to carry out automatic
b. Template-Based Approach
13
c. Statistical Based (Stochastic) approach
This approach used is used on the knowledge of phonetics and linguistics to guide search
process (Yadeta, 2016). It depends on finding speech sounds and providing appropriate
labels to sounds. This approach has the following drawbacks as mentioned by (Rabiner
and Juang, 1993). It uses knowledge of phonetics and linguistics to guide search process
(Adugna, 2015).
Even though it has the above mentioned drawbacks it can be used in artificial intelligence
based recognizers.
b. Template-Based Approach
find the example that most thoroughly fits the input. It extracts features from speech
signal, and then it matches these which have similar features (Adugna, 2015).
c. Statistical Approach
14
This approach is an extension of Template-Based approach using more powerful
uses the probabilistic models to deal with uncertain and incomplete information found
in speech recognition the most widely used model is HMM. However, in this thesis we
are going to use the three HMM, GMM and ANN. It uses works by collecting a large
corpus of transcribed speech recordings then Train the computer and then at run time,
apply statistical processes to search through the space of all possible solutions, and pick
The main idea of this approach is collecting and employing the knowledge from different
sources in order to perform recognition process. The knowledge sources contain acoustic,
lexical, syntactic, semantic and pragmatic knowledge which are important for speech
phonetic approach and pattern recognition approach. In this, it exploits the ideas and
approach uses the information regarding linguistic, phonetic and spectrogram (Debela,
2011).
The core for pattern matching speech recognition approach is a set of statistical models
Speech has a sequential structure and can be encoded as a sequence of spectral vectors
15
where hidden Markov model (HMM) provides a neutral framework for constructing
It is a Markov chain plus emission probability function for each state. Each state
represents one observable event. But this model is too restrictive, for a large number of
observations the size of the model explodes, and the case where the range of
observations is continuous is not covered at all (Jurafsky D., 2009). A first order
Hidden Markov Model has a sequence of states which can be represented as:
𝑆 = {𝑆1, 𝑆2, 𝑆3, … 𝑆𝑁} ∶ A set of states (usually indicated by i, j) is the state that
the model is in at a particular point of time t. Thus st = i means that the model is
in state i at time t.
A=a11 a12 …aij: A transition probability A, each aij representing the probability
∀𝑗, 𝑖 𝑎𝑛𝑑 ∑𝑁
𝑗=1 𝑎𝑖𝑗 = 1, ∀𝑖
O=o1 o2…oN: A set of observations, each one drawn from a vocabulary list V=
V1,v2,v3,…,Vn
i.
𝜋 = 𝜋1, 𝜋2, 𝜋3, … , 𝜋𝑁: An initial probability distribution over states: 𝝅𝒊 is the
HMM three basic problems are Evaluation, Decoding and Training [21]. The
next topics will discuss these three problems and their solution.
16
Computing likelihood: Given an HMM λ= (A, B, π) and an observation
discover the best hidden state sequence Q which can be shown as Q=Q1, Q2,
Q3…Qr.
Learning: is the one in which we optimize model parameter so that it can best
used here are called training sequence since it is used for training HMM.
Training is one of the crucial elements of HMM. This allows adjusting a model
The three problems HMM are solved with the following algorithms.
We want to find P (O|λ), given an observation sequence O=O1, O2…Or and a model.
The most straight forward way to find the solution is enumerating every possible state
sequence of length T. Consider one such state sequence Q=Q1,Q2, …Qr such that Q1
produces O1 with some probability, Q2 produces O2 with some probability and the like.
While using chain rule and when the order of chain rule is NT, at every t=1,2,..,T, there
17
Viterbi Algorithm for Decoding Hidden State Sequence [P(Q, O|λ)]
optimum state sequence. One way to find optimum state sequence is to choose the states
at which are individually most likely. But, this way has serious flaw in the sense that if
two states I and j are selected such that aij=0, then despite of the most likely state at time
t and t+1 which is not valid state sequence that can be solved by deciding optimum
criteria. The single best state sequence is using dynamic programming algorithm called
Viterbi algorithm which has a wide spread applications in speech recognition and used
the log likelihood and update the current model to be close to the optimal model.
However, the optimal solution is not guaranteed and adjusting the model parameter is
challenging to maximize the probability of the observation sequence given in the model.
18
In this study, we use the GMM in representing large class of sample distribution from
our training data. The GMM problems and solutions are stated as follows with the help
of Maximum Likelihood (ML) which is used to find the model parameters which
maximize the likelihood of the GMM in the given training data. Given sequence of T
training vectors X={X1, X2, X3….XT} the GMM likelihood, assuming independence
𝑃(𝑋|𝑡) = ∏ 𝑃(𝑋𝑡|𝜆)
𝑡=1
not possible and hence the iterative Expectation-Maximization algorithm (EM) was
used that was then used for training as well as matching purposes. Therefore:
∑𝑁
𝑖=1 𝑝𝑖 𝑔(𝑥𝑖, µ𝑖, ∑𝑖) 𝑤ℎ𝑒𝑟𝑒 𝑖 = 1 𝑡𝑜 𝑁
𝑔(𝑥𝑗,µ𝑖,∑𝑖)
2. The normalized likelihoods 𝑛𝑖𝑗 = 𝑃𝑡 , 𝑖 = 1 𝑡𝑜 𝑗 − 1 and the notions can
𝑡𝑗
Speech is the most efficient mode of communication between peoples. This, being the
machines. Therefore the popularity of automatic speech recognition system has been
greatly increased. There are different approaches to speech recognition like Hidden
Markov Model (HMM), Dynamic Time Warping (DTW), and Vector Quantization
19
emotional state, gender, pitch, speed, volume, background noise and echoes (Mikio
1. Feed forward Artificial Network: Are the first and the simplest form of ANN (Kamble,
2005). In this network the information flows only in one direction. That means it allows
only forward information flow. The network has no loops or cycles. The neuron in layer
‘a’ can send data to only neuron in layer ‘b’ if b is greater than a. Learning is the adaptation
learning without teacher is called as unsupervised training (Kamble, 2005). The back-
propagation algorithm has emerged to design the new class of layered feed forward
network called as Multi-Layer Perceptron (MLP). It generally contains at least two layers
of perceptron. It has one input layer, one or more hidden layers and output layers. The
hidden layer plays very important role and acts as a feature extractor. It uses a nonlinear
minimize classification error, the output layer acts as a logical net which chooses an index
to send to the output on the basis of input it receives from the hidden layer (Kamble, 2005).
2. Recurrent Neural Network (RNN): is a neural network that operates in time. RNN
accepts an input vector, updates its hidden state via non-linear activation function and uses
20
ANN is flexible in changing environments
ANN can build informative model when conventional model fails. They can handle
ANN is a non-linear model which is easy to use and understand than statistical
methods. ANN has the ability to learn how to do task based on the data given for training,
ANN can create their own organization and require no supervision as they can learn
ANN can be used in pattern recognition which is a powerful technique for harnessing
Automatic Speech Recognition (ASR) is the process of deriving the transcription (word
sequence) of an utterance, given the speech waveform. Speech understanding goes one
step further, and gleans the meaning of the utterance in order to carry out the speaker's
command.
21
Figure 1: Automatic Speech Recognition Components in Process taken from (Yadeta, 2016)
i. Acoustic Model: models the relationship between audio signal and the phonetic units
in the language. It is a file that contains a statistical representation of each distinct sound
that makes up a spoken word (Yadeta, 2016). This consists of phonemes those form
ii. Language Model is also responsible for modeling the word sequences in the language.
Statistical language model is the probability distribution over sequences of words. The
probability of the sequences of words with l can be computed as Ws=P (w1, w2, w3…wl)
to the whole sequence whereas Ws stands for word sequence, wl stands for the length of
words in the list. It is also used to improve the performance of the speech recognition
and translation systems (Yadeta, 2016). Language model contains very large lists of
words.
Language Model
Decoder is a software program that takes the spoken sound by the speaker and searches
for its equivalent acoustic model (Yadeta, 2016). When a match is made, the Decoder
determines the phoneme corresponding to the sound. It keeps track of the matching
phonemes until it reaches a pause in the user speech. It then searches the Language Model
or Grammar file for the equivalent series of phonemes. If a match is made it returns the
22
text of the corresponding word or phrase to the calling program (Yadeta, 2016).
deal with a huge search space obtained by combining the acoustic model and language
models. The aim of the decoder is to determine the most likely word sequence W, given
the language model, pronunciation dictionary and acoustic model (Adugna, 2015).
Recognition (Pattern matching) refers to the process of assessing the similarity between
two speech patterns, one of which represents the unknown speech and one of which
represents the reference pattern (derived from the training process) of each element that
There are several toolkits to implement the algorithms needed in speech recognition like
CMU sphinx, Hidden Markov Toolkits, KALDI, JULIUS and ISIP. HTK is a portable
software toolkit for building and manipulating systems that use continuous density Hidden
Markov models. It has been developed by the Speech Group at Cambridge University
Engineering Department. HMMs can be used to model any time series and the core of HTK
is similarly general purpose. However, HTK is primarily designed for building HMM
based speech processing tools, in particular speech recognizers. It can be used to perform
a wide range of tasks in this domain including isolated or connected speech recognition
using models based on whole word or sub- word units, but it is especially suitable for
23
HSLab is an interactive label editor for manipulating speech label files. It can be used both
to record the speech and to manually annotate it with any required transcriptions (Adugna,
2015). An example of using HSLab would be to load a sampled waveform file, determine
the boundaries of the speech units of interest and assign labels to them. Alternatively, an
existing label file can be loaded and edited by changing current label boundaries, deleting
and creating new labels. HSLab is the only tool in the HTK package, which provides a
graphical user interface (Adugna, 2015). HCopy is used to parameterize the data just once.
HList is used to check the contents of any speech file and check the conversions before
processing large quantities of data. HLed is a script-driven label editor which is designed
to make transformations to label files, like translating word level label files to phone level
label files, merging labels or creating tri-phone labels. HLed can also output files to a single
Master Label File MLF, which is usually more convenient for subsequent processing
(Adugna, 2015).
HTK allows HMMs to be built in any desired topology (Adugna, 2015). HMM definitions
can be stored externally as simple text files and hence it is possible to edit them with any
convenient text editor (Adugna, 2015). The lists of words in the datasets are recorded for
20 speakers from different area of different accents so that the speech agent will be added
to the windows operating system and the program will bring the data from it with c-sharp
and python programming language. However, because of the work on spontaneous speech
machine and the native speaker of the language or non-native speaker of the language, we
have prepared a lists of 15000 words for training and of which 500 are used for testing.
24
Sensible values for the transition probabilities must be given but the training process
is very insensitive to these. An acceptable and simple strategy for choosing these
probabilities is to make all of the transitions out of any state equally likely (Adugna, 2015).
For the data to be stored we used command delimited on Microsoft excel 2010 and the
words are formed from the structure and to see the processing, precision and recall, we
The second step of system building is to define the topology required for each HMM by
writing a prototype definition. HTK allows HMMs to be built with any desired topology.
HMM definitions can be stored externally as simple text files and hence it is possible to
edit them with any convenient text editor. Alternatively, the standard HTK distribution
includes a number of example HMM prototypes and a script to generate the most common
topologies automatically. With the exception of the transition probabilities, all of the HMM
parameters given in the prototype definition are ignored. The purpose of the prototype
definition is only to specify the overall characteristics and topology of the HMM. The
actual parameters will be computed later by the training tools. Sensible values for the
transition probabilities must be given but the training process is very insensitive to these.
An acceptable and simple strategy for choosing these probabilities is to make all of the
25
HTK provides a recognition tool called HVite that allows recognition using language
models and lattices. HLRecsore is a tool that allows lattices generated using HVite (or
recognition tool called HVite which uses the token passing algorithm described in the
network describing the allowable word sequences, a dictionary defining how each word is
pronounced and a set of HMMs. It operates by converting the word network to a phone
network and then attaching the appropriate HMM definition to each phone instance.
Recognition can then be performed on either a list of stored speech files or on direct audio
input. As noted at the end of the last chapter, HVite can support cross-word triphones and
it can run with multiple tokens to generate lattices containing multiple hypotheses. It can
also be configured to rescore lattices and perform forced alignments. The word networks
needed to drive HVite are usually either simple word loops in which any word can follow
any other word or they are directed graphs representing a finite-state task grammar. In the
former case, bigram probabilities are normally attached to the word transitions. Word
networks are stored using the HTK standard lattice format. This is a text-based format and
hence word networks can be created directly using a text-editor. However, this is rather
tedious and hence HTK provides two tools to assist in creating word networks. Firstly,
HBuild allows sub-networks to be created and used within higher level networks. Hence,
although the same low level notation is used, much duplication is avoided. Also, HBuild
can be used to generate word loops and it can also read in a backed-off bigram language
model and modify the word loop transitions to incorporate the bigram probabilities. Note
26
that the label statistics tool HLStats mentioned earlier can be used to generate a backed-off
level grammar notation can be used. This notation is based on the Extended Backus Naur
Form (EBNF) used in compiler specification and it is compatible with the grammar
specification language used in earlier versions of HTK. The tool HParse is supplied to
convert this notation into the equivalent word network. Whichever method is chosen to
generate a word network, it is useful to be able to see examples of the language that it
defines. The tool HSGen is provided to do this. It takes as input a network and then
randomly traverses the network outputting word strings. These strings can then be
inspected to ensure that they correspond to what is required. HSGen can also compute the
empirical perplexity of the task. Finally, the construction of large dictionaries can involve
merging several sources and performing a variety of transformations on each source. The
dictionary management tool HDMan is supplied to assist with this process. HLRescore is
a tool for manipulating lattices. It reads lattices in standard lattice format (for example
produced by HVite) and applies one of the following operations on them: • finding 1-best
path through lattice: this allows language model scale factors and insertion penalties to be
optimized rapidly; • expanding lattices with new language model: allows the application
of more complex language, e, g, 4-grams, than can be efficiently used on the decoder. •
converting lattices to equivalent word networks: this is necessary prior to using lattices
Once the HMM-based recognizer has been built, it is necessary to evaluate its performance.
This is usually done by using it to transcribe some pre-recorded test sentences and match
27
the recognizer output with the correct reference transcriptions. This comparison is
performed by a tool called HResults which uses dynamic programming to align the two
transcriptions and then count substitution, deletion and insertion errors. Options are
provided to ensure that the algorithms and output formats used by HResults are compatible
with those used by the US National Institute of Standards and Technology (NIST). As well
applications, it can also compute Figure of Merit (FOM) scores and Receiver Operating
There are three possible types of errors that can be occurred while speech recognition made:
error one insertion error which occurs when the ASR system generates a word that does
not correspond to any word in the reference transcript. Once it founds the optimal
alignment, HResults calculates the number of substitution errors (S), deletion errors (D),
and the insertion errors(I) where N is the total number of labels in the reference
𝑁−𝐷−𝑆
𝑃𝑒𝑟𝑐𝑒𝑛𝑡 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 = ∗ 100% − − − − − − − − − − − − − − − −1
𝑁
As observed from the above equation, it ignores the insertion errors. The percentage
𝑁−𝐷−𝑆−𝐼
𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒 𝐴𝑐𝑐𝑢𝑟𝑎𝑡𝑒 = ∗ 100% − − − − − − − − − −2
𝑁
HResults outputs both of the above measures during result analysis (Adugna, 2015).
28
2.5. Related Works
different languages like Adugna, 2015, done on Spontaneous Speech Recognition for
Amharic Language using HMM with 90 minute speech database. Language model
developed using 30 MB text data and different smoothing techniques out of which 68 were
used for testing, in which the result is obtained as 50% accuracy and 53 WER. The collected
database is very low as of compared with CSJ (Adugna, 2015). The size of data we have
used is very low in size if compared with another databases used for other language
(CSJ)”. Using CSJ Training data of 510 hours long and around 6.84 M words for language
modeling, best Word Error Rate of 25.3% obtained (Furui S, et.al, 2005).
Spontaneous speech recognizer for Japanese developed by (Furui S, et.al, 2005) using two
different corpuses, One “Corpus of Spontaneous Japanese (CSJ)” speech: A part of the
collected from the World Wide Web used. In this experiment for the evaluation of
is used as a test set of speech recognition (Furui S, et.al, 2005). (Yadeta, 2016) had
Recognition System for Afan Oromo using broadcast news speech corpus. The speech
recognizer system developed from 57 anchors or speakers (42 males and 15 females) using
2953 sentences which have 06:15:38 hours long. In addition text data about 2000 sentences
29
used for language modeling purpose among collected from Kallacha Oromiyaa. Out of
2953 sentences, 2653 (10138 unique words dictionary) used for training 66 the speech
recognizer system and remaining 300 sentence (2516 unique words dictionary) were used
to test the developed system. Speakers who are involved in training does not involved in
testing. Therefore, from 57 speakers only 12 speakers (9 males and 3 females) were
participated for testing [1]. (Murat, 2003) conducted a research on large vocabulary
continuous speech recognition for Turkish using HTK. In his work, he used the un-
segmented data; that consisted of 7650 utterances spoken by 104 male and 89 female
speakers for training. The researcher has performed five experiments like the IWR (Isolated
Word Recognition) task, a CWR (Connected Word Recognition) system with no grammar,
a CWR system with a simple grammar that is a CSR (Continuous Speech Recognition)
system, and the fifth experiment was performed in order to test using bigram language
model which developed by using HTK language modeling tool (HLM). In his thesis, five
experiments were done. The first one was the IWR (Isolated Word Recognition) task. In
the first experiment to perform an IWR task, he cut 40 speech segments from test data,
which contained only one word. In the second one, he tested a CWR (Connected Word
Recognition) system with no grammar. This means, every word can follow another one
employing no rule in the second experiment. In the third experiment, he tested a CWR
system with a simple grammar that he designed. Accordingly, follower words are
determined according to the grammar he designed. The fourth experiment was related to a
CSR (Continuous Speech Recognition) system, in which the cross-word expanded network
is based on bigrams that include stems and endings. The fifth experiment was performed
in order to test the bigram language model that is actually proposed in his thesis. The test
utterances contain 220 sentences (have a vocabulary size 1168 words) that are randomly
30
selected from the text corpus. In the experiment these sentences were continuously spoken
by 6 speakers (4 male, 2 female). After performing the five experiments by comparing the
results of correct sentence recognition rate (CSRR) 20.09% and 30.90% given in
performed better than experiment 5 since the correct sentence recognition rate is bigger for
experiment 5 than experiment 4. To the end since he applied no smoothing algorithm to the
language model built; he has concluded as applying a smoothing technique would be better.
domain containing 1550 utterances for training the acoustic model. Language model
developed using 30MB text data and different smoothing techniques. 68 sentences from
the same domain used for testing. Using context dependent acoustic model of 8 Gaussian
mixtures and a tri-gram language model with absolute discounting smoothing, result
obtained were 47 % word accuracy and 53 % WER. Based on the experimental result if
data size is less in number, the data used for training and testing from the same domain
31
CHAPTER THREE
3. INTRODUCTION
Afan Oromo is one of the Cushitic branches of Afro-Asiatic language family languages of
Africa that can be divided six major six families like Chadic, Berber, Egyptian, Cushitic,
Omotic, and Semitic. The Cushitic family is also further divided into four groups like
North, Central, South and East. Accordingly, Afaan Oromo is one of the languages of
Lowland groups within the East Cushitic group and the most widely spoken language from
Ethiopian census (2007) denotes that Afaan Oromo has around 34% of speakers out of the
Ethiopian total population (Yadeta, 2016). (Abraham, 2014) also stated that as Afaan
Oromo was spoken by more than 40 million people over the world. And majority of these
speakers of the language are living in Ethiopia and others are in neighboring countries like
Afan Oromo writing system was started to be written in 1991 G.C with Latin alphabets
called ‘Qubee’ which was formally approved in this year (Abraham, 1993). Afan Oromo
has 26 letters which was adopted from Latin alphabets and additional six compound letters.
Afan Oromo has a total of 32 (Yadeta, 2016). Afan Oromo alphabets are divided in to two
vowels are five (5) whereas consonants are twenty seven (27) (Yadeta, 2016).
32
Table 2: Afan Oromo Alphabets
Afan Oromo has five vowels that are used to form sounds with consonants or by themselves
(Yadeta, 2016). The sound can be either long sound or short sound. Short Sound (Sagalee
Gabaabaa) is a sound formed with consonant and single vowel. Sounds have no meaning
independently unless they relate with other sounds to form a word. E.g. Ma- is a sound
formed from a consonant ‘M’ and single vowel ‘a’. Ma- has no meaning by itself. But if
this Ma- is concatenated with –na, then the word ‘Mana’ which means Home or House
in English will be formed. Both Ma- and –na are short sounds. Long Sound (Sagalee
Dheeraa) is formed from consonant and two vowels of the same kind. In Afan Oromo no
two different vowels form a long sound. E.g. the sound Gaa- is formed from the consonant
‘G’ and two vowels of the same kind ‘-aa’. This ‘Gaa-‘has no meaning by itself unless it
concatenated with other sounds that helps it to have a meaning. So, ‘Gaa-‘and ‘-rii’ will
form ‘Gaarii’ a meaningful word which means Good. Sounds can be either geminated or
Baddaa which means Highland is a geminated sound. Afan Oromo consonants can be
geminated except [h] and Compound Symbols (Yadeta, 2016). Compound letters in Afan
Oromo are categorized under consonants (Hinsene, 2009). This compound symbol is
constructed in the language from double consonants to represent one single consonant
(Yadeta, 2016). These consonants are pronounced as a single letter even though they are
33
formed from two independent consonants. E.g. Nyaata which means food, Dhadhaa which
means Butter are words and sounds formed from compound letters.
Afan Oromo Glottal Sound: Afaan Oromo has a glottal sound which called ‘Hudhaa’ that
shows a diacritical marker and represented by single quote ['] or some times by [h]. For
instance in the Oromo words like ‘Ga’aa or Gahaa’ which means enough, both’ and h are
used interchangeably and most of the time the single quote [‘] is used [Yadeta, 2016).
This International Phonetics Alphabet (IPA) is used in our transcription. Afan Oromo is
still using Latin alphabets for writing. The IPA Representation is shown as follows.
Alphabets (Qubee) Aa Bb Cc CH ch Dd DH dh Ee Ff Gg Hh Ii
IPA for Alphabets [a] [b]
Alphabets (Qubee) Jj Kk Ll M m Nn NY ny Oo Pp PH ph Qq R
r
IPA for Alphabets
Alphabets (Qubee) Ss SH Tt TS ts Uu Vv W w X x Y y Z z
sh
IPA for Alphabets
Word is a collection of two or more sounds that has a full meaning. In Afan Oromo words
can be formed with short sounds, long sounds or hybrid of the two.
a. Word formation from short sounds: is when two sounds formed with consonant
followed by single vowel are concatenated to each other. E.g. Ma- + -ra are two
independents meaningless sounds. ‘Mara’ is a word formed from the two sounds
and is a meaningful. Its meaning is all. Others like Lama (Two), Buna (Coffee),
34
Kuma (Thousand), Muka (Tree), Cina (Beside), Laga (River) and many other
b. Word formation from long sounds: is when two sounds formed with consonant
followed by two same vowels are concatenated to each other. E.g. Laa- + - gaa are
two independent meaningless sounds. After concatenation the word Laagaa which
means Jaws will be formed. Many others like Laamaa (Starved person), Gaarii
(good), Yaalii (treatment), Mootii (King), Loogii (Biasness) and many others of
c. Word formation from hybrid sounds: is formed from the concatenation of short
sound and long sound, Long sound and short sound. Words like Garuu (but),
i. Number
Afan Oromo numbers are used to show either a singular or plural form of entities
(Yadeta, 2016). Afan Oromo words have their own singular and plural form which will
formed by the suffixes to the base word. E.g. Nama (Person) is a single person whereas
. .
Nine Sagal
Ten Kudhan “Two Digit” 10-99
Eleven Kudha Tokko “Qub-Lamee”
Twelve Kudha Lama
. .
. .
35
Ninety Nine Sagaltamii Sagal
. . . .
ii. Gender
Gender is the other morphological feature of Afan Oromo. Gender ‘Koorniyaa or Saala’
(Yadeta, 2016). To Afan Oromo nouns there are suffixes that are used to identify whether
the gender is masculine or feminine. E.g. Daa’ima cannot be easily identified the
masculinity and femininity of the noun without the suffixes. But with Daa’im-+ -cha
which shows the child is masculine and Daa’im-+-ttii which shows the child is feminine.
Afan Oromo is spoken in the way it is written that makes it to be called as a phonetic
language (Yadeta, 2016). Afan Oromo is free of the problem of homonymy. That means
there are no words having the same pronunciations like (write and right). Even if the Latin
letters were adopted for developing the Afaan Oromo alphabets, their pronunciation is
different as the variety of language as we have discussed in Table 3. Also in Afaan Oromo
alphabets, there is no difference while pronouncing the capital letters and small letters.
This means like the English language which used Latin letters, the pronunciations of [A]
36
Phonetics is the study of speech sounds used in languages of the world. It is concerned
with sounds of languages, how these sounds are articulated and how the hearer perceives
them. Phonetics is related to the science of acoustics in that it uses much of the techniques
Phonology is the study of the sound patterns of a language. It describes the systematic
way in which sounds are differently realized in different contexts, and how this system of
sounds is related to the rest of the grammar. Phonology is concerned with how sounds are
Morphology is the study of word formation and structure. It studies how words are put
together from their smaller parts and the rules governing this process. The elements that
are combined to form words are called morphemes. A morpheme is the smallest
Afan Oromo has five long and short vowels which can be found either at the beginning, in
the middle or at the end of the consonant. E.g. Aadaa – consists of two long vowels at the
beginning and at the end next to the consonant which is d. The pronunciation of these
vowels within the Oromia region is the same. No word can be created without vowels in
37
Table 5: Articulation of Afan Oromo vowels compiled from (Yadeta, 2016)
Vowels
Front Central Back
Close i/I,ii/i: u /Ʋ, uu / u:/
Mid e/ɛ/, ee/e:/ o /Ɔ, oo/ o:/
Open a/Ʌ/ aa /ɑ: /
Front is the sound of a vowel formed by rising the tongue towards the hard plate and back
is the sound of vowel articulated at the back of the mouth. However, [a] is articulated at
the central of our mouth. Speaking of the Afan Oromo Alphabets (Qubee) vowels were
As discussed in above, the most Afan Oromo letters are categorized under consonants
which means, from 32 alphabets only five are vowels and the remaining 27 are consonants
(Yadeta, 2016). As seen the above vowels are articulated in three forms like front, central
and back in the form of close, mid and open (Yadeta, 2016).
Additionally, in some sources the glottal sound symbol is categorized under consonants
and we have discussed in earlier as it is represented by single quote [‘] or some times by
letter [h]. However, in this section we elaborate on the pronunciation and the manner of
articulation. Therefore, whether the symbol used for glottal sound was single quote or
letter[h], the pronunciation representation from IPA is , the manner of articulation is stops
and fricatives, and glottal. In this section, we discuss mostly the manner of articulation for
38
Table 6: Articulation of Afan Oromo consonant Alphabets (Compiled from Yadeta, 2016)
Place of Articulation
Manner of Articulation Bilabial Labiodental Alveolar Palatal Velar Glottal
Stop b, ph F d, n k, g, q . /Ɂ/
Fricative s, z Sh H
Affricative c, h, j
Nasal M n Ny
Flap r
Lateral l
Semi-vowel W Y W
Afan Oromo consonants can articulated either in stop, fricative, affricative, nasal, flap, and
lateral, semi-vowel with articulation place bilabial, labiodental, alveolar, palatal, velar and
glottal.
i. Text Corpus
To estimate the probability of the word sequence in the speech recognition, regularities
were taken by language model. Language models are used to model regularities in natural
language (Yadeta, 2016). Lists of large texts are also required in the development of
language model. So the data are collected from “Kallacha Oromiyaa” newspaper and is
normalized manually because the written text contains different type like number,
For modeling acoustic we need the audio/speech data. In other words, speech is one and
the primary input for the recognizer system (Yadeta, 2016). We have used OMN (Oromia
Media Network), OBN (Oromia Broadcasting Network) and FIB (Finfine Integrated
39
Broadcast) Afan Oromo program. The ways of collecting and number of utterances found
respectively are listed as follows. From Oromia Media Network studio 4500 with a length
of 05:15:40, Oromia Broadcast Network studio 5600 with a length of 08:15:40, Finfine
Integrated Broadcast (FIB) studio with a length of 3400 with a length of 03:15:55 and
directly from Afan Oromo teachers 1500 utterances have been collected. Totally, the
number of utterances collected from those sources is 15000 for the total duration of around
17 hours.
As observed from the above status, 37% are collected from OBN studio, 30% are from
OMN studio, 23% from FIB and 10% from Afan Oromo teachers like Fetenu Chimdesa,
Lami Gesherbo and Sena Gemechu. The collected data has been sourced from both direct
and internet.
In the collected data there are both text and audio format. From these two it is mandatory
to pre-process using tool audacity. To construct the speech data that is suitable format for
HTK (Hidden Markov Toolkit) audio preprocessing should be done for the collected and
recorded.
Speech Segmentation: The researcher tried to dig out if there might be tool for the
researcher, there is no preferred tool that can be used for audio segmentation (Yadeta,
2016).
Therefore, the sentence segmentation has been performed manually by listening the
collected audio file. Because of we are conducting this work on spontaneous speech
40
recognition the developed system understands what speaker is speaking and then convert
it to text. About 15000 utterances has been recorded and the recorded data has
(ibsa xumuraa), verb (xumurtuu) attributes and when the speaker speak a sentence
constructed with recorded utterances, it will form a sentence accordingly. It recognizes the
children and gender voices. Because of HTK requires .wav .AIFF and the like supported
Transcription has been taken place with the consideration of Afan Oromo grammar with
the consult of different literatures from Afan Oromo. Afan Oromo language has a glottal
sound which is represented by single quote [ˈ] and sometimes by the letter [h] (Hinsene,
2009). Even though this glottal sound exists in Afan Oromo, most of the researchers argued
on classifying it either under consonant or vowel. Most of them have classified the glottal
sound to consonant and we have again classified under consonant for this study. Letters
like C, CH, DH, J, NY, PH, Q, SH, X are non ASCII code which HTK cannot support
(Yadeta, 2016). However, the glottal sound by itself cannot be represented by ASCII code
that makes challenges to researcher while making a transcription. [hh] has been used to
represent the glottal sound in this study to identify from the letter [h] (Yadeta, 2016).
Because all consonants can have either geminated or non-geminated except from [h] and
compound symbols (Yadeta, 2016). Double consonants are used to represent the geminated
form.
41
Table 7: Afan Oromo phones used during Transcription taken from (Yadeta, 2016)
Letters A AA B BB C CC CH D DD DH E EE F FF
IPA A Aa b bb C Cc Ch D Dd dh e Ee f Ff
Letters G GG H I II J JJ K KK L LL M MM N
IPA G g h I Ii J Jj K Kk L ll M mm N
Letters NN NY O OO P PP PH Q QQ R RR S SS SH
IPA Nn Ny o oo P Pp Ph Q Qq R rr S ss sh
Letters T TT TS U UU V W WW X XX Y YY Z ˈ
IPA T Tt ts U Uu v W ww X xx y yy z hh
While doing this the verity of dialects, punctuations, and the rule of capitalization were not
considered.
HTK toolkit is used for building Hidden Markov Models (HMMs). Young et. al. (2006)
stated that as HTK is primarily designed for building HMM based speech processing tools,
particularly speech recognizers. Also much of the functionality of HTK is built into the
library modules. The figure 3. Shows the HTK software architecture. Figure 4.1 illustrates
the software structure of the HTK tool and shows its input/output interfaces. User
input/output and interaction with the operating system is controlled by the library module
42
Figure 2: Software Architecture (Young et.al, 2006)
Math support is provided by HMM at hand the signal processing operations needed for
speech analysis are in HSigP. Each of the file types required by HTK has a dedicated
interface module. HLabel provides the interface for label files, HLM for language model
files, HNet for networks and lattices, HDict for dictionaries, HVQ for Vector Quantization
(VQ) codebooks and HModel for HMM definitions. In the next section we discussed the
tools taken from HTK and that we have used for our work as they were required (Yadeta,
2016).
Data preparation is the first step in the development of speech recognition system as several
researchers do that. So to build a set of HMMs, a set of speech data files and their associated
transcriptions were needed. Because, before bringing speech data and using it in training,
it must be converted into the required and correct format and phone or word labels (Young
43
et. al. 2006). While all HTK tools can parameterize waveforms on-the-fly, in practice we
need to parameterize the data just once. So the tool HCopy is used for this purpose. Also
the tool HList is used to check the contents of any speech file and simply it can be used to
check the result of the conversions before using. In order to output the files to a single
b. Training Tools
Defining the topology required for each HMM by writing a prototype definition is the
second step of system building. Also HTK allows HMMs to be built with any desired
topology. HMM definitions can be stored externally as simple text files and hence it is
possible to edit them with any convenient text editor. The purpose of the prototype
definition is only to specify the overall characteristics and topology of the HMM. In the
study all of the phone models are initialized to be identical and have state means and
variances equal to the global speech mean and variance because of no bootstrap data is
c. Recognition Tools
HTK provides a single recognition tool called HVite which uses the token passing algorithm
to perform Viterbi-based speech recognition (Young et. al. 2006). HVite takes as input a
network describing the allowable word sequences, a dictionary defining how each word is
pronounced and a set of HMMs. It operates by converting the word network to a phone
network and then attaching the appropriate HMM definition to each phone instance.
Recognition can be performed on either a list of stored speech or on direct audio input.
However, in this study we have used HVite which can support triphones and mono phones
that can run with multiple tokens to generate lattices containing multiple hypotheses. It can
also be configured to rescore lattices and perform forced alignments. The word networks
44
needed to drive HVite are usually either simple word loops in which any word can follow
any other word or they are directed graphs representing a finite state task grammar. In the
former case, trigram probabilities are normally attached to the word transitions
d. Analysis Tools
Once the HMM-based recognizer has been built, it is necessary to evaluate its performance.
This is usually done by using it to transcribe some pre-recorded test sentences and match the
recognizer output with the correct reference transcriptions. This comparison is performed by
a tool called HResults which uses dynamic programming to align the two transcriptions and
45
CHAPTER FOUR
EXPERMENTS AND OUTCOMES
4. Introduction
Experiments pass through five steps or phases of HTK; those are Data Preparation,
Data preparation is the core and first step on in the development of speech recognition
system so that a set of HMMS will be built up from the set of speech data files and their
associated transcribed files. Data preparation would be accomplished through steps like
constructing pronunciation dictionary, creating transcription files, and coding the audio
Any language has its own way of pronunciation while reading and speaking words. E.g.
the way the Afan Oromo words are pronounced are different from the way English words
are pronounced. So, creating a sorted list of the words that contains in the grammar one per
line with their pronunciations is the initial step in building the Pronunciation Dictionary
(trainlists) which is a sorted list of the unique words appear in the training transcription.
Lists of unique words are derived from sentences using Perl Script, Jupiter notebook from
different formats. The word lists can be prompted to be trainlist and testlist which can
46
contain 14500 and 500 words respectively. Totally, 15000 words are collected from
different sources.
The words have their own pronunciation based on the speaker. Also the dictionary can
provides a connotation between words used in the task grammar and the acoustic models
which composed of sub word. It can be in (phonetic, syllabic, and etc.) (Yadeta, 2016). In
this study we have recorded around fifteen thousand (15000) utterances. We have manually
recorded 6500 utterances from 23 speakers and 8500 utterances have been taken from
OMN, OBN and FIB Afan Oromo program. As per the training dataset, when the speaker
speak the word to the machine, the developed speech recognizer search the word /sentence
and display it to the speaker, and again pronounces it in the same way to the speaker. We
have used the tool Microsoft Visual studio 2017 with full python libraries, anaconda, and
weka to do this.
The pronunciations of the words with the same writing scale haven’t been considered.
All HTK tools can parameterize waveforms on-the-fly and HCopy is used for this
purpose while HLed is used to read a Master Label File (MLF). MLF file is a single file
that contains a label entry for each line in prompts file. Since the second one is the
easiest approach we have preferred it for our experiment. In order to generate the
mlf file from our prompts, we have used the Perl script prompts2mlf which is
47
ii. Phone Level Transcription
After the completion of word level transcription HLEd command is executed being
provided by HTK tool to expand the Word Level Transcriptions to Phone Level
Transcriptions. This command replaces each word by its equivalent phonemes and put
the result in a new phone level master label file. This is done by reviewing each word
in the MLF file, and looking up the phones that make up that word in the dictionary
file we created earlier, and out putting the result in a file called afanoromophone.mlf
which do not have short pauses ("SP"s) after each word phone group.
The final stage of data preparation is to parameterize the raw speech waveforms into
sequences of feature vectors (Young et.al, 2006). Because of HTK is not efficient in
processing wav files as it is with its internal format, we need to convert our audio wav files
to another format. HTK support both FFT-based and LPC-based analysis. Here we have
used Mel Frequency Cepstral Coefficients (MFCCs), which are derived from FFT-based
log spectra. We used the HCopy tool to convert our wav files to MFCC format. For doing
this, we have 2 options; one option we can execute the HCopy command by hand for each
of our audio (wav) files, or in other option we can create a file containing a list of each
source audio file and the name of the MFCC file it will be converted to, and use that file
48
4.4. Training Phase
Defining the topology required for each HMM by writing a prototype definition is the
second step of system building. Also HTK allows HMMs to be built with any desired
topology. HMM definitions can be stored externally as simple text files and hence it is
possible to edit them with any convenient text editor. The purpose of the prototype
definition is only to specify the overall characteristics and topology of the HMM. In our
study all of the phone models are initialized to be identical and have state means and
variances equal to the global speech mean and variance because of no bootstrap data is
Training is the next step which can be undertaken after the work of preparing the required
data for training. Also splitting our data for training and for testing is another important
Training and Testing data has been used as of 6350 of training sentences and 650 testing
sentences that have the speaker of both male and female gender with the amount of 10
males and 5 females. The total numbers of constructed sentences are 7000 and the total
number of speaker is 15. This was added because of the data collected and retrieved from
OMN, OBN and FIB are full of noisy and music in their background. That is why adding
self-recorded speech was necessary so that it enabled the researcher to model a language
model with the pronunciation dictionary with low word error rate.
49
Prototype definition
The principal step of an HMM model training is defining the prototype model. As Young
et. al. (2006) stated that, the parameters of this model are not important; its purpose is to
define the model topology. Hence, our recognition system is phone-based system, and we
have done training with left to right HMM topology of 5-state (3 emitting states and two
non-emitting) left to right without skipping. To define the topology we have created the
The HTK tool HCompV will scan a set of data files, compute the global mean and variance
and set all of the Gaussians in a given HMM to have the same mean and variance.
Therefore, supposing that a list of all the training files is stored in afanoromotrain.scp, the
above command creates the new version of our proto file in which the zero means and unit
variances are replaced by the global speech means and variances. Using new prototype
model generated by HCompV, a Master Macro File (MMF) called hmmdefs containing a
copy for each of the required mono-phone HMMs is constructed manually by copying the
prototype and relabeling it for each required mono-phone including “sil”. Consequently,
the file macros contain a global options macro and the variance floor macro vFloors
generated earlier by HCompV. The global options macro simply defines the HMM
The mono-phones created are re-estimated using the embedded re-estimation tool HERest
50
train.scp -H hmm0/macros -H hmm0/hmmdefs -M hmm1 trainmonophones0 Then the
above command loads all the models both hmmdefs and macros which are listed in the
model list. Mono-phones used here are excluding the short pause (sp) model. These are
then re-estimated using the data listed in train.scp and the new model set is stored. In the
command, −𝑡 option sets the pruning thresholds to be used during training and this pruning
is normally 250.0. If re-estimation fails on any particular file, the threshold is increased by
150.0 and the file is reprocessed. This is repeated until either the file is successfully
processed or the pruning limit of 1000.0 is exceeded. Fixing the silence models has been
created by running the HMM editor HHEd to add the extra transitions required and tie the
sp state to the center sil state HHEd works in a similar way to HLEd. It applies a set of
commands in a script to modify a set of HMMs. In this case, it is executed by the following
[2]} At this point, the AT commands add transitions to the given transition matrices and
the final TI command create a tied-state called silst. The parameters of this tied-state are
stored in file and within each silence model, the original state parameters are replaced by
the name of this macro. The new list of mono-phones contains sp model is used in the
above HHEd command. At the end, HERest are applied using the phone transcriptions with
sp models between words. Realigning the Training Data we have said earlier as our
pronunciation dictionary was alternative pronunciation dictionary. So, 109 words have at
least two pronunciations from our train pronunciation dictionary. For this reason we need
to realign the trained data. The basic difference between realigning the training data and
the original word to-phone mapping performed by HLEd in data preparation is that, in the
51
operation of realigning all pronunciations for each word are considered and outputs the
pronunciation that best matches the acoustic data. Using the phone models created before
we have realigned the training data then creates new transcriptions with a single call of the
trainmonophones1>HVite_log
This command uses the HMMs made previously to transform the input word level
transcription trainwords.mlf to the new phone level transcription aligned.mlf using the
pronunciations stored in the traindictionary.txt that constructed so far. When aligning the
data, it is sometimes clear that there are significant amounts of silence at the beginning and
end of some utterances, so to spot this the time-stamp information need to be output during
the alignment. That is why we have used the option -o SW in the above command. After
the new phone alignments have been created, HERest applied to re-estimate the HMM set
parameters.
Tied-State Tri-phones
As pointed by (Young et. al., 2006), the first stage of model refinement is usually to convert
to decide whether or not cross-word tri-phones are to be used. If they are context dependent,
then word boundaries in the training data can be ignored and all mono-phone labels can be
52
converted to tri-phones. If word internal tri-phones are to be used, then word boundaries in
the training transcriptions must be marked. So, we have built context dependent tri-phones
for this study. Since we have prepared a set of mono-phone HMMs with the previous steps,
now we can use them to create context-dependent tri-phone HMMs. We did this in two
steps. Firstly, the mono phone transcriptions are converted to tri-phone transcriptions and
a set of triphones models are created and re-estimated. Secondly, similar acoustic states of
these triphones are tied. Tying is nothing but it is the method of making one or more HMMs
share the same set of parameters. Then the set of context-dependent models re-estimated
can be created from mono-phones. The tri-phone transcription have be created first using
HLEd tool which enables us to generate a list of all the tri-phones for which there is at least
one example in the training data. HLEd-n triphones1 –l * –I wintri.mlf mktri.led aligned.mlf
Using the above command we have created the tri-phone transcriptions in wintri.mlf file
using mono-phone transcriptions aligned.mlf file. At the same time, a list of triphones is
written to the file triphones1. The edit script mktri.led used above contains the commands:
WB sp WB sil TC 56 The two WB command defines sp and sil as word boundary symbols.
These then blocks the addition of context in the TI command, which converts all phones
except word boundary symbols to tri-phones. For example, sil a kk a m sp becomes sil
a+kk a-kk+a kk-a+m a-m sp This tri-phone transcription is word internal. Some bi-phones
may be generated as contexts at word boundaries since sometimes they include only two
phones.
53
Where the edit script mktri.hed contains a clone command CL followed by TI commands
to tie all of the transition matrices in each triphones set, that is: CL triphones1 TI T_aa {(*-
generated the file mktri.hed using the Perl script mktri.hed included in the HTK Tutorial
directory. The clone command CL takes as its argument the name of the file containing the
list of triphones and biphones) generated above. For each model of the form a-b+c in this
list, it looks for the monophones b and makes a copy of it. Due to we use latter the transition
matrix transP which is regarded as a subcomponent of each HMM, each TI command takes
as its argument the name of a macro and a list of HMM components. The lists of items
within brackets are patterns designed to match the set of tri-phones, right bi-phones and
left bi-phones for each phone. Making Tied State Tri-phone after a set of tri-phone HMMs
with all tri-phones in a phone set sharing the same transition matrix prepared now we can
tie them. Tying states within tri-phone sets helps to share data and thus be able to make
robust parameter estimates. HTK tool HHEd provides two mechanisms which allow states
to be clustered and then each cluster tied. 57 The first is data-driven and uses a similarity
measure between states. The second uses decision trees and is based on asking questions
about the left and right contexts of each tri-phone. The decision tree attempts to find those
contexts which make the largest difference to the acoustics and which should therefore
distinguish clusters. Decision tree state tying is performed by running HHEd: HHEd -B -H
The edit script tree.hed contains the instructions regarding which contexts to examine for
possible clustering and the detail contents tree.hed was attached at appendix H. We have
used mkclscript script which is found in the RM Demo for creating the TB commands
54
(decision tree clustering of states) which is one part of tree.hed. Firstly, the RO command
is used to set the outlier threshold to 100.0 and load the statistics file generated at the end
of the previous step. The outlier threshold determines the minimum occupancy of any
cluster and prevents a single outlier state forming a singleton cluster just because it is
acoustically very different to all the other states. The TR command sets the trace level to
zero in preparation for loading in the questions. Each QS command loads a single question
and each question is defined by a set of contexts. For example, one of the QS command
defines a question called ‘R_Nasal’ which is true if the right context is either of the nasals
n, ny, or m n tree.hed file using QS command. So, the questions referring to both the right
and left contexts of a phone are included. The full set of questions loaded using the QS
command would include every possible context which can influence the acoustic
realization of a phone, and can include any linguistic or phonetic classification which may
be relevant. For this study we have constructed the questions (QS) which are attached at
Appendix H. The set of tri-phones used so far only includes those needed to cover the
training data. The AU command takes as its argument a new list of tri-phones expanded to
include all needed for recognition. This list can be generated, for example, by using
HDMan on the entire dictionary (not just the training dictionary), converting it to tri-phones
using the command TC and outputting a list of the distinct tri-phones to a file using the
option.
The -b sp option specifies that the sp phone is used as a word boundary and it is excluded
from tri-phones. The effect of the AU command is to use the decision trees to synthesize
all of the new previously unseen tri-phones in the new list. Once all state-tying has been
55
completed and new models synthesized, some models may share exactly the same states
and transition matrices and they became identical. The CO command is used to compact
the model set by finding all identical models and tying them together, producing a new list
of models called tiedlist. One of the advantages of using decision tree clustering is that it
allows previously unseen tri-phones to be synthesized. To do this, the trees must be saved
and this is done by the ST command. Finally, the models are re-estimated using HERest.
Increasing Gaussian mixture, we have constructed mono-phones and context dependent tri-
phones in the previous stages with only single Gaussian models. However, the increment
increase the Gaussian mixture of the models we have constructed in previous tasks. By
doing so the Gaussian Mixtures was increased to 12 in this study to achieve on best
performance. According to Young et.al. (2006), in HTK the conversion from single
Gaussian HMMs to multiple mixture component HMMs is usually one of the tasks in a
system refinement. The mechanism provided to do this is the HHEd tool MU command
which helps to increase the number of components in a mixture by a process called mixture
splitting. We used this approach to building a multiple mixture component system which
increased until the desired level of performance is achieved. For this purpose, the MU
command was used and the MU command has the form: MU 𝒏 item Lists Where 𝒏 is the
new number of mixture components required and item Lists defines the actual mixture
distributions which required modifying. For instance, to increases the number of mixture
components in the output distribution for state 2 to 4 of all models to 2 is defined as the
model used for recognition was developed using 2000 sentences text corpus taken from
56
Kallacha Oromiyaa Afaan Oromo Newspaper. Since this text corpus was small, we add the
text corpus that we have prepared during transcription. The reason we have done this is to
increase the size of our data. Because, when the size of text data increased, the probability
of occurrence of words also increased. Then the increment of word probability has an
modeling tool for development of bigram language model and trigram language model.
The HTK language modeling tool we have used to build bigram language model and word
network were HLStats and HBuild, respectively. Using these two different language
modeling tools was for only identifying which performs better. We also had done the
During recognition phase, HTK provides a single recognition tool called HVite which uses
the token passing algorithm to perform Viterbi-based speech recognition. HVite takes as
input a network describing the allowable word sequences, a dictionary defining how each
word is pronounced and a set of HMMs word networks will be needed to drive HVite are
usually either simple word loops in which any word can follow any other word or they are
directed graphs representing a finite state task grammar. In the former case, bigram
probabilities are normally attached to the word transitions. Recognizing the task of
searching for the most likely sequence of words given in observed features extracted from
the speech signal is usually referred to as decoding or recognizing the speech signal.
Decoding speech was begun by constructing a search graph which contains every word in
the recognition vocabulary. Each word is then replaced by the HMMs that correspond to
57
the sequence of sound units which make up the word. As a result, the search graph is a
large HMM, and recognition is performed using the Viterbi algorithm to 60 align the search
graph to the speech features derived from the utterances. Because the Viterbi algorithm is
used to find the most likely word sequence, the decoding procedure is said to be done via
Viterbi search. Suppose that the test.scp file holds a list of the coded test files, and then
each test file recognized and its transcription output to an MLF file called recout.mlf by
The options -p and -s set the word insertion penalty and the grammar scale factor,
respectively. The word insertion penalty is a fixed value added to each token when it
transits from the end of one word to the start of the next. The grammar scale factor is the
amount by which the language model probability is scaled before being added to each token
as it transits from the end of one word to the start of the next. Because of these parameters
can have a significant effect on recognition performance, we have made some modification
The final stage of the HTK toolkit is the analysis stage. Once the HMM-based recognizer
has been built, it is necessary to evaluate its performance. This is usually done by using it
to transcribe pre-recorded test sentences and match the recognizer output with the correct
uses dynamic programming to align the two transcriptions and then count substitution,
deletion and insertion errors. Once the test data has been processed by the recognizer, the
58
next step is to analyze the results. The HTK tool HResults is provided for this purpose.
HResults compares the transcriptions output by HVITE with the original reference
transcriptions and then outputs various statistics. HResults matches each of the recognized
and reference label sequences by performing an optimal string match using dynamic
programming. Once it founds the optimal alignment, HResults calculates the number of
substitution errors (S), deletion errors (D) and insertion errors (I). 61 Then outputs both
Since HTK tools can process individual label files and files stored in MLFs, we have used
our MLF file which contains word level transcriptions for test file called testref.mlf, the
testref.mlf tiedlist recout.mlf After running HResults command there are several results
that we have seen. So, for the finding of this study we tried to describe in following tables.
Our experiment was conducted using different parameters for word insertion penalty and
grammar factor scale. In this experiment, we used test data that collected only from OBN,
From the experiment with the increasing of the Gaussian Mixture, we got the performance
on 10 which is 65.23% WER. Hence the developed spontaneous speech recognition was
so small. As it is known results with high word error rate shows the performance is weak
59
with Increasing Gaussian mixture. Hence we have tried to build a language model built by
HTK to increase the performance of our work. The result with language model experiment
shows greater precision than increasing the Gaussian mixture. Experiments have been
shown by both training and test data. The performance gained from the experiment of the
language model built by HTK with trigram language model was set as follows.
As it is seen on Table 9: the word error rate for both tri-phones and monophones
respectively are 36.45 % and 50.23 %. The WER is low means greater performance has
been done at most with building a language model using HTK. Even though we used
increasing gaussian mixture and built language model, we used a minimum data size that
may limits the performance of the developed speech recognizer with different categories
4.7. Challenges
Data Collection with audio files from broadcasting corporate and other social media is so
tiresome because of no pure and required data for this work has been existed before. While
collecting data from social media it has background music and other noisy sounds which
60
are not suitable for the work but it had been transcribed manually. We used hybrid model
(Hidden Markov Model, Gaussian Mixture Model and Artificial Model) for the
model for discrete and continuous speech recognition type. But in our case we have tried
to develop spontaneous speech recognition for Afan Oromo that was why we have used
the three models in order to have a successful Afan Oromo Spontaneous Speech
Recognizer. However, Because of time and budget constraint was also a great headache
to me while working on this study. From the obtained results the word error rate was
36.45 % which is less than the Word Accuracy. The word accuracy has been maximized
because of the models we used during the work. We have used the optimal level to see the
highest performance on our work using different approaches like increasing Gaussian
mixture, tuning parameters of word insertion penalty and grammar scale factors. For
language model we have used the hidden markov model for discrete and continuous
recognition language model. From that we have tried to transform the continuous speech
Network and upgrade the performance to this level. We used a Gaussian Mixture model
efficiently imputes missing values. We used artificial neural network to understand and
61
CHAPTER FIVE
5.1. Conclusion
In this study, the possibility of speaker independent spontaneous speech recognition for
Afan Oromo has been explored. To conduct this, we have reviewed many related literatures
thoroughly on spontaneous speech recognition and its development and others literatures
done on Afan Oromo speech recognition development with different nature of speech
recognition. In this study we used hybrid model and acoustic model and language model
For this study, among the approaches of ASR systems the stochastic was used and the
HMM has been implemented for modeling, HTK tool and different tools which are
compatible with the approach and modeling technique we have preferred were
implemented accordingly. However, throughout the study of this work, we have used visual
The developed spontaneous speech recognizer for Afan Oromo has been done using
medium data size from 10 male Afan Oromo native speakers and5 female Afan Oromo
native speakers and it was 12 hours and 20 minutes for training. The test data was prepared
with 500 utterances consisting of 1200 unique words from 15 speakers, in which 9 of them
involved in training and 6 of them did not involve in training. The language model used
62
was trigram language model developed using HLStats and word networks built from this
Non-speech events (dis-fluencies) that are the main distinguishers of spontaneous speech
from read speech occurs in both training and test data. Because of their direct influence on
the performance of recognizer, more than treating them as a silence we have modeled them
The training has been done first by modeling context independent mono-phones and re-
estimating the models using HERest tool. In order to improve our recognizers‟ accuracy
we have refined our models, as a result we have developed context dependent tri-phones
both cross-word and word internal tri-phones from mono-phones and re-estimated tri-
phone models.
During the course of this study the recognizers developed using different acoustic models
has been tested using test data we have prepared for this purpose. Tuning the parameters
of the decoder at the time of recognition brought different recognition accuracy. Some of
these parameters are word insertion penalty (p), grammar scale factor (s) and pruning level.
Spontaneous speech recognizer for Afan Oromo developed with the speech database in a
judicial domain contains 14500 utterances for training the acoustic model. Language model
developed using 120MB text data and 30MB audio dataset and the percentage of result
Generally, the achieved spontaneous speech recognizer understands only from the
developed language model and pronunciation dictionary of 15000 words. This study
6.2. Recommendations
63
Throughout the work we have learned the course of this research and from the experimental
results, we would like to forward our recommendations that a further researchers can do
There is no spontaneous speech corpus prepared before which can be used for this study.
Therefore, we have tried to prepare it from the scratch in both audio format and text format.
The speech recognizer has been developed with the concept of spontaneous speech
recognition that is almost natural and if the recognizer will really be developed in full
course, it enables Afan Oromo native speakers speak to machine. Afan Oromo has much
more utterances than used utterances which is 15000. So, using the more utterances can
increase the performance and its accuracy. Since, Automatic Speech Recognition is
difficult to use with a low accuracy in a real world, it is mandatory to work on increment
of data size and recognizer accuracy improvement. As it is known, the target of speech
technology is to manage the machine with the words we speak, and still today text to speech
(read speech), speech to text (write speech) and other speech technologies are on
In this study we did our experiment using trigram language model. In addition to increasing
data size, depending on the data size of data in hand doing a lot with language model
increasing it to 4-gram.
The speech data which is used for this study is sparse and diversified and it is from different
speakers. Consequently, in order to handle the variability among speakers, the need for
investigating the speaker adaption is one of the issues to be handled for the recognizer
development; therefore the need for conducting research in this area is also one of the
important aspects to be considered in order to improve the accuracy of the recognizer. One
64
of the difficult feature of spontaneous speech is non-speech events (dis-fluencies),
therefore handling none speech events has a big role in recognizer’s performance. We have
dealt with some issues to handle them in general but still it is possible to handle them
further by applying different techniques. May be one of the task one can do in order to
handle these non-speech events effect is including or removing from acoustic, lexical and
language model, by studying their nature in detail and separately. Besides increasing the
size of data the performance can be increased if implemented on dialect and accent
identification and recognition because Afan Oromo words can be pronounced in the same
way but defined in different ways. E.g. ‘Bukkee’ which means beside, ‘cina’ or ‘bira’ in
Afan Oromo speakers around Horo Guduru Wolega, East Wolega, West and Kelem
Wolega and It is defined as ‘Maseena’ around west, east Haraghe zones. We haven’t
focused on the dialect and accent identification and recognition. So we recommend further
researchers to work on the same title increasing data size, using n-gram language model
means 4-gram, 5-gram since we used trigram, Dialect Identification in Afan Oromo
References
Bantegize. (2015). BRANA:Application of Amharic speech recognition system for
Dictation in judicial domain, MScThesis, Gondar University. Gondar: Gondar
University.
65
Carnigie Mellom University. (2013). Understanding Spontaneous Speech, Wayne Ward 1
Carnegie Mellon University Computer Science Department Pittsburgh, PA 15213.
Pittsburgh.
Gamta, T. (1993). Qubee Afaan Oromoo: Reasons for Choosing the Latin Script for
Developing an Oromo Alphabet.The Journal of OromoStudies, 1(1) pp: 36.
Makuria, H. (2009). Elellee Conversation: Afaan Oromo writing system. Addis Abba,
Ethiopia, Commercial printing E.pp 23-39.
66
Shriberg, B. E. (2012). Speech Technology and Research Laboratory SRI International,
Menlo Park, CA 94025, USA International Computer Science Institute, Berkeley,
CA 94704, USA. Berkeley: USA International Computer Science Institute.
South Africa. (2006). Artificial Neural Network, School of Electrical and Information
Engineering, University of the Witwatersrand Johannesburg. Johannesburg.
Zhang. (1998). A fault detection and diagnosis approach based on hidden Markov chain
model, Proc. of the American Control Conference, Philadelphia, pp.2012-2016.
Philadelphia.
67
Appendices
1 Visual Studio 2017 For the installation of HTK and writing speech
recognition code
2 HTK 3.4.1
5 Notepad++
68
Appendix B: Samples of prompts afanoromotrainprompt.txt and afanoromotestprompt.txt
afanoromotrainprompt.txt file
69
itti gaafatamummaa akka biyyaatti nutti kenname milkeessuuf
Afaanoromotestprompts.txt file
cimsanii dubbatan.
jedhaniiru.
70
rakkoo hiikan irratti xiyyeeffachuun murteessaa ta’uu
eeraniiru.
mana [mana] ma na
danaa [danaa] da na
gahee [Gahee] ga h ee
gatii [gatii] ga t ii
71
lammii [lammii] la m mii
kaleessa [kaleessa] ka le es sa
72
Appendix D: Number of alternative pronunciation dictionary
73
47 Bartee 2
48 Cina 2
49 Cinqii 2
50 Ciiggahuu 2
51 Cunqursaa 2
52 Eeruu 2
53 Ga’aa 2
54 Tarreessaa 2
55 Booji’uu 2
56 Lakkaa’uu 2
57 Gurmuu 2
58 Salgaffaa 2
59 Jahaffaa 2
60 Galgalaa 2
61 Ta’a 2
62 Ta’ullee 2
63 To’achuu 2
64 Tasgabbaa’e 2
65 Xaa’oo 2
66 Haara’umsa 2
67 Danda’a 2
68 Buqqa’uu 2
69 Daldaluu 2
70 Xiinxaluu 2
71 Durduuba 2
72
73
74
Appendix E: Sample of coding for afanoromotraincode.scp and afanoromotest.scp code.
Voxforge/train/word1.wav Voxforge/train/word1.mfc
Voxforge/train/word2.wav Voxforge/train/word2.mfc
Voxforge/train/word3.wav Voxforge/train/word3.mfc
Voxforge/train/word4.wav Voxforge/train/word4.mfc
Voxforge/train/word5.wav Voxforge/train/word5.mfc
Voxforge/train/word6.wav Voxforge/train/word6.mfc
Voxforge/train/word7.wav Voxforge/train/word7.mfc
------------------------------------------------------------------------------------------------------------
-------
75
Appendix F: Configuration Tools
# PATH needs CR
as_cr_letters='abcdefghijklmnopqrstuvwxyz'
as_cr_LETTERS='ABCDEFGHIJKLMNOPQRSTUVWXYZ'
as_cr_compoundletter='CH,DH,NY,PH,SH,TS,ZY'
as_cr_glottal='h'
as_cr_Letters=$as_cr_letters$as_cr_LETTERS$as_cr_compoundletter
as_cr_digits='0123456789'
as_cr_alnum=$as_cr_Letters$as_cr_digits
chmod +x conf$$.sh
PATH_SEPARATOR=';'
else
PATH_SEPARATOR=:
fi
rm -f conf$$.sh
fi
76
as_unset=unset
else
as_unset=false
fi
# IFS
# We need space, tab and new line, in precisely that order. Quoting is
# (If _AS_PATH_WALK were called with IFS unset, it would disable word
as_nl='
'
case $0 in
*[\\/]* ) as_myself=$0 ;;
*) as_save_IFS=$IFS; IFS=$PATH_SEPARATOR
do
IFS=$as_save_IFS
done
IFS=$as_save_IFS
;;
esac
# We did not find ourselves, most probably we were run as `sh COMMAND'
77
if test "x$as_myself" = x; then
as_myself=$0
fi
echo "$as_myself: error: cannot find myself; rerun with an absolute file name"
>&2
fi
done
PS1='$ '
PS2='> '
PS4='+ '
# NLS nuisances.
for as_var in \
LC_TELEPHONE LC_TIME
do
else
fi
done
78
config_hcopy
TARGETKIND = MFCC_0_D_A
SOURCEFORMAT = WAV
TARGETFORMAT = HTK
SOURCERATE = 625
TARGETRATE = 100000.0
SAVECOMPRESSED = TRUE
SAVEWITHCRC = TRUE;
WINDOWSIZE = 250000.0
USEHAMMING = TRUE
PREEMCOEF = 0.97
NUMCHANS = 26
CEPLIFTER = 22
NUMCEPS = 12
ENORMALISE = FALSE
Config_hcompv
TARGETKIND = MFCC_0_D_A
SOURCEFORMAT = HTK
SOURCERATE = 625
TARGETRATE = 100000.0
SAVECOMPRESSED = TRUE
SAVEWITHCRC = TRUE
WINDOWSIZE = 250000.0
USEHAMMING = TRUE
PREEMCOEF = 0.97
NUMCHANS = 26
CEPLIFTER = 22
NUMCEPS = 12
2
ENORMALISE = FALSE
Config_hvite
TARGETKIND = MFCC_0_D_A
TARGETRATE = 100000.0
SAVECOMPRESSED= TRUE
SAVEWITHCRC = TRUE
WINDOWSIZE = 250000.0
USEHAMMING = TRUE
PREEMCOEF = 0.97
NUMCHANS = 26
CEPLIFTER = 22
NUMCEPS = 12
ENORMALISE = FALSE
SOURCEFORMAT = HTK
USESILDET = TRUE
MEASURESIL = FALSE
OUTSILWARN = TRUE
MICIN= TRUE
3
Appendix G: Prototype of Proto files
~o <VecSize> 25 <MFCC_0_D_N_Z>
~h "proto"
<BeginHMM>
<NumStates> 5
<State> 2
<Mean> 25
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
<Variance> 25
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
<State> 3
<Mean> 25
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
<Variance> 25
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
<State> 4
<Mean> 25
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
<Variance> 25
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
2
<TransP> 5
<EndHMM>
RO 100 "stats"
2
TR 0
QS "R_NonBoundary" { *+* }
QS "R_Silence" { *+sil }
QS "R_Stop" { *+p,*+ph,*+d,*+b,*+t,*+j,*+dh,*+k,*+q,*+g,*+ch,*+x,*+c, }
QS "R_Nasal" { *+m,*+mm,*+n,*+nn,*+ny }
QS "R_Fricative" { *+s,*+ss,*+sh,*+z,*+f,*+ff,*+hh,*+v }
QS "R_Vowel" { *+a,*+aa,*+e,*+ee,*+i,*+ii,*+o,*+oo,*+u,*+uu }
QS "R_C-Front" { *+p,*+ph,*+b,*+m,*+*+f,*+v,*+w }
QS "R_C-Central" { *+t,*+dd,*+d,*+dh,*+x,*+n,*+s,*+z,*+r }
QS "R_C-Back" { *+sh,*+ch,*+j,*+jj,*+y,*+k,*+kk,*+g,*+ny,hh }
QS "R_V-Front" { *+i,*+ii,*+e,*+ee }
QS "R_V-Central" { *+a }
QS "R_V-Back" { *+u,*+aa,*+oo,*+uu,*+o }
QS "R_Unvoiced-Cons" { *+p,*+tt,*+k,*+kk,*+ch,*+f,*+ff,*+s,*+ss,*+sh }
QS "R_Voiced-Cons" { *+j,*+b,*+bb,*+d,*+dd,*+dh,*+g,*+gg,*+v,*+z }
QS "R_Long" { *+aa,*+ee,*+ii,*+oo,*+uu }
QS "R_Short" { *+a,*+e,*+i,*+o,*+u }
QS "R_IVowel" { *+i,*+ii }
QS "R_EVowel" { *+e,*+ee }
QS "R_AVowel" { *+a,*+aa }
QS "R_OVowel" { *+o,*+oo, }
QS "R_UVowel" { *+u,*+uu }
QS "R_Voiced-Stop" { *+b,*+bb,*+d,*+dd,*+g,*+gg,*+j,*+jj }
QS "R_Unvoiced-Stop" { *+p,*+pp,*+t,*+tt,*+k,*+kk,*+ch }
QS "R_Voiced-Fric" { *+z,*+v }
QS "R_Unvoiced-Fric" { *+s,*+sh,*+th,*+f,*+ch }
QS "R_Front-Fric" { *+f,*+ff,*+v }
3
78
QS "R_Central-Fric" { *+s,*+ss,*+z }
QS "R_Back-Fric" { *+sh,*+ch }
QS "R_a" { *+a }
QS "R_aa" { *+aa }
QS "R_b" { *+b }
QS "R_bb" { *+bb }
QS "R_c" { *+c }
QS "R_cc" { *+cc }
QS "R_ch" { *+ch }
QS "R_d" { *+d }
QS "R_dd" { *+dd }
QS "R_dh" { *+dh }
QS "R_e" { *+e }
QS "R_ee" { *+ee }
QS "R_f" { *+f }
QS "R_ff" { *+ff }
QS "R_g" { *+g }
QS "R_gg" { *+gg }
QS "R_h" { *+h }
QS "R_hh" { *+hh }
QS "R_i" { *+i }
QS "R_ii" { *+ii }
QS "R_j" { *+j }
QS "R_jj" { *+jj }
QS "R_k" { *+k }
QS "R_kk" { *+kk }
4
QS "R_l" { *+l }
QS "R_ll" { *+ll }
QS "R_m" { *+m }
QS "R_mm" { *+mm }
QS "R_n" { *+n }
QS "R_nn" { *+nn }
………………………………………………………………………
………………………………………………………………………
………………………………………………………………………
QS "L_NonBoundary" { *-* }
QS "L_Silence" { sil-* }
QS "L_Stop" { p-*,ph-*,d-*,b-*,t-*,j-*,dh-*,k-*,q-*,g-*,ch-*,x-*,c-* }
QS "L_Nasal" { m-*,mm-*,n-*,nn-*,ny-* }
QS "L_Fricative" { s-*,ss-*,sh-*,z-*,f-*,ff-*,hh-*,v-* }
QS "L_Vowel" { a-*,aa-*,e-*,ee-*,i-*,ii-*,o-*,oo-*,u-*,uu-* }
QS "L_C-Front" { p-*,ph-*,b-*,m-*,f-*,v-*,w-* }
QS "L_C-Central" { t-*,dd-*,d-*,dh-*,x-*,n-*,s-*,z-*,r-* }
QS "L_C-Back" { sh-*,ch-*,j-*,jj-*,y-*,k-*,kk-*,g-*,ny-*,hh-* }
QS "L_V-Front" { i-*,ii-*,e-*,ee-* }
79
QS "L_V-Central" { a-* }
QS "L_V-Back" { u-*,aa-*,oo-*,uu-*,o-* }
QS "L_Unvoiced-Cons" { p-*,t-*,k-*,kk-*,ch-*,f-*,ff-*,s-*,ss-*,sh-* }
QS "L_Voiced-Cons" { j-*,b-*,bb-*,d-*,dd-*,dh-*,g-*,gg-*,v-*,z-* }
QS "L_Long" { aa-*,ee-*,ii-*,oo-*,uu-* }
QS "L_Short" { a-*,e-*,i-*,o-*,u-* }
QS "L_IVowel" { i-*,ii-* }
5
QS "L_EVowel" { e-*,ee-* }
QS "L_AVowel" { a-*,aa-* }
QS "L_OVowel" { o-*,oo-* }
QS "L_UVowel" { u-*,uu-* }
QS "L_Voiced-Stop" { b-*,bb-*,d-*,dd-*,g-*,gg-*,j-*,jj-* }
QS "L_Unvoiced-Stop" { p-*,pp-*,t-*,tt-*,k-*,kk-*,ch-* }
QS "L_Voiced-Fric" { z-*,v-* }
QS "L_Unvoiced-Fric" { s-*,sh-*,th-*,f-*,ch-* }
QS "L_Front-Fric" { f-*,ff-*,v-* }
QS "L_Central-Fric" { s-*,ss-*,z-* }
QS "L_Back-Fric" { sh-*,ch-* }
QS "L_a" { a-* }
QS "L_aa" { aa-* }
QS "L_b" { b-* }
QS "L_bb" { bb-* }
QS "L_c" { c-* }
QS "L_cc" { cc-* }
QS "L_ch" { ch-* }
QS "L_d" { d-* }
QS "L_dd" { dd-* }
QS "L_dh" { dh-* }
QS "L_e" { e-* }
QS "L_ee" { ee-* }
QS "L_f" { f-* }
QS "L_ff" { ff-* }
QS "L_g" { g-* }
QS "L_gg" { gg-* }
6
QS "L_h" { h-* }
QS "L_hh" { hh-* }
QS "L_i" { i-* }
QS "L_ii" { ii-* }
QS "L_j" { j-* }
QS "L_jj" { jj-* }
QS "L_k" { k-* }
QS "L_kk" { kk-* }
80
………………………………………………………………………
………………………………………………………………………
………………………………………………………………………
TR 2
………………………………………………………………………
………………………………………………………………………
………………………………………………………………………
………………………………………………………………………
………………………………………………………………………
………………………………………………………………………
………………………………………………………………………
TR 1
AU "fulllist"
CO "tiedlist"
ST "trees"