0% found this document useful (0 votes)

273 views101 pages

Speaker Independent Speech Recognition For Afan Oromo Language Using Hybrid Approach

Speech Recognition is a subfield of computational linguistics that develops methodologies and technologies that enables the recognition and translation of spoken language into text by computer. The goal of the speech recognition is to develop a model that automatically converts speech utterances into a sequence of words. With similar objective of transforming Afan Oromo Spoken words into its equivalent sequence of words, this study explored the possibility of developing

Uploaded by

Tariku Endale

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

273 views101 pages

Speaker Independent Speech Recognition For Afan Oromo Language Using Hybrid Approach

Uploaded by

Tariku Endale

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 101

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/365365838

DEPARTMENT OF INFORMATION TECHNOLOGY Speaker Independent

Spontaneous Speech Recognition for Afan Oromo Language using Hybrid
Approach

Book · November 2022

CITATIONS READS
0 316

1 author:

Tariku Endale Terefe

Ethiopian Public Health Institute
1 PUBLICATION 0 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Speaker Independent Spontaneous Speech Recognition for Afan Oromo using Hybrid Approach View project

All content following this page was uploaded by Tariku Endale Terefe on 14 November 2022.

The user has requested enhancement of the downloaded file.

AMBO UNIVERSITY WOLISO CAMPUS

SCHOOL OF GRADUATE STUDIES

SCHOOL OF TECHNOLOGY AND INFORMATICS

DEPARTMENT OF INFORMATION TECHNOLOGY

Speaker Independent Spontaneous Speech Recognition for Afan Oromo

Language using Hybrid Approach

BY TARIKU ENDALE

A THESIS SUBMITTED TO SCHOOL OF GRADUATE STUDIES OF AMBO

UNIVERSITY IN PARTIAL FULLFILLMENT OF THE REQUIREMENT
FOR THE DEGREE OF MASTER OF SCIENCE IN INFORMATION
TECHNOLOGY.

ADVISOR: DR. GETACHEW M.

May, 2020

WOLISO, ETHIOPIA

I
APPROVAL SHEET

Submitted By:

TARIKU ENDALE TEREFE 29/09/2020

PG Candidate Signature Date
Approved By:
1. Advisor:

Getachew Mamo (Ph.D.)

Name Signature Date
2. Co-advisor
__________________ ____________ ____________
Name Signature Date
3. Co-advisor
__________________ ____________ ____________
Name Signature Date
4. Co-advisor
_________________ ____________ ____________
Name Signature Date
5. College /Institute Dean
_________________ ____________ ____________
Name Signature Date
6. Head Department
__________________ ___________ ____________
Name Signature Date
7. Director, School of Graduate Studies
__________________ ___________ ____________
Name Signature Date

ii
AMBO UNIVERSITY WOLISO CAMPUS
SCHOOL OF GRADUATE STUDIES
CERTIFICATION SHEET
A thesis research advisor, I hereby certify that have read and evaluated this thesis
under my guidance by Tariku Endale Terefe entitled as Speaker Independent
Spontaneous Speech Recognition for Afan Oromo using Hybrid approach, I
recommend that it be submitted as fulfilling the thesis requirement.

GETACHEW MAMO (PhD)

Name of Major Advisor Signature Date

___________ _

Name of Co-advisor Signature Date
As mentioned of the Board of Examiners of the MSc thesis open defense examined.
We certified that we have read and evaluated the thesis prepared by Tariku Endale
Terefe and examined the candidate. We recommend that the thesis be accepted as
fulfilling the thesis requirements for the degree of Masters of Science in Information
Technology.

_______________ _ _______

Chair Person Signature Date
YADETA GONFA _ ___
Name of Internal Examiner Signature Date

Name of External Examiner Signature Date

_____________________ _____________ _____________
Name of PG Coordinator Signature Date

iii
Declaration

I the undersigned declare that the thesis comprises my own work in compliance with internationally

accepted practices; I have dually acknowledged and refereed all materials used in this work. I

understand that non-adherence to the principles of academic honesty and integrity,

misrepresentation/fabrication of any idea/data/source constitute sufficient ground for disciplinary

action by the University and can also evoke penal action from the sources which have not been

cited or acknowledged.

TARIKU ENDALE TEREFE 25/06/2020

Name of Student Signature Date

iv
Lists of Table
Table 1: Historical development of Speech Recognition taken from (NEC R & D Meeting 2009) ..............13
Table 2: Afan Oromo Alphabets ...................................................................................................................33
Table 3: Afan Oromo International Phonetics Alphabets ..............................................................................34
Table 4: Afan Oromo Numbering System Taken from (Yadeta, 2016) ........................................................35
Table 5: Articulation of Afan Oromo vowels compiled from (Yadeta, 2016) .............................................38
Table 6: Articulation of Afan Oromo consonant Alphabets (Compiled from Yadeta, 2016) ........................39
Table 7: Afan Oromo phones used during Transcription taken from (Yadeta, 2016) ..................................42
Table 8: Result of experiments conducted by increasing Gaussian Mixtures ...............................................59
Table 9: Results obtained from Experiments of Language Model built by HTK ..........................................60

v
Table of Contents
Declaration .................................................................................................................................... iv
Lists of Table.................................................................................................................................. v
Acknowledgements ........................................................................................................................ix
Lists of Figures ............................................................................................................................... x
Lists of Acronyms ..........................................................................................................................xi
Abstract .........................................................................................................................................xii
CHAPTER ONE ............................................................................................................................ 1
1.1. INTRODUCTION ................................................................................................................... 1
1.2. Statement of the Problem ........................................................................................................ 3
1.3. Research Questions ................................................................................................................. 4
1.4. Objectives of the Study ........................................................................................................... 5
1.4.1. General Objectives ......................................................................................................... 5
1.4.2. Specific Objectives ........................................................................................................ 5
1.5. Scope and Limitations of the Study .................................................................................. 5
1.6. Research Methodology ............................................................................................................ 6
1.6.1. Review of Related Literatures........................................................................................ 6
1.6.2. Data Selection and Preparation ...................................................................................... 6
1.6.3. Modeling Techniques..................................................................................................... 7
1.6.4. Testing Techniques ........................................................................................................ 7
1.6.5. Tools used for the Study ................................................................................................ 8
1.6.6. Significance of the Study ............................................................................................... 9
1.6.7. Organization of the Study .............................................................................................. 9
CHAPTER TWO.......................................................................................................................... 10
LITERATURE REVIEW ............................................................................................................. 10
2.1. Introduction ........................................................................................................................... 10
2.2. Overview of Automatic Speech Recognition (ASR) ...................................................... 10
2.2.1. Categories of Automatic Speech Recognition (ASR) System ..................................... 10
2.2.2. History of Speech Recognition System ....................................................................... 12
2.2.3. Speech Recognition Approaches ................................................................................. 13
2.2.4. OVERVIEWS OF HMM, GMM AND ANN .................................................................... 15
2.3. Automatic Speech Recognition Process............................................................................... 21

vi
2.4. Toolkits used in Speech Recognition ................................................................................... 23
2.4.1. Data Preparation Tools............................................................................................... 23
2.4.2. Training Tools ............................................................................................................ 24
2.4.3. Recognition Tools ...................................................................................................... 25
2.4.4. Analysis Tools ........................................................................................................... 27
2.5. Related Works ...................................................................................................................... 29
CHAPTER THREE ................................................................................................................... 32
AFAAN OROMO LANGUAGE ............................................................................................... 32
3. INTRODUCTION .......................................................................................................... 32
3.3. AFAN OROMO ALPHABETS (QUBEE AFAAN OROMOO)......................................... 32
3.3.1. Sound Formation in Afan Oromo .............................................................................. 33
3.3.2. IPA Representation of Afan Oromo........................................................................... 34
3.3.3. Morphological Features in Afan Oromo .................................................................... 34
3.3.4. Afan Oromo Phonetics ............................................................................................... 36
3.3.5. Articulation of Afan Oromo Vowels ......................................................................... 37
3.3.6. Articulation of Afan Oromo Consonants ................................................................... 38
3.4. Data Preparation ................................................................................................................... 39
3.5. Audio Pre-processing ........................................................................................................... 40
3.6. Transcribing the Segmented Speech .................................................................................... 41
3.7. Hidden Markov Toolkits (HTK). ......................................................................................... 42
CHAPTER FOUR ...................................................................................................................... 46
EXPERMENTS AND OUTCOMES ........................................................................................ 46
4. Introduction............................................................................................................................. 46
4.3. Data Preparation Phase ........................................................................................................ 46
4.3.1. Pronunciation Dictionary ........................................................................................... 46
4.3.2. Creating Transcription Files....................................................................................... 47
4.3.3. Feature Extraction ...................................................................................................... 48
4.4. Training Phase...................................................................................................................... 49
4.4.1. Creating Mono and Tri-Phone HMMS ...................................................................... 49
4.4.2. Re-estimating Mono-phones ...................................................................................... 50
4.4.3. Refinement and Optimization .................................................................................... 52
4.5. Recognition Phase ................................................................................................................ 57

vii
4.6. Analysis Phase ..................................................................................................................... 58
4.7. Challenges ............................................................................................................................ 60
CHAPTER FIVE .......................................................................................................................... 62
CONCLUSION AND RECOMMENDATIONS ......................................................................... 62
5.1. Conclusion............................................................................................................................. 62
6.2. Recommendations ................................................................................................................. 63
References .................................................................................................................................... 65
Appendices ................................................................................................................................... 68
Appendix A: Summary of Tools used in the study ...................................................................... 68
Appendix B: Samples of prompts afanoromotrainprompt.txt and afanoromotestprompt.txt ....... 69
Appendix C: Samples of Pronounciation Dictionary ................................................................... 71
Appendix D: Number of alternative pronounciation dictionary................................................... 73
Appendix E: Sample of coding for afanoromotraincode.scp and afanoromotest.scp code. ......... 75
Appendix F: Configuration Tools ................................................................................................ 76
Appendix G: Prototype of Proto files ............................................................................................. 2
Appendix H: Editing script of tree.hed .......................................................................................... 2

viii
Acknowledgements

It is my great pleasure to thank my God heart fully for the work he is doing in my daily

life. Passed life styles and situations were unable to be resisted if God was not helping me

in all, at all, and for all in my life. God Thank you Very Much!

My heartfelt thank should go to my Advisor Dr. Getachew Mamo for his constructive

comments and guidance. I am thankful to him because his guidance and open comments

supported me for the completion of this research. I extol Dr. Getachew Mamo for his

commitment to support and build me with a very constructive idea.

I would like also to thank my wife Demitu Kebede and My son Goftanbon Tariku for the

time they have given me and support me in any conditions while I was on learning.

Especially my wife has been supported me in providing finances, encouraging me and

Prayer.

My Heartfelt thank should go to Mr. Daba Ararsa who is the General Manager of Ambo

Urban Water Supply and Sewerage Service Enterprise for his support in all and with all

during my study. I also thank all the staffs of Ambo Urban Water Supply and Sewerage

Service Enterprise especially Mr. Teshome Kenea, for his support with encouraging and

constructive idea during my study.

TARIKU ENDALE TEREFE

ix
Lists of Figures

Figure 1: Overview of Automatic Speech Recognition (ASR) ........................ Error! Bookmark not defined.

Figure 2: Automatic Speech Recognition Components in Process taken from (Yadeta, 2016) ...................22

Figure 3: Software Architecture (taken from Young et.al, 2006) ..................................................................43

x
Lists of Acronyms

ASR: Automatic Speech Recognition

DWT: Dynamic Time Warping

EBNF: Extended Backus Naur Form

EM: Expectation-Maximization

FFT: Fast Fourier Transform

FO-ANN: Firefly Optimized Artificial Neural Network

GMM: Gaussian Mixture Model

HMM: Hidden Markov Model

MFCC: Mel Frequency Cepstrum Coefficients

OBN: Oromia Broadcasting Network

OMN: Oromia Media Network

PDF: Probability Density Function

SRILM: Stanford Research Institute Language Modeling Toolkit

VQ: Vector Quantization

VS: Visual Studio

xi
Abstract

Speech Recognition is a subfield of computational linguistics that develops methodologies and

technologies that enables the recognition and translation of spoken language into text by computer.

The goal of the speech recognition is to develop a model that automatically converts speech

utterances into a sequence of words. With similar objective of transforming Afan Oromo Spoken

words into its equivalent sequence of words, this study explored the possibility of developing

Speaker Independent Spontaneous Speech Recognition for Afan Oromo Language using Hybrid

Models (Hidden Markov Model, Artificial Neural Network, and Gaussian Mixture Model).

A Speaker Independent Spontaneous Speech Recognition for Afan Oromo Language has been done

using Conversational Speech between two or more speakers. Amount of Training Data planned to

be comprised of were 15000 utterances out of which 14500 utterances were used for Training Data

and the other 500 were used for Testing. Automatic speech recognition (ASR) on some

controlled speech has achieved almost human performance. However, the performance of

spontaneous speech is drastically decreased due to the diversity of speaking styles, speak

rate, presence of additive and non-linear distortion, accents and weakened articulation.

Spontaneous speech recognizer for Afan Oromo developed with the speech database

contains 14500 utterances for training the acoustic model. Language model developed

using 120MB text data and 30MB audio dataset and the percentage of result obtained was

with 36.45 % Word Error Rate.

Keywords

Speech recognition, acoustic modeling, language modeling, finite state network, dialog systems

xii
CHAPTER ONE

1.1. INTRODUCTION

Speech Recognition is a subfield of computational linguistics that develops methodologies

and technologies that enables the recognition and translation of spoken language into text

by computer. It is the ability of devices to respond to spoken commands (Mikio Nakano,

2013).

It is a Computer Technology that enables the device to recognize and understand spoken

words by digitizing the sound and matching its pattern against the stored patterns (Adugna,

2015). It is used to enable the computer to understand the real spoken word and executes

the word to text or actions. Speech Recognition (is also known as Automatic Speech

Recognition (ASR) or computer speech recognition) is the process of converting a speech

signal to a sequence of words, by means of an algorithm implemented as a computer

program (Debela, 2011).

Speech Recognition is a technology that allows spoken words input into the systems.

Speech Recognition enables the user to talk to the mobile phone, computers and it uses the

spoken word as an input to trigger some action. It is the process of finding a linguistic

interpretation of a spoken utterance, typically, this means finding the sequence of

characters, which forms the word. Speech recognition is a friendly human interface for

computer control. Speech recognition is often confused with natural language

understanding. Spontaneous Speech Recognition is more natural way in which people

communicate among one another. Spontaneous conversation is optimized for human-

human communication, but differs in some important ways from the types of speech for

1
which human language technology is often developed (Elizabeth, 2012).

The human ability to communicate with together inspired the researcher to develop the

system that can imitate the human being. Different researchers have been working on

several fronts to decode most of the information from the speech signal. Some of these

fronts include tasks like identifying speakers by the voice, detecting the language being

spoken, transcribing speech, translating speech, and understanding speech. Among all

speech tasks, automatic speech recognition (ASR) has been the focus of many researchers

for several decades. In this task, the linguistic message is one of the areas of interest

(Adugna, 2015). Spontaneous Speech Recognition is a thought of speech that is natural

sounding and not rehearsed (IJCSIS, 2009).

An ASR system with spontaneous speech ability should be able to handle a variety of

natural speech features such as words being run together, "ums" and "ahs", and even slight

stutters (IJCSIS, 2009). Continuous speech recognizers allow users to speak almost

naturally, while the computer determines the content. (Basically, it's computer dictation).

Recognizers with continuous speech capabilities are some of the most difficult to create

because they utilize special methods to determine utterance boundaries (IJCSIS, 2009).

Afan Oromo is one of the major languages that is widely spoken and used in Ethiopia.

Currently it is an official language of Oromia state (which is the largest region in Ethiopia)

(Elizabeth, 2012). Spontaneous Speech Recognition for Afan Oromo is an important

feature of database-query spoken language systems (Ward, 1987). It has a very consistent

rate and articulation across the sentence and across the speakers (Ward, 1987).

2
1.2. Statement of the Problem

The objective of speech recognition is to trap human voice in a digital Computer and

decode it into corresponding text. The ultimate goal of any automatic speech recognition

is to develop a model that converts speech utterance to text or words (Getachew, 2009).

Human beings are in need and trying to communicate with the machine such as computer

with their voices using Natural Language Processing. That is why researchers are trying to

investigate the speech recognition technology for different languages as per the rule and

regulation of writing, reading and speaking of that language. Researchers showed the

possibilities of developing Speech Recognition with broadcast news corpus.

Apart from foreign languages, Some Researchers have conducted a research on some of

Ethiopian Languages like: (Zegaye, 2003) developed ASR system for large vocabulary,

speaker independent, continuous Amharic speech recognizer using HMM based approach;

(Solomon, 2005) explored various possibilities for developing a Large Vocabulary Speaker

Independent Continuous Speech Recognition System for Amharic, and (Adugna, 2015)

developed a spontaneous, speaker independent Amharic speech recognizer by using

spontaneous speech such as conversation between two or more speakers; (Hafte, 2009)

tried to design speaker independent continuous Tigrigna speech recognition system and

(Abdella, 2010) tried to explore the possibility of developing prototype speaker dependent

speech recognition for Sidama language using HMM.

Beside these local researchers there are researches done on Afan Oromo as follows.

Isolated word speech recognition system by (Ashenafi, 2009), continuous speech

recognition system by (Kassahun, 2010). Teferi revealed the possibilities of developing

Speech Recognition System using Hybrid Hidden Markov Model and Artificial Neural

3
Network. Large Vocabulary, Speaker Independent Continuous Speech Recognition for

Afan Oromo using Broadcast News Speech Corpus (Yadeta, 2016). The outcome of the

Researcher has been indicated to be with a low performance which means with a large

Word Error Rate. The reason why this large word error rate was registered was set by the

researcher as they used a bigram and recommended that using a trigram will improve the

performance. See details on Chapter 2 for the detailed review of these works.

Most of the aforementioned researchers used Hidden Markov Model for the Speech

Recognition development. Any work with Hidden Markov models requires three things -

estimating the A matrix or the transition matrix, Estimating the B matrix or the observation

matrix and the initial vector or pi. All of these impose limitations. HMMs do not encode

the physics of the vocal tract. Artificial Neural Network is the most popular in Condition

Monitoring because it is the most popular condition monitoring tool (South Africa UWJ,

2006). Gaussian Mixture Model is a classification tool in pattern recognition like speech

and face recognition (South Africa UWJ, 2006). Hidden Markov Model is used for the

success of GMM in classiﬁcation of dynamic signals has also been demonstrated by many

researchers such as Cardinaux and Marcel (Reynolds, 2000) who compared GMM and

MLP in face recognition and Speech Recognition. A Markov chain is a random process of

discrete-valued variables that involves a number of states (South Africa UWJ, 2006). So,

this thesis has been conducted with the hybrid model to have a successful speech

development for Afan Oromo in which the machine takes training data once and not

rehearsed.

1.3. Research Questions

The Research Study has tried to answer the following questions.

4
 What are the challenges of Developing Spontaneous Speech Recognition for Afan

Oromo using a corpus collected from social media like Oromia Broadcasting

Network, Finfine Integrated Broadcast and Oromia Media Network?

 How to Overcome the Challenges?

 What are the techniques and methodologies that can be used to develop

Spontaneous Speech Recognition for Afan Oromo Language?

1.4. Objectives of the Study

1.4.1. General Objectives

The general objective of the study is to improve the development of Spontaneous Speech

Recognition for Afan Oromo using Hybrid Model.

1.4.2. Specific Objectives

 Exploring the methodologies, tools and techniques that were used in the researches

done by reviewing literatures and articles and related works.

 To build acoustic and language model for collected Audio and Text Corpus.

 Prepare the transcription and segmentation of the spoken words.

 Build a prototype for real person spoken words to the machine using Hybrid Model.

 Evaluate the performance of the prototype

 Forward conclusion and recommendations for future work in the area.

1.5. Scope and Limitations of the Study

 The main aim of this study is to scrutinize the possibility of developing ASR for Afan

Oromo using corpus collected from Afan Oromo Social Media like Oromia

5
Broadcasting Network and Oromia Media Network and self-recorded dataset. That is

due to time and financial constraints only about 17 hours of speech corpus has been

collected and text corpus is being collected from newspaper of Kallacha Oromiyaa.

 The limitation of the research is, the research has no component that identifies the age,

dialects and sex of the speaker. In this research bigram is not used because we have

used the trigram to develop a large dataset for Afan Oromo language. We have not

used bigram in this thesis because of bigram approximation consider only one word of

context.

1.6. Research Methodology

The Research methodologies that we used in this study are like Reviewing Related

Literature, Data Collection, Data Preparation and Data Modeling and Data Evaluation.

1.6.1. Review of Related Literatures

To find the approaches, techniques, methods and tools used for speech recognition thesis

for different languages that has been already developed, A number of researches have been

reviewed. In addition to this, researches, articles and literatures have been reviewed on

Afan Oromo Speech Recognition and other related works.

1.6.2. Data Selection and Preparation

The data that has been collected from Oromia Broadcasting Network, Oromia Media

Network, and Finfine Integrated Broadcast and self-recorded in both Audio and text format

respectively. Again in addition to those media the researcher has tried to recorded manually

from 10 male and 5 female at sol studio found in Ambo town. Out of the Collected audio

6
speech from Broadcasting Networks the Researcher has constructed around 4000 sentences

after the audio pre-processing and 4000 utterances for the experiment of the study. Also,

after making text normalization we have prepared text corpus of 3,000 sentences for

language modeling purpose. For training of the built system, the researcher used 3500

utterances and 500 utterances used for the testing purpose. The detail explanation and

process of data preparation has been described under chapter 3.

1.6.3. Modeling Techniques

As Fulufhelo V. Nelwamondo December 2005, but Revised 2006, they had clearly stated

the definition and appropriate function of the three models: Artificial Neural Network,

Gaussian Mixture Model and Hidden Markov Models. According to their definition the

Artificial Neural Network is a most popular model in condition Monitoring especially in

speech and face recognition. Hidden Markov Model is used in Machine Controlling

(Bernard, 1997), Speech Recognition (Zhang, 1998) and Fault Detection (Zhang, 1998).

As of Juraf sky and Martin 2007, Hidden Markov Model (HMM) is more appropriate

model for speech recognition because it is very rich in mathematical structure (Gonfa,

2016). Gaussian Mixture Model has been a reliable classiﬁcation tool in many applications

of pattern recognition, particularly in speech and face recognition. GMM have proved to

perform better than Hidden Markov Models in text independent speaker recognition (South

Africa UWJ, 2006). So, I need to use the Hybrid Model of Artificial Neural Network

(ANN), Hidden Markov Model (HMM) and Gaussian Mixture Models (GMM).

1.6.4. Testing Techniques

7
To test the trained system to check its accuracy, the Word Error Rate (WER) is used

because it is a standard evaluation metric for speech recognition system from the HTK

toolkit in order to test the accuracy of our developed system (CUED, 2009). WER depends

on how much the speech recognizer returns the word string while in operation or running.

Word Error could be happened in spontaneous speech recognition as of insertion, deletion

and substitution of error words. High Word Error Rate shows low accuracy and Low Word

Error Rate shows High Accuracy for the System built.

Word Error Rate can be calculated as:

Where S=Number of Substitutions of the Error, D= Number of Deletions and I=Number

of Insertions, H=the number of the correct words, N=the number of words in the references.

To evaluate the performance of our speech recognizer we take the word accuracy (Wacc)

which can be obtained from the Word Error Rate. Word Accuracy can be computed and

obtained as

Wacc=100-WER.

1.6.5. Tools used for the Study

The collected audio speech has been saved in wave audio file extension during pre-

processing. The audio processing tool has been audacity and Hidden Markov Model

Toolkit (HTK) is a toolkit used for building Hidden Markov Models which was used to

model any time series. Microsoft Visual Studio 2017 with full Python libraries will be used

for Prototyping. HTK was preferred instead of other toolkits due to the familiarity of the

8
researcher and it is freely available for academic and research use, also it is state of the art

in speech recognition task. Language modeling is important in speech recognition which

can be used by SRILM.

1.6.6. Significance of the Study

Human beings are in need to communicate with machines like computer through speech

technology through their vocal speech. The speech recognition is essential in the upcoming

of technological innovation on speech technology. Now in this, it is going to develop the

Speaker Independent Spontaneous Speech Recognition for Afan Oromo. There were so

many researchers that attempted to develop the speech recognition using different types of

speech recognition like isolated speech recognition, continuous speech recognition and this

thesis focuses on developing the Spontaneous speech recognition for Afan Oromo using

hybrid model with Broadcast News and Self-recorded corpuses.

1.6.7. Organization of the Study

This paper falls under six chapters. In the second chapter, we have tried to review the

related literatures and it consists of the detail related literatures on the title. Chapter three

consists of the detail information about Afan Oromo Language. Chapter four Data

Preparation and the experiments and Outcomes of the work and the end chapter which is

chapter six consist of Conclusion, and Recommendations.

9
CHAPTER TWO
LITERATURE REVIEW

2.1. Introduction

In this Chapter, the Overview of ASR Systems, Types of ASR, Applications of ASR, and

other Speech Recognitions are going to be described, the approaches and tools used in the

area of speech recognition were also discussed in this chapter. At the end of this chapter

we tried to put the survey of ASR under the related work section.

2.2. Overview of Automatic Speech Recognition (ASR)

Speech Recognition which is also called Automatic Speech Recognition is the process in

which speech signal is converted to a sequence of words. Automatic Speech Recognition

(ASR) works by taking an audio speech as input and then giving the string of words as an

output (Jurafsky, 2014). To develop an ASR system, the audio format is needed that is used

as an input parameter, and a large text data will be required for building the language

model. Now the required audio will be real world speech or Spontaneous Speech.

Developing a Spontaneous ASR from the real-world speech or spontaneous is not an easy

task as of using read speech corpus and continues speech recognition. Continues Speech

Recognition not requires the cut and fills whereas spontaneous Speech recognition requires

start and pause while talking because it takes any parameter from the real man.

2.2.1. Categories of Automatic Speech Recognition (ASR) System

10
Several literatures reveal that as ASR system can be classified to in different categories

based on the nature of utterance, speaker type, and vocabulary size (Suman et.Al, 2015).

Types of Speech Recognition based on nature of utterances

Depending up on the nature of utterances ASR system can be classified as: isolated words

speech recognition, connected words speech recognition, Spontaneous speech recognition,

and spontaneous speech recognition.

Isolated Words speech recognition system: Isolated word speech recognition system is a

speech recognizer system which recognizes single word. In such systems, the pause or

silence is needed before and after the single word. It is suitable for conditions that user is

required to give only a single word. Also, this type of ASR system is simple and easiest

because of the word boundaries are simply identified (Yadeta, 2016). The beginning and

the end of each word can be detected directly from the energy of the signal (Manan, 2013).

Connected Words speech recognition system: This type of speech recognizer system is

similar with the recognizer system of isolated words. However, the difference is allowing

separate utterances to be run-together by having a minimal pause among words which run

together. But it is recalled as pause is used to show the boundary of words that we have

discussed for isolated words.

Spontaneous Speech Recognition System – is a type of speech recognition which

recognizes the natural speech (Adugna, 2015). It is a sequence of spontaneously spoken

utterances that can be distinguished from well-planned utterances like radio news and

movie dialogues (Mikio Nakano, 2013). The example of this kind of speech recognition is

the conversation and dialogue between two people.

11
Types of Speech Recognition Based on Speaker Class

Speaker Dependent: is the speech recognition developed for a particular speaker. It

depends on the speaker’s characteristics like accent, dialect and the other (Adugna, 2015).

Speaker Independent: This speech recognition is not enforced to be for individual as

speaker dependent speech recognition. But implementation of the speaker independent

speech recognition is not easy and cheap like speaker dependent (Adugna, 2015). It is more

difficult, because the internal representation of the speech must somehow be global enough

to cover all types of voices and all possible ways of pronouncing words, and yet specific

enough to discriminate between numerous of words of the vocabulary (Gibbon, 1997).

Types of Speech Recognition Based Vocabulary Size

Small Vocabulary: is the speech recognition with 1 to 100 words and it is suitable for

command control (Yadeta, 2016).

Medium Vocabulary: is the speech recognition which contains 101 to 1000 words

(Yadeta, 2016).

Large Vocabulary: is the speech recognition which contains 1001 to 10,000 words.

Very Large Vocabulary: is the speech recognition that contains and developed with more

than 10,000 words.

2.2.2. History of Speech Recognition System

Automatic Speech Recognition System was started in the 19th century for the first time as

mentioned by (Rabiner and Juang, 2004). Speech Recognition was started by building a system

for isolated word recognition system for a single speaker in Bell Laboratories in the year of 1952.

This was done by designing to develop the ASR system for the ten digits (one to nine and zero).

12
The first developed speech recognition in 1960’s was for small vocabulary which contains only

(10-100) words using acoustic-phonetic approach. The next was developed for medium vocabulary

speech recognition which consists of 101 to 1000 words in 1970 (Yadeta, 2016). This system was

also developed for connected digits and continuous speech in addition to isolated words recognition

system (Rabiner & Juang, 2004). The evolution of ASR system is coming with the idea of large

vocabulary which was started in 1980’s. The system was developed for connected word and

continuous speech by applying statistical approach. Then, the continuity of large vocabulary; also

for very large vocabulary ASR system was investigated in 2000’s for spoken dialogs using

multimodal dialogs approach (Rabiner, 1989).

Table 1: Historical development of Speech Recognition taken from (NEC R & D Meeting 2009)

SN Technologies Year Developed Speech Recognition

1 Isolated Syllable Recognition (Syllables only) 1960 Isolated Syllable Recognition
2 Dynamic Programming using whole-word 1980 Isolated word recognition small
models vocabulary
3 Semi-Syllable Units 1990 Grammar-based Continuous
speech recognition LV (1000)
words
4 The world’s first product of automatic 2000 Voice Operator
interpreter
4 Statistical Models and Highly efficient search 2005 Automatic dictation very large
algorithm vocabulary
5 Very robust speech recognition for actual 2010 Spontaneous Speech
environments Recognition

2.2.3. Speech Recognition Approaches

As Rabiner and Juang (1993) there are numerous approaches to carry out automatic

speech recognition. These are:

a. Rule Based (Acoustic – Phonetic) approach

b. Template-Based Approach

13
c. Statistical Based (Stochastic) approach

d. Artificial Intelligence approach

a. Rule Based (Acoustic – Phonetic) approach

This approach used is used on the knowledge of phonetics and linguistics to guide search

process (Yadeta, 2016). It depends on finding speech sounds and providing appropriate

labels to sounds. This approach has the following drawbacks as mentioned by (Rabiner

and Juang, 1993). It uses knowledge of phonetics and linguistics to guide search process

(Adugna, 2015).

 Difficulty of expressing rules

 Difficulty of making rules interact

 Difficulty of knowing the way to improve the system.

Even though it has the above mentioned drawbacks it can be used in artificial intelligence

based recognizers.

b. Template-Based Approach

Template-Based approach store examples of units (words, phonemes, syllables), then

find the example that most thoroughly fits the input. It extracts features from speech

signal, and then it matches these which have similar features (Adugna, 2015).

It has the following drawbacks.

 It works for discrete utterances and for a single user

 Hard to distinguish very similar templates

 The performance quickly degrades when input differs from templates

c. Statistical Approach

14
This approach is an extension of Template-Based approach using more powerful

mathematical and statistical tools. Sometimes it is seen as anti-linguistic approach. It

uses the probabilistic models to deal with uncertain and incomplete information found

in speech recognition the most widely used model is HMM. However, in this thesis we

are going to use the three HMM, GMM and ANN. It uses works by collecting a large

corpus of transcribed speech recordings then Train the computer and then at run time,

apply statistical processes to search through the space of all possible solutions, and pick

the statistically most likely one.

d. Artificial Intelligence Approach

The main idea of this approach is collecting and employing the knowledge from different

sources in order to perform recognition process. The knowledge sources contain acoustic,

lexical, syntactic, semantic and pragmatic knowledge which are important for speech

recognition system. The Artificial Intelligence approach is a hybrid of the acoustic

phonetic approach and pattern recognition approach. In this, it exploits the ideas and

concepts of Acoustic Phonetic and Pattern Recognition methods. Knowledge based

approach uses the information regarding linguistic, phonetic and spectrogram (Debela,

2011).

2.2.4. OVERVIEWS OF HMM, GMM AND ANN

a. Overview of Hidden Markov Model (HMM)

The core for pattern matching speech recognition approach is a set of statistical models

representing the various sounds of the language to be recognized (Adugna, 2015).

Speech has a sequential structure and can be encoded as a sequence of spectral vectors

15
where hidden Markov model (HMM) provides a neutral framework for constructing

such models (Adugna, 2015).

It is a Markov chain plus emission probability function for each state. Each state

represents one observable event. But this model is too restrictive, for a large number of

observations the size of the model explodes, and the case where the range of

observations is continuous is not covered at all (Jurafsky D., 2009). A first order

discrete-time Markov chain depends on the previous state only.

Hidden Markov Model has a sequence of states which can be represented as:

 𝑆 = {𝑆1, 𝑆2, 𝑆3, … 𝑆𝑁} ∶ A set of states (usually indicated by i, j) is the state that

the model is in at a particular point of time t. Thus st = i means that the model is

in state i at time t.

 A=a11 a12 …aij: A transition probability A, each aij representing the probability

of moving from state i to state j. which can be computed as aij≥0

∀𝑗, 𝑖 𝑎𝑛𝑑 ∑𝑁
𝑗=1 𝑎𝑖𝑗 = 1, ∀𝑖

 O=o1 o2…oN: A set of observations, each one drawn from a vocabulary list V=

V1,v2,v3,…,Vn

 B=bi (ot): A set of observations likelihoods: also called emission probabilities,

each denoting the probability of an observation ot being generated from a state

 𝜋 = 𝜋1, 𝜋2, 𝜋3, … , 𝜋𝑁: An initial probability distribution over states: 𝝅𝒊 is the

probability that state Si is S starting state.

 HMM Problems and their Solution

HMM three basic problems are Evaluation, Decoding and Training [21]. The

next topics will discuss these three problems and their solution.

16
Computing likelihood: Given an HMM λ= (A, B, π) and an observation

sequence O, determine the likelihood P (O/λ).

Decoding: Given an observation sequence O and an HMM λ = (A, B, π),

discover the best hidden state sequence Q which can be shown as Q=Q1, Q2,

Q3…Qr.

Learning: is the one in which we optimize model parameter so that it can best

describe as to how given observation sequence displayed. The model parameter

can be adjusted as λ=(A, B, π) to maximize P(O|λ). The observation sequence

used here are called training sequence since it is used for training HMM.

Training is one of the crucial elements of HMM. This allows adjusting a model

parameter to create bets model for a given training sequence.

 Solution to the problems

The three problems HMM are solved with the following algorithms.

 Forward Algorithm for evaluation problem [P(O|λ)]

We want to find P (O|λ), given an observation sequence O=O1, O2…Or and a model.

The most straight forward way to find the solution is enumerating every possible state

sequence of length T. Consider one such state sequence Q=Q1,Q2, …Qr such that Q1

produces O1 with some probability, Q2 produces O2 with some probability and the like.

While using chain rule and when the order of chain rule is NT, at every t=1,2,..,T, there

are N possible states which can be reached.

P(O|λ) = ∑ 𝜋𝑞1𝑏𝑞1(𝑂1)𝑎𝑞1𝑞2𝑏𝑞2(𝑂2) … 𝑎𝑞𝑟 − 1𝑞𝑟𝑏𝑞𝑟(𝑂𝑇)

𝑞1,𝑞2,…𝑞𝑟

17
 Viterbi Algorithm for Decoding Hidden State Sequence [P(Q, O|λ)]

Optimum solution can be calculated in different ways but difficult on definition of

optimum state sequence. One way to find optimum state sequence is to choose the states

at which are individually most likely. But, this way has serious flaw in the sense that if

two states I and j are selected such that aij=0, then despite of the most likely state at time

t and t+1 which is not valid state sequence that can be solved by deciding optimum

criteria. The single best state sequence is using dynamic programming algorithm called

Viterbi algorithm which has a wide spread applications in speech recognition and used

in most communication devices to decode messages in noisy channels.

 Baum-Welch Algorithm for Learning

Baum-Welch is an iterative procedure and each-iterations of Baum-Welch are

guaranteed to increase the log-likelihood of the data. It works by maximizing a proxy to

the log likelihood and update the current model to be close to the optimal model.

However, the optimal solution is not guaranteed and adjusting the model parameter is

challenging to maximize the probability of the observation sequence given in the model.

b. Overview of Gaussian Mixture Model (GMM)

A Gaussian Mixture Model (GMM) is a parametric probability density function

represented as a weighted sum of Gaussian component densities (Manan, 2013).

Gaussian Mixture Model is commonly used as a parametric model of the probability

distribution of continuous measurements or features in a biometric system such as a

vocal-tract related spectral feature in a speaker recognition system. It is estimated from

training data using the iterative Expectation-Maximization (EM) algorithm.

18
In this study, we use the GMM in representing large class of sample distribution from

our training data. The GMM problems and solutions are stated as follows with the help

of Maximum Likelihood (ML) which is used to find the model parameters which

maximize the likelihood of the GMM in the given training data. Given sequence of T

training vectors X={X1, X2, X3….XT} the GMM likelihood, assuming independence

between the input data vectors can be written as:

𝑇

𝑃(𝑋|𝑡) = ∏ 𝑃(𝑋𝑡|𝜆)
𝑡=1

This expression is a non-linear function of the parameters λ and direct maximization is

not possible and hence the iterative Expectation-Maximization algorithm (EM) was

used that was then used for training as well as matching purposes. Therefore:

1.The total likelihood with g being the Gaussian PDF as: 𝑡𝑖 =

∑𝑁
𝑖=1 𝑝𝑖 𝑔(𝑥𝑖, µ𝑖, ∑𝑖) 𝑤ℎ𝑒𝑟𝑒 𝑖 = 1 𝑡𝑜 𝑁

𝑔(𝑥𝑗,µ𝑖,∑𝑖)
2. The normalized likelihoods 𝑛𝑖𝑗 = 𝑃𝑡 , 𝑖 = 1 𝑡𝑜 𝑗 − 1 and the notions can
𝑡𝑗

be calculated with the derivation of from the above two formulae.

c. Overview of Artificial Neural Network (ANN)

Speech is the most efficient mode of communication between peoples. This, being the

best way of communication, could also be a useful interface to communicate with

machines. Therefore the popularity of automatic speech recognition system has been

greatly increased. There are different approaches to speech recognition like Hidden

Markov Model (HMM), Dynamic Time Warping (DTW), and Vector Quantization

(VQ). Speech is greatly affected by accents, articulations, pronunciation, roughness,

19
emotional state, gender, pitch, speed, volume, background noise and echoes (Mikio

Nakano, 2013). The development of system is through learning than programming.

I. Types of Artificial Neural Network

There are various types of artificial neural networks.

1. Feed forward Artificial Network: Are the first and the simplest form of ANN (Kamble,
2005). In this network the information flows only in one direction. That means it allows

only forward information flow. The network has no loops or cycles. The neuron in layer

‘a’ can send data to only neuron in layer ‘b’ if b is greater than a. Learning is the adaptation

of free parameters of neural network through a continuous process of stimulation by the

embedded environment. Learning with teacher is called as supervised training; and

learning without teacher is called as unsupervised training (Kamble, 2005). The back-

propagation algorithm has emerged to design the new class of layered feed forward

network called as Multi-Layer Perceptron (MLP). It generally contains at least two layers

of perceptron. It has one input layer, one or more hidden layers and output layers. The

hidden layer plays very important role and acts as a feature extractor. It uses a nonlinear

function such as sigmoid or a radial-basis to generate complex functions of input. To

minimize classification error, the output layer acts as a logical net which chooses an index

to send to the output on the basis of input it receives from the hidden layer (Kamble, 2005).

2. Recurrent Neural Network (RNN): is a neural network that operates in time. RNN

accepts an input vector, updates its hidden state via non-linear activation function and uses

it to make prediction on output (Kamble, 2005).

A. Advantages of Artificial Neural Network

 The development of the system is through learning instead of programming.

20
 ANN is flexible in changing environments

 ANN can build informative model when conventional model fails. They can handle

very complex interactions.

 ANN is a non-linear model which is easy to use and understand than statistical

methods. ANN has the ability to learn how to do task based on the data given for training,

learning and initial experience.

 ANN can create their own organization and require no supervision as they can learn

on their own unsupervised competitive learning.

 Computations of ANN can be carried out in parallel.

 ANN can be used in pattern recognition which is a powerful technique for harnessing

the data generalizing about it.

B. Limitations of Artificial Neural Network

 It is not a daily life problem solving approach

 No structured methodology is available in it

 It may give unpredictable output quality

 Problem solving methodology is available in its system is not described.

 Black box nature

 Empirical nature for model development

2.3. Automatic Speech Recognition Process

Automatic Speech Recognition (ASR) is the process of deriving the transcription (word

sequence) of an utterance, given the speech waveform. Speech understanding goes one

step further, and gleans the meaning of the utterance in order to carry out the speaker's

command.

21
Figure 1: Automatic Speech Recognition Components in Process taken from (Yadeta, 2016)

i. Acoustic Model: models the relationship between audio signal and the phonetic units
in the language. It is a file that contains a statistical representation of each distinct sound

that makes up a spoken word (Yadeta, 2016). This consists of phonemes those form

words we use for the successful implementation of the work.

ii. Language Model is also responsible for modeling the word sequences in the language.

Statistical language model is the probability distribution over sequences of words. The

probability of the sequences of words with l can be computed as Ws=P (w1, w2, w3…wl)

to the whole sequence whereas Ws stands for word sequence, wl stands for the length of

words in the list. It is also used to improve the performance of the speech recognition

and translation systems (Yadeta, 2016). Language model contains very large lists of

words.

Language Model

iii. Decoder (Recognizer)

Decoder is a software program that takes the spoken sound by the speaker and searches

for its equivalent acoustic model (Yadeta, 2016). When a match is made, the Decoder

determines the phoneme corresponding to the sound. It keeps track of the matching

phonemes until it reaches a pause in the user speech. It then searches the Language Model

or Grammar file for the equivalent series of phonemes. If a match is made it returns the

22
text of the corresponding word or phrase to the calling program (Yadeta, 2016).

Statistical speech recognition decoding is the design of an efficient search algorithm to

deal with a huge search space obtained by combining the acoustic model and language

models. The aim of the decoder is to determine the most likely word sequence W, given

the language model, pronunciation dictionary and acoustic model (Adugna, 2015).

Recognition (Pattern matching) refers to the process of assessing the similarity between

two speech patterns, one of which represents the unknown speech and one of which

represents the reference pattern (derived from the training process) of each element that

can be recognized (Adugna, 2015).

2.4. Toolkits used in Speech Recognition

There are several toolkits to implement the algorithms needed in speech recognition like

CMU sphinx, Hidden Markov Toolkits, KALDI, JULIUS and ISIP. HTK is a portable

software toolkit for building and manipulating systems that use continuous density Hidden

Markov models. It has been developed by the Speech Group at Cambridge University

Engineering Department. HMMs can be used to model any time series and the core of HTK

is similarly general purpose. However, HTK is primarily designed for building HMM

based speech processing tools, in particular speech recognizers. It can be used to perform

a wide range of tasks in this domain including isolated or connected speech recognition

using models based on whole word or sub- word units, but it is especially suitable for

performing large vocabulary continuous speech recognition (Adugna, 2015).

2.4.1. Data Preparation Tools

23
HSLab is an interactive label editor for manipulating speech label files. It can be used both

to record the speech and to manually annotate it with any required transcriptions (Adugna,

2015). An example of using HSLab would be to load a sampled waveform file, determine

the boundaries of the speech units of interest and assign labels to them. Alternatively, an

existing label file can be loaded and edited by changing current label boundaries, deleting

and creating new labels. HSLab is the only tool in the HTK package, which provides a

graphical user interface (Adugna, 2015). HCopy is used to parameterize the data just once.

HList is used to check the contents of any speech file and check the conversions before

processing large quantities of data. HLed is a script-driven label editor which is designed

to make transformations to label files, like translating word level label files to phone level

label files, merging labels or creating tri-phone labels. HLed can also output files to a single

Master Label File MLF, which is usually more convenient for subsequent processing

(Adugna, 2015).

2.4.2. Training Tools

HTK allows HMMs to be built in any desired topology (Adugna, 2015). HMM definitions

can be stored externally as simple text files and hence it is possible to edit them with any

convenient text editor (Adugna, 2015). The lists of words in the datasets are recorded for

20 speakers from different area of different accents so that the speech agent will be added

to the windows operating system and the program will bring the data from it with c-sharp

and python programming language. However, because of the work on spontaneous speech

recognition which focuses on listening and responding directly as a conversation between

machine and the native speaker of the language or non-native speaker of the language, we

have prepared a lists of 15000 words for training and of which 500 are used for testing.

24
Sensible values for the transition probabilities must be given but the training process

is very insensitive to these. An acceptable and simple strategy for choosing these

probabilities is to make all of the transitions out of any state equally likely (Adugna, 2015).

For the data to be stored we used command delimited on Microsoft excel 2010 and the

words are formed from the structure and to see the processing, precision and recall, we

have used weka.

The second step of system building is to define the topology required for each HMM by

writing a prototype definition. HTK allows HMMs to be built with any desired topology.

HMM definitions can be stored externally as simple text files and hence it is possible to

edit them with any convenient text editor. Alternatively, the standard HTK distribution

includes a number of example HMM prototypes and a script to generate the most common

topologies automatically. With the exception of the transition probabilities, all of the HMM

parameters given in the prototype definition are ignored. The purpose of the prototype

definition is only to specify the overall characteristics and topology of the HMM. The

actual parameters will be computed later by the training tools. Sensible values for the

transition probabilities must be given but the training process is very insensitive to these.

An acceptable and simple strategy for choosing these probabilities is to make all of the

transitions out of any state equally likely.

2.4.3. Recognition Tools

25
HTK provides a recognition tool called HVite that allows recognition using language

models and lattices. HLRecsore is a tool that allows lattices generated using HVite (or

HDecode) to be manipulated for example to apply a more complex language model. An

additional recognizer is also available as an extension to HTK HDecode. Note: HDecode

is distributed under a more restrictive license agreement. HVite HTK provides a

recognition tool called HVite which uses the token passing algorithm described in the

previous chapter to perform Viterbi-based speech recognition. HVite takes as input a

network describing the allowable word sequences, a dictionary defining how each word is

pronounced and a set of HMMs. It operates by converting the word network to a phone

network and then attaching the appropriate HMM definition to each phone instance.

Recognition can then be performed on either a list of stored speech files or on direct audio

input. As noted at the end of the last chapter, HVite can support cross-word triphones and

it can run with multiple tokens to generate lattices containing multiple hypotheses. It can

also be configured to rescore lattices and perform forced alignments. The word networks

needed to drive HVite are usually either simple word loops in which any word can follow

any other word or they are directed graphs representing a finite-state task grammar. In the

former case, bigram probabilities are normally attached to the word transitions. Word

networks are stored using the HTK standard lattice format. This is a text-based format and

hence word networks can be created directly using a text-editor. However, this is rather

tedious and hence HTK provides two tools to assist in creating word networks. Firstly,

HBuild allows sub-networks to be created and used within higher level networks. Hence,

although the same low level notation is used, much duplication is avoided. Also, HBuild

can be used to generate word loops and it can also read in a backed-off bigram language

model and modify the word loop transitions to incorporate the bigram probabilities. Note

26
that the label statistics tool HLStats mentioned earlier can be used to generate a backed-off

bigram language model. As an alternative to specifying a word network directly, a higher

level grammar notation can be used. This notation is based on the Extended Backus Naur

Form (EBNF) used in compiler specification and it is compatible with the grammar

specification language used in earlier versions of HTK. The tool HParse is supplied to

convert this notation into the equivalent word network. Whichever method is chosen to

generate a word network, it is useful to be able to see examples of the language that it

defines. The tool HSGen is provided to do this. It takes as input a network and then

randomly traverses the network outputting word strings. These strings can then be

inspected to ensure that they correspond to what is required. HSGen can also compute the

empirical perplexity of the task. Finally, the construction of large dictionaries can involve

merging several sources and performing a variety of transformations on each source. The

dictionary management tool HDMan is supplied to assist with this process. HLRescore is

a tool for manipulating lattices. It reads lattices in standard lattice format (for example

produced by HVite) and applies one of the following operations on them: • finding 1-best

path through lattice: this allows language model scale factors and insertion penalties to be

optimized rapidly; • expanding lattices with new language model: allows the application

of more complex language, e, g, 4-grams, than can be efficiently used on the decoder. •

converting lattices to equivalent word networks: this is necessary prior to using lattices

generated with HVite (or HDecode) to merge duplicate paths.

2.4.4. Analysis Tools

Once the HMM-based recognizer has been built, it is necessary to evaluate its performance.

This is usually done by using it to transcribe some pre-recorded test sentences and match

27
the recognizer output with the correct reference transcriptions. This comparison is

performed by a tool called HResults which uses dynamic programming to align the two

transcriptions and then count substitution, deletion and insertion errors. Options are

provided to ensure that the algorithms and output formats used by HResults are compatible

with those used by the US National Institute of Standards and Technology (NIST). As well

as global performance measures, HResults can also provide speaker-by-speaker

breakdowns, confusion matrices and time-aligned transcriptions. For word spotting

applications, it can also compute Figure of Merit (FOM) scores and Receiver Operating

Curve (ROC) information.

There are three possible types of errors that can be occurred while speech recognition made:

error one insertion error which occurs when the ASR system generates a word that does

not correspond to any word in the reference transcript. Once it founds the optimal

alignment, HResults calculates the number of substitution errors (S), deletion errors (D),

and the insertion errors(I) where N is the total number of labels in the reference

transcriptions; the percentage of correct word recognized in the calculated as:

𝑁−𝐷−𝑆
𝑃𝑒𝑟𝑐𝑒𝑛𝑡 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 = ∗ 100% − − − − − − − − − − − − − − − −1
𝑁

As observed from the above equation, it ignores the insertion errors. The percentage

accuracy defined by following equation includes the insertion errors.

𝑁−𝐷−𝑆−𝐼
𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒 𝐴𝑐𝑐𝑢𝑟𝑎𝑡𝑒 = ∗ 100% − − − − − − − − − −2
𝑁

HResults outputs both of the above measures during result analysis (Adugna, 2015).

28
2.5. Related Works

Many Researchers have conducted a research on spontaneous speech recognition for

different languages like Adugna, 2015, done on Spontaneous Speech Recognition for

Amharic Language using HMM with 90 minute speech database. Language model

developed using 30 MB text data and different smoothing techniques out of which 68 were

used for testing, in which the result is obtained as 50% accuracy and 53 WER. The collected

database is very low as of compared with CSJ (Adugna, 2015). The size of data we have

used is very low in size if compared with another databases used for other language

including a large-scale spontaneous speech database “Corpus of Spontaneous Japanese

(CSJ)”. Using CSJ Training data of 510 hours long and around 6.84 M words for language

modeling, best Word Error Rate of 25.3% obtained (Furui S, et.al, 2005).

Spontaneous speech recognizer for Japanese developed by (Furui S, et.al, 2005) using two

different corpuses, One “Corpus of Spontaneous Japanese (CSJ)” speech: A part of the

corpus completed by the end of December 2000, consisting of 610 presentations

(approximately 1.5M words of transcriptions), is used. Web corpus: Transcribed

presentations consisting of approximately 76k sentences with 2M words have been

collected from the World Wide Web used. In this experiment for the evaluation of

recognizer performance, 4.4 hours of presentation speech uttered by 10 male speakers

is used as a test set of speech recognition (Furui S, et.al, 2005). (Yadeta, 2016) had

conducted the research on Large Vocabulary, Speaker Independent Continuous Speech

Recognition System for Afan Oromo using broadcast news speech corpus. The speech

recognizer system developed from 57 anchors or speakers (42 males and 15 females) using

2953 sentences which have 06:15:38 hours long. In addition text data about 2000 sentences

29
used for language modeling purpose among collected from Kallacha Oromiyaa. Out of

2953 sentences, 2653 (10138 unique words dictionary) used for training 66 the speech

recognizer system and remaining 300 sentence (2516 unique words dictionary) were used

to test the developed system. Speakers who are involved in training does not involved in

testing. Therefore, from 57 speakers only 12 speakers (9 males and 3 females) were

participated for testing [1]. (Murat, 2003) conducted a research on large vocabulary

continuous speech recognition for Turkish using HTK. In his work, he used the un-

segmented data; that consisted of 7650 utterances spoken by 104 male and 89 female

speakers for training. The researcher has performed five experiments like the IWR (Isolated

Word Recognition) task, a CWR (Connected Word Recognition) system with no grammar,

a CWR system with a simple grammar that is a CSR (Continuous Speech Recognition)

system, and the fifth experiment was performed in order to test using bigram language

model which developed by using HTK language modeling tool (HLM). In his thesis, five

experiments were done. The first one was the IWR (Isolated Word Recognition) task. In

the first experiment to perform an IWR task, he cut 40 speech segments from test data,

which contained only one word. In the second one, he tested a CWR (Connected Word

Recognition) system with no grammar. This means, every word can follow another one

employing no rule in the second experiment. In the third experiment, he tested a CWR

system with a simple grammar that he designed. Accordingly, follower words are

determined according to the grammar he designed. The fourth experiment was related to a

CSR (Continuous Speech Recognition) system, in which the cross-word expanded network

is based on bigrams that include stems and endings. The fifth experiment was performed

in order to test the bigram language model that is actually proposed in his thesis. The test

utterances contain 220 sentences (have a vocabulary size 1168 words) that are randomly

30
selected from the text corpus. In the experiment these sentences were continuously spoken

by 6 speakers (4 male, 2 female). After performing the five experiments by comparing the

results of correct sentence recognition rate (CSRR) 20.09% and 30.90% given in

experiment 4 and experiment 5, respectively. This result shows that experiment 4

performed better than experiment 5 since the correct sentence recognition rate is bigger for

experiment 5 than experiment 4. To the end since he applied no smoothing algorithm to the

language model built; he has concluded as applying a smoothing technique would be better.

(Bantegize, 2015) developed recognizer using 90 min speech database in a judicial

domain containing 1550 utterances for training the acoustic model. Language model

developed using 30MB text data and different smoothing techniques. 68 sentences from

the same domain used for testing. Using context dependent acoustic model of 8 Gaussian

mixtures and a tri-gram language model with absolute discounting smoothing, result

obtained were 47 % word accuracy and 53 % WER. Based on the experimental result if

data size is less in number, the data used for training and testing from the same domain

performs better than data which is not from specified domain.

31
CHAPTER THREE

AFAAN OROMO LANGUAGE

3. INTRODUCTION

Afan Oromo is one of the Cushitic branches of Afro-Asiatic language family languages of

Africa that can be divided six major six families like Chadic, Berber, Egyptian, Cushitic,

Omotic, and Semitic. The Cushitic family is also further divided into four groups like

North, Central, South and East. Accordingly, Afaan Oromo is one of the languages of

Lowland groups within the East Cushitic group and the most widely spoken language from

Cushitic family (Yadeta, 2016).

Ethiopian census (2007) denotes that Afaan Oromo has around 34% of speakers out of the

Ethiopian total population (Yadeta, 2016). (Abraham, 2014) also stated that as Afaan

Oromo was spoken by more than 40 million people over the world. And majority of these

speakers of the language are living in Ethiopia and others are in neighboring countries like

Kenya, Somalia, Egypt, and Djibouti (Yadeta, 2016).

3.3. AFAN OROMO ALPHABETS (QUBEE AFAAN OROMOO)

Afan Oromo writing system was started to be written in 1991 G.C with Latin alphabets

called ‘Qubee’ which was formally approved in this year (Abraham, 1993). Afan Oromo

has 26 letters which was adopted from Latin alphabets and additional six compound letters.

Afan Oromo has a total of 32 (Yadeta, 2016). Afan Oromo alphabets are divided in to two

as of Vowels ‘Qubee Dubbachiiftuu’ and consonants ‘Qubee Dubbifamaa’. Afan Oromo

vowels are five (5) whereas consonants are twenty seven (27) (Yadeta, 2016).

32
Table 2: Afan Oromo Alphabets

Types Afan Oromo Alphabets Total

Vowels A,E,I,O,U 5
Consonants B,C,D,F,G,H,J,K,L,M,N,P,Q,R,S,T,V,W,X,Y,Z 21
Compound Symbols: CH, DH,NY, PH, SH, TS 6
Total 32

3.3.1.Sound Formation in Afan Oromo

Afan Oromo has five vowels that are used to form sounds with consonants or by themselves

(Yadeta, 2016). The sound can be either long sound or short sound. Short Sound (Sagalee

Gabaabaa) is a sound formed with consonant and single vowel. Sounds have no meaning

independently unless they relate with other sounds to form a word. E.g. Ma- is a sound

formed from a consonant ‘M’ and single vowel ‘a’. Ma- has no meaning by itself. But if

this Ma- is concatenated with –na, then the word ‘Mana’ which means Home or House

in English will be formed. Both Ma- and –na are short sounds. Long Sound (Sagalee

Dheeraa) is formed from consonant and two vowels of the same kind. In Afan Oromo no

two different vowels form a long sound. E.g. the sound Gaa- is formed from the consonant

‘G’ and two vowels of the same kind ‘-aa’. This ‘Gaa-‘has no meaning by itself unless it

concatenated with other sounds that helps it to have a meaning. So, ‘Gaa-‘and ‘-rii’ will

form ‘Gaarii’ a meaningful word which means Good. Sounds can be either geminated or

non-geminated. Badaa which means Evil, Harsh is non-geminated sound whereas

Baddaa which means Highland is a geminated sound. Afan Oromo consonants can be

geminated except [h] and Compound Symbols (Yadeta, 2016). Compound letters in Afan

Oromo are categorized under consonants (Hinsene, 2009). This compound symbol is

constructed in the language from double consonants to represent one single consonant

(Yadeta, 2016). These consonants are pronounced as a single letter even though they are

33
formed from two independent consonants. E.g. Nyaata which means food, Dhadhaa which

means Butter are words and sounds formed from compound letters.

Afan Oromo Glottal Sound: Afaan Oromo has a glottal sound which called ‘Hudhaa’ that

shows a diacritical marker and represented by single quote ['] or some times by [h]. For

instance in the Oromo words like ‘Ga’aa or Gahaa’ which means enough, both’ and h are

used interchangeably and most of the time the single quote [‘] is used [Yadeta, 2016).

3.3.2.IPA Representation of Afan Oromo

This International Phonetics Alphabet (IPA) is used in our transcription. Afan Oromo is

still using Latin alphabets for writing. The IPA Representation is shown as follows.

Table 3: Afan Oromo International Phonetics Alphabets

Alphabets (Qubee) Aa Bb Cc CH ch Dd DH dh Ee Ff Gg Hh Ii
IPA for Alphabets [a] [b]
Alphabets (Qubee) Jj Kk Ll M m Nn NY ny Oo Pp PH ph Qq R
r
IPA for Alphabets
Alphabets (Qubee) Ss SH Tt TS ts Uu Vv W w X x Y y Z z
sh
IPA for Alphabets

3.3.3. Morphological Features in Afan Oromo

Word is a collection of two or more sounds that has a full meaning. In Afan Oromo words

can be formed with short sounds, long sounds or hybrid of the two.

a. Word formation from short sounds: is when two sounds formed with consonant

followed by single vowel are concatenated to each other. E.g. Ma- + -ra are two

independents meaningless sounds. ‘Mara’ is a word formed from the two sounds

and is a meaningful. Its meaning is all. Others like Lama (Two), Buna (Coffee),

34
Kuma (Thousand), Muka (Tree), Cina (Beside), Laga (River) and many other

words can be created from short sounds.

b. Word formation from long sounds: is when two sounds formed with consonant

followed by two same vowels are concatenated to each other. E.g. Laa- + - gaa are

two independent meaningless sounds. After concatenation the word Laagaa which

means Jaws will be formed. Many others like Laamaa (Starved person), Gaarii

(good), Yaalii (treatment), Mootii (King), Loogii (Biasness) and many others of

Afan Oromo words can be formed from long sounds.

c. Word formation from hybrid sounds: is formed from the concatenation of short

sound and long sound, Long sound and short sound. Words like Garuu (but),

Busaa (Malaria), Nagaa (Peace), Nuura (Imagination), Naqaa (Ingredient) and

many others can be formed.

i. Number

Afan Oromo numbers are used to show either a singular or plural form of entities

(Yadeta, 2016). Afan Oromo words have their own singular and plural form which will

formed by the suffixes to the base word. E.g. Nama (Person) is a single person whereas

Namoota (People) which shows many persons.

Table 4: Afan Oromo Numbering System Taken from (Yadeta, 2016)

Type English Afan Oromo Other Names Range

One Tokko “One Digit” or 1-9
Two Lama “Qub-Tokkee”
Three Sadii
. .
Cardinal Numbering

. .
Nine Sagal
Ten Kudhan “Two Digit” 10-99
Eleven Kudha Tokko “Qub-Lamee”
Twelve Kudha Lama
. .
. .

35
Ninety Nine Sagaltamii Sagal
. . . .

Type English Afan Oromo Other Names

First Tokkoffaa, Dursaa, Jalqaba
Second Lammaffaa, Lammeessoo,..
Ordinal Number

Third Sadaffaa, Sadeessoo

. .
Tenth Kurnaffaa
. .

ii. Gender

Gender is the other morphological feature of Afan Oromo. Gender ‘Koorniyaa or Saala’

is classified into two as Masculine ‘Dhiira/Kormee’ or Feminine ‘Dhalaa/Naayyee’

(Yadeta, 2016). To Afan Oromo nouns there are suffixes that are used to identify whether

the gender is masculine or feminine. E.g. Daa’ima cannot be easily identified the

masculinity and femininity of the noun without the suffixes. But with Daa’im-+ -cha

which shows the child is masculine and Daa’im-+-ttii which shows the child is feminine.

3.3.4. Afan Oromo Phonetics

Afan Oromo is spoken in the way it is written that makes it to be called as a phonetic

language (Yadeta, 2016). Afan Oromo is free of the problem of homonymy. That means

there are no words having the same pronunciations like (write and right). Even if the Latin

letters were adopted for developing the Afaan Oromo alphabets, their pronunciation is

different as the variety of language as we have discussed in Table 3. Also in Afaan Oromo

alphabets, there is no difference while pronouncing the capital letters and small letters.

This means like the English language which used Latin letters, the pronunciations of [A]

and [a] is same (Yadeta, 2016).

36
Phonetics is the study of speech sounds used in languages of the world. It is concerned

with sounds of languages, how these sounds are articulated and how the hearer perceives

them. Phonetics is related to the science of acoustics in that it uses much of the techniques

used by acoustics in the analysis of sound (Carnigie Mellom University, 2013).

Phonology is the study of the sound patterns of a language. It describes the systematic

way in which sounds are differently realized in different contexts, and how this system of

sounds is related to the rest of the grammar. Phonology is concerned with how sounds are

organized in a language. It endeavors to explain what these phonological processes are in

terms of formal rules.

Morphology is the study of word formation and structure. It studies how words are put

together from their smaller parts and the rules governing this process. The elements that

are combined to form words are called morphemes. A morpheme is the smallest

meaningful unit you can have in a language.

Syntax is the study of sentence structure. It attempts to describe what is grammatical in a

particular language in terms of rules. Syntactic knowledge of a language is given to an

ASRS as its language model (Adugna, 2015).

3.3.5. Articulation of Afan Oromo Vowels

Afan Oromo has five long and short vowels which can be found either at the beginning, in

the middle or at the end of the consonant. E.g. Aadaa – consists of two long vowels at the

beginning and at the end next to the consonant which is d. The pronunciation of these

vowels within the Oromia region is the same. No word can be created without vowels in

Afan Oromo language.

37
Table 5: Articulation of Afan Oromo vowels compiled from (Yadeta, 2016)

Vowels
Front Central Back
Close i/I,ii/i: u /Ʋ, uu / u:/
Mid e/ɛ/, ee/e:/ o /Ɔ, oo/ o:/
Open a/Ʌ/ aa /ɑ: /

Front is the sound of a vowel formed by rising the tongue towards the hard plate and back

is the sound of vowel articulated at the back of the mouth. However, [a] is articulated at

the central of our mouth. Speaking of the Afan Oromo Alphabets (Qubee) vowels were

based on the movement of the tongue (Yadeta, 2016).

3.3.6. Articulation of Afan Oromo Consonants

As discussed in above, the most Afan Oromo letters are categorized under consonants

which means, from 32 alphabets only five are vowels and the remaining 27 are consonants

(Yadeta, 2016). As seen the above vowels are articulated in three forms like front, central

and back in the form of close, mid and open (Yadeta, 2016).

Additionally, in some sources the glottal sound symbol is categorized under consonants

and we have discussed in earlier as it is represented by single quote [‘] or some times by

letter [h]. However, in this section we elaborate on the pronunciation and the manner of

articulation. Therefore, whether the symbol used for glottal sound was single quote or

letter[h], the pronunciation representation from IPA is , the manner of articulation is stops

and fricatives, and glottal. In this section, we discuss mostly the manner of articulation for

Afaan Oromo alphabets (Yadeta, 2016).

38
Table 6: Articulation of Afan Oromo consonant Alphabets (Compiled from Yadeta, 2016)

Place of Articulation
Manner of Articulation Bilabial Labiodental Alveolar Palatal Velar Glottal
Stop b, ph F d, n k, g, q . /Ɂ/
Fricative s, z Sh H
Affricative c, h, j
Nasal M n Ny
Flap r
Lateral l
Semi-vowel W Y W

Afan Oromo consonants can articulated either in stop, fricative, affricative, nasal, flap, and

lateral, semi-vowel with articulation place bilabial, labiodental, alveolar, palatal, velar and

glottal.

3.4. Data Preparation

Both text corpus and audio corpus are prepared from different sources.

i. Text Corpus

To estimate the probability of the word sequence in the speech recognition, regularities

were taken by language model. Language models are used to model regularities in natural

language (Yadeta, 2016). Lists of large texts are also required in the development of

language model. So the data are collected from “Kallacha Oromiyaa” newspaper and is

normalized manually because the written text contains different type like number,

alphabets and symbols, dates and abbreviations.

ii. Audio Corpus

For modeling acoustic we need the audio/speech data. In other words, speech is one and

the primary input for the recognizer system (Yadeta, 2016). We have used OMN (Oromia

Media Network), OBN (Oromia Broadcasting Network) and FIB (Finfine Integrated

39
Broadcast) Afan Oromo program. The ways of collecting and number of utterances found

respectively are listed as follows. From Oromia Media Network studio 4500 with a length

of 05:15:40, Oromia Broadcast Network studio 5600 with a length of 08:15:40, Finfine

Integrated Broadcast (FIB) studio with a length of 3400 with a length of 03:15:55 and

directly from Afan Oromo teachers 1500 utterances have been collected. Totally, the

number of utterances collected from those sources is 15000 for the total duration of around

17 hours.

As observed from the above status, 37% are collected from OBN studio, 30% are from

OMN studio, 23% from FIB and 10% from Afan Oromo teachers like Fetenu Chimdesa,

Lami Gesherbo and Sena Gemechu. The collected data has been sourced from both direct

and internet.

3.5. Audio Pre-processing

In the collected data there are both text and audio format. From these two it is mandatory

to pre-process using tool audacity. To construct the speech data that is suitable format for

HTK (Hidden Markov Toolkit) audio preprocessing should be done for the collected and

recorded.

Speech Segmentation: The researcher tried to dig out if there might be tool for the

segmentation of Afan Oromo sentences automatically. So, to the knowledge of the

researcher, there is no preferred tool that can be used for audio segmentation (Yadeta,

2016).

Therefore, the sentence segmentation has been performed manually by listening the

collected audio file. Because of we are conducting this work on spontaneous speech

40
recognition the developed system understands what speaker is speaking and then convert

it to text. About 15000 utterances has been recorded and the recorded data has

noun/pronoun (maqaa / bamaqaa), subject (gochima), adjective (ibsa maqaa), adverb

(ibsa xumuraa), verb (xumurtuu) attributes and when the speaker speak a sentence

constructed with recorded utterances, it will form a sentence accordingly. It recognizes the

children and gender voices. Because of HTK requires .wav .AIFF and the like supported

formats, the researcher uses audacity 2.3.3.

3.6. Transcribing the Segmented Speech

Transcription has been taken place with the consideration of Afan Oromo grammar with

the consult of different literatures from Afan Oromo. Afan Oromo language has a glottal

sound which is represented by single quote [ˈ] and sometimes by the letter [h] (Hinsene,

2009). Even though this glottal sound exists in Afan Oromo, most of the researchers argued

on classifying it either under consonant or vowel. Most of them have classified the glottal

sound to consonant and we have again classified under consonant for this study. Letters

like C, CH, DH, J, NY, PH, Q, SH, X are non ASCII code which HTK cannot support

(Yadeta, 2016). However, the glottal sound by itself cannot be represented by ASCII code

that makes challenges to researcher while making a transcription. [hh] has been used to

represent the glottal sound in this study to identify from the letter [h] (Yadeta, 2016).

Because all consonants can have either geminated or non-geminated except from [h] and

compound symbols (Yadeta, 2016). Double consonants are used to represent the geminated

form.

41
Table 7: Afan Oromo phones used during Transcription taken from (Yadeta, 2016)

Letters A AA B BB C CC CH D DD DH E EE F FF
IPA A Aa b bb C Cc Ch D Dd dh e Ee f Ff
Letters G GG H I II J JJ K KK L LL M MM N
IPA G g h I Ii J Jj K Kk L ll M mm N
Letters NN NY O OO P PP PH Q QQ R RR S SS SH
IPA Nn Ny o oo P Pp Ph Q Qq R rr S ss sh
Letters T TT TS U UU V W WW X XX Y YY Z ˈ
IPA T Tt ts U Uu v W ww X xx y yy z hh

While doing this the verity of dialects, punctuations, and the rule of capitalization were not

considered.

3.7. Hidden Markov Toolkits (HTK).

HTK toolkit is used for building Hidden Markov Models (HMMs). Young et. al. (2006)

stated that as HTK is primarily designed for building HMM based speech processing tools,

particularly speech recognizers. Also much of the functionality of HTK is built into the

library modules. The figure 3. Shows the HTK software architecture. Figure 4.1 illustrates

the software structure of the HTK tool and shows its input/output interfaces. User

input/output and interaction with the operating system is controlled by the library module

HShell and all memory management is controlled by HMem.

42
Figure 2: Software Architecture (Young et.al, 2006)

Math support is provided by HMM at hand the signal processing operations needed for

speech analysis are in HSigP. Each of the file types required by HTK has a dedicated

interface module. HLabel provides the interface for label files, HLM for language model

files, HNet for networks and lattices, HDict for dictionaries, HVQ for Vector Quantization

(VQ) codebooks and HModel for HMM definitions. In the next section we discussed the

tools taken from HTK and that we have used for our work as they were required (Yadeta,

2016).

a. Data Preparation Tool

Data preparation is the first step in the development of speech recognition system as several

researchers do that. So to build a set of HMMs, a set of speech data files and their associated

transcriptions were needed. Because, before bringing speech data and using it in training,

it must be converted into the required and correct format and phone or word labels (Young

43
et. al. 2006). While all HTK tools can parameterize waveforms on-the-fly, in practice we

need to parameterize the data just once. So the tool HCopy is used for this purpose. Also

the tool HList is used to check the contents of any speech file and simply it can be used to

check the result of the conversions before using. In order to output the files to a single

Master Label File (MLF), the tool HLEd was used.

b. Training Tools

Defining the topology required for each HMM by writing a prototype definition is the

second step of system building. Also HTK allows HMMs to be built with any desired

topology. HMM definitions can be stored externally as simple text files and hence it is

possible to edit them with any convenient text editor. The purpose of the prototype

definition is only to specify the overall characteristics and topology of the HMM. In the

study all of the phone models are initialized to be identical and have state means and

variances equal to the global speech mean and variance because of no bootstrap data is

available. The tool HCompV can be used for this purpose.

c. Recognition Tools

HTK provides a single recognition tool called HVite which uses the token passing algorithm

to perform Viterbi-based speech recognition (Young et. al. 2006). HVite takes as input a

network describing the allowable word sequences, a dictionary defining how each word is

pronounced and a set of HMMs. It operates by converting the word network to a phone

network and then attaching the appropriate HMM definition to each phone instance.

Recognition can be performed on either a list of stored speech or on direct audio input.

However, in this study we have used HVite which can support triphones and mono phones

that can run with multiple tokens to generate lattices containing multiple hypotheses. It can

also be configured to rescore lattices and perform forced alignments. The word networks

44
needed to drive HVite are usually either simple word loops in which any word can follow

any other word or they are directed graphs representing a finite state task grammar. In the

former case, trigram probabilities are normally attached to the word transitions

d. Analysis Tools

Once the HMM-based recognizer has been built, it is necessary to evaluate its performance.

This is usually done by using it to transcribe some pre-recorded test sentences and match the

recognizer output with the correct reference transcriptions. This comparison is performed by

a tool called HResults which uses dynamic programming to align the two transcriptions and

then count substitution, deletion and insertion errors.

45
CHAPTER FOUR
EXPERMENTS AND OUTCOMES

4. Introduction

Experiments pass through five steps or phases of HTK; those are Data Preparation,

Training, recognition and analysis phases.

4.3. Data Preparation Phase

Data preparation is the core and first step on in the development of speech recognition

system so that a set of HMMS will be built up from the set of speech data files and their

associated transcribed files. Data preparation would be accomplished through steps like

constructing pronunciation dictionary, creating transcription files, and coding the audio

data (Yadeta, 2016).

4.3.1. Pronunciation Dictionary

Any language has its own way of pronunciation while reading and speaking words. E.g.

the way the Afan Oromo words are pronounced are different from the way English words

are pronounced. So, creating a sorted list of the words that contains in the grammar one per

line with their pronunciations is the initial step in building the Pronunciation Dictionary

(Yadeta, 2016). To create a pronunciation dictionary we should have a list of words

(trainlists) which is a sorted list of the unique words appear in the training transcription.

Lists of unique words are derived from sentences using Perl Script, Jupiter notebook from

different formats. The word lists can be prompted to be trainlist and testlist which can

46
contain 14500 and 500 words respectively. Totally, 15000 words are collected from

different sources.

The words have their own pronunciation based on the speaker. Also the dictionary can

provides a connotation between words used in the task grammar and the acoustic models

which composed of sub word. It can be in (phonetic, syllabic, and etc.) (Yadeta, 2016). In

this study we have recorded around fifteen thousand (15000) utterances. We have manually

recorded 6500 utterances from 23 speakers and 8500 utterances have been taken from

OMN, OBN and FIB Afan Oromo program. As per the training dataset, when the speaker

speak the word to the machine, the developed speech recognizer search the word /sentence

and display it to the speaker, and again pronounces it in the same way to the speaker. We

have used the tool Microsoft Visual studio 2017 with full python libraries, anaconda, and

weka to do this.

The pronunciations of the words with the same writing scale haven’t been considered.

4.3.2.Creating Transcription Files

i. Word level Transcription

All HTK tools can parameterize waveforms on-the-fly and HCopy is used for this

purpose while HLed is used to read a Master Label File (MLF). MLF file is a single file

that contains a label entry for each line in prompts file. Since the second one is the

easiest approach we have preferred it for our experiment. In order to generate the

mlf file from our prompts, we have used the Perl script prompts2mlf which is

provided with HTK samples.

47
ii. Phone Level Transcription

After the completion of word level transcription HLEd command is executed being

provided by HTK tool to expand the Word Level Transcriptions to Phone Level

Transcriptions. This command replaces each word by its equivalent phonemes and put

the result in a new phone level master label file. This is done by reviewing each word

in the MLF file, and looking up the phones that make up that word in the dictionary

file we created earlier, and out putting the result in a file called afanoromophone.mlf

which do not have short pauses ("SP"s) after each word phone group.

4.3.3. Feature Extraction

The final stage of data preparation is to parameterize the raw speech waveforms into

sequences of feature vectors (Young et.al, 2006). Because of HTK is not efficient in

processing wav files as it is with its internal format, we need to convert our audio wav files

to another format. HTK support both FFT-based and LPC-based analysis. Here we have

used Mel Frequency Cepstral Coefficients (MFCCs), which are derived from FFT-based

log spectra. We used the HCopy tool to convert our wav files to MFCC format. For doing

this, we have 2 options; one option we can execute the HCopy command by hand for each

of our audio (wav) files, or in other option we can create a file containing a list of each

source audio file and the name of the MFCC file it will be converted to, and use that file

as a parameter to the HCopy command.

48
4.4. Training Phase

Defining the topology required for each HMM by writing a prototype definition is the

second step of system building. Also HTK allows HMMs to be built with any desired

topology. HMM definitions can be stored externally as simple text files and hence it is

possible to edit them with any convenient text editor. The purpose of the prototype

definition is only to specify the overall characteristics and topology of the HMM. In our

study all of the phone models are initialized to be identical and have state means and

variances equal to the global speech mean and variance because of no bootstrap data is

available. The tool HCompV can be used for this purpose.

Training the Model

Training is the next step which can be undertaken after the work of preparing the required

data for training. Also splitting our data for training and for testing is another important

task which must be accomplished at this level.

Training and Testing data has been used as of 6350 of training sentences and 650 testing

sentences that have the speaker of both male and female gender with the amount of 10

males and 5 females. The total numbers of constructed sentences are 7000 and the total

number of speaker is 15. This was added because of the data collected and retrieved from

OMN, OBN and FIB are full of noisy and music in their background. That is why adding

self-recorded speech was necessary so that it enabled the researcher to model a language

model with the pronunciation dictionary with low word error rate.

4.4.1.Creating Mono and Tri-Phone HMMS

49
Prototype definition

The principal step of an HMM model training is defining the prototype model. As Young

et. al. (2006) stated that, the parameters of this model are not important; its purpose is to

define the model topology. Hence, our recognition system is phone-based system, and we

have done training with left to right HMM topology of 5-state (3 emitting states and two

non-emitting) left to right without skipping. To define the topology we have created the

file proto, which is in appendix G.

The HTK tool HCompV will scan a set of data files, compute the global mean and variance

and set all of the Gaussians in a given HMM to have the same mean and variance.

HCompV -A -D -T 1 -C config_hcompv -f 0.01 -m -S train.scp -M hmm0 proto

Therefore, supposing that a list of all the training files is stored in afanoromotrain.scp, the

above command creates the new version of our proto file in which the zero means and unit

variances are replaced by the global speech means and variances. Using new prototype

model generated by HCompV, a Master Macro File (MMF) called hmmdefs containing a

copy for each of the required mono-phone HMMs is constructed manually by copying the

prototype and relabeling it for each required mono-phone including “sil”. Consequently,

the file macros contain a global options macro and the variance floor macro vFloors

generated earlier by HCompV. The global options macro simply defines the HMM

parameter kind and the vector size.

4.4.2. Re-estimating Mono-phones

The mono-phones created are re-estimated using the embedded re-estimation tool HERest

invoked as: HERest -C config_hcompv -I trainphones0.mlf -t 250.0 150.0 1000.0 -S

50
train.scp -H hmm0/macros -H hmm0/hmmdefs -M hmm1 trainmonophones0 Then the

above command loads all the models both hmmdefs and macros which are listed in the

model list. Mono-phones used here are excluding the short pause (sp) model. These are

then re-estimated using the data listed in train.scp and the new model set is stored. In the

command, −𝑡 option sets the pruning thresholds to be used during training and this pruning

is normally 250.0. If re-estimation fails on any particular file, the threshold is increased by

150.0 and the file is reprocessed. This is repeated until either the file is successfully

processed or the pruning limit of 1000.0 is exceeded. Fixing the silence models has been

created by running the HMM editor HHEd to add the extra transitions required and tie the

sp state to the center sil state HHEd works in a similar way to HLEd. It applies a set of

commands in a script to modify a set of HMMs. In this case, it is executed by the following

command. HHEd -H hmm4/macros -H hmm4/hmmdefs -M hmm5 sil.hed

trainmonophones1 54 Where sil.hed contains the following commands: AT 2 4 0.2

{sil.transP} AT 4 2 0.2 {sil.transP} AT 1 3 0.3 {sp.transP} TI silst {sil.state [3], sp.state

[2]} At this point, the AT commands add transitions to the given transition matrices and

the final TI command create a tied-state called silst. The parameters of this tied-state are

stored in file and within each silence model, the original state parameters are replaced by

the name of this macro. The new list of mono-phones contains sp model is used in the

above HHEd command. At the end, HERest are applied using the phone transcriptions with

sp models between words. Realigning the Training Data we have said earlier as our

pronunciation dictionary was alternative pronunciation dictionary. So, 109 words have at

least two pronunciations from our train pronunciation dictionary. For this reason we need

to realign the trained data. The basic difference between realigning the training data and

the original word to-phone mapping performed by HLEd in data preparation is that, in the

51
operation of realigning all pronunciations for each word are considered and outputs the

pronunciation that best matches the acoustic data. Using the phone models created before

we have realigned the training data then creates new transcriptions with a single call of the

HTK recognition tool HVite as follows:

HVite -l * -o SWT -b sent-end -C config_hvite -a -H hmm7/macros -H hmm7/hmmdefs –i

aligned. Mlf -m -t 250.0 -y lab -I oromowords.mlf -S train.scp traindictionary.txt

trainmonophones1>HVite_log

This command uses the HMMs made previously to transform the input word level

transcription trainwords.mlf to the new phone level transcription aligned.mlf using the

pronunciations stored in the traindictionary.txt that constructed so far. When aligning the

data, it is sometimes clear that there are significant amounts of silence at the beginning and

end of some utterances, so to spot this the time-stamp information need to be output during

the alignment. That is why we have used the option -o SW in the above command. After

the new phone alignments have been created, HERest applied to re-estimate the HMM set

parameters.

4.4.3.Refinement and Optimization

Tied-State Tri-phones

As pointed by (Young et. al., 2006), the first stage of model refinement is usually to convert

a set of initialized and trained context-independent mono-phone HMMs to a set of context

dependent models. But before building a set of context-dependent models, it is necessary

to decide whether or not cross-word tri-phones are to be used. If they are context dependent,

then word boundaries in the training data can be ignored and all mono-phone labels can be

52
converted to tri-phones. If word internal tri-phones are to be used, then word boundaries in

the training transcriptions must be marked. So, we have built context dependent tri-phones

for this study. Since we have prepared a set of mono-phone HMMs with the previous steps,

now we can use them to create context-dependent tri-phone HMMs. We did this in two

steps. Firstly, the mono phone transcriptions are converted to tri-phone transcriptions and

a set of triphones models are created and re-estimated. Secondly, similar acoustic states of

these triphones are tied. Tying is nothing but it is the method of making one or more HMMs

share the same set of parameters. Then the set of context-dependent models re-estimated

using HERest tool. Making Tri-phones from Mono-phones Context-dependent tri-phones

can be created from mono-phones. The tri-phone transcription have be created first using

HLEd tool which enables us to generate a list of all the tri-phones for which there is at least

one example in the training data. HLEd-n triphones1 –l * –I wintri.mlf mktri.led aligned.mlf

Using the above command we have created the tri-phone transcriptions in wintri.mlf file

using mono-phone transcriptions aligned.mlf file. At the same time, a list of triphones is

written to the file triphones1. The edit script mktri.led used above contains the commands:

WB sp WB sil TC 56 The two WB command defines sp and sil as word boundary symbols.

These then blocks the addition of context in the TI command, which converts all phones

except word boundary symbols to tri-phones. For example, sil a kk a m sp becomes sil

a+kk a-kk+a kk-a+m a-m sp This tri-phone transcription is word internal. Some bi-phones

may be generated as contexts at word boundaries since sometimes they include only two

phones.

HHEd -B -H hmm9/macros -H hmm9/hmmdefs -M hmm10 mktri.hed trainmonophones1

53
Where the edit script mktri.hed contains a clone command CL followed by TI commands

to tie all of the transition matrices in each triphones set, that is: CL triphones1 TI T_aa {(*-

aa+,aa+,*-aa).transP} TI T_d {(-d+,d+,-d).transP} TI T_sp {(-sp+,sp+,-

sp).transP} TI T_n {(-n+,n+,-n).transP} TI T_t {(-t+,t+,-t).transP} We have

generated the file mktri.hed using the Perl script mktri.hed included in the HTK Tutorial

directory. The clone command CL takes as its argument the name of the file containing the

list of triphones and biphones) generated above. For each model of the form a-b+c in this

list, it looks for the monophones b and makes a copy of it. Due to we use latter the transition

matrix transP which is regarded as a subcomponent of each HMM, each TI command takes

as its argument the name of a macro and a list of HMM components. The lists of items

within brackets are patterns designed to match the set of tri-phones, right bi-phones and

left bi-phones for each phone. Making Tied State Tri-phone after a set of tri-phone HMMs

with all tri-phones in a phone set sharing the same transition matrix prepared now we can

tie them. Tying states within tri-phone sets helps to share data and thus be able to make

robust parameter estimates. HTK tool HHEd provides two mechanisms which allow states

to be clustered and then each cluster tied. 57 The first is data-driven and uses a similarity

measure between states. The second uses decision trees and is based on asking questions

about the left and right contexts of each tri-phone. The decision tree attempts to find those

contexts which make the largest difference to the acoustics and which should therefore

distinguish clusters. Decision tree state tying is performed by running HHEd: HHEd -B -H

hmm12/macros -H hmm12/hmmdefs -M hmm13 tree.hed triphones1

The edit script tree.hed contains the instructions regarding which contexts to examine for

possible clustering and the detail contents tree.hed was attached at appendix H. We have

used mkclscript script which is found in the RM Demo for creating the TB commands

54
(decision tree clustering of states) which is one part of tree.hed. Firstly, the RO command

is used to set the outlier threshold to 100.0 and load the statistics file generated at the end

of the previous step. The outlier threshold determines the minimum occupancy of any

cluster and prevents a single outlier state forming a singleton cluster just because it is

acoustically very different to all the other states. The TR command sets the trace level to

zero in preparation for loading in the questions. Each QS command loads a single question

and each question is defined by a set of contexts. For example, one of the QS command

defines a question called ‘R_Nasal’ which is true if the right context is either of the nasals

n, ny, or m n tree.hed file using QS command. So, the questions referring to both the right

and left contexts of a phone are included. The full set of questions loaded using the QS

command would include every possible context which can influence the acoustic

realization of a phone, and can include any linguistic or phonetic classification which may

be relevant. For this study we have constructed the questions (QS) which are attached at

Appendix H. The set of tri-phones used so far only includes those needed to cover the

training data. The AU command takes as its argument a new list of tri-phones expanded to

include all needed for recognition. This list can be generated, for example, by using

HDMan on the entire dictionary (not just the training dictionary), converting it to tri-phones

using the command TC and outputting a list of the distinct tri-phones to a file using the

option.

HDMan -b sp -n fulllist -g global.ded -l flog dict-tri traindictionary.txt

The -b sp option specifies that the sp phone is used as a word boundary and it is excluded

from tri-phones. The effect of the AU command is to use the decision trees to synthesize

all of the new previously unseen tri-phones in the new list. Once all state-tying has been

55
completed and new models synthesized, some models may share exactly the same states

and transition matrices and they became identical. The CO command is used to compact

the model set by finding all identical models and tying them together, producing a new list

of models called tiedlist. One of the advantages of using decision tree clustering is that it

allows previously unseen tri-phones to be synthesized. To do this, the trees must be saved

and this is done by the ST command. Finally, the models are re-estimated using HERest.

Increasing Gaussian mixture, we have constructed mono-phones and context dependent tri-

phones in the previous stages with only single Gaussian models. However, the increment

of Gaussian mixture has weighty effect on the performance of recognizer. It is better to

increase the Gaussian mixture of the models we have constructed in previous tasks. By

doing so the Gaussian Mixtures was increased to 12 in this study to achieve on best

performance. According to Young et.al. (2006), in HTK the conversion from single

Gaussian HMMs to multiple mixture component HMMs is usually one of the tasks in a

system refinement. The mechanism provided to do this is the HHEd tool MU command

which helps to increase the number of components in a mixture by a process called mixture

splitting. We used this approach to building a multiple mixture component system which

is particularly flexible since it allows the number of mixture components to be repeatedly

increased until the desired level of performance is achieved. For this purpose, the MU

command was used and the MU command has the form: MU 𝒏 item Lists Where 𝒏 is the

new number of mixture components required and item Lists defines the actual mixture

distributions which required modifying. For instance, to increases the number of mixture

components in the output distribution for state 2 to 4 of all models to 2 is defined as the

following MU command: MU 2 {*.state[2-4].mix} 59 Language Model The Language

model used for recognition was developed using 2000 sentences text corpus taken from

56
Kallacha Oromiyaa Afaan Oromo Newspaper. Since this text corpus was small, we add the

text corpus that we have prepared during transcription. The reason we have done this is to

increase the size of our data. Because, when the size of text data increased, the probability

of occurrence of words also increased. Then the increment of word probability has an

improvement on the accuracy of speech recognizer. We have used SRILM language

modeling tool for development of bigram language model and trigram language model.

The HTK language modeling tool we have used to build bigram language model and word

network were HLStats and HBuild, respectively. Using these two different language

modeling tools was for only identifying which performs better. We also had done the

experiment with closed vocabulary, pronunciation dictionary.

4.5. Recognition Phase

During recognition phase, HTK provides a single recognition tool called HVite which uses

the token passing algorithm to perform Viterbi-based speech recognition. HVite takes as

input a network describing the allowable word sequences, a dictionary defining how each

word is pronounced and a set of HMMs word networks will be needed to drive HVite are

usually either simple word loops in which any word can follow any other word or they are

directed graphs representing a finite state task grammar. In the former case, bigram

probabilities are normally attached to the word transitions. Recognizing the task of

searching for the most likely sequence of words given in observed features extracted from

the speech signal is usually referred to as decoding or recognizing the speech signal.

Decoding speech was begun by constructing a search graph which contains every word in

the recognition vocabulary. Each word is then replaced by the HMMs that correspond to

57
the sequence of sound units which make up the word. As a result, the search graph is a

large HMM, and recognition is performed using the Viterbi algorithm to 60 align the search

graph to the speech features derived from the utterances. Because the Viterbi algorithm is

used to find the most likely word sequence, the decoding procedure is said to be done via

Viterbi search. Suppose that the test.scp file holds a list of the coded test files, and then

each test file recognized and its transcription output to an MLF file called recout.mlf by

executing the following:

HVite -H hmm15/macros -H hmm15/hmmdefs -S test.scp -l * -i recout.mlf -w word network

-p 1.0 -s 9.0 traindict tiedlist

The options -p and -s set the word insertion penalty and the grammar scale factor,

respectively. The word insertion penalty is a fixed value added to each token when it

transits from the end of one word to the start of the next. The grammar scale factor is the

amount by which the language model probability is scaled before being added to each token

as it transits from the end of one word to the start of the next. Because of these parameters

can have a significant effect on recognition performance, we have made some modification

and conducted the experiment with different parameters.

4.6. Analysis Phase

The final stage of the HTK toolkit is the analysis stage. Once the HMM-based recognizer

has been built, it is necessary to evaluate its performance. This is usually done by using it

to transcribe pre-recorded test sentences and match the recognizer output with the correct

reference transcriptions. This comparison is performed by a tool called HResults which

uses dynamic programming to align the two transcriptions and then count substitution,

deletion and insertion errors. Once the test data has been processed by the recognizer, the

58
next step is to analyze the results. The HTK tool HResults is provided for this purpose.

HResults compares the transcriptions output by HVITE with the original reference

transcriptions and then outputs various statistics. HResults matches each of the recognized

and reference label sequences by performing an optimal string match using dynamic

programming. Once it founds the optimal alignment, HResults calculates the number of

substitution errors (S), deletion errors (D) and insertion errors (I). 61 Then outputs both

percentage of correctly recognized words and percentage of accurately recognized words.

Since HTK tools can process individual label files and files stored in MLFs, we have used

our MLF file which contains word level transcriptions for test file called testref.mlf, the

actual performance is determined by running HResults command as follows: HResults –I

testref.mlf tiedlist recout.mlf After running HResults command there are several results

that we have seen. So, for the finding of this study we tried to describe in following tables.

Our experiment was conducted using different parameters for word insertion penalty and

grammar factor scale. In this experiment, we used test data that collected only from OBN,

and the default value of Gaussian Mixture was used.

Table 8: Result of experiments conducted by increasing Gaussian Mixtures

Number of Percent of words

Gaussians Correctly Recognized Accurately Recognized Word Error Rate
2 21.31 10.10 56.23
6 12.21 11.90 62.25
10 11.90 10.90 61.20
14 11.90 11.90 62.98
18 14.90 12.90 62.65
22 12.90 18.90 65.23

From the experiment with the increasing of the Gaussian Mixture, we got the performance

on 10 which is 65.23% WER. Hence the developed spontaneous speech recognition was

so small. As it is known results with high word error rate shows the performance is weak

59
with Increasing Gaussian mixture. Hence we have tried to build a language model built by

HTK to increase the performance of our work. The result with language model experiment

shows greater precision than increasing the Gaussian mixture. Experiments have been

shown by both training and test data. The performance gained from the experiment of the

language model built by HTK with trigram language model was set as follows.

Table 9: Results obtained from Experiments of Language Model built by HTK

Parameter Number of Percent of words Word Acoustic

Tuning Values Gaussian Error Rate Models
-p -s Mixture Correctly Accurately
options options Recognized Recognized
3.0 16 12 26.51 23.20 36.45 % Tri-
phones
33.25 29.29 50.23 % Mono
phones

As it is seen on Table 9: the word error rate for both tri-phones and monophones

respectively are 36.45 % and 50.23 %. The WER is low means greater performance has

been done at most with building a language model using HTK. Even though we used

increasing gaussian mixture and built language model, we used a minimum data size that

may limits the performance of the developed speech recognizer with different categories

in natural language processing.

4.7. Challenges

Data Collection with audio files from broadcasting corporate and other social media is so

tiresome because of no pure and required data for this work has been existed before. While

collecting data from social media it has background music and other noisy sounds which

60
are not suitable for the work but it had been transcribed manually. We used hybrid model

(Hidden Markov Model, Gaussian Mixture Model and Artificial Model) for the

development of the spontaneous speech recognizer. Hidden Markov Model is suitable

model for discrete and continuous speech recognition type. But in our case we have tried

to develop spontaneous speech recognition for Afan Oromo that was why we have used

the three models in order to have a successful Afan Oromo Spontaneous Speech

Recognizer. However, Because of time and budget constraint was also a great headache

to me while working on this study. From the obtained results the word error rate was

36.45 % which is less than the Word Accuracy. The word accuracy has been maximized

because of the models we used during the work. We have used the optimal level to see the

highest performance on our work using different approaches like increasing Gaussian

mixture, tuning parameters of word insertion penalty and grammar scale factors. For

language model we have used the hidden markov model for discrete and continuous

recognition language model. From that we have tried to transform the continuous speech

recognition to neural network of spontaneous speech recognition using Artificial Neural

Network and upgrade the performance to this level. We used a Gaussian Mixture model

efficiently imputes missing values. We used artificial neural network to understand and

add a conditionally spoken utterances

61
CHAPTER FIVE

CONCLUSION AND RECOMMENDATIONS

5.1. Conclusion

In this study, the possibility of speaker independent spontaneous speech recognition for

Afan Oromo has been explored. To conduct this, we have reviewed many related literatures

thoroughly on spontaneous speech recognition and its development and others literatures

done on Afan Oromo speech recognition development with different nature of speech

recognition. In this study we used hybrid model and acoustic model and language model

for its implementation.

For this study, among the approaches of ASR systems the stochastic was used and the

HMM has been implemented for modeling, HTK tool and different tools which are

compatible with the approach and modeling technique we have preferred were

implemented accordingly. However, throughout the study of this work, we have used visual

studio 2017 with full python and anaconda libraries.

The developed spontaneous speech recognizer for Afan Oromo has been done using

medium data size from 10 male Afan Oromo native speakers and5 female Afan Oromo

native speakers and it was 12 hours and 20 minutes for training. The test data was prepared

with 500 utterances consisting of 1200 unique words from 15 speakers, in which 9 of them

involved in training and 6 of them did not involve in training. The language model used

62
was trigram language model developed using HLStats and word networks built from this

trigram using HBuild tool.

Non-speech events (dis-fluencies) that are the main distinguishers of spontaneous speech

from read speech occurs in both training and test data. Because of their direct influence on

the performance of recognizer, more than treating them as a silence we have modeled them

according to their occurrences as we have explained before under data preparation.

The training has been done first by modeling context independent mono-phones and re-

estimating the models using HERest tool. In order to improve our recognizers‟ accuracy

we have refined our models, as a result we have developed context dependent tri-phones

both cross-word and word internal tri-phones from mono-phones and re-estimated tri-

phone models.

During the course of this study the recognizers developed using different acoustic models

has been tested using test data we have prepared for this purpose. Tuning the parameters

of the decoder at the time of recognition brought different recognition accuracy. Some of

these parameters are word insertion penalty (p), grammar scale factor (s) and pruning level.

Spontaneous speech recognizer for Afan Oromo developed with the speech database in a

judicial domain contains 14500 utterances for training the acoustic model. Language model

developed using 120MB text data and 30MB audio dataset and the percentage of result

obtained was 36.45 % Word Error Rate

Generally, the achieved spontaneous speech recognizer understands only from the

developed language model and pronunciation dictionary of 15000 words. This study

achieved Speech spontaneous speech recognition for 15000 utterances.

6.2. Recommendations

63
Throughout the work we have learned the course of this research and from the experimental

results, we would like to forward our recommendations that a further researchers can do

on the same title and other related titles

There is no spontaneous speech corpus prepared before which can be used for this study.

Therefore, we have tried to prepare it from the scratch in both audio format and text format.

The speech recognizer has been developed with the concept of spontaneous speech

recognition that is almost natural and if the recognizer will really be developed in full

course, it enables Afan Oromo native speakers speak to machine. Afan Oromo has much

more utterances than used utterances which is 15000. So, using the more utterances can

increase the performance and its accuracy. Since, Automatic Speech Recognition is

difficult to use with a low accuracy in a real world, it is mandatory to work on increment

of data size and recognizer accuracy improvement. As it is known, the target of speech

technology is to manage the machine with the words we speak, and still today text to speech

(read speech), speech to text (write speech) and other speech technologies are on

development for Afan Oromo even though they need an improvement.

In this study we did our experiment using trigram language model. In addition to increasing

data size, depending on the data size of data in hand doing a lot with language model

increasing it to 4-gram.

The speech data which is used for this study is sparse and diversified and it is from different

speakers. Consequently, in order to handle the variability among speakers, the need for

investigating the speaker adaption is one of the issues to be handled for the recognizer

development; therefore the need for conducting research in this area is also one of the

important aspects to be considered in order to improve the accuracy of the recognizer. One

64
of the difficult feature of spontaneous speech is non-speech events (dis-fluencies),

therefore handling none speech events has a big role in recognizer’s performance. We have

dealt with some issues to handle them in general but still it is possible to handle them

further by applying different techniques. May be one of the task one can do in order to

handle these non-speech events effect is including or removing from acoustic, lexical and

language model, by studying their nature in detail and separately. Besides increasing the

size of data the performance can be increased if implemented on dialect and accent

identification and recognition because Afan Oromo words can be pronounced in the same

way but defined in different ways. E.g. ‘Bukkee’ which means beside, ‘cina’ or ‘bira’ in

Afan Oromo speakers around Horo Guduru Wolega, East Wolega, West and Kelem

Wolega and It is defined as ‘Maseena’ around west, east Haraghe zones. We haven’t

focused on the dialect and accent identification and recognition. So we recommend further

researchers to work on the same title increasing data size, using n-gram language model

means 4-gram, 5-gram since we used trigram, Dialect Identification in Afan Oromo

Spontaneous Speech Recognition, Accent Recognition for Afan Oromo Spontaneous

speech recognition using Deep Learning algorithm and hybrid model.

References
Bantegize. (2015). BRANA:Application of Amharic speech recognition system for
Dictation in judicial domain, MScThesis, Gondar University. Gondar: Gondar
University.

Bernard, A. a. (1997). Owsley, L. M. D., L. E. Atlas, and G. D. Bernard, Self-organizing

feature maps and hidden Markov models for machine tool monitoring, IEEE
Transactions on Signal Processing, vol.45, no.11, pp.27872798. Owsley .

65
Carnigie Mellom University. (2013). Understanding Spontaneous Speech, Wayne Ward 1
Carnegie Mellon University Computer Science Department Pittsburgh, PA 15213.
Pittsburgh.

CUED. (2009). The HTK Book Version 3.4. Cambridge.

Debela, T. (2011). A rule-based Afan Oromo Grammar Checker. Addis Ababa

University: Jimma University.

Deksiso, A. (2015). Speaker Independent Spontaneous Speech Recognition for Amharic

Language. Addis Ababa: Addis Ababa University.

Furui, S. (2005). recent progress in corpus-based spontaneous speech recognition IEICE

Trans. Inf. & Syst., E88-D, 3, pp. 366-375.

Gamta, T. (1993). Qubee Afaan Oromoo: Reasons for Choosing the Latin Script for
Developing an Oromo Alphabet.The Journal of OromoStudies, 1(1) pp: 36.

Getachew, M. (2009). PART-OF-SPEECH TAGGING FOR AFAAN OROMO

LANGUAGE. Addis Ababa.

Gibbon. (1997). Wwwhomes.uni-

bielefeld.de/gibbon/Handbooks/gibbon_handbook_1997/node303.html.

Gonfa, Y. (2016). A large vocabulary, speaker-independent, continuous speech

recognition system for afaan oromo: using broadcast news speech corpus. Addis
Ababa.

IJCSIS. (2009). Speech Recognition by Machine v.6.No. 3. International Journal of

Computer Science and Information Security.

J.H, J. D. (2014). Speech and Language Processing:An Introduction to Natural Language

Processing, Computational Linguistics and Speech Recognition. Pretice-Hall, Inc.

Jurafsky D., M. J. (2009). Speech and Language Processing, An Introduction to Natural

Language Processing, Computational Linguistics and Speech Recognition.
Pearson Prentice Hall, Upper Saddle River, N.J.

Kamble, B. C. (2005). Speech Recognition Using Artificial Neural Network.

Makuria, H. (2009). Elellee Conversation: Afaan Oromo writing system. Addis Abba,
Ethiopia, Commercial printing E.pp 23-39.

Mikio Nakano, A. S. (2013). A Grammar and A Parser for Spontaneous Speech.

Reynolds. (2000). D. A., T. F. Quatieri and R. B. Dunn, Speaker veriﬁcation using

adapted Gaussian mixture models, Digital Signal Processing, vol.10, pp.19-41.

66
Shriberg, B. E. (2012). Speech Technology and Research Laboratory SRI International,
Menlo Park, CA 94025, USA International Computer Science Institute, Berkeley,
CA 94704, USA. Berkeley: USA International Computer Science Institute.

South Africa. (2006). Artificial Neural Network, School of Electrical and Information
Engineering, University of the Witwatersrand Johannesburg. Johannesburg.

Tesfaye, D. (2011). A rule-based Afan Oromo Grammar Checker, (IJACSA)

International Journal of Advanced Computer Science and Applications. IJACSA.

Tesso, A. (2014). Challenges of Diacritical Marker or Hudhaa Character in

Tokenization of Afan Oromo text. Journal of Software 9(7).

Vyas, M. (2013). A GAUSSIAN MIXTURE MODEL BASED SPEECH. Mumbai:

University of Mumbai.

Zhang. (1998). A fault detection and diagnosis approach based on hidden Markov chain
model, Proc. of the American Control Conference, Philadelphia, pp.2012-2016.
Philadelphia.

67
Appendices

Appendix A: Summary of Tools used in the study

SN Names of Tools and Software Version Used for

1 Visual Studio 2017 For the installation of HTK and writing speech

recognition code

2 HTK 3.4.1

3 Strawbery Perl 5.30.2.1 For testing the installed HTK-3.4.1

4 Winrar 5.71.0 For unpacking or extracting htk source code

5 Notepad++

6 Audacity 2.3.3 For Audio preprocessing

7 Command Prompt To run/execute HTK

8 Microsoft Powerpoint 10 To write the document or report of the study

9 Microsoft Powerpoint 10 For making presentation

10 Python / Anaconda 3.8 To write the speech recognition code

11 Windows operating system 10 pro To change the integrated managed voice

12 SRILM 1.7.3 To develop language model

68
Appendix B: Samples of prompts afanoromotrainprompt.txt and afanoromotestprompt.txt

afanoromotrainprompt.txt file

*/train1 Jilli Ministira Muummee Mootummaa federaalawaa

Dimookiraatawaa Rippaabilika Itoophiyaa Doctor Abiyyi

Ahimadiin durfamu guyyaa kaleessaa magaalaa Ambootti argamuun

injifannoo bu’aa qabsoo ummata Oromoofi guutuu biyyattiitiin

aragame ibsan ummataaf ibsan.

*/train2 Doctor Abiyyi haasaa taasisaniin qabsoon ummatoonni

guutuu biyyattii akka waliigalaafi ummanni Oromoo adda

durummaan geggeessaa turan hundaa ol fedhii seenaa biyyattii

gara boqonnaa haaraa hammattummaafi miira jaalalaa cimaatti

ceesisuufi jedhan. */train3 Fedhii kana hojiitti

jijjiiruudhaan ergamaafi itti gaafatamummaa ummanni Oromoofi

biyyattii nutti kenne galmaan gahuuf yeroo waadaa itti galle

waan ta’eef ummanni Itoophiyaa guutuuniifi gaachana ummata

Oromoo kan ta’e qeerroon Oromoo har’as akkuma kaleessaa

tumsa irraa eegamu gochuu qaba jedhan.

/*train4 Qabsoon keenya mo’ataafi tarkaanfataa akka ta’u,

Injifannoo tokkummaa ummata keenyaatiin arganne kunuunfachaa

69
itti gaafatamummaa akka biyyaatti nutti kenname milkeessuuf

imala jalqabne itti fufuu qabanas jedhaniiru.

Afaanoromotestprompts.txt file

*/test1 Icciitiin injifannoo keenyaa tokkummaa ummata keenyaati

kan jedhan ministirri muummee Doktor Abiyyi gufuu har’allee

nu fuuldura jiru of duraa buqqisaa, misoomaafi dimookiraasii

biyyattii gabbisuuf ummanni gamtaan hojjechuu akka qabu

cimsanii dubbatan.

*/test2 Lammiin biyyattii miira yeroo hin qabnuun waltumsuun

hojjechuu akka qabus dhaamaniiru. Ummanni Amboo nagaa kan

jaallatu ta’uu ibsanii, kana booda fuulasaa gara misoomaatti

akka deeffatus gaafataniiru.Hojii kaleessaa yaadannee kan

haasofnu malee kan ittiin jiraannu waan hin taanef, mufii

duraa dhiisnee seenaa haaraa hojjechuuf yaa kaanus

jedhaniiru.

/*test3 Rakkoolee mudatan hiikuuf waldhageettiin

barbaachisaadha kan jedhan ministirri muummee Dr. Abiyyi,

fayyadamummaa waliinii gama hundaan milkeessuuf qorannoowwan

70
rakkoo hiikan irratti xiyyeeffachuun murteessaa ta’uu

eeraniiru.

*/test4 Ummanni magaalaa Amboofi naannawaashee keessummoota

fardaan kan simate yammuu ta’u, daawwii ministira muummeettis

gammaduusaa bifa garaagaraatiin ibseera.

Appendix C: Samples of Pronounciation Dictionary

laagaa [laagaa] laa gaa

mana [mana] ma na

gaarii [gaarii] gaa rii

digdama [digdama] dig da ma

biliyoona [biliyoona] bi li yoo na

danaa [danaa] da na

caamsaa [caamsaa] caa m saa

dinagdee [dinagdee] di na g dee

dhadhaa [dhadhaa] dha dhaa

dhadhaa [a'aa] a ' aa

gahee [Gahee] ga h ee

ga'ee [ga'ee] ga ' ee

naannoo [naannoo] naa n noo

gatii [gatii] ga t ii

71
lammii [lammii] la m mii

tokkummaa [tokkummaa] to k ku m maa

kaleessa [kaleessa] ka le es sa

gaarii [gaarii] ga rii

deemsa [deemsa] deem sa

adeemsa [adeemsa] adee m sa

jiila [jiila] jii la

kudhan [kudhan] ku dha n

72
Appendix D: Number of alternative pronunciation dictionary

SN Word Number of Pronunciations

1 Dhagahuu 2
2 Baayee 2
3 Aara 2
4 Hangam 2
5 Digdama 2
6 Ishee 2
7 Teesse 2
8 Arfaffaa 2
9 Kurnaffaa 2
10 Jaha 2
11 Sooma 2
12 Adeemsa 2
13 Hawaasaa 2
14 Salgaffaa 2
15 Bahe 2
16 Urgaahuu 2
17 Gahee 2
18 Hojii 2
19 Miliyoona 4
20 Gahuu 2
21 Tahuu 2
22 Osoo 2
23 Gurguddaa 2
24 Eeyyama 2
25 Gumii 2
26 Olaanoo 2
27 Yommuu 2
28 Keessa 2
29 Nagahee 2
30 Murtii 2
31 Oofa 2
32 Olbahuu 2
33 Oftuuluu 2
34 Dahoo 2
35 Hir’isuu 2
36 Yaaddahuu 2
37 Bal’ina 2
38 Deebi’uu 2
39 Socho’uu 2
40 Bir’annaa 2
41 Daawwii 2
42 Harree 2
43 Haaduu 2
44 Hir’achuu 2
45 Daba 2
46 Bal’ina 2

73
47 Bartee 2
48 Cina 2
49 Cinqii 2
50 Ciiggahuu 2
51 Cunqursaa 2
52 Eeruu 2
53 Ga’aa 2
54 Tarreessaa 2
55 Booji’uu 2
56 Lakkaa’uu 2
57 Gurmuu 2
58 Salgaffaa 2
59 Jahaffaa 2
60 Galgalaa 2
61 Ta’a 2
62 Ta’ullee 2
63 To’achuu 2
64 Tasgabbaa’e 2
65 Xaa’oo 2
66 Haara’umsa 2
67 Danda’a 2
68 Buqqa’uu 2
69 Daldaluu 2
70 Xiinxaluu 2
71 Durduuba 2
72
73

74
Appendix E: Sample of coding for afanoromotraincode.scp and afanoromotest.scp code.

Voxforge/train/word1.wav Voxforge/train/word1.mfc

Voxforge/train/word2.wav Voxforge/train/word2.mfc

Voxforge/train/word3.wav Voxforge/train/word3.mfc

Voxforge/train/word4.wav Voxforge/train/word4.mfc

Voxforge/train/word5.wav Voxforge/train/word5.mfc

Voxforge/train/word6.wav Voxforge/train/word6.mfc

Voxforge/train/word7.wav Voxforge/train/word7.mfc

------------------------------------------------------------------------------------------------------------
-------

Voxforge/train/afanoromotest1.wav Voxforge/train/ afanoromotest1.mfc

Voxforge/train/ afanoromotest2.wav Voxforge/train/ afanoromotest2.mfc

Voxforge/train/ afanoromotest3.wav Voxforge/train/ afanoromotest3.mfc

Voxforge/train/ afanoromotest4.wav Voxforge/train/ afanoromotest4.mfc

75
Appendix F: Configuration Tools

Letter Configuration for Afaan Oromo

# PATH needs CR

# Avoid depending upon Character Ranges.

as_cr_letters='abcdefghijklmnopqrstuvwxyz'

as_cr_LETTERS='ABCDEFGHIJKLMNOPQRSTUVWXYZ'

as_cr_compoundletter='CH,DH,NY,PH,SH,TS,ZY'

as_cr_glottal='h'

as_cr_Letters=$as_cr_letters$as_cr_LETTERS$as_cr_compoundletter

as_cr_digits='0123456789'

as_cr_alnum=$as_cr_Letters$as_cr_digits

# The user is always right.

if test "${PATH_SEPARATOR+set}" != set; then

echo "#! /bin/sh" >conf$$.sh

echo "exit 0" >>conf$$.sh

chmod +x conf$$.sh

if (PATH="/nonexistent;."; conf$$.sh) >/dev/null 2>&1; then

PATH_SEPARATOR=';'

else

PATH_SEPARATOR=:

rm -f conf$$.sh

# Support unset when possible.

if ( (MAIL=60; unset MAIL) || exit) >/dev/null 2>&1; then

76
as_unset=unset

else

as_unset=false

# IFS

# We need space, tab and new line, in precisely that order. Quoting is

# there to prevent editors from complaining about space-tab.

# (If _AS_PATH_WALK were called with IFS unset, it would disable word

# splitting by setting IFS to empty value.)

as_nl='

IFS=" "" $as_nl"

# Find who we are. Look in the path if we contain no directory separator.

case $0 in

*[\\/]* ) as_myself=$0 ;;

*) as_save_IFS=$IFS; IFS=$PATH_SEPARATOR

for as_dir in $PATH

IFS=$as_save_IFS

test -z "$as_dir" && as_dir=.

test -r "$as_dir/$0" && as_myself=$as_dir/$0 && break

done

IFS=$as_save_IFS

;;

esac

# We did not find ourselves, most probably we were run as `sh COMMAND'

# in which case we are not to be found in the path.

77
if test "x$as_myself" = x; then

as_myself=$0

if test ! -f "$as_myself"; then

echo "$as_myself: error: cannot find myself; rerun with an absolute file name"
>&2

{ (exit 1); exit 1; }

# Work around bugs in pre-3.0 UWIN ksh.

for as_var in ENV MAIL MAILPATH

do ($as_unset $as_var) >/dev/null 2>&1 && $as_unset $as_var

done

PS1='$ '

PS2='> '

PS4='+ '

# NLS nuisances.

for as_var in \

LANG LANGUAGE LC_ADDRESS LC_ALL LC_COLLATE LC_CTYPE LC_IDENTIFICATION \

LC_MEASUREMENT LC_MESSAGES LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER \

LC_TELEPHONE LC_TIME

if (set +x; test -z "`(eval $as_var=C; export $as_var) 2>&1`"); then

eval $as_var=C; export $as_var

else

($as_unset $as_var) >/dev/null 2>&1 && $as_unset $as_var

done

78
config_hcopy

TARGETKIND = MFCC_0_D_A

SOURCEFORMAT = WAV

TARGETFORMAT = HTK

SOURCERATE = 625

TARGETRATE = 100000.0

SAVECOMPRESSED = TRUE

SAVEWITHCRC = TRUE;

WINDOWSIZE = 250000.0

USEHAMMING = TRUE

PREEMCOEF = 0.97

NUMCHANS = 26

CEPLIFTER = 22

NUMCEPS = 12

ENORMALISE = FALSE

Config_hcompv

TARGETKIND = MFCC_0_D_A

SOURCEFORMAT = HTK

SOURCERATE = 625

TARGETRATE = 100000.0

SAVECOMPRESSED = TRUE

SAVEWITHCRC = TRUE

WINDOWSIZE = 250000.0

USEHAMMING = TRUE

PREEMCOEF = 0.97

NUMCHANS = 26

CEPLIFTER = 22

NUMCEPS = 12

2
ENORMALISE = FALSE

Config_hvite

TARGETKIND = MFCC_0_D_A

TARGETRATE = 100000.0

SAVECOMPRESSED= TRUE

SAVEWITHCRC = TRUE

WINDOWSIZE = 250000.0

USEHAMMING = TRUE

PREEMCOEF = 0.97

NUMCHANS = 26

CEPLIFTER = 22

NUMCEPS = 12

ENORMALISE = FALSE

SOURCEFORMAT = HTK

USESILDET = TRUE

MEASURESIL = FALSE

OUTSILWARN = TRUE

MICIN= TRUE

3
Appendix G: Prototype of Proto files

~o <VecSize> 25 <MFCC_0_D_N_Z>

~h "proto"

<NumStates> 5

<State> 2

<Mean> 25

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

<Variance> 25

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

<State> 3

<Mean> 25

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

<Variance> 25

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

<State> 4

<Mean> 25

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

<Variance> 25

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

2
<TransP> 5

0.0 1.0 0.0 0.0 0.0

0.0 0.6 0.4 0.0 0.0

0.0 0.0 0.6 0.4 0.0

0.0 0.0 0.0 0.7 0.3

0.0 0.0 0.0 0.0 0.0

Appendix H: Editing script of tree.hed

RO 100 "stats"
2
TR 0

QS "R_NonBoundary" { *+* }

QS "R_Silence" { *+sil }

QS "R_Stop" { *+p,*+ph,*+d,*+b,*+t,*+j,*+dh,*+k,*+q,*+g,*+ch,*+x,*+c, }

QS "R_Nasal" { *+m,*+mm,*+n,*+nn,*+ny }

QS "R_Fricative" { *+s,*+ss,*+sh,*+z,*+f,*+ff,*+hh,*+v }

QS "R_Vowel" { *+a,*+aa,*+e,*+ee,*+i,*+ii,*+o,*+oo,*+u,*+uu }

QS "R_C-Front" { *+p,*+ph,*+b,*+m,*+*+f,*+v,*+w }

QS "R_C-Central" { *+t,*+dd,*+d,*+dh,*+x,*+n,*+s,*+z,*+r }

QS "R_C-Back" { *+sh,*+ch,*+j,*+jj,*+y,*+k,*+kk,*+g,*+ny,hh }

QS "R_V-Front" { *+i,*+ii,*+e,*+ee }

QS "R_V-Central" { *+a }

QS "R_V-Back" { *+u,*+aa,*+oo,*+uu,*+o }

QS "R_Unvoiced-Cons" { *+p,*+tt,*+k,*+kk,*+ch,*+f,*+ff,*+s,*+ss,*+sh }

QS "R_Voiced-Cons" { *+j,*+b,*+bb,*+d,*+dd,*+dh,*+g,*+gg,*+v,*+z }

QS "R_Long" { *+aa,*+ee,*+ii,*+oo,*+uu }

QS "R_Short" { *+a,*+e,*+i,*+o,*+u }

QS "R_IVowel" { *+i,*+ii }

QS "R_EVowel" { *+e,*+ee }

QS "R_AVowel" { *+a,*+aa }

QS "R_OVowel" { *+o,*+oo, }

QS "R_UVowel" { *+u,*+uu }

QS "R_Voiced-Stop" { *+b,*+bb,*+d,*+dd,*+g,*+gg,*+j,*+jj }

QS "R_Unvoiced-Stop" { *+p,*+pp,*+t,*+tt,*+k,*+kk,*+ch }

QS "R_Voiced-Fric" { *+z,*+v }

QS "R_Unvoiced-Fric" { *+s,*+sh,*+th,*+f,*+ch }

QS "R_Front-Fric" { *+f,*+ff,*+v }
3
78

QS "R_Central-Fric" { *+s,*+ss,*+z }

QS "R_Back-Fric" { *+sh,*+ch }

QS "R_a" { *+a }

QS "R_aa" { *+aa }

QS "R_b" { *+b }

QS "R_bb" { *+bb }

QS "R_c" { *+c }

QS "R_cc" { *+cc }

QS "R_ch" { *+ch }

QS "R_d" { *+d }

QS "R_dd" { *+dd }

QS "R_dh" { *+dh }

QS "R_e" { *+e }

QS "R_ee" { *+ee }

QS "R_f" { *+f }

QS "R_ff" { *+ff }

QS "R_g" { *+g }

QS "R_gg" { *+gg }

QS "R_h" { *+h }

QS "R_hh" { *+hh }

QS "R_i" { *+i }

QS "R_ii" { *+ii }

QS "R_j" { *+j }

QS "R_jj" { *+jj }

QS "R_k" { *+k }

QS "R_kk" { *+kk }
4
QS "R_l" { *+l }

QS "R_ll" { *+ll }

QS "R_m" { *+m }

QS "R_mm" { *+mm }

QS "R_n" { *+n }

QS "R_nn" { *+nn }

………………………………………………………………………

QS "L_NonBoundary" { *-* }

QS "L_Silence" { sil-* }

QS "L_Stop" { p-*,ph-*,d-*,b-*,t-*,j-*,dh-*,k-*,q-*,g-*,ch-*,x-*,c-* }

QS "L_Nasal" { m-*,mm-*,n-*,nn-*,ny-* }

QS "L_Fricative" { s-*,ss-*,sh-*,z-*,f-*,ff-*,hh-*,v-* }

QS "L_Vowel" { a-*,aa-*,e-*,ee-*,i-*,ii-*,o-*,oo-*,u-*,uu-* }

QS "L_C-Front" { p-*,ph-*,b-*,m-*,f-*,v-*,w-* }

QS "L_C-Central" { t-*,dd-*,d-*,dh-*,x-*,n-*,s-*,z-*,r-* }

QS "L_C-Back" { sh-*,ch-*,j-*,jj-*,y-*,k-*,kk-*,g-*,ny-*,hh-* }

QS "L_V-Front" { i-*,ii-*,e-*,ee-* }

QS "L_V-Central" { a-* }

QS "L_V-Back" { u-*,aa-*,oo-*,uu-*,o-* }

QS "L_Unvoiced-Cons" { p-*,t-*,k-*,kk-*,ch-*,f-*,ff-*,s-*,ss-*,sh-* }

QS "L_Voiced-Cons" { j-*,b-*,bb-*,d-*,dd-*,dh-*,g-*,gg-*,v-*,z-* }

QS "L_Long" { aa-*,ee-*,ii-*,oo-*,uu-* }

QS "L_Short" { a-*,e-*,i-*,o-*,u-* }

QS "L_IVowel" { i-*,ii-* }
5
QS "L_EVowel" { e-*,ee-* }

QS "L_AVowel" { a-*,aa-* }

QS "L_OVowel" { o-*,oo-* }

QS "L_UVowel" { u-*,uu-* }

QS "L_Voiced-Stop" { b-*,bb-*,d-*,dd-*,g-*,gg-*,j-*,jj-* }

QS "L_Unvoiced-Stop" { p-*,pp-*,t-*,tt-*,k-*,kk-*,ch-* }

QS "L_Voiced-Fric" { z-*,v-* }

QS "L_Unvoiced-Fric" { s-*,sh-*,th-*,f-*,ch-* }

QS "L_Front-Fric" { f-*,ff-*,v-* }

QS "L_Central-Fric" { s-*,ss-*,z-* }

QS "L_Back-Fric" { sh-*,ch-* }

QS "L_a" { a-* }

QS "L_aa" { aa-* }

QS "L_b" { b-* }

QS "L_bb" { bb-* }

QS "L_c" { c-* }

QS "L_cc" { cc-* }

QS "L_ch" { ch-* }

QS "L_d" { d-* }

QS "L_dd" { dd-* }

QS "L_dh" { dh-* }

QS "L_e" { e-* }

QS "L_ee" { ee-* }

QS "L_f" { f-* }

QS "L_ff" { ff-* }

QS "L_g" { g-* }

QS "L_gg" { gg-* }
6
QS "L_h" { h-* }

QS "L_hh" { hh-* }

QS "L_i" { i-* }

QS "L_ii" { ii-* }

QS "L_j" { j-* }

QS "L_jj" { jj-* }

QS "L_k" { k-* }

QS "L_kk" { kk-* }

………………………………………………………………………

TR 2

TB 350.0 "ST_aa_2_" {("aa","-aa+","aa+","-aa").state[2]}

TB 350.0 "ST_d_2_" {("d","-d+","d+","-d").state[2]}

TB 350.0 "ST_sp_2_" {("sp","-sp+","sp+","-sp").state[2]}

………………………………………………………………………

TB 350.0 "ST_ph_2_" {("ph","-ph+","ph+","-ph").state[2]}

TB 350.0 "ST_sil_2_" {("sil","-sil+","sil+","-sil").state[2]}

TB 350.0 "ST_aa_3_" {("aa","-aa+","aa+","-aa").state[3]}

TB 350.0 "ST_d_3_" {("d","-d+","d+","-d").state[3]}

TB 350.0 "ST_sp_3_" {("sp","-sp+","sp+","-sp").state[3]}

………………………………………………………………………

TB 350.0 "ST_ph_3_" {("ph","-ph+","ph+","-ph").state[3]}

TB 350.0 "ST_sil_3_" {("sil","-sil+","sil+","-sil").state[3]}

7
TB 350.0 "ST_aa_4_" {("aa","*-aa+*","aa+*","*-aa").state[4]}

TB 350.0 "ST_d_4_" {("d","-d+","d+","-d").state[4]}

TB 350.0 "ST_sp_4_" {("sp","-sp+","sp+","-sp").state[4]}

………………………………………………………………………

TB 350.0 "ST_ph_4_" {("ph","-ph+","ph+","-ph").state[4]}

TB 350.0 "ST_sil_4_" {("sil","-sil+","sil+","-sil").state[4]}

TR 1

AU "fulllist"

CO "tiedlist"

ST "trees"

View publication stats

Harmonized Computer Science UnderGraduate Curriculum September 2020
100% (1)
Harmonized Computer Science UnderGraduate Curriculum September 2020
268 pages
Megersa Oljira
100% (3)
Megersa Oljira
106 pages
Report of Profit Prediction
No ratings yet
Report of Profit Prediction
15 pages
Queing Management Assessement On Ambo General Hospital Assigniment
No ratings yet
Queing Management Assessement On Ambo General Hospital Assigniment
46 pages
Oprating Personal Computer Lab Test
0% (1)
Oprating Personal Computer Lab Test
4 pages
ECE Cafeteria Management System Proposal
100% (1)
ECE Cafeteria Management System Proposal
13 pages
Information Systems blueprint (1)
No ratings yet
Information Systems blueprint (1)
26 pages
Getachew Emiru
No ratings yet
Getachew Emiru
100 pages
Research Proposal Printout
100% (2)
Research Proposal Printout
4 pages
Project Tittle: Wolkite University Medical Centre Management System
No ratings yet
Project Tittle: Wolkite University Medical Centre Management System
41 pages
Distributed Systems Laboratory Manual
No ratings yet
Distributed Systems Laboratory Manual
97 pages
G9 IT STB 2023 Web
No ratings yet
G9 IT STB 2023 Web
251 pages
Developing Bacterial Wilt Detection Model On Enset Crop Using A Deep Learning Approach
No ratings yet
Developing Bacterial Wilt Detection Model On Enset Crop Using A Deep Learning Approach
91 pages
Remote Method Invocation (RMI)
No ratings yet
Remote Method Invocation (RMI)
20 pages
IT BSC Curriculum 2013 Revised
100% (1)
IT BSC Curriculum 2013 Revised
172 pages
Vacancy PDF
No ratings yet
Vacancy PDF
86 pages
Chapter 7-Evaluation Techniques and Universal Design
No ratings yet
Chapter 7-Evaluation Techniques and Universal Design
22 pages
Revised Final Draft
100% (1)
Revised Final Draft
18 pages
Dereja Academy: Worksheet Personal Vision Statement
No ratings yet
Dereja Academy: Worksheet Personal Vision Statement
2 pages
Final Municipality Document
0% (1)
Final Municipality Document
104 pages
Thesis AfaanOromoo 2016
No ratings yet
Thesis AfaanOromoo 2016
93 pages
Chapter Six: Evaluation Techniques and Universal Design
No ratings yet
Chapter Six: Evaluation Techniques and Universal Design
27 pages
Term Paper Writing Guideline
100% (2)
Term Paper Writing Guideline
3 pages
Gadissa Hailu
No ratings yet
Gadissa Hailu
77 pages
Seminar Guideline Haramaya University
100% (1)
Seminar Guideline Haramaya University
3 pages
Sintayehu Yihunie
No ratings yet
Sintayehu Yihunie
123 pages
Thesis Format and Guidlines For Students
No ratings yet
Thesis Format and Guidlines For Students
19 pages
SAAD Lecture V - System Design
No ratings yet
SAAD Lecture V - System Design
63 pages
Curriculum Vitae (CV) : Full Name .. Address .Region, Oromia Zone, Horo Gudur Wollega, Amuru Woreda
No ratings yet
Curriculum Vitae (CV) : Full Name .. Address .Region, Oromia Zone, Horo Gudur Wollega, Amuru Woreda
1 page
Derso Dereja CV
80% (5)
Derso Dereja CV
2 pages
Afaan Oromo Question Classification Using Deep Learning Approachosal
No ratings yet
Afaan Oromo Question Classification Using Deep Learning Approachosal
17 pages
Employee Management System For KIOT Chapter Chapter One and Two1
100% (1)
Employee Management System For KIOT Chapter Chapter One and Two1
82 pages
TECS 112, Lecture NotesPDF, Edited
No ratings yet
TECS 112, Lecture NotesPDF, Edited
78 pages
Wolkite University Cost Sharing System PDF
No ratings yet
Wolkite University Cost Sharing System PDF
97 pages
Afaan Oromo News Text Summarization Using Sentence Scoring Method
No ratings yet
Afaan Oromo News Text Summarization Using Sentence Scoring Method
106 pages
Afaan Oromo Question Classification Using Deep Learning Approach
100% (4)
Afaan Oromo Question Classification Using Deep Learning Approach
14 pages
Build Internet Infrastructure LO1
No ratings yet
Build Internet Infrastructure LO1
8 pages
Afaan Oromo Text Retrieval
No ratings yet
Afaan Oromo Text Retrieval
79 pages
Fleet
No ratings yet
Fleet
16 pages
PLAN and Org
No ratings yet
PLAN and Org
20 pages
Ambo University Woliso Campus: School of Technology and Informatics
No ratings yet
Ambo University Woliso Campus: School of Technology and Informatics
10 pages
Annual Lesson Plan (2017) - 8
100% (1)
Annual Lesson Plan (2017) - 8
5 pages
Oromia State University School of Post Graduate Studies Department of Leadership and Change Management
No ratings yet
Oromia State University School of Post Graduate Studies Department of Leadership and Change Management
50 pages
1616
100% (2)
1616
88 pages
E School Student Mark Entry and Continuous Assessment Guide Handout
No ratings yet
E School Student Mark Entry and Continuous Assessment Guide Handout
21 pages
CS Course Competency
No ratings yet
CS Course Competency
33 pages
Woreda Management System: Acknowledgment
0% (1)
Woreda Management System: Acknowledgment
57 pages
Software Engineering
No ratings yet
Software Engineering
12 pages
Wollega University School of Graduate Studies Institute of Technology Department of Computer Science (MSC)
100% (1)
Wollega University School of Graduate Studies Institute of Technology Department of Computer Science (MSC)
2 pages
Wcu Online Condominium House MGMT System
100% (1)
Wcu Online Condominium House MGMT System
100 pages
Final Project Proposal - Last
100% (1)
Final Project Proposal - Last
13 pages
IT GRADE 11 Short Note
100% (1)
IT GRADE 11 Short Note
2 pages
Letter of Sponsorship For Graduate Studies2
No ratings yet
Letter of Sponsorship For Graduate Studies2
1 page
Proposal Layout
50% (2)
Proposal Layout
11 pages
Presentation On Maintain Equipment and Consumables Unit 1 Clean Equipment
100% (1)
Presentation On Maintain Equipment and Consumables Unit 1 Clean Equipment
2 pages
2 English Grade 2teachers' Guide Zero Draft
No ratings yet
2 English Grade 2teachers' Guide Zero Draft
135 pages
Understand The Principles and Theories of Management Information System (MIS) and Its Impact On Organizations
100% (1)
Understand The Principles and Theories of Management Information System (MIS) and Its Impact On Organizations
11 pages
Ambo University Woliso Campus: Web Based Help Desk Information Management System
No ratings yet
Ambo University Woliso Campus: Web Based Help Desk Information Management System
18 pages
Roster G 9&10
No ratings yet
Roster G 9&10
9 pages
Hiidhaa Seexaa II: Maa'ikalaawii Mooraa Gubbaa Mana Hidhaa fi Duudhaa GooliiI
From Everand
Hiidhaa Seexaa II: Maa'ikalaawii Mooraa Gubbaa Mana Hidhaa fi Duudhaa GooliiI
Ibsaa Guutama
4/5 (5)

Speaker Independent Speech Recognition For Afan Oromo Language Using Hybrid Approach

Uploaded by

Speaker Independent Speech Recognition For Afan Oromo Language Using Hybrid Approach

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

DEPARTMENT OF INFORMATION TECHNOLOGY Speaker Independent

Book · November 2022

Tariku Endale Terefe

The user has requested enhancement of the downloaded file.

SCHOOL OF GRADUATE STUDIES

SCHOOL OF TECHNOLOGY AND INFORMATICS

DEPARTMENT OF INFORMATION TECHNOLOGY

Speaker Independent Spontaneous Speech Recognition for Afan Oromo

Language using Hybrid Approach

A THESIS SUBMITTED TO SCHOOL OF GRADUATE STUDIES OF AMBO

ADVISOR: DR. GETACHEW M.

TARIKU ENDALE TEREFE 29/09/2020

Getachew Mamo (Ph.D.)

GETACHEW MAMO (PhD)

Name of Major Advisor Signature Date

_____________________ ________________ ___________

_____________________ _____________ _____________

Name of External Examiner Signature Date

understand that non-adherence to the principles of academic honesty and integrity,

misrepresentation/fabrication of any idea/data/source constitute sufficient ground for disciplinary

TARIKU ENDALE TEREFE 25/06/2020

Name of Student Signature Date

commitment to support and build me with a very constructive idea.

Especially my wife has been supported me in providing finances, encouraging me and

constructive idea during my study.

TARIKU ENDALE TEREFE

Figure 3: Software Architecture (taken from Young et.al, 2006) ..................................................................43

ASR: Automatic Speech Recognition

DWT: Dynamic Time Warping

EBNF: Extended Backus Naur Form

FFT: Fast Fourier Transform

FO-ANN: Firefly Optimized Artificial Neural Network

GMM: Gaussian Mixture Model

HMM: Hidden Markov Model

MFCC: Mel Frequency Cepstrum Coefficients

OBN: Oromia Broadcasting Network

OMN: Oromia Media Network

PDF: Probability Density Function

SRILM: Stanford Research Institute Language Modeling Toolkit

VQ: Vector Quantization

VS: Visual Studio

Speech Recognition is a subfield of computational linguistics that develops methodologies and

with 36.45 % Word Error Rate.

Speech Recognition is a subfield of computational linguistics that develops methodologies

by computer. It is the ability of devices to respond to spoken commands (Mikio Nakano,

Recognition (ASR) or computer speech recognition) is the process of converting a speech

signal to a sequence of words, by means of an algorithm implemented as a computer

program (Debela, 2011).

interpretation of a spoken utterance, typically, this means finding the sequence of

computer control. Speech recognition is often confused with natural language

understanding. Spontaneous Speech Recognition is more natural way in which people

communicate among one another. Spontaneous conversation is optimized for human-

(Adugna, 2015). Spontaneous Speech Recognition is a thought of speech that is natural

sounding and not rehearsed (IJCSIS, 2009).

(Elizabeth, 2012). Spontaneous Speech Recognition for Afan Oromo is an important

possibilities of developing Speech Recognition with broadcast news corpus.

developed a spontaneous, speaker independent Amharic speech recognizer by using

speech recognition for Sidama language using HMM.

Isolated word speech recognition system by (Ashenafi, 2009), continuous speech

recognition system by (Kassahun, 2010). Teferi revealed the possibilities of developing

1.3. Research Questions

The Research Study has tried to answer the following questions.

Network, Finfine Integrated Broadcast and Oromia Media Network?

 How to Overcome the Challenges?

Spontaneous Speech Recognition for Afan Oromo Language?

1.4. Objectives of the Study

1.4.1. General Objectives

Recognition for Afan Oromo using Hybrid Model.

1.4.2. Specific Objectives

done by reviewing literatures and articles and related works.

 Prepare the transcription and segmentation of the spoken words.

 Evaluate the performance of the prototype

 Forward conclusion and recommendations for future work in the area.

1.5. Scope and Limitations of the Study

1.6. Research Methodology

___________ _

_______________ _ _______