0% found this document useful (0 votes)
686 views

Tamil Morphological Analysis

The document summarizes a student project on developing an efficient rule-based system for morphological parsing of the Tamil language. It discusses the challenges of agglutination and inflections in Tamil morphology. The proposed solution uses a rule-based approach to analyze word inflections according to Tamil grammar rules, combined with a machine learning approach to resolve conflicts and optimize the analysis of recurring inflectional patterns. The project aims to enable downstream applications for Tamil like machine translation by performing accurate morphological parsing of text.

Uploaded by

Karthik Sankar
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
686 views

Tamil Morphological Analysis

The document summarizes a student project on developing an efficient rule-based system for morphological parsing of the Tamil language. It discusses the challenges of agglutination and inflections in Tamil morphology. The proposed solution uses a rule-based approach to analyze word inflections according to Tamil grammar rules, combined with a machine learning approach to resolve conflicts and optimize the analysis of recurring inflectional patterns. The project aims to enable downstream applications for Tamil like machine translation by performing accurate morphological parsing of text.

Uploaded by

Karthik Sankar
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 18

An Efficient Rule-Based System for Morphological

Parsing of Tamil Language


தமிழ் உருபனியல் ஆய்வு

Final Semester Project


Department of Computer Science and Engineering
National Institute of Technology, Tiruchirappalli

May 2010

STUDENTS:
Karthik S 106106029
Praveen Kumar 106106045
Venkataraman GB 106106073

GUIDE:
Dr. V. Gopalakrishnan
Agenda
 Overview of the Project
 NLP Applications – The Stakeholders
 The problem at hand
 The proposed solution
◦ Rule – Based Morphological Analysis
◦ Machine Learning
 Where does it all fit in ?
 Need for Tamil Morphological Analysis
 Resources Obtained
 Implementation Details
 Demonstration
 Future Scope

1 WHO WHAT WHERE WHY HOW


12/08/2021 National Institute of Technology, Tiruchirappalli
Overview of the Project
 Natural Language Processing
 Morphological Analysis
 Tamil Language

Morphing …

நடப்பான்
நடக்கின் நடக்கின்
… And in Tamil றான் றாள்
நடந்தான் நடந்தனர்

2 WHO WHAT WHERE WHY HOW


12/08/2021 National Institute of Technology, Tiruchirappalli
NLP Applications – The Stakeholders

WHO ARE THE STAKEHOLDERS ?


Natural Language Processing Applications like:
 Stemming
 Machine Translation
 Speech Recognition
 Information Retrieval

WHY ARE THESE APPLICATION THE STAKEHOLDERS ?

3 WHO WHAT WHERE WHY HOW


12/08/2021 National Institute of Technology, Tiruchirappalli
The problem at hand
Morphological Analysis of Tamil involves understanding the word structure and its
inflections
AGGLUTINATION IN TAMIL
 Agglutination is the morphological process of adding affixes to the base of a word
 Typical Tamil verb form will have a number of suffixes showing person, number,
mood, tense and voice.
INFLECTIONS IN TAMIL

திணை - Class பால் - Gender

எண் - Number

இடம் - Person காலம் - Tense

4 WHO WHAT WHERE WHY HOW


12/08/2021 National Institute of Technology, Tiruchirappalli
The problem at hand
Morphological Analysis of Tamil involves understanding the word structure and its
inflections
AGGLUTINATION IN TAMIL
 Agglutination is the morphological process of adding affixes to the base of a word
 Typical Tamil verb form will have a number of suffixes showing person, number,
mood, tense and voice.
INFLECTIONS IN TAMIL
 Example: vAḷntukkoṇṭiruntēṉ: [வாழ்ந்துகொண்டிருந்தேன்]

vAḷ - வாழ் intu - ந்து koṇṭu - கொண்டு irunta - இருந்த ēn - ஏன்

root voice marker tense marker aspect marker person marker

live past tense during past progressive first person,


object voice Singular

4 WHO WHAT WHERE WHY HOW


12/08/2021 National Institute of Technology, Tiruchirappalli
The proposed solution
There are two levels called lexical and surface levels. In the surface level, a
word is represented in its original orthographic form. In the lexical level, a
word is represented by denoting all of the functional components of the word.

SURFACE LEVEL LEXICAL LEVEL

RULE – BASED MORPHOLOGICAL ANALYSIS


Analyzing word inflections using rules specified in Tamil Grammar

அன் ஆன் அள் ஆள் அர் ஆர் பம்மார்


அஆ குடுதுறு என் ஏன் அல் அன்
அம் ஆம் எம் ஏம் ஓமொ டும்மூர் நன்னூல்
கடதற ஐ ஆய் இம்மின் இர்ஈர்
தொல்காப்பியம்
ஈயர் கயவு மென்பவும் பிறவும்
வினையின் விகுதி பெயரினும் சிலவே

5 WHO WHAT WHERE WHY HOW


12/08/2021 National Institute of Technology, Tiruchirappalli
The proposed solution
MACHINE LEARNING APPROACH
While checking for suffixes in a given word, more than one suffix might be
possible, if the rules are strictly followed. But only one suffix is semantically
possible.
விகுதி : படித்து – “உ” படித்தது – “து” or “உ” ???
1
M/L approach helps the system in “learning” the correct parsing method for the
word, and in the subsequent processing of the same word, the wrong
possibilities are automatically eliminated.

Two words might share the same inflectional part.

நடக்கின்றான் படிக்கின்றான்
2
The inflectional part of every word is learnt by the system. This helps in
optimization by eliminating the need to analyse the second word again from
scratch

6 WHO WHAT WHERE WHY HOW


12/08/2021 National Institute of Technology, Tiruchirappalli
Where does it all fit in ?

Characters ப டி த் தா ன்

Word – Tokenization படித்தான்

Morphological Analysis படி - த்த் - ஆன்

Sentence Syntax Analysis அவன் புத்தகத்தைப்


படித்தான்

Semantic Analysis Meaning of the sentence ???

7 WHO WHAT WHERE WHY HOW


12/08/2021 National Institute of Technology, Tiruchirappalli
Need for Tamil Morphological Analysis
ENGLISH vs. TAMIL
I came நான் வந்தேன்
You came நீ வந்தாய்
They came அவர்கள் வந்தனர்

He came அவன் வந்தான்


She came அவள் வந்தாள்

TRANSLATION AND SEMANTIC ANALYSIS

அவன் மதுரைக்கு வந்தாள் -- Semantically Wrong

To check semantic correctness of a sentence, morphological analysis is needed.


How to translate the above sentence ??

8 WHO WHAT WHERE WHY HOW


12/08/2021 National Institute of Technology, Tiruchirappalli
Resources Obtained
EMILLE – CIIL TAMIL MONOLINGUAL CORPUS
 Enabling Minority Language Engineering
 Collaborative Venture of
◦ Lancaster University, UK
◦ Central Institute of Indian Languages (CIIL), Mysore, India
 Distributed by European Language Resources Association [ELRA]
TAMIL WORDNET
 The database is a semantic dictionary that is designed as a lexical network
 Developed by
◦ Department of Linguistics of Tamil University
◦ AU-KBC Research Centre, Chennai
 Tamil Wordnet resembles a traditional dictionary. It also contains valuable
information about morphologically related words

9 WHO WHAT WHERE WHY HOW


12/08/2021 National Institute of Technology, Tiruchirappalli
Implementation Details - 1
Classify and Backward Scanning
Input Tamil Word
Remove Inflection of inflections

No

Check No Root
in DB verb ?

C-V Segmentation
Yes Yes

Output

Conflict Resolution
Machine Learning

10 WHO WHAT WHERE WHY HOW


12/08/2021 National Institute of Technology, Tiruchirappalli
Implementation Details - 2
படித்தான்

ப டி த் தா ன்

ப் - அ ட் - இ த் த் - ஆ ன்

ப் அ ட் இ த் த் ஆ ன்

படி < VERB_ROOT >


த்த் < PAST TENSE >
ஆன் < 3SM >

11 WHO WHAT WHERE WHY HOW


12/08/2021 National Institute of Technology, Tiruchirappalli
Implementation Details - 3
UNICODE SUPPORT FOR TAMIL
 U+0B80 – U+0BFF

GOOGLE TAMIL TRANSLITERATOR IME (Input Method)


 Google Transliteration IME is an input method editor which allows users to
enter text Tamil using a roman keyboard

PROGRAMMING LANGUAGE
 Java

DATABASES
 MySQL Databases, with JDBC to access the database

12 WHO WHAT WHERE WHY HOW


12/08/2021 National Institute of Technology, Tiruchirappalli
Implementation Details - 3
TRANSLITERATION MODULE
 A simple Transliterator module - to enable conversion from Tamil to English
and vice-versa
 Example:
◦ அ - a
◦ ஆ - aa
◦ க - ka

HASH TABLE GENERATOR


 The application uses two data files, containing a list of vigudhi and idainilai.
 The Java Hash Generator Code loads the data from the workbooks, adds
them to a hash table, and serializes the data and outputs to an external data
file, which can be loaded whenever the application requires access.

13 WHO WHAT WHERE WHY HOW


12/08/2021 National Institute of Technology, Tiruchirappalli
Future Scope
 The algorithm can be extended to cover nouns and noun forms too.

 The algorithm can be improved to incorporate stricter rules so as to reduce


conflicts that arise in the output generated by the current system.

 The algorithm can be extended for other agglutinative languages.

 The various resources obtained as a part of this project, including the


EMILLE-CIIL ELRA Corpus, the Tamil Wordnet Database and other tools can
be used for further study, research and development in the field of Natural
Language Processing at our college in the years to come.

14
12/08/2021 National Institute of Technology, Tiruchirappalli
References
 A Novel Approach to Morphological Analysis for Tamil Language
◦ Anand kumar M1, Dhanalakshmi V1, Rajendran S2, Soman K P
 Nannool and Tholkaapiyam
◦ Tamil Grammar texts
 The Morphological Generator and Parsing Engine for Tamil Verb Forms.
◦ Ultimate Software Solution, Dindigul
 Morphological Analyzer for Tamil
◦ Anandan. P, Ranjani Parthasarathy, Geetha T.V. [2002]
◦ ICON 2002, RCILTS-Tamil, Anna University, India.
 Morphology. A Handbook on Inflection and Word Formation
◦ Daelemans Walter, G. Booij, Ch. Lehmann, and J. Mugdan (eds.) [2004]
 Tamil Part-of-Speech tagger based on SVMTool
◦ Dhanalakshmi V, Anandkumar M, Vijaya M.S, Loganathan R, Soman K.P, Rajendran S [2008]
◦ Proceedings of the COLIPS International Conference on Asian Language Processing 2008 (IALP).
 Unsupervised Learning of the Morphology of a Natural Language.
◦ John Goldsmith. [2001]
◦ Computational Linguistics, 27(2):153–198.
 Computational morphology of verbal complex
◦ Rajendran, S., Arulmozi, S., Ramesh Kumar, Viswanathan, S. [2001]
15
◦ Paper read in Conference at Dravidan University, Kuppam, December 26-29, 2001.
12/08/2021 National Institute of Technology, Tiruchirappalli
Thank you

12/08/2021 National Institute of Technology, Tiruchirappalli

You might also like