0% found this document useful (0 votes)
20 views

CH2

NLP CH2

Uploaded by

shyamthakkar1673
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

CH2

NLP CH2

Uploaded by

shyamthakkar1673
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 119

CH-2 Natural Language

Processing Models and


Algorithm
What is Language Model:

• A language model in Natural Language Processing (NLP) is a statistical


tool that helps understand and generate human language. It predicts
the probability of a sequence of words, aiding various NLP tasks like
text generation, translation, sentiment analysis.

• Language models assign a probability to a sequence of words. This


helps in predicting the next word in a sentence or the likelihood of a
given sentence.
Types of Language Model:
• N-gram Models: These are simple statistical models that predict the next word in a sequence based on
the previous 'n-1' words.

• Neural Language Models: These use neural networks to predict word sequences and are more
powerful than traditional statistical models.

• Recurrent Neural Networks (RNNs): These models handle sequential data by maintaining a 'memory'
of previous words in the sequence.

• Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU): These are advanced types of RNNs
designed to better capture long-range dependencies.

• Transformers: These models, like BERT and GPT, use self-attention mechanisms to process the entire
sequence of words simultaneously, leading to improved performance on many NLP tasks
Applications of Language Models:
• Text Generation: Creating coherent and contextually relevant text.

• Machine Translation: Translating text from one language to another.

• Speech Recognition: Converting spoken language into text.

• Chatbots and Conversational Agents: Facilitating human-like interactions.

• Sentiment Analysis: Understanding and categorizing emotions and


opinions in text.

• Autocompletion: Predicting the next words or phrases to assist in typing.


Example of Language Model:
• GPT (Generative Pre-trained Transformer):

• Developed by OpenAI, models like GPT-3 and GPT-4 generate human-like text based on
the input they receive.

• BERT (Bidirectional Encoder Representations from Transformers):

• Developed by Google, BERT is designed to understand the context of words in a sentence


by looking at the words that come before and after.

• T5 (Text-to-Text Transfer Transformer):

• Also developed by Google, T5 treats every NLP problem as a text-to-text problem,


making it highly versatile.
Training Language Models:
• Language models are trained on large corpora of text data. During training,
they learn to capture linguistic patterns, grammar, context, and even world
knowledge. The training process involves:
• Tokenization: Breaking down text into individual words or subwords.
• Learning Weights: Adjusting parameters in the neural network to minimize
prediction errors.
• Fine-Tuning: Adjusting a pre-trained model to better perform on specific tasks
N- Gram Language model of Log of
Probabilities:
• We always represent and compute language model probabilities in log
format as log probabilities.

• probabilities are (by definition) less than or equal to 1, the more


probabilities we multiply together, the smaller the product becomes.
Multiplying enough n-grams together would result in numerical underflow.

• By using log probabilities instead of raw probabilities, we get numbers that


are not as small Adding in log space is equivalent to multiplying in linear
space, so we combine log probabilities by adding them.
• The result of doing all computation and storage in log space is that we
only need to convert back into probabilities if we need to report them
at the end; then we can just take the exp of the logprob:

• p1 × p2 × p3 × p4 = exp(log p1 +log p2 +log p3 +log p4).


What is smoothing NLP
• In NLP, we have statistical models to perform tasks like auto-
completion of sentences, where we use a probabilistic model. Now,
we predict the next words based on training data, which has
complete sentences so that the model can understand the pattern for
prediction. Naturally, we have so many combinations of words
possible. It is next to impossible to include all the varieties in training
data so that our model can predict accurately on unseen data. So,
here comes Smoothing to the rescue.
Why do we need smoothing in NLP?
• To improve the accuracy of our model.
• To handle data sparsity, out of vocabulary words, words that are
absent in the training set.
• Example - Training set: ["I like coding", “Prakriti likes mathematics”,
“She likes coding”]
• Let’s consider bigrams, a group of two words.
• P(wi | w(i-1)) = count(wi w(i-1)) / count(w(i-1))
• So, let's find the probability of “I like mathematics”.
Bigram
Natural Language Generation:
Part of Speech of Tagging (POS)
• portions of In Natural Language Processing (NLP), speech tagging is a
linguistic task in which each word in a document is assigned to a
specific part of speech (verb, adjective, adverb, etc.) or grammatical
category. This process helps to clarify the meaning and structure of
the sentence by adding a layer of syntactic and semantic information
to the words.
Part of Speech of Tagging (POS):
• POS tagging is typically performed using machine learning algorithms,
which are trained on a large annotated corpus of text. The algorithm
learns to predict the correct POS tag for a given word based on the
context in which it appears.
• Text: “The cat sat on the mat.”
• POS tags:
• The: determiner
• cat: noun
• sat: verb
• on: preposition
• the: determiner
• mat: noun
• POS tagging is a useful tool in natural language processing (NLP) as it
allows algorithms to understand the grammatical structure of a
sentence and to disambiguate words that have multiple meanings..
Use of Parts of Speech Tagging in NLP:
1. To understand the grammatical structure of a sentence: By labeling
each word with its POS, we can better understand the syntax and
structure of a sentence. This is useful for tasks such as machine
translation and information extraction, where it is important to
know how words relate to each other in the sentence.
1. 2. To disambiguate words with multiple meanings: Some words, such as
“bank,” can have multiple meanings depending on the context in which
they are used. By labeling each word with its POS, we can disambiguate
these words and better understand their intended meaning.

2. 3. To improve the accuracy of NLP tasks: POS tagging can help improve
the performance of various NLP tasks, such as named entity recognition
and text classification. By providing additional context and information
about the words in a text, we can build more accurate and sophisticated
algorithms.
1. 4. To facilitate research in linguistics: POS tagging can also be used
to study the patterns and characteristics of language use and to
gain insights into the structure and function of different parts of
speech.
Steps Involved in the POS tagging:
1. Collect a dataset of annotated text: This dataset will be used to train and test the POS
tagger. The text should be annotated with the correct POS tags for each word.

2. Preprocess the text: This may include tasks such as tokenization (splitting the text into
individual words), lowercasing, and removing punctuation.

3. Divide the dataset into training and testing sets: The training set will be used to train
the POS tagger, and the testing set will be used to evaluate its performance.

4. Train the POS tagger: This may involve building a statistical model, such as a hidden
Markov model (HMM), or defining a set of rules for a rule-based or transformation-
based tagger. The model or rules will be trained on the annotated text in the training
set.
5. Test the POS tagger: Use the trained model or rules to predict the POS tags of the
words in the testing set. Compare the predicted tags to the true tags and calculate
metrics such as precision and recall to evaluate the performance of the tagger.

6.Fine-tune the POS tagger: If the performance of the tagger is not satisfactory,
adjust the model or rules and repeat the training and testing process until the
desired level of accuracy is achieved.

7.Use the POS tagger: Once the tagger is trained and tested, it can be used to
perform POS tagging on new, unseen text. This may involve preprocessing the text
and inputting it into the trained model or applying the rules to the text. The output
will be the predicted POS tags for each word in the text.
Application of POS Tagging:
• Information extraction: POS tagging can be used to identify specific types of
information in a text, such as names, locations, and organizations. This is useful for
tasks such as extracting data from news articles or building knowledge bases for
artificial intelligence systems.

• Named entity recognition: POS tagging can be used to identify and classify named
entities in a text, such as people, places, and organizations. This is useful for tasks such
as building customer profiles or identifying key figures in a news story.

• Text classification: POS tagging can be used to help classify texts into different
categories, such as spam emails or sentiment analysis. By analyzing the POS tags of the
words in a text, algorithms can better understand the content and tone of the text.
• Machine translation: POS tagging can be used to help translate texts
from one language to another by identifying the grammatical
structure and relationships between words in the source language
and mapping them to the target language.

• Natural language generation: POS tagging can be used to generate


natural-sounding text by selecting appropriate words and
constructing grammatically correct sentences. This is useful for tasks
such as chatbots and virtual assistants
Types of POS Tagging in NLP:

• Rule-based part-of-speech (POS) tagging is a method of labeling


words with their corresponding parts of speech using a set of pre-
defined rules. This is in contrast to machine learning-based POS
tagging, which relies on training a model on a large annotated corpus
of text.
• Rule-based POS taggers can be relatively simple to implement and are
often used as a starting point for more complex machine learning-
based taggers. However, they can be less accurate and less efficient
than machine learning-based taggers, especially for tasks with large
or complex datasets.
Statistical POS Tagging:
• In statistical POS tagging, a model is trained on a large annotated
corpus of text to learn the patterns and characteristics of different
parts of speech. The model uses this training data to predict the POS
tag of a given word based on the context in which it appears and the
probability of different POS tags occurring in that context.
• Statistical POS taggers can be more accurate and efficient than rule-
based taggers, especially for tasks with large or complex datasets.

• Collect a large annotated corpus of text and divide it into training and
testing sets.

• Train a statistical model on the training data, using techniques such as


maximum likelihood estimation or hidden Markov models.

• Use the trained model to predict the POS tags of the words in the
testing data.
• Evaluate the performance of the model by comparing the predicted
tags to the true tags in the testing data and calculating metrics such
as precision and recall.

• Fine-tune the model and repeat the process until the desired level of
accuracy is achieved.

• Use the trained model to perform POS tagging on new, unseen text.
Transformation-based tagging (TBT):
• A set of rules is defined to transform the tags of words in a text based
on the context in which they appear. For example, a rule might
change the tag of a verb to a noun if it appears after a determiner
such as “the.” The rules are applied to the text in a specific order, and
the tags are updated after each transformation.
• TBT can be more accurate than rule-based tagging, especially for tasks with
complex grammatical structures. However, it can be more computationally
intensive and requires a larger set of rules to achieve good performance.

• Define a set of rules for transforming the tags of words in the text. For
example:

• If the word is a verb and appears after a determiner, change the tag to
“noun.”

• If the word is a noun and appears after an adjective, change the tag to
“adjective.”
• Iterate through the words in the text and apply the rules in a specific
order. For example:

• In the sentence “The cat sat on the mat,” the word “sat” would be
changed from a verb to a noun based on the first rule.

• In the sentence “The red cat sat on the mat,” the word “red” would
be changed from an adjective to a noun based on the second rule.

• Output the transformed tags for each word in the text.


• Hidden Markov Model POS tagging:
• Hidden Markov models (HMMs) are a type of statistical model that
can be used for part-of-speech (POS) tagging in natural language
processing (NLP). In an HMM-based POS tagger, a model is trained
on a large annotated corpus of text to learn the patterns and
characteristics of different parts of speech. The model uses this
training data to predict the POS tag of a given word based on the
probability of different tags occurring in the context of the word.
Challenges of POS

• Ambiguity: Some words can have multiple POS tags depending on the
context in which they appear, making it difficult to determine their
correct tag. For example, the word “bass” can be a noun (a type of
fish) or an adjective (having a low frequency or pitch).
• Out-of-vocabulary (OOV) words: Words that are not present in the
training data of a POS tagger can be difficult to tag accurately,
especially if they are rare or specific to a particular domain.

• Complex grammatical structures: Languages with complex


grammatical structures, such as languages with many inflections or
free word order, can be more challenging to tag accurately.
• Lack of annotated training data: Some languages or domains may
have limited annotated training data, making it difficult to train a
high-performing POS tagger.

• Inconsistencies in annotated data: Annotated data can sometimes


contain errors or inconsistencies, which can negatively impact the
performance of a POS tagger.
• Default tagging is a fundamental step in part-of-speech labelling. It is
done with the class. The DefaultTagger class takes a single parameter,
'tag'. NN is the tag for a single noun. DefaultTagger is most beneficial
when it comes to dealing with the most frequent part-of-speech tags.
This is why a noun tag is recommended.
DefaultTagger:
• Example of POS Tagging
• Consider the sentence: “The quick brown fox jumps over the lazy dog.”
• After performing POS Tagging:
• “The” is tagged as determiner (DT)
• “quick” is tagged as adjective (JJ)
• “brown” is tagged as adjective (JJ)
• “fox” is tagged as noun (NN)
• “jumps” is tagged as verb (VBZ)
• “over” is tagged as preposition (IN)
• “the” is tagged as determiner (DT)
• “lazy” is tagged as adjective (JJ)
• “dog” is tagged as noun (NN)
Morphology
• Morphology is the study of the way words are built from smaller meaningful units called
morphemes.
• We can divide morphemes into two broad classes.
• Stems – the core meaningful units, the root of the word.
• Affixes – add additional meanings and grammatical functions to words.
• Affixes are further divided into:
• Prefixes – precede the stem: do / undo
• Suffixes – follow the stem: eat / eats
• Infixes – are inserted inside the stem
• Circumfixes – precede and follow the stem
• English doesn’t stack more affixes.
• But Turkish can have words with a lot of suffixes.
• Languages, such as Turkish, tend to string affixes together are called agglutinative
languages.

BİL711 Natural Language Processing 81


Surface and Lexical Forms
• The surface level of a word represents the actual spelling
of that word.
• geliyorum eats cats kitabım
• The lexical level of a word represents a simple concatenation
of morphemes making up that word.
• gel +PROG +1SG
• eat +AOR
• cat +PLU
• kitap +P1SG
• Morphological processors try to find correspondences between lexical and
surface forms of words.
• Morphological recognition – surface to lexical
• Morphological generation – lexical to surface

BİL711 Natural Language Processing 82


Inflectional and Derivational Morphology
• There are two broad classes of morphology:
• Inflectional morphology
• Derivational morphology
• After a combination with an inflectional morpheme,
the meaning and class of the actual stem usually do not change.
• eat / eats pencil / pencils
• gel / geliyorum masa / masam
• After a combination with an derivational morpheme, the meaning and the
class of the actual stem usually change.
• compute / computer do / undo friend / friendly
• Uygar / uygarlaş kapı / kapıcı
• The irregular changes may happen with derivational affixes.

BİL711 Natural Language Processing 83


English Inflectional Morphology
• Nouns have simple inflectional morphology.
• plural -- cat / cats
• possessive -- John / John’s
• Verbs have slightly more complex inflectional, but still relatively simple
inflectional morphology.
• past form -- walk / walked
• past participle form -- walk / walked
• gerund -- walk / walking
• singular third person -- walk / walks
• Verbs can be categorized as:
• main verbs
• modal verbs -- can, will, should
• primary verbs -- be, have, do
• Regular and irregular verbs: walk / walked -- go / went

BİL711 Natural Language Processing 84


English Derivational Morphology
• Some English derivational affixes
• -ation : transport / transportation
• -er : kill / killer
• -ness : fuzzy / fuzziness
• -al : computation / computational
• -able : break / breakable
• -less : help / helpless
• un : do / undo
• re : try / retry

BİL711 Natural Language Processing 85


Morphological Parsing
• Morphological parsing is to find the lexical form of a word
from its surface form.
• cats -- cat +N +PLU
• cat -- cat +N +SG
• goose -- goose +N +SG or goose +V
• geese -- goose +N +PLU
• gooses -- goose +V +3SG
• catch -- catch +V
• caught -- catch +V +PAST or catch +V +PP
• geliyorum -- gel +V +PROG +1SG
• masalardan -- masa +N +PLU +ABL
• There can be more than one lexical level representation
for a given word. (ambiguity)
BİL711 Natural Language Processing 86
Parts of A Morphological Processor
• For a morphological processor, we need at least followings:

• Lexicon : The list of stems and affixes together with basic information
about them such as their main categories (noun, verb, adjective, …) and
their sub-categories (regular noun, irregular noun, …).
• Morphotactics : The model of morpheme ordering that explains which
classes of morphemes can follow other classes of morphemes inside a
word.
• Orthographic Rules (Spelling Rules) : These spelling rules are used to
model changes that occur in a word (normally when two morphemes
combine).

BİL711 Natural Language Processing 87


Lexicon
• A lexicon is a repository for words (stems).
• They are grouped according to their main categories.
• noun, verb, adjective, adverb, …
• They may be also divided into sub-categories.
• regular-nouns, irregular-singular nouns, irregular-plural nouns, …
• The simplest way to create a morphological parser, put all possible
words (together with its inflections) into a lexicon.

BİL711 Natural Language Processing 88


Combine Lexicon and Morphotactics
o
f x
c a t
s
d o g
s
p
h e e

g e
o s
o

e e e
m
o u s
i c

This only says yes or no. Does not give lexical representation.
It accepts a wrong word (foxs).

BİL711 Natural Language Processing 90


Formal Definition of FST (Mealey Machine)
• FST is Q x  x q0 x F x 

• Q : a finite set of N states q0, q1, … qN


•  : a finite input alphabet of complex symbols.
• Each complex symbol is a pair of an input and an output symbol i:o
• where i is a member of I (an input alphabet),
• and o is a member of O (an output alphabet).
• I and O may contain empty string.
• So,  is a subset of IxO.
• q0 : the start state
• F : the set of final states -- F is a subset of Q
• (q,i:o) : transition function

BİL711 Natural Language Processing 92


FST (cont.)
•  may not contain all possible pairs from IxO.
• For example:
• I = {a, b, c} O={a,b,c, є}
•  = {a:a, b:b, c:c, a:є, b: є, c: є}
• feasible pairs – In two-level morphology terminology, the pairs in 
are called as feasible pairs.
• default pair – Instead of a:a we can use a single character for this
default pair.
• FSAs are isomorphic to regular languages, and FSTs are isomorphic to
regular relations (pair of strings of regular languages).

BİL711 Natural Language Processing 93


FST Properties
• FSTs are closed under: union, inversion, and composition.

• union : The union of two regular relations is also a regular relation.


• inversion : The inversion of a FST simply switches the input and output
labels.
• This means that the same FST can be used for both directions of a morphological
processor.
• composition : If T1 is a FST from I1 to O1 and T2 is a FST from O1 to O2, then
composition of T1 and T2 (T1oT2) maps from I1 to O2.

• We use these properties of FSTs in the creation of the FST for a


morphological processor.

BİL711 Natural Language Processing 94


A FST for Simple English Nominals
+N: є
+S:#
reg-noun +PL:^s#

irreg-sg-noun +N: є +SG:#

irreg-pl-noun
+PL:#

+N: є

BİL711 Natural Language Processing 95


FST for stems
• A FST for stems which maps roots to their root-class
reg-noun irreg-pl-noun irreg-sg-noun
fox g o:e o:e se goose
cat sheep sheep
dog m o:i u:є s:c e mouse

• fox stands for f:f o:o x:x


• When these two transducers are composed, we have a FST which maps
lexical forms to intermediate forms of words for simple English noun
inflections.
• Next thing that we should handle is to design the FSTs for orthographic
rules, and combine all these transducers.

BİL711 Natural Language Processing 96


lexicl

intermediate d o g ^ s #

surface d o g s

BİL711 Natural Language Processing


Lexical to Intermediate FST

BİL711 Natural Language Processing 98


Orthographic Rules
• We need FSTs to map intermediate level to surface level.
• For each spelling rule we will have a FST, and these FSTs run parallel.

• Some of English Spelling Rules:


• consonant doubling -- 1-letter consonant doubled before ing/ed -- beg/begging
• E deletion - Silent e dropped before ing and ed -- make/making
• E insertion -- e added after s, z, x, ch, sh before s -- watch/watches
• Y replacement -- y changes to ie before s, and to i before ed -- try/tries
• K insertion -- verbs ending with vowel+c we add k -- panic/panicked

• We represent these rules using two-level morphology rules:


• a => b / c __ d rewrite a as b when it occurs between c and d.

BİL711 Natural Language Processing 99


Generating or Parsing with FST Lexicon and
Rules

BİL711 Natural Language Processing 100


Accepting Foxes

BİL711 Natural Language Processing 101


Intersection
• We can intersect all rule FSTs to create a single FST.

• Intersection algorithm just takes the Cartesian product of states.


• For each state qi of the first machine and qj of the second machine, we
create a new state qij
• For input symbol a, if the first machine would transition to state qn and the
second machine would transition to qm the new machine would transition to
qnm.

BİL711 Natural Language Processing 102


What is named entity recognition?
• Named entity recognition (NER)—also called entity chunking or entity
extraction—is a component of natural language processing (NLP) that
identifies predefined categories of objects in a body of text.
• organizations, locations, expressions of times, quantities, medical codes,
monetary values and percentages, among others. Essentially, NER is the
process of taking a string of text (i.e., a sentence, paragraph or entire
document), and identifying and classifying the entities that refer to each
category.
NER techniques:

• The organizations that do utilize NER for unstructured data extraction


rely on a range of approaches, but most fall into three broad
categories: rule-based approaches, machine learning approaches and
hybrid approaches.
• Rule-based approaches involve creating a set of rules for the grammar
of a language. The rules are then used to identify entities in the text
based on their structural and grammatical features. These methods
can be time-consuming and may not generalize well to unseen data.

• Machine learning approaches involve training an AI-driven


machine learning model on a labeled dataset using algorithms
like conditional random fields and maximum entropy (two
types of complex statistical language models).
• Techniques can range from traditional machine learning
methods (e.g., decision trees and support vector machines) to
more complex deep learning approaches, like recurrent neural
networks (RNNs) and transformers.

• Hybrid approaches combine rule-based and machine learning


methods to leverage the strengths of both. They can use a rule-based
system to quickly identify easy-to-recognize entities and a machine
learning system to identify more complex entities.
NER methodologies:

• Recurrent neural networks (RNNs) and long short-term memory


(LSTM). RNNs are a type of neural network designed for sequence
prediction problems.

• LSTMs, a special kind of RNN, can learn to recognize patterns over


time and maintain information in “memory” over long sequences,
making them particularly useful for understanding context and
identifying entities.
• Conditional random fields (CRFs). CRFs are often used in combination
with LSTMs for NER tasks. They can model the conditional probability
of an entire sequence of labels, rather than just individual labels,
making them useful for tasks where the label of a word depends on
the labels of surrounding words.
• Transformers and BERT. Transformer networks, particularly the BERT
(Bidirectional Encoder Representations from Transformers) model,
have had a significant impact on NER. Using a self-attention
mechanism that weighs the importance of different words, BERT
accounts for the full context of a word by looking at the words that
come before and after it.
NER process:
• Step 1. Data collection
• The first step of NER is to aggregate a dataset of annotated text. The
dataset should contain examples of text where named entities are
labeled or marked, indicating their types. The annotations can be
done manually or using automated methods.
• Step 2. Data preprocessing
• Once the dataset is collected, the text should be cleaned and formatted.
You may need to remove unnecessary characters, normalize the text
and/or split text into sentences or tokens.
• Step 3. Feature extraction
• During this stage, relevant features are extracted from the preprocessed
text. These features can include part-of-speech tagging (POS tagging), word
embeddings and contextual information, among others. The choice of
features will depend on the specific NER model the organization uses.
• Step 4. Model training
• The next step is to train a machine learning or deep learning model
using the annotated dataset and the extracted features. The model
learns to identify patterns and relationships between words in the
text, as well as their corresponding named entity labels.
• Step 5. Model evaluation
• After you have trained the NER model, it should be evaluated to
assess its performance. You can measure metrics like precision, recall
and F1 score, which indicate how well the model correctly identifies
and classifies named entities.
• Step 6. Model fine-tuning
• Based on the evaluation results, you will refine the model to improve
its performance. This can include adjusting hyperparameters,
modifying the training data and/or using more advanced techniques
(e.g., ensembling or domain adaptation).
• Step 7. Inference
• At this stage, you can start using the model for inference on new,
unseen text. The model will take the input text, apply the
preprocessing steps, extract relevant features and ultimately predict
the named entity labels for each token or span of text.
Application of NER:
Challenges of NER:
• Ambiguity in Entity Names: Certain words or phrases can have multiple
possible meanings or interpretations.

• Misspelled Entity Names: Text data often contains spelling errors or


variations, making it difficult to recognize named entities accurately.

• Ambiguity in Entity Types: Some words or phrases can be classified into


multiple entity types, leading to uncertainty in classification.

• Variations in Entity References: Entities can be referred to using different


expressions or synonyms, making their identification challenging.
• Contextual Challenges: Understanding the context of a word or
phrase within a sentence or document is essential for accurate
entity recognition.

You might also like