0% found this document useful (0 votes)
38 views

Web Mining UNIT-II Chapter-01 - 02 - 03

This document discusses information retrieval models and describes the Boolean and vector space models. The Boolean model retrieves documents that match a user's boolean query, but has limitations. The vector space model represents documents and queries as vectors of term weights and calculates similarity between them to return relevant results.

Uploaded by

Ganesh Pandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Web Mining UNIT-II Chapter-01 - 02 - 03

This document discusses information retrieval models and describes the Boolean and vector space models. The Boolean model retrieves documents that match a user's boolean query, but has limitations. The vector space model represents documents and queries as vectors of term weights and calculates similarity between them to return relevant results.

Uploaded by

Ganesh Pandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

1 - Introduction to Information Retrieval (IR) Systems

Definition and goals of information retrieval


1. Information retrieval (IR) is the process of obtaining relevant information from a large
collection of data or documents. It involves searching for, identifying, and retrieving
documents or resources that match a user's information need, typically expressed as a
query.
2. IR can be defined as a software program that deals with the organization, storage,
retrieval, and evaluation of information from document repositories, particularly textual
information.
3. Information Retrieval is the activity of obtaining material that can usually be documented
on an unstructured nature i.e. usually text which satisfies an information need from
within large collections which is stored on computers.
4. For example, Information Retrieval can be when a user enters a query into the system.
The primary goals of information retrieval are:
Relevance: The retrieved information should be relevant to the user's query. Relevance is
subjective and depends on the context, user preferences, and the specific information need.
Precision: Precision refers to the accuracy of retrieved results. It measures the proportion of
relevant documents among all the retrieved documents. A system with high precision returns a
high percentage of relevant documents in its results.
Recall: Recall measures the ability of an IR system to retrieve all relevant documents from the
collection. It is the proportion of relevant documents that are successfully retrieved out of all the
relevant documents in the collection.
Efficiency: IR systems should be efficient in terms of both time and resources. Users expect
quick responses to their queries, and systems should be able to handle large volumes of data
efficiently.
User Satisfaction: Ultimately, the goal of information retrieval is to satisfy the information
needs of users. Users should find the retrieved information useful and be satisfied with the
performance of the IR system.
Adaptability: IR systems should be adaptable to different types of queries and domains. They
should be able to handle various data formats, languages, and user preferences effectively.
Scalability: As data collections continue to grow, IR systems must be scalable to handle
increasingly large volumes of data while maintaining performance and efficiency.
Components of IR System

User Interface: The user interface is the component through which users interact with the IR
system. It allows users to input their queries, view search results, and navigate through
retrieved documents. User interfaces can vary widely, ranging from simple keyword search
boxes to more advanced query builders and filtering options.
Query Processor: The query processor interprets user queries and translates them into a
format that the IR system can understand and use for searching. It may perform tasks such as
query parsing, expansion (e.g., adding synonyms or related terms), and normalization to
improve the accuracy and effectiveness of the search process.
Indexer: The indexer is responsible for creating and maintaining an index of the documents in
the collection. This index contains information about the terms (words or phrases) present in
each document and their locations. Indexing allows for efficient searching by enabling the
system to quickly identify relevant documents based on the terms in the user's query.
Search Engine: The search engine executes queries against the document index to retrieve
relevant documents. It ranks the retrieved documents based on their relevance to the user's
query, typically using algorithms that consider factors such as term frequency, document
length, and document popularity.
Ranking Model: The ranking model determines the order in which retrieved documents are
presented to the user. It assigns a relevance score to each document based on various factors,
such as the frequency of query terms in the document, the position of query terms, and the
authority of the document.
Retrieval Model: The retrieval model defines the principles and algorithms used to retrieve
relevant documents from the index. Common retrieval models include Boolean retrieval,
vector space models, and probabilistic models.
Relevance Feedback Mechanism: Relevance feedback allows users to provide feedback on
the relevance of retrieved documents, which the system can use to improve future search
results. This feedback can be explicit (e.g., user ratings or annotations)
Evaluation Module: The evaluation module assesses the performance of the IR system by
measuring metrics such as precision, recall, and relevance in response to a set of test queries.
Evaluation is essential for optimizing and fine-tuning the system's components to improve
overall effectiveness.

Challenges and applications of IR


Information retrieval (IR) faces various challenges and finds applications across a wide range of
domains. Here are some of the key challenges and applications:
Ambiguity: Words or phrases often have multiple meanings, leading to ambiguity in user queries
and document content.
Heterogeneity of Data: Data in various formats (text, images, videos) and languages poses
challenges for indexing and searching.
Information Overload: With the exponential growth of data, users may be overwhelmed by the
volume of information available, making it difficult to find relevant content.
Dynamic Nature of Data: Information is constantly evolving, with new documents being
created and existing ones being updated or removed, requiring IR systems to adapt in real-time.
Context Sensitivity: The relevance of information often depends on the context in which it is
used, making it challenging to accurately retrieve relevant content.
Evaluation and Metrics: Assessing the effectiveness of IR systems and defining appropriate
evaluation metrics is challenging due to the subjective nature of relevance and user satisfaction.

Applications:
Web Search Engines: Web search engines like Google, Bing, and Yahoo use IR techniques to
retrieve relevant web pages in response to user queries.
Digital Libraries: IR systems are used to organize and retrieve documents in digital libraries,
archives, and repositories, enabling users to access scholarly articles, books, and other resources.

E-commerce Search: Online shopping platforms use IR techniques to help users find products
based on their preferences and search queries.

Enterprise Search: Organizations use IR systems to index and search internal documents,
emails, and other digital assets, facilitating knowledge management and information retrieval
within the organization.

Healthcare Information Retrieval: IR systems are used in healthcare to retrieve relevant


medical literature, patient records, and clinical guidelines to support clinical decision-making
and research.

Legal Information Retrieval: Legal professionals use IR systems to search and retrieve case
law, statutes, and legal documents relevant to their cases and research.

Social Media and Content Recommendation: Social media platforms and content
recommendation systems use IR techniques to deliver personalized content and
recommendations to users based on their interests and preferences.

Multimedia Retrieval: IR systems retrieve multimedia content such as images, videos, and
audio files based on user queries, enabling users to find relevant multimedia content.

Difference Between Information Retrieval and Data Retrieval


Information Retrieval Data Retrieval
The software program that deals with the Data retrieval deals with obtaining data from a
organization, storage, retrieval, and database management system such as ODBMS. It
evaluation of information from is A process of identifying and retrieving the data
document repositories particularly from the database, based on the query provided by
textual information. user or application.
Retrieves information about a subject. Determines the keywords in the user query and
retrieves the data.
Small errors are likely to go unnoticed. A single error object means total failure.
Not always well structured and is Has a well-defined structure and semantics.
semantically ambiguous.
Does not provide a solution to the user of Provides solutions to the user of the database
the database system. system.
The results obtained are approximate The results obtained are exact matches.
matches.
Results are ordered by relevance. Results are unordered by relevance.
It is a probabilistic model. It is a deterministic model.
2 - Retrieval Models:

Introduction
 An IR (Information Retrieval) model is a framework or system used to retrieve relevant
information from a large collection of data, typically text documents, in response to a
user query.
 The primary goal of an IR model is to efficiently and accurately match documents in the
collection to the user's information needs expressed in the query.

Types of IR Model
Boolean Model
 It is a simple retrieval model based on set theory and boolean algebra. Queries are
designed as boolean expressions which have precise semantics. The retrieval strategy is
based on binary decision criterion. The boolean model considers that index terms are
present or absent in a document.
Problem:
Consider 5 documents with a vocabulary of 6 terms
document 1 = ‘ term1 term3 ‘
document 2 = ‘ term 2 term4 term6 ‘
document 3 = ‘ term1 term2 term3 term4 term5 ‘
document 4 = ‘ term1 term3 term6 ‘
document 5 = ‘ term3 term4 ‘

Our documents in a boolean model


term1 term term3 term4 term5 term6
2
document1 1 0 1 0 0 0
document2 0 1 0 1 0 1
document3 1 1 1 1 1 0
document4 1 0 1 0 0 1
document5 0 0 1 1 0 0

Consider the query: Find the document consisting of term1 and term3 and not term2 ( term1 ∧
term3 ∧ ¬ term2)
term ¬term2 term term4 term term6
1 3 5
document1 1 1 1 0 0 0
document2 0 0 0 1 0 1
document3 1 0 1 1 1 0
document4 1 1 1 0 0 1
document5 0 1 1 1 0 0

document 1 : 1 ∧ 1∧ 1 = 1
document 2 : 0 ∧ 0 ∧ 0 = 0
document 3 : 1 ∧ 1 ∧ 0 = 0
document 4 : 1 ∧ 1 ∧ 1 = 1
document 5 : 0 ∧ 1 ∧ 1 = 0
Based on the above computation document1 and document4 are relevant to the given query.
Following are the Boolean operators used in Boolean model.
The Boolean retrieval model is a model for information retrieval in which we can pose any query
which is in the form of a Boolean expression of terms, that is, in which terms are combined
with the operators AND, OR, and NOT.

Vector Space Model


 The vector space model is an algebraic model that represents objects (like text) as
vectors. This makes it easy to determine the similarity between words or the relevance
between a search query and document.
 The vector space model uses linear algebra with non-binary term weights. This means the
continuous degree of similarity between two objects, like a query and documents, can be
calculated allowing for partial matching.
How the Vector Space Model works:
Document Representation: Each document in the collection is represented as a vector in the
term space. The dimensions of this space are defined by the entire vocabulary of terms across all
documents.
Query Representation: Similarly, the user query is also represented as a vector in the same term
space. The query vector typically represents the presence or absence of each term in the query.
Similarity Calculation: To find the relevance of documents to the query, the similarity between
the query vector and each document vector is calculated. One common similarity measure is the
cosine similarity, which computes the cosine of the angle between the query vector and the
document vector. Other similarity measures such as Euclidean distance or Manhattan distance
can also be used.
Ranking: Once the similarities between the query and all documents have been computed, the
documents are ranked based on their similarity scores. Documents with higher similarity scores
are considered more relevant to the query and are typically presented to the user first in the
search results.

TF-IDF(Term Frequency-Inverse Document Frequency)


TF-IDF, which stands for Term Frequency-Inverse Document Frequency, is a statistical measure
used in Information Retrieval (IR) and Text Mining to evaluate the importance of a term in a
document relative to a collection of documents. It's commonly used to weigh the importance of
terms in documents and queries for ranking and retrieval purposes.

Solve Example
Consider an example to understand the Vector Space Model. Consider a total of 10 unique words
(w1, w2, …, w10) in three articles (d1, d2, d3). The statistical word frequency table shows the
word frequencies in each article. Using any vector space formula, it is possible to calculate the
similarity between two text documents.

Calculate the similarity between D1 and D2 articles. Take cosine as an example.


Solve Example
D1: “the Health Observances for March”
D2: “the Health oriented Calendar”
D3: “the good News for March Awareness” Retrieval results are summarized in Table
Where D is the number of documents in the document collection and Idf stands for inverse
documentfrequency.
𝐐∙𝐃𝟏= 1*1+1*1+ 1*0=2
𝐐∙𝐃𝟐= 1*1+1*0+1*0=1
𝐐∙𝐃𝟑= 1*1+1*0+1*2=3
Since a dot product is defined as the product of the magnitudes of vectors times the cosine angle
between these, Dot product = (magnitudes product)*(cosine angle)
Probabilistic model
 In information retrieval, a probabilistic model is often used to estimate the relevance of
documents to a particular query. One popular example of a probabilistic model in
information retrieval is the Binary Independence Model (BIM).
 In the Binary Independence Model, each document and query are represented as a set of
binary terms (presence or absence of terms).
 The relevance of a document to a query is then estimated based on the presence or
absence of terms in both the document and the query.
 Here's a simplified example of how the Binary Independence Model works:
 Suppose we have a collection of documents containing the following terms: "cat", "dog",
"bird", "fish".
we have a query: "cat dog".
Consider two documents:
Document 1: "The quick brown fox jumps over the lazy dog."
Document 2: "Cats are furry animals."
To determine the relevance of each document to the query using the Binary Independence Model,
we calculate the probability of relevance based on the presence or absence of terms in both the
document and the query.
For Document 1:

"cat" is absent
"dog" is present
"cat dog" (from the query) has one term present
P(relevant|document 1) = P("dog"|relevant) * (1 - P("cat"|relevant))

For Document 2:

"cat" is present
"dog" is absent
"cat dog" (from the query) has one term present
P(relevant|document 2) = P("cat"|relevant) * (1 - P("dog"|relevant))

These probabilities can be estimated from the training data or based on empirical observations.
The document with the highest probability of relevance is considered the most relevant to the
query.
Probabilistic models like the Binary Independence Model allow for flexible and efficient ranking
of documents based on their estimated relevance to a query, making them widely used in
information retrieval systems like search engines.
3 - Web Information Retrieval

Web search architecture and challenges


Web search architecture involves a complex system of components working together to retrieve
relevant information from the vast expanse of the World Wide Web. Here's a simplified overview
of the typical architecture and some challenges in information retrieval (IR):
Web Search Architecture:
Crawling: The process of systematically browsing the web to discover and index web pages.
This involves web crawlers (also known as spiders or bots) that follow links from one page to
another, fetching content and storing it for indexing.
Indexing: Once web pages are crawled, they are processed to extract relevant information and
build an index. This index usually contains information about keywords, metadata, and other
features to facilitate efficient retrieval.
Ranking Algorithm: When a user enters a query, the search engine uses a ranking algorithm to
determine the relevance of indexed pages to the query. This typically involves various factors
such as keyword relevance, page popularity, user behavior signals, and more.
User Interface: The search results are presented to the user via a user interface, usually a search
engine website or application. This interface may include features such as autocomplete
suggestions, filters, and advanced search options.
Query Processing: The search engine processes the user's query, interpreting the intent and
identifying relevant documents from the index.
Retrieval: The search engine retrieves the relevant documents from the index based on the user's
query and presents them to the user in a ranked order.
Challenges in Information Retrieval:
Scale: The web is vast and constantly growing, presenting challenges in crawling, indexing, and
searching through a massive amount of data efficiently.

Freshness: Web content is continuously changing, requiring frequent updates to the index to
ensure that search results are up-to-date and relevant.
Multimedia Content: With the proliferation of multimedia content (images, videos, audio), search
engines need to develop effective algorithms for indexing and retrieving diverse types of content.
Personalization: Providing personalized search results tailored to individual users' preferences
and context while maintaining user privacy is a challenge for search engines.
Multilingualism: The web contains content in multiple languages, requiring support for
multilingual search and retrieval.

Web Crawlers and Indexing


1. Crawling: Crawling is the discovery process in which search engines send out a team of robots
(known as crawlers or spiders) to find newly updated content.

2. Indexing: Indexing is the process that stores information they find in an index, a huge database
of all the content they have discovered, and seem good enough to serve up to searchers.
What are Web Crawlers?
Web crawlers, also known as web spiders or web robots, are automated programs or scripts used
by search engines to systematically browse the internet, discover web pages, and gather
information for indexing. These crawlers navigate the web by following hyperlinks from one
webpage to another, retrieving content, and storing it for indexing by search engines.
What is Indexing?
Indexing in Information Retrieval (IR) refers to the process of creating and maintaining an
organized data structure (index) that facilitates efficient and effective retrieval of information in
response to user queries.
CRAWLING INDEXING
In the SEO world, Crawling means Indexing is the process of “adding webpages
“following your links”. into Google search”.
Crawling is the process through which When search engine crawlers visit any link is
indexing is done. Google crawls through the crawling and when crawlers save or index
web pages and index the pages. that links in search engine database is called
indexing.
When google visits your website for tracking After crawling has been done, the result gets
purpose. This process is done by Google’s put on to Google’s index (i.e. web search),
Spiders or Crawlers. which means Crawling and Indexing is a
step by step process.
Crawling is a process which is done by Indexing means when search engine bots
search engine bots to discover publicly crawl the web pages and saves a copy of all
available web pages. information on index servers and search
engines show the relevant results on search
engine when a user performs a search query.
It finds web pages and queues for indexing. It analyses the web pages content and saves
the pages with quality content in index.
It crawls the web pages. It performs analysis on the page content and
stores it in the index.
Crawling is simply when search engines bots Indexing is the process of placing a page.
are actively crawling your website.
Crawling discovers the web crawler’s URLs Indexing builds its index with every
recursive visits of input of web pages. significant word on a web page found in the
title, heading, meta tags, alt tags, subtitles
and other important positions.
Crawling requires more resources than Indexing is more resource-efficient as the
indexing information gathered during the crawling
process is analyzed

Link analysis
Link analysis in Information Retrieval (IR) refers to the process of analyzing the relationships
between documents based on hyperlinks or citations. This analysis is commonly used in web
search engines and academic citation networks to understand the structure and authority of
documents, as well as to improve search relevance. Here's how link analysis works in IR:

1. Hyperlink Structure: In web search engines, documents (web pages) are connected through
hyperlinks. Each hyperlink from one document to another represents a relationship or reference
between the two documents.

2. PageRank Algorithm: One of the most famous link analysis algorithms is PageRank,
developed by Larry Page and Sergey Brin at Google. PageRank assigns a numerical weight to
each document in a hyperlinked collection of documents (such as the World Wide Web), with the
purpose of measuring its relative importance within the network. The algorithm considers both
the quantity and quality of inbound links to a page when determining its importance.

3. HITS Algorithm: Another notable link analysis algorithm is the HITS (Hyperlink-Induced
Topic Search) algorithm. HITS identifies authoritative pages and hubs within a network of
hyperlinked documents. Authority pages are those that are highly relevant to a given topic, while
hub pages are those that point to many authority pages. HITS iteratively computes authority and
hub scores for each document based on the links pointing to and from it.

4. Link-Based Relevance: In addition to traditional content-based methods, search engines often


incorporate link-based relevance signals into their ranking algorithms. Pages that are frequently
linked to by other reputable pages are often considered more relevant and authoritative, and thus
may be ranked higher in search results.

5. Anchor Text Analysis: Analyzing the anchor text of hyperlinks provides additional context
about the content of the linked documents. Search engines may use anchor text information to
improve the relevance of search results by matching query terms with anchor text in links.

6. Citation Analysis: In academic information retrieval, citation analysis examines the network of
citations between academic papers. By analyzing citation patterns, researchers can identify
influential papers, measure scholarly impact, and discover relationships between research topics.

Link analysis techniques play a crucial role in understanding the structure of interconnected
document collections, identifying authoritative sources of information, and improving the
relevance of search results in information retrieval systems.
Page Rank Algorithm

You might also like