Web Mining UNIT-II Chapter-01 - 02 - 03
Web Mining UNIT-II Chapter-01 - 02 - 03
User Interface: The user interface is the component through which users interact with the IR
system. It allows users to input their queries, view search results, and navigate through
retrieved documents. User interfaces can vary widely, ranging from simple keyword search
boxes to more advanced query builders and filtering options.
Query Processor: The query processor interprets user queries and translates them into a
format that the IR system can understand and use for searching. It may perform tasks such as
query parsing, expansion (e.g., adding synonyms or related terms), and normalization to
improve the accuracy and effectiveness of the search process.
Indexer: The indexer is responsible for creating and maintaining an index of the documents in
the collection. This index contains information about the terms (words or phrases) present in
each document and their locations. Indexing allows for efficient searching by enabling the
system to quickly identify relevant documents based on the terms in the user's query.
Search Engine: The search engine executes queries against the document index to retrieve
relevant documents. It ranks the retrieved documents based on their relevance to the user's
query, typically using algorithms that consider factors such as term frequency, document
length, and document popularity.
Ranking Model: The ranking model determines the order in which retrieved documents are
presented to the user. It assigns a relevance score to each document based on various factors,
such as the frequency of query terms in the document, the position of query terms, and the
authority of the document.
Retrieval Model: The retrieval model defines the principles and algorithms used to retrieve
relevant documents from the index. Common retrieval models include Boolean retrieval,
vector space models, and probabilistic models.
Relevance Feedback Mechanism: Relevance feedback allows users to provide feedback on
the relevance of retrieved documents, which the system can use to improve future search
results. This feedback can be explicit (e.g., user ratings or annotations)
Evaluation Module: The evaluation module assesses the performance of the IR system by
measuring metrics such as precision, recall, and relevance in response to a set of test queries.
Evaluation is essential for optimizing and fine-tuning the system's components to improve
overall effectiveness.
Applications:
Web Search Engines: Web search engines like Google, Bing, and Yahoo use IR techniques to
retrieve relevant web pages in response to user queries.
Digital Libraries: IR systems are used to organize and retrieve documents in digital libraries,
archives, and repositories, enabling users to access scholarly articles, books, and other resources.
E-commerce Search: Online shopping platforms use IR techniques to help users find products
based on their preferences and search queries.
Enterprise Search: Organizations use IR systems to index and search internal documents,
emails, and other digital assets, facilitating knowledge management and information retrieval
within the organization.
Legal Information Retrieval: Legal professionals use IR systems to search and retrieve case
law, statutes, and legal documents relevant to their cases and research.
Social Media and Content Recommendation: Social media platforms and content
recommendation systems use IR techniques to deliver personalized content and
recommendations to users based on their interests and preferences.
Multimedia Retrieval: IR systems retrieve multimedia content such as images, videos, and
audio files based on user queries, enabling users to find relevant multimedia content.
Introduction
An IR (Information Retrieval) model is a framework or system used to retrieve relevant
information from a large collection of data, typically text documents, in response to a
user query.
The primary goal of an IR model is to efficiently and accurately match documents in the
collection to the user's information needs expressed in the query.
Types of IR Model
Boolean Model
It is a simple retrieval model based on set theory and boolean algebra. Queries are
designed as boolean expressions which have precise semantics. The retrieval strategy is
based on binary decision criterion. The boolean model considers that index terms are
present or absent in a document.
Problem:
Consider 5 documents with a vocabulary of 6 terms
document 1 = ‘ term1 term3 ‘
document 2 = ‘ term 2 term4 term6 ‘
document 3 = ‘ term1 term2 term3 term4 term5 ‘
document 4 = ‘ term1 term3 term6 ‘
document 5 = ‘ term3 term4 ‘
Consider the query: Find the document consisting of term1 and term3 and not term2 ( term1 ∧
term3 ∧ ¬ term2)
term ¬term2 term term4 term term6
1 3 5
document1 1 1 1 0 0 0
document2 0 0 0 1 0 1
document3 1 0 1 1 1 0
document4 1 1 1 0 0 1
document5 0 1 1 1 0 0
document 1 : 1 ∧ 1∧ 1 = 1
document 2 : 0 ∧ 0 ∧ 0 = 0
document 3 : 1 ∧ 1 ∧ 0 = 0
document 4 : 1 ∧ 1 ∧ 1 = 1
document 5 : 0 ∧ 1 ∧ 1 = 0
Based on the above computation document1 and document4 are relevant to the given query.
Following are the Boolean operators used in Boolean model.
The Boolean retrieval model is a model for information retrieval in which we can pose any query
which is in the form of a Boolean expression of terms, that is, in which terms are combined
with the operators AND, OR, and NOT.
Solve Example
Consider an example to understand the Vector Space Model. Consider a total of 10 unique words
(w1, w2, …, w10) in three articles (d1, d2, d3). The statistical word frequency table shows the
word frequencies in each article. Using any vector space formula, it is possible to calculate the
similarity between two text documents.
"cat" is absent
"dog" is present
"cat dog" (from the query) has one term present
P(relevant|document 1) = P("dog"|relevant) * (1 - P("cat"|relevant))
For Document 2:
"cat" is present
"dog" is absent
"cat dog" (from the query) has one term present
P(relevant|document 2) = P("cat"|relevant) * (1 - P("dog"|relevant))
These probabilities can be estimated from the training data or based on empirical observations.
The document with the highest probability of relevance is considered the most relevant to the
query.
Probabilistic models like the Binary Independence Model allow for flexible and efficient ranking
of documents based on their estimated relevance to a query, making them widely used in
information retrieval systems like search engines.
3 - Web Information Retrieval
Freshness: Web content is continuously changing, requiring frequent updates to the index to
ensure that search results are up-to-date and relevant.
Multimedia Content: With the proliferation of multimedia content (images, videos, audio), search
engines need to develop effective algorithms for indexing and retrieving diverse types of content.
Personalization: Providing personalized search results tailored to individual users' preferences
and context while maintaining user privacy is a challenge for search engines.
Multilingualism: The web contains content in multiple languages, requiring support for
multilingual search and retrieval.
2. Indexing: Indexing is the process that stores information they find in an index, a huge database
of all the content they have discovered, and seem good enough to serve up to searchers.
What are Web Crawlers?
Web crawlers, also known as web spiders or web robots, are automated programs or scripts used
by search engines to systematically browse the internet, discover web pages, and gather
information for indexing. These crawlers navigate the web by following hyperlinks from one
webpage to another, retrieving content, and storing it for indexing by search engines.
What is Indexing?
Indexing in Information Retrieval (IR) refers to the process of creating and maintaining an
organized data structure (index) that facilitates efficient and effective retrieval of information in
response to user queries.
CRAWLING INDEXING
In the SEO world, Crawling means Indexing is the process of “adding webpages
“following your links”. into Google search”.
Crawling is the process through which When search engine crawlers visit any link is
indexing is done. Google crawls through the crawling and when crawlers save or index
web pages and index the pages. that links in search engine database is called
indexing.
When google visits your website for tracking After crawling has been done, the result gets
purpose. This process is done by Google’s put on to Google’s index (i.e. web search),
Spiders or Crawlers. which means Crawling and Indexing is a
step by step process.
Crawling is a process which is done by Indexing means when search engine bots
search engine bots to discover publicly crawl the web pages and saves a copy of all
available web pages. information on index servers and search
engines show the relevant results on search
engine when a user performs a search query.
It finds web pages and queues for indexing. It analyses the web pages content and saves
the pages with quality content in index.
It crawls the web pages. It performs analysis on the page content and
stores it in the index.
Crawling is simply when search engines bots Indexing is the process of placing a page.
are actively crawling your website.
Crawling discovers the web crawler’s URLs Indexing builds its index with every
recursive visits of input of web pages. significant word on a web page found in the
title, heading, meta tags, alt tags, subtitles
and other important positions.
Crawling requires more resources than Indexing is more resource-efficient as the
indexing information gathered during the crawling
process is analyzed
Link analysis
Link analysis in Information Retrieval (IR) refers to the process of analyzing the relationships
between documents based on hyperlinks or citations. This analysis is commonly used in web
search engines and academic citation networks to understand the structure and authority of
documents, as well as to improve search relevance. Here's how link analysis works in IR:
1. Hyperlink Structure: In web search engines, documents (web pages) are connected through
hyperlinks. Each hyperlink from one document to another represents a relationship or reference
between the two documents.
2. PageRank Algorithm: One of the most famous link analysis algorithms is PageRank,
developed by Larry Page and Sergey Brin at Google. PageRank assigns a numerical weight to
each document in a hyperlinked collection of documents (such as the World Wide Web), with the
purpose of measuring its relative importance within the network. The algorithm considers both
the quantity and quality of inbound links to a page when determining its importance.
3. HITS Algorithm: Another notable link analysis algorithm is the HITS (Hyperlink-Induced
Topic Search) algorithm. HITS identifies authoritative pages and hubs within a network of
hyperlinked documents. Authority pages are those that are highly relevant to a given topic, while
hub pages are those that point to many authority pages. HITS iteratively computes authority and
hub scores for each document based on the links pointing to and from it.
5. Anchor Text Analysis: Analyzing the anchor text of hyperlinks provides additional context
about the content of the linked documents. Search engines may use anchor text information to
improve the relevance of search results by matching query terms with anchor text in links.
6. Citation Analysis: In academic information retrieval, citation analysis examines the network of
citations between academic papers. By analyzing citation patterns, researchers can identify
influential papers, measure scholarly impact, and discover relationships between research topics.
Link analysis techniques play a crucial role in understanding the structure of interconnected
document collections, identifying authoritative sources of information, and improving the
relevance of search results in information retrieval systems.
Page Rank Algorithm