0% found this document useful (0 votes)
76 views28 pages

Web Scraping

This document provides an overview of web scraping, including definitions, prerequisites in Python, methods for web scraping, and libraries like BeautifulSoup that can be used. Web scraping is a technique to extract large amounts of unstructured data from websites and transform it into structured data like databases or spreadsheets. It discusses Python basics needed like lists, dictionaries, files and regular expressions. It also covers techniques like using sockets and libraries like urllib and BeautifulSoup to extract and parse data from websites. Advantages include low cost and ease of implementation while disadvantages include difficulty analyzing extracted data and potential speed issues.

Uploaded by

sai rohith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views28 pages

Web Scraping

This document provides an overview of web scraping, including definitions, prerequisites in Python, methods for web scraping, and libraries like BeautifulSoup that can be used. Web scraping is a technique to extract large amounts of unstructured data from websites and transform it into structured data like databases or spreadsheets. It discusses Python basics needed like lists, dictionaries, files and regular expressions. It also covers techniques like using sockets and libraries like urllib and BeautifulSoup to extract and parse data from websites. Advantages include low cost and ease of implementation while disadvantages include difficulty analyzing extracted data and potential speed issues.

Uploaded by

sai rohith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

WEB SCRAPING

Contents…

1.Definition
2. prerequisites for web scraping
3. Python Basics
4. Regular Expressions Basics
5. What we are going to do?
6. Methods for doing web scraping
7. Sockets
8. Urllib
9. BeautifulSoup
10.Advantages
11. Disadvantages
12.Conclusion
Web Scraping (also termed Screen Scraping, Web Data
Extraction, Web Harvesting etc.) is a technique employed to
extract large amounts of data from websites whereby the data is
extracted and saved to a local file in your computer or to a
database in table (spreadsheet) format.

This technique mostly focuses on the transformation of


unstructured data (HTML format) on the web into
structured data (database or spreadsheet).
prerequisites for web scraping
Python:

Python is a high-level programming language, with applications in numerous areas,


including web programming, scripting, scientific computing, and artificial intelligence.
Python is processed at runtime by the interpreter. There is no need to compile your
program before executing it.

Sample Program:

print('Hello world!')
Hello world!
Control Structures:(Python Basics)

Lists are another type of object in Python. They are used to store an


indexed list of items. 

A list is created using square brackets with commas separating items.


The certain item in the list can be accessed by using its index in square
brackets.

Input: Output:
List Operations

To check if an item is in a list, the in operator can be used. It returns True if the


item occurs one or more times in the list, and False if it doesn't.

Input: Output:
List Functions:
Input: Output:
Dictionaries:

Dictionaries are data structures used to map arbitrary keys to values. 


Lists can be thought of as dictionaries with integer keys within a certain
range.
Dictionaries can be indexed in the same way as lists, using square
brackets containing keys.

Input: Output:
Dictionary Functions:
Input: Output:
Reading Files

The contents of a file that has been opened in text mode can be read using
the read method.

Input: Output:

This will print all of the contents


of the file "filename.txt".
Regular Expressions

Regular expressions are a powerful tool for various kinds of string manipulation.


Input: Output:
Metacharacters
.
^
$
*
+
?
{}
[]
\
|
()
Our Task:
Webpage:
Sockets:

Sockets are the endpoints of a bidirectional communications


channel. Sockets may communicate within a process, between
processes on the same machine, or between processes on
different continents.
Domain
The family of protocols that is used as the transport mechanism. These
values are constants such as AF_INET, PF_INET, PF_UNIX, PF_X25, and so
on.

s = socket.socket (socket_family, socket_type, protocol=0)

•socket_family − This is either AF_UNIX or AF_INET, as explained earlier.

•socket_type − This is either SOCK_STREAM or SOCK_DGRAM.

•protocol − This is usually left out, defaulting to 0.


General Socket Methods:
s.connect()
This method actively initiates TCP server connection.

socket.gethostname()
Returns the hostname.

s.recv()
This method receives TCP message

s.send()
This method transmits TCP message

s.close()
This method closes socket
Method-1: Using sockets

import socket
my=socket.socket(socket.AF_INET,socket.SOCK_STREAM)
my.connect(('data.pr4e.org',80))
cmd='GET http://data.pr4e.org/romeo.txt HTTP/1.0\n\n'.encode() #string -> bytes
my.send(cmd)

while True:
data = my.recv(307)
if (len(data)<1):
break Output:
print(data.decode()) But soft what light through yonder window breaks It
my.close() is the east and Juliet is the sun Arise fair sun and kill
the envious moon Who is already sick and pale with
grief
A list of some important modules in Python Network/Internet programming.

Protocol Common function Port No Python module

HTTP Web pages 80 httplib, urllib, xmlrpclib

NNTP Usenet news 119 nntplib

FTP File transfers 20 ftplib, urllib

SMTP Sending email 25 smtplib

POP3 Fetching email 110 poplib

IMAP4 Fetching email 143 imaplib

Telnet Command lines 23 telnetlib

Gopher Document transfers 70 gopherlib, urllib


Urllib:

The urllib module has been split into parts and renamed in Python 3


to urllib.request, urllib.parse, and urllib.error. The 2to3 tool will automatically adapt
imports when converting your sources to Python 3.

Also note that the urllib.request.urlopen() function in Python 3 is equivalent to


urllib2.urlopen() and that  urllib.urlopen() has been removed. The urllib module has
been split into parts and renamed in Python 3 to urllib.request, urllib.parse, and
urllib.error.

The 2to3 tool will automatically adapt imports when converting your sources to
Python 3. Also note that the urllib.request.urlopen() function in Python 3 is
equivalent to urllib2.urlopen() and that urllib.urlopen() has been removed.
Code:
import urllib.request,urllib.parse,urllib.error
f=urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
for l in f:
print(l.decode().strip())

Output:

But soft what light through yonder window breaks


It is the east and Juliet is the sun Arise fair sun and
kill the envious moon Who is already sick and
pale with grief
Beautiful Soup is a Python library for pulling data out of HTML and
XML files.
Code:
import urllib.request,urllib.parse,urllib.error
from bs4 import BeautifulSoup
url= input('Enter-')
f=urllib.request.urlopen(url).read()
soup=BeautifulSoup(f,'html.parser')
tags=soup('a')
for l in tags:
print(l.get('href',None))
Output:
Uses:
For Research
Data is an integral part of any research, be it academic, marketing or
scientific. A Web Scraper will help you gather structured data from multiple
sources in the Internet with ease.

For Businesses / eCommerce : Market Analysis, Price Comparison


Companies catering products or services related to a specific domain need
to have a comprehensive data of similar products and services which
appear in the market every day. A Web Scraper software can be used to
keep a constant watch on this data. You can get all the required
information from a variety of sources at the click of a button.
For Marketing : Lead Generation

A web scraper can be used to gather contact details of businesses or


individuals from websites like yellowpages.com or linkedin.com. Details like
email address, phone, website URL etc can be easily extracted using a
web scraper.
The major advantages of web scraping services are explained in the following points.
•Inexpensive – Web scraping services provide an essential service at a low cost. It is paramount that
data is collected back from websites and analyzed so that the internet functions regularly. Web
scraping services do the job in an efficient and budget friendly manner.

•Easy to implement – Once a web scraping services deploys the proper mechanism to extract data, you
are assured that you are not only getting data from a single page but from the entire domain. This
means that with just a onetime investment, a lot of data can be collected.

•Low maintenance and speed– One aspect that is often overlooked when installing new services is the
maintenance cost. Long term maintenance costs can cause the project budget to spiral out of control.
Thankfully, web scraping technologies need very little to no maintenance over a long period. Another
characteristic that must also be mentioned is the speed with which web scraping services do their job.
A job that could take a person week is finished in a matter of hours.

•Accuracy – The web scraping services are not only fast, they are accurate too. Simple errors in data
extraction can cause major mistakes later on. Accurate extraction of any type of data is thus very
important.In websites that deal in pricing data, sales prices, real estate numbers or any kind of financial
data, the accuracy is extremely important.
The Disadvantages of Web Scraping

The major disadvantages of web scraping services are explained in the following points.
•Difficult to analyze – For anybody who is not an expert, the scraping processes are confusing to
understand. Although this is not a major problem, but some errors could be fixed faster if it was easier to
understand for more software developers.

•Data analysis – The data that has been extracted will first need to be treated so that they can be easily
understood. In certain cases, this might take a long time and a lot of energy to complete.

•Time – It is common for new data extraction applications to take some time in the beginning as the
software often has a learning curve. Sometimes web scraping services take time to become familiar with
the core application and need to adjust to the scrapping language. This means that such services can take
some days before they are up and running at full speed.

•Speed and protection policies – Most web scrapping services are slower than API calls and another
problem is the websites that do not allow screen scrapping. In such cases web scrapping services are
rendered useless. Also, if the developer of the website decides to introduce some changes in the code, the
scrapping service might stop working.
With Proxy we cannot do WEB SCRAPING

You might also like