0% found this document useful (0 votes)

76 views28 pages

Web Scraping

This document provides an overview of web scraping, including definitions, prerequisites in Python, methods for web scraping, and libraries like BeautifulSoup that can be used. Web scraping is a technique to extract large amounts of unstructured data from websites and transform it into structured data like databases or spreadsheets. It discusses Python basics needed like lists, dictionaries, files and regular expressions. It also covers techniques like using sockets and libraries like urllib and BeautifulSoup to extract and parse data from websites. Advantages include low cost and ease of implementation while disadvantages include difficulty analyzing extracted data and potential speed issues.

Uploaded by

sai rohith

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

76 views28 pages

Web Scraping

Uploaded by

sai rohith

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 28

WEB SCRAPING

Contents…

1.Definition
2. prerequisites for web scraping
3. Python Basics
4. Regular Expressions Basics
5. What we are going to do?
6. Methods for doing web scraping
7. Sockets
8. Urllib
9. BeautifulSoup
10.Advantages
11. Disadvantages
12.Conclusion
Web Scraping (also termed Screen Scraping, Web Data
Extraction, Web Harvesting etc.) is a technique employed to
extract large amounts of data from websites whereby the data is
extracted and saved to a local file in your computer or to a
database in table (spreadsheet) format.

This technique mostly focuses on the transformation of

unstructured data (HTML format) on the web into
structured data (database or spreadsheet).
prerequisites for web scraping
Python:

Python is a high-level programming language, with applications in numerous areas,

including web programming, scripting, scientific computing, and artificial intelligence.
Python is processed at runtime by the interpreter. There is no need to compile your
program before executing it.

Sample Program:

print('Hello world!')
Hello world!
Control Structures:(Python Basics)

Lists are another type of object in Python. They are used to store an

indexed list of items.

A list is created using square brackets with commas separating items.

The certain item in the list can be accessed by using its index in square
brackets.

Input: Output:
List Operations

To check if an item is in a list, the in operator can be used. It returns True if the

item occurs one or more times in the list, and False if it doesn't.

Input: Output:
List Functions:
Input: Output:
Dictionaries:

Dictionaries are data structures used to map arbitrary keys to values.

Lists can be thought of as dictionaries with integer keys within a certain
range.
Dictionaries can be indexed in the same way as lists, using square
brackets containing keys.

Input: Output:
Dictionary Functions:
Input: Output:
Reading Files

The contents of a file that has been opened in text mode can be read using
the read method.

Input: Output:

This will print all of the contents

of the file "filename.txt".
Regular Expressions

Regular expressions are a powerful tool for various kinds of string manipulation.

Input: Output:
Metacharacters
.
^
$
*
+
?
{}
[]
\
|
()
Our Task:
Webpage:
Sockets:

Sockets are the endpoints of a bidirectional communications

channel. Sockets may communicate within a process, between
processes on the same machine, or between processes on
different continents.
Domain
The family of protocols that is used as the transport mechanism. These
values are constants such as AF_INET, PF_INET, PF_UNIX, PF_X25, and so
on.

s = socket.socket (socket_family, socket_type, protocol=0)

•socket_family − This is either AF_UNIX or AF_INET, as explained earlier.

•socket_type − This is either SOCK_STREAM or SOCK_DGRAM.

•protocol − This is usually left out, defaulting to 0.

General Socket Methods:
s.connect()
This method actively initiates TCP server connection.

socket.gethostname()
Returns the hostname.

s.recv()
This method receives TCP message

s.send()
This method transmits TCP message

s.close()
This method closes socket
Method-1: Using sockets

import socket
my=socket.socket(socket.AF_INET,socket.SOCK_STREAM)
my.connect(('data.pr4e.org',80))
cmd='GET http://data.pr4e.org/romeo.txt HTTP/1.0\n\n'.encode() #string -> bytes
my.send(cmd)

while True:
data = my.recv(307)
if (len(data)<1):
break Output:
print(data.decode()) But soft what light through yonder window breaks It
my.close() is the east and Juliet is the sun Arise fair sun and kill
the envious moon Who is already sick and pale with
grief
A list of some important modules in Python Network/Internet programming.

Protocol Common function Port No Python module

HTTP Web pages 80 httplib, urllib, xmlrpclib

NNTP Usenet news 119 nntplib

FTP File transfers 20 ftplib, urllib

SMTP Sending email 25 smtplib

POP3 Fetching email 110 poplib

IMAP4 Fetching email 143 imaplib

Telnet Command lines 23 telnetlib

Gopher Document transfers 70 gopherlib, urllib

Urllib:

The urllib module has been split into parts and renamed in Python 3

to urllib.request, urllib.parse, and urllib.error. The 2to3 tool will automatically adapt
imports when converting your sources to Python 3.

Also note that the urllib.request.urlopen() function in Python 3 is equivalent to

urllib2.urlopen() and that urllib.urlopen() has been removed. The urllib module has
been split into parts and renamed in Python 3 to urllib.request, urllib.parse, and
urllib.error.

The 2to3 tool will automatically adapt imports when converting your sources to
Python 3. Also note that the urllib.request.urlopen() function in Python 3 is
equivalent to urllib2.urlopen() and that urllib.urlopen() has been removed.
Code:
import urllib.request,urllib.parse,urllib.error
f=urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
for l in f:
print(l.decode().strip())

Output:

But soft what light through yonder window breaks

It is the east and Juliet is the sun Arise fair sun and
kill the envious moon Who is already sick and
pale with grief
Beautiful Soup is a Python library for pulling data out of HTML and
XML files.
Code:
import urllib.request,urllib.parse,urllib.error
from bs4 import BeautifulSoup
url= input('Enter-')
f=urllib.request.urlopen(url).read()
soup=BeautifulSoup(f,'html.parser')
tags=soup('a')
for l in tags:
print(l.get('href',None))
Output:
Uses:
For Research
Data is an integral part of any research, be it academic, marketing or
scientific. A Web Scraper will help you gather structured data from multiple
sources in the Internet with ease.

For Businesses / eCommerce : Market Analysis, Price Comparison

Companies catering products or services related to a specific domain need
to have a comprehensive data of similar products and services which
appear in the market every day. A Web Scraper software can be used to
keep a constant watch on this data. You can get all the required
information from a variety of sources at the click of a button.
For Marketing : Lead Generation

A web scraper can be used to gather contact details of businesses or

individuals from websites like yellowpages.com or linkedin.com. Details like
email address, phone, website URL etc can be easily extracted using a
web scraper.
The major advantages of web scraping services are explained in the following points.
•Inexpensive – Web scraping services provide an essential service at a low cost. It is paramount that
data is collected back from websites and analyzed so that the internet functions regularly. Web
scraping services do the job in an efficient and budget friendly manner.

•Easy to implement – Once a web scraping services deploys the proper mechanism to extract data, you
are assured that you are not only getting data from a single page but from the entire domain. This
means that with just a onetime investment, a lot of data can be collected.

•Low maintenance and speed– One aspect that is often overlooked when installing new services is the
maintenance cost. Long term maintenance costs can cause the project budget to spiral out of control.
Thankfully, web scraping technologies need very little to no maintenance over a long period. Another
characteristic that must also be mentioned is the speed with which web scraping services do their job.
A job that could take a person week is finished in a matter of hours.

•Accuracy – The web scraping services are not only fast, they are accurate too. Simple errors in data
extraction can cause major mistakes later on. Accurate extraction of any type of data is thus very
important.In websites that deal in pricing data, sales prices, real estate numbers or any kind of financial
data, the accuracy is extremely important.
The Disadvantages of Web Scraping

The major disadvantages of web scraping services are explained in the following points.
•Difficult to analyze – For anybody who is not an expert, the scraping processes are confusing to
understand. Although this is not a major problem, but some errors could be fixed faster if it was easier to
understand for more software developers.

•Data analysis – The data that has been extracted will first need to be treated so that they can be easily
understood. In certain cases, this might take a long time and a lot of energy to complete.

•Time – It is common for new data extraction applications to take some time in the beginning as the
software often has a learning curve. Sometimes web scraping services take time to become familiar with
the core application and need to adjust to the scrapping language. This means that such services can take
some days before they are up and running at full speed.

•Speed and protection policies – Most web scrapping services are slower than API calls and another
problem is the websites that do not allow screen scrapping. In such cases web scrapping services are
rendered useless. Also, if the developer of the website decides to introduce some changes in the code, the
scrapping service might stop working.
With Proxy we cannot do WEB SCRAPING

Smmlab Social Media Marketing SMM Platform License
No ratings yet
Smmlab Social Media Marketing SMM Platform License
1 page
Template - Workbook - SaaS Directory
No ratings yet
Template - Workbook - SaaS Directory
21 pages
Digital Ocean
No ratings yet
Digital Ocean
15 pages
Database Design Document Template PDF Free
No ratings yet
Database Design Document Template PDF Free
22 pages
Booklist Dasa BKK
No ratings yet
Booklist Dasa BKK
662 pages
Generic Privacy Policy Template
No ratings yet
Generic Privacy Policy Template
2 pages
Worksheet 2.8 Relational Database and SQL
No ratings yet
Worksheet 2.8 Relational Database and SQL
9 pages
Week 5 Assignment Worksheet: 1. Mind-Map Creation
No ratings yet
Week 5 Assignment Worksheet: 1. Mind-Map Creation
5 pages
Comprehensive Python CheatSheet 1731972192
No ratings yet
Comprehensive Python CheatSheet 1731972192
10 pages
Oauth 2 0 Simplified 4th Edition Aaron Parecki - The ebook is available for instant download, no waiting required
100% (1)
Oauth 2 0 Simplified 4th Edition Aaron Parecki - The ebook is available for instant download, no waiting required
62 pages
Practical Web Penetration Testing Secure Web Applications Using Burp Suite Nmap Metasploit And More 1st Edition Gus Khawaja download
100% (2)
Practical Web Penetration Testing Secure Web Applications Using Burp Suite Nmap Metasploit And More 1st Edition Gus Khawaja download
80 pages
25 Extremely Useful Tricks For The WordPress Functions File
No ratings yet
25 Extremely Useful Tricks For The WordPress Functions File
19 pages
Software Download URL Tutorials/ Instructions: These Links Provided Are Workable As of 17th January 2012
No ratings yet
Software Download URL Tutorials/ Instructions: These Links Provided Are Workable As of 17th January 2012
1 page
Website Design Client Onboarding Template
No ratings yet
Website Design Client Onboarding Template
14 pages
Apps Script Exercises Docs
No ratings yet
Apps Script Exercises Docs
26 pages
Links of Gate Notes
No ratings yet
Links of Gate Notes
1 page
100 Sites To Download All Sorts of Things PDF
No ratings yet
100 Sites To Download All Sorts of Things PDF
14 pages
Data Toolbar
No ratings yet
Data Toolbar
2 pages
GK AI Tools
No ratings yet
GK AI Tools
21 pages
Standalone Applications: Mac Os X Windows 2000 XP Vista Google Adwords
100% (1)
Standalone Applications: Mac Os X Windows 2000 XP Vista Google Adwords
20 pages
Mega Pack Afrique Digital
No ratings yet
Mega Pack Afrique Digital
4 pages
HubSpot Marketing Software
No ratings yet
HubSpot Marketing Software
90 pages
Software Product Catalogue
No ratings yet
Software Product Catalogue
24 pages
Figma Tutorial
No ratings yet
Figma Tutorial
121 pages
Scrape R
No ratings yet
Scrape R
6 pages
Handout On Google Mail
100% (7)
Handout On Google Mail
36 pages
Top 200 Proxy Websites List Compilation 2015
No ratings yet
Top 200 Proxy Websites List Compilation 2015
30 pages
Softwares Download
No ratings yet
Softwares Download
2 pages
Complete Web Development Resource Guide
No ratings yet
Complete Web Development Resource Guide
2 pages
Forgotten45.Com
No ratings yet
Forgotten45.Com
109 pages
Technology Users List Card
No ratings yet
Technology Users List Card
451 pages
Articles
0% (1)
Articles
20 pages
(Ebooks PDF) Download From Big Data To Big Profits Success With Data and Analytics 1st Edition Russell Walker Full Chapters
100% (5)
(Ebooks PDF) Download From Big Data To Big Profits Success With Data and Analytics 1st Edition Russell Walker Full Chapters
52 pages
100 Free Tools For Your Business
No ratings yet
100 Free Tools For Your Business
8 pages
Search Engine Student Documents
No ratings yet
Search Engine Student Documents
6 pages
All in One Ethical Hacking 10 TB+ DATA By Digital Wave
No ratings yet
All in One Ethical Hacking 10 TB+ DATA By Digital Wave
8 pages
Research
No ratings yet
Research
14 pages
How To Open Password Protected ZIP File Without Password ?
No ratings yet
How To Open Password Protected ZIP File Without Password ?
13 pages
Django Models Cheat Sheet: by Via
No ratings yet
Django Models Cheat Sheet: by Via
2 pages
All in One Ethical Hacking 1000 TB+ DATA by Multi Bundle
No ratings yet
All in One Ethical Hacking 1000 TB+ DATA by Multi Bundle
7 pages
Sarkari Results, Latest Online Form - Result 2021
No ratings yet
Sarkari Results, Latest Online Form - Result 2021
4 pages
AI TOOLS
0% (1)
AI TOOLS
2 pages
Any File On The Internet To Google Drive M-2 BoostUpStation
No ratings yet
Any File On The Internet To Google Drive M-2 BoostUpStation
1 page
Daftar Subfolder
No ratings yet
Daftar Subfolder
52 pages
Instant Download (Ebook) A Practical Guide to Digital Forensics Investigations 2nd Edition by Darren R. Hayes ISBN 9780789759917, 0789759918 PDF All Chapters
100% (5)
Instant Download (Ebook) A Practical Guide to Digital Forensics Investigations 2nd Edition by Darren R. Hayes ISBN 9780789759917, 0789759918 PDF All Chapters
60 pages
Courses Digital
No ratings yet
Courses Digital
10 pages
Premium
No ratings yet
Premium
1 page
Adv Excel Video Course
No ratings yet
Adv Excel Video Course
6 pages
Web/Mobile Based Application On Mendix Platform.: - Phonegap and Cordova. - Technical Questions
No ratings yet
Web/Mobile Based Application On Mendix Platform.: - Phonegap and Cordova. - Technical Questions
9 pages
Best Resources For Front
No ratings yet
Best Resources For Front
7 pages
65 Profile Creation Sites
No ratings yet
65 Profile Creation Sites
4 pages
Backlinking Log Sheet
No ratings yet
Backlinking Log Sheet
1 page
Websites For Developers
No ratings yet
Websites For Developers
3 pages
Partner Development - Marketing Cloud Basics Series - Session 06
No ratings yet
Partner Development - Marketing Cloud Basics Series - Session 06
33 pages
Digital_Products_Report_-_Starter_Story[1]
No ratings yet
Digital_Products_Report_-_Starter_Story[1]
28 pages
Digital Products Ideas (1)
No ratings yet
Digital Products Ideas (1)
11 pages
Free Follow - Mesvak
100% (2)
Free Follow - Mesvak
3 pages
Back Links
No ratings yet
Back Links
127 pages
GOOGLE DORKS
No ratings yet
GOOGLE DORKS
6 pages
All Roadmap and Free Courses
No ratings yet
All Roadmap and Free Courses
1 page
30 + Wordpress Interview Questions and Answers: 2. What Year Was Wordpress Released?
No ratings yet
30 + Wordpress Interview Questions and Answers: 2. What Year Was Wordpress Released?
7 pages
Web content management system Complete Self-Assessment Guide
From Everand
Web content management system Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
Better Tax Researcher: Unlocking The Limits of The Keyword
No ratings yet
Better Tax Researcher: Unlocking The Limits of The Keyword
13 pages
Install Guide
No ratings yet
Install Guide
7 pages
PPT
No ratings yet
PPT
12 pages
ICTNWK559 Project Portfolio by Ajaya Pandit
No ratings yet
ICTNWK559 Project Portfolio by Ajaya Pandit
29 pages
Debian10 Final
No ratings yet
Debian10 Final
22 pages
Programming Language: Core Java
No ratings yet
Programming Language: Core Java
6 pages
Chapter 8
100% (1)
Chapter 8
36 pages
VB Helper - HowTo - Make An ActiveX DLL or EXE
No ratings yet
VB Helper - HowTo - Make An ActiveX DLL or EXE
6 pages
Job Miller Project Report
No ratings yet
Job Miller Project Report
58 pages
Zimbra Collaboration System Administration - March2014
No ratings yet
Zimbra Collaboration System Administration - March2014
264 pages
Fakultas Teknik: Universitas Putra Indonesia "Yptk"
No ratings yet
Fakultas Teknik: Universitas Putra Indonesia "Yptk"
31 pages
A Comparative Analysis Between SQL LOADER and UTL - FILE Utility
No ratings yet
A Comparative Analysis Between SQL LOADER and UTL - FILE Utility
16 pages
Jtable: Open Computing Institute, Inc
No ratings yet
Jtable: Open Computing Institute, Inc
39 pages
3D Real-Time-Strategy (RTS) Game Tutorial - Unity3D - Coffee Break Codes
No ratings yet
3D Real-Time-Strategy (RTS) Game Tutorial - Unity3D - Coffee Break Codes
5 pages
sampleMicroproject jpr
No ratings yet
sampleMicroproject jpr
25 pages
BAUP Upload Bank Master Data
No ratings yet
BAUP Upload Bank Master Data
10 pages
ALTHue Philips Hue Plugin
No ratings yet
ALTHue Philips Hue Plugin
12 pages
ES6 Notes
100% (1)
ES6 Notes
2 pages
DiscordFreeEmojisInstaller ps1
No ratings yet
DiscordFreeEmojisInstaller ps1
5 pages
Unit I Introduction: Introduction To Mobile Applications
No ratings yet
Unit I Introduction: Introduction To Mobile Applications
46 pages
Adv IT Consultant
No ratings yet
Adv IT Consultant
2 pages
Untitled
No ratings yet
Untitled
10 pages
Cross-Site Scripting (XSS) - OWASP
No ratings yet
Cross-Site Scripting (XSS) - OWASP
9 pages
Double Deck
No ratings yet
Double Deck
4 pages
Defining Customizing Settings in SAP ERP
No ratings yet
Defining Customizing Settings in SAP ERP
4 pages
READ ME - Strike MultiPad - Feature Update v1.3
No ratings yet
READ ME - Strike MultiPad - Feature Update v1.3
3 pages
Salman Qamar - SQA Engineer
No ratings yet
Salman Qamar - SQA Engineer
1 page
The Following Table Provides A List of Win32 Error Codes
No ratings yet
The Following Table Provides A List of Win32 Error Codes
7 pages