Web Scraping
Web Scraping
Contents…
1.Definition
2. prerequisites for web scraping
3. Python Basics
4. Regular Expressions Basics
5. What we are going to do?
6. Methods for doing web scraping
7. Sockets
8. Urllib
9. BeautifulSoup
10.Advantages
11. Disadvantages
12.Conclusion
Web Scraping (also termed Screen Scraping, Web Data
Extraction, Web Harvesting etc.) is a technique employed to
extract large amounts of data from websites whereby the data is
extracted and saved to a local file in your computer or to a
database in table (spreadsheet) format.
Sample Program:
print('Hello world!')
Hello world!
Control Structures:(Python Basics)
Input: Output:
List Operations
Input: Output:
List Functions:
Input: Output:
Dictionaries:
Input: Output:
Dictionary Functions:
Input: Output:
Reading Files
The contents of a file that has been opened in text mode can be read using
the read method.
Input: Output:
socket.gethostname()
Returns the hostname.
s.recv()
This method receives TCP message
s.send()
This method transmits TCP message
s.close()
This method closes socket
Method-1: Using sockets
import socket
my=socket.socket(socket.AF_INET,socket.SOCK_STREAM)
my.connect(('data.pr4e.org',80))
cmd='GET http://data.pr4e.org/romeo.txt HTTP/1.0\n\n'.encode() #string -> bytes
my.send(cmd)
while True:
data = my.recv(307)
if (len(data)<1):
break Output:
print(data.decode()) But soft what light through yonder window breaks It
my.close() is the east and Juliet is the sun Arise fair sun and kill
the envious moon Who is already sick and pale with
grief
A list of some important modules in Python Network/Internet programming.
The 2to3 tool will automatically adapt imports when converting your sources to
Python 3. Also note that the urllib.request.urlopen() function in Python 3 is
equivalent to urllib2.urlopen() and that urllib.urlopen() has been removed.
Code:
import urllib.request,urllib.parse,urllib.error
f=urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
for l in f:
print(l.decode().strip())
Output:
•Easy to implement – Once a web scraping services deploys the proper mechanism to extract data, you
are assured that you are not only getting data from a single page but from the entire domain. This
means that with just a onetime investment, a lot of data can be collected.
•Low maintenance and speed– One aspect that is often overlooked when installing new services is the
maintenance cost. Long term maintenance costs can cause the project budget to spiral out of control.
Thankfully, web scraping technologies need very little to no maintenance over a long period. Another
characteristic that must also be mentioned is the speed with which web scraping services do their job.
A job that could take a person week is finished in a matter of hours.
•Accuracy – The web scraping services are not only fast, they are accurate too. Simple errors in data
extraction can cause major mistakes later on. Accurate extraction of any type of data is thus very
important.In websites that deal in pricing data, sales prices, real estate numbers or any kind of financial
data, the accuracy is extremely important.
The Disadvantages of Web Scraping
The major disadvantages of web scraping services are explained in the following points.
•Difficult to analyze – For anybody who is not an expert, the scraping processes are confusing to
understand. Although this is not a major problem, but some errors could be fixed faster if it was easier to
understand for more software developers.
•Data analysis – The data that has been extracted will first need to be treated so that they can be easily
understood. In certain cases, this might take a long time and a lot of energy to complete.
•Time – It is common for new data extraction applications to take some time in the beginning as the
software often has a learning curve. Sometimes web scraping services take time to become familiar with
the core application and need to adjust to the scrapping language. This means that such services can take
some days before they are up and running at full speed.
•Speed and protection policies – Most web scrapping services are slower than API calls and another
problem is the websites that do not allow screen scrapping. In such cases web scrapping services are
rendered useless. Also, if the developer of the website decides to introduce some changes in the code, the
scrapping service might stop working.
With Proxy we cannot do WEB SCRAPING