0% found this document useful (0 votes)
6 views

11Python Reading HTML Pages

Uploaded by

David Osei
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

11Python Reading HTML Pages

Uploaded by

David Osei
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

PYTHON ­ READING HTML PAGES

https://www.tutorialspoint.com/python/python_reading_html_pages.htm Copyright © tutorialspoint.com

Advertisements

library known as beautifulsoup. Using this library, we can search for the values of html tags and get specific data
like title of the page and the list of headers in the page.

Install Beautifulsoup
Use the Anaconda package manager to install the required package and its dependent packages.

conda install Beaustifulsoap

Reading the HTML file


In the below example we make a request to an url to be loaded into the python environment. Then use the html
parser parameter to read the entire html file. Next, we print first few lines of the html page.

import urllib2
from bs4 import BeautifulSoup

# Fetch the html file


response = urllib2.urlopen('http://tutorialspoint.com/python/python_overview.htm')
html_doc = response.read()

# Parse the html file


soup = BeautifulSoup(html_doc, 'html.parser')

# Format the parsed html file


strhtm = soup.prettify()

# Print the first few characters


print (strhtm[:225])

When we execute the above code, it produces the following result.

<!DOCTYPE html>
<!‐‐[if IE 8]><html class="ie ie8"> <![endif]‐‐>
<!‐‐[if IE 9]><html class="ie ie9"> <![endif]‐‐>
<!‐‐[if gt IE 9]><!‐‐>
<html>
<!‐‐<![endif]‐‐>
<head>
<!‐‐ Basic ‐‐>
<meta charset="utf‐8"/>
<title>

Extracting Tag Value


We can extract tag value from the first instance of the tag using the following code.
import urllib2
from bs4 import BeautifulSoup

response = urllib2.urlopen('http://tutorialspoint.com/python/python_overview.htm')
html_doc = response.read()

soup = BeautifulSoup(html_doc, 'html.parser')

print (soup.title)
print(soup.title.string)
print(soup.a.string)
print(soup.b.string)

When we execute the above code, it produces the following result.

Python Overview
None
Python is Interpreted

Extracting All Tags


We can extract tag value from all the instances of a tag using the following code.

import urllib2
from bs4 import BeautifulSoup

response = urllib2.urlopen('http://tutorialspoint.com/python/python_overview.htm')
html_doc = response.read()
soup = BeautifulSoup(html_doc, 'html.parser')

for x in soup.find_all('b'): print(x.string)

When we execute the above code, it produces the following result.

Python is Interpreted
Python is Interactive
Python is Object‐Oriented
Python is a Beginner's Language
Easy‐to‐learn
Easy‐to‐read
Easy‐to‐maintain
A broad standard library
Interactive Mode
Portable
Extendable
Databases
GUI Programming
Scalable

You might also like