100% found this document useful (2 votes)
129 views

2 Python Data Processing

Python handles data processing through two main libraries: Pandas and Numpy. Numpy handles n-dimensional arrays and allows for data manipulation. Pandas handles data through Series, DataFrames, and Panels. It allows for reading, cleaning, and analyzing data. The document provides examples of data handling techniques in Numpy and Pandas, including creating arrays and data structures, accessing and manipulating data, handling missing values, and reading/writing CSV files.

Uploaded by

Shaifali Jain
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
129 views

2 Python Data Processing

Python handles data processing through two main libraries: Pandas and Numpy. Numpy handles n-dimensional arrays and allows for data manipulation. Pandas handles data through Series, DataFrames, and Panels. It allows for reading, cleaning, and analyzing data. The document provides examples of data handling techniques in Numpy and Pandas, including creating arrays and data structures, accessing and manipulating data, handling missing values, and reading/writing CSV files.

Uploaded by

Shaifali Jain
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 66

Python Data Processing

Python - Data Operations


• Python handles data of various formats mainly through the two libraries, Pandas
and Numpy.
• Data Operations in Numpy
– The most important object defined in NumPy is an N-dimensional array type called ndarray.
– It describes the collection of items of the same type.
– Items in the collection can be accessed using a zero-based index.
– An instance of ndarray class can be constructed by different array creation routines.
– The basic ndarray is created using an array function in NumPy as follows −
numpy.array
– Following are some examples on Numpy Data handling.
– Example 1
# more than one dimensions
import numpy as np
a = np.array([[1, 2], [3, 4]])
print (a)
The output is as follows −
[[1, 2]
[3, 4]]
• Example 2
# minimum dimensions
import numpy as np
a = np.array([1, 2, 3,4,5], ndmin = 2)
print (a)
The output is as follows −
[[1, 2, 3, 4, 5]]
• Example 3
# dtype parameter
import numpy as np
a = np.array([1, 2, 3], dtype = complex)
print (a)
The output is as follows −
[ 1.+0.j, 2.+0.j, 3.+0.j]
Data Operations in Pandas
• Pandas handles data through Series,Data Frame, and Panel. We will see some examples from each
of these.
• Pandas Series
– Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python
objects, etc.).
– The axis labels are collectively called index. A pandas Series can be created using the following constructor −
pandas.Series( data, index, dtype, copy)
• Example
– Here we create a series from a Numpy Array.
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data)
print (s)
Its output is as follows −
0a
1b
2c
3d
dtype: object
Pandas DataFrame
• A Data frame is a two-dimensional data structure, i.e., data
is aligned in a tabular fashion in rows and columns.
• A pandas DataFrame can be created using the following
constructor −
pandas.DataFrame( data, index, columns, dtype, copy)
• Let us now create an indexed DataFrame using arrays.
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print (df)
• Its output is as follows −
Age Name
rank1 28 Tom
rank2 34 Jack
rank3 29 Steve
rank4 42 Ricky
Pandas Panel
• A panel is a 3D container of data.
• The term Panel data is derived from econometrics and is partially responsible for the name pandas
− pan(el)-da(ta)-s.
• A Panel can be created using the following constructor −
• pandas.Panel(data, items, major_axis, minor_axis, dtype, copy)
• In the below example we create a panel from dict of DataFrame Objects
#creating an empty panel
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)),
'Item2' : pd.DataFrame(np.random.randn(4, 2))}
p = pd.Panel(data)
print (p)
• Its output is as follows −
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2
Python - Data Cleansing
• Missing data is always a problem in real life
scenarios.
• Areas like machine learning and data mining
face severe issues in the accuracy of their
model predictions because of poor quality of
data caused by missing values.
• In these areas, missing value treatment is a
major point of focus to make their models
more accurate and valid.
When and Why Is Data Missed?
• Let us consider an online survey for a product.
• Many a times, people do not share all the
information related to them.
• Few people share their experience, but not how
long they are using the product; few people share
how long they are using the product, their
experience but not their contact information.
• Thus, in some or the other way a part of data is
always missing, and this is very common in real
time.
• Let us now see how we can handle missing values (say NA
or NaN) using Pandas.
# import the pandas library
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print (df)
• Its output is as follows −
• Using reindexing, we have created a DataFrame with
missing values. In the output, NaN means Not a Number.
Check for Missing Values
• To make detecting missing values easier (and across
different array dtypes), Pandas provides
the isnull() and notnull() functions, which are also methods
on Series and DataFrame objects −
• Example
• import pandas as pd
• import numpy as np
• df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c',
'e', 'f', 'h'],columns=['one', 'two', 'three']) df =
df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
• print df['one'].isnull()
• Its output is as follows −
Cleaning / Filling Missing Data
• Pandas provides various methods for cleaning the missing values.
• The fillna function can “fill in” NA values with non-null data in a couple of
ways, which we have illustrated in the following sections.
• Replace NaN with a Scalar Value
• The following program shows how you can replace "NaN" with "0".
• import pandas as pd
• import numpy as np
• df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c',
'e'],columns=['one', 'two', 'three'])
• df = df.reindex(['a', 'b', 'c'])
• print (df )
• print ("NaN replaced with '0':")
• Print( df.fillna(0))
• Its output is as follows −
Fill NA Forward and Backward
• Using the concepts of filling discussed in the
ReIndexing Chapter we will fill the missing
values.

• Example
• import pandas as pd
• import numpy as np df =
pd.DataFrame(np.random.randn(5, 3),
index=['a', 'c', 'e', 'f', 'h'],columns=['one', 'two',
'three'])
• df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print (df.fillna(method='pad'))
Drop Missing Values
• If you want to simply exclude the missing values, then
use the dropna function along with the axis argument.
• By default, axis=0, i.e., along row, which means that if
any value within a row is NA then the whole row is
excluded.
• Example
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e',
'f', 'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print (df.dropna())
Replace Missing (or) Generic Values
• Many times, we have to replace a generic value
with some specific value. We can achieve this by
applying the replace method.
• Replacing NA with a scalar value is equivalent
behavior of the fillna()function.
• Example
import pandas as pd
import numpy as np
df = pd.DataFrame({'one':[10,20,30,40,50,2000],
'two':[1000,0,30,40,50,60]})
print (df.replace({1000:10,2000:60}))
Python - Processing CSV Data
• Reading data from CSV(comma separated values)
is a fundamental necessity in Data Science.
• Often, we get data from various sources which
can get exported to CSV format so that they can
be used by other systems.
• The Panadas library provides features using which
we can read the CSV file in full as well as in parts
for only a selected group of columns and rows.
Input as CSV File
• The csv file is a text file in which the values in the
columns are separated by a comma.
• Let's consider the following data present in the file
named input.csv.
• You can create this file using windows notepad by
copying and pasting this data.
• Save the file as input.csv using the save As All files(*.*)
option in notepad.
id,name,salary,start_date,dept
1,Rick,623.3,2012-01-01,IT
2,Dan,515.2,2013-09-23,Operations
3,Tusar,611,2014-11-15,IT
4,Ryan,729,2014-05-11,HR
5,Gary,843.25,2015-03-27,Finance
6,Rasmi,578,2013-05-21,IT
7,Pranab,632.8,2013-07-30,Operations
8,Guru,722.5,2014-06-17,Finance
Reading a CSV File
• The read_csv function of the pandas library is
used read the content of a CSV file into the
python environment as a pandas DataFrame.
• The function can read the files from the OS by
using proper path to the file.
import pandas as pd data =
pd.read_csv('path/input.csv')
print (data)
Output

id name salary start_date dept


0 1 Rick 623.30 2012-01-01 IT
1 2 Dan 515.20 2013-09-23 Operations
2 3 Tusar 611.00 2014-11-15 IT
3 4 Ryan 729.00 2014-05-11 HR
4 5 Gary 843.25 2015-03-27 Finance
5 6 Rasmi 578.00 2013-05-21 IT
6 7 Pranab 632.80 2013-07-30 Operations
7 8 Guru 722.50 2014-06-17 Finance
Reading Specific Rows
• The read_csv function of the pandas library can
also be used to read some specific rows for a
given column.
• We slice the result from the read_csv function
using the code shown below for first 5 rows for
the column named salary.
import pandas as pd data =
pd.read_csv('path/input.csv')
# Slice the result for first 5 rows
print (data[0:5]['salary'])
Reading Specific Columns
• The read_csv function of the pandas library can also be
used to read some specific columns.
• We use the multi-axes indexing method called .loc() for
this purpose. We choose to display the salary and
name column for all the rows.
import pandas as pd salary name
data = pd.read_csv('path/input.csv') 0 623.30 Rick
1 515.20 Dan
# Use the multi-axes indexing funtion 2 611.00 Tusar
print (data.loc[:,['salary','name']]) 3 729.00 Ryan
4 843.25 Gary
5 578.00 Rasmi
6 632.80 Pranab
7 722.50 Guru
Reading Specific Columns and Rows
• The read_csv function of the pandas library can
also be used to read some specific columns and
specific rows.
• We use the multi-axes indexing method
called .loc() for this purpose.
• We choose to display the salary and name
column for some of the rows.
salary name
import pandas as pd 1 515.2 Dan
data = pd.read_csv('path/input.csv') 3 729.0 Ryan
5 578.0 Rasmi
# Use the multi-axes indexing funtion
print (data.loc[[1,3,5],['salary','name']])
Reading Specific Columns for a Range
of Rows
• The read_csv function of the pandas library can also be
used to read some specific columns and a range of
rows.
• We use the multi-axes indexing method called .loc() for
this purpose.
• We choose to display the salary and name column for
some of the rows. salary name
import pandas as pd 2 611.00 Tusar
3 729.00 Ryan
data = pd.read_csv('path/input.csv') 4 843.25 Gary
# Use the multi-axes indexing funtion 5 578.00 Rasmi
6 632.80 Pranab
print (data.loc[2:6,['salary','name']])
Python - Processing JSON Data
• JSON file stores data as text in human-
readable format.
• JSON stands for JavaScript Object Notation.
• Pandas can read JSON files using
the read_jsonfunction.
Input Data
• Create a JSON file by copying the below data into a text
editor like notepad. Save the file with .json extension and
choosing the file type as all files(*.*).
{ "ID":[ "1","2","3","4","5","6","7","8"
], "Name":[
"Rick","Dan","Michelle","Ryan","Gary","Nina","Simon",
"Guru" ], "Salary":[
"623.3","515.2","611","729","843.25","578","632.8","7
22.5" ], "StartDate":[
"1/1/2012","9/23/2013","11/15/2014","5/11/2014","3
/27/2015","5/21/2013","7/30/2013","6/17/2014"
], "Dept":[
"IT","Operations","IT","HR","Finance","IT","Operations
","Finance" ]}
Read the JSON File
• The read_json function of the pandas library
can be used to read the JSON file into a
pandas DataFrame.
• import pandas as pd
• data = pd.read_json('path/input.json')
Dept ID Name Salary StartDate
• print (data)
0 IT 1 Rick 623.30 1/1/2012
1 Operations 2 Dan 515.20 9/23/2013
2 IT 3 Tusar 611.00 11/15/2014
3 HR 4 Ryan 729.00 5/11/2014
4 Finance 5 Gary 843.25 3/27/2015
5 IT 6 Rasmi 578.00 5/21/2013
6 Operations 7 Pranab 632.80 7/30/2013
7 Finance 8 Guru 722.50 6/17/2014
Reading Specific Columns and Rows
• Similar to what we have already seen in the
previous chapter to read the CSV file,
the read_json function of the pandas library can
also be used to read some specific columns and
specific rows after the JSON file is read to a
DataFrame.
• We use the multi-axes indexing method
called .loc() for this purpose. We choose to
display the Salary and Name column for some of
the rows.
Example
import pandas as pd
data = pd.read_json('input.json')
# Use the multi-axes indexing funtion
print (data.loc[[1,3,5],['salary','name']])
salary name
1 515.2 Dan
3 729.0 Ryan
5 578.0 Rasmi
Reading JSON file as Records
• We can also apply the to_json function along with parameters to read the JSON
file content into individual records.
• import pandas as pd
• data = pd.read_json('path/input.json')
• print(data.to_json(orient='records', lines=True))

• {"Dept":"IT","ID":1,"Name":"Rick","Salary":623.3,"StartDate":"1\/1\/2012"}
{"Dept":"Operations","ID":2,"Name":"Dan","Salary":515.2,"StartDate":"9\/23\/20
13"}
{"Dept":"IT","ID":3,"Name":"Tusar","Salary":611.0,"StartDate":"11\/15\/2014"}
{"Dept":"HR","ID":4,"Name":"Ryan","Salary":729.0,"StartDate":"5\/11\/2014"}
{"Dept":"Finance","ID":5,"Name":"Gary","Salary":843.25,"StartDate":"3\/27\/201
5"} {"Dept":"IT","ID":6,"Name":"Rasmi","Salary":578.0,"StartDate":"5\/21\/2013"}
{"Dept":"Operations","ID":7,"Name":"Pranab","Salary":632.8,"StartDate":"7\/30\/
2013"}
{"Dept":"Finance","ID":8,"Name":"Guru","Salary":722.5,"StartDate":"6\/17\/2014
"}
Python - Processing XLS Data
• Microsoft Excel is a very widely used spread sheet
program.
• Its user friendliness and appealing features makes
it a very frequently used tool in Data Science.
• The Panadas library provides features using which
we can read the Excel file in full as well as in parts
for only a selected group of Data.
• We can also read an Excel file with multiple
sheets in it.
• We use the read_excelfunction to read the data
from it.
Input as Excel File
• We Create an excel file with multiple sheets in the windows OS. The Data in the different sheets is as shown
below.
• You can create this file using the Excel Program in windows OS. Save the file as input2.xlsx.
• # Data in Sheet1

id,name,salary,start_date,dept
1,Rick,623.3,2012-01-01,IT
2,Dan,515.2,2013-09-23,Operations
3,Tusar,611,2014-11-15,IT
4,Ryan,729,2014-05-11,HR
5,Gary,843.25,2015-03-27,Finance
6,Rasmi,578,2013-05-21,IT
7,Pranab,632.8,2013-07-30,Operations
8,Guru,722.5,2014-06-17,Finance

• # Data in Sheet2

id name zipcode
1 Rick 301224
2 Dan 341255
3 Tusar 297704
4 Ryan 216650
5 Gary 438700
6 Rasmi 665100
7 Pranab 341211
8 Guru 347480
Reading an Excel File
• The read_excel function of the pandas library is used read the content of an Excel file into the
python environment as a pandas DataFrame. The function can read the files from the OS by using
proper path to the file. By default, the function will read Sheet1.
• import pandas as pd data =
• pd.read_excel('path/input.xlsx')
• print (data)
• When we execute the above code, it produces the following result. Please note how an additional
column starting with zero as a index has been created by the function.
• id name salary start_date dept
• 0 1 Rick 623.30 2012-01-01 IT
• 1 2 Dan 515.20 2013-09-23 Operations
• 2 3 Tusar 611.00 2014-11-15 IT
• 3 4 Ryan 729.00 2014-05-11 HR
• 4 5 Gary 843.25 2015-03-27 Finance
• 5 6 Rasmi 578.00 2013-05-21 IT
• 6 7 Pranab 632.80 2013-07-30 Operations
• 7 8 Guru 722.50 2014-06-17 Finance
Reading Specific Columns and Rows
• Similar to what we have already seen in the previous chapter to read the
CSV file, the read_excel function of the pandas library can also be used to
read some specific columns and specific rows.
• We use the multi-axes indexing method called .loc() for this purpose.
• We choose to display the salary and name column for some of the rows.
import pandas as pd
data = pd.read_excel('path/input.xlsx')
# Use the multi-axes indexing funtion
print (data.loc[[1,3,5],['salary','name']])
• When we execute the above code, it produces the following result.
• salary name
• 1 515.2 Dan
• 3 729.0 Ryan
• 5 578.0 Rasmi
Reading Multiple Excel Sheets
• Multiple sheets with different Data formats can also be read by using read_excel function with help of a wrapper
class named ExcelFile.
• It will read the multiple sheets into memory only once. In the below example we read sheet1 and sheet2 into two
data frames and print them out individually.
• import pandas as pd
• with pd.ExcelFile('C:/Users/Rasmi/Documents/pydatasci/input.xlsx') as xls:
df1 = pd.read_excel(xls, 'Sheet1')
df2 = pd.read_excel(xls, 'Sheet2')
print("****Result Sheet 1****")
print (df1[0:5]['salary'])
print("")
print("***Result Sheet 2****")
print (df2[0:5]['zipcode'])
• When we execute the above code, it produces the following result.
****Result Sheet 1****
0 623.30
1 515.20
2 611.00
3 729.00
4 843.25
Name: salary, dtype: float64
***Result Sheet 2****
0 301224
1 341255
2 297704
3 216650
4 438700
Name: zipcode, dtype: int64
Python - Relational Databases
• We can connect to relational databases for
analysing data using the pandaslibrary as well
as another additional library for implementing
database connectivity.
• This package is named as sqlalchemy which
provides full SQL language functionality to be
used in python.
Reading Relational Tables
• We will use Sqlite3 as our relational database as it is very light weight and easy to use.
• Though the SQLAlchemy library can connect to a variety of relational sources including MySql,
Oracle and Postgresql and Mssql.
• We first create a database engine and then connect to the database engine using
the to_sql function of the SQLAlchemy library.
• In the below example we create the relational table by using the to_sqlfunction from a dataframe
already created by reading a csv file.
• Then we use the read_sql_query function from pandas to execute and capture the results from
various SQL queries.
from sqlalchemy import create_engine
import pandas as pd
data = pd.read_csv('/path/input.csv')
# Create the db engine
engine = create_engine('sqlite:///:memory:')
# Store the dataframe as a table
data.to_sql('data_table', engine)
# Query 1 on the relational table
res1 = pd.read_sql_query('SELECT * FROM data_table', engine)
print('Result 1')
print(res1)
print('')
# Query 2 on the relational table
res2 = pd.read_sql_query('SELECT dept,sum(salary) FROM data_table group by dept', engine)
print('Result 2')
print(res2)
Inserting Data to Relational Tables
• We can also insert data into relational tables using sql.execute function available in
pandas.
• In the below code we previous csv file as input data set, store it in a relational
table and then insert another record using sql.execute.
from sqlalchemy import create_engine
from pandas.io import sql
import pandas as pd
data = pd.read_csv(r'C:\Users\Kuldeep Singh\Desktop\input.csv')
engine = create_engine('sqlite:///:memory:')
# Store the Data in a relational table
data.to_sql('data_table', engine)
# Insert another row
sql.execute('INSERT INTO data_table VALUES(?,?,?,?,?,?)', engine,
params=[('id',9,'Ruby',711.20,'2015-03-27','IT')])
# Read from the relational table
res = pd.read_sql_query('SELECT ID,Dept,Name,Salary,start_date FROM data_table', engine)
print(res)
Deleting Data from Relational Tables
• We can also delete data into relational tables using sql.execute
function available in pandas.
• The below code deletes a row based on the input condition given.
from sqlalchemy import create_engine
from pandas.io import sql
import pandas as pd
data = pd.read_csv(r'C:\Users\Kuldeep Singh\Desktop\input.csv')
engine = create_engine('sqlite:///:memory:')
data.to_sql('data_table', engine)
sql.execute('Delete from data_table where name = (?) ', engine,
params=[('Gary')])
res = pd.read_sql_query('SELECT ID,Dept,Name,Salary,start_date
FROM data_table', engine)
print(res)
Python - NoSQL Databases
• As more and more data become available as
unstructured or semi-structured, the need of managing
them through NoSql database increases.
• Python can also interact with NoSQL databases in a
similar way as is interacts with Relational databases.
• In this chapter we will use python to interact with
MongoDB as a NoSQL database.
• In order to connect to MongoDB, python uses a library
known as pymongo.
• This library enables python to connect to MOngoDB
using a db client. Once connected we select the db
name to be used for various operations.
Inserting Data
• To insert data into MongoDB we use the insert() method which is available in the database environment.
• First we connect to the db using python code shown below and then we provide the document details in form of a
series of key-value pairs.
# Import the python libraries
from pymongo import MongoClient
from pprint import pprint
# Choose the appropriate client
client = MongoClient()
# Connect to the test db
db=client.test
# Use the employee collection
employee = db.employee
employee_details = {
'Name': 'Raj Kumar',
'Address': 'Sears Streer, NZ',
'Age': '42'
}
# Use the insert method
result = employee.insert_one(employee_details)
# Query for the inserted document.
Queryresult = employee.find_one({'Age': '42'})
pprint(Queryresult)
Output
{u'Address': u'Sears Streer, NZ', u'Age': u'42',
u'Name': u'Raj Kumar', u'_id':
ObjectId('5adc5a9f84e7cd3940399f93')}
Updating Data
• Updating an existing MongoDB data is similar to inserting.
• We use the update() method which is native to mongoDB.
• In the below code we are replacing the existing record with new key-value pairs.
• Please note how we are using the condition criteria to decide which record to update.
# Import the python libraries
from pymongo import MongoClient
from pprint import pprint
# Choose the appropriate client
client = MongoClient()
# Connect to db
db=client.test
employee = db.employee
# Use the condition to choose the record
# and use the update method
db.employee.update_one(
{"Age":'42'},
{
"$set": {
"Name":"Srinidhi",
"Age":'35',
"Address":"New Omsk, WC"
}
}
)
Queryresult = employee.find_one({'Age':'35'})
pprint(Queryresult)
When we execute the above code, it
produces the following result.
• {u'Address': u'New Omsk, WC', u'Age': u'35',
u'Name': u'Srinidhi', u'_id':
ObjectId('5adc5a9f84e7cd3940399f93')}
Deleting Data
• Deleting a record is also straight forward where we use the delete method. Here
also we mention the condition which is used to choose the record to be deleted.
# Import the python libraries
from pymongo import MongoClient
from pprint import pprint
# Choose the appropriate client
client = MongoClient()
# Connect to db
db=client.test
employee = db.employee
# Use the condition to choose the record
# and use the delete method
db.employee.delete_one({"Age":'35'})
Queryresult = employee.find_one({'Age':'35'})
pprint(Queryresult)
• When we execute the above code, it produces the following result.
• NoneSo we see the particular record does not exist in the db any more.
Python - Date and Time
• Often in data science we need analysis which is
based on temporal values.
• Python can handle the various formats of date
and time gracefully.
• The datetime library provides necessary methods
and functions to handle the following scenarios.
– Date Time Representation
– Date Time Arithmetic
– Date Time Comparison
Date Time Representation
• A date and its various parts are represented by using different datetime
functions.
• Also, there are format specifiers which play a role in displaying the
alphabetical parts of a date like name of the month or week day.
• The following code shows today's date and various parts of the date.
import datetime
print ('The Date Today is :', datetime.datetime.today())
date_today = datetime.date.today()
print (date_today)
print ('This Year :', date_today.year)
print ('This Month :', date_today.month)
print ('Month Name:',date_today.strftime('%B'))
print ('This Week Day :', date_today.day)
print ('Week Day Name:',date_today.strftime('%A'))
Date Time Arithmetic
• For calculations involving dates we store the various dates into variables and apply the relevant
mathematical operator to these variables.
import datetime
#Capture the First Date
day1 = datetime.date(2018, 2, 12)
print ('day1:', day1.ctime())
# Capture the Second Date
day2 = datetime.date(2017, 8, 18)
print ('day2:', day2.ctime())
# Find the difference between the dates
print ('Number of Days:', day1-day2)
date_today = datetime.date.today()
# Create a delta of Four Days
no_of_days = datetime.timedelta(days=4)
# Use Delta for Past Date
before_four_days = date_today - no_of_days
print ('Before Four Days:', before_four_days )
# Use Delta for future Date
after_four_days = date_today + no_of_days
print ('After Four Days:', after_four_days )
Date Time Comparison
• Date and time are compared using logical operators. But we must be careful in comparing the right
parts of the dates with each other. In the below examples we take the future and past dates and
compare them using the python if clause along with logical operators.
import datetime
date_today = datetime.date.today()
print ('Today is: ', date_today)
# Create a delta of Four Days
no_of_days = datetime.timedelta(days=4)
# Use Delta for Past Date
before_four_days = date_today - no_of_days
print ('Before Four Days:', before_four_days )
after_four_days = date_today + no_of_days
date1 = datetime.date(2018,4,4)
print ('date1:',date1)
if date1 == before_four_days :
print ('Same Dates')
if date_today > date1:
print ('Past Date')
if date1 < after_four_days:
print ('Future Date')
Python - Data Wrangling
• Data wrangling involves processing the data in
various formats like - merging, grouping,
concatenating etc. for the purpose of
analysing or getting them ready to be used
with another set of data.
• Python has built-in features to apply these
wrangling methods to various data sets to
achieve the analytical goal.
Merging Data
• The Pandas library in python provides a single function, merge, as the entry point
for all standard database join operations between DataFrame objects −
• pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True)
• Let us now create two different DataFrames and perform the merging operations
on it.
# import the pandas library
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
{'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print (left)
print (right)
Grouping Data
• Grouping data sets is a frequent need in data analysis where we need the
result in terms of various groups present in the data set. Panadas has in-
built methods which can roll the data into various groups.
• In the below example we group the data by year and then get the result
for a specific year.
# import the pandas library
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
grouped = df.groupby('Year')
print (grouped.get_group(2014))
Concatenating Data
• Pandas provides various facilities for easily combining together Series, DataFrame,
and Panel objects.
• In the below example the concat function performs concatenation operations
along an axis.
• Let us create different objects and do concatenation.
import pandas as pd
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print (pd.concat([one,two])
Python - Data Aggregation
• Python has several methods are available to
perform aggregations on data.
• It is done using the pandas and numpy
libraries.
• The data must be available or converted to a
dataframe to apply the aggregation functions.
Applying Aggregations on DataFrame
• Let us create a DataFrame and apply aggregations on it.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
print (df)
r = df.rolling(window=3,min_periods=1)
print (r)

• We can aggregate by passing a function to the entire


DataFrame, or select a column via the standard get item
method.
Apply Aggregation on a Whole
Dataframe
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
print (df)
r = df.rolling(window=3,min_periods=1)
print (r.aggregate(np.sum)
Apply Aggregation on a Single Column
of a Dataframe
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
print (df)
r = df.rolling(window=3,min_periods=1)
print (r['A'].aggregate(np.sum))
Apply Aggregation on Multiple
Columns of a DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
print (df)
r = df.rolling(window=3,min_periods=1)
print (r[['A','B']].aggregate(np.sum))
Python - Reading HTML Pages
• library known as beautifulsoup. Using this
library, we can search for the values of html
tags and get specific data like title of the page
and the list of headers in the page.
Reading the HTML file
• In the below example we make a request to an
url to be loaded into the python environment.
Then use the html parser parameter to read
the entire html file. Next, we print first few
lines of the html page.
Python - Processing Unstructured Data
• The data that is already present in a row and column format or
which can be easily converted to rows and columns so that later it
can fit nicely into a database is known as structured data.
• Examples are CSV, TXT, XLS files etc.
• These files have a delimiter and either fixed or variable width where
the missing values are represented as blanks in between the
delimiters.
• But sometimes we get data where the lines are not fixed width, or
they are just HTML, image or pdf files.
• Such data is known as unstructured data. While the HTML file can
be handled by processing the HTML tags, a feed from twitter or a
plain text document from a news feed can without having a
delimiter does not have tags to handle.
• In such scenario we use different in-built functions from various
python libraries to process the file.
Reading Data
• In the below example we take a text file and read the file segregating each
of the lines in it.
• Next we can divide the output into further lines and words.
• The original file is a text file containing some paragraphs describing the
python language.
filename = r'C:\Users\Kuldeep Singh\Desktop\python.txt'
with open(filename) as fn:
# Read each line
ln = fn.readline()
# Keep count of lines
lncnt = 1
while ln:
print("Line {}: {}".format(lncnt, ln.strip()))
ln = fn.readline()
lncnt += 1
Counting Word Frequency
from collections import Counter
with open(r'C:\Users\Kuldeep
Singh\Desktop\python.txt') as f:
p = Counter(f.read().split())
print(p)
Python - Word Tokenization
• Word tokenization is the process of splitting a
large sample of text into words.
• This is a requirement in natural language
processing tasks where each word needs to be
captured and subjected to further analysis like
classifying and counting them for a particular
sentiment etc.
• The Natural Language Tool kit(NLTK) is a library
used to achieve this.
• Install NLTK before proceeding with the python
program for word tokenization.
• Next we use the word_tokenize method to
split the paragraph into individual words.
import nltk
word_data = "It originated from the idea that
there are readers who prefer learning new
skills from the comforts of their drawing
rooms"
nltk_tokens = nltk.word_tokenize("Hello Hi",
language='english')
Python - Stemming and Lemmatization
• In the areas of Natural Language Processing we come across situation where two or more words
have a common root.
• For example, the three words - agreed, agreeing and agreeable have the same root word agree.
• A search involving any of these words should treat them as the same word which is the root word.
• So it becomes essential to link all the words into their root word.
• The NLTK library has methods to do this linking and give the output showing the root word.
• The below program uses the Porter Stemming Algorithm for stemming.
import nltk
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()
word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts
of their drawing rooms"
# First Word tokenization
nltk_tokens = nltk.word_tokenize(word_data)
#Next find the roots of the word
for w in nltk_tokens:
print ("Actual: %s Stem: %s" % (w,porter_stemmer.stem(w))
Below, let's go over some of the
leading Python libraries in the field.
• Matplotlib: for data visualization
• NumPy: used for Scientific Computing.
• SciPy: uses NumPy for more mathematical functions:
• Pandas: providing fast, flexible, and expressive data structures
designed to make working with “relational” or “labeled” data both
easy and intuitive.
• PyTorch: machine learning library based on the Torch library, used
for applications such as computer vision and natural language
processing.
• Seaborn: data visualization library based on matplotlib
• Scikit-Learn: for data analysis and data mining.
• PySpark: for processing structured and semi-structured datasets
& can process the data by making use of SQL as well as HiveQL.

You might also like