Informatics Practices Xii CH 1 To CH 6 1
Informatics Practices Xii CH 1 To CH 6 1
ite
m
Li
Saraswati
a te
INFORMATICS PRACTICES
iv
Pr
[A TEXTBOOK FOR CLASS XII]
a
di
In
se
By
ou
New Sa aswati
r House (India) Private Limited
New Delhi-110002 (INDIA)
d
ite
m
Li
Head Office : Second Floor, MGM Tower, 19 Ansari Road, Daryaganj, New Delhi–110 002 (India)
Registered Office : A-27, 2nd Floor, Mohan Co-operative Industrial Estate, New Delhi–110 044
e
Phone : +91-11-4355 6600
at
Fax : +91-11-4355 6688
E-mail : [email protected]
iv
Website : www.saraswatihouse.com
Pr
CIN : U22110DL2013PTC262320
Import-Export Licence No. 0513086293
Branches:
a
• Ahmedabad: Ph. 079-2657 5018 • Bengaluru: Ph. 080-2675 6396
di
• Chennai: Ph. 044-2841 6531 • Bhubaneswar: Ph. +91-94370 05810 In
• Guwahati: Ph. 0361-2457 198 • Hyderabad: Ph. 040-4261 5566 • Jaipur: Ph. 0141-4006 022
• Jalandhar: Ph. 0181-4642 600, 4643 600 • Kochi: Ph. 0484-4033 369 • Kolkata: Ph. 033-4004 2314
• Lucknow: Ph. 0522-4062 517 • Mumbai: Ph. 022-2876 9871, 2873 7090
se
• Patna: Ph. 0612-2275 403 • Ranchi: Ph. 0651-2244 654 • Nagpur: Ph. +91 9371940224
ou
ISBN: 978-93-53621-34-6
i
at
Jurisdiction: All disputes with respect to this publication shall be subject to the jurisdiction of the Courts, Tribunals and
Forums of New Delhi, India Only.
ra
All rights reserved under the Copyright Act. No part of this publication may be reproduced, transcribed, transmitted,
Sa
stored in a retrieval system or translated into any language or computer, in any form or by any means, electronic,
mechanical, magnetic, optical, chemical, manual, photocopy or otherwise without the prior permission of the
copyright owner. Any person who does any unauthorised act in relation to this publication may be liable to criminal
prosecution and civil claims for damages.
ew
This book is meant for educational and learning purposes. The author(s) of the book has/have taken all reasonable
care to ensure that the contents of the book do not violate any copyright or other intellectual property rights of any
@
person in any manner whatsoever. In the event the author(s) has/have been unable to track any source and if any
copyright has been inadvertently infringed, please notify the publisher in writing for any corrective action.
PRINTED IN INDIA
By Vikas Publishing House Private Limited, Plot 20/4, Site-IV, Industrial Area Sahibabad, Ghaziabad–201 010 and
published by New Saraswati House (India) Private Limited, 19 Ansari Road, Daryaganj, New Delhi–110 002 (India)
Preface
d
Informatics Practices book is designed as per the new syllabus prescribed by CBSE for class - XII. This
ite
book covers advanced operations on Pandas DataFrame, descriptive statistics, histogram, function
application, etc. NumPy array is used to handle multi-dimensional array objects. Software engineering,
m
database management using MySQL, computing ethics and cyber safety are the other highlights of
Li
the book.
The salient features of the book are:
e
1. Easy-to-understand aggregation operations, descriptive statistics, and re-indexing columns in a
at
DataFrame.
iv
2. Learn to plot different graphs using pandas DataFrame.
Pr
3. Apply functions row-wise and element-wise on a Data Frame.
4. Detailed explanation of the know how to NumPy array used for scientific computing applications
including covariance, correlation and linear regression.
a
di
5. NumPy array discussed in detail to handle multi-dimensional array object.
6. Understanding of the basic software engineering: models, activities, business use-case diagrams,
In
and version control systems.
7. Know-hows to connect a Python program with a SQL database, and learn aggregation functions
se
in SQL.
ou
8. A clear understanding of cyber ethics and cybercrime. Helps to understand the value of technology
in societies, gender and disability issues, and the technology behind biometric ids.
H
This new edition provides you with an updated knowledge based on plenty of solved exercises and
unsolved questions. It includes complete and easily understandable explanations of the commonly
i
used features of Python pandas DataFrame, NumPy and MySQL. The questions and examples will
at
Many solved and unsolved examples are provided at the end of each chapter. It also contains a CD
s
with a folder called IPSource_XII which includes chapter-wise solved examples and can be made
ra
available on demand.
Sa
Also, we would like to convey our sincere thanks to the dedicated team of New Saraswati House
(India) Pvt. Ltd. for bringing out this book in an excellent form.
Suggestions for further improvement of the book will be gratefully acknowledged.
ew
N
Phone: 011-42953418
Mobile No.: 9818588644
E-mail: [email protected]
SYLLABUS
UNIT–WISE MARKS (THEORY)
d
Unit No. Unit Name Periods Marks
ite
Theory Practical
m
1. Data Handling - 2 80 70 30
Li
2. Basic Software Engineering 25 10 15
3. Data Management - 2 20 20 15
e
at
4. Society, Law and Ethics - 2 15 10
iv
5. Practicals 30
Pr
Total 100
a
Unit 1: Data Handling (DH-2)
Python Pandas
di
In
• Advanced operations on Data Frames: pivoting, sorting, and aggregation
• Descriptive statistics: min, max, mode, mean, count, sum, median, quartile, var
se
• Function application: pipe, apply, aggregation (group by), transform, and apply map.
• Reindexing, and altering labels.
H
Numpy
• 1D array, 2D array
i
at
• Write a minimal Django based web application that parses a GET and POST request, and writes
the fields to a file – flat file and CSV file.
d
• Interface Python with an SQL database
ite
• SQL commands: aggregation functions, having, group by, order by.
m
Unit 4: Society, Law and Ethics (SLE-2)
Li
• Intellectual property rights, plagiarism, digital rights management, and licensing (Creative
Commons, GPL and Apache), open source, open data, privacy.
e
• Privacy laws, fraud; cybercrime- phishing, illegal downloads, child pornography, scams; cyber
at
forensics, IT Act, 2000.
iv
• Technology and society: understanding of societal issues and cultural changes induced by
technology.
Pr
• E-waste management: proper disposal of used electronic gadgets.
• Identity theft, unique ids, and biometrics.
a
• Gender and disability issues while teaching and using computers.
•
di
Role of new media in society: online campaigns, crowdsourcing, smart mobs
In
• Issues with the internet: internet as an echo chamber, net neutrality, internet addiction
• Case studies - Arab Spring, WikiLeaks, Bit coin
se
Practical
ou
H
d
• Write a SQL query to order the (student ID, marks) table in descending order of the marks.
ite
• Integrate SQL with Python by importing MYSQL dB
m
• Write a Django based web server to parse a user request (POST), and write it to a CSV file.
Li
Data handling using Python libraries
e
• Use map functions to convert all negative numbers in a Data Frame to the mean of all the
numbers.
at
• Consider a Data Frame, where each row contains the item category, item name, and expenditure.
iv
• Group the rows by the category, and print the total expenditure per category.
Pr
• Given a Series, print all the elements that are above the 75th percentile.
• Given a day’s worth of stock market data, aggregate it. Print the highest, lowest, and closing
a
prices of each stock.
di
• Given sample data, plot a linear regression line.
In
• Take data from government web sites, aggregate and summarize it. Then plot it using different
plotting functions of the PyPlot library.
se
• Business use-case diagrams for an airline ticket booking system, train reservation system, stock
exchange
H
• Collaboratively write a program and manage the code with a version control system (GIT)
i
at
Project
w
The aim of the class project is to create something that is tangible and useful. This should be done in
groups of 2 to 3 students, and should be started by students at least 6 months before the submission
s
deadline. The aim here is to find a real world problem that is worthwhile to solve. Students are encouraged
ra
to visit local businesses and ask them about the problems that they are facing. For example, if a business
Sa
is finding it hard to create invoices for filing GST claims, then students can do a project that takes the raw
data (list of transactions), groups the transactions by category, accounts for the GST tax rates, and creates
invoices in the appropriate format. Students can be extremely creative here. They can use a wide variety
ew
of Python libraries to create user friendly applications such as games, software for their school, software
for their disabled fellow students, and mobile applications, Of course to do some of this projects, some
additional learning is required; this should be encouraged. Students should know how to teach themselves.
N
If three people work on a project for 6 months, at least 500 lines of code is expected. The committee has
also been made aware about the degree of plagiarism in such projects. Teachers should take a very strict
@
look at this situation, and take very strict disciplinary action against students who are cheating on lab
assignments, or projects, or using pirated software to do the same. Everything that is proposed can be
achieved using absolutely free, and legitimate open source software.
CONTENTS
Unit 1: Data Handling - 2
d
Chapter–1 Review of Python Pandas .................................................................................. 1–40
ite
1.1 Introduction ..................................................................................................................... 1
m
1.2 Pandas ..................................................................................................................... ........ 1
1.3 Pandas Series ................................................................................................................... 2
Li
1.4 Mathematical Operations on Series ............................................................................... 7
1.5 Vector Operations ............................................................................................................ 8
e
1.6 Comparison Operations on Series .................................................................................. 9
at
1.7 DataFrame ...................................................................................................................... 10
iv
1.7.1 Creating DataFrame ........................................................................................... 11
1.7.2 Printing/Displaying DataFrame Data ............................................................... 13
Pr
1.7.3 Accessing and Slicing DataFrame ..................................................................... 16
1.8 Iterating Pandas DataFrame .......................................................................................... 24
a
1.9 Manipulating Pandas DataFrame Data ......................................................................... 25
1.9.1 Adding a Column to DataFrame ....................................................................... 25
di
1.9.2 Adding Rows into DataFrame ........................................................................... 28
In
1.9.3 Dropping Columns in DataFrame ..................................................................... 32
1.9.4 Dropping Rows in DataFrame ........................................................................... 36
se
d
Solved Exercises ...................................................................................................................... 91
ite
Review Questions .................................................................................................................... 94
m
Chapter–4 Function Applications in Pandas ................................................................... 95–132
Li
4.1 Introduction ................................................................................................................... 95
4.2 .pipe() Function .............................................................................................................. 95
e
4.3 .apply() Function ............................................................................................................ 97
at
4.3.1 Using Lambda Function .................................................................................. 100
4.4 Aggregation (groupby) ................................................................................................ 102
iv
4.5 Data Transformation ................................................................................................... 114
Pr
4.6 .applymap() Function ................................................................................................. 118
4.7 Reindexing Pandas Dataframes .................................................................................. 120
4.8 Altering Labels or Changing Column/Row Names .................................................... 124
a
Points to Remember ............................................................................................................ 1 27
di
Solved Exercises ................................................................................................................... 128
In
Review Questions ................................................................................................................. 130
5.5 Finding the Shape and Size of the Array ................................................................... 148
Points to Remember ............................................................................................................ 1 50
w
Chapter–6 Indexing, Slicing and Arithmetic Operations in NumPy Array ................... 153–180
Sa
d
7.4 Linear Regression ........................................................................................................ 195
ite
Points to Remember ............................................................................................................ 1 98
Solved Exercises ................................................................................................................... 198
m
Review Questions ................................................................................................................. 200
Li
Chapter– Plotting with Pyplot ...................................................................................... 201–246
8.1 Introduction ................................................................................................................ 201
e
8.2 Introduction to Matplotlib ......................................................................................... 201
at
8.3 Matplotlib, Pylab and Pyplot ..................................................................................... 202
iv
8.4 Creating and Showing Simple Line Plot ..................................................................... 202
8.5 Using Pyplot Methods ................................................................................................ 207
Pr
8.6 Formatting Graph Pyplot Attributes ......................................................................... 212
8.7 Changing Matplotlib Plot Figure Size and DPI .......................................................... 219
a
8.8 Plotting DataFrames ................................................................................................... 220
di
8.9 Matplotlib Plot Types ................................................................................................. 223
8.9.1 Plotting a Bar Plot .......................................................................................... 223
In
8.9.2 Plotting a Pie Plot ........................................................................................... 229
8.9.3 Histogram ....................................................................................................... 231
se
d
10.6.2 Spiral Model .................................................................................................... 270
ite
10.7 Software Process Activities ........................................................................................ 272
10.8 Agile Methodology ...................................................................................................... 275
m
10.8.1 Pair Programing Methodology ...................................................................... 277
10.8.2 Agile Scrum Methodology ............................................................................. 278
Li
Points to Remember ............................................................................................................ 2 83
Solved Exercises ................................................................................................................... 283
e
Review Questions ................................................................................................................. 285
at
Chapter–11 Business Use-Case Diagrams .................................................................. 287–300
iv
11.1 Introduction ................................................................................................................ 287
Pr
11.2 Unified Modeling Language (UML) ............................................................................ 288
11.3 Use-case Diagram ........................................................................................................ 289
a
11.3.1 Relationships in Use-case Diagrams .............................................................. 291
11.3.2 Basic Principles to Draw Use-case Diagram .................................................. 294
di
11.4 Business Use-case Diagram ........................................................................................ 294
In
Points to Remember ............................................................................................................ 2 98
Solved Exercises ................................................................................................................... 298
se
d
ite
14.1 Introduction ............................................................................................................... 329
14.2 SELECT Command ....................................................................................................... 329
m
14.2.1 Using SQL Clauses and Operators with SELECT Command ....................... 330
14.2.2 Defining a Column Alias .............................................................................. 337
Li
14.3 The UPDATE Command .............................................................................................. 338
Points to Remember ............................................................................................................ 3 40
e
Solved Exercises ................................................................................................................... 341
at
Review Questions ................................................................................................................. 348
iv
Chapter–15 Grouping Records using MySQL Database ............................................. 355–402
Pr
15.1 Introduction ............................................................................................................... 355
15.2 The Group Functions ................................................................................................. 355
a
15.3 The GROUP BY Clause ................................................................................................ 359
di
15.4 The HAVING Clause .................................................................................................... 362
15.5 Ordering the Database ............................................................................................... 363
In 3\WKR
Points to Remember ............................................................................................................ 3 65
Solved Exercises ................................................................................................................... 365
se
d
17.11 Cyber Forensics ....................................................................................................... 439
ite
17.12 Identity Theft, Unique IDs and Biometrics ............................................................. 440
17.13 Information Technology Act, 2000 ......................................................................... 443
m
17.14 Society and Technology ........................................................................................... 444
17.15 Gender and Disability Issues while Teaching and Using Computers .................... 447
Li
Points to Remember ............................................................................................................ 4 50
Solved Exercises ................................................................................................................... 451
e
Review Questions ................................................................................................................. 453
at
Chapter–1 E-Waste Management ................................................................................ 455–462
iv
18.1 Introduction ............................................................................................................. 455
Pr
18.2 Electronic Waste (E-waste) ..................................................................................... 455
18.3 Effects of E-Waste .................................................................................................... 456
a
18.4 E-Waste Management ............................................................................................. 457
18.5 E-Waste Management in India ............................................................................... 461
di
Points to Remember ............................................................................................................ 4 61
In
Solved Exercises ................................................................................................................... 461
Review Questions ................................................................................................................. 462
se
d
Chapter – 1
ite
m
Li
e
at
iv
1.1 Introduction
Pr
Data processing is an important part of analyzing the data, as data is not always available in the desired
format. Various processing techniques are required before analyzing the data such as cleaning, restructuring
a
or merging, etc. NumPy, SciPy, Cython and Pandas are the tools available in Python which can be used for
di
processing of the data. Further, Pandas are built on the NumPy package thus Pandas library relies heavily on
the NumPy array for the implementation of Pandas data objects and shares many of its features. In this
In
chapter, we will review the basic concepts of Python Pandas data series and dataframes that we have learnt
in class - XI.
se
1.2 Pandas
ou
Pandas is an open-source Python Library providing high-performance data manipulation and analysis tool
H
using its powerful data structures. Pandas provides rich set of functions to process various types of data.
When doing data analysis, it is important to make sure you are using the correct data types; otherwise
i
you may get unexpected results or errors. A data type is essentially an internal construct that a programing
at
language uses to understand how to store and manipulate data. Table 1.1 summarizes the data types of
w
Most of the time, using Pandas, default int64 and float64 types will work. If your system is compatible
with Pandas software and is installed earlier, you can start pandas.
1
2 Saraswati Informatics Practices XII
Pandas provides two very useful data structures to process the data i.e., Series and DataFrame.
d
of holding data of any type (integer, string, float, python objects, etc.). The Series data are mutable (can be
ite
changed). But the size of Series data is immutable. It can be seen as a data structure with two arrays: one
functioning as the index, i.e., the labels, and the other one contains the actual data. The row labels in a
m
Series are called the index. Let us see the following data which can be considered as series:
Li
Num = [23, 54, 34, 44, 35, 66, 27, 88, 69, 54] # a list with homogeneous data
Emp = ['A V Raman', 35, 'Finance', 45670.00] # a list with heterogeneous data
e
Marks = {"ELENA JOSE" : 450, "PARAS GUPTA" : 467, "JOEFFIN JOSEPH" : 480} # a dictionary
at
Num1 = (23, 54, 34, 44, 35, 66, 27, 88, 69, 54) # a tuple with homogeneous data
Std = ('AKYHA KUMAR', 78.0, 79.0, 89.0, 88.0, 91.0) # a list with heterogeneous data
iv
Any list, tuple and dictionary data can be converted into a series using 'Series()' method.
Pr
Creating an Empty Series
a
A basic series, which can be created is an Empty Series. For example:
# import the Pandas library and aliasing as pd
di
>>> import pandas as pd
In
# pd as alternate name of the module pandas
>>> S = pd.Series()
>>> print (S)
se
<class 'pandas.core.series.Series'>
H
Here,
• S is the series variable.
i
at
• The Series() method shows an empty list (default) and its default data type.
• The type() function shows the series data types.
w
We know that a list is a one dimensional data type capable of holding any data type and have indices. A list
can be converted into a series using Series() method. The basic method to create a series is:
Sa
S = pd.Series([data], index=[index])
Here,
ew
• The data is structured like a Python list, dictionary, an ndarray or a scalar value. If data is an
ndarray, index must be the same length as data.
N
• The index is the numeric values displayed with respective data. We did not pass any index, so by
default, it assigned the indexes ranging from 0 to len(data)–1. Otherwise, the default index starts
@
with 0, 1, 2, ... till the length of the data minus 1. If you want then you can mention your own
index for the data.
For example;
>>> import pandas as pd # pd as alternate name of the module pandas
Review of Python Pandas 3
>>> Months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September',
'October', 'November', 'December']
>>> S = pd.Series(Months) # S is a pandas series
Or
d
>>> S = pd.Series(['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September',
ite
'October', 'November', 'December'])
Printing a Series
m
Li
To print a series, you can use the print() function with the series or simple the series. For example;
e
Index
at
0 January
iv
1 February
Pr
2 March
3 April
a
4 May
di
5 June
6 July
In
7 August
se
8 September
9 October
ou
10 November
11 December
H
dtype: object
i
Here, the list Months converted into a Pandas series using Series() method. The series result shows in
at
two columns. We haven't defined an index in our example, but we see two columns in our output. The right
w
column contains our data, whereas the left column contains the index. Pandas created a default index and
automatically assigned to the series starting with 0 going to 11, which is the length of the data minus 1.
s
ra
We can directly access the index and the values of our Series S:
>>> print(S.index)
RangeIndex(start=0, stop=12, step=1)
ew
>>> print(S.values)
['January' 'February' 'March' 'April' 'May' 'June' 'July' 'August' 'September' 'October' 'November'
N
'December']
@
We have created a list called Months and a Series called S. In both the cases the indexes are same. To
clarify about the indexes, let us access a particular index position value, i.e., 3rd place:
>>> print("Element at 3rd place in list:", Months[2]) # 2 in the index position in series S
Element at 3rd place in list: March
>>> print("Element at 3rd place in Series:", S[2])
Element at 3rd place in Series: March
4 Saraswati Informatics Practices XII
d
1 35
ite
2 Finance
3 45670
m
dtype: object
Li
Accessing Rows using head() and tail() function
e
The Series.head() function in series, by default, shows you the top 5 rows of data in the series. The opposite
is Series.tail(), which gives you the last 5 rows. In both the function, if you pass in a number as parameter
at
and Pandas will print out the specified number of rows.
iv
For example, let us print the first 5 rows of data from the series S of months:
Pr
>>> S.head()
0 January
1 February
a
2 March
di
3 April
4 May
In
dtype: object
se
0 January
1 February
H
2 March
dtype: object
i
at
0 January
s
dtype: object
ra
>>> S.tail()
7 August
8 September
ew
9 October
10 November
11 December
N
dtype: object
@
d
ite
>>> months = ['Jan','Apr','Mar','June']
>>> days = [31, 30, 31, 30]
m
>>> S2 = pd.Series(days, index=months)
>>> S2
Li
Jan 31
Apr 30
e
Mar 31
at
June 30
dtype: int64
iv
Pr
Declaring an Index
We can make Series with an explicit index. For example:
a
>>> M = pd.Series([456, 478, 467, 477, 405], index=['Amit', 'Sneha', 'Manya', 'Pari','Lavanya'])
>>> print(M)
di
In
Amit 456
Sneha 478
se
Manya 467
Pari 477
ou
Lavanya 405
dtype: int64
H
Observe that the index we provided is on the left with the values on the right and in Pandas series,
i
>>> M.index
Index(['Amit', 'Sneha', 'Manya', 'Pari', 'Lavanya'], dtype='object')
s
ra
Indexing and slicing concepts in Pandas series is similar to lists and tuples. Using a series, you can access
any position values. With Pandas Series we can index by corresponding number to retrieve values. For
example, let us recall the series:
ew
'Jan'
Example 2 - Printing third element
>>> S[2] # Prints third element of the list
'Mar'
6 Saraswati Informatics Practices XII
d
0 Jan
ite
1 Apr
2 Mar
dtype: object
m
Example 4 - Printing element starting from 2nd till 3rd
Li
>>> S[1:3] # Prints elements starting from 2nd till 3rd
e
1 Apr
at
2 Mar
iv
Example 5 - Printing last two elements
Pr
>>> S[-2:] # Prints first three elements from the list
2 Mar
a
3 June
dtype: object
Example 6 - Printing the value correspond to label index
di
In
Additionally, we can call the value of the index to return the value that it corresponds with:
se
>>> M = pd.Series([456, 478, 467, 477, 405], index=['Amit', 'Sneha', 'Manya', 'Pari','Lavanya'])
>>> M['Manya']
ou
467
H
Manya 467
s
Lavanya 405
ra
dtype: int64
Sa
Example 8 - Printing the slices with the values of the label index
We can also slice with the values of the index to return the corresponding values:
ew
Lavanya 405
dtype: int64
@
Manya 467
Pari 477
dtype: int64
Here, M>460 returns a Series of True/False values, which we then pass to our Series M, returning the
d
corresponding True items.
ite
Initializing Series from Scalar
m
You can also use a scalar to initialize a Series. In this case, all elements of the Series is initialized to the
Li
same value. When used with a scalar for initialization, an index array can be specified. In this case, the size
of the Series is the same as the size of the index array. Let us create a series for a scalar value 7.
e
>>> import pandas as pd
at
>>> S = pd.Series(7, index=[0, 1, 2, 3, 4])
>>> S
iv
0 7
Pr
1 7
2 7
3 7
a
4 7
dtype: int64
di
In
You can use the range() function to specify the index (and thus the size of the Series). The above
series can be modified as:
se
0 1
1 3
w
2 5
s
3 7
ra
4 9
dtype: int64
Sa
• Create an alphabetic index label with series starting from 1 after 3 intervals. The command is:
>>> print (pd.Series(range(1, 15, 3), index=[x for x in 'ABCDE']))
ew
A 1
B 4
C 7
N
D 10
E 13
@
dtype: int64
For example;
>>> Marks = [456, 478, 467, 477, 405]
>>> Names = ['Amit', 'Sneha', 'Manya', 'Pari','Lavanya']
>>> M1 = pd.Series(Marks, index=Names)
d
>>> print (M1)
ite
Amit 456
Sneha 478
m
Manya 467
Pari 477
Li
Lavanya 405
dtype: int32
e
at
The following example increase 5 marks to each students in the series M1.
iv
# File name: ...\IPSource_XI\PyChap12\Mincrease.py
import pandas as pd
Pr
import numpy as np
Marks = np.array([456, 478, 467, 477, 405])
a
M1 = pd.Series(Marks, index=['Amit', 'Sneha', 'Manya', 'Pari','Lavanya'])
for label, value in M1.items():
di
M1.at[label] = value + 5 # increase each values
In
print (M1)
Output: Amit : 461
se
Sneha : 483
Manya : 472
ou
Pari : 482
Lavanya : 410
H
Series support element-wise vector operations. For example, when a numeber is added to a series, the
w
number adds with each element of the series values. For example;
s
0 6
1 7
2 8
ew
3 9
4 10
dtype: int64
N
1 4
2 6
3 8
4 10
dtype: int64
Review of Python Pandas 9
d
2 27
ite
3 64
4 125
m
dtype: int64
>>> Series_Var + Series_Var # Each series value add with itself
Li
0 2
1 4
e
2 6
at
3 8
4 10
iv
dtype: int64
Pr
1.6 Comparison Operations on Series
a
We can use all the Python comparison operators in Pandas series. When we apply a comparison operator
di
over a series, the comparison operation compares all values in the series. Assume that we have a series
called S with following data:
In
>>> GradePoint = [9.5, 6.7, 8.8, 9.1, 8.7, 8.8, 9, 8.5, 9.4, 8.2]
se
>>> S
Amit 9.5
H
Sneha 6.7
i
Manya 8.8
at
Pari 9.1
w
Lavanya 8.7
s
Priyanka 8.8
ra
Aanya 9.0
Ronald 8.5
Sa
Dipika 9.4
Akriti 8.2
ew
dtype: float64
Amit False
@
Sneha True
Manya True
Pari False
Lavanya True
Priyanka True
10 Saraswati Informatics Practices XII
Aanya False
Ronald True
Dipika False
Akriti True
d
dtype: bool
ite
>>> S <= 9.0 # is GradePoint less than or equal to 9.0
m
Amit False
Li
Sneha True
Manya True
e
Pari False
at
Lavanya True
Priyanka True
iv
Aanya True
Pr
Ronald True
Dipika False
a
Akriti True
dtype: bool
di
In
>>> S != 9.0 # is GradePoint not equal to 9.0
Amit True
se
Sneha True
ou
Manya True
Pari True
H
Lavanya True
Priyanka True
i
at
Aanya False
Ronald True
w
Dipika True
s
Akriti True
ra
dtype: bool
Sa
Similarly, you can use other comparison operators like: ==, > and >=.
ew
1.7 DataFrame
N
The Pandas main object is called a DataFrame. DataFrame is the widely used data structure of Pandas.
DataFrame is a two-dimensional array with heterogeneous data, usually represented in the tabular format.
@
DataFrame handle the vast majority of typical use cases in finance, statistics, social science, and many
areas of engineering. The DataFrame data and size are also mutable (can be changed). DataFrame has two
different index i.e., column-index and row-index. DataFrames allow you to store and manipulate tabular
data in rows of observations and columns of variables. Note that, Series are used to work with one dimensional
array, whereas DataFrame can be used with two dimensional arrays.
Review of Python Pandas 11
Let us look the Table 1.2 which shows 6 countries population with its birth rate ratio in row and column
format. Each column represents an attribute and each row represents a country.
Table 1.2 A DataFrame with two-dimensional array with heterogeneous data.
d
Country Population BirthRate UpdateDate
ite
China 1,379,750,000 12.00 2016-08-11
India 1,330,780,000 21.76 2016-08-11
m
United States 324,882,000 13.21 2016-08-11
Li
Indonesia 260,581,000 18.84 2016-01-07
Brazil 206,918,000 18.43 2016-08-11
e
at
Pakistan 194,754,000 27.62 2016-08-11
iv
If you consider the data, then the following represents the data types of the columns.
Pr
Column Data Type
Country String
a
Population Integer
di
BirthRate Float In
UpdateDate Date
se
The most common way to create a DataFrame is by using the dictionary of equal-length list as shown
below. Further, all the spreadsheets and text files are read as DataFrame, therefore it is an important data
H
structure of Pandas. The Pandas DataFrame() constructor is used to create a DataFrame. The general format
is:
i
at
Here,
s
− Another DataFrame
• You can optionally pass index (row labels) and columns (column labels) arguments. If you pass
N
an index and / or columns, you are guaranteeing the index and / or columns of the resulting
DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the
@
passed index.
• The dtype is the data type of each column. You can use the DataFrameName.dtypes command
to extract the information of types of variables stored in the data frame.
12 Saraswati Informatics Practices XII
d
>>> import pandas as pd
ite
>>> df = pd.DataFrame() # Here we define df as the name of the DataFrame
Or
>>> import pandas
m
>>> df = pandas.DataFrame()
Li
Most of the times, people gives the DataFrame name as df. You can give any name (valid name) and
create any number of DataFrame objects. In this text we use different names and examples according the
e
requirements.
at
To know the type of the DataFrame and its values:
iv
>>> type(df)
<class 'pandas.core.frame.DataFrame'>
Pr
>>> print(df) # or df
which displays the following:
a
di
Empty DataFrame
Columns: []
In
Index: []
se
While a series only support a single dimension, data frames are 2 dimensional objects. The
pandas.DataFrame(..) function has provisions for creating data frames from lists. Let us say we have two
H
lists, one of them is of string type (Months) and the other is of int type (Days). We want to make a DataFrame
with these lists as columns.
i
>>> Months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September',
at
>>> Days = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
s
To create a DataFrame out of common Python data structures, we can pass a dictionary of lists to the
ra
DataFrame constructor. From the above two lists, we can get DataFrame in three different ways.
Sa
Let us make a dictionary with two lists such that names as keys and the lists as values.
>>> d = {'Month':Months,'Day':Days} # Lists are converted into dictionary for key-value
@
From the above dictionary d, Month and Day are the keys and Months and Days are the values,
respectively.
>>> print (d)
{'Month': ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October',
'November', 'December'], 'Day': [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]}
Review of Python Pandas 13
d
>>> df = pd.DataFrame(d) # Creates a data frame using two lists
Or
ite
>>> import pandas
>>> df = pandas.DataFrame(d) # Creates a data frame using two lists
m
Li
1.7.2 Printing/Displaying DataFrame Data
e
To print or display the data, simply type the DataFrame name or using print (DataFrame_name). For example,
at
to display the previous DataFrame (df):
>>> df # or print(df), the following output will produce.
iv
Index
Pr
Day Month
0 31 January
a
1 28 February
2 31 March
di
In
3 30 April
se
4 31 May
5 30 June
ou
6 31 July
7 31 August
H
8 30 September
i
9 31 October
at
10 30 November
w
11 31 December
s
ra
If you carefully observe the above data format, it just looks like a spreadsheet format data. The result
of the above DataFrame creation is a simple 12-row, 2-column table with automatically generated numeric
Sa
<class 'pandas.core.frame.DataFrame'>
@
We can use the zip function to merge these two lists first. In Python 3, zip function creates a zip object,
which is a generator and we can use it to produce one item at a time. To get a list of tuples, we can use list()
and create a list of tuples.
14 Saraswati Informatics Practices XII
d
('August', 31), ('September', 30), ('October', 31), ('November', 30), ('December', 31)]
ite
We can simply use pd.DataFrame on this list of tuples to get a Pandas DataFrame. And we can also
m
specify column names with the list of tuples.
>>> pd.DataFrame(DTuples, columns=['Month','Day'])
Li
Month Day
e
0 January 31
at
1 February 28
iv
2 March 31
3 April 30
Pr
4 May 31
5 June 30
a
di
6 July 31
7 August 31
In
8 September 30
se
9 October 31
10 November 30
ou
11 December 31
H
The third way to make a Pandas DataFrame from multiple lists is to start from scratch and add columns
at
manually. We will first create an empty Pandas DataFrame and then add columns to it. Let us create an
w
empty DataFrame:
Add the first column (Months) to the empty DataFrame.
s
>>> df = pd.DataFrame()
Sa
0 January
1 February
N
2 March
@
3 April
4 May
5 June
6 July
7 August
Review of Python Pandas 15
8 September
9 October
10 November
d
11 December
ite
Name: Month, dtype: object
m
Add the second column to the empty DataFrame.
>>> df['Day'] = Days
Li
>>> df
Month Day
e
at
0 January 31
1 February 28
iv
2 March 31
Pr
3 April 30
4 May 31
a
5 June 30
6 July 31
di
In
7 August 31
8 September 30
se
9 October 31
ou
10 November 30
11 December 31
H
As you know that a DataFrame has a row and column index; it's like a dict of Series with a common index.
w
>>> PData = { 'Country' : ['China', 'India', 'United States', 'Indonesia', 'Brazil', 'Pakistan'],
ra
>>> df
Country Population BirthRate UpdateDate
N
The DataFrame result is displayed as integer based index. The Country, Population, BirthRate,
UpdateDate are the attributes of the DataFrame df.
d
ite
Accessing and slicing data from a DataFrame is depends on the indexes of the DataFrame.
m
The DataFrame.head() function in Pandas, by default, shows you the top 5 rows of data in the DataFrame.
Li
The opposite is DataFrame.tail(), which gives you the last 5 rows. In both the function, if you pass in a
number as parameter and Pandas will print out the specified number of rows.
e
For example, let us print the first 5 rows of DataFrame df:
at
>>> df.head()
iv
Country Population BirthRate UpdateDate
Pr
0 China 1379750000 12.00 2016-08-11
1 India 1330780000 21.76 2016-08-11
a
2 United States 324882000 13.21 2016-08-11
3 Indonesia 260581000
di
18.84 2016-01-07
In
4 Brazil 206918000 18.43 2016-08-11
se
>>> df.head(1)
ra
>>> df.tail()
Country Population BirthRate UpdateDate
N
d
Country Population BirthRate UpdateDate
ite
3 Indonesia 260581000 18.84 2016-01-07
4 Brazil 206918000 18.43 2016-08-11
m
5 Pakistan 194754000 27.62 2016-08-11
Li
To print the first 3 rows of Population column of DataFrame df:
e
>>> df.Population.head(3)
at
0 1379750000
iv
1 1330780000
Pr
2 324882000
Name: Population, dtype: int64
a
Note. If you type df.head(0) of df.tail(0), then the following result will be displayed:
di
Empty DataFrame
Columns: [Population, BirthRate, UpdateDate]
In
Index: []
se
There are three primary methods for selecting columns from DataFrames in Pandas – use the dot (.) notation,
square brackets ( [ ] ), or iloc methods. There are three main methods of selecting columns in Pandas:
H
• or using numeric indexing and the iloc selector data.iloc[:, <column_number>]. The "i" stands for
at
"integer" and should indicate that this selection expects a numerical position specification both
w
For example, let us access the population data (PData) using square braces and dot notation. To access
ra
2 United States
3 Indonesia
N
4 Brazil
@
5 Pakistan
Name: Country, dtype: object
Notice that the Country column data is displayed as a series. When we select a single column from a
DataFrame, it always returns a series and selecting multiple columns from the DataFrame will return a
DataFrame.
18 Saraswati Informatics Practices XII
In the previous DataFrame (df), the population data is in default integer based index, you can use the
.iloc method to access rows or columns.
For example, to access the Country and Population columns with all rows from the DataFrame df, the
command is:
d
>>> df.iloc[:, [0, 1]]
ite
Country Population
m
0 China 1379750000
1 India 1330780000
Li
2 United States 324882000
e
3 Indonesia 260581000
at
4 Brazil 206918000
iv
5 Pakistan 194754000
Pr
The above result displays a DataFrame as we selected multiple columns. Here, from the above output,
the colon (:) represents all rows and the [0, 1] represents the column numbers i.e., Country and Population,
a
respectively. For example, if you want to display a range of columns, i.e., first three columns of the DataFrame
df, the command is:
>>> df.iloc[:, 0:3]
di
In
Country Population BirthRate
se
Here, from the above output, the colon (:) represents all rows and the 0:3 represents the column
s
numbers 0, 1 and 2 i.e., Country, Population and BirthRate, respectively. In a range of columns (for example,
ra
you have to specify rows and columns based on their row and column labels. The previous command can be
written as:
>>> df.loc[:, 'Country':'BirthRate']
ew
which will produce the same result as above command: df.iloc[:, 0:3]
N
• loc. It is label based indexing and gets rows (or columns) with particular labels from the index.
• iloc. It is position based indexing and gets rows (or columns) at particular positions in the index
(so it only takes integers) which you learnt in previous section.
• ix usually tries to behave like .loc but falls back to behaving like .iloc if a label is not present in the
index.
Review of Python Pandas 19
These three methods belong to index selection methods. Index is the identifier used for each row of
the data set. A key thing to have into account is that indexing can take specific labels. These labels can
either be integers or any other specified value by the user (e.g., dates, names).
Before slicing data using any of the above three methods, let us see the following two sets of data:
d
Set-1: Integer Based data:
ite
Country Population BirthRate UpdateDate
m
0 China 1379750000 12.00 2016-08-11
Li
1 India 1330780000 21.76 2016-08-11
2 United States 324882000 13.21 2016-08-11
e
3 Indonesia 260581000 18.84 2016-01-07
at
4 Brazil 206918000 18.43 2016-08-11
iv
5 Pakistan 194754000 27.62 2016-08-11
Pr
Set-2: Label Index data:
Population BirthRate UpdateDate
a
di
Country
China 1379750000 12.00 2016-08-11
In
India 1330780000 21.76 2016-08-11
se
Here, in above two sets, the data set is exactly the same but the indexes are changed. In Set-1, it is
i
at
integer based, i.e., 0, 1, 2, 3.... In Set-2, it uses a set of strings to identify the rows, i.e., Country column.
This distinction is important to take into account when using selection methods.
w
Let us access the first three elements of the data set using .loc[] and .iloc[]:
s
>>> df.loc[0:3]
ra
>>> df.iloc[0:3]
@
Here, both .loc[] and .iloc[] work, because the integer index can also be taken as a label. We get a
different result depending on the method we are using. In the case of .loc[], the selection includes the last
term (i.e., last row). With .iloc[], normal Python index selection rules apply. Thus, the last term is excluded.
So, it is obvious that the Pandas .loc indexer can be used with DataFrames for two different use cases:
d
• Selecting rows by label/index
ite
• Selecting rows with a boolean/conditional lookup
m
The loc indexer is used with the same syntax as iloc:. That is:
dataFrame.loc[<row selection>, <column selection>]
Li
Adding an Index to an existing DataFrame
e
at
In Pandas, selections using the loc method are based on the index of the data frame (if any), where the index
is set on a DataFrame function called set_index() which takes a column name (for a regular Index) or a list
iv
of column names (for a MultiIndex). The syntax is:
Pr
dataFrame.set_index(['Column_name'])
Using df.set_index(), the .loc method directly selects based on index values of any rows. For example,
a
set the index of population data frame (PData) to the country name “Country”. To create a new, indexed
DataFrame df:
di
In
>>> df = df.set_index(['Country'])
>>> df
se
Note that if you set a column as index on a DataFrame, you cannot access the indexed column (here
Country] using data['column_name'] or the data.column_name commands. This will produce an AttributeError.
Sa
Rows in a DataFrame are selected, typically, using the iloc/loc selection methods, or using logical selectors
(selecting based on the value of another column or variable). Now, with the index set, we can directly select
rows for different “Country” values using .loc[<label>] – either singly, or in multiples.
N
d
Population BirthRate UpdateDate
ite
Country
m
India 1330780000 21.76 2016-08-11
Li
Brazil 206918000 18.43 2016-08-11
Note that from the above two examples, the first example returns a series, and the second returns a
e
DataFrame. You can achieve a single-column DataFrame by passing a single-element list to the .loc operation.
at
In an indexed column with a DataFrame, you can access range rows with .iloc method. For example, to
access first three rows:
iv
>>> df.iloc[0:3]
Pr
Population BirthRate UpdateDate
Country
a
China 1379750000 12.00 2016-08-11
India 1330780000 21.76
di 2016-08-11
In
United States 324882000 13.21 2016-08-11
se
But if you apply, df.loc[0:3], then it will produce TypeError which means you cannot slice rows which
are indexed.
ou
You can select columns with .loc using the names of the columns. For example, to display the Population
and UpdateDate columns for India and Brazil, the command is:
i
at
Country
s
Population BirthRate
@
Country
China 1379750000 12.00
India 1330780000 21.76
United States 324882000 13.21
22 Saraswati Informatics Practices XII
Note that when we used label-based indexing both the start and the end labels were included in the
subset. With position based slicing, only the start index is included. So, in this case China had an index of 0,
India 1, and United States 2. Same goes for the columns.
And one more thing you should know about indexing is that when you have labels for either the rows or
d
the columns, and you want to slice a portion of the DataFrame, you wouldn’t know whether to use loc or
ite
iloc. In this case, you would want to use ix:
m
Li
Like Python, you can use Boolean or logical condition with Pandas DataFrames. With Boolean indexing or
logical selection, you pass an array or Series of True/False values to the .loc indexer to select the rows
e
where your Series has True values.
at
For example, the statement df[‘BirthRate’] > 20] produces a Pandas Series with a True/False value for
every row in the ‘PData’ DataFrame, where there are “True” values for the rows where the BirthRate is
iv
greater than 20. These type of Boolean arrays can be passed directly to the .loc indexer. The command is:
Pr
>>> df.loc[df['BirthRate'] > 20]
Population BirthRate UpdateDate
a
di
Country
India 1330780000 21.76 2016-08-11
In
Pakistan 194754000 27.62 2016-08-11
se
Similarly, to select rows with BirthRate column between 18 and 30, and just return 'Population' and
ou
Population BirthRate
i
Country
at
Similarly, you can access the data using Python functions with respective columns.
As seen before, when you have an integer based index, confusion may arise between location and label
N
based methods. .ix[] solves this confusion by falling to label based access (i.e., like .loc[]), which might not
be what you are looking for. The ix[] indexer is a hybrid of .loc and .iloc. The syntax is:
@
d
ite
India 21.76 2016-08-11
United States 13.21 2016-08-11
m
This is called position based indexing. From the above output, the first row data is inclusive for .ix and
starts from China, and for columns, the last column is exclusive (i.e., UpdateDate) but inclusive of start.
Li
The ix[] indexer is a hybrid of .loc and .iloc. Generally, ix is label based and acts just as the .loc indexer.
However, .ix also supports integer type selections (as in .iloc) where passed an integer. This only works
e
where the index of the DataFrame is not integer based. ix will accept any of the inputs of .loc and .iloc.
at
For example,
iv
>>> df.ix[[3]] # Integer type selection
Pr
Population BirthRate UpdateDate
Country
Indonesia 260581000 18.84 2016-01-07
a
di
For example, to display row-index of India data: In
>>> df.ix["India"]
Population 1330780000
BirthRate 21.76
se
UpdateDate 2016-08-11
Name: India, dtype: object
ou
>>> df.ix["China":"Indonesia"]
Population BirthRate UpdateDate
i
at
Country
China 1379750000 12.00 2016-08-11
w
To display all rows for a column, i.e., for Population column, the command is:
>>> df.ix[:, 'Population'] # colon takes all rows of dataframe df
Country
N
China 1379750000
@
India 1330780000
United States 324882000
Indonesia 260581000
Brazil 206918000
Pakistan 194754000
Name: Population, dtype: int64
24 Saraswati Informatics Practices XII
To display a specific element from a DataFrame, say the second row value of Population column,
>>> df.ix[1, 'Population']
1330780000
d
1.8 Iterating Pandas DataFrame
ite
In Pandas, you can iterate over rows in a DataFrame. This is similar to iterating over Python dictionaries
m
(think iteritems() or items() in Python 3). While iterating, a DataFrame will process rows over the keys of
Li
the object/DataFrame. Let us iterate the previous population (PData) dataframe df:
>>> df
e
Population BirthRate UpdateDate
at
Country
iv
China 1379750000 12.00 2016-08-11
Pr
India 1330780000 21.76 2016-08-11
United States 324882000 13.21 2016-08-11
a
Indonesia 260581000 18.84 2016-01-07
Brazil 206918000 18.43
di 2016-08-11
In
Pakistan 194754000 27.62 2016-08-11
>>> for keys in df:
se
print (keys)
ou
BirthRate
UpdateDate
i
at
Pandas DataFrame iterates the rows of the DataFrame using three different functions. These are:
w
• iterrows(). This function iterates over the rows of a DataFrame as (index, Series) pairs. In other
s
words, it gives you (index, row) tuples as a result. It returns a generator that iterates over the
ra
rows of the DataFrame. Because iterrows() function returns a Series for each row, it does not
preserve dtypes across the rows (dtypes are preserved across columns for DataFrames). It is
Sa
If you need a formatted output, then you can apply the following command:
>>> for index, row in df.iterrows():
print ("{0:<15} {1:>15} {2:>5.2f}".format(index, row["Population"], row["BirthRate"]))
And the output is:
d
ite
China 1379750000 12.00 2016-08-11
India 1330780000 21.76 2016-08-11
m
United States 324882000 13.21 2016-08-11
Indonesia 260581000 18.84 2016-01-07
Li
Brazil 206918000 18.43 2016-08-11
Pakistan 194754000 27.62 2016-08-11
e
at
• iteritems(). Iterate over (column name, Series) pairs. For example,
iv
>>> for key,value in df.iteritems():
print (key, value)
Pr
• itertuples(index=True). Iterate over the rows of DataFrame as tuples, with index value as first
element of the tuple. If the DataFrame is indexed, then it returns the index as the first element of
a
the tuple. For example,
di
>>> for row in df.itertuples(): In
print (row)
which will print the data index-wise of the DataFrame as given below:
se
Notice that the iteration result displays the index column as the first column.
w
There are five ways to add a new column into a DataFrame. These are: indexing, loc, assign(), insert() and
concat(). The concat() function will be discussed in Adding Rows section. Before applying any method,
assume that we have a table with following product information.
ew
Using above data, create a dictionary called Product with only Product_Code column and also create
a DataFrame called df using the Product dictionary. The commands are:
>>> import pandas as pd
>>> Product = { 'Product_Code' : ['A01', 'A02', 'A03', 'A04', 'A05', 'A06']}
d
ite
To create a new dataframe with first column, (or with an empty DataFrame), apply the following:
>>> dFrame = pd.DataFrame(Product, columns=['Product_Code']) # A column is created
m
>>> dFrame
Li
Product_Code
0 A01
e
1 A02
at
2 A03
iv
3 A04
Pr
4 A05
5 A06
a
Method 1: Using index
di
The DataFrame dFrame is now integer index based. To add a second column i.e., Product_Name using
In
indexing process, the command is:
>>> dFrame["Product_Name"] = ['Motherboard', 'Hard Disk', 'Keyboard', 'Mouse', 'Motherboard',
se
'Hard Disk']
>>> dFrame
ou
Product_Code Product_Name
H
0 A01 Motherboard
1 A02 Hard Disk
i
at
2 A03 Keyboard
w
3 A04 Mouse
s
4 A05 Motherboard
ra
>>> dFrame
Which will display the following:
@
d
ite
Method 3: Using assign() function
.loc has two limitations that it mutates the DataFrame in-place and it can't be used with method chaining. If
m
that's a problem for you, use assign() function. The assign() function in python, assigns the new column
Li
(i.e., a list) to the existing DataFrame. The syntax is:
DataFrame = DataFrame.assign(List)
e
at
Here,
• The DataFrame on left side of assignment sign assigns a new DataFrame along with a new list
iv
(i.e., List).
Pr
• The DataFrame on right side of assignment sign can assigns a list temporarily.
• If both the DataFrame name is same then the new DataFrame will hold a new column as the List.
a
For example, to assign/add a new column called Unit_Price into DataFrame dFrame, the command is:
di
>>> dFrame = dFrame.assign(Unit_Price = [12000, 6500, 500, 500, 13000, 8800])
In
>>> dFrame
Product_Code Product_Name Company_Name Unit_Price
se
The insert() function adds a column at the column index position. In a DataFrame, the first column is
ra
started from 0, 1, 2, . . . . and so on. To add a column using Insert() function, the syntax is:
Sa
Here,
ew
• loc is the integer index and must verify 0 <= loc <= len(columns). If the index does not match with
the DataFrame column index then an IndexError will raise.
N
For example, if the dFrame columns has only 3 columns (0, 1, 2), and you mention the index as
4, then it will raise an index error, i.e., “index 4 is out of bounds for axis 0 with size 3”.
@
• col_name is the name of column which will be inserted or added into existing DataFrame.
• value is a list.
Let us add a column called Quantity as 4th column (originally it is 5th if you count from 0, and i.e.,
after Unit_Price column) with following values [200, 180, 250, 350, 120, 130] into the existing DataFrame
dFrame:
28 Saraswati Informatics Practices XII
>>> idx = 4 # Column index position where new column Unit_Price will be inserted
>> Qty = [200, 180, 250, 350, 120, 130] # can be a list, a Series, an array or a scalar
>>> dFrame.insert(loc=idx, column='Quantity', value=Qty) # New column added into df
>>> dFrame
d
Product_Code Product_Name Company_Name Unit_Price Quantity
ite
0 A01 Motherboard Intel 12000 200
m
1 A02 Hard Disk Seagate 6500 180
2 A03 Keyboard Samsung 500 250
Li
3 A04 Mouse Logitech 500 350
e
4 A05 Motherboard AMD 13000 120
at
5 A06 Hard Disk HP 8800 130
iv
Notice that the above output shows a new column called Quantity as 4th column in the DataFrame
Pr
dFrame.
a
di
There are two most popular methods to add a new row into a DataFrame. These are: append() and concat().
In
Method 1: Using append() function
Pandas DataFrame has a straight forward method called append() which add new rows into an existing
se
DataFrame.
The syntax is:
ou
Here,
• The DataFrame is appended into an existing DataFrame.
i
at
• If any columns were missing from the data we are trying to append, they would result in those
rows having NaN (not a number) values in the cells falling under the missing columns.
w
For example, in previous sections we have created a DataFrame called df which contains country-wise
ra
population data. Let us append another DataFrame called df1 which contains 3 new rows of population data
as given below:
Sa
df1 data:
Nigeria 186,987,000 36.65 2016-01-07
ew
>>> df1
Country Population BirthRate UpdateDate
0 Nigeria 186987000 36.65 2016-01-07
d
1 Bangladesh 161390000 24.68 2016-01-08
ite
2 Russia 146691020 11.10 2016-01-10
m
>>> df1 = df1.set_index(['Country'])
>>> df1
Li
Population BirthRate UpdateDate
Nigeria 186987000 36.65 2016-01-07
e
at
Bangladesh 161390000 24.68 2016-01-08
Russia 146691020 11.10 2016-01-10
iv
Pr
Let us append the df1 DataFrame rows into DataFrame df:
>>> df = df.append(df1) # DataFrame df1 appended with DataFrame df
a
>>> df
di
Population BirthRate
In UpdateDate
China 1379750000 12.00 2016-08-11
India 1330780000 21.76 2016-08-11
se
Concatenation basically attaches the DataFrames together. For simple operations where we need to add
rows or columns of the same length, the pd.concat() function is perfect. The syntax is:
Sa
Here,
• By default, the argument is set to axis=0, which means we are concatenating rows. For columns
set the axis as 1.
@
Using ignore_index
This option is used whether or not the original row labels should be retained or not. By default it is false.
For example, assume that we have a new DataFrame called dFrame1 with following product contents:
30 Saraswati Informatics Practices XII
dFrame1 data:
Product_Code Product_Name Company_Name Unit_Price
0 A07 Keyboard TVS 2400
d
1 A08 LCD-21 LG 8000
ite
2 A09 LCD-21 Samsung 8500
3 A10 Mouse Dell 450
m
The commands to create the DataFrame dFrame are:
Li
>>> dFrame1 = pd.DataFrame({'Product_Code':['A07', 'A08', 'A09', 'A10'],
e
'Product_Name':['Keyboard', 'LCD-21', 'LCD-21', 'Mouse'],
at
'Company_Name':['TVS', 'LG', 'Samsung', 'Dell'],
'Unit_Price':[2400, 8000, 8500, 450]},
iv
index = [0, 1, 2, 3])
Pr
>>> dFrame1
Product_Code Product_Name Company_Name Unit_Price
a
0 A07 Keyboard TVS 2400
di
1 A08 LCD-21 In LG 8000
2 A09 LCD-21 Samsung 8500
3 A10 Mouse Dell 450
se
Notice that here we deliberately does not add the Quantity column with dFrame1. In previous section,
ou
If you compare the two DataFrames (dFrame and dFrame1), you will see two major changes: different
ew
index and different columns. That is the first DataFrame (dFrame) has 6 rows and 5 columns and the
second DataFrame (dFrame1) has 4 rows and 4 columns. With concatenation, we can talk about various
methods of bringing these together.
N
Let's create a third DataFrame called dFrame2 which concatenate the DataFrames dFrame and dFrame1:
When the rows are concatenated, the orders of columns are changed than the original one. To rearrange
the columns into its original sequence, for example the dFrame2, the command is:
>>> dFrame2 = dFrame2[['Product_Code', 'Product_Name', 'Company_Name', 'Quantity', 'Unit_Price']]
Review of Python Pandas 31
>>> dFrame2
Product_Code Product_Name Company_Name Unit_Price Quantity
0 A01 Motherboard Intel 12000 200
d
1 A02 Hard Disk Seagate 6500 180
ite
2 A03 Keyboard Samsung 500 250
3 A04 Mouse Logitech 500 350
m
4 A05 Motherboard AMD 13000 120
Li
5 A06 Hard Disk HP 8800 130
0 A07 Keyboard TVS 2400 NaN
at
2 A09 LCD-21 Samsung 8500 NaN
iv
3 A10 Mouse Dell 450 NaN
Pr
As you can observe in the above output, for the first DataFrame dFrame data, Quantity column values
are available and it has printed the respective values, but the dFrame1 data, there is no “Quantity” column
a
and therefore it has printed NaN (Not a Number) in the resulted DataFrame dFrame2. The index of the
di
resultant is duplicated; each index is repeated. In
ignore_index = True
se
Also from the above dFrame2 results, notice that the index column is ordered as it is according to the two
DataFrames. The index of the resultant is duplicated where some of the indexes are repeated. So, to avoid
ou
such situation or if you want that the resultant object has to follow its own indexing, then set ignore_index
to True.
H
New Index
Adding Column using concat() function
Next, you can also specify axis=1 in order to join, merge or concatenate along the columns. Suppose we
have a column called Total_Price in a DataFrame called Temp as given:
32 Saraswati Informatics Practices XII
d
>>> dFrame2
ite
Company_Name Product_Code Product_Name Quantity Unit_Price Total_Price
0 Intel A01 Motherboard 200.0 12000 NaN
m
1 Seagate A02 Hard Disk 180.0 6500 NaN
Li
2 Samsung A03 Keyboard 250.0 500 NaN
3 Logitech A04 Mouse 350.0 500 NaN
e
4 AMD A05 Motherboard 120.0 13000 NaN
at
5 HP A06 Hard Disk 130.0 8800 NaN
iv
6 TVS A07 Keyboard NaN 2400 NaN
Pr
7 LG A08 LCD-21 NaN 8000 NaN
8 Samsung A09 LCD-21 NaN 8500 NaN
a
9 Dell A10 Mouse NaN 450 NaN
di
As you can see above DataFrame result the Total_Price column with missing values. This happens
In
because the Temp DataFrame didn’t have values for that particular column.
Don’t apply the ignore_index=True option with the above concat() function. Otherwise, it will create a
se
column indexed as (0, 1, 2, 3, 4....) instead of its original indexes (Company_Name, Product_Code,
Product_Name, Quantity, Unit_Price, Total_Price).
ou
In Pandas, we can drop or delete column by index, by name and by position. The syntax is:
i
Here,
w
• index. It refers to the column/row which will be deleted depending on the axis.
• axis. By default, the argument is set to axis=0, which means it denotes rows. For columns set the
s
axis as 1.
ra
• inplace. By default it is false. If True, do operation inplace and return None. To actually edit the
Sa
original DataFrame, the “inplace” parameter can be set to True, and there is no returned value.
Before applying any deletion operation, let us create a DataFrame for Table 1.4 data.
ew
Aashna 16 87 A2
Somya 15 64 B2
@
Ronald 16 58 C1
Jack 17 74 B1
Raghu 15 34 D
Mathew 16 77 B1
Review of Python Pandas 33
Nancy 14 87 A2
Bhavya 16 64 B2
Kumar 15 45 C2
d
Aashna 17 68 B2
ite
Somya 16 92 A1
m
Mathew 16 93 A1
Li
To create a DataFrame df, for the dictionary, the commands are:
>>> import pandas as pd
e
>>> ClassXIIA = {'Name':['Aashna', 'Somya', 'Ronald', 'Jack', 'Raghu', 'Mathew',
at
'Nancy', 'Bhavya', 'Kumar', 'Aashna', 'Somya', 'Mathew'],
iv
'Age':[16, 15, 16, 17, 15, 16, 14, 16, 15, 17, 16, 16],
'Score':[87, 64, 58, 74, 34, 77, 87, 64, 45, 68, 92, 93],
Pr
'Grade' :['A2', 'B2', 'C1', 'B1', 'D', 'B1', 'A2', 'B2', 'C2', 'B2', 'A1', 'A1']}
>>> df = pd.DataFrame(ClassXIIA, columns=['Name', 'Age', 'Score', 'Grade'] )
a
Drop a column by name
di
To delete a column, or multiple columns, use the name of the column(s), and specify the “axis” as 1. Lets
In
drop a column (i.e., Age) by name in python Pandas.
>>> df.drop('Age',axis=1) # drop a column based on name
se
The axis=1 denotes that we are referring to a column, not a row. The above deletion operation do not
ou
affect the original data frames df; however, you can again put the results in extra variables (DataFrame).
Originally the deleted column(s) are there in the DataFrame till you won’t apply inplace=True. Or if you
H
apply a command like, df = df.drop('Age',axis=1) then it will copy the resultant columns into the same
DataFrame and assumes that it is permanently deleted.
i
>>> df
w
0 Aashna 87 A2
ra
1 Somya 64 B2
Sa
2 Ronald 58 C1
3 Jack 74 B1
ew
4 Raghu 34 D
5 Mathew 77 B1
N
6 Nancy 87 A2
7 Bhavya 64 B2
@
8 Kumar 45 C2
9 Aashna 68 B2
10 Somya 92 A1
11 Mathew 93 A1
34 Saraswati Informatics Practices XII
To drop more than one column using label name, mention the column names in a list, for example, df =
df.drop(['Age', 'Score'], axis=1). This will delete 'Age' and 'Score' columns.
d
Let us drop a column by its index in python Pandas. For example to delete the Grade column, whose
ite
column index is 3, the command is:
>>> df.drop(df.columns[3],axis=1) # drop Grade column based on column index
m
In the above example, column with index 3 is dropped (i.e., the Grade column). So, the resultant
Li
DataFrame will be:
e
Name Age Score
at
0 Aashna 16 87
iv
1 Somya 15 64
2 Ronald 16 58
Pr
3 Jack 17 74
a
4 Raghu 15 34
di
5 Mathew 16 77
6 Nancy 14 87
In
7 Bhavya 16 64
se
8 Kumar 15 45
9 Aashna 17 68
ou
10 Somya 16 92
11 Mathew 16 93
H
We can delete a column based on column name by using del command. Let us drop a column (i.e., Age) by
w
0 Aashna 87 A2
1 Somya 64 B2
2 Ronald 58 C1
ew
3 Jack 74 B1
4 Raghu 34 D
N
5 Mathew 77 B1
@
6 Nancy 87 A2
7 Bhavya 64 B2
8 Kumar 45 C2
9 Aashna 68 B2
Review of Python Pandas 35
10 Somya 92 A1
11 Mathew 93 A1
d
ite
If you use the inplace = True with the drop command for column, the column will be permanently deleted
from the DataFrame. For example, notice the following commands and their results given below:
m
Let us see the original DataFrame first:
Li
>>> df
Name Age Score Grade
e
0 Aashna 16 87 A2
at
1 Somya 15 64 B2
iv
2 Ronald 16 58 C1
Pr
3 Jack 17 74 B1
4 Raghu 15 34 D
a
5 Mathew 16 77 B1
di
6 Nancy 14 87 A2 In
7 Bhavya 16 64 B2
8 Kumar 15 45 C2
se
9 Aashna 17 68 B2
10 Somya 16 92 A1
ou
11 Mathew 16 93 A1
H
If you apply the inplace = True with the drop() function, i.e.,
i
The above deletion operation affects the original data frames df. The command deletes the Grade
w
(index is 3) column in the DataFrame (permanently deleted from df). So, the resultant DataFrame will be:
s
>>> df
ra
0 Aashna 16 87
1 Somya 15 64
ew
2 Ronald 16 58
3 Jack 17 74
N
4 Raghu 15 34
5 Mathew 16 77
@
6 Nancy 14 87
7 Bhavya 16 64
8 Kumar 15 45
9 Aashna 17 68
36 Saraswati Informatics Practices XII
10 Somya 16 92
11 Mathew 16 93
d
1.9.4 Dropping Rows in DataFrame
ite
In Pandas, we can drop or delete row by index, by condition and by position. Note that if the DataFrame is
m
indexed by label, then you can delete the row by name also.
Li
Drop a row by number
e
Let us delete the second and third rows (i.e., index 1 and 2) from the DataFrame df. The command is:
at
>>> df.drop([1,2]) # Deletes rows whose indexes are 1 and 2.
iv
Here, the axis itself assumes a 0 which means we are deleting row(s). Originally the deleted row(s) are
there in the DataFrame till you won’t apply inplace=True. Or if you apply a command like, df = df.drop([1,2])
Pr
then it will copy the resultant rows into the same DataFrame and assumes that it is permanently deleted. So,
the resultant DataFrame for above command will be:
a
Name Age Score
0 Aashna 16 87
di
In
3 Jack 17 74
4 Raghu 15 34
se
5 Mathew 16 77
ou
6 Nancy 14 87
7 Bhavya 16 64
H
8 Kumar 15 45
9 Aashna 17 68
i
at
10 Somya 16 92
w
11 Mathew 16 93
s
We can drop a row when it satisfies a specific condition. For example, let us delete the row with the name
Sa
The above code takes up all the names except Mathew, thereby dropping the row(s) with the name
‘Mathew’, i.e., the operation affect the original data frames df. So, the resultant DataFrame will be:
N
0 Aashna 16 87
1 Somya 15 64
2 Ronald 16 58
3 Jack 17 74
Review of Python Pandas 37
4 Raghu 15 34
6 Nancy 14 87
7 Bhavya 16 64
d
8 Kumar 15 45
ite
9 Aashna 17 68
m
10 Somya 16 92
Li
Drop by index
e
We can drop a row by index, i.e., integer index. Let us delete the 3rd row (i.e., index is 2) from the DataFrame
at
df. The command is:
>>> df.drop(df.index[2]) # Drop a row by index
iv
The above code drops the row with index number 2. So the resultant DataFrame will be
Pr
Name Age Score
0 Aashna 16 87
a
di
1 Somya 15 64
3 Jack 17 74
In
4 Raghu 15 34
se
5 Mathew 16 77
6 Nancy 14 87
ou
7 Bhavya 16 64
8 Kumar 15 45
H
9 Aashna 17 68
i
10 Somya 16 92
at
11 Mathew 16 93
s w
Drop by position
ra
We can drop rows by position using the slicer (:) with indexes. For example, to drop first 3 rows (i.e., the
Sa
It will delete the top 3 rows from the DataFrame. So, the resultant DataFrame will be:
Name Age Score
N
3 Jack 17 74
4 Raghu 15 34
@
5 Mathew 16 77
6 Nancy 14 87
7 Bhavya 16 64
8 Kumar 15 45
38 Saraswati Informatics Practices XII
9 Aashna 17 68
10 Somya 16 92
11 Mathew 16 93
d
ite
If you want to delete all rows except last three rows, then the above command can be written as:
>>> df.drop(df.index[:-3])
m
So, the resultant DataFrame will be
Li
Name Age Score
9 Aashna 17 68
e
at
10 Somya 16 92
11 Mathew 16 93
iv
Pr
Points to Remember
a
1. Pandas is a high-level data manipulation tool developed by Wes McKinney.
di
2. The Series is a one-dimensional labeled array capable of holding data of any type (integer, string,
float, python objects, etc.)
In
3. The series data are mutable (can be changed). But the size of Series data is immutable.
4. Any list, tuple and dictionary elements are called Series.
se
7. DataFrame is a two-dimensional array with heterogeneous data, usually represented in the tabular
format.
H
10. The DataFrame.head() function in Pandas, by default, shows you the top 5 rows of data in the
at
DataFrame.
11. The DataFrame.tail() function in Pandas, by default, shows you the last 5 rows of data in the
w
DataFrame.
s
12. iterrows() function iterate over the rows of a DataFrame as (index, Series) pairs.
ra
13. itertuples(index=True) function iterate over the rows of DataFrame as tuples, with index value as
first element of the tuple.
Sa
14. The assign() function in python, assigns the new column (i.e., a list) to the existing DataFrame.
15. The insert() function adds a column at the column index position.
ew
SOLVED EXERCISES
N
d
(ii) S[1 : 3]
ite
Ans. (i) 1 10
2 15
m
dtype: int64
(ii) 1 10
Li
2 15
dtype: int64
e
4. What is DataFrame?
at
Ans. DataFrame is a two-dimensional array with heterogeneous data, usually represented in the tabular
format. The data is represented in rows and columns. Each column represents an attribute and
iv
each row represents a person.
Pr
5. A dictionary s_marks contains the following data:
s_marks = {'name' : ['Rashmi', 'Harsh', 'Ganesh', 'Priya', 'Vivek'], 'Grade' : ['A1', 'A2', 'B1', 'A1',
'B2']}
a
Write a statement to create DataFrame called df. Assume that Pandas has been imported as pd.
Ans. df = pd.DataFrame(s_marks)
di
In
6. What is the purpose axis option in Pandas concat() function?
Ans. By default axis = 0 thus the new DataFrame will be added row-wise. If a column is not present then
se
(e) Add following 3 rows with following data: (Note: None is a null value).
ra
Ishan 86 B1
Amrita 97 A1
Sa
2 B1 Ganesh
3 A1 Priya
4 B2 Vivek
(c) Gr["Percentage"] = [92, 89, None, 95, 68, None, 93]
d
(d) Gr = Gr[['Name', 'Percentage', 'Grade']]
ite
(e) TGr = pd.DataFrame({ 'Name' : ['Ishan', 'Amrita', None],
'Percentage' : [86, 97, None],
m
'Grade' : ['B1', 'A1', None]},
columns=['Name', 'Percentage', 'Grade'])
Li
Gr = Gr.append(TGr, ignore_index=True)
e
(f ) Gr.drop('Grade',axis=1)
at
(g) Gr.drop([2, 4])
(h) (i) First row from the DataGrame Gr will be deleted.
iv
(ii) First row from the DataGrame Gr will be deleted.
Pr
(iii) First four rows from the DataGrame Gr will be deleted.
REVIE
a
ES IO S
di
1. If a list of array contains the following elements:
In
L = ['a','b','c','d','e','f']
Write the statements to create a series from using NumPy array.
se
S = pd.Series(D,index=['b','c','d','a'])
print(S)
3. A dictionary contains first 10 states with their Per Capita Income as follows:
H
d = {'Goa': 224138, 'Delhi': 212219, 'Sikkim': 176491, 'Chandigarh': None, 'Puducherry': 143677,
'Haryana': 133427, 'Maharashtra': None, 'Tamil Nadu': 112664, 'A. & N. Islands': 107418, 'Gujarat':
i
at
106831}
Answer the following:
w
(a) Create a series called Income. (b) List the states with the income below 130000.
s
'Author_Name' : ['Lata Kapoor', 'William Hopkins', 'Brain & Brooke', 'A.W. Rossaine', 'Anna
Roberts'],
N
Advanced Operations on
Pandas DataFrames
DataFrames
d
Chapter –
ite
m
Li
e
at
iv
2.1 Introduction
Pr
Pandas is a popular python library for data analysis. One of the key actions for any data analyst is to be able
to pivot data tables. Pandas can be used to create MS Excel style pivot tables. They will save you a lot of
a
time by allowing you to quickly summarize large amounts of data into a meaningful report.
detailed data set. Pivot tables are particularly useful if you have long rows or columns holding values that
you need to track the sums of and easily compare to one another. They can automatically sort, count, total,
ou
or average data stored in one table. Then, they can show the results of those actions in a new table of that
summarized data inside DataFrame.
H
In a general sense, pivoting means to use unique values from specified index / columns to form axes of
the resulting DataFrame. We can get pandas to form a pivot table for our DataFrame by calling the pivot()
i
or pivot_table() methods and providing parameters about how we would like the resulting table to be
at
organized.
w
The pivot() method reshape data (produce a “pivot” table) based on column values and returns reshaped
DataFrame. The pivot() method takes maximum of 3 arguments with the following names: index, columns,
Sa
and values. But you can take at least two. As a value for each of
these parameters you need to specify a column name in the original What is P ivot T
Pivot able?
Table?
table. Then the pivot() method will create a new table whose row A Pivot Table enables you to
ew
and column indices are the unique values of the respective summarize large amounts of
parameters. The cell values of the new table are taken from column data in a matter of minutes. You
N
given as the values parameter. The syntax of pivot() is: can transform endless rows and
columns of numbers into a
pandas.pivot(index, columns, values)
@
So, to create a pivot table, use unique values from index / columns and fills with values. To start, here
is the data-set that will be used to create a pivot table using pivot() method in Python:
>>> import pandas as pd
>>> ClassXII = {'Name': ['Aashna', 'Ronald', 'Jack', 'Raghu', 'Somya'],
d
'Subject': ['Accounts', 'Economics', 'Accounts', 'Economics', 'Accounts'],
ite
'Score': [87, 64, 58, 74, 87],
'Grade' :['A2', 'B2', 'C1', 'B1', 'A2']}
m
>>> df = pd.DataFrame(ClassXII, columns=['Name', 'Subject', 'Score', 'Grade']) # creating a DataFrame
Li
>>> df
Name Subject Score Grade
e
0 Aashna Accounts 87 A2
at
1 Ronald Economics 64 B2
iv
2 Jack Accounts 58 C1
Pr
3 Raghu Economics 74 B1
4 Somya Accounts 87 A2
a
di
Notice that the DataFrame above is a row-by-row record of transactions. Each row in the DataFrame
corresponds to one complete record of the information. We could think of the information as being organized
In
by Name.
To create a pivot table:
se
Name
Aashna 87.0 NaN
i
at
As can be seen, the value of Score for every row in the original table has been transferred to the new
table, where its row and column match the Name and Subject of its original row. Also notice that many of
the values are NaN. This is because many of the positions of the table do not have matching information
ew
from the original DataFrame and such data are set with NaN (None).
If you don’t need a separate DataFrame, then you can use the following command:
N
>>> pv
Subject Accounts Economics
Name
d
Aashna 87.0
ite
Jack 58.0
Raghu 74.0
m
Ronald 64.0
Li
Somya 87.0
e
Whatever column you specify as the columns argument will be used to create new columns (each
at
unique entry will form a new column). Remember, columns are optional and they provide an additional way
to segment the actual values you care about. The aggregation functions are applied to the values you list,
iv
i.e., Score.
Pr
Let us create a pivot table to display the Grade for every row in the original table using values='Grade':
>>> df.pivot(index='Name', columns='Subject', values='Grade').fillna('')
a
Subject Accounts Economics
Name
di
In
Aashna A2
Jack C1
se
Raghu B1
ou
Ronald B2
Somya A2
H
Now, what if we want to extend the previous example to have the Score and Grade for each name on its row
w
as well? This is actually easy - we just have to omit the values parameter as follows:
s
>>> pv
Sa
Score Grade
Subject Accounts Economics Accounts Economics
ew
Name
Aashna 87.0 NaN A2 NaN
N
As shown in the previous page, pandas will create a hierarchical column index (MultiIndex) for the new
table. You can think of a hierarchical index as a set of five indices. The first level of the column index
defines all columns that we have not specified in the pivot invocation - in this case Score and Grade. The
second level of the index defines the unique value of the corresponding column.
d
Using the above pivot table, we can use this hierarchical column index to filter the values of a single
ite
column from the original table. For example pv.Score returns a pivoted DataFrame with the Score values
only and it is equivalent to the pivoted DataFrame from the previous section.
m
>>> pv.Score.fillna('')
Li
Subject Accounts Economics
Name
e
Aashna 87
at
Jack 58
iv
Raghu 74
Pr
Ronald 64
Somya 87
a
>>> pv.Score.Accounts.fillna('')
Name
di
In
Aashna 87
se
Jack 58
Raghu
ou
Ronald
H
Somya 87
i
A pivot problem
s w
But remember that when there are any index, columns combinations with multiple values, then it will raise
ra
ValueError. Let us add one more entry to the previous records to demonstrate a problem. We will append
another list to sales and rebuild our DataFrame.
Sa
For example, let us create a new data set with duplicate entries:
>>> Temp = {'Name': ['Aashna', 'Ronald', 'Jack', 'Raghu', 'Somya', 'Ronald'],
ew
0 Aashna Accounts 87 A2
1 Ronald Economics 64 B2
Advanced Operations on Pandas DataFrames 45
2 Jack Accounts 58 C1
3 Raghu Economics 74 B1
4 Somya Accounts 87 A2
d
5 Ronald Accounts 78 B1
ite
As you can see from this exmaple, the 5th row in the DataFrame, only the Name is duplicated. Here, if
m
we apply the pivot() function, it won’t produce any ValueError as shown below:
Li
>>> pv1 = df1.pivot(index='Name', columns='Subject', values='Score').fillna('')
>>> pv1
e
Subject Accounts Economics
at
Name
iv
Aashna 87.0
Pr
Jack 58.0
Raghu 74.0
a
Ronald 78.0 64.0
Somya 87.0
di
In
It is to be noted that if there are any index, columns combination with multiple values, then the ValueError
se
>>> df2
at
0 Aashna Accounts 87 A2
s
ra
1 Ronald Economics 64 B2
2 Jack Accounts 58 C1
Sa
3 Raghu Economics 74 B1
4 Somya Accounts 87 A2
ew
5 Ronald Accounts 78 B1
6 Aashna Accounts 82 A2
N
As you can see above, the 0th and 6th row in the DataFrame contains duplicate values in both Name and
@
Subject columns. Here, if we apply the pivot() function it produces a ValueError as shown below:
>>> pv2 = df2.pivot(index='Name', columns='Subject', values='Score')
Traceback (most recent call last):
File "<pyshell#150>", line 1, in <module>
pv2 = df2.pivot(index='Name', columns='Subject', values='Score')
46 Saraswati Informatics Practices XII
....
....
File "C:\Python36\lib\site-packages\pandas\core\reshape\reshape.py", line 154, in _make_selectors
raise ValueError('Index contains duplicate entries, '
d
ValueError: Index contains duplicate entries, cannot reshape
ite
Using stack and unstack Methods
m
The stack() and unstack() methods flip the layout of a DataFrame by moving whole levels of columns to
Li
rows, or whole levels of rows to columns. Stacking a DataFrame means moving (also rotating or pivoting)
the innermost column index to become the innermost row index. The inverse operation is called unstacking.
e
It means moving the innermost row index to become the innermost column index. These are particularly
at
useful to help manipulate the hierarchies we form when making pivot tables, but they can be applied any
time. To understand this, let us see the example given below:
iv
>>> pv = df.pivot(index='Name', columns='Subject')
Pr
>>> pv
Score Grade
a
Subject Accounts Economics Accounts Economics
Name
di
In
Aashna 87.0 NaN A2 NaN
se
As you can see from above, the pivot table has a hierarchy of column labels. In such case, the column
i
at
labels were broken down into multiple levels, one for "Score" and "Grade", and another level for the subject.
Right now, the Subject are written across as column labels. If we would prefer that they were written
w
downward as row labels, we can call stack on the DataFrame. That is:
s
>>> pv.stack()
ra
Score Grade
Sa
Name Subject
Aashna Accounts 87.0 A2
Jack Accounts 58.0 C1
ew
Here, the stack() method by default takes the last level in the column breakdown and turns it into the
last row breakdown.
If we call stack again, it will move the remaining column level. This will result in there not being any
more columns. This is possible to do, and returns something reasonable. Note that this is no longer a
DataFrame object however, it is a Series object.
Advanced Operations on Pandas DataFrames 47
>>> pv.stack().stack()
Name Subject
Aashna Accounts Score 87
d
Grade A2
ite
Jack Accounts Score 58
Grade C1
m
Raghu Economics Score 74
Li
Grade B1
e
Ronald Economics Score 64
at
Grade B2
Somya Accounts Score 87
iv
Grade A2
Pr
dtype: object
a
Unstack is similar to stack, but moves row levels to column levels. One more thing to note is that these
di
methods could take a parameter to specify which level in the hierarchy to move. As mentioned above, by
default they will move the "last" level.
In
For example, if we start with our original DataFrame df, and this time use stack on the 0 level, we will
move the "Score" and "Grade" labels.
se
>>> pv
Name Subject
H
Score Accounts 87
at
Score Accounts 58
s
Score Economics 74
Sa
dtype: object
@
A bit confusingly, pandas DataFrames also come with a pivot_table() method, which is a generalization of
the pivot() method. Whenever you have duplicate values for one index/column pair, you need to use the
pivot_table().
48 Saraswati Informatics Practices XII
There are several reasons why you might want to automate pivot table operations. In order to use the
interactive pivot table, you had to identify:
• what column(s) in the dataset to use to define the row groupings in the pivot table?
• what column(s) in the dataset to use to define the column groupings in the pivot table?
d
• what column in the dataset to use as the basis for the pivot table summary function?
ite
• what summary function to use?
m
A pivot table is composed of counts, sums, or other aggregations derived from a table of data. The
pivot_table() method create a spreadsheet-style pivot table as a DataFrame. It allows transforming data
Li
columns into rows and rows into columns. It allows grouping by any data field. The syntax is:
e
pandas.pivot_table(DataFrame, values=None, index=None, columns=None,
at
aggfunc=’mean’, fill_value=None,
margins=False, dropna=True, margins_name=’All’)
iv
.pivot_table() method does not necessarily need all four arguments, because it has some smart defaults.
Pr
If we pivot on one column, it will by default use all other numeric columns as the index (rows) and take the
average (mean) of the values.
a
Here,
that position. It basically shows how rows are summarized, such as sum, mean, max, min or
count. The default aggfunc of pivot_table is numpy.mean
H
• fill_value: By providing a fill_value parameter, we can set the default when values are missing.
The default is NaN (i.e., None).
i
at
• margins: This is a Boolean that defaults to False, but if we set it to True, the resulting DataFrame
will also include total sums along the rows and columns. The totals appear in entries whose new
w
• dropna=True, and is used to drop rows that have missing data. Do not include columns whose
ra
Before creating pivot table, let us create a DataFrame with following data:
ew
>>> Data = {'Name': ['C Joseph', 'Sareen', 'Abhishek', 'Rughwani', 'C Joseph', 'Sareen', 'Abhishek',
'Rughwani', 'C Joseph', 'Sareen', 'Abhishek', 'Rughwani', 'C Joseph', 'Sareen', 'Abhishek',
'Rughwani'],
N
'Test': ['Semester 1', 'Semester 1', 'Semester 1', 'Semester 1', 'Semester 1', 'Semester 1',
@
'Semester 1', 'Semester 1', 'Semester 2', 'Semester 2', 'Semester 2', 'Semester 2',
'Semester 2', 'Semester 2', 'Semester 2', 'Semester 2'],
'Subject': ['Accounts', 'Accounts', 'Accounts', 'Accounts', 'Economics', 'Economics',
'Economics', 'Economics', 'Accounts', 'Accounts', 'Accounts', 'Accounts', 'Economics',
'Economics', 'Economics', 'Economics'],
'Marks': [78, 87, 67, 69, 79, 80, 82, 78, 62, 59, 68, 73, 60, 70, 64, 84]}
Advanced Operations on Pandas DataFrames 49
d
0 C Joseph Semester 1 Accounts 78
ite
1 Sareen Semester 1 Accounts 87
2 Abhishek Semester 1 Accounts 67
m
3 Rughwani Semester 1 Accounts 69
Li
4 C Joseph Semester 1 Economics 79
5 Sareen Semester 1 Economics 80
e
6 Abhishek Semester 1 Economics 82
at
7 Rughwani Semester 1 Economics 78
iv
8 C Joseph Semester 2 Account 62
Pr
9 Sareen Semester 2 Account 59
10 Abhishek Semester 2 Account 68
a
11 Rughwani Semester 2 Account 73
12 C Joseph Semester 2 Economics
di 60
In
13 Sareen Semester 2 Economics 70
14 Abhishek Semester 2 Economics 64
se
Note. You can extract the above data from the .CSV file:
dfP = pd.read_csv('E:/IPSource_XII/IPXIIChap02/Result.csv')
H
We know that the index and columns parameters for pivot_table() can take lists, not just single column
i
labels. Let us create a pivot table using pivot_table() method to find student wise mean of Marks.
at
Or
s
>>> pv
Marks
Sa
Name
Abhishek 70.25
ew
C Joseph 69.75
Rughwani 76.00
N
Sareen 74.00
@
Here,
• Marks column in the dataset is used as the basis for the pivot table summary function
• aggfunc to aggfunc='mean' since we want to find the average of all values in Marks that belongs
to a unique Name.
As shown in the previous command, if you don’t need a new pivot table (pv), then you can write the
d
following command to display result of pivot table immediately:
ite
>>> pd.pivot_table(dfP, index='Name', aggfunc='mean')
m
Marks
Li
Name
Abhishek 70.25
e
C Joseph 69.75
at
Rughwani 76.00
iv
Sareen 74.00
Pr
To find subject-wise mean for each students’ mark of DataFrame dfP, we have to change the columns as
Subject:
a
>>> pv = dfP.pivot_table(index='Name', columns='Subject', aggfunc='mean')
di
>>> pv In
Marks
Name
ou
Example Using DataFrame dfP, create a pivot table (pv1) to find the group means by Name and Subject
w
>>> pv1
ra
Marks
Sa
Example Using DataFrame dfP, create a pivot table (pv2) to find the group Marks counts by Name and
Subject
>>> pv2 = df.pivot_table(index='Name', columns='Subject', aggfunc='count')
>>> pv2
d
ite
Marks Test
m
Name
Li
Abhishek 2 2 2 2
e
C Joseph 2 2 2 2
at
Rughwani 2 2 2 2
iv
Sareen 2 2 2 2
Pr
Example An Emp table contains the following data:
Empno Name Department Salary Commission Job
a
di
100 Sunita Sharma RESEARCH 45600 5600.0 CLERK
101 Ashok Singhal SALES 43900 3900.0 SALESMAN
In
102 Sumit Avasti SALES 27000 7000.0 SALESMAN
se
d
41600, 47800, 43600],
ite
'Commission': [5600, 3900, 7000, 4900, 3500, 4200, 6800, 7000, 4900, 4500, 8200,
np.nan, np.nan],
m
'Job': ['CLERK', 'SALESMAN', 'SALESMAN', 'MANAGER', 'SALESMAN', 'MANAGER',
'MANAGER', 'ANALYST', 'CLERK', 'MANAGER', 'SR. MANAGER', 'SR. MANAGER', 'CLERK']}
Li
dfE = pd.DataFrame(Emp, columns=['Empno', 'Name', 'Department', 'Salary',
'Commission', 'Job'])
e
(b) pd.pivot_table(dfE, index='Department', values='Salary', aggfunc='sum')
at
(c) pd.pivot_table(dfE, index='Department', values='Salary')
Or
iv
pd.pivot_table(dfE, index='Department', values='Salary', aggfunc='mean')
Pr
(d) pd.pivot_table(dfE, index='Department', values='Salary', aggfunc=['sum', 'mean'])
(e) pd.pivot_table(dfE, index='Department', values='Salary', aggfunc=['max', 'min'])
a
(f) pd.pivot_table(dfE, index=['Department', 'Job'], values='Salary', aggfunc='max')
di
2.5 Sorting DataFrames
In
se
The data of a data frame can also be sorted by rows or columns or by their respective values. By default,
sorting is done on row labels in ascending order. Pandas data frame has two useful sort functions:
ou
• sort_values(): This function sorts a data frame in ascending or descending order of passed column.
The optional by parameter to DataFrame.sort_values() may use to specify one or more columns
H
Each of these functions come with numerous (parameters) options, like sorting the data frame in
w
specific order (ascending or descending), sorting in place, sorting with missing values, sorting by specific
s
algorithm and so on. Some common sorting options are given in the syntax. These are:
ra
Here,
ew
• ascending : The default sorting is ascending whose value is True (0). If the value is False (1), then
it sorts the DataFrame by descending.
@
• inplace : The default value is False. Otherwise if you do not want a new DataFrame, then mention
the value as True.
Suppose, we have following DataFrame (dfS) where we can apply different types of sorting:
d
'Grade' :['A2', 'B2', 'C1', 'B1', 'D', 'B1', 'A2', 'B2', 'C2', 'B2', 'A1', 'A1']}
ite
>>> dfS = pd.DataFrame(Data, columns=['Name', 'Age', 'Score', 'Grade'] )
>>> dfS
m
Name Age Score Grade
Li
0 Aashna 16 87 A2
e
1 Somya 15 64 B2
at
2 Ronald 16 58 C1
iv
3 Jack 17 74 B1
Pr
4 Raghu 15 34 D
5 Mathew 16 77 B1
a
6 Nancy 14 87 A2
di
7 Bhavya 16 64 B2 In
8 Kumar 15 45 C2
9 Aashna 17 68 B2
se
10 Somya 16 92 A1
11 Mathew 16 93 A1
ou
Sort by Value
H
To sort a DataFrame by value, mention the column value as an input argument to sort_values() function. For
example, we can sort by the values of 'Name' column in the DataFrame dfS. The command is:
i
at
Or
>>> dfN = dfS.sort_values('Name') # DataFrame sorted in ascending order by ‘Name’
s
Or
ra
0 Aashna 16 87 A2
N
9 Aashna 17 68 B2
3 Jack 17 74 B1
@
8 Kumar 15 45 C2
5 Mathew 16 77 B1
11 Mathew 16 93 A1
6 Nancy 14 87 A2
4 Raghu 15 34 D
54 Saraswati Informatics Practices XII
2 Ronald 16 58 C1
1 Somya 15 64 B2
10 Somya 16 92 A1
d
7 Bhavya 16 64 B2
ite
Note that by default sort_values() sorts and gives a new data frame dfN. The new sorted DataFrame dfN
m
is in ascending order (small values first and large values last).
Li
To sort a DataFrame on descending order, we can use the argument ascending=False. In this example, we
e
can see that after sorting the DataFrame by 'Name' column with ascending=False,
at
>>> dfA = dfN.sort_values('Score', ascending=False)
>>> dfA
iv
It will produce the following output:
Pr
Name Age Score Grade
a
11 Mathew 16 93 A1
10 Somya 16 92 A1
di
In
0 Aashna 16 87 A2
6 Nancy 14 87 A2
se
5 Mathew 16 77 B1
3 Jack 17 74 B1
ou
9 Aashna 17 68 B2
H
1 Somya 15 64 B2
7 Bhavya 16 64 B2
i
at
2 Ronald 16 58 C1
8 Kumar 15 45 C2
w
4 Raghu 15 34 D
s
ra
We can specify the columns we want to sort by as a list in the argument for sort_values() function. Note
that when sorting by multiple columns, pandas sort_value() uses the first variable first and second variable
next. Let us sort the DataFrame dfS multiple columns (Name, Score).
ew
7 Bhavya 16 64 B2
10 Somya 16 92 A1
1 Somya 15 64 B2
2 Ronald 16 58 C1
Advanced Operations on Pandas DataFrames 55
4 Raghu 15 34 D
6 Nancy 14 87 A2
11 Mathew 16 93 A1
d
5 Mathew 16 77 B1
ite
8 Kumar 15 45 C2
3 Jack 17 74 B1
m
0 Aashna 16 87 A2
Li
9 Aashna 17 68 B2
e
Sort by Index
at
In previous section, we created a DataFrame called dfN. Let us sort the DataFrame dfN by index.
iv
>>> dfN = dfN.sort_index()
Pr
>>> dfN
It will produce the following output:
a
Name Age Score Grade
0 Aashna 16 87 A2
di
In
1 Somya 15 64 B2
2 Ronald 16 58 C1
se
3 Jack 17 74 B1
ou
4 Raghu 15 34 D
5 Mathew 16 77 B1
H
6 Nancy 14 87 A2
7 Bhavya 16 64 B2
i
at
8 Kumar 15 45 C2
w
9 Aashna 17 68 B2
s
10 Somya 16 92 A1
ra
11 Mathew 16 93 A1
Sa
>>> dfN.sort_index(ascending=False)
It will produce the following output:
N
11 Mathew 16 93 A1
10 Somya 16 92 A1
9 Aashna 17 68 B2
8 Kumar 15 45 C2
56 Saraswati Informatics Practices XII
7 Bhavya 16 64 B2
6 Nancy 14 87 A2
5 Mathew 16 77 B1
d
4 Raghu 15 34 D
ite
3 Jack 17 74 B1
m
2 Ronald 16 58 C1
1 Somya 15 64 B2
Li
0 Aashna 16 87 A2
e
at
Points to Remember
iv
Pr
1. Then the pivot() method creates a new table, whose row and column indices are the unique
values of the respective parameters.
2. pivot() method is used for pivoting without aggregation.
a
3. Stacking a DataFrame means moving (also rotating or pivoting) the innermost column index to
become the innermost row index.
di
In
4. A PivotTable is a powerful tool to calculate, summarize, and analyze data that lets you see
comparisons, patterns, and trends in your data.
se
5. Pandas sort_values() function sorts a data frame in Ascending or Descending order of passed
Column.
ou
6. Using the sort_index() method, by passing the axis arguments and the order of sorting, DataFrame
can be sorted.
i H
SOLVED EXERCISES
at
w
Ans. The pivot() method reshape data (produce a “pivot” table) based on column values. The pivot()
ra
method takes maximum of 3 arguments with the following names: index, columns, and values.
2. Define pivot_table() method.
Sa
Ans. The pivot_table() method create a spreadsheet-style pivot table as a DataFrame. It allows
transforming data columns into rows and rows into columns.
3. What is the difference between pivot() and pivot_table()?
ew
columns, whereas the pivot_table() accepts duplicate values for one index/column pair you
need to use.
@
• pivot() allows both numeric and string types as "values=", whereas pivot_table() only allow
numerical types as "values=".
• pivot() is used for pivoting without aggregation whereas pivot_table() works with duplicate
values by aggregating them.
Advanced Operations on Pandas DataFrames 57
d
2 Pencil Blue 5.5
ite
3 Ball Pen Green 10.5
m
4 Gel Pen Green 11.0
Li
5 Notebook Red 15.5
6 Ball Pen Green 11.5
e
7 Highlighter Blue 8.5
at
8 Gel Pen Red 12.5
iv
9 P Marker Blue 8.6
Pr
10 Pencil Green 11.5
11 Ball Pen Green 10.5
a
di
Answer the following questions (assume that the DataFrame name is dfA):
( ) Using the above table, create a DataFrame called df.
In
( ) Create a pivot table to display item name wise items from DataFrame df.
( ) Create a table to display item name and item number wise price for all rows.
se
(d) Create a table to display item name and item number wise price for all colors.
(e) Create a table to display item name and item number wise sum of color values.
ou
( ) Create a table to display item name and item number wise total price of all color items
along the rows and columns.
H
( ) Create a table to display item name wise total price of all color items.
Ans. (a) For data:
i
at
df = pd.read_csv('E:/IPSource_XII/IPXIIChap02/Items.csv')
Or
w
data = {
s
'ItemName': ['Ball Pen', 'Pencil', 'Ball Pen', 'Gel Pen', 'Notebook', 'Ball Pen', 'Highlighter',
Sa
d
1 GBC P House South 2017-08-23 1359000 August 2017
ite
2 S Books Store North 2016-10-11 1670000 October 2016
3 TM Books West 2019-08-25 1490000 August 2019
m
4 IND Books Distributors North 2017-09-04 1560000 September 2017
Li
5 Aniket Pustak West 2018-05-17 1180000 May 2018
6 M Pustak Bhandar South 2018-11-28 2100000 November 2018
e
at
7 BOOKWELL Distributors North 2017-01-22 1630000 January 2017
8 Jatin Book Agency West 2016-12-21 1380000 December 2016
iv
9 New India Agency South 2018-09-12 1730000 September 2018
Pr
10 Libra Books Distributors East 2016-10-04 1210000 October 2016
Answer the following questions:
a
( ) Find the region wise average sales for each year.
( ) Find the year wise average sales for each region.
di
In
( ) Find the year wise total sales for each region.
(d) Find the year wise maximum and minimum sales for each region.
se
print (pv1.stack())
Ans. For data:
i
at
Sales
Region Year
N
d
6. A data set is given the sales of two products in four different regions.
ite
Region Year Product Units Sold
m
Southeast 2018 Air Purifier 87
Li
Northwest 2019 Air conditioner 165
Southwest 2019 Air Purifier 122
e
Northeast 2019 Air conditioner 132
at
Southeast 2018 Air conditioner 98
iv
Northeast 2019 Air Purifier 120
Pr
Northwest 2018 Air Purifier 137
Southeast 2019 Air conditioner 83
a
Northwest 2018 Air Purifier 128
Northwest 2019
di
Air conditioner 149
In
Southwest 2018 Air Purifier 167
Northeast 2018 Air conditioner 139
se
( ) Crate a pivot table to summarize the data into region and product wise total sales.
( ) Print the summary report.
H
Region Product
s
d
Dinesh Programmer Delhi 31 Male
ite
Akshya Manager Delhi 26 Male
m
Megha Manager Mumbai 30 Female
Hemant Manager Kolkata 28 Male
Li
Using the above data set answer the following:
e
( ) Crate a pivot table to print the positions of average age of each each city.
at
( ) Crate a pivot table to print the positions of average age of City and Sex.
( ) Create a pivot table to print the average age for each position and sex.
iv
(d) What will be the output of the following:
Pr
import numpy as np
df.pivot_table(index='Position', aggfunc={'Age': np.mean})
Ans. For data:
a
df = pd.read_csv('E:/IPSource_XII/IPXIIChap02/EJob.csv')
di
(a) print (df.pivot_table(index='Position', columns='City', values='Age'))
In
(b) print (df.pivot_table(index='Position', columns=['City','Sex'], values='Age'))
(c) print (df.pivot_table(index=['Position','Sex'], columns='City', values='Age'))
(d) Age
se
Position
Manager 30.166667
ou
Programmer 33.750000
H
REVIE ES IO S
i
at
d
4. A sample dataset is given with different columns as given below:
ite
Item_ID ItemName Manufacturer Price CustomerName City
m
PC01 Personal Computer HCL India 42000 N Roy Delhi
Li
LC05 Laptop HP USA 55000 H Singh Mumbai
PC03 Personal Computer Dell USA 32000 R Pandey Delhi
e
PC06 Personal Computer Zenith USA 37000 C Sharma Chennai
at
LC03 Laptop Dell USA 57000 K Agarwal Bengaluru
iv
AL03 Monitor HP USA 9800 S C Gandhi Delhi
Pr
CS03 Hard Disk Dell USA 5400 B S Arora Mumbai
PR06 Motherboard Intel USA 17500 A K Rawat Delhi
a
BA03 UPS Microtek India 4300 C K Naidu Chennai
MC01 Monitor
di
HCL India 6800 P N Ghosh Bengaluru
In
Write the command for the following (assume that the DataFrame name is dfA):
(a) Create a city wise customer table.
se
(b) Create a pivot table for manufacturer wise item names and their price.
(c) Arrange the DataFrame in ascending order of CustomerName.
ou
d
Jaya Priya 282100 4 Kerala
ite
Ryma Sen 369000 4 West Bengal
m
R Sahay 233100 4 Delhi
Li
Write the command for the following (assume that the DataFrame name is dfQ):
(a) Find the total sales of each employee.
e
(b) Find the total sales by state.
at
(c) Find the total sales by both employees wise and state wise.
(d) Find the maximum individual sale by state.
iv
(e) Find the mean, median and minimum sales by state.
Pr
a
di
In
se
ou
i H
at
s w
ra
Sa
ew
N
@
Aggregation/Descriptive Statistics in Pandas 63
Aggr egation/Descriptive
Aggregation/Descriptive
Statistics in Pandas
Pandas
d
Chapter –
ite
m
Li
e
at
iv
Pr
3.1 Introduction
Python is a great language for data analysis. Python pandas supports number of data aggregation functions
a
to analyze data. An essential piece of analysis of large data is efficient summarization: computing
di
aggregations like sum(), mean(), median(), min(), and max(), in which a single number gives from a large
dataset. To demonstrate the aggregation functions, let us create a DataFrame from Student.csv file with
In
following data:
>>> import pandas as pd
se
>>> df = pd.read_csv('E:/IPSource_XII/IPXIIChap03/Student.csv')
>>> df
ou
If you carefully observe the above DataFrame (df), you will find number of column values are NaN.
Remember that the NaN value represents null value or None. In this chapter, we use the above DataFrame
(df) to demonstrate the aggregation/descriptive statistics functions.
N
In last chapter, we used number of aggregate function with pivot_table(). Aggregation is the process of
turning the values of a dataset (or a subset of it) into one single value. Data aggregation always accepts
multivalued functions, which in turn returns only single value. The dataset is either a series or DataFrame.
Table 3.1 on the next page shows the most commonly used built-in Pandas aggregation functions.
63
64 Saraswati Informatics Practices XII
Aggregation Description
count() Total number of items
d
sum() calculate the sum of a given set of numbers
ite
mean() Calculate the arithmetic mean or average of a given set of numbers
m
median() Calculate the median or middle value of a given set of numbers
mode() Calculate the mode or most repeated value of a given set of numbers
Li
max() Find the maximum value of a given set of numbers
e
min() Find the minimum value of a given set of numbers
at
std() Calculate standard deviation of a given set of numbers
iv
var() Calculate variance of a given set of numbers
Pr
DataFrame.count() Function
a
Pandas DataFrame.count() is used to count the number of non-null observations across the given axis of a
di
DataFrame or a series. It works with non-floating type data as well. It returns series (or DataFrame if level
specified). The syntax is:
In
DataFrame.count(axis=0, level=None, numeric_only=False)
se
Here,
• axis. {0 or ‘index’, 1 or ‘columns’}, default 0.
ou
• level. If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a
i
DataFrame.
at
>>> S.count()
5
Sa
Example 2 Count the number of non-null value across the row axis for DataFrame df.
>>> df.count()
ew
Student_Name 6
Age 6
N
Gender 6
Test1 6
@
Test2 5
Test3 4
dtype: int64
As you have seen above, the NaN values are not counted in the above series output.
Aggregation/Descriptive Statistics in Pandas 65
Example 3 Count the number of non-null observation in column Age of DataFrame df.
>>> df.Age.count()
Or
>>> df['Age'].count()
d
which prints: 6
ite
Example 4 Count the number of non-null value across the column for DataFrame df.
m
>>> df.count(axis=1)
Or
Li
>>> df.count(axis='columns')
which will print the following:
e
0 6
at
1 0
iv
2 4
Pr
3 6
4 6
a
5 5
6 6
di
In
dtype: int64
se
DataFrame.sum() Function
ou
Pandas DataFrame.sum() function is used to add all of the values in a particular column of a DataFrame (or
a Series). For a DataFrame, by default, axis is index (axis=0).
H
Example 1 Find the sum of all the values of a series S given below:
i
>>> S.sum()
w
75
Example 2 Find the sum of non-null value across the row axis for DataFrame df.
s
ra
Test1 45.5
Test2 41.8
ew
Test3 33.1
dtype: float64
N
Example 3 Find the sum of the non-null value across the column for DataFrame df.
>>> df.sum(axis=1)
66 Saraswati Informatics Practices XII
0 39.7
1 0.0
2 24.6
d
3 40.2
ite
4 37.4
5 34.2
m
6 38.3
Li
dtype: float64
e
Example 4 Find the sum of Test1, Test2, and Test3 across the column for DataFrame df.
at
>>> df.loc[:, ['Test1', 'Test2', 'Test3']].sum(axis=1)
iv
0 23.7
1 0.0
Pr
2 8.6
3 23.2
a
di
4 22.4
5 18.2
In
6 24.3
se
dtype: float64
ou
DataFrame.mean() Function
H
This function is used to find the arithmetic mean of a set of data which is obtained by taking the sum of the
data, and then dividing the sum by the total number of values in the set. A mean is commonly referred to as
i
an average. The mean() function in Python pandas is used to calculate the arithmetic mean of a given series
at
or mean of a DataFrame, mean of column and mean of rows. The syntax is:
w
Like count() function parameters, the parameters of mean() function are same. But the skipna parameter
ra
is used to exclude NA/null values when computing the result. The default value is True.
Sa
15.0
Example 2 Find the mean of the non-null value across the row axis for DataFrame df.
N
Age 15.666667
Test1 7.583333
Test2 8.360000
Test3 8.275000
dtype: float64
Aggregation/Descriptive Statistics in Pandas 67
Example 3 Calculate the mean of specific numeric columns like (Test1, Test2, Test3) for DataFrame
df.
>>> df.loc[:, ['Test1', 'Test2', 'Test3']].mean()
Test1 7.583333
d
ite
Test2 8.360000
Test3 8.275000
m
dtype: float64
Li
Example 4 Calculate the mean of specific numeric columns (Test1, Test2, Test3) row-wise for
DataFrame df. Aslo display the result in 2 decimal format.
e
>>> df.loc[:, ['Test1', 'Test2', 'Test3']].mean(axis=1).round(decimals=2)
at
0 7.90
iv
1 NaN
2 8.60
Pr
3 7.73
4 7.47
a
5 9.10
6 8.10
di
In
dtype: float64
Example 5 Calculate the mean of specific numeric columns (Test1, Test2, Test3) row-wise for
se
DataFrame df excluding null values. Also display the result in 2 decimal formats.
>>> df.loc[:, ['Test1', 'Test2', 'Test3']].mean(axis=1, skipna=False).round(decimals=2)
ou
0 7.90
H
1 NaN
2 NaN
i
3 7.73
at
4 7.47
w
5 NaN
s
6 8.10
ra
dtype: float64
Sa
From the above result, notice that any row which contains NaN value, when mean operations
are performed, resulted as NaN.
ew
DataFrame.median() Function
The median of a set of data is the middlemost number in the set. The median is also the number that is
N
halfway into the set. In Python pandas, the median() function is used to calculate the median or middle
@
value of a given set of numbers. The dataset can be a series, a data frame, a column or rows. To find the
median, the dataset should first be arranged in an order from least to the greatest. But in Python pandas,
median() takes care of itself. The syntax is:
5 10 15 20 25
d
The above sequence is in sorted order and middle value is 15.
ite
Similarly, look at another set of numbers given below:
m
22 5 10 4 17 15 20 25 18 11
Li
which is unsorted 10 numbers and ordering the data from least to greatest, we get:
e
4 5 10 11 15 17 18 20 22 25
at
iv
Since there is an even number of items in the data set, we compute the median by taking the mean of the
two middlemost numbers, i.e., (15 + 17) / 2 = 16.
Pr
Example 1 Find the middlemost number in the series S given below:
a
>>> S = pd . Series ( [ 5 , 10 , 15 , 20 , 25 ] ) # A numeric series
>>> S.median()
15.0
di
In
Example 2 Find the middlemost number in an unsorted series X given below:
se
Note. If you want to sort the series, you can use the following command:
>>> X = X.sort_values()
H
>>> X.median()
16.0
i
at
>>> df.median()
Age 16.00
s
ra
Test1 7.20
Test2 8.50
Sa
Test3 8.35
dtype: float64
ew
7.199999999999999
>>> df.loc[:, ['Test1', 'Test2', 'Test3']].median() # median of columns
@
Test1 7.20
Test2 8.50
Test3 8.35
dtype: float64
Aggregation/Descriptive Statistics in Pandas 69
Example 5 Calculate the median of columns of Test1, Test2, and Test3 row-wise for the DataFrame
df.
>>> df.loc[:, ['Test1', 'Test2', 'Test3']].median(axis=1)
which will display the following:
d
ite
0 7.6
1 NaN
m
2 8.6
Li
3 7.9
4 7.7
e
5 9.1
at
6 8.7
iv
dtype: float64
Pr
DataFrame.mode() Function
a
The mode of a set of data is the value in the set that occurs most often. The mode() in function Python
di
pandas is used to calculate the mode or most repeated value of a given set of numbers. The number can find
in a series, can be of a data frame, a column or a row. If more than one number occurs many times, then both
In
numbers are the mode. The syntax is:
DataFrame.mode(axis=0, numeric_only=False, dropna=True)
se
Example 1 Find the number occurs most often in the series X given below:
ou
dtype: int64
i
at
Example 2 Find the number occurs most often in the series Y given below:
>>> Y = pd . Series ( [ 22, 5 , 22 , 4, 7, 5 , 22, 5] )
w
>>> Y.mode()
s
0 5
ra
1 22
Sa
dtype: int64
Example 3 Calculate the mode of DataFrame df.
ew
>>> df.mode()
Student_Name Age Gender Test1 Test2 Test3
N
Example 4 Find the most repeated value for a specific column ‘Age’ of DataFrame df.
>>> df['Age'].mode()
0 16.0
dtype: float64
d
ite
Example 5 Find the repeated values of specific columns (Test1, Test2, Test3) of DataFrame df.
>>> df.loc[:, ['Test1', 'Test2', 'Test3']].mode()
m
Test1 Test2 Test3
Li
0 6.8 7.7 8.8
1 NaN 7.9 NaN
e
2 NaN 8.5 NaN
at
3 NaN 8.7 NaN
iv
4 NaN 9.0 NaN
Pr
Example 6 Find the repeated values of specific rows (Age, Test1, Test2, Test3) of DataFrame df.
>>> df.loc[:, ['Test1', 'Test2', 'Test3']].mode(axis=1)
a
0 1 2
0 7.6 NaN NaN
di
In
1 NaN NaN NaN
se
The max() function finds the maximum value from a column of a DataFrame or a Series. The min() function
s
finds the minimum value from a column of a DataFrame or a Series. The syntax of max() and min() functions
ra
are:
DataFrame.max(axis=None, skipna=None, level=None, numeric_only=None)
Sa
25
Example 2 Find the minimum values of series S given below:
@
Age 17.0
Test1 9.2
Test2 9.0
d
Test3 8.8
ite
dtype: float64
Example 4 Find the minimum values of DataFrame df.
m
>>> df.min() # get the minimum values of all numeric columns
Li
Age 14.0
Test1 6.5
e
Test2 7.7
at
Test3 7.6
iv
dtype: float64
Pr
Example 5 Find the maximum value for a specific column ‘Age’ of DataFrame df.
>>> df['Age'].max()
a
Age 17.0
Example 6 Find the minimum age value of DataFrame df.
di
In
>>> df['Age'].min()
Age 14.0
se
Example 7 Find the maximum values of specific numeric columns (Test1, Test2, Test3) row-wise for
ou
DataFrame df.
>>> df.loc[:, ['Test1', 'Test2', 'Test3']].max(axis=1)
H
0 8.5
1 NaN
i
2 8.6
at
3 8.8
w
4 7.9
5 9.2
s
6 8.8
ra
dtype: float64
Sa
Example 8 Find the minimum values of specific numeric columns (Test1, Test2, Test3) row-wise for
DataFrame df.
>>> df.loc[:, ['Test1', 'Test2', 'Test3']].min(axis=1)
ew
0 7.6
1 NaN
N
2 8.6
@
3 6.5
4 6.8
5 9.0
6 6.8
dtype: float64
72 Saraswati Informatics Practices XII
DataFrame.std() Function
This function is used to calculate standard deviation of a given set of numbers. The standard deviation can
d
be of a series, a data frame, a column or row. The syntax is:
ite
DataFrame.std(axis=None, skipna=None, level=None, numeric_only=None)
Example 1 Find the standard deviation of series S given below:
m
>>> S = pd . Series ( [ 5 , 10 , 15 , 20 , 25 ] ) # A numeric series
Li
>>> S.std()
7.905694150420948
e
Example 2 Find the standard deviation of numeric columns of DataFrame df.
at
>>> df.std() # get the standard deviation of all numeric columns
iv
Age 1.032796
Pr
Test1 1.099848
Test2 0.545894
a
Test3 0.618466
dtype: float64
di
In
Example 3 Find the standard deviation for a specific column ‘Age’ of DataFrame df.
se
Example 4 Find the standard deviation of specific numeric columns (Test1, Test2, Test3) row-wise for
DataFrame df.
H
0 0.519615
1 NaN
w
2 NaN
s
ra
3 1.159023
4 0.585947
Sa
5 0.141421
6 1.126943
ew
dtype: float64
DataFrame.var() Function
N
@
This function is used to calculate variance of a given set of numbers. The variance can be of a series, a data
frame, a column or row. The syntax is:
d
62.5
ite
Example 2 Find the variance of numeric columns of DataFrame df.
m
>>> df.var() # get the variance all numeric columns
Age 1.066667
Li
Test1 1.209667
e
Test2 0.298000
at
Test3 0.382500
iv
dtype: float64
Pr
Example 3 Find the variance for a specific column ‘Age’ of DataFrame df.
>>> df['Age'].var()
1.0666666666666669
a
di
Example 4 Find the variance of specific numeric columns (Test1, Test2, Test3) row-wise for DataFrame
df.
In
>>> df.loc[:, ['Test1', 'Test2', 'Test3']].var(axis=1)
se
0 0.270000
1 NaN
ou
2 NaN
H
3 1.343333
4 0.343333
i
at
5 0.020000
6 1.270000
w
dtype: float64
s
ra
Sales State
Goa 650000
ew
Delhi 692400
Odisha 750000
N
Haryana 867000
Bihar 920000
@
Kerala 939000
Tamil Nadu 1015000
West Bengal 1553000
Maharashtra 2176000
74 Saraswati Informatics Practices XII
Write commands for the following (Assume that the DataFrame name is dfA):
( ) Count the number of observation in the DataFrame.
( ) Count the number of observation in column State of DataFrame.
d
() Sum of the non-null value across the row axis for DataFrame.
ite
(d) Calculate the mean of Sales columns in the DataFrame.
(e) Add a Commission column (i.e., 4% of Sales) into the DataFrame dfA.
m
Commission = Sales * 0.04
( ) Calculate the mean of all numeric columns in the DataFrame.
Li
( ) Calculate the median of Sales column.
( ) Find the maximum sales and commission values.
e
(i) Find the minimum sales and commission values.
at
( ) Find the standard deviation of commission.
iv
Solution For data: dfA = pd.read_csv('E:/IPSource_XII/IPXIIChap03/State.csv')
Pr
(a) dfA.count()
(b) dfA.State.count()
(c) dfA.Sales.sum()
a
(d) dfA.Sales.mean()
(e) dfA['Commission'] = dfA['Sales']*.04
di
In
(f) dfA.mean().round(decimals=2)
(g) dfA['Sales'].median()
se
(h) dfA.max()
(i) dfA.min()
ou
(j) dfA.Commission.std()
H
In the statistics, the three terms commonly used are called quartile, quantile and percentile. All these
at
produce approximate results. These terms are used in different statistical calculations over list of numbers.
w
Quartiles
N
Quartiles in statistics are values that divide your data into quarters. The four quarters that divide a data set
@
d
Let us find the quartiles for the following numbers:
ite
3, 6, 21, 11, 18, 13, 8, 15, 36, 29, 32, 16
m
To find the quartiles:
Li
Step 1: Put the numbers in order: 3, 6, 8, 11, 13, 15, 16, 18, 21, 29, 32, 36
Step 2: Total numbers in the list is 12.
e
Step 3: Divide the total number by 4 to cut the list of numbers into quarters, i.e., 12 / 4 = 3
at
Stpe 4: There are 16 numbers in the list, so you would have 4 numbers in each quartile and these
iv
are:
Pr
1st Quarter 2nd Quarter 3rd Quarter 4th Quarter
3, 6, 8, 11, 13, 15, 16, 18, 27, 29, 32, 36
a
di
Percentile In
Percentile is a number where a certain percentage of scores fall below that number. Percentiles are commonly
used to report scores in tests, like the SAT, GATE and LSAT. For example, in an examination, if you score in
se
the 25th percentile, then 25% of test takers are below your score. The “25” is called the percentile rank.
Let us see 10 students, test scores ordered by Rank.
ou
Score 35 39 45 55 63 67 75 82 86 95
H
Rank 1 2 3 4 5 6 7 8 9 10
i
at
To find the 25th percentile is in the above list, let us calculate the rank where 25th percentile. To
w
However, there is no rank of 5.5. So, we must round it up to 6. This equals to a score of 67.
Similarly, to find the 75th percentile, we may search in the third quartile. As per the rank, it is 8.2 and
it is rounded to 8. This equals to a score of 82.
You can also calculate your percentile by using the following formula, if you know your rank:
P = ((N – Your rank) / N) * 100
76 Saraswati Informatics Practices XII
Here, P is the percentile and N is the total number of candidates who appeared for exam. (N- Your
rank) indicates the number of candidates who have scored less than you.
For example, 50 students are there in your class, and your rank is 3, then the percentile will be calculated
as:
d
P = ((50 – 3) / 50) * 100 = 94.0
ite
Percentile rank is the proportion of values in a distribution where a particular value is greater than or
m
equal to it. For example, if a pupil is taller than or as tall as 79% of his classmates then the percentile rank
of his height is 79, i.e., he is in the 79th percentile of heights in his class.
Li
Quantiles
e
at
The quantiles/percentiles/fractiles of a list of numbers are statistical values that partially illustrate the
distribution of numbers in the list. Quantiles are points in a distribution that relate to the rank order of
iv
values in that distribution. For a sample, you can find any quantile by sorting the sample. The middle value
Pr
of the sorted sample (middle quantile, i.e., 50th percentile) is known as the median.
The median is a special case of a quantile: it is the 50% quantile. It is usually estimated as follows:
- from an odd number of values, take the middle value;
a
- from an even number, take the average of the two midmost values.
di
For example, in the given set of five numbers:
In
15, 16, 18, 27, 29, 32, 36
se
data points.
i
The limits are the minimum and maximum values. Any other locations between these points can be
at
Centiles/percentiles are descriptions of quantiles relative to 100; so the 75th percentile (upper quartile)
is 75% or three quarters of the way up an ascending list of sorted values of a sample. The 25th percentile
s
Certain types of quantiles are used commonly enough to have specific names. Here is a list of these:
Sa
The pandas quantile() function return float or series values at the given quantile over requested axis, a
d
numpy.percentile. This is the list of probabilities for which quantiles should be computed. For example,
ite
• if percentile = 25, then it is the first quartile or lower quartile (LQ). The 0, 25 quantile is basically
saying that 25th percentile (i.e., 25%) of the observation in the data set is below a given line.
m
• if percentile = 50, then it is in the second quartile or the median.
• if percentile = 75, then it is in third quartile or upper quartile (UQ).
Li
The syntax is:
e
DataFrame.quantile(q=0.5, axis=0, numeric_only=True, interpolation='linear')
at
Here,
iv
• q is float or array-like, default 0.5 (50% quantile)
Pr
0 <= q <= 1, the quantile(s) to compute.
− If q is an array, a DataFrame will be returned where the index is q, the columns are the
a
columns of self, and the values are the quantiles.
di
− If q is a float, a Series will be returned where the index is the columns of self and the values
are the quantiles.
In
• axis: 0 or ‘index’, 1 or ‘columns’}, default 0. If 0 or ‘index’ counts are generated for each column.
If 1 or ‘columns’ counts are generated for each row.
se
• numeric_only : boolean, default True. If False, the quantile of datetime and timedelta data will
be computed as well.
ou
points i and j.
Example 1 Find the quantiles of an even series S given below:
i
at
>>> S.quantile(.25)
17.0 import numpy as np
s
0.50 27.0
0.75 30.5
ew
dtype: float64
Example 2 Find the quantiles of an odd series P given below:
N
d
0.25 10.5
x = [1, 10, 11, 100, 101, 110, 111]
ite
0.50 100.0 print (np.percentile(x, [25, 50, 75]))
0.75 105.5 which prints:
m
[ 10.5 100. 105.5]
dtype: float64
Li
Example 4 Find the quantiles for the following series y:
e
>>> y = pd.Series([1, 2, 3, 4, 5, 6])
Try this:
at
>>> y.quantile([.25, .5, .75])
0.25 2.25 import numpy as np
iv
y = np.array([1, 2, 3, 4, 5, 6])
0.50 3.50 print (np.percentile(y, [25, 50, 75]))
Pr
0.75 4.75 which prints:
dtype: float64 [2.25 3.5 4.75]
a
di
Example 5 Find the 0.25 quantile of DataFrame df. In
>>> df.quantile(.25)
Age 15.250
se
Test1 6.800
Test2 7.900
ou
Test3 7.825
Name: 0.25, dtype: float64
H
Example 7 Find the (0.05, 0.25, 0.5, 0.75, 0.95) quantiles of DataFrame df.
>>> quants = [0.05, 0.25, 0.5, 0.75, 0.95]
ew
>>> q = df.quantile(quants)
>>> print (q)
N
d
when you are working with numeric columns. You can use .describe() to see a number of basic statistics
ite
about the column, such as the mean, min, max, and standard deviation.
• Generally describe() function excludes the character columns and gives summary statistics of
m
numeric columns.
• We need to add a variable named include=’all’ to get the summary statistics or descriptive
Li
statistics of both numeric and character column.
• For describing a DataFrame, by default only numeric fields are returned.
e
at
The describe() function on the Pandas DataFrame lists 8 statistical properties of each attribute. These
functions are also included in aggregation function. These are:
iv
• count(): Total number of items
Pr
• mean(): Calculate the arithmetic mean or average of a given set of numbers
• std(): Calculate standard deviation of a given set of numbers
a
• min(): Find the minimum value of a given set of numbers
di
• 25th Percentile
• 50th Percentile (Median)
In
• 75th Percentile
• max(): Find the maximum value of a given set of numbers
se
When these methods are called on a DataFrame, they are applied over each row/column as specified
ou
and results collated into a Series. Missing values are ignored by default by these methods. Calling the
describe() function on categorical data returns summary information about the Series that includes the
H
Here,
• percentiles. The percentiles to include in the output. All should fall between 0 and 1. The default
is [.25, .50, .75], which returns the 25th, 50th, and 75th percentiles.
ew
• include. It is used to pass necessary information regarding what columns need to be considered
for summarizing.
N
- The default is None means the result will include all numeric columns.
• exclude. Sometimes you do not need to include any column to describe, so mention the exclude
option with describe() function. The default is None.
For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and
upper percentiles. For object data (e.g., strings or timestamps), the result’s index will include count, unique,
80 Saraswati Informatics Practices XII
top, and freq. The top is the most common value and chosen from among those with the highest count. The
freq is the most common value’s frequency. Timestamps also include the first and last items.
For example, let us implement this with two different series data: Numeric and String. For a numeric
series:
d
>>> S = pd . Series ( [ 5 , 10 , 15 , 20 , 25 ] ) # A numeric series
ite
>>> S.describe()
count 5.000000
m
mean 15.000000
Li
std 7.905694
min 5.000000
e
at
25% 10.000000
50% 15.000000
iv
75% 20.000000
Pr
max 25.000000
dtype: float64
a
di
For a String series: In
>>> str1 = pd.Series(['a', 'a', 'b', 'b', 'c', 'd']) # A string series
>>> str1.describe()
se
count 6
unique 4
ou
top a
freq 2
H
dtype: object
i
at
Let us use describe() function to summarize only numeric fields using DataFrame. To summarize previous
ra
DataFrame (df).
>>> df.describe()
Sa
You can also analyse the descriptive statistics of a single column of a DataFrame as well. For example,
>>> df['Age'].describe()
Age
d
count 6.000000
ite
mean 15.666667
m
std 1.032796
min 14.000000
Li
25% 15.250000
e
50% 16.000000
at
75% 16.000000
max 17.000000
iv
Name: Age, dtype: float64
Pr
To summarize all columns of DataFrame (df) regardless of data type.
a
>>> df.describe(include='all')
Student_Name Age
di
Gender Test1 Test2 Test3
In
count 6 6.000000 6 6.000000 5.000000 4.000000
unique 6 NaN 2 NaN NaN NaN
se
d
9 Manali Sovani 452
ite
Write commands for the following:
m
( ) Find the default quantile of the DataFrame.
Li
( ) Find the [.25, .50, .75] quantiles of the DataFrame.
( ) Write the summary of statistics pertaining to the DataFrame column.
e
(d) Get the full summary of statistics to the DataFrame.
at
(e) Find the 50th percentile of the DataFrame.
iv
Solution For data: dfA = pd.read_csv('E:/IPSource_XII/IPXIIChap03/Std7.csv')
(a) dfS.quantile()
Pr
(b) dfS.quantile([.25, .5, .75])
(c) dfS.describe()
a
(d) dfS.describe(include='all')
di
(e) dfS.describe(percentiles=[.50]) In
3.5 Histogram
se
Histograms are powerful tools for analyzing the distribution of data. A histogram plot is generally used to
ou
show the frequency across a continuous or numeric or discrete data. These can get user a clear understanding
of the distribution of data points or dataset falls into each category and its median and range of values.
To create a histogram, first, we have to divide the entire range of values into a series of intervals, and
H
second, we have to count how many values fall into each interval. Matplotlib calls those categories or
intervals as bins. The bins are consecutive and non-overlapping intervals of a variable. They must be adjacent
i
at
• The y-axis represents the “frequency density” or count of the number of observations in the
s
Matplotlib is the leading visualization library in Python. It is a powerful two-dimensional plotting library
for the Python language. Matplotlib is a multi-platform data visualization library built on NumPy arrays.
ew
The matplotlib provides a context, one in which one or more plots can be drawn before the image is shown
or saved to file. The context can be accessed via functions on . Matplotlib is capable of creating all
manner of graphs, plots, charts, histograms and much more. Before using any method to draw a histogram,
N
you must install Matplotlib library in your system. If you are using Python 3.6, 3.7 or any latest version,
then it is easy to install Matplotlib using pip.
@
After installing the Python Matplotlib, now you are ready to analyse your data through different plots.
To check the version that you installed, you can use the following command in Python interactive mode:
>>> import matplotlib
>>> print (matplotlib.__version__)
d
3.0.3
ite
3.5.2 Creating Histogram
m
Creating histograms from Pandas data frames is one of the easiest operations in Pandas and data visualization
Li
that you will come across. Pandas DataFrames that contain our data come pre-equipped with methods for
creating histograms, making preparation and presentation easy.
e
We can create histograms from Pandas DataFrames using the pandas.DataFrame.hist(), which is a
at
sub-method of pandas.DataFrame.plot(). Pandas uses the Python module Matplotlib to create and render
all plots, and each plotting method from pandas.DataFrame.plot takes optional arguments that are passed
iv
to the Matplotlib functions. The syntax of pandas.DataFrame.hist DataFrame method is:
Pr
DataFrame.hist(column=None, by=None, grid=True, xlabelsize=None, xrot=None, ylabelsize=None,
yrot=None, ax=None, sharex=False, sharey=False, figsize=None, layout=None, bins=10, **kwds)
a
As shown in the above syntax, the DataFrame.hist() function has many more options, but in this text
we use some common options like:
di
In
• column: is the DataFrame column name to create a histogram.
• by: If passed, then used to form histograms for separate groups. The by option will take an
se
• xlabelsize, ylabelsize: these options change the size of x and y label text size.
• sharex, sharey: to set both of the axes to the same range and scale.
H
• bin: A bin in a histogram is the block that you use to combine values before getting the frequency.
The bins are usually specified as consecutive, non-overlapping intervals of a variable. The default
i
at
value is 10.
• fill: to fills the DataFrame column name to create a histogram.
s w
>>> df.hist()
which prints:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x07D02830>,
d
<matplotlib.axes._subplots.AxesSubplot object at 0x08D42FD0>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x09110FD0>,
ite
<matplotlib.axes._subplots.AxesSubplot object at 0x09133650>]],
dtype=object)
m
This command produced histograms for each of the 4 features we specified, i.e., the numeric columns
Li
(Age, Test1, Test2 and Test3). Once you have drawn that data or plot, you can then "show" that data. If you
are using Matplotlib from within a Python script, you have to add plt.show() method inside the file to be
e
able display your plot. The show() command is used to look for current active drawing or figure and opens
at
in an interactive window as shown in Figure 3.1.
>>> plt.show()
iv
Pr
a
di
In
se
ou
i H
at
s w
ra
wish to only examine a subset of the features, or even look at only one, then we can specify what we want to
plot using the columns parameter of the df.hist() method. The columns feature takes either a string or list
of strings of columns names. For example, to plot a histogram for Age column:
N
d
ite
m
Li
e
at
iv
Pr
a
di
Figure 3.2 A histogram with the most common age group.
In
3.5.4 Modifying Histogram Bin Sizes
se
The bins, or bars of the histogram plot, can be adjusted using the bins option. This option can be tuned
depending on the data, to see either subtle distribution trends, or broader phenomena. Which bin size to use
ou
heavily depends on the data set you use, therefore it is important to know how to change this parameter. The
default number of bars is 10. For example, to create a histogram with bin size 5:
H
dtype=object)
w
>>> plt.show()
s
ra
Sa
ew
N
@
d
ite
m
Li
e
at
iv
Pr
a
Figure 3.4 A histogram with 30 bins.
>>> plt.show()
i H
at
s w
ra
Sa
ew
N
Here, the histograms are plotted side-by-side in two different features. Notice the axes are automatically
adjusted by default, so the scales may be different for each Pandas DataFrame histogram.
and scale. We can do this with the sharex and sharey options. These options accept Boolean values, which
are False by default. If these options are set to True, then the respective axis range and scale is shared
between plots. For example, let us apply this feature with previous histogram command:
>>> df.hist(column=["Test1", "Test2"], sharex=True) # Share only x axis
d
>>> plt.show()
ite
m
Li
e
at
iv
Pr
a
di
In
Figure 3.6 Histogram set X-axis to the same range.
>>> df.hist(column=["Test1", "Test2"], sharex=True, sharey=True) # Share x and y axis
se
>>> plt.show()
ou
i H
at
s w
ra
Sa
Figure 3.7 Histogram set both of the axes to the same range and scale.
ew
Be careful when comparing histograms this way. The range over which bins are set in the Test1 data are
larger than those in the Test2 data, leading to larger boxes in Test1 than Test2. The result is that while both
N
plots have the same number of data points, Test1 appears “larger” because of the default bar widths.
@
Recall the DataFrame df which had 6 columns of data. To only plot the Test1, Test2 and Test3 data in
one plot, the command is:
>>> df[["Test1", "Test2", "Test3"]].plot.hist() # Note slicing is performed on df itself
>>> plt.show()
d
ite
m
Li
e
at
iv
Pr
a
di
In
Figure 3.8 Histogram with multiple features on one plot.
This code snippet plotted the three histograms on the same plot, but the second and third plot “blocks”
se
the view of the first. We can solve this problem by adjusting the alpha transparency option, which takes a
value in the range [0,1], where 0 is fully transparent and 1 is fully opaque.
ou
For example, let us create a histogram for previous DataFrame df using plot() method:
>>> df.plot(kind='hist')
>>> plt.show()
d
ite
m
Li
e
at
iv
Pr
a
di
In
Figure 3.10 Histogram using plot() method.
From the above command, the kind option specifies the plot names. The default is ‘line’ plot, otherwise,
se
we can mention ‘bar’, ‘barh’, ‘hist’, ‘box’, ‘area’, ‘pie’, ‘scatter’, etc.
ou
If we wish to only examine a subset of the features, for example, to plot a histogram for Age column using
plot() method, the command is:
i
at
Similarly, to plot a histogram using multiple columns, i.e., Test1, Test2, and Test3 using DataFrame
plot() method, the command is:
>>> df[['Test1', 'Test2', 'Test3' ]].plot(kind='hist')
>>> plt.show()
d
ite
m
Li
e
at
iv
Pr
a
di
In
Figure 3.12 A histogram with multiple columns using plot() method.
Example Using previous DataFrame dfS, write the command to create following histograms:
se
( ) Create a histogram plot to show Total Marks with bins [400, 420, 440, 460, 480, 500].
H
import pandas as pd
import matplotlib.pyplot as plt
w
plt.show()
ra
plt.show()
(c) dfA.hist(column="Total Marks", bins=[400, 420, 440, 460, 480, 500])
plt.show()
ew
Or
dfA[['Total Marks']].plot(kind='hist',bins=[400, 420, 440, 460, 480, 500])
N
plt.show()
@
Points to Remember
1. Aggregation is the process of turning the values of a dataset (or a subset of it) into one single
value.
2. Data aggregation always accepts multivalued functions, which in turn returns only single value.
Aggregation/Descriptive Statistics in Pandas 91
3. In Python pandas, the median() function is used to calculate the median or middle value of a
given set of numbers.
4. The mode() in function Python pandas is used to calculate the mode or most repeated value of
a given set of numbers.
d
5. Quantiles are points in a distribution that relate to the rank order of values in that distribution.
ite
6. The middle value of the sorted sample (middle quantile, i.e., 50th percentile) is known as the
median.
m
7. The describe() method is used to compute summary statistics for each column numerical (default)
column.
Li
8. The plot() is used to quickly visualize the data in different ways.
9. The available plotting types are: ‘line’ (default), ‘bar’, ‘barh’, ‘hist’, ‘box’ , ‘kde’, ‘area’, ‘pie’, ‘scatter’,
e
‘hexbin’.
at
iv
SOLVED EXERCISES
Pr
1. What is the use of describe() function?
a
Ans. The .describe() function is a useful summarisation tool that will quickly display statistics for any
di
variable or group it is applied to. You can use .describe() to see a number of basic statistics about
the column, such as the mean, min, max, and standard deviation. This can give you a quick overview
In
of the shape of the data.
2. A vector x is given with the following even number:
se
(a) x.quantile(0.50)
at
which prints:
ra
0.25 5.25
0.50 8.00
Sa
0.75 10.75
dtype: float64
ew
d
0.75 11.0
ite
dtype: float64
4. A vector A is given with the following even number:
m
15, 20, 32, 60
( ) Find the default quantile of the vector.
Li
( ) Find the [.25, .5, .75] quantiles of the vector.
Ans. A = pd.Series([15, 20, 32, 60])
e
(a) A.quantile()
at
(b) A.quantile([.25, .5, .75])
5. Explain the functions mean() and median() with examples of each using pandas DataFrame.
iv
Ans. A DataFrame called dfW with following data:
Pr
Age Wage Rate
0 20 2.5
a
di
1 25 3.5
2 30 4.5
In
3 35 5.5
se
4 40 7.0
5 45 8.7
ou
6 50 9.5
7 55 10.0
H
8 60 12.5
i
at
• mean(). The mean() function in Python pandas is used to calculate the arithmetic mean of a
given series or mean of a DataFrame. For example, to calculate the mean of Age column:
w
dfW['Age'].mean()
s
• median(). In Python pandas, the median() function is used to calculate the median or middle
ra
value of a given set of numbers. For example, to calculate the median of Wage Rate column:
dfW['Wage Rate'].median()
Sa
6. For class 12 students, the height and weight of 8 students are given below:
Student Name Height (inch) Weight (kg)
ew
Assume that the above dataset is created with a DataFrame called dfC. Using dfC, answer the
following:
( ) Count the number of non-null value across the row axis for DataFrame dfC.
( ) Count the number of non-null observation in column Weight of DataFrame dfC.
d
( ) How many non-null observation in column Height of DataFrame dfC?
ite
(d) Count the number of non-null value across the column for DataFrame dfC.
(e) Find the sum of height and weight column for all students using DataFrame dfC.
m
( ) Find the mean of height and weight column for all students using DataFrame dfC.
Li
( ) Find the median of height and weight column for all students using DataFrame dfC.
( ) Find the most repeated value for a specific column ‘Weight’ of DataFrame dfC.
e
(i) Find the maximum weight value of DataFrame dfC.
at
Ans. For data: dfC = pd.read_csv('E:/IPSource_XII/IPXIIChap03/Heightweight.csv')
iv
Or
Pr
dfC = pd.DataFrame({'Student Name' : ['TANVI GUPTA', 'MRIDUL KOHLI', 'DHRUV TYAGI', 'SAUMYA
PANDEY', 'ALEN RUJIS', 'MANALI SOVANI', 'AAKASH IRENGBAM', 'SHIVAM BHATIA'],
'Height' : [60.0, 62.9, np.nan, 58.3, 62.5, 58.4, 63.7, 61.4],
a
'Weight' : [54.3, 56.8, 60.4, 58.3, np.nan, 57.4, 58.3, 55.8]},
di
columns = ['Student Name', 'Height', 'Weight'])
In
(a) dfC.count()
(b) dfC['Weight'].count()
se
(c) 7
(d) dfC.count(axis='columns')
ou
(h) dfC['Weight'].mode()
at
(i) dfC['Weight'].max()
w
(d) dfC.describe()
8. Using previous DataFrame dfC, answer the following:
@
( ) Create two separate histogram plots for each chosen column, i.e., height and weight using
DataFrame dfC.
( ) Create two separate histogram plots for height and weight columns by sharing the axes to
the same range and scale using DataFrame dfC.
( ) Create a histogram to plot both height and weight data in only plot using DataFrame dfC.
94 Saraswati Informatics Practices XII
d
(b) dfC.hist(column=["Height", "Weight"], sharex=True, sharey=True)
ite
plt.show()
(c) dfC[["Height", "Weight"]].plot.hist()
m
REVIE
Li
ES IO S
e
1. What is quantiles? How can you find 0.25 quantiles?
at
2. What do you mean by the middle value of the sorted sample list?
3. In a histograph, how can you set both of the axes to the same range and scale?
iv
4. A DataFrame column called discount contains following values:
Pr
Discount
2560
a
3600
di
1250 In
NaN
1200
se
5. Find the .25, .50, and .75 quantiles for the following series x:
[3, 7, 8, 5, 12, 14, 21, 13, 18]
i H
at
s w
ra
Sa
ew
N
@
Function Applications in Pandas 95
Function Applications
in Pandas
Pandas
d
Chapter –
ite
m
Li
e
at
iv
Pr
4.1 Introduction
Pandas is a big srore house of Python library. Whether for data visualization or data analysis, the practicality
a
and functionality that this tool offers is not found in any other module. Python panda supports number of
di
data aggregation/descriptive functions to analyze data. In this chapter, we will learn Python pandas function
In
applications.
Pipes take the output from one function and feed it to the first argument of the next function. Pipe() function
ou
performs the custom operation for the entire DataFrame. Pipe can be thought of as a function chaining. The
syntax is:
H
To apply pipe(), the first argument of the function must be the data set.
at
Here,
w
95
96 Saraswati Informatics Practices XII
>>> df
Name English Accounts Economics Bst IP
0 Aashna 87 76 82 72 78
1 Simran 64 76 69 56 75
d
ite
2 Jack 58 68 78 63 82
3 Raghu 74 72 67 64 86
m
4 Somya 87 82 78 66 67
Li
5 Ronald 78 68 68 71 71
e
at
>>> df
English Accounts Economics Bst IP
iv
Name
Pr
Aashna 87 76 82 72 78
Simran 64 76 69 56 75
a
Jack 58 68 78 63 82
Raghu 74 72 67
di 64 86
In
Somya 87 82 78 66 67
Ronald 78 68 68 71 71
se
Now using the above DataFrame df, apply the function .pipe() to add 2 marks with every numeric
ou
column:
# Create a user-define function to add two numbers
H
>>> df.pipe(Add_Two, 2)
w
Name
ra
Aashna 89 78 84 74 80
Sa
Simran 66 78 71 58 77
Jack 60 70 80 65 84
Raghu 76 74 69 66 88
ew
Somya 89 84 80 68 69
Ronald 80 70 70 73 73
N
df.pipe(Add_Two, 2)
Notice that the first argument Add_Two is of the .pipe() is the data set. For example, Add_Two accepts
Add_Tw
two arguments Add_T o(Datta, aV
wo(Da alue). As Data is the first parameter that takes in the data set, we can
alue)
aValue)
directly use pipe(). We only need to specify to pipe what’s the name of the argument in the function that
refers to the data set.
Function Applications in Pandas 97
d
Basically, we can use custom functions when applying on Series and also when operating on chunks of data
ite
frames in groupbys. This is useful when cleaning up data - converting formats, altering values, etc. This
method takes as argument the following:
m
• a general or user defined function
• any other parameters that the function would take
Li
The syntax is:
e
DataFrame.apply(func, axis=0, broadcast=None, raw=False, reduce=None, result_type=None,
at
args=(), **kwds)
iv
Here,
Pr
• func: is the name operations which will be applied to each column or row.
• axis: along which the function is applied:
− 0 or ‘index’: apply function to each column.
a
− 1 or ‘columns’: apply function to each row.
di
In
Let us see the following data set with two columns containing numeric data:
A B
se
1 6
ou
2 7
3 8
H
4 9
5 10
i
at
6 11
w
When we use apply() function, by default it will apply on each column. For example, let us create the
s
DataFrame and find the sum of each column using the aggregate function sum with apply().
ra
A B
0 1 6
N
1 2 7
@
2 3 8
3 4 9
4 5 10
5 6 11
98 Saraswati Informatics Practices XII
As the name suggests, the .apply() function applies a function along any axis of the DataFrame. For
example, to find the sum of all values of each column:
>>> tdf[['A', 'B']].apply(sum)
d
The above command will return the sum of all the values of column A and column B.
ite
A 21
B 51
m
dtype: int64
Li
Use .apply() with axis=1 to send every single row to a function
e
at
Similarly, to find the sum of all values of each row, the command is:
iv
>>> tdf[['A', 'B']].apply(sum, axis=1) # axis = 1 applies to each row
Pr
Sum of Column
A and B
0 7
a
1 9
di
2 11 In
3 13
4 15
se
5 17
dtype: int64
ou
Let us take another DataFrame example with following data set for applying the apply() function.
H
>>> dfA
w
So, to implement the expression as a function for each row with previous DataFrame (dfA), create a
function called Tax_calc() with following:
>>> # we create a function to retrieve
>>> def Tax_calc(Price):
d
Taxes = Price * 0.04 # 4% tax
ite
return Taxes
>>> dfA['Taxes'] = dfA.Price.apply(Tax_calc)
m
Here, the function .apply() retrieves the Tax_calc() function to calculate the taxes for Price column.
Li
The Tacx_calc() function accepts a argument Tax_calc(Price). The resulted data is stored as a new column
called Taxes with the existing DataFrame dfA. The result of new DataFrame is:
e
New colum
>>> print (dfA)
at
with .apply()
OrdNum Size Topping Price Taxes
iv
0 PZ001 Small Margherita 356.65 14.266
Pr
1 PZ002 Large Peppy Paneer 545.70 21.828
a
2 PZ003 Extra Large Bell Pepper 756.90 30.276
di
3 PZ004 Extra Large M Green Wave
In 654.00 26.160
4 PZ005 Extra Large Pepperoni chicken 632.00 25.280
5 PZ006 Large Chicken Sausage 480.60 19.224
se
Let us apply the .apply() function to send every single row to a function to calculate a new column
Taxes:
H
>>> def Tax_calc(row): # A row is an user-defined name that contain all the column values
at
return row['Price'] * 0.04 # The price value is used from the row
w
0 14.266
ra
1 21.828
Sa
2 30.276
3 26.160
4 25.280
ew
5 19.224
6 15.000
N
dtype: float64
@
Or
If you want add it as new columns using .apply() function, then write the following:
>>> df['Taxes'] = df.apply(Tax_calc, axis=1) # to add a new column
The above command will again create a DataFrame with a new Taxes column.
100 Saraswati Informatics Practices XII
The axis parameter of apply() functions is useful to travel through columns and rows. When any arbitrary
d
function like max, min, etc. are applied with apply() function, it travels the entire columns and rows. If the
ite
axis = 0, it will travel downwards of each column. Similarly, when axis = 1, it will travel row-wise right.
In previous DataFrame dfA, we have 5 columns and 7 rows. Let us find the maximum value of ‘Price’
m
and ‘Taxes’ column using apply() function.
>>> # apply() method travel axis=0
Li
>>> dfA.loc[:, 'Price':'Taxes'].apply(max, axis=0)
e
which will display the maximum value of Price and Taxes column as given below:
at
Price 756.900
iv
Taxes 30.276
Pr
dtype: float64
Similarly, to find row-wise maximum values of ‘Price’ and ‘Taxes’ columns:
a
>>> dfA.loc[:, 'Price':'Taxes'].apply(max, axis=1)
0 356.65
di
In
1 545.70
2 756.90
se
3 654.00
ou
4 632.00
5 480.60
H
6 375.00
dtype: float64
i
at
w
Keyword lambda in python is used to create anonymous functions. Anonymous functions are those functions
ra
who are unnamed. That means you are defining a function without any name of the function. A lambda
Sa
function is a shorthand way to define a quick function that you need once.
The basic syntax to create a lambda function is:
lambda arguments : expression
ew
Lambda functions can have any number of arguments but only one expression. The expression is
N
evaluated and returned. Lambda functions cannot contain any statements and it returns a function object
which can be assigned to any variable.
@
For example:
For example,
>>> fun = lambda x: x*x
>>> sqr = fun(5) # Output: 25
d
To check the type of the lambda function, type the following:
ite
>>> type(fun)
<class 'function'>
m
Similarly, let us see another example to add two numbers using lambda function:
Li
>>> SumTwo = lambda x, y : x + y
e
>>> print (SumTwo(10, 20)) # Output: 30
at
Here, in lambda x, y: x + y; x and y are arguments to the function and x + y is the expression which gets
executed and its value is returned as output.
iv
Also, the lambda x, y: x + y returns a function object which can be assigned to any variable, in this case
Pr
function object is assigned to the SumTwo variable.
Using Lambda Function with .apply()
a
In previous section, we use the .apply() function to find the tax for each row in DataFrame dfA. Instead of
di
using the above function (Tax_calc), we can use a lambda function. Let us do the same with a lambda
In
function:
>>> dfA.apply(lambda row: row[3] * 0.04, axis=1)
se
0 14.266
1 21.828
ou
2 30.276
H
3 26.160
4 25.280
i
5 19.224
at
6 15.000
w
dtype: float64
s
Here, the row parameter takes enter row of DataFrame dfA and row[3] is the Price column.
ra
Let us see the difference between a normal def defined function and lambda function. This is a program
Sa
>>> def Tax_calc(row): # A row is an user-define name contain all the column values
return row['Price'] * 0.04 # The price value is used from the row
ew
Here, while using def, we need to define a function with a name Tax_calc and need to pass a value to it.
After execution, we also need to return the result from where the function was called using the return
N
keyword.
On the other hand, in the lambda function, it does not include a “return” statement; it always contains
@
an expression which is returned. We can also put a lambda definition anywhere a function is expected, and
we don’t have to assign it to a variable at all.
Example Using previous DataFrame dfA, write the commands for the following:
( ) Write a function to display Topping column into capitalization form.
( ) Apply a toppingcapital function over the column ‘Topping’.
102 Saraswati Informatics Practices XII
d
dfA['Topping'].apply(toppingcapital)
ite
which will display the following:
m
0 MARGHERITA
Li
1 PEPPY PANEER
2 BELL PEPPER
e
3 MEXICAN GREEN WAVE
at
4 PEPPERONI CHICKEN
iv
5 CHICKEN SAUSAGE
6 PERI-PERI CHICKEN
Pr
Name: Topping, dtype: object
a
4.4 Aggregation (groupby)
di
In
Pandas DataFrames have a .groupby() method that works in the same way as the SQL Group By. The main
objective of this function is to split the data into sets and then apply some functionality on each subset. The
se
most important operations made available by a groupby() are aggregate, filter, transform, and apply.
• Splitting the data into groups based on some criteria with the levels of a categorical variable.
ou
This is generally the simplest step. For example, a DataFrame can be split up by rows(axis=0) or
columns(axis=1) into groups.
H
• Applying a function to each group individually. A function is applied to each group using .agg() or
i
− Aggregate – estimate/compute summary statistics (like counts, means) for each group. This
w
− Transform – within group standardization, imputation using group values. The size of the
data will not change.
For example:
ew
− Filter – ignore rows that belong to a particular group. This discards some groups, according
@
Let us see the following data set with two columns that how they are grouped.
Data Data with groupby Product Group result
d
ite
Black 120 Black 120
Red 130 Black 120 sum max min mean count
m
Black 120 Black 136 376 136 120 125.33 3
Li
Green 110 Green 110
Red 125 Green 132
e
Green 132 Green 144 386 144 110 128.67 3
at
Red 115 Red 130
iv
Black 136 Red 125
Pr
Green 144 Red 115
Red 165 Red 165 535 165 115 133.75 4
a
di
When we apply the .groupby() method to a DataFrame object, it returns a GroupBy object, which is
then assigned to the grouped single variable or GroupBy variable. An important thing to note about a pandas
In
GroupBy object is that no splitting of the DataFrame has taken place at the point of creating the object. The
GroupBy object simply has all of the information it needs about the nature of the grouping. No aggregation
se
will take place until we explicitly call an aggregation function on the GroupBy object.
ou
Here,
at
• by : Used to determine the groups for the groupby. If by is a function, it’s called on each value of
w
the order of observations within each group. groupby() function preserves the order of rows
within each group.
• group_keys : When calling apply, add group keys to index to identify pieces.
• squeeze : Reduce the dimensionality of the return type if possible, otherwise return a consistent
type.
104 Saraswati Informatics Practices XII
Before using the groupby() function, let us create a DataFrame with following data:
>>> import pandas as pd
>>> df = pd.read_csv('E:/IPSource_XII/IPXIIChap04/Stock.csv')
>>> df
d
ite
Category_Name Item_Num Unit_Price Sales_Quantity
0 Television T001_Panasonic 27800 8
m
1 Washing Machine W003_Samsung 9699 4
Li
2 Refrigerator R001_LG 43800 6
3 Microwave M001_LG 13600 8
e
at
4 Television T002_Sony 42200 4
5 Air Conditioner A001_LG 23500 11
iv
6 Microwave M002_Samsung 18750 4
Pr
7 Washing Machine W001_IFB 32600 12
8 Television T003_LG 32500 4
a
9 Refrigerator R002_Samsung 23300 4
10 Air Conditioner A002_Carrier
di 43700 6
In
11 Microwave M003_LG 28750 5
12 Television T004_Sony 65800 5
se
Using the above DataFrame (df) we create a grouping of categories and apply a function to the categories.
i
at
For example, let us apply the groupby() function to group the data on Category_Name column.
w
Here, the groupby() function creates a groupby object called dfC. When we print the object, it will
display the following:
Sa
This grouped variable (dfC) is now a GroupBy object. It has not actually computed anything yet except
for some intermediate data about the group key df['key1']. The idea is that this object has all of the information
N
needed to then apply some operation to each of the groups. We can print information through iterate only.
View Groups
@
d
From the above output, it displays the Category_Name wise number of indexes from DataFrame df.
ite
Or
We can use list() method to see the details of the GroupBy object values:
m
>>> list(df['Item_Num'].groupby(df['Category_Name']))
Li
[('Air Conditioner', 5 A001_LG
e
10 A002_Carrier
at
14 A003_Samsung
Name: Item Num, dtype: object), ('Microwave', 3 M001_LG
iv
6 M002_Samsung
Pr
11 M003_LG
Name: Item Num, dtype: object), ('Refrigerator', 2 R001_LG
a
9 R002_Samsung
15 R003_Onida
di
In
Name: Item Num, dtype: object), ('Television', 0 T001_Panasonic
4 T002_Sony
se
8 T003_LG
ou
12 T004_Sony
Name: Item Num, dtype: object), ('Washing Machine', 1 W003_Samsung
H
7 W001_IFB
13 W002_LG
i
at
Let's print the first row each of each group of pandas DataFrame using DataFrame.first() method.
Sa
>>> dfC.first()
Item_Num Unit_Price Sales_Quantity
ew
Category_Name
Air Conditioner A001_LG 23500 11
N
By default, the GroupBy object has the same label name as the group name.
106 Saraswati Informatics Practices XII
Object returned by the call to groupby() function can be used as an iterator. In previous example, we created
a GroupBy object called dfC. Let us use the for loop to iterate the object dfC:
d
ite
>>> for key, group_df in dfC:
print("The group for Category Name '{}' has {} rows".format(key,len(group_df)))
m
which prints the following:
Li
The group for Category Name 'Air Conditioner' has 3 rows
The group for Category Name 'Microwave' has 3 rows
e
The group for Category Name 'Refrigerator' has 3 rows
at
The group for Category Name 'Television' has 4 rows
The group for Category Name 'Washing Machine' has 3 rows
iv
Pr
From the above command:
• key contains the name of the grouped element i.e. 'Air Conditioner', 'Microwave', 'Refrigerator',
'Television', 'Washing Machine'
a
• group_df is a normal DataFrame containing only the data referring to the key.
di
In
Select a Group
se
Using the get_group() method, we can select a single group. For example, to select a particular group
called ‘Television’ from the GroupBy object dfC:
ou
4 T002_Sony 42200 4
w
8 T003_LG 32500 4
12 T004_Sony 65800 5
s
ra
To produce a result, we can apply an aggregate to this DataFrameGroupBy object, which will perform the
appropriate apply/combine steps to produce the desired result. Let us simple, use the aggregate function
ew
>>> df.groupby('Category_Name').sum()
@
Unit_Price Sales_Quantity
Category_Name
Air Conditioner 90700 28
Microwave 61100 17
Function Applications in Pandas 107
Refrigerator 90400 14
Television 168300 21
Washing Machine 66499 26
d
ite
Here, the aggregate function sum() calculates the Category_Name wise total of Unit_Price and
Sales_Quantity. The sum() method is just one possibility here; you can apply any common Pandas or
m
NumPy aggregation function, as well as any valid DataFrame operation.
Li
Resetting groupby rows index
e
We can reset the grouped row index in pandas with reset_index() function to make the index start from 0.
at
For example,
iv
>>> df.groupby('Category_Name').sum().reset_index()
Pr
Category_Name Unit_Price Sales_Quantity
0 Air Conditioner 90700 28
a
1 Microwave 61100 17
2 Refrigerator 90400
di 14
In
3 Television 168300 21
se
We can sort the GroupBy object result using sort_values() method. For example, let us find the total
H
>>> df1
w
Category_Name Unit_Price
s
ra
1 Microwave 61100
4 Washing Machine 66499
Sa
2 Refrigerator 90400
0 Air Conditioner 90700
ew
3 Television 168300
N
The groupby method uses an option called as_index to display SQL_style grouped output. For example, to
display SQL_style category_Name wise aggregate sum of data:
d
2 Refrigerator 90400 14
ite
3 Television 168300 21
m
4 Washing Machine 66499 26
Li
Example: Write the command to find the maximum Unit_Price of each Category_Name of DataFrame
df.
e
# Find the maximum Unit_Price of each Category_Name
at
>>> df.groupby('Category_Name').Unit_Price.max()
iv
Category_Name
Pr
Air Conditioner 43700
Microwave 28750
a
Refrigerator 43800
di
Television 65800 In
Washing Machine 32600
Name: Unit_Price, dtype: int64
se
To use two key in groupby() function, let us find the first grouping based on "Category_Name" within each
H
Or
>>> dfD = df.groupby([df['Category_Name'], df['Item_Num']])
w
>>> dfD.first()
s
ra
Unit_Price Sales_Quantity
Category_Name Item_Num
Sa
A003_Samsung 23500 11
N
M003_LG 28750 5
Refrigerator R001_LG 43800 6
R002_Samsung 23300 4
R003_Onida 23300 4
Function Applications in Pandas 109
d
T004_Sony 65800 5
ite
Washing Machine W001_IFB 32600 12
m
W002_LG 24200 10
Li
W003_Samsung 9699 4
e
Column-wise aggregations – optimized statistical methods
at
For simple statistical aggregations (of numeric columns of a DataFrame), we can call methods like sum(),
iv
max(), min(), mean(), etc. Before applying the functions, let us create a new DataFrame called dfN by
adding the new column ‘Total_Amount’ with previous DataFrame df:
Pr
# Find the values contained in the "Category Name" group
a
>>> Amount = [] # a blank list
di
>>> for index, row in df.iterrows():
Amt = row['Unit_Price'] * row['Sales_Quantity'] # Calculating a row value
In
Amount.append(Amt) # append current amount to a list.
>>> dfN = df.assign(Total_Amount = Amount) # A new column ‘Total_Amount’ created
se
d
Sales_Quantity 106
ite
Total_Amount 3078146
dtype: int64
m
# Summing a particular column or series
Li
>>> dfN['Total_Amount'].groupby(dfN['Category_Name']).sum()
Category_Name
e
at
Air Conditioner 779200
Microwave 327550
iv
Refrigerator 449200
Pr
Television 850200
Washing Machine 671996
a
di
Name: Total_Amount, dtype: int64 In
# Finding the mean of all series of a DataFrame
>>> print (dfN.groupby('Category_Name').mean())
se
We can find the unique column values per group by using the .nunique() method of groupby() method. For
example to find the number of unique column values per each Category_Name for Sales_Quantity column
of DataFrame dfN:
ew
>> dfN.groupby('Category_Name')["Sales_Quantity"].nunique()
Category_Name
N
Air Conditioner 2
@
Microwave 3
Refrigerator 2
Television 3
Washing Machine 3
Name: Sales_Quantity, dtype: int64
Function Applications in Pandas 111
d
method. Simply pass a list of the functions that you would like to apply to your dataset. The .agg() method
ite
allows us to easily and flexibly specify these details.
It takes arguments as given below:
m
• list of function names to be applied to all selected columns
Li
• tuples of (colname, function) to be applied to all selected columns
• dict of (df.col, function) to be applied to each df.col
• Apply >1 functions to selected column(s) by passing names of functions to agg() as a list
e
at
For example, to find the Category_Name wise sum of Total_Amount, the command is:
iv
>>> dfN['Total_Amount'].groupby(dfN['Category_Name']).agg('sum')
Pr
Category_Name
Air Conditioner 779200
a
Microwave 327550
Refrigerator 449200
di
In
Television 850200
Washing Machine 671996
se
Or
ou
>>> dfN.groupby(dfN['Category_Name']).agg({'Total_Amount':['sum']}).reset_index()
Category_Name Total_Amount
H
sum
i
at
2 Refrigerator 449200
s
3 Television 850200
ra
Example Write the command to find the Category_Name wise aggregates applying multiple
aggregation functions like count, min, mean and max for Total_Amount.
ew
Or
@
d
Washing Machine 3 38796 223998.666667 391200
ite
Example Write the command to create a hierarchical index to find minimum and maximum to all
m
numeric columns for DataFrame dfN.
Li
# Apply min and max to all numeric columns of dfN grouped by Category_Name
# A Hierarchical index will be created
e
>>> dfN[['Unit_Price', 'Sales_Quantity', 'Total_Amount']].groupby(dfN['Category_Name']).agg(['min',
at
'max'])
Unit_Price Sales_Quantity Total_Amount
iv
Pr
min max min max min max
Category_Name
a
Air Conditioner 23500 43700 6 11 258500 262200
Microwave 13600 28750
di 4 8 75000 143750
In
Refrigerator 23300 43800 4 6 93200 262800
Television 27800 65800 4 8 130000 329000
se
Example Write the command to create a hierarchical index to find min and max to all numeric
H
columns for DataFrame dfN by flipping the above layout of dfN, i.e., by moving whole
levels of columns to rows.
i
at
'max']).stack()
s
Category_Name
Sa
Example Create a histogram for the Total_Amount of dfN with bin size 5.
# Creating histogram for Total_Amount
>>> import matplotlib.pyplot as plt
>>> dfN.hist(column="Total_Amount", bins=5) # Plotting 5 bins
d
>>> plt.show()
ite
m
Li
e
at
iv
Pr
a
di
In
se
Just as you can apply custom functions to a column in your data frame, you can do the same with groups. As
H
Let us apply the .apply() method with groupby() method to find the Item_Num wise total amount for
s
each Category Name for the original DataFrame df. That is:
ra
>>> df
Sa
d
12 Television T004_Sony 65800 5
ite
13 Washing Machine W002_LG 24200 10
m
14 Air Conditioner A003_Samsung 23500 11
15 Refrigerator R003_Onida 23300 4
Li
# User defined function to calculate the total amount
e
>>> def Calculate(ndf):
at
return (ndf.Unit_Price * ndf.Sales_Quantity)
>>> df.groupby(['Category_Name', 'Item_Num']).apply(Calculate)
iv
Category_Name Item_Num
Pr
Air Conditioner A001_LG 5 258500
a
A002_Carrier 10 262200
di
A003_Samsung 14 In 258500
Microwave M001_LG 3 108800
M002_Samsung 6 75000
se
M003_LG 11 143750
Refrigerator R001_LG 2 262800
ou
R002_Samsung 9 93200
H
R003_Onida 15 93200
Television T001_Panasonic 0 222400
i
at
T002_Sony 4 168800
T003_LG 8 130000
w
T004_Sony 12 329000
s
W002_LG 13 242000
Sa
W003_Samsung 1 38796
dtype: int64
ew
While aggregation must return a reduced version of the data, transformation can return some transformed
@
version of the full data to recombine. For such a transformation, the output is the same shape as the input.
Transformation on a group or a column returns an object that is indexed the same size of data that is being
grouped. And if you want to get a new value for each original row, use transform(). Thus, the transform
should return a result that is the same size as that of a group chunk.
Before applying the transform() method, let us create a DataFrame dfT by sorting the previous DataFrame
dfN with Category_Name.
Function Applications in Pandas 115
d
5 Air Conditioner A001_LG 23500 11 258500
ite
10 Air Conditioner A002_Carrier 43700 6 262200
m
14 Air Conditioner A003_Samsung 23500 11 258500
3 Microwave M001_LG 13600 8 108800
Li
6 Microwave M002_Samsung 18750 4 75000
e
11 Microwave M003_LG 28750 5 143750
at
2 Refrigerator R001_LG 43800 6 262800
iv
9 Refrigerator R002_Samsung 23300 4 93200
Pr
15 Refrigerator R003_Onida 23300 4 93200
0 Television T001_Panasonic 27800 8 222400
a
4 Television T002_Sony 42200 4 168800
di
8 Television T003_LG In 32500 4 130000
12 Television T004_Sony 65800 5 329000
1 Washing Machine W003_Samsung 9699 4 38796
se
>>> dfT.groupby('Category_Name')["Sales_Quantity"].transform('sum')
H
5 28
10 28
i
at
14 28
w
3 17
s
6 17
ra
11 17
Sa
2 14
9 14
15 14
ew
0 21
4 21
N
8 21
@
12 21
1 26
7 26
13 26
Name: Sales_Quantity, dtype: int64
116 Saraswati Informatics Practices XII
You will notice how this returns a different size data set from our normal groupby() functions. Instead
of only showing the totals for 5 category names, we retain the same number of items as the original data
set. That is the unique feature of using transform. Figure 4.2 shows the processes of transform() method.
Split
d
Category_Name Sales_Quantity
ite
Air Conditioner 11
Input Air Conditioner 6
m
Category_Name Sales_Quantity Air Conditioner 11
Li
Air Conditioner 11
Category_Name Sales_Quantity Apply (sum)
Air Conditioner 6 Category_Name Sales_Quantity
Microwave 8
e
Air Conditioner 11 Air Conditioner 28
Microwave 4
at
Microwave 8
Microwave 5
Microwave 4 Category_Name Sales_Quantity
iv
Microwave 5 Category_Name Sales_Quantity Microwave 17
Refrigerator 6 Refrigerator 6
Pr
Refrigerator 4 Refrigerator 4 Category_Name Sales_Quantity
Refrigerator 4 Refrigerator 4 Refrigerator 14
a
Television 8
Category_Name Sales_Quantity
di
Television 4 Category_Name Sales_Quantity
Television 8
Television 4
In Television 21
Television 4
Television 5
Television 4
Washing Machine 4 Category_Name Sales_Quantity
se
Television 5
Washing Machine 12 Washing Machine 26
Washing Machine 10 Category_Name Sales_Quantity
ou
Washing Machine 4
Washing Machine 12
H
Washing Machine 10
i
Combine (Transform)
at
Category_Name Sales_Quantity
w
Air Conditioner 28
s
Air Conditioner 28
ra
Air Conditioner 28
Microwave 17
Sa
Microwave 17
Microwave 17
Refrigerator 14
ew
Refrigerator 14
Refrigerator 14
Television 21
N
Television 21
@
Television 21
Television 21
Washing Machine 26
Washing Machine 26
Washing Machine 26
Figure 4.2 Combined data using transform() method.
Function Applications in Pandas 117
Example Write the command to create new value mean of each row for columns "Unit_Price" and
"Sales_Quantity".
# Find the meanCreating histogram for Total_Amount
>>> dfT.groupby('Category_Name')["Unit_Price", "Sales_Quantity"].transform('mean')
d
ite
Unit_Price Sales_Quantity
5 30233.333333 9.333333
m
10 30233.333333 9.333333
Li
14 30233.333333 9.333333
3 20366.666667 5.666667
e
at
6 20366.666667 5.666667
11 20366.666667 5.666667
iv
2 30133.333333 4.666667
Pr
9 30133.333333 4.666667
15 30133.333333 4.666667
a
0 42075.000000 5.250000
4 42075.000000 5.250000
di
In
8 42075.000000 5.250000
12 42075.000000 5.250000
se
1 22166.333333 8.666667
ou
7 22166.333333 8.666667
13 22166.333333 8.666667
H
>>> df
w
Write the command to find the mean of all numeric columns as per Gender.
# For DataFrame
>>> df = pd.read_csv('E:/IPSource_XII/IPXIIChap04/Student.csv')
118 Saraswati Informatics Practices XII
d
# Create a pipeline that applies the mean_age function to create a group according to column
ite
>>> df.pipe(mean_age, col='Gender')
which will display the following output:
m
Age Test1 Test2 Test3
Li
Gender
e
F 15.666667 6.966667 8.366667 8.4
at
M 15.666667 8.200000 8.350000 7.9
iv
Pr
4.6 .applymap() Function
The .applymap() function performs the specified operation for all the elements of a DataFrame. Remember
a
that all columns (except index column) of the DataFrame must be numeric type. The syntax is:
di
DataFrame.applymap(func) In
Here,
• func: Python function, returns a single value from a single value.
se
For example, suppose we have a DataFrame called tdf with the following data:
>>> Test = {'A': [1, 2, 3, 4, 5, 6],
H
>>> tdf
w
A B
s
0 1 6
ra
1 2 7
Sa
2 3 8
3 4 9
4 5 10
ew
5 6 11
N
>>> tdf.applymap(func)
A B
0 5 30
1 10 35
Function Applications in Pandas 119
2 15 40
3 20 45
4 25 50
d
5 30 55
ite
Let us create another DataFrame dfs with the following data and apply the applymap() function.
m
>>> import pandas as pd
Li
>>> Data = {'Name': ['Aashna', 'Simran', 'Jack', 'Raghu', 'Somya', 'Ronald'],
'English': [87, 64, 58, 74, 87, 78],
e
'Accounts': [76, 76, 68, 72, 82, 68],
at
'Economics': [82, 69, 78, 67, 78, 68],
iv
'Bst': [72, 56, 63, 64, 66, 71],
'IP': [78, 75, 82, 86, 67, 71]}
Pr
>>> dfs = pd.DataFrame(Data, columns=['Name', 'English', 'Accounts', 'Economics', 'Bst', 'IP'])
>>> dfs
a
Name English Accounts Economics Bst IP
0 Aashna 87 76
di 82 72 78
In
1 Simran 64 76 69 56 75
se
2 Jack 58 68 78 63 82
3 Raghu 74 72 67 64 86
ou
4 Somya 87 82 78 66 67
5 Ronald 78 68 68 71 71
i H
>>> dfn
w
Name
Sa
Aashna 87 76 82 72 78
Simran 64 76 69 56 75
Jack 58 68 78 63 82
ew
Raghu 74 72 67 64 86
Somya 87 82 78 66 67
N
Ronald 78 68 68 71 71
@
Now using the above DataFrame dfn, apply the function .applymap() to convert all numeric column
cell value into float:
>>> dfn.applymap(float)
120 Saraswati Informatics Practices XII
d
Aashna 87.0 76.0 82.0 72.0 78.0
ite
Simran 64.0 76.0 69.0 56.0 75.0
Jack 58.0 68.0 78.0 63.0 82.0
m
Raghu 74.0 72.0 67.0 64.0 86.0
Li
Somya 87.0 82.0 78.0 66.0 67.0
Ronald 78.0 68.0 68.0 71.0 71.0
e
at
Example Write the command to given an increment of 5% to all students to DataFrame df1 using
iv
applymap() function.
Pr
# Create a function to increase 5% marks
>>> def increase5(x):
a
return x + x*0.05
di
>>> dfn.applymap(increase5) # Temporary increases 5%
In
English Accounts Economics Bst IP
Name
se
Or
s
Reindexing in pandas is a process that changes the row labels and column labels of a DataFrame. This is
ew
core to the functionality of pandas as it enables label alignment across multiple objects, which may originally
have different indexing schemes. This process of performing a reindex includes the following steps:
N
• Possibly, filling missing data for a label using some type of logic (defaulting to adding NaN values).
The syntax is:
DataFrame.reindex(labels=None, index=None, columns=None, axis=None, method=None,
copy=True, level=None, fill_value=nan, limit=None, tolerance=None)
Function Applications in Pandas 121
Here,
• labels : New labels/index to conform the axis specified by ‘axis’ to.
• index, columns : New labels/index to conform to. Preferably an Index object to avoid duplicating
d
data.
ite
• axis : Axis to target. Can be either the axis name (‘index’, ‘columns’) or number (0, 1).
• method : {None, ‘backfill’/’bfill’, ‘pad’/’ffill’, ‘nearest’}, optional.
m
• copy : Return a new object, even if the passed indexes are the same.
Li
• level : Broadcast across a level, matching Index values on the passed MultiIndex level.
• fill_value : Fill existing missing (NaN) values, and any new element needed for successful
e
DataFrame alignment, with this value before computation. If data in both corresponding
at
DataFrame locations is missing the result will be missing.
• limit : Maximum number of consecutive elements to forward or backward fill.
iv
• tolerance : Maximum distance between original and new labels for inexact matches. The values
Pr
of the index at the matching locations most satisfy the equation abs(index[indexer] – target)
<= tolerance.
a
Changing the order of the rows
di
In
To change the order (the index) of the rows of previous DataFrame dfS:
>>> dfs.reindex([5, 4, 3, 2, 1, 0])
se
reindexed
Name English Accounts Economics Bst IP
H
5 Ronald 78 68 68 71 71
i
4 Somya 87 82 78 66 67
at
3 Raghu 74 72 67 64 86
w
2 Jack 58 68 78 63 82
s
1 Simran 64 76 69 56 75
ra
0 Aashna 87 76 82 72 78
Sa
reindexed
Name English Accounts Economics Bst IP
N
3 Raghu 74 72 67 64 86
4 Somya 87 82 78 66 67
@
5 Ronald 78 68 68 71 71
2 Jack 58 68 78 63 82
0 Aashna 87 76 82 72 78
1 Simran 64 76 69 56 75
122 Saraswati Informatics Practices XII
d
>>> ChangeColumns = ['Name', 'Accounts', 'English', 'Bst', 'Economics', 'IP']
ite
>>> dfs.reindex(columns=ChangeColumns)
So, the reindexed DataFrame will be:
m
Name Accounts English Bst Economics IP reindexed
Li
0 Aashna 76 87 72 82 78
e
1 Simran 76 64 56 69 75
at
2 Jack 68 58 63 78 82
iv
3 Raghu 72 74 64 67 86
4 Somya 82 87 66 78 67
Pr
5 Ronald 68 78 71 68 71
a
Reindexing with new index values
di
In
We can add new rows or columns by reindexing with new index values. By default values in the new index
that do not have corresponding records in the DataFrame are assigned NaN. Let us create label index
se
>>> ndf
English Accounts Economics Bst IP
H
Name
i
Aashna 87 76 82 72 78
at
Simran 64 76 69 56 75
w
Jack 58 68 78 63 82
s
Raghu 74 72 67 64 86
ra
Somya 87 82 78 66 67
Sa
Ronald 78 68 68 71 71
Now, create a new DataFrame called df1 with a new row index ‘Meghna’ using DataFrame ndf:
ew
Jack 58 68 78 63 82
Raghu 74 72 67 64 86
Meghna NaN NaN NaN NaN NaN
d
Somya 87 82 78 66 67
ite
Ronald 78 68 68 71 71
m
Notice the above output where new indexes are populated with NaN values. Here, we can fill in the
Li
missing values using the parameter, fill_value:
>>> df1 = ndf.reindex(['Aashna', 'Simran', 'Jack', 'Raghu', 'Meghna', 'Somya', 'Ronald'], fill_value=73)
e
>>> df1
at
So, the reindexed DataFrame will be:
iv
English Accounts Economics Bst IP
Pr
Name
Aashna 87 76 82 72 78
a
di
Simran 64 76 69 56 75
Jack 58 68 78 63 82
In
Raghu 74 72 67 64 86
se
Meghna 73 73 73 73 73
Somya 87 82 78 66 67
ou
Ronald 78 68 68 71 71
H
Name
Aashna 87 76 82 72 78 NaN
ew
Simran 64 76 69 56 75 NaN
Jack 58 68 78 63 82 NaN
N
Raghu 74 72 67 64 86 NaN
Meghna 73 73 73 73 73 NaN
@
Somya 87 82 78 66 67 NaN
Ronald 78 68 68 71 71 NaN
>>> df1
English Accounts Economics Bst IP TotalMarks
Name
d
Aashna 87 76 82 72 78 395
ite
Simran 64 76 69 56 75 340
m
Jack 58 68 78 63 82 349
Raghu 74 72 67 64 86 363
Li
Meghna 73 73 73 73 73 365
e
Somya 87 82 78 66 67 380
at
Ronald 78 68 68 71 71 356
iv
Pr
4.8 Altering Labels or Changing Column/Row Names
In pandas, there are two ways where one can change the column names of a pandas DataFrame. One way to
a
rename columns in Pandas is to use df.columns from Pandas and assign new names directly. To demonstrate
di
this, let us create a simple DataFrame with following: In
>>> import pandas as pd
>>> Data1 = {'Customer_id': ['C_01', 'C_02' , 'C_03', 'C_04', 'C_05', 'C_06'],
se
To change the columns of dfm DataFrame, we can assign the list of new column names to dfm.columns
ew
as:
>>> dfm.columns = ['CustomerID','Product_Choice','Fees']
N
>>> dfm
CustomerID Product_Choice Fees
@
d
ite
A problem with this approach to change column names is that one has to change names of all the
columns in the DataFrame. This approach would not work, if we want to change the name of one column.
m
Also, the above method is not applicable on index labels. Another way to change column names in pandas is
to use .rename() function.
Li
Changing column name using .rename() function
e
at
Using .rename() function, one can change names of specific column easily. And not all the column names
need to be changed. One of the biggest advantages of using .rename() function is that we can use rename to
iv
change as many column names as we want. The syntax is:
Pr
DataFrame.rename(mapper=None, index=None, columns=None, axis=None, copy=True,
inplace=False, level=None)
a
Here,
•
di
mapper, index and columns: Dictionary value, key refers to the old name and value refers to new
In
name. Remember that only one of these parameters can be used at once.
• axis: int or string value, 0/’row’ for Rows and 1/’columns’ for Columns.
se
'TNAME' : ['Rakesh Sharma', 'Jugal Mittal', 'Sharmila Kaur', 'Sandeep Kaushik', 'Sangeeta Vats'],
'TADDRESS' : ['245-Malviya Nagar', '34 Ramesh Nagar', 'D-31 Ashok Vihar', 'MG-32 Shalimar Bagh',
w
>>> tdf
d
1 T02 Jugal Mittal 34 Ramesh Nagar 22000
ite
2 T03 Sharmila Kaur D-31 Ashok Vihar 21000
m
3 T04 Sandeep Kaushik MG-32 Shalimar Bagh 15000
4 T05 Sangeeta Vats G-35 Malviya Nagar 18900
Li
From the above result, the first column is renamed as ‘Teacher_No’.
e
Renaming multiple columns
at
iv
Let us change column names TNAME to Teacher_Name, TADDRESS to Teacher_Address and SALARY to
Income of above DataFrame tdf:
Pr
>>> tdf.rename(columns = {'TNAME': 'Teacher_Name',
'TADDRESS' : 'Teacher Address',
a
'SALARY': 'Income'}, inplace=True)
From the above output:
di
In
• second column is renamed as ‘Teacher_Name’.
third column is renamed as ‘Teacher_Address’.
se
>>> tdf
H
names. We just need to use index argument and specify, we want to change index not columns.
For example, to change row names 0 and 1 to ‘zero’ and ‘one’ in our tdf DataFrame, we will construct
N
a dictionary with old row index names as keys and new row index as values.
>>> tdf.rename(index={0:'zero',1:'one'})
@
d
Note that the above result does not change the DataFrame (tdf) row names permanently, because we
omit the inplace=True option.
ite
Renaming column name and row index simultaneously
m
Li
With pandas’ rename() function, one can also change both column names and row names simultaneously by
using both column and index arguments to rename function with corresponding mapper dictionaries.
e
>>> tdf.rename(columns={'Teacher_No':'T_Number'},
at
index={0:'zero',1:'one'})
T_Number Teacher_Name Teacher_Address Income
iv
zero T01 Rakesh Sharma 245-Malviya Nagar 25600
Pr
one T02 Jugal Mittal 34 Ramesh Nagar 22000
a
2 T03 Sharmila Kaur D-31 Ashok Vihar 21000
di
3 T04 Sandeep Kaushik MG-32 Shalimar Bagh 15000
4 T05 Sangeeta Vats G-35 Malviya Nagar 18900
In
Note that the above result does not change the DataFrame (tdf) row and column names permanently,
se
Pandas rename() function can also take a function as input instead of a dictionary. For example, we can
H
write a lambda function to take the current column names and consider only the first six characters for the
new column names.
i
at
Points to Remember
N
1. Pipe() function performs the custom operation for the entire DataFrame.
@
2. The apply() method allows us to pass a function that will run on every value in a column.
3. The applymap() method applies a function to each element of the DataFrame and returns a
scalar to every element of a DataFrame.
4. The groupby() function splits the data into groups based on the levels of a categorical variable.
5. Reindexing in pandas is a process that changes the row labels and column labels of a DataFrame.
128 Saraswati Informatics Practices XII
SOLVED EXERCISES
1. What is the use of describe() function?
d
Ans. The .describe() function is a useful summarisation tool that will quickly display statistics for any
ite
variable or group it is applied to.
2. Differentiate between .apply() and .applymap() functions.
Ans. The .apply() applies a function along any axis of the DataFrame whereas .applymap() apply a function
m
to each element of DataFrame.
Li
3. What is the default grouping is made using pandas groupby() method?
Ans. The default the grouping is made via the index (rows) axis.
e
4. A DataFrame dfW is given with following data:
at
Age Wage Rate
iv
0 20 2.5
Pr
1 25 3.5
2 30 4.5
a
3 35 5.5
di
4 40 7.0 In
5 45 8.7
6 50 9.5
se
7 55 10.0
8 60 12.5
ou
( ) Write a program using pipe() function to add 2 to each numeric column of DataFrame dfW.
H
import pandas as pd
w
dfW = pd.DataFrame({'Age' : [20, 25, 30, 35, 40, 45, 50, 55, 60],
'Wage Rate' : [2.5, 3.5, 4.5, 5.5, 7.0, 8.7, 9.5, 10.0, 12.5]},
s
d
8 Gel Pen Red 12.5
ite
9 P Marker Blue 8.6
m
10 Pencil Green 11.5
11 Ball Pen Green 10.5
Li
Answer the following questions using groupby function (assume that the DataFrame name is
e
dfB):
at
( ) Display Color-wise item and price of each ItemName category.
iv
( ) Find the maximum price of each ItemName.
( ) Find the minimum price of each ItemName.
Pr
(d) Count the number of items in each ItemName category.
Ans. (a) dfX = dfB.groupby(['ItemName', 'Color'])
a
dfX.first()
di
(b) dfB.groupby('ItemName').Price.max()
(c) dfB.groupby('ItemName').Price.min()
In
(d) dfB.groupby('ItemName')['Color'].apply(lambda x: x.count())
6. A DataFrame contains following information:
se
( ) Find the region-wise aggregates of sales applying multiple aggregation functions like count,
max, min and mean.
(d) What will be the output of the following:
(i) dfN.groupby('Year').Sales.sum().round(decimals=2)
(ii) dfN.groupby('Year').Sales.max().round(decimals=2)
(iii) dfN.groupby('Region').Sales.min().round(decimals=2)
130 Saraswati Informatics Practices XII
d
(c) dfN.groupby('Region').Sales.agg(['count', 'max', 'min', 'mean']).round(decimals=2)
ite
(d) (i) Year
2016 5516000
m
2017 4549000
Li
2018 5010000
2019 1490000
e
Name: Sales, dtype: int64
at
(ii) Year
iv
2016 1670000
2017 1630000
Pr
2018 2100000
2019 1490000
a
Name: Sales, dtype: int64
(iii) Region
di
In
East 1210000
North 1560000
se
South 1359000
West 1180000
ou
REVIE ES IO S
i
at
1 6500 180
2 500 250
ew
3 500 350
4 13000 120
N
5 8800 130
@
6 2400 120
7 8000 170
8 8500 130
9 450 142
Write a command to find the Total Price (Quantity * Unit Price) using the lambda function.
Function Applications in Pandas 131
d
1 Minu Arora Graduate 11
ite
2 Sharmila Kaur Post Graduate 7
m
3 Sangeeta Vats Masters 9
4 Ramesh Kumar Graduate 6
Li
5 Jatin Ghosh Post Graduate 8
e
6 Yash Sharma Masters 10
at
Write the command for the following using pandas groupby() function :
iv
(a) Find the average experience for each qualification.
(b) Find the total experience for each qualification.
Pr
(c) Find the average experience for each qualification and name.
5. A sample dataset is given with four quarter sales data for five employees:
a
Name of Employee Sales Quarter State
R Sahay 125600
di1 Delhi
In
George K 235600 1 Tamil Nadu
se
Write the command for the following using pandas groupby() function (assume that the DataFrame
name is dfQ):
132 Saraswati Informatics Practices XII
d
(e)Find the employee name-wise aggregates of sales applying multiple aggregation functions
ite
like count, max, min and mean.
(f) Find the output of the following:
m
(i) dfQ.groupby('Name of Employee').Sales.mean()
Li
(ii) dfQ.groupby('State').Sales.mean()
6. A sample dataset is given with different columns as given below:
e
Item_ID ItemName Manufacturer Price CustomerName City
at
PC01 Personal Computer HCL India 42000 N Roy Delhi
iv
LC05 Laptop HP USA 55000 H Singh Mumbai
Pr
PC03 Personal Computer Dell USA 32000 R Pandey Delhi
PC06 Personal Computer Zenith USA 37000 C Sharma Chennai
a
LC03 Laptop Dell USA 57000 K Agarwal Bengaluru
AL03 Monitor HP USA
di 9800 S C Gandhi Delhi
In
CS03 Hard Disk Dell USA 5400 B S Arora Mumbai
se
Write the command for the following (assume that the DataFrame name is dfA):
(a) Find city-wise total price.
i
at
Introduction to NumPy
(Numeric Python)
d
Chapter –
ite
m
Li
e
at
iv
Pr
5.1 Introduction
NumPy is a Python package which stands for ‘Numerical Python’ or ‘Numeric Python’. It contains a
a
collection of tools and techniques that can be used to solve on a computer mathematical models of problems
di
in Science and Engineering. One of these tools is a high-performance multidimensional array object that is
a powerful data structure for efficient computation of arrays and matrices.
In
Both NumPy and pandas are often used together, as the pandas library relies heavily on the NumPy
array for the implementation of pandas data objects and shares many of its features. NumPy is the core
se
library for scientific computing, which contains a powerful n-dimensional array object; provide tools for
integrating C, C++ etc. It is also useful in linear algebra, random number capability etc. NumPy array can
ou
NumPy is installed by default when we install Python pandas. If you want to install it separately, go to your
at
command prompt and type “pip install numpy” because standard Python distribution doesn't come bundled
w
Once the installation is completed, you can import the module What is NumPy?
Sa
in your IDLE by typing: “import numpy as np”. Before we can use NumPy is a module for Python.
NumPy we will have to import it. It has to be imported like any NumPy stands for Numeric
other module: Python which is a Python
ew
to np: elements.
import numpy as np
@
Above code renames the NumPy namespace to np. This permits us to prefix NumPy function, methods,
and attributes with "np" instead of typing "numpy".
After import command, you can check the NumPy version by using the following command:
>>> print (np.__version__)
1.14.3
133
134 Saraswati Informatics Practices XII
A NumPy array is a grid of values, all of the same type, and is indexed by a tuple of non-negative integers.
NumPy arrays are great alternatives to Python Lists, but still very much different at the same time. We use
d
python NumPy array instead of a list because of the following reasons:
ite
• It efficiently implements the multidimensional arrays and of all the same type. That is Numpy
arrays are homogeneous in nature, i.e., they comprise one data type (integer, float, double, etc.)
m
unlike lists.
• It is more compact than list (don’t need to store both value and type like in a list).
Li
• It occupies less memory as compared to list.
• It is fast in terms of execution (reading/writing) and at the same time it is very convenient to
e
work with NumPy.
at
• It is easy to work with and give users the opportunity to perform calculations across entire
iv
arrays.
• It is capable of performing Fourier Transform and reshaping the data stored in multi-dimensional
Pr
arrays.
• NumPy provides the in-built functions for linear algebra and random number generation.
a
• You can also do the standard stuff, like indexing, comparisons, logical operations.
di
The main data structure in NumPy is the ndarray, which is a shorthand name for N-dimensional array
In
object which is in the form of rows and columns. Figure 5.1 shows the basic structure of different arrays in
that managed in NumPy.
se
1D NumPy array:
Indices 0 1 2 3 4 5 6 7
ou
24 12 10 34 17 13 32 51
i H
at
axis = 1 x is = 2 `
Indices X: Z: a0 1
s
0 1 2 3 4
`
ra
2 4 9 3 10 Y: 0 2 4 9 3
Y: 0
Sa
axis = 0
1 7 6 8 3 5 1 7 6 8 3
axis = 0
2 4 2 5 1 9 2 4 2 5 1
ew `
3 6 7 2 7
`
`
X: 0 1 2 3
N
axis = 1
Figure 5.1 Structures of different arrays.
@
When working with NumPy, data in ndarray is simply referred to as an array. NumPy’s main object is
the homogeneous multi-dimensional array.
• It is a fixed-sized array in memory that contains data of the same type, such as integers or
floating point values.
Introduction to NumPy (Numeric Python) 135
d
ite
5.3.1 Creating a NumPy Array
m
There are several ways to create a NumPy array. We can create a NumPy array using the numpy.array()
function or np.array() function. To use the later part (i.e., np.array()), you need to make sure that the
Li
NumPy library is present in your environment.
If we pass in a list of lists, it will automatically create a NumPy array with the same number of rows and
e
columns. It creates an ndarray from any object exposing array interface, or from any method that returns
at
an array. The syntax to create a NumpPy array is:
iv
numpy.array(object, dtype = None, copy = True, order = None, subok = False, ndmin = 0)
Pr
From the above syntax, except object, the remaining options are optional.
• object: It represents the collection object. It can be a list, tuple, dictionary, set, etc.
a
• dtype: It is used to mention the data type. We can change the data type of the array elements by
changing this option to the specified type.
di
In
• copy. By default, it is true which means the object is copied.
• order. Specify the memory layout of the array. If object is not an array, the newly created array
se
A simple way to create an array from data or simple Python data structures like a list is to use the array()
w
function. The NumPy array elements must be of the same data type. For example,
s
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> print (A) # Printing the numpy array
N
[1 2 3 4 5 6 7 8 9]
In practice, there is no need to declare a Python list. The operation can be combined with NumPy
@
d
ite
A two-dimensional array is sometimes called as matrix array. A two-dimensional is where numbers are
arranged into rows and columns. Every axis in a NumPy array has a number, starting with 0. In this way, they
m
are similar to Python indexes in that they start at 0, not 1. So the first axis is axis 0 which refers to rows.
Li
The second axis (in a 2D array) is axis 1 which refers to the columns. Let us see a practical example
showing a matrix with 4 rows and 3 columns index values as shown below:
e
Index [1][1]
at
col 0 col 1 col 2
iv
row 0 [0, 0] [0, 1] [0, 2]
Pr
row 1 [1, 0] [1, 1] [1, 2]
a
row 2 [2, 0] [2, 1] [2, 2]
2 rows and 5 4 7
H
axis = 0 6 3 6
i
at
To create a 2D array, each dimension will be added with a comma (",") separator and it has to be within
the bracket []. Let us create the two-dimensional array using above data:
w
>>> M
ra
array([ [ 5, 4, 7 ],
Sa
[ 6, 3, 6 ]])
To print the 2D array, the command is:
ew
5 4 7
3 rows and
11 3 6
axis = 0
8 17 9
Introduction to NumPy (Numeric Python) 137
>>> B1 = np.array([[5, 4, 7], [11, 3, 6], [8, 17, 9]]) #Create a rank 3 array of integers
array([ [ 5, 4, 7 ],
[ 11, 3, 6 ],
[ 8, 17, 9 ]])
d
To print the 2D array, the command is:
ite
>>> print (B1)
m
[[ 5 4 7 ]
[ 11 3 6 ]
Li
[ 8 17 9 ]]
e
Example A matrix is given with following values:
at
4 columns and axis = 1
iv
Pr
5 4 7 3
3 rows and
6 3 6 4
axis = 0
a
8 6 9 5
di
Write the NumPy command to create an array B2 with above data and print the array.
In
SolutionB2 = np.array([[5, 4, 7, 3], [6, 3, 6, 4], [8, 6, 9, 5]]) # array of integers
print (B2)
se
[[ 5 4 7 3 ]
[ 6 3 6 4 ]
ou
[ 8 6 9 5 ]]
Example Suppose you have 3 friends and each having 5 different marks as given below:
H
56 43 48 65 54
i
at
65 54 76 34 54
w
48 67 54 56 31
s
Write the NumPy command to create an array Marks and print array.
ra
[ [ 56 43 48 65 54 ]
[ 65 54 76 34 54 ]
N
[ 48 67 54 56 31 ]]
@
d
Using arange()
ite
In some occasion, you want to create value evenly spaced within a given interval. The NumPy arange()
m
function (sometimes called np.arange) is a tool for creating numeric sequences in Python. This function
returns evenly spaced values within a given interval. The syntax is:
Li
The data type
e
The function The end of of the array
at
name the interval (optional)
numpy.arange( start =, stop =, step =, dtype )
iv
Pr
The start of The step
the interval between values
(optional) (optional)
a
Here,
•
di
start: The start parameter indicates the beginning value of the range. This parameter is optional,
In
so if you omit it, it will automatically default to 0.
• stop: The stop parameter indicates the end of the range.
se
• step: The step parameter specifies the spacing between values in the sequence. If you don’t
specify a step value, by default the step value will be 1.
ou
• dtype: The dtype parameter specifies the data type. This parameter is optional.
For example, to create a sequence of 10 evenly spaced values from 0 to 9:
H
>>> x
at
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
w
[0 1 2 3 4 5 6 7 8 9]
ra
The above np.arange() function, only holds the stop parameter. The remaining two parameters take its
Sa
default values.
For example, to create values from 1 to 10; you can use arange() function.
>>> y = np.arange(1, 11) # y = np.arange(start = 1, stop = 11)
ew
>>> y
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
N
If you want to change the step, you can add a third number in the parenthesis. It will change the step
parameter. For example, to create an array with starting from 1 with an interval of 3 till 20, then the range()
function will be written as:
>>> z = np.arange(1, 20, 3)
Introduction to NumPy (Numeric Python) 139
d
>>> p = np.arange(4, 11, dtype=float)
ite
>>> print (p)
[ 4. 5. 6. 7. 8. 9. 10.]
m
Creating Array using Functions
Li
NumPy also provides many functions to create arrays. You can initialize arrays with ones or zeros, but you
e
can also make arrays that get filled up with evenly spaced values, constant or random values.
at
• ones(). This function returns a new array of specified size, filled with ones. The syntax is:
iv
numpy.ones(shape, dtype = float, order = 'C')
Pr
Here,
− shape: is the shape of the array.
a
− dtype: is the datatype. It is optional. The default value is float64.
−
di
order: Default is C which is an essential row style.
In
For exmaple,
se
[[ [ 1 1 1 ]
[ 1 1 1 ]]]
H
Similarly, we can create NumPy array with all values True using the dtype parameter. For example,
to create a 2 x 2 array with all values True:
i
at
[[ True True]
ra
[ True True]]
Sa
• zeros(). This function returns a new array of specified size, filled with zeros. This is also called
null values. The syntax is:
ew
>>> X = np.zeros(5)
>>> X
@
d
to create a 3 x 3 array with all values False:
ite
>>> y = np.zeros((3, 3), dtype=bool)
m
>>> print (y)
[ [ False False False ]
Li
[ False False False ]
[ False False False ] ]
e
•
at
full(). This function returns a new array with constant values with the same shape and type as a
given array filled with a fill_value. The syntax is:
iv
numpy.full(shape, fill_value, dtype = None, order = ‘C’)
Pr
Here, fill_value is the constant value which will be filled as array elements.
For example,
a
di
>>> c = np.full((3, 3), 9) # Create a constant array
>>> print (c)
In
[[ 9 9 9 ]
[9 9 9 ]
se
[ 9 9 9 ]]
ou
Similarly, we can create NumPy array with all Boolean values like True or False using the dtype
parameter.
H
• fill(). This function is used to fill scalar values to an existing array. The syntax is:
@
ndarray.fill(value)
Here, value is a scalar value. For example, let us create an array of 1D containing numeric values
0 to 9 and then fill the array will a scalar value 5.
>>> x = np.arange(10) # create a 1D array x with vales 0 to 9
Introduction to NumPy (Numeric Python) 141
d
>>> print (x) # printing the array values
ite
[5 5 5 5 5 5 5 5 5 5]
m
• empty(). This function creates an uninitialized array of specified shape and data type. The elements
in an array will always show random values. The default data types of the values are float. The
Li
syntax is:
numpy.empty(shape, dtype = float, order = 'C')
e
at
For example, to create a 1D array:
iv
>>> A = np.empty(4)
>>> print (A) # Prints random values
Pr
[6.30731226e+202 4.73591267e+202 8.78952566e-315 0.00000000e+000]
>>> B = np.empty(4, dtype=int)
a
di
>>> print (B) # Prints integer type random values
[ 209 4354560 1030881280 1912602624]
In
For example, to create 2D arrays:
se
[ [ 110 0 0 ]
[ 0 0 0 ]
H
[ 128 0 0 ]]
>>> d = np.empty([3, 3], dtype = int)
i
at
[ 0, 0, 0 ],
s
[ 404, 0, 0 ] ]
ra
From the above two results, it’s not safe to assume that np.empty will return an array of all
Sa
zeros. In many cases, as previously shown, it will return uninitialized garbage values.
Example What will be the output of following lines?
( ) X = np.array([1, 3])
ew
print (X)
X.fill(0)
N
print (X)
( ) arr = np.empty((2, 5), dtype=bool)
@
arr.fill(1)
print (arr)
Solution(a) [1 3]
[0 0]
(b) [[ True True True True True]
[ True True True True True]]
142 Saraswati Informatics Practices XII
• eye(). This function is used to create a 2D array with ones on the diagonal and zeros elsewhere.
The syntax is:
numpy.eye(N, M=None, k=0, dtype=<class 'float'>, order='C')
d
Here,
ite
− N is the number of rows in the output.
− M is number of columns in the output, defaults to N.
m
− k is the index of the diagonal. 0 refers to the main diagonal; a positive value refers to an
upper diagonal, and a negative value to a lower diagonal.
Li
For example,
e
>>> e = np.eye(2) # Create a 2x2 identity matrix
at
>>> print (e)
[ [ 1. 0. ]
iv
[ 0. 1. ]]
Pr
>>> e1= np.eye(2, 3) # Create a 2x3 identity matrix
>>> print (e1)
a
[ [ 1. 0. 0. ]
di
[ 0. 1. 0. ]] In
>>> e2= np.eye(3, 3)
>>> print (e2) # Create a 3x3 identity matrix
[ [ 1. 0. 0. ]
se
[ 0. 1. 0. ]
[ 0. 0. 1. ]]
ou
these functions generate samples from the uniform distribution on [0, 1). Results are from the
“continuous uniform” distribution over the stated interval. The syntax for these functions are:
i
at
numpy.random.random_sample(size=None)
s
Here, d0, d1, ..., dn are the dimensions of the returned array, and they should all be positive. If
ra
The only difference is in how the arguments are handled. With numpy.random.rand, the length
of each dimension of the output array is a separate argument.
With numpy.random.random_sample, the shape argument is a single tuple.
ew
For example, let us print 4 random values using the above three random commands:
>>> A = np.random.rand(4) # uniform in [0, 1]
N
>>> B = np.random.random_sample(5)
>>> print (B)
[0.26982835, 0.97150188, 0.35623911, 0.90855109, 0.80270268]
>>> C = np.random.random(5)
Introduction to NumPy (Numeric Python) 143
For example, to create a 2D array of samples with shape (2, 3), you can write any one of the
d
following:
ite
A = np.random.random((2, 3)) # Create an array filled with random values.
Or
m
A = np.random.random_sample((2, 3)) # Create an array filled with random values
Or
Li
A = np.random.rand(2, 3) # Create an array filled with random values.
e
But the only difference is the random and random_sample uses the single tuble whereas the
at
rand() function uses the dimensional values.
iv
• linspace(). A linspace array is an array of equally spaced values going from a start to an end
value. The NumPy linspace function (sometimes called np.linspace) is a tool in Python for creating
Pr
numeric sequences or an ndarray. The syntax is:
numpy.linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None, axis=0
a
di
Here, In
− start is the starting value of the sequence.
− stop is the end value of the sequence, unless endpoint is set to False. In that case, the sequence
se
consists of all but the last of num + 1 evenly spaced samples, so that stop is excluded. Note
that the step size changes when endpoint is False.
ou
will be incuded in the returned array. Otherwise, the stop value will be not included as the
final value in the returned array.
i
− retstep is a Boolean value and optional. If True, return (samples, step), where step is the
at
− dtype is the type of the output array. If dtype is not given, infer the data type from the other
s
input arguments.
ra
For example,
Sa
Notice that there are 5 values are created and are equally spaced values.
N
Start Stop
0. 25. 50. 75. 100.
@
By declaring a start value, stop value, and the num of points in between those points an array is
generated. Similarly, let us see another example:
>>> L1 = np.linspace(1, 10, 10)
144 Saraswati Informatics Practices XII
d
values. So, to create an array with integers instead of floats, then you can use the dtype parameter
as int. For example,
ite
>>> L1 = np.linspace(1, 10, 10, dtype=int)
m
>>> print (L1)
Li
[ 1 2 3 4 5 6 7 8 9 10]
• reshape(). This function used to give a new shape to an array without changing its data. This is
e
just reshape the array by changing the number of rows and columns of the multi-dimensional
at
array. It accepts the two parameters indicating the row and columns of the new shape of the
array. The reshape function has two required inputs. First, an array. Second, a shape. Remember
iv
NumPy array shapes are in the form of tuples. For example, a shape tuple for an array with two
Pr
rows and three columns would look like this: (2, 3). Also, if the tuple is (2, 3), then the number of
elements in the input array must be 6 elements, i.e., 2 x 3 = 6.
The syntax is:
a
di
numpy.reshape(a, newshape, order='C')
Here,
In
− a is the array to be reshaped.
− newshape should be compatible with the original shape.
se
Let us first create a 1D array with 6 elements and reshape it into a 2D array with 2 rows and
ou
3 columns.
A = np.array([1, 2, 3, 4, 5, 6])
H
Now, we use numpy.reshape() to create a new array B by reshaping our initial array A. Notice we
i
pass numpy.reshape() the array A and a tuple for the new shape (2, 3).
at
Or
s
>>> B = np.reshape(A, (–1, 3)) # Setting to –1 automatically decides the number of rows
ra
[[1, 2, 3],
[4, 5, 6]])
ew
We can also print the shape of B to make sure it matches the tuple we passed to reshape().
>>> print (B.shape)
N
(2, 3)
Similarly, to create an array with three rows and two columns, we can use the tuple as (3, 2):
@
d
ite
Reshaping 2D Array
m
For example, let us see how a 4x3 array reshape into 3x4 array:
Li
1 2 3 1 2 3 4
4 5 6
` 5 6 7 8
e
at
7 8 9 9 10 11 12
10 11 12
iv
Pr
>>> x = np.array([[1, 2, 3],[4, 5, 6], [7, 8, 9], [10, 11, 12]])
>>> print (x)
[[ 1 2 3 ]
a
[ 4 5 6 ]
[ 7 8 9 ]
di
[ 10 11 12 ]]
In
>>> x = x.reshape(3,4) # x = np.reshape(x, (3, 4))
se
[ 5 6 7 8 ]
[ 9 10 11 12 ]]
H
We can also create array through sequence values by using arange() and reshape() functions.
For example to create a 3 x 5 array using values from 1 to 16;
i
at
[ 6 7 8 9 10 ]
ra
[ 11 12 13 14 15 ]]
Sa
The transpose of a matrix is a new matrix whose rows are the columns of the original. This makes the
columns of the new matrix the rows of the original. Here is a matrix and its transpose: To transpose a
matrix, simply use the T attribute of an array object.
N
That is:
matrix.T
@
>>> b.T
array([3, 6, 9]) #Here it didn't transpose because 'a' is 1 dimensional
>>> b = np.array([a])
>>> b.T
d
array([[3], #Here it did transpose because a is 2 dimensional
ite
[6],
[9]])
m
Let us see the transpose of a matrix:
Li
>>> MAT = np.array([ [4, 5, 8, 6],
[2, 3, 2, 4],
e
[7, 6, 4, 6]])
at
>>> print (MAT)
iv
[[ 4 5 8 6 ]
[2 3 2 4 ]
Pr
[ 7 6 4 6 ]]
>>> print (MAT.T)
a
di
[[ 4 2 7 ]
[5 3 6]
In
[8 2 4]
[ 6 4 6 ]]
se
numpy.transpose(a, axes=None)
H
For example,
>>> print (np.transpose(MAT))
i
at
[[ 4 2 7]
[5 3 6]
w
[8 2 4]
s
[6 4 6 ]]
ra
Sa
You may have noticed that, in some instances, array elements are displayed with a trailing dot (e.g. 2. vs 2).
ew
Notice that the above two data types show int32 and float64. Those with numbers in their name indicate
the bit size of the type (i.e. how many bits are needed to represent a single value in memory).
NumPy arrays contain values of a single type, so it is important to have detailed knowledge of those
types and their limitations. Because NumPy is built in ‘C’, the types will be familiar to users of ‘C’, Fortran,
d
and other related languages. The standard NumPy data types are listed in the following Table 5.1.
ite
Table 5.1 Standard NumPy data types.
m
Li
bool_ Boolean (True or False) stored as a byte
int_ Default integer type (same as C long; normally either int64 or int32)
e
intc Identical to C int (normally int32 or int64)
at
intp Integer used for indexing (same as C ssize_t; normally either int32 or int64)
iv
int8 Byte (–128 to 127)
Pr
int16 Integer (–32768 to 32767)
int32 Integer (–2147483648 to 2147483647)
a
int64 Integer (–9223372036854775808 to 9223372036854775807)
uint8 Unsigned integer (0 to 255)
di
In
uint16 Unsigned integer (0 to 65535)
uint32 Unsigned integer (0 to 4294967295)
se
float32 Single precision float: sign bit, 8 bits exponent, 23 bits mantissa
i
float64 Double precision float: sign bit, 11 bits exponent, 52 bits mantissa
at
Different data types allow us to store data more compactly in memory, but most of the time we simply
work with floating point numbers. Note that in the example above, NumPy auto-detects the data type from
the input.
ew
You can explicitly specify which data type you want. For example,
>>> c = np.array([1, 2, 3], dtype=float)
>>> c.dtype
N
dtype('float64')
@
d
>>> print (Y)
ite
[[5. 4. 7.]
[6. 3. 6.]]
m
>>> print(Y.dtype) # prints: float64
Li
>>> Z = np.array([[5, 4, 7], [6, 3, 6]], dtype = 'complex') # array of complex type
>>> print (Z)
e
[[5.+0.j 4.+0.j 7.+0.j]
at
[6.+0.j 3.+0.j 6.+0.j]]
iv
>>> print(Z.dtype) # prints: complex128
Pr
5.5 Finding the Shape and Size of the Array
a
We know that a NumPy array is a grid of values, all of the same type, and is indexed by a tuple of
di
nonnegative integers. The number of dimensions is the n of the array; the
In e of an array is a tuple of
integers giving the i e of the array along each dimension.
To get the shape and size of the array, the shape and size function associated with the NumPy array is
se
used. For example, to print the total size (total number of elements) of array:
>>> import numpy as np
ou
Array Size: 9
Similarly, to find the size of a 2D array:
i
at
NumPy array has a method called shape that returns [No.of rows, No.of columns], and shape[0] gives
ra
you the number of rows, shape[1] gives you the number of columns. The shape return is a tuple of integers.
These numbers denote the lengths of the corresponding array dimension.
Sa
E.g.,
• For a 1D array, the shape would be (n, ) where n is the number of elements in the array. Because,
ew
if the array only have one row, then it returns [No.of columns, ]. And shape[1] will be out of the
index.
• For a 2D array, the shape would be (n, m) where n is the number of rows and m is the number of
N
d
ite
>>> x.shape[0] # prints: 2 as the number rows
>>> x.shape[1] # prints: 4 as the number columns
m
But if the array only have one row (1D array), then it returns [No.of columns, ]. And shape[1] will be
Li
out of the index. That is, the shape would simply be (n, ) instead of what you said as either (1, n) or (n, 1) for
row and column vectors respectively.
e
For example, let us find the shape of 1D array:
at
>>> A = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9]) # A NumPy 1D array with rank 1
>>> print("Array Shape:", A.shape) # Returns number of columns, i.e., 9
iv
Array Shape: (9,)
Pr
Here, the array has 9 columns.
>>> A.shape[0] # Returns 9
a
>>> A.shape[1] # Error
Traceback (most recent call last):
di
In
File "<pyshell#31>", line 1, in <module>
A.shape[1]
se
print (B)
Output [0., 0., 0., 0., 0., 0.])
i
at
Example What is the command to find the size and shape of the following arrray?
w
[6, 3, 6, 4],
ra
[8, 6, 9, 5]])
Sa
We know that the shape of an array tells us also something about the order in which the indices are
processed. At the same time, "shape" can also be used to change the shape of an array. For example,
N
(2, 3)
>>> b.shape = (3, 2)
>>> print (b)
[[ 1 2 ]
[3 4 ]
[ 5 6 ]]
150 Saraswati Informatics Practices XII
Points to Remember
1. NumPy is a Python library that can be used for scientific and numerical applications and is the
d
tool to use for linear algebra operations.
ite
2. The numpy.full() function is used to create a numpy array of given shape and all elements
initialized with a given value.
m
3. The numpy.zero() function is used to create a numpy array of given shape and type and all
values in it initialized with 0’s.
Li
4. The numpy.one() function is used to create a numpy array of given shape and type and all values
in it initialized with 1’s.
e
at
5. The numpy.linspace() function is used to create a evenly spaced samples over a specified interval.
6. The numpy.reshape() method associated with the ndarray object is used to reshape the array.
iv
7. The numpy.reshape() method takes the data in an existing array, and puts it into an array with the
Pr
given shape and returns it.
8. The number of dimensions is the rank of the array.
9. The shape of an array is a tuple of integers giving the size of the array along each dimension.
a
10. The numpy.empty(...) is filled with random/junk values.
di
11. numpy.random.random(...) is actually using a random number generator to fill in each of the
In
spots in the array with a randomly sampled number from 0 to 1
se
SOLVED EXERCISES
ou
1. What is NumPy?
H
Ans. NumPy is a module for Python. NumPy stands for Numeric Python which is a Python package for
the computation and processing of single and multi-dimensional array elements.
i
Ans. AR = A.reshape(4, 5)
print (AR)
@
print (AR.T)
4. Write the commands for the following (assume that the NumPy namespace is np):
( ) Create a sequence array A of 20 evenly spaced values from 0 to 20.
( ) Create an array of 1D B containing numeric values 0 to 9.
( ) Create a 3x4 floating-point array C filled with ones.
Introduction to NumPy (Numeric Python) 151
d
Ans. (a) A = np.arange(0, 20)
ite
(b) B = np.arange(10)
(c) C = np.ones((3, 5), dtype=float)
m
(d) D = np.full((3, 5), 3.14)
Li
(e) F = np.arange(0, 20, 2)
(f) G = np.linspace(0, 1, 5)
e
(g) H = np.eye(3)
at
5. Write the commands for the following (assume that the NumPy namespace is np):
( ) Create and print a 3 x 4 numpy array X with all values True.
iv
( ) Create and print a 4 x 3 numpy array Y with all values False.
Pr
( ) Create a 3x3 matrix Z with values ranging from 0 to 8.
Ans. (a) X = np.ones((3, 4), dtype=bool)
print (X)
a
(b) Y = np.zeros((4, 3), dtype=bool)
print (Y)
di
In
(c) Z = np.arange(9).reshape(3,3)
6. What will happen if the start option is missing in arange() function?
se
Ans. If the 'start' parameter is not given or missing in arange() function, then it will be set to 0.
7. What will be the output of the following?
ou
( ) x = np.arange(0, 20, 2)
print (x)
w
print (L1)
ra
( ) A = np.array([1, 2, 3, 4, 5, 6])
B = np.reshape(A, (-1, 2))
print (B)
ew
Ans. (a) [0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5 5. 5.5 6. 6.5 7. 7.5 8. 8.5 9. 9.5]
(b) [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18]
N
(c) [ 0 2 4 6 8 10 12 14 16 18]
(d) [1 1 2 3 4 5 6 7 8 9]
@
REVIE ES IO S
1. If X is 2D array with following values (assume that np is used as NumPy namespace):
X = np.array([[1, 2, 3], [4, 5, 6]])
d
What will it print if you type the following two commands?
ite
(a) print(X) (b) type(X)
(c) X.ndim (d) X.shape
m
2. What will be the output of the following (assume that np is used as NumPy namespace)?
(a) A = np.zeros((2,5))
Li
print(A)
(b) B = np.ones(7)
e
print(B)
at
(c) C = np.ones((3,2))
iv
print(C)
3. What will the the output of the following NumPy command? Explain it.
Pr
D = np.zeros((5, 6))
4. If you type following (assume that np is used as NumPy namespace):
a
n = np.arange(10)
di
How many values that array n will store and what are they?
5. Write the command for the following:
In
(a) Create an array starting from 0 with an interval of 2 till 20.
(b) Convert a 1D array to a 2D array with 2 rows.
se
(g) Create a 2D Numpy Array of 4 rows and 5 columns and all elements initialized with value 9.
(h) Ceate a 2D numpy array with 3 rows and 4 columns, filled with 1’s.
i
at
d
Chapter –
ite
m
Li
e
at
iv
Pr
6.1 Introduction
In Python you have already learnt the slice method to access list and tuple elements. Selecting a slice is
a
similar to selecting element(s) of a NumPy array. In this text, you will learn how to use indexing and slicing
di
to access NumPy array elements. In
6.2 Indexing NumPy Array
se
Once your data is represented using a NumPy array, you can access it using indexing. Indexing is used to
ou
obtain individual elements from an array, but it can also be used to obtain entire rows, columns or planes
from multi-dimensional arrays. The important thing to remember is that indexing in python starts at zero.
H
Indexing 1D Array
i
Array indexing refers to any use of the square brackets ([ ]) to index array values. Single element indexing
at
for a 1D array works exactly like that of other standard Python sequences like list or tuple. It is 0-based,
w
and accepts negative indices for indexing from the end of the array. Figure 6.1 shows both positive and
s
Positive Indices 0 1 2 3 4 5 6 7
`
Sa
1D Array 24 12 10 34 17 13 32 51
`
Negative Indices –8 –7 –6 –5 –4 –3 –2 –1
ew
153
154 Saraswati Informatics Practices XII
Remember that if you use out of the index value, then NumPy will raise an IndexError. For example,
d
>> print(A1[–10])
ite
Traceback (most recent call last):
File "<pyshell#15>", line 1, in <module>
m
print(A1[–10])
Li
IndexError: index –10 is out of bounds for axis 0 with size 8
Indexing 2D Array
e
at
Unlike lists and tuples, NumPy arrays support multi-dimensional indexing for multi-dimensional arrays.
iv
That means it is not necessary to separate each dimension’s index into its own set of square brackets.
Figure 6.2 shows a 2D array with its indexes (both positive and negative) as given below:
Pr
`
a
0 1 2 3 4
2 4
di9 3 10
In
0 [0, 0] [0, 1] [0, 2] [0, 3] [0, 4]
[–3, –5] [–3, –4] [–3, –3] [–3, –2] [–3, –1]
se
7 6 8 3 5
ou
4 2 5 1 9
i
2
at
Let us create a 2D array and access the array elements using 2D array indexing method:
Sa
>>> B
array([ [ 2, 4, 9, 3, 10 ],
N
[ 7, 6, 8, 3, 5 ],
[ 4, 2, 5, 1, 9 ]])
@
We can select an element of the array using two indices, inside a square 2 4 9 3 10
bracket ([ ]), i.e., i is the row index and j is the column index. For
example, 7 6 8 3 5
>>> print(B[1, 2]) # prints: 8 4 2 5 1 9
Indexing, Slicing and Arithmetic Operations in NumPy Array 155
If you notice the previous print() function, the i (1) and j (2) values are both inside the square brackets,
separated by a comma (,) operator. The print(B[1, 2]) picks row 1, column 2, which has the value 8. This
compares with the syntax you might use with a 2D list (i.e., a list of lists). That is:
>>> # A python list
d
>>> L1 = [[2, 4, 9, 3, 10],
ite
[7, 6, 8, 3, 5],
[4, 2, 5, 1, 9]]
m
>>> print (L1[1][2])
Li
which prints: 8
e
Indexing a Row or Column
at
To select elements from a 2D array by index, we must use the index positions of the array. Let us see how
iv
to select elements from the following 2D array using index.
Pr
>>> B = np.array([ [2, 4, 9, 3, 10],
[7, 6, 8, 3, 5],
a
[4, 2, 5, 1, 9]])
di
1. To select a single element from 2D Numpy array by index, we can use [][] operator.
In
ndArray[row_index][column_index]
For example, to select the row 1 and column 2:
se
print(B[1][2]) # prints: 8
ou
Or we can pass the comma separated list of indices representing row index and column index.
2. To select rows by index from a 2D Numpy array, we can call [] operator to select a single or
H
multiple row.
ndArray[row_index]
i
at
2 4 9 3 10 2 4 9 3 10
7 6 8 3 5 7 6 8 3 5
ew
4 2 5 1 9 4 2 5 1 9
N
2 4 9 3 10 2 4 9 3 10
7 6 8 3 5 7 6 8 3 5
4 2 5 1 9 4 2 5 1 9
156 Saraswati Informatics Practices XII
d
select multiple rows from index 1 to 2, the command is:
ite
print (B[1:3, :])
[[ 7 6 8 3 5]
m
[ 4 2 5 1 9]]
Li
Similarly, to select multiple rows from index 1 to last index, the
command is: 2 4 9 3 10
e
print (B[1: , :]) 7 6 8 3 5
at
[[ 7 6 8 3 5]
4 2 5 1 9
[ 4 2 5 1 9]]
iv
4. NumPy allows us to select a single or multiple column as well. To select columns by index from a
Pr
2D Numpy array, the format is:
ndArray[ : , column_index]
a
di
It will return a complete column at given index. For example, to select a column at index 1, the
command is: In
print(B[:, 1]) # prints: [4 6 2] print(B[:, –4]) # prints: [4 6 2]
se
2 4 9 3 10 2 4 9 3 10
ou
7 6 8 3 5 7 6 8 3 5
4 2 5 1 9 4 2 5 1 9
H
The above indexing is just like slicing, and the B[:, 1] means:
i
at
• for the i or row value, it takes all values (: is a full slice, from start to end)
• for the j value take 1, i.e., all rows values of column 1
w
It will return columns from start_index to end_index – 1 and will include all rows. For example, to
Sa
[ 6 8] 7 6 8 3 5
[ 2 5]]
N
4 2 5 1 9
@
Similarly, to select multiple columns from index 1 to last index, the command is:
print (B[:, 1:])
2 4 9 3 10
[[4 9 3 10 ]
[6 8 3 5 ] 7 6 8 3 5
[2 5 1 9 ]]
4 2 5 1 9
Indexing, Slicing and Arithmetic Operations in NumPy Array 157
Assigning to and accessing the elements of an array is similar to other sequential data types of Python, i.e.,
lists and tuples. Slicing in the NumPy array is the way to extract a range of elements from an array. Slicing
d
in the array is performed in the same way as it is performed in the python list except you can do it in more
ite
than one dimension.
Before slicing NumPy array elements, just a quick recap how slicing works with normal Python lists.
m
Suppose we have a list:
Li
>>> A1 = [24, 12, 10, 34, 17, 13, 32, 51]
e
We can use slicing to take a sub-list, like this:
at
>>> x = A1[1:7]
>>> print (x) # or print(Num[1:7])
iv
[12 10 34 17 13 32]
Pr
0 1 2 3 4 5 6 7
`
a
24 12 10 34 17 13 32 51
di
`
–8 –7 –6 –5 –4 –3 –2 –1
In
Notice that the slice notation specifies a start and end value [start:end] and copies the array from
ar at index 1 (i.e., 2nd position in the list) end at up to 7 – 1 (i.e., 6th index position in the array). Note
se
that the index starts from 0 just like Python sequences like list and tuple.
The basic format to slice NumPy array values are:
ou
position
Here, the start, end and step are all optional. These ranges work just like slices for lists. The start
ra
specifies a range that starts, and stops one position before end, in steps size of step. If you don’t mention
Sa
Or
>>> print (A1[:]) # Prints all the elements in the list
N
[24 12 10 34 17 13 32 51]
@
We can slice list elements in different ways also. We can omit the start, in which case the slice would
start at the beginning of the list. For example, to access first four elements:
>>> A1[:4] # Returns a list with four elements starting from 0th index
[24, 12, 10, 34]
158 Saraswati Informatics Practices XII
d
[24 12 10 34]
ite
We can omit the end, so the slice continues to the end of the list. For example, to print the list elements
m
starting from 4th position till end, the command is:
>>> print (A1[3:]) # Prints list elements starting from third element (i.e., 4th position) till end
Li
[34 17 13 32 51]
e
The important thing to note is the difference between an index and a slice of length 1. For example,
at
>>> print (A1[0]) # Prints index of 0, i.e., 24
>>> print (A1[0:1]) # Prints slice of [0:1], i.e., [24]
iv
Pr
Slicing 1D array
a
Slicing a 1D NumPy array is almost exactly the same as slicing a list. Slicing is specified using the colon
di
operator ‘:’ with a [start:end:stop] or ‘from‘ and ‘to‘ index before and after the column, respectively. The
slice extends from the ‘from’ index and ends one item before the ‘to’ index. The NumPy slicing is exactly
In
same as Python sequences and first index starts from 0th position.
For example,
se
>>> A1
array([24, 12, 10, 34, 17, 13, 32, 51])
H
>>> x = A1[1:4] # Creates another array using A1 starting from 2nd index 2nd to 3rd index position.
s
[12 10 34]
Sa
Or
>>> print (A1[0:4])
N
[24 12 10 34]
@
Here, the first item of the array is sliced by specifying a slice that starts at index 0 and ends at index 3
(one item before the ‘to’ index).
>>> print (A1[3:]) # Prints array elements starting from index 3rd till end
[34 17 13 32 51]
>>> x = A1[1:7:2] # [start:end:step] - start is 1, i.e., 2nd position, end is 7th position and step 2.
Indexing, Slicing and Arithmetic Operations in NumPy Array 159
d
Like Python list, we can also use negative indexes in NumPy array slices. For example, to slice the last
ite
four items in the array by starting the slice at –4 and not specifying a ‘to’ index; that takes the slice to the
end of the dimension.
m
>>> print(A1[–4:]) # Prints the last four elements
[17 13 32 51]
Li
>>> print (A1[::-1]) # Prints the array in reverse order
[51 32 13 17 34 10 12 24]
e
at
Example A 1D array x1 is given with following values:
iv
[8, 10, 21, 19, 4, 32, 45, 12, 66, 93, 11, 17]
Pr
Answer the following:
( ) Create the array x1.
( ) Print the value to index zero.
a
( ) Print the fifth value.
(d) Print the last value.
di
In
(e) Print the second last value.
( ) Print the values from start to 5th position.
se
Solution (a) x1 = np.array([8, 10, 21, –19, 4, 32, 45, –12, 66, 93, 11, –17])
at
Slicing 2D array
ew
You can slice a 2D array in both axes to obtain a rectangular subset of the original array. For example,
>>> M = np.array([ [5, 4, 7, 3],
N
[6, 3, 6, 4],
[8, 6, 9, 5] ] )
@
>>> M
array([ [ 5, 4, 7, 3 ],
[ 6, 3, 6, 4 ],
[ 8, 6, 9, 5 ]])
160 Saraswati Informatics Practices XII
Access the entries in a 2D array using the square brackets with 2 indices. In particular, access the entry
at row index 1 and column index 2:
>> print (M[1,2]) # Prints 6
d
To access the top left or first entry from the array:
ite
>>> print (M[0, 0]) # Prints 5
Negative indices work for NumPy arrays as they do for Python sequences. For example, to access the
m
bottom right entry in the array:
Li
>>> print (M[–1, –1]) # Prints 5
To access a row at index 2 using the colon : syntax. For example,
e
at
>>> print (M[2, :]) # Prints [8 6 9 5]
iv
To access a column at index 3 using the colon : operator. For example,
>>> print (M[:, 2]) # Prints: [7 6 9]
Pr
You can slice a 2D array in both axes to obtain a rectangular subset of the original array. For example,
a
to select the sub array of rows at index 1 and 2, and columns at index 1, 2 and 3:
di
>>> subB = M[1:3, 1:4]
>>> print (subB)
In
[[ 3 6 4 ]
[ 6 9 5 ]]
se
Similarly, to select rows 1: (1 to the end of bottom of the array) and columns 2:4 (columns 2 and 3):
ou
[ 9 5 ]]
i
at
[[ 3 7 4 5 ]
s
[ 4 6 3 6 ]
ra
[ 5 9 6 8 ]]
Sa
Slices vs Indexing
ew
As we saw earlier, you can use an index to select a particular plane column or row. Here, we select row 1,
columns 2:4:
>>> print(M[1, 2:4]) # Prints: [6 4]
N
You can also use a slice of length 1 to do something similar (slice 1:2 instead of index 1):
@
Example Write a program to print all odd numbers from NumPy array. For example, if an array
A contains the following values:
[10, 21, 4, 45, 66, 93]
d
then result will be:
ite
The odd numbers are: 21 45 93
Solution # allodd.py
m
# Program to print odd Numbers from a NumPy array
Li
import numpy as np
A = np.array([10, 21, 4, 45, 66, 93])
e
# iterating each number in array
print ("The odd numbers are: ", end='')
at
for num in A: # A is the numpy array
iv
# checking condition for odd number
if num % 2 != 0: # Note. For even number, the condition is: if num % 2 == 0:
Pr
print(num, end = " ")
Example Write a program to copy the content of an array A to another array B, replacing all odd
a
numbers of array A with –1 without altering the original array A. Also, print both the arrays.
For example, if the input array A is:
di
In
The array A is [10 51 2 18 4 31 13 5 23 64 29]
The result will be:
se
Solution # copyarray.py
import numpy as np
H
for num in A: # A is the numpy array & num extract array value A one by one
# checking condition for odd number
w
B[ctr] = -1
ra
else:
B[ctr] = num
Sa
ctr+=1
print("The array A is", A)
print("The array B is", B)
ew
Two or more arrays can be concatenated together using the NumPy concatenate() function. The syntax is:
@
• axis denotes the axis along which the arrays will be Try this:
joined. Default is 0 (i.e., row join).
import numpy as np
For example to join two 1D arrays: a1 = numpy.array([1,2,3])
a2 = np.array([4,5,6])
d
a1 = numpy.array([1,2,3])
a3 = np.array([7,8,9])
ite
a2 = np.array([4,5,6])
print (np.concatenate((a1, a2))) # (a1, a2) is a tuple print (np.concatenate((a1, a2, a3)))
Or
m
which will print: print (np.concatenate(([a1, a2, a3])))
which prints:
Li
[1 2 3 4 5 6]
[1 2 3 4 5 6 7 8 9]
Joining 2D NumPy Arrays
e
at
NumPy concatenate essentially combines together multiple NumPy arrays. If we are joining 2D arrays,
iv
then it can be joined in two ways—row join (default join) and column join. For example, to perform row
Pr
join operations of two array A and B:
1 2 3
a
A 1 2 3
di
4 5 6
np.concatenate((A, B)) 4 5 6
In
` 11 12 13
se
11 12 13
B 14 15 16
ou
14 15 16
H
which prints:
s
[[ 1 2 3 ]
ra
[ 4 5 6 ]
[ 11 12 13 ]
Sa
[ 14 15 16 ] ]
Similarly to perform column join operations of two array A and B:
ew
1 2 3
N
A
4 5 6
np.concatenate((A, B), axis=1) 1 2 3 11 12 13
@
` 4 5 6 14 15 16
11 12 13
B
14 15 16
Indexing, Slicing and Arithmetic Operations in NumPy Array 163
d
[ 4 5 6 14 15 16 ] ]
ite
m
6.5 Creating Sub Array
Li
In the last section, we used indexing and slicing methods of 1D and 2D array elements. We can also select
a sub array from Numpy Array using [] operator. Sub array is just a view of original array i.e., data is not
e
copied but just a view of sub array is created. When you modify the content of sub array, you will see that
at
the original array is also modified/changed.
iv
ndArray[first:last]
Pr
It will return a sub array from original array with elements from index first to last – 1. Let’s use a 1D
array A1 to select different sub arrays from original NumPy Array.
a
A1 = np.array([24, 12, 10, 34, 17, 13, 32, 51])
Now let’s see some examples,
di
In
A1 = np.array([24, 12, 10, 34, 17, 13, 32, 51])
se
'The content of original array is', [24, 12, 10, 34, 17, 13, 32, 51]
H
To modify the original array through sub array, for Try this:
Sa
subY = X[:4]
Now the contents of sub array and original array are: subY = X[:6]
print ('The sub array is', subAr) print (subY)
N
d
a 3 x 3 array that is given:
ite
23 54 76
m
37 19 28
Li
62 13 19
e
M = np.array([[23, 54, 76], [37, 19, 28], [62, 13, 19]])
at
The content of the array is:
iv
>>> print (M)
Pr
[ [ 23 54 76 ]
[ 37 19 28 ]
[ 62 13 19 ] ]
a
di
Let us extract a 2x2 sub array from M: In 2x2 sub array
subM = M[:2, :2] # Creating a 2 x 2 sub array
23 54 76
The content of the sub array is:
se
print (subM) 37 19 28
[ [ 23 54 ]
ou
[ 37 19 ]] 62 13 19
H
Now, we can modify the original array through sub array, for example, let us modify the [0, 1] index
position value with –39.
i
at
print (subM)
ra
[ [ 23 –39 ]
[ 37 19 ]] 23 –39 76
Sa
print (M)
[ [ 23 –39 76 ] 37 19 28
[ 37 19 28 ]
ew
62 13 19
[ 62 13 19 ] ]
N
NumPy arrays support element-wise operations, meaning that arithmetic operations on arrays are applied
to each value in the array. Arithmetic operators are commonly used to perform numeric calculations. Also,
each of these arithmetic operations are simply convenient wrappers around specific functions built into
NumPy; for example, the + operator is a wrapper for the add function. NumPy has following arithmetic
operators shown in Table 6.1.
Indexing, Slicing and Arithmetic Operations in NumPy Array 165
d
– np.subtract Subtraction y-4 or np.subtract(y, 4) [4, 5, 7, 8, 6]
ite
* np.multiply Multiplication y*2 or np.multiply(y, 2) 16, 18, 22, 24, 20]
m
/ np.divide Division y/2 or np.divide(y, 2) [4. , 4.5, 5.5, 6. , 5. ]
% np.mod Modulus y%2 or np.mod(y, 2) [0, 1, 1, 0, 0]
Li
** np.power Exponent (power) y**y or np.power(y, 2) [ 64, 81, 121, 144, 100]
e
// np.floor_divide Floor Division y//2 or np.floor_divide(y, 2) [4, 4, 5, 6, 5]
at
Let us see simple arithmetic operations for a 2D array:
iv
Addition (+):
Pr
M M+2
5 4 7 3 5+2 4+2 7+2 3+2 7 6 9 5
a
6 3 6 4 6+2 3+2 6+2 4+2 = 8 5 8 6
8 6 9 5 8+2 6+2
di
9+2 5+2 10 8 11 7
In
>>> M = np.array( [ [5, 4, 7, 3],
[6, 3, 6, 4],
se
[8, 6, 9, 5]])
ou
[ 6, 3, 6, 4 ]
[ 8, 6, 9, 5 ] ]
i
at
Let us perform element-wise sum operation. Note that the original array does not change
w
Or import numpy as np
ra
print (x + y)
[ 8 5 8 6 ]
print (np.add(x, y))
[ 10 8 11 7 ] ]
ew
Subtraction (–):
M M–3
N
Let us perform element-wise differences. Note that the original array does not change.
>>> print (M–3)
Or Try this:
>>> np.subtract(M, 3) import numpy as np
d
[[ 2 1 4 0 ] x = np.array([1,2,3,4], float)
ite
y = np.array([5,6,7,8], float)
[ 3 0 3 1 ]
print (x – y)
m
[ 5 3 6 2 ]] print (np.subtract(x, y))
Li
Multiplication (*):
M M*2
e
5 4 7 3 5*2 4*2 7*2 3*2 10 8 14 6
at
6 3 6 4 6*2 3*2 6*2 4*2 = 12 6 12 8
iv
8 6 9 5 8*2 6*2 9*2 5*2 16 12 18 10
Pr
>>> M = np.array( [ [5, 4, 7, 3],
[6, 3, 6, 4],
a
di
[8, 6, 9, 5]])
Try this:
In
>>> print (M*2) # Performs element-wise multiplication.
Or import numpy as np
se
[ 12 6 12 8 ]
print (np.multiply(x, y))
[ 16 12 18 10 ] ]
H
Division (/):
i
>>> print (M/2) # Performs element-wise division. Note that the original array does not change
at
Or
w
print (x / y)
print (np.divide(x, y))
Modulus (%):
ew
[[ 1 0 1 1 ] y = np.array([5,6,7,8], float)
[ 0 1 0 0 ] print (x % y)
[ 0 0 1 1 ]] print (np.mod(x, y))
Indexing, Slicing and Arithmetic Operations in NumPy Array 167
We can apply modulus operator as a conditional expression to each array values. For example,
a = np.array([10, 21, 4, 45, 66, 93, 7, 11, 13])
print (a%3 == 0)
[False True False True True True False False False]
d
ite
Here, from the above result, the expression a%3 == 0 finds the modulus result as logical values: True
or False. That is:
m
a[0] % 3 = 10 % 3 = 1, i.e., False
Li
a[1] % 3 = 21 % 3 = 0, i.e., True
a[2] % 3 = 4 % 3 = 1, i.e., False
e
....
at
....
and so on.
iv
Exponentiation (**):
Pr
>>> print (M**2) Try this:
a
Or import numpy as np
di
>>> print (np.power(M, 2)) x = np.array([1,2,3,4], float)
y = np.array([5,6,7,8], float)
[ [ 25 16 49 9 ]
In
print (x ** y)
[ 36 9 36 16 ] print (np.power(x, y))
se
[ 64 36 81 25 ] ]
Floor division (//): Performs element-wise floor division and produce integer part of the division.
ou
Or
import numpy as np
>>> print (np.floor_divide(M, 2))
i
x = np.array([1,2,3,4], float)
at
[[ 2 2 3 1 ] y = np.array([5,6,7,8], float)
print (x // y)
w
[ 3 1 3 2 ]
print (np.floor_divide(x, y))
[ 4 3 4 2 ]]
s
ra
The compound-assignment operators combine the simple-assignment operator with another binary operator.
Compound-assignment operators perform the operation specified by the additional operator and then assign
ew
the result to the left operand. The compound operations act in place to modify an existing array rather than
create a new one. Table 6.2 shows the compound assignment operators.
N
+= s+=x x=s+x
–= s–= x s=s–x
*= s*= x s = s*x
168 Saraswati Informatics Practices XII
/= s/=x s = s/x
%= s%= x s = s%x
**= s**=x s = s**x
d
//= s//=x s = s//x
ite
m
For example, to perform element wise compound addition operation with a 2D array:
P P += 10
Li
23 54 76 23+10 54+10 76+10 33 64 86
e
37 19 28 37+10 19+10 28+10 = 47 29 38
at
62 13 19 62+10 13+10 19+10 72 23 29
iv
Remember that all the compound operations changes the original array in place.
Pr
>>> P = np.array( [[23, 54, 76],
[37, 19, 28],
a
[62, 13, 19]])
>>> print (P)
di
In
[ [ 23 54 76 ]
[ 37 19 28 ]
se
[ 62 13 19 ] ]
ou
P += 10 # Performs element-wise sum and also changed made into the original array.
>>> print (P) Try this:
H
[ [ 33 64 86 ]
import numpy as np
[ 47 29 38 ]
i
x = np.array([1,2,3,4], float)
at
Try this:
ra
[ 42 24 33 ] x = np.array([1,2,3,4], float)
x–=2 # Don’t write print (x+=2)
[ 67 18 24 ] ]
ew
The sum of two matrices x and y of the same order is written x + y and defined to be the matrix. For
example:
1 2 3 4 5 6 5 7 9
The sum of x = and y = is x + y =
3 4 5 2 3 4 5 7 9
Indexing, Slicing and Arithmetic Operations in NumPy Array 169
d
Or
ite
>>> print (np.add(x, y))
[[ 5 7 9 ]
m
[ 5 7 9 ]]
Li
Similarly, we can perform other operations such as subtraction, multiplication and division.
e
Subtraction (–):
x–y
at
x y
1 2 3 4 5 6 –3 –3 –3
iv
3 4 5 2 3 4 1 1 1
Pr
>>> print (x–y)
Or
a
>>> print (np.subtract(x, y))
[ [–3 –3 –3 ]
di
In
[ 1 1 1 ]]
se
Multiplication (*):
x y x*y
ou
1 2 3 4 5 6 4 10 18
3 4 5 2 3 4 6 12 20
H
Or
at
[ [ 4 10 18 ]
s
[ 6 12 20 ] ]
ra
Division (/):
Sa
x y x/y
1 2 3 4 5 6 0.25 0.4 0.5
3 4 5 2 3 4 1.5 1.33333333 1.25
ew
d
[[ 1 32 729 ]
ite
[ 9 64 625 ] ]
Let us practice the arithmetic operation with two 3 x 3 arrays:
m
–3 –3 –3
Li
1 0 6 3 7 0
The sum of p = 4 2 –1 and q = 7 –2 5 is p + q = 11 0 4
e
5 3 –2 2 6 9 7 9 7
at
E.g.,
iv
>>> p = np.array( [[1, –3, 0],
Pr
[4, 2, –1],
[5, 3, –2]])
a
>>> q = np.array( [[6, 3, –3],
di
[7, –2, 5], In
[2, 6, 9]])
>>> print (p+q) # element-wise sum
se
[[ 7 0 –3 ]
[ 11 0 4 ]
ou
[ 7 9 7 ]]
H
Similarly, we can perform other operations such as subtraction, multiplication and division. Consider
the below example:
i
[ [ –5 –6 3 ]
w
[ –3 4 –6 ]
s
[ 3 –3 –11 ] ]
ra
[[ 6 –9 0 ]
[ 28 –4 –5 ]
[ 10 18 –18 ]]
ew
Matrix Multiplication
In Python 3.5+, the symbol @ computes matrix multiplication for NumPy arrays.
Indexing, Slicing and Arithmetic Operations in NumPy Array 171
d
M21 M22 N21 N22 N23
ite
then the product will be:
m
Matrix - A¯
¯B
Li
M11¯ N11+ M12¯ N21 M11¯ N12+ M12¯ N22 M11¯ N13+ M12¯ N23
e
M21¯ N11+ M22¯ N21 M21¯ N12+ M22¯ N22 M21¯ N13+ M22¯ N23
at
For example, suppose two matrices A and B are of size 2 ¯ 2 and 2 ¯ 3, respectively. Here is a
iv
pictorial representation for cell (1, 1):
Pr
1 2 5 6 7
a
A= B=
di
3 4 8
In 9 10
1¯
¯5 + 2¯
¯8 1¯
¯6 + 2¯
¯9 1¯
¯ 7 + 2¯
¯ 10
ou
A¯
¯B =
3¯
¯5 + 4¯
¯8 3¯
¯6 + 4¯
¯9 3¯
¯ 7 + 4¯
¯ 10
H
Hence,
i
at
21 24 27
A¯
¯B =
w
47 54 61
s
ra
[8, 9, 10]])
>>> print (A@B) # matrix product
N
[ [ 21 24 27 ]
[ 47 54 61 ]]
@
To multiply two matrices, we can use the .dot() method. Numpy is powerful library for matrices
computation. For instance, you can compute the dot product with np.dot().
The syntax is:
numpy.dot(a, b, out=None)
172 Saraswati Informatics Practices XII
Here, the function returns the dot product of a and b. If a and b are both scalars or both 1-D arrays then
a scalar is returned; otherwise an array is returned.
For example, to generate the matrix multiplication of A and B using numpy.dot(), the command is:
>>> print (np.dot(A, B))
d
[ [ 21 24 27 ]
ite
[ 47 54 61 ]]
m
6.7 NumPy Array Functions
Li
There are many array functions we can use to compute with NumPy arrays. Since NumPy arrays are Python
e
objects, they have methods associated with them.
Most of the NumPy array functions contain axis parameter. By default, the axis value is None. It’s
at
optional and if not provided then it will flatten the passed NumPy array and returns the max value in it.
iv
If the axis parameter provided then it will return for array of max values along the axis i.e.
Pr
• If axis=0 then it returns an array containing max value for each column.
• If axis=1 then it returns an array containing max value for each row.
a
numpy.sum() Function
di
In
The numpy.sum() function returns the sum of all the elements in the array. With an axis argument, the sums
along the specified axis will be calculated.
se
Because np.sum() is operating on a 1-dimensional NumPy array, it will just sum up the values. That is:
` 193
i
at
24 12 10 34 17 13 32 51
w
Every axis in a NumPy array has a number, starting with 0. In this way, they are similar to Python indexes in
Sa
that they start at 0, not 1. So the first axis is axis 0. The second axis (in a 2-d array) is axis 1. For multi-
dimensional arrays, the third axis is axis 2. And so on.
axis = 1
ew
0 1 2 3
`
N
0 2 4 9 3
axis = 0
@
1 7 6 8 1
2
4 2 5 11
`
Indexing, Slicing and Arithmetic Operations in NumPy Array 173
For example, if we set axis = 0, we are indicating that we want to sum up the rows. Remember, axis 0
refers to the row axis. Likewise, if we set axis = 1, we are indicating that we want to sum up the columns.
Remember, axis 1 refers to the column axis.
Next, let’s sum all of the elements in a 2-dimensional NumPy array P.
d
P = np.array([ [2, 4, 9, 3],
ite
[7, 6, 8, 1],
[4, 2, 5, 11]])
m
print (p)
Li
[ [ 2, 4, 9, 3 ]
[ 7, 6, 8, 1 ]
e
at
[ 4, 2, 5, 11] ]
iv
To find the total of all elements:
Pr
print ('Sum of entire array =',P.sum())
Sum of entire array = 62
a
Notice that when we use the NumPy sum function without specifying an axis, it will simply add together
di
all of the values and produce a single scalar value. In
2 4 9 3
se
7 6 8 1
ou
4 2 5 11 ` 62
H
But using the axis parameter is little confusing. We always thought that axis 0 is row-wise, and 1 is
i
at
column-wise.
row-wise (axis 0) ---> [[ 8 5 ]
w
[ 4 6 ]]
s
|
ra
|
column-wise (axis 1)
Sa
However, the result of y () is the exact opposite of what we were thinking. Let us see the
following two sums:
ew
axis 1
axis 0 `
8 5
N
8 5 ` 13
4 6
@
4 6 ` 10
`
The way to understand the “axis” of numpy.sum() is that it collapses the specified axis. So, when it
e the axis 0 (row), it becomes just one row and column-wise sum. That is, when we set axis = 0, we
are basically saying, “sum the rows”, i.e., to operate on the rows only. This is often called a row-wise
operation.
d
Similarly, when it e the axis 1 (column), it becomes just one column and row-wise sum. That
ite
is, when we set the parameter axis = 1, we are telling the np.sum function to operate on the columns only.
Specifically, we are telling the function to sum up the values across the columns.
m
For exmaple, to find the sum along rows:
Li
x = np.array([[2, 5], [4, 6]])
print (x)
e
[[ 8 5 ]
at
[ 4 6 ]]
iv
print ('Sum along rows =', x.sum(axis = 0))
Pr
which will print:
Sum along columns = [ 12 11]
a
di
Similarly, to find the sum along columns: In
print ('Sum along rows =', x.sum(axis = 1))
which will print:
se
[4, 2, 5, 11]])
Write the commands to print the row sum and column sum values.
w
Solution print ('Row-sum =', P.sum(axis = 0)) # returns: Row-sum = [13 12 22 15]
s
print ('Column sum =', P.sum(axis = 1)) # retuns: Column sum = [18 22 22]
ra
Sa
The numpy.max() function returns the scalar value which is the largest element in the entire array. If an axis
ew
is defined for an N-dimensional array, the maximum values along that axis are returned.
To find the maximum number in a 1D array:
N
The numpy.min() function returns the scalar value which is the smallest element in the entire array. If
an axis is defined for an N-dimensional array, the minimum values along that axis are returned.
Tio find the minimum number in a 1D array:
>>> A1 = np.array([24, 12, 10, 34, 17, 13, 32, 51])
>>> A1.min() # returns: 10
Indexing, Slicing and Arithmetic Operations in NumPy Array 175
The axis parameter of numpy.max() and numpy.min() functions are operates same way as that of
numpy.sum() function.
When numpy.max(axis=0): When numpy.max(axis=1):
d
axis 0 axis 1
ite
2 4 9 3 `
m
2 4 9 3 ` 9
7 6 8 1
Li
e
7 6 8 1 ` 8
4 2 5 11
at
`
iv
4 2 5 11 ` 11
`
7 6 9 11
Pr
Here, numpy.max finds maximum Here, numpy.max finds maximum across
down the rows when we set axis = 0 the columns when we set axis = 1
a
di
When numpy.min(axis=0): When numpy.min(axis=1):
In
axis 0 axis 1
2 4 9 3 `
se
2 4 9 3 ` 2
ou
7 6 8 1
H
7 6 8 1 ` 1
4 2 5 11
i
at
`
4 2 5 11 ` 2
`
`
w
2 2 5 3
s
down the rows when we set axis = 0 the columns when we set axis = 1
Sa
d
which will print:
ite
Column-wise maximum values = [ 9 8 11]
m
Similarly, to find the miniumum numbers in row-wise and column-wise in the 2D array P:
print ('Row-wise minimum values =', P.min(axis = 0))
Li
which will print:
e
Row-wise minimum values = [2 2 5 1]
at
print ('Column-wise minimum values =', P.min(axis = 1))
iv
which will print:
Pr
Column-wise minimum values = [2 1 2]
Example Write the command to create a 1D array with 10 random values and print the maximum
a
value.
Solution X = np.random.random(30)
di
In
print (X.max())
se
numpy.mean() Function
ou
The numpy.mean() function returns mean or average of all values in list/array or along rows or columns.
To find the mean value of a 1D array:
H
The mean operation along with columns and rows are exactly calculated as like previous functions
w
numpy.sum(), numpy.max(), etc. For example, let us find mean using previous 2D array P:
s
numpy.sort() Function
N
The numpy.sort() function sorts the elements along the defined dimension, with the default being the
@
last (–1). Dimension numbering starts with 0. The default sorting is row wise (axis=0) ascending order. The
order of operation is again happen like numpy.sum() function. For example,
>>> import numpy as np
>>> a = np.array([3, 7, 4, 8, 2, 15])
Indexing, Slicing and Arithmetic Operations in NumPy Array 177
d
ite
>>> S1 = np.array( [ [23, 54, 76 ],
[37, 19, 28 ],
m
[62, 13, 19 ] ] )
# row-wise sorting
Li
>>> np.sort(S1, axis=0)
e
array([ [ 23, 13, 19 ],
at
[ 37, 19, 28 ],
[ 62, 54, 76 ] ] )
iv
# Column-wise sorting
Pr
>>> np.sort(S1, axis=1)
array([ [ 23, 54, 76 ],
a
[ 19, 28, 37 ],
di
[ 13, 19, 62 ] ] ) In
Points to Remember
se
1. Array indexing refers to any use of the square brackets ([ ]) to index array values.
2. Single element indexing for a 1D array work exactly like that for other standard Python sequences
ou
4. Slicing is specified using the colon operator ‘:’ with a [start:end] or ‘from‘ and ‘to‘ index before
and after the column respectively.
i
at
SOLVED EXERCISES
s w
import numpy as np
X = np.arange(4, 20, 4)
Sa
print("X =", X)
print("X + 5 =", X + 5)
print("X – 5 =", X – 5)
ew
print("X * 2 =", X * 2)
print("X / 2 =", X / 2)
N
X = [ 4 8 12 16]
X + 5 = [ 9 13 17 21]
X – 5 = [–1 3 7 11]
X * 2 = [ 8 16 24 32]
X / 2 = [2. 4. 6. 8.]
X // 2 = [2 4 6 8]
178 Saraswati Informatics Practices XII
2. There are two arrays a and b given (assume that np is used as numpy namespace):
a = np.array([[1,2],[3,4]])
b = np.array([[5,10]])
What will be the output of the following?
d
( ) print (a + b)
ite
( ) d = np.array([5,10])
dd = d.reshape(1,2)
m
print (a + d)
Ans. The output is:
Li
(a) [ [ 6 12 ]
[ 8 14 ] ]
e
(b) [ [ 6 12 ]
at
[ 8 14 ] ]
3. Two arrays 2D arrays are given as below:
iv
x = np.array([[1, 3, 5], [3, 4, 2], [ 5, 2, 0]])
Pr
y = np.array([[1], [5], [3]])
Find the output of print (x*y).
Ans. The output is:
a
[ [ 1 3 –5 ]
di
[ 15 20 10 ]
[ –15 6 0 ]]
In
4. In an array X contains the following:
X = np.full((3, 4), True, dtype=bool)
se
D[::2] +=2
print (D)
i
at
[ 4 5 1 10]
ra
( ) Create a 2D array Z by multiplying a 5x3 matrix of all values 1 by a 3x2 matrix with all values
1 (real matrix product)
( ) Create a 10x10 array with random values and find the minimum and maximum values.
Ans. (a) X = np.zeros(10)
X[4] = 1
print(X)
Indexing, Slicing and Arithmetic Operations in NumPy Array 179
d
8. An array Num is given with [10, 51, 2, 18, 4, 31, 13, 5, 23, 64, 29] values. Write the commands to
ite
create the array Num and replace all odd numbers in Num with -1. Also print the array Num.
Ans. The commands are:
m
Num = np.array([10, 51, 2, 18, 4, 31, 13, 5, 23, 64, 29])
Num[Num % 2 == 1] = –1
Li
print (Num)
9. What will be the output of the following program?
e
import numpy as np
at
a2dr = np.array([1, 2, 3, 4]*3).reshape((3, 4))
iv
print('The array is:')
print(a2dr)
Pr
print('')
print('mean entire array =', np.mean(a2dr))
a
print('mean along columns =', np.mean(a2dr, axis=0))
di
print('mean along rows =', np.mean(a2dr, axis=1))
In
Ans. The output is
The array is:
se
[[ 1 2 3 4 ]
[ 1 2 3 4 ]
ou
[ 1 2 3 4 ]]
Mean of entire array = 2.5
H
(ii) C.max(axis = 1)
(iii) C.mean(axis = 0)
@
REVIE ES IO S
1. If x1 is an array with following data, then write the command for following:
d
import numpy as np
ite
x1 = np.array([14, 13, 14, 15, 18, 12, 14])
(a) Assess value to index zero.
(b) Assess fifth value
m
(c) Get the last value
Li
(d) Get the second last value
2. A multi-dimentional array x2 contains following values:
e
import numpy as np
at
x2 = np.array( [ [ 13, 17, 13, 15 ],
[ 10, 21, 15, 19 ],
iv
[ 31, 14, 16, 30 ] ] )
Using array indexing, write the command for the following:
Pr
(a) Get 1st row and 2nd column value.
(b) Get 3rd row and last value from the 3rd column.
a
(c) Get the third row.
di
(d) Get the third column.
(e) Replace value 17 at 0,0 index.
In
(f) Print the array in reverse order of the rows.
3. A multi-dimensional array 5 x 5 contains following values:
se
array([ [ 1, 3, 5, 7, 9],
[11, 13, 15, 17, 19],
ou
b = np.sort(a, axis=1)
print (b)
5. Write a program to extract all even numbers from NumPy array. For example, if an array A contains
ew