0% found this document useful (0 votes)
80 views192 pages

Informatics Practices Xii CH 1 To CH 6 1

Uploaded by

Ashish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views192 pages

Informatics Practices Xii CH 1 To CH 6 1

Uploaded by

Ashish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 192

d

ite
m
Li
Saraswati

a te
INFORMATICS PRACTICES

iv
Pr
[A TEXTBOOK FOR CLASS XII]

a
di
In
se

By
ou

Reeta Sahoo Gagan Sahoo


M.C.A. M.C.A.
iH
at
sw
ra
Sa
ew
N
@

New Sa aswati
r House (India) Private Limited
New Delhi-110002 (INDIA)
d
ite
m
Li
Head Office : Second Floor, MGM Tower, 19 Ansari Road, Daryaganj, New Delhi–110 002 (India)
Registered Office : A-27, 2nd Floor, Mohan Co-operative Industrial Estate, New Delhi–110 044

e
Phone : +91-11-4355 6600

at
Fax : +91-11-4355 6688
E-mail : [email protected]

iv
Website : www.saraswatihouse.com

Pr
CIN : U22110DL2013PTC262320
Import-Export Licence No. 0513086293

Branches:

a
• Ahmedabad: Ph. 079-2657 5018 • Bengaluru: Ph. 080-2675 6396

di
• Chennai: Ph. 044-2841 6531 • Bhubaneswar: Ph. +91-94370 05810 In
• Guwahati: Ph. 0361-2457 198 • Hyderabad: Ph. 040-4261 5566 • Jaipur: Ph. 0141-4006 022
• Jalandhar: Ph. 0181-4642 600, 4643 600 • Kochi: Ph. 0484-4033 369 • Kolkata: Ph. 033-4004 2314
• Lucknow: Ph. 0522-4062 517 • Mumbai: Ph. 022-2876 9871, 2873 7090
se

• Patna: Ph. 0612-2275 403 • Ranchi: Ph. 0651-2244 654 • Nagpur: Ph. +91 9371940224
ou

Revised edition 2019


Reprinted 2020
H

ISBN: 978-93-53621-34-6
i
at

The moral rights of the author has been asserted.

© New Saraswati House (India) Private Limited


s w

Jurisdiction: All disputes with respect to this publication shall be subject to the jurisdiction of the Courts, Tribunals and
Forums of New Delhi, India Only.
ra

All rights reserved under the Copyright Act. No part of this publication may be reproduced, transcribed, transmitted,
Sa

stored in a retrieval system or translated into any language or computer, in any form or by any means, electronic,
mechanical, magnetic, optical, chemical, manual, photocopy or otherwise without the prior permission of the
copyright owner. Any person who does any unauthorised act in relation to this publication may be liable to criminal
prosecution and civil claims for damages.
ew

Product Code: NSS3IPC120CSCAB19CBN


N

This book is meant for educational and learning purposes. The author(s) of the book has/have taken all reasonable
care to ensure that the contents of the book do not violate any copyright or other intellectual property rights of any
@

person in any manner whatsoever. In the event the author(s) has/have been unable to track any source and if any
copyright has been inadvertently infringed, please notify the publisher in writing for any corrective action.

PRINTED IN INDIA
By Vikas Publishing House Private Limited, Plot 20/4, Site-IV, Industrial Area Sahibabad, Ghaziabad–201 010 and
published by New Saraswati House (India) Private Limited, 19 Ansari Road, Daryaganj, New Delhi–110 002 (India)
Preface

d
Informatics Practices book is designed as per the new syllabus prescribed by CBSE for class - XII. This

ite
book covers advanced operations on Pandas DataFrame, descriptive statistics, histogram, function
application, etc. NumPy array is used to handle multi-dimensional array objects. Software engineering,

m
database management using MySQL, computing ethics and cyber safety are the other highlights of

Li
the book.
The salient features of the book are:

e
1. Easy-to-understand aggregation operations, descriptive statistics, and re-indexing columns in a

at
DataFrame.

iv
2. Learn to plot different graphs using pandas DataFrame.

Pr
3. Apply functions row-wise and element-wise on a Data Frame.
4. Detailed explanation of the know how to NumPy array used for scientific computing applications
including covariance, correlation and linear regression.

a
di
5. NumPy array discussed in detail to handle multi-dimensional array object.
6. Understanding of the basic software engineering: models, activities, business use-case diagrams,
In
and version control systems.
7. Know-hows to connect a Python program with a SQL database, and learn aggregation functions
se

in SQL.
ou

8. A clear understanding of cyber ethics and cybercrime. Helps to understand the value of technology
in societies, gender and disability issues, and the technology behind biometric ids.
H

This new edition provides you with an updated knowledge based on plenty of solved exercises and
unsolved questions. It includes complete and easily understandable explanations of the commonly
i

used features of Python pandas DataFrame, NumPy and MySQL. The questions and examples will
at

enable students to test their knowledge and understanding.


w

Many solved and unsolved examples are provided at the end of each chapter. It also contains a CD
s

with a folder called IPSource_XII which includes chapter-wise solved examples and can be made
ra

available on demand.
Sa

Also, we would like to convey our sincere thanks to the dedicated team of New Saraswati House
(India) Pvt. Ltd. for bringing out this book in an excellent form.
Suggestions for further improvement of the book will be gratefully acknowledged.
ew
N

Reeta Sahoo & Gagan Sahoo


@

Phone: 011-42953418
Mobile No.: 9818588644
E-mail: [email protected]
SYLLABUS
UNIT–WISE MARKS (THEORY)

d
Unit No. Unit Name Periods Marks

ite
Theory Practical

m
1. Data Handling - 2 80 70 30

Li
2. Basic Software Engineering 25 10 15
3. Data Management - 2 20 20 15

e
at
4. Society, Law and Ethics - 2 15 10

iv
5. Practicals 30

Pr
Total 100

a
Unit 1: Data Handling (DH-2)

Python Pandas
di
In
• Advanced operations on Data Frames: pivoting, sorting, and aggregation
• Descriptive statistics: min, max, mode, mean, count, sum, median, quartile, var
se

• Create a histogram, and quantiles.


ou

• Function application: pipe, apply, aggregation (group by), transform, and apply map.
• Reindexing, and altering labels.
H

Numpy
• 1D array, 2D array
i
at

• Arrays: slices, joins, and subsets


• Arithmetic operations on 2D arrays
w

• Covariance, correlation and linear regression


s
ra

Plotting with Pyplot


• Plot bar graphs, histograms, frequency polygons, box plots, and scatter plots.
Sa

Unit 2: Basic Software Engineering (BSE)


ew

• Introduction to software engineering


• Software Processes: waterfall model, evolutionary model, and component based model
N

• Delivery models: incremental delivery, spiral delivery


• Process activities: specification, design/implementation, validation, evolution
@

• Agile methods: pair programming, and Scrum


• Business use-case diagrams
• Practical aspects: Version control system (GIT), and do case studies of software systems and
build use-case diagrams.
Unit 3: Data Management (DM-2)

• Write a minimal Django based web application that parses a GET and POST request, and writes
the fields to a file – flat file and CSV file.

d
• Interface Python with an SQL database

ite
• SQL commands: aggregation functions, having, group by, order by.

m
Unit 4: Society, Law and Ethics (SLE-2)

Li
• Intellectual property rights, plagiarism, digital rights management, and licensing (Creative
Commons, GPL and Apache), open source, open data, privacy.

e
• Privacy laws, fraud; cybercrime- phishing, illegal downloads, child pornography, scams; cyber

at
forensics, IT Act, 2000.

iv
• Technology and society: understanding of societal issues and cultural changes induced by
technology.

Pr
• E-waste management: proper disposal of used electronic gadgets.
• Identity theft, unique ids, and biometrics.

a
• Gender and disability issues while teaching and using computers.

di
Role of new media in society: online campaigns, crowdsourcing, smart mobs
In
• Issues with the internet: internet as an echo chamber, net neutrality, internet addiction
• Case studies - Arab Spring, WikiLeaks, Bit coin
se

Practical
ou
H

S. No. Unit Name Marks (Total = 30)

1. Lab Test (10 marks)


i
at

Python programs for data handling (60% logic + 20% 7


w

documentation + 20% code quality)


s

Small Python program that sends a SQL query to a database and 3


ra

displays the result. A stub program can be provided.


Sa

2. Report File + viva (9 marks)


Report file: Minimum 21 Python programs. Out of this at least 4 7
programs should send SQL commands to a database, and retrieve
ew

the result; at least 1 program should implement the web server


to write user data to a CSV file.
N

Viva voce based on the report file 2


@

3. Project + viva (11 marks)


Project (that uses most of the concepts that have been learnt) 8
Project viva voce 3
Data Management: SQL+web-server
• Find the min, max, sum, and average of the marks in a student marks table.
• Find the total number of customers from each country in the table (customer ID, customer
Name, country) using group by.

d
• Write a SQL query to order the (student ID, marks) table in descending order of the marks.

ite
• Integrate SQL with Python by importing MYSQL dB

m
• Write a Django based web server to parse a user request (POST), and write it to a CSV file.

Li
Data handling using Python libraries

e
• Use map functions to convert all negative numbers in a Data Frame to the mean of all the
numbers.

at
• Consider a Data Frame, where each row contains the item category, item name, and expenditure.

iv
• Group the rows by the category, and print the total expenditure per category.

Pr
• Given a Series, print all the elements that are above the 75th percentile.
• Given a day’s worth of stock market data, aggregate it. Print the highest, lowest, and closing

a
prices of each stock.

di
• Given sample data, plot a linear regression line.
In
• Take data from government web sites, aggregate and summarize it. Then plot it using different
plotting functions of the PyPlot library.
se

Basic Software Engineering


ou

• Business use-case diagrams for an airline ticket booking system, train reservation system, stock
exchange
H

• Collaboratively write a program and manage the code with a version control system (GIT)
i
at

Project
w

The aim of the class project is to create something that is tangible and useful. This should be done in
groups of 2 to 3 students, and should be started by students at least 6 months before the submission
s

deadline. The aim here is to find a real world problem that is worthwhile to solve. Students are encouraged
ra

to visit local businesses and ask them about the problems that they are facing. For example, if a business
Sa

is finding it hard to create invoices for filing GST claims, then students can do a project that takes the raw
data (list of transactions), groups the transactions by category, accounts for the GST tax rates, and creates
invoices in the appropriate format. Students can be extremely creative here. They can use a wide variety
ew

of Python libraries to create user friendly applications such as games, software for their school, software
for their disabled fellow students, and mobile applications, Of course to do some of this projects, some
additional learning is required; this should be encouraged. Students should know how to teach themselves.
N

If three people work on a project for 6 months, at least 500 lines of code is expected. The committee has
also been made aware about the degree of plagiarism in such projects. Teachers should take a very strict
@

look at this situation, and take very strict disciplinary action against students who are cheating on lab
assignments, or projects, or using pirated software to do the same. Everything that is proposed can be
achieved using absolutely free, and legitimate open source software.
CONTENTS
Unit 1: Data Handling - 2

d
Chapter–1 Review of Python Pandas .................................................................................. 1–40

ite
1.1 Introduction ..................................................................................................................... 1

m
1.2 Pandas ..................................................................................................................... ........ 1
1.3 Pandas Series ................................................................................................................... 2

Li
1.4 Mathematical Operations on Series ............................................................................... 7
1.5 Vector Operations ............................................................................................................ 8

e
1.6 Comparison Operations on Series .................................................................................. 9

at
1.7 DataFrame ...................................................................................................................... 10

iv
1.7.1 Creating DataFrame ........................................................................................... 11
1.7.2 Printing/Displaying DataFrame Data ............................................................... 13

Pr
1.7.3 Accessing and Slicing DataFrame ..................................................................... 16
1.8 Iterating Pandas DataFrame .......................................................................................... 24

a
1.9 Manipulating Pandas DataFrame Data ......................................................................... 25
1.9.1 Adding a Column to DataFrame ....................................................................... 25

di
1.9.2 Adding Rows into DataFrame ........................................................................... 28
In
1.9.3 Dropping Columns in DataFrame ..................................................................... 32
1.9.4 Dropping Rows in DataFrame ........................................................................... 36
se

Points to Remember ............................................................................................................... 38


Solved Exercises ...................................................................................................................... 38
ou

Review Questions .................................................................................................................... 40

Chapter– Advanced Operations on Pandas DataFrames ............................................... 41–62


H

2.1 Introduction ................................................................................................................... 41


i

2.2 Pivoting DataFrame ....................................................................................................... 41


at

2.3 Pivot Table using pivot() Method ................................................................................. 41


w

2.4 Pivot Table using pivot_table() Method ....................................................................... 47


2.5 Sorting DataFrames ....................................................................................................... 52
s

Points to Remember ............................................................................................................... 56


ra

Solved Exercises ...................................................................................................................... 56


Review Questions .................................................................................................................... 60
Sa

Chapter– Aggregation/Descriptive Stastistics in Pandas .............................................. 63–94


ew

3.1 Introduction ................................................................................................................... 63


3.2 Data Aggregation ........................................................................................................... 6 3
3.3 Quartiles, Quantiles and Percentiles with Pandas ....................................................... 74
N

3.4 Descriptive Statistics ..................................................................................................... 79


3.5 Histogram ....................................................................................................................... 82
@

3.5.1 Understanding Matplotlib ................................................................................ 82


3.5.2 Creating Histogram ........................................................................................... 83
3.5.3 Single Histogram from a Pandas DataFrame ................................................... 84
3.5.4 Modifying Histogram Bin Sizes ......................................................................... 85
3.5.5 Multiple Pandas Histograms from a DataFrame ............................................ 86
3.5.6 Modifying Histogram Axes ............................................................................... 86
3.5.7 Plotting Multiple Features in One Plot ............................................................ 87
3.5.8 Plotting DataFrame Columns using DataFrame plot() Method ..................... 88
Points to Remember ............................................................................................................... 90

d
Solved Exercises ...................................................................................................................... 91

ite
Review Questions .................................................................................................................... 94

m
Chapter–4 Function Applications in Pandas ................................................................... 95–132

Li
4.1 Introduction ................................................................................................................... 95
4.2 .pipe() Function .............................................................................................................. 95

e
4.3 .apply() Function ............................................................................................................ 97

at
4.3.1 Using Lambda Function .................................................................................. 100
4.4 Aggregation (groupby) ................................................................................................ 102

iv
4.5 Data Transformation ................................................................................................... 114

Pr
4.6 .applymap() Function ................................................................................................. 118
4.7 Reindexing Pandas Dataframes .................................................................................. 120
4.8 Altering Labels or Changing Column/Row Names .................................................... 124

a
Points to Remember ............................................................................................................ 1 27

di
Solved Exercises ................................................................................................................... 128
In
Review Questions ................................................................................................................. 130

Chapter–5 Introduction to NumPy (Numeric Python) .................................................. 133–152


se

5.1 Introduction ................................................................................................................ 133


ou

5.2 Installing NumPy ......................................................................................................... 133


5.3 NumPy Array ............................................................................................................... 134
5.3.1 Creating a NumPy Array ................................................................................ 135
H

5.3.2 Transpose of NumPy Array ............................................................................ 145


i

5.4 Checking Data Type ..................................................................................................... 146


at

5.5 Finding the Shape and Size of the Array ................................................................... 148
Points to Remember ............................................................................................................ 1 50
w

Solved Exercises ................................................................................................................... 150


s

Review Questions ................................................................................................................. 152


ra

Chapter–6 Indexing, Slicing and Arithmetic Operations in NumPy Array ................... 153–180
Sa

6.1 Introduction ................................................................................................................ 153


6.2 Indexing NumPy Array ................................................................................................ 153
ew

6.3 Slicing NumPy Array ................................................................................................... 157


6.4 Joining Arrays .............................................................................................................. 161
6.5 Creating Sub Array ...................................................................................................... 163
N

6.6 Arithmetic Operation on 2D Arrays ........................................................................... 164


6.7 NumPy Array Functions ............................................................................................. 172
@

Points to Remember ............................................................................................................ 1 77


Solved Exercises ................................................................................................................... 177
Review Questions ................................................................................................................. 180
Chapter–7 Covariance, Correlation and Linear Regression ........................................ 181–200
7.1 Introduction ................................................................................................................ 181
7.2 Covariance ................................................................................................................... 181
7.3 Correlation ................................................................................................................... 190

d
7.4 Linear Regression ........................................................................................................ 195

ite
Points to Remember ............................................................................................................ 1 98
Solved Exercises ................................................................................................................... 198

m
Review Questions ................................................................................................................. 200

Li
Chapter– Plotting with Pyplot ...................................................................................... 201–246
8.1 Introduction ................................................................................................................ 201

e
8.2 Introduction to Matplotlib ......................................................................................... 201

at
8.3 Matplotlib, Pylab and Pyplot ..................................................................................... 202

iv
8.4 Creating and Showing Simple Line Plot ..................................................................... 202
8.5 Using Pyplot Methods ................................................................................................ 207

Pr
8.6 Formatting Graph Pyplot Attributes ......................................................................... 212
8.7 Changing Matplotlib Plot Figure Size and DPI .......................................................... 219

a
8.8 Plotting DataFrames ................................................................................................... 220

di
8.9 Matplotlib Plot Types ................................................................................................. 223
8.9.1 Plotting a Bar Plot .......................................................................................... 223
In
8.9.2 Plotting a Pie Plot ........................................................................................... 229
8.9.3 Histogram ....................................................................................................... 231
se

8.9.4 Box Plots ......................................................................................................... 232


8.9.5 Scatter Plots .................................................................................................... 235
ou

8.10 Frequency Polygons .................................................................................................... 238


Points to Remember ............................................................................................................ 2 39
H

Solved Exercises ................................................................................................................... 239


Review Questions ................................................................................................................. 244
i
at

Unit 2: Basic Software Engineering


w

Chapter– Introduction to Software Engineering ........................................................ 247–256


s

9.1 Introduction ................................................................................................................ 247


ra

9.2 What is Software Engineering? .................................................................................. 247


9.3 Characteristics of a Good Software ........................................................................... 248
Sa

9.4 Need of Software Engineering .................................................................................... 249


9.5 Software Engineering Tasks ........................................................................................ 250
Points to Remember ............................................................................................................ 2 55
ew

Solved Exercises ................................................................................................................... 256


Review Questions ................................................................................................................. 256
N

Chapter–1 Software Process Models ......................................................................... 257–286


@

10.1 Introduction ................................................................................................................ 257


10.2 Software Processes ..................................................................................................... 257
10.3 Software Process Models ........................................................................................... 257
10.4 Waterfall Model .......................................................................................................... 258
10.5 Evolutionary Model .................................................................................................... 260
10.5.1 Prototyping Model ......................................................................................... 260
10.5.2 Component Based Model .............................................................................. 262
10.6 Delivery Models .......................................................................................................... 269
10.6.1 Incremental Model ......................................................................................... 269

d
10.6.2 Spiral Model .................................................................................................... 270

ite
10.7 Software Process Activities ........................................................................................ 272
10.8 Agile Methodology ...................................................................................................... 275

m
10.8.1 Pair Programing Methodology ...................................................................... 277
10.8.2 Agile Scrum Methodology ............................................................................. 278

Li
Points to Remember ............................................................................................................ 2 83
Solved Exercises ................................................................................................................... 283

e
Review Questions ................................................................................................................. 285

at
Chapter–11 Business Use-Case Diagrams .................................................................. 287–300

iv
11.1 Introduction ................................................................................................................ 287

Pr
11.2 Unified Modeling Language (UML) ............................................................................ 288
11.3 Use-case Diagram ........................................................................................................ 289

a
11.3.1 Relationships in Use-case Diagrams .............................................................. 291
11.3.2 Basic Principles to Draw Use-case Diagram .................................................. 294

di
11.4 Business Use-case Diagram ........................................................................................ 294
In
Points to Remember ............................................................................................................ 2 98
Solved Exercises ................................................................................................................... 298
se

Review Questions ................................................................................................................. 300

Chapter–1 Introduction to Version Control System .................................................. 301–312


ou

12.1 Introduction ................................................................................................................ 301


12.2 Version Control System (or Software) ....................................................................... 301
H

12.3 Global Information Tracker (GIT) ............................................................................... 303


12.3.1 GIT Commands ............................................................................................... 307
i
at

12.3.2 Main Components of GIT ............................................................................... 307


12.3.3 Creating New Repository ............................................................................... 308
w

12.3.4 Adding a New File for Git to Track ................................................................ 309


s

12.3.5 Committing Changes ...................................................................................... 310


ra

Points to Remember ............................................................................................................ 3 12


Solved Exercises ................................................................................................................... 312
Sa

Review Questions ................................................................................................................. 312


Unit 3: Data Management -2
ew

Chapter–1 Django Based Web Application ................................................................ 313–328


13.1 Introduction ............................................................................................................... 313
N

13.2 Why Django? .............................................................................................................. 313


13.3 Django Architecture/Structure .................................................................................. 313
@

13.4 Installing Virtual Environment and Django .............................................................. 315


13.5 Creating/Setting Django Project ............................................................................... 318
13.6 Running Django Project ............................................................................................. 320
13.7 Creating Django App .................................................................................................. 321
13.8 Outputting CSV with Django ..................................................................................... 326
Points to Remember ............................................................................................................ 3 28
Solved Exercises ................................................................................................................... 328
Review Questions ................................................................................................................. 328
Chapter–14 Review of SQL Statements ....................................................................... 329–354

d
ite
14.1 Introduction ............................................................................................................... 329
14.2 SELECT Command ....................................................................................................... 329

m
14.2.1 Using SQL Clauses and Operators with SELECT Command ....................... 330
14.2.2 Defining a Column Alias .............................................................................. 337

Li
14.3 The UPDATE Command .............................................................................................. 338
Points to Remember ............................................................................................................ 3 40

e
Solved Exercises ................................................................................................................... 341

at
Review Questions ................................................................................................................. 348

iv
Chapter–15 Grouping Records using MySQL Database ............................................. 355–402

Pr
15.1 Introduction ............................................................................................................... 355
15.2 The Group Functions ................................................................................................. 355

a
15.3 The GROUP BY Clause ................................................................................................ 359

di
15.4 The HAVING Clause .................................................................................................... 362
15.5 Ordering the Database ............................................................................................... 363
In 3\WKR
Points to Remember ............................................................................................................ 3 65
Solved Exercises ................................................................................................................... 365
se

Review Questions ................................................................................................................. 393

Chapter–16 Interface Python with SQL Database ....................................................... 403–424


ou

16.1 Introduction ............................................................................................................... 403


H

16.2 Communicating Python with MySQL ....................................................................... 403


16.3 Installing Python PyMySQL ....................................................................................... 404
i

16.4 Creating Database Connection .................................................................................. 405


at

16.5 Create Tables in Python Database ............................................................................ 408


w

16.6 Database Operations ................................................................................................. 410


16.6.1 Inserting Rows/Records into Table ............................................................. 411
s

16.6.2 Listing or Querying Records ........................................................................ 415


ra

16.6.3 Updating Database Record .......................................................................... 419


16.6.4 Deleting Database Record ........................................................................... 422
Sa

Points to Remember ............................................................................................................ 4 23


Solved Exercises ................................................................................................................... 423
Review Questions ................................................................................................................. 423
ew

Unit 4: Society, Law and Ethics - 2


N

Chapter–17 Society, Law and Ethics ............................................................................ 425–454


@

17.1 Introduction ............................................................................................................. 425


17.2 Computer Ethics ...................................................................................................... 425
17.3 Code of Ethics .......................................................................................................... 425
17.4 Intellectual Property Rights .................................................................................... 426
17.5 Plagiarism ................................................................................................................ 427
17.6 Digital Rights Management (DRM) ......................................................................... 427
17.7 Licensing ................................................................................................................... 431
17.8 Open Source, Open Data and Privacy .................................................................... 433
17.9 Fraud and Scams ..................................................................................................... 435
17.10 Cybercrime ............................................................................................................... 438

d
17.11 Cyber Forensics ....................................................................................................... 439

ite
17.12 Identity Theft, Unique IDs and Biometrics ............................................................. 440
17.13 Information Technology Act, 2000 ......................................................................... 443

m
17.14 Society and Technology ........................................................................................... 444
17.15 Gender and Disability Issues while Teaching and Using Computers .................... 447

Li
Points to Remember ............................................................................................................ 4 50
Solved Exercises ................................................................................................................... 451

e
Review Questions ................................................................................................................. 453

at
Chapter–1 E-Waste Management ................................................................................ 455–462

iv
18.1 Introduction ............................................................................................................. 455

Pr
18.2 Electronic Waste (E-waste) ..................................................................................... 455
18.3 Effects of E-Waste .................................................................................................... 456

a
18.4 E-Waste Management ............................................................................................. 457
18.5 E-Waste Management in India ............................................................................... 461

di
Points to Remember ............................................................................................................ 4 61
In
Solved Exercises ................................................................................................................... 461
Review Questions ................................................................................................................. 462
se

Chapter–1 Media in Society and Issues with Internet ............................................... 463–474


ou

19.1 Introduction ............................................................................................................. 463


19.2 New Media in Society ............................................................................................. 463
H

19.3 The Best Players of New Media .............................................................................. 463


19.4 Effects of New Media on Society ............................................................................ 464
i

19.4.1 Online Campaigns ....................................................................................... 465


at

19.4.2 Role in Politics ............................................................................................. 466


w

19.4.3 Social Media on Society ............................................................................. 467


19.4.4 Social Media on Crowdsourcing ................................................................ 467
s

19.4.5 Social Media on Smart Mobs ..................................................................... 468


ra

19.5 Issues with the Internet .......................................................................................... 468


19.5.1 Internet as an Echo Chamber ..................................................................... 468
Sa

19.5.2 Net Neutrality ............................................................................................. 468


19.5.3 Internet Addiction ....................................................................................... 469
19.6 Case Studies ............................................................................................................. 470
ew

19.6.1 The Arab Spring ........................................................................................... 470


19.6.2 WikiLeaks .................................................................................................... 472
N

19.6.3 Bitcoin .......................................................................................................... 472


Points to Remember ............................................................................................................ 4 73
@

Solved Exercises ................................................................................................................... 474


Review Questions ................................................................................................................. 474

Pr ec — Banking Transaction System ........................................................................... 1–12


Review of Python Pandas 1

Review of Python Pandas


Pandas U N I T —1

d
Chapter – 1

ite
m
Li
e
at
iv
1.1 Introduction

Pr
Data processing is an important part of analyzing the data, as data is not always available in the desired
format. Various processing techniques are required before analyzing the data such as cleaning, restructuring

a
or merging, etc. NumPy, SciPy, Cython and Pandas are the tools available in Python which can be used for

di
processing of the data. Further, Pandas are built on the NumPy package thus Pandas library relies heavily on
the NumPy array for the implementation of Pandas data objects and shares many of its features. In this
In
chapter, we will review the basic concepts of Python Pandas data series and dataframes that we have learnt
in class - XI.
se

1.2 Pandas
ou

Pandas is an open-source Python Library providing high-performance data manipulation and analysis tool
H

using its powerful data structures. Pandas provides rich set of functions to process various types of data.
When doing data analysis, it is important to make sure you are using the correct data types; otherwise
i

you may get unexpected results or errors. A data type is essentially an internal construct that a programing
at

language uses to understand how to store and manipulate data. Table 1.1 summarizes the data types of
w

Pandas, Python and NumPy.


s

Table 1.1 Summarizing Pandas, Python and NumPy data types.


ra

Pandas dtype Python type NumPy type Usage


object str string_, unicode_ Text
Sa

int64 int int_, int8, int16, int32, int64, Integer numbers


uint8, uint16, uint32, uint64
ew

float64 float float_, float16, float32, float64 Floating point numbers


bool bool bool_ True/False values
N

datetime64 NA datetime64[ns] Date and time values


@

timedelta[ns] NA NA Differences between two datetimes


category NA NA Finite list of text values

Most of the time, using Pandas, default int64 and float64 types will work. If your system is compatible
with Pandas software and is installed earlier, you can start pandas.
1
2 Saraswati Informatics Practices XII

Pandas provides two very useful data structures to process the data i.e., Series and DataFrame.

1.3 Pandas Series


The Series is the primary building block of Pandas. The Series is a one-dimensional labeled array capable

d
of holding data of any type (integer, string, float, python objects, etc.). The Series data are mutable (can be

ite
changed). But the size of Series data is immutable. It can be seen as a data structure with two arrays: one
functioning as the index, i.e., the labels, and the other one contains the actual data. The row labels in a

m
Series are called the index. Let us see the following data which can be considered as series:

Li
Num = [23, 54, 34, 44, 35, 66, 27, 88, 69, 54] # a list with homogeneous data
Emp = ['A V Raman', 35, 'Finance', 45670.00] # a list with heterogeneous data

e
Marks = {"ELENA JOSE" : 450, "PARAS GUPTA" : 467, "JOEFFIN JOSEPH" : 480} # a dictionary

at
Num1 = (23, 54, 34, 44, 35, 66, 27, 88, 69, 54) # a tuple with homogeneous data
Std = ('AKYHA KUMAR', 78.0, 79.0, 89.0, 88.0, 91.0) # a list with heterogeneous data

iv
Any list, tuple and dictionary data can be converted into a series using 'Series()' method.

Pr
Creating an Empty Series

a
A basic series, which can be created is an Empty Series. For example:
# import the Pandas library and aliasing as pd
di
>>> import pandas as pd
In
# pd as alternate name of the module pandas
>>> S = pd.Series()
>>> print (S)
se

Series([], dtype: float64)


>>> type(S)
ou

<class 'pandas.core.series.Series'>
H

Here,
• S is the series variable.
i
at

• The Series() method shows an empty list (default) and its default data type.
• The type() function shows the series data types.
w

Creating a Series using List


s
ra

We know that a list is a one dimensional data type capable of holding any data type and have indices. A list
can be converted into a series using Series() method. The basic method to create a series is:
Sa

S = pd.Series([data], index=[index])
Here,
ew

• The data is structured like a Python list, dictionary, an ndarray or a scalar value. If data is an
ndarray, index must be the same length as data.
N

• The index is the numeric values displayed with respective data. We did not pass any index, so by
default, it assigned the indexes ranging from 0 to len(data)–1. Otherwise, the default index starts
@

with 0, 1, 2, ... till the length of the data minus 1. If you want then you can mention your own
index for the data.
For example;
>>> import pandas as pd # pd as alternate name of the module pandas
Review of Python Pandas 3

>>> Months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September',
'October', 'November', 'December']
>>> S = pd.Series(Months) # S is a pandas series
Or

d
>>> S = pd.Series(['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September',

ite
'October', 'November', 'December'])

Printing a Series

m
Li
To print a series, you can use the print() function with the series or simple the series. For example;

>>> print (S) # of simple type S at the prompt, i.e., >>> S

e
Index

at
0 January

iv
1 February

Pr
2 March
3 April

a
4 May

di
5 June
6 July
In
7 August
se

8 September
9 October
ou

10 November
11 December
H

dtype: object
i

Here, the list Months converted into a Pandas series using Series() method. The series result shows in
at

two columns. We haven't defined an index in our example, but we see two columns in our output. The right
w

column contains our data, whereas the left column contains the index. Pandas created a default index and
automatically assigned to the series starting with 0 going to 11, which is the length of the data minus 1.
s
ra

Accessing Series Index and Values


Sa

We can directly access the index and the values of our Series S:
>>> print(S.index)
RangeIndex(start=0, stop=12, step=1)
ew

>>> print(S.values)
['January' 'February' 'March' 'April' 'May' 'June' 'July' 'August' 'September' 'October' 'November'
N

'December']
@

We have created a list called Months and a Series called S. In both the cases the indexes are same. To
clarify about the indexes, let us access a particular index position value, i.e., 3rd place:
>>> print("Element at 3rd place in list:", Months[2]) # 2 in the index position in series S
Element at 3rd place in list: March
>>> print("Element at 3rd place in Series:", S[2])
Element at 3rd place in Series: March
4 Saraswati Informatics Practices XII

Let us see another example of a series with Emp:


>>> Emp = pd.Series(['A V Raman', 35, 'Finance', 45670.00])
>>> print (Emp)
0 A V Raman

d
1 35

ite
2 Finance
3 45670

m
dtype: object

Li
Accessing Rows using head() and tail() function

e
The Series.head() function in series, by default, shows you the top 5 rows of data in the series. The opposite
is Series.tail(), which gives you the last 5 rows. In both the function, if you pass in a number as parameter

at
and Pandas will print out the specified number of rows.

iv
For example, let us print the first 5 rows of data from the series S of months:

Pr
>>> S.head()
0 January
1 February

a
2 March

di
3 April
4 May
In
dtype: object
se

To print the first 3 rows of the series S:


>>> S.head(3)
ou

0 January
1 February
H

2 March
dtype: object
i
at

To print the first row of the series S:


>>> S.head(1)
w

0 January
s

dtype: object
ra

To print the last 5 rows of the series S:


Sa

>>> S.tail()
7 August
8 September
ew

9 October
10 November
11 December
N

dtype: object
@

To print the last 3 rows of the series S:


>>> S.tail(3)
9 October
10 November
11 December
dtype: object
Review of Python Pandas 5

Creating a Series using two different Lists


You can create a series using two different lists. Out of the two lists one will be the index and other one will
be the value. For example;

d
ite
>>> months = ['Jan','Apr','Mar','June']
>>> days = [31, 30, 31, 30]

m
>>> S2 = pd.Series(days, index=months)
>>> S2

Li
Jan 31
Apr 30

e
Mar 31

at
June 30
dtype: int64

iv
Pr
Declaring an Index
We can make Series with an explicit index. For example:

a
>>> M = pd.Series([456, 478, 467, 477, 405], index=['Amit', 'Sneha', 'Manya', 'Pari','Lavanya'])
>>> print(M)
di
In
Amit 456
Sneha 478
se

Manya 467
Pari 477
ou

Lavanya 405
dtype: int64
H

Observe that the index we provided is on the left with the values on the right and in Pandas series,
i

indexes can be of String type.


at

To print the indexes:


w

>>> M.index
Index(['Amit', 'Sneha', 'Manya', 'Pari', 'Lavanya'], dtype='object')
s
ra

Indexing, Slicing and Accessing Data from Series


Sa

Indexing and slicing concepts in Pandas series is similar to lists and tuples. Using a series, you can access
any position values. With Pandas Series we can index by corresponding number to retrieve values. For
example, let us recall the series:
ew

Example 1 - Printing first element


>>> S = pd.Series(months) # Or S = pd.Series(['Jan','Apr','Mar','June'])
N

>>> S[0] # Prints first element of the list


@

'Jan'
Example 2 - Printing third element
>>> S[2] # Prints third element of the list
'Mar'
6 Saraswati Informatics Practices XII

Example 3 - Printing first three elements


To slice by index number to retrieve values.
>>> S[:3] # Prints first three elements from the list

d
0 Jan

ite
1 Apr
2 Mar
dtype: object

m
Example 4 - Printing element starting from 2nd till 3rd

Li
>>> S[1:3] # Prints elements starting from 2nd till 3rd

e
1 Apr

at
2 Mar

iv
Example 5 - Printing last two elements

Pr
>>> S[-2:] # Prints first three elements from the list
2 Mar

a
3 June
dtype: object
Example 6 - Printing the value correspond to label index
di
In
Additionally, we can call the value of the index to return the value that it corresponds with:
se

>>> M = pd.Series([456, 478, 467, 477, 405], index=['Amit', 'Sneha', 'Manya', 'Pari','Lavanya'])
>>> M['Manya']
ou

467
H

Example 7 - Printing multiple values correspond to label index


We can call multiple value of a list using index label values.
i
at

>>> print (M[['Amit','Manya','Lavanya']])


Amit 456
w

Manya 467
s

Lavanya 405
ra

dtype: int64
Sa

Example 8 - Printing the slices with the values of the label index
We can also slice with the values of the index to return the corresponding values:
ew

>>> M['Manya' : 'Lavanya']


Manya 467
Pari 477
N

Lavanya 405
dtype: int64
@

Example 9 - Printing the data using Boolean indexing


You can use Boolean indexing for selection. For example, to select marks more than 460,
>>> M[M>460]
Sneha 478
Review of Python Pandas 7

Manya 467
Pari 477
dtype: int64
Here, M>460 returns a Series of True/False values, which we then pass to our Series M, returning the

d
corresponding True items.

ite
Initializing Series from Scalar

m
You can also use a scalar to initialize a Series. In this case, all elements of the Series is initialized to the

Li
same value. When used with a scalar for initialization, an index array can be specified. In this case, the size
of the Series is the same as the size of the index array. Let us create a series for a scalar value 7.

e
>>> import pandas as pd

at
>>> S = pd.Series(7, index=[0, 1, 2, 3, 4])
>>> S

iv
0 7

Pr
1 7
2 7
3 7

a
4 7
dtype: int64
di
In
You can use the range() function to specify the index (and thus the size of the Series). The above
series can be modified as:
se

>>> S = pd.Series(7, index=range(4))


ou

which will produce the same output as above.


There are some additional ways of initializing a Series from Scalar. These are:
H

• Create a series of odd numbers. The command is:


>>> print (pd.Series(range(1, 10, 2)))
i
at

0 1
1 3
w

2 5
s

3 7
ra

4 9
dtype: int64
Sa

• Create an alphabetic index label with series starting from 1 after 3 intervals. The command is:
>>> print (pd.Series(range(1, 15, 3), index=[x for x in 'ABCDE']))
ew

A 1
B 4
C 7
N

D 10
E 13
@

dtype: int64

1.4 Mathematical Operations on Series


Mathematical operations can be done using scalars and functions. You can perform the mathematical
operation using simple operators like +, –, *, /, etc.
8 Saraswati Informatics Practices XII

For example;
>>> Marks = [456, 478, 467, 477, 405]
>>> Names = ['Amit', 'Sneha', 'Manya', 'Pari','Lavanya']
>>> M1 = pd.Series(Marks, index=Names)

d
>>> print (M1)

ite
Amit 456
Sneha 478

m
Manya 467
Pari 477

Li
Lavanya 405
dtype: int32

e
at
The following example increase 5 marks to each students in the series M1.

iv
# File name: ...\IPSource_XI\PyChap12\Mincrease.py
import pandas as pd

Pr
import numpy as np
Marks = np.array([456, 478, 467, 477, 405])

a
M1 = pd.Series(Marks, index=['Amit', 'Sneha', 'Manya', 'Pari','Lavanya'])
for label, value in M1.items():

di
M1.at[label] = value + 5 # increase each values
In
print (M1)
Output: Amit : 461
se

Sneha : 483
Manya : 472
ou

Pari : 482
Lavanya : 410
H

1.5 Vector Operations


i
at

Series support element-wise vector operations. For example, when a numeber is added to a series, the
w

number adds with each element of the series values. For example;
s

>>> List_Var = [1, 2, 3, 4, 5] # a list with 5 elements


ra

>>> Series_Var = pd.Series(List_Var)


>>> print (Series_Var + 5) # Each series value added with 5
Sa

0 6
1 7
2 8
ew

3 9
4 10
dtype: int64
N

>>> print (Series_Var * 2) # Each series value multiplied by 2


0 2
@

1 4
2 6
3 8
4 10
dtype: int64
Review of Python Pandas 9

Similarly, you can do other vector operations like:


>>> Series_Var ** 3 # Calculates the mathematical power 3 of each of the series value
0 1
1 8

d
2 27

ite
3 64
4 125

m
dtype: int64
>>> Series_Var + Series_Var # Each series value add with itself

Li
0 2
1 4

e
2 6

at
3 8
4 10

iv
dtype: int64

Pr
1.6 Comparison Operations on Series

a
We can use all the Python comparison operators in Pandas series. When we apply a comparison operator

di
over a series, the comparison operation compares all values in the series. Assume that we have a series
called S with following data:
In
>>> GradePoint = [9.5, 6.7, 8.8, 9.1, 8.7, 8.8, 9, 8.5, 9.4, 8.2]
se

>>> S = pd.Series(GradePoint, index=['Amit', 'Sneha', 'Manya', 'Pari','Lavanya', 'Priyanka', 'Aanya',


'Ronald', 'Dipika', 'Akriti'])
ou

>>> S
Amit 9.5
H

Sneha 6.7
i

Manya 8.8
at

Pari 9.1
w

Lavanya 8.7
s

Priyanka 8.8
ra

Aanya 9.0
Ronald 8.5
Sa

Dipika 9.4
Akriti 8.2
ew

dtype: float64

>>> S < 9.0 # is GradePoint less than 9.0


N

Amit False
@

Sneha True
Manya True
Pari False
Lavanya True
Priyanka True
10 Saraswati Informatics Practices XII

Aanya False
Ronald True
Dipika False
Akriti True

d
dtype: bool

ite
>>> S <= 9.0 # is GradePoint less than or equal to 9.0

m
Amit False

Li
Sneha True
Manya True

e
Pari False

at
Lavanya True
Priyanka True

iv
Aanya True

Pr
Ronald True
Dipika False

a
Akriti True
dtype: bool
di
In
>>> S != 9.0 # is GradePoint not equal to 9.0
Amit True
se

Sneha True
ou

Manya True
Pari True
H

Lavanya True
Priyanka True
i
at

Aanya False
Ronald True
w

Dipika True
s

Akriti True
ra

dtype: bool
Sa

Similarly, you can use other comparison operators like: ==, > and >=.
ew

1.7 DataFrame
N

The Pandas main object is called a DataFrame. DataFrame is the widely used data structure of Pandas.
DataFrame is a two-dimensional array with heterogeneous data, usually represented in the tabular format.
@

DataFrame handle the vast majority of typical use cases in finance, statistics, social science, and many
areas of engineering. The DataFrame data and size are also mutable (can be changed). DataFrame has two
different index i.e., column-index and row-index. DataFrames allow you to store and manipulate tabular
data in rows of observations and columns of variables. Note that, Series are used to work with one dimensional
array, whereas DataFrame can be used with two dimensional arrays.
Review of Python Pandas 11

Let us look the Table 1.2 which shows 6 countries population with its birth rate ratio in row and column
format. Each column represents an attribute and each row represents a country.
Table 1.2 A DataFrame with two-dimensional array with heterogeneous data.

d
Country Population BirthRate UpdateDate

ite
China 1,379,750,000 12.00 2016-08-11
India 1,330,780,000 21.76 2016-08-11

m
United States 324,882,000 13.21 2016-08-11

Li
Indonesia 260,581,000 18.84 2016-01-07
Brazil 206,918,000 18.43 2016-08-11

e
at
Pakistan 194,754,000 27.62 2016-08-11

iv
If you consider the data, then the following represents the data types of the columns.

Pr
Column Data Type
Country String

a
Population Integer

di
BirthRate Float In
UpdateDate Date
se

1.7.1 Creating DataFrame


ou

The most common way to create a DataFrame is by using the dictionary of equal-length list as shown
below. Further, all the spreadsheets and text files are read as DataFrame, therefore it is an important data
H

structure of Pandas. The Pandas DataFrame() constructor is used to create a DataFrame. The general format
is:
i
at

pandas.DataFrame(data, index, columns, dtype, copy)


w

Here,
s

• The DataFrame data accepts many different kinds of input:


ra

− Dict of 1D ndarrays, lists, dicts, or Series


− 2-D numpy.ndarray
Sa

− Structured or record ndarray


− A Series
− CSV (Comma Separated Value) file
ew

− Another DataFrame
• You can optionally pass index (row labels) and columns (column labels) arguments. If you pass
N

an index and / or columns, you are guaranteeing the index and / or columns of the resulting
DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the
@

passed index.
• The dtype is the data type of each column. You can use the DataFrameName.dtypes command
to extract the information of types of variables stored in the data frame.
12 Saraswati Informatics Practices XII

Creating Empty DataFrame


To create an empty DataFrame:

d
>>> import pandas as pd

ite
>>> df = pd.DataFrame() # Here we define df as the name of the DataFrame
Or
>>> import pandas

m
>>> df = pandas.DataFrame()

Li
Most of the times, people gives the DataFrame name as df. You can give any name (valid name) and
create any number of DataFrame objects. In this text we use different names and examples according the

e
requirements.

at
To know the type of the DataFrame and its values:

iv
>>> type(df)
<class 'pandas.core.frame.DataFrame'>

Pr
>>> print(df) # or df
which displays the following:

a
di
Empty DataFrame
Columns: []
In
Index: []
se

Creating a DataFrame using List


ou

While a series only support a single dimension, data frames are 2 dimensional objects. The
pandas.DataFrame(..) function has provisions for creating data frames from lists. Let us say we have two
H

lists, one of them is of string type (Months) and the other is of int type (Days). We want to make a DataFrame
with these lists as columns.
i

>>> Months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September',
at

'October', 'November', 'December']


w

>>> Days = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
s

To create a DataFrame out of common Python data structures, we can pass a dictionary of lists to the
ra

DataFrame constructor. From the above two lists, we can get DataFrame in three different ways.
Sa

List using Dictionary


To create a DataFrame, you can pass a dictionary of lists to the DataFrame constructor:
ew

• The key of the dictionary will be the column name.


• The associating list will be the values within that column.
N

Let us make a dictionary with two lists such that names as keys and the lists as values.
>>> d = {'Month':Months,'Day':Days} # Lists are converted into dictionary for key-value
@

From the above dictionary d, Month and Day are the keys and Months and Days are the values,
respectively.
>>> print (d)
{'Month': ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October',
'November', 'December'], 'Day': [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]}
Review of Python Pandas 13

Here d is our dictionary with names “Day” and “Month” as keys.


Let us create a Pandas DataFrame from using pd.DataFrame function with our dictionary as input.
>>> import pandas as pd

d
>>> df = pd.DataFrame(d) # Creates a data frame using two lists
Or

ite
>>> import pandas
>>> df = pandas.DataFrame(d) # Creates a data frame using two lists

m
Li
1.7.2 Printing/Displaying DataFrame Data

e
To print or display the data, simply type the DataFrame name or using print (DataFrame_name). For example,

at
to display the previous DataFrame (df):
>>> df # or print(df), the following output will produce.

iv
Index

Pr
Day Month
0 31 January

a
1 28 February
2 31 March
di
In
3 30 April
se

4 31 May
5 30 June
ou

6 31 July
7 31 August
H

8 30 September
i

9 31 October
at

10 30 November
w

11 31 December
s
ra

If you carefully observe the above data format, it just looks like a spreadsheet format data. The result
of the above DataFrame creation is a simple 12-row, 2-column table with automatically generated numeric
Sa

rows indices and two label columns.


Also, see that an index (0,1,2, ...) has been automatically assigned to the DataFrame. This is integer
based indexed. Those are the indices and they are not particularly meaningful in this data set but are useful
ew

to Pandas. To know the data types of the columns:


>>> type(df)
N

<class 'pandas.core.frame.DataFrame'>
@

Method 1: DataFrame from lists using zip

We can use the zip function to merge these two lists first. In Python 3, zip function creates a zip object,
which is a generator and we can use it to produce one item at a time. To get a list of tuples, we can use list()
and create a list of tuples.
14 Saraswati Informatics Practices XII

For this example, we can create a list of tuples like:


>>> DTuples = list(zip(Months, Days))
>>> DTuples
[('January', 31), ('February', 28), ('March', 31), ('April', 30), ('May', 31), ('June', 30), ('July', 31),

d
('August', 31), ('September', 30), ('October', 31), ('November', 30), ('December', 31)]

ite
We can simply use pd.DataFrame on this list of tuples to get a Pandas DataFrame. And we can also

m
specify column names with the list of tuples.
>>> pd.DataFrame(DTuples, columns=['Month','Day'])

Li
Month Day

e
0 January 31

at
1 February 28

iv
2 March 31
3 April 30

Pr
4 May 31
5 June 30

a
di
6 July 31
7 August 31
In
8 September 30
se

9 October 31
10 November 30
ou

11 December 31
H

Method 2: Creating DataFrame from scratch


i

The third way to make a Pandas DataFrame from multiple lists is to start from scratch and add columns
at

manually. We will first create an empty Pandas DataFrame and then add columns to it. Let us create an
w

empty DataFrame:
Add the first column (Months) to the empty DataFrame.
s

# create empty data frame in pandas


ra

>>> df = pd.DataFrame()
Sa

>>> df['Month'] = Months


>>> print (df['Month']) # prints Month column values
Month
ew

0 January
1 February
N

2 March
@

3 April
4 May
5 June
6 July
7 August
Review of Python Pandas 15

8 September
9 October
10 November

d
11 December

ite
Name: Month, dtype: object

m
Add the second column to the empty DataFrame.
>>> df['Day'] = Days

Li
>>> df
Month Day

e
at
0 January 31
1 February 28

iv
2 March 31

Pr
3 April 30
4 May 31

a
5 June 30
6 July 31
di
In
7 August 31
8 September 30
se

9 October 31
ou

10 November 30
11 December 31
H

Creating a DataFrame using Dictionary


i
at

As you know that a DataFrame has a row and column index; it's like a dict of Series with a common index.
w

Let us create a DataFrame called PData for Table 1.2.


s

>>> PData = { 'Country' : ['China', 'India', 'United States', 'Indonesia', 'Brazil', 'Pakistan'],
ra

'Population' : [1379750000, 1330780000, 324882000, 260581000, 206918000, 194754000],


'BirthRate' : [12.00, 21.76, 13.21, 18.84, 18.43, 27.62],
Sa

'UpdateDate' : ['2016-08-11', '2016-08-11', '2016-08-11', '2016-01-07', '2016-08-11', '2016-08-11']}


>>> import pandas as pd
>>> df = pd.DataFrame(PData, columns=['Country', 'Population', 'BirthRate', 'UpdateDate'])
ew

>>> df
Country Population BirthRate UpdateDate
N

0 China 1379750000 12.00 2016-08-11


@

1 India 1330780000 21.76 2016-08-11


2 United States 324882000 13.21 2016-08-11
3 Indonesia 260581000 18.84 2016-01-07
4 Brazil 206918000 18.43 2016-08-11
5 Pakistan 194754000 27.62 2016-08-11
16 Saraswati Informatics Practices XII

The DataFrame result is displayed as integer based index. The Country, Population, BirthRate,
UpdateDate are the attributes of the DataFrame df.

1.7.3 Accessing and Slicing DataFrame

d
ite
Accessing and slicing data from a DataFrame is depends on the indexes of the DataFrame.

Accessing Rows using head() and tail() function

m
The DataFrame.head() function in Pandas, by default, shows you the top 5 rows of data in the DataFrame.

Li
The opposite is DataFrame.tail(), which gives you the last 5 rows. In both the function, if you pass in a
number as parameter and Pandas will print out the specified number of rows.

e
For example, let us print the first 5 rows of DataFrame df:

at
>>> df.head()

iv
Country Population BirthRate UpdateDate

Pr
0 China 1379750000 12.00 2016-08-11
1 India 1330780000 21.76 2016-08-11

a
2 United States 324882000 13.21 2016-08-11
3 Indonesia 260581000
di
18.84 2016-01-07
In
4 Brazil 206918000 18.43 2016-08-11
se

To print the first 3 rows of DataFrame df:


>>> df.head(3)
ou

Country Population BirthRate UpdateDate


H

0 China 1379750000 12.00 2016-08-11


1 India 1330780000 21.76 2016-08-11
i
at

2 United States 324882000 13.21 2016-08-11


w

To print the first row of DataFrame df:


s

>>> df.head(1)
ra

Country Population BirthRate UpdateDate


Sa

0 China 1379750000 12.00 2016-08-11

To print the last 5 rows of DataFrame df:


ew

>>> df.tail()
Country Population BirthRate UpdateDate
N

1 India 1330780000 21.76 2016-08-11


@

2 United States 324882000 13.21 2016-08-11


3 Indonesia 260581000 18.84 2016-01-07
4 Brazil 206918000 18.43 2016-08-11
5 Pakistan 194754000 27.62 2016-08-11
Review of Python Pandas 17

To print the last 3 rows of DataFrame df:


>>> df.tail(3)
Which will display the following:

d
Country Population BirthRate UpdateDate

ite
3 Indonesia 260581000 18.84 2016-01-07
4 Brazil 206918000 18.43 2016-08-11

m
5 Pakistan 194754000 27.62 2016-08-11

Li
To print the first 3 rows of Population column of DataFrame df:

e
>>> df.Population.head(3)

at
0 1379750000

iv
1 1330780000

Pr
2 324882000
Name: Population, dtype: int64

a
Note. If you type df.head(0) of df.tail(0), then the following result will be displayed:

di
Empty DataFrame
Columns: [Population, BirthRate, UpdateDate]
In
Index: []
se

Selecting DataFrame Columns


ou

There are three primary methods for selecting columns from DataFrames in Pandas – use the dot (.) notation,
square brackets ( [ ] ), or iloc methods. There are three main methods of selecting columns in Pandas:
H

• using a dot notation, e.g. data.column_name,


• using square braces and the name of the column as a string, e.g. data['column_name']
i

• or using numeric indexing and the iloc selector data.iloc[:, <column_number>]. The "i" stands for
at

"integer" and should indicate that this selection expects a numerical position specification both
w

for specifying the row and the column area.


s

For example, let us access the population data (PData) using square braces and dot notation. To access
ra

only Country column, the command is:


Sa

>>> df['Country'] # or df.Country


0 China
1 India
ew

2 United States
3 Indonesia
N

4 Brazil
@

5 Pakistan
Name: Country, dtype: object
Notice that the Country column data is displayed as a series. When we select a single column from a
DataFrame, it always returns a series and selecting multiple columns from the DataFrame will return a
DataFrame.
18 Saraswati Informatics Practices XII

In the previous DataFrame (df), the population data is in default integer based index, you can use the
.iloc method to access rows or columns.
For example, to access the Country and Population columns with all rows from the DataFrame df, the
command is:

d
>>> df.iloc[:, [0, 1]]

ite
Country Population

m
0 China 1379750000
1 India 1330780000

Li
2 United States 324882000

e
3 Indonesia 260581000

at
4 Brazil 206918000

iv
5 Pakistan 194754000

Pr
The above result displays a DataFrame as we selected multiple columns. Here, from the above output,
the colon (:) represents all rows and the [0, 1] represents the column numbers i.e., Country and Population,

a
respectively. For example, if you want to display a range of columns, i.e., first three columns of the DataFrame
df, the command is:
>>> df.iloc[:, 0:3]
di
In
Country Population BirthRate
se

0 China 1379750000 12.00


1 India 1330780000 21.76
ou

2 United States 324882000 13.21


H

3 Indonesia 260581000 18.84


4 Brazil 206918000 18.43
i
at

5 Pakistan 194754000 27.62


w

Here, from the above output, the colon (:) represents all rows and the 0:3 represents the column
s

numbers 0, 1 and 2 i.e., Country, Population and BirthRate, respectively. In a range of columns (for example,
ra

0:3) the iloc excludes 4 (compared to loc where it includes 3).


The similar columns can be accessed by using the .loc. Because .loc is label-based, which means that
Sa

you have to specify rows and columns based on their row and column labels. The previous command can be
written as:
>>> df.loc[:, 'Country':'BirthRate']
ew

which will produce the same result as above command: df.iloc[:, 0:3]
N

Label-based / Index-based indexing using .loc


Pandas can support this use case using three types of location based indexing: .loc[], .iloc[], and .ix[].
@

• loc. It is label based indexing and gets rows (or columns) with particular labels from the index.
• iloc. It is position based indexing and gets rows (or columns) at particular positions in the index
(so it only takes integers) which you learnt in previous section.
• ix usually tries to behave like .loc but falls back to behaving like .iloc if a label is not present in the
index.
Review of Python Pandas 19

These three methods belong to index selection methods. Index is the identifier used for each row of
the data set. A key thing to have into account is that indexing can take specific labels. These labels can
either be integers or any other specified value by the user (e.g., dates, names).
Before slicing data using any of the above three methods, let us see the following two sets of data:

d
Set-1: Integer Based data:

ite
Country Population BirthRate UpdateDate

m
0 China 1379750000 12.00 2016-08-11

Li
1 India 1330780000 21.76 2016-08-11
2 United States 324882000 13.21 2016-08-11

e
3 Indonesia 260581000 18.84 2016-01-07

at
4 Brazil 206918000 18.43 2016-08-11

iv
5 Pakistan 194754000 27.62 2016-08-11

Pr
Set-2: Label Index data:
Population BirthRate UpdateDate

a
di
Country
China 1379750000 12.00 2016-08-11
In
India 1330780000 21.76 2016-08-11
se

United States 324882000 13.21 2016-08-11


Indonesia 260581000 18.84 2016-01-07
ou

Brazil 206918000 18.43 2016-08-11


Pakistan 194754000 27.62 2016-08-11
H

Here, in above two sets, the data set is exactly the same but the indexes are changed. In Set-1, it is
i
at

integer based, i.e., 0, 1, 2, 3.... In Set-2, it uses a set of strings to identify the rows, i.e., Country column.
This distinction is important to take into account when using selection methods.
w

Let us access the first three elements of the data set using .loc[] and .iloc[]:
s

>>> df.loc[0:3]
ra

Country Population BirthRate UpdateDate


Sa

0 China 1379750000 12.00 2016-08-11


1 India 1330780000 21.76 2016-08-11
ew

2 United States 324882000 13.21 2016-08-11


3 Indonesia 260581000 18.84 2016-01-07
N

>>> df.iloc[0:3]
@

Country Population BirthRate UpdateDate


0 China 1379750000 12.00 2016-08-11
1 India 1330780000 21.76 2016-08-11
2 United States 324882000 13.21 2016-08-11
20 Saraswati Informatics Practices XII

Here, both .loc[] and .iloc[] work, because the integer index can also be taken as a label. We get a
different result depending on the method we are using. In the case of .loc[], the selection includes the last
term (i.e., last row). With .iloc[], normal Python index selection rules apply. Thus, the last term is excluded.
So, it is obvious that the Pandas .loc indexer can be used with DataFrames for two different use cases:

d
• Selecting rows by label/index

ite
• Selecting rows with a boolean/conditional lookup

m
The loc indexer is used with the same syntax as iloc:. That is:
dataFrame.loc[<row selection>, <column selection>]

Li
Adding an Index to an existing DataFrame

e
at
In Pandas, selections using the loc method are based on the index of the data frame (if any), where the index
is set on a DataFrame function called set_index() which takes a column name (for a regular Index) or a list

iv
of column names (for a MultiIndex). The syntax is:

Pr
dataFrame.set_index(['Column_name'])
Using df.set_index(), the .loc method directly selects based on index values of any rows. For example,

a
set the index of population data frame (PData) to the country name “Country”. To create a new, indexed
DataFrame df:
di
In
>>> df = df.set_index(['Country'])
>>> df
se

Population BirthRate UpdateDate


Country
ou

China 1379750000 12.00 2016-08-11


H

India 1330780000 21.76 2016-08-11


United States 324882000 13.21 2016-08-11
i
at

Indonesia 260581000 18.84 2016-01-07


w

Brazil 206918000 18.43 2016-08-11


Pakistan 194754000 27.62 2016-08-11
s
ra

Note that if you set a column as index on a DataFrame, you cannot access the indexed column (here
Country] using data['column_name'] or the data.column_name commands. This will produce an AttributeError.
Sa

Selecting DataFrame Rows


ew

Rows in a DataFrame are selected, typically, using the iloc/loc selection methods, or using logical selectors
(selecting based on the value of another column or variable). Now, with the index set, we can directly select
rows for different “Country” values using .loc[<label>] – either singly, or in multiples.
N

Extracting a row of a Pandas DataFrame


@

For example, to display the details of row India:


>>> df.loc['India']
Population 1330780000
BirthRate 21.76
UpdateDate 2016-08-11
Review of Python Pandas 21

Extracting specific rows of a Pandas DataFrame


To display population details of India and Brazil:
>>> df.loc[['India', 'Brazil']]

d
Population BirthRate UpdateDate

ite
Country

m
India 1330780000 21.76 2016-08-11

Li
Brazil 206918000 18.43 2016-08-11
Note that from the above two examples, the first example returns a series, and the second returns a

e
DataFrame. You can achieve a single-column DataFrame by passing a single-element list to the .loc operation.

at
In an indexed column with a DataFrame, you can access range rows with .iloc method. For example, to
access first three rows:

iv
>>> df.iloc[0:3]

Pr
Population BirthRate UpdateDate
Country

a
China 1379750000 12.00 2016-08-11
India 1330780000 21.76
di 2016-08-11
In
United States 324882000 13.21 2016-08-11
se

But if you apply, df.loc[0:3], then it will produce TypeError which means you cannot slice rows which
are indexed.
ou

Extracting specific rows and specific columns of a Pandas DataFrame


H

You can select columns with .loc using the names of the columns. For example, to display the Population
and UpdateDate columns for India and Brazil, the command is:
i
at

>>> df.loc[['India', 'Brazil'], ['Population', 'UpdateDate']]


Population UpdateDate
w

Country
s

India 1330780000 2016-08-11


ra

Brazil 206918000 2016-08-11


Sa

Position based indexing


Sometimes, you don’t have row or column labels. In such cases you will have to rely on position based
ew

indexing which is implemented with .iloc instead of .loc:


>>> df.iloc[0:3,0:2]
N

Population BirthRate
@

Country
China 1379750000 12.00
India 1330780000 21.76
United States 324882000 13.21
22 Saraswati Informatics Practices XII

Note that when we used label-based indexing both the start and the end labels were included in the
subset. With position based slicing, only the start index is included. So, in this case China had an index of 0,
India 1, and United States 2. Same goes for the columns.
And one more thing you should know about indexing is that when you have labels for either the rows or

d
the columns, and you want to slice a portion of the DataFrame, you wouldn’t know whether to use loc or

ite
iloc. In this case, you would want to use ix:

Using Boolean/Logical indexing using .loc

m
Li
Like Python, you can use Boolean or logical condition with Pandas DataFrames. With Boolean indexing or
logical selection, you pass an array or Series of True/False values to the .loc indexer to select the rows

e
where your Series has True values.

at
For example, the statement df[‘BirthRate’] > 20] produces a Pandas Series with a True/False value for
every row in the ‘PData’ DataFrame, where there are “True” values for the rows where the BirthRate is

iv
greater than 20. These type of Boolean arrays can be passed directly to the .loc indexer. The command is:

Pr
>>> df.loc[df['BirthRate'] > 20]
Population BirthRate UpdateDate

a
di
Country
India 1330780000 21.76 2016-08-11
In
Pakistan 194754000 27.62 2016-08-11
se

Similarly, to select rows with BirthRate column between 18 and 30, and just return 'Population' and
ou

'BirthRage' columns, the command is:


>>> df.loc[(df['BirthRate'] > 18) & (df['BirthRate'] < 30), ['Population', 'BirthRate']]
H

Population BirthRate
i

Country
at

India 1330780000 21.76


w

Indonesia 260581000 18.84


s

Brazil 206918000 18.43


ra

Pakistan 194754000 27.62


Sa

Similarly, you can access the data using Python functions with respective columns.

Selecting DataFrame Data using ix


ew

As seen before, when you have an integer based index, confusion may arise between location and label
N

based methods. .ix[] solves this confusion by falling to label based access (i.e., like .loc[]), which might not
be what you are looking for. The ix[] indexer is a hybrid of .loc and .iloc. The syntax is:
@

dataFrame.ix[<row selection>, <column selection>]


For row selection, .ix is inclusive from start to end and for column selection; .ix is exclusive of end
but inclusive of start. For example:
>>> df.ix[0:3, 1:3] Or df.ix[0:3,"BirthRate":"UpdateDate"]
Review of Python Pandas 23

Which will display the following:


BirthRate UpdateDate
Country
China 12.00 2016-08-11

d
ite
India 21.76 2016-08-11
United States 13.21 2016-08-11

m
This is called position based indexing. From the above output, the first row data is inclusive for .ix and
starts from China, and for columns, the last column is exclusive (i.e., UpdateDate) but inclusive of start.

Li
The ix[] indexer is a hybrid of .loc and .iloc. Generally, ix is label based and acts just as the .loc indexer.
However, .ix also supports integer type selections (as in .iloc) where passed an integer. This only works

e
where the index of the DataFrame is not integer based. ix will accept any of the inputs of .loc and .iloc.

at
For example,

iv
>>> df.ix[[3]] # Integer type selection

Pr
Population BirthRate UpdateDate
Country
Indonesia 260581000 18.84 2016-01-07

a
di
For example, to display row-index of India data: In
>>> df.ix["India"]
Population 1330780000
BirthRate 21.76
se

UpdateDate 2016-08-11
Name: India, dtype: object
ou

To print the rows from China to Indonesia, the command is:


H

>>> df.ix["China":"Indonesia"]
Population BirthRate UpdateDate
i
at

Country
China 1379750000 12.00 2016-08-11
w

India 1330780000 21.76 2016-08-11


s

United States 324882000 13.21 2016-08-11


ra

Indonesia 260581000 18.84 2016-01-07


Sa

The above command can be written as: df.ix[0:4]


>>> df.ix["China":"Indonesia"]
ew

To display all rows for a column, i.e., for Population column, the command is:
>>> df.ix[:, 'Population'] # colon takes all rows of dataframe df
Country
N

China 1379750000
@

India 1330780000
United States 324882000
Indonesia 260581000
Brazil 206918000
Pakistan 194754000
Name: Population, dtype: int64
24 Saraswati Informatics Practices XII

To display a specific element from a DataFrame, say the second row value of Population column,
>>> df.ix[1, 'Population']
1330780000

d
1.8 Iterating Pandas DataFrame

ite
In Pandas, you can iterate over rows in a DataFrame. This is similar to iterating over Python dictionaries

m
(think iteritems() or items() in Python 3). While iterating, a DataFrame will process rows over the keys of

Li
the object/DataFrame. Let us iterate the previous population (PData) dataframe df:
>>> df

e
Population BirthRate UpdateDate

at
Country

iv
China 1379750000 12.00 2016-08-11

Pr
India 1330780000 21.76 2016-08-11
United States 324882000 13.21 2016-08-11

a
Indonesia 260581000 18.84 2016-01-07
Brazil 206918000 18.43
di 2016-08-11
In
Pakistan 194754000 27.62 2016-08-11
>>> for keys in df:
se

print (keys)
ou

which will print the following keys of DataFrame.


Population
H

BirthRate
UpdateDate
i
at

Pandas DataFrame iterates the rows of the DataFrame using three different functions. These are:
w

• iterrows(). This function iterates over the rows of a DataFrame as (index, Series) pairs. In other
s

words, it gives you (index, row) tuples as a result. It returns a generator that iterates over the
ra

rows of the DataFrame. Because iterrows() function returns a Series for each row, it does not
preserve dtypes across the rows (dtypes are preserved across columns for DataFrames). It is
Sa

best suited to be used with for loop. For example,


>>> for index, row in df.iterrows():
print (index, row["Population"], row["BirthRate"], row["UpdateDate"])
ew

which will print the following keys of DataFrame:


N

China 1379750000 12.0 2016-08-11


India 1330780000 21.76 2016-08-11
@

United States 324882000 13.21 2016-08-11


Indonesia 260581000 18.84 2016-01-07
Brazil 206918000 18.43 2016-08-11
Pakistan 194754000 27.62 2016-08-11
Review of Python Pandas 25

If you need a formatted output, then you can apply the following command:
>>> for index, row in df.iterrows():
print ("{0:<15} {1:>15} {2:>5.2f}".format(index, row["Population"], row["BirthRate"]))
And the output is:

d
ite
China 1379750000 12.00 2016-08-11
India 1330780000 21.76 2016-08-11

m
United States 324882000 13.21 2016-08-11
Indonesia 260581000 18.84 2016-01-07

Li
Brazil 206918000 18.43 2016-08-11
Pakistan 194754000 27.62 2016-08-11

e
at
• iteritems(). Iterate over (column name, Series) pairs. For example,

iv
>>> for key,value in df.iteritems():
print (key, value)

Pr
• itertuples(index=True). Iterate over the rows of DataFrame as tuples, with index value as first
element of the tuple. If the DataFrame is indexed, then it returns the index as the first element of

a
the tuple. For example,

di
>>> for row in df.itertuples(): In
print (row)
which will print the data index-wise of the DataFrame as given below:
se

Pandas(Index='China', Population=1379750000, BirthRate=12.0, UpdateDate='2016-08-11')


Pandas(Index='India', Population=1330780000, BirthRate=21.76, UpdateDate='2016-08-11')
ou

Pandas(Index='United States', Population=324882000, BirthRate=13.21, UpdateDate='2016-08-11')


Pandas(Index='Indonesia', Population=260581000, BirthRate=18.84, UpdateDate='2016-01-07')
H

Pandas(Index='Brazil', Population=206918000, BirthRate=18.43, UpdateDate='2016-08-11')


Pandas(Index='Pakistan', Population=194754000, BirthRate=27.62, UpdateDate='2016-08-11')
i
at

Notice that the iteration result displays the index column as the first column.
w

1.9 Manipulating Pandas DataFrame Data


s
ra

1.9.1 Adding a Column to DataFrame


Sa

There are five ways to add a new column into a DataFrame. These are: indexing, loc, assign(), insert() and
concat(). The concat() function will be discussed in Adding Rows section. Before applying any method,
assume that we have a table with following product information.
ew

Table 1.3 A table with product information.


Product_Code Product_Name Company_Name Unit_Price Quantity
N

A01 Motherboard Intel 12000 200


@

A02 Hard Disk Seagate 6500 180


A03 Keyboard Samsung 500 250
A04 Mouse Logitech 500 350
A05 Motherboard AMD 13000 120
A06 Hard Disk HP 8800 130
26 Saraswati Informatics Practices XII

Using above data, create a dictionary called Product with only Product_Code column and also create
a DataFrame called df using the Product dictionary. The commands are:
>>> import pandas as pd
>>> Product = { 'Product_Code' : ['A01', 'A02', 'A03', 'A04', 'A05', 'A06']}

d
ite
To create a new dataframe with first column, (or with an empty DataFrame), apply the following:
>>> dFrame = pd.DataFrame(Product, columns=['Product_Code']) # A column is created

m
>>> dFrame

Li
Product_Code
0 A01

e
1 A02

at
2 A03

iv
3 A04

Pr
4 A05
5 A06

a
Method 1: Using index

di
The DataFrame dFrame is now integer index based. To add a second column i.e., Product_Name using
In
indexing process, the command is:
>>> dFrame["Product_Name"] = ['Motherboard', 'Hard Disk', 'Keyboard', 'Mouse', 'Motherboard',
se

'Hard Disk']
>>> dFrame
ou

Product_Code Product_Name
H

0 A01 Motherboard
1 A02 Hard Disk
i
at

2 A03 Keyboard
w

3 A04 Mouse
s

4 A05 Motherboard
ra

5 A06 Hard Disk


Sa

Method 2: Using .loc


loc gets rows (or columns) with particular labels from the index. To add a third column i.e., Company_Name
into DataFrame dFrame using .loc, the command is:
ew

>>> dFrame.loc[:, "Company_Name"] = ['Intel', 'Seagate', 'Samsung', 'Logitech', 'AMD', 'HP']


N

>>> dFrame
Which will display the following:
@

Product_Code Product_Name Company_Name


0 A01 Motherboard Intel
1 A02 Hard Disk Seagate
2 A03 Keyboard Samsung
Review of Python Pandas 27

3 A04 Mouse Logitech


4 A05 Motherboard AMD
5 A06 Hard Disk HP

d
ite
Method 3: Using assign() function
.loc has two limitations that it mutates the DataFrame in-place and it can't be used with method chaining. If

m
that's a problem for you, use assign() function. The assign() function in python, assigns the new column

Li
(i.e., a list) to the existing DataFrame. The syntax is:
DataFrame = DataFrame.assign(List)

e
at
Here,
• The DataFrame on left side of assignment sign assigns a new DataFrame along with a new list

iv
(i.e., List).

Pr
• The DataFrame on right side of assignment sign can assigns a list temporarily.
• If both the DataFrame name is same then the new DataFrame will hold a new column as the List.

a
For example, to assign/add a new column called Unit_Price into DataFrame dFrame, the command is:

di
>>> dFrame = dFrame.assign(Unit_Price = [12000, 6500, 500, 500, 13000, 8800])
In
>>> dFrame
Product_Code Product_Name Company_Name Unit_Price
se

0 A01 Motherboard Intel 12000


ou

1 A02 Hard Disk Seagate 6500


2 A03 Keyboard Samsung 500
H

3 A04 Mouse Logitech 500


4 A05 Motherboard AMD 13000
i
at

5 A06 Hard Disk HP 8800


w

Method 4: Using insert() function


s

The insert() function adds a column at the column index position. In a DataFrame, the first column is
ra

started from 0, 1, 2, . . . . and so on. To add a column using Insert() function, the syntax is:
Sa

DataFrame.insert(loc, col_name, value)

Here,
ew

• loc is the integer index and must verify 0 <= loc <= len(columns). If the index does not match with
the DataFrame column index then an IndexError will raise.
N

For example, if the dFrame columns has only 3 columns (0, 1, 2), and you mention the index as
4, then it will raise an index error, i.e., “index 4 is out of bounds for axis 0 with size 3”.
@

• col_name is the name of column which will be inserted or added into existing DataFrame.
• value is a list.
Let us add a column called Quantity as 4th column (originally it is 5th if you count from 0, and i.e.,
after Unit_Price column) with following values [200, 180, 250, 350, 120, 130] into the existing DataFrame
dFrame:
28 Saraswati Informatics Practices XII

>>> idx = 4 # Column index position where new column Unit_Price will be inserted
>> Qty = [200, 180, 250, 350, 120, 130] # can be a list, a Series, an array or a scalar
>>> dFrame.insert(loc=idx, column='Quantity', value=Qty) # New column added into df
>>> dFrame

d
Product_Code Product_Name Company_Name Unit_Price Quantity

ite
0 A01 Motherboard Intel 12000 200

m
1 A02 Hard Disk Seagate 6500 180
2 A03 Keyboard Samsung 500 250

Li
3 A04 Mouse Logitech 500 350

e
4 A05 Motherboard AMD 13000 120

at
5 A06 Hard Disk HP 8800 130

iv
Notice that the above output shows a new column called Quantity as 4th column in the DataFrame

Pr
dFrame.

1.9.2 Adding Rows into DataFrame

a
di
There are two most popular methods to add a new row into a DataFrame. These are: append() and concat().
In
Method 1: Using append() function
Pandas DataFrame has a straight forward method called append() which add new rows into an existing
se

DataFrame.
The syntax is:
ou

DataFrame.append([DataFrame or list of DataFrames]


H

Here,
• The DataFrame is appended into an existing DataFrame.
i
at

• If any columns were missing from the data we are trying to append, they would result in those
rows having NaN (not a number) values in the cells falling under the missing columns.
w

• Once the DataFrames are appended, they are automatically reindexed.


s

For example, in previous sections we have created a DataFrame called df which contains country-wise
ra

population data. Let us append another DataFrame called df1 which contains 3 new rows of population data
as given below:
Sa

df1 data:
Nigeria 186,987,000 36.65 2016-01-07
ew

Bangladesh 161,390,000 24.68 2016-01-08


Russia 146,691,020 11.10 2016-01-10
N

>>> df1 = pd.DataFrame({ 'Country' : ['Nigeria', 'Bangladesh', 'Russia'],


@

'Population' : [186987000, 161390000, 14669102],


'BirthRate' : [36.65, 24.68, 11.10],
'UpdateDate' : ['2016-01-07', '2016-01-08', '2016-01-10']},
columns=['Country', 'Population', 'BirthRate', 'UpdateDate'])
Review of Python Pandas 29

>>> df1
Country Population BirthRate UpdateDate
0 Nigeria 186987000 36.65 2016-01-07

d
1 Bangladesh 161390000 24.68 2016-01-08

ite
2 Russia 146691020 11.10 2016-01-10

m
>>> df1 = df1.set_index(['Country'])
>>> df1

Li
Population BirthRate UpdateDate
Nigeria 186987000 36.65 2016-01-07

e
at
Bangladesh 161390000 24.68 2016-01-08
Russia 146691020 11.10 2016-01-10

iv
Pr
Let us append the df1 DataFrame rows into DataFrame df:
>>> df = df.append(df1) # DataFrame df1 appended with DataFrame df

a
>>> df

di
Population BirthRate
In UpdateDate
China 1379750000 12.00 2016-08-11
India 1330780000 21.76 2016-08-11
se

United States 324882000 13.21 2016-08-11


Indonesia 260581000 18.84 2016-01-07
ou

Brazil 206918000 18.43 2016-08-11


H

Pakistan 194754000 27.62 2016-08-11


Nigeria 186987000 36.65 2016-01-07
i
at

Bangladesh 161390000 24.68 2016-01-08


Russia 14669102 11.10 2016-01-10
s w

Method 2: Using concat() function


ra

Concatenation basically attaches the DataFrames together. For simple operations where we need to add
rows or columns of the same length, the pd.concat() function is perfect. The syntax is:
Sa

result = pd.concat([list of DataFrames], axis=0, join='outer', ignore_index=False)


ew

Here,

• List of DataFrames are the DataFrames which you want to concatenate.


N

• By default, the argument is set to axis=0, which means we are concatenating rows. For columns
set the axis as 1.
@

Using ignore_index
This option is used whether or not the original row labels should be retained or not. By default it is false.
For example, assume that we have a new DataFrame called dFrame1 with following product contents:
30 Saraswati Informatics Practices XII

dFrame1 data:
Product_Code Product_Name Company_Name Unit_Price
0 A07 Keyboard TVS 2400

d
1 A08 LCD-21 LG 8000

ite
2 A09 LCD-21 Samsung 8500
3 A10 Mouse Dell 450

m
The commands to create the DataFrame dFrame are:

Li
>>> dFrame1 = pd.DataFrame({'Product_Code':['A07', 'A08', 'A09', 'A10'],

e
'Product_Name':['Keyboard', 'LCD-21', 'LCD-21', 'Mouse'],

at
'Company_Name':['TVS', 'LG', 'Samsung', 'Dell'],
'Unit_Price':[2400, 8000, 8500, 450]},

iv
index = [0, 1, 2, 3])

Pr
>>> dFrame1
Product_Code Product_Name Company_Name Unit_Price

a
0 A07 Keyboard TVS 2400

di
1 A08 LCD-21 In LG 8000
2 A09 LCD-21 Samsung 8500
3 A10 Mouse Dell 450
se

Notice that here we deliberately does not add the Quantity column with dFrame1. In previous section,
ou

we already create a DataFrame called dFrame with following product information.


>>> dFrame
H

Product_Code Product_Name Company_Name Unit_Price Quantity


0 A01 Motherboard Intel 12000 200
i
at

1 A02 Hard Disk Seagate 6500 180


w

2 A03 Keyboard Samsung 500 250


s

3 A04 Mouse Logitech 500 350


ra

4 A05 Motherboard AMD 13000 120


5 A06 Hard Disk HP 8800 130
Sa

If you compare the two DataFrames (dFrame and dFrame1), you will see two major changes: different
ew

index and different columns. That is the first DataFrame (dFrame) has 6 rows and 5 columns and the
second DataFrame (dFrame1) has 4 rows and 4 columns. With concatenation, we can talk about various
methods of bringing these together.
N

Let's create a third DataFrame called dFrame2 which concatenate the DataFrames dFrame and dFrame1:

>>> dFrame2 = pd.concat([dFrame, dFrame1])


@

When the rows are concatenated, the orders of columns are changed than the original one. To rearrange
the columns into its original sequence, for example the dFrame2, the command is:
>>> dFrame2 = dFrame2[['Product_Code', 'Product_Name', 'Company_Name', 'Quantity', 'Unit_Price']]
Review of Python Pandas 31

>>> dFrame2
Product_Code Product_Name Company_Name Unit_Price Quantity
0 A01 Motherboard Intel 12000 200

d
1 A02 Hard Disk Seagate 6500 180

ite
2 A03 Keyboard Samsung 500 250
3 A04 Mouse Logitech 500 350

m
4 A05 Motherboard AMD 13000 120

Li
5 A06 Hard Disk HP 8800 130
0 A07 Keyboard TVS 2400 NaN

New rows are


e
appended
1 A08 LCD-21 LG 8000 NaN

at
2 A09 LCD-21 Samsung 8500 NaN

iv
3 A10 Mouse Dell 450 NaN

Pr
As you can observe in the above output, for the first DataFrame dFrame data, Quantity column values
are available and it has printed the respective values, but the dFrame1 data, there is no “Quantity” column

a
and therefore it has printed NaN (Not a Number) in the resulted DataFrame dFrame2. The index of the

di
resultant is duplicated; each index is repeated. In
ignore_index = True
se

Also from the above dFrame2 results, notice that the index column is ordered as it is according to the two
DataFrames. The index of the resultant is duplicated where some of the indexes are repeated. So, to avoid
ou

such situation or if you want that the resultant object has to follow its own indexing, then set ignore_index
to True.
H

>>> dFrame2 = pd.concat([dFrame, dFrame1], ignore_index=True)


>>> dFrame2
i
at

Product_Code Product_Name Company_Name Unit_Price Quantity


0 A01 Motherboard Intel 12000 200
w

1 A02 Hard Disk Seagate 6500 180


s
ra

2 A03 Keyboard Samsung 500 250


3 A04 Mouse Logitech 500 350
Sa

4 A05 Motherboard AMD 13000 120


5 A06 Hard Disk HP 8800 130
ew

6 A07 Keyboard TVS 2400 NaN


7 A08 LCD-21 LG 8000 NaN
N

8 A09 LCD-21 Samsung 8500 NaN


9 A10 Mouse Dell 450 NaN
@

New Index
Adding Column using concat() function
Next, you can also specify axis=1 in order to join, merge or concatenate along the columns. Suppose we
have a column called Total_Price in a DataFrame called Temp as given:
32 Saraswati Informatics Practices XII

>>> Temp = pd.DataFrame({"Total_Price":[]}) # a DataFrame with a Total_Price column and no value


To concatenate the Temp DataFrame (i.e., the Total_Price column):
>>> dFrame2 = pd.concat([dFrame2, Temp], axis=1)
New Column

d
>>> dFrame2

ite
Company_Name Product_Code Product_Name Quantity Unit_Price Total_Price
0 Intel A01 Motherboard 200.0 12000 NaN

m
1 Seagate A02 Hard Disk 180.0 6500 NaN

Li
2 Samsung A03 Keyboard 250.0 500 NaN
3 Logitech A04 Mouse 350.0 500 NaN

e
4 AMD A05 Motherboard 120.0 13000 NaN

at
5 HP A06 Hard Disk 130.0 8800 NaN

iv
6 TVS A07 Keyboard NaN 2400 NaN

Pr
7 LG A08 LCD-21 NaN 8000 NaN
8 Samsung A09 LCD-21 NaN 8500 NaN

a
9 Dell A10 Mouse NaN 450 NaN

di
As you can see above DataFrame result the Total_Price column with missing values. This happens
In
because the Temp DataFrame didn’t have values for that particular column.
Don’t apply the ignore_index=True option with the above concat() function. Otherwise, it will create a
se

column indexed as (0, 1, 2, 3, 4....) instead of its original indexes (Company_Name, Product_Code,
Product_Name, Quantity, Unit_Price, Total_Price).
ou

1.9.3 Dropping Columns in DataFrame


H

In Pandas, we can drop or delete column by index, by name and by position. The syntax is:
i

income.drop('Index', axis = 0, inplace=False)


at

Here,
w

• index. It refers to the column/row which will be deleted depending on the axis.
• axis. By default, the argument is set to axis=0, which means it denotes rows. For columns set the
s

axis as 1.
ra

• inplace. By default it is false. If True, do operation inplace and return None. To actually edit the
Sa

original DataFrame, the “inplace” parameter can be set to True, and there is no returned value.
Before applying any deletion operation, let us create a DataFrame for Table 1.4 data.
ew

Table 1.4 A table with students information.


Name Age Score Grade
N

Aashna 16 87 A2
Somya 15 64 B2
@

Ronald 16 58 C1
Jack 17 74 B1
Raghu 15 34 D
Mathew 16 77 B1
Review of Python Pandas 33

Nancy 14 87 A2
Bhavya 16 64 B2
Kumar 15 45 C2

d
Aashna 17 68 B2

ite
Somya 16 92 A1

m
Mathew 16 93 A1

Li
To create a DataFrame df, for the dictionary, the commands are:
>>> import pandas as pd

e
>>> ClassXIIA = {'Name':['Aashna', 'Somya', 'Ronald', 'Jack', 'Raghu', 'Mathew',

at
'Nancy', 'Bhavya', 'Kumar', 'Aashna', 'Somya', 'Mathew'],

iv
'Age':[16, 15, 16, 17, 15, 16, 14, 16, 15, 17, 16, 16],
'Score':[87, 64, 58, 74, 34, 77, 87, 64, 45, 68, 92, 93],

Pr
'Grade' :['A2', 'B2', 'C1', 'B1', 'D', 'B1', 'A2', 'B2', 'C2', 'B2', 'A1', 'A1']}
>>> df = pd.DataFrame(ClassXIIA, columns=['Name', 'Age', 'Score', 'Grade'] )

a
Drop a column by name

di
To delete a column, or multiple columns, use the name of the column(s), and specify the “axis” as 1. Lets
In
drop a column (i.e., Age) by name in python Pandas.
>>> df.drop('Age',axis=1) # drop a column based on name
se

The axis=1 denotes that we are referring to a column, not a row. The above deletion operation do not
ou

affect the original data frames df; however, you can again put the results in extra variables (DataFrame).
Originally the deleted column(s) are there in the DataFrame till you won’t apply inplace=True. Or if you
H

apply a command like, df = df.drop('Age',axis=1) then it will copy the resultant columns into the same
DataFrame and assumes that it is permanently deleted.
i

So the resultant DataFrame will be:


at

>>> df
w

Name Score Grade


s

0 Aashna 87 A2
ra

1 Somya 64 B2
Sa

2 Ronald 58 C1
3 Jack 74 B1
ew

4 Raghu 34 D
5 Mathew 77 B1
N

6 Nancy 87 A2
7 Bhavya 64 B2
@

8 Kumar 45 C2
9 Aashna 68 B2
10 Somya 92 A1
11 Mathew 93 A1
34 Saraswati Informatics Practices XII

To drop more than one column using label name, mention the column names in a list, for example, df =
df.drop(['Age', 'Score'], axis=1). This will delete 'Age' and 'Score' columns.

Drop a column on column index

d
Let us drop a column by its index in python Pandas. For example to delete the Grade column, whose

ite
column index is 3, the command is:
>>> df.drop(df.columns[3],axis=1) # drop Grade column based on column index

m
In the above example, column with index 3 is dropped (i.e., the Grade column). So, the resultant

Li
DataFrame will be:

e
Name Age Score

at
0 Aashna 16 87

iv
1 Somya 15 64
2 Ronald 16 58

Pr
3 Jack 17 74

a
4 Raghu 15 34

di
5 Mathew 16 77
6 Nancy 14 87
In
7 Bhavya 16 64
se

8 Kumar 15 45
9 Aashna 17 68
ou

10 Somya 16 92
11 Mathew 16 93
H

Deleting a column based on column name using del


i
at

We can delete a column based on column name by using del command. Let us drop a column (i.e., Age) by
w

name in python Pandas.


s

>>> df.drop('Age', axis=1) # drop a column based on name


ra

Name Score Grade


Sa

0 Aashna 87 A2
1 Somya 64 B2
2 Ronald 58 C1
ew

3 Jack 74 B1
4 Raghu 34 D
N

5 Mathew 77 B1
@

6 Nancy 87 A2
7 Bhavya 64 B2
8 Kumar 45 C2
9 Aashna 68 B2
Review of Python Pandas 35

10 Somya 92 A1
11 Mathew 93 A1

Dropping column inplace

d
ite
If you use the inplace = True with the drop command for column, the column will be permanently deleted
from the DataFrame. For example, notice the following commands and their results given below:

m
Let us see the original DataFrame first:

Li
>>> df
Name Age Score Grade

e
0 Aashna 16 87 A2

at
1 Somya 15 64 B2

iv
2 Ronald 16 58 C1

Pr
3 Jack 17 74 B1
4 Raghu 15 34 D

a
5 Mathew 16 77 B1

di
6 Nancy 14 87 A2 In
7 Bhavya 16 64 B2
8 Kumar 15 45 C2
se

9 Aashna 17 68 B2
10 Somya 16 92 A1
ou

11 Mathew 16 93 A1
H

If you apply the inplace = True with the drop() function, i.e.,
i

>>> df.drop(df.columns[3],axis=1, inplace=True)


at

The above deletion operation affects the original data frames df. The command deletes the Grade
w

(index is 3) column in the DataFrame (permanently deleted from df). So, the resultant DataFrame will be:
s

>>> df
ra

Name Age Score


Sa

0 Aashna 16 87
1 Somya 15 64
ew

2 Ronald 16 58
3 Jack 17 74
N

4 Raghu 15 34
5 Mathew 16 77
@

6 Nancy 14 87
7 Bhavya 16 64
8 Kumar 15 45
9 Aashna 17 68
36 Saraswati Informatics Practices XII

10 Somya 16 92
11 Mathew 16 93

d
1.9.4 Dropping Rows in DataFrame

ite
In Pandas, we can drop or delete row by index, by condition and by position. Note that if the DataFrame is

m
indexed by label, then you can delete the row by name also.

Li
Drop a row by number

e
Let us delete the second and third rows (i.e., index 1 and 2) from the DataFrame df. The command is:

at
>>> df.drop([1,2]) # Deletes rows whose indexes are 1 and 2.

iv
Here, the axis itself assumes a 0 which means we are deleting row(s). Originally the deleted row(s) are
there in the DataFrame till you won’t apply inplace=True. Or if you apply a command like, df = df.drop([1,2])

Pr
then it will copy the resultant rows into the same DataFrame and assumes that it is permanently deleted. So,
the resultant DataFrame for above command will be:

a
Name Age Score
0 Aashna 16 87
di
In
3 Jack 17 74
4 Raghu 15 34
se

5 Mathew 16 77
ou

6 Nancy 14 87
7 Bhavya 16 64
H

8 Kumar 15 45
9 Aashna 17 68
i
at

10 Somya 16 92
w

11 Mathew 16 93
s

Drop a row or observation by condition


ra

We can drop a row when it satisfies a specific condition. For example, let us delete the row with the name
Sa

‘Mathew’. The command is:


>>> df.drop[df.Name != 'Mathew'] # Drop a row by condition
ew

The above code takes up all the names except Mathew, thereby dropping the row(s) with the name
‘Mathew’, i.e., the operation affect the original data frames df. So, the resultant DataFrame will be:
N

Name Age Score


@

0 Aashna 16 87
1 Somya 15 64
2 Ronald 16 58
3 Jack 17 74
Review of Python Pandas 37

4 Raghu 15 34
6 Nancy 14 87
7 Bhavya 16 64

d
8 Kumar 15 45

ite
9 Aashna 17 68

m
10 Somya 16 92

Li
Drop by index

e
We can drop a row by index, i.e., integer index. Let us delete the 3rd row (i.e., index is 2) from the DataFrame

at
df. The command is:
>>> df.drop(df.index[2]) # Drop a row by index

iv
The above code drops the row with index number 2. So the resultant DataFrame will be

Pr
Name Age Score
0 Aashna 16 87

a
di
1 Somya 15 64
3 Jack 17 74
In
4 Raghu 15 34
se

5 Mathew 16 77
6 Nancy 14 87
ou

7 Bhavya 16 64
8 Kumar 15 45
H

9 Aashna 17 68
i

10 Somya 16 92
at

11 Mathew 16 93
s w

Drop by position
ra

We can drop rows by position using the slicer (:) with indexes. For example, to drop first 3 rows (i.e., the
Sa

row index is 0, 1, & 2), the command is:


>>> df.drop(df.index[:3]) # drop first 3 rows (put ':' to left of # to drop last X rows)
ew

It will delete the top 3 rows from the DataFrame. So, the resultant DataFrame will be:
Name Age Score
N

3 Jack 17 74
4 Raghu 15 34
@

5 Mathew 16 77
6 Nancy 14 87
7 Bhavya 16 64
8 Kumar 15 45
38 Saraswati Informatics Practices XII

9 Aashna 17 68
10 Somya 16 92
11 Mathew 16 93

d
ite
If you want to delete all rows except last three rows, then the above command can be written as:
>>> df.drop(df.index[:-3])

m
So, the resultant DataFrame will be

Li
Name Age Score
9 Aashna 17 68

e
at
10 Somya 16 92
11 Mathew 16 93

iv
Pr
Points to Remember

a
1. Pandas is a high-level data manipulation tool developed by Wes McKinney.

di
2. The Series is a one-dimensional labeled array capable of holding data of any type (integer, string,
float, python objects, etc.)
In
3. The series data are mutable (can be changed). But the size of Series data is immutable.
4. Any list, tuple and dictionary elements are called Series.
se

5. Dictionary keys are used to construct index in a series.


6. NumPy is a Python package which stands for ‘Numerical Python’.
ou

7. DataFrame is a two-dimensional array with heterogeneous data, usually represented in the tabular
format.
H

8. DataFrame has two different index i.e., column-index and row-index.


9. The Pandas DataFrame() constructor is used to create a DataFrame.
i

10. The DataFrame.head() function in Pandas, by default, shows you the top 5 rows of data in the
at

DataFrame.
11. The DataFrame.tail() function in Pandas, by default, shows you the last 5 rows of data in the
w

DataFrame.
s

12. iterrows() function iterate over the rows of a DataFrame as (index, Series) pairs.
ra

13. itertuples(index=True) function iterate over the rows of DataFrame as tuples, with index value as
first element of the tuple.
Sa

14. The assign() function in python, assigns the new column (i.e., a list) to the existing DataFrame.
15. The insert() function adds a column at the column index position.
ew

SOLVED EXERCISES
N

1. What is pandas Series?


Ans. Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer,
@

string, float, python objects, etc.).


2. In pandas S is a series with following result:
S = pd.Series([5, 10, 15, 20, 25])
The Series object is automatically indexed as 0, 1, 2, 3, 4. Write a statement to assign the series
as a, b, c, d, e explicitly.
Ans. S.index = ['a' , 'b' , 'c' , 'd' , 'e']
Review of Python Pandas 39

3. In pandas, S is a series with following data:


S = pd.Series([5 , 10 , 15 , 20 , 25])
Find the output of the following:
(i) S[[1 , 2]]

d
(ii) S[1 : 3]

ite
Ans. (i) 1 10
2 15

m
dtype: int64
(ii) 1 10

Li
2 15
dtype: int64

e
4. What is DataFrame?

at
Ans. DataFrame is a two-dimensional array with heterogeneous data, usually represented in the tabular
format. The data is represented in rows and columns. Each column represents an attribute and

iv
each row represents a person.

Pr
5. A dictionary s_marks contains the following data:
s_marks = {'name' : ['Rashmi', 'Harsh', 'Ganesh', 'Priya', 'Vivek'], 'Grade' : ['A1', 'A2', 'B1', 'A1',
'B2']}

a
Write a statement to create DataFrame called df. Assume that Pandas has been imported as pd.
Ans. df = pd.DataFrame(s_marks)
di
In
6. What is the purpose axis option in Pandas concat() function?
Ans. By default axis = 0 thus the new DataFrame will be added row-wise. If a column is not present then
se

in one of the DataFrames it creates NaNs. To join column-wise we set axis = 1


7. A dictionary Grade contains the following data:
Grade = {'Name' : ['Rashmi', 'Harsh', 'Ganesh', 'Priya', 'Vivek', 'Anita', 'Karthik'], 'Grade' : ['A1',
ou

'A2', 'B1', 'A1', 'B2', 'A2', 'A1']}


Write statements for the following:
H

( ) Create a DataFrame called Gr.


( ) Find the output of Gr.iloc[0:5] and Gr[0:5]
i
at

( ) Add a column called Percentage with following data:


[92, 89, None, 95, 68, None, 93]
w

(d) Rearrange the columns as Name, Percentage and Grade.


s

(e) Add following 3 rows with following data: (Note: None is a null value).
ra

Ishan 86 B1
Amrita 97 A1
Sa

None None None


( ) Drop the column (i.e., Grade) by name.
( ) Delete the 3rd and 5th rows rows.
ew

( ) What does the following will do?


(i) Gr.drop(0,axis = 0)
N

(ii) Gr.drop(0,axis = "index")


(iii) Gr.drop([0,1,2,3],axis = 0)
@

Ans. (a) Gr = pd.DataFrame(Grade)


(b) Output for both the commands are same:
Grade Name
0 A1 Rashmi
1 A2 Harsh
40 Saraswati Informatics Practices XII

2 B1 Ganesh
3 A1 Priya
4 B2 Vivek
(c) Gr["Percentage"] = [92, 89, None, 95, 68, None, 93]

d
(d) Gr = Gr[['Name', 'Percentage', 'Grade']]

ite
(e) TGr = pd.DataFrame({ 'Name' : ['Ishan', 'Amrita', None],
'Percentage' : [86, 97, None],

m
'Grade' : ['B1', 'A1', None]},
columns=['Name', 'Percentage', 'Grade'])

Li
Gr = Gr.append(TGr, ignore_index=True)

e
(f ) Gr.drop('Grade',axis=1)

at
(g) Gr.drop([2, 4])
(h) (i) First row from the DataGrame Gr will be deleted.

iv
(ii) First row from the DataGrame Gr will be deleted.

Pr
(iii) First four rows from the DataGrame Gr will be deleted.

REVIE

a
ES IO S

di
1. If a list of array contains the following elements:
In
L = ['a','b','c','d','e','f']
Write the statements to create a series from using NumPy array.
se

2. What will be the output of the following:


D = {'a' : 0., 'b' : 1., 'c' : 2.}
ou

S = pd.Series(D,index=['b','c','d','a'])
print(S)
3. A dictionary contains first 10 states with their Per Capita Income as follows:
H

d = {'Goa': 224138, 'Delhi': 212219, 'Sikkim': 176491, 'Chandigarh': None, 'Puducherry': 143677,
'Haryana': 133427, 'Maharashtra': None, 'Tamil Nadu': 112664, 'A. & N. Islands': 107418, 'Gujarat':
i
at

106831}
Answer the following:
w

(a) Create a series called Income. (b) List the states with the income below 130000.
s

(c) What will be the output of the commands?


ra

Less_Than = income < 130000


print (Less_Than)
Sa

4. Suppose we make a DataFrame as (assume that pd is used as pandas namespace)


df = pd.DataFrame({'Book_ID' : ['C0001', 'F0001', 'T0001', 'T0002', 'F0002'],
'Book_name' : ['Fast Cook', 'The Tears', 'My First C++', 'C++ Brainworks', 'Thunderbolts'],
ew

'Author_Name' : ['Lata Kapoor', 'William Hopkins', 'Brain & Brooke', 'A.W. Rossaine', 'Anna
Roberts'],
N

'Price' : [540, 450, 670, 548, 750]},


columns = ['Book_ID', 'Book_name', 'Author_Name', 'Price'])
@

What are the purposes of the following commands?


(i) print(df.loc[:, 'Book_Name']) (ii) print(df['Book_Name'])
(iii) print(df[['Book_Name']]) (iv) print(df[['Book_Name', 'Price']])
(v) print(df[0:3]) (vi) print(df[2:4])
(vii) print(df.iloc[2])
Advanced Operations on Pandas DataFrames 41

Advanced Operations on
Pandas DataFrames
DataFrames

d
Chapter –

ite
m
Li
e
at
iv
2.1 Introduction

Pr
Pandas is a popular python library for data analysis. One of the key actions for any data analyst is to be able
to pivot data tables. Pandas can be used to create MS Excel style pivot tables. They will save you a lot of

a
time by allowing you to quickly summarize large amounts of data into a meaningful report.

2.2 Pivoting DataFrame


di
In
Pivot tables are useful for summarizing data. A pivot table allows us to extract the significance from a large,
se

detailed data set. Pivot tables are particularly useful if you have long rows or columns holding values that
you need to track the sums of and easily compare to one another. They can automatically sort, count, total,
ou

or average data stored in one table. Then, they can show the results of those actions in a new table of that
summarized data inside DataFrame.
H

In a general sense, pivoting means to use unique values from specified index / columns to form axes of
the resulting DataFrame. We can get pandas to form a pivot table for our DataFrame by calling the pivot()
i

or pivot_table() methods and providing parameters about how we would like the resulting table to be
at

organized.
w

2.3 Pivot Table using pivot() Method


s
ra

The pivot() method reshape data (produce a “pivot” table) based on column values and returns reshaped
DataFrame. The pivot() method takes maximum of 3 arguments with the following names: index, columns,
Sa

and values. But you can take at least two. As a value for each of
these parameters you need to specify a column name in the original What is P ivot T
Pivot able?
Table?
table. Then the pivot() method will create a new table whose row A Pivot Table enables you to
ew

and column indices are the unique values of the respective summarize large amounts of
parameters. The cell values of the new table are taken from column data in a matter of minutes. You
N

given as the values parameter. The syntax of pivot() is: can transform endless rows and
columns of numbers into a
pandas.pivot(index, columns, values)
@

meaningful presentation of the


Here, data.
• index is the column name to use in order to make new DataFrame’s index.
• columns is the column name to use to make new DataFrame’s columns.
• values is the column name to use to make new DataFrame’s values. The values parameter accepts
both numeric and string.
41
42 Saraswati Informatics Practices XII

So, to create a pivot table, use unique values from index / columns and fills with values. To start, here
is the data-set that will be used to create a pivot table using pivot() method in Python:
>>> import pandas as pd
>>> ClassXII = {'Name': ['Aashna', 'Ronald', 'Jack', 'Raghu', 'Somya'],

d
'Subject': ['Accounts', 'Economics', 'Accounts', 'Economics', 'Accounts'],

ite
'Score': [87, 64, 58, 74, 87],
'Grade' :['A2', 'B2', 'C1', 'B1', 'A2']}

m
>>> df = pd.DataFrame(ClassXII, columns=['Name', 'Subject', 'Score', 'Grade']) # creating a DataFrame

Li
>>> df
Name Subject Score Grade

e
0 Aashna Accounts 87 A2

at
1 Ronald Economics 64 B2

iv
2 Jack Accounts 58 C1

Pr
3 Raghu Economics 74 B1
4 Somya Accounts 87 A2

a
di
Notice that the DataFrame above is a row-by-row record of transactions. Each row in the DataFrame
corresponds to one complete record of the information. We could think of the information as being organized
In
by Name.
To create a pivot table:
se

>>> pv = df.pivot(index='Name', columns='Subject', values='Score') # pv is the pivot table


>>> pv
ou

Subject Accounts Economics


H

Name
Aashna 87.0 NaN
i
at

Jack 58.0 NaN


w

Raghu NaN 74.0


s

Ronald NaN 64.0


ra

Somya 87.0 NaN


Sa

As can be seen, the value of Score for every row in the original table has been transferred to the new
table, where its row and column match the Name and Subject of its original row. Also notice that many of
the values are NaN. This is because many of the positions of the table do not have matching information
ew

from the original DataFrame and such data are set with NaN (None).
If you don’t need a separate DataFrame, then you can use the following command:
N

>>> df.pivot(index='Name', columns='Subject', values='Score')


The above command will display the same result of DataFrame pv.
@

Using pivot() with .fillna()


If you want to avoid the NaN value in your new DataFrame, you can use pivot() with the .fillna('') function.
That is:
>>> pv = df.pivot(index='Name', columns='Subject', values='Score').fillna('')
Advanced Operations on Pandas DataFrames 43

>>> pv
Subject Accounts Economics
Name

d
Aashna 87.0

ite
Jack 58.0
Raghu 74.0

m
Ronald 64.0

Li
Somya 87.0

e
Whatever column you specify as the columns argument will be used to create new columns (each

at
unique entry will form a new column). Remember, columns are optional and they provide an additional way
to segment the actual values you care about. The aggregation functions are applied to the values you list,

iv
i.e., Score.

Pr
Let us create a pivot table to display the Grade for every row in the original table using values='Grade':
>>> df.pivot(index='Name', columns='Subject', values='Grade').fillna('')

a
Subject Accounts Economics
Name
di
In
Aashna A2
Jack C1
se

Raghu B1
ou

Ronald B2
Somya A2
H

Pivoting By Multiple Columns


i
at

Now, what if we want to extend the previous example to have the Score and Grade for each name on its row
w

as well? This is actually easy - we just have to omit the values parameter as follows:
s

>>> pv = df.pivot(index='Name', columns='Subject')


ra

>>> pv
Sa

Score Grade
Subject Accounts Economics Accounts Economics
ew

Name
Aashna 87.0 NaN A2 NaN
N

Jack 58.0 NaN C1 NaN


Raghu NaN 74.0 None B1
@

Ronald NaN 64.0 None B2


Somya 87.0 NaN A2 NaN
44 Saraswati Informatics Practices XII

As shown in the previous page, pandas will create a hierarchical column index (MultiIndex) for the new
table. You can think of a hierarchical index as a set of five indices. The first level of the column index
defines all columns that we have not specified in the pivot invocation - in this case Score and Grade. The
second level of the index defines the unique value of the corresponding column.

d
Using the above pivot table, we can use this hierarchical column index to filter the values of a single

ite
column from the original table. For example pv.Score returns a pivoted DataFrame with the Score values
only and it is equivalent to the pivoted DataFrame from the previous section.

m
>>> pv.Score.fillna('')

Li
Subject Accounts Economics
Name

e
Aashna 87

at
Jack 58

iv
Raghu 74

Pr
Ronald 64
Somya 87

a
>>> pv.Score.Accounts.fillna('')
Name
di
In
Aashna 87
se

Jack 58
Raghu
ou

Ronald
H

Somya 87
i

Name: Accounts, dtype: float64


at

A pivot problem
s w

But remember that when there are any index, columns combinations with multiple values, then it will raise
ra

ValueError. Let us add one more entry to the previous records to demonstrate a problem. We will append
another list to sales and rebuild our DataFrame.
Sa

For example, let us create a new data set with duplicate entries:
>>> Temp = {'Name': ['Aashna', 'Ronald', 'Jack', 'Raghu', 'Somya', 'Ronald'],
ew

'Subject': ['Accounts', 'Economics', 'Accounts', 'Economics', 'Accounts', 'Accounts'],


'Score': [87, 64, 58, 74, 87, 78],
'Grade' :['A2', 'B2', 'C1', 'B1', 'A2', 'B1']}
N

>>> df1 = pd.DataFrame(Temp, columns=['Name', 'Subject', 'Score', 'Grade'])


>>> df1
@

Name Subject Score Grade

0 Aashna Accounts 87 A2
1 Ronald Economics 64 B2
Advanced Operations on Pandas DataFrames 45

2 Jack Accounts 58 C1
3 Raghu Economics 74 B1
4 Somya Accounts 87 A2

d
5 Ronald Accounts 78 B1

ite
As you can see from this exmaple, the 5th row in the DataFrame, only the Name is duplicated. Here, if

m
we apply the pivot() function, it won’t produce any ValueError as shown below:

Li
>>> pv1 = df1.pivot(index='Name', columns='Subject', values='Score').fillna('')
>>> pv1

e
Subject Accounts Economics

at
Name

iv
Aashna 87.0

Pr
Jack 58.0
Raghu 74.0

a
Ronald 78.0 64.0
Somya 87.0
di
In
It is to be noted that if there are any index, columns combination with multiple values, then the ValueError
se

will rise. To justify this, follow the example:


>>> Temp1 = {'Name': ['Aashna', 'Ronald', 'Jack', 'Raghu', 'Somya', 'Ronald', 'Aashna'],
ou

'Subject': ['Accounts', 'Economics', 'Accounts', 'Economics', 'Accounts', 'Accounts', 'Accounts'],


'Score': [87, 64, 58, 74, 87, 78, 82],
H

'Grade' :['A2', 'B2', 'C1', 'B1', 'A2', 'B1', 'A2']}


>>> df2 = pd.DataFrame(Temp1, columns=['Name', 'Subject', 'Score', 'Grade'])
i

>>> df2
at

Name Subject Score Grade


w

0 Aashna Accounts 87 A2
s
ra

1 Ronald Economics 64 B2
2 Jack Accounts 58 C1
Sa

3 Raghu Economics 74 B1
4 Somya Accounts 87 A2
ew

5 Ronald Accounts 78 B1
6 Aashna Accounts 82 A2
N

As you can see above, the 0th and 6th row in the DataFrame contains duplicate values in both Name and
@

Subject columns. Here, if we apply the pivot() function it produces a ValueError as shown below:
>>> pv2 = df2.pivot(index='Name', columns='Subject', values='Score')
Traceback (most recent call last):
File "<pyshell#150>", line 1, in <module>
pv2 = df2.pivot(index='Name', columns='Subject', values='Score')
46 Saraswati Informatics Practices XII

....
....
File "C:\Python36\lib\site-packages\pandas\core\reshape\reshape.py", line 154, in _make_selectors
raise ValueError('Index contains duplicate entries, '

d
ValueError: Index contains duplicate entries, cannot reshape

ite
Using stack and unstack Methods

m
The stack() and unstack() methods flip the layout of a DataFrame by moving whole levels of columns to

Li
rows, or whole levels of rows to columns. Stacking a DataFrame means moving (also rotating or pivoting)
the innermost column index to become the innermost row index. The inverse operation is called unstacking.

e
It means moving the innermost row index to become the innermost column index. These are particularly

at
useful to help manipulate the hierarchies we form when making pivot tables, but they can be applied any
time. To understand this, let us see the example given below:

iv
>>> pv = df.pivot(index='Name', columns='Subject')

Pr
>>> pv
Score Grade

a
Subject Accounts Economics Accounts Economics
Name
di
In
Aashna 87.0 NaN A2 NaN
se

Jack 58.0 NaN C1 NaN


Raghu NaN 74.0 NaN B1
ou

Ronald NaN 64.0 NaN B2


Somya 87.0 NaN A2 NaN
H

As you can see from above, the pivot table has a hierarchy of column labels. In such case, the column
i
at

labels were broken down into multiple levels, one for "Score" and "Grade", and another level for the subject.
Right now, the Subject are written across as column labels. If we would prefer that they were written
w

downward as row labels, we can call stack on the DataFrame. That is:
s

>>> pv.stack()
ra

Score Grade
Sa

Name Subject
Aashna Accounts 87.0 A2
Jack Accounts 58.0 C1
ew

Raghu Economics 74.0 B1


N

Ronald Economics 64.0 B2


Somya Accounts 87.0 A2
@

Here, the stack() method by default takes the last level in the column breakdown and turns it into the
last row breakdown.
If we call stack again, it will move the remaining column level. This will result in there not being any
more columns. This is possible to do, and returns something reasonable. Note that this is no longer a
DataFrame object however, it is a Series object.
Advanced Operations on Pandas DataFrames 47

>>> pv.stack().stack()

Name Subject
Aashna Accounts Score 87

d
Grade A2

ite
Jack Accounts Score 58
Grade C1

m
Raghu Economics Score 74

Li
Grade B1

e
Ronald Economics Score 64

at
Grade B2
Somya Accounts Score 87

iv
Grade A2

Pr
dtype: object

a
Unstack is similar to stack, but moves row levels to column levels. One more thing to note is that these

di
methods could take a parameter to specify which level in the hierarchy to move. As mentioned above, by
default they will move the "last" level.
In
For example, if we start with our original DataFrame df, and this time use stack on the 0 level, we will
move the "Score" and "Grade" labels.
se

>>> pv = df.pivot(index="Name", columns="Subject").stack(0).stack()


ou

>>> pv

Name Subject
H

Aashna Grade Accounts A2


i

Score Accounts 87
at

Jack Grade Accounts C1


w

Score Accounts 58
s

Raghu Grade Economics B1


ra

Score Economics 74
Sa

Ronald Grade Economics B2


Score Economics 64
ew

Somya Grade Accounts A2


Score Accounts 87
N

dtype: object
@

2.4 Pivot Table using pivot_table() Method

A bit confusingly, pandas DataFrames also come with a pivot_table() method, which is a generalization of
the pivot() method. Whenever you have duplicate values for one index/column pair, you need to use the
pivot_table().
48 Saraswati Informatics Practices XII

There are several reasons why you might want to automate pivot table operations. In order to use the
interactive pivot table, you had to identify:
• what column(s) in the dataset to use to define the row groupings in the pivot table?
• what column(s) in the dataset to use to define the column groupings in the pivot table?

d
• what column in the dataset to use as the basis for the pivot table summary function?

ite
• what summary function to use?

m
A pivot table is composed of counts, sums, or other aggregations derived from a table of data. The
pivot_table() method create a spreadsheet-style pivot table as a DataFrame. It allows transforming data

Li
columns into rows and rows into columns. It allows grouping by any data field. The syntax is:

e
pandas.pivot_table(DataFrame, values=None, index=None, columns=None,

at
aggfunc=’mean’, fill_value=None,
margins=False, dropna=True, margins_name=’All’)

iv
.pivot_table() method does not necessarily need all four arguments, because it has some smart defaults.

Pr
If we pivot on one column, it will by default use all other numeric columns as the index (rows) and take the
average (mean) of the values.

a
Here,

• DataFame: It is a pandas DataFrame.


di
In
• values: The columns to aggregate and it is optional.
• index: The name of the column, Grouper, array, or list of the previous.
se

• columns: column, Grouper, array, or list of the previous.


• aggfunc: The aggfunc (Aggregation Function) will apply to values even if there is only one for
ou

that position. It basically shows how rows are summarized, such as sum, mean, max, min or
count. The default aggfunc of pivot_table is numpy.mean
H

• fill_value: By providing a fill_value parameter, we can set the default when values are missing.
The default is NaN (i.e., None).
i
at

• margins: This is a Boolean that defaults to False, but if we set it to True, the resulting DataFrame
will also include total sums along the rows and columns. The totals appear in entries whose new
w

column or row labels are "All".


s

• dropna=True, and is used to drop rows that have missing data. Do not include columns whose
ra

entries are all NaN.


• margins_name=’All’. Name of the row / column that will contain the totals when margins is True.
Sa

Before creating pivot table, let us create a DataFrame with following data:
ew

>>> Data = {'Name': ['C Joseph', 'Sareen', 'Abhishek', 'Rughwani', 'C Joseph', 'Sareen', 'Abhishek',
'Rughwani', 'C Joseph', 'Sareen', 'Abhishek', 'Rughwani', 'C Joseph', 'Sareen', 'Abhishek',
'Rughwani'],
N

'Test': ['Semester 1', 'Semester 1', 'Semester 1', 'Semester 1', 'Semester 1', 'Semester 1',
@

'Semester 1', 'Semester 1', 'Semester 2', 'Semester 2', 'Semester 2', 'Semester 2',
'Semester 2', 'Semester 2', 'Semester 2', 'Semester 2'],
'Subject': ['Accounts', 'Accounts', 'Accounts', 'Accounts', 'Economics', 'Economics',
'Economics', 'Economics', 'Accounts', 'Accounts', 'Accounts', 'Accounts', 'Economics',
'Economics', 'Economics', 'Economics'],
'Marks': [78, 87, 67, 69, 79, 80, 82, 78, 62, 59, 68, 73, 60, 70, 64, 84]}
Advanced Operations on Pandas DataFrames 49

>>> dfP = pd.DataFrame(Data, columns=['Name', 'Test', 'Subject', 'Marks'])


>>> dfP
Name Test Subject Marks

d
0 C Joseph Semester 1 Accounts 78

ite
1 Sareen Semester 1 Accounts 87
2 Abhishek Semester 1 Accounts 67

m
3 Rughwani Semester 1 Accounts 69

Li
4 C Joseph Semester 1 Economics 79
5 Sareen Semester 1 Economics 80

e
6 Abhishek Semester 1 Economics 82

at
7 Rughwani Semester 1 Economics 78

iv
8 C Joseph Semester 2 Account 62

Pr
9 Sareen Semester 2 Account 59
10 Abhishek Semester 2 Account 68

a
11 Rughwani Semester 2 Account 73
12 C Joseph Semester 2 Economics
di 60
In
13 Sareen Semester 2 Economics 70
14 Abhishek Semester 2 Economics 64
se

15 Rughwani Semester 2 Economics 84


ou

Note. You can extract the above data from the .CSV file:
dfP = pd.read_csv('E:/IPSource_XII/IPXIIChap02/Result.csv')
H

We know that the index and columns parameters for pivot_table() can take lists, not just single column
i

labels. Let us create a pivot table using pivot_table() method to find student wise mean of Marks.
at

>>> pv = dfP.pivot_table(index='Name', aggfunc='mean')


w

Or
s

>>> pv = df.pivot_table(index='Name', values='Marks', aggfunc='mean')


ra

>>> pv
Marks
Sa

Name
Abhishek 70.25
ew

C Joseph 69.75
Rughwani 76.00
N

Sareen 74.00
@

Here,

• DataFrame is the dfP.


• index to be Name since that is the column from dfP we want to appear as a unique value in each
row. That is the Name column in the dataset to use to define the row groupings in the pivot
table.
50 Saraswati Informatics Practices XII

• Marks column in the dataset is used as the basis for the pivot table summary function
• aggfunc to aggfunc='mean' since we want to find the average of all values in Marks that belongs
to a unique Name.
As shown in the previous command, if you don’t need a new pivot table (pv), then you can write the

d
following command to display result of pivot table immediately:

ite
>>> pd.pivot_table(dfP, index='Name', aggfunc='mean')

m
Marks

Li
Name
Abhishek 70.25

e
C Joseph 69.75

at
Rughwani 76.00

iv
Sareen 74.00

Pr
To find subject-wise mean for each students’ mark of DataFrame dfP, we have to change the columns as
Subject:

a
>>> pv = dfP.pivot_table(index='Name', columns='Subject', aggfunc='mean')

di
>>> pv In
Marks

Subject Accounts Economics


se

Name
ou

Abhishek 67.5 73.0


C Joseph 70.0 69.5
H

Rughwani 71.0 81.0


i

Sareen 73.0 75.0


at

Example Using DataFrame dfP, create a pivot table (pv1) to find the group means by Name and Subject
w

>>> pv1 = dfP.pivot_table(index=['Name', 'Subject'], columns='Subject', aggfunc='mean')


s

>>> pv1
ra

Marks
Sa

Subject Accounts Economics


Name Subject
ew

Abhishek Accounts 67.5 NaN


Economics NaN 73.0
N

C Joseph Accounts 70.0 NaN


Economics NaN 69.5
@

Rughwani Accounts 71.0 NaN


Economics NaN 81.0
Sareen Accounts 73.0 NaN
Economics NaN 75.0
Advanced Operations on Pandas DataFrames 51

Example Using DataFrame dfP, create a pivot table (pv2) to find the group Marks counts by Name and
Subject
>>> pv2 = df.pivot_table(index='Name', columns='Subject', aggfunc='count')
>>> pv2

d
ite
Marks Test

Subject Accounts Economics Accounts Economics

m
Name

Li
Abhishek 2 2 2 2

e
C Joseph 2 2 2 2

at
Rughwani 2 2 2 2

iv
Sareen 2 2 2 2

Pr
Example An Emp table contains the following data:
Empno Name Department Salary Commission Job

a
di
100 Sunita Sharma RESEARCH 45600 5600.0 CLERK
101 Ashok Singhal SALES 43900 3900.0 SALESMAN
In
102 Sumit Avasti SALES 27000 7000.0 SALESMAN
se

103 Jyoti Lamba RESEARCH 45900 4900.0 MANAGER


104 Martin S. SALES 32500 3500.0 SALESMAN
ou

105 Binod Goel SALES 45200 4200.0 MANAGER


H

106 Chetan Gupta ACCOUNTS 36800 6800.0 MANAGER


107 Sudhir Rawat RESEARCH 37000 7000.0 ANALYST
i
at

108 Kavita Sharma ACCOUNTS 42900 4900.0 CLERK


w

109 Tushar Tiwari SALES 49500 4500.0 MANAGER


s

110 Anand Rathi OPERATIONS 41600 8200.0 SR. MANAGER


ra

111 Sumit Vats RESEARCH 47800 NaN SR. MANAGER


Sa

112 Manoj Kaushik OPERATIONS 43600 NaN CLERK

(a) Using above table create a DataFrame called dfE.


( ) Display the department wise total salary.
ew

() Display the department wise average salary.


(d) Display the department wise total and average salary.
N

(e) Display the department wise maximum and minimum salary.


() Display the department and job wise maximum salary.
@

Solution: (a) For data: dfE = pd.read_csv('E:/IPSource_XII/IPXIIChap02/Emp.csv', index_col=0)


Or
Emp = { 'Empno': [100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112],
'Name': ['Sunita Sharma', 'Ashok Singhal', 'Sumit Avasti', 'Jyoti Lamba', 'Martin S.',
'Binod Goel', 'Chetan Gupta', 'Sudhir Rawat', 'Kavita Sharma', 'Tushar Tiwari',
52 Saraswati Informatics Practices XII

'Anand Rathi', 'Sumit Vats', 'Manoj Kaushik'],


'Department': ['RESEARCH', 'SALES', 'SALES', 'RESEARCH', 'SALES', 'SALES','ACCOUNTS',
'RESEARCH', 'ACCOUNTS', 'SALES', 'OPERATIONS', 'RESEARCH', 'OPERATIONS'],
'Salary': [45600, 43900, 27000, 45900, 32500, 45200, 36800, 37000, 42900, 49500,

d
41600, 47800, 43600],

ite
'Commission': [5600, 3900, 7000, 4900, 3500, 4200, 6800, 7000, 4900, 4500, 8200,
np.nan, np.nan],

m
'Job': ['CLERK', 'SALESMAN', 'SALESMAN', 'MANAGER', 'SALESMAN', 'MANAGER',
'MANAGER', 'ANALYST', 'CLERK', 'MANAGER', 'SR. MANAGER', 'SR. MANAGER', 'CLERK']}

Li
dfE = pd.DataFrame(Emp, columns=['Empno', 'Name', 'Department', 'Salary',
'Commission', 'Job'])

e
(b) pd.pivot_table(dfE, index='Department', values='Salary', aggfunc='sum')

at
(c) pd.pivot_table(dfE, index='Department', values='Salary')
Or

iv
pd.pivot_table(dfE, index='Department', values='Salary', aggfunc='mean')

Pr
(d) pd.pivot_table(dfE, index='Department', values='Salary', aggfunc=['sum', 'mean'])
(e) pd.pivot_table(dfE, index='Department', values='Salary', aggfunc=['max', 'min'])

a
(f) pd.pivot_table(dfE, index=['Department', 'Job'], values='Salary', aggfunc='max')

di
2.5 Sorting DataFrames
In
se

The data of a data frame can also be sorted by rows or columns or by their respective values. By default,
sorting is done on row labels in ascending order. Pandas data frame has two useful sort functions:
ou

• sort_values(): This function sorts a data frame in ascending or descending order of passed column.
The optional by parameter to DataFrame.sort_values() may use to specify one or more columns
H

to use to determine the sorted order.


• sort_index(): This function allows the data to be sorted by rows (axis=0) or columns ( axis=1); By
i

means of ascending=False the order of sorting can be reversed.


at

Each of these functions come with numerous (parameters) options, like sorting the data frame in
w

specific order (ascending or descending), sorting in place, sorting with missing values, sorting by specific
s

algorithm and so on. Some common sorting options are given in the syntax. These are:
ra

DataFrame.sort_values(by=None, axis=0, ascending=True, inplace=False')


Sa

DataFrame.sort_index(by=None, axis=0, ascending=True, inplace=False')

Here,
ew

• by : Name(s) of the column to sort by.


• axis : The axis value 0 mean to sort by row and 1 means column labels. The default axis value
is 0.
N

• ascending : The default sorting is ascending whose value is True (0). If the value is False (1), then
it sorts the DataFrame by descending.
@

• inplace : The default value is False. Otherwise if you do not want a new DataFrame, then mention
the value as True.
Suppose, we have following DataFrame (dfS) where we can apply different types of sorting:

>>> import pandas as pd


Advanced Operations on Pandas DataFrames 53

>>> Data = {'Name':['Aashna', 'Somya', 'Ronald', 'Jack', 'Raghu', 'Mathew',


'Nancy', 'Bhavya', 'Kumar', 'Aashna', 'Somya', 'Mathew'],
'Age':[16, 15, 16, 17, 15, 16, 14, 16, 15, 17, 16, 16],
'Score':[87, 64, 58, 74, 34, 77, 87, 64, 45, 68, 92, 93],

d
'Grade' :['A2', 'B2', 'C1', 'B1', 'D', 'B1', 'A2', 'B2', 'C2', 'B2', 'A1', 'A1']}

ite
>>> dfS = pd.DataFrame(Data, columns=['Name', 'Age', 'Score', 'Grade'] )
>>> dfS

m
Name Age Score Grade

Li
0 Aashna 16 87 A2

e
1 Somya 15 64 B2

at
2 Ronald 16 58 C1

iv
3 Jack 17 74 B1

Pr
4 Raghu 15 34 D
5 Mathew 16 77 B1

a
6 Nancy 14 87 A2

di
7 Bhavya 16 64 B2 In
8 Kumar 15 45 C2
9 Aashna 17 68 B2
se

10 Somya 16 92 A1
11 Mathew 16 93 A1
ou

Sort by Value
H

To sort a DataFrame by value, mention the column value as an input argument to sort_values() function. For
example, we can sort by the values of 'Name' column in the DataFrame dfS. The command is:
i
at

>>> dfN = dfS.sort_values(by='Name') # DataFrame sorted in ascending order by ‘Name’


w

Or
>>> dfN = dfS.sort_values('Name') # DataFrame sorted in ascending order by ‘Name’
s

Or
ra

>>> dfN = dfS.sort_values(by='Name', ascending=True) # DataFrame sorted by ‘Name’


>>> dfN
Sa

It will produce the following output:


Name Age Score Grade
ew

0 Aashna 16 87 A2
N

9 Aashna 17 68 B2
3 Jack 17 74 B1
@

8 Kumar 15 45 C2
5 Mathew 16 77 B1
11 Mathew 16 93 A1
6 Nancy 14 87 A2
4 Raghu 15 34 D
54 Saraswati Informatics Practices XII

2 Ronald 16 58 C1
1 Somya 15 64 B2
10 Somya 16 92 A1

d
7 Bhavya 16 64 B2

ite
Note that by default sort_values() sorts and gives a new data frame dfN. The new sorted DataFrame dfN

m
is in ascending order (small values first and large values last).

Sorting in Descending Order by Value

Li
To sort a DataFrame on descending order, we can use the argument ascending=False. In this example, we

e
can see that after sorting the DataFrame by 'Name' column with ascending=False,

at
>>> dfA = dfN.sort_values('Score', ascending=False)
>>> dfA

iv
It will produce the following output:

Pr
Name Age Score Grade

a
11 Mathew 16 93 A1
10 Somya 16 92 A1
di
In
0 Aashna 16 87 A2
6 Nancy 14 87 A2
se

5 Mathew 16 77 B1
3 Jack 17 74 B1
ou

9 Aashna 17 68 B2
H

1 Somya 15 64 B2
7 Bhavya 16 64 B2
i
at

2 Ronald 16 58 C1
8 Kumar 15 45 C2
w

4 Raghu 15 34 D
s
ra

Sorting Based on Multiple Columns


Sa

We can specify the columns we want to sort by as a list in the argument for sort_values() function. Note
that when sorting by multiple columns, pandas sort_value() uses the first variable first and second variable
next. Let us sort the DataFrame dfS multiple columns (Name, Score).
ew

>>> dfS.sort_values(['Name', 'Score'], ascending=False)


It will produce the following output:
N

Name Age Score Grade


@

7 Bhavya 16 64 B2
10 Somya 16 92 A1
1 Somya 15 64 B2
2 Ronald 16 58 C1
Advanced Operations on Pandas DataFrames 55

4 Raghu 15 34 D
6 Nancy 14 87 A2
11 Mathew 16 93 A1

d
5 Mathew 16 77 B1

ite
8 Kumar 15 45 C2
3 Jack 17 74 B1

m
0 Aashna 16 87 A2

Li
9 Aashna 17 68 B2

e
Sort by Index

at
In previous section, we created a DataFrame called dfN. Let us sort the DataFrame dfN by index.

iv
>>> dfN = dfN.sort_index()

Pr
>>> dfN
It will produce the following output:

a
Name Age Score Grade

0 Aashna 16 87 A2
di
In
1 Somya 15 64 B2
2 Ronald 16 58 C1
se

3 Jack 17 74 B1
ou

4 Raghu 15 34 D
5 Mathew 16 77 B1
H

6 Nancy 14 87 A2
7 Bhavya 16 64 B2
i
at

8 Kumar 15 45 C2
w

9 Aashna 17 68 B2
s

10 Somya 16 92 A1
ra

11 Mathew 16 93 A1
Sa

Sort by Index in Descending order

Let us sort the DataFrame dfN by index in descending order:


ew

>>> dfN.sort_index(ascending=False)
It will produce the following output:
N

Name Age Score Grade


@

11 Mathew 16 93 A1
10 Somya 16 92 A1
9 Aashna 17 68 B2
8 Kumar 15 45 C2
56 Saraswati Informatics Practices XII

7 Bhavya 16 64 B2
6 Nancy 14 87 A2
5 Mathew 16 77 B1

d
4 Raghu 15 34 D

ite
3 Jack 17 74 B1

m
2 Ronald 16 58 C1
1 Somya 15 64 B2

Li
0 Aashna 16 87 A2

e
at
Points to Remember

iv
Pr
1. Then the pivot() method creates a new table, whose row and column indices are the unique
values of the respective parameters.
2. pivot() method is used for pivoting without aggregation.

a
3. Stacking a DataFrame means moving (also rotating or pivoting) the innermost column index to
become the innermost row index.
di
In
4. A PivotTable is a powerful tool to calculate, summarize, and analyze data that lets you see
comparisons, patterns, and trends in your data.
se

5. Pandas sort_values() function sorts a data frame in Ascending or Descending order of passed
Column.
ou

6. Using the sort_index() method, by passing the axis arguments and the order of sorting, DataFrame
can be sorted.
i H

SOLVED EXERCISES
at
w

1. What is pivot() method?


s

Ans. The pivot() method reshape data (produce a “pivot” table) based on column values. The pivot()
ra

method takes maximum of 3 arguments with the following names: index, columns, and values.
2. Define pivot_table() method.
Sa

Ans. The pivot_table() method create a spreadsheet-style pivot table as a DataFrame. It allows
transforming data columns into rows and rows into columns.
3. What is the difference between pivot() and pivot_table()?
ew

Ans. The differences are:


• The index column of pivot() method does not accept duplicate rows with values for specified
N

columns, whereas the pivot_table() accepts duplicate values for one index/column pair you
need to use.
@

• pivot() allows both numeric and string types as "values=", whereas pivot_table() only allow
numerical types as "values=".
• pivot() is used for pivoting without aggregation whereas pivot_table() works with duplicate
values by aggregating them.
Advanced Operations on Pandas DataFrames 57

4. A sample dataset is given with different columns as given below:

Itemno ItemName Color Price

1 Ball Pen Black 15.0

d
2 Pencil Blue 5.5

ite
3 Ball Pen Green 10.5

m
4 Gel Pen Green 11.0

Li
5 Notebook Red 15.5
6 Ball Pen Green 11.5

e
7 Highlighter Blue 8.5

at
8 Gel Pen Red 12.5

iv
9 P Marker Blue 8.6

Pr
10 Pencil Green 11.5
11 Ball Pen Green 10.5

a
di
Answer the following questions (assume that the DataFrame name is dfA):
( ) Using the above table, create a DataFrame called df.
In
( ) Create a pivot table to display item name wise items from DataFrame df.
( ) Create a table to display item name and item number wise price for all rows.
se

(d) Create a table to display item name and item number wise price for all colors.
(e) Create a table to display item name and item number wise sum of color values.
ou

( ) Create a table to display item name and item number wise total price of all color items
along the rows and columns.
H

( ) Create a table to display item name wise total price of all color items.
Ans. (a) For data:
i
at

df = pd.read_csv('E:/IPSource_XII/IPXIIChap02/Items.csv')
Or
w

data = {
s

'Itemno': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],


ra

'ItemName': ['Ball Pen', 'Pencil', 'Ball Pen', 'Gel Pen', 'Notebook', 'Ball Pen', 'Highlighter',
Sa

'Gel Pen', 'P Marker', 'Pencil', 'Ball Pen'],


'Color': ['Black', 'Blue', 'Green', 'Green', 'Red', 'Green', 'Blue', 'Red', 'Blue', 'Green', 'Green'],
'Price': [15.0, 5.50, 10.50, 11.0, 15.50, 11.50, 8.50, 12.50, 8.60, 11.50, 10.50]}
ew

df = pd.DataFrame(data, columns=['Itemno', 'ItemName', 'Color', 'Price'])


(b) df.pivot_table(index="ItemName", columns="Itemno", fill_value="")
N

(c) df.pivot_table(index=["ItemName", "Itemno"], fill_value="")


(d) df.pivot_table(index=["ItemName", "Itemno"], columns="Color", fill_value="")
@

(e) df.pivot_table(index="ItemName", columns="Color", fill_value="", aggfunc='sum')


(f) df.pivot_table(index=["ItemName", "Itemno"], columns="Color", fill_value="",
aggfunc=sum, margins=True)
(g) df.pivot_table(index="ItemName", columns="Color", values="Price", fill_value="",
aggfunc=sum, margins=True)
58 Saraswati Informatics Practices XII

5. A DataFrame contains following information:


Customer Region Order_Date Sales Month Year
0 K Books Distributers East 2016-04-13 1256000 April 2016

d
1 GBC P House South 2017-08-23 1359000 August 2017

ite
2 S Books Store North 2016-10-11 1670000 October 2016
3 TM Books West 2019-08-25 1490000 August 2019

m
4 IND Books Distributors North 2017-09-04 1560000 September 2017

Li
5 Aniket Pustak West 2018-05-17 1180000 May 2018
6 M Pustak Bhandar South 2018-11-28 2100000 November 2018

e
at
7 BOOKWELL Distributors North 2017-01-22 1630000 January 2017
8 Jatin Book Agency West 2016-12-21 1380000 December 2016

iv
9 New India Agency South 2018-09-12 1730000 September 2018

Pr
10 Libra Books Distributors East 2016-10-04 1210000 October 2016
Answer the following questions:

a
( ) Find the region wise average sales for each year.
( ) Find the year wise average sales for each region.
di
In
( ) Find the year wise total sales for each region.
(d) Find the year wise maximum and minimum sales for each region.
se

(e) Create a pivot table to count number of customers in each region.


( ) What does the following command will produce?
pv5 = df.pivot_table(values='Sales', columns='Region', index='Year')
ou

( ) What will be the output of the following?


pv1 = dfN.pivot_table(index='Region', columns='Year', aggfunc='mean')
H

print (pv1.stack())
Ans. For data:
i
at

dfN = pd.read_csv('E:/IPSource_XII/IPXIIChap02/Distributors.csv', index_col=0)


(a) pv1 = dfN.pivot_table(index='Region', columns='Year', aggfunc='mean')
w

(b) pv2 = dfN.pivot_table(index='Year', columns='Region', aggfunc='mean')


s

(c) pv3 = dfN.pivot_table(index='Year', values='Sales', columns='Region', aggfunc='sum')


ra

(d) pv4 = dfN.pivot_table(index='Year', values='Sales', columns='Region', aggfunc=[max, min])


Sa

(e) dfN.pivot_table(index='Region', aggfunc='count')


(f) This will display the year wise average sales of each region.
(g) The output is:
ew

Sales
Region Year
N

East 2016 1233000.0


@

North 2016 1670000.0


2017 1595000.0
South 2017 1359000.0
2018 1915000.0
Advanced Operations on Pandas DataFrames 59

West 2016 1380000.0


2018 1180000.0
2019 1490000.0

d
6. A data set is given the sales of two products in four different regions.

ite
Region Year Product Units Sold

m
Southeast 2018 Air Purifier 87

Li
Northwest 2019 Air conditioner 165
Southwest 2019 Air Purifier 122

e
Northeast 2019 Air conditioner 132

at
Southeast 2018 Air conditioner 98

iv
Northeast 2019 Air Purifier 120

Pr
Northwest 2018 Air Purifier 137
Southeast 2019 Air conditioner 83

a
Northwest 2018 Air Purifier 128
Northwest 2019
di
Air conditioner 149
In
Southwest 2018 Air Purifier 167
Northeast 2018 Air conditioner 139
se

Using the above data set answer the following:


ou

( ) Crate a pivot table to summarize the data into region and product wise total sales.
( ) Print the summary report.
H

Ans. For data:


dfP = pd.read_csv('E:/IPSource_XII/IPXIIChap02/RSales.csv')
i

(a) pv = pd.pivot_table(dfP, index=['Region', 'Product'], values='Units Sold', aggfunc='sum')


at

(b) print (pv)


w

Region Product
s

Northeast Air Purifier 120


ra

Air conditioner 271


Northwest Air Purifier 265
Sa

Air conditioner 314


Southeast Air Purifier 87
ew

Air conditioner 181


Southwest Air Purifier 289
N

7. A data set is given with 10 employees in different cities as given below:


@

Name Position City Age Sex


Aakash Manager Delhi 35 Male
Arpita Programmer Mumbai 37 Female
Sanket Manager Kolkata 33 Male
60 Saraswati Informatics Practices XII

Prince Programmer Mumbai 40 Male


Sahil Manager Kolkata 29 Male
Sakshi Programmer Kolkata 27 Female

d
Dinesh Programmer Delhi 31 Male

ite
Akshya Manager Delhi 26 Male

m
Megha Manager Mumbai 30 Female
Hemant Manager Kolkata 28 Male

Li
Using the above data set answer the following:

e
( ) Crate a pivot table to print the positions of average age of each each city.

at
( ) Crate a pivot table to print the positions of average age of City and Sex.
( ) Create a pivot table to print the average age for each position and sex.

iv
(d) What will be the output of the following:

Pr
import numpy as np
df.pivot_table(index='Position', aggfunc={'Age': np.mean})
Ans. For data:

a
df = pd.read_csv('E:/IPSource_XII/IPXIIChap02/EJob.csv')

di
(a) print (df.pivot_table(index='Position', columns='City', values='Age'))
In
(b) print (df.pivot_table(index='Position', columns=['City','Sex'], values='Age'))
(c) print (df.pivot_table(index=['Position','Sex'], columns='City', values='Age'))
(d) Age
se

Position
Manager 30.166667
ou

Programmer 33.750000
H

REVIE ES IO S
i
at

1. Why pivot() method is more restrictive than pivot_table method in pandas?


w

2. State one difference between stacking and unstacking.


s

3. A DataFrame contains following data:


ra

Name Qualification Experience


Sa

0 Ms. Mittal Masters 8


1 Minu Arora Graduate 11
ew

2 Sharmila Kaur Post Graduate 7


3 Sangeeta Vats Masters 9
N

4 Ramesh Kumar Graduate 6


5 Jatin Ghosh Post Graduate 8
@

6 Yash Sharma Masters 10


Write the command for the following (assume that the DataFrame name is dfT):
(a) Find the average experience for each qualification.
(b) Find the total experience for each qualification.
(c) Find the average experience for each qualification and name.
Advanced Operations on Pandas DataFrames 61

(d) What will be the following will display:


pva = pd.pivot_table(dfT, index ='Qualification', columns ='Qualification',
values ='Experience', aggfunc = np.sum)
pva.stack()

d
4. A sample dataset is given with different columns as given below:

ite
Item_ID ItemName Manufacturer Price CustomerName City

m
PC01 Personal Computer HCL India 42000 N Roy Delhi

Li
LC05 Laptop HP USA 55000 H Singh Mumbai
PC03 Personal Computer Dell USA 32000 R Pandey Delhi

e
PC06 Personal Computer Zenith USA 37000 C Sharma Chennai

at
LC03 Laptop Dell USA 57000 K Agarwal Bengaluru

iv
AL03 Monitor HP USA 9800 S C Gandhi Delhi

Pr
CS03 Hard Disk Dell USA 5400 B S Arora Mumbai
PR06 Motherboard Intel USA 17500 A K Rawat Delhi

a
BA03 UPS Microtek India 4300 C K Naidu Chennai
MC01 Monitor
di
HCL India 6800 P N Ghosh Bengaluru
In
Write the command for the following (assume that the DataFrame name is dfA):
(a) Create a city wise customer table.
se

(b) Create a pivot table for manufacturer wise item names and their price.
(c) Arrange the DataFrame in ascending order of CustomerName.
ou

(d) Arrange the DataFrame in ascending order of City and Price.


5. A sample dataset is given with four quarter sales data for five employees:
H

Name of Employee Sales Quarter State


i
at

R Sahay 125600 1 Delhi


George K 235600 1 Tamil Nadu
w

Jaya Priya 213400 1 Kerala


s
ra

Manila Sahai 189000 1 Haryana


Ryma Sen 456000 1 West Bengal
Sa

Manila Sahai 172000 2 Haryana


Jaya Priya 201400 2 Kerala
ew

George K 225400 2 Tamil Nadu


R Sahay 140600 2 Delhi
N

Ryma Sen 389000 2 West Bengal


@

Jaya Priya 242100 3 Kerala


George K 262000 3 Tamil Nadu
Ryma Sen 339000 3 West Bengal
Manila Sahai 228000 3 Haryana
62 Saraswati Informatics Practices XII

R Sahay 193100 3 Delhi


George K 292000 4 Tamil Nadu
Manila Sahai 278000 4 Haryana

d
Jaya Priya 282100 4 Kerala

ite
Ryma Sen 369000 4 West Bengal

m
R Sahay 233100 4 Delhi

Li
Write the command for the following (assume that the DataFrame name is dfQ):
(a) Find the total sales of each employee.

e
(b) Find the total sales by state.

at
(c) Find the total sales by both employees wise and state wise.
(d) Find the maximum individual sale by state.

iv
(e) Find the mean, median and minimum sales by state.

Pr
a
di
In
se
ou
i H
at
s w
ra
Sa
ew
N
@
Aggregation/Descriptive Statistics in Pandas 63

Aggr egation/Descriptive
Aggregation/Descriptive
Statistics in Pandas
Pandas

d
Chapter –

ite
m
Li
e
at
iv
Pr
3.1 Introduction
Python is a great language for data analysis. Python pandas supports number of data aggregation functions

a
to analyze data. An essential piece of analysis of large data is efficient summarization: computing

di
aggregations like sum(), mean(), median(), min(), and max(), in which a single number gives from a large
dataset. To demonstrate the aggregation functions, let us create a DataFrame from Student.csv file with
In
following data:
>>> import pandas as pd
se

>>> df = pd.read_csv('E:/IPSource_XII/IPXIIChap03/Student.csv')
>>> df
ou

Student_Name Age Gender Test1 Test2 Test3


H

0 Aashna 16.0 F 7.6 8.5 7.6


1 NaN NaN NaN NaN NaN NaN
i
at

2 Jack 16.0 M 8.6 NaN NaN


w

3 Somya 17.0 F 6.5 7.9 8.8


s

4 Raghu 15.0 M 6.8 7.7 7.9


ra

5 Mathew 16.0 M 9.2 9.0 NaN


Sa

6 Nancy 14.0 F 6.8 8.7 8.8


ew

If you carefully observe the above DataFrame (df), you will find number of column values are NaN.
Remember that the NaN value represents null value or None. In this chapter, we use the above DataFrame
(df) to demonstrate the aggregation/descriptive statistics functions.
N

3.2 Data Aggregation


@

In last chapter, we used number of aggregate function with pivot_table(). Aggregation is the process of
turning the values of a dataset (or a subset of it) into one single value. Data aggregation always accepts
multivalued functions, which in turn returns only single value. The dataset is either a series or DataFrame.
Table 3.1 on the next page shows the most commonly used built-in Pandas aggregation functions.

63
64 Saraswati Informatics Practices XII

Table 3.1 Built-in pandas aggregate functions.

Aggregation Description
count() Total number of items

d
sum() calculate the sum of a given set of numbers

ite
mean() Calculate the arithmetic mean or average of a given set of numbers

m
median() Calculate the median or middle value of a given set of numbers
mode() Calculate the mode or most repeated value of a given set of numbers

Li
max() Find the maximum value of a given set of numbers

e
min() Find the minimum value of a given set of numbers

at
std() Calculate standard deviation of a given set of numbers

iv
var() Calculate variance of a given set of numbers

Pr
DataFrame.count() Function

a
Pandas DataFrame.count() is used to count the number of non-null observations across the given axis of a

di
DataFrame or a series. It works with non-floating type data as well. It returns series (or DataFrame if level
specified). The syntax is:
In
DataFrame.count(axis=0, level=None, numeric_only=False)
se

Here,
• axis. {0 or ‘index’, 1 or ‘columns’}, default 0.
ou

- axis=0 [will give you the calculated value per column]


- axis=1 [will give you the calculated value per row]
H

• level. If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a
i

DataFrame.
at

• numeric_only include only float, int, boolean data


w

Example 1 Count the number of observation in a series S given below:


s

>>> S = pd . Series ( [ 5 , 10 , 15 , 20 , 25 ] ) # A numeric series


ra

>>> S.count()
5
Sa

Example 2 Count the number of non-null value across the row axis for DataFrame df.
>>> df.count()
ew

Student_Name 6
Age 6
N

Gender 6
Test1 6
@

Test2 5
Test3 4
dtype: int64

As you have seen above, the NaN values are not counted in the above series output.
Aggregation/Descriptive Statistics in Pandas 65

Example 3 Count the number of non-null observation in column Age of DataFrame df.
>>> df.Age.count()
Or
>>> df['Age'].count()

d
which prints: 6

ite
Example 4 Count the number of non-null value across the column for DataFrame df.

m
>>> df.count(axis=1)
Or

Li
>>> df.count(axis='columns')
which will print the following:

e
0 6

at
1 0

iv
2 4

Pr
3 6
4 6

a
5 5
6 6
di
In
dtype: int64
se

DataFrame.sum() Function
ou

Pandas DataFrame.sum() function is used to add all of the values in a particular column of a DataFrame (or
a Series). For a DataFrame, by default, axis is index (axis=0).
H

Example 1 Find the sum of all the values of a series S given below:
i

>>> S = pd . Series ( [ 5 , 10 , 15 , 20 , 25 ] ) # A numeric series


at

>>> S.sum()
w

75
Example 2 Find the sum of non-null value across the row axis for DataFrame df.
s
ra

>>> df.sum() # Find the sum row-wise


Age 94.0
Sa

Test1 45.5
Test2 41.8
ew

Test3 33.1
dtype: float64
N

Notice that each individual numeric column is added individually.


@

>>> df.Test1.sum() # sum value in the Test1 column


45.5

Example 3 Find the sum of the non-null value across the column for DataFrame df.
>>> df.sum(axis=1)
66 Saraswati Informatics Practices XII

0 39.7
1 0.0
2 24.6

d
3 40.2

ite
4 37.4
5 34.2

m
6 38.3

Li
dtype: float64

e
Example 4 Find the sum of Test1, Test2, and Test3 across the column for DataFrame df.

at
>>> df.loc[:, ['Test1', 'Test2', 'Test3']].sum(axis=1)

iv
0 23.7
1 0.0

Pr
2 8.6
3 23.2

a
di
4 22.4
5 18.2
In
6 24.3
se

dtype: float64
ou

DataFrame.mean() Function
H

This function is used to find the arithmetic mean of a set of data which is obtained by taking the sum of the
data, and then dividing the sum by the total number of values in the set. A mean is commonly referred to as
i

an average. The mean() function in Python pandas is used to calculate the arithmetic mean of a given series
at

or mean of a DataFrame, mean of column and mean of rows. The syntax is:
w

DataFrame.mean(axis=None, skipna=None, level=None, numeric_only=None)


s

Like count() function parameters, the parameters of mean() function are same. But the skipna parameter
ra

is used to exclude NA/null values when computing the result. The default value is True.
Sa

Example 1 Find the arithmetic mean of a series S given below:


>>> S = pd . Series ( [ 5 , 10 , 15 , 20 , 25 ] ) # A numeric series
>>> S.mean()
ew

15.0
Example 2 Find the mean of the non-null value across the row axis for DataFrame df.
N

>>> df.mean() # Find the average of numeric columns.


@

Age 15.666667
Test1 7.583333
Test2 8.360000
Test3 8.275000
dtype: float64
Aggregation/Descriptive Statistics in Pandas 67

Example 3 Calculate the mean of specific numeric columns like (Test1, Test2, Test3) for DataFrame
df.
>>> df.loc[:, ['Test1', 'Test2', 'Test3']].mean()
Test1 7.583333

d
ite
Test2 8.360000
Test3 8.275000

m
dtype: float64

Li
Example 4 Calculate the mean of specific numeric columns (Test1, Test2, Test3) row-wise for
DataFrame df. Aslo display the result in 2 decimal format.

e
>>> df.loc[:, ['Test1', 'Test2', 'Test3']].mean(axis=1).round(decimals=2)

at
0 7.90

iv
1 NaN
2 8.60

Pr
3 7.73
4 7.47

a
5 9.10
6 8.10
di
In
dtype: float64
Example 5 Calculate the mean of specific numeric columns (Test1, Test2, Test3) row-wise for
se

DataFrame df excluding null values. Also display the result in 2 decimal formats.
>>> df.loc[:, ['Test1', 'Test2', 'Test3']].mean(axis=1, skipna=False).round(decimals=2)
ou

0 7.90
H

1 NaN
2 NaN
i

3 7.73
at

4 7.47
w

5 NaN
s

6 8.10
ra

dtype: float64
Sa

From the above result, notice that any row which contains NaN value, when mean operations
are performed, resulted as NaN.
ew

DataFrame.median() Function
The median of a set of data is the middlemost number in the set. The median is also the number that is
N

halfway into the set. In Python pandas, the median() function is used to calculate the median or middle
@

value of a given set of numbers. The dataset can be a series, a data frame, a column or rows. To find the
median, the dataset should first be arranged in an order from least to the greatest. But in Python pandas,
median() takes care of itself. The syntax is:

DataFrame.median(axis=None, skipna=None, level=None, numeric_only=None)


68 Saraswati Informatics Practices XII

For example, let us see the series given below:

5 10 15 20 25

d
The above sequence is in sorted order and middle value is 15.

ite
Similarly, look at another set of numbers given below:

m
22 5 10 4 17 15 20 25 18 11

Li
which is unsorted 10 numbers and ordering the data from least to greatest, we get:

e
4 5 10 11 15 17 18 20 22 25

at
iv
Since there is an even number of items in the data set, we compute the median by taking the mean of the
two middlemost numbers, i.e., (15 + 17) / 2 = 16.

Pr
Example 1 Find the middlemost number in the series S given below:

a
>>> S = pd . Series ( [ 5 , 10 , 15 , 20 , 25 ] ) # A numeric series
>>> S.median()
15.0
di
In
Example 2 Find the middlemost number in an unsorted series X given below:
se

>>> X = pd . Series ( [ 22, 5 , 10 , 4, 17, 15 , 20 , 25, 18, 11 ] ) # A numeric series


>>> X.median() # prints: 16.0
ou

Note. If you want to sort the series, you can use the following command:
>>> X = X.sort_values()
H

>>> X.median()
16.0
i
at

Example 3 Find the median of DataFrame df.


w

>>> df.median()
Age 16.00
s
ra

Test1 7.20
Test2 8.50
Sa

Test3 8.35
dtype: float64
ew

Example 4 Calculate the median of Test1 column of DataFrame df.


>>> df['Test1'].median() # median of specific column
N

7.199999999999999
>>> df.loc[:, ['Test1', 'Test2', 'Test3']].median() # median of columns
@

Test1 7.20
Test2 8.50
Test3 8.35
dtype: float64
Aggregation/Descriptive Statistics in Pandas 69

Example 5 Calculate the median of columns of Test1, Test2, and Test3 row-wise for the DataFrame
df.
>>> df.loc[:, ['Test1', 'Test2', 'Test3']].median(axis=1)
which will display the following:

d
ite
0 7.6
1 NaN

m
2 8.6

Li
3 7.9
4 7.7

e
5 9.1

at
6 8.7

iv
dtype: float64

Pr
DataFrame.mode() Function

a
The mode of a set of data is the value in the set that occurs most often. The mode() in function Python

di
pandas is used to calculate the mode or most repeated value of a given set of numbers. The number can find
in a series, can be of a data frame, a column or a row. If more than one number occurs many times, then both
In
numbers are the mode. The syntax is:
DataFrame.mode(axis=0, numeric_only=False, dropna=True)
se

Example 1 Find the number occurs most often in the series X given below:
ou

>>> X = pd . Series ( [ 22, 5 , 22 , 4, 7, 5 , 22] )


>>> X.mode()
0 22
H

dtype: int64
i
at

Example 2 Find the number occurs most often in the series Y given below:
>>> Y = pd . Series ( [ 22, 5 , 22 , 4, 7, 5 , 22, 5] )
w

>>> Y.mode()
s

0 5
ra

1 22
Sa

dtype: int64
Example 3 Calculate the mode of DataFrame df.
ew

>>> df.mode()
Student_Name Age Gender Test1 Test2 Test3
N

0 Aashna 16.0 F 6.8 7.7 8.8


1 Jack NaN M NaN 7.9 NaN
@

2 Mathew NaN NaN NaN 8.5 NaN


3 Nancy NaN NaN NaN 8.7 NaN
4 Raghu NaN NaN NaN 9.0 NaN
5 Somya NaN NaN NaN NaN NaN
70 Saraswati Informatics Practices XII

Example 4 Find the most repeated value for a specific column ‘Age’ of DataFrame df.
>>> df['Age'].mode()
0 16.0
dtype: float64

d
ite
Example 5 Find the repeated values of specific columns (Test1, Test2, Test3) of DataFrame df.
>>> df.loc[:, ['Test1', 'Test2', 'Test3']].mode()

m
Test1 Test2 Test3

Li
0 6.8 7.7 8.8
1 NaN 7.9 NaN

e
2 NaN 8.5 NaN

at
3 NaN 8.7 NaN

iv
4 NaN 9.0 NaN

Pr
Example 6 Find the repeated values of specific rows (Age, Test1, Test2, Test3) of DataFrame df.
>>> df.loc[:, ['Test1', 'Test2', 'Test3']].mode(axis=1)

a
0 1 2
0 7.6 NaN NaN
di
In
1 NaN NaN NaN
se

2 8.6 NaN NaN


3 6.5 7.9 8.8
ou

4 6.8 7.7 7.9


5 9.0 9.2 NaN
H

6 6.8 8.7 8.8


i
at

DataFrame max() and min() Function


w

The max() function finds the maximum value from a column of a DataFrame or a Series. The min() function
s

finds the minimum value from a column of a DataFrame or a Series. The syntax of max() and min() functions
ra

are:
DataFrame.max(axis=None, skipna=None, level=None, numeric_only=None)
Sa

DataFrame.min(axis=None, skipna=None, level=None, numeric_only=None)


Example 1 Find the maximum values of series S given below:
ew

>>> S = pd . Series ( [ 5 , 10 , 15 , 20 , 25 ] ) # A numeric series


>>> S.max()
N

25
Example 2 Find the minimum values of series S given below:
@

>>> S = pd . Series ( [ 5 , 10 , 15 , 20 , 25 ] ) # A numeric series


>>> S.min()
5
Example 3 Find the maximum values of DataFrame df.
>>> df.max() # get the maximum values of all numeric columns
Aggregation/Descriptive Statistics in Pandas 71

Age 17.0
Test1 9.2
Test2 9.0

d
Test3 8.8

ite
dtype: float64
Example 4 Find the minimum values of DataFrame df.

m
>>> df.min() # get the minimum values of all numeric columns

Li
Age 14.0
Test1 6.5

e
Test2 7.7

at
Test3 7.6

iv
dtype: float64

Pr
Example 5 Find the maximum value for a specific column ‘Age’ of DataFrame df.
>>> df['Age'].max()

a
Age 17.0
Example 6 Find the minimum age value of DataFrame df.
di
In
>>> df['Age'].min()
Age 14.0
se

Example 7 Find the maximum values of specific numeric columns (Test1, Test2, Test3) row-wise for
ou

DataFrame df.
>>> df.loc[:, ['Test1', 'Test2', 'Test3']].max(axis=1)
H

0 8.5
1 NaN
i

2 8.6
at

3 8.8
w

4 7.9
5 9.2
s

6 8.8
ra

dtype: float64
Sa

Example 8 Find the minimum values of specific numeric columns (Test1, Test2, Test3) row-wise for
DataFrame df.
>>> df.loc[:, ['Test1', 'Test2', 'Test3']].min(axis=1)
ew

0 7.6
1 NaN
N

2 8.6
@

3 6.5
4 6.8
5 9.0
6 6.8
dtype: float64
72 Saraswati Informatics Practices XII

DataFrame.std() Function
This function is used to calculate standard deviation of a given set of numbers. The standard deviation can

d
be of a series, a data frame, a column or row. The syntax is:

ite
DataFrame.std(axis=None, skipna=None, level=None, numeric_only=None)
Example 1 Find the standard deviation of series S given below:

m
>>> S = pd . Series ( [ 5 , 10 , 15 , 20 , 25 ] ) # A numeric series

Li
>>> S.std()
7.905694150420948

e
Example 2 Find the standard deviation of numeric columns of DataFrame df.

at
>>> df.std() # get the standard deviation of all numeric columns

iv
Age 1.032796

Pr
Test1 1.099848
Test2 0.545894

a
Test3 0.618466
dtype: float64
di
In
Example 3 Find the standard deviation for a specific column ‘Age’ of DataFrame df.
se

>>> df['Age'].std() Or >>> df['Age'].std().round(decimals=2)


1.0327955589886446 1.03
ou

Example 4 Find the standard deviation of specific numeric columns (Test1, Test2, Test3) row-wise for
DataFrame df.
H

>>> df.loc[:, ['Test1', 'Test2', 'Test3']].std(axis=1)


i
at

0 0.519615
1 NaN
w

2 NaN
s
ra

3 1.159023
4 0.585947
Sa

5 0.141421
6 1.126943
ew

dtype: float64

DataFrame.var() Function
N
@

This function is used to calculate variance of a given set of numbers. The variance can be of a series, a data
frame, a column or row. The syntax is:

DataFrame.var(axis=None, skipna=None, level=None, ddof=1, numeric_only=None)


By default the sample variance is calculated, i.e., N-1. Where N is the number of observations given
(unbiased estimate). This can be changed using the ddof argument.
Aggregation/Descriptive Statistics in Pandas 73

Example 1 Find the variance over a series S given below:


>>> S = pd . Series ( [ 5 , 10 , 15 , 20 , 25 ] ) # A numeric series
>>> S.var()

d
62.5

ite
Example 2 Find the variance of numeric columns of DataFrame df.

m
>>> df.var() # get the variance all numeric columns
Age 1.066667

Li
Test1 1.209667

e
Test2 0.298000

at
Test3 0.382500

iv
dtype: float64

Pr
Example 3 Find the variance for a specific column ‘Age’ of DataFrame df.
>>> df['Age'].var()
1.0666666666666669

a
di
Example 4 Find the variance of specific numeric columns (Test1, Test2, Test3) row-wise for DataFrame
df.
In
>>> df.loc[:, ['Test1', 'Test2', 'Test3']].var(axis=1)
se

0 0.270000
1 NaN
ou

2 NaN
H

3 1.343333
4 0.343333
i
at

5 0.020000
6 1.270000
w

dtype: float64
s
ra

Example 5 State-wise sales value of an item is given below:


Sa

Sales State
Goa 650000
ew

Delhi 692400
Odisha 750000
N

Haryana 867000
Bihar 920000
@

Kerala 939000
Tamil Nadu 1015000
West Bengal 1553000
Maharashtra 2176000
74 Saraswati Informatics Practices XII

Write commands for the following (Assume that the DataFrame name is dfA):
( ) Count the number of observation in the DataFrame.
( ) Count the number of observation in column State of DataFrame.

d
() Sum of the non-null value across the row axis for DataFrame.

ite
(d) Calculate the mean of Sales columns in the DataFrame.
(e) Add a Commission column (i.e., 4% of Sales) into the DataFrame dfA.

m
Commission = Sales * 0.04
( ) Calculate the mean of all numeric columns in the DataFrame.

Li
( ) Calculate the median of Sales column.
( ) Find the maximum sales and commission values.

e
(i) Find the minimum sales and commission values.

at
( ) Find the standard deviation of commission.

iv
Solution For data: dfA = pd.read_csv('E:/IPSource_XII/IPXIIChap03/State.csv')

Pr
(a) dfA.count()
(b) dfA.State.count()
(c) dfA.Sales.sum()

a
(d) dfA.Sales.mean()
(e) dfA['Commission'] = dfA['Sales']*.04
di
In
(f) dfA.mean().round(decimals=2)
(g) dfA['Sales'].median()
se

(h) dfA.max()
(i) dfA.min()
ou

(j) dfA.Commission.std()
H

3.3 Quartiles, Quantiles and Percentiles with Pandas


i

In the statistics, the three terms commonly used are called quartile, quantile and percentile. All these
at

produce approximate results. These terms are used in different statistical calculations over list of numbers.
w

The differences are given below:


s

• 0 quartile = 0 quantile = 0 percentile


ra

• 1 quartile = 0.25 quantile = 25 percentile


Sa

• 2 quartile = 0.5 quantile = 50 percentile (median)


• 3 quartile = 0.75 quantile = 75 percentile
• 4 quartile = 1 quantile = 100 percentile
ew

Quartiles
N

Quartiles in statistics are values that divide your data into quarters. The four quarters that divide a data set
@

into quartiles are:


• The lowest 25% of numbers.
• The next lowest 25% of numbers (up to the median).
• The second highest 25% of numbers (above the median).
• The highest 25% of numbers.
Aggregation/Descriptive Statistics in Pandas 75

For example, let us see a number line to represent the quartiles:


1st Quarter 2nd Quarter 3rd Quarter 4th Quarter
–4 –3 –2 –1 0 1 2 3 4

d
Let us find the quartiles for the following numbers:

ite
3, 6, 21, 11, 18, 13, 8, 15, 36, 29, 32, 16

m
To find the quartiles:

Li
Step 1: Put the numbers in order: 3, 6, 8, 11, 13, 15, 16, 18, 21, 29, 32, 36
Step 2: Total numbers in the list is 12.

e
Step 3: Divide the total number by 4 to cut the list of numbers into quarters, i.e., 12 / 4 = 3

at
Stpe 4: There are 16 numbers in the list, so you would have 4 numbers in each quartile and these

iv
are:

Pr
1st Quarter 2nd Quarter 3rd Quarter 4th Quarter
3, 6, 8, 11, 13, 15, 16, 18, 27, 29, 32, 36

a
di
Percentile In
Percentile is a number where a certain percentage of scores fall below that number. Percentiles are commonly
used to report scores in tests, like the SAT, GATE and LSAT. For example, in an examination, if you score in
se

the 25th percentile, then 25% of test takers are below your score. The “25” is called the percentile rank.
Let us see 10 students, test scores ordered by Rank.
ou

Score 35 39 45 55 63 67 75 82 86 95
H

Rank 1 2 3 4 5 6 7 8 9 10
i
at

To find the 25th percentile is in the above list, let us calculate the rank where 25th percentile. To
w

calculate, the formula is:


s

Rank = (Percentile / 100) * Number of items Try this:


ra

= (25 / 100) * (10+1)


import numpy as np
Sa

= 0.25 * 11 X = np.array([35, 39, 45, 55, 63, 67, 75,


= 2.75 82, 86, 95])
P25 = np.percentile(X,25)
ew

Here, it is observed that there is not a rank of 2.75. So,


print (P25)
we must either round up, or round down the rank. As 2.75 is
returns: 47.5
closer to 3 than 2, so, we round up to a rank of 3. This equals
N

to a score of 45 (or greater than 45) on this list (a rank of 3).


Let us find the 50th percentile. The 50th percentile is to the second quartile or the median, i.e., 5.5.
@

However, there is no rank of 5.5. So, we must round it up to 6. This equals to a score of 67.
Similarly, to find the 75th percentile, we may search in the third quartile. As per the rank, it is 8.2 and
it is rounded to 8. This equals to a score of 82.
You can also calculate your percentile by using the following formula, if you know your rank:
P = ((N – Your rank) / N) * 100
76 Saraswati Informatics Practices XII

Here, P is the percentile and N is the total number of candidates who appeared for exam. (N- Your
rank) indicates the number of candidates who have scored less than you.
For example, 50 students are there in your class, and your rank is 3, then the percentile will be calculated
as:

d
P = ((50 – 3) / 50) * 100 = 94.0

ite
Percentile rank is the proportion of values in a distribution where a particular value is greater than or

m
equal to it. For example, if a pupil is taller than or as tall as 79% of his classmates then the percentile rank
of his height is 79, i.e., he is in the 79th percentile of heights in his class.

Li
Quantiles

e
at
The quantiles/percentiles/fractiles of a list of numbers are statistical values that partially illustrate the
distribution of numbers in the list. Quantiles are points in a distribution that relate to the rank order of

iv
values in that distribution. For a sample, you can find any quantile by sorting the sample. The middle value

Pr
of the sorted sample (middle quantile, i.e., 50th percentile) is known as the median.
The median is a special case of a quantile: it is the 50% quantile. It is usually estimated as follows:
- from an odd number of values, take the middle value;

a
- from an even number, take the average of the two midmost values.

di
For example, in the given set of five numbers:
In
15, 16, 18, 27, 29, 32, 36
se

27 is the median and is the quntile value. If our values are:


ou

15, 16, 18, 29, 32, 36


then the median is (18 + 29) / 2 = 23.5. This is just an interpolation value, because it lies between two
H

data points.
i

The limits are the minimum and maximum values. Any other locations between these points can be
at

described in terms of centiles/percentiles.


w

Centiles/percentiles are descriptions of quantiles relative to 100; so the 75th percentile (upper quartile)
is 75% or three quarters of the way up an ascending list of sorted values of a sample. The 25th percentile
s

(lower quartile) is one quarter of the way up this rank order.


ra

Certain types of quantiles are used commonly enough to have specific names. Here is a list of these:
Sa

• The 2 quantiles is called the median


• The 3 quantiles are called terciles
• The 4 quantiles are called quartiles
ew

• The 5 quantiles are called quintiles


• The 6 quantiles are called sextiles
N

• The 7 quantiles are called septiles


• The 8 quantiles are called octiles
@

• The 10 quantiles are called deciles


• The 12 quantiles are called duodeciles
• The 20 quantiles are called vigintiles
• The 100 quantiles are called percentiles
• The 1000 quantiles are called permilles
Aggregation/Descriptive Statistics in Pandas 77

Using Quantile() Function in Pandas

The pandas quantile() function return float or series values at the given quantile over requested axis, a

d
numpy.percentile. This is the list of probabilities for which quantiles should be computed. For example,

ite
• if percentile = 25, then it is the first quartile or lower quartile (LQ). The 0, 25 quantile is basically
saying that 25th percentile (i.e., 25%) of the observation in the data set is below a given line.

m
• if percentile = 50, then it is in the second quartile or the median.
• if percentile = 75, then it is in third quartile or upper quartile (UQ).

Li
The syntax is:

e
DataFrame.quantile(q=0.5, axis=0, numeric_only=True, interpolation='linear')

at
Here,

iv
• q is float or array-like, default 0.5 (50% quantile)

Pr
0 <= q <= 1, the quantile(s) to compute.
− If q is an array, a DataFrame will be returned where the index is q, the columns are the

a
columns of self, and the values are the quantiles.

di
− If q is a float, a Series will be returned where the index is the columns of self and the values
are the quantiles.
In
• axis: 0 or ‘index’, 1 or ‘columns’}, default 0. If 0 or ‘index’ counts are generated for each column.
If 1 or ‘columns’ counts are generated for each row.
se

• numeric_only : boolean, default True. If False, the quantile of datetime and timedelta data will
be computed as well.
ou

• interpolation : {‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}. This optional parameter


specifies the interpolation method to use, when the desired quantile lies between two data
H

points i and j.
Example 1 Find the quantiles of an even series S given below:
i
at

>>> S = pd.Series([15, 16, 18, 27, 29, 32, 36])


Try this:
w

>>> S.quantile(.25)
17.0 import numpy as np
s

S = pd.Series([15, 16, 18, 27, 29, 32, 36])


ra

>>> S.quantile([.25, .5, .75]) np.percentile(S, 25)


0.25 17.0 Returns: 17.0
Sa

0.50 27.0
0.75 30.5
ew

dtype: float64
Example 2 Find the quantiles of an odd series P given below:
N

>>> P = pd.Series([15, 16, 18, 29, 32, 36]) Try this:


@

>>> P.quantile([.25, .5, .75])


import numpy as np
0.25 16.50 P = np.array([15, 16, 18, 29, 32, 36])
0.50 23.50 print (np.percentile(P, [25, 50, 75]))
0.75 31.25 which prints:
[16.5 23.5 31.25]
dtype: float64
78 Saraswati Informatics Practices XII

Example 3 Find the quantiles for the following series x:


>>> x = pd.Series([1, 10, 11, 100, 101, 110, 111])
Try this:
>>> x.quantile([.25, .5, .75])
import numpy as np

d
0.25 10.5
x = [1, 10, 11, 100, 101, 110, 111]

ite
0.50 100.0 print (np.percentile(x, [25, 50, 75]))
0.75 105.5 which prints:

m
[ 10.5 100. 105.5]
dtype: float64

Li
Example 4 Find the quantiles for the following series y:

e
>>> y = pd.Series([1, 2, 3, 4, 5, 6])
Try this:

at
>>> y.quantile([.25, .5, .75])
0.25 2.25 import numpy as np

iv
y = np.array([1, 2, 3, 4, 5, 6])
0.50 3.50 print (np.percentile(y, [25, 50, 75]))

Pr
0.75 4.75 which prints:
dtype: float64 [2.25 3.5 4.75]

a
di
Example 5 Find the 0.25 quantile of DataFrame df. In
>>> df.quantile(.25)
Age 15.250
se

Test1 6.800
Test2 7.900
ou

Test3 7.825
Name: 0.25, dtype: float64
H

Example 6 Find the (.25, .5, .75) quantiles of DataFrame df.


i
at

>>> df.quantile([.25, .5, .75])


w

Age Test1 Test2 Test3


0.25 15.25 6.80 7.9 7.825
s
ra

0.50 16.00 7.20 8.5 8.350


0.75 16.00 8.35 8.7 8.800
Sa

Example 7 Find the (0.05, 0.25, 0.5, 0.75, 0.95) quantiles of DataFrame df.
>>> quants = [0.05, 0.25, 0.5, 0.75, 0.95]
ew

>>> q = df.quantile(quants)
>>> print (q)
N

Age Test1 Test2 Test3


0.05 14.25 6.575 7.74 7.645
@

0.25 15.25 6.80 7.9 7.825


0.50 16.00 7.20 8.5 8.350
0.75 16.00 8.35 8.7 8.800
0.95 16.75 9.050 8.94 8.800
Aggregation/Descriptive Statistics in Pandas 79

3.4 Descriptive Statistics


Descriptive or summary statistics in python pandas, can be obtained by using describe function – describe().
Describe() function gives the mean, std and interquartile range (IQR) values. .describe() is a handy function

d
when you are working with numeric columns. You can use .describe() to see a number of basic statistics

ite
about the column, such as the mean, min, max, and standard deviation.
• Generally describe() function excludes the character columns and gives summary statistics of

m
numeric columns.
• We need to add a variable named include=’all’ to get the summary statistics or descriptive

Li
statistics of both numeric and character column.
• For describing a DataFrame, by default only numeric fields are returned.

e
at
The describe() function on the Pandas DataFrame lists 8 statistical properties of each attribute. These
functions are also included in aggregation function. These are:

iv
• count(): Total number of items

Pr
• mean(): Calculate the arithmetic mean or average of a given set of numbers
• std(): Calculate standard deviation of a given set of numbers

a
• min(): Find the minimum value of a given set of numbers

di
• 25th Percentile
• 50th Percentile (Median)
In
• 75th Percentile
• max(): Find the maximum value of a given set of numbers
se

When these methods are called on a DataFrame, they are applied over each row/column as specified
ou

and results collated into a Series. Missing values are ignored by default by these methods. Calling the
describe() function on categorical data returns summary information about the Series that includes the
H

• count of non-null values,


• the number of unique values,
i
at

• the mode of the data


• the frequency of the mode
w

The syntax of .describe() function is:


s
ra

DataFrame.describe(percentiles=None, include=None, exclude=None)


Sa

Here,
• percentiles. The percentiles to include in the output. All should fall between 0 and 1. The default
is [.25, .50, .75], which returns the 25th, 50th, and 75th percentiles.
ew

• include. It is used to pass necessary information regarding what columns need to be considered
for summarizing.
N

- ‘all’ : All columns of the input will be included in the output.


- object - Summarizes String columns
@

- The default is None means the result will include all numeric columns.
• exclude. Sometimes you do not need to include any column to describe, so mention the exclude
option with describe() function. The default is None.
For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and
upper percentiles. For object data (e.g., strings or timestamps), the result’s index will include count, unique,
80 Saraswati Informatics Practices XII

top, and freq. The top is the most common value and chosen from among those with the highest count. The
freq is the most common value’s frequency. Timestamps also include the first and last items.
For example, let us implement this with two different series data: Numeric and String. For a numeric
series:

d
>>> S = pd . Series ( [ 5 , 10 , 15 , 20 , 25 ] ) # A numeric series

ite
>>> S.describe()
count 5.000000

m
mean 15.000000

Li
std 7.905694
min 5.000000

e
at
25% 10.000000
50% 15.000000

iv
75% 20.000000

Pr
max 25.000000
dtype: float64

a
di
For a String series: In
>>> str1 = pd.Series(['a', 'a', 'b', 'b', 'c', 'd']) # A string series
>>> str1.describe()
se

count 6
unique 4
ou

top a
freq 2
H

dtype: object
i
at

Using describe() Function in Pandas DataFrame


s w

Let us use describe() function to summarize only numeric fields using DataFrame. To summarize previous
ra

DataFrame (df).
>>> df.describe()
Sa

Age Test1 Test2 Test3


ew

count 6.000000 6.000000 5.000000 4.000000


mean 15.666667 7.583333 8.360000 8.275000
N

std 1.032796 1.099848 0.545894 0.618466


min 14.000000 6.500000 7.700000 7.600000
@

25% 15.250000 6.800000 7.900000 7.825000


50% 16.000000 7.200000 8.500000 8.350000
75% 16.000000 8.350000 8.700000 8.800000
max 17.000000 9.200000 9.000000 8.800000
Aggregation/Descriptive Statistics in Pandas 81

You can also analyse the descriptive statistics of a single column of a DataFrame as well. For example,
>>> df['Age'].describe()
Age

d
count 6.000000

ite
mean 15.666667

m
std 1.032796
min 14.000000

Li
25% 15.250000

e
50% 16.000000

at
75% 16.000000
max 17.000000

iv
Name: Age, dtype: float64

Pr
To summarize all columns of DataFrame (df) regardless of data type.

a
>>> df.describe(include='all')

Student_Name Age
di
Gender Test1 Test2 Test3
In
count 6 6.000000 6 6.000000 5.000000 4.000000
unique 6 NaN 2 NaN NaN NaN
se

top Somya NaN M NaN NaN NaN


ou

freq 1 NaN 3 NaN NaN NaN


mean NaN 15.666667 NaN 7.583333 8.360000 8.275000
H

std NaN 1.032796 NaN 1.099848 0.545894 0.618466


i
at

min NaN 14.000000 NaN 6.500000 7.700000 7.600000


w

25% NaN 15.250000 NaN 6.800000 7.900000 7.825000


s

50% NaN 16.000000 NaN 7.200000 8.500000 8.350000


ra

75% NaN 16.000000 NaN 8.350000 8.700000 8.800000


Sa

max NaN 17.000000 NaN 9.200000 9.000000 8.800000

Example A DataFrame dfS is given with following data for 10 students:


ew

Name Total Marks


N

0 Amit Kumar 450


1 Vinamr Katyal 476
@

2 Aasha Goel 426


3 Naina Rawat 476
4 Dawn Sebastian 458
5 Riya Mary 446
82 Saraswati Informatics Practices XII

6 Aashna Sharma 461


7 Piush Cocher 464
8 Om Berma 476

d
9 Manali Sovani 452

ite
Write commands for the following:

m
( ) Find the default quantile of the DataFrame.

Li
( ) Find the [.25, .50, .75] quantiles of the DataFrame.
( ) Write the summary of statistics pertaining to the DataFrame column.

e
(d) Get the full summary of statistics to the DataFrame.

at
(e) Find the 50th percentile of the DataFrame.

iv
Solution For data: dfA = pd.read_csv('E:/IPSource_XII/IPXIIChap03/Std7.csv')
(a) dfS.quantile()

Pr
(b) dfS.quantile([.25, .5, .75])
(c) dfS.describe()

a
(d) dfS.describe(include='all')

di
(e) dfS.describe(percentiles=[.50]) In
3.5 Histogram
se

Histograms are powerful tools for analyzing the distribution of data. A histogram plot is generally used to
ou

show the frequency across a continuous or numeric or discrete data. These can get user a clear understanding
of the distribution of data points or dataset falls into each category and its median and range of values.
To create a histogram, first, we have to divide the entire range of values into a series of intervals, and
H

second, we have to count how many values fall into each interval. Matplotlib calls those categories or
intervals as bins. The bins are consecutive and non-overlapping intervals of a variable. They must be adjacent
i
at

and are often of equal size. So, we understood that in histogram:


• The x-axis represents discrete bins or intervals for the observations.
w

• The y-axis represents the “frequency density” or count of the number of observations in the
s

dataset that belong to each bin.


ra

3.5.1 Understanding Matplotlib


Sa

Matplotlib is the leading visualization library in Python. It is a powerful two-dimensional plotting library
for the Python language. Matplotlib is a multi-platform data visualization library built on NumPy arrays.
ew

The matplotlib provides a context, one in which one or more plots can be drawn before the image is shown
or saved to file. The context can be accessed via functions on . Matplotlib is capable of creating all
manner of graphs, plots, charts, histograms and much more. Before using any method to draw a histogram,
N

you must install Matplotlib library in your system. If you are using Python 3.6, 3.7 or any latest version,
then it is easy to install Matplotlib using pip.
@

To install Python Matplotlib library:


• Start Windows command prompt.
• Change the Python installation folder, i.e., C:\Python37\Scripts
• Type: pip install matplotlib
Aggregation/Descriptive Statistics in Pandas 83

After installing the Python Matplotlib, now you are ready to analyse your data through different plots.
To check the version that you installed, you can use the following command in Python interactive mode:
>>> import matplotlib
>>> print (matplotlib.__version__)

d
3.0.3

ite
3.5.2 Creating Histogram

m
Creating histograms from Pandas data frames is one of the easiest operations in Pandas and data visualization

Li
that you will come across. Pandas DataFrames that contain our data come pre-equipped with methods for
creating histograms, making preparation and presentation easy.

e
We can create histograms from Pandas DataFrames using the pandas.DataFrame.hist(), which is a

at
sub-method of pandas.DataFrame.plot(). Pandas uses the Python module Matplotlib to create and render
all plots, and each plotting method from pandas.DataFrame.plot takes optional arguments that are passed

iv
to the Matplotlib functions. The syntax of pandas.DataFrame.hist DataFrame method is:

Pr
DataFrame.hist(column=None, by=None, grid=True, xlabelsize=None, xrot=None, ylabelsize=None,
yrot=None, ax=None, sharex=False, sharey=False, figsize=None, layout=None, bins=10, **kwds)

a
As shown in the above syntax, the DataFrame.hist() function has many more options, but in this text
we use some common options like:
di
In
• column: is the DataFrame column name to create a histogram.
• by: If passed, then used to form histograms for separate groups. The by option will take an
se

object by which the data can be grouped.


• grid: takes a Boolean value, i.e., to enable (if True) or disable (if False) the grid.
ou

• xlabelsize, ylabelsize: these options change the size of x and y label text size.
• sharex, sharey: to set both of the axes to the same range and scale.
H

• bin: A bin in a histogram is the block that you use to combine values before getting the frequency.
The bins are usually specified as consecutive, non-overlapping intervals of a variable. The default
i
at

value is 10.
• fill: to fills the DataFrame column name to create a histogram.
s w

For example, to create a histogram using DataFrame df:


ra

>>> import pandas as pd


Sa

>>> import matplotlib.pyplot as plt


>>> df = pd.read_csv('E:/IPSource_XII/IPXIIChap03/Student.csv')
>>> df # Source DataFrame
ew

Student_Name Age Gender Test1 Test2 Test3


0 Aashna 16.0 F 7.6 8.5 7.6
N

1 NaN NaN NaN NaN NaN NaN


@

2 Jack 16.0 M 8.6 NaN NaN


3 Somya 17.0 F 6.5 7.9 8.8
4 Raghu 15.0 M 6.8 7.7 7.9
5 Mathew 16.0 M 9.2 9.0 NaN
6 Nancy 14.0 F 6.8 8.7 8.8
84 Saraswati Informatics Practices XII

>>> df.hist()
which prints:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x07D02830>,

d
<matplotlib.axes._subplots.AxesSubplot object at 0x08D42FD0>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x09110FD0>,

ite
<matplotlib.axes._subplots.AxesSubplot object at 0x09133650>]],
dtype=object)

m
This command produced histograms for each of the 4 features we specified, i.e., the numeric columns

Li
(Age, Test1, Test2 and Test3). Once you have drawn that data or plot, you can then "show" that data. If you
are using Matplotlib from within a Python script, you have to add plt.show() method inside the file to be

e
able display your plot. The show() command is used to look for current active drawing or figure and opens

at
in an interactive window as shown in Figure 3.1.
>>> plt.show()

iv
Pr
a
di
In
se
ou
i H
at
s w
ra

Figure 3.1 A simple histogram with four features of DataFrame df.


Sa

3.5.3 Single Histogram from a Pandas DataFrame


The simple df.hist() method shown above has plotted histograms of every feature in the DataFrame. If we
ew

wish to only examine a subset of the features, or even look at only one, then we can specify what we want to
plot using the columns parameter of the df.hist() method. The columns feature takes either a string or list
of strings of columns names. For example, to plot a histogram for Age column:
N

>>> df.hist(column='Age') # Plot a single column


@

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x09357E30>]],


dtype=object)
>>> plt.show()
When you execute the above script, a histogram will appear with most common age group (i.e., 16) as
given in Figure 3.2 on the next page.
Aggregation/Descriptive Statistics in Pandas 85

d
ite
m
Li
e
at
iv
Pr
a
di
Figure 3.2 A histogram with the most common age group.
In
3.5.4 Modifying Histogram Bin Sizes
se

The bins, or bars of the histogram plot, can be adjusted using the bins option. This option can be tuned
depending on the data, to see either subtle distribution trends, or broader phenomena. Which bin size to use
ou

heavily depends on the data set you use, therefore it is important to know how to change this parameter. The
default number of bars is 10. For example, to create a histogram with bin size 5:
H

>>> df.hist(column="Age", bins=5) # Plotting 5 bins


i

array( [ [<matplotlib.axes._subplots.AxesSubplot object at 0x09D93990>] ],


at

dtype=object)
w

>>> plt.show()
s
ra
Sa
ew
N
@

Figure 3.3 A histogram with 5 bins.


86 Saraswati Informatics Practices XII

>>> df.hist(column="Age", bins=30) # Plotting 30 bins


>>> plt.show()

d
ite
m
Li
e
at
iv
Pr
a
Figure 3.4 A histogram with 30 bins.

3.5.5 Multiple Pandas Histograms from a DataFrame


di
In
The columns feature can take a list of column names to produce separate plots for each chosen column.
For example, to plot Test1 and Test2 data in histogram:
se

>>> df.hist(column=["Test1", "Test2"]) # Plot specific columns


ou

>>> plt.show()
i H
at
s w
ra
Sa
ew
N

Figure 3.5 A histogram with multiple columns.


@

Here, the histograms are plotted side-by-side in two different features. Notice the axes are automatically
adjusted by default, so the scales may be different for each Pandas DataFrame histogram.

3.5.6 Modifying Histogram Axes


Again, you may notice in the above plots (Figure 3.5), the X and Y axes are not the same. Different scales
can complicate side-by-side data comparisons, so we would prefer to set both of the axes to the same range
Aggregation/Descriptive Statistics in Pandas 87

and scale. We can do this with the sharex and sharey options. These options accept Boolean values, which
are False by default. If these options are set to True, then the respective axis range and scale is shared
between plots. For example, let us apply this feature with previous histogram command:
>>> df.hist(column=["Test1", "Test2"], sharex=True) # Share only x axis

d
>>> plt.show()

ite
m
Li
e
at
iv
Pr
a
di
In
Figure 3.6 Histogram set X-axis to the same range.
>>> df.hist(column=["Test1", "Test2"], sharex=True, sharey=True) # Share x and y axis
se

>>> plt.show()
ou
i H
at
s w
ra
Sa

Figure 3.7 Histogram set both of the axes to the same range and scale.
ew

Be careful when comparing histograms this way. The range over which bins are set in the Test1 data are
larger than those in the Test2 data, leading to larger boxes in Test1 than Test2. The result is that while both
N

plots have the same number of data points, Test1 appears “larger” because of the default bar widths.
@

3.5.7 Plotting Multiple Features in One Plot


Suppose we wanted to present the histograms on the same plot in different colors. To do this, we will have
to slightly change our syntax and use the pandas.DataFrame.plot.hist method. This plot.hist() method
contains more specific options for plotting. It does not, however, contain a column option; therefore we
will have to slice the DataFrame prior to calling the method.
88 Saraswati Informatics Practices XII

Recall the DataFrame df which had 6 columns of data. To only plot the Test1, Test2 and Test3 data in
one plot, the command is:
>>> df[["Test1", "Test2", "Test3"]].plot.hist() # Note slicing is performed on df itself
>>> plt.show()

d
ite
m
Li
e
at
iv
Pr
a
di
In
Figure 3.8 Histogram with multiple features on one plot.
This code snippet plotted the three histograms on the same plot, but the second and third plot “blocks”
se

the view of the first. We can solve this problem by adjusting the alpha transparency option, which takes a
value in the range [0,1], where 0 is fully transparent and 1 is fully opaque.
ou

>>> df[["Test1", "Test2", "Test3"]].plot.hist(alpha=0.4) # Plot at 40% opacity


>>> plt.show()
i H
at
s w
ra
Sa
ew
N
@

Figure 3.9 Histogram plotting at 40% opacity.

3.5.8 Plotting DataFrame Columns using DataFrame plot() Method


The plot() method makes plots of DataFrame using matplotlib/pylab. The pandas.DataFrame.plot() takes
optional arguments that are passed to the Matplotlib functions. Note that the details about
pandas.DataFrame.plot() method will be discussed in Chapter-8.
Aggregation/Descriptive Statistics in Pandas 89

For example, let us create a histogram for previous DataFrame df using plot() method:
>>> df.plot(kind='hist')
>>> plt.show()

d
ite
m
Li
e
at
iv
Pr
a
di
In
Figure 3.10 Histogram using plot() method.
From the above command, the kind option specifies the plot names. The default is ‘line’ plot, otherwise,
se

we can mention ‘bar’, ‘barh’, ‘hist’, ‘box’, ‘area’, ‘pie’, ‘scatter’, etc.
ou

Histogram using Single/Multiple column


H

If we wish to only examine a subset of the features, for example, to plot a histogram for Age column using
plot() method, the command is:
i
at

>>> df[['Age']].plot(kind='hist',bins=[0, 5, 10, 15, 20, 25], rwidth=0.8)


>>> plt.show()
s w
ra
Sa
ew
N
@

Figure 3.11 Single column histogram using plot() method.


90 Saraswati Informatics Practices XII

Similarly, to plot a histogram using multiple columns, i.e., Test1, Test2, and Test3 using DataFrame
plot() method, the command is:
>>> df[['Test1', 'Test2', 'Test3' ]].plot(kind='hist')
>>> plt.show()

d
ite
m
Li
e
at
iv
Pr
a
di
In
Figure 3.12 A histogram with multiple columns using plot() method.

Example Using previous DataFrame dfS, write the command to create following histograms:
se

( ) Crete a histogram plot to show Total Marks


( ) Create a histogram plot to show Total Marks with bins 100.
ou

( ) Create a histogram plot to show Total Marks with bins [400, 420, 440, 460, 480, 500].
H

Solution For data: dfA = pd.read_csv('E:/IPSource_XII/IPXIIChap03/Std7.csv')


Assume that the following modules are imported.
i
at

import pandas as pd
import matplotlib.pyplot as plt
w

(a) dfA = dfA.sort_values(by='Total Marks')


s

plt.show()
ra

(b) dfA.hist(column="Total Marks", bins=200)


Sa

plt.show()
(c) dfA.hist(column="Total Marks", bins=[400, 420, 440, 460, 480, 500])
plt.show()
ew

Or
dfA[['Total Marks']].plot(kind='hist',bins=[400, 420, 440, 460, 480, 500])
N

plt.show()
@

Points to Remember
1. Aggregation is the process of turning the values of a dataset (or a subset of it) into one single
value.
2. Data aggregation always accepts multivalued functions, which in turn returns only single value.
Aggregation/Descriptive Statistics in Pandas 91

3. In Python pandas, the median() function is used to calculate the median or middle value of a
given set of numbers.
4. The mode() in function Python pandas is used to calculate the mode or most repeated value of
a given set of numbers.

d
5. Quantiles are points in a distribution that relate to the rank order of values in that distribution.

ite
6. The middle value of the sorted sample (middle quantile, i.e., 50th percentile) is known as the
median.

m
7. The describe() method is used to compute summary statistics for each column numerical (default)
column.

Li
8. The plot() is used to quickly visualize the data in different ways.
9. The available plotting types are: ‘line’ (default), ‘bar’, ‘barh’, ‘hist’, ‘box’ , ‘kde’, ‘area’, ‘pie’, ‘scatter’,

e
‘hexbin’.

at
iv
SOLVED EXERCISES

Pr
1. What is the use of describe() function?

a
Ans. The .describe() function is a useful summarisation tool that will quickly display statistics for any

di
variable or group it is applied to. You can use .describe() to see a number of basic statistics about
the column, such as the mean, min, max, and standard deviation. This can give you a quick overview
In
of the shape of the data.
2. A vector x is given with the following even number:
se

x = [2, 5, 6, 10, 11, 13]


( ) Find second quartile or median of x.
ou

( ) Find all quartiles of x.


Ans. import pandas as pd
H

x = pd.Series( [2, 5, 6, 10, 11, 13])


i

(a) x.quantile(0.50)
at

which prints: 8.0


w

(b) x.quantile([0.25, 0.50, 0.75])


s

which prints:
ra

0.25 5.25
0.50 8.00
Sa

0.75 10.75
dtype: float64
ew

3. A vector y is given with the following odd number:


y = [2, 4, 6, 8, 10, 12, 14]
N

( ) Find second quartile or median of y.


( ) Find all quartiles of y.
@

Ans. import pandas as pd


y = pd.Series( [2, 4, 6, 8, 10, 12, 14])
(a) y.quantile(0.50)
which prints:
8.0
92 Saraswati Informatics Practices XII

(b) y.quantile([0.25, 0.50, 0.75])


which prints:
0.25 5.0
0.50 8.0

d
0.75 11.0

ite
dtype: float64
4. A vector A is given with the following even number:

m
15, 20, 32, 60
( ) Find the default quantile of the vector.

Li
( ) Find the [.25, .5, .75] quantiles of the vector.
Ans. A = pd.Series([15, 20, 32, 60])

e
(a) A.quantile()

at
(b) A.quantile([.25, .5, .75])
5. Explain the functions mean() and median() with examples of each using pandas DataFrame.

iv
Ans. A DataFrame called dfW with following data:

Pr
Age Wage Rate
0 20 2.5

a
di
1 25 3.5
2 30 4.5
In
3 35 5.5
se

4 40 7.0
5 45 8.7
ou

6 50 9.5
7 55 10.0
H

8 60 12.5
i
at

• mean(). The mean() function in Python pandas is used to calculate the arithmetic mean of a
given series or mean of a DataFrame. For example, to calculate the mean of Age column:
w

dfW['Age'].mean()
s

• median(). In Python pandas, the median() function is used to calculate the median or middle
ra

value of a given set of numbers. For example, to calculate the median of Wage Rate column:
dfW['Wage Rate'].median()
Sa

6. For class 12 students, the height and weight of 8 students are given below:
Student Name Height (inch) Weight (kg)
ew

TANVI GUPTA 60.0 54.3


MRIDUL KOHLI 62.9 56.8
N

DHRUV TYAGI 60.4


SAUMYA PANDEY 58.3 58.3
@

ALEN RUJIS 62.5


MANALI SOVANI 58.4 57.4
AAKASH IRENGBAM 63.7 58.3
SHIVAM BHATIA 61.4 55.8
Aggregation/Descriptive Statistics in Pandas 93

Assume that the above dataset is created with a DataFrame called dfC. Using dfC, answer the
following:
( ) Count the number of non-null value across the row axis for DataFrame dfC.
( ) Count the number of non-null observation in column Weight of DataFrame dfC.

d
( ) How many non-null observation in column Height of DataFrame dfC?

ite
(d) Count the number of non-null value across the column for DataFrame dfC.
(e) Find the sum of height and weight column for all students using DataFrame dfC.

m
( ) Find the mean of height and weight column for all students using DataFrame dfC.

Li
( ) Find the median of height and weight column for all students using DataFrame dfC.
( ) Find the most repeated value for a specific column ‘Weight’ of DataFrame dfC.

e
(i) Find the maximum weight value of DataFrame dfC.

at
Ans. For data: dfC = pd.read_csv('E:/IPSource_XII/IPXIIChap03/Heightweight.csv')

iv
Or

Pr
dfC = pd.DataFrame({'Student Name' : ['TANVI GUPTA', 'MRIDUL KOHLI', 'DHRUV TYAGI', 'SAUMYA
PANDEY', 'ALEN RUJIS', 'MANALI SOVANI', 'AAKASH IRENGBAM', 'SHIVAM BHATIA'],
'Height' : [60.0, 62.9, np.nan, 58.3, 62.5, 58.4, 63.7, 61.4],

a
'Weight' : [54.3, 56.8, 60.4, 58.3, np.nan, 57.4, 58.3, 55.8]},

di
columns = ['Student Name', 'Height', 'Weight'])
In
(a) dfC.count()
(b) dfC['Weight'].count()
se

(c) 7
(d) dfC.count(axis='columns')
ou

(e) dfC.loc[:, ['Height', 'Weight']].sum()


H

(f) dfC.loc[:, ['Height', 'Weight']].mean()


(g) dfC.loc[:, ['Height', 'Weight']].mean()
i

(h) dfC['Weight'].mode()
at

(i) dfC['Weight'].max()
w

7. Using previous DataFrame dfC, answer the following:


s

( ) Find the 0.25 quantile of DataFrame dfC.


ra

( ) Find the (.25, .50, .75) quantiles of DataFrame dfC.


( ) Find the (0.05, 0.25, 0.5, 0.75, 0.95) quantiles of DataFrame df.
Sa

(d) Find the basic statistics values using DataFrame (df).


Ans. (a) dfC.quantile(.25)
ew

(b) dfC.quantile([.25, .50, .75])


(c) dfC.quantile([0.05, 0.25, 0.50, 0.75, 0.95])
N

(d) dfC.describe()
8. Using previous DataFrame dfC, answer the following:
@

( ) Create two separate histogram plots for each chosen column, i.e., height and weight using
DataFrame dfC.
( ) Create two separate histogram plots for height and weight columns by sharing the axes to
the same range and scale using DataFrame dfC.
( ) Create a histogram to plot both height and weight data in only plot using DataFrame dfC.
94 Saraswati Informatics Practices XII

Ans. import pandas as pd


import matplotlib.pyplot as plt
(a) dfC.hist(column=["Height", "Weight"])
plt.show()

d
(b) dfC.hist(column=["Height", "Weight"], sharex=True, sharey=True)

ite
plt.show()
(c) dfC[["Height", "Weight"]].plot.hist()

m
REVIE

Li
ES IO S

e
1. What is quantiles? How can you find 0.25 quantiles?

at
2. What do you mean by the middle value of the sorted sample list?
3. In a histograph, how can you set both of the axes to the same range and scale?

iv
4. A DataFrame column called discount contains following values:

Pr
Discount
2560

a
3600

di
1250 In
NaN
1200
se

(a) What will be the mean value of the column discount?


(b) What will be the median value of the column discount?
ou

5. Find the .25, .50, and .75 quantiles for the following series x:
[3, 7, 8, 5, 12, 14, 21, 13, 18]
i H
at
s w
ra
Sa
ew
N
@
Function Applications in Pandas 95

Function Applications
in Pandas
Pandas

d
Chapter –

ite
m
Li
e
at
iv
Pr
4.1 Introduction
Pandas is a big srore house of Python library. Whether for data visualization or data analysis, the practicality

a
and functionality that this tool offers is not found in any other module. Python panda supports number of

di
data aggregation/descriptive functions to analyze data. In this chapter, we will learn Python pandas function
In
applications.

4.2 .pipe() Function


se

Pipes take the output from one function and feed it to the first argument of the next function. Pipe() function
ou

performs the custom operation for the entire DataFrame. Pipe can be thought of as a function chaining. The
syntax is:
H

DataFrame.pipe(func, *args, **kwargs)


i

To apply pipe(), the first argument of the function must be the data set.
at

Here,
w

• func : A function which will be passed as first argument.


s

• args : positional arguments passed into func.


ra

• kwargs : a dictionary of keyword arguments passed into func.


Sa

Let us create a DataFrame with following data:


>>> import pandas as pd
>>> Data = {'Name': ['Aashna', 'Simran', 'Jack', 'Raghu', 'Somya', 'Ronald'],
ew

'English': [87, 64, 58, 74, 87, 78],


'Accounts': [76, 76, 68, 72, 82, 68],
N

'Economics': [82, 69, 78, 67, 78, 68],


'Bst': [72, 56, 63, 64, 66, 71],
@

'IP': [78, 75, 82, 86, 67, 71]}


>>> df = pd.DataFrame(Data, columns=['Name', 'English', 'Accounts', 'Economics', 'Bst', 'IP'])
Or
For data: df = pd.read_csv('E:/IPSource_XII/IPXIIChap04/MResult.csv')

95
96 Saraswati Informatics Practices XII

>>> df
Name English Accounts Economics Bst IP
0 Aashna 87 76 82 72 78
1 Simran 64 76 69 56 75

d
ite
2 Jack 58 68 78 63 82
3 Raghu 74 72 67 64 86

m
4 Somya 87 82 78 66 67

Li
5 Ronald 78 68 68 71 71

>>> df = df.set_index(['Name']) # DataFrame indexed permanently

e
at
>>> df
English Accounts Economics Bst IP

iv
Name

Pr
Aashna 87 76 82 72 78
Simran 64 76 69 56 75

a
Jack 58 68 78 63 82
Raghu 74 72 67
di 64 86
In
Somya 87 82 78 66 67
Ronald 78 68 68 71 71
se

Now using the above DataFrame df, apply the function .pipe() to add 2 marks with every numeric
ou

column:
# Create a user-define function to add two numbers
H

>>> def Add_Two(Data, aValue):


return Data + aValue
i
at

>>> df.pipe(Add_Two, 2)
w

English Accounts Economics Bst IP


s

Name
ra

Aashna 89 78 84 74 80
Sa

Simran 66 78 71 58 77
Jack 60 70 80 65 84
Raghu 76 74 69 66 88
ew

Somya 89 84 80 68 69
Ronald 80 70 70 73 73
N

From the above command,


@

df.pipe(Add_Two, 2)
Notice that the first argument Add_Two is of the .pipe() is the data set. For example, Add_Two accepts
Add_Tw
two arguments Add_T o(Datta, aV
wo(Da alue). As Data is the first parameter that takes in the data set, we can
alue)
aValue)
directly use pipe(). We only need to specify to pipe what’s the name of the argument in the function that
refers to the data set.
Function Applications in Pandas 97

4.3 .apply() Function


The .apply() function applies a function to each element in the Series or an axis of the DataFrame, i.e.,
either row wise or column wise. We can use .apply() to send a single column or a single row to a function.

d
Basically, we can use custom functions when applying on Series and also when operating on chunks of data

ite
frames in groupbys. This is useful when cleaning up data - converting formats, altering values, etc. This
method takes as argument the following:

m
• a general or user defined function
• any other parameters that the function would take

Li
The syntax is:

e
DataFrame.apply(func, axis=0, broadcast=None, raw=False, reduce=None, result_type=None,

at
args=(), **kwds)

iv
Here,

Pr
• func: is the name operations which will be applied to each column or row.
• axis: along which the function is applied:
− 0 or ‘index’: apply function to each column.

a
− 1 or ‘columns’: apply function to each row.

di
In
Let us see the following data set with two columns containing numeric data:

A B
se

1 6
ou

2 7
3 8
H

4 9
5 10
i
at

6 11
w

When we use apply() function, by default it will apply on each column. For example, let us create the
s

DataFrame and find the sum of each column using the aggregate function sum with apply().
ra

>>> Test = {'A': [1, 2, 3, 4, 5, 6],


Sa

'B': [6, 7, 8, 9, 10, 11]}


>>> tdf = pd.DataFrame(Test, columns=['A', 'B'])
>>> tdf
ew

A B
0 1 6
N

1 2 7
@

2 3 8
3 4 9
4 5 10
5 6 11
98 Saraswati Informatics Practices XII

As the name suggests, the .apply() function applies a function along any axis of the DataFrame. For
example, to find the sum of all values of each column:
>>> tdf[['A', 'B']].apply(sum)

d
The above command will return the sum of all the values of column A and column B.

ite
A 21
B 51

m
dtype: int64

Li
Use .apply() with axis=1 to send every single row to a function

e
at
Similarly, to find the sum of all values of each row, the command is:

iv
>>> tdf[['A', 'B']].apply(sum, axis=1) # axis = 1 applies to each row

Pr
Sum of Column
A and B
0 7

a
1 9

di
2 11 In
3 13
4 15
se

5 17
dtype: int64
ou

Let us take another DataFrame example with following data set for applying the apply() function.
H

>>> import pandas as pd


>>> dfA = pd.read_csv('E:/IPSource_XII/IPXIIChap04/Pizza.csv')
i
at

>>> dfA
w

OrdNum Size Topping Price


s

0 PZ001 Small Margherita 356.65


ra

1 PZ002 Large Peppy Paneer 545.70


Sa

2 PZ003 Extra Large Bell Pepper 756.90


3 PZ004 Extra Large Mexican Green Wave 654.00
ew

4 PZ005 Extra Large Pepperoni chicken 632.00


5 PZ006 Large Chicken Sausage 480.60
N

6 PZ007 Small Peri-Peri chicken 375.00


@

Use .apply() to send a column of every row to a function


Let us apply the .apply() function to send a column of every row to a function to calculate 4% tax as a new
column called Taxes with following expression:
Taxes = Price * 0.04
Function Applications in Pandas 99

So, to implement the expression as a function for each row with previous DataFrame (dfA), create a
function called Tax_calc() with following:
>>> # we create a function to retrieve
>>> def Tax_calc(Price):

d
Taxes = Price * 0.04 # 4% tax

ite
return Taxes
>>> dfA['Taxes'] = dfA.Price.apply(Tax_calc)

m
Here, the function .apply() retrieves the Tax_calc() function to calculate the taxes for Price column.

Li
The Tacx_calc() function accepts a argument Tax_calc(Price). The resulted data is stored as a new column
called Taxes with the existing DataFrame dfA. The result of new DataFrame is:

e
New colum
>>> print (dfA)

at
with .apply()
OrdNum Size Topping Price Taxes

iv
0 PZ001 Small Margherita 356.65 14.266

Pr
1 PZ002 Large Peppy Paneer 545.70 21.828

a
2 PZ003 Extra Large Bell Pepper 756.90 30.276

di
3 PZ004 Extra Large M Green Wave
In 654.00 26.160
4 PZ005 Extra Large Pepperoni chicken 632.00 25.280
5 PZ006 Large Chicken Sausage 480.60 19.224
se

6 PZ007 Small Peri-Peri chicken 375.00 15.000


ou

Let us apply the .apply() function to send every single row to a function to calculate a new column
Taxes:
H

>>> # we create a function to retrieve


i

>>> def Tax_calc(row): # A row is an user-defined name that contain all the column values
at

return row['Price'] * 0.04 # The price value is used from the row
w

>>> dfA.apply(Tax_calc, axis=1) # a series is displayed with taxes


s

0 14.266
ra

1 21.828
Sa

2 30.276
3 26.160
4 25.280
ew

5 19.224
6 15.000
N

dtype: float64
@

Or
If you want add it as new columns using .apply() function, then write the following:
>>> df['Taxes'] = df.apply(Tax_calc, axis=1) # to add a new column
The above command will again create a DataFrame with a new Taxes column.
100 Saraswati Informatics Practices XII

Using Arbitrary function with apply()

The axis parameter of apply() functions is useful to travel through columns and rows. When any arbitrary

d
function like max, min, etc. are applied with apply() function, it travels the entire columns and rows. If the

ite
axis = 0, it will travel downwards of each column. Similarly, when axis = 1, it will travel row-wise right.
In previous DataFrame dfA, we have 5 columns and 7 rows. Let us find the maximum value of ‘Price’

m
and ‘Taxes’ column using apply() function.
>>> # apply() method travel axis=0

Li
>>> dfA.loc[:, 'Price':'Taxes'].apply(max, axis=0)

e
which will display the maximum value of Price and Taxes column as given below:

at
Price 756.900

iv
Taxes 30.276

Pr
dtype: float64
Similarly, to find row-wise maximum values of ‘Price’ and ‘Taxes’ columns:

a
>>> dfA.loc[:, 'Price':'Taxes'].apply(max, axis=1)
0 356.65
di
In
1 545.70
2 756.90
se

3 654.00
ou

4 632.00
5 480.60
H

6 375.00
dtype: float64
i
at
w

4.3.1 Using Lambda Function


s

Keyword lambda in python is used to create anonymous functions. Anonymous functions are those functions
ra

who are unnamed. That means you are defining a function without any name of the function. A lambda
Sa

function is a shorthand way to define a quick function that you need once.
The basic syntax to create a lambda function is:
lambda arguments : expression
ew

Lambda functions can have any number of arguments but only one expression. The expression is
N

evaluated and returned. Lambda functions cannot contain any statements and it returns a function object
which can be assigned to any variable.
@

For example:

>>> fun = lambda x: x*x


Here, in lambda x: x*x; x is an argument to the function and x*x is the expression which gets executed
and its value is returned as output. When we call the function fun with an argument, it will pass it as x and
perform the expression x*x.
Function Applications in Pandas 101

For example,
>>> fun = lambda x: x*x
>>> sqr = fun(5) # Output: 25

d
To check the type of the lambda function, type the following:

ite
>>> type(fun)
<class 'function'>

m
Similarly, let us see another example to add two numbers using lambda function:

Li
>>> SumTwo = lambda x, y : x + y

e
>>> print (SumTwo(10, 20)) # Output: 30

at
Here, in lambda x, y: x + y; x and y are arguments to the function and x + y is the expression which gets
executed and its value is returned as output.

iv
Also, the lambda x, y: x + y returns a function object which can be assigned to any variable, in this case

Pr
function object is assigned to the SumTwo variable.
Using Lambda Function with .apply()

a
In previous section, we use the .apply() function to find the tax for each row in DataFrame dfA. Instead of

di
using the above function (Tax_calc), we can use a lambda function. Let us do the same with a lambda
In
function:
>>> dfA.apply(lambda row: row[3] * 0.04, axis=1)
se

0 14.266
1 21.828
ou

2 30.276
H

3 26.160
4 25.280
i

5 19.224
at

6 15.000
w

dtype: float64
s

Here, the row parameter takes enter row of DataFrame dfA and row[3] is the Price column.
ra

Let us see the difference between a normal def defined function and lambda function. This is a program
Sa

that returns the Taxes of every single row to a function.

>>> def Tax_calc(row): # A row is an user-define name contain all the column values
return row['Price'] * 0.04 # The price value is used from the row
ew

Here, while using def, we need to define a function with a name Tax_calc and need to pass a value to it.
After execution, we also need to return the result from where the function was called using the return
N

keyword.
On the other hand, in the lambda function, it does not include a “return” statement; it always contains
@

an expression which is returned. We can also put a lambda definition anywhere a function is expected, and
we don’t have to assign it to a variable at all.

Example Using previous DataFrame dfA, write the commands for the following:
( ) Write a function to display Topping column into capitalization form.
( ) Apply a toppingcapital function over the column ‘Topping’.
102 Saraswati Informatics Practices XII

Solution The commands are:


(a) Command to create a lambda function toppingcapital:
toppingcapital = lambda x: x.upper()
(b) Apply the toppingcapital function over the column ‘Topping’

d
dfA['Topping'].apply(toppingcapital)

ite
which will display the following:

m
0 MARGHERITA

Li
1 PEPPY PANEER
2 BELL PEPPER

e
3 MEXICAN GREEN WAVE

at
4 PEPPERONI CHICKEN

iv
5 CHICKEN SAUSAGE
6 PERI-PERI CHICKEN

Pr
Name: Topping, dtype: object

a
4.4 Aggregation (groupby)

di
In
Pandas DataFrames have a .groupby() method that works in the same way as the SQL Group By. The main
objective of this function is to split the data into sets and then apply some functionality on each subset. The
se

most important operations made available by a groupby() are aggregate, filter, transform, and apply.
• Splitting the data into groups based on some criteria with the levels of a categorical variable.
ou

This is generally the simplest step. For example, a DataFrame can be split up by rows(axis=0) or
columns(axis=1) into groups.
H

• Applying a function to each group individually. A function is applied to each group using .agg() or
i

.apply(). There are 3 classes of functions we might consider:


at

− Aggregate – estimate/compute summary statistics (like counts, means) for each group. This
w

will reduce the size of the data. For example:


s

− Compute group sum(), max(), min(), mean(), etc.


ra

− Compute group sizes / count().


Sa

− Transform – within group standardization, imputation using group values. The size of the
data will not change.
For example:
ew

− Standardize data (zscore) within a group.


− Filling NAs within groups with a value derived from each group.
N

− Filter – ignore rows that belong to a particular group. This discards some groups, according
@

to a group-wise computation that evaluates True or False. For example:


− Discard data that belongs to groups with only a few members.
− Filter out data based on the group sum or mean.
− A combination of these 3
• Combining the results into a data structure like series or DataFrame.
Function Applications in Pandas 103

Let us see the following data set with two columns that how they are grouped.
Data Data with groupby Product Group result

Product Sales Product Sales Group function results

d
ite
Black 120 Black 120
Red 130 Black 120 sum max min mean count

m
Black 120 Black 136 376 136 120 125.33 3

Li
Green 110 Green 110
Red 125 Green 132

e
Green 132 Green 144 386 144 110 128.67 3

at
Red 115 Red 130

iv
Black 136 Red 125

Pr
Green 144 Red 115
Red 165 Red 165 535 165 115 133.75 4

a
di
When we apply the .groupby() method to a DataFrame object, it returns a GroupBy object, which is
then assigned to the grouped single variable or GroupBy variable. An important thing to note about a pandas
In
GroupBy object is that no splitting of the DataFrame has taken place at the point of creating the object. The
GroupBy object simply has all of the information it needs about the nature of the grouping. No aggregation
se

will take place until we explicitly call an aggregation function on the GroupBy object.
ou

The syntax is:


DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True,
H

squeeze=False, observed=False, **kwargs)


i

Here,
at

• by : Used to determine the groups for the groupby. If by is a function, it’s called on each value of
w

the object’s index.



s

axis: axis along which the function is applied:


ra

− 0 or ‘index’: apply function to each column.


− 1 or ‘columns’: apply function to each row.
Sa

• level : If the axis is a multiIndex (hierarchical), group by a particular level or levels.


• as_index : For aggregated output, return object with group labels as the index. Only relevant for
ew

DataFrame input. as_index=False is effectively “SQL-style” grouped output. If you mention


as_Index=False, then it does not show the GroupBy index column(s). A new numeric index will
N

display. Otherwise all the index columns will be displayed in as_Index=True.


• sort : Sort group keys. Get better performance by turning this off. Note this does not influence
@

the order of observations within each group. groupby() function preserves the order of rows
within each group.
• group_keys : When calling apply, add group keys to index to identify pieces.
• squeeze : Reduce the dimensionality of the return type if possible, otherwise return a consistent
type.
104 Saraswati Informatics Practices XII

Before using the groupby() function, let us create a DataFrame with following data:
>>> import pandas as pd
>>> df = pd.read_csv('E:/IPSource_XII/IPXIIChap04/Stock.csv')
>>> df

d
ite
Category_Name Item_Num Unit_Price Sales_Quantity
0 Television T001_Panasonic 27800 8

m
1 Washing Machine W003_Samsung 9699 4

Li
2 Refrigerator R001_LG 43800 6
3 Microwave M001_LG 13600 8

e
at
4 Television T002_Sony 42200 4
5 Air Conditioner A001_LG 23500 11

iv
6 Microwave M002_Samsung 18750 4

Pr
7 Washing Machine W001_IFB 32600 12
8 Television T003_LG 32500 4

a
9 Refrigerator R002_Samsung 23300 4
10 Air Conditioner A002_Carrier
di 43700 6
In
11 Microwave M003_LG 28750 5
12 Television T004_Sony 65800 5
se

13 Washing Machine W002_LG 24200 10


ou

14 Air Conditioner A003_Samsung 23500 11


15 Refrigerator R003_Onida 23300 4
H

Using the above DataFrame (df) we create a grouping of categories and apply a function to the categories.
i
at

For example, let us apply the groupby() function to group the data on Category_Name column.
w

# Create a groupBy object


>>> dfC = df.groupby('Category_Name')
s
ra

Here, the groupby() function creates a groupby object called dfC. When we print the object, it will
display the following:
Sa

>>> print (dfC)


<pandas.core.groupby.DataFrameGroupBy object at 0x07370430>
ew

This grouped variable (dfC) is now a GroupBy object. It has not actually computed anything yet except
for some intermediate data about the group key df['key1']. The idea is that this object has all of the information
N

needed to then apply some operation to each of the groups. We can print information through iterate only.

View Groups
@

To view the GroupBy object result:


>>> print (df.groupby('Category_Name').groups)
{'Air Conditioner': Int64Index([5, 10, 14], dtype='int64'),
'Microwave': Int64Index([3, 6, 11], dtype='int64'),
Function Applications in Pandas 105

'Refrigerator': Int64Index([2, 9, 15], dtype='int64'),


'Television': Int64Index([0, 4, 8, 12], dtype='int64'),
'Washing Machine': Int64Index([1, 7, 13], dtype='int64')}

d
From the above output, it displays the Category_Name wise number of indexes from DataFrame df.

ite
Or
We can use list() method to see the details of the GroupBy object values:

m
>>> list(df['Item_Num'].groupby(df['Category_Name']))

Li
[('Air Conditioner', 5 A001_LG

e
10 A002_Carrier

at
14 A003_Samsung
Name: Item Num, dtype: object), ('Microwave', 3 M001_LG

iv
6 M002_Samsung

Pr
11 M003_LG
Name: Item Num, dtype: object), ('Refrigerator', 2 R001_LG

a
9 R002_Samsung
15 R003_Onida
di
In
Name: Item Num, dtype: object), ('Television', 0 T001_Panasonic
4 T002_Sony
se

8 T003_LG
ou

12 T004_Sony
Name: Item Num, dtype: object), ('Washing Machine', 1 W003_Samsung
H

7 W001_IFB
13 W002_LG
i
at

Name: Item Num, dtype: object)]


w

Printing First row of each Group


s
ra

Let's print the first row each of each group of pandas DataFrame using DataFrame.first() method.
Sa

>>> dfC.first()
Item_Num Unit_Price Sales_Quantity
ew

Category_Name
Air Conditioner A001_LG 23500 11
N

Microwave M001_LG 13600 8


@

Refrigerator R001_LG 43800 6


Television T001_Panasonic 27800 8
Washing Machine W003_Samsung 9699 4

By default, the GroupBy object has the same label name as the group name.
106 Saraswati Informatics Practices XII

Iterate over DataFrame groups

Object returned by the call to groupby() function can be used as an iterator. In previous example, we created
a GroupBy object called dfC. Let us use the for loop to iterate the object dfC:

d
ite
>>> for key, group_df in dfC:
print("The group for Category Name '{}' has {} rows".format(key,len(group_df)))

m
which prints the following:

Li
The group for Category Name 'Air Conditioner' has 3 rows
The group for Category Name 'Microwave' has 3 rows

e
The group for Category Name 'Refrigerator' has 3 rows

at
The group for Category Name 'Television' has 4 rows
The group for Category Name 'Washing Machine' has 3 rows

iv
Pr
From the above command:
• key contains the name of the grouped element i.e. 'Air Conditioner', 'Microwave', 'Refrigerator',
'Television', 'Washing Machine'

a
• group_df is a normal DataFrame containing only the data referring to the key.

di
In
Select a Group
se

Using the get_group() method, we can select a single group. For example, to select a particular group
called ‘Television’ from the GroupBy object dfC:
ou

>>> print (dfC.get_group('Television'))


H

Item_Num Unit_Price Sales_Quantity


0 T001_Panasonic 27800 8
i
at

4 T002_Sony 42200 4
w

8 T003_LG 32500 4
12 T004_Sony 65800 5
s
ra

Grouping by One key


Sa

To produce a result, we can apply an aggregate to this DataFrameGroupBy object, which will perform the
appropriate apply/combine steps to produce the desired result. Let us simple, use the aggregate function
ew

sum() with the groupby key.


# Finding the values contained in the "Category_Name" group
N

>>> df.groupby('Category_Name').sum()
@

Unit_Price Sales_Quantity
Category_Name
Air Conditioner 90700 28
Microwave 61100 17
Function Applications in Pandas 107

Refrigerator 90400 14
Television 168300 21
Washing Machine 66499 26

d
ite
Here, the aggregate function sum() calculates the Category_Name wise total of Unit_Price and
Sales_Quantity. The sum() method is just one possibility here; you can apply any common Pandas or

m
NumPy aggregation function, as well as any valid DataFrame operation.

Li
Resetting groupby rows index

e
We can reset the grouped row index in pandas with reset_index() function to make the index start from 0.

at
For example,

iv
>>> df.groupby('Category_Name').sum().reset_index()

Pr
Category_Name Unit_Price Sales_Quantity
0 Air Conditioner 90700 28

a
1 Microwave 61100 17
2 Refrigerator 90400
di 14
In
3 Television 168300 21
se

4 Washing Machine 66499 26

Ordering groupby results


ou

We can sort the GroupBy object result using sort_values() method. For example, let us find the total
H

Unit_Price for each Category_Name into a new DataFrame df1:


i

>>> df1 = df.groupby('Category_Name')['Unit_Price'].sum().reset_index().sort_values(by='Unit_Price')


at

>>> df1
w

Category_Name Unit_Price
s
ra

1 Microwave 61100
4 Washing Machine 66499
Sa

2 Refrigerator 90400
0 Air Conditioner 90700
ew

3 Television 168300
N

Displaying SQL-style grouped output


@

The groupby method uses an option called as_index to display SQL_style grouped output. For example, to
display SQL_style category_Name wise aggregate sum of data:

>>> df.groupby('Category_Name', as_index=False).sum()


108 Saraswati Informatics Practices XII

Category_Name Unit_Price Sales_Quantity


0 Air Conditioner 90700 28
1 Microwave 61100 17

d
2 Refrigerator 90400 14

ite
3 Television 168300 21

m
4 Washing Machine 66499 26

Li
Example: Write the command to find the maximum Unit_Price of each Category_Name of DataFrame
df.

e
# Find the maximum Unit_Price of each Category_Name

at
>>> df.groupby('Category_Name').Unit_Price.max()

iv
Category_Name

Pr
Air Conditioner 43700
Microwave 28750

a
Refrigerator 43800

di
Television 65800 In
Washing Machine 32600
Name: Unit_Price, dtype: int64
se

Grouping by TWO key


ou

To use two key in groupby() function, let us find the first grouping based on "Category_Name" within each
H

item number we are grouping based on "Item_Num":


# Finding the values contained in the "Category_Name" group
i

>>> dfD = df.groupby(['Category_Name', 'Item_Num'])


at

Or
>>> dfD = df.groupby([df['Category_Name'], df['Item_Num']])
w

>>> dfD.first()
s
ra

Unit_Price Sales_Quantity
Category_Name Item_Num
Sa

Air Conditioner A001_LG 23500 11


A002_Carrier 43700 6
ew

A003_Samsung 23500 11
N

Microwave M001_LG 13600 8


M002_Samsung 18750 4
@

M003_LG 28750 5
Refrigerator R001_LG 43800 6
R002_Samsung 23300 4
R003_Onida 23300 4
Function Applications in Pandas 109

Television T001_Panasonic 27800 8


T002_Sony 42200 4
T003_LG 32500 4

d
T004_Sony 65800 5

ite
Washing Machine W001_IFB 32600 12

m
W002_LG 24200 10

Li
W003_Samsung 9699 4

e
Column-wise aggregations – optimized statistical methods

at
For simple statistical aggregations (of numeric columns of a DataFrame), we can call methods like sum(),

iv
max(), min(), mean(), etc. Before applying the functions, let us create a new DataFrame called dfN by
adding the new column ‘Total_Amount’ with previous DataFrame df:

Pr
# Find the values contained in the "Category Name" group

a
>>> Amount = [] # a blank list

di
>>> for index, row in df.iterrows():
Amt = row['Unit_Price'] * row['Sales_Quantity'] # Calculating a row value
In
Amount.append(Amt) # append current amount to a list.
>>> dfN = df.assign(Total_Amount = Amount) # A new column ‘Total_Amount’ created
se

>>> print (dfN)


ou

Category_Name Item_Num Unit_Price Sales_Quantity Total_Amount


0 Television T001_Panasonic 27800 8 222400
H

1 Washing Machine W003_Samsung 9699 4 38796


i
at

2 Refrigerator R001_LG 43800 6 262800


3 Microwave M001_LG 13600 8 108800
w

4 Television T002_Sony 42200 4 168800


s
ra

5 Air Conditioner A001_LG 23500 11 258500


6 Microwave M002_Samsung 18750 4 75000
Sa

7 Washing Machine W001_IFB 32600 12 391200


8 Television T003_LG 32500 4 130000
ew

9 Refrigerator R002_Samsung 23300 4 93200


10 Air Conditioner A002_Carrier 43700 6 262200
N

11 Microwave M003_LG 28750 5 143750


@

12 Television T004_Sony 65800 5 329000


13 Washing Machine W002_LG 24200 10 242000
14 Air Conditioner A003_Samsung 23500 11 258500
15 Refrigerator R003_Onida 23300 4 93200
110 Saraswati Informatics Practices XII

>>> df1 = dfN.groupby(['Category_Name', 'Item_Num']).cumsum()


>>> df1.sum() # Summing of all numeric columns or series
Unit_Price 476999

d
Sales_Quantity 106

ite
Total_Amount 3078146
dtype: int64

m
# Summing a particular column or series

Li
>>> dfN['Total_Amount'].groupby(dfN['Category_Name']).sum()
Category_Name

e
at
Air Conditioner 779200
Microwave 327550

iv
Refrigerator 449200

Pr
Television 850200
Washing Machine 671996

a
di
Name: Total_Amount, dtype: int64 In
# Finding the mean of all series of a DataFrame
>>> print (dfN.groupby('Category_Name').mean())
se

Unit_Price Sales_Quantity Total_Amount


Category_Name
ou

Air Conditioner 30233.333333 9.333333 259733.333333


H

Microwave 20366.666667 5.666667 109183.333333


Refrigerator 30133.333333 4.666667 149733.333333
i
at

Television 42075.000000 5.250000 212550.000000


w

Washing Machine 22166.333333 8.666667 223998.666667


s

Number of unique column values per group


ra
Sa

We can find the unique column values per group by using the .nunique() method of groupby() method. For
example to find the number of unique column values per each Category_Name for Sales_Quantity column
of DataFrame dfN:
ew

>> dfN.groupby('Category_Name')["Sales_Quantity"].nunique()
Category_Name
N

Air Conditioner 2
@

Microwave 3
Refrigerator 2
Television 3
Washing Machine 3
Name: Sales_Quantity, dtype: int64
Function Applications in Pandas 111

Using the .agg( ) Method with groupby()


When we have a GroupBy object, we may choose to apply one or more functions to one or more columns,
even different functions to individual columns. We can aggregate by multiple functions using the .agg()

d
method. Simply pass a list of the functions that you would like to apply to your dataset. The .agg() method

ite
allows us to easily and flexibly specify these details.
It takes arguments as given below:

m
• list of function names to be applied to all selected columns

Li
• tuples of (colname, function) to be applied to all selected columns
• dict of (df.col, function) to be applied to each df.col
• Apply >1 functions to selected column(s) by passing names of functions to agg() as a list

e
at
For example, to find the Category_Name wise sum of Total_Amount, the command is:

iv
>>> dfN['Total_Amount'].groupby(dfN['Category_Name']).agg('sum')

Pr
Category_Name
Air Conditioner 779200

a
Microwave 327550
Refrigerator 449200
di
In
Television 850200
Washing Machine 671996
se

Or
ou

>>> dfN.groupby(dfN['Category_Name']).agg({'Total_Amount':['sum']}).reset_index()
Category_Name Total_Amount
H

sum
i
at

0 Air Conditioner 779200


1 Microwave 327550
w

2 Refrigerator 449200
s

3 Television 850200
ra

4 Washing Machine 671996


Sa

Example Write the command to find the Category_Name wise aggregates applying multiple
aggregation functions like count, min, mean and max for Total_Amount.
ew

# Finding the maximum Unit_Price of each Category_Name


# Apply min, mean, max and max to Total_Amount grouped by Category_Name
>>> dfN['Total_Amount'].groupby(dfN['Category_Name']).agg(['min', 'mean', 'max'])
N

Or
@

>>> dfN.groupby('Category_Name').Total_Amount.agg(['count','min', 'mean', 'max'])


count min mean max
Category_Name
Air Conditioner 3 258500 259733.333333 262200
112 Saraswati Informatics Practices XII

Microwave 3 75000 109183.333333 143750


Refrigerator 3 93200 149733.333333 262800
Television 4 130000 212550.000000 329000

d
Washing Machine 3 38796 223998.666667 391200

ite
Example Write the command to create a hierarchical index to find minimum and maximum to all

m
numeric columns for DataFrame dfN.

Li
# Apply min and max to all numeric columns of dfN grouped by Category_Name
# A Hierarchical index will be created

e
>>> dfN[['Unit_Price', 'Sales_Quantity', 'Total_Amount']].groupby(dfN['Category_Name']).agg(['min',

at
'max'])
Unit_Price Sales_Quantity Total_Amount

iv
Pr
min max min max min max
Category_Name

a
Air Conditioner 23500 43700 6 11 258500 262200
Microwave 13600 28750
di 4 8 75000 143750
In
Refrigerator 23300 43800 4 6 93200 262800
Television 27800 65800 4 8 130000 329000
se

Washing Machine 9699 32600 4 12 38796 391200


ou

Example Write the command to create a hierarchical index to find min and max to all numeric
H

columns for DataFrame dfN by flipping the above layout of dfN, i.e., by moving whole
levels of columns to rows.
i
at

# We can call .stack() on the returned object!


>>> dfN[['Unit_Price', 'Sales_Quantity', 'Total_Amount']].groupby(dfN['Category_Name']).agg(['min',
w

'max']).stack()
s

Unit_Price Sales_Quantity Total_Amount


ra

Category_Name
Sa

Air Conditioner min 23500 6 258500


max 43700 11 262200
ew

Microwave min 13600 4 75000


max 28750 8 143750
N

Refrigerator min 23300 4 93200


max 43800 6 262800
@

Television min 27800 4 130000


max 65800 8 329000
Washing Machine min 9699 4 38796
max 32600 12 391200
Function Applications in Pandas 113

Example Create a histogram for the Total_Amount of dfN with bin size 5.
# Creating histogram for Total_Amount
>>> import matplotlib.pyplot as plt
>>> dfN.hist(column="Total_Amount", bins=5) # Plotting 5 bins

d
>>> plt.show()

ite
m
Li
e
at
iv
Pr
a
di
In
se

Figure 4.1 Histogram of a grouped data.

Grouping with Custom .apply() Functions


ou

Just as you can apply custom functions to a column in your data frame, you can do the same with groups. As
H

we know that .apply() method takes as argument the following:


i

• a general or user defined function


at

• any other parameters that the function would take


w

Let us apply the .apply() method with groupby() method to find the Item_Num wise total amount for
s

each Category Name for the original DataFrame df. That is:
ra

>>> df
Sa

Category_Name Item_Num Unit_Price Sales_Quantity


0 Television T001_Panasonic 27800 8
ew

1 Washing Machine W003_Samsung 9699 4


2 Refrigerator R001_LG 43800 6
N

3 Microwave M001_LG 13600 8


4 Television T002_Sony 42200 4
@

5 Air Conditioner A001_LG 23500 11


6 Microwave M002_Samsung 18750 4
7 Washing Machine W001_IFB 32600 12
8 Television T003_LG 32500 4
114 Saraswati Informatics Practices XII

9 Refrigerator R002_Samsung 23300 4


10 Air Conditioner A002_Carrier 43700 6
11 Microwave M003_LG 28750 5

d
12 Television T004_Sony 65800 5

ite
13 Washing Machine W002_LG 24200 10

m
14 Air Conditioner A003_Samsung 23500 11
15 Refrigerator R003_Onida 23300 4

Li
# User defined function to calculate the total amount

e
>>> def Calculate(ndf):

at
return (ndf.Unit_Price * ndf.Sales_Quantity)
>>> df.groupby(['Category_Name', 'Item_Num']).apply(Calculate)

iv
Category_Name Item_Num

Pr
Air Conditioner A001_LG 5 258500

a
A002_Carrier 10 262200

di
A003_Samsung 14 In 258500
Microwave M001_LG 3 108800
M002_Samsung 6 75000
se

M003_LG 11 143750
Refrigerator R001_LG 2 262800
ou

R002_Samsung 9 93200
H

R003_Onida 15 93200
Television T001_Panasonic 0 222400
i
at

T002_Sony 4 168800
T003_LG 8 130000
w

T004_Sony 12 329000
s

Washing Machine W001_IFB 7 391200


ra

W002_LG 13 242000
Sa

W003_Samsung 1 38796
dtype: int64
ew

4.5 Data Transformation


N

While aggregation must return a reduced version of the data, transformation can return some transformed
@

version of the full data to recombine. For such a transformation, the output is the same shape as the input.
Transformation on a group or a column returns an object that is indexed the same size of data that is being
grouped. And if you want to get a new value for each original row, use transform(). Thus, the transform
should return a result that is the same size as that of a group chunk.
Before applying the transform() method, let us create a DataFrame dfT by sorting the previous DataFrame
dfN with Category_Name.
Function Applications in Pandas 115

>>> dfT = dfN.sort_values('Category_Name') # a new DataFrame dfT is created


>>> dfT

Category_Name Item_Num Unit_Price Sales_Quantity Total_Amount

d
5 Air Conditioner A001_LG 23500 11 258500

ite
10 Air Conditioner A002_Carrier 43700 6 262200

m
14 Air Conditioner A003_Samsung 23500 11 258500
3 Microwave M001_LG 13600 8 108800

Li
6 Microwave M002_Samsung 18750 4 75000

e
11 Microwave M003_LG 28750 5 143750

at
2 Refrigerator R001_LG 43800 6 262800

iv
9 Refrigerator R002_Samsung 23300 4 93200

Pr
15 Refrigerator R003_Onida 23300 4 93200
0 Television T001_Panasonic 27800 8 222400

a
4 Television T002_Sony 42200 4 168800

di
8 Television T003_LG In 32500 4 130000
12 Television T004_Sony 65800 5 329000
1 Washing Machine W003_Samsung 9699 4 38796
se

7 Washing Machine W001_IFB 32600 12 391200


13 Washing Machine W002_LG 24200 10 242000
ou

>>> dfT.groupby('Category_Name')["Sales_Quantity"].transform('sum')
H

5 28
10 28
i
at

14 28
w

3 17
s

6 17
ra

11 17
Sa

2 14
9 14
15 14
ew

0 21
4 21
N

8 21
@

12 21
1 26
7 26
13 26
Name: Sales_Quantity, dtype: int64
116 Saraswati Informatics Practices XII

You will notice how this returns a different size data set from our normal groupby() functions. Instead
of only showing the totals for 5 category names, we retain the same number of items as the original data
set. That is the unique feature of using transform. Figure 4.2 shows the processes of transform() method.
Split

d
Category_Name Sales_Quantity

ite
Air Conditioner 11
Input Air Conditioner 6

m
Category_Name Sales_Quantity Air Conditioner 11

Li
Air Conditioner 11
Category_Name Sales_Quantity Apply (sum)
Air Conditioner 6 Category_Name Sales_Quantity
Microwave 8

e
Air Conditioner 11 Air Conditioner 28
Microwave 4

at
Microwave 8
Microwave 5
Microwave 4 Category_Name Sales_Quantity

iv
Microwave 5 Category_Name Sales_Quantity Microwave 17
Refrigerator 6 Refrigerator 6

Pr
Refrigerator 4 Refrigerator 4 Category_Name Sales_Quantity
Refrigerator 4 Refrigerator 4 Refrigerator 14

a
Television 8
Category_Name Sales_Quantity

di
Television 4 Category_Name Sales_Quantity
Television 8
Television 4
In Television 21
Television 4
Television 5
Television 4
Washing Machine 4 Category_Name Sales_Quantity
se

Television 5
Washing Machine 12 Washing Machine 26
Washing Machine 10 Category_Name Sales_Quantity
ou

Washing Machine 4
Washing Machine 12
H

Washing Machine 10
i

Combine (Transform)
at

Category_Name Sales_Quantity
w

Air Conditioner 28
s

Air Conditioner 28
ra

Air Conditioner 28
Microwave 17
Sa

Microwave 17
Microwave 17
Refrigerator 14
ew

Refrigerator 14
Refrigerator 14
Television 21
N

Television 21
@

Television 21
Television 21
Washing Machine 26
Washing Machine 26
Washing Machine 26
Figure 4.2 Combined data using transform() method.
Function Applications in Pandas 117

Example Write the command to create new value mean of each row for columns "Unit_Price" and
"Sales_Quantity".
# Find the meanCreating histogram for Total_Amount
>>> dfT.groupby('Category_Name')["Unit_Price", "Sales_Quantity"].transform('mean')

d
ite
Unit_Price Sales_Quantity
5 30233.333333 9.333333

m
10 30233.333333 9.333333

Li
14 30233.333333 9.333333
3 20366.666667 5.666667

e
at
6 20366.666667 5.666667
11 20366.666667 5.666667

iv
2 30133.333333 4.666667

Pr
9 30133.333333 4.666667
15 30133.333333 4.666667

a
0 42075.000000 5.250000
4 42075.000000 5.250000
di
In
8 42075.000000 5.250000
12 42075.000000 5.250000
se

1 22166.333333 8.666667
ou

7 22166.333333 8.666667
13 22166.333333 8.666667
H

Example A DataFrame df contains following data:


i
at

>>> df
w

Student_Name Age Gender Test1 Test2 Test3


s
ra

0 Aashna 16.0 F 7.6 8.5 7.6


1 NaN NaN NaN NaN NaN NaN
Sa

2 Jack 16.0 M 8.6 NaN NaN


3 Somya 17.0 F 6.5 7.9 8.8
ew

4 Raghu 15.0 M 6.8 7.7 7.9


5 Mathew 16.0 M 9.2 9.0 NaN
N

6 Nancy 14.0 F 6.8 8.7 8.8


@

Write the command to find the mean of all numeric columns as per Gender.
# For DataFrame
>>> df = pd.read_csv('E:/IPSource_XII/IPXIIChap04/Student.csv')
118 Saraswati Informatics Practices XII

# Create a function that


>>> def mean_age(ndf, col): # here ndf is a DataFrame and col is the column name
# groups the data by a column (i.e., Gender) and returns the mean age per group
return ndf.groupby(col).mean()

d
# Create a pipeline that applies the mean_age function to create a group according to column

ite
>>> df.pipe(mean_age, col='Gender')
which will display the following output:

m
Age Test1 Test2 Test3

Li
Gender

e
F 15.666667 6.966667 8.366667 8.4

at
M 15.666667 8.200000 8.350000 7.9

iv
Pr
4.6 .applymap() Function
The .applymap() function performs the specified operation for all the elements of a DataFrame. Remember

a
that all columns (except index column) of the DataFrame must be numeric type. The syntax is:

di
DataFrame.applymap(func) In
Here,
• func: Python function, returns a single value from a single value.
se

• Returns: Transformed DataFrame.


ou

For example, suppose we have a DataFrame called tdf with the following data:
>>> Test = {'A': [1, 2, 3, 4, 5, 6],
H

'B': [6, 7, 8, 9, 10, 11]}


>>> tdf = pd.DataFrame(Test, columns=['A', 'B'])
i
at

>>> tdf
w

A B
s

0 1 6
ra

1 2 7
Sa

2 3 8
3 4 9
4 5 10
ew

5 6 11
N

Let us multiply by 2 to each element of the DataFrame tdf:


>>> func = lambda x: x*5
@

>>> tdf.applymap(func)
A B
0 5 30
1 10 35
Function Applications in Pandas 119

2 15 40
3 20 45
4 25 50

d
5 30 55

ite
Let us create another DataFrame dfs with the following data and apply the applymap() function.

m
>>> import pandas as pd

Li
>>> Data = {'Name': ['Aashna', 'Simran', 'Jack', 'Raghu', 'Somya', 'Ronald'],
'English': [87, 64, 58, 74, 87, 78],

e
'Accounts': [76, 76, 68, 72, 82, 68],

at
'Economics': [82, 69, 78, 67, 78, 68],

iv
'Bst': [72, 56, 63, 64, 66, 71],
'IP': [78, 75, 82, 86, 67, 71]}

Pr
>>> dfs = pd.DataFrame(Data, columns=['Name', 'English', 'Accounts', 'Economics', 'Bst', 'IP'])
>>> dfs

a
Name English Accounts Economics Bst IP
0 Aashna 87 76
di 82 72 78
In
1 Simran 64 76 69 56 75
se

2 Jack 58 68 78 63 82
3 Raghu 74 72 67 64 86
ou

4 Somya 87 82 78 66 67
5 Ronald 78 68 68 71 71
i H

>>> dfn = dfs.set_index(['Name']) # Create a new DataFrame indexed permanently


at

>>> dfn
w

English Accounts Economics Bst IP


s
ra

Name
Sa

Aashna 87 76 82 72 78
Simran 64 76 69 56 75
Jack 58 68 78 63 82
ew

Raghu 74 72 67 64 86
Somya 87 82 78 66 67
N

Ronald 78 68 68 71 71
@

Now using the above DataFrame dfn, apply the function .applymap() to convert all numeric column
cell value into float:
>>> dfn.applymap(float)
120 Saraswati Informatics Practices XII

English Accounts Economics Bst IP


Name

d
Aashna 87.0 76.0 82.0 72.0 78.0

ite
Simran 64.0 76.0 69.0 56.0 75.0
Jack 58.0 68.0 78.0 63.0 82.0

m
Raghu 74.0 72.0 67.0 64.0 86.0

Li
Somya 87.0 82.0 78.0 66.0 67.0
Ronald 78.0 68.0 68.0 71.0 71.0

e
at
Example Write the command to given an increment of 5% to all students to DataFrame df1 using

iv
applymap() function.

Pr
# Create a function to increase 5% marks
>>> def increase5(x):

a
return x + x*0.05

di
>>> dfn.applymap(increase5) # Temporary increases 5%
In
English Accounts Economics Bst IP
Name
se

Aashna 91.35 79.8 86.10 75.60 81.90


ou

Simran 67.20 79.8 72.45 58.80 78.75


Jack 60.90 71.4 81.90 66.15 86.10
H

Raghu 77.70 75.6 70.35 67.20 90.30


Somya 91.35 86.1 81.90 69.30 70.35
i
at

Ronald 81.90 71.4 71.40 74.55 74.55


w

Or
s

# Use the lambda function


ra

>>> dfn.applymap(lambda x:x + x*0.05)


Sa

4.7 Reindexing Pandas Dataframes

Reindexing in pandas is a process that changes the row labels and column labels of a DataFrame. This is
ew

core to the functionality of pandas as it enables label alignment across multiple objects, which may originally
have different indexing schemes. This process of performing a reindex includes the following steps:
N

• Reordering existing data to match a new set of labels.


• Inserting NaN markers where no data exists for a label.
@

• Possibly, filling missing data for a label using some type of logic (defaulting to adding NaN values).
The syntax is:
DataFrame.reindex(labels=None, index=None, columns=None, axis=None, method=None,
copy=True, level=None, fill_value=nan, limit=None, tolerance=None)
Function Applications in Pandas 121

Here,
• labels : New labels/index to conform the axis specified by ‘axis’ to.
• index, columns : New labels/index to conform to. Preferably an Index object to avoid duplicating

d
data.

ite
• axis : Axis to target. Can be either the axis name (‘index’, ‘columns’) or number (0, 1).
• method : {None, ‘backfill’/’bfill’, ‘pad’/’ffill’, ‘nearest’}, optional.

m
• copy : Return a new object, even if the passed indexes are the same.

Li
• level : Broadcast across a level, matching Index values on the passed MultiIndex level.
• fill_value : Fill existing missing (NaN) values, and any new element needed for successful

e
DataFrame alignment, with this value before computation. If data in both corresponding

at
DataFrame locations is missing the result will be missing.
• limit : Maximum number of consecutive elements to forward or backward fill.

iv
• tolerance : Maximum distance between original and new labels for inexact matches. The values

Pr
of the index at the matching locations most satisfy the equation abs(index[indexer] – target)
<= tolerance.

a
Changing the order of the rows

di
In
To change the order (the index) of the rows of previous DataFrame dfS:
>>> dfs.reindex([5, 4, 3, 2, 1, 0])
se

So, the reindexed DataFrame will be:


ou

reindexed
Name English Accounts Economics Bst IP
H

5 Ronald 78 68 68 71 71
i

4 Somya 87 82 78 66 67
at

3 Raghu 74 72 67 64 86
w

2 Jack 58 68 78 63 82
s

1 Simran 64 76 69 56 75
ra

0 Aashna 87 76 82 72 78
Sa

>>> dfs.reindex([3, 4, 5, 2, 0, 1])


So, the reindexed DataFrame will be:
ew

reindexed
Name English Accounts Economics Bst IP
N

3 Raghu 74 72 67 64 86
4 Somya 87 82 78 66 67
@

5 Ronald 78 68 68 71 71
2 Jack 58 68 78 63 82
0 Aashna 87 76 82 72 78
1 Simran 64 76 69 56 75
122 Saraswati Informatics Practices XII

Change the order of the columns


To change the order (the index) of the columns of previous DataFrame dfS:

d
>>> ChangeColumns = ['Name', 'Accounts', 'English', 'Bst', 'Economics', 'IP']

ite
>>> dfs.reindex(columns=ChangeColumns)
So, the reindexed DataFrame will be:

m
Name Accounts English Bst Economics IP reindexed

Li
0 Aashna 76 87 72 82 78

e
1 Simran 76 64 56 69 75

at
2 Jack 68 58 63 78 82

iv
3 Raghu 72 74 64 67 86
4 Somya 82 87 66 78 67

Pr
5 Ronald 68 78 71 68 71

a
Reindexing with new index values

di
In
We can add new rows or columns by reindexing with new index values. By default values in the new index
that do not have corresponding records in the DataFrame are assigned NaN. Let us create label index
se

DataFrame using DataFrame dfS:


>>> ndf = dfs.set_index(['Name'])
ou

>>> ndf
English Accounts Economics Bst IP
H

Name
i

Aashna 87 76 82 72 78
at

Simran 64 76 69 56 75
w

Jack 58 68 78 63 82
s

Raghu 74 72 67 64 86
ra

Somya 87 82 78 66 67
Sa

Ronald 78 68 68 71 71

Now, create a new DataFrame called df1 with a new row index ‘Meghna’ using DataFrame ndf:
ew

>>> df1 = ndf.reindex(['Aashna', 'Simran', 'Jack', 'Raghu', 'Meghna', 'Somya', 'Ronald'])


>>> df1
N

So, the reindexed DataFrame will be:


@

English Accounts Economics Bst IP


Name
Aashna 87 76 82 72 78
Simran 64 76 69 56 75
Function Applications in Pandas 123

Jack 58 68 78 63 82
Raghu 74 72 67 64 86
Meghna NaN NaN NaN NaN NaN

d
Somya 87 82 78 66 67

ite
Ronald 78 68 68 71 71

m
Notice the above output where new indexes are populated with NaN values. Here, we can fill in the

Li
missing values using the parameter, fill_value:
>>> df1 = ndf.reindex(['Aashna', 'Simran', 'Jack', 'Raghu', 'Meghna', 'Somya', 'Ronald'], fill_value=73)

e
>>> df1

at
So, the reindexed DataFrame will be:

iv
English Accounts Economics Bst IP

Pr
Name
Aashna 87 76 82 72 78

a
di
Simran 64 76 69 56 75
Jack 58 68 78 63 82
In
Raghu 74 72 67 64 86
se

Meghna 73 73 73 73 73
Somya 87 82 78 66 67
ou

Ronald 78 68 68 71 71
H

Simiarly, create a new column index ‘TotalMarks’ using DataFrame df1:


i

>>> ChangeColumns = ['Name', 'Accounts', 'English', 'Bst', 'Economics', 'IP', 'TotalMarks']


at

>>> df1 = df1.reindex(columns=ChangeColumns)


>>> df1
s w

So, the reindexed DataFrame will be:


ra

English Accounts Economics Bst IP TotalMarks


Sa

Name
Aashna 87 76 82 72 78 NaN
ew

Simran 64 76 69 56 75 NaN
Jack 58 68 78 63 82 NaN
N

Raghu 74 72 67 64 86 NaN
Meghna 73 73 73 73 73 NaN
@

Somya 87 82 78 66 67 NaN
Ronald 78 68 68 71 71 NaN

# To calculate the TotalMarks column values


>>> df1.TotalMarks = df1.Accounts + df1.English + df1.Bst + df1.Economics + df1.IP
124 Saraswati Informatics Practices XII

>>> df1
English Accounts Economics Bst IP TotalMarks
Name

d
Aashna 87 76 82 72 78 395

ite
Simran 64 76 69 56 75 340

m
Jack 58 68 78 63 82 349
Raghu 74 72 67 64 86 363

Li
Meghna 73 73 73 73 73 365

e
Somya 87 82 78 66 67 380

at
Ronald 78 68 68 71 71 356

iv
Pr
4.8 Altering Labels or Changing Column/Row Names

In pandas, there are two ways where one can change the column names of a pandas DataFrame. One way to

a
rename columns in Pandas is to use df.columns from Pandas and assign new names directly. To demonstrate

di
this, let us create a simple DataFrame with following: In
>>> import pandas as pd
>>> Data1 = {'Customer_id': ['C_01', 'C_02' , 'C_03', 'C_04', 'C_05', 'C_06'],
se

'Product': ['Makeup', 'Mascara', 'Foundation', 'Lip Gloss', 'Eyeshadow', 'Eyeliner'],


'Charges': [650, 450, 550, 250, 150, 125]}
ou

>>> dfm = pd.DataFrame(Data1, columns=['Customer_id', 'Product', 'Charges'])


>>> dfm
H

Customer_id Product Charges


i

0 C_01 Makeup 650


at

1 C_02 Mascara 450


w

2 C_03 Foundation 550


s

3 C_04 Lip Gloss 250


ra

4 C_05 Eyeshadow 150


Sa

5 C_06 Eyeliner 125

To change the columns of dfm DataFrame, we can assign the list of new column names to dfm.columns
ew

as:
>>> dfm.columns = ['CustomerID','Product_Choice','Fees']
N

>>> dfm
CustomerID Product_Choice Fees
@

0 C_01 Makeup 650


1 C_02 Mascara 450
2 C_03 Foundation 550
Function Applications in Pandas 125

3 C_04 Lip Gloss 250


4 C_05 Eyeshadow 150
5 C_06 Eyeliner 125

d
ite
A problem with this approach to change column names is that one has to change names of all the
columns in the DataFrame. This approach would not work, if we want to change the name of one column.

m
Also, the above method is not applicable on index labels. Another way to change column names in pandas is
to use .rename() function.

Li
Changing column name using .rename() function

e
at
Using .rename() function, one can change names of specific column easily. And not all the column names
need to be changed. One of the biggest advantages of using .rename() function is that we can use rename to

iv
change as many column names as we want. The syntax is:

Pr
DataFrame.rename(mapper=None, index=None, columns=None, axis=None, copy=True,
inplace=False, level=None)

a
Here,

di
mapper, index and columns: Dictionary value, key refers to the old name and value refers to new
In
name. Remember that only one of these parameters can be used at once.
• axis: int or string value, 0/’row’ for Rows and 1/’columns’ for Columns.
se

• copy: Copies underlying data if True.


• inplace: Makes changes in original Data Frame if True.
• level: Used to specify level in case data frame is having multiple level index.
ou

Let us create a DataFrame as given below:


H

>>> import pandas as pd


>>> Teacher = {'TNO' : ['T01', 'T02', 'T03', 'T04', 'T05'],
i
at

'TNAME' : ['Rakesh Sharma', 'Jugal Mittal', 'Sharmila Kaur', 'Sandeep Kaushik', 'Sangeeta Vats'],
'TADDRESS' : ['245-Malviya Nagar', '34 Ramesh Nagar', 'D-31 Ashok Vihar', 'MG-32 Shalimar Bagh',
w

'G-35 Malviya Nagar'],


s

'SALARY' : [25600, 22000, 21000, 15000, 18900]}


ra

>>> tdf = pd.DataFrame(Teacher, columns=['TNO', 'TNAME', 'TADDRESS', 'SALARY'])


>>> tdf
Sa

TNO TNAME TADDRESS SALARY


0 T01 Rakesh Sharma 245-Malviya Nagar 25600
ew

1 T02 Jugal Mittal 34 Ramesh Nagar 22000


2 T03 Sharmila Kaur D-31 Ashok Vihar 21000
N

3 T04 Sandeep Kaushik MG-32 Shalimar Bagh 15000


4 T05 Sangeeta Vats G-35 Malviya Nagar 18900
@

Renaming a single column


For example, let us change a column name (TNO to Teacher_No) of above DataFrame tdf:
>>> tdf.rename(columns = {'TNO': 'Teacher_No'}, inplace=True) # inplace=True to affect DataFrame
126 Saraswati Informatics Practices XII

>>> tdf

Teacher_No TNAME TADDRESS SALARY


0 T01 Rakesh Sharma 245-Malviya Nagar 25600

d
1 T02 Jugal Mittal 34 Ramesh Nagar 22000

ite
2 T03 Sharmila Kaur D-31 Ashok Vihar 21000

m
3 T04 Sandeep Kaushik MG-32 Shalimar Bagh 15000
4 T05 Sangeeta Vats G-35 Malviya Nagar 18900

Li
From the above result, the first column is renamed as ‘Teacher_No’.

e
Renaming multiple columns

at
iv
Let us change column names TNAME to Teacher_Name, TADDRESS to Teacher_Address and SALARY to
Income of above DataFrame tdf:

Pr
>>> tdf.rename(columns = {'TNAME': 'Teacher_Name',
'TADDRESS' : 'Teacher Address',

a
'SALARY': 'Income'}, inplace=True)
From the above output:
di
In
• second column is renamed as ‘Teacher_Name’.
third column is renamed as ‘Teacher_Address’.
se

• fourth column is renamed as ‘Income’.


So the resultant DataFrame will be
ou

>>> tdf
H

Teacher_No Teacher_Name Teacher_Address Income


0 T01 Rakesh Sharma 245-Malviya Nagar 25600
i
at

1 T02 Jugal Mittal 34 Ramesh Nagar 22000


w

2 T03 Sharmila Kaur D-31 Ashok Vihar 21000


s

3 T04 Sandeep Kaushik MG-32 Shalimar Bagh 15000


ra

4 T05 Sangeeta Vats G-35 Malviya Nagar 18900


Sa

Renaming row index or row names


Another good thing about pandas rename function is that, we can also use it to change row indexes or row
ew

names. We just need to use index argument and specify, we want to change index not columns.
For example, to change row names 0 and 1 to ‘zero’ and ‘one’ in our tdf DataFrame, we will construct
N

a dictionary with old row index names as keys and new row index as values.
>>> tdf.rename(index={0:'zero',1:'one'})
@

Teacher_No Teacher_Name Teacher_Address Income


zero T01 Rakesh Sharma 245-Malviya Nagar 25600
one T02 Jugal Mittal 34 Ramesh Nagar 22000
2 T03 Sharmila Kaur D-31 Ashok Vihar 21000
Function Applications in Pandas 127

3 T04 Sandeep Kaushik MG-32 Shalimar Bagh 15000


4 T05 Sangeeta Vats G-35 Malviya Nagar 18900

d
Note that the above result does not change the DataFrame (tdf) row names permanently, because we
omit the inplace=True option.

ite
Renaming column name and row index simultaneously

m
Li
With pandas’ rename() function, one can also change both column names and row names simultaneously by
using both column and index arguments to rename function with corresponding mapper dictionaries.

e
>>> tdf.rename(columns={'Teacher_No':'T_Number'},

at
index={0:'zero',1:'one'})
T_Number Teacher_Name Teacher_Address Income

iv
zero T01 Rakesh Sharma 245-Malviya Nagar 25600

Pr
one T02 Jugal Mittal 34 Ramesh Nagar 22000

a
2 T03 Sharmila Kaur D-31 Ashok Vihar 21000

di
3 T04 Sandeep Kaushik MG-32 Shalimar Bagh 15000
4 T05 Sangeeta Vats G-35 Malviya Nagar 18900
In
Note that the above result does not change the DataFrame (tdf) row and column names permanently,
se

because we omit the inplace=True option.

Using function input


ou

Pandas rename() function can also take a function as input instead of a dictionary. For example, we can
H

write a lambda function to take the current column names and consider only the first six characters for the
new column names.
i
at

>>> tdf.rename(columns=lambda x: x[0:6])


w

Teache Teache Teache Income


s

0 T01 Rakesh Sharma 245-Malviya Nagar 25600


ra

1 T02 Jugal Mittal 34 Ramesh Nagar 22000


Sa

2 T03 Sharmila Kaur D-31 Ashok Vihar 21000


3 T04 Sandeep Kaushik MG-32 Shalimar Bagh 15000
4 T05 Sangeeta Vats G-35 Malviya Nagar 18900
ew

Points to Remember
N

1. Pipe() function performs the custom operation for the entire DataFrame.
@

2. The apply() method allows us to pass a function that will run on every value in a column.
3. The applymap() method applies a function to each element of the DataFrame and returns a
scalar to every element of a DataFrame.
4. The groupby() function splits the data into groups based on the levels of a categorical variable.
5. Reindexing in pandas is a process that changes the row labels and column labels of a DataFrame.
128 Saraswati Informatics Practices XII

SOLVED EXERCISES
1. What is the use of describe() function?

d
Ans. The .describe() function is a useful summarisation tool that will quickly display statistics for any

ite
variable or group it is applied to.
2. Differentiate between .apply() and .applymap() functions.
Ans. The .apply() applies a function along any axis of the DataFrame whereas .applymap() apply a function

m
to each element of DataFrame.

Li
3. What is the default grouping is made using pandas groupby() method?
Ans. The default the grouping is made via the index (rows) axis.

e
4. A DataFrame dfW is given with following data:

at
Age Wage Rate

iv
0 20 2.5

Pr
1 25 3.5
2 30 4.5

a
3 35 5.5

di
4 40 7.0 In
5 45 8.7
6 50 9.5
se

7 55 10.0
8 60 12.5
ou

( ) Write a program using pipe() function to add 2 to each numeric column of DataFrame dfW.
H

( ) Find the maximum value each column using apply() function.


() Find the row wise maximum value using apply() function.
i

Ans. (a) # Wageplus.py


at

import pandas as pd
w

dfW = pd.DataFrame({'Age' : [20, 25, 30, 35, 40, 45, 50, 55, 60],
'Wage Rate' : [2.5, 3.5, 4.5, 5.5, 7.0, 8.7, 9.5, 10.0, 12.5]},
s

columns = ['Age', 'Wage Rate'])


ra

def Add_Two(Data, aValue):


Sa

return Data + aValue


print(dfW.pipe(Add_Two, 2))
(b) dfW.loc[:, 'Age':'Wage Rate'].apply(max, axis=0)
ew

(c) dfW.loc[:, 'Age':'Wage Rate'].apply(max, axis=1)


5. A DataFrame dfB is given with following data:
N

Itemno ItemName Color Price


@

1 Ball Pen Black 15.0


2 Pencil Blue 5.5
3 Ball Pen Green 10.5
4 Gel Pen Green 11.0
Function Applications in Pandas 129

5 Notebook Red 15.5


6 Ball Pen Green 11.5
7 Highlighter Blue 8.5

d
8 Gel Pen Red 12.5

ite
9 P Marker Blue 8.6

m
10 Pencil Green 11.5
11 Ball Pen Green 10.5

Li
Answer the following questions using groupby function (assume that the DataFrame name is

e
dfB):

at
( ) Display Color-wise item and price of each ItemName category.

iv
( ) Find the maximum price of each ItemName.
( ) Find the minimum price of each ItemName.

Pr
(d) Count the number of items in each ItemName category.
Ans. (a) dfX = dfB.groupby(['ItemName', 'Color'])

a
dfX.first()

di
(b) dfB.groupby('ItemName').Price.max()
(c) dfB.groupby('ItemName').Price.min()
In
(d) dfB.groupby('ItemName')['Color'].apply(lambda x: x.count())
6. A DataFrame contains following information:
se

Customer Region Order_Date Sales Month Year


ou

0 K Books Distributers East 2016-04-13 1256000 April 2016


1 GBC P House South 2017-08-23 1359000 August 2017
H

2 S Books Store North 2016-10-11 1670000 October 2016


i

3 TM Books West 2019-08-25 1490000 August 2019


at

4 IND Books Distributors North 2017-09-04 1560000 September 2017


w

5 Aniket Pustak West 2018-05-17 1180000 May 2018


s

6 M Pustak Bhandar South 2018-11-28 2100000 November 2018


ra

7 BOOKWELL Distributors North 2017-01-22 1630000 January 2017


Sa

8 Jatin Book Agency West 2016-12-21 1380000 December 2016


9 New India Agency South 2018-09-12 1730000 September 2018
10 Libra Books Distributors East 2016-10-04 1210000 October 2016
ew

Answer the following questions using groupby() function:


N

( ) Find the region-wise average sales.


( ) Find the year-wise average sales.
@

( ) Find the region-wise aggregates of sales applying multiple aggregation functions like count,
max, min and mean.
(d) What will be the output of the following:
(i) dfN.groupby('Year').Sales.sum().round(decimals=2)
(ii) dfN.groupby('Year').Sales.max().round(decimals=2)
(iii) dfN.groupby('Region').Sales.min().round(decimals=2)
130 Saraswati Informatics Practices XII

Ans. For data:


dfN = pd.read_csv('E:/IPSource_XII/IPXIIChap04/Distributors.csv', index_col=0)
(a) dfN.groupby('Region').Sales.mean().round(decimals=2)
(b) dfN.groupby('Year').Sales.mean().round(decimals=2)

d
(c) dfN.groupby('Region').Sales.agg(['count', 'max', 'min', 'mean']).round(decimals=2)

ite
(d) (i) Year
2016 5516000

m
2017 4549000

Li
2018 5010000
2019 1490000

e
Name: Sales, dtype: int64

at
(ii) Year

iv
2016 1670000
2017 1630000

Pr
2018 2100000
2019 1490000

a
Name: Sales, dtype: int64
(iii) Region
di
In
East 1210000
North 1560000
se

South 1359000
West 1180000
ou

Name: Sales, dtype: int64


H

REVIE ES IO S
i
at

1. What is the use of pandas groupby() method?


w

2. Define apply() function with an example.


3. A DataFrame called dfD is given with following data set:
s
ra

Quantity Unit Price


0 12000 200
Sa

1 6500 180
2 500 250
ew

3 500 350
4 13000 120
N

5 8800 130
@

6 2400 120
7 8000 170
8 8500 130
9 450 142
Write a command to find the Total Price (Quantity * Unit Price) using the lambda function.
Function Applications in Pandas 131

4. A DataFrame dfT is given with following data:


Name Qualification Experience
0 Ms. Mittal Masters 8

d
1 Minu Arora Graduate 11

ite
2 Sharmila Kaur Post Graduate 7

m
3 Sangeeta Vats Masters 9
4 Ramesh Kumar Graduate 6

Li
5 Jatin Ghosh Post Graduate 8

e
6 Yash Sharma Masters 10

at
Write the command for the following using pandas groupby() function :

iv
(a) Find the average experience for each qualification.
(b) Find the total experience for each qualification.

Pr
(c) Find the average experience for each qualification and name.
5. A sample dataset is given with four quarter sales data for five employees:

a
Name of Employee Sales Quarter State
R Sahay 125600
di1 Delhi
In
George K 235600 1 Tamil Nadu
se

Jaya Priya 213400 1 Kerala


Manila Sahai 189000 1 Haryana
ou

Ryma Sen 456000 1 West Bengal


Manila Sahai 172000 2 Haryana
H

Jaya Priya 201400 2 Kerala


i

George K 225400 2 Tamil Nadu


at

R Sahay 140600 2 Delhi


w

Ryma Sen 389000 2 West Bengal


s

Jaya Priya 242100 3 Kerala


ra

George K 262000 3 Tamil Nadu


Sa

Ryma Sen 339000 3 West Bengal


Manila Sahai 228000 3 Haryana
ew

R Sahay 193100 3 Delhi


George K 292000 4 Tamil Nadu
N

Manila Sahai 278000 4 Haryana


Jaya Priya 282100 4 Kerala
@

Ryma Sen 369000 4 West Bengal


R Sahay 233100 4 Delhi

Write the command for the following using pandas groupby() function (assume that the DataFrame
name is dfQ):
132 Saraswati Informatics Practices XII

(a)Find the total sales of each employee.


(b)Find the state-wise total sales.
(c)Find the total sales by both employee name-wise and state-wise.
(d)Find the maximum individual sale by state.

d
(e)Find the employee name-wise aggregates of sales applying multiple aggregation functions

ite
like count, max, min and mean.
(f) Find the output of the following:

m
(i) dfQ.groupby('Name of Employee').Sales.mean()

Li
(ii) dfQ.groupby('State').Sales.mean()
6. A sample dataset is given with different columns as given below:

e
Item_ID ItemName Manufacturer Price CustomerName City

at
PC01 Personal Computer HCL India 42000 N Roy Delhi

iv
LC05 Laptop HP USA 55000 H Singh Mumbai

Pr
PC03 Personal Computer Dell USA 32000 R Pandey Delhi
PC06 Personal Computer Zenith USA 37000 C Sharma Chennai

a
LC03 Laptop Dell USA 57000 K Agarwal Bengaluru
AL03 Monitor HP USA
di 9800 S C Gandhi Delhi
In
CS03 Hard Disk Dell USA 5400 B S Arora Mumbai
se

PR06 Motherboard Intel USA 17500 A K Rawat Delhi


BA03 UPS Microtek India 4300 C K Naidu Chennai
ou

MC01 Monitor HCL India 6800 P N Ghosh Bengaluru


H

Write the command for the following (assume that the DataFrame name is dfA):
(a) Find city-wise total price.
i
at

(b) Find manufacturer-wise total price.


s w
ra
Sa
ew
N
@
Introduction to NumPy (Numeric Python) 133

Introduction to NumPy
(Numeric Python)

d
Chapter –

ite
m
Li
e
at
iv
Pr
5.1 Introduction
NumPy is a Python package which stands for ‘Numerical Python’ or ‘Numeric Python’. It contains a

a
collection of tools and techniques that can be used to solve on a computer mathematical models of problems

di
in Science and Engineering. One of these tools is a high-performance multidimensional array object that is
a powerful data structure for efficient computation of arrays and matrices.
In
Both NumPy and pandas are often used together, as the pandas library relies heavily on the NumPy
array for the implementation of pandas data objects and shares many of its features. NumPy is the core
se

library for scientific computing, which contains a powerful n-dimensional array object; provide tools for
integrating C, C++ etc. It is also useful in linear algebra, random number capability etc. NumPy array can
ou

also be used as an efficient multi-dimensional container for generic data.


H

5.2 Installing NumPy


i

NumPy is installed by default when we install Python pandas. If you want to install it separately, go to your
at

command prompt and type “pip install numpy” because standard Python distribution doesn't come bundled
w

with NumPy module.


s

pip install numpy


ra

Once the installation is completed, you can import the module What is NumPy?
Sa

in your IDLE by typing: “import numpy as np”. Before we can use NumPy is a module for Python.
NumPy we will have to import it. It has to be imported like any NumPy stands for Numeric
other module: Python which is a Python
ew

import numpy package for the computation


and processing of single and
But you will hardly ever see this. Numpy is usually renamed multi-dimensional array
N

to np: elements.
import numpy as np
@

Above code renames the NumPy namespace to np. This permits us to prefix NumPy function, methods,
and attributes with "np" instead of typing "numpy".
After import command, you can check the NumPy version by using the following command:
>>> print (np.__version__)
1.14.3
133
134 Saraswati Informatics Practices XII

5.3 NumPy Array

A NumPy array is a grid of values, all of the same type, and is indexed by a tuple of non-negative integers.
NumPy arrays are great alternatives to Python Lists, but still very much different at the same time. We use

d
python NumPy array instead of a list because of the following reasons:

ite
• It efficiently implements the multidimensional arrays and of all the same type. That is Numpy
arrays are homogeneous in nature, i.e., they comprise one data type (integer, float, double, etc.)

m
unlike lists.
• It is more compact than list (don’t need to store both value and type like in a list).

Li
• It occupies less memory as compared to list.
• It is fast in terms of execution (reading/writing) and at the same time it is very convenient to

e
work with NumPy.

at
• It is easy to work with and give users the opportunity to perform calculations across entire

iv
arrays.
• It is capable of performing Fourier Transform and reshaping the data stored in multi-dimensional

Pr
arrays.
• NumPy provides the in-built functions for linear algebra and random number generation.

a
• You can also do the standard stuff, like indexing, comparisons, logical operations.

di
The main data structure in NumPy is the ndarray, which is a shorthand name for N-dimensional array
In
object which is in the form of rows and columns. Figure 5.1 shows the basic structure of different arrays in
that managed in NumPy.
se

1D NumPy array:
Indices 0 1 2 3 4 5 6 7
ou

24 12 10 34 17 13 32 51
i H
at

2D NumPy array: 3D NumPy array:


2
w

axis = 1 x is = 2 `
Indices X: Z: a0 1
s

0 1 2 3 4
`
ra

2 4 9 3 10 Y: 0 2 4 9 3
Y: 0
Sa
axis = 0

1 7 6 8 3 5 1 7 6 8 3
axis = 0

2 4 2 5 1 9 2 4 2 5 1
ew `

3 6 7 2 7
`

`
X: 0 1 2 3
N

axis = 1
Figure 5.1 Structures of different arrays.
@

When working with NumPy, data in ndarray is simply referred to as an array. NumPy’s main object is
the homogeneous multi-dimensional array.
• It is a fixed-sized array in memory that contains data of the same type, such as integers or
floating point values.
Introduction to NumPy (Numeric Python) 135

• In NumPy dimensions are called axes. The number of axes is rank.


• NumPy’s array class is called ndarray. It is also known by the alias array.
• Every item in ndarray takes the same size of block in the memory.
• Each element in ndarray is an object of data-type object (called dtype).

d
ite
5.3.1 Creating a NumPy Array

m
There are several ways to create a NumPy array. We can create a NumPy array using the numpy.array()
function or np.array() function. To use the later part (i.e., np.array()), you need to make sure that the

Li
NumPy library is present in your environment.
If we pass in a list of lists, it will automatically create a NumPy array with the same number of rows and

e
columns. It creates an ndarray from any object exposing array interface, or from any method that returns

at
an array. The syntax to create a NumpPy array is:

iv
numpy.array(object, dtype = None, copy = True, order = None, subok = False, ndmin = 0)

Pr
From the above syntax, except object, the remaining options are optional.
• object: It represents the collection object. It can be a list, tuple, dictionary, set, etc.

a
• dtype: It is used to mention the data type. We can change the data type of the array elements by
changing this option to the specified type.
di
In
• copy. By default, it is true which means the object is copied.
• order. Specify the memory layout of the array. If object is not an array, the newly created array
se

will be in ‘C’ order (row major) unless ‘F’ is specified.


• subok. If True, then sub-classes will be passed-through, otherwise the returned array will be
ou

forced to be a base-class array (default).


• ndmin: It is used to know the number of axes (dimensions) of the array.
H

Creating Single-dimensional (1D) NumPy array


i
at

A simple way to create an array from data or simple Python data structures like a list is to use the array()
w

function. The NumPy array elements must be of the same data type. For example,
s

>>> import numpy as np


ra

>>> AList = [1, 2, 3, 4, 5, 6, 7, 8, 9] # A python list with homogeneous elements


Sa

>>> A = np.array(AList) # The list passed into the function array()


# Converts python list to numpy array
>>> A # Display the content of the numpy array
ew

array([1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> print (A) # Printing the numpy array
N

[1 2 3 4 5 6 7 8 9]
In practice, there is no need to declare a Python list. The operation can be combined with NumPy
@

.array() function. For example,


>>> A = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9]) # Create a numpy array
>>> print (A) # Printing the numpy array
[1 2 3 4 5 6 7 8 9]
136 Saraswati Informatics Practices XII

>>> print(type(A)) # Output: <class 'numpy.ndarray'>


As you can see, NumPy's array class is called ndarray.

Creating Two-dimensional (2D) NumPy array

d
ite
A two-dimensional array is sometimes called as matrix array. A two-dimensional is where numbers are
arranged into rows and columns. Every axis in a NumPy array has a number, starting with 0. In this way, they

m
are similar to Python indexes in that they start at 0, not 1. So the first axis is axis 0 which refers to rows.

Li
The second axis (in a 2D array) is axis 1 which refers to the columns. Let us see a practical example
showing a matrix with 4 rows and 3 columns index values as shown below:

e
Index [1][1]

at
col 0 col 1 col 2

iv
row 0 [0, 0] [0, 1] [0, 2]

Pr
row 1 [1, 0] [1, 1] [1, 2]

a
row 2 [2, 0] [2, 1] [2, 2]

row 3 [3, 0] [3, 1]


di [3, 2]
In
For multi-dimensional arrays, the third axis is axis 2. The following is a 2 ¯ 3 matrix because it has 2
se

rows and 3 columns.

3 columns and axis = 1


ou

2 rows and 5 4 7
H

axis = 0 6 3 6
i
at

To create a 2D array, each dimension will be added with a comma (",") separator and it has to be within
the bracket []. Let us create the two-dimensional array using above data:
w

>>> M = np.array([[5, 4, 7], [6, 3, 6]]) #Create a rank 2 array of integers


s

>>> M
ra

array([ [ 5, 4, 7 ],
Sa

[ 6, 3, 6 ]])
To print the 2D array, the command is:
ew

>>> print (M)


[[ 5 4 7 ]
[ 6 3 6 ]]
N

¯3 matrix, meaning there are 3 rows, 3 columns:


The following is a 3¯
@

3 columns and axis = 1

5 4 7
3 rows and
11 3 6
axis = 0
8 17 9
Introduction to NumPy (Numeric Python) 137

>>> B1 = np.array([[5, 4, 7], [11, 3, 6], [8, 17, 9]]) #Create a rank 3 array of integers
array([ [ 5, 4, 7 ],
[ 11, 3, 6 ],
[ 8, 17, 9 ]])

d
To print the 2D array, the command is:

ite
>>> print (B1)

m
[[ 5 4 7 ]
[ 11 3 6 ]

Li
[ 8 17 9 ]]

e
Example A matrix is given with following values:

at
4 columns and axis = 1

iv
Pr
5 4 7 3
3 rows and
6 3 6 4
axis = 0

a
8 6 9 5

di
Write the NumPy command to create an array B2 with above data and print the array.
In
SolutionB2 = np.array([[5, 4, 7, 3], [6, 3, 6, 4], [8, 6, 9, 5]]) # array of integers
print (B2)
se

[[ 5 4 7 3 ]
[ 6 3 6 4 ]
ou

[ 8 6 9 5 ]]
Example Suppose you have 3 friends and each having 5 different marks as given below:
H

56 43 48 65 54
i
at

65 54 76 34 54
w

48 67 54 56 31
s

Write the NumPy command to create an array Marks and print array.
ra

Solution Marks = np.array([[56, 43, 48, 65, 54],


Sa

[65, 54, 76, 34, 54],


[48, 67, 54, 56, 31]])
print (Marks)
ew

[ [ 56 43 48 65 54 ]
[ 65 54 76 34 54 ]
N

[ 48 67 54 56 31 ]]
@

Printing Data using Index


When you use the print() command with the matrix index, then it will print the individual position values.
For example, let print some positional array data using previous matrix Marks:
>>> print (Marks[2, 3]) # Prints: 56
138 Saraswati Informatics Practices XII

>>> print (Marks[1, 0]) # Prints: 56


>>> print (Marks[1, :]) # Prints: 2nd row, i.e., [65 54 76 34 54]
>>> print (Marks[:, 2]) # Prints: 3rd column, i.e., [48 76 54]

d
Using arange()

ite
In some occasion, you want to create value evenly spaced within a given interval. The NumPy arange()

m
function (sometimes called np.arange) is a tool for creating numeric sequences in Python. This function
returns evenly spaced values within a given interval. The syntax is:

Li
The data type

e
The function The end of of the array

at
name the interval (optional)
numpy.arange( start =, stop =, step =, dtype )

iv
Pr
The start of The step
the interval between values
(optional) (optional)

a
Here,

di
start: The start parameter indicates the beginning value of the range. This parameter is optional,
In
so if you omit it, it will automatically default to 0.
• stop: The stop parameter indicates the end of the range.
se

• step: The step parameter specifies the spacing between values in the sequence. If you don’t
specify a step value, by default the step value will be 1.
ou

• dtype: The dtype parameter specifies the data type. This parameter is optional.
For example, to create a sequence of 10 evenly spaced values from 0 to 9:
H

>>> x = np.arange(10) # x = np.arange(start = 0, stop = 10)


i

>>> x
at

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
w

>>> print (x)


s

[0 1 2 3 4 5 6 7 8 9]
ra

The above np.arange() function, only holds the stop parameter. The remaining two parameters take its
Sa

default values.
For example, to create values from 1 to 10; you can use arange() function.
>>> y = np.arange(1, 11) # y = np.arange(start = 1, stop = 11)
ew

>>> y
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
N

>>> print (y)


[ 1 2 3 4 5 6 7 8 9 10]
@

If you want to change the step, you can add a third number in the parenthesis. It will change the step
parameter. For example, to create an array with starting from 1 with an interval of 3 till 20, then the range()
function will be written as:
>>> z = np.arange(1, 20, 3)
Introduction to NumPy (Numeric Python) 139

>>> print (z)


[ 1 4 7 10 13 16 19]
To create array with values from 4 to 10 with float data type:

d
>>> p = np.arange(4, 11, dtype=float)

ite
>>> print (p)
[ 4. 5. 6. 7. 8. 9. 10.]

m
Creating Array using Functions

Li
NumPy also provides many functions to create arrays. You can initialize arrays with ones or zeros, but you

e
can also make arrays that get filled up with evenly spaced values, constant or random values.

at
• ones(). This function returns a new array of specified size, filled with ones. The syntax is:

iv
numpy.ones(shape, dtype = float, order = 'C')

Pr
Here,
− shape: is the shape of the array.

a
− dtype: is the datatype. It is optional. The default value is float64.

di
order: Default is C which is an essential row style.
In
For exmaple,
se

>>> b = np.ones((1, 2, 3), dtype=np.int16) # Create an array of all ones


>>> print (b)
ou

[[ [ 1 1 1 ]
[ 1 1 1 ]]]
H

Similarly, we can create NumPy array with all values True using the dtype parameter. For example,
to create a 2 x 2 array with all values True:
i
at

>>> x = np.ones((2, 2), dtype=bool)


w

>>> print (x)


s

[[ True True]
ra

[ True True]]
Sa

• zeros(). This function returns a new array of specified size, filled with zeros. This is also called
null values. The syntax is:
ew

numpy.zeros(shape, dtype = float, order = 'C')


For exmaple, to create a null value 1D array X of size 5, the command is:
N

>>> X = np.zeros(5)
>>> X
@

array([0., 0., 0., 0., 0.])


To create a 2D null value array a of 2x2 size:
>>> a = np.zeros((2, 2)) # Create an array of all zeros
140 Saraswati Informatics Practices XII

>>> print (a)


[ [ 0. 0. ]
[ 0. 0. ]]
Similarly, we can create NumPy array with all values False using the dtype parameter. For example,

d
to create a 3 x 3 array with all values False:

ite
>>> y = np.zeros((3, 3), dtype=bool)

m
>>> print (y)
[ [ False False False ]

Li
[ False False False ]
[ False False False ] ]

e

at
full(). This function returns a new array with constant values with the same shape and type as a
given array filled with a fill_value. The syntax is:

iv
numpy.full(shape, fill_value, dtype = None, order = ‘C’)

Pr
Here, fill_value is the constant value which will be filled as array elements.
For example,

a
di
>>> c = np.full((3, 3), 9) # Create a constant array
>>> print (c)
In
[[ 9 9 9 ]
[9 9 9 ]
se

[ 9 9 9 ]]
ou

Similarly, we can create NumPy array with all Boolean values like True or False using the dtype
parameter.
H

For example, create a 3 x 3 array A with all values True:


>>> A = np.full((3, 3), True) # Create a Boolean True array
i
at

>>> print (A)


[ [ True True True ]
w

[ True True True ]


s

[ True True True ] ]


ra

For example, create a 3 x 4 array B with all values False:


Sa

>>> B = np.full((3, 4), False) # Create a Boolean False array


>>> print (B)
[ [ False False False False ]
ew

[ False False False False ]


[ False False False False ] ]
N

• fill(). This function is used to fill scalar values to an existing array. The syntax is:
@

ndarray.fill(value)
Here, value is a scalar value. For example, let us create an array of 1D containing numeric values
0 to 9 and then fill the array will a scalar value 5.
>>> x = np.arange(10) # create a 1D array x with vales 0 to 9
Introduction to NumPy (Numeric Python) 141

>>> print (x) # printing the array values


[0 1 2 3 4 5 6 7 8 9]
>>> x.fill(5) # Fillig all array values 5

d
>>> print (x) # printing the array values

ite
[5 5 5 5 5 5 5 5 5 5]

m
• empty(). This function creates an uninitialized array of specified shape and data type. The elements
in an array will always show random values. The default data types of the values are float. The

Li
syntax is:
numpy.empty(shape, dtype = float, order = 'C')

e
at
For example, to create a 1D array:

iv
>>> A = np.empty(4)
>>> print (A) # Prints random values

Pr
[6.30731226e+202 4.73591267e+202 8.78952566e-315 0.00000000e+000]
>>> B = np.empty(4, dtype=int)

a
di
>>> print (B) # Prints integer type random values
[ 209 4354560 1030881280 1912602624]
In
For example, to create 2D arrays:
se

>>> d = np.empty([3, 3], dtype = int)


>>> print (d)
ou

[ [ 110 0 0 ]
[ 0 0 0 ]
H

[ 128 0 0 ]]
>>> d = np.empty([3, 3], dtype = int)
i
at

>>> print (d)


[ [ 65535, 0, 0 ],
w

[ 0, 0, 0 ],
s

[ 404, 0, 0 ] ]
ra

From the above two results, it’s not safe to assume that np.empty will return an array of all
Sa

zeros. In many cases, as previously shown, it will return uninitialized garbage values.
Example What will be the output of following lines?
( ) X = np.array([1, 3])
ew

print (X)
X.fill(0)
N

print (X)
( ) arr = np.empty((2, 5), dtype=bool)
@

arr.fill(1)
print (arr)
Solution(a) [1 3]
[0 0]
(b) [[ True True True True True]
[ True True True True True]]
142 Saraswati Informatics Practices XII

• eye(). This function is used to create a 2D array with ones on the diagonal and zeros elsewhere.
The syntax is:
numpy.eye(N, M=None, k=0, dtype=<class 'float'>, order='C')

d
Here,

ite
− N is the number of rows in the output.
− M is number of columns in the output, defaults to N.

m
− k is the index of the diagonal. 0 refers to the main diagonal; a positive value refers to an
upper diagonal, and a negative value to a lower diagonal.

Li
For example,

e
>>> e = np.eye(2) # Create a 2x2 identity matrix

at
>>> print (e)
[ [ 1. 0. ]

iv
[ 0. 1. ]]

Pr
>>> e1= np.eye(2, 3) # Create a 2x3 identity matrix
>>> print (e1)

a
[ [ 1. 0. 0. ]

di
[ 0. 1. 0. ]] In
>>> e2= np.eye(3, 3)
>>> print (e2) # Create a 3x3 identity matrix
[ [ 1. 0. 0. ]
se

[ 0. 1. 0. ]
[ 0. 0. 1. ]]
ou

• numpy.random.rand(), numpy.random.random() and numpy.random.random_sample(). All


H

these functions generate samples from the uniform distribution on [0, 1). Results are from the
“continuous uniform” distribution over the stated interval. The syntax for these functions are:
i
at

numpy.random.rand(d0, d1, ..., dn)


numpy.random.random(size=None)
w

numpy.random.random_sample(size=None)
s

Here, d0, d1, ..., dn are the dimensions of the returned array, and they should all be positive. If
ra

no argument is given a single Python float is returned.


Sa

The only difference is in how the arguments are handled. With numpy.random.rand, the length
of each dimension of the output array is a separate argument.
With numpy.random.random_sample, the shape argument is a single tuple.
ew

For example, let us print 4 random values using the above three random commands:
>>> A = np.random.rand(4) # uniform in [0, 1]
N

>>> print (A)


[0.60963991 0.37459676 0.97760841 0.27236068]
@

>>> B = np.random.random_sample(5)
>>> print (B)
[0.26982835, 0.97150188, 0.35623911, 0.90855109, 0.80270268]
>>> C = np.random.random(5)
Introduction to NumPy (Numeric Python) 143

>>> print (C)


[0.5463147 0.70647374 0.00679288 0.23957108 0.18720496]

For example, to create a 2D array of samples with shape (2, 3), you can write any one of the

d
following:

ite
A = np.random.random((2, 3)) # Create an array filled with random values.
Or

m
A = np.random.random_sample((2, 3)) # Create an array filled with random values
Or

Li
A = np.random.rand(2, 3) # Create an array filled with random values.

e
But the only difference is the random and random_sample uses the single tuble whereas the

at
rand() function uses the dimensional values.

iv
• linspace(). A linspace array is an array of equally spaced values going from a start to an end
value. The NumPy linspace function (sometimes called np.linspace) is a tool in Python for creating

Pr
numeric sequences or an ndarray. The syntax is:
numpy.linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None, axis=0

a
di
Here, In
− start is the starting value of the sequence.
− stop is the end value of the sequence, unless endpoint is set to False. In that case, the sequence
se

consists of all but the last of num + 1 evenly spaced samples, so that stop is excluded. Note
that the step size changes when endpoint is False.
ou

− num is the number of samples to generate. Default is 50 and must be non-negative.


− endpoint is a Boolean value and optional. The default is True. If it is True, then the stop value
H

will be incuded in the returned array. Otherwise, the stop value will be not included as the
final value in the returned array.
i

− retstep is a Boolean value and optional. If True, return (samples, step), where step is the
at

spacing between samples.


w

− dtype is the type of the output array. If dtype is not given, infer the data type from the other
s

input arguments.
ra

For example,
Sa

>>> L = np.linspace(start = 0, stop = 100, num = 5)


>>> print (L)
[ 0. 25. 50. 75. 100.]
ew

Notice that there are 5 values are created and are equally spaced values.
N

Start Stop
0. 25. 50. 75. 100.
@

By declaring a start value, stop value, and the num of points in between those points an array is
generated. Similarly, let us see another example:
>>> L1 = np.linspace(1, 10, 10)
144 Saraswati Informatics Practices XII

>>> print (L1) # Here, it creates 10 values


[ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.]
From the above two arrays, if you observe then all the array results are produced float type

d
values. So, to create an array with integers instead of floats, then you can use the dtype parameter
as int. For example,

ite
>>> L1 = np.linspace(1, 10, 10, dtype=int)

m
>>> print (L1)

Li
[ 1 2 3 4 5 6 7 8 9 10]
• reshape(). This function used to give a new shape to an array without changing its data. This is

e
just reshape the array by changing the number of rows and columns of the multi-dimensional

at
array. It accepts the two parameters indicating the row and columns of the new shape of the
array. The reshape function has two required inputs. First, an array. Second, a shape. Remember

iv
NumPy array shapes are in the form of tuples. For example, a shape tuple for an array with two

Pr
rows and three columns would look like this: (2, 3). Also, if the tuple is (2, 3), then the number of
elements in the input array must be 6 elements, i.e., 2 x 3 = 6.
The syntax is:

a
di
numpy.reshape(a, newshape, order='C')
Here,
In
− a is the array to be reshaped.
− newshape should be compatible with the original shape.
se

Let us first create a 1D array with 6 elements and reshape it into a 2D array with 2 rows and
ou

3 columns.
A = np.array([1, 2, 3, 4, 5, 6])
H

Now, we use numpy.reshape() to create a new array B by reshaping our initial array A. Notice we
i

pass numpy.reshape() the array A and a tuple for the new shape (2, 3).
at

>>> B = np.reshape(A, (2, 3))


w

Or
s

>>> B = np.reshape(A, (–1, 3)) # Setting to –1 automatically decides the number of rows
ra

>>> print (B)


Sa

[[1, 2, 3],
[4, 5, 6]])
ew

We can also print the shape of B to make sure it matches the tuple we passed to reshape().
>>> print (B.shape)
N

(2, 3)
Similarly, to create an array with three rows and two columns, we can use the tuple as (3, 2):
@

>>> B = np.reshape(A, (3, 2))


Or
>>> B = np.reshape(A, (3, –1)) # Setting to –1 automatically decides the number of columns
Introduction to NumPy (Numeric Python) 145

>>> print (B)


[[ 1 2]
[ 3 4]
[ 5 6] ]

d
ite
Reshaping 2D Array

m
For example, let us see how a 4x3 array reshape into 3x4 array:

Li
1 2 3 1 2 3 4
4 5 6
` 5 6 7 8

e
at
7 8 9 9 10 11 12
10 11 12

iv
Pr
>>> x = np.array([[1, 2, 3],[4, 5, 6], [7, 8, 9], [10, 11, 12]])
>>> print (x)
[[ 1 2 3 ]

a
[ 4 5 6 ]
[ 7 8 9 ]
di
[ 10 11 12 ]]
In
>>> x = x.reshape(3,4) # x = np.reshape(x, (3, 4))
se

>>> print (x)


[[ 1 2 3 4 ]
ou

[ 5 6 7 8 ]
[ 9 10 11 12 ]]
H

We can also create array through sequence values by using arange() and reshape() functions.
For example to create a 3 x 5 array using values from 1 to 16;
i
at

>>> a = np.arange(1, 16).reshape(3, 5)


w

>>> print (a)


[[ 1 2 3 4 5 ]
s

[ 6 7 8 9 10 ]
ra

[ 11 12 13 14 15 ]]
Sa

5.3.2 Transpose of NumPy Array


ew

The transpose of a matrix is a new matrix whose rows are the columns of the original. This makes the
columns of the new matrix the rows of the original. Here is a matrix and its transpose: To transpose a
matrix, simply use the T attribute of an array object.
N

That is:
matrix.T
@

This will return the transpose of the matrix.


For example,
>>> a = [3, 6, 9]
>>> b = np.array(a)
146 Saraswati Informatics Practices XII

>>> b.T
array([3, 6, 9]) #Here it didn't transpose because 'a' is 1 dimensional
>>> b = np.array([a])
>>> b.T

d
array([[3], #Here it did transpose because a is 2 dimensional

ite
[6],
[9]])

m
Let us see the transpose of a matrix:

Li
>>> MAT = np.array([ [4, 5, 8, 6],
[2, 3, 2, 4],

e
[7, 6, 4, 6]])

at
>>> print (MAT)

iv
[[ 4 5 8 6 ]
[2 3 2 4 ]

Pr
[ 7 6 4 6 ]]
>>> print (MAT.T)

a
di
[[ 4 2 7 ]
[5 3 6]
In
[8 2 4]
[ 6 4 6 ]]
se

We can transpose a matrix using numpy.transpose() function. The syntax is:


ou

numpy.transpose(a, axes=None)
H

For example,
>>> print (np.transpose(MAT))
i
at

[[ 4 2 7]
[5 3 6]
w

[8 2 4]
s

[6 4 6 ]]
ra
Sa

5.4 Checking Data Type

You may have noticed that, in some instances, array elements are displayed with a trailing dot (e.g. 2. vs 2).
ew

This is due to a difference in the data-type used.


You can find the data type of the elements that are stored in an array. So, if you want to know the data
type of a particular element, you can use ‘dtype’ function which will print the data type along with the size.
N

For example, to know the data type of above array:


@

>> a = np.array([1, 2, 3])


>>> a.dtype
dtype('int32')
>>> b = np.array([1., 2., 3.])
>>> b.dtype
dtype('float64')
Introduction to NumPy (Numeric Python) 147

Notice that the above two data types show int32 and float64. Those with numbers in their name indicate
the bit size of the type (i.e. how many bits are needed to represent a single value in memory).
NumPy arrays contain values of a single type, so it is important to have detailed knowledge of those
types and their limitations. Because NumPy is built in ‘C’, the types will be familiar to users of ‘C’, Fortran,

d
and other related languages. The standard NumPy data types are listed in the following Table 5.1.

ite
Table 5.1 Standard NumPy data types.

Data Types Description

m
Li
bool_ Boolean (True or False) stored as a byte
int_ Default integer type (same as C long; normally either int64 or int32)

e
intc Identical to C int (normally int32 or int64)

at
intp Integer used for indexing (same as C ssize_t; normally either int32 or int64)

iv
int8 Byte (–128 to 127)

Pr
int16 Integer (–32768 to 32767)
int32 Integer (–2147483648 to 2147483647)

a
int64 Integer (–9223372036854775808 to 9223372036854775807)
uint8 Unsigned integer (0 to 255)
di
In
uint16 Unsigned integer (0 to 65535)
uint32 Unsigned integer (0 to 4294967295)
se

uint64 Unsigned integer (0 to 18446744073709551615)


ou

float_ Shorthand for float64.


float16 Half precision float: sign bit, 5 bits exponent, 10 bits mantissa
H

float32 Single precision float: sign bit, 8 bits exponent, 23 bits mantissa
i

float64 Double precision float: sign bit, 11 bits exponent, 52 bits mantissa
at

complex_ Shorthand for complex128.


w

complex64 Complex number, represented by two 32-bit floats


s

complex128 Complex number, represented by two 64-bit floats


ra
Sa

Different data types allow us to store data more compactly in memory, but most of the time we simply
work with floating point numbers. Note that in the example above, NumPy auto-detects the data type from
the input.
ew

You can explicitly specify which data type you want. For example,
>>> c = np.array([1, 2, 3], dtype=float)
>>> c.dtype
N

dtype('float64')
@

>>> M = np.array([[5, 4, 7], [6, 3, 6]]) #Create a rank 2 array of integers


>>> print (M.dtype)
int32
Type of array can be explicitly defined while creating array. For example,
>>> X = np.array([[5.4, 4, 7], [6, 3, 6]]) # array of float type
148 Saraswati Informatics Practices XII

>>> print (X)


[[ 5.4 4. 7. ]
[ 6. 3. 6. ]]
>>> Y = np.array([[5, 4, 7], [6, 3, 6]], dtype = 'float') # array of float type

d
>>> print (Y)

ite
[[5. 4. 7.]
[6. 3. 6.]]

m
>>> print(Y.dtype) # prints: float64

Li
>>> Z = np.array([[5, 4, 7], [6, 3, 6]], dtype = 'complex') # array of complex type
>>> print (Z)

e
[[5.+0.j 4.+0.j 7.+0.j]

at
[6.+0.j 3.+0.j 6.+0.j]]

iv
>>> print(Z.dtype) # prints: complex128

Pr
5.5 Finding the Shape and Size of the Array

a
We know that a NumPy array is a grid of values, all of the same type, and is indexed by a tuple of

di
nonnegative integers. The number of dimensions is the n of the array; the
In e of an array is a tuple of
integers giving the i e of the array along each dimension.
To get the shape and size of the array, the shape and size function associated with the NumPy array is
se

used. For example, to print the total size (total number of elements) of array:
>>> import numpy as np
ou

>>> A = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9]) # Creates a numpy 1D array with rank 1


>>> print("Array Size:", A.size)
H

Array Size: 9
Similarly, to find the size of a 2D array:
i
at

>>> x = np.array([[1, 2, 3, 4], [2, 3, 4, 5]])


w

>>> x.size # prints: 8


s

NumPy array has a method called shape that returns [No.of rows, No.of columns], and shape[0] gives
ra

you the number of rows, shape[1] gives you the number of columns. The shape return is a tuple of integers.
These numbers denote the lengths of the corresponding array dimension.
Sa

E.g.,
• For a 1D array, the shape would be (n, ) where n is the number of elements in the array. Because,
ew

if the array only have one row, then it returns [No.of columns, ]. And shape[1] will be out of the
index.
• For a 2D array, the shape would be (n, m) where n is the number of rows and m is the number of
N

columns in the array.


@

For example, see the 2D array given below:


>>> x = np.array([[1, 2, 3, 4], [2, 3, 4, 5]])
>>> x
array( [ [1, 2, 3, 4 ],
[2, 3, 4, 5 ] ] )
Introduction to NumPy (Numeric Python) 149

The shape of the array x is:


>>> x.shape
(2, 4)
Here, it shows that the array x has 2 rows and 4 columns. That is

d
ite
>>> x.shape[0] # prints: 2 as the number rows
>>> x.shape[1] # prints: 4 as the number columns

m
But if the array only have one row (1D array), then it returns [No.of columns, ]. And shape[1] will be

Li
out of the index. That is, the shape would simply be (n, ) instead of what you said as either (1, n) or (n, 1) for
row and column vectors respectively.

e
For example, let us find the shape of 1D array:

at
>>> A = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9]) # A NumPy 1D array with rank 1
>>> print("Array Shape:", A.shape) # Returns number of columns, i.e., 9

iv
Array Shape: (9,)

Pr
Here, the array has 9 columns.
>>> A.shape[0] # Returns 9

a
>>> A.shape[1] # Error
Traceback (most recent call last):
di
In
File "<pyshell#31>", line 1, in <module>
A.shape[1]
se

IndexError: tuple index out of range


Example What will be the output of following program lines?
ou

A = np.array([10, 21, 4, 45, 66, 93])


B = np.zeros(A.size)
H

print (B)
Output [0., 0., 0., 0., 0., 0.])
i
at

Example What is the command to find the size and shape of the following arrray?
w

M = np.array( [[5, 4, 7, 3],


s

[6, 3, 6, 4],
ra

[8, 6, 9, 5]])
Sa

Solution: print("Array size is:", M.size)


print("Array shape is:", M.shape)
ew

We know that the shape of an array tells us also something about the order in which the indices are
processed. At the same time, "shape" can also be used to change the shape of an array. For example,
N

>>> b = np.array([[1, 2, 3], [4, 5, 6]]) # Create a rank 2 array


>>> b.shape
@

(2, 3)
>>> b.shape = (3, 2)
>>> print (b)
[[ 1 2 ]
[3 4 ]
[ 5 6 ]]
150 Saraswati Informatics Practices XII

Points to Remember
1. NumPy is a Python library that can be used for scientific and numerical applications and is the

d
tool to use for linear algebra operations.

ite
2. The numpy.full() function is used to create a numpy array of given shape and all elements
initialized with a given value.

m
3. The numpy.zero() function is used to create a numpy array of given shape and type and all
values in it initialized with 0’s.

Li
4. The numpy.one() function is used to create a numpy array of given shape and type and all values
in it initialized with 1’s.

e
at
5. The numpy.linspace() function is used to create a evenly spaced samples over a specified interval.
6. The numpy.reshape() method associated with the ndarray object is used to reshape the array.

iv
7. The numpy.reshape() method takes the data in an existing array, and puts it into an array with the

Pr
given shape and returns it.
8. The number of dimensions is the rank of the array.
9. The shape of an array is a tuple of integers giving the size of the array along each dimension.

a
10. The numpy.empty(...) is filled with random/junk values.

di
11. numpy.random.random(...) is actually using a random number generator to fill in each of the
In
spots in the array with a randomly sampled number from 0 to 1
se

SOLVED EXERCISES
ou

1. What is NumPy?
H

Ans. NumPy is a module for Python. NumPy stands for Numeric Python which is a Python package for
the computation and processing of single and multi-dimensional array elements.
i

2. Write the command for the following:


at

( ) print the NumPy version.


w

( ) Create a null array X of size 10.


( ) Create a 1D Y array with values ranging from 10 to 49.
s
ra

Ans. (a) print(np.__version__)


(b) X = np.zeros(10)
Sa

(c) Y = np.arange(10, 50)


3. Write the commands for the following (assume that the NumPy namespace is np):
- Create a new shape to, i.e., a 2D (i.e., 4 x 5) array AR without changing its data.
ew

- Print the array.


- Print the array after transposing the AR data.
N

Ans. AR = A.reshape(4, 5)
print (AR)
@

print (AR.T)
4. Write the commands for the following (assume that the NumPy namespace is np):
( ) Create a sequence array A of 20 evenly spaced values from 0 to 20.
( ) Create an array of 1D B containing numeric values 0 to 9.
( ) Create a 3x4 floating-point array C filled with ones.
Introduction to NumPy (Numeric Python) 151

(d) Create a 4x3 array D filled with 3.14.


(e) Create an array F filled with a linear sequence starting at 0, ending at 20, stepping by 2.
( ) Create an array G of even space between the given ranges of values.
( ) Create a 3x3 identity matrix H.

d
Ans. (a) A = np.arange(0, 20)

ite
(b) B = np.arange(10)
(c) C = np.ones((3, 5), dtype=float)

m
(d) D = np.full((3, 5), 3.14)

Li
(e) F = np.arange(0, 20, 2)
(f) G = np.linspace(0, 1, 5)

e
(g) H = np.eye(3)

at
5. Write the commands for the following (assume that the NumPy namespace is np):
( ) Create and print a 3 x 4 numpy array X with all values True.

iv
( ) Create and print a 4 x 3 numpy array Y with all values False.

Pr
( ) Create a 3x3 matrix Z with values ranging from 0 to 8.
Ans. (a) X = np.ones((3, 4), dtype=bool)
print (X)

a
(b) Y = np.zeros((4, 3), dtype=bool)
print (Y)
di
In
(c) Z = np.arange(9).reshape(3,3)
6. What will happen if the start option is missing in arange() function?
se

Ans. If the 'start' parameter is not given or missing in arange() function, then it will be set to 0.
7. What will be the output of the following?
ou

( ) x = np.arange(0.5, 10, 0.5)


print (x)
H

( ) >>> x = np.arange(0.5, 10, 0.5, int)


>>> print (x)
i
at

( ) x = np.arange(0, 20, 2)
print (x)
w

(d) L1 = np.linspace(1, 10, 10, dtype=int, endpoint=False)


s

print (L1)
ra

(e) A1 = np.linspace(2, 11, 10, dtype=int, endpoint=True)


print (A1)
Sa

( ) A = np.array([1, 2, 3, 4, 5, 6])
B = np.reshape(A, (-1, 2))
print (B)
ew

Ans. (a) [0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5 5. 5.5 6. 6.5 7. 7.5 8. 8.5 9. 9.5]
(b) [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18]
N

(c) [ 0 2 4 6 8 10 12 14 16 18]
(d) [1 1 2 3 4 5 6 7 8 9]
@

(e) [ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]


(f) [[1 2]
[3 4]
[5 6]]
152 Saraswati Informatics Practices XII

REVIE ES IO S
1. If X is 2D array with following values (assume that np is used as NumPy namespace):
X = np.array([[1, 2, 3], [4, 5, 6]])

d
What will it print if you type the following two commands?

ite
(a) print(X) (b) type(X)
(c) X.ndim (d) X.shape

m
2. What will be the output of the following (assume that np is used as NumPy namespace)?
(a) A = np.zeros((2,5))

Li
print(A)
(b) B = np.ones(7)

e
print(B)

at
(c) C = np.ones((3,2))

iv
print(C)
3. What will the the output of the following NumPy command? Explain it.

Pr
D = np.zeros((5, 6))
4. If you type following (assume that np is used as NumPy namespace):

a
n = np.arange(10)

di
How many values that array n will store and what are they?
5. Write the command for the following:
In
(a) Create an array starting from 0 with an interval of 2 till 20.
(b) Convert a 1D array to a 2D array with 2 rows.
se

(c) Create 3 x 3 identity matrix.


(d) Create a 3 x 3 matrix with values ranging from 0 to 8
ou

(e) Create a 3×3 NumPy array of all True’s.


(f) Create a 1D Numpy Array of length 10 & all elements initialized with value 7.
H

(g) Create a 2D Numpy Array of 4 rows and 5 columns and all elements initialized with value 9.
(h) Ceate a 2D numpy array with 3 rows and 4 columns, filled with 1’s.
i
at

(i) Create 5 evenly spaced samples in interval [30, 70).


(j) Create evenly spaced samples in interval [30,760} and also get the step size. Also print the
w

array and its step size.


s
ra
Sa
ew
N
@
Indexing, Slicing and Arithmetic Operations in NumPy Array 153

Indexing, Slicing and Arithmetic


Operations in NumPy Array

d
Chapter –

ite
m
Li
e
at
iv
Pr
6.1 Introduction

In Python you have already learnt the slice method to access list and tuple elements. Selecting a slice is

a
similar to selecting element(s) of a NumPy array. In this text, you will learn how to use indexing and slicing

di
to access NumPy array elements. In
6.2 Indexing NumPy Array
se

Once your data is represented using a NumPy array, you can access it using indexing. Indexing is used to
ou

obtain individual elements from an array, but it can also be used to obtain entire rows, columns or planes
from multi-dimensional arrays. The important thing to remember is that indexing in python starts at zero.
H

Indexing 1D Array
i

Array indexing refers to any use of the square brackets ([ ]) to index array values. Single element indexing
at

for a 1D array works exactly like that of other standard Python sequences like list or tuple. It is 0-based,
w

and accepts negative indices for indexing from the end of the array. Figure 6.1 shows both positive and
s

negative indices in a 1D array.


ra

Positive Indices 0 1 2 3 4 5 6 7
`
Sa

1D Array 24 12 10 34 17 13 32 51
`
Negative Indices –8 –7 –6 –5 –4 –3 –2 –1
ew

Figure 6.1 A 1D array with indices.


N

For example, to access the array elements:


>>> import numpy as np
@

>>> A1 = np.array([24, 12, 10, 34, 17, 13, 32, 51])


>>> print(A1) # prints: [24 12 10 34 17 13 32 51]
>>> print(A1[0]) # prints: 24
>>> print(A1[5]) # prints: 13

153
154 Saraswati Informatics Practices XII

>>> print(A1[–3]) # prints: 13


>>> print(A1[–7]) # prints: 12

Remember that if you use out of the index value, then NumPy will raise an IndexError. For example,

d
>> print(A1[–10])

ite
Traceback (most recent call last):
File "<pyshell#15>", line 1, in <module>

m
print(A1[–10])

Li
IndexError: index –10 is out of bounds for axis 0 with size 8

Indexing 2D Array

e
at
Unlike lists and tuples, NumPy arrays support multi-dimensional indexing for multi-dimensional arrays.

iv
That means it is not necessary to separate each dimension’s index into its own set of square brackets.
Figure 6.2 shows a 2D array with its indexes (both positive and negative) as given below:

Pr
`

a
0 1 2 3 4

2 4
di9 3 10
In
0 [0, 0] [0, 1] [0, 2] [0, 3] [0, 4]
[–3, –5] [–3, –4] [–3, –3] [–3, –2] [–3, –1]
se

7 6 8 3 5
ou

1 [1, 0] [1, 1] [1, 2] [1, 3] [1, 4]


[–2, –5] [–2, –4] [–2, –3] [–2, –2] [–2, –1]
H

4 2 5 1 9
i

2
at

[2, 0] [2, 1] [2, 2] [2, 3] [2, 4]


[–1, –5] [–1, –4] [–1, –3] [–1, –2] [–1, –1]
w
`

Figure 6.2 A 2D array with indices.


ra

Let us create a 2D array and access the array elements using 2D array indexing method:
Sa

>>> B = np.array([ [2, 4, 9, 3, 10],


[7, 6, 8, 3, 5],
[4, 2, 5, 1, 9]])
ew

>>> B
array([ [ 2, 4, 9, 3, 10 ],
N

[ 7, 6, 8, 3, 5 ],
[ 4, 2, 5, 1, 9 ]])
@

We can select an element of the array using two indices, inside a square 2 4 9 3 10
bracket ([ ]), i.e., i is the row index and j is the column index. For
example, 7 6 8 3 5
>>> print(B[1, 2]) # prints: 8 4 2 5 1 9
Indexing, Slicing and Arithmetic Operations in NumPy Array 155

If you notice the previous print() function, the i (1) and j (2) values are both inside the square brackets,
separated by a comma (,) operator. The print(B[1, 2]) picks row 1, column 2, which has the value 8. This
compares with the syntax you might use with a 2D list (i.e., a list of lists). That is:
>>> # A python list

d
>>> L1 = [[2, 4, 9, 3, 10],

ite
[7, 6, 8, 3, 5],
[4, 2, 5, 1, 9]]

m
>>> print (L1[1][2])

Li
which prints: 8

e
Indexing a Row or Column

at
To select elements from a 2D array by index, we must use the index positions of the array. Let us see how

iv
to select elements from the following 2D array using index.

Pr
>>> B = np.array([ [2, 4, 9, 3, 10],
[7, 6, 8, 3, 5],

a
[4, 2, 5, 1, 9]])

di
1. To select a single element from 2D Numpy array by index, we can use [][] operator.
In
ndArray[row_index][column_index]
For example, to select the row 1 and column 2:
se

print(B[1][2]) # prints: 8
ou

Or we can pass the comma separated list of indices representing row index and column index.
2. To select rows by index from a 2D Numpy array, we can call [] operator to select a single or
H

multiple row.
ndArray[row_index]
i
at

For example, to select the first row, i.e., at the index 0:


w

B[0] # get first row B[–3] # get first row


s

array([ 2, 4, 9, 3, 10]) array([ 2, 4, 9, 3, 10])


ra
Sa

2 4 9 3 10 2 4 9 3 10
7 6 8 3 5 7 6 8 3 5
ew

4 2 5 1 9 4 2 5 1 9
N

Similarly, to select a row at index 2, the command is:


print (B[1]) # prints: [7 6 8 3 5] print (B[–2]) # prints: [7 6 8 3 5]
@

2 4 9 3 10 2 4 9 3 10

7 6 8 3 5 7 6 8 3 5

4 2 5 1 9 4 2 5 1 9
156 Saraswati Informatics Practices XII

3. To select multiple rows:


ndArray[start_index: end_index , :]
It will return rows from start_index to end_index – 1 and will include all columns. For example,

d
select multiple rows from index 1 to 2, the command is:

ite
print (B[1:3, :])
[[ 7 6 8 3 5]

m
[ 4 2 5 1 9]]

Li
Similarly, to select multiple rows from index 1 to last index, the
command is: 2 4 9 3 10

e
print (B[1: , :]) 7 6 8 3 5

at
[[ 7 6 8 3 5]
4 2 5 1 9
[ 4 2 5 1 9]]

iv
4. NumPy allows us to select a single or multiple column as well. To select columns by index from a

Pr
2D Numpy array, the format is:
ndArray[ : , column_index]

a
di
It will return a complete column at given index. For example, to select a column at index 1, the
command is: In
print(B[:, 1]) # prints: [4 6 2] print(B[:, –4]) # prints: [4 6 2]
se

2 4 9 3 10 2 4 9 3 10
ou

7 6 8 3 5 7 6 8 3 5

4 2 5 1 9 4 2 5 1 9
H

The above indexing is just like slicing, and the B[:, 1] means:
i
at

• for the i or row value, it takes all values (: is a full slice, from start to end)
• for the j value take 1, i.e., all rows values of column 1
w

5. To select multiple columns:


s

ndArray[ : , start_index: end_index]


ra

It will return columns from start_index to end_index – 1 and will include all rows. For example, to
Sa

select multiple columns from index 1 to 2, the command is:


print (B[: , 1:3])
[ [ 4 9] 2 4 9 3 10
ew

[ 6 8] 7 6 8 3 5
[ 2 5]]
N

4 2 5 1 9
@

Similarly, to select multiple columns from index 1 to last index, the command is:
print (B[:, 1:])
2 4 9 3 10
[[4 9 3 10 ]
[6 8 3 5 ] 7 6 8 3 5
[2 5 1 9 ]]
4 2 5 1 9
Indexing, Slicing and Arithmetic Operations in NumPy Array 157

6.3 Slicing NumPy Array

Assigning to and accessing the elements of an array is similar to other sequential data types of Python, i.e.,
lists and tuples. Slicing in the NumPy array is the way to extract a range of elements from an array. Slicing

d
in the array is performed in the same way as it is performed in the python list except you can do it in more

ite
than one dimension.
Before slicing NumPy array elements, just a quick recap how slicing works with normal Python lists.

m
Suppose we have a list:

Li
>>> A1 = [24, 12, 10, 34, 17, 13, 32, 51]

e
We can use slicing to take a sub-list, like this:

at
>>> x = A1[1:7]
>>> print (x) # or print(Num[1:7])

iv
[12 10 34 17 13 32]

Pr
0 1 2 3 4 5 6 7
`

a
24 12 10 34 17 13 32 51

di
`
–8 –7 –6 –5 –4 –3 –2 –1
In
Notice that the slice notation specifies a start and end value [start:end] and copies the array from
ar at index 1 (i.e., 2nd position in the list) end at up to 7 – 1 (i.e., 6th index position in the array). Note
se

that the index starts from 0 just like Python sequences like list and tuple.
The basic format to slice NumPy array values are:
ou

The end index


The array
H

position

ndarray( start, end, step )


i
at

The starting index The step between


w

of the array the index


s

Here, the start, end and step are all optional. These ranges work just like slices for lists. The start
ra

specifies a range that starts, and stops one position before end, in steps size of step. If you don’t mention
Sa

any of the parameter, then the complete array will print.


Let us see some examples, how list elements are accessed. For example, to access all elements:
>>> print (A1)
ew

Or
>>> print (A1[:]) # Prints all the elements in the list
N

[24 12 10 34 17 13 32 51]
@

We can slice list elements in different ways also. We can omit the start, in which case the slice would
start at the beginning of the list. For example, to access first four elements:

>>> A1[:4] # Returns a list with four elements starting from 0th index
[24, 12, 10, 34]
158 Saraswati Informatics Practices XII

To create another list x using A1, starting from first element:


>>> x = A1[:4] # Creates another list x using A1 from 0th position till 3rd position
>>> print (x) # Prints the array

d
[24 12 10 34]

ite
We can omit the end, so the slice continues to the end of the list. For example, to print the list elements

m
starting from 4th position till end, the command is:
>>> print (A1[3:]) # Prints list elements starting from third element (i.e., 4th position) till end

Li
[34 17 13 32 51]

e
The important thing to note is the difference between an index and a slice of length 1. For example,

at
>>> print (A1[0]) # Prints index of 0, i.e., 24
>>> print (A1[0:1]) # Prints slice of [0:1], i.e., [24]

iv
Pr
Slicing 1D array

a
Slicing a 1D NumPy array is almost exactly the same as slicing a list. Slicing is specified using the colon

di
operator ‘:’ with a [start:end:stop] or ‘from‘ and ‘to‘ index before and after the column, respectively. The
slice extends from the ‘from’ index and ends one item before the ‘to’ index. The NumPy slicing is exactly
In
same as Python sequences and first index starts from 0th position.
For example,
se

>>> import numpy as np


>>> A1 = np.array([24, 12, 10, 34, 17, 13, 32, 51])
ou

>>> A1
array([24, 12, 10, 34, 17, 13, 32, 51])
H

>>> print (A1[:]) # Prints all the elements in the array


[24 12 10 34 17 13 32 51]
i
at

Let us see some slicing examples:


w

>>> x = A1[1:4] # Creates another array using A1 starting from 2nd index 2nd to 3rd index position.
s

>>> print (x) # Prints the array


ra

[12 10 34]
Sa

To print first four elements, the command is:


>>> print (A1[:4]) # Prints array elements starting from index 0th position till 3rd position
[24 12 10 34]
ew

Or
>>> print (A1[0:4])
N

[24 12 10 34]
@

Here, the first item of the array is sliced by specifying a slice that starts at index 0 and ends at index 3
(one item before the ‘to’ index).
>>> print (A1[3:]) # Prints array elements starting from index 3rd till end
[34 17 13 32 51]
>>> x = A1[1:7:2] # [start:end:step] - start is 1, i.e., 2nd position, end is 7th position and step 2.
Indexing, Slicing and Arithmetic Operations in NumPy Array 159

>>> print (x) # prints: [12, 34, 13]


>>> print (A1[[2, 3, 6, 7]]) # prints 3rd, 4th, 7th and 8th position values from the array.
[10 34 32 51]

d
Like Python list, we can also use negative indexes in NumPy array slices. For example, to slice the last

ite
four items in the array by starting the slice at –4 and not specifying a ‘to’ index; that takes the slice to the
end of the dimension.

m
>>> print(A1[–4:]) # Prints the last four elements
[17 13 32 51]

Li
>>> print (A1[::-1]) # Prints the array in reverse order
[51 32 13 17 34 10 12 24]

e
at
Example A 1D array x1 is given with following values:

iv
[8, 10, 21, 19, 4, 32, 45, 12, 66, 93, 11, 17]

Pr
Answer the following:
( ) Create the array x1.
( ) Print the value to index zero.

a
( ) Print the fifth value.
(d) Print the last value.
di
In
(e) Print the second last value.
( ) Print the values from start to 5th position.
se

( ) Print the values from 5th position to end.


( ) Access the values from 4th to 6th position.
ou

(i) Access the elements at only even places.


( ) Print the elements from first position step by 2.
H

( ) Print the array in reverse order.


i

Solution (a) x1 = np.array([8, 10, 21, –19, 4, 32, 45, –12, 66, 93, 11, –17])
at

(b) print (x1[0]) (c) print (x1[4])


w

(d) print (x1[–1]) (e) print (x1[–2])


(f) print (x1[:6]) (g) print (x1[5:])
s

(h) x1[4:7] (i) x1[ : : 2]


ra

(j) print (x1[1::2]) (k) print (x1[::–1])


Sa

Slicing 2D array
ew

You can slice a 2D array in both axes to obtain a rectangular subset of the original array. For example,
>>> M = np.array([ [5, 4, 7, 3],
N

[6, 3, 6, 4],
[8, 6, 9, 5] ] )
@

>>> M
array([ [ 5, 4, 7, 3 ],
[ 6, 3, 6, 4 ],
[ 8, 6, 9, 5 ]])
160 Saraswati Informatics Practices XII

Access the entries in a 2D array using the square brackets with 2 indices. In particular, access the entry
at row index 1 and column index 2:
>> print (M[1,2]) # Prints 6

d
To access the top left or first entry from the array:

ite
>>> print (M[0, 0]) # Prints 5

Negative indices work for NumPy arrays as they do for Python sequences. For example, to access the

m
bottom right entry in the array:

Li
>>> print (M[–1, –1]) # Prints 5
To access a row at index 2 using the colon : syntax. For example,

e
at
>>> print (M[2, :]) # Prints [8 6 9 5]

iv
To access a column at index 3 using the colon : operator. For example,
>>> print (M[:, 2]) # Prints: [7 6 9]

Pr
You can slice a 2D array in both axes to obtain a rectangular subset of the original array. For example,

a
to select the sub array of rows at index 1 and 2, and columns at index 1, 2 and 3:

di
>>> subB = M[1:3, 1:4]
>>> print (subB)
In
[[ 3 6 4 ]
[ 6 9 5 ]]
se

Similarly, to select rows 1: (1 to the end of bottom of the array) and columns 2:4 (columns 2 and 3):
ou

>>> print(M[1:, 2:4])


[[ 6 4 ]
H

[ 9 5 ]]
i
at

To print the array in reverse order of column,


>>> print (M[:, ::-1]) # for row order: print (M[::-1])
w

[[ 3 7 4 5 ]
s

[ 4 6 3 6 ]
ra

[ 5 9 6 8 ]]
Sa

Slices vs Indexing
ew

As we saw earlier, you can use an index to select a particular plane column or row. Here, we select row 1,
columns 2:4:
>>> print(M[1, 2:4]) # Prints: [6 4]
N

You can also use a slice of length 1 to do something similar (slice 1:2 instead of index 1):
@

>>> print(M[1:2, 2:4])


[ [ 6 4 ]]
Notice the subtle difference. The first creates a 1D array, the second creates a 2D array with only one
row.
Indexing, Slicing and Arithmetic Operations in NumPy Array 161

Example Write a program to print all odd numbers from NumPy array. For example, if an array
A contains the following values:
[10, 21, 4, 45, 66, 93]

d
then result will be:

ite
The odd numbers are: 21 45 93
Solution # allodd.py

m
# Program to print odd Numbers from a NumPy array

Li
import numpy as np
A = np.array([10, 21, 4, 45, 66, 93])

e
# iterating each number in array
print ("The odd numbers are: ", end='')

at
for num in A: # A is the numpy array

iv
# checking condition for odd number
if num % 2 != 0: # Note. For even number, the condition is: if num % 2 == 0:

Pr
print(num, end = " ")
Example Write a program to copy the content of an array A to another array B, replacing all odd

a
numbers of array A with –1 without altering the original array A. Also, print both the arrays.
For example, if the input array A is:
di
In
The array A is [10 51 2 18 4 31 13 5 23 64 29]
The result will be:
se

The array B is [10 -1 2 18 4 -1 -1 -1 -1 64 -1]


ou

Solution # copyarray.py
import numpy as np
H

A = np.array([10, 21, 4, 45, 66, 93])


B = np.zeros(A.size, dtype=int) # Create array B of integer type with same size of A
i

ctr = 0 # to set the index position in array B.


at

for num in A: # A is the numpy array & num extract array value A one by one
# checking condition for odd number
w

if num % 2 != 0: # if the number is odd


s

B[ctr] = -1
ra

else:
B[ctr] = num
Sa

ctr+=1
print("The array A is", A)
print("The array B is", B)
ew

6.4 Joining Arrays


N

Two or more arrays can be concatenated together using the NumPy concatenate() function. The syntax is:
@

numpy.concatenate((a1, a2, ...), axis=0)


Here,
• a1, a2, ... are the arrays and must have the same shape. They must be enclosed in parenthesis,
and they are essentially being passed to the concatenate function as a Python tuple
162 Saraswati Informatics Practices XII

• axis denotes the axis along which the arrays will be Try this:
joined. Default is 0 (i.e., row join).
import numpy as np
For example to join two 1D arrays: a1 = numpy.array([1,2,3])
a2 = np.array([4,5,6])

d
a1 = numpy.array([1,2,3])
a3 = np.array([7,8,9])

ite
a2 = np.array([4,5,6])
print (np.concatenate((a1, a2))) # (a1, a2) is a tuple print (np.concatenate((a1, a2, a3)))
Or

m
which will print: print (np.concatenate(([a1, a2, a3])))
which prints:

Li
[1 2 3 4 5 6]
[1 2 3 4 5 6 7 8 9]
Joining 2D NumPy Arrays

e
at
NumPy concatenate essentially combines together multiple NumPy arrays. If we are joining 2D arrays,

iv
then it can be joined in two ways—row join (default join) and column join. For example, to perform row

Pr
join operations of two array A and B:

1 2 3

a
A 1 2 3

di
4 5 6
np.concatenate((A, B)) 4 5 6
In
` 11 12 13
se

11 12 13
B 14 15 16
ou

14 15 16
H

And the code is:


A = np.array([[1, 2, 3], [4, 5, 6]])
i
at

B = np.array([[11, 12, 13], [14, 15, 16]])


print (np.concatenate((A, B)))
w

which prints:
s

[[ 1 2 3 ]
ra

[ 4 5 6 ]
[ 11 12 13 ]
Sa

[ 14 15 16 ] ]
Similarly to perform column join operations of two array A and B:
ew

1 2 3
N

A
4 5 6
np.concatenate((A, B), axis=1) 1 2 3 11 12 13
@

` 4 5 6 14 15 16
11 12 13
B
14 15 16
Indexing, Slicing and Arithmetic Operations in NumPy Array 163

And the code is:


print (np.concatenate((A, B), axis=1))
which prints:
[ [ 1 2 3 11 12 13 ]

d
[ 4 5 6 14 15 16 ] ]

ite
m
6.5 Creating Sub Array

Li
In the last section, we used indexing and slicing methods of 1D and 2D array elements. We can also select
a sub array from Numpy Array using [] operator. Sub array is just a view of original array i.e., data is not

e
copied but just a view of sub array is created. When you modify the content of sub array, you will see that

at
the original array is also modified/changed.

iv
ndArray[first:last]

Pr
It will return a sub array from original array with elements from index first to last – 1. Let’s use a 1D
array A1 to select different sub arrays from original NumPy Array.

a
A1 = np.array([24, 12, 10, 34, 17, 13, 32, 51])
Now let’s see some examples,
di
In
A1 = np.array([24, 12, 10, 34, 17, 13, 32, 51])
se

The content of the array is:


print ('The content of original array is', A1)
ou

'The content of original array is', [24, 12, 10, 34, 17, 13, 32, 51]
H

• Select a sub array with elements from index 1 to 5


i

subAr = A1[1:6] # subAr is a sub array


at

The content of the sub array is:


w

print ('The sub array is', subAr)


s

The sub array is [12 10 34 17 13]


ra

To modify the original array through sub array, for Try this:
Sa

example, modify the index [1] position value with 100:


import numpy as np
subAr[1] = 100 x = np.arange(10)
ew

subY = X[:4]
Now the contents of sub array and original array are: subY = X[:6]
print ('The sub array is', subAr) print (subY)
N

print ('The original array is', A1) subY[3] = –200


print (subY)
@

which will print the following: print (X)


The sub array is [12 100 34 17 13]
The original array is [24 12 100 34 17 13 32 51]
164 Saraswati Informatics Practices XII

Creating Sub Array using 2D Array


Like 1D array, we can create 2D sub array by slicing another NumPy 2D array. For example, we have

d
a 3 x 3 array that is given:

ite
23 54 76

m
37 19 28

Li
62 13 19

e
M = np.array([[23, 54, 76], [37, 19, 28], [62, 13, 19]])

at
The content of the array is:

iv
>>> print (M)

Pr
[ [ 23 54 76 ]
[ 37 19 28 ]
[ 62 13 19 ] ]

a
di
Let us extract a 2x2 sub array from M: In 2x2 sub array
subM = M[:2, :2] # Creating a 2 x 2 sub array
23 54 76
The content of the sub array is:
se

print (subM) 37 19 28
[ [ 23 54 ]
ou

[ 37 19 ]] 62 13 19
H

Now, we can modify the original array through sub array, for example, let us modify the [0, 1] index
position value with –39.
i
at

subM[0, 1] = -39 # sub array modified


w

Now, the contents of sub array and original array are:


s

print (subM)
ra

[ [ 23 –39 ]
[ 37 19 ]] 23 –39 76
Sa

print (M)
[ [ 23 –39 76 ] 37 19 28
[ 37 19 28 ]
ew

62 13 19
[ 62 13 19 ] ]
N

6.6 Arithmetic Operation on 2D Arrays


@

NumPy arrays support element-wise operations, meaning that arithmetic operations on arrays are applied
to each value in the array. Arithmetic operators are commonly used to perform numeric calculations. Also,
each of these arithmetic operations are simply convenient wrappers around specific functions built into
NumPy; for example, the + operator is a wrapper for the add function. NumPy has following arithmetic
operators shown in Table 6.1.
Indexing, Slicing and Arithmetic Operations in NumPy Array 165

Table 6.1 Arithmetic operators with 1D array operation.

Operator Function Name y = [ 8, 9, 11, 12, 10] Result


+ np.add Addition y+2 or np.add(y, 2) [10, 11, 13, 14, 12]

d
– np.subtract Subtraction y-4 or np.subtract(y, 4) [4, 5, 7, 8, 6]

ite
* np.multiply Multiplication y*2 or np.multiply(y, 2) 16, 18, 22, 24, 20]

m
/ np.divide Division y/2 or np.divide(y, 2) [4. , 4.5, 5.5, 6. , 5. ]
% np.mod Modulus y%2 or np.mod(y, 2) [0, 1, 1, 0, 0]

Li
** np.power Exponent (power) y**y or np.power(y, 2) [ 64, 81, 121, 144, 100]

e
// np.floor_divide Floor Division y//2 or np.floor_divide(y, 2) [4, 4, 5, 6, 5]

at
Let us see simple arithmetic operations for a 2D array:

iv
Addition (+):

Pr
M M+2
5 4 7 3 5+2 4+2 7+2 3+2 7 6 9 5

a
6 3 6 4 6+2 3+2 6+2 4+2 = 8 5 8 6
8 6 9 5 8+2 6+2
di
9+2 5+2 10 8 11 7
In
>>> M = np.array( [ [5, 4, 7, 3],
[6, 3, 6, 4],
se

[8, 6, 9, 5]])
ou

>>> print (M)


[ [ 5, 4, 7, 3 ]
H

[ 6, 3, 6, 4 ]
[ 8, 6, 9, 5 ] ]
i
at

Let us perform element-wise sum operation. Note that the original array does not change
w

>>> print (M+2) Try this:


s

Or import numpy as np
ra

>>> print (np.add(M, 2)) x = np.array([1,2,3,4], float)


[[ 7 6 9 5 ] y = np.array([5,6,7,8], float)
Sa

print (x + y)
[ 8 5 8 6 ]
print (np.add(x, y))
[ 10 8 11 7 ] ]
ew

Subtraction (–):
M M–3
N

5 4 7 3 5–3 4–3 7–3 3–3 2 1 4 0


6 3 6 4 6–3 3–3 6–3 4–3 = 3 0 3 1
@

8 6 9 5 8–3 6–3 9–3 5–3 5 3 6 2

>>> M = np.array( [ [5, 4, 7, 3],


[6, 3, 6, 4],
[8, 6, 9, 5]])
166 Saraswati Informatics Practices XII

Let us perform element-wise differences. Note that the original array does not change.
>>> print (M–3)
Or Try this:
>>> np.subtract(M, 3) import numpy as np

d
[[ 2 1 4 0 ] x = np.array([1,2,3,4], float)

ite
y = np.array([5,6,7,8], float)
[ 3 0 3 1 ]
print (x – y)

m
[ 5 3 6 2 ]] print (np.subtract(x, y))

Li
Multiplication (*):
M M*2

e
5 4 7 3 5*2 4*2 7*2 3*2 10 8 14 6

at
6 3 6 4 6*2 3*2 6*2 4*2 = 12 6 12 8

iv
8 6 9 5 8*2 6*2 9*2 5*2 16 12 18 10

Pr
>>> M = np.array( [ [5, 4, 7, 3],
[6, 3, 6, 4],

a
di
[8, 6, 9, 5]])
Try this:
In
>>> print (M*2) # Performs element-wise multiplication.
Or import numpy as np
se

>>> print (np.multiply(M, 2)) x = np.array([1,2,3,4], float)


[ [ 10 8 14 6 ] y = np.array([5,6,7,8], float)
print (x * y)
ou

[ 12 6 12 8 ]
print (np.multiply(x, y))
[ 16 12 18 10 ] ]
H

Division (/):
i

>>> print (M/2) # Performs element-wise division. Note that the original array does not change
at

Or
w

>>> print (np.divide(M, 2)) Try this:


s

[ [ 2.5 2. 3.5 1.5 ]


import numpy as np
ra

[ 3. 1.5 3. 2. ] x = np.array([1,2,3,4], float)


[ 4. 3. 4.5 2.5 ] ] ) y = np.array([5,6,7,8], float)
Sa

print (x / y)
print (np.divide(x, y))
Modulus (%):
ew

>>> print (M%2)


Try this:
Or
N

print (np.mod(M, 2)) import numpy as np


x = np.array([1,2,3,4], float)
@

[[ 1 0 1 1 ] y = np.array([5,6,7,8], float)
[ 0 1 0 0 ] print (x % y)
[ 0 0 1 1 ]] print (np.mod(x, y))
Indexing, Slicing and Arithmetic Operations in NumPy Array 167

We can apply modulus operator as a conditional expression to each array values. For example,
a = np.array([10, 21, 4, 45, 66, 93, 7, 11, 13])
print (a%3 == 0)
[False True False True True True False False False]

d
ite
Here, from the above result, the expression a%3 == 0 finds the modulus result as logical values: True
or False. That is:

m
a[0] % 3 = 10 % 3 = 1, i.e., False

Li
a[1] % 3 = 21 % 3 = 0, i.e., True
a[2] % 3 = 4 % 3 = 1, i.e., False

e
....

at
....
and so on.

iv
Exponentiation (**):

Pr
>>> print (M**2) Try this:

a
Or import numpy as np

di
>>> print (np.power(M, 2)) x = np.array([1,2,3,4], float)
y = np.array([5,6,7,8], float)
[ [ 25 16 49 9 ]
In
print (x ** y)
[ 36 9 36 16 ] print (np.power(x, y))
se

[ 64 36 81 25 ] ]

Floor division (//): Performs element-wise floor division and produce integer part of the division.
ou

>>> print (M//2) Try this:


H

Or
import numpy as np
>>> print (np.floor_divide(M, 2))
i

x = np.array([1,2,3,4], float)
at

[[ 2 2 3 1 ] y = np.array([5,6,7,8], float)
print (x // y)
w

[ 3 1 3 2 ]
print (np.floor_divide(x, y))
[ 4 3 4 2 ]]
s
ra

Compound Assignment Operators


Sa

The compound-assignment operators combine the simple-assignment operator with another binary operator.
Compound-assignment operators perform the operation specified by the additional operator and then assign
ew

the result to the left operand. The compound operations act in place to modify an existing array rather than
create a new one. Table 6.2 shows the compound assignment operators.
N

Table 6.2 Compound-assignment operators or shorthand operators.

Operator Shorthand Expression


@

+= s+=x x=s+x
–= s–= x s=s–x
*= s*= x s = s*x
168 Saraswati Informatics Practices XII

/= s/=x s = s/x
%= s%= x s = s%x
**= s**=x s = s**x

d
//= s//=x s = s//x

ite
m
For example, to perform element wise compound addition operation with a 2D array:

P P += 10

Li
23 54 76 23+10 54+10 76+10 33 64 86

e
37 19 28 37+10 19+10 28+10 = 47 29 38

at
62 13 19 62+10 13+10 19+10 72 23 29

iv
Remember that all the compound operations changes the original array in place.

Pr
>>> P = np.array( [[23, 54, 76],
[37, 19, 28],

a
[62, 13, 19]])
>>> print (P)
di
In
[ [ 23 54 76 ]
[ 37 19 28 ]
se

[ 62 13 19 ] ]
ou

P += 10 # Performs element-wise sum and also changed made into the original array.
>>> print (P) Try this:
H

[ [ 33 64 86 ]
import numpy as np
[ 47 29 38 ]
i

x = np.array([1,2,3,4], float)
at

[ 72 23 29 ]] x+=2 # Don’t write print (x+=2)


w

>>> P–=5 # The previous array subtract 5 from each element


s

Try this:
ra

>>> print (P)


[ [ 28 59 81 ] import numpy as np
Sa

[ 42 24 33 ] x = np.array([1,2,3,4], float)
x–=2 # Don’t write print (x+=2)
[ 67 18 24 ] ]
ew

Similarly, you can apply other compound operators with 2D array.


N

Arithmetic Operations on Matrices


@

The sum of two matrices x and y of the same order is written x + y and defined to be the matrix. For
example:

1 2 3 4 5 6 5 7 9
The sum of x = and y = is x + y =
3 4 5 2 3 4 5 7 9
Indexing, Slicing and Arithmetic Operations in NumPy Array 169

The code is:


>>> x= np.array([(1, 2, 3), (3, 4, 5)])
>>> y= np.array([(4, 5, 6), (2, 3, 4)])
>>> print (x+y)

d
Or

ite
>>> print (np.add(x, y))
[[ 5 7 9 ]

m
[ 5 7 9 ]]

Li
Similarly, we can perform other operations such as subtraction, multiplication and division.

e
Subtraction (–):
x–y

at
x y
1 2 3 4 5 6 –3 –3 –3

iv
3 4 5 2 3 4 1 1 1

Pr
>>> print (x–y)
Or

a
>>> print (np.subtract(x, y))
[ [–3 –3 –3 ]
di
In
[ 1 1 1 ]]
se

Multiplication (*):
x y x*y
ou

1 2 3 4 5 6 4 10 18
3 4 5 2 3 4 6 12 20
H

>>> print (x*y)


i

Or
at

>>> print (np.multiply(x, y))


w

[ [ 4 10 18 ]
s

[ 6 12 20 ] ]
ra

Division (/):
Sa

x y x/y
1 2 3 4 5 6 0.25 0.4 0.5
3 4 5 2 3 4 1.5 1.33333333 1.25
ew

>>> print (x/y)


>>> print (np.divide(x, y))
N

[ [ 0.25 0.4 0.5 ]


@

[ 1.5 1.33333333 1.25 ] ]


Exponent (**): xy
x y
1 2 3 4 5 6 1 32 729
3 4 5 2 3 4 9 64 625
170 Saraswati Informatics Practices XII

>>> print (x**y)


Or
>>> print (np.power(x, y))

d
[[ 1 32 729 ]

ite
[ 9 64 625 ] ]
Let us practice the arithmetic operation with two 3 x 3 arrays:

m
–3 –3 –3

Li
1 0 6 3 7 0
The sum of p = 4 2 –1 and q = 7 –2 5 is p + q = 11 0 4

e
5 3 –2 2 6 9 7 9 7

at
E.g.,

iv
>>> p = np.array( [[1, –3, 0],

Pr
[4, 2, –1],
[5, 3, –2]])

a
>>> q = np.array( [[6, 3, –3],

di
[7, –2, 5], In
[2, 6, 9]])
>>> print (p+q) # element-wise sum
se

[[ 7 0 –3 ]
[ 11 0 4 ]
ou

[ 7 9 7 ]]
H

Similarly, we can perform other operations such as subtraction, multiplication and division. Consider
the below example:
i

>>> print (p–q) # element-wise difference


at

[ [ –5 –6 3 ]
w

[ –3 4 –6 ]
s

[ 3 –3 –11 ] ]
ra

>>> print (p*q) # element-wise product


Sa

[[ 6 –9 0 ]
[ 28 –4 –5 ]
[ 10 18 –18 ]]
ew

>>> print (p/q) # element-wise division


[ [ 0.16666667 –1. –0. ]
N

[ 0.57142857 –1. –0.2 ]


@

[ 2.5 0.5 –0.22222222 ]]

Matrix Multiplication
In Python 3.5+, the symbol @ computes matrix multiplication for NumPy arrays.
Indexing, Slicing and Arithmetic Operations in NumPy Array 171

For example, if A and B are two matrices with following elements:


Matrix - A Matrix - B

M11 M12 N11 N12 N13

d
M21 M22 N21 N22 N23

ite
then the product will be:

m
Matrix - A¯
¯B

Li
M11¯ N11+ M12¯ N21 M11¯ N12+ M12¯ N22 M11¯ N13+ M12¯ N23

e
M21¯ N11+ M22¯ N21 M21¯ N12+ M22¯ N22 M21¯ N13+ M22¯ N23

at
For example, suppose two matrices A and B are of size 2 ¯ 2 and 2 ¯ 3, respectively. Here is a

iv
pictorial representation for cell (1, 1):

Pr
1 2 5 6 7

a
A= B=

di
3 4 8
In 9 10

Multiplication of these two matrices is:


se


¯5 + 2¯
¯8 1¯
¯6 + 2¯
¯9 1¯
¯ 7 + 2¯
¯ 10
ou


¯B =

¯5 + 4¯
¯8 3¯
¯6 + 4¯
¯9 3¯
¯ 7 + 4¯
¯ 10
H

Hence,
i
at

21 24 27

¯B =
w

47 54 61
s
ra

Let us calculate the matrix multiplication in NumPy:


Sa

>>> A = np.array([ [1, 2],


[3, 4]])
>>> B = np.array([[5, 6, 7],
ew

[8, 9, 10]])
>>> print (A@B) # matrix product
N

[ [ 21 24 27 ]
[ 47 54 61 ]]
@

To multiply two matrices, we can use the .dot() method. Numpy is powerful library for matrices
computation. For instance, you can compute the dot product with np.dot().
The syntax is:
numpy.dot(a, b, out=None)
172 Saraswati Informatics Practices XII

Here, the function returns the dot product of a and b. If a and b are both scalars or both 1-D arrays then
a scalar is returned; otherwise an array is returned.
For example, to generate the matrix multiplication of A and B using numpy.dot(), the command is:
>>> print (np.dot(A, B))

d
[ [ 21 24 27 ]

ite
[ 47 54 61 ]]

m
6.7 NumPy Array Functions

Li
There are many array functions we can use to compute with NumPy arrays. Since NumPy arrays are Python

e
objects, they have methods associated with them.
Most of the NumPy array functions contain axis parameter. By default, the axis value is None. It’s

at
optional and if not provided then it will flatten the passed NumPy array and returns the max value in it.

iv
If the axis parameter provided then it will return for array of max values along the axis i.e.

Pr
• If axis=0 then it returns an array containing max value for each column.
• If axis=1 then it returns an array containing max value for each row.

a
numpy.sum() Function

di
In
The numpy.sum() function returns the sum of all the elements in the array. With an axis argument, the sums
along the specified axis will be calculated.
se

For example, to find the total of a 1D array:


>>> A1 = np.array([24, 12, 10, 34, 17, 13, 32, 51])
ou

>>> A1.sum() # returns: 193


H

Because np.sum() is operating on a 1-dimensional NumPy array, it will just sum up the values. That is:

` 193
i
at

24 12 10 34 17 13 32 51
w

How numpy.sum() function work with axis parameters in 2D array?


s
ra

Every axis in a NumPy array has a number, starting with 0. In this way, they are similar to Python indexes in
Sa

that they start at 0, not 1. So the first axis is axis 0. The second axis (in a 2-d array) is axis 1. For multi-
dimensional arrays, the third axis is axis 2. And so on.
axis = 1
ew

0 1 2 3
`
N

0 2 4 9 3
axis = 0
@

1 7 6 8 1
2
4 2 5 11
`
Indexing, Slicing and Arithmetic Operations in NumPy Array 173

For example, if we set axis = 0, we are indicating that we want to sum up the rows. Remember, axis 0
refers to the row axis. Likewise, if we set axis = 1, we are indicating that we want to sum up the columns.
Remember, axis 1 refers to the column axis.
Next, let’s sum all of the elements in a 2-dimensional NumPy array P.

d
P = np.array([ [2, 4, 9, 3],

ite
[7, 6, 8, 1],
[4, 2, 5, 11]])

m
print (p)

Li
[ [ 2, 4, 9, 3 ]
[ 7, 6, 8, 1 ]

e
at
[ 4, 2, 5, 11] ]

iv
To find the total of all elements:

Pr
print ('Sum of entire array =',P.sum())
Sum of entire array = 62

a
Notice that when we use the NumPy sum function without specifying an axis, it will simply add together

di
all of the values and produce a single scalar value. In
2 4 9 3
se

7 6 8 1
ou

4 2 5 11 ` 62
H

But using the axis parameter is little confusing. We always thought that axis 0 is row-wise, and 1 is
i
at

column-wise.
row-wise (axis 0) ---> [[ 8 5 ]
w

[ 4 6 ]]
s

|
ra

|
column-wise (axis 1)
Sa

However, the result of y () is the exact opposite of what we were thinking. Let us see the
following two sums:
ew

axis 1
axis 0 `
8 5
N

8 5 ` 13
4 6
@

4 6 ` 10
`

12 11 Here, numpy.sum sums across the


Here, numpy.sum sums down columns when we set axis = 1
the rows when we set axis = 0
174 Saraswati Informatics Practices XII

The way to understand the “axis” of numpy.sum() is that it collapses the specified axis. So, when it
e the axis 0 (row), it becomes just one row and column-wise sum. That is, when we set axis = 0, we
are basically saying, “sum the rows”, i.e., to operate on the rows only. This is often called a row-wise
operation.

d
Similarly, when it e the axis 1 (column), it becomes just one column and row-wise sum. That

ite
is, when we set the parameter axis = 1, we are telling the np.sum function to operate on the columns only.
Specifically, we are telling the function to sum up the values across the columns.

m
For exmaple, to find the sum along rows:

Li
x = np.array([[2, 5], [4, 6]])
print (x)

e
[[ 8 5 ]

at
[ 4 6 ]]

iv
print ('Sum along rows =', x.sum(axis = 0))

Pr
which will print:
Sum along columns = [ 12 11]

a
di
Similarly, to find the sum along columns: In
print ('Sum along rows =', x.sum(axis = 1))
which will print:
se

Sum along rows = [ 13 10]


ou

Example A NumPy array P contains following values:


H

P = np.array([ [2, 4, 9, 3],


[7, 6, 8, 1],
i
at

[4, 2, 5, 11]])
Write the commands to print the row sum and column sum values.
w

Solution print ('Row-sum =', P.sum(axis = 0)) # returns: Row-sum = [13 12 22 15]
s

print ('Column sum =', P.sum(axis = 1)) # retuns: Column sum = [18 22 22]
ra
Sa

numpy.max() and numpy.min() Functions

The numpy.max() function returns the scalar value which is the largest element in the entire array. If an axis
ew

is defined for an N-dimensional array, the maximum values along that axis are returned.
To find the maximum number in a 1D array:
N

>>> A1 = np.array([24, 12, 10, 34, 17, 13, 32, 51])


>>> A1.max() # returns: 51
@

The numpy.min() function returns the scalar value which is the smallest element in the entire array. If
an axis is defined for an N-dimensional array, the minimum values along that axis are returned.
Tio find the minimum number in a 1D array:
>>> A1 = np.array([24, 12, 10, 34, 17, 13, 32, 51])
>>> A1.min() # returns: 10
Indexing, Slicing and Arithmetic Operations in NumPy Array 175

The axis parameter of numpy.max() and numpy.min() functions are operates same way as that of
numpy.sum() function.
When numpy.max(axis=0): When numpy.max(axis=1):

d
axis 0 axis 1

ite
2 4 9 3 `

m
2 4 9 3 ` 9
7 6 8 1

Li
e
7 6 8 1 ` 8
4 2 5 11

at
`

iv
4 2 5 11 ` 11
`

7 6 9 11

Pr
Here, numpy.max finds maximum Here, numpy.max finds maximum across
down the rows when we set axis = 0 the columns when we set axis = 1

a
di
When numpy.min(axis=0): When numpy.min(axis=1):
In
axis 0 axis 1
2 4 9 3 `
se

2 4 9 3 ` 2
ou

7 6 8 1
H

7 6 8 1 ` 1
4 2 5 11
i
at
`

4 2 5 11 ` 2
`

`
w

2 2 5 3
s

Here, numpy.min finds minimum Here, numpy.min finds minimum across


ra

down the rows when we set axis = 0 the columns when we set axis = 1
Sa

To find the maximum and minimum in the 2D array P:


print ('Maximum number =', P.max())
ew

which will print:


Maximum number = 11
N

print ('Minimum number =', P.min())


@

which will print:


Minimum number = 1
To find the maximum numbers in row-wise and column-wise in the 2D array P:
print ('Row-wise maximum values =', P.max(axis = 0))
176 Saraswati Informatics Practices XII

which will print:


Row-wise maximum values = [ 7 6 9 11]
print ('Column-wise maximum values =', P.max(axis = 1))

d
which will print:

ite
Column-wise maximum values = [ 9 8 11]

m
Similarly, to find the miniumum numbers in row-wise and column-wise in the 2D array P:
print ('Row-wise minimum values =', P.min(axis = 0))

Li
which will print:

e
Row-wise minimum values = [2 2 5 1]

at
print ('Column-wise minimum values =', P.min(axis = 1))

iv
which will print:

Pr
Column-wise minimum values = [2 1 2]
Example Write the command to create a 1D array with 10 random values and print the maximum

a
value.
Solution X = np.random.random(30)
di
In
print (X.max())
se

numpy.mean() Function
ou

The numpy.mean() function returns mean or average of all values in list/array or along rows or columns.
To find the mean value of a 1D array:
H

>>> A1 = np.array([24, 12, 10, 34, 17, 13, 32, 51])


>>> A1.mean() # returns: 24.125
i
at

The mean operation along with columns and rows are exactly calculated as like previous functions
w

numpy.sum(), numpy.max(), etc. For example, let us find mean using previous 2D array P:
s

>>> print ('Mean of entire array =',P.mean())


ra

Mean of entire array = 36.77777777777778


>>> print ('Mean along columns =', P.mean(axis = 0))
Sa

Mean along columns = [40.66666667 28.66666667 41. ]


>>> print ('Mean along rows =', P.sum(axis = 1))
ew

Mean along rows = [153 84 94]

numpy.sort() Function
N

The numpy.sort() function sorts the elements along the defined dimension, with the default being the
@

last (–1). Dimension numbering starts with 0. The default sorting is row wise (axis=0) ascending order. The
order of operation is again happen like numpy.sum() function. For example,
>>> import numpy as np
>>> a = np.array([3, 7, 4, 8, 2, 15])
Indexing, Slicing and Arithmetic Operations in NumPy Array 177

>>> a.sort() # arrange in ascending order


>>> a
array([ 2, 3, 4, 7, 8, 15])
To perform sort operation on a 2D array:

d
ite
>>> S1 = np.array( [ [23, 54, 76 ],
[37, 19, 28 ],

m
[62, 13, 19 ] ] )
# row-wise sorting

Li
>>> np.sort(S1, axis=0)

e
array([ [ 23, 13, 19 ],

at
[ 37, 19, 28 ],
[ 62, 54, 76 ] ] )

iv
# Column-wise sorting

Pr
>>> np.sort(S1, axis=1)
array([ [ 23, 54, 76 ],

a
[ 19, 28, 37 ],

di
[ 13, 19, 62 ] ] ) In
Points to Remember
se

1. Array indexing refers to any use of the square brackets ([ ]) to index array values.
2. Single element indexing for a 1D array work exactly like that for other standard Python sequences
ou

like list or tuple.


3. Slicing in the NumPy array is the way to extract a range of elements from an array.
H

4. Slicing is specified using the colon operator ‘:’ with a [start:end] or ‘from‘ and ‘to‘ index before
and after the column respectively.
i
at

SOLVED EXERCISES
s w

1. What will be the output of the following code?


ra

import numpy as np
X = np.arange(4, 20, 4)
Sa

print("X =", X)
print("X + 5 =", X + 5)
print("X – 5 =", X – 5)
ew

print("X * 2 =", X * 2)
print("X / 2 =", X / 2)
N

print("X // 2 =", X // 2) # floor division


Ans. The output is:
@

X = [ 4 8 12 16]
X + 5 = [ 9 13 17 21]
X – 5 = [–1 3 7 11]
X * 2 = [ 8 16 24 32]
X / 2 = [2. 4. 6. 8.]
X // 2 = [2 4 6 8]
178 Saraswati Informatics Practices XII

2. There are two arrays a and b given (assume that np is used as numpy namespace):
a = np.array([[1,2],[3,4]])
b = np.array([[5,10]])
What will be the output of the following?

d
( ) print (a + b)

ite
( ) d = np.array([5,10])
dd = d.reshape(1,2)

m
print (a + d)
Ans. The output is:

Li
(a) [ [ 6 12 ]
[ 8 14 ] ]

e
(b) [ [ 6 12 ]

at
[ 8 14 ] ]
3. Two arrays 2D arrays are given as below:

iv
x = np.array([[1, 3, 5], [3, 4, 2], [ 5, 2, 0]])

Pr
y = np.array([[1], [5], [3]])
Find the output of print (x*y).
Ans. The output is:

a
[ [ 1 3 –5 ]

di
[ 15 20 10 ]
[ –15 6 0 ]]
In
4. In an array X contains the following:
X = np.full((3, 4), True, dtype=bool)
se

What will be the output of print (X[2][3])?


Ans. True
ou

5. What will be the output of the following?


D = np.arange(10)
H

D[::2] +=2
print (D)
i
at

print (D[[2, 5, 1, 8]])


w

Ans. The output is:


[ 2 1 4 3 6 5 8 7 10 9]
s

[ 4 5 1 10]
ra

6. An array X is given with following values:


Sa

array([ [23, 54],


[37, 19],
[62, 13]])
ew

Write the command to print the array X in reverse order.


Ans. The command is: X[::-1]
N

7. Write the command for the following:


( ) Create an array X of size 10 but the fifth value which is 1.
@

( ) Create a 2D array Z by multiplying a 5x3 matrix of all values 1 by a 3x2 matrix with all values
1 (real matrix product)
( ) Create a 10x10 array with random values and find the minimum and maximum values.
Ans. (a) X = np.zeros(10)
X[4] = 1
print(X)
Indexing, Slicing and Arithmetic Operations in NumPy Array 179

(b) Z = np.dot(np.ones((5,3)), np.ones((3,2)))


(c) Z = np.random.random((10,10))
Zmin, Zmax = Z.min(), Z.max()
print(Zmin, Zmax)

d
8. An array Num is given with [10, 51, 2, 18, 4, 31, 13, 5, 23, 64, 29] values. Write the commands to

ite
create the array Num and replace all odd numbers in Num with -1. Also print the array Num.
Ans. The commands are:

m
Num = np.array([10, 51, 2, 18, 4, 31, 13, 5, 23, 64, 29])
Num[Num % 2 == 1] = –1

Li
print (Num)
9. What will be the output of the following program?

e
import numpy as np

at
a2dr = np.array([1, 2, 3, 4]*3).reshape((3, 4))

iv
print('The array is:')
print(a2dr)

Pr
print('')
print('mean entire array =', np.mean(a2dr))

a
print('mean along columns =', np.mean(a2dr, axis=0))

di
print('mean along rows =', np.mean(a2dr, axis=1))
In
Ans. The output is
The array is:
se

[[ 1 2 3 4 ]
[ 1 2 3 4 ]
ou

[ 1 2 3 4 ]]
Mean of entire array = 2.5
H

Mean along columns = [1. 2. 3. 4.]


Mean along rows = [2.5 2.5 2.5]
i
at

10. A 2D array x is given with following values:


C = np.array([[1, 3, 5], [3, 4, 2], [ 5, 2, 0]])
w

( ) Write the commands for the following:


s

(i) Row-wise sum.


ra

(ii) column-wise maximum.


(iii) Row-wise mean.
Sa

( ) Write the output for the following:


(i) C.sort(axis = 0)
print (C)
ew

(ii) np.sum(C.max(axis = 1))


Ans. (a) (i) C.sum(axis = 0)
N

(ii) C.max(axis = 1)
(iii) C.mean(axis = 0)
@

(b) (i) [[-5 2 -5]


[ 1 3 0]
[ 3 4 2]]
(ii) 9
180 Saraswati Informatics Practices XII

REVIE ES IO S

1. If x1 is an array with following data, then write the command for following:

d
import numpy as np

ite
x1 = np.array([14, 13, 14, 15, 18, 12, 14])
(a) Assess value to index zero.
(b) Assess fifth value

m
(c) Get the last value

Li
(d) Get the second last value
2. A multi-dimentional array x2 contains following values:

e
import numpy as np

at
x2 = np.array( [ [ 13, 17, 13, 15 ],
[ 10, 21, 15, 19 ],

iv
[ 31, 14, 16, 30 ] ] )
Using array indexing, write the command for the following:

Pr
(a) Get 1st row and 2nd column value.
(b) Get 3rd row and last value from the 3rd column.

a
(c) Get the third row.

di
(d) Get the third column.
(e) Replace value 17 at 0,0 index.
In
(f) Print the array in reverse order of the rows.
3. A multi-dimensional array 5 x 5 contains following values:
se

array([ [ 1, 3, 5, 7, 9],
[11, 13, 15, 17, 19],
ou

[21, 23, 25, 27, 29],


[31, 33, 35, 37, 39],
H

[41, 43, 45, 47, 49]])


Create the arrary using reshape() function and write the output for the following:
i

(a) print (N[0, 3:5])


at

(b) print (N[4:, 4:])


w

(c) print (N[:, 2])


(d) print (N[::, 2])
s

(e) print (N[2::2, ::2])


ra

4. What will be the output of the following?


a = np.array([[4, 3, 5], [1, 2, 1]])
Sa

b = np.sort(a, axis=1)
print (b)
5. Write a program to extract all even numbers from NumPy array. For example, if an array A contains
ew

the following values:


[24 12 10 34 17 13 32 51]
N

then result will be:


@

The even numbers are: 24 12 10 34 32

You might also like