0% found this document useful (0 votes)
4 views

21css303t Datascience Unit 1 Notes (1)

Unit 1 of the Data Science course covers the introduction to data science, including its benefits, the data science process, and the differences between big data and data science. It also discusses various types of data, such as structured, unstructured, and machine-generated data, as well as the steps in the data science process, data acquisition, and data cleansing. Additionally, it introduces NumPy for data manipulation and provides insights into data modeling and exploration techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

21css303t Datascience Unit 1 Notes (1)

Unit 1 of the Data Science course covers the introduction to data science, including its benefits, the data science process, and the differences between big data and data science. It also discusses various types of data, such as structured, unstructured, and machine-generated data, as well as the steps in the data science process, data acquisition, and data cleansing. Additionally, it introduces NumPy for data manipulation and provides insights into data modeling and exploration techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 246

Data Science

21CSS303T
UNIT - 1
Unit I
Unit-1: INTRODUCTION TO DATA SCIENCE 10 hours
Benefits and uses of Data science, Facets of data, The data
science process

Introduction to Numpy: Numpy, creating array, attributes,


Numpy Arrays objects: Creating Arrays, basic operations (Array
Join, split, search, sort), Indexing, Slicing and iterating, copying
arrays, Arrays shape manipulation, Identity array, eye function
Exploring Data using Series, Exploring Data using Data Frames,
Index objects, Re-index, Drop Entry, Selecting Entries, Data
Alignment, Rank and Sort, Summary Statistics, Index Hierarchy

Data Acquisition: Gather information from different sources,


Web APIs, Open Data Sources, Web Scrapping.
Big Data vs Data Science
• Big data is a blanket term for any collection of data sets so
large or complex that it becomes difficult to process them
using traditional data management techniques such as, for
example, the RDBMS (relational database management
systems).

• Data science involves using methods to analyze massive


amounts of data and extract the knowledge it contains.

You can think of the relationship between big data and data
science as being like the relationship between crude oil and
an oil refinery.
Characteristics of Big Data
• Volume—How much data is there?
• Variety—How diverse are different types of data?
• Velocity—At what speed is new data generated?
Benefits and uses of data science and
big data
1. It’s in Demand
2. Abundance of Positions
3. A Highly Paid Career
4. Data Science is Versatile
5. Data Science Makes Data Better
6. Data Scientists are Highly Prestigious
7. No More Boring Tasks
8. Data Science Makes Products Smarter
9. Data Science can Save Lives
Facets of data
■ Structured
■ Unstructured
■ Natural language
■ Machine-generated
■ Graph-based
■ Audio, video, and images
■ Streaming
Structured Data
• Structured data is data that depends on a data model and
resides in a fixed field within a record.
Unstructured data
• Unstructured data is data that isn’t easy to fit into a data
model because the content is context-specific or varying.
Natural language
• Natural language is a special type of unstructured data; it’s
challenging to process because it requires knowledge of
specific data science techniques and linguistics.

• The natural language processing community has had


success in entity recognition, topic recognition,
summarization, text completion, and sentiment analysis, but
models trained in one domain don’t generalize well to other
domains.
Machine-generated data
• Machine-generated data is information that’s automatically
created by a computer, process, application, or other
machine without human intervention.
• Machine-generated data is becoming a major data resource
and will continue to do so.
Machine-generated data
Graph-based or network data
• “Graph data” can be a confusing term because any data can
be shown in a graph.
• “Graph” in this case points to mathematical graph theory.
• In graph theory, a graph is a mathematical structure to
model pair-wise relationships between objects.
• Graph or network data is, in short, data that focuses on the
relationship or adjacency of objects.
• The graph structures use nodes, edges, and properties to
represent and store graphical data.
• Graph-based data is a natural way to represent social
networks, and its structure allows you to calculate specific
metrics such as the influence of a person and the shortest
path between two people.
Graph-based or network data
Audio, video and image
• Audio, image, and video are data types that pose specific
challenges to a data scientist.
Streaming
• While streaming data can take almost any of the previous
forms, it has an extra property.
• The data flows into the system when an event happens
instead of being loaded into a data store in a batch.
The Data Science
Process (Life cycle of
Data Science)
The Data Science Process
• The data science process typically consists of six steps, as
you can see in the mind map
The Data Science Process
Setting the research goal
Setting the research goal
• Data science is mostly applied in the context of an
organization.
• A clear research goal
• The project mission and context
• How you’re going to perform your analysis
• What resources you expect to use
• Proof that it’s an achievable project, or proof of concepts
• Deliverables and a measure of success
• A timeline
Retrieving data
Retrieving data
• Data can be stored in many forms, ranging from simple text
files to tables in a database.
• The objective now is acquiring all the data you need.

• Start with data stored within the company


• Databases
• Data marts
• Data warehouses
• Data lakes
Data Lakes
• A data lake is a centralized storage repository that holds a
massive amount of structured and unstructured data.
• According to Gartner, “it is a collection of storage instances
of various data assets additional to the originating data
sources.”
Data warehouse
• Data warehousing is about the collection of data from varied
sources for meaningful business insights.
• An electronic storage of a massive amount of information, it
is a blend of technologies that enable the strategic use of
data!
Data Mart
DWH vs DM
• Data Warehouse is a large repository of data collected from
different sources whereas Data Mart is only subtype of a data
warehouse.
• Data Warehouse is focused on all departments in an organization
whereas Data Mart focuses on a specific group.
• Data Warehouse designing process is complicated whereas the
Data Mart process is easy to design.
• Data Warehouse takes a long time for data handling whereas Data
Mart takes a short time for data handling.
• Comparing Data Warehouse vs Data Mart, Data Warehouse size
range is 100 GB to 1 TB+ whereas Data Mart size is less than 100
GB.
• When we differentiate Data Warehouse and Data Mart, Data
Warehouse implementation process takes 1 month to 1 year
whereas Data Mart takes a few months to complete the
implementation process.
DWH vs DL
Data Lakes
• Data lakes are a fairly new concept and experts have
predicted that it might cause the death of data warehouses
and data marts.
• Although with the increase of unstructured data, data lakes
will become quite popular. But you will probably prefer
keeping your structured data in a data warehouse.
Data Providers
Cleansing, integration and
transformation
Cleansing data
• Data cleansing is a sub process of the data science
process that focuses on removing errors in your data so
your data becomes a true and consistent
representation of the processes it originates from.
• True and consistent representation
• interpretation error
• inconsistencies
Outliers
Data Entry Errors
• Data collection and data entry are error-prone processes.
• They often require human intervention, and because
humans are only human, they make typos or lose their
concentration for a second and introduce an error into the
chain. But data collected by machines or computers isn’t
free from errors either.
• Errors can arise from human sloppiness, whereas others
are due to machine or hardware failure.
Data Entry Errors
Redundant Whitespaces
• Whitespaces tend to be hard to detect but cause errors like
other redundant characters would.
• Capital letter mismatches are common.
• Most programming languages make a distinction between
“Brazil” and “brazil”. In this case you can solve the problem
by applying a function that returns both strings in
lowercase, such as .lower() in Python. “Brazil”.lower() ==
“brazil”.lower() should result in true.
Impossible values and Sanity checks
• Sanity checks are another valuable type of data check.
• Sanity checks can be directly expressed with rules:
check = 0 <= age <= 120
Outliers
• An outlier is an observation that seems to be distant from
other observations or, more specifically, one observation
that follows a different logic or generative process than the
other observations.
• Find outliers ➔ Use a plot or table
Outliers
Handle missing data
Deviations from a code book
• A code book is a description of your data, a form of
metadata.
• It contains things such as the number of variables per
observation, the number of observations, and what each
encoding within a variable means.(For instance “0” equals
“negative”, “5” stands for “very positive”.)
Combining data from different data
sources
• Joining ➔ enriching an observation from one table with
information from another table
• Appending or Stacking ➔adding the observations of one
table to those of another table.
Joining
• Joining ➔ focus on enriching a single observation
• To join tables, you use variables that represent the same
object in both tables, such as a date, a country name, or a
Social Security number. These common fields are known as
keys.
• When these keys also uniquely define the records in the table
they are called Primary Keys
Appending
• Appending ➔ effectively adding observations from one
table to another table.
Views
• To avoid duplication of data, you virtually combine data
with views
• Existing ➔ needed more storage space
• A view behaves as if you’re working on a table, but this table
is nothing but a virtual layer that combines the tables for
you.
Views
Enriching aggregated measures
• Data enrichment can also be done by adding calculated
information to the table, such as the total number of sales or
what percentage of total stock has been sold in a certain
region
Transforming data
• Certain models require their data to be in a certain shape.
• Transforming your data so it takes a suitable form for data
modeling.
Reducing the number of variables
• Too many variables
➔don’t add new information to the model
➔model difficult to handle
➔certain techniques don’t perform well when you overload them
with too many input variables
• Data scientists use special methods to reduce the number of
variables but retain the maximum amount of data.
Turning variables into dummies
• Dummy variables can only take two values: true(1) or
false(0). They’re used to indicate the absence of a
categorical effect that may explain the observation.
Data Exploration
Data Exploration
• Information becomes much easier to grasp when shown in a
picture, therefore you mainly use graphical techniques to
gain an understanding of your data and the interactions
between variables.
• Visualization Techniques
• Simple graphs
• Histograms
• Sankey
• Network graphs
Bar Chart
Line Chart
Distribution
Overlaying
Brushing and Linking
STEP 5: BUILD THE MODELS
Data modeling
Data modeling
• Building a model is an iterative process.
• The way you build your model depends on whether you go
with classic statistics or the somewhat more recent machine
learning school, and the type of technique you want to use.
• Models consist of the following main steps:
• 1 Selection of a modeling technique and variables to enter
in the model
• 2 Execution of the model
• 3 Diagnosis and model comparison
Model and variable selection
❖ Must the model be moved to a production environment
and, if so, would it be easy to implement?
❖ How difficult is the maintenance on the model: how long
will it remain relevant if left untouched?
❖ Does the model need to be easy to explain?
Model execution
Model execution
Model execution
Introduction to
Numpy
NumPy Arrays
NumPy
• Numerical Python
• General-purpose array-processing package.
• High-performance multidimensional array object, and
tools for working with these arrays.
• Fundamental package for scientific computing with
Python.
• It is open-source software.
NumPy - Features
• A powerful N-dimensional array object
• Sophisticated (broadcasting) functions
• Tools for integrating C/C++ and Fortran code
• Useful linear algebra, Fourier transform, and random
number capabilities
Choosing NumPy over Python list
Array
• An array is a data type used to store multiple values
using a single identifier (variable name).
• An array contains an ordered collection of data
elements where each element is of the same type and
can be referenced by its index (position)
Array
• Similar to the indexing of lists
• Zero-based indexing
• [10, 9, 99, 71, 90 ]
NumPy Array
• Store lists of numerical data, vectors and matrices
• Large set of routines (built-in functions) for creating,
manipulating, and transforming NumPy arrays.
• NumPy array is officially called ndarray but commonly
known as array
Creation of NumPy Arrays from List
• First we need to import the NumPy library
import numpy as np
Creation of Arrays
1. Using the NumPy functions

a. Creating one-dimensional array in NumPy


import numpy as np
array=np.arange(20)
array

Output:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12, 13, 14, 15, 16, 17, 18, 19])
1. Using the NumPy functions

a. check the dimensions by using array.shape.


(20, )

Output:
array([ 0 1 2 3 4 5 6 7 8 9 10 1112 13 14,15, 16, 17, 18, 19])
1. Using the NumPy functions

b. Creating two-dimensional arrays in NumPy


array=np.arange(20).reshape(4,5)

Output:
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]
[15, 16, 17, 18, 19]])
1. Using the NumPy functions

c. Using other NumPy functions


np.zeros((2,4))
np.ones((3,6))
np.full((2,2), 3)

Output:
array([[0., 0., 0., 0.],
[0., 0., 0., 0.]])
array([[1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1.]])
1. Using the NumPy functions
1. Using the NumPy functions
1. Using the NumPy functions
[[0. 0. 0. 0.]
[0. 0. 0. 0.]]
c. Using other NumPy
functions [[1. 1. 1. 1. 1. 1.]
import numpy as np [1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1.]]
a=np.zeros((2,4))
b=np.ones((3,6)) [[1.14137702e-316 0.00000000e+000
c=np.empty((2,3)) 6.91583610e-310]
d=np.full((2,2), 3) [6.91583609e-310 6.91583601e-310
6.91583601e-310]]
e= np.eye(3,3)
f=np.linspace(0, 10, num=4) [[3 3]
[3 3]]
print(a)
[[1. 0. 0.]
print(b) [0. 1. 0.]
print(c) [0. 0. 1.]]
print(d)
[ 0. 3.33333333 6.66666667 10.
]
1. Using the NumPy functions

Sr No. Function Description


Return a new array with the same shape
1 empty_like()
and type

Return an array of ones with the same


2 ones_like()
shape and type.

Return an array of zeros with the same


3 zeros_like()
shape and type

Return a full array with the same shape


4 full_like()
and type
5 asarray() Convert the input to an array.
Return evenly spaced numbers on a log
6 geomspace()
scale.
7 copy() Returns a copy of the given object
1. Using the NumPy functions

Sr No. Function Description

8 diag() a diagonal array

9 frombuffer() buffer as a 1-D array

10 fromfile() Construct an array from text or binary file

Build a matrix object from a string, nested


11 bmat()
sequence, or array
12 mat() Interpret the input as a matrix

13 vander() Generate a Vandermonde matrix

14 triu() Upper triangle of array


1. Using the NumPy functions

Sr No. Function Description

15 tril() Lower triangle of array

An array with ones at & below the given


16 tri()
diagonal and zeros elsewhere

two-dimensional array with the flattened


17 diagflat()
input as a diagonal

18 fromfunction() executing a function over each coordinate

Return numbers spaced evenly on a log


19 logspace()
scale
Return coordinate matrices from
20 meshgrid()
coordinate vectors
2. Conversion from Python structure like lists

import numpy as np
[4 5 6]
array=np.array([4,5,6])
[4, 5, 6]
print(array)
list=[4,5,6]
print(list)
Working with Ndarray
• np.ndarray(shape, type)
• Creates an array of the given shape with random numbers.
• np.array(array_object)
• Creates an array of the given shape from the list or tuple.
• np.zeros(shape)
• Creates an array of the given shape with all zeros.
• np.ones(shape)
• Creates an array of the given shape with all ones.
• np.full(shape,array_object, dtype)
• Creates an array of the given shape with complex numbers.
• np.arange(range)
• Creates an array with the specified range.
NumPy Basic Array Operations
There is a vast range of built-in operations that we can
perform on these arrays.
1. ndim – It returns the dimensions of the array.
2. itemsize – It calculates the byte size of each element.
3. dtype – It can determine the data type of the element.
4. reshape – It provides a new view.
5. slicing – It extracts a particular set of elements.
6. linspace – Returns evenly spaced elements.
7. max/min , sum, sqrt
8. ravel – It converts the array into a single line.
Arrays in NumPy
Checking Array Dimensions in NumPy

import numpy as np
a = np.array(10)
b = np.array([1,1,1,1])
c = np.array([[1, 1, 1], [2,2,2]])
d = np.array([[[1, 1, 1], [2, 2, 2]], [[3, 3, 3], [4, 4, 4]]])
print(a.ndim) #0
print(b.ndim) #1
print(c.ndim) #2
print(d.ndim) #3
Higher Dimensional Arrays in NumPy

import numpy as np
arr = np.array([1, 1, 1, 1, 1], ndmin=10)
print(arr)
print('number of dimensions :', arr.ndim)

[[[[[[[[[[1 1 1 1 1]]]]]]]]]]
number of dimensions : 10
Indexing and Slicing in NumPy
Indexing & Slicing
Indexing
import numpy as np
arr=([1,2,5,6,7])
print(arr[3]) #6

Slicing
import numpy as np
arr=([1,2,5,6,7])
print(arr[2:5]) #[5, 6, 7]
Indexing and Slicing
Indexing and Slicing in 2-D
Copying Arrays
Copy from one array to another
• Method 1: Using np.empty_like() function
• Method 2: Using np.copy() function
• Method 3: Using Assignment Operator
Using np.empty_like( )
• This function returns a new array with the same shape and
type as a given array.
Syntax:
• numpy.empty_like(a, dtype = None, order = ‘K’, subok = True)
Using np.empty_like( )
• import numpy as np
• ary=np.array([13,99,100,34,65,11,66,81,632,44])

print("Original array: ")
• # printing the Numpy array
• print(ary)

# Creating an empty Numpy array similar to ary
• copy=np.empty_like(ary)

# Now assign ary to copy
• copy=ary

print("\nCopy of the given array: ")

# printing the copied array
• print(copy)
Using np.empty_like( )
Using np.copy() function
• This function returns an array copy of the given object.
Syntax :
• numpy.copy(a, order='K', subok=False)

# importing Numpy package


import numpy as np
org_array = np.array([1.54, 2.99, 3.42, 4.87, 6.94, 8.21, 7.65, 10.50, 77.5])
print("Original array: ")
print(org_array)
# Now copying the org_array to copy_array using np.copy() function
copy_array = np.copy(org_array)
print("\nCopied array: ")
# printing the copied Numpy array
print(copy_array)
Using np.copy() function
# importing Numpy package
import numpy as np
org_array = np.array([1.54, 2.99, 3.42, 4.87, 6.94, 8.21, 7.65, 10.50,
77.5])
print("Original array: ")
print(org_array)
copy_array = np.copy(org_array)
print("\nCopied array: ")
# printing the copied Numpy array
print(copy_array)
Using Assignment Operator
import numpy as np
org_array = np.array([[99, 22, 33],[44, 77, 66]])
# Copying org_array to copy_array using Assignment operator
copy_array = org_array

# modifying org_array
org_array[1, 2] = 13

# checking if copy_array has remained the same

# printing original array


print('Original Array: \n', org_array)

# printing copied array


print('\nCopied Array: \n', copy_array)
Iterating Arrays
• Iterating means going through elements one by one.
• As we deal with multi-dimensional arrays in numpy, we can do
this using basic for loop of python.
• If we iterate on a 1-D array it will go through each element one by
one.
• Iterate on the elements of the following 1-D array:
import numpy as np
arr = np.array([1, 2, 3])
for x in arr:
print(x)
Output:
1
2
3
Iterating Arrays
• Iterating 2-D Arrays
• In a 2-D array it will go through all the rows.
• If we iterate on a n-D array it will go through (n-1)th dimension
one by one.

import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6]])

for x in arr:
print(x)
Output:
[1 2 3]
[4 5 6]
Iterating Arrays
• To return the actual values, the scalars, we have to iterate
the arrays in each dimension.
arr = np.array([[1, 2, 3], [4, 5, 6]])
for x in arr:
for y in x:
print(y)

1
2
3
4
5
6
Iterating Arrays
• Iterating 3-D Arrays
• In a 3-D array it will go through all the 2-D arrays.

• import numpy as np

arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])

for x in arr:
print(x)

[[1 2 3] [4 5 6]]
[[ 7 8 9] [10 11 12]]
Iterating Arrays
• Iterating 3-D Arrays
• To return the actual values, the scalars, we have to iterate the
arrays in each dimension.

import numpy as np

arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])

for x in arr:
for y in x:
for z in y:
print(z)
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8

Iterating Arrays Using nditer()


• The function nditer() is a helping function that can be
used from very basic to very advanced iterations. 1
2
• Iterating on Each Scalar Element 3
• In basic for loops, iterating through each scalar of an array 4
we need to use n for loops which can be difficult to write for 5
arrays with very high dimensionality. 6
7
import numpy as np 8

arr = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])

for x in np.nditer(arr):
print(x)
Identity array
• The identity array is a square array with ones on the main
diagonal.
• The identity() function return the identity array.
Identity
• numpy.identity(n, dtype = None) : Return a identity
matrix i.e. a square matrix with ones on the main daignol

• Parameters:
• n : [int] Dimension n x n of output array
• dtype : [optional, float(by Default)] Data type of returned array
Identity array
# 2x2 matrix with 1's on main diagonal
b = np.identity(2, dtype = float)
print("Matrix b : \n", b)
a = np.identity(4)
print("\nMatrix a : \n", a)

Output:
Matrix b :
[[ 1. 0.]
[ 0. 1.]]
Matrix a :
[[ 1. 0. 0. 0.]
[ 0. 1. 0. 0.]
[ 0. 0. 1. 0.]
[ 0. 0. 0. 1.]]
eye( )
• numpy.eye(R, C = None, k = 0, dtype = type
<‘float’>) : Return a matrix having 1’s on the diagonal and
0’s elsewhere w.r.t. k.
• R : Number of rows
C : [optional] Number of columns; By default M = N
k : [int, optional, 0 by default]
Diagonal we require; k>0 means diagonal above main
diagonal or vice versa.
dtype : [optional, float(by Default)] Data type of returned
array.
eye( )
Identity( ) vs eye( )
• np.identity returns a square matrix (special case of a 2D-
array) which is an identity matrix with the main diagonal
(i.e. 'k=0') as 1's and the other values as 0's. you can't
change the diagonal k here.
• np.eye returns a 2D-array, which fills the diagonal, i.e. 'k'
which can be set, with 1's and rest with 0's.
• So, the main advantage depends on the requirement. If you
want an identity matrix, you can go for identity right away,
or can call the np.eye leaving the rest to defaults.
• But, if you need a 1's and 0's matrix of a particular shape/
size or have a control over the diagonal you can go
for eye method.
Identity( ) vs eye( )
import numpy as np
print(np.eye(3,5,1))
print(np.eye(8,4,0))
print(np.eye(8,4,-1))
print(np.eye(8,4,-2))
Print(np.identity(4)
Shape of an Array
• import numpy as np

arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])

print(arr.shape)

• Output: (2,4)
Reshaping arrays
• Reshaping means changing the shape of an array.
• The shape of an array is the number of elements in each
dimension.
• By reshaping we can add or remove dimensions or change
number of elements in each dimension.
Reshape From 1-D to 2-D
• import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

newarr = arr.reshape(4, 3)

print(newarr)

• Output:
• [[ 1 2 3]
• [ 4 5 6]
• [ 7 8 9]
• [10 11 12]]
Reshape From 1-D to 3-D
• The outermost dimension will have 2 arrays that contains 3 arrays, each with
2 elements
• import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
newarr = arr.reshape(2, 3, 2)
print(newarr)

Output:
[[[ 1 2]
[ 3 4]
[ 5 6]]

[[ 7 8]
[ 9 10]
[11 12]]]
Can we Reshape into any Shape?
• Yes, as long as the elements required for reshaping are equal in
both shapes.
• We can reshape an 8 elements 1D array into 4 elements in 2 rows
2D array but we cannot reshape it into a 3 elements 3 rows 2D
array as that would require 3x3 = 9 elements.
import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])

newarr = arr.reshape(3, 3)

print(newarr)

• Traceback (most recent call last): File


"demo_numpy_array_reshape_error.py", line 5, in <module>
ValueError: cannot reshape array of size 8 into shape (3,3)
Flattening the arrays
• Flattening array means converting a multidimensional array
into a 1D array.
• import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6]])

newarr = arr.reshape(-1)

print(newarr)
• Output: [1 2 3 4 5 6]
• There are a lot of functions for changing the shapes of arrays
in numpy flatten, ravel and also for rearranging the
elements rot90, flip, fliplr, flipud etc. These fall under
Intermediate to Advanced section of numpy.
ADDITIONAL EXAMPLES FOR NUMPY
OPERATIONS
• Access the element on the first row, second column:
import numpy as np

arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])

print('2nd element on 1st row: ', arr[0, 1])


• ANSWER: 2
• Access the element on the 2nd row, 5th column:
import numpy as np

arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])

print('5th element on 2nd row: ', arr[1, 4])


• ANSWER: 10
• Access the third element of the second array of the first array:
import numpy as np

arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])

print(arr[0, 1, 2])
• Example Explained
arr[0, 1, 2] prints the value 6.
And this is why:
• The first number represents the first dimension, which contains two arrays:
[[1, 2, 3], [4, 5, 6]]
and:
[[7, 8, 9], [10, 11, 12]]
Since we selected 0, we are left with the first array:
[[1, 2, 3], [4, 5, 6]]
• The second number represents the second dimension, which also contains two arrays:
[1, 2, 3]
and:
[4, 5, 6]
Since we selected 1, we are left with the second array:
[4, 5, 6]
• The third number represents the third dimension, which contains three values:
4
5
6
Since we selected 2, we end up with the third value:
6
• ARRAY JOIN:
• Joining NumPy Arrays
• Joining means putting contents of two or more arrays in a single array.
• In SQL we join tables based on a key, whereas in NumPy we join arrays by axes.
• We pass a sequence of arrays that we want to join to the concatenate() function, along with the axis. If
axis is not explicitly passed, it is taken as 0.
• ExampLE
• Join two arrays
import numpy as np

arr1 = np.array([1, 2, 3])

arr2 = np.array([4, 5, 6])

arr = np.concatenate((arr1, arr2))

print(arr)
• ANSWER: 1,2,3,4,5,6
• Example
• Join two 2-D arrays along rows (axis=1):
• import numpy as np

arr1 = np.array([[1, 2], [3, 4]])

arr2 = np.array([[5, 6], [7, 8]])

arr = np.concatenate((arr1, arr2), axis=1)

print(arr)
• ANSWER: [[1 2 5 6]
• [3 4 7 8]]
• Joining Arrays Using Stack Functions
• Stacking is same as concatenation, the only difference is that stacking is done along a new axis.
• We can concatenate two 1-D arrays along the second axis which would result in putting them one over the other,
ie. stacking.
• We pass a sequence of arrays that we want to join to the stack() method along with the axis. If axis is not
explicitly passed it is taken as 0.
• Example
import numpy as np

arr1 = np.array([1, 2, 3])

arr2 = np.array([4, 5, 6])

arr = np.stack((arr1, arr2), axis=1)

print(arr)
• ANSWER: [[1 4]
• [2 5]
• [3 6]]
• Stacking Along Rows
• NumPy provides a helper function: hstack() to stack along rows.
• Example
• import numpy as np

arr1 = np.array([1, 2, 3])

arr2 = np.array([4, 5, 6])

arr = np.hstack((arr1, arr2))

print(arr)
• ANSWER: [1 2 3 4 5 6]
• Stacking Along Columns
• NumPy provides a helper function: vstack() to stack along columns.
• Example
import numpy as np

arr1 = np.array([1, 2, 3])

arr2 = np.array([4, 5, 6])

arr = np.vstack((arr1, arr2))

print(arr)
• ANSWER: [[1 2 3]
• [4 5 6]]
• Stacking Along Height (depth)
• NumPy provides a helper function: dstack() to stack along height, which is the same
as depth.
• Example
import numpy as np

arr1 = np.array([1, 2, 3])

arr2 = np.array([4, 5, 6])

arr = np.dstack((arr1, arr2))

print(arr)
• ANSWER: [[[1 4]
• [2 5]
• [3 6]]]
• Splitting NumPy Arrays
• Splitting is reverse operation of Joining.
• Joining merges multiple arrays into one and Splitting breaks one array into multiple.
• We use array_split() for splitting arrays, we pass it the array we want to split and the
number of splits.
• Example
• Split the array in 3 parts:
import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6])

newarr = np.array_split(arr, 3)

print(newarr)
• ANSWER: [array([1, 2]), array([3, 4]), array([5, 6])]

• If the array has less elements than required, it will adjust from the end
accordingly.
• Example
• Split the array in 4 parts:
• import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6])

newarr = np.array_split(arr, 4)

print(newarr)
• ANSWER: [array([1, 2]), array([3, 4]), array([5]), array([6])]
• Split Into Arrays
• The return value of the array_split() method is an array containing each of the split as an array.
• If you split an array into 3 arrays, you can access them from the result just like any array element:
• Example
• Access the splitted arrays:
import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6])

newarr = np.array_split(arr, 3)

print(newarr[0])
print(newarr[1])
print(newarr[2])
• ANSWER: [1 2]
[3 4]
[5 6]
• From both elements, slice index 1 to index 4 (not included), this will return
a 2-D array:
• import numpy as np

arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])

print(arr[0:2, 1:4])
• ANSWER: [[2 3 4]
[7 8 9]]
• Splitting 2-D Arrays
• Use the same syntax when splitting 2-D arrays.
• Use the array_split() method, pass in the array you want to split and the number of splits you want to
do.
• Example
• Split the 2-D array into three 2-D arrays.
• import numpy as np

arr = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]])

newarr = np.array_split(arr, 3)

print(newarr)
• ANSWER: [array([[1, 2],
• [3, 4]]), array([[5, 6],
• [7, 8]]), array([[ 9, 10],
• [11, 12]])]
• Split the 2-D array into three 2-D arrays.
• import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15], [16, 17, 18]])

newarr = np.array_split(arr, 3)

print(newarr)
• ANSWER:
[array([[1, 2],
[3, 4]]), array([[5, 6],
[7, 8]]), array([[ 9, 10],
[11, 12]])]
• Split the 2-D array into three 2-D arrays along rows.
import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15], [16, 17, 18]])

newarr = np.array_split(arr, 3, axis=1)

print(newarr)

• ANSWER:

• [array([[ 1],

• [ 4],

• [ 7],

• [10],

• [13],

• [16]]), array([[ 2],

• [ 5],

• [ 8],

• [11],

• [14],

• [17]]), array([[ 3],

• [ 6],

• [ 9],

• [12],

• [15],

• [18]])]
• An alternate solution is using hsplit() opposite of hstack()

• Example

• Use the hsplit() method to split the 2-D array into three 2-D arrays along rows.

import numpy as np

• arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15], [16, 17, 18]])

• newarr = np.hsplit(arr, 3)

• print(newarr)

• ANSWER: [array([[ 1],

• [ 4],

• [ 7],

• [10],

• [13],

• [16]]), array([[ 2],

• [ 5],

• [ 8],

• [11],

• [14],

• [17]]), array([[ 3],

• [ 6],

• [ 9],

• [12],

• [15],

• [18]])]
• Searching Arrays
• You can search an array for a certain value, and return the indexes that get a match.
• To search an array, use the where() method.
• ExampleGet your own Python Server
• Find the indexes where the value is 4:
import numpy as np

arr = np.array([1, 2, 3, 4, 5, 4, 4])

x = np.where(arr == 4)

print(x)
• ANSWER: (array([3, 5, 6]),)
• The example above will return a tuple: (array([3, 5, 6],)
• Which means that the value 4 is present at index 3, 5, and 6.
• Find the indexes where the values are even:
import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])

x = np.where(arr%2 == 0)

print(x)
• ANSWER: (array([1, 3, 5, 7]),)

• Find the indexes where the values are odd:
import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])

x = np.where(arr%2 == 1)

print(x)
• ANSWER: (array([0, 2, 4, 6]),)
• Search Sorted
• There is a method called searchsorted() which performs a binary search in the array, and returns the
index where the specified value would be inserted to maintain the search order.
• The searchsorted() method is assumed to be used on sorted arrays.
• Example
• Find the indexes where the value 7 should be inserted:
• import numpy as np

arr = np.array([6, 7, 8, 9])

x = np.searchsorted(arr, 7)

print(x)
• ANSWER: 1
• Example explained: The number 7 should be inserted on index 1 to remain the sort order.
• The method starts the search from the left and returns the first index where the number 7 is no longer
larger than the next value.
• Search From the Right Side
• By default the left most index is returned, but we can give side='right' to return the right
most index instead.
• Example
• Find the indexes where the value 7 should be inserted, starting from the right:
• import numpy as np

arr = np.array([6, 7, 8, 9])

x = np.searchsorted(arr, 7, side='right')

print(x)
• ANSWER: 2
• Example explained: The number 7 should be inserted on index 2 to remain the sort order.
• The method starts the search from the right and returns the first index where the number
7 is no longer less than the next value.
• Multiple Values
• To search for more than one value, use an array with the specified values.
• Example
• Find the indexes where the values 2, 4, and 6 should be inserted:
import numpy as np

arr = np.array([1, 3, 5, 7])

x = np.searchsorted(arr, [2, 4, 6])

print(x)
• ANSWER: [1 2 3]
• Sorting Arrays
• Sorting means putting elements in an ordered sequence.
• Ordered sequence is any sequence that has an order corresponding to elements, like
numeric or alphabetical, ascending or descending.
• The NumPy ndarray object has a function called sort(), that will sort a specified array.
• ExampleGet your own Python Server
• Sort the array:
import numpy as np

arr = np.array([3, 2, 0, 1])

print(np.sort(arr))
• ANSWER: [0 1 2 3]
• You can also sort arrays of strings, or any other data type:
• Example
• Sort the array alphabetically:
import numpy as np

arr = np.array(['banana', 'cherry', 'apple'])

print(np.sort(arr))
• ANSWER: ['apple' 'banana' 'cherry']

• Example
• Sort a boolean array:
import numpy as np

arr = np.array([True, False, True])

print(np.sort(arr))
• ANSWER: [False True True]
• Sorting a 2-D Array
• If you use the sort() method on a 2-D array, both arrays will be sorted:
• Example
• Sort a 2-D array:
import numpy as np

arr = np.array([[3, 2, 4], [5, 0, 1]])

print(np.sort(arr))
• ANSWER: [[2 3 4]
• [0 1 5]]
• Slicing arrays
• Slicing in python means taking elements from one given index to another given index.
• We pass slice instead of index like this: [start:end].
• We can also define the step, like this: [start:end:step].
• If we don't pass start its considered 0
• If we don't pass end its considered length of array in that dimension
• If we don't pass step its considered 1
• ExampleGet your own Python Server
• Slice elements from index 1 to index 5 from the following array:
import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7])

print(arr[1:5])
• ANSWER: 5
• Slice elements from index 4 to the end of the array:
• import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7])

print(arr[4:])
• ANSWER: 5,6,7

• Slice elements from the beginning to index 4 (not included):
• import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7])

print(arr[:4])
• ANSWER: 1,2,3,4
• Negative Slicing
• Use the minus operator to refer to an index from the end:
• Example
• Slice from the index 3 from the end to index 1 from the
end:
import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7])

print(arr[-3:-1])
• ANSWER: 5,6

• STEP
• Use the step value to determine the step of the slicing:
• Example
• Return every other element from index 1 to index 5:
• import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])

print(arr[1:5:2])
• ANSWER: 2,4

Return every other element from the entire array:
• import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7])

print(arr[::2])
• ANSWER: 1,3,5,7
• Slicing 2-D Arrays
• Example
• From the second element, slice elements from index 1 to index 4 (not
included):
• import numpy as np

arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])

print(arr[1, 1:4])
• ANSWER: 7,8,9
• From both elements, return index 2:
• import numpy as np

arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])

print(arr[0:2, 2])
• ANSWER: 3,8
• Iterating Arrays
• Iterating means going through elements one by one.
• As we deal with multi-dimensional arrays in numpy, we can do this using basic for loop of
python.
• If we iterate on a 1-D array it will go through each element one by one.
• ExampleGet your own Python Server
• Iterate on the elements of the following 1-D array:
• import numpy as np

arr = np.array([1, 2, 3])

for x in arr:
print(x)
• ANSWER: 1
• 2
• 3
• Iterating 2-D Arrays
• In a 2-D array it will go through all the rows.
• Example
• Iterate on the elements of the following 2-D array:
• import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6]])

for x in arr:
print(x)
• ANSWER: [1 2 3]
[4 5 6]
• Example
• Iterate on each scalar element of the 2-D array:
• import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6]])

for x in arr:
for y in x:
print(y)
• ANSWER: 1
2
3
4
5
6
• Iterating 3-D Arrays
• In a 3-D array it will go through all the 2-D arrays.
• Example
• Iterate on the elements of the following 3-D array:
• import numpy as np

arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])

for x in arr:
print(x)
• ANSWER: x represents the 2-D array:
• [[1 2 3]
• [4 5 6]]
• x represents the 2-D array:
• [[ 7 8 9]
• [10 11 12]]
• Iterating Arrays Using nditer()
• The function nditer() is a helping function that can be used from very basic to very advanced iterations. It solves some basic issues which
we face in iteration, lets go through it with examples.
• Iterating on Each Scalar Element
• In basic for loops, iterating through each scalar of an array we need to use n for loops which can be difficult to write for arrays with very
high dimensionality.
• Example
Iterate through the following 3-D array:
import numpy as np

arr = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])

for x in np.nditer(arr):
print(x)
• ANSWER: 1
• 2
• 3
• 4
• 5
• 6
• 7
• 8
The Difference Between Copy and View

The main difference between a copy and a view of an array is that the copy is a new array, and the
view is just a view of the original array.

The copy owns the data and any changes made to the copy will not affect original array, and any
changes made to the original array will not affect the copy.

The view does not own the data and any changes made to the view will affect the original array, and
any changes made to the original array will affect the view.
• COPY:
• Example
• Make a copy, change the original array, and display both arrays:
import numpy as np

arr = np.array([1, 2, 3, 4, 5])


x = arr.copy()
arr[0] = 42

print(arr)
print(x)
• ANSWER: [42 2 3 4 5]
• [1 2 3 4 5]

• The copy SHOULD NOT be affected by the changes made to the original array.
• VIEW:
• Example
• Make a view, change the original array, and display both arrays:
import numpy as np

arr = np.array([1, 2, 3, 4, 5])


x = arr.view()
arr[0] = 42

print(arr)
print(x)
• ANSWER: [42 2 3 4 5]
• [42 2 3 4 5]

• The view SHOULD be affected by the changes made to the original array.
• Make Changes in the VIEW:
• Example
• Make a view, change the view, and display both arrays:
import numpy as np

arr = np.array([1, 2, 3, 4, 5])


x = arr.view()
x[0] = 31

print(arr)
print(x)
• ANSWER:
• [31 2 3 4 5]
• [31 2 3 4 5]

• The original array SHOULD be affected by the changes made to the view.
• Check if Array Owns its Data
• As mentioned above, copies owns the data, and views does not own the data, but how can we check
this?
• Every NumPy array has the attribute base that returns None if the array owns the data.
• Otherwise, the base attribute refers to the original object.
• Example
• Print the value of the base attribute to check if an array owns it's data or not:
• import numpy as np

arr = np.array([1, 2, 3, 4, 5])

x = arr.copy()
y = arr.view()

print(x.base)
print(y.base)
• ANSWER: None
[1 2 3 4 5]
Shape of an Array
The shape of an array is the number of elements in each dimension.
________________________________________
Get the Shape of an Array
NumPy arrays have an attribute called shape that returns a tuple with
each index having the number of corresponding elements.
Example:
Print the shape of a 2-D array:
import numpy as np

arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])

print(arr.shape)
ANSWER: (2, 4)
The example above returns (2, 4), which means that the array has 2 dimensions, where the
first dimension has 2 elements and the second has 4.
• Example
• Create an array with 5 dimensions using ndmin using a vector with values
1,2,3,4 and verify that last dimension has value 4:
• import numpy as np

arr = np.array([1, 2, 3, 4], ndmin=5)

print(arr)
print('shape of array :', arr.shape)
• ANSWER:
• [[[[[1 2 3 4]]]]]
shape of array : (1, 1, 1, 1, 4)
Reshaping arrays
Reshaping means changing the shape of an array.
The shape of an array is the number of elements in each dimension.
By reshaping we can add or remove dimensions or change number of elements in each dimension.
________________________________________
Reshape From 1-D to 2-D
• Reshape From 1-D to 2-D
• Example
• Convert the following 1-D array with 12 elements into a 2-D array.
• The outermost dimension will have 4 arrays, each with 3 elements:
import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

newarr = arr.reshape(4, 3)

print(newarr)
• ANSWER: [[ 1 2 3]
• [ 4 5 6]
• [ 7 8 9]
• [10 11 12]]
• Reshape From 1-D to 3-D
• Example
• Convert the following 1-D array with 12 elements into a 3-D array.
• The outermost dimension will have 2 arrays that contains 3 arrays, each with 2 elements:
• import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

newarr = arr.reshape(2, 3, 2)

print(newarr)
• ANSWER: [[[ 1 2]
• [ 3 4]
• [ 5 6]]
• [[ 7 8]
• [ 9 10]
• [11 12]]]
• Example
• Try converting 1D array with 8 elements to a 2D array with 3 elements in
each dimension (will raise an error):
• import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])

newarr = arr.reshape(3, 3)

print(newarr)
• ANSWER: Traceback (most recent call last):
• File "demo_numpy_array_reshape_error.py", line 5, in <module>
• ValueError: cannot reshape array of size 8 into shape (3,3)
Identity matrix
• In this program, we will print an identity matrix of size nxn where n will be taken as an input from the user. We shall use
the identity() function in the numpy library which takes in the dimension and the data type of the elements as
parameters
• Algorithm
• Step 1: Import numpy.
• Step 2: Take dimensions as input from the user.
• Step 3: Print the identity matrix using numpy.identity() function.
• Example Code
import numpy as np

• dimension = int(input("Enter the dimension of identitiy matrix: "))
• identity_matrix = np.identity(dimension, dtype="int")
• print(identity_matrix)

• Output
• Enter the dimension of identitiy matrix: 5
• [[1 0 0 0 0]
• [0 1 0 0 0]
• [0 0 1 0 0]
• [0 0 0 1 0]
• [0 0 0 0 1]]
Eye( ) function
• Introduction
• The eye() function in Python's NumPy library is essential for creating identity
matrices, which are crucial in linear algebra and other mathematical computations.
An identity matrix is a square matrix with ones on the diagonal and zeros elsewhere,
serving as the neutral element in matrix multiplication.
• In this article, you will learn how to utilize the eye() function to generate identity
matrices of various sizes. Discover the simplicity of configuring matrix dimensions
and understand how this function can be applied to different matrix-related
operations.
• Basic Usage of numpy.eye()
• Create a Simple Identity Matrix
1. Import the NumPy library.
2. Use the eye() function to create an identity matrix of a specified size.
• pythonCopy
import numpy as np
• identity_matrix = np.eye(3)
• print(identity_matrix)
• This code creates a 3x3 identity matrix. Each row in the output contains exactly one
element with a value of 1 (aligned diagonally), and all other elements are 0.
• Specifying Data Type
1. Identify the desired data type for the matrix elements.
2. Pass the dtype argument to the eye() function to specify the data type.
• pythonCopy
import numpy as np
• identity_matrix_float = np.eye(3, dtype=float)
• print(identity_matrix_float)
• By setting dtype to float, each element in the matrix is of type float. This is particularly
useful when the identity matrix is used in computations needing floating point precision.
• Customizing the Identity Matrix
• Adjusting the Width of the Matrix
1. Use the N and M parameters to define the dimensions of the matrix.
2. Generate a matrix that is not strictly square.
• pythonCopy
import numpy as np
• rectangular_identity = np.eye(3, 4)
• print(rectangular_identity)
• This generates a 3x4 matrix where the identity diagonal is still present, but
the matrix is not square. This type of matrix is useful when dealing with
transformation matrices in graphics and other applications.
Introduction to
Pandas
Pandas
• Pandas is a popular open-source data manipulation and
analysis library for Python.
• It provides easy-to-use data structures like DataFrame
and Series, which are designed to make working with
structured data fast, easy, and expressive.
• Pandas are widely used in data science, machine
learning, and data analysis for tasks such as data
cleaning, transformation, and exploration.
Series
• A Pandas Series is a one-dimensional array-like object that
can hold data of any type (integer, float, string, etc.).
• It is labelled, meaning each element has a unique identifier
called an index.
• Series is defined as a column in a spreadsheet or a single
column of a database table.
• Series are a fundamental data structure in Pandas and are
commonly used for data manipulation and analysis tasks.
• They can be created from lists, arrays, dictionaries, and
existing Series objects.
• Series are also a building block for the more complex Pandas
DataFrame, which is a two-dimensional table-like structure
consisting of multiple Series objects.
Series
import pandas as pd Output
0 1
# Initializing a Series from a list 1 2
data = [1, 2, 3, 4, 5] 2 3
series_from_list = pd.Series(data)
3 4
print(series_from_list)
4 5
# Initializing a Series from a dictionary dtype: int64
data = {'a': 1, 'b': 2, 'c': 3} a 1
series_from_dict = pd.Series(data) b 2
print(series_from_dict) c 3
dtype: int64
# Initializing a Series with custom index
a 1
data = [1, 2, 3, 4, 5]
index = ['a', 'b', 'c', 'd', 'e'] b 2
series_custom_index = pd.Series(data, index=index) c 3
print(series_custom_index) d 4
e 5
dtype: int64
Series - Indexing
• Each element in a Series has a corresponding index,
which can be used to access or manipulate the data.

print(series_from_list[0])
print(series_from_dict['b’])

Output
1
2
Series – Vectorized Operations
• Series supports vectorized operations, allowing you to
perform arithmetic operations on the entire series efficiently.

series_a = pd.Series([1, 2, 3])


series_b = pd.Series([4, 5, 6])
sum_series = series_a + series_b
print(sum_series)

Output
0 5
1 7
2 9
dtype: int64
Series – Alignment
• When performing operations between two Series
objects, Pandas automatically aligns the data based on
the index labels.
series_a = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
series_b = pd.Series([4, 5, 6], index=['b', 'c', 'd'])
sum_series = series_a + series_b
print(sum_series)

Output
a NaN
b 6.0
c 8.0
d NaN
dtype: float64
Series – NaN Handling
• Missing values, represented by NaN (Not a Number), can
be handled gracefully in Series operations.

series_a = pd.Series([1, 2, 3], index=['a', 'b', 'c'])


series_b = pd.Series([4, 5], index=['b', 'c'])
sum_series = series_a + series_b
print(sum_series)

Output
a NaN
b 6.0
c 8.0
dtype: float64
DataFrame
• A Pandas DataFrame is a two-dimensional, tabular data
structure with rows and columns.
• It is similar to a spreadsheet or a table in a relational
database.
• The DataFrame has three main components:
• data, which is stored in rows and columns;
• rows, which are labeled by an index;
• columns, which are labeled and contain the actual data.
DataFrame
• The DataFrame has three main components:
• data, which is stored in rows and columns;
• rows, which are labeled by an index;
• columns, which are labeled and contain the actual data.
DataFrames
import pandas as pd

# Initializing a DataFrame from a dictionary Name Age City


data = {'Name': ['John', 'Alice', 'Bob'], 0 John 25 New York
'Age': [25, 30, 35], 1 Alice 30 Los Angeles
'City': ['New York', 'Los Angeles', 'Chicago']} 2 Bob 35 Chicago
df = pd.DataFrame(data) Name Age City
print(df) 0 John 25 New York
1 Alice 30 Los Angeles
# Initializing a DataFrame from a list of lists
2 Bob 35 Chicago
data = [['John', 25, 'New York'],
['Alice', 30, 'Los Angeles'],
['Bob', 35, 'Chicago']]
columns = ['Name', 'Age', 'City']
df = pd.DataFrame(data, columns=columns)
print(df)
DataFrames - Indexing
• DataFrame provides flexible indexing options, allowing access to
rows, columns, or individual elements based on labels or integer
positions.
# Accessing a column
print(df['Name']) 0 John
1 Alice
# Accessing a row by label 2 Bob
print(df.loc[0]) Name: Name, dtype: object
Name John
# Accessing a row by integer position Age 25
print(df.iloc[0]) City New York
Name: 0, dtype: object
# Accessing an individual element Name John
print(df.at[0, 'Name']) Age 25
City New York
Name: 0, dtype: object
John
DataFrame – Column Operations
• Columns in a DataFrame are Series objects, enabling
various operations such as arithmetic operations, filtering,
and sorting.
# Adding a new column
df['Salary'] = [50000, 60000, 70000]

# Filtering rows based on a condition


high_salary_employees = df[df['Salary'] &gt; 60000]
print(high_salary_employees)

# Sorting DataFrame by a column


sorted_df = df.sort_values(by='Age', ascending=False)
print(sorted_df)
DataFrames – Column Operations
• Columns in a DataFrame are Series objects, enabling
various operations such as arithmetic operations, filtering,
and sorting.
Name Age City Salary
2 Bob 35 Chicago 70000
# Adding a new column Name Age City Salary
df['Salary'] = [50000, 60000, 70000] 2 Bob 35 Chicago 70000
1 Alice 30 Los Angeles 60000
# Filtering rows based on a condition 0 John 25 New York 50000
high_salary_employees = df[df['Salary'] &gt; 60000]
print(high_salary_employees)
# Sorting DataFrame by a column
sorted_df = df.sort_values(by='Age', ascending=False)
print(sorted_df)
DataFrames – Handling NaN
• DataFrames provide methods for handling missing or
NaN values, including dropping or filling missing values.
Name Age City Salary
# Dropping rows with missing values
0 John 25 New York 50000
df.dropna()
1 Alice 30 Los Angeles 60000
print(df)
2 Bob 35 Chicago 70000
Name Age City Salary
# Filling missing values with a specified value
0 John 25 New York 50000
df.fillna(0)
1 Alice 30 Los Angeles 60000
print(df)
2 Bob 35 Chicago 70000
DataFrames – Grouping and
Aggregation
• DataFrames support group-by operations for
summarizing data and applying aggregation functions.

# Grouping by a column and calculating mean


avg_age_by_city = df.groupby('City')['Age'].mean()
print(avg_age_by_city)

City
Chicago 35.0
Los Angeles 30.0
New York 25.0
Name: Age, dtype: float64
Indexing
• Indexing is a fundamental operation for accessing and
manipulating data efficiently.
• It involves assigning unique identifiers or labels to data
elements, allowing for rapid retrieval and modification.
Indexing - Features
• Immutability: Once created, an index cannot be
modified.
• Alignment: Index objects are used to align data
structures like Series and DataFrames.
• Flexibility: Pandas offers various index types,
including integer-based, datetime, and custom
indices.
Index - Creation
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data, index=['A', 'B', 'C'])
Re-index
• Reindexing is the process of creating a new DataFrame
or Series with a different index.

• The reindex() method is used for this purpose.


import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie’],
'Age': [25, 30, 35]}
df = pd.DataFrame(data, index=['A', 'B', 'C’])
# Create a new index
new_index = ['A', 'B', 'D', 'E']

# Reindex the DataFrame


df_reindexed = df.reindex(new_index)

df_reindexed
Drop Entry
• Dropping entries in data science refers to removing
specific rows or columns from a dataset.
• This is a common operation in data cleaning and
preprocessing to handle missing values, outliers, or
irrelevant information.
Drop Entry
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]}
df = pd.DataFrame(data)
df
# Drop column
newdf = df.drop("Age", axis='columns')

newdf
Selecting Entries – Selecting by
Position Created DataFrame

import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles’, 'Chicago']}
df = pd.DataFrame(data)
# Select the second row Selecting data by Position
df.iloc[1]
Selecting Entries – Selecting by
Condition Created DataFrame

import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles’, 'Chicago']}
df = pd.DataFrame(data)
# Select rows where Age is greater than 30 Selecting data by Condition
df[df['Age'] > 30]
Data Alignment
• Data alignment is intrinsic, which means that it's
inherent to the operations you perform.
• Align data in them by their labels and not by their
position
• align( ) function is used to align
• Used to align two data objects with each other according
to their labels.
• Used on both Series and DataFrame objects
• Returns a new object of the same type with labels
compared and aligned.
Data Alignment
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9] })
df2 = pd.DataFrame({
'A': [10, 11],
'B': [12, 13],
'D': [14, 15] })
Data Alignment
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9] })
df2 = pd.DataFrame({
'A': [10, 11],
'B': [12, 13],
'D': [14, 15] })
df1_aligned, df2_aligned = df1.align(df2, fill_value=np.nan)
Rank
• Ranking is assigning ranks or positions to data elements
based on their values.
• Rank is returned based on position after sorting.
• Used when analyzing data with repetitive values or when you
need to identify the top or bottom entries.
Rank
import numpy as np
import pandas as pd
df = pd.DataFrame(data={'Animal': ['fox', 'Kangaroo’,
'deer','spider', 'snake’],
'Number_legs': [4, 2, 4, 8, np.nan]})
df
Rank
Rank
df['default_rank'] = df['Number_legs'].rank()
df['max_rank'] = df['Number_legs'].rank(method='max’)
df['NA_bottom’]= df['Number_legs'].rank(na_option='bottom’)
df['pct_rank'] = df['Number_legs'].rank(pct=True)
df
Rank
Rank
Rank
Rank
Sort
• Sort by the values along the axis
• Sort a pandas DataFrame by the values of one or more
columns
• Use the ascending parameter to change the sort order
• Sort a DataFrame by its index using .sort_index()
• Organize missing data while sorting values
• Sort a DataFrame in place using inplace set to True
Sort
import pandas as pd
age_list = [['Afghanistan', 1952, 8425333, 'Asia'],
['Australia', 1957, 9712569, 'Oceania'],
['Brazil', 1962, 76039390, 'Americas'],
['China', 1957, 637408000, 'Asia'],
['France', 1957, 44310863, 'Europe'],
['India', 1952, 3.72e+08, 'Asia'],
['United States', 1957, 171984000, 'Americas']]
df = pd.DataFrame(age_list, columns=['Country', 'Year',
'Population', 'Continent'])
df
Sort
import pandas as pd
age_list = [['Afghanistan', 1952, 8425333, 'Asia'],
['Australia', 1957, 9712569, 'Oceania'],
['Brazil', 1962, 76039390, 'Americas'],
['China', 1957, 637408000, 'Asia'],
['France', 1957, 44310863, 'Europe'],
['India', 1952, 3.72e+08, 'Asia'],
['United States', 1957, 171984000, 'Americas']]
df = pd.DataFrame(age_list, columns=['Country', 'Year’, 'Population', 'Continent'])
df
Sort by Ascending Order
import pandas as pd
age_list = [['Afghanistan', 1952, 8425333, 'Asia'],
['Australia', 1957, 9712569, 'Oceania'],
['Brazil', 1962, 76039390, 'Americas'],
['China', 1957, 637408000, 'Asia'],
['France', 1957, 44310863, 'Europe'],
['India', 1952, 3.72e+08, 'Asia'],
['United States', 1957, 171984000, 'Americas']]
df = pd.DataFrame(age_list, columns=['Country', 'Year’, 'Population', 'Continent’])
df.sort_values(by=['Country’]) # sorting in Ascending Order
df
Sort by Descending Order
import pandas as pd
age_list = [['Afghanistan', 1952, 8425333, 'Asia'],
['Australia', 1957, 9712569, 'Oceania'],
['Brazil', 1962, 76039390, 'Americas'],
['China', 1957, 637408000, 'Asia'],
['France', 1957, 44310863, 'Europe'],
['India', 1952, 3.72e+08, 'Asia'],
['United States', 1957, 171984000, 'Americas']]
df = pd.DataFrame(age_list, columns=['Country', 'Year’, 'Population', 'Continent’])
df.sort_values(by=['Population'], ascending=False) # sorting in Descending Order
df
Sort by Descending Order
import pandas as pd
age_list = [['Afghanistan', 1952, 8425333, 'Asia'],
['Australia', 1957, 9712569, 'Oceania'],
['Brazil', 1962, 76039390, 'Americas'],
['China', 1957, 637408000, 'Asia'],
['France', 1957, 44310863, 'Europe'],
['India', 1952, 3.72e+08, 'Asia'],
['United States', 1957, 171984000, 'Americas']]
df = pd.DataFrame(age_list, columns=['Country', 'Year’, 'Population', 'Continent’])
df.sort_values(by=['Population'], ascending=False) # sorting in Descending Order
df
ADDITIONAL EXAMPLES FOR PANDAS
DATA SERIES
• A Pandas Series is like a column in a table.
• It is a one-dimensional array holding data of any type.

import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories)

print(myvar)

OUTPUT:
day1 420
day2 380
day3 390
dtype: int64
import pandas as pd

calories =
{"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories, index =


["day1", "day2"])

print(myvar)
OUTPUT:
day1 420
day2 380
dtype: int64
DATA FRAMES
• Data sets in Pandas are usually multi-dimensional tables,
called DataFrames.
• Series is like a column, a DataFrame is the whole table.
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
myvar = pd.DataFrame(data)
print(myvar)
OUTPUT:
calories duration
0 420 50
1 380 40
2 390 45
LOCATE ROW IN DATA FRAMES
Locate Row
As you can see from the result above, the DataFrame is
like a table with rows and columns.
Pandas use the loc attribute to return one or more
specified row(s).
CONTINUING FROM PREVIOUS EXAMPLE:
#refer to the row index:
print(df.loc[0])
OUTPUT:
calories 420
duration 50
Name: 0, dtype: int64
• Return row 0 and 1:
• #use a list of indexes:
print(df.loc[[0, 1]])
OUTPUT:
calories duration
0 420 50
1 380 40
NAMED INDEXES
With the index argument, you can name your own indexes.
• Add a list of names to give each row a name:
• import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)
• OUTPUT:
• calories duration
• day1 420 50
• day2 380 40
• day3 390 45
Locate Named Indexes
Use the named index in the loc attribute to return the
specified row(s).
• Return "day2":
• #refer to the named index:
print(df.loc["day2"])
OUTPUT:
calories 380
duration 40
Name: day2, dtype: int64
• Load Files Into a DataFrame
• If your data sets are stored in a file, Pandas can load them into a DataFrame.
• Example

• Load a comma separated file (CSV file) into a DataFrame:

• import pandas as pd
df = pd.read_csv('data.csv')
print(df)

OUTPUT:
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
.. ... ... ... ...
164 60 105 140 290.8
165 60 110 145 300.4
166 60 115 145 310.2
167 75 120 150 320.4
168 75 125 150 330.4
• Reset the index back to 0, 1, 2:
• import pandas as pd
data = {
"name": ["Sally", "Mary", "John"],
"age": [50, 40, 30],
"qualified": [True, False, False]
}
idx = ["X", "Y", "Z"]
df = pd.DataFrame(data, index=idx)
newdf = df.reset_index()
print(newdf)
OUTPUT:
index name age qualified
0 X Sally 50 True
1 Y Mary 40 False
2 Z John 30 False
DROP METHOD
The drop() method removes the specified row or column.

By specifying the column axis (axis='columns'), the drop()


method removes the specified column.

By specifying the row axis (axis='index'), the drop()


method removes the specified row.

Syntax
dataframe.drop(labels, axis, index, columns, level,
inplace., errors)
DROP ENTRY
• ExampleGet your own Python Server
• Remove the "age" column from the DataFrame:

• import pandas as pd

• data = {
• "name": ["Sally", "Mary", "John"],
• "age": [50, 40, 30],
• "qualified": [True, False, False]
• }

• df = pd.DataFrame(data)

• newdf = df.drop("age", axis='columns')

• print(newdf)
• OUTPUT:
• name qualified
• 0 Sally True
• 1 Mary False
• 2 John False
• Parameters
• The axis, index, columns, level, inplace, errors parameters are keyword
arguments.

Parameter Value Description

labels Optional, The labels or indexes to drop. If more than one, specify
them in a list.

axis 0 Optional, Which axis to check, default 0.


1
'index'
'columns'

index String Optional, Specifies the name of the rows to drop. Can be used
List instead of the labels parameter.

columns String Optional, Specifies the name of the columns to drop. Can be used
List instead of the labels parameter.

level Number Optional, default None. Specifies which level ( in a hierarchical multi
level name index) to check along

inplace True Optional, default False. If True: the removing is done on the current
False DataFrame. If False: returns a copy where the removing is done.

errors 'ignore' Optional, default 'ignore'. Specifies whether to ignore errors or not
'raise'
Pandas DataFrame get() Method
• Extract the "firstname" column from the DataFrame:
• import pandas as pd
data = {
"firstname": ["Sally", "Mary", "John"],
"age": [50, 40, 30],
"qualified": [True, False, False]
}
df = pd.DataFrame(data)
print(df.get("firstname"))
OUTPUT:
0 Sally
1 Mary
2 John
Name: firstname, dtype: object
The get() method returns the specified column(s) from the
DataFrame.

If you specify only one column, the return value is a Pandas


Series object.

To specify more than one column, specify the columns inside an


array. The result will be a new DataFrame object.

Syntax
dataframe.get(key)
key - Optional. A String or object representing the column(s)
you want to return
Pandas DataFrame filter() Method

• Return a DataFrame with only the "name" and "age" columns:


• import pandas as pd
data = {
"name": ["Sally", "Mary", "John"],
"age": [50, 40, 30],
"qualified": [True, False, False]
}
df = pd.DataFrame(data)
newdf = df.filter(items=["name", "age"])
OUTPUT:
name age
0 Sally 50
1 Mary 40
2 John 30
The filter() method filters the DataFrame, and returns
only the rows or columns that are specified in the filter.

Syntax
dataframe.filter(items, like, regex, axis)
Parameters
The item, like, regex, axis parameters are keyword
arguments.
Parameter Value Description

items List Optional. A list of labels or indexes of


the rows or columns to keep

like String Optional. A string that specifies what


the indexes or column labels should
contain.
regex Regular Optional. A regular expression of
Expression what the indexes or column labels
should contain.
axis 0 Optional, default 'column'. The axis
1 to filter on
'index'
'column'
None
WEB API
• A web API, or Application Programming Interface, is a way for web
browsers and web servers to communicate with each other. Web APIs allow
applications to perform specific functions by exchanging data
• API stands for Application Programming Interface. API is actually some kind
of interface which is having a set of functions. These set of functions will
allow programmers to acquire some specific features or the data of an
application.
• Web API is an API as the name suggests, it can be accessed over the web
using the HTTP protocol. It is a framework that helps you to create and
develop HTTP based RESTFUL services. The web API can be developed by
using different technologies such as java, ASP.NET, etc. Web API is used in
either a web server or a web browser.
• Basically Web API is a web development concept. It is limited to Web
Application’s client-side and also it does not include a web server or web
browser details.
• If an application is to be used on a distributed system and to provide
services on different devices like laptops, mobiles, etc then web API
services are used. Web API is the enhanced form of the web application.
• ASP.NET Web API: ASP.NET stands for Active Server Pages.NET. It is mostly
used for creating web pages and web technologies. It is considered a very
important tool for developers to build dynamic web pages using languages
like C# and Visual Basic.
• ASP.NET Web API is a framework that helps you to build services by making
it easy to reach a wide range of clients including browsers, mobiles, tablets,
etc. With the help of ASP.NET, you can use the same framework and same
patterns for creating web pages and services both.
• Where to use Web API?
1. Web APIs are very useful in implementation of RESTFUL web services
using .NET framework.
2. Web API helps in enabling the development of HTTP services to reach out
to client entities like browser, devices or tablets.
3. ASP.NET Web API can be used with MVC for any type of application.
4. A web API can help you develop ASP.NET application via AJAX.
5. Hence, web API makes it easier for the developers to build an ASP.NET
application that is compatible with any browser and almost any device.
• Why to Choose Web API?
• A Web API services are preferable over other services to use with a native
application that does not support SOAP but require web services.
• For creating resource-oriented services, the web API services are the best to
choose. By using HTTP or restful service, these services are established.
• If you want good performance and fast development of services, the web
API services are very helpful.
• For developing light weighted and maintainable web services, web API
services are really helpful to develop that service. It supports any text
pattern like JSON, XML etc.
• The devices that have tight bandwidth or having a limitation in
bandwidth, then the Web API services are the best for those
devices.
• API provides data to its programmers which is made available to outside
users. When programmers decide to make some of their data available to
the public, they “expose endpoints, ” meaning they publish a portion of the
language they have used to build their program. Other programmers can
then extract the data from the application by building URLs or using HTTP
clients to request data from those endpoints.
• Server Side: A server-side web API is a programmatic interface. It consists of
one or more publicly exposed endpoints. It defines a request-response
message system. Mashup is a web application that is a server-side API that
combines several server-side APIs. Webhook is a server-side API that takes
input as a uniform resource identifier.
• Client Side: Client Side web APIs target standardized JavaScript bindings.
Google created their native client architecture designed to replace native
plug-ins with secure native sandboxed extensions and applications.
• Steps to use Web API:
• Most APIs require an API key. Once you find an API you want to play with,
look in the documentation for access requirements. Most APIs will ask you
to complete an identity verification, like signing in with your Google
account. You’ll get a unique string of letters and numbers to use when
accessing the API.
• The easiest way to start using an API is by finding an HTTP client online, like
REST-Client, Postman, or Paw. These ready-made tools help you structure
your requests to access existing APIs with the API key you received. You’ll
still need to know some of the syntaxes from the documentation, but there
is very little coding knowledge required.
• The next best way to pull data from an API is by building a URL from
existing API documentation.
• Popular API Examples:
1. Google Maps API’s: Google Maps APIs allows developers to use Google
Maps on Webpages using a JavaScript or Flash interface.
2. YouTube API’s: Google’s API lets developers integrate YouTube and
functionality into websites or applications. YouTube APIs include the
YouTube analytics API, YouTube Data API, YouTube live streaming API,
YouTube Player APIs and others.
3. The Flickr APIs: It is used by developers to access the Flick photo sharing
community data.
4. Twitter APIs: Twitter offers two APIs, the REST API allows developers to
access core Twitter data and the search API provides methods for
developers to interact with twitter search and trends data.
OPEN DATA SOURCES
• OPEN DATA SOURCES:
• Open data is data which is openly accessible to all, including companies, citizens, the
media, and consumers. Here are some popular open data definitions: “Open data and
content can be freely used, modified, and shared by anyone for any purpose.
• REAL TIME EXAMPLES:
• Open data catalog
• Data bank
• Microdata library
• Global data facility
• International debt statistics
• Open finances
• Web development indicators
• Projects and operations
• Global consumption data
WEB SCRAPPING
• Web Scraping is a technique to extract a large amount of data from several
websites.
• The term "scraping" refers to obtaining the information from another
source (webpages) and saving it into a local file. For example: Suppose you
are working on a project called "Phone comparing website," where you
require the price of mobile phones, ratings, and model names to make
comparisons between the different mobile phones.
• If you collect these details by checking various sites, it will take much time.
In that case, web scrapping plays an important role where by writing a few
lines of code you can get the desired results.
WEB SCRAPPING
WEB SCRAPPING
• Web Scrapping extracts the data from websites in the unstructured format.
It helps to collect these unstructured data and convert it in a structured
form.
• Startups prefer web scrapping because it is a cheap and effective way to get
a large amount of data without any partnership with the data selling
company.
• Is Web Scrapping legal?
• Here the question arises whether the web scrapping is legal or not. The
answer is that some sites allow it when used legally. Web scraping is just a
tool you can use it in the right way or wrong way.
• Web scrapping is illegal if someone tries to scrap the nonpublic data.
Nonpublic data is not reachable to everyone; if you try to extract such data
then it is a violation of the legal term.
WEB SCRAPPING TOOLS
• There are several tools available to scrap data from websites, such as:
• Scrapping-bot
• Scrapper API
• Octoparse
• Import.io
• Webhose.io
• Dexi.io
• Outwit
• Diffbot
• Content Grabber
• Mozenda
• Web Scrapper Chrome Extension
WEB SCRAPPING PROCESS
BENEFITS OF WEB SCRAPPING
• As we have discussed above, web scrapping is used to extract the data from
websites. But we should know how to use that raw data. That raw data can
be used in various fields. Let's have a look at the usage of web scrapping:
• Dynamic Price Monitoring
• It is widely used to collect data from several online shopping sites and
compare the prices of products and make profitable pricing decisions. Price
monitoring using web scrapped data gives the ability to the companies to
know the market condition and facilitate dynamic pricing. It ensures the
companies they always outrank others.
• Market Research
• Web Scrapping is perfectly appropriate for market trend analysis. It is
gaining insights into a particular market. The large organization requires a
great deal of data, and web scrapping provides the data with a guaranteed
level of reliability and accuracy.
• Email Gathering
• Many companies use personals e-mail data for email marketing. They can
target the specific audience for their marketing.
THANK YOU

You might also like