0% found this document useful (0 votes)
7 views

Data Science Workshop - Day 1

The document outlines a 3-day workshop on Data Science, led by Sathishkumar Kannan, covering topics such as the definition of data science, its life cycle, and practical programming with Python, Anaconda, and Jupyter. It includes detailed phases of the data science process, from business understanding to model deployment, as well as hands-on exercises with Python libraries like Pandas and NumPy. Participants will learn essential data manipulation and analysis techniques, along with data visualization methods.

Uploaded by

ironman292k4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Data Science Workshop - Day 1

The document outlines a 3-day workshop on Data Science, led by Sathishkumar Kannan, covering topics such as the definition of data science, its life cycle, and practical programming with Python, Anaconda, and Jupyter. It includes detailed phases of the data science process, from business understanding to model deployment, as well as hands-on exercises with Python libraries like Pandas and NumPy. Participants will learn essential data manipulation and analysis techniques, along with data visualization methods.

Uploaded by

ironman292k4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

DATA SCIENCE

(3 DAYS WORKSHOP)

Day 1
Workshop
By

Sathishkumar Kannan MSc (UK)

Founder and MD of WHY Global Services


CEO, Abhis Overseas Educampus Pvt Ltd
Agenda

 What is Data Science?


 Types of Data Science
 Facets of Data
 Data Science Process/Life Cycle
 Install Python, Anaconda and Jupyter
 Python Basic Programming
 Work with Pandas
 Work with Numpy
Let’s know, what is?
Data Science
Big data is a blanket term for any
collection of data sets so large or
complex that it becomes difficult to
process them using traditional data
management techniques

Data science involves using


methods to analyze massive
amounts of data and extract the
knowledge it contains.
Facets of Data

• Structured
• Unstructured
• Natural language
• Machine-generated
• Graph-based
• Audio, video, and images
• Streaming
Structured Data
Unstructured data
Machine-generated data
Graph-based or network data
Data Science Life Cycle
Phase 1: Business Understanding

 The first phase consists of defining the


business problem because a well-defined
problem statement defines a specific goal
and is the key to the success of the
project.

 The main goal is to get an understanding


of the business problem, the domain of
the business problem, and the kind of
solution the business seeks.
Phase 2: Data Collection

 Data Acquisition, Data Entry, Signal


Reception, Data Extraction. This stage
involves gathering raw structured and
unstructured data.

 It should be made sure that data is


collected from a reliable source to
ensure that data is correct because
trash data will produce a trash result
only.
Phase 3: Data Preparation

Data preparation is a crucial step in a Data Science


project as it helps in cleaning and bringing the data
into the shape, which is required for further analysis
and modeling. This may also be referred as data
cleaning. As part of the data preparation, we treat
issues like missing values, outliers and also transform
the data into the required format.
Phase 4: Exploratory Data Analysis

 Exploratory data analysis (EDA) is used by data scientists to analyze and


investigate data sets and summarize their main characteristics, often
employing data visualization methods.

 Data is analyzed using summary statistics and graphically to understand


key patterns.

 The exploratory analysis also establishes the relationship among different


variables in form of correlations.
Exploratory Data Analysis Tools
Python: An interpreted, object-oriented programming
language with dynamic semantics. Its high-level,
built-in data structures, combined with dynamic
typing and dynamic binding, make it very attractive
for rapid application development, as well as for use
as a scripting or glue language to connect existing
components together.
R: An open-source programming language and free
software environment for statistical computing and
graphics supported by the R Foundation for Statistical
Computing. The R language is widely used among
statisticians in data science in developing statistical
observations and data analysis.
Phase 5: Model Building

 There are two types of data modeling,

i.e., descriptive analytics, which involves insights based on historical data


and

predictive modeling, which involves future predictions.


Phase 6: Deployment

 Once the model is built, it is ready to deploy in the real world. The
deployment can occur offline, on the web, on the cloud, any android or
iOS app.

 The Data Science project is monitored and maintained to work in the long
run. If there is any performance downgrade, then relevant changes can
be made as a part of the maintenance.
Data Scientist
Python

 Python is open source, interpreted, high level language and


provides great approach for object-oriented programming.

 It is one of the best language used by data scientist for various data
science projects/application.

 Best tool for data analysis, data visualization and machine learning
tasks
Download

 Python latest version can be downloaded from below link:

 https://www.python.org/downloads/
 Install
 Download and install Anaconda
 Launch Anaconda
 Launch Jupyter Notebook
Anaconda

 The world’s most popular


open-source Python
distribution platform
 https://repo.anaconda
.com/archive/Anacon
da3-2022.05-Windows-
x86_64.exe
Our repository features over 8,000 open-source data
science and machine learning packages,
Anaconda-built and compiled for all major operating
systems and architectures.
Jupyter Notebook

 The Jupyter Notebook is an open-source web application that allows


you to create and share documents that contain live code,
equations, visualizations, and narrative text.

1. data cleaning and transformation,


2. numerical simulation,
3. statistical modeling,
4. data visualization,
5. machine learning.
Python Programming Basics

1. print ("Welcome to Data


Science Workshop")
2. type(3)
3. type(3.14)
4. pi = 3.14
type(pi)
Add two numbers
# Python3 program to add two numbers

num1 = 15
num2 = 12

# Adding two nos


sum = num1 + num2

# printing values
print("Sum of {0} and {1} is {2}" .format(num1,
num2, sum))
Comments

 In Python, a single-line comment begins with a hash (#) symbol


followed by the comment. For example:

 # This is a single line comment in Python


Variable in Python

 message = 'Hello World!'


 print(message)

 message = 'Good Bye!'


 print(message)
String

 A string is a series of characters. In Python, anything inside quotes is


a string. And you can use either single or double quotes. For
example:

 message = 'This is a string in Python’


 message = "This is also a string"
f-strings

 name = 'John'
 message = f'Hi {name}'
 print(message)
Concatenating Python strings

greeting = 'Good ' 'Morning!'


print(greeting)

greeting = 'Good '


time = 'Afternoon'

greeting = greeting + time + '!'


print(greeting)
Length of the string

 str = "Python String"


 str_len = len(str)
 print(str_len)
Integers
 The integers are numbers such as -1, 0, 1, 2, and 3, .. and
they have type int.
 >>> 20 + 10
 30
 >>> 20 - 10
 10
 >>> 20 * 10
 200
 >>> 20 / 10
 2.0
Calculate Exponents

 To calculate exponents, you use two multiplication symbols (**). For


example:
 >>> 3**3
 27
Booleans

 >>> 10 == 10
 True
 >>> 10 == 11
 False
 >>> "jack" == "jack"
 True
 >>> "jack" == "jake"
 False
inequality

 >>> 10 != 10
 False
 >>> 10 != 11
 True
 >>> "jack" != "jack"
 False
 >>> "jack" != "jake"
 True
Dictionaries

>>> words={'apple':'red','lemon':'yellow'}
>>> words
{'lemon': 'yellow', 'apple': 'red'}
>>> words['apple']
'red'
>>> words['lemon']
'yellow'
Function

# A simple Python function


def fun():
print("Welcome to Data Science Workshop")

# Driver code to call a function


fun()
Adding Two Numbers
def add(num1: int, num2: int) -> int:
"""Add two numbers"""
num3 = num1 + num2

return num3

# Driver code
num1, num2 = 5, 15
ans = add(num1, num2)
print(f"The addition of {num1} and {num2} results {ans}.")
Pandas with Python

 Pandas is an open-source Python Library used for high-performance


data manipulation and data analysis using its powerful data
structures.
 Python with pandas is in use in a variety of academic and
commercial domains, including Finance, Economics, Statistics,
Advertising, Web Analytics, and more.
Let’s get started

 Import NumPy and load pandas into your namespace:

 import numpy as np
 import pandas as pd
Series

 Series is a one-dimensional labeled array capable of holding any data type (integers,

strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred

to as the index. The basic method to create a Series is to call:

s = pd.Series(data, index=index)

 Here, data can be many different things:

 a Python dict

 an ndarray

 a scalar value (like 5)


Series can be instantiated from
dicts:
d = {"b": 1, "a": 0, "c": 2} import pandas as pd

a = [1, 7, 2]
pd.Series(d)
myvar = pd.Series(a)

Output print(myvar)

b 1
a 0
c 2
dtype: int64
Pass index
pd.Series(d, index=["b", "c", "d", "a"])
d = {"a": 0.0, "b": 1.0, "c": 2.0}
Output
pd.Series(d) b 1.0
c 2.0
d NaN
Output
a 0.0
dtype: float64
a 0.0
b 1.0
NaN (not a number) is the standard missing data marker
c 2.0 used in pandas.
dtype: float64
Pandas Data Frames

A Pandas Data Frame is a 2


dimensional data structure, like a
2 dimensional array, or a table
with rows and columns.
A simple Pandas Data Frame

 import pandas as pd

data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}

#load data into a DataFrame object:


df = pd.DataFrame(data)

print(df)
Locate Row

#refer to the row index:


print(df.loc[0])

#use a list of indexes:


print(df.loc[[0, 1]])
Creating a dataframe using List
# import pandas as pd
import pandas as pd

# list of strings
lst = ['Geeks', 'For', 'Geeks', 'is',
'portal', 'for', 'Geeks']

# Calling DataFrame constructor on list


df = pd.DataFrame(lst)
print(df)
Creating DataFrame from dict of
ndarray/lists
import pandas as pd

# intialise data of lists.


data = {'Name':['Tom', 'nick', 'krish', 'jack'],
'Age':[20, 21, 19, 18]}

# Create DataFrame A Data frame is a two-dimensional data


df = pd.DataFrame(data)
structure, i.e., data is aligned in a tabular fashion
in rows and columns. We can perform basic
# Print the output.
operations on rows/columns like selecting,
print(df)
deleting, adding, and renaming.
Dealing with Rows and Columns

 A Data frame is a two-dimensional data structure, i.e., data is


aligned in a tabular fashion in rows and columns. We can perform
basic operations on rows/columns like selecting, deleting, adding,
and renaming.
Column Selection
# Import pandas package
import pandas as pd

# Define a dictionary containing employee data


data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}

# Convert the dictionary into DataFrame


df = pd.DataFrame(data)

# select two columns


print(df[['Name', 'Qualification']])
Row Selection
# importing pandas package
import pandas as pd

# making data frame from csv file


data = pd.read_csv("nba.csv", index_col ="Name")

# retrieving row by loc method


first = data.loc["Avery Bradley"]
second = data.loc["R.J. Hunter"]

print(first, "\n\n\n", second)


Row Addition
# importing pandas module
import pandas as pd

# making data frame


df = pd.read_csv("nba.csv", index_col ="Name")

df.head(10)

new_row = pd.DataFrame({'Name':'Geeks', 'Team':'Boston', 'Number':3,


'Position':'PG', 'Age':33, 'Height':'6-2',
'Weight':189, 'College':'MIT', 'Salary':99999},
index =[0])
# simply concatenate both dataframes
df = pd.concat([new_row, df]).reset_index(drop = True)
df.head(5)
Working with text data

There are two ways to store text data in pandas:


 object -dtype NumPy array.

 StringDtype extension type.


Working with text data

pd.Series(["a", "b", "c"])


Output
0 a
1 b
2 c
dtype: object

pd.Series(["a", "b", "c"],


dtype="string")
Chart Visualisation

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
ts = pd.Series(np.random.randn(1000),
index=pd.date_range("1/1/2000", periods=1000))

ts = ts.cumsum()

ts.plot();
Plot

On DataFrame, plot() is a convenience to plot all of the columns with


labels:
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
columns=list("ABCD"))

df = df.cumsum()

plt.figure();

df.plot();
Bar Plot

 plt.figure();

 df.iloc[5].plot(kind="bar");
Scatter Matrix

from pandas.plotting import scatter_matrix

df =
pd.DataFrame(np.random.randn(1000, 4),
columns=["a", "b", "c", "d"])

scatter_matrix(df, alpha=0.2, figsize=(6, 6),


diagonal="kde");
Density Plot

ser = pd.Series(np.random.randn(1000))
ser.plot.kde();
Numpy
Numpy Operations
Size and Shape
Reshape and Slicing
Minimum, Maximum and Sum
Basic function
import numpy

arr = numpy.array([1, 2, 3, 4, 5])

print(arr)

Numpy as np

import numpy as np

arr = np.array([1, 2, 3, 4, 5])

print(arr)
Checking NumPy Version

import numpy as np

print(np.__version__)
Create a NumPy ndarray Object

import numpy as np
import numpy as np
arr = np.array((1, 2, 3, 4, 5))
arr = np.array([1, 2, 3, 4, 5]) print(arr)

print(arr)
type(): This built-in Python function tells us the type of the
print(type(arr)) object passed to it. Like in above code it shows that arr is
numpy.ndarray type.
Dimensions in Arrays
 0-D Arrays 2-D Arrays
import numpy as np import numpy as np
arr = np.array(42)
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr)
print(arr)

1-D Arrays
3-D arrays
import numpy as np
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr) arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])

print(arr)
Access Array Elements

import numpy as np import numpy as np

arr = np.array([1, 2, 3, 4])


arr = np.array([1, 2, 3, 4])
print(arr[1])

print(arr[0])

import numpy as np

arr = np.array([1, 2, 3, 4])

print(arr[2] + arr[3])
NumPy Array Slicing

import numpy as np import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7])


arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[4:])

print(arr[1:5])

import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7])

print(arr[:4])
Checking the Data Type of an
Array
import numpy as np import numpy as np

arr = np.array(['apple', 'banana', 'cherry'])


arr = np.array([1, 2, 3, 4])
print(arr.dtype)

print(arr.dtype)
Shape of an Array

import numpy as np

arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])

print(arr.shape)
NumPy Array Iterating

Iterating means going through elements one by one.

import numpy as np import numpy as np

arr = np.array([1, 2, 3]) arr = np.array([[1, 2, 3], [4, 5, 6]])


for x in arr:
for x in arr:
print(x) print(x)
Joining NumPy Arrays

Joining means putting contents of two or more arrays in a single array.

import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.concatenate((arr1, arr2))
print(arr)
Splitting NumPy Arrays
import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6]) import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6])


newarr = np.array_split(arr, 3)
newarr = np.array_split(arr, 4)

print(newarr)
print(newarr)

If the array has less elements than required, it will adjust


from the end accordingly.
THANK YOU FOR LISTENING!!!!

Any Queries, please ask…!

7299119900

www.whyglobalservices.com
www.whytap.in
www.abhisoverseas.com

You might also like