0% found this document useful (0 votes)

7 views

Data Science Workshop - Day 1

The document outlines a 3-day workshop on Data Science, led by Sathishkumar Kannan, covering topics such as the definition of data science, its life cycle, and practical programming with Python, Anaconda, and Jupyter. It includes detailed phases of the data science process, from business understanding to model deployment, as well as hands-on exercises with Python libraries like Pandas and NumPy. Participants will learn essential data manipulation and analysis techniques, along with data visualization methods.

Uploaded by

ironman292k4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Data Science Workshop - Day 1

Uploaded by

ironman292k4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 80

DATA SCIENCE

(3 DAYS WORKSHOP)

Day 1
Workshop
By

Sathishkumar Kannan MSc (UK)

Founder and MD of WHY Global Services

CEO, Abhis Overseas Educampus Pvt Ltd
Agenda

 What is Data Science?

 Types of Data Science
 Facets of Data
 Data Science Process/Life Cycle
 Install Python, Anaconda and Jupyter
 Python Basic Programming
 Work with Pandas
 Work with Numpy
Let’s know, what is?
Data Science
Big data is a blanket term for any
collection of data sets so large or
complex that it becomes difficult to
process them using traditional data
management techniques

Data science involves using

methods to analyze massive
amounts of data and extract the
knowledge it contains.
Facets of Data

• Structured
• Unstructured
• Natural language
• Machine-generated
• Graph-based
• Audio, video, and images
• Streaming
Structured Data
Unstructured data
Machine-generated data
Graph-based or network data
Data Science Life Cycle
Phase 1: Business Understanding

 The first phase consists of defining the

business problem because a well-defined
problem statement defines a specific goal
and is the key to the success of the
project.

 The main goal is to get an understanding

of the business problem, the domain of
the business problem, and the kind of
solution the business seeks.
Phase 2: Data Collection

 Data Acquisition, Data Entry, Signal

Reception, Data Extraction. This stage
involves gathering raw structured and
unstructured data.

 It should be made sure that data is

collected from a reliable source to
ensure that data is correct because
trash data will produce a trash result
only.
Phase 3: Data Preparation

Data preparation is a crucial step in a Data Science

project as it helps in cleaning and bringing the data
into the shape, which is required for further analysis
and modeling. This may also be referred as data
cleaning. As part of the data preparation, we treat
issues like missing values, outliers and also transform
the data into the required format.
Phase 4: Exploratory Data Analysis

 Exploratory data analysis (EDA) is used by data scientists to analyze and

investigate data sets and summarize their main characteristics, often
employing data visualization methods.

 Data is analyzed using summary statistics and graphically to understand

key patterns.

 The exploratory analysis also establishes the relationship among different

variables in form of correlations.
Exploratory Data Analysis Tools
Python: An interpreted, object-oriented programming
language with dynamic semantics. Its high-level,
built-in data structures, combined with dynamic
typing and dynamic binding, make it very attractive
for rapid application development, as well as for use
as a scripting or glue language to connect existing
components together.
R: An open-source programming language and free
software environment for statistical computing and
graphics supported by the R Foundation for Statistical
Computing. The R language is widely used among
statisticians in data science in developing statistical
observations and data analysis.
Phase 5: Model Building

 There are two types of data modeling,

i.e., descriptive analytics, which involves insights based on historical data

and

predictive modeling, which involves future predictions.

Phase 6: Deployment

 Once the model is built, it is ready to deploy in the real world. The
deployment can occur offline, on the web, on the cloud, any android or
iOS app.

 The Data Science project is monitored and maintained to work in the long
run. If there is any performance downgrade, then relevant changes can
be made as a part of the maintenance.
Data Scientist
Python

 Python is open source, interpreted, high level language and

provides great approach for object-oriented programming.

 It is one of the best language used by data scientist for various data
science projects/application.

 Best tool for data analysis, data visualization and machine learning
tasks
Download

 Python latest version can be downloaded from below link:

 https://www.python.org/downloads/
 Install
 Download and install Anaconda
 Launch Anaconda
 Launch Jupyter Notebook
Anaconda

 The world’s most popular

open-source Python
distribution platform
 https://repo.anaconda
.com/archive/Anacon
da3-2022.05-Windows-
x86_64.exe
Our repository features over 8,000 open-source data
science and machine learning packages,
Anaconda-built and compiled for all major operating
systems and architectures.
Jupyter Notebook

 The Jupyter Notebook is an open-source web application that allows

you to create and share documents that contain live code,
equations, visualizations, and narrative text.

1. data cleaning and transformation,

2. numerical simulation,
3. statistical modeling,
4. data visualization,
5. machine learning.
Python Programming Basics

1. print ("Welcome to Data

Science Workshop")
2. type(3)
3. type(3.14)
4. pi = 3.14
type(pi)
Add two numbers
# Python3 program to add two numbers

num1 = 15
num2 = 12

# Adding two nos

sum = num1 + num2

# printing values
print("Sum of {0} and {1} is {2}" .format(num1,
num2, sum))
Comments

 In Python, a single-line comment begins with a hash (#) symbol

followed by the comment. For example:

 # This is a single line comment in Python

Variable in Python

 message = 'Hello World!'

 print(message)

 message = 'Good Bye!'

 print(message)
String

 A string is a series of characters. In Python, anything inside quotes is

a string. And you can use either single or double quotes. For
example:

 message = 'This is a string in Python’

 message = "This is also a string"
f-strings

 name = 'John'
 message = f'Hi {name}'
 print(message)
Concatenating Python strings

greeting = 'Good ' 'Morning!'

print(greeting)

greeting = 'Good '

time = 'Afternoon'

greeting = greeting + time + '!'

print(greeting)
Length of the string

 str = "Python String"

 str_len = len(str)
 print(str_len)
Integers
 The integers are numbers such as -1, 0, 1, 2, and 3, .. and
they have type int.
 >>> 20 + 10
 30
 >>> 20 - 10
 10
 >>> 20 * 10
 200
 >>> 20 / 10
 2.0
Calculate Exponents

 To calculate exponents, you use two multiplication symbols (**). For

example:
 >>> 3**3
 27
Booleans

 >>> 10 == 10
 True
 >>> 10 == 11
 False
 >>> "jack" == "jack"
 True
 >>> "jack" == "jake"
 False
inequality

 >>> 10 != 10
 False
 >>> 10 != 11
 True
 >>> "jack" != "jack"
 False
 >>> "jack" != "jake"
 True
Dictionaries

>>> words={'apple':'red','lemon':'yellow'}
>>> words
{'lemon': 'yellow', 'apple': 'red'}
>>> words['apple']
'red'
>>> words['lemon']
'yellow'
Function

# A simple Python function

def fun():
print("Welcome to Data Science Workshop")

# Driver code to call a function

fun()
Adding Two Numbers
def add(num1: int, num2: int) -> int:
"""Add two numbers"""
num3 = num1 + num2

return num3

# Driver code
num1, num2 = 5, 15
ans = add(num1, num2)
print(f"The addition of {num1} and {num2} results {ans}.")
Pandas with Python

 Pandas is an open-source Python Library used for high-performance

data manipulation and data analysis using its powerful data
structures.
 Python with pandas is in use in a variety of academic and
commercial domains, including Finance, Economics, Statistics,
Advertising, Web Analytics, and more.
Let’s get started

 Import NumPy and load pandas into your namespace:

 import numpy as np
 import pandas as pd
Series

 Series is a one-dimensional labeled array capable of holding any data type (integers,

strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred

to as the index. The basic method to create a Series is to call:

s = pd.Series(data, index=index)

 Here, data can be many different things:

 a Python dict

 an ndarray

 a scalar value (like 5)

Series can be instantiated from
dicts:
d = {"b": 1, "a": 0, "c": 2} import pandas as pd

a = [1, 7, 2]
pd.Series(d)
myvar = pd.Series(a)

Output print(myvar)

b 1
a 0
c 2
dtype: int64
Pass index
pd.Series(d, index=["b", "c", "d", "a"])
d = {"a": 0.0, "b": 1.0, "c": 2.0}
Output
pd.Series(d) b 1.0
c 2.0
d NaN
Output
a 0.0
dtype: float64
a 0.0
b 1.0
NaN (not a number) is the standard missing data marker
c 2.0 used in pandas.
dtype: float64
Pandas Data Frames

A Pandas Data Frame is a 2

dimensional data structure, like a
2 dimensional array, or a table
with rows and columns.
A simple Pandas Data Frame

 import pandas as pd

data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}

#load data into a DataFrame object:

df = pd.DataFrame(data)

print(df)
Locate Row

#refer to the row index:

print(df.loc[0])

#use a list of indexes:

print(df.loc[[0, 1]])
Creating a dataframe using List
# import pandas as pd
import pandas as pd

# list of strings
lst = ['Geeks', 'For', 'Geeks', 'is',
'portal', 'for', 'Geeks']

# Calling DataFrame constructor on list

df = pd.DataFrame(lst)
print(df)
Creating DataFrame from dict of
ndarray/lists
import pandas as pd

# intialise data of lists.

data = {'Name':['Tom', 'nick', 'krish', 'jack'],
'Age':[20, 21, 19, 18]}

# Create DataFrame A Data frame is a two-dimensional data

df = pd.DataFrame(data)
structure, i.e., data is aligned in a tabular fashion
in rows and columns. We can perform basic
# Print the output.
operations on rows/columns like selecting,
print(df)
deleting, adding, and renaming.
Dealing with Rows and Columns

 A Data frame is a two-dimensional data structure, i.e., data is

aligned in a tabular fashion in rows and columns. We can perform
basic operations on rows/columns like selecting, deleting, adding,
and renaming.
Column Selection
# Import pandas package
import pandas as pd

# Define a dictionary containing employee data

data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}

# Convert the dictionary into DataFrame

df = pd.DataFrame(data)

# select two columns

print(df[['Name', 'Qualification']])
Row Selection
# importing pandas package
import pandas as pd

# making data frame from csv file

data = pd.read_csv("nba.csv", index_col ="Name")

# retrieving row by loc method

first = data.loc["Avery Bradley"]
second = data.loc["R.J. Hunter"]

print(first, "\n\n\n", second)

Row Addition
# importing pandas module
import pandas as pd

# making data frame

df = pd.read_csv("nba.csv", index_col ="Name")

df.head(10)

new_row = pd.DataFrame({'Name':'Geeks', 'Team':'Boston', 'Number':3,

'Position':'PG', 'Age':33, 'Height':'6-2',
'Weight':189, 'College':'MIT', 'Salary':99999},
index =[0])
# simply concatenate both dataframes
df = pd.concat([new_row, df]).reset_index(drop = True)
df.head(5)
Working with text data

There are two ways to store text data in pandas:

 object -dtype NumPy array.

 StringDtype extension type.

Working with text data

pd.Series(["a", "b", "c"])

Output
0 a
1 b
2 c
dtype: object

pd.Series(["a", "b", "c"],

dtype="string")
Chart Visualisation

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
ts = pd.Series(np.random.randn(1000),
index=pd.date_range("1/1/2000", periods=1000))

ts = ts.cumsum()

ts.plot();
Plot

On DataFrame, plot() is a convenience to plot all of the columns with

labels:
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
columns=list("ABCD"))

df = df.cumsum()

plt.figure();

df.plot();
Bar Plot

 plt.figure();

 df.iloc[5].plot(kind="bar");
Scatter Matrix

from pandas.plotting import scatter_matrix

df =
pd.DataFrame(np.random.randn(1000, 4),
columns=["a", "b", "c", "d"])

scatter_matrix(df, alpha=0.2, figsize=(6, 6),

diagonal="kde");
Density Plot

ser = pd.Series(np.random.randn(1000))
ser.plot.kde();
Numpy
Numpy Operations
Size and Shape
Reshape and Slicing
Minimum, Maximum and Sum
Basic function
import numpy

arr = numpy.array([1, 2, 3, 4, 5])

print(arr)

Numpy as np

import numpy as np

arr = np.array([1, 2, 3, 4, 5])

print(arr)
Checking NumPy Version

import numpy as np

print(np.__version__)
Create a NumPy ndarray Object

import numpy as np
import numpy as np
arr = np.array((1, 2, 3, 4, 5))
arr = np.array([1, 2, 3, 4, 5]) print(arr)

print(arr)
type(): This built-in Python function tells us the type of the
print(type(arr)) object passed to it. Like in above code it shows that arr is
numpy.ndarray type.
Dimensions in Arrays
 0-D Arrays 2-D Arrays
import numpy as np import numpy as np
arr = np.array(42)
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr)
print(arr)

1-D Arrays
3-D arrays
import numpy as np
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr) arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])

print(arr)
Access Array Elements

import numpy as np import numpy as np

arr = np.array([1, 2, 3, 4])

arr = np.array([1, 2, 3, 4])
print(arr[1])

print(arr[0])

import numpy as np

arr = np.array([1, 2, 3, 4])

print(arr[2] + arr[3])
NumPy Array Slicing

import numpy as np import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7])

arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[4:])

print(arr[1:5])

import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7])

print(arr[:4])
Checking the Data Type of an
Array
import numpy as np import numpy as np

arr = np.array(['apple', 'banana', 'cherry'])

arr = np.array([1, 2, 3, 4])
print(arr.dtype)

print(arr.dtype)
Shape of an Array

import numpy as np

arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])

print(arr.shape)
NumPy Array Iterating

Iterating means going through elements one by one.

import numpy as np import numpy as np

arr = np.array([1, 2, 3]) arr = np.array([[1, 2, 3], [4, 5, 6]])

for x in arr:
for x in arr:
print(x) print(x)
Joining NumPy Arrays

Joining means putting contents of two or more arrays in a single array.

import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.concatenate((arr1, arr2))
print(arr)
Splitting NumPy Arrays
import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6]) import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6])

newarr = np.array_split(arr, 3)
newarr = np.array_split(arr, 4)

print(newarr)
print(newarr)

If the array has less elements than required, it will adjust

from the end accordingly.
THANK YOU FOR LISTENING!!!!

Any Queries, please ask…!

7299119900

www.whyglobalservices.com
www.whytap.in
www.abhisoverseas.com

Python For Data Science Extended Ebook PDF
100% (5)
Python For Data Science Extended Ebook PDF
56 pages
Sap BW Overview
No ratings yet
Sap BW Overview
15 pages
DS FINAL
No ratings yet
DS FINAL
46 pages
ds with py
No ratings yet
ds with py
39 pages
Python For Data Science
No ratings yet
Python For Data Science
20 pages
Data Science I: Charles C.N. Wang
No ratings yet
Data Science I: Charles C.N. Wang
68 pages
Dsbda Ass1
No ratings yet
Dsbda Ass1
61 pages
tool and lib in Data Science
No ratings yet
tool and lib in Data Science
32 pages
data science
No ratings yet
data science
42 pages
Report
No ratings yet
Report
18 pages
Data Science Introduction_lecture Class.ppt
No ratings yet
Data Science Introduction_lecture Class.ppt
62 pages
01 Introduction to Python
No ratings yet
01 Introduction to Python
36 pages
Internship
No ratings yet
Internship
31 pages
TY FDS Workbook
No ratings yet
TY FDS Workbook
56 pages
Lab Course - II (Foundations of Data Science)
No ratings yet
Lab Course - II (Foundations of Data Science)
59 pages
Pandas Course Slides
No ratings yet
Pandas Course Slides
90 pages
jenisha INTERNSHIP REPORT-2.docx (1)
No ratings yet
jenisha INTERNSHIP REPORT-2.docx (1)
19 pages
Microsoft Ai Automate
No ratings yet
Microsoft Ai Automate
259 pages
RR
No ratings yet
RR
35 pages
PDS_unit1-1
No ratings yet
PDS_unit1-1
104 pages
Unit 1
100% (1)
Unit 1
69 pages
Unit-1
No ratings yet
Unit-1
84 pages
Unit2 PDS
No ratings yet
Unit2 PDS
17 pages
Data Science
No ratings yet
Data Science
109 pages
fds_merged (3) (1)
No ratings yet
fds_merged (3) (1)
102 pages
Data Science using Python_ Introduction
No ratings yet
Data Science using Python_ Introduction
6 pages
Introduction to Python 1
No ratings yet
Introduction to Python 1
13 pages
01 Introduction to Python
No ratings yet
01 Introduction to Python
36 pages
Machine Learning Lecture2
No ratings yet
Machine Learning Lecture2
38 pages
Rakshitha.M - 1BO17EC031
No ratings yet
Rakshitha.M - 1BO17EC031
26 pages
Python GTU Study Material Presentations Unit-2 24072020062038AM
No ratings yet
Python GTU Study Material Presentations Unit-2 24072020062038AM
18 pages
Exp No. 1-3 (MLC)
No ratings yet
Exp No. 1-3 (MLC)
12 pages
Data Analysis Using Python Day_1 to Day_4
No ratings yet
Data Analysis Using Python Day_1 to Day_4
30 pages
DAL EXT 1 and 2
No ratings yet
DAL EXT 1 and 2
125 pages
Chapter - 2: Data Science & Python
No ratings yet
Chapter - 2: Data Science & Python
17 pages
Python Data Science - A Beginner's Guide To Mastering Analysis, Visualization, and Machine Learning by A. Eich Liana
No ratings yet
Python Data Science - A Beginner's Guide To Mastering Analysis, Visualization, and Machine Learning by A. Eich Liana
86 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
4 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
4 pages
DOC-20250315-WA0005.
No ratings yet
DOC-20250315-WA0005.
29 pages
AI ML June 4 2022
No ratings yet
AI ML June 4 2022
40 pages
Datascience
No ratings yet
Datascience
8 pages
Syllabus AIML
No ratings yet
Syllabus AIML
14 pages
PDS Qba
No ratings yet
PDS Qba
12 pages
SENG419-python 98745
No ratings yet
SENG419-python 98745
103 pages
FDS_LAB_MANUAL (1)
No ratings yet
FDS_LAB_MANUAL (1)
62 pages
DS1
No ratings yet
DS1
20 pages
8 LO5 Lect 1
No ratings yet
8 LO5 Lect 1
16 pages
CH 4
No ratings yet
CH 4
17 pages
FDS RECORD-1-4
No ratings yet
FDS RECORD-1-4
18 pages
Q-Step WS 06112019 Data Analysis and Visualisation With Python
No ratings yet
Q-Step WS 06112019 Data Analysis and Visualisation With Python
76 pages
Introduction To Analyse
No ratings yet
Introduction To Analyse
10 pages
Unit 5 PythonPackages (Numpy,Pandas,Tkinter)
No ratings yet
Unit 5 PythonPackages (Numpy,Pandas,Tkinter)
68 pages
unit 1
No ratings yet
unit 1
69 pages
CS3361 - Data Science Laboratory
No ratings yet
CS3361 - Data Science Laboratory
31 pages
DOC-20250105-WA0007.
No ratings yet
DOC-20250105-WA0007.
8 pages
Data Ty
No ratings yet
Data Ty
59 pages
Advance Data Analysis and Visualisation - With - Python For Executives and Business Management
No ratings yet
Advance Data Analysis and Visualisation - With - Python For Executives and Business Management
76 pages
Ass1 DSBDA Writeup
No ratings yet
Ass1 DSBDA Writeup
8 pages
data science
No ratings yet
data science
10 pages
Introduction to Python Programming: Do your first steps into programming with python
From Everand
Introduction to Python Programming: Do your first steps into programming with python
Greytower Corp
No ratings yet
Easy Programming for Everyone
From Everand
Easy Programming for Everyone
Umar Asghar
No ratings yet
Judul Abstract
No ratings yet
Judul Abstract
12 pages
English Language Curriculum Vitae Writing Cala 2
No ratings yet
English Language Curriculum Vitae Writing Cala 2
5 pages
How To Create A Database Links in Data Services Using SQL Server - SAP Blogs
No ratings yet
How To Create A Database Links in Data Services Using SQL Server - SAP Blogs
16 pages
Prism Central Alert Reference VPC - 2024 - 1
No ratings yet
Prism Central Alert Reference VPC - 2024 - 1
254 pages
Kolam
No ratings yet
Kolam
21 pages
VTS2
No ratings yet
VTS2
26 pages
A Research Paper Format
No ratings yet
A Research Paper Format
2 pages
How-to-Build-AI-driven-Knowledge-Assistants
100% (1)
How-to-Build-AI-driven-Knowledge-Assistants
24 pages
Frame Relay Lecture
No ratings yet
Frame Relay Lecture
11 pages
A0205e-1 Cetrics DBBackupTool UserManual
No ratings yet
A0205e-1 Cetrics DBBackupTool UserManual
16 pages
Lecture 9.1 - Views Stored Procedures Transactions - PPTX - 0
No ratings yet
Lecture 9.1 - Views Stored Procedures Transactions - PPTX - 0
33 pages
Concurrency Control in DBMS
No ratings yet
Concurrency Control in DBMS
22 pages
Deconstruction and Design Investigation Task Sheet
No ratings yet
Deconstruction and Design Investigation Task Sheet
2 pages
Gosawork Fikadu
100% (2)
Gosawork Fikadu
46 pages
RDBMS PDF
No ratings yet
RDBMS PDF
10 pages
The Data Science Handbook
No ratings yet
The Data Science Handbook
2 pages
HDGHD
No ratings yet
HDGHD
2 pages
Basic Terminologies of Databases
No ratings yet
Basic Terminologies of Databases
3 pages
Unit 1 Topical Marking Scheme s15-w23
No ratings yet
Unit 1 Topical Marking Scheme s15-w23
221 pages
MC End-Term Ques
No ratings yet
MC End-Term Ques
4 pages
Lab # 06 Implementation of SQL Constraints
No ratings yet
Lab # 06 Implementation of SQL Constraints
21 pages
Posdm Vs Posdta
No ratings yet
Posdm Vs Posdta
2 pages
RVDF Manual
No ratings yet
RVDF Manual
3 pages
Lec 15
No ratings yet
Lec 15
29 pages
01 Hardware and Loop
No ratings yet
01 Hardware and Loop
43 pages
Query Optimization in Distributed Systems
No ratings yet
Query Optimization in Distributed Systems
4 pages
Implementation of Grounded Header Linked List
No ratings yet
Implementation of Grounded Header Linked List
8 pages
A08 Backup & Restore Controller Memory R30iA March 2014 PDF
No ratings yet
A08 Backup & Restore Controller Memory R30iA March 2014 PDF
58 pages