0% found this document useful (0 votes)
6 views

fds lab manual[1]

The document outlines a series of exercises focused on installing and using Python packages for data science, including NumPy, Pandas, and Jupyter. It provides step-by-step instructions for setting up the environment, performing operations with NumPy arrays, working with Pandas data frames, and conducting descriptive analysis on datasets like Iris and Diabetes. The exercises culminate in practical applications of data analysis techniques, including univariate analysis and data visualization.

Uploaded by

pradeena937556
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

fds lab manual[1]

The document outlines a series of exercises focused on installing and using Python packages for data science, including NumPy, Pandas, and Jupyter. It provides step-by-step instructions for setting up the environment, performing operations with NumPy arrays, working with Pandas data frames, and conducting descriptive analysis on datasets like Iris and Diabetes. The exercises culminate in practical applications of data analysis techniques, including univariate analysis and data visualization.

Uploaded by

pradeena937556
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Ex.No.

1 Download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodels and
Pandas packages

1. Install Anaconda
Anaconda puts nearly all of the tools that we're going to need into a neat little package: the Python core
language, an improved REPL environment called Jupyter, numeric computing libraries (NumPy, pandas),
plotting libraries (seaborn, matplotlib), and statistics and machine learning libraries (SciPy, scikit-learn,
statsmodels). We'll use Anaconda's installer to handle setting up the environment that we'll work in.
In order to keep the size of the download small, we actually use a minimum set of packages called
Miniconda.

 Miniconda installer packages:


o Windows
o Mac OSX
 Once this downloads, you can follow the instructions for installing on your operating system: at this
link.
 Note: It's easiest just to use Anaconda's defaults in the installer. You don't have to change anything
unless you're sure you want something different.

2. Download and install common packages for data science in Python

 Click the link below to download an environment file. This file contains a list of common packages and
libraries for doing data science in Python. Remember where you save the file environment.yml.
You'll need that path shortly. You don't need to open that file right now.
o Windows
o OSX
 Once the download finishes, open the command line by doing the following:
o Windows - Hit "Start" and then type "Command Prompt" and use that terminal.
o OSX - Type Cmd+Space and then enter Terminal in the search box to open the terminal.
 Run the following commands, which will install the package and put you in the tutorial environment.
o conda env create -f <PATH_TO_ENVIRONMENT.YML> - You'll need to replace
<PATH_TO_ENVIRONMENT.YML> with the actual path where the file was downloaded.
For OSX, that's often
(/Users/<USERNAME>/Downloads/environment.yml). For Windows, it is usually
C:/Users/<USERNAME>/Downloads/environment.yml. You'll have to replace
<USERNAME> with your username on your machine.

 That will download all a set of packages that are commonly used for data science in Python. When it
finishes, you can activate the environment with the following command:
o Windows - activate tutorial
o OSX - source activate tutorial

3. Run Jupyter notebook!


In this step, we'll make sure everything is working by running the Jupyter Notebook. Jupyter
Notebook is a tool for doing interactive data science work in your browser. * In your command prompt with
the tutorial environment activated (Note: you'll be able to tell because your command prompt will say
(tutorial) at the start of it.) * Type the following command jupyter notebook . * A browser window will open,
showing the Jupyer environment. By default, you will be in a file browser view. * In the file browser, find
where you have a Jupyter notebook. If you don't have materials for a course or tutorial that you have
downloaded, you can download this fun Jupyter notebook and then open it in the file browser. * Click on one of
the notebook (*.ipynb) files to get started!

4. To stop Jupyter notebook:

 Hit Ctrl+c to stop the Jupyter notebook server running on your machine. (Make sure to use Ctrl+s
in the notebook to save it first!)

5. To leave the tutorial environment (with all our fun packages) and go back to your normal
environment:

 Windows - deactivate tutorial


 OSX - source deactivate tutorial

Result:
Thus the above process to Install and Explore Numpy was executed and verified successfully.
Ex no: 2 Working with Numpy arrays

AIM

Working with Numpy arrays

ALGORITHM
Step1:Start
Step2:Import numpy module
Step3:Print the basic characteristics and operations of array
Step4:Stop

PROGRAM

import numpy as np
# Creating array
objectarr=np.array([[1,2,3],[ 4, 2,5]] )
#Printing type of arr object
print("Array is of type: ", type(arr))
#Printing array dimensions(axes)
print("No.of dimensions:",arr.ndim)
#Printing shape of array
print("Shape of array:",arr.shape)
print("Size of array:",arr.size)
#Printing type of elements in array
print("Array stores elements of type:",arr.dtype)

OUTPUT
Array is of type:<class'numpy.ndarray'> No.of dimensions:2 Shape of array:(2,3)
Size of array:6
Array stores elements of type:int32

To Perform Array Slicing


import numpy as
np
a=np.array([[1,2,3],[3,4,5],[4,5,6]])
print('Our array is:')
p rint(a)
#this returns array of items in the second column
print( 'The Program items in the second column are:')
print(a[...,1])
print('\n')

print('The items in the second row are:' )


print(a[1,...])
print('\n')
print('The items column 1 onwards are:' )
print(a[...,1:])

Output:
Ourarrayis:
[[1 2 3]
[3 4 5]
[4 5 6]]
The items in the second column are:
[2 4 5]
The items in the second row are:
[3 4 5]
Theitemscolumn1onwards are:
[[2 3]
[4 5]
[5 6]]

Write a NumPy program to convert an array to a float type.


Program:
import numpy as np
import numpy as np
a = [1, 2, 3, 4]
print("Original array")
print(a)
x = np.asfarray(a)
print("Array converted to a float type:")
print(x)

output:
Original array
[1, 2, 3, 4]
Array converted to a float type:
[ 1. 2. 3. 4.]

Write a NumPy program to create an empty and a full array

Program:
import numpy as np
# Create an empty array
x = np.empty((3,4))
print(x)
# Create a full array
y = np.full((3,3),6)
print(y)

output:
[[ 6.93643969e-310 8.76783124e-317 6.93643881e-310 6.79038654e-313]
[ 2.22809558e-312 2.14321575e-312 2.35541533e-312 2.42092166e-322]
[ 7.46824097e-317 9.08479214e3172.46151512e3122.41907520e312]]
[[6 6 6]
[6 6 6]
[6 6 6]]

Write a NumPy program to convert a list and tuple into arrays


Program:
import numpy as np
my_list = [1, 2, 3, 4, 5, 6, 7, 8]
print("List to array: ")
print(np.asarray(my_list))
my_tuple = ([8, 4, 6], [1, 2, 3])
print("Tuple to array: ")
print(np.asarray(my_tuple))

output:

List to array:
[1 2 3 4 5 6 7 8]
Tuple to array:
[[8 4 6]
[1 2 3]

Write a NumPy program to find the real and imaginary parts of an array of complex
numbers

Program:

import numpy as np
x = np.sqrt([1+0j])
y = np.sqrt([0+1j])
print("Original array:x ",x)
print("Original array:y ",y)
print("Real part of the array:")
print(x.real)
print(y.real)
print("Imaginary part of the array:")
print(x.imag)
print(y.imag)

output:

Original array:x [ 1.+0.j]


Original array:y [ 0.70710678+0.70710678j]
Real part of the array:
[ 1.]
[ 0.70710678]

Imaginary part of the array:


[ 0.]
[ 0.70710678]

Result:
Thus the working with Numpy arrays was successfully completed.
Ex no: 3 WORKING WITH PANDAS DATA FRAMES

Aim:

To work with Pandas data frames

ALGORITHM

Step1:Start
Step2: import numpy and pandas module
Step3:Createa dataframe using the dictio nary
Step4:Perform the basic operations and print the result
Step5:Stop

PROGRAM
import numpy as np
import pandas as pd
data=np.array([['','Col1','Col2'],['Row1',1,2], ['Row2',3,4]])
print(pd.DataFrame(data=data[1:,1:],index = data[1:,0],columns= data[0,1:]))
my_2darray = np.array([[1, 2, 3], [4, 5, 6]])
print(pd.DataFrame(my_2darray))
my_ dict={1:['1','3'],2:['1','2'],3:['2','4']}
print(pd.DataFrame(my_dict))
my_df=pd.DataFrame(data=[4,5,6,7],index=range(0,4),columns=['A'])
print(pd.DataFrame(my_df))
my_series=pd.Series({"UnitedKingdom":"London","India":"NewDelhi","UnitedStates":"Washington",
"Belgium":"Brussels"})
print(pd.DataFrame(my_series))
df=pd.DataFrame(np.array([[1,2,3],[4,5,6]]))
pr int(df.shape)
print(len(df.index))

Output:
Col1Col2
Row1 1 2
Row2 3 4
0 1 2
0 1 2 3
1 4 5 61 23
0 1 1 2
1 3 2 4A

0 4
1 5
2 6
3 7
0
UnitedKingdom
LondonIndia New
Delhi
United States
WashingtonBelgiu
mBrussels
(2, 3)
2

Write a Pandas program to select the rows where the score is missing, i.e. is NaN
Sample DataFrame:
Sample Python dictionary data and list labels:
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura',
'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

Program:
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura',
'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels)
print("Rows where score is missing:")
print(df[df['score'].isnull()])

output:

Rows where score is missing:


attempts name qualify score
d 3 James no NaN
h 1 Laura no NaN

Result:

Thus the working with Pandas data frames was successfully completed.
Ex no: 4 DESCRIPTIVE ANALYSES ON THE IRIS DATASET

Aim:
To read data from text files, excel and the web and exploring various Commands for doing
descriptive analysis on iris dataset.

Algorithm:
Step1:Start
Step2: import pandas
Step3: download iris.csv dataset from
https://datahub.io/machine -learning/iris
step4: Perform basic operations
Step4:Print the output
Step5:Stop

Program:
import pandas as pd
df=pd.read_csv("iris.csv")
df.head()
df.shape
df.info()
df.describe()
df.isnull().sum()
print(df.value_counts("Species"))

output:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 150
entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 unnamed:0 150 non-null int64
1 Sepal.Length 150 non-null float64
2 Sepal.Width 150 non-null float64
3 Petal.Length 150 non-null float64
4 Petal.Width 150 non-null float64
5 Species 150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB
Species
Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50
dtype: int

Result:
Thus the working with iris dataset was successfully completed.
Ex no: 5 a) WORKING WITH DIABETES DATASET FROM UCI AND PIMA
INDIANS DATA SET
Univariate analysis:

Aim:
To work with univariate analysis to find mean, median, variance standard deviation, skewness
and kurtosis using diabetes dataset

Algorithm:

Step1:start
Step2: download diabetes dataset from https://datahub.io/machine-learning/diabetes
Step3: perform the basic operations, calculate mean,median,standard deviation,skewness,histogram and
print the result
Step4:stop

Program:
import pandas as pd
import numpy as np
df=pd.read_csv("diabetes.csv") print('MEAN:\n',df.mean())
print('MEDIAN:\n',df.median()) print('MODE:\n',df.mode())
print('STANDAR DEVIATION:\n',df.std()) print('VARIANCE:\
n',df.var())
print('SKEWNESS:\n',df.skew()) print('KURTOSIS:\
n',df.kurtosis())
df.describe()

output:
MEAN:
Pregnancies 3.845052
Glucose 120.894531
BloodPressure 9.105469
SkinThickness 20.536458
Insulin 79.799479
BMI 31.992578
DiabetesPedigreeFunction 0.471876
Age 33.240885
Outcome 0.348958
dtype: float64
MEDIAN:
Pregnancies 3.0000
Glucose 117.0000
BloodPressure 72.0000
SkinThickness 23.0000
Insulin 30.5000
BMI 32.0000
DiabetesPedigreeFunction 0.3725
Age 29.0000
Outcome 0.0000
dtype: float64
MODE:
Pregnancies Glucose BloodPressure ...
DiabetesPedigreeFunction Age Outcome
0 1.0 99 70.0 ... 0.254 22.0 0.0
1 NaN 100 NaN ... 0.258 NaN NaN

#histogram
Program:
import pandas as pd
import numpy as np
import statistics as st
import matplotlib.pyplot as plt
df=pd.read_csv("diabetes.csv")
df=df.copy(deep=True)
df=df.drop(['Outcome'],axis=1)
plt.rcParams['figure.figsize']=[40,40]
df.hist(bins=40)

output:

Result:
Thus the working with diabetes dataset was successfully completed.
Ex no: 5.b) Use the diabetes dataset from UCI : Bivariate analysis: linear and logistic
regression modeling

Aim:
To work with bivariate analysis to find linear and logistic regression using diabetes dataset

Program:
#LINEAR REGRESSION MODEL

import matplotlib.pyplot as plt


import numpy as np
import pandas as pd
from sklearn
import datasets,linear_model
from sklearn.metrics
import mean_squared_error
diabetes=datasets.load_diabetes()
diabetes.keys()
df=pd.DataFrame(diabetes["data"],columns=diabetes["feature_names"])
x=df
y=diabetes["target"]
from sklearn.model_selection
import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=1 01)
from sklearn import linear_model
model=linear_model.LinearRegression()
model.fit(x_train,y_train)
y_pre=model.predict(x_test)
from sklearn.model_selection
import cross_val_score
scores=cross_val_score(model,x,y,scoring="neg_mean_squared_error",cv=10)
rmse_scores=np.sqrt(-scores).mean()
print("cross validation:",rmse_scores)
from sklearn.metrics
import r2_score
print('r^2:',r2_score(y_test,y_pre))
mse=mean_squared_error(y_test,y_pre)
rmse=np.sqrt(mse)
print('RMSE:',rmse)
print("Weight:",model.coef_) print("\
nIntercept",model.intercept
output:

cross validation: 54.40461553640237


r^2: 0.45767674177195616
RMSE: 58.009275047551974
Weight: [ -8.02566358 -308.83945001 583.63074324 299.9976184 -
360.68940198
95.14235214 -93.03306818 118.15005596 662.12887711 26.07401648]
Intercept 153.72029738615726

#LOGISTIC REGRESSION MODEL

Program:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn
import datasets from sklearn.metrics
import mean_squared_error diabetes=datasets.load_diabetes()
diabetes.keys()
df=pd.DataFrame(diabetes['data'],columns=diabetes['feature_names'])
x=df
y=diabetes['target']
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=101)
from sklearn.linear_model
import LogisticRegression model=LogisticRegression()
model.fit(x_train,y_train)
y_pre=model.predict(x_test)
from sklearn.metrics
import r2_score
print('r^2:',r2_score(y_test,y_pre))
mse=mean_squared_error(y_test,y_pre)
rmse=np.sqrt(mse)
print('RMSE:',rmse)

output:
r^2: -0.44401265478624397
RMSE: 94.65723681369009

Result:
Thus working with bivariate analysis to find linear and logistic regression using diabetes dataset was
executed successfully.
Ex no:6 APPLY AND EXPLORE VARIOUS PLOTTING FUNCTIONS USING UCI DATASET

Aim:
To work with various plotting functions using uci dataset

Algorithm:
STEP1:start
STEP2:read uci dataset
STEP 3:compute heatmap,boxplot,normal curve,density and contour plots, correlation and scatter
plot,histogram and three dimensional plotting using uci dataset
STEP4: print the result
STEP5:stop

#BOXPLOT
Program:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv("Heart.csv")
df.boxplot() plt.xticks(rotation=90)

output
Program:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv("Heart.csv")
df.plot(kind='box',subplots=True,layout=(5,3),figsize=(12,12))
plt.show()

OUTPUT:

#HEATMAP
Program:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv("Heart.csv")
sns.heatmap(df.corr())

output:

#HISTOGRAMS
Program:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv("Heart.csv")
df.hist(figsize=(12,12),layout=(5,3))

Output:
#HEATMAP
Program:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv("Heart.csv")
plt.figure(figsize=(20,10))
sns.heatmap(df.corr(),annot=True,cmap='terrain')

output:

#NORMAL CURVE
Program:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv("Heart.csv")
f,ax=plt.subplots(figsize=(10,6))
x=df['Age'] ax=sns.displot(x,bins=10)
plt.show()

output:

Program:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv("Heart.csv")
f,ax=plt.subplots(figsize=(10,6))
x=df['Age']
x=pd.Series(x,name="Age variable")
ax=sns.kdeplot(x,shade=True,color='r')
plt.show()

output:

#THREE DIMENSIONAL PLOTTING


Program:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv("Heart.csv")
fig=plt.figure()
ax=plt.axes(projection='3d')
x=df['Age']
x=pd.Series(x,name="Age variable")
y=df["Sex"]
y=pd.Series(y,name="Sex varible")
z=df["Chol"]
z=pd.Series(z,name="Cholestrol Variable")
ax.plot3D(x,y,z,'green')
ax.set_title('3D line plot Heart disease dataset')
plt.show()
output:
Ex no: 7 VISUALIZING GEOGRAPHIC DATA WITH BASEMAP

Aim:
To work and visualize geographic data using basemap
Algorithm:

Step 1: Import Libraries


import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
import numpy as np

Step 2: Prepare Geographic Data

Example with some sample lat/lon points:

python
Copy code
# Example data
lats = [34.05, 40.71, 37.77] # Los Angeles, New York, San Francisco
lons = [-118.25, -74.00, -122.42]
labels = ['LA', 'NYC', 'SF']

Step 3: Initialize Basemap

Choose the appropriate projection and map boundaries.

python
Copy code
m = Basemap(projection='merc',
llcrnrlat=20, urcrnrlat=50,
llcrnrlon=-130, urcrnrlon=-60,
resolution='i')

Step 4: Draw Map Features


python
Copy code
m.drawcoastlines()
m.drawcountries()
m.drawstates()
m.drawmapboundary(fill_color='lightblue')
m.fillcontinents(color='lightgreen', lake_color='lightblue')
m.drawparallels(np.arange(-90., 91., 10.), labels=[1,0,0,0])
m.drawmeridians(np.arange(-180., 181., 10.), labels=[0,0,0,1])

Step 5: Convert Lat/Lon to Map Coordinates


python
Copy code
x, y = m(lons, lats)
Step 6: Plot the Data

Copy code
m.scatter(x, y, marker='o', color='red', zorder=5)

# Optionally annotate the points


for i, label in enumerate(labels):
plt.text(x[i], y[i], label, fontsize=12, ha='left', va='bottom')

Step 7: Show the Map


python
Copy code
plt.title('Sample Geographic Data Visualization')
plt.show()

Output:

You might also like