fds lab manual[1]
fds lab manual[1]
1 Download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodels and
Pandas packages
1. Install Anaconda
Anaconda puts nearly all of the tools that we're going to need into a neat little package: the Python core
language, an improved REPL environment called Jupyter, numeric computing libraries (NumPy, pandas),
plotting libraries (seaborn, matplotlib), and statistics and machine learning libraries (SciPy, scikit-learn,
statsmodels). We'll use Anaconda's installer to handle setting up the environment that we'll work in.
In order to keep the size of the download small, we actually use a minimum set of packages called
Miniconda.
Click the link below to download an environment file. This file contains a list of common packages and
libraries for doing data science in Python. Remember where you save the file environment.yml.
You'll need that path shortly. You don't need to open that file right now.
o Windows
o OSX
Once the download finishes, open the command line by doing the following:
o Windows - Hit "Start" and then type "Command Prompt" and use that terminal.
o OSX - Type Cmd+Space and then enter Terminal in the search box to open the terminal.
Run the following commands, which will install the package and put you in the tutorial environment.
o conda env create -f <PATH_TO_ENVIRONMENT.YML> - You'll need to replace
<PATH_TO_ENVIRONMENT.YML> with the actual path where the file was downloaded.
For OSX, that's often
(/Users/<USERNAME>/Downloads/environment.yml). For Windows, it is usually
C:/Users/<USERNAME>/Downloads/environment.yml. You'll have to replace
<USERNAME> with your username on your machine.
That will download all a set of packages that are commonly used for data science in Python. When it
finishes, you can activate the environment with the following command:
o Windows - activate tutorial
o OSX - source activate tutorial
Hit Ctrl+c to stop the Jupyter notebook server running on your machine. (Make sure to use Ctrl+s
in the notebook to save it first!)
5. To leave the tutorial environment (with all our fun packages) and go back to your normal
environment:
Result:
Thus the above process to Install and Explore Numpy was executed and verified successfully.
Ex no: 2 Working with Numpy arrays
AIM
ALGORITHM
Step1:Start
Step2:Import numpy module
Step3:Print the basic characteristics and operations of array
Step4:Stop
PROGRAM
import numpy as np
# Creating array
objectarr=np.array([[1,2,3],[ 4, 2,5]] )
#Printing type of arr object
print("Array is of type: ", type(arr))
#Printing array dimensions(axes)
print("No.of dimensions:",arr.ndim)
#Printing shape of array
print("Shape of array:",arr.shape)
print("Size of array:",arr.size)
#Printing type of elements in array
print("Array stores elements of type:",arr.dtype)
OUTPUT
Array is of type:<class'numpy.ndarray'> No.of dimensions:2 Shape of array:(2,3)
Size of array:6
Array stores elements of type:int32
Output:
Ourarrayis:
[[1 2 3]
[3 4 5]
[4 5 6]]
The items in the second column are:
[2 4 5]
The items in the second row are:
[3 4 5]
Theitemscolumn1onwards are:
[[2 3]
[4 5]
[5 6]]
output:
Original array
[1, 2, 3, 4]
Array converted to a float type:
[ 1. 2. 3. 4.]
Program:
import numpy as np
# Create an empty array
x = np.empty((3,4))
print(x)
# Create a full array
y = np.full((3,3),6)
print(y)
output:
[[ 6.93643969e-310 8.76783124e-317 6.93643881e-310 6.79038654e-313]
[ 2.22809558e-312 2.14321575e-312 2.35541533e-312 2.42092166e-322]
[ 7.46824097e-317 9.08479214e3172.46151512e3122.41907520e312]]
[[6 6 6]
[6 6 6]
[6 6 6]]
output:
List to array:
[1 2 3 4 5 6 7 8]
Tuple to array:
[[8 4 6]
[1 2 3]
Write a NumPy program to find the real and imaginary parts of an array of complex
numbers
Program:
import numpy as np
x = np.sqrt([1+0j])
y = np.sqrt([0+1j])
print("Original array:x ",x)
print("Original array:y ",y)
print("Real part of the array:")
print(x.real)
print(y.real)
print("Imaginary part of the array:")
print(x.imag)
print(y.imag)
output:
Result:
Thus the working with Numpy arrays was successfully completed.
Ex no: 3 WORKING WITH PANDAS DATA FRAMES
Aim:
ALGORITHM
Step1:Start
Step2: import numpy and pandas module
Step3:Createa dataframe using the dictio nary
Step4:Perform the basic operations and print the result
Step5:Stop
PROGRAM
import numpy as np
import pandas as pd
data=np.array([['','Col1','Col2'],['Row1',1,2], ['Row2',3,4]])
print(pd.DataFrame(data=data[1:,1:],index = data[1:,0],columns= data[0,1:]))
my_2darray = np.array([[1, 2, 3], [4, 5, 6]])
print(pd.DataFrame(my_2darray))
my_ dict={1:['1','3'],2:['1','2'],3:['2','4']}
print(pd.DataFrame(my_dict))
my_df=pd.DataFrame(data=[4,5,6,7],index=range(0,4),columns=['A'])
print(pd.DataFrame(my_df))
my_series=pd.Series({"UnitedKingdom":"London","India":"NewDelhi","UnitedStates":"Washington",
"Belgium":"Brussels"})
print(pd.DataFrame(my_series))
df=pd.DataFrame(np.array([[1,2,3],[4,5,6]]))
pr int(df.shape)
print(len(df.index))
Output:
Col1Col2
Row1 1 2
Row2 3 4
0 1 2
0 1 2 3
1 4 5 61 23
0 1 1 2
1 3 2 4A
0 4
1 5
2 6
3 7
0
UnitedKingdom
LondonIndia New
Delhi
United States
WashingtonBelgiu
mBrussels
(2, 3)
2
Write a Pandas program to select the rows where the score is missing, i.e. is NaN
Sample DataFrame:
Sample Python dictionary data and list labels:
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura',
'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
Program:
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura',
'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels)
print("Rows where score is missing:")
print(df[df['score'].isnull()])
output:
Result:
Thus the working with Pandas data frames was successfully completed.
Ex no: 4 DESCRIPTIVE ANALYSES ON THE IRIS DATASET
Aim:
To read data from text files, excel and the web and exploring various Commands for doing
descriptive analysis on iris dataset.
Algorithm:
Step1:Start
Step2: import pandas
Step3: download iris.csv dataset from
https://datahub.io/machine -learning/iris
step4: Perform basic operations
Step4:Print the output
Step5:Stop
Program:
import pandas as pd
df=pd.read_csv("iris.csv")
df.head()
df.shape
df.info()
df.describe()
df.isnull().sum()
print(df.value_counts("Species"))
output:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 150
entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 unnamed:0 150 non-null int64
1 Sepal.Length 150 non-null float64
2 Sepal.Width 150 non-null float64
3 Petal.Length 150 non-null float64
4 Petal.Width 150 non-null float64
5 Species 150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB
Species
Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50
dtype: int
Result:
Thus the working with iris dataset was successfully completed.
Ex no: 5 a) WORKING WITH DIABETES DATASET FROM UCI AND PIMA
INDIANS DATA SET
Univariate analysis:
Aim:
To work with univariate analysis to find mean, median, variance standard deviation, skewness
and kurtosis using diabetes dataset
Algorithm:
Step1:start
Step2: download diabetes dataset from https://datahub.io/machine-learning/diabetes
Step3: perform the basic operations, calculate mean,median,standard deviation,skewness,histogram and
print the result
Step4:stop
Program:
import pandas as pd
import numpy as np
df=pd.read_csv("diabetes.csv") print('MEAN:\n',df.mean())
print('MEDIAN:\n',df.median()) print('MODE:\n',df.mode())
print('STANDAR DEVIATION:\n',df.std()) print('VARIANCE:\
n',df.var())
print('SKEWNESS:\n',df.skew()) print('KURTOSIS:\
n',df.kurtosis())
df.describe()
output:
MEAN:
Pregnancies 3.845052
Glucose 120.894531
BloodPressure 9.105469
SkinThickness 20.536458
Insulin 79.799479
BMI 31.992578
DiabetesPedigreeFunction 0.471876
Age 33.240885
Outcome 0.348958
dtype: float64
MEDIAN:
Pregnancies 3.0000
Glucose 117.0000
BloodPressure 72.0000
SkinThickness 23.0000
Insulin 30.5000
BMI 32.0000
DiabetesPedigreeFunction 0.3725
Age 29.0000
Outcome 0.0000
dtype: float64
MODE:
Pregnancies Glucose BloodPressure ...
DiabetesPedigreeFunction Age Outcome
0 1.0 99 70.0 ... 0.254 22.0 0.0
1 NaN 100 NaN ... 0.258 NaN NaN
#histogram
Program:
import pandas as pd
import numpy as np
import statistics as st
import matplotlib.pyplot as plt
df=pd.read_csv("diabetes.csv")
df=df.copy(deep=True)
df=df.drop(['Outcome'],axis=1)
plt.rcParams['figure.figsize']=[40,40]
df.hist(bins=40)
output:
Result:
Thus the working with diabetes dataset was successfully completed.
Ex no: 5.b) Use the diabetes dataset from UCI : Bivariate analysis: linear and logistic
regression modeling
Aim:
To work with bivariate analysis to find linear and logistic regression using diabetes dataset
Program:
#LINEAR REGRESSION MODEL
Program:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn
import datasets from sklearn.metrics
import mean_squared_error diabetes=datasets.load_diabetes()
diabetes.keys()
df=pd.DataFrame(diabetes['data'],columns=diabetes['feature_names'])
x=df
y=diabetes['target']
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=101)
from sklearn.linear_model
import LogisticRegression model=LogisticRegression()
model.fit(x_train,y_train)
y_pre=model.predict(x_test)
from sklearn.metrics
import r2_score
print('r^2:',r2_score(y_test,y_pre))
mse=mean_squared_error(y_test,y_pre)
rmse=np.sqrt(mse)
print('RMSE:',rmse)
output:
r^2: -0.44401265478624397
RMSE: 94.65723681369009
Result:
Thus working with bivariate analysis to find linear and logistic regression using diabetes dataset was
executed successfully.
Ex no:6 APPLY AND EXPLORE VARIOUS PLOTTING FUNCTIONS USING UCI DATASET
Aim:
To work with various plotting functions using uci dataset
Algorithm:
STEP1:start
STEP2:read uci dataset
STEP 3:compute heatmap,boxplot,normal curve,density and contour plots, correlation and scatter
plot,histogram and three dimensional plotting using uci dataset
STEP4: print the result
STEP5:stop
#BOXPLOT
Program:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv("Heart.csv")
df.boxplot() plt.xticks(rotation=90)
output
Program:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv("Heart.csv")
df.plot(kind='box',subplots=True,layout=(5,3),figsize=(12,12))
plt.show()
OUTPUT:
#HEATMAP
Program:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv("Heart.csv")
sns.heatmap(df.corr())
output:
#HISTOGRAMS
Program:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv("Heart.csv")
df.hist(figsize=(12,12),layout=(5,3))
Output:
#HEATMAP
Program:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv("Heart.csv")
plt.figure(figsize=(20,10))
sns.heatmap(df.corr(),annot=True,cmap='terrain')
output:
#NORMAL CURVE
Program:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv("Heart.csv")
f,ax=plt.subplots(figsize=(10,6))
x=df['Age'] ax=sns.displot(x,bins=10)
plt.show()
output:
Program:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv("Heart.csv")
f,ax=plt.subplots(figsize=(10,6))
x=df['Age']
x=pd.Series(x,name="Age variable")
ax=sns.kdeplot(x,shade=True,color='r')
plt.show()
output:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv("Heart.csv")
fig=plt.figure()
ax=plt.axes(projection='3d')
x=df['Age']
x=pd.Series(x,name="Age variable")
y=df["Sex"]
y=pd.Series(y,name="Sex varible")
z=df["Chol"]
z=pd.Series(z,name="Cholestrol Variable")
ax.plot3D(x,y,z,'green')
ax.set_title('3D line plot Heart disease dataset')
plt.show()
output:
Ex no: 7 VISUALIZING GEOGRAPHIC DATA WITH BASEMAP
Aim:
To work and visualize geographic data using basemap
Algorithm:
python
Copy code
# Example data
lats = [34.05, 40.71, 37.77] # Los Angeles, New York, San Francisco
lons = [-118.25, -74.00, -122.42]
labels = ['LA', 'NYC', 'SF']
python
Copy code
m = Basemap(projection='merc',
llcrnrlat=20, urcrnrlat=50,
llcrnrlon=-130, urcrnrlon=-60,
resolution='i')
Copy code
m.scatter(x, y, marker='o', color='red', zorder=5)
Output: