📑Table of Contents📑

This notebook can also be checked on kaggle https://www.kaggle.com/yogidsba/must-know-pandas-functions-data-analysis?select=IN_youtube_trending_data.csv

📑Table of Contents📑

🤷‍Purpose of Analysis
🌐 Basic Pandas functions
📖 Reading Dataset
👀 Exploring Dataset
🎭 Converting Datatypes
🩸 Add/Rename/Drop/Filter Columns
❓ Missing Values
📄 Descriptive Statistics
🔪 Slicing/Subsetting
👨‍👨‍👯‍Data Aggregation ,Combining data frame and Groupby

🤷‍ Purpose of Analysis 🤷‍

The biggest motivation for this notebook was a simple thought to get better at Pandas.The best way to learn or improve is to write about it .With this thought process, I have created a simple tutorial along with a Case study on a Youtube trending Dataset to provide examples how these Panda functions can be used during analysis.Questions which I want to answer with this analysis :

Which is most watched video in India. ❓(Most watch doesnot mean trending)

Number of videos published in 2021❓

How many videos did T-Series publish❓

Top 30 videos that has been most watched❓

Top 3 Categories having maximum pubished videos in 2020❓

Top 10 Most liked videos in India❓

Getting Dataset: 🧧

To Read Dataset From Method

CSV pd.read_csv(filename)

TAB/Delimited /TSV pd.read_table(filename)

EXCEL d.read_excel('myfile.xlsx',sheet_name='Sheet1', index_col=None)

SQL Table pd.read_sql(query,connection_object)

JSON formatted string< pd.read_json(json_string)

HTML url pd.read_html(url)

Clipboard pd.read_clipboard()

Analyzing Dataset🔭🔭 : When you begin working with a new data set, it is best to check out the first few rows ,look at datatypes, check for any datatype mismatch in variables. This will show you what kind of data is in the dataset, what data types you are working with, and help you in your analysis.

Description Method

Displays top 5 row or n rows; df_india.head()/df_india.head(n)

Displays bottom 5 row or n rows df_india.tail()/df_india.tail(n)

Displays randomly selected rows df_india.sample(n)

Selecting certain columns col=['title','trending_date','view_count'] df_india[col].head()

Select and order top n entries df_india.nlargest(10, 'view_count')

Select and order bottom n entries df_india.nsmallest(10, 'view_count')

Distinct values in a column df_india['trending_date'].nunique()

Count number of rows with each unique value of variable df_india['trending_date'].value_counts()

To display datatypes of variables df_india.dtypes

Displays numbers of rows and column our dataset
contains.Its a tuple representing the dimensions df_india.shape

This method print info about a dataset including the index dtype
and columns,non-null values and memory usage df_india.info()

A selection of numeric dtypes to be included/excluded. df_india.select_dtypes(include = [np.number]).dtypes

A selection of object dtypes to be included/excluded. df_india.select_dtypes(include = [np.object]).dtypes

A selection of datetime dtypes to be included/excluded. df_india.select_dtypes(include = [np.datetime64]).dtypes

To select Pandas categorical dtypes df_india.select_dtypes(include = 'category').dtypes

Converting Datatypes 🎭: Pandas automatically assign datatype to the columns,depending on what data the columns hold.For eg weight column may have data in form of 60.5kg so the datatype here would be object. But we want weight to be float . We can remove kg and change data type to float.

Description Method

To convert object datatype to categorical datatype df_india.description=df_india.description.astype("category")

To convert object datatype to Integer datatype df_india.description=df_india.description.astype("int")

To convert object datatype to float datatype df['DataFrame Column'] = df['DataFrame Column'].astype(float)

To Convert object to datetime datatype df_india.publishedAt=pd.to_datetime(df_india.publishedAt)

To make a copy of dataset df_copy=df_india.copy()

Adding /Renaming/Dropping Columns ➕🩸🩸

Description Method

To add new columns to a dataframe. df_india['count_max_view'] =df_india.groupby(['video_id'])['view_count'].transform(max)

To rename a column df_india.rename(columns={'description':'desc'})

To drop a column df_india.drop()

Descriptive Statistics📄📄
Descriptive statistics, which include things like the mean, median,Quartiles,Minimum,Standard Deviation etc of the data , can be useful to understand the spread of the variables are and what variables may be the most important. Descriptive statistics on pandas objects exclude missing data by default

Description Method

To get statistical summary for numeric value, to understand the distrubution df_india.describe()(n)

To get summary for Categorical value df_india.describe(include='category')

To get Minimum value for all numeric columns df_india.min()

To get Mean value for all numeric columnss df_india.mean()

To get exact middle value among a dataset. Useful for skewed distribution or data with outliers df_india.median()

To get most frequent element in a dataset. df_india.mode()

To get standard deviation of numeric column df_india.std()

Checking and Handling Missing Values.:❓❓
.When no data value is stored for feature for a particular observation, we say this feature has a missing value.Missing value handling is a post in itself and will be covered later

Description Method

This function returns boolean value indicating whether values are missing (NaN) df.isna()

This function check for not null or non-missing values. df.notna()

This function is used to replace values with other values dynamically. df.Replace()

This function returns boolean value indicating whether values are missing (NaN) pd.isnull()
This function checks for not null or non-missing values. pd.notnull()

This function is used to remove missing value. pd.dropna()

Return a copy of the data with missing values filled or imputed pd.fillna()

Slicing /Subsetting DataSet :🔪🔪
One of the most effective ways to get more information out of a dataset is to divide it into smaller, more uniform subsets, and analyze each of these "subset" on its own.

Description Method

To Select rows by position use ilocs .
It allows integer-location based indexing for selections . df_india.iloc[10:4]

To select data based on labels, which means
that we have to specify the name of the rows and columns that we need to filter out df_india.loc[df_india['title','view_count]]

Query the columns of a DataFrame with a boolean expression. df_india.query['category == 24']

Data Aggregation,comibing dataframe and Groupby:👯‍♂️👯‍👨‍👨‍👦‍👦
One of the most basic analysis functions is grouping and aggregating data. An aggregation function is one which takes multiple individual values and returns a summary.In pandas, the groupby function can be combined with one or more aggregation functions to quickly and easily summarize data.

Description Method

To find total count,unique observatoin agg_func_count = {'view_count': ['count', 'nunique']}
df_merge.groupby(['category']).agg(agg_func_count)

To find increased number of likes agg_func_selection = {'likes': ['first', 'last']}
df_merge.sort_values(by=['channelTitle'], ascending=False).groupby(['channelTitle' ]).agg(agg_func_selection)

To merge category dataset with trending dataset df_merge = df_india.merge(df_category, on = 'categoryId', how = 'inner')

To concatenate Us and canada dataset df_con_us_ca=pd.concat([df_us,df_ca])

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Basic pandas functions&data analysis.ipynb		Basic pandas functions&data analysis.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📑Table of Contents📑

🤷‍ Purpose of Analysis 🤷‍

About

Uh oh!

Releases

Packages

Uh oh!

Languages

To Read Dataset From	Method
CSV	pd.read_csv(filename)
TAB/Delimited /TSV	pd.read_table(filename)
EXCEL	d.read_excel('myfile.xlsx',sheet_name='Sheet1', index_col=None)
SQL Table	pd.read_sql(query,connection_object)
JSON formatted string<	pd.read_json(json_string)
HTML url	pd.read_html(url)
Clipboard	pd.read_clipboard()

Description	Method
Displays top 5 row or n rows;	df_india.head()/df_india.head(n)
Displays bottom 5 row or n rows	df_india.tail()/df_india.tail(n)
Displays randomly selected rows	df_india.sample(n)
Selecting certain columns	col=['title','trending_date','view_count'] df_india[col].head()
Select and order top n entries	df_india.nlargest(10, 'view_count')
Select and order bottom n entries	df_india.nsmallest(10, 'view_count')
Distinct values in a column	df_india['trending_date'].nunique()
Count number of rows with each unique value of variable	df_india['trending_date'].value_counts()
To display datatypes of variables	df_india.dtypes
Displays numbers of rows and column our dataset contains.Its a tuple representing the dimensions	df_india.shape
This method print info about a dataset including the index dtype and columns,non-null values and memory usage	df_india.info()
A selection of numeric dtypes to be included/excluded.	df_india.select_dtypes(include = [np.number]).dtypes
A selection of object dtypes to be included/excluded.	df_india.select_dtypes(include = [np.object]).dtypes
A selection of datetime dtypes to be included/excluded.	df_india.select_dtypes(include = [np.datetime64]).dtypes
To select Pandas categorical dtypes	df_india.select_dtypes(include = 'category').dtypes

Description	Method
To convert object datatype to categorical datatype	df_india.description=df_india.description.astype("category")
To convert object datatype to Integer datatype	df_india.description=df_india.description.astype("int")
To convert object datatype to float datatype	df['DataFrame Column'] = df['DataFrame Column'].astype(float)
To Convert object to datetime datatype	df_india.publishedAt=pd.to_datetime(df_india.publishedAt)
To make a copy of dataset	df_copy=df_india.copy()

Description	Method
To add new columns to a dataframe.	df_india['count_max_view'] =df_india.groupby(['video_id'])['view_count'].transform(max)
To rename a column	df_india.rename(columns={'description':'desc'})
To drop a column	df_india.drop()

Description	Method
To get statistical summary for numeric value, to understand the distrubution	df_india.describe()(n)
To get summary for Categorical value	df_india.describe(include='category')
To get Minimum value for all numeric columns	df_india.min()
To get Mean value for all numeric columnss	df_india.mean()
To get exact middle value among a dataset. Useful for skewed distribution or data with outliers	df_india.median()
To get most frequent element in a dataset.	df_india.mode()
To get standard deviation of numeric column	df_india.std()

Description	Method
This function returns boolean value indicating whether values are missing (NaN)	df.isna()
This function check for not null or non-missing values.	df.notna()
This function is used to replace values with other values dynamically.	df.Replace()
This function returns boolean value indicating whether values are missing (NaN)	pd.isnull()
This function checks for not null or non-missing values.	pd.notnull()
This function is used to remove missing value.	pd.dropna()
Return a copy of the data with missing values filled or imputed	pd.fillna()

Description	Method
To Select rows by position use ilocs . It allows integer-location based indexing for selections .	df_india.iloc[10:4]
To select data based on labels, which means that we have to specify the name of the rows and columns that we need to filter out	df_india.loc[df_india['title','view_count]]
Query the columns of a DataFrame with a boolean expression.	df_india.query['category == 24']

Description	Method
To find total count,unique observatoin	agg_func_count = {'view_count': ['count', 'nunique']} df_merge.groupby(['category']).agg(agg_func_count)
To find increased number of likes	agg_func_selection = {'likes': ['first', 'last']} df_merge.sort_values(by=['channelTitle'], ascending=False).groupby(['channelTitle' ]).agg(agg_func_selection)
To merge category dataset with trending dataset	df_merge = df_india.merge(df_category, on = 'categoryId', how = 'inner')
To concatenate Us and canada dataset	df_con_us_ca=pd.concat([df_us,df_ca])

yogitadarade/LearningPandas

Folders and files

Latest commit

History

Repository files navigation

📑Table of Contents📑

🤷‍ Purpose of Analysis 🤷‍

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages