Skip to content

yogitadarade/LearningPandas

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 

Repository files navigation

This notebook can also be checked on kaggle https://www.kaggle.com/yogidsba/must-know-pandas-functions-data-analysis?select=IN_youtube_trending_data.csv

📑Table of Contents📑


🤷‍Purpose of Analysis
🌐 Basic Pandas functions
📖 Reading Dataset
👀 Exploring Dataset
🎭 Converting Datatypes
🩸 Add/Rename/Drop/Filter Columns
Missing Values
📄 Descriptive Statistics
🔪 Slicing/Subsetting
👨‍👨‍👯‍Data Aggregation ,Combining data frame and Groupby

🤷‍ Purpose of Analysis 🤷‍

The biggest motivation for this notebook was a simple thought to get better at Pandas.The best way to learn or improve is to write about it .With this thought process, I have created a simple tutorial along with a Case study on a Youtube trending Dataset to provide examples how these Panda functions can be used during analysis.Questions which I want to answer with this analysis :

  • Which is most watched video in India. ❓(Most watch doesnot mean trending)
  • Number of videos published in 2021❓
  • How many videos did T-Series publish❓
  • Top 30 videos that has been most watched❓
  • Top 3 Categories having maximum pubished videos in 2020❓
  • Top 10 Most liked videos in India❓

Getting Dataset: 🧧
To Read Dataset From Method
CSV  pd.read_csv(filename)
TAB/Delimited /TSV pd.read_table(filename)
EXCEL d.read_excel('myfile.xlsx',sheet_name='Sheet1', index_col=None)
SQL Table pd.read_sql(query,connection_object)
JSON formatted string< pd.read_json(json_string)
HTML url pd.read_html(url)
Clipboard pd.read_clipboard()

Analyzing Dataset🔭🔭 : When you begin working with a new data set, it is best to check out the first few rows ,look at datatypes, check for any datatype mismatch in variables. This will show you what kind of data is in the dataset, what data types you are working with, and help you in your analysis.
Description Method
Displays top 5 row or n rows; df_india.head()/df_india.head(n)
Displays bottom 5 row or n rows df_india.tail()/df_india.tail(n)
Displays randomly selected rows df_india.sample(n)
Selecting certain columns col=['title','trending_date','view_count'] df_india[col].head()
Select and order top n entries df_india.nlargest(10, 'view_count')
Select and order bottom n entries df_india.nsmallest(10, 'view_count')
Distinct values in a column df_india['trending_date'].nunique()
Count number of rows with each unique value of variable df_india['trending_date'].value_counts()
To display datatypes of variables df_india.dtypes
Displays numbers of rows and column our dataset
contains.Its a tuple representing the dimensions
df_india.shape
This method print info about a dataset including the index dtype
and columns,non-null values and memory usage
df_india.info()
A selection of numeric dtypes to be included/excluded. df_india.select_dtypes(include = [np.number]).dtypes
A selection of object dtypes to be included/excluded. df_india.select_dtypes(include = [np.object]).dtypes
A selection of datetime dtypes to be included/excluded. df_india.select_dtypes(include = [np.datetime64]).dtypes
To select Pandas categorical dtypes df_india.select_dtypes(include = 'category').dtypes

Converting Datatypes 🎭: Pandas automatically assign datatype to the columns,depending on what data the columns hold.For eg weight column may have data in form of 60.5kg so the datatype here would be object. But we want weight to be float . We can remove kg and change data type to float.
Description Method
To convert object datatype to categorical datatype df_india.description=df_india.description.astype("category")
To convert object datatype to Integer datatype df_india.description=df_india.description.astype("int")
To convert object datatype to float datatype df['DataFrame Column'] = df['DataFrame Column'].astype(float)
To Convert object to datetime datatype df_india.publishedAt=pd.to_datetime(df_india.publishedAt)
To make a copy of dataset df_copy=df_india.copy()

Adding /Renaming/Dropping Columns ➕🩸🩸
Description Method
To add new columns to a dataframe. df_india['count_max_view'] =df_india.groupby(['video_id'])['view_count'].transform(max)
To rename a column df_india.rename(columns={'description':'desc'})
To drop a column df_india.drop()

Descriptive Statistics📄📄
Descriptive statistics, which include things like the mean, median,Quartiles,Minimum,Standard Deviation etc of the data , can be useful to understand the spread of the variables are and what variables may be the most important. Descriptive statistics on pandas objects exclude missing data by default

Description Method
To get statistical summary for numeric value, to understand the distrubution df_india.describe()(n)
To get summary for Categorical value df_india.describe(include='category')
To get Minimum value for all numeric columns df_india.min()
To get Mean value for all numeric columnss df_india.mean()
To get exact middle value among a dataset. Useful for skewed distribution or data with outliers df_india.median()
To get most frequent element in a dataset. df_india.mode()
To get standard deviation of numeric column df_india.std()

Checking and Handling Missing Values.:❓❓
.When no data value is stored for feature for a particular observation, we say this feature has a missing value.Missing value handling is a post in itself and will be covered later

Description Method
This function returns boolean value indicating whether values are missing (NaN) df.isna()
This function check for not null or non-missing values. df.notna()
This function is used to replace values with other values dynamically. df.Replace()
This function returns boolean value indicating whether values are missing (NaN) pd.isnull()
This function checks for not null or non-missing values. pd.notnull()
This function is used to remove missing value. pd.dropna()
Return a copy of the data with missing values filled or imputed pd.fillna()

Slicing /Subsetting DataSet :🔪🔪
One of the most effective ways to get more information out of a dataset is to divide it into smaller, more uniform subsets, and analyze each of these "subset" on its own.

Description Method
To Select rows by position use ilocs .
It allows integer-location based indexing for selections .
df_india.iloc[10:4]
To select data based on labels, which means
that we have to specify the name of the rows and columns that we need to filter out
df_india.loc[df_india['title','view_count]]
Query the columns of a DataFrame with a boolean expression. df_india.query['category == 24']

Data Aggregation,comibing dataframe and Groupby:👯‍♂️👯‍👨‍👨‍👦‍👦
One of the most basic analysis functions is grouping and aggregating data. An aggregation function is one which takes multiple individual values and returns a summary.In pandas, the groupby function can be combined with one or more aggregation functions to quickly and easily summarize data.

Description Method
To find total count,unique observatoin agg_func_count = {'view_count': ['count', 'nunique']}
df_merge.groupby(['category']).agg(agg_func_count)
To find increased number of likes agg_func_selection = {'likes': ['first', 'last']}
df_merge.sort_values(by=['channelTitle'], ascending=False).groupby(['channelTitle' ]).agg(agg_func_selection)
To merge category dataset with trending dataset df_merge = df_india.merge(df_category, on = 'categoryId', how = 'inner')
To concatenate Us and canada dataset df_con_us_ca=pd.concat([df_us,df_ca])

About

LearningPandas

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published