This notebook can also be checked on kaggle
https://www.kaggle.com/yogidsba/must-know-pandas-functions-data-analysis?select=IN_youtube_trending_data.csv
🤷Purpose of Analysis
🌐 Basic Pandas functions
📖 Reading Dataset
👀 Exploring Dataset
🎭 Converting Datatypes
🩸 Add/Rename/Drop/Filter Columns
❓ Missing Values
📄 Descriptive Statistics
🔪 Slicing/Subsetting
👨👨👯Data Aggregation ,Combining data frame and Groupby
The biggest motivation for this notebook was a simple thought to get better at Pandas.The best way to learn or improve is to write about it .With this thought process, I have created a simple tutorial along with a Case study on a Youtube trending Dataset to provide examples how these Panda functions can be used during analysis.Questions which I want to answer with this analysis :
- Which is most watched video in India. ❓(Most watch doesnot mean trending)
- Number of videos published in 2021❓
- How many videos did T-Series publish❓
- Top 30 videos that has been most watched❓
- Top 3 Categories having maximum pubished videos in 2020❓
- Top 10 Most liked videos in India❓
Getting Dataset: 🧧
To Read Dataset From | Method |
---|---|
CSV | pd.read_csv(filename) |
TAB/Delimited /TSV | pd.read_table(filename) |
EXCEL | d.read_excel('myfile.xlsx',sheet_name='Sheet1', index_col=None) |
SQL Table | pd.read_sql(query,connection_object) |
JSON formatted string< | pd.read_json(json_string) |
HTML url | pd.read_html(url) |
Clipboard | pd.read_clipboard() |
Analyzing Dataset🔭🔭 : When you begin working with a new data set, it is best to check out the first few rows ,look at datatypes, check for any datatype mismatch in variables. This will show you what kind of data is in the dataset, what data types you are working with, and help you in your analysis.
Description | Method |
---|---|
Displays top 5 row or n rows; | df_india.head()/df_india.head(n) |
Displays bottom 5 row or n rows | df_india.tail()/df_india.tail(n) |
Displays randomly selected rows | df_india.sample(n) |
Selecting certain columns | col=['title','trending_date','view_count'] df_india[col].head() |
Select and order top n entries | df_india.nlargest(10, 'view_count') |
Select and order bottom n entries | df_india.nsmallest(10, 'view_count') |
Distinct values in a column | df_india['trending_date'].nunique() |
Count number of rows with each unique value of variable | df_india['trending_date'].value_counts() |
To display datatypes of variables | df_india.dtypes |
Displays numbers of rows and column our dataset contains.Its a tuple representing the dimensions |
df_india.shape |
This method print info about a dataset including the index dtype and columns,non-null values and memory usage |
df_india.info() |
A selection of numeric dtypes to be included/excluded. | df_india.select_dtypes(include = [np.number]).dtypes |
A selection of object dtypes to be included/excluded. | df_india.select_dtypes(include = [np.object]).dtypes |
A selection of datetime dtypes to be included/excluded. | df_india.select_dtypes(include = [np.datetime64]).dtypes |
To select Pandas categorical dtypes | df_india.select_dtypes(include = 'category').dtypes |
Converting Datatypes 🎭: Pandas automatically assign datatype to the columns,depending on what data the columns hold.For eg weight column may have data in form of 60.5kg so the datatype here would be object. But we want weight to be float . We can remove kg and change data type to float.
Description | Method |
---|---|
To convert object datatype to categorical datatype | df_india.description=df_india.description.astype("category") |
To convert object datatype to Integer datatype | df_india.description=df_india.description.astype("int") |
To convert object datatype to float datatype | df['DataFrame Column'] = df['DataFrame Column'].astype(float) |
To Convert object to datetime datatype | df_india.publishedAt=pd.to_datetime(df_india.publishedAt) |
To make a copy of dataset | df_copy=df_india.copy() |
Adding /Renaming/Dropping Columns ➕🩸🩸
Description | Method |
---|---|
To add new columns to a dataframe. | df_india['count_max_view'] =df_india.groupby(['video_id'])['view_count'].transform(max) |
To rename a column | df_india.rename(columns={'description':'desc'}) |
To drop a column | df_india.drop() |
Descriptive Statistics📄📄
Descriptive statistics, which include things like the mean, median,Quartiles,Minimum,Standard Deviation etc of the data , can be useful to understand the spread of the variables are and what variables may be the most important. Descriptive statistics on pandas objects exclude missing data by default
Description | Method |
---|---|
To get statistical summary for numeric value, to understand the distrubution | df_india.describe()(n) |
To get summary for Categorical value | df_india.describe(include='category') |
To get Minimum value for all numeric columns | df_india.min() |
To get Mean value for all numeric columnss | df_india.mean() |
To get exact middle value among a dataset. Useful for skewed distribution or data with outliers | df_india.median() |
To get most frequent element in a dataset. | df_india.mode() |
To get standard deviation of numeric column | df_india.std() |
Checking and Handling Missing Values.:❓❓
.When no data value is stored for feature for a particular observation, we say this feature has a missing value.Missing value handling is a post in itself and will be covered later
Description | Method |
---|---|
This function returns boolean value indicating whether values are missing (NaN) | df.isna() |
This function check for not null or non-missing values. | df.notna() |
This function is used to replace values with other values dynamically. | df.Replace() |
This function returns boolean value indicating whether values are missing (NaN) | pd.isnull() |
This function checks for not null or non-missing values. | pd.notnull() |
This function is used to remove missing value. | pd.dropna() |
Return a copy of the data with missing values filled or imputed | pd.fillna() |
Slicing /Subsetting DataSet :🔪🔪
One of the most effective ways to get more information out of a dataset is to divide it into smaller, more uniform subsets, and analyze each of these "subset" on its own.
Description | Method |
---|---|
To Select rows by position use ilocs . It allows integer-location based indexing for selections . |
df_india.iloc[10:4] |
To select data based on labels, which means that we have to specify the name of the rows and columns that we need to filter out |
df_india.loc[df_india['title','view_count]] |
Query the columns of a DataFrame with a boolean expression. | df_india.query['category == 24'] |
Data Aggregation,comibing dataframe and Groupby:👯♂️👯👨👨👦👦
One of the most basic analysis functions is grouping and aggregating data. An aggregation function is one which takes multiple individual values and returns a summary.In pandas, the groupby function can be combined with one or more aggregation functions to quickly and easily summarize data.
Description | Method |
---|---|
To find total count,unique observatoin |
agg_func_count = {'view_count': ['count', 'nunique']} df_merge.groupby(['category']).agg(agg_func_count) |
To find increased number of likes |
agg_func_selection = {'likes': ['first', 'last']} df_merge.sort_values(by=['channelTitle'], ascending=False).groupby(['channelTitle' ]).agg(agg_func_selection) |
To merge category dataset with trending dataset | df_merge = df_india.merge(df_category, on = 'categoryId', how = 'inner') |
To concatenate Us and canada dataset | df_con_us_ca=pd.concat([df_us,df_ca]) |