Exploratory Data Analysis in R (introduction)

Exploratory data analysis (EDA) the very first step in a data project. W

Setting-up

install.packages("tidyverse")

install.packages("funModeling")

install.packages("Hmisc")

load the needed libraries…

library(funModeling)

library(tidyverse)

library(Hmisc)

tl;dr (code)

Run all the functions in this post in one-shot with the following function:

basic_eda <- function(data) { glimpse(data) df_status(data) freq(data) profiling_num(data) plot_num(data) describe(data) }

Replace data with your data, and that's it!:

basic_eda(my_amazing_data)

Creating the data for this example

data=heart_disease %>% select(age, max_heart_rate, thal, has_heart_disease)

Step 1 - First approach to data

glimpse(data)

metrics about data types, zeros, infinite numbers, and missing values:

df_status(data)

Step 2 - Analyzing categorical variables

freq(data)

Export the plots to jpeg into current directory: freq(data, path_out = ".")

Step 3 - Analyzing numerical variables

Graphically

plot_num(data)

Tips

Try to identify high-unbalanced variables
Visually check any variable with outliers

Quantitatively

data_prof=profiling_num(data)

Tips

Describe each variable based on its distribution (also useful for reporting)

Attention to variables with high standard deviation.

Use metrics that you are most familiar with: data_prof %>% select(variable, variation_coef, range_98): A high value in variation_coef may indictate outliers. range_98 indicates where most of the values are.

Step 4 - Analyzing numerical and categorical at the same time

library(Hmisc) library(Hmisc) describe(data)

TIPS:

Check min and max values (outliers)
Check Distributions

Data Cleaning

choco$Cocoa.Percent = as.numeric(gsub('%','',choco$Cocoa.Percent))

choco$Review.Date = as.character(choco$Review.Date)

Variables

The very first thing that you’d want to do in your EDA is checking the dimension of the input dataset and the time of variables.

plot_str(choco)

search for Missing Values

plot_missing(choco)

Continuous Variables

plot_histogram(choco)

Density plot, DataExplorer has got a function for that.

plot_density(choco)

Multivariate Analysis

plot_correlation(choco, type = 'continuous','Review.Date')

Categorical Variables — Barplots

plot_bar(choco)

create_report

reate_report(choco)

Use nrow() and ncol()

Use nrow() and ncol() to get the number of rows and number of columns, respectively. You can get the same information by extracting the first and second element of the output vector from dim().

How to see the last observations.

For example, the following command will return the last 10 observations.

tail(InsectSprays, n = -62)

The str() function

Returns many useful pieces of information, including the above useful outputs and the types of data for each column. In this example, “num” denotes that the variable “count” is numeric (continuous), and “Factor” denotes that the variable “spray” is categorical with 6 categories or levels.

str(InsectSprays)

'data.frame': 72 obs. of 2 variables:

$ count: num 10 7 20 14 14 12 10 23 17 20 ...

$ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ... tail(InsectSprays, n = -62)

Categories or levels of a categorical variable

To obtain all of the categories or levels of a categorical variable, use the levels() function.

levels(InsectSprays$spray)

Missing values

If there are any missing values (denoted by “NA” for a particular datum), it would also provide a count for them. In this example, there are no missing values for “count”, so there is no display for the number of NA’s. For a categorical variable like “spray”, it returns the levels and the number of data in each level.

summary(InsectSprays)

count spray

Min. : 0.00 A:12

1st Qu.: 3.00 B:12

Median : 7.00 C:12

Mean : 9.50 D:12

3rd Qu.:14.25 E:12

Max. :26.00 F:12

[1] "A" "B" "C" "D" "E" "F"

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
Books		Books
Cheat Sheet Python		Cheat Sheet Python
Cheat Sheet R		Cheat Sheet R
Gráficos		Gráficos
Jupyter Stuff		Jupyter Stuff
Python Artificial-neural-networks		Python Artificial-neural-networks
Python EDA Visualization Python		Python EDA Visualization Python
Python datascience		Python datascience
Python pandas-prodiling		Python pandas-prodiling
Python-EDA		Python-EDA
Python-finance		Python-finance
Python-models		Python-models
Python_courses		Python_courses
Workshop Machine Learning		Workshop Machine Learning
papers/LAURO		papers/LAURO
README.md		README.md
README2.md		README2.md
practice.md		practice.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploratory Data Analysis in R (introduction)

Setting-up

load the needed libraries…

tl;dr (code)

Creating the data for this example

Step 1 - First approach to data

metrics about data types, zeros, infinite numbers, and missing values:

Step 2 - Analyzing categorical variables

Export the plots to jpeg into current directory: freq(data, path_out = ".")

Step 3 - Analyzing numerical variables

Graphically

Tips

Quantitatively

Tips

Step 4 - Analyzing numerical and categorical at the same time

TIPS:

Data Cleaning

Variables

search for Missing Values

Continuous Variables

Density plot, DataExplorer has got a function for that.

Multivariate Analysis

Categorical Variables — Barplots

create_report

Use nrow() and ncol()

How to see the last observations.

The str() function

Categories or levels of a categorical variable

Missing values

About

Releases

Packages

Languages

jcombari/Exploratory-Data-Analysis

Folders and files

Latest commit

History

Repository files navigation

Exploratory Data Analysis in R (introduction)

Setting-up

load the needed libraries…

tl;dr (code)

Creating the data for this example

Step 1 - First approach to data

metrics about data types, zeros, infinite numbers, and missing values:

Step 2 - Analyzing categorical variables

Export the plots to jpeg into current directory: freq(data, path_out = ".")

Step 3 - Analyzing numerical variables

Graphically

Tips

Quantitatively

Tips

Step 4 - Analyzing numerical and categorical at the same time

TIPS:

Data Cleaning

Variables

search for Missing Values

Continuous Variables

Density plot, DataExplorer has got a function for that.

Multivariate Analysis

Categorical Variables — Barplots

create_report

Use nrow() and ncol()

How to see the last observations.

The str() function

Categories or levels of a categorical variable

Missing values

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages