Datascienece
Datascienece
Visualizations are the easiest way to analyze and absorb information. Visuals help to
easily understand the complex problem. They help in identifying patterns, relationships,
and outliers in data. It helps in understanding business problems better and quickly. It
helps to build a compelling story based on visuals. Insights gathered from the visuals
help in building strategies for businesses. It is also a precursor to many high-level data
analysis for Exploratory Data Analysis(EDA) and Machine Learning(ML).
Human beings are visual creatures. Countless studies show how our brain is wired for the
visual, and processes everything faster when it is through the eye.
. . .
Matplotlib
Matplotlib is a 2-D plotting library that helps in visualizing figures. Matplotlib emulates
Matlab like graphs and visualizations. Matlab is not free, is difficult to scale and as a
programming language is tedious. So, matplotlib in Python is used as it is a robust, free
and easy library for data visualization.
Anatomy of Matplotlib Figure
Anatomy of Matpotlib
The figure contains the overall window where plotting happens, contained within the
figure are where actual graphs are plotted. Every Axes has an x-axis and y-axis for
plotting. And contained within the axes are titles, ticks, labels associated with each axis.
An essential figure of matplotlib is that we can more than axes in a figure which helps in
building multiple plots, as shown below. In matplotlib, pyplot is used to create figures
and change the characteristics of figures.
Installing Matplotlib
Type !pip install matplotlib in the Jupyter Notebook or if it doesn’t work in cmd type
conda install -c conda-forge matplotlib . This should work in most cases.
Things to follow
Plotting of Matplotlib is quite easy. Generally, while plotting they follow the same steps
in each and every plot. Matplotlib has a module called pyplot which aids in plotting
figure. The Jupyter notebook is used for running the plots. We import matplotlib.pyplot
plt.plot() for plotting line chart similarly in place of plot other functions are used
for plotting. All plotting functions require data and it is provided in the function
through parameters.
Histogram
A histogram takes in a series of data and divides the data into a number of bins. It then
plots the frequency data points in each bin (i.e. the interval of points). It is useful in
understanding the count of data ranges.
When to use: We should use histogram when we need the count of the variable in a
plot.
From above we can see the histogram for GrandCanyon visitors in years. plt.hist()
takes the first argument as numeric data in the horizontal axis i.e GrandCanyon
visitor.bins=10 is used to create 10 bins between values of visitors in GrandCanyon.
From above, we can see the components that make a histogram, n as the max values in
each bin of histogram i.e 5,9, and so on.
The cumulative property gives us the end added value and helps us understand the
increase in value at each bin.
Implementation: Histogram
Pie Chart
It is a circular plot which is divided into slices to illustrate numerical proportion. The
slice of a pie chart is to show the proportion of parts out of a whole.
When to use: Pie chart should be used seldom used as It is difficult to compare sections
of the chart. Bar plot is used instead as comparing sections is easy.
From above we can the components that make a pie chart and it returns wedge object,
text in labels and so on.
A pie chart can be easily customized and from above color and label values are
formatted.
From above explode is used to separate out points from the pie. Similar to a pizza piece
being cut.
When to use: Time Series should be used when single or multiple variables are to be
plotted over time.
We can see that the tech company stocks are following an upward trend showing positive
results for traders to invest in stocks.
Boxplot
Boxplot gives a nice summary of the data. It helps in understanding our distribution
better.
When to use: It should be used when we require to use the overall statistical
information on the distribution of the data. It can be used to detect outliers in the data.
eg: Credit Score of Customer. We can get the max, min and much more information
about the mark.
Understanding Boxplot
From the above diagram, the line that divides the box into 2 parts represents the median
of the data. The end of the box shows the upper quartile(75%)and the start of the box
represents the lower quartile(25%). Upper Quartile is also called 3rd quartile and
similarly, Lower Quartile is also called as 1st quartile. The region between lower quartile
and the upper quartile is called as Inter Quartile Range(IQR) and it is used to
approximate the 50% spread in the middle data(75–25=50%). The maximum is the
highest value in data, similarly minimum is the lowest value in data, it is also called as
caps. The points outside the boxes and between the maximum and maximum are called
as whiskers, they show the range of values in data. The extreme points are outliers to the
data. A commonly used rule is that a value is an outlier if it’s less than lower quartile-1.5
* IQR or high than the upper quartile + 1.5* IQR.
bp contains the boxplot components like boxes, whiskers, medians, caps. Seaborn
another plotting library makes it easier to build custom plots than matplotlib.
patch_artist makes the customization possible. notch makes the median look more
prominent.
A caveat of using boxplot is the number of observations in the unique value is not
defined, Jitter Plot in Seaborn can overcome this caveat or Violinplot is also useful
Violin plot
Violin plot is a better chart than boxplot as it gives a much broader understanding of the
distribution. It resembles a violin and dense areas point the more distribution of data
otherwise hidden by box plots
When to use: Its an extension to boxplot. It should be used when we require a better
intuitive understanding of data.
The density of points in the middle seems more as students tend to score around average
mostly in the subjects.
TwinAxis
TwinAxis helps in visualizing plotting 2 plots w.r.t to the y-axis and same x-axis.
When to use: It should when we require 2 plots or grouped data in the same direction.
As we can there is only 1 axis, twinx() is used for twinning the x-axis and left y-axis is
used for Temp and the right y-axis is used for WindMPH.
Plotting the same data in different units and the same x-axis
The function is defined for calculating different unit of data i.e convert from Fahrenheit
to Celsius.
We can see that to the left y-axis Temp in Fahrenheit is plotted and to the right x-axis
Temp in Celsius is plotted.
Implementation: TwinAxis
Stack Plot
Stack plot visualizes data in stacks and shows the distribution of data over time.
When to use: It is used for checking multiple variable area plots in a single plot.
Stem Plot
Stemplot even takes negative values, so the difference is taken of data and is plotted over
time.
When to use: It is similar to a stack plot but the difference helps in comparing the data
points.
diff() is used to find the difference between previous data and is stored in another copy
of the data. The first data point is NaN (Not a Number) as it doesn’t contain any previous
data for calculating the difference.
Bar Plot
Bar Plot shows the distribution of data over several groups. It is commonly confused with
a histogram which only takes numerical data for plotting. It helps in comparing multiple
numeric values.
plt.bar() takes the 1st argument as labels in numeric format and 2nd argument for the
value it represents w.r.t to the plots.
Scatter Plot
Scatter plot helps in visualizing 2 numeric variables. It helps in identifying the
relationship of the data with each variable i.e correlation or trend patterns. It also helps
in detecting outliers in the plot.
When to use: It is used in Machine learning concepts like regression, where x and y are
continuous variables. It is also used in clustering scatters or outlier detection.
plt.scatter() takes 2 numeric arguments for scattering data points in the plot. It is
similar to line plot except without the connected straight lines. By corr we mean
correlation and it means that how correlated GDP is with life expectancy, as we can see
that it is positive it means as GDP of a country increases, life expectancy too increases.
By taking the log of GDP, we can there is a much better correlation as we can fit points
better, it converts GDP in log scale i.e log($1000)=3.
3D Scatterplot
3D Scatterplot helps in visualizing 3 numerical variables in a three- dimensional plot.
It is similar to scatter except we add 3 numerical variables this time. By looking at the
plot we can make an inference that as the year and GDP increases, life expectancy too
increases.
Conclusion
In summary, we learned how to build data visualization plots using one numeric variable
and multiple variables. We can now easily build plots for understanding our data
intuitively through visualizations.