0% found this document useful (0 votes)
120 views

Datascienece

Data visualization is an important part of analyzing large amounts of collected data. Matplotlib is a Python library that can be used to create various types of visualizations, including histograms, pie charts, line plots, boxplots, and violin plots. These visualizations help identify patterns, relationships, and outliers in data to gain insights. Key steps in creating visualizations with Matplotlib include importing libraries, preparing data, using plotting functions, and customizing aspects like labels, titles, and legends.

Uploaded by

ajus ady
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
120 views

Datascienece

Data visualization is an important part of analyzing large amounts of collected data. Matplotlib is a Python library that can be used to create various types of visualizations, including histograms, pie charts, line plots, boxplots, and violin plots. These visualizations help identify patterns, relationships, and outliers in data to gain insights. Key steps in creating visualizations with Matplotlib include importing libraries, preparing data, using plotting functions, and customizing aspects like labels, titles, and legends.

Uploaded by

ajus ady
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Data Visualization using Matplotlib

Badreesh Shetty Follow


Nov 12, 2018 · 11 min read

Data Visualization is an important part of business activities as organizations nowadays


collect a huge amount of data. Sensors all over the world are collecting climate data,
user data through clicks, car data for prediction of steering wheels etc. All of these data
collected hold key insights for businesses and visualizations make these insights easy to
interpret.

Data is only as good as it’s presented.

Why are visualizations important?

Visualizations are the easiest way to analyze and absorb information. Visuals help to
easily understand the complex problem. They help in identifying patterns, relationships,
and outliers in data. It helps in understanding business problems better and quickly. It
helps to build a compelling story based on visuals. Insights gathered from the visuals
help in building strategies for businesses. It is also a precursor to many high-level data
analysis for Exploratory Data Analysis(EDA) and Machine Learning(ML).

Human beings are visual creatures. Countless studies show how our brain is wired for the
visual, and processes everything faster when it is through the eye.

. . .

“Even if your role does not directly involve the nuts


and bolts of data science, it is useful to know what
data visualization can do and how it is realized in the
real world.”
- Ramie Jacobson
Data visualizations in python can be done via many packages. We’ll be discussing of
matplotlib package. It can be used in Python scripts, Jupyter notebook, and web
application servers.

Matplotlib
Matplotlib is a 2-D plotting library that helps in visualizing figures. Matplotlib emulates
Matlab like graphs and visualizations. Matlab is not free, is difficult to scale and as a
programming language is tedious. So, matplotlib in Python is used as it is a robust, free
and easy library for data visualization.
Anatomy of Matplotlib Figure

Anatomy of Matpotlib

The figure contains the overall window where plotting happens, contained within the
figure are where actual graphs are plotted. Every Axes has an x-axis and y-axis for
plotting. And contained within the axes are titles, ticks, labels associated with each axis.
An essential figure of matplotlib is that we can more than axes in a figure which helps in
building multiple plots, as shown below. In matplotlib, pyplot is used to create figures
and change the characteristics of figures.

Installing Matplotlib
Type !pip install matplotlib in the Jupyter Notebook or if it doesn’t work in cmd type
conda install -c conda-forge matplotlib . This should work in most cases.

Things to follow
Plotting of Matplotlib is quite easy. Generally, while plotting they follow the same steps
in each and every plot. Matplotlib has a module called pyplot which aids in plotting
figure. The Jupyter notebook is used for running the plots. We import matplotlib.pyplot

as plt for making it call the package module.

Importing required libraries and dataset to plot using Pandas pd.read_csv()

Extracting important parts for plots using conditions on Pandas Dataframes.

plt.plot() for plotting line chart similarly in place of plot other functions are used

for plotting. All plotting functions require data and it is provided in the function
through parameters.

plot.xlabel , plt.ylabel for labeling x and y-axis respectively.

plt.xticks , plt.yticks for labeling x and y-axis observation tick points


respectively.

plt.legend() for signifying the observation variables.

plt.title() for setting the title of the plot.

plot.show() for displaying the plot.

Histogram
A histogram takes in a series of data and divides the data into a number of bins. It then
plots the frequency data points in each bin (i.e. the interval of points). It is useful in
understanding the count of data ranges.

When to use: We should use histogram when we need the count of the variable in a
plot.

eg: Number of particular games sold in a store.

From above we can see the histogram for GrandCanyon visitors in years. plt.hist()

takes the first argument as numeric data in the horizontal axis i.e GrandCanyon
visitor.bins=10 is used to create 10 bins between values of visitors in GrandCanyon.
From above, we can see the components that make a histogram, n as the max values in
each bin of histogram i.e 5,9, and so on.

The cumulative property gives us the end added value and helps us understand the
increase in value at each bin.

Range helps us in understanding value distribution between specified values.


Multiple histograms are useful in understanding the distribution between 2 entity
variables. We can see that GrandCanyon has comparably more visitors than
BryceCanyon.

Implementation: Histogram

Pie Chart
It is a circular plot which is divided into slices to illustrate numerical proportion. The
slice of a pie chart is to show the proportion of parts out of a whole.

When to use: Pie chart should be used seldom used as It is difficult to compare sections
of the chart. Bar plot is used instead as comparing sections is easy.

eg: Market share in Films.

Note: Pie Charts is not a good chart to illustrate information.


Above, plt.pie() takes the numeric data as 1st argument i.e Percentage and labels to
display as second argument i.e Sector. Ultimately, it shows the distribution of data in
proportion to the pie.

From above we can the components that make a pie chart and it returns wedge object,
text in labels and so on.

A pie chart can be easily customized and from above color and label values are
formatted.
From above explode is used to separate out points from the pie. Similar to a pizza piece
being cut.

Implementation: Pie Chart

Time Series by line plot


Time series is a line plot and it is basically connecting data points with a straight line. It
is useful in understanding the trend over time. It can explain the correlation between
points by the trend. An upward trend means positive correlation and downward trend
means a negative correlation. It mostly used in forecasting, monitoring models.

When to use: Time Series should be used when single or multiple variables are to be
plotted over time.

eg: Stock Market Analysis of Companies, Weather Forecasting.

First, Convert Date to pandas DateTime for easier plotting of data.


From above, fig.add_axes is used for plotting the canvas. Check this What are the
differences between add_axes and add_subplot? to understand axes and subplots.
plt.plot() takes the 1st argument as numeric data i.e Date and 2nd argument is to
numeric stock data. AAPL Stock is considered as ax1 which is the outer figure and on ax2
IBM Stock is considered for plotting which is inset.
In the earlier figure,add_axes is used to used to add an axes to a figure whereas from
above add_subplot adds multiple subplots to a figure. fig.add_subplot(237) cannot be
done as there are only 6 subplots possible.

We can see that the tech company stocks are following an upward trend showing positive
results for traders to invest in stocks.

Implementation: Time Series

Boxplot and Violinplot

Boxplot
Boxplot gives a nice summary of the data. It helps in understanding our distribution
better.

When to use: It should be used when we require to use the overall statistical
information on the distribution of the data. It can be used to detect outliers in the data.

eg: Credit Score of Customer. We can get the max, min and much more information
about the mark.

Understanding Boxplot

Source: How to Read and Use a Box-and-Whisker Plot

From the above diagram, the line that divides the box into 2 parts represents the median
of the data. The end of the box shows the upper quartile(75%)and the start of the box
represents the lower quartile(25%). Upper Quartile is also called 3rd quartile and
similarly, Lower Quartile is also called as 1st quartile. The region between lower quartile
and the upper quartile is called as Inter Quartile Range(IQR) and it is used to
approximate the 50% spread in the middle data(75–25=50%). The maximum is the
highest value in data, similarly minimum is the lowest value in data, it is also called as
caps. The points outside the boxes and between the maximum and maximum are called
as whiskers, they show the range of values in data. The extreme points are outliers to the
data. A commonly used rule is that a value is an outlier if it’s less than lower quartile-1.5
* IQR or high than the upper quartile + 1.5* IQR.

bp contains the boxplot components like boxes, whiskers, medians, caps. Seaborn
another plotting library makes it easier to build custom plots than matplotlib.
patch_artist makes the customization possible. notch makes the median look more

prominent.

A caveat of using boxplot is the number of observations in the unique value is not
defined, Jitter Plot in Seaborn can overcome this caveat or Violinplot is also useful
Violin plot
Violin plot is a better chart than boxplot as it gives a much broader understanding of the
distribution. It resembles a violin and dense areas point the more distribution of data
otherwise hidden by box plots

When to use: Its an extension to boxplot. It should be used when we require a better
intuitive understanding of data.

The density of points in the middle seems more as students tend to score around average
mostly in the subjects.

Implementation: Boxplot & Violinplot

TwinAxis
TwinAxis helps in visualizing plotting 2 plots w.r.t to the y-axis and same x-axis.

When to use: It should when we require 2 plots or grouped data in the same direction.

Eg: Population, GDP data in the same x-axis (Date).

Plotting 2 Plots w.r.t the y-axis and same x-axis


Extracting important details i.e Date for the x-axis, TempAvgF, and WindAvgMPH for the
different y-axis.

As we can there is only 1 axis, twinx() is used for twinning the x-axis and left y-axis is
used for Temp and the right y-axis is used for WindMPH.

Plotting the same data in different units and the same x-axis
The function is defined for calculating different unit of data i.e convert from Fahrenheit
to Celsius.

We can see that to the left y-axis Temp in Fahrenheit is plotted and to the right x-axis
Temp in Celsius is plotted.

Implementation: TwinAxis

Stack Plot and Stem Plot

Stack Plot
Stack plot visualizes data in stacks and shows the distribution of data over time.

When to use: It is used for checking multiple variable area plots in a single plot.

Eg: It is useful in understanding the change of distribution in multiple variables over an


interval.

As stack plot requires stacking, it is done in using np.vstack()


plt.stackplot takes in 1st argument numeric data i.e year and 2nd argument the
vertically stacked data i.e the Nationalparks.

Percentage Stacked plot


Similar to stack plot but each data is converted into a percentage of distribution it holds.

data_prec is used to divide the overall percentage into individual percentage


distributions. s= np_data.sum(axis=1) calculates sum along columns,
np_data.divide(s,axis=0) divides data along rows.

Stem Plot
Stemplot even takes negative values, so the difference is taken of data and is plotted over
time.

When to use: It is similar to a stack plot but the difference helps in comparing the data
points.

diff() is used to find the difference between previous data and is stored in another copy
of the data. The first data point is NaN (Not a Number) as it doesn’t contain any previous
data for calculating the difference.

(31n)Subplots are created to accommodate 3 rows 1 column subplots in the figure.


plt.stem() takes the 1st argument as numeric data i.e year and 2nd argument as
numeric data of the National Park visitors.
Implementation: Stack Plot & Stem Plot

Bar Plot
Bar Plot shows the distribution of data over several groups. It is commonly confused with
a histogram which only takes numerical data for plotting. It helps in comparing multiple
numeric values.

When to use: It is used when to compare between several groups.

Eg: Student marks in an exam.

plt.bar() takes the 1st argument as labels in numeric format and 2nd argument for the
value it represents w.r.t to the plots.

Implementation: Bar Plot

Scatter Plot
Scatter plot helps in visualizing 2 numeric variables. It helps in identifying the
relationship of the data with each variable i.e correlation or trend patterns. It also helps
in detecting outliers in the plot.

When to use: It is used in Machine learning concepts like regression, where x and y are
continuous variables. It is also used in clustering scatters or outlier detection.
plt.scatter() takes 2 numeric arguments for scattering data points in the plot. It is
similar to line plot except without the connected straight lines. By corr we mean
correlation and it means that how correlated GDP is with life expectancy, as we can see
that it is positive it means as GDP of a country increases, life expectancy too increases.

By taking the log of GDP, we can there is a much better correlation as we can fit points
better, it converts GDP in log scale i.e log($1000)=3.

3D Scatterplot
3D Scatterplot helps in visualizing 3 numerical variables in a three- dimensional plot.

It is similar to scatter except we add 3 numerical variables this time. By looking at the
plot we can make an inference that as the year and GDP increases, life expectancy too
increases.

Implementation: Scatter Plot

Find the above code in this Github Repo.

Conclusion
In summary, we learned how to build data visualization plots using one numeric variable
and multiple variables. We can now easily build plots for understanding our data
intuitively through visualizations.

Data Visualization Data Science Matplotlib Jupyter Notebook Data Analysis

About Help Legal

You might also like