MGNM801 Ca2 Final
MGNM801 Ca2 Final
Declaration:
I declare that this Assignment is my individual work. I have not copied it
from any other student’s work or from any other source except where due
acknowledgement is made explicitly in the text, nor has any part been written
for me by any other person.
Evaluator’s comments (For Instructor’s use only)
Use case: Pandas helps uncover hidden patterns and segment your customer
base into distinct groups with shared characteristics. This lets you craft targeted
campaigns, personalize messaging, and deliver experiences that resonate,
ultimately driving engagement and sales.
Example: -
Use case: Pandas empowers you to dissect the effectiveness of your marketing
campaigns across different channels. Analyse key metrics like impressions,
clicks, and conversion rates to identify top performers, optimize budget
allocation, and maximize ROI.
Example:
Scenario 3: Predicting Customer Churn Before It's Too Late
Use case: Pandas helps you identify customers at risk of churning (leaving your
brand) based on their past behaviour and purchase patterns. This allows you to
take proactive measures like offering personalized incentives or resolving
potential issues, ultimately decreasing customer loss and boosting lifetime
value.
Example:
2. Describe the primary data structures in Pandas, namely Series and Data
Frame. Explain the differences and use cases for each.
Ans)
Series:
Structure:
▪ One-dimensional array like a list or column in a spreadsheet.
▪ Holds an array of data values and an associated array of labels,
called an index.
Key characteristics:
▪ Can hold any data type, including numbers, strings, dates, and
Booleans.
▪ Data must be homogeneous (all elements of the same type).
▪ Labelled with an index that can be used for selection and
alignment.
Use cases:
▪ Representing a single column of data in a dataset.
▪ Storing time series data or sequences of values. ▪ Creating
simple statistical summaries of data.
Data Frame:
Structure:
▪ Two-dimensional labelled data structure with rows and
columns, resembling a spreadsheet or SQL table.
▪ Can be thought of as a collection of Series objects, each
representing a column.
Key characteristics:
▪ Columns can hold different data types.
▪ Labelled with both row and column indices for flexible
access and manipulation.
Use cases:
▪ Representing tabular datasets with multiple columns and
rows.
▪ Loading and storing data from various file formats (CSV,
Excel, databases).
Performing complex data cleaning, transformation, and analysis tasks.
Feature Series Data Frame
Dimensionality One-dimensional Two-dimensional
Data types Homogeneous Heterogeneous (different types per
column)
Structure Array of values + index Collection of Series objects
(columns)
Use cases Single-column data, Tabular datasets, multiple columns,
sequences and rows
Example:
Part2: NumPy
Key reasons for its importance in scientific computing and data analysis:
Performance:
▪ Vectorized operations: NumPy arrays enable you to perform
operations on entire arrays at once, rather than element-by-
element, leading to significant speed gains.
▪ Optimized for numerical computations: NumPy's arrays are
optimized for numerical operations, making them much faster
than Python lists for large datasets.
Mathematical capabilities:
▪ Comprehensive toolkit: NumPy offers a rich set of
mathematical functions for common tasks in scientific
computing, eliminating the need to write custom code for
many operations.
In essence, NumPy's efficient array structures, fast computations, and extensive
mathematical functions make it an indispensable tool for anyone working with
numerical data in Python, especially in the fields of scientific computing, data
analysis, machine learning, and engineering.
2.Explain the significance of NumPy in terms of performance and
efficiency when working with large datasets and numerical computations.
Ans)
NumPy (Numerical Python) is a powerful library in the Python programming
language that provides support for large, multi-dimensional arrays and matrices,
along with a collection of mathematical functions to operate on these elements.
It is a fundamental package for scientific computing in Python and is widely
used in various domains such as data science, machine learning, signal
processing, and more. The significance of NumPy, particularly in terms of
performance and efficiency when working with large datasets and numerical
computations, can be explained through several key aspects:
1. Array Representation:
• NumPy introduces the ndarray (N-dimensional array) data
structure, which allows for efficient representation of large
datasets. This array is a contiguous block of memory containing
elements of the same type, enabling fast and memory-efficient
operations.
2. Vectorized Operations:
• NumPy provides a set of highly optimized functions that
operate on entire arrays at once, eliminating the need for
explicit looping in Python. This vectorized approach takes
advantage of low-level optimizations in the underlying C and
Fortran code, resulting in significantly faster computations.
3. Broadcasting:
• NumPy allows for implicit element-wise operations on arrays of
different shapes and sizes through a feature called broadcasting.
This enables more concise and readable code, without the need
to explicitly reshape or replicate arrays.
4. Memory Efficiency:
• NumPy arrays are more memory-efficient compared to Python
lists, especially for large datasets. The array's homogeneous
data type ensures that memory is allocated in a contiguous
block, reducing memory overhead, and allowing for better
cache utilization.
Output:-
2.Provide at least three examples of data visualization scenarios where
Seaborn is the preferred library over Matplotlib. Describe the type of plots
or charts involved and why Seaborn is a better choice.
Ans)
1. Statistical Relationships
Plot Type: lmplot, joint plot, pair plot
Scenario: When exploring relationships between variables or performing
regression analysis, Seaborn's specialized functions make it simpler to
create visualizations that include regression lines, scatter plots with trend
lines, and distribution plots. Seaborn's lmplot and joint plot provide built-
in functionalities for visualizing linear relationships between variables,
along with additional features like adding regression lines, confidence
intervals, and kernel density estimation.
Why Seaborn: Seaborn streamlines the process of creating complex
statistical visualizations by providing convenient high-level functions that
directly handle these tasks, making it easier to visualize relationships in
data without the need for extensive customization.
2.Categorical Data Analysis
Plot Type: cat plot, boxplot, violin plot
Scenario: Analysing categorical variables involves visualizing distributions,
relationships, or comparisons across categories. Seaborn's cat plot, boxplot,
and violin plot functions offer a concise way to display categorical data
distributions, especially when dealing with multiple categories or
comparing distributions across different groups.
Why Seaborn: Seaborn provides specialized functions specifically designed
for categorical data visualization, offering better aesthetics, flexibility, and
ease of use compared to manually customizing Matplotlib plots for
categorical data analysis.
3.Distribution Visualization
Plot Type: distplot, kdeplot, rug plot
Scenario: Visualizing distributions of variables is crucial in understanding
the underlying data patterns. Seaborn's distplot, kdeplot, and rug plot allow
easy plotting of univariate distributions, kernel density estimations, and rug
plots to represent individual data points on a distribution axis.
Why Seaborn: Seaborn simplifies the creation of distribution plots by
providing intuitive functions that handle both the creation of the histogram-
like representation and the estimation of the underlying probability density
function (PDF) simultaneously, offering a more streamlined approach
compared to Matplotlib.
Additional advantages of Seaborn:
Aesthetically pleasing defaults: Seaborn's default styles and colour palettes
create visually appealing and informative plots.
Close integration with Pandas: Seaborn works effortlessly with Pandas Data
Frames, making it convenient for data analysis workflows.
Focus on statistical visualization: Seaborn is designed to create informative
statistical graphics, making it a valuable tool for data exploration and
communication.
Unit: -6
Describe the three key structures in Plotly:
1.Figure, Data, and Layout. Explain the purpose of each structure in creating
visualizations.
Ans)
- The key structures in Plotly and their purposes in creating visualizations:
Figure -
The overall container, which houses all the visualization's components, including
the data and layout.
Serves as a canvas: This is where the visual components are put together and
coordinated.
Crucial to interaction: it makes functions like panning, zooming, and hovering
over data points possible.
Data –
The major component of the visual aid: It contains the real data that you wish to
visualize.
Several traces: Multiple traces (data sets) can be included in a figure, and each
one can be seen as a separate visual entity (e.g., lines, bars, scatter points).
Trace-specific properties: A trace's look can be defined by its own attributes, such
as type, name, mode, marker style, line style, etc.
Layout –
Manages visual presentation: It oversees the visualization's non-data components,
including titles, labels, and annotations.
Gridlines and axes
Legend and colour bar o Margins and spacing
Colour and style of the background Collaborating Together:
Figure orchestrates: It combines layout and data to provide the entire
representation.
Information offers content: The visual elements are formed from this raw
material.
Context is created by layout: It sets the general look and feel, provides labels and
annotations, and arranges the visual elements.
2.Load a sales dataset with columns 'Sales,' create a Plotly line chart to
visualize the total sales trend. Include axis labels, a title, and customize the
appearance.