Big Mart Sales Analysis DOCUMENT
Big Mart Sales Analysis DOCUMENT
ABSTRACT
ABSTRACT
In the modern era of reaching new lengths of advancement, every company and
enterprise are working on their customer demands as well as their inventory
management. The models used by them help them predict future demands by
understanding the pattern from old sales records. Lately, everyone is abandoning
the traditional prediction models for sales forecasting as it takes a prolonged
amount of time to get the expected results. Therefore now the retailers keep track
of their sales record in the form of a data set, which comprises price tag, outlet
types, outlet location, item visibility, item outlet sales etc.
TABLE OF CONTENTS
1.INTRODUCTION
Customer Satisfaction and keeping up with the demand for products is very
important for any store to survive in the market and to compete with other stores.
And these two can only be achieved when you have a future demand figure for
coming up with new plans for a flourishing business With an increased population,
the number of stores and shopping malls is also increasing creating competition
between different enterprises for bigger sales and popularity. Along with grocery
shops and stores, even enterprises need an analysis to check about the patterns and
predict future sales. Many businesses and enterprises keep track of the statistical
data of their products so as to analyze their future demand in the market. The
stored statistical data consists of the amount of items sold and its categories and
other attribute details to provide trends and patterns to the organization regarding
their supply, to grow their business, and improvise their sales strategies and this
might come in handy in the near future or seasonally in-order to create short-term
discount offers to attract more customers towards their brand; the previous data is
important as it helps in the management of logistics, inventory, and transport
services, etc. And to carry out all these tasks we will use machine learning
algorithms like the random forest, linear regression, decision trees, and XGBoost
regressor. The aim of our project is to build a machine learning model which can
predict the sales of a product and understand its patterns and trends which is an
important part of a big mart’s management.
Operating system is one of the first requirements mentioned when defining system
requirements (software). Software may not be compatible with different versions
of same line of operating systems, although some measure of backward
compatibility is often maintained. For example, most software designed for
Microsoft Windows XP does not run on Microsoft Windows 98, although the
converse is not always true. Similarly, software designed using newer features of
Linux Kernel v2.6 generally does not run or compile properly (or at all) on Linux
distributions using Kernel v2.2 or v2.4.
APIs and drivers – Software making extensive use of special hardware devices,
like high-end display adapters, needs special API or newer device drivers. A good
example is DirectX, which is a collection of APIs for handling tasks related to
multimedia, especially game programming, on Microsoft platforms.
Web browser – Most web applications and software depending heavily on Internet
technologies make use of the default browser installed on system. Microsoft
Internet Explorer is a frequent choice of software running on Microsoft Windows,
which makes use of ActiveX controls, despite their vulnerabilities.
Memory – All software, when run, resides in the random access memory (RAM)
of a computer. Memory requirements are defined after considering demands of the
application, operating system, supporting software and files, and other running
processes. Optimal performance of other unrelated software running on a multi-
tasking computer system is also considered when defining this requirement.
4)Hard Disk : 50 GB
FEASIBILITY STUDY
2. FEASIBILITY STUDY
ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY
This study is carried out to check the economic impact that the system will
have on the organization. The amount of fund that the company can pour into the
research and development of the system is limited. The expenditures must be
justified. Thus the developed system as well within the budget and this was
achieved because most of the technologies used are freely available. Only the
customized products had to be purchased.
This study is carried out to check the technical feasibility, that is, the
technical requirements of the system. Any system developed must not have a high
demand on the available technical resources. This will lead to high demands on the
available technical resources. This will lead to high demands being placed on the
client. The developed system must have a modest requirement, as only minimal or
null changes are required for implementing this system.
The aspect of study is to check the level of acceptance of the system by the
user. This includes the process of training the user to use the system efficiently.
The user must not feel threatened by the system, instead must accept it as a
necessity. The level of acceptance by the users solely depends on the methods that
are employed to educate the user about the system and to make him familiar with
it. His level of confidence must be raised so that he is also able to make some
constructive criticism, which is welcomed, as he is the final user of the system.
LITERATURE SURVEY
3.LITERATURE SURVEY
https://ijcrt.org/papers/IJCRT2106802.pdf
ABSTRACT: Nowadays shopping malls and Big Marts keep the track of their
sales data of each and every individual item for predicting future demand of the
customer and update the inventory management as well. These data stores basically
contain a large number of customer data and individual item attributes in a data
warehouse. Further, anomalies and frequent patterns are detected by mining the
data store from the data warehouse. The resultant data can be used for predicting
future sales volume with the help of different machine learning techniques for the
retailers like Big Mart. In this paper, we propose a predictive model using XG
boost Regressor technique for predicting the sales of a company like Big Mart and
found that the model produces better performance as compared to existing models.
https://www.irjmets.com/uploadedfiles/paper/volume_3/issue_9_september_2021/16025/final/fin_irj
mets1630829142.pdf
https://www.ijedr.org/papers/IJEDR1804010.pdf
ABSTRACT: In today’s world, big malls and marts record sales data of individual
items for predicting future demand and inventory management. This data stores a
large number of attributes of the item as well as the individual customer data
together in a data warehouse. This data is mined for detecting frequent patterns as
well as anomalies. This data can be used for forecasting future sales volume with
the help of random forests and multiple linear regression model.
3.4 Grid Search Optimization (GSO) Based Future Sales Prediction for Big
Mart:
https://ieeexplore.ieee.org/abstract/document/9067927
ABSTRACT: In retailer domain predicting sales before actual sales plays a vital
role for any retailer company like Big Mart or Mall for maintaining a successful
business. Traditional forecasting models such as statistical model is commonly
used as methodology for future sales prediction, but these techniques takes much
more time to estimate the sales, also they are not capable to handle the non-linear
data. Therefore, Machine Learning(ML) techniques are employed to handle both
non-linear and linear data. ML techniques can also efficiently large volume of data
like Big Mart dataset, containing large number of customer data and individual
data item's attribute. A retailer company wants a model that can predict accurate
sales so that it can keep track of customers future demand and update in advance
the sale inventory. In this work, we propose a Grid Search Optimization (GSO)
technique to optimize the parameters and select the best tuning hyper parameters,
further ensemble with Xgboost techniques for forecasting the future sales of a
retailer company such as Big Mart and we found our model produces the better
result.
https://ieeexplore.ieee.org/document/8675060
4.SYSTEM ANALYSIS
4.1 EXISTING SYSTEM:
Rohit Sav, Pratiksha Shinde, Saurabh Gaikwad in [1] have implemented predictive
models to measure big mart sales. They first cleaned the gathered data and applied
the XG Booster algorithm. It was observed that XGBoost Regressor showed the
highest accuracy rate when compared with other algorithms. This led them to draw
a conclusion for using XG boost for prediction of big mart sales. Inedi. Theresa,
Dr.Venkata Reddy Medikonda, K.V. Narasimha Reddy in [2] discusses sales
prediction by using the methodology of Exploratory Machine Learning. They
carried out the whole process by figuring out proper steps that included a collection
of data, thesis generation to efficiently understand bugs, further cleaning and
processing the data. The models such as Linear Regression, Decision Tree
Regression, Ridge Regression, and Random Forest model were used to predict the
outcome of the sales. They concluded that multiple modeling implementation led
them to a better prediction as compared to that of the single model prediction
technique.
The aim of our project is to build a machine learning model which can predict the
sales of a product and understand its patterns and trends which is an important part
of a big mart’s management.
1. More accurate.
2.Data Preprocessing
4.Modiling
5.Predicting
Usability requirement
Serviceability requirement
Manageability requirement
Recoverability requirement
Security requirement
Data Integrity requirement
Capacity requirement
Availability requirement
Scalability requirement
Interoperability requirement
Reliability requirement
Maintainability requirement
Regulatory requirement
Environmental requirement
SYSTEM DESIGN
5. SYSTEM DESIGN
5.1 SYSTEM ARCHITECTURE:
XGBOOST
1. The DFD is also called as bubble chart. It is a simple graphical formalism that
can be used to represent a system in terms of input data to the system,
various processing carried out on this data, and the output data is
generated by this system.
2. The data flow diagram (DFD) is one of the most important modeling tools. It
is used to model the system components. These components are the
system process, the data used by the process, an external entity that
interacts with the system and the information flows in the system.
3. DFD shows how the information moves through the system and how it is
modified by a series of transformations. It is a graphical technique that
depicts information flow and the transformations that are applied as data
moves from input to output.
4. DFD is also known as bubble chart. A DFD may be used to represent a
system at any level of abstraction. DFD may be partitioned into levels that
represent increasing information flow and functional detail.
Import libraries
NO PROCESS
VERIFY
Yes NO
Data processing
Data visualization
User input
Final outcome
End process
GOALS:
The Primary goals in the design of the UML are as follows:
1. Provide users a ready-to-use, expressive visual modeling Language so that
they can develop and exchange meaningful models.
2. Provide extendibility and specialization mechanisms to extend the core
concepts.
3. Be independent of particular programming languages and development
process.
4. Provide a formal basis for understanding the modeling language.
5. Encourage the growth of OO tools market.
6. Support higher level development concepts such as collaborations,
frameworks, patterns and components.
7. Integrate best practices.
Data processing
User input
Final outcome
The class diagram is used to refine the use case diagram and define a detailed design of
the system. The class diagram classifies the actors defined in the use case diagram into a set of
interrelated classes. The relationship or association between the classes can be either an "is-a" or
"has-a" relationship. Each class in the class diagram may be capable of providing certain
functionalities. These functionalities provided by the class are termed "methods" of the class.
Apart from this, each class may have certain "attributes" that uniquely identify the class.
build()
train()
signup()
signin()
User input
choose()
Final outcome
predict()
The process flows in the system are captured in the activity diagram. Similar to a state
diagram, an activity diagram also consists of activities, actions, transitions, initial and final
states, and guard conditions.
Importing libraries
Data preprocessing
Data visualization
User input
Final outcome
Fig.5.2.3 Activity diagram
Sequence diagram:
A sequence diagram represents the interaction between different objects in the system. The
important aspect of a sequence diagram is that it is time-ordered. This means that the exact
sequence of the interactions between the objects is represented step by step. Different objects in
the sequence diagram interact with each other by passing "messages".
User System
Data processing
Algorithm generation
User input
Prediction
Collaboration diagram:
A collaboration diagram groups together the interactions between different objects. The
interactions are listed as numbered interactions that help to trace the sequence of the interactions.
The collaboration diagram helps to identify all the possible interactions that each object has with
other objects.
Component diagram:
The component diagram represents the high-level parts that make up the system. This
diagram depicts, at a high level, what components form part of the system and how they are
interrelated. A component diagram depicts the components culled after the system has undergone
the development or construction phase.
Importing Exploring Data Splitting the datase
libraries the dataset processing t into train & test
Building the
model
Training the
model
User signu
p & sgnin
User input
Prediction
Deployment diagram:
The deployment diagram captures the configuration of the runtime elements of the
application. This diagram is by far most useful when a system is built and ready to be deployed.
User System
MODULES:
Data exploration: using this module we will load data into system
Splitting data into train & test: using this module data will be divided into
train & test
User signup & login: Using this module will get registration and login
User input: Using this module will give input for prediction
ALGORITHMS:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
#Combine test and train into one file
train['source']='train'
test['source']='test'
data = pd.concat([train, test],ignore_index=True)
print(train.shape, test.shape, data.shape)
#Check missing values:
data.apply(lambda x: sum(x.isnull()))
#Item type combine:
data['Item_Identifier'].value_counts()
data['Item_Type_Combined'] = data['Item_Identifier'].apply(lambda x: x[0:2])
data['Item_Type_Combined'] = data['Item_Type_Combined'].map({'FD':'Food',
'NC':'Non-
Consumable',
'DR':'Drinks'})
data['Item_Type_Combined'].value_counts()
#Import library:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
le = LabelEncoder()
#New variable for outlet
data['Outlet'] = le.fit_transform(data['Outlet_Identifier'])
var_mod =
['Item_Fat_Content','Outlet_Location_Type','Outlet_Size','Item_Type_Combined',
'Outlet_Type','Outlet']
le = LabelEncoder()
for i in var_mod:
data[i] = le.fit_transform(data[i])
import warnings
warnings.filterwarnings('ignore')
#Drop the columns which have been converted to different types:
data.drop(['Item_Type','Outlet_Establishment_Year'],axis=1,inplace=True)
PYTHON LANGUAGE:
Python is a dynamic, high-level, free open source, and interpreted programming language. It
supports object-oriented programming as well as procedural-oriented programming. In Python,
we don’t need to declare the type of variable because it is a dynamically typed language. For
example, x = 10 Here, x can be anything such as String, int, etc.
Features in Python:
There are many features in Python, some of which are discussed below as
follows:
1. Free and Open Source
Python language is freely available at the official website and you can download
it from the given download link below click on the Download
Python keyword. Download Python Since it is open-source, this means that
source code is also available to the public. So you can download it, use it as well
as share it.
2. Easy to code
3. Easy to Read
As you will see, learning Python is quite simple. As was already established,
Python’s syntax is really straightforward. The code block is defined by the
indentations rather than by semicolons or brackets.
4. Object-Oriented Language
Graphical User interfaces can be made using a module such as PyQt5, PyQt4,
wxPython, or Tk in python. PyQt5 is the most popular option for creating
graphical apps with Python.
6. High-Level Language
7. Extensible feature
Python is an Extensible language. We can write some Python code into C or C++
language and also we can compile that code in C/C++ language.
8. Easy to Debug
Excellent information for mistake tracing. You will be able to quickly identify
and correct the majority of your program’s issues once you understand how
to interpret Python’s error traces. Simply by glancing at the code, you can
determine what it is designed to perform.
Python has a large standard library that provides a rich set of modules and
functions so you do not have to write your own code for every single thing. There
are many libraries present in Python such as regular expressions, unit-testing,
web browsers, etc.
Python is a dynamically-typed language. That means the type (for example- int,
double, long, etc.) for a variable is decided at run time not in advance because of
this feature we don’t need to specify the type of variable.
With a new project py script, you can run and write Python codes in HTML with
the help of some simple tags <py-script>, <py-env>, etc. This will help you do
frontend development work in Python like javascript. Backend is the strong forte
of Python it’s extensively used for this work cause of its frameworks
like Django and Flask.
In Python, the variable data type does not need to be specified. The memory is
automatically allocated to a variable at runtime when it is given a value.
Developers do not need to write int y = 18 if the integer value 15 is set to y. You
may just type y=18.
LIBRARIES/PACKGES :-
Tensorflow
TensorFlow is a free and open-source software library for dataflow and differentiable
programming across a range of tasks. It is a symbolic math library, and is also used
for machine learning applications such as neural networks. It is used for both research and
production at Google.
TensorFlow was developed by the Google Brain team for internal Google use. It was released
under the Apache 2.0 open-source license on November 9, 2015.
Numpy
It is the fundamental package for scientific computing with Python. It contains various features
including these important ones:
Pandas
Pandas is an open-source Python Library providing high-performance data manipulation and analysis
tool using its powerful data structures. Python was majorly used for data munging and preparation. It
had very little contribution towards data analysis. Pandas solved this problem. Using Pandas, we can
accomplish five typical steps in the processing and analysis of data, regardless of the origin of data load,
prepare, manipulate, model, and analyze. Python with Pandas is used in a wide range of fields including
academic and commercial domains including finance, economics, Statistics, analytics, etc.
Matplotlib
For simple plotting the pyplot module provides a MATLAB-like interface, particularly when
combined with IPython. For the power user, you have full control of line styles, font
properties, axes properties, etc, via an object oriented interface or via a set of functions
familiar to MATLAB users.
Scikit – learn
A video tutorial about this test level. System testing examines every component of
an application to make sure that they work as a complete and unified whole. A QA
team typically conducts system testing after it checks individual modules with
functional or user-story testing and then each component through integration
testing.
If a software build achieves the desired results in system testing, it gets a final
check via acceptance testing before it goes to production, where users consume the
software. An app-dev team logs all defects, and establishes what kinds and amount
of defects are tolerable.
Structural Testing:
It is not possible to effectively test software without running it. Structural testing,
also known as white-box testing, is required to detect and fix bugs and errors
emerging during the pre-production stage of the software development process. At
this stage, unit testing based on the software structure is performed using
regression testing. In most cases, it is an automated process working within the test
automation framework to speed up the development process at this stage.
Developers and QA engineers have full access to the software’s structure and data
flows (data flows testing), so they could track any changes (mutation testing) in the
system’s behavior by comparing the tests’ outcomes with the results of previous
iterations (control flow testing).
Behavioral Testing:
The final stage of testing focuses on the software’s reactions to various activities
rather than on the mechanisms behind these reactions. In other words, behavioral
testing, also known as black-box testing, presupposes running numerous tests,
mostly manual, to see the product from the user’s point of view. QA engineers
usually have some specific information about a business or other purposes of the
software (‘the black box’) to run usability tests, for example, and react to bugs as
regular users of the product will do. Behavioral testing also may include
automation (regression tests) to eliminate human error if repetitive activities are
required. For example, you may need to fill 100 registration forms on the website
to see how the product copes with such an activity, so the automation of this test is
preferable.
3. Heramb Kadam, Rahul Shevade, Prof. Deven Ketkar , Mr. Sufiyan Rajguru (2018). A Forecast
for Big Mart Sales Based on Random Forests and Multiple Linear Regression. (IJEDR).
4. Gopal Behere, Neeta Nain (2019). Grid Search Optimization (GSO) Based Future Sales
Prediction for Big Mart. 2019 International Conference on Signal-Image Technology & Internet-
Based Systems (SITIS).
5. Kumari Punam , Rajendra Pamula , Praphula Kumar Jain (2018, September 28-29). A Two-
Level Statistical Model for Big Mart Sales Prediction. 2018 International conference on on
Computing, Power and Communication Technologies
6. Ranjitha P, Spandana M. (2021). Predictive Analysis for Big Mart Sales Using Machine
Learning Algorithms.Fifth International Conference on Intelligent Computing and Control
Systems (ICICCS 2021).
7. Bohdan M. Pavlyshenko (2018, August 25). Rainfall Predictive Approach for La Trinidad,
Benguet using Machine Learning Classification. 2018 IEEE Second International Conference on
Data Stream Mining & Processing (DSMP).
8. Nikita Malik, Karan Singh. (2020, June). SalesPrediction Model for Big Mart.
9. Gopal Behere, Neeta Nain. (2019, September). A Comparative Study of Big Mart Sales
Prediction.
10. Archisha Chandel, Akanksha Dubey, Saurabh Dhawale, Madhuri Ghuge (2019, April). Sales
Prediction System using Machine Learning. International Journal of Scientific Research and
Engineering Development.
11. https://www.javatpoint.com/machine-learning-random-forestalgorithm
12. https://www.javatpoint.com/machine-learning-decision-treeclassification-algorithm
13. A. Chandel, A. Dubey, S. Dhawale and M. Ghuge,;Sales Prediction System using Machine
Learning; International Journal of Scientific Research and Engineering Development, vol. 2, no.
2, pp. 1-4, 2019. [Accessed 27 January 2020].