0% found this document useful (0 votes)
38 views

Big Mart Sales Analysis DOCUMENT

Uploaded by

Samay Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Big Mart Sales Analysis DOCUMENT

Uploaded by

Samay Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

BIG MART SALES ANALYSIS

ABSTRACT
ABSTRACT

In the modern era of reaching new lengths of advancement, every company and
enterprise are working on their customer demands as well as their inventory
management. The models used by them help them predict future demands by
understanding the pattern from old sales records. Lately, everyone is abandoning
the traditional prediction models for sales forecasting as it takes a prolonged
amount of time to get the expected results. Therefore now the retailers keep track
of their sales record in the form of a data set, which comprises price tag, outlet
types, outlet location, item visibility, item outlet sales etc.
TABLE OF CONTENTS

S.NO CONTENT PGNO


1 Introduction
1.1 Software requirements
1.2 Hardware requirements
2 Feasibility study
2.1 Economic feasibility
2.2 Technical feasibility
2.3 Social feasibility
3 Literature survey
4 System analysis
4.1 Existing system
4.1.1 Disadvantages of existing system
4.2 Proposed system
4.2.1 Advantages of proposed system
4.3 Functional requirements
4.4 Non-Functional requirements
5 System design
5.1 System architecture
5.2 UML diagrams
6 Implementation
6.1 Modules
6.2 Sample code
7 Software environment
8 System testing
8.1 Testing strategies
8.2 Test cases
9 Screens
10 Conclusion
11 References
LIST OF FIGURE

FIG.NO FIG.NAME PG.NO


5.1.1 System architecture
5.1.2 Dataflow diagram
5.2.1 Usecase diagram
5.2.2 Class diagram
5.2.3 Activity diagram
5.2.4 Sequence diagram
5.2.5 Collaboration diagram
5.2.6 Component diagram
5.2.7 Deployment diagram
INTRODUCTION

1.INTRODUCTION

Customer Satisfaction and keeping up with the demand for products is very
important for any store to survive in the market and to compete with other stores.
And these two can only be achieved when you have a future demand figure for
coming up with new plans for a flourishing business With an increased population,
the number of stores and shopping malls is also increasing creating competition
between different enterprises for bigger sales and popularity. Along with grocery
shops and stores, even enterprises need an analysis to check about the patterns and
predict future sales. Many businesses and enterprises keep track of the statistical
data of their products so as to analyze their future demand in the market. The
stored statistical data consists of the amount of items sold and its categories and
other attribute details to provide trends and patterns to the organization regarding
their supply, to grow their business, and improvise their sales strategies and this
might come in handy in the near future or seasonally in-order to create short-term
discount offers to attract more customers towards their brand; the previous data is
important as it helps in the management of logistics, inventory, and transport
services, etc. And to carry out all these tasks we will use machine learning
algorithms like the random forest, linear regression, decision trees, and XGBoost
regressor. The aim of our project is to build a machine learning model which can
predict the sales of a product and understand its patterns and trends which is an
important part of a big mart’s management.

1.1 SOFTWARE REQUIREMENTS

Software requirements deal with defining software resource requirements and


prerequisites that need to be installed on a computer to provide optimal functioning
of an application. These requirements or prerequisites are generally not included in
the software installation package and need to be installed separately before the
software is installed.

Platform – In computing, a platform describes some sort of framework, either in


hardware or software, which allows software to run. Typical platforms include a
computer’s architecture, operating system, or programming languages and their
runtime libraries.

Operating system is one of the first requirements mentioned when defining system
requirements (software). Software may not be compatible with different versions
of same line of operating systems, although some measure of backward
compatibility is often maintained. For example, most software designed for
Microsoft Windows XP does not run on Microsoft Windows 98, although the
converse is not always true. Similarly, software designed using newer features of
Linux Kernel v2.6 generally does not run or compile properly (or at all) on Linux
distributions using Kernel v2.2 or v2.4.

APIs and drivers – Software making extensive use of special hardware devices,
like high-end display adapters, needs special API or newer device drivers. A good
example is DirectX, which is a collection of APIs for handling tasks related to
multimedia, especially game programming, on Microsoft platforms.

Web browser – Most web applications and software depending heavily on Internet
technologies make use of the default browser installed on system. Microsoft
Internet Explorer is a frequent choice of software running on Microsoft Windows,
which makes use of ActiveX controls, despite their vulnerabilities.

1)Visual Studio Community Version

2)Nodejs (Version 12.3.1)

3)Python IDEL (Python 3.7 )

1.2 HARDWARE REQUIREMENTS


The most common set of requirements defined by any operating system or
software application is the physical computer resources, also known as hardware, A
hardware requirements list is often accompanied by a hardware compatibility list
(HCL), especially in case of operating systems. An HCL lists tested, compatible,
and sometimes incompatible hardware devices for a particular operating system or
application. The following sub-sections discuss the various aspects of hardware
requirements.

Architecture – All computer operating systems are designed for a particular


computer architecture. Most software applications are limited to particular
operating systems running on particular architectures. Although architecture-
independent operating systems and applications exist, most need to be recompiled
to run on a new architecture. See also a list of common operating systems and their
supporting architectures.

Processing power – The power of the central processing unit (CPU) is a


fundamental system requirement for any software. Most software running on x86
architecture defines processing power as the model and the clock speed of the
CPU. Many other features of a CPU that influence its speed and power, like bus
speed, cache, and MIPS are often ignored. This definition of power is often
erroneous, as AMD Athlon and Intel Pentium CPUs at similar clock speeds often
have different throughput speeds. Intel Pentium CPUs have enjoyed a considerable
degree of popularity, and are often mentioned in this category.

Memory – All software, when run, resides in the random access memory (RAM)
of a computer. Memory requirements are defined after considering demands of the
application, operating system, supporting software and files, and other running
processes. Optimal performance of other unrelated software running on a multi-
tasking computer system is also considered when defining this requirement.

Secondary storage – Hard-disk requirements vary, depending on the size of


software installation, temporary files created and maintained while installing or
running the software, and possible use of swap space (if RAM is insufficient).

Display adapter – Software requiring a better than average computer graphics


display, like graphics editors and high-end games, often define high-end display
adapters in the system requirements.

Peripherals – Some software applications need to make extensive and/or special


use of some peripherals, demanding the higher performance or functionality of
such peripherals. Such peripherals include CD-ROM drives, keyboards, pointing
devices, network devices, etc.

1)Operating System : Windows Only

2)Processor : i5 and above

3)Ram : 4gb and above

4)Hard Disk : 50 GB
FEASIBILITY STUDY

2. FEASIBILITY STUDY

The feasibility of the project is analyzed in this phase and business


proposal is put forth with a very general plan for the project and some cost
estimates. During system analysis the feasibility study of the proposed system is
to be carried out. This is to ensure that the proposed system is not a burden to
the company. For feasibility analysis, some understanding of the major
requirements for the system is essential.

Three key considerations involved in the feasibility analysis are

 ECONOMICAL FEASIBILITY
 TECHNICAL FEASIBILITY
 SOCIAL FEASIBILITY

2.1 ECONOMICAL FEASIBILITY

This study is carried out to check the economic impact that the system will
have on the organization. The amount of fund that the company can pour into the
research and development of the system is limited. The expenditures must be
justified. Thus the developed system as well within the budget and this was
achieved because most of the technologies used are freely available. Only the
customized products had to be purchased.

2.2 TECHNICAL FEASIBILITY

This study is carried out to check the technical feasibility, that is, the
technical requirements of the system. Any system developed must not have a high
demand on the available technical resources. This will lead to high demands on the
available technical resources. This will lead to high demands being placed on the
client. The developed system must have a modest requirement, as only minimal or
null changes are required for implementing this system.

2.3 SOCIAL FEASIBILITY

The aspect of study is to check the level of acceptance of the system by the
user. This includes the process of training the user to use the system efficiently.
The user must not feel threatened by the system, instead must accept it as a
necessity. The level of acceptance by the users solely depends on the methods that
are employed to educate the user about the system and to make him familiar with
it. His level of confidence must be raised so that he is also able to make some
constructive criticism, which is welcomed, as he is the final user of the system.
LITERATURE SURVEY

3.LITERATURE SURVEY

3.1 Big Mart Sales Prediction using Machine Learning:

https://ijcrt.org/papers/IJCRT2106802.pdf
ABSTRACT: Nowadays shopping malls and Big Marts keep the track of their
sales data of each and every individual item for predicting future demand of the
customer and update the inventory management as well. These data stores basically
contain a large number of customer data and individual item attributes in a data
warehouse. Further, anomalies and frequent patterns are detected by mining the
data store from the data warehouse. The resultant data can be used for predicting
future sales volume with the help of different machine learning techniques for the
retailers like Big Mart. In this paper, we propose a predictive model using XG
boost Regressor technique for predicting the sales of a company like Big Mart and
found that the model produces better performance as compared to existing models.

3.2 Prediction of Big Mart Sales using Exploratory Machine Learning


Techniques:

https://www.irjmets.com/uploadedfiles/paper/volume_3/issue_9_september_2021/16025/final/fin_irj
mets1630829142.pdf

ABSTRACT: In this study, exploratory machine learning approaches are used to


forecast big-box store sales. In general, sales forecasting is crucial for advertising,
merchandising, warehousing, and production, and it is done in a variety of
organizations. To modify the business strategy to predicted results, the sales
estimate is based on Big Mart sales from different stores. Different machine
learning approaches may then be applied to forecast possible sales volumes for
stores like Big Mart. Machine Learning models such as Linear, Ridge and lasso
regression model, Random Forest, Gradient Boosted Decision Tree, AdaBoost
regressor, Xgboost, Light Gradient Boosting Machine are used in detailed research
of sales prediction. In order to anticipate correct outcomes, data exploration, data
transformation, and feature engineering are essential. The method is tested using
Big Mart Sales data from the year 2013.
3.3 A Forecast for Big Mart Sales Based on Random Forests and Multiple
Linear Regression:

https://www.ijedr.org/papers/IJEDR1804010.pdf

ABSTRACT: In today’s world, big malls and marts record sales data of individual
items for predicting future demand and inventory management. This data stores a
large number of attributes of the item as well as the individual customer data
together in a data warehouse. This data is mined for detecting frequent patterns as
well as anomalies. This data can be used for forecasting future sales volume with
the help of random forests and multiple linear regression model.

3.4 Grid Search Optimization (GSO) Based Future Sales Prediction for Big
Mart:

https://ieeexplore.ieee.org/abstract/document/9067927

ABSTRACT: In retailer domain predicting sales before actual sales plays a vital
role for any retailer company like Big Mart or Mall for maintaining a successful
business. Traditional forecasting models such as statistical model is commonly
used as methodology for future sales prediction, but these techniques takes much
more time to estimate the sales, also they are not capable to handle the non-linear
data. Therefore, Machine Learning(ML) techniques are employed to handle both
non-linear and linear data. ML techniques can also efficiently large volume of data
like Big Mart dataset, containing large number of customer data and individual
data item's attribute. A retailer company wants a model that can predict accurate
sales so that it can keep track of customers future demand and update in advance
the sale inventory. In this work, we propose a Grid Search Optimization (GSO)
technique to optimize the parameters and select the best tuning hyper parameters,
further ensemble with Xgboost techniques for forecasting the future sales of a
retailer company such as Big Mart and we found our model produces the better
result.

3.5 A Two-Level Statistical Model for Big Mart Sales Prediction:

https://ieeexplore.ieee.org/document/8675060

ABSTRACT: Sales forecasting is an important aspect of different companies


engaged in retailing, logistics, manufacturing, marketing and wholesaling. It allows
companies to efficiently allocate resources, to estimate achievable sales revenue
and to plan a better strategy for future growth of the company. In this paper,
prediction of sales of a product from a particular outlet is performed via a two-
level approach that produces better predictive performance compared to any of the
popular single model predictive learning algorithms. The approach is performed on
Big Mart Sales data of the year 2013. Data exploration, data transformation and
feature engineering play a vital role in predicting accurate results. The result
demonstrated that the two-level statistical approach performed better than a single
model approach as the former provided more information that leads to better
prediction.
SYSTEM ANALYSIS

4.SYSTEM ANALYSIS
4.1 EXISTING SYSTEM:

Rohit Sav, Pratiksha Shinde, Saurabh Gaikwad in [1] have implemented predictive
models to measure big mart sales. They first cleaned the gathered data and applied
the XG Booster algorithm. It was observed that XGBoost Regressor showed the
highest accuracy rate when compared with other algorithms. This led them to draw
a conclusion for using XG boost for prediction of big mart sales. Inedi. Theresa,
Dr.Venkata Reddy Medikonda, K.V. Narasimha Reddy in [2] discusses sales
prediction by using the methodology of Exploratory Machine Learning. They
carried out the whole process by figuring out proper steps that included a collection
of data, thesis generation to efficiently understand bugs, further cleaning and
processing the data. The models such as Linear Regression, Decision Tree
Regression, Ridge Regression, and Random Forest model were used to predict the
outcome of the sales. They concluded that multiple modeling implementation led
them to a better prediction as compared to that of the single model prediction

technique.

4.2 Proposed System:

The aim of our project is to build a machine learning model which can predict the
sales of a product and understand its patterns and trends which is an important part
of a big mart’s management.

4.2.1 Advantages of proposed system:

1. More accurate.

2. Our model outperformed the existing models.

4.3 FUNCTIONAL REQUIREMENTS


1.Data Collection

2.Data Preprocessing

3.Training And Testing

4.Modiling
5.Predicting

4.4 NON FUNCTIONAL REQUIREMENTS


NON-FUNCTIONAL REQUIREMENT (NFR) specifies the quality attribute of a software system.
They judge the software system based on Responsiveness, Usability, Security, Portability and
other non-functional standards that are critical to the success of the software system.
Example of nonfunctional requirement, “how fast does the website load?” Failing to meet
non-functional requirements can result in systems that fail to satisfy user needs. Non-
functional Requirements allows you to impose constraints or restrictions on the design of
the system across the various agile backlogs. Example, the site should load in 3 seconds
when the number of simultaneous users are > 10000. Description of non-functional
requirements is just as critical as a functional requirement.

 Usability requirement
 Serviceability requirement
 Manageability requirement
 Recoverability requirement
 Security requirement
 Data Integrity requirement
 Capacity requirement
 Availability requirement
 Scalability requirement
 Interoperability requirement
 Reliability requirement
 Maintainability requirement
 Regulatory requirement
 Environmental requirement
SYSTEM DESIGN

5. SYSTEM DESIGN
5.1 SYSTEM ARCHITECTURE:

XGBOOST

Fig.5.1.1 System architecture

DATA FLOW DIAGRAM:

1. The DFD is also called as bubble chart. It is a simple graphical formalism that
can be used to represent a system in terms of input data to the system,
various processing carried out on this data, and the output data is
generated by this system.
2. The data flow diagram (DFD) is one of the most important modeling tools. It
is used to model the system components. These components are the
system process, the data used by the process, an external entity that
interacts with the system and the information flows in the system.
3. DFD shows how the information moves through the system and how it is
modified by a series of transformations. It is a graphical technique that
depicts information flow and the transformations that are applied as data
moves from input to output.
4. DFD is also known as bubble chart. A DFD may be used to represent a
system at any level of abstraction. DFD may be partitioned into levels that
represent increasing information flow and functional detail.
Import libraries

NO PROCESS
VERIFY
Yes NO

Exploring the dataset

Data processing

Data visualization

Splitting the data into train & test

Building the models- Linear Regression, Random


Forest, Xgboost

Training the model

Signup & signin

User input

Final outcome

End process

Fig.5.1.3 Dataflow diagram


5.2 UML DIAGRAMS

UML stands for Unified Modeling Language. UML is a standardized


general-purpose modeling language in the field of object-oriented software
engineering. The standard is managed, and was created by, the Object Management
Group.
The goal is for UML to become a common language for creating models of
object oriented computer software. In its current form UML is comprised of two
major components: a Meta-model and a notation. In the future, some form of
method or process may also be added to; or associated with, UML.
The Unified Modeling Language is a standard language for specifying,
Visualization, Constructing and documenting the artifacts of software system, as
well as for business modeling and other non-software systems.
The UML represents a collection of best engineering practices that have
proven successful in the modeling of large and complex systems.
The UML is a very important part of developing objects oriented software
and the software development process. The UML uses mostly graphical notations
to express the design of software projects.

GOALS:
The Primary goals in the design of the UML are as follows:
1. Provide users a ready-to-use, expressive visual modeling Language so that
they can develop and exchange meaningful models.
2. Provide extendibility and specialization mechanisms to extend the core
concepts.
3. Be independent of particular programming languages and development
process.
4. Provide a formal basis for understanding the modeling language.
5. Encourage the growth of OO tools market.
6. Support higher level development concepts such as collaborations,
frameworks, patterns and components.
7. Integrate best practices.

Use case diagram:

A use case diagram in the Unified Modeling Language (UML) is a type of


behavioral diagram defined by and created from a Use-case analysis. Its purpose is
to present a graphical overview of the functionality provided by a system in terms
of actors, their goals (represented as use cases), and any dependencies between
those use cases. The main purpose of a use case diagram is to show what system
functions are performed for which actor. Roles of the actors in the system can be
depicted.
Importing the packages

Exploring the dataset

Data processing

Splitting the data into train & test

Building the model


User

Training the model

User signup & signin

User input

Final outcome

Fig.5.2.1 Usecase diagram


Class diagram:

The class diagram is used to refine the use case diagram and define a detailed design of
the system. The class diagram classifies the actors defined in the use case diagram into a set of
interrelated classes. The relationship or association between the classes can be either an "is-a" or
"has-a" relationship. Each class in the class diagram may be capable of providing certain
functionalities. These functionalities provided by the class are termed "methods" of the class.
Apart from this, each class may have certain "attributes" that uniquely identify the class.

Splitting the data into train & test


Importing packages Exploring the dataset Data processing train
test
import() upload() process()
split()

Building the model

build()

Training the model

train()

User saignup & signin

signup()
signin()

User input

choose()

Final outcome

predict()

Fig.5.2.2 Class diagram


Activity diagram:

The process flows in the system are captured in the activity diagram. Similar to a state
diagram, an activity diagram also consists of activities, actions, transitions, initial and final
states, and guard conditions.

Importing libraries

Exploring the dataset

Data preprocessing

Data visualization

Splitting the data to train & test

Building the models

Training the model

User signup & signin

User input

Final outcome
Fig.5.2.3 Activity diagram

Sequence diagram:

A sequence diagram represents the interaction between different objects in the system. The
important aspect of a sequence diagram is that it is time-ordered. This means that the exact
sequence of the interactions between the objects is represented step by step. Different objects in
the sequence diagram interact with each other by passing "messages".
User System

importing the packages

Exploring the dataset

Data processing

Splitting the data into train & test

Algorithm generation

Training the model

User signup & signin

User input

Prediction

Fig.5.2.4 Sequence diagram

Collaboration diagram:
A collaboration diagram groups together the interactions between different objects. The
interactions are listed as numbered interactions that help to trace the sequence of the interactions.
The collaboration diagram helps to identify all the possible interactions that each object has with
other objects.

1: importing the packages


2: Exploring the dataset
3: Data processing
4: Splitting the data into train & test
5: Algorithm generation
6: Training the model
7: User signup & signin
8: User input
9: Prediction result
User System

Fig.5.2.5 Collaboration diagram

Component diagram:

The component diagram represents the high-level parts that make up the system. This
diagram depicts, at a high level, what components form part of the system and how they are
interrelated. A component diagram depicts the components culled after the system has undergone
the development or construction phase.
Importing Exploring Data Splitting the datase
libraries the dataset processing t into train & test

Building the
model

Training the
model

User signu
p & sgnin

User input

Prediction

Fig.5.2.6 Component diagram

Deployment diagram:

The deployment diagram captures the configuration of the runtime elements of the
application. This diagram is by far most useful when a system is built and ready to be deployed.
User System

Fig.5.2.7 Deployment diagram


IMPLEMENTATION
6. IMPLEMENTATION

MODULES:

 Data exploration: using this module we will load data into system

 Processing: Using the module we will read data for processing

 Splitting data into train & test: using this module data will be divided into
train & test

 Model generation: Algorithms accuracy calculated

 User signup & login: Using this module will get registration and login

 User input: Using this module will give input for prediction

 Prediction: final predicted displayed

ALGORITHMS:

Linear Regression: Linear regression analysis is used to predict the value of a


variable based on the value of another variable. The variable you want to predict is
called the dependent variable. The variable you are using to predict the other
variable's value is called the independent variable.

Random Forest: Random forest is a commonly-used machine learning algorithm


trademarked by Leo Breiman and Adele Cutler, which combines the output of
multiple decision trees to reach a single result. Its ease of use and flexibility have
fueled its adoption, as it handles both classification and regression problems.

Xgboost: XGBoost is a popular and efficient open-source implementation of the


gradient boosted trees algorithm. Gradient boosting is a supervised learning
algorithm, which attempts to accurately predict a target variable by combining the
estimates of a set of simpler, weaker models.

6.2 SAMPLE CODE:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
#Combine test and train into one file
train['source']='train'
test['source']='test'
data = pd.concat([train, test],ignore_index=True)
print(train.shape, test.shape, data.shape)
#Check missing values:
data.apply(lambda x: sum(x.isnull()))
#Item type combine:
data['Item_Identifier'].value_counts()
data['Item_Type_Combined'] = data['Item_Identifier'].apply(lambda x: x[0:2])
data['Item_Type_Combined'] = data['Item_Type_Combined'].map({'FD':'Food',
'NC':'Non-
Consumable',
'DR':'Drinks'})
data['Item_Type_Combined'].value_counts()
#Import library:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
le = LabelEncoder()
#New variable for outlet
data['Outlet'] = le.fit_transform(data['Outlet_Identifier'])
var_mod =
['Item_Fat_Content','Outlet_Location_Type','Outlet_Size','Item_Type_Combined',
'Outlet_Type','Outlet']
le = LabelEncoder()
for i in var_mod:
data[i] = le.fit_transform(data[i])
import warnings
warnings.filterwarnings('ignore')
#Drop the columns which have been converted to different types:
data.drop(['Item_Type','Outlet_Establishment_Year'],axis=1,inplace=True)

#Divide into test and train:


train = data.loc[data['source']=="train"]
test = data.loc[data['source']=="test"]

#Drop unnecessary columns:


test.drop(['Item_Outlet_Sales','source'],axis=1,inplace=True)
train.drop(['source'],axis=1,inplace=True)
#Export files as modified versions:
train.to_csv("train_modified.csv",index=False)
test.to_csv("test_modified.csv",index=False)
# Reading modified data
train2 = pd.read_csv("train_modified.csv")
test2 = pd.read_csv("test_modified.csv")
# Fitting Multiple Linear Regression to the training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
import warnings
warnings.filterwarnings('ignore')
# Measuring Accuracy
from sklearn.metrics import accuracy_score, r2_score, mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn import cross_validation, metrics
import warnings
warnings.filterwarnings('ignore')
#Perform cross-validation:
cv_score = cross_val_score(regressor, X_train, y_train, cv=5,
scoring='mean_squared_error')
submission = pd.DataFrame({
'Item_Identifier':test2['Item_Identifier'],
'Outlet_Identifier':test2['Outlet_Identifier'],
'Item_Outlet_Sales': y_pred
},columns=['Item_Identifier','Outlet_Identifier','Item_Outlet_Sales'])
# Fitting Decision Tree Regression to the dataset
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(max_depth=15,min_samples_leaf=300)
regressor.fit(X_train, y_train)
submission = pd.DataFrame({
'Item_Identifier':test2['Item_Identifier'],
'Outlet_Identifier':test2['Outlet_Identifier'],
'Item_Outlet_Sales': y_pred
},columns=['Item_Identifier','Outlet_Identifier','Item_Outlet_Sales'])
# Fitting Random Forest Regression to the dataset
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=100,max_depth=6,
min_samples_leaf=50,n_jobs=4)
regressor.fit(X_train, y_train)
submission = pd.DataFrame({
'Item_Identifier':test2['Item_Identifier'],
'Outlet_Identifier':test2['Outlet_Identifier'],
'Item_Outlet_Sales': y_pred
},columns=['Item_Identifier','Outlet_Identifier','Item_Outlet_Sales'])
SOFTWARE ENVIRONMENT
7.SOFTWARE ENVIRONMENT

PYTHON LANGUAGE:

Python is an interpreted, object-oriented, high-level programming language with dynamic


semantics. Its high-level built in data structures, combined with dynamic typing and dynamic
binding, make it very attractive for Rapid Application Development, as well as for use as a
scripting or glue language to connect existing components together. Python's simple, easy to
learn syntax emphasizes readability and therefore reduces the cost of program maintenance.
Python supports modules and packages, which encourages program modularity and code reuse.
The Python interpreter and the extensive standard library are available in source or binary form
without charge for all major platforms, and can be freely distributed. Often, programmers fall in
love with Python because of the increased productivity it provides. Since there is no compilation
step, the edit-test-debug cycle is incredibly fast. Debugging Python programs is easy: a bug or
bad input will never cause a segmentation fault. Instead, when the interpreter discovers an error,
it raises an exception. When the program doesn't catch the exception, the interpreter prints a
stack trace. A source level debugger allows inspection of local and global variables, evaluation of
arbitrary expressions, setting breakpoints, stepping through the code a line at a time, and so on.
The debugger is written in Python itself, testifying to Python's introspective power. On the other
hand, often the quickest way to debug a program is to add a few print statements to the source:
the fast edit-test-debug cycle makes this simple approach very effective.

Python is a dynamic, high-level, free open source, and interpreted programming language. It
supports object-oriented programming as well as procedural-oriented programming. In Python,
we don’t need to declare the type of variable because it is a dynamically typed language. For
example, x = 10 Here, x can be anything such as String, int, etc.

Features in Python:

There are many features in Python, some of which are discussed below as
follows:
1. Free and Open Source

Python language is freely available at the official website and you can download
it from the given download link below click on the Download
Python keyword. Download Python Since it is open-source, this means that
source code is also available to the public. So you can download it, use it as well
as share it.

2. Easy to code

Python is a high-level programming language . Python is very easy to learn the


language as compared to other languages like C, C#, Javascript, Java, etc. It is
very easy to code in the Python language and anybody can learn Python basics in
a few hours or days. It is also a developer-friendly language.

3. Easy to Read

As you will see, learning Python is quite simple. As was already established,
Python’s syntax is really straightforward. The code block is defined by the
indentations rather than by semicolons or brackets.

4. Object-Oriented Language

One of the key features of Python is Object-Oriented programming . Python


supports object-oriented language and concepts of classes, object encapsulation,
etc.
5. GUI Programming Support

Graphical User interfaces can be made using a module such as PyQt5, PyQt4,
wxPython, or Tk in python. PyQt5 is the most popular option for creating
graphical apps with Python.

6. High-Level Language

Python is a high-level language. When we write programs in Python, we do not


need to remember the system architecture, nor do we need to manage the
memory.

7. Extensible feature

Python is an Extensible language. We can write some Python code into C or C++
language and also we can compile that code in C/C++ language.

8. Easy to Debug

Excellent information for mistake tracing. You will be able to quickly identify
and correct the majority of your program’s issues once you understand how
to interpret Python’s error traces. Simply by glancing at the code, you can
determine what it is designed to perform.

9. Python is a Portable language

Python language is also a portable language. For example, if we have Python


code for windows and if we want to run this code on other platforms such
as Linux, Unix, and Mac then we do not need to change it, we can run this code
on any platform.
10. Python is an Integrated language

Python is also an Integrated language because we can easily integrate Python


with other languages like C, C++, etc.

11. Interpreted Language:

Python is an Interpreted Language because Python code is executed line by line at


a time. like other languages C, C++, Java, etc. there is no need to compile Python
code this makes it easier to debug our code. The source code of Python is
converted into an immediate form called bytecode.

12. Large Standard Library

Python has a large standard library that provides a rich set of modules and
functions so you do not have to write your own code for every single thing. There
are many libraries present in Python such as regular expressions, unit-testing,
web browsers, etc.

13. Dynamically Typed Language

Python is a dynamically-typed language. That means the type (for example- int,
double, long, etc.) for a variable is decided at run time not in advance because of
this feature we don’t need to specify the type of variable.

14. Frontend and backend development

With a new project py script, you can run and write Python codes in HTML with
the help of some simple tags <py-script>, <py-env>, etc. This will help you do
frontend development work in Python like javascript. Backend is the strong forte
of Python it’s extensively used for this work cause of its frameworks
like Django and Flask.

15. Allocating Memory Dynamically

In Python, the variable data type does not need to be specified. The memory is
automatically allocated to a variable at runtime when it is given a value.
Developers do not need to write int y = 18 if the integer value 15 is set to y. You
may just type y=18.

LIBRARIES/PACKGES :-

Tensorflow

TensorFlow is a free and open-source software library for dataflow and differentiable
programming across a range of tasks. It is a symbolic math library, and is also used
for machine learning applications such as neural networks. It is used for both research and
production at Google.

TensorFlow was developed by the Google Brain team for internal Google use. It was released
under the Apache 2.0 open-source license on November 9, 2015.

Numpy

Numpy is a general-purpose array-processing package. It provides a high-performance


multidimensional array object, and tools for working with these arrays.

It is the fundamental package for scientific computing with Python. It contains various features
including these important ones:

 A powerful N-dimensional array object


 Sophisticated (broadcasting) functions
 Tools for integrating C/C++ and Fortran code
 Useful linear algebra, Fourier transform, and random number capabilities
Besides its obvious scientific uses, Numpy can also be used as an efficient multi-dimensional
container of generic data. Arbitrary data-types can be defined using Numpy which allows
Numpy to seamlessly and speedily integrate with a wide variety of databases.

Pandas

Pandas is an open-source Python Library providing high-performance data manipulation and analysis
tool using its powerful data structures. Python was majorly used for data munging and preparation. It
had very little contribution towards data analysis. Pandas solved this problem. Using Pandas, we can
accomplish five typical steps in the processing and analysis of data, regardless of the origin of data load,
prepare, manipulate, model, and analyze. Python with Pandas is used in a wide range of fields including
academic and commercial domains including finance, economics, Statistics, analytics, etc.

Matplotlib

Matplotlib is a Python 2D plotting library which produces publication quality figures in a


variety of hardcopy formats and interactive environments across platforms. Matplotlib can be
used in Python scripts, the Python and IPython shells, the Jupyter Notebook, web application
servers, and four graphical user interface toolkits. Matplotlib tries to make easy things easy
and hard things possible. You can generate plots, histograms, power spectra, bar charts, error
charts, scatter plots, etc., with just a few lines of code. For examples, see the sample
plots and thumbnail gallery.

For simple plotting the pyplot module provides a MATLAB-like interface, particularly when
combined with IPython. For the power user, you have full control of line styles, font
properties, axes properties, etc, via an object oriented interface or via a set of functions
familiar to MATLAB users.

Scikit – learn

Scikit-learn provides a range of supervised and unsupervised learning algorithms via a


consistent interface in Python. It is licensed under a permissive simplified BSD license and is
distributed under many Linux distributions, encouraging academic and commercial use.
SYSTEM TESTING
8.SYSTEM TESTING
System testing, also referred to as system-level tests or system-integration testing,
is the process in which a quality assurance (QA) team evaluates how the various
components of an application interact together in the full, integrated system or
application. System testing verifies that an application performs tasks as designed.
This step, a kind of black box testing, focuses on the functionality of an
application. System testing, for example, might check that every kind of user input
produces the intended output across the application.

Phases of system testing:

A video tutorial about this test level. System testing examines every component of
an application to make sure that they work as a complete and unified whole. A QA
team typically conducts system testing after it checks individual modules with
functional or user-story testing and then each component through integration
testing.

If a software build achieves the desired results in system testing, it gets a final
check via acceptance testing before it goes to production, where users consume the
software. An app-dev team logs all defects, and establishes what kinds and amount
of defects are tolerable.

8.1Software Testing Strategies:

Optimization of the approach to testing in software engineering is the best way to


make it effective. A software testing strategy defines what, when, and how to do
whatever is necessary to make an end-product of high quality. Usually, the
following software testing strategies and their combinations are used to achieve
this major objective:
Static Testing:

The early-stage testing strategy is static testing: it is performed without actually


running the developing product. Basically, such desk-checking is required to detect
bugs and issues that are present in the code itself. Such a check-up is important at
the pre-deployment stage as it helps avoid problems caused by errors in the code
and software structure deficits.

Structural Testing:
It is not possible to effectively test software without running it. Structural testing,
also known as white-box testing, is required to detect and fix bugs and errors
emerging during the pre-production stage of the software development process. At
this stage, unit testing based on the software structure is performed using
regression testing. In most cases, it is an automated process working within the test
automation framework to speed up the development process at this stage.
Developers and QA engineers have full access to the software’s structure and data
flows (data flows testing), so they could track any changes (mutation testing) in the
system’s behavior by comparing the tests’ outcomes with the results of previous
iterations (control flow testing).

Behavioral Testing:

The final stage of testing focuses on the software’s reactions to various activities
rather than on the mechanisms behind these reactions. In other words, behavioral
testing, also known as black-box testing, presupposes running numerous tests,
mostly manual, to see the product from the user’s point of view. QA engineers
usually have some specific information about a business or other purposes of the
software (‘the black box’) to run usability tests, for example, and react to bugs as
regular users of the product will do. Behavioral testing also may include
automation (regression tests) to eliminate human error if repetitive activities are
required. For example, you may need to fill 100 registration forms on the website
to see how the product copes with such an activity, so the automation of this test is
preferable.

8.2 TEST CASES:

S.N INPUT If available If not available


O

1 User signup User get registered into There is no process


the application

2 User signin User get login into the There is no process


application

3 Enter input for prediction Prediction result There is no process


displayed
SCREENS
7. SCREENSHOTS
CONCLUSION
10.CONCLUSION

We have applied four algorithms XGBoost, Random Forest, Linear Regression.


From the results, we can conclude that among all the four algorithms XGBoost has
the highest accuracy of 61.14% when distinguished together. Hence, we can say
that XGBoost is the better algorithm for efficient sales analysis. This methodology
is primarily used by shopping marts, groceries, Brand outlets etc. The data analysis
applied to the predictive machine learning models provides a very effective way to
manage sales, it also generously contributes to better decisions and plan strategies
based on future demands. This approach is very much encouraged in today’s world
since it aids many companies, enterprises, researchers and brands for outcomes that
lead to management of their profits, sales, inventory management, data research
and customer demand.
BIBILOGRAPHY
11. REFERENCES
1. Rohit Sav, Pratiksha Shinde, Saurabh Gaikwad (2021, June). Big Mart Sales Prediction using
Machine Learning.2021 International Journal of Research Thoughts (IJCRT).

2. Inedi. Theresa, Dr. Venkata Reddy Medikonda,K.V.Narasimha Reddy. (2020, March).


Prediction of Big Mart Sales using Exploratory Machine Learning Techniques 020 International
Journal of Advanced Science and Technology (IJAST).

3. Heramb Kadam, Rahul Shevade, Prof. Deven Ketkar , Mr. Sufiyan Rajguru (2018). A Forecast
for Big Mart Sales Based on Random Forests and Multiple Linear Regression. (IJEDR).

4. Gopal Behere, Neeta Nain (2019). Grid Search Optimization (GSO) Based Future Sales
Prediction for Big Mart. 2019 International Conference on Signal-Image Technology & Internet-
Based Systems (SITIS).

5. Kumari Punam , Rajendra Pamula , Praphula Kumar Jain (2018, September 28-29). A Two-
Level Statistical Model for Big Mart Sales Prediction. 2018 International conference on on
Computing, Power and Communication Technologies

6. Ranjitha P, Spandana M. (2021). Predictive Analysis for Big Mart Sales Using Machine
Learning Algorithms.Fifth International Conference on Intelligent Computing and Control
Systems (ICICCS 2021).

7. Bohdan M. Pavlyshenko (2018, August 25). Rainfall Predictive Approach for La Trinidad,
Benguet using Machine Learning Classification. 2018 IEEE Second International Conference on
Data Stream Mining & Processing (DSMP).

8. Nikita Malik, Karan Singh. (2020, June). SalesPrediction Model for Big Mart.

9. Gopal Behere, Neeta Nain. (2019, September). A Comparative Study of Big Mart Sales
Prediction.

10. Archisha Chandel, Akanksha Dubey, Saurabh Dhawale, Madhuri Ghuge (2019, April). Sales
Prediction System using Machine Learning. International Journal of Scientific Research and
Engineering Development.
11. https://www.javatpoint.com/machine-learning-random-forestalgorithm

12. https://www.javatpoint.com/machine-learning-decision-treeclassification-algorithm

13. A. Chandel, A. Dubey, S. Dhawale and M. Ghuge,;Sales Prediction System using Machine
Learning; International Journal of Scientific Research and Engineering Development, vol. 2, no.
2, pp. 1-4, 2019. [Accessed 27 January 2020].

You might also like