88% found this document useful (8 votes)

2K views

Business: Capstone Project House Price Prediction Project Note-1

This document provides an overview and outline for a project aimed at predicting house prices. It introduces the problem, which is to understand how various house features can predict price. It then outlines the data, which was collected from 2014-2015 and includes 23 variables like bedrooms, bathrooms, floors, quality, and condition. Finally, it discusses exploring the data through analysis of distributions, relationships between variables, and outliers to gain business insights like how features contribute to price and if the data is balanced. This will help buyers, sellers, and real estate companies set fair and profitable prices.

Uploaded by

Srinidhi A E

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

88% found this document useful (8 votes)

2K views

Business: Capstone Project House Price Prediction Project Note-1

Uploaded by

Srinidhi A E

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

BUSINESS

REPORT
Capstone Project

HOUSE PRICE PREDICTION

Project Note-1

SONAL SINGH
01/05/2022

1
CONTENT
1) Introduction of the Problem
a) Defining problem statement
b) Need of the study/project
c) Understanding business/social opportunity

2) Data Report
a) Understanding how data was collected in terms of time, frequency
and methodology
b) Visual inspection of data (rows, columns, descriptive details)
c) Understanding of attributes (variable info, renaming if required)

3) Exploratory data analysis

a) Duplicates
b) Univariate analysis (distribution and spread for every continuous
attribute, distribution of data in categories for categorical ones)
c) Bivariate analysis
d) Removal of unwanted variables
e) Missing value treatment
f) Outlier treatment
g) Variable transformation
h) Scaling of data
i) Log Transformation
j) Encoding

4) Business Insights from EDA

a) Is the data unbalanced? If so. What can be done? Please explain in
the context of the business
b) Any business insights using clustering
c) Any other business insights

2
1) INTRODUCTION
This section aims at introducing the project and providing the basic understanding of the
project and the objectives of this analysis. The analysis deals with the prediction of house prices
based on the factors given in the data set to define the attributes of a house. In other words,
it targets to understand the real estate market of the geographical location given. Prediction
of house prices is not only depend upon square foot of space that it occupies but, different
other factors like, number of bedrooms, bathrooms, floors, basement area, condition of house,
quality of house, year of build, waterfront/ beachfront, age of the house, age of renovation of
the house, etc., are few of the important points that play a major role in determining its cost.
So through this project we try to derive different patterns and we will be exploring multiple
other questions and try to derive answers to those by applying our learning and models from
the past 11 months of study.

Defining Problem Statement

The goal of this analysis is to understand the relationship between the features of the house
and how those features can predict the house price.
A house value is simply more than location and square footage. Like the features that make up
a person, an educated party would want to know all aspects that give a house its value. For
example, you want to sell a house and you don’t know the price which you may expect — it
can’t be too low or too high. To find house price you usually try to find similar properties in
your neighborhood and based on gathered data you will try to assess your house price.

Assumptions
This section aims at understanding the attributes in the data set which are not explained well
in the problem.

Ceil – 1 indicates the level/floor of house which is lowest in the attributes and 3.5 indicates
the maximum levels/floor of house.

Coast – 0 indicates closer to waterfront and 1 indicates farther to waterfront

Condition – 1 indicates Poor Condition and 4 indicates Best Condition

Quality – 1 indicate Poor Quality and 13 indicates Best Quality

Furnished – 0 indicates not furnished and 1 indicates furnished

Scope of Project
This section aims at understanding as Data Scientists, what is the scope of this project in real
world? Real estate is an always active market. This is also one of the markets that gets hit
hardest in times of distress of the economy. As per research, real estate generates almost 35
percent of the total revenue of the country’s economy. When it comes to young population,
then real estate is the most viable option to invest in. During the time of Corona Pandemic too,
this market kept on working, despite it saw some crashes and booms with parallel to the stock
movements.
Seller can't estimate the price of the house. Features of the house can help evaluate the house

3
price. Different houses have different features. Features of more than two houses can help
evaluate relevant prices. Hence, analyzing the bulk of data can help predict the house price.
To get the profitable pricing for the houses and buildings, so that neither the seller nor the
buyer are at a loss? That is where the factors affecting the price of the house comes into
picture. If a fair evaluation of all the factors, how they contribute, why they contribute, how
they contribute is made, then a profitable figure can be derived which leads to a win-win
situation for both the parties. Understanding and knowing the contribution of real estate to
the economy and to the standard of living of an individual, it’s very essential for us to contribute
our data skills so as to make it to a fair and profitable future

Understanding business/social opportunity

This section aims at understanding that how will such kind of a project or a study generate
business profitability or social benefits.
Real estate is a booming sector that contributes hugely to the country’s economy. It is also one
of the sectors that contribute substantially to generating the employment. When we talk about
employment, it’s not only for the brokers of the houses or the builders, rather it also accounts
those laborers who help with construction of the building. Now, if a sector is contributing such
heavily into the economy and employment, then it’s fair to have an honest and viable pricing
of the product that the sector generates, in our case, houses. Any unfair pricing will be injustice
not only to buyer and the seller but also to the workers who are contributing building the real
estate. Not only this, big companies who are into building, buying and selling of the properties
which means that the major turnover of these companies are from the pricing of the houses.
These houses maybe newly built or selling of an already existing house. Also this is the
investment option chosen by majority of the public. Hence, as data scientist, it’s our duty to
provide a fair pricing and a just understanding of the factors that contribute to the pricing of
the properties. Therefore, this project becomes an imperative to the lives of people, as well as
to the profits of the companies of the nation and abroad.

2) Data Report

Understanding how data was collected in terms of time, frequency and methodology
This section aims at giving how the data is collected.
This is the Capstone Project driven by the Great Learning, hence the data of “House Price
Prediction” is provided to us from the learning platform.
This data is collected already from year 2014 to year 2015.

Visual inspection of data (rows, columns, descriptive details)

The various attributes provided are

1) cid: a notation for a house

2) dayhours: Date house was sold
3) price: Price is prediction target
4) room_bed: Number of Bedrooms/House
5) room_bath: Number of bathrooms/bedrooms
6) living_measure: square footage of the home
7) lot_measure: quare footage of the lot

4
8) ceil: Total floors (levels) in house
9) coast: House which has a view to a waterfront
10) sight: Has been viewed
11) condition: How good the condition is (Overall)
12) quality: grade given to the housing unit, based on grading system
13) ceil_measure: square footage of house apart from basement
14) basement_measure: square footage of the basement
15) yr_built: Built Year
16) yr_renovated: Year when house was renovated
17) zipcode: zip
18) lat: Latitude coordinate
19) long: Longitude coordinate
20) living_measure15: Living room area in 2015(implies-- some renovations) This might or
might not have affected the lotsize area
21) lot_measure15: lotSize area in 2015(implies-- some renovations)
22) furnished: Based on the quality of room
23) total_area: Measure of both living and lot

Fig 1, we can see the initial look of the data. This tells us that the data has 23 columns.

5
These columns are the different factors that impact the price of the house. Factors like number
of bedrooms, number of bathrooms, number of floors, quality of house, condition of house,
etc.. Each column has a different name and a different meaning.

Number of Rows & Columns

We see that there are 21613 rows and 23 columns in the dataset.

From this, we see that there are 21613 rows and 23 columns. This information tallies from the
above Fig. too where we got 23 columns in the data. Also, there are total 21613 rows which
means there are 21613 entries of different instances. These rows can be consisting of missing
data or duplicates. They can also have unwanted inputs like an object variable in the
float/integer column.

Data Info

6
In the dataset, we have more than 21k records and 23 columns, out of which
• 12 features are of float type
• 4 features are of integer type
• 7 feature is of object type

We see that number of bedrooms and number of bathrooms, living_measure, lot_measure,

ceil, coast, sight, condition, quality, ceil_measure, basement, yr_built, living_measure15,
lot_measure15, furnished & total_area have null values. We get to know this because the
above Fig says that bedrooms and bathrooms have only 21505 non null values. This means that
the rest of the entries i.e. (21613 – 21505) number of entries are actually null or NaN. We see
a similar happening in the living_measure, lot_measure, ceil, coast, sight, condition, quality,
ceil_measure,basement, yr_built, living_measure15, lot_measure15, furnished and
total_area. All these columns have null values. It’s important to know and then treat the null
values.
Another observation is the type of data in each column. We see that the data type is either
object, or float64 or int64. Object datatype happens when alphabets or signs creep in the
dataset. Float64 happens when there are decimals and int64 meaning integer64 happens when
there are integer values. An astonishing thing to note is that dayhours is object. Its due to
presence of “T” in between & remaining features like total area, long, year build, condition,
coast, ceil are numerical features but it is shown as a object data because of bad data that
needs to be treated. In conclusion, we see that 12 columns are in float64 nature, 4 columns
are of int64 nature and 7 columns are of object nature. In case of bad data of missing data it
needs to be treated for accurate results.

Data Description

7
Besides graphs, statistics that summarize the distribution of the data, are used to transform
data into information. The five-number summary, which forms the basis for a boxplot, is a good
example of summarizing data. The above table is summary statistics of the dataset

• CID: House ID/Property ID. Not used for analysis

• price: Our target column value is in 75k - 7700k range. As Mean > Median, it's Right-
Skewed.
• room_bed: Number of bedrooms range from 0 - 33. As Mean slightly > Median,
it's slightly Right-Skewed.
• room_bath: Number of bathrooms range from 0 - 8. As Mean slightly < Median,
it's slightly Left-Skewed.
• living_measure: square footage of house ranges from 290 - 13,540. As Mean > Median,
it's Right-Skewed.
• lot_measure: Square footage of lot range from 520 - 16,51,359. As Mean almost double
of Median, it's Highly Right-Skewed.
• ceil: Number of floors range from 1 - 3.5 As Mean ~ Median, it's almost Normal
Distributed.
• coast: As this value represent whether house has waterfront view or not.
It's categorical column. From above analysis we got know, very few houses has
waterfront view.
• sight: Value ranges from 0 - 4. As Mean > Median, it's Right-Skewed
• condition: Represents rating of house which ranges from 1 - 5. As Mean > Median,
it's Right-Skewed
• quality: Representing grade given to house which range from 1 - 13. As Mean > Median,
it's Right-Skewed.
• ceil_measure: square footage of house apart from basement ranges in 290 - 9,410. As
Mean > Median, it's Right-Skewed.
• basement: Square footage house basement ranges in 0 - 4,820. As Mean highly >
Median, it's Highly Right-Skewed.
• yr_built: House built year ranges from 1900 - 2015. As Mean < Median, it's Left-
Skewed.
• yr_renovated: House renovation year only 2015. So, this column can be used
as Categorical Variable for knowing whether house is renovated or not.
• zipcode: House Zip Code ranges from 98001 - 98199. As Mean > Median, it's Right-
Skewed.
• lat: Latitude ranges from 47.1559 - 47.7776 As Mean < Median, it's Left-Skewed.
• long: Longitude ranges from -122.5190 to -121.315 As Mean > Median, it's Right-
Skewed.
• living_measure15: Value ranges from 399 to 6,210. As Mean > Median, it's Right-
Skewed.
• lot_measure15: Value ranges from 651 to 8,71,200. As Mean highly > Median,
it's Highly Right-Skewed.
• furnished: Representing whether house is furnished or not. It's a Categorical Variable
• total_area Total area of house ranges from 1,423 to 16,52,659. As Mean is almost
double of Median, it's Highly Right-Skewed

8
From above analysis we got to know,

Most columns distribution is Right-Skewed and only few features are Left-Skewed (like
room_bath, yr_built, lat).

3) Exploratory Data Analysis

This section aims at a deeper level of data cleaning for the dataset. It targets to give the
univariate analysis, bivariate analysis, remove the unwanted variables, remove the
missing values (already done in previous section) outlier treatment, variable
transformation and addition of any new variables. It is essential because we cannot
work on an unclean data, hence the Exploratory Data Analysis aims at cleaning the data
to make it ready for processing. Unclean data, filled with missing values, outliers,
unwanted variables can make the analysis erroneous and outcome to be misguiding.

Removal of unwanted variable

There could also be some miscellaneous columns like ID, that we can drop from the analysis
as it’s a mere identifier and doesn’t contribute much to our analysis.

Missing value treatment

This tells us that in the total entries of 21613, there are max missing null values of 166 count
in the living_measure15. Next, we observe that the columns that have high number of missing
data are the number of bedrooms and bathrooms. Rest of the columns have substantially less
numbers of missing data, like lot_measure15, furnished, total_area have only 29 null values.
Also, sight and condition too have very lesser i.e., just 57 of the missing values. An interesting
analysis here is that 166 is the highest number of null value spaces and it is very less than 30
percentage of the total data of 21613. 166 is approximately 7 to 8 percentage of the total data.

9
This implies that maximum only 7 to 8 percentage of the data is missing or null in nature which
needs to be treated to get the more accurate results.

Bad data and missing data is treated. Replaced bad data with NaN value and treated the null
values with simple imputer & mode method.

Duplicates

As per the code, we see there is no duplicates

Univariate analysis (distribution and spread for every continuous attribute, distribution of
data in categories for categorical ones)

This is simplest form of data analysis, where the data being analyzed consists of just one
variable. Since it’s a single variable, it doesn’t deal with causes or relationships. The main
purpose of univariate analysis is to describe the data and find patterns that exist within it.

10
11
Very few houses are renovated, only 914 houses are renovated out of total 21613 records
house with no sight or 0 record is more after that we have house few more houses with 2 sights
hose with 1 or 4 site is very minimal
most of the houses in the dataset has bedroom within the range of 0 to 5
more no of houses are built from year 2000 onwards. from the year 1900 to 1950 we can see
less no of house got constructed
more no of unfurnished house are there in data set .17500 house are unfurnished and near
about only 4000 houses are furnished Most of the houses are non-coast in the dataset and
very few houses negligible amount of houses are near the coast.

Bivariate analysis (relationship between different variables, correlations)

To plot multiple pairwise bivariate distributions in a dataset, you can use the pairplot()
function. This shows the relationship for (n, 2) combination of variable in a Dataframe as a
matrix of plots and the diagonal plots are the univariate plots

From above pair plot, we observed/deduced below

12
• room_bed: our target variable (price) and room_bed plot is not linear. Its distribution
has lot of gaussians
• room_bath: It's plot with price has somewhat linear relationship. Distribution has
number of gaussians.
• living_measure: Plot against price has strong linear relationship. It also have linear
relationship with room_bath variable. So might remove one of these 2. Distribution is
Right-Skewed.
• lot_measure: No clear relationship with price.
• ceil: No clear relationship with price. We can see, it's have 6 unique values only.
Therefore, we can convert this column into categorical column for values.
• coast: No clear relationship with price. Clearly it's categorical variable with 2 unique
values.
• sight: No clear relationship with price. This has 5 unique values. Can be converted to
Categorical variable.
• condition: No clear relationship with price. This has 5 unique values. Can be converted
to Categorical variable.
• quality: Somewhat linear relationship with price. Has discrete values from 1 - 13. Can
be converted to Categorical variable.
• ceil_measure: Strong linear relationship with price. Also with room_bath and
living_measure features. Distribution is Right-Skewed.
• basement: No clear relationship with price.
• yr_built: No clear relationship with price.
• yr_renovated: No clear relationship with price. Have 2 unique values. Can be converted
to Categorical Variable which tells whether house is renovated or not.
• zipcode, lat, long: No clear relationship with price or any other feature.
• living_measure15: Somewhat linear relationship with target feature. It's same as
living_measure. Therefore we can drop this variable.
• lot_measure15: No clear relationship with price or any other feature.
• furnished: No clear relationship with price or any other feature. 2 unique values so can
be converted to Categorical Variable
• total_area: No clear relationship with price. But it has Very Strong linear relationship
with lot_measure. So one of it can be dropped.
• There is Linear relation exist between lot_ measure and total area And also there is
some linear relation between ceil_measure and living_measure

13
Analysing Bivariate

for Feature: room_bed

There is clear increasing trend in price with room_bed, price increases with the increase in
no. of bedrooms.

for Feature: room_bath

There is upward trend in price with increase in room_bath, price increases with the increase in
number of bath rooms.

14
for Feature: living_measure

There is clear increment in price of the property with increment in the living measure but there
seems to be one outlier to this trend. Need to evaluate the same

Feature: lot_measure

15
There doesn’t seem to be no relation between lot_measure and price trend

For lot_measure <25000

Almost 95% of the houses have <25000 lot_measure. But there is no clear trend between
lot_measure and price

For lot_measure >100000

Price increases with increase in living measure

16
Feature: ceil

There is some slight upward trend in price with the ceil and then falls latter

Feature: coast

The house properties with waterfront tend to have higher price compared to that of non-
waterfront properties

17
Feature: sight

Properties with higher price have more no. of sights compared to that of houses with lower
price

Sight - Viewed in relation with price and living_measure

The above graph also justifies that: Properties with higher price have more no. of sights
compared to that of houses with lower price

18
Feature: condition

The price of the house increases with condition rating of the house

Condition - Viewed in relation with price and living_measure

Most houses are rated as 3 or more.

19
So we found out that smaller houses are in better condition and better condition houses are
having higher prices.

Feature: quality

with grade increase price and living_measure increase (mean and median)
There is clear increase in price of the house with higher rating on quality

20
quality - Viewed in relation with price and living_measure.

Most houses are graded as 6 or more.

We can see some outliers as well
There is clear increase in price of the house with higher rating on quality

Feature: ceil_measure

There is upward trend in price with ceil_measure

21
Feature: basement

We will create the categorical variable for basement 'has_basement' for houses with basement
and no basement. This categorical variable will be used for further analysis.

basement - after binning we data shows with basement houses are costlier and have higher
living measure (mean & median)

The houses with basement has better price compared to that of houses without basement

22
basement - have higher price & living measure

Feature: yr_built

As per the graph, most of the houses are built in around 2000’s and least houses are built in
around 1900’s. there is a kind of slightly increase in trend, as year increases the number of
house built increases.

23
Feature: yr_renovated

So most houses are renovated after 1980's. We will create new categorical variable
'has_renovated' to categorize the property as renovated and non-renovated. For further
analysis we will use this categorical variable.

Renovated properties have higher price than others with same living measure space.

24
Feature: furnished

Furnished houses have higher price than that of the Non-furnished houses

Analyzing Feature: Zipcode, Lat, Long

25
With above figure aims at understanding that how the combination of latitude, longitude & Zip
code affects the price. This precisely is a geographical study of differing price. We see that
highest price is concentrated in the center of the map and lowest is happening at the coast line
houses. Probably this is the reason that most of our houses don’t have coast lines.
We can find that the highest price is around the location of longitude -122.2 and -122.3 and
latitude 47.5 and 47.6 This is location has high number of houses.
The lowest price is around the region longitude -121.906 and latitude 47.26.

26
Correlation

We have linear relationships in below features as we got to know from above matrix

1. price: room_bath, living_measure, quality, living_measure15, furnished

2. living_measure: price, room_bath. So we can consider dropping 'room_bath' variable.
3. quality: price, room_bath, living_measure
4. ceil_measure: price, room_bath, living_measure, quality
5. living_measure15: price, living_measure, quality.
6. lot_measure15: lot_measure.
7. furnished: quality
8. total_area: lot_measure, lot_measure15.

We can plot heatmap and can easily confirm our above findings

27
Heat Map

Outlier

28
We see that all the variables has outliers. The black dots lying in distant marks the outliers.
Now we need to treat them and remove them. Using the remove outlier code in Python, we
removed all the outliers.

Outlier Treatment

We have the boxplot of the cleaner data with no outliers. As now we can see that there are no
randomly lying black dots in the boxplots. This confirms that there are no outliers in the data
now. We have successfully removed all the outliers.

29
Multicollinearity

The dataset we have has only 22 Features and 1 Target column. Since the number of Features
are only 22, not high we should be good to use all the Features given in the Dataset. Out of the
22 Features, we can clearly see that cid and dayhours are just audit columns and do not add
any value to the model in terms of prediction. Based on above correlation heat map/VIF and
summary listed, we identify the below features with a correlation less than 0.25 with target
variable 'price' as potential columns to be excluded from the data. However, we will be
experimenting the different Features to be used/eliminated in the models and finalize the
Feature list.

cid, dayhours, zipcoad, lot_measure, lot_measure15, long, condition, yr_built, yr_renovated,

total_area.

30
Scaling
Scaling is necessary for unscaled data. Scaling needs to be done as the values of the variables
are in different scales. Spending, advance price are in different values and this may get more
weightage. Scaling will have all the values in the relative same range. Below is the snapshot of
scaled data.

Unscaled data

Scaled data using z score method for scaling

Before Scaling After Scaling

31
We can see from above figure there is not major difference in graph from before and after
scaled data.

Transformation
Log transformation of gives actual information by enhancing the image. If we apply this
method in an image having higher pixel values then it will enhance the image more and actual
information of the image will be lost.
In situations where data is highly skewed and the algorithm we plan to use for prediction has
a prerequisite that the data has to be normally distributed, below transformation can be
applied.

Price

Graph after log transformation

32
Living measure

Graph after log transformation

Lot measures

33
Graph after log transformation

Quality

Graph after log transformation

34
Ceil measure

Graph after log transformation

Year built

35
Graph after log transformation

Living measure15

Graph after log transformation

36
Lot measure15

Graph after log transformation

Total area

37
Graph after log transformation

As per the above graphs of various numeric field, Distribution is more spread across, even
boxplot is more spread, because of scaled property outlier is more near to boxplot. With
transformation distribution seems to have changed even achieved better skeness & kurtosis.

Label Encoding

Encoding deals with categorical features

Encoded with numerical values

38
4) Business Insights from EDA

This section aims at giving certain business insights from the analysis done above. This
takes into account all the pointers that have been coded/ discussed/ elaborated above.

a) I) Not applicable as we are going to develop regression model not classification

model.
II) value counts of independent variables data does seems to be unbalanced due to
the outliers. IQR method is used for treating the outliers. Outliers can degrade the
efficiency of the data. - It results in overestimation or underestimation

b) we have used KMeans Clustering to obtain the WSS plot and its clear noticeable
that the elbow is at clusters = 3. Hence the optimal number of clusters will be 3.

Post running the silhouette score analysis, we got to the K-Means cluster and Fig.
tells us about the clustering using the K-Means and the head of the data.

39
Above fig gives us the train and test dataset split up

c) I) The missing data in the data set is already imputed based on the data type.
II) Living measure is the most significant variable in our analysis and since living
measure, lot measure and ceil measure are proportional we need not spend a lot
of time in analyzing these variables, analysis based on living measure would provide
much greater insights.
III) It is evident from EDA that an ideal house would be the one with 2-3 bedrooms
and 3 bathrooms, even though houses with 8 and >8 bedrooms and bathrooms
have sold for a higher price a lot of people doesn’t seem to be buying them, higher
number of records are sold with three-bedroom houses hence an equal or even
more revenue could be obtained by selling more houses with three bedrooms and
bathrooms.
IV) Although majority of houses are not furnished, it is seen in bivariate analysis that
furnished houses produce more revenue compared to unfurnished ones.
V) From the above analysis, we can conclude that, high quality house has the
highest house price.
VI) These features combined, can help estimate the house price.

MRA Project Milesone-1: BY-Shorya Goel PGP Dsba Oct - 20 B
92% (25)
MRA Project Milesone-1: BY-Shorya Goel PGP Dsba Oct - 20 B
35 pages
Project: Submitted By: Abhijit Kumar Kalita
90% (21)
Project: Submitted By: Abhijit Kumar Kalita
44 pages
Project FRA Milestone1 JPY Nikita Chaturvedi 05.05.2022 Jupyter Notebook PDF
76% (21)
Project FRA Milestone1 JPY Nikita Chaturvedi 05.05.2022 Jupyter Notebook PDF
102 pages
Data Analytics Using Python
100% (1)
Data Analytics Using Python
982 pages
Probability and Statistics For Engineers - Solutions
84% (135)
Probability and Statistics For Engineers - Solutions
609 pages
MRA Project MIlestone1
83% (18)
MRA Project MIlestone1
29 pages
Bayesian Statistical Methods
100% (10)
Bayesian Statistical Methods
288 pages
Low Code AIML USL Project CreditCardCustomerSegmentation Vijay Borade Aug23
67% (3)
Low Code AIML USL Project CreditCardCustomerSegmentation Vijay Borade Aug23
66 pages
Machine Learning - An Applied Mathematics Introduction PDF
100% (13)
Machine Learning - An Applied Mathematics Introduction PDF
246 pages
This Study Resource Was: SQL Project
67% (3)
This Study Resource Was: SQL Project
9 pages
Problem Statement
0% (2)
Problem Statement
2 pages
Structral Analysis - 4th Edition PDF
91% (11)
Structral Analysis - 4th Edition PDF
895 pages
The Python Bible
97% (31)
The Python Bible
506 pages
Final Document of SQL Project With Questions
0% (2)
Final Document of SQL Project With Questions
5 pages
Project Report - FRA V1.0
71% (7)
Project Report - FRA V1.0
28 pages
Four PE Civil Practice Exam
81% (43)
Four PE Civil Practice Exam
272 pages
MRA ML1 Plabeeta Patangia
100% (8)
MRA ML1 Plabeeta Patangia
30 pages
MRA Project Milestone 2
71% (17)
MRA Project Milestone 2
20 pages
India Credit Risk Default Model - Nivedita Dey - PGP BABI May19 - 2
100% (4)
India Credit Risk Default Model - Nivedita Dey - PGP BABI May19 - 2
19 pages
Understanding Machine Learning
100% (69)
Understanding Machine Learning
416 pages
Café Chain Analysis
83% (6)
Café Chain Analysis
35 pages
FinalReport Life Insurance
80% (5)
FinalReport Life Insurance
34 pages
MRA ML1 - Kirtesh
100% (7)
MRA ML1 - Kirtesh
43 pages
Anushi Project-House Price Prediction
100% (2)
Anushi Project-House Price Prediction
26 pages
Time Series Forecasting - SoftDrink - Business Report
75% (4)
Time Series Forecasting - SoftDrink - Business Report
37 pages
MRA Project Milestone 2
92% (12)
MRA Project Milestone 2
26 pages
Design of Steel Structure Vol 2
100% (10)
Design of Steel Structure Vol 2
916 pages
BA4101 - Statistics - For - Management All - Units - Two - Mark's - Questions - and Answers
100% (2)
BA4101 - Statistics - For - Management All - Units - Two - Mark's - Questions - and Answers
46 pages
MRA Project Milestone 2
100% (2)
MRA Project Milestone 2
31 pages
FRA Milestone 1 Jupyter Notebook PDF
100% (3)
FRA Milestone 1 Jupyter Notebook PDF
42 pages
Capstone Proect Notes 2
100% (2)
Capstone Proect Notes 2
16 pages
Project +Sweta+Kumari+ +FRA+Milestone+1+ July+ 2021
100% (2)
Project +Sweta+Kumari+ +FRA+Milestone+1+ July+ 2021
31 pages
Problem Statement
100% (1)
Problem Statement
17 pages
FRA Milestone1 - Maminulislam
100% (4)
FRA Milestone1 - Maminulislam
23 pages
Vaibhav Kumar MRA Project Milestone 2
No ratings yet
Vaibhav Kumar MRA Project Milestone 2
18 pages
Boston Condo Sale Story
0% (1)
Boston Condo Sale Story
11 pages
FRA Project Business Report
100% (2)
FRA Project Business Report
27 pages
Report - Project8 - FRA - Surabhi - Report
0% (1)
Report - Project8 - FRA - Surabhi - Report
15 pages
Harshini Week 8 Doc PDF
No ratings yet
Harshini Week 8 Doc PDF
10 pages
Business Report Machine Learning-1
100% (7)
Business Report Machine Learning-1
60 pages
Vaibhav Kumar MRA Project Milestone 1
100% (3)
Vaibhav Kumar MRA Project Milestone 1
29 pages
SQL Prject
No ratings yet
SQL Prject
8 pages
Week 7 Project Report 1 and 2
No ratings yet
Week 7 Project Report 1 and 2
10 pages
Sonal Fra Milestone1 v1
No ratings yet
Sonal Fra Milestone1 v1
20 pages
Hypothesis Testing - A Visual Introduction To Statistical Significance
100% (4)
Hypothesis Testing - A Visual Introduction To Statistical Significance
137 pages
Civil Structural Design Standards
100% (1)
Civil Structural Design Standards
3 pages
Design of Rectangular Water Tank
90% (29)
Design of Rectangular Water Tank
268 pages
Design of Steel Structures Vol 1
100% (10)
Design of Steel Structures Vol 1
931 pages
PN1 Shakti Akshaya S PDF
100% (2)
PN1 Shakti Akshaya S PDF
60 pages
House Sale Price Prediction
0% (1)
House Sale Price Prediction
11 pages
House Prices Predictive Model Summary Report
100% (1)
House Prices Predictive Model Summary Report
20 pages
House
100% (2)
House
19 pages
Gowtham Mra 2
No ratings yet
Gowtham Mra 2
18 pages
Project 7 - DVT - Manoj
No ratings yet
Project 7 - DVT - Manoj
1 page
Mra Project1 - Firoz Afzal
60% (5)
Mra Project1 - Firoz Afzal
20 pages
Report - Project8 - FRA - Surabhi - Report
100% (2)
Report - Project8 - FRA - Surabhi - Report
15 pages
Data Visvilization Project Boston-Condo-Sales
No ratings yet
Data Visvilization Project Boston-Condo-Sales
61 pages
Grocery Project
100% (5)
Grocery Project
40 pages
MRA Project Milestone 1 PDF
No ratings yet
MRA Project Milestone 1 PDF
1 page
Time Series Project
50% (4)
Time Series Project
2 pages
MRA Project Milestone 1 - Maminulislam
83% (6)
MRA Project Milestone 1 - Maminulislam
30 pages
Facebook Comment Volume Prediction
100% (1)
Facebook Comment Volume Prediction
12 pages
Financial Risk Analysis Project Report Financial Risk Analysis Project Report
100% (2)
Financial Risk Analysis Project Report Financial Risk Analysis Project Report
29 pages
Mra Project
No ratings yet
Mra Project
12 pages
FRA Assignment
100% (1)
FRA Assignment
31 pages
MRA Project 2: Sudesh Yadav
100% (2)
MRA Project 2: Sudesh Yadav
23 pages
MRA Project - (RFM Analysis Using Python)
No ratings yet
MRA Project - (RFM Analysis Using Python)
8 pages
Mra Project Milestone 2: Kirtesh Tiwari PGP - Data Science and Business Analytics - Pgpdsba Online Sep - C 2021
100% (4)
Mra Project Milestone 2: Kirtesh Tiwari PGP - Data Science and Business Analytics - Pgpdsba Online Sep - C 2021
24 pages
Predictive Modeling
100% (1)
Predictive Modeling
22 pages
Project Report - Advanced - Stats - Final PDF
No ratings yet
Project Report - Advanced - Stats - Final PDF
25 pages
MRA Milestone-1 Graded Project
100% (2)
MRA Milestone-1 Graded Project
41 pages
Final Report Submission Capstone Project House Price Prediction.docx
No ratings yet
Final Report Submission Capstone Project House Price Prediction.docx
30 pages
18BCS115
No ratings yet
18BCS115
25 pages
Project_Report___Vishal_Pradeep
No ratings yet
Project_Report___Vishal_Pradeep
97 pages
FML PROJECT diya (1) (1)
No ratings yet
FML PROJECT diya (1) (1)
9 pages
Machine Learning Projects Python
94% (18)
Machine Learning Projects Python
134 pages
Dynamics
100% (1)
Dynamics
36 pages
Hands On Machine Learning With Python Concepts and Applications For Beginners - John Anderson 2018
91% (11)
Hands On Machine Learning With Python Concepts and Applications For Beginners - John Anderson 2018
166 pages
Lecture 2. Portland Cement: CIV-E2020 Concrete Technology (5 CR)
No ratings yet
Lecture 2. Portland Cement: CIV-E2020 Concrete Technology (5 CR)
31 pages
Statistical Regression and Classification - From Linear Models To Machine Learning
100% (10)
Statistical Regression and Classification - From Linear Models To Machine Learning
532 pages
Machine Learning Projects in Python
100% (16)
Machine Learning Projects in Python
135 pages
Capstone Project Vivek
100% (4)
Capstone Project Vivek
145 pages
Car Transport Machine Learning
89% (9)
Car Transport Machine Learning
28 pages
Full Course of Machine Learning
100% (16)
Full Course of Machine Learning
660 pages
Python For Science and Engineering
100% (14)
Python For Science and Engineering
304 pages
Regression Modeling Strategies - With Applications To Linear Models by Frank E. Harrell
100% (4)
Regression Modeling Strategies - With Applications To Linear Models by Frank E. Harrell
598 pages
Passmedicine Statistics Note 2021: Prepared by DR - Abohaneen Mrcpase Telegram Group
No ratings yet
Passmedicine Statistics Note 2021: Prepared by DR - Abohaneen Mrcpase Telegram Group
25 pages
Statistical Techniques in Business and e
50% (2)
Statistical Techniques in Business and e
29 pages
Test Bank for Business Statistics Communicating with Numbers 2nd Edition by Jaggia and Kelly ISBN 0078020557 9780078020551 - All Chapters Are Available In PDF Format For Download
100% (4)
Test Bank for Business Statistics Communicating with Numbers 2nd Edition by Jaggia and Kelly ISBN 0078020557 9780078020551 - All Chapters Are Available In PDF Format For Download
50 pages
Analysis and Approaches Higher May 2022 Paper 2 TZ1
No ratings yet
Analysis and Approaches Higher May 2022 Paper 2 TZ1
16 pages
Cot 3 PPT
No ratings yet
Cot 3 PPT
65 pages
احصاء ٢
No ratings yet
احصاء ٢
13 pages
12 Data Interpretation
No ratings yet
12 Data Interpretation
36 pages
University of Cambridge International Examinations General Certificate of Education Advanced Level
No ratings yet
University of Cambridge International Examinations General Certificate of Education Advanced Level
4 pages
Sch2105: Chemometrics and Classical Techniques of Chemical Analysis
50% (2)
Sch2105: Chemometrics and Classical Techniques of Chemical Analysis
48 pages
Biostat Exam For Anaesthesia
No ratings yet
Biostat Exam For Anaesthesia
7 pages
Sample Exam 1 FKB 20302
No ratings yet
Sample Exam 1 FKB 20302
9 pages
Valuation With Multiples A Conceptual Analysis
No ratings yet
Valuation With Multiples A Conceptual Analysis
13 pages
Central Tendency
No ratings yet
Central Tendency
17 pages
9th Sample Papers Merged Papers
No ratings yet
9th Sample Papers Merged Papers
74 pages
Assignment 2 Outliers and Normality
No ratings yet
Assignment 2 Outliers and Normality
24 pages
Midterm TQ (Prob Stat)
No ratings yet
Midterm TQ (Prob Stat)
2 pages
Psychological Assessment (Midterms)
No ratings yet
Psychological Assessment (Midterms)
20 pages
Simulatedt Practice Test-8 Professional Education
No ratings yet
Simulatedt Practice Test-8 Professional Education
18 pages
Lec. 2 Statistical Analysis
No ratings yet
Lec. 2 Statistical Analysis
12 pages
Normal Distribution
No ratings yet
Normal Distribution
3 pages
Business Statistics: Shalabh Singh Room No: 231 Shalabhsingh@iim Raipur - Ac.in
No ratings yet
Business Statistics: Shalabh Singh Room No: 231 Shalabhsingh@iim Raipur - Ac.in
58 pages
IMSS31-URD-TemplateQIMTIntoLabWareLIMS v2 20240124125901.110 X
No ratings yet
IMSS31-URD-TemplateQIMTIntoLabWareLIMS v2 20240124125901.110 X
47 pages
Book 1
No ratings yet
Book 1
8 pages
Thesis Edited V
No ratings yet
Thesis Edited V
30 pages
4. Lecture Note 04_ Measures of Central Tendency
No ratings yet
4. Lecture Note 04_ Measures of Central Tendency
15 pages
ps7 Sol
No ratings yet
ps7 Sol
7 pages
I B Maths Standard Notes
100% (1)
I B Maths Standard Notes
159 pages
3.practice Assignment 3.1 - Not Graded
No ratings yet
3.practice Assignment 3.1 - Not Graded
16 pages
High School Statistics and Probability
100% (1)
High School Statistics and Probability
6 pages

Business: Capstone Project House Price Prediction Project Note-1

Uploaded by

Business: Capstone Project House Price Prediction Project Note-1

Uploaded by

BUSINESS

HOUSE PRICE PREDICTION

3) Exploratory data analysis

4) Business Insights from EDA

Defining Problem Statement

Coast – 0 indicates closer to waterfront and 1 indicates farther to waterfront

Condition – 1 indicates Poor Condition and 4 indicates Best Condition

Quality – 1 indicate Poor Quality and 13 indicates Best Quality

Furnished – 0 indicates not furnished and 1 indicates furnished

Understanding business/social opportunity

Visual inspection of data (rows, columns, descriptive details)

The various attributes provided are

1) cid: a notation for a house

Number of Rows & Columns

We see that number of bedrooms and number of bathrooms, living_measure, lot_measure,

• CID: House ID/Property ID. Not used for analysis

3) Exploratory Data Analysis

Removal of unwanted variable

Missing value treatment

As per the code, we see there is no duplicates

Bivariate analysis (relationship between different variables, correlations)

From above pair plot, we observed/deduced below

for Feature: room_bed

for Feature: room_bath

For lot_measure <25000

For lot_measure >100000

Price increases with increase in living measure

Sight - Viewed in relation with price and living_measure

Condition - Viewed in relation with price and living_measure

Most houses are rated as 3 or more.

Most houses are graded as 6 or more.

There is upward trend in price with ceil_measure

Analyzing Feature: Zipcode, Lat, Long

1. price: room_bath, living_measure, quality, living_measure15, furnished

cid, dayhours, zipcoad, lot_measure, lot_measure15, long, condition, yr_built, yr_renovated,

Scaled data using z score method for scaling

Before Scaling After Scaling

Graph after log transformation

Graph after log transformation

Graph after log transformation

Graph after log transformation

Graph after log transformation

Graph after log transformation

Encoding deals with categorical features

Encoded with numerical values

a) I) Not applicable as we are going to develop regression model not classification

You might also like