Business: Capstone Project House Price Prediction Project Note-1
Business: Capstone Project House Price Prediction Project Note-1
REPORT
Capstone Project
Project Note-1
SONAL SINGH
01/05/2022
1
CONTENT
1) Introduction of the Problem
a) Defining problem statement
b) Need of the study/project
c) Understanding business/social opportunity
2) Data Report
a) Understanding how data was collected in terms of time, frequency
and methodology
b) Visual inspection of data (rows, columns, descriptive details)
c) Understanding of attributes (variable info, renaming if required)
2
1) INTRODUCTION
This section aims at introducing the project and providing the basic understanding of the
project and the objectives of this analysis. The analysis deals with the prediction of house prices
based on the factors given in the data set to define the attributes of a house. In other words,
it targets to understand the real estate market of the geographical location given. Prediction
of house prices is not only depend upon square foot of space that it occupies but, different
other factors like, number of bedrooms, bathrooms, floors, basement area, condition of house,
quality of house, year of build, waterfront/ beachfront, age of the house, age of renovation of
the house, etc., are few of the important points that play a major role in determining its cost.
So through this project we try to derive different patterns and we will be exploring multiple
other questions and try to derive answers to those by applying our learning and models from
the past 11 months of study.
Assumptions
This section aims at understanding the attributes in the data set which are not explained well
in the problem.
Ceil – 1 indicates the level/floor of house which is lowest in the attributes and 3.5 indicates
the maximum levels/floor of house.
Scope of Project
This section aims at understanding as Data Scientists, what is the scope of this project in real
world? Real estate is an always active market. This is also one of the markets that gets hit
hardest in times of distress of the economy. As per research, real estate generates almost 35
percent of the total revenue of the country’s economy. When it comes to young population,
then real estate is the most viable option to invest in. During the time of Corona Pandemic too,
this market kept on working, despite it saw some crashes and booms with parallel to the stock
movements.
Seller can't estimate the price of the house. Features of the house can help evaluate the house
3
price. Different houses have different features. Features of more than two houses can help
evaluate relevant prices. Hence, analyzing the bulk of data can help predict the house price.
To get the profitable pricing for the houses and buildings, so that neither the seller nor the
buyer are at a loss? That is where the factors affecting the price of the house comes into
picture. If a fair evaluation of all the factors, how they contribute, why they contribute, how
they contribute is made, then a profitable figure can be derived which leads to a win-win
situation for both the parties. Understanding and knowing the contribution of real estate to
the economy and to the standard of living of an individual, it’s very essential for us to contribute
our data skills so as to make it to a fair and profitable future
2) Data Report
Understanding how data was collected in terms of time, frequency and methodology
This section aims at giving how the data is collected.
This is the Capstone Project driven by the Great Learning, hence the data of “House Price
Prediction” is provided to us from the learning platform.
This data is collected already from year 2014 to year 2015.
4
8) ceil: Total floors (levels) in house
9) coast: House which has a view to a waterfront
10) sight: Has been viewed
11) condition: How good the condition is (Overall)
12) quality: grade given to the housing unit, based on grading system
13) ceil_measure: square footage of house apart from basement
14) basement_measure: square footage of the basement
15) yr_built: Built Year
16) yr_renovated: Year when house was renovated
17) zipcode: zip
18) lat: Latitude coordinate
19) long: Longitude coordinate
20) living_measure15: Living room area in 2015(implies-- some renovations) This might or
might not have affected the lotsize area
21) lot_measure15: lotSize area in 2015(implies-- some renovations)
22) furnished: Based on the quality of room
23) total_area: Measure of both living and lot
Fig 1, we can see the initial look of the data. This tells us that the data has 23 columns.
5
These columns are the different factors that impact the price of the house. Factors like number
of bedrooms, number of bathrooms, number of floors, quality of house, condition of house,
etc.. Each column has a different name and a different meaning.
We see that there are 21613 rows and 23 columns in the dataset.
From this, we see that there are 21613 rows and 23 columns. This information tallies from the
above Fig. too where we got 23 columns in the data. Also, there are total 21613 rows which
means there are 21613 entries of different instances. These rows can be consisting of missing
data or duplicates. They can also have unwanted inputs like an object variable in the
float/integer column.
Data Info
6
In the dataset, we have more than 21k records and 23 columns, out of which
• 12 features are of float type
• 4 features are of integer type
• 7 feature is of object type
Data Description
7
Besides graphs, statistics that summarize the distribution of the data, are used to transform
data into information. The five-number summary, which forms the basis for a boxplot, is a good
example of summarizing data. The above table is summary statistics of the dataset
8
From above analysis we got to know,
Most columns distribution is Right-Skewed and only few features are Left-Skewed (like
room_bath, yr_built, lat).
This tells us that in the total entries of 21613, there are max missing null values of 166 count
in the living_measure15. Next, we observe that the columns that have high number of missing
data are the number of bedrooms and bathrooms. Rest of the columns have substantially less
numbers of missing data, like lot_measure15, furnished, total_area have only 29 null values.
Also, sight and condition too have very lesser i.e., just 57 of the missing values. An interesting
analysis here is that 166 is the highest number of null value spaces and it is very less than 30
percentage of the total data of 21613. 166 is approximately 7 to 8 percentage of the total data.
9
This implies that maximum only 7 to 8 percentage of the data is missing or null in nature which
needs to be treated to get the more accurate results.
Bad data and missing data is treated. Replaced bad data with NaN value and treated the null
values with simple imputer & mode method.
Duplicates
Univariate analysis (distribution and spread for every continuous attribute, distribution of
data in categories for categorical ones)
This is simplest form of data analysis, where the data being analyzed consists of just one
variable. Since it’s a single variable, it doesn’t deal with causes or relationships. The main
purpose of univariate analysis is to describe the data and find patterns that exist within it.
10
11
Very few houses are renovated, only 914 houses are renovated out of total 21613 records
house with no sight or 0 record is more after that we have house few more houses with 2 sights
hose with 1 or 4 site is very minimal
most of the houses in the dataset has bedroom within the range of 0 to 5
more no of houses are built from year 2000 onwards. from the year 1900 to 1950 we can see
less no of house got constructed
more no of unfurnished house are there in data set .17500 house are unfurnished and near
about only 4000 houses are furnished Most of the houses are non-coast in the dataset and
very few houses negligible amount of houses are near the coast.
12
• room_bed: our target variable (price) and room_bed plot is not linear. Its distribution
has lot of gaussians
• room_bath: It's plot with price has somewhat linear relationship. Distribution has
number of gaussians.
• living_measure: Plot against price has strong linear relationship. It also have linear
relationship with room_bath variable. So might remove one of these 2. Distribution is
Right-Skewed.
• lot_measure: No clear relationship with price.
• ceil: No clear relationship with price. We can see, it's have 6 unique values only.
Therefore, we can convert this column into categorical column for values.
• coast: No clear relationship with price. Clearly it's categorical variable with 2 unique
values.
• sight: No clear relationship with price. This has 5 unique values. Can be converted to
Categorical variable.
• condition: No clear relationship with price. This has 5 unique values. Can be converted
to Categorical variable.
• quality: Somewhat linear relationship with price. Has discrete values from 1 - 13. Can
be converted to Categorical variable.
• ceil_measure: Strong linear relationship with price. Also with room_bath and
living_measure features. Distribution is Right-Skewed.
• basement: No clear relationship with price.
• yr_built: No clear relationship with price.
• yr_renovated: No clear relationship with price. Have 2 unique values. Can be converted
to Categorical Variable which tells whether house is renovated or not.
• zipcode, lat, long: No clear relationship with price or any other feature.
• living_measure15: Somewhat linear relationship with target feature. It's same as
living_measure. Therefore we can drop this variable.
• lot_measure15: No clear relationship with price or any other feature.
• furnished: No clear relationship with price or any other feature. 2 unique values so can
be converted to Categorical Variable
• total_area: No clear relationship with price. But it has Very Strong linear relationship
with lot_measure. So one of it can be dropped.
• There is Linear relation exist between lot_ measure and total area And also there is
some linear relation between ceil_measure and living_measure
13
Analysing Bivariate
There is clear increasing trend in price with room_bed, price increases with the increase in
no. of bedrooms.
There is upward trend in price with increase in room_bath, price increases with the increase in
number of bath rooms.
14
for Feature: living_measure
There is clear increment in price of the property with increment in the living measure but there
seems to be one outlier to this trend. Need to evaluate the same
Feature: lot_measure
15
There doesn’t seem to be no relation between lot_measure and price trend
Almost 95% of the houses have <25000 lot_measure. But there is no clear trend between
lot_measure and price
16
Feature: ceil
There is some slight upward trend in price with the ceil and then falls latter
Feature: coast
The house properties with waterfront tend to have higher price compared to that of non-
waterfront properties
17
Feature: sight
Properties with higher price have more no. of sights compared to that of houses with lower
price
The above graph also justifies that: Properties with higher price have more no. of sights
compared to that of houses with lower price
18
Feature: condition
The price of the house increases with condition rating of the house
19
So we found out that smaller houses are in better condition and better condition houses are
having higher prices.
Feature: quality
with grade increase price and living_measure increase (mean and median)
There is clear increase in price of the house with higher rating on quality
20
quality - Viewed in relation with price and living_measure.
Feature: ceil_measure
21
Feature: basement
We will create the categorical variable for basement 'has_basement' for houses with basement
and no basement. This categorical variable will be used for further analysis.
basement - after binning we data shows with basement houses are costlier and have higher
living measure (mean & median)
The houses with basement has better price compared to that of houses without basement
22
basement - have higher price & living measure
Feature: yr_built
As per the graph, most of the houses are built in around 2000’s and least houses are built in
around 1900’s. there is a kind of slightly increase in trend, as year increases the number of
house built increases.
23
Feature: yr_renovated
So most houses are renovated after 1980's. We will create new categorical variable
'has_renovated' to categorize the property as renovated and non-renovated. For further
analysis we will use this categorical variable.
Renovated properties have higher price than others with same living measure space.
24
Feature: furnished
Furnished houses have higher price than that of the Non-furnished houses
25
With above figure aims at understanding that how the combination of latitude, longitude & Zip
code affects the price. This precisely is a geographical study of differing price. We see that
highest price is concentrated in the center of the map and lowest is happening at the coast line
houses. Probably this is the reason that most of our houses don’t have coast lines.
We can find that the highest price is around the location of longitude -122.2 and -122.3 and
latitude 47.5 and 47.6 This is location has high number of houses.
The lowest price is around the region longitude -121.906 and latitude 47.26.
26
Correlation
We have linear relationships in below features as we got to know from above matrix
We can plot heatmap and can easily confirm our above findings
27
Heat Map
Outlier
28
We see that all the variables has outliers. The black dots lying in distant marks the outliers.
Now we need to treat them and remove them. Using the remove outlier code in Python, we
removed all the outliers.
Outlier Treatment
We have the boxplot of the cleaner data with no outliers. As now we can see that there are no
randomly lying black dots in the boxplots. This confirms that there are no outliers in the data
now. We have successfully removed all the outliers.
29
Multicollinearity
The dataset we have has only 22 Features and 1 Target column. Since the number of Features
are only 22, not high we should be good to use all the Features given in the Dataset. Out of the
22 Features, we can clearly see that cid and dayhours are just audit columns and do not add
any value to the model in terms of prediction. Based on above correlation heat map/VIF and
summary listed, we identify the below features with a correlation less than 0.25 with target
variable 'price' as potential columns to be excluded from the data. However, we will be
experimenting the different Features to be used/eliminated in the models and finalize the
Feature list.
30
Scaling
Scaling is necessary for unscaled data. Scaling needs to be done as the values of the variables
are in different scales. Spending, advance price are in different values and this may get more
weightage. Scaling will have all the values in the relative same range. Below is the snapshot of
scaled data.
Unscaled data
31
We can see from above figure there is not major difference in graph from before and after
scaled data.
Transformation
Log transformation of gives actual information by enhancing the image. If we apply this
method in an image having higher pixel values then it will enhance the image more and actual
information of the image will be lost.
In situations where data is highly skewed and the algorithm we plan to use for prediction has
a prerequisite that the data has to be normally distributed, below transformation can be
applied.
Price
Lot measures
33
Graph after log transformation
Quality
34
Ceil measure
Year built
35
Graph after log transformation
Living measure15
Total area
37
Graph after log transformation
As per the above graphs of various numeric field, Distribution is more spread across, even
boxplot is more spread, because of scaled property outlier is more near to boxplot. With
transformation distribution seems to have changed even achieved better skeness & kurtosis.
Label Encoding
38
4) Business Insights from EDA
This section aims at giving certain business insights from the analysis done above. This
takes into account all the pointers that have been coded/ discussed/ elaborated above.
b) we have used KMeans Clustering to obtain the WSS plot and its clear noticeable
that the elbow is at clusters = 3. Hence the optimal number of clusters will be 3.
Post running the silhouette score analysis, we got to the K-Means cluster and Fig.
tells us about the clustering using the K-Means and the head of the data.
39
Above fig gives us the train and test dataset split up
c) I) The missing data in the data set is already imputed based on the data type.
II) Living measure is the most significant variable in our analysis and since living
measure, lot measure and ceil measure are proportional we need not spend a lot
of time in analyzing these variables, analysis based on living measure would provide
much greater insights.
III) It is evident from EDA that an ideal house would be the one with 2-3 bedrooms
and 3 bathrooms, even though houses with 8 and >8 bedrooms and bathrooms
have sold for a higher price a lot of people doesn’t seem to be buying them, higher
number of records are sold with three-bedroom houses hence an equal or even
more revenue could be obtained by selling more houses with three bedrooms and
bathrooms.
IV) Although majority of houses are not furnished, it is seen in bivariate analysis that
furnished houses produce more revenue compared to unfurnished ones.
V) From the above analysis, we can conclude that, high quality house has the
highest house price.
VI) These features combined, can help estimate the house price.
40