Housing Price Model - Data Preprocessing

Author: Paul A. Beata
GitHub: pbeata


The original data set comes from the Ames, Iowa housing data on Kaggle.

Description of the columns (features): Each column represents a different feature of the home. Some of the features are numerical (such as "lot area" and "year built) and others are categorical (e.g., "neighboorhood" and "Building Type"). There are some features with absolute measurements, such as areas or distances, but there are also subjective ones like overall quality at the time of sale. A description each feature is provided in the "Ames_Housing_Data_Feature_Description.txt" file included with this project and referenced in this notebook.

Description of the targets: The target is the sale price of the house. Our ultimate goal is to build regression models to predict the sale price of a home based on its features. In this notebook, we will only focus on the data preprocessing.

Description of the rows: Each row of the data set is a single observation (i.e., a single house). Each house is identified in the "PID" column which is a unique identifier for each property.

Summary of data preprocessing performed in this notebook: Since this data set has missing values and outliers, we will first perform the following data preprocessing steps before building the regression models in a separate notebook.

  1. Identify the potential outliers in the data set
  2. Handle missing data using row-wise and/or column-wise solutions as appropriate
  3. Create data checkpoints throughout the preprocessing
  4. Handle the categorical data by making dummy variables
  5. Save the final (cleaned) data set at the end of the notebook

Part 0: Load the Original Data

Here we can see a detailed description of the features (columns) in the data set before we begin preprocessing:

Observe Correlations

Here we look at the correlation of the various features to see which ones have large positive or negative correlations with the overall house price. In this heat map, we only look at the continuous variables for now since we cannot compute the correlation of categorical values.

Same correlation analysis but with a few adjustments:

  1. Show the correlation values above in a bar chart
  2. Sale price is 100% correlated to itself, so we remove it from the plot
  3. PID is simply the property ID number, so we remove it from the plot

In the correlation plot below, we can see the numerical features with the highest positive (to the right) and negative (to the left) correlation to the final sale price.

Part 1: Handling Outliers

From the scatter plots above, we can see that there are potentially a few outliers in the original data set. For example, there are some houses with an overall rating of 9 or 10, but a sale price less than $200,000. Clearly there are other factors affecting the prices of these homes, but these few data points could be considered outliers since the general trend seen above is: the greater the rating, the greater the sale price. We can confirm this here with a box plot (below).

We choose to drop the observations that have a living area greater than 4000 square feet and a final sale price of less than $400,000. This corresponds to 3 rows being dropped out of 2930 total rows (0.1% of the data being dropped).

Part 2: Handling Missing Data

In order to get a better idea of the missing values that we have in the data set, we can use a function that will compute the percentage of missing values for each feature. The percentages computed by the function are sorted and filtered such that we only focus on features (columns) where there are at least 1 missing value.

By looking at the percent of missing values above (where the percent missing is less than 1%), we can see that many of these features are related to basements: the smallest six values in the cell above are all associated with basements ("Bsmt"). It seems that these missing values are related to the fact that some homes do not have basements. We can use this knowledge to fill in such missing values with zero (e.g., for features like total basement area: if there is no basement, then the area can logically be set to zero).

Fixing Data in Columns (Features)

Two approaches to consider:

The final feature with missing values now is the "lot frontage" feature.

For the missing values in the lot frontage area, we will use the average frontage area for homes in the same neighborhood.

At this point, we have now handled all the missing values in the Ames housing data set.

Original Data: 2930 rows with 81 columns

NEW Data with No Missing Values: 2925 rows with 76 columns

Part 3: Handling Categorical Data