Housing Price Model - Regression Analysis

Author: Paul A. Beata
GitHub: pbeata


The original data set comes from the Ames, Iowa housing data on Kaggle.

Load the Processed Data

In the data preprocessing notebook, we took care of the outliers, missing values, and categorical data in order to prepare our data set for these machine learning models.

Split the Data for Training and Testing

Scale the features using the standard scaler (we do not need to scale the targets):

Model 1: Linear Regression using Elastic Net

Combination of Ridge + Lasso Regression

We will use a grid search to find the best alpha values and the L1-ratio for the Elastic Net model.

The best parameters found during the grid search using the mean squared error as our metric are the following (hyperparameters):

$\alpha = 16$

$L_{ratio}^1 = 1.0$

Since we explored alpha values of 8, 16, and 32, we can try to focus on this range of [8, 32] to see if there is a better alpha that we missed:

Therefore, we will use an alpha value of either 14 or 16 in the elastic net regression model. The alpha is simply a constant that multiplies the penalty terms. If we use an alpha of 0, this is equal to normal linear regression.

For reference, the scikit-learn documentation says that the l1_ratio is the "the ElasticNet mixing parameter, with 0 <= l1_ratio <= 1. For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2." Therefore, with an l1_ratio of 1.0, we are using the pure L1 penalty method (Lasso).

Elastic Net Model Predictions

Plot the predicted values versus the actual known values for the target prices (y_test):

In order to test various models, I created this function so that we can produce the same plot as the one above and automatically compute the mean absolute error and mean squared error:

Model 2: Ordinary Linear Regression

Model 3: Lasso Regression Only

Lasso regularization allows for a sort of "automatic" feature selection as some of the model coefficients could become exactly zero when using Lasso for regression.

We can see here from the table above that there are 68 coefficients that are equal to zero. This implies that 68 of the 273 features (which includes the dummy variables created during preprocessing) will not be used in the model predictions.

Model 4: Ridge Regression Only

Model 5: Random Forest Regressor

Summary of Model Performance

Model MAE RMSE Notes
1. Elastic Net \$14,197 \$20,172 Lowest RMSE
2. Linear \$14,577 \$20,849
3. Lasso \$14,191 \$20,554 Lowest MAE
4. Ridge \$14,275 \$20,867
5. Random Forest \$15,366 \$21,807

The best regression model in terms of the mean absolute error (using the 10% withheld training split of data) was the Lasso model. The lowest root mean square error was the Elastic Net with an L1 ration of 1, which is essentially the same thing as pure Lasso regression. We can confirm this by comparing the mean absolute error for the Elastic Net and Lasso above as they are quite similar.

For a mean house sale price of \$180,815 across the full data set, this means that the mean average absolute error of the Lasso model relative to the average price as a percentage is only 7.8\%.

The Random Forest Regressor performed the worst out of this group of regression models (relatively speaking). While Random Forests are commonly used for classification problems, scikit-learn provides a regressor based on Random Forests as well. However, if we take a closer look at the grid search results for the Random Forest Regressor, we see that using a max depth of 10 and max features of 32, the mean absolute error was \$15,615: meaning that our error only increased from \\$14,191 (Lasso) to \$15,615 (Random Forest), but we only needed to use 32 features compared to the 205 needed for Lasso. Note that when using Lasso, 68 of the coefficients dropped to zero, therefore only 273 - 68 = 205 features were included in the actual predictions.

For a mean house sale price of \$180,815 across the full data set, this means that the mean average absolute error of the Random Forest model relative to the average price as a percentage is only 8.5\%.