Regression Model: Housing Price Prediction with R and Tidymodels

A Comprehensive Guide from EDA to Deployment

1 Introduction

This document provides a comprehensive, step-by-step guide to building a regression model using the tidymodels ecosystem in R. We will use a housing dataset to predict house prices, a classic regression problem. This project will walk you through the entire machine learning pipeline, from initial data loading and exploratory data analysis (EDA) to sophisticated model training, hyperparameter tuning, and final model evaluation.

The primary goal is to demonstrate a robust and reproducible workflow for regression tasks. We will explore various algorithms, compare their performance, and select the best model for making predictions on unseen data.

2 Installing and Loading Libraries

2.1 Step 1: Install Necessary Libraries

First, we ensure that all the required R packages are installed. We use the pak package for efficient and reliable package management.

Code
# This chunk installs all necessary packages. 
# It's set to not evaluate by default to avoid re-installing every time.
if (!requireNamespace("pak", quietly = TRUE)) {
  install.packages("pak")
}
# pak::pak handles dependencies gracefully.
pak::pak(c("tidymodels", "tidyverse", "here", "reactable", "ranger", "xgboost", "glmnet", "future", "skimr", "naniar"))

2.2 Step 2: Load Libraries

Next, we load the libraries that will be used throughout the analysis. Each library serves a specific purpose, from data manipulation (tidyverse) to modeling (tidymodels) and creating interactive tables (reactable).

Code
library(tidymodels)      # Core package for modeling
library(tidyverse)       # Data manipulation and visualization
library(here)            # For reproducible file paths
library(reactable)       # For interactive tables
library(glmnet)          # For regularized regression models
library(ranger)          # For Random Forest
library(xgboost)         # For XGBoost
library(future)          # For parallel processing
library(skimr)           # For summary statistics
library(naniar)          # For visualizing missing data

3 Data Preparation

3.1 Step 3: Read Data

We load the housing dataset, which is stored in a CSV file. The here package helps in locating the file relative to the project’s root directory, making the code more portable.

Code
housing_df <- read_csv(here("../data", "Housing.csv"))
Code
# Glimpse the data to see its structure and variable types
glimpse(housing_df)
Rows: 545
Columns: 13
$ price            <dbl> 13300000, 12250000, 12250000, 12215000, 11410000, 108…
$ area             <dbl> 7420, 8960, 9960, 7500, 7420, 7500, 8580, 16200, 8100…
$ bedrooms         <dbl> 4, 4, 3, 4, 4, 3, 4, 5, 4, 3, 3, 4, 4, 4, 3, 4, 4, 3,…
$ bathrooms        <dbl> 2, 4, 2, 2, 1, 3, 3, 3, 1, 2, 1, 3, 2, 2, 2, 1, 2, 2,…
$ stories          <dbl> 3, 4, 2, 2, 2, 1, 4, 2, 2, 4, 2, 2, 2, 2, 2, 2, 2, 4,…
$ mainroad         <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes…
$ guestroom        <chr> "no", "no", "no", "no", "yes", "no", "no", "no", "yes…
$ basement         <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "…
$ hotwaterheating  <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no",…
$ airconditioning  <chr> "yes", "yes", "no", "yes", "yes", "yes", "yes", "no",…
$ parking          <dbl> 2, 3, 2, 3, 2, 2, 2, 0, 2, 1, 2, 2, 1, 2, 0, 2, 1, 2,…
$ prefarea         <chr> "yes", "no", "yes", "yes", "no", "yes", "yes", "no", …
$ furnishingstatus <chr> "furnished", "furnished", "semi-furnished", "furnishe…

3.2 Step 4: Exploratory Data Analysis (EDA)

EDA is a crucial step to understand the data’s underlying structure, identify missing values, and uncover relationships between variables and the target, price.

Code
# Get a quick summary of the dataset
summary(housing_df)
     price               area          bedrooms       bathrooms    
 Min.   : 1750000   Min.   : 1650   Min.   :1.000   Min.   :1.000  
 1st Qu.: 3430000   1st Qu.: 3600   1st Qu.:2.000   1st Qu.:1.000  
 Median : 4340000   Median : 4600   Median :3.000   Median :1.000  
 Mean   : 4766729   Mean   : 5151   Mean   :2.965   Mean   :1.286  
 3rd Qu.: 5740000   3rd Qu.: 6360   3rd Qu.:3.000   3rd Qu.:2.000  
 Max.   :13300000   Max.   :16200   Max.   :6.000   Max.   :4.000  
    stories        mainroad          guestroom           basement        
 Min.   :1.000   Length:545         Length:545         Length:545        
 1st Qu.:1.000   Class :character   Class :character   Class :character  
 Median :2.000   Mode  :character   Mode  :character   Mode  :character  
 Mean   :1.806                                                           
 3rd Qu.:2.000                                                           
 Max.   :4.000                                                           
 hotwaterheating    airconditioning       parking         prefarea        
 Length:545         Length:545         Min.   :0.0000   Length:545        
 Class :character   Class :character   1st Qu.:0.0000   Class :character  
 Mode  :character   Mode  :character   Median :0.0000   Mode  :character  
                                       Mean   :0.6936                     
                                       3rd Qu.:1.0000                     
                                       Max.   :3.0000                     
 furnishingstatus  
 Length:545        
 Class :character  
 Mode  :character  
                   
                   
                   
Code
# Use skimr for more detailed and user-friendly summary statistics
skim(housing_df)
Data summary
Name housing_df
Number of rows 545
Number of columns 13
_______________________
Column type frequency:
character 7
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
mainroad 0 1 2 3 0 2 0
guestroom 0 1 2 3 0 2 0
basement 0 1 2 3 0 2 0
hotwaterheating 0 1 2 3 0 2 0
airconditioning 0 1 2 3 0 2 0
prefarea 0 1 2 3 0 2 0
furnishingstatus 0 1 9 14 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
price 0 1 4766729.25 1870439.62 1750000 3430000 4340000 5740000 13300000 ▇▇▂▁▁
area 0 1 5150.54 2170.14 1650 3600 4600 6360 16200 ▇▆▂▁▁
bedrooms 0 1 2.97 0.74 1 2 3 3 6 ▃▇▂▁▁
bathrooms 0 1 1.29 0.50 1 1 1 2 4 ▇▂▁▁▁
stories 0 1 1.81 0.87 1 1 2 2 4 ▇▇▁▁▂
parking 0 1 0.69 0.86 0 0 0 1 3 ▇▃▁▃▁
Code
# Visualize missing data to understand its extent
# In this case, there is no missing data, but this is a good practice.
# gg_miss_var(housing_df)
Code
# Distribution of the target variable
ggplot(housing_df, aes(x = price)) +
  geom_histogram(bins = 30, fill = "blue", alpha = 0.7) +
  labs(title = "Distribution of Housing Prices")

Code
# Price vs. Area
ggplot(housing_df, aes(x = area, y = price)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(title = "Price vs. Living Area")

Code
# Price by number of bedrooms
ggplot(housing_df, aes(x = as.factor(bedrooms), y = price)) +
  geom_boxplot() +
  labs(title = "Price by Number of Bedrooms", x = "Bedrooms")

3.3 Step 5: Data Cleaning

For this dataset, the categorical variables are binary (‘yes’/‘no’). We will convert them to factors so the recipe can handle them correctly.

Code
housing_clean <- housing_df %>%
  mutate(across(where(is.character), as.factor))

glimpse(housing_clean)
Rows: 545
Columns: 13
$ price            <dbl> 13300000, 12250000, 12250000, 12215000, 11410000, 108…
$ area             <dbl> 7420, 8960, 9960, 7500, 7420, 7500, 8580, 16200, 8100…
$ bedrooms         <dbl> 4, 4, 3, 4, 4, 3, 4, 5, 4, 3, 3, 4, 4, 4, 3, 4, 4, 3,…
$ bathrooms        <dbl> 2, 4, 2, 2, 1, 3, 3, 3, 1, 2, 1, 3, 2, 2, 2, 1, 2, 2,…
$ stories          <dbl> 3, 4, 2, 2, 2, 1, 4, 2, 2, 4, 2, 2, 2, 2, 2, 2, 2, 4,…
$ mainroad         <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes…
$ guestroom        <fct> no, no, no, no, yes, no, no, no, yes, yes, no, yes, n…
$ basement         <fct> no, no, yes, yes, yes, yes, no, no, yes, no, yes, yes…
$ hotwaterheating  <fct> no, no, no, no, no, no, no, no, no, no, no, yes, no, …
$ airconditioning  <fct> yes, yes, no, yes, yes, yes, yes, no, yes, yes, yes, …
$ parking          <dbl> 2, 3, 2, 3, 2, 2, 2, 0, 2, 1, 2, 2, 1, 2, 0, 2, 1, 2,…
$ prefarea         <fct> yes, no, yes, yes, no, yes, yes, no, yes, yes, yes, n…
$ furnishingstatus <fct> furnished, furnished, semi-furnished, furnished, furn…

3.4 Step 6: Data Splitting

To properly evaluate our models, we split the data into three sets: a training set (for model building), a testing set (for tuning and initial evaluation), and a final hold-out set (for unbiased final validation).

Code
set.seed(123) # for reproducibility

# First, split into training/testing (90%) and hold-out (10%)
split_ho <- initial_split(housing_clean, prop = 0.9, strata = price)
train_test_data <- training(split_ho)
holdout_data <- testing(split_ho)

# Next, split the 90% into training (7/9) and testing (2/9)
# This results in overall proportions of 70% train, 20% test
split_tt <- initial_split(train_test_data, prop = 7/9, strata = price)
train_data <- training(split_tt)
test_data <- testing(split_tt)

# Create cross-validation folds from the training data for robust evaluation
folds <- vfold_cv(train_data, v = 5, strata = price)

# Check the proportions of each dataset
cat("Training data:", nrow(train_data), "rows\n")
Training data: 378 rows
Code
cat("Testing data:", nrow(test_data), "rows\n")
Testing data: 111 rows
Code
cat("Hold-out data:", nrow(holdout_data), "rows\n")
Hold-out data: 56 rows

4 Modeling

4.1 Step 7: Create a Preprocessing Recipe

A recipe is a powerful tool from tidymodels for defining a sequence of data preprocessing steps. This ensures that the same transformations are applied consistently to all data splits.

Our recipe will: - Dummy code all factor predictors. - Normalize all numeric predictors to have a mean of 0 and standard deviation of 1.

Code
housing_recipe <- recipe(price ~ ., data = train_data) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_numeric_predictors())

# Check the recipe to see the transformations
summary(prep(housing_recipe))
# A tibble: 14 × 4
   variable                        type      role      source  
   <chr>                           <list>    <chr>     <chr>   
 1 area                            <chr [2]> predictor original
 2 bedrooms                        <chr [2]> predictor original
 3 bathrooms                       <chr [2]> predictor original
 4 stories                         <chr [2]> predictor original
 5 parking                         <chr [2]> predictor original
 6 price                           <chr [2]> outcome   original
 7 mainroad_yes                    <chr [2]> predictor derived 
 8 guestroom_yes                   <chr [2]> predictor derived 
 9 basement_yes                    <chr [2]> predictor derived 
10 hotwaterheating_yes             <chr [2]> predictor derived 
11 airconditioning_yes             <chr [2]> predictor derived 
12 prefarea_yes                    <chr [2]> predictor derived 
13 furnishingstatus_semi.furnished <chr [2]> predictor derived 
14 furnishingstatus_unfurnished    <chr [2]> predictor derived 

4.2 Step 8: Define Models

We will define a suite of regression models to compare their performance on this dataset.

4.2.1 Model Introductions

  • Linear Regression: A straightforward model that finds a linear relationship between predictors and the target. It serves as an excellent, simple baseline.
  • Random Forest: An ensemble method that builds multiple decision trees and merges them to get a more accurate and stable prediction. It’s robust and handles non-linearities well.
  • XGBoost (Extreme Gradient Boosting): A highly efficient and powerful implementation of gradient boosting. It’s known for its top-tier performance and speed in a wide range of problems.
  • Regularized Regression (Lasso): A linear model that includes a penalty to shrink less important feature coefficients towards zero. This helps prevent overfitting and can perform feature selection. We will use the glmnet engine.

4.2.2 Model Comparison Table

Model Key Strengths Key Weaknesses Tunable
Linear Regression Highly interpretable, fast, great baseline Assumes linearity, sensitive to outliers No
Random Forest High accuracy, robust to overfitting Less interpretable, can be slow Yes
XGBoost Top-tier performance, fast, flexible Complex, many parameters to tune Yes
Regularized Regression Prevents overfitting, performs feature selection Can be less powerful than tree ensembles Yes

4.2.3 Model Specifications

Code
# 1. Linear Regression
lm_spec <- linear_reg() %>%
  set_engine("lm") %>%
  set_mode("regression")

# 2. Random Forest (Tunable)
rf_spec <- rand_forest(
  mtry = tune(),
  trees = 1000,
  min_n = tune()
) %>%
  set_engine("ranger") %>%
  set_mode("regression")

# 3. XGBoost (Tunable)
xgb_spec <- boost_tree(
  trees = 1000,
  tree_depth = tune(),
  min_n = tune(),
  learn_rate = tune()
) %>%
  set_engine("xgboost") %>%
  set_mode("regression")

# 4. Lasso Regression (Tunable)
lasso_spec <- linear_reg(penalty = tune(), mixture = 1) %>%
  set_engine("glmnet") %>%
  set_mode("regression")

4.3 Step 9: Create a WorkflowSet

A workflow_set conveniently bundles our recipe and model specifications, making it easy to train and compare them.

Code
regression_models <- list(
  lm = lm_spec,
  rf = rf_spec,
  xgb = xgb_spec,
  lasso = lasso_spec
)

all_workflows <- workflow_set(
  preproc = list(recipe = housing_recipe),
  models = regression_models
)

4.4 Step 10: Training and Hyperparameter Tuning

We use workflow_map() to efficiently fit the baseline model and tune the hyperparameters for the others using our cross-validation folds. We’ll leverage parallel processing to speed up this step.

Code
# Set up parallel processing to speed up computation
plan(multisession)

# Define the metrics we want to track
reg_metrics <- metric_set(rmse, rsq, mae)

# Tune the workflows using the cross-validation folds
set.seed(234)
tune_results <- workflow_map(
  all_workflows,
  "tune_grid",
  resamples = folds,
  grid = 20, # Use a grid of 20 candidate parameter sets for tuning
  metrics = reg_metrics,
  control = control_grid(save_pred = TRUE) # Save predictions for later analysis
)

5 Model Evaluation

5.1 Step 11: Compare Model Performance

Let’s visualize the results and examine the performance metrics.

Code
autoplot(tune_results)

5.1.1 Understanding the Metrics

  • RMSE (Root Mean Squared Error): This is the square root of the average of the squared differences between the actual and predicted values. It’s in the same units as the target variable (price), making it interpretable. Lower values are better.
    • Formula: sqrt(mean((actual - predicted)^2))
  • R-squared (R²): This represents the proportion of the variance in the dependent variable that is predictable from the independent variables. It ranges from 0 to 1, with higher values indicating a better fit.
    • Formula: 1 - (Sum of Squared Residuals / Total Sum of Squares)
  • MAE (Mean Absolute Error): This is the average of the absolute differences between the actual and predicted values. Like RMSE, it’s in the same units as the target variable and is less sensitive to large outliers than RMSE. Lower values are better.
    • Formula: mean(abs(actual - predicted))

5.1.2 Ranked Model Results

Code
rank_results(tune_results, rank_metric = "rmse") %>%
  reactable(
    defaultPageSize = 10,
    striped = TRUE,
    highlight = TRUE,
    compact = TRUE,
    columns = list(
      wflow_id = colDef(name = "Workflow"),
      .metric = colDef(name = "Metric"),
      mean = colDef(name = "Mean", format = colFormat(digits = 2)),
      std_err = colDef(name = "Std Error", format = colFormat(digits = 2)),
      rank = colDef(name = "Rank")
    ),
    defaultSorted = list(rank = "asc")
  )

6 Finalizing and Deploying the Model

6.1 Step 12: Select and Save the Best Model

Based on the RMSE, we select the best model (XGBoost in this case), finalize the workflow with the best hyperparameters, and save the fitted object for future use.

Code
# Select the best workflow by RMSE
best_wflow_id <- rank_results(tune_results, rank_metric = "rmse") %>%
  dplyr::slice(1) %>%
  pull(wflow_id)
best_workflow <- extract_workflow(tune_results, id = best_wflow_id)

# Find the best hyperparameters
best_result <- extract_workflow_set_result(tune_results, id = best_wflow_id)
best_params <- select_best(best_result, metric = "rmse")

# Finalize the workflow
final_workflow <- finalize_workflow(best_workflow, best_params)

# Fit the final model on the combined training and testing data
final_fit <- fit(final_workflow, data = train_test_data)

# Save the final model object
saveRDS(final_fit, file = here("../data", "best_housing_model.rds"))
print("Best model saved to data/best_housing_model.rds")
[1] "Best model saved to data/best_housing_model.rds"

6.2 Step 13: Load the Saved Model

We can now load the saved model object for deployment or future predictions.

Code
loaded_model <- readRDS(here("../data", "best_housing_model.rds"))
print(loaded_model)
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_dummy()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────

Call:  glmnet::glmnet(x = maybe_matrix(x), y = y, family = "gaussian",      alpha = ~1) 

   Df  %Dev  Lambda
1   0  0.00 1018000
2   2  5.98  927800
3   2 12.89  845400
4   3 19.96  770300
5   3 25.89  701800
6   4 31.35  639500
7   4 36.04  582700
8   4 39.94  530900
9   5 43.40  483800
10  6 46.52  440800
11  6 49.50  401600
12  6 51.96  365900
13  7 54.12  333400
14  8 56.10  303800
15 10 57.88  276800
16 10 59.48  252200
17 11 60.86  229800
18 11 62.04  209400
19 11 63.02  190800
20 11 63.83  173900
21 11 64.51  158400
22 11 65.07  144300
23 12 65.62  131500
24 12 66.09  119800
25 12 66.48  109200
26 12 66.81   99480
27 12 67.08   90650
28 12 67.30   82590
29 12 67.49   75260
30 12 67.65   68570
31 12 67.77   62480
32 12 67.88   56930
33 12 67.97   51870
34 12 68.04   47260
35 12 68.10   43060
36 12 68.15   39240
37 12 68.20   35750
38 12 68.23   32580
39 12 68.26   29680
40 12 68.29   27050
41 12 68.31   24640
42 12 68.32   22450
43 12 68.34   20460
44 12 68.35   18640
45 12 68.36   16990
46 12 68.36   15480

...
and 16 more lines.

6.3 Step 14: Make Predictions on Hold-out Data

Finally, we evaluate our best model on the hold-out data, which it has never seen before. This provides the most realistic estimate of its performance on new, unseen data.

Code
# Make predictions on the hold-out set
holdout_predictions <- predict(loaded_model, new_data = holdout_data)

# Combine with actual values
results_df <- bind_cols(
  holdout_data %>%
  select(price),
  holdout_predictions
)

# Calculate final metrics
rmse(results_df, truth = price, estimate = .pred)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard     993652.
Code
rsq(results_df, truth = price, estimate = .pred)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rsq     standard       0.652
Code
mae(results_df, truth = price, estimate = .pred)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 mae     standard     746531.
Code
# Plot predictions vs. actuals to visualize performance
ggplot(results_df, aes(x = price, y = .pred)) +
  geom_point(alpha = 0.6) +
  geom_abline(color = "red", linetype = "dashed") +
  labs(
    title = "Actual vs. Predicted Prices on Hold-out Data",
    x = "Actual Price",
    y = "Predicted Price"
  ) +
  coord_obs_pred()

7 Conclusion

In this analysis, we successfully built and evaluated multiple machine learning models to predict housing prices. The tidymodels framework provided a powerful and organized workflow for this task. Our final XGBoost model demonstrated strong performance on the hold-out data, indicating its effectiveness in this regression problem. This project serves as a template for tackling similar regression challenges in a structured and reproducible manner.