Case Studies | Nitin Gupta

Ames Housing - Part 2 - Building Models

Mon, 26 Dec 2016 00:00:00 +0000

In a previous post in this series, we did an exploratory data analysis of the Ames Housing dataset.

In this post, we will build linear and non-linear models and see how well they predict the SalePrice of properties.

Evaluation Criteria

Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed SalePrice will be our evaluation criteria. Taking the log ensures that errors in predicting expensive and cheap houses will affect the result equally.

Steps for Building Models

Here are the steps for building models and determining the best hyperparameter combinations by K-fold cross validation:

Partition the training dataset into model training and validation sets. Use stratified sampling such that each partition has a similar distribution of the target variable - SalePrice.
Define linear and non-linear models.
For each model, create a grid of hyperparameter combinations that are equally spaced.
For each hyperparameter combination, fit a model on the training set and make predictions on the validation set. Repeat the process for all folds.
Determine root mean squared errors (RMSE) and choose the best hyperparameter combination that corresponds to the minimum RMSE.
Train each model with its best hyperparameter combination on the entire training set.
Calculate RMSE of the each finalized model on the testing set.
Finally, choose the best model that gives the least RMSE.

Partitioning Training Data

We split the training data into 4 folds. Within each fold, 75% of the data is used for training models and 25% for validating the predicted values against the actual values.

Let’s look at the distribution of the target variable across all folds:

By using stratified sampling, we ensure that the training and validation distributions of the target variable are similar.

Linear Models

Ordinary Least Squares Regression

Before creating any new features or indulging in more complex modelling methods, we will cross validate a simple linear model on the training data to establish a benchmark. If more complex approaches do not have a significant improvement in the model validation metrics, then they are not worthwhile to be pursued.

Linear Regression Model Specification (regression)

Computational engine: lm

What’s notable?

After training a linear model on all predictors, we get an RMSE of 0.1468.
This is the simplest and fastest model with no hyperparameters to tune.

Regularized Linear Model

We will use glmnet that uses LASSO and Ridge Regression with regularization. We will do a grid search of the following hyperparameters that minimize RMSE:

penalty: The total amount of regularization in the model.
mixture: The proportion of L1 regularization in the model.

Linear Regression Model Specification (regression)

Main Arguments:
  penalty = tune()
  mixture = tune()

Computational engine: glmnet

Let’s take a look at the top 10 RMSE values and hyperparameter combinations:

# A tibble: 10 x 3
    penalty mixture mean_rmse
      <dbl>   <dbl>     <dbl>
 1 4.83e- 3  0.922      0.127
 2 3.79e- 2  0.0518     0.129
 3 1.36e- 3  0.659      0.132
 4 1.60e- 3  0.431      0.133
 5 3.50e- 3  0.177      0.133
 6 4.17e- 2  0.288      0.133
 7 5.67e- 4  0.970      0.133
 8 6.79e- 9  0.0193     0.138
 9 4.32e-10  0.337      0.138
10 1.95e- 6  0.991      0.138

What’s notable?

After hyperparameter tuning with cross validation, glmnet gives the best RMSE of 0.127 with penalty = 0.0048 and mixture = 0.9216.
It is a significant improvement over Ordinary Least Squares regression that had an RMSE of 0.1468.
glmnet cross validation takes under a minute to execute.
But the presence of outliers can significantly affect its performance.

Here a plot of the glmnet hyperparameter grid along with the best hyperparameter combination:

Non-linear Models

Next, we will train a couple of tree-based algorithms, which are not very sensitive to outliers and skewed data.

randomForest

In each ensemble, we have 1000 trees and do a grid search of the following hyperparameters:

mtry: The number of predictors to randomly sample at each split.
min_n: The minimum number of data points in a node required to further split the node.

Random Forest Model Specification (regression)

Main Arguments:
  mtry = tune()
  trees = 1000
  min_n = tune()

Engine-Specific Arguments:
  objective = reg:squarederror

Computational engine: randomForest

Let’s take a look at the top 10 RMSE values and hyperparameter combinations:

# A tibble: 10 x 3
   min_n  mtry mean_rmse
   <int> <int>     <dbl>
 1     4    85     0.134
 2     3   140     0.135
 3    14    90     0.135
 4     6    45     0.136
 5     9   138     0.136
 6    13   158     0.137
 7     9   183     0.137
 8    19    56     0.138
 9    21   130     0.138
10     5   218     0.138

What’s notable?

After cross validation, we get the best RMSE of 0.134 with mtry = 85 and min_n = 4.
This is no improvement in RMSE compared to glmnet and randomForest cross validation takes much longer to execute than glmnet.

Here a plot of the randomForest hyperparameter grid along with the best hyperparameter combination:

xgboost

In each ensemble we have 1000 trees and do a grid search of the following hyperparameters:

min_n: The minimum number of data points in a node required to further split the node.
tree_depth: The maximum depth or the number of splits of the tree.
learn_rate: The rate at which the boosting algorithm adapts from one iteration to another.

Boosted Tree Model Specification (regression)

Main Arguments:
  trees = 1000
  min_n = tune()
  tree_depth = tune()
  learn_rate = tune()

Engine-Specific Arguments:
  objective = reg:squarederror

Computational engine: xgboost

Let’s take a look at the top 10 RMSE values and hyperparameter combinations:

# A tibble: 10 x 4
   min_n tree_depth learn_rate mean_rmse
   <int>      <int>      <dbl>     <dbl>
 1    13          3  0.0309        0.124
 2    40          4  0.0350        0.126
 3     6          8  0.0469        0.126
 4    34         15  0.0172        0.127
 5    28         10  0.0336        0.128
 6    20         14  0.00348       0.389
 7    22          7  0.000953      4.46 
 8     3          2  0.000528      6.81 
 9    10         12  0.000401      7.73 
10    34          3  0.0000802    10.6

What’s notable?

After cross validation, we get the best RMSE of 0.124 with min_n = 13, tree_depth = 3 and learn_rate = 0.0309.
Gives the best RMSE compared to glmnet and randomForest.
However, xgboost cross validation takes longer to execute than that of glmnet, but is faster than that of randomForest

Finalizing Models

For each model, we found the combination of hyperparameters that minimize RMSE. Using those parameters, we can now train the same models on the entire training dataset. Finally, we can use the trained models to predict log(SalePrice) on the entire training set to see the actual v/s predicted log(SalePrice) results.

What’s notable?

Both randomForest and xgboost models do a fantastic job of predicting log(SalePrice) with the tuned parameters, as the predictions lie close to the straight line drawn at 45 degrees.
The glmnet model shows a couple of outliers with Ids 524 and 1299 whose predicted values are far in excess of their actual values. Even properties whose SalePrice is at the lower end, show a wide dispersion in prediced values.
But the true performance can only be measured on unseen testing data.

Performance on Test Data

# A tibble: 3 x 3
  model        test_rmse cv_rmse
  <chr>            <dbl>   <dbl>
1 glmnet           0.129   0.127
2 randomForest     0.139   0.134
3 xgboost          0.128   0.124

What’s notable?

All models have similar RMSE on the unseen testing set as their cross validated RMSE, which shows the cross validation process and hyperparameters worked very well.
Records with Ids 1537 and 2217 are outliers, as none of the models are able to predict close to actual values.
Looking at the test RMSE, we could finalize xgboost as the model that generalizes very well on this dataset.

Feature Importance

Even though xgboost is not as easily interpretable as a linear model, we could use variable importance plots to determine the most important features selected by the model.

Let’s take a look at the top 10 most important features of our finalized xgboost model:

Correlations of numerical features are plotted side-by-side. All features have a correlation of 0.5 or more with SalePrice.
All of the top 10 features make sense. To evaluate SalePrice, a buyer would definitely look at total square footage, overall quality, neighborhood, number of bathrooms, kitchen quality, age of property, etc.
This shows, our finalized model generalizes well and makes very reasonable choices in terms of features.

New Property Premium

Among the top 10 features by importance in our final model, most of the features like square footage, neighborhood and number of bathrooms remain the same throughout the life of the property. Quality and condition of property does change but their evaluation is mostly subjective. The only other feature that cannot be disputed to change over time is PropertyAge.

So, how would the predicted SalePrice differ if a property was newly constructed vis-a-vis the same property if it were constructed more than 30 years earlier, and all the times in between?

We could pick a couple of properties at random, change PropertyAge and see its impact on SalePrice.

We can see there’s a small premium for a newly constructed property v/s an older property of the same build, quality and condition. This premium isn’t very much in a place like Ames, IA but we’d reckon it would be much higher in a larger metropolitan city.

Ames Housing - Part 1 - Exploratory Data Analysis

Sun, 25 Dec 2016 00:00:00 +0000

In this case study, we will use the Ames Housing dataset to explore regression techniques and predict the sale price of houses.

Data Summaries

The Ames Housing dataset contains the sale prices of properties in Ames, Iowa along with 80 other features. Each property has an Id associated with it. Here are the dimensions of the training and testing sets respectively:

[1] "Dimensions of the training set"

[1] 1460   81

[1] "Dimensions of the testing set"

[1] 1459   81

Now, let’s combine training and testing into a single dataset and take a look at the count of missing values:

What’s notable?

The combined dataset has 2919 property records.
Very few properties have a pool, fence or an alley access to the property.
Very few properties have a miscellaneous feature that has not been covered by other features.
More than a dozen features have atleast 1 missing value. Since we have a tiny dataset, we will try to impute the missing values.

Data Cleaning & Transformation

We will visualize features of the complete dataset and create a data cleaning pipeline.

Fixing Data Errors

First, a few data integrity checks need to be done to ensure the quality of the data:

YearRemodAdd should not be earlier than YearBuilt: 1 record to be fixed
YrSold should not be earlier than YearRemodAdd: 3 records to be fixed

# A tibble: 1 x 4
     Id YearBuilt YearRemodAdd YrSold
  <dbl>     <dbl>        <dbl>  <dbl>
1  1877      2002         2001   2009

# A tibble: 3 x 4
     Id YearBuilt YearRemodAdd YrSold
  <dbl>     <dbl>        <dbl>  <dbl>
1   524      2007         2008   2007
2  2296      2007         2008   2007
3  2550      2008         2009   2007

GarageYrBlt should not be earlier than YearBuilt: 18 records to be fixed
GarageYrBlt should not be later than YrSold: 1 record to be fixed

# A tibble: 18 x 4
      Id YearBuilt GarageYrBlt YrSold
   <dbl>     <dbl>       <dbl>  <dbl>
 1    30      1927        1920   2008
 2    94      1910        1900   2007
 3   325      1967        1961   2010
 4   601      2005        2003   2006
 5   737      1950        1949   2006
 6  1104      1959        1954   2006
 7  1377      1930        1925   2008
 8  1415      1923        1922   2008
 9  1419      1963        1962   2008
10  1522      1959        1956   2010
11  1577      2010        2009   2010
12  1806      1935        1920   2009
13  1841      1978        1960   2009
14  1896      1941        1940   2009
15  1898      1935        1926   2009
16  2123      1945        1925   2008
17  2264      2006        2005   2007
18  2510      2006        2005   2007

# A tibble: 1 x 4
     Id YearBuilt GarageYrBlt YrSold
  <dbl>     <dbl>       <dbl>  <dbl>
1  2593      2006        2207   2007

Imputing Missing Values & New Features

Basement Features

There is one property (Id = 2121) where all the basement features are NA. TotalBsmtSF is replaced by 0.
Now there are 79 properties which have no basement (TotalBsmtSF = 0). All other basement features having NA values are changed to None.
Since qualitative features do not have the same distribution across neighborhoods, any remaining NA values are imputed to be the most common value in that Neighborhood.

# A tibble: 1 x 13
     Id Neighborhood BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath
  <dbl> <chr>        <chr>    <chr>    <chr>        <chr>             <dbl> <chr>             <dbl>     <dbl>       <dbl>        <dbl>        <dbl>
1  2121 BrkSide      <NA>     <NA>     <NA>         <NA>                 NA <NA>                 NA        NA          NA           NA           NA

# A tibble: 79 x 13
      Id Neighborhood BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath
   <dbl> <chr>        <chr>    <chr>    <chr>        <chr>             <dbl> <chr>             <dbl>     <dbl>       <dbl>        <dbl>        <dbl>
 1    18 Sawyer       <NA>     <NA>     <NA>         <NA>                  0 <NA>                  0         0           0            0            0
 2    40 Edwards      <NA>     <NA>     <NA>         <NA>                  0 <NA>                  0         0           0            0            0
 3    91 NAmes        <NA>     <NA>     <NA>         <NA>                  0 <NA>                  0         0           0            0            0
 4   103 SawyerW      <NA>     <NA>     <NA>         <NA>                  0 <NA>                  0         0           0            0            0
 5   157 NAmes        <NA>     <NA>     <NA>         <NA>                  0 <NA>                  0         0           0            0            0
 6   183 Edwards      <NA>     <NA>     <NA>         <NA>                  0 <NA>                  0         0           0            0            0
 7   260 OldTown      <NA>     <NA>     <NA>         <NA>                  0 <NA>                  0         0           0            0            0
 8   343 NAmes        <NA>     <NA>     <NA>         <NA>                  0 <NA>                  0         0           0            0            0
 9   363 Edwards      <NA>     <NA>     <NA>         <NA>                  0 <NA>                  0         0           0            0            0
10   372 ClearCr      <NA>     <NA>     <NA>         <NA>                  0 <NA>                  0         0           0            0            0
# ... with 69 more rows

# A tibble: 9 x 13
     Id Neighborhood BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath
  <dbl> <chr>        <chr>    <chr>    <chr>        <chr>             <dbl> <chr>             <dbl>     <dbl>       <dbl>        <dbl>        <dbl>
1   333 NridgHt      Gd       TA       No           GLQ                1124 <NA>                479      1603        3206            1            0
2   949 CollgCr      Gd       TA       <NA>         Unf                   0 Unf                   0       936         936            0            0
3  1488 Somerst      Gd       TA       <NA>         Unf                   0 Unf                   0      1595        1595            0            0
4  2041 Veenker      Gd       <NA>     Mn           GLQ                1044 Rec                 382         0        1426            1            0
5  2186 Edwards      TA       <NA>     No           BLQ                1033 Unf                   0        94        1127            0            1
6  2218 IDOTRR       <NA>     Fa       No           Unf                   0 Unf                   0       173         173            0            0
7  2219 IDOTRR       <NA>     TA       No           Unf                   0 Unf                   0       356         356            0            0
8  2349 Somerst      Gd       TA       <NA>         Unf                   0 Unf                   0       725         725            0            0
9  2525 CollgCr      TA       <NA>     Av           ALQ                 755 Unf                   0       240         995            0            0

Histograms of numerical basement features and their correlations with SalePrice are plotted below.

It could be verified that: TotalBsmtSF = BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF

Additionally, new features are generated where:

BsmtBath = BsmtFullBath + 0.5 * BsmtHalfBath
HasBsmt = TotalBsmtSF > 0

Most properties have a basement.
Column plots show that BsmtFinType2 and BsmtCond values are dominated by a single category.

Bathroom Features

A new feature is generated to determine the total number of bathrooms: TotalBath = FullBath + HalfBath + BsmtBath

Fireplace Features

There are 1420 properties that have no fireplaces. FireplaceQu is changed to None.

# A tibble: 1,420 x 4
      Id Neighborhood Fireplaces FireplaceQu
   <dbl> <chr>             <dbl> <chr>      
 1     1 CollgCr               0 <NA>       
 2     6 Mitchel               0 <NA>       
 3    11 Sawyer                0 <NA>       
 4    13 Sawyer                0 <NA>       
 5    16 BrkSide               0 <NA>       
 6    18 Sawyer                0 <NA>       
 7    19 SawyerW               0 <NA>       
 8    20 NAmes                 0 <NA>       
 9    27 NAmes                 0 <NA>       
10    30 BrkSide               0 <NA>       
# ... with 1,410 more rows

A new feature is generated where: HasFireplace = Fireplaces > 0
A significant number of properties have fireplaces.

Garage Features

GarageYrBlt where NA is set to YearBuilt.
There are 157 properties where the property has no garage. In these records, GarageType, GarageFinish, GarageQual and GarageCond are recorded as None.
Since qualitative features do not have the same distribution across neighborhoods, any remaining NA values are imputed to be the most common or median value in the Neighborhood by GarageType.

# A tibble: 157 x 9
      Id Neighborhood GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond
   <dbl> <chr>        <chr>            <dbl> <chr>             <dbl>      <dbl> <chr>      <chr>     
 1    40 Edwards      <NA>              1955 <NA>                  0          0 <NA>       <NA>      
 2    49 OldTown      <NA>              1920 <NA>                  0          0 <NA>       <NA>      
 3    79 Sawyer       <NA>              1968 <NA>                  0          0 <NA>       <NA>      
 4    89 IDOTRR       <NA>              1915 <NA>                  0          0 <NA>       <NA>      
 5    90 CollgCr      <NA>              1994 <NA>                  0          0 <NA>       <NA>      
 6   100 NAmes        <NA>              1959 <NA>                  0          0 <NA>       <NA>      
 7   109 IDOTRR       <NA>              1919 <NA>                  0          0 <NA>       <NA>      
 8   126 IDOTRR       <NA>              1935 <NA>                  0          0 <NA>       <NA>      
 9   128 OldTown      <NA>              1930 <NA>                  0          0 <NA>       <NA>      
10   141 NAmes        <NA>              1971 <NA>                  0          0 <NA>       <NA>      
# ... with 147 more rows

# A tibble: 2 x 9
     Id Neighborhood GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond
  <dbl> <chr>        <chr>            <dbl> <chr>             <dbl>      <dbl> <chr>      <chr>     
1  2127 OldTown      Detchd            1910 <NA>                  1        360 <NA>       <NA>      
2  2577 IDOTRR       Detchd            1923 <NA>                 NA         NA <NA>       <NA>

GarageArea and GarageCars have almost similar correlation with SalePrice.
A new feature is generated where: HasGarage = GarageArea > 0
Most properties have a garage.
Column plots show that GarageQual and GarageCond values are dominated by a single category.

Masonry Features

There is one property (Id = 2611) where MasVnrArea = 198 but MasVnrType = NA. Impute MasVnrType to be most common value in the neighborhood where MasVnrArea > 0.
Impute NA values in MasVnrType to be the most common values by Neighborhood and YearRemodAdd.
Impute NA values in MasVnrArea to be the median values by Neighborhood and MasVnrType.

# A tibble: 1 x 4
     Id Neighborhood MasVnrType MasVnrArea
  <dbl> <chr>        <chr>           <dbl>
1  2611 Mitchel      <NA>              198

# A tibble: 23 x 4
      Id Neighborhood MasVnrType MasVnrArea
   <dbl> <chr>        <chr>           <dbl>
 1   235 Gilbert      <NA>               NA
 2   530 Crawfor      <NA>               NA
 3   651 Somerst      <NA>               NA
 4   937 SawyerW      <NA>               NA
 5   974 Somerst      <NA>               NA
 6   978 Somerst      <NA>               NA
 7  1244 NridgHt      <NA>               NA
 8  1279 CollgCr      <NA>               NA
 9  1692 Gilbert      <NA>               NA
10  1707 Somerst      <NA>               NA
# ... with 13 more rows

# A tibble: 23 x 4
      Id Neighborhood MasVnrType MasVnrArea
   <dbl> <chr>        <chr>           <dbl>
 1   235 Gilbert      None               NA
 2   530 Crawfor      None               NA
 3   651 Somerst      None               NA
 4   937 SawyerW      None               NA
 5   974 Somerst      Stone              NA
 6   978 Somerst      None               NA
 7  1244 NridgHt      Stone              NA
 8  1279 CollgCr      BrkFace            NA
 9  1692 Gilbert      None               NA
10  1707 Somerst      Stone              NA
# ... with 13 more rows

A new feature is generated where: HasMasVnr = MasVnrArea > 0
A significant number of properties have masonry.

Pool Features

Change values in PoolQC to None if the property has no pool
Impute NA values in remaining PoolQC to the most common value in the Neighborhood in the properties that have a pool.

# A tibble: 2,906 x 4
      Id Neighborhood PoolArea PoolQC
   <dbl> <chr>           <dbl> <chr> 
 1     1 CollgCr             0 <NA>  
 2     2 Veenker             0 <NA>  
 3     3 CollgCr             0 <NA>  
 4     4 Crawfor             0 <NA>  
 5     5 NoRidge             0 <NA>  
 6     6 Mitchel             0 <NA>  
 7     7 Somerst             0 <NA>  
 8     8 NWAmes              0 <NA>  
 9     9 OldTown             0 <NA>  
10    10 BrkSide             0 <NA>  
# ... with 2,896 more rows

# A tibble: 3 x 4
     Id Neighborhood PoolArea PoolQC
  <dbl> <chr>           <dbl> <chr> 
1  2421 NAmes             368 <NA>  
2  2504 SawyerW           444 <NA>  
3  2600 Mitchel           561 <NA>

A new feature is generated where: HasPool = PoolArea > 0
Most properties do not have a pool.

Porch Features

New features are generated for:
- Total porch area: PorchSF = OpenPorchSF + EnclosedPorch + 3SsnPorch + ScreenPorch
- Whether property has a porch: HasPorch = PorchSF > 0

Built Area Features

A new feature is added to determine the total square footage of built area: TotalSF = GrLivArea + TotalBsmtSF

Construction Year Features

New features are generated for:
- Vintage of year built: 1945 or earlier, 1946-1999, 2000 or later
- Age of property from when it was built to the time it was sold: PropertyAge = YrSold - YearRemodAdd
- Indicate if the property is new or newly renovated: IsNew = YearRemodAdd == YrSold
- Indicate if the property has been remodelled: IsRemodAdd = YearRemodAdd > YearBuilt

Neighborhood Features

Type of Neighborhood: There are 25 neighborhoods in the dataset. As it is said, real estate is all about location, location, location. Clearly some neighborhoods command higher prices than others.

Neighborhoods could be grouped together in fewer categories depending upon how they are ranked by their median SalePrice:

Type1: StoneBr, NridgHt, NoRidge
Type2: Veenker, Timber, Somerst
Type3: Crawfor, CollgCr, ClearCr, Blmngtn, Gilbert, NWAmes, SawyerW
Type4: Mitchel, NPkVill, NAmes, SWISU, Sawyer, Blueste, BrkSide, Edwards, OldTown
Type5: IDOTRR, BrDale, MeadowV

Other Missing Features

In MiscFeature, Alley and Fence NA values are recoded as None.
In Utilities, Functional, SaleType NA values are imputed as the most common value of each feature.
In LotFrontage NA values are imputed as the median values in the Neighborhood.
In MSZoning, KitchenQual, Exterior1st, Exterior2nd, Electrical NA values are imputed as the most common value in the Neighborhood.

Label Encoding

A quick look at the data description shows many features have categories that follow a specific order. These features are:

LotShape: Reg, IR1, IR2, IR3
LandSlope: Gtl, Mod, Sev
ExterQual: Ex, Gd, TA, Fa, Po
ExterCond: Ex, Gd, TA, Fa, Po
BsmtQual: Ex, Gd, TA, Fa, Po, None
BsmtCond: Ex, Gd, TA, Fa, Po, None
BsmtExposure: Gd, Av, Mn, No, None
BsmtFinType1: GLQ, ALQ, BLQ, Rec, LwQ, Unf, None
BsmtFinType2: GLQ, ALQ, BLQ, Rec, LwQ, Unf, None
HeatingQC: Ex, Gd, TA, Fa, Po
CentralAir: Y, N
KitchenQual: Ex, Gd, TA, Fa, Po
Functional: Typ, Min1, Min2, Mod, Maj1, Maj2, Sev, Sal
FireplaceQu: Ex, Gd, TA, Fa, Po, None
GarageFinish: Fin, RFn, Unf, None
GarageQual: Ex, Gd, TA, Fa, Po, None
GarageCond: Ex, Gd, TA, Fa, Po, None
Street: Grvl, Pave
PavedDrive: Y, P, N

Most of these features have a common order Ex, Gd, TA, Fa, Po, except some are missing None as a category. These features could be ordered with a common set of categories from Ex, Gd, TA, Fa, Po, None.

Some categorical features are already ordered by an integer number. These features are:

OverallQual: 10 to 1
OverallCond: 10 to 1

MoSold is cyclical and should be recoded as a factor.

YrSold has only 5 values from 2006-2010 and should also be recoded as a factor.

Categorical features where several categories have less than 10 observations are lumped into a single category named Other.

Features to Drop

Highly Correlated Features

Some features could be dropped from further analysis because either they are too correlated or replaced by a similar feature.

 [1] "BsmtFullBath"  "GarageCars"    "GarageYrBlt"   "GrLivArea"     "PoolArea"      "YearBuilt"     "YearRemodAdd"  "Neighborhood"  "OpenPorchSF"  
[10] "EnclosedPorch" "3SsnPorch"     "ScreenPorch"

Skewed Categorical Features

Any feature where more than 95% of the records have the same category probably doesn’t have any predictive value. An extreme case is Utilities which has only 2 categories - AllPub and NoSeWa in the dataset. Only 1 record has NoSeWa and the rest of the records have AllPub. Therefore, features like these do not have any predictive value.

Finalized Data

[1] "Dimensions of the finalized dataset"

[1] 2919   73

Excluding Id, there are 72 features in the finalized dataset.

There are 26 numerical, 26 ordinal and 20 nominal features.

Univariate Analysis

Let us look at each feature in the dataset in detail.

Numerical Features

First let’s plot all the features that are measured as area in square feet:

What’s notable?

All area features have outliers.
Many features are heavily skewed so they need to be normalized before fitting models.

Now let’s see other numerical features:

What’s notable?

Most of the properties have been built less than 20 years prior to their sale.

Let’s plot the distribution of SalePrice in log scale:

What’s notable?

We see long tailed distribution on both sides.
There are 11 properties below USD 50,000 and 17 above USD 500,000.
Linear models are very sensitive to the presence of outliers.

Categorical Features

Ordinal Features

What’s notable?

Categorical imbalances exist in many features where 1 or 2 categories are dominant. This poses a big challenge for using these features as predictors, as categories with fewer counts tend to be underrepresented in the data.

Nominal Features.

What’s notable?

Categorical imbalances exist in many features where 1 or 2 categories are dominant.
Most of the properties are sold during the summer months, and the least during the winter months.
The effect of housing market crisis are visible in the data, as the fewest properties were sold in 2010.

Bivariate Analysis

Numerical-Numerical

Let’s examine the relationship of SalePrice with other numerical features:

What’s notable?

From the scatterplot of TotalSF v/s SalePrice, it is very clear there are high leverage points where the target SalePrice is unusually low relative to the area in sq. ft. These points have an outsized impact on the slope of the regression line, which otherwise would be higher.
The same set of points impact TotalBsmtSF.
The Ids of these records are 524,1299,2550. Out of these 524 and 1299 are in the training set.

Correlations with SalePrice

We isolate the features that have an absolute correlation of 0.1 or more with SalePrice.

What’s notable?

The top 5 features are TotalSF, GarageArea, TotalBath, TotalBsmtSF, 1stFlrSF. Quite reasonably, a buyer would look at these features to evaluate a property and its SalePrice.
It is somewhat counterintuitive that PropertyAge shows a strong negative correlation with SalePrice. It means properties that were more recently built, sell for higher prices than older properties.

Numerical-Categorical (Ordinal)

What’s notable?

We can spot clear trends in SalePrice v/s the order of the categories in almost all of these features.
Overall quality and external quality show some of the strongest trends.

Numerical-Categorical (Nominal)

Let’s examine SalePrice with respect to the nominal features in the dataset. None of these features have a natural order, but we can identify trends within categories by sorting with the median SalePrice. The SalePrice axis is truncated to exclude outliers.

What’s notable?

GarageType: Builtin and attached garages are more preferred than detached or other types of garages.
From MSSubClass categories, it is evident that 1946 or newer houses are higher priced than older houses.

Multivariate Analysis

We will check variation of some related features with SalePrice.

Numerical-Numerical-Categorical

We have determined TotalSF and GarageArea have among the strongest correlations with SalePrice. Let’s see how they vary by NeighborhoodType and GarageType respectively:

For the same total area, there are neighborhoods where SalePrice is higher than others.
Properties with no garage are distinctly separated.
Properties with built-in or attached garages tend to have higher SalePrice for the same GarageArea.
Therefore, NeighborhoodType and GarageType explain some variance in SalePrice.

Categorical-Categorical-Numerical

We want to see if there is any interaction of SalePrice with a combination of categorical features, that could provide any additional explanatory power for prediction:

It is evident that some neighborhoods have higher OverallQual and therefore command higher price. However in Type4 neighborhoods, we can see a clear variation in SalePrice by quality of property.
It is less clear if GarageType has a major impact by itself. Even though built-in and attached garages seem to be preferred, most of the variation can be explained by NeighborhoodType itself.
Low density and floating village residential properties tend to be higher priced in both single and multi-storied properties built after 1946.

Diamonds - Part 3 - A polished gem - Building Non-linear Models

Thu, 22 Dec 2016 00:00:00 +0000

Training Non-linear Models

We’ll follow some of the same steps as we did for linear models, while transforming some predictors:

Partition the dataset into training and testing sets in the proportion 75% and 25% respectively.
Stratify the partitioning by clarity, so both training and testing sets have the same distributions of this feature.
clarity, color and cut have ordered categories from lowest to highest grades. The randomForest method requires no change in representing this data before training the models, however xgboost and keras methods require all the predictors to be in numerical form. Two methods could be used for transforming the categorical data:
1. Use one-hot encoding to convert categorical data to sparse data with 0s and 1s. This way, each category in clarity, color and cut is converted to a new predictor in binary form. A disadvantage of this method is that it treates ordered categorical data the same as unordered categorical data, so the ordinality is lost in transformation. However, non-linear models should be able to infer the ordinality as our training sample is sufficiently large.
2. Represent the ordinal categories from lowest to highest grades in integer form. However, this creates a linear gradation from one category to another, which may not be a suitable choice here.
Center and scale all values in the training set and build a matrix of predictors.
Fit a non-linear model with the training set.
Make predictions on the testing set and determine model metrics.
Wrap all the steps above inside a function in which the model formula, and a seed could be passed that randomizes the partition of training and testing sets.
Run multiple iterations of models with different seeds, and compute their average metrics, that would reflect results on unseen data.

Here are the average metrics for all the models trained with keras, randomForest and xgboost regression methods:

	mae		rmse	rsq
	keras	randomForest	xgboost	keras	randomForest	xgboost	keras	randomForest	xgboost
price ~ .	360.55	262.35	280.49	989.71	529.28	540.76	0.93	0.98	0.98
price ~ carat	860.29	816.1	815.76	1499.2	1427.25	1427.35	0.86	0.87	0.87
price ~ carat + clarity	590.32	548.67	544.48	1040.69	1006.61	992.46	0.93	0.94	0.94
price ~ carat + clarity + color	358.85	305.17	306.86	645.4	571.73	575.3	0.97	0.98	0.98
price ~ carat + clarity + color + cut	347.99	285.96	282.38	626.78	545.02	541.63	0.98	0.98	0.98

Looking at the r-squared terms, it is remarkable how well all the models have been able to infer the complex relationship between price and carat. To fit linear models, we needed to transform price to logarithmic terms and take the cube root of carat. The neural network as well as the decision tree based models do this all on their own. The root mean squared error is in $ terms so it is easier to interpret. Considering the mean and standard deviation of price in the dataset is about $4000, the root mean squared errors of the models are very low.

Exploratory data analysis adds value here, as the models with carat, clarity and color give excellent results. Including cut in the models does not provide any significant benefits and results in overfitted models.

Even the base models with all predictors: price ~ . (where some of them are confounders), do a very good job of explaning the variance. Decision tree and neural network models are unaffected by multi-collinearity. We can use local model interpretations to determine the most important predictors from these models.

Local Interpretable Model-agnostic Explanations

LIME is a method for explaining black-box machine learning models. It can help visualize and explain individual predictions. It makes the assumption that every complex model is linear on a local scale. So it is possible to fit a simple model around a single observation that will behave how the global model behaves at that locality. The simple model can be used to explain the predictions of the more complex model locally.

The generalized algorithm LIME applies is:

Given an observation, permute it to create replicated feature data with slight value modifications.
Compute similarity distance measure between original observation and permuted observations.
Apply selected machine learning model to predict outcomes of permuted data.
Select m number of features to best describe predicted outcomes.
Fit a simple model to the permuted data, explaining the complex model outcome with m features from the permuted data weighted by its similarity to the original observation .
Use the resulting feature weights to explain local behavior.

Here we will select 5 features that best describe the predicted outcomes for 6 random observations from the testing set.

The features by importance that best explain the predictions in these 6 random samples are carat, clarity, color, x and y.

We know that x and y are co-linear with carat, which is why it is good practice to remove any redundant features from the training data before applying any machine learning algorithm. We find the model with the best metrics turns out to be the one using carat, clarity and color.

Actual v/s Predicted

Finally, here are the scatterplots of actual v/s predicted price from the best model on the testing set, using the 3 regression methods:

The scatterplots are shown with both linear and logarithmic axes. Even though the results from all the 3 methods have roughly similar r-squared and rmse values, we can see predicted prices from keras have more dispersion than the two decision-tree methods at the higher end. The decision-tree based methods appear do a better job of predicting prices at the lower end with lesser dispersion.

As in the case with linear models, the variance in predicted diamond prices increases with price. But unlike linear models, the non-linear models do not produce extreme outliers in predicted prices. So, not only do non-linear methods do a fantastic job in inferring the relationships between price and its predictors, they also predict prices within a reasonable range.

Summary

All the 3 non-linear regression methods can infer the complex relationship between price, carat and other predictors, without the need for feature engineering.
Exploratory Data Analysis is useful in removing the redundant features from the training dataset, resulting in both faster execution, as well as much better metrics.
In terms of time taken to train the models, keras neural network models execute the fastest by virtue of being able to use GPUs.
Among the decision-tree based methods, xgboost models train much faster than randomForest models.
Multiple CPUs can be used to run randomForest and xgboost methods. RAM is the only limiting constraint, when trained on a local machine.

Diamonds - Part 2 - A cut above - Building Linear Models

Wed, 21 Dec 2016 00:00:00 +0000

In a previous post in this series, we did an exploratory data analysis of the diamonds dataset and found that carat, x, y, z were strongly correlated with price. To some extent, clarity also appeared to provide some predictive ability.

In this post, we will build linear models and see how well they predict the price of diamonds.

Before we do any transformations, feature engineering or feature selections for our model, let’s see what kind of results we get from a base linear model, that uses all the features to predict price:


Call:
lm(formula = price ~ ., data = diamonds)

Residuals:
   Min     1Q Median     3Q    Max 
-21376   -592   -183    376  10694 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept)  5753.76     396.63   14.51 < 0.0000000000000002 ***
carat       11256.98      48.63  231.49 < 0.0000000000000002 ***
cut.L         584.46      22.48   26.00 < 0.0000000000000002 ***
cut.Q        -301.91      17.99  -16.78 < 0.0000000000000002 ***
cut.C         148.03      15.48    9.56 < 0.0000000000000002 ***
cut^4         -20.79      12.38   -1.68               0.0929 .  
color.L     -1952.16      17.34 -112.57 < 0.0000000000000002 ***
color.Q      -672.05      15.78  -42.60 < 0.0000000000000002 ***
color.C      -165.28      14.72  -11.22 < 0.0000000000000002 ***
color^4        38.20      13.53    2.82               0.0047 ** 
color^5       -95.79      12.78   -7.50    0.000000000000066 ***
color^6       -48.47      11.61   -4.17    0.000030090737193 ***
clarity.L    4097.43      30.26  135.41 < 0.0000000000000002 ***
clarity.Q   -1925.00      28.23  -68.20 < 0.0000000000000002 ***
clarity.C     982.20      24.15   40.67 < 0.0000000000000002 ***
clarity^4    -364.92      19.29  -18.92 < 0.0000000000000002 ***
clarity^5     233.56      15.75   14.83 < 0.0000000000000002 ***
clarity^6       6.88      13.72    0.50               0.6157    
clarity^7      90.64      12.10    7.49    0.000000000000071 ***
depth         -63.81       4.53  -14.07 < 0.0000000000000002 ***
table         -26.47       2.91   -9.09 < 0.0000000000000002 ***
x           -1008.26      32.90  -30.65 < 0.0000000000000002 ***
y               9.61      19.33    0.50               0.6192    
z             -50.12      33.49   -1.50               0.1345    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1130 on 53916 degrees of freedom
Multiple R-squared:  0.92,  Adjusted R-squared:  0.92 
F-statistic: 2.69e+04 on 23 and 53916 DF,  p-value: <0.0000000000000002

# A tibble: 3 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard    1130.   
2 rsq     standard       0.920
3 mae     standard     740.

The model summary shows it is an overfitted model. Among other things, we know that depth and table have no impact on price, yet these are shown to be highly significant. Root Mean Squared Error (rmse) and other metrics are also shown above.

Let’s make a plot of actual v/s predicted prices to visualize how well this base model performs.

If the predictions are good, the points should lie close to a straight line drawn at 45 degrees. We can see this base model does a poor job of predicting prices. Worst of all, the model predicts negative prices on the lower end. It shows that price has to be log transformed to avoid these absurdities.

Feature Engineering

We know the price of a diamond is strongly correlated with its size. All things equal, the larger the diamond, the greater its price.

As a first approximation, we can assume a diamond is a cuboid with dimensions x, y and z. Then, we can compute its volume as x * y * z. As these 3 dimensions are highly correlated, we can compute a geometrical average dimension by taking the cube root of volume, and retain a linear relationship with log(price).

Another way to calculate an average dimension is by using high school chemistry. Mass, volume and density are related to each other by the equation:

$ density = mass/volume $

We can find out that 1 carat = 0.2 gms. Dividing by the density of diamond (3.51 gms/cc) would give us its volume in cc, which could be converted to a geometrical average dimension by taking the cube root.

Even though both methods yield similar results, we could see that the density method results in a narrower range. But which method would be more robust? Keep in mind there are 20 z values that are 0. In 7 of these records both x and y are 0 too, which means these values were not recorded reliably.

# A tibble: 20 x 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  1    Premium   G     SI2      59.1    59  3142  6.55  6.48     0
 2  1.01 Premium   H     I1       58.1    59  3167  6.66  6.6      0
 3  1.1  Premium   G     SI2      63      59  3696  6.5   6.47     0
 4  1.01 Premium   F     SI2      59.2    58  3837  6.5   6.47     0
 5  1.5  Good      G     I1       64      61  4731  7.15  7.04     0
 6  1.07 Ideal     F     SI2      61.6    56  4954  0     6.62     0
 7  1    Very Good H     VS2      63.3    53  5139  0     0        0
 8  1.15 Ideal     G     VS2      59.2    56  5564  6.88  6.83     0
 9  1.14 Fair      G     VS1      57.5    67  6381  0     0        0
10  2.18 Premium   H     SI2      59.4    61 12631  8.49  8.45     0
11  1.56 Ideal     G     VS2      62.2    54 12800  0     0        0
12  2.25 Premium   I     SI1      61.3    58 15397  8.52  8.42     0
13  1.2  Premium   D     VVS1     62.1    59 15686  0     0        0
14  2.2  Premium   H     SI1      61.2    59 17265  8.42  8.37     0
15  2.25 Premium   H     SI2      62.8    59 18034  0     0        0
16  2.02 Premium   H     VS2      62.7    53 18207  8.02  7.95     0
17  2.8  Good      G     SI2      63.8    58 18788  8.9   8.85     0
18  0.71 Good      F     SI2      64.1    60  2130  0     0        0
19  0.71 Good      F     SI2      64.1    60  2130  0     0        0
20  1.12 Premium   G     I1       60.4    59  2383  6.71  6.67     0

In all of these records, the carat values were recorded reliably and are probably more accurate than the dimensions. Hence, we might prefer the density method of generating this feature.

Furthermore, since density is a constant, dividing by a constant to calculate volume isn’t really necessary. Instead, a cube root transformation could be applied to carat itself for the purposes of predictive modelling that would result in a linear relationship between $log(price)$ and $carat^{1/3}$. It is the reason why we’re fitting a linear model because the model is linear in its parameters.

Training Linear Models

Here are the steps for building linear models and computing metrics:

Partition the dataset into training and testing sets in the proportion 75% and 25% respectively.
Since clarity is one of the main predictors, stratify the partitioning by clarity, so both training and testing sets have the same distributions of this feature.
Fit a linear model with the training set.
Make predictions on the testing set and determine model metrics.
Wrap all the steps above inside a function in which the model formula and a seed could be passed. Since the seed determines the random partitioning, it helps to minimize vagaries in partitioning the training and testing sets before fitting models.
Run multiple iterations of a model with different seeds, and compute its average metrics, that would reflect the results on unseen data.

Here’s a sample split of training and testing set, stratified by clarity. As we can see, the training and testing sets have similar distributions.

dfTrain$clarity 
       n  missing distinct 
   40457        0        8 

lowest : I1   SI2  SI1  VS2  VS1 , highest: VS2  VS1  VVS2 VVS1 IF  
                                                          
Value         I1   SI2   SI1   VS2   VS1  VVS2  VVS1    IF
Frequency    552  6895  9826  9222  6125  3780  2722  1335
Proportion 0.014 0.170 0.243 0.228 0.151 0.093 0.067 0.033

dfTest$clarity 
       n  missing distinct 
   13483        0        8 

lowest : I1   SI2  SI1  VS2  VS1 , highest: VS2  VS1  VVS2 VVS1 IF  
                                                          
Value         I1   SI2   SI1   VS2   VS1  VVS2  VVS1    IF
Frequency    189  2299  3239  3036  2046  1286   933   455
Proportion 0.014 0.171 0.240 0.225 0.152 0.095 0.069 0.034

After running 5 iterations of each model with a different seed, here are the average metrics:

# A tibble: 5 x 4
  model                                                 rmse   rsq   mae
  <chr>                                                <dbl> <dbl> <dbl>
1 log(price) ~ .                                      11055. 0.670  570.
2 log(price) ~ I(carat^(1/3))                          2893. 0.687 1039.
3 log(price) ~ I(carat^(1/3)) + clarity                2312. 0.807  881.
4 log(price) ~ I(carat^(1/3)) + clarity + color        1870. 0.870  631.
5 log(price) ~ I(carat^(1/3)) + clarity + color + cut  1848. 0.875  625.

The first model with all predictors is an overfitted one.

The model with carat, clarity and color provides the best combination of root mean squared error and r-squared, that explains the most variance. This is our final model. Including cut in the model has diminishing benefits, and tends to overfit the data.

Here’s the summary of our final model:


Call:
lm(formula = model_formula, data = dfTrain)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.6022 -0.1034  0.0145  0.1066  1.7941 

Coefficients:
                Estimate Std. Error t value             Pr(>|t|)    
(Intercept)     2.147009   0.004993  429.99 < 0.0000000000000002 ***
I(carat^(1/3))  6.246412   0.005365 1164.27 < 0.0000000000000002 ***
clarity.L       0.922295   0.005036  183.15 < 0.0000000000000002 ***
clarity.Q      -0.295539   0.004734  -62.43 < 0.0000000000000002 ***
clarity.C       0.166979   0.004068   41.05 < 0.0000000000000002 ***
clarity^4      -0.068591   0.003260  -21.04 < 0.0000000000000002 ***
clarity^5       0.032833   0.002669   12.30 < 0.0000000000000002 ***
clarity^6      -0.001904   0.002325   -0.82              0.41288    
clarity^7       0.025508   0.002049   12.45 < 0.0000000000000002 ***
color.L        -0.488882   0.002927 -167.05 < 0.0000000000000002 ***
color.Q        -0.117319   0.002680  -43.78 < 0.0000000000000002 ***
color.C        -0.012230   0.002497   -4.90           0.00000098 ***
color^4         0.019007   0.002288    8.31 < 0.0000000000000002 ***
color^5        -0.008110   0.002159   -3.76              0.00017 ***
color^6        -0.000396   0.001967   -0.20              0.84055    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.166 on 40442 degrees of freedom
Multiple R-squared:  0.973, Adjusted R-squared:  0.973 
F-statistic: 1.05e+05 on 14 and 40442 DF,  p-value: <0.0000000000000002

Here’s a scatterplot of actual v/s predicted log(price) from our final model on the testing set:

The points lie close to the 45 degress line. However, on the high end, there are many outliers where actual and predicted values have very high variance. Nevertheless, this is as good as it gets.

Diamonds - Part 1 - In the rough - An Exploratory Data Analysis

Tue, 20 Dec 2016 00:00:00 +0000

In this case study, we will explore the diamonds dataset, then build linear and non-linear regression models to predict the price of diamonds.

Data Description

The diamonds dataset contains the prices in 2008 USD terms, and other attributes of almost 54,000 diamonds.

Attribute	Description
price	price in 2008 USD
carat	weight of a diamond (1 carat = 0.2 gms)
cut	quality of the cut (Fair, Good, Very Good, Premium, Ideal)
color	diamond color from D (best) to J (worst)
clarity	a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
x	length in mm
y	width in mm
z	depth in mm
depth	total depth percentage = z/mean(x, y)
table	width of the top of diamond relative to widest point

Data Summaries

A preliminary visual summary of the whole dataset shows all the features and their types. There are no missing values (NAs) in this dataset.

Let’s examine each feature numerically:

dfInput 

 10  Variables      53940  Observations
----------------------------------------------------------------------------------------------------------------------------------------------------------------
price 
       n  missing distinct     Info     Mean      Gmd      .05      .10      .25      .50      .75      .90      .95 
   53940        0    11602        1     3933     4012      544      646      950     2401     5324     9821    13107 

lowest :   326   327   334   335   336, highest: 18803 18804 18806 18818 18823
----------------------------------------------------------------------------------------------------------------------------------------------------------------
carat 
       n  missing distinct     Info     Mean      Gmd      .05      .10      .25      .50      .75      .90      .95 
   53940        0      273    0.999   0.7979   0.5122     0.30     0.31     0.40     0.70     1.04     1.51     1.70 

lowest : 0.20 0.21 0.22 0.23 0.24, highest: 4.00 4.01 4.13 4.50 5.01
----------------------------------------------------------------------------------------------------------------------------------------------------------------
cut 
       n  missing distinct 
   53940        0        5 

lowest : Fair      Good      Very Good Premium   Ideal    , highest: Fair      Good      Very Good Premium   Ideal    
                                                            
Value           Fair      Good Very Good   Premium     Ideal
Frequency       1610      4906     12082     13791     21551
Proportion     0.030     0.091     0.224     0.256     0.400
----------------------------------------------------------------------------------------------------------------------------------------------------------------
color 
       n  missing distinct 
   53940        0        7 

lowest : J I H G F, highest: H G F E D
                                                    
Value          J     I     H     G     F     E     D
Frequency   2808  5422  8304 11292  9542  9797  6775
Proportion 0.052 0.101 0.154 0.209 0.177 0.182 0.126
----------------------------------------------------------------------------------------------------------------------------------------------------------------
clarity 
       n  missing distinct 
   53940        0        8 

lowest : I1   SI2  SI1  VS2  VS1 , highest: VS2  VS1  VVS2 VVS1 IF  
                                                          
Value         I1   SI2   SI1   VS2   VS1  VVS2  VVS1    IF
Frequency    741  9194 13065 12258  8171  5066  3655  1790
Proportion 0.014 0.170 0.242 0.227 0.151 0.094 0.068 0.033
----------------------------------------------------------------------------------------------------------------------------------------------------------------
depth 
       n  missing distinct     Info     Mean      Gmd      .05      .10      .25      .50      .75      .90      .95 
   53940        0      184    0.999    61.75    1.515     59.3     60.0     61.0     61.8     62.5     63.3     63.8 

lowest : 43.0 44.0 50.8 51.0 52.2, highest: 72.2 72.9 73.6 78.2 79.0
----------------------------------------------------------------------------------------------------------------------------------------------------------------
table 
       n  missing distinct     Info     Mean      Gmd      .05      .10      .25      .50      .75      .90      .95 
   53940        0      127     0.98    57.46    2.448       54       55       56       57       59       60       61 

lowest : 43.0 44.0 49.0 50.0 50.1, highest: 71.0 73.0 76.0 79.0 95.0
----------------------------------------------------------------------------------------------------------------------------------------------------------------
x 
       n  missing distinct     Info     Mean      Gmd      .05      .10      .25      .50      .75      .90      .95 
   53940        0      554        1    5.731    1.276     4.29     4.36     4.71     5.70     6.54     7.31     7.66 

lowest :  0.00  3.73  3.74  3.76  3.77, highest: 10.01 10.02 10.14 10.23 10.74
----------------------------------------------------------------------------------------------------------------------------------------------------------------
y 
       n  missing distinct     Info     Mean      Gmd      .05      .10      .25      .50      .75      .90      .95 
   53940        0      552        1    5.735    1.269     4.30     4.36     4.72     5.71     6.54     7.30     7.65 

lowest :  0.00  3.68  3.71  3.72  3.73, highest: 10.10 10.16 10.54 31.80 58.90
                                                                                                                      
Value        0.0   3.5   4.0   4.5   5.0   5.5   6.0   6.5   7.0   7.5   8.0   8.5   9.0   9.5  10.0  10.5  32.0  59.0
Frequency      7     5  1731 12305  7817  5994  6742  9260  4298  3402  1635   652    69    14     6     1     1     1
Proportion 0.000 0.000 0.032 0.228 0.145 0.111 0.125 0.172 0.080 0.063 0.030 0.012 0.001 0.000 0.000 0.000 0.000 0.000

For the frequency table, variable is rounded to the nearest 0.5
----------------------------------------------------------------------------------------------------------------------------------------------------------------
z 
       n  missing distinct     Info     Mean      Gmd      .05      .10      .25      .50      .75      .90      .95 
   53940        0      375        1    3.539   0.7901     2.65     2.69     2.91     3.53     4.04     4.52     4.73 

lowest :  0.00  1.07  1.41  1.53  2.06, highest:  6.43  6.72  6.98  8.06 31.80
                                                                                                          
Value        0.0   1.0   1.5   2.0   2.5   3.0   3.5   4.0   4.5   5.0   5.5   6.0   6.5   7.0   8.0  32.0
Frequency     20     1     2     3  8807 13809  9474 13682  5525  2352   237    20     5     1     1     1
Proportion 0.000 0.000 0.000 0.000 0.163 0.256 0.176 0.254 0.102 0.044 0.004 0.000 0.000 0.000 0.000 0.000

For the frequency table, variable is rounded to the nearest 0.5
----------------------------------------------------------------------------------------------------------------------------------------------------------------

price: The average price of a diamond in this dataset is ~ USD 4000. There are many outliers on the high end.
carat: The average carat weight is ~ 0.8. About 75% of the diamonds are under 1 carat. The top 5 values show presence of many outliers on the high end.
cut: About 40% of the diamonds are of Ideal cut. Only 3% are Fair cut. So there is a lot of imbalance in the categories.
color: Most of the diamonds are rated E to H color. Relatively fewer are rated J color.
clarity: Most of the diamonds are rated SI2 to VS1 clarity. About 1% are rated the worst I1 clarity, where as only ~ 3% are rated IF.
depth: Most of the depth values are between 60 and 64. There are outliers on both low end and high end.
table: Most of the table values are between 54 and 65. There are outliers on both ends.
x: Denotes the dimension along the x-axis. Most values are between 4 and 8. There are some 0 values too which means they were not recorded.
y: Denotes the dimension along the y-axis. Most values are between 3.5 and 8. There are 7 records where the values are 0.
z: Denotes the dimension along the z-axis. Most values are between 2.5 and 8.5. There are 20 records where the values are 0.

Univariate Analysis

Let us look at each feature in the dataset in detail.

Numerical Features

The plots show presence of outliers within each feature. Let’s exclude the outliers and plot them again.

Excluding outliers, the range of values are more reasonable. We can see that carat and price are heavily right skewed.

Let’s plot the distribution of price in log scale:

Two peaks in the log transformed plot show a bimodal distribution of prices. This implies two price points of diamonds are most popular among customers - one at just below USD 1000 and the other around USD 5000. Intriguingly, there are no diamonds in the dataset that are around USD 1500. Hence, a big gap is visible around that price.

Categorical Features

The categorical imbalance in cut and clarity can be clearly noticed.

Bivariate Analysis

Let’s examine the relationship of price with other features.

Numerical-numerical

First and foremost, let’s do a correlation analysis to see how price is correlated with other numerical features:

We can see that price is very strongly correlated with carat, x, y, and z dimensions. If a predictive linear regression model is built, some of these features would act as confounders. table and depth have almost no correlation with price so they are not so interesting for predictive modelling.

Now let’s see the scatter plots:

After removing outliers, it could be noted that price increases exponentially with carat, as well as x, y and z dimensions. So price should be plotted with a log tranformation. Let’s do that:

Now, the relationship between log(price) appears to be linear with x, y and z. But, not so much with carat. Variance in price tends to increase both by carat and its dimensions. Log transforming carat wouldn’t help because carat does not have a wide range. We will find ways to deal with this when we do Feature Engineering.

Numerical-Categorical

Let’s examine price with respect to the categorical features in the dataset:

The boxplots above are plotted with truncated price axis for better visualization of trends. All the boxplots are counter-intuitive - median prices tend to decline as we move from lowest grade to highest grade in terms of cut, color and clarity. This is very odd.

The median price declines monotonically from Fair cut to Ideal cut.
In terms of color, the median price decreases from J (worst) to G (mid-grade), then increases and finally decreases for D (best).
The median price increases when clarity improves from I1 to SI2, and then decreases monotonically to IF grade.

Multivariate Analysis

So far, we have determined carat, x, y, and z have the strongest relationship with price. Different grades of cut, color and clarity also seem to have some impact on median price. So let’s make some scatter plots to see these relationships:

Numerical-Numerical-Categorical

Although there is a lot of overlap, but there is a clear trend of price increasing with clarity, at a given carat weight. The same pattern could also be observed in the plot with increasing grades of color, though not to the same extent. There is no evidence of any relationship between price and carat with cut.

We can conclude both color and clarity explain some variance in price at a given carat weight.

To be sure of any interaction between table and depth, with color and clarity, let’s plot these:

There is no pattern in the interaction of price v/s depth and table values when plotted by color and clarity. So, these features do not have any predictive ability to determine price.

Categorical-Categorical-Numerical

We want to see if there is any interaction of clarity with cut and color, that could provide any additional explanatory power to predict price:

The second heatmap appears to be more interesting. From bottom left to top right, with increasing grades of color and clarity, price tends to decrease on average. Once again, this runs counter to our intuition; after all prices of diamonds with the best color and clarity should be the highest. Nevertheless this counter-trend persists in the dataset.

With respect to cut and clarity, the mean prices do not show any discernable pattern.

Summary

To summarize, here’s what we found interesting in this dataset, after doing an exploratory data analysis:

price is heavily right-skewed, and when log tranformed, has a bimodal distribution which implies there is demand in 2 different price ranges.
carat about 75% of the diamonds are below 1 carat. The variance in price increases with carat weight.
cut is imbalanced with about 40% of the diamonds rated Ideal.
color is imbalanced with about 5% of the diamonds rated J.
clarity is imbalanced at the extremes, with only 1.5% of the diamonds rated I1 and 3.3% of the diamonds rated IF.
price is strongly correlated with carat and x, y, z dimensions of the diamonds. table and depth have almost no correlation with price.
Both clarity and color appear to explain some variance in price for a given carat weight.

Case Studies | Nitin Gupta

Ames Housing - Part 2 - Building Models

Evaluation Criteria

Steps for Building Models

Partitioning Training Data

Linear Models

Ordinary Least Squares Regression

What’s notable?

Regularized Linear Model

What’s notable?

Non-linear Models

randomForest

What’s notable?

xgboost

What’s notable?

Finalizing Models

What’s notable?

Performance on Test Data

What’s notable?

Feature Importance

New Property Premium

Ames Housing - Part 1 - Exploratory Data Analysis

Data Summaries

What’s notable?

Data Cleaning & Transformation

Fixing Data Errors

Imputing Missing Values & New Features

Basement Features

Bathroom Features

Fireplace Features

Garage Features

Masonry Features

Pool Features

Porch Features

Built Area Features

Construction Year Features

Neighborhood Features

Other Missing Features

Label Encoding

Features to Drop

Highly Correlated Features

Skewed Categorical Features

Finalized Data

Univariate Analysis

Numerical Features

What’s notable?

What’s notable?

What’s notable?

Categorical Features

Ordinal Features

What’s notable?

Nominal Features.

What’s notable?

Bivariate Analysis

Numerical-Numerical

What’s notable?

Correlations with SalePrice

What’s notable?

Numerical-Categorical (Ordinal)

What’s notable?

Numerical-Categorical (Nominal)

What’s notable?

Multivariate Analysis

Numerical-Numerical-Categorical

Categorical-Categorical-Numerical

Diamonds - Part 3 - A polished gem - Building Non-linear Models

Other posts in this series:

Training Non-linear Models

Local Interpretable Model-agnostic Explanations

Actual v/s Predicted

Summary

Diamonds - Part 2 - A cut above - Building Linear Models

Feature Engineering

Training Linear Models

Diamonds - Part 1 - In the rough - An Exploratory Data Analysis

Data Description

Data Summaries

Univariate Analysis

Numerical Features

Categorical Features