Diamonds - Part 2 - A cut above - Building Linear Models

In a previous post in this series, we did an exploratory data analysis of the diamonds dataset and found that carat, x, y, z were strongly correlated with price. To some extent, clarity also appeared to provide some predictive ability.

In this post, we will build linear models and see how well they predict the price of diamonds.

Before we do any transformations, feature engineering or feature selections for our model, let’s see what kind of results we get from a base linear model, that uses all the features to predict price:


Call:
lm(formula = price ~ ., data = diamonds)

Residuals:
   Min     1Q Median     3Q    Max 
-21376   -592   -183    376  10694 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept)  5753.76     396.63   14.51 < 0.0000000000000002 ***
carat       11256.98      48.63  231.49 < 0.0000000000000002 ***
cut.L         584.46      22.48   26.00 < 0.0000000000000002 ***
cut.Q        -301.91      17.99  -16.78 < 0.0000000000000002 ***
cut.C         148.03      15.48    9.56 < 0.0000000000000002 ***
cut^4         -20.79      12.38   -1.68               0.0929 .  
color.L     -1952.16      17.34 -112.57 < 0.0000000000000002 ***
color.Q      -672.05      15.78  -42.60 < 0.0000000000000002 ***
color.C      -165.28      14.72  -11.22 < 0.0000000000000002 ***
color^4        38.20      13.53    2.82               0.0047 ** 
color^5       -95.79      12.78   -7.50    0.000000000000066 ***
color^6       -48.47      11.61   -4.17    0.000030090737193 ***
clarity.L    4097.43      30.26  135.41 < 0.0000000000000002 ***
clarity.Q   -1925.00      28.23  -68.20 < 0.0000000000000002 ***
clarity.C     982.20      24.15   40.67 < 0.0000000000000002 ***
clarity^4    -364.92      19.29  -18.92 < 0.0000000000000002 ***
clarity^5     233.56      15.75   14.83 < 0.0000000000000002 ***
clarity^6       6.88      13.72    0.50               0.6157    
clarity^7      90.64      12.10    7.49    0.000000000000071 ***
depth         -63.81       4.53  -14.07 < 0.0000000000000002 ***
table         -26.47       2.91   -9.09 < 0.0000000000000002 ***
x           -1008.26      32.90  -30.65 < 0.0000000000000002 ***
y               9.61      19.33    0.50               0.6192    
z             -50.12      33.49   -1.50               0.1345    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1130 on 53916 degrees of freedom
Multiple R-squared:  0.92,  Adjusted R-squared:  0.92 
F-statistic: 2.69e+04 on 23 and 53916 DF,  p-value: <0.0000000000000002
# A tibble: 3 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard    1130.   
2 rsq     standard       0.920
3 mae     standard     740.   

The model summary shows it is an overfitted model. Among other things, we know that depth and table have no impact on price, yet these are shown to be highly significant. Root Mean Squared Error (rmse) and other metrics are also shown above.

Let’s make a plot of actual v/s predicted prices to visualize how well this base model performs.

If the predictions are good, the points should lie close to a straight line drawn at 45 degrees. We can see this base model does a poor job of predicting prices. Worst of all, the model predicts negative prices on the lower end. It shows that price has to be log transformed to avoid these absurdities.

Feature Engineering

We know the price of a diamond is strongly correlated with its size. All things equal, the larger the diamond, the greater its price.

As a first approximation, we can assume a diamond is a cuboid with dimensions x, y and z. Then, we can compute its volume as x * y * z. As these 3 dimensions are highly correlated, we can compute a geometrical average dimension by taking the cube root of volume, and retain a linear relationship with log(price).

Another way to calculate an average dimension is by using high school chemistry. Mass, volume and density are related to each other by the equation:

$ density = mass/volume $

We can find out that 1 carat = 0.2 gms. Dividing by the density of diamond (3.51 gms/cc) would give us its volume in cc, which could be converted to a geometrical average dimension by taking the cube root.

Even though both methods yield similar results, we could see that the density method results in a narrower range. But which method would be more robust? Keep in mind there are 20 z values that are 0. In 7 of these records both x and y are 0 too, which means these values were not recorded reliably.

# A tibble: 20 x 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  1    Premium   G     SI2      59.1    59  3142  6.55  6.48     0
 2  1.01 Premium   H     I1       58.1    59  3167  6.66  6.6      0
 3  1.1  Premium   G     SI2      63      59  3696  6.5   6.47     0
 4  1.01 Premium   F     SI2      59.2    58  3837  6.5   6.47     0
 5  1.5  Good      G     I1       64      61  4731  7.15  7.04     0
 6  1.07 Ideal     F     SI2      61.6    56  4954  0     6.62     0
 7  1    Very Good H     VS2      63.3    53  5139  0     0        0
 8  1.15 Ideal     G     VS2      59.2    56  5564  6.88  6.83     0
 9  1.14 Fair      G     VS1      57.5    67  6381  0     0        0
10  2.18 Premium   H     SI2      59.4    61 12631  8.49  8.45     0
11  1.56 Ideal     G     VS2      62.2    54 12800  0     0        0
12  2.25 Premium   I     SI1      61.3    58 15397  8.52  8.42     0
13  1.2  Premium   D     VVS1     62.1    59 15686  0     0        0
14  2.2  Premium   H     SI1      61.2    59 17265  8.42  8.37     0
15  2.25 Premium   H     SI2      62.8    59 18034  0     0        0
16  2.02 Premium   H     VS2      62.7    53 18207  8.02  7.95     0
17  2.8  Good      G     SI2      63.8    58 18788  8.9   8.85     0
18  0.71 Good      F     SI2      64.1    60  2130  0     0        0
19  0.71 Good      F     SI2      64.1    60  2130  0     0        0
20  1.12 Premium   G     I1       60.4    59  2383  6.71  6.67     0

In all of these records, the carat values were recorded reliably and are probably more accurate than the dimensions. Hence, we might prefer the density method of generating this feature.

Furthermore, since density is a constant, dividing by a constant to calculate volume isn’t really necessary. Instead, a cube root transformation could be applied to carat itself for the purposes of predictive modelling that would result in a linear relationship between \(log(price)\) and \(carat^{1/3}\). It is the reason why we’re fitting a linear model because the model is linear in its parameters.

Training Linear Models

Here are the steps for building linear models and computing metrics:

  • Partition the dataset into training and testing sets in the proportion 75% and 25% respectively.
  • Since clarity is one of the main predictors, stratify the partitioning by clarity, so both training and testing sets have the same distributions of this feature.
  • Fit a linear model with the training set.
  • Make predictions on the testing set and determine model metrics.
  • Wrap all the steps above inside a function in which the model formula and a seed could be passed. Since the seed determines the random partitioning, it helps to minimize vagaries in partitioning the training and testing sets before fitting models.
  • Run multiple iterations of a model with different seeds, and compute its average metrics, that would reflect the results on unseen data.

Here’s a sample split of training and testing set, stratified by clarity. As we can see, the training and testing sets have similar distributions.

dfTrain$clarity 
       n  missing distinct 
   40457        0        8 
                                                          
Value         I1   SI2   SI1   VS2   VS1  VVS2  VVS1    IF
Frequency    552  6895  9826  9222  6125  3780  2722  1335
Proportion 0.014 0.170 0.243 0.228 0.151 0.093 0.067 0.033
dfTest$clarity 
       n  missing distinct 
   13483        0        8 
                                                          
Value         I1   SI2   SI1   VS2   VS1  VVS2  VVS1    IF
Frequency    189  2299  3239  3036  2046  1286   933   455
Proportion 0.014 0.171 0.240 0.225 0.152 0.095 0.069 0.034

After running 5 iterations of each model with a different seed, here are the average metrics:

# A tibble: 5 x 4
  model                                                rmse   rsq   mae
  <chr>                                               <dbl> <dbl> <dbl>
1 log(price) ~ .                                      0.170 0.972 0.112
2 log(price) ~ I(carat^(1/3))                         0.282 0.923 0.216
3 log(price) ~ I(carat^(1/3)) + clarity               0.220 0.953 0.173
4 log(price) ~ I(carat^(1/3)) + clarity + color       0.167 0.973 0.127
5 log(price) ~ I(carat^(1/3)) + clarity + color + cut 0.164 0.974 0.125

The first model with all predictors is an overfitted one.

The model with carat, clarity and color provides the best combination of root mean squared error and r-squared, that explains the most variance. This is our final model. Including cut in the model has diminishing benefits, and tends to overfit the data.

Here’s the summary of our final model:


Call:
lm(formula = model_formula, data = dfTrain)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.6022 -0.1034  0.0145  0.1066  1.7941 

Coefficients:
                Estimate Std. Error t value             Pr(>|t|)    
(Intercept)     2.147009   0.004993  429.99 < 0.0000000000000002 ***
I(carat^(1/3))  6.246412   0.005365 1164.27 < 0.0000000000000002 ***
clarity.L       0.922295   0.005036  183.15 < 0.0000000000000002 ***
clarity.Q      -0.295539   0.004734  -62.43 < 0.0000000000000002 ***
clarity.C       0.166979   0.004068   41.05 < 0.0000000000000002 ***
clarity^4      -0.068591   0.003260  -21.04 < 0.0000000000000002 ***
clarity^5       0.032833   0.002669   12.30 < 0.0000000000000002 ***
clarity^6      -0.001904   0.002325   -0.82              0.41288    
clarity^7       0.025508   0.002049   12.45 < 0.0000000000000002 ***
color.L        -0.488882   0.002927 -167.05 < 0.0000000000000002 ***
color.Q        -0.117319   0.002680  -43.78 < 0.0000000000000002 ***
color.C        -0.012230   0.002497   -4.90           0.00000098 ***
color^4         0.019007   0.002288    8.31 < 0.0000000000000002 ***
color^5        -0.008110   0.002159   -3.76              0.00017 ***
color^6        -0.000396   0.001967   -0.20              0.84055    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.166 on 40442 degrees of freedom
Multiple R-squared:  0.973, Adjusted R-squared:  0.973 
F-statistic: 1.05e+05 on 14 and 40442 DF,  p-value: <0.0000000000000002

Here’s a scatterplot of actual v/s predicted log(price) from our final model on the testing set:

The points lie close to the 45 degress line. However, on the high end, there are many outliers where actual and predicted values have very high variance. Nevertheless, this is as good as it gets.

Nitin Gupta
Founder

Quantitative Data Technologies

Related