<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Case Studies | Nitin Gupta</title>
    <link>https://www.nitingupta.com/casestudies/</link>
      <atom:link href="https://www.nitingupta.com/casestudies/index.xml" rel="self" type="application/rss+xml" />
    <description>Case Studies</description>
    <generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><copyright>© Nitin Gupta. All Rights Reserved.</copyright><lastBuildDate>Mon, 26 Dec 2016 00:00:00 +0000</lastBuildDate>
    <image>
      <url>https://www.nitingupta.com/img/icon-192.png</url>
      <title>Case Studies</title>
      <link>https://www.nitingupta.com/casestudies/</link>
    </image>
    
    <item>
      <title>Ames Housing - Part 2 - Building Models</title>
      <link>https://www.nitingupta.com/casestudies/ames-housing-part2-models/</link>
      <pubDate>Mon, 26 Dec 2016 00:00:00 +0000</pubDate>
      <guid>https://www.nitingupta.com/casestudies/ames-housing-part2-models/</guid>
      <description>


&lt;p&gt;In a &lt;a href=&#34;https://www.nitingupta.com/casestudies/ames-housing-part1-eda/&#34;&gt;previous post&lt;/a&gt; in this series, we did an exploratory data analysis of the &lt;a href=&#34;http://www.amstat.org/publications/jse/v19n3/decock.pdf&#34;&gt;Ames Housing dataset&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In this post, we will build linear and non-linear models and see how well they predict the &lt;code&gt;SalePrice&lt;/code&gt; of properties.&lt;/p&gt;
&lt;div id=&#34;evaluation-criteria&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Evaluation Criteria&lt;/h2&gt;
&lt;p&gt;Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed &lt;code&gt;SalePrice&lt;/code&gt; will be our evaluation criteria. Taking the log ensures that errors in predicting expensive and cheap houses will affect the result equally.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;steps-for-building-models&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Steps for Building Models&lt;/h2&gt;
&lt;p&gt;Here are the steps for building models and determining the best hyperparameter combinations by K-fold cross validation:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Partition the training dataset into model training and validation sets. Use stratified sampling such that each partition has a similar distribution of the target variable - &lt;code&gt;SalePrice&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Define linear and non-linear models.&lt;/li&gt;
&lt;li&gt;For each model, create a grid of hyperparameter combinations that are equally spaced.&lt;/li&gt;
&lt;li&gt;For each hyperparameter combination, fit a model on the training set and make predictions on the validation set. Repeat the process for all folds.&lt;/li&gt;
&lt;li&gt;Determine root mean squared errors (RMSE) and choose the best hyperparameter combination that corresponds to the minimum RMSE.&lt;/li&gt;
&lt;li&gt;Train each model with its best hyperparameter combination on the entire training set.&lt;/li&gt;
&lt;li&gt;Calculate RMSE of the each finalized model on the testing set.&lt;/li&gt;
&lt;li&gt;Finally, choose the best model that gives the least RMSE.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;partitioning-training-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Partitioning Training Data&lt;/h2&gt;
&lt;p&gt;We split the training data into 4 folds. Within each fold, 75% of the data is used for training models and 25% for validating the predicted values against the actual values.&lt;/p&gt;
&lt;p&gt;Let’s look at the distribution of the target variable across all folds:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/ames-housing-part2-models/index_files/figure-html/plot_target_partitioning-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;By using stratified sampling, we ensure that the training and validation distributions of the target variable are similar.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;linear-models&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Linear Models&lt;/h2&gt;
&lt;div id=&#34;ordinary-least-squares-regression&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Ordinary Least Squares Regression&lt;/h3&gt;
&lt;p&gt;Before creating any new features or indulging in more complex modelling methods, we will cross validate a simple linear model on the training data to establish a benchmark. If more complex approaches do not have a significant improvement in the model validation metrics, then they are not worthwhile to be pursued.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Linear Regression Model Specification (regression)

Computational engine: lm &lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;whats-notable&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;What’s notable?&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;After training a linear model on all predictors, we get an RMSE of &lt;strong&gt;0.1468&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;This is the simplest and fastest model with no hyperparameters to tune.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;regularized-linear-model&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Regularized Linear Model&lt;/h3&gt;
&lt;p&gt;We will use &lt;code&gt;glmnet&lt;/code&gt; that uses LASSO and Ridge Regression with regularization. We will do a grid search of the following hyperparameters that minimize RMSE:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;penalty&lt;/code&gt;: The total amount of regularization in the model.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;mixture&lt;/code&gt;: The proportion of L1 regularization in the model.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;Linear Regression Model Specification (regression)

Main Arguments:
  penalty = tune()
  mixture = tune()

Computational engine: glmnet &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s take a look at the top 10 RMSE values and hyperparameter combinations:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 10 x 3
    penalty mixture mean_rmse
      &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;
 1 4.83e- 3  0.922      0.127
 2 3.79e- 2  0.0518     0.129
 3 1.36e- 3  0.659      0.132
 4 1.60e- 3  0.431      0.133
 5 3.50e- 3  0.177      0.133
 6 4.17e- 2  0.288      0.133
 7 5.67e- 4  0.970      0.133
 8 6.79e- 9  0.0193     0.138
 9 4.32e-10  0.337      0.138
10 1.95e- 6  0.991      0.138&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;whats-notable-1&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;What’s notable?&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;After hyperparameter tuning with cross validation, &lt;code&gt;glmnet&lt;/code&gt; gives the best RMSE of 0.127 with penalty = 0.0048 and mixture = 0.9216.&lt;/li&gt;
&lt;li&gt;It is a significant improvement over Ordinary Least Squares regression that had an RMSE of 0.1468.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;glmnet&lt;/code&gt; cross validation takes under a minute to execute.&lt;/li&gt;
&lt;li&gt;But the presence of outliers can significantly affect its performance.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Here a plot of the &lt;code&gt;glmnet&lt;/code&gt; hyperparameter grid along with the best hyperparameter combination:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/ames-housing-part2-models/index_files/figure-html/plot_glmnet-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;non-linear-models&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Non-linear Models&lt;/h2&gt;
&lt;p&gt;Next, we will train a couple of tree-based algorithms, which are not very sensitive to outliers and skewed data.&lt;/p&gt;
&lt;div id=&#34;randomforest&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;&lt;em&gt;randomForest&lt;/em&gt;&lt;/h3&gt;
&lt;p&gt;In each ensemble, we have 1000 trees and do a grid search of the following hyperparameters:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;mtry&lt;/code&gt;: The number of predictors to randomly sample at each split.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;min_n&lt;/code&gt;: The minimum number of data points in a node required to further split the node.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;Random Forest Model Specification (regression)

Main Arguments:
  mtry = tune()
  trees = 1000
  min_n = tune()

Engine-Specific Arguments:
  objective = reg:squarederror

Computational engine: randomForest &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s take a look at the top 10 RMSE values and hyperparameter combinations:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 10 x 3
   min_n  mtry mean_rmse
   &amp;lt;int&amp;gt; &amp;lt;int&amp;gt;     &amp;lt;dbl&amp;gt;
 1     4    85     0.134
 2     3   140     0.135
 3    14    90     0.135
 4     6    45     0.136
 5     9   138     0.136
 6    13   158     0.137
 7     9   183     0.137
 8    19    56     0.138
 9    21   130     0.138
10     5   218     0.138&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;whats-notable-2&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;What’s notable?&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;After cross validation, we get the best RMSE of 0.134 with mtry = 85 and min_n = 4.&lt;/li&gt;
&lt;li&gt;This is no improvement in RMSE compared to &lt;code&gt;glmnet&lt;/code&gt; and &lt;code&gt;randomForest&lt;/code&gt; cross validation takes much longer to execute than &lt;code&gt;glmnet&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Here a plot of the &lt;code&gt;randomForest&lt;/code&gt; hyperparameter grid along with the best hyperparameter combination:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/ames-housing-part2-models/index_files/figure-html/plot_randomForest-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;xgboost&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;&lt;em&gt;xgboost&lt;/em&gt;&lt;/h3&gt;
&lt;p&gt;In each ensemble we have 1000 trees and do a grid search of the following hyperparameters:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;min_n&lt;/code&gt;: The minimum number of data points in a node required to further split the node.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;tree_depth&lt;/code&gt;: The maximum depth or the number of splits of the tree.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;learn_rate&lt;/code&gt;: The rate at which the boosting algorithm adapts from one iteration to another.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;Boosted Tree Model Specification (regression)

Main Arguments:
  trees = 1000
  min_n = tune()
  tree_depth = tune()
  learn_rate = tune()

Engine-Specific Arguments:
  objective = reg:squarederror

Computational engine: xgboost &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s take a look at the top 10 RMSE values and hyperparameter combinations:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 10 x 4
   min_n tree_depth learn_rate mean_rmse
   &amp;lt;int&amp;gt;      &amp;lt;int&amp;gt;      &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;
 1    13          3  0.0309        0.124
 2    40          4  0.0350        0.126
 3     6          8  0.0469        0.126
 4    34         15  0.0172        0.127
 5    28         10  0.0336        0.128
 6    20         14  0.00348       0.389
 7    22          7  0.000953      4.46 
 8     3          2  0.000528      6.81 
 9    10         12  0.000401      7.73 
10    34          3  0.0000802    10.6  &lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;whats-notable-3&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;What’s notable?&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;After cross validation, we get the best RMSE of 0.124 with min_n = 13, tree_depth = 3 and learn_rate = 0.0309.&lt;/li&gt;
&lt;li&gt;Gives the best RMSE compared to &lt;code&gt;glmnet&lt;/code&gt; and &lt;code&gt;randomForest&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;However, &lt;code&gt;xgboost&lt;/code&gt; cross validation takes longer to execute than that of &lt;code&gt;glmnet&lt;/code&gt;, but is faster than that of &lt;code&gt;randomForest&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;!-- Here a 3D plot of the `xgboost` hyperparameter grid: --&gt;
&lt;!-- &lt;center&gt; --&gt;
&lt;!-- ```{r plot_xgboost} --&gt;
&lt;!-- library(plotly) --&gt;
&lt;!-- plot_ly(param_grid_xgboost, x = ~min_n, y = ~tree_depth, z = ~learn_rate) %&gt;% --&gt;
&lt;!--   add_markers() %&gt;% --&gt;
&lt;!--   layout(font = list(family = &#34;Roboto Condensed&#34;), --&gt;
&lt;!--          title = list(text = &#34;Scatterplot of min_n, tree_depth and learn_rate&#34;, font = list(size = 22)), --&gt;
&lt;!--          scene = list(xaxis = list(title = &#39;min_n&#39;), --&gt;
&lt;!--                       yaxis = list(title = &#39;tree_depth&#39;), --&gt;
&lt;!--                       zaxis = list(title = &#39;learn_rate&#39;))) --&gt;
&lt;!-- ``` --&gt;
&lt;!-- &lt;/center&gt; --&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;finalizing-models&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Finalizing Models&lt;/h2&gt;
&lt;p&gt;For each model, we found the combination of hyperparameters that minimize RMSE. Using those parameters, we can now train the same models on the entire training dataset. Finally, we can use the trained models to predict log(SalePrice) on the entire training set to see the actual v/s predicted log(SalePrice) results.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/ames-housing-part2-models/index_files/figure-html/plot_train-1.png&#34; width=&#34;768&#34; /&gt;&lt;/p&gt;
&lt;div id=&#34;whats-notable-4&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;What’s notable?&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Both &lt;code&gt;randomForest&lt;/code&gt; and &lt;code&gt;xgboost&lt;/code&gt; models do a fantastic job of predicting log(SalePrice) with the tuned parameters, as the predictions lie close to the straight line drawn at 45 degrees.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;glmnet&lt;/code&gt; model shows a couple of outliers with Ids &lt;strong&gt;524&lt;/strong&gt; and &lt;strong&gt;1299&lt;/strong&gt; whose predicted values are far in excess of their actual values. Even properties whose &lt;code&gt;SalePrice&lt;/code&gt; is at the lower end, show a wide dispersion in prediced values.&lt;/li&gt;
&lt;li&gt;But the true performance can only be measured on unseen testing data.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;performance-on-test-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Performance on Test Data&lt;/h2&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/ames-housing-part2-models/index_files/figure-html/plot_test-1.png&#34; width=&#34;768&#34; /&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 3 x 3
  model        test_rmse cv_rmse
  &amp;lt;chr&amp;gt;            &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;
1 glmnet           0.129   0.127
2 randomForest     0.139   0.134
3 xgboost          0.128   0.124&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;whats-notable-5&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;What’s notable?&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;All models have similar RMSE on the unseen testing set as their cross validated RMSE, which shows the cross validation process and hyperparameters worked very well.&lt;/li&gt;
&lt;li&gt;Records with Ids &lt;strong&gt;1537&lt;/strong&gt; and &lt;strong&gt;2217&lt;/strong&gt; are outliers, as none of the models are able to predict close to actual values.&lt;/li&gt;
&lt;li&gt;Looking at the test RMSE, we could finalize &lt;code&gt;xgboost&lt;/code&gt; as the model that generalizes very well on this dataset.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;feature-importance&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Feature Importance&lt;/h2&gt;
&lt;p&gt;Even though &lt;code&gt;xgboost&lt;/code&gt; is not as easily interpretable as a linear model, we could use variable importance plots to determine the most important features selected by the model.&lt;/p&gt;
&lt;p&gt;Let’s take a look at the top 10 most important features of our finalized &lt;code&gt;xgboost&lt;/code&gt; model:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/ames-housing-part2-models/index_files/figure-html/feature_importance-1.png&#34; width=&#34;1152&#34; /&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Correlations of numerical features are plotted side-by-side. All features have a correlation of 0.5 or more with &lt;code&gt;SalePrice&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;All of the top 10 features make sense. To evaluate &lt;code&gt;SalePrice&lt;/code&gt;, a buyer would definitely look at total square footage, overall quality, neighborhood, number of bathrooms, kitchen quality, age of property, etc.&lt;/li&gt;
&lt;li&gt;This shows, our finalized model generalizes well and makes very reasonable choices in terms of features.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;new-property-premium&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;New Property Premium&lt;/h2&gt;
&lt;p&gt;Among the top 10 features by importance in our final model, most of the features like square footage, neighborhood and number of bathrooms remain the same throughout the life of the property. Quality and condition of property does change but their evaluation is mostly subjective. The only other feature that cannot be disputed to change over time is &lt;code&gt;PropertyAge&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;So, how would the predicted &lt;code&gt;SalePrice&lt;/code&gt; differ if a property was newly constructed vis-a-vis the same property if it were constructed more than 30 years earlier, and all the times in between?&lt;/p&gt;
&lt;p&gt;We could pick a couple of properties at random, change &lt;code&gt;PropertyAge&lt;/code&gt; and see its impact on &lt;code&gt;SalePrice&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/ames-housing-part2-models/index_files/figure-html/property_appreciation-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We can see there’s a small premium for a newly constructed property v/s an older property of the same build, quality and condition. This premium isn’t very much in a place like Ames, IA but we’d reckon it would be much higher in a larger metropolitan city.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Ames Housing - Part 1 - Exploratory Data Analysis</title>
      <link>https://www.nitingupta.com/casestudies/ames-housing-part1-eda/</link>
      <pubDate>Sun, 25 Dec 2016 00:00:00 +0000</pubDate>
      <guid>https://www.nitingupta.com/casestudies/ames-housing-part1-eda/</guid>
      <description>


&lt;p&gt;In this case study, we will use the &lt;a href=&#34;http://www.amstat.org/publications/jse/v19n3/decock.pdf&#34;&gt;Ames Housing dataset&lt;/a&gt; to explore regression techniques and predict the sale price of houses.&lt;/p&gt;
&lt;div id=&#34;data-summaries&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Data Summaries&lt;/h2&gt;
&lt;p&gt;The Ames Housing dataset contains the sale prices of properties in Ames, Iowa along with 80 other features. Each property has an &lt;strong&gt;Id&lt;/strong&gt; associated with it. Here are the dimensions of the training and testing sets respectively:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[1] &amp;quot;Dimensions of the training set&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;[1] 1460   81&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;[1] &amp;quot;Dimensions of the testing set&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;[1] 1459   81&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now, let’s combine training and testing into a single dataset and take a look at the count of missing values:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/ames-housing-part1-eda/index_files/figure-html/missing_values-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;div id=&#34;whats-notable&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;What’s notable?&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;The combined dataset has 2919 property records.&lt;/li&gt;
&lt;li&gt;Very few properties have a pool, fence or an alley access to the property.&lt;/li&gt;
&lt;li&gt;Very few properties have a miscellaneous feature that has not been covered by other features.&lt;/li&gt;
&lt;li&gt;More than a dozen features have atleast 1 missing value. Since we have a tiny dataset, we will try to impute the missing values.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;data-cleaning-transformation&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Data Cleaning &amp;amp; Transformation&lt;/h2&gt;
&lt;p&gt;We will visualize features of the complete dataset and create a data cleaning pipeline.&lt;/p&gt;
&lt;div id=&#34;fixing-data-errors&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Fixing Data Errors&lt;/h3&gt;
&lt;p&gt;First, a few data integrity checks need to be done to ensure the quality of the data:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;YearRemodAdd&lt;/code&gt; should not be earlier than &lt;code&gt;YearBuilt&lt;/code&gt;: 1 record to be fixed&lt;/li&gt;
&lt;li&gt;&lt;code&gt;YrSold&lt;/code&gt; should not be earlier than &lt;code&gt;YearRemodAdd&lt;/code&gt;: 3 records to be fixed&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 1 x 4
     Id YearBuilt YearRemodAdd YrSold
  &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;        &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt;
1  1877      2002         2001   2009&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 3 x 4
     Id YearBuilt YearRemodAdd YrSold
  &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;        &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt;
1   524      2007         2008   2007
2  2296      2007         2008   2007
3  2550      2008         2009   2007&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;GarageYrBlt&lt;/code&gt; should not be earlier than &lt;code&gt;YearBuilt&lt;/code&gt;: 18 records to be fixed&lt;/li&gt;
&lt;li&gt;&lt;code&gt;GarageYrBlt&lt;/code&gt; should not be later than &lt;code&gt;YrSold&lt;/code&gt;: 1 record to be fixed&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 18 x 4
      Id YearBuilt GarageYrBlt YrSold
   &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;       &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt;
 1    30      1927        1920   2008
 2    94      1910        1900   2007
 3   325      1967        1961   2010
 4   601      2005        2003   2006
 5   737      1950        1949   2006
 6  1104      1959        1954   2006
 7  1377      1930        1925   2008
 8  1415      1923        1922   2008
 9  1419      1963        1962   2008
10  1522      1959        1956   2010
11  1577      2010        2009   2010
12  1806      1935        1920   2009
13  1841      1978        1960   2009
14  1896      1941        1940   2009
15  1898      1935        1926   2009
16  2123      1945        1925   2008
17  2264      2006        2005   2007
18  2510      2006        2005   2007&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 1 x 4
     Id YearBuilt GarageYrBlt YrSold
  &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;       &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt;
1  2593      2006        2207   2007&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;imputing-missing-values-new-features&#34; class=&#34;section level3 tabset tabset-fade tabset-pills&#34;&gt;
&lt;h3&gt;Imputing Missing Values &amp;amp; New Features&lt;/h3&gt;
&lt;!-- #### &lt;span style=&#34;color:red&#34;&gt;Basement Features&lt;/span&gt; --&gt;
&lt;div id=&#34;basement-features&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;Basement Features&lt;/h4&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;p&gt;There is one property (&lt;code&gt;Id&lt;/code&gt; = 2121) where all the basement features are NA. &lt;code&gt;TotalBsmtSF&lt;/code&gt; is replaced by 0.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Now there are 79 properties which have no basement (&lt;code&gt;TotalBsmtSF&lt;/code&gt; = 0). All other basement features having NA values are changed to &lt;strong&gt;None&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Since qualitative features do not have the same distribution across neighborhoods, any remaining &lt;strong&gt;NA&lt;/strong&gt; values are imputed to be the most common value in that &lt;code&gt;Neighborhood&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 1 x 13
     Id Neighborhood BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath
  &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;        &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;        &amp;lt;chr&amp;gt;             &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;             &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;       &amp;lt;dbl&amp;gt;        &amp;lt;dbl&amp;gt;        &amp;lt;dbl&amp;gt;
1  2121 BrkSide      &amp;lt;NA&amp;gt;     &amp;lt;NA&amp;gt;     &amp;lt;NA&amp;gt;         &amp;lt;NA&amp;gt;                 NA &amp;lt;NA&amp;gt;                 NA        NA          NA           NA           NA&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 79 x 13
      Id Neighborhood BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath
   &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;        &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;        &amp;lt;chr&amp;gt;             &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;             &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;       &amp;lt;dbl&amp;gt;        &amp;lt;dbl&amp;gt;        &amp;lt;dbl&amp;gt;
 1    18 Sawyer       &amp;lt;NA&amp;gt;     &amp;lt;NA&amp;gt;     &amp;lt;NA&amp;gt;         &amp;lt;NA&amp;gt;                  0 &amp;lt;NA&amp;gt;                  0         0           0            0            0
 2    40 Edwards      &amp;lt;NA&amp;gt;     &amp;lt;NA&amp;gt;     &amp;lt;NA&amp;gt;         &amp;lt;NA&amp;gt;                  0 &amp;lt;NA&amp;gt;                  0         0           0            0            0
 3    91 NAmes        &amp;lt;NA&amp;gt;     &amp;lt;NA&amp;gt;     &amp;lt;NA&amp;gt;         &amp;lt;NA&amp;gt;                  0 &amp;lt;NA&amp;gt;                  0         0           0            0            0
 4   103 SawyerW      &amp;lt;NA&amp;gt;     &amp;lt;NA&amp;gt;     &amp;lt;NA&amp;gt;         &amp;lt;NA&amp;gt;                  0 &amp;lt;NA&amp;gt;                  0         0           0            0            0
 5   157 NAmes        &amp;lt;NA&amp;gt;     &amp;lt;NA&amp;gt;     &amp;lt;NA&amp;gt;         &amp;lt;NA&amp;gt;                  0 &amp;lt;NA&amp;gt;                  0         0           0            0            0
 6   183 Edwards      &amp;lt;NA&amp;gt;     &amp;lt;NA&amp;gt;     &amp;lt;NA&amp;gt;         &amp;lt;NA&amp;gt;                  0 &amp;lt;NA&amp;gt;                  0         0           0            0            0
 7   260 OldTown      &amp;lt;NA&amp;gt;     &amp;lt;NA&amp;gt;     &amp;lt;NA&amp;gt;         &amp;lt;NA&amp;gt;                  0 &amp;lt;NA&amp;gt;                  0         0           0            0            0
 8   343 NAmes        &amp;lt;NA&amp;gt;     &amp;lt;NA&amp;gt;     &amp;lt;NA&amp;gt;         &amp;lt;NA&amp;gt;                  0 &amp;lt;NA&amp;gt;                  0         0           0            0            0
 9   363 Edwards      &amp;lt;NA&amp;gt;     &amp;lt;NA&amp;gt;     &amp;lt;NA&amp;gt;         &amp;lt;NA&amp;gt;                  0 &amp;lt;NA&amp;gt;                  0         0           0            0            0
10   372 ClearCr      &amp;lt;NA&amp;gt;     &amp;lt;NA&amp;gt;     &amp;lt;NA&amp;gt;         &amp;lt;NA&amp;gt;                  0 &amp;lt;NA&amp;gt;                  0         0           0            0            0
# ... with 69 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 9 x 13
     Id Neighborhood BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath
  &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;        &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;        &amp;lt;chr&amp;gt;             &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;             &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;       &amp;lt;dbl&amp;gt;        &amp;lt;dbl&amp;gt;        &amp;lt;dbl&amp;gt;
1   333 NridgHt      Gd       TA       No           GLQ                1124 &amp;lt;NA&amp;gt;                479      1603        3206            1            0
2   949 CollgCr      Gd       TA       &amp;lt;NA&amp;gt;         Unf                   0 Unf                   0       936         936            0            0
3  1488 Somerst      Gd       TA       &amp;lt;NA&amp;gt;         Unf                   0 Unf                   0      1595        1595            0            0
4  2041 Veenker      Gd       &amp;lt;NA&amp;gt;     Mn           GLQ                1044 Rec                 382         0        1426            1            0
5  2186 Edwards      TA       &amp;lt;NA&amp;gt;     No           BLQ                1033 Unf                   0        94        1127            0            1
6  2218 IDOTRR       &amp;lt;NA&amp;gt;     Fa       No           Unf                   0 Unf                   0       173         173            0            0
7  2219 IDOTRR       &amp;lt;NA&amp;gt;     TA       No           Unf                   0 Unf                   0       356         356            0            0
8  2349 Somerst      Gd       TA       &amp;lt;NA&amp;gt;         Unf                   0 Unf                   0       725         725            0            0
9  2525 CollgCr      TA       &amp;lt;NA&amp;gt;     Av           ALQ                 755 Unf                   0       240         995            0            0&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Histograms of numerical basement features and their correlations with &lt;code&gt;SalePrice&lt;/code&gt; are plotted below.&lt;/p&gt;
&lt;p&gt;It could be verified that: &lt;code&gt;TotalBsmtSF&lt;/code&gt; = &lt;code&gt;BsmtFinSF1&lt;/code&gt; + &lt;code&gt;BsmtFinSF2&lt;/code&gt; + &lt;code&gt;BsmtUnfSF&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Additionally, new features are generated where:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;BsmtBath&lt;/code&gt; = &lt;code&gt;BsmtFullBath&lt;/code&gt; + 0.5 * &lt;code&gt;BsmtHalfBath&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;HasBsmt&lt;/code&gt; = &lt;code&gt;TotalBsmtSF&lt;/code&gt; &amp;gt; 0&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/ames-housing-part1-eda/index_files/figure-html/plot_num_bsmt-1.png&#34; width=&#34;960&#34; /&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Most properties have a basement.&lt;/li&gt;
&lt;li&gt;Column plots show that &lt;code&gt;BsmtFinType2&lt;/code&gt; and &lt;code&gt;BsmtCond&lt;/code&gt; values are dominated by a single category.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/ames-housing-part1-eda/index_files/figure-html/plot_chr_bsmt-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;bathroom-features&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;Bathroom Features&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;A new feature is generated to determine the total number of bathrooms: &lt;code&gt;TotalBath&lt;/code&gt; = &lt;code&gt;FullBath&lt;/code&gt; + &lt;code&gt;HalfBath&lt;/code&gt; + &lt;code&gt;BsmtBath&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/ames-housing-part1-eda/index_files/figure-html/plot_bath-1.png&#34; width=&#34;960&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;fireplace-features&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;Fireplace Features&lt;/h4&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;There are 1420 properties that have no fireplaces. &lt;code&gt;FireplaceQu&lt;/code&gt; is changed to &lt;strong&gt;None&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 1,420 x 4
      Id Neighborhood Fireplaces FireplaceQu
   &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;             &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;      
 1     1 CollgCr               0 &amp;lt;NA&amp;gt;       
 2     6 Mitchel               0 &amp;lt;NA&amp;gt;       
 3    11 Sawyer                0 &amp;lt;NA&amp;gt;       
 4    13 Sawyer                0 &amp;lt;NA&amp;gt;       
 5    16 BrkSide               0 &amp;lt;NA&amp;gt;       
 6    18 Sawyer                0 &amp;lt;NA&amp;gt;       
 7    19 SawyerW               0 &amp;lt;NA&amp;gt;       
 8    20 NAmes                 0 &amp;lt;NA&amp;gt;       
 9    27 NAmes                 0 &amp;lt;NA&amp;gt;       
10    30 BrkSide               0 &amp;lt;NA&amp;gt;       
# ... with 1,410 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;A new feature is generated where: &lt;code&gt;HasFireplace&lt;/code&gt; = &lt;code&gt;Fireplaces&lt;/code&gt; &amp;gt; 0&lt;/li&gt;
&lt;li&gt;A significant number of properties have fireplaces.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/ames-housing-part1-eda/index_files/figure-html/plot_fireplace-1.png&#34; width=&#34;960&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;garage-features&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;Garage Features&lt;/h4&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;GarageYrBlt&lt;/code&gt; where NA is set to &lt;code&gt;YearBuilt&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;There are 157 properties where the property has no garage. In these records, &lt;code&gt;GarageType&lt;/code&gt;, &lt;code&gt;GarageFinish&lt;/code&gt;, &lt;code&gt;GarageQual&lt;/code&gt; and &lt;code&gt;GarageCond&lt;/code&gt; are recorded as &lt;strong&gt;None&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Since qualitative features do not have the same distribution across neighborhoods, any remaining &lt;strong&gt;NA&lt;/strong&gt; values are imputed to be the most common or median value in the &lt;code&gt;Neighborhood&lt;/code&gt; by &lt;code&gt;GarageType&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 157 x 9
      Id Neighborhood GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond
   &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;        &amp;lt;chr&amp;gt;            &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;             &amp;lt;dbl&amp;gt;      &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;      &amp;lt;chr&amp;gt;     
 1    40 Edwards      &amp;lt;NA&amp;gt;              1955 &amp;lt;NA&amp;gt;                  0          0 &amp;lt;NA&amp;gt;       &amp;lt;NA&amp;gt;      
 2    49 OldTown      &amp;lt;NA&amp;gt;              1920 &amp;lt;NA&amp;gt;                  0          0 &amp;lt;NA&amp;gt;       &amp;lt;NA&amp;gt;      
 3    79 Sawyer       &amp;lt;NA&amp;gt;              1968 &amp;lt;NA&amp;gt;                  0          0 &amp;lt;NA&amp;gt;       &amp;lt;NA&amp;gt;      
 4    89 IDOTRR       &amp;lt;NA&amp;gt;              1915 &amp;lt;NA&amp;gt;                  0          0 &amp;lt;NA&amp;gt;       &amp;lt;NA&amp;gt;      
 5    90 CollgCr      &amp;lt;NA&amp;gt;              1994 &amp;lt;NA&amp;gt;                  0          0 &amp;lt;NA&amp;gt;       &amp;lt;NA&amp;gt;      
 6   100 NAmes        &amp;lt;NA&amp;gt;              1959 &amp;lt;NA&amp;gt;                  0          0 &amp;lt;NA&amp;gt;       &amp;lt;NA&amp;gt;      
 7   109 IDOTRR       &amp;lt;NA&amp;gt;              1919 &amp;lt;NA&amp;gt;                  0          0 &amp;lt;NA&amp;gt;       &amp;lt;NA&amp;gt;      
 8   126 IDOTRR       &amp;lt;NA&amp;gt;              1935 &amp;lt;NA&amp;gt;                  0          0 &amp;lt;NA&amp;gt;       &amp;lt;NA&amp;gt;      
 9   128 OldTown      &amp;lt;NA&amp;gt;              1930 &amp;lt;NA&amp;gt;                  0          0 &amp;lt;NA&amp;gt;       &amp;lt;NA&amp;gt;      
10   141 NAmes        &amp;lt;NA&amp;gt;              1971 &amp;lt;NA&amp;gt;                  0          0 &amp;lt;NA&amp;gt;       &amp;lt;NA&amp;gt;      
# ... with 147 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 2 x 9
     Id Neighborhood GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond
  &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;        &amp;lt;chr&amp;gt;            &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;             &amp;lt;dbl&amp;gt;      &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;      &amp;lt;chr&amp;gt;     
1  2127 OldTown      Detchd            1910 &amp;lt;NA&amp;gt;                  1        360 &amp;lt;NA&amp;gt;       &amp;lt;NA&amp;gt;      
2  2577 IDOTRR       Detchd            1923 &amp;lt;NA&amp;gt;                 NA         NA &amp;lt;NA&amp;gt;       &amp;lt;NA&amp;gt;      &lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;GarageArea&lt;/code&gt; and &lt;code&gt;GarageCars&lt;/code&gt; have almost similar correlation with &lt;code&gt;SalePrice&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;A new feature is generated where: &lt;code&gt;HasGarage&lt;/code&gt; = &lt;code&gt;GarageArea&lt;/code&gt; &amp;gt; 0&lt;/li&gt;
&lt;li&gt;Most properties have a garage.&lt;/li&gt;
&lt;li&gt;Column plots show that &lt;code&gt;GarageQual&lt;/code&gt; and &lt;code&gt;GarageCond&lt;/code&gt; values are dominated by a single category.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/ames-housing-part1-eda/index_files/figure-html/plot_garage-1.png&#34; width=&#34;960&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;masonry-features&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;Masonry Features&lt;/h4&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;p&gt;There is one property (&lt;code&gt;Id&lt;/code&gt; = 2611) where &lt;code&gt;MasVnrArea&lt;/code&gt; = 198 but &lt;code&gt;MasVnrType&lt;/code&gt; = NA. Impute &lt;code&gt;MasVnrType&lt;/code&gt; to be most common value in the neighborhood where &lt;code&gt;MasVnrArea&lt;/code&gt; &amp;gt; 0.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Impute &lt;strong&gt;NA&lt;/strong&gt; values in &lt;code&gt;MasVnrType&lt;/code&gt; to be the most common values by &lt;code&gt;Neighborhood&lt;/code&gt; and &lt;code&gt;YearRemodAdd&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Impute &lt;strong&gt;NA&lt;/strong&gt; values in &lt;code&gt;MasVnrArea&lt;/code&gt; to be the median values by &lt;code&gt;Neighborhood&lt;/code&gt; and &lt;code&gt;MasVnrType&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 1 x 4
     Id Neighborhood MasVnrType MasVnrArea
  &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;        &amp;lt;chr&amp;gt;           &amp;lt;dbl&amp;gt;
1  2611 Mitchel      &amp;lt;NA&amp;gt;              198&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 23 x 4
      Id Neighborhood MasVnrType MasVnrArea
   &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;        &amp;lt;chr&amp;gt;           &amp;lt;dbl&amp;gt;
 1   235 Gilbert      &amp;lt;NA&amp;gt;               NA
 2   530 Crawfor      &amp;lt;NA&amp;gt;               NA
 3   651 Somerst      &amp;lt;NA&amp;gt;               NA
 4   937 SawyerW      &amp;lt;NA&amp;gt;               NA
 5   974 Somerst      &amp;lt;NA&amp;gt;               NA
 6   978 Somerst      &amp;lt;NA&amp;gt;               NA
 7  1244 NridgHt      &amp;lt;NA&amp;gt;               NA
 8  1279 CollgCr      &amp;lt;NA&amp;gt;               NA
 9  1692 Gilbert      &amp;lt;NA&amp;gt;               NA
10  1707 Somerst      &amp;lt;NA&amp;gt;               NA
# ... with 13 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 23 x 4
      Id Neighborhood MasVnrType MasVnrArea
   &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;        &amp;lt;chr&amp;gt;           &amp;lt;dbl&amp;gt;
 1   235 Gilbert      None               NA
 2   530 Crawfor      None               NA
 3   651 Somerst      None               NA
 4   937 SawyerW      None               NA
 5   974 Somerst      Stone              NA
 6   978 Somerst      None               NA
 7  1244 NridgHt      Stone              NA
 8  1279 CollgCr      BrkFace            NA
 9  1692 Gilbert      None               NA
10  1707 Somerst      Stone              NA
# ... with 13 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;A new feature is generated where: &lt;code&gt;HasMasVnr&lt;/code&gt; = &lt;code&gt;MasVnrArea&lt;/code&gt; &amp;gt; 0&lt;/li&gt;
&lt;li&gt;A significant number of properties have masonry.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/ames-housing-part1-eda/index_files/figure-html/plot_masvnr-1.png&#34; width=&#34;960&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;pool-features&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;Pool Features&lt;/h4&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;p&gt;Change values in &lt;code&gt;PoolQC&lt;/code&gt; to &lt;strong&gt;None&lt;/strong&gt; if the property has no pool&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Impute &lt;strong&gt;NA&lt;/strong&gt; values in remaining &lt;code&gt;PoolQC&lt;/code&gt; to the most common value in the Neighborhood in the properties that have a pool.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 2,906 x 4
      Id Neighborhood PoolArea PoolQC
   &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;           &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt; 
 1     1 CollgCr             0 &amp;lt;NA&amp;gt;  
 2     2 Veenker             0 &amp;lt;NA&amp;gt;  
 3     3 CollgCr             0 &amp;lt;NA&amp;gt;  
 4     4 Crawfor             0 &amp;lt;NA&amp;gt;  
 5     5 NoRidge             0 &amp;lt;NA&amp;gt;  
 6     6 Mitchel             0 &amp;lt;NA&amp;gt;  
 7     7 Somerst             0 &amp;lt;NA&amp;gt;  
 8     8 NWAmes              0 &amp;lt;NA&amp;gt;  
 9     9 OldTown             0 &amp;lt;NA&amp;gt;  
10    10 BrkSide             0 &amp;lt;NA&amp;gt;  
# ... with 2,896 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 3 x 4
     Id Neighborhood PoolArea PoolQC
  &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;           &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt; 
1  2421 NAmes             368 &amp;lt;NA&amp;gt;  
2  2504 SawyerW           444 &amp;lt;NA&amp;gt;  
3  2600 Mitchel           561 &amp;lt;NA&amp;gt;  &lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;A new feature is generated where: &lt;code&gt;HasPool&lt;/code&gt; = &lt;code&gt;PoolArea&lt;/code&gt; &amp;gt; 0&lt;/li&gt;
&lt;li&gt;Most properties do not have a pool.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/ames-housing-part1-eda/index_files/figure-html/plot_pool-1.png&#34; width=&#34;960&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;porch-features&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;Porch Features&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;New features are generated for:
&lt;ul&gt;
&lt;li&gt;Total porch area: &lt;code&gt;PorchSF&lt;/code&gt; = &lt;code&gt;OpenPorchSF&lt;/code&gt; + &lt;code&gt;EnclosedPorch&lt;/code&gt; + &lt;code&gt;3SsnPorch&lt;/code&gt; + &lt;code&gt;ScreenPorch&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Whether property has a porch: &lt;code&gt;HasPorch&lt;/code&gt; = &lt;code&gt;PorchSF&lt;/code&gt; &amp;gt; 0&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/ames-housing-part1-eda/index_files/figure-html/plot_porch-1.png&#34; width=&#34;960&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;built-area-features&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;Built Area Features&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;A new feature is added to determine the total square footage of built area: &lt;code&gt;TotalSF&lt;/code&gt; = &lt;code&gt;GrLivArea&lt;/code&gt; + &lt;code&gt;TotalBsmtSF&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/ames-housing-part1-eda/index_files/figure-html/plot_area-1.png&#34; width=&#34;960&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;construction-year-features&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;Construction Year Features&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;New features are generated for:
&lt;ul&gt;
&lt;li&gt;Vintage of year built: &lt;strong&gt;1945 or earlier, 1946-1999, 2000 or later&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Age of property from when it was built to the time it was sold: &lt;code&gt;PropertyAge&lt;/code&gt; = &lt;code&gt;YrSold&lt;/code&gt; - &lt;code&gt;YearRemodAdd&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Indicate if the property is new or newly renovated: &lt;code&gt;IsNew&lt;/code&gt; = &lt;code&gt;YearRemodAdd&lt;/code&gt; == &lt;code&gt;YrSold&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Indicate if the property has been remodelled: &lt;code&gt;IsRemodAdd&lt;/code&gt; = &lt;code&gt;YearRemodAdd&lt;/code&gt; &amp;gt; &lt;code&gt;YearBuilt&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/ames-housing-part1-eda/index_files/figure-html/plot_construction_years-1.png&#34; width=&#34;960&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;neighborhood-features&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;Neighborhood Features&lt;/h4&gt;
&lt;p&gt;Type of Neighborhood: There are 25 neighborhoods in the dataset.
As it is said, real estate is all about location, location, location. Clearly some neighborhoods command higher prices than others.&lt;/p&gt;
&lt;p&gt;Neighborhoods could be grouped together in fewer categories depending upon how they are ranked by their median SalePrice:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Type1&lt;/strong&gt;: StoneBr, NridgHt, NoRidge&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Type2&lt;/strong&gt;: Veenker, Timber, Somerst&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Type3&lt;/strong&gt;: Crawfor, CollgCr, ClearCr, Blmngtn, Gilbert, NWAmes, SawyerW&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Type4&lt;/strong&gt;: Mitchel, NPkVill, NAmes, SWISU, Sawyer, Blueste, BrkSide, Edwards, OldTown&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Type5&lt;/strong&gt;: IDOTRR, BrDale, MeadowV&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/ames-housing-part1-eda/index_files/figure-html/plot_neighborhood-1.png&#34; width=&#34;960&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;other-missing-features&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;Other Missing Features&lt;/h4&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;p&gt;In &lt;code&gt;MiscFeature&lt;/code&gt;, &lt;code&gt;Alley&lt;/code&gt; and &lt;code&gt;Fence&lt;/code&gt; &lt;strong&gt;NA&lt;/strong&gt; values are recoded as &lt;strong&gt;None&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;In &lt;code&gt;Utilities&lt;/code&gt;, &lt;code&gt;Functional&lt;/code&gt;, &lt;code&gt;SaleType&lt;/code&gt; &lt;strong&gt;NA&lt;/strong&gt; values are imputed as the most common value of each feature.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;In &lt;code&gt;LotFrontage&lt;/code&gt; &lt;strong&gt;NA&lt;/strong&gt; values are imputed as the median values in the &lt;code&gt;Neighborhood&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;In &lt;code&gt;MSZoning&lt;/code&gt;, &lt;code&gt;KitchenQual&lt;/code&gt;, &lt;code&gt;Exterior1st&lt;/code&gt;, &lt;code&gt;Exterior2nd&lt;/code&gt;, &lt;code&gt;Electrical&lt;/code&gt; &lt;strong&gt;NA&lt;/strong&gt; values are imputed as the most common value in the &lt;code&gt;Neighborhood&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;label-encoding&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Label Encoding&lt;/h3&gt;
&lt;p&gt;A quick look at the data description shows many features have categories that follow a specific order. These features are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;LotShape&lt;/code&gt;: Reg, IR1, IR2, IR3&lt;/li&gt;
&lt;li&gt;&lt;code&gt;LandSlope&lt;/code&gt;: Gtl, Mod, Sev&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ExterQual&lt;/code&gt;: Ex, Gd, TA, Fa, Po&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ExterCond&lt;/code&gt;: Ex, Gd, TA, Fa, Po&lt;/li&gt;
&lt;li&gt;&lt;code&gt;BsmtQual&lt;/code&gt;: Ex, Gd, TA, Fa, Po, None&lt;/li&gt;
&lt;li&gt;&lt;code&gt;BsmtCond&lt;/code&gt;: Ex, Gd, TA, Fa, Po, None&lt;/li&gt;
&lt;li&gt;&lt;code&gt;BsmtExposure&lt;/code&gt;: Gd, Av, Mn, No, None&lt;/li&gt;
&lt;li&gt;&lt;code&gt;BsmtFinType1&lt;/code&gt;: GLQ, ALQ, BLQ, Rec, LwQ, Unf, None&lt;/li&gt;
&lt;li&gt;&lt;code&gt;BsmtFinType2&lt;/code&gt;: GLQ, ALQ, BLQ, Rec, LwQ, Unf, None&lt;/li&gt;
&lt;li&gt;&lt;code&gt;HeatingQC&lt;/code&gt;: Ex, Gd, TA, Fa, Po&lt;/li&gt;
&lt;li&gt;&lt;code&gt;CentralAir&lt;/code&gt;: Y, N&lt;/li&gt;
&lt;li&gt;&lt;code&gt;KitchenQual&lt;/code&gt;: Ex, Gd, TA, Fa, Po&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Functional&lt;/code&gt;: Typ, Min1, Min2, Mod, Maj1, Maj2, Sev, Sal&lt;/li&gt;
&lt;li&gt;&lt;code&gt;FireplaceQu&lt;/code&gt;: Ex, Gd, TA, Fa, Po, None&lt;/li&gt;
&lt;li&gt;&lt;code&gt;GarageFinish&lt;/code&gt;: Fin, RFn, Unf, None&lt;/li&gt;
&lt;li&gt;&lt;code&gt;GarageQual&lt;/code&gt;: Ex, Gd, TA, Fa, Po, None&lt;/li&gt;
&lt;li&gt;&lt;code&gt;GarageCond&lt;/code&gt;: Ex, Gd, TA, Fa, Po, None&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Street&lt;/code&gt;: Grvl, Pave&lt;/li&gt;
&lt;li&gt;&lt;code&gt;PavedDrive&lt;/code&gt;: Y, P, N&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Most of these features have a common order &lt;strong&gt;Ex, Gd, TA, Fa, Po&lt;/strong&gt;, except some are missing &lt;strong&gt;None&lt;/strong&gt; as a category. These features could be ordered with a common set of categories from &lt;strong&gt;Ex, Gd, TA, Fa, Po, None&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Some categorical features are already ordered by an integer number. These features are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;OverallQual&lt;/code&gt;: 10 to 1&lt;/li&gt;
&lt;li&gt;&lt;code&gt;OverallCond&lt;/code&gt;: 10 to 1&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;code&gt;MoSold&lt;/code&gt; is cyclical and should be recoded as a factor.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;YrSold&lt;/code&gt; has only 5 values from 2006-2010 and should also be recoded as a factor.&lt;/p&gt;
&lt;p&gt;Categorical features where several categories have less than 10 observations are lumped into a single category named &lt;strong&gt;Other&lt;/strong&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;features-to-drop&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Features to Drop&lt;/h3&gt;
&lt;div id=&#34;highly-correlated-features&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;Highly Correlated Features&lt;/h4&gt;
&lt;p&gt;Some features could be dropped from further analysis because either they are too correlated or replaced by a similar feature.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt; [1] &amp;quot;BsmtFullBath&amp;quot;  &amp;quot;GarageCars&amp;quot;    &amp;quot;GarageYrBlt&amp;quot;   &amp;quot;GrLivArea&amp;quot;     &amp;quot;PoolArea&amp;quot;      &amp;quot;YearBuilt&amp;quot;     &amp;quot;YearRemodAdd&amp;quot;  &amp;quot;Neighborhood&amp;quot;  &amp;quot;OpenPorchSF&amp;quot;  
[10] &amp;quot;EnclosedPorch&amp;quot; &amp;quot;3SsnPorch&amp;quot;     &amp;quot;ScreenPorch&amp;quot;  &lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;skewed-categorical-features&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;Skewed Categorical Features&lt;/h4&gt;
&lt;p&gt;Any feature where more than 95% of the records have the same category probably doesn’t have any predictive value. An extreme case is &lt;code&gt;Utilities&lt;/code&gt; which has only 2 categories - &lt;strong&gt;AllPub&lt;/strong&gt; and &lt;strong&gt;NoSeWa&lt;/strong&gt; in the dataset. Only 1 record has &lt;strong&gt;NoSeWa&lt;/strong&gt; and the rest of the records have &lt;strong&gt;AllPub&lt;/strong&gt;. Therefore, features like these do not have any predictive value.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/ames-housing-part1-eda/index_files/figure-html/plot_skewed_features-1.png&#34; width=&#34;960&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;finalized-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Finalized Data&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;[1] &amp;quot;Dimensions of the finalized dataset&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;[1] 2919   73&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Excluding Id, there are 72 features in the finalized dataset.&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;There are 26 numerical, 26 ordinal and 20 nominal features.&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;univariate-analysis&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Univariate Analysis&lt;/h2&gt;
&lt;p&gt;Let us look at each feature in the dataset in detail.&lt;/p&gt;
&lt;div id=&#34;numerical-features&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Numerical Features&lt;/h3&gt;
&lt;p&gt;First let’s plot all the features that are measured as area in square feet:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/ames-housing-part1-eda/index_files/figure-html/univariate_continuous1-1.png&#34; width=&#34;960&#34; /&gt;&lt;/p&gt;
&lt;div id=&#34;whats-notable-1&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;What’s notable?&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;All area features have outliers.&lt;/li&gt;
&lt;li&gt;Many features are heavily skewed so they need to be normalized before fitting models.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Now let’s see other numerical features:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/ames-housing-part1-eda/index_files/figure-html/univariate_continuous2-1.png&#34; width=&#34;960&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;whats-notable-2&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;What’s notable?&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Most of the properties have been built less than 20 years prior to their sale.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Let’s plot the distribution of &lt;code&gt;SalePrice&lt;/code&gt; in log scale:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/ames-housing-part1-eda/index_files/figure-html/SalePrice_distribution-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;whats-notable-3&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;What’s notable?&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;We see long tailed distribution on both sides.&lt;/li&gt;
&lt;li&gt;There are 11 properties below USD 50,000 and 17 above USD 500,000.&lt;/li&gt;
&lt;li&gt;Linear models are very sensitive to the presence of outliers.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;categorical-features&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Categorical Features&lt;/h2&gt;
&lt;div id=&#34;ordinal-features&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Ordinal Features&lt;/h3&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/ames-housing-part1-eda/index_files/figure-html/Univariate_Cat1-1.png&#34; width=&#34;960&#34; /&gt;&lt;/p&gt;
&lt;div id=&#34;whats-notable-4&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;What’s notable?&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Categorical imbalances exist in many features where 1 or 2 categories are dominant. This poses a big challenge for using these features as predictors, as categories with fewer counts tend to be underrepresented in the data.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;nominal-features.&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Nominal Features.&lt;/h3&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/ames-housing-part1-eda/index_files/figure-html/Univariate_Cat2-1.png&#34; width=&#34;960&#34; /&gt;&lt;/p&gt;
&lt;div id=&#34;whats-notable-5&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;What’s notable?&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Categorical imbalances exist in many features where 1 or 2 categories are dominant.&lt;/li&gt;
&lt;li&gt;Most of the properties are sold during the summer months, and the least during the winter months.&lt;/li&gt;
&lt;li&gt;The effect of housing market crisis are visible in the data, as the fewest properties were sold in 2010.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;bivariate-analysis&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Bivariate Analysis&lt;/h2&gt;
&lt;div id=&#34;numerical-numerical&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Numerical-Numerical&lt;/h3&gt;
&lt;p&gt;Let’s examine the relationship of &lt;code&gt;SalePrice&lt;/code&gt; with other numerical features:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/ames-housing-part1-eda/index_files/figure-html/bivariate_scatterplots-1.png&#34; width=&#34;960&#34; /&gt;&lt;/p&gt;
&lt;div id=&#34;whats-notable-6&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;What’s notable?&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;From the scatterplot of &lt;code&gt;TotalSF&lt;/code&gt; v/s &lt;code&gt;SalePrice&lt;/code&gt;, it is very clear there are &lt;em&gt;high leverage&lt;/em&gt; points where the target &lt;code&gt;SalePrice&lt;/code&gt; is unusually low relative to the area in sq. ft. These points have an outsized impact on the slope of the regression line, which otherwise would be higher.&lt;/li&gt;
&lt;li&gt;The same set of points impact &lt;code&gt;TotalBsmtSF&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The Ids of these records are 524,1299,2550. Out of these 524 and 1299 are in the training set.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;correlations-with-saleprice&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Correlations with SalePrice&lt;/h3&gt;
&lt;p&gt;We isolate the features that have an absolute correlation of 0.1 or more with &lt;code&gt;SalePrice&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/ames-housing-part1-eda/index_files/figure-html/plot_correlated-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;div id=&#34;whats-notable-7&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;What’s notable?&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;The top 5 features are TotalSF, GarageArea, TotalBath, TotalBsmtSF, 1stFlrSF. Quite reasonably, a buyer would look at these features to evaluate a property and its &lt;code&gt;SalePrice&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;It is somewhat counterintuitive that &lt;code&gt;PropertyAge&lt;/code&gt; shows a strong negative correlation with &lt;code&gt;SalePrice&lt;/code&gt;. It means properties that were more recently built, sell for higher prices than older properties.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;numerical-categorical-ordinal&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Numerical-Categorical (Ordinal)&lt;/h3&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/ames-housing-part1-eda/index_files/figure-html/Bivariate_Num_Cat1-1.png&#34; width=&#34;960&#34; /&gt;&lt;/p&gt;
&lt;div id=&#34;whats-notable-8&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;What’s notable?&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;We can spot clear trends in &lt;code&gt;SalePrice&lt;/code&gt; v/s the order of the categories in almost all of these features.&lt;/li&gt;
&lt;li&gt;Overall quality and external quality show some of the strongest trends.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;numerical-categorical-nominal&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Numerical-Categorical (Nominal)&lt;/h3&gt;
&lt;p&gt;Let’s examine &lt;code&gt;SalePrice&lt;/code&gt; with respect to the nominal features in the dataset. None of these features have a natural order, but we can identify trends within categories by sorting with the median &lt;code&gt;SalePrice&lt;/code&gt;. The &lt;code&gt;SalePrice&lt;/code&gt; axis is truncated to exclude outliers.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/ames-housing-part1-eda/index_files/figure-html/Bivariate_Num_Cat2-1.png&#34; width=&#34;960&#34; /&gt;&lt;/p&gt;
&lt;div id=&#34;whats-notable-9&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;What’s notable?&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;GarageType&lt;/code&gt;: Builtin and attached garages are more preferred than detached or other types of garages.&lt;/li&gt;
&lt;li&gt;From &lt;code&gt;MSSubClass&lt;/code&gt; categories, it is evident that 1946 or newer houses are higher priced than older houses.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;multivariate-analysis&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Multivariate Analysis&lt;/h2&gt;
&lt;p&gt;We will check variation of some related features with &lt;code&gt;SalePrice&lt;/code&gt;.&lt;/p&gt;
&lt;div id=&#34;numerical-numerical-categorical&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Numerical-Numerical-Categorical&lt;/h3&gt;
&lt;p&gt;We have determined &lt;code&gt;TotalSF&lt;/code&gt; and &lt;code&gt;GarageArea&lt;/code&gt; have among the strongest correlations with &lt;code&gt;SalePrice&lt;/code&gt;. Let’s see how they vary by &lt;code&gt;NeighborhoodType&lt;/code&gt; and &lt;code&gt;GarageType&lt;/code&gt; respectively:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/ames-housing-part1-eda/index_files/figure-html/multivariate_num_num_cat-1.png&#34; width=&#34;960&#34; /&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;For the same total area, there are neighborhoods where &lt;code&gt;SalePrice&lt;/code&gt; is higher than others.&lt;/li&gt;
&lt;li&gt;Properties with no garage are distinctly separated.&lt;/li&gt;
&lt;li&gt;Properties with built-in or attached garages tend to have higher &lt;code&gt;SalePrice&lt;/code&gt; for the same &lt;code&gt;GarageArea&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Therefore, &lt;code&gt;NeighborhoodType&lt;/code&gt; and &lt;code&gt;GarageType&lt;/code&gt; explain some variance in &lt;code&gt;SalePrice&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;categorical-categorical-numerical&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Categorical-Categorical-Numerical&lt;/h3&gt;
&lt;p&gt;We want to see if there is any interaction of &lt;code&gt;SalePrice&lt;/code&gt; with a combination of categorical features, that could provide any additional explanatory power for prediction:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/ames-housing-part1-eda/index_files/figure-html/multivariate_cat_cat_num-1.png&#34; width=&#34;960&#34; /&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;It is evident that some neighborhoods have higher &lt;code&gt;OverallQual&lt;/code&gt; and therefore command higher price. However in Type4 neighborhoods, we can see a clear variation in &lt;code&gt;SalePrice&lt;/code&gt; by quality of property.&lt;/li&gt;
&lt;li&gt;It is less clear if &lt;code&gt;GarageType&lt;/code&gt; has a major impact by itself. Even though built-in and attached garages seem to be preferred, most of the variation can be explained by &lt;code&gt;NeighborhoodType&lt;/code&gt; itself.&lt;/li&gt;
&lt;li&gt;Low density and floating village residential properties tend to be higher priced in both single and multi-storied properties built after 1946.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Diamonds - Part 3 - A polished gem - Building Non-linear Models</title>
      <link>https://www.nitingupta.com/casestudies/diamonds-part3-non-linear-models/</link>
      <pubDate>Thu, 22 Dec 2016 00:00:00 +0000</pubDate>
      <guid>https://www.nitingupta.com/casestudies/diamonds-part3-non-linear-models/</guid>
      <description>


&lt;div id=&#34;other-posts-in-this-series&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;Other posts in this series:&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://www.nitingupta.com/casestudies/diamonds-part1-eda/&#34;&gt;Diamonds - Part 1 - In the rough - An Exploratory Data Analysis&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://www.nitingupta.com/casestudies/diamonds-part2-linear-models/&#34;&gt;Diamonds - Part 2 - A cut above - Building Linear Models&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In a couple of previous posts, we tried to understand what attributes of diamonds are important to determine their prices. We showed that &lt;code&gt;carat&lt;/code&gt;, &lt;code&gt;clarity&lt;/code&gt; and &lt;code&gt;color&lt;/code&gt; are the most important predictors of &lt;code&gt;price&lt;/code&gt;. We arrived at this conclusion after doing a detailed exploratory data analysis. Finally we fit linear models to predict prices and determined the best model from the metrics.&lt;/p&gt;
&lt;p&gt;In this post, we will use non-linear regression models to predict diamond prices and compare them with those from linear models.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;training-non-linear-models&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Training Non-linear Models&lt;/h2&gt;
&lt;p&gt;We’ll follow some of the same steps as we did for linear models, while transforming some predictors:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Partition the dataset into training and testing sets in the proportion 75% and 25% respectively.&lt;/li&gt;
&lt;li&gt;Stratify the partitioning by &lt;code&gt;clarity&lt;/code&gt;, so both training and testing sets have the same distributions of this feature.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;clarity&lt;/code&gt;, &lt;code&gt;color&lt;/code&gt; and &lt;code&gt;cut&lt;/code&gt; have ordered categories from lowest to highest grades. The &lt;code&gt;randomForest&lt;/code&gt; method requires no change in representing this data before training the models, however &lt;code&gt;xgboost&lt;/code&gt; and &lt;code&gt;keras&lt;/code&gt; methods require all the predictors to be in numerical form. &lt;a href=&#34;https://statmodeling.stat.columbia.edu/2009/10/06/coding_ordinal/&#34;&gt;Two methods&lt;/a&gt; could be used for transforming the categorical data:
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Use one-hot encoding to convert categorical data to sparse data with 0s and 1s. This way, each category in &lt;code&gt;clarity&lt;/code&gt;, &lt;code&gt;color&lt;/code&gt; and &lt;code&gt;cut&lt;/code&gt; is converted to a new predictor in binary form. A disadvantage of this method is that it treates ordered categorical data the same as unordered categorical data, so the ordinality is lost in transformation. However, non-linear models should be able to infer the ordinality as our training sample is sufficiently large.&lt;/li&gt;
&lt;li&gt;Represent the ordinal categories from lowest to highest grades in integer form. However, this creates a linear gradation from one category to another, which may not be a suitable choice here.&lt;/li&gt;
&lt;/ol&gt;&lt;/li&gt;
&lt;li&gt;Center and scale all values in the training set and build a matrix of predictors.&lt;/li&gt;
&lt;li&gt;Fit a non-linear model with the training set.&lt;/li&gt;
&lt;li&gt;Make predictions on the testing set and determine model metrics.&lt;/li&gt;
&lt;li&gt;Wrap all the steps above inside a function in which the model formula, and a seed could be passed that randomizes the partition of training and testing sets.&lt;/li&gt;
&lt;li&gt;Run multiple iterations of models with different seeds, and compute their average metrics, that would reflect results on unseen data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Here are the average metrics for all the models trained with &lt;code&gt;keras&lt;/code&gt;, &lt;code&gt;randomForest&lt;/code&gt; and &lt;code&gt;xgboost&lt;/code&gt; regression methods:&lt;/p&gt;
&lt;table class=&#34;gmisc_table&#34; style=&#34;border-collapse: collapse; margin-top: 1em; margin-bottom: 1em;&#34;&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;border-top: 2px solid grey;&#34;&gt;
&lt;/th&gt;
&lt;th colspan=&#34;3&#34; style=&#34;font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;&#34;&gt;
mae
&lt;/th&gt;
&lt;th style=&#34;border-top: 2px solid grey;; border-bottom: hidden;&#34;&gt;
 
&lt;/th&gt;
&lt;th colspan=&#34;3&#34; style=&#34;font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;&#34;&gt;
rmse
&lt;/th&gt;
&lt;th style=&#34;border-top: 2px solid grey;; border-bottom: hidden;&#34;&gt;
 
&lt;/th&gt;
&lt;th colspan=&#34;3&#34; style=&#34;font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;&#34;&gt;
rsq
&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;th style=&#34;border-bottom: 1px solid grey;&#34;&gt;
&lt;/th&gt;
&lt;th style=&#34;border-bottom: 1px solid grey; text-align: center;&#34;&gt;
keras
&lt;/th&gt;
&lt;th style=&#34;border-bottom: 1px solid grey; text-align: center;&#34;&gt;
randomForest
&lt;/th&gt;
&lt;th style=&#34;border-bottom: 1px solid grey; text-align: center;&#34;&gt;
xgboost
&lt;/th&gt;
&lt;th style=&#34;border-bottom: 1px solid grey;&#34; colspan=&#34;1&#34;&gt;
 
&lt;/th&gt;
&lt;th style=&#34;border-bottom: 1px solid grey; text-align: center;&#34;&gt;
keras
&lt;/th&gt;
&lt;th style=&#34;border-bottom: 1px solid grey; text-align: center;&#34;&gt;
randomForest
&lt;/th&gt;
&lt;th style=&#34;border-bottom: 1px solid grey; text-align: center;&#34;&gt;
xgboost
&lt;/th&gt;
&lt;th style=&#34;border-bottom: 1px solid grey;&#34; colspan=&#34;1&#34;&gt;
 
&lt;/th&gt;
&lt;th style=&#34;border-bottom: 1px solid grey; text-align: center;&#34;&gt;
keras
&lt;/th&gt;
&lt;th style=&#34;border-bottom: 1px solid grey; text-align: center;&#34;&gt;
randomForest
&lt;/th&gt;
&lt;th style=&#34;border-bottom: 1px solid grey; text-align: center;&#34;&gt;
xgboost
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;border-right: 1px solid black; text-align: left;&#34;&gt;
price ~ .
&lt;/td&gt;
&lt;td style=&#34;text-align: right;&#34;&gt;
360.55
&lt;/td&gt;
&lt;td style=&#34;text-align: right;&#34;&gt;
262.35
&lt;/td&gt;
&lt;td style=&#34;border-right: 1px solid black; text-align: right;&#34;&gt;
280.49
&lt;/td&gt;
&lt;td style colspan=&#34;1&#34;&gt;
 
&lt;/td&gt;
&lt;td style=&#34;text-align: right;&#34;&gt;
989.71
&lt;/td&gt;
&lt;td style=&#34;text-align: right;&#34;&gt;
529.28
&lt;/td&gt;
&lt;td style=&#34;border-right: 1px solid black; text-align: right;&#34;&gt;
540.76
&lt;/td&gt;
&lt;td style colspan=&#34;1&#34;&gt;
 
&lt;/td&gt;
&lt;td style=&#34;text-align: right;&#34;&gt;
0.93
&lt;/td&gt;
&lt;td style=&#34;text-align: right;&#34;&gt;
0.98
&lt;/td&gt;
&lt;td style=&#34;text-align: right;&#34;&gt;
0.98
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;border-right: 1px solid black; text-align: left;&#34;&gt;
price ~ carat
&lt;/td&gt;
&lt;td style=&#34;text-align: right;&#34;&gt;
860.29
&lt;/td&gt;
&lt;td style=&#34;text-align: right;&#34;&gt;
816.1
&lt;/td&gt;
&lt;td style=&#34;border-right: 1px solid black; text-align: right;&#34;&gt;
815.76
&lt;/td&gt;
&lt;td style colspan=&#34;1&#34;&gt;
 
&lt;/td&gt;
&lt;td style=&#34;text-align: right;&#34;&gt;
1499.2
&lt;/td&gt;
&lt;td style=&#34;text-align: right;&#34;&gt;
1427.25
&lt;/td&gt;
&lt;td style=&#34;border-right: 1px solid black; text-align: right;&#34;&gt;
1427.35
&lt;/td&gt;
&lt;td style colspan=&#34;1&#34;&gt;
 
&lt;/td&gt;
&lt;td style=&#34;text-align: right;&#34;&gt;
0.86
&lt;/td&gt;
&lt;td style=&#34;text-align: right;&#34;&gt;
0.87
&lt;/td&gt;
&lt;td style=&#34;text-align: right;&#34;&gt;
0.87
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;border-right: 1px solid black; text-align: left;&#34;&gt;
price ~ carat + clarity
&lt;/td&gt;
&lt;td style=&#34;text-align: right;&#34;&gt;
590.32
&lt;/td&gt;
&lt;td style=&#34;text-align: right;&#34;&gt;
548.67
&lt;/td&gt;
&lt;td style=&#34;border-right: 1px solid black; text-align: right;&#34;&gt;
544.48
&lt;/td&gt;
&lt;td style colspan=&#34;1&#34;&gt;
 
&lt;/td&gt;
&lt;td style=&#34;text-align: right;&#34;&gt;
1040.69
&lt;/td&gt;
&lt;td style=&#34;text-align: right;&#34;&gt;
1006.61
&lt;/td&gt;
&lt;td style=&#34;border-right: 1px solid black; text-align: right;&#34;&gt;
992.46
&lt;/td&gt;
&lt;td style colspan=&#34;1&#34;&gt;
 
&lt;/td&gt;
&lt;td style=&#34;text-align: right;&#34;&gt;
0.93
&lt;/td&gt;
&lt;td style=&#34;text-align: right;&#34;&gt;
0.94
&lt;/td&gt;
&lt;td style=&#34;text-align: right;&#34;&gt;
0.94
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;border-right: 1px solid black; text-align: left;&#34;&gt;
price ~ carat + clarity + color
&lt;/td&gt;
&lt;td style=&#34;text-align: right;&#34;&gt;
358.85
&lt;/td&gt;
&lt;td style=&#34;text-align: right;&#34;&gt;
305.17
&lt;/td&gt;
&lt;td style=&#34;border-right: 1px solid black; text-align: right;&#34;&gt;
306.86
&lt;/td&gt;
&lt;td style colspan=&#34;1&#34;&gt;
 
&lt;/td&gt;
&lt;td style=&#34;text-align: right;&#34;&gt;
645.4
&lt;/td&gt;
&lt;td style=&#34;text-align: right;&#34;&gt;
571.73
&lt;/td&gt;
&lt;td style=&#34;border-right: 1px solid black; text-align: right;&#34;&gt;
575.3
&lt;/td&gt;
&lt;td style colspan=&#34;1&#34;&gt;
 
&lt;/td&gt;
&lt;td style=&#34;text-align: right;&#34;&gt;
0.97
&lt;/td&gt;
&lt;td style=&#34;text-align: right;&#34;&gt;
0.98
&lt;/td&gt;
&lt;td style=&#34;text-align: right;&#34;&gt;
0.98
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;border-bottom: 2px solid grey; border-right: 1px solid black; text-align: left;&#34;&gt;
price ~ carat + clarity + color + cut
&lt;/td&gt;
&lt;td style=&#34;border-bottom: 2px solid grey; text-align: right;&#34;&gt;
347.99
&lt;/td&gt;
&lt;td style=&#34;border-bottom: 2px solid grey; text-align: right;&#34;&gt;
285.96
&lt;/td&gt;
&lt;td style=&#34;border-bottom: 2px solid grey; border-right: 1px solid black; text-align: right;&#34;&gt;
282.38
&lt;/td&gt;
&lt;td style=&#34;border-bottom: 2px solid grey;&#34; colspan=&#34;1&#34;&gt;
 
&lt;/td&gt;
&lt;td style=&#34;border-bottom: 2px solid grey; text-align: right;&#34;&gt;
626.78
&lt;/td&gt;
&lt;td style=&#34;border-bottom: 2px solid grey; text-align: right;&#34;&gt;
545.02
&lt;/td&gt;
&lt;td style=&#34;border-bottom: 2px solid grey; border-right: 1px solid black; text-align: right;&#34;&gt;
541.63
&lt;/td&gt;
&lt;td style=&#34;border-bottom: 2px solid grey;&#34; colspan=&#34;1&#34;&gt;
 
&lt;/td&gt;
&lt;td style=&#34;border-bottom: 2px solid grey; text-align: right;&#34;&gt;
0.98
&lt;/td&gt;
&lt;td style=&#34;border-bottom: 2px solid grey; text-align: right;&#34;&gt;
0.98
&lt;/td&gt;
&lt;td style=&#34;border-bottom: 2px solid grey; text-align: right;&#34;&gt;
0.98
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Looking at the r-squared terms, it is remarkable how well all the models have been able to infer the complex relationship between &lt;code&gt;price&lt;/code&gt; and &lt;code&gt;carat&lt;/code&gt;. To fit linear models, we needed to transform &lt;code&gt;price&lt;/code&gt; to logarithmic terms and take the cube root of &lt;code&gt;carat&lt;/code&gt;. The neural network as well as the decision tree based models do this all on their own. The root mean squared error is in $ terms so it is easier to interpret. Considering the mean and standard deviation of &lt;code&gt;price&lt;/code&gt; in the dataset is about $4000, the root mean squared errors of the models are very low.&lt;/p&gt;
&lt;p&gt;Exploratory data analysis adds value here, as the models with &lt;code&gt;carat&lt;/code&gt;, &lt;code&gt;clarity&lt;/code&gt; and &lt;code&gt;color&lt;/code&gt; give excellent results. Including &lt;code&gt;cut&lt;/code&gt; in the models does not provide any significant benefits and results in overfitted models.&lt;/p&gt;
&lt;p&gt;Even the base models with all predictors: &lt;strong&gt;price ~ .&lt;/strong&gt; (where some of them are confounders), do a very good job of explaning the variance. Decision tree and neural network models are unaffected by multi-collinearity. We can use local model interpretations to determine the most important predictors from these models.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;local-interpretable-model-agnostic-explanations&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Local Interpretable Model-agnostic Explanations&lt;/h2&gt;
&lt;p&gt;LIME is a method for explaining black-box machine learning models. It can help visualize and explain individual predictions. It makes the assumption that every complex model is linear on a local scale. So it is possible to fit a simple model around a single observation that will behave how the global model behaves at that locality. The simple model can be used to explain the predictions of the more complex model locally.&lt;/p&gt;
&lt;p&gt;The generalized algorithm LIME applies is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Given an observation, permute it to create replicated feature data with slight value modifications.&lt;/li&gt;
&lt;li&gt;Compute similarity distance measure between original observation and permuted observations.&lt;/li&gt;
&lt;li&gt;Apply selected machine learning model to predict outcomes of permuted data.&lt;/li&gt;
&lt;li&gt;Select m number of features to best describe predicted outcomes.&lt;/li&gt;
&lt;li&gt;Fit a simple model to the permuted data, explaining the complex model outcome with m features from the permuted data weighted by its similarity to the original observation .&lt;/li&gt;
&lt;li&gt;Use the resulting feature weights to explain local behavior.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Here we will select 5 features that best describe the predicted outcomes for 6 random observations from the testing set.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/diamonds-part3-non-linear-models/index_files/figure-html/plot_feature_importance-1.png&#34; width=&#34;960&#34; /&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/diamonds-part3-non-linear-models/index_files/figure-html/plot_feature_importance-2.png&#34; width=&#34;960&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The features by importance that best explain the predictions in these 6 random samples are &lt;code&gt;carat&lt;/code&gt;, &lt;code&gt;clarity&lt;/code&gt;, &lt;code&gt;color&lt;/code&gt;, &lt;code&gt;x&lt;/code&gt; and &lt;code&gt;y&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/diamonds-part3-non-linear-models/index_files/figure-html/plot_feature_heatmap-1.png&#34; width=&#34;672&#34; /&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/diamonds-part3-non-linear-models/index_files/figure-html/plot_feature_heatmap-2.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We know that &lt;code&gt;x&lt;/code&gt; and &lt;code&gt;y&lt;/code&gt; are co-linear with &lt;code&gt;carat&lt;/code&gt;, which is why it is good practice to remove any redundant features from the training data before applying any machine learning algorithm. We find the model with the best metrics turns out to be the one using &lt;code&gt;carat&lt;/code&gt;, &lt;code&gt;clarity&lt;/code&gt; and &lt;code&gt;color&lt;/code&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;actual-vs-predicted&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Actual v/s Predicted&lt;/h2&gt;
&lt;p&gt;Finally, here are the scatterplots of actual v/s predicted &lt;code&gt;price&lt;/code&gt; from the best model on the testing set, using the 3 regression methods:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/diamonds-part3-non-linear-models/index_files/figure-html/best_model_plot-1.png&#34; width=&#34;960&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The scatterplots are shown with both linear and logarithmic axes. Even though the results from all the 3 methods have roughly similar &lt;strong&gt;r-squared&lt;/strong&gt; and &lt;strong&gt;rmse&lt;/strong&gt; values, we can see predicted prices from keras have more dispersion than the two decision-tree methods at the higher end. The decision-tree based methods appear do a better job of predicting prices at the lower end with lesser dispersion.&lt;/p&gt;
&lt;p&gt;As in the case with linear models, the variance in predicted diamond prices increases with &lt;code&gt;price&lt;/code&gt;. But unlike linear models, the non-linear models do not produce extreme outliers in predicted prices. So, not only do non-linear methods do a fantastic job in inferring the relationships between &lt;code&gt;price&lt;/code&gt; and its predictors, they also predict prices within a reasonable range.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;summary&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;All the 3 non-linear regression methods can infer the complex relationship between &lt;code&gt;price&lt;/code&gt;, &lt;code&gt;carat&lt;/code&gt; and other predictors, without the need for feature engineering.&lt;/li&gt;
&lt;li&gt;Exploratory Data Analysis is useful in removing the redundant features from the training dataset, resulting in both faster execution, as well as much better metrics.&lt;/li&gt;
&lt;li&gt;In terms of time taken to train the models, &lt;code&gt;keras&lt;/code&gt; neural network models execute the fastest by virtue of being able to use GPUs.&lt;/li&gt;
&lt;li&gt;Among the decision-tree based methods, &lt;code&gt;xgboost&lt;/code&gt; models train much faster than &lt;code&gt;randomForest&lt;/code&gt; models.&lt;/li&gt;
&lt;li&gt;Multiple CPUs can be used to run &lt;code&gt;randomForest&lt;/code&gt; and &lt;code&gt;xgboost&lt;/code&gt; methods. RAM is the only limiting constraint, when trained on a local machine.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Diamonds - Part 2 - A cut above - Building Linear Models</title>
      <link>https://www.nitingupta.com/casestudies/diamonds-part2-linear-models/</link>
      <pubDate>Wed, 21 Dec 2016 00:00:00 +0000</pubDate>
      <guid>https://www.nitingupta.com/casestudies/diamonds-part2-linear-models/</guid>
      <description>


&lt;p&gt;In a &lt;a href=&#34;https://www.nitingupta.com/casestudies/diamonds-part1-eda/&#34;&gt;previous post&lt;/a&gt; in this series, we did an exploratory data analysis of the &lt;code&gt;diamonds&lt;/code&gt; dataset and found that &lt;code&gt;carat&lt;/code&gt;, &lt;code&gt;x&lt;/code&gt;, &lt;code&gt;y&lt;/code&gt;, &lt;code&gt;z&lt;/code&gt; were strongly correlated with &lt;code&gt;price&lt;/code&gt;. To some extent, &lt;code&gt;clarity&lt;/code&gt; also appeared to provide some predictive ability.&lt;/p&gt;
&lt;p&gt;In this post, we will build linear models and see how well they predict the &lt;code&gt;price&lt;/code&gt; of diamonds.&lt;/p&gt;
&lt;p&gt;Before we do any transformations, feature engineering or feature selections for our model, let’s see what kind of results we get from a base linear model, that uses all the features to predict &lt;code&gt;price&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;
Call:
lm(formula = price ~ ., data = diamonds)

Residuals:
   Min     1Q Median     3Q    Max 
-21376   -592   -183    376  10694 

Coefficients:
            Estimate Std. Error t value             Pr(&amp;gt;|t|)    
(Intercept)  5753.76     396.63   14.51 &amp;lt; 0.0000000000000002 ***
carat       11256.98      48.63  231.49 &amp;lt; 0.0000000000000002 ***
cut.L         584.46      22.48   26.00 &amp;lt; 0.0000000000000002 ***
cut.Q        -301.91      17.99  -16.78 &amp;lt; 0.0000000000000002 ***
cut.C         148.03      15.48    9.56 &amp;lt; 0.0000000000000002 ***
cut^4         -20.79      12.38   -1.68               0.0929 .  
color.L     -1952.16      17.34 -112.57 &amp;lt; 0.0000000000000002 ***
color.Q      -672.05      15.78  -42.60 &amp;lt; 0.0000000000000002 ***
color.C      -165.28      14.72  -11.22 &amp;lt; 0.0000000000000002 ***
color^4        38.20      13.53    2.82               0.0047 ** 
color^5       -95.79      12.78   -7.50    0.000000000000066 ***
color^6       -48.47      11.61   -4.17    0.000030090737193 ***
clarity.L    4097.43      30.26  135.41 &amp;lt; 0.0000000000000002 ***
clarity.Q   -1925.00      28.23  -68.20 &amp;lt; 0.0000000000000002 ***
clarity.C     982.20      24.15   40.67 &amp;lt; 0.0000000000000002 ***
clarity^4    -364.92      19.29  -18.92 &amp;lt; 0.0000000000000002 ***
clarity^5     233.56      15.75   14.83 &amp;lt; 0.0000000000000002 ***
clarity^6       6.88      13.72    0.50               0.6157    
clarity^7      90.64      12.10    7.49    0.000000000000071 ***
depth         -63.81       4.53  -14.07 &amp;lt; 0.0000000000000002 ***
table         -26.47       2.91   -9.09 &amp;lt; 0.0000000000000002 ***
x           -1008.26      32.90  -30.65 &amp;lt; 0.0000000000000002 ***
y               9.61      19.33    0.50               0.6192    
z             -50.12      33.49   -1.50               0.1345    
---
Signif. codes:  0 &amp;#39;***&amp;#39; 0.001 &amp;#39;**&amp;#39; 0.01 &amp;#39;*&amp;#39; 0.05 &amp;#39;.&amp;#39; 0.1 &amp;#39; &amp;#39; 1

Residual standard error: 1130 on 53916 degrees of freedom
Multiple R-squared:  0.92,  Adjusted R-squared:  0.92 
F-statistic: 2.69e+04 on 23 and 53916 DF,  p-value: &amp;lt;0.0000000000000002&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 3 x 3
  .metric .estimator .estimate
  &amp;lt;chr&amp;gt;   &amp;lt;chr&amp;gt;          &amp;lt;dbl&amp;gt;
1 rmse    standard    1130.   
2 rsq     standard       0.920
3 mae     standard     740.   &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The model summary shows it is an overfitted model. Among other things, we know that &lt;code&gt;depth&lt;/code&gt; and &lt;code&gt;table&lt;/code&gt; have no impact on &lt;code&gt;price&lt;/code&gt;, yet these are shown to be highly significant. Root Mean Squared Error (rmse) and other metrics are also shown above.&lt;/p&gt;
&lt;p&gt;Let’s make a plot of actual v/s predicted prices to visualize how well this base model performs.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/diamonds-part2-linear-models/index_files/figure-html/simple_lm_model_plot-1.png&#34; width=&#34;768&#34; /&gt;&lt;/p&gt;
&lt;p&gt;If the predictions are good, the points should lie close to a straight line drawn at 45 degrees. We can see this base model does a poor job of predicting prices. Worst of all, the model predicts negative prices on the lower end.
It shows that &lt;code&gt;price&lt;/code&gt; has to be log transformed to avoid these absurdities.&lt;/p&gt;
&lt;div id=&#34;feature-engineering&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Feature Engineering&lt;/h2&gt;
&lt;p&gt;We know the price of a diamond is strongly correlated with its size. All things equal, the larger the diamond, the greater its price.&lt;/p&gt;
&lt;p&gt;As a first approximation, we can assume a diamond is a cuboid with dimensions &lt;code&gt;x&lt;/code&gt;, &lt;code&gt;y&lt;/code&gt; and &lt;code&gt;z&lt;/code&gt;. Then, we can compute its &lt;code&gt;volume&lt;/code&gt; as x * y * z.
As these 3 dimensions are highly correlated, we can compute a geometrical average dimension by taking the cube root of &lt;code&gt;volume&lt;/code&gt;, and retain a linear relationship with &lt;code&gt;log(price)&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Another way to calculate an average dimension is by using high school chemistry. Mass, volume and density are related to each other by the equation:&lt;/p&gt;
&lt;p&gt;$ density = mass/volume $&lt;/p&gt;
&lt;p&gt;We can find out that 1 carat = 0.2 gms. Dividing by the density of diamond (3.51 gms/cc) would give us its volume in cc, which could be converted to a geometrical average dimension by taking the cube root.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/diamonds-part2-linear-models/index_files/figure-html/feature_engineering-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Even though both methods yield similar results, we could see that the density method results in a narrower range. But which method would be more robust?
Keep in mind there are 20 &lt;code&gt;z&lt;/code&gt; values that are 0. In 7 of these records both &lt;code&gt;x&lt;/code&gt; and &lt;code&gt;y&lt;/code&gt; are 0 too, which means these values were not recorded reliably.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 20 x 10
   carat cut       color clarity depth table price     x     y     z
   &amp;lt;dbl&amp;gt; &amp;lt;ord&amp;gt;     &amp;lt;ord&amp;gt; &amp;lt;ord&amp;gt;   &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;
 1  1    Premium   G     SI2      59.1    59  3142  6.55  6.48     0
 2  1.01 Premium   H     I1       58.1    59  3167  6.66  6.6      0
 3  1.1  Premium   G     SI2      63      59  3696  6.5   6.47     0
 4  1.01 Premium   F     SI2      59.2    58  3837  6.5   6.47     0
 5  1.5  Good      G     I1       64      61  4731  7.15  7.04     0
 6  1.07 Ideal     F     SI2      61.6    56  4954  0     6.62     0
 7  1    Very Good H     VS2      63.3    53  5139  0     0        0
 8  1.15 Ideal     G     VS2      59.2    56  5564  6.88  6.83     0
 9  1.14 Fair      G     VS1      57.5    67  6381  0     0        0
10  2.18 Premium   H     SI2      59.4    61 12631  8.49  8.45     0
11  1.56 Ideal     G     VS2      62.2    54 12800  0     0        0
12  2.25 Premium   I     SI1      61.3    58 15397  8.52  8.42     0
13  1.2  Premium   D     VVS1     62.1    59 15686  0     0        0
14  2.2  Premium   H     SI1      61.2    59 17265  8.42  8.37     0
15  2.25 Premium   H     SI2      62.8    59 18034  0     0        0
16  2.02 Premium   H     VS2      62.7    53 18207  8.02  7.95     0
17  2.8  Good      G     SI2      63.8    58 18788  8.9   8.85     0
18  0.71 Good      F     SI2      64.1    60  2130  0     0        0
19  0.71 Good      F     SI2      64.1    60  2130  0     0        0
20  1.12 Premium   G     I1       60.4    59  2383  6.71  6.67     0&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In all of these records, the &lt;code&gt;carat&lt;/code&gt; values were recorded reliably and are probably more accurate than the dimensions.
Hence, we might prefer the density method of generating this feature.&lt;/p&gt;
&lt;p&gt;Furthermore, since density is a constant, dividing by a constant to calculate volume isn’t really necessary. Instead, a cube root transformation could be applied to &lt;code&gt;carat&lt;/code&gt; itself for the purposes of predictive modelling that would result in a linear relationship between &lt;span class=&#34;math inline&#34;&gt;\(log(price)\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(carat^{1/3}\)&lt;/span&gt;.
It is the reason why we’re fitting a linear model because the model is linear in its parameters.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;training-linear-models&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Training Linear Models&lt;/h2&gt;
&lt;p&gt;Here are the steps for building linear models and computing metrics:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Partition the dataset into training and testing sets in the proportion 75% and 25% respectively.&lt;/li&gt;
&lt;li&gt;Since &lt;code&gt;clarity&lt;/code&gt; is one of the main predictors, stratify the partitioning by &lt;code&gt;clarity&lt;/code&gt;, so both training and testing sets have the same distributions of this feature.&lt;/li&gt;
&lt;li&gt;Fit a linear model with the training set.&lt;/li&gt;
&lt;li&gt;Make predictions on the testing set and determine model metrics.&lt;/li&gt;
&lt;li&gt;Wrap all the steps above inside a function in which the model formula and a seed could be passed. Since the seed determines the random partitioning, it helps to minimize vagaries in partitioning the training and testing sets before fitting models.&lt;/li&gt;
&lt;li&gt;Run multiple iterations of a model with different seeds, and compute its average metrics, that would reflect the results on unseen data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Here’s a sample split of training and testing set, stratified by &lt;code&gt;clarity&lt;/code&gt;. As we can see, the training and testing sets have similar distributions.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;dfTrain$clarity 
       n  missing distinct 
   40457        0        8 

lowest : I1   SI2  SI1  VS2  VS1 , highest: VS2  VS1  VVS2 VVS1 IF  
                                                          
Value         I1   SI2   SI1   VS2   VS1  VVS2  VVS1    IF
Frequency    552  6895  9826  9222  6125  3780  2722  1335
Proportion 0.014 0.170 0.243 0.228 0.151 0.093 0.067 0.033&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;dfTest$clarity 
       n  missing distinct 
   13483        0        8 

lowest : I1   SI2  SI1  VS2  VS1 , highest: VS2  VS1  VVS2 VVS1 IF  
                                                          
Value         I1   SI2   SI1   VS2   VS1  VVS2  VVS1    IF
Frequency    189  2299  3239  3036  2046  1286   933   455
Proportion 0.014 0.171 0.240 0.225 0.152 0.095 0.069 0.034&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After running 5 iterations of each model with a different seed, here are the average metrics:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 5 x 4
  model                                                 rmse   rsq   mae
  &amp;lt;chr&amp;gt;                                                &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;
1 log(price) ~ .                                      11055. 0.670  570.
2 log(price) ~ I(carat^(1/3))                          2893. 0.687 1039.
3 log(price) ~ I(carat^(1/3)) + clarity                2312. 0.807  881.
4 log(price) ~ I(carat^(1/3)) + clarity + color        1870. 0.870  631.
5 log(price) ~ I(carat^(1/3)) + clarity + color + cut  1848. 0.875  625.&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The first model with all predictors is an overfitted one.&lt;/p&gt;
&lt;p&gt;The model with &lt;code&gt;carat&lt;/code&gt;, &lt;code&gt;clarity&lt;/code&gt; and &lt;code&gt;color&lt;/code&gt; provides the best combination of root mean squared error and r-squared, that explains the most variance.
This is our final model.
Including &lt;code&gt;cut&lt;/code&gt; in the model has diminishing benefits, and tends to overfit the data.&lt;/p&gt;
&lt;p&gt;Here’s the summary of our final model:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;
Call:
lm(formula = model_formula, data = dfTrain)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.6022 -0.1034  0.0145  0.1066  1.7941 

Coefficients:
                Estimate Std. Error t value             Pr(&amp;gt;|t|)    
(Intercept)     2.147009   0.004993  429.99 &amp;lt; 0.0000000000000002 ***
I(carat^(1/3))  6.246412   0.005365 1164.27 &amp;lt; 0.0000000000000002 ***
clarity.L       0.922295   0.005036  183.15 &amp;lt; 0.0000000000000002 ***
clarity.Q      -0.295539   0.004734  -62.43 &amp;lt; 0.0000000000000002 ***
clarity.C       0.166979   0.004068   41.05 &amp;lt; 0.0000000000000002 ***
clarity^4      -0.068591   0.003260  -21.04 &amp;lt; 0.0000000000000002 ***
clarity^5       0.032833   0.002669   12.30 &amp;lt; 0.0000000000000002 ***
clarity^6      -0.001904   0.002325   -0.82              0.41288    
clarity^7       0.025508   0.002049   12.45 &amp;lt; 0.0000000000000002 ***
color.L        -0.488882   0.002927 -167.05 &amp;lt; 0.0000000000000002 ***
color.Q        -0.117319   0.002680  -43.78 &amp;lt; 0.0000000000000002 ***
color.C        -0.012230   0.002497   -4.90           0.00000098 ***
color^4         0.019007   0.002288    8.31 &amp;lt; 0.0000000000000002 ***
color^5        -0.008110   0.002159   -3.76              0.00017 ***
color^6        -0.000396   0.001967   -0.20              0.84055    
---
Signif. codes:  0 &amp;#39;***&amp;#39; 0.001 &amp;#39;**&amp;#39; 0.01 &amp;#39;*&amp;#39; 0.05 &amp;#39;.&amp;#39; 0.1 &amp;#39; &amp;#39; 1

Residual standard error: 0.166 on 40442 degrees of freedom
Multiple R-squared:  0.973, Adjusted R-squared:  0.973 
F-statistic: 1.05e+05 on 14 and 40442 DF,  p-value: &amp;lt;0.0000000000000002&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/diamonds-part2-linear-models/index_files/figure-html/final_model_summary-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Here’s a scatterplot of actual v/s predicted log(price) from our final model on the testing set:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/diamonds-part2-linear-models/index_files/figure-html/final_model_plot-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The points lie close to the 45 degress line. However, on the high end, there are many outliers where actual and predicted values have very high variance.
Nevertheless, this is as good as it gets.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Diamonds - Part 1 - In the rough - An Exploratory Data Analysis</title>
      <link>https://www.nitingupta.com/casestudies/diamonds-part1-eda/</link>
      <pubDate>Tue, 20 Dec 2016 00:00:00 +0000</pubDate>
      <guid>https://www.nitingupta.com/casestudies/diamonds-part1-eda/</guid>
      <description>


&lt;p&gt;In this case study, we will explore the &lt;code&gt;diamonds&lt;/code&gt; dataset, then build linear and non-linear regression models to predict the price of diamonds.&lt;/p&gt;
&lt;div id=&#34;data-description&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Data Description&lt;/h2&gt;
&lt;p&gt;The &lt;code&gt;diamonds&lt;/code&gt; dataset contains the prices in 2008 USD terms, and other attributes of almost 54,000 diamonds.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;Attribute&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;price&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;price in 2008 USD&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;carat&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;weight of a diamond (1 carat = 0.2 gms)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;cut&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;quality of the cut (Fair, Good, Very Good, Premium, Ideal)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;color&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;diamond color from D (best) to J (worst)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;clarity&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;x&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;length in mm&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;y&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;width in mm&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;z&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;depth in mm&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;depth&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;total depth percentage = z/mean(x, y)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;table&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;width of the top of diamond relative to widest point&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;center&gt;
&lt;p&gt;&lt;img src=&#34;xyz.png&#34; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;color.png&#34; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;clarity.png&#34; /&gt;&lt;/p&gt;
&lt;/center&gt;
&lt;/div&gt;
&lt;div id=&#34;data-summaries&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Data Summaries&lt;/h2&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/diamonds-part1-eda/index_files/figure-html/summary_visual-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;A preliminary visual summary of the whole dataset shows all the features and their types. There are no missing values (NAs) in this dataset.&lt;/p&gt;
&lt;p&gt;Let’s examine each feature numerically:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;dfInput 

 10  Variables      53940  Observations
----------------------------------------------------------------------------------------------------------------------------------------------------------------
price 
       n  missing distinct     Info     Mean      Gmd      .05      .10      .25      .50      .75      .90      .95 
   53940        0    11602        1     3933     4012      544      646      950     2401     5324     9821    13107 

lowest :   326   327   334   335   336, highest: 18803 18804 18806 18818 18823
----------------------------------------------------------------------------------------------------------------------------------------------------------------
carat 
       n  missing distinct     Info     Mean      Gmd      .05      .10      .25      .50      .75      .90      .95 
   53940        0      273    0.999   0.7979   0.5122     0.30     0.31     0.40     0.70     1.04     1.51     1.70 

lowest : 0.20 0.21 0.22 0.23 0.24, highest: 4.00 4.01 4.13 4.50 5.01
----------------------------------------------------------------------------------------------------------------------------------------------------------------
cut 
       n  missing distinct 
   53940        0        5 

lowest : Fair      Good      Very Good Premium   Ideal    , highest: Fair      Good      Very Good Premium   Ideal    
                                                            
Value           Fair      Good Very Good   Premium     Ideal
Frequency       1610      4906     12082     13791     21551
Proportion     0.030     0.091     0.224     0.256     0.400
----------------------------------------------------------------------------------------------------------------------------------------------------------------
color 
       n  missing distinct 
   53940        0        7 

lowest : J I H G F, highest: H G F E D
                                                    
Value          J     I     H     G     F     E     D
Frequency   2808  5422  8304 11292  9542  9797  6775
Proportion 0.052 0.101 0.154 0.209 0.177 0.182 0.126
----------------------------------------------------------------------------------------------------------------------------------------------------------------
clarity 
       n  missing distinct 
   53940        0        8 

lowest : I1   SI2  SI1  VS2  VS1 , highest: VS2  VS1  VVS2 VVS1 IF  
                                                          
Value         I1   SI2   SI1   VS2   VS1  VVS2  VVS1    IF
Frequency    741  9194 13065 12258  8171  5066  3655  1790
Proportion 0.014 0.170 0.242 0.227 0.151 0.094 0.068 0.033
----------------------------------------------------------------------------------------------------------------------------------------------------------------
depth 
       n  missing distinct     Info     Mean      Gmd      .05      .10      .25      .50      .75      .90      .95 
   53940        0      184    0.999    61.75    1.515     59.3     60.0     61.0     61.8     62.5     63.3     63.8 

lowest : 43.0 44.0 50.8 51.0 52.2, highest: 72.2 72.9 73.6 78.2 79.0
----------------------------------------------------------------------------------------------------------------------------------------------------------------
table 
       n  missing distinct     Info     Mean      Gmd      .05      .10      .25      .50      .75      .90      .95 
   53940        0      127     0.98    57.46    2.448       54       55       56       57       59       60       61 

lowest : 43.0 44.0 49.0 50.0 50.1, highest: 71.0 73.0 76.0 79.0 95.0
----------------------------------------------------------------------------------------------------------------------------------------------------------------
x 
       n  missing distinct     Info     Mean      Gmd      .05      .10      .25      .50      .75      .90      .95 
   53940        0      554        1    5.731    1.276     4.29     4.36     4.71     5.70     6.54     7.31     7.66 

lowest :  0.00  3.73  3.74  3.76  3.77, highest: 10.01 10.02 10.14 10.23 10.74
----------------------------------------------------------------------------------------------------------------------------------------------------------------
y 
       n  missing distinct     Info     Mean      Gmd      .05      .10      .25      .50      .75      .90      .95 
   53940        0      552        1    5.735    1.269     4.30     4.36     4.72     5.71     6.54     7.30     7.65 

lowest :  0.00  3.68  3.71  3.72  3.73, highest: 10.10 10.16 10.54 31.80 58.90
                                                                                                                      
Value        0.0   3.5   4.0   4.5   5.0   5.5   6.0   6.5   7.0   7.5   8.0   8.5   9.0   9.5  10.0  10.5  32.0  59.0
Frequency      7     5  1731 12305  7817  5994  6742  9260  4298  3402  1635   652    69    14     6     1     1     1
Proportion 0.000 0.000 0.032 0.228 0.145 0.111 0.125 0.172 0.080 0.063 0.030 0.012 0.001 0.000 0.000 0.000 0.000 0.000

For the frequency table, variable is rounded to the nearest 0.5
----------------------------------------------------------------------------------------------------------------------------------------------------------------
z 
       n  missing distinct     Info     Mean      Gmd      .05      .10      .25      .50      .75      .90      .95 
   53940        0      375        1    3.539   0.7901     2.65     2.69     2.91     3.53     4.04     4.52     4.73 

lowest :  0.00  1.07  1.41  1.53  2.06, highest:  6.43  6.72  6.98  8.06 31.80
                                                                                                          
Value        0.0   1.0   1.5   2.0   2.5   3.0   3.5   4.0   4.5   5.0   5.5   6.0   6.5   7.0   8.0  32.0
Frequency     20     1     2     3  8807 13809  9474 13682  5525  2352   237    20     5     1     1     1
Proportion 0.000 0.000 0.000 0.000 0.163 0.256 0.176 0.254 0.102 0.044 0.004 0.000 0.000 0.000 0.000 0.000

For the frequency table, variable is rounded to the nearest 0.5
----------------------------------------------------------------------------------------------------------------------------------------------------------------&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;price&lt;/code&gt;: The average price of a diamond in this dataset is ~ USD 4000. There are many outliers on the high end.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;carat&lt;/code&gt;: The average carat weight is ~ 0.8. About 75% of the diamonds are under 1 carat. The top 5 values show presence of many outliers on the high end.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;cut&lt;/code&gt;: About 40% of the diamonds are of &lt;em&gt;Ideal&lt;/em&gt; cut. Only 3% are &lt;em&gt;Fair&lt;/em&gt; cut. So there is a lot of imbalance in the categories.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;color&lt;/code&gt;: Most of the diamonds are rated &lt;em&gt;E&lt;/em&gt; to &lt;em&gt;H&lt;/em&gt; color. Relatively fewer are rated &lt;em&gt;J&lt;/em&gt; color.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;clarity&lt;/code&gt;: Most of the diamonds are rated &lt;em&gt;SI2&lt;/em&gt; to &lt;em&gt;VS1&lt;/em&gt; clarity. About 1% are rated the worst &lt;em&gt;I1&lt;/em&gt; clarity, where as only ~ 3% are rated &lt;em&gt;IF&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;depth&lt;/code&gt;: Most of the depth values are between 60 and 64. There are outliers on both low end and high end.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;table&lt;/code&gt;: Most of the table values are between 54 and 65. There are outliers on both ends.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;x&lt;/code&gt;: Denotes the dimension along the x-axis. Most values are between 4 and 8. There are some 0 values too which means they were not recorded.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;y&lt;/code&gt;: Denotes the dimension along the y-axis. Most values are between 3.5 and 8. There are 7 records where the values are 0.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;z&lt;/code&gt;: Denotes the dimension along the z-axis. Most values are between 2.5 and 8.5. There are 20 records where the values are 0.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;univariate-analysis&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Univariate Analysis&lt;/h2&gt;
&lt;p&gt;Let us look at each feature in the dataset in detail.&lt;/p&gt;
&lt;div id=&#34;numerical-features&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;Numerical Features&lt;/h4&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/diamonds-part1-eda/index_files/figure-html/univariate_continuous-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The plots show presence of outliers within each feature. Let’s exclude the outliers and plot them again.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/diamonds-part1-eda/index_files/figure-html/univariate_continuous_ex_outliers-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Excluding outliers, the range of values are more reasonable. We can see that &lt;code&gt;carat&lt;/code&gt; and &lt;code&gt;price&lt;/code&gt; are heavily right skewed.&lt;/p&gt;
&lt;p&gt;Let’s plot the distribution of &lt;code&gt;price&lt;/code&gt; in log scale:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/diamonds-part1-eda/index_files/figure-html/price_distribution-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Two peaks in the log transformed plot show a bimodal distribution of prices. This implies two price points of diamonds are most popular among customers -
one at just below USD 1000 and the other around USD 5000. Intriguingly, there are no diamonds in the dataset that are around USD 1500. Hence, a big gap is visible around that price.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;categorical-features&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;Categorical Features&lt;/h4&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/diamonds-part1-eda/index_files/figure-html/Univariate_Categorical-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The categorical imbalance in &lt;code&gt;cut&lt;/code&gt; and &lt;code&gt;clarity&lt;/code&gt; can be clearly noticed.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;bivariate-analysis&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Bivariate Analysis&lt;/h2&gt;
&lt;p&gt;Let’s examine the relationship of &lt;code&gt;price&lt;/code&gt; with other features.&lt;/p&gt;
&lt;div id=&#34;numerical-numerical&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;Numerical-numerical&lt;/h4&gt;
&lt;p&gt;First and foremost, let’s do a correlation analysis to see how &lt;code&gt;price&lt;/code&gt; is correlated with other numerical features:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/diamonds-part1-eda/index_files/figure-html/bivariate_correlations-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We can see that &lt;code&gt;price&lt;/code&gt; is very strongly correlated with &lt;code&gt;carat&lt;/code&gt;, &lt;code&gt;x&lt;/code&gt;, &lt;code&gt;y&lt;/code&gt;, and &lt;code&gt;z&lt;/code&gt; dimensions. If a predictive linear regression model is built,
some of these features would act as confounders. &lt;code&gt;table&lt;/code&gt; and &lt;code&gt;depth&lt;/code&gt; have almost no correlation with &lt;code&gt;price&lt;/code&gt; so they are not so interesting for
predictive modelling.&lt;/p&gt;
&lt;p&gt;Now let’s see the scatter plots:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/diamonds-part1-eda/index_files/figure-html/bivariate_scatterplots-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;After removing outliers, it could be noted that &lt;code&gt;price&lt;/code&gt; increases exponentially with &lt;code&gt;carat&lt;/code&gt;, as well as &lt;code&gt;x&lt;/code&gt;, &lt;code&gt;y&lt;/code&gt; and &lt;code&gt;z&lt;/code&gt; dimensions. So &lt;code&gt;price&lt;/code&gt; should be plotted with a log tranformation. Let’s do that:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/diamonds-part1-eda/index_files/figure-html/log_scatterplots-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Now, the relationship between &lt;code&gt;log(price)&lt;/code&gt; appears to be linear with &lt;code&gt;x&lt;/code&gt;, &lt;code&gt;y&lt;/code&gt; and &lt;code&gt;z&lt;/code&gt;. But, not so much with &lt;code&gt;carat&lt;/code&gt;. Variance in &lt;code&gt;price&lt;/code&gt; tends to
increase both by &lt;code&gt;carat&lt;/code&gt; and its dimensions. Log transforming &lt;code&gt;carat&lt;/code&gt; wouldn’t help because &lt;code&gt;carat&lt;/code&gt; does not have a wide range.
We will find ways to deal with this when we do Feature Engineering.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;numerical-categorical&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;Numerical-Categorical&lt;/h4&gt;
&lt;p&gt;Let’s examine &lt;code&gt;price&lt;/code&gt; with respect to the categorical features in the dataset:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/diamonds-part1-eda/index_files/figure-html/Bivariate_Cont_Cat-1.png&#34; width=&#34;960&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The boxplots above are plotted with truncated &lt;code&gt;price&lt;/code&gt; axis for better visualization of trends. All the boxplots are counter-intuitive - median prices tend to decline as we move from lowest grade to highest grade in terms of &lt;code&gt;cut&lt;/code&gt;, &lt;code&gt;color&lt;/code&gt; and &lt;code&gt;clarity&lt;/code&gt;. This is very odd.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The median &lt;code&gt;price&lt;/code&gt; declines monotonically from &lt;em&gt;Fair&lt;/em&gt; &lt;code&gt;cut&lt;/code&gt; to &lt;em&gt;Ideal&lt;/em&gt; &lt;code&gt;cut&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;In terms of &lt;code&gt;color&lt;/code&gt;, the median &lt;code&gt;price&lt;/code&gt; decreases from &lt;em&gt;J&lt;/em&gt; (worst) to &lt;em&gt;G&lt;/em&gt; (mid-grade), then increases and finally decreases for &lt;em&gt;D&lt;/em&gt; (best).&lt;/li&gt;
&lt;li&gt;The median &lt;code&gt;price&lt;/code&gt; increases when &lt;code&gt;clarity&lt;/code&gt; improves from &lt;em&gt;I1&lt;/em&gt; to &lt;em&gt;SI2&lt;/em&gt;, and then decreases monotonically to &lt;em&gt;IF&lt;/em&gt; grade.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;multivariate-analysis&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Multivariate Analysis&lt;/h2&gt;
&lt;p&gt;So far, we have determined &lt;code&gt;carat&lt;/code&gt;, &lt;code&gt;x&lt;/code&gt;, &lt;code&gt;y&lt;/code&gt;, and &lt;code&gt;z&lt;/code&gt; have the strongest relationship with &lt;code&gt;price.&lt;/code&gt; Different grades of &lt;code&gt;cut&lt;/code&gt;, &lt;code&gt;color&lt;/code&gt; and &lt;code&gt;clarity&lt;/code&gt; also seem to have some impact on median &lt;code&gt;price&lt;/code&gt;. So let’s make some scatter plots to see these relationships:&lt;/p&gt;
&lt;div id=&#34;numerical-numerical-categorical&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;Numerical-Numerical-Categorical&lt;/h4&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/diamonds-part1-eda/index_files/figure-html/multivariate_num_num_cat-1.png&#34; width=&#34;960&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Although there is a lot of overlap, but there is a clear trend of &lt;code&gt;price&lt;/code&gt; increasing with &lt;code&gt;clarity&lt;/code&gt;, at a given &lt;code&gt;carat&lt;/code&gt; weight. The same pattern could also be observed in the plot with increasing grades of &lt;code&gt;color&lt;/code&gt;, though not to the same extent. There is no evidence of any relationship between &lt;code&gt;price&lt;/code&gt; and &lt;code&gt;carat&lt;/code&gt; with &lt;code&gt;cut&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;We can conclude both &lt;code&gt;color&lt;/code&gt; and &lt;code&gt;clarity&lt;/code&gt; explain some variance in &lt;code&gt;price&lt;/code&gt; at a given &lt;code&gt;carat&lt;/code&gt; weight.&lt;/p&gt;
&lt;p&gt;To be sure of any interaction between &lt;code&gt;table&lt;/code&gt; and &lt;code&gt;depth&lt;/code&gt;, with &lt;code&gt;color&lt;/code&gt; and &lt;code&gt;clarity&lt;/code&gt;, let’s plot these:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/diamonds-part1-eda/index_files/figure-html/multivariate_plots_other-1.png&#34; width=&#34;1152&#34; /&gt;&lt;/p&gt;
&lt;p&gt;There is no pattern in the interaction of &lt;code&gt;price&lt;/code&gt; v/s &lt;code&gt;depth&lt;/code&gt; and &lt;code&gt;table&lt;/code&gt; values when plotted by &lt;code&gt;color&lt;/code&gt; and &lt;code&gt;clarity&lt;/code&gt;. So, these features do not have any predictive ability to determine &lt;code&gt;price&lt;/code&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;categorical-categorical-numerical&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;Categorical-Categorical-Numerical&lt;/h4&gt;
&lt;p&gt;We want to see if there is any interaction of &lt;code&gt;clarity&lt;/code&gt; with &lt;code&gt;cut&lt;/code&gt; and &lt;code&gt;color&lt;/code&gt;, that could provide any additional explanatory power to predict &lt;code&gt;price&lt;/code&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.nitingupta.com/casestudies/diamonds-part1-eda/index_files/figure-html/multivariate_cat_cat_num-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The second heatmap appears to be more interesting. From bottom left to top right, with increasing grades of &lt;code&gt;color&lt;/code&gt; and &lt;code&gt;clarity&lt;/code&gt;, &lt;code&gt;price&lt;/code&gt; tends to decrease on average. Once again, this runs counter to our intuition; after all prices of diamonds with the best &lt;code&gt;color&lt;/code&gt; and &lt;code&gt;clarity&lt;/code&gt; should be the highest. Nevertheless this counter-trend persists in the dataset.&lt;/p&gt;
&lt;p&gt;With respect to &lt;code&gt;cut&lt;/code&gt; and &lt;code&gt;clarity&lt;/code&gt;, the mean prices do not show any discernable pattern.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;summary&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;To summarize, here’s what we found interesting in this dataset, after doing an exploratory data analysis:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;price&lt;/code&gt; is heavily right-skewed, and when log tranformed, has a bimodal distribution which implies there is demand in 2 different price ranges.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;carat&lt;/code&gt; about 75% of the diamonds are below 1 carat. The variance in price increases with carat weight.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;cut&lt;/code&gt; is imbalanced with about 40% of the diamonds rated &lt;em&gt;Ideal&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;color&lt;/code&gt; is imbalanced with about 5% of the diamonds rated &lt;em&gt;J&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;clarity&lt;/code&gt; is imbalanced at the extremes, with only 1.5% of the diamonds rated &lt;em&gt;I1&lt;/em&gt; and 3.3% of the diamonds rated &lt;em&gt;IF&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;price&lt;/code&gt; is strongly correlated with &lt;code&gt;carat&lt;/code&gt; and &lt;code&gt;x&lt;/code&gt;, &lt;code&gt;y&lt;/code&gt;, &lt;code&gt;z&lt;/code&gt; dimensions of the diamonds. &lt;code&gt;table&lt;/code&gt; and &lt;code&gt;depth&lt;/code&gt; have almost no correlation with &lt;code&gt;price&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Both &lt;code&gt;clarity&lt;/code&gt; and &lt;code&gt;color&lt;/code&gt; appear to explain some variance in &lt;code&gt;price&lt;/code&gt; for a given &lt;code&gt;carat&lt;/code&gt; weight.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
  </channel>
</rss>
