ggplot2 | Nitin Gupta

Ames Housing - Part 2 - Building Models

In a previous post in this series, we did an exploratory data analysis of the Ames Housing dataset. In this post, we will build linear and non-linear models and see how well they predict the SalePrice of properties. Evaluation Criteria Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed SalePrice will be our evaluation criteria. Taking the log ensures that errors in predicting expensive and cheap houses will affect the result equally.

Ames Housing - Part 1 - Exploratory Data Analysis

In this case study, we will use the Ames Housing dataset to explore regression techniques and predict the sale price of houses. Data Summaries The Ames Housing dataset contains the sale prices of properties in Ames, Iowa along with 80 other features. Each property has an Id associated with it. Here are the dimensions of the training and testing sets respectively: [1] "Dimensions of the training set" [1] 1460 81 [1] "Dimensions of the testing set" [1] 1459 81 Now, let’s combine training and testing into a single dataset and take a look at the count of missing values:

Diamonds - Part 3 - A polished gem - Building Non-linear Models

Other posts in this series: Diamonds - Part 1 - In the rough - An Exploratory Data Analysis Diamonds - Part 2 - A cut above - Building Linear Models In a couple of previous posts, we tried to understand what attributes of diamonds are important to determine their prices. We showed that carat, clarity and color are the most important predictors of price. We arrived at this conclusion after doing a detailed exploratory data analysis.

Diamonds - Part 2 - A cut above - Building Linear Models

In a previous post in this series, we did an exploratory data analysis of the diamonds dataset and found that carat, x, y, z were strongly correlated with price. To some extent, clarity also appeared to provide some predictive ability. In this post, we will build linear models and see how well they predict the price of diamonds. Before we do any transformations, feature engineering or feature selections for our model, let’s see what kind of results we get from a base linear model, that uses all the features to predict price:

Diamonds - Part 1 - In the rough - An Exploratory Data Analysis

In this case study, we will explore the diamonds dataset, then build linear and non-linear regression models to predict the price of diamonds. Data Description The diamonds dataset contains the prices in 2008 USD terms, and other attributes of almost 54,000 diamonds. Attribute Description price price in 2008 USD carat weight of a diamond (1 carat = 0.2 gms) cut quality of the cut (Fair, Good, Very Good, Premium, Ideal) color diamond color from D (best) to J (worst) clarity a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)) x length in mm y width in mm z depth in mm depth total depth percentage = z/mean(x, y) table width of the top of diamond relative to widest point Data Summaries A preliminary visual summary of the whole dataset shows all the features and their types.