Analysis.Rmd

---
title: "House Prices"
author: "Gabriel Lapointe"
date: "September 18, 2016"
output:
  html_document:
    highlight: pygments
    keep_md: yes
    number_sections: yes
    toc: yes
  pdf_document:
    toc: yes
variant: markdown_github
---

# Requirements
Requirements are taken [here](https://www.kaggle.com/c/house-prices-advanced-regression-techniques).

## Business Requirement
We have to answer this question: How do home features add up to its price tag?


## Functional Requirement
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this analysis shall predict the final price of each home.


# Data Acquisition
In this section, we will ask questions on the dataset and establish a methodology to solve the problem.


## Data Source
The data is provided by Kaggle and can be found [here](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data).


## Dataset Questions
Before we start the exploration of the dataset, we need to write a list of questions about this dataset considering the problem we have to solve. 

* How big is the dataset?
* Does the dataset contains 'NA' or missing values? Can we replace them by a value? Why?
* Does the data is coherent (date with same format, no out of bound values, no misspelled words, etc.)?
* What does the data look like and what are the relationships between features if they exist?
* What are the measures used?
* Does the dataset contains abnormal data?
* Can we solve the problem with this dataset?


## Evaluation Metrics
Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)


## Methodology
In this document, we start by cleaning and exploring the dataset to build the data story behind it. This will give us important insights which will answer our questions on this dataset. The next step is to proceed to feature engineering which consists to create, remove or replace features regarding insights we got when exploring the dataset. We will ensure our new dataset is a valid input for each of our prediction models. We will fine-tune the model's parameters by cross-validating the model with the train set to get the optimal parameters. After applying our model to the test set, we will visualize the predictions calculated and explain the results. Finally, we will conclude on most useful features to fulfill the business objective of this project.


## Loading Dataset
We load 'train.csv' and 'test.csv'. Then, we merge them to proceed to the cleaning and exploration of this entire dataset.

```{r message=FALSE, warning=FALSE, comment=NA}
library(data.table)      # setDT, set
library(dplyr)           # select, filter, %>%
library(scales)          # Scaling functions used for ggplot
library(gridExtra)       # Grid of ggplot to save space
library(ggplot2)         # ggplot functions for visualization and exploration
library(caret)
library(corrplot)
library(moments)         # For skewness
library(Matrix)
#library(mice)            # To replace NA values by a predicted one
library(Hmisc)           # To impute features having NA values to replace
library(VIM)
library(randomForest)
library(xgboost)
library(glmnet)
library(microbenchmark)  # benchmarking functions
library(knitr)           # opts_chunk

setwd("/home/gabriel/Documents/Projects/HousePrices")

set.seed(1234)

source("Dataset.R")

## Remove scientific notation (e.g. E-005).
options(scipen = 999)

## Remove hash symbols when printing results and do not show message or warning everywhere in this document.
opts_chunk$set(message = FALSE,
               warning = FALSE,
               comment = NA)

'%nin%' <- Negate('%in%')

## Read csv files and ensure NA strings are converted to real NA.
system.time({
    na.strings <- c("NA", "", " ")
    train <- fread(input = "train.csv", 
                   showProgress = FALSE,
                   stringsAsFactors = FALSE, 
                   na.strings = na.strings, 
                   header = TRUE)
    
    test <- fread(input = "test.csv", 
                  showProgress = FALSE,
                  stringsAsFactors = FALSE, 
                  na.strings = na.strings, 
                  header = TRUE)
    
    ## Merge the train and test sets in a data.table object.
    test$SalePrice <- -1
    dataset <- rbindlist(list(train, test), use.names = TRUE)
})
```

| Dataset            |  File Size (Kb) | # Houses              | # Features            |
| ------------------ | --------------- | --------------------- | --------------------- |
| train.csv          | 460.7           | `r nrow(train)`       | `r ncol(train)`       |
| test.csv           | 451.4           | `r nrow(test)`        | `r ncol(test) - 1`    |
| **Total(dataset)** | **912.1**       | **`r nrow(dataset)`** | **`r ncol(dataset)`** |

These datasets are very small. Each observation (row) is a house where we want to predict their sale price in the test set.


<!------------------------------------------------------------DATASET CLEANING------------------------------------------------------------------------------>


# Dataset Cleaning
The objective of this section is to detect all inconsistancies in the dataset and try to fix them all to gain as much coherence and accuracy as possible. We have to check if the dataset is valid with the possible values given in the code book. Thus, we need to ensure that there are no mispelled words or no values that are not in the code book. Also, all numerical values should be coherent with their description meaning that their bounds have to be logically correct. Regarding the code book, none of the categorical features have over 25 unique values. Then, we will compare the values mentioned in the code book with the values we have in the dataset. Finally, we have to detect anomalies and determine techniques to replace missing values with the most accurate ones.

```{r echo=FALSE}
sapply(dataset, getUniqueValues)
```


## Feature Names Harmonization
We start by harmonizing the feature names to be coherent with the code book. Comparing manually with the code book's possible codes, the following features have differences:

| Feature            | Dataset      | CodeBook        |
| ------------------ | ------------ | --------------- |
| MSZoning           | C (all)      | C |
| MSZoning           | NA           | No corresponding value |
| Alley              | Empty string | No corresponding value |
| PoolQC             | Empty string | No corresponding value |
| Utilities          | NA           | No corresponding value |
| Neighborhood       | NAmes        | Names (should be NAmes) |
| BldgType           | 2fmCon       | 2FmCon |
| BldgType           | Duplex       | Duplx |
| BldgType           | Twnhs        | TwnhsI |
| Exterior1st        | NA           | No corresponding value |
| Exterior2nd        | NA           | No corresponding value |
| Exterior2nd        | Wd Shng      | WdShing |
| MasVnrType         | NA           | No corresponding value |
| Electrical         | NA           | No corresponding value |
| KitchenQual        | NA           | No corresponding value |
| Functional         | NA           | No corresponding value |
| MiscFeature        | Empty string | No corresponding value |
| SaleType           | NA           | No corresponding value |
| Bedroom            | Named 'BedroomAbvGr' | Should be named 'BedroomAbvGr' to follow the naming convention |
| Kitchen            | Named 'KitchenAbvGr' | Should be named 'KitchenAbvGr' to follow the naming convention |

The code book seems to have a naming convention but it is not always respected. Thus, it will be hard to achieve complete coherence. Since we do not know the reason behind each code and each feature name given, we will not change any of them in this code book. The changes will be done in the dataset only.

To be coherent with the code book (assuming the code book is the truth), we will replace mispelled categories in the dataset by their corresponding one from the code book. Note that we deduct that the string 'Twnhs' corresponds to the string 'TwnhsI' in the code book since the other codes can be easily associated.

```{r}
dataset <- dataset[MSZoning == "C (all)", MSZoning := "C"]

dataset <- dataset[BldgType == "2fmCon", BldgType := "2FmCon"]
dataset <- dataset[BldgType == "Duplex", BldgType := "Duplx"] 
dataset <- dataset[BldgType == "Twnhs", BldgType := "TwnhsI"] 

dataset <- dataset[Exterior2nd == "Wd Shng", Exterior2nd := "WdShing"]
```

Since we have feature names starting by a digit which is not allowed in many programming languages, we will rename them with their full name.

```{r}
colnames(dataset)[colnames(dataset) == "1stFlrSF"] <- "FirstFloorArea"
colnames(dataset)[colnames(dataset) == "2ndFlrSF"] <- "SecondFloorArea"
colnames(dataset)[colnames(dataset) == "3SsnPorch"] <- "ThreeSeasonPorchArea"
```


## Data Coherence
We also need to check the logic in the dataset to make sure the data make sense. We will enumerate facts coming from the code book and from logic to detect anomalies in this dataset.


**1. The feature 'FirstFloorArea' must not have an area of 0 ft². Otherwise, there would not have a first floor, thus no stories at all and then, no house.**

The minimum area of the first floor is `r min(dataset$FirstFloorArea)` ft². Looking at features 'HouseStyle' and 'MSSubClass' in the code book, there is neither NA value nor another value indicating that there is no story in the house. Indeed, we have `r length(dataset$HouseStyle[is.na(dataset$HouseStyle)])` NA values for 'HouseStyle' and `r length(dataset$MSSubClass[is.na(dataset$MSSubClass)])` NA values for 'MSSubClass'.


**2. The HouseStyle feature values must match with the values of the feature MSSubClass.**

To check this fact, we have to do a mapping between values of 'HouseStyle' and 'MSSubClass'. We have to be careful with 'SLvl' and 'SFoyer' because they can be used for all types. Since we are not sure about them, we will validate with values we know they mismatch.

| HouseStyle | MSSubClass |
| -----------| ---------- |
| 1Story     |  20        |
| 1Story     |  30        |
| 1Story     |  40        |
| 1Story     | 120        |
| 1.5Fin     |  50        |
| 1.5Unf     |  45        |
| 2Story     |  60        |
| 2Story     |  70        |
| 2Story     | 160        |
| 2.5Fin     |  75        |
| 2.5Unf     |  75        |
| SFoyer     |  85        |
| SFoyer     | 180        |
| SLvl       |  80        |
| SLvl       | 180        |

```{r echo=FALSE}
cols <- c("Id", "HouseStyle", "BldgType", "MSSubClass")
houses <- dataset[HouseStyle %nin% c("SFoyer", "SLvl"), ]

rows <- houses[HouseStyle != "1Story" & MSSubClass %in% c(20, 30, 40, 120), cols, with = FALSE]
rows <- bind_rows(rows, houses[HouseStyle != "1.5Fin" & MSSubClass == 50, cols, with = FALSE])
rows <- bind_rows(rows, houses[HouseStyle != "1.5Unf" & MSSubClass == 45, cols, with = FALSE])
rows <- bind_rows(rows, houses[HouseStyle != "2Story" & MSSubClass %in% c(60, 70, 160), cols, with = FALSE])
rows <- bind_rows(rows, houses[HouseStyle != "2.5Fin" & MSSubClass == 75, cols, with = FALSE])
rows <- bind_rows(rows, houses[HouseStyle != "2.5Unf" & MSSubClass == 75, cols, with = FALSE])
print(rows)
```


**3. Per the code book, values of MSSubClass for 1 and 2 stories must match with the YearBuilt.**

To verify this fact, we need to compare values of 'MSSubClass' with the 'YearBuilt' values. The fact is not respected if the year built is less than 1946 and values of 'MSSubClass' are 20, 60, 120 and 160. The case when the year built is 1946 and newer, and values of 'MSSubClass' are 30 and 70 also show that the fact is not respected.
```{r echo=FALSE}
cols <- c("Id", "YearBuilt", "MSSubClass", "BldgType", "HouseStyle")
rows <- dataset[YearBuilt < 1946 & MSSubClass %in% c(20, 60, 120, 160), cols, with = FALSE]
print(bind_rows(rows, dataset[YearBuilt >= 1946 & MSSubClass %in% c(30, 70), cols, with = FALSE]))
```

These features represents `r nrow(id) / nrow(dataset) * 100` % of the dataset.


**4. If there is no garage with the house, then GarageType = NA, GarageYrBlt = NA, GarageFinish = NA, GarageCars = 0, GarageArea = 0, GarageQual = NA and GarageCond = NA.**

We need to get all houses where the GarageType is NA and check if the this fact's conditions are respected.
```{r echo=FALSE}
cols <- c("Id", "GarageType", "GarageYrBlt", "GarageFinish", "GarageQual", "GarageCond", "GarageArea", "GarageCars")

garage.none <- dataset[is.na(GarageType) & is.na(GarageYrBlt) & is.na(GarageFinish) & is.na(GarageQual) & is.na(GarageCond) & GarageArea == 0 & GarageCars == 0, cols, with = FALSE]

garage <- dataset[!is.na(GarageType) & !is.na(GarageYrBlt) & !is.na(GarageFinish) & !is.na(GarageQual) & !is.na(GarageCond) & GarageArea > 0 & GarageCars > 0, cols, with = FALSE]

garage <- setdiff(dataset[, cols, with = FALSE], bind_rows(garage.none, garage))
print(garage)
garage <- garage[is.na(GarageQual) & is.na(GarageCond) & is.na(GarageArea), cols, with = FALSE]
dataset <- dataset[Id %in% garage$Id, GarageType := NA]
```


**5. If there is no basement in the house, then TotalBsmtSF = 0, BsmtUnfSF = 0, BsmtFinSF2 = 0, BsmtHalfBath = 0, BsmtFullBath = 0, BsmtQual = NA and BsmtCond = NA, BsmtExposure = NA, BsmtFinType1 = NA, BsmtFinSF1 = 0, BsmtFinType2 = NA.**

```{r echo=FALSE}
cols <- c("Id", "TotalBsmtSF", "BsmtUnfSF", "BsmtFinSF2", "BsmtHalfBath", "BsmtFullBath", "BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinSF1", "BsmtFinType2")

basement.none <- dataset[is.na(BsmtQual) & is.na(BsmtCond) & is.na(BsmtExposure) & is.na(BsmtFinType1) & is.na(BsmtFinType2) & TotalBsmtSF == 0 & BsmtUnfSF == 0 & BsmtFinSF1 == 0 & BsmtFinSF2 == 0 & BsmtHalfBath == 0 & BsmtFullBath == 0, cols, with = FALSE]

basement <- dataset[!is.na(BsmtQual) & !is.na(BsmtCond) & TotalBsmtSF > 0, cols, with = FALSE]

basement <- setdiff(subset(dataset, select = cols), bind_rows(basement.none, basement))
print(basement)

## For houses with no basement, we replace area features having NA by 0.
basement.none <- basement[is.na(BsmtQual) & is.na(BsmtCond) & TotalBsmtSF == 0, cols, with = FALSE]
print(basement.none)
dataset <- dataset[Id %in% basement.none$Id, `:=`(BsmtHalfBath = 0, BsmtFullBath = 0)]
```


**6. Per the code book, if there are no fireplaces, then FireplaceQu = NA and Fireplaces = 0.**

```{r echo=FALSE}
dataset[Fireplaces > 0 & is.na(FireplaceQu), c("Id", "Fireplaces", "FireplaceQu"), with = FALSE]
dataset[Fireplaces == 0 & !is.na(FireplaceQu), c("Id", "Fireplaces", "FireplaceQu"), with = FALSE]
```


**7. Per the code book, if there are no Pool, then PoolQC = NA and PoolArea = 0.**

```{r echo=FALSE}
dataset[PoolArea > 0 & is.na(PoolQC), c("Id", "PoolArea", "PoolQC"), with = FALSE]
dataset[PoolArea == 0 & !is.na(PoolQC), c("Id", "PoolArea", "PoolQC"), with = FALSE]
```


**8. Per the code book, the Remodel year is the same as the year built if no remodeling or additions. Then, it is true to say that YearRemodAdd $\geq$ YearBuilt.**

The abnormal houses that are not respecting this fact are detected by filtering houses having the remodel year less than the year built. If it is the case, then we can verify the year when the garage was built if exists and compare with the house year built and remodeled. 

```{r echo=FALSE}
dataset[YearRemodAdd < YearBuilt, c("Id", "YearBuilt", "YearRemodAdd", "GarageYrBlt"), with = FALSE]
```


```{r}
dataset <- dataset[which(YearRemodAdd < YearBuilt), YearRemodAdd := YearBuilt]
```


**9. We verify that if the Garage Cars is 0, then the Garage Area is also 0. The converse is true since a Garage area of 0 means that there is no garage, thus no cars.**

```{r echo=FALSE}
dataset[GarageArea == 0 & GarageCars > 0, c("Id", "GarageArea", "GarageCars"), with = FALSE]
```


**10. We have BsmtCond = NA (no basement per code book) if and only if BsmtQual = NA which means no basement per the code book.**

```{r echo=FALSE}
dataset[is.na(BsmtCond) & !is.na(BsmtQual), c("Id", "BsmtCond", "BsmtQual"), with = FALSE]
dataset[!is.na(BsmtCond) & is.na(BsmtQual), c("Id", "BsmtCond", "BsmtQual"), with = FALSE]
```


```{r}
dataset <- dataset[which(!is.na(BsmtCond) & is.na(BsmtQual)), BsmtQual := BsmtCond]
dataset <- dataset[which(is.na(BsmtCond) & !is.na(BsmtQual)), BsmtCond := BsmtQual]
```


**11. We have MasVnrType = None if and only if MasVnrArea = 0 ft².**

We have two cases where it is hard to check which one is right. 

* Case when MasVnrType = 'None' and MasVnrArea $\neq 0$ ft²
* Case when MasVnrType $\neq$ 'None' and MasVnrArea $= 0$ ft²

```{r echo=FALSE}
dataset[MasVnrType == "None" & MasVnrArea > 0, c("Id", "MasVnrType", "MasVnrArea"), with = FALSE]
dataset[MasVnrType != "None" & MasVnrArea == 0, c("Id", "MasVnrType", "MasVnrArea"), with = FALSE]
```


```{r}
dataset <- dataset[which(MasVnrType != "None" & MasVnrArea == 0), MasVnrType := "None"]
dataset <- dataset[which(MasVnrType == "None" & MasVnrArea <= 10), MasVnrArea := 0]
```


## Missing Values
Per the code book of this dataset, we know that generally, the NA values mean 'No' or 'None' and they are used only for some categorical features. The other NA values that are not in the code book will be explained case by case. This goes also for the empty strings that will be replaced by NA. 

* Case when NA means 'None' or 'No'
* Case when an integer feature has 0 and NA as possible values
* Case when a numeric value has 0 and NA as possible values
* Case when a category is NA where NA means 'No', and the numeric feature is not zero
* Case when a category is not NA where NA means 'No', and the numeric feature is NA where 0 has a clear meaning

Features having NA values where NA means 'None' or 'No' will be replaced by 0.

However, it is possible to solve some NA values by analysing the value used for other features strongly related. For example, some integer features like GarageCars and GarageArea have NA values. At the first glance, we cannot state that NA means 0 since 0 already has a meaning. It could be a "No Information", but looking at the GarageQual and GarageCond features, we notice that their value is NA as well. This means that this house has no garage per the code book. Therefore, we will replace NA values by 0 for GarageArea and GarageCars.

For features like "BsmtFullBath", the value 0 means that we do not have full bathroom in the basement. Thus, we cannot replace NA by 0 if there is a basement. Otherwise, the house has no basement, thus no full bathroom in the basement. In this case only, we can replace NA by 0.

We expect that numeric features where the value 0 means the same thing as a NA value. For example, a garage area of 0 means that there is no garage with this house. However, if the value 0 is used for an amount of money or for a geometric measure (e.g. area), then it is a real 0.

For "year" features (e.g. GarageYrBlt), if the values are NA, then we can replace them by 0 without loss of generality. A year 0 is theorically possible, but in our context, it is impossible. But, using 0 will decrease the mean and will add noise to the data since the difference between the minimum year and zero is large: `r min(dataset$GarageYrBlt)`.

Another case is when a feature uses the value NA to indicate that the information is missing. For example, the feature "KitchenQual" is not supposed to have the value NA per the code book. If the value NA is used, then it really means "No Information" and we cannot replace it by 0. Normally, we would exclude this house of the dataset, but this house is taken from the test set, thus we must not remove it.

For those cases, we need to use imputation on missing data (NA value). We could calculate the mean for a given feature and use this value to replace NA values. But it is more accurate to predict what value to use by using the other features since we have many of them. 

```{r echo=FALSE}
sapply(dataset, function(x) sum(is.na(x)))
#md.pattern(dataset)
aggr(dataset, 
    col = c('navyblue', 'red'), 
    numbers = TRUE, 
    sortVars = TRUE, 
    labels = names(dataset), 
    cex.axis = .7, 
    gap = 3, 
    ylab = c("Histogram of missing data", "Pattern"))

```

For the Masonry veneer type (MasVnrType) feature, the value "None" means that the house does not have a masonry veneer per the code book. If some houses have the value NA, then it will mean that the information is missing. 

Note that it is possible to have information on the masonry veneer area but not on the type (vice-versa could be possible as well). In that case, we cannot deduct with certainty what will be the value to replace NA. We cannot replace NA by 0 for the area because 0 means *None* which is a valid choice. The best choice we can take is to replace NA value by the mean value of the feature.


<!------------------------------------------------------------ANOMALIES DETECTION------------------------------------------------------------------------------>


# Anomalies Detection
In this section, the objective is to detect houses or features having wrong or illogic information. We will fix them if it is possible.

We define a house as being an anomaly if $\left\lVert Y - P \right\rVert > \epsilon$ where $Y = (x, y)$ is the point belonging to the regression linear model and $P = (x, z)$ a point not on the regression linear model. Also, $x$ is the ground living area, $y$ and $z$ the sale price, and $\epsilon > 0$ the threshold.

Regarding the overall quality, the sale price and the ground living area, we expect that the sale price will increase when the overall quality increases and the ground living area increases. This is verified in the data exploratory section. 

Taking houses having their overall quality = 10 and their ground living area greater than 4000 ft², the sale price should be part of the highest sale prices. If there are houses respecting these conditions with a sale price over 240000$ than what the regression model gives, then this may be possible, but if it is lower, than this is exceptionnel. 

```{r echo=FALSE}
anomalies <- train[OverallQual == 10 & GrLivArea > 4000, c("Id", "GrLivArea", "SalePrice"), with = FALSE]
print(anomalies)

model <- lm(formula = train$SalePrice ~ train$GrLivArea)
price.eq <- coef(model)["(Intercept)"] + coef(model)["train$GrLivArea"] * anomalies$GrLivArea
prices <- data.table(Id = anomalies$Id, 
                     ApproxPrice = price.eq, 
                     SalePrice = anomalies$SalePrice, 
                     PriceDifference = abs(anomalies$SalePrice - price.eq))
print(prices)
ids <- prices$Id[prices$PriceDifference > 240000]

dataset <- dataset[Id %nin% ids, ]
```

After visualizing, we detected another anomaly concerning the garage year built. Since the year cannot be greater than `r max(dataset$YrSold)`, years greater than that year will be treated as an anomaly.

```{r echo=FALSE}
dataset[GarageYrBlt > max(YrSold), c("Id", "GarageYrBlt", "YearBuilt", "YrSold"), with = FALSE]
```

```{r}
dataset <- dataset[GarageYrBlt > max(YrSold), GarageYrBlt := YrSold]
```


<!------------------------------------------------------------DATA EXPLORATORY------------------------------------------------------------------------------>


# Data Exploratory
The objective is to visualize and understand the relationships between features in the dataset we have to solve the problem. We will also compare changes we will make to this dataset to validate if they have significant influance on the sale price or not.


## Features
Here is the list of features with their type.

```{r echo=FALSE}
str(dataset)

train <- dataset[SalePrice > -1, ]
test <- dataset[SalePrice == -1, ]
```

We see now a plot of the correlation between numeric features of the train set.

```{r echo=FALSE}
features.numeric <- names(train)[which(sapply(train, is.numeric))]
train.numeric <- train[, features.numeric, with = FALSE]
correlations <- cor(na.omit(train.numeric))

row_indic <- apply(correlations, 1, function(x) sum(x > 0.3 | x < -0.3) > 1)

correlations <- correlations[row_indic, row_indic]
corrplot(correlations, method = "pie")
sale.price <- data.frame(SalePriceCorrelation = sort(correlations[, "SalePrice"], decreasing = TRUE))
print(sale.price)
```

We note that some features are strongly correlated with the sale price or other features. We will produce plots for each of them to get insights.


## Dependant vs Independent Features
With the current features in this dataset, we have to check which features are dependent of other features versus which ones are independent. At first glance in the dataset, features representing totals and overalls seems dependent.

* $GrLivArea = FirstFloorArea + SecondFloorArea + LowQualFinSF$
* $TotalBsmtSF = BsmtUnfSF + BsmtFinSF1 + BsmtFinSF2$


## Sale Price
The sale price should follow the normal distribution. However, the sale price does not totally follow the normal law, thus we need to normalize the sale price by taking its logarithm.

```{r echo=FALSE}
local({
    plot.saleprice <- ggplot(train, aes(x = SalePrice)) + 
        geom_histogram(col = 'white') + 
        theme_light() + 
        ggtitle("Distribution of the Sale Price") + 
        labs(x = "Sale Price ($)")

    plot.logsaleprice <- ggplot(train, aes(x = log(SalePrice + 1))) + 
        geom_histogram(col = 'white') + 
        theme_light() + 
        ggtitle("Distribution of the log of Sale Price") + 
        labs(x = "Log Sale Price (log$)")
    
    grid.arrange(plot.saleprice, plot.logsaleprice, ncol = 2)
})

summary(train$SalePrice)
```


## Overall Quality Rate
The overall quality rate is the most correlated feature to the sale price as seen previously. We look at the average sale price for each overall quality rate and try to figure out an equation that will best approximate our data.

```{r echo=FALSE}
local({
    data <- train[, list(MeanSalePrice = mean(SalePrice)), by = OverallQual]
    data <- setorder(data, OverallQual)
    print(data)
    
    ggplot(data, aes(x = OverallQual, y = MeanSalePrice)) +
        geom_line(aes(colour = "Right")) + 
        geom_line(aes(x = OverallQual, 
                      y = 939113/180*OverallQual*OverallQual - 2561483/180*OverallQual + 354979/6, 
                      colour = "Approx.")) +
        ggtitle("Distribution of Average Sale Price in function of the overall quality rate") + 
        labs(y = "Average sale price ($)", x = "Overall Quality Rate") +
        scale_colour_manual("Legend",
                            breaks = c("Approx.", "Right"),
                            values = c("red", "black"))
})
```

Note that the equation used to approximate is a parabola where the equation has been built from 3 points (OverallQual, MeanSalePrice) where the overall quality rates chosen are 1, 6 and 10 with their corresponding average sale price. The equation used to approximate the polyline is $M(Q) = \dfrac{939113}{180}Q^2-\dfrac{2561483}{180}Q+\dfrac{354979}{6}$ where $Q$ is the overall quality rate and $M(Q)$ is the mean sale price in function of $Q$.

Here is a frequencies' table and a histogram representing these frequencies.

```{r echo=FALSE}
local({
    table.freq <- table(dataset$OverallQual)
    print(cbind(Freq = table.freq, 
          Cumul = cumsum(table.freq), 
          Relative = prop.table(table.freq)))
    
    print(ggplot(dataset, aes(x = OverallQual)) +
        geom_bar(aes(y = ..count..)) +
        scale_x_continuous(breaks = seq(min(dataset$OverallQual), max(dataset$OverallQual), by = 1)) +
        geom_text(aes(y = ..count.. , 
                      label = scales::percent(..count.. / sum(..count..))), 
                  stat = "count", 
                  vjust = -0.25) +
        ggtitle("Percentage of Houses by Overall Quality") +
        labs(y = "Percentage of Houses", x = "Overall Quality"))
})
```


## Above Ground Living Area
This feature is the second most correlated with the sale price per the correlation plot.

```{r echo=FALSE}
local({
    plot.grlivarea <- ggplot(train, aes(x = GrLivArea, y = SalePrice)) +
        geom_point(stat = "identity") + 
        geom_smooth(method = "lm") +
        ggtitle("Distribution of Sale Price in function \n of the Grade Living Area") + 
        labs(x = "Grade Living Area (ft²)", y = "Sale Price ($)")
    
    plot.loggrlivarea <- ggplot(train, aes(x = log(GrLivArea + 1))) + 
        geom_histogram(col = 'white') + 
        theme_light() + 
        ggtitle("Distribution of the GrLivArea") + 
        labs(x = "log(GrLivArea + 1) (log(ft²))")
    
    plot.rooms <- ggplot(train, aes(x = TotRmsAbvGrd, y = GrLivArea)) +
        geom_point(stat = "identity") + 
        geom_smooth(method = "lm") +
        ggtitle("Distribution of Abobe grade living Area \n in function of the total rooms above grade") + 
        labs(x = "Total rooms above grade", y = "Abobe grade living Area (ft²)")
    
    grid.arrange(plot.grlivarea, plot.loggrlivarea, plot.rooms, ncol = 2, nrow = 2)
})
```


## Garage Cars


```{r echo=FALSE}
local({
    data <- train[, list(MinGarageArea = min(GarageArea), 
                         MeanGarageArea = mean(GarageArea),
                         MaxGarageArea = max(GarageArea),
                         MeanSalePrice = mean(SalePrice)), by = GarageCars]
    data <- setorder(data, GarageCars)
    print(data)
    
    plot.garage.price <- ggplot(data, aes(x = GarageCars, y = MeanSalePrice)) +
        geom_line() + 
        ggtitle("Distribution of Mean Sale Prices \n in function of Garage Cars") + 
        labs(x = "Garage Cars", y = "Average sale price ($)")
    
    plot.garage.cars <- ggplot(data, aes(x = GarageCars, y = MeanGarageArea)) +
        geom_point(stat = "identity") + 
        geom_smooth(method = "lm") +
        ggtitle("Distribution of Mean Garage Area \n in function of the Garage Cars") + 
        labs(x = "Garage Cars", y = "Mean of Garage Area (ft²)")
    
    grid.arrange(plot.garage.price, plot.garage.cars, ncol = 2)
})
```

Here is the list of houses having a garage that can contain more than 3 cars in the dataset.

```{r echo=FALSE}
dataset[GarageCars >= 4, c("Id", "OverallQual", "GarageCars", "GarageArea", "SalePrice"), with = FALSE]
```


## Garage Area


```{r echo=FALSE}
local({
    plot.garagearea <- ggplot(train, aes(x = GarageArea, y = SalePrice)) +
        geom_point(stat = "identity") + 
        geom_smooth(method = "lm") +
        ggtitle("Distribution of Average Sale Price \n in function of the Garage Area") + 
        labs(x = "Garage Area (ft²)", y = "Sale Price ($)")
    
    plot.loggaragearea <- ggplot(train, aes(x = log(GarageArea + 1))) + 
        geom_histogram(col = 'white') + 
        theme_light() + 
        ggtitle("Distribution of the log \n of Garage Area") + 
        labs(x = "Log Garage Area (log$)")
    
    grid.arrange(plot.garagearea, plot.loggaragearea, ncol = 2)
})
```


## Total Basement Area


```{r echo=FALSE}
local({
    plot.basementarea <- ggplot(train, aes(x = TotalBsmtSF, y = SalePrice)) +
        geom_point(stat = "identity") + 
        geom_smooth(method = "lm") +
        ggtitle("Distribution of Average Sale Price in \n function of the Total Basement Area") + 
        labs(x = "Total Basement Area (ft²)", y = "Sale Price ($)")

    plot.logbasementarea <- ggplot(train, aes(x = log(TotalBsmtSF + 1))) + 
        geom_histogram(col = 'white') + 
        theme_light() + 
        ggtitle("Distribution of the log of \n TotalBsmtSF") + 
        labs(x = "Log TotalBsmtSF (log$)")
    
    grid.arrange(plot.basementarea, plot.logbasementarea, ncol = 2)
})
```


## First Floor Area


```{r echo=FALSE}
local({
    plot.firstfloorarea <- ggplot(train, aes(x = FirstFloorArea, y = SalePrice)) +
        geom_point(stat = "identity") +
        geom_smooth(method = "lm") +
        ggtitle("Distribution of Average Sale Price in \n function of the First Floor Area") + 
        labs(x = "First Floor Area (ft²)", y = "Sale Price ($)")

    plot.logfirstfloorarea <- ggplot(train, aes(x = log(FirstFloorArea + 1))) + 
        geom_histogram(col = 'white') + 
        theme_light() + 
        ggtitle("Distribution of the log of \n FirstFloorArea") + 
        labs(x = "Log FirstFloorArea (log$)")
    
    grid.arrange(plot.firstfloorarea, plot.logfirstfloorarea, ncol = 2)
})
```


## Year Built
We compare the house year built and the garage year built.

```{r echo=FALSE}
local({
    ggplot(dataset, aes(x = YearBuilt, y = GarageYrBlt)) +
        geom_point(stat = "identity") +
        geom_smooth(method = "lm") +
        ggtitle("Distribution of Garage Year Built in \n function of the House Year Built") + 
        labs(x = "House Year Built", y = "Garage Year Built")
})
```

We can see that few houses have been built many years after the garage. We can think of a garage / workshop and then, the workshop has been converted to a garage many years after to build a house with this garage.  

```{r echo=FALSE}
dataset[GarageYrBlt < YearBuilt, c("Id", "GarageYrBlt", "YearBuilt", "GarageType"), with = FALSE]
```


<!------------------------------------------------------------FEATURE ENGINEERING------------------------------------------------------------------------------>


# Feature Engineering
In this section, we create, modify and delete features to help the prediction. We will impute missing values and scale features like the quality and condition ones. Then, we will check for skewed features for which we will normalize.


## Feature Replacement
The categorical features will be 1-base except features having values meaning 'No' or 'None' which will be set to 0. Since the feature 'MasVnrType' has both, 'None' and NA, we will replace 'None' by 0 and the NA value will be replaced by the median in the imputation of missing values section. There are two reasons behind these replacements:

1. It is logical that values having the 'Empty' or 'Nothing' meaning are equivalent to zero.

2. We may want to convert the dataset as a sparse matrix to save memory. Having 0-base, the sparse matrix will be more useful.

```{r}
## Replace By NA or NaN. Otherwise, the numeric conversion with factor will convert the value 0 as well
## to 1-base. NA and NaN are not affected by that conversion.
dataset <- dataset[MasVnrType == "None", MasVnrType := NaN]
dataset <- dataset[CentralAir == "N", CentralAir := NA]

## Transform all categorical features from string to numeric 1-base.
features.string <- which(sapply(dataset, function(x) is.character(x)))

for(feature in features.string)
{
    set(dataset, i = NULL, j = feature, value = as.numeric(factor(dataset[[feature]])))
}

dataset <- dataset[is.nan(MasVnrType), MasVnrType := 0]
```


## Missing Values Imputation
Features having NA values where NA means 'None' or 'No' will be replaced by 0 as specified at the previous section.

```{r}
dataset <- dataset[is.na(Alley), Alley := 0]
dataset <- dataset[is.na(BsmtQual), BsmtQual := 0]
dataset <- dataset[is.na(BsmtCond), BsmtCond := 0]
dataset <- dataset[is.na(BsmtExposure), BsmtExposure := 0]
dataset <- dataset[is.na(BsmtFinType1), BsmtFinType1 := 0]
dataset <- dataset[is.na(BsmtFinType2), BsmtFinType2 := 0]
dataset <- dataset[is.na(FireplaceQu), FireplaceQu := 0]
dataset <- dataset[is.na(GarageType), GarageType := 0]
dataset <- dataset[is.na(GarageFinish), GarageFinish := 0]
dataset <- dataset[is.na(GarageQual), GarageQual := 0]
dataset <- dataset[is.na(GarageCond), GarageCond := 0]
dataset <- dataset[is.na(PoolQC), PoolQC := 0]
dataset <- dataset[is.na(Fence), Fence := 0]
dataset <- dataset[is.na(MiscFeature), MiscFeature := 0]
dataset <- dataset[is.na(CentralAir), CentralAir := 0]
```

All other NA values that need a more complex method than just replacing them by a constant will be replaced either by the mean or the median. Features containing real values will have their NA values replaced by the mean while features having integer values will have their NA values replaced by the median.

```{r}
dataset$MSZoning <- impute(dataset$MSZoning, median)
dataset$LotFrontage <- impute(dataset$LotFrontage, mean)
dataset$Utilities <- impute(dataset$Utilities, median)
dataset$Exterior1st <- impute(dataset$Exterior1st, median)
dataset$Exterior2nd <- impute(dataset$Exterior2nd, median)
dataset$MasVnrType <- impute(dataset$MasVnrType, median)
dataset$MasVnrArea <- impute(dataset$MasVnrArea, mean)

dataset$BsmtFinSF1 <- impute(dataset$BsmtFinSF1, mean)
dataset$BsmtFinSF2 <- impute(dataset$BsmtFinSF2, mean)
dataset$BsmtUnfSF <- impute(dataset$BsmtUnfSF, mean)
dataset <- dataset[is.na(TotalBsmtSF), TotalBsmtSF := BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF]

dataset$Electrical <- impute(dataset$Electrical, median)
dataset$BsmtFullBath <- impute(dataset$BsmtFullBath, median)
dataset$BsmtHalfBath <- impute(dataset$BsmtHalfBath, median)
dataset$KitchenQual <- impute(dataset$KitchenQual, median)
dataset$Functional <- impute(dataset$Functional, median)
dataset$GarageYrBlt <- impute(dataset$GarageYrBlt, median)
dataset$GarageCars <- impute(dataset$GarageCars, median)
dataset$GarageArea <- impute(dataset$GarageArea, mean)
dataset$SaleType <- impute(dataset$SaleType, median)


# imputation.start <- mice(dataset, maxit = 0, print = FALSE)
# method <- imputation.start$method
# predictors <- imputation.start$predictorMatrix
# 
# ## Exclude from prediction since these features will not help.
# predictors[, c("SalePrice")] <- 0
# 
# imputed <- mice(dataset,
#                 method = "mean",
#                 predictorMatrix = predictors,
#                 m = 5,
#                 print = FALSE)
# 
# dataset <- complete(imputed, 1)
# 
# densityplot(imputed)
```


## Feature Scaling
Quality and Condition features do not have the right scale based on the most important feature, i.e. the overall quality. Indeed, the overall quality has integer values from 1 to 10, but the other quality features have been transformed from 0 to 4 or 5 previously. If $Q$ represents all quality features except the overall quality, then the scaling function will be $f(Q) = 2Q$ where $Q \in \{0, 1, 2, 3, 4, 5\}$.

```{r}
dataset$ExterQual <- dataset$ExterQual * 2
dataset$FireplaceQu <- dataset$FireplaceQu * 2
dataset$BsmtQual <- dataset$BsmtQual * 2
dataset$KitchenQual <- dataset$KitchenQual * 2
dataset$GarageQual <- dataset$GarageQual * 2

dataset$BsmtCond <- dataset$BsmtCond * 2
dataset$GarageCond <- dataset$GarageCond * 2
dataset$ExterCond <- dataset$ExterCond * 2
```

For Pool, Heating and Fence quality / condition features, we apply the function $f(Q) = 2.5Q$ where $Q \in \{0, 1, 2, 3, 4\}$.

```{r}
dataset$PoolQC <- dataset$PoolQC * 2.5
dataset$HeatingQC <- dataset$HeatingQC * 2.5
dataset$Fence <- dataset$Fence * 2.5
```

All area features are given in square feet, thus no need to convert any of them. 


## Skewed Features
We need to transform skewed features to ensure they follow the lognormal distribution. Thus, we will use the function $f(A) = \log{(A + 1)}$, where $A \in \mathbb{R}_+^n$ is a vector representing a feature of the dataset and $n$ the number of values in this vector. We add 1 to avoid $\log{0}$ which is not defined for real numbers.

We set a skewness threshold and ensure to remove every categorical feature that is above the threshold. 

```{r echo=FALSE}
skewed <- apply(dataset, 2, function(feature) skewness(feature))
print(skewed)

skewed <- setdiff(names(skewed[skewed > 0.8]), 
                  c("SalePrice", "MSSubClass", "FirstFloorArea", "BsmtFinSF2", "Utilities", "Condition1",
                    "Condition2", "BldgType", "RoofStyle", "RoofMatl", "Heating")) #, "PoolQC", "Fence", "Alley"))
print(skewed)
```

Let's apply the formula to the remaining features.

```{r}
indices <- which(colnames(dataset) %in% skewed)
for(index in indices)
{
    dataset[[index]] <- log(dataset[[index]] + 1)
}
```


## Features Construction
The objective is to add features that will be good predictors for models created in the section Models Building. Clients may ask:

* How old is the house? We need to know the year the house has been built and subtract the result to when the house has been sold.
* How many years since the house has been remodeled? We need to know the year the house has been remodeled and subtract the result to when the house has been sold.
* How many bathrooms are there in the house including the basement? Thus summing bathrooms in the basement and the ones above grade.
* What is the total house area? We have to add the basement area to the grade living area.


```{r}
dataset <- dataset %>%
    mutate(YearsSinceBuilt = YrSold - YearBuilt) %>%
    mutate(YearsSinceRemodeled = YrSold - YearRemodAdd) %>%
    mutate(OverallQualExp = exp(OverallQual) - 1) %>%
    mutate(TotalBaths = FullBath + HalfBath + BsmtFullBath + BsmtHalfBath) %>%
    mutate(TotalArea = TotalBsmtSF + GrLivArea)
```


## Noisy Features
We remove features that add noise to the predictions. We will use 3 models in the section Models Building which gives the importance of features. The method used to eliminate noisy features is to look at the intersection of the less important features after applying the 3 models.

The other method used is to eliminate features having a high percentage of NA values determined at the dataset cleaning section. This assumes that the customer will not really check about the fence, the alley and the pool quality and condition.

Finally, we remove the Id feature since it is only a unique identifier of a house which should not have any prediction importance on the sale price.

```{r echo=FALSE}
dataset$ThreeSeasonPorchArea <- NULL
dataset$PoolQC <- NULL
#dataset$Alley <- NULL
#dataset$Fence <- NULL

test.id <- test$Id
dataset$Id <- NULL
```


<!------------------------------------------------------------MODELS BUILDING------------------------------------------------------------------------------>


# Models Building
In this section, we train different models and give predictions on the sale price of each house. We will use the extreme gradient boosting trees, the random forest and LASSO algorithms to build models.

Those algorithms need 2 inputs : the dataset as a matrix and the real sale prices from the train set. Since we had many NA and None values that have been replaced by 0, then it should be more efficient to use a sparse matrix to represent the dataset.

```{r echo=FALSE}
## Need them for the random forest only.
train.original <- dataset[dataset$SalePrice != -1, ]
test.original <- dataset[dataset$SalePrice == -1, ]

## Keep the sale price in a numeric vector since this is not a predictor.
sale.price <- dataset$SalePrice[dataset$SalePrice != -1]
dataset$SalePrice <- NULL

dataset.zeros <- sum(dataset == 0L)
dataset.cells <- nrow(dataset) * ncol(dataset)
cat("Dataset contains", dataset.zeros, "zeros which is", dataset.zeros / dataset.cells * 100, "% of the dataset.")

## Transform the dataset to a sparse matrix.
dataset <- sparse.model.matrix(~ ., data = dataset)
train <- dataset[1:nrow(train), ]
test <- dataset[(nrow(train)+1) : nrow(dataset), ]
```


## Extreme Gradient Boosted Regression Trees
We proceed to a 10-fold cross-validation to get the optimal number of trees and the RMSE score which is the metric used for the accuracy of our model. We use randomly subsamples of the training set. The training set will be split in 10 samples where each sample has `r as.integer(nrow(train) / 10)` observations (activities).

For each tree, we will have the average of 10 error estimates to obtain a more robust estimate of the true prediction error. This is done for all trees and we get the optimal number of trees to use for the test set.

We also display 2 curves indicating the test and train RMSE mean progression. The vertical dotted line is the optimal number of trees. This plot shows if the model overfits or underfits.

```{r}
cv.nfolds <- 10
cv.nrounds <- 400

sale.price.log <- log(sale.price + 1)
train.matrix <- xgb.DMatrix(train, label = sale.price.log)

param <- list(objective        = "reg:linear",
              eta              = 0.12,
              subsample        = 0.75,
              colsample_bytree = 0.75,
              min_child_weight = 2,
              max_depth        = 2)

model.cv <- xgb.cv(data     = train.matrix,
                   nfold    = cv.nfolds,
                   param    = param,
                   nrounds  = cv.nrounds,
                   verbose  = 0)

model.cv$names <- as.integer(rownames(model.cv))
best <- model.cv[model.cv$test.rmse.mean == min(model.cv$test.rmse.mean), ]
cv.plot.title <- paste("Training RMSE using", cv.nfolds, "folds CV")

print(ggplot(model.cv, aes(x = names)) +
          geom_line(aes(y = test.rmse.mean, colour = "test")) +
          geom_line(aes(y = train.rmse.mean, colour = "train")) +
          geom_vline(xintercept = best$names, linetype = "dotted") +
          ggtitle(cv.plot.title) +
          xlab("Number of trees") +
          ylab("RMSE"))

print(model.cv)
cat("\nOptimal testing set RMSE score:", best$test.rmse.mean)
cat("\nAssociated training set RMSE score:", best$train.rmse.mean)
cat("\nInterval testing set RMSE score: [", best$test.rmse.mean - best$test.rmse.std, ",", best$test.rmse.mean + best$test.rmse.std, "]")
cat("\nDifference between optimal training and testing sets RMSE:", abs(best$train.rmse.mean - best$test.rmse.mean))
cat("\nOptimal number of trees:", best$names)
```

Using the optimal number of trees given by the cross-validation, we can build the model using the test set as input.

```{r}
nrounds <- as.integer(best$names)

model <- xgboost(param = param,
                 train.matrix,
                 nrounds = nrounds,
                 verbose = 0)

test.matrix <- xgb.DMatrix(test)

xgb.prediction.test <- exp(predict(model, test.matrix)) - 1
prediction.train <- predict(model, train.matrix)

# Check which features are the most important.
names <- dimnames(train)[[2]]
importance.matrix <- xgb.importance(names, model = model)
print(importance.matrix)

# Display the 35 most important features.
print(xgb.plot.importance(importance.matrix[1:35]))

rmse <- printRMSEInformation(prediction.train, sale.price)
```

We can see that the model overfits. Indeed, the RMSE by the cross-validation for the test set is `r best$test.rmse.mean` since the RMSE for the train set is `r rmse`.  


## Random Forest

```{r}
# rf.model <- randomForest(log(SalePrice + 1) ~ .,
#                          data = train.original,
#                          importance = TRUE,
#                          proximity = TRUE,
#                          ntree = 130,
#                          do.trace = 5)
# 
# plot(rf.model, ylim = c(0, 1))
# print(rf.model)
# varImpPlot(rf.model)
# importance(rf.model)
# 
# # Reduce the x-axis labels font by 0.5. Rotate 90° the x-axis labels.
# barplot(sort(rf.model$importance, dec = TRUE),
#         type = "h",
#         main = "Features in function of their Gain",
#         xlab = "Features",
#         ylab = "Gain",
#         las  = 2,
#         cex.names = 0.7)
# 
# #rf.prediction.test <- exp(predict(rf.model, test.original)) - 1
# prediction.train <- predict(rf.model, train.original)
# 
# rmse <- printRMSEInformation(prediction.train, sale.price)
```


## LASSO Regressions
In this section, we will proceed to a features selection of the dataset. The objective is to keep only the features that have strong predictive accuracy on the sale price. Since this is a regression problem, we will use the LASSO (L1-norm) algorithm.

The Gaussian family is the most suitable for a linear regression problem. We proceed by cross-validation using 10 folds to know which features have a coefficient of zero or different of zero. 

```{r}
## alpha = 1 for lasso only 
## alpha = 0 for ridge only
## alpha = 0.5 for elastic net

## Cross-validation
sale.price.log <- log(sale.price + 1)
cv.model <- cv.glmnet(x = train, 
                      y = sale.price.log, 
                      alpha = 1)
lambda.coef <- coef(cv.model, s = "lambda.min")
lambda.best <- cv.model$lambda.min
print(lambda.best)

cv.model$cvm <- sqrt(cv.model$cvm)
cv.model$cvlo <- sqrt(cv.model$cvlo)
cv.model$cvup <- sqrt(cv.model$cvup)

selection <- data.frame(coef.name = dimnames(lambda.coef)[[1]], 
                        coef.value = matrix(lambda.coef))
print(selection)
plot(cv.model, ylab = "Root Mean-Squared Error")

features <- as.vector(selection$coef.name[selection$coef.value != 0])
features <- setdiff(features, c("(Intercept)"))
print(features)

## Create the model and get predictions on test and train sets.
model <- glmnet(train, 
                sale.price.log, 
                alpha = 1, 
                lambda = 0.001) #lambda.best found

varImp(model, lambda = lambda.best)

# make predictions
prediction.train <- as.vector(predict(model, s = lambda.best, train))
net.prediction.test <- as.vector(exp(predict(model, s = lambda.best, newx = test)) - 1)

rmse <- printRMSEInformation(prediction.train, sale.price)
```

This means that, in a linear regression represented by $$y_j = \beta_0 + \sum_{i = 1}^n \beta_i x_i$$ where $\beta_i$ are the coefficient values, $\beta_0$ is the intercept value, $x_i$ are the features (predictors) and $y_j$ represents the $j^{th}$ house, every feature having their coefficient equals to 0 is removed.


# Results
We write the 'Id' associated to the predicted SalePrice in the submission file and we show first predicted sale prices.

```{r}
prediction.test <- 0.5 * net.prediction.test + 0.5 * xgb.prediction.test

submission <- data.frame(Id = test.id, SalePrice = prediction.test)
write.csv(submission, "Submission.csv", row.names = FALSE)

head(submission, 15)
```

From the previous sections and in virtue of results we got, this dataset is enough to solve the problem.


<!------------------------------------------------------------CONCLUSION------------------------------------------------------------------------------>


# Conclusion
How do home features add up to its price tag? In other terms, we have to find what features are the most important to get an accurate prediction. Among the 79 features, here is the list of the ones adding the most to the price tag in descending order:

```{r echo=FALSE}
print(importance.matrix[1:12])
```