-
Notifications
You must be signed in to change notification settings - Fork 0
/
Analysis.Rmd
1096 lines (787 loc) · 49.1 KB
/
Analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "House Prices"
author: "Gabriel Lapointe"
date: "September 18, 2016"
output:
html_document:
highlight: pygments
keep_md: yes
number_sections: yes
toc: yes
pdf_document:
toc: yes
variant: markdown_github
---
# Requirements
Requirements are taken [here](https://www.kaggle.com/c/house-prices-advanced-regression-techniques).
## Business Requirement
We have to answer this question: How do home features add up to its price tag?
## Functional Requirement
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this analysis shall predict the final price of each home.
# Data Acquisition
In this section, we will ask questions on the dataset and establish a methodology to solve the problem.
## Data Source
The data is provided by Kaggle and can be found [here](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data).
## Dataset Questions
Before we start the exploration of the dataset, we need to write a list of questions about this dataset considering the problem we have to solve.
* How big is the dataset?
* Does the dataset contains 'NA' or missing values? Can we replace them by a value? Why?
* Does the data is coherent (date with same format, no out of bound values, no misspelled words, etc.)?
* What does the data look like and what are the relationships between features if they exist?
* What are the measures used?
* Does the dataset contains abnormal data?
* Can we solve the problem with this dataset?
## Evaluation Metrics
Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)
## Methodology
In this document, we start by cleaning and exploring the dataset to build the data story behind it. This will give us important insights which will answer our questions on this dataset. The next step is to proceed to feature engineering which consists to create, remove or replace features regarding insights we got when exploring the dataset. We will ensure our new dataset is a valid input for each of our prediction models. We will fine-tune the model's parameters by cross-validating the model with the train set to get the optimal parameters. After applying our model to the test set, we will visualize the predictions calculated and explain the results. Finally, we will conclude on most useful features to fulfill the business objective of this project.
## Loading Dataset
We load 'train.csv' and 'test.csv'. Then, we merge them to proceed to the cleaning and exploration of this entire dataset.
```{r message=FALSE, warning=FALSE, comment=NA}
library(data.table) # setDT, set
library(dplyr) # select, filter, %>%
library(scales) # Scaling functions used for ggplot
library(gridExtra) # Grid of ggplot to save space
library(ggplot2) # ggplot functions for visualization and exploration
library(caret)
library(corrplot)
library(moments) # For skewness
library(Matrix)
#library(mice) # To replace NA values by a predicted one
library(Hmisc) # To impute features having NA values to replace
library(VIM)
library(randomForest)
library(xgboost)
library(glmnet)
library(microbenchmark) # benchmarking functions
library(knitr) # opts_chunk
setwd("/home/gabriel/Documents/Projects/HousePrices")
set.seed(1234)
source("Dataset.R")
## Remove scientific notation (e.g. E-005).
options(scipen = 999)
## Remove hash symbols when printing results and do not show message or warning everywhere in this document.
opts_chunk$set(message = FALSE,
warning = FALSE,
comment = NA)
'%nin%' <- Negate('%in%')
## Read csv files and ensure NA strings are converted to real NA.
system.time({
na.strings <- c("NA", "", " ")
train <- fread(input = "train.csv",
showProgress = FALSE,
stringsAsFactors = FALSE,
na.strings = na.strings,
header = TRUE)
test <- fread(input = "test.csv",
showProgress = FALSE,
stringsAsFactors = FALSE,
na.strings = na.strings,
header = TRUE)
## Merge the train and test sets in a data.table object.
test$SalePrice <- -1
dataset <- rbindlist(list(train, test), use.names = TRUE)
})
```
| Dataset | File Size (Kb) | # Houses | # Features |
| ------------------ | --------------- | --------------------- | --------------------- |
| train.csv | 460.7 | `r nrow(train)` | `r ncol(train)` |
| test.csv | 451.4 | `r nrow(test)` | `r ncol(test) - 1` |
| **Total(dataset)** | **912.1** | **`r nrow(dataset)`** | **`r ncol(dataset)`** |
These datasets are very small. Each observation (row) is a house where we want to predict their sale price in the test set.
<!------------------------------------------------------------DATASET CLEANING------------------------------------------------------------------------------>
# Dataset Cleaning
The objective of this section is to detect all inconsistancies in the dataset and try to fix them all to gain as much coherence and accuracy as possible. We have to check if the dataset is valid with the possible values given in the code book. Thus, we need to ensure that there are no mispelled words or no values that are not in the code book. Also, all numerical values should be coherent with their description meaning that their bounds have to be logically correct. Regarding the code book, none of the categorical features have over 25 unique values. Then, we will compare the values mentioned in the code book with the values we have in the dataset. Finally, we have to detect anomalies and determine techniques to replace missing values with the most accurate ones.
```{r echo=FALSE}
sapply(dataset, getUniqueValues)
```
## Feature Names Harmonization
We start by harmonizing the feature names to be coherent with the code book. Comparing manually with the code book's possible codes, the following features have differences:
| Feature | Dataset | CodeBook |
| ------------------ | ------------ | --------------- |
| MSZoning | C (all) | C |
| MSZoning | NA | No corresponding value |
| Alley | Empty string | No corresponding value |
| PoolQC | Empty string | No corresponding value |
| Utilities | NA | No corresponding value |
| Neighborhood | NAmes | Names (should be NAmes) |
| BldgType | 2fmCon | 2FmCon |
| BldgType | Duplex | Duplx |
| BldgType | Twnhs | TwnhsI |
| Exterior1st | NA | No corresponding value |
| Exterior2nd | NA | No corresponding value |
| Exterior2nd | Wd Shng | WdShing |
| MasVnrType | NA | No corresponding value |
| Electrical | NA | No corresponding value |
| KitchenQual | NA | No corresponding value |
| Functional | NA | No corresponding value |
| MiscFeature | Empty string | No corresponding value |
| SaleType | NA | No corresponding value |
| Bedroom | Named 'BedroomAbvGr' | Should be named 'BedroomAbvGr' to follow the naming convention |
| Kitchen | Named 'KitchenAbvGr' | Should be named 'KitchenAbvGr' to follow the naming convention |
The code book seems to have a naming convention but it is not always respected. Thus, it will be hard to achieve complete coherence. Since we do not know the reason behind each code and each feature name given, we will not change any of them in this code book. The changes will be done in the dataset only.
To be coherent with the code book (assuming the code book is the truth), we will replace mispelled categories in the dataset by their corresponding one from the code book. Note that we deduct that the string 'Twnhs' corresponds to the string 'TwnhsI' in the code book since the other codes can be easily associated.
```{r}
dataset <- dataset[MSZoning == "C (all)", MSZoning := "C"]
dataset <- dataset[BldgType == "2fmCon", BldgType := "2FmCon"]
dataset <- dataset[BldgType == "Duplex", BldgType := "Duplx"]
dataset <- dataset[BldgType == "Twnhs", BldgType := "TwnhsI"]
dataset <- dataset[Exterior2nd == "Wd Shng", Exterior2nd := "WdShing"]
```
Since we have feature names starting by a digit which is not allowed in many programming languages, we will rename them with their full name.
```{r}
colnames(dataset)[colnames(dataset) == "1stFlrSF"] <- "FirstFloorArea"
colnames(dataset)[colnames(dataset) == "2ndFlrSF"] <- "SecondFloorArea"
colnames(dataset)[colnames(dataset) == "3SsnPorch"] <- "ThreeSeasonPorchArea"
```
## Data Coherence
We also need to check the logic in the dataset to make sure the data make sense. We will enumerate facts coming from the code book and from logic to detect anomalies in this dataset.
**1. The feature 'FirstFloorArea' must not have an area of 0 ft². Otherwise, there would not have a first floor, thus no stories at all and then, no house.**
The minimum area of the first floor is `r min(dataset$FirstFloorArea)` ft². Looking at features 'HouseStyle' and 'MSSubClass' in the code book, there is neither NA value nor another value indicating that there is no story in the house. Indeed, we have `r length(dataset$HouseStyle[is.na(dataset$HouseStyle)])` NA values for 'HouseStyle' and `r length(dataset$MSSubClass[is.na(dataset$MSSubClass)])` NA values for 'MSSubClass'.
**2. The HouseStyle feature values must match with the values of the feature MSSubClass.**
To check this fact, we have to do a mapping between values of 'HouseStyle' and 'MSSubClass'. We have to be careful with 'SLvl' and 'SFoyer' because they can be used for all types. Since we are not sure about them, we will validate with values we know they mismatch.
| HouseStyle | MSSubClass |
| -----------| ---------- |
| 1Story | 20 |
| 1Story | 30 |
| 1Story | 40 |
| 1Story | 120 |
| 1.5Fin | 50 |
| 1.5Unf | 45 |
| 2Story | 60 |
| 2Story | 70 |
| 2Story | 160 |
| 2.5Fin | 75 |
| 2.5Unf | 75 |
| SFoyer | 85 |
| SFoyer | 180 |
| SLvl | 80 |
| SLvl | 180 |
```{r echo=FALSE}
cols <- c("Id", "HouseStyle", "BldgType", "MSSubClass")
houses <- dataset[HouseStyle %nin% c("SFoyer", "SLvl"), ]
rows <- houses[HouseStyle != "1Story" & MSSubClass %in% c(20, 30, 40, 120), cols, with = FALSE]
rows <- bind_rows(rows, houses[HouseStyle != "1.5Fin" & MSSubClass == 50, cols, with = FALSE])
rows <- bind_rows(rows, houses[HouseStyle != "1.5Unf" & MSSubClass == 45, cols, with = FALSE])
rows <- bind_rows(rows, houses[HouseStyle != "2Story" & MSSubClass %in% c(60, 70, 160), cols, with = FALSE])
rows <- bind_rows(rows, houses[HouseStyle != "2.5Fin" & MSSubClass == 75, cols, with = FALSE])
rows <- bind_rows(rows, houses[HouseStyle != "2.5Unf" & MSSubClass == 75, cols, with = FALSE])
print(rows)
```
**3. Per the code book, values of MSSubClass for 1 and 2 stories must match with the YearBuilt.**
To verify this fact, we need to compare values of 'MSSubClass' with the 'YearBuilt' values. The fact is not respected if the year built is less than 1946 and values of 'MSSubClass' are 20, 60, 120 and 160. The case when the year built is 1946 and newer, and values of 'MSSubClass' are 30 and 70 also show that the fact is not respected.
```{r echo=FALSE}
cols <- c("Id", "YearBuilt", "MSSubClass", "BldgType", "HouseStyle")
rows <- dataset[YearBuilt < 1946 & MSSubClass %in% c(20, 60, 120, 160), cols, with = FALSE]
print(bind_rows(rows, dataset[YearBuilt >= 1946 & MSSubClass %in% c(30, 70), cols, with = FALSE]))
```
These features represents `r nrow(id) / nrow(dataset) * 100` % of the dataset.
**4. If there is no garage with the house, then GarageType = NA, GarageYrBlt = NA, GarageFinish = NA, GarageCars = 0, GarageArea = 0, GarageQual = NA and GarageCond = NA.**
We need to get all houses where the GarageType is NA and check if the this fact's conditions are respected.
```{r echo=FALSE}
cols <- c("Id", "GarageType", "GarageYrBlt", "GarageFinish", "GarageQual", "GarageCond", "GarageArea", "GarageCars")
garage.none <- dataset[is.na(GarageType) & is.na(GarageYrBlt) & is.na(GarageFinish) & is.na(GarageQual) & is.na(GarageCond) & GarageArea == 0 & GarageCars == 0, cols, with = FALSE]
garage <- dataset[!is.na(GarageType) & !is.na(GarageYrBlt) & !is.na(GarageFinish) & !is.na(GarageQual) & !is.na(GarageCond) & GarageArea > 0 & GarageCars > 0, cols, with = FALSE]
garage <- setdiff(dataset[, cols, with = FALSE], bind_rows(garage.none, garage))
print(garage)
garage <- garage[is.na(GarageQual) & is.na(GarageCond) & is.na(GarageArea), cols, with = FALSE]
dataset <- dataset[Id %in% garage$Id, GarageType := NA]
```
**5. If there is no basement in the house, then TotalBsmtSF = 0, BsmtUnfSF = 0, BsmtFinSF2 = 0, BsmtHalfBath = 0, BsmtFullBath = 0, BsmtQual = NA and BsmtCond = NA, BsmtExposure = NA, BsmtFinType1 = NA, BsmtFinSF1 = 0, BsmtFinType2 = NA.**
```{r echo=FALSE}
cols <- c("Id", "TotalBsmtSF", "BsmtUnfSF", "BsmtFinSF2", "BsmtHalfBath", "BsmtFullBath", "BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinSF1", "BsmtFinType2")
basement.none <- dataset[is.na(BsmtQual) & is.na(BsmtCond) & is.na(BsmtExposure) & is.na(BsmtFinType1) & is.na(BsmtFinType2) & TotalBsmtSF == 0 & BsmtUnfSF == 0 & BsmtFinSF1 == 0 & BsmtFinSF2 == 0 & BsmtHalfBath == 0 & BsmtFullBath == 0, cols, with = FALSE]
basement <- dataset[!is.na(BsmtQual) & !is.na(BsmtCond) & TotalBsmtSF > 0, cols, with = FALSE]
basement <- setdiff(subset(dataset, select = cols), bind_rows(basement.none, basement))
print(basement)
## For houses with no basement, we replace area features having NA by 0.
basement.none <- basement[is.na(BsmtQual) & is.na(BsmtCond) & TotalBsmtSF == 0, cols, with = FALSE]
print(basement.none)
dataset <- dataset[Id %in% basement.none$Id, `:=`(BsmtHalfBath = 0, BsmtFullBath = 0)]
```
**6. Per the code book, if there are no fireplaces, then FireplaceQu = NA and Fireplaces = 0.**
```{r echo=FALSE}
dataset[Fireplaces > 0 & is.na(FireplaceQu), c("Id", "Fireplaces", "FireplaceQu"), with = FALSE]
dataset[Fireplaces == 0 & !is.na(FireplaceQu), c("Id", "Fireplaces", "FireplaceQu"), with = FALSE]
```
**7. Per the code book, if there are no Pool, then PoolQC = NA and PoolArea = 0.**
```{r echo=FALSE}
dataset[PoolArea > 0 & is.na(PoolQC), c("Id", "PoolArea", "PoolQC"), with = FALSE]
dataset[PoolArea == 0 & !is.na(PoolQC), c("Id", "PoolArea", "PoolQC"), with = FALSE]
```
**8. Per the code book, the Remodel year is the same as the year built if no remodeling or additions. Then, it is true to say that YearRemodAdd $\geq$ YearBuilt.**
The abnormal houses that are not respecting this fact are detected by filtering houses having the remodel year less than the year built. If it is the case, then we can verify the year when the garage was built if exists and compare with the house year built and remodeled.
```{r echo=FALSE}
dataset[YearRemodAdd < YearBuilt, c("Id", "YearBuilt", "YearRemodAdd", "GarageYrBlt"), with = FALSE]
```
```{r}
dataset <- dataset[which(YearRemodAdd < YearBuilt), YearRemodAdd := YearBuilt]
```
**9. We verify that if the Garage Cars is 0, then the Garage Area is also 0. The converse is true since a Garage area of 0 means that there is no garage, thus no cars.**
```{r echo=FALSE}
dataset[GarageArea == 0 & GarageCars > 0, c("Id", "GarageArea", "GarageCars"), with = FALSE]
```
**10. We have BsmtCond = NA (no basement per code book) if and only if BsmtQual = NA which means no basement per the code book.**
```{r echo=FALSE}
dataset[is.na(BsmtCond) & !is.na(BsmtQual), c("Id", "BsmtCond", "BsmtQual"), with = FALSE]
dataset[!is.na(BsmtCond) & is.na(BsmtQual), c("Id", "BsmtCond", "BsmtQual"), with = FALSE]
```
```{r}
dataset <- dataset[which(!is.na(BsmtCond) & is.na(BsmtQual)), BsmtQual := BsmtCond]
dataset <- dataset[which(is.na(BsmtCond) & !is.na(BsmtQual)), BsmtCond := BsmtQual]
```
**11. We have MasVnrType = None if and only if MasVnrArea = 0 ft².**
We have two cases where it is hard to check which one is right.
* Case when MasVnrType = 'None' and MasVnrArea $\neq 0$ ft²
* Case when MasVnrType $\neq$ 'None' and MasVnrArea $= 0$ ft²
```{r echo=FALSE}
dataset[MasVnrType == "None" & MasVnrArea > 0, c("Id", "MasVnrType", "MasVnrArea"), with = FALSE]
dataset[MasVnrType != "None" & MasVnrArea == 0, c("Id", "MasVnrType", "MasVnrArea"), with = FALSE]
```
```{r}
dataset <- dataset[which(MasVnrType != "None" & MasVnrArea == 0), MasVnrType := "None"]
dataset <- dataset[which(MasVnrType == "None" & MasVnrArea <= 10), MasVnrArea := 0]
```
## Missing Values
Per the code book of this dataset, we know that generally, the NA values mean 'No' or 'None' and they are used only for some categorical features. The other NA values that are not in the code book will be explained case by case. This goes also for the empty strings that will be replaced by NA.
* Case when NA means 'None' or 'No'
* Case when an integer feature has 0 and NA as possible values
* Case when a numeric value has 0 and NA as possible values
* Case when a category is NA where NA means 'No', and the numeric feature is not zero
* Case when a category is not NA where NA means 'No', and the numeric feature is NA where 0 has a clear meaning
Features having NA values where NA means 'None' or 'No' will be replaced by 0.
However, it is possible to solve some NA values by analysing the value used for other features strongly related. For example, some integer features like GarageCars and GarageArea have NA values. At the first glance, we cannot state that NA means 0 since 0 already has a meaning. It could be a "No Information", but looking at the GarageQual and GarageCond features, we notice that their value is NA as well. This means that this house has no garage per the code book. Therefore, we will replace NA values by 0 for GarageArea and GarageCars.
For features like "BsmtFullBath", the value 0 means that we do not have full bathroom in the basement. Thus, we cannot replace NA by 0 if there is a basement. Otherwise, the house has no basement, thus no full bathroom in the basement. In this case only, we can replace NA by 0.
We expect that numeric features where the value 0 means the same thing as a NA value. For example, a garage area of 0 means that there is no garage with this house. However, if the value 0 is used for an amount of money or for a geometric measure (e.g. area), then it is a real 0.
For "year" features (e.g. GarageYrBlt), if the values are NA, then we can replace them by 0 without loss of generality. A year 0 is theorically possible, but in our context, it is impossible. But, using 0 will decrease the mean and will add noise to the data since the difference between the minimum year and zero is large: `r min(dataset$GarageYrBlt)`.
Another case is when a feature uses the value NA to indicate that the information is missing. For example, the feature "KitchenQual" is not supposed to have the value NA per the code book. If the value NA is used, then it really means "No Information" and we cannot replace it by 0. Normally, we would exclude this house of the dataset, but this house is taken from the test set, thus we must not remove it.
For those cases, we need to use imputation on missing data (NA value). We could calculate the mean for a given feature and use this value to replace NA values. But it is more accurate to predict what value to use by using the other features since we have many of them.
```{r echo=FALSE}
sapply(dataset, function(x) sum(is.na(x)))
#md.pattern(dataset)
aggr(dataset,
col = c('navyblue', 'red'),
numbers = TRUE,
sortVars = TRUE,
labels = names(dataset),
cex.axis = .7,
gap = 3,
ylab = c("Histogram of missing data", "Pattern"))
```
For the Masonry veneer type (MasVnrType) feature, the value "None" means that the house does not have a masonry veneer per the code book. If some houses have the value NA, then it will mean that the information is missing.
Note that it is possible to have information on the masonry veneer area but not on the type (vice-versa could be possible as well). In that case, we cannot deduct with certainty what will be the value to replace NA. We cannot replace NA by 0 for the area because 0 means *None* which is a valid choice. The best choice we can take is to replace NA value by the mean value of the feature.
<!------------------------------------------------------------ANOMALIES DETECTION------------------------------------------------------------------------------>
# Anomalies Detection
In this section, the objective is to detect houses or features having wrong or illogic information. We will fix them if it is possible.
We define a house as being an anomaly if $\left\lVert Y - P \right\rVert > \epsilon$ where $Y = (x, y)$ is the point belonging to the regression linear model and $P = (x, z)$ a point not on the regression linear model. Also, $x$ is the ground living area, $y$ and $z$ the sale price, and $\epsilon > 0$ the threshold.
Regarding the overall quality, the sale price and the ground living area, we expect that the sale price will increase when the overall quality increases and the ground living area increases. This is verified in the data exploratory section.
Taking houses having their overall quality = 10 and their ground living area greater than 4000 ft², the sale price should be part of the highest sale prices. If there are houses respecting these conditions with a sale price over 240000$ than what the regression model gives, then this may be possible, but if it is lower, than this is exceptionnel.
```{r echo=FALSE}
anomalies <- train[OverallQual == 10 & GrLivArea > 4000, c("Id", "GrLivArea", "SalePrice"), with = FALSE]
print(anomalies)
model <- lm(formula = train$SalePrice ~ train$GrLivArea)
price.eq <- coef(model)["(Intercept)"] + coef(model)["train$GrLivArea"] * anomalies$GrLivArea
prices <- data.table(Id = anomalies$Id,
ApproxPrice = price.eq,
SalePrice = anomalies$SalePrice,
PriceDifference = abs(anomalies$SalePrice - price.eq))
print(prices)
ids <- prices$Id[prices$PriceDifference > 240000]
dataset <- dataset[Id %nin% ids, ]
```
After visualizing, we detected another anomaly concerning the garage year built. Since the year cannot be greater than `r max(dataset$YrSold)`, years greater than that year will be treated as an anomaly.
```{r echo=FALSE}
dataset[GarageYrBlt > max(YrSold), c("Id", "GarageYrBlt", "YearBuilt", "YrSold"), with = FALSE]
```
```{r}
dataset <- dataset[GarageYrBlt > max(YrSold), GarageYrBlt := YrSold]
```
<!------------------------------------------------------------DATA EXPLORATORY------------------------------------------------------------------------------>
# Data Exploratory
The objective is to visualize and understand the relationships between features in the dataset we have to solve the problem. We will also compare changes we will make to this dataset to validate if they have significant influance on the sale price or not.
## Features
Here is the list of features with their type.
```{r echo=FALSE}
str(dataset)
train <- dataset[SalePrice > -1, ]
test <- dataset[SalePrice == -1, ]
```
We see now a plot of the correlation between numeric features of the train set.
```{r echo=FALSE}
features.numeric <- names(train)[which(sapply(train, is.numeric))]
train.numeric <- train[, features.numeric, with = FALSE]
correlations <- cor(na.omit(train.numeric))
row_indic <- apply(correlations, 1, function(x) sum(x > 0.3 | x < -0.3) > 1)
correlations <- correlations[row_indic, row_indic]
corrplot(correlations, method = "pie")
sale.price <- data.frame(SalePriceCorrelation = sort(correlations[, "SalePrice"], decreasing = TRUE))
print(sale.price)
```
We note that some features are strongly correlated with the sale price or other features. We will produce plots for each of them to get insights.
## Dependant vs Independent Features
With the current features in this dataset, we have to check which features are dependent of other features versus which ones are independent. At first glance in the dataset, features representing totals and overalls seems dependent.
* $GrLivArea = FirstFloorArea + SecondFloorArea + LowQualFinSF$
* $TotalBsmtSF = BsmtUnfSF + BsmtFinSF1 + BsmtFinSF2$
## Sale Price
The sale price should follow the normal distribution. However, the sale price does not totally follow the normal law, thus we need to normalize the sale price by taking its logarithm.
```{r echo=FALSE}
local({
plot.saleprice <- ggplot(train, aes(x = SalePrice)) +
geom_histogram(col = 'white') +
theme_light() +
ggtitle("Distribution of the Sale Price") +
labs(x = "Sale Price ($)")
plot.logsaleprice <- ggplot(train, aes(x = log(SalePrice + 1))) +
geom_histogram(col = 'white') +
theme_light() +
ggtitle("Distribution of the log of Sale Price") +
labs(x = "Log Sale Price (log$)")
grid.arrange(plot.saleprice, plot.logsaleprice, ncol = 2)
})
summary(train$SalePrice)
```
## Overall Quality Rate
The overall quality rate is the most correlated feature to the sale price as seen previously. We look at the average sale price for each overall quality rate and try to figure out an equation that will best approximate our data.
```{r echo=FALSE}
local({
data <- train[, list(MeanSalePrice = mean(SalePrice)), by = OverallQual]
data <- setorder(data, OverallQual)
print(data)
ggplot(data, aes(x = OverallQual, y = MeanSalePrice)) +
geom_line(aes(colour = "Right")) +
geom_line(aes(x = OverallQual,
y = 939113/180*OverallQual*OverallQual - 2561483/180*OverallQual + 354979/6,
colour = "Approx.")) +
ggtitle("Distribution of Average Sale Price in function of the overall quality rate") +
labs(y = "Average sale price ($)", x = "Overall Quality Rate") +
scale_colour_manual("Legend",
breaks = c("Approx.", "Right"),
values = c("red", "black"))
})
```
Note that the equation used to approximate is a parabola where the equation has been built from 3 points (OverallQual, MeanSalePrice) where the overall quality rates chosen are 1, 6 and 10 with their corresponding average sale price. The equation used to approximate the polyline is $M(Q) = \dfrac{939113}{180}Q^2-\dfrac{2561483}{180}Q+\dfrac{354979}{6}$ where $Q$ is the overall quality rate and $M(Q)$ is the mean sale price in function of $Q$.
Here is a frequencies' table and a histogram representing these frequencies.
```{r echo=FALSE}
local({
table.freq <- table(dataset$OverallQual)
print(cbind(Freq = table.freq,
Cumul = cumsum(table.freq),
Relative = prop.table(table.freq)))
print(ggplot(dataset, aes(x = OverallQual)) +
geom_bar(aes(y = ..count..)) +
scale_x_continuous(breaks = seq(min(dataset$OverallQual), max(dataset$OverallQual), by = 1)) +
geom_text(aes(y = ..count.. ,
label = scales::percent(..count.. / sum(..count..))),
stat = "count",
vjust = -0.25) +
ggtitle("Percentage of Houses by Overall Quality") +
labs(y = "Percentage of Houses", x = "Overall Quality"))
})
```
## Above Ground Living Area
This feature is the second most correlated with the sale price per the correlation plot.
```{r echo=FALSE}
local({
plot.grlivarea <- ggplot(train, aes(x = GrLivArea, y = SalePrice)) +
geom_point(stat = "identity") +
geom_smooth(method = "lm") +
ggtitle("Distribution of Sale Price in function \n of the Grade Living Area") +
labs(x = "Grade Living Area (ft²)", y = "Sale Price ($)")
plot.loggrlivarea <- ggplot(train, aes(x = log(GrLivArea + 1))) +
geom_histogram(col = 'white') +
theme_light() +
ggtitle("Distribution of the GrLivArea") +
labs(x = "log(GrLivArea + 1) (log(ft²))")
plot.rooms <- ggplot(train, aes(x = TotRmsAbvGrd, y = GrLivArea)) +
geom_point(stat = "identity") +
geom_smooth(method = "lm") +
ggtitle("Distribution of Abobe grade living Area \n in function of the total rooms above grade") +
labs(x = "Total rooms above grade", y = "Abobe grade living Area (ft²)")
grid.arrange(plot.grlivarea, plot.loggrlivarea, plot.rooms, ncol = 2, nrow = 2)
})
```
## Garage Cars
```{r echo=FALSE}
local({
data <- train[, list(MinGarageArea = min(GarageArea),
MeanGarageArea = mean(GarageArea),
MaxGarageArea = max(GarageArea),
MeanSalePrice = mean(SalePrice)), by = GarageCars]
data <- setorder(data, GarageCars)
print(data)
plot.garage.price <- ggplot(data, aes(x = GarageCars, y = MeanSalePrice)) +
geom_line() +
ggtitle("Distribution of Mean Sale Prices \n in function of Garage Cars") +
labs(x = "Garage Cars", y = "Average sale price ($)")
plot.garage.cars <- ggplot(data, aes(x = GarageCars, y = MeanGarageArea)) +
geom_point(stat = "identity") +
geom_smooth(method = "lm") +
ggtitle("Distribution of Mean Garage Area \n in function of the Garage Cars") +
labs(x = "Garage Cars", y = "Mean of Garage Area (ft²)")
grid.arrange(plot.garage.price, plot.garage.cars, ncol = 2)
})
```
Here is the list of houses having a garage that can contain more than 3 cars in the dataset.
```{r echo=FALSE}
dataset[GarageCars >= 4, c("Id", "OverallQual", "GarageCars", "GarageArea", "SalePrice"), with = FALSE]
```
## Garage Area
```{r echo=FALSE}
local({
plot.garagearea <- ggplot(train, aes(x = GarageArea, y = SalePrice)) +
geom_point(stat = "identity") +
geom_smooth(method = "lm") +
ggtitle("Distribution of Average Sale Price \n in function of the Garage Area") +
labs(x = "Garage Area (ft²)", y = "Sale Price ($)")
plot.loggaragearea <- ggplot(train, aes(x = log(GarageArea + 1))) +
geom_histogram(col = 'white') +
theme_light() +
ggtitle("Distribution of the log \n of Garage Area") +
labs(x = "Log Garage Area (log$)")
grid.arrange(plot.garagearea, plot.loggaragearea, ncol = 2)
})
```
## Total Basement Area
```{r echo=FALSE}
local({
plot.basementarea <- ggplot(train, aes(x = TotalBsmtSF, y = SalePrice)) +
geom_point(stat = "identity") +
geom_smooth(method = "lm") +
ggtitle("Distribution of Average Sale Price in \n function of the Total Basement Area") +
labs(x = "Total Basement Area (ft²)", y = "Sale Price ($)")
plot.logbasementarea <- ggplot(train, aes(x = log(TotalBsmtSF + 1))) +
geom_histogram(col = 'white') +
theme_light() +
ggtitle("Distribution of the log of \n TotalBsmtSF") +
labs(x = "Log TotalBsmtSF (log$)")
grid.arrange(plot.basementarea, plot.logbasementarea, ncol = 2)
})
```
## First Floor Area
```{r echo=FALSE}
local({
plot.firstfloorarea <- ggplot(train, aes(x = FirstFloorArea, y = SalePrice)) +
geom_point(stat = "identity") +
geom_smooth(method = "lm") +
ggtitle("Distribution of Average Sale Price in \n function of the First Floor Area") +
labs(x = "First Floor Area (ft²)", y = "Sale Price ($)")
plot.logfirstfloorarea <- ggplot(train, aes(x = log(FirstFloorArea + 1))) +
geom_histogram(col = 'white') +
theme_light() +
ggtitle("Distribution of the log of \n FirstFloorArea") +
labs(x = "Log FirstFloorArea (log$)")
grid.arrange(plot.firstfloorarea, plot.logfirstfloorarea, ncol = 2)
})
```
## Year Built
We compare the house year built and the garage year built.
```{r echo=FALSE}
local({
ggplot(dataset, aes(x = YearBuilt, y = GarageYrBlt)) +
geom_point(stat = "identity") +
geom_smooth(method = "lm") +
ggtitle("Distribution of Garage Year Built in \n function of the House Year Built") +
labs(x = "House Year Built", y = "Garage Year Built")
})
```
We can see that few houses have been built many years after the garage. We can think of a garage / workshop and then, the workshop has been converted to a garage many years after to build a house with this garage.
```{r echo=FALSE}
dataset[GarageYrBlt < YearBuilt, c("Id", "GarageYrBlt", "YearBuilt", "GarageType"), with = FALSE]
```
<!------------------------------------------------------------FEATURE ENGINEERING------------------------------------------------------------------------------>
# Feature Engineering
In this section, we create, modify and delete features to help the prediction. We will impute missing values and scale features like the quality and condition ones. Then, we will check for skewed features for which we will normalize.
## Feature Replacement
The categorical features will be 1-base except features having values meaning 'No' or 'None' which will be set to 0. Since the feature 'MasVnrType' has both, 'None' and NA, we will replace 'None' by 0 and the NA value will be replaced by the median in the imputation of missing values section. There are two reasons behind these replacements:
1. It is logical that values having the 'Empty' or 'Nothing' meaning are equivalent to zero.
2. We may want to convert the dataset as a sparse matrix to save memory. Having 0-base, the sparse matrix will be more useful.
```{r}
## Replace By NA or NaN. Otherwise, the numeric conversion with factor will convert the value 0 as well
## to 1-base. NA and NaN are not affected by that conversion.
dataset <- dataset[MasVnrType == "None", MasVnrType := NaN]
dataset <- dataset[CentralAir == "N", CentralAir := NA]
## Transform all categorical features from string to numeric 1-base.
features.string <- which(sapply(dataset, function(x) is.character(x)))
for(feature in features.string)
{
set(dataset, i = NULL, j = feature, value = as.numeric(factor(dataset[[feature]])))
}
dataset <- dataset[is.nan(MasVnrType), MasVnrType := 0]
```
## Missing Values Imputation
Features having NA values where NA means 'None' or 'No' will be replaced by 0 as specified at the previous section.
```{r}
dataset <- dataset[is.na(Alley), Alley := 0]
dataset <- dataset[is.na(BsmtQual), BsmtQual := 0]
dataset <- dataset[is.na(BsmtCond), BsmtCond := 0]
dataset <- dataset[is.na(BsmtExposure), BsmtExposure := 0]
dataset <- dataset[is.na(BsmtFinType1), BsmtFinType1 := 0]
dataset <- dataset[is.na(BsmtFinType2), BsmtFinType2 := 0]
dataset <- dataset[is.na(FireplaceQu), FireplaceQu := 0]
dataset <- dataset[is.na(GarageType), GarageType := 0]
dataset <- dataset[is.na(GarageFinish), GarageFinish := 0]
dataset <- dataset[is.na(GarageQual), GarageQual := 0]
dataset <- dataset[is.na(GarageCond), GarageCond := 0]
dataset <- dataset[is.na(PoolQC), PoolQC := 0]
dataset <- dataset[is.na(Fence), Fence := 0]
dataset <- dataset[is.na(MiscFeature), MiscFeature := 0]
dataset <- dataset[is.na(CentralAir), CentralAir := 0]
```
All other NA values that need a more complex method than just replacing them by a constant will be replaced either by the mean or the median. Features containing real values will have their NA values replaced by the mean while features having integer values will have their NA values replaced by the median.
```{r}
dataset$MSZoning <- impute(dataset$MSZoning, median)
dataset$LotFrontage <- impute(dataset$LotFrontage, mean)
dataset$Utilities <- impute(dataset$Utilities, median)
dataset$Exterior1st <- impute(dataset$Exterior1st, median)
dataset$Exterior2nd <- impute(dataset$Exterior2nd, median)
dataset$MasVnrType <- impute(dataset$MasVnrType, median)
dataset$MasVnrArea <- impute(dataset$MasVnrArea, mean)
dataset$BsmtFinSF1 <- impute(dataset$BsmtFinSF1, mean)
dataset$BsmtFinSF2 <- impute(dataset$BsmtFinSF2, mean)
dataset$BsmtUnfSF <- impute(dataset$BsmtUnfSF, mean)
dataset <- dataset[is.na(TotalBsmtSF), TotalBsmtSF := BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF]
dataset$Electrical <- impute(dataset$Electrical, median)
dataset$BsmtFullBath <- impute(dataset$BsmtFullBath, median)
dataset$BsmtHalfBath <- impute(dataset$BsmtHalfBath, median)
dataset$KitchenQual <- impute(dataset$KitchenQual, median)
dataset$Functional <- impute(dataset$Functional, median)
dataset$GarageYrBlt <- impute(dataset$GarageYrBlt, median)
dataset$GarageCars <- impute(dataset$GarageCars, median)
dataset$GarageArea <- impute(dataset$GarageArea, mean)
dataset$SaleType <- impute(dataset$SaleType, median)
# imputation.start <- mice(dataset, maxit = 0, print = FALSE)
# method <- imputation.start$method
# predictors <- imputation.start$predictorMatrix
#
# ## Exclude from prediction since these features will not help.
# predictors[, c("SalePrice")] <- 0
#
# imputed <- mice(dataset,
# method = "mean",
# predictorMatrix = predictors,
# m = 5,
# print = FALSE)
#
# dataset <- complete(imputed, 1)
#
# densityplot(imputed)
```
## Feature Scaling
Quality and Condition features do not have the right scale based on the most important feature, i.e. the overall quality. Indeed, the overall quality has integer values from 1 to 10, but the other quality features have been transformed from 0 to 4 or 5 previously. If $Q$ represents all quality features except the overall quality, then the scaling function will be $f(Q) = 2Q$ where $Q \in \{0, 1, 2, 3, 4, 5\}$.
```{r}
dataset$ExterQual <- dataset$ExterQual * 2
dataset$FireplaceQu <- dataset$FireplaceQu * 2
dataset$BsmtQual <- dataset$BsmtQual * 2
dataset$KitchenQual <- dataset$KitchenQual * 2
dataset$GarageQual <- dataset$GarageQual * 2
dataset$BsmtCond <- dataset$BsmtCond * 2
dataset$GarageCond <- dataset$GarageCond * 2
dataset$ExterCond <- dataset$ExterCond * 2
```
For Pool, Heating and Fence quality / condition features, we apply the function $f(Q) = 2.5Q$ where $Q \in \{0, 1, 2, 3, 4\}$.
```{r}
dataset$PoolQC <- dataset$PoolQC * 2.5
dataset$HeatingQC <- dataset$HeatingQC * 2.5
dataset$Fence <- dataset$Fence * 2.5
```
All area features are given in square feet, thus no need to convert any of them.
## Skewed Features
We need to transform skewed features to ensure they follow the lognormal distribution. Thus, we will use the function $f(A) = \log{(A + 1)}$, where $A \in \mathbb{R}_+^n$ is a vector representing a feature of the dataset and $n$ the number of values in this vector. We add 1 to avoid $\log{0}$ which is not defined for real numbers.
We set a skewness threshold and ensure to remove every categorical feature that is above the threshold.
```{r echo=FALSE}
skewed <- apply(dataset, 2, function(feature) skewness(feature))
print(skewed)
skewed <- setdiff(names(skewed[skewed > 0.8]),
c("SalePrice", "MSSubClass", "FirstFloorArea", "BsmtFinSF2", "Utilities", "Condition1",
"Condition2", "BldgType", "RoofStyle", "RoofMatl", "Heating")) #, "PoolQC", "Fence", "Alley"))
print(skewed)
```
Let's apply the formula to the remaining features.
```{r}
indices <- which(colnames(dataset) %in% skewed)
for(index in indices)
{
dataset[[index]] <- log(dataset[[index]] + 1)
}
```
## Features Construction
The objective is to add features that will be good predictors for models created in the section Models Building. Clients may ask:
* How old is the house? We need to know the year the house has been built and subtract the result to when the house has been sold.
* How many years since the house has been remodeled? We need to know the year the house has been remodeled and subtract the result to when the house has been sold.
* How many bathrooms are there in the house including the basement? Thus summing bathrooms in the basement and the ones above grade.
* What is the total house area? We have to add the basement area to the grade living area.
```{r}
dataset <- dataset %>%
mutate(YearsSinceBuilt = YrSold - YearBuilt) %>%
mutate(YearsSinceRemodeled = YrSold - YearRemodAdd) %>%
mutate(OverallQualExp = exp(OverallQual) - 1) %>%
mutate(TotalBaths = FullBath + HalfBath + BsmtFullBath + BsmtHalfBath) %>%
mutate(TotalArea = TotalBsmtSF + GrLivArea)
```
## Noisy Features
We remove features that add noise to the predictions. We will use 3 models in the section Models Building which gives the importance of features. The method used to eliminate noisy features is to look at the intersection of the less important features after applying the 3 models.
The other method used is to eliminate features having a high percentage of NA values determined at the dataset cleaning section. This assumes that the customer will not really check about the fence, the alley and the pool quality and condition.
Finally, we remove the Id feature since it is only a unique identifier of a house which should not have any prediction importance on the sale price.
```{r echo=FALSE}
dataset$ThreeSeasonPorchArea <- NULL
dataset$PoolQC <- NULL
#dataset$Alley <- NULL
#dataset$Fence <- NULL
test.id <- test$Id
dataset$Id <- NULL
```
<!------------------------------------------------------------MODELS BUILDING------------------------------------------------------------------------------>
# Models Building
In this section, we train different models and give predictions on the sale price of each house. We will use the extreme gradient boosting trees, the random forest and LASSO algorithms to build models.
Those algorithms need 2 inputs : the dataset as a matrix and the real sale prices from the train set. Since we had many NA and None values that have been replaced by 0, then it should be more efficient to use a sparse matrix to represent the dataset.
```{r echo=FALSE}
## Need them for the random forest only.
train.original <- dataset[dataset$SalePrice != -1, ]
test.original <- dataset[dataset$SalePrice == -1, ]
## Keep the sale price in a numeric vector since this is not a predictor.
sale.price <- dataset$SalePrice[dataset$SalePrice != -1]
dataset$SalePrice <- NULL
dataset.zeros <- sum(dataset == 0L)
dataset.cells <- nrow(dataset) * ncol(dataset)
cat("Dataset contains", dataset.zeros, "zeros which is", dataset.zeros / dataset.cells * 100, "% of the dataset.")
## Transform the dataset to a sparse matrix.
dataset <- sparse.model.matrix(~ ., data = dataset)
train <- dataset[1:nrow(train), ]
test <- dataset[(nrow(train)+1) : nrow(dataset), ]
```
## Extreme Gradient Boosted Regression Trees
We proceed to a 10-fold cross-validation to get the optimal number of trees and the RMSE score which is the metric used for the accuracy of our model. We use randomly subsamples of the training set. The training set will be split in 10 samples where each sample has `r as.integer(nrow(train) / 10)` observations (activities).
For each tree, we will have the average of 10 error estimates to obtain a more robust estimate of the true prediction error. This is done for all trees and we get the optimal number of trees to use for the test set.
We also display 2 curves indicating the test and train RMSE mean progression. The vertical dotted line is the optimal number of trees. This plot shows if the model overfits or underfits.
```{r}
cv.nfolds <- 10
cv.nrounds <- 400
sale.price.log <- log(sale.price + 1)
train.matrix <- xgb.DMatrix(train, label = sale.price.log)
param <- list(objective = "reg:linear",
eta = 0.12,
subsample = 0.75,
colsample_bytree = 0.75,
min_child_weight = 2,
max_depth = 2)
model.cv <- xgb.cv(data = train.matrix,
nfold = cv.nfolds,
param = param,
nrounds = cv.nrounds,
verbose = 0)
model.cv$names <- as.integer(rownames(model.cv))
best <- model.cv[model.cv$test.rmse.mean == min(model.cv$test.rmse.mean), ]
cv.plot.title <- paste("Training RMSE using", cv.nfolds, "folds CV")
print(ggplot(model.cv, aes(x = names)) +
geom_line(aes(y = test.rmse.mean, colour = "test")) +
geom_line(aes(y = train.rmse.mean, colour = "train")) +
geom_vline(xintercept = best$names, linetype = "dotted") +
ggtitle(cv.plot.title) +
xlab("Number of trees") +
ylab("RMSE"))
print(model.cv)
cat("\nOptimal testing set RMSE score:", best$test.rmse.mean)
cat("\nAssociated training set RMSE score:", best$train.rmse.mean)
cat("\nInterval testing set RMSE score: [", best$test.rmse.mean - best$test.rmse.std, ",", best$test.rmse.mean + best$test.rmse.std, "]")
cat("\nDifference between optimal training and testing sets RMSE:", abs(best$train.rmse.mean - best$test.rmse.mean))
cat("\nOptimal number of trees:", best$names)
```
Using the optimal number of trees given by the cross-validation, we can build the model using the test set as input.
```{r}
nrounds <- as.integer(best$names)
model <- xgboost(param = param,
train.matrix,
nrounds = nrounds,
verbose = 0)
test.matrix <- xgb.DMatrix(test)
xgb.prediction.test <- exp(predict(model, test.matrix)) - 1
prediction.train <- predict(model, train.matrix)
# Check which features are the most important.
names <- dimnames(train)[[2]]
importance.matrix <- xgb.importance(names, model = model)
print(importance.matrix)
# Display the 35 most important features.
print(xgb.plot.importance(importance.matrix[1:35]))
rmse <- printRMSEInformation(prediction.train, sale.price)
```
We can see that the model overfits. Indeed, the RMSE by the cross-validation for the test set is `r best$test.rmse.mean` since the RMSE for the train set is `r rmse`.
## Random Forest
```{r}
# rf.model <- randomForest(log(SalePrice + 1) ~ .,
# data = train.original,
# importance = TRUE,
# proximity = TRUE,
# ntree = 130,
# do.trace = 5)
#
# plot(rf.model, ylim = c(0, 1))