Skip to content

rachh8283/rstuido-housing-prices-analysis-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

rstuido-housing-prices-analysis-prediction

Explore and Predict Housing Sales Price in Ames, IA

This project seeks to explore and forecast housing prices in Ames, Iowa. The goal is to identify significant predicting factors, develop a simple linear regression model, and determine how well the model fits the data or the accuracy of the predictions. The training data set of 1000 observations and 25 variables will fit into the linear regression model. The testing data set of 460 observations and 25 variables will determine how well the predicted prices fit the observed costs.

Interpretation

The summary statistics show the maximum and minimum sale prices. The median price is $161,750.The median is a better indicator of average prices than the mean given many high outliers. The median price is less affected by those outliers than the mean. The testing set’s histogram and boxplot are right-skewed, demonstrating most housing prices in are on the lower end, at or below $210,000. The majority are between approximately $100,000 and $200,000. The boxplot demonstrates multiple outliers. After combining the training and testing data sets and replotting the information, the data is still right-skewed with multiple outliers. The boxplot is narrower in the combined graph given the larger number of observations, with most prices falling at or below $200,000. Again, most houses fall between $100,000 and $200,000.

A linear regression model uses the characteristics of each sale to predict housing prices. A summary of the linear regression model shows several significant predictors for the SalePrice. These predictors include Lot Area, Overall Quality, Overall Condition, Year Built, MasVnrArea (masonry veneer area), Total BsmtSF (basement square footage), GrLivArea (ground living area), BedroomAbvGr (bedrooms above ground), KitchenAbvGr (kitchens above ground), TotRmsAbvGrd (total rooms above ground), and GarageArea. All except the BedroomAbvGr and KitchenAbvGr variables have positive coefficients. Therefore, when all variables are considered, as BedroomAbvGr and KitchenAbvGr increase, the SalePrice decreases. As all other variables with a positive coefficient increase, so does the SalePrice.

The significant predictor variables are indicated by the low p-value < 0.05. Overfitting could be an issue with simple regression models; however, the random sampling is large and includes all necessary variables to test. In overfitting, results may be overly optimistic and findings difficult or impossible to replicate on other data sets, but that does not look like the case here. The R-squared and adjusted R-squared values can be indicative of model accuracy. The values equal 0.8473 and 0.8423, respectively, indicating a good fit model. The predicted prices in Figure 6 are for the first 20 observations of the testing data once missing records have been removed. A data frame and line graph are created to analyze this data better by comparing the actual with predicted values. The actual prices are visually very similar to the predicted prices, another indication of a good-fit model.

Summary

Overall, the training set was large enough to create a reliable, simple linear regression model to predict housing sale prices. The median prices in the summary data are the best indicators of average prices because these are less affected by the significant outliers. Outliers in this set are values over $350,000, as demonstrated on the boxplots for the testing and combined data sets. The training set was used to create a linear regression model, and the testing set helped determine how well that model fit the data, which was determined to be a good fit based on sample size and R-squared values. The linear regression model identified the significant predictors, most of which had positive coefficients. That is, as those variables increased, the Sale Price also increased. The two variables with negative coefficients—bedroom and kitchen above the ground level—indicate that as they increased, the SalePrice decreased.

Finally, once missing data were omitted from the test set, the first 20 rows were pulled to predict values using the linear regression model. These were then compared to the actual prices in a table. As that did not tell much of a story, a line graph was used to visualize the comparison. Visually, the prices were very similar, another indication of a good-fit model.

Releases

No releases published

Packages

No packages published

Languages