This project involves the collection, cleaning, and analysis of Toronto real estate listings data. The goal is to extract valuable insights from the data and create a machine learning model to predict property prices.
- Overview
- Table of Contents
- Data Collection
- Data Cleaning
- Database Creation
- Modeling
- Deployment
- Contributors
- Data Collection
- Data Cleaning
- Database Creation
- Modeling
- Deployment
- Web Development
- General
- Python Libraries
The project collects Toronto real estate listings data from multiple sources, including web scraping and API requests. The collected data includes property details such as address, price, baths, beds, and geographical coordinates.
The data cleaning process involves handling missing values, formatting issues, and extracting latitude and longitude using the Geoapify API. Additionally, luxury listings with more than 5 bathrooms or more than 4 beds were removed, and outliers were addressed using the Interquartile Range (IQR) method. The cleaned data is stored in CSV files in the data
folder.
The cleaned data is imported into a PostgreSQL database named listings_db
using SQLAlchemy. The database has a table named toronto_listings
with columns like mls_id
, property_type
, address
, and more.
A Random Forest Regressor model has been implemented to predict property prices based on features such as baths, beds, dens, relative latitude, and relative longitude. The model's performance is evaluated using cross-validation, providing Mean Absolute Error (MAE) scores for each fold. Neighbourhood-wise analysis revealed varying ratios of prediction errors, with specific attention given to neighbourhoods with a small number of listings.
The machine learning model is deployed using a cloud-based infrastructure, specifically on Amazon Web Services (AWS). The deployment process involves the following steps:
-
Model Serialization: The trained Random Forest Regressor model is serialized using the
joblib
library. -
Flask API Endpoint: A Flask web application is set up to serve as an API endpoint for the machine learning model. The Flask application uses Flask and Flask-CORS to handle HTTP requests and responses, providing a seamless interaction with the deployed model.
-
PostgreSQL Database Interaction: SQLAlchemy is utilized to interact with the PostgreSQL database named
listings_db
. The database stores relevant information about Toronto real estate listings. -
API Usage: Users can make HTTP POST requests to the Flask API endpoint, providing property features as input in the request body. The API will respond with predicted property prices.
-
Containerization: The serialized model, database creation script, and flask application are encapsulated within a Docker container, with dependencies specified in the
requirements.txt
file to ensure consistent and reproducible deployment across different environments. When running the container, the database is created and the flask app is started using gunicorn. The Docker image is then pushed to docker hub. -
Azure Web App: The application is deployed using Azure Web App Services and the Docker image that is available via Docker Hub.
- Fanny Sigouin
- Jorge Nardy
- Kamal Farran
- Tania Barrera
- Geoapify API - Used for geocoding addresses and obtaining latitude and longitude.
- Listing.ca - Source of real estate data for Toronto listings.
- Pandas Documentation - Reference for data manipulation using Pandas.
- Regular Expressions in Python - Guide for using regular expressions in Python.
- Pathlib Documentation - Documentation for working with file paths using Pathlib.
- PostgreSQL Documentation - Official documentation for PostgreSQL.
- Scikit-learn Documentation - Documentation for the Scikit-learn machine learning library.
- Azure Documentation - Azure App Service documentation for setting up and deploying the application.
- Flask Documentation - Flask documentation for setting up API endpoint.
- Docker Documentation - Docker documentation for containerization in deployment.
- Bootstrap Documentation - Bootstrap documentation for setting up HTML, CSS and Java framekwork.
- SQLAlchemy Documentation - Reference for using SQLAlchemy for database interactions.
- Side Navigation - Code used to create the side navigation.
- Pandas - Powerful data manipulation library for Python.
- NumPy - Library for numerical operations in Python.
- Scikit-learn - Machine learning library for Python.