This tutorial develops a prediction model for the New York City Taxi Trip Duration on Kaggle. You may refer to the trip_time_prediction.pdf
for a detailed description of the model and features used for this task.
- Clone the repo
git clone https://github.com/ifiaposto/Taxi-Trip-Duration-Prediction.git
- Install the requirements
pip install -r requirements.txt
- Download the data
You can download the data from Kaggle. After downloading, unzip and save the CSV files to a directory called data
in the root of this repository.
For a detailed description of the files mentioned below, you may refer to the report trip_time_prediction.pdf
provided in the repository.
python3 traffic_binarization.py
This step computes the discretized traffic level per location and hour of the day and per location and day of the week. It generates the files:
traffic_location_hour.csv
traffic_location_weekday.csv
python3 zone_popularity.py
This step computes the discretized popularity level of each sub-area. It generates the files:
zone_populariy.csv
python3 graph_part.py
This step constructs the location graphs, partitions the graph given their popularity and computes the distance between each pair of sub areas. The discretization information used for the graph is provided in the directory called NYC_Maps
in the root of this repository. It generates the files:
taxi_zones_clusters.csv
distances.csv
python3 location_features.py
This step keeps the most popular clusters and locations that will be used as features for the regressor, and it creates the one hot representation for each subarea. It generates the generates the files:
popular_taxi_zones_binary_clusters.csv
one_hot_popular_taxi_zones_lookup.csv
python3 test_features.py
This step computes the features for the testing dataset. It generates the file:
X_test.csv
python3 learn.py
This file splits the dataset in the validation/ training dataset. It computes the traffic and location features for them. Subsequently, a random forest regressor is fitted. Finally, it predicts the trip duration for the testing dataset. It prints a log file with the importance of the features, the validation error achieved and the learning time.
Any questions and requests for access to the data can be directed to [email protected]