Python utilities to process and analyze travel behavior data collected by OneBusAway and exported as csv file by oba-firebase-export.
This utility matches ground truth data activities in excel format with activities data generated by oba-firebase-export in csv format. The algorithm will match each activity in the ground truth dataset with the nearest activity in the oba-firebase-export generated file.
To run the matchAndMerge script use python
command and pass the desired arguments:
python matchAndMerge.py --obaFile obaFile.csv --gtFile gtFile.xlsx
matchAndMerge works over two expected input data files, one generates from OBA firebase export and one containing the ground truth data:
--obaFile <oba csv file>
A csv file generated by OBA firebase export. The csv file must include the following columns:Activity Start Date and Time* (UTC)
Date and time recorded for the start of the activity using UTC timezone.Origin location Date and Time (*best) (UTC)
Date and time recorded for the location where the activity started using UTC timezone.Duration* (minutes)
Duration of the activity in minutes.Origin-Destination Bird-Eye Distance* (meters)
Euclidean distance (meters) between origin and destination recorded for the activity.Google Activity
Detected activity including Android supported activities plus 'OBA firebase export' additional activities ('IN_VEHICLE', 'ON_BICYCLE', 'RUNNING', 'WALKING', 'WALKING/RUNNING', 'STILL')--gtFile <ground truth xlsx file>
A xlsx file that must be formatted as shown below. The main (required) column descriptions are:GT_Collector
User name of the GT data collectorGT_Mode
Activity mode ('WALKING', 'IN_VEHICLE', 'STILL', 'ON_BICYCLE', 'IN_BUS')GT_Date
Date of the recorded activityGT_TimeOrig
Time recorded at the origin of the recorded activityGT_TimeMinuteRounded
One (1) if theGT_TimeOrig
value was rounded to the closest minute while recording the activity, zero (0) otherwise.GT_TimeZone
Time zone for the recorded activityGT_TimeDest
Time recorded at the destination of the recorded activityGT_TimeDestMinuteRounded
One (1) if theGT_TimeDest
value was rounded to the closest minute while recording the activity, zero (0) otherwise.
GT_Collector | GT_TourID | GT_TripID | GT_Mode | GT_Date | GT_TimeOrig | GT_TimeMinuteRounded | GT_TimeZone | GT_LatOrig | GT_LonOrig | GT_LocationOrig | GT_TimeDest | GT_TimeDestMinuteRounded | GT_LatDest | GT_LonDest | GT_LocDest |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DoeJohn | 1 | 1 | IN_VEHICLE | 3/4/2021 | 3:28:15 PM | 0 | America/Chicago | 33.588713 | -76.33308 | 2045 Small St | 3:40:10 PM | 1 | 35.617885 | -76.312499 | 305 Large Dr |
DoeJohn | 1 | 2 | WALKING | 3/4/2021 | 3:41:51 PM | 0 | America/Chicago | 23.617885 | -86.312499 | 305 Holly Dr | 3:58:01 PM | 0 | 43.615829 | -86.305452 | Red Pen River |
DoeJohn | 1 | 3 | STILL | 3/4/2021 | 3:58:20 PM | 0 | America/Chicago | 35.615829 | -61.305452 | Red Pen River | 4:19:05 PM | 0 | 56.615829 | -61.305452 | Red Pen River |
DoeJohn | 1 | 4 | WALKING | 3/4/2021 | 4:20:00 PM | 1 | America/Chicago | 43.615829 | -67.305452 | Red Pen River | 4:59:15 PM | 0 | 65.617885 | -67.312499 | 305 Holly Dr |
--outputDir <data folder>
Takes a string with the name of the folder where the merged data and log files will be stored. If the folder does not exist, the application will try to create it. The default values ismerger_output
. Example usage:--outputDir outputData
will look for the folderoutputData
.--minActivityDuration <minutes>
Minimum activity time span (in minutes), shorter activities will be dropped before merging. The default values is 5 minutes. For example--minActivityDuration 3
will remove, from the oba generated data, activities whose duration is less than 3 minutes.--minTripLength <meters>
Minimum distance (in meters) for a trip. Shorter trips will be dropped before merging. The default values is 50 meters. Example usage:--minTripLength 60
will remove, from the oba generated data, activities whoseOrigin-Destination Bird-Eye Distance* (meters)
is less than 60 meters.--tolerance <milliseconds>
Maximum tolerated difference (milliseconds) between matched ground truth data start activity and OBA data start activity. By default, it is 3000 milliseconds. Example usage:--tolerance 5000
will consider only a difference equal or less than 5000 milliseconds while looking for a match between a ground truth data start activity and a OBA data start activity.--iterateOverTol
When used, the merging process is applied over tolerances iterating from 30000 totolerance
in steps of 30000. By default, the merging process will only be applied once over the tolerance defined bytolerance
. Example usage:--no-iterateOverTol
.--no-removeStillMode
When used, preprocess of input datasets will not eliminate the records with activity mode equal toSTILL
. By default, preprocess of input dataset eliminates the records with activity mode equal toSTILL
. Example usage:--no-removeStillMode
.--mergeOneToOne
This flag will force the merging system to merge eachGround Truth trip
with one and only one OBA record according to the other command line parameters. By default, this flag is set to False. In such case, the merger will match eachGround Truth trip
with all the OBa records that starts after theGround Truth trip
starts and before theGround Truth trip
ends. Example usage:--mergeOneToOne
--repeatGtRows
This flag will force the merging system to repeat a GT trip as many rows as matches are found before exporting the output. By default, this flag is set to False. In such case, the merger wil only include one GT data row per trip while merging with a device. Example usage:--repeatGtRows
--deviceList <User ID txt file>
Takes a string with the name of a txt file including the IDs of devices to be used for match and merge. The whole list of devices must go in the first row of the txt file. The list of devices must be comma separated. Example usage:--deviceList "fileWithDeviceIDs.txt"
.
The output csv
file generated by the matchAndMerge.py
script has the following format:
GT_Collector | GT_Date | GT_TimeOrig | GT_TimeOrigMinuteRounded | GT_TimeZone | GT_LatOrig | GT_LonOrig | GT_LocationOrig | GT_TimeDest | GT_TimeDestMinuteRounded | GT_LatDest | GT_LonDest | GT_LocDest | GT_Comments | GT_DateTimeCombined | GT_DateTimeDestCombined | GT_TourID | GT_TripID | GT_Mode | GT_DateTimeOrigUTC_Backup | GT_DateTimeDestUTC | Google Activity | Activity Start Date and Time* (UTC) | Activity Destination Date and Time* (UTC) | Manual Assignment | Trip ID | User ID | Device Trip ID | Google Activity Confidence | Time_Difference | Distance_Difference | Vehicle type | Region ID | Origin location Date and Time (*best) (UTC) | Activity Start/Origin Time Diff* (minutes) | Origin latitude (*best) | Origin longitude (*best) | Origin Horizontal Accuracy (meters) (*best) | Origin Location Provider (*best) | Destination Location Date and Time (*best) (UTC) | Activity End/Destination Time Diff* (minutes) | Destination latitude (*best) | Destination longitude (*best) | Destination Horizontal Accuracy (meters) (*best) | Destination Location Provider (*best) | Duration* (minutes) | Origin-Destination Bird-Eye Distance* (meters) | Chain ID | Chain Index | Tour ID | Tour Index | Ignoring Battery Optimizations | Talk Back Enabled | Power Save Mode Enabled | Origin fused Date and Time (UTC) | Origin fused latitude | Origin fused longitude | Origin fused Horizontal Accuracy (meters) | Origin gps Date and Time (UTC) | Origin gps latitude | Origin gps longitude | Origin gps Horizontal Accuracy (meters) | Origin network Date and Time (UTC) | Origin network latitude | Origin network longitude | Origin network Horizontal Accuracy (meters) | Destination fused Date and Time (UTC) | Destination fused latitude | Destination fused longitude | Destination fused Horizontal Accuracy (meters) | Destination gps Date and Time (UTC) | Destination gps latitude | Destination gps longitude | Destination gps Horizontal Accuracy (meters) | Destination network Date and Time (UTC) | Destination network latitude | Destination network longitude | Destination network Horizontal Accuracy (meters) | GT_DateTimeOrigUTC |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DoeJohn | 8/14/21 | 14:39:00 | 1 | America/New_York | 36.147942 | -82.476045 | NoHo Flats - Home | 14:46:00 | 1 | 36.1475913 | -82.4718322 | N Newport and W Fig | 2021-08-24 14:39:00-04:00 | 2021-08-24 14:46:00-04:00 | 1 | 1 | WALKING | 2021-08-24 18:39:00+00:00 | 2021-08-24 18:46:00+00:00 | WALKING | 2021-08-24 18:43:49+00:00 | 2021-08-24T18:46:52Z | 234 | asieEWEfej2aejfh3r4wsp0s343q | 289 | 0.83 | 289 | 209.7216397 | 0 | 2021-08-24 18:44:27+00:00 | 0.6166667 | 36.1481099 | -82.4739184 | 30.714 | network | 2021-08-24 18:50:33+00:00 | 3.6666667 | 36.15018044 | -82.46762726 | 4.7475247 | gps | 3.0333333 | 660.2517 | 73 | 3 | FALSE | FALSE | FALSE | 2021-08-24T18:42:27Z | 36.1476199 | -82.4746442 | 15.102 | 2021-08-24T18:50:33Z | 36.15018044 | -82.46762726 | 4.7475247 | 2021-08-24T18:44:27Z | 36.1481099 | -82.4739184 | 30.714 | 2021-08-24T18:42:27Z | 36.1476199 | -82.4746442 | 15.102 | 2021-08-24T18:50:33Z | 36.15018044 | -82.46762726 | 4.7475247 | 2021-08-24T18:44:27Z | 36.1481099 | -82.4739184 | 30.714 | 2021-08-24 18:39:00+00:00 | |||||
DoeJohn | 8/14/21 | 14:46:00 | 1 | America/New_York | 36.1475913 | -82.4718322 | N Newport and W Fig | 14:57:00 | 1 | 36.1522225 | -70.4284092 | Publix Channelside | 2021-08-24 14:46:00-04:00 | 2021-08-24 14:57:00-04:00 | 1 | 2 | SCOOTER | 2021-08-24 18:46:00+00:00 | 2021-08-24 18:57:00+00:00 | ON_BICYCLE | 2021-08-24 18:47:21+00:00 | 2021-08-24T18:58:08Z | 235 | asieEWEfej2aejfh3r4wsp0s343q | 307 | 0.99 | 81 | 503.471453 | 0 | 2021-08-24 18:50:33+00:00 | 3.1833334 | 36.15018044 | -82.46762726 | 4.7475247 | gps | 2021-08-24 18:58:41+00:00 | 0.53333336 | 36.152163 | -70.4276277 | 17.765 | network | 10.783334 | 1980.3054 | 73 | 5 | FALSE | FALSE | FALSE | 2021-08-24T18:46:28Z | 36.1480456 | -82.4719809 | 79.973 | 2021-08-24T18:50:33Z | 36.15018044 | -82.46762726 | 4.7475247 | 2021-08-24T18:48:30Z | 36.1502028 | -82.4716052 | 87.6 | 2021-08-24T18:56:40Z | 36.152351 | -70.4286857 | 32.03 | 2021-08-24T19:00:33Z | 36.15139218 | -70.42861502 | 8.689676 | 2021-08-24T18:58:41Z | 36.152163 | -70.4276277 | 17.765 | 2021-08-24 18:46:00+00:00 | |||||
DoeJohn | 8/14/21 | 14:57:00 | 1 | America/New_York | 36.1522225 | -70.4284092 | Publix Channelside | 15:00:00 | 1 | 36.1512789 | -70.4287438 | Grand Central | 2021-08-24 14:57:00-04:00 | 2021-08-24 15:00:00-04:00 | 1 | 3 | WALKING | 2021-08-24 18:57:00+00:00 | 2021-08-24 19:00:00+00:00 | WALKING | 2021-08-24 18:58:08+00:00 | 2021-08-24T19:01:08Z | 236 | asieEWEfej2aejfh3r4wsp0s343q | 308 | 0.76 | 68 | 77.04583302 | 0 | 2021-08-24 18:58:41+00:00 | 0.53333336 | 36.152163 | -70.4276277 | 17.765 | network | 2021-08-24 19:00:41+00:00 | 0.43333334 | 36.1513342 | -70.4287049 | 10.911 | fused | 2.9833333 | 140.25807 | 73 | 6 | FALSE | FALSE | FALSE | 2021-08-24T18:56:40Z | 36.152351 | -70.4286857 | 32.03 | 2021-08-24T19:00:33Z | 36.15139218 | -70.42861502 | 8.689676 | 2021-08-24T18:58:41Z | 36.152163 | -70.4276277 | 17.765 | 2021-08-24T18:56:40Z | 36.152351 | -70.4286857 | 32.03 | 2021-08-24T19:00:33Z | 36.15139218 | -70.42861502 | 8.689676 | 2021-08-24T18:58:41Z | 36.152163 | -70.4276277 | 17.765 | 2021-08-24 18:57:00+00:00 |
This project was funded under the National Institute for Congestion Reduction (NICR).
/*
* Copyright (C) 2021 University of South Florida
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/