-
Notifications
You must be signed in to change notification settings - Fork 21
Input Checker
Activity-based travel models rely on data from a variety of sources (zonal data, highway networks, transit networks, synthetic population, etc). A problem in any of these inputs can affect the accuracy of model outputs and/or can result in run time error(s) during the model run. It is very important that the analyst carefully prepare and review all inputs prior to running the model. However, even with the best of efforts, sometimes errors in input data remain undetected. In order to aid the analyst in the input checking process, an automated Input Checker Tool was developed for use with the AB Model. The following sections describe the setup and application of this tool.
The Input Checker Tool (input checker) was implemented in Python and makes heavy use of the pandas and numpy packages. The input checker was integrated into the overall SANDAG AB model as an Emme tool. Specifically, the input_checker.py Python script itself is called by the master_run.py Python script. The main inputs to input checker are a list of ABM inputs, a list of QA/QC checks to be performed on these inputs and the actual AB Model inputs. All inputs are read and loaded to memory as pandas DataFrames (2-dimensional data tables). The input checks are specified by the user as pandas expressions which are solved by input checker on the input pandas DataFrames. Input checker generates a LOG file summarizing the results of all of the input checks.
After having created a scenario folder, an input checker (input_checker) directory is created within the scenario folder. The input_checker directory initially only contains a config directory with 2 configuration files: a list of inputs and a list of checks. After a first successful input checker run, a logs directory is created where the LOG and summary text files are stored.
Input checker executes the following steps:
First, input checker reads all the inputs specified in the list of inputs and loads them to memory as pandas DataFrames.
Then, the list of input checks is read. input checker loops through the list of input checks and evaluates the checks as either True (passed) or False (failed). The result of each check is sent to the logging module. The user must specify the severity level of each check as - Fatal, Logical or Warning.
Next, an input checker log file is generated. The input checker log includes results of all checks. The checks that failed are moved up in order of the severity-level specified for the test. A summary of input checker results is also generated that lists the number of passed and failed (per severity level) checks.
The final step is to check for any fatal errors. When generating the log files, input checker keeps track of the number of failed checks per severity level. If there is a single failed check, with a severity level of fatal, the model run is terminated and the user is notified that input checker has failed.
Configuring input checker involves specifying both the inputs and and the checks to be performed on them. This section describes the configuration details of the two settings file - config/inputs_list.csv
and config/inputs_checks.csv
.
Inputs on which QA/QC checks are to be performed are specified in the config/inputs_list.csv
file. Each row in inputs_list.csv
represents an ABM input. The attributes that user must specify for each input are described in the table below:
Attribute | Description |
---|---|
Input_Table | The name of the input table. The inputs are loaded into input checker memory as data-frames under this name. |
Property_Token | For each input, its corresponding property token as listed in the SANDAG ABM Properties file. For each input's property token, input checker looks up the corresponding file path within the properties file. For Emme objects, this field should be specified as 'NA'. |
Emme_Object | The name of the Emme network object whose attributes must be exported. Must be specified as 'NA' for non Emme network objects (i.e. CSV or DBF). The input checker is currently capable of reading the following Emme network objects: NODE, LINK, TRANSIT_LINE, and TRANSIT_SEGMENT |
Fields | The list of attributes to be exported from the Emme network object. Refer to the Network Object Attributes page for a complete list of attributes per Emme network object. If all Emme network object attributes are desired, the user must specify 'All' for this field. All fields are read for CSV and DBF inputs. |
Column_Map | A column name can be specified if some columns must be renamed for easier reference. |
Input_Description | The description of the input file. |
All non Emme network object inputs must be in either CSV or DBF format. For the Emme-based SANDAG ABM network inputs, an Emme database (i.e. Emmebank) contains the combined traffic and transit networks along with their associated attributes (Refer to the Setup and Configuration page for more information on the Emme database). Given input checker is called from an opened Emme Modeller instance, the input checker tool has full access to the Emme database, scenarios and network. The input checker loads the Emme database, base scenario and network and then obtains attributes of the specified Emme network objects. The user must specify each input as an Emme object (e.g. LINK) or either a CSV or DBF file in the inputs
or uec
sub-directories. The CSV inputs are read into memory from the specified sub-directory. If defined, columns are renamed as per user specifications in the Column_Map column.
The user has an option to comment out inputs that should not to be loaded. To comment out a line in inputs_list.csv
, add a "#" in front of the table name. All inputs whose table name starts with a "#" are ignored by input checker
The QA/QC checks to be performed on the ABM inputs are specified in the config/checks_list.csv
file. Each row in checks_list.csv
represents a specific operation to be performed on a specific input listed in inputs_list.csv
. The listed operations are evaluated from top to bottom. Each operation can be classified as a Test
or Calculation
. For Test
operations, the pandas expression is evaluated and the result is sent to the logging module of input checker for logging. For Calculation
operations, the pandas expression is evaluated and the result is stored as a Python object to be referenced by subsequent operations. The table below describes the various tokens that users must specify for each Test
or Calculation
operation:
Attribute | Description |
---|---|
Test | The name of the QA/QC check. The check results are referenced using this name in the log file. For calculation operations, this becomes the name of the resulting object. |
Input_Table | The name of the input table on which the check is to be performed. The name must match the name specified under the Input_Table field in the inputs_list.csv
|
Input_ID_Column | The name of the unique ID column. This serves as the input table's (i.e. DataFrame) row index by which checks and results are carried out and stored, respectively. |
Severity | The severity level of the test - Fatal, Logical or Warning |
Type | The type of operation - Test or Calculation . |
Expression | The pandas expression to be evaluated. |
Test_Vals | A list of values on which the test needs to be repeated. List must be comma separated. Test for each value is logged separately. |
Report_Statistics | Any additional statistics from the test that must be reported to the log file. |
Test_Description | The description of the check that is being performed. |
An important step in specifying checks is assigning a severity level to each check. Input checker allows the user to specify three severity levels for each QA/QC check - Fatal, Logical, Warning. Careful thought must be given while assigning severity level to each check. Some general principles to help decide the severity level of a check are described below:
If input checker fails a fatal check, it returns an exit code of 2 to the main ABM procedure, causing the ABM run to halt. Therefore, the analyst should only set the severity level of Fatal for checks that must pass in order to proceed with a model run.
The failure of these checks indicates logical inconsistencies in the inputs. With logical errors in inputs, the ABM outputs may not be very meaningful.
The failure of warning checks would indicate an issue in the input data which are not significant enough to cause a run-time error or affect model outputs. However, these checks might reveal other problems related to data processing or data quality.
At the heart of an input data check is the pandas expression that is evaluated on an input data table. Each Test
expression must evaluate to a single logical value (TRUE
or FALSE
) or a vector of logical values. Therefore, the Test
expression must be a logical test. For most applications, this involves creating logical relationships such as equalities, inequalities and ranges using standard logical operators (AND
, OR
, EQUAL
, GREATER THAN
, LESS THAN
, IN
, etc.). The length of the result vector must be equal to the length of the input on which the check was performed. The result of a Calculation
expression can be any Python data type to be used by a subsequent expression.
The success or failure of a check is decided based on the test result. In case of a single value result, the check fails if the result is FALSE
. In case of a vector result, the test is declared as failed if any value in the vector is FALSE
. Therefore, the expression must be designed to evaluate to TRUE
if there are no problems in the input data.
Rules and conventions for writing input checker expressions are summarized below:
- Expressions must be a valid Python/pandas expression
- Expressions must be designed to evaluate to
FALSE
to indicate any errors in data - Each expression must evaluate to logical value(s) (i.e.
TRUE
orFALSE
) - Each expression must be applied to valid input table specified in
inputs_list.csv
or make use of intermediate tables created by precedingCalculation
expressions - Expressions must use the same table names as specified in
inputs_list.csv
or the Test name of theCalculation
object - Expressions must use the same field names as specified in
inputs_list.csv
. If a column map was specified, then the new names must be used - Expressions can be looped over a list of
Test_Vals
to reduce number of expressions - The Report_Statistic must also be a valid Python/pandas expression and must evaluate to a single numeric value
- Expressions can be commented by adding a "#" in front of the
Test
name. All checks whose test name starts with a "#" are ignored by input checker
Below are some example expressions for different types of checks:
Check if household income field exists in the input synthetic population:
For performing this check for multiple fields, write the expression as follows and specify the list of field names under Test_Vals
token (separated by comma):
Check if number of household inhabitants ('persons') is greater than zero for each household:
households.persons > 0
Check if each person's employment status ('pemploy') matches the predefined employment status categories (1,2,3,4):
persons.pemploy.apply(lambda x: True if x in [1,2,3,4] else False)
It is possible that all person records pass the above test but one of the employment status categories may not have a single person record. To check for such cases, the following expression can be used:
set(persons.pemploy)=={1,2,3,4}
Check if total employment across occupation categories sum to total employment for each Master Geographical Reference Area (MGRA). Since this may result in a complex expression, this can be done in two steps. First, employment across all occupation types are summed using a Calculation
expression:
mgra_data[[col for col in mgra_data if (col.startswith('emp')) and not (col.endswith('total'))]].sum(axis=1)
The result of the above expression is a MGRA level vector:mgra_total_employment
Next, the total employment field can be compared against mgra_total_employment
mgra_data.emp_total==mgra_total_employment
Check if household IDs start from 1 and are sequential:
(min(households.hhid)==1) & (max(households.hhid)==len(households.hhid))
To ensure that ABM outputs are meaningful, it is important to perform logical checks on input data. One such check is to compare the summation of each employment land-use category field per MGRA with the total employment field per MGRA. These values should match exactly. For this check, first the summation of workers per land-use category are calculated per MGRA. Then, the summation value is compared against the emp_total
(i.e. Total Employment for MGRA) field. This can be achieved with a Calculate
operation and subsequent Test
operation where the test expression utilizes the calculated value, as shown below:
While most of the above checks apply to link and node level attributes, some checks might be unique to some other network objects such as transit routes. In Emmme, transit line segment names must be unique. This requires performing a check on transit line segment data as follows:
len(set(transit_segments.id)) == len(transit_segments.id)
The design of network level checks will depend on the transportation modeling software being used.
Given the input checker was integrated as an Emme tool, users may decide to run or skip input checker prior to running the SANDAG AB model via the Master run interface (as described in Run the Model). Alternatively, users may launch input checker independently by running it via its stand-alone Emme tool (SANDAG toolbox > Import > Input checker). To commence the tool, users simply need to click on Run on the input checker GUI, as is shown in the figure below.
The final output from input checker is a log file which is output to the input checker/logs
directory. The log file is named as input checkerLog[RUN_DATE].LOG
. The log file can be opened using any text editor. The results of all checks are summarized in this log file. The following sections describe the organization and details of the log file.
The log file summarizes results from all checks. However, the order in which they are presented depends upon the severity level and the output of the check. input checker organizes the check results under the following headings:
- IMMEDIATE ACTION REQUIRED: All failed FATAL checks are logged under this heading
- ACTION REQUIRED: All failed LOGICAL checks are logged under this heading
- WARNINGS: All failed WARNING checks are logged under this heading
- LOG OF ALL PASSED CHECKS: A complete LOG of all passed checks
A standard check log is generated for each check. The table below shows the elements of a check LOG:
Attribute | Description |
---|---|
Input File Name | The name of the input file on which the check was evaluated |
Input File Location | Path to the location of the input file. Not applicable to Emme Objects. |
Emme Object | The name of the Emme object, if applicable |
Input Description | The description of the input as specified in inputs_list.csv
|
Test Name | The name of the test as specified in checks_list.csv
|
Test Description | The description of the test |
Test Severity | The severity level of the test |
TEST RESULT | The result of the test - PASSED or FAILED |
TEST results for each test val | Test result for each Test_Vals (i.e. test value) on which the test was repeated |
Test Statistics | The value of the expression specified under the Report_Statistic token of checks_list.csv . The first 25 values are printed in case of vector results. |
ID Column | The name of the unique ID column of the input data table |
List of failed IDs | The first 25 IDs for which the test failed. This is generated in case of vector results. |
Number of failures | Total number of failures in case of vector result |
In addition to the log file, input checker also produces a text file (input checkerSummary.txt
) containing a summary of number of input checker fails by their severity levels.