Input Checker

Background

Activity-based travel models rely on data from a variety of sources (zonal data, highway networks, transit networks, synthetic population, etc). A problem in any of these inputs can affect the accuracy of model outputs and/or can result in run time error(s) during the model run. It is very important that the analyst carefully prepare and review all inputs prior to running the model. However, even with the best of efforts, sometimes errors in input data remain undetected. In order to aid the analyst in the input checking process, an automated Input Checker Tool was developed for use with the AB Model. The following sections describe the setup and application of this tool.

Input Checker Implementation

The Input Checker Tool (input checker) was implemented in Python and makes heavy use of the pandas and numpy packages. The input checker was integrated into the overall SANDAG AB model as an Emme tool. Specifically, the input_checker.py Python script itself is called by the master_run.py Python script. The main inputs to input checker are a list of ABM inputs, a list of QA/QC checks to be performed on these inputs and the actual AB Model inputs. All inputs are read and loaded to memory as pandas DataFrames (2-dimensional data tables). The input checks are specified by the user as pandas expressions which are solved by input checker on the input pandas DataFrames. Input checker generates a LOG file summarizing the results of all of the input checks.

After having created a scenario folder, an input checker (input_checker) directory is created within the scenario folder. The input_checker directory initially only contains a config directory with 2 configuration files: a list of inputs and a list of checks. After a first successful input checker run, a logs directory is created where the LOG and summary text files are stored.

Process Overview

Input checker executes the following steps:

1. Read Inputs:

First, input checker reads all the inputs specified in the list of inputs and loads them to memory as pandas DataFrames.

2. Run Checks:

Then, the list of input checks is read. input checker loops through the list of input checks and evaluates the checks as either True (passed) or False (failed). The result of each check is sent to the logging module. The user must specify the severity level of each check as - Fatal, Logical or Warning.

3. Generate LOG File:

Next, an input checker log file is generated. The input checker log includes results of all checks. The checks that failed are moved up in order of the severity-level specified for the test. A summary of input checker results is also generated that lists the number of passed and failed (per severity level) checks.

4. Check for Fatal Errors:

The final step is to check for any fatal errors. When generating the log files, input checker keeps track of the number of failed checks per severity level. If there is a single failed check, with a severity level of fatal, the model run is terminated and the user is notified that input checker has failed.

Configuring Input Checker

Configuring input checker involves specifying both the inputs and and the checks to be performed on them. This section describes the configuration details of the two settings file - config/inputs_list.csv and config/inputs_checks.csv.

Specifying Inputs

Inputs on which QA/QC checks are to be performed are specified in the config/inputs_list.csv file. Each row in inputs_list.csv represents an ABM input. The attributes that user must specify for each input are described in the table below:

Attribute	Description
Input_Table	The name of the input table. The inputs are loaded into input checker memory as data-frames under this name.
Property_Token	For each input, its corresponding property token as listed in the SANDAG ABM Properties file. For each input's property token, input checker looks up the corresponding file path within the properties file. For Emme objects, this field should be specified as 'NA'.
Emme_Object	The name of the Emme network object whose attributes must be exported. Must be specified as 'NA' for non Emme network objects (i.e. CSV or DBF). The input checker is currently capable of reading the following Emme network objects: NODE, LINK, TRANSIT_LINE, and TRANSIT_SEGMENT
Fields	The list of attributes to be exported from the Emme network object. Refer to the Network Object Attributes page for a complete list of attributes per Emme network object. If all Emme network object attributes are desired, the user must specify 'All' for this field. All fields are read for CSV and DBF inputs.
Column_Map	A column name can be specified if some columns must be renamed for easier reference.
Input_Description	The description of the input file.

All non Emme network object inputs must be in either CSV or DBF format. For the Emme-based SANDAG ABM network inputs, an Emme database (i.e. Emmebank) contains the combined traffic and transit networks along with their associated attributes (Refer to the Setup and Configuration page for more information on the Emme database). Given input checker is called from an opened Emme Modeller instance, the input checker tool has full access to the Emme database, scenarios and network. The input checker loads the Emme database, base scenario and network and then obtains attributes of the specified Emme network objects. The user must specify each input as an Emme object (e.g. LINK) or either a CSV or DBF file in the inputs or uec sub-directories. The CSV inputs are read into memory from the specified sub-directory. If defined, columns are renamed as per user specifications in the Column_Map column.

The user has an option to comment out inputs that should not to be loaded. To comment out a line in inputs_list.csv, add a "#" in front of the table name. All inputs whose table name starts with a "#" are ignored by input checker

Specifying Checks

The QA/QC checks to be performed on the ABM inputs are specified in the config/checks_list.csv file. Each row in checks_list.csv represents a specific operation to be performed on a specific input listed in inputs_list.csv. The listed operations are evaluated from top to bottom. Each operation can be classified as a Test or Calculation. For Test operations, the pandas expression is evaluated and the result is sent to the logging module of input checker for logging. For Calculation operations, the pandas expression is evaluated and the result is stored as a Python object to be referenced by subsequent operations. The table below describes the various tokens that users must specify for each Test or Calculation operation:

Attribute	Description
Test	The name of the QA/QC check. The check results are referenced using this name in the log file. For calculation operations, this becomes the name of the resulting object.
Input_Table	The name of the input table on which the check is to be performed. The name must match the name specified under the `Input_Table` field in the `inputs_list.csv`
Input_ID_Column	The name of the unique ID column. This serves as the input table's (i.e. DataFrame) row index by which checks and results are carried out and stored, respectively.
Severity	The severity level of the test - Fatal, Logical or Warning
Type	The type of operation - `Test` or `Calculation`.
Expression	The pandas expression to be evaluated.
Test_Vals	A list of values on which the test needs to be repeated. List must be comma separated. Test for each value is logged separately.
Report_Statistics	Any additional statistics from the test that must be reported to the log file.
Test_Description	The description of the check that is being performed.

Severity Levels

An important step in specifying checks is assigning a severity level to each check. Input checker allows the user to specify three severity levels for each QA/QC check - Fatal, Logical, Warning. Careful thought must be given while assigning severity level to each check. Some general principles to help decide the severity level of a check are described below:

Fatal

If input checker fails a fatal check, it returns an exit code of 2 to the main ABM procedure, causing the ABM run to halt. Therefore, the analyst should only set the severity level of Fatal for checks that must pass in order to proceed with a model run.

Logical

The failure of these checks indicates logical inconsistencies in the inputs. With logical errors in inputs, the ABM outputs may not be very meaningful.

Warnings

The failure of warning checks would indicate an issue in the input data which are not significant enough to cause a run-time error or affect model outputs. However, these checks might reveal other problems related to data processing or data quality.

Expressions

At the heart of an input data check is the pandas expression that is evaluated on an input data table. Each Test expression must evaluate to a single logical value (TRUE or FALSE) or a vector of logical values. Therefore, the Test expression must be a logical test. For most applications, this involves creating logical relationships such as equalities, inequalities and ranges using standard logical operators (AND, OR, EQUAL, GREATER THAN, LESS THAN, IN, etc.). The length of the result vector must be equal to the length of the input on which the check was performed. The result of a Calculation expression can be any Python data type to be used by a subsequent expression.

The success or failure of a check is decided based on the test result. In case of a single value result, the check fails if the result is FALSE. In case of a vector result, the test is declared as failed if any value in the vector is FALSE. Therefore, the expression must be designed to evaluate to TRUE if there are no problems in the input data.

Conventions for Writing Expressions

Rules and conventions for writing input checker expressions are summarized below:

Expressions must be a valid Python/pandas expression
Expressions must be designed to evaluate to FALSE to indicate any errors in data
Each expression must evaluate to logical value(s) (i.e. TRUE or FALSE)
Each expression must be applied to valid input table specified in inputs_list.csv or make use of intermediate tables created by preceding Calculation expressions
Expressions must use the same table names as specified in inputs_list.csv or the Test name of the Calculation object
Expressions must use the same field names as specified in inputs_list.csv. If a column map was specified, then the new names must be used
Expressions can be looped over a list of Test_Vals to reduce number of expressions
The Report_Statistic must also be a valid Python/pandas expression and must evaluate to a single numeric value
Expressions can be commented by adding a "#" in front of the Test name. All checks whose test name starts with a "#" are ignored by input checker

Example Expressions

Below are some example expressions for different types of checks:

Data Completeness Checks

Check if household income field exists in the input synthetic population:

For performing this check for multiple fields, write the expression as follows and specify the list of field names under Test_Vals token (separated by comma):

Boundary Checks

Check if number of household inhabitants ('persons') is greater than zero for each household:

households.persons > 0

Predefined Value Checks

Check if each person's employment status ('pemploy') matches the predefined employment status categories (1,2,3,4):

persons.pemploy.apply(lambda x: True if x in [1,2,3,4] else False)

It is possible that all person records pass the above test but one of the employment status categories may not have a single person record. To check for such cases, the following expression can be used:

set(persons.pemploy)=={1,2,3,4}

Consistency Check

Check if total employment across occupation categories sum to total employment for each Master Geographical Reference Area (MGRA). Since this may result in a complex expression, this can be done in two steps. First, employment across all occupation types are summed using a Calculation expression:

mgra_data[[col for col in mgra_data if (col.startswith('emp')) and not (col.endswith('total'))]].sum(axis=1)

The result of the above expression is a MGRA level vector:mgra_total_employment Next, the total employment field can be compared against mgra_total_employment

mgra_data.emp_total==mgra_total_employment

Other Checks

Check if household IDs start from 1 and are sequential:

(min(households.hhid)==1) & (max(households.hhid)==len(households.hhid))

Logical Checks

To ensure that ABM outputs are meaningful, it is important to perform logical checks on input data. One such check is to compare the summation of each employment land-use category field per MGRA with the total employment field per MGRA. These values should match exactly. For this check, first the summation of workers per land-use category are calculated per MGRA. Then, the summation value is compared against the emp_total (i.e. Total Employment for MGRA) field. This can be achieved with a Calculate operation and subsequent Test operation where the test expression utilizes the calculated value, as shown below:

Network Checks

While most of the above checks apply to link and node level attributes, some checks might be unique to some other network objects such as transit routes. In Emmme, transit line segment names must be unique. This requires performing a check on transit line segment data as follows:

len(set(transit_segments.id)) == len(transit_segments.id)

The design of network level checks will depend on the transportation modeling software being used.

Running Input Checker

Given the input checker was integrated as an Emme tool, users may decide to run or skip input checker prior to running the SANDAG AB model via the Master run interface (as described in Run the Model). Alternatively, users may launch input checker independently by running it via its stand-alone Emme tool (SANDAG toolbox > Import > Input checker). To commence the tool, users simply need to click on Run on the input checker GUI, as is shown in the figure below.

Analyzing input checker Log

The final output from input checker is a log file which is output to the input checker/logs directory. The log file is named as input checkerLog[RUN_DATE].LOG. The log file can be opened using any text editor. The results of all checks are summarized in this log file. The following sections describe the organization and details of the log file.

Organization

The log file summarizes results from all checks. However, the order in which they are presented depends upon the severity level and the output of the check. input checker organizes the check results under the following headings:

IMMEDIATE ACTION REQUIRED: All failed FATAL checks are logged under this heading
ACTION REQUIRED: All failed LOGICAL checks are logged under this heading
WARNINGS: All failed WARNING checks are logged under this heading
LOG OF ALL PASSED CHECKS: A complete LOG of all passed checks

Check LOG

A standard check log is generated for each check. The table below shows the elements of a check LOG:

Attribute	Description
Input File Name	The name of the input file on which the check was evaluated
Input File Location	Path to the location of the input file. Not applicable to Emme Objects.
Emme Object	The name of the Emme object, if applicable
Input Description	The description of the input as specified in `inputs_list.csv`
Test Name	The name of the test as specified in `checks_list.csv`
Test Description	The description of the test
Test Severity	The severity level of the test
TEST RESULT	The result of the test - PASSED or FAILED
TEST results for each test val	Test result for each `Test_Vals` (i.e. test value) on which the test was repeated
Test Statistics	The value of the expression specified under the `Report_Statistic` token of `checks_list.csv`. The first 25 values are printed in case of vector results.
ID Column	The name of the unique ID column of the input data table
List of failed IDs	The first 25 IDs for which the test failed. This is generated in case of vector results.
Number of failures	Total number of failures in case of vector result

Summary File

In addition to the log file, input checker also produces a text file (input checkerSummary.txt) containing a summary of number of input checker fails by their severity levels.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly