Poetry is an alternative package management and environment management system. It differs from conda in that it manages the project package dependencies as well as the environment. https://python-poetry.org/
Install poetry:
pip install poetry
Install the project dependencies:
poetry install
If a lockfile exists, it will try to use that first. A lockfile provides an exact snapshot of the dependencies that were installed when the lockfile was created and avoids needing to resolve dependencies again. If you don't have a lockfile, you can create one after installing the dependencies:
poetry lock
This will create a poetry.lock
file from your current environment.
I created a environment setup batch script to create a conda environment. This will essentiall run the following steps:
You can use whatever preferred Python environment management system, but I will demonstrate using Mamba, a faster conda alternative.
- Create a new conda environment: You can install and run PopulationSim in default “base” python environment or you can set up new environment:
conda create --name bts_popS python=3.11.5
Use the following command to activate the environment:
conda activate bts_popS
-
Install all dependencies and PopulationSim: After cloning this repository, install all dependencies and the forked version of PopulationSim either directly from GitHub or with Conda/Mamba:
2a. Installing from GitHub: Install populationsim fork directly from GitHub using pip. This will install all dependencies and the forked version of PopulationSim to your current Python environment.
pip install git+https://github.com/nick-fournier-rsg/[email protected]#egg=populationsim
2b. Installing with Conda/Mamba: The easiest way to install the fork is to use Conda or Mamba. This will install all dependencies and the forked version of PopulationSim.
Detail installation direction can be found in this link:
https://github.com/nick-fournier-rsg/populationsim -
Install required python packages: Intall all the required python packages in the environment to run populatiosim using following command:
pip install -r requirements.txt
I created a set of data preparation scripts which fetch Census PUMS and ACS data and prepare it for use with populationsim. The scripts are located in the ./setup
folder.
To configure the setup scripts (e.g., selecting States, ACS years, fields), you can edit the settings.py
file. The settings file is a Python module that gets inherited by the setup scripts. An example settings file is shown below:
# User-defined constants
YEAR = 2019
POPSIM_DIR = 'C:/gitclones/bts_populationsim/populationsim'
SETUP_DIR = 'C:/gitclones/bts_populationsim/setup'
RAW_DATA_DIR = os.path.join(SETUP_DIR, 'raw')
# Specific states, or just list all states in the 'states' module!
# STATES = ['VT','AK','ND', 'SD','WY', 'RI'] # Explicitly list specific states
STATES = [x.abbr for x in states.STATES] # lists all states in the states module
ACS_TYPE = 'acs5'
# You must define the PUMS fields you want to use for households and persons, grouped in a nested dictionary by table. The fields must also specify the data type (int, float, str, etc.) to ensure that the data is read in correctly.
PUMS_FIELDS = {
'HH': {
'SERIALNO': str,
'PUMA': int,
'WGTP':int,
'NP': int,
'HINCP': int,
'VEH': int,
'HUPAC': int,
},
'PER': {
'SPORDER': int,
'PWGTP': int,
'JWTRNS': int,
'ESR': int,
'SCH': int,
'SCHG': int,
'AGEP': int,
'SEX': int,
'RAC1P': int,
'HISP': int,
'WKHP': int,
}
}
The setup scripts require a Census API key. You can get one here: https://api.census.gov/data/key_signup.html, once you have one, you can set it as an environment variable in a .env
file in the root of the repository that is inherited by the setup scripts The .env
file should look like this:
CENSUS_API_KEY="YOUR_API_KEY"
The .env file is ignored by git, keeping your API key private so you will need to create one for yourself in your local repository. It then is loaded into the python environment in the settings.py
file using the python-dotenv
package with load_dotenv()
.
In addition to the populationsim-specific configs in the configs
folder, there is a controls_aggregator.csv
file that is used to aggregate the PUMS data to the control totals. The controls_aggregator.csv
file is a CSV file with the following columns:
field
: [mandatory] The field code in the Census (e.g.,B01001_003E
)type
: [mandatory] The data type of field (e.g.,int
orstr
)geography
: [mandatory] the geography that the control total available at (e.g.,BG
orTRACT
)control_field
: [user defined] the control field to aggregate to (e.g.,H_RACE_AAPI
orH_INC_1
)original_label
: [optional] The original Census label of the field.concept
: [optional] The Census concept of the field [e.g.,SEX BY AGE
]group
: [optional] The table that the field belongs to (e.g.,B01001
)
You may populate this file from the raw/acs_fields.csv
and raw/acs_tables.csv
, which just has a list of tables and the fields.
The setup scripts have three main components:
- create_acs_targets(): This function fetches the ACS data from the Census API and caches it into local parquet files. It then aggregates the fields and saves the aggregated data to control_totals CSV files in the
populationsim/data
folder. - create_seeds(): This function fetches the PUMS data from Census API and caches it into local parquet files. It then formats the fields and saves the seed data to seed_household and seed_person CSV files in the
populationsim/data
folder. - create_crosswalk(): This function fetches the relevant geography files (e.g., block groups, tracts, PUMAs, etc.), saves them locally in the
setup/raw/shp
folder, and creates a crosswalk between the PUMS and ACS geographies. The crosswalk is saved to thepopulationsim/data
folder.
The entire process can be run from the batch_run.py
script. This script acts as the single point of entry for the data preparation, populationsim, and validation steps.
It can be run from command line with the following command:
python batch_run.py