Skip to content

Curbcut/cc.data

Repository files navigation

cc.data

The objective of cc.data is to build and process datasets that are ready to use in a new Curbcut instance, or to update data in an already existing Curbcut instance. cc.data offers functions to write and access the processed data to AWS services such as RDS (Amazon Aurora Serverless - MySQL) and S3 buckets.

Installation

You can install the development version of cc.data from GitHub with:

# install.packages("devtools")
devtools::install_github("Curbcut/cc.data")

Usage

For census data building and processing, cc.data creates and works from long lists of tibble. For faster building, the package frequently uses functions of the future_*apply family from the future.apply package, for any computationally intensive work. These implementations of the apply family function can be solved using any backend supported by future (parallelization). We therefore recommend defining a future backend at the beginning of the data building script.

future::plan(future::multisession())

Let’s also enable the global progression handler to get a progress bar on those more computationally intensive work.

progressr::handlers(global = TRUE)

Database connections

Amazon Aurora Serverless - MySQL

To provide the ability to download subsets of the pre-processed data, cc.data uploads the data into a MySQL serverless database on AWS. To do so, the builder/uploader must have a database administrator user account. To read the data, a user account with read privileges only must be created. To get one, please contact Max at [email protected]. The Curbcut team will create a user account using the db_create_user function.

db_create_user(user, password)

Once a user account is created, credentials must be stored as an environment variable (in the .Renviron file which can be opened using usethis::edit_r_environ().).

# AWS Curbcut db user with read only privileges
CURBCUT_DB_USER = "user"
CURBCUT_DB_PASSWORD = "password"
S3 bucket

Every Curbcut instance have its own data folder in which unique pre-computed datasets are stored for the live application (built through the cc.buildr package). To allow for easier sharing of this and any other folder, cc.data have write/read functions to AWS S3 buckets. To write to and read from these buckets, a user account must be created through the AWS console Security credentials > Access management > Users > Add users. The credential type is Access key - Programmatic acess and attached policies must be either AmazonS3FullAccess to allow writing, or AmazonS3ReadOnlyAccess to only allow reading. Accesses must then be saved in the .Renviron file.

# AWS Curbcut bucket access for the 'x' user
CURBCUT_BUCKET_ACCESS_ID = "access_id"
CURBCUT_BUCKET_ACCESS_KEY = "access_pw"
CURBCUT_BUCKET_DEFAULT_REGION = "us-east-1"

Build and process census data

Global cc.data census datasets

There are a set of already built tibbles part of cc.data that, by default, informs census building and processing functions:

  • cc.data::census_years: All available census years starting 1996;
  • cc.data::census_scales: All scales for which data should be built and processed (CSD, CT, DA);
  • cc.data::census_vectors_table: A Curbcut selection of census variables or aggregates of census variables. A column for each year of the CensusMapper variable names of the census variables to be downloaded, as well as for the “parent” variables e.g. in the case where a Curbcut variable is a percentage, and the title and explanation of the variables.
  • cc.data::census_vectors: Only the variable codes of the census vectors. This is the vector fed to most function arguments asking for the census vectors, and it retrieves and subset census_vectors_table behind the scenes.

Only these can be updated when a new census is out, if we want to add a scale to the census data building, or if we want to add variables. The data building and processing will use them by default.

Usage

Census data is built and processed through a depth 2 list. The first depth is the census scales, and the second depth is the census years for each census scale. For this reason, we can speed up operations by using two layers of futures. If memory and number of threads allow, the optimal layers would be for the outer layer to be the same length as the number of census scales, and the inner layer to be the same length as the number of census years.

# Tweak futures
future::plan(list(future::tweak(future::multisession,
                                workers = length(cc.data::census_scales)),
                  future::tweak(future::multisession,
                                workers = length(cc.data::census_years))))
Census data building and processing workflow

Here is the census data building and processing workfow:

empty_geometries <- census_empty_geometries()
data_raw <- census_data_raw(empty_geometries)
interpolated <- census_interpolate(data_raw)
normalized <- census_normalize(interpolated)
parent_dropped <- census_drop_parents(normalized)
processed_census <- census_reduce_years(parent_dropped)
  1. Download empty geometries, for each census scale and each of their census years;
  2. Populate the empty geometries with raw data directly downloaded from the cancensus package using the variable codes from the census_vectors_table, as well as the parent variables;
  3. Interpolate all data of all all years to the most recent census year.;
  4. Normalize data depending if it is classed as a number or a percentage, using the parent variable;
  5. Drop the parent variables;
  6. Reduce all years into once dataframe. The output of the last function is a list with only one depth. A list container only the number of sf data.frame as there are census scales in cc.data::census_scales.
Census data upload to Amazon Aurora Serverless - MySQL

To provide the ability to download subsets of the pre-processed data, we store the raw census data of dissemination areas and the processed census data of all scales to a MySQL database. The raw census data is used if a builder of a Curbcut instance needs to interpolate census data to custom boundaries. Interpolating using the raw census data provides the closest degree of accuracy possible. The two following functions are tailored to exactly the output of census_data_raw()$DA and the output of census_reduce_years().

db_write_processed_data(processed_census)
db_write_raw_data(DA_data_raw = data_raw$DA)

Build and process buildings across the country

Usage

Most of the operations for the building dataset are performed iteratively on each province. It is very spatial feature heavy and can hog a lot of memory. Each worker of the future backend will use between 16 to 20 gb of memory.

Download and combine buildings datasets

Our building dataset is built using both OpenStreetMap and the Microsoft’s Canadian Building Footprints for places where OSM doesn’t have great coverage. To free up memory between operations, the following functions write the building dataset for each province to a destination folder.

buildings_folder <- "buildings/"

buildings_sf(DA_processed_table = processed_census$DA,
             dest_folder = buildings_folder,
             OSM_cache = TRUE)
Reverse geocode

The reverse geocoding can be sped up by building our own docker container of Nominatim, which, when mounted using a .pbf (OpenStreetMap) file, can be used to find the nearest address of a latitude and longitude coordinate. This option is more than 100x faster than querying external servers to reverse geocode a large dataset like all the buildings in the country. The following will download and mount the docker image:

rev_geocode_create_local_nominatim()

Using a combination of the National Open Database of Addresses and the local instance of Nominatim, we reverse geocode all the buildings in the country. This takes an even harder toll on memory usage, as the local Nominatim instance also uses a lot of computational power to serve the reverse geocoding.

buildings_rev_geo <- rev_geocode_building(prov_folder = buildings_folder)
Buildings upload to Amazon Aurora Serverless - MySQL

The processed building dataset is stored in the MySQL database with the buildings name, the index being the dissemination areas identifier. As the table is huge, we append data by waves of 50k buildings (last argument).

db_write_table(df = rev_geo, 
               tb_name = "buildings", 
               index = "DA_ID",
               rows_per_append_wave = 50000)

Build and process streets across the country

The function creating the streets downloads the most updated road network from the Canadian’s government website. It’s then cleaned and street names are standardised. In the case a street name can’t be put together, we use the Nominatim local instance created at the buildings step.

streets <- streets_sf(DA_processed_table = processed_census$DA)
db_write_table(df = streets, 
               tb_name = "streets", 
               index = "DA_ID",
               rows_per_append_wave = 25000)

Postal codes

A csv containing country-wide postal codes is available in the Curbcut’s S3 buckets. The following simply imports it, converts it all to spatial features, and attach a dissemination area identifier. It’s saved in the MySQL database with an DA ID index, so that only the postal code of a particular territory under study can be imported.

postal_codes <- postal_codes_process(processed_census$DA)
db_write_table(df = postal_codes, tb_name = "postal_codes", index = "DA_ID",
               rows_per_append_wave = 50000)

Travel time matrices

The travel time matrices are calculated using two different techniques depending on the mode. Walking, biking and driving are calculated using a local docker container of the modern C++ routing engine OSRM. The transit travel time matrices are calculated using a mix of OSRM and of General Transit Feed Specification (GTFS) files. They can all be solved with any backend supported by future (parallelization).

Foot, bicycle, car

We start by getting centroids of every dissemination areas and specifying where the routing data will be downloaded and saved to. Inside this folder, we create folders for foot, bike and car.

DA_table <- processed_census$DA["ID"]
DA_table <- suppressWarnings(sf::st_transform(DA_table, 3347))
DA_table <- suppressWarnings(sf::st_centroid(DA_table))
DA_table <- suppressWarnings(sf::st_transform(DA_table, 4326))

dest_folder <- "routing/"
dir.create(dest_folder)
foot_folder <- paste0(dest_folder, "foot")
bike_folder <- paste0(dest_folder, "bike")
car_folder <- paste0(dest_folder, "car")
dir.create(foot_folder)
dir.create(bike_folder)
dir.create(car_folder)
Foot

For every mode, we will have to create a docker local OSRM instance. For further information, read the Project-OSRM/osrm-backend documentation. The following code set it up by downloading pbf files and opening a command-line terminal (only works on Windows right now).

tt_local_osrm(dest_folder = foot_folder, mode = "foot")

Once the docker container is properly set up, we calculate travel time matrices from every DA to every DA inside a 10km radius (as the maximum limit one could possibly walk in an hour). The routing can be calculated in parallel and, for each pairs of dissemination areas, makes a call to the local OSRM container. The output is a named list of dataframes of 2 columns: every dataframe is a DA, and the first column is all the other DAs it can reach in an hour. The second column is the time in second it takes to reach the DA by walk. We then save each individual dataframe in their own dataframe in the MySQL database.

foot <- tt_calculate(DA_table_centroids = DA_table, max_dist = 10000)
db_write_ttm(ttm = foot, mode = "foot")

To keep it clean, we can proceed by removing it from the active docker containers. I suggest not to do so for the walking mode, as it will be re-used when calculating travel times for transit.

shell(paste0('docker rm --force foot_osrm'))
Bicycle and car

The exact same process is done for the bike and car mode. We modify the mode and update the maximum distance for which DAs inside x radius of a DA should be calculated a travel time.

# Bike
tt_local_osrm(dest_folder = bicycle_folder, mode = "bicycle")
bicycle <- tt_calculate(DA_table_centroids = DA_table, max_dist = 30000,
                        routing_server = routing_server)
db_write_ttm(ttm = bicycle, mode = "bicycle")
shell(paste0('docker rm --force bicycle_osrm'))

# Car
tt_local_osrm(dest_folder = car_folder, mode = "car")
car <- tt_calculate(DA_table_centroids = DA_table, max_dist = 120000,
                    routing_server = routing_server)
db_write_ttm(ttm = car, mode = "car")
shell(paste0('docker rm --force car_osrm'))

Transit

TKTKTKTKTKTK TRANSIT

Additional datasets

The following are simple one variable dataset used in any version of Curbcut. Their code takes a csv file living on a S3 bucket, clean it, and save it to the MySQL database.

Can-ALE

canale <- canale_process()
db_write_table(df = canale, 
               tb_name = "canale", 
               primary_key = "DA_ID",
               index = "DA_ID")

Can-BICS

canbics <- canbics_process()
db_write_table(df = canbics, 
               tb_name = "canbics", 
               primary_key = "DA_ID",
               index = "DA_ID")

Releases

No releases published

Packages

No packages published

Languages