Skip to content

USCensus 2020 Package and Modernization

Shreshtha edited this page Apr 18, 2022 · 15 revisions

Background

The United States conducts a decennial census, and that data is used pervasively in academia, industry and government. Although there have been some delays, the 2020 data is finally going to be published this summer.

Related work

There are two types of packages for using US census data:

  • Prepackaged Data

    • UScensus
    • UScensus2010
  • API Based

    • tidycensus
    • censusapi

Limitations: The prepackaged data packages do not include the 2020 data and are fairly old. There have also been many advances in the R ecosystem, particularly in spatial tooling, that a new package should leverage. On the other hand, API-based packages require account registration, which is undesirable in a teaching context, and network connectivity for incremental data access, which is undesirable in an HPC/cluster computing context.

Details of your coding project

This project will develop a new family of packages, UScensus2020, that will provide easy access to the demographic data and shape files released by the US census.

The majority of the suite will be packages containing data and shapes at various levels of reporting, such as state, county, tract, and block. The student should ideally automate as much as is practical; if it's possible, it would also be nice to rebuild the 2010 data sets as well using modern libraries (eg sf). These data packages can be quite large, and will be hosted on R-Forge or elsewhere.

The main package will provide helper functions for the data packages and will be submitted to CRAN.

Data: Package a cleansed version of 2020 SF1 at different reporting levels.

Functions: Helper functions for retrieving different data packages. Ideally they also are roughly compatible with 2010 data.

Docs: each function should be documented such that the package passes R CMD check --as-cran. Each column in underlying data sets should be documented.

Tests: underlying data sets should be validated (eg populations can't be negative)

Vignettes: There should be a vignette that works an example using 2020 data with sufficient detail to run a short workshop. If the mentee is on the academic track, we can also coach them through preparing an article and submitting to an appropriate venue such as JSS.

Expected impact

This project potentially has an extremely broad impact because of the wide spread use of census data. Social scientists and industry will be using it for demographic data, and governments will use it to set policy.

For example, census data is used in the BISG methodology to audit credit models for fair lending compliance.

Mentors

Contributors, please contact mentors below after completing at least one of the tests below.

  • EVALUATING MENTOR: Zack Almquist [email protected] is the author of the UScensus2010 suite of R packages, and a sociology and statistics professor, Senior Data Scientist at the eScience Institute at the University of Washington and the Training Core PI for the Center for Studies in Demography and Ecology at UW.
  • Neal Fultz [email protected] previously mentored for GSoC for the grpc R package, the author of grpc and other packages, and is the data science lead at UCLA Social Science Computing, the principal of NJNM consulting, and was coorganizer of the Los Angeles R User Group from 2010-2014.
  • Mike Tzen [email protected] is Director of Statistics for UCLA California Center for Population Research, and was previously a mathematical statistician at the US Census.

Tests

Contributors, please do one or more of the following tests before contacting the mentors above.

  • Easy: Choose a city in the USA, and plot an age pyramid for it using the 2010 census data. Compare it with the state's capital.
  • Medium: Create a chloropleth (by county) of the median age for the corresponding state.
  • Hard: Review the source code of UScensus2010 (see eg ROpenSci reviewer guide), and make suggestions for improvement. Optionally, implement those suggestions and PR the repository.

References

Solutions of tests

Contributors, please post a link to your test results here.

  • EXAMPLE CONTRIBUTOR 1 NAME, LINK TO GITHUB PROFILE, LINK TO TEST RESULTS.
  1. Abdussamad Muhammad, https://github.com/abdussamadma, https://github.com/abdussamadma/Age-pyramid- , https://github.com/abdussamadma/Chloropleth-of-Wisconsin
  2. Tejasvi Gupta, Github Profile, Test Links
  3. Arepalli Yashwanth Reddy, Github profile, Medium Test, Easy Test
  4. Modi Shreshtha [Github:] (https://github.com/shreshtha48), [easy test:] (https://rpubs.com/shreshtha2002/easy_gsoc), [Medium Test:] (https://rpubs.com/shreshtha2002/medium_gsoc), [hard test:] (https://docs.google.com/document/d/136fySM_quKUbhAEBxTZQlwt_ljcJimdUzXrVclIVdVY/edit?usp=sharing)