Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Current Population Survey (CPS) #40

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
53 changes: 47 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,17 @@ pip install -r requirements.txt
Folktables contains a suite of prediction tasks derived from US Census data that
can be easily downloaded and used for a variety of benchmarking tasks.
For information about the features, response, or group membership coding for any
of the datasets, please refer to the [ACS PUMS
of the American Community Survey (ACS) datasets, please refer to the [ACS PUMS
documentation](https://www.census.gov/programs-surveys/acs/microdata/documentation.html).
To see this information for [Current Population Survey](https://www.census.gov/programs-surveys/cps.html)
(CPS) datasets, refer [here](https://www2.census.gov/programs-surveys/cps/datasets/)
and navigate to the IO Code List `.txt` file in the `basic/` folder
of the year you are pulling data from. For example, [here](https://www2.census.gov/programs-surveys/cps/datasets/2023/basic/2023_Basic_CPS_Public_Use_Record_Layout_plus_IO_Code_list.txt) are the variable
explanations for the CPS as of January 2023. You can also find the variable explanations for any
Census Bureau survey in the corresponding variable list for the survey in the [Census Data API Discovery Tool](https://api.census.gov/data.html).
The ACS is conducted annually and has more variables that
can be used as features/targets/groups while the CPS is conducted monthly and focuses
on labor force statistics.


### Evaluating algorithms for fair machine learning
Expand Down Expand Up @@ -98,7 +107,7 @@ The ACS data source contains data for all fifty states, each of which has a
slightly different distribution of features and response. This increases the
diversity of environments in which we can evaluate our methods. For instance, we
can generate another `ACSEmployment` task using data from Texas and repeat the
experiment
experiment.
```py
acs_tx = data_source.get_data(states=["TX"], download=True)
tx_features, tx_label, tx_group = ACSEmployment.df_to_numpy(acs_tx)
Expand All @@ -117,6 +126,29 @@ black_tpr = np.mean(yhat[(y_test == 1) & (group_test == 2)])
# Equality of opportunity violation: 0.0397
white_tpr - black_tpr
```
The CPS data source contains more specific labor force and employment data. It
is also separable by state as well as the District of Columbia. Below is the
`CPSEmployment` task, which has different features from `ACSEmployment`, using data from DC.
```py
from folktables import CPSDataSource, CPSEmployment

data_source = CPSDataSource(survey_year=2023, survey_month='jan')
cps_data = data_source.get_data(states=["DC"], download=True)
features, label, group = CPSEmployment.df_to_numpy(cps_data)

X_train, X_test, y_train, y_test, group_train, group_test = train_test_split(
features, label, group, test_size=0.2, random_state=0)

model = make_pipeline(StandardScaler(), LogisticRegression())
model.fit(X_train, y_train)

yhat = model.predict(X_test)
white_tpr = np.mean(yhat[(y_test == 1) & (group_test == 1)])
black_tpr = np.mean(yhat[(y_test == 1) & (group_test == 2)])

# Equality of opportunity violation: 0.0783
white_tpr - black_tpr
```

### Distribution shift across states
Each prediction problem in Folktables can be instantiated on data from every US
Expand Down Expand Up @@ -195,12 +227,20 @@ Folktables provides the following pre-defined prediction tasks:

- **ACSPublicCoverage**: predict whether an individual is covered by public health insurance, after filtering the ACS PUMS data sample to only include individuals under the age of 65, and those with an income of less than \$30,000. This filtering focuses the prediction problem on low-income individuals who are not eligible for Medicare.

- **ACSHealthInsurance**: predict whether an individual has purchased insurance directly from an insurance company or not.

- **ACSMobility**: predict whether an individual had the same residential address one year ago, after filtering the ACS PUMS data sample to only include individuals between the ages of 18 and 35. This filtering increases the difficulty of the prediction task, as the base rate of staying at the same address is above 90\% for the general population.

- **ACSEmployment**: predict whether an individual is employed, after filtering the ACS PUMS data sample to only include individuals between the ages of 16 and 90.
- **ACSEmployment**: predict whether an individual is employed.

- **ACSEmploymentFiltered**: predict whether an individual is employed, after filtering the ACS PUMS data sample to only include individuals between the ages of 16 and 90.

- **ACSTravelTime**: predict whether an individual has a commute to work that is longer than 20 minutes, after filtering the ACS PUMS data sample to only include individuals who are employed and above the age of 16. The threshold of 20 minutes was chosen as it is the US-wide median travel time to work in the 2018 ACS PUMS data release.

- **ACSIncomePovertyRatio**: predict an individual's income as a ratio of the poverty rate.

- **CPSEmployment**: predict whether an individual is employed using data from the Current Population Survey.

Each of these tasks can be instantiated on different ACS PUMS data samples, as
illustrated in the [quick start examples](#quick-start-examples). Further
details about each task can also be found in `acs.py`, where they are defined.
Expand Down Expand Up @@ -303,9 +343,10 @@ need for the North American fairness community to engage with it more strongly

## License and terms of use
Folktables provides code to download data from the American Community Survey
(ACS) Public Use Microdata Sample (PUMS) files managed by the US Census Bureau.
The data itself is governed by the terms of use provided by the Census Bureau.
For more information, see https://www.census.gov/data/developers/about/terms-of-service.html
(ACS) Public Use Microdata Sample (PUMS) or the Current Population Survey (CPS)
microdata files managed by the US Census Bureau. The data itself is governed by
the terms of use provided by the Census Bureau. For more information, see
https://www.census.gov/data/developers/about/terms-of-service.html

The Adult reconstruction dataset is a subsample of the IPUMS CPS data available
from https://cps.ipums.org/. The data are intended for replication purposes only.
Expand Down
3 changes: 2 additions & 1 deletion folktables/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
__version__ = "0.0.12"
__version__ = "0.1.0"

from .folktables import *
from .acs import *
from .cps import *
from .load_acs import state_list
from .load_acs import generate_categories
from .utils import *
70 changes: 70 additions & 0 deletions folktables/cps.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
"""Data source and problem definitions for Community Population Survey (CPS)"""
import numpy as np
import pandas as pd
from datetime import datetime

from . import folktables
from .load_cps import load_cps

class CPSDataSource(folktables.DataSource):
"""Data source implementation for CPS montly microdata."""

def __init__(self, survey_year, survey_month, root_dir="data"):
"""Create data source for the microdata of a specific month and year

Args:
survey_year: int. Year of CPS microdata, e.g., 2023
survey_month: String: First 3 letters of the month of the survey, e.g., 'jan' or 'jul'

Returns:
CPSDataSource
"""
assert type(survey_month) == type('')
survey_year = int(survey_year)
# back through 1994 is provided, but files are raw text (not .csv) and need specific parses written
if survey_year not in range(2020, datetime.now().year+1):
raise ValueError("Data not available for the specified year")
months = ['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec']
if survey_month.lower() not in months:
raise ValueError(f'Please specify the month with its first three letters, available options are: {months}')
self._survey_year = str(survey_year)
self._survey_month = survey_month.lower()
self._root_dir = root_dir

def get_data(self, states=None, download=False):
"""Get data from a given list of states, or all states if no `states` argument.

Args:
states: List of Strings. Two letter codes for states, including the District of
Columbia, e.g., ['RI', 'NY', 'PR']
download: Boolean. True will download the `.csv` file for the specified survey year &
month. Use False if this data has already been downloaded.

Returns:
A pandas DataFrame of the requested data
"""
data = load_cps(root_dir=self._root_dir,
year=self._survey_year,
month=self._survey_month,
states=states,
download=download)
return data

CPSEmployment = folktables.BasicProblem(
features=[
'PRTAGE',
'PEEDUCA',
'PESEX',
'PEMARITL',
'PRDASIAN',
'PRDTHSP',
'PENATVTY',
'HEHOUSUT',
'HEFAMINC'
],
target='PEMLR',
target_transform=lambda x: (x==1) | (x==2),
group='PTDTRACE',
preprocess=lambda x: x,
postprocess=lambda x: np.nan_to_num(x, -1),
)
63 changes: 63 additions & 0 deletions folktables/load_cps.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
"""Load CPS microdata from Census CSV files."""
import os
import io
import requests
import pandas as pd

state_list = ['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA', 'HI',
'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA', 'MI',
'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC',
'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT',
'VT', 'VA', 'WA', 'WV', 'WI', 'WY', 'DC']


_STATE_CODES = {'AL': '01', 'AK': '02', 'AZ': '04', 'AR': '05', 'CA': '06',
'CO': '08', 'CT': '09', 'DE': '10', 'FL': '12', 'GA': '13',
'HI': '15', 'ID': '16', 'IL': '17', 'IN': '18', 'IA': '19',
'KS': '20', 'KY': '21', 'LA': '22', 'ME': '23', 'MD': '24',
'MA': '25', 'MI': '26', 'MN': '27', 'MS': '28', 'MO': '29',
'MT': '30', 'NE': '31', 'NV': '32', 'NH': '33', 'NJ': '34',
'NM': '35', 'NY': '36', 'NC': '37', 'ND': '38', 'OH': '39',
'OK': '40', 'OR': '41', 'PA': '42', 'RI': '44', 'SC': '45',
'SD': '46', 'TN': '47', 'TX': '48', 'UT': '49', 'VT': '50',
'VA': '51', 'WA': '53', 'WV': '54', 'WI': '55', 'WY': '56',
'DC': '11'}

def load_cps(root_dir, year, month, states=None, download=False):
"""
Load sample of CPS microdata from Census csv files into DataFrame.

If download is False it is assumed the csv for the requested month and year have already been
downloaded and root_dir will be checked. Pass True for download if this is not the case.
"""
df = retrieve_data(root_dir, year, month, states, download)
return df

def retrieve_data(root_dir, year, month, states=None, download=False):
"""Actually download the csv from the Census Bureau website if needed, return data as DataFrame"""
datadir = os.path.join(root_dir, str(year), str(month))
os.makedirs(datadir, exist_ok=True)
filename = f'{month}{year[-2:]}pub.csv'
filepath = os.path.join(datadir, filename)
if os.path.isfile(filepath):
df = pd.read_csv(filepath).replace(' ','')
elif download == False:
raise FileNotFoundError(f'Could not find survey data for {month} {year}. Call get_data with download=True to download the dataset.')
else:
df = download_data(filepath, year, month)
if states != None:
df = filter_by_state(df, states)
return df

def download_data(filepath, year, month):
"""Download the csv from Census Bureau website and convert to dataframe"""
print(f'Downloading CPS data for {month} {year}...')
url = f'https://www2.census.gov/programs-surveys/cps/datasets/{year}/basic/{month}{year[-2:]}pub.csv'
response = requests.get(url)
with open(filepath, 'wb') as handle:
handle.write(response.content)
return pd.read_csv(filepath).replace(' ','')

def filter_by_state(df, state_list):
return df[df['GESTFIPS'].isin([int(_STATE_CODES[state]) for state in state_list])]

2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

setup(
name="folktables",
version="0.0.12",
version="0.1.0",
author="John Miller, Frances Ding, Ludwig Schmidt, Moritz Hardt",
author_email="[email protected]",
description="New machine learning benchmarks from tabular datasets.",
Expand Down
27 changes: 27 additions & 0 deletions tests/dev_tests/api_tests.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
from folktables import ACSDataSource, ACSPublicCoverage
import requests
import datetime

'''
This test shows how using the Census Bureau's web API to get prefiltered ACS data in JSON format
is 12-44x slower than just downloading the CSV of the survey data in its entirety, at least at
the time of creation for this test (January 4, 2024).
'''

def req():
start = datetime.datetime.now()
data_source = ACSDataSource(survey_year='2018', horizon='1-Year', survey='person')
acs_data = data_source.get_data(states=["CA"], download=True)
features, label, group = ACSPublicCoverage.df_to_numpy(acs_data)
delta = datetime.datetime.now() - start
print(delta)

def req_api():
start = datetime.datetime.now()
resp = requests.get('https://api.census.gov/data/2018/acs/acs1/pums?get=AGEP,SCHL,MAR,SEX,DIS,ESP,CIT,MIG,MIL,ANC,NATIVITY,DEAR,DEYE,DREM,PINCP,ESR,ST,FER,RAC1P,PUBCOV&in=state:06')
delta = datetime.datetime.now() - start
print(delta)

req()

req_api()