diff --git a/README.md b/README.md index 3dfe628..3d7884e 100644 --- a/README.md +++ b/README.md @@ -53,8 +53,17 @@ pip install -r requirements.txt Folktables contains a suite of prediction tasks derived from US Census data that can be easily downloaded and used for a variety of benchmarking tasks. For information about the features, response, or group membership coding for any -of the datasets, please refer to the [ACS PUMS +of the American Community Survey (ACS) datasets, please refer to the [ACS PUMS documentation](https://www.census.gov/programs-surveys/acs/microdata/documentation.html). +To see this information for [Current Population Survey](https://www.census.gov/programs-surveys/cps.html) +(CPS) datasets, refer [here](https://www2.census.gov/programs-surveys/cps/datasets/) +and navigate to the IO Code List `.txt` file in the `basic/` folder +of the year you are pulling data from. For example, [here](https://www2.census.gov/programs-surveys/cps/datasets/2023/basic/2023_Basic_CPS_Public_Use_Record_Layout_plus_IO_Code_list.txt) are the variable +explanations for the CPS as of January 2023. You can also find the variable explanations for any +Census Bureau survey in the corresponding variable list for the survey in the [Census Data API Discovery Tool](https://api.census.gov/data.html). +The ACS is conducted annually and has more variables that +can be used as features/targets/groups while the CPS is conducted monthly and focuses +on labor force statistics. ### Evaluating algorithms for fair machine learning @@ -98,7 +107,7 @@ The ACS data source contains data for all fifty states, each of which has a slightly different distribution of features and response. This increases the diversity of environments in which we can evaluate our methods. For instance, we can generate another `ACSEmployment` task using data from Texas and repeat the -experiment +experiment. ```py acs_tx = data_source.get_data(states=["TX"], download=True) tx_features, tx_label, tx_group = ACSEmployment.df_to_numpy(acs_tx) @@ -117,6 +126,29 @@ black_tpr = np.mean(yhat[(y_test == 1) & (group_test == 2)]) # Equality of opportunity violation: 0.0397 white_tpr - black_tpr ``` +The CPS data source contains more specific labor force and employment data. It +is also separable by state as well as the District of Columbia. Below is the +`CPSEmployment` task, which has different features from `ACSEmployment`, using data from DC. +```py +from folktables import CPSDataSource, CPSEmployment + +data_source = CPSDataSource(survey_year=2023, survey_month='jan') +cps_data = data_source.get_data(states=["DC"], download=True) +features, label, group = CPSEmployment.df_to_numpy(cps_data) + +X_train, X_test, y_train, y_test, group_train, group_test = train_test_split( + features, label, group, test_size=0.2, random_state=0) + +model = make_pipeline(StandardScaler(), LogisticRegression()) +model.fit(X_train, y_train) + +yhat = model.predict(X_test) +white_tpr = np.mean(yhat[(y_test == 1) & (group_test == 1)]) +black_tpr = np.mean(yhat[(y_test == 1) & (group_test == 2)]) + +# Equality of opportunity violation: 0.0783 +white_tpr - black_tpr +``` ### Distribution shift across states Each prediction problem in Folktables can be instantiated on data from every US @@ -195,12 +227,20 @@ Folktables provides the following pre-defined prediction tasks: - **ACSPublicCoverage**: predict whether an individual is covered by public health insurance, after filtering the ACS PUMS data sample to only include individuals under the age of 65, and those with an income of less than \$30,000. This filtering focuses the prediction problem on low-income individuals who are not eligible for Medicare. +- **ACSHealthInsurance**: predict whether an individual has purchased insurance directly from an insurance company or not. + - **ACSMobility**: predict whether an individual had the same residential address one year ago, after filtering the ACS PUMS data sample to only include individuals between the ages of 18 and 35. This filtering increases the difficulty of the prediction task, as the base rate of staying at the same address is above 90\% for the general population. -- **ACSEmployment**: predict whether an individual is employed, after filtering the ACS PUMS data sample to only include individuals between the ages of 16 and 90. +- **ACSEmployment**: predict whether an individual is employed. + +- **ACSEmploymentFiltered**: predict whether an individual is employed, after filtering the ACS PUMS data sample to only include individuals between the ages of 16 and 90. - **ACSTravelTime**: predict whether an individual has a commute to work that is longer than 20 minutes, after filtering the ACS PUMS data sample to only include individuals who are employed and above the age of 16. The threshold of 20 minutes was chosen as it is the US-wide median travel time to work in the 2018 ACS PUMS data release. +- **ACSIncomePovertyRatio**: predict an individual's income as a ratio of the poverty rate. + +- **CPSEmployment**: predict whether an individual is employed using data from the Current Population Survey. + Each of these tasks can be instantiated on different ACS PUMS data samples, as illustrated in the [quick start examples](#quick-start-examples). Further details about each task can also be found in `acs.py`, where they are defined. @@ -303,9 +343,10 @@ need for the North American fairness community to engage with it more strongly ## License and terms of use Folktables provides code to download data from the American Community Survey -(ACS) Public Use Microdata Sample (PUMS) files managed by the US Census Bureau. -The data itself is governed by the terms of use provided by the Census Bureau. -For more information, see https://www.census.gov/data/developers/about/terms-of-service.html +(ACS) Public Use Microdata Sample (PUMS) or the Current Population Survey (CPS) +microdata files managed by the US Census Bureau. The data itself is governed by +the terms of use provided by the Census Bureau. For more information, see +https://www.census.gov/data/developers/about/terms-of-service.html The Adult reconstruction dataset is a subsample of the IPUMS CPS data available from https://cps.ipums.org/. The data are intended for replication purposes only. diff --git a/folktables/__init__.py b/folktables/__init__.py index b0e438a..977c7af 100644 --- a/folktables/__init__.py +++ b/folktables/__init__.py @@ -1,7 +1,8 @@ -__version__ = "0.0.12" +__version__ = "0.1.0" from .folktables import * from .acs import * +from .cps import * from .load_acs import state_list from .load_acs import generate_categories from .utils import * diff --git a/folktables/cps.py b/folktables/cps.py new file mode 100644 index 0000000..91ac528 --- /dev/null +++ b/folktables/cps.py @@ -0,0 +1,70 @@ +"""Data source and problem definitions for Community Population Survey (CPS)""" +import numpy as np +import pandas as pd +from datetime import datetime + +from . import folktables +from .load_cps import load_cps + +class CPSDataSource(folktables.DataSource): + """Data source implementation for CPS montly microdata.""" + + def __init__(self, survey_year, survey_month, root_dir="data"): + """Create data source for the microdata of a specific month and year + + Args: + survey_year: int. Year of CPS microdata, e.g., 2023 + survey_month: String: First 3 letters of the month of the survey, e.g., 'jan' or 'jul' + + Returns: + CPSDataSource + """ + assert type(survey_month) == type('') + survey_year = int(survey_year) + # back through 1994 is provided, but files are raw text (not .csv) and need specific parses written + if survey_year not in range(2020, datetime.now().year+1): + raise ValueError("Data not available for the specified year") + months = ['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec'] + if survey_month.lower() not in months: + raise ValueError(f'Please specify the month with its first three letters, available options are: {months}') + self._survey_year = str(survey_year) + self._survey_month = survey_month.lower() + self._root_dir = root_dir + + def get_data(self, states=None, download=False): + """Get data from a given list of states, or all states if no `states` argument. + + Args: + states: List of Strings. Two letter codes for states, including the District of + Columbia, e.g., ['RI', 'NY', 'PR'] + download: Boolean. True will download the `.csv` file for the specified survey year & + month. Use False if this data has already been downloaded. + + Returns: + A pandas DataFrame of the requested data + """ + data = load_cps(root_dir=self._root_dir, + year=self._survey_year, + month=self._survey_month, + states=states, + download=download) + return data + +CPSEmployment = folktables.BasicProblem( + features=[ + 'PRTAGE', + 'PEEDUCA', + 'PESEX', + 'PEMARITL', + 'PRDASIAN', + 'PRDTHSP', + 'PENATVTY', + 'HEHOUSUT', + 'HEFAMINC' + ], + target='PEMLR', + target_transform=lambda x: (x==1) | (x==2), + group='PTDTRACE', + preprocess=lambda x: x, + postprocess=lambda x: np.nan_to_num(x, -1), +) diff --git a/folktables/load_cps.py b/folktables/load_cps.py new file mode 100644 index 0000000..8c68940 --- /dev/null +++ b/folktables/load_cps.py @@ -0,0 +1,63 @@ +"""Load CPS microdata from Census CSV files.""" +import os +import io +import requests +import pandas as pd + +state_list = ['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA', 'HI', + 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA', 'MI', + 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', + 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', + 'VT', 'VA', 'WA', 'WV', 'WI', 'WY', 'DC'] + + +_STATE_CODES = {'AL': '01', 'AK': '02', 'AZ': '04', 'AR': '05', 'CA': '06', + 'CO': '08', 'CT': '09', 'DE': '10', 'FL': '12', 'GA': '13', + 'HI': '15', 'ID': '16', 'IL': '17', 'IN': '18', 'IA': '19', + 'KS': '20', 'KY': '21', 'LA': '22', 'ME': '23', 'MD': '24', + 'MA': '25', 'MI': '26', 'MN': '27', 'MS': '28', 'MO': '29', + 'MT': '30', 'NE': '31', 'NV': '32', 'NH': '33', 'NJ': '34', + 'NM': '35', 'NY': '36', 'NC': '37', 'ND': '38', 'OH': '39', + 'OK': '40', 'OR': '41', 'PA': '42', 'RI': '44', 'SC': '45', + 'SD': '46', 'TN': '47', 'TX': '48', 'UT': '49', 'VT': '50', + 'VA': '51', 'WA': '53', 'WV': '54', 'WI': '55', 'WY': '56', + 'DC': '11'} + +def load_cps(root_dir, year, month, states=None, download=False): + """ + Load sample of CPS microdata from Census csv files into DataFrame. + + If download is False it is assumed the csv for the requested month and year have already been + downloaded and root_dir will be checked. Pass True for download if this is not the case. + """ + df = retrieve_data(root_dir, year, month, states, download) + return df + +def retrieve_data(root_dir, year, month, states=None, download=False): + """Actually download the csv from the Census Bureau website if needed, return data as DataFrame""" + datadir = os.path.join(root_dir, str(year), str(month)) + os.makedirs(datadir, exist_ok=True) + filename = f'{month}{year[-2:]}pub.csv' + filepath = os.path.join(datadir, filename) + if os.path.isfile(filepath): + df = pd.read_csv(filepath).replace(' ','') + elif download == False: + raise FileNotFoundError(f'Could not find survey data for {month} {year}. Call get_data with download=True to download the dataset.') + else: + df = download_data(filepath, year, month) + if states != None: + df = filter_by_state(df, states) + return df + +def download_data(filepath, year, month): + """Download the csv from Census Bureau website and convert to dataframe""" + print(f'Downloading CPS data for {month} {year}...') + url = f'https://www2.census.gov/programs-surveys/cps/datasets/{year}/basic/{month}{year[-2:]}pub.csv' + response = requests.get(url) + with open(filepath, 'wb') as handle: + handle.write(response.content) + return pd.read_csv(filepath).replace(' ','') + +def filter_by_state(df, state_list): + return df[df['GESTFIPS'].isin([int(_STATE_CODES[state]) for state in state_list])] + \ No newline at end of file diff --git a/setup.py b/setup.py index c7be517..426e3ec 100644 --- a/setup.py +++ b/setup.py @@ -2,7 +2,7 @@ setup( name="folktables", - version="0.0.12", + version="0.1.0", author="John Miller, Frances Ding, Ludwig Schmidt, Moritz Hardt", author_email="hardt@is.mpg.de", description="New machine learning benchmarks from tabular datasets.", diff --git a/tests/dev_tests/api_tests.py b/tests/dev_tests/api_tests.py new file mode 100644 index 0000000..f06cae0 --- /dev/null +++ b/tests/dev_tests/api_tests.py @@ -0,0 +1,27 @@ +from folktables import ACSDataSource, ACSPublicCoverage +import requests +import datetime + +''' +This test shows how using the Census Bureau's web API to get prefiltered ACS data in JSON format +is 12-44x slower than just downloading the CSV of the survey data in its entirety, at least at +the time of creation for this test (January 4, 2024). +''' + +def req(): + start = datetime.datetime.now() + data_source = ACSDataSource(survey_year='2018', horizon='1-Year', survey='person') + acs_data = data_source.get_data(states=["CA"], download=True) + features, label, group = ACSPublicCoverage.df_to_numpy(acs_data) + delta = datetime.datetime.now() - start + print(delta) + +def req_api(): + start = datetime.datetime.now() + resp = requests.get('https://api.census.gov/data/2018/acs/acs1/pums?get=AGEP,SCHL,MAR,SEX,DIS,ESP,CIT,MIG,MIL,ANC,NATIVITY,DEAR,DEYE,DREM,PINCP,ESR,ST,FER,RAC1P,PUBCOV&in=state:06') + delta = datetime.datetime.now() - start + print(delta) + +req() + +req_api() \ No newline at end of file