Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge pull request #76 from uchicago-dsi/main #80

Open
wants to merge 21 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
f420f5c
Merge pull request #76 from uchicago-dsi/main
trevorspreadbury Mar 22, 2024
866e465
add-texas-data
cXu-01 Apr 10, 2024
7c88945
Use Form to process data
cXu-01 Apr 12, 2024
31f908a
include essential information for contributors
cXu-01 May 2, 2024
31d01cd
Major refactor of Texas transformer moving most logic to Form class
trevorspreadbury May 8, 2024
5be4a63
score match
cXu-01 May 14, 2024
92efae2
Merge branch 'code-migration-texas' of https://github.com/uchicago-ds…
cXu-01 May 14, 2024
34ec949
start branch with big updates
trevorspreadbury Sep 5, 2024
565760e
update table manipulation and remove dead code
trevorspreadbury Sep 18, 2024
7bd6c6e
update texas standardizeer
trevorspreadbury Sep 18, 2024
bff93c1
remove unused code
trevorspreadbury Sep 18, 2024
4df8563
start documenting process for new states
trevorspreadbury Sep 18, 2024
3977028
add package docstring for finance
trevorspreadbury Sep 23, 2024
e2f2c8c
add package docstring for finance.states
trevorspreadbury Sep 23, 2024
3f7d240
update PA scraper for new site
trevorspreadbury Sep 26, 2024
2ef891b
handle erroneously entered datatimes in datasource class
trevorspreadbury Sep 26, 2024
567e9d2
add todo to texas data source reader
trevorspreadbury Sep 26, 2024
20817f5
add party affiliation to yaml
trevorspreadbury Sep 26, 2024
2cfd930
first version of PA datasource -- has some ID bugs
trevorspreadbury Sep 26, 2024
baee70a
refactoring id handling
trevorspreadbury Sep 26, 2024
2aa19ff
Merge pull request #110 from dsi-clinic/yaml-refactor
trevorspreadbury Sep 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,3 +79,14 @@ By default we have a few nice features:
- VS Code extensions are installed by default. This will allow us to use the Python Debugger, and lint and format documents with ruff automatically.

To run a pipeline using the debugger, we will want to run the relevant file in the `scripts` directory. Open the desired file and click the python debugger icon in the left sidebar (a play button with a bug). Click the play button. Select current python file.


## Adding a new state

### Scraper

Write a scraper in the scraper package, if it makes sense. If the data is available as only a bulk download, it is not worth it to write a scraper. Document the process for finding the bulk download (TODO: where) and save the bulk download somewhere publicly accessible.

### Standardization

To standardize state data, the finance.source.DataSource class is used. Each unique information source the state provides should be placed in a subclass of DataSource. A unique source of information is any file information is retrieved from with a consistent format. For example, Pennsylvannia provides campaign finance data in
37 changes: 37 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,43 @@ This folder is empty by default. The final outputs of make commands will be plac



transactor-election-year


name
type
address




## Steps

### Step 1: Collect Data
#### Implemented in `collect`
Retrieve data from state agencies and store in flat files

### Step 2: Normalize Transaction Data
#### Implemented in `normalize`
Convert raw data into a standardized simple schema centered around a `transactions` table.
The `transactions` represents monetary transactions and each row, at minimum, specifies
a donor, recipient, date, and amount. Donor and recipient are foreign keys to a `transactors`
table which is related to `organizations` and `individuals` tables. See schema here TODO.

The only modifications to source data here are dropping of invalid rows and changing of data types (i.e. 20240627 and June 27, 2024 will both be standardized as datetimes.)

### Step 3: Clean Transaction Data
#### Implemented in `clean`
Modify raw data where appropriate to fix mistakes with high confidence.

### Step 4: Record Linkage
#### Implemented in `link`
Perform record linkage for individuals and organization. Further normalize the table to include `memberships` and `addresses` tables.

### Step 5: Incorporate Elections
#### Implemented in


## Team Member

Student Name: Nicolas Posner
Expand Down
4 changes: 4 additions & 0 deletions data/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -160,3 +160,7 @@ contribution data and READMEs in a Google Drive for the duration of this project
3. The Finance Report states that a record must be kept for any contribution over \$10.00, but “Contributions and receipts of \$50.00 or less per contributor, during the reporting period, need not be itemized on the report” … this might mean that if 1,000 people for instance donate \$50 or less, there could be potentially thousands/tens of thousands of \$ not shown on the data, even though this information is recorded. This means that the total contributions that filers itemize does not necessarily reflect the total contributions they received.

4. Transparency USA has aggregated data on the contributions of individuals and committees. This could be a helpful source to cross-check the data and potentially help alleviate the debt-contribution issue. Pennsylvania' Dept. of State also offers a detailed website that shows all the aggregated contributions made and received, expenditures made, debts, and receipts. The catch is one must know which candidate they are looking for as it's a searchable database, but it can be very helpful for cross-matching and verification. Here's the link :https://www.campaignfinanceonline.pa.gov/Pages/CFReportSearch.aspx

## Texas

Texas data is retrieved from 'Campaign Finance CSV Database' from the [Texas Ethics Commission](https://www.ethics.state.tx.us/search/cf/)
3 changes: 3 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -49,3 +49,6 @@ convention = "google"

[tool.pytest.ini_options]
testpaths = "tests"

[tool.ruff]
extend-include = ["*.ipynb"]
2 changes: 2 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,5 @@ nameparser==1.1.3
networkx~=3.1
splink==3.9.12
scipy
dask
dask[dataframe]
19 changes: 8 additions & 11 deletions scripts/transform_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,14 +32,11 @@
input_directory.mkdir(parents=True, exist_ok=True)
output_directory.mkdir(parents=True, exist_ok=True)

individuals_output_path = output_directory / "individuals_table.csv"
organizations_output_path = output_directory / "organizations_table.csv"
transactions_output_path = output_directory / "transactions_table.csv"
(
complete_individuals_table,
complete_organizations_table,
complete_transactions_table,
) = transform_and_merge()
complete_individuals_table.to_csv(individuals_output_path)
complete_organizations_table.to_csv(organizations_output_path)
complete_transactions_table.to_csv(transactions_output_path)
individuals_output_path = output_directory / "individuals_table-*.csv"
organizations_output_path = output_directory / "organizations_table-*.csv"
transactions_output_path = output_directory / "transactions_table-*.csv"
id_table_output_path = output_directory / "id_map-*.csv"
database = transform_and_merge()
for table_type in database:
database[table_type].to_csv(output_directory / f"{table_type}.csv")
print("pipeline finished and save data to csv.")
2 changes: 2 additions & 0 deletions src/utils/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@
BASE_FILEPATH = Path(__file__).resolve().parent.parent.parent
# returns the base_path to the directory

source_metadata_directory = BASE_FILEPATH / "src" / "utils" / "static"

COMPANY_TYPES = {
"CORP": "CORPORATION",
"CO": "CORPORATION",
Expand Down
6 changes: 6 additions & 0 deletions src/utils/finance/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"""Package for reading and standardizing state campaign finance data

The DataSource class is subclassed for each unique source of data. For
more information, see TODO and CONTRIBUTING.md for how to add additional
states
"""
Loading
Loading