uchicago-dsi · trevorspreadbury · Mar 22, 2024 · Apr 10, 2024 · Apr 12, 2024 · May 2, 2024
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -79,3 +79,14 @@ By default we have a few nice features:
 - VS Code extensions are installed by default. This will allow us to use the Python Debugger, and lint and format documents with ruff automatically.
 
 To run a pipeline using the debugger, we will want to run the relevant file in the `scripts` directory. Open the desired file and click the python debugger icon in the left sidebar (a play button with a bug). Click the play button. Select current python file. 
+
+
+## Adding a new state
+
+### Scraper
+
+Write a scraper in the scraper package, if it makes sense. If the data is available as only a bulk download, it is not worth it to write a scraper. Document the process for finding the bulk download (TODO: where) and save the bulk download somewhere publicly accessible. 
+
+### Standardization
+
+To standardize state data, the finance.source.DataSource class is used. Each unique information source the state provides should be placed in a subclass of DataSource. A unique source of information is any file information is retrieved from with a consistent format. For example, Pennsylvannia provides campaign finance data in 
diff --git a/README.md b/README.md
@@ -52,6 +52,43 @@ This folder is empty by default. The final outputs of make commands will be plac
 
 
 
+transactor-election-year
+
+
+name
+type
+address
+
+
+
+
+## Steps
+
+### Step 1: Collect Data
+#### Implemented in `collect` 
+Retrieve data from state agencies and store in flat files
+
+### Step 2: Normalize Transaction Data
+#### Implemented in `normalize`
+Convert raw data into a standardized simple schema centered around a `transactions` table.
+The `transactions` represents monetary transactions and each row, at minimum, specifies
+a donor, recipient, date, and amount. Donor and recipient are foreign keys to a `transactors`
+table which is related to `organizations` and `individuals` tables. See schema here TODO. 
+
+The only modifications to source data here are dropping of invalid rows and changing of data types (i.e. 20240627 and June 27, 2024 will both be standardized as datetimes.)
+
+### Step 3: Clean Transaction Data
+#### Implemented in `clean`
+Modify raw data where appropriate to fix mistakes with high confidence. 
+
+### Step 4: Record Linkage
+#### Implemented in `link`
+Perform record linkage for individuals and organization. Further normalize the table to include `memberships` and `addresses` tables.
+
+### Step 5: Incorporate Elections
+#### Implemented in 
+
+
 ## Team Member
 
 Student Name: Nicolas Posner

diff --git a/data/README.md b/data/README.md
@@ -160,3 +160,7 @@ contribution data and READMEs in a Google Drive for the duration of this project
 3. The Finance Report states that a record must be kept for any contribution over \$10.00, but “Contributions and receipts of \$50.00 or less per contributor, during the reporting period, need not be itemized on the report” … this might mean that if 1,000 people for instance donate \$50 or less, there could be potentially thousands/tens of thousands of \$ not shown on the data, even though this information is recorded. This means that the total contributions that filers itemize does not necessarily reflect the total contributions they received. 
 
 4. Transparency USA has aggregated data on the contributions of individuals and committees. This could be a helpful source to cross-check the data and potentially help alleviate the debt-contribution issue. Pennsylvania' Dept. of State also offers a detailed website that shows all the aggregated contributions made and received, expenditures made, debts, and receipts. The catch is one must know which candidate they are looking for as it's a searchable database, but it can be very helpful for cross-matching and verification. Here's the link :https://www.campaignfinanceonline.pa.gov/Pages/CFReportSearch.aspx 
+
+## Texas
+
+Texas data is retrieved from 'Campaign Finance CSV Database' from the [Texas Ethics Commission](https://www.ethics.state.tx.us/search/cf/)
diff --git a/pyproject.toml b/pyproject.toml
@@ -49,3 +49,6 @@ convention = "google"
 
 [tool.pytest.ini_options]
 testpaths = "tests"
+
+[tool.ruff]
+extend-include = ["*.ipynb"]
diff --git a/requirements.txt b/requirements.txt
@@ -23,3 +23,5 @@ nameparser==1.1.3
 networkx~=3.1
 splink==3.9.12
 scipy
+dask
+dask[dataframe]
diff --git a/scripts/transform_pipeline.py b/scripts/transform_pipeline.py
@@ -32,14 +32,11 @@
 input_directory.mkdir(parents=True, exist_ok=True)
 output_directory.mkdir(parents=True, exist_ok=True)
 
-individuals_output_path = output_directory / "individuals_table.csv"
-organizations_output_path = output_directory / "organizations_table.csv"
-transactions_output_path = output_directory / "transactions_table.csv"
-(
-    complete_individuals_table,
-    complete_organizations_table,
-    complete_transactions_table,
-) = transform_and_merge()
-complete_individuals_table.to_csv(individuals_output_path)
-complete_organizations_table.to_csv(organizations_output_path)
-complete_transactions_table.to_csv(transactions_output_path)
+individuals_output_path = output_directory / "individuals_table-*.csv"
+organizations_output_path = output_directory / "organizations_table-*.csv"
+transactions_output_path = output_directory / "transactions_table-*.csv"
+id_table_output_path = output_directory / "id_map-*.csv"
+database = transform_and_merge()
+for table_type in database:
+    database[table_type].to_csv(output_directory / f"{table_type}.csv")
+print("pipeline finished and save data to csv.")
diff --git a/src/utils/constants.py b/src/utils/constants.py
@@ -8,6 +8,8 @@
 BASE_FILEPATH = Path(__file__).resolve().parent.parent.parent
 # returns the base_path to the directory
 
+source_metadata_directory = BASE_FILEPATH / "src" / "utils" / "static"
+
 COMPANY_TYPES = {
     "CORP": "CORPORATION",
     "CO": "CORPORATION",

diff --git a/src/utils/finance/__init__.py b/src/utils/finance/__init__.py
@@ -0,0 +1,6 @@
+"""Package for reading and standardizing state campaign finance data
+
+The DataSource class is subclassed for each unique source of data. For
+more information, see TODO and CONTRIBUTING.md for how to add additional
+states
+"""