Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Python script to clean 2015 through 2019 data #1844

Open
1 of 11 tasks
ryanfchase opened this issue Oct 17, 2024 · 1 comment
Open
1 of 11 tasks

Create Python script to clean 2015 through 2019 data #1844

ryanfchase opened this issue Oct 17, 2024 · 1 comment
Assignees
Labels

Comments

@ryanfchase
Copy link
Member

ryanfchase commented Oct 17, 2024

Overview

We need to clean the 2016 through 2019 data from the 311 Data Service Request APIs from the city so that we can access them through our Search and Filters modal.

Action Items

We want to know how to clean the data from 2015 to 2019 in order to make it consistent with our 2021-2024 data. To achieve this, complete the following:

  • Review relevant code from our scripts folder (see R1)
  • For each year (2015 through 2020), do the following...
    1. Download the dataset for that year and compare it to our 2024 data
    2. In a comment, answer the following questions:
      • q1: Does the dataset for this year have the same columns as the 2024 dataset? (can use check_column_count.py from R1)
      • q2: Does the dataset for this year contain any "problematic rows", e.g. rows that contain cells that might confuse our DuckDb interpretter (see inspect_csv.py from R1)
    3. Based on answers from q1 and q2, determine (if any) what steps should be taken to clean the dataset for that particular year
  • If any significant differences or problem rows were identified in the previous step, write a Python script to fix those issues, creating a cleaned dataset as a byproduct
    • commit those scripts under a new folder: 311-data/scripts/clean-2015-through-2020
    • attend a 311 Data weekly general meeting or 311 Data weekly engineering meeting to review your work with the other devs

Resources/Instructions

R1: Relevant files and functions used in our build process, as well as tools used to determine where and how our datasets needed to be cleaned:

  • Review 311-data/scripts/migrateOldHfDataset.py
    • Note: this script is very similar to 311-data/scripts/updateHfDataset.py, except that it allows you to pass in a year as an argument.
    • Review dlData(year): you'll notice we are drawing data from a personal repo, see resources to know where that data comes from
    • Review hfClean(year): this method has two parts:
      1. it opens a local file (e.g. 2024.csv), creates a new output file (e.g. 2024-fixed.csv), and does a string replacement on the input before writing to the output file
      2. it opens a connection to DuckDb, creates a temporary table with the data from our fixed csv, then converts it to Parquet
    • Skip hfUpload(year), this won't be needed for the purposes of this ticket
    • Skip process_data(...), this is our main control flow for the script, which is determined by command line arguments
  • review 311-data/scripts/csv_debug_tools
    • review check_column_count.py, read documentation at the top of the file
    • review inspect_csv.py, read documentation at the top of the file

R2: Data Sources

@github-project-automation github-project-automation bot moved this to New Issue Approval in P: 311: Project Board Oct 17, 2024
@ryanfchase ryanfchase changed the title Create Python script to clean 2015 through 2020 data Create Python script to clean 2015 through 2019 data Oct 19, 2024
@ryanfchase ryanfchase moved this from New Issue Approval to In progress in P: 311: Project Board Oct 19, 2024
@ryanfchase ryanfchase moved this from In progress to Prioritized Backlog in P: 311: Project Board Oct 19, 2024
@ryanfchase
Copy link
Member Author

This ticket is ready to be picked up

@mru-hub mru-hub self-assigned this Oct 26, 2024
@mru-hub mru-hub moved this from Prioritized Backlog to In progress in P: 311: Project Board Oct 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: In progress
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants