Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add multi-line JSON reading methods #3

Merged
merged 1 commit into from
Jul 2, 2024
Merged

feat: add multi-line JSON reading methods #3

merged 1 commit into from
Jul 2, 2024

Conversation

mikix
Copy link
Contributor

@mikix mikix commented Jul 1, 2024

Specifically:

  • list_multiline_json_in_dir: scans for files & their resource types
  • read_multiline_json: returns line by line results from a file
  • read_multiline_json_from_dir: returns line by line results from a folder

This hides all the I/O error handling & odd format challenges (like .jsonl vs .ndjson or handling new lines).

This effort will require two other PRs: one for the ETL and one for the Library:

@@ -1,7 +1,7 @@
repos:
- repo: https://github.com/psf/black
#this version is synced with the black mentioned in .github/workflows/ci.yml
rev: 23.10.0
rev: 24.4.2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort of an aside, but how do we feel about more ruff cutover?

Copy link
Contributor Author

@mikix mikix Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opposed - I do like moving from 3 or 4 different tools to 1.

Ongoing concerns:

  • I still don't love how there's not really an obviously-correct default config - our config sections so far seem large and opinionated. But 🤷 it's fine
  • I'm also lightly leery of its stability right now - like they are still in 0.x mode and changing CLI args etc.
  • Relatedly, ruff has no format-change guarantees like black does (with the year-based major version).

So I'm cautious, but I'd be down for more migrations.

return {}


def read_ndjson(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i realize this is for best compatibility with what we've been doing to date, but with the scope change: consider read_multiline_json here and in the following function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First off, the whole ndjson / jsonlines thing is weird. Well, I guess just the ndjson side is weird.

I had kept with ndjson for consistency yeah. But I'm open to a better name. But multiline is so corporate, man. 🤔

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_json_but_like_what_if_there_was_a_ton_of_it_man

but, more seriously, read_delimited_json? read_json_lines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK we couldn't come up with better names - read_multiline_json it is

Specifically:
- list_multiline_json_in_dir: scans for files & their resource types
- read_multiline_json: returns line by line results from a file
- read_multiline_json_from_dir: returns line by line results from a
  folder

This hides all the I/O error handling & odd format challenges
(like .jsonl vs .ndjson or handling new lines).
@mikix mikix changed the title feat: add ndjson reading methods feat: add multi-line JSON reading methods Jul 1, 2024
@mikix mikix merged commit e153b3f into main Jul 2, 2024
5 checks passed
@mikix mikix deleted the mikix/ndjson branch July 2, 2024 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants