Skip to content

Conversation

RX14
Copy link
Member

@RX14 RX14 commented Sep 8, 2025

Addresses #12946


This is a WIP, mostly to discuss the general approach now something is compiling. I'd like to avoid nitpicks about the code and focus on if I've missed an i32 anywhere, and the resulting API. This code has been tested parsing a 150GiB JSON file, but I can't be sure I didn't miss something.

If saturating the line/column numbers is the consensus, the JSON.saturate_to_i32 helper would be nice to generalise to IntXX#saturate_to_ixx.

@yxhuvud
Copy link
Contributor

yxhuvud commented Sep 8, 2025

150GiB JSON

People do actual JSON that large without resorting to JSONL or similar? Yikes, that is the kind of size where a streaming parser would be nice to have (like SAX parsers for XML. Dunno what is possible for JSON).

@RX14
Copy link
Member Author

RX14 commented Sep 8, 2025

JSON::PullParser is a streaming JSON parser, the file was an array of 9 million items and I was categorising each entry as it came in and splitting it out into smaller more manageable files.

@jneen
Copy link
Contributor

jneen commented Sep 10, 2025

Is there a reason not to use unsigned here?

@straight-shoota
Copy link
Member

Unsigned types are more susceptible to errors on math operations. Technically, the domain range would be covered by unsigned values. But they propagate into other calculations and cause issues there.
We won't need the full range of UInt64 here for parsing JSON files over 8 exabytes.

@RX14 RX14 force-pushed the feature/json-i64 branch from 7554e0e to 94aa387 Compare October 5, 2025 10:19
@RX14
Copy link
Member Author

RX14 commented Oct 5, 2025

I've marked the line/column accessors experimental, and the i32 methods now raise on overflow instead of saturate. I think this PR can probably be reviewed in more detail unless there's still concerns about the overall approach.

@RX14 RX14 marked this pull request as ready for review October 5, 2025 10:20
@straight-shoota straight-shoota changed the title WIP: support large JSON files Add support for large JSON documents (> Int32::MAX) Oct 8, 2025
@RX14 RX14 merged commit c4b8bbf into crystal-lang:master Oct 11, 2025
41 checks passed
@straight-shoota
Copy link
Member

@RX14 We're in the middle of the freeze period for 1.18.0. This is a feature addition and shouldn't have been merged into master until after the release.

It seems I forgot adding it to the milestone after I approved it. It should've been added to 1.19.0 though.
But we should never merge anything before adding it to a milestone (see https://github.com/crystal-lang/crystal/blob/master/CONTRIBUTING.md#accepting-a-pull-request).

straight-shoota added a commit that referenced this pull request Oct 11, 2025
ysbaddaden pushed a commit that referenced this pull request Oct 11, 2025
@straight-shoota straight-shoota added the status:reverted PR was reverted or reverts another one, and is not part of a milestone label Oct 11, 2025
@straight-shoota
Copy link
Member

The commit has been reverted.
Please resubmit this PR so we can merge it for 1.19.

@RX14
Copy link
Member Author

RX14 commented Oct 14, 2025

Sorry, that's my bad for not reading the processes. I've re-made the PR in #16211.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind:feature status:reverted PR was reverted or reverts another one, and is not part of a milestone topic:stdlib:serialization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants