Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parse_json() function to omm #129

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

CDWimmer
Copy link

Added a test case also following the example set by the xml and csv tests - I think it's ready for review

@@ -21,6 +22,13 @@ def parse_xml(file):
fields.update((field.tag, field.text) for field in element)
yield fields

def parse_json(file):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happily, I think we can have users call json.load() directly. It looks like here you take the dictionary returned by json.load(), and then build a second identical dictionary, and then throw the first one away, which I think users might find a little slower than just using the original dictionary?

See if you can update this to make no change to omm.py, but instead make the test test_omm_json_matches_old_tle() just call json.load() itself. If it works and passes, then after this lands, I'll update the docs to show the same maneuver.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternately, since I apparently chose to alias the function csv.DictReader as parse_csv in this file, we could do something like parse_json = json.load. But I suspect that we should simply avoid hiding the lower-level routines, in case folks need to use their options for things like encodings.

Copy link
Author

@CDWimmer CDWimmer Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This response accidentally became 3 times longer than I would have liked and comes with the largest pinch of salt ever seen by god or man, I can only apologise:

It looks like here you take the dictionary returned by json.load(), and then build a second identical dictionary, and then throw the first one away, which I think users might find a little slower than just using the original dictionary?

Good spot, I think this thought was trying to wrestle its way into my head earlier today!

However, I do believe wrapping it in a def provides a benefit:- your parse_xml() produces a generator rather than a list as would be returned by a raw json.load, so it will have consistent output regardless of input. A correction would be

def parse_json(file):
    yield json.load(file)

to achieve the effect without throwing anything away.

A few more thoughts I've had materialise:

  1. json.load takes a bunch of extra arguments that I don't think are very helpful here, and it would be good to not pollute users' IDEs with things that don't make sense to be present in this context.

  2. json.load/loads() has the annoying (for us) property that it will convert whatever JSON is entered into its Python equivalent without complaint, meaning were a user to not wrap their satellite data object in an array, they would receive no error relating to the totally valid JSON, but get an unexpected return type from json.load() -> a plain dict. (see) One could build functionality into this to ensure a consistent return type, or throw a more helpful error perhaps telling the user to reformat their JSON.

  3. A fairly important note I think: I'm not sure either of us have used yield properly? I think we need to move yield inside the for block to actually take advantage of it, then construct the return value on the fly instead of building a list, e.g.

def parse_json(file):
    json_data = json.load(file)
    for item in json_data:
        yield {k: v for k, v in item.items()}

This way, we keep generator behaviour, don't throw away any data, and only generate a single dict at a time while still allowing for for ... in parse(json) usage.
It's slightly slower for a handful of datasets, but almost 200,000 lines of JSON relating to tens of thousands of satellites that is only increasing, it'll be vastly more memory efficient! Who knows what devices will be running this stuff after all.
I am sure similar can be sought from parse_xml but I am actually falling asleep :(

  1. Getting slightly off-topic:
    It could be made to accept both a filepath/object (i.e. stringIO) and a raw JSON string with a simple use of isinstance() and swapping to json.loads() (load _s_tring), or internally re-wrapping with stringIO, not sure on best approach. Same could be applied to the xml an csv parsers - because lord knows users will find incredible new ways to get their data into memory (I got here due to wanting to use a database!)

since I apparently chose to alias the function csv.DictReader as parse_csv in this file...

I'm tempted to suggest that normalising this to also produce a generator would be good :P

Copy link
Author

@CDWimmer CDWimmer Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to this after some sleep: I'm a dullard. On the one hand, I think it just makes more sense to return a list rather than use yield in all parse functions, the entire active satellite JSON array only comes to about 3.7MB. Even with added python overheads, if whatever is running the application can't handle loading that, then it probably can't handle the rest of the application either!

I'm back on board Team "Just Wrap-json.load", except swapping yields for returns given we're not doing any extra processing, unlike with xml where obviously you are doing more processing within a loop first.

I suppose it just bothers me that you get inconsistent results depending on the parser used, even if practically they are interchangeable in most use cases.

For comparison's sake I'm running some timing tests on each function, will edit this again when done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get just under 4MB of "compact" json (jq -c) from the combined active and analyst catalogs. Doing some quick'n'dirty tests, it seems that on my particular machine (linux x86_64), with my particular build of python (3.10.2), loading orbital elements takes about 1.15MB per thousand records. Let's maybe bump that up to 1.25MB per thousand records to accommodate all objects having non-empty OBJECT_ID and descriptive OBJECT_NAME. I don't have any 32-bit machines handy to test with.

This would be easier if the catalog was "ndjson" and you could have a generator that yields a line at a time which then gets parsed. Alas, I know of no existing incremental/streaming JSON parser that would return one dict at a time from a list of dicts..

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alas, it will be this weekend before I can again look at these details, but I'm happy to see y'all engaged in an entertaining discussion of different approaches. I'm also happy to see CSV winning, as I think it's by far the most sensible (as well as widely-supported format), as it makes no sense to me here to repeat every field name for every single satellite.

So the only aside I'll make at the moment is: have you tried the Pandas read_csv() routine? If we are going to have a documentation section for folks who need to load up very large numbers of satellites, we will probably want to outline—in case they can install and use Pandas—how to load satellites, sort them, and select them from a very large CSV using a Pandas dataframe.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree about csv, ultimately it is very predictable tabular data, of course it should be in a table! So it makes perfect sense to use pandas DataFrames too.

Though, I don't know how consistent pandas has been across major Python revisions - this library supports 2.6 but the latest version of pandas requires >=3.9, so worth checking if there's version-specific caveats for examples/documentation. (Or release a final 2.X/3.5 version and then drop support, I am begging for type annotations 😢 3.5 is 9 years old!)

Generally I think docs should be an outline of "you can use this set of built in parsers provided by the library that are guaranteed to work with CelesTrak data (examples) [...] or if you want more advanced control of the data you can construct your own parser that has output in XYZ format... (pandas example that uses some DataFrame feature)"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that sure is cursed, but if it works... Run it through tracemalloc? I'm now emotionally invested in beating the CSV parser

"Your Scientists Were So Preoccupied With Whether Or Not They Could, They Didn’t Stop To Think If They Should"

I doubt that json loading will ever be faster than csv, but it was fun to see if I could do it without reading the whole file in.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how consistent pandas has been across major Python revisions…

Do you mean its performance? I have found that, if I'm careful, I can get it to switch into pure-C mode for CSV parsing in all versions that I try. Let me know if you hit a snag where it tries using Python parsing, and I'll try finding a workaround for you. Note that people using old Python versions usually just download and install the Pandas version that was current at that time, and can use Pandas just fine as long as they don't rely on recent tweaks and features.

…this library supports 2.6

Just to make things simple, I've dropped the "2.6" from setup.py and from now on plan to only advertise 2.7 and later. I think it's fair at this point to tell Python 2.6 users to consider themselves permanent users of Skyfield 1.48.

Copy link
Author

@CDWimmer CDWimmer Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean its performance?

Sorry, I was unclear, I just mean if there have been changes to the "api", how things are used, feature changes etc.

install the Pandas version that was current at that time

Of course yeah :) Just worried if there were changes to how pandas does things then documentation examples might want to reflect that, but an excercise for the reader never hurt ;)

I can get it to switch into pure-C mode for CSV parsing

That is great news about the C code, will see if I can get a quick something to do that, would you just df = load_csv() and then do for row in df.iterrows() as a fair(ish) comparison? Also how do you tell when it's doing pure-C? Just the time it takes?

only advertise 2.7

Sounds good, though I did read a survey suggesting only about 1% or so of Python users were using 2.x now, though I suspect some physics/astronomy people aren't going to follow suite with that stat, do you know if it's known what the userbase of this is like?

edit:

Results

Running CSV builtin test
Current memory usage is 5.048KB; Peak was 49.077KB; Diff = 44.029KB
CSV builtin avg time: 0.1141 s
Running CSV pandas test with default(?) backend
Current memory usage is 93.833KB; Peak was 8145.391KB; Diff = 8051.558KB
CSV pandas avg time: 0.6323 s
Running CSV pandas test with numpy_nullable backend
Current memory usage is 195.294KB; Peak was 7555.298KB; Diff = 7360.004KB
CSV pandas avg time: 0.8052 s
Running CSV pandas test with pyarrow backend - no idea why they're planning to make this default
Current memory usage is 381.754KB; Peak was 7420.537KB; Diff = 7038.783KB
CSV pandas avg time: 0.9581 s

I must be doing something wrong...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants