-
Notifications
You must be signed in to change notification settings - Fork 30
Implement ST_Dump #269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement ST_Dump #269
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
This crashes R session, so probably I'm doing something wrong.
When the R session crashes, please assume that I am the one that caused it 🙂 . This one is probably nanoarrow having a hard time converting an list of an extension type. Try |> as_nanoarrow_array_stream() |> as_arrow_table() (the conversion of geoarrow.wkb to extension type attempts to do a zero-copy conversion and that is probably doing something very fishy).
rust/sedona-functions/src/st_dump.rs
Outdated
| // TODO: what's the decent default value for the capacity? | ||
| BinaryBuilder::with_capacity(8, WKB_MIN_PROBABLE_BYTES * 8), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WKB_MIN_PROBABLE_BYTES * executor.num_iterations() is probably fine. This will be very close to the input binary array in capacity and size.
rust/sedona-functions/src/st_dump.rs
Outdated
|
|
||
| let input = create_array(&[Some("POINT (1 2)")], &sedona_type); | ||
|
|
||
| let expected = create_array(&[Some("POINT (1 2)")], &WKB_GEOMETRY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right that this will be pretty verbose to test. This is a unique one but luckily byte-for-byte equality shouldn't be an issue. You can probably use create_array() here and assemble a ListArray using offsets, then use assert_eq!(). The failure message will be terrible but it will get the job done.
|
Thanks so much for the tip! |
Co-authored-by: Dewey Dunnington <[email protected]>
…ona-db into feat/st_dump
|
Thanks to your help, I think I figured out how to handle and test these complex types. I'll work on the actual implementation... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a stab at a Python test since I think this is going to be a pain. Feel free to take or leave, but
@pytest.mark.parametrize("eng", [SedonaDB, PostGIS])
def test_st_dump(eng):
wkt = "MULTIPOINT (0 1, 2 3)"
expected = [
{
"path": [1],
"geom": shapely.from_wkt("POINT (0 1)").wkb,
},
{
"path": [2],
"geom": shapely.from_wkt("POINT (1 2)").wkb,
},
]
if eng == PostGIS:
eng = eng.create_or_skip()
result = eng.execute_and_collect(f"SELECT ST_Dump({geom_or_null(wkt)})")
else:
eng = eng.create_or_skip()
result = eng.execute_and_collect(f"SELECT unnest(ST_Dump({geom_or_null(wkt)}))")
df = eng.result_to_pandas(result)
actual = df.iloc[0]
for item in actual:
assert list(item.keys()) == ["path", "geom"]...can hopefully get you started if you haven't gotten there yet.
rust/sedona-functions/src/st_dump.rs
Outdated
| if let Some(wkb) = maybe_wkb { | ||
| let mut struct_builder = builder.struct_builder(); | ||
|
|
||
| let mut cur_path: Vec<u32> = Vec::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably doesn't make much of a difference here but you could declare this outside the loop and reset the size to zero to reuse any previous heap allocation.
|
Thanks for the Python test! I hope I can get started soon with it. |
|
Benchmark result: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so cool!
One optional suggestion that you should feel free to punt on (this is also great as is!)
rust/sedona-functions/src/st_dump.rs
Outdated
| let path_builder = self | ||
| .struct_builder | ||
| .field_builder::<ListBuilder<UInt32Builder>>(0) | ||
| .unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure it matters here given that you've nicely demonstrated that this is pretty fast, but we have tried in other places to avoid downcasting in a loop. Something like:
struct STDumpBuilder {
path_array_builder: UInt32Builder,
path_array_offsets_builder: OffsetBufferBuilder,
geom_builder: BinaryBuilder,
}...and then implement a finish() -> StructArray that assembles at the end.
(optional!)
|
Thanks, I simply didn't know! It looks cleaner. I'm trying to switch to that approach, but probably it takes some more time. Currently, I'm struggling to figure out how |
|
This should be ready for review again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! (This is probably one of the hardest functions to implement 🚀 )
I have a tiny comment about CRS propagation but I'll follow-up with a PR for that one 🙂
| fn return_type(&self, args: &[SedonaType]) -> Result<Option<SedonaType>> { | ||
| let matcher = ArgMatcher::new(vec![ArgMatcher::is_geometry()], geometry_dump_type()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this should to propagate the CRS of args[0] into the geometry_dump_type(). The reason this isn't done in any other type matcher code is because the type matcher does this automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for catching!
Note: Sedona's
ST_Dumpreturns a different result from PostGIS and DuckDB. This pull request implements the PostGIS's version ([]Struct { path: [int], geom: GEOMETRY })https://sedona.apache.org/latest/api/sql/Function/#st_dump
https://postgis.net/docs/en/ST_Dump.html