Skip to content

Conversation

@yutannihilation
Copy link
Contributor

@yutannihilation yutannihilation commented Nov 1, 2025

Note: Sedona's ST_Dump returns a different result from PostGIS and DuckDB. This pull request implements the PostGIS's version ([]Struct { path: [int], geom: GEOMETRY })

https://sedona.apache.org/latest/api/sql/Function/#st_dump
https://postgis.net/docs/en/ST_Dump.html

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

This crashes R session, so probably I'm doing something wrong.

When the R session crashes, please assume that I am the one that caused it 🙂 . This one is probably nanoarrow having a hard time converting an list of an extension type. Try |> as_nanoarrow_array_stream() |> as_arrow_table() (the conversion of geoarrow.wkb to extension type attempts to do a zero-copy conversion and that is probably doing something very fishy).

Comment on lines 79 to 80
// TODO: what's the decent default value for the capacity?
BinaryBuilder::with_capacity(8, WKB_MIN_PROBABLE_BYTES * 8),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WKB_MIN_PROBABLE_BYTES * executor.num_iterations() is probably fine. This will be very close to the input binary array in capacity and size.


let input = create_array(&[Some("POINT (1 2)")], &sedona_type);

let expected = create_array(&[Some("POINT (1 2)")], &WKB_GEOMETRY);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that this will be pretty verbose to test. This is a unique one but luckily byte-for-byte equality shouldn't be an issue. You can probably use create_array() here and assemble a ListArray using offsets, then use assert_eq!(). The failure message will be terrible but it will get the job done.

@yutannihilation
Copy link
Contributor Author

Thanks so much for the tip!

@yutannihilation
Copy link
Contributor Author

Thanks to your help, I think I figured out how to handle and test these complex types. I'll work on the actual implementation...

@yutannihilation yutannihilation marked this pull request as ready for review November 2, 2025 13:31
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a stab at a Python test since I think this is going to be a pain. Feel free to take or leave, but

@pytest.mark.parametrize("eng", [SedonaDB, PostGIS])
def test_st_dump(eng):
    wkt = "MULTIPOINT (0 1, 2 3)"
    expected = [
        {
            "path": [1],
            "geom": shapely.from_wkt("POINT (0 1)").wkb,
        },
        {
            "path": [2],
            "geom": shapely.from_wkt("POINT (1 2)").wkb,
        },
    ]

    if eng == PostGIS:
        eng = eng.create_or_skip()
        result = eng.execute_and_collect(f"SELECT ST_Dump({geom_or_null(wkt)})")
    else:
        eng = eng.create_or_skip()
        result = eng.execute_and_collect(f"SELECT unnest(ST_Dump({geom_or_null(wkt)}))")
    df = eng.result_to_pandas(result)
    actual = df.iloc[0]

    for item in actual:
        assert list(item.keys()) == ["path", "geom"]

...can hopefully get you started if you haven't gotten there yet.

if let Some(wkb) = maybe_wkb {
let mut struct_builder = builder.struct_builder();

let mut cur_path: Vec<u32> = Vec::new();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably doesn't make much of a difference here but you could declare this outside the loop and reset the size to zero to reuse any previous heap allocation.

@yutannihilation
Copy link
Contributor Author

Thanks for the Python test! I hope I can get started soon with it.

@yutannihilation
Copy link
Contributor Author

Benchmark result:

------------------------------------------------------------------------------------------------------------- benchmark: 6 tests -------------------------------------------------------------------------------------------------------------
Name (time in ms)                                                  Min                    Max                   Mean              StdDev                 Median                 IQR            Outliers      OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_st_dump[collections_simple-SedonaDBSingleThread]          25.8595 (1.0)          28.0133 (1.0)          26.1220 (1.0)        0.4491 (1.0)          25.9570 (1.0)        0.1710 (1.0)           2;5  38.2819 (1.0)          26           1
test_st_dump[collections_simple-DuckDBSingleThread]            54.1694 (2.09)         61.4570 (2.19)         56.3276 (2.16)       2.9164 (6.49)         55.2865 (2.13)       2.1671 (12.68)         1;1  17.7533 (0.46)          5           1
test_st_dump[collections_complex-SedonaDBSingleThread]        382.6457 (14.80)       472.4213 (16.86)       402.0678 (15.39)     39.3457 (87.60)       385.3617 (14.85)     23.3447 (136.55)        1;1   2.4871 (0.06)          5           1
test_st_dump[collections_complex-DuckDBSingleThread]          749.3529 (28.98)       763.6902 (27.26)       757.9570 (29.02)      5.9036 (13.14)       760.3898 (29.29)      9.0438 (52.90)         1;0   1.3193 (0.03)          5           1
test_st_dump[collections_simple-PostGISSingleThread]        3,297.8760 (127.53)    3,372.9212 (120.40)    3,340.3870 (127.88)    32.3133 (71.94)     3,355.1089 (129.26)    53.6658 (313.91)        2;0   0.2994 (0.01)          5           1
test_st_dump[collections_complex-PostGISSingleThread]      37,008.9275 (>1000.0)  37,411.0606 (>1000.0)  37,149.3238 (>1000.0)  184.6306 (411.07)   37,029.6383 (>1000.0)  295.5638 (>1000.0)       1;0   0.0269 (0.00)          5           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so cool!

One optional suggestion that you should feel free to punt on (this is also great as is!)

Comment on lines 93 to 96
let path_builder = self
.struct_builder
.field_builder::<ListBuilder<UInt32Builder>>(0)
.unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it matters here given that you've nicely demonstrated that this is pretty fast, but we have tried in other places to avoid downcasting in a loop. Something like:

struct STDumpBuilder {
    path_array_builder: UInt32Builder,
    path_array_offsets_builder: OffsetBufferBuilder,
    geom_builder: BinaryBuilder,
}

...and then implement a finish() -> StructArray that assembles at the end.

(optional!)

@yutannihilation yutannihilation marked this pull request as draft November 4, 2025 23:36
@yutannihilation
Copy link
Contributor Author

Thanks, I simply didn't know! It looks cleaner. I'm trying to switch to that approach, but probably it takes some more time. Currently, I'm struggling to figure out how append_null() should be handled, but probably there are several more things that I still don't understand...

@paleolimbot paleolimbot marked this pull request as ready for review November 5, 2025 15:44
@paleolimbot paleolimbot marked this pull request as draft November 5, 2025 15:44
@yutannihilation yutannihilation marked this pull request as ready for review November 5, 2025 15:52
@yutannihilation
Copy link
Contributor Author

This should be ready for review again.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! (This is probably one of the hardest functions to implement 🚀 )

I have a tiny comment about CRS propagation but I'll follow-up with a PR for that one 🙂

Comment on lines +241 to +242
fn return_type(&self, args: &[SedonaType]) -> Result<Option<SedonaType>> {
let matcher = ArgMatcher::new(vec![ArgMatcher::is_geometry()], geometry_dump_type());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this should to propagate the CRS of args[0] into the geometry_dump_type(). The reason this isn't done in any other type matcher code is because the type matcher does this automatically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for catching!

@paleolimbot paleolimbot merged commit 2939063 into apache:main Nov 5, 2025
12 checks passed
@yutannihilation yutannihilation deleted the feat/st_dump branch November 5, 2025 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants