Skip to content

Conversation

Kontinuation
Copy link
Member

SedonaDB may run queries using multiple threads so the results are not directly comparable with other engines. See comment #10 (comment) for details. This patch configures SedonaDB and DuckDB to run queries using one single thread.

@paleolimbot
Copy link
Member

I haven't run into DuckDB running stuff on a single thread before although I have dealt with wildly different timings with the CLI vs Python. It's worth checking something other than ST_Contains, too, and also ensuring we're benchmarking against forthcoming DuckDB (pip install duckdb --upgrade --pre) since these will all change in a week.

Probably the benchmark we want to focus on is the "user-facing default", since that is how people will perceive the speed of our engine? We might want to consider tuning the PostGIS instance to be smarter since the defaults are very bad for geometry (the last time I did this was https://dewey.dunnington.ca/post/2024/wrangling-and-joining-130m-points-with-duckdb--the-open-source-spatial-stack/#postgis ).

Some other issues with the current benchmarks:

  • They benchmark on a "table" and not a Parquet scan. We have the edge on a Parquet scan, PostGIS and DuckDB have an edge with their native table format. The Parquet scan probably is more realistic.
  • We don't benchmark predicates with realistic input (they are ST_Contains with two identical inputs). The array/scalar case is probably best to focus on (more likely to affect the perceived speed of our engine's fist release).

@Kontinuation
Copy link
Member Author

Kontinuation commented Sep 5, 2025

I was using my local build of duckdb-spatial from the latest trunk, the duckdb-python is also my locally built latest 1.4.1-dev version.

I have tried re-installing duckdb using pip install duckdb --upgrade --pre and ran another benchmark (test_st_buffer), the behavior is basically the same:

----------------------------------------------------------------------------------- benchmark 'table=collections_complex': 2 tests ----------------------------------------------------------------------------------
Name (time in ms)                                     Min                 Max                Mean             StdDev              Median                IQR            Outliers     OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_st_buffer[collections_complex-SedonaDB]     115.0520 (1.0)      157.6825 (1.0)      128.0455 (1.0)      13.9384 (1.73)     122.8113 (1.0)      10.1471 (1.0)           2;1  7.8097 (1.0)           9           1
test_st_buffer[collections_complex-DuckDB]       611.7650 (5.32)     633.0960 (4.02)     623.6101 (4.87)      8.0625 (1.0)      623.5183 (5.08)     10.8149 (1.07)          2;0  1.6036 (0.21)          5           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------- benchmark 'table=collections_simple': 2 tests -----------------------------------------------------------------------------------
Name (time in ms)                                    Min                 Max                Mean             StdDev              Median                IQR            Outliers     OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_st_buffer[collections_simple-SedonaDB]     111.6307 (1.0)      128.6885 (1.0)      120.1398 (1.0)       5.7616 (1.0)      121.7260 (1.0)       8.5031 (1.0)           3;0  8.3236 (1.0)           9           1
test_st_buffer[collections_simple-DuckDB]       619.3594 (5.55)     698.7082 (5.43)     655.4287 (5.46)     34.3422 (5.96)     657.1137 (5.40)     60.9305 (7.17)          2;0  1.5257 (0.18)          5           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

@Kontinuation
Copy link
Member Author

  • We don't benchmark predicates with realistic input (they are ST_Contains with two identical inputs). The array/scalar case is probably best to focus on (more likely to affect the perceived speed of our engine's fist release).

I also found this problem. This will drastically affect the benchmarking results for predicates such as ST_Intersects since some implementations may return early for such cases. I'll fix the benchmarking data by generating uniformly distributed pairs.

@Kontinuation
Copy link
Member Author

You can run the test_st_buffer benchmark yourself and see if this single-thread behavior happens only on my site.

The benchmark results for ST_Contains represented by Peter seems to be single-threaded performance. I'm unsure if I'm doing anything wrong to achieve this multi-core concurrency behavior when running SedonaDB.

image

@paleolimbot
Copy link
Member

When I run

import duckdb

duckdb.load_extension("spatial")
duckdb.sql(
    "SELECT count(*) FROM 'submodules/geoarrow-data/microsoft-buildings/files/microsoft-buildings_point_geo.parquet' WHERE ST_Intersects(geometry, geometry)"
).to_arrow_table()

...I see Activity Manager reporting 900% CPU usage (query runs in 30s). I'll check the benchmark too in a bit.

@petern48
Copy link
Collaborator

petern48 commented Sep 5, 2025

  • We don't benchmark predicates with realistic input (they are ST_Contains with two identical inputs). The array/scalar case is probably best to focus on (more likely to affect the perceived speed of our engine's fist release).

Yeah, I knowingly made geom1 and geom2 identical here. I was primarily focused on the non-predicate functions, so I just put something together quick for binary functions. I meant to circle back to it later, but ran into other things.

  • They benchmark on a "table" and not a Parquet scan. We have the edge on a Parquet scan, PostGIS and DuckDB have an edge with their native table format. The Parquet scan probably is more realistic.

I knowingly did this too actually. When I started this was just supposed to be a scuffed development tool rather than something we'd use for presenting results publicly. I created these benchmarks with the intention of comparing purely the function implementations (ignoring parquet reading optimizations), since I've been focused on function implementations.

I would still argue using "table" is probably more useful as a developer when it comes to optimizing functions, since it would be a raw 1-to-1 comparison. We could make it configurable (later potentially). For now, It's totally fine if we want to change it to using parquet scans.

Sorry for all of these issues 😬

@paleolimbot
Copy link
Member

I don't see any issues running these without the benchmark suite:

from sedonadb.testing import SedonaDB, DuckDB

sedonadb = SedonaDB()
duckdb = DuckDB()

random = sedonadb.con.sql("""
    SELECT geometry FROM sd_random_geometry('{
                    "geom_type": "Point",
                    "target_rows": 10000000,
                    "vertices_per_linestring_range": [2, 2]
    }')""")


duckdb.create_table_arrow("random", random)
sedonadb.create_table_arrow("random", random)

%time duckdb.execute_and_collect("SELECT ST_Buffer(geometry, 0.1) FROM random")
#> CPU times: user 1min 20s, sys: 2.04 s, total: 1min 22s
#> Wall time: 7.53 s

%time sedonadb.execute_and_collect("SELECT ST_Buffer(geometry, 0.1) FROM random")
#> CPU times: user 1min 31s, sys: 3.05 s, total: 1min 35s
#> Wall time: 9.75 s

Maybe duckdb always runs single threaded under pytest? Probably it makes sense to roll our own benchmark suite.

@jiayuasu
Copy link
Member

jiayuasu commented Sep 5, 2025

+1 to add Parquet scan. We can have both (DuckDB parquet and DuckDB internal) and default to compare against Parquet scan. Internal storage format always have huge impact on performance. We are targeting at the lakehouse architecture which use open format like parquet for storage

@Kontinuation
Copy link
Member Author

Kontinuation commented Sep 10, 2025

I don't see any issues running these without the benchmark suite:

from sedonadb.testing import SedonaDB, DuckDB

sedonadb = SedonaDB()
duckdb = DuckDB()

random = sedonadb.con.sql("""
    SELECT geometry FROM sd_random_geometry('{
                    "geom_type": "Point",
                    "target_rows": 10000000,
                    "vertices_per_linestring_range": [2, 2]
    }')""")


duckdb.create_table_arrow("random", random)
sedonadb.create_table_arrow("random", random)

%time duckdb.execute_and_collect("SELECT ST_Buffer(geometry, 0.1) FROM random")
#> CPU times: user 1min 20s, sys: 2.04 s, total: 1min 22s
#> Wall time: 7.53 s

%time sedonadb.execute_and_collect("SELECT ST_Buffer(geometry, 0.1) FROM random")
#> CPU times: user 1min 31s, sys: 3.05 s, total: 1min 35s
#> Wall time: 9.75 s

Maybe duckdb always runs single threaded under pytest? Probably it makes sense to roll our own benchmark suite.

I did some experiments and found that it is not related to whether we are using a test/benchmark framework or not. It is related to the size of dataset.

I tried this workload with various target_rows configurations when generating the test dataset. Here are the results:

  • target_rows = 100000: DuckDB uses a single thread, SedonaDB uses 10 threads (this is the data size in our pytest benchmarks)
  • target_rows = 500000: DuckDB uses 4 threads, SedonaDB uses 10 threads
  • target_rows = 1000000: DuckDB uses 5 threads, SedonaDB uses 10 threads

Generally, we observe less parallelism when the base table contains less data. This is probably related to how DuckDB estimates thread count. This behavior is also documented here.

@jiayuasu
Copy link
Member

Interesting. I wonder if this also happens if we run spatial join against DuckDB.

This estimation gives us non-deterministic performance. But looks like there is a way to force DuckDB to use a fixed number of threads as well?

@paleolimbot
Copy link
Member

I can see how there are two types of benchmarks that are equally valuable:

  • Integration-style benchmarks that use the defaults and read from Parquet (e.g., check the perceived speed of a relatively realistic query)
  • Unit-style benchmarks that are just a way to check if our particular implementation of a scalar function/iteration overhead is reasonable. Running these from memory on one thread (or the same number of threads) is possibly more comparable but forcing a single thread is maybe unrealistic because in practice some of our per-batch and per-item overhead is amortized over multiple threads and is possibly not something we should spend time optimizing yet.

@Kontinuation
Copy link
Member Author

Kontinuation commented Sep 11, 2025

I wonder if this also happens if we run spatial join against DuckDB.

I have investigated the performance characteristics using a profiler, both are using all available processors when running the spatial join benchmarks. But this could be a problem if we run benchmarks in a different environment, or using a different dataset. We should always understand the reasons behind those numbers produced by benchmarks.

forcing a single thread is maybe unrealistic because in practice some of our per-batch and per-item overhead is amortized over multiple threads

How does multithreading amortize per-batch and per-item overhead? Could you explain that a bit more?

@paleolimbot
Copy link
Member

How does multithreading amortize per-batch and per-item overhead? Could you explain that a bit more?

It might not! I am mostly just skeptical of creating too much of an artificial situation for the sake of fairness (although we can certainly do it for ourselves if it's giving us useful information/finding performance issues in specific benchmarks). Parts of Sedona and Sedona DB also adapt various configuration based on statistics...it's all part of the magic of the engine(s) that we want to measure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants