-
Notifications
You must be signed in to change notification settings - Fork 15
Run queries in python benchmarks using only one thread #24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I haven't run into DuckDB running stuff on a single thread before although I have dealt with wildly different timings with the CLI vs Python. It's worth checking something other than ST_Contains, too, and also ensuring we're benchmarking against forthcoming DuckDB ( Probably the benchmark we want to focus on is the "user-facing default", since that is how people will perceive the speed of our engine? We might want to consider tuning the PostGIS instance to be smarter since the defaults are very bad for geometry (the last time I did this was https://dewey.dunnington.ca/post/2024/wrangling-and-joining-130m-points-with-duckdb--the-open-source-spatial-stack/#postgis ). Some other issues with the current benchmarks:
|
I was using my local build of duckdb-spatial from the latest trunk, the duckdb-python is also my locally built latest 1.4.1-dev version. I have tried re-installing duckdb using
|
I also found this problem. This will drastically affect the benchmarking results for predicates such as ST_Intersects since some implementations may return early for such cases. I'll fix the benchmarking data by generating uniformly distributed pairs. |
When I run import duckdb
duckdb.load_extension("spatial")
duckdb.sql(
"SELECT count(*) FROM 'submodules/geoarrow-data/microsoft-buildings/files/microsoft-buildings_point_geo.parquet' WHERE ST_Intersects(geometry, geometry)"
).to_arrow_table() ...I see Activity Manager reporting 900% CPU usage (query runs in 30s). I'll check the benchmark too in a bit. |
Yeah, I knowingly made
I knowingly did this too actually. When I started this was just supposed to be a scuffed development tool rather than something we'd use for presenting results publicly. I created these benchmarks with the intention of comparing purely the function implementations (ignoring parquet reading optimizations), since I've been focused on function implementations. I would still argue using "table" is probably more useful as a developer when it comes to optimizing functions, since it would be a raw 1-to-1 comparison. We could make it configurable (later potentially). For now, It's totally fine if we want to change it to using parquet scans. Sorry for all of these issues 😬 |
I don't see any issues running these without the benchmark suite: from sedonadb.testing import SedonaDB, DuckDB
sedonadb = SedonaDB()
duckdb = DuckDB()
random = sedonadb.con.sql("""
SELECT geometry FROM sd_random_geometry('{
"geom_type": "Point",
"target_rows": 10000000,
"vertices_per_linestring_range": [2, 2]
}')""")
duckdb.create_table_arrow("random", random)
sedonadb.create_table_arrow("random", random)
%time duckdb.execute_and_collect("SELECT ST_Buffer(geometry, 0.1) FROM random")
#> CPU times: user 1min 20s, sys: 2.04 s, total: 1min 22s
#> Wall time: 7.53 s
%time sedonadb.execute_and_collect("SELECT ST_Buffer(geometry, 0.1) FROM random")
#> CPU times: user 1min 31s, sys: 3.05 s, total: 1min 35s
#> Wall time: 9.75 s Maybe |
+1 to add Parquet scan. We can have both (DuckDB parquet and DuckDB internal) and default to compare against Parquet scan. Internal storage format always have huge impact on performance. We are targeting at the lakehouse architecture which use open format like parquet for storage |
I did some experiments and found that it is not related to whether we are using a test/benchmark framework or not. It is related to the size of dataset. I tried this workload with various
Generally, we observe less parallelism when the base table contains less data. This is probably related to how DuckDB estimates thread count. This behavior is also documented here. |
Interesting. I wonder if this also happens if we run spatial join against DuckDB. This estimation gives us non-deterministic performance. But looks like there is a way to force DuckDB to use a fixed number of threads as well? |
I can see how there are two types of benchmarks that are equally valuable:
|
I have investigated the performance characteristics using a profiler, both are using all available processors when running the spatial join benchmarks. But this could be a problem if we run benchmarks in a different environment, or using a different dataset. We should always understand the reasons behind those numbers produced by benchmarks.
How does multithreading amortize per-batch and per-item overhead? Could you explain that a bit more? |
It might not! I am mostly just skeptical of creating too much of an artificial situation for the sake of fairness (although we can certainly do it for ourselves if it's giving us useful information/finding performance issues in specific benchmarks). Parts of Sedona and Sedona DB also adapt various configuration based on statistics...it's all part of the magic of the engine(s) that we want to measure. |
SedonaDB may run queries using multiple threads so the results are not directly comparable with other engines. See comment #10 (comment) for details. This patch configures SedonaDB and DuckDB to run queries using one single thread.