-
Notifications
You must be signed in to change notification settings - Fork 18
Improve the random benchmark data generator to generate non-identical pairs of geometries #70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ef3dbab
to
f9302ff
Compare
f9302ff
to
954a634
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR improves the random benchmark data generator to create non-identical pairs of geometries for spatial predicate testing, addressing concerns that identical geometry pairs don't reflect typical real-world workloads with high selectivity.
- Modified
generate_circular_vertices
function to accept an RNG parameter for randomized starting angles - Updated benchmark data generation to create two separate geometry sets with overlapping spatial distributions
- Changed test seeds to ensure different geometry pairs are generated
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
rust/sedona-testing/src/datagen.rs | Added RNG parameter to generate_circular_vertices for randomized geometry generation |
benchmarks/test_bench_base.py | Refactored to generate two distinct geometry sets with ~2% intersection rate |
python/sedonadb/tests/test_sjoin.py | Updated test seeds to generate different geometry pairs |
python/sedonadb/tests/functions/test_distance.py | Added numeric epsilon for distance test precision |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Randomizing the angle is a good idea, although it makes your sjoin tests not be perfect diamonds anymore. As long as that's OK with you!
…sing `with` statement (#71) This patch depends on #70. The pytest may hang when some of the test fails. This is a workflow run that exhibited this problem: https://github.com/apache/sedona-db/actions/runs/17668086933/job/50213528921 Pytest does not free up resources held by local variables by calling `__del__` in time when a test fails. When pytest captured the AssertionError, it effectively adds a reference to the local c object, preventing it from being __del__-ed when the function ends. This leaves the transaction started by the failed test active when the next test starts running. We generate test data by dropping a temporary table, inserting data into the temporary table, and finally renaming the temporary table. Dropping the temporary table will be blocked by a lock held by the still-active transaction started by the previously failed test: ``` postgres=# select pid, query, state, wait_event, wait_event_type, backend_type from pg_stat_activity; pid | query | state | wait_event | wait_event_type | backend_type -------+--------------------------------------------------------------------------------------------+---------------------+---------------------+-----------------+------------------------------ 99895 | | | AutoVacuumMain | Activity | autovacuum launcher 99896 | | | LogicalLauncherMain | Activity | logical replication launcher 83813 | select pid, query, state, wait_event, wait_event_type, backend_type from pg_stat_activity; | active | | | client backend 47073 | DROP TABLE IF EXISTS "public" . "sjoin_polygon_xtemp" | idle in transaction | ClientRead | Client | client backend 51565 | DROP TABLE IF EXISTS "public" . "sjoin_point_xtemp" | active | relation | Lock | client backend 66330 | DROP TABLE IF EXISTS "public" . "sjoin_point_xtemp" | active | relation | Lock | client backend 71494 | DROP TABLE IF EXISTS "public" . "sjoin_point_xtemp" | active | relation | Lock | client backend 99892 | | | BgWriterHibernate | Activity | background writer 99891 | | | CheckpointerMain | Activity | checkpointer 99894 | | | WalWriterMain | Activity | walwriter ``` This patch fixes this problem by managing connections using `with` statement. The transaction will be committed or rolled back when the connection is closed. This prevents us from leaking transactions and blocking ourselves on table locks.
As mentioned in #24 (comment), the geometry pairs for benchmarking spatial predicates are identical geometries. This may not reflect the typical real-world workload where predicates are usually applied to a pair of different objects and have high selectivity.
This patch generates benchmark data containing pair of non-identical geometries. The rate of intersected pairs is near 2%, which mimics the workload of high-selectivity spatial filters.