-
Notifications
You must be signed in to change notification settings - Fork 229
Speed up repness 11x #2316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: edge
Are you sure you want to change the base?
Speed up repness 11x #2316
Conversation
Replace the row-by-row for-loop in update_votes with a vectorized pivot_table approach. This dramatically speeds up vote loading for large datasets. Performance on bg2050 dataset (1M+ votes, 7.8k participants, 7.7k comments): - Before: 18.5s average, 56k votes/sec - After: 3.5s average, 295k votes/sec - Speedup: 5.3x overall, 16x for the batch update step The optimization: 1. Use pivot_table to reshape long-form votes to wide-form matrix 2. Use DataFrame.where() to merge with existing matrix 3. Use float32 for intermediate matrix to halve memory usage Also adds a benchmark script at polismath/benchmarks/bench_update_votes.py for measuring update_votes performance. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
- _compute_vote_stats: Replace per-row/per-column loops with numpy vectorized operations using boolean masks and axis-based sums. This eliminates O(rows + cols) Python loops. - bench_update_votes.py: Make standalone by accepting CSV path directly instead of depending on datasets package. Add TODO for using datasets package once PR compdemocracy#2312 is merged. Combined with pivot_table optimization, achieves ~10x speedup on bg2050 dataset (1M votes): 18-30s -> 2.5s (~400k votes/sec). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Addresses GitHub Copilot review comments on PR compdemocracy#2313: - Removed float32 conversion that only provided temporary memory savings - The comment was misleading as savings didn't persist after .where() 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Replace iterrows() with rename() + to_dict('records') for efficiency,
as suggested by GitHub Copilot review.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <[email protected]>
Convert wide-format vote matrix to long-format using melt() and use vectorized pandas groupby operations instead of nested loops. Key changes: - Add compute_group_comment_stats_df() for vectorized (group, comment) stats - Add prop_test_vectorized() and two_prop_test_vectorized() for batch z-tests - Add select_rep_comments_df() and select_consensus_comments_df() for DataFrame-native selection, converting to dicts only at the end - Compute "other" stats as total - group instead of recalculating - Use MultiIndex.from_product() to ensure all (group, comment) combinations Test changes: - Add test_old_format_repness.py to preserve backwards compatibility tests - Add TestVectorizedFunctions class with 8 tests for new DataFrame interface 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces significant performance optimizations to the representativeness (repness) computation, achieving an 11x speedup through vectorization. The changes transform the codebase from row-by-row iteration to vectorized pandas/numpy operations, while maintaining backward compatibility and adding comprehensive testing and benchmarking infrastructure.
Key Changes:
- Vectorized core repness computation using DataFrame-native operations with long-format data transformations
- Optimized vote matrix updates using
pivot_tableinstead of row-by-row iteration (5x speedup) - Added comprehensive test suite for vectorized functions and backward compatibility verification
- Introduced benchmarking infrastructure with profiling support for performance monitoring
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
delphi/polismath/pca_kmeans_rep/repness.py |
Core refactor: added vectorized functions (prop_test_vectorized, two_prop_test_vectorized, compute_group_comment_stats_df) and refactored conv_repness to use long-format DataFrames for 11x speedup |
delphi/polismath/conversation/conversation.py |
Optimized update_votes with pivot_table approach and vectorized _compute_vote_stats; added timing logs for PCA, clustering, and repness computation |
delphi/polismath/utils/general.py |
Removed deprecated helper functions (agree, disagree, pass_vote) and added vote encoding constants (AGREE, DISAGREE, PASS) |
delphi/tests/test_repness_unit.py |
Added comprehensive unit tests for new vectorized functions with edge case coverage and comparison against scalar implementations |
delphi/tests/test_old_format_repness.py |
New test file ensuring backward compatibility of the old single-group/single-comment API that wraps the vectorized implementation |
delphi/polismath/benchmarks/benchmark_utils.py |
New utility module for benchmark scripts with CSV loading, dataset name extraction, and benchmark execution helpers |
delphi/polismath/benchmarks/bench_update_votes.py |
Benchmark script for measuring update_votes performance with throughput reporting |
delphi/polismath/benchmarks/bench_repness.py |
Benchmark script for repness computation with optional line profiler support |
delphi/polismath/benchmarks/__init__.py |
Package initialization for benchmark module |
delphi/pyproject.toml |
Added line_profiler as optional dev dependency for performance profiling |
Comments suppressed due to low confidence (2)
delphi/tests/test_old_format_repness.py:8
- Import of 'pytest' is not used.
import pytest
delphi/tests/test_old_format_repness.py:13
- Import of 'math' is not used.
import math
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
delphi/polismath/pca_kmeans_rep/repness.py:15
- Import of 'PASS' is not used.
from polismath.utils.general import AGREE, DISAGREE, PASS
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <[email protected]>
|
@copilot open a new pull request to apply changes based on the comments in this thread |
Work in progress on speeding up Repness computation.
To be merged only after #2313 , which is merged into this PR
Vectorized the whole repness computation, and switched it to work on a dataframe representing the actually-sparse matrix of votes rather than the full matrix.
On BG2050: down from 26 seconds to 2.2 seconds :)
At the moment there's still a conversion step from full matrix to sparse representation, but this lays the preliminary work to move everything except the PCA to sparse representation.
The statistics are now expressed in Data frame operations, much easier to understand for volunteers from the data science, machine learning, and statistic communities.
And because the NamedMatrix had been replaced by a DataFrame in #2282 , we removed tons of index-juggling code, directly using the participants and comment IDs.
The PR looks like it adds a thousand lines, but most of those are extra tests and a benchmark script.
I've kept wrappers for the old single-group non-vectorized version, for now, but I recommend we nuke them eventually, to avoid cruft.