Skip to content

Conversation

@jucor
Copy link
Collaborator

@jucor jucor commented Nov 27, 2025

Work in progress on speeding up Repness computation.

To be merged only after #2313 , which is merged into this PR

Vectorized the whole repness computation, and switched it to work on a dataframe representing the actually-sparse matrix of votes rather than the full matrix.

On BG2050: down from 26 seconds to 2.2 seconds :)

At the moment there's still a conversion step from full matrix to sparse representation, but this lays the preliminary work to move everything except the PCA to sparse representation.

The statistics are now expressed in Data frame operations, much easier to understand for volunteers from the data science, machine learning, and statistic communities.

And because the NamedMatrix had been replaced by a DataFrame in #2282 , we removed tons of index-juggling code, directly using the participants and comment IDs.

The PR looks like it adds a thousand lines, but most of those are extra tests and a benchmark script.

I've kept wrappers for the old single-group non-vectorized version, for now, but I recommend we nuke them eventually, to avoid cruft.

jucor and others added 13 commits November 26, 2025 13:04
Replace the row-by-row for-loop in update_votes with a vectorized
pivot_table approach. This dramatically speeds up vote loading for
large datasets.

Performance on bg2050 dataset (1M+ votes, 7.8k participants, 7.7k comments):
- Before: 18.5s average, 56k votes/sec
- After:  3.5s average, 295k votes/sec
- Speedup: 5.3x overall, 16x for the batch update step

The optimization:
1. Use pivot_table to reshape long-form votes to wide-form matrix
2. Use DataFrame.where() to merge with existing matrix
3. Use float32 for intermediate matrix to halve memory usage

Also adds a benchmark script at polismath/benchmarks/bench_update_votes.py
for measuring update_votes performance.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- _compute_vote_stats: Replace per-row/per-column loops with numpy
  vectorized operations using boolean masks and axis-based sums.
  This eliminates O(rows + cols) Python loops.

- bench_update_votes.py: Make standalone by accepting CSV path directly
  instead of depending on datasets package. Add TODO for using datasets
  package once PR compdemocracy#2312 is merged.

Combined with pivot_table optimization, achieves ~10x speedup on bg2050
dataset (1M votes): 18-30s -> 2.5s (~400k votes/sec).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Addresses GitHub Copilot review comments on PR compdemocracy#2313:
- Removed float32 conversion that only provided temporary memory savings
- The comment was misleading as savings didn't persist after .where()

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Replace iterrows() with rename() + to_dict('records') for efficiency,
as suggested by GitHub Copilot review.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Convert wide-format vote matrix to long-format using melt() and use
vectorized pandas groupby operations instead of nested loops.

Key changes:
- Add compute_group_comment_stats_df() for vectorized (group, comment) stats
- Add prop_test_vectorized() and two_prop_test_vectorized() for batch z-tests
- Add select_rep_comments_df() and select_consensus_comments_df() for
  DataFrame-native selection, converting to dicts only at the end
- Compute "other" stats as total - group instead of recalculating
- Use MultiIndex.from_product() to ensure all (group, comment) combinations

Test changes:
- Add test_old_format_repness.py to preserve backwards compatibility tests
- Add TestVectorizedFunctions class with 8 tests for new DataFrame interface

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@jucor jucor changed the title speed up repness Speed up repness 11x Nov 27, 2025
@jucor jucor requested a review from Copilot November 27, 2025 14:14
Copilot finished reviewing on behalf of jucor November 27, 2025 14:18
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces significant performance optimizations to the representativeness (repness) computation, achieving an 11x speedup through vectorization. The changes transform the codebase from row-by-row iteration to vectorized pandas/numpy operations, while maintaining backward compatibility and adding comprehensive testing and benchmarking infrastructure.

Key Changes:

  • Vectorized core repness computation using DataFrame-native operations with long-format data transformations
  • Optimized vote matrix updates using pivot_table instead of row-by-row iteration (5x speedup)
  • Added comprehensive test suite for vectorized functions and backward compatibility verification
  • Introduced benchmarking infrastructure with profiling support for performance monitoring

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
delphi/polismath/pca_kmeans_rep/repness.py Core refactor: added vectorized functions (prop_test_vectorized, two_prop_test_vectorized, compute_group_comment_stats_df) and refactored conv_repness to use long-format DataFrames for 11x speedup
delphi/polismath/conversation/conversation.py Optimized update_votes with pivot_table approach and vectorized _compute_vote_stats; added timing logs for PCA, clustering, and repness computation
delphi/polismath/utils/general.py Removed deprecated helper functions (agree, disagree, pass_vote) and added vote encoding constants (AGREE, DISAGREE, PASS)
delphi/tests/test_repness_unit.py Added comprehensive unit tests for new vectorized functions with edge case coverage and comparison against scalar implementations
delphi/tests/test_old_format_repness.py New test file ensuring backward compatibility of the old single-group/single-comment API that wraps the vectorized implementation
delphi/polismath/benchmarks/benchmark_utils.py New utility module for benchmark scripts with CSV loading, dataset name extraction, and benchmark execution helpers
delphi/polismath/benchmarks/bench_update_votes.py Benchmark script for measuring update_votes performance with throughput reporting
delphi/polismath/benchmarks/bench_repness.py Benchmark script for repness computation with optional line profiler support
delphi/polismath/benchmarks/__init__.py Package initialization for benchmark module
delphi/pyproject.toml Added line_profiler as optional dev dependency for performance profiling
Comments suppressed due to low confidence (2)

delphi/tests/test_old_format_repness.py:8

  • Import of 'pytest' is not used.
import pytest

delphi/tests/test_old_format_repness.py:13

  • Import of 'math' is not used.
import math

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jucor jucor marked this pull request as ready for review November 27, 2025 15:50
@jucor jucor requested a review from Copilot November 27, 2025 15:50
Copilot finished reviewing on behalf of jucor November 27, 2025 15:55
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

delphi/polismath/pca_kmeans_rep/repness.py:15

  • Import of 'PASS' is not used.
from polismath.utils.general import AGREE, DISAGREE, PASS

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jucor
Copy link
Collaborator Author

jucor commented Nov 27, 2025

@copilot open a new pull request to apply changes based on the comments in this thread

@jucor jucor closed this Nov 27, 2025
@jucor jucor reopened this Nov 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants