Replace handwritten PCA with sklearn by jucor · Pull Request #2416 · compdemocracy/polis

jucor · 2026-03-05T13:05:44Z

Summary

Stacked on #2415 (comparer improvements). Please review and merge #2413 → #2414 → #2415 first.
Stack: #2413 (NaN fix) → #2414 (test infra) → #2415 (comparer) → This PR → #2417 (test cleanup)

The main deliverable of the #2311 split. Replaces ~900 lines of hand-rolled power iteration PCA with ~50 lines using sklearn's SVD-based PCA.

Add PCA benchmark script for comparing implementations
Replace `powerit_pca` with `sklearn.decomposition.PCA`
Replace manual projection with `fit_transform`
Remove unused `start_vectors` argument and `wrapped_pca` function
Update golden snapshots to reflect sklearn PCA output
Fix PCA-related warnings in `test_conversation.py`
Address Copilot review feedback (division-by-zero fix, type annotations)

Net deletion: ~1000+ lines of hand-rolled linear algebra replaced by battle-tested sklearn.

Reviewer note

The golden record JSON diffs (~15k lines) can be skipped — they are the expected output changes from switching to sklearn PCA. The projection metrics from #2415 validate equivalence. Focus review on `pca.py`.

Test plan

198 passed, 7 skipped, 2 xfailed (17 fewer tests vs previous PR because obsolete power-iteration-specific tests were removed)

🤖 Generated with Claude Code

ballPointPenguin

NOTE: pca.py’s module docstring still describes the deleted handwritten PCA implementation, even though this PR replaces it with sklearn.

NOTE: the new benchmark helper’s usage examples assume ../.venv/bin/python, which doesn’t match the repo’s documented/canonical delphi-env workflow in delphi/Makefile

-> we should pick either delphi-env/bin/python OR .venv/bin/python and use that everywhere

otherwise: looks good, tests pass. Nice cleanup 🧽

Note: we will update the golden records in a later commit.

I took care to keep the "sparsity-aware" scaling, computed in a vectorized way.

It's not needed anymore: the PCA is done by Sklearn, and the edge cases are already handled by wrapped_pca caller, i.e. pca_project_dataframe. Less code, less complexity, less maintenance.

Re-recorded golden snapshots using the new sklearn SVD-based PCA. These snapshots now serve as the baseline for regression tests.

Tests that only verify sorting/moderation now use recompute=False to skip unnecessary PCA computation. Manager tests that need PCA but use minimal data have targeted filterwarnings with explanations. Changes: - Add recompute=False to sorting tests (no PCA needed for ID ordering) - Add recompute=False to test_moderation (only checks filtering) - Add @pytest.mark.filterwarnings to TestConversationManager tests that use minimal data through the manager interface 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Fix division by zero when participant has no votes (pca.py) Mirrors Clojure's (max n-votes 1) approach - Fix return type annotation for _log_projection_metrics (comparer.py) - Add explanatory comment for intentional except pass (comparer.py) - Remove unused variable in test (test_regression.py) - Add unit test for no-votes edge case (test_pca_unit.py) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Use pca.mean_ instead of manual centering (sklearn handles centering internally) - Replace np.random.seed(42) with random_state=42 on PCA constructor - Update docstring to reflect sklearn PCA + column-mean imputation - Fix _pca_sign_flips type annotation to Dict[str, Dict[int, int]] - Scope outlier_fraction to PCA-related paths only (non-PCA data stays strict) - Add try/except around symlink creation for cross-platform robustness - Remove unused normalize_vector/proj_vec test helpers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- pca.py: describe sklearn wrapper instead of deleted power iteration impl - bench_pca.py: use tool-agnostic `python` instead of hardcoded venv path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jucor · 2026-03-10T11:54:09Z

Thanks @ballPointPenguin !

docstring in pca.py modified
good catch on the venv. I've removed it from the docstring, using just the venv-agnostic "python -m"

As to what should be a "canonical" venv place, in a way that would be "least surprising" to most python developers likely to work on delphi, or even might be tool-agnostic, this deserves a discussion of its own, and eventually its own PR.

github-actions · 2026-03-10T11:58:41Z

Delphi Coverage Report

File	Stmts	Miss	Cover
init.py	3	0	100%
main.py	55	55	0%
benchmarks/bench_pca.py	76	76	0%
benchmarks/bench_repness.py	81	81	0%
benchmarks/bench_update_votes.py	38	38	0%
benchmarks/benchmark_utils.py	34	34	0%
components/init.py	2	0	100%
components/config.py	165	133	19%
components/server.py	116	72	38%
conversation/init.py	2	0	100%
conversation/conversation.py	1036	353	66%
conversation/manager.py	131	42	68%
database/init.py	1	0	100%
database/dynamodb.py	387	234	40%
database/postgres.py	306	205	33%
pca_kmeans_rep/init.py	5	0	100%
pca_kmeans_rep/clusters.py	234	8	97%
pca_kmeans_rep/corr.py	98	17	83%
pca_kmeans_rep/pca.py	50	15	70%
pca_kmeans_rep/repness.py	361	48	87%
pca_kmeans_rep/stats.py	107	22	79%
poller.py	224	188	16%
regression/init.py	4	0	100%
regression/comparer.py	887	406	54%
regression/datasets.py	95	21	78%
regression/recorder.py	36	27	25%
regression/utils.py	137	38	72%
run_math_pipeline.py	260	239	8%
system.py	85	55	35%
umap_narrative/500_generate_embedding_umap_cluster.py	210	109	48%
umap_narrative/501_calculate_comment_extremity.py	112	54	52%
umap_narrative/502_calculate_priorities.py	135	135	0%
umap_narrative/700_datamapplot_for_layer.py	502	502	0%
umap_narrative/701_static_datamapplot_for_layer.py	310	310	0%
umap_narrative/702_consensus_divisive_datamapplot.py	432	432	0%
umap_narrative/801_narrative_report_batch.py	787	787	0%
umap_narrative/802_process_batch_results.py	265	265	0%
umap_narrative/803_check_batch_status.py	175	175	0%
umap_narrative/llm_factory_constructor/init.py	2	2	0%
umap_narrative/llm_factory_constructor/model_provider.py	157	157	0%
umap_narrative/polismath_commentgraph/init.py	1	0	100%
umap_narrative/polismath_commentgraph/cli.py	270	270	0%
umap_narrative/polismath_commentgraph/core/init.py	3	3	0%
umap_narrative/polismath_commentgraph/core/clustering.py	110	110	0%
umap_narrative/polismath_commentgraph/core/embedding.py	104	104	0%
umap_narrative/polismath_commentgraph/lambda_handler.py	219	219	0%
umap_narrative/polismath_commentgraph/schemas/init.py	2	0	100%
umap_narrative/polismath_commentgraph/schemas/dynamo_models.py	160	9	94%
umap_narrative/polismath_commentgraph/tests/conftest.py	17	17	0%
umap_narrative/polismath_commentgraph/tests/test_clustering.py	74	74	0%
umap_narrative/polismath_commentgraph/tests/test_embedding.py	55	55	0%
umap_narrative/polismath_commentgraph/tests/test_storage.py	87	87	0%
umap_narrative/polismath_commentgraph/utils/init.py	3	0	100%
umap_narrative/polismath_commentgraph/utils/converter.py	283	237	16%
umap_narrative/polismath_commentgraph/utils/group_data.py	354	336	5%
umap_narrative/polismath_commentgraph/utils/storage.py	585	477	18%
umap_narrative/reset_conversation.py	159	50	69%
umap_narrative/run_pipeline.py	453	312	31%
utils/general.py	63	41	35%
Total	11105	7736	30%

ballPointPenguin approved these changes Mar 8, 2026

View reviewed changes

jucor force-pushed the jc/pca_comparer branch from 2798a2f to 503762b Compare March 10, 2026 11:12

jucor force-pushed the jc/pca_sklearn branch from 47e6bc4 to 70e7dea Compare March 10, 2026 11:12

Base automatically changed from jc/pca_comparer to edge March 10, 2026 11:25

jucor and others added 9 commits March 10, 2026 11:27

Add PCA benchmark script

89b76a7

Replace handwritten PCA by SKLearn's implementation

ec8c338

Note: we will update the golden records in a later commit.

Replace projection by Sklearn's fit_transform for better performance.

02b4446

I took care to keep the "sparsity-aware" scaling, computed in a vectorized way.

Remove unused start_vectors argument

d83d778

Remove function wrapped_pca

e8d773d

It's not needed anymore: the PCA is done by Sklearn, and the edge cases are already handled by wrapped_pca caller, i.e. pca_project_dataframe. Less code, less complexity, less maintenance.

Update biodiversity and vw golden snapshots with sklearn PCA

585fc64

Re-recorded golden snapshots using the new sklearn SVD-based PCA. These snapshots now serve as the baseline for regression tests.

jucor force-pushed the jc/pca_sklearn branch from 70e7dea to 33a9749 Compare March 10, 2026 11:27

Address review feedback: update stale docstrings

553dcef

- pca.py: describe sklearn wrapper instead of deleted power iteration impl - bench_pca.py: use tool-agnostic `python` instead of hardcoded venv path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jucor merged commit 59f7f95 into edge Mar 10, 2026
4 checks passed

jucor deleted the jc/pca_sklearn branch March 10, 2026 11:54

jucor restored the jc/pca_sklearn branch March 19, 2026 15:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace handwritten PCA with sklearn#2416

Replace handwritten PCA with sklearn#2416
jucor merged 10 commits intoedgefrom
jc/pca_sklearn

jucor commented Mar 5, 2026 •

edited

Loading

Uh oh!

ballPointPenguin left a comment

Uh oh!

jucor commented Mar 10, 2026

Uh oh!

Uh oh!

github-actions bot commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jucor commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Reviewer note

Test plan

Uh oh!

ballPointPenguin left a comment

Choose a reason for hiding this comment

Uh oh!

jucor commented Mar 10, 2026

Uh oh!

Uh oh!

github-actions bot commented Mar 10, 2026

Delphi Coverage Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jucor commented Mar 5, 2026 •

edited

Loading