Skip to content

Replace handwritten PCA with sklearn#2416

Merged
jucor merged 10 commits intoedgefrom
jc/pca_sklearn
Mar 10, 2026
Merged

Replace handwritten PCA with sklearn#2416
jucor merged 10 commits intoedgefrom
jc/pca_sklearn

Conversation

@jucor
Copy link
Copy Markdown
Collaborator

@jucor jucor commented Mar 5, 2026

Summary

Stacked on #2415 (comparer improvements). Please review and merge #2413#2414#2415 first.
Stack: #2413 (NaN fix) → #2414 (test infra) → #2415 (comparer) → This PR#2417 (test cleanup)

The main deliverable of the #2311 split. Replaces ~900 lines of hand-rolled power iteration PCA with ~50 lines using sklearn's SVD-based PCA.

  • Add PCA benchmark script for comparing implementations
  • Replace `powerit_pca` with `sklearn.decomposition.PCA`
  • Replace manual projection with `fit_transform`
  • Remove unused `start_vectors` argument and `wrapped_pca` function
  • Update golden snapshots to reflect sklearn PCA output
  • Fix PCA-related warnings in `test_conversation.py`
  • Address Copilot review feedback (division-by-zero fix, type annotations)

Net deletion: ~1000+ lines of hand-rolled linear algebra replaced by battle-tested sklearn.

Reviewer note

The golden record JSON diffs (~15k lines) can be skipped — they are the expected output changes from switching to sklearn PCA. The projection metrics from #2415 validate equivalence. Focus review on `pca.py`.

Test plan

  • 198 passed, 7 skipped, 2 xfailed (17 fewer tests vs previous PR because obsolete power-iteration-specific tests were removed)

🤖 Generated with Claude Code

Copy link
Copy Markdown
Member

@ballPointPenguin ballPointPenguin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: pca.py’s module docstring still describes the deleted handwritten PCA implementation, even though this PR replaces it with sklearn.

NOTE: the new benchmark helper’s usage examples assume ../.venv/bin/python, which doesn’t match the repo’s documented/canonical delphi-env workflow in delphi/Makefile

-> we should pick either delphi-env/bin/python OR .venv/bin/python and use that everywhere

otherwise: looks good, tests pass. Nice cleanup 🧽

@jucor jucor force-pushed the jc/pca_comparer branch from 2798a2f to 503762b Compare March 10, 2026 11:12
Base automatically changed from jc/pca_comparer to edge March 10, 2026 11:25
jucor and others added 9 commits March 10, 2026 11:27
Note: we will update the golden records in a later commit.
I took care to keep the "sparsity-aware" scaling, computed in a vectorized way.
It's not needed anymore: the PCA is done by Sklearn, and the edge cases
are already handled by wrapped_pca caller, i.e. pca_project_dataframe.

Less code, less complexity, less maintenance.
Re-recorded golden snapshots using the new sklearn SVD-based PCA.
These snapshots now serve as the baseline for regression tests.
Tests that only verify sorting/moderation now use recompute=False to
skip unnecessary PCA computation. Manager tests that need PCA but use
minimal data have targeted filterwarnings with explanations.

Changes:
- Add recompute=False to sorting tests (no PCA needed for ID ordering)
- Add recompute=False to test_moderation (only checks filtering)
- Add @pytest.mark.filterwarnings to TestConversationManager tests
  that use minimal data through the manager interface

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix division by zero when participant has no votes (pca.py)
  Mirrors Clojure's (max n-votes 1) approach
- Fix return type annotation for _log_projection_metrics (comparer.py)
- Add explanatory comment for intentional except pass (comparer.py)
- Remove unused variable in test (test_regression.py)
- Add unit test for no-votes edge case (test_pca_unit.py)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use pca.mean_ instead of manual centering (sklearn handles centering internally)
- Replace np.random.seed(42) with random_state=42 on PCA constructor
- Update docstring to reflect sklearn PCA + column-mean imputation
- Fix _pca_sign_flips type annotation to Dict[str, Dict[int, int]]
- Scope outlier_fraction to PCA-related paths only (non-PCA data stays strict)
- Add try/except around symlink creation for cross-platform robustness
- Remove unused normalize_vector/proj_vec test helpers

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- pca.py: describe sklearn wrapper instead of deleted power iteration impl
- bench_pca.py: use tool-agnostic `python` instead of hardcoded venv path

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jucor
Copy link
Copy Markdown
Collaborator Author

jucor commented Mar 10, 2026

Thanks @ballPointPenguin !

  1. docstring in pca.py modified
  2. good catch on the venv. I've removed it from the docstring, using just the venv-agnostic "python -m"

As to what should be a "canonical" venv place, in a way that would be "least surprising" to most python developers likely to work on delphi, or even might be tool-agnostic, this deserves a discussion of its own, and eventually its own PR.

@jucor jucor merged commit 59f7f95 into edge Mar 10, 2026
4 checks passed
@jucor jucor deleted the jc/pca_sklearn branch March 10, 2026 11:54
@github-actions
Copy link
Copy Markdown

Delphi Coverage Report

File Stmts Miss Cover
init.py 3 0 100%
main.py 55 55 0%
benchmarks/bench_pca.py 76 76 0%
benchmarks/bench_repness.py 81 81 0%
benchmarks/bench_update_votes.py 38 38 0%
benchmarks/benchmark_utils.py 34 34 0%
components/init.py 2 0 100%
components/config.py 165 133 19%
components/server.py 116 72 38%
conversation/init.py 2 0 100%
conversation/conversation.py 1036 353 66%
conversation/manager.py 131 42 68%
database/init.py 1 0 100%
database/dynamodb.py 387 234 40%
database/postgres.py 306 205 33%
pca_kmeans_rep/init.py 5 0 100%
pca_kmeans_rep/clusters.py 234 8 97%
pca_kmeans_rep/corr.py 98 17 83%
pca_kmeans_rep/pca.py 50 15 70%
pca_kmeans_rep/repness.py 361 48 87%
pca_kmeans_rep/stats.py 107 22 79%
poller.py 224 188 16%
regression/init.py 4 0 100%
regression/comparer.py 887 406 54%
regression/datasets.py 95 21 78%
regression/recorder.py 36 27 25%
regression/utils.py 137 38 72%
run_math_pipeline.py 260 239 8%
system.py 85 55 35%
umap_narrative/500_generate_embedding_umap_cluster.py 210 109 48%
umap_narrative/501_calculate_comment_extremity.py 112 54 52%
umap_narrative/502_calculate_priorities.py 135 135 0%
umap_narrative/700_datamapplot_for_layer.py 502 502 0%
umap_narrative/701_static_datamapplot_for_layer.py 310 310 0%
umap_narrative/702_consensus_divisive_datamapplot.py 432 432 0%
umap_narrative/801_narrative_report_batch.py 787 787 0%
umap_narrative/802_process_batch_results.py 265 265 0%
umap_narrative/803_check_batch_status.py 175 175 0%
umap_narrative/llm_factory_constructor/init.py 2 2 0%
umap_narrative/llm_factory_constructor/model_provider.py 157 157 0%
umap_narrative/polismath_commentgraph/init.py 1 0 100%
umap_narrative/polismath_commentgraph/cli.py 270 270 0%
umap_narrative/polismath_commentgraph/core/init.py 3 3 0%
umap_narrative/polismath_commentgraph/core/clustering.py 110 110 0%
umap_narrative/polismath_commentgraph/core/embedding.py 104 104 0%
umap_narrative/polismath_commentgraph/lambda_handler.py 219 219 0%
umap_narrative/polismath_commentgraph/schemas/init.py 2 0 100%
umap_narrative/polismath_commentgraph/schemas/dynamo_models.py 160 9 94%
umap_narrative/polismath_commentgraph/tests/conftest.py 17 17 0%
umap_narrative/polismath_commentgraph/tests/test_clustering.py 74 74 0%
umap_narrative/polismath_commentgraph/tests/test_embedding.py 55 55 0%
umap_narrative/polismath_commentgraph/tests/test_storage.py 87 87 0%
umap_narrative/polismath_commentgraph/utils/init.py 3 0 100%
umap_narrative/polismath_commentgraph/utils/converter.py 283 237 16%
umap_narrative/polismath_commentgraph/utils/group_data.py 354 336 5%
umap_narrative/polismath_commentgraph/utils/storage.py 585 477 18%
umap_narrative/reset_conversation.py 159 50 69%
umap_narrative/run_pipeline.py 453 312 31%
utils/general.py 63 41 35%
Total 11105 7736 30%

@jucor jucor restored the jc/pca_sklearn branch March 19, 2026 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants