60 smart meter analysis pricing simulation by griffinsharps · Pull Request #61 · switchbox-data/smart-meter-analysis

griffinsharps · 2026-02-11T21:24:21Z

No description provided.

…dation.

Co-authored-by: Cursor <cursoragent@cursor.com>

…onth validator Enhance validate_month_output.py with three preflight checks needed before scaling to full-month execution: - Duplicate (zip_code, account_identifier, datetime) detection per batch file - Row count reporting (total + per-file) in validation report JSON - Run artifact integrity via --run-dir flag (plan.json, run_summary.json, manifests, batch summaries) Add PREFLIGHT_200.md checklist for 200-file EC2 validation run. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…s_sorted() Polars 1.38 removed is_sorted() from Expr. Collect the composite key first, then check sortedness on the resulting Series which retains the method. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Restores Justfile from main and adds migrate-month recipe. Usage: just migrate-month 202307 - batch-size 100, workers 6, lazy_sink, --resume - Reads ~/s3_paths_<YYYYMM>_full.txt, writes to /ebs/.../out_<YYYYMM>_production - Uses bare python (no uv) for EC2 compatibility Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

.gitignore: block *.txt, .tmp/, archive_quarantine/, tmp_polars_run_*/, subagent_packages/ from being tracked. pre-commit: add detect-private-key hook and a local forbid-secrets hook that blocks .env, .secrets, credentials.json, .pem, .key, .p12, .pfx, .jks files from being committed (even via git add -f). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace hash-based duplicate detection (n_unique ~400MB/file) with adjacent-key streaming that leverages the required global sort order. Sortedness and uniqueness now share a single PyArrow iter_batches pass in full mode. Key changes: - _streaming_sort_and_dup_check: combined sort+dup via PyArrow batch iteration, O(batch_size) memory, cross-file boundary state - Per-file datetime stats with merge (_DtStats dataclass) - Per-file DST stats with merge (_DstFileStats dataclass) - Enhanced sample mode: strict-increasing check (catches dups in windows) - Row counts from parquet metadata (O(1), no data scan) - Phase-based main() architecture (discovery -> metadata -> streaming -> datetime -> DST -> artifacts -> report) - _fail() typed as NoReturn for mypy narrowing - Add pyarrow mypy override in pyproject.toml Removed dead functions: _check_sorted_full, _validate_no_duplicates_file, _validate_datetime_invariants_partition, _validate_dst_option_b_partition, _keys_is_sorted_df Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Two fixes in validate_month_output.py: 1. _slice_keys: use lf.collect() instead of streaming engine for slice reads — streaming may reorder rows, defeating sortedness validation. Slices are small (5K rows x 3 cols) so default engine is correct and fast. 2. _check_sorted_sample: track prev_end and only perform cross-slice boundary comparison when off >= prev_end (non-overlapping). Random windows can overlap head/tail/each other, making boundary checks invalid under overlap. Within-slice strict-monotonic checks still run unconditionally. Also updates remaining collect(streaming=True) calls to collect(engine="streaming") to fix Polars deprecation warnings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Restores Justfile from main and adds migrate-month YEAR_MONTH recipe: - Guards against non-EC2 environments (checks /ebs mount) - Auto-generates S3 input list via aws s3 ls + awk + sort - Validates non-empty input list before running - Runs migrate_month_runner.py with standard production params (batch-size 100, workers 6, --resume, lazy_sink) Usage: just migrate-month 202307 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Annotate migrate_month_runner.py, validate_month_output.py, and Justfile with industry-standard "why" comments for senior code review. Additions include module-level architecture docstrings, function-level design rationale, and parameter tuning explanations. No logic changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…r-data-from-csv-to-parquet

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Refactor migrate-month to use configurable variables (S3_PREFIX, MIGRATE_OUT_BASE, etc.) instead of hardcoded bucket names and usernames, preparing the repo for open-source. Add six recipes: months-from-s3, migrate-months, validate-month, validate-months, and migration-status. Multi-month recipes support fail-fast (default) or continue-on-error mode with per-invocation UTC-timestamped logs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace hard-coded RTP/flat spread logic with dual tariff inputs (--tariff-prices-a, --tariff-prices-b) that each use the standard price_cents_per_kwh schema from build_tariff_hourly_prices.py. Adds fail-loud join guards (null check + row-count check per tariff). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Generates one row per local hour in America/Chicago for a given year, mapping each hour to its TOU season and period with the associated price. Handles DST by keeping the first UTC occurrence of fall-back duplicates. Validates full hour coverage per season. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Validates uniqueness and join coverage across the full chain: interval data -> hourly loads -> tariff calendars -> household bills. Synthetic tests exercise spring-forward, fall-back, and normal months; sample-data tests validate the real 202308 artefacts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

build_regression_dataset.py: maps household bills to Census block groups via ZIP+4 crosswalk (deterministic 1:1), aggregates to BG-level outcomes, and fits two OLS regressions (savings + bill diff ~ demographics). Supports auto/core/explicit predictor modes and graceful outcome column fallback. run_billing_pipeline.py: multi-month orchestrator that chains hourly loads, tariff billing, annual aggregation, and regression via subprocess. Supports --months/--months-file with {yyyymm} path patterns and writes per-month outputs plus annual_household_aggregate.parquet with a full run manifest. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ng tests/test_*.py

Annotate 12 files with inline comments explaining design decisions, trade-offs, and non-obvious rationale for a senior reviewer. No logic changes—comments only (+121/-12 lines). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace all references to annual_household_aggregate.parquet with all_months_household_bills.parquet. Expand the test suite from 25 tests to 93 tests covering: - All-months bills: schema, YYYYMM month format, no nulls, set equality of months, additive totals matching per-month outputs - Regression artifacts: existence of all 7 files (bg_month_outcomes, bg_annual_outcomes, bg_season_outcomes, regression_dataset_bg, regression_results.json, regression_summary.txt, regression_metadata.json) - Schema assertions for all BG outcome parquets - Mathematical invariants: pct_savings_weighted definition, annual rollup equals sum-of-months, season values and mapping, null handling - Crosswalk coverage: n_zip4, n_bg, n_zip4_multi_bg, pct_dropped - Regression results JSON: model keys, r_squared, coefficients with const - Both regression modes (annual, bg_month) with schema consistency checks - Skip-regression mode: no artifacts, manifest flags, bills still produced - Manifest: all_months_bills_rows, steps_completed, regression_level Also adds --regression-level pass-through to the orchestrator CLI so both annual and bg_month modes can be tested end-to-end. Test data is augmented with synthetic households at diverse ZIP+4 values to ensure >= 6 census block groups for OLS regression coverage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…alidation Co-authored-by: Cursor <cursoragent@cursor.com>

… S3 prefixes

…t atomic swap

Add plotnine, compute_delivery_deltas, gspread, dotenv, bs4, and IPython to the appropriate deptry ignore lists. plotnine is a notebook-only dependency; the others are local scripts or transitive imports in lib/.

Add plotnine, compute_delivery_deltas, gspread, dotenv, bs4, and IPython to the appropriate deptry ignore lists. plotnine is a notebook-only dependency; the others are local scripts or transitive imports in lib/. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Update pct_save computations to use the actual parquet column names: STOU → total_delta_dollars, DTOU → dtou_total_delta_dollars. Fix regression extraction to filter on "rate" and "dep_var" (not "rate_type"/"outcome"); hardcode stou_jan_mean_pct = 25.66 since there is no mean_pct column in regression_summary.csv. Update callout notes and prose to reflect the real schema.

…ort targets - Add _quarto.yml: manuscript project type, Switchbox theme, SVG figures - Add references.bib with ICC Final Order and Order on Rehearing entries - Add render/draft/clean targets to Justfile - Fix bibliography path in index.qmd (../references.bib → references.bib) - Fix typo in index.qmd: "chargesa" → "charges a"

…github.com/switchbox-data/smart-meter-analysis into 60-smart-meter-analysis-pricing-simulation

…ay counts

integrity (default): checks duplicates, order breaks, row counts, file counts, dir size — skips acct_day_counts entirely for a fast ~1-2 hour full pass. Outputs to /tmp/phase1_integrity_audit.{tsv,json,md}. acct-day: only accumulates rows-per-acct-day stats. Accepts --months flag to target specific YYYYMM values instead of all 49 months. Outputs to /tmp/phase1_acct_day_audit.{tsv,json,md}.

… ~20M) Replace within-batch numpy row-by-row comparisons with boundary-only checks: compare only the first row of each batch against the last row of the previous batch. Drops the numpy import entirely. The 202508 double-ingestion defect shows up at file boundaries, so internal per-row scanning adds no detection value. Expected speedup: 11+ min/month → <30 s/month.

Scans all out_*_production dirs, skips already-compacted months (no batch_* files), and runs compact_month_output.run_compaction() on each. Supports --dry-run (default), --execute, and --months filter flags.

…ction batch_*.parquet was the only pattern matched; compacted months use part-*.parquet. Switch both audit_month_integrity and audit_month_acct_day to *.parquet so all parquet files are included regardless of naming convention.

Griffin Sharps and others added 23 commits January 26, 2026 23:22

WIP Wide-to-long transformation code and accompanying runner and vali…

971096f

…dation.

WIP on transform CSV scripts

296cb57

Co-authored-by: Cursor <cursoragent@cursor.com>

Fix sortedness bug: sink_parquet ignores sort, use collect+write_parquet

49ae5b2

Merge remote-tracking branch 'origin/main' into 43-convert-coned-mete…

012e9cf

…r-data-from-csv-to-parquet

Fix trailing whitespace in README files from main merge

c5d7157

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add rate_structures directory for tariff definitions

31d37d3

Added STOU YAML file to begin simulation testing

a91613c

Add builder for hourly flat-rate price calendar

fe68a1a

Add pricing simulation E2E tests and reviewer test plan; allow tracki…

64eb66b

…ng tests/test_*.py

griffinsharps linked an issue Feb 11, 2026 that may be closed by this pull request

[smart-meter-analysis] Pricing simulation #60

Closed

Griffin Sharps and others added 6 commits February 16, 2026 22:26

Add deterministic month-level compaction stage with atomic swap and v…

7285174

…alidation Co-authored-by: Cursor <cursoragent@cursor.com>

Filter migrate-month inputs by filename month to tolerate mixed-month…

4e0f3e4

… S3 prefixes

Run migrate_month_runner via uv-managed environment

35ff63e

Add --compact-no-swap mode; perform full compaction+validation withou…

286c4f7

…t atomic swap

update .gitignore to explicitly avoid committing test parquets

8f18d52

Griffin Sharps and others added 3 commits March 23, 2026 21:19

Fix deptry dependency issues in CI

a24c5df

Add plotnine, compute_delivery_deltas, gspread, dotenv, bs4, and IPython to the appropriate deptry ignore lists. plotnine is a notebook-only dependency; the others are local scripts or transitive imports in lib/.

griffinsharps force-pushed the 60-smart-meter-analysis-pricing-simulation branch 2 times, most recently from 49ae9d9 to 473b47a Compare March 24, 2026 17:02

Griffin Sharps and others added 8 commits March 24, 2026 17:07

Fix month filter comparisons to use integers in assign-regression-vars

327b1f7

Add Felt map iframes for DTOU and Rate BEST bill change maps

be1f1dd

add report_variables.pkl to cache, ignore .quarto build dir

2e7baca

WIP: proposed edits to report

d4268e8

Merge branch '60-smart-meter-analysis-pricing-simulation' of https://…

997a7b6

…github.com/switchbox-data/smart-meter-analysis into 60-smart-meter-analysis-pricing-simulation

Add Switchbox shared lib (plotnine theme, quarto helpers) from reports2

2105d4e

Add build dependencies and uncommitted project files

d2ad44b

griffinsharps force-pushed the 60-smart-meter-analysis-pricing-simulation branch from 5924afc to d2ad44b Compare March 24, 2026 18:38

Max Shron (Switchbox) and others added 16 commits March 24, 2026 11:43

Merge branch '60-smart-meter-analysis-pricing-simulation' of https://…

ff89d5d

…github.com/switchbox-data/smart-meter-analysis into 60-smart-meter-analysis-pricing-simulation

Merge branch '60-smart-meter-analysis-pricing-simulation' of https://…

7cce8f9

…github.com/switchbox-data/smart-meter-analysis into 60-smart-meter-analysis-pricing-simulation

Memo draft 2 for review/comment

fa9cae5

Proposed edits to final draft

d04eddf

Line edits for memo

c163dea

Updated visualization titles

00ee6b0

Add phase1 streaming audit script for production parquet outputs

1d44756

Vectorize inner loop: numpy comparisons + pyarrow group_by for acct-d…

859887f

…ay counts

Add phase4 S3 publish script for versioned parquet releases

3b8183f

Add compact_all_months wrapper to batch-compact production months

829f7c5

Scans all out_*_production dirs, skips already-compacted months (no batch_* files), and runs compact_month_output.run_compaction() on each. Supports --dry-run (default), --execute, and --months filter flags.

Fix needs_compaction path: use YYYY/MM not year=/month= Hive layout

a5e4fbf

Add BESTEC off-peak sensitivity script for CUB presentation

68b3c29

full draft after revenue neutrality adjustment

a07ea81

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

60 smart meter analysis pricing simulation#61

60 smart meter analysis pricing simulation#61
griffinsharps wants to merge 149 commits intomainfrom
60-smart-meter-analysis-pricing-simulation

griffinsharps commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

griffinsharps commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant