Skip to content

60 smart meter analysis pricing simulation#61

Open
griffinsharps wants to merge 149 commits intomainfrom
60-smart-meter-analysis-pricing-simulation
Open

60 smart meter analysis pricing simulation#61
griffinsharps wants to merge 149 commits intomainfrom
60-smart-meter-analysis-pricing-simulation

Conversation

@griffinsharps
Copy link
Copy Markdown
Contributor

No description provided.

Griffin Sharps and others added 23 commits January 26, 2026 23:22
Co-authored-by: Cursor <cursoragent@cursor.com>
…onth validator

Enhance validate_month_output.py with three preflight checks needed before
scaling to full-month execution:
- Duplicate (zip_code, account_identifier, datetime) detection per batch file
- Row count reporting (total + per-file) in validation report JSON
- Run artifact integrity via --run-dir flag (plan.json, run_summary.json,
  manifests, batch summaries)

Add PREFLIGHT_200.md checklist for 200-file EC2 validation run.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…s_sorted()

Polars 1.38 removed is_sorted() from Expr. Collect the composite key first,
then check sortedness on the resulting Series which retains the method.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Restores Justfile from main and adds migrate-month recipe.
Usage: just migrate-month 202307
- batch-size 100, workers 6, lazy_sink, --resume
- Reads ~/s3_paths_<YYYYMM>_full.txt, writes to /ebs/.../out_<YYYYMM>_production
- Uses bare python (no uv) for EC2 compatibility

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
.gitignore: block *.txt, .tmp/, archive_quarantine/, tmp_polars_run_*/,
subagent_packages/ from being tracked.

pre-commit: add detect-private-key hook and a local forbid-secrets hook
that blocks .env, .secrets, credentials.json, .pem, .key, .p12, .pfx,
.jks files from being committed (even via git add -f).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace hash-based duplicate detection (n_unique ~400MB/file) with
adjacent-key streaming that leverages the required global sort order.
Sortedness and uniqueness now share a single PyArrow iter_batches
pass in full mode.

Key changes:
- _streaming_sort_and_dup_check: combined sort+dup via PyArrow
  batch iteration, O(batch_size) memory, cross-file boundary state
- Per-file datetime stats with merge (_DtStats dataclass)
- Per-file DST stats with merge (_DstFileStats dataclass)
- Enhanced sample mode: strict-increasing check (catches dups in windows)
- Row counts from parquet metadata (O(1), no data scan)
- Phase-based main() architecture (discovery -> metadata -> streaming
  -> datetime -> DST -> artifacts -> report)
- _fail() typed as NoReturn for mypy narrowing
- Add pyarrow mypy override in pyproject.toml

Removed dead functions: _check_sorted_full, _validate_no_duplicates_file,
_validate_datetime_invariants_partition, _validate_dst_option_b_partition,
_keys_is_sorted_df

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two fixes in validate_month_output.py:

1. _slice_keys: use lf.collect() instead of streaming engine for slice
   reads — streaming may reorder rows, defeating sortedness validation.
   Slices are small (5K rows x 3 cols) so default engine is correct and fast.

2. _check_sorted_sample: track prev_end and only perform cross-slice
   boundary comparison when off >= prev_end (non-overlapping). Random
   windows can overlap head/tail/each other, making boundary checks
   invalid under overlap. Within-slice strict-monotonic checks still
   run unconditionally.

Also updates remaining collect(streaming=True) calls to
collect(engine="streaming") to fix Polars deprecation warnings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Restores Justfile from main and adds migrate-month YEAR_MONTH recipe:
- Guards against non-EC2 environments (checks /ebs mount)
- Auto-generates S3 input list via aws s3 ls + awk + sort
- Validates non-empty input list before running
- Runs migrate_month_runner.py with standard production params
  (batch-size 100, workers 6, --resume, lazy_sink)

Usage: just migrate-month 202307

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Annotate migrate_month_runner.py, validate_month_output.py, and Justfile
with industry-standard "why" comments for senior code review. Additions
include module-level architecture docstrings, function-level design
rationale, and parameter tuning explanations. No logic changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Refactor migrate-month to use configurable variables (S3_PREFIX,
MIGRATE_OUT_BASE, etc.) instead of hardcoded bucket names and
usernames, preparing the repo for open-source. Add six recipes:
months-from-s3, migrate-months, validate-month, validate-months,
and migration-status. Multi-month recipes support fail-fast (default)
or continue-on-error mode with per-invocation UTC-timestamped logs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace hard-coded RTP/flat spread logic with dual tariff inputs
(--tariff-prices-a, --tariff-prices-b) that each use the standard
price_cents_per_kwh schema from build_tariff_hourly_prices.py.
Adds fail-loud join guards (null check + row-count check per tariff).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Generates one row per local hour in America/Chicago for a given year,
mapping each hour to its TOU season and period with the associated
price. Handles DST by keeping the first UTC occurrence of fall-back
duplicates. Validates full hour coverage per season.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Validates uniqueness and join coverage across the full chain:
interval data -> hourly loads -> tariff calendars -> household bills.
Synthetic tests exercise spring-forward, fall-back, and normal months;
sample-data tests validate the real 202308 artefacts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
build_regression_dataset.py: maps household bills to Census block groups
via ZIP+4 crosswalk (deterministic 1:1), aggregates to BG-level outcomes,
and fits two OLS regressions (savings + bill diff ~ demographics). Supports
auto/core/explicit predictor modes and graceful outcome column fallback.

run_billing_pipeline.py: multi-month orchestrator that chains hourly loads,
tariff billing, annual aggregation, and regression via subprocess. Supports
--months/--months-file with {yyyymm} path patterns and writes per-month
outputs plus annual_household_aggregate.parquet with a full run manifest.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Annotate 12 files with inline comments explaining design decisions,
trade-offs, and non-obvious rationale for a senior reviewer. No logic
changes—comments only (+121/-12 lines).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@griffinsharps griffinsharps linked an issue Feb 11, 2026 that may be closed by this pull request
Griffin Sharps and others added 6 commits February 16, 2026 22:26
Replace all references to annual_household_aggregate.parquet with
all_months_household_bills.parquet. Expand the test suite from 25 tests
to 93 tests covering:

- All-months bills: schema, YYYYMM month format, no nulls, set equality
  of months, additive totals matching per-month outputs
- Regression artifacts: existence of all 7 files (bg_month_outcomes,
  bg_annual_outcomes, bg_season_outcomes, regression_dataset_bg,
  regression_results.json, regression_summary.txt, regression_metadata.json)
- Schema assertions for all BG outcome parquets
- Mathematical invariants: pct_savings_weighted definition, annual rollup
  equals sum-of-months, season values and mapping, null handling
- Crosswalk coverage: n_zip4, n_bg, n_zip4_multi_bg, pct_dropped
- Regression results JSON: model keys, r_squared, coefficients with const
- Both regression modes (annual, bg_month) with schema consistency checks
- Skip-regression mode: no artifacts, manifest flags, bills still produced
- Manifest: all_months_bills_rows, steps_completed, regression_level

Also adds --regression-level pass-through to the orchestrator CLI so
both annual and bg_month modes can be tested end-to-end.

Test data is augmented with synthetic households at diverse ZIP+4 values
to ensure >= 6 census block groups for OLS regression coverage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…alidation

Co-authored-by: Cursor <cursoragent@cursor.com>
Griffin Sharps and others added 3 commits March 23, 2026 21:19
Add plotnine, compute_delivery_deltas, gspread, dotenv, bs4, and IPython
to the appropriate deptry ignore lists. plotnine is a notebook-only
dependency; the others are local scripts or transitive imports in lib/.
Add plotnine, compute_delivery_deltas, gspread, dotenv, bs4, and IPython
to the appropriate deptry ignore lists. plotnine is a notebook-only
dependency; the others are local scripts or transitive imports in lib/.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Update pct_save computations to use the actual parquet column names:
STOU → total_delta_dollars, DTOU → dtou_total_delta_dollars.

Fix regression extraction to filter on "rate" and "dep_var" (not
"rate_type"/"outcome"); hardcode stou_jan_mean_pct = 25.66 since
there is no mean_pct column in regression_summary.csv.

Update callout notes and prose to reflect the real schema.
@griffinsharps griffinsharps force-pushed the 60-smart-meter-analysis-pricing-simulation branch 2 times, most recently from 49ae9d9 to 473b47a Compare March 24, 2026 17:02
Griffin Sharps and others added 8 commits March 24, 2026 17:07
…ort targets

- Add _quarto.yml: manuscript project type, Switchbox theme, SVG figures
- Add references.bib with ICC Final Order and Order on Rehearing entries
- Add render/draft/clean targets to Justfile
- Fix bibliography path in index.qmd (../references.bib → references.bib)
- Fix typo in index.qmd: "chargesa" → "charges a"
@griffinsharps griffinsharps force-pushed the 60-smart-meter-analysis-pricing-simulation branch from 5924afc to d2ad44b Compare March 24, 2026 18:38
Max Shron (Switchbox) and others added 16 commits March 24, 2026 11:43
integrity (default): checks duplicates, order breaks, row counts, file
counts, dir size — skips acct_day_counts entirely for a fast ~1-2 hour
full pass. Outputs to /tmp/phase1_integrity_audit.{tsv,json,md}.

acct-day: only accumulates rows-per-acct-day stats. Accepts --months
flag to target specific YYYYMM values instead of all 49 months.
Outputs to /tmp/phase1_acct_day_audit.{tsv,json,md}.
… ~20M)

Replace within-batch numpy row-by-row comparisons with boundary-only checks:
compare only the first row of each batch against the last row of the previous
batch. Drops the numpy import entirely. The 202508 double-ingestion defect
shows up at file boundaries, so internal per-row scanning adds no detection
value. Expected speedup: 11+ min/month → <30 s/month.
Scans all out_*_production dirs, skips already-compacted months (no
batch_* files), and runs compact_month_output.run_compaction() on each.
Supports --dry-run (default), --execute, and --months filter flags.
…ction

batch_*.parquet was the only pattern matched; compacted months use part-*.parquet.
Switch both audit_month_integrity and audit_month_acct_day to *.parquet so all
parquet files are included regardless of naming convention.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[smart-meter-analysis] Pricing simulation

1 participant