switchbox-data · alexhyunminlee · Mar 27, 2026 · Mar 12, 2026 · Mar 12, 2026 · Mar 12, 2026
diff --git a/.cursor/commands/extract-pdf-to-markdown.md b/.cursor/commands/extract-pdf-to-markdown.md
@@ -17,6 +17,14 @@ You are extracting a technical PDF into a **standalone, fully-formatted markdown
 6. **Standalone design**: A reader should be able to work from this markdown alone; the PDF is emergency reference only
 7. **LLM-friendly markers**: Use clear, parseable markers when you must indicate "see PDF for visual"
 
+### Long Documents (30+ pages)
+
+For source PDFs longer than ~30 pages, **maintain the same fidelity throughout the entire document**. Do not allow quality to degrade in later sections — every chapter deserves the same level of verbatim transcription, footnote capture, and figure description as the first. Specifically:
+
+- Work section by section at a consistent pace. If the Executive Summary was extracted near-verbatim, body chapters must be too.
+- Do not switch from transcription to summarization partway through. If you find yourself writing "the section discusses X" instead of reproducing the actual text, stop and transcribe.
+- For very long documents (80+ pages), it is acceptable to use multiple passes — extract the first half, save, then continue — rather than compressing later content.
+
 ## Extraction Instructions
 
 ### Structure & Hierarchy
@@ -33,7 +41,17 @@ You are extracting a technical PDF into a **standalone, fully-formatted markdown
 - `code` for technical terms, file paths, code snippets
 - Preserve numbered and bulleted lists with correct nesting. **When the source has an inline numbered list** (e.g. "commissions should 1) … 2) … 5)" or "(1) … (2)" in one paragraph), **convert it to a markdown list** (numbered or bullet) unless that would break the flow of the paragraph.
 - Keep paragraph structure and grouping exactly as in original
-- Convert hyperlinks to markdown format: `[link text](URL)`
+- Convert hyperlinks to markdown format: `[link text](URL)`. When a URL appears as a bare link without surrounding descriptive text (common in footnotes of legal/regulatory documents), wrap it with a short descriptive label: `[Short Description](URL)` (e.g. `[2022 NYISO Gold Book](https://...pdf)`).
+- **Sidebar boxes, callout panels, and inset text** (visually set apart from the main body — e.g., case studies, worked examples, "Challenge and Opportunity" boxes, marginal definitions): Preserve these in full. Format each as a blockquote or with a clear header indicating the sidebar title (e.g., `> **Remote Disconnection and Reconnection: Challenge and Opportunity**`). Do not silently drop sidebars; they often contain substantive content not duplicated in the main text.
+
+### Source Errors, Typos, and Cross-References
+
+Source PDFs often contain typos, missing words, or erroneous internal cross-references (e.g. "Section 4.1" when the content is clearly in Section 5.1). The goal is to make the extract easy for agents to reason about while preserving traceability to the original.
+
+- **Minor typos** (misspellings, missing prepositions, obvious letter transpositions like "fro" for "for"): Correct them silently for readability, but add a blanket note in the front matter: _"Note: Minor typographical errors in the source (e.g. "an" for "and", "fro" for "for") have been corrected for readability. Substantive corrections are annotated inline."_
+- **Erroneous cross-references** (section numbers, figure numbers, or other internal references that are clearly wrong given the document's own structure): Correct the reference to what it clearly means, and add an inline bracketed annotation preserving the original: `Section 5.1 [source says "Section 4.1"]`. This lets agents follow the correct reference while knowing the source differed.
+- **Truncated text** (e.g. a footnote or sentence cut off at a page break): Include whatever text is present, then add a bracketed note: `[text truncated at page break in source]`.
+- **Ambiguous cases** (where it's unclear whether the source text is an error or intentional): Preserve the original text exactly and do not correct it. If warranted, add a bracketed note: `[sic]` or `[sic; possibly intended "X"]`.
 
 ### Tables
 
@@ -99,6 +117,8 @@ You are extracting a technical PDF into a **standalone, fully-formatted markdown
 
 5. **For flowcharts/process diagrams**: Describe flow path, decision points, inputs/outputs, process steps in order
 
+6. **Preserve figure-referencing transitional sentences.** When the source text introduces a figure with a sentence like "Figure 3 shows the historical values for X" or "Figure 7 presents the projected installations," preserve that sentence as standalone text **before** the `[DIAGRAM DESCRIPTION]` block. Do not drop these sentences or fold them into the diagram description — they are part of the source prose and connect the narrative to the visual. The pattern is: transitional sentence → `[DIAGRAM DESCRIPTION]` block → any post-figure discussion.
+
 ### Citations & References
 
 - Preserve citation format exactly: `(Author et al., Year)` or `[1]`, `[2]`, etc.
@@ -112,6 +132,15 @@ You are extracting a technical PDF into a **standalone, fully-formatted markdown
 - Preserve ALL footnote content—nothing drops. If a footnote reference triggers a linter (e.g. "unused reference definition"), you may **inline the footnote content** into the body at the reference point and remove the footnote definition, provided no content is lost.
 - Keep numbering/order from original (or renumber from 1 if the source uses different numbering).
 
+### OCR-Based Extractions
+
+Some PDFs are scanned images with no embedded text, requiring OCR (e.g. via tesseract or PyMuPDF's `get_textpage_ocr`). When OCR is needed:
+
+- **Note it in the front matter.** Add a line to the metadata block: `**Extraction method**: OCR (source PDF is image-based)`. Also include a blanket note: _"This document was extracted via OCR from a scanned PDF. Minor recognition errors may remain; low-confidence values are marked with `[?]`."_
+- **Mark low-confidence values with `[?]`.** When OCR produces ambiguous or garbled text — especially in table cells, numbers, or proper nouns — include your best reconstruction followed by `[?]`. For example: `0.58 [?]` or `Smith [?]`. This signals to downstream consumers that the value may need manual verification against the original PDF.
+- **Use the highest practical DPI** (300+ recommended) for OCR to maximize recognition quality.
+- **For tables that OCR cannot extract** (e.g. complex spreadsheet images), do not invent cell data. Add: _"Table content was not extractable from the PDF; description below is inferred from surrounding text. See original PDF for data."_ Then describe the table's purpose and structure based on the narrative.
+
 ### Content You Cannot Fully Extract
 
 **Logos/branding graphics**: Skip unless essential
@@ -159,6 +188,8 @@ When the source PDF contains them, include short back-matter sections such as **
 - [ ] Equations preserved in original notation
 - [ ] All citations and references complete and linked
 - [ ] All footnotes and endnotes preserved
+- [ ] **Footnote count verified (mandatory)**: Before finalizing, count footnote **reference marks** in the source PDF (superscript numbers, symbols, or `[n]` markers in body text) and count footnote **definitions** in the extract (`[^n]:` lines or inlined equivalents). The counts **must** match. If the source has N footnotes, the extract must have exactly N footnote definitions. When footnotes are numbered per-chapter (resetting to 1 in each chapter), count per chapter and verify each. This is the single most commonly failed check — do not skip it.
+- [ ] Source errors handled: minor typos corrected with blanket note in front matter; erroneous cross-references corrected with inline `[source says "…"]` annotations; truncated text noted
 - [ ] No orphaned content (mentioned but not included)
 - [ ] Section numbers, titles, labels preserved exactly
 - [ ] Document reads as complete, standalone resource
@@ -189,7 +220,7 @@ When the source PDF contains them, include short back-matter sections such as **
 ## Output filename and location
 
 - **Filename**: The markdown file **must** use the same base name as the PDF, with a `.md` extension. Example: `bill_alignment_test.pdf` → `bill_alignment_test.md`.
-- **Location**: Save under `context/docs/` for technical documentation (e.g. Cambium, ResStock) or `context/papers/` for academic papers (e.g. Bill Alignment Test). Update `context/README.md` when adding or changing files.
+- **Location**: Save under `context/docs/` for technical documentation (e.g. Cambium, ResStock) or `context/sources/papers/` for academic papers (e.g. Bill Alignment Test). Update `context/README.md` when adding or changing files.
 
 ## Process
 
@@ -203,6 +234,6 @@ The PDF file path is provided as: **$ARGUMENTS**
 6. Build complete References section
 7. Do final quality check against checklist
 8. Output complete markdown as ready-to-use document
-9. **Save the file** under `context/docs/` or `context/papers/` using the **same base name as the PDF** (e.g. `path/to/foo.pdf` → `context/.../foo.md`). Update `context/README.md` if needed.
+9. **Save the file** under `context/docs/` or `context/sources/papers/` using the **same base name as the PDF** (e.g. `path/to/foo.pdf` → `context/.../foo.md`). Update `context/README.md` if needed.
 
 **Provide the extracted markdown in full, ready to save to context/ with the matching filename and commit.**
diff --git a/.cursor/commands/validate-pdf-to-markdown-extraction.md b/.cursor/commands/validate-pdf-to-markdown-extraction.md
@@ -7,7 +7,7 @@ argument-hint: <path-to-extract.md> [path-to-source.pdf]
 
 You are comparing an **existing markdown extract** to its **source PDF** to assess how well the extraction succeeded across all major categories. Your output is a **structured validation report** that other agents or humans can use to judge quality and decide on follow-up actions.
 
-**Inputs:** The user provides the path to the markdown file (e.g. `context/papers/bill_alignment_test.md`). The source PDF is either provided as a second argument or inferred from the markdown path by replacing the file with the same base name and `.pdf` extension in the same directory (e.g. `context/papers/bill_alignment_test.pdf`). Read both the PDF (or its text representation) and the markdown to perform the comparison.
+**Inputs:** The user provides the path to the markdown file (e.g. `context/sources/papers/bill_alignment_test.md`). The source PDF is either provided as a second argument or inferred from the markdown path by replacing the file with the same base name and `.pdf` extension in the same directory (e.g. `context/sources/papers/bill_alignment_test.pdf`). Read both the PDF (or its text representation) and the markdown to perform the comparison.
 
 **Output:** Produce the validation report **only in the chat** (in your response). **Do not** write the report to a file or save it to disk. The user reads the report in the conversation; they can copy or save it themselves if needed.
 

diff --git a/.devcontainer/devpod/aws.sh b/.devcontainer/devpod/aws.sh
@@ -20,10 +20,11 @@ if ! command -v aws >/dev/null 2>&1; then
   exit 1
 fi
 
-# Check if credentials are already valid (early exit if so)
-# Test with an actual EC2 API call since DevPod uses EC2
-if aws sts get-caller-identity &>/dev/null &&
-  aws ec2 describe-instances --max-results 5 &>/dev/null; then
+# Check if SSO credentials are already valid (early exit if so).
+# aws configure export-credentials exercises the full SSO credential chain,
+# so it fails if the SSO token is expired — unlike aws sts get-caller-identity,
+# which can succeed via env vars or static credentials even when SSO is stale.
+if aws configure export-credentials --format json &>/dev/null; then
   echo "✅ AWS credentials are already valid"
   echo
   exit 0

diff --git a/.gitignore b/.gitignore
@@ -88,3 +88,5 @@ CLAUDE.md
 dev_plots/
 run_logs/
 utils/pre/tou_window/*.csv
+
+/.luarc.json
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -12,7 +12,7 @@ repos:
       - id: check-added-large-files
         name: "Git: block files over 600KB (check-added-large-files)"
         args: ["--maxkb=600"]
-        exclude: uv.lock|rate_design/.*/config/marginal_costs/.*\.csv|context/papers/nyiso_gold_book_2025\.md
+        exclude: uv.lock|rate_design/.*/config/marginal_costs/.*\.csv|context/sources/nyiso_gold_book_2025\.md
       - id: check-merge-conflict
         name: "Git: detect merge conflict markers (check-merge-conflict)"
       - id: check-case-conflict

diff --git a/.vscode/extensions.json b/.vscode/extensions.json
@@ -8,6 +8,7 @@
     "dvirtz.parquet-viewer",
     "hashicorp.terraform",
     "davidanson.vscode-markdownlint",
+    "goessner.mdmath",
     "tombi-toml.tombi",
     "nefrob.vscode-just-syntax",
     "christian-kohler.path-intellisense"

diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -18,6 +18,7 @@
   ],
   "python.testing.unittestEnabled": false,
   "python.testing.pytestEnabled": true,
+  "python.analysis.diagnosticMode": "openFilesOnly",
   "python.analysis.exclude": [
     "run_logs",
     "**/config/tariffs/**",