From 8313ee3e25ca9bda407bbaccea7d6b6bb258f5c1 Mon Sep 17 00:00:00 2001 From: Yaroslav Halchenko Date: Fri, 30 Jan 2026 10:49:33 -0500 Subject: [PATCH 1/8] Copy YODA talk for the future webinar slides --- 2026-repronim-YODA-BIDS-webinar.html | 882 +++++++++++++++++++++++++++ 1 file changed, 882 insertions(+) create mode 100644 2026-repronim-YODA-BIDS-webinar.html diff --git a/2026-repronim-YODA-BIDS-webinar.html b/2026-repronim-YODA-BIDS-webinar.html new file mode 100644 index 0000000..4f051d4 --- /dev/null +++ b/2026-repronim-YODA-BIDS-webinar.html @@ -0,0 +1,882 @@ + + + + + + + + + Pragmatic YODA: overview of YODA principles and their wild life encounters + + + + + + + + + + + + + + + +
+
+ + + +
+
+ +

Pragmatic YODA: overview of YODA principles
and their wild life encounters

+ + + + + Live slides/Sources: + https://datasets.datalad.org/centerforopenneuroscience/talks/2025-distribits-YODA.html
+ YouTube: https://www.youtube.com/watch?v=EuKVapscUQ4. +
+ + + + + + + + + + + + + +
+
+
+ +
+ +
+

Acknowledgements

+ + + + + +
+ +
+

A YODA principle a day keeps gray hair rate at bay

+
+
+ + +
+ + + +
+
+ +
+ +
+

Why version control?

+ + + + + +
+
    +
  • collaborate
  • +
  • keep things synchronized
  • +
  • keep track of changes
  • +
  • use as a backup
  • +
  • search (grep)
  • +
  • ...
  • + +
+
+ + Borrowed from PhD Comics 1531. +
+ +
+ +
+
+ + + +
+
+ +
+
+ + + + +
+
+ + + + + +
+ +
+

+ YODA style OpenNeuro Derivatives from OpenNeuro +

+ +

Explore/obtain yourself from https://github.com/OpenNeuroDerivatives

+
+ +
+ + + + + + +
+
+ +
+
+ +
+ + + +
+ + + + + + + + +
+
+ + + + + + + + + From b0a35302f4206f23ef5bd6c9c230171546fd871b Mon Sep 17 00:00:00 2001 From: Yaroslav Halchenko Date: Sun, 1 Feb 2026 21:42:01 -0500 Subject: [PATCH 2/8] Add ReproNim 2026 webinar on ReproFlow & YODA with planning materials New presentation: ReproFlow & YODA: Structure your studies, observable and reproducible they become - Title slide and metadata updated for Feb 6, 2026 ReproNim webinar - Abstract emphasizes observability and reproducibility themes - QR code generated for slides URL - Planning materials organized in YODA-compliant structure: - notes/act2-refinement-notes.md: Research on BEP028, BABS, Nipoppy, BIDS-flux, FAIRly big framework, and SciOps principles - planning/proposed-structure.md: 5-act narrative structure proposal - README.md: Overview and entry point for all materials Theme: YODA principles + BIDS composition + ReproFlow/reprostim tooling enable observable and reproducible neuroimaging workflows from acquisition to publication. Emphasis on provenance (BEP028), dashboard separation pattern, and AI as amplifier of structured data. --- 2026-repronim-YODA-BIDS-webinar.html | 18 +- 2026-repronim-YODA-BIDS-webinar/README.md | 56 ++++++ .../notes/act2-refinement-notes.md | 161 ++++++++++++++++++ .../planning/proposed-structure.md | 151 ++++++++++++++++ ...2026-repronim-YODA-BIDS-webinar-qrcode.png | 1 + 5 files changed, 381 insertions(+), 6 deletions(-) create mode 100644 2026-repronim-YODA-BIDS-webinar/README.md create mode 100644 2026-repronim-YODA-BIDS-webinar/notes/act2-refinement-notes.md create mode 100644 2026-repronim-YODA-BIDS-webinar/planning/proposed-structure.md create mode 100644 pics/2026-repronim-YODA-BIDS-webinar-qrcode.png diff --git a/2026-repronim-YODA-BIDS-webinar.html b/2026-repronim-YODA-BIDS-webinar.html index 4f051d4..57363cb 100644 --- a/2026-repronim-YODA-BIDS-webinar.html +++ b/2026-repronim-YODA-BIDS-webinar.html @@ -6,8 +6,8 @@ - Pragmatic YODA: overview of YODA principles and their wild life encounters - + ReproFlow & YODA: Structure your studies, observable and reproducible they become + @@ -30,7 +30,12 @@
-

Pragmatic YODA: overview of YODA principles
and their wild life encounters

+

ReproFlow & YODA: Structure your studies,
observable and reproducible they become

+

+ Version control everything. Look up you must not. Compose modularly you shall.
+ Discover how YODA principles, BIDS composition, and ReproFlow/reprostim tooling
+ bring observable and reproducible workflows to neuroimaging—from acquisition to publication. +

@@ -50,7 +55,7 @@

Pragmatic YODA: overview of YODA principles
and their wild life encount Dartmouth College
New Hampshire, USA
- +

@@ -68,9 +73,10 @@

Pragmatic YODA: overview of YODA principles
and their wild life encount --> + ReproNim Webinar — Friday, February 6th, 2026
Live slides/Sources: - https://datasets.datalad.org/centerforopenneuroscience/talks/2025-distribits-YODA.html
- YouTube: https://www.youtube.com/watch?v=EuKVapscUQ4. + https://datasets.datalad.org/centerforopenneuroscience/talks/2026-repronim-YODA-BIDS-webinar.html
+ More info: https://repronim.org/about/webinars/
diff --git a/2026-repronim-YODA-BIDS-webinar/README.md b/2026-repronim-YODA-BIDS-webinar/README.md new file mode 100644 index 0000000..84f760c --- /dev/null +++ b/2026-repronim-YODA-BIDS-webinar/README.md @@ -0,0 +1,56 @@ +# ReproFlow & YODA: Structure your studies, observable and reproducible they become + +**ReproNim Webinar — Friday, February 6th, 2026** + +> Version control everything. Look up you must not. Compose modularly you shall. +> Discover how YODA principles, BIDS composition, and ReproFlow/reprostim tooling +> bring observable and reproducible workflows to neuroimaging—from acquisition to publication. + +## Contents + +- `../2026-repronim-YODA-BIDS-webinar.html` - Main presentation slides (reveal.js) +- `notes/` - Planning and refinement notes +- `planning/` - Structural planning documents + +## Live Slides + +- **URL**: https://datasets.datalad.org/centerforopenneuroscience/talks/2026-repronim-YODA-BIDS-webinar.html +- **Sources**: https://datasets.datalad.org/centerforopenneuroscience/talks/.git + +## More Information + +- **ReproNim Webinars**: https://repronim.org/about/webinars/ +- **Center for Open Neuroscience**: https://centerforopenneuroscience.org/ +- **Previous ReproFlow webinar** (June 2024): https://datasets.datalad.org/repronim/artwork/talks/webinar-2024-reproflow/#/ + +## Presentation Overview + +### Core Themes + +1. **Observability** - duct, bash history, zoom recordings, ReproStim +2. **Hierarchical composition** - BIDS, submodules, OpenNeuroDerivatives, condensed frontiers +3. **AI as amplifier, not replacement** - structured data enables better AI +4. **Provenance everywhere** - git commits, run records, CI logs, BEP028 +5. **Independence through standardization** - BIDS/YODA enable federation without centralization + +### Structure (Proposed) + +- **Act I**: YODA Foundation (principles 1-3) +- **Act II**: Execution & Workflows (SciOps, ReproFlow, tools) +- **Act III**: Hierarchical Composition (BIDS as YODA exemplar) +- **Act IV**: AI Frontier (structure enables intelligence) +- **Act V**: The Vision (every lab, a YODA) + +## Development + +To view locally: +```bash +cd /home/yoh/proj/CON/talks +python -m http.server 8081 +# Visit: http://0.0.0.0:8081/2026-repronim-YODA-BIDS-webinar.html +``` + +## License + +Talk materials: CC-BY 4.0 +Code examples: MIT License (where applicable) diff --git a/2026-repronim-YODA-BIDS-webinar/notes/act2-refinement-notes.md b/2026-repronim-YODA-BIDS-webinar/notes/act2-refinement-notes.md new file mode 100644 index 0000000..aa6ebec --- /dev/null +++ b/2026-repronim-YODA-BIDS-webinar/notes/act2-refinement-notes.md @@ -0,0 +1,161 @@ +# Act II Refinement Notes - Execution & Workflows + +## Key Additions for Tomorrow's Discussion + +### 1. Provenance Standards - BEP028 +- [BEP028 (BIDS Provenance)](https://github.com/bids-standard/BEP028_BIDSprov/blob/master/bep028spec.md) +- Based on W3C PROV-O ontology +- Defines Activities, Entities, Agents, Environments +- JSON-LD format for machine-readable provenance +- Can be stored in `/prov/` directory or sidecars +- **Key Point**: YODA + BEP028 = complete computational provenance chain + +### 2. Dashboard Separation Pattern +**Nipoppy Example:** +- Data layer: `.tsv` files in BIDS hierarchy (YODA-compliant) +- Visualization layer: [Neurobagel digest dashboard](https://digest.neurobagel.org/) +- Upload tracker files → interactive dashboard +- **Principle**: Data remains version-controlled, dashboards consume but don't own + +**Other dashboards to mention:** +- BABS processing status tracking +- ReproMan job monitoring +- DataLad-Registry search interface + +### 3. YODA-Compliant Workflow Tools + +#### BABS (BIDS App Bootstrap) +- [Paper](https://direct.mit.edu/imag/article/doi/10.1162/imag_a_00074/119046) +- [Docs](https://pennlinc-babs.readthedocs.io/) +- Uses DataLad + FAIRly big framework +- HPC-scale (demonstrated on n=2,565 Healthy Brain Network) +- Automatic provenance tracking +- Supports SGE and Slurm +- **Felix Hoffstaedter** co-author (Forschungszentrum Jülich, INM-7) + +#### FAIRly big +- [Framework paper](https://www.researchgate.net/publication/355214162_FAIRly_big_A_framework_for_computationally_reproducible_processing_of_large-scale_data) +- DataLad-based, domain-agnostic +- Full audit trail via DataLad +- Used by BABS + +#### Nipoppy +- [GitHub](https://github.com/nipoppy/nipoppy) +- [Tutorial](https://repronim.org/resources/tutorials/nipoppy/) +- Lightweight framework for neuroimaging-clinical data +- BIDS + phenotypic data organization +- Integration with DataLad +- Dashboard for processing status via Neurobagel + +#### BIDS-flux +- [Docs](https://bids-flux-docs.readthedocs.io/) +- Scalable FAIR data management platform +- Built on BIDS + DataLad +- GitLab for workflow orchestration +- MinIO for object storage +- Containerized BIDSApps +- Multi-site neuroimaging research focus + +#### ReproMan +- Already covered in original slides +- Need to emphasize: orchestration across compute environments +- YODA-compliant job specifications + +### 4. SciOps Framework (from June 2024 webinar) +Reference: [ReproFlow webinar slides](https://datasets.datalad.org/repronim/artwork/talks/webinar-2024-reproflow/#/) + +**Core Principles:** +1. **Be thorough**: Automate provenance information collection +2. **Be efficient**: Automate as much as feasible +3. **Be formal**: Use standardized approaches + +**Effort Rebalancing:** +- Current: >80% on data collection/processing +- Goal: >80% upfront planning, automate execution + +**ReproFlow Components:** +- **ReproIn/HeuDiConv**: DICOM → BIDS automation +- **ReproStim**: Capture audio/video stimuli presented +- **ReproEvents**: Behavioral event tracking +- **Con/noisseur**: Scanner console input capture +- **ReproMon**: Real-time operator feedback +- **phys2bids**: Physiological data automation + +**YODA Connection:** +All of these produce version-controlled outputs that fit into YODA hierarchy + +### 5. Proposed Act II Structure + +**Act II: Execution & Workflows - The SciOps Way** + +1. **The Automation Imperative** + - 80/20 rule: plan upfront, automate execution + - SciOps principles (thorough, efficient, formal) + - Reference to June 2024 webinar + +2. **Provenance as First-Class Citizen** + - BEP028 specification + - Activities, Entities, Agents + - `datalad run` → BEP028-compliant records + - `con/duct` → execution traces + - `tinuous` → CI/CD provenance + +3. **YODA-Compliant Workflow Tools** + - **ReproMan**: Multi-environment orchestration + - **BABS**: HPC-scale BIDS Apps (FAIRly big) - Felix Hoffstaedter et al. + - **Nipoppy**: Clinical-imaging integration with dashboard + - **BIDS-flux**: Multi-site platform with GitLab orchestration + - All share: YODA hierarchy + provenance tracking + +4. **The Dashboard Pattern** + - Data ≠ Visualization + - Data: Version-controlled .tsv files in YODA structure + - Dashboards: Consume data, provide insights + - Examples: + - Nipoppy → Neurobagel digest + - BABS → processing status + - DataLad-Registry → dataset search + - **Principle**: Dashboards are regenerable views, not truth + +5. **Complete Capture - ReproFlow Ecosystem** + - Scanner → ReproIn → BIDS + - Stimuli → ReproStim → BIDS derivatives + - Events → ReproEvents → BIDS events.tsv + - Console → Con/noisseur → metadata + - Monitor → ReproMon → QC feedback + - Processing → datalad run + duct → BEP028 prov + - CI/CD → tinuous → build artifacts + +6. **From Execution to Archive** + - YODA structure enables: re-execution, sharing, composition + - Every step captured, every input/output tracked + - "Do not look up" preserved at every level + +### 6. Key Messages for Act II + +- **Automation ≠ Black Box**: SciOps automates capture, not decisions +- **Provenance Everywhere**: From scanner to publication +- **Separation of Concerns**: Data in YODA, views via dashboards +- **Standard Tools**: Don't write custom scripts, compose standard tools +- **Scale Through Structure**: YODA enables HPC/cloud/local equivalence + +## Additional Resources + +### Papers to Reference +- [BABS paper (2024)](https://direct.mit.edu/imag/article/doi/10.1162/imag_a_00074/119046) - Hoffstaedter et al. +- [FAIRly big framework](https://www.researchgate.net/publication/355214162_FAIRly_big_A_framework_for_computationally_reproducible_processing_of_large-scale_data) +- [BEP028 spec](https://github.com/bids-standard/BEP028_BIDSprov/blob/master/bep028spec.md) + +### Tools Documentation +- [Nipoppy tutorial](https://repronim.org/resources/tutorials/nipoppy/) +- [BABS docs](https://pennlinc-babs.readthedocs.io/) +- [BIDS-flux docs](https://bids-flux-docs.readthedocs.io/) +- [Neurobagel digest](https://digest.neurobagel.org/) + +## Questions for Tomorrow +1. Should we create a unified "ReproFlow ecosystem diagram" showing all components? +2. How much detail on each tool vs. overview? +3. Should Felix's BABS work be a detailed case study or just mentioned? +4. Do we need a slide specifically on "data vs. dashboard" separation? +5. How to best visualize the SciOps 80/20 principle? +6. Should we have a comparative slide showing BABS vs. Nipoppy vs. BIDS-flux use cases? diff --git a/2026-repronim-YODA-BIDS-webinar/planning/proposed-structure.md b/2026-repronim-YODA-BIDS-webinar/planning/proposed-structure.md new file mode 100644 index 0000000..fff9ff1 --- /dev/null +++ b/2026-repronim-YODA-BIDS-webinar/planning/proposed-structure.md @@ -0,0 +1,151 @@ +# Proposed Slide Structure & Narrative Arc + +## Act I: The YODA Foundation (Keep & Enhance) +*Current slides work well - minor enhancements* + +1. **Title Slide** ✓ (already updated) +2. **YODA Principles Overview** ✓ (keep as-is) +3. **NEW: Evolution of YODA** + - From "research data organization" (2018) → "comprehensive digital material management" (2026) + - Beyond code/data: bash history, zoom recordings, meeting notes, experimental stimuli + - Theme: *"If it's digital and matters, version control it shall be"* + +## Act II: YODA Meets Real-World Workflows (Expand) +*Connect YODA principles to ReproFlow ecosystem* + +4. **Principle 1: Version Control** ✓ (keep current content) + - Add: `con/duct` for execution tracing → *observability as provenance* + - Add: `tinuous` for CI logs → even build artifacts become referenceable + +5. **NEW: Beyond Code - The Unrealized YODA** + - Bash history capture in datasets + - Zoom/Teams recordings as subdatasets + - Lab meeting notes, presentations (this very slide deck!) + - ReproStim: video capture of actual stimuli shown + - Theme: *"Capture you must what actually happened, not what should have"* + +6. **Principle 2: Portable Environments** ✓ (keep) + - Enhance with ReproMan for orchestrating across HPC/cloud + +7. **Principle 3: Modular Composition** ✓ (keep core) + - **EXPAND significantly**: This is where BIDS comes in + +## Act III: Hierarchical Composition - BIDS as YODA Exemplar (NEW) +*The breakthrough insight: BIDS is YODA at scale* + +8. **NEW: BIDS Studies as Composed Datasets** + ``` + study/ + ├── sub-01/ (subdataset - raw data) + ├── sub-02/ (subdataset) + ├── derivatives/ + │ ├── mriqc/ (subdataset - QC results) + │ ├── fmriprep/ (subdataset - preprocessed) + │ └── analysis-smith2025/ (subdataset - paper) + └── code/ (stimuli, protocols) + ``` + - Each level knows only what's below + - "Do not look up" enables independent reuse + +9. **NEW: Real Examples - OpenNeuroDerivatives** + - Show the actual structure (you have slides for this already around line 718) + - Highlight: ReproMan job specs capture *how* each derivative was created + - Each derivative = independent dataset but composed into study + +10. **NEW: From ReproIn to ReproFlow** + - **ReproIn**: MRI scanner → BIDS (automated) + - **ReproStim**: Capture actual stimuli presented + - **ReproMon**: Monitor acquisition in real-time + - **ReproMan**: Run processing pipelines + - **con/duct**: Trace execution details + - **tinuous**: Capture CI/CD artifacts + - Full provenance chain: scanner → preprocessing → analysis → paper + +## Act IV: The AI Frontier - Structure Enables Intelligence (NEW) +*This is the novel contribution for ReproNim audience* + +11. **NEW: The Problem with Unstructured "RAG"** + - LLMs hallucinate + - Web content disappears + - No version control of sources + - No provenance + - Image: *"Depends on untracked web"* meme + +12. **NEW: Structured Collections as AI Substrate** + - **BIDS datasets** → AI can learn cross-study patterns + - **DataLad-Registry** → 10,000+ datasets with metadata + - **OpenNeuroStudies** → Aggregated structured knowledge + - But: Structure preserved, provenance intact + +13. **NEW: Condensed Frontiers with Deep Links** + ``` + study-summary/ (AI-generated, version controlled) + ├── README.md (NotebookLM summary) + ├── key-findings.md + ├── anomalies-detected.md + └── .datalad/ + └── config (links to full subdatasets) + ``` + - AI tools (NotebookLM, Claude, etc.) generate *summaries* + - Summaries are version controlled + - Links to source subdatasets preserved + - Can always drill down to raw data + - Theme: *"Surface you create, depth you preserve"* + +14. **NEW: Observability → Insight** + - `duct` traces + LLM analysis = "Why did this fail?" + - Bash history + code = "What actually ran?" + - Zoom recordings + meeting notes = "What was decided?" + - All timestamped, all linked, all version controlled + +15. **NEW: Agentic Workflows on YODA** + - AI agents can: + - Traverse subdatasets programmatically + - Generate derivative summaries + - Detect anomalies across studies + - Propose new analyses + - But: Always within YODA structure + - Agents don't replace structure, they leverage it + - Example: `datalad run --agent=claude "Summarize QC failures across all subjects"` + +## Act V: The Vision (NEW) + +16. **NEW: Every Lab, A YODA** + - Lab notebook → DataLad dataset + - Weekly meetings → Zoom recording subdataset + notes + - Every experiment → BIDS + ReproStim stimuli + - Every analysis → `datalad run` with `con/duct` + - Every paper → Subdataset linking to all of above + +17. **NEW: OpenNeuro-Scale Insights** + - When 1000 studies share structure: + - Cross-study meta-analysis becomes `datalad get` + `pandas merge` + - AI can learn "what normal looks like" + - Outliers auto-detected + - Replication attempts auto-verified + - But: No centralization required + - Each dataset remains independent, composable module + +18. **Take Home Messages** (Enhanced) + - YODA isn't just for data - it's for *everything digital* + - Structure doesn't constrain - it enables + - AI tools amplify structured collections, don't replace them + - Observability + reproducibility = trustworthy science + - Start small: version control your next meeting notes + +## New Slides to Create + +1. **Evolution of YODA slide** - timeline graphic +2. **Unrealized YODA slide** - examples of digital artifacts we should capture +3. **BIDS as YODA slide** - show hierarchy with "do not look up" arrows +4. **ReproFlow ecosystem diagram** - all components connected +5. **AI on unstructured vs structured slide** - contrasting approaches +6. **Condensed frontiers diagram** - show AI summary layer with links down +7. **Agentic workflows slide** - example commands +8. **Vision slide** - "Every Lab, A YODA" graphic + +## Suggested Deletions/Condensations + +- Keep but condense: detailed git-annex stats (slide ~325) - just mention briefly +- Keep but reduce: container families details (slide ~469) - focus on ReproNim/containers +- Merge: some datalad-run examples are repetitive diff --git a/pics/2026-repronim-YODA-BIDS-webinar-qrcode.png b/pics/2026-repronim-YODA-BIDS-webinar-qrcode.png new file mode 100644 index 0000000..eaf6c77 --- /dev/null +++ b/pics/2026-repronim-YODA-BIDS-webinar-qrcode.png @@ -0,0 +1 @@ +/annex/objects/MD5E-s860--b7d485c308479377b875439655c6ee24.png From 87120ecda18dc2bc0dadbec9b0ed2d9deef736cd Mon Sep 17 00:00:00 2001 From: Yaroslav Halchenko Date: Sun, 1 Feb 2026 21:43:51 -0500 Subject: [PATCH 3/8] Add frontier condensation concept to webinar notes MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Elaborate on how modular composition creates "condensed frontiers" - transformed, summarized, or extracted forms that are more appropriate for downstream use while maintaining exact version-controlled links to source materials. Key insight: Each module/subdataset serves as both: - A stopping point ("stopping the bleeding" of data/complexity) - A usable interface for next level (condensed/transformed) - A versioned link back to full source (reproducible) Examples across domains: - Neuroscience: TB of ephys recordings → spike trains (1000x smaller) - Software: Source code → compiled binaries (platform-appropriate) - BIDS: Multi-stage cascade (DICOM → BIDS → derivatives → paper) - Data analysis: Individual measurements → summary statistics - AI/ML: Full corpora → embeddings/indices - Meetings: Video recordings → minutes - Genomics: Full genomes → variant calls (1000x smaller) - Dashboards: .tsv data → interactive visualizations Pattern enables: - Cognitive load reduction (work at appropriate level) - Performance (smaller, transformed data) - Reproducibility (exact source association via git hexsha) - Flexibility (multiple frontiers from same source) - Evolvability (regenerate as methods improve) Anti-pattern: Orphaned frontiers without source links Best practice: Version control both source and frontier as modules Visual metaphor: "Surface you create, depth you preserve" This concept integrates throughout Acts II-IV of the presentation. --- .../notes/frontier-condensation.md | 451 ++++++++++++++++++ 1 file changed, 451 insertions(+) create mode 100644 2026-repronim-YODA-BIDS-webinar/notes/frontier-condensation.md diff --git a/2026-repronim-YODA-BIDS-webinar/notes/frontier-condensation.md b/2026-repronim-YODA-BIDS-webinar/notes/frontier-condensation.md new file mode 100644 index 0000000..6498f48 --- /dev/null +++ b/2026-repronim-YODA-BIDS-webinar/notes/frontier-condensation.md @@ -0,0 +1,451 @@ +# Frontier Condensation: Modular Composition as Hierarchical Transformation + +## The Core Concept + +Modular composition in YODA isn't just about organizing files—it's about creating **condensed frontiers** at each level of the hierarchy. Each module/subdataset serves as: + +1. **A stopping point** ("stopping the bleeding" of data/complexity growth) +2. **A transformation** (extraction, compilation, summarization) +3. **A usable interface** for the next level up +4. **A versioned link** back to the full source + +This pattern appears across all domains of computation and science, and YODA+version control makes it explicit and traceable. + +## The Pattern + +``` +Original Source (large, raw, detailed) + ↓ + [Transformation] + ↓ +Condensed Frontier (smaller, processed, appropriate) + ↓ + [Used by next level] +``` + +**Key insight**: The frontier is version-controlled and maintains exact association with its source, allowing bidirectional traversal: +- **Forward**: Use the condensed form (efficient) +- **Backward**: Retrieve exact original (reproducible) + +## Examples Across Domains + +### 1. Neuroscience - Electrophysiology + +**Source**: Terabytes of continuous high-sampling-rate recordings (30 kHz, weeks of data) + +**Transformation**: Spike detection and sorting + +**Frontier**: +- Spike times (timestamps, milliseconds precision) +- Cluster assignments (which neuron) +- Average waveforms +- **Size reduction**: 1000x-10000x smaller + +**YODA implementation**: +``` +study/ +├── raw-ephys/ (subdataset: TB of continuous data) +└── derivatives/ + └── spike-sorted/ (subdataset: spike trains, MB-GB) + └── .datalad/ + └── config (links to exact raw-ephys hexsha) +``` + +Work with spike trains daily, drill down to raw traces when needed. + +### 2. Software Development - Compilation + +**Source**: Human-readable source code (text files, with comments, dependencies) + +**Transformation**: Compilation and linking + +**Frontier**: +- Platform-specific binaries (x86_64, ARM, etc.) +- Optimized for execution +- Stripped of debug symbols (often) +- **Size**: Often smaller, always more appropriate for deployment + +**YODA implementation**: +``` +software-project/ +├── src/ (subdataset: source code) +├── dependencies/ (subdatasets: libraries) +└── releases/ + └── v1.2.3/ (subdataset: compiled binaries) + └── .datalad/ + └── config (links to exact src hexsha) +``` + +Distribute binaries, but can always rebuild from exact source. + +### 3. Neuroimaging - BIDS Pipeline + +**Multi-stage transformation cascade:** + +**Stage 1: DICOM → BIDS** +- Source: Scanner raw (DICOM, proprietary formats, ~GB per subject) +- Frontier: Organized BIDS (NIfTI, JSON sidecars) +- Reduction: Similar size, but structured and standardized + +**Stage 2: BIDS → Preprocessed** +- Source: Raw BIDS +- Frontier: Preprocessed (fMRIPrep, MRIQC outputs) +- Transformation: Motion correction, normalization, quality metrics +- Reduction: Similar size, but ready for analysis + +**Stage 3: Preprocessed → Analysis** +- Source: Preprocessed volumes +- Frontier: Statistical maps, ROI timecourses +- Reduction: Massive (GB → MB) + +**Stage 4: Analysis → Publication** +- Source: All statistical maps +- Frontier: Select figures and tables +- Reduction: Extreme (GB → KB) + +**YODA implementation**: +``` +study/ +├── inputs/ +│ └── raw-bids/ (subdataset: DICOM → BIDS) +├── derivatives/ +│ ├── mriqc/ (subdataset: QC metrics) +│ ├── fmriprep/ (subdataset: preprocessed) +│ └── analysis-2025/ (subdataset: stats) +└── paper/ (subdataset: figures, manuscript) + └── .datalad/ + └── config (links to all source subdatasets) +``` + +Paper cites exact versions of all upstream data. + +### 4. Data Analysis - Summary Statistics + +**Source**: +- Individual subject measurements (1000s of subjects × 100s of variables) +- Full timeseries data +- Raw questionnaires + +**Transformation**: Aggregation and statistical analysis + +**Frontier**: +- Summary statistics (means, SDs, correlations) +- Demographic tables +- Group-level plots +- **Size reduction**: 10000x typical + +**Example**: +- Source: 5000 subjects × 200 brain regions × 1200 timepoints = 1.2 billion datapoints +- Frontier: Correlation matrix (200×200) + summary stats = ~40k values + +**YODA implementation**: +``` +cohort-study/ +├── subjects/ +│ ├── sub-0001/ (subdataset) +│ ├── sub-0002/ (subdataset) +│ └── ... +└── group-analysis/ (subdataset) + ├── summary-stats.tsv (condensed frontier) + ├── figures/ + └── .datalad/ + └── config (links to all subject subdatasets) +``` + +### 5. AI/ML - Embeddings and Indices + +**Source**: Full text corpus (billions of tokens, TB) + +**Transformation**: Embedding generation, vector indexing + +**Frontier**: +- Vector embeddings (dense representations) +- Similarity indices +- **Size**: Often smaller, always faster to query + +**Example**: +- Source: 1M documents, 500 words each = 500M tokens +- Frontier: 1M × 768-dim embeddings = ~3GB dense vectors +- Query time: milliseconds vs. hours + +**YODA implementation**: +``` +knowledge-base/ +├── documents/ (subdataset: full text) +└── embeddings/ (subdataset: vectors) + ├── index.faiss + ├── metadata.json (which model, when created) + └── .datalad/ + └── config (links to exact documents hexsha) +``` + +When embedding model improves, regenerate from same source. + +### 6. Meeting Documentation + +**Source**: +- Full Zoom recording (hours, GB) +- Complete audio transcript +- Chat logs +- Shared screens + +**Transformation**: Human curation + AI assistance + +**Frontier**: +- Meeting minutes (key decisions, action items) +- Summary document +- **Size reduction**: 1000x (2 GB video → 2 KB notes) + +**YODA implementation**: +``` +lab-meetings/ +├── 2026-02-01/ +│ ├── recording/ (subdataset: video, audio) +│ │ ├── zoom-recording.mp4 +│ │ └── transcript.txt +│ └── notes/ (subdataset: summary) +│ ├── minutes.md (frontier) +│ └── .datalad/ +│ └── config (links to recording subdataset) +``` + +Read minutes daily, watch recording when clarification needed. + +### 7. Genomics - Reference to Variants + +**Source**: Full genomes (3 billion base pairs per individual, ~200 GB BAM files) + +**Transformation**: Variant calling against reference + +**Frontier**: +- VCF files (only differences from reference) +- **Size reduction**: 1000x (200 GB → 200 MB) + +**YODA implementation**: +``` +genomics-study/ +├── reference/ (subdataset: reference genome) +├── alignments/ +│ └── sample-001.bam (subdataset: full alignment) +└── variants/ + └── sample-001.vcf (subdataset: condensed frontier) + └── .datalad/ + └── config (links to alignments + reference) +``` + +### 8. Literature - From Papers to Knowledge Graphs + +**Source**: Full-text PDFs (millions of papers, TB) + +**Transformation**: Entity extraction, relationship mining + +**Frontier**: +- Knowledge graph (entities + relationships) +- Citation network +- **Representation**: More queryable, interconnected + +**YODA implementation**: +``` +literature-corpus/ +├── pdfs/ (subdataset: full papers) +├── fulltext/ (subdataset: extracted text) +└── knowledge-graph/ (subdataset: structured knowledge) + ├── entities.jsonld + ├── relationships.ttl + └── .datalad/ + └── config (links to pdfs, fulltext) +``` + +### 9. Simulation - From Runs to Figures + +**Source**: Massive simulation output (timesteps, spatial grids, TB) + +**Transformation**: Post-processing, visualization + +**Frontier**: +- Key summary metrics over time +- Select visualizations +- **Size reduction**: 10000x typical + +**Example**: Climate model +- Source: 100 years × 365 days × 24 hours × global grid = PB +- Frontier: Annual averages for key regions = GB + +**YODA implementation**: +``` +climate-model/ +├── runs/ +│ └── scenario-RCP85/ (subdataset: raw output) +└── analysis/ + └── temperature-trends/ (subdataset: condensed) + ├── regional-means.nc + └── .datalad/ + └── config (links to runs) +``` + +## The YODA+Git Advantage + +### 1. Exact Association +Every frontier knows its exact source via git hexsha: +```bash +# Which exact version of raw data produced this derivative? +git submodule status derivatives/spike-sorted +# a3f9d1c2... derivatives/spike-sorted (v1.2.0) + +# What raw data version does it link to? +cd derivatives/spike-sorted +git submodule status +# 7b8e4a5d... ../raw-ephys (heads/main) +``` + +### 2. Reproducible Regeneration +If processing improves, regenerate frontier from **exact same source**: +```bash +# Update processing code +git commit -m "Improved spike detection algorithm" + +# Regenerate from exact same raw data +datalad run -m "Reprocess with improved algorithm" \ + --input raw-ephys \ + --output derivatives/spike-sorted-v2 \ + spike_sort_v2.py +``` + +### 3. Multiple Frontiers from Same Source +Different use cases need different condensations: +``` +raw-ephys/ (source: TB) +├── spike-sorted/ (frontier 1: spike trains, MB) +├── lfp-filtered/ (frontier 2: LFP bands, GB) +└── spectrograms/ (frontier 3: power spectra, GB) +``` + +All link to same source, all independently version controlled. + +### 4. Frontier of Frontiers +Meta-analyses work on condensed data: +``` +meta-analysis/ +├── study-001/summary-stats/ (frontier of a frontier) +├── study-002/summary-stats/ +└── aggregated/ (frontier of frontiers) +``` + +Can still drill down to individual subject in study-001. + +## Dashboard Pattern Revisited + +Dashboards are **visualization frontiers**: + +**Source**: .tsv files, JSON metadata (YODA-structured data) + +**Transformation**: Web rendering, interactive visualization + +**Frontier**: Dashboard UI (ephemeral, regenerable) + +**Key**: Dashboard is not the source of truth +- Data remains version-controlled +- Dashboard can be regenerated anytime +- Multiple dashboards can visualize same data + +**Example: Nipoppy** +``` +study/ +├── tabular/ +│ ├── demographics.tsv (data: version controlled) +│ ├── processing-status.tsv (data: version controlled) +│ └── qc-metrics.tsv (data: version controlled) +└── dashboards/ + └── neurobagel-config.json (points to tabular/) + # Dashboard runs separately, reads from tabular/ +``` + +## AI-Generated Frontiers + +With LLMs, we can create new types of condensed frontiers: + +### NotebookLM-style Summaries +**Source**: Full dataset with README, code, results + +**Transformation**: LLM analysis + +**Frontier**: +- Executive summary +- Key findings document +- FAQ about the dataset + +**YODA pattern**: +```bash +datalad run -m "Generate AI summary with NotebookLM" \ + --input . \ + --output ai-summary/README-summary.md \ + generate_summary.py --model=notebooklm +``` + +The summary is version controlled and links to exact dataset version. + +### Anomaly Detection Frontiers +**Source**: 1000s of QC metrics across subjects + +**Transformation**: AI-based outlier detection + +**Frontier**: +- List of flagged subjects +- Anomaly explanations +- Confidence scores + +**Regenerable**: As models improve, re-scan same data. + +## Benefits of Frontier Condensation + +1. **Cognitive Load Reduction**: Work with appropriate level of detail +2. **Performance**: Query/process smaller, transformed data +3. **Sharing**: Distribute condensed forms, source on demand +4. **Privacy**: Share aggregates while controlling access to raw data +5. **Specialization**: Different teams work at different frontiers +6. **Reproducibility**: Exact links enable verification +7. **Flexibility**: Multiple frontiers from same source +8. **Evolvability**: Regenerate as methods improve + +## Anti-Pattern: Orphaned Frontiers + +Without version control + modularity: + +``` +❌ final_analysis_v3.xlsx (source unknown) +❌ paper_figure_2_revised.png (which data? which code?) +❌ summary_stats_march.csv (from which subjects?) +``` + +**Problems**: +- Can't verify results +- Can't reproduce with updated methods +- Can't trace back to source +- Frontiers become truth (dangerous!) + +## Best Practices + +1. **Always link frontier to source** (subdataset references) +2. **Document transformation** (`datalad run` or BEP028 provenance) +3. **Version control both** (source and frontier as separate modules) +4. **Make frontiers regenerable** (script the transformation) +5. **Test bidirectional traversal** (can you get from frontier to source and back?) +6. **Don't delete sources** (storage is cheap, lost provenance isn't) +7. **Multiple frontiers OK** (different condensations for different uses) +8. **Dashboards consume, don't own** (data in version control, viz is ephemeral) + +## Slides Integration + +This concept should appear in: + +1. **Act II**: When discussing modularity and "do not look up" +2. **Act III**: BIDS as cascade of frontiers (DICOM → BIDS → derivatives → paper) +3. **Act IV**: AI summaries as condensed frontiers with deep links +4. **Examples throughout**: Every real-world case is frontier condensation + +**Visual metaphor**: +- Iceberg: Tip (frontier) is visible/usable, mass below (source) is accessible +- Pyramid: Condensed summit, detailed base, but connected +- Tree: Leaves (frontiers) are visible, roots (sources) provide depth + +**Yoda wisdom**: *"Surface you create, depth you preserve"* From 47cafda53854f3e82a3bf066f339dc8278a8fe62 Mon Sep 17 00:00:00 2001 From: Yaroslav Halchenko Date: Mon, 2 Feb 2026 10:10:46 -0500 Subject: [PATCH 4/8] Enhance frontier condensation with concrete real-world examples MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Software section: - Add NeuroDebian as example of source → package transformation - Add reproducible-builds.org for bit-identical binaries - Add snapshot.debian.org (~20PB) as non-git archival approach - Emphasize pattern is universal, not DataLad/git-specific Literature section (complete rewrite): - Replace generic example with DANDI Archive citation workflow - Detail dandi-bib: metadata → BibTeX/RIS/Zotero (daily automation) - Detail citations-collector: DOI → citation discovery (WiP) - 8 citation types (Publication, Preprint, Protocol, etc.) - 11 relationship types (Cites, Uses, IsDocumentedBy, etc.) - Show multi-layer frontier condensation in action - Zotero as "dashboard" - regenerable view of version-controlled data Meetings section: - Add real-world practice of maintaining local Zoom archive - Emphasize reusable resource (decisions, training, quotes) - Storage cheap, context priceless Universal pattern section: - New section: "The Pattern is Universal, Not Tool-Specific" - Compare git/DataLad, snapshot.debian.org, container registries, data repositories, academic citations - Emphasize principles over tools: explicit linking, retrievability, versioning, automation, modularity - Message: "Pattern is ancient, tools evolve—embrace principles" All examples now concrete, traceable projects with URLs. --- .../notes/frontier-condensation.md | 147 +++++++++++++++--- 1 file changed, 128 insertions(+), 19 deletions(-) diff --git a/2026-repronim-YODA-BIDS-webinar/notes/frontier-condensation.md b/2026-repronim-YODA-BIDS-webinar/notes/frontier-condensation.md index 6498f48..5a3b688 100644 --- a/2026-repronim-YODA-BIDS-webinar/notes/frontier-condensation.md +++ b/2026-repronim-YODA-BIDS-webinar/notes/frontier-condensation.md @@ -65,7 +65,29 @@ Work with spike trains daily, drill down to raw traces when needed. - Stripped of debug symbols (often) - **Size**: Often smaller, always more appropriate for deployment -**YODA implementation**: +**Real-world examples**: + +**NeuroDebian** (http://neuro.debian.net): +- Source: Upstream neuroimaging software repositories +- Frontier: Debian packages (.deb) for easy installation +- Maintains links to exact source versions +- Distribution across Debian releases (stable, testing, unstable) + +**Reproducible Builds** (https://reproducible-builds.org/): +- Takes source → binary transformation to next level +- Goal: Bit-for-bit identical binaries from same source +- Eliminates timestamps, build paths, randomness +- Enables verification: different builders get identical results +- **Frontier condensation + verifiability** + +**Snapshot Archive** (https://snapshot.debian.org/): +- Different approach to preservation (not git-based) +- Archives every version of every Debian package +- ~20PB of historical builds +- Can retrieve exact binary from specific timestamp +- **Shows pattern is universal, not DataLad/git-specific** + +**YODA/DataLad implementation**: ``` software-project/ ├── src/ (subdataset: source code) @@ -76,7 +98,7 @@ software-project/ └── config (links to exact src hexsha) ``` -Distribute binaries, but can always rebuild from exact source. +Distribute binaries, but can always rebuild from exact source. Reproducible-builds ensures same source → same binary. ### 3. Neuroimaging - BIDS Pipeline @@ -196,6 +218,17 @@ When embedding model improves, regenerate from same source. - Summary document - **Size reduction**: 1000x (2 GB video → 2 KB notes) +**Real-world practice**: +- Maintain local archive of all Zoom meetings with recordings +- Becomes **reusable resource** for: + - Clarifying decisions made months ago + - Training new team members + - Extracting quotes for papers/grants + - Understanding evolution of ideas +- Most common use: minutes +- Occasional use: full video review +- **Frontier (minutes) + Source (video) both valuable** + **YODA implementation**: ``` lab-meetings/ @@ -209,7 +242,7 @@ lab-meetings/ │ └── config (links to recording subdataset) ``` -Read minutes daily, watch recording when clarification needed. +Read minutes daily, watch recording when clarification needed. Archive never deleted—storage cheap, context priceless. ### 7. Genomics - Reference to Variants @@ -233,29 +266,55 @@ genomics-study/ └── config (links to alignments + reference) ``` -### 8. Literature - From Papers to Knowledge Graphs +### 8. Literature - From Papers to Citations Database -**Source**: Full-text PDFs (millions of papers, TB) +**Real-world example: DANDI Archive Citations** -**Transformation**: Entity extraction, relationship mining +**Project**: https://github.com/dandi/dandi-bib + https://github.com/con/citations-collector -**Frontier**: -- Knowledge graph (entities + relationships) -- Citation network -- **Representation**: More queryable, interconnected +**Source Layer 1**: DANDI Archive metadata +- 1000s of neuroscience datasets +- Dataset DOIs, descriptions, contributors -**YODA implementation**: +**Frontier 1**: Structured bibliography +- **Transformation**: Daily GitHub Actions workflow (3:22 AM UTC) +- Fetch metadata from DANDI API → Generate BibTeX/RIS files +- Sync to public Zotero collection +- **Output**: `dandi.bib`, `dandi.ris`, Zotero library +- Use case: Easy citation of DANDI datasets + +**Source Layer 2**: Academic citation databases +- OpenAlex, DataCite, CrossRef, OpenCitations + +**Frontier 2**: Citation discovery (WiP) +- **Transformation**: `citations-collector` queries with dataset DOIs +- Discovers papers citing DANDI datasets +- Merges preprint/published versions +- Classifies citation types (8 types defined in schema): + - Publication, Preprint, Protocol, Thesis, Book, Software, Dataset, Other +- Classifies citation relationships (11 types): + - Cites, CitesAsDataSource, Uses, IsDocumentedBy, Reviews, etc. +- Fetches open-access PDFs +- **Output**: TSV file + Zotero subcollection +- Use case: Understand dataset impact, discover related work + +**Schema**: https://github.com/con/citations-collector/blob/master/schema/citations.yaml + +**Workflow visualization**: https://github.com/dandi/dandi-bib#workflow + +**YODA-style organization**: ``` -literature-corpus/ -├── pdfs/ (subdataset: full papers) -├── fulltext/ (subdataset: extracted text) -└── knowledge-graph/ (subdataset: structured knowledge) - ├── entities.jsonld - ├── relationships.ttl - └── .datalad/ - └── config (links to pdfs, fulltext) +dandi-bib/ +├── scripts/ (transformation code) +├── outputs/ +│ ├── dandi.bib (frontier: bibliography) +│ ├── dandi.ris +│ └── citations.tsv (frontier: citation graph) +└── .github/workflows/ (automation: daily updates) ``` +**Key insight**: Automated transformation from archive metadata → structured citations → queryable knowledge. Zotero serves as "dashboard" - regenerable view of version-controlled data. + ### 9. Simulation - From Runs to Figures **Source**: Massive simulation output (timesteps, spatial grids, TB) @@ -333,6 +392,56 @@ meta-analysis/ Can still drill down to individual subject in study-001. +## The Pattern is Universal, Not Tool-Specific + +**Critical insight**: Frontier condensation is a **general pattern**, not limited to DataLad/git: + +### Different Approaches to the Same Pattern + +**Git/DataLad approach**: +- Explicit submodule references (hexsha) +- Distributed, federated +- Local clone contains full history + +**Debian snapshot.debian.org approach**: +- Centralized archive of all package versions +- ~20PB of historical builds +- Timestamp-based retrieval +- Different mechanism, same goal: exact source retrieval + +**Container registries (Docker Hub, etc.)**: +- Layer-based content addressing +- Tag → specific image hash +- Multi-stage builds as frontier cascade + +**Data repositories (Zenodo, Figshare)**: +- DOI → specific version +- Can't update published versions (new DOI instead) +- Less granular than git, but same principle + +**Academic citations**: +- DOI/ISBN as "permanent" identifier +- Reference list links papers together +- Different granularity than code, same graph structure + +### What Matters: The Principles + +1. **Explicit linking**: Frontier knows its source (hexsha, DOI, timestamp, hash) +2. **Retrievability**: Can get exact source when needed +3. **Versioning**: Changes create new versions, don't overwrite +4. **Automation**: Transformations are scripted/repeatable +5. **Modularity**: Components can be used independently + +**YODA/DataLad advantages**: +- Fine-grained (file-level) tracking +- Distributed (no central chokepoint) +- Unified interface across domains +- Computational provenance (`datalad run`) + +**But**: The conceptual pattern predates and transcends any specific tool. + +**Message for slides**: "Pattern is ancient, tools evolve—choose what fits your domain, but embrace the principles." + ## Dashboard Pattern Revisited Dashboards are **visualization frontiers**: From 64fa4ea205431a186645a0571f1c992c802f06db Mon Sep 17 00:00:00 2001 From: Yaroslav Halchenko Date: Mon, 2 Feb 2026 10:12:15 -0500 Subject: [PATCH 5/8] Add integration plan for frontier condensation concept MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Document how to weave frontier condensation throughout all 5 acts: - Act I: Introduce concept with Principle 3 (modular composition) - Act II: Show in practice (ReproFlow, tools, dashboards) - Act III: BIDS as 4-stage frontier cascade - Act IV: AI as frontier generator (structured vs unstructured) - Act V: Universal pattern across domains Key reframings: - BIDS pipeline = cascade of frontiers (DICOM → derivatives → paper) - Dashboards = visualization frontiers (consume, don't own data) - AI summaries = version-controlled frontiers with source links - Tools comparison = different condensation strategies Visual motif: Two-layer diagrams (frontier ⇅ source) Terminology: Surface/depth, frontier/source, condensation/link New slides proposed: - Frontier condensation pattern intro - Software example: NeuroDebian + reproducible-builds + snapshot - Literature example: dandi-bib workflow - Meeting archives as resource - Universal pattern comparison table Narrative thread: 'Surface you create, depth you preserve' Questions for refinement discussion tomorrow morning. --- .../frontier-condensation-integration.md | 325 ++++++++++++++++++ 1 file changed, 325 insertions(+) create mode 100644 2026-repronim-YODA-BIDS-webinar/planning/frontier-condensation-integration.md diff --git a/2026-repronim-YODA-BIDS-webinar/planning/frontier-condensation-integration.md b/2026-repronim-YODA-BIDS-webinar/planning/frontier-condensation-integration.md new file mode 100644 index 0000000..ec6d360 --- /dev/null +++ b/2026-repronim-YODA-BIDS-webinar/planning/frontier-condensation-integration.md @@ -0,0 +1,325 @@ +# Integrating Frontier Condensation Throughout the Presentation + +## Core Message Shift + +**Previous framing**: YODA as organizational principles + tools +**New framing**: YODA enables **hierarchical transformation through condensed frontiers** + +Every module/subdataset is: +1. A stopping point for complexity growth +2. A transformation to more appropriate form +3. A usable interface (frontier) with links to full source (depth) + +**Tagline**: *"Surface you create, depth you preserve"* + +## How Frontier Condensation Appears in Each Act + +### Act I: YODA Foundation +*Introduce the concept* + +**Slide: "YODA Principle 3: Modular Composition" (expand)** +- Current: Just shows submodule hierarchy +- **Add**: Each level is a **condensed frontier** + - Not just organization + - Active transformation + - Size reduction + format appropriateness + - Examples: TB → GB → MB → KB + +**Visual metaphor options**: +- Iceberg: Visible tip (frontier), accessible depth (source) +- Pyramid: Condensed summit, detailed base, connected +- Telescope: Zoom in (drill down) or out (work at frontier) + +**Key message**: "Modular ≠ just folders. Modular = hierarchy of transformations." + +### Act II: Execution & Workflows (SciOps) +*Show frontier condensation in practice* + +**Reorganize around transformation stages:** + +#### 1. Data Acquisition Frontiers +- **ReproIn**: DICOM (scanner raw) → BIDS (structured) + - Same data, more appropriate format + - Metadata extraction as frontier creation +- **ReproStim**: Video stream → event-synchronized recordings + - Capture what actually happened (not just planned) + - Original stimulus files + actual presentation record + +#### 2. Processing Frontiers +- **BIDS → Derivatives**: fMRIPrep, MRIQC + - Multi-hour scans → quality metrics (seconds to review) + - Volume-by-volume → summary statistics + - **Dashboard pattern**: Nipoppy tracker .tsv → Neurobagel visualization + - Data (frontier) = version controlled + - Dashboard = regenerable view + +#### 3. Provenance Frontiers +- **BEP028**: Execution details → structured prov records + - Full logs → key Activities/Entities/Agents + - Machine-readable, queryable +- **con/duct**: System calls → execution graph + - Thousands of operations → flame graph visualization + - Raw JSON → plotted insights + +#### 4. Workflow Tool Comparison (Frontier Condensation Strategies) + +**Table/comparison slide:** + +| Tool | Source | Frontier | Scale | Approach | +|------|---------|----------|-------|----------| +| **BABS** | Raw BIDS | Derivatives | 1000s subjects, HPC | FAIRly big + DataLad | +| **Nipoppy** | Multi-modal raw | BIDS + phenotype .tsv | Clinical integration | DataLad + Dashboard | +| **BIDS-flux** | Multi-site data | Harmonized datasets | Federated | GitLab + MinIO | +| **ReproMan** | Any compute | Job specs + results | Cross-platform | Environment orchestration | + +**Common pattern**: All create frontiers while maintaining source links. + +### Act III: Hierarchical Composition (BIDS as YODA Exemplar) +*BIDS as cascade of frontiers* + +**Slide: "BIDS Pipeline as Frontier Cascade"** + +Show 4-stage transformation: +``` +Scanner raw (DICOM, GB/subject) + ↓ ReproIn +BIDS (NIfTI+JSON, GB/subject) ← Frontier 1 + ↓ fMRIPrep/MRIQC +Derivatives (processed, GB/subject) ← Frontier 2 + ↓ Analysis +Statistical maps (MB/subject) ← Frontier 3 + ↓ Publication +Figures & tables (KB) ← Frontier 4 +``` + +**Each arrow is**: +- Version-controlled transformation (`datalad run` or BEP028) +- Size/complexity reduction +- Format change for next use +- Reversible (can drill down) + +**OpenNeuroDerivatives example**: +- Each derivative = frontier +- Each maintains link to raw BIDS (source) +- Multiple analysis groups can create competing frontiers +- Same source, different transformations, all preserved + +### Act IV: AI Frontier (Structure Enables Intelligence) +*AI tools as frontier generators* + +**Slide: "AI-Generated Frontiers"** + +**Traditional problem**: +``` +Papers (unstructured) → LLM → Summary (unverifiable) + ↓ + Hallucinations, no links +``` + +**Structured approach**: +``` +Structured corpus (BIDS, DANDI, papers+metadata) + ↓ AI transformation (NotebookLM, LLMs) +Version-controlled summary ← Frontier + ├── README-summary.md (condensed) + └── .datalad/config (links to source) +``` + +**Examples**: + +1. **Literature: dandi-bib** + - Source: DANDI metadata (API) + - AI step: Citation discovery via citations-collector + - Query OpenAlex, CrossRef, DataCite + - Classify citation types (8 types) + - Merge preprint/published versions + - Frontier: citations.tsv + Zotero collection + - Use: Understand dataset impact, find related work + +2. **Quality Control** + - Source: 1000s of MRIQC metrics (.tsv files) + - AI step: Anomaly detection + - Frontier: Flagged subjects with explanations + - Regenerable as models improve + +3. **Meeting Documentation** + - Source: Zoom recordings (GB, archived locally) + - AI step: Whisper transcription + summarization + - Frontier: Minutes with timestamps (KB) + - Reusable resource: Training, quotes, decision archaeology + +**Key difference from RAG**: +- RAG: Query web → synthesize → ephemeral answer +- Structured: Query local corpus → generate frontier → version control → verify + +**Slide: "Observability → Insight"** +- duct traces + LLM = "Why did pipeline fail?" +- Bash history + code + LLM = "What actually ran?" +- Zoom + notes + LLM = "What was decided in Q3 about X?" + +All possible because sources are preserved, frontiers are regenerable. + +### Act V: The Vision +*Universal frontier condensation* + +**Slide: "Every Lab, A Frontier Factory"** + +``` +Lab operation Frontier condensation +───────────────── ───────────────────── +Daily meetings → Minutes (KB) + Archive (GB) +Experiments → BIDS (GB) + Summaries (MB) +Analysis → Figures (KB) + Full results (GB) +Literature review → Annotated bib + PDF archive +Grant writing → Submitted grant + Research notes +Teaching → Lecture slides + Full recordings +``` + +**Every activity**: +- Creates usable frontier (work at this level) +- Preserves full source (drill down when needed) +- Version controlled (reproducible) +- Transformations automated (SciOps principles) + +**Slide: "Pattern Transcends Tools"** + +Show comparison table: + +| Domain | Tool/Approach | Pattern | +|--------|---------------|---------| +| Code | Git + submodules | Frontier via modularity | +| Software | snapshot.debian.org | Frontier via timestamps | +| Papers | DOI + references | Frontier via citations | +| Data | DataLad + subdatasets | Frontier via git-annex | +| Containers | Docker + layers | Frontier via content addressing | +| Meetings | Zoom + minutes | Frontier via summarization | + +**Message**: "Choose tools that fit your domain, but embrace the principles." + +**Principles**: +1. Explicit linking (hexsha, DOI, timestamp, hash) +2. Retrievability (can get exact source) +3. Versioning (new versions, don't overwrite) +4. Automation (scripted transformations) +5. Modularity (independent components) + +## Visual Consistency Throughout + +**Recommended visual motif**: Two-layer diagrams + +``` +[Frontier Layer] ← Small, fast, usable + ⇅ +[Source Layer] ← Large, detailed, preserved +``` + +Use this pattern in: +- Software compilation slides +- BIDS pipeline slides +- AI summary slides +- Dashboard explanation slides +- Every example + +**Color coding**: +- Frontier: Green (ready to use) +- Source: Blue (preserved depth) +- Transformation: Orange arrows +- Links: Dashed lines (bidirectional) + +## Key Terminology Consistency + +**Use these terms consistently**: + +- **Frontier**: The condensed, transformed, usable form +- **Source**: The detailed, raw, original form +- **Condensation**: The transformation process +- **Link**: The version-controlled association (hexsha, DOI, etc.) +- **Drill down**: Going from frontier to source +- **Surface**: Work at the frontier level +- **Depth**: Preserved source accessibility + +**Avoid**: +- "Derived data" (too passive) +- "Processed data" (unclear what happened) +- "Summary" (too limited - transformations aren't always summarization) + +## Recommended Slide Additions + +### New slide: "Frontier Condensation Pattern" (Act II intro) +- Define the concept +- Show the basic diagram +- List examples across domains +- "This pattern appears everywhere YODA succeeds" + +### New slide: "Software Example: Reproducible Builds" (Act II) +- NeuroDebian: source → .deb packages +- reproducible-builds.org: Bit-identical binaries +- snapshot.debian.org: 20PB archive (different approach, same principle) +- "Pattern transcends tools" + +### New slide: "Literature Example: DANDI Citations" (Act IV) +- Show dandi-bib workflow diagram +- Metadata → BibTeX/RIS → Zotero (automated daily) +- DOI → citation discovery → classified relationships +- "From archive to knowledge graph" + +### New slide: "Meeting Archives as Resource" (Act V) +- Personal practice: Archive all Zoom recordings +- Use cases: Decision archaeology, training, quotes +- Frontier (minutes) used daily, source (video) used occasionally +- "Storage cheap, context priceless" + +### Enhanced slide: "Dashboard ≠ Data" (Act II) +- Current: Mention dashboards +- **Enhance**: Show data layer vs. visualization layer +- Nipoppy: .tsv (data, version controlled) → Neurobagel (viz, regenerable) +- "Dashboards consume frontiers, don't own them" + +## Narrative Thread + +**Opening** (Act I): "YODA enables hierarchical transformation" + +**Middle** (Act II-III): "Every tool creates frontiers while preserving sources" + +**Climax** (Act IV): "AI amplifies structured frontiers, doesn't replace them" + +**Resolution** (Act V): "Pattern is universal—embrace principles, choose tools" + +**Closing**: "Surface you create, depth you preserve—this is the YODA way" + +## Questions for Tomorrow's Refinement + +1. Should we make "frontier condensation" an explicit title for a slide section? +2. How much to emphasize non-DataLad examples (snapshot.debian.org, etc.)? +3. Should the meeting archive practice be personal anecdote or general recommendation? +4. Do we need a "frontier condensation anti-patterns" slide? +5. Should we create a visual "frontier condensation cheatsheet" as final slide? + +## Integration with Existing Content + +**Keep from original slides**: +- YODA principles overview (enhance Principle 3) +- datalad run examples (frame as "frontier generation") +- Container slides (frame as "environment frontiers") +- OpenNeuroDerivatives examples (perfect frontier cascade) + +**Reduce/merge from original slides**: +- Detailed git-annex stats (mention briefly) +- Multiple datalad run examples (consolidate) +- Container technology details (focus on concept) + +**Add new**: +- Frontier condensation intro slide +- Software examples (NeuroDebian, reproducible-builds, snapshot) +- Literature examples (dandi-bib, citations-collector) +- Meeting archive example +- Universal pattern comparison table +- AI-generated frontiers section + +## Backup Slides (if time) + +- Deep dive: BEP028 provenance structure +- Deep dive: citations-collector schema (8 types, 11 relationships) +- Deep dive: con/duct execution graphs +- Case study: OpenNeuroDerivatives workflow +- Tutorial: Creating your first frontier with datalad run From 13c35df64183502e23eaa90c97e0497c574218ca Mon Sep 17 00:00:00 2001 From: Yaroslav Halchenko Date: Mon, 2 Feb 2026 10:13:28 -0500 Subject: [PATCH 6/8] Add summary document for tomorrow's refinement session Comprehensive overview of current status and tomorrow's agenda: Status: - Presentation header updated and committed - Materials organized in YODA-compliant structure - Research complete on BEP028, BABS, Nipoppy, BIDS-flux, etc. - Frontier condensation concept developed and documented Key breakthrough: - Frontier condensation = hierarchical transformation pattern - Each module: stopping point, transformation, usable interface, linked source - Tagline: 'Surface you create, depth you preserve' - Unifies YODA, BIDS, ReproFlow, dashboards, AI under one framework Tomorrow's agenda: 1. Review/refine frontier condensation concept 2. Decide presentation structure (explicit theme vs. woven throughout) 3. Prioritize new slides (high/medium/low) 4. Content decisions (keep/reduce/enhance) 5. Visual design (two-layer diagrams) 6. Time allocation (~45 min webinar) Questions to resolve: - Terminology: 'frontier condensation' or alternative? - Emphasis: DataLad-specific vs. universal pattern? - Depth: Tool details vs. conceptual overview? - Personal anecdotes: Include Zoom archive practice? - Slide count: Realistic for Feb 6 deadline? Resources ready: 4 planning docs, all references documented Timeline: 4 days to Feb 6 (realistic but tight) Differentiator: YODA as transformation framework, not just organization --- .../planning/SUMMARY-for-tomorrow.md | 237 ++++++++++++++++++ 1 file changed, 237 insertions(+) create mode 100644 2026-repronim-YODA-BIDS-webinar/planning/SUMMARY-for-tomorrow.md diff --git a/2026-repronim-YODA-BIDS-webinar/planning/SUMMARY-for-tomorrow.md b/2026-repronim-YODA-BIDS-webinar/planning/SUMMARY-for-tomorrow.md new file mode 100644 index 0000000..9bac83c --- /dev/null +++ b/2026-repronim-YODA-BIDS-webinar/planning/SUMMARY-for-tomorrow.md @@ -0,0 +1,237 @@ +# Summary for Tomorrow Morning (Feb 2, 2026) + +## What We've Done + +✅ **Updated presentation header** (committed) +- Title: "ReproFlow & YODA: Structure your studies, observable and reproducible they become" +- Abstract/motto with Yoda commandments +- ReproNim webinar date: Friday, Feb 6, 2026 +- QR code generated and linked + +✅ **Organized materials in YODA-compliant structure** (committed) +``` +2026-repronim-YODA-BIDS-webinar/ +├── README.md # Entry point +├── notes/ +│ ├── act2-refinement-notes.md # Research findings +│ └── frontier-condensation.md # Core concept +└── planning/ + ├── proposed-structure.md # 5-act structure + ├── frontier-condensation-integration.md # Integration plan + └── SUMMARY-for-tomorrow.md # This file +``` + +✅ **Researched and documented** (committed): +- BEP028 provenance standard +- BABS (Felix Hoffstaedter, FAIRly big) +- Nipoppy + Neurobagel dashboard +- BIDS-flux platform +- SciOps framework from June 2024 webinar +- dandi-bib + citations-collector workflow +- NeuroDebian + reproducible-builds.org +- snapshot.debian.org archival approach + +✅ **Developed core concept: Frontier Condensation** (committed) +- Modular composition = hierarchical transformation +- Each module creates "condensed frontier" with links to source +- Pattern universal (not DataLad/git-specific) +- 9 detailed cross-domain examples with real projects +- Integration plan for all 5 acts + +## Current Status + +**Main slides**: `../2026-repronim-YODA-BIDS-webinar.html` +- Still based on 2025-distribits-YODA.html structure +- Title slide updated +- Content needs expansion/restructuring + +**Planning materials**: Complete and committed +- Ready to guide slide creation +- All references documented +- Integration strategy defined + +## Key Conceptual Breakthrough + +### Frontier Condensation Pattern + +Every module/subdataset serves as: +1. **Stopping point** for complexity/data growth +2. **Transformation** to more appropriate form (1000x-10000x size reduction typical) +3. **Usable interface** (frontier) for next level +4. **Versioned link** back to source (depth preservation) + +**Tagline**: *"Surface you create, depth you preserve"* + +**Why it matters**: +- Unifies YODA, BIDS, ReproFlow, dashboards, AI under single conceptual framework +- Makes "modular composition" concrete and compelling +- Shows pattern transcends tools (git, Debian snapshot, DOIs, etc.) +- Connects to audience experience (they already do this!) + +## Tomorrow's Agenda + +### 1. Review & Refine Frontier Condensation Concept +**Questions to discuss:** +- Is "frontier condensation" the right term? +- Should it be explicit section title or woven throughout? +- Balance: DataLad examples vs. universal pattern? +- Which examples resonate most for ReproNim audience? + +### 2. Decide on Presentation Structure + +**Option A: Explicit Frontier Theme** +- Act I: YODA + Frontier Condensation intro +- Act II: Frontiers in Practice (tools, workflows) +- Act III: BIDS as Frontier Cascade +- Act IV: AI-Generated Frontiers +- Act V: Universal Pattern + +**Option B: Weave Throughout** (recommended in integration plan) +- Keep 5-act structure from proposed-structure.md +- Inject frontier condensation language everywhere +- Don't make it a separate "thing", make it the lens + +**Option C: Hybrid** +- Introduce briefly in Act I +- Examples throughout Acts II-III +- Synthesize in Act IV (AI) and V (vision) + +### 3. Prioritize New Slides + +**High priority** (must add): +- [ ] Frontier condensation intro/definition +- [ ] BIDS as 4-stage cascade diagram +- [ ] Dashboard ≠ Data (Nipoppy example) +- [ ] dandi-bib workflow visualization +- [ ] Universal pattern comparison table + +**Medium priority** (should add): +- [ ] Software examples (NeuroDebian, reproducible-builds, snapshot) +- [ ] SciOps 80/20 principle visualization +- [ ] BEP028 provenance integration +- [ ] Tools comparison (BABS, Nipoppy, BIDS-flux, ReproMan) +- [ ] Meeting archives as resource + +**Low priority** (nice to have): +- [ ] AI-generated frontiers detailed examples +- [ ] Anti-patterns slide +- [ ] Frontier condensation cheatsheet + +### 4. Content to Keep/Reduce/Enhance + +**Keep & enhance**: +- YODA principles (especially Principle 3) +- datalad run examples (frame as frontier generation) +- OpenNeuroDerivatives (perfect cascade example) +- con/duct execution tracing + +**Reduce/consolidate**: +- git-annex stats (brief mention only) +- Container technology details +- Multiple repetitive datalad run examples + +**Add new**: +- All high-priority slides above +- Concrete project examples (dandi-bib, BABS, etc.) +- Cross-domain comparisons + +### 5. Visual Design Decisions + +**Proposed visual motif**: Two-layer diagrams +``` +[Frontier Layer] ← Green, small, usable + ⇅ Links +[Source Layer] ← Blue, large, preserved +``` + +**Questions**: +- Use this consistently or vary by domain? +- Need custom graphics or can use text diagrams? +- Include Yoda imagery throughout or just title/end? + +### 6. Time Allocation + +**Available**: ~45 minutes typical ReproNim webinar +- Title/intro: 2 min +- Act I (YODA foundation): 8 min +- Act II (Workflows/SciOps): 12 min (STRENGTHEN) +- Act III (BIDS composition): 10 min +- Act IV (AI frontier): 8 min (NEW) +- Act V (Vision): 3 min +- Q&A: flexible + +**Act II needs most expansion** - this is where tools live. + +## Resources Ready for Tomorrow + +1. **act2-refinement-notes.md**: All tool research, links, descriptions +2. **frontier-condensation.md**: 9 detailed examples, principles, best practices +3. **frontier-condensation-integration.md**: How to weave concept throughout +4. **proposed-structure.md**: Original 5-act structure with new slides listed + +## Key References to Have Handy + +- BEP028: https://github.com/bids-standard/BEP028_BIDSprov/blob/master/bep028spec.md +- BABS paper: https://direct.mit.edu/imag/article/doi/10.1162/imag_a_00074/119046 +- dandi-bib: https://github.com/dandi/dandi-bib +- citations-collector: https://github.com/con/citations-collector +- Your June 2024 webinar: https://datasets.datalad.org/repronim/artwork/talks/webinar-2024-reproflow/#/ +- Nipoppy: https://github.com/nipoppy/nipoppy +- BIDS-flux: https://bids-flux-docs.readthedocs.io/ +- reproducible-builds.org: https://reproducible-builds.org/ +- snapshot.debian.org: https://snapshot.debian.org/ + +## Questions to Resolve Tomorrow + +1. **Terminology**: "Frontier condensation", "hierarchical transformation", or something else? +2. **Emphasis**: Tool-specific (DataLad) vs. universal pattern? +3. **Depth**: How much detail on each tool vs. conceptual overview? +4. **Personal anecdotes**: Include meeting archives practice, or keep general? +5. **Slide count**: How many new slides realistic for Feb 6 deadline? +6. **Audience assumption**: How familiar is ReproNim audience with YODA already? + +## Next Steps + +**Tomorrow morning**: +1. Review these materials +2. Discuss questions above +3. Decide on structure (Option A/B/C) +4. Create prioritized todo list for slide creation +5. Start with high-priority slides + +**Rest of week**: +- Create new slides based on priorities +- Reorganize existing content +- Test presentation flow +- Practice timing +- Generate any needed graphics/diagrams + +## What Makes This Talk Different + +**Previous YODA talks**: Organizational principles + DataLad tools +**This talk**: YODA as **transformation framework** enabling: +- Hierarchical size reduction (TB → GB → MB → KB) +- Format appropriateness (raw → processed → analyzed → published) +- Observability (provenance everywhere) +- AI amplification (structured → queryable → verifiable) +- Universal pattern (transcends specific tools) + +**Resonance for ReproNim audience**: +- They build tools that create frontiers (BIDS Apps, pipelines) +- They manage data at multiple granularities +- They need to explain value of structured data to PIs +- They face "why version control data?" questions +- This framework answers all of above + +## Confidence Level + +**Concept**: Very high - frontier condensation is solid, universal, compelling +**Examples**: High - all concrete, documented, real projects +**Integration**: Medium-high - clear plan, needs execution +**Timeline**: Medium - 4 days to Feb 6, realistic but tight + +## Final Thought + +The frontier condensation concept is the "aha!" that ties everything together. It makes YODA not just "good practice" but essential infrastructure for modern science. Every example in the talk becomes an instance of the same pattern. + +**This is the talk's superpower.** Tomorrow: decide how bold to be with it. From 45450887069ddf145e719156d21138cfc165c0b6 Mon Sep 17 00:00:00 2001 From: Yaroslav Halchenko Date: Mon, 2 Feb 2026 17:56:31 -0500 Subject: [PATCH 7/8] Add a note to README per @asmacdo request --- README.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/README.md b/README.md index c62f914..d2097f4 100644 --- a/README.md +++ b/README.md @@ -18,6 +18,16 @@ tools/cast_live casts/ A number of casts from previous workshops can be found in ``casts/``. To find out how to create casts on your own machine, check out the [contributing instructions for the book for casts](http://handbook.datalad.org/en/latest/contributing.html#directives), or write them by hand - everything that starts within a ``run ''`` statement is executed on ``Enter``, everything within a ``say ''`` is written to your private terminal as a note. Note that ``cast_live`` may configure your keyboard layout to ``en-us``. If you are usually using a different keyboard layout, e.g., German, reset it using ``setxkbmap de``. +## DataLad/git-annex + +If cloning from anywhere else than https://datasets.datalad.org copy of this dataset, you might need to add its content-containing git-annex via + +```sh +git remote add --fetch datasets.datalad.org https://datasets.datalad.org/centerforopenneuroscience/talks/.git +``` + +to get access to the files stored on this remote (I didn't bother adding it as auto-enabling git-annex type=git remote yet). + ## Advice for creating presentations - ``clone`` the repository to your local computer and ``datalad get`` all subdatasets (``datalad get -n -r .``). From 4de1d5b060a1be0c5be2af566f497867b744a709 Mon Sep 17 00:00:00 2001 From: Yaroslav Halchenko Date: Mon, 2 Feb 2026 18:54:51 -0500 Subject: [PATCH 8/8] Apply suggestions from @asmacdo review Co-authored-by: Austin Macdonald --- 2026-repronim-YODA-BIDS-webinar.html | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/2026-repronim-YODA-BIDS-webinar.html b/2026-repronim-YODA-BIDS-webinar.html index 57363cb..5619d82 100644 --- a/2026-repronim-YODA-BIDS-webinar.html +++ b/2026-repronim-YODA-BIDS-webinar.html @@ -196,6 +196,7 @@

Why version control?

#### Handbook example on "Brain Extraction" +`datalad run` can execute any command, and automatically commit when successfully finished, recording the exact command and what changes it made! ```shell $ datalad run -m "run brain extract workflow" \ --input "inputs/ds*/sub-01/anat/sub-01_T1w.nii.gz" \ @@ -264,7 +265,7 @@

Why version control?

---- -### How you got here know you do! +### How the state came to be, know you will! @@ -355,7 +356,7 @@

Why version control?

---- -### YODA wanted git-annex complete not +### YODA wanted git-annex, complete not [![Distribits 2024: "git annex is complete, right?"](pics/distribits-2024-git-annex-complete.png)](https://www.youtube.com/watch?v=pp8IeGXpRRI) @@ -503,7 +504,7 @@

Why version control?

- ready to use Singularity containers of popular neuroimaging containers - (local) "backup" of containers (in case removed from a Hub) -- WiP to accompany with OCI images +- WiP to include OCI images ---- @@ -780,7 +781,7 @@

- containers: - need to bind mount only (elements of) current folder - can map entire $PWD as $PWD inside -- no full paths specification in CLI or hard-coding in scripts +- no hardcoding full paths (CLI or scripts!) - the entire "module" becomes portable