Skip to content

A Python package for automating the creation of comprehensive and organized domain knowledge bases for AI applications

License

Notifications You must be signed in to change notification settings

btfranklin/compendiumscribe

Repository files navigation

Compendium Scribe

Compendium Scribe banner

Build Status Supports Python versions 3.12+

Compendium Scribe is a Click-driven command line tool and library that uses OpenAI's deep research models to assemble a comprehensive research compendium for any topic. The workflow combines optional prompt refinement, a "deep research" call with web search tooling, and deterministic post-processing. It produces human-readable Markdown by default, backed by a rich XML data model that can also be exported.


Features

  • 🔍 Deep research pipeline — Orchestrates prompt planning, background execution, and tool-call capture with o3-deep-research.
  • 🧱 Rich data model — Includes sections, insights, and citations for cross-format rendering.
  • 🧾 Structured XML output — Produces a schema-friendly document ready for downstream conversion (HTML, Markdown, PDF pipelines, etc.).
  • 🌐 HTML Site Export — Generates a static, multi-page HTML site with navigation and semantic structure.
  • 🧩 Skill Export — Emits an AI agent skill folder with SKILL.md plus the compendium Markdown in references/.
  • 🔄 Re-rendering — Ingest existing XML compendiums to generate new output formats without re-running costly research.
  • ⚙️ Configurable CLI — Control background execution, tool call limits, and output paths via a unified command structure.
  • 🧪 Testable architecture — Research orchestration is decoupled from the OpenAI client, making it simple to stub in tests.

Quick Start

1. Install

pdm install --group dev

Ensure PDM_HOME points to a writable location when developing within a sandboxed environment.

2. Configure credentials

Create a .env file (untracked) with your OpenAI credentials:

OPENAI_API_KEY=sk-...
PROMPT_REFINER_MODEL=gpt-5.2
DEEP_RESEARCH_MODEL=o3-deep-research
SKILL_NAMER_MODEL=gpt-5.2
SKILL_WRITER_MODEL=gpt-5.2
POLLING_INTERVAL_IN_SECONDS=10
MAX_POLL_TIME_IN_MINUTES=60

Deep research requires an OpenAI account with the browsing tooling enabled. Document any environment keys for additional tooling in the repo as you add them.

3. Generate a compendium

Use the create subcommand to verify a topic and run the research process:

pdm run compendium create "Lithium-ion battery recycling"

Options:

  • --output PATH — Base path/filename for the output (extension is ignored).
  • --no-background — Force synchronous execution (useful for short or restricted queries).
  • --max-tool-calls N — Cap the total number of tool calls for cost control.
  • --format FORMAT — Output format (defaults to md). Available: md, xml, html, pdf, skill. Can be repeated for multiple outputs.

Example output file name: lithium-ion-battery-recycling.md. Skill output writes a folder named after the skill with SKILL.md and a references/ markdown file using the standard output filename.

4. Render formats from existing XML

If you have an existing XML compendium (e.g., my-topic.xml), you can re-render it into other formats:

pdm run compendium render my-topic.xml --format html

Options:

  • --format FORMAT — Output format(s) to generate (md, xml, html, pdf, skill).
  • --output PATH — Base path/filename for the output.

5. Recover from a timeout

If a research task times out (exceeding MAX_POLL_TIME_IN_MINUTES), recovery information is saved to timed_out_research.json. You can resume checking for its completion without starting over:

pdm run compendium recover

Options:

  • --input PATH — Path to the recovery JSON file (defaults to timed_out_research.json).

Library Usage

from compendiumscribe import build_compendium, ResearchConfig, DeepResearchError

try:
    compendium = build_compendium(
        "Emerging pathogen surveillance",
        config=ResearchConfig(
            background=False, 
            max_tool_calls=30,
            max_poll_time_minutes=15,
        ),
    )
except DeepResearchError as exc:
    # Handle or log deep research failures
    raise

xml_payload = compendium.to_xml_string()

# Alternate exports
markdown_doc = compendium.to_markdown()
html_files = compendium.to_html_site()  # Returns dict of filename -> content
pdf_bytes = compendium.to_pdf_bytes()

The returned Compendium object contains structured sections, insights, citations, and open questions.


Data Model Overview

Compendium Scribe produces XML shaped like:

<compendium topic="Lithium-ion Battery Recycling" generated_at="2025-01-07T14:32:33+00:00">
  <overview><![CDATA[Comprehensive synthesis of the state of lithium-ion recycling...]]></overview>
  <methodology>
    <step><![CDATA[Surveyed peer-reviewed literature from 2022–2025]]></step>
    <step><![CDATA[Corroborated industrial capacity data with regulatory filings]]></step>
  </methodology>
  <sections>
    <section id="S01">
      <title><![CDATA[Technology Landscape]]></title>
      <summary><![CDATA[Dominant recycling modalities and throughput metrics...]]></summary>
      <key_terms>
        <term><![CDATA[hydrometallurgy]]></term>
        <term><![CDATA[direct recycling]]></term>
      </key_terms>
      <guiding_questions>
        <question><![CDATA[Which processes yield the highest cobalt recovery rates?]]></question>
      </guiding_questions>
      <insights>
        <insight>
          <title><![CDATA[Hydrometallurgy remains the throughput leader]]></title>
          <evidence><![CDATA[EPRI 2024 data shows >95% cobalt recovery in commercial plants.]]></evidence>
          <implications><![CDATA[Capital efficiency favors hydrometallurgy for near-term scaling.]]></implications>
          <citations>
            <ref>C1</ref>
          </citations>
        </insight>
      </insights>
    </section>
  </sections>
  <citations>
    <citation id="C1">
      <title><![CDATA[EPRI Lithium-ion Recycling Benchmarking 2024]]></title>
      <url><![CDATA[https://example.com/epri-li-benchmark]]></url>
      <publisher><![CDATA[EPRI]]></publisher>
      <published_at><![CDATA[2024-09-01]]></published_at>
      <summary><![CDATA[Performance metrics for recycling modalities across 12 facilities.]]></summary>
    </citation>
  </citations>
  <open_questions>
    <question><![CDATA[How will policy incentives shape regional plant siting post-2025?]]></question>
  </open_questions>
</compendium>

This format is intentionally verbose to support downstream transformation. Markdown links within text (e.g., [Label](URL)) are preserved in the XML to ensure they render correctly in final outputs.


Testing & Quality

  • pdm run test — Executes the unit suite. Tests stub the OpenAI client, so they run offline.
  • pdm run lint — Linting.
  • pdm build — Produce distributable artifacts.

If pdm fails to write log files in restricted environments, set PDM_HOME to a writable directory (for example, export PDM_HOME=.pdm_home).


Contributing

  1. Fork and clone the repository.
  2. Run pdm install --group dev.
  3. Make changes following the style guide and update/add tests.
  4. Run pdm run test and pdm run lint.
  5. Raise a pull request with:
    • A concise description of the change.
    • Verification commands executed locally.
    • Representative XML samples if the user-facing structure changes.

About

A Python package for automating the creation of comprehensive and organized domain knowledge bases for AI applications

Topics

Resources

License

Stars

Watchers

Forks

Languages