Skip to content

Conversation

@nrfulton
Copy link
Contributor

@nrfulton nrfulton commented Dec 11, 2025

A Span is a contiguous piece of KV Cache that typically also has some conceptual/semantic content. Examples of Spans include RAG documents, other artifacts (such as code, execution traces, error logs), or chat messages.

Most spans play both roles -- they are KV boundaries because they are conceptually self-contained entities. For example: we pre-compute KV cache for all of the documents in a RAG database. So we have the KV blocks associated with each document, and each of those KV blocks corresponds to a conceptually whole entity (the document).

It is useful to distinguish between the two roles that a Span plays when discussing implementation details. So we refer to all the KV caching semantics as "KV spans" or "KV blocks", and we refer the conceptual grouping semantics as "conceptual spans".

This PR is about conceptual span and it focuses on re-introducing "conceptual spans" into mellea from one of our earlier experimental code bases. There is a corresponding PR, also currently open, on the KV span / KV block aspect. See issue #111. The two PRs will be merged together.

Background

The Mellea tutorial uses the stdlib MelleaSession and mfunc abstractions to hide Mellea's core from the user. In this section we peel back the Session and mfunc abstracions so that we can see how Mellea works under the hood.

Mellea represents data using three types: Component | CBlock | ModelOutputThunk.

  • CBlocks are a wrapper around inputs to an LLM.
  • ModelOutputThunks are outputs from LLMs. These are created prior to any LLM call actually happening.
  • Components are composite types that implement a protocol that explains how the Component should be represented to an LLM.

Let's review each of these.

CBlocks and Thunks

CBlocks (and Components) are passed into a model via a Backend. The Backend emits a ModelOutputThunk (with a new Context which we will talk about in a moent). For example,

async def main():
    in_0 = CBlock("What is 1+1? Reply with only the number.")

    out_0: ModelOutputThunk, _ = await backend.generate_from_context(in_0, ctx=SimpleContext())
    print(f"Note: right now out_0 is not computed, so out_0=None (proof: {out_0.value})")
    next_int = await out_0.avalue()
    print(next_int) # out_0.value == next_int.

    in_1 = CBlock(value=f"What is {next_int} + {next_int}? Reply with only the number.")
    out_1: ModelOutputThunk, _ = await backend.generate_from_context(in_1, ctx=SimpleContext())
    print(await out_1.avalue())

asyncio.run(main())

Notice how a ModelOutputThunk can be uncomputed (mot.value is None) or computed mot.avalue is not None.

Important

We need to think about intermediate MoT states, such as where a mot has been cmoputed but has a tool call that is pending.

Components

Components can be composed of both CBlocks and ModelOutputThunks. For example,

class SimpleComponent(Component):
    """ aka tagless spans """
    def __init__(self, parts):
        self._parts = parts
    
    def parts(self):
        return self._parts

Let's extend this component a bit so that we can print it out and see which of its thunks are computed:

    @staticmethod
    def part_to_string(part) -> str:
        match part:
            case ModelOutputThunk() if part.value is not None:
                return part.value
            case ModelOutputThunk() if part.value is None:
                return "uncomputed!"
            case CBlock():
                return part.value
            case Component():
                formatted = part.format_for_llm()
                assert type(formatted) == str, "sic: actually need a formatter because this could be a template repr or a str but we're simplifying for now."
                return formatted

    def format_for_llm(self):
        str_parts : list[str] = [SimpleComponent.part_to_string(x) for x in self._parts]
        return " :: ".join(str_parts) + " :: EndList"     

(Aside: Recall in the first example we had to await the value of out_0 before computing next_int.
One of the things we need to change is automatic awaiting on MoTs that are contituents of Components as part of the generate call. This existed in our first couple codebases and we need to add thatb ack here.)

Notably, Components can be constructed using ModelOutputThunks that are not yet computed. So, in our core data structure we have a data dependency graph. E.g.,

async def main():
    in_0 = CBlock("This is an input query")
    out_0, _ = await backend.generate_from_context(in_0, ctx=SimpleContext())

    # nb: out_0 is NOT necessarily computed as this program point! Notice the None.
    component = SimpleComponent(parts=[in_0, out_0])
    print(backend.formatter.print(component))

    # again: out_0 is not yet computed so this component is also not computed.
    another_component = SimpleComponent(parts=[in_0, out_0, component])

    # we can generate into the component after forcing out_0 to complete.
    await out_0.avalue()
    out_component, _ = await backend.generate_from_context(component, ctx=SimpleContext())

    # now out_0 IS computed, but the output from component (out_component) is not!
    third_component = SimpleComponent(parts=[in_0, out_0, out_component])
    print(backend.formatter.print(third_component))

    # so le's resolve everything and make sure there's nothing uncomputed in the output from third component.
    await out_component.avalue()
    third_out, _ = await backend.generate_from_context(third_component, ctx=SimpleContext())
    assert "uncomputed!" not in await third_out.avalue()

asyncio.run(main())

@mergify
Copy link

mergify bot commented Dec 11, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert|release)(?:\(.+\))?:

TODO-nrf: we need to add generate walks to every generation call.
Comment on lines 92 to 101
def generate_walk(c: CBlock | Component | ModelOutputThunk) -> list[ModelOutputThunk]:
"""Returns the generation walk ordering for a Span."""
match c:
case ModelOutputThunk() if not c.is_computed():
return [c]
case CBlock():
return []
case Component():
parts_walk = [generate_walk(p) for p in c.parts()]
return itertools.chain.from_iterable(parts_walk) # aka flatten
Copy link
Contributor Author

@nrfulton nrfulton Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @jakelorocco

  • we'll have to start doing this in the backend generate calls.
  • This also means that we need to go back through stdlib and use parts() correctly. (No action on your part atm)
  • We probably want some sort of linting rule for third party code that warns the developer when they've got data in a Component class which has type CBlock | Component but which does not appear in parts().
  • I think we might want to make ModelOutputThunk NOT be a subtype of CBlock because Python pattern matching is first-match not most-specific-match.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nrfulton, should we also add some sort of computed / non-computed flag to Components because they will now suffer a similar situation as ModelOutputThunks?

And is it up to the Component owner what happens when not all parts of a Component are computed? For example, with a ModelOutputThunk, it's value is None until it is fully computed. Should we specify a similar default behavior for components?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might want to make ModelOutputThunk NOT be a subtype of CBlock because Python pattern matching is first-match not most-specific-match.

I think that's fine. It's yet to be seen / fully implemented, but in the work for adding return types and parsing functions to Components, a CBlock is really just a Component with no parts (or one part?) that has a str return type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also add some sort of computed / non-computed flag to Components because they will now suffer a similar situation as ModelOutputThunks?

I need to think about this. It's not quite the same as ModelOutputThunks. And I think it can be a computed method rather than a flag.

Copy link
Contributor Author

@nrfulton nrfulton Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we specify a similar default behavior for components?

We need to think about this. It's different from what happens with mots.

Things can go wrong. In particular: Component.format_for_llm should only be called when component prefillable judgement is derivable. But to your question regarding "similar behavior": format_for_llm can't ensure this contract holds itself because it doesn't have a backend in context (and shouldn't!).

NB: the problem isn't introduced by this PR, it already exists, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it already exists but doesn't manifest since we pretty much always use computed stuff right now.

Comment on lines 1 to 46
import mellea
from mellea.stdlib.base import CBlock, Context, SimpleContext
from mellea.stdlib.span import Span, SimpleComponent
from mellea.backends import Backend
from mellea.backends.ollama import OllamaModelBackend
import asyncio


async def main(backend: Backend, ctx: Context):
a_states = "Alaska,Arizona,Arkansas".split(",")
m_states = "Missouri", "Minnesota", "Montana", "Massachusetts"

a_state_pops = dict()
for state in a_states:
a_state_pops[state], _ = await backend.generate_from_context(
CBlock(f"What is the population of {state}? Respond with an integer only."),
SimpleContext(),
)
a_total_pop = SimpleComponent(
instruction=CBlock(
"What is the total population of these states? Respond with an integer only."
),
**a_state_pops,
)
a_state_total, _ = await backend.generate_from_context(a_total_pop, SimpleContext())

m_state_pops = dict()
for state in m_states:
m_state_pops[state], _ = await backend.generate_from_context(
CBlock(f"What is the population of {state}? Respond with an integer only."),
SimpleContext(),
)
m_total_pop = SimpleComponent(
instruction=CBlock(
"What is the total population of these states? Respond with an integer only."
),
**m_state_pops,
)
m_state_total, _ = await backend.generate_from_context(m_total_pop, SimpleContext())

print(await a_state_total.avalue())
print(await m_state_total.avalue())


backend = OllamaModelBackend(model_id="granite4:latest")
asyncio.run(main(backend, SimpleContext()))
Copy link
Contributor Author

@nrfulton nrfulton Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @HendrikStrobelt this is what lazy spans look like now.

Remember that await backend.generate_from_context doesn't actually await on the computation of the result. This merely awaits on the triggering on the generate call. So the full lifecycle of an call that looks sync has two awaits:

mot, new_ctx = await backend.generate_from_context(...)
result: str = await mot.avalue()

It's not the prettiest code in the world, but it's nice to see that lazy spans still work after our long sojourn into devexp land.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember that await backend.generate_from_context doesn't actually await on the computation of the result. This merely awaits on the triggering on the generate call.

Just wanted to call this out since python async is weird. Since backend.generate_from_context() can always do work immediately (ie processing the model opts / context, queueing up the API call, ...), Python should never actually pause the control flow at that await boundary. It will always immediately do the work to get you the ModelOutputThunk since none of the backends (currently) have await statements inside their backend.generate_from_context() functions that actually have to await asynchronous work being done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another gotcha: we should await.gather() rather than await if you are awaiting on multiple things. There's a bug in my version of the generate_walk:

_to_compute = generate_walk(action)
await asyncio.gather([x.avalue() for x in _to_compute])

@nrfulton
Copy link
Contributor Author

nrfulton commented Dec 12, 2025

Related stuff coming out of today's standup:

  • Refactor Backend so that the generate calls implement a protocol and individual stepsi n that protocol are overridden by specific implementations while others aren't <- opened issue and did it the manual way for now.
  • Similarly, we need to define a lifecycle for spans. (new -> prefilled -> computed -> post-processing). <- this is being addressing in melp.

Deletes the stdlib.span package and moves simplecomponent into base.

Fixes a big in call to gather (should be *list not list)
@nrfulton
Copy link
Contributor Author

Backend cleanup debt captured in #253

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants