Add support to automatic "hybrid" streaming (#182)

mpangrazzi · sjrl · web-flow · commit 42dc75ed660a · 2025-10-29T16:00:43.000+01:00
* Allow async_streaming_generator to support also sync-only streaming components assigning proper streaming_callback type

* Refactoring ; Remove unneded comments

* Using only 'auto' and False values ; refactoring ; add docs

* Add extra deps for tests

* Improved check for streaming capable components (check the run/run_async method signature rather than components attributes)

* Lint

* Update tests and docs without HuggingFace components

* Update docs

* Update docs/concepts/pipeline-wrapper.md

Co-authored-by: Sebastian Husch Lee &lt;10526848+sjrl@users.noreply.github.com&gt;

* Update docs/concepts/pipeline-wrapper.md

Co-authored-by: Sebastian Husch Lee &lt;10526848+sjrl@users.noreply.github.com&gt;

* Update docs/concepts/pipeline-wrapper.md

Co-authored-by: Sebastian Husch Lee &lt;10526848+sjrl@users.noreply.github.com&gt;

* Simplify find_all_streaming_components check

* Remove duplicate test

* Using allow_sync_streaming_callbacks=True/False; Refactoring + add example

---------

Co-authored-by: Sebastian Husch Lee &lt;10526848+sjrl@users.noreply.github.com&gt;
diff --git a/docs/concepts/pipeline-wrapper.md b/docs/concepts/pipeline-wrapper.md
@@ -176,6 +176,146 @@ async def run_chat_completion_async(self, model: str, messages: List[dict], body
     )
 ```
 
+## Hybrid Streaming: Mixing Async and Sync Components
+
+!!! tip "Compatibility for Legacy Components"
+    When working with legacy pipelines or components that only support sync streaming callbacks (like `OpenAIGenerator`), use `allow_sync_streaming_callbacks=True` to enable hybrid mode. For new code, prefer async-compatible components and use the default strict mode.
+
+Some Haystack components only support synchronous streaming callbacks and don't have async equivalents. Examples include:
+
+- `OpenAIGenerator` - Legacy OpenAI text generation (⚠️ Note: `OpenAIChatGenerator` IS async-compatible)
+- Other components without `run_async()` support
+
+### The Problem
+
+By default, `async_streaming_generator` requires all streaming components to support async callbacks:
+
+```python
+async def run_chat_completion_async(self, model: str, messages: List[dict], body: dict) -> AsyncGenerator:
+    # This will FAIL if pipeline contains OpenAIGenerator
+    return async_streaming_generator(
+        pipeline=self.pipeline,  # AsyncPipeline with OpenAIGenerator
+        pipeline_run_args={"prompt": {"query": question}},
+    )
+```
+
+**Error:**
+
+```text
+ValueError: Component 'llm' of type 'OpenAIGenerator' seems to not support
+async streaming callbacks...
+```
+
+### The Solution: Hybrid Streaming Mode
+
+Enable hybrid streaming mode to automatically handle both async and sync components:
+
+```python
+async def run_chat_completion_async(self, model: str, messages: list[dict], body: dict) -> AsyncGenerator:
+    question = get_last_user_message(messages)
+    return async_streaming_generator(
+        pipeline=self.pipeline,
+        pipeline_run_args={"prompt": {"query": question}},
+        allow_sync_streaming_callbacks=True  # ✅ Auto-detect and enable hybrid mode
+    )
+```
+
+### What `allow_sync_streaming_callbacks=True` Does
+
+When you set `allow_sync_streaming_callbacks=True`, the system enables **intelligent auto-detection**:
+
+1. **Scans Components**: Automatically inspects all streaming components in your pipeline
+2. **Detects Capabilities**: Checks if each component has `run_async()` support
+3. **Enables Hybrid Mode Only If Needed**:
+   - ✅ If **all components support async** → Uses pure async mode (no overhead)
+   - ✅ If **any component is sync-only** → Automatically enables hybrid mode
+4. **Bridges Sync to Async**: For sync-only components, wraps their callbacks to work seamlessly with the async event loop
+5. **Zero Configuration**: You don't need to know which components are sync/async - it figures it out automatically
+
+!!! success "Smart Behavior"
+    Setting `allow_sync_streaming_callbacks=True` does NOT force hybrid mode. It only enables it when actually needed. If your pipeline is fully async-capable, you get pure async performance with no overhead!
+
+### Configuration Options
+
+```python
+# Option 1: Strict mode (Default - Recommended)
+allow_sync_streaming_callbacks=False
+# → Raises error if sync-only components found
+# → Best for: New code, ensuring proper async components, best performance
+
+# Option 2: Auto-detection (Compatibility mode)
+allow_sync_streaming_callbacks=True
+# → Automatically detects and enables hybrid mode only when needed
+# → Best for: Legacy pipelines, components without async support, gradual migration
+```
+
+### Example: Legacy OpenAI Generator with Async Pipeline
+
+```python
+from typing import AsyncGenerator
+from haystack import AsyncPipeline
+from haystack.components.builders import PromptBuilder
+from haystack.components.generators import OpenAIGenerator
+from haystack.utils import Secret
+from hayhooks import BasePipelineWrapper, get_last_user_message, async_streaming_generator
+
+class LegacyOpenAIWrapper(BasePipelineWrapper):
+    def setup(self) -> None:
+        # OpenAIGenerator only supports sync streaming (legacy component)
+        llm = OpenAIGenerator(
+            api_key=Secret.from_env_var("OPENAI_API_KEY"),
+            model="gpt-4o-mini"
+        )
+
+        prompt_builder = PromptBuilder(
+            template="Answer this question: {{question}}"
+        )
+
+        self.pipeline = AsyncPipeline()
+        self.pipeline.add_component("prompt", prompt_builder)
+        self.pipeline.add_component("llm", llm)
+        self.pipeline.connect("prompt.prompt", "llm.prompt")
+
+    async def run_chat_completion_async(
+        self, model: str, messages: list[dict], body: dict
+    ) -> AsyncGenerator:
+        question = get_last_user_message(messages)
+
+        # Enable hybrid mode for OpenAIGenerator
+        return async_streaming_generator(
+            pipeline=self.pipeline,
+            pipeline_run_args={"prompt": {"question": question}},
+            allow_sync_streaming_callbacks=True  # ✅ Handles sync component
+        )
+```
+
+### When to Use Each Mode
+
+**Use strict mode (default) when:**
+
+- Building new pipelines (recommended default)
+- You want to ensure all components are **async-compatible**
+- Performance is critical (pure async is **~1-2μs faster** per chunk)
+- You're building a production system with controlled dependencies
+
+**Use `allow_sync_streaming_callbacks=True` when:**
+
+- Working with legacy pipelines that use `OpenAIGenerator` or other sync-only components
+- Deploying YAML pipelines with unknown/legacy component types
+- Migrating old code that doesn't have async equivalents yet
+- Third-party components without async support
+
+### Performance Considerations
+
+- **Pure async pipeline**: No overhead
+- **Hybrid mode (auto-detected)**: Minimal overhead (~1-2 microseconds per streaming chunk for sync components)
+- **Network-bound operations**: The overhead is negligible compared to LLM generation time
+
+!!! success "Best Practice"
+    **For new code**: Use the default strict mode (`allow_sync_streaming_callbacks=False`) to ensure you're using proper async components.
+
+    **For legacy/compatibility**: Use `allow_sync_streaming_callbacks=True` when working with older pipelines or components that don't support async streaming yet.
+
 ## Streaming from Multiple Components
 
 !!! info "Smart Streaming Behavior"
diff --git a/docs/examples/async-operations.md b/docs/examples/async-operations.md
@@ -28,7 +28,9 @@ curl -X POST http://localhost:1416/v1/chat/completions \
 
 !!! tip "Best Practices"
     - Prefer `run_chat_completion_async` for streaming and concurrency
-    - Ensure components support async streaming callbacks; otherwise use the sync `streaming_generator`
+    - Use async-compatible components (e.g., `OpenAIChatGenerator`) for best performance
+    - For legacy pipelines with sync-only components (like `OpenAIGenerator`), use `allow_sync_streaming_callbacks=True` to enable hybrid mode
+    - See [Hybrid Streaming](../concepts/pipeline-wrapper.md#hybrid-streaming-mixing-async-and-sync-components) for handling legacy components
 
 ## Related
 
diff --git a/examples/README.md b/examples/README.md
@@ -8,6 +8,7 @@ This directory contains various examples demonstrating different use cases and f
 |---------|-------------|--------------|----------|
 | [multi_llm_streaming](./pipeline_wrappers/multi_llm_streaming/) | Multiple LLM components with automatic streaming | • Two sequential LLMs<br/>• Automatic multi-component streaming<br/>• No special configuration needed<br/>• Shows default streaming behavior | Demonstrating how hayhooks automatically streams from all components in a pipeline |
 | [async_question_answer](./pipeline_wrappers/async_question_answer/) | Async question-answering pipeline with streaming support | • Async pipeline execution<br/>• Streaming responses<br/>• OpenAI Chat Generator<br/>• Both API and chat completion interfaces | Building conversational AI systems that need async processing and real-time streaming responses |
+| [async_hybrid_streaming](./pipeline_wrappers/async_hybrid_streaming/) | AsyncPipeline with legacy sync-only components using hybrid mode | • AsyncPipeline with OpenAIGenerator<br/>• `allow_sync_streaming_callbacks=True`<br/>• Automatic sync-to-async bridging<br/>• Migration example | Using legacy components (OpenAIGenerator) in async pipelines, migrating from sync to async gradually, handling third-party sync-only components |
 | [chat_with_website](./pipeline_wrappers/chat_with_website/) | Answer questions about website content | • Web content fetching<br/>• HTML to document conversion<br/>• Content-based Q&A<br/>• Configurable URLs | Creating AI assistants that can answer questions about specific websites or web-based documentation |
 | [chat_with_website_mcp](./pipeline_wrappers/chat_with_website_mcp/) | MCP-compatible website chat pipeline | • MCP (Model Context Protocol) support<br/>• Website content analysis<br/>• API-only interface<br/>• Simplified deployment | Integrating website analysis capabilities into MCP-compatible AI systems and tools |
 | [chat_with_website_streaming](./pipeline_wrappers/chat_with_website_streaming/) | Streaming website chat responses | • Real-time streaming<br/>• Website content processing<br/>• Progressive response generation<br/>• Enhanced user experience | Building responsive web applications that provide real-time AI responses about website content |
diff --git a/examples/pipeline_wrappers/async_hybrid_streaming/README.md b/examples/pipeline_wrappers/async_hybrid_streaming/README.md
@@ -0,0 +1,76 @@
+# Async Hybrid Streaming Example
+
+This example demonstrates using `allow_sync_streaming_callbacks=True` to enable hybrid streaming mode with AsyncPipeline and legacy sync-only components.
+
+## Overview
+
+This example shows how to use an **AsyncPipeline** with **OpenAIGenerator** (a legacy component that only supports synchronous streaming callbacks) by enabling hybrid mode with `allow_sync_streaming_callbacks=True`.
+
+## The Problem
+
+Some Haystack components like `OpenAIGenerator` only support **synchronous** streaming callbacks and don't have `run_async()` support. When you try to use them with `async_streaming_generator` in an AsyncPipeline, you'll get an error:
+
+```text
+ValueError: Component 'llm' of type 'OpenAIGenerator' seems to not support async streaming callbacks
+```
+
+## The Solution
+
+Set `allow_sync_streaming_callbacks=True` to enable **hybrid mode**:
+
+```python
+async_streaming_generator(
+    pipeline=self.pipeline,
+    pipeline_run_args={...},
+    allow_sync_streaming_callbacks=True  # ✅ Enables hybrid mode
+)
+```
+
+### What Hybrid Mode Does
+
+When `allow_sync_streaming_callbacks=True`, the system automatically detects components with sync-only streaming callbacks (e.g., `OpenAIGenerator`) and enables hybrid mode to bridge them to work in async context. If all components support async, no bridging is applied (pure async mode).
+
+## When to Use This
+
+**Use `allow_sync_streaming_callbacks=True` when:**
+
+- Working with **legacy components** like `OpenAIGenerator` that don't have async equivalents
+- Deploying **YAML pipelines** where you don't control which components are used
+- **Migrating** from sync to async pipelines gradually
+- Using **third-party components** without async support
+
+**For new code, prefer:**
+
+- Using async-compatible components (e.g., `OpenAIChatGenerator` instead of `OpenAIGenerator`)
+- Default strict mode (`allow_sync_streaming_callbacks=False`) to ensure proper async components
+
+## Usage
+
+### Deploy with Hayhooks
+
+```bash
+# Set your OpenAI API key
+export OPENAI_API_KEY=your_api_key_here
+
+# Deploy the pipeline
+hayhooks deploy examples/pipeline_wrappers/async_hybrid_streaming
+
+# Test it via OpenAI-compatible API
+curl -X POST http://localhost:1416/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "async_hybrid_streaming",
+    "messages": [{"role": "user", "content": "What is machine learning?"}],
+    "stream": true
+  }'
+```
+
+## Performance
+
+Hybrid mode might have a minimal overhead (~1-2 microseconds per streaming chunk for sync components). This is negligible compared to network latency and LLM generation time.
+
+## Related Documentation
+
+- [Hybrid Streaming Concept](https://deepset-ai.github.io/hayhooks/concepts/pipeline-wrapper/#hybrid-streaming-mixing-async-and-sync-components)
+- [Async Operations](https://deepset-ai.github.io/hayhooks/examples/async-operations/)
+- [Pipeline Wrapper Guide](https://deepset-ai.github.io/hayhooks/concepts/pipeline-wrapper/)
diff --git a/examples/pipeline_wrappers/async_hybrid_streaming/hybrid_streaming.yml b/examples/pipeline_wrappers/async_hybrid_streaming/hybrid_streaming.yml
@@ -0,0 +1,30 @@
+components:
+  llm:
+    init_parameters:
+      api_base_url: null
+      api_key:
+        env_vars:
+        - OPENAI_API_KEY
+        strict: true
+        type: env_var
+      generation_kwargs: {}
+      model: gpt-4o-mini
+      streaming_callback: null
+      system_prompt: null
+    type: haystack.components.generators.openai.OpenAIGenerator
+  prompt:
+    init_parameters:
+      required_variables: "*"
+      template: |
+        Answer the following question concisely and accurately:
+        {{query}}
+
+        Answer:
+      variables: null
+    type: haystack.components.builders.prompt_builder.PromptBuilder
+connection_type_validation: true
+connections:
+- receiver: llm.prompt
+  sender: prompt.prompt
+max_runs_per_component: 100
+metadata: {}
diff --git a/examples/pipeline_wrappers/async_hybrid_streaming/pipeline_wrapper.py b/examples/pipeline_wrappers/async_hybrid_streaming/pipeline_wrapper.py
@@ -0,0 +1,66 @@
+from collections.abc import AsyncGenerator
+from pathlib import Path
+
+from haystack import AsyncPipeline
+
+from hayhooks import BasePipelineWrapper, async_streaming_generator, get_last_user_message, log
+
+
+class PipelineWrapper(BasePipelineWrapper):
+    def setup(self) -> None:
+        """
+        Setup an AsyncPipeline with a legacy OpenAIGenerator component.
+
+        OpenAIGenerator only supports sync streaming callbacks (no run_async() method).
+        To use it with AsyncPipeline and async_streaming_generator, we need to enable
+        hybrid mode with allow_sync_streaming_callbacks=True.
+        """
+        pipeline_yaml = (Path(__file__).parent / "hybrid_streaming.yml").read_text()
+        self.pipeline = AsyncPipeline.loads(pipeline_yaml)
+
+    async def run_api_async(self, question: str) -> str:
+        """
+        Simple async API endpoint that returns the final answer.
+
+        Args:
+            question: The user's question
+
+        Returns:
+            The LLM's answer as a string
+        """
+        log.trace(f"Running pipeline with question: {question}")
+
+        result = await self.pipeline.run_async({"prompt": {"query": question}})
+        return result["llm"]["replies"][0]
+
+    async def run_chat_completion_async(self, model: str, messages: list[dict], body: dict) -> AsyncGenerator:
+        """
+        OpenAI-compatible chat completion endpoint with streaming support.
+
+        This demonstrates using allow_sync_streaming_callbacks=True to enable hybrid mode,
+        which allows the sync-only OpenAIGenerator to work with async_streaming_generator.
+
+        Args:
+            model: The model name (ignored in this example)
+            messages: Chat messages in OpenAI format
+            body: Additional request parameters
+
+        Yields:
+            Streaming chunks from the pipeline execution
+        """
+        log.trace(f"Running pipeline with model: {model}, messages: {messages}, body: {body}")
+
+        question = get_last_user_message(messages)
+        log.trace(f"Question: {question}")
+
+        # ✅ Enable hybrid mode with allow_sync_streaming_callbacks=True
+        # This is required because OpenAIGenerator (legacy component) only supports
+        # sync streaming callbacks. The hybrid mode automatically detects this and
+        # bridges the sync callback to work with the async event loop.
+        #
+        # If all components supported async, this would use pure async mode with no overhead.
+        return async_streaming_generator(
+            pipeline=self.pipeline,
+            pipeline_run_args={"prompt": {"query": question}},
+            allow_sync_streaming_callbacks=True,
+        )
diff --git a/src/hayhooks/server/pipelines/utils.py b/src/hayhooks/server/pipelines/utils.py
diff --git a/tests/test_hybrid_streaming_utils.py b/tests/test_hybrid_streaming_utils.py
diff --git a/tests/test_it_pipeline_utils.py b/tests/test_it_pipeline_utils.py