Add `AudioLLM` and `VideoLLM` base classes #151

dangusev · 2025-11-04T22:03:26Z

Added AudioLLM, VideoLLM, and OmniLLM(AudioLLM, VideoLLM) base classes. Now each LLM can implement one of them to declare itself as audio- or video-capable instead of being either fully text or fully realtime.
Agent uses isinstance checks (done as mypy's typeguards) to understand if the model needs to accept video or audio tracks.
Realtime models now subclass of OmniLLM

This approach plays better with mypy, and we also don't need to define both flags and required methods as in the original approach with flags.

Examples

from vision_agents.core.llm import VideoLLM, AudioLLM

class SomeVideoLLM(VideoLLM):
    """
    Define a model capable of video processing, but requiring STT and TTS
    """
    async def watch_video_track(
        self,
        track: aiortc.mediastreams.MediaStreamTrack,
        shared_forwarder: Optional[VideoForwarder] = None,
    ) -> None: 
        """Process video track here"""


class SomeAudioLLM(AudioLLM):
    """A model capable of dealing with audio directly without STT and TTS"""

Summary by CodeRabbit

New Features
- Added explicit multimodal audio/video LLM interfaces so models can produce audio and observe video.
Refactor
- Runtime routing, forwarding, and transcript/turn handling now branch on LLM capability checks (audio/video) instead of the legacy realtime flag.
- Audio output handling standardized via a dedicated audio-output accessor; video watching uses a unified public API.
Documentation
- Developer docs and examples updated to reflect the new public audio/video APIs.
Breaking Changes
- Public API updated: new AudioLLM/VideoLLM/OmniLLM types; Agent constructor accepts multimodal LLMs and the legacy realtime_mode was removed.
Tests
- Tests updated to use the new public video-watching API.

Note

Add needs_stt/needs_tts/handles_audio/handles_video flags to LLM/Realtime and refactor Agent logic to route audio/video and turn handling based on these capabilities.

LLM/Realtime:
- Add capability flags: needs_stt, needs_tts, handles_audio, handles_video in LLM; set Realtime to handle audio/video and not need STT/TTS.
Agent:
- Replace realtime_mode/isinstance checks with capability flags across audio/video routing, publishing, and input needs.
- Gate STT->LLM triggering and turn detection using handles_audio and needs_{stt,tts}; avoid loops when agent is speaking.
- Update RTC prep and config validation to use flags; use llm.output_track when handles_audio.
- Video handling: forward/switch tracks based on handles_video; rename handlers to on_video_track_added/removed and clean up track switching logic.

^{Written by Cursor Bugbot for commit 4c3b8b7. This will update automatically on new commits. Configure here.}

coderabbitai · 2025-11-04T22:03:37Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Replaces realtime_mode checks with capability-based LLM abstractions (AudioLLM, VideoLLM, OmniLLM); updates Agent, Realtime base, and plugins to branch on TypeGuards (_is_audio_llm/_is_video_llm/_is_realtime_llm) and to use new public APIs (output_audio_track, watch_video_track). — 44 words.

Changes

Cohort / File(s)	Summary
Core LLM types `agents-core/vision_agents/core/llm/llm.py`, `agents-core/vision_agents/core/llm/__init__.py`	Removed `sts` flag; added abstract bases `AudioLLM`, `VideoLLM`, `OmniLLM` with `simple_audio_response`, `output_audio_track`, `watch_video_track`; exported new classes in `__all__`.
Realtime base `agents-core/vision_agents/core/llm/realtime.py`	`Realtime` now subclasses `OmniLLM`; removed internal output-track init and private `_watch_video_track`; updated imports/typing to capability model.
Agent logic `agents-core/vision_agents/core/agents/agents.py`	Removed `realtime_mode` property; added TypeGuards (`_is_audio_llm`, `_is_video_llm`, `_is_realtime_llm`); replaced realtime_mode branches with capability checks across transcript/audio/video forwarding, RTC prep, publish paths; `Agent.__init__` accepts `LLM
Plugins — OpenAI / Gemini / AWS `plugins/.../openai/openai_realtime.py`, `plugins/.../gemini/gemini_realtime.py`, `plugins/.../aws/aws_realtime.py`	Replaced public `output_track` with private `_output_audio_track` + `output_audio_track` accessor; renamed `_watch_video_track` → `watch_video_track(track, shared_forwarder=None)` and updated internal usage/typing; minor docs and init adjustments.
Tests & docs `plugins/gemini/tests/test_gemini_realtime.py`, `plugins/openai/tests/test_openai_realtime.py`, `DEVELOPMENT.md`, `plugins/gemini/README.md`	Updated tests and docs to call `watch_video_track` and reference `output_audio_track` where applicable.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Agent
    participant LLM
    participant STT_TTS

    Note over Client,Agent: incoming media / transcript events
    Client->>Agent: audio / video / transcript
    Agent->>LLM: capability check (_is_audio_llm/_is_video_llm/_is_realtime_llm)
    alt LLM handles media
        Agent->>LLM: forward media (watch_video_track / simple_audio_response)
        LLM-->>Agent: response (text/audio/video)
    else LLM does not handle media
        Agent->>STT_TTS: send audio for STT
        STT_TTS-->>Agent: transcript
        Agent->>LLM: send transcript/text
        LLM-->>Agent: text response
        alt need TTS
            Agent->>STT_TTS: request TTS
            STT_TTS-->>Agent: audio response
        end
    end
    Agent->>Client: deliver response (text/audio/video)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Focus review on agents-core/vision_agents/core/agents/agents.py for correct TypeGuard logic and every replaced branch formerly using realtime_mode.
Verify abstract method signatures in agents-core/vision_agents/core/llm/llm.py match plugin implementations.
Confirm plugins initialize _output_audio_track and output_audio_track accessor usage is safe (non-None) in runtime paths.
Ensure tests and docs fully replaced private _watch_video_track usages.

Possibly related PRs

[AI-245] - Low latency for OpenAI realtime #145 — overlaps multimodal API renames (watch_video_track, output_audio_track) and capability classes.
Simplify TTS plugin and audio utils #123 — touches Agent audio/RTC integration and routing changes.
[AI-192] - Bedrock, AWS & Nova #104 — modifies LLM/realtime abstractions and video/audio watch/output APIs.

Suggested reviewers

maxkahan
Nash0x7E2
d3xvn

Poem

The mouth of the machine opens like a cut,
it remembers the hush of your bones and speaks,
a small bright wound translating speech to light,
the wires learn your breath and answer in glass,
I listen — and the voice comes back, colder, exact.

Pre-merge checks and finishing touches

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately summarizes the primary change: introducing AudioLLM and VideoLLM base classes, which is reflected throughout the changeset in llm.py, agents.py, and multiple plugin implementations.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch chore/agent-llm-refactoring

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 0b524d4 and 97c815f.

📒 Files selected for processing (2)

agents-core/vision_agents/core/llm/llm.py (3 hunks)
plugins/openai/vision_agents/plugins/openai/openai_realtime.py (5 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

**/*.py: Do not modify sys.path in Python code
Docstrings must follow the Google style guide

Files:

agents-core/vision_agents/core/llm/llm.py
plugins/openai/vision_agents/plugins/openai/openai_realtime.py

🧬 Code graph analysis (2)

agents-core/vision_agents/core/llm/llm.py (5)

agents-core/vision_agents/core/utils/video_forwarder.py (1)

VideoForwarder (14-195)

plugins/openai/vision_agents/plugins/openai/openai_realtime.py (3)

simple_audio_response (129-144)

output_audio_track (77-78)

watch_video_track (277-293)

plugins/gemini/vision_agents/plugins/gemini/gemini_realtime.py (3)

simple_audio_response (136-164)

output_audio_track (116-117)

watch_video_track (389-431)

plugins/aws/vision_agents/plugins/aws/aws_realtime.py (3)

simple_audio_response (247-269)

output_audio_track (180-181)

watch_video_track (183-189)

agents-core/vision_agents/core/edge/types.py (1)

Participant (22-24)

plugins/openai/vision_agents/plugins/openai/openai_realtime.py (4)

agents-core/vision_agents/core/utils/video_forwarder.py (1)

VideoForwarder (14-195)

agents-core/vision_agents/core/llm/llm.py (2)

output_audio_track (431-434)

watch_video_track (445-458)

plugins/gemini/vision_agents/plugins/gemini/gemini_realtime.py (2)

output_audio_track (116-117)

watch_video_track (389-431)

plugins/aws/vision_agents/plugins/aws/aws_realtime.py (2)

output_audio_track (180-181)

watch_video_track (183-189)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: unit / Ruff & mypy
GitHub Check: unit / Test "not integration"
GitHub Check: unit / Ruff & mypy
GitHub Check: unit / Test "not integration"

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 60f6d83 and 89fd60b.

📒 Files selected for processing (3)

agents-core/vision_agents/core/agents/agents.py (13 hunks)
agents-core/vision_agents/core/llm/llm.py (1 hunks)
agents-core/vision_agents/core/llm/realtime.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

**/*.py: Do not modify sys.path in Python code
Docstrings must follow the Google style guide

Files:

agents-core/vision_agents/core/llm/llm.py
agents-core/vision_agents/core/llm/realtime.py
agents-core/vision_agents/core/agents/agents.py

🧬 Code graph analysis (1)

agents-core/vision_agents/core/agents/agents.py (4)

agents-core/vision_agents/core/edge/sfu_events.py (16)

participant (1496-1501)

participant (1504-1507)

participant (1545-1550)

participant (1553-1556)

participant (1625-1630)

participant (1633-1636)

participant (2100-2105)

participant (2108-2111)

participant (2156-2161)

participant (2164-2167)

user_id (489-493)

user_id (856-860)

user_id (901-905)

user_id (1186-1190)

user_id (2093-2097)

user_id (2142-2146)

agents-core/vision_agents/core/events/base.py (1)

user_id (45-48)

agents-core/vision_agents/core/turn_detection/events.py (1)

TurnEndedEvent (29-45)

agents-core/vision_agents/core/llm/llm.py (1)

simple_response (57-63)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: unit / Test "not integration"
GitHub Check: unit / Ruff & mypy
GitHub Check: unit / Test "not integration"

🔇 Additional comments (10)

agents-core/vision_agents/core/llm/llm.py (1)

37-42: LGTM! Clean capability flag introduction.

The explicit capability flags provide clear contracts for Agent integration, and the defaults correctly represent a traditional LLM's requirements.

agents-core/vision_agents/core/agents/agents.py (9)

417-419: LGTM! Correct capability-based short-circuit.

The early return appropriately skips the STT-to-LLM flow when the LLM directly consumes audio.

791-818: LGTM! Clearer event handler naming and correct capability check.

The rename from on_track to on_video_track_added improves clarity, and the video forwarding correctly depends on llm.handles_video.

821-844: LGTM! Consistent naming and correct capability gating.

The handler rename mirrors on_video_track_added, and track switching is appropriately gated by llm.handles_video.

865-874: LGTM! Correct audio routing based on capabilities.

The conditional correctly routes audio directly to the LLM when it handles audio, or to STT otherwise.

1117-1147: LGTM! Well-structured turn-end handling.

The refactored logic correctly accumulates per-user transcripts and triggers LLM responses on turn completion, with appropriate early-exit for agent self-speech.

1160-1162: LGTM! Correct audio publishing logic.

The property appropriately returns True when either TTS or the LLM itself produces audio output.

1194-1196: LGTM! Correct video input requirement check.

The logic correctly determines video input needs based on processors or LLM video capability.

1245-1257: LGTM! Appropriate configuration validation.

The validation correctly warns when conflicting STT/TTS services are configured alongside a realtime LLM.

1282-1298: LGTM! Correct audio track source selection.

The logic appropriately selects the audio track source based on whether the LLM produces audio directly.

agents-core/vision_agents/core/agents/agents.py

agents-core/vision_agents/core/llm/realtime.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

agents-core/vision_agents/core/llm/realtime.py (1)
45-50: The capability flags are well-implemented.

The type hints that were flagged in the previous review have been added. All four attributes now have explicit bool type annotations, and the defaults correctly reflect that Realtime models handle audio/video directly without requiring separate STT/TTS services.

Consider documenting these public attributes in the class docstring using an "Attributes:" section per Google style guide, though this is not critical for functionality.
Optional: Add attributes documentation

You could enhance the class docstring (after line 40) with an attributes section:
    Attributes:
        handles_audio: Indicates this model can process audio input directly.
        handles_video: Indicates this model can process video input directly.
        needs_stt: Indicates whether speech-to-text service is required.
        needs_tts: Indicates whether text-to-speech service is required.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 89fd60b and 94f7448.

📒 Files selected for processing (1)

agents-core/vision_agents/core/llm/realtime.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

**/*.py: Do not modify sys.path in Python code
Docstrings must follow the Google style guide

Files:

agents-core/vision_agents/core/llm/realtime.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: unit / Ruff & mypy
GitHub Check: unit / Test "not integration"
GitHub Check: unit / Test "not integration"
GitHub Check: unit / Ruff & mypy

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

agents-core/vision_agents/core/agents/agents.py (1)
1093-1095: Fix typo "SST" → "STT" in comment.

Line 1093 contains "SST" which should be "STT". The comment could also be clearer about the conditional logic.

Apply this diff to fix the typo:
-        # Skip the turn event handling if the model doesn't require TTS or SST audio itself.
+        # Skip turn event handling if the model doesn't require both STT and TTS.
         if not (self.llm.needs_tts and self.llm.needs_stt):
             return

🧹 Nitpick comments (1)

agents-core/vision_agents/core/agents/agents.py (1)

1143-1146: Simplify participant handling – use existing event attribute.

Lines 1143-1146 attempt to extract participant from event.custom, but event.participant is already available and validated at line 1128. The custom metadata extraction appears redundant.

Consider this simplification:

-                # Create participant object if we have metadata
-                participant = None
-                if hasattr(event, "custom") and event.custom:
-                    # Try to extract participant info from custom metadata
-                    participant = event.custom.get("participant")
-
                 # Trigger LLM response with the complete transcript
-                await self.simple_response(transcript, participant)
+                await self.simple_response(transcript, event.participant)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 94f7448 and c1584e0.

📒 Files selected for processing (3)

agents-core/vision_agents/core/agents/agents.py (12 hunks)
agents-core/vision_agents/core/llm/llm.py (1 hunks)
agents-core/vision_agents/core/llm/realtime.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

agents-core/vision_agents/core/llm/llm.py
agents-core/vision_agents/core/llm/realtime.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

**/*.py: Do not modify sys.path in Python code
Docstrings must follow the Google style guide

Files:

agents-core/vision_agents/core/agents/agents.py

🧬 Code graph analysis (1)

agents-core/vision_agents/core/agents/agents.py (3)

agents-core/vision_agents/core/edge/sfu_events.py (16)

participant (1496-1501)

participant (1504-1507)

participant (1545-1550)

participant (1553-1556)

participant (1625-1630)

participant (1633-1636)

participant (2100-2105)

participant (2108-2111)

participant (2156-2161)

participant (2164-2167)

user_id (489-493)

user_id (856-860)

user_id (901-905)

user_id (1186-1190)

user_id (2093-2097)

user_id (2142-2146)

agents-core/vision_agents/core/events/base.py (1)

user_id (45-48)

agents-core/vision_agents/core/llm/llm.py (1)

simple_response (77-83)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: unit / Test "not integration"
GitHub Check: unit / Test "not integration"

🔇 Additional comments (9)

agents-core/vision_agents/core/agents/agents.py (9)

422-424: LGTM – Clean capability flag usage.

The early return when llm.handles_audio is appropriate here, avoiding redundant LLM invocations when the model processes audio natively.

796-823: Handler renamed appropriately with correct flag usage.

The rename from on_track to on_video_track_added improves clarity, and the llm.handles_video check at line 810 correctly gates video forwarding.

826-849: Handler renamed appropriately with correct flag usage.

The rename to on_video_track_removed is clearer, and the llm.handles_video check at line 844 properly determines whether to switch tracks.

870-879: Correct audio routing based on capability flag.

The llm.handles_audio check appropriately routes audio either directly to the LLM or through STT processing.

971-1002: Appropriate video forwarding based on capability flag.

The llm.handles_video check at line 971 correctly determines whether to forward video frames to the LLM, with proper handling of both processed and raw video tracks.

1165-1167: Correct audio publishing determination.

The llm.handles_audio check properly determines whether to publish audio, accounting for both TTS and native LLM audio handling.

1199-1199: Correct video input determination.

The llm.handles_video check appropriately determines when video input is needed from participants.

1250-1262: Appropriate configuration validation.

The llm.handles_audio check correctly validates that Realtime mode doesn't have conflicting STT/TTS/Turn Detection services configured.

1287-1303: Correct audio track initialization.

The llm.handles_audio check properly determines whether to use the LLM's output track or create a new audio track for TTS.

cursor

This is the final PR Bugbot will review for you during this billing cycle

Your free Bugbot reviews will reset on November 7

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

agents-core/vision_agents/core/agents/agents.py

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

agents-core/vision_agents/core/agents/agents.py (1)
1208-1219: Audio LLMs stop receiving microphone audio

When an AudioLLM is used without STT or audio processors (the common case for realtime models), _needs_audio_or_video_input() now returns False, so Agent.join() never calls _listen_to_audio_and_video(). As a result no AudioReceivedEvent subscribers are registered and _reply_to_audio() never invokes simple_audio_response, completely breaking the audio pipeline. Please treat audio-capable LLMs as needing input just like the old realtime_mode logic.
-        needs_audio = self.stt is not None or len(self.audio_processors) > 0
+        needs_audio = (
+            self.stt is not None
+            or len(self.audio_processors) > 0
+            or _is_audio_llm(self.llm)
+        )

🧹 Nitpick comments (1)

plugins/openai/vision_agents/plugins/openai/openai_realtime.py (1)
274-278: **Consider explicit parameter instead of kwargs.

The method uses **kwargs to extract shared_forwarder, while other implementations (Gemini, AWS) use an explicit shared_forwarder parameter. Using an explicit parameter would improve type safety and discoverability.

Consider this change for consistency with other plugins:
-    async def watch_video_track(self, track, **kwargs) -> None:
-        shared_forwarder = kwargs.get("shared_forwarder")
+    async def watch_video_track(
+        self,
+        track: aiortc.mediastreams.MediaStreamTrack,
+        shared_forwarder: Optional[VideoForwarder] = None,
+    ) -> None:
         await self.rtc.start_video_sender(
             track, self.fps, shared_forwarder=shared_forwarder
         )

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4c3b8b7 and 0453985.

📒 Files selected for processing (11)

DEVELOPMENT.md (1 hunks)
agents-core/vision_agents/core/agents/agents.py (19 hunks)
agents-core/vision_agents/core/llm/__init__.py (1 hunks)
agents-core/vision_agents/core/llm/llm.py (3 hunks)
agents-core/vision_agents/core/llm/realtime.py (1 hunks)
plugins/aws/vision_agents/plugins/aws/aws_realtime.py (4 hunks)
plugins/gemini/README.md (1 hunks)
plugins/gemini/tests/test_gemini_realtime.py (1 hunks)
plugins/gemini/vision_agents/plugins/gemini/gemini_realtime.py (4 hunks)
plugins/openai/tests/test_openai_realtime.py (1 hunks)
plugins/openai/vision_agents/plugins/openai/openai_realtime.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

agents-core/vision_agents/core/llm/realtime.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

**/*.py: Do not modify sys.path in Python code
Docstrings must follow the Google style guide

Files:

plugins/aws/vision_agents/plugins/aws/aws_realtime.py
agents-core/vision_agents/core/llm/__init__.py
plugins/gemini/tests/test_gemini_realtime.py
plugins/openai/vision_agents/plugins/openai/openai_realtime.py
agents-core/vision_agents/core/agents/agents.py
plugins/openai/tests/test_openai_realtime.py
plugins/gemini/vision_agents/plugins/gemini/gemini_realtime.py
agents-core/vision_agents/core/llm/llm.py

🧬 Code graph analysis (8)

plugins/aws/vision_agents/plugins/aws/aws_realtime.py (4)

agents-core/vision_agents/core/llm/llm.py (2)

output_audio_track (421-421)

watch_video_track (432-436)

agents-core/vision_agents/core/utils/video_forwarder.py (1)

VideoForwarder (14-195)

plugins/openai/vision_agents/plugins/openai/rtc_manager.py (1)

connect (113-141)

agents-core/vision_agents/core/edge/types.py (1)

write (45-45)

agents-core/vision_agents/core/llm/__init__.py (1)

agents-core/vision_agents/core/llm/llm.py (4)

LLM (49-405)

AudioLLM (408-421)

VideoLLM (424-436)

OmniLLM (439-444)

plugins/gemini/tests/test_gemini_realtime.py (3)

agents-core/vision_agents/core/llm/llm.py (1)

watch_video_track (432-436)

plugins/gemini/vision_agents/plugins/gemini/gemini_realtime.py (1)

watch_video_track (388-430)

conftest.py (1)

bunny_video_track (300-344)

plugins/openai/vision_agents/plugins/openai/openai_realtime.py (4)

agents-core/vision_agents/core/edge/types.py (1)

write (45-45)

agents-core/vision_agents/core/llm/llm.py (2)

output_audio_track (421-421)

watch_video_track (432-436)

plugins/aws/vision_agents/plugins/aws/aws_realtime.py (2)

output_audio_track (181-182)

watch_video_track (184-189)

plugins/gemini/vision_agents/plugins/gemini/gemini_realtime.py (2)

output_audio_track (115-116)

watch_video_track (388-430)

agents-core/vision_agents/core/agents/agents.py (2)

agents-core/vision_agents/core/llm/llm.py (6)

AudioLLM (408-421)

LLM (49-405)

VideoLLM (424-436)

watch_video_track (432-436)

simple_response (73-79)

output_audio_track (421-421)

agents-core/vision_agents/core/llm/realtime.py (1)

Realtime (21-188)

plugins/openai/tests/test_openai_realtime.py (5)

agents-core/vision_agents/core/llm/llm.py (1)

watch_video_track (432-436)

plugins/aws/vision_agents/plugins/aws/aws_realtime.py (1)

watch_video_track (184-189)

plugins/gemini/vision_agents/plugins/gemini/gemini_realtime.py (1)

watch_video_track (388-430)

plugins/openai/vision_agents/plugins/openai/openai_realtime.py (1)

watch_video_track (274-278)

conftest.py (1)

bunny_video_track (300-344)

plugins/gemini/vision_agents/plugins/gemini/gemini_realtime.py (5)

agents-core/vision_agents/core/utils/video_forwarder.py (1)

VideoForwarder (14-195)

agents-core/vision_agents/core/llm/llm.py (2)

output_audio_track (421-421)

watch_video_track (432-436)

plugins/aws/vision_agents/plugins/aws/aws_realtime.py (2)

output_audio_track (181-182)

watch_video_track (184-189)

plugins/openai/vision_agents/plugins/openai/openai_realtime.py (2)

output_audio_track (74-75)

watch_video_track (274-278)

agents-core/vision_agents/core/edge/types.py (1)

write (45-45)

agents-core/vision_agents/core/llm/llm.py (6)

agents-core/vision_agents/core/utils/video_forwarder.py (1)

VideoForwarder (14-195)

agents-core/vision_agents/core/llm/realtime.py (1)

simple_audio_response (59-61)

plugins/aws/vision_agents/plugins/aws/aws_realtime.py (3)

simple_audio_response (247-269)

output_audio_track (181-182)

watch_video_track (184-189)

plugins/gemini/vision_agents/plugins/gemini/gemini_realtime.py (3)

simple_audio_response (135-163)

output_audio_track (115-116)

watch_video_track (388-430)

plugins/openai/vision_agents/plugins/openai/openai_realtime.py (3)

simple_audio_response (126-141)

output_audio_track (74-75)

watch_video_track (274-278)

agents-core/vision_agents/core/edge/types.py (1)

Participant (22-24)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: unit / Test "not integration"
GitHub Check: unit / Ruff & mypy
GitHub Check: unit / Test "not integration"
GitHub Check: unit / Ruff & mypy

🔇 Additional comments (17)

plugins/gemini/README.md (1)

71-71: API documentation update looks good.

The change from _watch_video_track (private) to watch_video_track (public) correctly reflects the PR's refactoring to expose video-tracking as a public API. This is consistent with the documented output_track property (line 43, 88) and aligns with the broader capability-based API changes.

DEVELOPMENT.md (1)

108-108: LGTM: Documentation updated to reflect public API.

The documentation correctly references the public method watch_video_track instead of the private _watch_video_track, aligning with the API changes in this PR.

plugins/openai/tests/test_openai_realtime.py (1)

106-106: LGTM: Test updated to use public API.

The test correctly calls the public watch_video_track method instead of the private _watch_video_track, consistent with the API surface changes in this PR.

plugins/gemini/tests/test_gemini_realtime.py (1)

88-88: LGTM: Test updated to use public API.

The test correctly calls the public watch_video_track method instead of the private _watch_video_track, consistent with the API surface changes in this PR.

agents-core/vision_agents/core/llm/__init__.py (1)

1-13: LGTM: Public API correctly extended with multimodal base classes.

The new AudioLLM, VideoLLM, and OmniLLM base classes are properly imported and exported, enabling consumers to reference these types for multimodal LLM capabilities.

plugins/aws/vision_agents/plugins/aws/aws_realtime.py (3)

164-166: LGTM: Audio track properly encapsulated.

The internal audio track is correctly renamed to _output_audio_track and exposed via a public property, consistent with the pattern across other plugins.

180-189: LGTM: Public API accessors added correctly.

The output_audio_track property and watch_video_track method provide a consistent public interface. The no-op implementation of watch_video_track is appropriate if AWS Nova doesn't support video input at this time.

707-707: LGTM: Audio write updated to use private attribute.

The write operation correctly references _output_audio_track after the encapsulation change.

plugins/openai/vision_agents/plugins/openai/openai_realtime.py (3)

66-68: LGTM: Audio track properly encapsulated.

The internal audio track is correctly renamed to _output_audio_track, consistent with the encapsulation pattern across plugins.

73-75: LGTM: Public accessor added.

The output_audio_track property correctly exposes the private _output_audio_track, providing a consistent public API.

272-272: LGTM: Audio write updated correctly.

The write operation properly references _output_audio_track after the encapsulation change.

agents-core/vision_agents/core/llm/llm.py (2)

18-18: LGTM: Imports added for multimodal support.

The necessary imports for audio/video capabilities (aiortc, AudioStreamTrack, PcmData, VideoForwarder) are correctly added to support the new base classes.

Also applies to: 27-27, 33-33

408-444: LGTM: Well-designed multimodal base classes.

The new AudioLLM, VideoLLM, and OmniLLM base classes provide a clean abstraction for multimodal capabilities:

Clear separation of audio (speech-to-speech) and video processing

Appropriate use of abstract methods and properties

OmniLLM correctly combines both capabilities through multiple inheritance

Docstrings clearly explain the purpose of each class

plugins/gemini/vision_agents/plugins/gemini/gemini_realtime.py (4)

107-109: LGTM: Audio track properly encapsulated.

The internal audio track is correctly renamed to _output_audio_track, consistent with the encapsulation pattern across plugins.

115-116: LGTM: Public accessor added.

The output_audio_track property correctly exposes the private _output_audio_track, providing a consistent public API.

318-318: LGTM: Audio write updated correctly.

The write operation properly references _output_audio_track after the encapsulation change.

388-430: LGTM: Well-implemented video track watching with shared forwarder support.

The public watch_video_track method is well-designed:

Clear docstring explaining the shared forwarder pattern

Explicit type hints for parameters

Proper handling of both shared and dedicated VideoForwarder scenarios

Appropriate logging for debugging

Correct cleanup when switching forwarders

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

plugins/openai/vision_agents/plugins/openai/openai_realtime.py (1)

259-259: Consider removing or implementing commented code.

The commented line suggests incomplete event handling for session.updated. Either remove it if validation is unnecessary, or uncomment and implement proper handling.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 0453985 and 209b0cb.

📒 Files selected for processing (1)

plugins/openai/vision_agents/plugins/openai/openai_realtime.py (5 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

**/*.py: Do not modify sys.path in Python code
Docstrings must follow the Google style guide

Files:

plugins/openai/vision_agents/plugins/openai/openai_realtime.py

🧬 Code graph analysis (1)

plugins/openai/vision_agents/plugins/openai/openai_realtime.py (4)

agents-core/vision_agents/core/utils/video_forwarder.py (1)

VideoForwarder (14-195)

agents-core/vision_agents/core/llm/llm.py (2)

output_audio_track (421-421)

watch_video_track (432-436)

plugins/aws/vision_agents/plugins/aws/aws_realtime.py (2)

output_audio_track (181-182)

watch_video_track (184-189)

plugins/gemini/vision_agents/plugins/gemini/gemini_realtime.py (2)

output_audio_track (115-116)

watch_video_track (388-430)

🔇 Additional comments (4)

plugins/openai/vision_agents/plugins/openai/openai_realtime.py (4)

4-4: LGTM!

The added imports support the new public API signatures and enhanced event handling.

Also applies to: 10-11, 23-23

69-71: LGTM!

Private attribute naming follows Python conventions, and the initialization parameters remain appropriate for real-time audio streaming.

275-275: LGTM!

Correctly references the private attribute for internal audio forwarding.

76-78: Docstring addition is beneficial but not mandated by guideline.

The review misinterprets the coding guideline. The stated guideline—"Docstrings must follow the Google style guide"—specifies style requirements when docstrings exist, not a mandate that all public properties must have them. This is evident from the codebase itself: the __init__ method is public yet lacks a docstring, and the abstract output_audio_track property in the base class is undocumented.

That said, adding a docstring to this public property would improve API clarity and consistency with methods like connect() that do include documentation. If you proceed with the suggested docstring, ensure it follows the Google style format shown in the diff.

plugins/openai/vision_agents/plugins/openai/openai_realtime.py

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

plugins/openai/vision_agents/plugins/openai/openai_realtime.py (1)

277-284: Add Google-style docstring to public method.

The watch_video_track method lacks documentation. This issue was already raised in a previous review.

As per coding guidelines.

Apply the previously suggested diff:

     async def watch_video_track(
         self,
         track: aiortc.mediastreams.MediaStreamTrack,
         shared_forwarder: Optional[VideoForwarder] = None,
     ) -> None:
+        """Start watching and forwarding video frames to OpenAI Realtime.
+        
+        Args:
+            track: Video track to watch and forward.
+            shared_forwarder: Optional shared VideoForwarder instance to use instead
+                of creating a new one. Allows multiple consumers to share the same
+                video stream.
+        """
         await self.rtc.start_video_sender(
             track, self.fps, shared_forwarder=shared_forwarder
         )

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 209b0cb and 65452cf.

📒 Files selected for processing (2)

plugins/aws/vision_agents/plugins/aws/aws_realtime.py (5 hunks)
plugins/openai/vision_agents/plugins/openai/openai_realtime.py (5 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

**/*.py: Do not modify sys.path in Python code
Docstrings must follow the Google style guide

Files:

plugins/openai/vision_agents/plugins/openai/openai_realtime.py
plugins/aws/vision_agents/plugins/aws/aws_realtime.py

🧬 Code graph analysis (2)

plugins/openai/vision_agents/plugins/openai/openai_realtime.py (4)

agents-core/vision_agents/core/utils/video_forwarder.py (1)

VideoForwarder (14-195)

plugins/aws/vision_agents/plugins/aws/aws_realtime.py (2)

output_audio_track (180-181)

watch_video_track (183-189)

agents-core/vision_agents/core/llm/llm.py (2)

output_audio_track (421-421)

watch_video_track (432-436)

plugins/gemini/vision_agents/plugins/gemini/gemini_realtime.py (2)

output_audio_track (115-116)

watch_video_track (388-430)

plugins/aws/vision_agents/plugins/aws/aws_realtime.py (5)

plugins/openai/vision_agents/plugins/openai/openai_realtime.py (4)

Realtime (40-478)

output_audio_track (77-78)

watch_video_track (277-284)

connect (80-106)

agents-core/vision_agents/core/llm/realtime.py (2)

Realtime (21-188)

connect (56-56)

plugins/gemini/vision_agents/plugins/gemini/gemini_realtime.py (4)

Realtime (53-682)

output_audio_track (115-116)

watch_video_track (388-430)

connect (186-199)

agents-core/vision_agents/core/utils/video_forwarder.py (1)

VideoForwarder (14-195)

plugins/openai/vision_agents/plugins/openai/rtc_manager.py (1)

connect (113-141)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: unit / Test "not integration"
GitHub Check: unit / Ruff & mypy
GitHub Check: unit / Ruff & mypy
GitHub Check: unit / Test "not integration"

🔇 Additional comments (9)

plugins/aws/vision_agents/plugins/aws/aws_realtime.py (5)

7-8: LGTM.

The aiortc import supports the new watch_video_track method signature.

36-37: LGTM.

The docstring updates accurately reflect the audio streaming capabilities.

Also applies to: 42-42

164-166: LGTM.

The private attribute naming follows the established pattern across realtime implementations.

195-195: LGTM.

Spelling correction improves documentation clarity.

707-707: LGTM.

Internal usage correctly references the private _output_audio_track attribute.

plugins/openai/vision_agents/plugins/openai/openai_realtime.py (4)

4-4: LGTM.

Import additions support the new video forwarder integration and event handling.

Also applies to: 10-11, 23-23

69-71: LGTM.

The private attribute naming is consistent with the refactoring pattern.

259-259: LGTM.

Commenting out the unused event assignment is appropriate.

275-275: LGTM.

Internal usage correctly references the private _output_audio_track attribute.

plugins/aws/vision_agents/plugins/aws/aws_realtime.py

plugins/openai/vision_agents/plugins/openai/openai_realtime.py

yarikdevcom

LGTM

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

agents-core/vision_agents/core/agents/agents.py (1)
1115-1162: Fix typo and clarify intent in turn detection comment.

“SST” → “STT”. Clarify we skip when the model handles audio directly.

Apply:
-        # Skip the turn event handling if the model doesn't require TTS or SST audio itself.
+        # Skip turn event handling when the LLM handles audio directly (no STT/TTS pipeline).
         if _is_audio_llm(self.llm):
             return

♻️ Duplicate comments (4)

plugins/aws/vision_agents/plugins/aws/aws_realtime.py (2)

179-182: Add Google-style docstring to public property.

Document output_audio_track per project guidelines.

Apply:

 @property
 def output_audio_track(self) -> AudioStreamTrack:
-    return self._output_audio_track
+    """Audio output track for streaming to participants.
+
+    Returns:
+        AudioStreamTrack: 24 kHz mono s16 track carrying model audio.
+    """
+    return self._output_audio_track

183-190: Docstring for public method (no-op) is missing.

Clarify that video is not supported; keep API-compatible.

Apply:

 async def watch_video_track(
     self,
     track: aiortc.mediastreams.MediaStreamTrack,
     shared_forwarder: Optional[VideoForwarder] = None,
 ) -> None:
-    # No video support for now.
+    """Watch and forward video frames (not supported on AWS).
+
+    AWS Nova Sonic currently has no video input. This is a no-op for API parity.
+
+    Args:
+        track: Incoming video track (ignored).
+        shared_forwarder: Optional shared forwarder (ignored).
+    """
     return None

plugins/openai/vision_agents/plugins/openai/openai_realtime.py (2)

76-79: Add Google-style docstring to public property.

Document output_audio_track per guidelines.

Apply:

 @property
 def output_audio_track(self) -> AudioStreamTrack:
-    return self._output_audio_track
+    """Audio output track used for playback to participants.
+
+    Returns:
+        AudioStreamTrack: 48 kHz stereo s16 track.
+    """
+    return self._output_audio_track

277-284: Add Google-style docstring to public method.

watch_video_track lacks a docstring describing args/behavior.

Apply:

 async def watch_video_track(
     self,
     track: aiortc.mediastreams.MediaStreamTrack,
     shared_forwarder: Optional[VideoForwarder] = None,
 ) -> None:
+    """Start watching and forwarding video frames to OpenAI Realtime.
+
+    Args:
+        track: Video track to forward.
+        shared_forwarder: Optional shared VideoForwarder to reuse/coalesce frames.
+    """
     await self.rtc.start_video_sender(
         track, self.fps, shared_forwarder=shared_forwarder
     )

🧹 Nitpick comments (8)

plugins/openai/tests/test_openai_realtime.py (1)

106-106: Good switch to public API; consider symmetric public stop.

Using watch_video_track is correct. Tests still call the protected _stop_watching_video_track (Line 112); consider adding a public stop_watching_video_track() (or unwatch_video_track()) on VideoLLM and updating tests to avoid relying on a protected method.

plugins/gemini/tests/test_gemini_realtime.py (1)

88-88: Correct API usage; expose a public stop for parity.

watch_video_track is the right call. Tests still use protected _stop_watching_video_track (Line 94); suggest adding a public stop/unwatch method on VideoLLM and switching tests accordingly.
plugins/aws/vision_agents/plugins/aws/aws_realtime.py (2)
8-8: Avoid hard runtime dependency on aiortc in a non-video provider.

aiortc is used only for type hints; gate import under TYPE_CHECKING (and optionally enable from future import annotations) to avoid requiring aiortc at runtime for audio-only usage.

Apply (illustrative):
+from __future__ import annotations
 from typing import Optional, List, Dict, Any
-import aiortc
+from typing import TYPE_CHECKING
+if TYPE_CHECKING:
+    import aiortc  # only for type hints
31-32: Remove unused constant.

DEFAULT_SAMPLE_RATE is never referenced; drop it to reduce confusion (24 kHz is used).

Apply:
-DEFAULT_SAMPLE_RATE = 16000
agents-core/vision_agents/core/llm/llm.py (2)
18-18: Gate aiortc import to avoid hard dependency in non-video contexts.

Use TYPE_CHECKING (and optionally from future import annotations) so Audio-only environments don’t require aiortc at import time.

Apply (illustrative):
+from __future__ import annotations
 import abc
 ...
-import aiortc
+from typing import TYPE_CHECKING
+if TYPE_CHECKING:
+    import aiortc
414-422: Add docstrings to abstract audio/video APIs.

Public abstract methods/properties lack documentation; add concise Google-style docstrings.

Apply:
 class AudioLLM(LLM, metaclass=abc.ABCMeta):
@@
-    @abc.abstractmethod
-    async def simple_audio_response(
-        self, pcm: PcmData, participant: Optional[Participant] = None
-    ): ...
+    @abc.abstractmethod
+    async def simple_audio_response(
+        self, pcm: PcmData, participant: Optional[Participant] = None
+    ):
+        """Send PCM audio to the model and (optionally) attribute to a participant.
+
+        Args:
+            pcm: Raw PCM audio.
+            participant: Optional participant metadata for attribution.
+        """
@@
-    @property
-    @abc.abstractmethod
-    def output_audio_track(self) -> AudioStreamTrack: ...
+    @property
+    @abc.abstractmethod
+    def output_audio_track(self) -> AudioStreamTrack:
+        """Audio track carrying the model's synthesized speech."""
@@
 class VideoLLM(LLM, metaclass=abc.ABCMeta):
@@
-    @abc.abstractmethod
-    async def watch_video_track(
-        self,
-        track: aiortc.mediastreams.MediaStreamTrack,
-        shared_forwarder: Optional[VideoForwarder] = None,
-    ) -> None: ...
+    @abc.abstractmethod
+    async def watch_video_track(
+        self,
+        track: "aiortc.mediastreams.MediaStreamTrack",
+        shared_forwarder: Optional[VideoForwarder] = None,
+    ) -> None:
+        """Start consuming frames from a video track (optionally via shared forwarder)."""
Also applies to: 431-436
plugins/gemini/vision_agents/plugins/gemini/gemini_realtime.py (1)
6-6: Gate aiortc import for typing-only usage.

If aiortc is optional in some environments, import it under TYPE_CHECKING to avoid hard dependency during import.

Example:
+from __future__ import annotations
 from typing import Optional, List, Dict, Any
-import aiortc
+from typing import TYPE_CHECKING
+if TYPE_CHECKING:
+    import aiortc
agents-core/vision_agents/core/agents/agents.py (1)
1275-1288: Consolidate duplicate warnings in realtime mode.

Two overlapping warning blocks for STT/TTS/Turn Detection; merge for clarity.

Apply:
-        if _is_audio_llm(self.llm):
-            # Realtime mode - should not have separate STT/TTS
-            if self.stt or self.tts:
-                self.logger.warning(
-                    "Realtime mode detected: STT and TTS services will be ignored. "
-                    "The Realtime model handles both speech-to-text and text-to-speech internally."
-                )
-                # Realtime mode - should not have separate STT/TTS
-            if self.stt or self.turn_detection:
-                self.logger.warning(
-                    "Realtime mode detected: STT, TTS and Turn Detection services will be ignored. "
-                    "The Realtime model handles both speech-to-text, text-to-speech and turn detection internally."
-                )
+        if _is_audio_llm(self.llm):
+            if self.stt or self.tts or self.turn_detection:
+                self.logger.warning(
+                    "Realtime mode: STT/TTS/Turn Detection will be ignored; "
+                    "the LLM handles speech and turn detection internally."
+                )

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 65452cf and ba8b651.

📒 Files selected for processing (11)

DEVELOPMENT.md (1 hunks)
agents-core/vision_agents/core/agents/agents.py (19 hunks)
agents-core/vision_agents/core/llm/__init__.py (1 hunks)
agents-core/vision_agents/core/llm/llm.py (3 hunks)
agents-core/vision_agents/core/llm/realtime.py (1 hunks)
plugins/aws/vision_agents/plugins/aws/aws_realtime.py (5 hunks)
plugins/gemini/README.md (1 hunks)
plugins/gemini/tests/test_gemini_realtime.py (1 hunks)
plugins/gemini/vision_agents/plugins/gemini/gemini_realtime.py (4 hunks)
plugins/openai/tests/test_openai_realtime.py (1 hunks)
plugins/openai/vision_agents/plugins/openai/openai_realtime.py (5 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

plugins/gemini/README.md
DEVELOPMENT.md
agents-core/vision_agents/core/llm/init.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

**/*.py: Do not modify sys.path in Python code
Docstrings must follow the Google style guide

Files:

agents-core/vision_agents/core/llm/realtime.py
plugins/openai/vision_agents/plugins/openai/openai_realtime.py
plugins/gemini/vision_agents/plugins/gemini/gemini_realtime.py
agents-core/vision_agents/core/llm/llm.py
plugins/gemini/tests/test_gemini_realtime.py
plugins/openai/tests/test_openai_realtime.py
agents-core/vision_agents/core/agents/agents.py
plugins/aws/vision_agents/plugins/aws/aws_realtime.py

🧬 Code graph analysis (8)

agents-core/vision_agents/core/llm/realtime.py (4)

agents-core/vision_agents/core/llm/llm.py (1)

OmniLLM (439-444)

plugins/openai/vision_agents/plugins/openai/openai_realtime.py (1)

Realtime (40-478)

plugins/aws/vision_agents/plugins/aws/aws_realtime.py (1)

Realtime (40-817)

plugins/gemini/vision_agents/plugins/gemini/gemini_realtime.py (1)

Realtime (53-682)

plugins/openai/vision_agents/plugins/openai/openai_realtime.py (5)

agents-core/vision_agents/core/utils/video_forwarder.py (1)

VideoForwarder (14-195)

agents-core/vision_agents/core/edge/types.py (1)

Participant (22-24)

plugins/aws/vision_agents/plugins/aws/aws_realtime.py (2)

output_audio_track (180-181)

watch_video_track (183-189)

agents-core/vision_agents/core/llm/llm.py (2)

output_audio_track (421-421)

watch_video_track (432-436)

plugins/gemini/vision_agents/plugins/gemini/gemini_realtime.py (2)

output_audio_track (115-116)

watch_video_track (388-430)

plugins/gemini/vision_agents/plugins/gemini/gemini_realtime.py (5)

agents-core/vision_agents/core/utils/video_forwarder.py (1)

VideoForwarder (14-195)

plugins/openai/vision_agents/plugins/openai/openai_realtime.py (2)

output_audio_track (77-78)

watch_video_track (277-284)

plugins/aws/vision_agents/plugins/aws/aws_realtime.py (2)

output_audio_track (180-181)

watch_video_track (183-189)

agents-core/vision_agents/core/llm/llm.py (2)

output_audio_track (421-421)

watch_video_track (432-436)

agents-core/vision_agents/core/edge/types.py (1)

write (45-45)

agents-core/vision_agents/core/llm/llm.py (5)

agents-core/vision_agents/core/utils/video_forwarder.py (1)

VideoForwarder (14-195)

plugins/openai/vision_agents/plugins/openai/openai_realtime.py (3)

simple_audio_response (129-144)

output_audio_track (77-78)

watch_video_track (277-284)

plugins/aws/vision_agents/plugins/aws/aws_realtime.py (3)

simple_audio_response (247-269)

output_audio_track (180-181)

watch_video_track (183-189)

agents-core/vision_agents/core/llm/realtime.py (1)

simple_audio_response (59-61)

plugins/gemini/vision_agents/plugins/gemini/gemini_realtime.py (3)

simple_audio_response (135-163)

output_audio_track (115-116)

watch_video_track (388-430)

plugins/gemini/tests/test_gemini_realtime.py (4)

plugins/openai/vision_agents/plugins/openai/openai_realtime.py (1)

watch_video_track (277-284)

agents-core/vision_agents/core/llm/llm.py (1)

watch_video_track (432-436)

plugins/gemini/vision_agents/plugins/gemini/gemini_realtime.py (1)

watch_video_track (388-430)

conftest.py (1)

bunny_video_track (300-344)

plugins/openai/tests/test_openai_realtime.py (3)

plugins/openai/vision_agents/plugins/openai/openai_realtime.py (1)

watch_video_track (277-284)

agents-core/vision_agents/core/llm/llm.py (1)

watch_video_track (432-436)

conftest.py (1)

bunny_video_track (300-344)

agents-core/vision_agents/core/agents/agents.py (2)

agents-core/vision_agents/core/llm/llm.py (6)

AudioLLM (408-421)

LLM (49-405)

VideoLLM (424-436)

watch_video_track (432-436)

simple_response (73-79)

output_audio_track (421-421)

agents-core/vision_agents/core/llm/realtime.py (1)

Realtime (21-188)

plugins/aws/vision_agents/plugins/aws/aws_realtime.py (4)

plugins/openai/vision_agents/plugins/openai/openai_realtime.py (4)

Realtime (40-478)

output_audio_track (77-78)

watch_video_track (277-284)

connect (80-106)

agents-core/vision_agents/core/llm/realtime.py (2)

Realtime (21-188)

connect (56-56)

plugins/gemini/vision_agents/plugins/gemini/gemini_realtime.py (4)

Realtime (53-682)

output_audio_track (115-116)

watch_video_track (388-430)

connect (186-199)

agents-core/vision_agents/core/utils/video_forwarder.py (1)

VideoForwarder (14-195)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: unit / Ruff & mypy
GitHub Check: unit / Test "not integration"
GitHub Check: unit / Ruff & mypy
GitHub Check: unit / Test "not integration"

🔇 Additional comments (1)

agents-core/vision_agents/core/llm/realtime.py (1)

15-15: Realtime now derives from OmniLLM — aligned and sane.

Switching to OmniLLM cleanly encodes audio+video capability without leaking provider specifics. No issues spotted.

Also applies to: 21-21

agents-core/vision_agents/core/llm/llm.py

plugins/gemini/vision_agents/plugins/gemini/gemini_realtime.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

plugins/gemini/vision_agents/plugins/gemini/gemini_realtime.py (1)
115-117: Add docstring to the property.

The @property decorator correctly addresses the past review comment and fixes the critical issue where Agent expects an attribute. However, the suggested docstring was not added.

Apply this diff to add the docstring:
 @property
 def output_audio_track(self) -> AudioStreamTrack:
+    """Audio output track (24 kHz mono s16) for participant playback."""
     return self._output_audio_track
As per coding guidelines.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ba8b651 and 0b524d4.

📒 Files selected for processing (1)

plugins/gemini/vision_agents/plugins/gemini/gemini_realtime.py (4 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

**/*.py: Do not modify sys.path in Python code
Docstrings must follow the Google style guide

Files:

plugins/gemini/vision_agents/plugins/gemini/gemini_realtime.py

🧬 Code graph analysis (1)

plugins/gemini/vision_agents/plugins/gemini/gemini_realtime.py (5)

agents-core/vision_agents/core/utils/video_forwarder.py (1)

VideoForwarder (14-195)

agents-core/vision_agents/core/llm/llm.py (2)

output_audio_track (421-421)

watch_video_track (432-436)

plugins/aws/vision_agents/plugins/aws/aws_realtime.py (2)

output_audio_track (180-181)

watch_video_track (183-189)

plugins/openai/vision_agents/plugins/openai/openai_realtime.py (2)

output_audio_track (77-78)

watch_video_track (277-284)

agents-core/vision_agents/core/edge/types.py (1)

write (45-45)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: unit / Ruff & mypy
GitHub Check: unit / Test "not integration"
GitHub Check: unit / Test "not integration"
GitHub Check: unit / Ruff & mypy

🔇 Additional comments (4)

plugins/gemini/vision_agents/plugins/gemini/gemini_realtime.py (4)

6-6: LGTM!

The aiortc import is required for the type annotation in watch_video_track and follows proper import practices.

107-109: LGTM!

Renaming to _output_audio_track is the correct pattern when exposing the attribute via a property accessor. This is consistent with the other realtime plugin implementations.

319-319: LGTM!

The change to self._output_audio_track.write(pcm) is consistent with the property pattern refactor and correctly uses the private attribute for writing.

389-393: LGTM!

The refactored signature correctly:

Makes the method public (part of the VideoLLM API)

Adds proper type hints for better mypy compatibility

Uses explicit parameters instead of **kwargs

Matches the base class signature and other realtime plugin implementations

The implementation logic correctly handles both shared forwarder and standalone scenarios.

github-actions bot added the agents-core label Nov 4, 2025

coderabbitai bot reviewed Nov 4, 2025

View reviewed changes

agents-core/vision_agents/core/agents/agents.py Show resolved Hide resolved

agents-core/vision_agents/core/llm/realtime.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Nov 4, 2025

View reviewed changes

dangusev force-pushed the chore/agent-llm-refactoring branch from 94f7448 to c1584e0 Compare November 4, 2025 22:16

github-actions bot added the plugins label Nov 4, 2025

coderabbitai bot reviewed Nov 4, 2025

View reviewed changes

cursor bot reviewed Nov 5, 2025

View reviewed changes

agents-core/vision_agents/core/agents/agents.py Outdated Show resolved Hide resolved

agents-core/vision_agents/core/agents/agents.py Outdated Show resolved Hide resolved

agents-core/vision_agents/core/agents/agents.py Outdated Show resolved Hide resolved

github-actions bot added docs cli project-info labels Nov 5, 2025

coderabbitai bot reviewed Nov 5, 2025

View reviewed changes

plugins/openai/vision_agents/plugins/openai/openai_realtime.py Show resolved Hide resolved

coderabbitai bot reviewed Nov 5, 2025

View reviewed changes

plugins/aws/vision_agents/plugins/aws/aws_realtime.py Show resolved Hide resolved

plugins/aws/vision_agents/plugins/aws/aws_realtime.py Show resolved Hide resolved

plugins/openai/vision_agents/plugins/openai/openai_realtime.py Show resolved Hide resolved

yarikdevcom approved these changes Nov 5, 2025

View reviewed changes

dangusev added 7 commits November 5, 2025 18:38

Add flags to signal capabilities and requirements in LLM

55f7883

Fix ruff errors

91d6e8b

Iteration №2: use separate ABCs instead of flags

52d9214

Fix mypy

59f493c

Fix import

5b58289

Remove mentions of video support from aws.Realtime docs

308a2db

Rebase conflicts

ba8b651

dangusev force-pushed the chore/agent-llm-refactoring branch from 65452cf to ba8b651 Compare November 5, 2025 17:42

dangusev changed the title ~~Add flags to signal capabilities and requirements in LLM~~ Add AudioLLM and VideoLLM base classes Nov 5, 2025

coderabbitai bot reviewed Nov 5, 2025

View reviewed changes

agents-core/vision_agents/core/llm/llm.py Show resolved Hide resolved

plugins/gemini/vision_agents/plugins/gemini/gemini_realtime.py Show resolved Hide resolved

Fix missing property decorator

0b524d4

coderabbitai bot reviewed Nov 5, 2025

View reviewed changes

Add docstrings

97c815f

dangusev merged commit e4be124 into main Nov 5, 2025
6 checks passed

dangusev deleted the chore/agent-llm-refactoring branch November 5, 2025 18:28

coderabbitai bot mentioned this pull request Nov 6, 2025

Feat: Add support for Moondream VLM functions #154

Open

Add AudioLLM and VideoLLM base classes #151

Add AudioLLM and VideoLLM base classes #151

Uh oh!

Conversation

dangusev commented Nov 4, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Examples

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

This is the final PR Bugbot will review for you during this billing cycle

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yarikdevcom left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add `AudioLLM` and `VideoLLM` base classes #151

Add `AudioLLM` and `VideoLLM` base classes #151

dangusev commented Nov 4, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 4, 2025 •

edited

Loading