Skip to content

feat(memory_sources): add Memory Sources domain — user-configurable data connectors#2893

Merged
graycyrus merged 16 commits into
tinyhumansai:mainfrom
senamakel:feat/memory-sources
May 29, 2026
Merged

feat(memory_sources): add Memory Sources domain — user-configurable data connectors#2893
graycyrus merged 16 commits into
tinyhumansai:mainfrom
senamakel:feat/memory-sources

Conversation

@senamakel
Copy link
Copy Markdown
Member

@senamakel senamakel commented May 29, 2026

Summary

  • New memory_sources domain — a unified registry of data connectors that feed memory: Composio integrations, local folders, GitHub repos (commits/issues/PRs), RSS feeds, web pages, Twitter queries
  • Single "Memory Sources" UI panel replaces the old Composio-only sync panel; users add/toggle/sync/remove any source kind from one place
  • Per-source sync RPC (memory_sources_sync) dispatches by kind: Composio → existing engine, others → reader-based ingest into the memory tree
  • Per-source status RPC (memory_sources_status_list) returns chunks ingested, freshness label, last-sync timestamp; UI polls every 5s
  • Composio connections with sync providers (gmail/slack/github/notion/linear/clickup) are auto-seeded into the registry at startup and on every list call

Problem

Memory ingestion was previously a black box of "sync everything Composio is connected to." Users couldn't:

  • See what was feeding memory (no source registry)
  • Choose which Composio connections to sync
  • Add non-Composio sources (a folder of markdown, a GitHub repo's activity, an RSS feed)
  • Trigger sync per source with live status

The composio sync layer also had no consistent identity at the user-facing layer — "Gmail" rolled up multiple accounts and there was no API for the UI to enumerate or manage sources.

Solution

Backend: src/openhuman/memory_sources/

  • types.rsSourceKind enum (composio/folder/github_repo/twitter_query/rss_feed/web_page), MemorySourceEntry with kind-specific fields flat on the struct
  • registry.rs — CRUD over Config.memory_sources (TOML-persisted list)
  • readers/{composio,folder,github,rss,twitter,web_page}.rsSourceReader trait with list_items + read_item per kind. GitHub reader pulls project activity (commits/issues/PRs via gh CLI when available, falls back to public REST API), not source code.
  • sync.rssync_source(id) dispatches by kind. Composio → memory_sync::composio::run_connection_sync. Reader-backed kinds → walk items and ingest each via ingest_document (chunks tagged mem_src:{source_id}:{item_id}). Emits MemorySyncStageChanged events keyed by source id.
  • status.rs — Per-source SQL query against mem_tree_chunks for chunk counts + freshness; toolkit-prefix matching for composio, mem_src:{id}:% for everything else.
  • reconcile.rs — Auto-seeds the registry with active Composio connections that have a sync provider; called at startup and lazily on every memory_sources_list RPC.

Frontend:

  • memorySourcesService.ts — typed client for all 9 RPCs (list/get/add/update/remove/list_items/read_item/sync/status_list)
  • AddMemorySourceDialog.tsx — two-step picker (kind → form). Folder kind has a native folder picker via <input type=\"file\" webkitdirectory>. Composio kind fetches active connections via composio_list_connections and presents them as a dropdown.
  • MemorySourcesRegistry.tsx — single unified panel: Sync button on every row, polls status every 5s, shows chunk count / freshness pill / last-sync timestamp per row
  • Removed: the old MemorySources.tsx (composio-only panel), MemorySyncConnections.tsx, memorySyncService.ts — all consolidated under the new domain

Submission Checklist

  • Tests added or updated (happy path + at least one failure / edge case) per Testing Strategy
  • N/A: Heavy net-new code lands as Rust unit tests (27 passing) + JSON-RPC E2E tests (4 passing). UI E2E coverage for the new panel is a follow-up. Diff coverage ≥ 80% — changed lines (Vitest + cargo-llvm-cov merged via `diff-cover`) meet the gate enforced by `.github/workflows/coverage.yml`. Run `pnpm test:coverage` and `pnpm test:rust` locally; PRs below 80% on changed lines will not merge.
  • N/A: behaviour-only change for coverage matrix — new RPCs are catalog additions, not new product features. Coverage matrix updated — added/removed/renamed feature rows in `docs/TEST-COVERAGE-MATRIX.md` reflect this change (or `N/A: behaviour-only change`)
  • N/A: no matrix updates so no IDs to list. All affected feature IDs from the matrix are listed in the PR description under `## Related`
  • No new external network dependencies introduced (mock backend used per Testing Strategy)
  • N/A: no release-cut surfaces touched. Manual smoke checklist updated if this touches release-cut surfaces (`docs/RELEASE-MANUAL-SMOKE.md`)
  • N/A: no linked issue. Linked issue closed via `Closes #NNN` in the `## Related` section

Impact

  • Runtime: desktop only. New TOML field `[[memory_sources]]` on `Config`; `#[serde(default)]` so existing configs load unchanged.
  • Migration: none required. On first launch with the new build, Composio connections are auto-upserted into the registry.
  • Performance: status polling is one SQL roundtrip per source every 5s — cheap. Sync runs in a spawned task so the RPC returns immediately.
  • Security: folder reader has path-traversal prevention via canonicalize-and-prefix-check; web page reader strips HTML; no new external network deps.

Related

  • Closes:
  • Follow-up PR(s)/TODOs: UI E2E coverage for the new panel; Twitter API credential wiring; SSE-based push for sync progress (currently 5s poll).

AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

  • Key: N/A
  • URL: N/A

Commit & Branch

  • Branch: `feat/memory-sources`
  • Commit SHA: `92cf2a7f36e8f538fac00fe486d1cd689cd3b6f2`

Validation Run

  • `pnpm --filter openhuman-app format:check`
  • `pnpm typecheck`
  • Focused tests: `cargo test --lib -- memory_sources` (27 passing), `pnpm vitest run src/services/memorySourcesService.test.ts` (6 passing), `cargo test --test memory_sources_e2e` (4 passing)
  • Rust fmt/check (if changed): `cargo fmt --check` + `cargo check` clean
  • N/A — only `app/src/` and Rust core touched. Tauri fmt/check (if changed):

Validation Blocked

  • `command:` N/A
  • `error:` N/A
  • `impact:` N/A

Behavior Changes

  • Intended behavior change: introduce per-source memory sources registry + unified Sync UI; auto-seed Composio connections so users see their integrations as memory sources by default
  • User-visible effect: new "Memory Sources" panel in the Memory tab; "Add Source" dialog for non-Composio kinds; per-row Sync button and live chunk/freshness display

Parity Contract

  • Legacy behavior preserved: the existing `memory_sync_status_list` per-toolkit RPC is removed at the UI layer but the underlying `memory_sync::composio::run_connection_sync` engine and its events are unchanged; existing Composio sync providers continue to drive ingestion identically.
  • Guard/fallback/dispatch parity checks: `memory_sync::composio::list_sync_targets` now reads from the registry first and falls back to scanning all connections when the registry is empty — preserves the pre-registry path for first-launch / not-yet-reconciled state.

Duplicate / Superseded PR Handling

  • Duplicate PR(s): none
  • Canonical PR: this
  • Resolution: N/A

Summary by CodeRabbit

  • New Features

    • Memory Sources Registry: unified UI to add, list, enable/disable, remove, and sync memory sources; two‑step add dialog for choosing kind and configuring.
    • Adds support for Composio, local folders, GitHub repos, RSS feeds, web pages, and Twitter queries; Composio connections are auto-registered and can be synced.
  • Tests

    • New unit and end-to-end tests covering memory sources flows and service RPCs.
  • Documentation

    • New i18n strings for memory-sources across many languages.

@senamakel senamakel requested a review from a team May 29, 2026 04:13
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 29, 2026

📝 Walkthrough

Walkthrough

Adds a Memory Sources domain (types, readers, registry, RPC/schemas, sync/status), frontend registry and add-source dialog, TypeScript client and tests, i18n keys, Composio reconciliation on startup/post-OAuth, and Rust e2e tests exercising CRUD, listing, reads, ingestion, and Composio registry flows.

Changes

Memory Sources Feature

Layer / File(s) Summary
Core model, registry, RPC & schemas
src/openhuman/memory_sources/types.rs, src/openhuman/memory_sources/registry.rs, src/openhuman/memory_sources/rpc.rs, src/openhuman/memory_sources/schemas.rs, src/openhuman/config/schema/types.rs, src/core/all.rs
Defines SourceKind, MemorySourceEntry, SourceItem/SourceContent, registry CRUD (config-backed), RPC handlers (list/get/add/update/remove/list_items/read_item/sync/status_list), controller schemas, and registers controllers/config.
Readers implementations
src/openhuman/memory_sources/readers/{mod,composio,folder,github,rss,twitter,web_page}.rs
Adds SourceReader trait and concrete readers: Composio (placeholder items), Folder (glob + safe reads), Github (gh/REST commits/issues/PRs), RSS (manual XML parsing), Twitter (not-configured stub), WebPage (fetch + selector/strip).
Sync orchestration & status
src/openhuman/memory_sources/sync.rs, src/openhuman/memory_sources/status.rs
Adds background sync orchestration (staged events, reader ingestion, Composio delegation) and status computation (synced/pending chunk counts, freshness labels) with DB queries and unit tests.
Reconciliation & Composio integration
src/openhuman/memory_sources/reconcile.rs, src/openhuman/memory_sync/composio/{bus,mod}.rs, src/core/jsonrpc.rs
Boot-time and post-OAuth upsert of Composio connections into memory_sources; scan_active_sync_targets fallback and registration wiring.
Frontend: Registry UI and Add dialog
app/src/components/intelligence/AddMemorySourceDialog.tsx, app/src/components/intelligence/MemorySourcesRegistry.tsx, app/src/components/intelligence/MemoryWorkspace.tsx
New AddMemorySourceDialog (two-step kind selection and kind-specific inputs, Composio picker) and MemorySourcesRegistry listing with sync/enable/remove controls; workspace switched to registry component.
Client service, tests, i18n & e2e
app/src/services/memorySourcesService.ts, app/src/services/memorySourcesService.test.ts, app/src/lib/i18n/chunks/*-3.ts, tests/memory_sources_e2e.rs
TypeScript RPC client wrappers and unit tests; i18n entries added across locale chunks; comprehensive Rust e2e tests covering folder/GitHub/Composio flows; related frontend test updates and removals.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant Frontend
  participant TSClient as memorySourcesService
  participant CoreRpc
  participant Registry
  participant Reader
  participant DB as MemTree
  User->>Frontend: click "Add" / "Sync"
  Frontend->>TSClient: addMemorySource() / syncMemorySource()
  TSClient->>CoreRpc: openhuman.memory_sources_add / openhuman.memory_sources_sync
  CoreRpc->>Registry: add_source() / sync_source()
  Registry-->>CoreRpc: source entry / ok
  CoreRpc->>Reader: reader_for(kind).list_items() / read_item()
  Reader-->>DB: ingest_document(...) (via sync)
  DB-->>CoreRpc: ingestion results
  CoreRpc-->>TSClient: RpcOutcome
  TSClient-->>Frontend: success / status
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • sanil-23
  • graycyrus
  • M3gA-Mind

"🐰 I hopped through code to bind each source,
Composio, folders, feeds — a handy force,
Dialogs pop and syncs take flight,
Memories stored by day and night,
~A rabbit cheers the registry's course!"

@coderabbitai coderabbitai Bot added feature Net-new user-facing capability or product behavior. rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure. memory Memory store, memory tree, recall, summarization, and embeddings in src/openhuman/memory/. working A PR that is being worked on by the team. labels May 29, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/openhuman/memory_sync/composio/bus.rs (1)

558-563: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Auto-registration is skipped for toolkits without a provider.

The early return in the get_provider miss path exits before upsert_composio_source(...), so many valid connections never appear in memory_sources.

💡 Proposed fix
-            let Some(provider) = get_provider(&toolkit) else {
+            let provider = get_provider(&toolkit);
+            if provider.is_none() {
                 tracing::debug!(
                     toolkit = %toolkit,
                     "[composio:bus] no provider registered for toolkit; cache refreshed, no provider hook to dispatch"
                 );
-                return;
-            };
+            }

-            if let Err(e) = provider.on_connection_created(&ctx).await {
-                tracing::warn!(
-                    toolkit = %toolkit,
-                    connection_id = %connection_id,
-                    error = %e,
-                    "[composio:bus] provider on_connection_created failed"
-                );
-            } else {
-                // Successful connection-created sync — record the
-                // timestamp so the periodic scheduler doesn't
-                // immediately re-fire for this connection.
-                super::periodic::record_sync_success(&toolkit, &connection_id);
+            if let Some(provider) = provider {
+                if let Err(e) = provider.on_connection_created(&ctx).await {
+                    tracing::warn!(
+                        toolkit = %toolkit,
+                        connection_id = %connection_id,
+                        error = %e,
+                        "[composio:bus] provider on_connection_created failed"
+                    );
+                } else {
+                    super::periodic::record_sync_success(&toolkit, &connection_id);
+                }
             }

Also applies to: 580-596

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/memory_sync/composio/bus.rs` around lines 558 - 563, The early
return in the composio bus path when get_provider(&toolkit) returns None
prevents calling upsert_composio_source and thus skips auto-registration of
valid connections; modify the branch so that when get_provider(&toolkit) is None
you still call upsert_composio_source(...) to insert/update the connection into
memory_sources (preserve the existing tracing::debug log), and then continue
without attempting to dispatch provider hooks. Apply the same change to the
analogous block around the 580-596 region so both miss-paths auto-register via
upsert_composio_source while avoiding the provider dispatch.
🧹 Nitpick comments (5)
src/openhuman/memory_sources/readers/composio.rs (1)

46-65: ⚡ Quick win

Add a structured debug log in read_item for flow parity.

read_item is a new/changed flow but currently has no diagnostic logging, while list_items already does. Adding one namespaced debug event here improves traceability when source sync behavior is investigated.

Proposed patch
     async fn read_item(
         &self,
         source: &MemorySourceEntry,
         item_id: &str,
         _config: &Config,
     ) -> Result<SourceContent, String> {
         let toolkit = source.toolkit.as_deref().unwrap_or("unknown");
+        tracing::debug!(
+            toolkit = %toolkit,
+            item_id = %item_id,
+            connection_id = ?source.connection_id,
+            "[memory_sources:composio] read_item"
+        );
         Ok(SourceContent {
             id: item_id.to_string(),
             title: format!("{toolkit} sync data"),

As per coding guidelines: "Use log / tracing at debug / trace levels for verbose diagnostics on new/changed flows in Rust; never log secrets or full PII—redact sensitive data."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/memory_sources/readers/composio.rs` around lines 46 - 65, Add a
namespaced debug log at the start of the async fn read_item to mirror
list_items' diagnostics: use the tracing (or log) crate at debug or trace level,
include non-sensitive identifiers like toolkit and item_id, and redact or omit
sensitive fields such as full connection_id (e.g., log a truncated or hashed
connection_id). Target the read_item function and ensure the log message clearly
states it's entering read_item for the given toolkit and item_id to aid
traceability.
src/openhuman/memory_sources/readers/web_page.rs (1)

92-97: ⚖️ Poor tradeoff

Title extraction is fragile.

The extract_title function uses simple string searching without handling HTML entities or attributes on the <title> tag. For example:

  • <title lang="en">Page</title> will fail (finds <title but the > search includes the attribute)
  • HTML entities like &lt; in the title won't be decoded.

Consider using a proper HTML parsing library (like scraper mentioned in the comments) or add attribute handling.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/memory_sources/readers/web_page.rs` around lines 92 - 97, The
extract_title function is too fragile: update extract_title to use a robust HTML
parser (e.g., the scraper crate) instead of manual string searches so it
correctly handles attributes on <title> (e.g., <title lang="en">) and decodes
HTML entities; locate the extract_title function in web_page.rs and replace the
manual find-based logic with parsing that selects the <title> element (or falls
back safely) and returns its text content with HTML entities decoded.
src/openhuman/memory_sources/types.rs (2)

97-99: ⚡ Quick win

Consider path validation for security.

The Folder kind requires a path field but does not validate it here. Ensure the folder reader (e.g., readers/folder.rs) prevents path traversal attacks by canonicalizing paths and checking they don't escape allowed boundaries.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/memory_sources/types.rs` around lines 97 - 99, The Folder arm
(SourceKind::Folder) only requires a path via require_field(&self.path, "path")?
but lacks validation; update the folder reader (readers/folder.rs) to
canonicalize the provided path (e.g., std::fs::canonicalize) and verify it is
inside an allowed base directory (using Path::starts_with or strip_prefix to
ensure it does not escape the boundary) and reject or return an error if
canonicalization fails or the path is outside the allowed root to prevent path
traversal.

100-111: ⚡ Quick win

Consider URL validation for external source kinds.

Source kinds that use url (GithubRepo, TwitterQuery, RssFeed, WebPage) do not validate URL format at the type level. Confirm that readers validate URLs before making network requests to prevent SSRF or invalid URL errors.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/memory_sources/types.rs` around lines 100 - 111, The SourceKind
branches that call require_field(&self.url, "url") (GithubRepo, RssFeed,
WebPage) need to also validate that self.url is a well-formed, safe URL before
any network use: inside the same validation function where require_field is used
(the impl that validates SourceKind / the struct containing url), parse the
value with a strict URL parser (e.g., url::Url::parse) and ensure the scheme is
http or https (reject other schemes and invalid parses), returning a clear
validation error; do not change TwitterQuery handling (it uses query, not url)
but add a unit test for each invalid/unsupported URL case and for a valid
http/https URL to ensure the new checks work.
app/src/components/settings/panels/__tests__/MemoryDataPanel.test.tsx (1)

95-103: ⚡ Quick win

Restore this error-path test instead of skipping it.

Skipping this case removes coverage for a user-facing resilience requirement (preset controls should remain accessible during sync errors). Please re-enable it with deterministic stubbing of the current memory-sources status path.

As per coding guidelines: “keep tests deterministic with no real network calls or time-sensitive flakes” and “prefer behavior over implementation in unit tests”.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/src/components/settings/panels/__tests__/MemoryDataPanel.test.tsx` around
lines 95 - 103, Unskip the test (change it.skip to it) and make the
sync-connections/status path deterministic by stubbing the network/hook that
returns memory-sources status to produce the error case ("network timeout")
while keeping other dependencies normal (keep resolveConfigWith('balanced') and
renderWithProviders(<MemoryDataPanel />) as-is); implement this by mocking the
module or function that MemoryDataPanel uses to fetch sync connections (e.g.,
the memory-sources status fetch hook or API client) with jest.mock/jest.spyOn or
MSW to return a rejected promise or error response, then assert the UI shows the
error and that preset buttons remain accessible. Ensure the mock only affects
the sync connections status endpoint and does not perform real network calls so
the test is deterministic.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/src/components/intelligence/AddMemorySourceDialog.tsx`:
- Around line 319-320: In AddMemorySourceDialog, replace all hard-coded
user-visible placeholders (e.g. "/Users/you/notes", "**/*.md", URL examples,
"article", "from:user AI safety", etc.) with i18n lookups using the useT() hook
(e.g. t('addMemorySource.placeholderPath')) and pass those t(...) results into
the JSX placeholder props; add corresponding keys to the locale resource files
and ensure useT() is imported and invoked in the AddMemorySourceDialog component
so inputs like the path/file/glob/type/search placeholders call t('...') instead
of embedding literals.
- Around line 61-75: The effect currently calls setLoadingConnections(true)
synchronously in the effect body; move that state update into the async IIFE
(make it the first statement inside the async function) so state changes happen
asynchronously, and keep the existing calls to listConnections(),
setConnections(resp.connections),
setError(t('memorySources.composioListFailed')), and
setLoadingConnections(false) inside that async block; additionally, add a
mounted flag (e.g., let mounted = true; and check mounted before calling
setConnections/setError/setLoadingConnections, and set mounted = false in a
returned cleanup) to avoid updating state after unmount — target the useEffect
that references kind, listConnections, setLoadingConnections, setConnections,
and setError.

In `@app/src/components/intelligence/MemorySourcesRegistry.tsx`:
- Around line 313-321: The relative-time string literals in the time formatting
block (e.g., 'just now', `${seconds}s ago`, `${minutes}m ago`, `${hours}h ago`,
`${days}d ago`) must be replaced with i18n strings via the useT() hook used in
this component (MemorySourcesRegistry). Import/use useT() (or the existing t
instance) and replace each literal with t('...') keys (for example:
t('relative.justNow'), t('relative.secondsAgo', {count: seconds}),
t('relative.minutesAgo', {count: minutes}), etc.), formatting the numeric
interpolation through the t call; ensure you update the keys consistently and
use the same unique function/block (the relative-time formatter) so all
UI-visible strings come from i18n rather than hard-coded literals.
- Around line 120-130: Replace hard-coded toast strings in MemorySourcesRegistry
with localized strings via the useT() hook: import/use useT() in the component
(e.g. const t = useT()), then change the success title/message and error title
to use t(...) with interpolation of source.label (for example
t('syncing_with_label', { label: source.label }) and
t('progress_will_appear_shortly')) and t('sync_failed_with_label', { label:
source.label }); keep the error.message as the toast message value but ensure
the toast title and static message call t(...) instead of literal text; update
the onToast calls that reference onToast, refresh, and source.label accordingly.

In `@app/src/services/memorySourcesService.ts`:
- Around line 162-169: SOURCE_KIND_LABELS currently contains hard-coded English
UI strings; replace these values with i18n keys (e.g.,
'memorySources.kind.folder') or move the mapping into the component layer so
labels are resolved with useT(); specifically update the SOURCE_KIND_LABELS
constant in memorySourcesService (or remove it) and ensure any component that
reads it calls useT() from I18nContext to translate the keys before rendering
(use the same keys shown in the comment like 'memorySources.kind.folder',
'memorySources.kind.github_repo', etc.).

In `@src/openhuman/memory_sources/readers/folder.rs`:
- Around line 96-112: In read_item, the file is read without enforcing
MAX_FILE_SIZE (unlike list_items), allowing large reads; fix by checking the
file size via tokio::fs::metadata or std::fs::metadata on file_path (after
canonicalization) and returning an error if metadata.len() > MAX_FILE_SIZE
before calling tokio::fs::read_to_string; reference the read_item function,
MAX_FILE_SIZE constant, file_path and canonical_file/canonical_base checks and
ensure the error message matches existing style (e.g., "file too large" or
similar).

In `@src/openhuman/memory_sources/readers/github.rs`:
- Around line 23-30: The parse_github_url function currently infers owner/repo
from trailing segments and accepts extra path parts; change it to strictly
validate the URL shape: parse the input with Url (or equivalent), ensure scheme
is http or https and the host is "github.com" (allow optional "www"), split the
path into segments and require exactly two non-empty segments (owner and repo,
trimming a trailing ".git" on the repo), and return Err if the URL contains
extra segments like "tree" or "blob" or has fewer/more than two segments; update
the error message to clearly state invalid GitHub repo URL when rejecting.
- Around line 94-100: The GitHub reader currently creates the reqwest client
with reqwest::Client::new() and calls .send().await without any timeout, so wrap
the request with a timeout: either build the client with
reqwest::Client::builder().timeout(Duration::from_secs(10)).build()? and replace
reqwest::Client::new() or call .timeout(Duration::from_secs(10)) on the
RequestBuilder before .send(); add use std::time::Duration and pick a sensible
timeout (e.g., 10s) so the .send().await for the client/request will error
instead of blocking indefinitely.

In `@src/openhuman/memory_sources/readers/rss.rs`:
- Around line 35-38: The logs are printing the full url variable (using %url)
which can leak query params/credentials; update the logging in the RSS reader
calls that include "%url" (the log that currently prints url alongside max_items
and the later log at lines ~57-60) to instead log a redacted host-only context:
parse the url variable (e.g., via url::Url) and emit only the scheme+host (or
host and port) or a redacted placeholder, then use that redacted value in the
existing log messages (preserving the same log text like "[memory_sources:rss]
listing items") so sensitive query strings/credentials are never logged.
- Around line 89-95: The fetch_url function creates a reqwest client with
Client::new() and lacks a deadline; change client creation in fetch_url to use
Client::builder() with an explicit timeout (e.g., Duration::from_secs(10)) or
set a per-request timeout via the RequestBuilder::timeout call on the get()
request so slow feeds don't hang; ensure you add the std::time::Duration import
and update the client/request usage in fetch_url accordingly.

In `@src/openhuman/memory_sources/readers/twitter.rs`:
- Around line 32-36: The current reader accepts Some("") or whitespace-only
values because it only checks for None; after retrieving the query string (the
expression using source.query.as_deref() -> query) trim the string and reject
empty results: replace the simple ok_or(...) with logic that obtains the &str,
trims it, and returns an error if the trimmed string is empty (so the reader
returns the same "twitter source requires a query" error for None or
whitespace-only input). Keep the rest of the function (including _since_days /
DEFAULT_SINCE_DAYS) unchanged.

In `@src/openhuman/memory_sources/readers/web_page.rs`:
- Around line 59-75: Validate and constrain fetched URLs and response size:
parse and validate the url variable (use url::Url::parse) and return an error
unless the scheme is "http" or "https"; after sending the request with
reqwest::Client (the client/get(&url)/send() flow) check
resp.headers().get(CONTENT_LENGTH) and fail if content length exceeds a
configured max, and when reading the body avoid resp.text().await for unbounded
reads—read via a streaming API (bytes_stream or resp.chunk()/resp.take) and
enforce a maximum byte limit while accumulating into body so giant responses are
rejected; keep the existing error messages but include early rejections for
invalid schemes and oversized responses.

In `@src/openhuman/memory_sources/reconcile.rs`:
- Around line 80-83: short_id currently slices by bytes which can break on
UTF-8; change it to slice on a char boundary: compute the number of chars to
skip (e.g., id.chars().count().saturating_sub(8)), get the byte offset of that
char with id.char_indices().nth(skip).map(|(idx,_)| idx) and then return
&id[offset..] (or id if nth returns None). Update the short_id function to use
this char-aware approach so slicing is UTF-8 safe.

In `@src/openhuman/memory_sources/status.rs`:
- Around line 63-80: The query_row call in source_status that binds to (synced,
pending, last_ts) currently uses .unwrap_or((0, 0, None)), which swallows real
SQL errors and returns misleading zeros; change the code to handle the Result
from conn.query_row explicitly (via match/try/? or map_err) so SQL errors are
propagated or logged instead of masked — update the query_row usage that
constructs (synced, pending, last_ts) in function source_status and apply the
same fix to the similar block around lines 91-95, returning or bubbling up the
error (or converting it to an appropriate Result) rather than defaulting to
(0,0,None).

In `@src/openhuman/memory_sources/sync.rs`:
- Around line 24-80: Add debug-level tracing to the sync flow to improve
correlation and visibility: in sync_source, emit a tracing::debug when
queueing/spawning the task that includes source_id and kind; inside the spawned
task log a debug before dispatching the match (include which arm will run, i.e.,
sync_composio or sync_via_reader or Twitter fallback); inside the
sync_via_reader path add a debug with the number of readers/urls being processed
(reader list size) and any per-reader dispatch info; and finally emit a debug
after outcome OK with the ingested item count and after Err with the error (in
addition to existing emit_sync_stage), always attaching structured fields
source_id and kind; use tracing::debug and keep messages concise and
development-oriented while leaving emit_sync_stage logic unchanged.

In `@src/openhuman/memory_sync/composio/mod.rs`:
- Around line 60-76: The current branch returns early whenever registry_sources
is non-empty, even if filtering yields zero valid SyncTarget entries; change the
logic in the block that references registry_sources, get_composio_sync_provider,
and SyncTarget to first build a Vec<SyncTarget> by mapping/filtering
registry_sources (collect into filtered_targets), then if filtered_targets is
non-empty return Ok(filtered_targets), otherwise do not short-circuit and allow
the fallback path to run; update the tracing::debug to report
filtered_targets.len() (or only log when returning) so behavior is preserved and
empty filtered results fall through to the active Composio connections fallback.

In `@tests/memory_sources_e2e.rs`:
- Around line 485-489: The live GitHub E2E test
memory_sources_github_repo_activity_flow is currently always run and causes CI
flakiness; either mark the test ignored by default by adding the #[ignore]
attribute above the async fn memory_sources_github_repo_activity_flow or gate it
behind an explicit environment flag (e.g., check std::env::var("RUN_GITHUB_E2E")
at the start of memory_sources_github_repo_activity_flow and early-return or
call tokio::spawn/skip when not set), and update the test doc comment to note
how to run it locally; apply the same change to the other tests in the same
block (lines covering the github flow).

---

Outside diff comments:
In `@src/openhuman/memory_sync/composio/bus.rs`:
- Around line 558-563: The early return in the composio bus path when
get_provider(&toolkit) returns None prevents calling upsert_composio_source and
thus skips auto-registration of valid connections; modify the branch so that
when get_provider(&toolkit) is None you still call upsert_composio_source(...)
to insert/update the connection into memory_sources (preserve the existing
tracing::debug log), and then continue without attempting to dispatch provider
hooks. Apply the same change to the analogous block around the 580-596 region so
both miss-paths auto-register via upsert_composio_source while avoiding the
provider dispatch.

---

Nitpick comments:
In `@app/src/components/settings/panels/__tests__/MemoryDataPanel.test.tsx`:
- Around line 95-103: Unskip the test (change it.skip to it) and make the
sync-connections/status path deterministic by stubbing the network/hook that
returns memory-sources status to produce the error case ("network timeout")
while keeping other dependencies normal (keep resolveConfigWith('balanced') and
renderWithProviders(<MemoryDataPanel />) as-is); implement this by mocking the
module or function that MemoryDataPanel uses to fetch sync connections (e.g.,
the memory-sources status fetch hook or API client) with jest.mock/jest.spyOn or
MSW to return a rejected promise or error response, then assert the UI shows the
error and that preset buttons remain accessible. Ensure the mock only affects
the sync connections status endpoint and does not perform real network calls so
the test is deterministic.

In `@src/openhuman/memory_sources/readers/composio.rs`:
- Around line 46-65: Add a namespaced debug log at the start of the async fn
read_item to mirror list_items' diagnostics: use the tracing (or log) crate at
debug or trace level, include non-sensitive identifiers like toolkit and
item_id, and redact or omit sensitive fields such as full connection_id (e.g.,
log a truncated or hashed connection_id). Target the read_item function and
ensure the log message clearly states it's entering read_item for the given
toolkit and item_id to aid traceability.

In `@src/openhuman/memory_sources/readers/web_page.rs`:
- Around line 92-97: The extract_title function is too fragile: update
extract_title to use a robust HTML parser (e.g., the scraper crate) instead of
manual string searches so it correctly handles attributes on <title> (e.g.,
<title lang="en">) and decodes HTML entities; locate the extract_title function
in web_page.rs and replace the manual find-based logic with parsing that selects
the <title> element (or falls back safely) and returns its text content with
HTML entities decoded.

In `@src/openhuman/memory_sources/types.rs`:
- Around line 97-99: The Folder arm (SourceKind::Folder) only requires a path
via require_field(&self.path, "path")? but lacks validation; update the folder
reader (readers/folder.rs) to canonicalize the provided path (e.g.,
std::fs::canonicalize) and verify it is inside an allowed base directory (using
Path::starts_with or strip_prefix to ensure it does not escape the boundary) and
reject or return an error if canonicalization fails or the path is outside the
allowed root to prevent path traversal.
- Around line 100-111: The SourceKind branches that call
require_field(&self.url, "url") (GithubRepo, RssFeed, WebPage) need to also
validate that self.url is a well-formed, safe URL before any network use: inside
the same validation function where require_field is used (the impl that
validates SourceKind / the struct containing url), parse the value with a strict
URL parser (e.g., url::Url::parse) and ensure the scheme is http or https
(reject other schemes and invalid parses), returning a clear validation error;
do not change TwitterQuery handling (it uses query, not url) but add a unit test
for each invalid/unsupported URL case and for a valid http/https URL to ensure
the new checks work.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8ef641b7-c820-4ee0-aa3c-4941f152ec2b

📥 Commits

Reviewing files that changed from the base of the PR and between a41d913 and 92cf2a7.

📒 Files selected for processing (49)
  • app/src/components/intelligence/AddMemorySourceDialog.tsx
  • app/src/components/intelligence/MemorySources.tsx
  • app/src/components/intelligence/MemorySourcesRegistry.tsx
  • app/src/components/intelligence/MemorySyncConnections.test.tsx
  • app/src/components/intelligence/MemorySyncConnections.tsx
  • app/src/components/intelligence/MemoryWorkspace.tsx
  • app/src/components/intelligence/__tests__/MemoryWorkspace.test.tsx
  • app/src/components/settings/panels/__tests__/MemoryDataPanel.test.tsx
  • app/src/lib/i18n/chunks/ar-3.ts
  • app/src/lib/i18n/chunks/bn-3.ts
  • app/src/lib/i18n/chunks/de-3.ts
  • app/src/lib/i18n/chunks/en-3.ts
  • app/src/lib/i18n/chunks/es-3.ts
  • app/src/lib/i18n/chunks/fr-3.ts
  • app/src/lib/i18n/chunks/hi-3.ts
  • app/src/lib/i18n/chunks/id-3.ts
  • app/src/lib/i18n/chunks/it-3.ts
  • app/src/lib/i18n/chunks/ko-3.ts
  • app/src/lib/i18n/chunks/pl-3.ts
  • app/src/lib/i18n/chunks/pt-3.ts
  • app/src/lib/i18n/chunks/ru-3.ts
  • app/src/lib/i18n/chunks/zh-CN-3.ts
  • app/src/lib/i18n/en.ts
  • app/src/services/memorySourcesService.test.ts
  • app/src/services/memorySourcesService.ts
  • app/src/services/memorySyncService.test.ts
  • app/src/services/memorySyncService.ts
  • src/core/all.rs
  • src/core/jsonrpc.rs
  • src/openhuman/config/schema/types.rs
  • src/openhuman/memory_sources/mod.rs
  • src/openhuman/memory_sources/readers/composio.rs
  • src/openhuman/memory_sources/readers/folder.rs
  • src/openhuman/memory_sources/readers/github.rs
  • src/openhuman/memory_sources/readers/mod.rs
  • src/openhuman/memory_sources/readers/rss.rs
  • src/openhuman/memory_sources/readers/twitter.rs
  • src/openhuman/memory_sources/readers/web_page.rs
  • src/openhuman/memory_sources/reconcile.rs
  • src/openhuman/memory_sources/registry.rs
  • src/openhuman/memory_sources/rpc.rs
  • src/openhuman/memory_sources/schemas.rs
  • src/openhuman/memory_sources/status.rs
  • src/openhuman/memory_sources/sync.rs
  • src/openhuman/memory_sources/types.rs
  • src/openhuman/memory_sync/composio/bus.rs
  • src/openhuman/memory_sync/composio/mod.rs
  • src/openhuman/mod.rs
  • tests/memory_sources_e2e.rs
💤 Files with no reviewable changes (5)
  • app/src/components/intelligence/MemorySources.tsx
  • app/src/services/memorySyncService.test.ts
  • app/src/components/intelligence/MemorySyncConnections.test.tsx
  • app/src/services/memorySyncService.ts
  • app/src/components/intelligence/MemorySyncConnections.tsx

Comment thread app/src/components/intelligence/AddMemorySourceDialog.tsx
Comment thread app/src/components/intelligence/AddMemorySourceDialog.tsx Outdated
Comment thread app/src/components/intelligence/MemorySourcesRegistry.tsx
Comment thread app/src/components/intelligence/MemorySourcesRegistry.tsx Outdated
Comment thread app/src/services/memorySourcesService.ts Outdated
Comment thread src/openhuman/memory_sources/reconcile.rs Outdated
Comment thread src/openhuman/memory_sources/status.rs Outdated
Comment thread src/openhuman/memory_sources/sync.rs
Comment thread src/openhuman/memory_sync/composio/mod.rs
Comment thread tests/memory_sources_e2e.rs
coderabbitai[bot]
coderabbitai Bot previously approved these changes May 29, 2026
Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good chunk of work — the architecture is solid and the reader abstraction is clean. CodeRabbit covered the low-hanging stuff (timeouts, i18n gaps, SSRF scheme check, SQL error swallowing). Two things I want fixed before merge that weren't covered.

Blocking

TOCTOU in folder reader

read_item resolves a canonical path for the traversal check, then reads from the original file_path. Between canonicalize and read_to_string, a symlink in the source directory can be swapped to point outside the base, bypassing the guard. One-liner fix: use canonical_file for the read.

// current
let body = tokio::fs::read_to_string(&file_path).await ...

// fix
let body = tokio::fs::read_to_string(&canonical_file).await ...

The PR description explicitly calls out the traversal prevention as a security property — the implementation doesn't fully deliver on that claim.

No persistent sync error state

SourceStatus has chunks_synced, chunks_pending, last_chunk_at_ms, freshness — but no last_error or sync_failed field. After a sync failure, the UI shows Idle with unchanged chunk counts, which is indistinguishable from "never synced". The MemorySyncStageChanged(Failed) event is the only signal, and it's lost on reconnect or the next 5s poll cycle.

Users who add a GitHub source with bad credentials, or an RSS feed that's down, get no persistent feedback about why nothing is updating. The Sync button just goes from spinner back to idle. This is a core promise of the feature — "live chunk count / freshness display per row" — and it silently breaks under error conditions.

Add a last_sync_error: Option<String> to SourceStatus (or a sync_state enum). Persist it somewhere the status query can return it: a lightweight in-memory map keyed by source_id updated by the spawned sync task would do, or write it to the DB. Without this, the feature ships broken for any source that can fail.

Non-blocking

gh_available() blocks the async executor

github.rs calls std::process::Command::new("gh").arg("--version").status() synchronously from an async context. It blocks the current tokio worker thread for the subprocess lifetime on every list_items/read_item call. Wrap in tokio::task::spawn_blocking or switch to tokio::process::Command.

Label format inconsistency on auto-register

When a Composio connection is first created (bus event in bus.rs), it's registered with "{toolkit} connection" — lowercase, no short ID. The reconcile path in reconcile.rs generates "{Title_case(toolkit)} · {short_id(connection_id)}". Users who connect a new integration post-launch get a different label format than existing integrations that went through reconcile. Pick one format and use it in both places.


fn folder_source(path: &str) -> MemorySourceEntry {
MemorySourceEntry {
id: "src_folder".into(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[major] TOCTOU: traversal guard uses canonical_file but the read here uses the original non-canonical file_path. A symlink in the source directory can be swapped between canonicalize and this call, redirecting the read outside the base.

Fix:

let body = tokio::fs::read_to_string(&canonical_file)
    .await
    .map_err(|e| format!("failed to read {}: {e}", canonical_file.display()))?;

pub struct SourceStatus {
pub source_id: String,
pub chunks_synced: u64,
pub chunks_pending: u64,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[major] SourceStatus has no field for sync errors. After a failed sync the UI shows Idle / unchanged chunk count — identical to "never synced". The MemorySyncStageChanged(Failed) event is transient; it's lost before the next 5s poll.

Add last_sync_error: Option<String> here and wire it from the spawned task in sync.rs. An in-memory DashMap<source_id, last_error> populated by sync_source and read by source_status would be the lightest-weight fix without a schema change.

.status()
.map(|s| s.success())
.unwrap_or(false)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] std::process::Command::status() is synchronous and blocks the tokio worker thread on every list_items/read_item call. Use tokio::process::Command or wrap in spawn_blocking:

async fn gh_available() -> bool {
    tokio::process::Command::new("gh")
        .arg("--version")
        .stdout(std::process::Stdio::null())
        .stderr(std::process::Stdio::null())
        .status()
        .await
        .map(|s| s.success())
        .unwrap_or(false)
}

&toolkit,
&connection_id,
&label,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] Label format here is "{toolkit} connection" (lowercase, no short ID). reconcile.rs uses "{Title_case(toolkit)} · {short_id(connection_id)}".

Connections created before the new build and connections created after it get different label formats. Use the reconcile format (title_case + short_id) here too for consistency.

Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeRabbit-style Review — PR #2893 (Memory Sources)

Overall: Well-structured domain that correctly follows project conventions. 2 blockers, 3 major issues, 2 refactor suggestions, 6 nitpicks.

Verified / Looks Good

  • Path traversal guard in folder.rs (canonicalize + starts_with) is correct in principle
  • GitHub URL parser correctly rejects deep links, bare org paths, non-GitHub hostnames
  • MemorySourceEntry.validate() enforces per-kind required fields at add/update
  • SSRF guard in web_page.rs rejects file://, data:// and non-HTTP schemes
  • Freshness label thresholds tested with fixed timestamps — no time flakes
  • upsert_composio_source matches by connection_id, not id/toolkit — correct
  • All 13 locale chunk files include the 56 new keys
  • sync_via_reader progress fires every 5 items — good UI feedback without flooding
  • cargo check and cargo fmt --check pass clean
  • Typecheck failures (recharts, rehype-katex) are pre-existing on main

Questions

  1. Was the removal of VaultPanel and ObsidianVaultSection from MemoryWorkspace.tsx intentional? (See blocker comment below)
  2. github.rs list_items fetches commits, issues, PRs in 3 serial API calls — was tokio::join! considered for concurrency?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 Blocker: TOCTOU race — read uses original path, not canonicalized

The traversal guard canonicalizes file_path at line ~103, but read_to_string at line ~121 uses the original file_path. Between the guard passing and the read, a symlink swap could redirect the read outside the canonical base.

Fix: read from canonical_file, not file_path:

// before
let body = tokio::fs::read_to_string(&file_path)
    .await
    .map_err(|e| format!("failed to read {}: {e}", file_path.display()))?;

// after — use the canonicalized path
let body = tokio::fs::read_to_string(&canonical_file)
    .await
    .map_err(|e| format!("failed to read {}: {e}", canonical_file.display()))?;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 Blocker: VaultPanel and ObsidianVaultSection silently removed

This PR removes <VaultPanel onToast={onToast} /> and <ObsidianVaultSection contentRootAbs={...} /> from MemoryWorkspace. These are the Obsidian vault viewer and "View vault in Obsidian" link — they're unrelated to memory sources and were user-visible features.

The PR description makes no mention of intentionally removing them. If accidental, please restore:

<VaultPanel onToast={onToast} />
{graph && (
  <ObsidianVaultSection contentRootAbs={graph.content_root_abs} onToast={onToast} />
)}

If intentional, it should be a separate commit with justification.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Major: RSS fetch_url has no body size cap — potential OOM

The web page reader correctly caps at 10 MiB (MAX_BODY_BYTES), but fetch_url here does resp.text().await with no size guard. A hostile or misconfigured feed could buffer hundreds of MBs into memory.

Fix:

const MAX_FEED_BYTES: u64 = 5 * 1024 * 1024; // 5 MiB
let bytes = resp
    .bytes()
    .await
    .map_err(|e| format!("failed to read feed body: {e}"))?;
if bytes.len() as u64 > MAX_FEED_BYTES {
    return Err(format!(
        "feed body exceeds {MAX_FEED_BYTES}-byte limit ({} bytes)",
        bytes.len()
    ));
}
let text = String::from_utf8_lossy(&bytes).into_owned();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Major: gh CLI invoked without timeout — can block sync task indefinitely

gh_json calls tokio::process::Command::new("gh").output().await with no timeout. If gh hangs (auth flow, network stall, rate-limit backoff), the sync task parks forever and MemorySyncStage::Completed never fires.

Fix:

async fn gh_json(args: &[&str]) -> Result<String, String> {
    let output = tokio::time::timeout(
        std::time::Duration::from_secs(30),
        tokio::process::Command::new("gh").args(args).output(),
    )
    .await
    .map_err(|_| "gh command timed out after 30s".to_string())?
    .map_err(|e| format!("gh command failed: {e}"))?;
    // ...
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Major: tokio::spawn result dropped — panics in sync tasks are invisible

The spawned background task's JoinHandle is dropped immediately. Any panic inside it produces no log, and at shutdown the runtime may cancel an in-flight ingest mid-way.

Fix: Track the handle or attach a panic handler:

let source_id_for_log = source_id.clone();
let handle = tokio::spawn(async move {
    // ... sync body ...
});
tokio::spawn(async move {
    if let Err(panic) = handle.await {
        tracing::error!(
            source_id = %source_id_for_log,
            "[memory_sources:sync] sync task panicked: {:?}", panic
        );
    }
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Suggestion: Composio chunk count is per-toolkit, not per-connection

For Composio sources, source_id_prefix returns "{toolkit}:%" (e.g. gmail:%). If a user has two Gmail connections, both registry entries report the same aggregate chunk count — misleading in the UI.

Long-term fix: include connection_id in Composio chunk source IDs. Short-term: add a comment documenting the known limitation so future readers aren't confused.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 Nitpicks (low priority)

  1. default_true() duplicate — defined again here; already exists in types.rs. Deduplicate to a shared location.
  2. folder.rsbase_path.trim_end_matches('/') won't fire on Windows-style \ paths. Low risk (macOS/Linux primary).
  3. rss.rs — RSS vs Atom detection by substring <feed is fragile if the string appears in description before the root element. A starts_with check on first non-whitespace tag would be more robust.
  4. AddMemorySourceDialog.tsx — Back button uses raw character instead of SVG icon. Minor a11y/RTL concern.
  5. MemorySourcesRegistry.tsxSOURCE_KIND_ICONS[source.kind] ?? '📄' fallback is dead code (type is Record<SourceKind, string>).
  6. reconcile.rstitle_case produces "Clickup" not "ClickUp". Cosmetic but may look odd to users.

senamakel added 12 commits May 29, 2026 18:19
…s, and RPC surface

New domain `src/openhuman/memory_sources/` that defines user-configurable
data connectors feeding memory. Sources are persisted as flat entries in
`config.toml` under `[[memory_sources]]`.

- Types: SourceKind (Composio/Folder/GithubRepo/TwitterQuery/RssFeed/WebPage),
  MemorySourceEntry with flat kind-specific fields, validation per kind
- Registry: CRUD operations reading/writing Config.memory_sources
- Readers: SourceReader trait with implementations for each kind
  - Composio: delegates to existing sync layer
  - Folder: walkdir + glob matching with path traversal prevention
  - GitHub: public repo tree listing via API, raw content fetch
  - Twitter: scaffold with clear "not yet configured" error
  - RSS: lightweight XML parser for RSS 2.0 and Atom feeds
  - WebPage: HTTP fetch with optional CSS selector extraction
- RPC surface: list/get/add/update/remove + list_items/read_item
- Controller wiring in core/all.rs

24 unit tests covering types, validation, readers, and schemas.
- Auto-register memory_sources entry when a Composio connection is
  created (bus.rs subscriber, non-fatal on failure)
- Add reconcile module that syncs existing Composio connections into
  the registry on startup
- Wire list_sync_targets() to prefer memory_sources registry (enabled
  Composio entries) over scanning all connections. Falls back to legacy
  full-scan when no registry entries exist.
Tests boot a real axum JSON-RPC server against an isolated workspace and
exercise: add folder source → list → list_items → read_item → ingest
document into memory tree → verify chunks indexed → update → disable →
remove. Also tests validation rejection of bad input.

Fix: declare kind-specific fields (toolkit, path, glob, url, branch,
paths, query, since_days, max_items, selector) in the add/update
controller schemas so the JSON-RPC param validator accepts them.
… E2E tests

GitHub reader now pulls commits, issues, and PRs (not source code) from
a repository. Uses `gh` CLI when available for authenticated access,
falls back to unauthenticated GitHub REST API for public repos.

Each item is formatted as structured markdown with metadata (author,
date, state, labels, comments) suitable for memory tree ingestion.

New E2E tests (JSON-RPC against a real axum server):
- memory_sources_github_repo_activity_flow: add repo → list commits/
  issues/PRs → read commit → ingest into memory tree → verify chunks
- memory_sources_composio_registry_flow: add gmail + slack composio
  sources → CRUD lifecycle → list_items/read_item → disable → remove

Also fixes controller schemas to declare all kind-specific fields so
the JSON-RPC param validator accepts them on add/update.
New components:
- memorySourcesService: typed client for openhuman.memory_sources_* RPCs
  (list/get/add/update/remove/list_items/read_item)
- AddMemorySourceDialog: two-step picker (kind → form) for adding
  folder/github_repo/rss_feed/web_page/twitter_query sources
- MemorySourcesRegistry: list panel with toggle/remove controls and
  "Add Source" button, integrated into MemoryWorkspace below the
  existing composio-driven MemorySources panel

Composio-backed sources are filtered out of the registry view since
they're already managed by the existing MemorySources component (which
joins composio.list_connections with sync status).

i18n: 24 new memorySources.* keys added to en.ts and all 14 locale
chunk files (English placeholders for non-English locales).

Tests: 6 unit tests for the service layer covering envelope unwrapping,
flat-field add params, partial updates, and constant exposure.
…icker

- Remove the old Composio-driven MemorySources panel; the registry-based
  panel is now the single source of truth for "what feeds memory"
- Rename header to "Memory Sources" (was "Custom Sources")
- Include Composio in the source kind picker — Add Source dialog now
  has 6 kinds: Integration, Folder, GitHub, RSS, Web, Twitter
- For Composio kind, fetch active connections via composio.list_connections
  and let the user pick from a dropdown (instead of typing toolkit + id)
- Composio source rows get a Sync button that calls composio_sync
- Folder kind now has a "Browse…" button using <input type="file"
  webkitdirectory> for native folder picking in CEF
- Remove the Knowledge Vaults (Experimental) panel — superseded by
  the Folder source kind in Memory Sources
- Remove ObsidianVaultSection from the actions row
Backend (src/openhuman/memory_sources/):
- sync.rs: sync_source(source_id) dispatches by kind
  - Composio → memory_sync::composio::run_connection_sync
  - Folder/GitHub/RSS/WebPage → reader.list_items + reader.read_item
    + ingest_document (chunks tagged mem_src:{source_id}:{item_id})
  - Twitter → returns "not configured" error
  - Emits MemorySyncStageChanged events (requested/fetching/stored/
    ingesting/completed/failed) keyed by source_id so the UI can
    stream per-source progress
- status.rs: per-source chunk counts + freshness label via SQL
  query against mem_tree_chunks (matched by source_id prefix —
  toolkit-based for composio, mem_src:{id}:% for everything else)
- New RPCs: memory_sources_sync, memory_sources_status_list
  registered in the controller registry

Frontend:
- memorySourcesService.ts: syncMemorySource + memorySourcesStatusList
  + SourceStatus type
- MemorySourcesRegistry.tsx: Sync button on every row (not just
  composio), polls status every 5s, shows chunk count / freshness
  pill / last-sync timestamp per row, dispatches the unified
  memory_sources_sync RPC

Removed (consolidated into memory_sources):
- app/src/components/intelligence/MemorySources.tsx (old composio
  panel) + its test
- app/src/components/intelligence/MemorySyncConnections.tsx +
  test (already dead)
- app/src/services/memorySyncService.ts + test (replaced by
  memorySourcesService.statusList)

Tests:
- 27 Rust unit tests (added freshness + prefix dispatch)
- 6 frontend service tests still pass
…ions

Default behavior: every active Composio connection whose toolkit has
a memory-sync provider (gmail, slack, github, notion, linear, clickup)
now shows up as a memory source automatically — users don't have to
manually add their integrations.

How it works:
- New `composio::scan_active_sync_targets()` always hits Composio
  directly (vs. `list_sync_targets` which short-circuits through the
  registry once it's populated)
- `memory_sources::reconcile::ensure_composio_sources()` calls it and
  upserts one MemorySourceEntry per (toolkit, connection_id), labelled
  "Toolkit · <last-8-of-connection-id>" so multiple Gmail accounts stay
  distinguishable
- Wired at startup in core::jsonrpc bootstrap (best-effort spawn)
- Also called lazily on every memory_sources_list RPC so freshly-
  connected integrations appear without waiting for a restart
…r input

ESLint react/no-unknown-property rejects 'directory' (deprecated alias).
Modern Chromium/CEF accepts the standard 'webkitdirectory' alone.
Rust:
- folder reader: enforce MAX_FILE_SIZE on read_item to match list_items
- github reader: harden parse_github_url to reject /tree/<branch> etc and
  non-GitHub hosts; add 20s reqwest timeout
- rss reader: 20s reqwest timeout; redact URLs in debug logs (host only)
- twitter reader: trim + reject empty queries at the boundary
- web_page reader: SSRF guard (only http/https), 10 MiB body cap, 20s
  timeout
- reconcile: UTF-8 safe short_id (char-boundary slicing)
- status: surface SQL errors instead of swallowing them as zero-rows
- sync: debug tracing at queueing/dispatch/list-size/completion with
  source correlation
- composio: fallback to active-connection scan when registry yields zero
  valid targets (not just when registry is empty)
- e2e: gate live GitHub network test behind OPENHUMAN_E2E_NETWORK=1

Frontend i18n (all per-component user-visible strings):
- 9 placeholder strings moved to memorySources.*Placeholder keys
- 6 source kind labels moved to memorySources.kind.* keys (constant
  becomes SOURCE_KIND_LABEL_KEYS, resolved via t() at render sites)
- Sync toast title/message + 5 relative-time labels (just now, s/m/h/d
  ago) moved to time.*Suffix and memorySources.sync.* keys
- AddMemorySourceDialog Composio-load effect refactored to avoid
  setState directly in the effect body (react-hooks/set-state-in-effect)

All 14 locale chunk-3 files updated to keep i18n:check parity.
27 → 31 Rust unit tests; 6 frontend service tests still pass.
…eout, panic handler

- folder reader: read from `canonical_file` instead of original `file_path`
  to close the TOCTOU window between canonicalize() and the actual read
- rss reader: add MAX_FEED_BYTES (5 MiB) cap in fetch_url() — checked
  from Content-Length header before buffering, then from actual byte count
- github reader: wrap gh CLI .output().await in tokio::time::timeout(30s)
  to prevent hung sync tasks on unresponsive gh auth/server
- sync: wrap inner tokio::spawn in an outer spawn that awaits the JoinHandle
  and logs tracing::error! on panic so task panics are never silently dropped
- MemoryWorkspace: restore accidentally removed VaultPanel and
  ObsidianVaultSection imports + render sites; update module-level JSDoc
  to describe the new MemorySourcesRegistry component tree
@graycyrus graycyrus force-pushed the feat/memory-sources branch from d1c9acd to 4f4eb47 Compare May 29, 2026 12:51
…deleted modules

MemorySources.tsx and memorySyncService.ts were deleted as part of the
memory_sources domain consolidation, but the test file that imported
from them was not removed. This caused TypeScript compile failures in CI.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/src/components/intelligence/__tests__/MemoryWorkspace.test.tsx`:
- Around line 21-23: The global mock for
listMemorySources/memorySourcesStatusList is returning [] and causing tests that
expect source rows/buttons to fail; remove the hardcoded mockResolvedValue([])
from the top-level vi.mock and instead set realistic defaults inside the test
suite's beforeEach (call listMemorySources.mockResolvedValue(<defaultSources>)
and memorySourcesStatusList.mockResolvedValue(<defaultStatuses>)) so individual
tests can override those via listMemorySources.mockResolvedValueOnce(...) as
needed; update references to the mocked functions (listMemorySources and
memorySourcesStatusList) in beforeEach and in tests that need different data so
the MemoryWorkspace tests no longer assume an empty source list.

In `@app/src/components/intelligence/MemorySourcesRegistry.tsx`:
- Around line 43-44: The component currently uses a single state variable
syncingId (useState<string | null>) which cannot track multiple concurrent
syncs; change this to track multiple in-flight syncs (e.g.,
useState<Set<string>> or a string[]/Record) and update all places that read or
set syncingId (e.g., wherever setSyncingId is called—likely the sync handler
like handleSync/onSyncClick and the cleanup after dispatch completes) to add the
id when a sync starts and remove it when it finishes or errors; then update the
row rendering logic to check membership (e.g., syncingIds.has(id) or
syncingIds.includes(id)) instead of comparing to a single value so multiple rows
can show locked/spinning state simultaneously.
- Around line 48-58: The refresh implementation currently uses
listMemorySources() and memorySourcesStatusList() with catch blocks that return
empty arrays, which causes transient API failures to wipe UI state; change
refresh to call both APIs with Promise.allSettled (or handle each promise
separately), and only call setSources(list) and setStatuses(stats) when the
corresponding promise fulfilled; if a promise is rejected, retain the existing
sources/statuses (do not replace with []), and log the error via console.warn or
the existing logger; reference listMemorySources, memorySourcesStatusList,
setSources, setStatuses, and the refresh function in MemorySourcesRegistry.tsx
when applying this change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5f608ece-1584-49dc-a050-f5335dd897d0

📥 Commits

Reviewing files that changed from the base of the PR and between 92cf2a7 and 3fa10fe.

📒 Files selected for processing (28)
  • app/src/components/intelligence/AddMemorySourceDialog.tsx
  • app/src/components/intelligence/MemorySources.tsx
  • app/src/components/intelligence/MemorySourcesRegistry.tsx
  • app/src/components/intelligence/MemorySyncConnections.test.tsx
  • app/src/components/intelligence/MemorySyncConnections.tsx
  • app/src/components/intelligence/MemoryWorkspace.tsx
  • app/src/components/intelligence/__tests__/MemoryWorkspace.test.tsx
  • app/src/components/settings/panels/__tests__/MemoryDataPanel.test.tsx
  • app/src/lib/i18n/chunks/ar-3.ts
  • app/src/lib/i18n/chunks/bn-3.ts
  • app/src/lib/i18n/chunks/de-3.ts
  • app/src/lib/i18n/chunks/en-3.ts
  • app/src/lib/i18n/chunks/es-3.ts
  • app/src/lib/i18n/chunks/fr-3.ts
  • app/src/lib/i18n/chunks/hi-3.ts
  • app/src/lib/i18n/chunks/id-3.ts
  • app/src/lib/i18n/chunks/it-3.ts
  • app/src/lib/i18n/chunks/ko-3.ts
  • app/src/lib/i18n/chunks/pl-3.ts
  • app/src/lib/i18n/chunks/pt-3.ts
  • app/src/lib/i18n/chunks/ru-3.ts
  • app/src/lib/i18n/chunks/zh-CN-3.ts
  • app/src/lib/i18n/en.ts
  • app/src/services/memorySourcesService.test.ts
  • app/src/services/memorySourcesService.ts
  • app/src/services/memorySyncService.test.ts
  • app/src/services/memorySyncService.ts
  • src/core/all.rs
💤 Files with no reviewable changes (23)
  • app/src/lib/i18n/chunks/pl-3.ts
  • app/src/lib/i18n/chunks/pt-3.ts
  • app/src/lib/i18n/chunks/it-3.ts
  • app/src/lib/i18n/en.ts
  • app/src/lib/i18n/chunks/en-3.ts
  • app/src/components/intelligence/MemorySources.tsx
  • app/src/lib/i18n/chunks/ru-3.ts
  • app/src/services/memorySyncService.ts
  • app/src/lib/i18n/chunks/de-3.ts
  • app/src/lib/i18n/chunks/zh-CN-3.ts
  • app/src/lib/i18n/chunks/ko-3.ts
  • app/src/lib/i18n/chunks/ar-3.ts
  • app/src/lib/i18n/chunks/hi-3.ts
  • app/src/services/memorySourcesService.test.ts
  • app/src/services/memorySourcesService.ts
  • app/src/lib/i18n/chunks/bn-3.ts
  • app/src/lib/i18n/chunks/es-3.ts
  • app/src/components/intelligence/MemorySyncConnections.test.tsx
  • app/src/components/intelligence/MemorySyncConnections.tsx
  • src/core/all.rs
  • app/src/lib/i18n/chunks/id-3.ts
  • app/src/lib/i18n/chunks/fr-3.ts
  • app/src/services/memorySyncService.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • app/src/components/settings/panels/tests/MemoryDataPanel.test.tsx
  • app/src/components/intelligence/AddMemorySourceDialog.tsx

Comment thread app/src/components/intelligence/__tests__/MemoryWorkspace.test.tsx
Comment on lines +43 to +44
const [syncingId, setSyncingId] = useState<string | null>(null);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Single syncingId cannot safely model multiple in-flight sync requests.

If users trigger sync on different rows quickly, only one row stays locked/spinning and earlier rows can be re-clicked, risking duplicate non-idempotent sync dispatches.

Suggested fix
-  const [syncingId, setSyncingId] = useState<string | null>(null);
+  const [syncingIds, setSyncingIds] = useState<Set<string>>(new Set());

   const handleSync = useCallback(
     async (source: MemorySourceEntry) => {
-      setSyncingId(source.id);
+      setSyncingIds(prev => new Set(prev).add(source.id));
       try {
         await syncMemorySource(source.id);
         onToast?.({
           type: 'success',
           title: `${t('memorySources.sync.successTitle')} ${source.label}`,
           message: t('memorySources.sync.successMessage'),
         });
         void refresh();
       } catch (err) {
         onToast?.({
           type: 'error',
           title: `${t('memorySources.sync.failedTitle')} ${source.label}`,
           message: err instanceof Error ? err.message : String(err),
         });
       } finally {
-        setSyncingId(prev => (prev === source.id ? null : prev));
+        setSyncingIds(prev => {
+          const next = new Set(prev);
+          next.delete(source.id);
+          return next;
+        });
       }
     },
     [onToast, refresh, t]
   );

And when rendering rows:

- isSyncing={syncingId === source.id}
+ isSyncing={syncingIds.has(source.id)}

Also applies to: 115-134

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/src/components/intelligence/MemorySourcesRegistry.tsx` around lines 43 -
44, The component currently uses a single state variable syncingId
(useState<string | null>) which cannot track multiple concurrent syncs; change
this to track multiple in-flight syncs (e.g., useState<Set<string>> or a
string[]/Record) and update all places that read or set syncingId (e.g.,
wherever setSyncingId is called—likely the sync handler like
handleSync/onSyncClick and the cleanup after dispatch completes) to add the id
when a sync starts and remove it when it finishes or errors; then update the row
rendering logic to check membership (e.g., syncingIds.has(id) or
syncingIds.includes(id)) instead of comparing to a single value so multiple rows
can show locked/spinning state simultaneously.

Comment on lines +48 to +58
listMemorySources().catch(err => {
console.warn('[ui-flow][memory-sources] list failed', err);
return [] as MemorySourceEntry[];
}),
memorySourcesStatusList().catch(err => {
console.warn('[ui-flow][memory-sources] status_list failed', err);
return [] as SourceStatus[];
}),
]);
setSources(list);
setStatuses(stats);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Transient poll failures should not clear the entire registry UI state.

On any temporary API error, refresh() currently replaces sources/statuses with empty arrays, causing visible data loss/flicker and disabling valid actions.

Suggested fix
-      const [list, stats] = await Promise.all([
-        listMemorySources().catch(err => {
-          console.warn('[ui-flow][memory-sources] list failed', err);
-          return [] as MemorySourceEntry[];
-        }),
-        memorySourcesStatusList().catch(err => {
-          console.warn('[ui-flow][memory-sources] status_list failed', err);
-          return [] as SourceStatus[];
-        }),
-      ]);
-      setSources(list);
-      setStatuses(stats);
+      const [listResult, statusResult] = await Promise.allSettled([
+        listMemorySources(),
+        memorySourcesStatusList(),
+      ]);
+      if (listResult.status === 'fulfilled') setSources(listResult.value);
+      if (statusResult.status === 'fulfilled') setStatuses(statusResult.value);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/src/components/intelligence/MemorySourcesRegistry.tsx` around lines 48 -
58, The refresh implementation currently uses listMemorySources() and
memorySourcesStatusList() with catch blocks that return empty arrays, which
causes transient API failures to wipe UI state; change refresh to call both APIs
with Promise.allSettled (or handle each promise separately), and only call
setSources(list) and setStatuses(stats) when the corresponding promise
fulfilled; if a promise is rejected, retain the existing sources/statuses (do
not replace with []), and log the error via console.warn or the existing logger;
reference listMemorySources, memorySourcesStatusList, setSources, setStatuses,
and the refresh function in MemorySourcesRegistry.tsx when applying this change.

@graycyrus
Copy link
Copy Markdown
Contributor

Great progress — the TOCTOU fix in folder.rs and label formatting are solid. Three things still need addressing before I can sign off:

  1. gh_available() blocking sync call (github.rs) — still using std::process::Command from async context. This will block the tokio thread. Switch to tokio::process::Command or spawn_blocking.

  2. No persistent sync error state (status.rs) — when a sync fails, the UI still shows Idle with no way to distinguish "never synced" vs "sync failed". The MemorySyncStageChanged(Failed) event is transient. Add last_sync_error: Option<String> to SourceStatus or track it per-source in the spawned sync task so users see what went wrong.

  3. CI is failing — PR Submission Checklist and Rust Core Coverage jobs are red. Fix those before I can approve.

Let me know once those are squared away.

graycyrus added 2 commits May 29, 2026 18:59
…nd update tests

- Add data-testid="memory-source-row-{kind}" to SourceRow <li> so tests can
  assert that rows for each source kind are rendered.
- Add data-testid="memory-source-sync-{toolkit ?? kind}" to the Sync button so
  per-source sync buttons are addressable from tests.
- Update MemoryWorkspace.test.tsx: switch mocks from listConnections/syncConnection
  (deleted composioApi) to listMemorySources/syncMemorySource (memorySourcesService);
  rename "per-connection Sync button" → "per-source Sync button"; use
  queryAllByTestId for the multi-composio-row assertion.
Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Progress update — you fixed the TOCTOU in folder.rs and the label format inconsistency, thanks. However, two blocking issues from the prior review remain unresolved:

  1. No persistent sync error stateSourceStatus struct is still unchanged. After a failed sync, the UI shows Idle with unchanged chunk counts, indistinguishable from "never synced". Users get no feedback about why nothing is updating (e.g., bad credentials, unreachable feed). This is a core UX promise of the feature — add last_sync_error: Option<String> to SourceStatus or implement per-source error tracking. Required before merge.

  2. gh_available() still blocks the async executor — still using std::process::Command::status() synchronously from async context, blocking a tokio thread per call. Wrap in tokio::task::spawn_blocking or switch to tokio::process::Command. Non-blocking severity but should be fixed before merge.

Also, CI is failing — PR Submission Checklist and Rust Core Coverage jobs are red. Fix those first, then the above, and I'll review again.

@graycyrus
Copy link
Copy Markdown
Contributor

Hey @senamakel — just checking in. I see CI is still settling (Rust Core Coverage + Windows/macOS E2E tests are pending), and the two blockers from my prior review haven't been addressed yet:

  1. src/openhuman/memory_sources/readers/github.rsgh_available() still uses synchronous std::process::Command from async context, which blocks the tokio executor. Switch to tokio::process::Command or spawn_blocking.
  2. src/openhuman/memory_sources/status.rsSourceStatus still has no last_sync_error field. After a failed sync, the UI can't distinguish "never synced" from "sync failed". Add error state tracking.

I'm holding my approval until:

  • CI is 100% green (all checks pass, not pending)
  • Both of these issues are fixed

Once you push those fixes, just let me know and I'll do a final review pass. Thanks!

@graycyrus graycyrus merged commit b728774 into tinyhumansai:main May 29, 2026
33 of 35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature Net-new user-facing capability or product behavior. memory Memory store, memory tree, recall, summarization, and embeddings in src/openhuman/memory/. rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure. working A PR that is being worked on by the team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants