Add low-latency raw memory search mode#195
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f0a4ce247c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| cached = self._profile_catalog_cache.get(user_id) | ||
| if cached and now - cached[0] <= PROFILE_CATALOG_CACHE_TTL_SECONDS: | ||
| return cached[1], cached[2] |
There was a problem hiding this comment.
Invalidate profile catalog cache on fresh memory writes
This cache returns previously fetched profile records for up to 60 seconds without any ingest-triggered invalidation, so a user who ingests new profile facts and immediately calls /v1/memory/retrieve or /v1/memory/search can receive stale results that omit the just-written memory. Because _search_profile and raw profile search both reuse these cached records, this directly regresses read-after-write correctness for normal product flows.
Useful? React with 👍 / 👎.
| for entry in profile_catalog | ||
| ) | ||
| ) | ||
| return (user_id, query.strip().lower(), top_k, catalog_key) |
There was a problem hiding this comment.
Expand retrieval-plan cache key to reflect memory changes
The retrieval-plan cache key only includes user_id, normalized query, top_k, and profile catalog topics/subtopics, so repeated queries within TTL will reuse old tool calls even after non-profile memory changes (e.g., new temporal/summary/snippet data). In that case run() skips tool-selection LLM and can continue querying the wrong domains, producing stale or incomplete answers until the cache expires.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request enhances the memory search functionality by introducing a "code" domain and optional LLM-synthesized answers. Key changes include refactoring the RetrievalPipeline to support raw_search and synthesize_answer methods, implementing caching for profile catalogs and retrieval plans, and adding a latency tracking system with Prometheus metrics. Feedback suggests addressing a potential memory leak in the profile cache, improving encapsulation by avoiding protected method calls, and ensuring that search results are truncated to respect top_k limits and LLM context window constraints.
| self._profile_catalog_cache: Dict[ | ||
| str, Tuple[float, List[Dict[str, str]], List[Any]] | ||
| ] = {} |
There was a problem hiding this comment.
The _profile_catalog_cache is a dictionary keyed by user_id that grows indefinitely as new users perform searches. Since RetrievalPipeline is a long-lived singleton, this could lead to a memory leak in a production environment with many users. Consider using an LRU cache or setting a maximum size for this dictionary.
| code_results = await _search_code( | ||
| query=req.query, | ||
| user_id=user_id, | ||
| org_id=req.org_id, | ||
| repo=req.repo or "", | ||
| top_k=req.top_k, | ||
| ) |
There was a problem hiding this comment.
|
|
||
| if req.answer: | ||
| answer_start = time.perf_counter() | ||
| answer = await pipeline.synthesize_answer(req.query, all_results) |
There was a problem hiding this comment.
When req.answer is true, the all_results list is passed to synthesize_answer without any truncation. If the combined results from memory and code domains are large, this could exceed the LLM's context window. Consider truncating all_results to the top k hits before synthesis.
References
- Ensure that context provided to LLMs is within reasonable token limits to avoid truncation or API errors.
| memory_results, raw_latency = await pipeline.raw_search( | ||
| query=req.query, | ||
| user_id=user_id, | ||
| domains=memory_domains, | ||
| top_k=req.top_k, | ||
| ) |
There was a problem hiding this comment.
The raw_search method returns up to top_k results per domain. With multiple domains selected, the final all_results list can contain significantly more than top_k items. If the API consumer expects at most top_k results total, the final list should be truncated after sorting.
References
- Ensure search results respect the requested limit (top_k) across all aggregated domains.
|
Follow-up pushed in e555bea:
Validation:
GitHub currently reports 1 check on the new head commit, with 0 failed and 0 pending. |
|
Followed up on the remaining review note about the memory route calling the protected code retrieval dispatcher directly. Changes in
Validation run locally:
|
Closes #163.
Summary
/v1/memory/searchonto a low-latency raw retrieval path that skips LLM tool selection and synthesis by defaultanswer=trueto synthesize from already-retrieved raw hits when callers need the slower answer path/retrievepathValidation
\.venv\Scripts\python.exe -m pytest tests/integration/test_retrieval_pipeline.py tests/api/test_dependencies_and_routes.py tests/unit/test_schemas.py\.venv\Scripts\python.exe -m ruff check src/pipelines/retrieval.py src/api/schemas.py src/api/routes/memory.py tests/integration/test_retrieval_pipeline.py\.venv\Scripts\python.exe -m black --target-version py311 --check src/pipelines/retrieval.py src/api/schemas.py src/api/routes/memory.py tests/integration/test_retrieval_pipeline.pygit diff --check