feat: L1 extraction quality improvements — reduce LLM dependency#83
feat: L1 extraction quality improvements — reduce LLM dependency#83yuanrengu wants to merge 2 commits into
Conversation
Four optimizations to reduce LLM dependency in the memory pipeline: 1. Restore & enhance L1 quality gate (sanitize.ts) - Re-enable commented-out length filters (CJK >= 4, alpha >= 10) - Re-enable prompt injection detection - Add conversational filler filter (好的/OK/thanks/got it) 2. Add rule-based pre-extraction layer (pre-extractor.ts — new) - 10 persona patterns (喜欢/是/职业/擅长/认为) - 8 instruction patterns (以后/记住/禁止/语言切换) - Date+verb episodic detection - HIGH-confidence items bypass LLM entirely; MEDIUM as hints 3. Self-correction retry on JSON parse failure (l1-extractor.ts) - Parse failures trigger one retry with error feedback - Reduces silent memory loss from malformed LLM output 4. Post-LLM confidence check (l1-extractor.ts) - Source traceability: >=30% keywords must appear in source messages - Type consistency: persona must ref user, instruction must ref AI - Trivial content rejection: filter vague episodic statements Fixes: non-greedy regex in pre-extractor patterns, broader CJK injection detection
|
Reviewer triage notes from a local pass: Verification I ran locally on this branch:
Blocking / high-risk findings:
The idea is valuable, but I would not merge this as-is because the background extraction issue changes memory semantics. |
Thanks for the thorough review and for catching these issues. — I believe the main points are now addressed.
Moved the call to after the split, and now it scans only I also removed the hints-only logging block since it had no functional effect.
Removed the inaccurate claim from the PR description. The
Dropped
Added
All 17 tests pass with Let me know if anything else needs attention. |
- Move preExtractMemories to newMessages only (after background/new split) to prevent extracting memories from background context that should only serve as conversational context for the LLM - Remove MEDIUM-confidence hints logging (hints not wired to LLM prompt; keeping types as interface for follow-up PR) - Remove src/ from package.json files field to fix Size Guard limit (matches pattern from Tencent#76 and Tencent#71) - Export callLlmExtraction and passesConfidenceCheck for testability - Add pre-extractor.test.ts covering: - Background messages not pre-extracted - HIGH-confidence dedup via mergeExtractedMemories - Malformed JSON triggers exactly one retry - Confidence filtering does not reject valid persona/instruction
|
感谢您对L1提取管线的质量改进!我们内部会尽快reveiw后反馈 |
Closes #82
Summary
Four improvements to the L1 memory extraction pipeline to reduce LLM dependency and improve extraction quality.
1. Restore & enhance L1 quality gate (
sanitize.ts).{0,10}matching)2. Rule-based pre-extraction layer (
pre-extractor.ts— new)3. Self-correction retry on JSON parse failure (
l1-extractor.ts)4. Post-LLM confidence check (
l1-extractor.ts)Additional fixes
.{0,N}→.{0,N}?)Testing
54/55 unit tests pass.