-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(assistant): compress insights output #28113
base: master
Are you sure you want to change the base?
Conversation
🔍 Existing Issues For ReviewYour pull request is modifying functions with the following pre-existing issues: 📄 File: ee/hogai/summarizer/nodes.py
Did you find this useful? React with a 👍 or 👎 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
This PR introduces a comprehensive data compression and formatting system for PostHog's insights summarizer, optimizing token usage and improving LLM readability.
- Added
/ee/hogai/summarizer/format.py
with specialized formatters for trends, funnels, and retention data using pipe-delimited table formats - Implemented smart value formatting in
/ee/hogai/summarizer/format.py
for numbers, percentages, durations, and dates with proper rounding and unit handling - Enhanced error handling in
/ee/hogai/summarizer/nodes.py
with JSON fallback when compression fails - Added query-specific example prompts in
/ee/hogai/summarizer/prompts.py
to guide LLM interpretation of compressed data - Added comprehensive test coverage in
/ee/hogai/summarizer/test/test_format.py
for all formatting scenarios
5 file(s) reviewed, 7 comment(s)
Edit PR Review Bot Settings | Greptile
this_step_count = series["count"] | ||
first_step_count = matrix[1][1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: first_step_count is accessing matrix[1][1] which assumes there's always a first step at index 1, but this could fail if the matrix is empty or malformed
this_step_count = series["count"] | |
first_step_count = matrix[1][1] | |
this_step_count = series["count"] | |
first_step_count = matrix[1][1] if len(matrix) > 1 and len(matrix[1]) > 1 else 0 |
ee/hogai/summarizer/format.py
Outdated
|
||
formatted_matrix = _format_matrix(matrix) | ||
if results[0].get("breakdown_value") is not None: | ||
breakdown_value = series["breakdown_value"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: using 'series' variable from the last iteration of the for loop is unsafe - should use results[0] consistently
breakdown_value = series["breakdown_value"] | |
breakdown_value = results[0]["breakdown_value"] |
for idx, val in enumerate(series["values"]): | ||
initial_count = series["values"][0]["count"] | ||
count = val["count"] | ||
if idx == 0: | ||
row.append(_format_number(count)) | ||
row.append("100%") | ||
elif initial_count != 0: | ||
row.append(_format_percentage(count / initial_count)) | ||
else: | ||
row.append("0%") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: retention calculation adds both count and 100% for idx=0, causing misalignment in the matrix columns
Conversion rate|100%|50% | ||
Dropoff rate|0%|50% | ||
Average conversion time|-|1d | ||
Median conversion time|-|1d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syntax: missing closing ``` in the example code block
Median conversion time|-|1d | |
Median conversion time|-|1d |
The current date and time is {utc_datetime_display} UTC, which is {project_datetime_display} in this project's timezone ({project_timezone}). | ||
It's expected that the data point for the current period can have a drop in value, as it's not complete yet - don't point this out to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: consider moving the timezone and incomplete period notes into the example prompts since they are query-specific
self.assertEqual(_format_number(1.123456789), "1.12346") | ||
self.assertEqual(_format_number(1.123456789), "1.12346") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: duplicate assertion testing the same value twice
self.assertEqual(_format_number(1.123456789), "1.12346") | |
self.assertEqual(_format_number(1.123456789), "1.12346") | |
self.assertEqual(_format_number(1.123456789), "1.12346") |
self.assertIn( | ||
"Date range: 2025-01-20 to 2025-01-22\n\n---au\nMetric|$pageview|signup\nTotal person count|5|2\nConversion rate|100%|40%\nDropoff rate|0%|60%\nAverage conversion time|-|10s\nMedian conversion time|-|11s\n\n---us\nMetric|$pageview|signup\nTotal person count|5|2\nConversion rate|100%|40%\nDropoff rate|0%|60%\nAverage conversion time|-|10s\nMedian conversion time|-|11s", | ||
compress_and_format_funnels_results(results, date_from="2025-01-20", date_to="2025-01-22"), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: assertIn is less strict than assertEqual - could miss formatting issues. Consider using assertEqual with exact expected output
self.assertIn( | |
"Date range: 2025-01-20 to 2025-01-22\n\n---au\nMetric|$pageview|signup\nTotal person count|5|2\nConversion rate|100%|40%\nDropoff rate|0%|60%\nAverage conversion time|-|10s\nMedian conversion time|-|11s\n\n---us\nMetric|$pageview|signup\nTotal person count|5|2\nConversion rate|100%|40%\nDropoff rate|0%|60%\nAverage conversion time|-|10s\nMedian conversion time|-|11s", | |
compress_and_format_funnels_results(results, date_from="2025-01-20", date_to="2025-01-22"), | |
) | |
self.assertEqual( | |
compress_and_format_funnels_results(results, date_from="2025-01-20", date_to="2025-01-22"), | |
"Date range: 2025-01-20 to 2025-01-22\n\n---au\nMetric|$pageview|signup\nTotal person count|5|2\nConversion rate|100%|40%\nDropoff rate|0%|60%\nAverage conversion time|-|10s\nMedian conversion time|-|11s\n\n---us\nMetric|$pageview|signup\nTotal person count|5|2\nConversion rate|100%|40%\nDropoff rate|0%|60%\nAverage conversion time|-|10s\nMedian conversion time|-|11s" | |
) |
Problem
The summarizer node has multiple problems:
Additionally, we want to implement a conversational agent, so the proper compression will allow us to orchestrate the analysis of multiple insights.
Changes
Funnels are the most unoptimized. I think it's okay to leave them in the following format for now and see if it becomes a problem. A funnel with a breakdown is a unique insight, as it has multiple somewhat related insights that are complicated to compress. The horizontal compression didn't work, and the vertical will have almost the same output. I'm open to any suggestions🤔
Compression examples
Trends
Funnels
Retention
With the summarizer changes above, our Gen Z founder goes nuts:
Does this work well for both Cloud and self-hosted?
No
How did you test this code?
Unit tests, manual testing.