Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(assistant): compress insights output #28113

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

skoob13
Copy link
Contributor

@skoob13 skoob13 commented Jan 30, 2025

Problem

The summarizer node has multiple problems:

  • Tokens usage is high due to JSON output.
  • The output format is not LLM-friendly.
  • Values are not formatted for specific insights (percentages, for example).
  • The summarizer output is somewhat distant from customer expectations.

Additionally, we want to implement a conversational agent, so the proper compression will allow us to orchestrate the analysis of multiple insights.

Changes

  • Use a self-made format that looks similar to a markdown table.
  • Compress values and keys for specific insight types.
  • Format values for the summarizer so it can provide nicer feedback.
  • Inject query-specific prompts.
  • Fallback to JSON if something goes wrong, as insights have many combinations.

Funnels are the most unoptimized. I think it's okay to leave them in the following format for now and see if it becomes a problem. A funnel with a breakdown is a unique insight, as it has multiple somewhat related insights that are complicated to compress. The horizontal compression didn't work, and the vertical will have almost the same output. I'm open to any suggestions🤔

Compression examples

Trends
Date|Chrome (breakdown)|Firefox (breakdown)|Safari (breakdown)|Opera (breakdown)|Mobile Safari (breakdown)|Android Mobile (breakdown)|Samsung Internet (breakdown)|Internet Explorer (breakdown)|Microsoft Edge (breakdown)|Firefox iOS (breakdown)|Chrome iOS (breakdown)
2024-01-01|0|0|0|0|0|0|0|0|0|0|0
2024-02-01|0|0|0|0|0|0|0|0|0|0|0
2024-03-01|0|0|0|0|0|0|0|0|0|0|0
2024-04-01|0|0|0|0|0|0|0|0|0|0|0
2024-05-01|0|0|0|0|0|0|0|0|0|0|0
2024-06-01|0|0|0|0|0|0|0|0|0|0|0
2024-07-01|0|0|0|0|0|0|0|0|0|0|0
2024-08-01|9|2|1|4|2|0|0|0|0|1|0
2024-09-01|782|264|356|134|126|92|38|21|5|22|8
2024-10-01|2316|1153|768|434|403|236|121|67|52|98|29
2024-11-01|5168|2226|1130|894|814|287|233|98|77|68|117
2024-12-01|4583|2053|939|738|682|265|248|104|127|56|66
2025-01-01|2902|0|0|0|0|0|0|0|0|0|0
Funnels
Date range: 2025-01-23 00:00:00 to 2025-01-30 23:59:59

---Australia
Metric|$pageview|$ai_span
Total person count|24|0
Conversion rate|100%|0%
Dropoff rate|0%|100%
Average conversion time|-|-
Median conversion time|-|-
---United States
Metric|$pageview|$ai_span
Total person count|12|0
Conversion rate|100%|0%
Dropoff rate|0%|100%
Average conversion time|-|-
Median conversion time|-|-

Conversion and drop-off rates are calculated in overall. For example, "Conversion rate: 9%" means that 9% of users from the first step completed the funnel.
Retention
Date range: 2025-01-02 00:00 to 2025-01-31 00:00
Granularity: Day
Date|Number of persons on date|Day 0|Day 1|Day 2|Day 3|Day 4|Day 5|Day 6|Day 7|Day 8|Day 9|Day 10|Day 11|Day 12|Day 13|Day 14|Day 15|Day 16|Day 17|Day 18|Day 19|Day 20|Day 21|Day 22|Day 23|Day 24|Day 25|Day 26|Day 27|Day 28|Day 29
2025-01-02 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-03 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-04 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-05 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-06 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-07 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-08 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-09 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-10 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-11 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-12 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-13 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-14 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-15 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-16 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-17 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-18 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-19 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-20 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-21 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-22 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-23 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-24 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%
2025-01-25 00:00|0|100%|0%|0%|0%|0%|0%|0%
2025-01-26 00:00|0|100%|0%|0%|0%|0%|0%
2025-01-27 00:00|0|100%|0%|0%|0%|0%
2025-01-28 00:00|0|100%|0%|0%|0%
2025-01-29 00:00|0|100%|0%|0%
2025-01-30 00:00|0|100%|0%
2025-01-31 00:00|0|100%

With the summarizer changes above, our Gen Z founder goes nuts:

Screenshot 2025-01-30 at 20 37 30

Does this work well for both Cloud and self-hosted?

No

How did you test this code?

Unit tests, manual testing.

Copy link

sentry-io bot commented Jan 30, 2025

🔍 Existing Issues For Review

Your pull request is modifying functions with the following pre-existing issues:

📄 File: ee/hogai/summarizer/nodes.py

Function Unhandled Issue
run ValueError: Can only run summarization with a visualization message as the last one in the state /api/env...
Event Count: 33
run ValueError: Can only run summarization with a visualization message as the last one in the state ...
Event Count: 23
run KeyError: 'results' /api/environments/{parent_loo...
Event Count: 6
run KeyError: 'results' ee.hogai.summarizer.nodes in run
Event Count: 3
run ValueError: Did not found query in the visualization message ee.hogai.summarizer....
Event Count: 1

Did you find this useful? React with a 👍 or 👎

@skoob13 skoob13 changed the title feat: trends feat(assistant): compress insights output Jan 30, 2025
@skoob13 skoob13 requested a review from Twixes January 31, 2025 12:46
@skoob13 skoob13 marked this pull request as ready for review January 31, 2025 12:46
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

This PR introduces a comprehensive data compression and formatting system for PostHog's insights summarizer, optimizing token usage and improving LLM readability.

  • Added /ee/hogai/summarizer/format.py with specialized formatters for trends, funnels, and retention data using pipe-delimited table formats
  • Implemented smart value formatting in /ee/hogai/summarizer/format.py for numbers, percentages, durations, and dates with proper rounding and unit handling
  • Enhanced error handling in /ee/hogai/summarizer/nodes.py with JSON fallback when compression fails
  • Added query-specific example prompts in /ee/hogai/summarizer/prompts.py to guide LLM interpretation of compressed data
  • Added comprehensive test coverage in /ee/hogai/summarizer/test/test_format.py for all formatting scenarios

5 file(s) reviewed, 7 comment(s)
Edit PR Review Bot Settings | Greptile

Comment on lines +199 to +200
this_step_count = series["count"]
first_step_count = matrix[1][1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: first_step_count is accessing matrix[1][1] which assumes there's always a first step at index 1, but this could fail if the matrix is empty or malformed

Suggested change
this_step_count = series["count"]
first_step_count = matrix[1][1]
this_step_count = series["count"]
first_step_count = matrix[1][1] if len(matrix) > 1 and len(matrix[1]) > 1 else 0


formatted_matrix = _format_matrix(matrix)
if results[0].get("breakdown_value") is not None:
breakdown_value = series["breakdown_value"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: using 'series' variable from the last iteration of the for loop is unsafe - should use results[0] consistently

Suggested change
breakdown_value = series["breakdown_value"]
breakdown_value = results[0]["breakdown_value"]

Comment on lines +307 to +316
for idx, val in enumerate(series["values"]):
initial_count = series["values"][0]["count"]
count = val["count"]
if idx == 0:
row.append(_format_number(count))
row.append("100%")
elif initial_count != 0:
row.append(_format_percentage(count / initial_count))
else:
row.append("0%")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: retention calculation adds both count and 100% for idx=0, causing misalignment in the matrix columns

Conversion rate|100%|50%
Dropoff rate|0%|50%
Average conversion time|-|1d
Median conversion time|-|1d
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: missing closing ``` in the example code block

Suggested change
Median conversion time|-|1d
Median conversion time|-|1d

Comment on lines +26 to 27
The current date and time is {utc_datetime_display} UTC, which is {project_datetime_display} in this project's timezone ({project_timezone}).
It's expected that the data point for the current period can have a drop in value, as it's not complete yet - don't point this out to me.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: consider moving the timezone and incomplete period notes into the example prompts since they are query-specific

Comment on lines 21 to 22
self.assertEqual(_format_number(1.123456789), "1.12346")
self.assertEqual(_format_number(1.123456789), "1.12346")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: duplicate assertion testing the same value twice

Suggested change
self.assertEqual(_format_number(1.123456789), "1.12346")
self.assertEqual(_format_number(1.123456789), "1.12346")
self.assertEqual(_format_number(1.123456789), "1.12346")

Comment on lines +304 to +307
self.assertIn(
"Date range: 2025-01-20 to 2025-01-22\n\n---au\nMetric|$pageview|signup\nTotal person count|5|2\nConversion rate|100%|40%\nDropoff rate|0%|60%\nAverage conversion time|-|10s\nMedian conversion time|-|11s\n\n---us\nMetric|$pageview|signup\nTotal person count|5|2\nConversion rate|100%|40%\nDropoff rate|0%|60%\nAverage conversion time|-|10s\nMedian conversion time|-|11s",
compress_and_format_funnels_results(results, date_from="2025-01-20", date_to="2025-01-22"),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: assertIn is less strict than assertEqual - could miss formatting issues. Consider using assertEqual with exact expected output

Suggested change
self.assertIn(
"Date range: 2025-01-20 to 2025-01-22\n\n---au\nMetric|$pageview|signup\nTotal person count|5|2\nConversion rate|100%|40%\nDropoff rate|0%|60%\nAverage conversion time|-|10s\nMedian conversion time|-|11s\n\n---us\nMetric|$pageview|signup\nTotal person count|5|2\nConversion rate|100%|40%\nDropoff rate|0%|60%\nAverage conversion time|-|10s\nMedian conversion time|-|11s",
compress_and_format_funnels_results(results, date_from="2025-01-20", date_to="2025-01-22"),
)
self.assertEqual(
compress_and_format_funnels_results(results, date_from="2025-01-20", date_to="2025-01-22"),
"Date range: 2025-01-20 to 2025-01-22\n\n---au\nMetric|$pageview|signup\nTotal person count|5|2\nConversion rate|100%|40%\nDropoff rate|0%|60%\nAverage conversion time|-|10s\nMedian conversion time|-|11s\n\n---us\nMetric|$pageview|signup\nTotal person count|5|2\nConversion rate|100%|40%\nDropoff rate|0%|60%\nAverage conversion time|-|10s\nMedian conversion time|-|11s"
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant