feat(assistant): compress insights output #28113

skoob13 · 2025-01-30T19:50:32Z

Problem

The summarizer node has multiple problems:

Tokens usage is high due to JSON output.
The output format is not LLM-friendly.
Values are not formatted for specific insights (percentages, for example).
The summarizer output is somewhat distant from customer expectations.

Additionally, we want to implement a conversational agent, so the proper compression will allow us to orchestrate the analysis of multiple insights.

Changes

Use a self-made format that looks similar to a markdown table.
Compress values and keys for specific insight types.
Format values for the summarizer so it can provide nicer feedback.
Inject query-specific prompts.
Fallback to JSON if something goes wrong, as insights have many combinations.

Funnels are the most unoptimized. I think it's okay to leave them in the following format for now and see if it becomes a problem. A funnel with a breakdown is a unique insight, as it has multiple somewhat related insights that are complicated to compress. The horizontal compression didn't work, and the vertical will have almost the same output. I'm open to any suggestions🤔

Compression examples

Trends

Date|Chrome (breakdown)|Firefox (breakdown)|Safari (breakdown)|Opera (breakdown)|Mobile Safari (breakdown)|Android Mobile (breakdown)|Samsung Internet (breakdown)|Internet Explorer (breakdown)|Microsoft Edge (breakdown)|Firefox iOS (breakdown)|Chrome iOS (breakdown)
2024-01-01|0|0|0|0|0|0|0|0|0|0|0
2024-02-01|0|0|0|0|0|0|0|0|0|0|0
2024-03-01|0|0|0|0|0|0|0|0|0|0|0
2024-04-01|0|0|0|0|0|0|0|0|0|0|0
2024-05-01|0|0|0|0|0|0|0|0|0|0|0
2024-06-01|0|0|0|0|0|0|0|0|0|0|0
2024-07-01|0|0|0|0|0|0|0|0|0|0|0
2024-08-01|9|2|1|4|2|0|0|0|0|1|0
2024-09-01|782|264|356|134|126|92|38|21|5|22|8
2024-10-01|2316|1153|768|434|403|236|121|67|52|98|29
2024-11-01|5168|2226|1130|894|814|287|233|98|77|68|117
2024-12-01|4583|2053|939|738|682|265|248|104|127|56|66
2025-01-01|2902|0|0|0|0|0|0|0|0|0|0

Funnels

Date range: 2025-01-23 00:00:00 to 2025-01-30 23:59:59

---Australia
Metric|$pageview|$ai_span
Total person count|24|0
Conversion rate|100%|0%
Dropoff rate|0%|100%
Average conversion time|-|-
Median conversion time|-|-
---United States
Metric|$pageview|$ai_span
Total person count|12|0
Conversion rate|100%|0%
Dropoff rate|0%|100%
Average conversion time|-|-
Median conversion time|-|-

Conversion and drop-off rates are calculated in overall. For example, "Conversion rate: 9%" means that 9% of users from the first step completed the funnel.

Retention

Date range: 2025-01-02 00:00 to 2025-01-31 00:00
Granularity: Day
Date|Number of persons on date|Day 0|Day 1|Day 2|Day 3|Day 4|Day 5|Day 6|Day 7|Day 8|Day 9|Day 10|Day 11|Day 12|Day 13|Day 14|Day 15|Day 16|Day 17|Day 18|Day 19|Day 20|Day 21|Day 22|Day 23|Day 24|Day 25|Day 26|Day 27|Day 28|Day 29
2025-01-02 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-03 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-04 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-05 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-06 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-07 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-08 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-09 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-10 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-11 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-12 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-13 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-14 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-15 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-16 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-17 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-18 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-19 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-20 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-21 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-22 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-23 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%|0%
2025-01-24 00:00|0|100%|0%|0%|0%|0%|0%|0%|0%
2025-01-25 00:00|0|100%|0%|0%|0%|0%|0%|0%
2025-01-26 00:00|0|100%|0%|0%|0%|0%|0%
2025-01-27 00:00|0|100%|0%|0%|0%|0%
2025-01-28 00:00|0|100%|0%|0%|0%
2025-01-29 00:00|0|100%|0%|0%
2025-01-30 00:00|0|100%|0%
2025-01-31 00:00|0|100%

With the summarizer changes above, our Gen Z founder goes nuts:

Does this work well for both Cloud and self-hosted?

No

How did you test this code?

Unit tests, manual testing.

sentry-io · 2025-01-30T19:50:46Z

🔍 Existing Issues For Review

Your pull request is modifying functions with the following pre-existing issues:

📄 File: ee/hogai/summarizer/nodes.py

Function	Unhandled Issue
`run`	ValueError: Can only run summarization with a visualization message as the last one in the state /api/env... `Event Count:` 33
`run`	ValueError: Can only run summarization with a visualization message as the last one in the state ... `Event Count:` 23
`run`	KeyError: 'results' /api/environments/{parent_loo... `Event Count:` 6
`run`	KeyError: 'results' ee.hogai.summarizer.nodes in run `Event Count:` 3
`run`	ValueError: Did not found query in the visualization message ee.hogai.summarizer.... `Event Count:` 1

_{Did you find this useful? React with a 👍 or 👎}

greptile-apps

PR Summary

This PR introduces a comprehensive data compression and formatting system for PostHog's insights summarizer, optimizing token usage and improving LLM readability.

Added /ee/hogai/summarizer/format.py with specialized formatters for trends, funnels, and retention data using pipe-delimited table formats
Implemented smart value formatting in /ee/hogai/summarizer/format.py for numbers, percentages, durations, and dates with proper rounding and unit handling
Enhanced error handling in /ee/hogai/summarizer/nodes.py with JSON fallback when compression fails
Added query-specific example prompts in /ee/hogai/summarizer/prompts.py to guide LLM interpretation of compressed data
Added comprehensive test coverage in /ee/hogai/summarizer/test/test_format.py for all formatting scenarios

_{5 file(s) reviewed, 7 comment(s)}
_{Edit PR Review Bot Settings | Greptile}

greptile-apps · 2025-01-31T12:47:58Z

ee/hogai/summarizer/format.py

+        this_step_count = series["count"]
+        first_step_count = matrix[1][1]


logic: first_step_count is accessing matrix[1][1] which assumes there's always a first step at index 1, but this could fail if the matrix is empty or malformed

Suggested change

this_step_count = series["count"]

first_step_count = matrix[1][1]

this_step_count = series["count"]

first_step_count = matrix[1][1] if len(matrix) > 1 and len(matrix[1]) > 1 else 0

greptile-apps · 2025-01-31T12:47:58Z

ee/hogai/summarizer/format.py

+
+    formatted_matrix = _format_matrix(matrix)
+    if results[0].get("breakdown_value") is not None:
+        breakdown_value = series["breakdown_value"]


logic: using 'series' variable from the last iteration of the for loop is unsafe - should use results[0] consistently

Suggested change

breakdown_value = series["breakdown_value"]

breakdown_value = results[0]["breakdown_value"]

greptile-apps · 2025-01-31T12:47:59Z

ee/hogai/summarizer/format.py

+        for idx, val in enumerate(series["values"]):
+            initial_count = series["values"][0]["count"]
+            count = val["count"]
+            if idx == 0:
+                row.append(_format_number(count))
+                row.append("100%")
+            elif initial_count != 0:
+                row.append(_format_percentage(count / initial_count))
+            else:
+                row.append("0%")


logic: retention calculation adds both count and 100% for idx=0, causing misalignment in the matrix columns

greptile-apps · 2025-01-31T12:47:59Z

ee/hogai/summarizer/prompts.py

+Conversion rate|100%|50%
+Dropoff rate|0%|50%
+Average conversion time|-|1d
+Median conversion time|-|1d


syntax: missing closing ``` in the example code block

Suggested change

Median conversion time|-|1d

Median conversion time|-|1d

greptile-apps · 2025-01-31T12:48:00Z

ee/hogai/summarizer/prompts.py

+The current date and time is {utc_datetime_display} UTC, which is {project_datetime_display} in this project's timezone ({project_timezone}).
 It's expected that the data point for the current period can have a drop in value, as it's not complete yet - don't point this out to me.


style: consider moving the timezone and incomplete period notes into the example prompts since they are query-specific

greptile-apps · 2025-01-31T12:48:04Z

ee/hogai/summarizer/test/test_format.py

+        self.assertEqual(_format_number(1.123456789), "1.12346")
+        self.assertEqual(_format_number(1.123456789), "1.12346")


logic: duplicate assertion testing the same value twice

Suggested change

self.assertEqual(_format_number(1.123456789), "1.12346")

self.assertEqual(_format_number(1.123456789), "1.12346")

self.assertEqual(_format_number(1.123456789), "1.12346")

greptile-apps · 2025-01-31T12:48:05Z

ee/hogai/summarizer/test/test_format.py

+        self.assertIn(
+            "Date range: 2025-01-20 to 2025-01-22\n\n---au\nMetric|$pageview|signup\nTotal person count|5|2\nConversion rate|100%|40%\nDropoff rate|0%|60%\nAverage conversion time|-|10s\nMedian conversion time|-|11s\n\n---us\nMetric|$pageview|signup\nTotal person count|5|2\nConversion rate|100%|40%\nDropoff rate|0%|60%\nAverage conversion time|-|10s\nMedian conversion time|-|11s",
+            compress_and_format_funnels_results(results, date_from="2025-01-20", date_to="2025-01-22"),
+        )


style: assertIn is less strict than assertEqual - could miss formatting issues. Consider using assertEqual with exact expected output

Suggested change

self.assertIn(

"Date range: 2025-01-20 to 2025-01-22\n\n---au\nMetric|$pageview|signup\nTotal person count|5|2\nConversion rate|100%|40%\nDropoff rate|0%|60%\nAverage conversion time|-|10s\nMedian conversion time|-|11s\n\n---us\nMetric|$pageview|signup\nTotal person count|5|2\nConversion rate|100%|40%\nDropoff rate|0%|60%\nAverage conversion time|-|10s\nMedian conversion time|-|11s",

compress_and_format_funnels_results(results, date_from="2025-01-20", date_to="2025-01-22"),

)

self.assertEqual(

compress_and_format_funnels_results(results, date_from="2025-01-20", date_to="2025-01-22"),

"Date range: 2025-01-20 to 2025-01-22\n\n---au\nMetric|$pageview|signup\nTotal person count|5|2\nConversion rate|100%|40%\nDropoff rate|0%|60%\nAverage conversion time|-|10s\nMedian conversion time|-|11s\n\n---us\nMetric|$pageview|signup\nTotal person count|5|2\nConversion rate|100%|40%\nDropoff rate|0%|60%\nAverage conversion time|-|10s\nMedian conversion time|-|11s"

)

…zer-compression

skoob13 added 7 commits January 30, 2025 16:37

feat: trends

7e063dd

feat: custom_names and breakdown labels for trends

4b922fe

feat: prompts

4f7b8b6

feat: format funnel

92759d2

feat: funnels

e99f23c

fix: missing date range

86223a0

feat: format dates

f09f15c

skoob13 changed the title ~~feat: trends~~ feat(assistant): compress insights output Jan 30, 2025

skoob13 added 4 commits January 31, 2025 12:11

feat: retention

1756683

feat: fallback strategy

dac8c1b

chore: prompts spelling

8c12934

test: fallback

bc208dd

skoob13 requested a review from Twixes January 31, 2025 12:46

skoob13 marked this pull request as ready for review January 31, 2025 12:46

greptile-apps bot reviewed Jan 31, 2025

View reviewed changes

skoob13 added 2 commits January 31, 2025 13:52

fix: greptile comments

c614ad6

Merge branch 'master' of github.com:PostHog/posthog into feat/summari…

203319a

…zer-compression

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(assistant): compress insights output #28113

feat(assistant): compress insights output #28113

skoob13 commented Jan 30, 2025 •

edited

Loading

sentry-io bot commented Jan 30, 2025

greptile-apps bot left a comment

greptile-apps bot Jan 31, 2025

greptile-apps bot Jan 31, 2025

greptile-apps bot Jan 31, 2025

greptile-apps bot Jan 31, 2025

greptile-apps bot Jan 31, 2025

greptile-apps bot Jan 31, 2025

greptile-apps bot Jan 31, 2025

		this_step_count = series["count"]
		first_step_count = matrix[1][1]

	breakdown_value = series["breakdown_value"]
	breakdown_value = results[0]["breakdown_value"]

		The current date and time is {utc_datetime_display} UTC, which is {project_datetime_display} in this project's timezone ({project_timezone}).
		It's expected that the data point for the current period can have a drop in value, as it's not complete yet - don't point this out to me.

		self.assertEqual(_format_number(1.123456789), "1.12346")
		self.assertEqual(_format_number(1.123456789), "1.12346")

feat(assistant): compress insights output #28113

Are you sure you want to change the base?

feat(assistant): compress insights output #28113

Conversation

skoob13 commented Jan 30, 2025 • edited Loading

Problem

Changes

Does this work well for both Cloud and self-hosted?

How did you test this code?

sentry-io bot commented Jan 30, 2025

🔍 Existing Issues For Review

greptile-apps bot left a comment

Choose a reason for hiding this comment

PR Summary

greptile-apps bot Jan 31, 2025

Choose a reason for hiding this comment

greptile-apps bot Jan 31, 2025

Choose a reason for hiding this comment

greptile-apps bot Jan 31, 2025

Choose a reason for hiding this comment

greptile-apps bot Jan 31, 2025

Choose a reason for hiding this comment

greptile-apps bot Jan 31, 2025

Choose a reason for hiding this comment

greptile-apps bot Jan 31, 2025

Choose a reason for hiding this comment

greptile-apps bot Jan 31, 2025

Choose a reason for hiding this comment

skoob13 commented Jan 30, 2025 •

edited

Loading