Skip to content

Commit

Permalink
readme update
Browse files Browse the repository at this point in the history
  • Loading branch information
aaryanpunia authored and densumesh committed Jul 23, 2024
1 parent 9510c1b commit dfc3349
Showing 1 changed file with 11 additions and 6 deletions.
17 changes: 11 additions & 6 deletions docker/collapse-query-script/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# Search Query Collapse Script

This script is designed to optimize search query analytics by collapsing similar queries in a ClickHouse database. It addresses the issue of storing redundant partial queries (e.g., "a", "ap", "app", "apple") which can skew analytics results.
This script optimizes search query analytics by collapsing similar queries in a ClickHouse database. It addresses the issue of storing redundant partial queries (e.g., "a", "ap", "app", "apple") which can skew analytics results, while also considering the timing of these queries.

## Purpose

The main purpose of this script is to:
1. Identify and remove partial queries that are prefixes of longer, more complete queries.
1. Identify and remove partial queries that are prefixes of longer, more complete queries, but only if they occur within a 10-second window of each other.
2. Process queries across multiple datasets stored in ClickHouse.
3. Keep track of the last processed timestamp for each dataset to allow for incremental updates.

Expand All @@ -16,18 +16,23 @@ The main purpose of this script is to:
3. For each dataset:
- It fetches the timestamp of the last collapse operation from the `last_collapsed_dataset` table.
- It retrieves search queries in batches of 5000, starting from the last collapsed timestamp.
- The `collapse_queries` function identifies queries that are prefixes of longer queries.
- The `collapse_queries` function identifies queries that are prefixes of longer queries and occur within 10 seconds of each other.
- Identified partial queries are deleted from the database.
- The process continues until all queries in the dataset are processed or no new queries are found.
- The last processed timestamp is updated in the `last_collapsed_dataset` table.

## Main Functions

- `get_search_queries`: Retrieves search queries for a specific dataset.
- `get_search_queries`: Retrieves search queries for a specific dataset, converting timestamps to datetime objects.
- `get_datasets`: Gets a list of all dataset IDs.
- `get_dataset_last_collapsed`: Retrieves the timestamp of the last collapse operation for a dataset.
- `set_dataset_last_collapsed`: Updates the last collapse timestamp for a dataset.
- `collapse_queries`: Identifies partial queries that should be removed.
- `collapse_queries`: Identifies partial queries that should be removed, considering a 10-second time window.
- `delete_queries`: Removes identified partial queries from the database.

The script will process all datasets, collapse queries, and provide output on the number of deleted rows for each dataset.
## Query Collapse Logic

The script now only collapses queries that meet the following criteria:
1. A query is a prefix of a subsequent query.
2. The subsequent query occurs within 10 seconds of the prefix query.

0 comments on commit dfc3349

Please sign in to comment.