[DONT MERGE] Draft query stat API #2148

phoebusm · 2025-01-28T08:56:58Z

Reference Issues/PRs

What does this implement or fix?

Draft query stat API for discussion and as placeholder

Todo:

Storage specific stats
Update API

Sample use cases

(Unit of time below is ms)

Very slow `list_symbools` because of very outdated symbol list cache

In the situation where the library has frequent add/delete symbol, a lot of SymbolList keys will be added as journal.
The list of keys usually will be compacted by

list_symbols call if caller has write permission
Symbol list compaction job if there is any

Imagine if the above actions are not done, when users call list_symbols, those keys are needed to be iterated to create the symbol list. This could be very time consuming if the list is long.
With query stat, users can learn the cause by checking the result_count in the entry:

  arcticdb_call stage key_type library     storage_op count total_time time_count_140 time_count_150 time_count_160 time_count_170 time_count_180
0  list_symbols  list       sl       a  ListObjectsV2   562      91044             20            200            200            100             42

Very slow s3 API call

In the situation when the s3 endpoint has poor performance, the avg_time in stats will be very helpful to allow user pinpoint the culprit of poor performance:

  arcticdb_call stage key_type library     storage_op  count  total_time result_count time_count_16200
0  list_symbols  list       sl       a  ListObjectsV2      1       16200            1                1

In this case, user can turn on s3 log to understand which exact step is slow.

Slow read due to fragmented segments

The library has very small setting for no. columns per segment. A simple read will require reading many segments.
It can be discovered by checking count in the stats. It can be verified with disproportional high ratio of tdata key count to vref key count. Below is the example of a simple read of 9 identical symbols with 1000 segments in a version:

  arcticdb_call         stage key_type library storage_op count total_time uncompressed_size compressed_size time_count_140 time_count_150 time_count_160 time_count_40 time_count_50 time_count_60
0          read          read    tdata       a  GetObject  9000    1458000               100              10           4000           4000           1000           NaN           NaN           NaN
1          read          read   tindex       a  GetObject     9        450               NaN             NaN            NaN            NaN            NaN             3             3             3
2          read  find_version     vref       a  GetObject     9        450               NaN             NaN            NaN            NaN            NaN             2             4             3

Tranversing long version chain

In the situation when the symbol is having numerous small append, a simple read of latest version could be slow as the version chain is needed to be iterated. This could be couple with high old version being deleted, which makes more v keys are need to be tranversed, e.g. 10 versions are appended and the last 5 version are deleted, the v keys will be tranversed in this order (T = tombstone):

vref -> ver 5T -> ver 6T -> ver 7T -> ver 8T -> ver 9T -> ver 9 -> ver 8 -> ver 7 -> ver 6 -> ver 5 -> ver 4

It can be discovered with disproportional high ratio of d key count to vref key count, with counts of tombstone keys highlighted in v key:

  arcticdb_call         stage key_type library storage_op count total_time tombstone uncompressed_size compressed_size time_count_50 time_count_140 time_count_150 time_count_160 time_count_170
0          read  find_version      ver       a  GetObject    11       1782         5               NaN             NaN           NaN              3              2              3              3
1          read  find_version     vref       a  GetObject     1         50       NaN               NaN             NaN             1            NaN            NaN            NaN            NaN
2          read          read    tdata       a  GetObject     5        810       NaN               100              10           NaN              1              2              1              1
3          read          read   tindex       a  GetObject     1         50       NaN               NaN             NaN             1            NaN            NaN            NaN            NaN

Slow response while trying to read tombstone/delete version of a symbol

In the situation when user tries to specifically read a tombstone/delete version of a symbol, the process will iterate all existing snapshots of the library. This is as last resort process to try to find the version but the process could be painstakingly long as all snapshots are needed to be iterated. Below is an example when user's trying to read a non-existence version larger then the latest version, with 100 snapshots in the library:

  arcticdb_call             stage key_type library     storage_op count total_time tombstone time_count_30 time_count_40 time_count_50 time_count_90 time_count_100 time_count_110
0          read      find_version      ver       a      GetObject     1         43         0           NaN             1           NaN           NaN            NaN            NaN
1          read      find_version     vref       a      GetObject     1         50       NaN           NaN           NaN             1           NaN            NaN            NaN
2          read  iterate_snapshot     tref       a  ListObjectsV2     1         37       NaN             1           NaN           NaN           NaN            NaN            NaN
3          read  iterate_snapshot     snap       a  ListObjectsV2     1         37       NaN             1           NaN           NaN           NaN            NaN            NaN
4          read  iterate_snapshot     tref       a      GetObject   100       1000       NaN           NaN           NaN           NaN            10             80             10

Additional grouping

The pre-designate grouping may not be enough for identifying which part of the long running process is the culprit of slowness. With the help of additional grouping, the investigation would be easier. Below is an example of mixed read latest version of symbol and read symbol version in a particular snapshot on the same symbol:

  user_group arcticdb_call          stage key_type library  storage_op count total_time uncompressed_size compressed_size time_count_30 time_count_50
0     latest          read   find_version     vref       a   GetObject     1         50               NaN             NaN           NaN             1
1     latest          read           read    tdata       a   GetObject     1         36               100              10             1           NaN
2     latest          read           read   tindex       a   GetObject     1         50               NaN             NaN           NaN             1
3   snapshot          read  find_snapshot     tref       a  HeadObject     1         50               NaN             NaN           NaN             1
4   snapshot          read  find_snapshot     tref       a   GetObject     1         50               NaN             NaN           NaN             1
5   snapshot          read           read   tindex       a   GetObject     1         50               NaN             NaN           NaN             1
6   snapshot          read           read    tdata       a   GetObject     1         36               100              10             1           NaN

Analyzing why the script is slow

A very common request. With query stat, user can pinpoint which exact step is slow. Below is a common use case

Check symbol exist
Read symbol latest version
Update symbol

   arcticdb_call         stage key_type library     storage_op count total_time uncompressed_size compressed_size time_count_30 time_count_50
0   list_symbols          list       sl       a  ListObjectsV2     1         50               NaN             NaN             1           NaN
1           read  find_version     vref       a      GetObject     1         50               NaN             NaN           NaN             1
2           read          read    tdata       a      GetObject     1         36               100              10             1           NaN
3           read          read   tindex       a      GetObject     1         50               NaN             NaN           NaN             1
4          write  find_version     vref       a      GetObject     1         50               NaN             NaN           NaN             1
5          write  find_version      ver       a      GetObject     1         50               NaN             NaN           NaN             1
6          write         write    tdata       a      GetObject     1         36               100              10             1           NaN
7          write         write   tindex       a      GetObject     1         50               NaN             NaN           NaN             1
8          write         write      ver       a      GetObject     3         50               NaN             NaN           NaN             3
9          write         write      ver       a      PutObject     1         50               NaN             NaN           NaN             3
10         write         write     vref       a      GetObject     1         50               NaN             NaN           NaN             1
11         write        delete     tref       a  ListObjectsV2     1         50               NaN             NaN           NaN             1
12         write        delete     snap       a  ListObjectsV2     1         50               NaN             NaN           NaN             1
13         write        delete     tref       a      GetObject   100       5000               NaN             NaN            50            50
14         write        delete   cstats       a  DeleteObjects     1         50               NaN             NaN           NaN             1
15         write        delete   tindex       a  DeleteObjects     1         50               NaN             NaN           NaN             1
16         write        delete    tdata       a  DeleteObjects     1         50               NaN             NaN           NaN             1

With above stats, user can tell the slowness of the script run is because the library has delay delete OFF. Therefore it requires to iterate all snapshots to find out whether the version being overwritten is safe to delete

poodlewars · 2025-01-29T12:00:01Z

python/tests/integration/toolbox/test_admin_tools.py

+    'count': [1, 5],
+    'max_time': [1, 10],
+    'min_time': [1, 20],
+    'avg_time': [1, 15],


I think a histogram will be better than min max avg

poodlewars · 2025-01-29T12:10:53Z

python/tests/integration/toolbox/test_admin_tools.py

+    'max_time': [1, 10],
+    'min_time': [1, 20],
+    'avg_time': [1, 15],
+    'uncompressed_size': [10, 1000],


What does 10 mean for the list here? You need to give realistic output for anyone to understand this proposal

Add draft stat api

79a85ec

phoebusm requested review from alexowens90, willdealtry and poodlewars as code owners January 28, 2025 08:56

Get getter for admin tool

d504dfd

phoebusm marked this pull request as draft January 28, 2025 10:27

phoebusm added 2 commits January 28, 2025 16:56

More details

62bc71b

Support context manager

8596f67

phoebusm changed the title ~~Draft query stat API~~ [DONT MERGE] Draft query stat API Jan 29, 2025

poodlewars reviewed Jan 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DONT MERGE] Draft query stat API #2148

[DONT MERGE] Draft query stat API #2148

phoebusm commented Jan 28, 2025 •

edited

Loading

poodlewars Jan 29, 2025

poodlewars Jan 29, 2025

[DONT MERGE] Draft query stat API #2148

Are you sure you want to change the base?

[DONT MERGE] Draft query stat API #2148

Conversation

phoebusm commented Jan 28, 2025 • edited Loading

Reference Issues/PRs

What does this implement or fix?

Sample use cases

Very slow list_symbools because of very outdated symbol list cache

Very slow s3 API call

Slow read due to fragmented segments

Tranversing long version chain

Slow response while trying to read tombstone/delete version of a symbol

Additional grouping

Analyzing why the script is slow

poodlewars Jan 29, 2025

Choose a reason for hiding this comment

poodlewars Jan 29, 2025

Choose a reason for hiding this comment

phoebusm commented Jan 28, 2025 •

edited

Loading

Very slow `list_symbools` because of very outdated symbol list cache