-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DONT MERGE] Draft query stat API #2148
base: master
Are you sure you want to change the base?
Conversation
'count': [1, 5], | ||
'max_time': [1, 10], | ||
'min_time': [1, 20], | ||
'avg_time': [1, 15], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a histogram will be better than min
max
avg
'max_time': [1, 10], | ||
'min_time': [1, 20], | ||
'avg_time': [1, 15], | ||
'uncompressed_size': [10, 1000], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does 10
mean for the list
here? You need to give realistic output for anyone to understand this proposal
Reference Issues/PRs
What does this implement or fix?
Draft query stat API for discussion and as placeholder
Todo:
Sample use cases
(Unit of time below is
ms
)Very slow
list_symbools
because of very outdated symbol list cacheIn the situation where the library has frequent add/delete symbol, a lot of SymbolList keys will be added as journal.
The list of keys usually will be compacted by
list_symbols
call if caller has write permissionImagine if the above actions are not done, when users call
list_symbols
, those keys are needed to be iterated to create the symbol list. This could be very time consuming if the list is long.With query stat, users can learn the cause by checking the
result_count
in the entry:Very slow s3 API call
In the situation when the s3 endpoint has poor performance, the
avg_time
in stats will be very helpful to allow user pinpoint the culprit of poor performance:In this case, user can turn on s3 log to understand which exact step is slow.
Slow read due to fragmented segments
The library has very small setting for no. columns per segment. A simple read will require reading many segments.
It can be discovered by checking
count
in the stats. It can be verified with disproportional high ratio of tdata key count to vref key count. Below is the example of a simple read of 9 identical symbols with 1000 segments in a version:Tranversing long version chain
In the situation when the symbol is having numerous small append, a simple read of latest version could be slow as the version chain is needed to be iterated. This could be couple with high old version being deleted, which makes more v keys are need to be tranversed, e.g. 10 versions are appended and the last 5 version are deleted, the v keys will be tranversed in this order (T = tombstone):
vref -> ver 5T -> ver 6T -> ver 7T -> ver 8T -> ver 9T -> ver 9 -> ver 8 -> ver 7 -> ver 6 -> ver 5 -> ver 4
It can be discovered with disproportional high ratio of d key count to vref key count, with counts of tombstone keys highlighted in v key:
Slow response while trying to read tombstone/delete version of a symbol
In the situation when user tries to specifically read a tombstone/delete version of a symbol, the process will iterate all existing snapshots of the library. This is as last resort process to try to find the version but the process could be painstakingly long as all snapshots are needed to be iterated. Below is an example when user's trying to read a non-existence version larger then the latest version, with 100 snapshots in the library:
Additional grouping
The pre-designate grouping may not be enough for identifying which part of the long running process is the culprit of slowness. With the help of additional grouping, the investigation would be easier. Below is an example of mixed read latest version of symbol and read symbol version in a particular snapshot on the same symbol:
Analyzing why the script is slow
A very common request. With query stat, user can pinpoint which exact step is slow. Below is a common use case
With above stats, user can tell the slowness of the script run is because the library has delay delete OFF. Therefore it requires to iterate all snapshots to find out whether the version being overwritten is safe to delete