-
Notifications
You must be signed in to change notification settings - Fork 61
add prometheus metrics #514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Vercel deployment URL: https://bitcoin-indexer-ajnyrbzi5-hirosystems.vercel.app 🚀 |
Codecov ReportAttention: Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! Looking forward to exposing this data. Left some questions/comments.
pub rune_parsing_time: Histogram, | ||
pub rune_computation_time: Histogram, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this correctly, parsing
and computation
describe two phases involved in processing a rune, is that correct? If so, are there any other phases we should track?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct. I am looking into it and will note the possible phases there might be. The db_write_time
is also part of this, I appended rune_
to that as well.
// Update cache size metric | ||
prometheus.metrics_update_cache_size(cache_l2.len() as u64); | ||
|
||
// Memory usage metric - include all caches | ||
let mut total_memory = cache_l2.len() as f64 * 0.1; // L2 cache estimate | ||
total_memory += cache_l1.len() as f64 * 0.05; // L1 cache estimate | ||
if brc20_cache.is_some() { | ||
// Add BRC20 cache memory estimate based on config's lru_cache_size | ||
let lru_size = config | ||
.ordinals_brc20_config() | ||
.map(|c| c.lru_cache_size) | ||
.unwrap_or(0); | ||
total_memory += lru_size as f64 * 0.02; // Estimate based on configured cache size | ||
} | ||
prometheus.metrics_update_memory_usage(total_memory); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the best way to track memory/cache, or are there other resources we should be looking at for metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I don't think this would be a good way because you'll only be measuring memory on this thread IIUC... There should be a way to sample the VM's memory use but that should already be covered by the grafana graphs (is that right @CharlieC3 ?)
components/runes/src/db/index.rs
Outdated
// TODO: is there no way to have processing errors as we have on ordinals? | ||
// if so, delete the processing errors metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBD before the review is finalized
prometheus | ||
.metrics_record_block_processing_time(process_start_time.elapsed().as_millis() as f64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This measurement should occur once per block so we can measure it both by block (with a label for block height or something) and as a histogram to check stats on overall block processing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the Prometheus documentation, and it seems that using labels for each block_height
isn’t recommended, as it would create a new time series for every block, leading to high cardinality issues. Unless I’m misunderstanding something, this approach isn’t feasible for Prometheus:
Each labelset is an additional time series that has RAM, CPU, disk, and network costs. Usually the overhead is negligible, but in scenarios with lots of metrics and hundreds of labelsets across hundreds of servers, this can add up quickly.
As a general guideline, try to keep the cardinality of your metrics below 10, and for metrics that exceed that, aim to limit them to a handful across your whole system. The vast majority of your metrics should have no labels.
If you have a metric that has a cardinality over 100 or the potential to grow that large, investigate alternate solutions such as reducing the number of dimensions or moving the analysis away from monitoring and to a general-purpose processing system.
// Update cache size metric | ||
prometheus.metrics_update_cache_size(cache_l2.len() as u64); | ||
|
||
// Memory usage metric - include all caches | ||
let mut total_memory = cache_l2.len() as f64 * 0.1; // L2 cache estimate | ||
total_memory += cache_l1.len() as f64 * 0.05; // L1 cache estimate | ||
if brc20_cache.is_some() { | ||
// Add BRC20 cache memory estimate based on config's lru_cache_size | ||
let lru_size = config | ||
.ordinals_brc20_config() | ||
.map(|c| c.lru_cache_size) | ||
.unwrap_or(0); | ||
total_memory += lru_size as f64 * 0.02; // Estimate based on configured cache size | ||
} | ||
prometheus.metrics_update_memory_usage(total_memory); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I don't think this would be a good way because you'll only be measuring memory on this thread IIUC... There should be a way to sample the VM's memory use but that should already be covered by the grafana graphs (is that right @CharlieC3 ?)
import { FastifyInstance, FastifyReply, FastifyRequest } from 'fastify'; | ||
import { ApiMetrics } from '../../metrics/metrics'; | ||
|
||
export function registerMetricsMiddleware(fastify: FastifyInstance, metrics: ApiMetrics) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Alin, apologies, I should've explained this yesterday. The prom-client
already takes care of per-request timing measurements and all kinds of default stats, so you shouldn't worry about adding these to your metrics.
The same goes for DB queries, because those get measured by another tool we use called PgHero.
You should only focus on making sure the prometheus metrics are being applied correctly to this API and the Runes API via the use of prom-client
.
Ordinals
Health Metrics
update_chain_tip_distance
record_processing_error
Performance Metrics
record_block_processing_time
record_inscription_parsing_time
record_ordinal_computation_time
record_db_write_time
Volumetric Metrics
record_inscriptions_in_block
record_brc20_operations_in_block
BRC-20 Specific Metrics
record_brc20_deploy
record_brc20_mint
record_brc20_transfer
record_brc20_transfer_send
Runes
Health Metrics
update_chain_tip_distance
record_processing_error
Performance Metrics
record_block_processing_time
record_rune_parsing_time
record_rune_computation_time
record_db_write_time
Volumetric Metrics
record_runes_in_block
record_rune_operations_in_block
Runes Specific Metrics
record_rune_etching
record_rune_mint
record_rune_transfer
record_rune_burn