-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: reduce syncv2 conn pool usage and add pool utilization metrics #6596
base: develop
Are you sure you want to change the base?
Conversation
Database.WithConnection is always used by syncv2 for any probe/sync operations against peers, but the actual connection it acquires from the pool is not actually used sometimes, when all the necessary information can be retrieved from the FPTree. Also, while holding the connection for the duration of sync/probe works best in absolute majority of cases, in case of a very slow peer it may cause unwanted extra utilization of the connection pool. Another problem is that there's currently no metrics for the actual database connection pool utilization, and connection pool wait latency metrics for the main state database pool, api pool and the local database are not collected separately. This change makes `Database.WithConnection` acquire the connection from the pool lazily, that is, upon the first query, and release it back if there are no queries executed for certain amount of time (defaults to 10ms), which will prevent slow syncv2 peers from holding connections for too long. In the latter case, the connection is re-acquired from the pool upon the following query. Also, this change adds connection pool utilization metric, and separates the connection wait latency metrics and pool utilization metrics for th main state database pool, api pool and the local database.
bec23c3
to
8496629
Compare
Codecov ReportAttention: Patch coverage is
✅ All tests successful. No failed tests found.
Additional details and impacted files@@ Coverage Diff @@
## develop #6596 +/- ##
=======================================
Coverage 79.8% 79.9%
=======================================
Files 356 356
Lines 47339 47417 +78
=======================================
+ Hits 37820 37902 +82
+ Misses 7376 7374 -2
+ Partials 2143 2141 -2 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newLazyCon
seems to me quite convoluted. Why does sync2
call WithDBCon
when it doesn't need a DB connection, such that WithDBCon
then needs additional complexity with the lazy fetching and time based release of the connection?
if conn != nil && db.connWaitLatency != nil { | ||
db.connWaitLatency.Observe(time.Since(start).Seconds()) | ||
db.poolUsage.Inc() | ||
} | ||
return conn | ||
} | ||
|
||
func (db *sqliteDatabase) putConn(conn *sqlite.Conn) { | ||
db.pool.Put(conn) | ||
if db.poolUsage != nil { | ||
db.poolUsage.Dec() | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having to check those fields against nil
how about setting a default metric name? The only thing this would impact is tests. There we don't read the metrics but collecting them might again affect timings or cause races that we won't detect if we are not using them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also sql.Version()
which does not receive any config yet it uses an sqliteDatabase
. I've added default database name but if we remove these checks we'll be getting unneeded db
metrics entries from a single sql.Version()
call early during startup.
func (db *sqliteDatabase) putConn(conn *sqlite.Conn) { | ||
db.pool.Put(conn) | ||
if db.poolUsage != nil { | ||
db.poolUsage.Dec() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this poolUsage
metric cause a lot of data to be generated? Aren't connections fetched and put back into the pool basically every second at the moment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a gauge. Inc
/ Dec
operations are trivial: https://github.com/prometheus/client_golang/blob/v1.20.5/prometheus/gauge.go#L114-L130
numConns := func() int { | ||
return int(testutil.ToFloat64(PoolUsage.WithLabelValues(dbName))) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be inlined, if the type check of require.Equal
is the reason for this function you can also use require.EqualValues
which accepts float64(1) == int(1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done b/c ToFloat64
will always return integer values as float64
here, but if we treat them as such we get a linter error:
./bin/golangci-lint run --config .golangci.yml
sql/database_test.go:672:6: float-compare: use require.InEpsilon (or InDelta) (testifylint)
require.EqualValues(t, 1, testutil.ToFloat64(PoolUsage.WithLabelValues(dbName)))
^
sql/database_test.go:677:5: float-compare: use require.InEpsilon (or InDelta) (testifylint)
require.EqualValues(t, 1, testutil.ToFloat64(PoolUsage.WithLabelValues(dbName)))
^
There's no sense in using require.InEpsilon
/ require.InDelta
here, and silencing the linter warning explicitly is not much better than numConns
func, which also makes the intent of reading the metric more clear.
sql/database.go
Outdated
defer func() { | ||
c.timer.Reset(c.db.connIdleTimeout) | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does c
, c.timer
, c.db
or c.db.connIdleTimeout
change between this line and exec
returning? If not why not just
defer func() { | |
c.timer.Reset(c.db.connIdleTimeout) | |
}() | |
defer c.timer.Reset(c.db.connIdleTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
sql/database.go
Outdated
conCtx, cancel := context.WithCancel(ctx) | ||
defer cancel() | ||
conn := db.getConn(conCtx) | ||
if conn == nil { | ||
return ErrNoConnection | ||
} | ||
defer db.pool.Put(conn) | ||
return exec(&sqliteConn{queryCache: db.queryCache, db: db, conn: conn}) | ||
c := newLazyConn(conCtx, db) | ||
defer func() { | ||
cancel() | ||
c.release() | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the anonymous function?
conCtx, cancel := context.WithCancel(ctx)
defer cancel()
c := newLazyConn(conCtx, db)
defer c.release()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
sql/database.go
Outdated
} | ||
} | ||
}) | ||
c.doneCh = make(chan struct{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be clearer if getCon
inits the timer and the doneCh
and releaseCon
stops the timer, sets it to nil and also closes the doneCh
instead of spreading this across 3 methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which getConn
? c.getConn
is a function field that's needed so that we don't have to store the context and that's it sole job. sqliteDatabase.getConn
probably should not be dealing with doneCh
inside the lazyConn
.
I don't think that re-creating the timer makes much sense, it may add race with the background goroutine. We could also restart the goroutine, but that's also extra hassle.
And doneCh
is only used in the background goroutine in ensureConn
and in release
where it tells the goroutine to stop when we don't need this particular lazyConn
anymore.
When a sync/probe operation against a peer begins, DBSet.WithCopy method is used to create a temporary shallow (very fast) copy of a As noted in the PR description, this approach has its downsides, namely that sometimes the allocated connections are not actually used (which is impossible to predict at the point in which |
Motivation
Database.WithConnection
is always used by syncv2 for any probe/sync operations against peers, but the actual connection it acquires from the pool is not actually used sometimes, when all the necessary information can be retrieved from theFPTree
. Also, while holding the connection for the duration of sync/probe works best in absolute majority of cases, in case of a very slow peer it may cause unwanted extra utilization of the connection pool.Another problem is that there's currently no metrics for the actual database connection pool utilization, and connection pool wait latency metrics for the main state database pool, api pool and the local database are not collected separately.
Description
This change makes
Database.WithConnection
acquire the connection from the pool lazily, that is, upon the first query, and release it back if there are no queries executed for certain amount of time (defaults to 10ms), which will prevent slow syncv2 peers from holding connections for too long. In the latter case, the connection is re-acquired from the pool upon the following query.Also, this change adds connection pool utilization metric, and separates the connection wait latency metrics and pool utilization metrics for th main state database pool, api pool and the local database.
Test Plan
Verify on a mainnet node.