Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: reduce syncv2 conn pool usage and add pool utilization metrics #6596

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

ivan4th
Copy link
Contributor

@ivan4th ivan4th commented Jan 8, 2025

Motivation

Database.WithConnection is always used by syncv2 for any probe/sync operations against peers, but the actual connection it acquires from the pool is not actually used sometimes, when all the necessary information can be retrieved from the FPTree. Also, while holding the connection for the duration of sync/probe works best in absolute majority of cases, in case of a very slow peer it may cause unwanted extra utilization of the connection pool.

Another problem is that there's currently no metrics for the actual database connection pool utilization, and connection pool wait latency metrics for the main state database pool, api pool and the local database are not collected separately.

Description

This change makes Database.WithConnection acquire the connection from the pool lazily, that is, upon the first query, and release it back if there are no queries executed for certain amount of time (defaults to 10ms), which will prevent slow syncv2 peers from holding connections for too long. In the latter case, the connection is re-acquired from the pool upon the following query.

Also, this change adds connection pool utilization metric, and separates the connection wait latency metrics and pool utilization metrics for th main state database pool, api pool and the local database.

Test Plan

Verify on a mainnet node.

Database.WithConnection is always used by syncv2 for any probe/sync
operations against peers, but the actual connection it acquires from
the pool is not actually used sometimes, when all the necessary
information can be retrieved from the FPTree. Also, while holding the
connection for the duration of sync/probe works best in absolute
majority of cases, in case of a very slow peer it may cause unwanted
extra utilization of the connection pool.

Another problem is that there's currently no metrics for the actual
database connection pool utilization, and connection pool wait latency
metrics for the main state database pool, api pool and the local
database are not collected separately.

This change makes `Database.WithConnection` acquire the connection
from the pool lazily, that is, upon the first query, and release it
back if there are no queries executed for certain amount of
time (defaults to 10ms), which will prevent slow syncv2 peers from
holding connections for too long. In the latter case, the connection
is re-acquired from the pool upon the following query.

Also, this change adds connection pool utilization metric, and
separates the connection wait latency metrics and pool utilization
metrics for th main state database pool, api pool and the local
database.
@ivan4th ivan4th force-pushed the sql/release-conns branch from bec23c3 to 8496629 Compare January 8, 2025 16:37
Copy link

codecov bot commented Jan 8, 2025

Codecov Report

Attention: Patch coverage is 99.04762% with 1 line in your changes missing coverage. Please review.

Project coverage is 79.9%. Comparing base (cdee34f) to head (f31ab62).
Report is 2 commits behind head on develop.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
sql/database.go 98.8% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           develop   #6596   +/-   ##
=======================================
  Coverage     79.8%   79.9%           
=======================================
  Files          356     356           
  Lines        47339   47417   +78     
=======================================
+ Hits         37820   37902   +82     
+ Misses        7376    7374    -2     
+ Partials      2143    2141    -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@fasmat fasmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newLazyCon seems to me quite convoluted. Why does sync2 call WithDBCon when it doesn't need a DB connection, such that WithDBCon then needs additional complexity with the lazy fetching and time based release of the connection?

Comment on lines +694 to +707
if conn != nil && db.connWaitLatency != nil {
db.connWaitLatency.Observe(time.Since(start).Seconds())
db.poolUsage.Inc()
}
return conn
}

func (db *sqliteDatabase) putConn(conn *sqlite.Conn) {
db.pool.Put(conn)
if db.poolUsage != nil {
db.poolUsage.Dec()
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having to check those fields against nil how about setting a default metric name? The only thing this would impact is tests. There we don't read the metrics but collecting them might again affect timings or cause races that we won't detect if we are not using them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also sql.Version() which does not receive any config yet it uses an sqliteDatabase. I've added default database name but if we remove these checks we'll be getting unneeded db metrics entries from a single sql.Version() call early during startup.

func (db *sqliteDatabase) putConn(conn *sqlite.Conn) {
db.pool.Put(conn)
if db.poolUsage != nil {
db.poolUsage.Dec()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this poolUsage metric cause a lot of data to be generated? Aren't connections fetched and put back into the pool basically every second at the moment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +666 to +668
numConns := func() int {
return int(testutil.ToFloat64(PoolUsage.WithLabelValues(dbName)))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be inlined, if the type check of require.Equal is the reason for this function you can also use require.EqualValues which accepts float64(1) == int(1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done b/c ToFloat64 will always return integer values as float64 here, but if we treat them as such we get a linter error:

./bin/golangci-lint run --config .golangci.yml
sql/database_test.go:672:6: float-compare: use require.InEpsilon (or InDelta) (testifylint)
					require.EqualValues(t, 1, testutil.ToFloat64(PoolUsage.WithLabelValues(dbName)))
					^
sql/database_test.go:677:5: float-compare: use require.InEpsilon (or InDelta) (testifylint)
				require.EqualValues(t, 1, testutil.ToFloat64(PoolUsage.WithLabelValues(dbName)))
				^

There's no sense in using require.InEpsilon / require.InDelta here, and silencing the linter warning explicitly is not much better than numConns func, which also makes the intent of reading the metric more clear.

sql/database.go Outdated
Comment on lines 1278 to 1280
defer func() {
c.timer.Reset(c.db.connIdleTimeout)
}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does c, c.timer, c.db or c.db.connIdleTimeout change between this line and exec returning? If not why not just

Suggested change
defer func() {
c.timer.Reset(c.db.connIdleTimeout)
}()
defer c.timer.Reset(c.db.connIdleTimeout)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

sql/database.go Outdated
Comment on lines 857 to 862
conCtx, cancel := context.WithCancel(ctx)
defer cancel()
conn := db.getConn(conCtx)
if conn == nil {
return ErrNoConnection
}
defer db.pool.Put(conn)
return exec(&sqliteConn{queryCache: db.queryCache, db: db, conn: conn})
c := newLazyConn(conCtx, db)
defer func() {
cancel()
c.release()
}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the anonymous function?

conCtx, cancel := context.WithCancel(ctx)
defer cancel()
c := newLazyConn(conCtx, db)
defer c.release()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

sql/database.go Outdated
}
}
})
c.doneCh = make(chan struct{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be clearer if getCon inits the timer and the doneCh and releaseCon stops the timer, sets it to nil and also closes the doneCh instead of spreading this across 3 methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which getConn? c.getConn is a function field that's needed so that we don't have to store the context and that's it sole job. sqliteDatabase.getConn probably should not be dealing with doneCh inside the lazyConn.
I don't think that re-creating the timer makes much sense, it may add race with the background goroutine. We could also restart the goroutine, but that's also extra hassle.
And doneCh is only used in the background goroutine in ensureConn and in release where it tells the goroutine to stop when we don't need this particular lazyConn anymore.

@ivan4th
Copy link
Contributor Author

ivan4th commented Jan 8, 2025

newLazyCon seems to me quite convoluted. Why does sync2 call WithDBCon when it doesn't need a DB connection, such that WithDBCon then needs additional complexity with the lazy fetching and time based release of the connection?

When a sync/probe operation against a peer begins, DBSet.WithCopy method is used to create a temporary shallow (very fast) copy of a DBSet that only lives for the duration of execution of the callback passed to DBSet.WithCopy. This copy of DBSet may or may not need to execute database queries during sync/probe and this cannot be known ahead of time before the actual sync interaction begins. To provide the request-scoped DBSet clone with a database connection, DBSet.WithCopy uses Database.WithConnection. This needs to be done b/c if we just pass the database itself as the Executor, a connection will be retrieved from the pool before executing each SQL query, and put back to the pool after each query, and sometimes syncv2 needs to do tens of thousands of trivial queries, and SQLite pool latency adds up to substantial delays in this case, even if very few connections are used. We could also use a read transaction, but this would add extra burden b/c long-running read transactions prevent SQLite from doing WAL checkpointing, leading to unneeded growth of the WAL file, and syncv2 has a rowid-based "snapshotting" mechanism which makes explicit read transactions unnecessary. So, WithConnection is like "WithTransaction without the actual transaction".

As noted in the PR description, this approach has its downsides, namely that sometimes the allocated connections are not actually used (which is impossible to predict at the point in which WithConnection is called) and also that a slow peer may hold a connection for too long, making it more likely for the pool to be exhausted. In theory, the logic to alleviate these issues, namely, lazy conn acquisition and putting idle conns back to the pool could be done on the syncv2 side, but I think that the resulting code will be much more convoluted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants