Description
First, thank you for maintaining the pg library. I'm having an issue dealing with large field values that I'd like to get community guidance on.
Problem Description
When working with PostgreSQL tables that contain very large field values (text columns with hundreds of MB or more), we face a risk of out-of-memory (OOM) errors. Even when using pagination techniques like cursors, the library loads entire rows into memory, which becomes problematic with extremely large fields.
Our ideal solution would be a way to set a maximum threshold for query result size and automatically abort when exceeded, preventing memory issues.
Reproduction Case
Here's a minimal script demonstrating the issue:
const pg = require('pg')
async function demonstrateMemoryIssue() {
// Create pool
const pool = new pg.Pool({
connectionString: 'postgresql://postgres:postgres@localhost:5432/postgres',
})
try {
// Create test table and insert a large row
await pool.query('DROP TABLE IF EXISTS memory_test')
await pool.query(`
CREATE TABLE memory_test (
id SERIAL PRIMARY KEY,
name TEXT,
large_data1 TEXT,
large_data2 TEXT,
large_data3 TEXT,
large_data4 TEXT,
large_data5 TEXT,
large_data6 TEXT,
large_data7 TEXT,
large_data8 TEXT,
large_data9 TEXT,
large_data10 TEXT,
large_data11 TEXT,
large_data12 TEXT,
large_data13 TEXT,
large_data14 TEXT,
large_data15 TEXT,
large_data16 TEXT,
large_data17 TEXT,
large_data18 TEXT,
large_data19 TEXT,
large_data20 TEXT
)
`)
// Insert a single row with 20 columns of 500MB each (total 10GB)
console.log('Inserting large rows...')
await pool.query(`
INSERT INTO memory_test (
name,
large_data1, large_data2, large_data3, large_data4, large_data5,
large_data6, large_data7, large_data8, large_data9, large_data10,
large_data11, large_data12, large_data13, large_data14, large_data15,
large_data16, large_data17, large_data18, large_data19, large_data20
)
VALUES (
'Ultra Large Row',
repeat('A', 500 * 1024 * 1024), repeat('B', 500 * 1024 * 1024),
repeat('C', 500 * 1024 * 1024), repeat('D', 500 * 1024 * 1024),
repeat('E', 500 * 1024 * 1024), repeat('F', 500 * 1024 * 1024),
repeat('G', 500 * 1024 * 1024), repeat('H', 500 * 1024 * 1024),
repeat('I', 500 * 1024 * 1024), repeat('J', 500 * 1024 * 1024),
repeat('K', 500 * 1024 * 1024), repeat('L', 500 * 1024 * 1024),
repeat('M', 500 * 1024 * 1024), repeat('N', 500 * 1024 * 1024),
repeat('O', 500 * 1024 * 1024), repeat('P', 500 * 1024 * 1024),
repeat('Q', 500 * 1024 * 1024), repeat('R', 500 * 1024 * 1024),
repeat('S', 500 * 1024 * 1024), repeat('T', 500 * 1024 * 1024)
)
`)
console.log('Fetching all rows...')
// Even with a cursor, this will cause memory issues
const client = await pool.connect()
// Try with cursor approach
const Cursor = require('pg-cursor')
const cursor = client.query(new Cursor('SELECT * FROM memory_test'))
// Read in small batches (still causes memory issues)
cursor.read(10, (err, rows) => {
if (err) {
console.error('Error reading rows:', err)
} else {
console.log(`Retrieved ${rows.length} rows`)
console.log(rows)
process.exit(0)
}
})
} catch (error) {
console.error('Error:', error)
}
}
demonstrateMemoryIssue()
// run with: /usr/bin/time -l node script.js
Attempted Solutions
I've tried:
- Using cursors with small batch sizes, but as we see, cusor work row by row, and a single row could be gigabytes large not preventing the OOM of the program
- Try to plug into
pg
events to keep track of the current query buffer size (couldn't find a way to do this)
Questions
- Is there a recommended approach to cap the maximum data size returned by a query to prevent OOM error within the caller program, and keep any query result size bellow a specific amount of memory before gracefully aborting ?
Thank you for the time you'll take answering my question and pointing me in the right direction.
Edit: After searching in older issues, I discovered that this issue is actually a duplicate of: #2336.
In addition to what was mentioned there, my use case involves an application where the server running the query and the queried database belong to different actors.
If there is not any new known way to handle this, and if the community is still open to reviewing contributions along those lines, I could try to develop a solution for this use case by integrating at the pool connection level, as mentioned in the comments.