feat: Result set streaming #702

BobdenOs · 2024-06-19T13:52:04Z

Result set streaming

Currently it is possible to stream Readable objects into the INSERT queries. This enables for fast mass data inserting, but often it is also required to serve large result sets for download or excel export requests. Currently the biggest restriction for these endpoints is the default 1000 result limit as the default batch size for an excel export from UI5 is set to 5000 rows. Which means that currently cap receives 5x the requests for a single excel export. The most important reason for the default 1000 limit is keeping the application from running out of memory. Therefor Result set streaming enforces a highWaterMark according the node standard streaming implementations. While this does not fully prevent a result set stream from using more memory then the highWaterMark it is a soft limit enforced on the stream and provides a balance between throughput and memory usage.

Raw stream

Up till now CAP has always loaded all data into javascript objects and processed it inside the javascript layer. With raw result set streaming the json result provided from the database connection is never parsed and is kept in a raw Buffer format. This greatly improves memory and cpu usage.

The big drawback is that the result cannot be manipulated inside the javascript layer. Therefor it is not possible to call after handlers on the result set. Additionally it is required for the protocol adapter to be able to handle the Readable and write it correclty onto the res body stream.

Object stream

For the cases that it is required to modify the results using javascript it is possible using the Object stream. Instead of loading the whole result as an Array into memory the same results are passed through as single Objects. Allowing the protocol adapters to serialize them back to JSON to be written back into the res body as they are processed.

While this does not benefit from the cpu usage benefits that come with Raw streams. There is still the memory usage benefits which can still result in reduced response times as V8 has garbage collection optimization for short lived objects.

Expand streams

When using Object streams it might be the case that the root result set is only a few rows, but each row has a large amount of children which would still all be loaded into memory and be counted as a single Object inside the highWaterMark. To prevent this from happening it is possible to apply a recursive streaming approach that handles all expand columns as Object streams as well.

Depending on the database connection protocol it might be required to still load all results into the Readable buffer as the order of the root and children rows are related. So when reading all the root entries would require loading all children into the buffer. As they are interlaced in the result set between the root entries.

Usage examples

These code examples are just examples as the APIs are not yet implemented in @sap/cds

Raw stream

The most common use case for Raw result set streams are protocol adapters. Where the final result has to be

const { pipeline } = require('stream')
app.on('*',async (req,res) => {
  const result = await SELECT.from(entity)
  await pipeline(result, res) // propagates errors to both the database and the http res stream
})

Object stream

The most common place for the Object stream usage would be in custom handlers that required to modify the data or do additional calculations.

let total = 0
for await(const row of SELECT.from(entity)) {
  total += row.amount
}

Expand stream

This is an extension of the previous usage case, but with children rows.

let total = 0
for await(const row of SELECT.from(entity)) {
  // for await works on async iterators and normal iterators this includes normal Arrays
  for await(const child of row.children) {
    total += child.amount
  }
}

PR Status

Raw JSON result stream
Object result stream
Expand Object result streams

…sult-streaming

BobdenOs · 2024-07-09T13:04:34Z

Performance

The main target of result set streaming is to reduce memory consumption, but it is important that the clock cycles are not being wasted. As the ultimate metric will be request throughput. When the initial result set streaming was implemented it was measured for its performance (results). For certain implementations the performance was un acceptable. This PR also addresses the extreme outlier its fundamental performance issues while optimizing the other implementations where possible.

results

These are the results from the test stream.perf-test.js running on all the difference database service and their drivers. The scenario inserts 100,000 rows into the Books entity of the Bookshop scenario. After which all the rows are read 20 times either after each other or all in parallel. So in the results sync refers to single connection performance and async refers to throughput performance.

mode	`hana-client`	`hdb`	`SQLite`	`Postgres`
sync
Raw	~80 rows/ms	100-150 rows/ms	320-340 rows/ms	100-120 rows/ms
Object	~70 rows/ms	60-80 rows/ms	140-150 rows/ms	~100 rows/ms
Array	~80 rows/ms	70-80 rows/ms	140-150 rows/ms	~80 rows/ms
async
Raw	140-180 rows/ms	270-360 rows/ms	N/A	~500 rows/ms
Object	100-110 rows/ms	100-120 rows/ms	N/A	~160 rows/ms
Array	120-140 rows/ms	110-120 rows/ms	140-150 rows/ms (disk)	~160 rows/ms

The current standard is that all queries are processed in array mode. Which for this test consumed 1-1.5 GiB of memory for the async scenario. While raw and Object modes consume <100 MiB of memory for both scenarios with all implementations.

The variance of hana-client is very big for the async scenario as it is heavily impacted by the HDB_NODEJS_THREADPOOL_SIZE environment variable. The highest result achieved with hana-client in raw mode was using 512 as the thread pool size. As this test was run on a machine with more then 1 CPU core. The performance of hana-client specifically might be greatly diminished when deployed with heavily restricted resources (e.g. 128MiB memory and 1-2 vCPU cores). As all other drivers run inside the main thread they won't be impacted by these limitations.

…sult-streaming

BobdenOs added 16 commits June 7, 2024 10:33

Add hasPostProcessing check

143e36b

Close resultset in case of early close of stream

86073b1

Always fetch count when streaming results

bac0f37

Start adding object mode streaming

8e10010

Intermediate nested object stream

cedde73

Change test to use pipeline

56a60bf

Add raw streaming support

899fbc6

Merge branch 'main' of https://github.com/cap-js/cds-dbs into feat/re…

c3fb932

…sult-streaming

Optimize hdb resultset streaming

e66e940

Unify resultset streaming between hana-client and hdb

6906320

Move streaming performance test

4ed5671

Merge branch 'main' of https://github.com/cap-js/cds-dbs into feat/re…

a786f9a

…sult-streaming

Add object streaming support to postgres

137d49e

Fix invalid invocation errors

79b8af3

Add perf test for sqlite and postgres

0402d3f

Add stream performance test

5d1083c

BobdenOs added 3 commits July 10, 2024 11:42

Merge branch 'main' of https://github.com/cap-js/cds-dbs into feat/re…

9045ddb

…sult-streaming

Disable streaming performance test in CI

e803cbd

Fix await position

0d2df8d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Result set streaming #702

feat: Result set streaming #702

BobdenOs commented Jun 19, 2024

BobdenOs commented Jul 9, 2024 •

edited

Loading

feat: Result set streaming #702

Are you sure you want to change the base?

feat: Result set streaming #702

Conversation

BobdenOs commented Jun 19, 2024

Result set streaming

Raw stream

Object stream

Expand streams

Usage examples

Raw stream

Object stream

Expand stream

PR Status

BobdenOs commented Jul 9, 2024 • edited Loading

Performance

results

BobdenOs commented Jul 9, 2024 •

edited

Loading