-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Redshift batch inserts using COPY FROM operation #25866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
a2827db
to
9187510
Compare
9187510
to
0fca059
Compare
0fca059
to
0d4c2da
Compare
@ebyhr Added requested documentation. We need the ENV var set on the repo for the tests to pass.
|
This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack. |
ConnectorPageSinkId pageSinkId, | ||
TrinoFileSystemFactory fileSystemFactory, | ||
TypeOperators typeOperators, | ||
Location copyLocationWithPrefix, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about rename to copyLocation
return Optional.ofNullable(batchedInsertsCopyIamRole); | ||
} | ||
|
||
@Config("redshift.batched-inserts-copy-iam-role") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a toggle to control that we enable the COPY FROM or not, otherwise we only control it on session level?
|
||
::: | ||
|
||
Use the `batched_inserts_copy_enabled` [catalog session property](/sql/set-session) to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should mention current implementation set the value to true by default
Description
Fixes #24546
This PR aims to allow the use of
COPY FROM
statements when sinking data into Redshift.The Redshift connector inherits
BaseJdbcConnector
which uses batched INSERT statements to execute sink operations. Even when using non transactional mode, this can only push about 1000 rows per second. This change stages the rows to a parquet file first, then issues aCOPY FROM
statement to load the table. We are noticing 250K rows per second or more using this method.This has been running in production for 2+ months on our own branch.
This functionality needs to be enabled by specifying the following config option:
The following options are also required when specifying the above:
A suggested IAM Policy to for this role and user:
Additional context and related issues
REDSHIFT_S3_COPY_ROOT
similar toREDSHIFT_S3_UNLOAD_ROOT
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: