-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvement/s3 c 9769/ignore trailing checksum #5751
Improvement/s3 c 9769/ignore trailing checksum #5751
Conversation
Hello fredmnl,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
Incorrect Jira projectThe Jira issue S3C-9769 specified in the source |
76addf8
to
bc68eb3
Compare
@jonathan-gramain Thanks again for the review, I reworked the whole class to address your concerns. I thought that my tests were covering all of the tricky splits that could happen but actually the http request module used was squashing the chunks together before sending. I resorted to coding a unit tests for the class which is much neater. I might have overdone it but they run pretty quickly so we could keep them all. I reworked the logic in a way which I think is much neater. I'm only detecting I think that eventually we will want to refactor V4Transform to integrate this as well as checksum checking. |
continue; | ||
} | ||
|
||
const lineBreakIndex2 = chunk.indexOf('\r'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks functional to me, I still have a slight preference for finding the full sequence \r\n
in the chunkSizeBuffer
because it also validates the stream contents more strongly. But it's not blocking for me, I'm fine with this approach too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd like to refactor both Transforms (the AuthV4 too) into a single processor that would actually work on a configurable FSM to both validate and process the stream. I think it would also make the V4Transform much more readable. With that in mind, we could deliver this iteration as is (we're aiming for a timely release on this fix). There will at some point be a further iteration to validate the checksums (prioritization aside)
c34a49d
to
dba1a83
Compare
dba1a83
to
7862de5
Compare
@jonathan-gramain I think we're ready for another review here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with some minor suggestions
const lineBreakIndex = chunk.indexOf('\r'); | ||
if (lineBreakIndex === -1) { | ||
if (this.chunkSizeBuffer.byteLength + chunk.byteLength > 10) { | ||
this.log.debug('chunk size field too big', { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two suggestions:
Turn into info
level, as it can be useful for debugging on production platforms in case their app sends chunks that are too big for us to support
You could do first:
this.chunkSizeBuffer = Buffer.concat([this.chunkSizeBuffer, chunk]);
then check that the size is below what we support.
Then you can log directly this.chunkSizeBuffer.subarray(0, 16).toString()
:
-
You can remove the hex encoding since it's already JSON-encoded in logs
-
I think 16 chars allow to see the full size, while 8 chars may mask the least significant digits of the hex-encoded length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Also see comment below, which I wrote first)
5GB
is an AWS limit, it's forbidden to perform a single upload of more than 5GB
hence the hard specification limit here, and 5GB
needs 9 hex characters to be written, we give it an extra character for \r
.
this.chunkSizeBuffer = Buffer.concat([this.chunkSizeBuffer, chunk]);
could potentially make us perform a huge buffer concatenation (although I think that Node would rechunk things before reaching this part of the code)
👍 on hex
👍 Yes 16 is better, in case the chunkSizeBuffer is empty, we would not see the whole 10 characters that we could have been interested in.
} | ||
const dataSize = Number.parseInt(this.chunkSizeBuffer.toString(), 16); | ||
if (Number.isNaN(dataSize)) { | ||
this.log.debug('unable to parse chunk size', { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case now looks unlikely due to the sanity check, so I think you can log it with error level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, will do 👍
f8c4593
to
20011c1
Compare
20011c1
to
7b413ba
Compare
/approve |
Incorrect Jira projectThe Jira issue S3C-9769 specified in the source The following options are set: approve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First review iteration, I will need to complete my review of the TrailingChecksumTransform class
if (stream.headers['x-amz-trailer'] === undefined || | ||
stream.headers['x-amz-trailer'] === '') { | ||
return stream; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just check if the value is set (both cases would return true
here, and I believe we want to handle null
the same way)
if (stream.headers['x-amz-trailer'] === undefined || | |
stream.headers['x-amz-trailer'] === '') { | |
return stream; | |
if (!stream.headers['x-amz-trailer']) { | |
return stream; |
|
||
if (headers['x-amz-trailer'] !== undefined && | ||
headers['x-amz-content-sha256'] !== 'STREAMING-UNSIGNED-PAYLOAD-TRAILER') { | ||
return errors.BadRequest.customizeDescription('signed trailing checksum is not supported'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usually BadRequest (error 400) is for errors from client side, here, I would instead return a NotImplemented
(error code 501)
constructor(log, errCb) { | ||
super({}); | ||
this.log = log; | ||
this.errCb = errCb; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We store this error callback in the class (which is a bit strange as passing such error function as a member of a class makes it complex to know when this will be called, and the impacts on the call path for the API).
But the main point is that this callback is not used anywhere: either we expect no error and we should remove this callback, or we can have error (handling of the Transform
events?) and we need to use it somehow...
return stream; | ||
} | ||
|
||
const trailingChecksumTransform = new TrailingChecksumTransform(log, errCb); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not pass a callback for error handling here: we just pipe two streams, and then we can handle the errors in the same place as before.
Note: piping streams, in the past, led to memory leaks. We must ensure all errors /events are properly handled not to miss something here. See here for an example.
this.log.info('chunk size is not a valid hex number', { | ||
chunkSizeBuffer: this.chunkSizeBuffer.toString(), | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this.log.info('chunk size is not a valid hex number', { | |
chunkSizeBuffer: this.chunkSizeBuffer.toString(), | |
}); | |
this.log.error('chunk size is not a valid hex number', { | |
chunkSizeBuffer: this.chunkSizeBuffer.toString(), | |
}); |
this.log.info('chunk size field too big', { | ||
chunkSizeBuffer: this.chunkSizeBuffer.toString(), | ||
truncatedChunk: chunk.subarray(0, 16).toString(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this.log.info('chunk size field too big', { | |
chunkSizeBuffer: this.chunkSizeBuffer.toString(), | |
truncatedChunk: chunk.subarray(0, 16).toString(), | |
this.log.error('chunk size field too big', { | |
chunkSizeBuffer: this.chunkSizeBuffer.toString(), | |
truncatedChunk: chunk.subarray(0, 16).toString(), |
} | ||
|
||
this.chunkSizeBuffer = Buffer.concat([this.chunkSizeBuffer, chunk.subarray(0, lineBreakIndex)]); | ||
chunk = chunk.subarray(lineBreakIndex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is correct (I may be wrong)
The delimiter is \r, but don't we have \n\r at the end of each chunk?
In such case, should we do something like
if (lineBreakIndex === -1 || lineBreakIndex + 1 >= chunk.length || chunk[lineBreakIndex + 1] !== '\n'.charCodeAt(0)) {
// Handle error or wait for more data
}
this.chunkSizeBuffer = Buffer.concat([this.chunkSizeBuffer, chunk.subarray(0, lineBreakIndex)]);
chunk = chunk.subarray(lineBreakIndex + 2);
Closing this PR in favor of #5757 |
Motivation
Since aws-cli 2.23 (and associated SDK version), put-object now defaults to using unsigned payload with trailing checksums for integrity.
This is particularly nice from a computational point of view because the client only needs to read their file once to perform the upload. With AuthV4 signed payload, you would need to first compute the checksum of the object, to eventually include it in the signature contained in the header. With trailing checksums (and unsigned payload), the checksum is computed while the upload is happening and the checksum is sent at the end of the body.
Implementation
In this first iteration we are only interested in accepting the incoming requests using trailing checksums and unsigned payload. We are not yet verifying the checksum although the current proposal is extensible to perform this checksum verification as well.