Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC][CELEBORN-894] Add support for end to end Integrity Checks #3062

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

gauravkm
Copy link

@gauravkm gauravkm commented Jan 10, 2025

What changes were proposed in this pull request?

CELEBORN-894
End to End integrity checks reference implementation. Not looking to merge this yet. Looking for high level feedback on the approach

Why are the changes needed?

https://docs.google.com/document/d/1YqK0kua-5rMufJw57kEIrHHGbLnAF9iXM5GdDweMzzg/edit?tab=t.0

Does this PR introduce any user-facing change?

How was this patch tested?

@gauravkm gauravkm changed the title [CELEBORN-894] Add support for end to end Integrity Checks [Draft][RFC][CELEBORN-894] Add support for end to end Integrity Checks Jan 10, 2025
@FMX
Copy link
Contributor

FMX commented Jan 12, 2025

Glad to see your PR. If it's ready you can remove the draft label.

@gauravkm gauravkm changed the title [Draft][RFC][CELEBORN-894] Add support for end to end Integrity Checks [RFC][CELEBORN-894] Add support for end to end Integrity Checks Jan 14, 2025
@gauravkm
Copy link
Author

gauravkm commented Feb 3, 2025

Hi @FMX
Could you please take a look? I removed the draft tag.

@FMX
Copy link
Contributor

FMX commented Feb 5, 2025

the

Code review will be done within this week.

Copy link
Contributor

@FMX FMX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gauravkm Looks like this PR is incomplete. Are there more PRs for this Jira Ticket?

@@ -377,7 +377,8 @@ private void close() throws IOException, InterruptedException {
updateRecordsWrittenMetrics();

long waitStartTime = System.nanoTime();
shuffleClient.mapperEnd(shuffleId, mapId, encodedAttemptId, numMappers);
int bytesWritten = shuffleClient.mapperEnd(shuffleId, mapId, encodedAttemptId, numMappers, numMappers);
writeMetrics.incBytesWritten(bytesWritten);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not add the bytes written to spark metrics. Because the write metric has the correct value of written bytes.

@@ -32,6 +33,7 @@ public class PushState {
private final int pushBufferMaxSize;
public AtomicReference<IOException> exception = new AtomicReference<>();
private final InFlightRequestTracker inFlightRequestTracker;
private final ConcurrentHashMap<Integer, CommitMetadata> commitMetadataMap = new ConcurrentHashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This map is always empty. Seems that you forget to update this maps.

int bytes = 0;

for (int partitionId = 0; partitionId < numPartitions; partitionId++) {
CommitMetadata metadata = metadataMap.getOrDefault(partitionId, new CommitMetadata());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here will always get empty commit meta data.

@gaoyajun02
Copy link

gentle ping @gauravkm
Thank you for your PR. This PR is very valuable to us, but it seems incomplete. Are you still working on it?

@gauravkm
Copy link
Author

gauravkm commented Feb 26, 2025

@gaoyajun02 Yes. I am still working on this. We (at Stripe) realized that the implementation needs to be a lot more comprehensive and thorough for the checks to be meaningful and provide confidence. We are internally testing out the new implementation which I will then open for review in OSS as well.
We also found that the current approach introduces a lot of overhead for apps that provision a high partition count but only write to a few of them. So we have altered the design to be able to accommodate such apps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants