Skip to content

Conversation

@liuguoqingfz
Copy link
Contributor

Description

The original code was walking the source data path while it’s still being mutated by engine/shard cleanup, and Files.walk(...) is lazy: as it iterates it calls readAttributes on each entry to decide file/dir. One of those entries (translog-7.ckp) was deleted between directory listing and attribute read → NoSuchFileException, which then gets wrapped as UncheckedIOException

Related Issues

Resolves #19731

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

✅ Gradle check result for 1a29cb1: SUCCESS

@codecov
Copy link

codecov bot commented Oct 24, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.24%. Comparing base (0c89456) to head (1a29cb1).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #19758      +/-   ##
============================================
+ Coverage     73.19%   73.24%   +0.04%     
- Complexity    70946    70983      +37     
============================================
  Files          5735     5735              
  Lines        324654   324654              
  Branches      46962    46962              
============================================
+ Hits         237643   237789     +146     
+ Misses        67875    67702     -173     
- Partials      19136    19163      +27     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Fixed formatting issue with extra empty line

Signed-off-by: Joe Liu <[email protected]>
@github-actions
Copy link
Contributor

❌ Gradle check result for 196fb16: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❌ Gradle check result for 196fb16: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

try {
if (path.endsWith(".lock") == false) {
Files.copy(path, newIndexDataPath.resolve(indexDataPath.relativize(path)));
Files.walkFileTree(indexDataPath, new SimpleFileVisitor<Path>() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this same race condition was attempted to be fixed by this code:

// race condition: async flush may cause translog file deletion resulting in an inconsistent stream from
// Files.walk below during copy phase
// temporarily disable refresh to avoid any flushes or syncs that may inadvertently cause the deletion
assertAcked(
client().admin()
.indices()
.prepareUpdateSettings(index)
.setSettings(Settings.builder().put(IndexSettings.INDEX_REFRESH_INTERVAL_SETTING.getKey(), "-1").build())
);

Is it possible there is still an in-flight refresh that causes the inconsistency even after disabling refresh? If so, could we just force a refresh after disabling and that might ensure everything is quiescent?

My concern with the approach here is that if things are being mutated while traversing the files then we can't know that we've copied a consistent view of the flies for the shard, which seems like a requirement of this test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autocut flaky-test Random test failure that succeeds on second run >test-failure Test failure from CI, local build, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for IndexShardIT

2 participants