Skip to content

Conversation

ColdL
Copy link

@ColdL ColdL commented Sep 24, 2025

Recently I was testing the integration between Lance and Spark.

I found that the Lance writer has a certain probability of hanging. After some troubleshooting, I discovered this is related to the LanceArrowWriter.setFinished method.

The original code appears to have a bug where it sets the finished status after notifying loadNextBatch, which could cause loadNextBatch to hang.

Root Cause

The ideal flow should be:

  1. (thread 1) loadToken.release
  2. (thread 1) finished = true
  3. (thread 2) loadNextBatch
  4. (thread 2) finished is true and count is 0 so return false

However, there's a chance it becomes:

  1. (thread 1) loadToken.release
  2. (thread 2) loadNextBatch
  3. (thread 2) finished is false so return true and waiting
  4. (thread 1) finished = false

If the second scenario occurs, thread 2 will hang indefinitely and cannot receive new notifications. jstack will show stacks hanging in LanceDataWriter.commit.

Reproduction

This issue is hard to reproduce. I encountered it in a very low-resource environment (Spark executor with only 1 core 4g) when creating a new table and writing 600 rows of data at once, where one column is a 1024-dimensional vector column.

It also occurs intermittently.

Further Confirmation

Although the current fix seems reasonable, I hope to get confirmation from maintainers to avoid introducing new unknown issues.

Any comments about this. @jackye1995

@github-actions github-actions bot added the bug Something isn't working label Sep 24, 2025
@jackye1995
Copy link
Collaborator

@ColdL ColdL force-pushed the fix-lance-arrow-writer branch from 0f0dfea to 232f394 Compare September 24, 2025 08:27
@ColdL
Copy link
Author

ColdL commented Sep 24, 2025

DONE

@ColdL ColdL force-pushed the fix-lance-arrow-writer branch from 232f394 to 871556b Compare September 24, 2025 08:35
@jackye1995
Copy link
Collaborator

looks like there are still some code style issues.

@ColdL ColdL force-pushed the fix-lance-arrow-writer branch from 871556b to 2477cdc Compare September 25, 2025 02:05
@ColdL
Copy link
Author

ColdL commented Sep 25, 2025

DONE

@ColdL
Copy link
Author

ColdL commented Oct 9, 2025

And, this commit might also be worth another look, it hasn't been merged yet 😉

@jackye1995

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants