Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should deny further write if ParquetWriter is aborted #3011

Open
hellishfire opened this issue Sep 11, 2024 · 6 comments
Open

Should deny further write if ParquetWriter is aborted #3011

hellishfire opened this issue Sep 11, 2024 · 6 comments

Comments

@hellishfire
Copy link
Contributor

Describe the bug, including details regarding any error messages, version, and platform.

After PARQUET-2437, InternalParquetRecordWriter will switch to aborted state if write throws, but still accept further writes even though these writes will not be flushed during close, leaving the file incomplete. Users will not notice the corrupted file until they actually read the file.

Should we check for aborted state and reject further writes to avoid such surprises?
@gszadovszky

Component(s)

No response

@hellishfire hellishfire changed the title Should deny further write if writer is aborted Should deny further write if ParquetWriter is aborted Sep 11, 2024
@gszadovszky
Copy link
Contributor

@hellishfire, How users would not notice the corruption when the related exception is thrown? My change is related to auto-closing to not to write anything automatically (at closing) after an exception happened. The users of the API shall not continue writing data after an exception.

@hellishfire
Copy link
Contributor Author

hellishfire commented Sep 20, 2024

@hellishfire, How users would not notice the corruption when the related exception is thrown? My change is related to auto-closing to not to write anything automatically (at closing) after an exception happened. The users of the API shall not continue writing data after an exception.

We found that some users still try to write after exception, then successfully close the file, only to find out later that the file can not be opened for read because of missing file footer.
Perhaps it's better to deny any further write to make it fool-proof?

@hellishfire
Copy link
Contributor Author

hellishfire commented Sep 20, 2024

@gszadovszky On another note, in InternalParquetRecordWriter.java

  public void write(T value) throws IOException, InterruptedException {
    try {
      writeSupport.write(value);
      ++recordCount;
      checkBlockSizeReached();
    } catch (Throwable t) {
      aborted = true;
      throw t;
    }
  }

Should we narrow the try-catch scope to exclude exceptions thrown by writeSupport (i.e. illegal input values denied by writeSupport), which is irrelevant to actual file io operation?

@gszadovszky
Copy link
Contributor

We found that some users still try to write after exception, then successfully close the file, only to find out later that the file can not be opened for read because of missing file footer.
Perhaps it's better to deny any further write to make it fool-proof?

I don't think parquet-java is fool-proof in any ways. And how do you think we should handle the case of continue writing with a writer that is already "aborted"? Throw yet another exception? I am not against it but cannot see the point. Also, it shall not significantly impact write performance.

Should we narrow the try-catch scope to exclude exceptions thrown by writeSupport (i.e. illegal input values denied by writeSupport), which is irrelevant to actual file io operation?

It would make sense but I am not sure if we can 100% differentiate the potential exceptions and whether we already had impact on the file writing at the throwing time or not.

Feel free to put up a PR if you would like to work on this. I'm happy to review.

@hellishfire
Copy link
Contributor Author

We found that some users still try to write after exception, then successfully close the file, only to find out later that the file can not be opened for read because of missing file footer.
Perhaps it's better to deny any further write to make it fool-proof?

I don't think parquet-java is fool-proof in any ways. And how do you think we should handle the case of continue writing with a writer that is already "aborted"? Throw yet another exception? I am not against it but cannot see the point. Also, it shall not significantly impact write performance.

Should we narrow the try-catch scope to exclude exceptions thrown by writeSupport (i.e. illegal input values denied by writeSupport), which is irrelevant to actual file io operation?

It would make sense but I am not sure if we can 100% differentiate the potential exceptions and whether we already had impact on the file writing at the throwing time or not.

Feel free to put up a PR if you would like to work on this. I'm happy to review.

OK. Would it be possible for you to elaborate a bit on why aborted state was introduced in the first place? Like, which user case triggered a bug that led you to fix it? It was not explained clearly in your commit message, so I somewhat lack the context of that commit.

@gszadovszky
Copy link
Contributor

@hellishfire, my bad, should have explained in more details. In many cases, Parquet writers are used in a try-with-resources construct. It invokes close in any case including exceptions thrown during the writes. close invokes flush by default which actually does some writing. It may cause issues like throwing exceptions while closing which is not always a problem. But, if you would use e.g. direct memory via the ByteBufferAllocator interface, it needs to be released and if fails, it may lead to memory leaks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants