Skip to content

Core: validate file format compatibility with v3 #13060

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

danielcweeks
Copy link
Contributor

Parquet is currently the only file format that supports all the required capabilities for v3 compatibility. This PR adds validation/tests to prevent unintentionally creating or upgrading a table to v3 in an incompatible way.

@danielcweeks danielcweeks requested a review from nastra May 14, 2025 21:23
@github-actions github-actions bot added the core label May 14, 2025
@danielcweeks danielcweeks force-pushed the v4-validate-file-format-2 branch from b9c7d75 to d0f29a7 Compare May 14, 2025 21:41
properties.getOrDefault(DEFAULT_FILE_FORMAT, DEFAULT_FILE_FORMAT_DEFAULT));

Preconditions.checkArgument(
V3_SUPPORTED_FILE_FORMATS.contains(targetFormat) || formatVersion <= 2,
Copy link
Contributor

@stevenzwu stevenzwu May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably fine for now.

but it is tied specifically to v3. if we are adding v4 in the future, this is another another we need to keep up. we probably need find some more elegant way to manage the scattered version specific validations. maybe out of the scope of this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Maybe the reverse is more cleaner, like maximum supported table format version in FileFormat.java for each format[1]. So, here validation can be generic for all file format and table format.

[1]
Parquet - 3
avro - 2
ORC - 2
PUFFIN- 2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered this as well, but it's a little more complicated than just having a max supported version. For example, Avro is supported in 1,2 and 3 as a metadata format (we still write manifests using Avro), but currently only 1,2 for data format. You also can't write formats other than puffin for deletes with 3 and beyond.

I'll think about this and see if there's a clearer way to present this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good approach for now. We don't need to build too much here since we intend to remove this and resolve the compatibility issues. If that doesn't happen then we actually need to address it in a different way, by working as a community to deprecate support and make this more permanent.

PUFFIN- 2

Puffin should be v3 and beyond. We can't write Puffin files for DVs in v2.

@danielcweeks danielcweeks force-pushed the v4-validate-file-format-2 branch from d0f29a7 to 67c6b87 Compare May 15, 2025 22:47
@danielcweeks danielcweeks force-pushed the v4-validate-file-format-2 branch from 67c6b87 to 2724673 Compare May 15, 2025 22:48
Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes LGTM but we'll need to adjust a few Spark tests that run with v3 and those file formats

@@ -61,6 +65,9 @@ public class TableMetadata implements Serializable {
static final int INITIAL_SCHEMA_ID = 0;
static final int INITIAL_ROW_ID = 0;

@VisibleForTesting
static final Set<FileFormat> V3_SUPPORTED_FILE_FORMATS = Set.of(FileFormat.PARQUET);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't PUFFIN be here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only for the data files, so puffin wouldn't apply (not a valid option).

* for features like default values, type support, etc. which makes creating or updating to v3
* tables unsafe. Supported formats should be expanded as support is implemented.
*/
private static void validateFileFormatCompatibility(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:nit: Could be validateDataFileFormatCompatibility
Just to remove ambiguity

@RussellSpitzer
Copy link
Member

The tests are failing because we have a few tests that use a parameterization that uses "avro" as the data file format. Currently we do this for all versions

@danielcweeks
Copy link
Contributor Author

The tests are failing because we have a few tests that use a parameterization that uses "avro" as the data file format. Currently we do this for all versions

I'll chase these other tests down and make sure we're handling/skipping appropriately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants