-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Core: validate file format compatibility with v3 #13060
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
b9c7d75
to
d0f29a7
Compare
properties.getOrDefault(DEFAULT_FILE_FORMAT, DEFAULT_FILE_FORMAT_DEFAULT)); | ||
|
||
Preconditions.checkArgument( | ||
V3_SUPPORTED_FILE_FORMATS.contains(targetFormat) || formatVersion <= 2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is probably fine for now.
but it is tied specifically to v3. if we are adding v4 in the future, this is another another we need to keep up. we probably need find some more elegant way to manage the scattered version specific validations. maybe out of the scope of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Maybe the reverse is more cleaner, like maximum supported table format version in FileFormat.java
for each format[1]. So, here validation can be generic for all file format and table format.
[1]
Parquet - 3
avro - 2
ORC - 2
PUFFIN- 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered this as well, but it's a little more complicated than just having a max supported version. For example, Avro is supported in 1,2 and 3 as a metadata format (we still write manifests using Avro), but currently only 1,2 for data format. You also can't write formats other than puffin for deletes with 3 and beyond.
I'll think about this and see if there's a clearer way to present this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good approach for now. We don't need to build too much here since we intend to remove this and resolve the compatibility issues. If that doesn't happen then we actually need to address it in a different way, by working as a community to deprecate support and make this more permanent.
PUFFIN- 2
Puffin should be v3 and beyond. We can't write Puffin files for DVs in v2.
d0f29a7
to
67c6b87
Compare
67c6b87
to
2724673
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes LGTM but we'll need to adjust a few Spark tests that run with v3 and those file formats
@@ -61,6 +65,9 @@ public class TableMetadata implements Serializable { | |||
static final int INITIAL_SCHEMA_ID = 0; | |||
static final int INITIAL_ROW_ID = 0; | |||
|
|||
@VisibleForTesting | |||
static final Set<FileFormat> V3_SUPPORTED_FILE_FORMATS = Set.of(FileFormat.PARQUET); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't PUFFIN be here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only for the data files, so puffin wouldn't apply (not a valid option).
* for features like default values, type support, etc. which makes creating or updating to v3 | ||
* tables unsafe. Supported formats should be expanded as support is implemented. | ||
*/ | ||
private static void validateFileFormatCompatibility( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:nit: Could be validateDataFileFormatCompatibility
Just to remove ambiguity
The tests are failing because we have a few tests that use a parameterization that uses "avro" as the data file format. Currently we do this for all versions |
I'll chase these other tests down and make sure we're handling/skipping appropriately. |
Parquet is currently the only file format that supports all the required capabilities for v3 compatibility. This PR adds validation/tests to prevent unintentionally creating or upgrading a table to v3 in an incompatible way.