-
Notifications
You must be signed in to change notification settings - Fork 461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GLUTEN-8455][VL] Port encrypted file checks to shim layer #8501
Conversation
ArnavBalyan
commented
Jan 10, 2025
•
edited
Loading
edited
- Moved encrypted parquet file checks to shim, since parquet version changes across spark versions.
- Added unit tests to cover the following cases:
- No encrytpion.
- Footer encrypted, and file encrypted
- Footer plaintext, file encrypted
- The behaviour continues to remain the same for any other case, and assume native scan is safe if we are unable to validate.
- 3.5 can be checked in similar way, but provides better metadata for encryption checking, will add the support in next PR. Currently 3.5 retains the previous behaviour of always offloading.
Run Gluten Clickhouse CI on x86 |
cc @Yohahaha, @jackylee-ch, can you please take a look. The exception checks also works for spark 3.5 but will use the footer metadata since it's more efficient |
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for the work! left some comments.
*/ | ||
package org.apache.gluten.utils | ||
|
||
object ExceptionUtils { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this util could be added in gluten-core
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes makes sense, do you think we can do this with 35, I just wanted to be sure once the logic is set in thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, let's move this util to common modules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
override def isParquetFileEncrypted( | ||
fileStatus: LocatedFileStatus, | ||
conf: Configuration): Boolean = { | ||
return false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add TODO with current PR link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* - Ensures the file is still detected as encrypted despite the plaintext footer. | ||
*/ | ||
|
||
class ParquetEncryptionDetectionSuite extends AnyFunSuite { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this suite can be moved to backends-velox test module after Spark35 shims check was done, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test can be moved to backends-velox and tested with testWithSpecifiedSparkVersion
.
And also, we can use existing parquet files instead of writing new ones each time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes was planning to move it, but testWithSpecifiedSparkVersion
works as well, added support thanks for the tip
@@ -296,4 +300,19 @@ class Spark32Shims extends SparkShims { | |||
override def unsetOperatorId(plan: QueryPlan[_]): Unit = { | |||
plan.unsetTagValue(QueryPlan.OP_ID_TAG) | |||
} | |||
|
|||
override def isParquetFileEncrypted( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this function be implemented in it's parent class: shims/common/src/main/scala/org/apache/gluten/sql/shims/SparkShims.scala
?
If it has a different implement, just to override it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic will be different for 3.5 shim, will check if consolidation can be done when that is added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a common issue in our shim code. Perhaps we should develop a better way than the current one to manage these code duplications in shim layer in future.
*/ | ||
package org.apache.gluten.utils | ||
|
||
object ExceptionUtils { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put it in a common class? As the code is much redundant in spark32/spark33/34/35
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor code which needs to change with 35, updated for now
ParquetFileReader.readFooter(new Configuration(), fileStatus.getPath).toString | ||
false | ||
} catch { | ||
case e: Exception if ExceptionUtils.hasCause(e, classOf[ParquetCryptoRuntimeException]) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use ExceptionUtils.hasCause
instead of case _: ParquetCryptoRuntimeException
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the exception may wrap ParquetCryptoRuntimeException, and may not directly expose it. This handles all cases thanks
ParquetFileReader.readFooter(new Configuration(), fileStatus.getPath).toString | ||
false | ||
} catch { | ||
case e: Exception if ExceptionUtils.hasCause(e, classOf[ParquetCryptoRuntimeException]) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed above
* - Ensures the file is still detected as encrypted despite the plaintext footer. | ||
*/ | ||
|
||
class ParquetEncryptionDetectionSuite extends AnyFunSuite { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test can be moved to backends-velox and tested with testWithSpecifiedSparkVersion
.
And also, we can use existing parquet files instead of writing new ones each time.
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
fs.listFiles(new Path(path), false).next() | ||
} | ||
|
||
// private def withTempDir(testCode: File => Any): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this since we don't need it.
|
||
testWithSpecifiedSparkVersion( | ||
"Detect encrypted Parquet without encrypted footer (plaintext footer)", | ||
Array("3.2", "3.3")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this test skip 3.4?
} | ||
} | ||
|
||
testWithSpecifiedSparkVersion("Detect plain (unencrypted) Parquet file", Array("3.3", "3.4")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is also needed for 3.2?
Basically LGTM, left few comments |
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |