Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-8455][VL] Port encrypted file checks to shim layer #8501

Merged
merged 10 commits into from
Jan 16, 2025

Conversation

ArnavBalyan
Copy link
Contributor

@ArnavBalyan ArnavBalyan commented Jan 10, 2025

  • Moved encrypted parquet file checks to shim, since parquet version changes across spark versions.
  • Added unit tests to cover the following cases:
    • No encrytpion.
    • Footer encrypted, and file encrypted
    • Footer plaintext, file encrypted
  • The behaviour continues to remain the same for any other case, and assume native scan is safe if we are unable to validate.
  • 3.5 can be checked in similar way, but provides better metadata for encryption checking, will add the support in next PR. Currently 3.5 retains the previous behaviour of always offloading.

@github-actions github-actions bot added CORE works for Gluten Core VELOX labels Jan 10, 2025
Copy link

#8455

Copy link

Run Gluten Clickhouse CI on x86

@ArnavBalyan
Copy link
Contributor Author

cc @Yohahaha, @jackylee-ch, can you please take a look. The exception checks also works for spark 3.5 but will use the footer metadata since it's more efficient

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

update
Copy link

Run Gluten Clickhouse CI on x86

Copy link
Contributor

@Yohahaha Yohahaha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the work! left some comments.

*/
package org.apache.gluten.utils

object ExceptionUtils {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this util could be added in gluten-core

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes makes sense, do you think we can do this with 35, I just wanted to be sure once the logic is set in thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, let's move this util to common modules.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

override def isParquetFileEncrypted(
fileStatus: LocatedFileStatus,
conf: Configuration): Boolean = {
return false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add TODO with current PR link.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* - Ensures the file is still detected as encrypted despite the plaintext footer.
*/

class ParquetEncryptionDetectionSuite extends AnyFunSuite {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this suite can be moved to backends-velox test module after Spark35 shims check was done, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test can be moved to backends-velox and tested with testWithSpecifiedSparkVersion.

And also, we can use existing parquet files instead of writing new ones each time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes was planning to move it, but testWithSpecifiedSparkVersion works as well, added support thanks for the tip

@@ -296,4 +300,19 @@ class Spark32Shims extends SparkShims {
override def unsetOperatorId(plan: QueryPlan[_]): Unit = {
plan.unsetTagValue(QueryPlan.OP_ID_TAG)
}

override def isParquetFileEncrypted(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this function be implemented in it's parent class: shims/common/src/main/scala/org/apache/gluten/sql/shims/SparkShims.scala
?
If it has a different implement, just to override it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic will be different for 3.5 shim, will check if consolidation can be done when that is added

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a common issue in our shim code. Perhaps we should develop a better way than the current one to manage these code duplications in shim layer in future.

*/
package org.apache.gluten.utils

object ExceptionUtils {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put it in a common class? As the code is much redundant in spark32/spark33/34/35

Copy link
Contributor Author

@ArnavBalyan ArnavBalyan Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor code which needs to change with 35, updated for now

ParquetFileReader.readFooter(new Configuration(), fileStatus.getPath).toString
false
} catch {
case e: Exception if ExceptionUtils.hasCause(e, classOf[ParquetCryptoRuntimeException]) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use ExceptionUtils.hasCause instead of case _: ParquetCryptoRuntimeException ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the exception may wrap ParquetCryptoRuntimeException, and may not directly expose it. This handles all cases thanks

ParquetFileReader.readFooter(new Configuration(), fileStatus.getPath).toString
false
} catch {
case e: Exception if ExceptionUtils.hasCause(e, classOf[ParquetCryptoRuntimeException]) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed above

* - Ensures the file is still detected as encrypted despite the plaintext footer.
*/

class ParquetEncryptionDetectionSuite extends AnyFunSuite {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test can be moved to backends-velox and tested with testWithSpecifiedSparkVersion.

And also, we can use existing parquet files instead of writing new ones each time.

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link
Contributor

@Yohahaha Yohahaha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

fs.listFiles(new Path(path), false).next()
}

// private def withTempDir(testCode: File => Any): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this since we don't need it.


testWithSpecifiedSparkVersion(
"Detect encrypted Parquet without encrypted footer (plaintext footer)",
Array("3.2", "3.3")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this test skip 3.4?

}
}

testWithSpecifiedSparkVersion("Detect plain (unencrypted) Parquet file", Array("3.3", "3.4")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is also needed for 3.2?

@jackylee-ch
Copy link
Contributor

Basically LGTM, left few comments

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

@Yohahaha Yohahaha merged commit 67ebbc8 into apache:main Jan 16, 2025
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CORE works for Gluten Core VELOX
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants