feat: Parquet Modular Encryption with Spark KMS for native readers #2447

mbutrovich · 2025-09-23T19:34:34Z

Which issue does this PR close?

Closes #.

Rationale for this change

We want to add Parquet Module Encryption support for the native readers when using a Spark KMS. We use the encryption factory features added in DataFusion 50 to register an encryption factory that uses JNI to get decryption keys from Spark.

What changes are included in this PR?

How are these changes tested?

Existing PME tests with new readers added.
New tests that exercise PME options like plaintext footer, etc.

…ark side accessed via JNI.

codecov-commenter · 2025-09-23T19:58:41Z

Codecov Report

❌ Patch coverage is 31.57895% with 52 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.30%. Comparing base (f09f8af) to head (7d1bf39).
⚠️ Report is 561 commits behind head on main.

Files with missing lines	Patch %	Lines
...rg/apache/comet/parquet/CometFileKeyUnwrapper.java	0.00%	17 Missing ⚠️
...a/org/apache/comet/parquet/CometParquetUtils.scala	0.00%	15 Missing ⚠️
...ain/scala/org/apache/comet/CometExecIterator.scala	36.36%	6 Missing and 1 partial ⚠️
...va/org/apache/comet/parquet/NativeBatchReader.java	0.00%	5 Missing ⚠️
...n/scala/org/apache/comet/rules/CometScanRule.scala	42.85%	3 Missing and 1 partial ⚠️
...n/scala/org/apache/spark/sql/comet/operators.scala	77.77%	3 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2447      +/-   ##
============================================
+ Coverage     56.12%   58.30%   +2.17%     
- Complexity      976     1436     +460     
============================================
  Files           119      147      +28     
  Lines         11743    13564    +1821     
  Branches       2251     2357     +106     
============================================
+ Hits           6591     7908    +1317     
- Misses         4012     4426     +414     
- Partials       1140     1230      +90

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java

native/core/src/parquet/parquet_exec.rs

…yption factory registration in parquet_exec.rs.

parthchandra · 2025-09-24T22:07:15Z

Also look at https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/crypto/TestPropertiesDrivenEncryption.java to see if there are any tests that might be relevant here.

# Conflicts: # spark/src/main/scala/org/apache/comet/CometExecIterator.scala

parthchandra · 2025-09-29T17:34:38Z

common/src/main/java/org/apache/comet/parquet/CometFileKeyUnwrapper.java

+  // Each hadoopConf yields a unique DecryptionPropertiesFactory. While it's unlikely that
+  // this Comet plan contains more than one hadoopConf, we don't want to assume that. So we'll
+  // provide the ability to cache more than one Factory with a map.
+  private final ConcurrentHashMap<Configuration, DecryptionPropertiesFactory> factoryCache =


There is only one hadoop conf in a spark session so this may be overkill.

Session hadoopConf is not what the scans use though. They add all the relation options (Parquet options like encryption keys) to the hadoopConf, so each scan can have a unique hadoopConf. Whether we could have a Comet plan with multiple Parquet scans is the real question.

Whether we could have a Comet plan with multiple Parquet scans is the real question.

I don't know what you mean by this. What exactly are you calling a Parquet scan?

parthchandra · 2025-09-29T17:43:38Z

common/src/main/java/org/apache/comet/parquet/CometFileKeyUnwrapper.java

+public class CometFileKeyUnwrapper {
+
+  // Each file path gets a unique DecryptionKeyRetriever
+  private final ConcurrentHashMap<String, DecryptionKeyRetriever> retrieverCache =


Every file path? This can get rather large when the number of files starts to reach 100K or more.

Spark must also be keeping the same number in memory for its scans. Also, it's whatever subset of files this plan is responsible for.

parthchandra · 2025-09-29T17:49:08Z

common/src/main/java/org/apache/comet/parquet/CometFileKeyUnwrapper.java

+  public void storeDecryptionKeyRetriever(final String filePath, final Configuration hadoopConf) {
+    // Use DecryptionPropertiesFactory.loadFactory to get the factory and then call
+    // getFileDecryptionProperties
+    DecryptionPropertiesFactory factory = factoryCache.get(hadoopConf);


Is this hadoop conf the entire hadoop configuration (which can have a thousand entries) or just the incremental properties specified for the session? Hashing this can become time comsuming.

The Parquet library decides what the fields it needs are. If we want to limit the versions we want to support I could start hard-coding the values, but that's creating future work for ourselves any time a new config is added.

common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java

common/src/main/scala/org/apache/comet/parquet/CometParquetUtils.scala

parthchandra · 2025-09-29T18:24:25Z

spark/src/main/scala/org/apache/spark/sql/comet/operators.scala

+          if (encryptionEnabled) {
+            // hadoopConf isn't serializable, so we have to do a broadcasted config.
+            val broadcastedConf =
+              scan.relation.sparkSession.sparkContext


Can you explain a little what you are doing here? What is the additional hadoop conf information that needs to be broadcast per file path (as opposed to encryption properties that are defined once per table)?

It's only broadcasting one hadoopConf per relation (table). It is broadcasting a mapping from file to hadoopConf for each file.

Why does this need to be broadcast? Won't each executor instance will have its own copy of the scan.relation.options already.

mbutrovich · 2025-09-30T17:02:27Z

Results attached from the benchmark I added to CometReadBenchmark, and a small chart with highlights to see what the overhead of encryption is for the various readers.

benchmark_decryption.txt

parthchandra · 2025-09-30T22:49:32Z

native/core/src/parquet/encryption_support.rs

+
+        // Call instance method FileKeyUnwrapper.getKey(String, byte[]) -> byte[]
+        let result = unsafe {
+            env.call_method_unchecked(


This can throw an exception can it not? Elsewhere we use try_unwrap_or_throw (errors.rs) to report the error.

parthchandra · 2025-09-30T22:52:27Z

common/src/main/java/org/apache/comet/parquet/CometFileKeyUnwrapper.java

+  // Each hadoopConf yields a unique DecryptionPropertiesFactory. While it's unlikely that
+  // this Comet plan contains more than one hadoopConf, we don't want to assume that. So we'll
+  // provide the ability to cache more than one Factory with a map.
+  private final ConcurrentHashMap<Configuration, DecryptionPropertiesFactory> factoryCache =


Whether we could have a Comet plan with multiple Parquet scans is the real question.

I don't know what you mean by this. What exactly are you calling a Parquet scan?

martin-g · 2025-10-01T08:03:25Z

common/src/main/java/org/apache/comet/parquet/CometFileKeyUnwrapper.java

+    DecryptionPropertiesFactory factory = factoryCache.get(hadoopConf);
+    if (factory == null) {
+      factory = DecryptionPropertiesFactory.loadFactory(hadoopConf);
+      factoryCache.put(hadoopConf, factory);


Since you use ConcurrentMap you probably want to use its #computeIfAbsent() method instead of #get() + check for null + #put()

martin-g · 2025-10-01T08:08:19Z

native/core/src/parquet/encryption_support.rs

+impl CometKeyRetriever {
+    pub fn new(file_path: &str, key_unwrapper: GlobalRef) -> Result<Self, ExecutionError> {
+        // Get JNI environment
+        let mut env = JVMClasses::get_env()?;


Below you use

let mut env = JVMClasses::get_env() .map_err(|e| datafusion::parquet::errors::ParquetError::General(e.to_string()))?;

No need of error mapping here ?!

martin-g · 2025-10-01T08:12:00Z

native/core/src/parquet/encryption_support.rs

+            )
+        };
+
+        let result = result.unwrap();


Can you use result? instead ? To return an Err instead of panic-ing.
Or use pattern match and return a custom error.

Few more unwraps below.

Parquet Modular Encryption support for native readers using KMS on Sp…

dcc882b

…ark side accessed via JNI.

mbutrovich changed the title ~~feat: Parquet Modular Encryption support for native_datafusion and native_iceberg_compat readers~~ feat: Parquet Modular Encryption with Spark KMS for native_datafusion and native_iceberg_compat readers Sep 23, 2025

mbutrovich changed the title ~~feat: Parquet Modular Encryption with Spark KMS for native_datafusion and native_iceberg_compat readers~~ feat: Parquet Modular Encryption with Spark KMS for native readers Sep 23, 2025

Fix unused import.

8bac76a

hsiang-c reviewed Sep 23, 2025

View reviewed changes

common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java Outdated Show resolved Hide resolved

hsiang-c reviewed Sep 23, 2025

View reviewed changes

native/core/src/parquet/parquet_exec.rs Outdated Show resolved Hide resolved

hsiang-c reviewed Sep 23, 2025

View reviewed changes

native/core/src/parquet/parquet_exec.rs Outdated Show resolved Hide resolved

mbutrovich added 4 commits September 23, 2025 18:06

Fix encryptionEnabled check in NativeBatchReader.java, and guard encr…

40935df

…yption factory registration in parquet_exec.rs.

Fix NPE when checking encryptedEnabled.

7cbfb1b

Merge branch 'main' into decryption

1e1fa2f

Minor refactor for encryptionEnabled.

090497b

mbutrovich added 7 commits September 26, 2025 10:38

Merge branch 'main' into decryption

992a4e1

More tests.

c9dfdd5

Cleanup Seq loop that wasn't doing anything.

bf0bec4

Merge branch 'main' into decryption

a0e2d9a

Docs.

271e940

Docs.

571c881

Refactor out of parquet_exec.rs.

4dde7fb

mbutrovich marked this pull request as ready for review September 26, 2025 20:31

mbutrovich added 2 commits September 29, 2025 10:13

Merge branch 'main' into decryption

ac566f5

# Conflicts: # spark/src/main/scala/org/apache/comet/CometExecIterator.scala

Add uniform encryption test.

9bc24fd

parthchandra reviewed Sep 29, 2025

View reviewed changes

mbutrovich added 3 commits September 30, 2025 07:47

Merge branch 'main' into decryption

1dfb252

Address PR feedback.

bf6ad03

Add benchmark.

7d1bf39

parthchandra reviewed Sep 30, 2025

View reviewed changes

martin-g reviewed Oct 1, 2025

View reviewed changes

feat: Parquet Modular Encryption with Spark KMS for native readers #2447

Are you sure you want to change the base?

feat: Parquet Modular Encryption with Spark KMS for native readers #2447

Uh oh!

Conversation

mbutrovich commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

codecov-commenter commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

parthchandra commented Sep 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbutrovich Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbutrovich commented Sep 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mbutrovich commented Sep 23, 2025 •

edited

Loading

codecov-commenter commented Sep 23, 2025 •

edited

Loading

mbutrovich Sep 30, 2025 •

edited

Loading