[parquet] Allow more encryption algorithms by hsiang-c · Pull Request #9203 · apache/arrow-rs

hsiang-c · 2026-01-16T07:20:16Z

Which issue does this PR close?

Closes [Parquet] Support other encryption/decryption key size #9202.

Rationale for this change

Iceberg spec supports AES key sizes of 128, 192 and 256 bits. Iceberg Rust depends on arrow-rs for Parquet I/O, I'd like to start supporting AES 256 with this PR.

What changes are included in this PR?

RingGcmBlockEncryptor and RingGcmBlockDecryptor will pick AES-128 or AES-256 based on key size
Refactor encryption_async.rs and encryption.rs to test both AES-128 and AES-256 encrypted parquet files

Are these changes tested?

Yes, unit test and on AES-256 encrypted Parquet files.

Are there any user-facing changes?

No

xanderbailey

When I exposed quote style here I exported the underlying QuoteStyle is that something we might want to do here also for Algorithm?

hsiang-c · 2026-01-23T23:42:04Z

@xanderbailey

Thank you for the review, I took a look at your PR. You made QuoteStyle part of Writer and expose as part of the API.

Do you mean you'd like to make Algorithm a field of RingGcmBlockDecryptor and RingGcmBlockEncryptor and expose them instead of importing it from ring::aead?

alamb · 2026-02-03T22:07:21Z

Thanks @hsiang-c and @mbutrovich -- looks like there are some CI failures on this PR.

Also, it seems like we should have some tests for the new features

alamb

Thanks @hsiang-c

FYI @adamreeve and @rok as the original authors of the parquet encryption feature

Perhaps you have some time to review this PR?

alamb · 2026-02-11T19:43:19Z

 }

 impl RingGcmBlockDecryptor {
+    #[allow(dead_code)]


why do we need to allow dead code? Maybe we should remove it instead?

alamb · 2026-02-11T19:46:26Z


+        #[cfg(feature = "encryption")]
+        let expected_size_with_decryptor = 3080;
+        #[cfg(not(feature = "encryption"))]


Why is this needed? The whole test is gated by

#[cfg(feature = "encryption")] fn test_memory_size_with_decryptor() {

it seems like you just need to update the 3072 to 3080 to reflect the additional size?

alamb · 2026-02-11T19:47:07Z

+    #[test]
+    fn test_round_trip_with_incorrect_key_length() {
+        let key = [0u8; 16];
+        assert!(RingGcmBlockEncryptor::new_with_algorithm(&CHACHA20_POLY1305, &key).is_err());


It isn't clear to me that CHACHA is a valid Parquet encryption: https://github.com/apache/parquet-format/blob/master/Encryption.md :

alamb · 2026-02-11T19:48:13Z

    }

+    /// The AEAD decryption algorithm to be used.
+    pub fn with_algorithm(mut self, algorithm: &'static Algorithm) -> Self {


this effectively makes Algorithm a part of the public API I think -- this might be ok, but it also means we would not be able to change the underlying crypto library without breaking the API in the future

Yes I don't like this and I don't think it's necessary, see my comment on new_with_algorithm. This also makes it possible for users to easily create Parquet files that aren't compliant with the spec by using a non-standard algorithm.

Changing the crypto library is something we might potentially want to do to be able to add AES_GCM_CTR support for example (#7258 (comment)).

I think if we want to support new algorithms like AES_GCM_CTR in the future the user should provide an enum value from a supported set of algorithms like is done in the C++ implementation, but we can infer the key length from the provided key so this isn't necessary for this change.

adamreeve

Thanks for addressing this @hsiang-c! I've left a few comments

adamreeve · 2026-02-11T21:35:29Z

+    pub(crate) fn new_with_algorithm(
+        algorithm: &'static Algorithm,
+        key_bytes: &[u8],
+    ) -> Result<Self> {


I don't think there's any need to add this new constructor, we can reuse the existing new method above and change the algorithm based on the length of the key.

This is also a bit semantically confusing as you could create a new "RingGcmBlockDecryptor" with a non-GCM algorithm.

adamreeve · 2026-02-11T21:38:54Z

+    pub(crate) fn new_with_algorithm(
+        algorithm: &'static Algorithm,
+        key_bytes: &[u8],
+    ) -> Result<Self> {


As above, I think we should reuse the existing new method.

adamreeve · 2026-02-11T21:42:50Z

    }

+    /// The AEAD decryption algorithm to be used.
+    pub fn with_algorithm(mut self, algorithm: &'static Algorithm) -> Self {


Yes I don't like this and I don't think it's necessary, see my comment on new_with_algorithm. This also makes it possible for users to easily create Parquet files that aren't compliant with the spec by using a non-standard algorithm.

Changing the crypto library is something we might potentially want to do to be able to add AES_GCM_CTR support for example (#7258 (comment)).

I think if we want to support new algorithms like AES_GCM_CTR in the future the user should provide an enum value from a supported set of algorithms like is done in the C++ implementation, but we can infer the key length from the provided key so this isn't necessary for this change.

tom-s-powell · 2026-02-18T10:07:25Z

Curious if there's a rough timeline on getting this change in?

tom-s-powell

Any chance that this'll get reviewed?

adamreeve · 2026-02-25T20:24:49Z

Any chance that this'll get reviewed?

This isn't ready for another review yet. The tests are currently failing because they need apache/parquet-testing#102. I'm happy to review this again once @hsiang-c says it's ready.

alamb · 2026-02-27T19:49:56Z

Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look

hsiang-c · 2026-03-06T01:15:48Z

@adamreeve @rok

I refactored unit tests in encryption.rs and encryption_async.rs and they are reading both AES-128 and AES-256 encrypted Parquet files now.

I have a question for your regarding the validate_encrypted_column_names function, please see my comments.

hsiang-c · 2026-03-06T01:18:33Z

-        decryption_properties,
-        file_encryption_properties,
+    // AES-256
+    // The asymmetric column names is because we check column paths in the validate_encrypted_column_names function of [encryption::encrypt]


@rok @adamreeve

Please check my comments here. I am not sure if this is an expected behavior. If not, I'll submit a patch after this PR.

Hi @hsiang-c. This doesn't look like expected behaviour to me, you shouldn't have to specify the same encryption key for both int64_field and int64_field.list.int64_field. I would expect that the full path should be used everywhere.

I'm also surprised that the path is int64_field.list.int64_field, the path should be int64_field.list.element according to the spec for the list logical type. Is this another bug related to encryption, or does arrow-rs differ from the spec here?

We need to support the use case where different struct fields can be encrypted with different keys, so I don't think your suggested fix to change to using c.name() instead of c.path() is correct, although I'm not sure where you're suggesting to change that.

But we may want to allow simpler encryption configuration using the top level field names as a new feature. C++ Arrow has recently added that for example: apache/arrow#45462

I agree that this isn't related to the current PR so can be fixed separately.

I looked closer into why the different "int64_field" and "int64_field.list.int64_field" column names are needed in the tests. This is mainly because the test file doesn't use the Parquet list logical type, but just has a top-level schema node named "int64_field" that's a repeated field. So it's schema path is simply "int64_field".

When reading this data as Arrow, the "int64_field" name gets used when creating the Arrow schema, which ends up with an "int64_field" list typed field, and the element/item field name is also "int64_field".

When writing the Arrow data back to Parquet, arrow-rs does use the 3-level Parquet list logical type, but rather than using the standard "element" name, it copies the "int64_field" name from the Arrow schema.

So to read the original example file, you need an "int64_field" column key specified, but to write the file with arrow-rs, and also to read it back after it's re-written, you need a key for "int64_field.list.int64_field".

I don't think there's any problem with the way the encryption code works, but arrow-rs using a non-standard naming scheme for the Parquet schema nodes might be something that should be fixed.

Digging a bit further, the use of the standard "element" name can be controlled by enabling the coerce_types option on the writer properties (see set_coerce_types). So this is expected behaviour.

Updating the test file to use the 3-level list schema structure would avoid this issue but probably isn't necessary.

I think we can just update the comments with an explanation of why the config is the way it is.

xanderbailey · 2026-04-01T18:51:04Z

Anything I can do to help here? Very excited to be able to use this!

adamreeve · 2026-04-13T23:19:19Z

Hi @hsiang-c, this seems pretty close to being mergable. Do you have time to finish this? Otherwise I'm happy to help get this over the line.

hsiang-c · 2026-04-14T00:50:19Z

@adamreeve @xanderbailey I'll wrap it up this week, thanks!

hsiang-c · 2026-04-23T22:41:03Z

@adamreeve Please take a look when you have a chance, thank you.

adamreeve

Thanks for the updates @hsiang-c. The ciphers.rs changes look good to me, but the tests are now a lot more complex. I think some of this can be improved by factoring out common key configuration constants. But we probably don't need 256 bit key variations for every single encryption test, as the key size won't be relevant to each test scenario. As well as increasing code complexity, these extra scenarios will also increase the time taken to run the tests for everyone.

Can we reduce the set of tests that have 256 bit key variations? I haven't gone through every test, but I think at least these tests can be kept as they were, and probably more:

test_multi_threaded_encrypted_writing_deprecated
test_multi_threaded_encrypted_writing
test_concurrent_encrypted_writing_over_multiple_row_groups
test_decrypting_without_decryption_properties_fails

adamreeve · 2026-04-24T02:27:45Z

-        .build()
-        .unwrap();
+    non_uniform_encryption_plaintext_footer(
+        b"0123456789012345", // 128bit/16


Can we pull these repeated footer keys and column key arrays out as constants and put them in the encryption_util module to reduce the duplication?

adamreeve · 2026-04-24T02:29:43Z

+    // AES-128: there is always a footer key even with a plaintext footer,
+    // but this is used for signing the footer.


I don't think this comment is relevant here, this test uses an encrypted footer.

adamreeve · 2026-04-24T02:33:35Z

+    // The asymmetric column names is because we check column paths in the validate_encrypted_column_names function of [encryption::encrypt]
+    // The column path of the repeated field (int64_field) is int64_field.list.int64_field
+    // Switching from `c.path().string()` to `c.name().to_string()` can fix the asymmetricity


This comment doesn't seem relevant here. Because this uses a key retriever, we don't need to provide column names to configure decryption.

adamreeve · 2026-04-24T02:39:41Z

 #[test]
 fn test_uniform_encryption_plaintext_footer_with_key_retriever() {
-    let test_data = arrow::util::test_util::parquet_test_data();
+    fn uniform_encryption_plaintext_footer_with_key_retriever(


This is a pre-existing problem, but this test says it's testing "uniform_encryption", but actually specifies column encryption keys. There's a test_non_uniform_encryption_plaintext_footer_with_key_retriever test above already. It looks like the difference is this one rewrites the file. Maybe this should be called test_roundtrip_non_uniform_encryption_plaintext_footer_with_key_retriever

adamreeve · 2026-04-24T03:08:49Z

It looks like the parquet-testing submodule still needs to be updated too to pick up the new test files.

Allow more encryption algorithms

c017804

github-actions Bot added the parquet Changes to the parquet crate label Jan 16, 2026

hsiang-c mentioned this pull request Jan 16, 2026

feat(encryption) [1/N] Support encryption: Add crypto for AES-GCM apache/iceberg-rust#2026

Merged

xanderbailey reviewed Jan 16, 2026

View reviewed changes

Merge branch 'main' into allow_more_enc_algo

1d12eb4

hsiang-c marked this pull request as ready for review January 23, 2026 07:27

hsiang-c added 2 commits January 27, 2026 11:20

Expose new API

e8da4b9

Fix API

3e1b8a7

mbutrovich reviewed Jan 27, 2026

View reviewed changes

Comment thread parquet/src/encryption/ciphers.rs Outdated

mbutrovich reviewed Jan 27, 2026

View reviewed changes

Comment thread parquet/src/encryption/decrypt.rs Outdated

hsiang-c added 2 commits February 4, 2026 20:36

Fix format and clippy

1f1f0da

Document new api

92b4094

alamb changed the title ~~Allow more encryption algorithms~~ [parquet] Allow more encryption algorithms Feb 11, 2026

alamb reviewed Feb 11, 2026

View reviewed changes

adamreeve reviewed Feb 11, 2026

View reviewed changes

hsiang-c added 2 commits February 13, 2026 16:52

Pick algorithm by key size

12c1888

Remove unnecessary tests

3ec6b70

hsiang-c marked this pull request as draft February 14, 2026 00:58

hsiang-c mentioned this pull request Feb 19, 2026

Added AES-256 encrypted parquet generated from parquet-java apache/parquet-testing#102

Merged

Round-trip tests w/ AES-256 encrypted parquet

deec81e

hsiang-c marked this pull request as ready for review February 19, 2026 00:52

hsiang-c and others added 2 commits February 18, 2026 17:26

Merge branch 'main' into allow_more_enc_algo

35b482f

Fix comment

f55fe3d

tom-s-powell reviewed Feb 25, 2026

View reviewed changes

rok reviewed Feb 25, 2026

View reviewed changes

Comment thread parquet/tests/encryption/encryption_util.rs Outdated

alamb marked this pull request as draft February 27, 2026 19:49

hsiang-c and others added 5 commits March 1, 2026 11:07

More test cases

d92014b

Merge branch 'main' into allow_more_enc_algo

0fa4888

Align field names

bd79c92

Refactor async encryption tests

e95eb4d

Refactor sync encryption tests

e8999d3

hsiang-c force-pushed the allow_more_enc_algo branch from b0140be to e8999d3 Compare March 6, 2026 00:44

hsiang-c and others added 2 commits March 5, 2026 16:44

Merge branch 'main' into allow_more_enc_algo

928ec10

Comment & clean up

0abe16d

hsiang-c commented Mar 6, 2026

View reviewed changes

hsiang-c added 3 commits April 22, 2026 16:39

Update comment

206c6c6

Merge branch 'main' into allow_more_enc_algo

36fd995

Format

efe04f6

hsiang-c marked this pull request as ready for review April 23, 2026 00:01

adamreeve reviewed Apr 24, 2026

View reviewed changes

		// AES-128: there is always a footer key even with a plaintext footer,
		// but this is used for signing the footer.

Conversation

hsiang-c commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

xanderbailey left a comment

Choose a reason for hiding this comment

Uh oh!

hsiang-c commented Jan 23, 2026

Uh oh!

Uh oh!

Uh oh!

alamb commented Feb 3, 2026

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adamreeve Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adamreeve left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adamreeve Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tom-s-powell commented Feb 18, 2026

Uh oh!

tom-s-powell left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adamreeve commented Feb 25, 2026

Uh oh!

alamb commented Feb 27, 2026

Uh oh!

hsiang-c commented Mar 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adamreeve Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xanderbailey commented Apr 1, 2026

Uh oh!

adamreeve commented Apr 13, 2026

Uh oh!

hsiang-c commented Apr 14, 2026

Uh oh!

hsiang-c commented Apr 23, 2026

Uh oh!

adamreeve left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hsiang-c commented Jan 16, 2026 •

edited

Loading

adamreeve Feb 11, 2026 •

edited

Loading

adamreeve Feb 11, 2026 •

edited

Loading

adamreeve Apr 9, 2026 •

edited

Loading