Skip to content

[parquet] Allow more encryption algorithms#9203

Open
hsiang-c wants to merge 21 commits intoapache:mainfrom
hsiang-c:allow_more_enc_algo
Open

[parquet] Allow more encryption algorithms#9203
hsiang-c wants to merge 21 commits intoapache:mainfrom
hsiang-c:allow_more_enc_algo

Conversation

@hsiang-c
Copy link
Copy Markdown

@hsiang-c hsiang-c commented Jan 16, 2026

Which issue does this PR close?

Rationale for this change

  • Iceberg spec supports AES key sizes of 128, 192 and 256 bits. Iceberg Rust depends on arrow-rs for Parquet I/O, I'd like to start supporting AES 256 with this PR.

What changes are included in this PR?

  • RingGcmBlockEncryptor and RingGcmBlockDecryptor will pick AES-128 or AES-256 based on key size
  • Refactor encryption_async.rs and encryption.rs to test both AES-128 and AES-256 encrypted parquet files

Are these changes tested?

Yes, unit test and on AES-256 encrypted Parquet files.

Are there any user-facing changes?

No

Copy link
Copy Markdown
Contributor

@xanderbailey xanderbailey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I exposed quote style here I exported the underlying QuoteStyle is that something we might want to do here also for Algorithm?

@hsiang-c hsiang-c marked this pull request as ready for review January 23, 2026 07:27
@hsiang-c
Copy link
Copy Markdown
Author

@xanderbailey

Thank you for the review, I took a look at your PR. You made QuoteStyle part of Writer and expose as part of the API.

Do you mean you'd like to make Algorithm a field of RingGcmBlockDecryptor and RingGcmBlockEncryptor and expose them instead of importing it from ring::aead?

Comment thread parquet/src/encryption/ciphers.rs Outdated
Comment thread parquet/src/encryption/decrypt.rs Outdated
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Feb 3, 2026

Thanks @hsiang-c and @mbutrovich -- looks like there are some CI failures on this PR.

Also, it seems like we should have some tests for the new features

@alamb alamb changed the title Allow more encryption algorithms [parquet] Allow more encryption algorithms Feb 11, 2026
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hsiang-c

FYI @adamreeve and @rok as the original authors of the parquet encryption feature

Perhaps you have some time to review this PR?

Comment thread parquet/src/encryption/ciphers.rs Outdated
}

impl RingGcmBlockDecryptor {
#[allow(dead_code)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to allow dead code? Maybe we should remove it instead?

Comment thread parquet/src/file/metadata/mod.rs Outdated

#[cfg(feature = "encryption")]
let expected_size_with_decryptor = 3080;
#[cfg(not(feature = "encryption"))]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? The whole test is gated by

    #[cfg(feature = "encryption")]
    fn test_memory_size_with_decryptor() {

it seems like you just need to update the 3072 to 3080 to reflect the additional size?

Comment thread parquet/src/encryption/ciphers.rs Outdated
#[test]
fn test_round_trip_with_incorrect_key_length() {
let key = [0u8; 16];
assert!(RingGcmBlockEncryptor::new_with_algorithm(&CHACHA20_POLY1305, &key).is_err());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't clear to me that CHACHA is a valid Parquet encryption: https://github.com/apache/parquet-format/blob/master/Encryption.md :

Comment thread parquet/src/encryption/decrypt.rs Outdated
}

/// The AEAD decryption algorithm to be used.
pub fn with_algorithm(mut self, algorithm: &'static Algorithm) -> Self {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this effectively makes Algorithm a part of the public API I think -- this might be ok, but it also means we would not be able to change the underlying crypto library without breaking the API in the future

Copy link
Copy Markdown
Contributor

@adamreeve adamreeve Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I don't like this and I don't think it's necessary, see my comment on new_with_algorithm. This also makes it possible for users to easily create Parquet files that aren't compliant with the spec by using a non-standard algorithm.

Changing the crypto library is something we might potentially want to do to be able to add AES_GCM_CTR support for example (#7258 (comment)).

I think if we want to support new algorithms like AES_GCM_CTR in the future the user should provide an enum value from a supported set of algorithms like is done in the C++ implementation, but we can infer the key length from the provided key so this isn't necessary for this change.

Copy link
Copy Markdown
Contributor

@adamreeve adamreeve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing this @hsiang-c! I've left a few comments

Comment thread parquet/src/encryption/ciphers.rs Outdated
Comment on lines +49 to +52
pub(crate) fn new_with_algorithm(
algorithm: &'static Algorithm,
key_bytes: &[u8],
) -> Result<Self> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any need to add this new constructor, we can reuse the existing new method above and change the algorithm based on the length of the key.

This is also a bit semantically confusing as you could create a new "RingGcmBlockDecryptor" with a non-GCM algorithm.

Comment thread parquet/src/encryption/ciphers.rs Outdated
Comment on lines +158 to +161
pub(crate) fn new_with_algorithm(
algorithm: &'static Algorithm,
key_bytes: &[u8],
) -> Result<Self> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, I think we should reuse the existing new method.

Comment thread parquet/src/encryption/decrypt.rs Outdated
}

/// The AEAD decryption algorithm to be used.
pub fn with_algorithm(mut self, algorithm: &'static Algorithm) -> Self {
Copy link
Copy Markdown
Contributor

@adamreeve adamreeve Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I don't like this and I don't think it's necessary, see my comment on new_with_algorithm. This also makes it possible for users to easily create Parquet files that aren't compliant with the spec by using a non-standard algorithm.

Changing the crypto library is something we might potentially want to do to be able to add AES_GCM_CTR support for example (#7258 (comment)).

I think if we want to support new algorithms like AES_GCM_CTR in the future the user should provide an enum value from a supported set of algorithms like is done in the C++ implementation, but we can infer the key length from the provided key so this isn't necessary for this change.

Comment thread parquet/tests/encryption/encryption_async.rs
@hsiang-c hsiang-c marked this pull request as draft February 14, 2026 00:58
@tom-s-powell
Copy link
Copy Markdown

Curious if there's a rough timeline on getting this change in?

@hsiang-c hsiang-c marked this pull request as ready for review February 19, 2026 00:52
Copy link
Copy Markdown

@tom-s-powell tom-s-powell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance that this'll get reviewed?

Comment thread parquet/tests/encryption/encryption_util.rs Outdated
@adamreeve
Copy link
Copy Markdown
Contributor

Any chance that this'll get reviewed?

This isn't ready for another review yet. The tests are currently failing because they need apache/parquet-testing#102. I'm happy to review this again once @hsiang-c says it's ready.

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Feb 27, 2026

Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look

@alamb alamb marked this pull request as draft February 27, 2026 19:49
@hsiang-c hsiang-c force-pushed the allow_more_enc_algo branch from b0140be to e8999d3 Compare March 6, 2026 00:44
@hsiang-c
Copy link
Copy Markdown
Author

hsiang-c commented Mar 6, 2026

@adamreeve @rok

I refactored unit tests in encryption.rs and encryption_async.rs and they are reading both AES-128 and AES-256 encrypted Parquet files now.

I have a question for your regarding the validate_encrypted_column_names function, please see my comments.

decryption_properties,
file_encryption_properties,
// AES-256
// The asymmetric column names is because we check column paths in the validate_encrypted_column_names function of [encryption::encrypt]
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rok @adamreeve

Please check my comments here. I am not sure if this is an expected behavior. If not, I'll submit a patch after this PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @hsiang-c. This doesn't look like expected behaviour to me, you shouldn't have to specify the same encryption key for both int64_field and int64_field.list.int64_field. I would expect that the full path should be used everywhere.

I'm also surprised that the path is int64_field.list.int64_field, the path should be int64_field.list.element according to the spec for the list logical type. Is this another bug related to encryption, or does arrow-rs differ from the spec here?

We need to support the use case where different struct fields can be encrypted with different keys, so I don't think your suggested fix to change to using c.name() instead of c.path() is correct, although I'm not sure where you're suggesting to change that.

But we may want to allow simpler encryption configuration using the top level field names as a new feature. C++ Arrow has recently added that for example: apache/arrow#45462

I agree that this isn't related to the current PR so can be fixed separately.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked closer into why the different "int64_field" and "int64_field.list.int64_field" column names are needed in the tests. This is mainly because the test file doesn't use the Parquet list logical type, but just has a top-level schema node named "int64_field" that's a repeated field. So it's schema path is simply "int64_field".

When reading this data as Arrow, the "int64_field" name gets used when creating the Arrow schema, which ends up with an "int64_field" list typed field, and the element/item field name is also "int64_field".

When writing the Arrow data back to Parquet, arrow-rs does use the 3-level Parquet list logical type, but rather than using the standard "element" name, it copies the "int64_field" name from the Arrow schema.

So to read the original example file, you need an "int64_field" column key specified, but to write the file with arrow-rs, and also to read it back after it's re-written, you need a key for "int64_field.list.int64_field".

I don't think there's any problem with the way the encryption code works, but arrow-rs using a non-standard naming scheme for the Parquet schema nodes might be something that should be fixed.

Copy link
Copy Markdown
Contributor

@adamreeve adamreeve Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Digging a bit further, the use of the standard "element" name can be controlled by enabling the coerce_types option on the writer properties (see set_coerce_types). So this is expected behaviour.

Updating the test file to use the 3-level list schema structure would avoid this issue but probably isn't necessary.

I think we can just update the comments with an explanation of why the config is the way it is.

@xanderbailey
Copy link
Copy Markdown
Contributor

Anything I can do to help here? Very excited to be able to use this!

@adamreeve
Copy link
Copy Markdown
Contributor

Hi @hsiang-c, this seems pretty close to being mergable. Do you have time to finish this? Otherwise I'm happy to help get this over the line.

@hsiang-c
Copy link
Copy Markdown
Author

@adamreeve @xanderbailey I'll wrap it up this week, thanks!

@hsiang-c hsiang-c marked this pull request as ready for review April 23, 2026 00:01
@hsiang-c
Copy link
Copy Markdown
Author

@adamreeve Please take a look when you have a chance, thank you.

Copy link
Copy Markdown
Contributor

@adamreeve adamreeve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @hsiang-c. The ciphers.rs changes look good to me, but the tests are now a lot more complex. I think some of this can be improved by factoring out common key configuration constants. But we probably don't need 256 bit key variations for every single encryption test, as the key size won't be relevant to each test scenario. As well as increasing code complexity, these extra scenarios will also increase the time taken to run the tests for everyone.

Can we reduce the set of tests that have 256 bit key variations? I haven't gone through every test, but I think at least these tests can be kept as they were, and probably more:

  • test_multi_threaded_encrypted_writing_deprecated
  • test_multi_threaded_encrypted_writing
  • test_concurrent_encrypted_writing_over_multiple_row_groups
  • test_decrypting_without_decryption_properties_fails

.build()
.unwrap();
non_uniform_encryption_plaintext_footer(
b"0123456789012345", // 128bit/16
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pull these repeated footer keys and column key arrays out as constants and put them in the encryption_util module to reduce the duplication?

Comment on lines +372 to +373
// AES-128: there is always a footer key even with a plaintext footer,
// but this is used for signing the footer.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this comment is relevant here, this test uses an encrypted footer.

Comment on lines +623 to +625
// The asymmetric column names is because we check column paths in the validate_encrypted_column_names function of [encryption::encrypt]
// The column path of the repeated field (int64_field) is int64_field.list.int64_field
// Switching from `c.path().string()` to `c.name().to_string()` can fix the asymmetricity
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment doesn't seem relevant here. Because this uses a key retriever, we don't need to provide column names to configure decryption.

#[test]
fn test_uniform_encryption_plaintext_footer_with_key_retriever() {
let test_data = arrow::util::test_util::parquet_test_data();
fn uniform_encryption_plaintext_footer_with_key_retriever(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pre-existing problem, but this test says it's testing "uniform_encryption", but actually specifies column encryption keys. There's a test_non_uniform_encryption_plaintext_footer_with_key_retriever test above already. It looks like the difference is this one rewrites the file. Maybe this should be called test_roundtrip_non_uniform_encryption_plaintext_footer_with_key_retriever

@adamreeve
Copy link
Copy Markdown
Contributor

It looks like the parquet-testing submodule still needs to be updated too to pick up the new test files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Parquet] Support other encryption/decryption key size

7 participants