[parquet] Allow more encryption algorithms#9203
[parquet] Allow more encryption algorithms#9203hsiang-c wants to merge 21 commits intoapache:mainfrom
Conversation
xanderbailey
left a comment
There was a problem hiding this comment.
When I exposed quote style here I exported the underlying QuoteStyle is that something we might want to do here also for Algorithm?
|
Thank you for the review, I took a look at your PR. You made Do you mean you'd like to make |
|
Thanks @hsiang-c and @mbutrovich -- looks like there are some CI failures on this PR. Also, it seems like we should have some tests for the new features |
alamb
left a comment
There was a problem hiding this comment.
Thanks @hsiang-c
FYI @adamreeve and @rok as the original authors of the parquet encryption feature
Perhaps you have some time to review this PR?
| } | ||
|
|
||
| impl RingGcmBlockDecryptor { | ||
| #[allow(dead_code)] |
There was a problem hiding this comment.
why do we need to allow dead code? Maybe we should remove it instead?
|
|
||
| #[cfg(feature = "encryption")] | ||
| let expected_size_with_decryptor = 3080; | ||
| #[cfg(not(feature = "encryption"))] |
There was a problem hiding this comment.
Why is this needed? The whole test is gated by
#[cfg(feature = "encryption")]
fn test_memory_size_with_decryptor() {it seems like you just need to update the 3072 to 3080 to reflect the additional size?
| #[test] | ||
| fn test_round_trip_with_incorrect_key_length() { | ||
| let key = [0u8; 16]; | ||
| assert!(RingGcmBlockEncryptor::new_with_algorithm(&CHACHA20_POLY1305, &key).is_err()); |
There was a problem hiding this comment.
It isn't clear to me that CHACHA is a valid Parquet encryption: https://github.com/apache/parquet-format/blob/master/Encryption.md :
| } | ||
|
|
||
| /// The AEAD decryption algorithm to be used. | ||
| pub fn with_algorithm(mut self, algorithm: &'static Algorithm) -> Self { |
There was a problem hiding this comment.
this effectively makes Algorithm a part of the public API I think -- this might be ok, but it also means we would not be able to change the underlying crypto library without breaking the API in the future
There was a problem hiding this comment.
Yes I don't like this and I don't think it's necessary, see my comment on new_with_algorithm. This also makes it possible for users to easily create Parquet files that aren't compliant with the spec by using a non-standard algorithm.
Changing the crypto library is something we might potentially want to do to be able to add AES_GCM_CTR support for example (#7258 (comment)).
I think if we want to support new algorithms like AES_GCM_CTR in the future the user should provide an enum value from a supported set of algorithms like is done in the C++ implementation, but we can infer the key length from the provided key so this isn't necessary for this change.
| pub(crate) fn new_with_algorithm( | ||
| algorithm: &'static Algorithm, | ||
| key_bytes: &[u8], | ||
| ) -> Result<Self> { |
There was a problem hiding this comment.
I don't think there's any need to add this new constructor, we can reuse the existing new method above and change the algorithm based on the length of the key.
This is also a bit semantically confusing as you could create a new "RingGcmBlockDecryptor" with a non-GCM algorithm.
| pub(crate) fn new_with_algorithm( | ||
| algorithm: &'static Algorithm, | ||
| key_bytes: &[u8], | ||
| ) -> Result<Self> { |
There was a problem hiding this comment.
As above, I think we should reuse the existing new method.
| } | ||
|
|
||
| /// The AEAD decryption algorithm to be used. | ||
| pub fn with_algorithm(mut self, algorithm: &'static Algorithm) -> Self { |
There was a problem hiding this comment.
Yes I don't like this and I don't think it's necessary, see my comment on new_with_algorithm. This also makes it possible for users to easily create Parquet files that aren't compliant with the spec by using a non-standard algorithm.
Changing the crypto library is something we might potentially want to do to be able to add AES_GCM_CTR support for example (#7258 (comment)).
I think if we want to support new algorithms like AES_GCM_CTR in the future the user should provide an enum value from a supported set of algorithms like is done in the C++ implementation, but we can infer the key length from the provided key so this isn't necessary for this change.
|
Curious if there's a rough timeline on getting this change in? |
tom-s-powell
left a comment
There was a problem hiding this comment.
Any chance that this'll get reviewed?
This isn't ready for another review yet. The tests are currently failing because they need apache/parquet-testing#102. I'm happy to review this again once @hsiang-c says it's ready. |
|
Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look |
b0140be to
e8999d3
Compare
|
I refactored unit tests in I have a question for your regarding the |
| decryption_properties, | ||
| file_encryption_properties, | ||
| // AES-256 | ||
| // The asymmetric column names is because we check column paths in the validate_encrypted_column_names function of [encryption::encrypt] |
There was a problem hiding this comment.
Please check my comments here. I am not sure if this is an expected behavior. If not, I'll submit a patch after this PR.
There was a problem hiding this comment.
Hi @hsiang-c. This doesn't look like expected behaviour to me, you shouldn't have to specify the same encryption key for both int64_field and int64_field.list.int64_field. I would expect that the full path should be used everywhere.
I'm also surprised that the path is int64_field.list.int64_field, the path should be int64_field.list.element according to the spec for the list logical type. Is this another bug related to encryption, or does arrow-rs differ from the spec here?
We need to support the use case where different struct fields can be encrypted with different keys, so I don't think your suggested fix to change to using c.name() instead of c.path() is correct, although I'm not sure where you're suggesting to change that.
But we may want to allow simpler encryption configuration using the top level field names as a new feature. C++ Arrow has recently added that for example: apache/arrow#45462
I agree that this isn't related to the current PR so can be fixed separately.
There was a problem hiding this comment.
I looked closer into why the different "int64_field" and "int64_field.list.int64_field" column names are needed in the tests. This is mainly because the test file doesn't use the Parquet list logical type, but just has a top-level schema node named "int64_field" that's a repeated field. So it's schema path is simply "int64_field".
When reading this data as Arrow, the "int64_field" name gets used when creating the Arrow schema, which ends up with an "int64_field" list typed field, and the element/item field name is also "int64_field".
When writing the Arrow data back to Parquet, arrow-rs does use the 3-level Parquet list logical type, but rather than using the standard "element" name, it copies the "int64_field" name from the Arrow schema.
So to read the original example file, you need an "int64_field" column key specified, but to write the file with arrow-rs, and also to read it back after it's re-written, you need a key for "int64_field.list.int64_field".
I don't think there's any problem with the way the encryption code works, but arrow-rs using a non-standard naming scheme for the Parquet schema nodes might be something that should be fixed.
There was a problem hiding this comment.
Digging a bit further, the use of the standard "element" name can be controlled by enabling the coerce_types option on the writer properties (see set_coerce_types). So this is expected behaviour.
Updating the test file to use the 3-level list schema structure would avoid this issue but probably isn't necessary.
I think we can just update the comments with an explanation of why the config is the way it is.
|
Anything I can do to help here? Very excited to be able to use this! |
|
Hi @hsiang-c, this seems pretty close to being mergable. Do you have time to finish this? Otherwise I'm happy to help get this over the line. |
|
@adamreeve @xanderbailey I'll wrap it up this week, thanks! |
|
@adamreeve Please take a look when you have a chance, thank you. |
adamreeve
left a comment
There was a problem hiding this comment.
Thanks for the updates @hsiang-c. The ciphers.rs changes look good to me, but the tests are now a lot more complex. I think some of this can be improved by factoring out common key configuration constants. But we probably don't need 256 bit key variations for every single encryption test, as the key size won't be relevant to each test scenario. As well as increasing code complexity, these extra scenarios will also increase the time taken to run the tests for everyone.
Can we reduce the set of tests that have 256 bit key variations? I haven't gone through every test, but I think at least these tests can be kept as they were, and probably more:
test_multi_threaded_encrypted_writing_deprecatedtest_multi_threaded_encrypted_writingtest_concurrent_encrypted_writing_over_multiple_row_groupstest_decrypting_without_decryption_properties_fails
| .build() | ||
| .unwrap(); | ||
| non_uniform_encryption_plaintext_footer( | ||
| b"0123456789012345", // 128bit/16 |
There was a problem hiding this comment.
Can we pull these repeated footer keys and column key arrays out as constants and put them in the encryption_util module to reduce the duplication?
| // AES-128: there is always a footer key even with a plaintext footer, | ||
| // but this is used for signing the footer. |
There was a problem hiding this comment.
I don't think this comment is relevant here, this test uses an encrypted footer.
| // The asymmetric column names is because we check column paths in the validate_encrypted_column_names function of [encryption::encrypt] | ||
| // The column path of the repeated field (int64_field) is int64_field.list.int64_field | ||
| // Switching from `c.path().string()` to `c.name().to_string()` can fix the asymmetricity |
There was a problem hiding this comment.
This comment doesn't seem relevant here. Because this uses a key retriever, we don't need to provide column names to configure decryption.
| #[test] | ||
| fn test_uniform_encryption_plaintext_footer_with_key_retriever() { | ||
| let test_data = arrow::util::test_util::parquet_test_data(); | ||
| fn uniform_encryption_plaintext_footer_with_key_retriever( |
There was a problem hiding this comment.
This is a pre-existing problem, but this test says it's testing "uniform_encryption", but actually specifies column encryption keys. There's a test_non_uniform_encryption_plaintext_footer_with_key_retriever test above already. It looks like the difference is this one rewrites the file. Maybe this should be called test_roundtrip_non_uniform_encryption_plaintext_footer_with_key_retriever
|
It looks like the parquet-testing submodule still needs to be updated too to pick up the new test files. |
Which issue does this PR close?
Rationale for this change
arrow-rsfor Parquet I/O, I'd like to start supporting AES 256 with this PR.What changes are included in this PR?
RingGcmBlockEncryptorandRingGcmBlockDecryptorwill pick AES-128 or AES-256 based on key sizeencryption_async.rsandencryption.rsto test both AES-128 and AES-256 encrypted parquet filesAre these changes tested?
Yes, unit test and on AES-256 encrypted Parquet files.
Are there any user-facing changes?
No