Add tests and fixes for schema resolution bug#9237
Conversation
…ay items and unions.
mzabaluev
left a comment
There was a problem hiding this comment.
It works! The changes look good, I have few comments on code reuse.
arrow-avro/src/codec.rs
Outdated
| let null_position = reader_variants | ||
| .iter() | ||
| .position(|x| x == &Schema::TypeName(TypeName::Primitive(PrimitiveType::Null))); | ||
| if let (2, Some(null_idx)) = (reader_variants.len(), null_position) { |
There was a problem hiding this comment.
Would it be good to reuse the nullable_union_variants helper here?
| let manifest_dir = env!("CARGO_MANIFEST_DIR"); | ||
| let gz_path = format!("{manifest_dir}/test/data/bad-varint-bug.avro.gz"); | ||
| let gz_file = File::open(&gz_path).expect("test file should exist"); | ||
| let mut decoder = GzDecoder::new(gz_file); |
There was a problem hiding this comment.
I actually had the file unpacked in my test suite branch, it's quite small. I only gzipped it to be able to post to GitHub. But since there's already a dependency on flate2, might as well use it.
arrow-avro/src/codec.rs
Outdated
| let non_null_idx = 1 - null_idx; | ||
| let non_null_branch = &reader_variants[non_null_idx]; | ||
| let mut dt = | ||
| self.make_data_type(writer_non_union, Some(non_null_branch), namespace)?; |
There was a problem hiding this comment.
I'd call resolve_type directly here because that's what it always amounts to.
Co-authored-by: Mikhail Zabaluev <mikhail.zabaluev@flarion.io>
|
@mzabaluev Thank you for the review! I just pushed up a commit that addresses your comments. Let me know what you think when you get a chance. |
6f13808 to
b2f475d
Compare
b2f475d to
919bdf4
Compare
alamb
left a comment
There was a problem hiding this comment.
Thanks @jecsand838 and @mzabaluev -- I gave this a quick review and it looks fine to me (though I didn't review the contents in detail)
There was a problem hiding this comment.
I unzipped this file and it looks fine to me
00000000 4f 62 6a 01 06 16 61 76 72 6f 2e 73 63 68 65 6d |Obj...avro.schem|
00000010 61 84 02 7b 22 74 79 70 65 22 3a 22 72 65 63 6f |a..{"type":"reco|
00000020 72 64 22 2c 22 6e 61 6d 65 22 3a 22 74 61 62 6c |rd","name":"tabl|
00000030 65 22 2c 22 66 69 65 6c 64 73 22 3a 5b 7b 22 6e |e","fields":[{"n|
00000040 61 6d 65 22 3a 22 69 6e 74 5f 61 72 72 61 79 22 |ame":"int_array"|
00000050 2c 22 74 79 70 65 22 3a 7b 22 74 79 70 65 22 3a |,"type":{"type":|
00000060 22 61 72 72 61 79 22 2c 22 69 74 65 6d 73 22 3a |"array","items":|
00000070 22 69 6e 74 22 2c 22 65 6c 65 6d 65 6e 74 2d 69 |"int","element-i|
00000080 64 22 3a 32 7d 2c 22 66 69 65 6c 64 2d 69 64 22 |d":2},"field-id"|
00000090 3a 31 7d 5d 7d 14 61 76 72 6f 2e 63 6f 64 65 63 |:1}]}.avro.codec|
000000a0 0e 64 65 66 6c 61 74 65 1c 69 63 65 62 65 72 67 |.deflate.iceberg|
000000b0 2e 73 63 68 65 6d 61 c8 02 7b 22 74 79 70 65 22 |.schema..{"type"|
000000c0 3a 22 73 74 72 75 63 74 22 2c 22 73 63 68 65 6d |:"struct","schem|
000000d0 61 2d 69 64 22 3a 30 2c 22 66 69 65 6c 64 73 22 |a-id":0,"fields"|
000000e0 3a 5b 7b 22 69 64 22 3a 31 2c 22 6e 61 6d 65 22 |:[{"id":1,"name"|
000000f0 3a 22 69 6e 74 5f 61 72 72 61 79 22 2c 22 72 65 |:"int_array","re|
00000100 71 75 69 72 65 64 22 3a 74 72 75 65 2c 22 74 79 |quired":true,"ty|
00000110 70 65 22 3a 7b 22 74 79 70 65 22 3a 22 6c 69 73 |pe":{"type":"lis|
00000120 74 22 2c 22 65 6c 65 6d 65 6e 74 2d 69 64 22 3a |t","element-id":|
00000130 32 2c 22 65 6c 65 6d 65 6e 74 22 3a 22 69 6e 74 |2,"element":"int|
00000140 22 2c 22 65 6c 65 6d 65 6e 74 2d 72 65 71 75 69 |","element-requi|
00000150 72 65 64 22 3a 74 72 75 65 7d 7d 5d 7d 00 24 70 |red":true}}]}.$p|
00000160 9c 93 05 f1 5e 13 2f c5 a0 60 cd 7a a0 35 02 0c |....^./..`.z.5..|
00000170 63 61 62 61 00 00 24 70 9c 93 05 f1 5e 13 2f c5 |caba..$p....^./.|
00000180 a0 60 cd 7a a0 35 |.`.z.5|
|
Thanks @jecsand838 |
Which issue does this PR close?
ParseError("bad varint")on reading array field with nullable elements in the reader schema #9231.Rationale for this change
Avro schema resolution allows a reader schema to represent “nullable” values using a two-branch union (
["null", T]or[T, "null"]) while still reading data written with the non-union schemaT(i.e. without union discriminants in the encoded data).In
arrow-avro, resolving a non-union writer type against a reader union (notably for array/list item schemas likeitems: ["null", "int"]) could incorrectly treat the encoded stream as a union and attempt to decode a union discriminant. This would misalign decoding and could surface asParseError("bad varint")for certain files (see #9231).What changes are included in this PR?
Fix schema resolution when the writer schema is non-union and the reader schema is a union:
nullto be treated as “nullable” (capturing whethernullis first or second), and resolve against the non-null branch.Add regression coverage for the bug and the fixed behavior:
test_resolve_array_writer_nonunion_items_reader_nullable_items(schema resolution / codec)test_array_decoding_writer_nonunion_items_reader_nullable_items(record decoding; ensures correct byte consumption and decoded values)test_bad_varint_bug_nullable_array_items(end-to-end reader regression using a small Avro fixture)Add a small compressed Avro fixture under
arrow-avro/test/data/bad-varint-bug.avro.gzused by the regression test.Are these changes tested?
Yes. This PR adds targeted unit/integration tests that reproduce the prior failure mode and validate correct schema resolution and decoding for nullable-union array items.
Are there any user-facing changes?
Yes (bug fix): reading Avro files with arrays whose element type is represented as a nullable union in the reader schema (e.g.
items: ["null", "int"]) now succeeds instead of failing withParseError("bad varint"). No public API changes are intended.