Skip to content

Add tests and fixes for schema resolution bug#9237

Merged
alamb merged 4 commits intoapache:mainfrom
jecsand838:arrow-avro-bug
Jan 26, 2026
Merged

Add tests and fixes for schema resolution bug#9237
alamb merged 4 commits intoapache:mainfrom
jecsand838:arrow-avro-bug

Conversation

@jecsand838
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

Avro schema resolution allows a reader schema to represent “nullable” values using a two-branch union (["null", T] or [T, "null"]) while still reading data written with the non-union schema T (i.e. without union discriminants in the encoded data).

In arrow-avro, resolving a non-union writer type against a reader union (notably for array/list item schemas like items: ["null", "int"]) could incorrectly treat the encoded stream as a union and attempt to decode a union discriminant. This would misalign decoding and could surface as ParseError("bad varint") for certain files (see #9231).

What changes are included in this PR?

  • Fix schema resolution when the writer schema is non-union and the reader schema is a union:

    • Special-case two-branch unions containing null to be treated as “nullable” (capturing whether null is first or second), and resolve against the non-null branch.
    • Improve matching for general reader unions by attempting to resolve against each union variant, preferring a direct match, and constructing the appropriate union resolution mapping for the selected branch.
    • Ensure promotions are represented at the union-resolution level (avoiding nested promotion resolution on the selected union child).
  • Add regression coverage for the bug and the fixed behavior:

    • test_resolve_array_writer_nonunion_items_reader_nullable_items (schema resolution / codec)
    • test_array_decoding_writer_nonunion_items_reader_nullable_items (record decoding; ensures correct byte consumption and decoded values)
    • test_bad_varint_bug_nullable_array_items (end-to-end reader regression using a small Avro fixture)
  • Add a small compressed Avro fixture under arrow-avro/test/data/bad-varint-bug.avro.gz used by the regression test.

Are these changes tested?

Yes. This PR adds targeted unit/integration tests that reproduce the prior failure mode and validate correct schema resolution and decoding for nullable-union array items.

Are there any user-facing changes?

Yes (bug fix): reading Avro files with arrays whose element type is represented as a nullable union in the reader schema (e.g. items: ["null", "int"]) now succeeds instead of failing with ParseError("bad varint"). No public API changes are intended.

@github-actions github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Jan 21, 2026
Copy link
Contributor

@mzabaluev mzabaluev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works! The changes look good, I have few comments on code reuse.

Comment on lines +1532 to +1535
let null_position = reader_variants
.iter()
.position(|x| x == &Schema::TypeName(TypeName::Primitive(PrimitiveType::Null)));
if let (2, Some(null_idx)) = (reader_variants.len(), null_position) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be good to reuse the nullable_union_variants helper here?

let manifest_dir = env!("CARGO_MANIFEST_DIR");
let gz_path = format!("{manifest_dir}/test/data/bad-varint-bug.avro.gz");
let gz_file = File::open(&gz_path).expect("test file should exist");
let mut decoder = GzDecoder::new(gz_file);
Copy link
Contributor

@mzabaluev mzabaluev Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually had the file unpacked in my test suite branch, it's quite small. I only gzipped it to be able to post to GitHub. But since there's already a dependency on flate2, might as well use it.

let non_null_idx = 1 - null_idx;
let non_null_branch = &reader_variants[non_null_idx];
let mut dt =
self.make_data_type(writer_non_union, Some(non_null_branch), namespace)?;
Copy link
Contributor

@mzabaluev mzabaluev Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd call resolve_type directly here because that's what it always amounts to.

Co-authored-by: Mikhail Zabaluev <mikhail.zabaluev@flarion.io>
@jecsand838
Copy link
Contributor Author

@mzabaluev Thank you for the review! I just pushed up a commit that addresses your comments. Let me know what you think when you get a chance.

Copy link
Contributor

@mzabaluev mzabaluev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@jecsand838
Copy link
Contributor Author

@alamb @mbrobbel

Would it be possible to get this bug fix in for the 58.0.0 release as well?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jecsand838 and @mzabaluev -- I gave this a quick review and it looks fine to me (though I didn't review the contents in detail)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I unzipped this file and it looks fine to me

00000000  4f 62 6a 01 06 16 61 76  72 6f 2e 73 63 68 65 6d  |Obj...avro.schem|
00000010  61 84 02 7b 22 74 79 70  65 22 3a 22 72 65 63 6f  |a..{"type":"reco|
00000020  72 64 22 2c 22 6e 61 6d  65 22 3a 22 74 61 62 6c  |rd","name":"tabl|
00000030  65 22 2c 22 66 69 65 6c  64 73 22 3a 5b 7b 22 6e  |e","fields":[{"n|
00000040  61 6d 65 22 3a 22 69 6e  74 5f 61 72 72 61 79 22  |ame":"int_array"|
00000050  2c 22 74 79 70 65 22 3a  7b 22 74 79 70 65 22 3a  |,"type":{"type":|
00000060  22 61 72 72 61 79 22 2c  22 69 74 65 6d 73 22 3a  |"array","items":|
00000070  22 69 6e 74 22 2c 22 65  6c 65 6d 65 6e 74 2d 69  |"int","element-i|
00000080  64 22 3a 32 7d 2c 22 66  69 65 6c 64 2d 69 64 22  |d":2},"field-id"|
00000090  3a 31 7d 5d 7d 14 61 76  72 6f 2e 63 6f 64 65 63  |:1}]}.avro.codec|
000000a0  0e 64 65 66 6c 61 74 65  1c 69 63 65 62 65 72 67  |.deflate.iceberg|
000000b0  2e 73 63 68 65 6d 61 c8  02 7b 22 74 79 70 65 22  |.schema..{"type"|
000000c0  3a 22 73 74 72 75 63 74  22 2c 22 73 63 68 65 6d  |:"struct","schem|
000000d0  61 2d 69 64 22 3a 30 2c  22 66 69 65 6c 64 73 22  |a-id":0,"fields"|
000000e0  3a 5b 7b 22 69 64 22 3a  31 2c 22 6e 61 6d 65 22  |:[{"id":1,"name"|
000000f0  3a 22 69 6e 74 5f 61 72  72 61 79 22 2c 22 72 65  |:"int_array","re|
00000100  71 75 69 72 65 64 22 3a  74 72 75 65 2c 22 74 79  |quired":true,"ty|
00000110  70 65 22 3a 7b 22 74 79  70 65 22 3a 22 6c 69 73  |pe":{"type":"lis|
00000120  74 22 2c 22 65 6c 65 6d  65 6e 74 2d 69 64 22 3a  |t","element-id":|
00000130  32 2c 22 65 6c 65 6d 65  6e 74 22 3a 22 69 6e 74  |2,"element":"int|
00000140  22 2c 22 65 6c 65 6d 65  6e 74 2d 72 65 71 75 69  |","element-requi|
00000150  72 65 64 22 3a 74 72 75  65 7d 7d 5d 7d 00 24 70  |red":true}}]}.$p|
00000160  9c 93 05 f1 5e 13 2f c5  a0 60 cd 7a a0 35 02 0c  |....^./..`.z.5..|
00000170  63 61 62 61 00 00 24 70  9c 93 05 f1 5e 13 2f c5  |caba..$p....^./.|
00000180  a0 60 cd 7a a0 35                                 |.`.z.5|

@alamb alamb merged commit c10fe77 into apache:main Jan 26, 2026
23 checks passed
@alamb
Copy link
Contributor

alamb commented Jan 26, 2026

Thanks @jecsand838

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate arrow-avro arrow-avro crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ParseError("bad varint") on reading array field with nullable elements in the reader schema

4 participants