Implement direct serialization of .bpte:s by Erik-Lundell · Pull Request #20217 · pytorch/executorch

Erik-Lundell · 2026-06-11T08:45:22Z

This patch is heavily inspired by
#17333

Similarly to the old serialization of pte:s,
bundled program serialization went through a
program -> string json -> flatbuffer path, which
is heavily inefficient. As in the PR above,
generate python flatbuffer bindings from the fbs schema and use them to cut out the json step.

Checked in flatbuffer bindings and fbs schema coherence is enforced by the validate_flatbuffer_gen job.

e6586533af

Testing:
Tested by devtools/bundled_program/test/test_end2end.py, Additionally, did the following benchmark locally:

/usr/bin/time -v -- aot_arm_compiler.py --delegate --bundleio -t vgf [args]

Time metric: Elapsed (wall clock) time
Memory metric: Maximum resident set size

Model       Baseline (s / MiB)    With-patch (s / MiB)    Speed-up   Relative memory   same.bpte  bpte size (MiB)
---------   -------------------   ---------------------   --------   ---------------   --------   ---------
mv2         31.36 / 1349.7        22.37 / 845.5           1.40x      0.63x (-37.4%)    yes        13.91
resnet50    128.57 / 10380.9      28.24 / 2003.4          4.55x      0.19x (-80.7%)    yes        97.99
w2l         730.23 / 12102.9      15.85 / 2376.2          46.07x     0.20x (-80.4%)    yes        131.45
deit_tiny   87.57 / 2669.5        70.23 / 1037.6          1.25x      0.39x (-61.1%)    yes        22.88

cc @Gasoonjia @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani

pytorch-bot · 2026-06-11T08:45:26Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20217

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit 9bf8f07 with merge base 6911e9f ():

NEW FAILURES - The following jobs have failed:

pull / test-lora-multimethod-linux / linux-job (gh)
RuntimeError: Command docker exec -t 6a334651fb1abfeeb6574fd78104cb85fff6ac2cc3fe8fb62d7b4eb9613648c7 /exec failed with exit code 1
trunk / test-qnn-model (fp32, dl3) / linux-job (gh)
RuntimeError: Command docker exec -t 43f52defd560a9c31672372a8f7d6a56b2039e75a10615e1d64f071c0c4078e9 /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This patch is heavily inspired by pytorch#17333 Similarly to the old serialization of pte:s, bundled program serialization went through a program -> string json -> flatbuffer path, which is heavily inefficient. As in the PR above, generate python flatbuffer bindings from the fbs schema and use them to cut out the json step. Checked in flatbuffer bindings and fbs schema coherence is enforced by the validate_flatbuffer_gen job. e6586533af Testing: Tested by devtools/bundled_program/test/test_end2end.py, Additionally, did the following benchmark locally: /usr/bin/time -v -- aot_arm_compiler.py --delegate --bundleio -t vgf [args] Time metric: Elapsed (wall clock) time Memory metric: Maximum resident set size Model Baseline (s / MiB) With-patch (s / MiB) Speed-up Relative memory same.bpte bpte size (MiB) --------- ------------------- --------------------- -------- --------------- -------- --------- mv2 31.36 / 1349.7 22.37 / 845.5 1.40x 0.63x (-37.4%) yes 13.91 resnet50 128.57 / 10380.9 28.24 / 2003.4 4.55x 0.19x (-80.7%) yes 97.99 w2l 730.23 / 12102.9 15.85 / 2376.2 46.07x 0.20x (-80.4%) yes 131.45 deit_tiny 87.57 / 2669.5 70.23 / 1037.6 1.25x 0.39x (-61.1%) yes 22.88 Notice that the peak memory consumption is still a factor 20-50 x the final artifact size. Signed-off-by: Erik Lundell <erik.lundell@arm.com> Change-Id: I98d24ad8e3d2537567e8f4b873d5aacae619eaf5

Erik-Lundell · 2026-06-11T13:52:45Z

@claude review this PR.

claude · 2026-06-11T13:53:52Z

Claude finished @Erik-Lundell's task in 3m 59s —— View job

PR Review: Direct serialization of `.bpte`s

Read the PR diff
Review serialization logic (__init__.py)
Review the flatbuffer generation tooling
Review CI / schema-coherence enforcement
Post review feedback

Nice work — this mirrors the approach from #17333 cleanly, reuses the existing _flatbuffer_program helpers instead of duplicating them, and the benchmarks (esp. the 46x/4.5x wins on w2l/resnet50 and the byte-identical .bpte output) make a strong case. Overall the design is sound. A few observations below.

Correctness

Custom packers exist for alignment, and they look correct. _pack_tensor / _pack_bundled_program override the generated Pack specifically to honor force_align: 16 (tensor data) and force_align: 32 (program) via _create_aligned_byte_vector, which the stock generated Pack does byte-by-byte without alignment. Vector construction is fully closed before each TensorStart/BundledProgramStart, so ordering is valid. Good.
dim_order round-trips correctly but via two different representations. core.py:181 builds dim_order=list(spec.dim_order) (list of ints), while the schema dataclass (bundled_program_schema.py:25) annotates it List[bytes]. _convert_tensor (__init__.py:179) coerces it with _coerce_bytes, so [0,1,2] → b"\x00\x01\x02", matching the old json→flatc path. It works, but the dataclass annotation is misleading vs. the actual runtime type. Not introduced by this PR — just worth a note.

Asymmetry: serialize is direct, deserialize is not

deserialize_from_flatbuffer_to_bundled_program (__init__.py:286) still routes through convert_from_flatbuffer → _flatc_decompile → json → dataclass. Only the serialize path was made direct. That's a reasonable scope choice (serialize is the hot path in export), but the PR title/description say "direct serialization of .bpte:s" without calling out that deserialize remains flatc-dependent. Worth a one-line note in the description or a TODO in the code so the asymmetry is intentional and discoverable.

Possible dead code

serialize_from_bundled_program_to_json (__init__.py:76) and convert_to_flatbuffer (__init__.py:89) appear to have no remaining in-tree callers (convert_from_flatbuffer is still used by the deserialize path; the other two are not). If they're kept only for backward-compat of a public API, fine — otherwise consider removing serialize_from_bundled_program_to_json/convert_to_flatbuffer to avoid two divergent serialize paths.

CI / generation robustness

flatc version pinning. validate_flatbuffer_gen.yml regenerates the bindings and fails if git diff is non-empty. This is a good guard, but it's only as stable as the flatc version. If the flatc shipped by install_executorch.sh -e ever differs from the version used to produce the checked-in files, this job will fail on unrelated PRs. Recommend confirming the flatc version is pinned (the exir generator has the same exposure, so at minimum it should match that one).
The two reported CI failures (test-lora-multimethod-linux, test-qnn-model fp32/dl3) look like infra/unrelated failures (generic docker exec ... exit code 1), not obviously caused by this change. Worth a re-run to confirm, since this PR doesn't touch those paths.

Nits

devtools/bundled_program/schema/README.md — missing trailing newline (\ No newline at end of file).
_bundled_program_file_identifier re-parses file_identifier from the .fbs via regex, while the generated Tensor.py:26 hardcodes b"BP08". The dynamic read is fine (and lru_cached), just slightly redundant with the generated constant.

Summary

Clean, well-scoped change with a real performance/memory payoff and good reuse of existing infra. No blocking correctness issues spotted. The main things I'd ask before merge: confirm flatc is version-pinned for the validate job, and clarify (description or code comment) that deserialize intentionally still uses flatc. Removing the now-unused json serialize helpers would be a nice cleanup.
· branch bundled-proper

Erik-Lundell requested review from Gasoonjia and digantdesai as code owners June 11, 2026 08:45

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 11, 2026

github-actions Bot added ciflow/trunk module: arm Issues related to arm backend labels Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement direct serialization of .bpte:s#20217

Implement direct serialization of .bpte:s#20217
Erik-Lundell wants to merge 1 commit into
pytorch:mainfrom
Erik-Lundell:bundled-proper

Erik-Lundell commented Jun 11, 2026 •

edited by pytorch-bot Bot

Loading

Uh oh!

pytorch-bot Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

Erik-Lundell commented Jun 11, 2026

Uh oh!

claude Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Erik-Lundell commented Jun 11, 2026 • edited by pytorch-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20217

❌ 2 New Failures

Uh oh!

Erik-Lundell commented Jun 11, 2026

Uh oh!

claude Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Direct serialization of .bptes

Correctness

Asymmetry: serialize is direct, deserialize is not

Possible dead code

CI / generation robustness

Nits

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Erik-Lundell commented Jun 11, 2026 •

edited by pytorch-bot Bot

Loading

pytorch-bot Bot commented Jun 11, 2026 •

edited

Loading

claude Bot commented Jun 11, 2026 •

edited

Loading

PR Review: Direct serialization of `.bpte`s