Skip to content

Implement direct serialization of .bpte:s#20217

Open
Erik-Lundell wants to merge 1 commit into
pytorch:mainfrom
Erik-Lundell:bundled-proper
Open

Implement direct serialization of .bpte:s#20217
Erik-Lundell wants to merge 1 commit into
pytorch:mainfrom
Erik-Lundell:bundled-proper

Conversation

@Erik-Lundell

@Erik-Lundell Erik-Lundell commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

This patch is heavily inspired by
#17333

Similarly to the old serialization of pte:s,
bundled program serialization went through a
program -> string json -> flatbuffer path, which
is heavily inefficient. As in the PR above,
generate python flatbuffer bindings from the fbs schema and use them to cut out the json step.

Checked in flatbuffer bindings and fbs schema coherence is enforced by the validate_flatbuffer_gen job.

e6586533af

Testing:
Tested by devtools/bundled_program/test/test_end2end.py, Additionally, did the following benchmark locally:

/usr/bin/time -v -- aot_arm_compiler.py --delegate --bundleio -t vgf [args]

Time metric: Elapsed (wall clock) time
Memory metric: Maximum resident set size

Model       Baseline (s / MiB)    With-patch (s / MiB)    Speed-up   Relative memory   same.bpte  bpte size (MiB)
---------   -------------------   ---------------------   --------   ---------------   --------   ---------
mv2         31.36 / 1349.7        22.37 / 845.5           1.40x      0.63x (-37.4%)    yes        13.91
resnet50    128.57 / 10380.9      28.24 / 2003.4          4.55x      0.19x (-80.7%)    yes        97.99
w2l         730.23 / 12102.9      15.85 / 2376.2          46.07x     0.20x (-80.4%)    yes        131.45
deit_tiny   87.57 / 2669.5        70.23 / 1037.6          1.25x      0.39x (-61.1%)    yes        22.88

cc @Gasoonjia @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani

@pytorch-bot

pytorch-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20217

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit 9bf8f07 with merge base 6911e9f (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 11, 2026
This patch is heavily inspired by
pytorch#17333

Similarly to the old serialization of pte:s,
bundled program serialization went through a
program -> string json -> flatbuffer path, which
is heavily inefficient. As in the PR above,
generate python flatbuffer bindings from the fbs schema
and use them to cut out the json step.

Checked in flatbuffer bindings and fbs schema coherence
is enforced by the validate_flatbuffer_gen job.

e6586533af

Testing:
Tested by devtools/bundled_program/test/test_end2end.py,
Additionally, did the following benchmark locally:

/usr/bin/time -v -- aot_arm_compiler.py --delegate --bundleio -t vgf [args]

Time metric: Elapsed (wall clock) time
Memory metric: Maximum resident set size

Model       Baseline (s / MiB)    With-patch (s / MiB)    Speed-up   Relative memory   same.bpte  bpte size (MiB)
---------   -------------------   ---------------------   --------   ---------------   --------   ---------
mv2         31.36 / 1349.7        22.37 / 845.5           1.40x      0.63x (-37.4%)    yes        13.91
resnet50    128.57 / 10380.9      28.24 / 2003.4          4.55x      0.19x (-80.7%)    yes        97.99
w2l         730.23 / 12102.9      15.85 / 2376.2          46.07x     0.20x (-80.4%)    yes        131.45
deit_tiny   87.57 / 2669.5        70.23 / 1037.6          1.25x      0.39x (-61.1%)    yes        22.88

Notice that the peak memory consumption is still a factor 20-50 x the final artifact size.

Signed-off-by: Erik Lundell <erik.lundell@arm.com>
Change-Id: I98d24ad8e3d2537567e8f4b873d5aacae619eaf5
@github-actions github-actions Bot added ciflow/trunk module: arm Issues related to arm backend labels Jun 11, 2026
@Erik-Lundell Erik-Lundell added ciflow/trunk help wanted Extra attention is needed partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm module: devtools Issues related to developer tools and code under devtools/ release notes: none Do not include this in the release notes and removed ciflow/trunk labels Jun 11, 2026
@Erik-Lundell

Copy link
Copy Markdown
Collaborator Author

@claude review this PR.

@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

Claude finished @Erik-Lundell's task in 3m 59s —— View job


PR Review: Direct serialization of .bptes

  • Read the PR diff
  • Review serialization logic (__init__.py)
  • Review the flatbuffer generation tooling
  • Review CI / schema-coherence enforcement
  • Post review feedback

Nice work — this mirrors the approach from #17333 cleanly, reuses the existing _flatbuffer_program helpers instead of duplicating them, and the benchmarks (esp. the 46x/4.5x wins on w2l/resnet50 and the byte-identical .bpte output) make a strong case. Overall the design is sound. A few observations below.

Correctness

  • Custom packers exist for alignment, and they look correct. _pack_tensor / _pack_bundled_program override the generated Pack specifically to honor force_align: 16 (tensor data) and force_align: 32 (program) via _create_aligned_byte_vector, which the stock generated Pack does byte-by-byte without alignment. Vector construction is fully closed before each TensorStart/BundledProgramStart, so ordering is valid. Good.

  • dim_order round-trips correctly but via two different representations. core.py:181 builds dim_order=list(spec.dim_order) (list of ints), while the schema dataclass (bundled_program_schema.py:25) annotates it List[bytes]. _convert_tensor (__init__.py:179) coerces it with _coerce_bytes, so [0,1,2] → b"\x00\x01\x02", matching the old json→flatc path. It works, but the dataclass annotation is misleading vs. the actual runtime type. Not introduced by this PR — just worth a note.

Asymmetry: serialize is direct, deserialize is not

deserialize_from_flatbuffer_to_bundled_program (__init__.py:286) still routes through convert_from_flatbuffer_flatc_decompile → json → dataclass. Only the serialize path was made direct. That's a reasonable scope choice (serialize is the hot path in export), but the PR title/description say "direct serialization of .bpte:s" without calling out that deserialize remains flatc-dependent. Worth a one-line note in the description or a TODO in the code so the asymmetry is intentional and discoverable.

Possible dead code

serialize_from_bundled_program_to_json (__init__.py:76) and convert_to_flatbuffer (__init__.py:89) appear to have no remaining in-tree callers (convert_from_flatbuffer is still used by the deserialize path; the other two are not). If they're kept only for backward-compat of a public API, fine — otherwise consider removing serialize_from_bundled_program_to_json/convert_to_flatbuffer to avoid two divergent serialize paths.

CI / generation robustness

  • flatc version pinning. validate_flatbuffer_gen.yml regenerates the bindings and fails if git diff is non-empty. This is a good guard, but it's only as stable as the flatc version. If the flatc shipped by install_executorch.sh -e ever differs from the version used to produce the checked-in files, this job will fail on unrelated PRs. Recommend confirming the flatc version is pinned (the exir generator has the same exposure, so at minimum it should match that one).

  • The two reported CI failures (test-lora-multimethod-linux, test-qnn-model fp32/dl3) look like infra/unrelated failures (generic docker exec ... exit code 1), not obviously caused by this change. Worth a re-run to confirm, since this PR doesn't touch those paths.

Nits

  • devtools/bundled_program/schema/README.md — missing trailing newline (\ No newline at end of file).
  • _bundled_program_file_identifier re-parses file_identifier from the .fbs via regex, while the generated Tensor.py:26 hardcodes b"BP08". The dynamic read is fine (and lru_cached), just slightly redundant with the generated constant.

Summary

Clean, well-scoped change with a real performance/memory payoff and good reuse of existing infra. No blocking correctness issues spotted. The main things I'd ask before merge: confirm flatc is version-pinned for the validate job, and clarify (description or code comment) that deserialize intentionally still uses flatc. Removing the now-unused json serialize helpers would be a nice cleanup.
· branch bundled-proper

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. help wanted Extra attention is needed module: arm Issues related to arm backend module: devtools Issues related to developer tools and code under devtools/ partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm release notes: none Do not include this in the release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant