Continue support for 4GB+ packs/clones/objects#6289
Open
dscho wants to merge 52 commits into
Open
Conversation
Same theme as the preceding pack-objects series: get_size_by_pos() returns an unsigned long but reads its size out of packed_object_info() / odb_read_object_info_extended() via a size_t out-parameter, so on Windows it would silently truncate the very sizes filter_bitmap_blob_limit() then compares against the --filter=blob:limit threshold to decide which blobs to elide from the bitmap-backed traversal. Drop the cast_size_t_to_ulong() and return size_t directly. The two callers' limit comparison promotes to size_t cleanly. limit itself stays unsigned long; it is part of a filter API ripple of its own. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Smallest piece of the tree topic. link_len is only used as strbuf_splice()'s size_t length and as an array index; widening it outright removes the cast_size_t_to_ulong() shim and the bridge local that fed it. odb_read_object() now writes straight into link_len. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Prep for dropping the cast_size_t_to_ulong() shim in add_preferred_base() (pack-objects.c), and aligns the public API with the size_t shape the rest of the tree topic is moving toward. struct tree_desc.size stays unsigned int -- the on-disk tree format hard-caps each tree at 4 GiB, so the field is intentionally narrow and the assignment in init_tree_desc_internal() already truncated unsigned long inputs the same way it now truncates size_t inputs. The widening is purely about the call-side type-correctness; the internal cap is unchanged. All 30+ callers pass values that promote cleanly (unsigned long, size_t, or smaller integer types). Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
With init_tree_desc() widened in the prior commit, the size_t-returning odb_read_object_peeled() call in add_preferred_base() and odb_read_object() call in pbase_tree_get() can both flow straight through to init_tree_desc() and into the pbase_tree_cache. Widen pbase_tree_cache.tree_size and the two local size variables to size_t, drop the size_st bridges, and drop the two cast_size_t_to_ulong() shims. This was the last pair of cast_size_t_to_ulong() call sites in builtin/pack-objects.c, completing the >4 GiB-objects work in that file that this branch and its predecessors have been pursuing. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Last piece of the delta API to still expose unsigned long. The function literally returns struct delta_index.memsize, which became size_t in the first commit of this series. The sole caller (free_unpacked() in builtin/pack-objects.c) already accepts size_t via its freed_mem local, so the widening only removes the implicit size_t -> unsigned long narrowing inside the function body. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Final piece of the tree topic. struct tree.size already receives its values from size_t-shaped sources (odb_read_object() in repo_parse_tree_gently() and in reflog.c::tree_is_complete()), so on Windows it was already silently truncating anything past 4 GiB. Switch the field and parse_tree_buffer()'s size parameter to size_t. All readers feed tree->size into init_tree_desc(), which was widened earlier in this topic; the existing parse_object_buffer() caller in object.c keeps its unsigned long parameter, which promotes cleanly. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Prep for the upcoming blame_scoreboard.final_buf_size widening: prepare_lines() will pass a size_t through to find_line_starts(), and the other caller (fill_origin_blob() via o->file.size) already goes through long->size_t promotion. The function is file-static and only uses len as a loop bound. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Mirror of the preceding fast-import sweep. anonymize_blob() writes strbuf.len (size_t) into its out-parameter, and export_blob()'s non-anonymize branch reads odb_read_object()'s size_t out-parameter through a size_st + cast_size_t_to_ulong() bridge into an unsigned long local; both have been silent on Windows past 4 GiB. Widen the helper signature and the local, and drop the bridge. check_object_signature() and parse_object_buffer() still take unsigned long, so the silent narrowing on Windows just moves from the local assignment to those call sites; both are separate topics. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Continue the size_t evacuation. count_objects() feeds the inflated size from odb_read_object_info_extended()'s size_t out-parameter into struct object_values (size_t) and check_largest() (size_t) through an unsigned long bridge with a cast_size_t_to_ulong() shim. The bridge was the only narrow link in the chain. Widen the local, point oi.sizep at it directly, and drop the cast. parse_object_buffer() still takes unsigned long, so a Windows narrowing remains at that one call; that is its own follow-up topic. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Continue the migration from `unsigned long` to `size_t`. The `size` attribute of `struct commit_buffer` is fed either from `odb_read_object()`'s return value (`size_t`, handled with `cast_size_t_to_ulong()`) or from `strbuf.len` in `fake_working_tree_commit()` (silently narrowed today). Widen the field and a couple of function signatures together, drop the shim in `repo_get_commit_buffer()`, and move the matching `unsigned long` locals at the in-tree callers in commit.c (three sites), builtin/replace.c, and builtin/stash.c (two sites). The remaining callers pass NULL or already pass a size_t-compatible variable. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This commit continue the migration from `unsigned long` to `size_t`, converting `grep_buffer()` and helpers. The callers are already prepared for this change. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Drop the last cast_size_t_to_ulong() in builtin/unpack-objects.c. With size_t-typed object sizes already coming in via odb_read_object() and the per-byte varint decode in unpack_one() (widened by f206385), the rest of the file was the only thing left that still threaded sizes through unsigned long: struct obj_buffer.size and struct delta_info.size, get_data() and add_object_buffer(), add_delta_to_list(), resolve_delta(), resolve_against_held(), added_object(), write_object(), unpack_non_delta_entry(), unpack_delta_entry(), and stream_blob(). Widen all of them together. None of those types had a downstream narrow consumer once odb_write_object() and patch_delta() were widened earlier, so the change is mechanical: parameter and field types change, the base_size_st bridge in unpack_delta_entry() and its cast go away, and odb_read_object() now writes into base_size directly. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Preparation for widening the delta-encoding API to size_t in subsequent commits, which is what lets pack-objects drop the cast_size_t_to_ulong() shims that 606c192 (odb, packfile: use size_t for streaming object sizes, 2026-05-08) had to leave behind in get_delta() and try_delta() because their downstream consumers were still narrow. The struct is private to diff-delta.c, so widening its fields in isolation is a no-op at runtime: the values stored continue to fit in 32 bits on Windows because the public API around it still truncates. Splitting it out keeps the API-change commit focused on caller updates. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The sole caller (try_delta() in builtin/pack-objects.c) passes an unsigned long, which promotes safely, so no caller fixups are needed. Splitting it out keeps the diff_delta() / create_delta() widening, which does ripple to several callers, in its own commit. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
These three are a single accounting tuple (the globals tracking cumulative cached-delta bytes, plus the helper that compares them against an incoming delta size) and are latently 32-bit on Windows where unsigned long != size_t: a pack with many large cached deltas could wrap silently. The widening is internally consistent on its own: the additions and subtractions against delta_cache_size already come from size_t sources (DELTA_SIZE() returns size_t), and delta_cacheable()'s sole caller in try_delta() still passes unsigned long, which promotes. Prerequisite for dropping try_delta()'s cast_size_t_to_ulong() shims, which becomes possible once create_delta() and diff_delta() are widened in a later commit. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
free_unpacked() sums two byte counts: sizeof_delta_index() and SIZE(n->entry). The latter has been size_t since the prior topic "More work supporting objects larger than 4GB on Windows" widened SIZE() / oe_size() to size_t, so accumulating it into an unsigned long return was a silent Windows-only truncation on a packing run with many large objects. The sole caller (find_deltas()) holds its own mem_usage in an unsigned long for now and subtracts the return into it, so the new narrowing happens at that subtraction. find_deltas() and the matching try_delta() out-parameter are widened next. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The pair must move together because find_deltas() passes &mem_usage to try_delta(): widening either alone breaks the type match. mem_usage accumulates per-object byte counts already computed in size_t (SIZE() and sizeof_delta_index() reach here through free_unpacked(), now size_t), and was the last 32-bit-on-Windows narrowing point in the delta-window memory accounting chain. With this commit, that chain is internally size_t end-to-end except for sizeof_delta_index()'s still-narrow return, whose value is bounded by create_delta_index()'s entries cap. window_memory_limit (config-driven via git_config_ulong()) stays unsigned long: it is only compared against mem_usage and promotes. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Last stop in the delta-encoding API widening for >4 GiB blobs on
Windows: with create_delta_index() done in the prior commit and
create_delta()/diff_delta() finished here, every byte count that
crosses delta.h is now size_t. The struct fields they store into
have been size_t since the diff-delta struct widening.
The API change must move with all callers in the same commit (the
build only passes when every &delta_size matches the new size_t*).
Caller updates are kept minimal:
* builtin/pack-objects.c get_delta() and try_delta(): widen only
the local delta_size variable; the surrounding unsigned-long
locals and their cast_size_t_to_ulong() shims are out of scope
here and will be cleaned up in their own commits.
* builtin/fast-import.c, diff.c, t/helper/test-pack-deltas.c:
keep the local unsigned-long delta size (each feeds a still-
unsigned-long downstream consumer: zlib's avail_in,
deflate_it(), the test helper's own do_compress()), and bridge
via a temporary size_t plus cast_size_t_to_ulong(). The new
casts are paid back in later topics that widen those consumers.
* t/helper/test-delta.c: widen the local outright (no downstream
consumer beyond the test's own out_size, which is already
size_t).
Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Bundling the two widenings: four call sites pass &stream.avail_in directly to use_pack(), and widening either type fencepost alone would force a bridge variable at each. Doing both together is the simpler end state and is the prerequisite for the do_compress() widening in the next commit, which is what lets write_no_reuse_object() lose its last cast_size_t_to_ulong() shim. The unsigned-long locals widened at the other use_pack() callers (avail / remaining / left) hold pack-window sizes bounded by core.packedGitWindowSize, so the change is type consistency rather than a new >4GB capability. git_zstream.avail_in / avail_out likewise reach zlib's uInt fields only after zlib_buf_cap()'s 1 GiB cap, so the wrapper already accepted size_t-shaped inputs in practice. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Prep for the upcoming git_deflate_bound() widening to size_t: the local that catches its return needs to be size_t too, otherwise the widening would introduce a silent Windows narrowing here. No semantic effect with the current unsigned-long-returning git_deflate_bound() (size_t == unsigned long on this caller's platforms today). Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Fixes a pre-existing silent narrowing from git_deflate_bound()'s unsigned long return into an int local: anything past 2 GiB has always wrapped negative here and then been re-extended to size_t inside xmalloc(). Also prep for the upcoming git_deflate_bound() widening to size_t, which would extend the narrowing further if bound stayed int. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Continue the size_t evacuation around large object handling: with deflate_it() and the locals around it widened, the cast_size_t_to_ulong() shim the prior delta_delta() widening had to leave behind in emit_binary_diff_body() goes away. deflate_it() is file-static; the only callers are the two in emit_binary_diff_body() already touched here. emit_diff_symbol() formats the resulting sizes via uintmax_t / %"PRIuMAX", so the diff output is not affected; only the per-process upper bound on a binary patch chunk that this function can address grows beyond 4 GiB on Windows. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The local is initialised from git_deflate_bound() (an unsigned upper bound on the deflated output, never negative) and used in exactly three places: the initialising assignment, strbuf_grow(buf, size) whose parameter is already size_t, and stream.avail_out which became size_t in the prior commit. There is no comparison against zero or a negative value, no subtraction, no arithmetic that depends on signedness, and no path that would assign a signed quantity to it. The original ssize_t was the wrong type to begin with: a git_deflate_bound() result above SSIZE_MAX would have wrapped negative on assignment and then implicitly re-extended to a huge size_t at strbuf_grow() / stream.avail_out, requesting an absurd allocation. That is not a real-world concern for the object sizes http-push pushes today, but it is also the reason the type needs to move to size_t before git_deflate_bound() itself is widened. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Prep for the upcoming read_blob_data_from_index() widening, whose callers in convert.c feed the size they receive straight into these two helpers. Both are file-static, so the change is contained. Also fixes a small pre-existing narrowing on the get_wt_convert_stats_ascii() path, where strbuf.len (size_t) was passed to a unsigned long parameter. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Prep for the upcoming git_deflate_bound() widening to size_t. The local is only ever the return value of git_deflate_bound() and the xmalloc() / stream.avail_out sizes derived from it; widening it has no semantic effect today. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Continue the size_t evacuation. read_blob_data_from_index() reads the blob through the size_t odb_read_object() API but writes the size back through an unsigned long out-parameter, silently truncating anything past 4 GiB on Windows. Widen the out-parameter, drop the cast_size_t_to_ulong() shim, and move the matching locals in the two convert.c callers and the one in attr.c. Their downstream consumers (gather_convert_stats() widened in the prior commit and read_attr_from_buf() already size_t) take the new type directly. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
All four `unsigned long` / `int` / `ssize_t` receivers across archive-zip, diff, http-push and t/helper/test-pack-deltas were widened to size_t in the prior commits, and remote-curl and fast-import were already there. With every caller prepared, both the parameter and the return type can now move without introducing any silent narrowing. For inputs above zlib's uLong range (i.e. >4 GiB on platforms where uLong is 32-bit, notably 64-bit Windows), defer to zlib's stored-block formula (the same fallback it would itself use for an unknown stream state) plus the worst-case wrapper overhead. The existing path through deflateBound() is unchanged for inputs that fit. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Prep for the widenings of its callers, where size-receiving locals will become size_t (combine-diff's result_size in the immediately following commit, struct diff_filespec.size in a later topic). Body caps the parameter at 8000 anyway, so the type change is mechanical. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Continue the size_t evacuation. With buffer_is_binary() widened in the prior commit, every consumer that the size flows into in combine-diff.c is size_t-ready, so widen grab_blob()'s out-param outright and move the matching locals at its three call sites together. grab_blob()'s body collapses to a direct odb_read_object(&size) since the bridge variable is no longer needed. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Continue the size_t evacuation. textconv_object() fills its out-parameter from fill_textconv()'s size_t return through an unsigned long*; widen the API to match, then take advantage of the new shape where callers can. cat-file's 'c' and batch-mode 'c' branches lose their size_ul bridge variables (one site becomes a direct call, the other collapses an if/else into a single negated condition that reads as "try textconv, fall back to a raw read"). blame.c likewise drops the file_size_st bridge in fill_origin_blob() and hoists final_buf_size_st to bracket both branches in setup_scoreboard(). The latter keeps a cast_size_t_to_ulong() shim because struct blame_scoreboard.final_buf_size is still unsigned long; that field is its own topic. log.c just widens its local from unsigned long to size_t. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Continue the size_t evacuation. The struct field already receives its writes from a size_t-shaped source (xsize_t(st.st_size), strbuf.len, fill_textconv()'s return, odb_read_object_info_extended() via oi.sizep), so on Windows it was already truncating anything past 4 GiB silently on the strbuf and textconv paths and loudly through cast_size_t_to_ulong() on the odb path. Switch the field to size_t. In diff_populate_filespec(), point oi.sizep at the field directly and drop both cast_size_t_to_ulong() shims and the size_st bridge they fed. Downstream consumers that still read .size into unsigned long locals will now silently narrow on Windows where the field exceeds 4 GiB. Each of those is its own follow-up; the writer side is the prerequisite for ever putting a >4 GiB value in the field in the first place. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The two shims that 606c192 (odb, packfile: use size_t for streaming object sizes, 2026-05-08) and the subsequent odb_read_object() widening introduced as scaffolding around get_delta()'s reads can now disappear: the previous commit widened diff_delta() to size_t, which was the last narrow consumer in this function. Widen size and base_size to size_t outright, drop the size_st / base_size_st bridging temporaries, and drop the two cast_size_t_to_ulong() calls. Net change is 4 lines smaller and one read-then-cast indirection gone from each odb read. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Companion to the prior get_delta() cleanup, and the last try_delta() piece of the >4 GiB delta-path topic. Every consumer that the function's locals fed has now been widened: SIZE() / DELTA_SIZE() to size_t (prior topic), the mem_usage out-parameter and delta_cacheable() earlier in this series, and create_delta() / create_delta_index() in the immediately preceding commits. Widen the declaration of trg_size, src_size, sizediff, max_size and sz to size_t (delta_size joins them on the same line, removing the size_t delta_size line that the create_delta() widening commit added as a stop-gap), and drop the two sz_st bridge variables together with the surrounding cast_size_t_to_ulong() calls. The result is just "odb_read_object(&sz)" on both reads. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Continue the size_t evacuation that this series and the merged js/objects-larger-than-4gb-on-windows topic are advancing for >4 GiB objects on Windows: with the odb readers and the zlib helpers reached from do_compress() now widened end-to-end, the last cast_size_t_to_ulong() shim in this function can be removed, and do_compress() itself can carry the new size type through. Two cast_size_t_to_ulong() shims remain in this file; they feed the tree-walk API, which is still narrow and is a separate widening topic. write_no_reuse_object()'s return type and the hashfile API are still narrow but unchanged in observable behaviour: on 64-bit Linux ulong coincides with size_t, and on Windows these were the narrow fenceposts the prior topics deliberately left in place. Their widening is left to follow-ups touching the hashfile API and the write_object() caller chain. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Continue the size_t evacuation. final_buf_size is fed either from textconv_object()'s now-size_t out-parameter, from odb_read_object()'s size_t out-parameter (both bridged today through a final_buf_size_st local + cast_size_t_to_ulong()), or from o->file.size (mmfile_t, long). Widen the struct field, point both producers straight at it, and drop the bridge variable along with the cast. builtin/blame.c only reads the field for pointer arithmetic and comparisons, which promote cleanly. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Continue the size_t evacuation. fast-import's helper gfi_unpack_entry() and the five size-handling sites that feed off it (store_object()'s deltalen, load_tree(), parse_from_existing(), the inline gfi_unpack_entry() caller in parse_objectish(), cat_blob(), and dereference()) all carry size_t-shaped values from the odb / unpack_entry() APIs through cast_size_t_to_ulong() bridges into unsigned long locals. With the producers (odb_read_object(), odb_read_object_peeled(), unpack_entry()) and the consumers it feeds (the zlib avail_in field from a prior commit, encode_in_pack_object_header()'s uintmax_t parameter, parse_from_commit()'s widened size parameter) all size_t-ready, the bridges and casts go away in one pass. gfi_unpack_entry() now writes into the caller's size_t directly, and the six locals collapse to plain size_t declarations. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Tidies up the bridge variable introduced in the create_delta() / diff_delta() widening commit earlier in this series. With the test helper's local do_compress() also widened to size_t in pass, the narrowing into the unsigned long delta_size local that compress expected is gone, the size_st bridge is unnecessary, and the cast goes away. encode_in_pack_object_header() takes uintmax_t and hashwrite() takes uint32_t, both unchanged. Assisted-by: Opus 4.7 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Now that all of the call sites of this helper (which I used as a kind of "NEEDSWORK" marker) are eliminated, we can drop that helper altogether. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
9f937ee to
b391157
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR contains a branch thicket on top of v2.55.0-rc1 (i.e. ready to go upstream) to continue the bulk of the
unsigned long->size_ttransformation.Since all of these changes have no impact on the currently-working functionality for <4GB objects/packs/clones (modulo bugs, that is 😄), I would like to merge this before v2.55.0-rc2, still: The risk of introducing a regression is negligible, the chance for fixing the majority of problems with large clones is high.