Skip to content

Implement sorting by hash#112

Open
aneubeck wants to merge 21 commits into
mainfrom
aneubeck/merge
Open

Implement sorting by hash#112
aneubeck wants to merge 21 commits into
mainfrom
aneubeck/merge

Conversation

@aneubeck
Copy link
Copy Markdown
Collaborator

@aneubeck aneubeck commented May 8, 2026

This PR

  • adds iterators
  • adds sort_by_hash feature that data is iterated according to their hash
  • adds benchmarks for sort/merge use cases
    Note: hashmap remains intact after sorting now.

Copilot AI review requested due to automatic review settings May 8, 2026 07:57
@aneubeck aneubeck requested a review from a team as a code owner May 8, 2026 07:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends hash-sorted-map to support deterministic hash-ordered iteration by introducing explicit iterator types and a sort_by_hash operation that reorders entries in-place (within hash-prefix group chains) before returning an owning HashSortedContainer. It also updates documentation around performance trade-offs and adds new Criterion benchmarks for iter/sort/merge scenarios.

Changes:

  • Refactors core storage into HashSortedContainer + Group and re-exports the container.
  • Adds Iter / IterMut / IntoIter for both HashSortedContainer and HashSortedMap, including IntoIterator impls.
  • Adds HashSortedMap::sort_by_hash() plus additional benchmarks and updated optimization notes.
Show a summary per file
File Description
crates/hash-sorted-map/src/lib.rs Wires up new modules and re-exports container + iterator types.
crates/hash-sorted-map/src/iter.rs Implements shared cursor-based iteration for container/map (immutable, mutable, and owning).
crates/hash-sorted-map/src/hash_sorted_map.rs Refactors map to wrap HashSortedContainer, adds sort_by_hash, and updates insertion/lookup paths accordingly.
crates/hash-sorted-map/src/group.rs Extracts Group struct and NO_OVERFLOW constant.
crates/hash-sorted-map/src/container.rs Adds HashSortedContainer with allocation helpers and Drop that owns entry destruction.
crates/hash-sorted-map/OPTIMIZATIONS.md Updates design notes and publishes new benchmark results/interpretation.
crates/hash-sorted-map/Cargo.toml Adds smallvec dependency for sorting scratch storage.
crates/hash-sorted-map/benchmarks/performance.rs Adds new benchmarks for iter/sort/merge, including k-way merge comparison.
crates/hash-sorted-map/benchmarks/Cargo.toml Adds itertools dependency for k-way merge benchmark.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comments suppressed due to low confidence (1)

crates/hash-sorted-map/src/hash_sorted_map.rs:946

  • Same issue as above: this test assumes hashes are strictly increasing (h > prev_hash), but hash collisions can occur (especially with strings), so the correct property is non-decreasing order. Update the assertion to tolerate equal hashes or validate ordering with a deterministic tie-breaker.
        let mut prev_hash = 0u64;
        let mut first = true;
        for (k, _) in &container {
            let h = hasher.hash_one(k);
            if !first {
                assert!(h > prev_hash, "hash order violated: {prev_hash:#x} > {h:#x}");
            }
  • Files reviewed: 9/9 changed files
  • Comments generated: 3

Comment thread crates/hash-sorted-map/src/hash_sorted_map.rs Outdated
Comment thread crates/hash-sorted-map/src/hash_sorted_map.rs
Comment thread crates/hash-sorted-map/src/hash_sorted_map.rs Outdated

hashbrown does **not** have this optimization — it always does a full SIMD
group scan. The reason why the performance is different is probably due to the different overflow strategies and the different load factors.
**Experimental finding**: This scalar check **hurts performance** on random
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes so much sense. And removing it from lookup paths explains why we can sort the map and it's still a map. Pretty great outcome!

How valuable is it for resizing? Even if it usually hits, surely it's equally fast to find the first empty slot in the group with SIMD, and that will hit even more often.


| Scenario | HashSortedMap | Comparison | Result |
| :------------------------------------------- | ------------: | :------------------------------------- | :---------- |
| Insert 1000 trigrams (pre-sized) | 7.34 µs | hashbrown::HashMap: 12.88 µs | ~43% faster |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't agree with the other file, which puts us 32% slower than hashbrown on the same microbenchmark. Different architecture?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be best to have all the benchmarks in one place and up to date, and preferably all on Intel if that's what we think most cloud servers have.

fn insert_after_grow<K: Hash + Eq, V, S: BuildHasher>(
map: &mut HashSortedMap<K, V, S>,
hash: u64,
_hash: u64,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean this up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants