Fix Chinese Whispers clustering to process all nodes each iteration#3133
Fix Chinese Whispers clustering to process all nodes each iteration#3133SamareshSingh wants to merge 1 commit intodavisking:masterfrom
Conversation
- Changed algorithm from random node selection to guaranteed sequential processing using Fisher-Yates shuffle per iteration - Each iteration now shuffles all node indices and processes each sequentially, ensuring complete label propagation
|
Thanks but this isn't what the algorithm is supposed to be doing. I.e. the way it's written in dlib isn't a bug. What is here in this PR is a different, but related algorithm. |
|
Although I agree this is what the original chinese whispers paper said to do. I forget at this point why the version in dlib deviates from that paper, but what's in dlib works really well for the applications it's used for so I don't want to go changing it. Might not be as good for existing users. |
|
Warning: this issue has been inactive for 35 days and will be automatically closed on 2026-03-22 if there is no further activity. If you are waiting for a response but haven't received one it's possible your question is somehow inappropriate. E.g. it is off topic, you didn't follow the issue submission instructions, or your question is easily answerable by reading the FAQ, dlib's official compilation instructions, dlib's API documentation, or a Google search. |
Summary
Fixed a critical bug in the Chinese Whispers clustering algorithm where vectors within the distance threshold were not being grouped together correctly.
The Problem
The algorithm was supposed to perform
num_iterationscomplete passes over all nodes in the graph, but instead it was using random node selection. This meant some nodes could be skipped entirely during an iteration, breaking the label propagation required for correct clustering.For example, vectors with a distance of 0.371814 (below the 0.38 threshold) were ending up in different clusters when they should have been grouped together.
The Solution
Changed the algorithm to guarantee that every node is processed at least once per iteration using a Fisher-Yates shuffle approach: