Fix copyright detection for URLs containing (c) symbol #4726
+49
−38
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #4724
Problem
URLs containing
(c)in their path or query parameters were incorrectly detected as copyright statements.Example:
Was being detected as a copyright statement.
Solution
This fix addresses the issue by:
(C)and(c)copyright patterns in the lexer, ensuring URL tokens are matched as URLs first(c)The tokenizer splits URLs on
=and;characters, which can cause(c)to appear as a separate token. By prioritizing URL pattern matching and filtering URL-like detections, we prevent these false positives.Testing
urls.10Kfile from the issue - now shows 0 false positives (previously had 2)tests/cluecode/data/copyrights/url_with_c_symbol.txtwith URLs containing(c)- all pass without false detectionsChanges
src/cluecode/copyrights.py:tests/cluecode/data/copyrights/url_with_c_symbol.txt