Fix copyright detection for URLs containing (c) symbol #4726

gyanranjanpanda · 2026-02-03T18:10:02Z

Problem

URLs containing (c) in their path or query parameters were incorrectly detected as copyright statements.

Example:

http://biblio.cesga.es:81/search*gag/aXove,+Xosé/axove+xose/7,-1,0,B/frameset&F=axuntanza&1,,3

Was being detected as a copyright statement.

Solution

This fix addresses the issue by:

Reordering URL/email patterns to appear before (C) and (c) copyright patterns in the lexer, ensuring URL tokens are matched as URLs first
Adding junk copyright patterns to filter out false positives from URL fragments containing (c)

The tokenizer splits URLs on = and ; characters, which can cause (c) to appear as a separate token. By prioritizing URL pattern matching and filtering URL-like detections, we prevent these false positives.

Testing

Tested with the original urls.10K file from the issue - now shows 0 false positives (previously had 2)
Created test file tests/cluecode/data/copyrights/url_with_c_symbol.txt with URLs containing (c) - all pass without false detections
Code follows natural coding style without excessive comments

Changes

Modified src/cluecode/copyrights.py:
- Moved URL/email patterns from line ~2304 to line 707 (before copyright patterns)
- Added 3 junk patterns to filter URL fragments
Added test file tests/cluecode/data/copyrights/url_with_c_symbol.txt

Fixes aboutcode-org#4724 URLs containing (c) in their path or query parameters were incorrectly detected as copyright statements. For example: http://example.com/path/(c)/test This fix addresses the issue by: 1. Reordering URL/email patterns to appear before (C) and (c) copyright patterns in the lexer, ensuring URL tokens are matched as URLs first 2. Adding junk copyright patterns to filter out false positives from URL fragments containing (c) The tokenizer splits URLs on = and ; characters, which can cause (c) to appear as a separate token. By prioritizing URL pattern matching and filtering URL-like detections, we prevent these false positives. Tested with the original urls.10K file from the issue - now shows 0 false positives (previously had 2). Signed-off-by: Gyan Ranjan Panda <gyanranjanpanda@gmail.com>

gyanranjanpanda force-pushed the fix/url-copyright-detection-4724 branch from 48b5dd8 to 24d43f1 Compare February 3, 2026 18:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix copyright detection for URLs containing (c) symbol #4726

Fix copyright detection for URLs containing (c) symbol #4726

gyanranjanpanda commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Fix copyright detection for URLs containing (c) symbol #4726

Are you sure you want to change the base?

Fix copyright detection for URLs containing (c) symbol #4726

Conversation

gyanranjanpanda commented Feb 3, 2026

Problem

Solution

Testing

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant