Skip to content

Fix copyright detection for URLs containing (c) symbol#4726

Open
gyanranjanpanda wants to merge 1 commit intoaboutcode-org:developfrom
gyanranjanpanda:fix/url-copyright-detection-4724
Open

Fix copyright detection for URLs containing (c) symbol#4726
gyanranjanpanda wants to merge 1 commit intoaboutcode-org:developfrom
gyanranjanpanda:fix/url-copyright-detection-4724

Conversation

@gyanranjanpanda
Copy link

Fixes #4724

Problem

URLs containing (c) in their path or query parameters were incorrectly detected as copyright statements.

Example:

http://biblio.cesga.es:81/search*gag/aXove,+Xosé/axove+xose/7,-1,0,B/frameset&F=axuntanza&1,,3

Was being detected as a copyright statement.

Solution

This fix addresses the issue by:

  1. Reordering URL/email patterns to appear before (C) and (c) copyright patterns in the lexer, ensuring URL tokens are matched as URLs first
  2. Adding junk copyright patterns to filter out false positives from URL fragments containing (c)

The tokenizer splits URLs on = and ; characters, which can cause (c) to appear as a separate token. By prioritizing URL pattern matching and filtering URL-like detections, we prevent these false positives.

Testing

  • Tested with the original urls.10K file from the issue - now shows 0 false positives (previously had 2)
  • Created test file tests/cluecode/data/copyrights/url_with_c_symbol.txt with URLs containing (c) - all pass without false detections
  • Code follows natural coding style without excessive comments

Changes

  • Modified src/cluecode/copyrights.py:
    • Moved URL/email patterns from line ~2304 to line 707 (before copyright patterns)
    • Added 3 junk patterns to filter URL fragments
  • Added test file tests/cluecode/data/copyrights/url_with_c_symbol.txt

Fixes aboutcode-org#4724

URLs containing (c) in their path or query parameters were incorrectly
detected as copyright statements. For example:
http://example.com/path/(c)/test

This fix addresses the issue by:
1. Reordering URL/email patterns to appear before (C) and (c) copyright
   patterns in the lexer, ensuring URL tokens are matched as URLs first
2. Adding junk copyright patterns to filter out false positives from
   URL fragments containing (c)

The tokenizer splits URLs on = and ; characters, which can cause (c)
to appear as a separate token. By prioritizing URL pattern matching
and filtering URL-like detections, we prevent these false positives.

Tested with the original urls.10K file from the issue - now shows 0
false positives (previously had 2).

Signed-off-by: Gyan Ranjan Panda <gyanranjanpanda@gmail.com>
@gyanranjanpanda gyanranjanpanda force-pushed the fix/url-copyright-detection-4724 branch from 48b5dd8 to 24d43f1 Compare February 3, 2026 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Copyright detection sees URLs containing copyright symbols as copyright statements

1 participant