Fix copyright detection for URLs containing (c) symbol#4726
Open
gyanranjanpanda wants to merge 1 commit intoaboutcode-org:developfrom
Open
Fix copyright detection for URLs containing (c) symbol#4726gyanranjanpanda wants to merge 1 commit intoaboutcode-org:developfrom
gyanranjanpanda wants to merge 1 commit intoaboutcode-org:developfrom
Conversation
Fixes aboutcode-org#4724 URLs containing (c) in their path or query parameters were incorrectly detected as copyright statements. For example: http://example.com/path/(c)/test This fix addresses the issue by: 1. Reordering URL/email patterns to appear before (C) and (c) copyright patterns in the lexer, ensuring URL tokens are matched as URLs first 2. Adding junk copyright patterns to filter out false positives from URL fragments containing (c) The tokenizer splits URLs on = and ; characters, which can cause (c) to appear as a separate token. By prioritizing URL pattern matching and filtering URL-like detections, we prevent these false positives. Tested with the original urls.10K file from the issue - now shows 0 false positives (previously had 2). Signed-off-by: Gyan Ranjan Panda <gyanranjanpanda@gmail.com>
48b5dd8 to
24d43f1
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #4724
Problem
URLs containing
(c)in their path or query parameters were incorrectly detected as copyright statements.Example:
Was being detected as a copyright statement.
Solution
This fix addresses the issue by:
(C)and(c)copyright patterns in the lexer, ensuring URL tokens are matched as URLs first(c)The tokenizer splits URLs on
=and;characters, which can cause(c)to appear as a separate token. By prioritizing URL pattern matching and filtering URL-like detections, we prevent these false positives.Testing
urls.10Kfile from the issue - now shows 0 false positives (previously had 2)tests/cluecode/data/copyrights/url_with_c_symbol.txtwith URLs containing(c)- all pass without false detectionsChanges
src/cluecode/copyrights.py:tests/cluecode/data/copyrights/url_with_c_symbol.txt