Skip to content

Copyright detection sees URLs containing copyright symbols as copyright statements #4724

@petergardfjall

Description

@petergardfjall

Description

Scancode sometimes produces copyrights where none are actually to be found.
In this case it appears to be thrown off by URLs containing a copyright character.
(I'll file this as a bug although it might be a feature/improvement.)

One case is this test file of 10000 URLs (see urls.10K) where Scancode v32.5.0 manages to find a "copyright statement" like:

{
  "copyright": "(c) /axove+xose/7,-1,0,B/frameset&F axuntanza&1,,3 http://biblio.cesga.es:81/search",
  "start_line": 4569,
  "end_line": 4570
}

and

{
  "copyright": "(c) /axove+xose/7,-1,0,B/frameset&F axuntanza&3,,3 http://db.zaq.ne.jp/asp/bbs/jttk_baasc506_1/article/36 http://db.zaq.ne.jp/asp/bbs/jttk_baasc506_1/article/37 http://db.zaq.ne.jp/asp/bbs/jttk_baasc506_1/article/35 http://db.zaq.ne.jp/asp/bbs/jttk_baasc506_1/article/21 http://www.tiroler-adler.com/1Root/Kontinent/6/Staat/7/Bundesland/16/Ort/691/Homepage/f_homepage...1.html http://yp.gates96.com/4/52/90/87.html http://yp.gates96.com/4/52/90/95.html http://yp.gates96.com/4/52/91/4.html http://yp.gates96.com/4/52/91/39.html http://yp.gates96.com/4/52/91/42.html http://yp.gates96.com/4/52/92/33.html http://yp.gates96.com/4/52/92/93.html http://yp.gates96.com/4/52/93/6.html http://yp.gates96.com/4/52/93/98.html http://yp.gates96.com/4/52/94/8.html http://yp.gates96.com/4/52/94/14.html http://yp.gates96.com/4/52/95/92.html http://yp.gates96.com/4/52/96/16.html http://yp.gates96.com/4/52/96/32.html http://yp.gates96.com/4/52/96/72.html http://yp.gates96.com/4/52/96/90.html http://yp.gates96.com/4/52/97/1.html http://yp.gates96.com/4/52/97/53.html http://yp.gates96.com/4/52/98/34.html http://yp.gates96.com/4/52/98/84.html http://yp.gates96.com/4/52/98/97.html http://yp.gates96.com/4/52/99/55.html http://yp.gates96.com/4/52/99/68.html http://yp.gates96.com/14/40/10/3.html http://yp.gates96.com/14/40/10/86.html http://yp.gates96.com/14/40/11/53.html http://yp.gates96.com/14/40/12/74.html http://yp.gates96.com/14/40/13/11.html http://yp.gates96.com/14/40/13/34.html http://yp.gates96.com/14/40/13/45.html http://yp.gates96.com/14/40/13/79.html http://yp.gates96.com/14/40/14/3.html http://yp.gates96.com/14/40/14/6.html http://yp.gates96.com/14/40/14/25.html http://yp.gates96.com/14/40/14/84.html http://yp.gates96.com/14/40/14/88.html http://yp.gates96.com/14/40/15/39.html http://yp.gates96.com/14/40/15/40.html http://yp.gates96.com/14/40/16/64.html http://yp.gates96.com/14/40/16/92.html http://yp.gates96.com/14/40/17/69.html http://yp.gates96.com/14/40/18/42.html http://yp.gates96.com/14/40/18/82.html http://yp.gates96.com/14/40/19/36.html http://www.4positiveimages.com/4positiveimages/727410225/IconBar http://www.teacherformation.org/html/od/facilitators.cfm/task1,login/discussion_id,2/xid,6559/yid,6157439 http://www.secinfo.com/dSU5m.74.htm http://www.secinfo.com/dSU5m.7v.htm http://www.secinfo.com/dSU5m.6y.htm http://www.secinfo.com/dSU5m.5c.htm http://linux.softhouse.com.cn/linux/knowledge/tech/qs/linux5.htm http://linux.softhouse.com.cn/linux/knowledge/tech/qs/linux10.htm http://freesoftware.subportal.com/sn/Programming/Visual_Basic_Components_H-P/993.html http://dk.egroups.com/message/NGHILUAN/2881 http://dk.egroups.com/message/NGHILUAN/2889 http://www.cga.state.ct.us/ps98/cbs/H/hj-0084.htm http://apple.excite.com/entertainment/music/artists_and_genres/jazz/new_world_jazz/afro_cuban/ http://www.euronet.nl/users/hiroshi/ksweb/interest.htm http://library.bangor.ac.uk/search/aMollica,+Anthony/amollica+anthony/-5,-1,0,B/browse http://www.fogdog.com/cedroID/ssd3040183308040/nav/products/featured_brands/14t/all/ http://www.brio.de/BRIO.catalog/39fe2f8d0912d4962740d472aa780701/UserTemplate/9 http://www.hig.se/(accessed,autoformat,referrer,sqloutput,tablify)/~jackson/roxen/ http://www.newstimescybermall.com/Mall/Catalog/Product/ASP/product-id/206059/store-id/1000010991.html http://www6.163.com/news/p-item/0,1587,economy_1916,00.html http://ftp.uni-stuttgart.de/pub/systems/sgi/graphics/lib/?D A http://preview.egroups.com/message/abdou3/152 http://ch.php.net/manual/it/function.pg-loimport.php http://yp.gates96.com/1/94/30/39.html http://yp.gates96.com/1/94/30/78.html http://yp.gates96.com/1/94/31/11.html http://yp.gates96.com/1/94/31/72.html http://yp.gates96.com/1/94/31/85.html http://yp.gates96.com/1/94/32/25.html http://yp.gates96.com/1/94/32/45.html http://yp.gates96.com/1/94/32/74.html http://yp.gates96.com/1/94/33/20.html http://yp.gates96.com/1/94/33/68.html http://yp.gates96.com/1/94/34/92.html http://yp.gates96.com/1/94/35/1.html http://yp.gates96.com/1/94/35/50.html http://yp.gates96.com/1/94/35/60.html http://yp.gates96.com/1/94/37/0.html http://yp.gates96.com/1/94/37/46.html http://yp.gates96.com/1/94/37/47.html http://yp.gates96.com/1/94/37/61.html http://yp.gates96.com/1/94/38/19.html http://yp.gates96.com/1/94/39/49.html http://yp.gates96.com/1/94/39/57.html http://www.gartenfachmarkt.de/beratung_garten/duengen_und_kompostieren/anlage_und_vorarbeiten/fertig.htm http://support.dell.com/docs/storage/4955r/en/Hw/setup.htm http://www.hig.se/(clientname,header,sort,sqlquery,sqltable)/~jackson/roxen/ http://www.mic.hr/PGBURZA:423870 http://www.mic.hr/PGNEWS:423870 http://members.tripod.com/Tess_Tom/my_photoalbum/page12.html http://legalminds.lp.findlaw.com/list/courtinterp-spanish/nav05815.html http://www.fogdog.com/cedroID/ssd3040183239698/crs/pn /wld/fogdog_sports/pearl_izumi/road_cycling/apparel/classic_vest.html http://www.fogdog.com/cedroID/ssd3040183239698/nav/stores/wakeboarding/ http://troy.lib.sfu.ca/search/slogos/slogos/-5,-1,0,E/frameset&F slogistics+and+transportation+review&1,,0 http://themes.tucows.dia.dk/skins/icq/preview/54718.html http://php.nic.fi/manual/html/function.shm_open.html http://ftp.fi.debian.org/debian/dists/woody/non-free/binary-m68k/misc/?S A http://www.csupomona.edu/reference/java/jdk1.2/docs/api/org/omg/CORBA/class-use/CompletionStatus.html http://www.trax.nilex.co.uk/trax.cgi/A1C/B1U/A1D/C1R/A1D/B1R/ http://www.uni-duesseldorf.de/ftp/ftp/software/opt/cpio-2.4.2/?M A http://ep.com/js/about/c9079/b0/250918.html http://polygraph.ircache.net:8181/busi/html/http_-2www.dirtsports.com/index.html-ssi http://www.chabadlibrary.org/ecatalog/EC06/EC06232.HTM http://chat.hani.co.kr/NetBBS/Bbs.dll/brief/lst/qqa/f/qqo/PRMY/zka/B23qB2Bm http://209.207.239.212/bkindex/c1047/f1128.html http://tv.thevines.com/leaf/AA0000364429/4/1 http://tv.thevines.com/leaf/AA0000364429/4//&order_by WORST http://www2.hgo.se/Kurskatalogen.nsf/a49e2126c83c4922412566f60052f831/f24cc45affc890eec12565d80055e0b9!OpenDocument&ExpandSection",
  "start_line": 4570,
  "end_line": 4687
}

How To Reproduce

scancode --json-pp - --copyright urls.10K.txt

(Find the file attached to the issue.)
urls.10K.txt

System configuration

For bug reports, it really helps us to know:

  • What OS are you running on? (Windows/MacOS/Linux)

Linux

  • What version of scancode-toolkit was used to generate the scan file?

32.5.0

  • What installation method was used to install/run scancode? (pip/source download/other)

pip

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions