Removing url matching from tokenizer #13685

chelle-rr · 2024-11-06T19:38:19Z

chelle-rr
Nov 6, 2024

Hi!

I'm trying to use spaCy to work with a set of filepaths and names, so I've needed to set some specific rules for tokenization. However, I'm getting some unexpected results, and via nlp.tokenizer.explain I can see that the issue is that it's occasionally matching parts of the filename as a URL. Is there a way I can disable this?

nlp.tokenizer.explain("(!)Object Files_/pull out acc#s_Site Visit Report_APS_2012.pdf")

Output:

[('TOKEN', '(!)Object'), ('TOKEN', 'Files'), ('INFIX', '_'), ('INFIX', '/'), ('TOKEN', 'pull'), ('TOKEN', 'out'), ('TOKEN', 'acc#s'), ('INFIX', '_'), ('TOKEN', 'Site'), ('TOKEN', 'Visit'), ('URL_MATCH', 'Report_APS_2012.pdf')]

Desired output:

[('TOKEN', '(!)Object'), ('TOKEN', 'Files'), ('INFIX', '_'), ('INFIX', '/'), ('TOKEN', 'pull'), ('TOKEN', 'out'), ('TOKEN', 'acc#s'), ('INFIX', '_'), ('TOKEN', 'Site'), ('TOKEN', 'Visit'), ('TOKEN', 'Report'), ('INFIX', '_'), ('TOKEN', 'APS'), ('INFIX', '_'), ('TOKEN', '2012'), ('INFIX', '.'), ('TOKEN', 'pdf')]

I'm very new to this so please excuse me if I'm missing something obvious. Thank you!

Answered by chelle-rr

Nov 12, 2024

If anyone needs it later, here's the answer:

nlp.tokenizer.url_match = None

View full answer

chelle-rr · 2024-11-12T16:39:35Z

chelle-rr
Nov 12, 2024
Author

If anyone needs it later, here's the answer:

nlp.tokenizer.url_match = None

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing url matching from tokenizer #13685

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Removing url matching from tokenizer #13685

chelle-rr Nov 6, 2024

Replies: 1 comment

chelle-rr Nov 12, 2024 Author

chelle-rr
Nov 6, 2024

chelle-rr
Nov 12, 2024
Author