Spacy is a library for advanced natural language processing in Python. It features state-of-the-art speed and accuracy, a concise API, and excellent documentation. It enables users to quickly and accurately create linguistic annotations for text, such as part-of-speech tagging, dependency parsing, named entity recognition, and more. Spacy is widely used in industry for applications such as information extraction, question answering, summarization, and more.
Load statistical model .load(model)
# nlp = spacy.load("ru_core_news_sm")
nlp = spacy.load("ru_core_news_lg")
# nlp = spacy.load('en_core_web_sm')
Define string nlp(string)
doc = nlp('Evidently someone with the authority to make decisions has arrived.')
Upon defining nlp()
, we have access to:
.text
: Tokens.lemma_
: lemmatised tokens.sents
Sentences.pos_
: CoarsePOS
tags.tag_
: FinePOS
tags.dep_
: Dependency labels.head.text
: Syntactic head token.ents
(.text
,.label
) Named Entities.noun_chunks
Base noun phrases
- Tokenizer text is a process of breaking down a text string into individual words, phrases, symbols, or other meaningful elements called tokens
- Tokenization is an important step in natural language processing (NLP) when analyzing text data
- Tokenization helps to identify the structure of a sentence, and it also helps to identify the parts of speech for each word
[token.text for token in doc]
['Evidently',
'someone',
'with',
'the',
'authority',
'to',
'make',
'decisions',
'has',
'arrived',
'.']
- Lemmatisation is a process of reducing inflected words to their base forms
- It is used to reduce the complexity of natural language processing by converting words into their dictionary form
- This helps computers to understand the meaning of a sentence more accurately
[token.lemma_ for token in doc]
['evidently',
'someone',
'with',
'the',
'authority',
'to',
'make',
'decision',
'have',
'arrive',
'.']
- Parts of speech in natural language processing (NLP) is a process of classifying words into their parts of speech, such as noun, verb, adjective, adverb ...
- It is used to analyse the structure of sentences and to identify relationships between words
- This helps computers to understand the meaning of a sentence and to interpret the context of a conversation
- We have access to
coarser
andfiner
methods POS
tagging is used to assign tags to words, such asnouns
,verbs
etc
# coarse POS tags
[token.pos_ for token in doc]
['ADV',
'PRON',
'ADP',
'DET',
'NOUN',
'PART',
'VERB',
'NOUN',
'AUX',
'VERB',
'PUNCT']
# fine POS tags
[token.tag_ for token in doc]
['RB', 'NN', 'IN', 'DT', 'NN', 'TO', 'VB', 'NNS', 'VBZ', 'VBN', '.']
# dependeny labels
# [token.dep_ for token in doc]
['advmod',
'ROOT',
'prep',
'det',
'pobj',
'aux',
'acl',
'nsubj',
'aux',
'ccomp',
'punct']
# syntatic head tokens
[token.head.text for token in doc]
['someone',
'someone',
'someone',
'authority',
'with',
'make',
'authority',
'arrived',
'arrived',
'make',
'someone']
# Named Entities
[(ent.text, ent.label_) for ent in doc.ents]
[]
[sent.text for sent in doc.sents]
['Evidently someone with the authority to make decisions has arrived.']
[chunk.text for chunk in doc.noun_chunks]
['Evidently someone', 'the authority', 'decisions']
The content of the output actually depends on the pipeline
content
- Each model has a preloaded
pipeline
of NLP operations - Upon activating the
str
special method, these pipelines are activated - We can visualise the pipeline step using
.pipe_names
& contents with.pipeline
nlp = spacy.load("en_core_web_sm")
print(nlp.pipe_names)
['tok2vec', 'morphologizer', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
print(nlp.pipeline)
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x7fab27982670>),
('tagger', <spacy.pipeline.tagger.Tagger object at 0x7fab2fe42130>),
('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x7fab290957d0>),
('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x7fab0100d280>),
('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x7fab01005230>),
('ner', <spacy.pipeline.ner.EntityRecognizer object at 0x7fab2908eed0>)]
Depending on the pipeline
, the content can slightly vary
tagger
Assign part-of-speech-tagsparser
Assign dependency labelsner
Assign named entitiesentity_linker
Assign knowledge base IDs to named entities. Should be added after the entity recognizerentity_ruler
Assign named entities based on pattern rules and dictionariestextcat
Assign text categories: exactly one category is predicted per documenttextcat_multilabel
Assign text categories in a multi-label setting: zero, one or more labels per documentlemmatizer
Assign base forms to words using rules and lookupstrainable_lemmatizer
Assign base forms to wordsmorphologizer
Assign morphological features and coarse-grained POS tagsattribute_ruler
Assign token attribute mappings and rule-based exceptionssenter
Assign sentence boundariessentencizer
Add rule-based sentence segmentation without the dependency parsetok2vec
Assign token-to-vector embeddingstransformer
Assign the tokens and outputs of a transformer model
- If we don't want to load all components in the
pipeline
, we can remove them when activating thestr
special function - For example, if we didn't want to load the
parser
&ner
components of thepipeline
doc = nlp(doc, disable=['parser', 'ner'])
Let's look at an example of how we can implement the SpaCy
pipeline
- We will utilise
SpaCy
to tokenisecorpus
& utilise thetokenised
&lammatised
words (which we can access from.lemma_
)
We need some extra libraries:
- We will also utilise
nltk
, which contains alist
of so calledstop words
, which we will remove - We will also utilise
string
, which contains alist
of punctuations, anything in this list we will also remove
# list of strings (our input corpus)
corpus = ['Evidently someone with the authority to make decisions has arrived.',
'I think I smell the stench of your cologne, Agent Cooper.',
'Smells like hubris.']
import spacy
# get english stopwords
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
# get punctuations
import string
punctuations = string.punctuation
# Define function to cleanup text by removing personal pronouns, stopwords, and puncuation
def cleanup_text(docs, verbose=False):
texts = []
for ii,doc in enumerate(docs):
if(ii % 1000 == 0 and verbose):
print(f"Processed {ii+1} out of {len(docs)} documents.")
# Load statistical model
nlp = spacy.load("en_core_web_sm",
disable=['parser', 'ner'])
doc = nlp(doc)
# choose tokens which are not pronouns (pos_)
tokens = [tok.lemma_.lower().strip() for tok in doc if tok.pos_ != 'PRON']
# choose tokens which are not punctuations (token)
tokens = [tok for tok in tokens if tok not in stopwords and tok not in punctuations]
tokens = ' '.join(tokens)
texts.append(tokens)
return texts
print(f'unprocessed: \n{corpus}')
processed = cleanup_text(corpus)
print(f'\nprocessed: \n{processed}')
unprocessed:
['Evidently someone with the authority to make decisions has arrived.', 'I think I smell the stench of your cologne, Agent Cooper.', 'Smells like hubris.']
processed:
['evidently authority make decision arrive', 'think smell stench cologne agent cooper', 'smell like hubris']
Like strings, spanning
is referred to selecting tokenised text
, we can also use :
to select multiple
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('I like cats too.')
doc[1]
like
We can also use SpaCy
to check the similarity between two input strings
& compare them
doc1 = nlp('I like cats too')
doc2 = nlp('I prefer cats over dogs')
doc3 = nlp('Tom and John went to the library')
print(doc1.similarity(doc2))
print(doc1.similarity(doc3))
0.585439442022622
-0.012077887409129722
- Words can be represented in vector format, which should allow us to capture
semantic
andsyntatic
similarity - Select one of the
tokenised
words, and visualise thevector
form
print(doc1[2].vector) # word vector
print(f'normalised {doc1.vector_norm}') # enter sentence normalised value
[ 0.67897564 0.12110139 -0.6604409 0.0776536 -1.9418607 0.6940666
1.19576 0.6619939 0.50233465 0.05248592 -0.19295473 1.2842398
0.723488 0.64765275 -0.46113864 -0.12381127 0.31568098 0.5463978
-1.376126 -0.2555286 0.6541349 -0.71979237 -0.14713815 0.79841524
-0.9361088 -0.14297551 1.2247958 0.5354835 -0.5443332 -0.42707154
0.55424935 0.8715273 -0.25182363 -1.5841036 -0.14750908 -0.65078586
-0.17169908 -0.5729357 -0.13728389 -0.1380899 0.12419317 -0.25010562
0.06765506 0.1825014 -0.6063776 -0.7749779 1.1444601 0.5669737
-1.0870733 -0.39480096 -0.3147017 -0.10073815 -1.1867158 -1.7028933
0.72262895 0.49310595 0.600273 0.16741765 0.57921666 -1.0983374
-0.5445302 -1.3918273 -0.4013725 2.2526665 1.0592192 -0.45277885
-0.6982554 -0.31604335 0.6993128 0.4112054 -0.02864948 0.13155304
-0.77090514 -0.19796075 -0.30412257 1.1847382 -0.0515812 -0.14376783
0.39952287 0.81643397 1.226792 -0.323776 -0.27826166 -1.7565243
-0.2628545 0.1729832 -0.6630485 -0.49979892 -0.40806353 -1.7938248
-0.56411684 -0.5260589 0.8754898 2.4937406 0.05924536 -1.4038779 ]
normalised 4.218990020889313