Zero score for Spancat - debug data looks to be fine #12412

goonhoon · 2023-03-13T23:21:55Z

goonhoon
Mar 13, 2023

I am having trouble training a fresh spancat model. According to the data debug, I am using a fairly low amount of samples. However, the number of samples I use is far larger than any I have ever used before (although the other models did not have the spancat component). I have a large dataset of .txt files that I plan to process and annotate, but before I do so, I would like to know whether my current method has any flaws in it or whether the low number of samples is the sole reason this isn't working.

To clarify: spans in the "sentences" key are whole sentences, hence the high token count (see below in data debug).

Any help is appreciated! (apologies for the wall of text; if there is a way to create code dropdowns or other functions to promote visibility, please do let me know).

=========================== Initializing pipeline ===========================
[2023-03-13 22:50:25,275] [INFO] Set up nlp object from config
[2023-03-13 22:50:25,283] [INFO] Pipeline: ['tok2vec', 'spancat']
[2023-03-13 22:50:25,285] [INFO] Created vocabulary
[2023-03-13 22:50:25,286] [INFO] Finished initializing nlp object
[2023-03-13 22:50:37,959] [INFO] Initialized pipeline components: ['tok2vec', 'spancat']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'spancat']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS SPANCAT  SPANS_SENT...  SPANS_SENT...  SPANS_SENT...  SCORE
---  ------  ------------  ------------  -------------  -------------  -------------  ------
  0       0        777.72       5629.39           0.00           0.00           0.00    0.00
 10     200       1249.04      15467.39           0.00           0.00           0.00    0.00
 20     400          0.00          0.15           0.00           0.00           0.00    0.00
 30     600          0.00          0.07           0.00           0.00           0.00    0.00

config.cfg:

[paths]
train = "spacy_training"
dev = "spacy_dev"
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","spancat"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.spancat]
factory = "spancat"
max_positive = null
scorer = {"@scorers":"spacy.spancat_scorer.v1"}
spans_key = "sentences"
threshold = 0.5

[components.spancat.model]
@architectures = "spacy.SpanCategorizer.v1"

[components.spancat.model.reducer]
@layers = "spacy.mean_max_reducer.v1"
hidden_size = 128

[components.spancat.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO = null
nI = null

[components.spancat.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,1000,2500,2500]
include_static_vectors = false

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
spans_sentences_f = 1.0
spans_sentences_p = 0.0
spans_sentences_r = 0.0
spans_sc_f = null
spans_sc_p = null
spans_sc_r = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

Debug data:

=============================== Training stats ===============================
Language: en
Training pipeline: tok2vec, spancat
20 training docs
5 evaluation docs
✔ No overlap between training and evaluation data
✘ Low number of examples to train a new pipeline (20)

============================== Vocab & Vectors ==============================
ℹ 204048 total word(s) in the data (9281 unique)
ℹ No word vectors present in the package

============================ Span Categorization ============================

Spans Key   Labels
---------   ------------------------------
sentences   {'Term', 'Governing Law', 'Pricing', 'Notices', 'Parties', 'Assignment'}

⚠ Low number of examples for label 'Pricing' in key 'sentences'
(43)
⚠ Low number of examples for label 'Parties' in key 'sentences'
(18)
⚠ Low number of examples for label 'Governing Law' in key 'sentences'
(27)
⚠ Low number of examples for label 'Term' in key 'sentences' (25)
⚠ Low number of examples for label 'Assignment' in key 'sentences'
(5)
⚠ Low number of examples for label 'Notices' in key 'sentences' (1)
ℹ Span characteristics for spans_key 'sentences'
ℹ SD = Span Distinctiveness, BD = Boundary Distinctiveness

Span Type       Length     SD     BD    N
-------------   ------   ----   ----   --
Pricing          52.81   0.90   2.71   43
Parties          70.76   1.81   3.43   18
Governing Law    52.52   1.11   3.11   27
Term             59.24   0.90   3.09   25
Assignment       51.72   1.43   2.81    5
Notices         105.00   2.56   4.59    1
-------------   ------   ----   ----   --
Wgt. Average     57.20   1.12   3.01    -

ℹ Over 90% of spans have lengths of 1 -- 373 (min=14, max=373). The
most common span lengths are: 14 (1.68%), 19 (0.84%), 20 (0.84%), 21 (5.88%), 22
(0.84%), 23 (4.2%), 24 (1.68%), 27 (1.68%), 28 (1.68%), 29 (1.68%), 35 (1.68%),
37 (1.68%), 38 (0.84%), 39 (1.68%), 40 (1.68%), 41 (1.68%), 42 (1.68%), 43
(1.68%), 44 (1.68%), 45 (0.84%), 46 (0.84%), 48 (1.68%), 50 (0.84%), 52 (2.52%),
53 (1.68%), 54 (3.36%), 56 (0.84%), 58 (1.68%), 59 (1.68%), 61 (2.52%), 62
(1.68%), 65 (1.68%), 66 (2.52%), 68 (0.84%), 72 (0.84%), 75 (0.84%), 78 (1.68%),
79 (0.84%), 80 (2.52%), 87 (0.84%), 92 (1.68%), 97 (1.68%), 99 (4.2%), 100
(0.84%), 103 (0.84%), 105 (2.52%), 114 (0.84%), 118 (1.68%), 131 (1.68%), 137
(0.84%), 164 (0.84%), 168 (0.84%), 199 (0.84%), 240 (1.68%), 355 (0.84%), 373
(0.84%). If you are using the n-gram suggester, note that omitting infrequent
n-gram lengths can greatly improve speed and memory usage.
✔ Spans are distinct from the rest of the corpus
✔ Boundary tokens are distinct from the rest of the corpus
✔ Examples without ocurrences available for all labels

================================== Summary ==================================
✔ 6 checks passed
⚠ 6 warnings
✘ 1 error

The docbin is created like this:

import spacy
from spacy.tokens import SpanGroup, Span
from spacy.tokens import DocBin

nlp = spacy.blank("en")
ruler = nlp.add_pipe("span_ruler")
nlp.add_pipe("sentencizer")

patterns = [{"label": "Governing Law", "pattern": "governed by"}]

# other patterns not included

ruler.add_patterns(patterns)

text_open = open("test.txt", "r")
text = text_open.read()

doc = nlp(text)

text_open.close()

print([(span.text, span.label_) for span in doc.spans["ruler"]])

doc.spans["sentences"] = SpanGroup(doc)
for sentence in doc.sents:
    for span in doc.spans["ruler"]:
        if span.start >= sentence.start and span.end <= sentence.end:
            doc.spans["sentences"] += [
                Span(doc, start=sentence.start, end=sentence.end, label=span.label_)
            ]

print(doc.spans["sentences"])

doc_bin = DocBin()
doc_bin.add(doc)
doc_bin.to_disk("./spacy_dev/data25.spacy")

EDIT: Still the same issue. I now tried adding spacy-experimental.sentence_suggester.v1, and get the following error when training. I have added sentencizer in the code above, so not sure where the issue lies:

ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: `nlp.add_pipe('sentencizer')`. Alternatively, add the dependency parser or sentence recognizer, o
r set sentence boundaries by setting `doc[i].is_sent_start`.

Answered by adrianeboyd

Mar 17, 2023

To use the sentence suggester, you need to add a sentencizer (or other component that annotates sentences) to your pipeline and add that to [training.annotating_components]:

[training]
annotating_components = ["sentencizer"]

sentencizer is rule-based and the easiest to start with. If you used sentencizer when creating the .spacy files from your expanded spans, then use sentencizer here, too.

View full answer

adrianeboyd · 2023-03-17T10:32:33Z

adrianeboyd
Mar 17, 2023

To use the sentence suggester, you need to add a sentencizer (or other component that annotates sentences) to your pipeline and add that to [training.annotating_components]:

[training]
annotating_components = ["sentencizer"]

sentencizer is rule-based and the easiest to start with. If you used sentencizer when creating the .spacy files from your expanded spans, then use sentencizer here, too.

10 replies

goonhoon Mar 20, 2023
Author

Since this relates to the above training and model, I thought not making a new thread is better:

The model now finishes training at around 0.35 score, although with a fairly low loss, which is good given how rough it is in its current state. However, I noticed that the most commonly occuring span label (present in about 50% of my 510 documents) never gets predicted, even if tested on documents the model was trained on.

These are the patterns for my ruler annotation of the 'effective date' sentence, usually structured in a very similar way.

{"label": "Effective Date", "pattern": "This Agreement shall become effective "},
{"label": "Effective Date", "pattern": "Effective Date means the date"},
{"label": "Effective Date", "pattern": '" Effective Date "'},
{"label": "Effective Date", "pattern": '"Effective Date"'},
{"label": "Effective Date", "pattern": '"EFFECTIVE DATE'},
{"label": "Effective Date", "pattern": '" EFFECTIVE DATE "'},

This is the sentence in displacy it did not pick up (I tested this on about 40 documents, including those it was trained on, and nowhere does it get predicted):

Could this be an issue with sentencizer and expanding my spans onto the sentence level?

ruler.add_patterns(patterns)

for i in range(1, 510):

    text_open = open(f"inputfiles/ ({i}).txt", "r", encoding='utf8')
    text = text_open.read()
    doc = nlp(text)

    print([(span.text, span.label_) for span in doc.spans["ruler"]])

    doc.spans["sentences"] = SpanGroup(doc)
    db = DocBin()
    for sentence in doc.sents:
        for span in doc.spans["ruler"]:
            if span.start >= sentence.start and span.end <= sentence.end:
                doc.spans["sentences"] += [
                    Span(doc, start=sentence.start, end=sentence.end, label=span.label_)
                ]

    print(doc.spans["sentences"])

Thanks!

adrianeboyd Mar 20, 2023

You do have to make sure that the sentence spans used while annotating are the same spans used while predicting. Is it possible that you used en_core_web_sm (sentence boundaries from parser) when annotating but sentencizer while training?

The faster/simpler option is to also use sentencizer when creating the training corpus. (You can also use parser for both, but it's slower and trickier to set up in the config.)

goonhoon Mar 21, 2023
Author

This was indeed the case. I loaded a blank "en" model and it worked, and also returned much better scores (around 0.97 on the total scorer). The training took about 2 hours and kept freezing my 16gb RAM machine, but other than that happy to finally get promising results.

Thanks again for all the help @adrianeboyd !

goonhoon Apr 2, 2023
Author

To follow-up on this the same question, except with the ngram.range suggester:

Is there any component to add either into [components], annotating_components = [] or the pipeline? I matched my ngram suggester to be within the same range as my debug data but get 0.00 scores and loss in a very similar pattern as to the original scorer posted above. I do not use sentencizer and do load the en_core_web_sm model.

adrianeboyd Apr 17, 2023

If you want to use parser from en_core_web_sm for sentence boundaries, you need to source tok2vec and parser and then add both to frozen_components and annotating_components.

In general, having longer spans makes this difficult and something like span_finder is going to be easier for you to use than the ngram suggester. We're currently moving it from spacy-experimental to spacy, so hopefully it'll be available as a standard component soon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero score for Spancat - debug data looks to be fine #12412

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 10 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Zero score for Spancat - debug data looks to be fine #12412

goonhoon Mar 13, 2023

Replies: 1 comment · 10 replies

adrianeboyd Mar 17, 2023

goonhoon Mar 20, 2023 Author

adrianeboyd Mar 20, 2023

goonhoon Mar 21, 2023 Author

goonhoon Apr 2, 2023 Author

adrianeboyd Apr 17, 2023

goonhoon
Mar 13, 2023

Replies: 1 comment 10 replies

adrianeboyd
Mar 17, 2023

goonhoon Mar 20, 2023
Author

goonhoon Mar 21, 2023
Author

goonhoon Apr 2, 2023
Author