Utilize CLS-Token of transformers in textcat component #7178

scellat · 2021-02-21T01:40:36Z

scellat
Feb 21, 2021

I was going over the spacy 3.0 and building a classification model with transformer + textcat components. I just realized that textcat inputs are the aligned outputs of the transformer tokens. Since CLS token encodes document-wise information, is there a way to utilize the CLS-token of the transformers and pass it to the textcat component. Or is it already being utilized and I am missing something?

Environment

spaCy version: 3.0.0rc2
Platform: Darwin-20.1.0-x86_64-i386-64bit
Python version: 3.7.9
Pipelines: en_core_web_sm (3.0.0), en_core_web_trf (3.0.0a0)

honnibal · 2021-02-21T03:56:26Z

honnibal
Feb 21, 2021
Maintainer

Yes, you can either build the whole model yourself in PyTorch (including using the transformers library directly) and just wrap it in Thinc, or you can use the existing spacy-transformers integration, which will let you share the transformer weights between multiple components.

Here's how to do it the latter way. This turned out to be a great question because it pointed to a few gaps in the layers we provide. I've drafted the solution, but I haven't run it yet.

Getting one [CLS] per Doc

The spacy-transformers library gives you a lot of control in how the transformer encodes the document, and how features are calculated from those encodings. One of the most important choices is how to deal with long inputs. The default strategy that we recommend is to take overlapping spans of the document, and pass each span into the transformer as a separate document. This avoids truncation, and allows longer contexts to be passed in than just a single sentence.

Using this overlapping span-based approach, you'll have multiple [CLS] tokens for a single document, as you'll have one [CLS] per span. If you know your documents are small enough to work well as a single instance in the transformer, you can configure the transformer to use the spacy-transformers.doc_spans.v1 span getter. This will ensure you only have one [CLS] token per document.

In case it's not practical to do this, you'll need to define the model such that it can handle multiple [CLS] tokens. I've done this in the example below.

Model definition and config

Next you'll need to define a model for your textcat. Here's a model that connects to a shared transformer, gets the class tokens for each span, mean pools them, and and passes the result through a linear layer with softmax activation to predict the class probabilities.

Model definition

from typing import List
from thinc.api import Model, Softmax, chain, reduce_mean, list2ragged
from thinc.types import Floats2d
from spacy_transformers.data_classes import TransformerData
from spacy_transformers.layers import TransformerListener

@registry.architectures.register("TransformerListenerClassTokenTextcat.v1")
def transformer_listener_class_tok2vec_v1(
    tensor_index: int,
    class_index: int,
) -> Model[List[Doc], Floats2d]:
    # I'm assuming that we can have more than one span per doc, and we're going
    # to average the class vectors for the spans if there are multiple.
    return chain(
        TransformerListener(upstream_name="*"), # List[Doc] -> List[TransformerData]
        trf2tensor(tensor_index),   # List[TransformerData] -> List[Floats3d]
        foreach(
            # This does array[:, class_index].
            # i.e. we're getting the class array for each span.
            array_getitem((slice(0, None), class_index)) # Floats3d -> Floats2d
        ), # List[Floats3d] -> List[Floats2d]
        list2ragged(), # List[Floats2d] -> Ragged
        reduce_mean(), # Ragged -> Floats2d
        Softmax()      # Floats2d -> Floats2d
     )

Config for transformer and textcat components

[components.transformer]
factory = "transformer"
max_batch_items = 4096

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "roberta-base"
tokenizer_config = {"use_fast": true}

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.doc_spans.v1"

[components.textcat]
factory = "textcat"
threshold = 0.5

[components.textcat.model]
@architectures = "TransformerListenerClassTokenTextcat.v1"

trf2tensor layer (we should add this to spacy-transformers)

from typing import List, TypeVar
from thinc.api import Model
from thinc.types import FloatsXd
from spacy_transformers.data_classes import TransformerData

OutT = TypeVar("OutT", bound=FloatsXd)


def trf2tensor(index: int) -> Model[List[TransformerData], OutT]:
    """Extract just one tensor from each TransformerData."""
    return Model(
        "trf2tensor",
        forward,
        attrs={"index": index}
    )


def forward(model: Model, Xs: List[TransformerData], is_train: bool) -> Tuple[OutT, Callable]:
    index = model.attrs["index"]

    Ys = [x.tensors[index] for x in Xs]

    def backprop_trfs2tensor(dYs: List[OutT]) -> List[TransformerData]:
        dXs = []
        for X, dY in zip(Xs, dYs):
            d_tensors = []
            for j, tensor in enumerate(X.tensors):
                if j == index:
                    d_tensors.append(dY)
                else:
                    d_tensors.append(model.ops.alloc(tensor.shape, dtype=tensor.dtype))
            dXs.append(
                TransformerData(
                    tensors=d_tensors,
                    wordpieces=X.wordpieces,
                    align=X.align
                 )
             )
        return dXs

foreach layer (should go in Thinc)

InT = TypeVar("InT")
OutT = TypeVar("OutT")

# I could've sworn I implemented this in thinc already =/. Maybe it was in a branch
# that got abandoned or something?
# In any case, it maps a layer across a list.

def foreach(layer: Model[InT, OutT]) -> Model[List[InT], List[OutT]]:
    return Model("foreach", forward_foreach, layers=[layer])


def forward_foreach(
    model: Model[List[InT], List[OutT]],
    Xs: List[InT],
    is_train: bool
) -> Tuple[List[OutT], Callable[[List[OutT]], List[InT]]]:
    layer = model.layers[0]
    Ys = []
    callbacks = []
    for X in Xs:
        Y, get_dX = layer(X, is_train)
        Ys.append(Y)
        callbacks.append(get_dX)

    def backprop_foreach(dYs: List[OutT]) -> List[InT]:
        return [callback(dY) for callback, dY in zip(callbacks, dYs)]

    return Ys, backprop_foreach

Other tips

You would need to ensure that the code that registers your TransformerListenerClassTokenTextcat.v1 model function is imported. For training you can do that with the --code command line argument, or you can put it in a model that's installed and use an entry-point.

0 replies

honnibal · 2021-02-21T05:24:18Z

honnibal
Feb 21, 2021
Maintainer

There's a couple of follow-up tasks someone could help with here:

Write tests and docs for the foreach layer and make a PR to Thinc
Write tests and docs for the trf2tensor layer and contribute it to spacy-transformers
Add a project to the projects repo using the architecture

0 replies

scellat · 2021-02-22T14:18:03Z

scellat
Feb 22, 2021
Author

This is so helpful. Thank you!

0 replies

adrianeboyd · 2021-03-15T09:00:32Z

adrianeboyd
Mar 15, 2021

The "foreach" layer is now in thinc v8.0.2 as map_list.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Utilize CLS-Token of transformers in textcat component #7178

{{title}}

Replies: 4 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Utilize CLS-Token of transformers in textcat component #7178

scellat Feb 21, 2021

Environment

Replies: 4 comments

honnibal Feb 21, 2021 Maintainer

Getting one [CLS] per Doc

Model definition and config

Other tips

honnibal Feb 21, 2021 Maintainer

scellat Feb 22, 2021 Author

adrianeboyd Mar 15, 2021

scellat
Feb 21, 2021

honnibal
Feb 21, 2021
Maintainer

honnibal
Feb 21, 2021
Maintainer

scellat
Feb 22, 2021
Author

adrianeboyd
Mar 15, 2021