Document-Level Embeddings in Transformer Model #11715

sunnyifan · 2022-10-28T15:16:28Z

sunnyifan
Oct 28, 2022

While using a Transformer-based model like en_core_web_trf, two tensors are being exposed from trf_data:

The first contains per-token embeddings, of the shape num_docs * num_tokens * hidden_size;
The second is the per-document embeddings, of the shape num_docs * hidden_size.

How should we interpret the per-document embeddings? From the model structure of RoBERTa, it's likely that the per-document embeddings are the last layer embeddings of [CLS] passed thru a linear layer and then tanh. Was this final Linear-tanh layer trained for a specific task?

adrianeboyd · 2022-11-03T11:37:04Z

adrianeboyd
Nov 3, 2022

Just double-checking: you're aware of how longer texts are split into overlapping strided spans with the span getter (https://spacy.io/api/transformer#span_getters)? So your num_docs is more like num_spans in terms of the spacy Doc object.

With the default en_core_web_trf settings, trf_data.model_output contains last_hidden_state and pooler_output. The closest link I could find (click "expand" and scroll down a bit for the output): https://huggingface.co/docs/transformers/v4.24.0/en/model_doc/roberta#transformers.RobertaModel.forward

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document-Level Embeddings in Transformer Model #11715

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Document-Level Embeddings in Transformer Model #11715

sunnyifan Oct 28, 2022

Replies: 1 comment

adrianeboyd Nov 3, 2022

sunnyifan
Oct 28, 2022

adrianeboyd
Nov 3, 2022