Skip to content

Latest commit

 

History

History
140 lines (109 loc) · 10.3 KB

FLAIR_EMBEDDINGS.md

File metadata and controls

140 lines (109 loc) · 10.3 KB

Flair Embeddings

Contextual string embeddings are powerful embeddings that capture latent syntactic-semantic information that goes beyond standard word embeddings. Key differences are: (1) they are trained without any explicit notion of words and thus fundamentally model words as sequences of characters. And (2) they are contextualized by their surrounding text, meaning that the same word will have different embeddings depending on its contextual use.

With Flair, you can use these embeddings simply by instantiating the appropriate embedding class, same as standard word embeddings:

from flair.embeddings import FlairEmbeddings

# init embedding
flair_embedding_forward = FlairEmbeddings('news-forward')

# create a sentence
sentence = Sentence('The grass is green .')

# embed words in sentence
flair_embedding_forward.embed(sentence)

You choose which embeddings you load by passing the appropriate string to the constructor of the FlairEmbeddings class. Currently, the following contextual string embeddings are provided (note: replace 'X' with either 'forward' or 'backward'):

ID Language Embedding
'multi-X' 300+ JW300 corpus, as proposed by Agić and Vulić (2019). The corpus is licensed under CC-BY-NC-SA
'multi-X-fast' English, German, French, Italian, Dutch, Polish Mix of corpora (Web, Wikipedia, Subtitles, News), CPU-friendly
'news-X' English Trained with 1 billion word corpus
'news-X-fast' English Trained with 1 billion word corpus, CPU-friendly
'mix-X' English Trained with mixed corpus (Web, Wikipedia, Subtitles)
'ar-X' Arabic Added by @stefan-it: Trained with Wikipedia/OPUS
'bg-X' Bulgarian Added by @stefan-it: Trained with Wikipedia/OPUS
'bg-X-fast' Bulgarian Added by @stefan-it: Trained with various sources (Europarl, Wikipedia or SETimes)
'cs-X' Czech Added by @stefan-it: Trained with Wikipedia/OPUS
'cs-v0-X' Czech Added by @stefan-it: LM embeddings (earlier version)
'de-X' German Trained with mixed corpus (Web, Wikipedia, Subtitles)
'de-historic-ha-X' German (historical) Added by @stefan-it: Historical German trained over Hamburger Anzeiger
'de-historic-wz-X' German (historical) Added by @stefan-it: Historical German trained over Wiener Zeitung
'de-historic-rw-X' German (historical) Added by @redewiedergabe: Historical German trained over 100 million tokens
'es-X' Spanish Added by @iamyihwa: Trained with Wikipedia
'es-X-fast' Spanish Added by @iamyihwa: Trained with Wikipedia, CPU-friendly
'es-clinical-' Spanish (clinical) Added by @matirojasg: Trained with Wikipedia
'eu-X' Basque Added by @stefan-it: Trained with Wikipedia/OPUS
'eu-v0-X' Basque Added by @stefan-it: LM embeddings (earlier version)
'fa-X' Persian Added by @stefan-it: Trained with Wikipedia/OPUS
'fi-X' Finnish Added by @stefan-it: Trained with Wikipedia/OPUS
'fr-X' French Added by @mhham: Trained with French Wikipedia
'he-X' Hebrew Added by @stefan-it: Trained with Wikipedia/OPUS
'hi-X' Hindi Added by @stefan-it: Trained with Wikipedia/OPUS
'hr-X' Croatian Added by @stefan-it: Trained with Wikipedia/OPUS
'id-X' Indonesian Added by @stefan-it: Trained with Wikipedia/OPUS
'it-X' Italian Added by @stefan-it: Trained with Wikipedia/OPUS
'ja-X' Japanese Added by @frtacoa: Trained with 439M words of Japanese Web crawls (2048 hidden states, 2 layers)
'nl-X' Dutch Added by @stefan-it: Trained with Wikipedia/OPUS
'nl-v0-X' Dutch Added by @stefan-it: LM embeddings (earlier version)
'no-X' Norwegian Added by @stefan-it: Trained with Wikipedia/OPUS
'pl-X' Polish Added by @borchmann: Trained with web crawls (Polish part of CommonCrawl)
'pl-opus-X' Polish Added by @stefan-it: Trained with Wikipedia/OPUS
'pt-X' Portuguese Added by @ericlief: LM embeddings
'sl-X' Slovenian Added by @stefan-it: Trained with Wikipedia/OPUS
'sl-v0-X' Slovenian Added by @stefan-it: Trained with various sources (Europarl, Wikipedia and OpenSubtitles2018)
'sv-X' Swedish Added by @stefan-it: Trained with Wikipedia/OPUS
'sv-v0-X' Swedish Added by @stefan-it: Trained with various sources (Europarl, Wikipedia or OpenSubtitles2018)
'ta-X' Tamil Added by @stefan-it
'pubmed-X' English Added by @jessepeng: Trained with 5% of PubMed abstracts until 2015 (1150 hidden states, 3 layers)
'de-impresso-hipe-v1-X' German (historical) In-domain data (Swiss and Luxembourgish newspapers) for CLEF HIPE Shared task. More information on the shared task can be found in this paper
'en-impresso-hipe-v1-X' English (historical) In-domain data (Chronicling America material) for CLEF HIPE Shared task. More information on the shared task can be found in this paper
'fr-impresso-hipe-v1-X' French (historical) In-domain data (Swiss and Luxembourgish newspapers) for CLEF HIPE Shared task. More information on the shared task can be found in this paper
'am-X' Amharic Based on 6.5m Amharic text corpus crawled from different sources. See this paper and the official GitHub Repository for more information.

So, if you want to load embeddings from the German forward LM model, instantiate the method as follows:

flair_de_forward = FlairEmbeddings('de-forward')

And if you want to load embeddings from the Bulgarian backward LM model, instantiate the method as follows:

flair_bg_backward = FlairEmbeddings('bg-backward')

Recommended Flair Usage

We recommend combining both forward and backward Flair embeddings. Depending on the task, we also recommend adding standard word embeddings into the mix. So, our recommended StackedEmbedding for most English tasks is:

from flair.embeddings import WordEmbeddings, FlairEmbeddings, StackedEmbeddings

# create a StackedEmbedding object that combines glove and forward/backward flair embeddings
stacked_embeddings = StackedEmbeddings([
                                        WordEmbeddings('glove'),
                                        FlairEmbeddings('news-forward'),
                                        FlairEmbeddings('news-backward'),
                                       ])

That's it! Now just use this embedding like all the other embeddings, i.e. call the embed() method over your sentences.

sentence = Sentence('The grass is green .')

# just embed a sentence using the StackedEmbedding as you would with any single embedding.
stacked_embeddings.embed(sentence)

# now check out the embedded tokens.
for token in sentence:
    print(token)
    print(token.embedding)

Words are now embedded using a concatenation of three different embeddings. This combination often gives state-of-the-art accuracy.

Pooled Flair Embeddings

We also developed a pooled variant of the FlairEmbeddings. These embeddings differ in that they constantly evolve over time, even at prediction time (i.e. after training is complete). This means that the same words in the same sentence at two different points in time may have different embeddings.

PooledFlairEmbeddings manage a 'global' representation of each distinct word by using a pooling operation of all past occurences. More details on how this works may be found in Akbik et al. (2019).

You can instantiate and use PooledFlairEmbeddings like any other embedding:

from flair.embeddings import PooledFlairEmbeddings

# init embedding
flair_embedding_forward = PooledFlairEmbeddings('news-forward')

# create a sentence
sentence = Sentence('The grass is green .')

# embed words in sentence
flair_embedding_forward.embed(sentence)

Note that while we get some of our best results with PooledFlairEmbeddings they are very ineffective memory-wise since they keep past embeddings of all words in memory. In many cases, regular FlairEmbeddings will be nearly as good but with much lower memory requirements.