Skip to content

Releases: huggingface/transformers

Patch v2.5.1: AutoTokenizer slow by default, bug fixes

24 Feb 23:53
b90745c
Compare
Choose a tag to compare

AutoTokenizer

AutoTokenizer has been put back to False by default so as to not have a breaking change between 2.4.x and 2.5.x

Fast tokenizers

Bug fixes

Slow tokenizers

Bug fixes related to batch_encode_plus

Rust Tokenizers, DistilBERT base cased, Model cards

19 Feb 16:54
Compare
Choose a tag to compare

Rust tokenizers (@mfuntowicz, @n1t0 )

  • Tokenizers for Bert, Roberta, OpenAI GPT, OpenAI GPT2, TransformerXL are now leveraging tokenizers library for fast tokenization 🚀
  • AutoTokenizer now defaults to fast tokenizers implementation when available
  • Calling batch_encode_plus on fast version of tokenizers will make better usage of CPU-cores.
  • Tokenizers leveraging native implementation will use all the CPU-cores by default when calling batch_encode_plus. You can change this behavior by setting the environment variable RAYON_NUM_THREADS = N
  • An exception is raised when tokenizing an input with pad_to_max_length=True but no padding token is defined.

Known Issues:

  • RoBERTa fast tokenizer implementation has slightly different output when compared to the original Python tokenizer (< 1%).
  • Squad example are not currently compatible with the new fast tokenizers thus, it will default to plain-old Python one.

DistilBERT base cased (@VictorSanh)

The distilled version of the bert-base-cased BERT checkpoint has been released.

Model cards (@julien-c)

Model cards are now stored directly in the repository

CLI script for environment information (@BramVanroy)

We now host a CLI script that gathers all the environment information when reporting an issue. The issue templates have been updated accordingly.

Contributors visible on repository (@clmnt)

The main contributors as identified by Sourcerer are now visible directly on the repository.

From fine-tuning to pre-training (@julien-c )

The language fine-tuning script has been renamed from run_lm_finetuning to run_language_modeling as it is now also able to train language models from scratch.

Extracting archives now available from cached_path (@thomwolf )

Slight modification to cached_path so that zip and tar archives can be automatically extracted.

  • archives are extracted in the same directory than the (possibly downloaded) archive in a created extraction directory named from the archive.
  • automatic extraction is activated by setting extract_compressed_file=True when calling cached_file.
  • the extraction directory is re-used to avoid extracting the archive again unless we set force_extract=True, in which case the cached extraction directory is removed and the archive is extracted again.

New activations file (@sshleifer )

Several activation functions (relu, swish, gelu, tanh and gelu_new) can now be accessed from the activations.py file and be used in the different PyTorch models.

Community additions/bug-fixes/improvements

  • Remove redundant hidden states that broke encoder-decoder architectures (@LysandreJik )
  • Cleaner and more readable code in test_attention_weights (@sshleifer)
  • XLM can be trained on SQuAD in different languages (@yuvalpinter)
  • Improve test coverage on several models that were ill-tested (@LysandreJik)
  • Fix issue where TFGPT2 could not be saved (@neonbjb )
  • Multi-GPU evaluation on run_glue now behaves correctly (@peteriz )
  • Fix issue with TransfoXL tokenizer that couldn't be saved (@dchurchwell)
  • More Robust conversion from ALBERT/BERT original checkpoints to huggingface/transformers models (@monologg )
  • FlauBERT bug fix; only add langs embeddings when there is more than one language handled by the model (@LysandreJik )
  • Fix CircleCI error with TensorFlow 2.1.0 (@mfuntowicz )
  • More specific testing advice in contributing (@sshleifer )
  • BERT decoder: Fix failure with the default attention mask (@asivokon )
  • Fix a few issues regarding the data preprocessing in run_language_modeling (@LysandreJik )
  • Fix an issue with leading spaces and the RobertaTokenizer (@joeddav )
  • Added pipeline: TokenClassificationPipeline, which is an alias over NerPipeline (@julien-c )

Patch v2.4.1: FlauBERT for AutoModel and AutoTokenizer

31 Jan 19:58
Compare
Choose a tag to compare

Patched an issue where FlauBERT couldn't be loaded with AutoModel and AutoTokenizer classes.

FlauBERT, MMBT, UmBERTo, Dutch model, improved documentation, training from scratch, clean Python code

31 Jan 14:55
Compare
Choose a tag to compare

FlauBERT, MMBT, UmBERTo

New TF architectures (@jplu)

  • TensorFlow XLM-RoBERTa was added (@jplu )
  • TensorFlow CamemBERT was added (@jplu )

Python best practices (@aaugustin)

  • Greatly improved the quality of the source code by leveraging black, isort and flake8. A test was added, check_code_quality, which checks that the contributions respect the contribution guidelines related to those tools.
  • Similarly, optional imports are better handled and raise more precise errors.
  • Cleaned up several requirements files, updated the contribution guidelines and rely on setup.py for the necessary dev dependencies.
  • you can clean up your code for a PR with (more details in CONTRIBUTING.md):
make style
make quality

Documentation (@LysandreJik)

The documentation was uniformized and some better guidelines have been defined. This work is part of an ongoing effort of making transformers accessible to a larger audience. A glossary has been added, adding definitions for most frequently used inputs.

Furthermore, some tips are given concerning each model in their documentation pages.

The code samples are now tested on a weekly basis alongside other slow tests.

Improved repository structure (@aaugustin)

The source code was moved from ./transformers to ./src/transformers. Since it changes the location of the source code, contributors must update their local development environment by uninstalling and re-installing the library.

Python 2 is not supported anymore (@aaugustin )

Version 2.3.0 was the last version to support Python 2. As we begin the year 2020, official Python 2 support has been dropped.

Parallel testing (@aaugustin)

Tests can now be run in parallel

Sampling sequence generator (@rlouf, @thomwolf )

An abstract method was added to PreTrainedModel, which is implemented in all models trained with CLM. This abstract method is generate, which offers an API for text generation:

  • with/without a prompt
  • with/without beam search
  • with/without greedy decoding/sampling
  • with any (and combination) of top-k/top-p/penalized repetitions

Resuming training when interrupted (@bkkaggle )

Previously, when stopping a training the only saved values would be the model weights/configuration. Now the different scripts save several other values: the global step, current epoch, and the steps trained in the current epoch. When resuming a training, all those values will be leveraged to correctly resume the training.

This applies to the following scripts: run_glue, run_squad, run_ner, run_xnli.

CLI (@julien-c , @mfuntowicz )

Model upload

  • The CLI now has better documentation.
  • Files can now be removed.

Pipelines

  • Expose the number of underlying FastAPI workers
  • Async forward methods
  • Fixed the environment variables so that they don't fight each other anymore (USE_TF, USE_TORCH)

Training from scratch (@julien-c )

The run_lm_finetuning.py script now handles training from scratch.

Changes in the configuration (@julien-c )

The configuration files now contain the architecture they're referring to. There is no need to have the architecture in the file name as it was necessary before. This should ease the naming of community models.

New Auto models (@thomwolf )

A new type of AutoModel was added: AutoModelForPreTraining. This model returns the base model that was used during the pre-training. For most models it is the base model alongside a language modeling head, whereas for others it is a different model, e.g. BertForPreTraining for BERT.

HANS dataset (@ns-moosavi)

The HANS dataset was added to the examples. It allows for testing a model with adversarial evaluation of natural language.

[BREAKING CHANGES]

Ignored indices in PyTorch loss computing (@LysandreJik)

When using PyTorch, certain values can be ignored when computing the loss. In order for the loss function to understand which indices must be ignored, those have to be set to a certain value. Most of our models required those indices to be set to -1. We decided to set this value to -100 instead as it is PyTorch's default value. This removes the discrepancy between user-implemented losses and the losses integrated in the models.

Further help from @r0mainK.

Community additions/bug-fixes/improvements

  • Can now save and load PreTrainedEncoderDecoder objects (@TheEdoardo93)
  • RoBERTa now bears more similarity to the FairSeq implementation (@DomHudson, @thomwolf)
  • Examples now better reflect the defaults of the encoding methods (@enzoampil)
  • TFXLNet now has a correct input mask (@thomwolf)
  • run_squad was fixed to allow better training for XLNet (@importpandas )
  • tokenization performance improvement (3-8x) (@mandubian)
  • RoBERTa was added to the run_squad script (@erenup)
  • Fixed the special and added tokens tokenization (@vitaliyradchenko)
  • Fixed an issue with language generation for XLM when having a batch size superior to 1 (@patrickvonplaten)
  • Fixed an issue with the generate method which did not correctly handle the repetition penalty (@patrickvonplaten)
  • Completed the documentation for repeating_words_penalty_for_language_generation (@patrickvonplaten)
  • run_generation now leverages cached past input for models that have access to it (@patrickvonplaten)
  • Finally manage to patch a rarely occurring bug with DistilBERT, eventually named DistilHeisenBug or HeisenDistilBug (@LysandreJik, with the help of @julien-c and @thomwolf).
  • Fixed an import error in run_tf_ner (@karajan1001).
  • Feature conversion for GLUE now has improved logging messages (@simonepri)
  • Patched an issue with GPUs and run_generation (@alberduris)
  • Added support for ALBERT and XLMRoBERTa to run_glue
  • Fixed an issue with the DistilBERT tokenizer not loading correct configurations (@LysandreJik)
  • Updated the SQuAD for distillation script to leverage the new SQuAD API (@LysandreJik)
  • Fixed an issue with T5 related to its rp_bucket (@mschrimpf)
  • PPLM now supports repetition penalties (@IWillPull)
  • Modified the QA pipeline to consider all features for each example (@Perseus14)
  • Patched an issue with a file lock (@dimagalat @aaugustin)
  • The bias should be resized with the weights when resizing a vocabulary projection layer with a new vocabulary size (@LysandreJik)
  • Fixed misleading token type IDs for RoBERTa. It doesn't leverage token type IDs and this has been clarified in the documentation (@LysandreJik ) Same for XLM-R (@maksym-del).
  • Fixed the prepare_for_model when tensorizing and returning token type IDs (@LysandreJik).
  • Fixed the XLNet model which wouldn't work with torch 1.4 (@julien-c)
  • Fetch all possible files remotely (@julien-c )
  • BERT's BasicTokenizer respects never_split parameters (@DeNeutoy)
  • Add lower bound to tqdm dependency @brendan-ai2
  • Fixed glue processors failing on tensorflow datasets (@neonbjb)
  • XLMRobertaTokenizer can now be serialized (@brandenchan)
  • A classifier dropout was added to ALBERT (@peteriz)
  • The ALBERT configuration for v2 models were fixed to be identical to those output by Google (@LysandreJik )

Downstream NLP task API (feature extraction, text classification, NER, QA), Command-Line Interface and Serving – models: T5 – community-added models: Japanese & Finnish BERT, PPLM, XLM-R

20 Dec 21:40
Compare
Choose a tag to compare

New class Pipeline (beta): easily run and use models on down-stream NLP tasks

We have added a new class called Pipeline to simply run and use models for several down-stream NLP tasks.

A Pipeline is just a tokenizer + model wrapped so they can take human-readable inputs and output human-readable results.

The Pipeline will take care of :
tokenizing inputs strings => convert in tensors => run in the model => post-process output

Currently, we have added the following pipelines with a default model for each:

  • feature extraction (can be used with any pretrained and finetuned models)
    inputs: strings/list of strings – output: list of floats (last hidden-states of the model for each token)
  • sentiment classification (DistilBert model fine-tuned on SST-2)
    inputs: strings/list of strings – output: list of dict with label/score of the top class
  • Named Entity Recognition (XLM-R finetuned on CoNLL2003 by the awesome @stefan-it), and
    inputs: strings/list of strings – output: list of dict with label/entities/position of the named-entities
  • Question Answering (Bert Large whole-word version fine-tuned on SQuAD 1.0)
    inputs: dict of strings/list of dict of strings – output: list of dict with text/position of the answers

There are three ways to use pipelines:

  • in python:
from transformers import pipeline

# Test the default model for QA (Bert large finetuned on SQuAD 1.0)
nlp = pipeline('question-answering')
nlp(question= "Where does Amy live ?", context="Amy lives in Amsterdam.")
>>> {'answer': 'Amsterdam', 'score': 0.9657156007786263, 'start': 13, 'end': 21}

# Test a specific model for NER (XLM-R finetuned by @stefan-it on CoNLL03 English)
nlp = pipeline('ner', model='xlm-roberta-large-finetuned-conll03-english')
nlp("My name is Amy. I live in Paris.")
>>> [{'word': 'Amy', 'score': 0.9999586939811707, 'entity': 'I-PER'},
     {'word': 'Paris', 'score': 0.9999983310699463, 'entity': 'I-LOC'}]
  • in bash (using the command-line interface)
bash $ echo -e "Where does Amy live?\tAmy lives in Amsterdam" | transformers-cli run --task question-answering
{'score': 0.9657156007786263, 'start': 13, 'end': 22, 'answer': 'Amsterdam'}
  • as a REST API
transformers-cli serve --task question-answering

This new feature is currently in beta and will evolve in the coming weeks.

CLI tool to upload and share community models

Users can now create accounts on the huggingface.co website and then login using the transformers CLI. Doing so allows users to upload their models to our S3 in their respective directories, so that other users may download said models and use them in their tasks.

Users may upload files or directories.

It's been tested by @stefan-it for a German BERT and by @singletongue for a Japanese BERT.

New model architectures: T5, Japanese BERT, PPLM, XLM-RoBERTa, Finnish BERT

Refactoring the SQuAD example

The run_squad script has been massively refactored. The reasons are the following:

  • it was made to work with only a few models (BERT, XLNet, XLM and DistilBERT), which had three different ways of encoding sequences. The script had to be individually modified in order to train different models, which would not scale as other models are added to the library.
  • the utilities did not rely on the QOL adjustments that were made to the encoding methods these past months.

It now leverages the full capacity of encode_plus, easing the addition of new models to the script. A new method squad_convert_examples_to_features encapsulates all of the tokenization.
This method can handle tensorflow_datasets as well as squad v1 json files and squad v2 json files.

  • ALBERT was added to the SQuAD script

BertAbs summarization

A contribution by @rlouf building on the encoder-decoder mechanism to do abstractive summarization.

  • Utilities to load the CNN/DailyMail dataset
  • BertAbs now usable as a traditional library model (using from_pretrained())
  • ROUGE evaluation

New Models

Additional architectures

@alexzubiaga added XLNetForTokenClassification and TFXLNetForTokenClassification

New model cards

Community additions/bug-fixes/improvements

  • Added mish activation function @digantamisra98
  • run_bertology.py was updated with correct imports and the ability to overwrite the cache
  • Training can be exited and relaunched safely, while keeping the epochs, global steps, scheduler steps and other variables in run_lm_finetuning.py @bkkaggle
  • Tests now run on cuda @aaugustin @julien-c
  • Cleaned up the pytorch to tf conversion script @thomwolf
  • Progress indicator improvements when downloading pre-trained models @leopd
  • from_pretrained() can now load from urls directly.
  • New tests to check that all files are accessible on HuggingFace's S3 @rlouf
  • Updated tf.shape and tensor.shape to all use shape_list @thomwolf
  • Valohai integration @thomwolf
  • Always use SequentialSampler in run_squad.py @ethanjperez
  • Stop using GPU when importing transformers @ondewo
  • Fixed the XLNet attention output @roskoN
  • Several QOL adjustments: removed dead code, deep cleaned tests and removed pytest dependency @aaugustin
  • Fixed an issue with the Camembert tokenization @thomwolf
  • Correctly create an encoder attention mask from the shape of the hidden states @rlouf
  • Fixed a non-deterministic behavior when encoding and decoding empty strings @pglock
  • Fixing tensor creation in encode_plus @LysandreJik
  • Remove usage of tf.mean which does not exist in TF2 @LysandreJik
  • A segmentation fault error was fixed (due to scipy 1.4.0) @LysandreJik
  • Start sunsetting support of Python 2
  • An example usage of Model2Model was added to the quickstart.

Bug fixes

20 Dec 14:53
Compare
Choose a tag to compare

Patched error where the tokenizers would split the special tokens.

Bug fixes related to input shape in TensorFlow and tokenization messages

03 Dec 16:23
Compare
Choose a tag to compare

Input shapes

This patch fixes a bug related to the input shape in several models in TensorFlow.

Tokenization message

A tokenization message was too present and overloaded the output, hiding the relevant information. It was removed.

ALBERT, CamemBERT, DistilRoberta, GPT-2 XL, and Encoder-Decoder architectures

26 Nov 19:26
Compare
Choose a tag to compare

New model architectures: ALBERT, CamemBERT, GPT2-XL, DistilRoberta

Four new models have been added in v2.2.0

  • ALBERT (Pytorch & TF) (from Google Research and the Toyota Technological Institute at Chicago) released with the paper ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
  • CamemBERT (Pytorch) (from Facebook AI Research, INRIA, and La Sorbonne Université), as the first large-scale Transformer language model. Released alongside the paper CamemBERT: a Tasty French Language Model by Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suarez, Yoann Dupont, Laurent Romary, Eric Villemonte de la Clergerie, Djame Seddah, and Benoît Sagot. It was added by @louismartin with the help of @julien-c.
  • DistilRoberta (Pytorch & TF) from @VictorSanh as the third distilled model after DistilBERT and DistilGPT-2.
  • GPT-2 XL (Pytorch & TF) as the last GPT-2 checkpoint released by OpenAI

Encoder-Decoder architectures

We welcome the possibility to create fully seq2seq models by incorporating Encoder-Decoder architectures using a PreTrainedEncoderDecoder class that can be initialized from pre-trained models. The base BERT class has be modified so that it may behave as a decoder.

Furthermore, a Model2Model class that simplifies the definition of an encoder-decoder when both encoder and decoder are based on the same model has been added. @rlouf

Benchmarks and performance improvements

Works by @tlkh and @LysandreJik aiming to benchmark the library models with different technologies: with TensorFlow and Pytorch, with mixed precision (AMP and FP-16) and with model tracing (Torchscript and XLA). A new section was created in the documentation: benchmarks pointing to Google sheets with the results.

Breaking changes

Tokenizers now add special tokens by default. @LysandreJik

New model templates

Model templates to ease the addition of new models to the library have been added. @thomwolf

Inputs Embeddings

A new input has been added to all models' forward (for Pytorch) and call (for TensorFlow) methods. These inputs_embeds are a direct embedded representation. This is useful as it gives more control over how to convert input_ids indices into associated vectors than the model's internal embedding lookup matrix. @julien-c

Getters and setters for input and output embeddings

A new API for the input and output embeddings are available. These methods are model-independent and allow easy acquisition/modification of the models' embeddings. @thomwolf

Additional architectures

New model architectures are available, namely: DistilBertForTokenClassification, CamembertForTokenClassification @stefan-it

Community additions/bug-fixes/improvements

  • The Fairseq RoBERTa model conversion script has been patched. @louismartin
  • einsum now runs in FP-16 in the library's examples @slayton58
  • In-depth work on the squad script for XLNet to reproduce the original paper's results @hlums
  • Additional improvements on the run_squad script by @WilliamTambellini, @orena1
  • The run_generation script has seen several improvements by @leo-du
  • The RoBERTaTensorFlow model has been patched for several use-cases: TPU and keras.fit @LysandreJik
  • The documentation is now versioned, links are available on the github readme @LysandreJik
  • The run_ner script has seen several improvements @mmaybeno, @oneraghavan, @manansanghi
  • The run_tf_glue script now works for all GLUE tasks @LysandreJik
  • The run_lm_finetuning script now correctly evaluates perplexity on MLM tasks @altsoph
  • An issue related to the XLM TensorFlow implementation's training has been fixed @tlkh
  • run_bertology has been updated to be closer to the run_glue example @adrianbg
  • Fixed added special tokens in decoded sequences @LysandreJik
  • Several performance improvements have been done to the tokenizers @iedmrc
  • A memory leak has been identified and patched in the library's schedulers @rlouf
  • Correct warning when encoding a sequence too long while specifying a maximum length @LysandreJik
  • Resizing the token embeddings now works as expected in the run_lm_finetuning script @iedmrc
  • The difference in versions between Pypi/source in order to run the examples has been clarified @rlouf

CTRL, DistilGPT-2, Pytorch TPU, tokenizer enhancements, guideline requirements

11 Oct 14:50
Compare
Choose a tag to compare

New model architectures: CTRL, DistilGPT-2

Two new models have been added since release 2.0.

Distillation

Several updates have been made to the distillation script, including the possibility to distill GPT-2 and to distill on the SQuAD task. By @VictorSanh.

Pytorch TPU support

The run_glue.py example script can now run on a Pytorch TPU.

Updates to example scripts

Several example scripts have been improved and refactored to use the full potential of the new tokenizer functions:

QOL enhancements on the tokenizer

Enhancements have been made on the tokenizers. Two new methods have been added: get_special_tokens_mask and truncate_sequences .

The former returns a mask indicating which tokens are special tokens in a token list, and which are tokens from the initial sequences. The latter truncate sequences according to a strategy.

Both of those methods are called by the encode_plus method, which itself is called by the encode method. The encode_plus now returns a larger dictionary which holds information about the special tokens, as well as the overflowing tokens.

Thanks to @julien-c, @thomwolf, and @LysandreJik for these additions.

New German BERT models

Breaking changes

  • The two methods add_special_tokens_single_sequence and add_special_tokens_sequence_pair have been removed. They have been replaced by the single method build_inputs_with_special_tokens which has a more comprehensible name and manages both sequence singletons and pairs.

  • The boolean parameter truncate_first_sequence has been removed in tokenizers' encode and encode_plus methods, being replaced by a strategy in the form of a string: 'longest_first', 'only_second', 'only_first' or 'do_not_truncate' are accepted strategies.

  • When the encode or encode_plus methods are called with a specified max_length, the sequences will now always be truncated or throw an error if overflowing.

Guidelines and requirements

New contributing guidelines have been added, alongside library development requirements by @rlouf, the newest member of the HuggingFace team.

Community additions/bug-fixes/improvements

  • GLUE Processors have been refactored to handle inputs for all tasks coming from the tensorflow_datasets. This work has been done by @agrinh and @philipp-eisen.
  • The padding_idx is now correctly initialized to 1 in randomly initialized RoBERTa models. @ikuyamada
  • The documentation CSS has been adapted to work on older browsers. @TimYagan
  • An addition concerning the management of hidden states has been added to the README by @BramVanroy.
  • Integration of TF 2.0 models with other Keras modules @thomwolf
  • Past values can be opted-out @thomwolf

Superseded by v2.1.1

11 Oct 14:47
Compare
Choose a tag to compare
v2.1.0

Adds version 2.1.0 for PyPi