本文涉及的jupter notebook在篇章4代码库中。
也直接使用google colab notebook打开本教程,下载相关数据集和模型。 如果您正在google的colab中打开这个notebook,您可能需要安装Transformers和🤗Datasets库。将以下命令取消注释即可安装。
!pip install transformers datasets
如果您正在本地打开这个notebook,请确保您已经进行上述依赖包的安装。 您也可以在这里找到本notebook的多GPU分布式训练版本。
我们将展示如何使用 🤗 Transformers代码库中的模型来解决文本分类任务,任务来源于GLUE Benchmark.
GLUE榜单包含了9个句子级别的分类任务,分别是:
- CoLA (Corpus of Linguistic Acceptability) 鉴别一个句子是否语法正确.
- MNLI (Multi-Genre Natural Language Inference) 给定一个假设,判断另一个句子与该假设的关系:entails, contradicts 或者 unrelated。
- MRPC (Microsoft Research Paraphrase Corpus) 判断两个句子是否互为paraphrases.
- QNLI (Question-answering Natural Language Inference) 判断第2句是否包含第1句问题的答案。
- QQP (Quora Question Pairs2) 判断两个问句是否语义相同。
- RTE (Recognizing Textual Entailment)判断一个句子是否与假设成entail关系。
- SST-2 (Stanford Sentiment Treebank) 判断一个句子的情感正负向.
- STS-B (Semantic Textual Similarity Benchmark) 判断两个句子的相似性(分数为1-5分)。
- WNLI (Winograd Natural Language Inference) Determine if a sentence with an anonymous pronoun and a sentence with this pronoun replaced are entailed or not.
对于以上任务,我们将展示如何使用简单的Dataset库加载数据集,同时使用transformer中的Trainer
接口对预训练模型进行微调。
GLUE_TASKS = ["cola", "mnli", "mnli-mm", "mrpc", "qnli", "qqp", "rte", "sst2", "stsb", "wnli"]
本notebook理论上可以使用各种各样的transformer模型(模型面板),解决任何文本分类分类任务。
如果您所处理的任务有所不同,大概率只需要很小的改动便可以使用本notebook进行处理。同时,您应该根据您的GPU显存来调整微调训练所需要的btach size大小,避免显存溢出。
task = "cola"
model_checkpoint = "distilbert-base-uncased"
batch_size = 16
我们将会使用🤗 Datasets库来加载数据和对应的评测方式。数据加载和评测方式加载只需要简单使用load_dataset
和load_metric
即可。
from datasets import load_dataset, load_metric
除了mnli-mm
以外,其他任务都可以直接通过任务名字进行加载。数据加载之后会自动缓存。
actual_task = "mnli" if task == "mnli-mm" else task
dataset = load_dataset("glue", actual_task)
metric = load_metric('glue', actual_task)
这个datasets
对象本身是一种DatasetDict
数据结构. 对于训练集、验证集和测试集,只需要使用对应的key(train,validation,test)即可得到相应的数据。
dataset
DatasetDict({
train: Dataset({
features: ['sentence', 'label', 'idx'],
num_rows: 8551
})
validation: Dataset({
features: ['sentence', 'label', 'idx'],
num_rows: 1043
})
test: Dataset({
features: ['sentence', 'label', 'idx'],
num_rows: 1063
})
})
给定一个数据切分的key(train、validation或者test)和下标即可查看数据。
dataset["train"][0]
{'idx': 0,
'label': 1,
'sentence': "Our friends won't buy this analysis, let alone the next one we propose."}
为了能够进一步理解数据长什么样子,下面的函数将从数据集里随机选择几个例子进行展示。
import datasets
import random
import pandas as pd
from IPython.display import display, HTML
def show_random_elements(dataset, num_examples=10):
assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
picks = []
for _ in range(num_examples):
pick = random.randint(0, len(dataset)-1)
while pick in picks:
pick = random.randint(0, len(dataset)-1)
picks.append(pick)
df = pd.DataFrame(dataset[picks])
for column, typ in dataset.features.items():
if isinstance(typ, datasets.ClassLabel):
df[column] = df[column].transform(lambda i: typ.names[i])
display(HTML(df.to_html()))
show_random_elements(dataset["train"])
sentence | label | idx | |
---|---|---|---|
0 | The more I talk to Joe, the less about linguistics I am inclined to think Sally has taught him to appreciate. | acceptable | 196 |
1 | Have in our class the kids arrived safely? | unacceptable | 3748 |
2 | I gave Mary a book. | acceptable | 5302 |
3 | Every student, who attended the party, had a good time. | unacceptable | 4944 |
4 | Bill pounded the metal fiat. | acceptable | 2178 |
5 | It bit me on the leg. | acceptable | 5908 |
6 | The boys were made a good mother by Aunt Mary. | unacceptable | 736 |
7 | More of a man is here. | unacceptable | 5403 |
8 | My mother baked me a birthday cake. | acceptable | 3761 |
9 | Gregory appears to have wanted to be loyal to the company. | acceptable | 4334 |
评估metic是datasets.Metric
的一个实例:
metric
Metric(name: "glue", features: {'predictions': Value(dtype='int64', id=None), 'references': Value(dtype='int64', id=None)}, usage: """
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
predictions: list of predictions to score.
Each translation should be tokenized into a list of tokens.
references: list of lists of references for each translation.
Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
"accuracy": Accuracy
"f1": F1 score
"pearson": Pearson Correlation
"spearmanr": Spearman Correlation
"matthews_correlation": Matthew Correlation
Examples:
>>> glue_metric = datasets.load_metric('glue', 'sst2') # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
>>> references = [0, 1]
>>> predictions = [0, 1]
>>> results = glue_metric.compute(predictions=predictions, references=references)
>>> print(results)
{'accuracy': 1.0}
>>> glue_metric = datasets.load_metric('glue', 'mrpc') # 'mrpc' or 'qqp'
>>> references = [0, 1]
>>> predictions = [0, 1]
>>> results = glue_metric.compute(predictions=predictions, references=references)
>>> print(results)
{'accuracy': 1.0, 'f1': 1.0}
>>> glue_metric = datasets.load_metric('glue', 'stsb')
>>> references = [0., 1., 2., 3., 4., 5.]
>>> predictions = [0., 1., 2., 3., 4., 5.]
>>> results = glue_metric.compute(predictions=predictions, references=references)
>>> print({"pearson": round(results["pearson"], 2), "spearmanr": round(results["spearmanr"], 2)})
{'pearson': 1.0, 'spearmanr': 1.0}
>>> glue_metric = datasets.load_metric('glue', 'cola')
>>> references = [0, 1]
>>> predictions = [0, 1]
>>> results = glue_metric.compute(predictions=predictions, references=references)
>>> print(results)
{'matthews_correlation': 1.0}
""", stored examples: 0)
直接调用metric的compute
方法,传入labels
和predictions
即可得到metric的值:
import numpy as np
fake_preds = np.random.randint(0, 2, size=(64,))
fake_labels = np.random.randint(0, 2, size=(64,))
metric.compute(predictions=fake_preds, references=fake_labels)
{'matthews_correlation': 0.1513518081969605}
每一个文本分类任务所对应的metic有所不同,具体如下:
- for CoLA: Matthews Correlation Coefficient
- for MNLI (matched or mismatched): Accuracy
- for MRPC: Accuracy and F1 score
- for QNLI: Accuracy
- for QQP: Accuracy and F1 score
- for RTE: Accuracy
- for SST-2: Accuracy
- for STS-B: Pearson Correlation Coefficient and Spearman's_Rank_Correlation_Coefficient
- for WNLI: Accuracy
所以一定要将metric和任务对齐
在将数据喂入模型之前,我们需要对数据进行预处理。预处理的工具叫Tokenizer
。Tokenizer
首先对输入进行tokenize,然后将tokens转化为预模型中需要对应的token ID,再转化为模型需要的输入格式。
为了达到数据预处理的目的,我们使用AutoTokenizer.from_pretrained
方法实例化我们的tokenizer,这样可以确保:
- 我们得到一个与预训练模型一一对应的tokenizer。
- 使用指定的模型checkpoint对应的tokenizer的时候,我们也下载了模型需要的词表库vocabulary,准确来说是tokens vocabulary。
这个被下载的tokens vocabulary会被缓存起来,从而再次使用的时候不会重新下载。
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
注意:use_fast=True
要求tokenizer必须是transformers.PreTrainedTokenizerFast类型,因为我们在预处理的时候需要用到fast tokenizer的一些特殊特性(比如多线程快速tokenizer)。如果对应的模型没有fast tokenizer,去掉这个选项即可。
几乎所有模型对应的tokenizer都有对应的fast tokenizer。我们可以在模型tokenizer对应表里查看所有预训练模型对应的tokenizer所拥有的特点。
tokenizer既可以对单个文本进行预处理,也可以对一对文本进行预处理,tokenizer预处理后得到的数据满足预训练模型输入格式
tokenizer("Hello, this one sentence!", "And this sentence goes with it.")
{'input_ids': [101, 7592, 1010, 2023, 2028, 6251, 999, 102, 1998, 2023, 6251, 3632, 2007, 2009, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
取决于我们选择的预训练模型,我们将会看到tokenizer有不同的返回,tokenizer和预训练模型是一一对应的,更多信息可以在这里进行学习。
为了预处理我们的数据,我们需要知道不同数据和对应的数据格式,因此我们定义下面这个dict。
task_to_keys = {
"cola": ("sentence", None),
"mnli": ("premise", "hypothesis"),
"mnli-mm": ("premise", "hypothesis"),
"mrpc": ("sentence1", "sentence2"),
"qnli": ("question", "sentence"),
"qqp": ("question1", "question2"),
"rte": ("sentence1", "sentence2"),
"sst2": ("sentence", None),
"stsb": ("sentence1", "sentence2"),
"wnli": ("sentence1", "sentence2"),
}
对数据格式进行检查:
sentence1_key, sentence2_key = task_to_keys[task]
if sentence2_key is None:
print(f"Sentence: {dataset['train'][0][sentence1_key]}")
else:
print(f"Sentence 1: {dataset['train'][0][sentence1_key]}")
print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")
Sentence: Our friends won't buy this analysis, let alone the next one we propose.
随后将预处理的代码放到一个函数中:
def preprocess_function(examples):
if sentence2_key is None:
return tokenizer(examples[sentence1_key], truncation=True)
return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)
预处理函数可以处理单个样本,也可以对多个样本进行处理。如果输入是多个样本,那么返回的是一个list:
preprocess_function(dataset['train'][:5])
{'input_ids': [[101, 2256, 2814, 2180, 1005, 1056, 4965, 2023, 4106, 1010, 2292, 2894, 1996, 2279, 2028, 2057, 16599, 1012, 102], [101, 2028, 2062, 18404, 2236, 3989, 1998, 1045, 1005, 1049, 3228, 2039, 1012, 102], [101, 2028, 2062, 18404, 2236, 3989, 2030, 1045, 1005, 1049, 3228, 2039, 1012, 102], [101, 1996, 2062, 2057, 2817, 16025, 1010, 1996, 13675, 16103, 2121, 2027, 2131, 1012, 102], [101, 2154, 2011, 2154, 1996, 8866, 2024, 2893, 14163, 8024, 3771, 1012, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
接下来对数据集datasets里面的所有样本进行预处理,处理的方式是使用map函数,将预处理函数prepare_train_features应用到(map)所有样本上。
encoded_dataset = dataset.map(preprocess_function, batched=True)
更好的是,返回的结果会自动被缓存,避免下次处理的时候重新计算(但是也要注意,如果输入有改动,可能会被缓存影响!)。datasets库函数会对输入的参数进行检测,判断是否有变化,如果没有变化就使用缓存数据,如果有变化就重新处理。但如果输入参数不变,想改变输入的时候,最好清理调这个缓存。清理的方式是使用load_from_cache_file=False
参数。另外,上面使用到的batched=True
这个参数是tokenizer的特点,以为这会使用多线程同时并行对输入进行处理。
既然数据已经准备好了,现在我们需要下载并加载我们的预训练模型,然后微调预训练模型。既然我们是做seq2seq任务,那么我们需要一个能解决这个任务的模型类。我们使用AutoModelForSequenceClassification
这个类。和tokenizer相似,from_pretrained
方法同样可以帮助我们下载并加载模型,同时也会对模型进行缓存,就不会重复下载模型啦。
需要注意的是:STS-B是一个回归问题,MNLI是一个3分类问题:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
num_labels = 3 if task.startswith("mnli") else 1 if task=="stsb" else 2
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)
Downloading: 0%| | 0.00/268M [00:00<?, ?B/s]
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
由于我们微调的任务是文本分类任务,而我们加载的是预训练的语言模型,所以会提示我们加载模型的时候扔掉了一些不匹配的神经网络参数(比如:预训练语言模型的神经网络head被扔掉了,同时随机初始化了文本分类的神经网络head)。
为了能够得到一个Trainer
训练工具,我们还需要3个要素,其中最重要的是训练的设定/参数 TrainingArguments
。这个训练设定包含了能够定义训练过程的所有属性。
metric_name = "pearson" if task == "stsb" else "matthews_correlation" if task == "cola" else "accuracy"
args = TrainingArguments(
"test-glue",
evaluation_strategy = "epoch",
save_strategy = "epoch",
learning_rate=2e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
num_train_epochs=5,
weight_decay=0.01,
load_best_model_at_end=True,
metric_for_best_model=metric_name,
)
上面evaluation_strategy = "epoch"参数告诉训练代码:我们每个epcoh会做一次验证评估。
上面batch_size在这个notebook之前定义好了。
最后,由于不同的任务需要不同的评测指标,我们定一个函数来根据任务名字得到评价方法:
def compute_metrics(eval_pred):
predictions, labels = eval_pred
if task != "stsb":
predictions = np.argmax(predictions, axis=1)
else:
predictions = predictions[:, 0]
return metric.compute(predictions=predictions, references=labels)
全部传给 Trainer
:
validation_key = "validation_mismatched" if task == "mnli-mm" else "validation_matched" if task == "mnli" else "validation"
trainer = Trainer(
model,
args,
train_dataset=encoded_dataset["train"],
eval_dataset=encoded_dataset[validation_key],
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
开始训练:
trainer.train()
The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running training *****
Num examples = 8551
Num Epochs = 5
Instantaneous batch size per device = 16
Total train batch size (w. parallel, distributed & accumulation) = 16
Gradient Accumulation steps = 1
Total optimization steps = 2675
<div>
<progress value='2675' max='2675' style='width:300px; height:20px; vertical-align: middle;'></progress>
[2675/2675 02:49, Epoch 5/5]
</div>
<table border="1" class="dataframe">
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running Evaluation *****
Num examples = 1043
Batch size = 16
Saving model checkpoint to test-glue/checkpoint-535
Configuration saved in test-glue/checkpoint-535/config.json
Model weights saved in test-glue/checkpoint-535/pytorch_model.bin
tokenizer config file saved in test-glue/checkpoint-535/tokenizer_config.json
Special tokens file saved in test-glue/checkpoint-535/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running Evaluation *****
Num examples = 1043
Batch size = 16
Saving model checkpoint to test-glue/checkpoint-1070
Configuration saved in test-glue/checkpoint-1070/config.json
Model weights saved in test-glue/checkpoint-1070/pytorch_model.bin
tokenizer config file saved in test-glue/checkpoint-1070/tokenizer_config.json
Special tokens file saved in test-glue/checkpoint-1070/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running Evaluation *****
Num examples = 1043
Batch size = 16
Saving model checkpoint to test-glue/checkpoint-1605
Configuration saved in test-glue/checkpoint-1605/config.json
Model weights saved in test-glue/checkpoint-1605/pytorch_model.bin
tokenizer config file saved in test-glue/checkpoint-1605/tokenizer_config.json
Special tokens file saved in test-glue/checkpoint-1605/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running Evaluation *****
Num examples = 1043
Batch size = 16
Saving model checkpoint to test-glue/checkpoint-2140
Configuration saved in test-glue/checkpoint-2140/config.json
Model weights saved in test-glue/checkpoint-2140/pytorch_model.bin
tokenizer config file saved in test-glue/checkpoint-2140/tokenizer_config.json
Special tokens file saved in test-glue/checkpoint-2140/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running Evaluation *****
Num examples = 1043
Batch size = 16
Saving model checkpoint to test-glue/checkpoint-2675
Configuration saved in test-glue/checkpoint-2675/config.json
Model weights saved in test-glue/checkpoint-2675/pytorch_model.bin
tokenizer config file saved in test-glue/checkpoint-2675/tokenizer_config.json
Special tokens file saved in test-glue/checkpoint-2675/special_tokens_map.json
Training completed. Do not forget to share your model on huggingface.co/models =)
Loading best model from test-glue/checkpoint-2675 (score: 0.5138995234247261).
TrainOutput(global_step=2675, training_loss=0.27181456521292713, metrics={'train_runtime': 169.649, 'train_samples_per_second': 252.02, 'train_steps_per_second': 15.768, 'total_flos': 229537542078168.0, 'train_loss': 0.27181456521292713, 'epoch': 5.0})
训练完成后进行评估:
trainer.evaluate()
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running Evaluation *****
Num examples = 1043
Batch size = 16
[66/66 00:00]
{'epoch': 5.0,
'eval_loss': 0.8822253346443176,
'eval_matthews_correlation': 0.5138995234247261,
'eval_runtime': 0.9319,
'eval_samples_per_second': 1119.255,
'eval_steps_per_second': 70.825}
To see how your model fared you can compare it to the GLUE Benchmark leaderboard.
Trainer
同样支持超参搜索,使用optuna or Ray Tune代码库。
反注释下面两行安装依赖:
! pip install optuna
! pip install ray[tune]
超参搜索时,Trainer
将会返回多个训练好的模型,所以需要传入一个定义好的模型从而让Trainer
可以不断重新初始化该传入的模型:
def model_init():
return AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)
和之前调用 Trainer
类似:
trainer = Trainer(
model_init=model_init,
args=args,
train_dataset=encoded_dataset["train"],
eval_dataset=encoded_dataset[validation_key],
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
loading configuration file https://huggingface.co/distilbert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/23454919702d26495337f3da04d1655c7ee010d5ec9d77bdb9e399e00302c0a1.d423bdf2f58dc8b77d5f5d18028d7ae4a72dcfd8f468e81fe979ada957a8c361
Model config DistilBertConfig {
"activation": "gelu",
"architectures": [
"DistilBertForMaskedLM"
],
"attention_dropout": 0.1,
"dim": 768,
"dropout": 0.1,
"hidden_dim": 3072,
"initializer_range": 0.02,
"max_position_embeddings": 512,
"model_type": "distilbert",
"n_heads": 12,
"n_layers": 6,
"pad_token_id": 0,
"qa_dropout": 0.1,
"seq_classif_dropout": 0.2,
"sinusoidal_pos_embds": false,
"tie_weights_": true,
"transformers_version": "4.9.1",
"vocab_size": 30522
}
loading weights file https://huggingface.co/distilbert-base-uncased/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/9c169103d7e5a73936dd2b627e42851bec0831212b677c637033ee4bce9ab5ee.126183e36667471617ae2f0835fab707baa54b731f991507ebbb55ea85adb12a
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
调用方法hyperparameter_search
。注意,这个过程可能很久,我们可以先用部分数据集进行超参搜索,再进行全量训练。
比如使用1/10的数据进行搜索:
best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize")
hyperparameter_search
会返回效果最好的模型相关的参数�:
best_run
将Trainner
设置为搜索到的最好参数,进行训练:
for n, v in best_run.hyperparameters.items():
setattr(trainer.args, n, v)
trainer.train()
最后别忘了,查看如何上传模型 ,上传模型到](https://huggingface.co/transformers/model_sharing.html) 到🤗 Model Hub。随后您就可以像这个notebook一开始一样,直接用模型名字就能使用您自己上传的模型啦。