Skip to content

Commit 079eda4

Browse files
author
milen
committed
add transform courses
1 parent 80d6656 commit 079eda4

File tree

17 files changed

+852
-3
lines changed

17 files changed

+852
-3
lines changed

README.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -297,10 +297,13 @@
297297

298298
| 章节名称 | notebook链接 | Python实现 | 课程简介 |
299299
| ----------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
300-
| transformer在图像分类中的应用 | [notebook链接](https://aistudio.baidu.com/aistudio/projectdetail/2154618) | [Python实现](./transformer_courses/Application_of_transformer_in_image_classification) | 本章节将为大家详细介绍 Transformer 在 CV 领域中的两个经典算法:ViT 以及 DeiT。带领大家一起学习Transformer 结构在图像分类领域的具体应用。 |
301-
|经典的预训练语言模型 | [notebook链接](https://aistudio.baidu.com/aistudio/projectdetail/2110336) |[Python实现](./transformer_courses/Transformer_Machine_Translation)|本章节将为大家详细介绍NLP领域 Transformer。Transformer的前世今生,包括ELMo,GPT,Transformer,BERT等经典模型,还会介绍Transformer在机器翻译里面的应用 |
302-
| 预训练模型的瘦身策略 – – 高效结构 | [notebook链接](https://aistudio.baidu.com/aistudio/projectdetail/2138857)| [Python实现](./transformer_courses/Transformer_Punctuation_Restoration) | 本章节将为大家详细介绍NLP领域,基于Transformer模型的瘦身技巧。包括 Electra,AlBERT 以及 performer。还会介绍代码实现案例:基于Electra的语音识别后处理中文标点符号预测 |
300+
|经典的预训练语言模型 | [notebook链接](https://aistudio.baidu.com/aistudio/projectdetail/2110336) |[Python实现](./transformer_courses/Transformer_Machine_Translation)|本章节将为大家详细介绍NLP领域 Transformer。Transformer的前世今生,包括ELMo,GPT,Transformer,BERT等经典模型,还会介绍Transformer在机器翻译里面的应用|
301+
|经典的预训练语言模型 | [notebook链接](https://aistudio.baidu.com/aistudio/projectdetail/2110336) |[Python实现](./transformer_courses/Transformer_Machine_Translation)|本章节将为大家详细介绍NLP领域 Transformer。Transformer的前世今生,包括ELMo,GPT,Transformer,BERT等经典模型,还会介绍Transformer在机器翻译里面的应用|
302+
|预训练模型在自然语言理解方面的改进| [notebook链接](https://aistudio.baidu.com/aistudio/projectdetail/2166195) | [Python实现](./transformer_courses/reading_comprehension_based_on_ernie)|ERNIE, RoBERTa, KBERT,清华ERNIE等,在广度上去分析经典预训练模型的一些改进。|
303+
|预训练模型在长序列建模方面的改进| [notebook链接](https://aistudio.baidu.com/aistudio/projectdetail/2166197) |[Python实现](./transformer_courses/sentiment_analysis_based_on_xlnet)|Transformer-xl, xlnet, longformer等,分析BERT和transformer的长度局限,并讨论这些方法的改进点。|
303304
| BERT蒸馏 | [notebook链接](https://aistudio.baidu.com/aistudio/projectdetail/2177549)| [Python实现](./transformer_courses/BERT_distillation) | 本章节为大家详细介绍了针对BERT模型的蒸馏算法,包括:Patient-KD、DistilBERT、TinyBERT、DynaBERT等模型,同时以代码的形式为大家展现了如何使用DynaBERT的训练策略对TinyBERT进行蒸馏。 |
305+
| 预训练模型的瘦身策略 – – 高效结构 | [notebook链接](https://aistudio.baidu.com/aistudio/projectdetail/2138857)| [Python实现](./transformer_courses/Transformer_Punctuation_Restoration) | 本章节将为大家>详细介绍NLP领域,基于Transformer模型的瘦身技巧。包括 Electra,AlBERT 以及 performer。还会介绍代码实现案例:基于Electra的语音识别后处理中文标点符号预测 |
306+
| transformer在图像分类中的应用 | [notebook链接](https://aistudio.baidu.com/aistudio/projectdetail/2154618) | [Python实现](./transformer_courses/Application_of_transformer_in_image_classification) | 本章>节将为大家详细介绍 Transformer 在 CV 领域中的两个经典算法:ViT 以及 DeiT。带领大家一起学习Transformer 结构在图像分类领域的具体应用。 |
304307
| | | | |
305308

306309
# 五、 经典深度学习案例集(开发中)
265 KB
Loading
17.4 KB
Loading
74.4 KB
Loading
72 KB
Loading
74.4 KB
Loading
55.4 KB
Loading
44 KB
Loading
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# 基于ERNIE的阅读理解
2+
3+
## 依赖模块
4+
5+
* python3
6+
* paddlepaddle-gpu==2.0.0.post101
7+
* paddlenlp==2.0.1
8+
9+
## 项目介绍
10+
11+
```
12+
|-data_proessor.py:数据处理相关代码
13+
|-train.py:模型训练代码
14+
|-evaluate.py:模型评估代码
15+
|-utilis.py:定义模型训练时用到的一些组件
16+
```
17+
18+
本项目基于预训练模型ERNIE进行中文阅读理解,使用的数据集是Dureader_robust数据集。
19+
20+
### 模型介绍
21+
22+
ERINE是百度发布一个预训练模型,它通过引入三种级别的Knowledge Masking帮助模型学习语言知识,在多项任务上超越了BERT。
23+
24+
25+
## 模型训练
26+
27+
```shell
28+
export CUDA_VISIBLE_DEVICES=0
29+
30+
python ./train.py --model_name ernie-1.0 \
31+
--epochs 1 \
32+
--learning_rate 3e-5 \
33+
--max_seq_length 512 \
34+
--batch_size 12 \
35+
--warmup_proportion 0.1 \
36+
--weight_decay 0.01 \
37+
--save_model_path ./ernie_rc.pdparams \
38+
--save_opt_path ./ernie_rc.pdopt
39+
```
40+
41+
其中参数释义如下:
42+
43+
- `model_name` 需要加载的模型名字。
44+
- `epochs` 训练轮次。
45+
- `learning_rate` 学习率。
46+
- `max_seq_length` 最大句子长度,超过将会被截断。
47+
- `batch_size` 每次迭代每张卡上的样本数目。
48+
- `warmup_proportion` warmup占据总的训练迭代次数的比例。
49+
- `weight_decay` 权重衰减值。
50+
- `save_model_path` 模型保存路径。
51+
- `save_opt_path` 优化器保存路径。
52+
53+
## 模型评估
54+
55+
运行evaluate.py脚本进行模型评估。
56+
57+
```shell
58+
export CUDA_VISIBLE_DEVICES=0
59+
60+
python ./evaluate.py --model_path ./ernie_rc.pdparams \
61+
--max_seq_length 512 \
62+
--batch_size 12
63+
```
Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
import collections
2+
import time
3+
import json
4+
import paddle
5+
from paddlenlp.metrics.squad import squad_evaluate, compute_prediction
6+
7+
8+
def prepare_train_features(examples,tokenizer,doc_stride,max_seq_length):
9+
# Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
10+
# in one example possible giving several features when a context is long, each of those features having a
11+
# context that overlaps a bit the context of the previous feature.
12+
contexts = [examples[i]['context'] for i in range(len(examples))]
13+
questions = [examples[i]['question'] for i in range(len(examples))]
14+
15+
tokenized_examples = tokenizer(
16+
questions,
17+
contexts,
18+
stride=doc_stride,
19+
max_seq_len=max_seq_length)
20+
21+
# Let's label those examples!
22+
for i, tokenized_example in enumerate(tokenized_examples):
23+
# We will label impossible answers with the index of the CLS token.
24+
input_ids = tokenized_example["input_ids"]
25+
cls_index = input_ids.index(tokenizer.cls_token_id)
26+
27+
# The offset mappings will give us a map from token to character position in the original context. This will
28+
# help us compute the start_positions and end_positions.
29+
offsets = tokenized_example['offset_mapping']
30+
31+
# Grab the sequence corresponding to that example (to know what is the context and what is the question).
32+
sequence_ids = tokenized_example['token_type_ids']
33+
34+
# One example can give several spans, this is the index of the example containing this span of text.
35+
sample_index = tokenized_example['overflow_to_sample']
36+
answers = examples[sample_index]['answers']
37+
answer_starts = examples[sample_index]['answer_starts']
38+
39+
# Start/end character index of the answer in the text.
40+
start_char = answer_starts[0]
41+
end_char = start_char + len(answers[0])
42+
43+
# Start token index of the current span in the text.
44+
token_start_index = 0
45+
while sequence_ids[token_start_index] != 1:
46+
token_start_index += 1
47+
48+
# End token index of the current span in the text.
49+
token_end_index = len(input_ids) - 1
50+
while sequence_ids[token_end_index] != 1:
51+
token_end_index -= 1
52+
# Minus one more to reach actual text
53+
token_end_index -= 1
54+
55+
# Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
56+
if not (offsets[token_start_index][0] <= start_char and
57+
offsets[token_end_index][1] >= end_char):
58+
tokenized_examples[i]["start_positions"] = cls_index
59+
tokenized_examples[i]["end_positions"] = cls_index
60+
else:
61+
# Otherwise move the token_start_index and token_end_index to the two ends of the answer.
62+
# Note: we could go after the last offset if the answer is the last word (edge case).
63+
while token_start_index < len(offsets) and offsets[
64+
token_start_index][0] <= start_char:
65+
token_start_index += 1
66+
tokenized_examples[i]["start_positions"] = token_start_index - 1
67+
while offsets[token_end_index][1] >= end_char:
68+
token_end_index -= 1
69+
tokenized_examples[i]["end_positions"] = token_end_index + 1
70+
71+
return tokenized_examples
72+
73+
def prepare_validation_features(examples,tokenizer,doc_stride,max_seq_length):
74+
# Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
75+
# in one example possible giving several features when a context is long, each of those features having a
76+
# context that overlaps a bit the context of the previous feature.
77+
contexts = [examples[i]['context'] for i in range(len(examples))]
78+
questions = [examples[i]['question'] for i in range(len(examples))]
79+
80+
tokenized_examples = tokenizer(
81+
questions,
82+
contexts,
83+
stride=doc_stride,
84+
max_seq_len=max_seq_length)
85+
86+
# For validation, there is no need to compute start and end positions
87+
for i, tokenized_example in enumerate(tokenized_examples):
88+
# Grab the sequence corresponding to that example (to know what is the context and what is the question).
89+
sequence_ids = tokenized_example['token_type_ids']
90+
91+
# One example can give several spans, this is the index of the example containing this span of text.
92+
sample_index = tokenized_example['overflow_to_sample']
93+
tokenized_examples[i]["example_id"] = examples[sample_index]['id']
94+
95+
# Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
96+
# position is part of the context or not.
97+
tokenized_examples[i]["offset_mapping"] = [
98+
(o if sequence_ids[k] == 1 else None)
99+
for k, o in enumerate(tokenized_example["offset_mapping"])
100+
]
101+
102+
return tokenized_examples
103+

0 commit comments

Comments
 (0)