-
Notifications
You must be signed in to change notification settings - Fork 1
/
project.yml
385 lines (326 loc) · 14.7 KB
/
project.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
title: "Train fastText or floret vectors"
description: |
This project downloads, extracts and preprocesses texts from a number of
sources and trains vectors with [floret](https://github.com/explosion/floret).
By default, the project trains floret vectors for Korean for use in `md` and
`lg` spaCy pipelines.
Prerequisites:
- linux (it may largely work on osx but this is not tested or maintained)
- a large amount of hard drive space (e.g. ~100GB total for Korean, which has
15GB of data in OSCAR 21.09; for English, Russian, Chinese, Spanish, etc.
you would need multiple TB with the provided defaults)
- a workstation with a good CPU, or a lot of patience
Adjust the variables `n_process_tokenize` and `vector_thread` for your CPU.
> For a Python-only cross-platform alternative, try out the simpler
> [`pipelines/floret_wiki_oscar_vectors`](https://github.com/explosion/projects/tree/v3/pipelines/floret_wiki_oscar_vectors)
> project using Wikipedia and OSCAR 2019.
## Text Sources
- Wikipedia: https://dumps.wikimedia.org
- OpenSubtitles: https://opus.nlpl.eu/OpenSubtitles-v2018.php (https://www.opensubtitles.org)
- WMT Newscrawl: https://data.statmt.org/news-crawl/
- OSCAR 21.09: https://oscar-corpus.com/post/oscar-v21-09/
OpenSubtitles and WMT Newscrawl only contain texts for a small subset of the
languages included in Wikipedia or OSCAR, so you may need to remove the
assets and adjust/remove related steps to use a subset of the resources.
### Source Requirements
#### Wikipedia
Install `Wikiparsec`: https://github.com/rspeer/wikiparsec
Choose a current version available at https://dumps.wikimedia.org for this
language or switch to `"latest"`.
#### OSCAR 21.09
The dataset [`oscar-corpus/OSCAR-2109`](https://huggingface.co/datasets/oscar-corpus/OSCAR-2109) requires you to:
- create a Hugging Face Hub account
- agree to the dataset terms to access: https://huggingface.co/datasets/oscar-corpus/OSCAR-2109
- authenticate with `huggingface-cli login`
#### OSCAR 2019
As an alternative to OSCAR 21.09, you can stream from
[`oscar`](https://huggingface.co/datasets/oscar) without authentication.
## floret Parameters
[floret](https://github.com/explosion/floret) has a large number of
parameters and it's difficult to give advice for all configurations, but the
parameters described here are the ones that it makes sense to customize for
any new language and to experiment with initially.
Be aware that if you're using more than one thread, the results of each run
with fastText or floret will be slightly different.
### `vector_minn` / `vector_maxn`
The minimum and maximum character n-gram lengths should be adapted for the
language and writing system. The n-grams should capture common grammatical
affixes like English `-ing`, without making the number of n-grams per word
too large. Very short n-grams aren't meaningful and very long n-grams will be
too sparse and won't be useful for cases with misspellings and noise.
A good rule of thumb is that `maxn` should correspond to the length of the
longest common affix + `1`, so for many languages with alphabets, `minn
4`/`maxn 5` can be a good starting point, similar to `minn 5`/`maxn 5`, which
was shown to be a reasonable default for the [original fastText
vectors](https://fasttext.cc/docs/en/crawl-vectors.html).
For writing systems where one character corresponds to a syllable, shorter
n-grams are typically more suitable. For Korean, where each (normalized)
character is a syllable and most grammatical affixes are 1-2 characters,
`minn 2`/`maxn 3` seems to perform well.
### `vector_bucket_md` / `vector_bucket_lg`
The bucket size is the number of rows in the floret vector table. For
tagging and parsing, a bucket size of 50k performs well, but larger sizes may
still lead to small improvements. For NER, the performance continues to
improve for bucket sizes up to at least 200k.
In a spaCy pipeline package, 50k 300-dim vectors are ~60MB and 200k 300-dim
vectors are ~230MB.
### `vector_hash_count`
The recommended hash count is `2`, especially for smaller bucket sizes.
Larger hash counts are slower to train with floret and slightly slower in
inference in spaCy, but may lead to slightly improved performance, especially
with larger bucket sizes.
### `vector_epoch`
You may want to reduce the number of epochs for larger training input sizes.
### `vector_min_count`
You may want to increase the minimum word count for larger training input
sizes.
### `vector_lr`
You may need to decrease the learning rate for larger training input sizes to
avoid NaN errors, see:
https://fasttext.cc/docs/en/faqs.html#im-encountering-a-nan-why-could-this-be
### `vector_thread`
Adjust the number of threads for your CPU. With a larger number of threads,
you may need more epochs to reach the same performance.
## Notes
The project does not currently clean up any intermediate files so that it's
possible to resume from any point in the workflow. The overall disk space
could be reduced by cleaning up files after each step, keeping only the final
floret input text file. floret does require the input file to be on disk
during training.
floret always writes the full `.bin` and `.vec` files after training. These
may be 5GB+ each even though the final `.floret` table is much smaller.
spacy_version: ">=3.2.0,<4.0.0"
vars:
name: "vectors"
lang: "ko"
n_process_tokenize: 16
# The defaults assume that you have a large hard drive mounted under /scratch.
downloaded_dir: "/scratch/vectors/downloaded"
extracted_dir: "/scratch/vectors/extracted"
tokenized_dir: "/scratch/vectors/tokenized"
wikipedia_version: 20220201
newscrawl_year: 2020
oscar_dataset: "oscar-corpus/OSCAR-2109"
oscar_dataset_subset: "deduplicated_${vars.lang}"
# For "oscar" instead of OSCAR-2109 (no auth required).
#oscar_dataset: "oscar"
#oscar_dataset_subset: "unshuffled_deduplicated_${vars.lang}"
oscar_dataset_split: "train"
oscar_max_texts: -1
vector_input_dir: "/scratch/vectors/input"
vector_model: "cbow"
# For languages with alphabets: minn/maxn 4/5 or 5/5 is a good starting point.
vector_minn: 2
vector_maxn: 3
vector_epoch: 5
vector_dim: 300
vector_neg: 10
vector_bucket_md: 50000
vector_bucket_lg: 200000
vector_min_count: 20
vector_hash_count: 2
vector_thread: 16
vector_lr: 0.05
directories: ["software", "vectors"]
assets:
- dest: "software/floret"
git:
repo: "https://github.com/explosion/floret"
branch: "v0.10.2"
path: ""
- dest: "${vars.downloaded_dir}/wikipedia/${vars.lang}wiki-${vars.wikipedia_version}-pages-articles.xml.bz2"
url: "https://dumps.wikimedia.org/${vars.lang}wiki/${vars.wikipedia_version}/${vars.lang}wiki-${vars.wikipedia_version}-pages-articles.xml.bz2"
- dest: "${vars.downloaded_dir}/opensubtitles/${vars.lang}.txt.gz"
url: "http://opus.nlpl.eu/download.php?f=OpenSubtitles/v2018/mono/OpenSubtitles.raw.${vars.lang}.gz"
- dest: "${vars.downloaded_dir}/newscrawl/${vars.lang}/news.${vars.newscrawl_year}.${vars.lang}.shuffled.deduped.gz"
url: "https://data.statmt.org/news-crawl/${vars.lang}/news.${vars.newscrawl_year}.${vars.lang}.shuffled.deduped.gz"
workflows:
prepare-text:
- extract-wikipedia
- tokenize-wikipedia
- extract-opensubtitles
- tokenize-opensubtitles
- extract-newscrawl
- tokenize-newscrawl
- tokenize-oscar
- create-input
train-vectors:
- compile-floret
- train-floret-vectors-md
- train-floret-vectors-lg
commands:
- name: "extract-wikipedia"
help: "Convert Wikipedia XML to plain text with Wikiparsec"
script:
- "mkdir -p ${vars.extracted_dir}/wikipedia/"
- "scripts/extract_wikipedia.sh ${vars.downloaded_dir}/wikipedia/${vars.lang}wiki-${vars.wikipedia_version}-pages-articles.xml.bz2 ${vars.extracted_dir}/wikipedia/${vars.lang}wiki_${vars.wikipedia_version}.txt"
deps:
- "scripts/extract_wikipedia.sh"
outputs:
- "${vars.extracted_dir}/wikipedia/${vars.lang}wiki_${vars.wikipedia_version}.txt"
- name: "tokenize-wikipedia"
help: "Tokenize Wikipedia"
script:
- "mkdir -p ${vars.tokenized_dir}"
- >-
python scripts/tokenize_resource.py ${vars.lang}
${vars.tokenized_dir}/${vars.lang}_wiki_${vars.wikipedia_version}.txt
--input-file ${vars.extracted_dir}/wikipedia/${vars.lang}wiki_${vars.wikipedia_version}.txt
--n-process ${vars.n_process_tokenize}
deps:
- "scripts/tokenize_resource.py"
- "${vars.extracted_dir}/wikipedia/${vars.lang}wiki_${vars.wikipedia_version}.txt"
outputs:
- "${vars.tokenized_dir}/${vars.lang}_wiki_${vars.wikipedia_version}.txt"
- name: "extract-opensubtitles"
help: "Extract OpenSubtitles data"
script:
- "mkdir -p ${vars.extracted_dir}/opensubtitles/"
- "scripts/extract_opensubtitles.sh ${vars.downloaded_dir}/opensubtitles/${vars.lang}.txt.gz ${vars.extracted_dir}/opensubtitles/${vars.lang}.txt"
deps:
- "scripts/extract_opensubtitles.sh"
outputs:
- "${vars.extracted_dir}/opensubtitles/${vars.lang}.txt"
- name: "tokenize-opensubtitles"
help: "Tokenize OpenSubtitles"
script:
- "mkdir -p ${vars.tokenized_dir}"
- >-
python scripts/tokenize_resource.py ${vars.lang}
${vars.tokenized_dir}/${vars.lang}_opensubtitles.txt
--input-file ${vars.extracted_dir}/opensubtitles/${vars.lang}.txt
--n-process ${vars.n_process_tokenize}
deps:
- "scripts/tokenize_resource.py"
- "${vars.extracted_dir}/opensubtitles/${vars.lang}.txt"
outputs:
- "${vars.tokenized_dir}/${vars.lang}_opensubtitles.txt"
- name: "extract-newscrawl"
help: "Extract newscrawl data"
script:
- "mkdir -p ${vars.extracted_dir}/newscrawl/"
- "pigz -d -k ${vars.downloaded_dir}/newscrawl/${vars.lang}/news.${vars.newscrawl_year}.${vars.lang}.shuffled.deduped.gz"
- "mv ${vars.downloaded_dir}/newscrawl/${vars.lang}/news.${vars.newscrawl_year}.${vars.lang}.shuffled.deduped ${vars.extracted_dir}/newscrawl"
outputs:
- "${vars.extracted_dir}/newscrawl/news.${vars.newscrawl_year}.${vars.lang}.shuffled.deduped"
- name: "tokenize-newscrawl"
help: "Tokenize newscrawl"
script:
- "mkdir -p ${vars.tokenized_dir}"
- >-
python scripts/tokenize_resource.py ${vars.lang}
${vars.tokenized_dir}/${vars.lang}_newscrawl_${vars.newscrawl_year}.txt
--input-file ${vars.extracted_dir}/newscrawl/news.${vars.newscrawl_year}.${vars.lang}.shuffled.deduped
--n-process ${vars.n_process_tokenize}
deps:
- "scripts/tokenize_resource.py"
- "${vars.extracted_dir}/newscrawl/news.${vars.newscrawl_year}.${vars.lang}.shuffled.deduped"
outputs:
- "${vars.tokenized_dir}/${vars.lang}_newscrawl_${vars.newscrawl_year}.txt"
- name: "tokenize-oscar"
help: "Tokenize and sentencize oscar dataset"
script:
- >-
python scripts/tokenize_resource.py ${vars.lang}
${vars.tokenized_dir}/${vars.lang}_oscar_${vars.oscar_dataset_subset}.txt
--input-dataset ${vars.oscar_dataset}
--dataset-subset ${vars.oscar_dataset_subset}
--dataset-split ${vars.oscar_dataset_split}
--dataset-auth-token
--n-process=${vars.n_process_tokenize}
--max-texts=${vars.oscar_max_texts}
deps:
- "scripts/tokenize_resource.py"
outputs:
- "${vars.tokenized_dir}/${vars.lang}_oscar_${vars.oscar_dataset_subset}.txt"
- name: "create-input"
help: "Concatenate tokenized input texts"
script:
- >-
python scripts/concat_files.py
--input-file ${vars.tokenized_dir}/${vars.lang}_wiki_${vars.wikipedia_version}.txt
--input-file ${vars.tokenized_dir}/${vars.lang}_opensubtitles.txt
--input-file ${vars.tokenized_dir}/${vars.lang}_newscrawl_${vars.newscrawl_year}.txt
--input-file ${vars.tokenized_dir}/${vars.lang}_oscar_${vars.oscar_dataset_subset}.txt
${vars.vector_input_dir}/${vars.lang}.txt
deps:
- "scripts/concat_files.py"
- "${vars.tokenized_dir}/${vars.lang}_wiki_${vars.wikipedia_version}.txt"
- "${vars.tokenized_dir}/${vars.lang}_opensubtitles.txt"
- "${vars.tokenized_dir}/${vars.lang}_newscrawl_${vars.newscrawl_year}.txt"
- "${vars.tokenized_dir}/${vars.lang}_oscar_${vars.oscar_dataset_subset}.txt"
outputs:
- "${vars.vector_input_dir}/${vars.lang}.txt"
- name: "compile-floret"
help: "Compile floret"
script:
- "make -C software/floret"
outputs:
- "software/floret/floret"
- name: "train-floret-vectors-md"
help: "Train floret md vectors"
script:
- >-
software/floret/floret ${vars.vector_model}
-dim ${vars.vector_dim}
-mode floret
-epoch ${vars.vector_epoch}
-minCount ${vars.vector_min_count}
-minn ${vars.vector_minn}
-maxn ${vars.vector_maxn}
-neg ${vars.vector_neg}
-hashCount ${vars.vector_hash_count}
-bucket ${vars.vector_bucket_md}
-thread ${vars.vector_thread}
-lr ${vars.vector_lr}
-input ${vars.vector_input_dir}/${vars.lang}.txt
-output vectors/${vars.lang}_md
deps:
- "software/floret"
- "${vars.vector_input_dir}/${vars.lang}.txt"
outputs:
- "vectors/${vars.lang}_md.floret"
- name: "train-floret-vectors-lg"
help: "Train floret lg vectors"
script:
- >-
software/floret/floret ${vars.vector_model}
-dim ${vars.vector_dim}
-mode floret
-epoch ${vars.vector_epoch}
-minCount ${vars.vector_min_count}
-minn ${vars.vector_minn}
-maxn ${vars.vector_maxn}
-neg ${vars.vector_neg}
-hashCount ${vars.vector_hash_count}
-bucket ${vars.vector_bucket_lg}
-thread ${vars.vector_thread}
-lr ${vars.vector_lr}
-input ${vars.vector_input_dir}/${vars.lang}.txt
-output vectors/${vars.lang}_lg
deps:
- "software/floret"
- "${vars.vector_input_dir}/${vars.lang}.txt"
outputs:
- "vectors/${vars.lang}_lg.floret"
- name: "train-fasttext-vectors"
help: "Train fastText vectors"
script:
- >-
software/floret/floret ${vars.vector_model}
-dim ${vars.vector_dim}
-mode fasttext
-epoch ${vars.vector_epoch}
-minCount ${vars.vector_min_count}
-minn ${vars.vector_minn}
-maxn ${vars.vector_maxn}
-neg ${vars.vector_neg}
-thread ${vars.vector_thread}
-lr ${vars.vector_lr}
-input ${vars.vector_input_dir}/${vars.lang}.txt
-output vectors/${vars.lang}.fasttext
deps:
- "software/floret"
- "${vars.vector_input_dir}/${vars.lang}.txt"
outputs:
- "vectors/${vars.lang}.fasttext.vec"