project.yml

title: "Train fastText or floret vectors"
description: |
  This project downloads, extracts and preprocesses texts from a number of
  sources and trains vectors with [floret](https://github.com/explosion/floret).

  By default, the project trains floret vectors for Korean for use in `md` and
  `lg` spaCy pipelines.

  Prerequisites:
  - linux (it may largely work on osx but this is not tested or maintained)
  - a large amount of hard drive space (e.g. ~100GB total for Korean, which has
    15GB of data in OSCAR 21.09; for English, Russian, Chinese, Spanish, etc.
    you would need multiple TB with the provided defaults)
  - a workstation with a good CPU, or a lot of patience

  Adjust the variables `n_process_tokenize` and `vector_thread` for your CPU.

  > For a Python-only cross-platform alternative, try out the simpler
  > [`pipelines/floret_wiki_oscar_vectors`](https://github.com/explosion/projects/tree/v3/pipelines/floret_wiki_oscar_vectors)
  > project using Wikipedia and OSCAR 2019.

  ## Text Sources

  - Wikipedia: https://dumps.wikimedia.org
  - OpenSubtitles: https://opus.nlpl.eu/OpenSubtitles-v2018.php (https://www.opensubtitles.org)
  - WMT Newscrawl: https://data.statmt.org/news-crawl/
  - OSCAR 21.09: https://oscar-corpus.com/post/oscar-v21-09/

  OpenSubtitles and WMT Newscrawl only contain texts for a small subset of the
  languages included in Wikipedia or OSCAR, so you may need to remove the
  assets and adjust/remove related steps to use a subset of the resources.

  ### Source Requirements

  #### Wikipedia

  Install `Wikiparsec`: https://github.com/rspeer/wikiparsec

  Choose a current version available at https://dumps.wikimedia.org for this
  language or switch to `"latest"`.

  #### OSCAR 21.09

  The dataset [`oscar-corpus/OSCAR-2109`](https://huggingface.co/datasets/oscar-corpus/OSCAR-2109) requires you to:
  - create a Hugging Face Hub account
  - agree to the dataset terms to access: https://huggingface.co/datasets/oscar-corpus/OSCAR-2109
  - authenticate with `huggingface-cli login`

  #### OSCAR 2019

  As an alternative to OSCAR 21.09, you can stream from
  [`oscar`](https://huggingface.co/datasets/oscar) without authentication.

  ## floret Parameters

  [floret](https://github.com/explosion/floret) has a large number of
  parameters and it's difficult to give advice for all configurations, but the
  parameters described here are the ones that it makes sense to customize for
  any new language and to experiment with initially.

  Be aware that if you're using more than one thread, the results of each run
  with fastText or floret will be slightly different.

  ### `vector_minn` / `vector_maxn`
  
  The minimum and maximum character n-gram lengths should be adapted for the
  language and writing system. The n-grams should capture common grammatical
  affixes like English `-ing`, without making the number of n-grams per word
  too large. Very short n-grams aren't meaningful and very long n-grams will be
  too sparse and won't be useful for cases with misspellings and noise.

  A good rule of thumb is that `maxn` should correspond to the length of the
  longest common affix + `1`, so for many languages with alphabets, `minn
  4`/`maxn 5` can be a good starting point, similar to `minn 5`/`maxn 5`, which
  was shown to be a reasonable default for the [original fastText
  vectors](https://fasttext.cc/docs/en/crawl-vectors.html).
  
  For writing systems where one character corresponds to a syllable, shorter
  n-grams are typically more suitable. For Korean, where each (normalized)
  character is a syllable and most grammatical affixes are 1-2 characters,
  `minn 2`/`maxn 3` seems to perform well.

  ### `vector_bucket_md` / `vector_bucket_lg`

  The bucket size is the number of rows in the floret vector table. For
  tagging and parsing, a bucket size of 50k performs well, but larger sizes may
  still lead to small improvements. For NER, the performance continues to
  improve for bucket sizes up to at least 200k.

  In a spaCy pipeline package, 50k 300-dim vectors are ~60MB and 200k 300-dim
  vectors are ~230MB.

  ### `vector_hash_count`

  The recommended hash count is `2`, especially for smaller bucket sizes.
  
  Larger hash counts are slower to train with floret and slightly slower in
  inference in spaCy, but may lead to slightly improved performance, especially
  with larger bucket sizes.

  ### `vector_epoch`

  You may want to reduce the number of epochs for larger training input sizes.

  ### `vector_min_count`

  You may want to increase the minimum word count for larger training input
  sizes.

  ### `vector_lr`

  You may need to decrease the learning rate for larger training input sizes to
  avoid NaN errors, see:
  https://fasttext.cc/docs/en/faqs.html#im-encountering-a-nan-why-could-this-be

  ### `vector_thread`
  
  Adjust the number of threads for your CPU. With a larger number of threads,
  you may need more epochs to reach the same performance.

  ## Notes

  The project does not currently clean up any intermediate files so that it's
  possible to resume from any point in the workflow. The overall disk space
  could be reduced by cleaning up files after each step, keeping only the final
  floret input text file. floret does require the input file to be on disk
  during training.

  floret always writes the full `.bin` and `.vec` files after training. These
  may be 5GB+ each even though the final `.floret` table is much smaller.

spacy_version: ">=3.2.0,<4.0.0"
vars:
  name: "vectors"
  lang: "ko"
  n_process_tokenize: 16
  # The defaults assume that you have a large hard drive mounted under /scratch.
  downloaded_dir: "/scratch/vectors/downloaded"
  extracted_dir: "/scratch/vectors/extracted"
  tokenized_dir: "/scratch/vectors/tokenized"
  wikipedia_version: 20220201
  newscrawl_year: 2020
  oscar_dataset: "oscar-corpus/OSCAR-2109"
  oscar_dataset_subset: "deduplicated_${vars.lang}"
  # For "oscar" instead of OSCAR-2109 (no auth required).
  #oscar_dataset: "oscar"
  #oscar_dataset_subset: "unshuffled_deduplicated_${vars.lang}"
  oscar_dataset_split: "train"
  oscar_max_texts: -1
  vector_input_dir: "/scratch/vectors/input"
  vector_model: "cbow"
  # For languages with alphabets: minn/maxn 4/5 or 5/5 is a good starting point.
  vector_minn: 2
  vector_maxn: 3
  vector_epoch: 5
  vector_dim: 300
  vector_neg: 10
  vector_bucket_md: 50000
  vector_bucket_lg: 200000
  vector_min_count: 20
  vector_hash_count: 2
  vector_thread: 16
  vector_lr: 0.05

directories: ["software", "vectors"]

assets:
  - dest: "software/floret"
    git:
      repo: "https://github.com/explosion/floret"
      branch: "v0.10.2"
      path: ""

  - dest: "${vars.downloaded_dir}/wikipedia/${vars.lang}wiki-${vars.wikipedia_version}-pages-articles.xml.bz2"
    url: "https://dumps.wikimedia.org/${vars.lang}wiki/${vars.wikipedia_version}/${vars.lang}wiki-${vars.wikipedia_version}-pages-articles.xml.bz2"

  - dest: "${vars.downloaded_dir}/opensubtitles/${vars.lang}.txt.gz"
    url: "http://opus.nlpl.eu/download.php?f=OpenSubtitles/v2018/mono/OpenSubtitles.raw.${vars.lang}.gz"

  - dest: "${vars.downloaded_dir}/newscrawl/${vars.lang}/news.${vars.newscrawl_year}.${vars.lang}.shuffled.deduped.gz"
    url: "https://data.statmt.org/news-crawl/${vars.lang}/news.${vars.newscrawl_year}.${vars.lang}.shuffled.deduped.gz"

workflows:
  prepare-text:
    - extract-wikipedia
    - tokenize-wikipedia
    - extract-opensubtitles
    - tokenize-opensubtitles
    - extract-newscrawl
    - tokenize-newscrawl
    - tokenize-oscar
    - create-input

  train-vectors:
    - compile-floret
    - train-floret-vectors-md
    - train-floret-vectors-lg

commands:
  - name: "extract-wikipedia"
    help: "Convert Wikipedia XML to plain text with Wikiparsec"
    script:
      - "mkdir -p ${vars.extracted_dir}/wikipedia/"
      - "scripts/extract_wikipedia.sh ${vars.downloaded_dir}/wikipedia/${vars.lang}wiki-${vars.wikipedia_version}-pages-articles.xml.bz2 ${vars.extracted_dir}/wikipedia/${vars.lang}wiki_${vars.wikipedia_version}.txt"
    deps:
      - "scripts/extract_wikipedia.sh"
    outputs:
      - "${vars.extracted_dir}/wikipedia/${vars.lang}wiki_${vars.wikipedia_version}.txt"

  - name: "tokenize-wikipedia"
    help: "Tokenize Wikipedia"
    script:
      - "mkdir -p ${vars.tokenized_dir}"
      - >-
        python scripts/tokenize_resource.py ${vars.lang}
        ${vars.tokenized_dir}/${vars.lang}_wiki_${vars.wikipedia_version}.txt
        --input-file ${vars.extracted_dir}/wikipedia/${vars.lang}wiki_${vars.wikipedia_version}.txt
        --n-process ${vars.n_process_tokenize}
    deps:
      - "scripts/tokenize_resource.py"
      - "${vars.extracted_dir}/wikipedia/${vars.lang}wiki_${vars.wikipedia_version}.txt"
    outputs:
      - "${vars.tokenized_dir}/${vars.lang}_wiki_${vars.wikipedia_version}.txt"

  - name: "extract-opensubtitles"
    help: "Extract OpenSubtitles data"
    script:
      - "mkdir -p ${vars.extracted_dir}/opensubtitles/"
      - "scripts/extract_opensubtitles.sh ${vars.downloaded_dir}/opensubtitles/${vars.lang}.txt.gz ${vars.extracted_dir}/opensubtitles/${vars.lang}.txt"
    deps:
      - "scripts/extract_opensubtitles.sh"
    outputs:
      - "${vars.extracted_dir}/opensubtitles/${vars.lang}.txt"

  - name: "tokenize-opensubtitles"
    help: "Tokenize OpenSubtitles"
    script:
      - "mkdir -p ${vars.tokenized_dir}"
      - >-
        python scripts/tokenize_resource.py ${vars.lang}
        ${vars.tokenized_dir}/${vars.lang}_opensubtitles.txt
        --input-file ${vars.extracted_dir}/opensubtitles/${vars.lang}.txt
        --n-process ${vars.n_process_tokenize}
    deps:
      - "scripts/tokenize_resource.py"
      - "${vars.extracted_dir}/opensubtitles/${vars.lang}.txt"
    outputs:
      - "${vars.tokenized_dir}/${vars.lang}_opensubtitles.txt"

  - name: "extract-newscrawl"
    help: "Extract newscrawl data"
    script:
      - "mkdir -p ${vars.extracted_dir}/newscrawl/"
      - "pigz -d -k ${vars.downloaded_dir}/newscrawl/${vars.lang}/news.${vars.newscrawl_year}.${vars.lang}.shuffled.deduped.gz"
      - "mv ${vars.downloaded_dir}/newscrawl/${vars.lang}/news.${vars.newscrawl_year}.${vars.lang}.shuffled.deduped ${vars.extracted_dir}/newscrawl"
    outputs:
      - "${vars.extracted_dir}/newscrawl/news.${vars.newscrawl_year}.${vars.lang}.shuffled.deduped"

  - name: "tokenize-newscrawl"
    help: "Tokenize newscrawl"
    script:
      - "mkdir -p ${vars.tokenized_dir}"
      - >-
        python scripts/tokenize_resource.py ${vars.lang}
        ${vars.tokenized_dir}/${vars.lang}_newscrawl_${vars.newscrawl_year}.txt
        --input-file ${vars.extracted_dir}/newscrawl/news.${vars.newscrawl_year}.${vars.lang}.shuffled.deduped
        --n-process ${vars.n_process_tokenize}
    deps:
      - "scripts/tokenize_resource.py"
      - "${vars.extracted_dir}/newscrawl/news.${vars.newscrawl_year}.${vars.lang}.shuffled.deduped"
    outputs:
      - "${vars.tokenized_dir}/${vars.lang}_newscrawl_${vars.newscrawl_year}.txt"

  - name: "tokenize-oscar"
    help: "Tokenize and sentencize oscar dataset"
    script:
      - >-
        python scripts/tokenize_resource.py ${vars.lang}
        ${vars.tokenized_dir}/${vars.lang}_oscar_${vars.oscar_dataset_subset}.txt
        --input-dataset ${vars.oscar_dataset}
        --dataset-subset ${vars.oscar_dataset_subset}
        --dataset-split ${vars.oscar_dataset_split}
        --dataset-auth-token
        --n-process=${vars.n_process_tokenize}
        --max-texts=${vars.oscar_max_texts}
    deps:
      - "scripts/tokenize_resource.py"
    outputs:
      - "${vars.tokenized_dir}/${vars.lang}_oscar_${vars.oscar_dataset_subset}.txt"

  - name: "create-input"
    help: "Concatenate tokenized input texts"
    script:
      - >-
        python scripts/concat_files.py
        --input-file ${vars.tokenized_dir}/${vars.lang}_wiki_${vars.wikipedia_version}.txt
        --input-file ${vars.tokenized_dir}/${vars.lang}_opensubtitles.txt
        --input-file ${vars.tokenized_dir}/${vars.lang}_newscrawl_${vars.newscrawl_year}.txt
        --input-file ${vars.tokenized_dir}/${vars.lang}_oscar_${vars.oscar_dataset_subset}.txt
        ${vars.vector_input_dir}/${vars.lang}.txt
    deps:
      - "scripts/concat_files.py"
      - "${vars.tokenized_dir}/${vars.lang}_wiki_${vars.wikipedia_version}.txt"
      - "${vars.tokenized_dir}/${vars.lang}_opensubtitles.txt"
      - "${vars.tokenized_dir}/${vars.lang}_newscrawl_${vars.newscrawl_year}.txt"
      - "${vars.tokenized_dir}/${vars.lang}_oscar_${vars.oscar_dataset_subset}.txt"
    outputs:
      - "${vars.vector_input_dir}/${vars.lang}.txt"

  - name: "compile-floret"
    help: "Compile floret"
    script:
      - "make -C software/floret"
    outputs:
      - "software/floret/floret"

  - name: "train-floret-vectors-md"
    help: "Train floret md vectors"
    script:
      - >-
        software/floret/floret ${vars.vector_model}
        -dim ${vars.vector_dim}
        -mode floret
        -epoch ${vars.vector_epoch}
        -minCount ${vars.vector_min_count}
        -minn ${vars.vector_minn}
        -maxn ${vars.vector_maxn}
        -neg ${vars.vector_neg}
        -hashCount ${vars.vector_hash_count}
        -bucket ${vars.vector_bucket_md}
        -thread ${vars.vector_thread}
        -lr ${vars.vector_lr}
        -input ${vars.vector_input_dir}/${vars.lang}.txt
        -output vectors/${vars.lang}_md
    deps:
      - "software/floret"
      - "${vars.vector_input_dir}/${vars.lang}.txt"
    outputs:
      - "vectors/${vars.lang}_md.floret"

  - name: "train-floret-vectors-lg"
    help: "Train floret lg vectors"
    script:
      - >-
        software/floret/floret ${vars.vector_model}
        -dim ${vars.vector_dim}
        -mode floret
        -epoch ${vars.vector_epoch}
        -minCount ${vars.vector_min_count}
        -minn ${vars.vector_minn}
        -maxn ${vars.vector_maxn}
        -neg ${vars.vector_neg}
        -hashCount ${vars.vector_hash_count}
        -bucket ${vars.vector_bucket_lg}
        -thread ${vars.vector_thread}
        -lr ${vars.vector_lr}
        -input ${vars.vector_input_dir}/${vars.lang}.txt
        -output vectors/${vars.lang}_lg
    deps:
      - "software/floret"
      - "${vars.vector_input_dir}/${vars.lang}.txt"
    outputs:
      - "vectors/${vars.lang}_lg.floret"

  - name: "train-fasttext-vectors"
    help: "Train fastText vectors"
    script:
      - >-
        software/floret/floret ${vars.vector_model}
        -dim ${vars.vector_dim}
        -mode fasttext
        -epoch ${vars.vector_epoch}
        -minCount ${vars.vector_min_count}
        -minn ${vars.vector_minn}
        -maxn ${vars.vector_maxn}
        -neg ${vars.vector_neg}
        -thread ${vars.vector_thread}
        -lr ${vars.vector_lr}
        -input ${vars.vector_input_dir}/${vars.lang}.txt
        -output vectors/${vars.lang}.fasttext
    deps:
      - "software/floret"
      - "${vars.vector_input_dir}/${vars.lang}.txt"
    outputs:
      - "vectors/${vars.lang}.fasttext.vec"