Skip to content

Onboard Project: Usage

Matthew Beech edited this page Oct 28, 2025 · 9 revisions

Onboarding Projects

This page details the silnlp.common.onboard_project script's usage and the configuration options available.

onboard_project

Cleans and uploads a Paratext project from a local machine to the MinIO bucket. Optionally performs other Onboarding tasks.

usage: python -m silnlp.common.onboard_project [--copy-from [local_dir]] [--config path_to_config]
[--extract-corpora] [--collect-verse-counts] [--clean-project] [--timestamp] [--wildebeest]
projects ...

Arguments:

Argument Purpose Description
projects list of Paratext project names (Required) These projects will be stored on the bucket at Paratext/projects.
--copy-from [local_dir] Path to a directory with a Paratext project. The local project(s) will be copied to the bucket. Default if included without a local_dir is the user's Downloads folder
--config path_to_config Path to a config.yml file This is used to configure what optional Onboarding tasks will run.
--extract-corpora Runs silnlp.common.extract_corpora Extracts corpora. See here for more information.
--collect-verse-counts Runs silnlp.common.collect_verse_counts Collects verse counts.
--clean-project Runs silnlp.common.clean_projects Cleans the Paratext project folder by removing unnecessary files and folders before copying. Only used if --copy-from is provided.
--timestamp Appends a current timestamp to the project name Adds a timestamp to the project folder name when creating a new Paratext project folder.
--wildebeest Runs a Wildebeest analysis on the extracted corpora. Produces a Wildebeest report for the project.

config file

The config file contains the parameters for all of the optional onboarding tasks this script can execute.

Below is an example of a onboarding config:

extract_corpora:
  include: NT
  exclude: OT
verse_counts:
  output_folder: /root/M/MT/experiments/test_onboard_project
  files: *.txt
  deutero: false
  recount: false
wildebeest:
  x: 500
  n: 500
  r: vref.txt
zip_password:
  project_name_1: password_1
  project_name_2: password_2

Parameter Definitions

extract_corpora

  • include=[]: A list of books to include; e.g., 'NT', 'OT', 'GEN'.
  • exclude=[]: A list of books to exclude; e.g., 'NT', 'OT', 'GEN'.
  • markers=False: If true, include USFM markers in extraction.
  • lemmas=False: If true, extract lemmas.
  • project_vrefs=False: If true, extract project_vrefs.

collect_verse_counts

  • output_folder=path_to_output_folder: Folder to store the verse counts.
  • files=*.txt: Semicolon-delimited list of patterns of extract file names to count (e.g. 'en-*.txt;fr-NT.txt).
  • deutero=False: If true, include counts for Deuterocanon books.
  • recount=False: If true, force recount of verse counts.

wildebeest

  • x=500: max number of examples per line
  • n=500: max number of cases per group
  • r=vref.txt: file with sentence reference IDs
  • See the Wildebeest Repo for more info

zip_password

  • Stores the project names and the respective passwords for any encrypted zip files
Clone this wiki locally