- 
          
 - 
                Notifications
    
You must be signed in to change notification settings  - Fork 7
 
Onboard Project: Usage
        Matthew Beech edited this page Oct 28, 2025 
        ·
        9 revisions
      
    This page details the silnlp.common.onboard_project script's usage and the configuration options available.
Cleans and uploads a Paratext project from a local machine to the MinIO bucket. Optionally performs other Onboarding tasks.
usage: python -m silnlp.common.onboard_project [--copy-from [local_dir]] [--config path_to_config]
                    [--extract-corpora] [--collect-verse-counts] [--clean-project] [--timestamp] [--wildebeest]
                    projects ...
Arguments:
| Argument | Purpose | Description | 
|---|---|---|
projects | 
list of Paratext project names | (Required) These projects will be stored on the bucket at Paratext/projects. | 
--copy-from [local_dir] | 
Path to a directory with a Paratext project. | The local project(s) will be copied to the bucket. Default if included without a local_dir is the user's Downloads folder | 
--config path_to_config | 
Path to a config.yml file | This is used to configure what optional Onboarding tasks will run. | 
--extract-corpora | 
Runs silnlp.common.extract_corpora | Extracts corpora. See here for more information. | 
--collect-verse-counts | 
Runs silnlp.common.collect_verse_counts | Collects verse counts. | 
--clean-project | 
Runs silnlp.common.clean_projects | Cleans the Paratext project folder by removing unnecessary files and folders before copying. Only used if --copy-from is provided. | 
--timestamp | 
Appends a current timestamp to the project name | Adds a timestamp to the project folder name when creating a new Paratext project folder. | 
--wildebeest | 
Runs a Wildebeest analysis on the extracted corpora. | Produces a Wildebeest report for the project. | 
The config file contains the parameters for all of the optional onboarding tasks this script can execute.
Below is an example of a onboarding config:
extract_corpora:
  include: NT
  exclude: OT
verse_counts:
  output_folder: /root/M/MT/experiments/test_onboard_project
  files: *.txt
  deutero: false
  recount: false
wildebeest:
  x: 500
  n: 500
  r: vref.txt
zip_password:
  project_name_1: password_1
  project_name_2: password_2
- 
include=[]: A list of books to include; e.g., 'NT', 'OT', 'GEN'. - 
exclude=[]: A list of books to exclude; e.g., 'NT', 'OT', 'GEN'. - 
markers=False: If true, include USFM markers in extraction. - 
lemmas=False: If true, extract lemmas. - 
project_vrefs=False: If true, extract project_vrefs. 
- 
output_folder=path_to_output_folder: Folder to store the verse counts. - 
files=*.txt: Semicolon-delimited list of patterns of extract file names to count (e.g. 'en-*.txt;fr-NT.txt). - 
deutero=False: If true, include counts for Deuterocanon books. - 
recount=False: If true, force recount of verse counts. 
- 
x=500: max number of examples per line - 
n=500: max number of cases per group - 
r=vref.txt: file with sentence reference IDs - See the Wildebeest Repo for more info
 
- Stores the project names and the respective passwords for any encrypted zip files