Onboard Project: Usage

Onboarding Projects

This page details the silnlp.common.onboard_project script's usage and the configuration options available.

onboard_project

Cleans and uploads a Paratext project from a local machine to the MinIO bucket. Optionally performs other Onboarding tasks.

usage: python -m silnlp.common.onboard_project [--copy-from [local_dir]] [--config path_to_config]
[--extract-corpora] [--collect-verse-counts] [--clean-project] [--timestamp] [--wildebeest]
projects ...

Arguments:

Argument	Purpose	Description
`projects`	list of Paratext project names	(Required) These projects will be stored on the bucket at `Paratext/projects`.
`--copy-from [local_dir]`	Path to a directory with a Paratext project.	The local project(s) will be copied to the bucket. Default if included without a local_dir is the user's `Downloads` folder
`--config path_to_config`	Path to a config.yml file	This is used to configure what optional Onboarding tasks will run.
`--extract-corpora`	Runs silnlp.common.extract_corpora	Extracts corpora. See here for more information.
`--collect-verse-counts`	Runs silnlp.common.collect_verse_counts	Collects verse counts.
`--clean-project`	Runs silnlp.common.clean_projects	Cleans the Paratext project folder by removing unnecessary files and folders before copying. Only used if --copy-from is provided.
`--timestamp`	Appends a current timestamp to the project name	Adds a timestamp to the project folder name when creating a new Paratext project folder.
`--wildebeest`	Runs a Wildebeest analysis on the extracted corpora.	Produces a Wildebeest report for the project.

config file

The config file contains the parameters for all of the optional onboarding tasks this script can execute.

Below is an example of a onboarding config:

extract_corpora:
  include: NT
  exclude: OT
verse_counts:
  output_folder: /root/M/MT/experiments/test_onboard_project
  files: *.txt
  deutero: false
  recount: false
wildebeest:
  x: 500
  n: 500
  r: vref.txt
zip_password:
  project_name_1: password_1
  project_name_2: password_2

Parameter Definitions

extract_corpora

include=[]: A list of books to include; e.g., 'NT', 'OT', 'GEN'.
exclude=[]: A list of books to exclude; e.g., 'NT', 'OT', 'GEN'.
markers=False: If true, include USFM markers in extraction.
lemmas=False: If true, extract lemmas.
project_vrefs=False: If true, extract project_vrefs.

collect_verse_counts

output_folder=path_to_output_folder: Folder to store the verse counts.
files=*.txt: Semicolon-delimited list of patterns of extract file names to count (e.g. 'en-*.txt;fr-NT.txt).
deutero=False: If true, include counts for Deuterocanon books.
recount=False: If true, force recount of verse counts.

wildebeest

x=500: max number of examples per line
n=500: max number of cases per group
r=vref.txt: file with sentence reference IDs
See the Wildebeest Repo for more info

zip_password

Stores the project names and the respective passwords for any encrypted zip files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Onboard Project: Usage

Onboarding Projects

onboard_project

config file

Parameter Definitions

extract_corpora

collect_verse_counts

wildebeest

zip_password

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally