Skip to content

Preferences for the onboard_project and other scripts. #874

@davidbaines

Description

@davidbaines

Preferences for the onboard_project and other scripts.

These are just my preferences for the onboard_project.py script. Including an ideal order of operations, from my current practice along with the rationale.

Order of operations.

  1. Check that the folder exists - else exit with the expected path shown in the error message.
  2. Check that the folder contains Settings.xml file - else exit with the expected path shown in the error message.
  3. Parse the Settings.xml file and ensure that at least one SFM file can be found using the naming convention specified. Else exit with a detailed error message. The error message can look for common sfm files by extensions (case insensitive) 'SFM', 'USFM', 'PT8' 'adescribe the apparent naming system of SFM files and how the Settings.xml file should specify them.
  4. Clean the data on the local disk, make this true by default.
  5. Rename the folder on the local disk.
  6. Copy the folder to MT/Paratext/projects
  7. Run the other steps: extract, verse-counts, stats.

Some are simply shorter and simpler names to save typing.

  1. Change onboard_project.py to onboard.py
  2. It would be great to be able to specify a per-user default folder as the source folder?
  3. Change the argument name to folder rather than --copy-from to remove the hyphen.
  4. Change --extract-corpora to --extract : I'll ask the other EITL team members to see whether they like this idea or not, since extract_corpora is the name of the existing script. Ideally we'd change the name of extract-corpora to extract.py too.
  5. Change --timestamp to --datestamp since it's adding a date.
  6. Change --clean-project to --no-clean (unless we can think of a better hyphenless alternative).
  7. Change --collect-verse-counts to --countverses and change the name of the collect_verse_counts script to countverses.py

Move slow imports to where they are needed.

  1. In particular not every script uses NLTK and checking that it has been downloaded delays the feedback for errors in the arguments that were given.
  2. Maybe there are other 'slow imports' that could be moved so that they are loaded only when required.

##Rationales
Removing hyphens and underscores from script and argument names saves a lot of mental effort and therefore saves a significant amount of time.
I often have to check the code, or use help or trial and error to discover whether a hyphen or an underscore is required and that feels like a waste of time.
It should be more effort not to clean the data than it is to clean it. It would be better if we don't have PII on our home machines or laptops.
It would be great if we could move the importing and download check for NLTK so that only happens when it is about to be used. I don't think that many of our scripts need NLTK and that short delay for every script run adds up over all users and over time.

Sub-issues

Metadata

Metadata

Labels

Type

No type

Projects

Status

🔖 Ready

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions