Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test compute/time requirements for different train tasks #26

Open
4 of 5 tasks
cmroughan opened this issue Nov 18, 2024 · 6 comments
Open
4 of 5 tasks

test compute/time requirements for different train tasks #26

cmroughan opened this issue Nov 18, 2024 · 6 comments
Assignees

Comments

@cmroughan
Copy link
Collaborator

cmroughan commented Nov 18, 2024

To better automate the creation of slurm batch scripts, we will want a clearer sense of how variations in the training data size and types affects the compute and time requirements for the training job. With that in mind, we will want to evaluate the following:

How does X factor impact time requirements for a training task? What is the optimal number of CPU cores? Run tests addressing these questions for the following factors across both segmentation and transcription training tasks:

  • File count
  • File sizes
  • Complexity of training data
  • Training from scratch vs fine-tuning
  • Worker counts
@cmroughan cmroughan self-assigned this Nov 18, 2024
@cmroughan
Copy link
Collaborator Author

Consider calculating compile time for producing .arrow files for transcription training. Is this always efficient, or only for certain datasets?

@cmroughan
Copy link
Collaborator Author

cmroughan commented Nov 27, 2024

First round of tests done, measuring five epochs of segmentation and transcription training on differently-sized batches of training data, where all the images were normalized to approximately 612.7 KB. Training data batch sizes were 10, 20, 50, 100, 200, and 500 images. Tests were run with a GPU and with 8 CPUs with 6 GB memory each. Full details in GSheets, summary below:

Segmentation

Training "from scratch" (refining on blla.mlmodel):

image count train duration time increase / img increase max CPU usage max GPU usage
10 0:02:07 1.00 13.7 GB 4.4 GiB
20 0:03:13 0.76 17.1 GB 4.4 GiB
50 0:04:37 0.44 23.8 GB 5.2 GiB
100 0:09:33 0.45 38.4 GB 6.9 GiB
200 0:15:17 0.36 35.6 GB 7.4 GiB
500 0:37:31 0.35 43.1 GB 8.3 GiB

Training by refining on existing model:

image count train duration time increase / img increase max CPU usage max GPU usage
10 0:01:47 1.00 10.7 GB 4.4 GiB
20 0:02:18 0.64 15.8 GB 4.4 GiB
50 0:04:21 0.49 20.8 GB 5.1 GiB
100 0:10:32 0.59 35.4 GB 6.6 GiB
200 0:14:54 0.42 36.0 GB 7.4 GiB
500 0:38:12 0.43 49.3 GB 7.7 GiB

Transcription

Training from scratch, straight XML:

image count train duration time increase / img increase max CPU usage max GPU usage
10 0:02:26 1.00 4.1 GB 1.9 GiB
20 0:03:44 0.77 4.5 GB 2.0 GiB
50 0:08:30 0.70 5.7 GB 2.2 GiB
100 0:18:30 0.76 6.0 GB 2.2 GiB
200 0:37:47 0.78 6.6 GB 2.1 GiB
500 1:37:50 0.80 8.0 GB 2.3 GiB

Training by refining on existing model, straight XML:

image count train duration time increase / img increase max CPU usage max GPU usage
10 0:02:43 1.00 4.7 GB 1.5 GiB
20 0:04:11 0.77 5.5 GB 1.5 GiB
50 0:08:41 0.64 5.7 GB 1.5 GiB
100 0:18:07 0.67 6.4 GB 1.6 GiB
200 0:36:34 0.67 6.7 GB 1.6 GiB
500 interrupted to be rerun

Training from scratch, binary:

image count train duration time increase / img increase max CPU usage max GPU usage
10 0:02:46 1.00 3.9 GB 2.3 GiB
20 0:03:28 0.63 4.6 GB 2.3 GiB
50 0:07:27 0.54 5.0 GB 2.3 GiB
100 0:16:02 0.58 5.9 GB 2.8 GiB
200 0:32:53 0.59 6.8 GB 2.8 GiB
500 1:23:17 0.60 7.8 GB 2.5 GiB

Training by refining on existing model, binary:

image count train duration time increase / img increase max CPU usage max GPU usage
10 0:03:05 1.00 5.7 GB 1.5 GiB
20 0:03:20 0.54 6.3 GB 1.5 GiB
50 0:07:26 0.48 7.0 GB 1.5 GiB
100 0:15:12 0.49 8.0 GB 1.6 GiB
200 0:29:45 0.48 8.9 GB 1.6 GiB
500 1:15:44 0.49 8.8 GB 1.6 GiB

Future rounds of testing will examine the impact of different filesizes and of different worker counts.

@cmroughan
Copy link
Collaborator Author

Another round of tests examined what impact different inputs for the --workers parameter had on resource usage / train time. Full details are again in the GSheet (see where values vary for worker count).

Tests were run by changing the parameter for --workers in the kraken train command, changing slurm --cpus-per-task to match this value, and changing slurm --mem-per-cpu so that the total memory across all CPUs stayed approximately the same.

Varying the worker count seems to have had no appreciable impact on transcription training, but does have an impact on segmentation training.

Transcription

Training from scratch, straight XML:

image count worker count mem per CPU train duration
100 4 4 GB 0:17:09
100 8 2 GB 0:16:27
100 16 1 GB 0:16:27

Training from scratch, binary:

image count worker count mem per CPU train duration
100 4 4 GB 0:16:16
100 8 2 GB 0:16:35
100 16 1 GB 0:16:39

Segmentation

Training "from scratch" (refining on blla.mlmodel):

image count worker count mem per CPU train duration
100 4 12 GB 0:13:51
100 8 6 GB 0:08:49
100 16 3 GB 0:06:01
100 24 2 GB OOM killed*
100 24 3 GB 0:05:08

* The fourth entry above never completed an epoch -- it hit an OOM killed error before training began. There seems to be some lower limit on memory per worker required to succeed, depending on the size of the input training data.

Relatedly, for 500 images, training "from scratch" (refining on blla.mlmodel):

image count worker count mem per CPU average epoch
500 4 12 724.0 s
500 8 6 491.4 s
500 16 3 OOM killed
500 16 4 OOM killed

Review of the kraken code might help illuminate what minimums need to be met to avoid the OOM killed error.

@rlskoeser
Copy link
Contributor

@cmroughan this is great. Could you also test increasing workers without decreasing memory?

When I was fighting the OOM errors due to the input data problem I did find some comments on github issues about that and vaguely recall seeing 3gb as a minimum for segmentation. I may be able to find again if that would be helpful.

@cmroughan
Copy link
Collaborator Author

cmroughan commented Dec 5, 2024

Could you also test increasing workers without decreasing memory?

Ran the tests: increasing workers without decreasing memory does not produce tangible benefits, as can be seen in the comparisons below. The code does not end up using the extra resources, which makes it an overly-expensive slurm request and -- if done too often -- can lead to future jobs being ranked lower in the queue as a result.

Transcription

Training from scratch, straight XML:

image count worker count mem per CPU train duration CPU usage / requested CPU usage percentage
100 4 4 GB 0:17:09 4.2 / 16.0 GB 26.25%
100 8 4 GB 0:17:13 5.3 / 32.0 GB 16.56%
100 8 2 GB 0:16:27 5.0 / 16.0 GB 31.25%

Segmentation

Training "from scratch" (refining on blla.mlmodel):

image count worker count mem per CPU train duration CPU usage / requested CPU usage percentage
100 4 12 GB 0:13:51 19.7 / 48.0 GB 41.04%
100 8 12 GB 0:09:03 25.1 / 96.0 GB 26.15%
100 8 6 GB 0:08:49 26.7 / 48.0 GB 55.63%

@cmroughan
Copy link
Collaborator Author

cmroughan commented Dec 5, 2024

Ran further tests to track how complexity of the training data impacts resource usage, starting with transcription training tasks. For transcription, the line count metric is more relevant than the page count metric (which makes sense, considering the lines are the input training data). Line length, here tracked in character counts, also has an impact.

Tracking line counts

Training from scratch, straight XML, same CPU counts + memory:

image count line count avg epoch full duration
10 524 14.4 s 0:02:24
10 261 6.6 s 0:02:30
5 269 7.4 s 0:02:13
100 5767 161.2 s 0:15:26
100 2870 83.6 s 0:08:16
47 2845 84.4 s 0:08:17
500 29520 937.8 s 1:22:25
500 14649 464.2 s 0:41:39
241 14563 475.8 s 0:42:37

Tracking line lengths:

image count line count line length avg epoch
500 7538 < 11 chars 113.4 s
500 7537 > 67 chars 361.8 s

In addition to line counts, I'll have to go back and extract the average line lengths across the datasets that have been tested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants