test compute/time requirements for different train tasks #26

cmroughan · 2024-11-18T14:41:28Z

To better automate the creation of slurm batch scripts, we will want a clearer sense of how variations in the training data size and types affects the compute and time requirements for the training job. With that in mind, we will want to evaluate the following:

How does X factor impact time requirements for a training task? What is the optimal number of CPU cores? Run tests addressing these questions for the following factors across both segmentation and transcription training tasks:

cmroughan · 2024-11-26T18:07:54Z

Consider calculating compile time for producing .arrow files for transcription training. Is this always efficient, or only for certain datasets?

cmroughan · 2024-11-27T17:36:41Z

First round of tests done, measuring five epochs of segmentation and transcription training on differently-sized batches of training data, where all the images were normalized to approximately 612.7 KB. Training data batch sizes were 10, 20, 50, 100, 200, and 500 images. Tests were run with a GPU and with 8 CPUs with 6 GB memory each. Full details in GSheets, summary below:

Segmentation

Training "from scratch" (refining on blla.mlmodel):

image count	train duration	time increase / img increase	max CPU usage	max GPU usage
10	0:02:07	1.00	13.7 GB	4.4 GiB
20	0:03:13	0.76	17.1 GB	4.4 GiB
50	0:04:37	0.44	23.8 GB	5.2 GiB
100	0:09:33	0.45	38.4 GB	6.9 GiB
200	0:15:17	0.36	35.6 GB	7.4 GiB
500	0:37:31	0.35	43.1 GB	8.3 GiB

Training by refining on existing model:

image count	train duration	time increase / img increase	max CPU usage	max GPU usage
10	0:01:47	1.00	10.7 GB	4.4 GiB
20	0:02:18	0.64	15.8 GB	4.4 GiB
50	0:04:21	0.49	20.8 GB	5.1 GiB
100	0:10:32	0.59	35.4 GB	6.6 GiB
200	0:14:54	0.42	36.0 GB	7.4 GiB
500	0:38:12	0.43	49.3 GB	7.7 GiB

Transcription

Training from scratch, straight XML:

image count	train duration	time increase / img increase	max CPU usage	max GPU usage
10	0:02:26	1.00	4.1 GB	1.9 GiB
20	0:03:44	0.77	4.5 GB	2.0 GiB
50	0:08:30	0.70	5.7 GB	2.2 GiB
100	0:18:30	0.76	6.0 GB	2.2 GiB
200	0:37:47	0.78	6.6 GB	2.1 GiB
500	1:37:50	0.80	8.0 GB	2.3 GiB

Training by refining on existing model, straight XML:

image count	train duration	time increase / img increase	max CPU usage	max GPU usage
10	0:02:43	1.00	4.7 GB	1.5 GiB
20	0:04:11	0.77	5.5 GB	1.5 GiB
50	0:08:41	0.64	5.7 GB	1.5 GiB
100	0:18:07	0.67	6.4 GB	1.6 GiB
200	0:36:34	0.67	6.7 GB	1.6 GiB
500	interrupted	to be rerun

Training from scratch, binary:

image count	train duration	time increase / img increase	max CPU usage	max GPU usage
10	0:02:46	1.00	3.9 GB	2.3 GiB
20	0:03:28	0.63	4.6 GB	2.3 GiB
50	0:07:27	0.54	5.0 GB	2.3 GiB
100	0:16:02	0.58	5.9 GB	2.8 GiB
200	0:32:53	0.59	6.8 GB	2.8 GiB
500	1:23:17	0.60	7.8 GB	2.5 GiB

Training by refining on existing model, binary:

image count	train duration	time increase / img increase	max CPU usage	max GPU usage
10	0:03:05	1.00	5.7 GB	1.5 GiB
20	0:03:20	0.54	6.3 GB	1.5 GiB
50	0:07:26	0.48	7.0 GB	1.5 GiB
100	0:15:12	0.49	8.0 GB	1.6 GiB
200	0:29:45	0.48	8.9 GB	1.6 GiB
500	1:15:44	0.49	8.8 GB	1.6 GiB

Future rounds of testing will examine the impact of different filesizes and of different worker counts.

cmroughan · 2024-12-04T19:14:35Z

Another round of tests examined what impact different inputs for the --workers parameter had on resource usage / train time. Full details are again in the GSheet (see where values vary for worker count).

Tests were run by changing the parameter for --workers in the kraken train command, changing slurm --cpus-per-task to match this value, and changing slurm --mem-per-cpu so that the total memory across all CPUs stayed approximately the same.

Varying the worker count seems to have had no appreciable impact on transcription training, but does have an impact on segmentation training.

Transcription

Training from scratch, straight XML:

image count	worker count	mem per CPU	train duration
100	4	4 GB	0:17:09
100	8	2 GB	0:16:27
100	16	1 GB	0:16:27

Training from scratch, binary:

image count	worker count	mem per CPU	train duration
100	4	4 GB	0:16:16
100	8	2 GB	0:16:35
100	16	1 GB	0:16:39

Segmentation

Training "from scratch" (refining on blla.mlmodel):

image count	worker count	mem per CPU	train duration
100	4	12 GB	0:13:51
100	8	6 GB	0:08:49
100	16	3 GB	0:06:01
100	24	2 GB	OOM killed*
100	24	3 GB	0:05:08

* The fourth entry above never completed an epoch -- it hit an OOM killed error before training began. There seems to be some lower limit on memory per worker required to succeed, depending on the size of the input training data.

Relatedly, for 500 images, training "from scratch" (refining on blla.mlmodel):

image count	worker count	mem per CPU	average epoch
500	4	12	724.0 s
500	8	6	491.4 s
500	16	3	OOM killed
500	16	4	OOM killed

Review of the kraken code might help illuminate what minimums need to be met to avoid the OOM killed error.

rlskoeser · 2024-12-05T09:34:19Z

@cmroughan this is great. Could you also test increasing workers without decreasing memory?

When I was fighting the OOM errors due to the input data problem I did find some comments on github issues about that and vaguely recall seeing 3gb as a minimum for segmentation. I may be able to find again if that would be helpful.

cmroughan · 2024-12-05T18:55:54Z

Could you also test increasing workers without decreasing memory?

Ran the tests: increasing workers without decreasing memory does not produce tangible benefits, as can be seen in the comparisons below. The code does not end up using the extra resources, which makes it an overly-expensive slurm request and -- if done too often -- can lead to future jobs being ranked lower in the queue as a result.

Transcription

Training from scratch, straight XML:

image count	worker count	mem per CPU	train duration	CPU usage / requested	CPU usage percentage
100	4	4 GB	0:17:09	4.2 / 16.0 GB	26.25%
100	8	4 GB	0:17:13	5.3 / 32.0 GB	16.56%
100	8	2 GB	0:16:27	5.0 / 16.0 GB	31.25%

Segmentation

Training "from scratch" (refining on blla.mlmodel):

image count	worker count	mem per CPU	train duration	CPU usage / requested	CPU usage percentage
100	4	12 GB	0:13:51	19.7 / 48.0 GB	41.04%
100	8	12 GB	0:09:03	25.1 / 96.0 GB	26.15%
100	8	6 GB	0:08:49	26.7 / 48.0 GB	55.63%

cmroughan · 2024-12-05T21:31:17Z

Ran further tests to track how complexity of the training data impacts resource usage, starting with transcription training tasks. For transcription, the line count metric is more relevant than the page count metric (which makes sense, considering the lines are the input training data). Line length, here tracked in character counts, also has an impact.

Tracking line counts

Training from scratch, straight XML, same CPU counts + memory:

image count	line count	avg epoch	full duration
10	524	14.4 s	0:02:24
10	261	6.6 s	0:02:30
5	269	7.4 s	0:02:13
100	5767	161.2 s	0:15:26
100	2870	83.6 s	0:08:16
47	2845	84.4 s	0:08:17
500	29520	937.8 s	1:22:25
500	14649	464.2 s	0:41:39
241	14563	475.8 s	0:42:37

Tracking line lengths:

image count	line count	line length	avg epoch
500	7538	< 11 chars	113.4 s
500	7537	> 67 chars	361.8 s

In addition to line counts, I'll have to go back and extract the average line lengths across the datasets that have been tested.

cmroughan self-assigned this Nov 18, 2024

rlskoeser mentioned this issue Jan 9, 2025

update CLI script to request appropriate hpc resources based on input data #40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test compute/time requirements for different train tasks #26

test compute/time requirements for different train tasks #26

cmroughan commented Nov 18, 2024 •

edited

Loading

cmroughan commented Nov 26, 2024

cmroughan commented Nov 27, 2024 •

edited

Loading

cmroughan commented Dec 4, 2024

rlskoeser commented Dec 5, 2024

cmroughan commented Dec 5, 2024 •

edited

Loading

cmroughan commented Dec 5, 2024 •

edited

Loading

test compute/time requirements for different train tasks #26

test compute/time requirements for different train tasks #26

Comments

cmroughan commented Nov 18, 2024 • edited Loading

cmroughan commented Nov 26, 2024

cmroughan commented Nov 27, 2024 • edited Loading

Segmentation

Transcription

cmroughan commented Dec 4, 2024

Transcription

Segmentation

rlskoeser commented Dec 5, 2024

cmroughan commented Dec 5, 2024 • edited Loading

Transcription

Segmentation

cmroughan commented Dec 5, 2024 • edited Loading

Tracking line counts

Tracking line lengths:

cmroughan commented Nov 18, 2024 •

edited

Loading

cmroughan commented Nov 27, 2024 •

edited

Loading

cmroughan commented Dec 5, 2024 •

edited

Loading

cmroughan commented Dec 5, 2024 •

edited

Loading