Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review performance issues when writing shards #3

Closed
sbesson opened this issue Aug 26, 2024 · 3 comments
Closed

Review performance issues when writing shards #3

sbesson opened this issue Aug 26, 2024 · 3 comments

Comments

@sbesson
Copy link
Member

sbesson commented Aug 26, 2024

See #2 (comment)

Enabling sharding was found to increase the Zarr v2->v3 conversion time by a factor up to ten-fold. For the same dataset, the conversion time depends on the specified shard size and increases with the shard size/number of chunks per shard.

While sharding was always expected to reduce conversion notably due to the constraints to write chunks in a specific order as well as the overhead of writing the sharding heard, the timings reported above feel unreasonable and probably require some investigation.

@melissalinkert
Copy link
Member

Note too the build time approximately triples with #2 merged (https://github.com/glencoesoftware/zarr2zarr/actions/runs/10542609120 vs https://github.com/glencoesoftware/zarr2zarr/actions/runs/10562307818) due to additional shard writing tests.

Using artificial data similarly constructed to what was used in #2 (comment), but with fewer planes so we can test in less than a day:

$ time bin/bioformats2raw -p "test&sizeX=2947&sizeY=5192&sizeZ=70&sizeC=3.fake" ~/perf-test.zarr
[0/0] 100% │█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████│ 3780/3780 (0:01:17 / 0:00:00) 
[0/1] 100% │█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████│ 1260/1260 (0:00:30 / 0:00:00) 
[0/2] 100% │███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████│ 420/420 (0:00:08 / 0:00:00) 
[0/3] 100% │███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████│ 210/210 (0:00:01 / 0:00:00) 
[0/4] 100% │███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████│ 210/210 (0:00:00 / 0:00:00) 
[0/5] 100% │███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████│ 210/210 (0:00:00 / 0:00:00) 

real	2m2.539s
user	1m58.269s
sys	0m1.087s

and then converting to v3 with defaults vs the worst case shard:

$ time bin/zarr2zarr ~/perf-test.zarr/ perf-default.zarr
15:15:52.538 [main] INFO com.glencoesoftware.zarr.Convert -- opened /home/melissa/perf-test.zarr/0
15:15:52.553 [main] INFO com.glencoesoftware.zarr.Convert -- got 2 series attributes
15:15:53.015 [main] INFO com.glencoesoftware.zarr.Convert -- found 6 resolutions
15:15:53.095 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /home/melissa/perf-test.zarr/0/0
15:16:23.786 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /home/melissa/perf-test.zarr/0/1
15:16:33.312 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /home/melissa/perf-test.zarr/0/2
15:16:35.456 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /home/melissa/perf-test.zarr/0/3
15:16:35.777 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /home/melissa/perf-test.zarr/0/4
15:16:35.983 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /home/melissa/perf-test.zarr/0/5

real	0m45.507s
user	0m30.566s
sys	0m6.780s
$ time bin/zarr2zarr ~/perf-test.zarr/ perf-big-shard.zarr --shard 1,1,4,4096,4096
15:17:11.972 [main] INFO com.glencoesoftware.zarr.Convert -- opened /home/melissa/perf-test.zarr/0
15:17:11.989 [main] INFO com.glencoesoftware.zarr.Convert -- got 2 series attributes
15:17:12.625 [main] INFO com.glencoesoftware.zarr.Convert -- found 6 resolutions
15:17:12.716 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /home/melissa/perf-test.zarr/0/0
...
15:57:25.985 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /home/melissa/perf-test.zarr/0/1
16:08:48.285 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /home/melissa/perf-test.zarr/0/2
16:08:48.293 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
16:08:50.003 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /home/melissa/perf-test.zarr/0/3
16:08:50.003 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
16:08:50.316 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /home/melissa/perf-test.zarr/0/4
16:08:50.317 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes
16:08:50.535 [main] INFO com.glencoesoftware.zarr.Convert -- opened array /home/melissa/perf-test.zarr/0/5
16:08:50.535 [main] WARN com.glencoesoftware.zarr.Convert -- Skipping sharding due to incompatible sizes

real	51m44.941s
user	46m45.110s
sys	3m9.911s

can definitely confirm that's not good. Taking a few intermediate stack traces, I see a lot of:

"main" #1 prio=5 os_prio=0 cpu=601084.25ms elapsed=863.64s tid=0x00007ff654018800 nid=0x1f59 runnable  [0x00007ff65e975000]
   java.lang.Thread.State: RUNNABLE
	at dev.zarr.zarrjava.v3.codec.core.ShardingIndexedCodec.lambda$decodeInternal$3(ShardingIndexedCodec.java:230)
	at dev.zarr.zarrjava.v3.codec.core.ShardingIndexedCodec$$Lambda$96/0x0000000800230440.accept(Unknown Source)
	at java.util.Spliterators$ArraySpliterator.forEachRemaining([email protected]/Spliterators.java:948)
	at java.util.stream.ReferencePipeline$Head.forEach([email protected]/ReferencePipeline.java:658)
	at dev.zarr.zarrjava.v3.codec.core.ShardingIndexedCodec.decodeInternal(ShardingIndexedCodec.java:207)
	at dev.zarr.zarrjava.v3.codec.core.ShardingIndexedCodec.decode(ShardingIndexedCodec.java:98)
	at dev.zarr.zarrjava.v3.codec.CodecPipeline.decode(CodecPipeline.java:120)
	at dev.zarr.zarrjava.v3.Array.readChunk(Array.java:225)
	at dev.zarr.zarrjava.v3.Array.lambda$write$1(Array.java:274)
	at dev.zarr.zarrjava.v3.Array$$Lambda$80/0x000000080020f440.accept(Unknown Source)
	at java.util.Spliterators$ArraySpliterator.forEachRemaining([email protected]/Spliterators.java:948)
	at java.util.stream.ReferencePipeline$Head.forEach([email protected]/ReferencePipeline.java:658)
	at dev.zarr.zarrjava.v3.Array.write(Array.java:257)
	at com.glencoesoftware.zarr.Convert.convertToV3(Convert.java:405)
	at com.glencoesoftware.zarr.Convert.call(Convert.java:173)
	at com.glencoesoftware.zarr.Convert.call(Convert.java:58)
	at picocli.CommandLine.executeUserObject(CommandLine.java:2041)
	at picocli.CommandLine.access$1500(CommandLine.java:148)
	at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2461)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2453)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2415)
	at picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:2264)
	at picocli.CommandLine.parseWithHandlers(CommandLine.java:2664)
	at picocli.CommandLine.parseWithHandler(CommandLine.java:2599)
	at picocli.CommandLine.call(CommandLine.java:2875)
	at com.glencoesoftware.zarr.Convert.main(Convert.java:636)

which suggests that a lot of time is being spent reading the partially-written shards, so that's a place to continue investigating.

melissalinkert added a commit to melissalinkert/zarr2zarr that referenced this issue Aug 27, 2024
See glencoesoftware#3. This dramatically reduces conversion time when sharding is used.
@sbesson
Copy link
Member Author

sbesson commented Aug 30, 2024

With #6 merged, should this be closed or are there additional investigations we want to make?

@melissalinkert
Copy link
Member

I don't think there is anything else to investigate here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants