Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unable to format CSI volumes on Nomad using juicefs-csi-driver >= v0.23.3 #1116

Open
kmott opened this issue Sep 17, 2024 · 9 comments
Labels
kind/bug Something isn't working

Comments

@kmott
Copy link

kmott commented Sep 17, 2024

What happened:

I deployed the latest version of the CSI image for Controller and Node (v0.24.7), and created a volume. That all worked fine.

However, when I then deployed a Nomad Job that mounts that previously created CSI volume, the Node job threw a series of errors:

I0917 01:41:15.835499       8 main.go:121] Run CSI node
I0917 01:41:15.836540       8 driver.go:50] Driver: csi.juicefs.com version v0.24.7-dirty commit ebd4ee6686fab6ec9655b39362a5363454058d3d date 2024-09-04T06:00:07Z
I0917 01:41:16.019738       8 driver.go:115] Listening for connection on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0917 01:41:29.375543       8 node.go:107] NodePublishVolume: volume_id is database-juicefs
I0917 01:41:29.375591       8 node.go:118] NodePublishVolume: volume_capability is mount:<fs_type:"ext4" mount_flags:"noatime" > access_mode:<mode:SINGLE_NODE_WRITER >
I0917 01:41:29.375713       8 node.go:124] NodePublishVolume: creating dir /local/csi/per-alloc/43574280-6cea-0311-6996-268521cdc791/database-juicefs/rw-file-system-single-node-writer
I0917 01:41:29.375930       8 node.go:139] NodePublishVolume: volume context: map[capacity:1000000000 subPath:database-juicefs]
I0917 01:41:29.375978       8 node.go:149] NodePublishVolume: mounting juicefs with secret [bucket metaurl name secret-key storage access-key], options [noatime]
W0917 01:41:29.376019       8 juicefs.go:352] Get PV with volumeID database-juicefs error: k8s client is nil
I0917 01:41:29.378418       8 juicefs.go:984] ceFormat cmd: [/usr/local/bin/juicefs format --storage=minio --bucket=http://minio.nomad.kitchen.example.org:9000/kitchen --access-key=administrator --secret-key=${secretkey} ${metaurl} database-juicefs]
I0917 01:41:29.487111       8 juicefs.go:1004] Format output is 2024/09/17 01:41:29.480359 juicefs[19] <INFO>: Meta address: $'etcd://root:****@node1.nomad.kitchen.example.org:2379,node2.nomad.kitchen.example.org:2379,node3.nomad.kitchen.example.org:2379/database-juicefs?insecure-skip-verify=1&server-name=juicefs.kitchen.example.org' [interface.go:504]
2024/09/17 01:41:29.480516 juicefs[19] <FATAL>: Invalid meta driver: $'etcd [interface.go:507]
I0917 01:41:29.487135       8 juicefs.go:1007] Format error: exit status 1
E0917 01:41:29.487206       8 driver.go:102] GRPC error: rpc error: code = Internal desc = Could not mount juicefs: juicefs format error: 2024/09/17 01:41:29.480359 juicefs[19] <INFO>: Meta address: $'etcd://root:****@node1.nomad.kitchen.example.org:2379,node2.nomad.kitchen.example.org:2379,node3.nomad.kitchen.example.org:2379/database-juicefs?insecure-skip-verify=1&server-name=juicefs.kitchen.example.org' [interface.go:504]
2024/09/17 01:41:29.480516 juicefs[19] <FATAL>: Invalid meta driver: $'etcd [interface.go:507]
: exit status 1
I0917 01:41:30.189715       8 node.go:212] NodeUnpublishVolume: volume_id is database-juicefs
I0917 01:41:30.190200       8 process_mount.go:257] ProcessUmount: /local/csi/per-alloc/43574280-6cea-0311-6996-268521cdc791/database-juicefs/rw-file-system-single-node-writer target not mounted
I0917 01:41:30.190563       8 process_mount.go:302] ProcessUmount: /local/csi/per-alloc/43574280-6cea-0311-6996-268521cdc791/database-juicefs/rw-file-system-single-node-writer target not mounted
I0917 01:42:01.284579       8 node.go:107] NodePublishVolume: volume_id is database-juicefs
I0917 01:42:01.284605       8 node.go:118] NodePublishVolume: volume_capability is mount:<fs_type:"ext4" mount_flags:"noatime" > access_mode:<mode:SINGLE_NODE_WRITER >
I0917 01:42:01.284655       8 node.go:124] NodePublishVolume: creating dir /local/csi/per-alloc/b0f64a60-ff22-a829-72f0-6c8025507b85/database-juicefs/rw-file-system-single-node-writer
I0917 01:42:01.284917       8 node.go:139] NodePublishVolume: volume context: map[capacity:1000000000 subPath:database-juicefs]
I0917 01:42:01.284938       8 node.go:149] NodePublishVolume: mounting juicefs with secret [bucket metaurl name secret-key storage access-key], options [noatime]
W0917 01:42:01.284952       8 juicefs.go:352] Get PV with volumeID database-juicefs error: k8s client is nil
I0917 01:42:01.285226       8 juicefs.go:984] ceFormat cmd: [/usr/local/bin/juicefs format --storage=minio --bucket=http://minio.nomad.kitchen.example.org:9000/kitchen --access-key=administrator --secret-key=${secretkey} ${metaurl} database-juicefs]
I0917 01:42:01.385773       8 juicefs.go:1004] Format output is 2024/09/17 01:42:01.374316 juicefs[26] <INFO>: Meta address: $'etcd://root:****@node1.nomad.kitchen.example.org:2379,node2.nomad.kitchen.example.org:2379,node3.nomad.kitchen.example.org:2379/database-juicefs?insecure-skip-verify=1&server-name=juicefs.kitchen.example.org' [interface.go:504]
2024/09/17 01:42:01.374730 juicefs[26] <FATAL>: Invalid meta driver: $'etcd [interface.go:507]
I0917 01:42:01.385796       8 juicefs.go:1007] Format error: exit status 1
E0917 01:42:01.385854       8 driver.go:102] GRPC error: rpc error: code = Internal desc = Could not mount juicefs: juicefs format error: 2024/09/17 01:42:01.374316 juicefs[26] <INFO>: Meta address: $'etcd://root:****@node1.nomad.kitchen.example.org:2379,node2.nomad.kitchen.example.org:2379,node3.nomad.kitchen.example.org:2379/database-juicefs?insecure-skip-verify=1&server-name=juicefs.kitchen.example.org' [interface.go:504]
2024/09/17 01:42:01.374730 juicefs[26] <FATAL>: Invalid meta driver: $'etcd [interface.go:507]
: exit status 1
I0917 01:42:02.129180       8 node.go:212] NodeUnpublishVolume: volume_id is database-juicefs
I0917 01:42:02.129530       8 process_mount.go:257] ProcessUmount: /local/csi/per-alloc/b0f64a60-ff22-a829-72f0-6c8025507b85/database-juicefs/rw-file-system-single-node-writer target not mounted
I0917 01:42:02.129817       8 process_mount.go:302] ProcessUmount: /local/csi/per-alloc/b0f64a60-ff22-a829-72f0-6c8025507b85/database-juicefs/rw-file-system-single-node-writer target not mounted
I0917 01:42:32.952950       8 node.go:107] NodePublishVolume: volume_id is database-juicefs
I0917 01:42:32.952970       8 node.go:118] NodePublishVolume: volume_capability is mount:<fs_type:"ext4" mount_flags:"noatime" > access_mode:<mode:SINGLE_NODE_WRITER >
I0917 01:42:32.953003       8 node.go:124] NodePublishVolume: creating dir /local/csi/per-alloc/f9ffab7c-602d-7a5b-2232-c34b3f368ed2/database-juicefs/rw-file-system-single-node-writer
I0917 01:42:32.953170       8 node.go:139] NodePublishVolume: volume context: map[capacity:1000000000 subPath:database-juicefs]
I0917 01:42:32.953189       8 node.go:149] NodePublishVolume: mounting juicefs with secret [metaurl name secret-key storage access-key bucket], options [noatime]
W0917 01:42:32.953208       8 juicefs.go:352] Get PV with volumeID database-juicefs error: k8s client is nil
I0917 01:42:32.953441       8 juicefs.go:984] ceFormat cmd: [/usr/local/bin/juicefs format --storage=minio --bucket=http://minio.nomad.kitchen.example.org:9000/kitchen --access-key=administrator --secret-key=${secretkey} ${metaurl} database-juicefs]
I0917 01:42:33.070526       8 juicefs.go:1004] Format output is 2024/09/17 01:42:33.065408 juicefs[32] <INFO>: Meta address: $'etcd://root:****@node1.nomad.kitchen.example.org:2379,node2.nomad.kitchen.example.org:2379,node3.nomad.kitchen.example.org:2379/database-juicefs?insecure-skip-verify=1&server-name=juicefs.kitchen.example.org' [interface.go:504]
2024/09/17 01:42:33.065661 juicefs[32] <FATAL>: Invalid meta driver: $'etcd [interface.go:507]
I0917 01:42:33.070557       8 juicefs.go:1007] Format error: exit status 1
E0917 01:42:33.070598       8 driver.go:102] GRPC error: rpc error: code = Internal desc = Could not mount juicefs: juicefs format error: 2024/09/17 01:42:33.065408 juicefs[32] <INFO>: Meta address: $'etcd://root:****@node1.nomad.kitchen.example.org:2379,node2.nomad.kitchen.example.org:2379,node3.nomad.kitchen.example.org:2379/database-juicefs?insecure-skip-verify=1&server-name=juicefs.kitchen.example.org' [interface.go:504]
2024/09/17 01:42:33.065661 juicefs[32] <FATAL>: Invalid meta driver: $'etcd [interface.go:507]
: exit status 1
I0917 01:42:33.912007       8 node.go:212] NodeUnpublishVolume: volume_id is database-juicefs
I0917 01:42:33.912299       8 process_mount.go:257] ProcessUmount: /local/csi/per-alloc/f9ffab7c-602d-7a5b-2232-c34b3f368ed2/database-juicefs/rw-file-system-single-node-writer target not mounted
I0917 01:42:33.912497       8 process_mount.go:302] ProcessUmount: /local/csi/per-alloc/f9ffab7c-602d-7a5b-2232-c34b3f368ed2/database-juicefs/rw-file-system-single-node-writer target not mounted

What you expected to happen:

Format and mount using Nomad CSI works fine.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?

Environment:

  • JuiceFS CSI Driver version (which image tag did your CSI Driver use):
    • Anything >= v0.23.3 (<= v0.23.2 works okay)
  • Kubernetes version (e.g. kubectl version):
    • N/A, using Nomad (v1.6.3)
  • Object storage (cloud provider and region):
    • Minio locally
  • Metadata engine info (version, cloud provider managed or self maintained):
    • ETCD locally
  • Network connectivity (JuiceFS to metadata engine, JuiceFS to object storage):
    • localhost

Additional Information

I dug around a little bit, and I think the v0.23.3 has a fix in place for #843 that introduced a regression.

Speciifcally, there is a lot going on in the ceFormat func, however, I think around L965, the text ${metaurl} is getting escaped to $'<whatever your metaurl text is>', which obvisouly won't work, as reported by L998 just a bit later, after the command is invoked.

I would take a stab at submitting a PR, however, there's some stuff going on that I'm not entirely sure about--namely why cmdArgs + args are declared (L934 + L935) and filled with identical information, but cmdArgs is never actually used anywhere except logging statements (from what I can tell)--maybe it's a different codepath for ce vs ee versions? The actual invocation of the cmd happens on L988 using args only.

At any rate, I'm happy to help submit a PR, if it would be useful. Thank you!

@kmott kmott added the kind/bug Something isn't working label Sep 17, 2024
@zxh326
Copy link
Member

zxh326 commented Sep 18, 2024

try URL Encoding for metaurl

@kmott
Copy link
Author

kmott commented Sep 18, 2024

try URL Encoding for metaurl

Hi @zxh326 , are you saying the security.EscapeBashStr call should use URL Encoding for metaurl? Or that I should pre-escape my metaurl in my spec?

@zxh326
Copy link
Member

zxh326 commented Sep 19, 2024

pre-escape metaurl in secret spec

if metaurl has special character password need to be replaced by url encoding, such as | needs to be replaced with %7C

@kmott
Copy link
Author

kmott commented Sep 19, 2024

pre-escape metaurl in secret spec

if metaurl has special character password need to be replaced by url encoding, such as | needs to be replaced with %7C

Hi, I am still not following, sorry. The metaurl I am using is:

etcd://root:[email protected]:2379,node2.nomad.kitchen.example.org:2379,node3.nomad.kitchen.example.org:2379/database-juicefs?insecure-skip-verify=1&server-name=juicefs.kitchen.example.org

Which has a password of dead-beef. If I pass that thru rawurlencode, it doesn't change, so I am assuming it doesn't need further escaping in this case?

@zxh326
Copy link
Member

zxh326 commented Sep 19, 2024

Which has a password of dead-beef. If I pass that thru rawurlencode, it doesn't change

yes!

could you try changing insecure-skip-verify=1&server-name toinsecure-skip-verify=1%26server-name

@kmott
Copy link
Author

kmott commented Sep 19, 2024

could you try changing insecure-skip-verify=1&server-name toinsecure-skip-verify=1%26server-name

Okay, I gave that a try, and while it doesn't seem to error out on the formatting of metaurl, it does not succeed in formatting the actual volume:

I0919 17:00:32.441384       7 node.go:107] NodePublishVolume: volume_id is database-juicefs
I0919 17:00:32.441416       7 node.go:118] NodePublishVolume: volume_capability is mount:<fs_type:"ext4" mount_flags:"noatime" > access_mode:<mode:MULTI_NODE_MULTI_WRITER >
I0919 17:00:32.441483       7 node.go:124] NodePublishVolume: creating dir /local/csi/per-alloc/d98ff0e3-09d8-bb47-388a-695fd8be4afb/database-juicefs/rw-file-system-multi-node-multi-writer
I0919 17:00:32.441627       7 node.go:139] NodePublishVolume: volume context: map[capacity:0 subPath:database-juicefs]
I0919 17:00:32.441656       7 node.go:149] NodePublishVolume: mounting juicefs with secret [name secret-key storage trash-days access-key bucket capacity metaurl], options [noatime]
W0919 17:00:32.441700       7 juicefs.go:352] Get PV with volumeID database-juicefs error: k8s client is nil
I0919 17:00:32.446697       7 juicefs.go:984] ceFormat cmd: [/usr/local/bin/juicefs format --storage=minio --bucket=http://minio.nomad.kitchen.example.org:9000/kitchen --access-key=administrator --trash-days=0 --capacity=1 --secret-key=${secretkey} ${metaurl} database-juicefs]
I0919 17:00:48.458570       7 juicefs.go:1004] Format output is 2024/09/19 17:00:32.544408 juicefs[20] <INFO>: Meta address: etcd://root:****@node1.nomad.kitchen.example.org:2379,node2.nomad.kitchen.example.org:2379,node3.nomad.kitchen.example.org:2379/01j80mr2r201qkeyzkesbc2s0w?insecure-skip-verify=1%26server-name=juicefs.kitchen.example.org [interface.go:504]
I0919 17:00:48.458655       7 juicefs.go:1007] Format error: signal: killed
E0919 17:00:48.458902       7 driver.go:102] GRPC error: rpc error: code = Internal desc = Could not mount juicefs: juicefs format error: juicefs format 16s timed out
I0919 17:00:49.794729       7 node.go:212] NodeUnpublishVolume: volume_id is database-juicefs
I0919 17:00:49.795228       7 process_mount.go:257] ProcessUmount: /local/csi/per-alloc/d98ff0e3-09d8-bb47-388a-695fd8be4afb/database-juicefs/rw-file-system-multi-node-multi-writer target not mounted
I0919 17:00:49.795601       7 process_mount.go:302] ProcessUmount: /local/csi/per-alloc/d98ff0e3-09d8-bb47-388a-695fd8be4afb/database-juicefs/rw-file-system-multi-node-multi-writer target not mounted

@zxh326
Copy link
Member

zxh326 commented Sep 20, 2024

can you check *.kitchen.example.org is reachable in the container?

could you try changing insecure-skip-verify=1&server-name toinsecure-skip-verify=1%26server-name

Okay, I gave that a try, and while it doesn't seem to error out on the formatting of metaurl, it does not succeed in formatting the actual volume:

I0919 17:00:32.441384       7 node.go:107] NodePublishVolume: volume_id is database-juicefs
I0919 17:00:32.441416       7 node.go:118] NodePublishVolume: volume_capability is mount:<fs_type:"ext4" mount_flags:"noatime" > access_mode:<mode:MULTI_NODE_MULTI_WRITER >
I0919 17:00:32.441483       7 node.go:124] NodePublishVolume: creating dir /local/csi/per-alloc/d98ff0e3-09d8-bb47-388a-695fd8be4afb/database-juicefs/rw-file-system-multi-node-multi-writer
I0919 17:00:32.441627       7 node.go:139] NodePublishVolume: volume context: map[capacity:0 subPath:database-juicefs]
I0919 17:00:32.441656       7 node.go:149] NodePublishVolume: mounting juicefs with secret [name secret-key storage trash-days access-key bucket capacity metaurl], options [noatime]
W0919 17:00:32.441700       7 juicefs.go:352] Get PV with volumeID database-juicefs error: k8s client is nil
I0919 17:00:32.446697       7 juicefs.go:984] ceFormat cmd: [/usr/local/bin/juicefs format --storage=minio --bucket=http://minio.nomad.kitchen.example.org:9000/kitchen --access-key=administrator --trash-days=0 --capacity=1 --secret-key=${secretkey} ${metaurl} database-juicefs]
I0919 17:00:48.458570       7 juicefs.go:1004] Format output is 2024/09/19 17:00:32.544408 juicefs[20] <INFO>: Meta address: etcd://root:****@node1.nomad.kitchen.example.org:2379,node2.nomad.kitchen.example.org:2379,node3.nomad.kitchen.example.org:2379/01j80mr2r201qkeyzkesbc2s0w?insecure-skip-verify=1%26server-name=juicefs.kitchen.example.org [interface.go:504]
I0919 17:00:48.458655       7 juicefs.go:1007] Format error: signal: killed
E0919 17:00:48.458902       7 driver.go:102] GRPC error: rpc error: code = Internal desc = Could not mount juicefs: juicefs format error: juicefs format 16s timed out
I0919 17:00:49.794729       7 node.go:212] NodeUnpublishVolume: volume_id is database-juicefs
I0919 17:00:49.795228       7 process_mount.go:257] ProcessUmount: /local/csi/per-alloc/d98ff0e3-09d8-bb47-388a-695fd8be4afb/database-juicefs/rw-file-system-multi-node-multi-writer target not mounted
I0919 17:00:49.795601       7 process_mount.go:302] ProcessUmount: /local/csi/per-alloc/d98ff0e3-09d8-bb47-388a-695fd8be4afb/database-juicefs/rw-file-system-multi-node-multi-writer target not mounted

@zxh326
Copy link
Member

zxh326 commented Sep 20, 2024

I noticed that your other PR has the same metaurl, but it can be mount normally

@kmott
Copy link
Author

kmott commented Sep 20, 2024

can you check *.kitchen.example.org is reachable in the container?

Yes, this is accessible from the container (I have a custom juicefs-csi-driver image I run that bypasses the escaping, and the volume can get formatted and mounted just fine).

I ran the command manually after execing into the juicefs-node alloc, and got this output (with the pre-escaped & in the metaurl)--note that it stayed there for a long time (~30 mins) before I manually did ctrl+c.

Let me know if you need anything else.

Failure

root@nomad-n3-debian12:/app# juicefs -v --trace format --storage=minio --bucket=http://minio.nomad.kitchen.example.org:9000/kitchen --access-key=administrator --trash-days=0 --secret-key=dead-beef 'etcd://root:[email protected]:2379,node2.nomad.kitchen.example.org:2379,node3.nomad.kitchen.example.org:2379/database-juicefs?insecure-skip-verify=1%26server-name=juicefs.kitchen.example.org' database-juicefs
2024/09/20 19:07:40.270276 juicefs[59] <DEBUG>: maxprocs: Leaving GOMAXPROCS=4: CPU quota undefined [maxprocs.go:47]
2024/09/20 19:07:40.270482 juicefs[59] <INFO>: Meta address: etcd://root:****@node1.nomad.kitchen.example.org:2379,node2.nomad.kitchen.example.org:2379,node3.nomad.kitchen.example.org:2379/database-juicefs?insecure-skip-verify=1%26server-name=juicefs.kitchen.example.org [interface.go:504]
2024/09/20 19:07:40.270612 juicefs[59] <DEBUG>: Debug agent listening on 127.0.0.1:6060 [main.go:321]
2024/09/20 19:07:40.272739 juicefs[59] <DEBUG>: Debug agent listening on 127.0.0.1:6061 [main.go:321]
^C

root@nomad-n3-debian12:/app# date --utc
Fri Sep 20 19:38:12 UTC 2024

Success

Using my custom image, if I re-run the command with the &, it works just fine:

root@nomad-n3-debian12:/app# juicefs -v --trace format --storage=minio --bucket=http://minio.nomad.kitchen.example.org:9000/kitchen --access-key=administrator --trash-days=0 --secret-key=dead-beef 'etcd://root:[email protected]:2379,node2.nomad.kitchen.example.org:2379,node3.nomad.kitchen.example.org:2379/database-juicefs?insecure-skip-verify=1&server-name=juicefs.kitchen.example.org' database-juicefs
2024/09/20 19:40:36.385106 juicefs[68] <DEBUG>: maxprocs: Leaving GOMAXPROCS=4: CPU quota undefined [maxprocs.go:47]
2024/09/20 19:40:36.385448 juicefs[68] <INFO>: Meta address: etcd://root:****@node1.nomad.kitchen.example.org:2379,node2.nomad.kitchen.example.org:2379,node3.nomad.kitchen.example.org:2379/database-juicefs?insecure-skip-verify=1&server-name=juicefs.kitchen.example.org [interface.go:504]
2024/09/20 19:40:36.385492 juicefs[68] <DEBUG>: Debug agent listening on 127.0.0.1:6060 [main.go:321]
2024/09/20 19:40:36.386385 juicefs[68] <DEBUG>: Debug agent listening on 127.0.0.1:6061 [main.go:321]
2024/09/20 19:40:36.556775 juicefs[68] <DEBUG>: Creating minio storage at endpoint http://minio.nomad.kitchen.example.org:9000/kitchen [object_storage.go:167]
2024/09/20 19:40:36.557066 juicefs[68] <INFO>: Data use minio://minio.nomad.kitchen.example.org:9000/kitchen/database-juicefs/ [format.go:484]
2024/09/20 19:40:36.687929 juicefs[68] <DEBUG>: txn with 0 conds and 1 ops took 22.522412ms [tkv_etcd.go:191]
2024/09/20 19:40:36.688121 juicefs[68] <INFO>: Volume is formatted as {
  "Name": "database-juicefs",
  "UUID": "bcd747a6-339a-484a-8eee-c44922cae96d",
  "Storage": "minio",
  "Bucket": "http://minio.nomad.kitchen.example.org:9000/kitchen",
  "AccessKey": "administrator",
  "SecretKey": "removed",
  "BlockSize": 4096,
  "Compression": "none",
  "EncryptAlgo": "aes256gcm-rsa",
  "KeyEncrypted": true,
  "TrashDays": 0,
  "MetaVersion": 1,
  "MinClientVersion": "1.1.0-A",
  "DirStats": true,
  "EnableACL": false
} [format.go:521]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants