Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: error updating VM: received an HTTP 500 response - Reason: can't lock file '/var/lock/qemu-server/lock-101.conf' - got timeout #995

Closed
CultureLinux opened this issue Jan 31, 2024 · 6 comments
Labels
🐛 bug Something isn't working

Comments

@CultureLinux
Copy link

Hello,

First of all, thanks for your amazing provider 👍

Describe the bug
When cloning multiple vm, the execution get a 500 error to acquire lock even with all timeout_* set to 600
Using the same main.tf always works with only one vm(<20sec).

Tested on last versions of Opentofu/Terraform/Proxmox/bpg provider

To Reproduce
Steps to reproduce the behavior:

  1. Create a resource : (see below)
  2. Run : tofu init && tofu plan && tofu apply
  3. See error
resource "proxmox_virtual_environment_vm" "debian_vm" {
  for_each = {
    for index,vm in var.all_vm_config:
    vm.name => vm
  }

  # Warning default timeout seems to be around 20sec
  timeout_clone = 600
  timeout_create = 600
  timeout_start_vm = 600
  timeout_shutdown_vm = 600
  timeout_stop_vm = 600
  timeout_reboot = 600

  name        = each.value.name
  description = "Managed by opentofu"
  tags        = ["opentofu", each.value.name]
  node_name = var.target_node
  clone {
    vm_id = each.value.vmid2clone
  }

  cpu {
    cores = each.value.cpu
    type = "host"
    numa = true
  }
  memory {
    dedicated = each.value.ram
  }
  network_device {
    bridge = "vmbr0"
    model = "virtio"
  }

  efi_disk {
    datastore_id = "ssd-front"
    file_format = "raw"
    type    = "4m"
  }

  disk {
    datastore_id = "ssd-front"
    file_format = "raw"
    interface = "scsi0"
    size = each.value.size
  }


  operating_system {
    type = "l26"
  }
  machine = "q35"
  agent {
    enabled = false
  }

  initialization {
    ip_config {
      ipv4 {
        #address = "192.168.1.180/24"
        address = format("%s%s%s","192.168.1.",180 + tonumber(each.value.idx), "/24")
        gateway = "192.168.1.1"
      }
    }
    user_account {
      keys     = [var.ssh_key]
      password = "tofu"
      username = each.value.name
    }
  }
}

Expected behavior
The apply part should stop after 600 secs and not 20 secs

Screenshots
Screenshot from 2024-01-31 14-17-25

Additional context

  • Single or clustered Proxmox: single instance 8.1.4
  • Provider version (ideally it should be the latest version): bpg/proxmox v0.46.1 (signed, key ID DAA1958557A27403)
  • Terraform version: terraform version Terraform v1.7.1 on linux_amd64
  • OpenTofu version : tofu version OpenTofu v1.6.1 on linux_amd64
  • OS (where you run Terraform from): rocky9 amd64
  • Debug logs (TF_LOG=DEBUG terraform apply):
@CultureLinux CultureLinux added the 🐛 bug Something isn't working label Jan 31, 2024
@Qarasique
Copy link

Same behavior, may be we can add add "sequentially" option, or parallelism setting ? Especially for slow disk setups, where cloning vm from template may pend for years

@bpg
Copy link
Owner

bpg commented Feb 2, 2024

@Qarasique Terraform and OpenTofu supports parallelism CLI argument that controls how many concurrent resource are applied in parallel.

@CultureLinux Thanks for the report! The 20s timeout is suspicious, and I think other people mentioned something like that in the past, tho I cant' find reference. We definitely should take a look how it is propagated.

This issue itself could also be specific to the underlying storage type. Other users had similar issues on ZFS backed storage in #831 and #868. So there could be some bottlenecks in IO on the PVE host, esp. if clonning image and VMs are on the same physical drive.

@CultureLinux
Copy link
Author

CultureLinux commented Feb 6, 2024

A small update on this issue.

I switch the storage destination from (ssh-front : directory type / ssd) to (lvm-lvm : LVM Thin / hdd).

Screenshot from 2024-02-06 09-53-10
The timeout is not catch and the 2 vm started successfully !!

Seems really weird since faster storage breaks and not the slow one.
I'm exploring the wiki on section storage especially the preallocation which specify :
When using network storages in combination with large qcow2 images, using off can help to avoid timeouts.

https://pve.proxmox.com/wiki/Storage#_storage_configuration

Still an issue on my mind, but you tell me if it's also the case for you

@CultureLinux
Copy link
Author

A final update !
I Just switch from scsi to virtio and : 7 vm in 1 minute and 2 seconds !
The hint was in #868

Sorry for creating a ticket for this.

Screenshot from 2024-02-07 12-12-50

@ratiborusx
Copy link

I don't know, lads. It seems to me that you closed that issue prematurely. Just today tried to spin-up 7 VMs simultaneously via for_each loop and got sameish looking errors after just 2 were created.

Terraform:

╷
│ Error: error waiting for VM start: task "UPID:prox-srv2:003EB904:2E8895E0:65C0FA88:qmstart:129:root@pam:" failed to complete with exit code: can't lock file '/var/lock/qemu-server/lock-129.conf' - got timeout
│ 
│   with module.database_px.proxmox_virtual_environment_vm.px_vm["database-test-rupost-px-007"],
│   on ../modules/proxmox-infra-local/main.tf line 22, in resource "proxmox_virtual_environment_vm" "px_vm":
│   22: resource "proxmox_virtual_environment_vm" "px_vm" {
│ 
╵

PVE GUI:

UPID:prox-srv2:003EC09F:2E88DFB3:65C0FB45:resize:129:root@pam: 65C0FB51 command '/usr/bin/qemu-img resize -f qcow2 /vzdata/images/129/vm-129-disk-0.qcow2 107374182400' failed: got timeout

I can only say that all resources were created successfully when we set parallelism=2. Neither of solutions mentioned in this issue or in #868 is exactly a solutions but coincidentally we independently went with parallelism setting too as a current workaround.
Choosing "virtio" as interface for a disk also may be sub-optimal even though it somehow helped achieve necessary result here. I think it's just a coincidence and that it somehow played nicely with some other variables like storage type or the way deployment is implemented. Also it doesn't look like it's a storage type related - we're using just local storage, not ZFS.
@bpg mentioned that going with "virtio" interface should increase performance but according to manual that may be not the case:

The VirtIO Block controller, often just called VirtIO or virtio-blk, is an older type of paravirtualized controller.
It has been superseded by the VirtIO SCSI Controller, in terms of features.
...
A SCSI controller of type VirtIO SCSI single and enabling the IO Thread setting for the attached disks is recommended if you aim for performance. This is the default for newly created Linux VMs since Proxmox VE 7.3.

image

It contradicts with what QEMU says in virtio-blk vs virtio-scsi issue but they didn't mention virtio-single so maybe it just didn't exist at that time?

Those are default timeouts btw (at least as shown to me on plan and i did not specify any of them in my manifests) but our tasks crashed in less than a minute:

      + timeout_clone           = 1800
      + timeout_create          = 1800
      + timeout_migrate         = 1800
      + timeout_move_disk       = 1800
      + timeout_reboot          = 1800
      + timeout_shutdown_vm     = 1800
      + timeout_start_vm        = 1800
      + timeout_stop_vm         = 300

I do not understand that stuff completely myself but it looks like there's a 2 HW disk related settings - "scsi_hardware" and "disk.interface" and it's a bit confusing. When you attach a disk in GUI it clearly shows your chosen SCSI Controller type if disk's interface is "SCSI [scsi0]" but when you choose "Virtio Block [virtio0]" interface it shows nothing thought previously chosen SCSI Controller type still present. Does that mean that "virtio" is an interface and controller type at the same time and if you choose that then "scsi_hardware" just doesn't work in that case? No clue.
image
image
image

I didn't want to open a new issue but all-in-all it looks like there's some kind of problem present still. For all it's worth it may be some kind of IO congestion because our SSD's are slow or because all clone operations are done on the same node (because we want these VMs on that specific node). It also may be related to older versions of terraform (1.5.7), bpg (0.46.1) and PVE (8.0.3) but i don't think so.

On a side note, maybe default settings of a provider should be changed in correlation with PVE's defaults (since 7.3). Maybe not right now but i think this disparity will only grow with time. Talking about these:

scsi_hardware = "virtio-scsi-single" ("virtio-scsi-pci" as of now)
disk.iothread = true (false as of now)

@bpg
Copy link
Owner

bpg commented Jul 8, 2024

Changing defaults is not super straightforward in the current implementation, and may lead to various side-effects. I'm planning to address most of it in #1231, and reduce provider-defined defaults to minimum.

@bpg bpg mentioned this issue Sep 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants