Error: error updating VM: received an HTTP 500 response - Reason: can't lock file '/var/lock/qemu-server/lock-101.conf' - got timeout #995

CultureLinux · 2024-01-31T13:22:45Z

Hello,

First of all, thanks for your amazing provider 👍

Describe the bug
When cloning multiple vm, the execution get a 500 error to acquire lock even with all timeout_* set to 600
Using the same main.tf always works with only one vm(<20sec).

Tested on last versions of Opentofu/Terraform/Proxmox/bpg provider

To Reproduce
Steps to reproduce the behavior:

Create a resource : (see below)
Run : tofu init && tofu plan && tofu apply
See error

resource "proxmox_virtual_environment_vm" "debian_vm" {
  for_each = {
    for index,vm in var.all_vm_config:
    vm.name => vm
  }

  # Warning default timeout seems to be around 20sec
  timeout_clone = 600
  timeout_create = 600
  timeout_start_vm = 600
  timeout_shutdown_vm = 600
  timeout_stop_vm = 600
  timeout_reboot = 600

  name        = each.value.name
  description = "Managed by opentofu"
  tags        = ["opentofu", each.value.name]
  node_name = var.target_node
  clone {
    vm_id = each.value.vmid2clone
  }

  cpu {
    cores = each.value.cpu
    type = "host"
    numa = true
  }
  memory {
    dedicated = each.value.ram
  }
  network_device {
    bridge = "vmbr0"
    model = "virtio"
  }

  efi_disk {
    datastore_id = "ssd-front"
    file_format = "raw"
    type    = "4m"
  }

  disk {
    datastore_id = "ssd-front"
    file_format = "raw"
    interface = "scsi0"
    size = each.value.size
  }


  operating_system {
    type = "l26"
  }
  machine = "q35"
  agent {
    enabled = false
  }

  initialization {
    ip_config {
      ipv4 {
        #address = "192.168.1.180/24"
        address = format("%s%s%s","192.168.1.",180 + tonumber(each.value.idx), "/24")
        gateway = "192.168.1.1"
      }
    }
    user_account {
      keys     = [var.ssh_key]
      password = "tofu"
      username = each.value.name
    }
  }
}

Expected behavior
The apply part should stop after 600 secs and not 20 secs

Screenshots

Additional context

Single or clustered Proxmox: single instance 8.1.4
Provider version (ideally it should be the latest version): bpg/proxmox v0.46.1 (signed, key ID DAA1958557A27403)
Terraform version: terraform version Terraform v1.7.1 on linux_amd64
OpenTofu version : tofu version OpenTofu v1.6.1 on linux_amd64
OS (where you run Terraform from): rocky9 amd64
Debug logs (TF_LOG=DEBUG terraform apply):

The text was updated successfully, but these errors were encountered:

Qarasique · 2024-02-01T22:54:30Z

Same behavior, may be we can add add "sequentially" option, or parallelism setting ? Especially for slow disk setups, where cloning vm from template may pend for years

bpg · 2024-02-02T02:46:24Z

@Qarasique Terraform and OpenTofu supports parallelism CLI argument that controls how many concurrent resource are applied in parallel.

@CultureLinux Thanks for the report! The 20s timeout is suspicious, and I think other people mentioned something like that in the past, tho I cant' find reference. We definitely should take a look how it is propagated.

This issue itself could also be specific to the underlying storage type. Other users had similar issues on ZFS backed storage in #831 and #868. So there could be some bottlenecks in IO on the PVE host, esp. if clonning image and VMs are on the same physical drive.

CultureLinux · 2024-02-06T08:59:33Z

A small update on this issue.

I switch the storage destination from (ssh-front : directory type / ssd) to (lvm-lvm : LVM Thin / hdd).

The timeout is not catch and the 2 vm started successfully !!

Seems really weird since faster storage breaks and not the slow one.
I'm exploring the wiki on section storage especially the preallocation which specify :
When using network storages in combination with large qcow2 images, using off can help to avoid timeouts.

https://pve.proxmox.com/wiki/Storage#_storage_configuration

Still an issue on my mind, but you tell me if it's also the case for you

CultureLinux · 2024-02-07T11:16:27Z

A final update !
I Just switch from scsi to virtio and : 7 vm in 1 minute and 2 seconds !
The hint was in #868

Sorry for creating a ticket for this.

ratiborusx · 2024-02-07T23:43:26Z

I don't know, lads. It seems to me that you closed that issue prematurely. Just today tried to spin-up 7 VMs simultaneously via for_each loop and got sameish looking errors after just 2 were created.

Terraform:

╷
│ Error: error waiting for VM start: task "UPID:prox-srv2:003EB904:2E8895E0:65C0FA88:qmstart:129:root@pam:" failed to complete with exit code: can't lock file '/var/lock/qemu-server/lock-129.conf' - got timeout
│ 
│   with module.database_px.proxmox_virtual_environment_vm.px_vm["database-test-rupost-px-007"],
│   on ../modules/proxmox-infra-local/main.tf line 22, in resource "proxmox_virtual_environment_vm" "px_vm":
│   22: resource "proxmox_virtual_environment_vm" "px_vm" {
│ 
╵

PVE GUI:

UPID:prox-srv2:003EC09F:2E88DFB3:65C0FB45:resize:129:root@pam: 65C0FB51 command '/usr/bin/qemu-img resize -f qcow2 /vzdata/images/129/vm-129-disk-0.qcow2 107374182400' failed: got timeout

I can only say that all resources were created successfully when we set parallelism=2. Neither of solutions mentioned in this issue or in #868 is exactly a solutions but coincidentally we independently went with parallelism setting too as a current workaround.
Choosing "virtio" as interface for a disk also may be sub-optimal even though it somehow helped achieve necessary result here. I think it's just a coincidence and that it somehow played nicely with some other variables like storage type or the way deployment is implemented. Also it doesn't look like it's a storage type related - we're using just local storage, not ZFS.
@bpg mentioned that going with "virtio" interface should increase performance but according to manual that may be not the case:

The VirtIO Block controller, often just called VirtIO or virtio-blk, is an older type of paravirtualized controller.
It has been superseded by the VirtIO SCSI Controller, in terms of features.
...
A SCSI controller of type VirtIO SCSI single and enabling the IO Thread setting for the attached disks is recommended if you aim for performance. This is the default for newly created Linux VMs since Proxmox VE 7.3.

It contradicts with what QEMU says in virtio-blk vs virtio-scsi issue but they didn't mention virtio-single so maybe it just didn't exist at that time?

Those are default timeouts btw (at least as shown to me on plan and i did not specify any of them in my manifests) but our tasks crashed in less than a minute:

      + timeout_clone           = 1800
      + timeout_create          = 1800
      + timeout_migrate         = 1800
      + timeout_move_disk       = 1800
      + timeout_reboot          = 1800
      + timeout_shutdown_vm     = 1800
      + timeout_start_vm        = 1800
      + timeout_stop_vm         = 300

I do not understand that stuff completely myself but it looks like there's a 2 HW disk related settings - "scsi_hardware" and "disk.interface" and it's a bit confusing. When you attach a disk in GUI it clearly shows your chosen SCSI Controller type if disk's interface is "SCSI [scsi0]" but when you choose "Virtio Block [virtio0]" interface it shows nothing thought previously chosen SCSI Controller type still present. Does that mean that "virtio" is an interface and controller type at the same time and if you choose that then "scsi_hardware" just doesn't work in that case? No clue.

I didn't want to open a new issue but all-in-all it looks like there's some kind of problem present still. For all it's worth it may be some kind of IO congestion because our SSD's are slow or because all clone operations are done on the same node (because we want these VMs on that specific node). It also may be related to older versions of terraform (1.5.7), bpg (0.46.1) and PVE (8.0.3) but i don't think so.

On a side note, maybe default settings of a provider should be changed in correlation with PVE's defaults (since 7.3). Maybe not right now but i think this disparity will only grow with time. Talking about these:

scsi_hardware = "virtio-scsi-single" ("virtio-scsi-pci" as of now)
disk.iothread = true (false as of now)

bpg · 2024-07-08T16:39:33Z

Changing defaults is not super straightforward in the current implementation, and may lead to various side-effects. I'm planning to address most of it in #1231, and reduce provider-defined defaults to minimum.

CultureLinux added the 🐛 bug Something isn't working label Jan 31, 2024

CultureLinux closed this as completed Feb 7, 2024

Koloss5421 mentioned this issue Mar 8, 2024

fix(vm): timeout when resizing a disk during clone #1103

Merged

3 tasks

bpg mentioned this issue Jun 18, 2024

Context Deadline Exceeded only on certain resources #1401

Closed

bpg mentioned this issue Sep 8, 2024

Concurrence issue #1521

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error: error updating VM: received an HTTP 500 response - Reason: can't lock file '/var/lock/qemu-server/lock-101.conf' - got timeout #995

Error: error updating VM: received an HTTP 500 response - Reason: can't lock file '/var/lock/qemu-server/lock-101.conf' - got timeout #995

CultureLinux commented Jan 31, 2024

Qarasique commented Feb 1, 2024

bpg commented Feb 2, 2024 •

edited

Loading

CultureLinux commented Feb 6, 2024 •

edited

Loading

CultureLinux commented Feb 7, 2024

ratiborusx commented Feb 7, 2024

bpg commented Jul 8, 2024

Error: error updating VM: received an HTTP 500 response - Reason: can't lock file '/var/lock/qemu-server/lock-101.conf' - got timeout #995

Error: error updating VM: received an HTTP 500 response - Reason: can't lock file '/var/lock/qemu-server/lock-101.conf' - got timeout #995

Comments

CultureLinux commented Jan 31, 2024

Qarasique commented Feb 1, 2024

bpg commented Feb 2, 2024 • edited Loading

CultureLinux commented Feb 6, 2024 • edited Loading

CultureLinux commented Feb 7, 2024

ratiborusx commented Feb 7, 2024

bpg commented Jul 8, 2024

bpg commented Feb 2, 2024 •

edited

Loading

CultureLinux commented Feb 6, 2024 •

edited

Loading