Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rdcore bind-boot fails with "Error: Expected one vendor dir" #1116

Closed
iluminae opened this issue Mar 7, 2022 · 14 comments · Fixed by coreos/coreos-installer#802
Closed

rdcore bind-boot fails with "Error: Expected one vendor dir" #1116

iluminae opened this issue Mar 7, 2022 · 14 comments · Fixed by coreos/coreos-installer#802
Labels

Comments

@iluminae
Copy link

iluminae commented Mar 7, 2022

Describe the bug
After install via PXE on some host models, encountering: Error: Expected one vendor dir on /dev/sda2, got 2 after reboot. I have 4 hosts, 2 poweredge r415 and 2 poweredge r210 II. This issue is happening on both r210s, but the same process works on the r415s.

Reproduction steps
Steps to reproduce the behavior:

  1. PXE boot server with kernel and initrd and kernel commandline: console=ttyS1,115200n8 coreos.live.rootfs_url=%s coreos.inst.install_dev=%s coreos.inst.ignition_url=%s
  2. Tried install_dev at /dev/disk/by-id and simply /dev/sdX - both do the same.

Expected behavior
After reboot, I expect it to boot.

Actual behavior
Install appears fine, triggers reboot, boots with no output and reboots again, then boots into an error: coreos-boot-edit[900]: Error: Expected one vendor dir on /dev/sda2, got 2.

System details

  • Bare Metal (PXE)
  • Fedora CoreOS version 35.20220213.3.0
  • Dell Poweredge r210 II

Ignition config

variant: fcos                                                                                                                                                                        
version: 1.4.0
kernel_arguments:
  should_exist:
  - mitigations=off
  - selinux=0
  - console=ttyS1,115200n8
  should_not_exist:
  - console=ttyS0,115200n8
  - mitigations=auto,nosmt
passwd:
  users:
  - name: REDACTED
    password_hash: REDACTED
    ssh_authorized_keys:
    - REDACTED
    groups:
    - "sudo"
    - "wheel"
storage:
  files:
  - path: /etc/hostname
    mode: 420
    overwrite: true
    contents:
      inline: "dosd-00"
  - path: /etc/ssh/sshd_config.d/20-enable-passwords.conf
    mode: 0644
    overwrite: true
    contents:
      inline: |
        PasswordAuthentication yes
  - path: /etc/NetworkManager/system-connections/eno2.nmconnection
    mode: 0600
    overwrite: true
    contents:
      inline: |
        [connection]
        id=eno2
        type=ethernet
        interface-name=eno2
        [ethernet]
        mtu=9000
        [ipv4]
        address1=192.168.6.0/24
        dns-search=
        may-fail=false
        method=manual
        never-default=true
        route1=192.168.6.0/24,192.168.6.0
jlebon added a commit to jlebon/coreos-installer that referenced this issue Mar 7, 2022
This can help figuring out what went wrong.

Related: coreos/fedora-coreos-tracker#1116
@jlebon
Copy link
Member

jlebon commented Mar 7, 2022

Can you provide the full logs of the failing boot?

Install appears fine, triggers reboot, boots with no output and reboots again, then boots into an error

The double reboot here is expected. During the first reboot, no logs are available because the console kargs haven't been added yet. Very early, Ignition applies them and immediately reboots again. You can avoid the double reboot by having coreos-installer itself perform the karg modifications (using pxe customize for example). (It'd make sense to have it automatically apply kargs from the target Ignition config as an optimization. I think this was discussed $somewhere but I can't find it right now; will look for it and otherwise file an RFE. Edit: coreos/coreos-installer#797)

Expected one vendor dir

This error comes from the first boot trying to "bind" the bootloader with the bootfs. The error message here is not super helpful (improvement in coreos/coreos-installer#796), but it's saying that there are multiple directories in the EFI partition. This is not the case in the FCOS EFI partition.

Are there any other disks connected to the systems with a boot filesystem label? E.g. a previous OS installation. Usually, the error message in that condition should've been clearer, so it's possible there's something more subtle going on here.

@jlebon jlebon changed the title new install results in Error: Expected one vendor dir rdcore bind-boot fails with "Error: Expected one vendor dir" Mar 7, 2022
@iluminae
Copy link
Author

iluminae commented Mar 8, 2022

Yea I figured the double boot was not the issue. There are 3 disks I have attached to the system, but I have also tried completely disconnecting the other 2 before boot, same result.

I will try to get the logs, I am on a iDRAC with serial redirection and they look pretty rough.

From the emergency shell in the initrd, I see only the labels made by the installer:

:/root# ls /dev/disk/by-label/ -l                                                                                                                   
total 0
lrwxrwxrwx 1 root root  10 Mar  8 02:43 EFI-SYSTEM -> ../../sda2
lrwxrwxrwx 1 root root  10 Mar  8 02:43 boot -> ../../sda3
lrwxrwxrwx 1 root root  10 Mar  8 02:43 root -> ../../sda4

:/root# ls /dev/disk/by-partlabel/ -l
total 0
lrwxrwxrwx 1 root root  10 Mar  8 02:43 BIOS-BOOT -> ../../sda1
lrwxrwxrwx 1 root root  10 Mar  8 02:43 EFI-SYSTEM -> ../../sda2
lrwxrwxrwx 1 root root  10 Mar  8 02:43 boot -> ../../sda3
lrwxrwxrwx 1 root root  10 Mar  8 02:43 root -> ../../sda4

We have 3 partitions there related to booting, and the error is referring only to /dev/sda2. I have tried booting the system both with EFI and with BIOS booting, same outcome.

@jlebon
Copy link
Member

jlebon commented Mar 8, 2022

Yea I figured the double boot was not the issue. There are 3 disks I have attached to the system, but I have also tried completely disconnecting the other 2 before boot, same result.

Interesting, thanks for testing that.

I will try to get the logs, I am on a iDRAC with serial redirection and they look pretty rough.

From the emergency shell in the initrd, I see only the labels made by the installer:

:/root# ls /dev/disk/by-label/ -l                                                                                                                   
total 0
lrwxrwxrwx 1 root root  10 Mar  8 02:43 EFI-SYSTEM -> ../../sda2
lrwxrwxrwx 1 root root  10 Mar  8 02:43 boot -> ../../sda3
lrwxrwxrwx 1 root root  10 Mar  8 02:43 root -> ../../sda4

:/root# ls /dev/disk/by-partlabel/ -l
total 0
lrwxrwxrwx 1 root root  10 Mar  8 02:43 BIOS-BOOT -> ../../sda1
lrwxrwxrwx 1 root root  10 Mar  8 02:43 EFI-SYSTEM -> ../../sda2
lrwxrwxrwx 1 root root  10 Mar  8 02:43 boot -> ../../sda3
lrwxrwxrwx 1 root root  10 Mar  8 02:43 root -> ../../sda4

We have 3 partitions there related to booting, and the error is referring only to /dev/sda2. I have tried booting the system both with EFI and with BIOS booting, same outcome.

From the emergency shell, can you mount /dev/sda2 somewhere and show the output of ls $mnt/EFI?

@iluminae
Copy link
Author

iluminae commented Mar 8, 2022

Here you go:

:/root# find mnt
mnt
mnt/EFI
mnt/EFI/BOOT
mnt/EFI/BOOT/BOOTX64.EFI
mnt/EFI/BOOT/fbx64.efi
mnt/EFI/fedora
mnt/EFI/fedora/BOOTX64.CSV
mnt/EFI/fedora/shim.efi
mnt/EFI/fedora/shimx64.efi
mnt/EFI/fedora/grubx64.efi
mnt/EFI/fedora/mmx64.efi
mnt/EFI/fedora/grub.cfg
mnt/EFI/Dell
mnt/EFI/Dell/BootOptionCache
mnt/EFI/Dell/BootOptionCache/BootOptionCache.dat

Sorry I've had a silly time getting the boot log off the box, I just need to find a USB stick or something.

EDIT:
I checked the same on the 2 servers that did work and yes, they are missing the Dell directory.

# on _working_ boxes
# find mnt/
mnt/
mnt/EFI
mnt/EFI/BOOT
mnt/EFI/BOOT/BOOTX64.EFI
mnt/EFI/BOOT/fbx64.efi
mnt/EFI/fedora
mnt/EFI/fedora/BOOTX64.CSV
mnt/EFI/fedora/shim.efi
mnt/EFI/fedora/shimx64.efi
mnt/EFI/fedora/grubx64.efi
mnt/EFI/fedora/mmx64.efi
mnt/EFI/fedora/grub.cfg
mnt/EFI/fedora/bootuuid.cfg

So - what adds that?

@iluminae
Copy link
Author

iluminae commented Mar 9, 2022

I have tried to just delete that EFI/Dell directory from /dev/sda2 but it comes back every time it reboots. This is not some magic dell thing is it?

@jlebon
Copy link
Member

jlebon commented Mar 9, 2022

mnt/EFI/Dell
mnt/EFI/Dell/BootOptionCache
mnt/EFI/Dell/BootOptionCache/BootOptionCache.dat

Ahh nice. Yup, this is the issue. We currently don't expect that.

This is not some magic dell thing is it?

Information on this is really scarce, so I can't say for sure, but yes it does smell a lot like a magic Dell thing.

Anyway, we should be more lax on our side here given this information, so I'll look at tweaking our heuristics.

jlebon added a commit to jlebon/coreos-installer that referenced this issue Mar 9, 2022
On some Dell machines at least, something (UEFI firmware?) creates a
`Dell` directory in the `EFI` dir. This throws off our logic here which
expects only a single vendor dir.

Let's tweak the logic so that we only consider a "vendor dir" a
directory which has a `grub.cfg`.

Closes: coreos/fedora-coreos-tracker#1116
@jlebon
Copy link
Member

jlebon commented Mar 9, 2022

Anyway, we should be more lax on our side here given this information, so I'll look at tweaking our heuristics.

coreos/coreos-installer#802

@wkruse
Copy link

wkruse commented Mar 18, 2022

This issue is also happening on Dell PowerEdge R630 and R640.

@iluminae
Copy link
Author

hey @jlebon any word on when coreos/coreos-installer#802 is getting in? I had to fall back to a different OS for this class of host and I would like to get them all homogenous.

@jlebon
Copy link
Member

jlebon commented Mar 23, 2022

Hi @iluminae, apologies for the delay. We had some CI issues but they should be fixed now.
Once the patch is in, I'll make sure it ends up in next week's releases.

@dustymabe
Copy link
Member

The fix for this went into testing stream release 35.20220327.2.0. Please try out the new release and report issues.

@dustymabe dustymabe added status/pending-stable-release Fixed upstream and in testing. Waiting on stable release. and removed status/pending-testing-release Fixed upstream. Waiting on a testing release. labels Mar 30, 2022
@log1cb0mb
Copy link

Issue occurring on FC640 and MX740c as well but tested with testing release mentioned on FC640 and seems to be fixed.

@dustymabe
Copy link
Member

Thanks for reporting the info and success @log1cb0mb!

@dustymabe
Copy link
Member

The fix for this went into stable stream release 35.20220327.3.0.

@dustymabe dustymabe removed the status/pending-stable-release Fixed upstream and in testing. Waiting on stable release. label Apr 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants