CPU hotplug

CPU Hotplug

This document explains in many technical details the interactions between the NEMU hypervisor (relying on the new virt machine type) and the Linux guest OS running inside the virtual machine. It highlights how the ACPI tables are used in order to allow the communication between hypervisor and guest OS, and also explains how the CPU hotplug is performed.

This same documentation can be applied to any machine type leveraging the Hardware-Reduced ACPI specification.

Overview

Here is a quick overview of the different components involved in the CPU hotplug mechanism:

The user is the one triggering the insertion or the removal of a CPU, relying on the command line provided by the hypervisor.
Once the hypervisor gets the information that some hotplug operation needs to be applied on a CPU, it notifies the guest OS about it.
The guest OS relies on the ACPI tables, and particularly the DSDT one, to evaluate the ACPI method associated with the event received. At some point, this method will notify the guest OS itself to trigger the appropriate driver handling the CPU hotplug internally. The method is CSCN, which is used by the hypervisor as the mechanism to notify the guest OS.
A lot of back and forth between the guest OS and the hypervisor through the range of I/O ports defined through the ACPI tables. The goal for the hypervisor being to provide the information about the CPU status, and for the guest OS being to signal the hypervisor about the status of the hotplug operation.

cpu-hotplug-overview

Range of I/O ports

A range of I/O ports is a convenient way for the hypervisor to establish some communication with the guest OS. By creating those regions and the ACPI methods accessing them, the hypervisor defines the expected memory accesses the guest OS will perform when evaluating one of those ACPI methods. Anytime a memory access to one of those regions is performed by the guest OS, the hypervisor traps it and takes the appropriate actions.

Here are the details of the range of I/O ports used by NEMU to communicate regarding its CPUs:

Base address 0x0CD8
Size of the register 0x0C

Here is how this range of I/O ports is defined through ACPI tables:

DSDT table

    OperationRegion (PRST, SystemIO, 0x0CD8, 0x0C)
    Field (PRST, ByteAcc, NoLock, WriteAsZeros)
    {
	Offset (0x04), 
	CPEN,   1, 
	CINS,   1, 
	CRMV,   1, 
	CEJ0,   1, 
	Offset (0x05), 
	CCMD,   8
    }

    Field (PRST, DWordAcc, NoLock, Preserve)
    {
	CSEL,   32, 
	Offset (0x08), 
	CDAT,   32
    }

CPEN: CPU enable. It is a flag indicating if the CPU is enabled, defined by the bit 0 of the byte accessible at offset 0x04 of the base address 0x0CD8.

CINS: CPU insert. It is a flag indicating if the CPU needs to be inserted, defined by the bit 1 of the byte accessible at offset 0x04 of the base address 0x0CD8.

CRMV: CPU remove. It is a flag indicating if the CPU needs to be removed, defined by the bit 2 of the byte accessible at offset 0x04 of the base address 0x0CD8.

CEJ0: CPU eject. It is a flag indicating if the CPU has been ejected, defined by the bit 3 of the byte accessible at offset 0x04 of the base address 0x0CD8.

CCMD: CPU command. It is a mask indicating the type of command associated with the data passed through CDAT, defined by the byte accessible at offset 0x05 of the base address 0x0CD8.

CSEL: CPU selector. It is a value indicating the CPU index, defined by the double word accessible at offset 0x00 of the base address 0x0CD8.

CDAT: CPU data. It's the data that needs to be associated with the command type specified by CCMD. It's defined by the double word accessible at offset 0x08 of the base address 0x0CD8.

Note: Both ByteAcc and DWordAcc specify the type of access for each field of the whole range of I/O ports.

And here is the code from NEMU taking care of handling any read/write operation from/to this range:

hw/acpi/cpu.c: Callbacks declaration

static const MemoryRegionOps cpu_hotplug_ops = {
    .read = cpu_hotplug_rd,
    .write = cpu_hotplug_wr,
    .endianness = DEVICE_LITTLE_ENDIAN,
    .valid = {
        .min_access_size = 1,
        .max_access_size = 4,
    },
};

Here is the important structure CPUHotplugState used by those two callbacks:

typedef struct CPUHotplugState {
    MemoryRegion ctrl_reg;
    uint32_t selector;
    uint8_t command;
    uint32_t dev_count;
    AcpiCpuStatus *devs;
} CPUHotplugState;

hw/acpi/cpu.c: Read operations

static uint64_t cpu_hotplug_rd(void *opaque, hwaddr addr, unsigned size)
{
    uint64_t val = 0;
    CPUHotplugState *cpu_st = opaque;
    AcpiCpuStatus *cdev;

    if (cpu_st->selector >= cpu_st->dev_count) {
        return val;
    }

    cdev = &cpu_st->devs[cpu_st->selector];
    switch (addr) {
    case ACPI_CPU_FLAGS_OFFSET_RW: /* pack and return is_* fields */
        val |= cdev->cpu ? 1 : 0;
        val |= cdev->is_inserting ? 2 : 0;
        val |= cdev->is_removing  ? 4 : 0;
        trace_cpuhp_acpi_read_flags(cpu_st->selector, val);
        break;
    case ACPI_CPU_CMD_DATA_OFFSET_RW:
        switch (cpu_st->command) {
        case CPHP_GET_NEXT_CPU_WITH_EVENT_CMD:
           val = cpu_st->selector;
           break;
        default:
           break;
        }
        trace_cpuhp_acpi_read_cmd_data(cpu_st->selector, val);
        break;
    default:
        break;
    }
    return val;
}

This function is the callback handling any read to the range of I/O ports defined above. Depending on the address offset to be read, the hypervisor will return different values:

ACPI_CPU_FLAGS_OFFSET_RW: Accessed when one of the flag CPEN, CINS or CRMV is read. The hypervisor will simply return the value based on its internal structures.
ACPI_CPU_CMD_DATA_OFFSET_RW: Accessed when CDAT is read. The hypervisor will return the CPU selector value only when the command type set through CCMD matches CPHP_GET_NEXT_CPU_WITH_EVENT_CMD (0x0).

For any of these operations, the CPU device cdev has to be specified. This has to happen by defining the CPU index, by writing to the CPU selector field CSEL. This way, when a read is performed, we make sure it happens on the right CPU.

hw/acpi/cpu.c: Write operations

static void cpu_hotplug_wr(void *opaque, hwaddr addr, uint64_t data,
                           unsigned int size)
{
    CPUHotplugState *cpu_st = opaque;
    AcpiCpuStatus *cdev;
    ACPIOSTInfo *info;

    assert(cpu_st->dev_count);

    if (addr) {
        if (cpu_st->selector >= cpu_st->dev_count) {
            trace_cpuhp_acpi_invalid_idx_selected(cpu_st->selector);
            return;
        }
    }

    switch (addr) {
    case ACPI_CPU_SELECTOR_OFFSET_WR: /* current CPU selector */
        cpu_st->selector = data;
        trace_cpuhp_acpi_write_idx(cpu_st->selector);
        break;
    case ACPI_CPU_FLAGS_OFFSET_RW: /* set is_* fields  */
        cdev = &cpu_st->devs[cpu_st->selector];
        if (data & 2) { /* clear insert event */
            cdev->is_inserting = false;
            trace_cpuhp_acpi_clear_inserting_evt(cpu_st->selector);
        } else if (data & 4) { /* clear remove event */
            cdev->is_removing = false;
            trace_cpuhp_acpi_clear_remove_evt(cpu_st->selector);
        } else if (data & 8) {
            DeviceState *dev = NULL;
            HotplugHandler *hotplug_ctrl = NULL;

            if (!cdev->cpu) {
                trace_cpuhp_acpi_ejecting_invalid_cpu(cpu_st->selector);
                break;
            }

            trace_cpuhp_acpi_ejecting_cpu(cpu_st->selector);
            dev = DEVICE(cdev->cpu);
            hotplug_ctrl = qdev_get_hotplug_handler(dev);
            hotplug_handler_unplug(hotplug_ctrl, dev, NULL);
        }
        break;
    case ACPI_CPU_CMD_OFFSET_WR:
        trace_cpuhp_acpi_write_cmd(cpu_st->selector, data);
        if (data < CPHP_CMD_MAX) {
            cpu_st->command = data;
            if (cpu_st->command == CPHP_GET_NEXT_CPU_WITH_EVENT_CMD) {
                uint32_t iter = cpu_st->selector;

                do {
                    cdev = &cpu_st->devs[iter];
                    if (cdev->is_inserting || cdev->is_removing) {
                        cpu_st->selector = iter;
                        trace_cpuhp_acpi_cpu_has_events(cpu_st->selector,
                            cdev->is_inserting, cdev->is_removing);
                        break;
                    }
                    iter = iter + 1 < cpu_st->dev_count ? iter + 1 : 0;
                } while (iter != cpu_st->selector);
            }
        }
        break;
    case ACPI_CPU_CMD_DATA_OFFSET_RW:
        switch (cpu_st->command) {
        case CPHP_OST_EVENT_CMD: {
           cdev = &cpu_st->devs[cpu_st->selector];
           cdev->ost_event = data;
           trace_cpuhp_acpi_write_ost_ev(cpu_st->selector, cdev->ost_event);
           break;
        }
        case CPHP_OST_STATUS_CMD: {
           cdev = &cpu_st->devs[cpu_st->selector];
           cdev->ost_status = data;
           info = acpi_cpu_device_status(cpu_st->selector, cdev);
           qapi_event_send_acpi_device_ost(info, &error_abort);
           qapi_free_ACPIOSTInfo(info);
           trace_cpuhp_acpi_write_ost_status(cpu_st->selector,
                                             cdev->ost_status);
           break;
        }
        default:
           break;
        }
        break;
    default:
        break;
    }
}

This function is the callback handling any write to the memory IO region defined above. And here are the possibilities by writing to the IO region:

ACPI_CPU_SELECTOR_OFFSET_WR: Accessed when writing to CSEL, setting the CPU selector with the value from the write access.
ACPI_CPU_FLAGS_OFFSET_RW: Accessed when one of the flag CINS, CRMV or CEJ0 is written. Writing 1 to CINS or CRMV will actually clear the flag indicating that the CPU needs to be inserted or removed. Writing 1 to CEJ0 will trigger the ejection of the CPU and the hypervisor will take care of completing the CPU removal.
ACPI_CPU_CMD_OFFSET_WR: Accessed when writing to CCMD. The only use case here is when the data being written matches CPHP_GET_NEXT_CPU_WITH_EVENT_CMD (0x0), the hypervisor is being asked for providing the next CPU index of a CPU that needs to be inserted or removed. Otherwise, it simply sets the command for a future data access through CDAT.
ACPI_CPU_CMD_DATA_OFFSET_RW: Accessed when writing to CDAT. The hypervisor will read and store the data written, based on the type of command previously set by a write to CCMD. The command can be either CPHP_OST_EVENT_CMD, which specifies the type of OST event, or CPHP_OST_STATUS_CMD which specifies the status of the event.

Hotplug flow

This section will focus on the interactions between the components, from the moment the CPU is added by the user through the NEMU CLI up to the guest OS.

From NEMU

Let's start where everything starts, the main() function in vl.c. vl.c: main() -> monitor_init_globals() monitor.c: monitor_init_globals() -> monitor_init_qmp_commands() -> qmp_register_command() When calling into qmp_register_command(), NEMU registers the callback qmp_device_add() that will be called whenever a device will be hotplugged using the monitor (QMP).

After the VM has been started, using the QMP command query-hotpluggable-cpus will return some output similar to:

{"return": [{"props": {"core-id": 1, "thread-id": 1, "node-id": 0, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 1, "thread-id": 0, "node-id": 0, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 0, "thread-id": 1, "node-id": 0, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 0, "thread-id": 0, "node-id": 0, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 1, "thread-id": 1, "node-id": 0, "socket-id": 0}, "vcpus-count": 1, "qom-path": "/machine/unattached/device[5]", "type": "host-x86_64-cpu"}, {"props": {"core-id": 1, "thread-id": 0, "node-id": 0, "socket-id": 0}, "vcpus-count": 1, "qom-path": "/machine/unattached/device[4]", "type": "host-x86_64-cpu"}, {"props": {"core-id": 0, "thread-id": 1, "node-id": 0, "socket-id": 0}, "vcpus-count": 1, "qom-path": "/machine/unattached/device[3]", "type": "host-x86_64-cpu"}, {"props": {"core-id": 0, "thread-id": 0, "node-id": 0, "socket-id": 0}, "vcpus-count": 1, "qom-path": "/machine/unattached/device[2]", "type": "host-x86_64-cpu"}]}

Based on this output, we can determine which CPU can be hotplugged, which leads to the following example of command that can be used from QMP to insert a new CPU to the running VM:

QEMU 3.0.0 monitor - type 'help' for more information
(qemu) device_add host-x86_64-cpu,id=core4,socket-id=1,core-id=1,thread-id=0

Once the command is issued, the callback previously registered is triggered: qdev-monitor.c: qmp_device_add() -> qdev_device_add()

qdev_device_add() is the central piece of code where the parsing of the device options is performed, which eventually triggers the creation of this device based on those parameters.

Once the driver name has been retrieved (host-x86_64-cpu in this case), here is the sequence creating the device internally: qdev-monitor.c: qdev_device_add() -> object_new(driver) object.c: object_new() -> object_new_with_type() -> object_initialize_with_type() object.c: object_initialize_with_type() -> type_initialize() -> class_init() which calls into the callback provided by the specific driver. In this case, x86_cpu_common_class_init() from target/i386/cpu.c is the one that matters. object.c: object_initialize_with_type() -> object_init_with_type() -> instance_init() which calls into the callback provided by the specific driver. In this case, x86_cpu_initfn() from target/i386/cpu.c is the one that matters.

At this point, the device exists internally, and we're missing the part where it triggers the hotplug code path.

Because any object is tied with parent/child relationship in NEMU, the CPU device that is being added here has a parent TYPE_DEVICE. Therefore, this CPU is also considered as a device, and because the parent has to be initialized too, the callback device_initfn() (from hw/core/qdev.c) is getting called whenever instance_init() is invoked.

static const TypeInfo device_type_info = {
    .name = TYPE_DEVICE,
    .parent = TYPE_OBJECT,
    .instance_size = sizeof(DeviceState),
    .instance_init = device_initfn,
    ...
};

hw/core/qdev.c: device_initfn() creates an interesting boolean property called realized for which it registers the callback device_set_realized():

static void device_initfn(Object *obj)
{
    ...
    object_property_add_bool(obj, "realized",
                             device_get_realized, device_set_realized, NULL);

So whenever the property realized is set, it triggers device_set_realized(), which will eventually call into qdev_get_hotplug_handler(). The point being to retrieve the hotplug handler that will be used to hotplug the device:

HotplugHandler *qdev_get_hotplug_handler(DeviceState *dev)
{
    HotplugHandler *hotplug_ctrl;

    if (dev->parent_bus && dev->parent_bus->hotplug_handler) {
        hotplug_ctrl = dev->parent_bus->hotplug_handler;
    } else {
        hotplug_ctrl = qdev_get_machine_hotplug_handler(dev);
    }
    return hotplug_ctrl;
}

This handler is directly retrieved from what has been registered by the machine type (virt machine type from hw/i386/virt/virt.c):

static void virt_machine_class_init(MachineClass *mc)
{
    VirtMachineClass *vmc = VIRT_MACHINE_CLASS(mc);
    HotplugHandlerClass *hc = HOTPLUG_HANDLER_CLASS(mc);
    ...
    /* Hotplug handlers */
    hc->pre_plug = virt_machine_device_pre_plug_cb;
    hc->plug = virt_machine_device_plug_cb;
    hc->unplug_request = virt_machine_device_unplug_request_cb;
    hc->unplug = virt_machine_device_unplug_cb;
    ...
}

And right after retrieving this handler, it is used to call into pre_plug and plug callbacks:

static void device_set_realized(Object *obj, bool value, Error **errp)
{
    ...
    hotplug_ctrl = qdev_get_hotplug_handler(dev);
    if (hotplug_ctrl) {
        hotplug_handler_pre_plug(hotplug_ctrl, dev, &local_err);
        if (local_err != NULL) {
            goto fail;
        }
    }
    ...
    if (hotplug_ctrl) {
        hotplug_handler_plug(hotplug_ctrl, dev, &local_err);
    }
    ...
}

So here is the point where the device has been created, where all the callback and handlers have been previously registered. Now, remember the function qdev_device_add() from qdev-monitor.c, it's been creating the new device, but no hotplug handler has been triggered so far. Well the missing piece is coming from the last bit of the function qdev_device_add():

DeviceState *qdev_device_add(QemuOpts *opts, Error **errp)
{
    ...
    /* create device */
    dev = DEVICE(object_new(driver));
    ...
    object_property_set_bool(OBJECT(dev), true, "realized", &err);
    if (err != NULL) {
        dev->opts = NULL;
        goto err_del_dev;
    }
    ...

After the device and its parents has been created and properly instantiated, the property realized is set to true, triggering the entire hotplug chain into calling the hotplug callbacks registered by the machine type virt. Let's look in details at plug which calls into virt_machine_device_plug_cb():

static void virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,
                                        DeviceState *dev, Error **errp)
{
    if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
        virt_cpu_plug(hotplug_dev, dev, errp);
    } else if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
        virt_dimm_plug(hotplug_dev, dev, errp);
    } else {
        error_setg(errp, "virt: device plug for unsupported device"
                   " type: %s", object_get_typename(OBJECT(dev)));
    }
}

Depending on the type of device being plugged here, and because those callbacks are generic and can be used both for CPU and memory, different functions might be triggered. In case of CPU, the type TYPE_CPU is the one being used, which calls into virt_cpu_plug(). Most of what this function does is to call into the plug callback defined by its ACPI implementation in hw/i386/virt/acpi.c:

static void virt_acpi_class_init(ObjectClass *class, void *data)
{
    ...
    HotplugHandlerClass *hc = HOTPLUG_HANDLER_CLASS(class);
    ...
    hc->plug = virt_device_plug_cb;
    hc->unplug_request = virt_device_unplug_request_cb;
    hc->unplug = virt_device_unplug_cb;
    ...
}

hw/i386/virt/acpi.c: virt_device_plug_cb() -> acpi_cpu_plug_cb() -> acpi_send_event() hw/acpi/acpi_interface.c: acpi_send_event() -> send_event() where it refers to another callback that has been registered earlier with the machine type and its ACPI implementation:

static void virt_acpi_class_init(ObjectClass *class, void *data)
{
    ...
    AcpiDeviceIfClass *adevc = ACPI_DEVICE_IF_CLASS(class);
    ...
    adevc->send_event = virt_send_ged;
    ...
}

Eventually, this is going to call the function virt_send_ged() responsible for sending an interrupt to the guest OS using GED events defined in the DSDT table with the GED object:

static void virt_send_ged(AcpiDeviceIf *adev, AcpiEventStatusBits ev)
{
    VirtAcpiState *s = VIRT_ACPI(adev);

    if (ev & ACPI_CPU_HOTPLUG_STATUS) {
        /* We inject the CPU hotplug interrupt */
        qemu_irq_pulse(s->gsi[VIRT_GED_CPU_HOTPLUG_IRQ]);
    } else if (ev & ACPI_MEMORY_HOTPLUG_STATUS) {
        /* We inject the memory hotplug interrupt */
        qemu_irq_pulse(s->gsi[VIRT_GED_MEMORY_HOTPLUG_IRQ]);
    } else if (ev & ACPI_NVDIMM_HOTPLUG_STATUS) {
        qemu_irq_pulse(s->gsi[VIRT_GED_NVDIMM_HOTPLUG_IRQ]);
    } else if (ev & ACPI_PCI_HOTPLUG_STATUS) {
        /* Inject PCI hotplug interrupt */
        qemu_irq_pulse(s->gsi[VIRT_GED_PCI_HOTPLUG_IRQ]);
    }
}

From ACPI

In case of CPU, the ACPI table that matters is the DSDT table.

The starting point here will be the definition of the GED object. GED stands for Generic Event Device, and describes all interrupts associated with event generation. When an interrupt is asserted, the guest OS will execute the event method _EVT declared in the GED object:

    Device (\_SB.GED)
    {
        Name (_HID, "ACPI0013")  // _HID: Hardware ID
        Name (_UID, Zero)  // _UID: Unique ID
        Name (_CRS, ResourceTemplate ()  // _CRS: Current Resource Settings
        {
            Interrupt (ResourceConsumer, Level, ActiveHigh, Exclusive, ,, )
            {
                0x00000010,
            }
            Interrupt (ResourceConsumer, Level, ActiveHigh, Exclusive, ,, )
            {
                0x00000011,
            }
            Interrupt (ResourceConsumer, Level, ActiveHigh, Exclusive, ,, )
            {
                0x00000013,
            }
            Interrupt (ResourceConsumer, Level, ActiveHigh, Exclusive, ,, )
            {
                0x00000012,
            }
        })
        Method (_EVT, 1, Serialized)  // _EVT: Event
        {
            Local0 = One
            While ((Local0 == One))
            {
                Local0 = Zero
                If ((Arg0 == 0x10))
                {
                    \_SB.CPUS.CSCN ()
                }
                ElseIf ((Arg0 == 0x11))
                {
                    \_SB.MHPC.MSCN ()
                }
                ElseIf ((Arg0 == 0x13))
                {
                    Notify (\_SB.NVDR, 0x80) // Status Change
                }
                ElseIf ((Arg0 == 0x12))
                {
                    Acquire (\_SB.PCI0.BLCK, 0xFFFF)
                    \_SB.PCI0.PCNT ()
                    Release (\_SB.PCI0.BLCK)
                }
            }
        }
    }

After the hypervisor sends the interrupt registered for the GED device, the method \_SB.CPUS.CSCN() is invoked:

    Method (CSCN, 0, Serialized)
    {
	Acquire (\_SB.PRES.CPLK, 0xFFFF)
	Local0 = One
	While ((Local0 == One))
	{
	    Local0 = Zero
	    \_SB.PRES.CCMD = Zero
	    If ((\_SB.PRES.CINS == One))
	    {
		CTFY (\_SB.PRES.CDAT, One)
		\_SB.PRES.CINS = One
		Local0 = One
	    }
	    ElseIf ((\_SB.PRES.CRMV == One))
	    {
		CTFY (\_SB.PRES.CDAT, 0x03)
		\_SB.PRES.CRMV = One
		Local0 = One
	    }
	}

	Release (\_SB.PRES.CPLK)
    }

\_SB.PRES.CPLK lock is acquired.

Local0 variable set to 0x1 in order to loop through each CPU.

The loop sets Local0 to 0x0 in order to break if no CPU needs to be inserted or removed.

\_SB.PRES.CCMD is written with value 0x0, at the base address 0x0cd8 with an offset of 0x05. The value 0x0 translates into the CPHP_GET_NEXT_CPU_WITH_EVENT_CMD command. This will trigger the hypervisor to loop over the list of CPUs and set the CPU selector to the first one with a is_inserting or is_removing flag.

Right after, \_SB.PRES.CINS value is read from PRST region at address base 0x0cd8. Because it's the bit 1 of the 5th byte, there is a 4 bytes offset to reach the byte value. The value returned corresponds to the CPU index that has been previously selected when CCMD has been written with 0x0.

If \_SB.PRES.CINS == 0x1, this means the CPU needs to be inserted, and CTFY method is invoked with \_SB.PRES.CDAT as Arg0 and 0x1 as Arg1. The value of CDAT is being read from PRST, and will return the value of the CPU selector previously set by the read to CCMD:

    Method (CTFY, 2, NotSerialized)
    {
	If ((Arg0 == Zero))
	{
	    Notify (C000, Arg1)
	}

	If ((Arg0 == One))
	{
	    Notify (C001, Arg1)
	}

	If ((Arg0 == 0x02))
	{
	    Notify (C002, Arg1)
	}

	If ((Arg0 == 0x03))
	{
	    Notify (C003, Arg1)
	}
    }

Note: The number of CPUs listed here is the maximum number of processors that can be supported by the VM. This number differs from the number of processors enabled when the VM is started.

CTFY will notify the guest OS with the value of the appropriate CPU device as Arg0 and 0x1 as Arg1. Here is an example of CPU device definition in case Arg0 is C000:

    Device (C000)
    {
	Name (_HID, "ACPI0007" /* Processor Device */)  // _HID: Hardware ID
	Name (_UID, Zero)  // _UID: Unique ID
	Method (_STA, 0, Serialized)  // _STA: Status
	{
	    Return (CSTA (Zero))
	}

	Name (_MAT, Buffer (0x08)  // _MAT: Multiple APIC Table Entry
	{
	     0x00, 0x08, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00   /* ........ */
	})
	Method (_EJ0, 1, NotSerialized)  // _EJx: Eject Device
	{
	    CEJ0 (Zero)
	}

	Method (_OST, 3, Serialized)  // _OST: OSPM Status Indication
	{
	    COST (Zero, Arg0, Arg1, Arg2)
	}

	Name (_PXM, Zero)  // _PXM: Device Proximity
    }

When the guest OS receives the notification from the hypervisor, it will insert the new CPU corresponding to the data provided through the Notify() method.

\_SB.PRES.CINS is written with the value 0x1, which will be handled by the hypervisor, and will end up clearing up the flag is_inserting since it knows the CPU has been added at this point:

    if (data & 2) { /* clear insert event */
        cdev->is_inserting = false;
        trace_cpuhp_acpi_clear_inserting_evt(cpu_st->selector);
    }

And the Local0 variable is set to 0x1 to ensure the loop does not get broken, as there might be another CPU to add or remove.

If the current CPU pointed by the CPU selector didn't need to be added, the code will try to determine if it needs to be removed instead. If \_SB.PRES.CRMV == 0x1, this means the CPU needs to be removed, and CTFY method is invoked with \_SB.PRES.CDAT as Arg0 and 0x3 as Arg1. The value of CDAT is being read from PRST, and will return the value of the CPU selector (same as before).

CTFY will notify the guest OS with the value of the appropriate CPU device as Arg0 and 0x3 as Arg1. When the guest OS receives the notification from the hypervisor, it will remove the CPU corresponding to the data provided through the Notify() method. But additionally to what is done in the insert code path, it will also call into _EJ0 method defined by the CPU device (C000 in this case). This will be detailed as part of the guest OS hotplug flow.

\_SB.PRES.CRMV is written with the value 0x1, which will be handled by the hypervisor, and will end up clearing up the flag is_removing since it knows the CPU has been removed at this point:

    else if (data & 4) { /* clear remove event */
        cdev->is_removing = false;
        trace_cpuhp_acpi_clear_remove_evt(cpu_st->selector);
    }

And the Local0 variable is set to 0x1 to ensure the loop does not get broken, as there might be another CPU to add or remove.

If at this point in the code, Local0 is 0x0, the loop will be broken, as no other CPU needs to be added or removed.

\_SB.PRES.CPLK lock is released.

From guest OS (OSPM)

The guest OS, referred to as OSPM (Operating System-directed configuration and Power Management) by the ACPI specification, defines the operating system consuming the tables provided by the firmware, and running on top of the hardware described. All the code mentioned in this section belongs to the implementation of the different ACPI drivers coming from the Linux kernel sources under drivers/acpi:

Even before any hotplug event is actually being triggered, the Linux kernel will parse every single ACPI table, including the one we're interested in here, the DSDT table. Part of this table is the description of a generic device taking care of CPU hotplug:

    Device (\_SB.CPUS)
    {
        Name (_HID, "ACPI0010")  // _HID: Hardware ID
        Name (_CID, EisaId ("PNP0A05") /* Generic Container Device */)  // _CID: Compatible ID
        ...
    }

The ACPI processor driver gets initialized unconditionally by the kernel, as long as ACPI support is built in. This driver, among other things, will register the processor handler that will be called anytime a CPU is being added or removed:

bus.c: acpi_init() --> acpi_scan_init() scan.c: acpi_scan_init() --> acpi_processor_init() acpi_processor.c: acpi_processor_init() registers the processor hotplug handler through acpi_scan_add_handler_with_hotplug() passing the structure processor_handler:

void __init acpi_processor_init(void)
{
	acpi_processor_check_duplicates();
	acpi_scan_add_handler_with_hotplug(&processor_handler, "processor");
	acpi_scan_add_handler(&processor_container_handler);
}

The structure processor_handler defines two handlers attach and detach for CPU hotplug and removal:

static struct acpi_scan_handler processor_handler = {
	.ids = processor_device_ids,
	.attach = acpi_processor_add,
#ifdef CONFIG_ACPI_HOTPLUG_CPU
	.detach = acpi_processor_remove,
#endif
	.hotplug = {
		.enabled = true,
	},
};

Prior to that, acpi_init() also initialized a generic notification handler by calling acpi_bus_init(). bus.c: acpi_bus_init() --> acpi_install_notify_handler() by providing acpi_bus_notify() as the generic callback to be triggered in case of a notification.

As a side note, Notify() method is triggered from the guest OS. Because the guest OS is the one evaluating, hence executing, the code defined from the ACPI tables, it is the one responsible for calling into this method. And this simply means it notifies itself to reach some specific driver which previously registered some handlers for those notifications.

Whenever the guest OS receives a notification following the execution of a Notify() method from the DSDT, the following flow applies:

Because the CPU hotplug device is a generic device, it will trigger the handling of the notification through the generic handler acpi_bus_notify(): bus.c: acpi_bus_notify() --> acpi_hotplug_schedule() osl.c: acpi_hotplug_schedule() --> acpi_hotplug_work_fn() scan.c: acpi_hotplug_work_fn() --> acpi_device_hotplug() --> acpi_generic_hotplug_event()

acpi_generic_hotplug_event() gets invoked with the type of notification passed as argument.

CPU insertion

In case of CPU insertion (ACPI_NOTIFY_DEVICE_CHECK type):

case ACPI_NOTIFY_DEVICE_CHECK:
	return acpi_scan_device_check(adev);

scan.c: acpi_scan_device_check() --> acpi_bus_scan() --> acpi_bus_attach() --> acpi_scan_attach_handler() --> handler->attach(device, devid)

As described by the flow above, the notification will end up calling into every attach() handler registered for the current device. In case of processor_handler previously registered, the following callback is going to be invoked:

static int acpi_processor_add(struct acpi_device *device, const struct acpi_device_id *id)

This function will add the CPU to the guest OS, retrieving _UID and other information from the current CPU from MADT table. It also looks for a slot ID defined by _SUN but this is not mandatory and is not defined in case of NEMU.

CPU removal

In case of CPU removal (ACPI_NOTIFY_EJECT_REQUEST type):

case ACPI_NOTIFY_EJECT_REQUEST:
	if (adev->handler && !adev->handler->hotplug.enabled) {
		dev_info(&adev->dev, "Eject disabled\n");
		return -EPERM;
	}
	acpi_evaluate_ost(adev->handle, ACPI_NOTIFY_EJECT_REQUEST,
			  ACPI_OST_SC_EJECT_IN_PROGRESS, NULL);
	return acpi_scan_hot_remove(adev);

The method acpi_evaluate_ost() is the first important step. The guest OS will send a message to the hypervisor, by evaluating the method _OST of the CPU device defined. The message is meant to notify about the ejection of a specific CPU is in progress.

    Device (C000)
    {
        Method (_OST, 3, Serialized)  // _OST: OSPM Status Indication
        {
            COST (Zero, Arg0, Arg1, Arg2)
        }
        ...

This method will call into a DSDT internal function COST taking all the parameters from _OST additionally to the CPU selector as Arg0:

    Method (COST, 4, Serialized)
    {
	Acquire (\_SB.PRES.CPLK, 0xFFFF)
	\_SB.PRES.CSEL = Arg0
	\_SB.PRES.CCMD = One
	\_SB.PRES.CDAT = Arg1
	\_SB.PRES.CCMD = 0x02
	\_SB.PRES.CDAT = Arg2
	Release (\_SB.PRES.CPLK)
    }

\_SB.PRES.CPLK lock is acquired.

\_SB.PRES.CSEL is written with Arg0 (0x0 in the example of device C000) to set the CPU selector index.

\_SB.PRES.CCMD is written with 0x1 to set the command to receive the next data. The command being CPHP_OST_EVENT_CMD in case of 0x1.

\_SB.PRES.CDAT is written with the _OST event type coming from the data being passed by Arg1.

case CPHP_OST_EVENT_CMD: {
    cdev = &cpu_st->devs[cpu_st->selector];
    cdev->ost_event = data;

\_SB.PRES.CCMD is written with 0x2 to set the command to receive the next data. The command being CPHP_OST_STATUS_CMD in case of 0x2.

\_SB.PRES.CDAT is written with the _OST status coming from the data being passed by Arg2.

case CPHP_OST_STATUS_CMD: {
    cdev = &cpu_st->devs[cpu_st->selector];
    cdev->ost_status = data;
    info = acpi_cpu_device_status(cpu_st->selector, cdev);

\_SB.PRES.CPLK lock is released.

Here is the second important step, the actual ejection of the CPU device. scan.c: acpi_scan_hot_remove() --> acpi_bus_trim() --> handler->detach(adev)

As described by the flow above, the notification will end up calling into every detach() handler registered for the current device. In case of processor_handler previously registered, the following callback is going to be invoked:

static void acpi_processor_remove(struct acpi_device *device)
{
	...
	/*
	 * The only reason why we ever get here is CPU hot-removal.  The CPU is
	 * already offline and the ACPI device removal locking prevents it from
	 * being put back online at this point.
	 *
	 * Unbind the driver from the processor device and detach it from the
	 * ACPI companion object.
	 */
	device_release_driver(pr->dev);
	acpi_unbind_one(pr->dev);

	/* Clean up. */
	per_cpu(processor_device_array, pr->id) = NULL;
	per_cpu(processors, pr->id) = NULL;

	cpu_maps_update_begin();
	cpu_hotplug_begin();

	/* Remove the CPU. */
	arch_unregister_cpu(pr->id);
	acpi_unmap_cpu(pr->id);

	cpu_hotplug_done();
	cpu_maps_update_done();

	try_offline_node(cpu_to_node(pr->id));
	...
}

This function will take care of removing the CPU from the guest OS by releasing the related structures.

Now, getting back to acpi_scan_hot_remove(), when acpi_bus_trim() returns, a few ACPI methods are invoked before to complete the removal notification.

The guest OS will try to evaluate the _LCK method if present. In our case, NEMU does not define such a method, meaning the guest has no way to lock or unlock the device.

Then, _EJ0 method is evaluated:

acpi_status acpi_evaluate_ej0(acpi_handle handle)
{
	acpi_status status;

	status = acpi_execute_simple_method(handle, "_EJ0", 1);
	if (status == AE_NOT_FOUND)
		acpi_handle_warn(handle, "No _EJ0 support for device\n");
	else if (ACPI_FAILURE(status))
		acpi_handle_warn(handle, "Eject failed (0x%x)\n", status);

	return status;
}

    Device (C000)
    {
        ...
        Method (_EJ0, 1, NotSerialized)  // _EJx: Eject Device
        {
            CEJ0 (Zero)
        }
        ...

_EJ0 calls into the internal CEJ0 method:

    Method (CEJ0, 1, Serialized)
    {
        Acquire (\_SB.PRES.CPLK, 0xFFFF)
        \_SB.PRES.CSEL = Arg0
        \_SB.PRES.CEJ0 = One
        Release (\_SB.PRES.CPLK)
    }

\_SB.PRES.CPLK lock is acquired.

\_SB.PRES.CSEL is written with the CPU index corresponding to the current device. This updates the CPU selector value from the hypervisor:

case ACPI_CPU_SELECTOR_OFFSET_WR: /* current CPU selector */
    cpu_st->selector = data;

\_SB.PRES.CEJ0 is written with the value 0x1. This means the hypervisor will complete the removal of this CPU by calling into the appropriate unplug handler, using hotplug_handler_unplug():

case ACPI_CPU_FLAGS_OFFSET_RW: /* set is_* fields  */
    cdev = &cpu_st->devs[cpu_st->selector];
    if (data & 2) { /* clear insert event */
        ...
    } else if (data & 8) {
        ...
        dev = DEVICE(cdev->cpu);
        hotplug_ctrl = qdev_get_hotplug_handler(dev);
        hotplug_handler_unplug(hotplug_ctrl, dev, NULL);    
    }

\_SB.PRES.CPLK lock is released.

Once _EJ0 successfully returns, the guest OS needs to check the device status. It evaluates the _STA method associated with the current CPU. This check will verify the state of the CPU, and will return an error if it's not in the expected state:

static int acpi_scan_hot_remove(struct acpi_device *device)
{
	...
	/*
	 * Verify if eject was indeed successful.  If not, log an error
	 * message.  No need to call _OST since _EJ0 call was made OK.
	 */
	status = acpi_evaluate_integer(handle, "_STA", NULL, &sta);
	if (ACPI_FAILURE(status)) {
		acpi_handle_warn(handle,
			"Status check after eject failed (0x%x)\n", status);
	} else if (sta & ACPI_STA_DEVICE_ENABLED) {
		acpi_handle_warn(handle,
			"Eject incomplete - status 0x%llx\n", sta);
	}
	...
}

    Device (C000)
    {
	...
	Method (_STA, 0, Serialized)  // _STA: Status
	{
	    Return (CSTA (Zero))
	}
	...

_STA calls into the internal CSTA method:

    Method (CSTA, 1, Serialized)
    {
	Acquire (\_SB.PRES.CPLK, 0xFFFF)
	\_SB.PRES.CSEL = Arg0
	Local0 = Zero
	If ((\_SB.PRES.CPEN == One))
	{
	    Local0 = 0x0F
	}

	Release (\_SB.PRES.CPLK)
	Return (Local0)
    }

\_SB.PRES.CPLK lock is acquired.

\_SB.PRES.CSEL is written with the CPU index corresponding to the current device. This updates the CPU selector value from the hypervisor:

case ACPI_CPU_SELECTOR_OFFSET_WR: /* current CPU selector */
    cpu_st->selector = data;

Local0 variable is assigned with 0.

\_SB.PRES.CPEN value is read from the hypervisor, indicating if the CPU is enabled or not:

case ACPI_CPU_FLAGS_OFFSET_RW: /* pack and return is_* fields */
    val |= cdev->cpu ? 1 : 0;

If the CPU is still enabled, Local0 variable is assigned with 0xF, which means all the following flags are enabled:

include/acpi/actypes.h

/* Flags for _STA method */

#define ACPI_STA_DEVICE_PRESENT         0x01
#define ACPI_STA_DEVICE_ENABLED         0x02
#define ACPI_STA_DEVICE_UI              0x04
#define ACPI_STA_DEVICE_FUNCTIONING     0x08
#define ACPI_STA_DEVICE_OK              0x08	/* Synonym */

\_SB.PRES.CPLK lock is released.

Local0 variable representing the status of the device is returned to the guest OS.

Error handling

Once the insertion or the removal of a CPU is done, the status is sent from the guest OS to the hypervisor. The error value returned by the hotplug action will be translated into an OST status type:

void acpi_device_hotplug(struct acpi_device *adev, u32 src)
{
	...
	switch (error) {
	case 0:
		ost_code = ACPI_OST_SC_SUCCESS;
		break;
	case -EPERM:
		ost_code = ACPI_OST_SC_EJECT_NOT_SUPPORTED;
		break;
	case -EBUSY:
		ost_code = ACPI_OST_SC_DEVICE_BUSY;
		break;
	default:
		ost_code = ACPI_OST_SC_NON_SPECIFIC_FAILURE;
		break;
	}
err_out:
	acpi_evaluate_ost(adev->handle, src, ost_code, NULL);
	...
}

The same way it was evaluated here, the ACPI method _OST will invoke the internal COST method, in order to notify the hypervisor about the status of the hotplug action.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CPU hotplug

CPU Hotplug

Overview

Range of I/O ports

Hotplug flow

From NEMU

From ACPI

From guest OS (OSPM)

CPU insertion

CPU removal

Error handling

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally