Skip to content

Commit 6a447b0

Browse files
committed
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini: "Much x86 work was pushed out to 5.12, but ARM more than made up for it. ARM: - PSCI relay at EL2 when "protected KVM" is enabled - New exception injection code - Simplification of AArch32 system register handling - Fix PMU accesses when no PMU is enabled - Expose CSV3 on non-Meltdown hosts - Cache hierarchy discovery fixes - PV steal-time cleanups - Allow function pointers at EL2 - Various host EL2 entry cleanups - Simplification of the EL2 vector allocation s390: - memcg accouting for s390 specific parts of kvm and gmap - selftest for diag318 - new kvm_stat for when async_pf falls back to sync x86: - Tracepoints for the new pagetable code from 5.10 - Catch VFIO and KVM irqfd events before userspace - Reporting dirty pages to userspace with a ring buffer - SEV-ES host support - Nested VMX support for wait-for-SIPI activity state - New feature flag (AVX512 FP16) - New system ioctl to report Hyper-V-compatible paravirtualization features Generic: - Selftest improvements" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (171 commits) KVM: SVM: fix 32-bit compilation KVM: SVM: Add AP_JUMP_TABLE support in prep for AP booting KVM: SVM: Provide support to launch and run an SEV-ES guest KVM: SVM: Provide an updated VMRUN invocation for SEV-ES guests KVM: SVM: Provide support for SEV-ES vCPU loading KVM: SVM: Provide support for SEV-ES vCPU creation/loading KVM: SVM: Update ASID allocation to support SEV-ES guests KVM: SVM: Set the encryption mask for the SVM host save area KVM: SVM: Add NMI support for an SEV-ES guest KVM: SVM: Guest FPU state save/restore not needed for SEV-ES guest KVM: SVM: Do not report support for SMM for an SEV-ES guest KVM: x86: Update __get_sregs() / __set_sregs() to support SEV-ES KVM: SVM: Add support for CR8 write traps for an SEV-ES guest KVM: SVM: Add support for CR4 write traps for an SEV-ES guest KVM: SVM: Add support for CR0 write traps for an SEV-ES guest KVM: SVM: Add support for EFER write traps for an SEV-ES guest KVM: SVM: Support string IO operations for an SEV-ES guest KVM: SVM: Support MMIO for an SEV-ES guest KVM: SVM: Create trace events for VMGEXIT MSR protocol processing KVM: SVM: Create trace events for VMGEXIT processing ...
2 parents f4a2f78 + d45f89f commit 6a447b0

File tree

171 files changed

+7242
-2750
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

171 files changed

+7242
-2750
lines changed

Documentation/admin-guide/kernel-parameters.txt

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2254,6 +2254,16 @@
22542254
for all guests.
22552255
Default is 1 (enabled) if in 64-bit or 32-bit PAE mode.
22562256

2257+
kvm-arm.mode=
2258+
[KVM,ARM] Select one of KVM/arm64's modes of operation.
2259+
2260+
protected: nVHE-based mode with support for guests whose
2261+
state is kept private from the host.
2262+
Not valid if the kernel is running in EL2.
2263+
2264+
Defaults to VHE/nVHE based on hardware support and
2265+
the value of CONFIG_ARM64_VHE.
2266+
22572267
kvm-arm.vgic_v3_group0_trap=
22582268
[KVM,ARM] Trap guest accesses to GICv3 group-0
22592269
system registers

Documentation/arm64/memory.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,7 @@ hypervisor maps kernel pages in EL2 at a fixed (and potentially
9797
random) offset from the linear mapping. See the kern_hyp_va macro and
9898
kvm_update_va_mask function for more details. MMIO devices such as
9999
GICv2 gets mapped next to the HYP idmap page, as do vectors when
100-
ARM64_HARDEN_EL2_VECTORS is selected for particular CPUs.
100+
ARM64_SPECTRE_V3A is enabled for particular CPUs.
101101

102102
When using KVM with the Virtualization Host Extensions, no additional
103103
mappings are created, since the host kernel runs directly in EL2.

Documentation/virt/kvm/api.rst

Lines changed: 111 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -262,6 +262,18 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared
262262
memory region. This ioctl returns the size of that region. See the
263263
KVM_RUN documentation for details.
264264

265+
Besides the size of the KVM_RUN communication region, other areas of
266+
the VCPU file descriptor can be mmap-ed, including:
267+
268+
- if KVM_CAP_COALESCED_MMIO is available, a page at
269+
KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons,
270+
this page is included in the result of KVM_GET_VCPU_MMAP_SIZE.
271+
KVM_CAP_COALESCED_MMIO is not documented yet.
272+
273+
- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
274+
KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. For more information on
275+
KVM_CAP_DIRTY_LOG_RING, see section 8.3.
276+
265277

266278
4.6 KVM_SET_MEMORY_REGION
267279
-------------------------
@@ -4455,9 +4467,9 @@ that KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 is present.
44554467
4.118 KVM_GET_SUPPORTED_HV_CPUID
44564468
--------------------------------
44574469

4458-
:Capability: KVM_CAP_HYPERV_CPUID
4470+
:Capability: KVM_CAP_HYPERV_CPUID (vcpu), KVM_CAP_SYS_HYPERV_CPUID (system)
44594471
:Architectures: x86
4460-
:Type: vcpu ioctl
4472+
:Type: system ioctl, vcpu ioctl
44614473
:Parameters: struct kvm_cpuid2 (in/out)
44624474
:Returns: 0 on success, -1 on error
44634475

@@ -4502,9 +4514,6 @@ Currently, the following list of CPUID leaves are returned:
45024514
- HYPERV_CPUID_SYNDBG_INTERFACE
45034515
- HYPERV_CPUID_SYNDBG_PLATFORM_CAPABILITIES
45044516

4505-
HYPERV_CPUID_NESTED_FEATURES leaf is only exposed when Enlightened VMCS was
4506-
enabled on the corresponding vCPU (KVM_CAP_HYPERV_ENLIGHTENED_VMCS).
4507-
45084517
Userspace invokes KVM_GET_SUPPORTED_HV_CPUID by passing a kvm_cpuid2 structure
45094518
with the 'nent' field indicating the number of entries in the variable-size
45104519
array 'entries'. If the number of entries is too low to describe all Hyper-V
@@ -4515,6 +4524,15 @@ number of valid entries in the 'entries' array, which is then filled.
45154524
'index' and 'flags' fields in 'struct kvm_cpuid_entry2' are currently reserved,
45164525
userspace should not expect to get any particular value there.
45174526

4527+
Note, vcpu version of KVM_GET_SUPPORTED_HV_CPUID is currently deprecated. Unlike
4528+
system ioctl which exposes all supported feature bits unconditionally, vcpu
4529+
version has the following quirks:
4530+
- HYPERV_CPUID_NESTED_FEATURES leaf and HV_X64_ENLIGHTENED_VMCS_RECOMMENDED
4531+
feature bit are only exposed when Enlightened VMCS was previously enabled
4532+
on the corresponding vCPU (KVM_CAP_HYPERV_ENLIGHTENED_VMCS).
4533+
- HV_STIMER_DIRECT_MODE_AVAILABLE bit is only exposed with in-kernel LAPIC.
4534+
(presumes KVM_CREATE_IRQCHIP has already been called).
4535+
45184536
4.119 KVM_ARM_VCPU_FINALIZE
45194537
---------------------------
45204538

@@ -6390,3 +6408,91 @@ When enabled, KVM will disable paravirtual features provided to the
63906408
guest according to the bits in the KVM_CPUID_FEATURES CPUID leaf
63916409
(0x40000001). Otherwise, a guest may use the paravirtual features
63926410
regardless of what has actually been exposed through the CPUID leaf.
6411+
6412+
6413+
8.29 KVM_CAP_DIRTY_LOG_RING
6414+
---------------------------
6415+
6416+
:Architectures: x86
6417+
:Parameters: args[0] - size of the dirty log ring
6418+
6419+
KVM is capable of tracking dirty memory using ring buffers that are
6420+
mmaped into userspace; there is one dirty ring per vcpu.
6421+
6422+
The dirty ring is available to userspace as an array of
6423+
``struct kvm_dirty_gfn``. Each dirty entry it's defined as::
6424+
6425+
struct kvm_dirty_gfn {
6426+
__u32 flags;
6427+
__u32 slot; /* as_id | slot_id */
6428+
__u64 offset;
6429+
};
6430+
6431+
The following values are defined for the flags field to define the
6432+
current state of the entry::
6433+
6434+
#define KVM_DIRTY_GFN_F_DIRTY BIT(0)
6435+
#define KVM_DIRTY_GFN_F_RESET BIT(1)
6436+
#define KVM_DIRTY_GFN_F_MASK 0x3
6437+
6438+
Userspace should call KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM
6439+
ioctl to enable this capability for the new guest and set the size of
6440+
the rings. Enabling the capability is only allowed before creating any
6441+
vCPU, and the size of the ring must be a power of two. The larger the
6442+
ring buffer, the less likely the ring is full and the VM is forced to
6443+
exit to userspace. The optimal size depends on the workload, but it is
6444+
recommended that it be at least 64 KiB (4096 entries).
6445+
6446+
Just like for dirty page bitmaps, the buffer tracks writes to
6447+
all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was
6448+
set in KVM_SET_USER_MEMORY_REGION. Once a memory region is registered
6449+
with the flag set, userspace can start harvesting dirty pages from the
6450+
ring buffer.
6451+
6452+
An entry in the ring buffer can be unused (flag bits ``00``),
6453+
dirty (flag bits ``01``) or harvested (flag bits ``1X``). The
6454+
state machine for the entry is as follows::
6455+
6456+
dirtied harvested reset
6457+
00 -----------> 01 -------------> 1X -------+
6458+
^ |
6459+
| |
6460+
+------------------------------------------+
6461+
6462+
To harvest the dirty pages, userspace accesses the mmaped ring buffer
6463+
to read the dirty GFNs. If the flags has the DIRTY bit set (at this stage
6464+
the RESET bit must be cleared), then it means this GFN is a dirty GFN.
6465+
The userspace should harvest this GFN and mark the flags from state
6466+
``01b`` to ``1Xb`` (bit 0 will be ignored by KVM, but bit 1 must be set
6467+
to show that this GFN is harvested and waiting for a reset), and move
6468+
on to the next GFN. The userspace should continue to do this until the
6469+
flags of a GFN have the DIRTY bit cleared, meaning that it has harvested
6470+
all the dirty GFNs that were available.
6471+
6472+
It's not necessary for userspace to harvest the all dirty GFNs at once.
6473+
However it must collect the dirty GFNs in sequence, i.e., the userspace
6474+
program cannot skip one dirty GFN to collect the one next to it.
6475+
6476+
After processing one or more entries in the ring buffer, userspace
6477+
calls the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about
6478+
it, so that the kernel will reprotect those collected GFNs.
6479+
Therefore, the ioctl must be called *before* reading the content of
6480+
the dirty pages.
6481+
6482+
The dirty ring can get full. When it happens, the KVM_RUN of the
6483+
vcpu will return with exit reason KVM_EXIT_DIRTY_LOG_FULL.
6484+
6485+
The dirty ring interface has a major difference comparing to the
6486+
KVM_GET_DIRTY_LOG interface in that, when reading the dirty ring from
6487+
userspace, it's still possible that the kernel has not yet flushed the
6488+
processor's dirty page buffers into the kernel buffer (with dirty bitmaps, the
6489+
flushing is done by the KVM_GET_DIRTY_LOG ioctl). To achieve that, one
6490+
needs to kick the vcpu out of KVM_RUN using a signal. The resulting
6491+
vmexit ensures that all dirty GFNs are flushed to the dirty rings.
6492+
6493+
NOTE: the capability KVM_CAP_DIRTY_LOG_RING and the corresponding
6494+
ioctl KVM_RESET_DIRTY_RINGS are mutual exclusive to the existing ioctls
6495+
KVM_GET_DIRTY_LOG and KVM_CLEAR_DIRTY_LOG. After enabling
6496+
KVM_CAP_DIRTY_LOG_RING with an acceptable dirty ring size, the virtual
6497+
machine will switch to ring-buffer dirty page tracking and further
6498+
KVM_GET_DIRTY_LOG or KVM_CLEAR_DIRTY_LOG ioctls will fail.

Documentation/virt/kvm/arm/pvtime.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,8 @@ Two new SMCCC compatible hypercalls are defined:
1919

2020
These are only available in the SMC64/HVC64 calling convention as
2121
paravirtualized time is not available to 32 bit Arm guests. The existence of
22-
the PV_FEATURES hypercall should be probed using the SMCCC 1.1 ARCH_FEATURES
23-
mechanism before calling it.
22+
the PV_TIME_FEATURES hypercall should be probed using the SMCCC 1.1
23+
ARCH_FEATURES mechanism before calling it.
2424

2525
PV_TIME_FEATURES
2626
============= ======== ==========

arch/arm64/include/asm/cpucaps.h

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
#define ARM64_HAS_VIRT_HOST_EXTN 11
2020
#define ARM64_WORKAROUND_CAVIUM_27456 12
2121
#define ARM64_HAS_32BIT_EL0 13
22-
#define ARM64_HARDEN_EL2_VECTORS 14
22+
#define ARM64_SPECTRE_V3A 14
2323
#define ARM64_HAS_CNP 15
2424
#define ARM64_HAS_NO_FPSIMD 16
2525
#define ARM64_WORKAROUND_REPEAT_TLBI 17
@@ -65,7 +65,8 @@
6565
#define ARM64_MTE 57
6666
#define ARM64_WORKAROUND_1508412 58
6767
#define ARM64_HAS_LDAPR 59
68+
#define ARM64_KVM_PROTECTED_MODE 60
6869

69-
#define ARM64_NCAPS 60
70+
#define ARM64_NCAPS 61
7071

7172
#endif /* __ASM_CPUCAPS_H */

arch/arm64/include/asm/cpufeature.h

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -705,6 +705,11 @@ static inline bool system_supports_generic_auth(void)
705705
cpus_have_const_cap(ARM64_HAS_GENERIC_AUTH);
706706
}
707707

708+
static inline bool system_has_full_ptr_auth(void)
709+
{
710+
return system_supports_address_auth() && system_supports_generic_auth();
711+
}
712+
708713
static __always_inline bool system_uses_irq_prio_masking(void)
709714
{
710715
return IS_ENABLED(CONFIG_ARM64_PSEUDO_NMI) &&

arch/arm64/include/asm/el2_setup.h

Lines changed: 181 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,181 @@
1+
/* SPDX-License-Identifier: GPL-2.0-only */
2+
/*
3+
* Copyright (C) 2012,2013 - ARM Ltd
4+
* Author: Marc Zyngier <[email protected]>
5+
*/
6+
7+
#ifndef __ARM_KVM_INIT_H__
8+
#define __ARM_KVM_INIT_H__
9+
10+
#ifndef __ASSEMBLY__
11+
#error Assembly-only header
12+
#endif
13+
14+
#include <asm/kvm_arm.h>
15+
#include <asm/ptrace.h>
16+
#include <asm/sysreg.h>
17+
#include <linux/irqchip/arm-gic-v3.h>
18+
19+
.macro __init_el2_sctlr
20+
mov_q x0, INIT_SCTLR_EL2_MMU_OFF
21+
msr sctlr_el2, x0
22+
isb
23+
.endm
24+
25+
/*
26+
* Allow Non-secure EL1 and EL0 to access physical timer and counter.
27+
* This is not necessary for VHE, since the host kernel runs in EL2,
28+
* and EL0 accesses are configured in the later stage of boot process.
29+
* Note that when HCR_EL2.E2H == 1, CNTHCTL_EL2 has the same bit layout
30+
* as CNTKCTL_EL1, and CNTKCTL_EL1 accessing instructions are redefined
31+
* to access CNTHCTL_EL2. This allows the kernel designed to run at EL1
32+
* to transparently mess with the EL0 bits via CNTKCTL_EL1 access in
33+
* EL2.
34+
*/
35+
.macro __init_el2_timers mode
36+
.ifeqs "\mode", "nvhe"
37+
mrs x0, cnthctl_el2
38+
orr x0, x0, #3 // Enable EL1 physical timers
39+
msr cnthctl_el2, x0
40+
.endif
41+
msr cntvoff_el2, xzr // Clear virtual offset
42+
.endm
43+
44+
.macro __init_el2_debug mode
45+
mrs x1, id_aa64dfr0_el1
46+
sbfx x0, x1, #ID_AA64DFR0_PMUVER_SHIFT, #4
47+
cmp x0, #1
48+
b.lt 1f // Skip if no PMU present
49+
mrs x0, pmcr_el0 // Disable debug access traps
50+
ubfx x0, x0, #11, #5 // to EL2 and allow access to
51+
1:
52+
csel x2, xzr, x0, lt // all PMU counters from EL1
53+
54+
/* Statistical profiling */
55+
ubfx x0, x1, #ID_AA64DFR0_PMSVER_SHIFT, #4
56+
cbz x0, 3f // Skip if SPE not present
57+
58+
.ifeqs "\mode", "nvhe"
59+
mrs_s x0, SYS_PMBIDR_EL1 // If SPE available at EL2,
60+
and x0, x0, #(1 << SYS_PMBIDR_EL1_P_SHIFT)
61+
cbnz x0, 2f // then permit sampling of physical
62+
mov x0, #(1 << SYS_PMSCR_EL2_PCT_SHIFT | \
63+
1 << SYS_PMSCR_EL2_PA_SHIFT)
64+
msr_s SYS_PMSCR_EL2, x0 // addresses and physical counter
65+
2:
66+
mov x0, #(MDCR_EL2_E2PB_MASK << MDCR_EL2_E2PB_SHIFT)
67+
orr x2, x2, x0 // If we don't have VHE, then
68+
// use EL1&0 translation.
69+
.else
70+
orr x2, x2, #MDCR_EL2_TPMS // For VHE, use EL2 translation
71+
// and disable access from EL1
72+
.endif
73+
74+
3:
75+
msr mdcr_el2, x2 // Configure debug traps
76+
.endm
77+
78+
/* LORegions */
79+
.macro __init_el2_lor
80+
mrs x1, id_aa64mmfr1_el1
81+
ubfx x0, x1, #ID_AA64MMFR1_LOR_SHIFT, 4
82+
cbz x0, 1f
83+
msr_s SYS_LORC_EL1, xzr
84+
1:
85+
.endm
86+
87+
/* Stage-2 translation */
88+
.macro __init_el2_stage2
89+
msr vttbr_el2, xzr
90+
.endm
91+
92+
/* GICv3 system register access */
93+
.macro __init_el2_gicv3
94+
mrs x0, id_aa64pfr0_el1
95+
ubfx x0, x0, #ID_AA64PFR0_GIC_SHIFT, #4
96+
cbz x0, 1f
97+
98+
mrs_s x0, SYS_ICC_SRE_EL2
99+
orr x0, x0, #ICC_SRE_EL2_SRE // Set ICC_SRE_EL2.SRE==1
100+
orr x0, x0, #ICC_SRE_EL2_ENABLE // Set ICC_SRE_EL2.Enable==1
101+
msr_s SYS_ICC_SRE_EL2, x0
102+
isb // Make sure SRE is now set
103+
mrs_s x0, SYS_ICC_SRE_EL2 // Read SRE back,
104+
tbz x0, #0, 1f // and check that it sticks
105+
msr_s SYS_ICH_HCR_EL2, xzr // Reset ICC_HCR_EL2 to defaults
106+
1:
107+
.endm
108+
109+
.macro __init_el2_hstr
110+
msr hstr_el2, xzr // Disable CP15 traps to EL2
111+
.endm
112+
113+
/* Virtual CPU ID registers */
114+
.macro __init_el2_nvhe_idregs
115+
mrs x0, midr_el1
116+
mrs x1, mpidr_el1
117+
msr vpidr_el2, x0
118+
msr vmpidr_el2, x1
119+
.endm
120+
121+
/* Coprocessor traps */
122+
.macro __init_el2_nvhe_cptr
123+
mov x0, #0x33ff
124+
msr cptr_el2, x0 // Disable copro. traps to EL2
125+
.endm
126+
127+
/* SVE register access */
128+
.macro __init_el2_nvhe_sve
129+
mrs x1, id_aa64pfr0_el1
130+
ubfx x1, x1, #ID_AA64PFR0_SVE_SHIFT, #4
131+
cbz x1, 1f
132+
133+
bic x0, x0, #CPTR_EL2_TZ // Also disable SVE traps
134+
msr cptr_el2, x0 // Disable copro. traps to EL2
135+
isb
136+
mov x1, #ZCR_ELx_LEN_MASK // SVE: Enable full vector
137+
msr_s SYS_ZCR_EL2, x1 // length for EL1.
138+
1:
139+
.endm
140+
141+
.macro __init_el2_nvhe_prepare_eret
142+
mov x0, #INIT_PSTATE_EL1
143+
msr spsr_el2, x0
144+
.endm
145+
146+
/**
147+
* Initialize EL2 registers to sane values. This should be called early on all
148+
* cores that were booted in EL2.
149+
*
150+
* Regs: x0, x1 and x2 are clobbered.
151+
*/
152+
.macro init_el2_state mode
153+
.ifnes "\mode", "vhe"
154+
.ifnes "\mode", "nvhe"
155+
.error "Invalid 'mode' argument"
156+
.endif
157+
.endif
158+
159+
__init_el2_sctlr
160+
__init_el2_timers \mode
161+
__init_el2_debug \mode
162+
__init_el2_lor
163+
__init_el2_stage2
164+
__init_el2_gicv3
165+
__init_el2_hstr
166+
167+
/*
168+
* When VHE is not in use, early init of EL2 needs to be done here.
169+
* When VHE _is_ in use, EL1 will not be used in the host and
170+
* requires no configuration, and all non-hyp-specific EL2 setup
171+
* will be done via the _EL1 system register aliases in __cpu_setup.
172+
*/
173+
.ifeqs "\mode", "nvhe"
174+
__init_el2_nvhe_idregs
175+
__init_el2_nvhe_cptr
176+
__init_el2_nvhe_sve
177+
__init_el2_nvhe_prepare_eret
178+
.endif
179+
.endm
180+
181+
#endif /* __ARM_KVM_INIT_H__ */

0 commit comments

Comments
 (0)