Count the pages used by KVM mmu on x86 in memory stats under secondary
pagetable stats (e.g. "SecPageTables" in /proc/meminfo) to give better
visibility into the memory consumption of KVM mmu in a similar way to
how normal user page tables are accounted.
Add the inner helper in common KVM, ARM will also use it to count stats
in a future commit.
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Acked-by: Marc Zyngier <maz@kernel.org> # generic KVM changes
Link: https://lore.kernel.org/r/20220823004639.2387269-3-yosryahmed@google.com
Link: https://lore.kernel.org/r/20220823004639.2387269-4-yosryahmed@google.com
[sean: squash x86 usage to workaround modpost issues]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Bug: 222044477
(cherry picked from commit 43a063cab325ee7cc50349967e536b3cd4e57f03)
[vdonnefort@: Fix conflicts in mmu.c and tdp_mmu.c]
Change-Id: I9b81155758e513504a87ea2d634f341652ed0630
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Changes in 5.15.76
r8152: add PID for the Lenovo OneLink+ Dock
arm64/mm: Consolidate TCR_EL1 fields
usb: gadget: uvc: consistently use define for headerlen
usb: gadget: uvc: use on returned header len in video_encode_isoc_sg
usb: gadget: uvc: rework uvcg_queue_next_buffer to uvcg_complete_buffer
usb: gadget: uvc: giveback vb2 buffer on req complete
usb: gadget: uvc: improve sg exit condition
arm64: errata: Remove AES hwcap for COMPAT tasks
perf/x86/intel/pt: Relax address filter validation
btrfs: enhance unsupported compat RO flags handling
ocfs2: clear dinode links count in case of error
ocfs2: fix BUG when iput after ocfs2_mknod fails
selinux: enable use of both GFP_KERNEL and GFP_ATOMIC in convert_context()
cpufreq: qcom: fix writes in read-only memory region
i2c: qcom-cci: Fix ordering of pm_runtime_xx and i2c_add_adapter
x86/microcode/AMD: Apply the patch early on every logical thread
hwmon/coretemp: Handle large core ID value
ata: ahci-imx: Fix MODULE_ALIAS
ata: ahci: Match EM_MAX_SLOTS with SATA_PMP_MAX_PORTS
x86/resctrl: Fix min_cbm_bits for AMD
cpufreq: qcom: fix memory leak in error path
drm/amdgpu: fix sdma doorbell init ordering on APUs
mm,hugetlb: take hugetlb_lock before decrementing h->resv_huge_pages
kvm: Add support for arch compat vm ioctls
KVM: arm64: vgic: Fix exit condition in scan_its_table()
media: ipu3-imgu: Fix NULL pointer dereference in active selection access
media: mceusb: set timeout to at least timeout provided
media: venus: dec: Handle the case where find_format fails
x86/topology: Fix multiple packages shown on a single-package system
x86/topology: Fix duplicated core ID within a package
btrfs: fix processing of delayed data refs during backref walking
btrfs: fix processing of delayed tree block refs during backref walking
drm/vc4: Add module dependency on hdmi-codec
ACPI: extlog: Handle multiple records
tipc: Fix recognition of trial period
tipc: fix an information leak in tipc_topsrv_kern_subscr
i40e: Fix DMA mappings leak
HID: magicmouse: Do not set BTN_MOUSE on double report
sfc: Change VF mac via PF as first preference if available.
net/atm: fix proc_mpc_write incorrect return value
net: phy: dp83867: Extend RX strap quirk for SGMII mode
net: phylink: add mac_managed_pm in phylink_config structure
scsi: lpfc: Fix memory leak in lpfc_create_port()
udp: Update reuse->has_conns under reuseport_lock.
cifs: Fix xid leak in cifs_create()
cifs: Fix xid leak in cifs_copy_file_range()
cifs: Fix xid leak in cifs_flock()
cifs: Fix xid leak in cifs_ses_add_channel()
dm: remove unnecessary assignment statement in alloc_dev()
net: hsr: avoid possible NULL deref in skb_clone()
ionic: catch NULL pointer issue on reconfig
netfilter: nf_tables: relax NFTA_SET_ELEM_KEY_END set flags requirements
nvme-hwmon: consistently ignore errors from nvme_hwmon_init
nvme-hwmon: kmalloc the NVME SMART log buffer
nvmet: fix workqueue MEM_RECLAIM flushing dependency
net: sched: cake: fix null pointer access issue when cake_init() fails
net: sched: delete duplicate cleanup of backlog and qlen
net: sched: sfb: fix null pointer access issue when sfb_init() fails
sfc: include vport_id in filter spec hash and equal()
wwan_hwsim: fix possible memory leak in wwan_hwsim_dev_new()
net: hns: fix possible memory leak in hnae_ae_register()
net: sched: fix race condition in qdisc_graft()
net: phy: dp83822: disable MDI crossover status change interrupt
iommu/vt-d: Allow NVS regions in arch_rmrr_sanity_check()
iommu/vt-d: Clean up si_domain in the init_dmars() error path
fs: dlm: fix invalid derefence of sb_lvbptr
arm64: mte: move register initialization to C
ksmbd: handle smb2 query dir request for OutputBufferLength that is too small
ksmbd: fix incorrect handling of iterate_dir
tracing: Simplify conditional compilation code in tracing_set_tracer()
tracing: Do not free snapshot if tracer is on cmdline
mmc: sdhci-tegra: Use actual clock rate for SW tuning correction
perf: Skip and warn on unknown format 'configN' attrs
ACPI: video: Force backlight native for more TongFang devices
x86/Kconfig: Drop check for -mabi=ms for CONFIG_EFI_STUB
Makefile.debug: re-enable debug info for .S files
mmc: core: Add SD card quirk for broken discard
mm: /proc/pid/smaps_rollup: fix no vma's null-deref
Linux 5.15.76
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ica5b3f26c36900ff31ccac63f4fb55b52bff0ec2
commit ed51862f2f57cbce6fed2d4278cfe70a490899fd upstream.
We will introduce the first architecture specific compat vm ioctl in the
next patch. Add all necessary boilerplate to allow architectures to
override compat vm ioctls when necessary.
Signed-off-by: Alexander Graf <graf@amazon.com>
Message-Id: <20221017184541.2658-2-graf@amazon.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 5.15.70
drm/tegra: vic: Fix build warning when CONFIG_PM=n
serial: atmel: remove redundant assignment in rs485_config
tty: serial: atmel: Preserve previous USART mode if RS485 disabled
of: fdt: fix off-by-one error in unflatten_dt_nodes()
pinctrl: qcom: sc8180x: Fix gpio_wakeirq_map
pinctrl: qcom: sc8180x: Fix wrong pin numbers
pinctrl: rockchip: Enhance support for IRQ_TYPE_EDGE_BOTH
pinctrl: sunxi: Fix name for A100 R_PIO
NFSv4: Turn off open-by-filehandle and NFS re-export for NFSv4.0
gpio: mpc8xxx: Fix support for IRQ_TYPE_LEVEL_LOW flow_type in mpc85xx
drm/meson: Correct OSD1 global alpha value
drm/meson: Fix OSD1 RGB to YCbCr coefficient
block: blk_queue_enter() / __bio_queue_enter() must return -EAGAIN for nowait
parisc: ccio-dma: Add missing iounmap in error path in ccio_probe()
of/device: Fix up of_dma_configure_id() stub
cifs: revalidate mapping when doing direct writes
cifs: don't send down the destination address to sendmsg for a SOCK_STREAM
cifs: always initialize struct msghdr smb_msg completely
parisc: Allow CONFIG_64BIT with ARCH=parisc
tools/include/uapi: Fix <asm/errno.h> for parisc and xtensa
drm/amdgpu: Don't enable LTR if not supported
drm/amdgpu: move nbio ih_doorbell_range() into ih code for vega
drm/amdgpu: move nbio sdma_doorbell_range() into sdma code for vega
binder: remove inaccurate mmap_assert_locked()
video: fbdev: i740fb: Error out if 'pixclock' equals zero
arm64: dts: juno: Add missing MHU secure-irq
ASoC: nau8824: Fix semaphore unbalance at error paths
regulator: pfuze100: Fix the global-out-of-bounds access in pfuze100_regulator_probe()
scsi: lpfc: Return DID_TRANSPORT_DISRUPTED instead of DID_REQUEUE
rxrpc: Fix local destruction being repeated
rxrpc: Fix calc of resend age
wifi: mac80211_hwsim: check length for virtio packets
ALSA: hda/sigmatel: Keep power up while beep is enabled
ALSA: hda/tegra: Align BDL entry to 4KB boundary
net: usb: qmi_wwan: add Quectel RM520N
afs: Return -EAGAIN, not -EREMOTEIO, when a file already locked
MIPS: OCTEON: irq: Fix octeon_irq_force_ciu_mapping()
drm/panfrost: devfreq: set opp to the recommended one to configure regulator
mksysmap: Fix the mismatch of 'L0' symbols in System.map
video: fbdev: pxa3xx-gcu: Fix integer overflow in pxa3xx_gcu_write
net: Find dst with sk's xfrm policy not ctl_sk
KVM: SEV: add cache flush to solve SEV cache incoherency issues
cgroup: Add missing cpus_read_lock() to cgroup_attach_task_all()
ALSA: hda/sigmatel: Fix unused variable warning for beep power change
Linux 5.15.70
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Iea16cee2475ff8bca607e57fc8b0c4b71b0a6f56
Changes in 5.15.69
NFS: Fix WARN_ON due to unionization of nfs_inode.nrequests
ACPI: resource: skip IRQ override on AMD Zen platforms
ARM: dts: imx: align SPI NOR node name with dtschema
ARM: dts: imx6qdl-kontron-samx6i: fix spi-flash compatible
ARM: dts: at91: fix low limit for CPU regulator
ARM: dts: at91: sama7g5ek: specify proper regulator output ranges
lockdep: Fix -Wunused-parameter for _THIS_IP_
x86/mm: Force-inline __phys_addr_nodebug()
task_stack, x86/cea: Force-inline stack helpers
tracing: hold caller_addr to hardirq_{enable,disable}_ip
tracefs: Only clobber mode/uid/gid on remount if asked
iommu/vt-d: Fix kdump kernels boot failure with scalable mode
Input: goodix - add support for GT1158
platform/surface: aggregator_registry: Add support for Surface Laptop Go 2
drm/msm/rd: Fix FIFO-full deadlock
dt-bindings: iio: gyroscope: bosch,bmg160: correct number of pins
HID: ishtp-hid-clientHID: ishtp-hid-client: Fix comment typo
hid: intel-ish-hid: ishtp: Fix ishtp client sending disordered message
tg3: Disable tg3 device on system reboot to avoid triggering AER
gpio: mockup: remove gpio debugfs when remove device
ieee802154: cc2520: add rc code in cc2520_tx()
Input: iforce - add support for Boeder Force Feedback Wheel
nvmet-tcp: fix unhandled tcp states in nvmet_tcp_state_change()
drm/amd/amdgpu: skip ucode loading if ucode_size == 0
net: dsa: hellcreek: Print warning only once
perf/arm_pmu_platform: fix tests for platform_get_irq() failure
platform/x86: acer-wmi: Acer Aspire One AOD270/Packard Bell Dot keymap fixes
usb: storage: Add ASUS <0x0b05:0x1932> to IGNORE_UAS
mm: Fix TLB flush for not-first PFNMAP mappings in unmap_region()
soc: fsl: select FSL_GUTS driver for DPIO
usb: gadget: f_uac2: clean up some inconsistent indenting
usb: gadget: f_uac2: fix superspeed transfer
RDMA/irdma: Use s/g array in post send only when its valid
Input: goodix - add compatible string for GT1158
Linux 5.15.69
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ifcadf79f34eb6093489fb3faf5e42c9739e56522
commit 683412ccf61294d727ead4a73d97397396e69a6b upstream.
Flush the CPU caches when memory is reclaimed from an SEV guest (where
reclaim also includes it being unmapped from KVM's memslots). Due to lack
of coherency for SEV encrypted memory, failure to flush results in silent
data corruption if userspace is malicious/broken and doesn't ensure SEV
guest memory is properly pinned and unpinned.
Cache coherency is not enforced across the VM boundary in SEV (AMD APM
vol.2 Section 15.34.7). Confidential cachelines, generated by confidential
VM guests have to be explicitly flushed on the host side. If a memory page
containing dirty confidential cachelines was released by VM and reallocated
to another user, the cachelines may corrupt the new user at a later time.
KVM takes a shortcut by assuming all confidential memory remain pinned
until the end of VM lifetime. Therefore, KVM does not flush cache at
mmu_notifier invalidation events. Because of this incorrect assumption and
the lack of cache flushing, malicous userspace can crash the host kernel:
creating a malicious VM and continuously allocates/releases unpinned
confidential memory pages when the VM is running.
Add cache flush operations to mmu_notifier operations to ensure that any
physical memory leaving the guest VM get flushed. In particular, hook
mmu_notifier_invalidate_range_start and mmu_notifier_release events and
flush cache accordingly. The hook after releasing the mmu lock to avoid
contention with other vCPUs.
Cc: stable@vger.kernel.org
Suggested-by: Sean Christpherson <seanjc@google.com>
Reported-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20220421031407.2516575-4-mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[OP: adjusted KVM_X86_OP_OPTIONAL() -> KVM_X86_OP_NULL, applied
kvm_arch_guest_memory_reclaimed() call in kvm_set_memslot()]
Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 8b023accc8df70e72f7704d29fead7ca914d6837 ]
While looking into a bug related to the compiler's handling of addresses
of labels, I noticed some uses of _THIS_IP_ seemed unused in lockdep.
Drive by cleanup.
-Wunused-parameter:
kernel/locking/lockdep.c:1383:22: warning: unused parameter 'ip'
kernel/locking/lockdep.c:4246:48: warning: unused parameter 'ip'
kernel/locking/lockdep.c:4844:19: warning: unused parameter 'ip'
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20220314221909.2027027-1-ndesaulniers@google.com
Stable-dep-of: 54c3931957f6 ("tracing: hold caller_addr to hardirq_{enable,disable}_ip")
Signed-off-by: Sasha Levin <sashal@kernel.org>
Move x86's perf guest callbacks into common KVM, as they are semantically
identical to arm64's callbacks (the only other such KVM callbacks).
arm64 will convert to the common versions in a future patch.
Implement the necessary arm64 arch hooks now to avoid having to provide
stubs or a temporary #define (from x86) to avoid arm64 compilation errors
when CONFIG_GUEST_PERF_EVENTS=y.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211111020738.2512932-13-seanjc@google.com
(cherry picked from commit e1bfc24577cc65c95dc519d7621a9c985b97e567)
[willdeacon@: Fix context conflict in x86 asm/kvm_host.h]
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: Id8184005f4ccc370512bdbf3d3f18612dcee1cd8
Now that the vcpu array is backed by an xarray, use the optimised
iterator that matches the underlying data structure.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Message-Id: <20211116160403.4074052-8-maz@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 214bd3a6f46981b7867946e1b4f628a06bcf2091)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: I262005faec3a52c5f2f698d943920c6136bb8144
Everywhere we use kvm_for_each_vpcu(), we use an int as the vcpu
index. Unfortunately, we're about to move rework the iterator,
which requires this to be upgrade to an unsigned long.
Let's bite the bullet and repaint all of it in one go.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Message-Id: <20211116160403.4074052-7-maz@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 46808a4cb89708c2e5b264eb9d1035762581921b)
[willdeacon@: Drop riscv hunks; drop x86 hunks for code that isn't present;
convert kvm_hyperv_tsc_notifier() as well]
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: I78a3ae244de22575684dd47a9823b9fc61b6fa7d
At least on arm64 and x86, the vcpus array is pretty huge (up to
1024 entries on x86) and is mostly empty in the majority of the cases
(running 1k vcpu VMs is not that common).
This mean that we end-up with a 4kB block of unused memory in the
middle of the kvm structure.
Instead of wasting away this memory, let's use an xarray instead,
which gives us almost the same flexibility as a normal array, but
with a reduced memory usage with smaller VMs.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Message-Id: <20211116160403.4074052-6-maz@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit c5b077549136584618a66258f09d8d4b41e7409c)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: I5c89adb8caf5b167536b0e51590c7ee7ec0363d9
All architectures have similar loops iterating over the vcpus,
freeing one vcpu at a time, and eventually wiping the reference
off the vcpus array. They are also inconsistently taking
the kvm->lock mutex when wiping the references from the array.
Make this code common, which will simplify further changes.
The locking is dropped altogether, as this should only be called
when there is no further references on the kvm structure.
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Message-Id: <20211116160403.4074052-2-maz@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 27592ae8dbe41033261b6fdf27d78998aabd2665)
[willdeacon@: Drop riscv changes; fix conflict in arm64/kvm/arm.c the
same way as upstream merge commit 7fd55a02a426 ("Merge tag 'kvmarm-5.17'
of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD")]
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: If63c359dc841fca1a9baf73283a7a014058b2c97
kvm_is_transparent_hugepage() was removed in commit 205d76ff06 ("KVM:
Remove kvm_is_transparent_hugepage() and PageTransCompoundMap()") but its
declaration in include/linux/kvm_host.h persisted. Drop it.
Fixes: 205d76ff06 (""KVM: Remove kvm_is_transparent_hugepage() and PageTransCompoundMap()")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211018151407.2107363-1-vkuznets@redhat.com
(cherry picked from commit f0e6e6fa41b3d2aa1dcb61dd4ed6d7be004bb5a8)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: I9078ab62be40bc843ca2959f929ed22c1b8888e2
Changes in 5.15.57
x86/traps: Use pt_regs directly in fixup_bad_iret()
x86/entry: Switch the stack after error_entry() returns
x86/entry: Move PUSH_AND_CLEAR_REGS out of error_entry()
x86/entry: Don't call error_entry() for XENPV
objtool: Classify symbols
objtool: Explicitly avoid self modifying code in .altinstr_replacement
objtool: Shrink struct instruction
objtool,x86: Replace alternatives with .retpoline_sites
objtool: Introduce CFI hash
x86/retpoline: Remove unused replacement symbols
x86/asm: Fix register order
x86/asm: Fixup odd GEN-for-each-reg.h usage
x86/retpoline: Move the retpoline thunk declarations to nospec-branch.h
x86/retpoline: Create a retpoline thunk array
x86/alternative: Implement .retpoline_sites support
x86/alternative: Handle Jcc __x86_indirect_thunk_\reg
x86/alternative: Try inline spectre_v2=retpoline,amd
x86/alternative: Add debug prints to apply_retpolines()
bpf,x86: Simplify computing label offsets
bpf,x86: Respect X86_FEATURE_RETPOLINE*
objtool: Default ignore INT3 for unreachable
x86/entry: Remove skip_r11rcx
x86/realmode: build with -D__DISABLE_EXPORTS
x86/kvm/vmx: Make noinstr clean
x86/cpufeatures: Move RETPOLINE flags to word 11
x86/retpoline: Cleanup some #ifdefery
x86/retpoline: Swizzle retpoline thunk
x86/retpoline: Use -mfunction-return
x86: Undo return-thunk damage
x86,objtool: Create .return_sites
objtool: skip non-text sections when adding return-thunk sites
x86,static_call: Use alternative RET encoding
x86/ftrace: Use alternative RET encoding
x86/bpf: Use alternative RET encoding
x86/kvm: Fix SETcc emulation for return thunks
x86/vsyscall_emu/64: Don't use RET in vsyscall emulation
x86/sev: Avoid using __x86_return_thunk
x86: Use return-thunk in asm code
x86/entry: Avoid very early RET
objtool: Treat .text.__x86.* as noinstr
x86: Add magic AMD return-thunk
x86/bugs: Report AMD retbleed vulnerability
x86/bugs: Add AMD retbleed= boot parameter
x86/bugs: Enable STIBP for JMP2RET
x86/bugs: Keep a per-CPU IA32_SPEC_CTRL value
x86/entry: Add kernel IBRS implementation
x86/bugs: Optimize SPEC_CTRL MSR writes
x86/speculation: Add spectre_v2=ibrs option to support Kernel IBRS
x86/bugs: Split spectre_v2_select_mitigation() and spectre_v2_user_select_mitigation()
x86/bugs: Report Intel retbleed vulnerability
intel_idle: Disable IBRS during long idle
objtool: Update Retpoline validation
x86/xen: Rename SYS* entry points
x86/xen: Add UNTRAIN_RET
x86/bugs: Add retbleed=ibpb
x86/bugs: Do IBPB fallback check only once
objtool: Add entry UNRET validation
x86/cpu/amd: Add Spectral Chicken
x86/speculation: Fix RSB filling with CONFIG_RETPOLINE=n
x86/speculation: Fix firmware entry SPEC_CTRL handling
x86/speculation: Fix SPEC_CTRL write on SMT state change
x86/speculation: Use cached host SPEC_CTRL value for guest entry/exit
x86/speculation: Remove x86_spec_ctrl_mask
objtool: Re-add UNWIND_HINT_{SAVE_RESTORE}
KVM: VMX: Flatten __vmx_vcpu_run()
KVM: VMX: Convert launched argument to flags
KVM: VMX: Prevent guest RSB poisoning attacks with eIBRS
KVM: VMX: Fix IBRS handling after vmexit
x86/speculation: Fill RSB on vmexit for IBRS
x86/common: Stamp out the stepping madness
x86/cpu/amd: Enumerate BTC_NO
x86/retbleed: Add fine grained Kconfig knobs
x86/bugs: Add Cannon lake to RETBleed affected CPU list
x86/entry: Move PUSH_AND_CLEAR_REGS() back into error_entry
x86/bugs: Do not enable IBPB-on-entry when IBPB is not supported
x86/kexec: Disable RET on kexec
x86/speculation: Disable RRSBA behavior
x86/static_call: Serialize __static_call_fixup() properly
x86/xen: Fix initialisation in hypercall_page after rethunk
x86/asm/32: Fix ANNOTATE_UNRET_SAFE use on 32-bit
x86/speculation: Use DECLARE_PER_CPU for x86_spec_ctrl_current
efi/x86: use naked RET on mixed mode call wrapper
x86/kvm: fix FASTOP_SIZE when return thunks are enabled
KVM: emulate: do not adjust size of fastop and setcc subroutines
tools arch x86: Sync the msr-index.h copy with the kernel sources
tools headers cpufeatures: Sync with the kernel sources
x86/bugs: Remove apostrophe typo
um: Add missing apply_returns()
x86: Use -mindirect-branch-cs-prefix for RETPOLINE builds
Linux 5.15.57
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I7d0a3c3eb4be1e5401c2678fdb6229523486146f
Changes in 5.15.22
drm/i915: Disable DSB usage for now
selinux: fix double free of cond_list on error paths
audit: improve audit queue handling when "audit=1" on cmdline
ipc/sem: do not sleep with a spin lock held
spi: stm32-qspi: Update spi registering
ASoC: hdmi-codec: Fix OOB memory accesses
ASoC: ops: Reject out of bounds values in snd_soc_put_volsw()
ASoC: ops: Reject out of bounds values in snd_soc_put_volsw_sx()
ASoC: ops: Reject out of bounds values in snd_soc_put_xr_sx()
ALSA: usb-audio: Correct quirk for VF0770
ALSA: hda: Fix UAF of leds class devs at unbinding
ALSA: hda: realtek: Fix race at concurrent COEF updates
ALSA: hda/realtek: Add quirk for ASUS GU603
ALSA: hda/realtek: Add missing fixup-model entry for Gigabyte X570 ALC1220 quirks
ALSA: hda/realtek: Fix silent output on Gigabyte X570S Aorus Master (newer chipset)
ALSA: hda/realtek: Fix silent output on Gigabyte X570 Aorus Xtreme after reboot from Windows
btrfs: don't start transaction for scrub if the fs is mounted read-only
btrfs: fix deadlock between quota disable and qgroup rescan worker
btrfs: fix use-after-free after failure to create a snapshot
Revert "fs/9p: search open fids first"
drm/nouveau: fix off by one in BIOS boundary checking
drm/i915/adlp: Fix TypeC PHY-ready status readout
drm/amd/pm: correct the MGpuFanBoost support for Beige Goby
drm/amd/display: watermark latencies is not enough on DCN31
drm/amd/display: Force link_rate as LINK_RATE_RBR2 for 2018 15" Apple Retina panels
nvme-fabrics: fix state check in nvmf_ctlr_matches_baseopts()
mm/debug_vm_pgtable: remove pte entry from the page table
mm/pgtable: define pte_index so that preprocessor could recognize it
mm/kmemleak: avoid scanning potential huge holes
block: bio-integrity: Advance seed correctly for larger interval sizes
dma-buf: heaps: Fix potential spectre v1 gadget
IB/hfi1: Fix AIP early init panic
Revert "fbcon: Disable accelerated scrolling"
fbcon: Add option to enable legacy hardware acceleration
mptcp: fix msk traversal in mptcp_nl_cmd_set_flags()
Revert "ASoC: mediatek: Check for error clk pointer"
KVM: arm64: Avoid consuming a stale esr value when SError occur
KVM: arm64: Stop handle_exit() from handling HVC twice when an SError occurs
RDMA/cma: Use correct address when leaving multicast group
RDMA/ucma: Protect mc during concurrent multicast leaves
RDMA/siw: Fix refcounting leak in siw_create_qp()
IB/rdmavt: Validate remote_addr during loopback atomic tests
RDMA/siw: Fix broken RDMA Read Fence/Resume logic.
RDMA/mlx4: Don't continue event handler after memory allocation failure
ALSA: usb-audio: initialize variables that could ignore errors
ALSA: hda: Fix signedness of sscanf() arguments
ALSA: hda: Skip codec shutdown in case the codec is not registered
iommu/vt-d: Fix potential memory leak in intel_setup_irq_remapping()
iommu/amd: Fix loop timeout issue in iommu_ga_log_enable()
spi: bcm-qspi: check for valid cs before applying chip select
spi: mediatek: Avoid NULL pointer crash in interrupt
spi: meson-spicc: add IRQ check in meson_spicc_probe
spi: uniphier: fix reference count leak in uniphier_spi_probe()
IB/hfi1: Fix tstats alloc and dealloc
IB/cm: Release previously acquired reference counter in the cm_id_priv
net: ieee802154: hwsim: Ensure proper channel selection at probe time
net: ieee802154: mcr20a: Fix lifs/sifs periods
net: ieee802154: ca8210: Stop leaking skb's
netfilter: nft_reject_bridge: Fix for missing reply from prerouting
net: ieee802154: Return meaningful error codes from the netlink helpers
net/smc: Forward wakeup to smc socket waitqueue after fallback
net: stmmac: dwmac-visconti: No change to ETHER_CLOCK_SEL for unexpected speed request.
net: stmmac: properly handle with runtime pm in stmmac_dvr_remove()
net: macsec: Fix offload support for NETDEV_UNREGISTER event
net: macsec: Verify that send_sci is on when setting Tx sci explicitly
net: stmmac: dump gmac4 DMA registers correctly
net: stmmac: ensure PTP time register reads are consistent
drm/kmb: Fix for build errors with Warray-bounds
drm/i915/overlay: Prevent divide by zero bugs in scaling
drm/amd: avoid suspend on dGPUs w/ s2idle support when runtime PM enabled
ASoC: fsl: Add missing error handling in pcm030_fabric_probe
ASoC: xilinx: xlnx_formatter_pcm: Make buffer bytes multiple of period bytes
ASoC: simple-card: fix probe failure on platform component
ASoC: cpcap: Check for NULL pointer after calling of_get_child_by_name
ASoC: max9759: fix underflow in speaker_gain_control_put()
ASoC: codecs: wcd938x: fix incorrect used of portid
ASoC: codecs: lpass-rx-macro: fix sidetone register offsets
ASoC: codecs: wcd938x: fix return value of mixer put function
pinctrl: sunxi: Fix H616 I2S3 pin data
pinctrl: intel: Fix a glitch when updating IRQ flags on a preconfigured line
pinctrl: intel: fix unexpected interrupt
pinctrl: bcm2835: Fix a few error paths
scsi: bnx2fc: Make bnx2fc_recv_frame() mp safe
nfsd: nfsd4_setclientid_confirm mistakenly expires confirmed client.
gve: fix the wrong AdminQ buffer queue index check
bpf: Use VM_MAP instead of VM_ALLOC for ringbuf
selftests/exec: Remove pipe from TEST_GEN_FILES
selftests: futex: Use variable MAKE instead of make
tools/resolve_btfids: Do not print any commands when building silently
e1000e: Separate ADP board type from TGP
rtc: cmos: Evaluate century appropriate
kvm: add guest_state_{enter,exit}_irqoff()
kvm/arm64: rework guest entry logic
perf: Copy perf_event_attr::sig_data on modification
perf stat: Fix display of grouped aliased events
perf/x86/intel/pt: Fix crash with stop filters in single-range mode
x86/perf: Default set FREEZE_ON_SMI for all
EDAC/altera: Fix deferred probing
EDAC/xgene: Fix deferred probing
ext4: prevent used blocks from being allocated during fast commit replay
ext4: modify the logic of ext4_mb_new_blocks_simple
ext4: fix error handling in ext4_restore_inline_data()
ext4: fix error handling in ext4_fc_record_modified_inode()
ext4: fix incorrect type issue during replay_del_range
net: dsa: mt7530: make NET_DSA_MT7530 select MEDIATEK_GE_PHY
cgroup/cpuset: Fix "suspicious RCU usage" lockdep warning
tools include UAPI: Sync sound/asound.h copy with the kernel sources
gpio: idt3243x: Fix an ignored error return from platform_get_irq()
gpio: mpc8xxx: Fix an ignored error return from platform_get_irq()
selftests: nft_concat_range: add test for reload with no element add/del
selftests: netfilter: check stateless nat udp checksum fixup
Linux 5.15.22
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I9143b858b768a8497c1df9440a74d8c105c32271
commit ef9989afda73332df566852d6e9ca695c05f10ce upstream.
When transitioning to/from guest mode, it is necessary to inform
lockdep, tracing, and RCU in a specific order, similar to the
requirements for transitions to/from user mode. Additionally, it is
necessary to perform vtime accounting for a window around running the
guest, with RCU enabled, such that timer interrupts taken from the guest
can be accounted as guest time.
Most architectures don't handle all the necessary pieces, and a have a
number of common bugs, including unsafe usage of RCU during the window
between guest_enter() and guest_exit().
On x86, this was dealt with across commits:
87fa7f3e98 ("x86/kvm: Move context tracking where it belongs")
0642391e21 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
9fc975e9ef ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
3ebccdf373 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
135961e0a7 ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
1604571401 ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
bc908e091b ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
... but those fixes are specific to x86, and as the resulting logic
(while correct) is split across generic helper functions and
x86-specific helper functions, it is difficult to see that the
entry/exit accounting is balanced.
This patch adds generic helpers which architectures can use to handle
guest entry/exit consistently and correctly. The guest_{enter,exit}()
helpers are split into guest_timing_{enter,exit}() to perform vtime
accounting, and guest_context_{enter,exit}() to perform the necessary
context tracking and RCU management. The existing guest_{enter,exit}()
heleprs are left as wrappers of these.
Atop this, new guest_state_enter_irqoff() and guest_state_exit_irqoff()
helpers are added to handle the ordering of lockdep, tracing, and RCU
manageent. These are inteneded to mirror exit_to_user_mode() and
enter_from_user_mode().
Subsequent patches will migrate architectures over to the new helpers,
following a sequence:
guest_timing_enter_irqoff();
guest_state_enter_irqoff();
< run the vcpu >
guest_state_exit_irqoff();
< take any pending IRQs >
guest_timing_exit_irqoff();
This sequences handles all of the above correctly, and more clearly
balances the entry and exit portions, making it easier to understand.
The existing helpers are marked as deprecated, and will be removed once
all architectures have been converted.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Message-Id: <20220201132926.3301912-2-mark.rutland@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
kvm_is_transparent_hugepage() was removed in commit 205d76ff06 ("KVM:
Remove kvm_is_transparent_hugepage() and PageTransCompoundMap()") but its
declaration in include/linux/kvm_host.h persisted. Drop it.
Fixes: 205d76ff06 (""KVM: Remove kvm_is_transparent_hugepage() and PageTransCompoundMap()")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211018151407.2107363-1-vkuznets@redhat.com
(cherry picked from commit f0e6e6fa41b3d2aa1dcb61dd4ed6d7be004bb5a8
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209777660
Signed-off-by: Will Deacon <willdeacon@google.com>
Change-Id: I9078ab62be40bc843ca2959f929ed22c1b8888e2
Read vcpu->vcpu_idx directly instead of bouncing through the one-line
wrapper, kvm_vcpu_get_idx(), and drop the wrapper. The wrapper is a
remnant of the original implementation and serves no purpose; remove it
before it gains more users.
Back when kvm_vcpu_get_idx() was added by commit 497d72d80a ("KVM: Add
kvm_vcpu_get_idx to get vcpu index in kvm->vcpus"), the implementation
was more than just a simple wrapper as vcpu->vcpu_idx did not exist and
retrieving the index meant walking over the vCPU array to find the given
vCPU.
When vcpu_idx was introduced by commit 8750e72a79 ("KVM: remember
position in kvm->vcpus array"), the helper was left behind, likely to
avoid extra thrash (but even then there were only two users, the original
arm usage having been removed at some point in the past).
No functional change intended.
Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210910183220.2397812-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM/arm64 updates for 5.15
- Page ownership tracking between host EL1 and EL2
- Rely on userspace page tables to create large stage-2 mappings
- Fix incompatibility between pKVM and kmemleak
- Fix the PMU reset state, and improve the performance of the virtual PMU
- Move over to the generic KVM entry code
- Address PSCI reset issues w.r.t. save/restore
- Preliminary rework for the upcoming pKVM fixed feature
- A bunch of MM cleanups
- a vGIC fix for timer spurious interrupts
- Various cleanups
Add a new stat that counts the number of times a remote TLB flush is
requested, regardless of whether it kicks vCPUs out of guest mode. This
allows us to look at how often flushes are initiated.
Unlike remote_tlb_flush, this one applies to ARM's instruction-set-based
TLB flush implementation, so apply it there too.
Original-by: David Matlack <dmatlack@google.com>
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <20210817002639.3856694-1-jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add three log histogram stats to record the distribution of time spent
on successful polling, failed polling and VCPU wait.
halt_poll_success_hist: Distribution of spent time for a successful poll.
halt_poll_fail_hist: Distribution of spent time for a failed poll.
halt_wait_hist: Distribution of time a VCPU has spent on waiting.
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <20210802165633.1866976-6-jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Allow archs to create arch-specific nodes under kvm->debugfs_dentry directory
besides the stats fields. The new interface kvm_arch_create_vm_debugfs() is
defined but not yet used. It's called after kvm->debugfs_dentry is created, so
it can be referenced directly in kvm_arch_create_vm_debugfs(). Arch should
define their own versions when they want to create extra debugfs nodes.
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210730220455.26054-2-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The memslot for a given gfn is looked up multiple times during page
fault handling. Avoid binary searching for it multiple times by caching
the most recently used slot. There is an existing VM-wide last_used_slot
but that does not work well for cases where vCPUs are accessing memory
in different slots (see performance data below).
Another benefit of caching the most recently use slot (versus looking
up the slot once and passing around a pointer) is speeding up memslot
lookups *across* faults and during spte prefetching.
To measure the performance of this change I ran dirty_log_perf_test with
64 vCPUs and 64 memslots and measured "Populate memory time" and
"Iteration 2 dirty memory time". Tests were ran with eptad=N to force
dirty logging to use fast_page_fault so its performance could be
measured.
Config | Metric | Before | After
---------- | ----------------------------- | ------ | ------
tdp_mmu=Y | Populate memory time | 6.76s | 5.47s
tdp_mmu=Y | Iteration 2 dirty memory time | 2.83s | 0.31s
tdp_mmu=N | Populate memory time | 20.4s | 18.7s
tdp_mmu=N | Iteration 2 dirty memory time | 2.65s | 0.30s
The "Iteration 2 dirty memory time" results are especially compelling
because they are equivalent to running the same test with a single
memslot. In other words, fast_page_fault performance no longer scales
with the number of memslots.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210804222844.1419481-4-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Make search_memslots unconditionally search all memslots and move the
last_used_slot logic up one level to __gfn_to_memslot. This is in
preparation for introducing a per-vCPU last_used_slot.
As part of this change convert existing callers of search_memslots to
__gfn_to_memslot to avoid making any functional changes.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210804222844.1419481-3-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
lru_slot is used to keep track of the index of the most-recently used
memslot. The correct acronym would be "mru" but that is not a common
acronym. So call it last_used_slot which is a bit more obvious.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210804222844.1419481-2-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We would like to avoid taking mmu_lock for .invalidate_range_{start,end}()
notifications that are unrelated to KVM. Because mmu_notifier_count
must be modified while holding mmu_lock for write, and must always
be paired across start->end to stay balanced, lock elision must
happen in both or none. Therefore, in preparation for this change,
this patch prevents memslot updates across range_start() and range_end().
Note, technically flag-only memslot updates could be allowed in parallel,
but stalling a memslot update for a relatively short amount of time is
not a scalability issue, and this is all more than complex enough.
A long note on the locking: a previous version of the patch used an rwsem
to block the memslot update while the MMU notifier run, but this resulted
in the following deadlock involving the pseudo-lock tagged as
"mmu_notifier_invalidate_range_start".
======================================================
WARNING: possible circular locking dependency detected
5.12.0-rc3+ #6 Tainted: G OE
------------------------------------------------------
qemu-system-x86/3069 is trying to acquire lock:
ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190
but task is already holding lock:
ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]
which lock already depends on the new lock.
This corresponds to the following MMU notifier logic:
invalidate_range_start
take pseudo lock
down_read() (*)
release pseudo lock
invalidate_range_end
take pseudo lock (**)
up_read()
release pseudo lock
At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.
This could cause a deadlock (ignoring for a second that the pseudo lock
is not a lock):
- invalidate_range_start waits on down_read(), because the rwsem is
held by install_new_memslots
- install_new_memslots waits on down_write(), because the rwsem is
held till (another) invalidate_range_end finishes
- invalidate_range_end sits waits on the pseudo lock, held by
invalidate_range_start.
Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
it would change the *shared* rwsem readers into *shared recursive*
readers), so open-code the wait using a readers count and a
spinlock. This also allows handling blockable and non-blockable
critical section in the same way.
Losing the rwsem fairness does theoretically allow MMU notifiers to
block install_new_memslots forever. Note that mm/mmu_notifier.c's own
retry scheme in mmu_interval_read_begin also uses wait/wake_up
and is likewise not fair.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Introduce this safe version of kvm_get_kvm() so that it can be called even
during vm destruction. Use it in kvm_debugfs_open() and remove the verbose
comment. Prepare to be used elsewhere.
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210625153214.43106-3-peterx@redhat.com>
[Preserve the comment in kvm_debugfs_open. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Export kvm_make_all_cpus_request() and hoist the request helper
declarations of request up to the KVM_REQ_* definitions in preparation
for adding a "VM bugged" framework. The framework will add KVM_BUG()
and KVM_BUG_ON() as alternatives to full BUG()/BUG_ON() for cases where
KVM has definitely hit a bug (in itself or in silicon) and the VM is all
but guaranteed to be hosed. Marking a VM bugged will trigger a request
to all vCPUs to allow arch code to forcefully evict each vCPU from its
run loop.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-Id: <1d8cbbc8065d831343e70b5dcaea92268145eef1.1625186503.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
To remove code duplication, use the binary stats descriptors in the
implementation of the debugfs interface for statistics. This unifies
the definition of statistics for the binary and debugfs interfaces.
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <20210618222709.1858088-8-jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This commit defines the API for userspace and prepare the common
functionalities to support per VM/VCPU binary stats data readings.
The KVM stats now is only accessible by debugfs, which has some
shortcomings this change series are supposed to fix:
1. The current debugfs stats solution in KVM could be disabled
when kernel Lockdown mode is enabled, which is a potential
rick for production.
2. The current debugfs stats solution in KVM is organized as "one
stats per file", it is good for debugging, but not efficient
for production.
3. The stats read/clear in current debugfs solution in KVM are
protected by the global kvm_lock.
Besides that, there are some other benefits with this change:
1. All KVM VM/VCPU stats can be read out in a bulk by one copy
to userspace.
2. A schema is used to describe KVM statistics. From userspace's
perspective, the KVM statistics are self-describing.
3. With the fd-based solution, a separate telemetry would be able
to read KVM stats in a less privileged environment.
4. After the initial setup by reading in stats descriptors, a
telemetry only needs to read the stats data itself, no more
parsing or setup is needed.
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com> #arm64
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <20210618222709.1858088-3-jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a new lock to protect the arch-specific fields of memslots if they
need to be modified in a kvm->srcu read critical section. A future
commit will use this lock to lazily allocate memslot rmaps for x86.
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210518173414.450044-5-bgardon@google.com>
[Add Documentation/ hunk. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
array_index_nospec does not work for uint64_t on 32-bit builds.
However, the size of a memory slot must be less than 20 bits wide
on those system, since the memory slot must fit in the user
address space. So just store it in an unsigned long.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM's mechanism for accessing guest memory translates a guest physical
address (gpa) to a host virtual address using the right-shifted gpa
(also known as gfn) and a struct kvm_memory_slot. The translation is
performed in __gfn_to_hva_memslot using the following formula:
hva = slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE
It is expected that gfn falls within the boundaries of the guest's
physical memory. However, a guest can access invalid physical addresses
in such a way that the gfn is invalid.
__gfn_to_hva_memslot is called from kvm_vcpu_gfn_to_hva_prot, which first
retrieves a memslot through __gfn_to_memslot. While __gfn_to_memslot
does check that the gfn falls within the boundaries of the guest's
physical memory or not, a CPU can speculate the result of the check and
continue execution speculatively using an illegal gfn. The speculation
can result in calculating an out-of-bounds hva. If the resulting host
virtual address is used to load another guest physical address, this
is effectively a Spectre gadget consisting of two consecutive reads,
the second of which is data dependent on the first.
Right now it's not clear if there are any cases in which this is
exploitable. One interesting case was reported by the original author
of this patch, and involves visiting guest page tables on x86. Right
now these are not vulnerable because the hva read goes through get_user(),
which contains an LFENCE speculation barrier. However, there are
patches in progress for x86 uaccess.h to mask kernel addresses instead of
using LFENCE; once these land, a guest could use speculation to read
from the VMM's ring 3 address space. Other architectures such as ARM
already use the address masking method, and would be susceptible to
this same kind of data-dependent access gadgets. Therefore, this patch
proactively protects from these attacks by masking out-of-bounds gfns
in __gfn_to_hva_memslot, which blocks speculation of invalid hvas.
Sean Christopherson noted that this patch does not cover
kvm_read_guest_offset_cached. This however is limited to a few bytes
past the end of the cache, and therefore it is unlikely to be useful in
the context of building a chain of data dependent accesses.
Reported-by: Artemiy Margaritov <artemiy.margaritov@gmail.com>
Co-developed-by: Artemiy Margaritov <artemiy.margaritov@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM_REQ_UNBLOCK will be used to exit a vcpu from
its inner vcpu halt emulation loop.
Rename KVM_REQ_PENDING_TIMER to KVM_REQ_UNBLOCK, switch
PowerPC to arch specific request bit.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Message-Id: <20210525134321.303768132@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Both lock holder vCPU and IPI receiver that has halted are condidate for
boost. However, the PLE handler was originally designed to deal with the
lock holder preemption problem. The Intel PLE occurs when the spinlock
waiter is in kernel mode. This assumption doesn't hold for IPI receiver,
they can be in either kernel or user mode. the vCPU candidate in user mode
will not be boosted even if they should respond to IPIs. Some benchmarks
like pbzip2, swaptions etc do the TLB shootdown in kernel mode and most
of the time they are running in user mode. It can lead to a large number
of continuous PLE events because the IPI sender causes PLE events
repeatedly until the receiver is scheduled while the receiver is not
candidate for a boost.
This patch boosts the vCPU candidiate in user mode which is delivery
interrupt. We can observe the speed of pbzip2 improves 10% in 96 vCPUs
VM in over-subscribe scenario (The host machine is 2 socket, 48 cores,
96 HTs Intel CLX box). There is no performance regression for other
benchmarks like Unixbench spawn (most of the time contend read/write
lock in kernel mode), ebizzy (most of the time contend read/write sem
and TLB shoodtdown in kernel mode).
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1618542490-14756-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a capability for userspace to mirror SEV encryption context from
one vm to another. On our side, this is intended to support a
Migration Helper vCPU, but it can also be used generically to support
other in-guest workloads scheduled by the host. The intention is for
the primary guest and the mirror to have nearly identical memslots.
The primary benefits of this are that:
1) The VMs do not share KVM contexts (think APIC/MSRs/etc), so they
can't accidentally clobber each other.
2) The VMs can have different memory-views, which is necessary for post-copy
migration (the migration vCPUs on the target need to read and write to
pages, when the primary guest would VMEXIT).
This does not change the threat model for AMD SEV. Any memory involved
is still owned by the primary guest and its initial state is still
attested to through the normal SEV_LAUNCH_* flows. If userspace wanted
to circumvent SEV, they could achieve the same effect by simply attaching
a vCPU to the primary VM.
This patch deliberately leaves userspace in charge of the memslots for the
mirror, as it already has the power to mess with them in the primary guest.
This patch does not support SEV-ES (much less SNP), as it does not
handle handing off attested VMSAs to the mirror.
For additional context, we need a Migration Helper because SEV PSP
migration is far too slow for our live migration on its own. Using
an in-guest migrator lets us speed this up significantly.
Signed-off-by: Nathan Tempelman <natet@google.com>
Message-Id: <20210408223214.2582277-1-natet@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Abort the walk of coalesced MMIO zones if kvm_io_bus_unregister_dev()
fails to allocate memory for the new instance of the bus. If it can't
instantiate a new bus, unregister_dev() destroys all devices _except_ the
target device. But, it doesn't tell the caller that it obliterated the
bus and invoked the destructor for all devices that were on the bus. In
the coalesced MMIO case, this can result in a deleted list entry
dereference due to attempting to continue iterating on coalesced_zones
after future entries (in the walk) have been deleted.
Opportunistically add curly braces to the for-loop, which encompasses
many lines but sneaks by without braces due to the guts being a single
if statement.
Fixes: f65886606c ("KVM: fix memory leak in kvm_io_bus_unregister_dev()")
Cc: stable@vger.kernel.org
Reported-by: Hao Sun <sunhao.th@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210412222050.876100-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>