Commit Graph

647 Commits

Author SHA1 Message Date
Yosry Ahmed
2af25795b7 BACKPORT: KVM: x86/mmu: count KVM mmu usage in secondary pagetable stats.
Count the pages used by KVM mmu on x86 in memory stats under secondary
pagetable stats (e.g. "SecPageTables" in /proc/meminfo) to give better
visibility into the memory consumption of KVM mmu in a similar way to
how normal user page tables are accounted.

Add the inner helper in common KVM, ARM will also use it to count stats
in a future commit.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Acked-by: Marc Zyngier <maz@kernel.org> # generic KVM changes
Link: https://lore.kernel.org/r/20220823004639.2387269-3-yosryahmed@google.com
Link: https://lore.kernel.org/r/20220823004639.2387269-4-yosryahmed@google.com
[sean: squash x86 usage to workaround modpost issues]
Signed-off-by: Sean Christopherson <seanjc@google.com>

Bug: 222044477
(cherry picked from commit 43a063cab325ee7cc50349967e536b3cd4e57f03)
[vdonnefort@: Fix conflicts in mmu.c and tdp_mmu.c]
Change-Id: I9b81155758e513504a87ea2d634f341652ed0630
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
2022-11-23 17:11:25 +00:00
Greg Kroah-Hartman
05dba81225 Merge 5.15.76 into android14-5.15
Changes in 5.15.76
	r8152: add PID for the Lenovo OneLink+ Dock
	arm64/mm: Consolidate TCR_EL1 fields
	usb: gadget: uvc: consistently use define for headerlen
	usb: gadget: uvc: use on returned header len in video_encode_isoc_sg
	usb: gadget: uvc: rework uvcg_queue_next_buffer to uvcg_complete_buffer
	usb: gadget: uvc: giveback vb2 buffer on req complete
	usb: gadget: uvc: improve sg exit condition
	arm64: errata: Remove AES hwcap for COMPAT tasks
	perf/x86/intel/pt: Relax address filter validation
	btrfs: enhance unsupported compat RO flags handling
	ocfs2: clear dinode links count in case of error
	ocfs2: fix BUG when iput after ocfs2_mknod fails
	selinux: enable use of both GFP_KERNEL and GFP_ATOMIC in convert_context()
	cpufreq: qcom: fix writes in read-only memory region
	i2c: qcom-cci: Fix ordering of pm_runtime_xx and i2c_add_adapter
	x86/microcode/AMD: Apply the patch early on every logical thread
	hwmon/coretemp: Handle large core ID value
	ata: ahci-imx: Fix MODULE_ALIAS
	ata: ahci: Match EM_MAX_SLOTS with SATA_PMP_MAX_PORTS
	x86/resctrl: Fix min_cbm_bits for AMD
	cpufreq: qcom: fix memory leak in error path
	drm/amdgpu: fix sdma doorbell init ordering on APUs
	mm,hugetlb: take hugetlb_lock before decrementing h->resv_huge_pages
	kvm: Add support for arch compat vm ioctls
	KVM: arm64: vgic: Fix exit condition in scan_its_table()
	media: ipu3-imgu: Fix NULL pointer dereference in active selection access
	media: mceusb: set timeout to at least timeout provided
	media: venus: dec: Handle the case where find_format fails
	x86/topology: Fix multiple packages shown on a single-package system
	x86/topology: Fix duplicated core ID within a package
	btrfs: fix processing of delayed data refs during backref walking
	btrfs: fix processing of delayed tree block refs during backref walking
	drm/vc4: Add module dependency on hdmi-codec
	ACPI: extlog: Handle multiple records
	tipc: Fix recognition of trial period
	tipc: fix an information leak in tipc_topsrv_kern_subscr
	i40e: Fix DMA mappings leak
	HID: magicmouse: Do not set BTN_MOUSE on double report
	sfc: Change VF mac via PF as first preference if available.
	net/atm: fix proc_mpc_write incorrect return value
	net: phy: dp83867: Extend RX strap quirk for SGMII mode
	net: phylink: add mac_managed_pm in phylink_config structure
	scsi: lpfc: Fix memory leak in lpfc_create_port()
	udp: Update reuse->has_conns under reuseport_lock.
	cifs: Fix xid leak in cifs_create()
	cifs: Fix xid leak in cifs_copy_file_range()
	cifs: Fix xid leak in cifs_flock()
	cifs: Fix xid leak in cifs_ses_add_channel()
	dm: remove unnecessary assignment statement in alloc_dev()
	net: hsr: avoid possible NULL deref in skb_clone()
	ionic: catch NULL pointer issue on reconfig
	netfilter: nf_tables: relax NFTA_SET_ELEM_KEY_END set flags requirements
	nvme-hwmon: consistently ignore errors from nvme_hwmon_init
	nvme-hwmon: kmalloc the NVME SMART log buffer
	nvmet: fix workqueue MEM_RECLAIM flushing dependency
	net: sched: cake: fix null pointer access issue when cake_init() fails
	net: sched: delete duplicate cleanup of backlog and qlen
	net: sched: sfb: fix null pointer access issue when sfb_init() fails
	sfc: include vport_id in filter spec hash and equal()
	wwan_hwsim: fix possible memory leak in wwan_hwsim_dev_new()
	net: hns: fix possible memory leak in hnae_ae_register()
	net: sched: fix race condition in qdisc_graft()
	net: phy: dp83822: disable MDI crossover status change interrupt
	iommu/vt-d: Allow NVS regions in arch_rmrr_sanity_check()
	iommu/vt-d: Clean up si_domain in the init_dmars() error path
	fs: dlm: fix invalid derefence of sb_lvbptr
	arm64: mte: move register initialization to C
	ksmbd: handle smb2 query dir request for OutputBufferLength that is too small
	ksmbd: fix incorrect handling of iterate_dir
	tracing: Simplify conditional compilation code in tracing_set_tracer()
	tracing: Do not free snapshot if tracer is on cmdline
	mmc: sdhci-tegra: Use actual clock rate for SW tuning correction
	perf: Skip and warn on unknown format 'configN' attrs
	ACPI: video: Force backlight native for more TongFang devices
	x86/Kconfig: Drop check for -mabi=ms for CONFIG_EFI_STUB
	Makefile.debug: re-enable debug info for .S files
	mmc: core: Add SD card quirk for broken discard
	mm: /proc/pid/smaps_rollup: fix no vma's null-deref
	Linux 5.15.76

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ica5b3f26c36900ff31ccac63f4fb55b52bff0ec2
2022-11-03 14:21:38 +09:00
Alexander Graf
5bf2fda26a kvm: Add support for arch compat vm ioctls
commit ed51862f2f57cbce6fed2d4278cfe70a490899fd upstream.

We will introduce the first architecture specific compat vm ioctl in the
next patch. Add all necessary boilerplate to allow architectures to
override compat vm ioctls when necessary.

Signed-off-by: Alexander Graf <graf@amazon.com>
Message-Id: <20221017184541.2658-2-graf@amazon.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-29 10:12:54 +02:00
Greg Kroah-Hartman
74ca15c523 Merge 5.15.70 into android14-5.15
Changes in 5.15.70
	drm/tegra: vic: Fix build warning when CONFIG_PM=n
	serial: atmel: remove redundant assignment in rs485_config
	tty: serial: atmel: Preserve previous USART mode if RS485 disabled
	of: fdt: fix off-by-one error in unflatten_dt_nodes()
	pinctrl: qcom: sc8180x: Fix gpio_wakeirq_map
	pinctrl: qcom: sc8180x: Fix wrong pin numbers
	pinctrl: rockchip: Enhance support for IRQ_TYPE_EDGE_BOTH
	pinctrl: sunxi: Fix name for A100 R_PIO
	NFSv4: Turn off open-by-filehandle and NFS re-export for NFSv4.0
	gpio: mpc8xxx: Fix support for IRQ_TYPE_LEVEL_LOW flow_type in mpc85xx
	drm/meson: Correct OSD1 global alpha value
	drm/meson: Fix OSD1 RGB to YCbCr coefficient
	block: blk_queue_enter() / __bio_queue_enter() must return -EAGAIN for nowait
	parisc: ccio-dma: Add missing iounmap in error path in ccio_probe()
	of/device: Fix up of_dma_configure_id() stub
	cifs: revalidate mapping when doing direct writes
	cifs: don't send down the destination address to sendmsg for a SOCK_STREAM
	cifs: always initialize struct msghdr smb_msg completely
	parisc: Allow CONFIG_64BIT with ARCH=parisc
	tools/include/uapi: Fix <asm/errno.h> for parisc and xtensa
	drm/amdgpu: Don't enable LTR if not supported
	drm/amdgpu: move nbio ih_doorbell_range() into ih code for vega
	drm/amdgpu: move nbio sdma_doorbell_range() into sdma code for vega
	binder: remove inaccurate mmap_assert_locked()
	video: fbdev: i740fb: Error out if 'pixclock' equals zero
	arm64: dts: juno: Add missing MHU secure-irq
	ASoC: nau8824: Fix semaphore unbalance at error paths
	regulator: pfuze100: Fix the global-out-of-bounds access in pfuze100_regulator_probe()
	scsi: lpfc: Return DID_TRANSPORT_DISRUPTED instead of DID_REQUEUE
	rxrpc: Fix local destruction being repeated
	rxrpc: Fix calc of resend age
	wifi: mac80211_hwsim: check length for virtio packets
	ALSA: hda/sigmatel: Keep power up while beep is enabled
	ALSA: hda/tegra: Align BDL entry to 4KB boundary
	net: usb: qmi_wwan: add Quectel RM520N
	afs: Return -EAGAIN, not -EREMOTEIO, when a file already locked
	MIPS: OCTEON: irq: Fix octeon_irq_force_ciu_mapping()
	drm/panfrost: devfreq: set opp to the recommended one to configure regulator
	mksysmap: Fix the mismatch of 'L0' symbols in System.map
	video: fbdev: pxa3xx-gcu: Fix integer overflow in pxa3xx_gcu_write
	net: Find dst with sk's xfrm policy not ctl_sk
	KVM: SEV: add cache flush to solve SEV cache incoherency issues
	cgroup: Add missing cpus_read_lock() to cgroup_attach_task_all()
	ALSA: hda/sigmatel: Fix unused variable warning for beep power change
	Linux 5.15.70

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Iea16cee2475ff8bca607e57fc8b0c4b71b0a6f56
2022-09-24 14:22:45 +02:00
Greg Kroah-Hartman
a449b299e8 Merge 5.15.69 into android14-5.15
Changes in 5.15.69
	NFS: Fix WARN_ON due to unionization of nfs_inode.nrequests
	ACPI: resource: skip IRQ override on AMD Zen platforms
	ARM: dts: imx: align SPI NOR node name with dtschema
	ARM: dts: imx6qdl-kontron-samx6i: fix spi-flash compatible
	ARM: dts: at91: fix low limit for CPU regulator
	ARM: dts: at91: sama7g5ek: specify proper regulator output ranges
	lockdep: Fix -Wunused-parameter for _THIS_IP_
	x86/mm: Force-inline __phys_addr_nodebug()
	task_stack, x86/cea: Force-inline stack helpers
	tracing: hold caller_addr to hardirq_{enable,disable}_ip
	tracefs: Only clobber mode/uid/gid on remount if asked
	iommu/vt-d: Fix kdump kernels boot failure with scalable mode
	Input: goodix - add support for GT1158
	platform/surface: aggregator_registry: Add support for Surface Laptop Go 2
	drm/msm/rd: Fix FIFO-full deadlock
	dt-bindings: iio: gyroscope: bosch,bmg160: correct number of pins
	HID: ishtp-hid-clientHID: ishtp-hid-client: Fix comment typo
	hid: intel-ish-hid: ishtp: Fix ishtp client sending disordered message
	tg3: Disable tg3 device on system reboot to avoid triggering AER
	gpio: mockup: remove gpio debugfs when remove device
	ieee802154: cc2520: add rc code in cc2520_tx()
	Input: iforce - add support for Boeder Force Feedback Wheel
	nvmet-tcp: fix unhandled tcp states in nvmet_tcp_state_change()
	drm/amd/amdgpu: skip ucode loading if ucode_size == 0
	net: dsa: hellcreek: Print warning only once
	perf/arm_pmu_platform: fix tests for platform_get_irq() failure
	platform/x86: acer-wmi: Acer Aspire One AOD270/Packard Bell Dot keymap fixes
	usb: storage: Add ASUS <0x0b05:0x1932> to IGNORE_UAS
	mm: Fix TLB flush for not-first PFNMAP mappings in unmap_region()
	soc: fsl: select FSL_GUTS driver for DPIO
	usb: gadget: f_uac2: clean up some inconsistent indenting
	usb: gadget: f_uac2: fix superspeed transfer
	RDMA/irdma: Use s/g array in post send only when its valid
	Input: goodix - add compatible string for GT1158
	Linux 5.15.69

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ifcadf79f34eb6093489fb3faf5e42c9739e56522
2022-09-24 14:14:08 +02:00
Mingwei Zhang
39b0235284 KVM: SEV: add cache flush to solve SEV cache incoherency issues
commit 683412ccf61294d727ead4a73d97397396e69a6b upstream.

Flush the CPU caches when memory is reclaimed from an SEV guest (where
reclaim also includes it being unmapped from KVM's memslots).  Due to lack
of coherency for SEV encrypted memory, failure to flush results in silent
data corruption if userspace is malicious/broken and doesn't ensure SEV
guest memory is properly pinned and unpinned.

Cache coherency is not enforced across the VM boundary in SEV (AMD APM
vol.2 Section 15.34.7). Confidential cachelines, generated by confidential
VM guests have to be explicitly flushed on the host side. If a memory page
containing dirty confidential cachelines was released by VM and reallocated
to another user, the cachelines may corrupt the new user at a later time.

KVM takes a shortcut by assuming all confidential memory remain pinned
until the end of VM lifetime. Therefore, KVM does not flush cache at
mmu_notifier invalidation events. Because of this incorrect assumption and
the lack of cache flushing, malicous userspace can crash the host kernel:
creating a malicious VM and continuously allocates/releases unpinned
confidential memory pages when the VM is running.

Add cache flush operations to mmu_notifier operations to ensure that any
physical memory leaving the guest VM get flushed. In particular, hook
mmu_notifier_invalidate_range_start and mmu_notifier_release events and
flush cache accordingly. The hook after releasing the mmu lock to avoid
contention with other vCPUs.

Cc: stable@vger.kernel.org
Suggested-by: Sean Christpherson <seanjc@google.com>
Reported-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20220421031407.2516575-4-mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[OP: adjusted KVM_X86_OP_OPTIONAL() -> KVM_X86_OP_NULL, applied
kvm_arch_guest_memory_reclaimed() call in kvm_set_memslot()]
Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-23 14:15:52 +02:00
Nick Desaulniers
f9571a9699 lockdep: Fix -Wunused-parameter for _THIS_IP_
[ Upstream commit 8b023accc8df70e72f7704d29fead7ca914d6837 ]

While looking into a bug related to the compiler's handling of addresses
of labels, I noticed some uses of _THIS_IP_ seemed unused in lockdep.
Drive by cleanup.

-Wunused-parameter:
kernel/locking/lockdep.c:1383:22: warning: unused parameter 'ip'
kernel/locking/lockdep.c:4246:48: warning: unused parameter 'ip'
kernel/locking/lockdep.c:4844:19: warning: unused parameter 'ip'

Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20220314221909.2027027-1-ndesaulniers@google.com
Stable-dep-of: 54c3931957f6 ("tracing: hold caller_addr to hardirq_{enable,disable}_ip")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-09-20 12:39:42 +02:00
Sean Christopherson
a2aa2a6c94 BACKPORT: KVM: Move x86's perf guest info callbacks to generic KVM
Move x86's perf guest callbacks into common KVM, as they are semantically
identical to arm64's callbacks (the only other such KVM callbacks).
arm64 will convert to the common versions in a future patch.

Implement the necessary arm64 arch hooks now to avoid having to provide
stubs or a temporary #define (from x86) to avoid arm64 compilation errors
when CONFIG_GUEST_PERF_EVENTS=y.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211111020738.2512932-13-seanjc@google.com
(cherry picked from commit e1bfc24577cc65c95dc519d7621a9c985b97e567)
[willdeacon@: Fix context conflict in x86 asm/kvm_host.h]
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: Id8184005f4ccc370512bdbf3d3f18612dcee1cd8
2022-08-10 07:29:13 +00:00
Marc Zyngier
c437aada2a UPSTREAM: KVM: Convert kvm_for_each_vcpu() to using xa_for_each_range()
Now that the vcpu array is backed by an xarray, use the optimised
iterator that matches the underlying data structure.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Message-Id: <20211116160403.4074052-8-maz@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 214bd3a6f46981b7867946e1b4f628a06bcf2091)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: I262005faec3a52c5f2f698d943920c6136bb8144
2022-08-10 07:29:13 +00:00
Marc Zyngier
e3031d06f7 BACKPORT: KVM: Use 'unsigned long' as kvm_for_each_vcpu()'s index
Everywhere we use kvm_for_each_vpcu(), we use an int as the vcpu
index. Unfortunately, we're about to move rework the iterator,
which requires this to be upgrade to an unsigned long.

Let's bite the bullet and repaint all of it in one go.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Message-Id: <20211116160403.4074052-7-maz@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 46808a4cb89708c2e5b264eb9d1035762581921b)
[willdeacon@: Drop riscv hunks; drop x86 hunks for code that isn't present;
 convert kvm_hyperv_tsc_notifier() as well]
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: I78a3ae244de22575684dd47a9823b9fc61b6fa7d
2022-08-10 07:29:13 +00:00
Marc Zyngier
37766cea76 UPSTREAM: KVM: Convert the kvm->vcpus array to a xarray
At least on arm64 and x86, the vcpus array is pretty huge (up to
1024 entries on x86) and is mostly empty in the majority of the cases
(running 1k vcpu VMs is not that common).

This mean that we end-up with a 4kB block of unused memory in the
middle of the kvm structure.

Instead of wasting away this memory, let's use an xarray instead,
which gives us almost the same flexibility as a normal array, but
with a reduced memory usage with smaller VMs.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Message-Id: <20211116160403.4074052-6-maz@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit c5b077549136584618a66258f09d8d4b41e7409c)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: I5c89adb8caf5b167536b0e51590c7ee7ec0363d9
2022-08-10 07:29:13 +00:00
Marc Zyngier
25f5ae814c BACKPORT: KVM: Move wiping of the kvm->vcpus array to common code
All architectures have similar loops iterating over the vcpus,
freeing one vcpu at a time, and eventually wiping the reference
off the vcpus array. They are also inconsistently taking
the kvm->lock mutex when wiping the references from the array.

Make this code common, which will simplify further changes.
The locking is dropped altogether, as this should only be called
when there is no further references on the kvm structure.

Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Message-Id: <20211116160403.4074052-2-maz@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 27592ae8dbe41033261b6fdf27d78998aabd2665)
[willdeacon@: Drop riscv changes; fix conflict in arm64/kvm/arm.c the
 same way as upstream merge commit 7fd55a02a426 ("Merge tag 'kvmarm-5.17'
 of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD")]
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: If63c359dc841fca1a9baf73283a7a014058b2c97
2022-08-10 07:29:13 +00:00
Vitaly Kuznetsov
2adcfdc6a6 UPSTREAM: KVM: Drop stale kvm_is_transparent_hugepage() declaration
kvm_is_transparent_hugepage() was removed in commit 205d76ff06 ("KVM:
Remove kvm_is_transparent_hugepage() and PageTransCompoundMap()") but its
declaration in include/linux/kvm_host.h persisted. Drop it.

Fixes: 205d76ff06 (""KVM: Remove kvm_is_transparent_hugepage() and PageTransCompoundMap()")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211018151407.2107363-1-vkuznets@redhat.com
(cherry picked from commit f0e6e6fa41b3d2aa1dcb61dd4ed6d7be004bb5a8)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: I9078ab62be40bc843ca2959f929ed22c1b8888e2
2022-08-08 14:57:53 +01:00
Will Deacon
f884005542 Revert "FROMGIT: KVM: Drop stale kvm_is_transparent_hugepage() declaration"
This reverts commit 03761cf7c7.

Bug: 233587962
Signed-off-by: Will Deacon <willdeacon@google.com>
Change-Id: I8d3741c7173b36b4e8f0d651a3f6600b0bbd9b72
2022-08-04 13:03:53 +00:00
Greg Kroah-Hartman
4f868bc314 Merge 5.15.57 into android14-5.15
Changes in 5.15.57
	x86/traps: Use pt_regs directly in fixup_bad_iret()
	x86/entry: Switch the stack after error_entry() returns
	x86/entry: Move PUSH_AND_CLEAR_REGS out of error_entry()
	x86/entry: Don't call error_entry() for XENPV
	objtool: Classify symbols
	objtool: Explicitly avoid self modifying code in .altinstr_replacement
	objtool: Shrink struct instruction
	objtool,x86: Replace alternatives with .retpoline_sites
	objtool: Introduce CFI hash
	x86/retpoline: Remove unused replacement symbols
	x86/asm: Fix register order
	x86/asm: Fixup odd GEN-for-each-reg.h usage
	x86/retpoline: Move the retpoline thunk declarations to nospec-branch.h
	x86/retpoline: Create a retpoline thunk array
	x86/alternative: Implement .retpoline_sites support
	x86/alternative: Handle Jcc __x86_indirect_thunk_\reg
	x86/alternative: Try inline spectre_v2=retpoline,amd
	x86/alternative: Add debug prints to apply_retpolines()
	bpf,x86: Simplify computing label offsets
	bpf,x86: Respect X86_FEATURE_RETPOLINE*
	objtool: Default ignore INT3 for unreachable
	x86/entry: Remove skip_r11rcx
	x86/realmode: build with -D__DISABLE_EXPORTS
	x86/kvm/vmx: Make noinstr clean
	x86/cpufeatures: Move RETPOLINE flags to word 11
	x86/retpoline: Cleanup some #ifdefery
	x86/retpoline: Swizzle retpoline thunk
	x86/retpoline: Use -mfunction-return
	x86: Undo return-thunk damage
	x86,objtool: Create .return_sites
	objtool: skip non-text sections when adding return-thunk sites
	x86,static_call: Use alternative RET encoding
	x86/ftrace: Use alternative RET encoding
	x86/bpf: Use alternative RET encoding
	x86/kvm: Fix SETcc emulation for return thunks
	x86/vsyscall_emu/64: Don't use RET in vsyscall emulation
	x86/sev: Avoid using __x86_return_thunk
	x86: Use return-thunk in asm code
	x86/entry: Avoid very early RET
	objtool: Treat .text.__x86.* as noinstr
	x86: Add magic AMD return-thunk
	x86/bugs: Report AMD retbleed vulnerability
	x86/bugs: Add AMD retbleed= boot parameter
	x86/bugs: Enable STIBP for JMP2RET
	x86/bugs: Keep a per-CPU IA32_SPEC_CTRL value
	x86/entry: Add kernel IBRS implementation
	x86/bugs: Optimize SPEC_CTRL MSR writes
	x86/speculation: Add spectre_v2=ibrs option to support Kernel IBRS
	x86/bugs: Split spectre_v2_select_mitigation() and spectre_v2_user_select_mitigation()
	x86/bugs: Report Intel retbleed vulnerability
	intel_idle: Disable IBRS during long idle
	objtool: Update Retpoline validation
	x86/xen: Rename SYS* entry points
	x86/xen: Add UNTRAIN_RET
	x86/bugs: Add retbleed=ibpb
	x86/bugs: Do IBPB fallback check only once
	objtool: Add entry UNRET validation
	x86/cpu/amd: Add Spectral Chicken
	x86/speculation: Fix RSB filling with CONFIG_RETPOLINE=n
	x86/speculation: Fix firmware entry SPEC_CTRL handling
	x86/speculation: Fix SPEC_CTRL write on SMT state change
	x86/speculation: Use cached host SPEC_CTRL value for guest entry/exit
	x86/speculation: Remove x86_spec_ctrl_mask
	objtool: Re-add UNWIND_HINT_{SAVE_RESTORE}
	KVM: VMX: Flatten __vmx_vcpu_run()
	KVM: VMX: Convert launched argument to flags
	KVM: VMX: Prevent guest RSB poisoning attacks with eIBRS
	KVM: VMX: Fix IBRS handling after vmexit
	x86/speculation: Fill RSB on vmexit for IBRS
	x86/common: Stamp out the stepping madness
	x86/cpu/amd: Enumerate BTC_NO
	x86/retbleed: Add fine grained Kconfig knobs
	x86/bugs: Add Cannon lake to RETBleed affected CPU list
	x86/entry: Move PUSH_AND_CLEAR_REGS() back into error_entry
	x86/bugs: Do not enable IBPB-on-entry when IBPB is not supported
	x86/kexec: Disable RET on kexec
	x86/speculation: Disable RRSBA behavior
	x86/static_call: Serialize __static_call_fixup() properly
	x86/xen: Fix initialisation in hypercall_page after rethunk
	x86/asm/32: Fix ANNOTATE_UNRET_SAFE use on 32-bit
	x86/speculation: Use DECLARE_PER_CPU for x86_spec_ctrl_current
	efi/x86: use naked RET on mixed mode call wrapper
	x86/kvm: fix FASTOP_SIZE when return thunks are enabled
	KVM: emulate: do not adjust size of fastop and setcc subroutines
	tools arch x86: Sync the msr-index.h copy with the kernel sources
	tools headers cpufeatures: Sync with the kernel sources
	x86/bugs: Remove apostrophe typo
	um: Add missing apply_returns()
	x86: Use -mindirect-branch-cs-prefix for RETPOLINE builds
	Linux 5.15.57

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I7d0a3c3eb4be1e5401c2678fdb6229523486146f
2022-07-23 13:51:05 +02:00
Peter Zijlstra
ccb25d7db1 x86/kvm/vmx: Make noinstr clean
commit 742ab6df974ae8384a2dd213db1a3a06cf6d8936 upstream.

The recent mmio_stale_data fixes broke the noinstr constraints:

  vmlinux.o: warning: objtool: vmx_vcpu_enter_exit+0x15b: call to wrmsrl.constprop.0() leaves .noinstr.text section
  vmlinux.o: warning: objtool: vmx_vcpu_enter_exit+0x1bf: call to kvm_arch_has_assigned_device() leaves .noinstr.text section

make it all happy again.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-07-23 12:53:57 +02:00
Greg Kroah-Hartman
a74d4e284c Merge 5.15.22 into android13-5.15
Changes in 5.15.22
	drm/i915: Disable DSB usage for now
	selinux: fix double free of cond_list on error paths
	audit: improve audit queue handling when "audit=1" on cmdline
	ipc/sem: do not sleep with a spin lock held
	spi: stm32-qspi: Update spi registering
	ASoC: hdmi-codec: Fix OOB memory accesses
	ASoC: ops: Reject out of bounds values in snd_soc_put_volsw()
	ASoC: ops: Reject out of bounds values in snd_soc_put_volsw_sx()
	ASoC: ops: Reject out of bounds values in snd_soc_put_xr_sx()
	ALSA: usb-audio: Correct quirk for VF0770
	ALSA: hda: Fix UAF of leds class devs at unbinding
	ALSA: hda: realtek: Fix race at concurrent COEF updates
	ALSA: hda/realtek: Add quirk for ASUS GU603
	ALSA: hda/realtek: Add missing fixup-model entry for Gigabyte X570 ALC1220 quirks
	ALSA: hda/realtek: Fix silent output on Gigabyte X570S Aorus Master (newer chipset)
	ALSA: hda/realtek: Fix silent output on Gigabyte X570 Aorus Xtreme after reboot from Windows
	btrfs: don't start transaction for scrub if the fs is mounted read-only
	btrfs: fix deadlock between quota disable and qgroup rescan worker
	btrfs: fix use-after-free after failure to create a snapshot
	Revert "fs/9p: search open fids first"
	drm/nouveau: fix off by one in BIOS boundary checking
	drm/i915/adlp: Fix TypeC PHY-ready status readout
	drm/amd/pm: correct the MGpuFanBoost support for Beige Goby
	drm/amd/display: watermark latencies is not enough on DCN31
	drm/amd/display: Force link_rate as LINK_RATE_RBR2 for 2018 15" Apple Retina panels
	nvme-fabrics: fix state check in nvmf_ctlr_matches_baseopts()
	mm/debug_vm_pgtable: remove pte entry from the page table
	mm/pgtable: define pte_index so that preprocessor could recognize it
	mm/kmemleak: avoid scanning potential huge holes
	block: bio-integrity: Advance seed correctly for larger interval sizes
	dma-buf: heaps: Fix potential spectre v1 gadget
	IB/hfi1: Fix AIP early init panic
	Revert "fbcon: Disable accelerated scrolling"
	fbcon: Add option to enable legacy hardware acceleration
	mptcp: fix msk traversal in mptcp_nl_cmd_set_flags()
	Revert "ASoC: mediatek: Check for error clk pointer"
	KVM: arm64: Avoid consuming a stale esr value when SError occur
	KVM: arm64: Stop handle_exit() from handling HVC twice when an SError occurs
	RDMA/cma: Use correct address when leaving multicast group
	RDMA/ucma: Protect mc during concurrent multicast leaves
	RDMA/siw: Fix refcounting leak in siw_create_qp()
	IB/rdmavt: Validate remote_addr during loopback atomic tests
	RDMA/siw: Fix broken RDMA Read Fence/Resume logic.
	RDMA/mlx4: Don't continue event handler after memory allocation failure
	ALSA: usb-audio: initialize variables that could ignore errors
	ALSA: hda: Fix signedness of sscanf() arguments
	ALSA: hda: Skip codec shutdown in case the codec is not registered
	iommu/vt-d: Fix potential memory leak in intel_setup_irq_remapping()
	iommu/amd: Fix loop timeout issue in iommu_ga_log_enable()
	spi: bcm-qspi: check for valid cs before applying chip select
	spi: mediatek: Avoid NULL pointer crash in interrupt
	spi: meson-spicc: add IRQ check in meson_spicc_probe
	spi: uniphier: fix reference count leak in uniphier_spi_probe()
	IB/hfi1: Fix tstats alloc and dealloc
	IB/cm: Release previously acquired reference counter in the cm_id_priv
	net: ieee802154: hwsim: Ensure proper channel selection at probe time
	net: ieee802154: mcr20a: Fix lifs/sifs periods
	net: ieee802154: ca8210: Stop leaking skb's
	netfilter: nft_reject_bridge: Fix for missing reply from prerouting
	net: ieee802154: Return meaningful error codes from the netlink helpers
	net/smc: Forward wakeup to smc socket waitqueue after fallback
	net: stmmac: dwmac-visconti: No change to ETHER_CLOCK_SEL for unexpected speed request.
	net: stmmac: properly handle with runtime pm in stmmac_dvr_remove()
	net: macsec: Fix offload support for NETDEV_UNREGISTER event
	net: macsec: Verify that send_sci is on when setting Tx sci explicitly
	net: stmmac: dump gmac4 DMA registers correctly
	net: stmmac: ensure PTP time register reads are consistent
	drm/kmb: Fix for build errors with Warray-bounds
	drm/i915/overlay: Prevent divide by zero bugs in scaling
	drm/amd: avoid suspend on dGPUs w/ s2idle support when runtime PM enabled
	ASoC: fsl: Add missing error handling in pcm030_fabric_probe
	ASoC: xilinx: xlnx_formatter_pcm: Make buffer bytes multiple of period bytes
	ASoC: simple-card: fix probe failure on platform component
	ASoC: cpcap: Check for NULL pointer after calling of_get_child_by_name
	ASoC: max9759: fix underflow in speaker_gain_control_put()
	ASoC: codecs: wcd938x: fix incorrect used of portid
	ASoC: codecs: lpass-rx-macro: fix sidetone register offsets
	ASoC: codecs: wcd938x: fix return value of mixer put function
	pinctrl: sunxi: Fix H616 I2S3 pin data
	pinctrl: intel: Fix a glitch when updating IRQ flags on a preconfigured line
	pinctrl: intel: fix unexpected interrupt
	pinctrl: bcm2835: Fix a few error paths
	scsi: bnx2fc: Make bnx2fc_recv_frame() mp safe
	nfsd: nfsd4_setclientid_confirm mistakenly expires confirmed client.
	gve: fix the wrong AdminQ buffer queue index check
	bpf: Use VM_MAP instead of VM_ALLOC for ringbuf
	selftests/exec: Remove pipe from TEST_GEN_FILES
	selftests: futex: Use variable MAKE instead of make
	tools/resolve_btfids: Do not print any commands when building silently
	e1000e: Separate ADP board type from TGP
	rtc: cmos: Evaluate century appropriate
	kvm: add guest_state_{enter,exit}_irqoff()
	kvm/arm64: rework guest entry logic
	perf: Copy perf_event_attr::sig_data on modification
	perf stat: Fix display of grouped aliased events
	perf/x86/intel/pt: Fix crash with stop filters in single-range mode
	x86/perf: Default set FREEZE_ON_SMI for all
	EDAC/altera: Fix deferred probing
	EDAC/xgene: Fix deferred probing
	ext4: prevent used blocks from being allocated during fast commit replay
	ext4: modify the logic of ext4_mb_new_blocks_simple
	ext4: fix error handling in ext4_restore_inline_data()
	ext4: fix error handling in ext4_fc_record_modified_inode()
	ext4: fix incorrect type issue during replay_del_range
	net: dsa: mt7530: make NET_DSA_MT7530 select MEDIATEK_GE_PHY
	cgroup/cpuset: Fix "suspicious RCU usage" lockdep warning
	tools include UAPI: Sync sound/asound.h copy with the kernel sources
	gpio: idt3243x: Fix an ignored error return from platform_get_irq()
	gpio: mpc8xxx: Fix an ignored error return from platform_get_irq()
	selftests: nft_concat_range: add test for reload with no element add/del
	selftests: netfilter: check stateless nat udp checksum fixup
	Linux 5.15.22

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I9143b858b768a8497c1df9440a74d8c105c32271
2022-02-09 08:15:44 +01:00
Mark Rutland
83071e2dad kvm: add guest_state_{enter,exit}_irqoff()
commit ef9989afda73332df566852d6e9ca695c05f10ce upstream.

When transitioning to/from guest mode, it is necessary to inform
lockdep, tracing, and RCU in a specific order, similar to the
requirements for transitions to/from user mode. Additionally, it is
necessary to perform vtime accounting for a window around running the
guest, with RCU enabled, such that timer interrupts taken from the guest
can be accounted as guest time.

Most architectures don't handle all the necessary pieces, and a have a
number of common bugs, including unsafe usage of RCU during the window
between guest_enter() and guest_exit().

On x86, this was dealt with across commits:

  87fa7f3e98 ("x86/kvm: Move context tracking where it belongs")
  0642391e21 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
  9fc975e9ef ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
  3ebccdf373 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
  135961e0a7 ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
  1604571401 ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
  bc908e091b ("KVM: x86: Consolidate guest enter/exit logic to common helpers")

... but those fixes are specific to x86, and as the resulting logic
(while correct) is split across generic helper functions and
x86-specific helper functions, it is difficult to see that the
entry/exit accounting is balanced.

This patch adds generic helpers which architectures can use to handle
guest entry/exit consistently and correctly. The guest_{enter,exit}()
helpers are split into guest_timing_{enter,exit}() to perform vtime
accounting, and guest_context_{enter,exit}() to perform the necessary
context tracking and RCU management. The existing guest_{enter,exit}()
heleprs are left as wrappers of these.

Atop this, new guest_state_enter_irqoff() and guest_state_exit_irqoff()
helpers are added to handle the ordering of lockdep, tracing, and RCU
manageent. These are inteneded to mirror exit_to_user_mode() and
enter_from_user_mode().

Subsequent patches will migrate architectures over to the new helpers,
following a sequence:

	guest_timing_enter_irqoff();

	guest_state_enter_irqoff();
	< run the vcpu >
	guest_state_exit_irqoff();

	< take any pending IRQs >

	guest_timing_exit_irqoff();

This sequences handles all of the above correctly, and more clearly
balances the entry and exit portions, making it easier to understand.

The existing helpers are marked as deprecated, and will be removed once
all architectures have been converted.

There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Message-Id: <20220201132926.3301912-2-mark.rutland@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-08 18:34:12 +01:00
Vitaly Kuznetsov
03761cf7c7 FROMGIT: KVM: Drop stale kvm_is_transparent_hugepage() declaration
kvm_is_transparent_hugepage() was removed in commit 205d76ff06 ("KVM:
Remove kvm_is_transparent_hugepage() and PageTransCompoundMap()") but its
declaration in include/linux/kvm_host.h persisted. Drop it.

Fixes: 205d76ff06 (""KVM: Remove kvm_is_transparent_hugepage() and PageTransCompoundMap()")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211018151407.2107363-1-vkuznets@redhat.com
(cherry picked from commit f0e6e6fa41b3d2aa1dcb61dd4ed6d7be004bb5a8
 git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209777660
Signed-off-by: Will Deacon <willdeacon@google.com>
Change-Id: I9078ab62be40bc843ca2959f929ed22c1b8888e2
2021-12-09 09:42:19 +00:00
Lai Jiangshan
6bc6db0002 KVM: Remove tlbs_dirty
There is no user of tlbs_dirty.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-4-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-23 11:01:12 -04:00
Sean Christopherson
4eeef24241 KVM: x86: Query vcpu->vcpu_idx directly and drop its accessor
Read vcpu->vcpu_idx directly instead of bouncing through the one-line
wrapper, kvm_vcpu_get_idx(), and drop the wrapper.  The wrapper is a
remnant of the original implementation and serves no purpose; remove it
before it gains more users.

Back when kvm_vcpu_get_idx() was added by commit 497d72d80a ("KVM: Add
kvm_vcpu_get_idx to get vcpu index in kvm->vcpus"), the implementation
was more than just a simple wrapper as vcpu->vcpu_idx did not exist and
retrieving the index meant walking over the vCPU array to find the given
vCPU.

When vcpu_idx was introduced by commit 8750e72a79 ("KVM: remember
position in kvm->vcpus array"), the helper was left behind, likely to
avoid extra thrash (but even then there were only two users, the original
arm usage having been removed at some point in the past).

No functional change intended.

Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210910183220.2397812-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:11 -04:00
Paolo Bonzini
e99314a340 Merge tag 'kvmarm-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 5.15

- Page ownership tracking between host EL1 and EL2

- Rely on userspace page tables to create large stage-2 mappings

- Fix incompatibility between pKVM and kmemleak

- Fix the PMU reset state, and improve the performance of the virtual PMU

- Move over to the generic KVM entry code

- Address PSCI reset issues w.r.t. save/restore

- Preliminary rework for the upcoming pKVM fixed feature

- A bunch of MM cleanups

- a vGIC fix for timer spurious interrupts

- Various cleanups
2021-09-06 06:34:48 -04:00
Jing Zhang
3cc4e148b9 KVM: stats: Add VM stat for remote tlb flush requests
Add a new stat that counts the number of times a remote TLB flush is
requested, regardless of whether it kicks vCPUs out of guest mode. This
allows us to look at how often flushes are initiated.

Unlike remote_tlb_flush, this one applies to ARM's instruction-set-based
TLB flush implementation, so apply it there too.

Original-by: David Matlack <dmatlack@google.com>
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <20210817002639.3856694-1-jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-06 06:30:45 -04:00
Jing Zhang
8ccba534a1 KVM: stats: Add halt polling related histogram stats
Add three log histogram stats to record the distribution of time spent
on successful polling, failed polling and VCPU wait.
halt_poll_success_hist: Distribution of spent time for a successful poll.
halt_poll_fail_hist: Distribution of spent time for a failed poll.
halt_wait_hist: Distribution of time a VCPU has spent on waiting.

Signed-off-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <20210802165633.1866976-6-jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:33 -04:00
Jing Zhang
87bcc5fa09 KVM: stats: Add halt_wait_ns stats for all architectures
Add simple stats halt_wait_ns to record the time a VCPU has spent on
waiting for all architectures (not just powerpc).

Signed-off-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <20210802165633.1866976-5-jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:33 -04:00
Jing Zhang
f95937ccf5 KVM: stats: Support linear and logarithmic histogram statistics
Add new types of KVM stats, linear and logarithmic histogram.
Histogram are very useful for observing the value distribution
of time or size related stats.

Signed-off-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <20210802165633.1866976-2-jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:32 -04:00
Maxim Levitsky
edb298c663 KVM: x86/mmu: bump mmu notifier count in kvm_zap_gfn_range
This together with previous patch, ensures that
kvm_zap_gfn_range doesn't race with page fault
running on another vcpu, and will make this page fault code
retry instead.

This is based on a patch suggested by Sean Christopherson:
https://lkml.org/lkml/2021/7/22/1025

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210810205251.424103-5-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:19 -04:00
Peter Xu
3165af738e KVM: Allow to have arch-specific per-vm debugfs files
Allow archs to create arch-specific nodes under kvm->debugfs_dentry directory
besides the stats fields.  The new interface kvm_arch_create_vm_debugfs() is
defined but not yet used.  It's called after kvm->debugfs_dentry is created, so
it can be referenced directly in kvm_arch_create_vm_debugfs().  Arch should
define their own versions when they want to create extra debugfs nodes.

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210730220455.26054-2-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:35:17 -04:00
David Matlack
fe22ed827c KVM: Cache the last used slot index per vCPU
The memslot for a given gfn is looked up multiple times during page
fault handling. Avoid binary searching for it multiple times by caching
the most recently used slot. There is an existing VM-wide last_used_slot
but that does not work well for cases where vCPUs are accessing memory
in different slots (see performance data below).

Another benefit of caching the most recently use slot (versus looking
up the slot once and passing around a pointer) is speeding up memslot
lookups *across* faults and during spte prefetching.

To measure the performance of this change I ran dirty_log_perf_test with
64 vCPUs and 64 memslots and measured "Populate memory time" and
"Iteration 2 dirty memory time".  Tests were ran with eptad=N to force
dirty logging to use fast_page_fault so its performance could be
measured.

Config     | Metric                        | Before | After
---------- | ----------------------------- | ------ | ------
tdp_mmu=Y  | Populate memory time          | 6.76s  | 5.47s
tdp_mmu=Y  | Iteration 2 dirty memory time | 2.83s  | 0.31s
tdp_mmu=N  | Populate memory time          | 20.4s  | 18.7s
tdp_mmu=N  | Iteration 2 dirty memory time | 2.65s  | 0.30s

The "Iteration 2 dirty memory time" results are especially compelling
because they are equivalent to running the same test with a single
memslot. In other words, fast_page_fault performance no longer scales
with the number of memslots.

Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210804222844.1419481-4-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-06 07:52:29 -04:00
David Matlack
0f22af940d KVM: Move last_used_slot logic out of search_memslots
Make search_memslots unconditionally search all memslots and move the
last_used_slot logic up one level to __gfn_to_memslot. This is in
preparation for introducing a per-vCPU last_used_slot.

As part of this change convert existing callers of search_memslots to
__gfn_to_memslot to avoid making any functional changes.

Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210804222844.1419481-3-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-06 07:52:28 -04:00
David Matlack
87689270b1 KVM: Rename lru_slot to last_used_slot
lru_slot is used to keep track of the index of the most-recently used
memslot. The correct acronym would be "mru" but that is not a common
acronym. So call it last_used_slot which is a bit more obvious.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210804222844.1419481-2-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-06 07:52:28 -04:00
Paolo Bonzini
52ac8b358b KVM: Block memslot updates across range_start() and range_end()
We would like to avoid taking mmu_lock for .invalidate_range_{start,end}()
notifications that are unrelated to KVM.  Because mmu_notifier_count
must be modified while holding mmu_lock for write, and must always
be paired across start->end to stay balanced, lock elision must
happen in both or none.  Therefore, in preparation for this change,
this patch prevents memslot updates across range_start() and range_end().

Note, technically flag-only memslot updates could be allowed in parallel,
but stalling a memslot update for a relatively short amount of time is
not a scalability issue, and this is all more than complex enough.

A long note on the locking: a previous version of the patch used an rwsem
to block the memslot update while the MMU notifier run, but this resulted
in the following deadlock involving the pseudo-lock tagged as
"mmu_notifier_invalidate_range_start".

   ======================================================
   WARNING: possible circular locking dependency detected
   5.12.0-rc3+ #6 Tainted: G           OE
   ------------------------------------------------------
   qemu-system-x86/3069 is trying to acquire lock:
   ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190

   but task is already holding lock:
   ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]

   which lock already depends on the new lock.

This corresponds to the following MMU notifier logic:

    invalidate_range_start
      take pseudo lock
      down_read()           (*)
      release pseudo lock
    invalidate_range_end
      take pseudo lock      (**)
      up_read()
      release pseudo lock

At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.

This could cause a deadlock (ignoring for a second that the pseudo lock
is not a lock):

- invalidate_range_start waits on down_read(), because the rwsem is
held by install_new_memslots

- install_new_memslots waits on down_write(), because the rwsem is
held till (another) invalidate_range_end finishes

- invalidate_range_end sits waits on the pseudo lock, held by
invalidate_range_start.

Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
it would change the *shared* rwsem readers into *shared recursive*
readers), so open-code the wait using a readers count and a
spinlock.  This also allows handling blockable and non-blockable
critical section in the same way.

Losing the rwsem fairness does theoretically allow MMU notifiers to
block install_new_memslots forever.  Note that mm/mmu_notifier.c's own
retry scheme in mmu_interval_read_begin also uses wait/wake_up
and is likewise not fair.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-03 03:44:03 -04:00
Peter Xu
605c713023 KVM: Introduce kvm_get_kvm_safe()
Introduce this safe version of kvm_get_kvm() so that it can be called even
during vm destruction.  Use it in kvm_debugfs_open() and remove the verbose
comment.  Prepare to be used elsewhere.

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210625153214.43106-3-peterx@redhat.com>
[Preserve the comment in kvm_debugfs_open. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02 11:01:46 -04:00
Sean Christopherson
7ee3e8c39d KVM: Export kvm_make_all_cpus_request() for use in marking VMs as bugged
Export kvm_make_all_cpus_request() and hoist the request helper
declarations of request up to the KVM_REQ_* definitions in preparation
for adding a "VM bugged" framework.  The framework will add KVM_BUG()
and KVM_BUG_ON() as alternatives to full BUG()/BUG_ON() for cases where
KVM has definitely hit a bug (in itself or in silicon) and the VM is all
but guaranteed to be hosed.  Marking a VM bugged will trigger a request
to all vCPUs to allow arch code to forcefully evict each vCPU from its
run loop.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-Id: <1d8cbbc8065d831343e70b5dcaea92268145eef1.1625186503.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02 09:36:36 -04:00
Sean Christopherson
0b8f11737c KVM: Add infrastructure and macro to mark VM as bugged
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <3a0998645c328bf0895f1290e61821b70f048549.1625186503.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02 09:36:35 -04:00
Marc Zyngier
36c3ce6c0d KVM: Get rid of kvm_get_pfn()
Nobody is using kvm_get_pfn() anymore. Get rid of it.

Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210726153552.1535838-7-maz@kernel.org
2021-08-02 14:05:58 +01:00
Jing Zhang
bc9e9e672d KVM: debugfs: Reuse binary stats descriptors
To remove code duplication, use the binary stats descriptors in the
implementation of the debugfs interface for statistics. This unifies
the definition of statistics for the binary and debugfs interfaces.

Signed-off-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <20210618222709.1858088-8-jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24 18:00:29 -04:00
Jing Zhang
ce55c04945 KVM: stats: Support binary stats retrieval for a VCPU
Add a VCPU ioctl to get a statistics file descriptor by which a read
functionality is provided for userspace to read out VCPU stats header,
descriptors and data.
Define VCPU statistics descriptors and header for all architectures.

Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com> #arm64
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <20210618222709.1858088-5-jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24 18:00:19 -04:00
Jing Zhang
fcfe1baedd KVM: stats: Support binary stats retrieval for a VM
Add a VM ioctl to get a statistics file descriptor by which a read
functionality is provided for userspace to read out VM stats header,
descriptors and data.
Define VM statistics descriptors and header for all architectures.

Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com> #arm64
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <20210618222709.1858088-4-jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24 18:00:10 -04:00
Jing Zhang
cb082bfab5 KVM: stats: Add fd-based API to read binary stats data
This commit defines the API for userspace and prepare the common
functionalities to support per VM/VCPU binary stats data readings.

The KVM stats now is only accessible by debugfs, which has some
shortcomings this change series are supposed to fix:
1. The current debugfs stats solution in KVM could be disabled
   when kernel Lockdown mode is enabled, which is a potential
   rick for production.
2. The current debugfs stats solution in KVM is organized as "one
   stats per file", it is good for debugging, but not efficient
   for production.
3. The stats read/clear in current debugfs solution in KVM are
   protected by the global kvm_lock.

Besides that, there are some other benefits with this change:
1. All KVM VM/VCPU stats can be read out in a bulk by one copy
   to userspace.
2. A schema is used to describe KVM statistics. From userspace's
   perspective, the KVM statistics are self-describing.
3. With the fd-based solution, a separate telemetry would be able
   to read KVM stats in a less privileged environment.
4. After the initial setup by reading in stats descriptors, a
   telemetry only needs to read the stats data itself, no more
   parsing or setup is needed.

Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com> #arm64
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <20210618222709.1858088-3-jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24 11:47:57 -04:00
Sergey Senozhatsky
2fdef3a2ae kvm: add PM-notifier
Add KVM PM-notifier so that architectures can have arch-specific
VM suspend/resume routines. Such architectures need to select
CONFIG_HAVE_KVM_PM_NOTIFIER and implement kvm_arch_pm_notifier().

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Marc Zyngier <maz@kernel.org>
Message-Id: <20210606021045.14159-1-senozhatsky@chromium.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:32 -04:00
Ben Gardon
b10a038e84 KVM: mmu: Add slots_arch_lock for memslot arch fields
Add a new lock to protect the arch-specific fields of memslots if they
need to be modified in a kvm->srcu read critical section. A future
commit will use this lock to lazily allocate memslot rmaps for x86.

Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210518173414.450044-5-bgardon@google.com>
[Add Documentation/ hunk. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:26 -04:00
Paolo Bonzini
4422829e80 kvm: fix previous commit for 32-bit builds
array_index_nospec does not work for uint64_t on 32-bit builds.
However, the size of a memory slot must be less than 20 bits wide
on those system, since the memory slot must fit in the user
address space.  So just store it in an unsigned long.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-09 01:49:13 -04:00
Paolo Bonzini
da27a83fd6 kvm: avoid speculation-based attacks from out-of-range memslot accesses
KVM's mechanism for accessing guest memory translates a guest physical
address (gpa) to a host virtual address using the right-shifted gpa
(also known as gfn) and a struct kvm_memory_slot.  The translation is
performed in __gfn_to_hva_memslot using the following formula:

      hva = slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE

It is expected that gfn falls within the boundaries of the guest's
physical memory.  However, a guest can access invalid physical addresses
in such a way that the gfn is invalid.

__gfn_to_hva_memslot is called from kvm_vcpu_gfn_to_hva_prot, which first
retrieves a memslot through __gfn_to_memslot.  While __gfn_to_memslot
does check that the gfn falls within the boundaries of the guest's
physical memory or not, a CPU can speculate the result of the check and
continue execution speculatively using an illegal gfn. The speculation
can result in calculating an out-of-bounds hva.  If the resulting host
virtual address is used to load another guest physical address, this
is effectively a Spectre gadget consisting of two consecutive reads,
the second of which is data dependent on the first.

Right now it's not clear if there are any cases in which this is
exploitable.  One interesting case was reported by the original author
of this patch, and involves visiting guest page tables on x86.  Right
now these are not vulnerable because the hva read goes through get_user(),
which contains an LFENCE speculation barrier.  However, there are
patches in progress for x86 uaccess.h to mask kernel addresses instead of
using LFENCE; once these land, a guest could use speculation to read
from the VMM's ring 3 address space.  Other architectures such as ARM
already use the address masking method, and would be susceptible to
this same kind of data-dependent access gadgets.  Therefore, this patch
proactively protects from these attacks by masking out-of-bounds gfns
in __gfn_to_hva_memslot, which blocks speculation of invalid hvas.

Sean Christopherson noted that this patch does not cover
kvm_read_guest_offset_cached.  This however is limited to a few bytes
past the end of the cache, and therefore it is unlikely to be useful in
the context of building a chain of data dependent accesses.

Reported-by: Artemiy Margaritov <artemiy.margaritov@gmail.com>
Co-developed-by: Artemiy Margaritov <artemiy.margaritov@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-08 17:12:05 -04:00
Marcelo Tosatti
084071d5e9 KVM: rename KVM_REQ_PENDING_TIMER to KVM_REQ_UNBLOCK
KVM_REQ_UNBLOCK will be used to exit a vcpu from
its inner vcpu halt emulation loop.

Rename KVM_REQ_PENDING_TIMER to KVM_REQ_UNBLOCK, switch
PowerPC to arch specific request bit.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Message-Id: <20210525134321.303768132@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27 07:57:38 -04:00
Wanpeng Li
6bd5b74368 KVM: PPC: exit halt polling on need_resched()
This is inspired by commit 262de4102c (kvm: exit halt polling on
need_resched() as well). Due to PPC implements an arch specific halt
polling logic, we have to the need_resched() check there as well. This
patch adds a helper function that can be shared between book3s and generic
halt-polling loops.

Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Venkatesh Srinivas <venkateshs@chromium.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Venkatesh Srinivas <venkateshs@chromium.org>
Cc: Jim Mattson <jmattson@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1621339235-11131-1-git-send-email-wanpengli@tencent.com>
[Make the function inline. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27 07:45:50 -04:00
Sean Christopherson
1ca0016c14 context_tracking: KVM: Move guest enter/exit wrappers to KVM's domain
Move the guest enter/exit wrappers to kvm_host.h so that KVM can manage
its context tracking vs. vtime accounting without bleeding too many KVM
details into the context tracking code.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210505002735.1684165-8-seanjc@google.com
2021-05-05 22:54:12 +02:00
Wanpeng Li
52acd22faa KVM: Boost vCPU candidate in user mode which is delivering interrupt
Both lock holder vCPU and IPI receiver that has halted are condidate for
boost. However, the PLE handler was originally designed to deal with the
lock holder preemption problem. The Intel PLE occurs when the spinlock
waiter is in kernel mode. This assumption doesn't hold for IPI receiver,
they can be in either kernel or user mode. the vCPU candidate in user mode
will not be boosted even if they should respond to IPIs. Some benchmarks
like pbzip2, swaptions etc do the TLB shootdown in kernel mode and most
of the time they are running in user mode. It can lead to a large number
of continuous PLE events because the IPI sender causes PLE events
repeatedly until the receiver is scheduled while the receiver is not
candidate for a boost.

This patch boosts the vCPU candidiate in user mode which is delivery
interrupt. We can observe the speed of pbzip2 improves 10% in 96 vCPUs
VM in over-subscribe scenario (The host machine is 2 socket, 48 cores,
96 HTs Intel CLX box). There is no performance regression for other
benchmarks like Unixbench spawn (most of the time contend read/write
lock in kernel mode), ebizzy (most of the time contend read/write sem
and TLB shoodtdown in kernel mode).

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1618542490-14756-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-21 12:20:03 -04:00
Nathan Tempelman
54526d1fd5 KVM: x86: Support KVM VMs sharing SEV context
Add a capability for userspace to mirror SEV encryption context from
one vm to another. On our side, this is intended to support a
Migration Helper vCPU, but it can also be used generically to support
other in-guest workloads scheduled by the host. The intention is for
the primary guest and the mirror to have nearly identical memslots.

The primary benefits of this are that:
1) The VMs do not share KVM contexts (think APIC/MSRs/etc), so they
can't accidentally clobber each other.
2) The VMs can have different memory-views, which is necessary for post-copy
migration (the migration vCPUs on the target need to read and write to
pages, when the primary guest would VMEXIT).

This does not change the threat model for AMD SEV. Any memory involved
is still owned by the primary guest and its initial state is still
attested to through the normal SEV_LAUNCH_* flows. If userspace wanted
to circumvent SEV, they could achieve the same effect by simply attaching
a vCPU to the primary VM.
This patch deliberately leaves userspace in charge of the memslots for the
mirror, as it already has the power to mess with them in the primary guest.

This patch does not support SEV-ES (much less SNP), as it does not
handle handing off attested VMSAs to the mirror.

For additional context, we need a Migration Helper because SEV PSP
migration is far too slow for our live migration on its own. Using
an in-guest migrator lets us speed this up significantly.

Signed-off-by: Nathan Tempelman <natet@google.com>
Message-Id: <20210408223214.2582277-1-natet@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-21 12:20:02 -04:00
Sean Christopherson
5d3c4c7938 KVM: Stop looking for coalesced MMIO zones if the bus is destroyed
Abort the walk of coalesced MMIO zones if kvm_io_bus_unregister_dev()
fails to allocate memory for the new instance of the bus.  If it can't
instantiate a new bus, unregister_dev() destroys all devices _except_ the
target device.   But, it doesn't tell the caller that it obliterated the
bus and invoked the destructor for all devices that were on the bus.  In
the coalesced MMIO case, this can result in a deleted list entry
dereference due to attempting to continue iterating on coalesced_zones
after future entries (in the walk) have been deleted.

Opportunistically add curly braces to the for-loop, which encompasses
many lines but sneaks by without braces due to the guts being a single
if statement.

Fixes: f65886606c ("KVM: fix memory leak in kvm_io_bus_unregister_dev()")
Cc: stable@vger.kernel.org
Reported-by: Hao Sun <sunhao.th@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210412222050.876100-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-20 04:18:51 -04:00