kernel_arpi

Author	SHA1	Message	Date
Yosry Ahmed	2af25795b7	BACKPORT: KVM: x86/mmu: count KVM mmu usage in secondary pagetable stats. Count the pages used by KVM mmu on x86 in memory stats under secondary pagetable stats (e.g. "SecPageTables" in /proc/meminfo) to give better visibility into the memory consumption of KVM mmu in a similar way to how normal user page tables are accounted. Add the inner helper in common KVM, ARM will also use it to count stats in a future commit. Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Acked-by: Marc Zyngier <maz@kernel.org> # generic KVM changes Link: https://lore.kernel.org/r/20220823004639.2387269-3-yosryahmed@google.com Link: https://lore.kernel.org/r/20220823004639.2387269-4-yosryahmed@google.com [sean: squash x86 usage to workaround modpost issues] Signed-off-by: Sean Christopherson <seanjc@google.com> Bug: 222044477 (cherry picked from commit 43a063cab325ee7cc50349967e536b3cd4e57f03) [vdonnefort@: Fix conflicts in mmu.c and tdp_mmu.c] Change-Id: I9b81155758e513504a87ea2d634f341652ed0630 Signed-off-by: Vincent Donnefort <vdonnefort@google.com>	2022-11-23 17:11:25 +00:00
Greg Kroah-Hartman	05dba81225	Merge 5.15.76 into android14-5.15 Changes in 5.15.76 r8152: add PID for the Lenovo OneLink+ Dock arm64/mm: Consolidate TCR_EL1 fields usb: gadget: uvc: consistently use define for headerlen usb: gadget: uvc: use on returned header len in video_encode_isoc_sg usb: gadget: uvc: rework uvcg_queue_next_buffer to uvcg_complete_buffer usb: gadget: uvc: giveback vb2 buffer on req complete usb: gadget: uvc: improve sg exit condition arm64: errata: Remove AES hwcap for COMPAT tasks perf/x86/intel/pt: Relax address filter validation btrfs: enhance unsupported compat RO flags handling ocfs2: clear dinode links count in case of error ocfs2: fix BUG when iput after ocfs2_mknod fails selinux: enable use of both GFP_KERNEL and GFP_ATOMIC in convert_context() cpufreq: qcom: fix writes in read-only memory region i2c: qcom-cci: Fix ordering of pm_runtime_xx and i2c_add_adapter x86/microcode/AMD: Apply the patch early on every logical thread hwmon/coretemp: Handle large core ID value ata: ahci-imx: Fix MODULE_ALIAS ata: ahci: Match EM_MAX_SLOTS with SATA_PMP_MAX_PORTS x86/resctrl: Fix min_cbm_bits for AMD cpufreq: qcom: fix memory leak in error path drm/amdgpu: fix sdma doorbell init ordering on APUs mm,hugetlb: take hugetlb_lock before decrementing h->resv_huge_pages kvm: Add support for arch compat vm ioctls KVM: arm64: vgic: Fix exit condition in scan_its_table() media: ipu3-imgu: Fix NULL pointer dereference in active selection access media: mceusb: set timeout to at least timeout provided media: venus: dec: Handle the case where find_format fails x86/topology: Fix multiple packages shown on a single-package system x86/topology: Fix duplicated core ID within a package btrfs: fix processing of delayed data refs during backref walking btrfs: fix processing of delayed tree block refs during backref walking drm/vc4: Add module dependency on hdmi-codec ACPI: extlog: Handle multiple records tipc: Fix recognition of trial period tipc: fix an information leak in tipc_topsrv_kern_subscr i40e: Fix DMA mappings leak HID: magicmouse: Do not set BTN_MOUSE on double report sfc: Change VF mac via PF as first preference if available. net/atm: fix proc_mpc_write incorrect return value net: phy: dp83867: Extend RX strap quirk for SGMII mode net: phylink: add mac_managed_pm in phylink_config structure scsi: lpfc: Fix memory leak in lpfc_create_port() udp: Update reuse->has_conns under reuseport_lock. cifs: Fix xid leak in cifs_create() cifs: Fix xid leak in cifs_copy_file_range() cifs: Fix xid leak in cifs_flock() cifs: Fix xid leak in cifs_ses_add_channel() dm: remove unnecessary assignment statement in alloc_dev() net: hsr: avoid possible NULL deref in skb_clone() ionic: catch NULL pointer issue on reconfig netfilter: nf_tables: relax NFTA_SET_ELEM_KEY_END set flags requirements nvme-hwmon: consistently ignore errors from nvme_hwmon_init nvme-hwmon: kmalloc the NVME SMART log buffer nvmet: fix workqueue MEM_RECLAIM flushing dependency net: sched: cake: fix null pointer access issue when cake_init() fails net: sched: delete duplicate cleanup of backlog and qlen net: sched: sfb: fix null pointer access issue when sfb_init() fails sfc: include vport_id in filter spec hash and equal() wwan_hwsim: fix possible memory leak in wwan_hwsim_dev_new() net: hns: fix possible memory leak in hnae_ae_register() net: sched: fix race condition in qdisc_graft() net: phy: dp83822: disable MDI crossover status change interrupt iommu/vt-d: Allow NVS regions in arch_rmrr_sanity_check() iommu/vt-d: Clean up si_domain in the init_dmars() error path fs: dlm: fix invalid derefence of sb_lvbptr arm64: mte: move register initialization to C ksmbd: handle smb2 query dir request for OutputBufferLength that is too small ksmbd: fix incorrect handling of iterate_dir tracing: Simplify conditional compilation code in tracing_set_tracer() tracing: Do not free snapshot if tracer is on cmdline mmc: sdhci-tegra: Use actual clock rate for SW tuning correction perf: Skip and warn on unknown format 'configN' attrs ACPI: video: Force backlight native for more TongFang devices x86/Kconfig: Drop check for -mabi=ms for CONFIG_EFI_STUB Makefile.debug: re-enable debug info for .S files mmc: core: Add SD card quirk for broken discard mm: /proc/pid/smaps_rollup: fix no vma's null-deref Linux 5.15.76 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Ica5b3f26c36900ff31ccac63f4fb55b52bff0ec2	2022-11-03 14:21:38 +09:00
Alexander Graf	5bf2fda26a	kvm: Add support for arch compat vm ioctls commit ed51862f2f57cbce6fed2d4278cfe70a490899fd upstream. We will introduce the first architecture specific compat vm ioctl in the next patch. Add all necessary boilerplate to allow architectures to override compat vm ioctls when necessary. Signed-off-by: Alexander Graf <graf@amazon.com> Message-Id: <20221017184541.2658-2-graf@amazon.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-10-29 10:12:54 +02:00
Greg Kroah-Hartman	74ca15c523	Merge 5.15.70 into android14-5.15 Changes in 5.15.70 drm/tegra: vic: Fix build warning when CONFIG_PM=n serial: atmel: remove redundant assignment in rs485_config tty: serial: atmel: Preserve previous USART mode if RS485 disabled of: fdt: fix off-by-one error in unflatten_dt_nodes() pinctrl: qcom: sc8180x: Fix gpio_wakeirq_map pinctrl: qcom: sc8180x: Fix wrong pin numbers pinctrl: rockchip: Enhance support for IRQ_TYPE_EDGE_BOTH pinctrl: sunxi: Fix name for A100 R_PIO NFSv4: Turn off open-by-filehandle and NFS re-export for NFSv4.0 gpio: mpc8xxx: Fix support for IRQ_TYPE_LEVEL_LOW flow_type in mpc85xx drm/meson: Correct OSD1 global alpha value drm/meson: Fix OSD1 RGB to YCbCr coefficient block: blk_queue_enter() / __bio_queue_enter() must return -EAGAIN for nowait parisc: ccio-dma: Add missing iounmap in error path in ccio_probe() of/device: Fix up of_dma_configure_id() stub cifs: revalidate mapping when doing direct writes cifs: don't send down the destination address to sendmsg for a SOCK_STREAM cifs: always initialize struct msghdr smb_msg completely parisc: Allow CONFIG_64BIT with ARCH=parisc tools/include/uapi: Fix <asm/errno.h> for parisc and xtensa drm/amdgpu: Don't enable LTR if not supported drm/amdgpu: move nbio ih_doorbell_range() into ih code for vega drm/amdgpu: move nbio sdma_doorbell_range() into sdma code for vega binder: remove inaccurate mmap_assert_locked() video: fbdev: i740fb: Error out if 'pixclock' equals zero arm64: dts: juno: Add missing MHU secure-irq ASoC: nau8824: Fix semaphore unbalance at error paths regulator: pfuze100: Fix the global-out-of-bounds access in pfuze100_regulator_probe() scsi: lpfc: Return DID_TRANSPORT_DISRUPTED instead of DID_REQUEUE rxrpc: Fix local destruction being repeated rxrpc: Fix calc of resend age wifi: mac80211_hwsim: check length for virtio packets ALSA: hda/sigmatel: Keep power up while beep is enabled ALSA: hda/tegra: Align BDL entry to 4KB boundary net: usb: qmi_wwan: add Quectel RM520N afs: Return -EAGAIN, not -EREMOTEIO, when a file already locked MIPS: OCTEON: irq: Fix octeon_irq_force_ciu_mapping() drm/panfrost: devfreq: set opp to the recommended one to configure regulator mksysmap: Fix the mismatch of 'L0' symbols in System.map video: fbdev: pxa3xx-gcu: Fix integer overflow in pxa3xx_gcu_write net: Find dst with sk's xfrm policy not ctl_sk KVM: SEV: add cache flush to solve SEV cache incoherency issues cgroup: Add missing cpus_read_lock() to cgroup_attach_task_all() ALSA: hda/sigmatel: Fix unused variable warning for beep power change Linux 5.15.70 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Iea16cee2475ff8bca607e57fc8b0c4b71b0a6f56	2022-09-24 14:22:45 +02:00
Greg Kroah-Hartman	a449b299e8	Merge 5.15.69 into android14-5.15 Changes in 5.15.69 NFS: Fix WARN_ON due to unionization of nfs_inode.nrequests ACPI: resource: skip IRQ override on AMD Zen platforms ARM: dts: imx: align SPI NOR node name with dtschema ARM: dts: imx6qdl-kontron-samx6i: fix spi-flash compatible ARM: dts: at91: fix low limit for CPU regulator ARM: dts: at91: sama7g5ek: specify proper regulator output ranges lockdep: Fix -Wunused-parameter for _THIS_IP_ x86/mm: Force-inline __phys_addr_nodebug() task_stack, x86/cea: Force-inline stack helpers tracing: hold caller_addr to hardirq_{enable,disable}_ip tracefs: Only clobber mode/uid/gid on remount if asked iommu/vt-d: Fix kdump kernels boot failure with scalable mode Input: goodix - add support for GT1158 platform/surface: aggregator_registry: Add support for Surface Laptop Go 2 drm/msm/rd: Fix FIFO-full deadlock dt-bindings: iio: gyroscope: bosch,bmg160: correct number of pins HID: ishtp-hid-clientHID: ishtp-hid-client: Fix comment typo hid: intel-ish-hid: ishtp: Fix ishtp client sending disordered message tg3: Disable tg3 device on system reboot to avoid triggering AER gpio: mockup: remove gpio debugfs when remove device ieee802154: cc2520: add rc code in cc2520_tx() Input: iforce - add support for Boeder Force Feedback Wheel nvmet-tcp: fix unhandled tcp states in nvmet_tcp_state_change() drm/amd/amdgpu: skip ucode loading if ucode_size == 0 net: dsa: hellcreek: Print warning only once perf/arm_pmu_platform: fix tests for platform_get_irq() failure platform/x86: acer-wmi: Acer Aspire One AOD270/Packard Bell Dot keymap fixes usb: storage: Add ASUS <0x0b05:0x1932> to IGNORE_UAS mm: Fix TLB flush for not-first PFNMAP mappings in unmap_region() soc: fsl: select FSL_GUTS driver for DPIO usb: gadget: f_uac2: clean up some inconsistent indenting usb: gadget: f_uac2: fix superspeed transfer RDMA/irdma: Use s/g array in post send only when its valid Input: goodix - add compatible string for GT1158 Linux 5.15.69 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Ifcadf79f34eb6093489fb3faf5e42c9739e56522	2022-09-24 14:14:08 +02:00
Mingwei Zhang	39b0235284	KVM: SEV: add cache flush to solve SEV cache incoherency issues commit 683412ccf61294d727ead4a73d97397396e69a6b upstream. Flush the CPU caches when memory is reclaimed from an SEV guest (where reclaim also includes it being unmapped from KVM's memslots). Due to lack of coherency for SEV encrypted memory, failure to flush results in silent data corruption if userspace is malicious/broken and doesn't ensure SEV guest memory is properly pinned and unpinned. Cache coherency is not enforced across the VM boundary in SEV (AMD APM vol.2 Section 15.34.7). Confidential cachelines, generated by confidential VM guests have to be explicitly flushed on the host side. If a memory page containing dirty confidential cachelines was released by VM and reallocated to another user, the cachelines may corrupt the new user at a later time. KVM takes a shortcut by assuming all confidential memory remain pinned until the end of VM lifetime. Therefore, KVM does not flush cache at mmu_notifier invalidation events. Because of this incorrect assumption and the lack of cache flushing, malicous userspace can crash the host kernel: creating a malicious VM and continuously allocates/releases unpinned confidential memory pages when the VM is running. Add cache flush operations to mmu_notifier operations to ensure that any physical memory leaving the guest VM get flushed. In particular, hook mmu_notifier_invalidate_range_start and mmu_notifier_release events and flush cache accordingly. The hook after releasing the mmu lock to avoid contention with other vCPUs. Cc: stable@vger.kernel.org Suggested-by: Sean Christpherson <seanjc@google.com> Reported-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20220421031407.2516575-4-mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [OP: adjusted KVM_X86_OP_OPTIONAL() -> KVM_X86_OP_NULL, applied kvm_arch_guest_memory_reclaimed() call in kvm_set_memslot()] Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-09-23 14:15:52 +02:00
Nick Desaulniers	f9571a9699	lockdep: Fix -Wunused-parameter for _THIS_IP_ [ Upstream commit 8b023accc8df70e72f7704d29fead7ca914d6837 ] While looking into a bug related to the compiler's handling of addresses of labels, I noticed some uses of _THIS_IP_ seemed unused in lockdep. Drive by cleanup. -Wunused-parameter: kernel/locking/lockdep.c:1383:22: warning: unused parameter 'ip' kernel/locking/lockdep.c:4246:48: warning: unused parameter 'ip' kernel/locking/lockdep.c:4844:19: warning: unused parameter 'ip' Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20220314221909.2027027-1-ndesaulniers@google.com Stable-dep-of: 54c3931957f6 ("tracing: hold caller_addr to hardirq_{enable,disable}_ip") Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-09-20 12:39:42 +02:00
Sean Christopherson	a2aa2a6c94	BACKPORT: KVM: Move x86's perf guest info callbacks to generic KVM Move x86's perf guest callbacks into common KVM, as they are semantically identical to arm64's callbacks (the only other such KVM callbacks). arm64 will convert to the common versions in a future patch. Implement the necessary arm64 arch hooks now to avoid having to provide stubs or a temporary #define (from x86) to avoid arm64 compilation errors when CONFIG_GUEST_PERF_EVENTS=y. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20211111020738.2512932-13-seanjc@google.com (cherry picked from commit e1bfc24577cc65c95dc519d7621a9c985b97e567) [willdeacon@: Fix context conflict in x86 asm/kvm_host.h] Signed-off-by: Will Deacon <willdeacon@google.com> Bug: 233587962 Bug: 233588291 Change-Id: Id8184005f4ccc370512bdbf3d3f18612dcee1cd8	2022-08-10 07:29:13 +00:00
Marc Zyngier	c437aada2a	UPSTREAM: KVM: Convert kvm_for_each_vcpu() to using xa_for_each_range() Now that the vcpu array is backed by an xarray, use the optimised iterator that matches the underlying data structure. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Message-Id: <20211116160403.4074052-8-maz@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> (cherry picked from commit 214bd3a6f46981b7867946e1b4f628a06bcf2091) Signed-off-by: Will Deacon <willdeacon@google.com> Bug: 233587962 Bug: 233588291 Change-Id: I262005faec3a52c5f2f698d943920c6136bb8144	2022-08-10 07:29:13 +00:00
Marc Zyngier	e3031d06f7	BACKPORT: KVM: Use 'unsigned long' as kvm_for_each_vcpu()'s index Everywhere we use kvm_for_each_vpcu(), we use an int as the vcpu index. Unfortunately, we're about to move rework the iterator, which requires this to be upgrade to an unsigned long. Let's bite the bullet and repaint all of it in one go. Signed-off-by: Marc Zyngier <maz@kernel.org> Message-Id: <20211116160403.4074052-7-maz@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> (cherry picked from commit 46808a4cb89708c2e5b264eb9d1035762581921b) [willdeacon@: Drop riscv hunks; drop x86 hunks for code that isn't present; convert kvm_hyperv_tsc_notifier() as well] Signed-off-by: Will Deacon <willdeacon@google.com> Bug: 233587962 Bug: 233588291 Change-Id: I78a3ae244de22575684dd47a9823b9fc61b6fa7d	2022-08-10 07:29:13 +00:00
Marc Zyngier	37766cea76	UPSTREAM: KVM: Convert the kvm->vcpus array to a xarray At least on arm64 and x86, the vcpus array is pretty huge (up to 1024 entries on x86) and is mostly empty in the majority of the cases (running 1k vcpu VMs is not that common). This mean that we end-up with a 4kB block of unused memory in the middle of the kvm structure. Instead of wasting away this memory, let's use an xarray instead, which gives us almost the same flexibility as a normal array, but with a reduced memory usage with smaller VMs. Signed-off-by: Marc Zyngier <maz@kernel.org> Message-Id: <20211116160403.4074052-6-maz@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> (cherry picked from commit c5b077549136584618a66258f09d8d4b41e7409c) Signed-off-by: Will Deacon <willdeacon@google.com> Bug: 233587962 Bug: 233588291 Change-Id: I5c89adb8caf5b167536b0e51590c7ee7ec0363d9	2022-08-10 07:29:13 +00:00
Marc Zyngier	25f5ae814c	BACKPORT: KVM: Move wiping of the kvm->vcpus array to common code All architectures have similar loops iterating over the vcpus, freeing one vcpu at a time, and eventually wiping the reference off the vcpus array. They are also inconsistently taking the kvm->lock mutex when wiping the references from the array. Make this code common, which will simplify further changes. The locking is dropped altogether, as this should only be called when there is no further references on the kvm structure. Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Message-Id: <20211116160403.4074052-2-maz@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> (cherry picked from commit 27592ae8dbe41033261b6fdf27d78998aabd2665) [willdeacon@: Drop riscv changes; fix conflict in arm64/kvm/arm.c the same way as upstream merge commit 7fd55a02a426 ("Merge tag 'kvmarm-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD")] Signed-off-by: Will Deacon <willdeacon@google.com> Bug: 233587962 Bug: 233588291 Change-Id: If63c359dc841fca1a9baf73283a7a014058b2c97	2022-08-10 07:29:13 +00:00
Vitaly Kuznetsov	2adcfdc6a6	UPSTREAM: KVM: Drop stale kvm_is_transparent_hugepage() declaration kvm_is_transparent_hugepage() was removed in commit `205d76ff06` ("KVM: Remove kvm_is_transparent_hugepage() and PageTransCompoundMap()") but its declaration in include/linux/kvm_host.h persisted. Drop it. Fixes: `205d76ff06` (""KVM: Remove kvm_is_transparent_hugepage() and PageTransCompoundMap()") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20211018151407.2107363-1-vkuznets@redhat.com (cherry picked from commit f0e6e6fa41b3d2aa1dcb61dd4ed6d7be004bb5a8) Signed-off-by: Will Deacon <willdeacon@google.com> Bug: 233587962 Bug: 233588291 Change-Id: I9078ab62be40bc843ca2959f929ed22c1b8888e2	2022-08-08 14:57:53 +01:00
Will Deacon	f884005542	Revert "FROMGIT: KVM: Drop stale kvm_is_transparent_hugepage() declaration" This reverts commit `03761cf7c7`. Bug: 233587962 Signed-off-by: Will Deacon <willdeacon@google.com> Change-Id: I8d3741c7173b36b4e8f0d651a3f6600b0bbd9b72	2022-08-04 13:03:53 +00:00
Greg Kroah-Hartman	4f868bc314	Merge 5.15.57 into android14-5.15 Changes in 5.15.57 x86/traps: Use pt_regs directly in fixup_bad_iret() x86/entry: Switch the stack after error_entry() returns x86/entry: Move PUSH_AND_CLEAR_REGS out of error_entry() x86/entry: Don't call error_entry() for XENPV objtool: Classify symbols objtool: Explicitly avoid self modifying code in .altinstr_replacement objtool: Shrink struct instruction objtool,x86: Replace alternatives with .retpoline_sites objtool: Introduce CFI hash x86/retpoline: Remove unused replacement symbols x86/asm: Fix register order x86/asm: Fixup odd GEN-for-each-reg.h usage x86/retpoline: Move the retpoline thunk declarations to nospec-branch.h x86/retpoline: Create a retpoline thunk array x86/alternative: Implement .retpoline_sites support x86/alternative: Handle Jcc __x86_indirect_thunk_\reg x86/alternative: Try inline spectre_v2=retpoline,amd x86/alternative: Add debug prints to apply_retpolines() bpf,x86: Simplify computing label offsets bpf,x86: Respect X86_FEATURE_RETPOLINE* objtool: Default ignore INT3 for unreachable x86/entry: Remove skip_r11rcx x86/realmode: build with -D__DISABLE_EXPORTS x86/kvm/vmx: Make noinstr clean x86/cpufeatures: Move RETPOLINE flags to word 11 x86/retpoline: Cleanup some #ifdefery x86/retpoline: Swizzle retpoline thunk x86/retpoline: Use -mfunction-return x86: Undo return-thunk damage x86,objtool: Create .return_sites objtool: skip non-text sections when adding return-thunk sites x86,static_call: Use alternative RET encoding x86/ftrace: Use alternative RET encoding x86/bpf: Use alternative RET encoding x86/kvm: Fix SETcc emulation for return thunks x86/vsyscall_emu/64: Don't use RET in vsyscall emulation x86/sev: Avoid using __x86_return_thunk x86: Use return-thunk in asm code x86/entry: Avoid very early RET objtool: Treat .text.__x86.* as noinstr x86: Add magic AMD return-thunk x86/bugs: Report AMD retbleed vulnerability x86/bugs: Add AMD retbleed= boot parameter x86/bugs: Enable STIBP for JMP2RET x86/bugs: Keep a per-CPU IA32_SPEC_CTRL value x86/entry: Add kernel IBRS implementation x86/bugs: Optimize SPEC_CTRL MSR writes x86/speculation: Add spectre_v2=ibrs option to support Kernel IBRS x86/bugs: Split spectre_v2_select_mitigation() and spectre_v2_user_select_mitigation() x86/bugs: Report Intel retbleed vulnerability intel_idle: Disable IBRS during long idle objtool: Update Retpoline validation x86/xen: Rename SYS* entry points x86/xen: Add UNTRAIN_RET x86/bugs: Add retbleed=ibpb x86/bugs: Do IBPB fallback check only once objtool: Add entry UNRET validation x86/cpu/amd: Add Spectral Chicken x86/speculation: Fix RSB filling with CONFIG_RETPOLINE=n x86/speculation: Fix firmware entry SPEC_CTRL handling x86/speculation: Fix SPEC_CTRL write on SMT state change x86/speculation: Use cached host SPEC_CTRL value for guest entry/exit x86/speculation: Remove x86_spec_ctrl_mask objtool: Re-add UNWIND_HINT_{SAVE_RESTORE} KVM: VMX: Flatten __vmx_vcpu_run() KVM: VMX: Convert launched argument to flags KVM: VMX: Prevent guest RSB poisoning attacks with eIBRS KVM: VMX: Fix IBRS handling after vmexit x86/speculation: Fill RSB on vmexit for IBRS x86/common: Stamp out the stepping madness x86/cpu/amd: Enumerate BTC_NO x86/retbleed: Add fine grained Kconfig knobs x86/bugs: Add Cannon lake to RETBleed affected CPU list x86/entry: Move PUSH_AND_CLEAR_REGS() back into error_entry x86/bugs: Do not enable IBPB-on-entry when IBPB is not supported x86/kexec: Disable RET on kexec x86/speculation: Disable RRSBA behavior x86/static_call: Serialize __static_call_fixup() properly x86/xen: Fix initialisation in hypercall_page after rethunk x86/asm/32: Fix ANNOTATE_UNRET_SAFE use on 32-bit x86/speculation: Use DECLARE_PER_CPU for x86_spec_ctrl_current efi/x86: use naked RET on mixed mode call wrapper x86/kvm: fix FASTOP_SIZE when return thunks are enabled KVM: emulate: do not adjust size of fastop and setcc subroutines tools arch x86: Sync the msr-index.h copy with the kernel sources tools headers cpufeatures: Sync with the kernel sources x86/bugs: Remove apostrophe typo um: Add missing apply_returns() x86: Use -mindirect-branch-cs-prefix for RETPOLINE builds Linux 5.15.57 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I7d0a3c3eb4be1e5401c2678fdb6229523486146f	2022-07-23 13:51:05 +02:00
Peter Zijlstra	ccb25d7db1	x86/kvm/vmx: Make noinstr clean commit 742ab6df974ae8384a2dd213db1a3a06cf6d8936 upstream. The recent mmio_stale_data fixes broke the noinstr constraints: vmlinux.o: warning: objtool: vmx_vcpu_enter_exit+0x15b: call to wrmsrl.constprop.0() leaves .noinstr.text section vmlinux.o: warning: objtool: vmx_vcpu_enter_exit+0x1bf: call to kvm_arch_has_assigned_device() leaves .noinstr.text section make it all happy again. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-07-23 12:53:57 +02:00
Greg Kroah-Hartman	a74d4e284c	Merge 5.15.22 into android13-5.15 Changes in 5.15.22 drm/i915: Disable DSB usage for now selinux: fix double free of cond_list on error paths audit: improve audit queue handling when "audit=1" on cmdline ipc/sem: do not sleep with a spin lock held spi: stm32-qspi: Update spi registering ASoC: hdmi-codec: Fix OOB memory accesses ASoC: ops: Reject out of bounds values in snd_soc_put_volsw() ASoC: ops: Reject out of bounds values in snd_soc_put_volsw_sx() ASoC: ops: Reject out of bounds values in snd_soc_put_xr_sx() ALSA: usb-audio: Correct quirk for VF0770 ALSA: hda: Fix UAF of leds class devs at unbinding ALSA: hda: realtek: Fix race at concurrent COEF updates ALSA: hda/realtek: Add quirk for ASUS GU603 ALSA: hda/realtek: Add missing fixup-model entry for Gigabyte X570 ALC1220 quirks ALSA: hda/realtek: Fix silent output on Gigabyte X570S Aorus Master (newer chipset) ALSA: hda/realtek: Fix silent output on Gigabyte X570 Aorus Xtreme after reboot from Windows btrfs: don't start transaction for scrub if the fs is mounted read-only btrfs: fix deadlock between quota disable and qgroup rescan worker btrfs: fix use-after-free after failure to create a snapshot Revert "fs/9p: search open fids first" drm/nouveau: fix off by one in BIOS boundary checking drm/i915/adlp: Fix TypeC PHY-ready status readout drm/amd/pm: correct the MGpuFanBoost support for Beige Goby drm/amd/display: watermark latencies is not enough on DCN31 drm/amd/display: Force link_rate as LINK_RATE_RBR2 for 2018 15" Apple Retina panels nvme-fabrics: fix state check in nvmf_ctlr_matches_baseopts() mm/debug_vm_pgtable: remove pte entry from the page table mm/pgtable: define pte_index so that preprocessor could recognize it mm/kmemleak: avoid scanning potential huge holes block: bio-integrity: Advance seed correctly for larger interval sizes dma-buf: heaps: Fix potential spectre v1 gadget IB/hfi1: Fix AIP early init panic Revert "fbcon: Disable accelerated scrolling" fbcon: Add option to enable legacy hardware acceleration mptcp: fix msk traversal in mptcp_nl_cmd_set_flags() Revert "ASoC: mediatek: Check for error clk pointer" KVM: arm64: Avoid consuming a stale esr value when SError occur KVM: arm64: Stop handle_exit() from handling HVC twice when an SError occurs RDMA/cma: Use correct address when leaving multicast group RDMA/ucma: Protect mc during concurrent multicast leaves RDMA/siw: Fix refcounting leak in siw_create_qp() IB/rdmavt: Validate remote_addr during loopback atomic tests RDMA/siw: Fix broken RDMA Read Fence/Resume logic. RDMA/mlx4: Don't continue event handler after memory allocation failure ALSA: usb-audio: initialize variables that could ignore errors ALSA: hda: Fix signedness of sscanf() arguments ALSA: hda: Skip codec shutdown in case the codec is not registered iommu/vt-d: Fix potential memory leak in intel_setup_irq_remapping() iommu/amd: Fix loop timeout issue in iommu_ga_log_enable() spi: bcm-qspi: check for valid cs before applying chip select spi: mediatek: Avoid NULL pointer crash in interrupt spi: meson-spicc: add IRQ check in meson_spicc_probe spi: uniphier: fix reference count leak in uniphier_spi_probe() IB/hfi1: Fix tstats alloc and dealloc IB/cm: Release previously acquired reference counter in the cm_id_priv net: ieee802154: hwsim: Ensure proper channel selection at probe time net: ieee802154: mcr20a: Fix lifs/sifs periods net: ieee802154: ca8210: Stop leaking skb's netfilter: nft_reject_bridge: Fix for missing reply from prerouting net: ieee802154: Return meaningful error codes from the netlink helpers net/smc: Forward wakeup to smc socket waitqueue after fallback net: stmmac: dwmac-visconti: No change to ETHER_CLOCK_SEL for unexpected speed request. net: stmmac: properly handle with runtime pm in stmmac_dvr_remove() net: macsec: Fix offload support for NETDEV_UNREGISTER event net: macsec: Verify that send_sci is on when setting Tx sci explicitly net: stmmac: dump gmac4 DMA registers correctly net: stmmac: ensure PTP time register reads are consistent drm/kmb: Fix for build errors with Warray-bounds drm/i915/overlay: Prevent divide by zero bugs in scaling drm/amd: avoid suspend on dGPUs w/ s2idle support when runtime PM enabled ASoC: fsl: Add missing error handling in pcm030_fabric_probe ASoC: xilinx: xlnx_formatter_pcm: Make buffer bytes multiple of period bytes ASoC: simple-card: fix probe failure on platform component ASoC: cpcap: Check for NULL pointer after calling of_get_child_by_name ASoC: max9759: fix underflow in speaker_gain_control_put() ASoC: codecs: wcd938x: fix incorrect used of portid ASoC: codecs: lpass-rx-macro: fix sidetone register offsets ASoC: codecs: wcd938x: fix return value of mixer put function pinctrl: sunxi: Fix H616 I2S3 pin data pinctrl: intel: Fix a glitch when updating IRQ flags on a preconfigured line pinctrl: intel: fix unexpected interrupt pinctrl: bcm2835: Fix a few error paths scsi: bnx2fc: Make bnx2fc_recv_frame() mp safe nfsd: nfsd4_setclientid_confirm mistakenly expires confirmed client. gve: fix the wrong AdminQ buffer queue index check bpf: Use VM_MAP instead of VM_ALLOC for ringbuf selftests/exec: Remove pipe from TEST_GEN_FILES selftests: futex: Use variable MAKE instead of make tools/resolve_btfids: Do not print any commands when building silently e1000e: Separate ADP board type from TGP rtc: cmos: Evaluate century appropriate kvm: add guest_state_{enter,exit}_irqoff() kvm/arm64: rework guest entry logic perf: Copy perf_event_attr::sig_data on modification perf stat: Fix display of grouped aliased events perf/x86/intel/pt: Fix crash with stop filters in single-range mode x86/perf: Default set FREEZE_ON_SMI for all EDAC/altera: Fix deferred probing EDAC/xgene: Fix deferred probing ext4: prevent used blocks from being allocated during fast commit replay ext4: modify the logic of ext4_mb_new_blocks_simple ext4: fix error handling in ext4_restore_inline_data() ext4: fix error handling in ext4_fc_record_modified_inode() ext4: fix incorrect type issue during replay_del_range net: dsa: mt7530: make NET_DSA_MT7530 select MEDIATEK_GE_PHY cgroup/cpuset: Fix "suspicious RCU usage" lockdep warning tools include UAPI: Sync sound/asound.h copy with the kernel sources gpio: idt3243x: Fix an ignored error return from platform_get_irq() gpio: mpc8xxx: Fix an ignored error return from platform_get_irq() selftests: nft_concat_range: add test for reload with no element add/del selftests: netfilter: check stateless nat udp checksum fixup Linux 5.15.22 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I9143b858b768a8497c1df9440a74d8c105c32271	2022-02-09 08:15:44 +01:00
Mark Rutland	83071e2dad	kvm: add guest_state_{enter,exit}_irqoff() commit ef9989afda73332df566852d6e9ca695c05f10ce upstream. When transitioning to/from guest mode, it is necessary to inform lockdep, tracing, and RCU in a specific order, similar to the requirements for transitions to/from user mode. Additionally, it is necessary to perform vtime accounting for a window around running the guest, with RCU enabled, such that timer interrupts taken from the guest can be accounted as guest time. Most architectures don't handle all the necessary pieces, and a have a number of common bugs, including unsafe usage of RCU during the window between guest_enter() and guest_exit(). On x86, this was dealt with across commits: `87fa7f3e98` ("x86/kvm: Move context tracking where it belongs") `0642391e21` ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit") `9fc975e9ef` ("x86/kvm/svm: Add hardirq tracing on guest enter/exit") `3ebccdf373` ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text") `135961e0a7` ("x86/kvm/svm: Move guest enter/exit into .noinstr.text") `1604571401` ("KVM: x86: Defer vtime accounting 'til after IRQ handling") `bc908e091b` ("KVM: x86: Consolidate guest enter/exit logic to common helpers") ... but those fixes are specific to x86, and as the resulting logic (while correct) is split across generic helper functions and x86-specific helper functions, it is difficult to see that the entry/exit accounting is balanced. This patch adds generic helpers which architectures can use to handle guest entry/exit consistently and correctly. The guest_{enter,exit}() helpers are split into guest_timing_{enter,exit}() to perform vtime accounting, and guest_context_{enter,exit}() to perform the necessary context tracking and RCU management. The existing guest_{enter,exit}() heleprs are left as wrappers of these. Atop this, new guest_state_enter_irqoff() and guest_state_exit_irqoff() helpers are added to handle the ordering of lockdep, tracing, and RCU manageent. These are inteneded to mirror exit_to_user_mode() and enter_from_user_mode(). Subsequent patches will migrate architectures over to the new helpers, following a sequence: guest_timing_enter_irqoff(); guest_state_enter_irqoff(); < run the vcpu > guest_state_exit_irqoff(); < take any pending IRQs > guest_timing_exit_irqoff(); This sequences handles all of the above correctly, and more clearly balances the entry and exit portions, making it easier to understand. The existing helpers are marked as deprecated, and will be removed once all architectures have been converted. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Message-Id: <20220201132926.3301912-2-mark.rutland@arm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-02-08 18:34:12 +01:00
Vitaly Kuznetsov	03761cf7c7	FROMGIT: KVM: Drop stale kvm_is_transparent_hugepage() declaration kvm_is_transparent_hugepage() was removed in commit `205d76ff06` ("KVM: Remove kvm_is_transparent_hugepage() and PageTransCompoundMap()") but its declaration in include/linux/kvm_host.h persisted. Drop it. Fixes: `205d76ff06` (""KVM: Remove kvm_is_transparent_hugepage() and PageTransCompoundMap()") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20211018151407.2107363-1-vkuznets@redhat.com (cherry picked from commit f0e6e6fa41b3d2aa1dcb61dd4ed6d7be004bb5a8 git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next) Bug: 209777660 Signed-off-by: Will Deacon <willdeacon@google.com> Change-Id: I9078ab62be40bc843ca2959f929ed22c1b8888e2	2021-12-09 09:42:19 +00:00
Lai Jiangshan	6bc6db0002	KVM: Remove tlbs_dirty There is no user of tlbs_dirty. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210918005636.3675-4-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-23 11:01:12 -04:00
Sean Christopherson	4eeef24241	KVM: x86: Query vcpu->vcpu_idx directly and drop its accessor Read vcpu->vcpu_idx directly instead of bouncing through the one-line wrapper, kvm_vcpu_get_idx(), and drop the wrapper. The wrapper is a remnant of the original implementation and serves no purpose; remove it before it gains more users. Back when kvm_vcpu_get_idx() was added by commit `497d72d80a` ("KVM: Add kvm_vcpu_get_idx to get vcpu index in kvm->vcpus"), the implementation was more than just a simple wrapper as vcpu->vcpu_idx did not exist and retrieving the index meant walking over the vCPU array to find the given vCPU. When vcpu_idx was introduced by commit `8750e72a79` ("KVM: remember position in kvm->vcpus array"), the helper was left behind, likely to avoid extra thrash (but even then there were only two users, the original arm usage having been removed at some point in the past). No functional change intended. Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210910183220.2397812-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-22 10:33:11 -04:00
Paolo Bonzini	e99314a340	Merge tag 'kvmarm-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for 5.15 - Page ownership tracking between host EL1 and EL2 - Rely on userspace page tables to create large stage-2 mappings - Fix incompatibility between pKVM and kmemleak - Fix the PMU reset state, and improve the performance of the virtual PMU - Move over to the generic KVM entry code - Address PSCI reset issues w.r.t. save/restore - Preliminary rework for the upcoming pKVM fixed feature - A bunch of MM cleanups - a vGIC fix for timer spurious interrupts - Various cleanups	2021-09-06 06:34:48 -04:00
Jing Zhang	3cc4e148b9	KVM: stats: Add VM stat for remote tlb flush requests Add a new stat that counts the number of times a remote TLB flush is requested, regardless of whether it kicks vCPUs out of guest mode. This allows us to look at how often flushes are initiated. Unlike remote_tlb_flush, this one applies to ARM's instruction-set-based TLB flush implementation, so apply it there too. Original-by: David Matlack <dmatlack@google.com> Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210817002639.3856694-1-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-06 06:30:45 -04:00
Jing Zhang	8ccba534a1	KVM: stats: Add halt polling related histogram stats Add three log histogram stats to record the distribution of time spent on successful polling, failed polling and VCPU wait. halt_poll_success_hist: Distribution of spent time for a successful poll. halt_poll_fail_hist: Distribution of spent time for a failed poll. halt_wait_hist: Distribution of time a VCPU has spent on waiting. Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210802165633.1866976-6-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-20 16:06:33 -04:00
Jing Zhang	87bcc5fa09	KVM: stats: Add halt_wait_ns stats for all architectures Add simple stats halt_wait_ns to record the time a VCPU has spent on waiting for all architectures (not just powerpc). Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210802165633.1866976-5-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-20 16:06:33 -04:00
Jing Zhang	f95937ccf5	KVM: stats: Support linear and logarithmic histogram statistics Add new types of KVM stats, linear and logarithmic histogram. Histogram are very useful for observing the value distribution of time or size related stats. Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210802165633.1866976-2-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-20 16:06:32 -04:00
Maxim Levitsky	edb298c663	KVM: x86/mmu: bump mmu notifier count in kvm_zap_gfn_range This together with previous patch, ensures that kvm_zap_gfn_range doesn't race with page fault running on another vcpu, and will make this page fault code retry instead. This is based on a patch suggested by Sean Christopherson: https://lkml.org/lkml/2021/7/22/1025 Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210810205251.424103-5-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-20 16:06:19 -04:00
Peter Xu	3165af738e	KVM: Allow to have arch-specific per-vm debugfs files Allow archs to create arch-specific nodes under kvm->debugfs_dentry directory besides the stats fields. The new interface kvm_arch_create_vm_debugfs() is defined but not yet used. It's called after kvm->debugfs_dentry is created, so it can be referenced directly in kvm_arch_create_vm_debugfs(). Arch should define their own versions when they want to create extra debugfs nodes. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20210730220455.26054-2-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-13 03:35:17 -04:00
David Matlack	fe22ed827c	KVM: Cache the last used slot index per vCPU The memslot for a given gfn is looked up multiple times during page fault handling. Avoid binary searching for it multiple times by caching the most recently used slot. There is an existing VM-wide last_used_slot but that does not work well for cases where vCPUs are accessing memory in different slots (see performance data below). Another benefit of caching the most recently use slot (versus looking up the slot once and passing around a pointer) is speeding up memslot lookups across faults and during spte prefetching. To measure the performance of this change I ran dirty_log_perf_test with 64 vCPUs and 64 memslots and measured "Populate memory time" and "Iteration 2 dirty memory time". Tests were ran with eptad=N to force dirty logging to use fast_page_fault so its performance could be measured. Config \| Metric \| Before \| After ---------- \| ----------------------------- \| ------ \| ------ tdp_mmu=Y \| Populate memory time \| 6.76s \| 5.47s tdp_mmu=Y \| Iteration 2 dirty memory time \| 2.83s \| 0.31s tdp_mmu=N \| Populate memory time \| 20.4s \| 18.7s tdp_mmu=N \| Iteration 2 dirty memory time \| 2.65s \| 0.30s The "Iteration 2 dirty memory time" results are especially compelling because they are equivalent to running the same test with a single memslot. In other words, fast_page_fault performance no longer scales with the number of memslots. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210804222844.1419481-4-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-06 07:52:29 -04:00
David Matlack	0f22af940d	KVM: Move last_used_slot logic out of search_memslots Make search_memslots unconditionally search all memslots and move the last_used_slot logic up one level to __gfn_to_memslot. This is in preparation for introducing a per-vCPU last_used_slot. As part of this change convert existing callers of search_memslots to __gfn_to_memslot to avoid making any functional changes. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210804222844.1419481-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-06 07:52:28 -04:00
David Matlack	87689270b1	KVM: Rename lru_slot to last_used_slot lru_slot is used to keep track of the index of the most-recently used memslot. The correct acronym would be "mru" but that is not a common acronym. So call it last_used_slot which is a bit more obvious. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210804222844.1419481-2-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-06 07:52:28 -04:00
Paolo Bonzini	52ac8b358b	KVM: Block memslot updates across range_start() and range_end() We would like to avoid taking mmu_lock for .invalidate_range_{start,end}() notifications that are unrelated to KVM. Because mmu_notifier_count must be modified while holding mmu_lock for write, and must always be paired across start->end to stay balanced, lock elision must happen in both or none. Therefore, in preparation for this change, this patch prevents memslot updates across range_start() and range_end(). Note, technically flag-only memslot updates could be allowed in parallel, but stalling a memslot update for a relatively short amount of time is not a scalability issue, and this is all more than complex enough. A long note on the locking: a previous version of the patch used an rwsem to block the memslot update while the MMU notifier run, but this resulted in the following deadlock involving the pseudo-lock tagged as "mmu_notifier_invalidate_range_start". ====================================================== WARNING: possible circular locking dependency detected 5.12.0-rc3+ #6 Tainted: G OE ------------------------------------------------------ qemu-system-x86/3069 is trying to acquire lock: ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190 but task is already holding lock: ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm] which lock already depends on the new lock. This corresponds to the following MMU notifier logic: invalidate_range_start take pseudo lock down_read() () release pseudo lock invalidate_range_end take pseudo lock () up_read() release pseudo lock At point () we take the mmu_notifiers_slots_lock inside the pseudo lock; at point (*) we take the pseudo lock inside the mmu_notifiers_slots_lock. This could cause a deadlock (ignoring for a second that the pseudo lock is not a lock): - invalidate_range_start waits on down_read(), because the rwsem is held by install_new_memslots - install_new_memslots waits on down_write(), because the rwsem is held till (another) invalidate_range_end finishes - invalidate_range_end sits waits on the pseudo lock, held by invalidate_range_start. Removing the fairness of the rwsem breaks the cycle (in lockdep terms, it would change the shared* rwsem readers into shared recursive readers), so open-code the wait using a readers count and a spinlock. This also allows handling blockable and non-blockable critical section in the same way. Losing the rwsem fairness does theoretically allow MMU notifiers to block install_new_memslots forever. Note that mm/mmu_notifier.c's own retry scheme in mmu_interval_read_begin also uses wait/wake_up and is likewise not fair. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-03 03:44:03 -04:00
Peter Xu	605c713023	KVM: Introduce kvm_get_kvm_safe() Introduce this safe version of kvm_get_kvm() so that it can be called even during vm destruction. Use it in kvm_debugfs_open() and remove the verbose comment. Prepare to be used elsewhere. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20210625153214.43106-3-peterx@redhat.com> [Preserve the comment in kvm_debugfs_open. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-02 11:01:46 -04:00
Sean Christopherson	7ee3e8c39d	KVM: Export kvm_make_all_cpus_request() for use in marking VMs as bugged Export kvm_make_all_cpus_request() and hoist the request helper declarations of request up to the KVM_REQ_* definitions in preparation for adding a "VM bugged" framework. The framework will add KVM_BUG() and KVM_BUG_ON() as alternatives to full BUG()/BUG_ON() for cases where KVM has definitely hit a bug (in itself or in silicon) and the VM is all but guaranteed to be hosed. Marking a VM bugged will trigger a request to all vCPUs to allow arch code to forcefully evict each vCPU from its run loop. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-Id: <1d8cbbc8065d831343e70b5dcaea92268145eef1.1625186503.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-02 09:36:36 -04:00
Sean Christopherson	0b8f11737c	KVM: Add infrastructure and macro to mark VM as bugged Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <3a0998645c328bf0895f1290e61821b70f048549.1625186503.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-02 09:36:35 -04:00
Marc Zyngier	36c3ce6c0d	KVM: Get rid of kvm_get_pfn() Nobody is using kvm_get_pfn() anymore. Get rid of it. Acked-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20210726153552.1535838-7-maz@kernel.org	2021-08-02 14:05:58 +01:00
Jing Zhang	bc9e9e672d	KVM: debugfs: Reuse binary stats descriptors To remove code duplication, use the binary stats descriptors in the implementation of the debugfs interface for statistics. This unifies the definition of statistics for the binary and debugfs interfaces. Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-8-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:29 -04:00
Jing Zhang	ce55c04945	KVM: stats: Support binary stats retrieval for a VCPU Add a VCPU ioctl to get a statistics file descriptor by which a read functionality is provided for userspace to read out VCPU stats header, descriptors and data. Define VCPU statistics descriptors and header for all architectures. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> #arm64 Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-5-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:19 -04:00
Jing Zhang	fcfe1baedd	KVM: stats: Support binary stats retrieval for a VM Add a VM ioctl to get a statistics file descriptor by which a read functionality is provided for userspace to read out VM stats header, descriptors and data. Define VM statistics descriptors and header for all architectures. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> #arm64 Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-4-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:10 -04:00
Jing Zhang	cb082bfab5	KVM: stats: Add fd-based API to read binary stats data This commit defines the API for userspace and prepare the common functionalities to support per VM/VCPU binary stats data readings. The KVM stats now is only accessible by debugfs, which has some shortcomings this change series are supposed to fix: 1. The current debugfs stats solution in KVM could be disabled when kernel Lockdown mode is enabled, which is a potential rick for production. 2. The current debugfs stats solution in KVM is organized as "one stats per file", it is good for debugging, but not efficient for production. 3. The stats read/clear in current debugfs solution in KVM are protected by the global kvm_lock. Besides that, there are some other benefits with this change: 1. All KVM VM/VCPU stats can be read out in a bulk by one copy to userspace. 2. A schema is used to describe KVM statistics. From userspace's perspective, the KVM statistics are self-describing. 3. With the fd-based solution, a separate telemetry would be able to read KVM stats in a less privileged environment. 4. After the initial setup by reading in stats descriptors, a telemetry only needs to read the stats data itself, no more parsing or setup is needed. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> #arm64 Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-3-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 11:47:57 -04:00
Sergey Senozhatsky	2fdef3a2ae	kvm: add PM-notifier Add KVM PM-notifier so that architectures can have arch-specific VM suspend/resume routines. Such architectures need to select CONFIG_HAVE_KVM_PM_NOTIFIER and implement kvm_arch_pm_notifier(). Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Marc Zyngier <maz@kernel.org> Message-Id: <20210606021045.14159-1-senozhatsky@chromium.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:32 -04:00
Ben Gardon	b10a038e84	KVM: mmu: Add slots_arch_lock for memslot arch fields Add a new lock to protect the arch-specific fields of memslots if they need to be modified in a kvm->srcu read critical section. A future commit will use this lock to lazily allocate memslot rmaps for x86. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210518173414.450044-5-bgardon@google.com> [Add Documentation/ hunk. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:26 -04:00
Paolo Bonzini	4422829e80	kvm: fix previous commit for 32-bit builds array_index_nospec does not work for uint64_t on 32-bit builds. However, the size of a memory slot must be less than 20 bits wide on those system, since the memory slot must fit in the user address space. So just store it in an unsigned long. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-09 01:49:13 -04:00
Paolo Bonzini	da27a83fd6	kvm: avoid speculation-based attacks from out-of-range memslot accesses KVM's mechanism for accessing guest memory translates a guest physical address (gpa) to a host virtual address using the right-shifted gpa (also known as gfn) and a struct kvm_memory_slot. The translation is performed in __gfn_to_hva_memslot using the following formula: hva = slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE It is expected that gfn falls within the boundaries of the guest's physical memory. However, a guest can access invalid physical addresses in such a way that the gfn is invalid. __gfn_to_hva_memslot is called from kvm_vcpu_gfn_to_hva_prot, which first retrieves a memslot through __gfn_to_memslot. While __gfn_to_memslot does check that the gfn falls within the boundaries of the guest's physical memory or not, a CPU can speculate the result of the check and continue execution speculatively using an illegal gfn. The speculation can result in calculating an out-of-bounds hva. If the resulting host virtual address is used to load another guest physical address, this is effectively a Spectre gadget consisting of two consecutive reads, the second of which is data dependent on the first. Right now it's not clear if there are any cases in which this is exploitable. One interesting case was reported by the original author of this patch, and involves visiting guest page tables on x86. Right now these are not vulnerable because the hva read goes through get_user(), which contains an LFENCE speculation barrier. However, there are patches in progress for x86 uaccess.h to mask kernel addresses instead of using LFENCE; once these land, a guest could use speculation to read from the VMM's ring 3 address space. Other architectures such as ARM already use the address masking method, and would be susceptible to this same kind of data-dependent access gadgets. Therefore, this patch proactively protects from these attacks by masking out-of-bounds gfns in __gfn_to_hva_memslot, which blocks speculation of invalid hvas. Sean Christopherson noted that this patch does not cover kvm_read_guest_offset_cached. This however is limited to a few bytes past the end of the cache, and therefore it is unlikely to be useful in the context of building a chain of data dependent accesses. Reported-by: Artemiy Margaritov <artemiy.margaritov@gmail.com> Co-developed-by: Artemiy Margaritov <artemiy.margaritov@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-08 17:12:05 -04:00
Marcelo Tosatti	084071d5e9	KVM: rename KVM_REQ_PENDING_TIMER to KVM_REQ_UNBLOCK KVM_REQ_UNBLOCK will be used to exit a vcpu from its inner vcpu halt emulation loop. Rename KVM_REQ_PENDING_TIMER to KVM_REQ_UNBLOCK, switch PowerPC to arch specific request bit. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Message-Id: <20210525134321.303768132@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-05-27 07:57:38 -04:00
Wanpeng Li	6bd5b74368	KVM: PPC: exit halt polling on need_resched() This is inspired by commit `262de4102c` (kvm: exit halt polling on need_resched() as well). Due to PPC implements an arch specific halt polling logic, we have to the need_resched() check there as well. This patch adds a helper function that can be shared between book3s and generic halt-polling loops. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Venkatesh Srinivas <venkateshs@chromium.org> Cc: Ben Segall <bsegall@google.com> Cc: Venkatesh Srinivas <venkateshs@chromium.org> Cc: Jim Mattson <jmattson@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1621339235-11131-1-git-send-email-wanpengli@tencent.com> [Make the function inline. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-05-27 07:45:50 -04:00
Sean Christopherson	1ca0016c14	context_tracking: KVM: Move guest enter/exit wrappers to KVM's domain Move the guest enter/exit wrappers to kvm_host.h so that KVM can manage its context tracking vs. vtime accounting without bleeding too many KVM details into the context tracking code. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210505002735.1684165-8-seanjc@google.com	2021-05-05 22:54:12 +02:00
Wanpeng Li	52acd22faa	KVM: Boost vCPU candidate in user mode which is delivering interrupt Both lock holder vCPU and IPI receiver that has halted are condidate for boost. However, the PLE handler was originally designed to deal with the lock holder preemption problem. The Intel PLE occurs when the spinlock waiter is in kernel mode. This assumption doesn't hold for IPI receiver, they can be in either kernel or user mode. the vCPU candidate in user mode will not be boosted even if they should respond to IPIs. Some benchmarks like pbzip2, swaptions etc do the TLB shootdown in kernel mode and most of the time they are running in user mode. It can lead to a large number of continuous PLE events because the IPI sender causes PLE events repeatedly until the receiver is scheduled while the receiver is not candidate for a boost. This patch boosts the vCPU candidiate in user mode which is delivery interrupt. We can observe the speed of pbzip2 improves 10% in 96 vCPUs VM in over-subscribe scenario (The host machine is 2 socket, 48 cores, 96 HTs Intel CLX box). There is no performance regression for other benchmarks like Unixbench spawn (most of the time contend read/write lock in kernel mode), ebizzy (most of the time contend read/write sem and TLB shoodtdown in kernel mode). Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1618542490-14756-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-21 12:20:03 -04:00
Nathan Tempelman	54526d1fd5	KVM: x86: Support KVM VMs sharing SEV context Add a capability for userspace to mirror SEV encryption context from one vm to another. On our side, this is intended to support a Migration Helper vCPU, but it can also be used generically to support other in-guest workloads scheduled by the host. The intention is for the primary guest and the mirror to have nearly identical memslots. The primary benefits of this are that: 1) The VMs do not share KVM contexts (think APIC/MSRs/etc), so they can't accidentally clobber each other. 2) The VMs can have different memory-views, which is necessary for post-copy migration (the migration vCPUs on the target need to read and write to pages, when the primary guest would VMEXIT). This does not change the threat model for AMD SEV. Any memory involved is still owned by the primary guest and its initial state is still attested to through the normal SEV_LAUNCH_* flows. If userspace wanted to circumvent SEV, they could achieve the same effect by simply attaching a vCPU to the primary VM. This patch deliberately leaves userspace in charge of the memslots for the mirror, as it already has the power to mess with them in the primary guest. This patch does not support SEV-ES (much less SNP), as it does not handle handing off attested VMSAs to the mirror. For additional context, we need a Migration Helper because SEV PSP migration is far too slow for our live migration on its own. Using an in-guest migrator lets us speed this up significantly. Signed-off-by: Nathan Tempelman <natet@google.com> Message-Id: <20210408223214.2582277-1-natet@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-21 12:20:02 -04:00
Sean Christopherson	5d3c4c7938	KVM: Stop looking for coalesced MMIO zones if the bus is destroyed Abort the walk of coalesced MMIO zones if kvm_io_bus_unregister_dev() fails to allocate memory for the new instance of the bus. If it can't instantiate a new bus, unregister_dev() destroys all devices _except_ the target device. But, it doesn't tell the caller that it obliterated the bus and invoked the destructor for all devices that were on the bus. In the coalesced MMIO case, this can result in a deleted list entry dereference due to attempting to continue iterating on coalesced_zones after future entries (in the walk) have been deleted. Opportunistically add curly braces to the for-loop, which encompasses many lines but sneaks by without braces due to the guts being a single if statement. Fixes: `f65886606c` ("KVM: fix memory leak in kvm_io_bus_unregister_dev()") Cc: stable@vger.kernel.org Reported-by: Hao Sun <sunhao.th@gmail.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210412222050.876100-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-20 04:18:51 -04:00

1 2 3 4 5 ...

647 Commits