Changes in 5.15.91
memory: tegra: Remove clients SID override programming
memory: atmel-sdramc: Fix missing clk_disable_unprepare in atmel_ramc_probe()
memory: mvebu-devbus: Fix missing clk_disable_unprepare in mvebu_devbus_probe()
dmaengine: ti: k3-udma: Do conditional decrement of UDMA_CHAN_RT_PEER_BCNT_REG
arm64: dts: imx8mp-phycore-som: Remove invalid PMIC property
ARM: dts: imx6ul-pico-dwarf: Use 'clock-frequency'
ARM: dts: imx7d-pico: Use 'clock-frequency'
ARM: dts: imx6qdl-gw560x: Remove incorrect 'uart-has-rtscts'
arm64: dts: imx8mm-beacon: Fix ecspi2 pinmux
ARM: imx: add missing of_node_put()
HID: intel_ish-hid: Add check for ishtp_dma_tx_map
arm64: dts: imx8mm-venice-gw7901: fix USB2 controller OC polarity
soc: imx8m: Fix incorrect check for of_clk_get_by_name()
reset: uniphier-glue: Use reset_control_bulk API
reset: uniphier-glue: Fix possible null-ptr-deref
EDAC/highbank: Fix memory leak in highbank_mc_probe()
firmware: arm_scmi: Harden shared memory access in fetch_response
firmware: arm_scmi: Harden shared memory access in fetch_notification
tomoyo: fix broken dependency on *.conf.default
RDMA/core: Fix ib block iterator counter overflow
IB/hfi1: Reject a zero-length user expected buffer
IB/hfi1: Reserve user expected TIDs
IB/hfi1: Fix expected receive setup error exit issues
IB/hfi1: Immediately remove invalid memory from hardware
IB/hfi1: Remove user expected buffer invalidate race
affs: initialize fsdata in affs_truncate()
PM: AVS: qcom-cpr: Fix an error handling path in cpr_probe()
arm64: dts: qcom: msm8992: Don't use sfpb mutex
arm64: dts: qcom: msm8992-libra: Add CPU regulators
arm64: dts: qcom: msm8992-libra: Fix the memory map
phy: ti: fix Kconfig warning and operator precedence
NFSD: fix use-after-free in nfsd4_ssc_setup_dul()
ARM: dts: at91: sam9x60: fix the ddr clock for sam9x60
amd-xgbe: TX Flow Ctrl Registers are h/w ver dependent
amd-xgbe: Delay AN timeout during KR training
bpf: Fix pointer-leak due to insufficient speculative store bypass mitigation
phy: rockchip-inno-usb2: Fix missing clk_disable_unprepare() in rockchip_usb2phy_power_on()
net: nfc: Fix use-after-free in local_cleanup()
net: wan: Add checks for NULL for utdm in undo_uhdlc_init and unmap_si_regs
net: enetc: avoid deadlock in enetc_tx_onestep_tstamp()
sch_htb: Avoid grafting on htb_destroy_class_offload when destroying htb
gpio: use raw spinlock for gpio chip shadowed data
gpio: mxc: Protect GPIO irqchip RMW with bgpio spinlock
gpio: mxc: Always set GPIOs used as interrupt source to INPUT mode
wifi: rndis_wlan: Prevent buffer overflow in rndis_query_oid
pinctrl/rockchip: Use temporary variable for struct device
pinctrl/rockchip: add error handling for pull/drive register getters
pinctrl: rockchip: fix reading pull type on rk3568
net: stmmac: Fix queue statistics reading
net/sched: sch_taprio: fix possible use-after-free
l2tp: Serialize access to sk_user_data with sk_callback_lock
l2tp: Don't sleep and disable BH under writer-side sk_callback_lock
l2tp: convert l2tp_tunnel_list to idr
l2tp: close all race conditions in l2tp_tunnel_register()
octeontx2-pf: Avoid use of GFP_KERNEL in atomic context
net: usb: sr9700: Handle negative len
net: mdio: validate parameter addr in mdiobus_get_phy()
HID: check empty report_list in hid_validate_values()
HID: check empty report_list in bigben_probe()
net: stmmac: fix invalid call to mdiobus_get_phy()
pinctrl: rockchip: fix mux route data for rk3568
HID: revert CHERRY_MOUSE_000C quirk
usb: gadget: f_fs: Prevent race during ffs_ep0_queue_wait
usb: gadget: f_fs: Ensure ep0req is dequeued before free_request
Bluetooth: Fix possible deadlock in rfcomm_sk_state_change
net: ipa: disable ipa interrupt during suspend
net/mlx5: E-switch, Fix setting of reserved fields on MODIFY_SCHEDULING_ELEMENT
net: mlx5: eliminate anonymous module_init & module_exit
drm/panfrost: fix GENERIC_ATOMIC64 dependency
dmaengine: Fix double increment of client_count in dma_chan_get()
net: macb: fix PTP TX timestamp failure due to packet padding
virtio-net: correctly enable callback during start_xmit
l2tp: prevent lockdep issue in l2tp_tunnel_register()
HID: betop: check shape of output reports
cifs: fix potential deadlock in cache_refresh_path()
dmaengine: xilinx_dma: call of_node_put() when breaking out of for_each_child_of_node()
phy: phy-can-transceiver: Skip warning if no "max-bitrate"
drm/amd/display: fix issues with driver unload
nvme-pci: fix timeout request state check
tcp: avoid the lookup process failing to get sk in ehash table
octeontx2-pf: Fix the use of GFP_KERNEL in atomic context on rt
ptdma: pt_core_execute_cmd() should use spinlock
device property: fix of node refcount leak in fwnode_graph_get_next_endpoint()
w1: fix deadloop in __w1_remove_master_device()
w1: fix WARNING after calling w1_process()
driver core: Fix test_async_probe_init saves device in wrong array
selftests/net: toeplitz: fix race on tpacket_v3 block close
net: dsa: microchip: ksz9477: port map correction in ALU table entry register
thermal/core: Remove duplicate information when an error occurs
thermal/core: Rename 'trips' to 'num_trips'
thermal: Validate new state in cur_state_store()
thermal/core: fix error code in __thermal_cooling_device_register()
thermal: core: call put_device() only after device_register() fails
net: stmmac: enable all safety features by default
tcp: fix rate_app_limited to default to 1
scsi: iscsi: Fix multiple iSCSI session unbind events sent to userspace
cpufreq: Add Tegra234 to cpufreq-dt-platdev blocklist
kcsan: test: don't put the expect array on the stack
cpufreq: Add SM6375 to cpufreq-dt-platdev blocklist
ASoC: fsl_micfil: Correct the number of steps on SX controls
net: usb: cdc_ether: add support for Thales Cinterion PLS62-W modem
drm: Add orientation quirk for Lenovo ideapad D330-10IGL
s390/debug: add _ASM_S390_ prefix to header guard
s390: expicitly align _edata and _end symbols on page boundary
perf/x86/msr: Add Emerald Rapids
perf/x86/intel/uncore: Add Emerald Rapids
cpufreq: armada-37xx: stop using 0 as NULL pointer
ASoC: fsl_ssi: Rename AC'97 streams to avoid collisions with AC'97 CODEC
ASoC: fsl-asoc-card: Fix naming of AC'97 CODEC widgets
spi: spidev: remove debug messages that access spidev->spi without locking
KVM: s390: interrupt: use READ_ONCE() before cmpxchg()
scsi: hisi_sas: Set a port invalid only if there are no devices attached when refreshing port id
r8152: add vendor/device ID pair for Microsoft Devkit
platform/x86: touchscreen_dmi: Add info for the CSL Panther Tab HD
platform/x86: asus-nb-wmi: Add alternate mapping for KEY_SCREENLOCK
lockref: stop doing cpu_relax in the cmpxchg loop
firmware: coreboot: Check size of table entry and use flex-array
drm/i915: Allow switching away via vga-switcheroo if uninitialized
Revert "selftests/bpf: check null propagation only neither reg is PTR_TO_BTF_ID"
drm/i915: Remove unused variable
x86: ACPI: cstate: Optimize C3 entry on AMD CPUs
fs: reiserfs: remove useless new_opts in reiserfs_remount
sysctl: add a new register_sysctl_init() interface
kernel/panic: move panic sysctls to its own file
panic: unset panic_on_warn inside panic()
ubsan: no need to unset panic_on_warn in ubsan_epilogue()
kasan: no need to unset panic_on_warn in end_report()
exit: Add and use make_task_dead.
objtool: Add a missing comma to avoid string concatenation
hexagon: Fix function name in die()
h8300: Fix build errors from do_exit() to make_task_dead() transition
csky: Fix function name in csky_alignment() and die()
ia64: make IA64_MCA_RECOVERY bool instead of tristate
panic: Separate sysctl logic from CONFIG_SMP
exit: Put an upper limit on how often we can oops
exit: Expose "oops_count" to sysfs
exit: Allow oops_limit to be disabled
panic: Consolidate open-coded panic_on_warn checks
panic: Introduce warn_limit
panic: Expose "warn_count" to sysfs
docs: Fix path paste-o for /sys/kernel/warn_count
exit: Use READ_ONCE() for all oops/warn limit reads
Bluetooth: hci_sync: cancel cmd_timer if hci_open failed
drm/amdgpu: complete gfxoff allow signal during suspend without delay
scsi: hpsa: Fix allocation size for scsi_host_alloc()
KVM: SVM: fix tsc scaling cache logic
module: Don't wait for GOING modules
tracing: Make sure trace_printk() can output as soon as it can be used
trace_events_hist: add check for return value of 'create_hist_field'
ftrace/scripts: Update the instructions for ftrace-bisect.sh
cifs: Fix oops due to uncleared server->smbd_conn in reconnect
i2c: mv64xxx: Remove shutdown method from driver
i2c: mv64xxx: Add atomic_xfer method to driver
ksmbd: add smbd max io size parameter
ksmbd: add max connections parameter
ksmbd: do not sign response to session request for guest login
ksmbd: downgrade ndr version error message to debug
ksmbd: limit pdu length size according to connection status
ovl: fail on invalid uid/gid mapping at copy up
KVM: x86/vmx: Do not skip segment attributes if unusable bit is set
KVM: arm64: GICv4.1: Fix race with doorbell on VPE activation/deactivation
thermal: intel: int340x: Protect trip temperature from concurrent updates
ipv6: fix reachability confirmation with proxy_ndp
ARM: 9280/1: mm: fix warning on phys_addr_t to void pointer assignment
EDAC/device: Respect any driver-supplied workqueue polling value
EDAC/qcom: Do not pass llcc_driv_data as edac_device_ctl_info's pvt_info
net: mana: Fix IRQ name - add PCI and queue number
scsi: ufs: core: Fix devfreq deadlocks
i2c: designware: use casting of u64 in clock multiplication to avoid overflow
netlink: prevent potential spectre v1 gadgets
net: fix UaF in netns ops registration error path
drm/i915/selftest: fix intel_selftest_modify_policy argument types
netfilter: nft_set_rbtree: Switch to node list walk for overlap detection
netfilter: nft_set_rbtree: skip elements in transaction from garbage collection
netlink: annotate data races around nlk->portid
netlink: annotate data races around dst_portid and dst_group
netlink: annotate data races around sk_state
ipv4: prevent potential spectre v1 gadget in ip_metrics_convert()
ipv4: prevent potential spectre v1 gadget in fib_metrics_match()
netfilter: conntrack: fix vtag checks for ABORT/SHUTDOWN_COMPLETE
netrom: Fix use-after-free of a listening socket.
net/sched: sch_taprio: do not schedule in taprio_reset()
sctp: fail if no bound addresses can be used for a given scope
riscv/kprobe: Fix instruction simulation of JALR
nvme: fix passthrough csi check
gpio: mxc: Unlock on error path in mxc_flip_edge()
ravb: Rename "no_ptp_cfg_active" and "ptp_cfg_active" variables
net: ravb: Fix lack of register setting after system resumed for Gen3
net: ravb: Fix possible hang if RIS2_QFF1 happen
net: mctp: mark socks as dead on unhash, prevent re-add
thermal: intel: int340x: Add locking to int340x_thermal_get_trip_type()
net/tg3: resolve deadlock in tg3_reset_task() during EEH
net: mdio-mux-meson-g12a: force internal PHY off on mux switch
treewide: fix up files incorrectly marked executable
tools: gpio: fix -c option of gpio-event-mon
Revert "Input: synaptics - switch touchpad on HP Laptop 15-da3001TU to RMI mode"
cpufreq: Move to_gov_attr_set() to cpufreq.h
cpufreq: governor: Use kobject release() method to free dbs_data
kbuild: Allow kernel installation packaging to override pkg-config
block: fix and cleanup bio_check_ro
x86/i8259: Mark legacy PIC interrupts with IRQ_LEVEL
netfilter: conntrack: unify established states for SCTP paths
perf/x86/amd: fix potential integer overflow on shift of a int
Linux 5.15.91
Change-Id: I3349d802533097ac86e5c680fbd40c00c9719ec7
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Everywhere we use kvm_for_each_vpcu(), we use an int as the vcpu
index. Unfortunately, we're about to move rework the iterator,
which requires this to be upgrade to an unsigned long.
Let's bite the bullet and repaint all of it in one go.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Message-Id: <20211116160403.4074052-7-maz@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 46808a4cb89708c2e5b264eb9d1035762581921b)
[willdeacon@: Drop riscv hunks; drop x86 hunks for code that isn't present;
convert kvm_hyperv_tsc_notifier() as well]
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Bug: 233588291
Change-Id: I78a3ae244de22575684dd47a9823b9fc61b6fa7d
commit 812de04661c4daa7ac385c0dfd62594540538034 upstream.
With KVM_CAP_S390_USER_SIGP, there are only five Signal Processor
orders (CONDITIONAL EMERGENCY SIGNAL, EMERGENCY SIGNAL, EXTERNAL CALL,
SENSE, and SENSE RUNNING STATUS) which are intended for frequent use
and thus are processed in-kernel. The remainder are sent to userspace
with the KVM_CAP_S390_USER_SIGP capability. Of those, three orders
(RESTART, STOP, and STOP AND STORE STATUS) have the potential to
inject work back into the kernel, and thus are asynchronous.
Let's look for those pending IRQs when processing one of the in-kernel
SIGP orders, and return BUSY (CC2) if one is in process. This is in
agreement with the Principles of Operation, which states that only one
order can be "active" on a CPU at a time.
Cc: stable@vger.kernel.org
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Eric Farman <farman@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20211213210550.856213-2-farman@linux.ibm.com
[borntraeger@linux.ibm.com: add stable tag]
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Read vcpu->vcpu_idx directly instead of bouncing through the one-line
wrapper, kvm_vcpu_get_idx(), and drop the wrapper. The wrapper is a
remnant of the original implementation and serves no purpose; remove it
before it gains more users.
Back when kvm_vcpu_get_idx() was added by commit 497d72d80a ("KVM: Add
kvm_vcpu_get_idx to get vcpu index in kvm->vcpus"), the implementation
was more than just a simple wrapper as vcpu->vcpu_idx did not exist and
retrieving the index meant walking over the vCPU array to find the given
vCPU.
When vcpu_idx was introduced by commit 8750e72a79 ("KVM: remember
position in kvm->vcpus array"), the helper was left behind, likely to
avoid extra thrash (but even then there were only two users, the original
arm usage having been removed at some point in the past).
No functional change intended.
Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210910183220.2397812-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
While in practice vcpu->vcpu_idx == vcpu->vcp_id is often true, it may
not always be, and we must not rely on this. Reason is that KVM decides
the vcpu_idx, userspace decides the vcpu_id, thus the two might not
match.
Currently kvm->arch.idle_mask is indexed by vcpu_id, which implies
that code like
for_each_set_bit(vcpu_id, kvm->arch.idle_mask, online_vcpus) {
vcpu = kvm_get_vcpu(kvm, vcpu_id);
do_stuff(vcpu);
}
is not legit. Reason is that kvm_get_vcpu expects an vcpu_idx, not an
vcpu_id. The trouble is, we do actually use kvm->arch.idle_mask like
this. To fix this problem we have two options. Either use
kvm_get_vcpu_by_id(vcpu_id), which would loop to find the right vcpu_id,
or switch to indexing via vcpu_idx. The latter is preferable for obvious
reasons.
Let us make switch from indexing kvm->arch.idle_mask by vcpu_id to
indexing it by vcpu_idx. To keep gisa_int.kicked_mask indexed by the
same index as idle_mask lets make the same change for it as well.
Fixes: 1ee0bc559d ("KVM: s390: get rid of local_int array")
Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
Reviewed-by: Christian Bornträger <borntraeger@de.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: <stable@vger.kernel.org> # 3.15+
Link: https://lore.kernel.org/r/20210827125429.1912577-1-pasic@linux.ibm.com
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Get rid of unsigned long long, and use unsigned long instead
everywhere. The usage of unsigned long long is a leftover from
31 bit kernel support.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Almost all kvm allocations in the s390x KVM code can be attributed to
the process that triggers the allocation (in other words, no global
allocation for other guests). This will help the memcg controller to
make the right decisions.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Janosch Frank <frankja@linux.ibm.com>
Acked-by: Cornelia Huck <cohuck@redhat.com>
The diag 0x44 handler, which handles a directed yield, goes into a
a codepath that does a kvm_for_each_vcpu() and ultimately
deliverable_irqs(). The new check for kvm_s390_pv_cpu_is_protected()
contains an assertion that the vcpu->mutex is held, which isn't going
to be the case in this scenario.
The result is a plethora of these messages if the lock debugging
is enabled, and thus an implication that we have a problem.
WARNING: CPU: 9 PID: 16167 at arch/s390/kvm/kvm-s390.h:239 deliverable_irqs+0x1c6/0x1d0 [kvm]
...snip...
Call Trace:
[<000003ff80429bf2>] deliverable_irqs+0x1ca/0x1d0 [kvm]
([<000003ff80429b34>] deliverable_irqs+0x10c/0x1d0 [kvm])
[<000003ff8042ba82>] kvm_s390_vcpu_has_irq+0x2a/0xa8 [kvm]
[<000003ff804101e2>] kvm_arch_dy_runnable+0x22/0x38 [kvm]
[<000003ff80410284>] kvm_vcpu_on_spin+0x8c/0x1d0 [kvm]
[<000003ff80436888>] kvm_s390_handle_diag+0x3b0/0x768 [kvm]
[<000003ff80425af4>] kvm_handle_sie_intercept+0x1cc/0xcd0 [kvm]
[<000003ff80422bb0>] __vcpu_run+0x7b8/0xfd0 [kvm]
[<000003ff80423de6>] kvm_arch_vcpu_ioctl_run+0xee/0x3e0 [kvm]
[<000003ff8040ccd8>] kvm_vcpu_ioctl+0x2c8/0x8d0 [kvm]
[<00000001504ced06>] ksys_ioctl+0xae/0xe8
[<00000001504cedaa>] __s390x_sys_ioctl+0x2a/0x38
[<0000000150cb9034>] system_call+0xd8/0x2d8
2 locks held by CPU 2/KVM/16167:
#0: 00000001951980c0 (&vcpu->mutex){+.+.}, at: kvm_vcpu_ioctl+0x90/0x8d0 [kvm]
#1: 000000019599c0f0 (&kvm->srcu){....}, at: __vcpu_run+0x4bc/0xfd0 [kvm]
Last Breaking-Event-Address:
[<000003ff80429b34>] deliverable_irqs+0x10c/0x1d0 [kvm]
irq event stamp: 11967
hardirqs last enabled at (11975): [<00000001502992f2>] console_unlock+0x4ca/0x650
hardirqs last disabled at (11982): [<0000000150298ee8>] console_unlock+0xc0/0x650
softirqs last enabled at (7940): [<0000000150cba6ca>] __do_softirq+0x422/0x4d8
softirqs last disabled at (7929): [<00000001501cd688>] do_softirq_own_stack+0x70/0x80
Considering what's being done here, let's fix this by removing the
mutex assertion rather than acquiring the mutex for every other vcpu.
Fixes: 201ae986ea ("KVM: s390: protvirt: Implement interrupt injection")
Signed-off-by: Eric Farman <farman@linux.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Link: https://lore.kernel.org/r/20200415190353.63625-1-farman@linux.ibm.com
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Only two program exceptions can be injected for a protected guest:
specification and operand.
For both, a code needs to be specified in the interrupt injection
control of the state description, as the guest prefix page is not
accessible to KVM for such guests.
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
[borntraeger@de.ibm.com: patch merging, splitting, fixing]
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
The sclp interrupt is kind of special. The ultravisor polices that we
do not inject an sclp interrupt with payload if no sccb is outstanding.
On the other hand we have "asynchronous" event interrupts, e.g. for
console input.
We separate both variants into sclp interrupt and sclp event interrupt.
The sclp interrupt is masked until a previous servc instruction has
finished (sie exit 108).
[frankja@linux.ibm.com: factoring out write_sclp]
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
This defines the necessary data structures in the SIE control block to
inject machine checks,external and I/O interrupts. We first define the
the interrupt injection control, which defines the next interrupt to
inject. Then we define the fields that contain the payload for machine
checks,external and I/O interrupts.
This is then used to implement interruption injection for the following
list of interruption types:
- I/O (uses inject io interruption)
__deliver_io
- External (uses inject external interruption)
__deliver_cpu_timer
__deliver_ckc
__deliver_emergency_signal
__deliver_external_call
- cpu restart (uses inject restart interruption)
__deliver_restart
- machine checks (uses mcic, failing address and external damage)
__write_machine_check
Please note that posted interrupts (GISA) are not used for protected
guests as of today.
The service interrupt is handled in a followup patch.
Signed-off-by: Michael Mueller <mimu@linux.ibm.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
[borntraeger@de.ibm.com: patch merging, splitting, fixing]
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
The adapter interrupt page containing the indicator bits is currently
pinned. That means that a guest with many devices can pin a lot of
memory pages in the host. This also complicates the reference tracking
which is needed for memory management handling of protected virtual
machines. It might also have some strange side effects for madvise
MADV_DONTNEED and other things.
We can simply try to get the userspace page set the bits and free the
page. By storing the userspace address in the irq routing entry instead
of the guest address we can actually avoid many lookups and list walks
so that this variant is very likely not slower.
If userspace messes around with the memory slots the worst thing that
can happen is that we write to some other memory within that process.
As we get the the page with FOLL_WRITE this can also not be used to
write to shared read-only pages.
Signed-off-by: Ulrich Weigand <Ulrich.Weigand@de.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
[borntraeger@de.ibm.com: patch simplification]
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
When the userspace program runs the KVM_S390_INTERRUPT ioctl to inject
an interrupt, we convert them from the legacy struct kvm_s390_interrupt
to the new struct kvm_s390_irq via the s390int_to_s390irq() function.
However, this function does not take care of all types of interrupts
that we can inject into the guest later (see do_inject_vcpu()). Since we
do not clear out the s390irq values before calling s390int_to_s390irq(),
there is a chance that we copy random data from the kernel stack which
could be leaked to the userspace later.
Specifically, the problem exists with the KVM_S390_INT_PFAULT_INIT
interrupt: s390int_to_s390irq() does not handle it, and the function
__inject_pfault_init() later copies irq->u.ext which contains the
random kernel stack data. This data can then be leaked either to
the guest memory in __deliver_pfault_init(), or the userspace might
retrieve it directly with the KVM_S390_GET_IRQ_STATE ioctl.
Fix it by handling that interrupt type in s390int_to_s390irq(), too,
and by making sure that the s390irq struct is properly pre-initialized.
And while we're at it, make sure that s390int_to_s390irq() now
directly returns -EINVAL for unknown interrupt types, so that we
immediately get a proper error code in case we add more interrupt
types to do_inject_vcpu() without updating s390int_to_s390irq()
sometime in the future.
Cc: stable@vger.kernel.org
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
Link: https://lore.kernel.org/kvm/20190912115438.25761-1-thuth@redhat.com
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Inspired by commit 9cac38dd5d (KVM/s390: Set preempted flag during
vcpu wakeup and interrupt delivery), we want to also boost not just
lock holders but also vCPUs that are delivering interrupts. Most
smp_call_function_many calls are synchronous, so the IPI target vCPUs
are also good yield candidates. This patch introduces vcpu->ready to
boost vCPUs during wakeup and interrupt delivery time; unlike s390 we do
not reuse vcpu->preempted so that voluntarily preempted vCPUs are taken
into account by kvm_vcpu_on_spin, but vmx_vcpu_pi_put is not affected
(VT-d PI handles voluntary preemption separately, in pi_pre_block).
Testing on 80 HT 2 socket Xeon Skylake server, with 80 vCPUs VM 80GB RAM:
ebizzy -M
vanilla boosting improved
1VM 21443 23520 9%
2VM 2800 8000 180%
3VM 1800 3100 72%
Testing on my Haswell desktop 8 HT, with 8 vCPUs VM 8GB RAM, two VMs,
one running ebizzy -M, the other running 'stress --cpu 2':
w/ boosting + w/o pv sched yield(vanilla)
vanilla boosting improved
1570 4000 155%
w/ boosting + w/ pv sched yield(vanilla)
vanilla boosting improved
1844 5157 179%
w/o boosting, perf top in VM:
72.33% [kernel] [k] smp_call_function_many
4.22% [kernel] [k] call_function_i
3.71% [kernel] [k] async_page_fault
w/ boosting, perf top in VM:
38.43% [kernel] [k] smp_call_function_many
6.31% [kernel] [k] async_page_fault
6.13% libc-2.23.so [.] __memcpy_avx_unaligned
4.88% [kernel] [k] call_function_interrupt
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Marc Zyngier <maz@kernel.org>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull KVM updates from Paolo Bonzini:
"ARM:
- support for SVE and Pointer Authentication in guests
- PMU improvements
POWER:
- support for direct access to the POWER9 XIVE interrupt controller
- memory and performance optimizations
x86:
- support for accessing memory not backed by struct page
- fixes and refactoring
Generic:
- dirty page tracking improvements"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (155 commits)
kvm: fix compilation on aarch64
Revert "KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU"
kvm: x86: Fix L1TF mitigation for shadow MMU
KVM: nVMX: Disable intercept for FS/GS base MSRs in vmcs02 when possible
KVM: PPC: Book3S: Remove useless checks in 'release' method of KVM device
KVM: PPC: Book3S HV: XIVE: Fix spelling mistake "acessing" -> "accessing"
KVM: PPC: Book3S HV: Make sure to load LPID for radix VCPUs
kvm: nVMX: Set nested_run_pending in vmx_set_nested_state after checks complete
tests: kvm: Add tests for KVM_SET_NESTED_STATE
KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS state before setting new state
tests: kvm: Add tests for KVM_CAP_MAX_VCPUS and KVM_CAP_MAX_CPU_ID
tests: kvm: Add tests to .gitignore
KVM: Introduce KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
KVM: Fix kvm_clear_dirty_log_protect off-by-(minus-)one
KVM: Fix the bitmap range to copy during clear dirty
KVM: arm64: Fix ptrauth ID register masking logic
KVM: x86: use direct accessors for RIP and RSP
KVM: VMX: Use accessors for GPRs outside of dedicated caching logic
KVM: x86: Omit caching logic for always-available GPRs
kvm, x86: Properly check whether a pfn is an MMIO or not
...
Add an extra parameter for airq handlers to recognize
floating vs. directed interrupts.
Signed-off-by: Sebastian Ott <sebott@linux.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The patch implements a handler for GIB alert interruptions
on the host. Its task is to alert guests that interrupts are
pending for them.
A GIB alert interrupt statistic counter is added as well:
$ cat /proc/interrupts
CPU0 CPU1
...
GAL: 23 37 [I/O] GIB Alert
...
Signed-off-by: Michael Mueller <mimu@linux.ibm.com>
Acked-by: Halil Pasic <pasic@linux.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
Message-Id: <20190131085247.13826-14-mimu@linux.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Add the Interruption Alert Mask (IAM) to the architecture specific
kvm struct. This mask in the GISA is used to define for which ISC
a GIB alert will be issued.
The functions kvm_s390_gisc_register() and kvm_s390_gisc_unregister()
are used to (un)register a GISC (guest ISC) with a virtual machine and
its GISA.
Upon successful completion, kvm_s390_gisc_register() returns the
ISC to be used for GIB alert interruptions. A negative return code
indicates an error during registration.
Theses functions will be used by other adapter types like AP and PCI to
request pass-through interruption support.
Signed-off-by: Michael Mueller <mimu@linux.ibm.com>
Acked-by: Pierre Morel <pmorel@linux.ibm.com>
Acked-by: Halil Pasic <pasic@linux.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Message-Id: <20190131085247.13826-12-mimu@linux.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
The Guest Information Block (GIB) links the GISA of all guests
that have adapter interrupts pending. These interrupts cannot be
delivered because all vcpus of these guests are currently in WAIT
state or have masked the respective Interruption Sub Class (ISC).
If enabled, a GIB alert is issued on the host to schedule these
guests to run suitable vcpus to consume the pending interruptions.
This mechanism allows to process adapter interrupts for currently
not running guests.
The GIB is created during host initialization and associated with
the Adapter Interruption Facility in case an Adapter Interruption
Virtualization Facility is available.
The GIB initialization and thus the activation of the related code
will be done in an upcoming patch of this series.
Signed-off-by: Michael Mueller <mimu@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
Acked-by: Halil Pasic <pasic@linux.ibm.com>
Message-Id: <20190131085247.13826-10-mimu@linux.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
In KVM code we use masks to test/set control registers.
Let's define the ones we use in arch/s390/include/asm/ctl_reg.h and
replace all occurrences in KVM code.
As we will be needing the define for Clock-comparator sign control soon,
let's also add it.
Suggested-by: Collin L. Walling <walling@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Collin Walling <walling@linux.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
For testing the exitless interrupt support it turned out useful to
have separate counters for inject and delivery of I/O interrupt.
While at it do the same for all interrupt types. For timer
related interrupts (clock comparator and cpu timer) we even had
no delivery counters. Fix this as well. On this way some counters
are being renamed to have a similar name.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Missed when enabling the Multiple-epoch facility. If the facility is
installed and the control is set, a sign based comaprison has to be
performed.
Right now we would inject wrong interrupts and ignore interrupt
conditions. Also the sleep time is calculated in a wrong way.
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20180207114647.6220-2-david@redhat.com>
Fixes: 8fa1696ea7 ("KVM: s390: Multiple Epoch Facility support")
Cc: stable@vger.kernel.org
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
If GISA is available, we do not have to kick CPUs out of SIE to deliver
interrupts. The hardware can deliver such interrupts while running.
Cc: Michael Mueller <mimu@linux.vnet.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
For interrupt injection of floating interrupts we queue the interrupt
either in the GISA or in the floating interrupt list. The first CPU
that looks at these data structures - either in KVM code or hardware
will then deliver that interrupt. To minimize latency we also:
-a: choose a VCPU to deliver that interrupt. We prefer idle CPUs
-b: we wake up the host thread that runs the VCPU
-c: set an I/O intervention bit for that CPU so that it exits guest
context as soon as the PSW I/O mask is enabled
This will make sure that this CPU will execute the interrupt delivery
code of KVM very soon.
We can now optimize the injection case if we have exitless interrupts.
The wakeup is still necessary in case the target CPU sleeps. We can
avoid the I/O intervention request bit though. Whenever this
intervention request would be handled, the hardware could also directly
inject the interrupt on that CPU, no need to go through the interrupt
injection loop of KVM.
Cc: Michael Mueller <mimu@linux.vnet.ibm.com>
Reviewed-by: Halil Pasic <pasic@linux.vnet.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
The patch modifies the previously defined GISA data structure to be
able to store two GISA formats, format-0 and format-1. Additionally,
it verifies the availability of the GISA format facility and enables
the use of a format-1 GISA in the SIE control block accordingly.
A format-1 can do everything that format-0 can and we will need it
for real HW passthrough. As there are systems with only format-0
we keep both variants.
Signed-off-by: Michael Mueller <mimu@linux.vnet.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
The function returns a pending I/O interrupt with the highest
priority defined by its ISC.
Together with AIV activation, pending adapter interrupts are
managed by the GISA IPM. Thus kvm_s390_get_io_int() needs to
inspect the IPM as well when the interrupt with the highest
priority has to be identified.
In case classic and adapter interrupts with the same ISC are
pending, the classic interrupt will be returned first.
Signed-off-by: Michael Mueller <mimu@linux.vnet.ibm.com>
Reviewed-by: Halil Pasic <pasic@linux.vnet.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>