Add the hook to apply vendor's performance tune for owner
of rwsem.
Add the hook for the waiter list of rwsem to allow
vendor perform waiting queue enhancement
ANDROID_VENDOR_DATA added to rw_semaphore
Bug: 222402411
Signed-off-by: JianMin Liu <jian-min.liu@mediatek.com>
Signed-off-by: Jino Hsu <jino.hsu@mediatek.com>
Change-Id: I007a5e26f3db2adaeaf4e5ccea414ce7abfa83b8
Changes in 5.15.25
drm/nouveau/pmu/gm200-: use alternate falcon reset sequence
fs/proc: task_mmu.c: don't read mapcount for migration entry
btrfs: zoned: cache reported zone during mount
scsi: lpfc: Fix mailbox command failure during driver initialization
HID:Add support for UGTABLET WP5540
Revert "svm: Add warning message for AVIC IPI invalid target"
parisc: Show error if wrong 32/64-bit compiler is being used
serial: parisc: GSC: fix build when IOSAPIC is not set
parisc: Drop __init from map_pages declaration
parisc: Fix data TLB miss in sba_unmap_sg
parisc: Fix sglist access in ccio-dma.c
mmc: block: fix read single on recovery logic
mm: don't try to NUMA-migrate COW pages that have other uses
HID: amd_sfh: Add illuminance mask to limit ALS max value
HID: i2c-hid: goodix: Fix a lockdep splat
HID: amd_sfh: Increase sensor command timeout
HID: amd_sfh: Correct the structure field name
PCI: hv: Fix NUMA node assignment when kernel boots with custom NUMA topology
parisc: Add ioread64_lo_hi() and iowrite64_lo_hi()
btrfs: send: in case of IO error log it
platform/x86: touchscreen_dmi: Add info for the RWC NANOTE P8 AY07J 2-in-1
platform/x86: ISST: Fix possible circular locking dependency detected
kunit: tool: Import missing importlib.abc
selftests: rtc: Increase test timeout so that all tests run
kselftest: signal all child processes
net: ieee802154: at86rf230: Stop leaking skb's
selftests/zram: Skip max_comp_streams interface on newer kernel
selftests/zram01.sh: Fix compression ratio calculation
selftests/zram: Adapt the situation that /dev/zram0 is being used
selftests: openat2: Print also errno in failure messages
selftests: openat2: Add missing dependency in Makefile
selftests: openat2: Skip testcases that fail with EOPNOTSUPP
selftests: skip mincore.check_file_mmap when fs lacks needed support
ax25: improve the incomplete fix to avoid UAF and NPD bugs
pinctrl: bcm63xx: fix unmet dependency on REGMAP for GPIO_REGMAP
vfs: make freeze_super abort when sync_filesystem returns error
quota: make dquot_quota_sync return errors from ->sync_fs
scsi: pm80xx: Fix double completion for SATA devices
kselftest: Fix vdso_test_abi return status
scsi: core: Reallocate device's budget map on queue depth change
scsi: pm8001: Fix use-after-free for aborted TMF sas_task
scsi: pm8001: Fix use-after-free for aborted SSP/STP sas_task
drm/amd: Warn users about potential s0ix problems
nvme: fix a possible use-after-free in controller reset during load
nvme-tcp: fix possible use-after-free in transport error_recovery work
nvme-rdma: fix possible use-after-free in transport error_recovery work
net: sparx5: do not refer to skb after passing it on
drm/amd: add support to check whether the system is set to s3
drm/amd: Only run s3 or s0ix if system is configured properly
drm/amdgpu: fix logic inversion in check
x86/Xen: streamline (and fix) PV CPU enumeration
Revert "module, async: async_synchronize_full() on module init iff async is used"
gcc-plugins/stackleak: Use noinstr in favor of notrace
random: wake up /dev/random writers after zap
KVM: x86/xen: Fix runstate updates to be atomic when preempting vCPU
KVM: x86: nSVM/nVMX: set nested_run_pending on VM entry which is a result of RSM
KVM: x86: SVM: don't passthrough SMAP/SMEP/PKE bits in !NPT && !gCR0.PG case
KVM: x86: nSVM: fix potential NULL derefernce on nested migration
KVM: x86: nSVM: mark vmcb01 as dirty when restoring SMM saved state
iwlwifi: fix use-after-free
drm/radeon: Fix backlight control on iMac 12,1
drm/atomic: Don't pollute crtc_state->mode_blob with error pointers
drm/amd/pm: correct the sequence of sending gpu reset msg
drm/amdgpu: skipping SDMA hw_init and hw_fini for S0ix.
drm/i915/opregion: check port number bounds for SWSCI display power state
drm/i915: Fix dbuf slice config lookup
drm/i915: Fix mbus join config lookup
vsock: remove vsock from connected table when connect is interrupted by a signal
drm/cma-helper: Set VM_DONTEXPAND for mmap
drm/i915/gvt: Make DRM_I915_GVT depend on X86
drm/i915/ttm: tweak priority hint selection
iwlwifi: pcie: fix locking when "HW not ready"
iwlwifi: pcie: gen2: fix locking when "HW not ready"
iwlwifi: mvm: don't send SAR GEO command for 3160 devices
selftests: netfilter: fix exit value for nft_concat_range
netfilter: nft_synproxy: unregister hooks on init error path
selftests: netfilter: disable rp_filter on router
ipv4: fix data races in fib_alias_hw_flags_set
ipv6: fix data-race in fib6_info_hw_flags_set / fib6_purge_rt
ipv6: mcast: use rcu-safe version of ipv6_get_lladdr()
ipv6: per-netns exclusive flowlabel checks
Revert "net: ethernet: bgmac: Use devm_platform_ioremap_resource_byname"
mac80211: mlme: check for null after calling kmemdup
brcmfmac: firmware: Fix crash in brcm_alt_fw_path
cfg80211: fix race in netlink owner interface destruction
net: dsa: lan9303: fix reset on probe
net: dsa: mv88e6xxx: flush switchdev FDB workqueue before removing VLAN
net: dsa: lantiq_gswip: fix use after free in gswip_remove()
net: dsa: lan9303: handle hwaccel VLAN tags
net: dsa: lan9303: add VLAN IDs to master device
net: ieee802154: ca8210: Fix lifs/sifs periods
ping: fix the dif and sdif check in ping_lookup
bonding: force carrier update when releasing slave
drop_monitor: fix data-race in dropmon_net_event / trace_napi_poll_hit
net_sched: add __rcu annotation to netdev->qdisc
bonding: fix data-races around agg_select_timer
libsubcmd: Fix use-after-free for realloc(..., 0)
net/smc: Avoid overwriting the copies of clcsock callback functions
net: phy: mediatek: remove PHY mode check on MT7531
atl1c: fix tx timeout after link flap on Mikrotik 10/25G NIC
tipc: fix wrong publisher node address in link publications
dpaa2-switch: fix default return of dpaa2_switch_flower_parse_mirror_key
dpaa2-eth: Initialize mutex used in one step timestamping path
net: bridge: multicast: notify switchdev driver whenever MC processing gets disabled
perf bpf: Defer freeing string after possible strlen() on it
selftests/exec: Add non-regular to TEST_GEN_PROGS
arm64: Correct wrong label in macro __init_el2_gicv3
ALSA: usb-audio: revert to IMPLICIT_FB_FIXED_DEV for M-Audio FastTrack Ultra
ALSA: hda/realtek: Add quirk for Legion Y9000X 2019
ALSA: hda/realtek: Fix deadlock by COEF mutex
ALSA: hda: Fix regression on forced probe mask option
ALSA: hda: Fix missing codec probe on Shenker Dock 15
ASoC: ops: Fix stereo change notifications in snd_soc_put_volsw()
ASoC: ops: Fix stereo change notifications in snd_soc_put_volsw_range()
ASoC: ops: Fix stereo change notifications in snd_soc_put_volsw_sx()
ASoC: ops: Fix stereo change notifications in snd_soc_put_xr_sx()
cifs: fix set of group SID via NTSD xattrs
powerpc/603: Fix boot failure with DEBUG_PAGEALLOC and KFENCE
powerpc/lib/sstep: fix 'ptesync' build error
mtd: rawnand: gpmi: don't leak PM reference in error path
smb3: fix snapshot mount option
tipc: fix wrong notification node addresses
scsi: ufs: Remove dead code
scsi: ufs: Fix a deadlock in the error handler
ASoC: tas2770: Insert post reset delay
ASoC: qcom: Actually clear DMA interrupt register for HDMI
block/wbt: fix negative inflight counter when remove scsi device
NFS: Remove an incorrect revalidation in nfs4_update_changeattr_locked()
NFS: LOOKUP_DIRECTORY is also ok with symlinks
NFS: Do not report writeback errors in nfs_getattr()
tty: n_tty: do not look ahead for EOL character past the end of the buffer
block: fix surprise removal for drivers calling blk_set_queue_dying
mtd: rawnand: qcom: Fix clock sequencing in qcom_nandc_probe()
mtd: parsers: qcom: Fix kernel panic on skipped partition
mtd: parsers: qcom: Fix missing free for pparts in cleanup
mtd: phram: Prevent divide by zero bug in phram_setup()
mtd: rawnand: brcmnand: Fixed incorrect sub-page ECC status
HID: elo: fix memory leak in elo_probe
mtd: rawnand: ingenic: Fix missing put_device in ingenic_ecc_get
Drivers: hv: vmbus: Fix memory leak in vmbus_add_channel_kobj
KVM: x86/pmu: Refactoring find_arch_event() to pmc_perf_hw_id()
KVM: x86/pmu: Don't truncate the PerfEvtSeln MSR when creating a perf event
KVM: x86/pmu: Use AMD64_RAW_EVENT_MASK for PERF_TYPE_RAW
ARM: OMAP2+: hwmod: Add of_node_put() before break
ARM: OMAP2+: adjust the location of put_device() call in omapdss_init_of
phy: usb: Leave some clocks running during suspend
staging: vc04_services: Fix RCU dereference check
phy: phy-mtk-tphy: Fix duplicated argument in phy-mtk-tphy
irqchip/sifive-plic: Add missing thead,c900-plic match string
x86/bug: Merge annotate_reachable() into _BUG_FLAGS() asm
netfilter: conntrack: don't refresh sctp entries in closed state
ksmbd: fix same UniqueId for dot and dotdot entries
ksmbd: don't align last entry offset in smb2 query directory
arm64: dts: meson-gx: add ATF BL32 reserved-memory region
arm64: dts: meson-g12: add ATF BL32 reserved-memory region
arm64: dts: meson-g12: drop BL32 region from SEI510/SEI610
pidfd: fix test failure due to stack overflow on some arches
selftests: fixup build warnings in pidfd / clone3 tests
mm: io_uring: allow oom-killer from io_uring_setup
ACPI: PM: Revert "Only mark EC GPE for wakeup on Intel systems"
kconfig: let 'shell' return enough output for deep path names
ata: libata-core: Disable TRIM on M88V29
soc: aspeed: lpc-ctrl: Block error printing on probe defer cases
xprtrdma: fix pointer derefs in error cases of rpcrdma_ep_create
drm/rockchip: dw_hdmi: Do not leave clock enabled in error case
tracing: Fix tp_printk option related with tp_printk_stop_on_boot
display/amd: decrease message verbosity about watermarks table failure
drm/amd/display: Cap pflip irqs per max otg number
drm/amd/display: fix yellow carp wm clamping
net: usb: qmi_wwan: Add support for Dell DW5829e
net: macb: Align the dma and coherent dma masks
kconfig: fix failing to generate auto.conf
scsi: lpfc: Fix pt2pt NVMe PRLI reject LOGO loop
EDAC: Fix calculation of returned address and next offset in edac_align_ptr()
ucounts: Handle wrapping in is_ucounts_overlimit
ucounts: In set_cred_ucounts assume new->ucounts is non-NULL
ucounts: Base set_cred_ucounts changes on the real user
ucounts: Enforce RLIMIT_NPROC not RLIMIT_NPROC+1
lib/iov_iter: initialize "flags" in new pipe_buffer
rlimit: Fix RLIMIT_NPROC enforcement failure caused by capability calls in set_user
ucounts: Move RLIMIT_NPROC handling after set_user
net: sched: limit TC_ACT_REPEAT loops
dmaengine: sh: rcar-dmac: Check for error num after setting mask
dmaengine: stm32-dmamux: Fix PM disable depth imbalance in stm32_dmamux_probe
dmaengine: sh: rcar-dmac: Check for error num after dma_set_max_seg_size
tests: fix idmapped mount_setattr test
i2c: qcom-cci: don't delete an unregistered adapter
i2c: qcom-cci: don't put a device tree node before i2c_add_adapter()
dmaengine: ptdma: Fix the error handling path in pt_core_init()
copy_process(): Move fd_install() out of sighand->siglock critical section
scsi: qedi: Fix ABBA deadlock in qedi_process_tmf_resp() and qedi_process_cmd_cleanup_resp()
ice: enable parsing IPSEC SPI headers for RSS
i2c: brcmstb: fix support for DSL and CM variants
lockdep: Correct lock_classes index mapping
Linux 5.15.25
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ib129a0e11f5e82d67563329a5de1b0aef1d87928
commit 28df029d53a2fd80c1b8674d47895648ad26dcfb upstream.
A kernel exception was hit when trying to dump /proc/lockdep_chains after
lockdep report "BUG: MAX_LOCKDEP_CHAIN_HLOCKS too low!":
Unable to handle kernel paging request at virtual address 00054005450e05c3
...
00054005450e05c3] address between user and kernel address ranges
...
pc : [0xffffffece769b3a8] string+0x50/0x10c
lr : [0xffffffece769ac88] vsnprintf+0x468/0x69c
...
Call trace:
string+0x50/0x10c
vsnprintf+0x468/0x69c
seq_printf+0x8c/0xd8
print_name+0x64/0xf4
lc_show+0xb8/0x128
seq_read_iter+0x3cc/0x5fc
proc_reg_read_iter+0xdc/0x1d4
The cause of the problem is the function lock_chain_get_class() will
shift lock_classes index by 1, but the index don't need to be shifted
anymore since commit 01bb6f0af9 ("locking/lockdep: Change the range
of class_idx in held_lock struct") already change the index to start
from 0.
The lock_classes[-1] located at chain_hlocks array. When printing
lock_classes[-1] after the chain_hlocks entries are modified, the
exception happened.
The output of lockdep_chains are incorrect due to this problem too.
Fixes: f611e8cf98 ("lockdep: Take read/write status in consideration when generate chainkey")
Signed-off-by: Cheng Jui Wang <cheng-jui.wang@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Link: https://lore.kernel.org/r/20220210105011.21712-1-cheng-jui.wang@mediatek.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit bf2290a48a (Revert "ANDROID: vendor_hooks: set
debugging data when rt_mutex is working")
The original patch has been reverted to resolve merge issues.
This patch adds again the vendor hooks for the original purpose.
Bug: 216016261
Signed-off-by: Sangmoon Kim <sangmoon.kim@samsung.com>
Change-Id: I00162d88e2a446e9ece4804def098fcdc63fceb9
(cherry picked from commit d497887b00ac3e5e380123cac4a303d009b570ee)
This reverts commit 31c9ccb138 (Revert "ANDROID: vendor_hooks: add
waiting information for blocked tasks")
And also revert portions of 396a501b1743 (Revert "ANDROID: rwsem: Add
vendor hook to the rw-semaphore")
The original patch has been reverted to resolve merge issues.
This patch adds again the vendor hooks for the original purpose.
Bug: 216016261
Signed-off-by: Sangmoon Kim <sangmoon.kim@samsung.com>
Change-Id: I04ed7b055eee40f7975bd5d74fb73dd080cd76bf
(cherry picked from commit c23da05eac0d35441695caff0b7d0220f92ca8a0)
The rwsem_waiter struct is needed in vendor hook alter_rwsem_list_add.
It has parameter sem which is a struct rw_semaphore (already export in
rwsem.h), inside the structure there is a wait_list to link
"struct rwsem_waiter" items. The task information in each item of the
wait_list is needed to be referenced in vendor loadable modules.
Bug: 174902706
Change-Id: Ic7d21ffdd795eaa203989751d26f8b1f32134d8b
Signed-off-by: Huang Yiwei <hyiwei@codeaurora.org>
Signed-off-by: Vamsi Krishna Lanka <quic_vamslank@quicinc.com>
[ Upstream commit d257cc8cb8d5355ffc43a96bab94db7b5a324803 ]
There are some inconsistency in the way that the handoff bit is being
handled in readers and writers that lead to a race condition.
Firstly, when a queue head writer set the handoff bit, it will clear
it when the writer is being killed or interrupted on its way out
without acquiring the lock. That is not the case for a queue head
reader. The handoff bit will simply be inherited by the next waiter.
Secondly, in the out_nolock path of rwsem_down_read_slowpath(), both
the waiter and handoff bits are cleared if the wait queue becomes
empty. For rwsem_down_write_slowpath(), however, the handoff bit is
not checked and cleared if the wait queue is empty. This can
potentially make the handoff bit set with empty wait queue.
Worse, the situation in rwsem_down_write_slowpath() relies on wstate,
a variable set outside of the critical section containing the ->count
manipulation, this leads to race condition where RWSEM_FLAG_HANDOFF
can be double subtracted, corrupting ->count.
To make the handoff bit handling more consistent and robust, extract
out handoff bit clearing code into the new rwsem_del_waiter() helper
function. Also, completely eradicate wstate; always evaluate
everything inside the same critical section.
The common function will only use atomic_long_andnot() to clear bits
when the wait queue is empty to avoid possible race condition. If the
first waiter with handoff bit set is killed or interrupted to exit the
slowpath without acquiring the lock, the next waiter will inherit the
handoff bit.
While at it, simplify the trylock for loop in
rwsem_down_write_slowpath() to make it easier to read.
Fixes: 4f23dbc1e6 ("locking/rwsem: Implement lock handoff to prevent lock starvation")
Reported-by: Zhenhua Ma <mazhenhua@xiaomi.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211116012912.723980-1-longman@redhat.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 2507003a1d10917c9158077bf6030719d02c941e ]
lock_is_held_type(, 1) detects acquired read locks. It only recognized
locks acquired with lock_acquire_shared(). Read locks acquired with
lock_acquire_shared_recursive() are not recognized because a `2' is
stored as the read value.
Rework the check to additionally recognise lock's read value one and two
as a read held lock.
Fixes: e918188611 ("locking: More accurate annotations for read_lock()")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Boqun Feng <boqun.feng@gmail.com>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lkml.kernel.org/r/20210903084001.lblecrvz4esl4mrr@linutronix.de
Signed-off-by: Sasha Levin <sashal@kernel.org>
Readers of rwbase can lock and unlock without taking any inner lock, if
that happens, we need the ordering provided by atomic operations to
satisfy the ordering semantics of lock/unlock. Without that, considering
the follow case:
{ X = 0 initially }
CPU 0 CPU 1
===== =====
rt_write_lock();
X = 1
rt_write_unlock():
atomic_add(READER_BIAS - WRITER_BIAS, ->readers);
// ->readers is READER_BIAS.
rt_read_lock():
if ((r = atomic_read(->readers)) < 0) // True
atomic_try_cmpxchg(->readers, r, r + 1); // succeed.
<acquire the read lock via fast path>
r1 = X; // r1 may be 0, because nothing prevent the reordering
// of "X=1" and atomic_add() on CPU 1.
Therefore audit every usage of atomic operations that may happen in a
fast path, and add necessary barriers.
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20210909110203.953991276@infradead.org
Dan reported that rt_mutex_adjust_prio_chain() can be called with
.orig_waiter == NULL however commit a055fcc132 ("locking/rtmutex: Return
success on deadlock for ww_mutex waiters") unconditionally dereferences it.
Since both call-sites that have .orig_waiter == NULL don't care for the
return value, simply disable the deadlock squash by adding the NULL check.
Notably, both callers use the deadlock condition as a termination condition
for the iteration; once detected, it is sure that (de)boosting is done.
Arguably step [3] would be a more natural termination point, but it's
dubious whether adding a third deadlock detection state would improve the
code.
Fixes: a055fcc132 ("locking/rtmutex: Return success on deadlock for ww_mutex waiters")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/YS9La56fHMiCCo75@hirez.programming.kicks-ass.net
Pull locking and atomics updates from Thomas Gleixner:
"The regular pile:
- A few improvements to the mutex code
- Documentation updates for atomics to clarify the difference between
cmpxchg() and try_cmpxchg() and to explain the forward progress
expectations.
- Simplification of the atomics fallback generator
- The addition of arch_atomic_long*() variants and generic arch_*()
bitops based on them.
- Add the missing might_sleep() invocations to the down*() operations
of semaphores.
The PREEMPT_RT locking core:
- Scheduler updates to support the state preserving mechanism for
'sleeping' spin- and rwlocks on RT.
This mechanism is carefully preserving the state of the task when
blocking on a 'sleeping' spin- or rwlock and takes regular wake-ups
targeted at the same task into account. The preserved or updated
(via a regular wakeup) state is restored when the lock has been
acquired.
- Restructuring of the rtmutex code so it can be utilized and
extended for the RT specific lock variants.
- Restructuring of the ww_mutex code to allow sharing of the ww_mutex
specific functionality for rtmutex based ww_mutexes.
- Header file disentangling to allow substitution of the regular lock
implementations with the PREEMPT_RT variants without creating an
unmaintainable #ifdef mess.
- Shared base code for the PREEMPT_RT specific rw_semaphore and
rwlock implementations.
Contrary to the regular rw_semaphores and rwlocks the PREEMPT_RT
implementation is writer unfair because it is infeasible to do
priority inheritance on multiple readers. Experience over the years
has shown that real-time workloads are not the typical workloads
which are sensitive to writer starvation.
The alternative solution would be to allow only a single reader
which has been tried and discarded as it is a major bottleneck
especially for mmap_sem. Aside of that many of the writer
starvation critical usage sites have been converted to a writer
side mutex/spinlock and RCU read side protections in the past
decade so that the issue is less prominent than it used to be.
- The actual rtmutex based lock substitutions for PREEMPT_RT enabled
kernels which affect mutex, ww_mutex, rw_semaphore, spinlock_t and
rwlock_t. The spin/rw_lock*() functions disable migration across
the critical section to preserve the existing semantics vs per-CPU
variables.
- Rework of the futex REQUEUE_PI mechanism to handle the case of
early wake-ups which interleave with a re-queue operation to
prevent the situation that a task would be blocked on both the
rtmutex associated to the outer futex and the rtmutex based hash
bucket spinlock.
While this situation cannot happen on !RT enabled kernels the
changes make the underlying concurrency problems easier to
understand in general. As a result the difference between !RT and
RT kernels is reduced to the handling of waiting for the critical
section. !RT kernels simply spin-wait as before and RT kernels
utilize rcu_wait().
- The substitution of local_lock for PREEMPT_RT with a spinlock which
protects the critical section while staying preemptible. The CPU
locality is established by disabling migration.
The underlying concepts of this code have been in use in PREEMPT_RT for
way more than a decade. The code has been refactored several times over
the years and this final incarnation has been optimized once again to be
as non-intrusive as possible, i.e. the RT specific parts are mostly
isolated.
It has been extensively tested in the 5.14-rt patch series and it has
been verified that !RT kernels are not affected by these changes"
* tag 'locking-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (92 commits)
locking/rtmutex: Return success on deadlock for ww_mutex waiters
locking/rtmutex: Prevent spurious EDEADLK return caused by ww_mutexes
locking/rtmutex: Dequeue waiter on ww_mutex deadlock
locking/rtmutex: Dont dereference waiter lockless
locking/semaphore: Add might_sleep() to down_*() family
locking/ww_mutex: Initialize waiter.ww_ctx properly
static_call: Update API documentation
locking/local_lock: Add PREEMPT_RT support
locking/spinlock/rt: Prepare for RT local_lock
locking/rtmutex: Add adaptive spinwait mechanism
locking/rtmutex: Implement equal priority lock stealing
preempt: Adjust PREEMPT_LOCK_OFFSET for RT
locking/rtmutex: Prevent lockdep false positive with PI futexes
futex: Prevent requeue_pi() lock nesting issue on RT
futex: Simplify handle_early_requeue_pi_wakeup()
futex: Reorder sanity checks in futex_requeue()
futex: Clarify comment in futex_requeue()
futex: Restructure futex_requeue()
futex: Correct the number of requeued waiters for PI
futex: Remove bogus condition for requeue PI
...
Pull RCU updates from Paul McKenney:
"RCU changes for this cycle were:
- Documentation updates
- Miscellaneous fixes
- Offloaded-callbacks updates
- Updates to the nolibc library
- Tasks-RCU updates
- In-kernel torture-test updates
- Torture-test scripting, perhaps most notably the pinning of
torture-test guest OSes so as to force differences in memory
latency. For example, in a two-socket system, a four-CPU guest OS
will have one pair of its CPUs pinned to threads in a single core
on one socket and the other pair pinned to threads in a single core
on the other socket. This approach proved able to force race
conditions that earlier testing missed. Some of these race
conditions are still being tracked down"
* 'core-rcu.2021.08.28a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (61 commits)
torture: Replace deprecated CPU-hotplug functions.
rcu: Replace deprecated CPU-hotplug functions
rcu: Print human-readable message for schedule() in RCU reader
rcu: Explain why rcu_all_qs() is a stub in preemptible TREE RCU
rcu: Use per_cpu_ptr to get the pointer of per_cpu variable
rcu: Remove useless "ret" update in rcu_gp_fqs_loop()
rcu: Mark accesses in tree_stall.h
rcu: Make rcu_gp_init() and rcu_gp_fqs_loop noinline to conserve stack
rcu: Mark lockless ->qsmask read in rcu_check_boost_fail()
srcutiny: Mark read-side data races
rcu: Start timing stall repetitions after warning complete
rcu: Do not disable GP stall detection in rcu_cpu_stall_reset()
rcu/tree: Handle VM stoppage in stall detection
rculist: Unify documentation about missing list_empty_rcu()
rcu: Mark accesses to ->rcu_read_lock_nesting
rcu: Weaken ->dynticks accesses and updates
rcu: Remove special bit at the bottom of the ->dynticks counter
rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock
rcu: Fix to include first blocked task in stall warning
torture: Make kvm-test-1-run-qemu.sh check for reboot loops
...
ww_mutexes can legitimately cause a deadlock situation in the lock graph
which is resolved afterwards by the wait/wound mechanics. The rtmutex chain
walk can detect such a deadlock and returns EDEADLK which in turn skips the
wait/wound mechanism and returns EDEADLK to the caller. That's wrong
because both lock chains might get EDEADLK or the wrong waiter would back
out.
Detect that situation and return 'success' in case that the waiter which
initiated the chain walk is a ww_mutex with context. This allows the
wait/wound mechanics to resolve the situation according to the rules.
[ tglx: Split it apart and added changelog ]
Reported-by: Sebastian Siewior <bigeasy@linutronix.de>
Fixes: add461325e ("locking/rtmutex: Extend the rtmutex core to support ww_mutex")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/YSeWjCHoK4v5OcOt@hirez.programming.kicks-ass.net
rtmutex based ww_mutexes can legitimately create a cycle in the lock graph
which can be observed by a blocker which didn't cause the problem:
P1: A, ww_A, ww_B
P2: ww_B, ww_A
P3: A
P3 might therefore be trapped in the ww_mutex induced cycle and run into
the lock depth limitation of rt_mutex_adjust_prio_chain() which returns
-EDEADLK to the caller.
Disable the deadlock detection walk when the chain walk observes a
ww_mutex to prevent this looping.
[ tglx: Split it apart and added changelog ]
Reported-by: Sebastian Siewior <bigeasy@linutronix.de>
Fixes: add461325e ("locking/rtmutex: Extend the rtmutex core to support ww_mutex")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/YSeWjCHoK4v5OcOt@hirez.programming.kicks-ass.net
The rt_mutex based ww_mutex variant queues the new waiter first in the
lock's rbtree before evaluating the ww_mutex specific conditions which
might decide that the waiter should back out. This check and conditional
exit happens before the waiter is enqueued into the PI chain.
The failure handling at the call site assumes that the waiter, if it is the
top most waiter on the lock, is queued in the PI chain and then proceeds to
adjust the unmodified PI chain, which results in RB tree corruption.
Dequeue the waiter from the lock waiter list in the ww_mutex error exit
path to prevent this.
Fixes: add461325e ("locking/rtmutex: Extend the rtmutex core to support ww_mutex")
Reported-by: Sebastian Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210825102454.042280541@linutronix.de
The new rt_mutex_spin_on_onwer() loop checks whether the spinning waiter is
still the top waiter on the lock by utilizing rt_mutex_top_waiter(), which
is broken because that function contains a sanity check which dereferences
the top waiter pointer to check whether the waiter belongs to the
lock. That's wrong in the lockless spinwait case:
CPU 0 CPU 1
rt_mutex_lock(lock) rt_mutex_lock(lock);
queue(waiter0)
waiter0 == rt_mutex_top_waiter(lock)
rt_mutex_spin_on_onwer(lock, waiter0) { queue(waiter1)
waiter1 == rt_mutex_top_waiter(lock)
...
top_waiter = rt_mutex_top_waiter(lock)
leftmost = rb_first_cached(&lock->waiters);
-> signal
dequeue(waiter1)
destroy(waiter1)
w = rb_entry(leftmost, ....)
BUG_ON(w->lock != lock) <- UAF
The BUG_ON() is correct for the case where the caller holds lock->wait_lock
which guarantees that the leftmost waiter entry cannot vanish. For the
lockless spinwait case it's broken.
Create a new helper function which avoids the pointer dereference and just
compares the leftmost entry pointer with current's waiter pointer to
validate that currrent is still elegible for spinning.
Fixes: 992caf7f17 ("locking/rtmutex: Add adaptive spinwait mechanism")
Reported-by: Sebastian Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210825102453.981720644@linutronix.de
The consolidation of the debug code for mutex waiter intialization sets
waiter::ww_ctx to a poison value unconditionally. For regular mutexes this
is intended to catch the case where waiter_ww_ctx is dereferenced
accidentally.
For ww_mutex the poison value has to be overwritten either with a context
pointer or NULL for ww_mutexes without context.
The rework broke this as it made the store conditional on the context
pointer instead of the argument which signals whether ww_mutex code should
be compiled in or optiized out. As a result waiter::ww_ctx ends up with the
poison pointer for contextless ww_mutexes which causes a later dereference of
the poison pointer because it is != NULL.
Use the build argument instead so for ww_mutex the poison value is always
overwritten.
Fixes: c0afb0ffc0 ("locking/ww_mutex: Gather mutex_waiter initialization")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210819193030.zpwrpvvrmy7xxxiy@linutronix.de
Going to sleep when locks are contended can be quite inefficient when the
contention time is short and the lock owner is running on a different CPU.
The MCS mechanism cannot be used because MCS is strictly FIFO ordered while
for rtmutex based locks the waiter ordering is priority based.
Provide a simple adaptive spinwait mechanism which currently restricts the
spinning to the top priority waiter.
[ tglx: Provide a contemporary changelog, extended it to all rtmutex based
locks and updated it to match the other spin on owner implementations ]
Originally-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210815211305.912050691@linutronix.de
The current logic only allows lock stealing to occur if the current task is
of higher priority than the pending owner.
Significant throughput improvements can be gained by allowing the lock
stealing to include tasks of equal priority when the contended lock is a
spin_lock or a rw_lock and the tasks are not in a RT scheduling task.
The assumption was that the system will make faster progress by allowing
the task already on the CPU to take the lock rather than waiting for the
system to wake up a different task.
This does add a degree of unfairness, but in reality no negative side
effects have been observed in the many years that this has been used in the
RT kernel.
[ tglx: Refactored and rewritten several times by Steve Rostedt, Sebastian
Siewior and myself ]
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210815211305.857240222@linutronix.de
On PREEMPT_RT the futex hashbucket spinlock becomes 'sleeping' and rtmutex
based. That causes a lockdep false positive because some of the futex
functions invoke spin_unlock(&hb->lock) with the wait_lock of the rtmutex
associated to the pi_futex held. spin_unlock() in turn takes wait_lock of
the rtmutex on which the spinlock is based which makes lockdep notice a
lock recursion.
Give the futex/rtmutex wait_lock a separate key.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210815211305.750701219@linutronix.de
Ensure all !RT tasks have the same prio such that they end up in FIFO
order and aren't split up according to nice level.
The reason why nice levels were taken into account so far is historical. In
the early days of the rtmutex code it was done to give the PI boosting and
deboosting a larger coverage.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210815211303.938676930@linutronix.de
Similar to rw_semaphores, on RT the rwlock substitution is not writer fair,
because it's not feasible to have a writer inherit its priority to
multiple readers. Readers blocked on a writer follow the normal rules of
priority inheritance. Like RT spinlocks, RT rwlocks are state preserving
across the slow lock operations (contended case).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210815211303.882793524@linutronix.de