4a4073a2e2fe392db08c00c39fdc5e2f8f198547
1057304 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
4a4073a2e2 |
ipc/sem: Fix dangling sem_array access in semtimedop race
commit b52be557e24c47286738276121177a41f54e3b83 upstream.
When __do_semtimedop() goes to sleep because it has to wait for a
semaphore value becoming zero or becoming bigger than some threshold, it
links the on-stack sem_queue to the sem_array, then goes to sleep
without holding a reference on the sem_array.
When __do_semtimedop() comes back out of sleep, one of two things must
happen:
a) We prove that the on-stack sem_queue has been disconnected from the
(possibly freed) sem_array, making it safe to return from the stack
frame that the sem_queue exists in.
b) We stabilize our reference to the sem_array, lock the sem_array, and
detach the sem_queue from the sem_array ourselves.
sem_array has RCU lifetime, so for case (b), the reference can be
stabilized inside an RCU read-side critical section by locklessly
checking whether the sem_queue is still connected to the sem_array.
However, the current code does the lockless check on sem_queue before
starting an RCU read-side critical section, so the result of the
lockless check immediately becomes useless.
Fix it by doing rcu_read_lock() before the lockless check. Now RCU
ensures that if we observe the object being on our queue, the object
can't be freed until rcu_read_unlock().
This bug is only hittable on kernel builds with full preemption support
(either CONFIG_PREEMPT or PREEMPT_DYNAMIC with preempt=full).
Fixes:
|
||
|
|
53b9b1201e |
Input: raydium_ts_i2c - fix memory leak in raydium_i2c_send()
commit 8c9a59939deb4bfafdc451100c03d1e848b4169b upstream.
There is a kmemleak when test the raydium_i2c_ts with bpf mock device:
unreferenced object 0xffff88812d3675a0 (size 8):
comm "python3", pid 349, jiffies 4294741067 (age 95.695s)
hex dump (first 8 bytes):
11 0e 10 c0 01 00 04 00 ........
backtrace:
[<0000000068427125>] __kmalloc+0x46/0x1b0
[<0000000090180f91>] raydium_i2c_send+0xd4/0x2bf [raydium_i2c_ts]
[<000000006e631aee>] raydium_i2c_initialize.cold+0xbc/0x3e4 [raydium_i2c_ts]
[<00000000dc6fcf38>] raydium_i2c_probe+0x3cd/0x6bc [raydium_i2c_ts]
[<00000000a310de16>] i2c_device_probe+0x651/0x680
[<00000000f5a96bf3>] really_probe+0x17c/0x3f0
[<00000000096ba499>] __driver_probe_device+0xe3/0x170
[<00000000c5acb4d9>] driver_probe_device+0x49/0x120
[<00000000264fe082>] __device_attach_driver+0xf7/0x150
[<00000000f919423c>] bus_for_each_drv+0x114/0x180
[<00000000e067feca>] __device_attach+0x1e5/0x2d0
[<0000000054301fc2>] bus_probe_device+0x126/0x140
[<00000000aad93b22>] device_add+0x810/0x1130
[<00000000c086a53f>] i2c_new_client_device+0x352/0x4e0
[<000000003c2c248c>] of_i2c_register_device+0xf1/0x110
[<00000000ffec4177>] of_i2c_notify+0x100/0x160
unreferenced object 0xffff88812d3675c8 (size 8):
comm "python3", pid 349, jiffies 4294741070 (age 95.692s)
hex dump (first 8 bytes):
22 00 36 2d 81 88 ff ff ".6-....
backtrace:
[<0000000068427125>] __kmalloc+0x46/0x1b0
[<0000000090180f91>] raydium_i2c_send+0xd4/0x2bf [raydium_i2c_ts]
[<000000001d5c9620>] raydium_i2c_initialize.cold+0x223/0x3e4 [raydium_i2c_ts]
[<00000000dc6fcf38>] raydium_i2c_probe+0x3cd/0x6bc [raydium_i2c_ts]
[<00000000a310de16>] i2c_device_probe+0x651/0x680
[<00000000f5a96bf3>] really_probe+0x17c/0x3f0
[<00000000096ba499>] __driver_probe_device+0xe3/0x170
[<00000000c5acb4d9>] driver_probe_device+0x49/0x120
[<00000000264fe082>] __device_attach_driver+0xf7/0x150
[<00000000f919423c>] bus_for_each_drv+0x114/0x180
[<00000000e067feca>] __device_attach+0x1e5/0x2d0
[<0000000054301fc2>] bus_probe_device+0x126/0x140
[<00000000aad93b22>] device_add+0x810/0x1130
[<00000000c086a53f>] i2c_new_client_device+0x352/0x4e0
[<000000003c2c248c>] of_i2c_register_device+0xf1/0x110
[<00000000ffec4177>] of_i2c_notify+0x100/0x160
After BANK_SWITCH command from i2c BUS, no matter success or error
happened, the tx_buf should be freed.
Fixes:
|
||
|
|
571b6bbbf5 |
char: tpm: Protect tpm_pm_suspend with locks
commit 23393c6461422df5bf8084a086ada9a7e17dc2ba upstream.
Currently tpm transactions are executed unconditionally in
tpm_pm_suspend() function, which may lead to races with other tpm
accessors in the system.
Specifically, the hw_random tpm driver makes use of tpm_get_random(),
and this function is called in a loop from a kthread, which means it's
not frozen alongside userspace, and so can race with the work done
during system suspend:
tpm tpm0: tpm_transmit: tpm_recv: error -52
tpm tpm0: invalid TPM_STS.x 0xff, dumping stack for forensics
CPU: 0 PID: 1 Comm: init Not tainted 6.1.0-rc5+ #135
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-20220807_005459-localhost 04/01/2014
Call Trace:
tpm_tis_status.cold+0x19/0x20
tpm_transmit+0x13b/0x390
tpm_transmit_cmd+0x20/0x80
tpm1_pm_suspend+0xa6/0x110
tpm_pm_suspend+0x53/0x80
__pnp_bus_suspend+0x35/0xe0
__device_suspend+0x10f/0x350
Fix this by calling tpm_try_get_ops(), which itself is a wrapper around
tpm_chip_start(), but takes the appropriate mutex.
Signed-off-by: Jan Dabros <jsd@semihalf.com>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Jason A. Donenfeld <Jason@zx2c4.com>
Tested-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/all/c5ba47ef-393f-1fba-30bd-1230d1b4b592@suse.cz/
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
f39891cfe7 |
Revert "clocksource/drivers/riscv: Events are stopped during CPU suspend"
[ Upstream commit d9f15a9de44affe733e34f93bc184945ba277e6d ]
This reverts commit 232ccac1bd9b5bfe73895f527c08623e7fa0752d.
On the subject of suspend, the RISC-V SBI spec states:
This does not cover whether any given events actually reach the hart or
not, just what the hart will do if it receives an event. On PolarFire
SoC, and potentially other SiFive based implementations, events from the
RISC-V timer do reach a hart during suspend. This is not the case for the
implementation on the Allwinner D1 - there timer events are not received
during suspend.
To fix this, the CLOCK_EVT_FEAT_C3STOP (mis)feature was enabled for the
timer driver - but this has broken both RCU stall detection and timers
generally on PolarFire SoC and potentially other SiFive based
implementations.
If an AXI read to the PCIe controller on PolarFire SoC times out, the
system will stall, however, with CLOCK_EVT_FEAT_C3STOP active, the system
just locks up without RCU stalling:
io scheduler mq-deadline registered
io scheduler kyber registered
microchip-pcie 2000000000.pcie: host bridge /soc/pcie@2000000000 ranges:
microchip-pcie 2000000000.pcie: MEM 0x2008000000..0x2087ffffff -> 0x0008000000
microchip-pcie 2000000000.pcie: sec error in pcie2axi buffer
microchip-pcie 2000000000.pcie: ded error in pcie2axi buffer
microchip-pcie 2000000000.pcie: axi read request error
microchip-pcie 2000000000.pcie: axi read timeout
microchip-pcie 2000000000.pcie: sec error in pcie2axi buffer
microchip-pcie 2000000000.pcie: ded error in pcie2axi buffer
microchip-pcie 2000000000.pcie: sec error in pcie2axi buffer
microchip-pcie 2000000000.pcie: ded error in pcie2axi buffer
microchip-pcie 2000000000.pcie: sec error in pcie2axi buffer
microchip-pcie 2000000000.pcie: ded error in pcie2axi buffer
Freeing initrd memory: 7332K
Similarly issues were reported with clock_nanosleep() - with a test app
that sleeps each cpu for 6, 5, 4, 3 ms respectively, HZ=250 & the blamed
commit in place, the sleep times are rounded up to the next jiffy:
== CPU: 1 == == CPU: 2 == == CPU: 3 == == CPU: 4 ==
Mean: 7.974992 Mean: 7.976534 Mean: 7.962591 Mean: 3.952179
Std Dev: 0.154374 Std Dev: 0.156082 Std Dev: 0.171018 Std Dev: 0.076193
Hi: 9.472000 Hi: 10.495000 Hi: 8.864000 Hi: 4.736000
Lo: 6.087000 Lo: 6.380000 Lo: 4.872000 Lo: 3.403000
Samples: 521 Samples: 521 Samples: 521 Samples: 521
Fortunately, the D1 has a second timer, which is "currently used in
preference to the RISC-V/SBI timer driver" so a revert here does not
hurt operation of D1 in its current form.
Ultimately, a DeviceTree property (or node) will be added to encode the
behaviour of the timers, but until then revert the addition of
CLOCK_EVT_FEAT_C3STOP.
Fixes: 232ccac1bd9b ("clocksource/drivers/riscv: Events are stopped during CPU suspend")
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Palmer Dabbelt <palmer@rivosinc.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Acked-by: Samuel Holland <samuel@sholland.org>
Link: https://lore.kernel.org/linux-riscv/YzYTNQRxLr7Q9JR0@spud/
Link: https://github.com/riscv-non-isa/riscv-sbi-doc/issues/98/
Link: https://lore.kernel.org/linux-riscv/bf6d3b1f-f703-4a25-833e-972a44a04114@sholland.org/
Link: https://lore.kernel.org/r/20221122121620.3522431-1-conor.dooley@microchip.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
||
|
|
a759057af7 |
ACPI: HMAT: Fix initiator registration for single-initiator systems
[ Upstream commit 48d4180939e12c4bd2846f984436d895bb9699ed ] In a system with a single initiator node, and one or more memory-only 'target' nodes, the memory-only node(s) would fail to register their initiator node correctly. i.e. in sysfs: # ls /sys/devices/system/node/node0/access0/targets/ node0 Where as the correct behavior should be: # ls /sys/devices/system/node/node0/access0/targets/ node0 node1 This happened because hmat_register_target_initiators() uses list_sort() to sort the initiator list, but the sort comparision function (initiator_cmp()) is overloaded to also set the node mask's bits. In a system with a single initiator, the list is singular, and list_sort elides the comparision helper call. Thus the node mask never gets set, and the subsequent search for the best initiator comes up empty. Add a new helper to consume the sorted initiator list, and generate the nodemask, decoupling it from the overloaded initiator_cmp() comparision callback. This prevents the singular list corner case naturally, and makes the code easier to follow as well. Cc: <stable@vger.kernel.org> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Chris Piper <chris.d.piper@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/r/20221116-acpi_hmat_fix-v2-2-3712569be691@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
da8a794d71 |
ACPI: HMAT: remove unnecessary variable initialization
[ Upstream commit 14f16d47561ba9249efc6c2db9d47ed56841f070 ] In hmat_register_target_initiators(), the variable 'best' gets initialized in the outer per-locality-type for loop. The initialization just before setting up 'Access 1' targets was unnecessary. Remove it. Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Dan Williams <dan.j.williams@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Link: https://lore.kernel.org/r/20221116-acpi_hmat_fix-v2-1-3712569be691@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Stable-dep-of: 48d4180939e1 ("ACPI: HMAT: Fix initiator registration for single-initiator systems") Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
2d16161a2c |
i2c: imx: Only DMA messages with I2C_M_DMA_SAFE flag set
[ Upstream commit d36678f7905cbd1dc55a8a96e066dafd749d4600 ]
Recent changes to the DMA code has resulting in the IMX driver failing
I2C transfers when the buffer has been vmalloc. Only perform DMA
transfers if the message has the I2C_M_DMA_SAFE flag set, indicating
the client is providing a buffer which is DMA safe.
This is a minimal fix for stable. The I2C core provides helpers to
allocate a bounce buffer. For a fuller fix the master should make use
of these helpers.
Fixes:
|
||
|
|
950a05cb15 |
i2c: npcm7xx: Fix error handling in npcm_i2c_init()
[ Upstream commit 145900cf91c4b32ac05dbc8675a0c7f4a278749d ]
A problem about i2c-npcm7xx create debugfs failed is triggered with the
following log given:
[ 173.827310] debugfs: Directory 'npcm_i2c' with parent '/' already present!
The reason is that npcm_i2c_init() returns platform_driver_register()
directly without checking its return value, if platform_driver_register()
failed, it returns without destroy the newly created debugfs, resulting
the debugfs of npcm_i2c can never be created later.
npcm_i2c_init()
debugfs_create_dir() # create debugfs directory
platform_driver_register()
driver_register()
bus_add_driver()
priv = kzalloc(...) # OOM happened
# return without destroy debugfs directory
Fix by removing debugfs when platform_driver_register() returns error.
Fixes:
|
||
|
|
db3f8da033 |
serial: stm32: Deassert Transmit Enable on ->rs485_config()
[ Upstream commit adafbbf6895eb0ce41a313c6ee68870ab9aa93cd ] The STM32 USART can control RS-485 Transmit Enable in hardware. Since commit |
||
|
|
45f628f4fd |
serial: stm32: Use TC interrupt to deassert GPIO RTS in RS485 mode
[ Upstream commit d7c76716169ddc37cf6316ff381d34ea807fbfd7 ] In case the RS485 mode is emulated using GPIO RTS, use the TC interrupt to deassert the GPIO RTS, otherwise the GPIO RTS stays asserted after a transmission ended and the RS485 cannot work. Signed-off-by: Marek Vasut <marex@denx.de> Cc: Alexandre Torgue <alexandre.torgue@foss.st.com> Cc: Erwan Le Ray <erwan.leray@foss.st.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jean Philippe Romain <jean-philippe.romain@foss.st.com> Cc: Valentin Caron <valentin.caron@foss.st.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-stm32@st-md-mailman.stormreply.com To: linux-serial@vger.kernel.org Link: https://lore.kernel.org/r/20220430162845.244655-2-marex@denx.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
c60eae5b1d |
serial: stm32: Factor out GPIO RTS toggling into separate function
[ Upstream commit 3bcea529b295a993b1b05db63f245ae8030c5acf ] Pull out the GPIO RTS enable and disable handling into separate function. Limit the scope of GPIO RTS toggling only to GPIO emulated RS485 too. Signed-off-by: Marek Vasut <marex@denx.de> Cc: Alexandre Torgue <alexandre.torgue@foss.st.com> Cc: Erwan Le Ray <erwan.leray@foss.st.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jean Philippe Romain <jean-philippe.romain@foss.st.com> Cc: Valentin Caron <valentin.caron@foss.st.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-stm32@st-md-mailman.stormreply.com To: linux-serial@vger.kernel.org Link: https://lore.kernel.org/r/20220430162845.244655-1-marex@denx.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
041f8dc882 |
ipv4: Fix route deletion when nexthop info is not specified
[ Upstream commit d5082d386eee7e8ec46fa8581932c81a4961dcef ]
When the kernel receives a route deletion request from user space it
tries to delete a route that matches the route attributes specified in
the request.
If only prefix information is specified in the request, the kernel
should delete the first matching FIB alias regardless of its associated
FIB info. However, an error is currently returned when the FIB info is
backed by a nexthop object:
# ip nexthop add id 1 via 192.0.2.2 dev dummy10
# ip route add 198.51.100.0/24 nhid 1
# ip route del 198.51.100.0/24
RTNETLINK answers: No such process
Fix by matching on such a FIB info when legacy nexthop attributes are
not specified in the request. An earlier check already covers the case
where a nexthop ID is specified in the request.
Add tests that cover these flows. Before the fix:
# ./fib_nexthops.sh -t ipv4_fcnal
...
TEST: Delete route when not specifying nexthop attributes [FAIL]
Tests passed: 11
Tests failed: 1
After the fix:
# ./fib_nexthops.sh -t ipv4_fcnal
...
TEST: Delete route when not specifying nexthop attributes [ OK ]
Tests passed: 12
Tests failed: 0
No regressions in other tests:
# ./fib_nexthops.sh
...
Tests passed: 228
Tests failed: 0
# ./fib_tests.sh
...
Tests passed: 186
Tests failed: 0
Cc: stable@vger.kernel.org
Reported-by: Jonas Gorski <jonas.gorski@gmail.com>
Tested-by: Jonas Gorski <jonas.gorski@gmail.com>
Fixes:
|
||
|
|
25174d91e4 |
ipv4: Handle attempt to delete multipath route when fib_info contains an nh reference
[ Upstream commit 61b91eb33a69c3be11b259c5ea484505cd79f883 ]
Gwangun Jung reported a slab-out-of-bounds access in fib_nh_match:
fib_nh_match+0xf98/0x1130 linux-6.0-rc7/net/ipv4/fib_semantics.c:961
fib_table_delete+0x5f3/0xa40 linux-6.0-rc7/net/ipv4/fib_trie.c:1753
inet_rtm_delroute+0x2b3/0x380 linux-6.0-rc7/net/ipv4/fib_frontend.c:874
Separate nexthop objects are mutually exclusive with the legacy
multipath spec. Fix fib_nh_match to return if the config for the
to be deleted route contains a multipath spec while the fib_info
is using a nexthop object.
Fixes:
|
||
|
|
a0ad247e55 |
selftests: net: fix nexthop warning cleanup double ip typo
[ Upstream commit 692930cc435099580a4b9e32fa781b0688c18439 ]
I made a stupid typo when adding the nexthop route warning selftest and
added both $IP and ip after it (double ip) on the cleanup path. The
error doesn't show up when running the test, but obviously it doesn't
cleanup properly after it.
Fixes: 392baa339c6a ("selftests: net: add delete nexthop route warning test")
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stable-dep-of: d5082d386eee ("ipv4: Fix route deletion when nexthop info is not specified")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
||
|
|
532847b69c |
selftests: net: add delete nexthop route warning test
[ Upstream commit 392baa339c6a42a2cb088e5e5df2b59b8f89be24 ] Add a test which causes a WARNING on kernels which treat a nexthop route like a normal route when comparing for deletion and a device is specified. That is, a route is found but we hit a warning while matching it. The warning is from fib_info_nh() in include/net/nexthop.h because we run it on a fib_info with nexthop object. The call chain is: inet_rtm_delroute -> fib_table_delete -> fib_nh_match (called with a nexthop fib_info and also with fc_oif set thus calling fib_info_nh on the fib_info and triggering the warning). Repro steps: $ ip nexthop add id 12 via 172.16.1.3 dev veth1 $ ip route add 172.16.101.1/32 nhid 12 $ ip route delete 172.16.101.1/32 dev veth1 Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Stable-dep-of: d5082d386eee ("ipv4: Fix route deletion when nexthop info is not specified") Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
e078355881 |
Kconfig.debug: provide a little extra FRAME_WARN leeway when KASAN is enabled
[ Upstream commit 152fe65f300e1819d59b80477d3e0999b4d5d7d2 ] When enabled, KASAN enlarges function's stack-frames. Pushing quite a few over the current threshold. This can mainly be seen on 32-bit architectures where the present limit (when !GCC) is a lowly 1024-Bytes. Link: https://lkml.kernel.org/r/20221125120750.3537134-3-lee@kernel.org Signed-off-by: Lee Jones <lee@kernel.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: David Airlie <airlied@gmail.com> Cc: Harry Wentland <harry.wentland@amd.com> Cc: Leo Li <sunpeng.li@amd.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Cc: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: Tom Rix <trix@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
723fa02e0e |
parisc: Increase FRAME_WARN to 2048 bytes on parisc
[ Upstream commit 8d192bec534bd5b778135769a12e5f04580771f7 ] PA-RISC uses a much bigger frame size for functions than other architectures. So increase it to 2048 for 32- and 64-bit kernels. This fixes e.g. a warning in lib/xxhash.c. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Helge Deller <deller@gmx.de> Stable-dep-of: 152fe65f300e ("Kconfig.debug: provide a little extra FRAME_WARN leeway when KASAN is enabled") Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
b951ab4b35 |
mm: migrate: fix THP's mapcount on isolation
[ Upstream commit 829ae0f81ce093d674ff2256f66a714753e9ce32 ]
The issue is reported when removing memory through virtio_mem device. The
transparent huge page, experienced copy-on-write fault, is wrongly
regarded as pinned. The transparent huge page is escaped from being
isolated in isolate_migratepages_block(). The transparent huge page can't
be migrated and the corresponding memory block can't be put into offline
state.
Fix it by replacing page_mapcount() with total_mapcount(). With this, the
transparent huge page can be isolated and migrated, and the memory block
can be put into offline state. Besides, The page's refcount is increased
a bit earlier to avoid the page is released when the check is executed.
Link: https://lkml.kernel.org/r/20221124095523.31061-1-gshan@redhat.com
Fixes:
|
||
|
|
c5eda6029c |
mm: __isolate_lru_page_prepare() in isolate_migratepages_block()
[ Upstream commit 89f6c88a6ab4a11deb14c270f7f1454cda4f73d6 ] __isolate_lru_page_prepare() conflates two unrelated functions, with the flags to one disjoint from the flags to the other; and hides some of the important checks outside of isolate_migratepages_block(), where the sequence is better to be visible. It comes from the days of lumpy reclaim, before compaction, when the combination made more sense. Move what's needed by mm/compaction.c isolate_migratepages_block() inline there, and what's needed by mm/vmscan.c isolate_lru_pages() inline there. Shorten "isolate_mode" to "mode", so the sequence of conditions is easier to read. Declare a "mapping" variable, to save one call to page_mapping() (but not another: calling again after page is locked is necessary). Simplify isolate_lru_pages() with a "move_to" list pointer. Link: https://lkml.kernel.org/r/879d62a8-91cc-d3c6-fb3b-69768236df68@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Alex Shi <alexs@kernel.org> Cc: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Stable-dep-of: 829ae0f81ce0 ("mm: migrate: fix THP's mapcount on isolation") Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
bdb613ef17 |
iommu/vt-d: Fix PCI device refcount leak in dmar_dev_scope_init()
[ Upstream commit 4bedbbd782ebbe7287231fea862c158d4f08a9e3 ]
for_each_pci_dev() is implemented by pci_get_device(). The comment of
pci_get_device() says that it will increase the reference count for the
returned pci_dev and also decrease the reference count for the input
pci_dev @from if it is not NULL.
If we break for_each_pci_dev() loop with pdev not NULL, we need to call
pci_dev_put() to decrease the reference count. Add the missing
pci_dev_put() for the error path to avoid reference count leak.
Fixes:
|
||
|
|
b6eea8b2e8 |
iommu/vt-d: Fix PCI device refcount leak in has_external_pci()
[ Upstream commit afca9e19cc720bfafc75dc5ce429c185ca93f31d ]
for_each_pci_dev() is implemented by pci_get_device(). The comment of
pci_get_device() says that it will increase the reference count for the
returned pci_dev and also decrease the reference count for the input
pci_dev @from if it is not NULL.
If we break for_each_pci_dev() loop with pdev not NULL, we need to call
pci_dev_put() to decrease the reference count. Add the missing
pci_dev_put() before 'return true' to avoid reference count leak.
Fixes:
|
||
|
|
787d81d4eb |
nvme: fix SRCU protection of nvme_ns_head list
[ Upstream commit 899d2a05dc14733cfba6224083c6b0dd5a738590 ]
Walking the nvme_ns_head siblings list is protected by the head's srcu
in nvme_ns_head_submit_bio() but not nvme_mpath_revalidate_paths().
Removing namespaces from the list also fails to synchronize the srcu.
Concurrent scan work can therefore cause use-after-frees.
Hold the head's srcu lock in nvme_mpath_revalidate_paths() and
synchronize with the srcu, not the global RCU, in nvme_ns_remove().
Observed the following panic when making NVMe/RDMA connections
with native multipath on the Rocky Linux 8.6 kernel
(it seems the upstream kernel has the same race condition).
Disassembly shows the faulting instruction is cmp 0x50(%rdx),%rcx;
computing capacity != get_capacity(ns->disk).
Address 0x50 is dereferenced because ns->disk is NULL.
The NULL disk appears to be the result of concurrent scan work
freeing the namespace (note the log line in the middle of the panic).
[37314.206036] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
[37314.206036] nvme0n3: detected capacity change from 0 to 11811160064
[37314.299753] PGD 0 P4D 0
[37314.299756] Oops: 0000 [#1] SMP PTI
[37314.299759] CPU: 29 PID: 322046 Comm: kworker/u98:3 Kdump: loaded Tainted: G W X --------- - - 4.18.0-372.32.1.el8test86.x86_64 #1
[37314.299762] Hardware name: Dell Inc. PowerEdge R720/0JP31P, BIOS 2.7.0 05/23/2018
[37314.299763] Workqueue: nvme-wq nvme_scan_work [nvme_core]
[37314.299783] RIP: 0010:nvme_mpath_revalidate_paths+0x26/0xb0 [nvme_core]
[37314.299790] Code: 1f 44 00 00 66 66 66 66 90 55 53 48 8b 5f 50 48 8b 83 c8 c9 00 00 48 8b 13 48 8b 48 50 48 39 d3 74 20 48 8d 42 d0 48 8b 50 20 <48> 3b 4a 50 74 05 f0 80 60 70 ef 48 8b 50 30 48 8d 42 d0 48 39 d3
[37315.058803] RSP: 0018:ffffabe28f913d10 EFLAGS: 00010202
[37315.121316] RAX: ffff927a077da800 RBX: ffff92991dd70000 RCX: 0000000001600000
[37315.206704] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff92991b719800
[37315.292106] RBP: ffff929a6b70c000 R08: 000000010234cd4a R09: c0000000ffff7fff
[37315.377501] R10: 0000000000000001 R11: ffffabe28f913a30 R12: 0000000000000000
[37315.462889] R13: ffff92992716600c R14: ffff929964e6e030 R15: ffff92991dd70000
[37315.548286] FS: 0000000000000000(0000) GS:ffff92b87fb80000(0000) knlGS:0000000000000000
[37315.645111] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[37315.713871] CR2: 0000000000000050 CR3: 0000002208810006 CR4: 00000000000606e0
[37315.799267] Call Trace:
[37315.828515] nvme_update_ns_info+0x1ac/0x250 [nvme_core]
[37315.892075] nvme_validate_or_alloc_ns+0x2ff/0xa00 [nvme_core]
[37315.961871] ? __blk_mq_free_request+0x6b/0x90
[37316.015021] nvme_scan_work+0x151/0x240 [nvme_core]
[37316.073371] process_one_work+0x1a7/0x360
[37316.121318] ? create_worker+0x1a0/0x1a0
[37316.168227] worker_thread+0x30/0x390
[37316.212024] ? create_worker+0x1a0/0x1a0
[37316.258939] kthread+0x10a/0x120
[37316.297557] ? set_kthread_struct+0x50/0x50
[37316.347590] ret_from_fork+0x35/0x40
[37316.390360] Modules linked in: nvme_rdma nvme_tcp(X) nvme_fabrics nvme_core netconsole iscsi_tcp libiscsi_tcp dm_queue_length dm_service_time nf_conntrack_netlink br_netfilter bridge stp llc overlay nft_chain_nat ipt_MASQUERADE nf_nat xt_addrtype xt_CT nft_counter xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_comment xt_multiport nft_compat nf_tables libcrc32c nfnetlink dm_multipath tg3 rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm intel_rapl_msr iTCO_wdt iTCO_vendor_support dcdbas intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif kvm irqbypass crct10dif_pclmul crc32_pclmul mlx5_ib ghash_clmulni_intel ib_uverbs rapl intel_cstate intel_uncore ib_core ipmi_si joydev mei_me pcspkr ipmi_devintf mei lpc_ich wmi ipmi_msghandler acpi_power_meter ext4 mbcache jbd2 sd_mod t10_pi sg mgag200 mlx5_core drm_kms_helper syscopyarea
[37316.390419] sysfillrect ahci sysimgblt fb_sys_fops libahci drm crc32c_intel libata mlxfw pci_hyperv_intf tls i2c_algo_bit psample dm_mirror dm_region_hash dm_log dm_mod fuse [last unloaded: nvme_core]
[37317.645908] CR2: 0000000000000050
Fixes:
|
||
|
|
12f237200c |
riscv: kexec: Fixup irq controller broken in kexec crash path
[ Upstream commit b17d19a5314a37f7197afd1a0200affd21a7227d ] If a crash happens on cpu3 and all interrupts are binding on cpu0, the bad irq routing will cause a crash kernel which can't receive any irq. Because crash kernel won't clean up all harts' PLIC enable bits in enable registers. This patch is similar to |
||
|
|
ac00301adb |
riscv: fix race when vmap stack overflow
[ Upstream commit 7e1864332fbc1b993659eab7974da9fe8bf8c128 ]
Currently, when detecting vmap stack overflow, riscv firstly switches
to the so called shadow stack, then use this shadow stack to call the
get_overflow_stack() to get the overflow stack. However, there's
a race here if two or more harts use the same shadow stack at the same
time.
To solve this race, we introduce spin_shadow_stack atomic var, which
will be swap between its own address and 0 in atomic way, when the
var is set, it means the shadow_stack is being used; when the var
is cleared, it means the shadow_stack isn't being used.
Fixes:
|
||
|
|
fa7a7d185e |
riscv: Sync efi page table's kernel mappings before switching
[ Upstream commit 3f105a742725a1b78766a55169f1d827732e62b8 ]
The EFI page table is initially created as a copy of the kernel page table.
With VMAP_STACK enabled, kernel stacks are allocated in the vmalloc area:
if the stack is allocated in a new PGD (one that was not present at the
moment of the efi page table creation or not synced in a previous vmalloc
fault), the kernel will take a trap when switching to the efi page table
when the vmalloc kernel stack is accessed, resulting in a kernel panic.
Fix that by updating the efi kernel mappings before switching to the efi
page table.
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Fixes:
|
||
|
|
d86d698925 |
pinctrl: single: Fix potential division by zero
[ Upstream commit 64c150339e7f6c5cbbe8c17a56ef2b3902612798 ]
There is a possibility of dividing by zero due to the pcs->bits_per_pin
if pcs->fmask() also has a value of zero and called fls
from asm-generic/bitops/builtin-fls.h or arch/x86/include/asm/bitops.h.
The function pcs_probe() has the branch that assigned to fmask 0 before
pcs_allocate_pin_table() was called
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes:
|
||
|
|
98b15c7066 |
ASoC: ops: Fix bounds check for _sx controls
[ Upstream commit 698813ba8c580efb356ace8dbf55f61dac6063a8 ]
For _sx controls the semantics of the max field is not the usual one, max
is the number of steps rather than the maximum value. This means that our
check in snd_soc_put_volsw_sx() needs to just check against the maximum
value.
Fixes: 4f1e50d6a9cf9c1b ("ASoC: ops: Reject out of bounds values in snd_soc_put_volsw_sx()")
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20220511134137.169575-1-broonie@kernel.org
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
||
|
|
f88a6977f8 |
KVM: x86/mmu: Fix race condition in direct_page_fault
commit 47b0c2e4c220f2251fd8dcfbb44479819c715e15 upstream.
make_mmu_pages_available() must be called with mmu_lock held for write.
However, if the TDP MMU is used, it will be called with mmu_lock held for
read.
This function does nothing unless shadow pages are used, so there is no
race unless nested TDP is used.
Since nested TDP uses shadow pages, old shadow pages may be zapped by this
function even when the TDP MMU is enabled.
Since shadow pages are never allocated by kvm_tdp_mmu_map(), a race
condition can be avoided by not calling make_mmu_pages_available() if the
TDP MMU is currently in use.
I encountered this when repeatedly starting and stopping nested VM.
It can be artificially caused by allocating a large number of nested TDP
SPTEs.
For example, the following BUG and general protection fault are caused in
the host kernel.
pte_list_remove: 00000000cd54fc10 many->many
------------[ cut here ]------------
kernel BUG at arch/x86/kvm/mmu/mmu.c:963!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
RIP: 0010:pte_list_remove.cold+0x16/0x48 [kvm]
Call Trace:
<TASK>
drop_spte+0xe0/0x180 [kvm]
mmu_page_zap_pte+0x4f/0x140 [kvm]
__kvm_mmu_prepare_zap_page+0x62/0x3e0 [kvm]
kvm_mmu_zap_oldest_mmu_pages+0x7d/0xf0 [kvm]
direct_page_fault+0x3cb/0x9b0 [kvm]
kvm_tdp_page_fault+0x2c/0xa0 [kvm]
kvm_mmu_page_fault+0x207/0x930 [kvm]
npf_interception+0x47/0xb0 [kvm_amd]
svm_invoke_exit_handler+0x13c/0x1a0 [kvm_amd]
svm_handle_exit+0xfc/0x2c0 [kvm_amd]
kvm_arch_vcpu_ioctl_run+0xa79/0x1780 [kvm]
kvm_vcpu_ioctl+0x29b/0x6f0 [kvm]
__x64_sys_ioctl+0x95/0xd0
do_syscall_64+0x5c/0x90
general protection fault, probably for non-canonical address
0xdead000000000122: 0000 [#1] PREEMPT SMP NOPTI
RIP: 0010:kvm_mmu_commit_zap_page.part.0+0x4b/0xe0 [kvm]
Call Trace:
<TASK>
kvm_mmu_zap_oldest_mmu_pages+0xae/0xf0 [kvm]
direct_page_fault+0x3cb/0x9b0 [kvm]
kvm_tdp_page_fault+0x2c/0xa0 [kvm]
kvm_mmu_page_fault+0x207/0x930 [kvm]
npf_interception+0x47/0xb0 [kvm_amd]
CVE: CVE-2022-45869
Fixes:
|
||
|
|
df4b177b48 |
io_uring/poll: fix poll_refs race with cancelation
[ upstream commit 12ad3d2d6c5b0131a6052de91360849e3e154846 ] There is an interesting race condition of poll_refs which could result in a NULL pointer dereference. The crash trace is like: KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f] CPU: 0 PID: 30781 Comm: syz-executor.2 Not tainted 6.0.0-g493ffd6605b2 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 RIP: 0010:io_poll_remove_entry io_uring/poll.c:154 [inline] RIP: 0010:io_poll_remove_entries+0x171/0x5b4 io_uring/poll.c:190 Code: ... RSP: 0018:ffff88810dfefba0 EFLAGS: 00010202 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000040000 RDX: ffffc900030c4000 RSI: 000000000003ffff RDI: 0000000000040000 RBP: 0000000000000008 R08: ffffffff9764d3dd R09: fffffbfff3836781 R10: fffffbfff3836781 R11: 0000000000000000 R12: 1ffff11003422d60 R13: ffff88801a116b04 R14: ffff88801a116ac0 R15: dffffc0000000000 FS: 00007f9c07497700(0000) GS:ffff88811a600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffb5c00ea98 CR3: 0000000105680005 CR4: 0000000000770ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> io_apoll_task_func+0x3f/0xa0 io_uring/poll.c:299 handle_tw_list io_uring/io_uring.c:1037 [inline] tctx_task_work+0x37e/0x4f0 io_uring/io_uring.c:1090 task_work_run+0x13a/0x1b0 kernel/task_work.c:177 get_signal+0x2402/0x25a0 kernel/signal.c:2635 arch_do_signal_or_restart+0x3b/0x660 arch/x86/kernel/signal.c:869 exit_to_user_mode_loop kernel/entry/common.c:166 [inline] exit_to_user_mode_prepare+0xc2/0x160 kernel/entry/common.c:201 __syscall_exit_to_user_mode_work kernel/entry/common.c:283 [inline] syscall_exit_to_user_mode+0x58/0x160 kernel/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x63/0xcd The root cause for this is a tiny overlooking in io_poll_check_events() when cocurrently run with poll cancel routine io_poll_cancel_req(). The interleaving to trigger use-after-free: CPU0 | CPU1 | io_apoll_task_func() | io_poll_cancel_req() io_poll_check_events() | // do while first loop | v = atomic_read(...) | // v = poll_refs = 1 | ... | io_poll_mark_cancelled() | atomic_or() | // poll_refs = IO_POLL_CANCEL_FLAG | 1 | atomic_sub_return(...) | // poll_refs = IO_POLL_CANCEL_FLAG | // loop continue | | | io_poll_execute() | io_poll_get_ownership() | // poll_refs = IO_POLL_CANCEL_FLAG | 1 | // gets the ownership v = atomic_read(...) | // poll_refs not change | | if (v & IO_POLL_CANCEL_FLAG) | return -ECANCELED; | // io_poll_check_events return | // will go into | // io_req_complete_failed() free req | | | io_apoll_task_func() | // also go into io_req_complete_failed() And the interleaving to trigger the kernel WARNING: CPU0 | CPU1 | io_apoll_task_func() | io_poll_cancel_req() io_poll_check_events() | // do while first loop | v = atomic_read(...) | // v = poll_refs = 1 | ... | io_poll_mark_cancelled() | atomic_or() | // poll_refs = IO_POLL_CANCEL_FLAG | 1 | atomic_sub_return(...) | // poll_refs = IO_POLL_CANCEL_FLAG | // loop continue | | v = atomic_read(...) | // v = IO_POLL_CANCEL_FLAG | | io_poll_execute() | io_poll_get_ownership() | // poll_refs = IO_POLL_CANCEL_FLAG | 1 | // gets the ownership | WARN_ON_ONCE(!(v & IO_POLL_REF_MASK))) | // v & IO_POLL_REF_MASK = 0 WARN | | | io_apoll_task_func() | // also go into io_req_complete_failed() By looking up the source code and communicating with Pavel, the implementation of this atomic poll refs should continue the loop of io_poll_check_events() just to avoid somewhere else to grab the ownership. Therefore, this patch simply adds another AND operation to make sure the loop will stop if it finds the poll_refs is exactly equal to IO_POLL_CANCEL_FLAG. Since io_poll_cancel_req() grabs ownership and will finally make its way to io_req_complete_failed(), the req will be reclaimed as expected. Fixes: aa43477b0402 ("io_uring: poll rework") Signed-off-by: Lin Ma <linma@zju.edu.cn> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: tweak description and code style] Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
4b702b7d11 |
io_uring: make poll refs more robust
[ upstream commit a26a35e9019fd70bf3cf647dcfdae87abc7bacea ] poll_refs carry two functions, the first is ownership over the request. The second is notifying the io_poll_check_events() that there was an event but wake up couldn't grab the ownership, so io_poll_check_events() should retry. We want to make poll_refs more robust against overflows. Instead of always incrementing it, which covers two purposes with one atomic, check if poll_refs is elevated enough and if so set a retry flag without attempts to grab ownership. The gap between the bias check and following atomics may seem racy, but we don't need it to be strict. Moreover there might only be maximum 4 parallel updates: by the first and the second poll entries, __io_arm_poll_handler() and cancellation. From those four, only poll wake ups may be executed multiple times, but they're protected by a spin. Cc: stable@vger.kernel.org Reported-by: Lin Ma <linma@zju.edu.cn> Fixes: aa43477b04025 ("io_uring: poll rework") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c762bc31f8683b3270f3587691348a7119ef9c9d.1668963050.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
1d58849ac2 |
io_uring: cmpxchg for poll arm refs release
[ upstream commit 2f3893437a4ebf2e892ca172e9e122841319d675 ] Replace atomically substracting the ownership reference at the end of arming a poll with a cmpxchg. We try to release ownership by setting 0 assuming that poll_refs didn't change while we were arming. If it did change, we keep the ownership and use it to queue a tw, which is fully capable to process all events and (even tolerates spurious wake ups). It's a bit more elegant as we reduce races b/w setting the cancellation flag and getting refs with this release, and with that we don't have to worry about any kinds of underflows. It's not the fastest path for polling. The performance difference b/w cmpxchg and atomic dec is usually negligible and it's not the fastest path. Cc: stable@vger.kernel.org Fixes: aa43477b04025 ("io_uring: poll rework") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0c95251624397ea6def568ff040cad2d7926fd51.1668963050.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
cd1981a8c3 |
io_uring: fix tw losing poll events
[ upstream commit 539bcb57da2f58886d7d5c17134236b0ec9cd15d ] We may never try to process a poll wake and its mask if there was multiple wake ups racing for queueing up a tw. Force io_poll_check_events() to update the mask by vfs_poll(). Cc: stable@vger.kernel.org Fixes: aa43477b04025 ("io_uring: poll rework") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/00344d60f8b18907171178d7cf598de71d127b0b.1668710222.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
62321dc7b0 |
io_uring: update res mask in io_poll_check_events
[ upstream commit b98186aee22fa593bc8c6b2c5d839c2ee518bc8c ] When io_poll_check_events() collides with someone attempting to queue a task work, it'll spin for one more time. However, it'll continue to use the mask from the first iteration instead of updating it. For example, if the first wake up was a EPOLLIN and the second EPOLLOUT, the userspace will not get EPOLLOUT in time. Clear the mask for all subsequent iterations to force vfs_poll(). Cc: stable@vger.kernel.org Fixes: aa43477b04025 ("io_uring: poll rework") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2dac97e8f691231049cb259c4ae57e79e40b537c.1668710222.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
417d5ea6e7 |
tracing: Free buffers when a used dynamic event is removed
commit 4313e5a613049dfc1819a6dfb5f94cf2caff9452 upstream.
After 65536 dynamic events have been added and removed, the "type" field
of the event then uses the first type number that is available (not
currently used by other events). A type number is the identifier of the
binary blobs in the tracing ring buffer (known as events) to map them to
logic that can parse the binary blob.
The issue is that if a dynamic event (like a kprobe event) is traced and
is in the ring buffer, and then that event is removed (because it is
dynamic, which means it can be created and destroyed), if another dynamic
event is created that has the same number that new event's logic on
parsing the binary blob will be used.
To show how this can be an issue, the following can crash the kernel:
# cd /sys/kernel/tracing
# for i in `seq 65536`; do
echo 'p:kprobes/foo do_sys_openat2 $arg1:u32' > kprobe_events
# done
For every iteration of the above, the writing to the kprobe_events will
remove the old event and create a new one (with the same format) and
increase the type number to the next available on until the type number
reaches over 65535 which is the max number for the 16 bit type. After it
reaches that number, the logic to allocate a new number simply looks for
the next available number. When an dynamic event is removed, that number
is then available to be reused by the next dynamic event created. That is,
once the above reaches the max number, the number assigned to the event in
that loop will remain the same.
Now that means deleting one dynamic event and created another will reuse
the previous events type number. This is where bad things can happen.
After the above loop finishes, the kprobes/foo event which reads the
do_sys_openat2 function call's first parameter as an integer.
# echo 1 > kprobes/foo/enable
# cat /etc/passwd > /dev/null
# cat trace
cat-2211 [005] .... 2007.849603: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196
cat-2211 [005] .... 2007.849620: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196
cat-2211 [005] .... 2007.849838: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196
cat-2211 [005] .... 2007.849880: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196
# echo 0 > kprobes/foo/enable
Now if we delete the kprobe and create a new one that reads a string:
# echo 'p:kprobes/foo do_sys_openat2 +0($arg2):string' > kprobe_events
And now we can the trace:
# cat trace
sendmail-1942 [002] ..... 530.136320: foo: (do_sys_openat2+0x0/0x240) arg1= cat-2046 [004] ..... 530.930817: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������"
cat-2046 [004] ..... 530.930961: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������"
cat-2046 [004] ..... 530.934278: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������"
cat-2046 [004] ..... 530.934563: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������"
bash-1515 [007] ..... 534.299093: foo: (do_sys_openat2+0x0/0x240) arg1="kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk���������@��4Z����;Y�����U
And dmesg has:
==================================================================
BUG: KASAN: use-after-free in string+0xd4/0x1c0
Read of size 1 at addr ffff88805fdbbfa0 by task cat/2049
CPU: 0 PID: 2049 Comm: cat Not tainted 6.1.0-rc6-test+ #641
Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
Call Trace:
<TASK>
dump_stack_lvl+0x5b/0x77
print_report+0x17f/0x47b
kasan_report+0xad/0x130
string+0xd4/0x1c0
vsnprintf+0x500/0x840
seq_buf_vprintf+0x62/0xc0
trace_seq_printf+0x10e/0x1e0
print_type_string+0x90/0xa0
print_kprobe_event+0x16b/0x290
print_trace_line+0x451/0x8e0
s_show+0x72/0x1f0
seq_read_iter+0x58e/0x750
seq_read+0x115/0x160
vfs_read+0x11d/0x460
ksys_read+0xa9/0x130
do_syscall_64+0x3a/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fc2e972ade2
Code: c0 e9 b2 fe ff ff 50 48 8d 3d b2 3f 0a 00 e8 05 f0 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
RSP: 002b:00007ffc64e687c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fc2e972ade2
RDX: 0000000000020000 RSI: 00007fc2e980d000 RDI: 0000000000000003
RBP: 00007fc2e980d000 R08: 00007fc2e980c010 R09: 0000000000000000
R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000020f00
R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
</TASK>
The buggy address belongs to the physical page:
page:ffffea00017f6ec0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x5fdbb
flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
raw: 000fffffc0000000 0000000000000000 ffffea00017f6ec8 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88805fdbbe80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff88805fdbbf00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>ffff88805fdbbf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
^
ffff88805fdbc000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff88805fdbc080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================
This was found when Zheng Yejian sent a patch to convert the event type
number assignment to use IDA, which gives the next available number, and
this bug showed up in the fuzz testing by Yujie Liu and the kernel test
robot. But after further analysis, I found that this behavior is the same
as when the event type numbers go past the 16bit max (and the above shows
that).
As modules have a similar issue, but is dealt with by setting a
"WAS_ENABLED" flag when a module event is enabled, and when the module is
freed, if any of its events were enabled, the ring buffer that holds that
event is also cleared, to prevent reading stale events. The same can be
done for dynamic events.
If any dynamic event that is being removed was enabled, then make sure the
buffers they were enabled in are now cleared.
Link: https://lkml.kernel.org/r/20221123171434.545706e3@gandalf.local.home
Link: https://lore.kernel.org/all/20221110020319.1259291-1-zhengyejian1@huawei.com/
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Depends-on: e18eb8783ec49 ("tracing: Add tracing_reset_all_online_cpus_unlocked() function")
Depends-on:
|
||
|
|
52fc245d15 |
tracing: Fix race where histograms can be called before the event
commit ef38c79a522b660f7f71d45dad2d6244bc741841 upstream.
commit 94eedf3dded5 ("tracing: Fix race where eprobes can be called before
the event") fixed an issue where if an event is soft disabled, and the
trigger is being added, there's a small window where the event sees that
there's a trigger but does not see that it requires reading the event yet,
and then calls the trigger with the record == NULL.
This could be solved with adding memory barriers in the hot path, or to
make sure that all the triggers requiring a record check for NULL. The
latter was chosen.
Commit 94eedf3dded5 set the eprobe trigger handle to check for NULL, but
the same needs to be done with histograms.
Link: https://lore.kernel.org/linux-trace-kernel/20221118211809.701d40c0f8a757b0df3c025a@kernel.org/
Link: https://lore.kernel.org/linux-trace-kernel/20221123164323.03450c3a@gandalf.local.home
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
cb2b0612cd |
tracing/osnoise: Fix duration type
commit 022632f6c43a86f2135642dccd5686de318e861d upstream.
The duration type is a 64 long value, not an int. This was
causing some long noise to report wrong values.
Change the duration to a 64 bits value.
Link: https://lkml.kernel.org/r/a93d8a8378c7973e9c609de05826533c9e977939.1668692096.git.bristot@kernel.org
Cc: stable@vger.kernel.org
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Fixes:
|
||
|
|
615a996ff3 |
drm/i915: Never return 0 if not all requests retired
commit 12b8b046e4c9de40fa59b6f067d6826f4e688f68 upstream.
Users of intel_gt_retire_requests_timeout() expect 0 return value on
success. However, we have no protection from passing back 0 potentially
returned by a call to dma_fence_wait_timeout() when it succedes right
after its timeout has expired.
Replace 0 with -ETIME before potentially using the timeout value as return
code, so -ETIME is returned if there are still some requests not retired
after timeout, 0 otherwise.
v3: Use conditional expression, more compact but also better reflecting
intention standing behind the change.
v2: Move the added lines down so flush_submission() is not affected.
Fixes:
|
||
|
|
01a2b25ef2 |
drm/i915: Fix negative value passed as remaining time
commit a8899b8728013c7b2456f0bfa20e5fea85ee0fd1 upstream. Commit |
||
|
|
ff1591ba33 |
drm/amdgpu: enable Vangogh VCN indirect sram mode
commit 9a8cc8cabc1e351614fd7f9e774757a5143b6fe8 upstream.
So that uses PSP to initialize HW.
Fixes:
|
||
|
|
ac2d7fa908 |
drm/amdgpu: temporarily disable broken Clang builds due to blown stack-frame
commit 6f6cb1714365a07dbc66851879538df9f6969288 upstream.
Patch series "Fix a bunch of allmodconfig errors", v2.
Since
|
||
|
|
57ee7bc4c6 |
mmc: sdhci: Fix voltage switch delay
commit c981cdfb9925f64a364f13c2b4f98f877308a408 upstream. Commit |
||
|
|
bb8f809514 |
mmc: sdhci-sprd: Fix no reset data and command after voltage switch
commit dd30dcfa7a74a06f8dcdab260d8d5adf32f17333 upstream.
After switching the voltage, no reset data and command will cause
CMD2 timeout.
Fixes:
|
||
|
|
4c7681c1a5 |
mmc: sdhci-esdhc-imx: correct CQHCI exit halt state check
commit a3cab1d2132474969871b5d7f915c5c0167b48b0 upstream.
With the current logic the "failed to exit halt state" error would be
shown even if any other bit than CQHCI_HALT was set in the CQHCI_CTL
register, since the right hand side is always true. Fix this by using
the correct operator (bit-wise instead of logical AND) to only check for
the halt bit flag, which was obviously intended here.
Fixes:
|
||
|
|
01dbe4db59 |
mmc: core: Fix ambiguous TRIM and DISCARD arg
commit 489d144563f23911262a652234b80c70c89c978b upstream.
Clean up the MMC_TRIM_ARGS define that became ambiguous with DISCARD
introduction. While at it, let's fix one usage where MMC_TRIM_ARGS falsely
included DISCARD too.
Fixes:
|
||
|
|
738946e355 |
mmc: mmc_test: Fix removal of debugfs file
commit f4307b4df1c28842bb1950ff0e1b97e17031b17f upstream.
In __mmc_test_register_dbgfs_file(), we need to assign 'file', as it's
being used when removing the debugfs files when the mmc_test module is
removed.
Fixes:
|
||
|
|
635d051734 |
net: stmmac: Set MAC's flow control register to reflect current settings
commit cc3d2b5fc0d6f8ad8a52da5ea679e5c2ec2adbd4 upstream.
Currently, pause frame register GMAC_RX_FLOW_CTRL_RFE is not updated
correctly when 'ethtool -A <IFACE> autoneg off rx off tx off' command
is issued. This fix ensures the flow control change is reflected directly
in the GMAC_RX_FLOW_CTRL_RFE register.
Fixes:
|
||
|
|
9132dcdf3b |
v4l2: don't fall back to follow_pfn() if pin_user_pages_fast() fails
commit 6647e76ab623b2b3fb2efe03a86e9c9046c52c33 upstream.
The V4L2_MEMORY_USERPTR interface is long deprecated and shouldn't be
used (and is discouraged for any modern v4l drivers). And Seth Jenkins
points out that the fallback to VM_PFNMAP/VM_IO is fundamentally racy
and dangerous.
Note that it's not even a case that should trigger, since any normal
user pointer logic ends up just using the pin_user_pages_fast() call
that does the proper page reference counting. That's not the problem
case, only if you try to use special device mappings do you have any
issues.
Normally I'd just remove this during the merge window, but since Seth
pointed out the problem cases, we really want to know as soon as
possible if there are actually any users of this odd special case of a
legacy interface. Neither Hans nor Mauro seem to think that such
mis-uses of the old legacy interface should exist. As Mauro says:
"See, V4L2 has actually 4 streaming APIs:
- Kernel-allocated mmap (usually referred simply as just mmap);
- USERPTR mmap;
- read();
- dmabuf;
The USERPTR is one of the oldest way to use it, coming from V4L
version 1 times, and by far the least used one"
And Hans chimed in on the USERPTR interface:
"To be honest, I wouldn't mind if it goes away completely, but that's a
bit of a pipe dream right now"
but while removing this legacy interface entirely may be a pipe dream we
can at least try to remove the unlikely (and actively broken) case of
using special device mappings for USERPTR accesses.
This replaces it with a WARN_ONCE() that we can remove once we've
hopefully confirmed that no actual users exist.
NOTE! Longer term, this means that a 'struct frame_vector' only ever
contains proper page pointers, and all the games we have with converting
them to pages can go away (grep for 'frame_vector_to_pages()' and the
uses of 'vec->is_pfns'). But this is just the first step, to verify
that this code really is all dead, and do so as quickly as possible.
Reported-by: Seth Jenkins <sethjenkins@google.com>
Acked-by: Hans Verkuil <hverkuil@xs4all.nl>
Acked-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
||
|
|
76ad884be0 |
pinctrl: intel: Save and restore pins in "direct IRQ" mode
commit 6989ea4881c8944fbf04378418bb1af63d875ef8 upstream. The firmware on some systems may configure GPIO pins to be an interrupt source in so called "direct IRQ" mode. In such cases the GPIO controller driver has no idea if those pins are being used or not. At the same time, there is a known bug in the firmwares that don't restore the pin settings correctly after suspend, i.e. by an unknown reason the Rx value becomes inverted. Hence, let's save and restore the pins that are configured as GPIOs in the input mode with GPIROUTIOXAPIC bit set. Cc: stable@vger.kernel.org Reported-and-tested-by: Dale Smith <dalepsmith@gmail.com> Reported-and-tested-by: John Harris <jmharris@gmail.com> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=214749 Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> Link: https://lore.kernel.org/r/20221124222926.72326-1-andriy.shevchenko@linux.intel.com Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
41296b85fa |
x86/bugs: Make sure MSR_SPEC_CTRL is updated properly upon resume from S3
commit 66065157420c5b9b3f078f43d313c153e1ff7f83 upstream.
The "force" argument to write_spec_ctrl_current() is currently ambiguous
as it does not guarantee the MSR write. This is due to the optimization
that writes to the MSR happen only when the new value differs from the
cached value.
This is fine in most cases, but breaks for S3 resume when the cached MSR
value gets out of sync with the hardware MSR value due to S3 resetting
it.
When x86_spec_ctrl_current is same as x86_spec_ctrl_base, the MSR write
is skipped. Which results in SPEC_CTRL mitigations not getting restored.
Move the MSR write from write_spec_ctrl_current() to a new function that
unconditionally writes to the MSR. Update the callers accordingly and
rename functions.
[ bp: Rework a bit. ]
Fixes: caa0ff24d5d0 ("x86/bugs: Keep a per-CPU IA32_SPEC_CTRL value")
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/806d39b0bfec2fe8f50dc5446dff20f5bb24a959.1669821572.git.pawan.kumar.gupta@linux.intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
||
|
|
33021419fd |
nilfs2: fix NULL pointer dereference in nilfs_palloc_commit_free_entry()
commit f0a0ccda18d6fd826d7c7e7ad48a6ed61c20f8b4 upstream. Syzbot reported a null-ptr-deref bug: NILFS (loop0): segctord starting. Construction interval = 5 seconds, CP frequency < 30 seconds general protection fault, probably for non-canonical address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017] CPU: 1 PID: 3603 Comm: segctord Not tainted 6.1.0-rc2-syzkaller-00105-gb229b6ca5abb #0 Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022 RIP: 0010:nilfs_palloc_commit_free_entry+0xe5/0x6b0 fs/nilfs2/alloc.c:608 Code: 00 00 00 00 fc ff df 80 3c 02 00 0f 85 cd 05 00 00 48 b8 00 00 00 00 00 fc ff df 4c 8b 73 08 49 8d 7e 10 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 26 05 00 00 49 8b 46 10 be a6 00 00 00 48 c7 c7 RSP: 0018:ffffc90003dff830 EFLAGS: 00010212 RAX: dffffc0000000000 RBX: ffff88802594e218 RCX: 000000000000000d RDX: 0000000000000002 RSI: 0000000000002000 RDI: 0000000000000010 RBP: ffff888071880222 R08: 0000000000000005 R09: 000000000000003f R10: 000000000000000d R11: 0000000000000000 R12: ffff888071880158 R13: ffff88802594e220 R14: 0000000000000000 R15: 0000000000000004 FS: 0000000000000000(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fb1c08316a8 CR3: 0000000018560000 CR4: 0000000000350ee0 Call Trace: <TASK> nilfs_dat_commit_free fs/nilfs2/dat.c:114 [inline] nilfs_dat_commit_end+0x464/0x5f0 fs/nilfs2/dat.c:193 nilfs_dat_commit_update+0x26/0x40 fs/nilfs2/dat.c:236 nilfs_btree_commit_update_v+0x87/0x4a0 fs/nilfs2/btree.c:1940 nilfs_btree_commit_propagate_v fs/nilfs2/btree.c:2016 [inline] nilfs_btree_propagate_v fs/nilfs2/btree.c:2046 [inline] nilfs_btree_propagate+0xa00/0xd60 fs/nilfs2/btree.c:2088 nilfs_bmap_propagate+0x73/0x170 fs/nilfs2/bmap.c:337 nilfs_collect_file_data+0x45/0xd0 fs/nilfs2/segment.c:568 nilfs_segctor_apply_buffers+0x14a/0x470 fs/nilfs2/segment.c:1018 nilfs_segctor_scan_file+0x3f4/0x6f0 fs/nilfs2/segment.c:1067 nilfs_segctor_collect_blocks fs/nilfs2/segment.c:1197 [inline] nilfs_segctor_collect fs/nilfs2/segment.c:1503 [inline] nilfs_segctor_do_construct+0x12fc/0x6af0 fs/nilfs2/segment.c:2045 nilfs_segctor_construct+0x8e3/0xb30 fs/nilfs2/segment.c:2379 nilfs_segctor_thread_construct fs/nilfs2/segment.c:2487 [inline] nilfs_segctor_thread+0x3c3/0xf30 fs/nilfs2/segment.c:2570 kthread+0x2e4/0x3a0 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306 </TASK> ... If DAT metadata file is corrupted on disk, there is a case where req->pr_desc_bh is NULL and blocknr is 0 at nilfs_dat_commit_end() during a b-tree operation that cascadingly updates ancestor nodes of the b-tree, because nilfs_dat_commit_alloc() for a lower level block can initialize the blocknr on the same DAT entry between nilfs_dat_prepare_end() and nilfs_dat_commit_end(). If this happens, nilfs_dat_commit_end() calls nilfs_dat_commit_free() without valid buffer heads in req->pr_desc_bh and req->pr_bitmap_bh, and causes the NULL pointer dereference above in nilfs_palloc_commit_free_entry() function, which leads to a crash. Fix this by adding a NULL check on req->pr_desc_bh and req->pr_bitmap_bh before nilfs_palloc_commit_free_entry() in nilfs_dat_commit_free(). This also calls nilfs_error() in that case to notify that there is a fatal flaw in the filesystem metadata and prevent further operations. Link: https://lkml.kernel.org/r/00000000000097c20205ebaea3d6@google.com Link: https://lkml.kernel.org/r/20221114040441.1649940-1-zhangpeng362@huawei.com Link: https://lkml.kernel.org/r/20221119120542.17204-1-konishi.ryusuke@gmail.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+ebe05ee8e98f755f61d0@syzkaller.appspotmail.com Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |