Commit Graph

38569 Commits

Author SHA1 Message Date
Connor O'Brien
31f50e8abc FROMGIT: bpf: btf: limit logging of ignored BTF mismatches
Enabling CONFIG_MODULE_ALLOW_BTF_MISMATCH is an indication that BTF
mismatches are expected and module loading should proceed
anyway. Logging with pr_warn() on every one of these "benign"
mismatches creates unnecessary noise when many such modules are
loaded. Instead, handle this case with a single log warning that BTF
info may be unavailable.

Mismatches also result in calls to __btf_verifier_log() via
__btf_verifier_log_type() or btf_verifier_log_member(), adding several
additional lines of logging per mismatched module. Add checks to these
paths to skip logging for module BTF mismatches in the "allow
mismatch" case.

All existing logging behavior is preserved in the default
CONFIG_MODULE_ALLOW_BTF_MISMATCH=n case.

Signed-off-by: Connor O'Brien <connoro@google.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20230107025331.3240536-1-connoro@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Bug: 260025057
Bug: 256578296
Bug: 258989318
(cherry picked from commit 9cb61e50bf6bf54db712bba6cf20badca4383f96
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
master)
Signed-off-by: Connor O'Brien <connoro@google.com>
Change-Id: I2f9d19a2027610af2e825fc38fd08f56f4cd7091
2023-01-17 23:18:26 +00:00
Minchan Kim
eebff8eab2 FROMLIST: mm: cma: introduce gfp flag in cma_alloc instead of no_warn
The upcoming patch will introduce __GFP_NORETRY semantic
in alloc_contig_range which is a failfast mode of the API.
Instead of adding a additional parameter for gfp, replace
no_warn with gfp flag.

To keep old behaviors, it follows the rule below.

  no_warn 			gfp_flags

  false         		GFP_KERNEL
  true          		GFP_KERNEL|__GFP_NOWARN
  gfp & __GFP_NOWARN		GFP_KERNEL | (gfp & __GFP_NOWARN)

Bug: 170340257
Bug: 120293424
Link: https://lore.kernel.org/linux-mm/YAnM5PbNJZlk%2F%2FiX@google.com/T/#m36b144ff81fe0a8f0ecaf6813de4819ecc41f8fe
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I1ce020ab5d5fff34eb6464be4632ddef72fb43eb
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit 23ba990a3e)
2023-01-05 02:16:50 +00:00
Minchan Kim
abafa1328b BACKPORT: locking: Add missing __sched attributes
This patch adds __sched attributes to a few missing places
to show blocked function rather than locking function
in get_wchan.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220115231657.84828-1-minchan@kernel.org

Conflicts:
	kernel/locking/percpu-rwsem.c

1. conflict <linux/sched/debug.h>

Bug: 228243692
Change-Id: Ifb50c13cfdd7484269d9a291a8da515e1cce6a7b
(cherry picked from commit c441e934b604a3b5f350a9104124cf6a3ba07a34)
Signed-off-by: Minchan Kim <minchan@google.com>
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit da358e264cbed0240debe37fa6867c81f0748bf8)
2023-01-04 02:28:50 +00:00
Greg Kroah-Hartman
5a91f1aa85 Merge 5.15.84 into android14-5.15
Changes in 5.15.84
	x86/vdso: Conditionally export __vdso_sgx_enter_enclave()
	vfs: fix copy_file_range() averts filesystem freeze protection
	nfp: fix use-after-free in area_cache_get()
	ASoC: fsl_micfil: explicitly clear software reset bit
	ASoC: fsl_micfil: explicitly clear CHnF flags
	ASoC: ops: Check bounds for second channel in snd_soc_put_volsw_sx()
	libbpf: Use page size as max_entries when probing ring buffer map
	pinctrl: meditatek: Startup with the IRQs disabled
	can: sja1000: fix size of OCR_MODE_MASK define
	can: mcba_usb: Fix termination command argument
	net: fec: don't reset irq coalesce settings to defaults on "ip link up"
	ASoC: cs42l51: Correct PGA Volume minimum value
	perf: Fix perf_pending_task() UaF
	nvme-pci: clear the prp2 field when not used
	ASoC: ops: Correct bounds check for second channel on SX controls
	net: fec: properly guard irq coalesce setup
	Linux 5.15.84

Change-Id: I34ef5e73fca9da9a77c89b1f0c7ad4af37b63a79
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2022-12-19 17:29:08 +01:00
Ramji Jiyani
a6eaf3db80 ANDROID: GKI: Only protect exports if KMI symbols are present
Only enforce export protection if there are symbols in the
unprotected list for the Kernel Module Interface (KMI).

This is only relevant for targets like arm64 that have
defined ABI symbol lists. This allows non-GKI targets
like arm and x86 to continue using GKI source code
without disabling the feature for those targets.

Bug: 232430739
Test: TH
Fixes: fd1e768866 ("ANDROID: GKI: Protect exports of protected GKI modules")
Change-Id: Ie89e8f63eda99d9b7aacd1bb76d036b3ff4ba37c
Signed-off-by: Ramji Jiyani <ramjiyani@google.com>
2022-12-19 16:16:00 +00:00
Peter Zijlstra
8bffa95ac1 perf: Fix perf_pending_task() UaF
[ Upstream commit 517e6a301f34613bff24a8e35b5455884f2d83d8 ]

Per syzbot it is possible for perf_pending_task() to run after the
event is free()'d. There are two related but distinct cases:

 - the task_work was already queued before destroying the event;
 - destroying the event itself queues the task_work.

The first cannot be solved using task_work_cancel() since
perf_release() itself might be called from a task_work (____fput),
which means the current->task_works list is already empty and
task_work_cancel() won't be able to find the perf_pending_task()
entry.

The simplest alternative is extending the perf_event lifetime to cover
the task_work.

The second is just silly, queueing a task_work while you know the
event is going away makes no sense and is easily avoided by
re-arranging how the event is marked STATE_DEAD and ensuring it goes
through STATE_OFF on the way down.

Reported-by: syzbot+9228d6098455bb209ec8@syzkaller.appspotmail.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marco Elver <elver@google.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-19 12:36:43 +01:00
Ramji Jiyani
fd1e768866 ANDROID: GKI: Protect exports of protected GKI modules
Implement support for protecting the exported symbols of
protected GKI modules.

Only signed GKI modules are permitted to export symbols
listed in the android/abi_gki_protected_exports file.
Attempting to export these symbols from an unsigned module
will result in the module failing to load, with a
'Permission denied' error message.

Bug: 232430739
Test: TH
Change-Id: I3e8b330938e116bb2e022d356ac0d55108a84a01
Signed-off-by: Ramji Jiyani <ramjiyani@google.com>
2022-12-16 16:44:54 +00:00
Greg Kroah-Hartman
63696badd9 Merge 5.15.83 into android14-5.15
Changes in 5.15.83
	clk: generalize devm_clk_get() a bit
	clk: Provide new devm_clk helpers for prepared and enabled clocks
	mmc: mtk-sd: Fix missing clk_disable_unprepare in msdc_of_clock_parse()
	arm64: dts: rockchip: keep I2S1 disabled for GPIO function on ROCK Pi 4 series
	arm: dts: rockchip: fix node name for hym8563 rtc
	arm: dts: rockchip: remove clock-frequency from rtc
	ARM: dts: rockchip: fix ir-receiver node names
	arm64: dts: rockchip: fix ir-receiver node names
	ARM: dts: rockchip: rk3188: fix lcdc1-rgb24 node name
	fs: use acquire ordering in __fget_light()
	ARM: 9251/1: perf: Fix stacktraces for tracepoint events in THUMB2 kernels
	ARM: 9266/1: mm: fix no-MMU ZERO_PAGE() implementation
	ASoC: wm8962: Wait for updated value of WM8962_CLOCKING1 register
	spi: mediatek: Fix DEVAPC Violation at KO Remove
	ARM: dts: rockchip: disable arm_global_timer on rk3066 and rk3188
	ASoC: rt711-sdca: fix the latency time of clock stop prepare state machine transitions
	9p/fd: Use P9_HDRSZ for header size
	regulator: slg51000: Wait after asserting CS pin
	ALSA: seq: Fix function prototype mismatch in snd_seq_expand_var_event
	selftests/net: Find nettest in current directory
	btrfs: send: avoid unaligned encoded writes when attempting to clone range
	ASoC: soc-pcm: Add NULL check in BE reparenting
	regulator: twl6030: fix get status of twl6032 regulators
	fbcon: Use kzalloc() in fbcon_prepare_logo()
	usb: dwc3: gadget: Disable GUSB2PHYCFG.SUSPHY for End Transfer
	9p/xen: check logical size for buffer size
	net: usb: qmi_wwan: add u-blox 0x1342 composition
	mm/khugepaged: take the right locks for page table retraction
	mm/khugepaged: fix GUP-fast interaction by sending IPI
	mm/khugepaged: invoke MMU notifiers in shmem/file collapse paths
	rtc: mc146818-lib: extract mc146818_avoid_UIP
	rtc: cmos: avoid UIP when writing alarm time
	rtc: cmos: avoid UIP when reading alarm time
	cifs: fix use-after-free caused by invalid pointer `hostname`
	drm/bridge: anx7625: Fix edid_read break case in sp_tx_edid_read()
	xen/netback: Ensure protocol headers don't fall in the non-linear area
	xen/netback: do some code cleanup
	xen/netback: don't call kfree_skb() with interrupts disabled
	media: videobuf2-core: take mmap_lock in vb2_get_unmapped_area()
	soundwire: intel: Initialize clock stop timeout
	Revert "ARM: dts: imx7: Fix NAND controller size-cells"
	media: v4l2-dv-timings.c: fix too strict blanking sanity checks
	memcg: fix possible use-after-free in memcg_write_event_control()
	mm/gup: fix gup_pud_range() for dax
	Bluetooth: btusb: Add debug message for CSR controllers
	Bluetooth: Fix crash when replugging CSR fake controllers
	net: mana: Fix race on per-CQ variable napi work_done
	KVM: s390: vsie: Fix the initialization of the epoch extension (epdx) field
	drm/vmwgfx: Don't use screen objects when SEV is active
	drm/amdgpu/sdma_v4_0: turn off SDMA ring buffer in the s2idle suspend
	drm/shmem-helper: Remove errant put in error path
	drm/shmem-helper: Avoid vm_open error paths
	net: dsa: sja1105: avoid out of bounds access in sja1105_init_l2_policing()
	HID: usbhid: Add ALWAYS_POLL quirk for some mice
	HID: hid-lg4ff: Add check for empty lbuf
	HID: core: fix shift-out-of-bounds in hid_report_raw_event
	HID: ite: Enable QUIRK_TOUCHPAD_ON_OFF_REPORT on Acer Aspire Switch V 10
	can: af_can: fix NULL pointer dereference in can_rcv_filter
	clk: Fix pointer casting to prevent oops in devm_clk_release()
	gpiolib: improve coding style for local variables
	gpiolib: check the 'ngpios' property in core gpiolib code
	gpiolib: fix memory leak in gpiochip_setup_dev()
	netfilter: nft_set_pipapo: Actually validate intervals in fields after the first one
	drm/vmwgfx: Fix race issue calling pin_user_pages
	ieee802154: cc2520: Fix error return code in cc2520_hw_init()
	ca8210: Fix crash by zero initializing data
	netfilter: ctnetlink: fix compilation warning after data race fixes in ct mark
	drm/bridge: ti-sn65dsi86: Fix output polarity setting bug
	gpio: amd8111: Fix PCI device reference count leak
	e1000e: Fix TX dispatch condition
	igb: Allocate MSI-X vector when testing
	net: broadcom: Add PTP_1588_CLOCK_OPTIONAL dependency for BCMGENET under ARCH_BCM2835
	drm: bridge: dw_hdmi: fix preference of RGB modes over YUV420
	af_unix: Get user_ns from in_skb in unix_diag_get_exact().
	vmxnet3: correctly report encapsulated LRO packet
	vmxnet3: use correct intrConf reference when using extended queues
	Bluetooth: 6LoWPAN: add missing hci_dev_put() in get_l2cap_conn()
	Bluetooth: Fix not cleanup led when bt_init fails
	net: dsa: ksz: Check return value
	net: dsa: hellcreek: Check return value
	net: dsa: sja1105: Check return value
	selftests: rtnetlink: correct xfrm policy rule in kci_test_ipsec_offload
	mac802154: fix missing INIT_LIST_HEAD in ieee802154_if_add()
	net: encx24j600: Add parentheses to fix precedence
	net: encx24j600: Fix invalid logic in reading of MISTAT register
	net: mdiobus: fwnode_mdiobus_register_phy() rework error handling
	net: mdiobus: fix double put fwnode in the error path
	octeontx2-pf: Fix potential memory leak in otx2_init_tc()
	xen-netfront: Fix NULL sring after live migration
	net: mvneta: Prevent out of bounds read in mvneta_config_rss()
	i40e: Fix not setting default xps_cpus after reset
	i40e: Fix for VF MAC address 0
	i40e: Disallow ip4 and ip6 l4_4_bytes
	NFC: nci: Bounds check struct nfc_target arrays
	nvme initialize core quirks before calling nvme_init_subsystem
	gpio/rockchip: fix refcount leak in rockchip_gpiolib_register()
	net: stmmac: fix "snps,axi-config" node property parsing
	ip_gre: do not report erspan version on GRE interface
	net: microchip: sparx5: Fix missing destroy_workqueue of mact_queue
	net: thunderx: Fix missing destroy_workqueue of nicvf_rx_mode_wq
	net: hisilicon: Fix potential use-after-free in hisi_femac_rx()
	net: mdio: fix unbalanced fwnode reference count in mdio_device_release()
	net: hisilicon: Fix potential use-after-free in hix5hd2_rx()
	tipc: Fix potential OOB in tipc_link_proto_rcv()
	ipv4: Fix incorrect route flushing when source address is deleted
	ipv4: Fix incorrect route flushing when table ID 0 is used
	net: dsa: sja1105: fix memory leak in sja1105_setup_devlink_regions()
	tipc: call tipc_lxc_xmit without holding node_read_lock
	ethernet: aeroflex: fix potential skb leak in greth_init_rings()
	dpaa2-switch: Fix memory leak in dpaa2_switch_acl_entry_add() and dpaa2_switch_acl_entry_remove()
	xen/netback: fix build warning
	net: phy: mxl-gpy: fix version reporting
	net: plip: don't call kfree_skb/dev_kfree_skb() under spin_lock_irq()
	ipv6: avoid use-after-free in ip6_fragment()
	net: thunderbolt: fix memory leak in tbnet_open()
	net: mvneta: Fix an out of bounds check
	macsec: add missing attribute validation for offload
	s390/qeth: fix various format strings
	s390/qeth: fix use-after-free in hsci
	can: esd_usb: Allow REC and TEC to return to zero
	block: move CONFIG_BLOCK guard to top Makefile
	io_uring: move to separate directory
	io_uring: Fix a null-ptr-deref in io_tctx_exit_cb()
	Linux 5.15.83

Change-Id: I08ef74d6ad8786c191050294dcbf1090908e7c4d
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2022-12-14 13:04:18 +01:00
Jens Axboe
f435c66d23 io_uring: move to separate directory
[ Upstream commit ed29b0b4fd835b058ddd151c49d021e28d631ee6 ]

In preparation for splitting io_uring up a bit, move it into its own
top level directory. It didn't really belong in fs/ anyway, as it's
not a file system only API.

This adds io_uring/ and moves the core files in there, and updates the
MAINTAINERS file for the new location.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stable-dep-of: 998b30c3948e ("io_uring: Fix a null-ptr-deref in io_tctx_exit_cb()")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-14 11:37:31 +01:00
Tejun Heo
aad8bbd17a memcg: fix possible use-after-free in memcg_write_event_control()
commit 4a7ba45b1a435e7097ca0f79a847d0949d0eb088 upstream.

memcg_write_event_control() accesses the dentry->d_name of the specified
control fd to route the write call.  As a cgroup interface file can't be
renamed, it's safe to access d_name as long as the specified file is a
regular cgroup file.  Also, as these cgroup interface files can't be
removed before the directory, it's safe to access the parent too.

Prior to 347c4a8747 ("memcg: remove cgroup_event->cft"), there was a
call to __file_cft() which verified that the specified file is a regular
cgroupfs file before further accesses.  The cftype pointer returned from
__file_cft() was no longer necessary and the commit inadvertently dropped
the file type check with it allowing any file to slip through.  With the
invarients broken, the d_name and parent accesses can now race against
renames and removals of arbitrary files and cause use-after-free's.

Fix the bug by resurrecting the file type check in __file_cft().  Now that
cgroupfs is implemented through kernfs, checking the file operations needs
to go through a layer of indirection.  Instead, let's check the superblock
and dentry type.

Link: https://lkml.kernel.org/r/Y5FRm/cfcKPGzWwl@slm.duckdns.org
Fixes: 347c4a8747 ("memcg: remove cgroup_event->cft")
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jann Horn <jannh@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org>	[3.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-14 11:37:19 +01:00
Rick Yiu
757d548ca6 ANDROID: kernel: sched: Export reweight_task
Export reweight_task for vendor usage when they are trying to manipulate
task prio. After the prio changed, it will need to update its load
weight to take effect. Therefore, this function needs to be called
from vendor kernel module. It could be used with
trace_android_rvh_set_user_nice and trace_android_rvh_setscheduler.

Bug: 245675204
Change-Id: I0033518bf1cbd0a8129795743b95340f439d5fe8
Signed-off-by: Rick Yiu <rickyiu@google.com>
(cherry picked from commit db144888f871fa149c0d18844faa971a3bf8f4fc)
2022-12-09 08:34:21 +00:00
Greg Kroah-Hartman
b3d82fd3de Merge 5.15.82 into android14-5.15
Changes in 5.15.82
	arm64: mte: Avoid setting PG_mte_tagged if no tags cleared or restored
	drm/i915: Create a dummy object for gen6 ppgtt
	drm/i915/gt: Use i915_vm_put on ppgtt_create error paths
	erofs: fix order >= MAX_ORDER warning due to crafted negative i_size
	btrfs: sink iterator parameter to btrfs_ioctl_logical_to_ino
	btrfs: free btrfs_path before copying inodes to userspace
	spi: spi-imx: Fix spi_bus_clk if requested clock is higher than input clock
	btrfs: move QUOTA_ENABLED check to rescan_should_stop from btrfs_qgroup_rescan_worker
	btrfs: qgroup: fix sleep from invalid context bug in btrfs_qgroup_inherit()
	drm/display/dp_mst: Fix drm_dp_mst_add_affected_dsc_crtcs() return code
	drm/amdgpu: update drm_display_info correctly when the edid is read
	drm/amdgpu: Partially revert "drm/amdgpu: update drm_display_info correctly when the edid is read"
	iio: health: afe4403: Fix oob read in afe4403_read_raw
	iio: health: afe4404: Fix oob read in afe4404_[read|write]_raw
	iio: light: rpr0521: add missing Kconfig dependencies
	bpf, perf: Use subprog name when reporting subprog ksymbol
	scripts/faddr2line: Fix regression in name resolution on ppc64le
	ARM: at91: rm9200: fix usb device clock id
	libbpf: Handle size overflow for ringbuf mmap
	hwmon: (ltc2947) fix temperature scaling
	hwmon: (ina3221) Fix shunt sum critical calculation
	hwmon: (i5500_temp) fix missing pci_disable_device()
	hwmon: (ibmpex) Fix possible UAF when ibmpex_register_bmc() fails
	bpf: Do not copy spin lock field from user in bpf_selem_alloc
	nvmem: rmem: Fix return value check in rmem_read()
	of: property: decrement node refcount in of_fwnode_get_reference_args()
	ixgbevf: Fix resource leak in ixgbevf_init_module()
	i40e: Fix error handling in i40e_init_module()
	fm10k: Fix error handling in fm10k_init_module()
	iavf: remove redundant ret variable
	iavf: Fix error handling in iavf_init_module()
	e100: Fix possible use after free in e100_xmit_prepare
	net/mlx5: DR, Rename list field in matcher struct to list_node
	net/mlx5: DR, Fix uninitialized var warning
	net/mlx5: Fix uninitialized variable bug in outlen_write()
	net/mlx5e: Fix use-after-free when reverting termination table
	can: sja1000_isa: sja1000_isa_probe(): add missing free_sja1000dev()
	can: cc770: cc770_isa_probe(): add missing free_cc770dev()
	can: etas_es58x: es58x_init_netdev(): free netdev when register_candev()
	can: m_can: pci: add missing m_can_class_free_dev() in probe/remove methods
	can: m_can: Add check for devm_clk_get
	qlcnic: fix sleep-in-atomic-context bugs caused by msleep
	aquantia: Do not purge addresses when setting the number of rings
	wifi: cfg80211: fix buffer overflow in elem comparison
	wifi: cfg80211: don't allow multi-BSSID in S1G
	wifi: mac8021: fix possible oob access in ieee80211_get_rate_duration
	net: phy: fix null-ptr-deref while probe() failed
	net: ethernet: ti: am65-cpsw: fix error handling in am65_cpsw_nuss_probe()
	net: net_netdev: Fix error handling in ntb_netdev_init_module()
	net/9p: Fix a potential socket leak in p9_socket_open
	net: ethernet: nixge: fix NULL dereference
	net: wwan: iosm: fix kernel test robot reported error
	net: wwan: iosm: fix dma_alloc_coherent incompatible pointer type
	dsa: lan9303: Correct stat name
	tipc: re-fetch skb cb after tipc_msg_validate
	net: hsr: Fix potential use-after-free
	net: mdiobus: fix unbalanced node reference count
	afs: Fix fileserver probe RTT handling
	net: tun: Fix use-after-free in tun_detach()
	packet: do not set TP_STATUS_CSUM_VALID on CHECKSUM_COMPLETE
	sctp: fix memory leak in sctp_stream_outq_migrate()
	net: ethernet: renesas: ravb: Fix promiscuous mode after system resumed
	hwmon: (coretemp) Check for null before removing sysfs attrs
	hwmon: (coretemp) fix pci device refcount leak in nv1a_ram_new()
	riscv: vdso: fix section overlapping under some conditions
	riscv: mm: Proper page permissions after initmem free
	ALSA: dice: fix regression for Lexicon I-ONIX FW810S
	error-injection: Add prompt for function error injection
	tools/vm/slabinfo-gnuplot: use "grep -E" instead of "egrep"
	nilfs2: fix NULL pointer dereference in nilfs_palloc_commit_free_entry()
	x86/bugs: Make sure MSR_SPEC_CTRL is updated properly upon resume from S3
	pinctrl: intel: Save and restore pins in "direct IRQ" mode
	v4l2: don't fall back to follow_pfn() if pin_user_pages_fast() fails
	net: stmmac: Set MAC's flow control register to reflect current settings
	mmc: mmc_test: Fix removal of debugfs file
	mmc: core: Fix ambiguous TRIM and DISCARD arg
	mmc: sdhci-esdhc-imx: correct CQHCI exit halt state check
	mmc: sdhci-sprd: Fix no reset data and command after voltage switch
	mmc: sdhci: Fix voltage switch delay
	drm/amdgpu: temporarily disable broken Clang builds due to blown stack-frame
	drm/amdgpu: enable Vangogh VCN indirect sram mode
	drm/i915: Fix negative value passed as remaining time
	drm/i915: Never return 0 if not all requests retired
	tracing/osnoise: Fix duration type
	tracing: Fix race where histograms can be called before the event
	tracing: Free buffers when a used dynamic event is removed
	io_uring: update res mask in io_poll_check_events
	io_uring: fix tw losing poll events
	io_uring: cmpxchg for poll arm refs release
	io_uring: make poll refs more robust
	io_uring/poll: fix poll_refs race with cancelation
	KVM: x86/mmu: Fix race condition in direct_page_fault
	ASoC: ops: Fix bounds check for _sx controls
	pinctrl: single: Fix potential division by zero
	riscv: Sync efi page table's kernel mappings before switching
	riscv: fix race when vmap stack overflow
	riscv: kexec: Fixup irq controller broken in kexec crash path
	nvme: fix SRCU protection of nvme_ns_head list
	iommu/vt-d: Fix PCI device refcount leak in has_external_pci()
	iommu/vt-d: Fix PCI device refcount leak in dmar_dev_scope_init()
	mm: __isolate_lru_page_prepare() in isolate_migratepages_block()
	mm: migrate: fix THP's mapcount on isolation
	parisc: Increase FRAME_WARN to 2048 bytes on parisc
	Kconfig.debug: provide a little extra FRAME_WARN leeway when KASAN is enabled
	selftests: net: add delete nexthop route warning test
	selftests: net: fix nexthop warning cleanup double ip typo
	ipv4: Handle attempt to delete multipath route when fib_info contains an nh reference
	ipv4: Fix route deletion when nexthop info is not specified
	serial: stm32: Factor out GPIO RTS toggling into separate function
	serial: stm32: Use TC interrupt to deassert GPIO RTS in RS485 mode
	serial: stm32: Deassert Transmit Enable on ->rs485_config()
	i2c: npcm7xx: Fix error handling in npcm_i2c_init()
	i2c: imx: Only DMA messages with I2C_M_DMA_SAFE flag set
	ACPI: HMAT: remove unnecessary variable initialization
	ACPI: HMAT: Fix initiator registration for single-initiator systems
	Revert "clocksource/drivers/riscv: Events are stopped during CPU suspend"
	char: tpm: Protect tpm_pm_suspend with locks
	Input: raydium_ts_i2c - fix memory leak in raydium_i2c_send()
	ipc/sem: Fix dangling sem_array access in semtimedop race
	proc: avoid integer type confusion in get_proc_long
	proc: proc_skip_spaces() shouldn't think it is working on C strings
	Linux 5.15.82

Change-Id: I1631261fcfd321674c546c41628bb3283e02f1a5
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2022-12-08 16:43:35 +00:00
Linus Torvalds
48642f9431 proc: proc_skip_spaces() shouldn't think it is working on C strings
commit bce9332220bd677d83b19d21502776ad555a0e73 upstream.

proc_skip_spaces() seems to think it is working on C strings, and ends
up being just a wrapper around skip_spaces() with a really odd calling
convention.

Instead of basing it on skip_spaces(), it should have looked more like
proc_skip_char(), which really is the exact same function (except it
skips a particular character, rather than whitespace).  So use that as
inspiration, odd coding and all.

Now the calling convention actually makes sense and works for the
intended purpose.

Reported-and-tested-by: Kyle Zeng <zengyhkyle@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-08 11:28:45 +01:00
Linus Torvalds
3eb9213f66 proc: avoid integer type confusion in get_proc_long
commit e6cfaf34be9fcd1a8285a294e18986bfc41a409c upstream.

proc_get_long() is passed a size_t, but then assigns it to an 'int'
variable for the length.  Let's not do that, even if our IO paths are
limited to MAX_RW_COUNT (exactly because of these kinds of type errors).

So do the proper test in the rigth type.

Reported-by: Kyle Zeng <zengyhkyle@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-08 11:28:45 +01:00
Steven Rostedt (Google)
417d5ea6e7 tracing: Free buffers when a used dynamic event is removed
commit 4313e5a613049dfc1819a6dfb5f94cf2caff9452 upstream.

After 65536 dynamic events have been added and removed, the "type" field
of the event then uses the first type number that is available (not
currently used by other events). A type number is the identifier of the
binary blobs in the tracing ring buffer (known as events) to map them to
logic that can parse the binary blob.

The issue is that if a dynamic event (like a kprobe event) is traced and
is in the ring buffer, and then that event is removed (because it is
dynamic, which means it can be created and destroyed), if another dynamic
event is created that has the same number that new event's logic on
parsing the binary blob will be used.

To show how this can be an issue, the following can crash the kernel:

 # cd /sys/kernel/tracing
 # for i in `seq 65536`; do
     echo 'p:kprobes/foo do_sys_openat2 $arg1:u32' > kprobe_events
 # done

For every iteration of the above, the writing to the kprobe_events will
remove the old event and create a new one (with the same format) and
increase the type number to the next available on until the type number
reaches over 65535 which is the max number for the 16 bit type. After it
reaches that number, the logic to allocate a new number simply looks for
the next available number. When an dynamic event is removed, that number
is then available to be reused by the next dynamic event created. That is,
once the above reaches the max number, the number assigned to the event in
that loop will remain the same.

Now that means deleting one dynamic event and created another will reuse
the previous events type number. This is where bad things can happen.
After the above loop finishes, the kprobes/foo event which reads the
do_sys_openat2 function call's first parameter as an integer.

 # echo 1 > kprobes/foo/enable
 # cat /etc/passwd > /dev/null
 # cat trace
             cat-2211    [005] ....  2007.849603: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196
             cat-2211    [005] ....  2007.849620: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196
             cat-2211    [005] ....  2007.849838: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196
             cat-2211    [005] ....  2007.849880: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196
 # echo 0 > kprobes/foo/enable

Now if we delete the kprobe and create a new one that reads a string:

 # echo 'p:kprobes/foo do_sys_openat2 +0($arg2):string' > kprobe_events

And now we can the trace:

 # cat trace
        sendmail-1942    [002] .....   530.136320: foo: (do_sys_openat2+0x0/0x240) arg1=             cat-2046    [004] .....   530.930817: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������"
             cat-2046    [004] .....   530.930961: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������"
             cat-2046    [004] .....   530.934278: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������"
             cat-2046    [004] .....   530.934563: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������"
            bash-1515    [007] .....   534.299093: foo: (do_sys_openat2+0x0/0x240) arg1="kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk���������@��4Z����;Y�����U

And dmesg has:

==================================================================
BUG: KASAN: use-after-free in string+0xd4/0x1c0
Read of size 1 at addr ffff88805fdbbfa0 by task cat/2049

 CPU: 0 PID: 2049 Comm: cat Not tainted 6.1.0-rc6-test+ #641
 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
 Call Trace:
  <TASK>
  dump_stack_lvl+0x5b/0x77
  print_report+0x17f/0x47b
  kasan_report+0xad/0x130
  string+0xd4/0x1c0
  vsnprintf+0x500/0x840
  seq_buf_vprintf+0x62/0xc0
  trace_seq_printf+0x10e/0x1e0
  print_type_string+0x90/0xa0
  print_kprobe_event+0x16b/0x290
  print_trace_line+0x451/0x8e0
  s_show+0x72/0x1f0
  seq_read_iter+0x58e/0x750
  seq_read+0x115/0x160
  vfs_read+0x11d/0x460
  ksys_read+0xa9/0x130
  do_syscall_64+0x3a/0x90
  entry_SYSCALL_64_after_hwframe+0x63/0xcd
 RIP: 0033:0x7fc2e972ade2
 Code: c0 e9 b2 fe ff ff 50 48 8d 3d b2 3f 0a 00 e8 05 f0 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
 RSP: 002b:00007ffc64e687c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
 RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fc2e972ade2
 RDX: 0000000000020000 RSI: 00007fc2e980d000 RDI: 0000000000000003
 RBP: 00007fc2e980d000 R08: 00007fc2e980c010 R09: 0000000000000000
 R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000020f00
 R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
  </TASK>

 The buggy address belongs to the physical page:
 page:ffffea00017f6ec0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x5fdbb
 flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
 raw: 000fffffc0000000 0000000000000000 ffffea00017f6ec8 0000000000000000
 raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
 page dumped because: kasan: bad access detected

 Memory state around the buggy address:
  ffff88805fdbbe80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  ffff88805fdbbf00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 >ffff88805fdbbf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                                ^
  ffff88805fdbc000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  ffff88805fdbc080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 ==================================================================

This was found when Zheng Yejian sent a patch to convert the event type
number assignment to use IDA, which gives the next available number, and
this bug showed up in the fuzz testing by Yujie Liu and the kernel test
robot. But after further analysis, I found that this behavior is the same
as when the event type numbers go past the 16bit max (and the above shows
that).

As modules have a similar issue, but is dealt with by setting a
"WAS_ENABLED" flag when a module event is enabled, and when the module is
freed, if any of its events were enabled, the ring buffer that holds that
event is also cleared, to prevent reading stale events. The same can be
done for dynamic events.

If any dynamic event that is being removed was enabled, then make sure the
buffers they were enabled in are now cleared.

Link: https://lkml.kernel.org/r/20221123171434.545706e3@gandalf.local.home
Link: https://lore.kernel.org/all/20221110020319.1259291-1-zhengyejian1@huawei.com/

Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Depends-on: e18eb8783ec49 ("tracing: Add tracing_reset_all_online_cpus_unlocked() function")
Depends-on: 5448d44c38 ("tracing: Add unified dynamic event framework")
Depends-on: 6212dd2968 ("tracing/kprobes: Use dyn_event framework for kprobe events")
Depends-on: 065e63f951 ("tracing: Only have rmmod clear buffers that its events were active in")
Depends-on: 575380da8b ("tracing: Only clear trace buffer on module unload if event was traced")
Fixes: 77b44d1b7c ("tracing/kprobes: Rename Kprobe-tracer to kprobe-event")
Reported-by: Zheng Yejian <zhengyejian1@huawei.com>
Reported-by: Yujie Liu <yujie.liu@intel.com>
Reported-by: kernel test robot <yujie.liu@intel.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-08 11:28:43 +01:00
Steven Rostedt (Google)
52fc245d15 tracing: Fix race where histograms can be called before the event
commit ef38c79a522b660f7f71d45dad2d6244bc741841 upstream.

commit 94eedf3dded5 ("tracing: Fix race where eprobes can be called before
the event") fixed an issue where if an event is soft disabled, and the
trigger is being added, there's a small window where the event sees that
there's a trigger but does not see that it requires reading the event yet,
and then calls the trigger with the record == NULL.

This could be solved with adding memory barriers in the hot path, or to
make sure that all the triggers requiring a record check for NULL. The
latter was chosen.

Commit 94eedf3dded5 set the eprobe trigger handle to check for NULL, but
the same needs to be done with histograms.

Link: https://lore.kernel.org/linux-trace-kernel/20221118211809.701d40c0f8a757b0df3c025a@kernel.org/
Link: https://lore.kernel.org/linux-trace-kernel/20221123164323.03450c3a@gandalf.local.home

Cc: Tom Zanussi <zanussi@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 7491e2c442 ("tracing: Add a probe that attaches to trace events")
Reported-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-08 11:28:43 +01:00
Daniel Bristot de Oliveira
cb2b0612cd tracing/osnoise: Fix duration type
commit 022632f6c43a86f2135642dccd5686de318e861d upstream.

The duration type is a 64 long value, not an int. This was
causing some long noise to report wrong values.

Change the duration to a 64 bits value.

Link: https://lkml.kernel.org/r/a93d8a8378c7973e9c609de05826533c9e977939.1668692096.git.bristot@kernel.org

Cc: stable@vger.kernel.org
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Fixes: bce29ac9ce ("trace: Add osnoise tracer")
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-08 11:28:43 +01:00
Xu Kuohai
e376183167 bpf: Do not copy spin lock field from user in bpf_selem_alloc
[ Upstream commit 836e49e103dfeeff670c934b7d563cbd982fce87 ]

bpf_selem_alloc function is used by inode_storage, sk_storage and
task_storage maps to set map value, for these map types, there may
be a spin lock in the map value, so if we use memcpy to copy the whole
map value from user, the spin lock field may be initialized incorrectly.

Since the spin lock field is zeroed by kzalloc, call copy_map_value
instead of memcpy to skip copying the spin lock field to fix it.

Fixes: 6ac99e8f23 ("bpf: Introduce bpf sk local storage")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20221114134720.1057939-2-xukuohai@huawei.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-08 11:28:39 +01:00
Hou Tao
ee3d37d796 bpf, perf: Use subprog name when reporting subprog ksymbol
[ Upstream commit 47df8a2f78bc34ff170d147d05b121f84e252b85 ]

Since commit bfea9a8574 ("bpf: Add name to struct bpf_ksym"), when
reporting subprog ksymbol to perf, prog name instead of subprog name is
used. The backtrace of bpf program with subprogs will be incorrect as
shown below:

  ffffffffc02deace bpf_prog_e44a3057dcb151f8_overwrite+0x66
  ffffffffc02de9f7 bpf_prog_e44a3057dcb151f8_overwrite+0x9f
  ffffffffa71d8d4e trace_call_bpf+0xce
  ffffffffa71c2938 perf_call_bpf_enter.isra.0+0x48

overwrite is the entry program and it invokes the overwrite_htab subprog
through bpf_loop, but in above backtrace, overwrite program just jumps
inside itself.

Fixing it by using subprog name when reporting subprog ksymbol. After
the fix, the output of perf script will be correct as shown below:

  ffffffffc031aad2 bpf_prog_37c0bec7d7c764a4_overwrite_htab+0x66
  ffffffffc031a9e7 bpf_prog_c7eb827ef4f23e71_overwrite+0x9f
  ffffffffa3dd8d4e trace_call_bpf+0xce
  ffffffffa3dc2938 perf_call_bpf_enter.isra.0+0x48

Fixes: bfea9a8574 ("bpf: Add name to struct bpf_ksym")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20221114095733.158588-1-houtao@huaweicloud.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-08 11:28:38 +01:00
Qais Yousef
cd65f53a58 ANDROID: sched/uclamp: Don't enable uclamp_is_used static key by in-kernel requests
We do have now in-kernel users of uclamp to implement inheritance. The
static_branch_enable() path unconditionally holds the cpus_read_lock()
which might_sleep(). The path in binder that implements inheritance
happens from in_atomic() context which leads to a splat like this one:

	[  147.529960] BUG: sleeping function called from invalid context at include/linux/percpu-rwsem.h:56
	[  147.530196] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 2586, name: RenderThread
	[  147.530410] INFO: lockdep is turned off.
	[  147.530518] Preemption disabled at:
	[  147.530521] [<ffffffc008ca2cec>] binder_proc_transaction+0x78/0x41c
	[  147.530793] CPU: 8 PID: 2586 Comm: RenderThread Tainted: G S      W  O      5.15.76-android14-5-00086-gc01afe5d262f #1
	[  147.531214] Call trace:
	[  147.531288] dump_backtrace+0xe8/0x134
	[  147.531444] show_stack+0x1c/0x4c
	[  147.531598] dump_stack_lvl+0x74/0x94
	[  147.531766] dump_stack+0x14/0x3c
	[  147.531920] ___might_sleep+0x210/0x230
	[  147.532094] __might_sleep+0x54/0x84
	[  147.532259] cpus_read_lock+0x2c/0x160
	[  147.532429] static_key_enable+0x1c/0x34
	[  147.532608] __sched_setscheduler+0x2a8/0x99c
	[  147.532802] sched_setattr_nocheck+0x1c/0x24
	[  147.532994] binder_do_set_priority+0x31c/0x4a4
	[  147.533195] binder_transaction_priority+0x200/0x3f4
	[  147.533413] binder_proc_transaction+0x220/0x41c
	[  147.533618] binder_transaction+0x1df0/0x234c
	[  147.533812] binder_thread_write+0xd84/0x2398
	[  147.534007] binder_ioctl_write_read+0x19c/0xb28
	[  147.534212] binder_ioctl+0x344/0x1a3c
	[  147.534382] __arm64_sys_ioctl+0x94/0xc8
	[  147.534561] invoke_syscall+0x44/0xf8
	[  147.534729] el0_svc_common+0xc8/0x10c
	[  147.534900] do_el0_svc+0x20/0x28
	[  147.535053] el0_svc+0x58/0xe0
	[  147.535198] el0t_64_sync_handler+0x7c/0xe4
	[  147.535386] el0t_64_sync+0x188/0x18c

Prevent enabling the lock for !user initiated sched_setattr()
operations. Generally we don't expect in-kernel uclamp users.

Bug: 259145692
Signed-off-by: Qais Yousef <qyousef@google.com>
Change-Id: Iac5be139b5ffd39f5e1c0431ce253133d81b98cf
2022-12-05 20:29:35 +00:00
Greg Kroah-Hartman
880a2689a3 Merge 5.15.81 into android14-5.15
Changes in 5.15.81
	ASoC: fsl_sai: use local device pointer
	ASoC: fsl_asrc fsl_esai fsl_sai: allow CONFIG_PM=N
	serial: Add rs485_supported to uart_port
	serial: fsl_lpuart: Fill in rs485_supported
	tty: serial: fsl_lpuart: don't break the on-going transfer when global reset
	sctp: remove the unnecessary sinfo_stream check in sctp_prsctp_prune_unsent
	sctp: clear out_curr if all frag chunks of current msg are pruned
	cifs: introduce new helper for cifs_reconnect()
	cifs: split out dfs code from cifs_reconnect()
	cifs: support nested dfs links over reconnect
	cifs: Fix connections leak when tlink setup failed
	ata: libata-scsi: simplify __ata_scsi_queuecmd()
	ata: libata-core: do not issue non-internal commands once EH is pending
	drm/display: Don't assume dual mode adaptors support i2c sub-addressing
	nvme: add a bogus subsystem NQN quirk for Micron MTFDKBA2T0TFH
	nvme-pci: add NVME_QUIRK_BOGUS_NID for Micron Nitro
	nvme-pci: disable namespace identifiers for the MAXIO MAP1001
	nvme-pci: disable write zeroes on various Kingston SSD
	nvme-pci: add NVME_QUIRK_BOGUS_NID for Netac NV7000
	iio: ms5611: Simplify IO callback parameters
	iio: pressure: ms5611: fixed value compensation bug
	ceph: do not update snapshot context when there is no new snapshot
	ceph: avoid putting the realm twice when decoding snaps fails
	x86/sgx: Create utility to validate user provided offset and length
	x86/sgx: Add overflow check in sgx_validate_offset_length()
	binder: validate alloc->mm in ->mmap() handler
	ceph: Use kcalloc for allocating multiple elements
	ceph: fix NULL pointer dereference for req->r_session
	wifi: mac80211: fix memory free error when registering wiphy fail
	wifi: mac80211_hwsim: fix debugfs attribute ps with rc table support
	riscv: dts: sifive unleashed: Add PWM controlled LEDs
	audit: fix undefined behavior in bit shift for AUDIT_BIT
	wifi: airo: do not assign -1 to unsigned char
	wifi: mac80211: Fix ack frame idr leak when mesh has no route
	wifi: ath11k: Fix QCN9074 firmware boot on x86
	spi: stm32: fix stm32_spi_prepare_mbr() that halves spi clk for every run
	selftests/bpf: Add verifier test for release_reference()
	Revert "net: macsec: report real_dev features when HW offloading is enabled"
	platform/x86: ideapad-laptop: Disable touchpad_switch
	platform/x86: touchscreen_dmi: Add info for the RCA Cambio W101 v2 2-in-1
	platform/x86/intel/pmt: Sapphire Rapids PMT errata fix
	platform/x86/intel/hid: Add some ACPI device IDs
	scsi: ibmvfc: Avoid path failures during live migration
	scsi: scsi_debug: Make the READ CAPACITY response compliant with ZBC
	drm: panel-orientation-quirks: Add quirk for Acer Switch V 10 (SW5-017)
	block, bfq: fix null pointer dereference in bfq_bio_bfqg()
	arm64/syscall: Include asm/ptrace.h in syscall_wrapper header.
	nvmet: fix memory leak in nvmet_subsys_attr_model_store_locked
	Revert "drm/amdgpu: Revert "drm/amdgpu: getting fan speed pwm for vega10 properly""
	ALSA: usb-audio: add quirk to fix Hamedal C20 disconnect issue
	RISC-V: vdso: Do not add missing symbols to version section in linker script
	MIPS: pic32: treat port as signed integer
	xfrm: fix "disable_policy" on ipv4 early demux
	xfrm: replay: Fix ESN wrap around for GSO
	af_key: Fix send_acquire race with pfkey_register
	ARM: dts: am335x-pcm-953: Define fixed regulators in root node
	ASoC: hdac_hda: fix hda pcm buffer overflow issue
	ASoC: sgtl5000: Reset the CHIP_CLK_CTRL reg on remove
	ASoC: soc-pcm: Don't zero TDM masks in __soc_pcm_open()
	x86/hyperv: Restore VP assist page after cpu offlining/onlining
	scsi: storvsc: Fix handling of srb_status and capacity change events
	ASoC: max98373: Add checks for devm_kcalloc
	regulator: core: fix kobject release warning and memory leak in regulator_register()
	spi: dw-dma: decrease reference count in dw_spi_dma_init_mfld()
	regulator: core: fix UAF in destroy_regulator()
	bus: sunxi-rsb: Remove the shutdown callback
	bus: sunxi-rsb: Support atomic transfers
	tee: optee: fix possible memory leak in optee_register_device()
	ARM: dts: at91: sam9g20ek: enable udc vbus gpio pinctrl
	selftests: mptcp: more stable simult_flows tests
	selftests: mptcp: fix mibit vs mbit mix up
	net: liquidio: simplify if expression
	rxrpc: Allow list of in-use local UDP endpoints to be viewed in /proc
	rxrpc: Use refcount_t rather than atomic_t
	rxrpc: Fix race between conn bundle lookup and bundle removal [ZDI-CAN-15975]
	net: dsa: sja1105: disallow C45 transactions on the BASE-TX MDIO bus
	nfc/nci: fix race with opening and closing
	net: pch_gbe: fix potential memleak in pch_gbe_tx_queue()
	9p/fd: fix issue of list_del corruption in p9_fd_cancel()
	netfilter: conntrack: Fix data-races around ct mark
	netfilter: nf_tables: do not set up extensions for end interval
	iavf: Fix a crash during reset task
	iavf: Do not restart Tx queues after reset task failure
	iavf: Fix race condition between iavf_shutdown and iavf_remove
	ARM: mxs: fix memory leak in mxs_machine_init()
	ARM: dts: imx6q-prti6q: Fix ref/tcxo-clock-frequency properties
	net: ethernet: mtk_eth_soc: fix error handling in mtk_open()
	net/mlx4: Check retval of mlx4_bitmap_init
	net: mvpp2: fix possible invalid pointer dereference
	net/qla3xxx: fix potential memleak in ql3xxx_send()
	octeontx2-af: debugsfs: fix pci device refcount leak
	net: pch_gbe: fix pci device refcount leak while module exiting
	nfp: fill splittable of devlink_port_attrs correctly
	nfp: add port from netdev validation for EEPROM access
	macsec: Fix invalid error code set
	Drivers: hv: vmbus: fix double free in the error path of vmbus_add_channel_work()
	Drivers: hv: vmbus: fix possible memory leak in vmbus_device_register()
	netfilter: ipset: regression in ip_set_hash_ip.c
	net/mlx5: Do not query pci info while pci disabled
	net/mlx5: Fix FW tracer timestamp calculation
	net/mlx5: Fix handling of entry refcount when command is not issued to FW
	tipc: set con sock in tipc_conn_alloc
	tipc: add an extra conn_get in tipc_conn_alloc
	tipc: check skb_linearize() return value in tipc_disc_rcv()
	xfrm: Fix oops in __xfrm_state_delete()
	xfrm: Fix ignored return value in xfrm6_init()
	net: wwan: iosm: use ACPI_FREE() but not kfree() in ipc_pcie_read_bios_cfg()
	sfc: fix potential memleak in __ef100_hard_start_xmit()
	net: sparx5: fix error handling in sparx5_port_open()
	net: sched: allow act_ct to be built without NF_NAT
	NFC: nci: fix memory leak in nci_rx_data_packet()
	regulator: twl6030: re-add TWL6032_SUBCLASS
	bnx2x: fix pci device refcount leak in bnx2x_vf_is_pcie_pending()
	dma-buf: fix racing conflict of dma_heap_add()
	netfilter: ipset: restore allowing 64 clashing elements in hash:net,iface
	netfilter: flowtable_offload: add missing locking
	fs: do not update freeing inode i_io_list
	dccp/tcp: Reset saddr on failure after inet6?_hash_connect().
	ipv4: Fix error return code in fib_table_insert()
	arcnet: fix potential memory leak in com20020_probe()
	s390/dasd: fix no record found for raw_track_access
	nfc: st-nci: fix incorrect validating logic in EVT_TRANSACTION
	nfc: st-nci: fix memory leaks in EVT_TRANSACTION
	nfc: st-nci: fix incorrect sizing calculations in EVT_TRANSACTION
	net: enetc: manage ENETC_F_QBV in priv->active_offloads only when enabled
	net: enetc: cache accesses to &priv->si->hw
	net: enetc: preserve TX ring priority across reconfiguration
	octeontx2-pf: Add check for devm_kcalloc
	octeontx2-af: Fix reference count issue in rvu_sdp_init()
	net: thunderx: Fix the ACPI memory leak
	s390/crashdump: fix TOD programmable field size
	lib/vdso: use "grep -E" instead of "egrep"
	init/Kconfig: fix CC_HAS_ASM_GOTO_TIED_OUTPUT test with dash
	nios2: add FORCE for vmlinuz.gz
	mmc: sdhci-brcmstb: Re-organize flags
	mmc: sdhci-brcmstb: Enable Clock Gating to save power
	mmc: sdhci-brcmstb: Fix SDHCI_RESET_ALL for CQHCI
	KVM: arm64: pkvm: Fixup boot mode to reflect that the kernel resumes from EL1
	usb: dwc3: exynos: Fix remove() function
	usb: cdnsp: Fix issue with Clear Feature Halt Endpoint
	usb: cdnsp: fix issue with ZLP - added TD_SIZE = 1
	ext4: fix use-after-free in ext4_ext_shift_extents
	arm64: dts: rockchip: lower rk3399-puma-haikou SD controller clock frequency
	iio: light: apds9960: fix wrong register for gesture gain
	iio: core: Fix entry not deleted when iio_register_sw_trigger_type() fails
	bus: ixp4xx: Don't touch bit 7 on IXP42x
	usb: dwc3: gadget: conditionally remove requests
	usb: dwc3: gadget: Return -ESHUTDOWN on ep disable
	usb: dwc3: gadget: Clear ep descriptor last
	nilfs2: fix nilfs_sufile_mark_dirty() not set segment usage as dirty
	gcov: clang: fix the buffer overflow issue
	mm: vmscan: fix extreme overreclaim and swap floods
	KVM: x86: nSVM: leave nested mode on vCPU free
	KVM: x86: forcibly leave nested mode on vCPU reset
	KVM: x86: nSVM: harden svm_free_nested against freeing vmcb02 while still in use
	KVM: x86: add kvm_leave_nested
	KVM: x86: remove exit_int_info warning in svm_handle_exit
	x86/tsx: Add a feature bit for TSX control MSR support
	x86/pm: Add enumeration check before spec MSRs save/restore setup
	x86/ioremap: Fix page aligned size calculation in __ioremap_caller()
	Input: synaptics - switch touchpad on HP Laptop 15-da3001TU to RMI mode
	ASoC: Intel: bytcht_es8316: Add quirk for the Nanote UMPC-01
	tools: iio: iio_generic_buffer: Fix read size
	serial: 8250: 8250_omap: Avoid RS485 RTS glitch on ->set_termios()
	Input: goodix - try resetting the controller when no config is set
	Input: soc_button_array - add use_low_level_irq module parameter
	Input: soc_button_array - add Acer Switch V 10 to dmi_use_low_level_irq[]
	Input: i8042 - apply probe defer to more ASUS ZenBook models
	ASoC: stm32: dfsdm: manage cb buffers cleanup
	xen-pciback: Allow setting PCI_MSIX_FLAGS_MASKALL too
	xen/platform-pci: add missing free_irq() in error path
	platform/x86: asus-wmi: add missing pci_dev_put() in asus_wmi_set_xusb2pr()
	platform/x86: acer-wmi: Enable SW_TABLET_MODE on Switch V 10 (SW5-017)
	drm/amdgpu: disable BACO support on more cards
	zonefs: fix zone report size in __zonefs_io_error()
	platform/x86: hp-wmi: Ignore Smart Experience App event
	platform/x86: ideapad-laptop: Fix interrupt storm on fn-lock toggle on some Yoga laptops
	tcp: configurable source port perturb table size
	net: usb: qmi_wwan: add Telit 0x103a composition
	scsi: iscsi: Fix possible memory leak when device_register() failed
	gpu: host1x: Avoid trying to use GART on Tegra20
	dm integrity: flush the journal on suspend
	dm integrity: clear the journal on suspend
	fuse: lock inode unconditionally in fuse_fallocate()
	wifi: wilc1000: validate pairwise and authentication suite offsets
	wifi: wilc1000: validate length of IEEE80211_P2P_ATTR_OPER_CHANNEL attribute
	wifi: wilc1000: validate length of IEEE80211_P2P_ATTR_CHANNEL_LIST attribute
	wifi: wilc1000: validate number of channels
	genirq/msi: Shutdown managed interrupts with unsatifiable affinities
	genirq: Always limit the affinity to online CPUs
	irqchip/gic-v3: Always trust the managed affinity provided by the core code
	genirq: Take the proposed affinity at face value if force==true
	btrfs: free btrfs_path before copying root refs to userspace
	btrfs: free btrfs_path before copying fspath to userspace
	btrfs: free btrfs_path before copying subvol info to userspace
	btrfs: zoned: fix missing endianness conversion in sb_write_pointer
	btrfs: use kvcalloc in btrfs_get_dev_zone_info
	btrfs: sysfs: normalize the error handling branch in btrfs_init_sysfs()
	drm/amd/dc/dce120: Fix audio register mapping, stop triggering KASAN
	drm/amd/display: No display after resume from WB/CB
	drm/amdgpu: Enable Aldebaran devices to report CU Occupancy
	drm/amdgpu: always register an MMU notifier for userptr
	drm/i915: fix TLB invalidation for Gen12 video and compute engines
	cifs: fix missed refcounting of ipc tcon
	Linux 5.15.81

Change-Id: I9fe677a5bc83d3484f86aebbe20dd129c7ca3a42
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2022-12-04 12:02:10 +00:00
Luiz Capitulino
7d0c25b5fe genirq: Take the proposed affinity at face value if force==true
From: Marc Zyngier <maz@kernel.org>

commit c48c8b829d2b966a6649827426bcdba082ccf922 upstream.

Although setting the affinity of an interrupt to a set of CPUs that doesn't
have any online CPU is generally frowned apon, there are a few limited
cases where such affinity is set from a CPUHP notifier, setting the
affinity to a CPU that isn't online yet.

The saving grace is that this is always done using the 'force' attribute,
which gives a hint that the affinity setting can be outside of the online
CPU mask and the callsite set this flag with the knowledge that the
underlying interrupt controller knows to handle it.

This restores the expected behaviour on Marek's system.

Fixes: 33de0aa4bae9 ("genirq: Always limit the affinity to online CPUs")
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/4b7fc13c-887b-a664-26e8-45aed13f048a@samsung.com
Link: https://lore.kernel.org/r/20220414140011.541725-1-maz@kernel.org

Signed-off-by: Luiz Capitulino <luizcap@amazon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-02 17:41:12 +01:00
Luiz Capitulino
52a93f2dcf genirq: Always limit the affinity to online CPUs
From: Marc Zyngier <maz@kernel.org>

commit 33de0aa4bae982ed6f7c777f86b5af3e627ac937 upstream.

[ Fixed small conflicts due to the HK_FLAG_MANAGED_IRQ flag been
  renamed on upstream ]

When booting with maxcpus=<small number> (or even loading a driver
while most CPUs are offline), it is pretty easy to observe managed
affinities containing a mix of online and offline CPUs being passed
to the irqchip driver.

This means that the irqchip cannot trust the affinity passed down
from the core code, which is a bit annoying and requires (at least
in theory) all drivers to implement some sort of affinity narrowing.

In order to address this, always limit the cpumask to the set of
online CPUs.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220405185040.206297-3-maz@kernel.org

Signed-off-by: Luiz Capitulino <luizcap@amazon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-02 17:41:11 +01:00
Luiz Capitulino
599cf4b845 genirq/msi: Shutdown managed interrupts with unsatifiable affinities
From: Marc Zyngier <maz@kernel.org>

commit d802057c7c553ad426520a053da9f9fe08e2c35a upstream.

[ This commit is almost a rewrite because it conflicts with Thomas
  Gleixner's refactoring of this code in v5.17-rc1. I wasn't sure if
  I should drop all the s-o-bs (including Mark's), but decided
  to keep as the original commit ]

When booting with maxcpus=<small number>, interrupt controllers
such as the GICv3 ITS may not be able to satisfy the affinity of
some managed interrupts, as some of the HW resources are simply
not available.

The same thing happens when loading a driver using managed interrupts
while CPUs are offline.

In order to deal with this, do not try to activate such interrupt
if there is no online CPU capable of handling it. Instead, place
it in shutdown state. Once a capable CPU shows up, it will be
activated.

Reported-by: John Garry <john.garry@huawei.com>
Reported-by: David Decotigny <ddecotig@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/20220405185040.206297-2-maz@kernel.org

Signed-off-by: Luiz Capitulino <luizcap@amazon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-02 17:41:11 +01:00
Mukesh Ojha
feb2eda5e1 gcov: clang: fix the buffer overflow issue
commit a6f810efabfd789d3bbafeacb4502958ec56c5ce upstream.

Currently, in clang version of gcov code when module is getting removed
gcov_info_add() incorrectly adds the sfn_ptr->counter to all the
dst->functions and it result in the kernel panic in below crash report.
Fix this by properly handling it.

[    8.899094][  T599] Unable to handle kernel write to read-only memory at virtual address ffffff80461cc000
[    8.899100][  T599] Mem abort info:
[    8.899102][  T599]   ESR = 0x9600004f
[    8.899103][  T599]   EC = 0x25: DABT (current EL), IL = 32 bits
[    8.899105][  T599]   SET = 0, FnV = 0
[    8.899107][  T599]   EA = 0, S1PTW = 0
[    8.899108][  T599]   FSC = 0x0f: level 3 permission fault
[    8.899110][  T599] Data abort info:
[    8.899111][  T599]   ISV = 0, ISS = 0x0000004f
[    8.899113][  T599]   CM = 0, WnR = 1
[    8.899114][  T599] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000ab8de000
[    8.899116][  T599] [ffffff80461cc000] pgd=18000009ffcde003, p4d=18000009ffcde003, pud=18000009ffcde003, pmd=18000009ffcad003, pte=00600000c61cc787
[    8.899124][  T599] Internal error: Oops: 9600004f [#1] PREEMPT SMP
[    8.899265][  T599] Skip md ftrace buffer dump for: 0x1609e0
....
..,
[    8.899544][  T599] CPU: 7 PID: 599 Comm: modprobe Tainted: G S         OE     5.15.41-android13-8-g38e9b1af6bce #1
[    8.899547][  T599] Hardware name: XXX (DT)
[    8.899549][  T599] pstate: 82400005 (Nzcv daif +PAN -UAO +TCO -DIT -SSBS BTYPE=--)
[    8.899551][  T599] pc : gcov_info_add+0x9c/0xb8
[    8.899557][  T599] lr : gcov_event+0x28c/0x6b8
[    8.899559][  T599] sp : ffffffc00e733b00
[    8.899560][  T599] x29: ffffffc00e733b00 x28: ffffffc00e733d30 x27: ffffffe8dc297470
[    8.899563][  T599] x26: ffffffe8dc297000 x25: ffffffe8dc297000 x24: ffffffe8dc297000
[    8.899566][  T599] x23: ffffffe8dc0a6200 x22: ffffff880f68bf20 x21: 0000000000000000
[    8.899569][  T599] x20: ffffff880f68bf00 x19: ffffff8801babc00 x18: ffffffc00d7f9058
[    8.899572][  T599] x17: 0000000000088793 x16: ffffff80461cbe00 x15: 9100052952800785
[    8.899575][  T599] x14: 0000000000000200 x13: 0000000000000041 x12: 9100052952800785
[    8.899577][  T599] x11: ffffffe8dc297000 x10: ffffffe8dc297000 x9 : ffffff80461cbc80
[    8.899580][  T599] x8 : ffffff8801babe80 x7 : ffffffe8dc2ec000 x6 : ffffffe8dc2ed000
[    8.899583][  T599] x5 : 000000008020001f x4 : fffffffe2006eae0 x3 : 000000008020001f
[    8.899586][  T599] x2 : ffffff8027c49200 x1 : ffffff8801babc20 x0 : ffffff80461cb3a0
[    8.899589][  T599] Call trace:
[    8.899590][  T599]  gcov_info_add+0x9c/0xb8
[    8.899592][  T599]  gcov_module_notifier+0xbc/0x120
[    8.899595][  T599]  blocking_notifier_call_chain+0xa0/0x11c
[    8.899598][  T599]  do_init_module+0x2a8/0x33c
[    8.899600][  T599]  load_module+0x23cc/0x261c
[    8.899602][  T599]  __arm64_sys_finit_module+0x158/0x194
[    8.899604][  T599]  invoke_syscall+0x94/0x2bc
[    8.899607][  T599]  el0_svc_common+0x1d8/0x34c
[    8.899609][  T599]  do_el0_svc+0x40/0x54
[    8.899611][  T599]  el0_svc+0x94/0x2f0
[    8.899613][  T599]  el0t_64_sync_handler+0x88/0xec
[    8.899615][  T599]  el0t_64_sync+0x1b4/0x1b8
[    8.899618][  T599] Code: f905f56c f86e69ec f86e6a0f 8b0c01ec (f82e6a0c)
[    8.899620][  T599] ---[ end trace ed5218e9e5b6e2e6 ]---

Link: https://lkml.kernel.org/r/1668020497-13142-1-git-send-email-quic_mojha@quicinc.com
Fixes: e178a5beb3 ("gcov: clang support")
Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Tested-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Tom Rix <trix@redhat.com>
Cc: <stable@vger.kernel.org>	[5.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-02 17:41:09 +01:00
Ramji Jiyani
706d642b1d ANDROID: GKI: Handle no ABI symbol list for modules
ACKs with no ABI symbol lists like mainline,
don't let any unsigned modules load as every
access is being treated as violation as
NO_OF_UNPROTECTED_SYMBOLS will be 0 in this case.

Check NO_OF_UNPROTECTED_SYMBOLS and if it's 0,
allow every symbol access by unsigned modules;
so we can keep the feature enable and also not
break any devices. It should never be 0 with
kernel branches where KMI_SYMBOL_LISTS  have been
enabled.

Bug: 257458145
Bug: 232430739
Test: TH
Fixes: e9669eeb2f45 ("ANDROID: GKI: Add module load time symbol protection")
Change-Id: Iab65e1425473e32baaad0d6c7f0d3eb007ae864f
Signed-off-by: Ramji Jiyani <ramjiyani@google.com>
(cherry picked from commit 8e00226a8fffa10b6383e448af785ce44451688e)
2022-12-01 00:22:47 +00:00
Yu Zhao
baeb9a0025 BACKPORT: mm: multi-gen LRU: kill switch
Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that
can be disabled include:
  0x0001: the multi-gen LRU core
  0x0002: walking page table, when arch_has_hw_pte_young() returns
          true
  0x0004: clearing the accessed bit in non-leaf PMD entries, when
          CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y
  [yYnN]: apply to all the components above
E.g.,
  echo y >/sys/kernel/mm/lru_gen/enabled
  cat /sys/kernel/mm/lru_gen/enabled
  0x0007
  echo 5 >/sys/kernel/mm/lru_gen/enabled
  cat /sys/kernel/mm/lru_gen/enabled
  0x0005

NB: the page table walks happen on the scale of seconds under heavy memory
pressure, in which case the mmap_lock contention is a lesser concern,
compared with the LRU lock contention and the I/O congestion.  So far the
only well-known case of the mmap_lock contention happens on Android, due
to Scudo [1] which allocates several thousand VMAs for merely a few
hundred MBs.  The SPF and the Maple Tree also have provided their own
assessments [2][3].  However, if walking page tables does worsen the
mmap_lock contention, the kill switch can be used to disable it.  In this
case the multi-gen LRU will suffer a minor performance degradation, as
shown previously.

Clearing the accessed bit in non-leaf PMD entries can also be disabled,
since this behavior was not tested on x86 varieties other than Intel and
AMD.

[1] https://source.android.com/devices/tech/debug/scudo
[2] https://lore.kernel.org/r/20220128131006.67712-1-michel@lespinasse.org/
[3] https://lore.kernel.org/r/20220426150616.3937571-1-Liam.Howlett@oracle.com/

Link: https://lkml.kernel.org/r/20220918080010.2920238-11-yuzhao@google.com
Change-Id: If3116e6698cc6967b6992c2017962fac6c2d3a11
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 354ed597442952fb680c9cafc7e4eb8a76f9514c)
Bug: 249601646
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
2022-11-30 00:28:11 +00:00
Yu Zhao
7f53b0e704 BACKPORT: mm: multi-gen LRU: support page table walks
To further exploit spatial locality, the aging prefers to walk page tables
to search for young PTEs and promote hot pages.  A kill switch will be
added in the next patch to disable this behavior.  When disabled, the
aging relies on the rmap only.

NB: this behavior has nothing similar with the page table scanning in the
2.4 kernel [1], which searches page tables for old PTEs, adds cold pages
to swapcache and unmaps them.

To avoid confusion, the term "iteration" specifically means the traversal
of an entire mm_struct list; the term "walk" will be applied to page
tables and the rmap, as usual.

An mm_struct list is maintained for each memcg, and an mm_struct follows
its owner task to the new memcg when this task is migrated.  Given an
lruvec, the aging iterates lruvec_memcg()->mm_list and calls
walk_page_range() with each mm_struct on this list to promote hot pages
before it increments max_seq.

When multiple page table walkers iterate the same list, each of them gets
a unique mm_struct; therefore they can run concurrently.  Page table
walkers ignore any misplaced pages, e.g., if an mm_struct was migrated,
pages it left in the previous memcg will not be promoted when its current
memcg is under reclaim.  Similarly, page table walkers will not promote
pages from nodes other than the one under reclaim.

This patch uses the following optimizations when walking page tables:
1. It tracks the usage of mm_struct's between context switches so that
   page table walkers can skip processes that have been sleeping since
   the last iteration.
2. It uses generational Bloom filters to record populated branches so
   that page table walkers can reduce their search space based on the
   query results, e.g., to skip page tables containing mostly holes or
   misplaced pages.
3. It takes advantage of the accessed bit in non-leaf PMD entries when
   CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
4. It does not zigzag between a PGD table and the same PMD table
   spanning multiple VMAs. IOW, it finishes all the VMAs within the
   range of the same PMD table before it returns to a PGD table. This
   improves the cache performance for workloads that have large
   numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.

Server benchmark results:
  Single workload:
    fio (buffered I/O): no change

  Single workload:
    memcached (anon): +[8, 10]%
                Ops/sec      KB/sec
      patch1-7: 1147696.57   44640.29
      patch1-8: 1245274.91   48435.66

  Configurations:
    no change

Client benchmark results:
  kswapd profiles:
    patch1-7
      48.16%  lzo1x_1_do_compress (real work)
       8.20%  page_vma_mapped_walk (overhead)
       7.06%  _raw_spin_unlock_irq
       2.92%  ptep_clear_flush
       2.53%  __zram_bvec_write
       2.11%  do_raw_spin_lock
       2.02%  memmove
       1.93%  lru_gen_look_around
       1.56%  free_unref_page_list
       1.40%  memset

    patch1-8
      49.44%  lzo1x_1_do_compress (real work)
       6.19%  page_vma_mapped_walk (overhead)
       5.97%  _raw_spin_unlock_irq
       3.13%  get_pfn_page
       2.85%  ptep_clear_flush
       2.42%  __zram_bvec_write
       2.08%  do_raw_spin_lock
       1.92%  memmove
       1.44%  alloc_zspage
       1.36%  memset

  Configurations:
    no change

Thanks to the following developers for their efforts [3].
  kernel test robot <lkp@intel.com>

[1] https://lwn.net/Articles/23732/
[2] https://llvm.org/docs/ScudoHardenedAllocator.html
[3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/

Link: https://lkml.kernel.org/r/20220918080010.2920238-9-yuzhao@google.com
Change-Id: I7ed3daf288e664e15bfd34991a77467a19a4e39a
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit bd74fdaea146029e4fa12c6de89adbe0779348a9)
[ Resolve conflicts in include/linux/memcontrol.h,
  include/linux/mm_types.h ]
Bug: 249601646
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
2022-11-30 00:28:11 +00:00
Yu Zhao
37397878ee BACKPORT: mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be applied
to the multi-gen LRU, as a new convention; the terms "activation" and
"deactivation" will be applied to the active/inactive LRU, as usual.

The aging produces young generations.  Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS.  The aging promotes
hot pages to the youngest generation when it finds them accessed through
page tables; the demotion of cold pages happens consequently when it
increments max_seq.  Promotion in the aging path does not involve any LRU
list operations, only the updates of the gen counter and
lrugen->nr_pages[]; demotion, unless as the result of the increment of
max_seq, requires LRU list operations, e.g., lru_deactivate_fn().  The
aging has the complexity O(nr_hot_pages), since it is only interested in
hot pages.

The eviction consumes old generations.  Given an lruvec, it increments
min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty.
A feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types are
available from the same generation.

The protection of pages accessed multiple times through file descriptors
takes place in the eviction path.  Each generation is divided into
multiple tiers.  A page accessed N times through file descriptors is in
tier order_base_2(N).  Tiers do not have dedicated lrugen->lists[], only
bits in page->flags.  The aforementioned feedback loop also monitors
refaults over all tiers and decides when to protect pages in which tiers
(N>1), using the first tier (N=0,1) as a baseline.  The first tier
contains single-use unmapped clean pages, which are most likely the best
choices.  In contrast to promotion in the aging path, the protection of a
page in the eviction path is achieved by moving this page to the next
generation, i.e., min_seq+1, if the feedback loop decides so.  This
approach has the following advantages:

1. It removes the cost of activation in the buffered access path by
   inferring whether pages accessed multiple times through file
   descriptors are statistically hot and thus worth protecting in the
   eviction path.
2. It takes pages accessed through page tables into account and avoids
   overprotecting pages accessed multiple times through file
   descriptors. (Pages accessed through page tables are in the first
   tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
   twice through file descriptors, when under heavy buffered I/O
   workloads.

Server benchmark results:
  Single workload:
    fio (buffered I/O): +[30, 32]%
                IOPS         BW
      5.19-rc1: 2673k        10.2GiB/s
      patch1-6: 3491k        13.3GiB/s

  Single workload:
    memcached (anon): -[4, 6]%
                Ops/sec      KB/sec
      5.19-rc1: 1161501.04   45177.25
      patch1-6: 1106168.46   43025.04

  Configurations:
    CPU: two Xeon 6154
    Mem: total 256G

    Node 1 was only used as a ram disk to reduce the variance in the
    results.

    patch drivers/block/brd.c <<EOF
    99,100c99,100
    < 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
    < 	page = alloc_page(gfp_flags);
    ---
    > 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
    > 	page = alloc_pages_node(1, gfp_flags, 0);
    EOF

    cat >>/etc/systemd/system.conf <<EOF
    CPUAffinity=numa
    NUMAPolicy=bind
    NUMAMask=0
    EOF

    cat >>/etc/memcached.conf <<EOF
    -m 184320
    -s /var/run/memcached/memcached.sock
    -a 0766
    -t 36
    -B binary
    EOF

    cat fio.sh
    modprobe brd rd_nr=1 rd_size=113246208
    swapoff -a
    mkfs.ext4 /dev/ram0
    mount -t ext4 /dev/ram0 /mnt

    mkdir /sys/fs/cgroup/user.slice/test
    echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
    echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
    fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
      --buffered=1 --ioengine=io_uring --iodepth=128 \
      --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
      --rw=randread --random_distribution=random --norandommap \
      --time_based --ramp_time=10m --runtime=5m --group_reporting

    cat memcached.sh
    modprobe brd rd_nr=1 rd_size=113246208
    swapoff -a
    mkswap /dev/ram0
    swapon /dev/ram0

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
      --ratio 1:0 --pipeline 8 -d 2000

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
      --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed

Client benchmark results:
  kswapd profiles:
    5.19-rc1
      40.33%  page_vma_mapped_walk (overhead)
      21.80%  lzo1x_1_do_compress (real work)
       7.53%  do_raw_spin_lock
       3.95%  _raw_spin_unlock_irq
       2.52%  vma_interval_tree_iter_next
       2.37%  page_referenced_one
       2.28%  vma_interval_tree_subtree_search
       1.97%  anon_vma_interval_tree_iter_first
       1.60%  ptep_clear_flush
       1.06%  __zram_bvec_write

    patch1-6
      39.03%  lzo1x_1_do_compress (real work)
      18.47%  page_vma_mapped_walk (overhead)
       6.74%  _raw_spin_unlock_irq
       3.97%  do_raw_spin_lock
       2.49%  ptep_clear_flush
       2.48%  anon_vma_interval_tree_iter_first
       1.92%  page_referenced_one
       1.88%  __zram_bvec_write
       1.48%  memmove
       1.31%  vma_interval_tree_iter_next

  Configurations:
    CPU: single Snapdragon 7c
    Mem: total 4G

    ChromeOS MemoryPressure [1]

[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/

Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com
Change-Id: I30b26b3086ce1879b83b96eb265f8f0dcb16a1fb
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit ac35a490237446b71e3b4b782b1596967edd0aa8)
[Resolve confilcts in mm/Kconfig, mm/swap.c, mm/vmscan.c]
Bug: 249601646
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
2022-11-30 00:28:11 +00:00
Yu Zhao
d5b2fa1c7b BACKPORT: mm: multi-gen LRU: groundwork
Evictable pages are divided into multiple generations for each lruvec.
The youngest generation number is stored in lrugen->max_seq for both
anon and file types as they are aged on an equal footing. The oldest
generation numbers are stored in lrugen->min_seq[] separately for anon
and file types as clean file pages can be evicted regardless of swap
constraints. These three variables are monotonically increasing.

Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
in order to fit into the gen counter in page->flags. Each truncated
generation number is an index to lrugen->lists[]. The sliding window
technique is used to track at least MIN_NR_GENS and at most
MAX_NR_GENS generations. The gen counter stores a value within [1,
MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
stores 0.

There are two conceptually independent procedures: "the aging", which
produces young generations, and "the eviction", which consumes old
generations.  They form a closed-loop system, i.e., "the page reclaim".
Both procedures can be invoked from userspace for the purposes of working
set estimation and proactive reclaim.  These techniques are commonly used
to optimize job scheduling (bin packing) in data centers [1][2].

To avoid confusion, the terms "hot" and "cold" will be applied to the
multi-gen LRU, as a new convention; the terms "active" and "inactive" will
be applied to the active/inactive LRU, as usual.

The protection of hot pages and the selection of cold pages are based
on page access channels and patterns. There are two access channels:
one through page tables and the other through file descriptors. The
protection of the former channel is by design stronger because:
1. The uncertainty in determining the access patterns of the former
   channel is higher due to the approximation of the accessed bit.
2. The cost of evicting the former channel is higher due to the TLB
   flushes required and the likelihood of encountering the dirty bit.
3. The penalty of underprotecting the former channel is higher because
   applications usually do not prepare themselves for major page
   faults like they do for blocked I/O. E.g., GUI applications
   commonly use dedicated I/O threads to avoid blocking rendering
   threads.

There are also two access patterns: one with temporal locality and the
other without.  For the reasons listed above, the former channel is
assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is
present; the latter channel is assumed to follow the latter pattern unless
outlying refaults have been observed [3][4].

The next patch will address the "outlying refaults".  Three macros, i.e.,
LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in
this patch to make the entire patchset less diffy.

A page is added to the youngest generation on faulting.  The aging needs
to check the accessed bit at least twice before handing this page over to
the eviction.  The first check takes care of the accessed bit set on the
initial fault; the second check makes sure this page has not been used
since then.  This protocol, AKA second chance, requires a minimum of two
generations, hence MIN_NR_GENS.

[1] https://dl.acm.org/doi/10.1145/3297858.3304053
[2] https://dl.acm.org/doi/10.1145/3503222.3507731
[3] https://lwn.net/Articles/495543/
[4] https://lwn.net/Articles/815342/

Link: https://lkml.kernel.org/r/20220918080010.2920238-6-yuzhao@google.com
Change-Id: I7b24d1e9d263e4eb2c2ee23f2eb143824fcb5201
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit ec1c86b25f4bdd9dce6436c0539d2a6ae676e1c4)
[ Resolve conflicts in mm/memory.c, mm/memcontrol.c, mm/Kconfig,
include/linux/mm_inline.h]
Bug: 249601646
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
2022-11-30 00:28:11 +00:00
Kalesh Singh
2635d7d108 Revert "FROMLIST: mm: multi-gen LRU: groundwork"
This reverts commit f88ed5a3d3.

To be replaced with upstream version.

Bug: 249601646
Change-Id: I5a206480f838c304fb1c960fec2615894c2421bb
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
2022-11-30 00:28:11 +00:00
Kalesh Singh
e931b1d222 Revert "FROMLIST: mm: multi-gen LRU: minimal implementation"
This reverts commit a1537a68c5.

To be replaced with upstream version.

Bug: 249601646
Change-Id: I3dfbb3ec56cfdb5a2db7ec00c124dae471cce932
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
2022-11-30 00:28:11 +00:00
Kalesh Singh
02dc0d1dda Revert "FROMLIST: mm: multi-gen LRU: support page table walks"
This reverts commit 5280d76d38.

To be replaced with upstream version.

Bug: 249601646
Change-Id: I5bf4095117082b6627d182a8d987ca78a18fa392
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
2022-11-30 00:28:11 +00:00
Kalesh Singh
8994fcd031 Revert "FROMLIST: mm: multi-gen LRU: kill switch"
This reverts commit 76f7f07cbf.

To be replaced with upstream version.

Bug: 249601646
Change-Id: I8356c32ce1962efba62ab2fc8a256b751c974c00
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
2022-11-30 00:28:11 +00:00
Greg Kroah-Hartman
f5ea8b2710 Merge 5.15.80 into android14-5.15
Changes in 5.15.80
	mm: hwpoison: refactor refcount check handling
	mm: hwpoison: handle non-anonymous THP correctly
	mm: shmem: don't truncate page if memory failure happens
	ASoC: wm5102: Revert "ASoC: wm5102: Fix PM disable depth imbalance in wm5102_probe"
	ASoC: wm5110: Revert "ASoC: wm5110: Fix PM disable depth imbalance in wm5110_probe"
	ASoC: wm8997: Revert "ASoC: wm8997: Fix PM disable depth imbalance in wm8997_probe"
	ASoC: mt6660: Keep the pm_runtime enables before component stuff in mt6660_i2c_probe
	ASoC: rt1019: Fix the TDM settings
	ASoC: wm8962: Add an event handler for TEMP_HP and TEMP_SPK
	spi: intel: Fix the offset to get the 64K erase opcode
	ASoC: codecs: jz4725b: add missed Line In power control bit
	ASoC: codecs: jz4725b: fix reported volume for Master ctl
	ASoC: codecs: jz4725b: use right control for Capture Volume
	ASoC: codecs: jz4725b: fix capture selector naming
	ASoC: Intel: sof_sdw: add quirk variant for LAPBC710 NUC15
	selftests/futex: fix build for clang
	selftests/intel_pstate: fix build for ARCH=x86_64
	ASoC: rt1308-sdw: add the default value of some registers
	drm/amd/display: Remove wrong pipe control lock
	ACPI: scan: Add LATT2021 to acpi_ignore_dep_ids[]
	RDMA/efa: Add EFA 0xefa2 PCI ID
	btrfs: raid56: properly handle the error when unable to find the missing stripe
	NFSv4: Retry LOCK on OLD_STATEID during delegation return
	ACPI: x86: Add another system to quirk list for forcing StorageD3Enable
	firmware: arm_scmi: Cleanup the core driver removal callback
	i2c: tegra: Allocate DMA memory for DMA engine
	i2c: i801: add lis3lv02d's I2C address for Vostro 5568
	drm/imx: imx-tve: Fix return type of imx_tve_connector_mode_valid
	btrfs: remove pointless and double ulist frees in error paths of qgroup tests
	Bluetooth: L2CAP: Fix l2cap_global_chan_by_psm
	x86/cpu: Add several Intel server CPU model numbers
	ASoC: codecs: jz4725b: Fix spelling mistake "Sourc" -> "Source", "Routee" -> "Route"
	mtd: spi-nor: intel-spi: Disable write protection only if asked
	spi: intel: Use correct mask for flash and protected regions
	KVM: x86/pmu: Do not speculatively query Intel GP PMCs that don't exist yet
	hugetlbfs: don't delete error page from pagecache
	arm64: dts: qcom: sa8155p-adp: Specify which LDO modes are allowed
	arm64: dts: qcom: sm8150-xperia-kumano: Specify which LDO modes are allowed
	arm64: dts: qcom: sm8250-xperia-edo: Specify which LDO modes are allowed
	arm64: dts: qcom: sm8350-hdk: Specify which LDO modes are allowed
	spi: stm32: Print summary 'callbacks suppressed' message
	ARM: dts: at91: sama7g5: fix signal name of pin PB2
	ASoC: core: Fix use-after-free in snd_soc_exit()
	ASoC: tas2770: Fix set_tdm_slot in case of single slot
	ASoC: tas2764: Fix set_tdm_slot in case of single slot
	ARM: at91: pm: avoid soft resetting AC DLL
	serial: 8250: omap: Fix missing PM runtime calls for omap8250_set_mctrl()
	serial: 8250_omap: remove wait loop from Errata i202 workaround
	serial: 8250: omap: Fix unpaired pm_runtime_put_sync() in omap8250_remove()
	serial: 8250: omap: Flush PM QOS work on remove
	serial: imx: Add missing .thaw_noirq hook
	tty: n_gsm: fix sleep-in-atomic-context bug in gsm_control_send
	bpf, test_run: Fix alignment problem in bpf_prog_test_run_skb()
	ASoC: soc-utils: Remove __exit for snd_soc_util_exit()
	pinctrl: rockchip: list all pins in a possible mux route for PX30
	scsi: scsi_transport_sas: Fix error handling in sas_phy_add()
	block: sed-opal: kmalloc the cmd/resp buffers
	bpf: Fix memory leaks in __check_func_call
	arm64: Fix bit-shifting UB in the MIDR_CPU_MODEL() macro
	siox: fix possible memory leak in siox_device_add()
	parport_pc: Avoid FIFO port location truncation
	pinctrl: devicetree: fix null pointer dereferencing in pinctrl_dt_to_map
	drm/vc4: kms: Fix IS_ERR() vs NULL check for vc4_kms
	drm/panel: simple: set bpc field for logic technologies displays
	drm/drv: Fix potential memory leak in drm_dev_init()
	drm: Fix potential null-ptr-deref in drm_vblank_destroy_worker()
	ARM: dts: imx7: Fix NAND controller size-cells
	arm64: dts: imx8mm: Fix NAND controller size-cells
	arm64: dts: imx8mn: Fix NAND controller size-cells
	ata: libata-transport: fix double ata_host_put() in ata_tport_add()
	ata: libata-transport: fix error handling in ata_tport_add()
	ata: libata-transport: fix error handling in ata_tlink_add()
	ata: libata-transport: fix error handling in ata_tdev_add()
	nfp: change eeprom length to max length enumerators
	MIPS: fix duplicate definitions for exported symbols
	MIPS: Loongson64: Add WARN_ON on kexec related kmalloc failed
	bpf: Initialize same number of free nodes for each pcpu_freelist
	net: bgmac: Drop free_netdev() from bgmac_enet_remove()
	mISDN: fix possible memory leak in mISDN_dsp_element_register()
	net: hinic: Fix error handling in hinic_module_init()
	net: stmmac: ensure tx function is not running in stmmac_xdp_release()
	soc: imx8m: Enable OCOTP clock before reading the register
	net: liquidio: release resources when liquidio driver open failed
	mISDN: fix misuse of put_device() in mISDN_register_device()
	net: macvlan: Use built-in RCU list checking
	net: caif: fix double disconnect client in chnl_net_open()
	bnxt_en: Remove debugfs when pci_register_driver failed
	net: mhi: Fix memory leak in mhi_net_dellink()
	net: dsa: make dsa_master_ioctl() see through port_hwtstamp_get() shims
	xen/pcpu: fix possible memory leak in register_pcpu()
	net: ionic: Fix error handling in ionic_init_module()
	net: ena: Fix error handling in ena_init()
	net: hns3: fix setting incorrect phy link ksettings for firmware in resetting process
	bridge: switchdev: Fix memory leaks when changing VLAN protocol
	drbd: use after free in drbd_create_device()
	platform/x86/intel: pmc: Don't unconditionally attach Intel PMC when virtualized
	platform/surface: aggregator: Do not check for repeated unsequenced packets
	cifs: add check for returning value of SMB2_close_init
	net: ag71xx: call phylink_disconnect_phy if ag71xx_hw_enable() fail in ag71xx_open()
	net/x25: Fix skb leak in x25_lapb_receive_frame()
	cifs: Fix wrong return value checking when GETFLAGS
	net: microchip: sparx5: Fix potential null-ptr-deref in sparx_stats_init() and sparx5_start()
	net: thunderbolt: Fix error handling in tbnet_init()
	cifs: add check for returning value of SMB2_set_info_init
	ftrace: Fix the possible incorrect kernel message
	ftrace: Optimize the allocation for mcount entries
	ftrace: Fix null pointer dereference in ftrace_add_mod()
	ring_buffer: Do not deactivate non-existant pages
	tracing: Fix memory leak in tracing_read_pipe()
	tracing/ring-buffer: Have polling block on watermark
	tracing: Fix memory leak in test_gen_synth_cmd() and test_empty_synth_event()
	tracing: Fix wild-memory-access in register_synth_event()
	tracing: Fix race where eprobes can be called before the event
	tracing: kprobe: Fix potential null-ptr-deref on trace_event_file in kprobe_event_gen_test_exit()
	tracing: kprobe: Fix potential null-ptr-deref on trace_array in kprobe_event_gen_test_exit()
	drm/amd/display: Add HUBP surface flip interrupt handler
	ALSA: usb-audio: Drop snd_BUG_ON() from snd_usbmidi_output_open()
	ALSA: hda/realtek: fix speakers for Samsung Galaxy Book Pro
	ALSA: hda/realtek: Fix the speaker output on Samsung Galaxy Book Pro 360
	Revert "usb: dwc3: disable USB core PHY management"
	slimbus: qcom-ngd: Fix build error when CONFIG_SLIM_QCOM_NGD_CTRL=y && CONFIG_QCOM_RPROC_COMMON=m
	slimbus: stream: correct presence rate frequencies
	speakup: fix a segfault caused by switching consoles
	USB: bcma: Make GPIO explicitly optional
	USB: serial: option: add Sierra Wireless EM9191
	USB: serial: option: remove old LARA-R6 PID
	USB: serial: option: add u-blox LARA-R6 00B modem
	USB: serial: option: add u-blox LARA-L6 modem
	USB: serial: option: add Fibocom FM160 0x0111 composition
	usb: add NO_LPM quirk for Realforce 87U Keyboard
	usb: chipidea: fix deadlock in ci_otg_del_timer
	usb: cdns3: host: fix endless superspeed hub port reset
	usb: typec: mux: Enter safe mode only when pins need to be reconfigured
	iio: adc: at91_adc: fix possible memory leak in at91_adc_allocate_trigger()
	iio: trigger: sysfs: fix possible memory leak in iio_sysfs_trig_init()
	iio: adc: mp2629: fix wrong comparison of channel
	iio: adc: mp2629: fix potential array out of bound access
	iio: pressure: ms5611: changed hardcoded SPI speed to value limited
	dm ioctl: fix misbehavior if list_versions races with module loading
	serial: 8250: Fall back to non-DMA Rx if IIR_RDI occurs
	serial: 8250: Flush DMA Rx on RLSI
	serial: 8250_lpss: Configure DMA also w/o DMA filter
	Input: iforce - invert valid length check when fetching device IDs
	maccess: Fix writing offset in case of fault in strncpy_from_kernel_nofault()
	net: phy: marvell: add sleep time after enabling the loopback bit
	scsi: zfcp: Fix double free of FSF request when qdio send fails
	iommu/vt-d: Preset Access bit for IOVA in FL non-leaf paging entries
	iommu/vt-d: Set SRE bit only when hardware has SRS cap
	firmware: coreboot: Register bus in module init
	mmc: core: properly select voltage range without power cycle
	mmc: sdhci-pci-o2micro: fix card detect fail issue caused by CD# debounce timeout
	mmc: sdhci-pci: Fix possible memory leak caused by missing pci_dev_put()
	docs: update mediator contact information in CoC doc
	misc/vmw_vmci: fix an infoleak in vmci_host_do_receive_datagram()
	perf/x86/intel/pt: Fix sampling using single range output
	nvme: restrict management ioctls to admin
	nvme: ensure subsystem reset is single threaded
	serial: 8250_lpss: Use 16B DMA burst with Elkhart Lake
	perf: Improve missing SIGTRAP checking
	ring-buffer: Include dropped pages in counting dirty patches
	tracing: Fix warning on variable 'struct trace_array'
	net: use struct_group to copy ip/ipv6 header addresses
	scsi: target: tcm_loop: Fix possible name leak in tcm_loop_setup_hba_bus()
	scsi: scsi_debug: Fix possible UAF in sdebug_add_host_helper()
	kprobes: Skip clearing aggrprobe's post_handler in kprobe-on-ftrace case
	Input: i8042 - fix leaking of platform device on module removal
	macvlan: enforce a consistent minimal mtu
	tcp: cdg: allow tcp_cdg_release() to be called multiple times
	kcm: avoid potential race in kcm_tx_work
	kcm: close race conditions on sk_receive_queue
	9p: trans_fd/p9_conn_cancel: drop client lock earlier
	gfs2: Check sb_bsize_shift after reading superblock
	gfs2: Switch from strlcpy to strscpy
	9p/trans_fd: always use O_NONBLOCK read/write
	wifi: wext: use flex array destination for memcpy()
	mm: fs: initialize fsdata passed to write_begin/write_end interface
	net/9p: use a dedicated spinlock for trans_fd
	ntfs: fix use-after-free in ntfs_attr_find()
	ntfs: fix out-of-bounds read in ntfs_attr_find()
	ntfs: check overflow when iterating ATTR_RECORDs
	Linux 5.15.80

Change-Id: Idc9aa4c30c528dd194bc813201cbb2c5df8c1d62
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2022-11-28 16:08:50 +00:00
Li Huafei
c49cc2c059 kprobes: Skip clearing aggrprobe's post_handler in kprobe-on-ftrace case
[ Upstream commit 5dd7caf0bdc5d0bae7cf9776b4d739fb09bd5ebb ]

In __unregister_kprobe_top(), if the currently unregistered probe has
post_handler but other child probes of the aggrprobe do not have
post_handler, the post_handler of the aggrprobe is cleared. If this is
a ftrace-based probe, there is a problem. In later calls to
disarm_kprobe(), we will use kprobe_ftrace_ops because post_handler is
NULL. But we're armed with kprobe_ipmodify_ops. This triggers a WARN in
__disarm_kprobe_ftrace() and may even cause use-after-free:

  Failed to disarm kprobe-ftrace at kernel_clone+0x0/0x3c0 (error -2)
  WARNING: CPU: 5 PID: 137 at kernel/kprobes.c:1135 __disarm_kprobe_ftrace.isra.21+0xcf/0xe0
  Modules linked in: testKprobe_007(-)
  CPU: 5 PID: 137 Comm: rmmod Not tainted 6.1.0-rc4-dirty #18
  [...]
  Call Trace:
   <TASK>
   __disable_kprobe+0xcd/0xe0
   __unregister_kprobe_top+0x12/0x150
   ? mutex_lock+0xe/0x30
   unregister_kprobes.part.23+0x31/0xa0
   unregister_kprobe+0x32/0x40
   __x64_sys_delete_module+0x15e/0x260
   ? do_user_addr_fault+0x2cd/0x6b0
   do_syscall_64+0x3a/0x90
   entry_SYSCALL_64_after_hwframe+0x63/0xcd
   [...]

For the kprobe-on-ftrace case, we keep the post_handler setting to
identify this aggrprobe armed with kprobe_ipmodify_ops. This way we
can disarm it correctly.

Link: https://lore.kernel.org/all/20221112070000.35299-1-lihuafei1@huawei.com/

Fixes: 0bc11ed5ab ("kprobes: Allow kprobes coexist with livepatch")
Reported-by: Zhao Gongyi <zhaogongyi@huawei.com>
Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Li Huafei <lihuafei1@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-11-26 09:24:50 +01:00
Steven Rostedt (Google)
73cf0ff9a3 ring-buffer: Include dropped pages in counting dirty patches
[ Upstream commit 31029a8b2c7e656a0289194ef16415050ae4c4ac ]

The function ring_buffer_nr_dirty_pages() was created to find out how many
pages are filled in the ring buffer. There's two running counters. One is
incremented whenever a new page is touched (pages_touched) and the other
is whenever a page is read (pages_read). The dirty count is the number
touched minus the number read. This is used to determine if a blocked task
should be woken up if the percentage of the ring buffer it is waiting for
is hit.

The problem is that it does not take into account dropped pages (when the
new writes overwrite pages that were not read). And then the dirty pages
will always be greater than the percentage.

This makes the "buffer_percent" file inaccurate, as the number of dirty
pages end up always being larger than the percentage, event when it's not
and this causes user space to be woken up more than it wants to be.

Add a new counter to keep track of lost pages, and include that in the
accounting of dirty pages so that it is actually accurate.

Link: https://lkml.kernel.org/r/20221021123013.55fb6055@gandalf.local.home

Fixes: 2c2b0a78b3 ("ring-buffer: Add percentage of ring buffer full to wake up reader")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-11-26 09:24:49 +01:00
Marco Elver
35c60b4e8c perf: Improve missing SIGTRAP checking
[ Upstream commit bb88f9695460bec25aa30ba9072595025cf6c8af ]

To catch missing SIGTRAP we employ a WARN in __perf_event_overflow(),
which fires if pending_sigtrap was already set: returning to user space
without consuming pending_sigtrap, and then having the event fire again
would re-enter the kernel and trigger the WARN.

This, however, seemed to miss the case where some events not associated
with progress in the user space task can fire and the interrupt handler
runs before the IRQ work meant to consume pending_sigtrap (and generate
the SIGTRAP).

syzbot gifted us this stack trace:

 | WARNING: CPU: 0 PID: 3607 at kernel/events/core.c:9313 __perf_event_overflow
 | Modules linked in:
 | CPU: 0 PID: 3607 Comm: syz-executor100 Not tainted 6.1.0-rc2-syzkaller-00073-g88619e77b33d #0
 | Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
 | RIP: 0010:__perf_event_overflow+0x498/0x540 kernel/events/core.c:9313
 | <...>
 | Call Trace:
 |  <TASK>
 |  perf_swevent_hrtimer+0x34f/0x3c0 kernel/events/core.c:10729
 |  __run_hrtimer kernel/time/hrtimer.c:1685 [inline]
 |  __hrtimer_run_queues+0x1c6/0xfb0 kernel/time/hrtimer.c:1749
 |  hrtimer_interrupt+0x31c/0x790 kernel/time/hrtimer.c:1811
 |  local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1096 [inline]
 |  __sysvec_apic_timer_interrupt+0x17c/0x640 arch/x86/kernel/apic/apic.c:1113
 |  sysvec_apic_timer_interrupt+0x40/0xc0 arch/x86/kernel/apic/apic.c:1107
 |  asm_sysvec_apic_timer_interrupt+0x16/0x20 arch/x86/include/asm/idtentry.h:649
 | <...>
 |  </TASK>

In this case, syzbot produced a program with event type
PERF_TYPE_SOFTWARE and config PERF_COUNT_SW_CPU_CLOCK. The hrtimer
manages to fire again before the IRQ work got a chance to run, all while
never having returned to user space.

Improve the WARN to check for real progress in user space: approximate
this by storing a 32-bit hash of the current IP into pending_sigtrap,
and if an event fires while pending_sigtrap still matches the previous
IP, we assume no progress (false negatives are possible given we could
return to user space and trigger again on the same IP).

Fixes: ca6c21327c6a ("perf: Fix missing SIGTRAPs")
Reported-by: syzbot+b8ded3e2e2c6adde4990@syzkaller.appspotmail.com
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221031093513.3032814-1-elver@google.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-11-26 09:24:49 +01:00
Shang XiaoJing
e57daa7503 tracing: kprobe: Fix potential null-ptr-deref on trace_array in kprobe_event_gen_test_exit()
commit 22ea4ca9631eb137e64e5ab899e9c89cb6670959 upstream.

When test_gen_kprobe_cmd() failed after kprobe_event_gen_cmd_end(), it
will goto delete, which will call kprobe_event_delete() and release the
corresponding resource. However, the trace_array in gen_kretprobe_test
will point to the invalid resource. Set gen_kretprobe_test to NULL
after called kprobe_event_delete() to prevent null-ptr-deref.

BUG: kernel NULL pointer dereference, address: 0000000000000070
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
CPU: 0 PID: 246 Comm: modprobe Tainted: G        W
6.1.0-rc1-00174-g9522dc5c87da-dirty #248
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
RIP: 0010:__ftrace_set_clr_event_nolock+0x53/0x1b0
Code: e8 82 26 fc ff 49 8b 1e c7 44 24 0c ea ff ff ff 49 39 de 0f 84 3c
01 00 00 c7 44 24 18 00 00 00 00 e8 61 26 fc ff 48 8b 6b 10 <44> 8b 65
70 4c 8b 6d 18 41 f7 c4 00 02 00 00 75 2f
RSP: 0018:ffffc9000159fe00 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffff88810971d268 RCX: 0000000000000000
RDX: ffff8881080be600 RSI: ffffffff811b48ff RDI: ffff88810971d058
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
R10: ffffc9000159fe58 R11: 0000000000000001 R12: ffffffffa0001064
R13: ffffffffa000106c R14: ffff88810971d238 R15: 0000000000000000
FS:  00007f89eeff6540(0000) GS:ffff88813b600000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000070 CR3: 000000010599e004 CR4: 0000000000330ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 __ftrace_set_clr_event+0x3e/0x60
 trace_array_set_clr_event+0x35/0x50
 ? 0xffffffffa0000000
 kprobe_event_gen_test_exit+0xcd/0x10b [kprobe_event_gen_test]
 __x64_sys_delete_module+0x206/0x380
 ? lockdep_hardirqs_on_prepare+0xd8/0x190
 ? syscall_enter_from_user_mode+0x1c/0x50
 do_syscall_64+0x3f/0x90
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f89eeb061b7

Link: https://lore.kernel.org/all/20221108015130.28326-3-shangxiaojing@huawei.com/

Fixes: 64836248dd ("tracing: Add kprobe event command generation test module")
Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com>
Cc: stable@vger.kernel.org
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-26 09:24:43 +01:00
Shang XiaoJing
3a41c0f2a5 tracing: kprobe: Fix potential null-ptr-deref on trace_event_file in kprobe_event_gen_test_exit()
commit e0d75267f59d7084e0468bd68beeb1bf9c71d7c0 upstream.

When trace_get_event_file() failed, gen_kretprobe_test will be assigned
as the error code. If module kprobe_event_gen_test is removed now, the
null pointer dereference will happen in kprobe_event_gen_test_exit().
Check if gen_kprobe_test or gen_kretprobe_test is error code or NULL
before dereference them.

BUG: kernel NULL pointer dereference, address: 0000000000000012
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
CPU: 3 PID: 2210 Comm: modprobe Not tainted
6.1.0-rc1-00171-g2159299a3b74-dirty #217
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
RIP: 0010:kprobe_event_gen_test_exit+0x1c/0xb5 [kprobe_event_gen_test]
Code: Unable to access opcode bytes at 0xffffffff9ffffff2.
RSP: 0018:ffffc900015bfeb8 EFLAGS: 00010246
RAX: ffffffffffffffea RBX: ffffffffa0002080 RCX: 0000000000000000
RDX: ffffffffa0001054 RSI: ffffffffa0001064 RDI: ffffffffdfc6349c
RBP: ffffffffa0000000 R08: 0000000000000004 R09: 00000000001e95c0
R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000800
R13: ffffffffa0002420 R14: 0000000000000000 R15: 0000000000000000
FS:  00007f56b75be540(0000) GS:ffff88813bc00000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffff9ffffff2 CR3: 000000010874a006 CR4: 0000000000330ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 __x64_sys_delete_module+0x206/0x380
 ? lockdep_hardirqs_on_prepare+0xd8/0x190
 ? syscall_enter_from_user_mode+0x1c/0x50
 do_syscall_64+0x3f/0x90
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Link: https://lore.kernel.org/all/20221108015130.28326-2-shangxiaojing@huawei.com/

Fixes: 64836248dd ("tracing: Add kprobe event command generation test module")
Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-26 09:24:43 +01:00
Steven Rostedt (Google)
7291dec4f2 tracing: Fix race where eprobes can be called before the event
commit 94eedf3dded5fb472ce97bfaf3ac1c6c29c35d26 upstream.

The flag that tells the event to call its triggers after reading the event
is set for eprobes after the eprobe is enabled. This leads to a race where
the eprobe may be triggered at the beginning of the event where the record
information is NULL. The eprobe then dereferences the NULL record causing
a NULL kernel pointer bug.

Test for a NULL record to keep this from happening.

Link: https://lore.kernel.org/linux-trace-kernel/20221116192552.1066630-1-rafaelmendsr@gmail.com/
Link: https://lore.kernel.org/linux-trace-kernel/20221117214249.2addbe10@gandalf.local.home

Cc: Linux Trace Kernel <linux-trace-kernel@vger.kernel.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 7491e2c442 ("tracing: Add a probe that attaches to trace events")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reported-by: Rafael Mendonca <rafaelmendsr@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-26 09:24:43 +01:00
Shang XiaoJing
6517b97134 tracing: Fix wild-memory-access in register_synth_event()
commit 1b5f1c34d3f5a664a57a5a7557a50e4e3cc2505c upstream.

In register_synth_event(), if set_synth_event_print_fmt() failed, then
both trace_remove_event_call() and unregister_trace_event() will be
called, which means the trace_event_call will call
__unregister_trace_event() twice. As the result, the second unregister
will causes the wild-memory-access.

register_synth_event
    set_synth_event_print_fmt failed
    trace_remove_event_call
        event_remove
            if call->event.funcs then
            __unregister_trace_event (first call)
    unregister_trace_event
        __unregister_trace_event (second call)

Fix the bug by avoiding to call the second __unregister_trace_event() by
checking if the first one is called.

general protection fault, probably for non-canonical address
	0xfbd59c0000000024: 0000 [#1] SMP KASAN PTI
KASAN: maybe wild-memory-access in range
[0xdead000000000120-0xdead000000000127]
CPU: 0 PID: 3807 Comm: modprobe Not tainted
6.1.0-rc1-00186-g76f33a7eedb4 #299
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
RIP: 0010:unregister_trace_event+0x6e/0x280
Code: 00 fc ff df 4c 89 ea 48 c1 ea 03 80 3c 02 00 0f 85 0e 02 00 00 48
b8 00 00 00 00 00 fc ff df 4c 8b 63 08 4c 89 e2 48 c1 ea 03 <80> 3c 02
00 0f 85 e2 01 00 00 49 89 2c 24 48 85 ed 74 28 e8 7a 9b
RSP: 0018:ffff88810413f370 EFLAGS: 00010a06
RAX: dffffc0000000000 RBX: ffff888105d050b0 RCX: 0000000000000000
RDX: 1bd5a00000000024 RSI: ffff888119e276e0 RDI: ffffffff835a8b20
RBP: dead000000000100 R08: 0000000000000000 R09: fffffbfff0913481
R10: ffffffff8489a407 R11: fffffbfff0913480 R12: dead000000000122
R13: ffff888105d050b8 R14: 0000000000000000 R15: ffff888105d05028
FS:  00007f7823e8d540(0000) GS:ffff888119e00000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7823e7ebec CR3: 000000010a058002 CR4: 0000000000330ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 __create_synth_event+0x1e37/0x1eb0
 create_or_delete_synth_event+0x110/0x250
 synth_event_run_command+0x2f/0x110
 test_gen_synth_cmd+0x170/0x2eb [synth_event_gen_test]
 synth_event_gen_test_init+0x76/0x9bc [synth_event_gen_test]
 do_one_initcall+0xdb/0x480
 do_init_module+0x1cf/0x680
 load_module+0x6a50/0x70a0
 __do_sys_finit_module+0x12f/0x1c0
 do_syscall_64+0x3f/0x90
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Link: https://lkml.kernel.org/r/20221117012346.22647-3-shangxiaojing@huawei.com

Fixes: 4b147936fa ("tracing: Add support for 'synthetic' events")
Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com>
Cc: stable@vger.kernel.org
Cc: <mhiramat@kernel.org>
Cc: <zanussi@kernel.org>
Cc: <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-26 09:24:43 +01:00
Shang XiaoJing
07ba4f0603 tracing: Fix memory leak in test_gen_synth_cmd() and test_empty_synth_event()
commit a4527fef9afe5c903c718d0cd24609fe9c754250 upstream.

test_gen_synth_cmd() only free buf in fail path, hence buf will leak
when there is no failure. Add kfree(buf) to prevent the memleak. The
same reason and solution in test_empty_synth_event().

unreferenced object 0xffff8881127de000 (size 2048):
  comm "modprobe", pid 247, jiffies 4294972316 (age 78.756s)
  hex dump (first 32 bytes):
    20 67 65 6e 5f 73 79 6e 74 68 5f 74 65 73 74 20   gen_synth_test
    20 70 69 64 5f 74 20 6e 65 78 74 5f 70 69 64 5f   pid_t next_pid_
  backtrace:
    [<000000004254801a>] kmalloc_trace+0x26/0x100
    [<0000000039eb1cf5>] 0xffffffffa00083cd
    [<000000000e8c3bc8>] 0xffffffffa00086ba
    [<00000000c293d1ea>] do_one_initcall+0xdb/0x480
    [<00000000aa189e6d>] do_init_module+0x1cf/0x680
    [<00000000d513222b>] load_module+0x6a50/0x70a0
    [<000000001fd4d529>] __do_sys_finit_module+0x12f/0x1c0
    [<00000000b36c4c0f>] do_syscall_64+0x3f/0x90
    [<00000000bbf20cf3>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
unreferenced object 0xffff8881127df000 (size 2048):
  comm "modprobe", pid 247, jiffies 4294972324 (age 78.728s)
  hex dump (first 32 bytes):
    20 65 6d 70 74 79 5f 73 79 6e 74 68 5f 74 65 73   empty_synth_tes
    74 20 20 70 69 64 5f 74 20 6e 65 78 74 5f 70 69  t  pid_t next_pi
  backtrace:
    [<000000004254801a>] kmalloc_trace+0x26/0x100
    [<00000000d4db9a3d>] 0xffffffffa0008071
    [<00000000c31354a5>] 0xffffffffa00086ce
    [<00000000c293d1ea>] do_one_initcall+0xdb/0x480
    [<00000000aa189e6d>] do_init_module+0x1cf/0x680
    [<00000000d513222b>] load_module+0x6a50/0x70a0
    [<000000001fd4d529>] __do_sys_finit_module+0x12f/0x1c0
    [<00000000b36c4c0f>] do_syscall_64+0x3f/0x90
    [<00000000bbf20cf3>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

Link: https://lkml.kernel.org/r/20221117012346.22647-2-shangxiaojing@huawei.com

Cc: <mhiramat@kernel.org>
Cc: <zanussi@kernel.org>
Cc: <fengguang.wu@intel.com>
Cc: stable@vger.kernel.org
Fixes: 9fe41efaca ("tracing: Add synth event generation test module")
Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-26 09:24:43 +01:00
Steven Rostedt (Google)
8b318f3032 tracing/ring-buffer: Have polling block on watermark
commit 42fb0a1e84ff525ebe560e2baf9451ab69127e2b upstream.

Currently the way polling works on the ring buffer is broken. It will
return immediately if there's any data in the ring buffer whereas a read
will block until the watermark (defined by the tracefs buffer_percent file)
is hit.

That is, a select() or poll() will return as if there's data available,
but then the following read will block. This is broken for the way
select()s and poll()s are supposed to work.

Have the polling on the ring buffer also block the same way reads and
splice does on the ring buffer.

Link: https://lkml.kernel.org/r/20221020231427.41be3f26@gandalf.local.home

Cc: Linux Trace Kernel <linux-trace-kernel@vger.kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Primiano Tucci <primiano@google.com>
Cc: stable@vger.kernel.org
Fixes: 1e0d6714ac ("ring-buffer: Do not wake up a splice waiter when page is not full")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-26 09:24:42 +01:00
Wang Yufen
2c21ee020c tracing: Fix memory leak in tracing_read_pipe()
commit 649e72070cbbb8600eb823833e4748f5a0815116 upstream.

kmemleak reports this issue:

unreferenced object 0xffff888105a18900 (size 128):
  comm "test_progs", pid 18933, jiffies 4336275356 (age 22801.766s)
  hex dump (first 32 bytes):
    25 73 00 90 81 88 ff ff 26 05 00 00 42 01 58 04  %s......&...B.X.
    03 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000560143a1>] __kmalloc_node_track_caller+0x4a/0x140
    [<000000006af00822>] krealloc+0x8d/0xf0
    [<00000000c309be6a>] trace_iter_expand_format+0x99/0x150
    [<000000005a53bdb6>] trace_check_vprintf+0x1e0/0x11d0
    [<0000000065629d9d>] trace_event_printf+0xb6/0xf0
    [<000000009a690dc7>] trace_raw_output_bpf_trace_printk+0x89/0xc0
    [<00000000d22db172>] print_trace_line+0x73c/0x1480
    [<00000000cdba76ba>] tracing_read_pipe+0x45c/0x9f0
    [<0000000015b58459>] vfs_read+0x17b/0x7c0
    [<000000004aeee8ed>] ksys_read+0xed/0x1c0
    [<0000000063d3d898>] do_syscall_64+0x3b/0x90
    [<00000000a06dda7f>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

iter->fmt alloced in
  tracing_read_pipe() -> .. ->trace_iter_expand_format(), but not
freed, to fix, add free in tracing_release_pipe()

Link: https://lkml.kernel.org/r/1667819090-4643-1-git-send-email-wangyufen@huawei.com

Cc: stable@vger.kernel.org
Fixes: efbbdaa22b ("tracing: Show real address for trace event arguments")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Wang Yufen <wangyufen@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-26 09:24:42 +01:00
Daniil Tatianin
00f74b1a98 ring_buffer: Do not deactivate non-existant pages
commit 56f4ca0a79a9f1af98f26c54b9b89ba1f9bcc6bd upstream.

rb_head_page_deactivate() expects cpu_buffer to contain a valid list of
->pages, so verify that the list is actually present before calling it.

Found by Linux Verification Center (linuxtesting.org) with the SVACE
static analysis tool.

Link: https://lkml.kernel.org/r/20221114143129.3534443-1-d-tatianin@yandex-team.ru

Cc: stable@vger.kernel.org
Fixes: 77ae365eca ("ring-buffer: make lockless")
Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-26 09:24:42 +01:00
Xiu Jianfeng
1bea037a1a ftrace: Fix null pointer dereference in ftrace_add_mod()
commit 19ba6c8af9382c4c05dc6a0a79af3013b9a35cd0 upstream.

The @ftrace_mod is allocated by kzalloc(), so both the members {prev,next}
of @ftrace_mode->list are NULL, it's not a valid state to call list_del().
If kstrdup() for @ftrace_mod->{func|module} fails, it goes to @out_free
tag and calls free_ftrace_mod() to destroy @ftrace_mod, then list_del()
will write prev->next and next->prev, where null pointer dereference
happens.

BUG: kernel NULL pointer dereference, address: 0000000000000008
Oops: 0002 [#1] PREEMPT SMP NOPTI
Call Trace:
 <TASK>
 ftrace_mod_callback+0x20d/0x220
 ? do_filp_open+0xd9/0x140
 ftrace_process_regex.isra.51+0xbf/0x130
 ftrace_regex_write.isra.52.part.53+0x6e/0x90
 vfs_write+0xee/0x3a0
 ? __audit_filter_op+0xb1/0x100
 ? auditd_test_task+0x38/0x50
 ksys_write+0xa5/0xe0
 do_syscall_64+0x3a/0x90
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
Kernel panic - not syncing: Fatal exception

So call INIT_LIST_HEAD() to initialize the list member to fix this issue.

Link: https://lkml.kernel.org/r/20221116015207.30858-1-xiujianfeng@huawei.com

Cc: stable@vger.kernel.org
Fixes: 673feb9d76 ("ftrace: Add :mod: caching infrastructure to trace_array")
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-26 09:24:42 +01:00
Wang Wensheng
fadfcf39fb ftrace: Optimize the allocation for mcount entries
commit bcea02b096333dc74af987cb9685a4dbdd820840 upstream.

If we can't allocate this size, try something smaller with half of the
size. Its order should be decreased by one instead of divided by two.

Link: https://lkml.kernel.org/r/20221109094434.84046-3-wangwensheng4@huawei.com

Cc: <mhiramat@kernel.org>
Cc: <mark.rutland@arm.com>
Cc: stable@vger.kernel.org
Fixes: a790087554 ("ftrace: Allocate the mcount record pages as groups")
Signed-off-by: Wang Wensheng <wangwensheng4@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-26 09:24:42 +01:00
Wang Wensheng
5c5f264289 ftrace: Fix the possible incorrect kernel message
commit 08948caebe93482db1adfd2154eba124f66d161d upstream.

If the number of mcount entries is an integer multiple of
ENTRIES_PER_PAGE, the page count showing on the console would be wrong.

Link: https://lkml.kernel.org/r/20221109094434.84046-2-wangwensheng4@huawei.com

Cc: <mhiramat@kernel.org>
Cc: <mark.rutland@arm.com>
Cc: stable@vger.kernel.org
Fixes: 5821e1b74f ("function tracing: fix wrong pos computing when read buffer has been fulfilled")
Signed-off-by: Wang Wensheng <wangwensheng4@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-26 09:24:42 +01:00
Xu Kuohai
7ff4fa179e bpf: Initialize same number of free nodes for each pcpu_freelist
[ Upstream commit 4b45cd81f737d79d0fbfc0d320a1e518e7f0bbf0 ]

pcpu_freelist_populate() initializes nr_elems / num_possible_cpus() + 1
free nodes for some CPUs, and then possibly one CPU with fewer nodes,
followed by remaining cpus with 0 nodes. For example, when nr_elems == 256
and num_possible_cpus() == 32, CPU 0~27 each gets 9 free nodes, CPU 28 gets
4 free nodes, CPU 29~31 get 0 free nodes, while in fact each CPU should get
8 nodes equally.

This patch initializes nr_elems / num_possible_cpus() free nodes for each
CPU firstly, then allocates the remaining free nodes by one for each CPU
until no free nodes left.

Fixes: e19494edab ("bpf: introduce percpu_freelist")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20221110122128.105214-1-xukuohai@huawei.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-11-26 09:24:38 +01:00