Commit Graph

1236 Commits

Author SHA1 Message Date
Suren Baghdasaryan
ca96bd7bf1 ANDROID: mm: avoid using vmacache in lockless vma search
When searching vma under RCU protection vmcache should be avoided because
a race with munmap() might result in finding a vma and placing it into
vmcache after munmap() removed that vma and called vmcache_invalidate.
Once that vma is freed, vmcache will be left with an invalid vma pointer.

Bug: 257443051
Change-Id: I62438305fcf5139974f4f7d3bae5b22c74084a59
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2022-11-23 10:25:27 -08:00
Suren Baghdasaryan
3f311327f9 ANDROID: mm: introduce vma refcounting to protect vma during SPF
Current mechanism to stabilize a vma during speculative page fault
handling makes a copy of the faulting vma under RCU protection. This
makes it hard to protect elements which do not belong to the vma but
are used by the page fault handler like vma->vm_file.
The problems is that a copy of the vma can't be used to safely
protect the file attached to the original vma unless the file is
also released after RCU grace period (which is how SPF was designed
originally but that caused performance regression and had to be
changed).
To avoid these complications, introduce vma refcounting to stabilize
and operate on the original vma during page fault handling. Page
fault handler finds the vma and increases its refcount under RCU
protection, vma is freed after RCU grace period, vma->vm_file is
released only after refcount indicates no users. This mechanism
guarantees that once get_vma returns a vma, both the vma itself and
vma->vm_file are stable.
Additional benefits of this patch are: we don't need to copy the vma
and no additional logic is needed to stabilize vma->vm_file.

Bug: 257443051
Change-Id: I59d373926d687fcbd56847a8c3500c43bf1844c8
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2022-11-23 10:25:27 -08:00
Suren Baghdasaryan
c11ef6356b Revert "ANDROID: add vma->file_ref_count to synchronize vma->vm_file destruction"
This reverts commit a3fe25d923.

File refcounting implemented in this patch is broken and needs to be
redone.
The change in include/linux/mm_types.h which adds file_ref_count into
vm_area_struct is left untouched to keep ABI intact.

Bug: 258731892
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I37984eb2f0981a989f74bcaaa6be42040a2f241e
2022-11-23 10:25:26 -08:00
Greg Kroah-Hartman
ae3c0ab383 ANDROID: GKI: mm.h: add Android ABI padding to a structure
Try to mitigate potential future driver core api changes by adding a
padding to struct vm_operations_struct.

Based on a change made to the RHEL/CENTOS 8 kernel.

Bug: 151154716
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I78f84148ef4d3524bd6c5b78e53e06503a4ac3ae
2022-07-19 12:47:34 +00:00
Suren Baghdasaryan
4daa3c254e ANDROID: add vma->file_ref_count to synchronize vma->vm_file destruction
In order to prevent destruction of vma->vm_file while it's being used
during speculative page fault handling, introduce an atomic refcounter.

Bug: 234527424
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I0e971156f3e76feb45136bac1582a7eaab8c75df
2022-07-19 12:47:27 +00:00
Liangcai Fan
bf46e6f5db BACKPORT: mm: khugepaged: recalculate min_free_kbytes after stopping khugepaged
When initializing transparent huge pages, min_free_kbytes would be
calculated according to what khugepaged expected.

So when transparent huge pages get disabled, min_free_kbytes should be
recalculated instead of the higher value set by khugepaged.

Link: https://lkml.kernel.org/r/1633937809-16558-1-git-send-email-liangcaifan19@gmail.com
Signed-off-by: Liangcai Fan <liangcaifan19@gmail.com>
Signed-off-by: Chunyan Zhang <zhang.lyra@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit bd3400ea173fb611cdf2030d03620185ff6c0b0e)

Bug: 235523176
Signed-off-by: Chinwen Chang <chinwen.chang@mediatek.com>
Change-Id: I815893d25186847933db2a0872528fb15a00b3c8
2022-07-19 03:54:51 +00:00
Greg Kroah-Hartman
06d58f3cef Merge 5.15.54 into android14-5.15
Changes in 5.15.54
	mm/slub: add missing TID updates on slab deactivation
	mm/filemap: fix UAF in find_lock_entries
	Revert "selftests/bpf: Add test for bpf_timer overwriting crash"
	ALSA: usb-audio: Workarounds for Behringer UMC 204/404 HD
	ALSA: hda/realtek: Add quirk for Clevo L140PU
	ALSA: cs46xx: Fix missing snd_card_free() call at probe error
	can: bcm: use call_rcu() instead of costly synchronize_rcu()
	can: grcan: grcan_probe(): remove extra of_node_get()
	can: gs_usb: gs_usb_open/close(): fix memory leak
	can: m_can: m_can_chip_config(): actually enable internal timestamping
	can: m_can: m_can_{read_fifo,echo_tx_event}(): shift timestamp to full 32 bits
	can: mcp251xfd: mcp251xfd_regmap_crc_read(): improve workaround handling for mcp2517fd
	can: mcp251xfd: mcp251xfd_regmap_crc_read(): update workaround broken CRC on TBC register
	bpf: Fix incorrect verifier simulation around jmp32's jeq/jne
	bpf: Fix insufficient bounds propagation from adjust_scalar_min_max_vals
	usbnet: fix memory leak in error case
	net: rose: fix UAF bug caused by rose_t0timer_expiry
	netfilter: nft_set_pipapo: release elements in clone from abort path
	netfilter: nf_tables: stricter validation of element data
	btrfs: rename btrfs_alloc_chunk to btrfs_create_chunk
	btrfs: add additional parameters to btrfs_init_tree_ref/btrfs_init_data_ref
	btrfs: fix invalid delayed ref after subvolume creation failure
	btrfs: fix warning when freeing leaf after subvolume creation failure
	Input: cpcap-pwrbutton - handle errors from platform_get_irq()
	Input: goodix - change goodix_i2c_write() len parameter type to int
	Input: goodix - add a goodix.h header file
	Input: goodix - refactor reset handling
	Input: goodix - try not to touch the reset-pin on x86/ACPI devices
	dma-buf/poll: Get a file reference for outstanding fence callbacks
	btrfs: fix deadlock between chunk allocation and chunk btree modifications
	drm/i915: Disable bonding on gen12+ platforms
	drm/i915/gt: Register the migrate contexts with their engines
	drm/i915: Replace the unconditional clflush with drm_clflush_virt_range()
	PCI/portdrv: Rename pm_iter() to pcie_port_device_iter()
	PCI: pciehp: Ignore Link Down/Up caused by error-induced Hot Reset
	media: ir_toy: prevent device from hanging during transmit
	memory: renesas-rpc-if: Avoid unaligned bus access for HyperFlash
	ath11k: add hw_param for wakeup_mhi
	qed: Improve the stack space of filter_config()
	platform/x86: wmi: introduce helper to convert driver to WMI driver
	platform/x86: wmi: Replace read_takes_no_args with a flags field
	platform/x86: wmi: Fix driver->notify() vs ->probe() race
	mt76: mt7921: get rid of mt7921_mac_set_beacon_filter
	mt76: mt7921: introduce mt7921_mcu_set_beacon_filter utility routine
	mt76: mt7921: fix a possible race enabling/disabling runtime-pm
	bpf: Stop caching subprog index in the bpf_pseudo_func insn
	bpf, arm64: Use emit_addr_mov_i64() for BPF_PSEUDO_FUNC
	riscv: defconfig: enable DRM_NOUVEAU
	RISC-V: defconfigs: Set CONFIG_FB=y, for FB console
	net/mlx5e: Check action fwd/drop flag exists also for nic flows
	net/mlx5e: Split actions_match_supported() into a sub function
	net/mlx5e: TC, Reject rules with drop and modify hdr action
	net/mlx5e: TC, Reject rules with forward and drop actions
	ASoC: rt5682: Avoid the unexpected IRQ event during going to suspend
	ASoC: rt5682: Re-detect the combo jack after resuming
	ASoC: rt5682: Fix deadlock on resume
	netfilter: nf_tables: convert pktinfo->tprot_set to flags field
	netfilter: nft_payload: support for inner header matching / mangling
	netfilter: nft_payload: don't allow th access for fragments
	s390/boot: allocate amode31 section in decompressor
	s390/setup: use physical pointers for memblock_reserve()
	s390/setup: preserve memory at OLDMEM_BASE and OLDMEM_SIZE
	ibmvnic: init init_done_rc earlier
	ibmvnic: clear fop when retrying probe
	ibmvnic: Allow queueing resets during probe
	virtio-blk: avoid preallocating big SGL for data
	io_uring: ensure that fsnotify is always called
	block: use bdev_get_queue() in bio.c
	block: only mark bio as tracked if it really is tracked
	block: fix rq-qos breakage from skipping rq_qos_done_bio()
	stddef: Introduce struct_group() helper macro
	media: omap3isp: Use struct_group() for memcpy() region
	media: davinci: vpif: fix use-after-free on driver unbind
	mt76: mt76_connac: fix MCU_CE_CMD_SET_ROC definition error
	mt76: mt7921: do not always disable fw runtime-pm
	cxl/port: Hold port reference until decoder release
	clk: renesas: r9a07g044: Update multiplier and divider values for PLL2/3
	KVM: x86/mmu: Use yield-safe TDP MMU root iter in MMU notifier unmapping
	KVM: x86/mmu: Use common TDP MMU zap helper for MMU notifier unmap hook
	scsi: qla2xxx: Move heartbeat handling from DPC thread to workqueue
	scsi: qla2xxx: Fix laggy FC remote port session recovery
	scsi: qla2xxx: edif: Replace list_for_each_safe with list_for_each_entry_safe
	scsi: qla2xxx: Fix crash during module load unload test
	gfs2: Fix gfs2_file_buffered_write endless loop workaround
	vdpa/mlx5: Avoid processing works if workqueue was destroyed
	btrfs: handle device lookup with btrfs_dev_lookup_args
	btrfs: add a btrfs_get_dev_args_from_path helper
	btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls
	btrfs: remove device item and update super block in the same transaction
	drbd: add error handling support for add_disk()
	drbd: Fix double free problem in drbd_create_device
	drbd: fix an invalid memory access caused by incorrect use of list iterator
	drm/amd/display: Set min dcfclk if pipe count is 0
	drm/amd/display: Fix by adding FPU protection for dcn30_internal_validate_bw
	NFSD: De-duplicate net_generic(nf->nf_net, nfsd_net_id)
	NFSD: COMMIT operations must not return NFS?ERR_INVAL
	riscv/mm: Add XIP_FIXUP for riscv_pfn_base
	iio: accel: mma8452: use the correct logic to get mma8452_data
	batman-adv: Use netif_rx().
	mtd: spi-nor: Skip erase logic when SPI_NOR_NO_ERASE is set
	Compiler Attributes: add __alloc_size() for better bounds checking
	mm: vmalloc: introduce array allocation functions
	KVM: use __vcalloc for very large allocations
	btrfs: don't access possibly stale fs_info data in device_list_add
	KVM: s390x: fix SCK locking
	scsi: qla2xxx: Fix loss of NVMe namespaces after driver reload test
	powerpc/32: Don't use lmw/stmw for saving/restoring non volatile regs
	powerpc: flexible GPR range save/restore macros
	powerpc/tm: Fix more userspace r13 corruption
	serial: sc16is7xx: Clear RS485 bits in the shutdown
	bus: mhi: core: Use correctly sized arguments for bit field
	bus: mhi: Fix pm_state conversion to string
	stddef: Introduce DECLARE_FLEX_ARRAY() helper
	uapi/linux/stddef.h: Add include guards
	ASoC: rt5682: move clk related code to rt5682_i2c_probe
	ASoC: rt5682: fix an incorrect NULL check on list iterator
	drm/amd/vcn: fix an error msg on vcn 3.0
	KVM: Don't create VM debugfs files outside of the VM directory
	tty: n_gsm: Modify CR,PF bit when config requester
	tty: n_gsm: Save dlci address open status when config requester
	tty: n_gsm: fix frame reception handling
	ALSA: usb-audio: add mapping for MSI MPG X570S Carbon Max Wifi.
	ALSA: usb-audio: add mapping for MSI MAG X570S Torpedo MAX.
	tty: n_gsm: fix missing update of modem controls after DLCI open
	btrfs: zoned: encapsulate inode locking for zoned relocation
	btrfs: zoned: use dedicated lock for data relocation
	KVM: Initialize debugfs_dentry when a VM is created to avoid NULL deref
	mm/hwpoison: mf_mutex for soft offline and unpoison
	mm/hwpoison: avoid the impact of hwpoison_filter() return value on mce handler
	mm/memory-failure.c: fix race with changing page compound again
	mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()
	tty: n_gsm: fix invalid use of MSC in advanced option
	tty: n_gsm: fix sometimes uninitialized warning in gsm_dlci_modem_output()
	serial: 8250_mtk: Make sure to select the right FEATURE_SEL
	tty: n_gsm: fix invalid gsmtty_write_room() result
	drm/amd: Refactor `amdgpu_aspm` to be evaluated per device
	drm/amdgpu: vi: disable ASPM on Intel Alder Lake based systems
	drm/i915: Fix a race between vma / object destruction and unbinding
	drm/mediatek: Use mailbox rx_callback instead of cmdq_task_cb
	drm/mediatek: Remove the pointer of struct cmdq_client
	drm/mediatek: Detect CMDQ execution timeout
	drm/mediatek: Add cmdq_handle in mtk_crtc
	drm/mediatek: Add vblank register/unregister callback functions
	Bluetooth: protect le accept and resolv lists with hdev->lock
	Bluetooth: btmtksdio: fix use-after-free at btmtksdio_recv_event
	io_uring: avoid io-wq -EAGAIN looping for !IOPOLL
	irqchip/gic-v3: Ensure pseudo-NMIs have an ISB between ack and handling
	irqchip/gic-v3: Refactor ISB + EOIR at ack time
	rxrpc: Fix locking issue
	dt-bindings: soc: qcom: smd-rpm: Add compatible for MSM8953 SoC
	dt-bindings: soc: qcom: smd-rpm: Fix missing MSM8936 compatible
	module: change to print useful messages from elf_validity_check()
	module: fix [e_shstrndx].sh_size=0 OOB access
	iommu/vt-d: Fix PCI bus rescan device hot add
	fbdev: fbmem: Fix logo center image dx issue
	fbmem: Check virtual screen sizes in fb_set_var()
	fbcon: Disallow setting font bigger than screen size
	fbcon: Prevent that screen size is smaller than font size
	PM: runtime: Redefine pm_runtime_release_supplier()
	memregion: Fix memregion_free() fallback definition
	video: of_display_timing.h: include errno.h
	powerpc/powernv: delay rng platform device creation until later in boot
	net: dsa: qca8k: reset cpu port on MTU change
	can: kvaser_usb: replace run-time checks with struct kvaser_usb_driver_info
	can: kvaser_usb: kvaser_usb_leaf: fix CAN clock frequency regression
	can: kvaser_usb: kvaser_usb_leaf: fix bittiming limits
	xfs: remove incorrect ASSERT in xfs_rename
	Revert "serial: sc16is7xx: Clear RS485 bits in the shutdown"
	btrfs: fix error pointer dereference in btrfs_ioctl_rm_dev_v2()
	virtio-blk: modify the value type of num in virtio_queue_rq()
	btrfs: fix use of uninitialized variable at rm device ioctl
	tty: n_gsm: fix encoding of command/response bit
	ARM: meson: Fix refcount leak in meson_smp_prepare_cpus
	pinctrl: sunxi: a83t: Fix NAND function name for some pins
	ASoC: rt711: Add endianness flag in snd_soc_component_driver
	ASoC: rt711-sdca: Add endianness flag in snd_soc_component_driver
	ASoC: codecs: rt700/rt711/rt711-sdca: resume bus/codec in .set_jack_detect
	arm64: dts: qcom: msm8994: Fix CPU6/7 reg values
	arm64: dts: qcom: sdm845: use dispcc AHB clock for mdss node
	ARM: mxs_defconfig: Enable the framebuffer
	arm64: dts: imx8mp-evk: correct mmc pad settings
	arm64: dts: imx8mp-evk: correct the uart2 pinctl value
	arm64: dts: imx8mp-evk: correct gpio-led pad settings
	arm64: dts: imx8mp-evk: correct vbus pad settings
	arm64: dts: imx8mp-evk: correct eqos pad settings
	arm64: dts: imx8mp-evk: correct I2C1 pad settings
	arm64: dts: imx8mp-evk: correct I2C3 pad settings
	arm64: dts: imx8mp-phyboard-pollux-rdk: correct uart pad settings
	arm64: dts: imx8mp-phyboard-pollux-rdk: correct eqos pad settings
	arm64: dts: imx8mp-phyboard-pollux-rdk: correct i2c2 & mmc settings
	pinctrl: sunxi: sunxi_pconf_set: use correct offset
	arm64: dts: qcom: msm8992-*: Fix vdd_lvs1_2-supply typo
	ARM: at91: pm: use proper compatible for sama5d2's rtc
	ARM: at91: pm: use proper compatibles for sam9x60's rtc and rtt
	ARM: at91: pm: use proper compatibles for sama7g5's rtc and rtt
	ARM: dts: at91: sam9x60ek: fix eeprom compatible and size
	ARM: dts: at91: sama5d2_icp: fix eeprom compatibles
	ARM: at91: fix soc detection for SAM9X60 SiPs
	xsk: Clear page contiguity bit when unmapping pool
	i2c: piix4: Fix a memory leak in the EFCH MMIO support
	i40e: Fix dropped jumbo frames statistics
	i40e: Fix VF's MAC Address change on VM
	ARM: dts: stm32: use usbphyc ck_usbo_48m as USBH OHCI clock on stm32mp151
	ARM: dts: stm32: add missing usbh clock and fix clk order on stm32mp15
	ibmvnic: Properly dispose of all skbs during a failover.
	selftests: forwarding: fix flood_unicast_test when h2 supports IFF_UNICAST_FLT
	selftests: forwarding: fix learning_test when h1 supports IFF_UNICAST_FLT
	selftests: forwarding: fix error message in learning_test
	r8169: fix accessing unset transport header
	i2c: cadence: Unregister the clk notifier in error path
	dmaengine: imx-sdma: Allow imx8m for imx7 FW revs
	misc: rtsx_usb: fix use of dma mapped buffer for usb bulk transfer
	misc: rtsx_usb: use separate command and response buffers
	misc: rtsx_usb: set return value in rsp_buf alloc err path
	Revert "mm/memory-failure.c: fix race with changing page compound again"
	Revert "serial: 8250_mtk: Make sure to select the right FEATURE_SEL"
	dt-bindings: dma: allwinner,sun50i-a64-dma: Fix min/max typo
	ida: don't use BUG_ON() for debugging
	dmaengine: pl330: Fix lockdep warning about non-static key
	dmaengine: lgm: Fix an error handling path in intel_ldma_probe()
	dmaengine: at_xdma: handle errors of at_xdmac_alloc_desc() correctly
	dmaengine: ti: Fix refcount leak in ti_dra7_xbar_route_allocate
	dmaengine: qcom: bam_dma: fix runtime PM underflow
	dmaengine: ti: Add missing put_device in ti_dra7_xbar_route_allocate
	dmaengine: idxd: force wq context cleanup on device disable path
	selftests/net: fix section name when using xdp_dummy.o
	Linux 5.15.54

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I3ca4c0aa09a3bea6969c7a127d833034a123f437
2022-07-13 19:41:43 +02:00
Naoya Horiguchi
2fa22e7906 Revert "mm/memory-failure.c: fix race with changing page compound again"
commit 2ba2b008a8bf5fd268a43d03ba79e0ad464d6836 upstream.

Reverts commit 888af2701db7 ("mm/memory-failure.c: fix race with changing
page compound again") because now we fetch the page refcount under
hugetlb_lock in try_memory_failure_hugetlb() so that the race check is no
longer necessary.

Link: https://lkml.kernel.org/r/20220408135323.1559401-4-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Suggested-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-07-12 16:35:17 +02:00
Naoya Horiguchi
62d1655b92 mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()
[ Upstream commit 405ce051236cc65b30bbfe490b28ce60ae6aed85 ]

There is a race condition between memory_failure_hugetlb() and hugetlb
free/demotion, which causes setting PageHWPoison flag on the wrong page.
The one simple result is that wrong processes can be killed, but another
(more serious) one is that the actual error is left unhandled, so no one
prevents later access to it, and that might lead to more serious results
like consuming corrupted data.

Think about the below race window:

  CPU 1                                   CPU 2
  memory_failure_hugetlb
  struct page *head = compound_head(p);
                                          hugetlb page might be freed to
                                          buddy, or even changed to another
                                          compound page.

  get_hwpoison_page -- page is not what we want now...

The current code first does prechecks roughly and then reconfirms after
taking refcount, but it's found that it makes code overly complicated,
so move the prechecks in a single hugetlb_lock range.

A newly introduced function, try_memory_failure_hugetlb(), always takes
hugetlb_lock (even for non-hugetlb pages).  That can be improved, but
memory_failure() is rare in principle, so should not be a big problem.

Link: https://lkml.kernel.org/r/20220408135323.1559401-2-naoya.horiguchi@linux.dev
Fixes: 761ad8d7c7 ("mm: hwpoison: introduce memory_failure_hugetlb()")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-07-12 16:35:06 +02:00
Miaohe Lin
5429eb5502 mm/memory-failure.c: fix race with changing page compound again
[ Upstream commit 888af2701db79b9b27c7e37f9ede528a5ca53b76 ]

Patch series "A few fixup patches for memory failure", v2.

This series contains a few patches to fix the race with changing page
compound page, make non-LRU movable pages unhandlable and so on.  More
details can be found in the respective changelogs.

There is a race window where we got the compound_head, the hugetlb page
could be freed to buddy, or even changed to another compound page just
before we try to get hwpoison page.  Think about the below race window:

  CPU 1					  CPU 2
  memory_failure_hugetlb
  struct page *head = compound_head(p);
					  hugetlb page might be freed to
					  buddy, or even changed to another
					  compound page.

  get_hwpoison_page -- page is not what we want now...

If this race happens, just bail out.  Also MF_MSG_DIFFERENT_PAGE_SIZE is
introduced to record this event.

[akpm@linux-foundation.org: s@/**@/*@, per Naoya Horiguchi]

Link: https://lkml.kernel.org/r/20220312074613.4798-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20220312074613.4798-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-07-12 16:35:05 +02:00
Greg Kroah-Hartman
28f0c67d40 Merge 5.15.44 into android14-5.15
Changes in 5.15.44
	HID: amd_sfh: Add support for sensor discovery
	KVM: x86/mmu: fix NULL pointer dereference on guest INVPCID
	ice: fix crash at allocation failure
	ACPI: sysfs: Fix BERT error region memory mapping
	MAINTAINERS: co-maintain random.c
	MAINTAINERS: add git tree for random.c
	lib/crypto: blake2s: include as built-in
	lib/crypto: blake2s: move hmac construction into wireguard
	lib/crypto: sha1: re-roll loops to reduce code size
	lib/crypto: blake2s: avoid indirect calls to compression function for Clang CFI
	random: document add_hwgenerator_randomness() with other input functions
	random: remove unused irq_flags argument from add_interrupt_randomness()
	random: use BLAKE2s instead of SHA1 in extraction
	random: do not sign extend bytes for rotation when mixing
	random: do not re-init if crng_reseed completes before primary init
	random: mix bootloader randomness into pool
	random: harmonize "crng init done" messages
	random: use IS_ENABLED(CONFIG_NUMA) instead of ifdefs
	random: early initialization of ChaCha constants
	random: avoid superfluous call to RDRAND in CRNG extraction
	random: don't reset crng_init_cnt on urandom_read()
	random: fix typo in comments
	random: cleanup poolinfo abstraction
	random: cleanup integer types
	random: remove incomplete last_data logic
	random: remove unused extract_entropy() reserved argument
	random: rather than entropy_store abstraction, use global
	random: remove unused OUTPUT_POOL constants
	random: de-duplicate INPUT_POOL constants
	random: prepend remaining pool constants with POOL_
	random: cleanup fractional entropy shift constants
	random: access input_pool_data directly rather than through pointer
	random: selectively clang-format where it makes sense
	random: simplify arithmetic function flow in account()
	random: continually use hwgenerator randomness
	random: access primary_pool directly rather than through pointer
	random: only call crng_finalize_init() for primary_crng
	random: use computational hash for entropy extraction
	random: simplify entropy debiting
	random: use linear min-entropy accumulation crediting
	random: always wake up entropy writers after extraction
	random: make credit_entropy_bits() always safe
	random: remove use_input_pool parameter from crng_reseed()
	random: remove batched entropy locking
	random: fix locking in crng_fast_load()
	random: use RDSEED instead of RDRAND in entropy extraction
	random: get rid of secondary crngs
	random: inline leaves of rand_initialize()
	random: ensure early RDSEED goes through mixer on init
	random: do not xor RDRAND when writing into /dev/random
	random: absorb fast pool into input pool after fast load
	random: use simpler fast key erasure flow on per-cpu keys
	random: use hash function for crng_slow_load()
	random: make more consistent use of integer types
	random: remove outdated INT_MAX >> 6 check in urandom_read()
	random: zero buffer after reading entropy from userspace
	random: fix locking for crng_init in crng_reseed()
	random: tie batched entropy generation to base_crng generation
	random: remove ifdef'd out interrupt bench
	random: remove unused tracepoints
	random: add proper SPDX header
	random: deobfuscate irq u32/u64 contributions
	random: introduce drain_entropy() helper to declutter crng_reseed()
	random: remove useless header comment
	random: remove whitespace and reorder includes
	random: group initialization wait functions
	random: group crng functions
	random: group entropy extraction functions
	random: group entropy collection functions
	random: group userspace read/write functions
	random: group sysctl functions
	random: rewrite header introductory comment
	random: defer fast pool mixing to worker
	random: do not take pool spinlock at boot
	random: unify early init crng load accounting
	random: check for crng_init == 0 in add_device_randomness()
	random: pull add_hwgenerator_randomness() declaration into random.h
	random: clear fast pool, crng, and batches in cpuhp bring up
	random: round-robin registers as ulong, not u32
	random: only wake up writers after zap if threshold was passed
	random: cleanup UUID handling
	random: unify cycles_t and jiffies usage and types
	random: do crng pre-init loading in worker rather than irq
	random: give sysctl_random_min_urandom_seed a more sensible value
	random: don't let 644 read-only sysctls be written to
	random: replace custom notifier chain with standard one
	random: use SipHash as interrupt entropy accumulator
	random: make consistent usage of crng_ready()
	random: reseed more often immediately after booting
	random: check for signal and try earlier when generating entropy
	random: skip fast_init if hwrng provides large chunk of entropy
	random: treat bootloader trust toggle the same way as cpu trust toggle
	random: re-add removed comment about get_random_{u32,u64} reseeding
	random: mix build-time latent entropy into pool at init
	random: do not split fast init input in add_hwgenerator_randomness()
	random: do not allow user to keep crng key around on stack
	random: check for signal_pending() outside of need_resched() check
	random: check for signals every PAGE_SIZE chunk of /dev/[u]random
	random: allow partial reads if later user copies fail
	random: make random_get_entropy() return an unsigned long
	random: document crng_fast_key_erasure() destination possibility
	random: fix sysctl documentation nits
	init: call time_init() before rand_initialize()
	ia64: define get_cycles macro for arch-override
	s390: define get_cycles macro for arch-override
	parisc: define get_cycles macro for arch-override
	alpha: define get_cycles macro for arch-override
	powerpc: define get_cycles macro for arch-override
	timekeeping: Add raw clock fallback for random_get_entropy()
	m68k: use fallback for random_get_entropy() instead of zero
	riscv: use fallback for random_get_entropy() instead of zero
	mips: use fallback for random_get_entropy() instead of just c0 random
	arm: use fallback for random_get_entropy() instead of zero
	nios2: use fallback for random_get_entropy() instead of zero
	x86/tsc: Use fallback for random_get_entropy() instead of zero
	um: use fallback for random_get_entropy() instead of zero
	sparc: use fallback for random_get_entropy() instead of zero
	xtensa: use fallback for random_get_entropy() instead of zero
	random: insist on random_get_entropy() existing in order to simplify
	random: do not use batches when !crng_ready()
	random: use first 128 bits of input as fast init
	random: do not pretend to handle premature next security model
	random: order timer entropy functions below interrupt functions
	random: do not use input pool from hard IRQs
	random: help compiler out with fast_mix() by using simpler arguments
	siphash: use one source of truth for siphash permutations
	random: use symbolic constants for crng_init states
	random: avoid initializing twice in credit race
	random: move initialization out of reseeding hot path
	random: remove ratelimiting for in-kernel unseeded randomness
	random: use proper jiffies comparison macro
	random: handle latent entropy and command line from random_init()
	random: credit architectural init the exact amount
	random: use static branch for crng_ready()
	random: remove extern from functions in header
	random: use proper return types on get_random_{int,long}_wait()
	random: make consistent use of buf and len
	random: move initialization functions out of hot pages
	random: move randomize_page() into mm where it belongs
	random: unify batched entropy implementations
	random: convert to using fops->read_iter()
	random: convert to using fops->write_iter()
	random: wire up fops->splice_{read,write}_iter()
	random: check for signals after page of pool writes
	ALSA: ctxfi: Add SB046x PCI ID
	Linux 5.15.44

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I2d874cba14f13379fc5f874e72634c9179b28742
2022-06-14 11:49:05 +02:00
Greg Kroah-Hartman
a0d26c51d7 Merge 5.15.37 into android14-5.15
Changes in 5.15.37
	floppy: disable FDRAWCMD by default
	bpf: Introduce composable reg, ret and arg types.
	bpf: Replace ARG_XXX_OR_NULL with ARG_XXX | PTR_MAYBE_NULL
	bpf: Replace RET_XXX_OR_NULL with RET_XXX | PTR_MAYBE_NULL
	bpf: Replace PTR_TO_XXX_OR_NULL with PTR_TO_XXX | PTR_MAYBE_NULL
	bpf: Introduce MEM_RDONLY flag
	bpf: Convert PTR_TO_MEM_OR_NULL to composable types.
	bpf: Make per_cpu_ptr return rdonly PTR_TO_MEM.
	bpf: Add MEM_RDONLY for helper args that are pointers to rdonly mem.
	bpf/selftests: Test PTR_TO_RDONLY_MEM
	bpf: Fix crash due to out of bounds access into reg2btf_ids.
	spi: cadence-quadspi: fix write completion support
	ARM: dts: socfpga: change qspi to "intel,socfpga-qspi"
	mm: kfence: fix objcgs vector allocation
	gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable}
	iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable
	iov_iter: Introduce fault_in_iov_iter_writeable
	gfs2: Add wrapper for iomap_file_buffered_write
	gfs2: Clean up function may_grant
	gfs2: Introduce flag for glock holder auto-demotion
	gfs2: Move the inode glock locking to gfs2_file_buffered_write
	gfs2: Eliminate ip->i_gh
	gfs2: Fix mmap + page fault deadlocks for buffered I/O
	iomap: Fix iomap_dio_rw return value for user copies
	iomap: Support partial direct I/O on user copy failures
	iomap: Add done_before argument to iomap_dio_rw
	gup: Introduce FOLL_NOFAULT flag to disable page faults
	iov_iter: Introduce nofault flag to disable page faults
	gfs2: Fix mmap + page fault deadlocks for direct I/O
	btrfs: fix deadlock due to page faults during direct IO reads and writes
	btrfs: fallback to blocking mode when doing async dio over multiple extents
	mm: gup: make fault_in_safe_writeable() use fixup_user_fault()
	selftests/bpf: Add test for reg2btf_ids out of bounds access
	Linux 5.15.37

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I785543e252f972c5a86f313e4b6721e2ff0797e6
2022-06-06 15:02:31 +02:00
Jason A. Donenfeld
64cb7f01dd random: move randomize_page() into mm where it belongs
commit 5ad7dd882e45d7fe432c32e896e2aaa0b21746ea upstream.

randomize_page is an mm function. It is documented like one. It contains
the history of one. It has the naming convention of one. It looks
just like another very similar function in mm, randomize_stack_top().
And it has always been maintained and updated by mm people. There is no
need for it to be in random.c. In the "which shape does not look like
the other ones" test, pointing to randomize_page() is correct.

So move randomize_page() into mm/util.c, right next to the similar
randomize_stack_top() function.

This commit contains no actual code changes.

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-30 09:29:17 +02:00
Andreas Gruenbacher
6e213bc614 gup: Introduce FOLL_NOFAULT flag to disable page faults
commit 55b8fe703bc51200d4698596c90813453b35ae63 upstream

Introduce a new FOLL_NOFAULT flag that causes get_user_pages to return
-EFAULT when it would otherwise trigger a page fault.  This is roughly
similar to FOLL_FAST_ONLY but available on all architectures, and less
fragile.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-01 17:22:32 +02:00
Yu Zhao
f88ed5a3d3 FROMLIST: mm: multi-gen LRU: groundwork
Evictable pages are divided into multiple generations for each lruvec.
The youngest generation number is stored in lrugen->max_seq for both
anon and file types as they are aged on an equal footing. The oldest
generation numbers are stored in lrugen->min_seq[] separately for anon
and file types as clean file pages can be evicted regardless of swap
constraints. These three variables are monotonically increasing.

Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
in order to fit into the gen counter in page->flags. Each truncated
generation number is an index to lrugen->lists[]. The sliding window
technique is used to track at least MIN_NR_GENS and at most
MAX_NR_GENS generations. The gen counter stores a value within [1,
MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
stores 0.

There are two conceptually independent procedures: "the aging", which
produces young generations, and "the eviction", which consumes old
generations. They form a closed-loop system, i.e., "the page reclaim".
Both procedures can be invoked from userspace for the purposes of
working set estimation and proactive reclaim. These features are
required to optimize job scheduling (bin packing) in data centers. The
variable size of the sliding window is designed for such use cases
[1][2].

To avoid confusion, the terms "hot" and "cold" will be applied to the
multi-gen LRU, as a new convention; the terms "active" and "inactive"
will be applied to the active/inactive LRU, as usual.

The protection of hot pages and the selection of cold pages are based
on page access channels and patterns. There are two access channels:
one through page tables and the other through file descriptors. The
protection of the former channel is by design stronger because:
1. The uncertainty in determining the access patterns of the former
   channel is higher due to the approximation of the accessed bit.
2. The cost of evicting the former channel is higher due to the TLB
   flushes required and the likelihood of encountering the dirty bit.
3. The penalty of underprotecting the former channel is higher because
   applications usually do not prepare themselves for major page
   faults like they do for blocked I/O. E.g., GUI applications
   commonly use dedicated I/O threads to avoid blocking the rendering
   threads.
There are also two access patterns: one with temporal locality and the
other without. For the reasons listed above, the former channel is
assumed to follow the former pattern unless VM_SEQ_READ or
VM_RAND_READ is present; the latter channel is assumed to follow the
latter pattern unless outlying refaults have been observed [3][4].

The next patch will address the "outlying refaults". Three macros,
i.e., LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are
added in this patch to make the entire patchset less diffy.

A page is added to the youngest generation on faulting. The aging
needs to check the accessed bit at least twice before handing this
page over to the eviction. The first check takes care of the accessed
bit set on the initial fault; the second check makes sure this page
has not been used since then. This protocol, AKA second chance,
requires a minimum of two generations, hence MIN_NR_GENS.

[1] https://dl.acm.org/doi/10.1145/3297858.3304053
[2] https://dl.acm.org/doi/10.1145/3503222.3507731
[3] https://lwn.net/Articles/495543/
[4] https://lwn.net/Articles/815342/

Link: https://lore.kernel.org/lkml/20220309021230.721028-6-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I333ec6a1d2abfa60d93d6adc190ed3eefe441512
2022-04-20 17:38:55 +00:00
Charan Teja Reddy
eabd925e61 ANDROID: mm: shmem: use reclaim_pages() to recalim pages from a list
Static code analysis tool reported NULL pointer access in
shrink_page_list() as the commit 26aa2d199d ("mm/migrate: demote
pages during reclaim") expects valid pgdat.

There is already an existing api, reclaim_pages, that tries to reclaim
pages from the list. use it instead of creating custom function.

Bug: 201263305
Fixes: 96f80f6284 ("ANDROID: mm: add reclaim_shmem_address_space() for faster reclaims")
Change-Id: Iaa11feac94c9e8338324ace0276c49d6a0adeb0c
Signed-off-by: Charan Teja Reddy <quic_charante@quicinc.com>
2022-04-18 20:31:40 +00:00
Suren Baghdasaryan
0fd37220d8 UPSTREAM: mm: refactor vm_area_struct::anon_vma_name usage code
Avoid mixing strings and their anon_vma_name referenced pointers by
using struct anon_vma_name whenever possible.  This simplifies the code
and allows easier sharing of anon_vma_name structures when they
represent the same name.

[surenb@google.com: fix comment]

Link: https://lkml.kernel.org/r/20220223153613.835563-1-surenb@google.com
Link: https://lkml.kernel.org/r/20220224231834.1481408-1-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Colin Cross <ccross@google.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Alexey Gladkov <legion@kernel.org>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Chris Hyser <chris.hyser@oracle.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Xiaofeng Cao <caoxiaofeng@yulong.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 5c26f6ac9416b63d093e29c30e79b3297e425472)

Bug: 218352794
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I4a6b5602ce7151d1a4b88fac489f86d68089bd4d
2022-03-24 18:44:39 -07:00
Charan Teja Reddy
96f80f6284 ANDROID: mm: add reclaim_shmem_address_space() for faster reclaims
Add the functionality that allow users of shmem to reclaim its pages
without going through the kswapd/direct reclaim path. An example usecase
is: Say that device allocates a larger amount of shmem pages and shares
it with hardware. To faster reclaims such pages, drivers can register
the shrinkers and call reclaim_shmem_address_space().

The implementation of this function is mostly borrowed from
reclaim_address_space() implemented for per process reclaim[1].

[1] https://lore.kernel.org/patchwork/cover/378056/

Bug: 201263305
Change-Id: I03d2c3b9610612af977f89ddeabb63b8e9e50918
Signed-off-by: Charan Teja Reddy <quic_charante@quicinc.com>
2022-03-23 11:32:21 -07:00
Michel Lespinasse
7d6787088d BACKPORT: FROMLIST: mm: enable speculative fault handling for supported file types.
Introduce vma_can_speculate(), which allows speculative handling for
VMAs mapping supported file types.

From do_handle_mm_fault(), speculative handling will follow through
__handle_mm_fault(), handle_pte_fault() and do_fault().

At this point, we expect speculative faults to continue through one of:
- do_read_fault(), fully implemented;
- do_cow_fault(), which might abort if missing anon vmas,
- do_shared_fault(), not implemented yet
  (would require ->page_mkwrite() changes).

vma_can_speculate() provides an early abort for the do_shared_fault() case,
limiting the time spent on trying that unimplemented case.

Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20210407014502.24091-31-michel@lespinasse.org/

Conflicts:
    include/linux/vm_event_item.h
    mm/vmstat.c

1. SPF_ATTEMPT_FILE is taken from https://lore.kernel.org/all/20210407014502.24091-36-michel@lespinasse.org/
since the patch posted upstream at the time had a different structure
with stats for anonymouse and file-backed pagefaults introduced in a
separate patch.

Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I3a28af63b41b649f02f8b73d53f6494ad114ee5a
2022-03-23 11:32:19 -07:00
Michel Lespinasse
a2138fee6c FROMLIST: fs: list file types that support speculative faults.
Add a speculative field to the vm_operations_struct, which indicates if
the associated file type supports speculative faults.

Initially this is set for files that implement fault() with filemap_fault().

Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20210407014502.24091-30-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ic92efdf13283c45e7da7bf703f4f85f8b392ba69
2022-03-23 11:32:19 -07:00
Michel Lespinasse
6e6766ab76 BACKPORT: FROMLIST: mm: add pte_map_lock() and pte_spinlock()
pte_map_lock() and pte_spinlock() are used by fault handlers to ensure
the pte is mapped and locked before they commit the faulted page to the
mm's address space at the end of the fault.

The functions differ in their preconditions; pte_map_lock() expects
the pte to be unmapped prior to the call, while pte_spinlock() expects
it to be already mapped.

In the speculative fault case, the functions verify, after locking the pte,
that the mmap sequence count has not changed since the start of the fault,
and thus that no mmap lock writers have been running concurrently with
the fault. After that point the page table lock serializes any further
races with concurrent mmap lock writers.

If the mmap sequence count check fails, both functions will return false
with the pte being left unmapped and unlocked.

Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-18-michel@lespinasse.org/

Conflicts:
    include/linux/mm.h

1. Fixed pte_map_lock and pte_spinlock macros not to fail when
CONFIG_SPECULATIVE_PAGE_FAULT=n

Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ibd7ccc2ead4fdf29f28c7657b312b2f677ac8836
2022-03-23 11:32:15 -07:00
Michel Lespinasse
6ab660d7cb FROMLIST: mm: implement speculative handling in __handle_mm_fault().
The speculative path calls speculative_page_walk_begin() before walking
the page table tree to prevent page table reclamation. The logic is
otherwise similar to the non-speculative path, but with additional
restrictions: in the speculative path, we do not handle huge pages or
wiring new pages tables.

Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-17-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: If099534da8b0ac105bbaa5ea4714a6654032592a
2022-03-23 11:32:15 -07:00
Michel Lespinasse
0823d516af FROMLIST: mm: separate mmap locked assertion from find_vma
This adds a new __find_vma() function, which implements find_vma minus
the mmap_assert_locked() assertion.

find_vma() is then implemented as an inline wrapper around __find_vma().

Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-13-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ia999b8cb8f5eed93040ab4b3caaf90d739da908d
2022-03-23 11:32:14 -07:00
Michel Lespinasse
4e2e391ff7 BACKPORT: FROMLIST: mm: add do_handle_mm_fault()
Add a new do_handle_mm_fault function, which extends the existing
handle_mm_fault() API by adding an mmap sequence count, to be used
in the FAULT_FLAG_SPECULATIVE case.

In the initial implementation, FAULT_FLAG_SPECULATIVE always fails
(by returning VM_FAULT_RETRY).

The existing handle_mm_fault() API is kept as a wrapper around
do_handle_mm_fault() so that we do not have to immediately update
every handle_mm_fault() call site.

Signed-off-by: Michel Lespinasse <michel@lespinasse.org>

Conflicts:
    mm/memory.c

1. Trivial merge conflict due to folios.

Link: https://lore.kernel.org/all/20220128131006.67712-10-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ic07b6d84af3e5d1fcc856e0968f1a6dd1544fa88
2022-03-23 11:32:14 -07:00
Michel Lespinasse
f2fa9aae2e BACKPORT: FROMLIST: mm: add FAULT_FLAG_SPECULATIVE flag
Define the new FAULT_FLAG_SPECULATIVE flag, which indicates when we are
attempting speculative fault handling (without holding the mmap lock).

Signed-off-by: Michel Lespinasse <michel@lespinasse.org>

Conflicts:
    include/linux/mm_types.h

1. Merge conflict due to enum fault_flag being defined in mm.h instead of
mm_types.h

Link: https://lore.kernel.org/all/20220128131006.67712-9-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I48ab427dfa4d7bdbe9932588bec7ae99e9e80ae9
2022-03-23 11:32:14 -07:00
Greg Kroah-Hartman
8222792e8e Merge 5.15.19 into android13-5.15
Changes in 5.15.19
	can: m_can: m_can_fifo_{read,write}: don't read or write from/to FIFO if length is 0
	net: sfp: ignore disabled SFP node
	net: stmmac: configure PTP clock source prior to PTP initialization
	net: stmmac: skip only stmmac_ptp_register when resume from suspend
	ARM: 9179/1: uaccess: avoid alignment faults in copy_[from|to]_kernel_nofault
	ARM: 9180/1: Thumb2: align ALT_UP() sections in modules sufficiently
	KVM: arm64: Use shadow SPSR_EL1 when injecting exceptions on !VHE
	s390/module: fix loading modules with a lot of relocations
	s390/hypfs: include z/VM guests with access control group set
	s390/nmi: handle guarded storage validity failures for KVM guests
	s390/nmi: handle vector validity failures for KVM guests
	bpf: Guard against accessing NULL pt_regs in bpf_get_task_stack()
	powerpc32/bpf: Fix codegen for bpf-to-bpf calls
	powerpc/bpf: Update ldimm64 instructions during extra pass
	ucount: Make get_ucount a safe get_user replacement
	scsi: zfcp: Fix failed recovery on gone remote port with non-NPIV FCP devices
	udf: Restore i_lenAlloc when inode expansion fails
	udf: Fix NULL ptr deref when converting from inline format
	efi: runtime: avoid EFIv2 runtime services on Apple x86 machines
	PM: wakeup: simplify the output logic of pm_show_wakelocks()
	tracing/histogram: Fix a potential memory leak for kstrdup()
	tracing: Don't inc err_log entry count if entry allocation fails
	ceph: properly put ceph_string reference after async create attempt
	ceph: set pool_ns in new inode layout for async creates
	fsnotify: fix fsnotify hooks in pseudo filesystems
	Revert "KVM: SVM: avoid infinite loop on NPF from bad address"
	psi: Fix uaf issue when psi trigger is destroyed while being polled
	powerpc/audit: Fix syscall_get_arch()
	perf/x86/intel/uncore: Fix CAS_COUNT_WRITE issue for ICX
	perf/x86/intel: Add a quirk for the calculation of the number of counters on Alder Lake
	drm/etnaviv: relax submit size limits
	drm/atomic: Add the crtc to affected crtc only if uapi.enable = true
	drm/amd/display: Fix FP start/end for dcn30_internal_validate_bw.
	KVM: LAPIC: Also cancel preemption timer during SET_LAPIC
	KVM: SVM: Never reject emulation due to SMAP errata for !SEV guests
	KVM: SVM: Don't intercept #GP for SEV guests
	KVM: x86: nSVM: skip eax alignment check for non-SVM instructions
	KVM: x86: Forcibly leave nested virt when SMM state is toggled
	KVM: x86: Keep MSR_IA32_XSS unchanged for INIT
	KVM: x86: Update vCPU's runtime CPUID on write to MSR_IA32_XSS
	KVM: x86: Sync the states size with the XCR0/IA32_XSS at, any time
	KVM: PPC: Book3S HV Nested: Fix nested HFSCR being clobbered with multiple vCPUs
	dm: revert partial fix for redundant bio-based IO accounting
	block: add bio_start_io_acct_time() to control start_time
	dm: properly fix redundant bio-based IO accounting
	serial: pl011: Fix incorrect rs485 RTS polarity on set_mctrl
	serial: 8250: of: Fix mapped region size when using reg-offset property
	serial: stm32: fix software flow control transfer
	tty: n_gsm: fix SW flow control encoding/handling
	tty: Partially revert the removal of the Cyclades public API
	tty: Add support for Brainboxes UC cards.
	kbuild: remove include/linux/cyclades.h from header file check
	usb-storage: Add unusual-devs entry for VL817 USB-SATA bridge
	usb: xhci-plat: fix crash when suspend if remote wake enable
	usb: common: ulpi: Fix crash in ulpi_match()
	usb: gadget: f_sourcesink: Fix isoc transfer for USB_SPEED_SUPER_PLUS
	usb: cdnsp: Fix segmentation fault in cdns_lost_power function
	usb: dwc3: xilinx: Skip resets and USB3 register settings for USB2.0 mode
	usb: dwc3: xilinx: Fix error handling when getting USB3 PHY
	USB: core: Fix hang in usb_kill_urb by adding memory barriers
	usb: typec: tcpci: don't touch CC line if it's Vconn source
	usb: typec: tcpm: Do not disconnect while receiving VBUS off
	usb: typec: tcpm: Do not disconnect when receiving VSAFE0V
	ucsi_ccg: Check DEV_INT bit only when starting CCG4
	mm, kasan: use compare-exchange operation to set KASAN page tag
	jbd2: export jbd2_journal_[grab|put]_journal_head
	ocfs2: fix a deadlock when commit trans
	sched/membarrier: Fix membarrier-rseq fence command missing from query bitmask
	PCI/sysfs: Find shadow ROM before static attribute initialization
	x86/MCE/AMD: Allow thresholding interface updates after init
	x86/cpu: Add Xeon Icelake-D to list of CPUs that support PPIN
	powerpc/32s: Allocate one 256k IBAT instead of two consecutives 128k IBATs
	powerpc/32s: Fix kasan_init_region() for KASAN
	powerpc/32: Fix boot failure with GCC latent entropy plugin
	i40e: Increase delay to 1 s after global EMP reset
	i40e: Fix issue when maximum queues is exceeded
	i40e: Fix queues reservation for XDP
	i40e: Fix for failed to init adminq while VF reset
	i40e: fix unsigned stat widths
	usb: roles: fix include/linux/usb/role.h compile issue
	rpmsg: char: Fix race between the release of rpmsg_ctrldev and cdev
	rpmsg: char: Fix race between the release of rpmsg_eptdev and cdev
	scsi: elx: efct: Don't use GFP_KERNEL under spin lock
	scsi: bnx2fc: Flush destroy_work queue before calling bnx2fc_interface_put()
	ipv6_tunnel: Rate limit warning messages
	ARM: 9170/1: fix panic when kasan and kprobe are enabled
	net: fix information leakage in /proc/net/ptype
	hwmon: (lm90) Mark alert as broken for MAX6646/6647/6649
	hwmon: (lm90) Mark alert as broken for MAX6680
	ping: fix the sk_bound_dev_if match in ping_lookup
	ipv4: avoid using shared IP generator for connected sockets
	hwmon: (lm90) Reduce maximum conversion rate for G781
	NFSv4: Handle case where the lookup of a directory fails
	NFSv4: nfs_atomic_open() can race when looking up a non-regular file
	net-procfs: show net devices bound packet types
	drm/msm: Fix wrong size calculation
	drm/msm/dsi: Fix missing put_device() call in dsi_get_phy
	drm/msm/dsi: invalid parameter check in msm_dsi_phy_enable
	ipv6: annotate accesses to fn->fn_sernum
	NFS: Ensure the server has an up to date ctime before hardlinking
	NFS: Ensure the server has an up to date ctime before renaming
	KVM: arm64: pkvm: Use the mm_ops indirection for cache maintenance
	SUNRPC: Use BIT() macro in rpc_show_xprt_state()
	SUNRPC: Don't dereference xprt->snd_task if it's a cookie
	powerpc64/bpf: Limit 'ldbrx' to processors compliant with ISA v2.06
	netfilter: conntrack: don't increment invalid counter on NF_REPEAT
	powerpc/64s: Mask SRR0 before checking against the masked NIP
	perf: Fix perf_event_read_local() time
	sched/pelt: Relax the sync of util_sum with util_avg
	net: phy: broadcom: hook up soft_reset for BCM54616S
	net: stmmac: dwmac-visconti: Fix bit definitions for ETHER_CLK_SEL
	net: stmmac: dwmac-visconti: Fix clock configuration for RMII mode
	phylib: fix potential use-after-free
	octeontx2-af: Do not fixup all VF action entries
	octeontx2-af: Fix LBK backpressure id count
	octeontx2-af: Retry until RVU block reset complete
	octeontx2-pf: cn10k: Ensure valid pointers are freed to aura
	octeontx2-af: verify CQ context updates
	octeontx2-af: Increase link credit restore polling timeout
	octeontx2-af: cn10k: Do not enable RPM loopback for LPC interfaces
	octeontx2-pf: Forward error codes to VF
	rxrpc: Adjust retransmission backoff
	efi/libstub: arm64: Fix image check alignment at entry
	io_uring: fix bug in slow unregistering of nodes
	Drivers: hv: balloon: account for vmbus packet header in max_pkt_size
	hwmon: (lm90) Re-enable interrupts after alert clears
	hwmon: (lm90) Mark alert as broken for MAX6654
	hwmon: (lm90) Fix sysfs and udev notifications
	hwmon: (adt7470) Prevent divide by zero in adt7470_fan_write()
	powerpc/perf: Fix power_pmu_disable to call clear_pmi_irq_pending only if PMI is pending
	ipv4: fix ip option filtering for locally generated fragments
	ibmvnic: Allow extra failures before disabling
	ibmvnic: init ->running_cap_crqs early
	ibmvnic: don't spin in tasklet
	net/smc: Transitional solution for clcsock race issue
	video: hyperv_fb: Fix validation of screen resolution
	can: tcan4x5x: regmap: fix max register value
	drm/msm/hdmi: Fix missing put_device() call in msm_hdmi_get_phy
	drm/msm/dpu: invalid parameter check in dpu_setup_dspp_pcc
	drm/msm/a6xx: Add missing suspend_count increment
	yam: fix a memory leak in yam_siocdevprivate()
	net: cpsw: Properly initialise struct page_pool_params
	net: hns3: handle empty unknown interrupt for VF
	sch_htb: Fail on unsupported parameters when offload is requested
	Revert "drm/ast: Support 1600x900 with 108MHz PCLK"
	KVM: selftests: Don't skip L2's VMCALL in SMM test for SVM guest
	ceph: put the requests/sessions when it fails to alloc memory
	gve: Fix GFP flags when allocing pages
	Revert "ipv6: Honor all IPv6 PIO Valid Lifetime values"
	net: bridge: vlan: fix single net device option dumping
	ipv4: raw: lock the socket in raw_bind()
	ipv4: tcp: send zero IPID in SYNACK messages
	ipv4: remove sparse error in ip_neigh_gw4()
	net: bridge: vlan: fix memory leak in __allowed_ingress
	Bluetooth: refactor malicious adv data check
	irqchip/realtek-rtl: Map control data to virq
	irqchip/realtek-rtl: Fix off-by-one in routing
	dt-bindings: can: tcan4x5x: fix mram-cfg RX FIFO config
	perf/core: Fix cgroup event list management
	psi: fix "no previous prototype" warnings when CONFIG_CGROUPS=n
	psi: fix "defined but not used" warnings when CONFIG_PROC_FS=n
	usb: dwc3: xilinx: fix uninitialized return value
	usr/include/Makefile: add linux/nfc.h to the compile-test coverage
	fsnotify: invalidate dcache before IN_DELETE event
	block: Fix wrong offset in bio_truncate()
	mtd: rawnand: mpc5121: Remove unused variable in ads5121_select_chip()
	Linux 5.15.19

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I66399d45af362fa8e1672ba38c0d672e21afc716
2022-02-02 09:32:24 +01:00
Peter Collingbourne
4ca8a0bc83 mm, kasan: use compare-exchange operation to set KASAN page tag
commit 27fe73394a1c6d0b07fa4d95f1bca116d1cc66e9 upstream.

It has been reported that the tag setting operation on newly-allocated
pages can cause the page flags to be corrupted when performed
concurrently with other flag updates as a result of the use of
non-atomic operations.

Fix the problem by using a compare-exchange loop to update the tag.

Link: https://lkml.kernel.org/r/20220120020148.1632253-1-pcc@google.com
Link: https://linux-review.googlesource.com/id/I456b24a2b9067d93968d43b4bb3351c0cec63101
Fixes: 2813b9c029 ("kasan, mm, arm64: tag non slab memory allocated via pagealloc")
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-01 17:27:05 +01:00
Colin Cross
301c56064d UPSTREAM: mm: add a field to store names for private anonymous memory
In many userspace applications, and especially in VM based applications
like Android uses heavily, there are multiple different allocators in
use.  At a minimum there is libc malloc and the stack, and in many cases
there are libc malloc, the stack, direct syscalls to mmap anonymous
memory, and multiple VM heaps (one for small objects, one for big
objects, etc.).  Each of these layers usually has its own tools to
inspect its usage; malloc by compiling a debug version, the VM through
heap inspection tools, and for direct syscalls there is usually no way
to track them.

On Android we heavily use a set of tools that use an extended version of
the logic covered in Documentation/vm/pagemap.txt to walk all pages
mapped in userspace and slice their usage by process, shared (COW) vs.
unique mappings, backing, etc.  This can account for real physical
memory usage even in cases like fork without exec (which Android uses
heavily to share as many private COW pages as possible between
processes), Kernel SamePage Merging, and clean zero pages.  It produces
a measurement of the pages that only exist in that process (USS, for
unique), and a measurement of the physical memory usage of that process
with the cost of shared pages being evenly split between processes that
share them (PSS).

If all anonymous memory is indistinguishable then figuring out the real
physical memory usage (PSS) of each heap requires either a pagemap
walking tool that can understand the heap debugging of every layer, or
for every layer's heap debugging tools to implement the pagemap walking
logic, in which case it is hard to get a consistent view of memory
across the whole system.

Tracking the information in userspace leads to all sorts of problems.
It either needs to be stored inside the process, which means every
process has to have an API to export its current heap information upon
request, or it has to be stored externally in a filesystem that somebody
needs to clean up on crashes.  It needs to be readable while the process
is still running, so it has to have some sort of synchronization with
every layer of userspace.  Efficiently tracking the ranges requires
reimplementing something like the kernel vma trees, and linking to it
from every layer of userspace.  It requires more memory, more syscalls,
more runtime cost, and more complexity to separately track regions that
the kernel is already tracking.

This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
userspace-provided name for anonymous vmas.  The names of named
anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as
[anon:<name>].

Userspace can set the name for a region of memory by calling

   prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name)

Setting the name to NULL clears it.  The name length limit is 80 bytes
including NUL-terminator and is checked to contain only printable ascii
characters (including space), except '[',']','\','$' and '`'.

Ascii strings are being used to have a descriptive identifiers for vmas,
which can be understood by the users reading /proc/pid/maps or
/proc/pid/smaps.  Names can be standardized for a given system and they
can include some variable parts such as the name of the allocator or a
library, tid of the thread using it, etc.

The name is stored in a pointer in the shared union in vm_area_struct
that points to a null terminated string.  Anonymous vmas with the same
name (equivalent strings) and are otherwise mergeable will be merged.
The name pointers are not shared between vmas even if they contain the
same name.  The name pointer is stored in a union with fields that are
only used on file-backed mappings, so it does not increase memory usage.

CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this
feature.  It keeps the feature disabled by default to prevent any
additional memory overhead and to avoid confusing procfs parsers on
systems which are not ready to support named anonymous vmas.

The patch is based on the original patch developed by Colin Cross, more
specifically on its latest version [1] posted upstream by Sumit Semwal.
It used a userspace pointer to store vma names.  In that design, name
pointers could be shared between vmas.  However during the last
upstreaming attempt, Kees Cook raised concerns [2] about this approach
and suggested to copy the name into kernel memory space, perform
validity checks [3] and store as a string referenced from
vm_area_struct.

One big concern is about fork() performance which would need to strdup
anonymous vma names.  Dave Hansen suggested experimenting with
worst-case scenario of forking a process with 64k vmas having longest
possible names [4].  I ran this experiment on an ARM64 Android device
and recorded a worst-case regression of almost 40% when forking such a
process.

This regression is addressed in the followup patch which replaces the
pointer to a name with a refcounted structure that allows sharing the
name pointer between vmas of the same name.  Instead of duplicating the
string during fork() or when splitting a vma it increments the refcount.

[1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@linaro.org/
[2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
[3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/
[4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@intel.com/

Changes for prctl(2) manual page (in the options section):

PR_SET_VMA
	Sets an attribute specified in arg2 for virtual memory areas
	starting from the address specified in arg3 and spanning the
	size specified	in arg4. arg5 specifies the value of the attribute
	to be set. Note that assigning an attribute to a virtual memory
	area might prevent it from being merged with adjacent virtual
	memory areas due to the difference in that attribute's value.

	Currently, arg2 must be one of:

	PR_SET_VMA_ANON_NAME
		Set a name for anonymous virtual memory areas. arg5 should
		be a pointer to a null-terminated string containing the
		name. The name length including null byte cannot exceed
		80 bytes. If arg5 is NULL, the name of the appropriate
		anonymous virtual memory areas will be reset. The name
		can contain only printable ascii characters (including
                space), except '[',']','\','$' and '`'.

                This feature is available only if the kernel is built with
                the CONFIG_ANON_VMA_NAME option enabled.

[surenb@google.com: docs: proc.rst: /proc/PID/maps: fix malformed table]
  Link: https://lkml.kernel.org/r/20211123185928.2513763-1-surenb@google.com
[surenb: rebased over v5.15-rc6, replaced userpointer with a kernel copy,
 added input sanitization and CONFIG_ANON_VMA_NAME config. The bulk of the
 work here was done by Colin Cross, therefore, with his permission, keeping
 him as the author]

Link: https://lkml.kernel.org/r/20211019215511.3771969-2-surenb@google.com
Signed-off-by: Colin Cross <ccross@google.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Glauber <jan.glauber@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rob Landley <rob@landley.net>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Shaohua Li <shli@fusionio.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 9a10064f5625d5572c3626c1516e0bebc6c9fe9b)

Bug: 120441514
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I53d56d551a7d62f75341304751814294b447c04e
2022-01-18 15:30:27 -08:00
Suren Baghdasaryan
f355f9635d Revert "ANDROID: mm: add a field to store names for private anonymous memory"
This reverts commit 60500a4228.
Replacing out-of-tree implementation with the upstream one.

Bug: 120441514
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ic34c8e16d51ccf9f00cb59d2de341e911bcb2828
2022-01-18 14:54:47 -08:00
Kalesh Singh
1a77f04aac Revert "ANDROID: mm: Throttle rss_stat tracepoint"
This reverts commit 77dfeaa02d.

Throttling can now be done using hist triggers and synthetic events

Bug: 145972256
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I39c284040e2fdb815cda980f3a40ef188e59287c
2021-11-17 23:30:23 +00:00
Greg Kroah-Hartman
5606699789 Merge 996fe06160 ("Merge tag 'kgdb-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux") into android-mainline
Steps on the way to 5.15-rc1

Change-Id: I3806b714a5a783a7132b1daf766ebb71985fc640
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2021-09-14 16:16:54 +02:00
Greg Kroah-Hartman
c5cd945b24 Merge fd47ff55c9 ("Merge tag 'usb-5.15-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb") into android-mainline
Steps on the way to 5.15-rc1.

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I42ffa8818bbb2072f043923553c4d8f91d9647a5
2021-09-14 14:42:51 +02:00
Greg Kroah-Hartman
c2b303f98f Merge 4e71add028 ("Merge branch 'stable/for-linus-5.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/ibft") into android-mainline
Steps on the way to 5.15-rc1

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ib3f181326491eb896547d802a6f0a1b3be54ce28
2021-09-14 14:35:23 +02:00
Greg Kroah-Hartman
59437fa4ba Merge 90c90cda05 ("Merge tag 'xfs-5.15-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux") into android-mainline
Steps on the way to 5.15-rc1

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Id0e9064c101f599601a6d864ce7ac2ed88083562
2021-09-09 14:02:28 +02:00
Linus Torvalds
cd1adf1b63 Revert "mm/gup: remove try_get_page(), call try_get_compound_head() directly"
This reverts commit 9857a17f20.

That commit was completely broken, and I should have caught on to it
earlier.  But happily, the kernel test robot noticed the breakage fairly
quickly.

The breakage is because "try_get_page()" is about avoiding the page
reference count overflow case, but is otherwise the exact same as a
plain "get_page()".

In contrast, "try_get_compound_head()" is an entirely different beast,
and uses __page_cache_add_speculative() because it's not just about the
page reference count, but also about possibly racing with the underlying
page going away.

So all the commentary about how

 "try_get_page() has fallen a little behind in terms of maintenance,
  try_get_compound_head() handles speculative page references more
  thoroughly"

was just completely wrong: yes, try_get_compound_head() handles
speculative page references, but the point is that try_get_page() does
not, and must not.

So there's no lack of maintainance - there are fundamentally different
semantics.

A speculative page reference would be entirely wrong in "get_page()",
and it's entirely wrong in "try_get_page()".  It's not about
speculation, it's purely about "uhhuh, you can't get this page because
you've tried to increment the reference count too much already".

The reason the kernel test robot noticed this bug was that it hit the
VM_BUG_ON() in __page_cache_add_speculative(), which is all about
verifying that the context of any speculative page access is correct.
But since that isn't what try_get_page() is all about, the VM_BUG_ON()
tests things that are not correct to test for try_get_page().

Reported-by: kernel test robot <oliver.sang@intel.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-07 11:03:45 -07:00
Linus Torvalds
49624efa65 Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux
Pull MAP_DENYWRITE removal from David Hildenbrand:
 "Remove all in-tree usage of MAP_DENYWRITE from the kernel and remove
  VM_DENYWRITE.

  There are some (minor) user-visible changes:

   - We no longer deny write access to shared libaries loaded via legacy
     uselib(); this behavior matches modern user space e.g. dlopen().

   - We no longer deny write access to the elf interpreter after exec
     completed, treating it just like shared libraries (which it often
     is).

   - We always deny write access to the file linked via /proc/pid/exe:
     sys_prctl(PR_SET_MM_MAP/EXE_FILE) will fail if write access to the
     file cannot be denied, and write access to the file will remain
     denied until the link is effectivel gone (exec, termination,
     sys_prctl(PR_SET_MM_MAP/EXE_FILE)) -- just as if exec'ing the file.

  Cross-compiled for a bunch of architectures (alpha, microblaze, i386,
  s390x, ...) and verified via ltp that especially the relevant tests
  (i.e., creat07 and execve04) continue working as expected"

* tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux:
  fs: update documentation of get_write_access() and friends
  mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff()
  mm: remove VM_DENYWRITE
  binfmt: remove in-tree usage of MAP_DENYWRITE
  kernel/fork: always deny write access to current MM exe_file
  kernel/fork: factor out replacing the current MM exe_file
  binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib()
2021-09-04 11:35:47 -07:00
Linus Torvalds
14726903c8 Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
 "173 patches.

  Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
  pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
  bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
  hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
  oom-kill, migration, ksm, percpu, vmstat, and madvise)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits)
  mm/madvise: add MADV_WILLNEED to process_madvise()
  mm/vmstat: remove unneeded return value
  mm/vmstat: simplify the array size calculation
  mm/vmstat: correct some wrong comments
  mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
  selftests: vm: add COW time test for KSM pages
  selftests: vm: add KSM merging time test
  mm: KSM: fix data type
  selftests: vm: add KSM merging across nodes test
  selftests: vm: add KSM zero page merging test
  selftests: vm: add KSM unmerge test
  selftests: vm: add KSM merge test
  mm/migrate: correct kernel-doc notation
  mm: wire up syscall process_mrelease
  mm: introduce process_mrelease system call
  memblock: make memblock_find_in_range method private
  mm/mempolicy.c: use in_task() in mempolicy_slab_node()
  mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
  mm/mempolicy: advertise new MPOL_PREFERRED_MANY
  mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
  ...
2021-09-03 10:08:28 -07:00
Yang Shi
d0505e9f7d mm: hwpoison: don't drop slab caches for offlining non-LRU page
In the current implementation of soft offline, if non-LRU page is met,
all the slab caches will be dropped to free the page then offline.  But
if the page is not slab page all the effort is wasted in vain.  Even
though it is a slab page, it is not guaranteed the page could be freed
at all.

However the side effect and cost is quite high.  It does not only drop
the slab caches, but also may drop a significant amount of page caches
which are associated with inode caches.  It could make the most
workingset gone in order to just offline a page.  And the offline is not
guaranteed to succeed at all, actually I really doubt the success rate
for real life workload.

Furthermore the worse consequence is the system may be locked up and
unusable since the page cache release may incur huge amount of works
queued for memcg release.

Actually we ran into such unpleasant case in our production environment.
Firstly, the workqueue of memory_failure_work_func is locked up as
below:

    BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 53s!
    Showing busy workqueues and worker pools:
    workqueue events: flags=0x0
     pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=14/256 refcnt=15
      in-flight: 409271:memory_failure_work_func
      pending: kfree_rcu_work, kfree_rcu_monitor, kfree_rcu_work, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, drain_local_stock, kfree_rcu_work
    workqueue mm_percpu_wq: flags=0x8
     pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
      pending: vmstat_update
    workqueue cgroup_destroy: flags=0x0
      pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1 refcnt=12072
        pending: css_release_work_fn

There were over 12K css_release_work_fn queued, and this caused a few
lockups due to the contention of worker pool lock with IRQ disabled, for
example:

    NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
    Modules linked in: amd64_edac_mod edac_mce_amd crct10dif_pclmul crc32_pclmul ghash_clmulni_intel xt_DSCP iptable_mangle kvm_amd bpfilter vfat fat acpi_ipmi i2c_piix4 usb_storage ipmi_si k10temp i2c_core ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel xfs libcrc32c crc32c_intel mlx5_core mlxfw nvme xhci_pci ptp nvme_core pps_core xhci_hcd
    CPU: 1 PID: 205500 Comm: kworker/1:0 Tainted: G             L    5.10.32-t1.el7.twitter.x86_64 #1
    Hardware name: TYAN F5AMT /z        /S8026GM2NRE-CGN, BIOS V8.030 03/30/2021
    Workqueue: events memory_failure_work_func
    RIP: 0010:queued_spin_lock_slowpath+0x41/0x1a0
    Code: 41 f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 01 75 04 c6 47
    RSP: 0018:ffff9b2ac278f900 EFLAGS: 00000002
    RAX: 0000000000480101 RBX: ffff8ce98ce71800 RCX: 0000000000000084
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8ce98ce6a140
    RBP: 00000000000284c8 R08: ffffd7248dcb6808 R09: 0000000000000000
    R10: 0000000000000003 R11: ffff9b2ac278f9b0 R12: 0000000000000001
    R13: ffff8cb44dab9c00 R14: ffffffffbd1ce6a0 R15: ffff8cacaa37f068
    FS:  0000000000000000(0000) GS:ffff8ce98ce40000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fcf6e8cb000 CR3: 0000000a0c60a000 CR4: 0000000000350ee0
    Call Trace:
     __queue_work+0xd6/0x3c0
     queue_work_on+0x1c/0x30
     uncharge_batch+0x10e/0x110
     mem_cgroup_uncharge_list+0x6d/0x80
     release_pages+0x37f/0x3f0
     __pagevec_release+0x1c/0x50
     __invalidate_mapping_pages+0x348/0x380
     inode_lru_isolate+0x10a/0x160
     __list_lru_walk_one+0x7b/0x170
     list_lru_walk_one+0x4a/0x60
     prune_icache_sb+0x37/0x50
     super_cache_scan+0x123/0x1a0
     do_shrink_slab+0x10c/0x2c0
     shrink_slab+0x1f1/0x290
     drop_slab_node+0x4d/0x70
     soft_offline_page+0x1ac/0x5b0
     memory_failure_work_func+0x6a/0x90
     process_one_work+0x19e/0x340
     worker_thread+0x30/0x360
     kthread+0x116/0x130

The lockup made the machine is quite unusable.  And it also made the
most workingset gone, the reclaimabled slab caches were reduced from 12G
to 300MB, the page caches were decreased from 17G to 4G.

But the most disappointing thing is all the effort doesn't make the page
offline, it just returns:

    soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 ()

It seems the aggressive behavior for non-LRU page didn't pay back, so it
doesn't make too much sense to keep it considering the terrible side
effect.

Link: https://lkml.kernel.org/r/20210819054116.266126-1-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reported-by: David Mackey <tdmackey@twitter.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 09:58:15 -07:00
John Hubbard
3969b1a654 mm: delete unused get_kernel_page()
get_kernel_page() was added in 2012 by [1].  It was used for a while for
NFS, but then in 2014, a refactoring [2] removed all callers, and it has
apparently not been used since.

Remove get_kernel_page() because it has no callers.

[1] commit 18022c5d86 ("mm: add get_kernel_page[s] for pinning of
    kernel addresses for I/O")
[2] commit 91f79c43d1 ("new helper: iov_iter_get_pages_alloc()")

Link: https://lkml.kernel.org/r/20210729221847.1165665-1-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Paris <eparis@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Xiaotian Feng <dfeng@redhat.com>
Cc: Mark Salter <msalter@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 09:58:11 -07:00
John Hubbard
9857a17f20 mm/gup: remove try_get_page(), call try_get_compound_head() directly
try_get_page() is very similar to try_get_compound_head(), and in fact
try_get_page() has fallen a little behind in terms of maintenance:
try_get_compound_head() handles speculative page references more
thoroughly.

There are only two try_get_page() callsites, so just call
try_get_compound_head() directly from those, and remove try_get_page()
entirely.

Also, seeing as how this changes try_get_compound_head() into a non-static
function, provide some kerneldoc documentation for it.

Link: https://lkml.kernel.org/r/20210813044133.1536842-4-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 09:58:11 -07:00
John Hubbard
54d516b1d6 mm/gup: small refactoring: simplify try_grab_page()
try_grab_page() does the same thing as try_grab_compound_head(..., refs=1,
...), just with a different API.  So there is a lot of code duplication
there.

Change try_grab_page() to call try_grab_compound_head(), while keeping the
API contract identical for callers.

Also, now that try_grab_compound_head() always has a caller, remove the
__maybe_unused annotation.

Link: https://lkml.kernel.org/r/20210813044133.1536842-3-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 09:58:11 -07:00
David Hildenbrand
8d0920bde5 mm: remove VM_DENYWRITE
All in-tree users of MAP_DENYWRITE are gone. MAP_DENYWRITE cannot be
set from user space, so all users are gone; let's remove it.

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
2021-09-03 18:42:01 +02:00
David Hildenbrand
fe69d560b5 kernel/fork: always deny write access to current MM exe_file
We want to remove VM_DENYWRITE only currently only used when mapping the
executable during exec. During exec, we already deny_write_access() the
executable, however, after exec completes the VMAs mapped
with VM_DENYWRITE effectively keeps write access denied via
deny_write_access().

Let's deny write access when setting or replacing the MM exe_file. With
this change, we can remove VM_DENYWRITE for mapping executables.

Make set_mm_exe_file() return an error in case deny_write_access()
fails; note that this should never happen, because exec code does a
deny_write_access() early and keeps write access denied when calling
set_mm_exe_file. However, it makes the code easier to read and makes
set_mm_exe_file() and replace_mm_exe_file() look more similar.

This represents a minor user space visible change:
sys_prctl(PR_SET_MM_MAP/EXE_FILE) can now fail if the file is already
opened writable. Also, after sys_prctl(PR_SET_MM_MAP/EXE_FILE) the file
cannot be opened writable. Note that we can already fail with -EACCES if
the file doesn't have execute permissions.

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
2021-09-03 18:42:01 +02:00
David Hildenbrand
35d7bdc860 kernel/fork: factor out replacing the current MM exe_file
Let's factor the main logic out into replace_mm_exe_file(), such that
all mm->exe_file logic is contained in kernel/fork.c.

While at it, perform some simple cleanups that are possible now that
we're simplifying the individual functions.

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
2021-09-03 18:42:01 +02:00
Dave Chinner
de2860f463 mm: Add kvrealloc()
During log recovery of an XFS filesystem with 64kB directory
buffers, rebuilding a buffer split across two log records results
in a memory allocation warning from krealloc like this:

xfs filesystem being mounted at /mnt/scratch supports timestamps until 2038 (0x7fffffff)
XFS (dm-0): Unmounting Filesystem
XFS (dm-0): Mounting V5 Filesystem
XFS (dm-0): Starting recovery (logdev: internal)
------------[ cut here ]------------
WARNING: CPU: 5 PID: 3435170 at mm/page_alloc.c:3539 get_page_from_freelist+0xdee/0xe40
.....
RIP: 0010:get_page_from_freelist+0xdee/0xe40
Call Trace:
 ? complete+0x3f/0x50
 __alloc_pages+0x16f/0x300
 alloc_pages+0x87/0x110
 kmalloc_order+0x2c/0x90
 kmalloc_order_trace+0x1d/0x90
 __kmalloc_track_caller+0x215/0x270
 ? xlog_recover_add_to_cont_trans+0x63/0x1f0
 krealloc+0x54/0xb0
 xlog_recover_add_to_cont_trans+0x63/0x1f0
 xlog_recovery_process_trans+0xc1/0xd0
 xlog_recover_process_ophdr+0x86/0x130
 xlog_recover_process_data+0x9f/0x160
 xlog_recover_process+0xa2/0x120
 xlog_do_recovery_pass+0x40b/0x7d0
 ? __irq_work_queue_local+0x4f/0x60
 ? irq_work_queue+0x3a/0x50
 xlog_do_log_recovery+0x70/0x150
 xlog_do_recover+0x38/0x1d0
 xlog_recover+0xd8/0x170
 xfs_log_mount+0x181/0x300
 xfs_mountfs+0x4a1/0x9b0
 xfs_fs_fill_super+0x3c0/0x7b0
 get_tree_bdev+0x171/0x270
 ? suffix_kstrtoint.constprop.0+0xf0/0xf0
 xfs_fs_get_tree+0x15/0x20
 vfs_get_tree+0x24/0xc0
 path_mount+0x2f5/0xaf0
 __x64_sys_mount+0x108/0x140
 do_syscall_64+0x3a/0x70
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Essentially, we are taking a multi-order allocation from kmem_alloc()
(which has an open coded no fail, no warn loop) and then
reallocating it out to 64kB using krealloc(__GFP_NOFAIL) and that is
then triggering the above warning.

This is a regression caused by converting this code from an open
coded no fail/no warn reallocation loop to using __GFP_NOFAIL.

What we actually need here is kvrealloc(), so that if contiguous
page allocation fails we fall back to vmalloc() and we don't
get nasty warnings happening in XFS.

Fixes: 771915c4f6 ("xfs: remove kmem_realloc()")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-09 15:57:43 -07:00
Lee Jones
946e465c81 Merge tag 'v5.14-rc2' into android-mainline
Linux 5.14-rc2

Signed-off-by: Lee Jones <lee.jones@linaro.org>
Change-Id: Ia2131de59daa96610741f5a0ff267b0d08697023
2021-07-22 14:14:38 +01:00
Lee Jones
a8b636db7d Merge tag 'v5.14-rc1' into android-mainline
Linux 5.14-rc1

Change-Id: I9765cd4581f6683a6fca3580667017fff9cbaa2b
Signed-off-by: Lee Jones <lee.jones@linaro.org>
2021-07-13 10:02:04 +01:00
Matthew Wilcox (Oracle)
79789db03f mm: Make copy_huge_page() always available
Rewrite copy_huge_page() and move it into mm/util.c so it's always
available.  Fixes an exposure of uninitialised memory on configurations
with HUGETLB and UFFD enabled and MIGRATION disabled.

Fixes: 8cc5fcbb5b ("mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-12 11:30:56 -07:00
Lee Jones
293f275f4d Merge commit df8ba5f160 ("Merge tag 'kgdb-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux") into android-mainline
A large step en route to v5.14-rc1

Change-Id: I52bb71dc737044a593d1a9dfd7fe02b31e273ff9
Signed-off-by: Lee Jones <lee.jones@linaro.org>
2021-07-12 11:00:18 +01:00
Lee Jones
8e658623d4 Merge commit c288d9cd71 ("Merge tag 'for-5.14/io_uring-2021-06-30' of git://git.kernel.dk/linux-block") into android-mainline
Another small step en route to v5.14-rc1

Change-Id: I24899ab78da7d367574ed69ceaa82ab0837d9556
Signed-off-by: Lee Jones <lee.jones@linaro.org>
2021-07-12 10:02:27 +01:00