kernel_arpi

Author	SHA1	Message	Date
Suren Baghdasaryan	ca96bd7bf1	ANDROID: mm: avoid using vmacache in lockless vma search When searching vma under RCU protection vmcache should be avoided because a race with munmap() might result in finding a vma and placing it into vmcache after munmap() removed that vma and called vmcache_invalidate. Once that vma is freed, vmcache will be left with an invalid vma pointer. Bug: 257443051 Change-Id: I62438305fcf5139974f4f7d3bae5b22c74084a59 Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2022-11-23 10:25:27 -08:00
Suren Baghdasaryan	3f311327f9	ANDROID: mm: introduce vma refcounting to protect vma during SPF Current mechanism to stabilize a vma during speculative page fault handling makes a copy of the faulting vma under RCU protection. This makes it hard to protect elements which do not belong to the vma but are used by the page fault handler like vma->vm_file. The problems is that a copy of the vma can't be used to safely protect the file attached to the original vma unless the file is also released after RCU grace period (which is how SPF was designed originally but that caused performance regression and had to be changed). To avoid these complications, introduce vma refcounting to stabilize and operate on the original vma during page fault handling. Page fault handler finds the vma and increases its refcount under RCU protection, vma is freed after RCU grace period, vma->vm_file is released only after refcount indicates no users. This mechanism guarantees that once get_vma returns a vma, both the vma itself and vma->vm_file are stable. Additional benefits of this patch are: we don't need to copy the vma and no additional logic is needed to stabilize vma->vm_file. Bug: 257443051 Change-Id: I59d373926d687fcbd56847a8c3500c43bf1844c8 Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2022-11-23 10:25:27 -08:00
Suren Baghdasaryan	c11ef6356b	Revert "ANDROID: add vma->file_ref_count to synchronize vma->vm_file destruction" This reverts commit `a3fe25d923`. File refcounting implemented in this patch is broken and needs to be redone. The change in include/linux/mm_types.h which adds file_ref_count into vm_area_struct is left untouched to keep ABI intact. Bug: 258731892 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I37984eb2f0981a989f74bcaaa6be42040a2f241e	2022-11-23 10:25:26 -08:00
Greg Kroah-Hartman	ae3c0ab383	ANDROID: GKI: mm.h: add Android ABI padding to a structure Try to mitigate potential future driver core api changes by adding a padding to struct vm_operations_struct. Based on a change made to the RHEL/CENTOS 8 kernel. Bug: 151154716 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I78f84148ef4d3524bd6c5b78e53e06503a4ac3ae	2022-07-19 12:47:34 +00:00
Suren Baghdasaryan	4daa3c254e	ANDROID: add vma->file_ref_count to synchronize vma->vm_file destruction In order to prevent destruction of vma->vm_file while it's being used during speculative page fault handling, introduce an atomic refcounter. Bug: 234527424 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I0e971156f3e76feb45136bac1582a7eaab8c75df	2022-07-19 12:47:27 +00:00
Liangcai Fan	bf46e6f5db	BACKPORT: mm: khugepaged: recalculate min_free_kbytes after stopping khugepaged When initializing transparent huge pages, min_free_kbytes would be calculated according to what khugepaged expected. So when transparent huge pages get disabled, min_free_kbytes should be recalculated instead of the higher value set by khugepaged. Link: https://lkml.kernel.org/r/1633937809-16558-1-git-send-email-liangcaifan19@gmail.com Signed-off-by: Liangcai Fan <liangcaifan19@gmail.com> Signed-off-by: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit bd3400ea173fb611cdf2030d03620185ff6c0b0e) Bug: 235523176 Signed-off-by: Chinwen Chang <chinwen.chang@mediatek.com> Change-Id: I815893d25186847933db2a0872528fb15a00b3c8	2022-07-19 03:54:51 +00:00
Greg Kroah-Hartman	06d58f3cef	Merge 5.15.54 into android14-5.15 Changes in 5.15.54 mm/slub: add missing TID updates on slab deactivation mm/filemap: fix UAF in find_lock_entries Revert "selftests/bpf: Add test for bpf_timer overwriting crash" ALSA: usb-audio: Workarounds for Behringer UMC 204/404 HD ALSA: hda/realtek: Add quirk for Clevo L140PU ALSA: cs46xx: Fix missing snd_card_free() call at probe error can: bcm: use call_rcu() instead of costly synchronize_rcu() can: grcan: grcan_probe(): remove extra of_node_get() can: gs_usb: gs_usb_open/close(): fix memory leak can: m_can: m_can_chip_config(): actually enable internal timestamping can: m_can: m_can_{read_fifo,echo_tx_event}(): shift timestamp to full 32 bits can: mcp251xfd: mcp251xfd_regmap_crc_read(): improve workaround handling for mcp2517fd can: mcp251xfd: mcp251xfd_regmap_crc_read(): update workaround broken CRC on TBC register bpf: Fix incorrect verifier simulation around jmp32's jeq/jne bpf: Fix insufficient bounds propagation from adjust_scalar_min_max_vals usbnet: fix memory leak in error case net: rose: fix UAF bug caused by rose_t0timer_expiry netfilter: nft_set_pipapo: release elements in clone from abort path netfilter: nf_tables: stricter validation of element data btrfs: rename btrfs_alloc_chunk to btrfs_create_chunk btrfs: add additional parameters to btrfs_init_tree_ref/btrfs_init_data_ref btrfs: fix invalid delayed ref after subvolume creation failure btrfs: fix warning when freeing leaf after subvolume creation failure Input: cpcap-pwrbutton - handle errors from platform_get_irq() Input: goodix - change goodix_i2c_write() len parameter type to int Input: goodix - add a goodix.h header file Input: goodix - refactor reset handling Input: goodix - try not to touch the reset-pin on x86/ACPI devices dma-buf/poll: Get a file reference for outstanding fence callbacks btrfs: fix deadlock between chunk allocation and chunk btree modifications drm/i915: Disable bonding on gen12+ platforms drm/i915/gt: Register the migrate contexts with their engines drm/i915: Replace the unconditional clflush with drm_clflush_virt_range() PCI/portdrv: Rename pm_iter() to pcie_port_device_iter() PCI: pciehp: Ignore Link Down/Up caused by error-induced Hot Reset media: ir_toy: prevent device from hanging during transmit memory: renesas-rpc-if: Avoid unaligned bus access for HyperFlash ath11k: add hw_param for wakeup_mhi qed: Improve the stack space of filter_config() platform/x86: wmi: introduce helper to convert driver to WMI driver platform/x86: wmi: Replace read_takes_no_args with a flags field platform/x86: wmi: Fix driver->notify() vs ->probe() race mt76: mt7921: get rid of mt7921_mac_set_beacon_filter mt76: mt7921: introduce mt7921_mcu_set_beacon_filter utility routine mt76: mt7921: fix a possible race enabling/disabling runtime-pm bpf: Stop caching subprog index in the bpf_pseudo_func insn bpf, arm64: Use emit_addr_mov_i64() for BPF_PSEUDO_FUNC riscv: defconfig: enable DRM_NOUVEAU RISC-V: defconfigs: Set CONFIG_FB=y, for FB console net/mlx5e: Check action fwd/drop flag exists also for nic flows net/mlx5e: Split actions_match_supported() into a sub function net/mlx5e: TC, Reject rules with drop and modify hdr action net/mlx5e: TC, Reject rules with forward and drop actions ASoC: rt5682: Avoid the unexpected IRQ event during going to suspend ASoC: rt5682: Re-detect the combo jack after resuming ASoC: rt5682: Fix deadlock on resume netfilter: nf_tables: convert pktinfo->tprot_set to flags field netfilter: nft_payload: support for inner header matching / mangling netfilter: nft_payload: don't allow th access for fragments s390/boot: allocate amode31 section in decompressor s390/setup: use physical pointers for memblock_reserve() s390/setup: preserve memory at OLDMEM_BASE and OLDMEM_SIZE ibmvnic: init init_done_rc earlier ibmvnic: clear fop when retrying probe ibmvnic: Allow queueing resets during probe virtio-blk: avoid preallocating big SGL for data io_uring: ensure that fsnotify is always called block: use bdev_get_queue() in bio.c block: only mark bio as tracked if it really is tracked block: fix rq-qos breakage from skipping rq_qos_done_bio() stddef: Introduce struct_group() helper macro media: omap3isp: Use struct_group() for memcpy() region media: davinci: vpif: fix use-after-free on driver unbind mt76: mt76_connac: fix MCU_CE_CMD_SET_ROC definition error mt76: mt7921: do not always disable fw runtime-pm cxl/port: Hold port reference until decoder release clk: renesas: r9a07g044: Update multiplier and divider values for PLL2/3 KVM: x86/mmu: Use yield-safe TDP MMU root iter in MMU notifier unmapping KVM: x86/mmu: Use common TDP MMU zap helper for MMU notifier unmap hook scsi: qla2xxx: Move heartbeat handling from DPC thread to workqueue scsi: qla2xxx: Fix laggy FC remote port session recovery scsi: qla2xxx: edif: Replace list_for_each_safe with list_for_each_entry_safe scsi: qla2xxx: Fix crash during module load unload test gfs2: Fix gfs2_file_buffered_write endless loop workaround vdpa/mlx5: Avoid processing works if workqueue was destroyed btrfs: handle device lookup with btrfs_dev_lookup_args btrfs: add a btrfs_get_dev_args_from_path helper btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls btrfs: remove device item and update super block in the same transaction drbd: add error handling support for add_disk() drbd: Fix double free problem in drbd_create_device drbd: fix an invalid memory access caused by incorrect use of list iterator drm/amd/display: Set min dcfclk if pipe count is 0 drm/amd/display: Fix by adding FPU protection for dcn30_internal_validate_bw NFSD: De-duplicate net_generic(nf->nf_net, nfsd_net_id) NFSD: COMMIT operations must not return NFS?ERR_INVAL riscv/mm: Add XIP_FIXUP for riscv_pfn_base iio: accel: mma8452: use the correct logic to get mma8452_data batman-adv: Use netif_rx(). mtd: spi-nor: Skip erase logic when SPI_NOR_NO_ERASE is set Compiler Attributes: add __alloc_size() for better bounds checking mm: vmalloc: introduce array allocation functions KVM: use __vcalloc for very large allocations btrfs: don't access possibly stale fs_info data in device_list_add KVM: s390x: fix SCK locking scsi: qla2xxx: Fix loss of NVMe namespaces after driver reload test powerpc/32: Don't use lmw/stmw for saving/restoring non volatile regs powerpc: flexible GPR range save/restore macros powerpc/tm: Fix more userspace r13 corruption serial: sc16is7xx: Clear RS485 bits in the shutdown bus: mhi: core: Use correctly sized arguments for bit field bus: mhi: Fix pm_state conversion to string stddef: Introduce DECLARE_FLEX_ARRAY() helper uapi/linux/stddef.h: Add include guards ASoC: rt5682: move clk related code to rt5682_i2c_probe ASoC: rt5682: fix an incorrect NULL check on list iterator drm/amd/vcn: fix an error msg on vcn 3.0 KVM: Don't create VM debugfs files outside of the VM directory tty: n_gsm: Modify CR,PF bit when config requester tty: n_gsm: Save dlci address open status when config requester tty: n_gsm: fix frame reception handling ALSA: usb-audio: add mapping for MSI MPG X570S Carbon Max Wifi. ALSA: usb-audio: add mapping for MSI MAG X570S Torpedo MAX. tty: n_gsm: fix missing update of modem controls after DLCI open btrfs: zoned: encapsulate inode locking for zoned relocation btrfs: zoned: use dedicated lock for data relocation KVM: Initialize debugfs_dentry when a VM is created to avoid NULL deref mm/hwpoison: mf_mutex for soft offline and unpoison mm/hwpoison: avoid the impact of hwpoison_filter() return value on mce handler mm/memory-failure.c: fix race with changing page compound again mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb() tty: n_gsm: fix invalid use of MSC in advanced option tty: n_gsm: fix sometimes uninitialized warning in gsm_dlci_modem_output() serial: 8250_mtk: Make sure to select the right FEATURE_SEL tty: n_gsm: fix invalid gsmtty_write_room() result drm/amd: Refactor `amdgpu_aspm` to be evaluated per device drm/amdgpu: vi: disable ASPM on Intel Alder Lake based systems drm/i915: Fix a race between vma / object destruction and unbinding drm/mediatek: Use mailbox rx_callback instead of cmdq_task_cb drm/mediatek: Remove the pointer of struct cmdq_client drm/mediatek: Detect CMDQ execution timeout drm/mediatek: Add cmdq_handle in mtk_crtc drm/mediatek: Add vblank register/unregister callback functions Bluetooth: protect le accept and resolv lists with hdev->lock Bluetooth: btmtksdio: fix use-after-free at btmtksdio_recv_event io_uring: avoid io-wq -EAGAIN looping for !IOPOLL irqchip/gic-v3: Ensure pseudo-NMIs have an ISB between ack and handling irqchip/gic-v3: Refactor ISB + EOIR at ack time rxrpc: Fix locking issue dt-bindings: soc: qcom: smd-rpm: Add compatible for MSM8953 SoC dt-bindings: soc: qcom: smd-rpm: Fix missing MSM8936 compatible module: change to print useful messages from elf_validity_check() module: fix [e_shstrndx].sh_size=0 OOB access iommu/vt-d: Fix PCI bus rescan device hot add fbdev: fbmem: Fix logo center image dx issue fbmem: Check virtual screen sizes in fb_set_var() fbcon: Disallow setting font bigger than screen size fbcon: Prevent that screen size is smaller than font size PM: runtime: Redefine pm_runtime_release_supplier() memregion: Fix memregion_free() fallback definition video: of_display_timing.h: include errno.h powerpc/powernv: delay rng platform device creation until later in boot net: dsa: qca8k: reset cpu port on MTU change can: kvaser_usb: replace run-time checks with struct kvaser_usb_driver_info can: kvaser_usb: kvaser_usb_leaf: fix CAN clock frequency regression can: kvaser_usb: kvaser_usb_leaf: fix bittiming limits xfs: remove incorrect ASSERT in xfs_rename Revert "serial: sc16is7xx: Clear RS485 bits in the shutdown" btrfs: fix error pointer dereference in btrfs_ioctl_rm_dev_v2() virtio-blk: modify the value type of num in virtio_queue_rq() btrfs: fix use of uninitialized variable at rm device ioctl tty: n_gsm: fix encoding of command/response bit ARM: meson: Fix refcount leak in meson_smp_prepare_cpus pinctrl: sunxi: a83t: Fix NAND function name for some pins ASoC: rt711: Add endianness flag in snd_soc_component_driver ASoC: rt711-sdca: Add endianness flag in snd_soc_component_driver ASoC: codecs: rt700/rt711/rt711-sdca: resume bus/codec in .set_jack_detect arm64: dts: qcom: msm8994: Fix CPU6/7 reg values arm64: dts: qcom: sdm845: use dispcc AHB clock for mdss node ARM: mxs_defconfig: Enable the framebuffer arm64: dts: imx8mp-evk: correct mmc pad settings arm64: dts: imx8mp-evk: correct the uart2 pinctl value arm64: dts: imx8mp-evk: correct gpio-led pad settings arm64: dts: imx8mp-evk: correct vbus pad settings arm64: dts: imx8mp-evk: correct eqos pad settings arm64: dts: imx8mp-evk: correct I2C1 pad settings arm64: dts: imx8mp-evk: correct I2C3 pad settings arm64: dts: imx8mp-phyboard-pollux-rdk: correct uart pad settings arm64: dts: imx8mp-phyboard-pollux-rdk: correct eqos pad settings arm64: dts: imx8mp-phyboard-pollux-rdk: correct i2c2 & mmc settings pinctrl: sunxi: sunxi_pconf_set: use correct offset arm64: dts: qcom: msm8992-*: Fix vdd_lvs1_2-supply typo ARM: at91: pm: use proper compatible for sama5d2's rtc ARM: at91: pm: use proper compatibles for sam9x60's rtc and rtt ARM: at91: pm: use proper compatibles for sama7g5's rtc and rtt ARM: dts: at91: sam9x60ek: fix eeprom compatible and size ARM: dts: at91: sama5d2_icp: fix eeprom compatibles ARM: at91: fix soc detection for SAM9X60 SiPs xsk: Clear page contiguity bit when unmapping pool i2c: piix4: Fix a memory leak in the EFCH MMIO support i40e: Fix dropped jumbo frames statistics i40e: Fix VF's MAC Address change on VM ARM: dts: stm32: use usbphyc ck_usbo_48m as USBH OHCI clock on stm32mp151 ARM: dts: stm32: add missing usbh clock and fix clk order on stm32mp15 ibmvnic: Properly dispose of all skbs during a failover. selftests: forwarding: fix flood_unicast_test when h2 supports IFF_UNICAST_FLT selftests: forwarding: fix learning_test when h1 supports IFF_UNICAST_FLT selftests: forwarding: fix error message in learning_test r8169: fix accessing unset transport header i2c: cadence: Unregister the clk notifier in error path dmaengine: imx-sdma: Allow imx8m for imx7 FW revs misc: rtsx_usb: fix use of dma mapped buffer for usb bulk transfer misc: rtsx_usb: use separate command and response buffers misc: rtsx_usb: set return value in rsp_buf alloc err path Revert "mm/memory-failure.c: fix race with changing page compound again" Revert "serial: 8250_mtk: Make sure to select the right FEATURE_SEL" dt-bindings: dma: allwinner,sun50i-a64-dma: Fix min/max typo ida: don't use BUG_ON() for debugging dmaengine: pl330: Fix lockdep warning about non-static key dmaengine: lgm: Fix an error handling path in intel_ldma_probe() dmaengine: at_xdma: handle errors of at_xdmac_alloc_desc() correctly dmaengine: ti: Fix refcount leak in ti_dra7_xbar_route_allocate dmaengine: qcom: bam_dma: fix runtime PM underflow dmaengine: ti: Add missing put_device in ti_dra7_xbar_route_allocate dmaengine: idxd: force wq context cleanup on device disable path selftests/net: fix section name when using xdp_dummy.o Linux 5.15.54 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I3ca4c0aa09a3bea6969c7a127d833034a123f437	2022-07-13 19:41:43 +02:00
Naoya Horiguchi	2fa22e7906	Revert "mm/memory-failure.c: fix race with changing page compound again" commit 2ba2b008a8bf5fd268a43d03ba79e0ad464d6836 upstream. Reverts commit 888af2701db7 ("mm/memory-failure.c: fix race with changing page compound again") because now we fetch the page refcount under hugetlb_lock in try_memory_failure_hugetlb() so that the race check is no longer necessary. Link: https://lkml.kernel.org/r/20220408135323.1559401-4-naoya.horiguchi@linux.dev Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Suggested-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-07-12 16:35:17 +02:00
Naoya Horiguchi	62d1655b92	mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb() [ Upstream commit 405ce051236cc65b30bbfe490b28ce60ae6aed85 ] There is a race condition between memory_failure_hugetlb() and hugetlb free/demotion, which causes setting PageHWPoison flag on the wrong page. The one simple result is that wrong processes can be killed, but another (more serious) one is that the actual error is left unhandled, so no one prevents later access to it, and that might lead to more serious results like consuming corrupted data. Think about the below race window: CPU 1 CPU 2 memory_failure_hugetlb struct page *head = compound_head(p); hugetlb page might be freed to buddy, or even changed to another compound page. get_hwpoison_page -- page is not what we want now... The current code first does prechecks roughly and then reconfirms after taking refcount, but it's found that it makes code overly complicated, so move the prechecks in a single hugetlb_lock range. A newly introduced function, try_memory_failure_hugetlb(), always takes hugetlb_lock (even for non-hugetlb pages). That can be improved, but memory_failure() is rare in principle, so should not be a big problem. Link: https://lkml.kernel.org/r/20220408135323.1559401-2-naoya.horiguchi@linux.dev Fixes: `761ad8d7c7` ("mm: hwpoison: introduce memory_failure_hugetlb()") Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reported-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-07-12 16:35:06 +02:00
Miaohe Lin	5429eb5502	mm/memory-failure.c: fix race with changing page compound again [ Upstream commit 888af2701db79b9b27c7e37f9ede528a5ca53b76 ] Patch series "A few fixup patches for memory failure", v2. This series contains a few patches to fix the race with changing page compound page, make non-LRU movable pages unhandlable and so on. More details can be found in the respective changelogs. There is a race window where we got the compound_head, the hugetlb page could be freed to buddy, or even changed to another compound page just before we try to get hwpoison page. Think about the below race window: CPU 1 CPU 2 memory_failure_hugetlb struct page head = compound_head(p); hugetlb page might be freed to buddy, or even changed to another compound page. get_hwpoison_page -- page is not what we want now... If this race happens, just bail out. Also MF_MSG_DIFFERENT_PAGE_SIZE is introduced to record this event. [akpm@linux-foundation.org: s@/@/@, per Naoya Horiguchi] Link: https://lkml.kernel.org/r/20220312074613.4798-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220312074613.4798-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-07-12 16:35:05 +02:00
Greg Kroah-Hartman	28f0c67d40	Merge 5.15.44 into android14-5.15 Changes in 5.15.44 HID: amd_sfh: Add support for sensor discovery KVM: x86/mmu: fix NULL pointer dereference on guest INVPCID ice: fix crash at allocation failure ACPI: sysfs: Fix BERT error region memory mapping MAINTAINERS: co-maintain random.c MAINTAINERS: add git tree for random.c lib/crypto: blake2s: include as built-in lib/crypto: blake2s: move hmac construction into wireguard lib/crypto: sha1: re-roll loops to reduce code size lib/crypto: blake2s: avoid indirect calls to compression function for Clang CFI random: document add_hwgenerator_randomness() with other input functions random: remove unused irq_flags argument from add_interrupt_randomness() random: use BLAKE2s instead of SHA1 in extraction random: do not sign extend bytes for rotation when mixing random: do not re-init if crng_reseed completes before primary init random: mix bootloader randomness into pool random: harmonize "crng init done" messages random: use IS_ENABLED(CONFIG_NUMA) instead of ifdefs random: early initialization of ChaCha constants random: avoid superfluous call to RDRAND in CRNG extraction random: don't reset crng_init_cnt on urandom_read() random: fix typo in comments random: cleanup poolinfo abstraction random: cleanup integer types random: remove incomplete last_data logic random: remove unused extract_entropy() reserved argument random: rather than entropy_store abstraction, use global random: remove unused OUTPUT_POOL constants random: de-duplicate INPUT_POOL constants random: prepend remaining pool constants with POOL_ random: cleanup fractional entropy shift constants random: access input_pool_data directly rather than through pointer random: selectively clang-format where it makes sense random: simplify arithmetic function flow in account() random: continually use hwgenerator randomness random: access primary_pool directly rather than through pointer random: only call crng_finalize_init() for primary_crng random: use computational hash for entropy extraction random: simplify entropy debiting random: use linear min-entropy accumulation crediting random: always wake up entropy writers after extraction random: make credit_entropy_bits() always safe random: remove use_input_pool parameter from crng_reseed() random: remove batched entropy locking random: fix locking in crng_fast_load() random: use RDSEED instead of RDRAND in entropy extraction random: get rid of secondary crngs random: inline leaves of rand_initialize() random: ensure early RDSEED goes through mixer on init random: do not xor RDRAND when writing into /dev/random random: absorb fast pool into input pool after fast load random: use simpler fast key erasure flow on per-cpu keys random: use hash function for crng_slow_load() random: make more consistent use of integer types random: remove outdated INT_MAX >> 6 check in urandom_read() random: zero buffer after reading entropy from userspace random: fix locking for crng_init in crng_reseed() random: tie batched entropy generation to base_crng generation random: remove ifdef'd out interrupt bench random: remove unused tracepoints random: add proper SPDX header random: deobfuscate irq u32/u64 contributions random: introduce drain_entropy() helper to declutter crng_reseed() random: remove useless header comment random: remove whitespace and reorder includes random: group initialization wait functions random: group crng functions random: group entropy extraction functions random: group entropy collection functions random: group userspace read/write functions random: group sysctl functions random: rewrite header introductory comment random: defer fast pool mixing to worker random: do not take pool spinlock at boot random: unify early init crng load accounting random: check for crng_init == 0 in add_device_randomness() random: pull add_hwgenerator_randomness() declaration into random.h random: clear fast pool, crng, and batches in cpuhp bring up random: round-robin registers as ulong, not u32 random: only wake up writers after zap if threshold was passed random: cleanup UUID handling random: unify cycles_t and jiffies usage and types random: do crng pre-init loading in worker rather than irq random: give sysctl_random_min_urandom_seed a more sensible value random: don't let 644 read-only sysctls be written to random: replace custom notifier chain with standard one random: use SipHash as interrupt entropy accumulator random: make consistent usage of crng_ready() random: reseed more often immediately after booting random: check for signal and try earlier when generating entropy random: skip fast_init if hwrng provides large chunk of entropy random: treat bootloader trust toggle the same way as cpu trust toggle random: re-add removed comment about get_random_{u32,u64} reseeding random: mix build-time latent entropy into pool at init random: do not split fast init input in add_hwgenerator_randomness() random: do not allow user to keep crng key around on stack random: check for signal_pending() outside of need_resched() check random: check for signals every PAGE_SIZE chunk of /dev/[u]random random: allow partial reads if later user copies fail random: make random_get_entropy() return an unsigned long random: document crng_fast_key_erasure() destination possibility random: fix sysctl documentation nits init: call time_init() before rand_initialize() ia64: define get_cycles macro for arch-override s390: define get_cycles macro for arch-override parisc: define get_cycles macro for arch-override alpha: define get_cycles macro for arch-override powerpc: define get_cycles macro for arch-override timekeeping: Add raw clock fallback for random_get_entropy() m68k: use fallback for random_get_entropy() instead of zero riscv: use fallback for random_get_entropy() instead of zero mips: use fallback for random_get_entropy() instead of just c0 random arm: use fallback for random_get_entropy() instead of zero nios2: use fallback for random_get_entropy() instead of zero x86/tsc: Use fallback for random_get_entropy() instead of zero um: use fallback for random_get_entropy() instead of zero sparc: use fallback for random_get_entropy() instead of zero xtensa: use fallback for random_get_entropy() instead of zero random: insist on random_get_entropy() existing in order to simplify random: do not use batches when !crng_ready() random: use first 128 bits of input as fast init random: do not pretend to handle premature next security model random: order timer entropy functions below interrupt functions random: do not use input pool from hard IRQs random: help compiler out with fast_mix() by using simpler arguments siphash: use one source of truth for siphash permutations random: use symbolic constants for crng_init states random: avoid initializing twice in credit race random: move initialization out of reseeding hot path random: remove ratelimiting for in-kernel unseeded randomness random: use proper jiffies comparison macro random: handle latent entropy and command line from random_init() random: credit architectural init the exact amount random: use static branch for crng_ready() random: remove extern from functions in header random: use proper return types on get_random_{int,long}_wait() random: make consistent use of buf and len random: move initialization functions out of hot pages random: move randomize_page() into mm where it belongs random: unify batched entropy implementations random: convert to using fops->read_iter() random: convert to using fops->write_iter() random: wire up fops->splice_{read,write}_iter() random: check for signals after page of pool writes ALSA: ctxfi: Add SB046x PCI ID Linux 5.15.44 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I2d874cba14f13379fc5f874e72634c9179b28742	2022-06-14 11:49:05 +02:00
Greg Kroah-Hartman	a0d26c51d7	Merge 5.15.37 into android14-5.15 Changes in 5.15.37 floppy: disable FDRAWCMD by default bpf: Introduce composable reg, ret and arg types. bpf: Replace ARG_XXX_OR_NULL with ARG_XXX \| PTR_MAYBE_NULL bpf: Replace RET_XXX_OR_NULL with RET_XXX \| PTR_MAYBE_NULL bpf: Replace PTR_TO_XXX_OR_NULL with PTR_TO_XXX \| PTR_MAYBE_NULL bpf: Introduce MEM_RDONLY flag bpf: Convert PTR_TO_MEM_OR_NULL to composable types. bpf: Make per_cpu_ptr return rdonly PTR_TO_MEM. bpf: Add MEM_RDONLY for helper args that are pointers to rdonly mem. bpf/selftests: Test PTR_TO_RDONLY_MEM bpf: Fix crash due to out of bounds access into reg2btf_ids. spi: cadence-quadspi: fix write completion support ARM: dts: socfpga: change qspi to "intel,socfpga-qspi" mm: kfence: fix objcgs vector allocation gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable} iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable iov_iter: Introduce fault_in_iov_iter_writeable gfs2: Add wrapper for iomap_file_buffered_write gfs2: Clean up function may_grant gfs2: Introduce flag for glock holder auto-demotion gfs2: Move the inode glock locking to gfs2_file_buffered_write gfs2: Eliminate ip->i_gh gfs2: Fix mmap + page fault deadlocks for buffered I/O iomap: Fix iomap_dio_rw return value for user copies iomap: Support partial direct I/O on user copy failures iomap: Add done_before argument to iomap_dio_rw gup: Introduce FOLL_NOFAULT flag to disable page faults iov_iter: Introduce nofault flag to disable page faults gfs2: Fix mmap + page fault deadlocks for direct I/O btrfs: fix deadlock due to page faults during direct IO reads and writes btrfs: fallback to blocking mode when doing async dio over multiple extents mm: gup: make fault_in_safe_writeable() use fixup_user_fault() selftests/bpf: Add test for reg2btf_ids out of bounds access Linux 5.15.37 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I785543e252f972c5a86f313e4b6721e2ff0797e6	2022-06-06 15:02:31 +02:00
Jason A. Donenfeld	64cb7f01dd	random: move randomize_page() into mm where it belongs commit 5ad7dd882e45d7fe432c32e896e2aaa0b21746ea upstream. randomize_page is an mm function. It is documented like one. It contains the history of one. It has the naming convention of one. It looks just like another very similar function in mm, randomize_stack_top(). And it has always been maintained and updated by mm people. There is no need for it to be in random.c. In the "which shape does not look like the other ones" test, pointing to randomize_page() is correct. So move randomize_page() into mm/util.c, right next to the similar randomize_stack_top() function. This commit contains no actual code changes. Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-30 09:29:17 +02:00
Andreas Gruenbacher	6e213bc614	gup: Introduce FOLL_NOFAULT flag to disable page faults commit 55b8fe703bc51200d4698596c90813453b35ae63 upstream Introduce a new FOLL_NOFAULT flag that causes get_user_pages to return -EFAULT when it would otherwise trigger a page fault. This is roughly similar to FOLL_FAST_ONLY but available on all architectures, and less fragile. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-01 17:22:32 +02:00
Yu Zhao	f88ed5a3d3	FROMLIST: mm: multi-gen LRU: groundwork Evictable pages are divided into multiple generations for each lruvec. The youngest generation number is stored in lrugen->max_seq for both anon and file types as they are aged on an equal footing. The oldest generation numbers are stored in lrugen->min_seq[] separately for anon and file types as clean file pages can be evicted regardless of swap constraints. These three variables are monotonically increasing. Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits in order to fit into the gen counter in page->flags. Each truncated generation number is an index to lrugen->lists[]. The sliding window technique is used to track at least MIN_NR_GENS and at most MAX_NR_GENS generations. The gen counter stores a value within [1, MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it stores 0. There are two conceptually independent procedures: "the aging", which produces young generations, and "the eviction", which consumes old generations. They form a closed-loop system, i.e., "the page reclaim". Both procedures can be invoked from userspace for the purposes of working set estimation and proactive reclaim. These features are required to optimize job scheduling (bin packing) in data centers. The variable size of the sliding window is designed for such use cases [1][2]. To avoid confusion, the terms "hot" and "cold" will be applied to the multi-gen LRU, as a new convention; the terms "active" and "inactive" will be applied to the active/inactive LRU, as usual. The protection of hot pages and the selection of cold pages are based on page access channels and patterns. There are two access channels: one through page tables and the other through file descriptors. The protection of the former channel is by design stronger because: 1. The uncertainty in determining the access patterns of the former channel is higher due to the approximation of the accessed bit. 2. The cost of evicting the former channel is higher due to the TLB flushes required and the likelihood of encountering the dirty bit. 3. The penalty of underprotecting the former channel is higher because applications usually do not prepare themselves for major page faults like they do for blocked I/O. E.g., GUI applications commonly use dedicated I/O threads to avoid blocking the rendering threads. There are also two access patterns: one with temporal locality and the other without. For the reasons listed above, the former channel is assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is present; the latter channel is assumed to follow the latter pattern unless outlying refaults have been observed [3][4]. The next patch will address the "outlying refaults". Three macros, i.e., LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in this patch to make the entire patchset less diffy. A page is added to the youngest generation on faulting. The aging needs to check the accessed bit at least twice before handing this page over to the eviction. The first check takes care of the accessed bit set on the initial fault; the second check makes sure this page has not been used since then. This protocol, AKA second chance, requires a minimum of two generations, hence MIN_NR_GENS. [1] https://dl.acm.org/doi/10.1145/3297858.3304053 [2] https://dl.acm.org/doi/10.1145/3503222.3507731 [3] https://lwn.net/Articles/495543/ [4] https://lwn.net/Articles/815342/ Link: https://lore.kernel.org/lkml/20220309021230.721028-6-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I333ec6a1d2abfa60d93d6adc190ed3eefe441512	2022-04-20 17:38:55 +00:00
Charan Teja Reddy	eabd925e61	ANDROID: mm: shmem: use reclaim_pages() to recalim pages from a list Static code analysis tool reported NULL pointer access in shrink_page_list() as the commit `26aa2d199d` ("mm/migrate: demote pages during reclaim") expects valid pgdat. There is already an existing api, reclaim_pages, that tries to reclaim pages from the list. use it instead of creating custom function. Bug: 201263305 Fixes: `96f80f6284` ("ANDROID: mm: add reclaim_shmem_address_space() for faster reclaims") Change-Id: Iaa11feac94c9e8338324ace0276c49d6a0adeb0c Signed-off-by: Charan Teja Reddy <quic_charante@quicinc.com>	2022-04-18 20:31:40 +00:00
Suren Baghdasaryan	0fd37220d8	UPSTREAM: mm: refactor vm_area_struct::anon_vma_name usage code Avoid mixing strings and their anon_vma_name referenced pointers by using struct anon_vma_name whenever possible. This simplifies the code and allows easier sharing of anon_vma_name structures when they represent the same name. [surenb@google.com: fix comment] Link: https://lkml.kernel.org/r/20220223153613.835563-1-surenb@google.com Link: https://lkml.kernel.org/r/20220224231834.1481408-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Colin Cross <ccross@google.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Alexey Gladkov <legion@kernel.org> Cc: Sasha Levin <sashal@kernel.org> Cc: Chris Hyser <chris.hyser@oracle.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Peter Collingbourne <pcc@google.com> Cc: Xiaofeng Cao <caoxiaofeng@yulong.com> Cc: David Hildenbrand <david@redhat.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 5c26f6ac9416b63d093e29c30e79b3297e425472) Bug: 218352794 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I4a6b5602ce7151d1a4b88fac489f86d68089bd4d	2022-03-24 18:44:39 -07:00
Charan Teja Reddy	96f80f6284	ANDROID: mm: add reclaim_shmem_address_space() for faster reclaims Add the functionality that allow users of shmem to reclaim its pages without going through the kswapd/direct reclaim path. An example usecase is: Say that device allocates a larger amount of shmem pages and shares it with hardware. To faster reclaims such pages, drivers can register the shrinkers and call reclaim_shmem_address_space(). The implementation of this function is mostly borrowed from reclaim_address_space() implemented for per process reclaim[1]. [1] https://lore.kernel.org/patchwork/cover/378056/ Bug: 201263305 Change-Id: I03d2c3b9610612af977f89ddeabb63b8e9e50918 Signed-off-by: Charan Teja Reddy <quic_charante@quicinc.com>	2022-03-23 11:32:21 -07:00
Michel Lespinasse	7d6787088d	BACKPORT: FROMLIST: mm: enable speculative fault handling for supported file types. Introduce vma_can_speculate(), which allows speculative handling for VMAs mapping supported file types. From do_handle_mm_fault(), speculative handling will follow through __handle_mm_fault(), handle_pte_fault() and do_fault(). At this point, we expect speculative faults to continue through one of: - do_read_fault(), fully implemented; - do_cow_fault(), which might abort if missing anon vmas, - do_shared_fault(), not implemented yet (would require ->page_mkwrite() changes). vma_can_speculate() provides an early abort for the do_shared_fault() case, limiting the time spent on trying that unimplemented case. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20210407014502.24091-31-michel@lespinasse.org/ Conflicts: include/linux/vm_event_item.h mm/vmstat.c 1. SPF_ATTEMPT_FILE is taken from https://lore.kernel.org/all/20210407014502.24091-36-michel@lespinasse.org/ since the patch posted upstream at the time had a different structure with stats for anonymouse and file-backed pagefaults introduced in a separate patch. Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I3a28af63b41b649f02f8b73d53f6494ad114ee5a	2022-03-23 11:32:19 -07:00
Michel Lespinasse	a2138fee6c	FROMLIST: fs: list file types that support speculative faults. Add a speculative field to the vm_operations_struct, which indicates if the associated file type supports speculative faults. Initially this is set for files that implement fault() with filemap_fault(). Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20210407014502.24091-30-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Ic92efdf13283c45e7da7bf703f4f85f8b392ba69	2022-03-23 11:32:19 -07:00
Michel Lespinasse	6e6766ab76	BACKPORT: FROMLIST: mm: add pte_map_lock() and pte_spinlock() pte_map_lock() and pte_spinlock() are used by fault handlers to ensure the pte is mapped and locked before they commit the faulted page to the mm's address space at the end of the fault. The functions differ in their preconditions; pte_map_lock() expects the pte to be unmapped prior to the call, while pte_spinlock() expects it to be already mapped. In the speculative fault case, the functions verify, after locking the pte, that the mmap sequence count has not changed since the start of the fault, and thus that no mmap lock writers have been running concurrently with the fault. After that point the page table lock serializes any further races with concurrent mmap lock writers. If the mmap sequence count check fails, both functions will return false with the pte being left unmapped and unlocked. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-18-michel@lespinasse.org/ Conflicts: include/linux/mm.h 1. Fixed pte_map_lock and pte_spinlock macros not to fail when CONFIG_SPECULATIVE_PAGE_FAULT=n Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Ibd7ccc2ead4fdf29f28c7657b312b2f677ac8836	2022-03-23 11:32:15 -07:00
Michel Lespinasse	6ab660d7cb	FROMLIST: mm: implement speculative handling in __handle_mm_fault(). The speculative path calls speculative_page_walk_begin() before walking the page table tree to prevent page table reclamation. The logic is otherwise similar to the non-speculative path, but with additional restrictions: in the speculative path, we do not handle huge pages or wiring new pages tables. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-17-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: If099534da8b0ac105bbaa5ea4714a6654032592a	2022-03-23 11:32:15 -07:00
Michel Lespinasse	0823d516af	FROMLIST: mm: separate mmap locked assertion from find_vma This adds a new __find_vma() function, which implements find_vma minus the mmap_assert_locked() assertion. find_vma() is then implemented as an inline wrapper around __find_vma(). Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-13-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Ia999b8cb8f5eed93040ab4b3caaf90d739da908d	2022-03-23 11:32:14 -07:00
Michel Lespinasse	4e2e391ff7	BACKPORT: FROMLIST: mm: add do_handle_mm_fault() Add a new do_handle_mm_fault function, which extends the existing handle_mm_fault() API by adding an mmap sequence count, to be used in the FAULT_FLAG_SPECULATIVE case. In the initial implementation, FAULT_FLAG_SPECULATIVE always fails (by returning VM_FAULT_RETRY). The existing handle_mm_fault() API is kept as a wrapper around do_handle_mm_fault() so that we do not have to immediately update every handle_mm_fault() call site. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Conflicts: mm/memory.c 1. Trivial merge conflict due to folios. Link: https://lore.kernel.org/all/20220128131006.67712-10-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Ic07b6d84af3e5d1fcc856e0968f1a6dd1544fa88	2022-03-23 11:32:14 -07:00
Michel Lespinasse	f2fa9aae2e	BACKPORT: FROMLIST: mm: add FAULT_FLAG_SPECULATIVE flag Define the new FAULT_FLAG_SPECULATIVE flag, which indicates when we are attempting speculative fault handling (without holding the mmap lock). Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Conflicts: include/linux/mm_types.h 1. Merge conflict due to enum fault_flag being defined in mm.h instead of mm_types.h Link: https://lore.kernel.org/all/20220128131006.67712-9-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I48ab427dfa4d7bdbe9932588bec7ae99e9e80ae9	2022-03-23 11:32:14 -07:00
Greg Kroah-Hartman	8222792e8e	Merge 5.15.19 into android13-5.15 Changes in 5.15.19 can: m_can: m_can_fifo_{read,write}: don't read or write from/to FIFO if length is 0 net: sfp: ignore disabled SFP node net: stmmac: configure PTP clock source prior to PTP initialization net: stmmac: skip only stmmac_ptp_register when resume from suspend ARM: 9179/1: uaccess: avoid alignment faults in copy_[from\|to]_kernel_nofault ARM: 9180/1: Thumb2: align ALT_UP() sections in modules sufficiently KVM: arm64: Use shadow SPSR_EL1 when injecting exceptions on !VHE s390/module: fix loading modules with a lot of relocations s390/hypfs: include z/VM guests with access control group set s390/nmi: handle guarded storage validity failures for KVM guests s390/nmi: handle vector validity failures for KVM guests bpf: Guard against accessing NULL pt_regs in bpf_get_task_stack() powerpc32/bpf: Fix codegen for bpf-to-bpf calls powerpc/bpf: Update ldimm64 instructions during extra pass ucount: Make get_ucount a safe get_user replacement scsi: zfcp: Fix failed recovery on gone remote port with non-NPIV FCP devices udf: Restore i_lenAlloc when inode expansion fails udf: Fix NULL ptr deref when converting from inline format efi: runtime: avoid EFIv2 runtime services on Apple x86 machines PM: wakeup: simplify the output logic of pm_show_wakelocks() tracing/histogram: Fix a potential memory leak for kstrdup() tracing: Don't inc err_log entry count if entry allocation fails ceph: properly put ceph_string reference after async create attempt ceph: set pool_ns in new inode layout for async creates fsnotify: fix fsnotify hooks in pseudo filesystems Revert "KVM: SVM: avoid infinite loop on NPF from bad address" psi: Fix uaf issue when psi trigger is destroyed while being polled powerpc/audit: Fix syscall_get_arch() perf/x86/intel/uncore: Fix CAS_COUNT_WRITE issue for ICX perf/x86/intel: Add a quirk for the calculation of the number of counters on Alder Lake drm/etnaviv: relax submit size limits drm/atomic: Add the crtc to affected crtc only if uapi.enable = true drm/amd/display: Fix FP start/end for dcn30_internal_validate_bw. KVM: LAPIC: Also cancel preemption timer during SET_LAPIC KVM: SVM: Never reject emulation due to SMAP errata for !SEV guests KVM: SVM: Don't intercept #GP for SEV guests KVM: x86: nSVM: skip eax alignment check for non-SVM instructions KVM: x86: Forcibly leave nested virt when SMM state is toggled KVM: x86: Keep MSR_IA32_XSS unchanged for INIT KVM: x86: Update vCPU's runtime CPUID on write to MSR_IA32_XSS KVM: x86: Sync the states size with the XCR0/IA32_XSS at, any time KVM: PPC: Book3S HV Nested: Fix nested HFSCR being clobbered with multiple vCPUs dm: revert partial fix for redundant bio-based IO accounting block: add bio_start_io_acct_time() to control start_time dm: properly fix redundant bio-based IO accounting serial: pl011: Fix incorrect rs485 RTS polarity on set_mctrl serial: 8250: of: Fix mapped region size when using reg-offset property serial: stm32: fix software flow control transfer tty: n_gsm: fix SW flow control encoding/handling tty: Partially revert the removal of the Cyclades public API tty: Add support for Brainboxes UC cards. kbuild: remove include/linux/cyclades.h from header file check usb-storage: Add unusual-devs entry for VL817 USB-SATA bridge usb: xhci-plat: fix crash when suspend if remote wake enable usb: common: ulpi: Fix crash in ulpi_match() usb: gadget: f_sourcesink: Fix isoc transfer for USB_SPEED_SUPER_PLUS usb: cdnsp: Fix segmentation fault in cdns_lost_power function usb: dwc3: xilinx: Skip resets and USB3 register settings for USB2.0 mode usb: dwc3: xilinx: Fix error handling when getting USB3 PHY USB: core: Fix hang in usb_kill_urb by adding memory barriers usb: typec: tcpci: don't touch CC line if it's Vconn source usb: typec: tcpm: Do not disconnect while receiving VBUS off usb: typec: tcpm: Do not disconnect when receiving VSAFE0V ucsi_ccg: Check DEV_INT bit only when starting CCG4 mm, kasan: use compare-exchange operation to set KASAN page tag jbd2: export jbd2_journal_[grab\|put]_journal_head ocfs2: fix a deadlock when commit trans sched/membarrier: Fix membarrier-rseq fence command missing from query bitmask PCI/sysfs: Find shadow ROM before static attribute initialization x86/MCE/AMD: Allow thresholding interface updates after init x86/cpu: Add Xeon Icelake-D to list of CPUs that support PPIN powerpc/32s: Allocate one 256k IBAT instead of two consecutives 128k IBATs powerpc/32s: Fix kasan_init_region() for KASAN powerpc/32: Fix boot failure with GCC latent entropy plugin i40e: Increase delay to 1 s after global EMP reset i40e: Fix issue when maximum queues is exceeded i40e: Fix queues reservation for XDP i40e: Fix for failed to init adminq while VF reset i40e: fix unsigned stat widths usb: roles: fix include/linux/usb/role.h compile issue rpmsg: char: Fix race between the release of rpmsg_ctrldev and cdev rpmsg: char: Fix race between the release of rpmsg_eptdev and cdev scsi: elx: efct: Don't use GFP_KERNEL under spin lock scsi: bnx2fc: Flush destroy_work queue before calling bnx2fc_interface_put() ipv6_tunnel: Rate limit warning messages ARM: 9170/1: fix panic when kasan and kprobe are enabled net: fix information leakage in /proc/net/ptype hwmon: (lm90) Mark alert as broken for MAX6646/6647/6649 hwmon: (lm90) Mark alert as broken for MAX6680 ping: fix the sk_bound_dev_if match in ping_lookup ipv4: avoid using shared IP generator for connected sockets hwmon: (lm90) Reduce maximum conversion rate for G781 NFSv4: Handle case where the lookup of a directory fails NFSv4: nfs_atomic_open() can race when looking up a non-regular file net-procfs: show net devices bound packet types drm/msm: Fix wrong size calculation drm/msm/dsi: Fix missing put_device() call in dsi_get_phy drm/msm/dsi: invalid parameter check in msm_dsi_phy_enable ipv6: annotate accesses to fn->fn_sernum NFS: Ensure the server has an up to date ctime before hardlinking NFS: Ensure the server has an up to date ctime before renaming KVM: arm64: pkvm: Use the mm_ops indirection for cache maintenance SUNRPC: Use BIT() macro in rpc_show_xprt_state() SUNRPC: Don't dereference xprt->snd_task if it's a cookie powerpc64/bpf: Limit 'ldbrx' to processors compliant with ISA v2.06 netfilter: conntrack: don't increment invalid counter on NF_REPEAT powerpc/64s: Mask SRR0 before checking against the masked NIP perf: Fix perf_event_read_local() time sched/pelt: Relax the sync of util_sum with util_avg net: phy: broadcom: hook up soft_reset for BCM54616S net: stmmac: dwmac-visconti: Fix bit definitions for ETHER_CLK_SEL net: stmmac: dwmac-visconti: Fix clock configuration for RMII mode phylib: fix potential use-after-free octeontx2-af: Do not fixup all VF action entries octeontx2-af: Fix LBK backpressure id count octeontx2-af: Retry until RVU block reset complete octeontx2-pf: cn10k: Ensure valid pointers are freed to aura octeontx2-af: verify CQ context updates octeontx2-af: Increase link credit restore polling timeout octeontx2-af: cn10k: Do not enable RPM loopback for LPC interfaces octeontx2-pf: Forward error codes to VF rxrpc: Adjust retransmission backoff efi/libstub: arm64: Fix image check alignment at entry io_uring: fix bug in slow unregistering of nodes Drivers: hv: balloon: account for vmbus packet header in max_pkt_size hwmon: (lm90) Re-enable interrupts after alert clears hwmon: (lm90) Mark alert as broken for MAX6654 hwmon: (lm90) Fix sysfs and udev notifications hwmon: (adt7470) Prevent divide by zero in adt7470_fan_write() powerpc/perf: Fix power_pmu_disable to call clear_pmi_irq_pending only if PMI is pending ipv4: fix ip option filtering for locally generated fragments ibmvnic: Allow extra failures before disabling ibmvnic: init ->running_cap_crqs early ibmvnic: don't spin in tasklet net/smc: Transitional solution for clcsock race issue video: hyperv_fb: Fix validation of screen resolution can: tcan4x5x: regmap: fix max register value drm/msm/hdmi: Fix missing put_device() call in msm_hdmi_get_phy drm/msm/dpu: invalid parameter check in dpu_setup_dspp_pcc drm/msm/a6xx: Add missing suspend_count increment yam: fix a memory leak in yam_siocdevprivate() net: cpsw: Properly initialise struct page_pool_params net: hns3: handle empty unknown interrupt for VF sch_htb: Fail on unsupported parameters when offload is requested Revert "drm/ast: Support 1600x900 with 108MHz PCLK" KVM: selftests: Don't skip L2's VMCALL in SMM test for SVM guest ceph: put the requests/sessions when it fails to alloc memory gve: Fix GFP flags when allocing pages Revert "ipv6: Honor all IPv6 PIO Valid Lifetime values" net: bridge: vlan: fix single net device option dumping ipv4: raw: lock the socket in raw_bind() ipv4: tcp: send zero IPID in SYNACK messages ipv4: remove sparse error in ip_neigh_gw4() net: bridge: vlan: fix memory leak in __allowed_ingress Bluetooth: refactor malicious adv data check irqchip/realtek-rtl: Map control data to virq irqchip/realtek-rtl: Fix off-by-one in routing dt-bindings: can: tcan4x5x: fix mram-cfg RX FIFO config perf/core: Fix cgroup event list management psi: fix "no previous prototype" warnings when CONFIG_CGROUPS=n psi: fix "defined but not used" warnings when CONFIG_PROC_FS=n usb: dwc3: xilinx: fix uninitialized return value usr/include/Makefile: add linux/nfc.h to the compile-test coverage fsnotify: invalidate dcache before IN_DELETE event block: Fix wrong offset in bio_truncate() mtd: rawnand: mpc5121: Remove unused variable in ads5121_select_chip() Linux 5.15.19 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I66399d45af362fa8e1672ba38c0d672e21afc716	2022-02-02 09:32:24 +01:00
Peter Collingbourne	4ca8a0bc83	mm, kasan: use compare-exchange operation to set KASAN page tag commit 27fe73394a1c6d0b07fa4d95f1bca116d1cc66e9 upstream. It has been reported that the tag setting operation on newly-allocated pages can cause the page flags to be corrupted when performed concurrently with other flag updates as a result of the use of non-atomic operations. Fix the problem by using a compare-exchange loop to update the tag. Link: https://lkml.kernel.org/r/20220120020148.1632253-1-pcc@google.com Link: https://linux-review.googlesource.com/id/I456b24a2b9067d93968d43b4bb3351c0cec63101 Fixes: `2813b9c029` ("kasan, mm, arm64: tag non slab memory allocated via pagealloc") Signed-off-by: Peter Collingbourne <pcc@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-02-01 17:27:05 +01:00
Colin Cross	301c56064d	UPSTREAM: mm: add a field to store names for private anonymous memory In many userspace applications, and especially in VM based applications like Android uses heavily, there are multiple different allocators in use. At a minimum there is libc malloc and the stack, and in many cases there are libc malloc, the stack, direct syscalls to mmap anonymous memory, and multiple VM heaps (one for small objects, one for big objects, etc.). Each of these layers usually has its own tools to inspect its usage; malloc by compiling a debug version, the VM through heap inspection tools, and for direct syscalls there is usually no way to track them. On Android we heavily use a set of tools that use an extended version of the logic covered in Documentation/vm/pagemap.txt to walk all pages mapped in userspace and slice their usage by process, shared (COW) vs. unique mappings, backing, etc. This can account for real physical memory usage even in cases like fork without exec (which Android uses heavily to share as many private COW pages as possible between processes), Kernel SamePage Merging, and clean zero pages. It produces a measurement of the pages that only exist in that process (USS, for unique), and a measurement of the physical memory usage of that process with the cost of shared pages being evenly split between processes that share them (PSS). If all anonymous memory is indistinguishable then figuring out the real physical memory usage (PSS) of each heap requires either a pagemap walking tool that can understand the heap debugging of every layer, or for every layer's heap debugging tools to implement the pagemap walking logic, in which case it is hard to get a consistent view of memory across the whole system. Tracking the information in userspace leads to all sorts of problems. It either needs to be stored inside the process, which means every process has to have an API to export its current heap information upon request, or it has to be stored externally in a filesystem that somebody needs to clean up on crashes. It needs to be readable while the process is still running, so it has to have some sort of synchronization with every layer of userspace. Efficiently tracking the ranges requires reimplementing something like the kernel vma trees, and linking to it from every layer of userspace. It requires more memory, more syscalls, more runtime cost, and more complexity to separately track regions that the kernel is already tracking. This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a userspace-provided name for anonymous vmas. The names of named anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as [anon:<name>]. Userspace can set the name for a region of memory by calling prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name) Setting the name to NULL clears it. The name length limit is 80 bytes including NUL-terminator and is checked to contain only printable ascii characters (including space), except '[',']','\','$' and '`'. Ascii strings are being used to have a descriptive identifiers for vmas, which can be understood by the users reading /proc/pid/maps or /proc/pid/smaps. Names can be standardized for a given system and they can include some variable parts such as the name of the allocator or a library, tid of the thread using it, etc. The name is stored in a pointer in the shared union in vm_area_struct that points to a null terminated string. Anonymous vmas with the same name (equivalent strings) and are otherwise mergeable will be merged. The name pointers are not shared between vmas even if they contain the same name. The name pointer is stored in a union with fields that are only used on file-backed mappings, so it does not increase memory usage. CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this feature. It keeps the feature disabled by default to prevent any additional memory overhead and to avoid confusing procfs parsers on systems which are not ready to support named anonymous vmas. The patch is based on the original patch developed by Colin Cross, more specifically on its latest version [1] posted upstream by Sumit Semwal. It used a userspace pointer to store vma names. In that design, name pointers could be shared between vmas. However during the last upstreaming attempt, Kees Cook raised concerns [2] about this approach and suggested to copy the name into kernel memory space, perform validity checks [3] and store as a string referenced from vm_area_struct. One big concern is about fork() performance which would need to strdup anonymous vma names. Dave Hansen suggested experimenting with worst-case scenario of forking a process with 64k vmas having longest possible names [4]. I ran this experiment on an ARM64 Android device and recorded a worst-case regression of almost 40% when forking such a process. This regression is addressed in the followup patch which replaces the pointer to a name with a refcounted structure that allows sharing the name pointer between vmas of the same name. Instead of duplicating the string during fork() or when splitting a vma it increments the refcount. [1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@linaro.org/ [2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/ [3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/ [4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@intel.com/ Changes for prctl(2) manual page (in the options section): PR_SET_VMA Sets an attribute specified in arg2 for virtual memory areas starting from the address specified in arg3 and spanning the size specified in arg4. arg5 specifies the value of the attribute to be set. Note that assigning an attribute to a virtual memory area might prevent it from being merged with adjacent virtual memory areas due to the difference in that attribute's value. Currently, arg2 must be one of: PR_SET_VMA_ANON_NAME Set a name for anonymous virtual memory areas. arg5 should be a pointer to a null-terminated string containing the name. The name length including null byte cannot exceed 80 bytes. If arg5 is NULL, the name of the appropriate anonymous virtual memory areas will be reset. The name can contain only printable ascii characters (including space), except '[',']','\','$' and '`'. This feature is available only if the kernel is built with the CONFIG_ANON_VMA_NAME option enabled. [surenb@google.com: docs: proc.rst: /proc/PID/maps: fix malformed table] Link: https://lkml.kernel.org/r/20211123185928.2513763-1-surenb@google.com [surenb: rebased over v5.15-rc6, replaced userpointer with a kernel copy, added input sanitization and CONFIG_ANON_VMA_NAME config. The bulk of the work here was done by Colin Cross, therefore, with his permission, keeping him as the author] Link: https://lkml.kernel.org/r/20211019215511.3771969-2-surenb@google.com Signed-off-by: Colin Cross <ccross@google.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Glauber <jan.glauber@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Stultz <john.stultz@linaro.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rob Landley <rob@landley.net> Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com> Cc: Shaohua Li <shli@fusionio.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 9a10064f5625d5572c3626c1516e0bebc6c9fe9b) Bug: 120441514 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I53d56d551a7d62f75341304751814294b447c04e	2022-01-18 15:30:27 -08:00
Suren Baghdasaryan	f355f9635d	Revert "ANDROID: mm: add a field to store names for private anonymous memory" This reverts commit `60500a4228`. Replacing out-of-tree implementation with the upstream one. Bug: 120441514 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Ic34c8e16d51ccf9f00cb59d2de341e911bcb2828	2022-01-18 14:54:47 -08:00
Kalesh Singh	1a77f04aac	Revert "ANDROID: mm: Throttle rss_stat tracepoint" This reverts commit `77dfeaa02d`. Throttling can now be done using hist triggers and synthetic events Bug: 145972256 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I39c284040e2fdb815cda980f3a40ef188e59287c	2021-11-17 23:30:23 +00:00
Greg Kroah-Hartman	5606699789	Merge `996fe06160` ("Merge tag 'kgdb-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux") into android-mainline Steps on the way to 5.15-rc1 Change-Id: I3806b714a5a783a7132b1daf766ebb71985fc640 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2021-09-14 16:16:54 +02:00
Greg Kroah-Hartman	c5cd945b24	Merge `fd47ff55c9` ("Merge tag 'usb-5.15-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb") into android-mainline Steps on the way to 5.15-rc1. Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I42ffa8818bbb2072f043923553c4d8f91d9647a5	2021-09-14 14:42:51 +02:00
Greg Kroah-Hartman	c2b303f98f	Merge `4e71add028` ("Merge branch 'stable/for-linus-5.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/ibft") into android-mainline Steps on the way to 5.15-rc1 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Ib3f181326491eb896547d802a6f0a1b3be54ce28	2021-09-14 14:35:23 +02:00
Greg Kroah-Hartman	59437fa4ba	Merge `90c90cda05` ("Merge tag 'xfs-5.15-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux") into android-mainline Steps on the way to 5.15-rc1 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Id0e9064c101f599601a6d864ce7ac2ed88083562	2021-09-09 14:02:28 +02:00
Linus Torvalds	cd1adf1b63	Revert "mm/gup: remove try_get_page(), call try_get_compound_head() directly" This reverts commit `9857a17f20`. That commit was completely broken, and I should have caught on to it earlier. But happily, the kernel test robot noticed the breakage fairly quickly. The breakage is because "try_get_page()" is about avoiding the page reference count overflow case, but is otherwise the exact same as a plain "get_page()". In contrast, "try_get_compound_head()" is an entirely different beast, and uses __page_cache_add_speculative() because it's not just about the page reference count, but also about possibly racing with the underlying page going away. So all the commentary about how "try_get_page() has fallen a little behind in terms of maintenance, try_get_compound_head() handles speculative page references more thoroughly" was just completely wrong: yes, try_get_compound_head() handles speculative page references, but the point is that try_get_page() does not, and must not. So there's no lack of maintainance - there are fundamentally different semantics. A speculative page reference would be entirely wrong in "get_page()", and it's entirely wrong in "try_get_page()". It's not about speculation, it's purely about "uhhuh, you can't get this page because you've tried to increment the reference count too much already". The reason the kernel test robot noticed this bug was that it hit the VM_BUG_ON() in __page_cache_add_speculative(), which is all about verifying that the context of any speculative page access is correct. But since that isn't what try_get_page() is all about, the VM_BUG_ON() tests things that are not correct to test for try_get_page(). Reported-by: kernel test robot <oliver.sang@intel.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-07 11:03:45 -07:00
Linus Torvalds	49624efa65	Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux Pull MAP_DENYWRITE removal from David Hildenbrand: "Remove all in-tree usage of MAP_DENYWRITE from the kernel and remove VM_DENYWRITE. There are some (minor) user-visible changes: - We no longer deny write access to shared libaries loaded via legacy uselib(); this behavior matches modern user space e.g. dlopen(). - We no longer deny write access to the elf interpreter after exec completed, treating it just like shared libraries (which it often is). - We always deny write access to the file linked via /proc/pid/exe: sys_prctl(PR_SET_MM_MAP/EXE_FILE) will fail if write access to the file cannot be denied, and write access to the file will remain denied until the link is effectivel gone (exec, termination, sys_prctl(PR_SET_MM_MAP/EXE_FILE)) -- just as if exec'ing the file. Cross-compiled for a bunch of architectures (alpha, microblaze, i386, s390x, ...) and verified via ltp that especially the relevant tests (i.e., creat07 and execve04) continue working as expected" * tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux: fs: update documentation of get_write_access() and friends mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff() mm: remove VM_DENYWRITE binfmt: remove in-tree usage of MAP_DENYWRITE kernel/fork: always deny write access to current MM exe_file kernel/fork: factor out replacing the current MM exe_file binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib()	2021-09-04 11:35:47 -07:00
Linus Torvalds	14726903c8	Merge branch 'akpm' (patches from Andrew) Merge misc updates from Andrew Morton: "173 patches. Subsystems affected by this series: ia64, ocfs2, block, and mm (debug, pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap, bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock, oom-kill, migration, ksm, percpu, vmstat, and madvise)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits) mm/madvise: add MADV_WILLNEED to process_madvise() mm/vmstat: remove unneeded return value mm/vmstat: simplify the array size calculation mm/vmstat: correct some wrong comments mm/percpu,c: remove obsolete comments of pcpu_chunk_populated() selftests: vm: add COW time test for KSM pages selftests: vm: add KSM merging time test mm: KSM: fix data type selftests: vm: add KSM merging across nodes test selftests: vm: add KSM zero page merging test selftests: vm: add KSM unmerge test selftests: vm: add KSM merge test mm/migrate: correct kernel-doc notation mm: wire up syscall process_mrelease mm: introduce process_mrelease system call memblock: make memblock_find_in_range method private mm/mempolicy.c: use in_task() in mempolicy_slab_node() mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies mm/mempolicy: advertise new MPOL_PREFERRED_MANY mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY ...	2021-09-03 10:08:28 -07:00
Yang Shi	d0505e9f7d	mm: hwpoison: don't drop slab caches for offlining non-LRU page In the current implementation of soft offline, if non-LRU page is met, all the slab caches will be dropped to free the page then offline. But if the page is not slab page all the effort is wasted in vain. Even though it is a slab page, it is not guaranteed the page could be freed at all. However the side effect and cost is quite high. It does not only drop the slab caches, but also may drop a significant amount of page caches which are associated with inode caches. It could make the most workingset gone in order to just offline a page. And the offline is not guaranteed to succeed at all, actually I really doubt the success rate for real life workload. Furthermore the worse consequence is the system may be locked up and unusable since the page cache release may incur huge amount of works queued for memcg release. Actually we ran into such unpleasant case in our production environment. Firstly, the workqueue of memory_failure_work_func is locked up as below: BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 53s! Showing busy workqueues and worker pools: workqueue events: flags=0x0 pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=14/256 refcnt=15 in-flight: 409271:memory_failure_work_func pending: kfree_rcu_work, kfree_rcu_monitor, kfree_rcu_work, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, drain_local_stock, kfree_rcu_work workqueue mm_percpu_wq: flags=0x8 pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 pending: vmstat_update workqueue cgroup_destroy: flags=0x0 pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1 refcnt=12072 pending: css_release_work_fn There were over 12K css_release_work_fn queued, and this caused a few lockups due to the contention of worker pool lock with IRQ disabled, for example: NMI watchdog: Watchdog detected hard LOCKUP on cpu 1 Modules linked in: amd64_edac_mod edac_mce_amd crct10dif_pclmul crc32_pclmul ghash_clmulni_intel xt_DSCP iptable_mangle kvm_amd bpfilter vfat fat acpi_ipmi i2c_piix4 usb_storage ipmi_si k10temp i2c_core ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel xfs libcrc32c crc32c_intel mlx5_core mlxfw nvme xhci_pci ptp nvme_core pps_core xhci_hcd CPU: 1 PID: 205500 Comm: kworker/1:0 Tainted: G L 5.10.32-t1.el7.twitter.x86_64 #1 Hardware name: TYAN F5AMT /z /S8026GM2NRE-CGN, BIOS V8.030 03/30/2021 Workqueue: events memory_failure_work_func RIP: 0010:queued_spin_lock_slowpath+0x41/0x1a0 Code: 41 f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 01 75 04 c6 47 RSP: 0018:ffff9b2ac278f900 EFLAGS: 00000002 RAX: 0000000000480101 RBX: ffff8ce98ce71800 RCX: 0000000000000084 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8ce98ce6a140 RBP: 00000000000284c8 R08: ffffd7248dcb6808 R09: 0000000000000000 R10: 0000000000000003 R11: ffff9b2ac278f9b0 R12: 0000000000000001 R13: ffff8cb44dab9c00 R14: ffffffffbd1ce6a0 R15: ffff8cacaa37f068 FS: 0000000000000000(0000) GS:ffff8ce98ce40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fcf6e8cb000 CR3: 0000000a0c60a000 CR4: 0000000000350ee0 Call Trace: __queue_work+0xd6/0x3c0 queue_work_on+0x1c/0x30 uncharge_batch+0x10e/0x110 mem_cgroup_uncharge_list+0x6d/0x80 release_pages+0x37f/0x3f0 __pagevec_release+0x1c/0x50 __invalidate_mapping_pages+0x348/0x380 inode_lru_isolate+0x10a/0x160 __list_lru_walk_one+0x7b/0x170 list_lru_walk_one+0x4a/0x60 prune_icache_sb+0x37/0x50 super_cache_scan+0x123/0x1a0 do_shrink_slab+0x10c/0x2c0 shrink_slab+0x1f1/0x290 drop_slab_node+0x4d/0x70 soft_offline_page+0x1ac/0x5b0 memory_failure_work_func+0x6a/0x90 process_one_work+0x19e/0x340 worker_thread+0x30/0x360 kthread+0x116/0x130 The lockup made the machine is quite unusable. And it also made the most workingset gone, the reclaimabled slab caches were reduced from 12G to 300MB, the page caches were decreased from 17G to 4G. But the most disappointing thing is all the effort doesn't make the page offline, it just returns: soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 () It seems the aggressive behavior for non-LRU page didn't pay back, so it doesn't make too much sense to keep it considering the terrible side effect. Link: https://lkml.kernel.org/r/20210819054116.266126-1-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reported-by: David Mackey <tdmackey@twitter.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:15 -07:00
John Hubbard	3969b1a654	mm: delete unused get_kernel_page() get_kernel_page() was added in 2012 by [1]. It was used for a while for NFS, but then in 2014, a refactoring [2] removed all callers, and it has apparently not been used since. Remove get_kernel_page() because it has no callers. [1] commit `18022c5d86` ("mm: add get_kernel_page[s] for pinning of kernel addresses for I/O") [2] commit `91f79c43d1` ("new helper: iov_iter_get_pages_alloc()") Link: https://lkml.kernel.org/r/20210729221847.1165665-1-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Paris <eparis@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Xiaotian Feng <dfeng@redhat.com> Cc: Mark Salter <msalter@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:11 -07:00
John Hubbard	9857a17f20	mm/gup: remove try_get_page(), call try_get_compound_head() directly try_get_page() is very similar to try_get_compound_head(), and in fact try_get_page() has fallen a little behind in terms of maintenance: try_get_compound_head() handles speculative page references more thoroughly. There are only two try_get_page() callsites, so just call try_get_compound_head() directly from those, and remove try_get_page() entirely. Also, seeing as how this changes try_get_compound_head() into a non-static function, provide some kerneldoc documentation for it. Link: https://lkml.kernel.org/r/20210813044133.1536842-4-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:11 -07:00
John Hubbard	54d516b1d6	mm/gup: small refactoring: simplify try_grab_page() try_grab_page() does the same thing as try_grab_compound_head(..., refs=1, ...), just with a different API. So there is a lot of code duplication there. Change try_grab_page() to call try_grab_compound_head(), while keeping the API contract identical for callers. Also, now that try_grab_compound_head() always has a caller, remove the __maybe_unused annotation. Link: https://lkml.kernel.org/r/20210813044133.1536842-3-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:11 -07:00
David Hildenbrand	8d0920bde5	mm: remove VM_DENYWRITE All in-tree users of MAP_DENYWRITE are gone. MAP_DENYWRITE cannot be set from user space, so all users are gone; let's remove it. Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: David Hildenbrand <david@redhat.com>	2021-09-03 18:42:01 +02:00
David Hildenbrand	fe69d560b5	kernel/fork: always deny write access to current MM exe_file We want to remove VM_DENYWRITE only currently only used when mapping the executable during exec. During exec, we already deny_write_access() the executable, however, after exec completes the VMAs mapped with VM_DENYWRITE effectively keeps write access denied via deny_write_access(). Let's deny write access when setting or replacing the MM exe_file. With this change, we can remove VM_DENYWRITE for mapping executables. Make set_mm_exe_file() return an error in case deny_write_access() fails; note that this should never happen, because exec code does a deny_write_access() early and keeps write access denied when calling set_mm_exe_file. However, it makes the code easier to read and makes set_mm_exe_file() and replace_mm_exe_file() look more similar. This represents a minor user space visible change: sys_prctl(PR_SET_MM_MAP/EXE_FILE) can now fail if the file is already opened writable. Also, after sys_prctl(PR_SET_MM_MAP/EXE_FILE) the file cannot be opened writable. Note that we can already fail with -EACCES if the file doesn't have execute permissions. Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: David Hildenbrand <david@redhat.com>	2021-09-03 18:42:01 +02:00
David Hildenbrand	35d7bdc860	kernel/fork: factor out replacing the current MM exe_file Let's factor the main logic out into replace_mm_exe_file(), such that all mm->exe_file logic is contained in kernel/fork.c. While at it, perform some simple cleanups that are possible now that we're simplifying the individual functions. Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: David Hildenbrand <david@redhat.com>	2021-09-03 18:42:01 +02:00
Dave Chinner	de2860f463	mm: Add kvrealloc() During log recovery of an XFS filesystem with 64kB directory buffers, rebuilding a buffer split across two log records results in a memory allocation warning from krealloc like this: xfs filesystem being mounted at /mnt/scratch supports timestamps until 2038 (0x7fffffff) XFS (dm-0): Unmounting Filesystem XFS (dm-0): Mounting V5 Filesystem XFS (dm-0): Starting recovery (logdev: internal) ------------[ cut here ]------------ WARNING: CPU: 5 PID: 3435170 at mm/page_alloc.c:3539 get_page_from_freelist+0xdee/0xe40 ..... RIP: 0010:get_page_from_freelist+0xdee/0xe40 Call Trace: ? complete+0x3f/0x50 __alloc_pages+0x16f/0x300 alloc_pages+0x87/0x110 kmalloc_order+0x2c/0x90 kmalloc_order_trace+0x1d/0x90 __kmalloc_track_caller+0x215/0x270 ? xlog_recover_add_to_cont_trans+0x63/0x1f0 krealloc+0x54/0xb0 xlog_recover_add_to_cont_trans+0x63/0x1f0 xlog_recovery_process_trans+0xc1/0xd0 xlog_recover_process_ophdr+0x86/0x130 xlog_recover_process_data+0x9f/0x160 xlog_recover_process+0xa2/0x120 xlog_do_recovery_pass+0x40b/0x7d0 ? __irq_work_queue_local+0x4f/0x60 ? irq_work_queue+0x3a/0x50 xlog_do_log_recovery+0x70/0x150 xlog_do_recover+0x38/0x1d0 xlog_recover+0xd8/0x170 xfs_log_mount+0x181/0x300 xfs_mountfs+0x4a1/0x9b0 xfs_fs_fill_super+0x3c0/0x7b0 get_tree_bdev+0x171/0x270 ? suffix_kstrtoint.constprop.0+0xf0/0xf0 xfs_fs_get_tree+0x15/0x20 vfs_get_tree+0x24/0xc0 path_mount+0x2f5/0xaf0 __x64_sys_mount+0x108/0x140 do_syscall_64+0x3a/0x70 entry_SYSCALL_64_after_hwframe+0x44/0xae Essentially, we are taking a multi-order allocation from kmem_alloc() (which has an open coded no fail, no warn loop) and then reallocating it out to 64kB using krealloc(__GFP_NOFAIL) and that is then triggering the above warning. This is a regression caused by converting this code from an open coded no fail/no warn reallocation loop to using __GFP_NOFAIL. What we actually need here is kvrealloc(), so that if contiguous page allocation fails we fall back to vmalloc() and we don't get nasty warnings happening in XFS. Fixes: `771915c4f6` ("xfs: remove kmem_realloc()") Signed-off-by: Dave Chinner <dchinner@redhat.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2021-08-09 15:57:43 -07:00
Lee Jones	946e465c81	Merge tag 'v5.14-rc2' into android-mainline Linux 5.14-rc2 Signed-off-by: Lee Jones <lee.jones@linaro.org> Change-Id: Ia2131de59daa96610741f5a0ff267b0d08697023	2021-07-22 14:14:38 +01:00
Lee Jones	a8b636db7d	Merge tag 'v5.14-rc1' into android-mainline Linux 5.14-rc1 Change-Id: I9765cd4581f6683a6fca3580667017fff9cbaa2b Signed-off-by: Lee Jones <lee.jones@linaro.org>	2021-07-13 10:02:04 +01:00
Matthew Wilcox (Oracle)	79789db03f	mm: Make copy_huge_page() always available Rewrite copy_huge_page() and move it into mm/util.c so it's always available. Fixes an exposure of uninitialised memory on configurations with HUGETLB and UFFD enabled and MIGRATION disabled. Fixes: `8cc5fcbb5b` ("mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-07-12 11:30:56 -07:00
Lee Jones	293f275f4d	Merge commit `df8ba5f160` ("Merge tag 'kgdb-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux") into android-mainline A large step en route to v5.14-rc1 Change-Id: I52bb71dc737044a593d1a9dfd7fe02b31e273ff9 Signed-off-by: Lee Jones <lee.jones@linaro.org>	2021-07-12 11:00:18 +01:00
Lee Jones	8e658623d4	Merge commit `c288d9cd71` ("Merge tag 'for-5.14/io_uring-2021-06-30' of git://git.kernel.dk/linux-block") into android-mainline Another small step en route to v5.14-rc1 Change-Id: I24899ab78da7d367574ed69ceaa82ab0837d9556 Signed-off-by: Lee Jones <lee.jones@linaro.org>	2021-07-12 10:02:27 +01:00

1 2 3 4 5 ...

1236 Commits