kernel_arpi

Author	SHA1	Message	Date
Greg Kroah-Hartman	3d8ac88867	Merge 5.15.46 into android14-5.15 Changes in 5.15.46 binfmt_flat: do not stop relocating GOT entries prematurely on riscv parisc/stifb: Implement fb_is_primary_device() parisc/stifb: Keep track of hardware path of graphics card RISC-V: Mark IORESOURCE_EXCLUSIVE for reserved mem instead of IORESOURCE_BUSY riscv: Initialize thread pointer before calling C functions riscv: Fix irq_work when SMP is disabled riscv: Wire up memfd_secret in UAPI header riscv: Move alternative length validation into subsection ALSA: hda/realtek - Add new type for ALC245 ALSA: hda/realtek: Enable 4-speaker output for Dell XPS 15 9520 laptop ALSA: hda/realtek - Fix microphone noise on ASUS TUF B550M-PLUS ALSA: usb-audio: Cancel pending work at closing a MIDI substream USB: serial: pl2303: fix type detection for odd device USB: serial: option: add Quectel BG95 modem USB: new quirk for Dell Gen 2 devices usb: isp1760: Fix out-of-bounds array access usb: dwc3: gadget: Move null pinter check to proper place usb: core: hcd: Add support for deferring roothub registration fs/ntfs3: Update valid size if -EIOCBQUEUED fs/ntfs3: Fix fiemap + fix shrink file size (to remove preallocated space) fs/ntfs3: Keep preallocated only if option prealloc enabled fs/ntfs3: Check new size for limits fs/ntfs3: In function ntfs_set_acl_ex do not change inode->i_mode if called from function ntfs_init_acl fs/ntfs3: Fix some memory leaks in an error handling path of 'log_replay()' fs/ntfs3: Update i_ctime when xattr is added fs/ntfs3: Restore ntfs_xattr_get_acl and ntfs_xattr_set_acl functions cifs: fix potential double free during failed mount cifs: when extending a file with falloc we should make files not-sparse xhci: Allow host runtime PM as default for Intel Alder Lake N xHCI platform/x86: intel-hid: fix _DSM function index handling x86/MCE/AMD: Fix memory leak when threshold_create_bank() fails perf/x86/intel: Fix event constraints for ICL x86/kexec: fix memory leak of elf header buffer x86/sgx: Set active memcg prior to shmem allocation ptrace/um: Replace PT_DTRACE with TIF_SINGLESTEP ptrace/xtensa: Replace PT_SINGLESTEP with TIF_SINGLESTEP ptrace: Reimplement PTRACE_KILL by always sending SIGKILL btrfs: add "0x" prefix for unsupported optional features btrfs: return correct error number for __extent_writepage_io() btrfs: repair super block num_devices automatically btrfs: fix the error handling for submit_extent_page() for btrfs_do_readpage() iommu/vt-d: Add RPLS to quirk list to skip TE disabling drm/vmwgfx: validate the screen formats drm/virtio: fix NULL pointer dereference in virtio_gpu_conn_get_modes selftests/bpf: Fix vfs_link kprobe definition selftests/bpf: Fix parsing of prog types in UAPI hdr for bpftool sync mwifiex: add mutex lock for call in mwifiex_dfs_chan_sw_work_queue b43legacy: Fix assigning negative value to unsigned variable b43: Fix assigning negative value to unsigned variable ipw2x00: Fix potential NULL dereference in libipw_xmit() ipv6: fix locking issues with loops over idev->addr_list fbcon: Consistently protect deferred_takeover with console_lock() x86/platform/uv: Update TSC sync state for UV5 ACPICA: Avoid cache flush inside virtual machines mac80211: minstrel_ht: fix where rate stats are stored (fixes debugfs output) drm/komeda: return early if drm_universal_plane_init() fails. drm/amd/display: Disabling Z10 on DCN31 rcu-tasks: Fix race in schedule and flush work rcu: Make TASKS_RUDE_RCU select IRQ_WORK sfc: ef10: Fix assigning negative value to unsigned variable ALSA: jack: Access input_dev under mutex rtw88: 8821c: fix debugfs rssi value spi: spi-rspi: Remove setting {src,dst}_{addr,addr_width} based on DMA direction tools/power turbostat: fix ICX DRAM power numbers scsi: lpfc: Move cfg_log_verbose check before calling lpfc_dmp_dbg() scsi: lpfc: Fix SCSI I/O completion and abort handler deadlock scsi: lpfc: Fix call trace observed during I/O with CMF enabled cpuidle: PSCI: Improve support for suspend-to-RAM for PSCI OSI mode drm/amd/pm: fix double free in si_parse_power_table() ASoC: rsnd: care default case on rsnd_ssiu_busif_err_status_clear() ASoC: rsnd: care return value from rsnd_node_fixed_index() ath9k: fix QCA9561 PA bias level media: venus: hfi: avoid null dereference in deinit media: pci: cx23885: Fix the error handling in cx23885_initdev() media: cx25821: Fix the warning when removing the module md/bitmap: don't set sb values if can't pass sanity check mmc: jz4740: Apply DMA engine limits to maximum segment size drivers: mmc: sdhci_am654: Add the quirk to set TESTCD bit scsi: megaraid: Fix error check return value of register_chrdev() drm/amdgpu/sdma: Fix incorrect calculations of the wptr of the doorbells scsi: ufs: Use pm_runtime_resume_and_get() instead of pm_runtime_get_sync() scsi: lpfc: Fix resource leak in lpfc_sli4_send_seq_to_ulp() ath11k: disable spectral scan during spectral deinit ASoC: Intel: bytcr_rt5640: Add quirk for the HP Pro Tablet 408 drm/plane: Move range check for format_count earlier drm/amd/pm: fix the compile warning ath10k: skip ath10k_halt during suspend for driver state RESTARTING arm64: compat: Do not treat syscall number as ESR_ELx for a bad syscall drm: msm: fix error check return value of irq_of_parse_and_map() scsi: target: tcmu: Fix possible data corruption ipv6: Don't send rs packets to the interface of ARPHRD_TUNNEL net/mlx5: fs, delete the FTE when there are no rules attached to it ASoC: dapm: Don't fold register value changes into notifications mlxsw: spectrum_dcb: Do not warn about priority changes mlxsw: Treat LLDP packets as control drm/amdgpu/psp: move PSP memory alloc from hw_init to sw_init drm/amdgpu/ucode: Remove firmware load type check in amdgpu_ucode_free_bo regulator: mt6315: Enforce regulator-compatible, not name HID: bigben: fix slab-out-of-bounds Write in bigben_probe of: Support more than one crash kernel regions for kexec -s ASoC: tscs454: Add endianness flag in snd_soc_component_driver scsi: lpfc: Alter FPIN stat accounting logic net: remove two BUG() from skb_checksum_help() s390/preempt: disable __preempt_count_add() optimization for PROFILE_ALL_BRANCHES perf/amd/ibs: Cascade pmu init functions' return value sched/core: Avoid obvious double update_rq_clock warning spi: stm32-qspi: Fix wait_cmd timeout in APM mode dma-debug: change allocation mode from GFP_NOWAIT to GFP_ATIOMIC ACPI: PM: Block ASUS B1400CEAE from suspend to idle by default ipmi:ssif: Check for NULL msg when handling events and messages ipmi: Fix pr_fmt to avoid compilation issues rtlwifi: Use pr_warn instead of WARN_ONCE mt76: mt7921: accept rx frames with non-standard VHT MCS10-11 mt76: fix encap offload ethernet type check media: rga: fix possible memory leak in rga_probe media: coda: limit frame interval enumeration to supported encoder frame sizes media: hantro: HEVC: unconditionnaly set pps_{cb/cr}_qp_offset values media: ccs-core.c: fix failure to call clk_disable_unprepare media: imon: reorganize serialization media: cec-adap.c: fix is_configuring state usbnet: Run unregister_netdev() before unbind() again openrisc: start CPU timer early in boot nvme-pci: fix a NULL pointer dereference in nvme_alloc_admin_tags ASoC: rt5645: Fix errorenous cleanup order nbd: Fix hung on disconnect request if socket is closed before drm/amd/pm: update smartshift powerboost calc for smu12 drm/amd/pm: update smartshift powerboost calc for smu13 net: phy: micrel: Allow probing without .driver_data media: exynos4-is: Fix compile warning media: hantro: Stop using H.264 parameter pic_num ASoC: max98357a: remove dependency on GPIOLIB ASoC: rt1015p: remove dependency on GPIOLIB ACPI: CPPC: Assume no transition latency if no PCCT nvme: set non-mdts limits in nvme_scan_work can: mcp251xfd: silence clang's -Wunaligned-access warning x86/microcode: Add explicit CPU vendor dependency net: ipa: ignore endianness if there is no header m68k: atari: Make Atari ROM port I/O write macros return void rxrpc: Return an error to sendmsg if call failed rxrpc, afs: Fix selection of abort codes afs: Adjust ACK interpretation to try and cope with NAT eth: tg3: silence the GCC 12 array-bounds warning char: tpm: cr50_i2c: Suppress duplicated error message in .remove() selftests/bpf: fix btf_dump/btf_dump due to recent clang change gfs2: use i_lock spin_lock for inode qadata scsi: target: tcmu: Avoid holding XArray lock when calling lock_page IB/rdmavt: add missing locks in rvt_ruc_loopback ARM: dts: ox820: align interrupt controller node name with dtschema ARM: dts: socfpga: align interrupt controller node name with dtschema ARM: dts: s5pv210: align DMA channels with dtschema arm64: dts: qcom: msm8994: Fix the cont_splash_mem address arm64: dts: qcom: msm8994: Fix BLSP[12]_DMA channels count PM / devfreq: rk3399_dmc: Disable edev on remove() crypto: ccree - use fine grained DMA mapping dir soc: ti: ti_sci_pm_domains: Check for null return of devm_kcalloc fs: jfs: fix possible NULL pointer dereference in dbFree() arm64: dts: qcom: sdm845-xiaomi-beryllium: fix typo in panel's vddio-supply property ALSA: usb-audio: Add quirk bits for enabling/disabling generic implicit fb ALSA: usb-audio: Move generic implicit fb quirk entries into quirks.c ARM: OMAP1: clock: Fix UART rate reporting algorithm powerpc/fadump: Fix fadump to work with a different endian capture kernel fat: add ratelimit to fat_ent_bread() pinctrl: renesas: rzn1: Fix possible null-ptr-deref in sh_pfc_map_resources() ARM: versatile: Add missing of_node_put in dcscb_init ARM: dts: exynos: add atmel,24c128 fallback to Samsung EEPROM ARM: hisi: Add missing of_node_put after of_find_compatible_node cpufreq: Avoid unnecessary frequency updates due to mismatch powerpc/rtas: Keep MSR[RI] set when calling RTAS PCI: Avoid pci_dev_lock() AB/BA deadlock with sriov_numvfs_store() KVM: PPC: Book3S HV Nested: L2 LPCR should inherit L1 LPES setting alpha: fix alloc_zeroed_user_highpage_movable() tracing: incorrect isolate_mote_t cast in mm_vmscan_lru_isolate powerpc/powernv/vas: Assign real address to rx_fifo in vas_rx_win_attr powerpc/xics: fix refcount leak in icp_opal_init() powerpc/powernv: fix missing of_node_put in uv_init() macintosh/via-pmu: Fix build failure when CONFIG_INPUT is disabled powerpc/iommu: Add missing of_node_put in iommu_init_early_dart smb3: check for null tcon RDMA/hfi1: Prevent panic when SDMA is disabled Input: gpio-keys - cancel delayed work only in case of GPIO drm: fix EDID struct for old ARM OABI format drm/bridge_connector: enable HPD by default if supported dt-bindings: display: sitronix, st7735r: Fix backlight in example drm/vmwgfx: Fix an invalid read ath11k: acquire ab->base_lock in unassign when finding the peer by addr drm: bridge: it66121: Fix the register page length ath9k: fix ar9003_get_eepmisc drm/edid: fix invalid EDID extension block filtering drm/bridge: adv7511: clean up CEC adapter when probe fails drm: bridge: icn6211: Fix register layout drm: bridge: icn6211: Fix HFP_HSW_HBP_HI and HFP_MIN handling mtd: spinand: gigadevice: fix Quad IO for GD5F1GQ5UExxG spi: qcom-qspi: Add minItems to interconnect-names ASoC: mediatek: Fix error handling in mt8173_max98090_dev_probe ASoC: mediatek: Fix missing of_node_put in mt2701_wm8960_machine_probe x86/delay: Fix the wrong asm constraint in delay_loop() drm/vc4: hvs: Fix frame count register readout drm/mediatek: Fix mtk_cec_mask() drm/vc4: hvs: Reset muxes at probe time drm/vc4: txp: Don't set TXP_VSTART_AT_EOF drm/vc4: txp: Force alpha to be 0xff if it's disabled libbpf: Don't error out on CO-RE relos for overriden weak subprogs x86/PCI: Fix ALi M1487 (IBC) PIRQ router link value interpretation mptcp: reset the packet scheduler on PRIO change nl80211: show SSID for P2P_GO interfaces drm/komeda: Fix an undefined behavior bug in komeda_plane_add() drm: mali-dp: potential dereference of null pointer spi: spi-ti-qspi: Fix return value handling of wait_for_completion_timeout scftorture: Fix distribution of short handler delays net: dsa: mt7530: 1G can also support 1000BASE-X link mode ixp4xx_eth: fix error check return value of platform_get_irq() NFC: NULL out the dev->rfkill to prevent UAF efi: Add missing prototype for efi_capsule_setup_info device property: Check fwnode->secondary when finding properties device property: Allow error pointer to be passed to fwnode APIs target: remove an incorrect unmap zeroes data deduction drbd: fix duplicate array initializer EDAC/dmc520: Don't print an error for each unconfigured interrupt line mtd: rawnand: denali: Use managed device resources HID: hid-led: fix maximum brightness for Dream Cheeky HID: elan: Fix potential double free in elan_input_configured drm/bridge: Fix error handling in analogix_dp_probe regulator: da9121: Fix uninit-value in da9121_assign_chip_model() drm/mediatek: dpi: Use mt8183 output formats for mt8192 signal: Deliver SIGTRAP on perf event asynchronously if blocked sched/fair: Fix cfs_rq_clock_pelt() for throttled cfs_rq sched/psi: report zeroes for CPU full at the system level spi: img-spfi: Fix pm_runtime_get_sync() error checking cpufreq: Fix possible race in cpufreq online error path printk: use atomic updates for klogd work printk: add missing memory barrier to wake_up_klogd() printk: wake waiters for safe and NMI contexts ath9k_htc: fix potential out of bounds access with invalid rxstatus->rs_keyix media: i2c: max9286: Use dev_err_probe() helper media: i2c: max9286: Use "maxim,gpio-poc" property media: i2c: max9286: fix kernel oops when removing module media: hantro: Empty encoder capture buffers by default drm/panel: simple: Add missing bus flags for Innolux G070Y2-L01 ALSA: pcm: Check for null pointer of pointer substream before dereferencing it mtdblock: warn if opened on NAND inotify: show inotify mask flags in proc fdinfo fsnotify: fix wrong lockdep annotations spi: rockchip: Stop spi slave dma receiver when cs inactive spi: rockchip: Preset cs-high and clk polarity in setup progress spi: rockchip: fix missing error on unsupported SPI_CS_HIGH of: overlay: do not break notify on NOTIFY_{OK\|STOP} selftests/damon: add damon to selftests root Makefile drm/msm/dp: Modify prototype of encoder based API drm/msm/hdmi: switch to drm_bridge_connector drm/msm/dpu: adjust display_v_end for eDP and DP scsi: iscsi: Fix harmless double shift bug scsi: ufs: qcom: Fix ufs_qcom_resume() scsi: ufs: core: Exclude UECxx from SFR dump list drm/v3d: Fix null pointer dereference of pointer perfmon selftests/resctrl: Fix null pointer dereference on open failed libbpf: Fix logic for finding matching program for CO-RE relocation mtd: spi-nor: core: Check written SR value in spi_nor_write_16bit_sr_and_check() x86/pm: Fix false positive kmemleak report in msr_build_context() mtd: rawnand: cadence: fix possible null-ptr-deref in cadence_nand_dt_probe() mtd: rawnand: intel: fix possible null-ptr-deref in ebu_nand_probe() x86/speculation: Add missing prototype for unpriv_ebpf_notify() ASoC: rk3328: fix disabling mclk on pclk probe failure perf tools: Add missing headers needed by util/data.h drm/msm/disp/dpu1: set vbif hw config to NULL to avoid use after memory free during pm runtime resume drm/msm/dp: stop event kernel thread when DP unbind drm/msm/dp: fix error check return value of irq_of_parse_and_map() drm/msm/dp: reset DP controller before transmit phy test pattern drm/msm/dp: do not stop transmitting phy test pattern during DP phy compliance test drm/msm/dsi: fix error checks and return values for DSI xmit functions drm/msm/hdmi: check return value after calling platform_get_resource_byname() drm/msm/hdmi: fix error check return value of irq_of_parse_and_map() drm/msm: add missing include to msm_drv.c drm/panel: panel-simple: Fix proper bpc for AM-1280800N3TZQW-T00H kunit: fix debugfs code to use enum kunit_status, not bool drm/rockchip: vop: fix possible null-ptr-deref in vop_bind() spi: cadence-quadspi: fix Direct Access Mode disable for SoCFPGA perf tools: Use Python devtools for version autodetection rather than runtime virtio_blk: fix the discard_granularity and discard_alignment queue limits nl80211: don't hold RTNL in color change request x86: Fix return value of __setup handlers irqchip/exiu: Fix acknowledgment of edge triggered interrupts irqchip/aspeed-i2c-ic: Fix irq_of_parse_and_map() return value irqchip/aspeed-scu-ic: Fix irq_of_parse_and_map() return value x86/mm: Cleanup the control_va_addr_alignment() __setup handler arm64: fix types in copy_highpage() regulator: core: Fix enable_count imbalance with EXCLUSIVE_GET drm/msm/dsi: fix address for second DSI PHY on SDM660 drm/msm/dp: fix event thread stuck in wait_event after kthread_stop() drm/msm/mdp5: Return error code in mdp5_pipe_release when deadlock is detected drm/msm/mdp5: Return error code in mdp5_mixer_release when deadlock is detected drm/msm: return an error pointer in msm_gem_prime_get_sg_table() media: uvcvideo: Fix missing check to determine if element is found in list arm64: stackleak: fix current_top_of_stack() iomap: iomap_write_failed fix spi: spi-fsl-qspi: check return value after calling platform_get_resource_byname() Revert "cpufreq: Fix possible race in cpufreq online error path" regulator: qcom_smd: Fix up PM8950 regulator configuration samples: bpf: Don't fail for a missing VMLINUX_BTF when VMLINUX_H is provided perf/amd/ibs: Use interrupt regs ip for stack unwinding ath11k: Don't check arvif->is_started before sending management frames wilc1000: fix crash observed in AP mode with cfg80211_register_netdevice() HID: amd_sfh: Modify the bus name HID: amd_sfh: Modify the hid name ASoC: fsl: Use dev_err_probe() helper ASoC: fsl: Fix refcount leak in imx_sgtl5000_probe ASoC: imx-hdmi: Fix refcount leak in imx_hdmi_probe ASoC: mxs-saif: Fix refcount leak in mxs_saif_probe regulator: pfuze100: Fix refcount leak in pfuze_parse_regulators_dt dma-direct: factor out a helper for DMA_ATTR_NO_KERNEL_MAPPING allocations dma-direct: don't fail on highmem CMA pages in dma_direct_alloc_pages ASoC: samsung: Use dev_err_probe() helper ASoC: samsung: Fix refcount leak in aries_audio_probe block: Fix the bio.bi_opf comment kselftest/cgroup: fix test_stress.sh to use OUTPUT dir scripts/faddr2line: Fix overlapping text section failures media: aspeed: Fix an error handling path in aspeed_video_probe() media: exynos4-is: Fix PM disable depth imbalance in fimc_is_probe mt76: mt7921: Fix the error handling path of mt7921_pci_probe() mt76: do not attempt to reorder received 802.3 packets without agg session media: st-delta: Fix PM disable depth imbalance in delta_probe media: atmel: atmel-isc: Fix PM disable depth imbalance in atmel_isc_probe media: i2c: rdacm2x: properly set subdev entity function media: exynos4-is: Change clk_disable to clk_disable_unprepare media: pvrusb2: fix array-index-out-of-bounds in pvr2_i2c_core_init media: vsp1: Fix offset calculation for plane cropping media: atmel: atmel-sama5d2-isc: fix wrong mask in YUYV format check media: hantro: HEVC: Fix tile info buffer value computation Bluetooth: fix dangling sco_conn and use-after-free in sco_sock_timeout Bluetooth: use hdev lock in activate_scan for hci_is_adv_monitoring Bluetooth: use hdev lock for accept_list and reject_list in conn req nvme: set dma alignment to dword m68k: math-emu: Fix dependencies of math emulation support sctp: read sk->sk_bound_dev_if once in sctp_rcv() net: hinic: add missing destroy_workqueue in hinic_pf_to_mgmt_init ASoC: ti: j721e-evm: Fix refcount leak in j721e_soc_probe_ kselftest/arm64: bti: force static linking media: ov7670: remove ov7670_power_off from ov7670_remove media: i2c: ov5648: fix wrong pointer passed to IS_ERR() and PTR_ERR() media: staging: media: rkvdec: Make use of the helper function devm_platform_ioremap_resource() media: rkvdec: h264: Fix dpb_valid implementation media: rkvdec: h264: Fix bit depth wrap in pps packet regulator: scmi: Fix refcount leak in scmi_regulator_probe ext4: reject the 'commit' option on ext2 filesystems drm/msm/a6xx: Fix refcount leak in a6xx_gpu_init drm: msm: fix possible memory leak in mdp5_crtc_cursor_set() x86/sev: Annotate stack change in the #VC handler drm/msm: don't free the IRQ if it was not requested selftests/bpf: Add missed ima_setup.sh in Makefile drm/msm/dpu: handle pm_runtime_get_sync() errors in bind path drm/i915: Fix CFI violation with show_dynamic_id() thermal/drivers/bcm2711: Don't clamp temperature at zero thermal/drivers/broadcom: Fix potential NULL dereference in sr_thermal_probe thermal/core: Fix memory leak in __thermal_cooling_device_register() thermal/drivers/imx_sc_thermal: Fix refcount leak in imx_sc_thermal_probe bfq: Relax waker detection for shared queues bfq: Allow current waker to defend against a tentative one ASoC: wm2000: fix missing clk_disable_unprepare() on error in wm2000_anc_transition() PM: domains: Fix initialization of genpd's next_wakeup net: macb: Fix PTP one step sync support NFC: hci: fix sleep in atomic context bugs in nfc_hci_hcp_message_tx ASoC: max98090: Move check for invalid values before casting in max98090_put_enab_tlv() net: stmmac: selftests: Use kcalloc() instead of kzalloc() net: stmmac: fix out-of-bounds access in a selftest hv_netvsc: Fix potential dereference of NULL pointer hwmon: (pmbus) Check PEC support before reading other registers rxrpc: Fix listen() setting the bar too high for the prealloc rings rxrpc: Don't try to resend the request if we're receiving the reply rxrpc: Fix overlapping ACK accounting rxrpc: Don't let ack.previousPacket regress rxrpc: Fix decision on when to generate an IDLE ACK net: huawei: hinic: Use devm_kcalloc() instead of devm_kzalloc() hinic: Avoid some over memory allocation net: dsa: restrict SMSC_LAN9303_I2C kconfig net/smc: postpone sk_refcnt increment in connect() dma-direct: factor out dma_set_{de,en}crypted helpers dma-direct: don't call dma_set_decrypted for remapped allocations dma-direct: always leak memory that can't be re-encrypted dma-direct: don't over-decrypt memory arm64: dts: rockchip: Move drive-impedance-ohm to emmc phy on rk3399 arm64: dts: mt8192: Fix nor_flash status disable typo PCI/ACPI: Allow D3 only if Root Port can signal and wake from D3 memory: samsung: exynos5422-dmc: Avoid some over memory allocation ARM: dts: BCM5301X: update CRU block description ARM: dts: BCM5301X: Update pin controller node name ARM: dts: suniv: F1C100: fix watchdog compatible soc: qcom: smp2p: Fix missing of_node_put() in smp2p_parse_ipc soc: qcom: smsm: Fix missing of_node_put() in smsm_parse_ipc PCI: cadence: Fix find_first_zero_bit() limit PCI: rockchip: Fix find_first_zero_bit() limit PCI: mediatek: Fix refcount leak in mtk_pcie_subsys_powerup() PCI: dwc: Fix setting error return on MSI DMA mapping failure ARM: dts: ci4x10: Adapt to changes in imx6qdl.dtsi regarding fec clocks soc: qcom: llcc: Add MODULE_DEVICE_TABLE() KVM: nVMX: Leave most VM-Exit info fields unmodified on failed VM-Entry KVM: nVMX: Clear IDT vectoring on nested VM-Exit for double/triple fault crypto: qat - set CIPHER capability for QAT GEN2 crypto: qat - set COMPRESSION capability for QAT GEN2 crypto: qat - set CIPHER capability for DH895XCC crypto: qat - set COMPRESSION capability for DH895XCC platform/chrome: cros_ec: fix error handling in cros_ec_register() ARM: dts: imx6dl-colibri: Fix I2C pinmuxing platform/chrome: Re-introduce cros_ec_cmd_xfer and use it for ioctls can: xilinx_can: mark bit timing constants as const ARM: dts: stm32: Fix PHY post-reset delay on Avenger96 ARM: dts: bcm2835-rpi-zero-w: Fix GPIO line name for Wifi/BT ARM: dts: bcm2837-rpi-cm3-io3: Fix GPIO line names for SMPS I2C ARM: dts: bcm2837-rpi-3-b-plus: Fix GPIO line name of power LED ARM: dts: bcm2835-rpi-b: Fix GPIO line names misc: ocxl: fix possible double free in ocxl_file_register_afu crypto: marvell/cesa - ECB does not IV gpiolib: of: Introduce hook for missing gpio-ranges pinctrl: bcm2835: implement hook for missing gpio-ranges arm: mediatek: select arch timer for mt7629 pinctrl/rockchip: support deferring other gpio params pinctrl: mediatek: mt8195: enable driver on mtk platforms arm64: dts: qcom: qrb5165-rb5: Fix can-clock node name Drivers: hv: vmbus: Fix handling of messages with transaction ID of zero powerpc/fadump: fix PT_LOAD segment for boot memory area mfd: ipaq-micro: Fix error check return value of platform_get_irq() scsi: fcoe: Fix Wstringop-overflow warnings in fcoe_wwn_from_mac() soc: bcm: Check for NULL return of devm_kzalloc() arm64: dts: ti: k3-am64-mcu: remove incorrect UART base clock rates ASoC: sh: rz-ssi: Check return value of pm_runtime_resume_and_get() ASoC: sh: rz-ssi: Propagate error codes returned from platform_get_irq_byname() ASoC: sh: rz-ssi: Release the DMA channels in rz_ssi_probe() error path firmware: arm_scmi: Fix list protocols enumeration in the base protocol nvdimm: Fix firmware activation deadlock scenarios nvdimm: Allow overwrite in the presence of disabled dimms pinctrl: mvebu: Fix irq_of_parse_and_map() return value drivers/base/node.c: fix compaction sysfs file leak dax: fix cache flush on PMD-mapped pages drivers/base/memory: fix an unlikely reference counting issue in __add_memory_block() firmware: arm_ffa: Fix uuid parameter to ffa_partition_probe firmware: arm_ffa: Remove incorrect assignment of driver_data list: introduce list_is_head() helper and re-use it in list.h list: fix a data-race around ep->rdllist drm/msm/dpu: fix error check return value of irq_of_parse_and_map() powerpc/8xx: export 'cpm_setbrg' for modules pinctrl: renesas: r8a779a0: Fix GPIO function on I2C-capable pins pinctrl: renesas: core: Fix possible null-ptr-deref in sh_pfc_map_resources() powerpc/idle: Fix return value of __setup() handler powerpc/4xx/cpm: Fix return value of __setup() handler RDMA/hns: Add the detection for CMDQ status in the device initialization process arm64: dts: marvell: espressobin-ultra: fix SPI-NOR config arm64: dts: marvell: espressobin-ultra: enable front USB3 port ASoC: atmel-pdmic: Remove endianness flag on pdmic component ASoC: atmel-classd: Remove endianness flag on class d component proc: fix dentry/inode overinstantiating under /proc/${pid}/net ipc/mqueue: use get_tree_nodev() in mqueue_get_tree() PCI: imx6: Fix PERST# start-up sequence tty: fix deadlock caused by calling printk() under tty_port->lock crypto: sun8i-ss - rework handling of IV crypto: sun8i-ss - handle zero sized sg crypto: cryptd - Protect per-CPU resource by disabling BH. ARM: dts: at91: sama7g5: remove interrupt-parent from gic node hugetlbfs: fix hugetlbfs_statfs() locking Input: sparcspkr - fix refcount leak in bbc_beep_probe PCI/AER: Clear MULTI_ERR_COR/UNCOR_RCV bits PCI: microchip: Fix potential race in interrupt handling hwrng: omap3-rom - fix using wrong clk_disable() in omap_rom_rng_runtime_resume() powerpc/64: Only WARN if __pa()/__va() called with bad addresses powerpc/perf: Fix the threshold compare group constraint for power10 powerpc/perf: Fix the threshold compare group constraint for power9 macintosh: via-pmu and via-cuda need RTC_LIB powerpc/xive: Add some error handling code to 'xive_spapr_init()' powerpc/xive: Fix refcount leak in xive_spapr_init powerpc/fsl_rio: Fix refcount leak in fsl_rio_setup mfd: davinci_voicecodec: Fix possible null-ptr-deref davinci_vc_probe() nfsd: destroy percpu stats counters after reply cache shutdown mailbox: forward the hrtimer if not queued and under a lock RDMA/hfi1: Prevent use of lock before it is initialized KVM: LAPIC: Drop pending LAPIC timer injection when canceling the timer Input: stmfts - do not leave device disabled in stmfts_input_open OPP: call of_node_put() on error path in _bandwidth_supported() f2fs: support fault injection for dquot_initialize() f2fs: fix to do sanity check on inline_dots inode f2fs: fix dereference of stale list iterator after loop body iommu/amd: Enable swiotlb in all cases iommu/mediatek: Fix 2 HW sharing pgtable issue iommu/mediatek: Add list_del in mtk_iommu_remove iommu/mediatek: Remove clk_disable in mtk_iommu_remove iommu/mediatek: Add mutex for m4u_group and m4u_dom in data i2c: at91: use dma safe buffers cpufreq: mediatek: Use module_init and add module_exit cpufreq: mediatek: Unregister platform device on exit iommu/arm-smmu-v3-sva: Fix mm use-after-free MIPS: Loongson: Use hwmon_device_register_with_groups() to register hwmon iommu/mediatek: Fix NULL pointer dereference when printing dev_name i2c: at91: Initialize dma_buf in at91_twi_xfer() dmaengine: idxd: Fix the error handling path in idxd_cdev_register() NFS: Do not report EINTR/ERESTARTSYS as mapping errors NFS: fsync() should report filesystem errors over EINTR/ERESTARTSYS NFS: Don't report ENOSPC write errors twice NFS: Do not report flush errors in nfs_write_end() NFS: Don't report errors from nfs_pageio_complete() more than once NFSv4/pNFS: Do not fail I/O when we fail to allocate the pNFS layout NFS: Further fixes to the writeback error handling video: fbdev: clcdfb: Fix refcount leak in clcdfb_of_vram_setup dmaengine: stm32-mdma: remove GISR1 register dmaengine: stm32-mdma: fix chan initialization in stm32_mdma_irq_handler() iommu/amd: Increase timeout waiting for GA log enablement i2c: npcm: Fix timeout calculation i2c: npcm: Correct register access width i2c: npcm: Handle spurious interrupts i2c: rcar: fix PM ref counts in probe error paths perf build: Fix btf__load_from_kernel_by_id() feature check perf c2c: Use stdio interface if slang is not supported perf jevents: Fix event syntax error caused by ExtSel video: fbdev: vesafb: Fix a use-after-free due early fb_info cleanup NFS: Always initialise fattr->label in nfs_fattr_alloc() NFS: Create a new nfs_alloc_fattr_with_label() function NFS: Convert GFP_NOFS to GFP_KERNEL NFSv4.1 mark qualified async operations as MOVEABLE tasks f2fs: fix to avoid f2fs_bug_on() in dec_valid_node_count() f2fs: fix to do sanity check on block address in f2fs_do_zero_range() f2fs: fix to clear dirty inode in f2fs_evict_inode() f2fs: fix deadloop in foreground GC f2fs: don't need inode lock for system hidden quota f2fs: fix to do sanity check on total_data_blocks f2fs: don't use casefolded comparison for "." and ".." f2fs: fix fallocate to use file_modified to update permissions consistently f2fs: fix to do sanity check for inline inode objtool: Fix objtool regression on x32 systems objtool: Fix symbol creation wifi: mac80211: fix use-after-free in chanctx code iwlwifi: mvm: fix assert 1F04 upon reconfig fs-writeback: writeback_sb_inodes：Recalculate 'wrote' according skipped pages efi: Do not import certificates from UEFI Secure Boot for T2 Macs bfq: Avoid false marking of bic as stably merged bfq: Avoid merging queues with different parents bfq: Split shared queues on move between cgroups bfq: Update cgroup information before merging bio bfq: Drop pointless unlock-lock pair bfq: Remove pointless bfq_init_rq() calls bfq: Track whether bfq_group is still online bfq: Get rid of __bio_blkcg() usage bfq: Make sure bfqg for which we are queueing requests is online ext4: mark group as trimmed only if it was fully scanned ext4: fix use-after-free in ext4_rename_dir_prepare ext4: fix race condition between ext4_write and ext4_convert_inline_data ext4: fix warning in ext4_handle_inode_extension ext4: fix bug_on in ext4_writepages ext4: filter out EXT4_FC_REPLAY from on-disk superblock field s_state ext4: fix bug_on in __es_tree_search ext4: verify dir block before splitting it ext4: avoid cycles in directory h-tree ACPI: property: Release subnode properties with data nodes tty: goldfish: Introduce gf_ioread32()/gf_iowrite32() tracing: Fix potential double free in create_var_ref() tracing: Initialize integer variable to prevent garbage return value drm/amdgpu: add beige goby PCI ID PCI/PM: Fix bridge_d3_blacklist[] Elo i2 overwrite of Gigabyte X299 PCI: qcom: Fix runtime PM imbalance on probe errors PCI: qcom: Fix unbalanced PHY init on probe errors staging: r8188eu: prevent ->Ssid overflow in rtw_wx_set_scan() mm, compaction: fast_find_migrateblock() should return pfn in the target zone s390/perf: obtain sie_block from the right address s390/stp: clock_delta should be signed dlm: fix plock invalid read dlm: uninitialized variable on error in dlm_listen_for_all() dlm: fix missing lkb refcount handling ocfs2: dlmfs: fix error handling of user_dlm_destroy_lock scsi: dc395x: Fix a missing check on list iterator scsi: ufs: qcom: Add a readl() to make sure ref_clk gets enabled landlock: Add clang-format exceptions landlock: Format with clang-format selftests/landlock: Add clang-format exceptions selftests/landlock: Normalize array assignment selftests/landlock: Format with clang-format samples/landlock: Add clang-format exceptions samples/landlock: Format with clang-format landlock: Fix landlock_add_rule(2) documentation selftests/landlock: Make tests build with old libc selftests/landlock: Extend tests for minimal valid attribute size selftests/landlock: Add tests for unknown access rights selftests/landlock: Extend access right tests to directories selftests/landlock: Fully test file rename with "remove" access selftests/landlock: Add tests for O_PATH landlock: Change landlock_add_rule(2) argument check ordering landlock: Change landlock_restrict_self(2) check ordering selftests/landlock: Test landlock_create_ruleset(2) argument check ordering landlock: Define access_mask_t to enforce a consistent access mask size landlock: Reduce the maximum number of layers to 16 landlock: Create find_rule() from unmask_layers() landlock: Fix same-layer rule unions drm/amdgpu/cs: make commands with 0 chunks illegal behaviour. drm/nouveau/subdev/bus: Ratelimit logging for fault errors drm/etnaviv: check for reaped mapping in etnaviv_iommu_unmap_gem drm/nouveau/clk: Fix an incorrect NULL check on list iterator drm/nouveau/kms/nv50-: atom: fix an incorrect NULL check on list iterator drm/bridge: analogix_dp: Grab runtime PM reference for DP-AUX drm/i915/dsi: fix VBT send packet port selection for ICL+ md: fix an incorrect NULL check in does_sb_need_changing md: fix an incorrect NULL check in md_reload_sb mtd: cfi_cmdset_0002: Move and rename chip_check/chip_ready/chip_good_for_write mtd: cfi_cmdset_0002: Use chip_ready() for write on S29GL064N media: coda: Fix reported H264 profile media: coda: Add more H264 levels for CODA960 ima: remove the IMA_TEMPLATE Kconfig option Kconfig: Add option for asm goto w/ tied outputs to workaround clang-13 bug RDMA/hfi1: Fix potential integer multiplication overflow errors mmc: core: Allows to override the timeout value for ioctl() path csky: patch_text: Fixup last cpu should be master irqchip/armada-370-xp: Do not touch Performance Counter Overflow on A375, A38x, A39x irqchip: irq-xtensa-mx: fix initial IRQ affinity thermal: devfreq_cooling: use local ops instead of global ops cfg80211: declare MODULE_FIRMWARE for regulatory.db mac80211: upgrade passive scan to active scan on DFS channels after beacon rx um: Use asm-generic/dma-mapping.h um: chan_user: Fix winch_tramp() return value um: Fix out-of-bounds read in LDT setup kexec_file: drop weak attribute from arch_kexec_apply_relocations[_add] ftrace: Clean up hash direct_functions on register failures ksmbd: fix outstanding credits related bugs iommu/msm: Fix an incorrect NULL check on list iterator iommu/dma: Fix iova map result check bug Revert "mm/cma.c: remove redundant cma_mutex lock" mm/page_alloc: always attempt to allocate at least one page during bulk allocation nodemask.h: fix compilation error with GCC12 hugetlb: fix huge_pmd_unshare address update mm/memremap: fix missing call to untrack_pfn() in pagemap_range() xtensa/simdisk: fix proc_read_simdisk() rtl818x: Prevent using not initialized queues ASoC: rt5514: Fix event generation for "DSP Voice Wake Up" control carl9170: tx: fix an incorrect use of list iterator stm: ltdc: fix two incorrect NULL checks on list iterator bcache: improve multithreaded bch_btree_check() bcache: improve multithreaded bch_sectors_dirty_init() bcache: remove incremental dirty sector counting for bch_sectors_dirty_init() bcache: avoid journal no-space deadlock by reserving 1 journal bucket serial: pch: don't overwrite xmit->buf[0] by x_char tilcdc: tilcdc_external: fix an incorrect NULL check on list iterator gma500: fix an incorrect NULL check on list iterator arm64: dts: qcom: ipq8074: fix the sleep clock frequency arm64: tegra: Add missing DFLL reset on Tegra210 clk: tegra: Add missing reset deassertion phy: qcom-qmp: fix struct clk leak on probe errors ARM: dts: s5pv210: Remove spi-cs-high on panel in Aries ARM: pxa: maybe fix gpio lookup tables SMB3: EBADF/EIO errors in rename/open caused by race condition in smb2_compound_op docs/conf.py: Cope with removal of language=None in Sphinx 5.0.0 dt-bindings: gpio: altera: correct interrupt-cells vdpasim: allow to enable a vq repeatedly blk-iolatency: Fix inflight count imbalances and IO hangs on offline coresight: core: Fix coresight device probe failure issue phy: qcom-qmp: fix reset-controller leak on probe errors net: ipa: fix page free in ipa_endpoint_trans_release() net: ipa: fix page free in ipa_endpoint_replenish_one() kseltest/cgroup: Make test_stress.sh work if run interactively list: test: Add a test for list_is_head() Revert "random: use static branch for crng_ready()" staging: r8188eu: delete rtw_wx_read/write32() RDMA/hns: Remove the num_cqc_timer variable RDMA/rxe: Generate a completion for unsupported/invalid opcode MIPS: IP27: Remove incorrect `cpu_has_fpu' override MIPS: IP30: Remove incorrect `cpu_has_fpu' override ext4: only allow test_dummy_encryption when supported interconnect: qcom: sc7180: Drop IP0 interconnects interconnect: qcom: icc-rpmh: Add BCMs to commit list in pre_aggregate fs: add two trivial lookup helpers exportfs: support idmapped mounts fs/ntfs3: Fix invalid free in log_replay md: Don't set mddev private to NULL in raid0 pers->free md: fix double free of io_acct_set bioset md: bcache: check the return value of kzalloc() in detached_dev_do_request() pinctrl/rockchip: support setting input-enable param block: fix bio_clone_blkg_association() to associate with proper blkcg_gq Linux 5.15.46 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I7b65df29c22a01b81a94cd844867a18e73098a15	2022-07-13 11:40:42 +02:00
Rei Yamamoto	20e6ec76ae	mm, compaction: fast_find_migrateblock() should return pfn in the target zone commit bbe832b9db2e1ad21522f8f0bf02775fff8a0e0e upstream. At present, pages not in the target zone are added to cc->migratepages list in isolate_migratepages_block(). As a result, pages may migrate between nodes unintentionally. This would be a serious problem for older kernels without commit `a984226f45` ("mm: memcontrol: remove the pgdata parameter of mem_cgroup_page_lruvec"), because it can corrupt the lru list by handling pages in list without holding proper lru_lock. Avoid returning a pfn outside the target zone in the case that it is not aligned with a pageblock boundary. Otherwise isolate_migratepages_block() will handle pages not in the target zone. Link: https://lkml.kernel.org/r/20220511044300.4069-1-yamamoto.rei@jp.fujitsu.com Fixes: `70b44595ea` ("mm, compaction: use free lists to quickly locate a migration source") Signed-off-by: Rei Yamamoto <yamamoto.rei@jp.fujitsu.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Don Dutile <ddutile@redhat.com> Cc: Wonhyuk Yang <vvghjk1234@gmail.com> Cc: Rei Yamamoto <yamamoto.rei@jp.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-06-09 10:23:21 +02:00
Carlos Llamas	061e34c52e	ANDROID: mm: compaction: fix isolate_and_split_free_page() redefinition Guard isolate_and_split_free_page() with CONFIG_COMPACTION. This fixes the follwoing build error as the function collides with its inline stub from the header file: mm/compaction.c:766:15: error: redefinition of ‘isolate_and_split_free_page’ 766 \| unsigned long isolate_and_split_free_page(struct page page, \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~ In file included from mm/compaction.c:14: ./include/linux/compaction.h:241:29: note: previous definition of ‘isolate_and_split_free_page’ was here 241 \| static inline unsigned long isolate_and_split_free_page(struct page page, \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~ Bug: 201263307 Fixes: `8cd9aa93b7` ("ANDROID: implement wrapper for reverse migration") Reported-by: kernelci.org bot <bot@kernelci.org> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Carlos Llamas <cmllamas@google.com> Change-Id: Ie8f3fedcc9d4af5cfdcfd5829377671745ab77d6	2022-03-19 05:19:23 +00:00
Charan Teja Reddy	f47b852faa	ANDROID: implement wrapper for reverse migration Reverse migration is used to do the balancing the occupancy of memory zones in a node in the system whose imabalance may be caused by migration of pages to other zones by an operation, eg: hotremove and then hotadding the same memory. In this case there is a lot of free memory in newly hotadd memory which can be filled up by the previous migrated pages(as part of offline/hotremove) thus may free up some pressure in other zones of the node. Upstream discussion: https://lore.kernel.org/all/ee78c83d-da9b-f6d1-4f66-934b7782acfb@codeaurora.org/ Change-Id: Ib3137dab0db66ecf6858c4077dcadb9dfd0c6b1c Bug: 201263307 Signed-off-by: Charan Teja Reddy <quic_charante@quicinc.com>	2022-03-17 21:15:46 +00:00
Linus Torvalds	2d338201d5	Merge branch 'akpm' (patches from Andrew) Merge more updates from Andrew Morton: "147 patches, based on `7d2a07b769`. Subsystems affected by this patch series: mm (memory-hotplug, rmap, ioremap, highmem, cleanups, secretmem, kfence, damon, and vmscan), alpha, percpu, procfs, misc, core-kernel, MAINTAINERS, lib, checkpatch, epoll, init, nilfs2, coredump, fork, pids, criu, kconfig, selftests, ipc, and scripts" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (94 commits) scripts: check_extable: fix typo in user error message mm/workingset: correct kernel-doc notations ipc: replace costly bailout check in sysvipc_find_ipc() selftests/memfd: remove unused variable Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH configs: remove the obsolete CONFIG_INPUT_POLLDEV prctl: allow to setup brk for et_dyn executables pid: cleanup the stale comment mentioning pidmap_init(). kernel/fork.c: unexport get_{mm,task}_exe_file coredump: fix memleak in dump_vma_snapshot() fs/coredump.c: log if a core dump is aborted due to changed file permissions nilfs2: use refcount_dec_and_lock() to fix potential UAF nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group nilfs2: fix NULL pointer in nilfs_##name##_attr_release nilfs2: fix memory leak in nilfs_sysfs_create_device_group trap: cleanup trap_init() init: move usermodehelper_enable() to populate_rootfs() ...	2021-09-08 12:55:35 -07:00
Mike Rapoport	859a85ddf9	mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE Patch series "mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE". After recent updates to freeing unused parts of the memory map, no architecture can have holes in the memory map within a pageblock. This makes pfn_valid_within() check and CONFIG_HOLES_IN_ZONE configuration option redundant. The first patch removes them both in a mechanical way and the second patch simplifies memory_hotplug::test_pages_in_a_zone() that had pfn_valid_within() surrounded by more logic than simple if. This patch (of 2): After recent changes in freeing of the unused parts of the memory map and rework of pfn_valid() in arm and arm64 there are no architectures that can have holes in the memory map within a pageblock and so nothing can enable CONFIG_HOLES_IN_ZONE which guards non trivial implementation of pfn_valid_within(). With that, pfn_valid_within() is always hardwired to 1 and can be completely removed. Remove calls to pfn_valid_within() and CONFIG_HOLES_IN_ZONE. Link: https://lkml.kernel.org/r/20210713080035.7464-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20210713080035.7464-2-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-08 11:50:22 -07:00
Charan Teja Reddy	65d759c8f9	mm: compaction: support triggering of proactive compaction by user The proactive compaction[1] gets triggered for every 500msec and run compaction on the node for COMPACTION_HPAGE_ORDER (usually order-9) pages based on the value set to sysctl.compaction_proactiveness. Triggering the compaction for every 500msec in search of COMPACTION_HPAGE_ORDER pages is not needed for all applications, especially on the embedded system usecases which may have few MB's of RAM. Enabling the proactive compaction in its state will endup in running almost always on such systems. Other side, proactive compaction can still be very much useful for getting a set of higher order pages in some controllable manner(controlled by using the sysctl.compaction_proactiveness). So, on systems where enabling the proactive compaction always may proove not required, can trigger the same from user space on write to its sysctl interface. As an example, say app launcher decide to launch the memory heavy application which can be launched fast if it gets more higher order pages thus launcher can prepare the system in advance by triggering the proactive compaction from userspace. This triggering of proactive compaction is done on a write to sysctl.compaction_proactiveness by user. [1]https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=facdaa917c4d5a376d09d25865f5a863f906234a [akpm@linux-foundation.org: tweak vm.rst, per Mike] Link: https://lkml.kernel.org/r/1627653207-12317-1-git-send-email-charante@codeaurora.org Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Nitin Gupta <nigupta@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:17 -07:00
Charan Teja Reddy	e1e92bfa38	mm: compaction: optimize proactive compaction deferrals Vlastimil Babka figured out that when fragmentation score didn't go down across the proactive compaction i.e. when no progress is made, next wake up for proactive compaction is deferred for 1 << COMPACT_MAX_DEFER_SHIFT, i.e. 64 times, with each wakeup interval of HPAGE_FRAG_CHECK_INTERVAL_MSEC(=500). In each of this wakeup, it just decrement 'proactive_defer' counter and goes sleep i.e. it is getting woken to just decrement a counter. The same deferral time can also achieved by simply doing the HPAGE_FRAG_CHECK_INTERVAL_MSEC << COMPACT_MAX_DEFER_SHIFT thus unnecessary wakeup of kcompact thread is avoided thus also removes the need of 'proactive_defer' thread counter. [akpm@linux-foundation.org: tweak comment] Link: https://lore.kernel.org/linux-fsdevel/88abfdb6-2c13-b5a6-5b46-742d12d1c910@suse.cz/ Link: https://lkml.kernel.org/r/1626869599-25412-1-git-send-email-charante@codeaurora.org Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Nitin Gupta <nigupta@nvidia.com> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:17 -07:00
Yang Shi	5ac95884a7	mm/migrate: enable returning precise migrate_pages() success count Under normal circumstances, migrate_pages() returns the number of pages migrated. In error conditions, it returns an error code. When returning an error code, there is no way to know how many pages were migrated or not migrated. Make migrate_pages() return how many pages are demoted successfully for all cases, including when encountering errors. Page reclaim behavior will depend on this in subsequent patches. Link: https://lkml.kernel.org/r/20210721063926.3024591-3-ying.huang@intel.com Link: https://lkml.kernel.org/r/20210715055145.195411-4-ying.huang@intel.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Oscar Salvador <osalvador@suse.de> [optional parameter] Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Wei Xu <weixugc@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:16 -07:00
Linus Torvalds	71bd934101	Merge branch 'akpm' (patches from Andrew) Merge more updates from Andrew Morton: "190 patches. Subsystems affected by this patch series: mm (hugetlb, userfaultfd, vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock, migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap, zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc, core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs, signals, exec, kcov, selftests, compress/decompress, and ipc" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits) ipc/util.c: use binary search for max_idx ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock ipc: use kmalloc for msg_queue and shmid_kernel ipc sem: use kvmalloc for sem_undo allocation lib/decompressors: remove set but not used variabled 'level' selftests/vm/pkeys: exercise x86 XSAVE init state selftests/vm/pkeys: refill shadow register after implicit kernel write selftests/vm/pkeys: handle negative sys_pkey_alloc() return code selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random kcov: add __no_sanitize_coverage to fix noinstr for all architectures exec: remove checks in __register_bimfmt() x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned hfsplus: report create_date to kstat.btime hfsplus: remove unnecessary oom message nilfs2: remove redundant continue statement in a while-loop kprobes: remove duplicated strong free_insn_page in x86 and s390 init: print out unknown kernel parameters checkpatch: do not complain about positive return values starting with EPOLL checkpatch: improve the indented label test checkpatch: scripts/spdxcheck.py now requires python3 ...	2021-07-02 12:08:10 -07:00
Wonhyuk Yang	b55ca5264b	mm/compaction: fix 'limit' in fast_isolate_freepages Because of 'min(1, ...)', fast_isolate_freepages set 'limit' to 0 or 1. This takes away the opportunities of find candinate pages. So, by making enough scans available, increases the probability of finding the appropriate freepage. Tested it on the thpscale and the results are as follows. 5.12.0 5.12.0 valnilla patched Amean fault-both-1 598.15 ( 0.00%) 592.56 ( 0.93%) Amean fault-both-3 1494.47 ( 0.00%) 1514.35 ( -1.33%) Amean fault-both-5 2519.48 ( 0.00%) 2471.76 ( 1.89%) Amean fault-both-7 3173.85 ( 0.00%) 3079.19 ( 2.98%) Amean fault-both-12 8063.83 ( 0.00%) 7858.29 ( 2.55%) Amean fault-both-18 8781.20 ( 0.00%) 7827.70 * 10.86%* Amean fault-both-24 12576.44 ( 0.00%) 12250.20 ( 2.59%) Amean fault-both-30 18503.27 ( 0.00%) 17528.11 * 5.27%* Amean fault-both-32 16133.69 ( 0.00%) 13874.24 * 14.00%* 5.12.0 5.12.0 vanilla patched Ops Compaction migrate scanned 6547133.00 5963901.00 Ops Compaction free scanned 32452453.00 26609101.00 5.12 5.12 vanilla patched Duration User 27.99 28.84 Duration System 244.08 236.76 Duration Elapsed 78.27 78.38 Link: https://lkml.kernel.org/r/20210626082443.22547-1-vvghjk1234@gmail.com Fixes: `5a811889de` ("mm, compaction: use free lists to quickly locate a migration target") Signed-off-by: Wonhyuk Yang <vvghjk1234@gmail.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-30 20:47:29 -07:00
Liu Xiang	d2155fe54d	mm: compaction: remove duplicate !list_empty(&sublist) check The list_splice_tail(&sublist, freelist) also do !list_empty(&sublist) check, so remove the duplicate call. Link: https://lkml.kernel.org/r/20210609095409.19920-1-liu.xiang@zlingsmart.com Signed-off-by: Liu Xiang <liu.xiang@zlingsmart.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-30 20:47:29 -07:00
YueHaibing	17adb230d6	mm/compaction: use DEVICE_ATTR_WO macro Use DEVICE_ATTR_WO helper instead of plain DEVICE_ATTR, which makes the code a bit shorter and easier to read. Link: https://lkml.kernel.org/r/20210523064521.32912-1-yuehaibing@huawei.com Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-30 20:47:29 -07:00
Linus Torvalds	65090f30ab	Merge branch 'akpm' (patches from Andrew) Merge misc updates from Andrew Morton: "191 patches. Subsystems affected by this patch series: kthread, ia64, scripts, ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab, slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap, mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization, pagealloc, and memory-failure)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (191 commits) mm,hwpoison: make get_hwpoison_page() call get_any_page() mm,hwpoison: send SIGBUS with error virutal address mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes mm/page_alloc: allow high-order pages to be stored on the per-cpu lists mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA docs: remove description of DISCONTIGMEM arch, mm: remove stale mentions of DISCONIGMEM mm: remove CONFIG_DISCONTIGMEM m68k: remove support for DISCONTIGMEM arc: remove support for DISCONTIGMEM arc: update comment about HIGHMEM implementation alpha: remove DISCONTIGMEM and NUMA mm/page_alloc: move free_the_page mm/page_alloc: fix counting of managed_pages mm/page_alloc: improve memmap_pages dbg msg mm: drop SECTION_SHIFT in code comments mm/page_alloc: introduce vm.percpu_pagelist_high_fraction mm/page_alloc: limit the number of pages on PCP lists when reclaim is active mm/page_alloc: scale the number of pages that are batch freed ...	2021-06-29 17:29:11 -07:00
Muchun Song	a984226f45	mm: memcontrol: remove the pgdata parameter of mem_cgroup_page_lruvec All the callers of mem_cgroup_page_lruvec() just pass page_pgdat(page) as the 2nd parameter to it (except isolate_migratepages_block()). But for isolate_migratepages_block(), the page_pgdat(page) is also equal to the local variable of @pgdat. So mem_cgroup_page_lruvec() do not need the pgdat parameter. Just remove it to simplify the code. Link: https://lkml.kernel.org/r/20210417043538.9793-4-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-29 10:53:50 -07:00
Peter Zijlstra	b03fbd4ff2	sched: Introduce task_is_running() Replace a bunch of 'p->state == TASK_RUNNING' with a new helper: task_is_running(p). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20210611082838.222401495@infradead.org	2021-06-18 11:43:07 +02:00
Ingo Molnar	f0953a1bba	mm: fix typos in comments Fix ~94 single-word typos in locking code comments, plus a few very obvious grammar mistakes. Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Cc: Bhaskar Chowdhury <unixbhaskar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-05-07 00:26:35 -07:00
Zhiyuan Dai	68d68ff6eb	mm/mempool: minor coding style tweaks Various coding style tweaks to various files under mm/ [daizhiyuan@phytium.com.cn: mm/swapfile: minor coding style tweaks] Link: https://lkml.kernel.org/r/1614223624-16055-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/sparse: minor coding style tweaks] Link: https://lkml.kernel.org/r/1614227288-19363-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/vmscan: minor coding style tweaks] Link: https://lkml.kernel.org/r/1614227649-19853-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/compaction: minor coding style tweaks] Link: https://lkml.kernel.org/r/1614228218-20770-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/oom_kill: minor coding style tweaks] Link: https://lkml.kernel.org/r/1614228360-21168-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/shmem: minor coding style tweaks] Link: https://lkml.kernel.org/r/1614228504-21491-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/page_alloc: minor coding style tweaks] Link: https://lkml.kernel.org/r/1614228613-21754-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/filemap: minor coding style tweaks] Link: https://lkml.kernel.org/r/1614228936-22337-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/mlock: minor coding style tweaks] Link: https://lkml.kernel.org/r/1613956588-2453-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/frontswap: minor coding style tweaks] Link: https://lkml.kernel.org/r/1613962668-15045-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/vmalloc: minor coding style tweaks] Link: https://lkml.kernel.org/r/1613963379-15988-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/memory_hotplug: minor coding style tweaks] Link: https://lkml.kernel.org/r/1613971784-24878-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/mempolicy: minor coding style tweaks] Link: https://lkml.kernel.org/r/1613972228-25501-1-git-send-email-daizhiyuan@phytium.com.cn Link: https://lkml.kernel.org/r/1614222374-13805-1-git-send-email-daizhiyuan@phytium.com.cn Signed-off-by: Zhiyuan Dai <daizhiyuan@phytium.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-05-05 11:27:27 -07:00
Minchan Kim	361a2a229f	mm: replace migrate_[prep\|finish] with lru_cache_[disable\|enable] Currently, migrate_[prep\|finish] is merely a wrapper of lru_cache_[disable\|enable]. There is not much to gain from having additional abstraction. Use lru_cache_[disable\|enable] instead of migrate_[prep\|finish], which would be more descriptive. note: migrate_prep_local in compaction.c changed into lru_add_drain to avoid CPU schedule cost with involving many other CPUs to keep old behavior. Link: https://lkml.kernel.org/r/20210319175127.886124-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Chris Goldsworthy <cgoldswo@codeaurora.org> Cc: John Dias <joaodias@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oliver Sang <oliver.sang@intel.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-05-05 11:27:24 -07:00
Charan Teja Reddy	06dac2f467	mm: compaction: update the COMPACT[STALL\|FAIL] events properly By definition, COMPACT[STALL\|FAIL] events needs to be counted when there is 'At least in one zone compaction wasn't deferred or skipped from the direct compaction'. And when compaction is skipped or deferred, COMPACT_SKIPPED will be returned but it will still go and update these compaction events which is wrong in the sense that COMPACT[STALL\|FAIL] is counted without even trying the compaction. Correct this by skipping the counting of these events when COMPACT_SKIPPED is returned for compaction. This indirectly also avoid the unnecessary try into the get_page_from_freelist() when compaction is not even tried. There is a corner case where compaction is skipped but still count COMPACTSTALL event, which is that IRQ came and freed the page and the same is captured in capture_control. Link: https://lkml.kernel.org/r/1613151184-21213-1-git-send-email-charante@codeaurora.org Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-05-05 11:27:24 -07:00
Pintu Kumar	ef49843841	mm/compaction: remove unused variable sysctl_compact_memory The sysctl_compact_memory is mostly unused in mm/compaction.c It just acts as a place holder for sysctl to store .data. But the .data itself is not needed here. So we can get ride of this variable completely and make .data as NULL. This will also eliminate the extern declaration from header file. No functionality is broken or changed this way. Link: https://lkml.kernel.org/r/1614852224-14671-1-git-send-email-pintu@codeaurora.org Signed-off-by: Pintu Kumar <pintu@codeaurora.org> Signed-off-by: Pintu Agarwal <pintu.ping@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-05-05 11:27:24 -07:00
Oscar Salvador	ae37c7ff79	mm: make alloc_contig_range handle in-use hugetlb pages alloc_contig_range() will fail if it finds a HugeTLB page within the range, without a chance to handle them. Since HugeTLB pages can be migrated as any LRU or Movable page, it does not make sense to bail out without trying. Enable the interface to recognize in-use HugeTLB pages so we can migrate them, and have much better chances to succeed the call. Link: https://lkml.kernel.org/r/20210419075413.1064-7-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-05-05 11:27:22 -07:00
Oscar Salvador	369fa227c2	mm: make alloc_contig_range handle free hugetlb pages alloc_contig_range will fail if it ever sees a HugeTLB page within the range we are trying to allocate, even when that page is free and can be easily reallocated. This has proved to be problematic for some users of alloc_contic_range, e.g: CMA and virtio-mem, where those would fail the call even when those pages lay in ZONE_MOVABLE and are free. We can do better by trying to replace such page. Free hugepages are tricky to handle so as to no userspace application notices disruption, we need to replace the current free hugepage with a new one. In order to do that, a new function called alloc_and_dissolve_huge_page is introduced. This function will first try to get a new fresh hugepage, and if it succeeds, it will replace the old one in the free hugepage pool. The free page replacement is done under hugetlb_lock, so no external users of hugetlb will notice the change. To allocate the new huge page, we use alloc_buddy_huge_page(), so we do not have to deal with any counters, and prep_new_huge_page() is not called. This is valulable because in case we need to free the new page, we only need to call __free_pages(). Once we know that the page to be replaced is a genuine 0-refcounted huge page, we remove the old page from the freelist by remove_hugetlb_page(). Then, we can call __prep_new_huge_page() and __prep_account_new_huge_page() for the new huge page to properly initialize it and increment the hstate->nr_huge_pages counter (previously decremented by remove_hugetlb_page()). Once done, the page is enqueued by enqueue_huge_page() and it is ready to be used. There is one tricky case when page's refcount is 0 because it is in the process of being released. A missing PageHugeFreed bit will tell us that freeing is in flight so we retry after dropping the hugetlb_lock. The race window should be small and the next retry should make a forward progress. E.g: CPU0 CPU1 free_huge_page() isolate_or_dissolve_huge_page PageHuge() == T alloc_and_dissolve_huge_page alloc_buddy_huge_page() spin_lock_irq(hugetlb_lock) // PageHuge() && !PageHugeFreed && // !PageCount() spin_unlock_irq(hugetlb_lock) spin_lock_irq(hugetlb_lock) 1) update_and_free_page PageHuge() == F __free_pages() 2) enqueue_huge_page SetPageHugeFreed() spin_unlock_irq(&hugetlb_lock) spin_lock_irq(hugetlb_lock) 1) PageHuge() == F (freed by case#1 from CPU0) 2) PageHuge() == T PageHugeFreed() == T - proceed with replacing the page In the case above we retry as the window race is quite small and we have high chances to succeed next time. With regard to the allocation, we restrict it to the node the page belongs to with __GFP_THISNODE, meaning we do not fallback on other node's zones. Note that gigantic hugetlb pages are fenced off since there is a cyclic dependency between them and alloc_contig_range. Link: https://lkml.kernel.org/r/20210419075413.1064-6-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-05-05 11:27:22 -07:00
Oscar Salvador	c2ad7a1ffe	mm,compaction: let isolate_migratepages_{range,block} return error codes Currently, isolate_migratepages_{range,block} and their callers use a pfn == 0 vs pfn != 0 scheme to let the caller know whether there was any error during isolation. This does not work as soon as we need to start reporting different error codes and make sure we pass them down the chain, so they are properly interpreted by functions like e.g: alloc_contig_range. Let us rework isolate_migratepages_{range,block} so we can report error codes. Since isolate_migratepages_block will stop returning the next pfn to be scanned, we reuse the cc->migrate_pfn field to keep track of that. Link: https://lkml.kernel.org/r/20210419075413.1064-3-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-05-05 11:27:22 -07:00
Vlastimil Babka	6e2b7044c1	mm, compaction: make fast_isolate_freepages() stay within zone Compaction always operates on pages from a single given zone when isolating both pages to migrate and freepages. Pageblock boundaries are intersected with zone boundaries to be safe in case zone starts or ends in the middle of pageblock. The use of pageblock_pfn_to_page() protects against non-contiguous pageblocks. The functions fast_isolate_freepages() and fast_isolate_around() don't currently protect the fast freepage isolation thoroughly enough against these corner cases, and can result in freepage isolation operate outside of zone boundaries: - in fast_isolate_freepages() if we get a pfn from the first pageblock of a zone that starts in the middle of that pageblock, 'highest' can be a pfn outside of the zone. If we fail to isolate anything in this function, we may then call fast_isolate_around() on a pfn outside of the zone and there effectively do a set_pageblock_skip(page_to_pfn(highest)) which may currently hit a VM_BUG_ON() in some configurations - fast_isolate_around() checks only the zone end boundary and not beginning, nor that the pageblock is contiguous (with pageblock_pfn_to_page()) so it's possible that we end up calling isolate_freepages_block() on a range of pfn's from two different zones and end up e.g. isolating freepages under the wrong zone's lock. This patch should fix the above issues. Link: https://lkml.kernel.org/r/20210217173300.6394-1-vbabka@suse.cz Fixes: `5a811889de` ("mm, compaction: use free lists to quickly locate a migration target") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-02-24 13:38:34 -08:00
Wonhyuk Yang	15d28d0d11	mm/compaction: fix misbehaviors of fast_find_migrateblock() In the fast_find_migrateblock(), it iterates ocer the freelist to find the proper pageblock. But there are some misbehaviors. First, if the page we found is equal to cc->migrate_pfn, it is considered that we didn't find a suitable pageblock. Secondly, if the loop was terminated because order is less than PAGE_ALLOC_COSTLY_ORDER, it could be considered that we found a suitable one. Thirdly, if the skip bit is set on the page block and we goto continue, it doesn't check nr_scanned. Fourthly, if the page block's skip bit is set, it checks that page block is the last of list, which is unnecessary. Link: https://lkml.kernel.org/r/20210128130411.6125-1-vvghjk1234@gmail.com Fixes: `70b44595ea` ("mm, compaction: use free lists to quickly locate a migration source") Signed-off-by: Wonhyuk Yang <vvghjk1234@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-02-24 13:38:34 -08:00
Charan Teja Reddy	40d7e20320	mm/compaction: correct deferral logic for proactive compaction should_proactive_compact_node() returns true when sum of the weighted fragmentation score of all the zones in the node is greater than the wmark_high of compaction, which then triggers the proactive compaction that operates on the individual zones of the node. But proactive compaction runs on the zone only when its weighted fragmentation score is greater than wmark_low(=wmark_high - 10). This means that the sum of the weighted fragmentation scores of all the zones can exceed the wmark_high but individual weighted fragmentation zone scores can still be less than wmark_low which makes the unnecessary trigger of the proactive compaction only to return doing nothing. Issue with the return of proactive compaction with out even trying is its deferral. It is simply deferred for 1 << COMPACT_MAX_DEFER_SHIFT if the scores across the proactive compaction is same, thinking that compaction didn't make any progress but in reality it didn't even try. With the delay between successive retries for proactive compaction is 500msec, it can result into the deferral for ~30sec with out even trying the proactive compaction. Test scenario is that: compaction_proactiveness=50 thus the wmark_low = 50 and wmark_high = 60. System have 2 zones(Normal and Movable) with sizes 5GB and 6GB respectively. After opening some apps on the android, the weighted fragmentation scores of these zones are 47 and 49 respectively. Since the sum of these fragmentation scores are above the wmark_high which triggers the proactive compaction and there since the individual zones weighted fragmentation scores are below wmark_low, it returns without trying the proactive compaction. As a result the weighted fragmentation scores of the zones are still 47 and 49 which makes the existing logic to defer the compaction thinking that noprogress is made across the compaction. Fix this by checking just zone fragmentation score, not the weighted, in __compact_finished() and use the zones weighted fragmentation score in fragmentation_score_node(). In the test case above, If the weighted average of is above wmark_high, then individual score (not adjusted) of atleast one zone has to be above wmark_high. Thus it avoids the unnecessary trigger and deferrals of the proactive compaction. Link: https://lkml.kernel.org/r/1610989938-31374-1-git-send-email-charante@codeaurora.org Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Nitin Gupta <ngupta@nitingupta.dev> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-02-24 13:38:34 -08:00
Miaohe Lin	e2d26aa5fb	mm/compaction: remove duplicated VM_BUG_ON_PAGE !PageLocked The VM_BUG_ON_PAGE(!PageLocked(page), page) is also done in PageMovable. Remove this explicitly one. Link: https://lkml.kernel.org/r/20210109081420.46030-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-02-24 13:38:34 -08:00
Alex Shi	d99fd5feb0	mm/compaction: remove rcu_read_lock during page compaction isolate_migratepages_block() used rcu_read_lock() with the intention of safeguarding against the mem_cgroup being destroyed concurrently; but its TestClearPageLRU already protects against that. Delete the unnecessary rcu_read_lock() and _unlock(). Hugh Dickins helped on commit log polishing, Thanks! Link: https://lkml.kernel.org/r/1608614453-10739-3-git-send-email-alex.shi@linux.alibaba.com Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-02-24 13:38:34 -08:00
Yu Zhao	46ae6b2cc2	mm/swap.c: don't pass "enum lru_list" to del_page_from_lru_list() The parameter is redundant in the sense that it can be potentially extracted from the "struct page" parameter by page_lru(). We need to make sure that existing PageActive() or PageUnevictable() remains until the function returns. A few places don't conform, and simple reordering fixes them. This patch may have left page_off_lru() seemingly odd, and we'll take care of it in the next patch. Link: https://lore.kernel.org/linux-mm/20201207220949.830352-6-yuzhao@google.com/ Link: https://lkml.kernel.org/r/20210122220600.906146-6-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-02-24 13:38:33 -08:00
Alex Shi	c2135f7c57	mm/vmscan: __isolate_lru_page_prepare() cleanup The function just returns 2 results, so using a 'switch' to deal with its result is unnecessary. Also simplify it to a bool func as Vlastimil suggested. Also remove 'goto' by reusing list_move(), and take Matthew Wilcox's suggestion to update comments in function. Link: https://lkml.kernel.org/r/728874d7-2d93-4049-68c1-dcc3b2d52ccd@linux.alibaba.com Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-02-24 13:38:33 -08:00
Rokudo Yan	74e21484e4	mm, compaction: move high_pfn to the for loop scope In fast_isolate_freepages, high_pfn will be used if a prefered one (ie PFN >= low_fn) not found. But the high_pfn is not reset before searching an free area, so when it was used as freepage, it may from another free area searched before. As a result move_freelist_head(freelist, freepage) will have unexpected behavior (eg corrupt the MOVABLE freelist) Unable to handle kernel paging request at virtual address dead000000000200 Mem abort info: ESR = 0x96000044 Exception class = DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000044 CM = 0, WnR = 1 [dead000000000200] address between user and kernel address ranges -000\|list_cut_before(inline) -000\|move_freelist_head(inline) -000\|fast_isolate_freepages(inline) -000\|isolate_freepages(inline) -000\|compaction_alloc(?, ?) -001\|unmap_and_move(inline) -001\|migrate_pages([NSD:0xFFFFFF80088CBBD0] from = 0xFFFFFF80088CBD88, [NSD:0xFFFFFF80088CBBC8] get_new_p -002\|__read_once_size(inline) -002\|static_key_count(inline) -002\|static_key_false(inline) -002\|trace_mm_compaction_migratepages(inline) -002\|compact_zone(?, [NSD:0xFFFFFF80088CBCB0] capc = 0x0) -003\|kcompactd_do_work(inline) -003\|kcompactd([X19] p = 0xFFFFFF93227FBC40) -004\|kthread([X20] _create = 0xFFFFFFE1AFB26380) -005\|ret_from_fork(asm) The issue was reported on an smart phone product with 6GB ram and 3GB zram as swap device. This patch fixes the issue by reset high_pfn before searching each free area, which ensure freepage and freelist match when call move_freelist_head in fast_isolate_freepages(). Link: http://lkml.kernel.org/r/20190118175136.31341-12-mgorman@techsingularity.net Link: https://lkml.kernel.org/r/20210112094720.1238444-1-wu-yan@tcl.com Fixes: `5a811889de` ("mm, compaction: use free lists to quickly locate a migration target") Signed-off-by: Rokudo Yan <wu-yan@tcl.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-02-05 11:03:47 -08:00
Alex Shi	6168d0da2b	mm/lru: replace pgdat lru_lock with lruvec lock This patch moves per node lru_lock into lruvec, thus bring a lru_lock for each of memcg per node. So on a large machine, each of memcg don't have to suffer from per node pgdat->lru_lock competition. They could go fast with their self lru_lock. After move memcg charge before lru inserting, page isolation could serialize page's memcg, then per memcg lruvec lock is stable and could replace per node lru lock. In isolate_migratepages_block(), compact_unlock_should_abort and lock_page_lruvec_irqsave are open coded to work with compact_control. Also add a debug func in locking which may give some clues if there are sth out of hands. Daniel Jordan's testing show 62% improvement on modified readtwice case on his 2P * 10 core * 2 HT broadwell box. https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/ Hugh Dickins helped on the patch polish, thanks! [alex.shi@linux.alibaba.com: fix comment typo] Link: https://lkml.kernel.org/r/5b085715-292a-4b43-50b3-d73dc90d1de5@linux.alibaba.com [alex.shi@linux.alibaba.com: use page_memcg()] Link: https://lkml.kernel.org/r/5a4c2b72-7ee8-2478-fc0e-85eb83aafec4@linux.alibaba.com Link: https://lkml.kernel.org/r/1604566549-62481-18-git-send-email-alex.shi@linux.alibaba.com Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Rong Chen <rong.a.chen@intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Jann Horn <jannh@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mika Penttilä <mika.penttila@nextfour.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-12-15 14:48:04 -08:00
Alex Shi	9df4131439	mm/compaction: do page isolation first in compaction Currently, compaction would get the lru_lock and then do page isolation which works fine with pgdat->lru_lock, since any page isoltion would compete for the lru_lock. If we want to change to memcg lru_lock, we have to isolate the page before getting lru_lock, thus isoltion would block page's memcg change which relay on page isoltion too. Then we could safely use per memcg lru_lock later. The new page isolation use previous introduced TestClearPageLRU() + pgdat lru locking which will be changed to memcg lru lock later. Hugh Dickins <hughd@google.com> fixed following bugs in this patch's early version: Fix lots of crashes under compaction load: isolate_migratepages_block() must clean up appropriately when rejecting a page, setting PageLRU again if it had been cleared; and a put_page() after get_page_unless_zero() cannot safely be done while holding locked_lruvec - it may turn out to be the final put_page(), which will take an lruvec lock when PageLRU. And move __isolate_lru_page_prepare back after get_page_unless_zero to make trylock_page() safe: trylock_page() is not safe to use at this time: its setting PG_locked can race with the page being freed or allocated ("Bad page"), and can also erase flags being set by one of those "sole owners" of a freshly allocated page who use non-atomic __SetPageFlag(). Link: https://lkml.kernel.org/r/1604566549-62481-16-git-send-email-alex.shi@linux.alibaba.com Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: "Chen, Rong A" <rong.a.chen@intel.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Jann Horn <jannh@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mika Penttilä <mika.penttila@nextfour.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-12-15 14:48:04 -08:00
Hui Su	2271b016bf	mm/compaction: make defer_compaction and compaction_deferred static defer_compaction() and compaction_deferred() and compaction_restarting() in mm/compaction.c won't be used in other files, so make them static, and remove the declaration in the header file. Take the chance to fix a typo. Link: https://lkml.kernel.org/r/20201123170801.GA9625@rlk Signed-off-by: Hui Su <sh_def@163.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Nitin Gupta <nigupta@nvidia.com> Cc: Baoquan He <bhe@redhat.com> Cc: Mateusz Nosek <mateusznosek0@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-12-15 12:13:45 -08:00
Hui Su	2b1a20c3af	mm/compaction: move compaction_suitable's comment to right place Since commit `837d026d56` ("mm/compaction: more trace to understand when/why compaction start/finish"), the comment place is not suitable. So move compaction_suitable's comment to right place. Link: https://lkml.kernel.org/r/20201116144121.GA385717@rlk Signed-off-by: Hui Su <sh_def@163.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-12-15 12:13:45 -08:00
Yanfei Xu	19d3cf9de1	mm/compaction: rename 'start_pfn' to 'iteration_start_pfn' in compact_zone() There are two 'start_pfn' declared in compact_zone() which have different meanings. Rename the second one to 'iteration_start_pfn' to prevent confusion. Also, remove an useless semicolon. Link: https://lkml.kernel.org/r/20201019115044.1571-1-yanfei.xu@windriver.com Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-12-15 12:13:45 -08:00
Zi Yan	d20bdd571e	mm/compaction: stop isolation if too many pages are isolated and we have pages to migrate In isolate_migratepages_block, if we have too many isolated pages and nr_migratepages is not zero, we should try to migrate what we have without wasting time on isolating. In theory it's possible that multiple parallel compactions will cause too_many_isolated() to become true even if each has isolated less than COMPACT_CLUSTER_MAX, and loop forever in the while loop. Bailing immediately prevents that. [vbabka@suse.cz: changelog addition] Fixes: `1da2f328fa` (“mm,thp,compaction,cma: allow THP migration for CMA allocations”) Suggested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: <stable@vger.kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Yang Shi <shy828301@gmail.com> Link: https://lkml.kernel.org/r/20201030183809.3616803-2-zi.yan@sent.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-11-14 11:26:03 -08:00
Zi Yan	38935861d8	mm/compaction: count pages and stop correctly during page isolation In isolate_migratepages_block, when cc->alloc_contig is true, we are able to isolate compound pages. But nr_migratepages and nr_isolated did not count compound pages correctly, causing us to isolate more pages than we thought. So count compound pages as the number of base pages they contain. Otherwise, we might be trapped in too_many_isolated while loop, since the actual isolated pages can go up to COMPACT_CLUSTER_MAX*512=16384, where COMPACT_CLUSTER_MAX is 32, since we stop isolation after cc->nr_migratepages reaches to COMPACT_CLUSTER_MAX. In addition, after we fix the issue above, cc->nr_migratepages could never be equal to COMPACT_CLUSTER_MAX if compound pages are isolated, thus page isolation could not stop as we intended. Change the isolation stop condition to '>='. The issue can be triggered as follows: In a system with 16GB memory and an 8GB CMA region reserved by hugetlb_cma, if we first allocate 10GB THPs and mlock them (so some THPs are allocated in the CMA region and mlocked), reserving 6 1GB hugetlb pages via /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages will get stuck (looping in too_many_isolated function) until we kill either task. With the patch applied, oom will kill the application with 10GB THPs and let hugetlb page reservation finish. [ziy@nvidia.com: v3] Link: https://lkml.kernel.org/r/20201030183809.3616803-1-zi.yan@sent.com Fixes: `1da2f328fa` ("cmm,thp,compaction,cma: allow THP migration for CMA allocations") Signed-off-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@surriel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20201029200435.3386066-1-zi.yan@sent.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-11-14 11:26:03 -08:00
Matthew Wilcox (Oracle)	ab130f9108	mm: rename page_order() to buddy_order() The current page_order() can only be called on pages in the buddy allocator. For compound pages, you have to use compound_order(). This is confusing and led to a bug, so rename page_order() to buddy_order(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: https://lkml.kernel.org/r/20201001152259.14932-2-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-10-16 11:11:19 -07:00
Mateusz Nosek	62b35fe0eb	mm/compaction.c: micro-optimization remove unnecessary branch The same code can work both for 'zone->compact_considered > defer_limit' and 'zone->compact_considered >= defer_limit'. In the latter there is one branch less which is more effective considering performance. Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Link: https://lkml.kernel.org/r/20200913190448.28649-1-mateusznosek0@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-10-13 18:38:34 -07:00
Matthew Wilcox (Oracle)	6c357848b4	mm: replace hpage_nr_pages with thp_nr_pages The thp prefix is more frequently used than hpage and we should be consistent between the various functions. [akpm@linux-foundation.org: fix mm/migrate.c] Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-08-14 19:56:56 -07:00
Randy Dunlap	a1c1dbeb2e	mm/compaction.c: delete duplicated word Drop the repeated word "a". Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Link: http://lkml.kernel.org/r/20200801173822.14973-2-rdunlap@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-08-12 10:57:58 -07:00
Alex Shi	860b32729a	mm/compaction: correct the comments of compact_defer_shift There is no compact_defer_limit. It should be compact_defer_shift in use. and add compact_order_failed explanation. Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Link: http://lkml.kernel.org/r/3bd60e1b-a74e-050d-ade4-6e8f54e00b92@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-08-12 10:57:56 -07:00
Nitin Gupta	d34c0a7599	mm: use unsigned types for fragmentation score Proactive compaction uses per-node/zone "fragmentation score" which is always in range [0, 100], so use unsigned type of these scores as well as for related constants. Signed-off-by: Nitin Gupta <nigupta@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: http://lkml.kernel.org/r/20200618010319.13159-1-nigupta@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-08-12 10:57:56 -07:00
Nitin Gupta	25788738eb	mm: fix compile error due to COMPACTION_HPAGE_ORDER Fix compile error when COMPACTION_HPAGE_ORDER is assigned to HUGETLB_PAGE_ORDER. The correct way to check if this constant is defined is to check for CONFIG_HUGETLBFS. Reported-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Nitin Gupta <nigupta@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Nathan Chancellor <natechancellor@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20200623064544.25766-1-nigupta@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-08-12 10:57:56 -07:00
Nitin Gupta	facdaa917c	mm: proactive compaction For some applications, we need to allocate almost all memory as hugepages. However, on a running system, higher-order allocations can fail if the memory is fragmented. Linux kernel currently does on-demand compaction as we request more hugepages, but this style of compaction incurs very high latency. Experiments with one-time full memory compaction (followed by hugepage allocations) show that kernel is able to restore a highly fragmented memory state to a fairly compacted memory state within <1 sec for a 32G system. Such data suggests that a more proactive compaction can help us allocate a large fraction of memory as hugepages keeping allocation latencies low. For a more proactive compaction, the approach taken here is to define a new sysctl called 'vm.compaction_proactiveness' which dictates bounds for external fragmentation which kcompactd tries to maintain. The tunable takes a value in range [0, 100], with a default of 20. Note that a previous version of this patch [1] was found to introduce too many tunables (per-order extfrag{low, high}), but this one reduces them to just one sysctl. Also, the new tunable is an opaque value instead of asking for specific bounds of "external fragmentation", which would have been difficult to estimate. The internal interpretation of this opaque value allows for future fine-tuning. Currently, we use a simple translation from this tunable to [low, high] "fragmentation score" thresholds (low=100-proactiveness, high=low+10%). The score for a node is defined as weighted mean of per-zone external fragmentation. A zone's present_pages determines its weight. To periodically check per-node score, we reuse per-node kcompactd threads, which are woken up every 500 milliseconds to check the same. If a node's score exceeds its high threshold (as derived from user-provided proactiveness value), proactive compaction is started until its score reaches its low threshold value. By default, proactiveness is set to 20, which implies threshold values of low=80 and high=90. This patch is largely based on ideas from Michal Hocko [2]. See also the LWN article [3]. Performance data ================ System: x64_64, 1T RAM, 80 CPU threads. Kernel: 5.6.0-rc3 + this patch echo madvise \| sudo tee /sys/kernel/mm/transparent_hugepage/enabled echo madvise \| sudo tee /sys/kernel/mm/transparent_hugepage/defrag Before starting the driver, the system was fragmented from a userspace program that allocates all memory and then for each 2M aligned section, frees 3/4 of base pages using munmap. The workload is mainly anonymous userspace pages, which are easy to move around. I intentionally avoided unmovable pages in this test to see how much latency we incur when hugepage allocations hit direct compaction. 1. Kernel hugepage allocation latencies With the system in such a fragmented state, a kernel driver then allocates as many hugepages as possible and measures allocation latency: (all latency values are in microseconds) - With vanilla 5.6.0-rc3 percentile latency –––––––––– ––––––– 5 7894 10 9496 25 12561 30 15295 40 18244 50 21229 60 27556 75 30147 80 31047 90 32859 95 33799 Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G total free => 98% of free memory could be allocated as hugepages) - With 5.6.0-rc3 + this patch, with proactiveness=20 sysctl -w vm.compaction_proactiveness=20 percentile latency –––––––––– ––––––– 5 2 10 2 25 3 30 3 40 3 50 4 60 4 75 4 80 4 90 5 95 429 Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G total free => 98% of free memory could be allocated as hugepages) 2. JAVA heap allocation In this test, we first fragment memory using the same method as for (1). Then, we start a Java process with a heap size set to 700G and request the heap to be allocated with THP hugepages. We also set THP to madvise to allow hugepage backing of this heap. /usr/bin/time java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch The above command allocates 700G of Java heap using hugepages. - With vanilla 5.6.0-rc3 17.39user 1666.48system 27:37.89elapsed - With 5.6.0-rc3 + this patch, with proactiveness=20 8.35user 194.58system 3:19.62elapsed Elapsed time remains around 3:15, as proactiveness is further increased. Note that proactive compaction happens throughout the runtime of these workloads. The situation of one-time compaction, sufficient to supply hugepages for following allocation stream, can probably happen for more extreme proactiveness values, like 80 or 90. In the above Java workload, proactiveness is set to 20. The test starts with a node's score of 80 or higher, depending on the delay between the fragmentation step and starting the benchmark, which gives more-or-less time for the initial round of compaction. As t he benchmark consumes hugepages, node's score quickly rises above the high threshold (90) and proactive compaction starts again, which brings down the score to the low threshold level (80). Repeat. bpftrace also confirms proactive compaction running 20+ times during the runtime of this Java benchmark. kcompactd threads consume 100% of one of the CPUs while it tries to bring a node's score within thresholds. Backoff behavior ================ Above workloads produce a memory state which is easy to compact. However, if memory is filled with unmovable pages, proactive compaction should essentially back off. To test this aspect: - Created a kernel driver that allocates almost all memory as hugepages followed by freeing first 3/4 of each hugepage. - Set proactiveness=40 - Note that proactive_compact_node() is deferred maximum number of times with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check (=> ~30 seconds between retries). [1] https://patchwork.kernel.org/patch/11098289/ [2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/ [3] https://lwn.net/Articles/817905/ Signed-off-by: Nitin Gupta <nigupta@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Oleksandr Natalenko <oleksandr@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Reviewed-by: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Nitin Gupta <ngupta@nitingupta.dev> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-08-12 10:57:56 -07:00
Vlastimil Babka	b9e20f0da1	mm, compaction: make capture control handling safe wrt interrupts Hugh reports: "While stressing compaction, one run oopsed on NULL capc->cc in __free_one_page()'s task_capc(zone): compact_zone_order() had been interrupted, and a page was being freed in the return from interrupt. Though you would not expect it from the source, both gccs I was using (4.8.1 and 7.5.0) had chosen to compile compact_zone_order() with the ".cc = &cc" implemented by mov %rbx,-0xb0(%rbp) immediately before callq compact_zone - long after the "current->capture_control = &capc". An interrupt in between those finds capc->cc NULL (zeroed by an earlier rep stos). This could presumably be fixed by a barrier() before setting current->capture_control in compact_zone_order(); but would also need more care on return from compact_zone(), in order not to risk leaking a page captured by interrupt just before capture_control is reset. Maybe that is the preferable fix, but I felt safer for task_capc() to exclude the rather surprising possibility of capture at interrupt time" I have checked that gcc10 also behaves the same. The advantage of fix in compact_zone_order() is that we don't add another test in the page freeing hot path, and that it might prevent future problems if we stop exposing pointers to uninitialized structures in current task. So this patch implements the suggestion for compact_zone_order() with barrier() (and WRITE_ONCE() to prevent store tearing) for setting current->capture_control, and prevents page leaking with WRITE_ONCE/READ_ONCE in the proper order. Link: http://lkml.kernel.org/r/20200616082649.27173-1-vbabka@suse.cz Fixes: `5e1f0f098b` ("mm, compaction: capture a page under direct compaction") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Hugh Dickins <hughd@google.com> Suggested-by: Hugh Dickins <hughd@google.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Li Wang <liwang@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> [5.1+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-26 00:27:36 -07:00
Ethon Paul	f386775510	mm/compaction: fix a typo in comment "pessemistic"->"pessimistic" There is a typo in comment, fix it. Signed-off-by: Ethon Paul <ethp@qq.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Link: http://lkml.kernel.org/r/20200411070307.16021-1-ethp@qq.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-04 19:06:23 -07:00
Linus Torvalds	ee01c4d72a	Merge branch 'akpm' (patches from Andrew) Merge more updates from Andrew Morton: "More mm/ work, plenty more to come Subsystems affected by this patch series: slub, memcg, gup, kasan, pagealloc, hugetlb, vmscan, tools, mempolicy, memblock, hugetlbfs, thp, mmap, kconfig" * akpm: (131 commits) arm64: mm: use ARCH_HAS_DEBUG_WX instead of arch defined x86: mm: use ARCH_HAS_DEBUG_WX instead of arch defined riscv: support DEBUG_WX mm: add DEBUG_WX support drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup mm/thp: rename pmd_mknotpresent() as pmd_mkinvalid() powerpc/mm: drop platform defined pmd_mknotpresent() mm: thp: don't need to drain lru cache when splitting and mlocking THP hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs sparc32: register memory occupied by kernel as memblock.memory include/linux/memblock.h: fix minor typo and unclear comment mm, mempolicy: fix up gup usage in lookup_node tools/vm/page_owner_sort.c: filter out unneeded line mm: swap: memcg: fix memcg stats for huge pages mm: swap: fix vmstats for huge pages mm: vmscan: limit the range of LRU type balancing mm: vmscan: reclaim writepage is IO cost mm: vmscan: determine anon/file pressure balance at the reclaim root mm: balance LRU lists based on relative thrashing mm: only count actual rotations as LRU reclaim cost ...	2020-06-03 20:24:15 -07:00

1 2 3 4 5 ...

333 Commits