Commit Graph

333 Commits

Author SHA1 Message Date
Greg Kroah-Hartman
3d8ac88867 Merge 5.15.46 into android14-5.15
Changes in 5.15.46
	binfmt_flat: do not stop relocating GOT entries prematurely on riscv
	parisc/stifb: Implement fb_is_primary_device()
	parisc/stifb: Keep track of hardware path of graphics card
	RISC-V: Mark IORESOURCE_EXCLUSIVE for reserved mem instead of IORESOURCE_BUSY
	riscv: Initialize thread pointer before calling C functions
	riscv: Fix irq_work when SMP is disabled
	riscv: Wire up memfd_secret in UAPI header
	riscv: Move alternative length validation into subsection
	ALSA: hda/realtek - Add new type for ALC245
	ALSA: hda/realtek: Enable 4-speaker output for Dell XPS 15 9520 laptop
	ALSA: hda/realtek - Fix microphone noise on ASUS TUF B550M-PLUS
	ALSA: usb-audio: Cancel pending work at closing a MIDI substream
	USB: serial: pl2303: fix type detection for odd device
	USB: serial: option: add Quectel BG95 modem
	USB: new quirk for Dell Gen 2 devices
	usb: isp1760: Fix out-of-bounds array access
	usb: dwc3: gadget: Move null pinter check to proper place
	usb: core: hcd: Add support for deferring roothub registration
	fs/ntfs3: Update valid size if -EIOCBQUEUED
	fs/ntfs3: Fix fiemap + fix shrink file size (to remove preallocated space)
	fs/ntfs3: Keep preallocated only if option prealloc enabled
	fs/ntfs3: Check new size for limits
	fs/ntfs3: In function ntfs_set_acl_ex do not change inode->i_mode if called from function ntfs_init_acl
	fs/ntfs3: Fix some memory leaks in an error handling path of 'log_replay()'
	fs/ntfs3: Update i_ctime when xattr is added
	fs/ntfs3: Restore ntfs_xattr_get_acl and ntfs_xattr_set_acl functions
	cifs: fix potential double free during failed mount
	cifs: when extending a file with falloc we should make files not-sparse
	xhci: Allow host runtime PM as default for Intel Alder Lake N xHCI
	platform/x86: intel-hid: fix _DSM function index handling
	x86/MCE/AMD: Fix memory leak when threshold_create_bank() fails
	perf/x86/intel: Fix event constraints for ICL
	x86/kexec: fix memory leak of elf header buffer
	x86/sgx: Set active memcg prior to shmem allocation
	ptrace/um: Replace PT_DTRACE with TIF_SINGLESTEP
	ptrace/xtensa: Replace PT_SINGLESTEP with TIF_SINGLESTEP
	ptrace: Reimplement PTRACE_KILL by always sending SIGKILL
	btrfs: add "0x" prefix for unsupported optional features
	btrfs: return correct error number for __extent_writepage_io()
	btrfs: repair super block num_devices automatically
	btrfs: fix the error handling for submit_extent_page() for btrfs_do_readpage()
	iommu/vt-d: Add RPLS to quirk list to skip TE disabling
	drm/vmwgfx: validate the screen formats
	drm/virtio: fix NULL pointer dereference in virtio_gpu_conn_get_modes
	selftests/bpf: Fix vfs_link kprobe definition
	selftests/bpf: Fix parsing of prog types in UAPI hdr for bpftool sync
	mwifiex: add mutex lock for call in mwifiex_dfs_chan_sw_work_queue
	b43legacy: Fix assigning negative value to unsigned variable
	b43: Fix assigning negative value to unsigned variable
	ipw2x00: Fix potential NULL dereference in libipw_xmit()
	ipv6: fix locking issues with loops over idev->addr_list
	fbcon: Consistently protect deferred_takeover with console_lock()
	x86/platform/uv: Update TSC sync state for UV5
	ACPICA: Avoid cache flush inside virtual machines
	mac80211: minstrel_ht: fix where rate stats are stored (fixes debugfs output)
	drm/komeda: return early if drm_universal_plane_init() fails.
	drm/amd/display: Disabling Z10 on DCN31
	rcu-tasks: Fix race in schedule and flush work
	rcu: Make TASKS_RUDE_RCU select IRQ_WORK
	sfc: ef10: Fix assigning negative value to unsigned variable
	ALSA: jack: Access input_dev under mutex
	rtw88: 8821c: fix debugfs rssi value
	spi: spi-rspi: Remove setting {src,dst}_{addr,addr_width} based on DMA direction
	tools/power turbostat: fix ICX DRAM power numbers
	scsi: lpfc: Move cfg_log_verbose check before calling lpfc_dmp_dbg()
	scsi: lpfc: Fix SCSI I/O completion and abort handler deadlock
	scsi: lpfc: Fix call trace observed during I/O with CMF enabled
	cpuidle: PSCI: Improve support for suspend-to-RAM for PSCI OSI mode
	drm/amd/pm: fix double free in si_parse_power_table()
	ASoC: rsnd: care default case on rsnd_ssiu_busif_err_status_clear()
	ASoC: rsnd: care return value from rsnd_node_fixed_index()
	ath9k: fix QCA9561 PA bias level
	media: venus: hfi: avoid null dereference in deinit
	media: pci: cx23885: Fix the error handling in cx23885_initdev()
	media: cx25821: Fix the warning when removing the module
	md/bitmap: don't set sb values if can't pass sanity check
	mmc: jz4740: Apply DMA engine limits to maximum segment size
	drivers: mmc: sdhci_am654: Add the quirk to set TESTCD bit
	scsi: megaraid: Fix error check return value of register_chrdev()
	drm/amdgpu/sdma: Fix incorrect calculations of the wptr of the doorbells
	scsi: ufs: Use pm_runtime_resume_and_get() instead of pm_runtime_get_sync()
	scsi: lpfc: Fix resource leak in lpfc_sli4_send_seq_to_ulp()
	ath11k: disable spectral scan during spectral deinit
	ASoC: Intel: bytcr_rt5640: Add quirk for the HP Pro Tablet 408
	drm/plane: Move range check for format_count earlier
	drm/amd/pm: fix the compile warning
	ath10k: skip ath10k_halt during suspend for driver state RESTARTING
	arm64: compat: Do not treat syscall number as ESR_ELx for a bad syscall
	drm: msm: fix error check return value of irq_of_parse_and_map()
	scsi: target: tcmu: Fix possible data corruption
	ipv6: Don't send rs packets to the interface of ARPHRD_TUNNEL
	net/mlx5: fs, delete the FTE when there are no rules attached to it
	ASoC: dapm: Don't fold register value changes into notifications
	mlxsw: spectrum_dcb: Do not warn about priority changes
	mlxsw: Treat LLDP packets as control
	drm/amdgpu/psp: move PSP memory alloc from hw_init to sw_init
	drm/amdgpu/ucode: Remove firmware load type check in amdgpu_ucode_free_bo
	regulator: mt6315: Enforce regulator-compatible, not name
	HID: bigben: fix slab-out-of-bounds Write in bigben_probe
	of: Support more than one crash kernel regions for kexec -s
	ASoC: tscs454: Add endianness flag in snd_soc_component_driver
	scsi: lpfc: Alter FPIN stat accounting logic
	net: remove two BUG() from skb_checksum_help()
	s390/preempt: disable __preempt_count_add() optimization for PROFILE_ALL_BRANCHES
	perf/amd/ibs: Cascade pmu init functions' return value
	sched/core: Avoid obvious double update_rq_clock warning
	spi: stm32-qspi: Fix wait_cmd timeout in APM mode
	dma-debug: change allocation mode from GFP_NOWAIT to GFP_ATIOMIC
	ACPI: PM: Block ASUS B1400CEAE from suspend to idle by default
	ipmi:ssif: Check for NULL msg when handling events and messages
	ipmi: Fix pr_fmt to avoid compilation issues
	rtlwifi: Use pr_warn instead of WARN_ONCE
	mt76: mt7921: accept rx frames with non-standard VHT MCS10-11
	mt76: fix encap offload ethernet type check
	media: rga: fix possible memory leak in rga_probe
	media: coda: limit frame interval enumeration to supported encoder frame sizes
	media: hantro: HEVC: unconditionnaly set pps_{cb/cr}_qp_offset values
	media: ccs-core.c: fix failure to call clk_disable_unprepare
	media: imon: reorganize serialization
	media: cec-adap.c: fix is_configuring state
	usbnet: Run unregister_netdev() before unbind() again
	openrisc: start CPU timer early in boot
	nvme-pci: fix a NULL pointer dereference in nvme_alloc_admin_tags
	ASoC: rt5645: Fix errorenous cleanup order
	nbd: Fix hung on disconnect request if socket is closed before
	drm/amd/pm: update smartshift powerboost calc for smu12
	drm/amd/pm: update smartshift powerboost calc for smu13
	net: phy: micrel: Allow probing without .driver_data
	media: exynos4-is: Fix compile warning
	media: hantro: Stop using H.264 parameter pic_num
	ASoC: max98357a: remove dependency on GPIOLIB
	ASoC: rt1015p: remove dependency on GPIOLIB
	ACPI: CPPC: Assume no transition latency if no PCCT
	nvme: set non-mdts limits in nvme_scan_work
	can: mcp251xfd: silence clang's -Wunaligned-access warning
	x86/microcode: Add explicit CPU vendor dependency
	net: ipa: ignore endianness if there is no header
	m68k: atari: Make Atari ROM port I/O write macros return void
	rxrpc: Return an error to sendmsg if call failed
	rxrpc, afs: Fix selection of abort codes
	afs: Adjust ACK interpretation to try and cope with NAT
	eth: tg3: silence the GCC 12 array-bounds warning
	char: tpm: cr50_i2c: Suppress duplicated error message in .remove()
	selftests/bpf: fix btf_dump/btf_dump due to recent clang change
	gfs2: use i_lock spin_lock for inode qadata
	scsi: target: tcmu: Avoid holding XArray lock when calling lock_page
	IB/rdmavt: add missing locks in rvt_ruc_loopback
	ARM: dts: ox820: align interrupt controller node name with dtschema
	ARM: dts: socfpga: align interrupt controller node name with dtschema
	ARM: dts: s5pv210: align DMA channels with dtschema
	arm64: dts: qcom: msm8994: Fix the cont_splash_mem address
	arm64: dts: qcom: msm8994: Fix BLSP[12]_DMA channels count
	PM / devfreq: rk3399_dmc: Disable edev on remove()
	crypto: ccree - use fine grained DMA mapping dir
	soc: ti: ti_sci_pm_domains: Check for null return of devm_kcalloc
	fs: jfs: fix possible NULL pointer dereference in dbFree()
	arm64: dts: qcom: sdm845-xiaomi-beryllium: fix typo in panel's vddio-supply property
	ALSA: usb-audio: Add quirk bits for enabling/disabling generic implicit fb
	ALSA: usb-audio: Move generic implicit fb quirk entries into quirks.c
	ARM: OMAP1: clock: Fix UART rate reporting algorithm
	powerpc/fadump: Fix fadump to work with a different endian capture kernel
	fat: add ratelimit to fat*_ent_bread()
	pinctrl: renesas: rzn1: Fix possible null-ptr-deref in sh_pfc_map_resources()
	ARM: versatile: Add missing of_node_put in dcscb_init
	ARM: dts: exynos: add atmel,24c128 fallback to Samsung EEPROM
	ARM: hisi: Add missing of_node_put after of_find_compatible_node
	cpufreq: Avoid unnecessary frequency updates due to mismatch
	powerpc/rtas: Keep MSR[RI] set when calling RTAS
	PCI: Avoid pci_dev_lock() AB/BA deadlock with sriov_numvfs_store()
	KVM: PPC: Book3S HV Nested: L2 LPCR should inherit L1 LPES setting
	alpha: fix alloc_zeroed_user_highpage_movable()
	tracing: incorrect isolate_mote_t cast in mm_vmscan_lru_isolate
	powerpc/powernv/vas: Assign real address to rx_fifo in vas_rx_win_attr
	powerpc/xics: fix refcount leak in icp_opal_init()
	powerpc/powernv: fix missing of_node_put in uv_init()
	macintosh/via-pmu: Fix build failure when CONFIG_INPUT is disabled
	powerpc/iommu: Add missing of_node_put in iommu_init_early_dart
	smb3: check for null tcon
	RDMA/hfi1: Prevent panic when SDMA is disabled
	Input: gpio-keys - cancel delayed work only in case of GPIO
	drm: fix EDID struct for old ARM OABI format
	drm/bridge_connector: enable HPD by default if supported
	dt-bindings: display: sitronix, st7735r: Fix backlight in example
	drm/vmwgfx: Fix an invalid read
	ath11k: acquire ab->base_lock in unassign when finding the peer by addr
	drm: bridge: it66121: Fix the register page length
	ath9k: fix ar9003_get_eepmisc
	drm/edid: fix invalid EDID extension block filtering
	drm/bridge: adv7511: clean up CEC adapter when probe fails
	drm: bridge: icn6211: Fix register layout
	drm: bridge: icn6211: Fix HFP_HSW_HBP_HI and HFP_MIN handling
	mtd: spinand: gigadevice: fix Quad IO for GD5F1GQ5UExxG
	spi: qcom-qspi: Add minItems to interconnect-names
	ASoC: mediatek: Fix error handling in mt8173_max98090_dev_probe
	ASoC: mediatek: Fix missing of_node_put in mt2701_wm8960_machine_probe
	x86/delay: Fix the wrong asm constraint in delay_loop()
	drm/vc4: hvs: Fix frame count register readout
	drm/mediatek: Fix mtk_cec_mask()
	drm/vc4: hvs: Reset muxes at probe time
	drm/vc4: txp: Don't set TXP_VSTART_AT_EOF
	drm/vc4: txp: Force alpha to be 0xff if it's disabled
	libbpf: Don't error out on CO-RE relos for overriden weak subprogs
	x86/PCI: Fix ALi M1487 (IBC) PIRQ router link value interpretation
	mptcp: reset the packet scheduler on PRIO change
	nl80211: show SSID for P2P_GO interfaces
	drm/komeda: Fix an undefined behavior bug in komeda_plane_add()
	drm: mali-dp: potential dereference of null pointer
	spi: spi-ti-qspi: Fix return value handling of wait_for_completion_timeout
	scftorture: Fix distribution of short handler delays
	net: dsa: mt7530: 1G can also support 1000BASE-X link mode
	ixp4xx_eth: fix error check return value of platform_get_irq()
	NFC: NULL out the dev->rfkill to prevent UAF
	efi: Add missing prototype for efi_capsule_setup_info
	device property: Check fwnode->secondary when finding properties
	device property: Allow error pointer to be passed to fwnode APIs
	target: remove an incorrect unmap zeroes data deduction
	drbd: fix duplicate array initializer
	EDAC/dmc520: Don't print an error for each unconfigured interrupt line
	mtd: rawnand: denali: Use managed device resources
	HID: hid-led: fix maximum brightness for Dream Cheeky
	HID: elan: Fix potential double free in elan_input_configured
	drm/bridge: Fix error handling in analogix_dp_probe
	regulator: da9121: Fix uninit-value in da9121_assign_chip_model()
	drm/mediatek: dpi: Use mt8183 output formats for mt8192
	signal: Deliver SIGTRAP on perf event asynchronously if blocked
	sched/fair: Fix cfs_rq_clock_pelt() for throttled cfs_rq
	sched/psi: report zeroes for CPU full at the system level
	spi: img-spfi: Fix pm_runtime_get_sync() error checking
	cpufreq: Fix possible race in cpufreq online error path
	printk: use atomic updates for klogd work
	printk: add missing memory barrier to wake_up_klogd()
	printk: wake waiters for safe and NMI contexts
	ath9k_htc: fix potential out of bounds access with invalid rxstatus->rs_keyix
	media: i2c: max9286: Use dev_err_probe() helper
	media: i2c: max9286: Use "maxim,gpio-poc" property
	media: i2c: max9286: fix kernel oops when removing module
	media: hantro: Empty encoder capture buffers by default
	drm/panel: simple: Add missing bus flags for Innolux G070Y2-L01
	ALSA: pcm: Check for null pointer of pointer substream before dereferencing it
	mtdblock: warn if opened on NAND
	inotify: show inotify mask flags in proc fdinfo
	fsnotify: fix wrong lockdep annotations
	spi: rockchip: Stop spi slave dma receiver when cs inactive
	spi: rockchip: Preset cs-high and clk polarity in setup progress
	spi: rockchip: fix missing error on unsupported SPI_CS_HIGH
	of: overlay: do not break notify on NOTIFY_{OK|STOP}
	selftests/damon: add damon to selftests root Makefile
	drm/msm/dp: Modify prototype of encoder based API
	drm/msm/hdmi: switch to drm_bridge_connector
	drm/msm/dpu: adjust display_v_end for eDP and DP
	scsi: iscsi: Fix harmless double shift bug
	scsi: ufs: qcom: Fix ufs_qcom_resume()
	scsi: ufs: core: Exclude UECxx from SFR dump list
	drm/v3d: Fix null pointer dereference of pointer perfmon
	selftests/resctrl: Fix null pointer dereference on open failed
	libbpf: Fix logic for finding matching program for CO-RE relocation
	mtd: spi-nor: core: Check written SR value in spi_nor_write_16bit_sr_and_check()
	x86/pm: Fix false positive kmemleak report in msr_build_context()
	mtd: rawnand: cadence: fix possible null-ptr-deref in cadence_nand_dt_probe()
	mtd: rawnand: intel: fix possible null-ptr-deref in ebu_nand_probe()
	x86/speculation: Add missing prototype for unpriv_ebpf_notify()
	ASoC: rk3328: fix disabling mclk on pclk probe failure
	perf tools: Add missing headers needed by util/data.h
	drm/msm/disp/dpu1: set vbif hw config to NULL to avoid use after memory free during pm runtime resume
	drm/msm/dp: stop event kernel thread when DP unbind
	drm/msm/dp: fix error check return value of irq_of_parse_and_map()
	drm/msm/dp: reset DP controller before transmit phy test pattern
	drm/msm/dp: do not stop transmitting phy test pattern during DP phy compliance test
	drm/msm/dsi: fix error checks and return values for DSI xmit functions
	drm/msm/hdmi: check return value after calling platform_get_resource_byname()
	drm/msm/hdmi: fix error check return value of irq_of_parse_and_map()
	drm/msm: add missing include to msm_drv.c
	drm/panel: panel-simple: Fix proper bpc for AM-1280800N3TZQW-T00H
	kunit: fix debugfs code to use enum kunit_status, not bool
	drm/rockchip: vop: fix possible null-ptr-deref in vop_bind()
	spi: cadence-quadspi: fix Direct Access Mode disable for SoCFPGA
	perf tools: Use Python devtools for version autodetection rather than runtime
	virtio_blk: fix the discard_granularity and discard_alignment queue limits
	nl80211: don't hold RTNL in color change request
	x86: Fix return value of __setup handlers
	irqchip/exiu: Fix acknowledgment of edge triggered interrupts
	irqchip/aspeed-i2c-ic: Fix irq_of_parse_and_map() return value
	irqchip/aspeed-scu-ic: Fix irq_of_parse_and_map() return value
	x86/mm: Cleanup the control_va_addr_alignment() __setup handler
	arm64: fix types in copy_highpage()
	regulator: core: Fix enable_count imbalance with EXCLUSIVE_GET
	drm/msm/dsi: fix address for second DSI PHY on SDM660
	drm/msm/dp: fix event thread stuck in wait_event after kthread_stop()
	drm/msm/mdp5: Return error code in mdp5_pipe_release when deadlock is detected
	drm/msm/mdp5: Return error code in mdp5_mixer_release when deadlock is detected
	drm/msm: return an error pointer in msm_gem_prime_get_sg_table()
	media: uvcvideo: Fix missing check to determine if element is found in list
	arm64: stackleak: fix current_top_of_stack()
	iomap: iomap_write_failed fix
	spi: spi-fsl-qspi: check return value after calling platform_get_resource_byname()
	Revert "cpufreq: Fix possible race in cpufreq online error path"
	regulator: qcom_smd: Fix up PM8950 regulator configuration
	samples: bpf: Don't fail for a missing VMLINUX_BTF when VMLINUX_H is provided
	perf/amd/ibs: Use interrupt regs ip for stack unwinding
	ath11k: Don't check arvif->is_started before sending management frames
	wilc1000: fix crash observed in AP mode with cfg80211_register_netdevice()
	HID: amd_sfh: Modify the bus name
	HID: amd_sfh: Modify the hid name
	ASoC: fsl: Use dev_err_probe() helper
	ASoC: fsl: Fix refcount leak in imx_sgtl5000_probe
	ASoC: imx-hdmi: Fix refcount leak in imx_hdmi_probe
	ASoC: mxs-saif: Fix refcount leak in mxs_saif_probe
	regulator: pfuze100: Fix refcount leak in pfuze_parse_regulators_dt
	dma-direct: factor out a helper for DMA_ATTR_NO_KERNEL_MAPPING allocations
	dma-direct: don't fail on highmem CMA pages in dma_direct_alloc_pages
	ASoC: samsung: Use dev_err_probe() helper
	ASoC: samsung: Fix refcount leak in aries_audio_probe
	block: Fix the bio.bi_opf comment
	kselftest/cgroup: fix test_stress.sh to use OUTPUT dir
	scripts/faddr2line: Fix overlapping text section failures
	media: aspeed: Fix an error handling path in aspeed_video_probe()
	media: exynos4-is: Fix PM disable depth imbalance in fimc_is_probe
	mt76: mt7921: Fix the error handling path of mt7921_pci_probe()
	mt76: do not attempt to reorder received 802.3 packets without agg session
	media: st-delta: Fix PM disable depth imbalance in delta_probe
	media: atmel: atmel-isc: Fix PM disable depth imbalance in atmel_isc_probe
	media: i2c: rdacm2x: properly set subdev entity function
	media: exynos4-is: Change clk_disable to clk_disable_unprepare
	media: pvrusb2: fix array-index-out-of-bounds in pvr2_i2c_core_init
	media: vsp1: Fix offset calculation for plane cropping
	media: atmel: atmel-sama5d2-isc: fix wrong mask in YUYV format check
	media: hantro: HEVC: Fix tile info buffer value computation
	Bluetooth: fix dangling sco_conn and use-after-free in sco_sock_timeout
	Bluetooth: use hdev lock in activate_scan for hci_is_adv_monitoring
	Bluetooth: use hdev lock for accept_list and reject_list in conn req
	nvme: set dma alignment to dword
	m68k: math-emu: Fix dependencies of math emulation support
	sctp: read sk->sk_bound_dev_if once in sctp_rcv()
	net: hinic: add missing destroy_workqueue in hinic_pf_to_mgmt_init
	ASoC: ti: j721e-evm: Fix refcount leak in j721e_soc_probe_*
	kselftest/arm64: bti: force static linking
	media: ov7670: remove ov7670_power_off from ov7670_remove
	media: i2c: ov5648: fix wrong pointer passed to IS_ERR() and PTR_ERR()
	media: staging: media: rkvdec: Make use of the helper function devm_platform_ioremap_resource()
	media: rkvdec: h264: Fix dpb_valid implementation
	media: rkvdec: h264: Fix bit depth wrap in pps packet
	regulator: scmi: Fix refcount leak in scmi_regulator_probe
	ext4: reject the 'commit' option on ext2 filesystems
	drm/msm/a6xx: Fix refcount leak in a6xx_gpu_init
	drm: msm: fix possible memory leak in mdp5_crtc_cursor_set()
	x86/sev: Annotate stack change in the #VC handler
	drm/msm: don't free the IRQ if it was not requested
	selftests/bpf: Add missed ima_setup.sh in Makefile
	drm/msm/dpu: handle pm_runtime_get_sync() errors in bind path
	drm/i915: Fix CFI violation with show_dynamic_id()
	thermal/drivers/bcm2711: Don't clamp temperature at zero
	thermal/drivers/broadcom: Fix potential NULL dereference in sr_thermal_probe
	thermal/core: Fix memory leak in __thermal_cooling_device_register()
	thermal/drivers/imx_sc_thermal: Fix refcount leak in imx_sc_thermal_probe
	bfq: Relax waker detection for shared queues
	bfq: Allow current waker to defend against a tentative one
	ASoC: wm2000: fix missing clk_disable_unprepare() on error in wm2000_anc_transition()
	PM: domains: Fix initialization of genpd's next_wakeup
	net: macb: Fix PTP one step sync support
	NFC: hci: fix sleep in atomic context bugs in nfc_hci_hcp_message_tx
	ASoC: max98090: Move check for invalid values before casting in max98090_put_enab_tlv()
	net: stmmac: selftests: Use kcalloc() instead of kzalloc()
	net: stmmac: fix out-of-bounds access in a selftest
	hv_netvsc: Fix potential dereference of NULL pointer
	hwmon: (pmbus) Check PEC support before reading other registers
	rxrpc: Fix listen() setting the bar too high for the prealloc rings
	rxrpc: Don't try to resend the request if we're receiving the reply
	rxrpc: Fix overlapping ACK accounting
	rxrpc: Don't let ack.previousPacket regress
	rxrpc: Fix decision on when to generate an IDLE ACK
	net: huawei: hinic: Use devm_kcalloc() instead of devm_kzalloc()
	hinic: Avoid some over memory allocation
	net: dsa: restrict SMSC_LAN9303_I2C kconfig
	net/smc: postpone sk_refcnt increment in connect()
	dma-direct: factor out dma_set_{de,en}crypted helpers
	dma-direct: don't call dma_set_decrypted for remapped allocations
	dma-direct: always leak memory that can't be re-encrypted
	dma-direct: don't over-decrypt memory
	arm64: dts: rockchip: Move drive-impedance-ohm to emmc phy on rk3399
	arm64: dts: mt8192: Fix nor_flash status disable typo
	PCI/ACPI: Allow D3 only if Root Port can signal and wake from D3
	memory: samsung: exynos5422-dmc: Avoid some over memory allocation
	ARM: dts: BCM5301X: update CRU block description
	ARM: dts: BCM5301X: Update pin controller node name
	ARM: dts: suniv: F1C100: fix watchdog compatible
	soc: qcom: smp2p: Fix missing of_node_put() in smp2p_parse_ipc
	soc: qcom: smsm: Fix missing of_node_put() in smsm_parse_ipc
	PCI: cadence: Fix find_first_zero_bit() limit
	PCI: rockchip: Fix find_first_zero_bit() limit
	PCI: mediatek: Fix refcount leak in mtk_pcie_subsys_powerup()
	PCI: dwc: Fix setting error return on MSI DMA mapping failure
	ARM: dts: ci4x10: Adapt to changes in imx6qdl.dtsi regarding fec clocks
	soc: qcom: llcc: Add MODULE_DEVICE_TABLE()
	KVM: nVMX: Leave most VM-Exit info fields unmodified on failed VM-Entry
	KVM: nVMX: Clear IDT vectoring on nested VM-Exit for double/triple fault
	crypto: qat - set CIPHER capability for QAT GEN2
	crypto: qat - set COMPRESSION capability for QAT GEN2
	crypto: qat - set CIPHER capability for DH895XCC
	crypto: qat - set COMPRESSION capability for DH895XCC
	platform/chrome: cros_ec: fix error handling in cros_ec_register()
	ARM: dts: imx6dl-colibri: Fix I2C pinmuxing
	platform/chrome: Re-introduce cros_ec_cmd_xfer and use it for ioctls
	can: xilinx_can: mark bit timing constants as const
	ARM: dts: stm32: Fix PHY post-reset delay on Avenger96
	ARM: dts: bcm2835-rpi-zero-w: Fix GPIO line name for Wifi/BT
	ARM: dts: bcm2837-rpi-cm3-io3: Fix GPIO line names for SMPS I2C
	ARM: dts: bcm2837-rpi-3-b-plus: Fix GPIO line name of power LED
	ARM: dts: bcm2835-rpi-b: Fix GPIO line names
	misc: ocxl: fix possible double free in ocxl_file_register_afu
	crypto: marvell/cesa - ECB does not IV
	gpiolib: of: Introduce hook for missing gpio-ranges
	pinctrl: bcm2835: implement hook for missing gpio-ranges
	arm: mediatek: select arch timer for mt7629
	pinctrl/rockchip: support deferring other gpio params
	pinctrl: mediatek: mt8195: enable driver on mtk platforms
	arm64: dts: qcom: qrb5165-rb5: Fix can-clock node name
	Drivers: hv: vmbus: Fix handling of messages with transaction ID of zero
	powerpc/fadump: fix PT_LOAD segment for boot memory area
	mfd: ipaq-micro: Fix error check return value of platform_get_irq()
	scsi: fcoe: Fix Wstringop-overflow warnings in fcoe_wwn_from_mac()
	soc: bcm: Check for NULL return of devm_kzalloc()
	arm64: dts: ti: k3-am64-mcu: remove incorrect UART base clock rates
	ASoC: sh: rz-ssi: Check return value of pm_runtime_resume_and_get()
	ASoC: sh: rz-ssi: Propagate error codes returned from platform_get_irq_byname()
	ASoC: sh: rz-ssi: Release the DMA channels in rz_ssi_probe() error path
	firmware: arm_scmi: Fix list protocols enumeration in the base protocol
	nvdimm: Fix firmware activation deadlock scenarios
	nvdimm: Allow overwrite in the presence of disabled dimms
	pinctrl: mvebu: Fix irq_of_parse_and_map() return value
	drivers/base/node.c: fix compaction sysfs file leak
	dax: fix cache flush on PMD-mapped pages
	drivers/base/memory: fix an unlikely reference counting issue in __add_memory_block()
	firmware: arm_ffa: Fix uuid parameter to ffa_partition_probe
	firmware: arm_ffa: Remove incorrect assignment of driver_data
	list: introduce list_is_head() helper and re-use it in list.h
	list: fix a data-race around ep->rdllist
	drm/msm/dpu: fix error check return value of irq_of_parse_and_map()
	powerpc/8xx: export 'cpm_setbrg' for modules
	pinctrl: renesas: r8a779a0: Fix GPIO function on I2C-capable pins
	pinctrl: renesas: core: Fix possible null-ptr-deref in sh_pfc_map_resources()
	powerpc/idle: Fix return value of __setup() handler
	powerpc/4xx/cpm: Fix return value of __setup() handler
	RDMA/hns: Add the detection for CMDQ status in the device initialization process
	arm64: dts: marvell: espressobin-ultra: fix SPI-NOR config
	arm64: dts: marvell: espressobin-ultra: enable front USB3 port
	ASoC: atmel-pdmic: Remove endianness flag on pdmic component
	ASoC: atmel-classd: Remove endianness flag on class d component
	proc: fix dentry/inode overinstantiating under /proc/${pid}/net
	ipc/mqueue: use get_tree_nodev() in mqueue_get_tree()
	PCI: imx6: Fix PERST# start-up sequence
	tty: fix deadlock caused by calling printk() under tty_port->lock
	crypto: sun8i-ss - rework handling of IV
	crypto: sun8i-ss - handle zero sized sg
	crypto: cryptd - Protect per-CPU resource by disabling BH.
	ARM: dts: at91: sama7g5: remove interrupt-parent from gic node
	hugetlbfs: fix hugetlbfs_statfs() locking
	Input: sparcspkr - fix refcount leak in bbc_beep_probe
	PCI/AER: Clear MULTI_ERR_COR/UNCOR_RCV bits
	PCI: microchip: Fix potential race in interrupt handling
	hwrng: omap3-rom - fix using wrong clk_disable() in omap_rom_rng_runtime_resume()
	powerpc/64: Only WARN if __pa()/__va() called with bad addresses
	powerpc/perf: Fix the threshold compare group constraint for power10
	powerpc/perf: Fix the threshold compare group constraint for power9
	macintosh: via-pmu and via-cuda need RTC_LIB
	powerpc/xive: Add some error handling code to 'xive_spapr_init()'
	powerpc/xive: Fix refcount leak in xive_spapr_init
	powerpc/fsl_rio: Fix refcount leak in fsl_rio_setup
	mfd: davinci_voicecodec: Fix possible null-ptr-deref davinci_vc_probe()
	nfsd: destroy percpu stats counters after reply cache shutdown
	mailbox: forward the hrtimer if not queued and under a lock
	RDMA/hfi1: Prevent use of lock before it is initialized
	KVM: LAPIC: Drop pending LAPIC timer injection when canceling the timer
	Input: stmfts - do not leave device disabled in stmfts_input_open
	OPP: call of_node_put() on error path in _bandwidth_supported()
	f2fs: support fault injection for dquot_initialize()
	f2fs: fix to do sanity check on inline_dots inode
	f2fs: fix dereference of stale list iterator after loop body
	iommu/amd: Enable swiotlb in all cases
	iommu/mediatek: Fix 2 HW sharing pgtable issue
	iommu/mediatek: Add list_del in mtk_iommu_remove
	iommu/mediatek: Remove clk_disable in mtk_iommu_remove
	iommu/mediatek: Add mutex for m4u_group and m4u_dom in data
	i2c: at91: use dma safe buffers
	cpufreq: mediatek: Use module_init and add module_exit
	cpufreq: mediatek: Unregister platform device on exit
	iommu/arm-smmu-v3-sva: Fix mm use-after-free
	MIPS: Loongson: Use hwmon_device_register_with_groups() to register hwmon
	iommu/mediatek: Fix NULL pointer dereference when printing dev_name
	i2c: at91: Initialize dma_buf in at91_twi_xfer()
	dmaengine: idxd: Fix the error handling path in idxd_cdev_register()
	NFS: Do not report EINTR/ERESTARTSYS as mapping errors
	NFS: fsync() should report filesystem errors over EINTR/ERESTARTSYS
	NFS: Don't report ENOSPC write errors twice
	NFS: Do not report flush errors in nfs_write_end()
	NFS: Don't report errors from nfs_pageio_complete() more than once
	NFSv4/pNFS: Do not fail I/O when we fail to allocate the pNFS layout
	NFS: Further fixes to the writeback error handling
	video: fbdev: clcdfb: Fix refcount leak in clcdfb_of_vram_setup
	dmaengine: stm32-mdma: remove GISR1 register
	dmaengine: stm32-mdma: fix chan initialization in stm32_mdma_irq_handler()
	iommu/amd: Increase timeout waiting for GA log enablement
	i2c: npcm: Fix timeout calculation
	i2c: npcm: Correct register access width
	i2c: npcm: Handle spurious interrupts
	i2c: rcar: fix PM ref counts in probe error paths
	perf build: Fix btf__load_from_kernel_by_id() feature check
	perf c2c: Use stdio interface if slang is not supported
	perf jevents: Fix event syntax error caused by ExtSel
	video: fbdev: vesafb: Fix a use-after-free due early fb_info cleanup
	NFS: Always initialise fattr->label in nfs_fattr_alloc()
	NFS: Create a new nfs_alloc_fattr_with_label() function
	NFS: Convert GFP_NOFS to GFP_KERNEL
	NFSv4.1 mark qualified async operations as MOVEABLE tasks
	f2fs: fix to avoid f2fs_bug_on() in dec_valid_node_count()
	f2fs: fix to do sanity check on block address in f2fs_do_zero_range()
	f2fs: fix to clear dirty inode in f2fs_evict_inode()
	f2fs: fix deadloop in foreground GC
	f2fs: don't need inode lock for system hidden quota
	f2fs: fix to do sanity check on total_data_blocks
	f2fs: don't use casefolded comparison for "." and ".."
	f2fs: fix fallocate to use file_modified to update permissions consistently
	f2fs: fix to do sanity check for inline inode
	objtool: Fix objtool regression on x32 systems
	objtool: Fix symbol creation
	wifi: mac80211: fix use-after-free in chanctx code
	iwlwifi: mvm: fix assert 1F04 upon reconfig
	fs-writeback: writeback_sb_inodes:Recalculate 'wrote' according skipped pages
	efi: Do not import certificates from UEFI Secure Boot for T2 Macs
	bfq: Avoid false marking of bic as stably merged
	bfq: Avoid merging queues with different parents
	bfq: Split shared queues on move between cgroups
	bfq: Update cgroup information before merging bio
	bfq: Drop pointless unlock-lock pair
	bfq: Remove pointless bfq_init_rq() calls
	bfq: Track whether bfq_group is still online
	bfq: Get rid of __bio_blkcg() usage
	bfq: Make sure bfqg for which we are queueing requests is online
	ext4: mark group as trimmed only if it was fully scanned
	ext4: fix use-after-free in ext4_rename_dir_prepare
	ext4: fix race condition between ext4_write and ext4_convert_inline_data
	ext4: fix warning in ext4_handle_inode_extension
	ext4: fix bug_on in ext4_writepages
	ext4: filter out EXT4_FC_REPLAY from on-disk superblock field s_state
	ext4: fix bug_on in __es_tree_search
	ext4: verify dir block before splitting it
	ext4: avoid cycles in directory h-tree
	ACPI: property: Release subnode properties with data nodes
	tty: goldfish: Introduce gf_ioread32()/gf_iowrite32()
	tracing: Fix potential double free in create_var_ref()
	tracing: Initialize integer variable to prevent garbage return value
	drm/amdgpu: add beige goby PCI ID
	PCI/PM: Fix bridge_d3_blacklist[] Elo i2 overwrite of Gigabyte X299
	PCI: qcom: Fix runtime PM imbalance on probe errors
	PCI: qcom: Fix unbalanced PHY init on probe errors
	staging: r8188eu: prevent ->Ssid overflow in rtw_wx_set_scan()
	mm, compaction: fast_find_migrateblock() should return pfn in the target zone
	s390/perf: obtain sie_block from the right address
	s390/stp: clock_delta should be signed
	dlm: fix plock invalid read
	dlm: uninitialized variable on error in dlm_listen_for_all()
	dlm: fix missing lkb refcount handling
	ocfs2: dlmfs: fix error handling of user_dlm_destroy_lock
	scsi: dc395x: Fix a missing check on list iterator
	scsi: ufs: qcom: Add a readl() to make sure ref_clk gets enabled
	landlock: Add clang-format exceptions
	landlock: Format with clang-format
	selftests/landlock: Add clang-format exceptions
	selftests/landlock: Normalize array assignment
	selftests/landlock: Format with clang-format
	samples/landlock: Add clang-format exceptions
	samples/landlock: Format with clang-format
	landlock: Fix landlock_add_rule(2) documentation
	selftests/landlock: Make tests build with old libc
	selftests/landlock: Extend tests for minimal valid attribute size
	selftests/landlock: Add tests for unknown access rights
	selftests/landlock: Extend access right tests to directories
	selftests/landlock: Fully test file rename with "remove" access
	selftests/landlock: Add tests for O_PATH
	landlock: Change landlock_add_rule(2) argument check ordering
	landlock: Change landlock_restrict_self(2) check ordering
	selftests/landlock: Test landlock_create_ruleset(2) argument check ordering
	landlock: Define access_mask_t to enforce a consistent access mask size
	landlock: Reduce the maximum number of layers to 16
	landlock: Create find_rule() from unmask_layers()
	landlock: Fix same-layer rule unions
	drm/amdgpu/cs: make commands with 0 chunks illegal behaviour.
	drm/nouveau/subdev/bus: Ratelimit logging for fault errors
	drm/etnaviv: check for reaped mapping in etnaviv_iommu_unmap_gem
	drm/nouveau/clk: Fix an incorrect NULL check on list iterator
	drm/nouveau/kms/nv50-: atom: fix an incorrect NULL check on list iterator
	drm/bridge: analogix_dp: Grab runtime PM reference for DP-AUX
	drm/i915/dsi: fix VBT send packet port selection for ICL+
	md: fix an incorrect NULL check in does_sb_need_changing
	md: fix an incorrect NULL check in md_reload_sb
	mtd: cfi_cmdset_0002: Move and rename chip_check/chip_ready/chip_good_for_write
	mtd: cfi_cmdset_0002: Use chip_ready() for write on S29GL064N
	media: coda: Fix reported H264 profile
	media: coda: Add more H264 levels for CODA960
	ima: remove the IMA_TEMPLATE Kconfig option
	Kconfig: Add option for asm goto w/ tied outputs to workaround clang-13 bug
	RDMA/hfi1: Fix potential integer multiplication overflow errors
	mmc: core: Allows to override the timeout value for ioctl() path
	csky: patch_text: Fixup last cpu should be master
	irqchip/armada-370-xp: Do not touch Performance Counter Overflow on A375, A38x, A39x
	irqchip: irq-xtensa-mx: fix initial IRQ affinity
	thermal: devfreq_cooling: use local ops instead of global ops
	cfg80211: declare MODULE_FIRMWARE for regulatory.db
	mac80211: upgrade passive scan to active scan on DFS channels after beacon rx
	um: Use asm-generic/dma-mapping.h
	um: chan_user: Fix winch_tramp() return value
	um: Fix out-of-bounds read in LDT setup
	kexec_file: drop weak attribute from arch_kexec_apply_relocations[_add]
	ftrace: Clean up hash direct_functions on register failures
	ksmbd: fix outstanding credits related bugs
	iommu/msm: Fix an incorrect NULL check on list iterator
	iommu/dma: Fix iova map result check bug
	Revert "mm/cma.c: remove redundant cma_mutex lock"
	mm/page_alloc: always attempt to allocate at least one page during bulk allocation
	nodemask.h: fix compilation error with GCC12
	hugetlb: fix huge_pmd_unshare address update
	mm/memremap: fix missing call to untrack_pfn() in pagemap_range()
	xtensa/simdisk: fix proc_read_simdisk()
	rtl818x: Prevent using not initialized queues
	ASoC: rt5514: Fix event generation for "DSP Voice Wake Up" control
	carl9170: tx: fix an incorrect use of list iterator
	stm: ltdc: fix two incorrect NULL checks on list iterator
	bcache: improve multithreaded bch_btree_check()
	bcache: improve multithreaded bch_sectors_dirty_init()
	bcache: remove incremental dirty sector counting for bch_sectors_dirty_init()
	bcache: avoid journal no-space deadlock by reserving 1 journal bucket
	serial: pch: don't overwrite xmit->buf[0] by x_char
	tilcdc: tilcdc_external: fix an incorrect NULL check on list iterator
	gma500: fix an incorrect NULL check on list iterator
	arm64: dts: qcom: ipq8074: fix the sleep clock frequency
	arm64: tegra: Add missing DFLL reset on Tegra210
	clk: tegra: Add missing reset deassertion
	phy: qcom-qmp: fix struct clk leak on probe errors
	ARM: dts: s5pv210: Remove spi-cs-high on panel in Aries
	ARM: pxa: maybe fix gpio lookup tables
	SMB3: EBADF/EIO errors in rename/open caused by race condition in smb2_compound_op
	docs/conf.py: Cope with removal of language=None in Sphinx 5.0.0
	dt-bindings: gpio: altera: correct interrupt-cells
	vdpasim: allow to enable a vq repeatedly
	blk-iolatency: Fix inflight count imbalances and IO hangs on offline
	coresight: core: Fix coresight device probe failure issue
	phy: qcom-qmp: fix reset-controller leak on probe errors
	net: ipa: fix page free in ipa_endpoint_trans_release()
	net: ipa: fix page free in ipa_endpoint_replenish_one()
	kseltest/cgroup: Make test_stress.sh work if run interactively
	list: test: Add a test for list_is_head()
	Revert "random: use static branch for crng_ready()"
	staging: r8188eu: delete rtw_wx_read/write32()
	RDMA/hns: Remove the num_cqc_timer variable
	RDMA/rxe: Generate a completion for unsupported/invalid opcode
	MIPS: IP27: Remove incorrect `cpu_has_fpu' override
	MIPS: IP30: Remove incorrect `cpu_has_fpu' override
	ext4: only allow test_dummy_encryption when supported
	interconnect: qcom: sc7180: Drop IP0 interconnects
	interconnect: qcom: icc-rpmh: Add BCMs to commit list in pre_aggregate
	fs: add two trivial lookup helpers
	exportfs: support idmapped mounts
	fs/ntfs3: Fix invalid free in log_replay
	md: Don't set mddev private to NULL in raid0 pers->free
	md: fix double free of io_acct_set bioset
	md: bcache: check the return value of kzalloc() in detached_dev_do_request()
	pinctrl/rockchip: support setting input-enable param
	block: fix bio_clone_blkg_association() to associate with proper blkcg_gq
	Linux 5.15.46

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I7b65df29c22a01b81a94cd844867a18e73098a15
2022-07-13 11:40:42 +02:00
Rei Yamamoto
20e6ec76ae mm, compaction: fast_find_migrateblock() should return pfn in the target zone
commit bbe832b9db2e1ad21522f8f0bf02775fff8a0e0e upstream.

At present, pages not in the target zone are added to cc->migratepages
list in isolate_migratepages_block().  As a result, pages may migrate
between nodes unintentionally.

This would be a serious problem for older kernels without commit
a984226f45 ("mm: memcontrol: remove the pgdata parameter of
mem_cgroup_page_lruvec"), because it can corrupt the lru list by
handling pages in list without holding proper lru_lock.

Avoid returning a pfn outside the target zone in the case that it is
not aligned with a pageblock boundary.  Otherwise
isolate_migratepages_block() will handle pages not in the target zone.

Link: https://lkml.kernel.org/r/20220511044300.4069-1-yamamoto.rei@jp.fujitsu.com
Fixes: 70b44595ea ("mm, compaction: use free lists to quickly locate a migration source")
Signed-off-by: Rei Yamamoto <yamamoto.rei@jp.fujitsu.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Wonhyuk Yang <vvghjk1234@gmail.com>
Cc: Rei Yamamoto <yamamoto.rei@jp.fujitsu.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09 10:23:21 +02:00
Carlos Llamas
061e34c52e ANDROID: mm: compaction: fix isolate_and_split_free_page() redefinition
Guard isolate_and_split_free_page() with CONFIG_COMPACTION. This fixes
the follwoing build error as the function collides with its inline stub
from the header file:

mm/compaction.c:766:15: error: redefinition of ‘isolate_and_split_free_page’
  766 | unsigned long isolate_and_split_free_page(struct page *page,
      |               ^~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from mm/compaction.c:14:
./include/linux/compaction.h:241:29: note: previous definition of ‘isolate_and_split_free_page’ was here
  241 | static inline unsigned long isolate_and_split_free_page(struct page *page,
      |                             ^~~~~~~~~~~~~~~~~~~~~~~~~~~

Bug: 201263307
Fixes: 8cd9aa93b7 ("ANDROID: implement wrapper for reverse migration")
Reported-by: kernelci.org bot <bot@kernelci.org>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Carlos Llamas <cmllamas@google.com>
Change-Id: Ie8f3fedcc9d4af5cfdcfd5829377671745ab77d6
2022-03-19 05:19:23 +00:00
Charan Teja Reddy
f47b852faa ANDROID: implement wrapper for reverse migration
Reverse migration is used to do the balancing the occupancy of memory
zones in a node in the system whose imabalance may be caused by
migration of pages to other zones by an operation, eg: hotremove and
then hotadding the same memory. In this case there is a lot of free
memory in newly hotadd memory which can be filled up by the previous
migrated pages(as part of offline/hotremove) thus may free up some
pressure in other zones of the node.

Upstream discussion: https://lore.kernel.org/all/ee78c83d-da9b-f6d1-4f66-934b7782acfb@codeaurora.org/

Change-Id: Ib3137dab0db66ecf6858c4077dcadb9dfd0c6b1c
Bug: 201263307
Signed-off-by: Charan Teja Reddy <quic_charante@quicinc.com>
2022-03-17 21:15:46 +00:00
Linus Torvalds
2d338201d5 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
 "147 patches, based on 7d2a07b769.

  Subsystems affected by this patch series: mm (memory-hotplug, rmap,
  ioremap, highmem, cleanups, secretmem, kfence, damon, and vmscan),
  alpha, percpu, procfs, misc, core-kernel, MAINTAINERS, lib,
  checkpatch, epoll, init, nilfs2, coredump, fork, pids, criu, kconfig,
  selftests, ipc, and scripts"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (94 commits)
  scripts: check_extable: fix typo in user error message
  mm/workingset: correct kernel-doc notations
  ipc: replace costly bailout check in sysvipc_find_ipc()
  selftests/memfd: remove unused variable
  Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH
  configs: remove the obsolete CONFIG_INPUT_POLLDEV
  prctl: allow to setup brk for et_dyn executables
  pid: cleanup the stale comment mentioning pidmap_init().
  kernel/fork.c: unexport get_{mm,task}_exe_file
  coredump: fix memleak in dump_vma_snapshot()
  fs/coredump.c: log if a core dump is aborted due to changed file permissions
  nilfs2: use refcount_dec_and_lock() to fix potential UAF
  nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group
  nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group
  nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group
  nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group
  nilfs2: fix NULL pointer in nilfs_##name##_attr_release
  nilfs2: fix memory leak in nilfs_sysfs_create_device_group
  trap: cleanup trap_init()
  init: move usermodehelper_enable() to populate_rootfs()
  ...
2021-09-08 12:55:35 -07:00
Mike Rapoport
859a85ddf9 mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE
Patch series "mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE".

After recent updates to freeing unused parts of the memory map, no
architecture can have holes in the memory map within a pageblock.  This
makes pfn_valid_within() check and CONFIG_HOLES_IN_ZONE configuration
option redundant.

The first patch removes them both in a mechanical way and the second patch
simplifies memory_hotplug::test_pages_in_a_zone() that had
pfn_valid_within() surrounded by more logic than simple if.

This patch (of 2):

After recent changes in freeing of the unused parts of the memory map and
rework of pfn_valid() in arm and arm64 there are no architectures that can
have holes in the memory map within a pageblock and so nothing can enable
CONFIG_HOLES_IN_ZONE which guards non trivial implementation of
pfn_valid_within().

With that, pfn_valid_within() is always hardwired to 1 and can be
completely removed.

Remove calls to pfn_valid_within() and CONFIG_HOLES_IN_ZONE.

Link: https://lkml.kernel.org/r/20210713080035.7464-1-rppt@kernel.org
Link: https://lkml.kernel.org/r/20210713080035.7464-2-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08 11:50:22 -07:00
Charan Teja Reddy
65d759c8f9 mm: compaction: support triggering of proactive compaction by user
The proactive compaction[1] gets triggered for every 500msec and run
compaction on the node for COMPACTION_HPAGE_ORDER (usually order-9) pages
based on the value set to sysctl.compaction_proactiveness.  Triggering the
compaction for every 500msec in search of COMPACTION_HPAGE_ORDER pages is
not needed for all applications, especially on the embedded system
usecases which may have few MB's of RAM.  Enabling the proactive
compaction in its state will endup in running almost always on such
systems.

Other side, proactive compaction can still be very much useful for getting
a set of higher order pages in some controllable manner(controlled by
using the sysctl.compaction_proactiveness).  So, on systems where enabling
the proactive compaction always may proove not required, can trigger the
same from user space on write to its sysctl interface.  As an example, say
app launcher decide to launch the memory heavy application which can be
launched fast if it gets more higher order pages thus launcher can prepare
the system in advance by triggering the proactive compaction from
userspace.

This triggering of proactive compaction is done on a write to
sysctl.compaction_proactiveness by user.

[1]https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=facdaa917c4d5a376d09d25865f5a863f906234a

[akpm@linux-foundation.org: tweak vm.rst, per Mike]

Link: https://lkml.kernel.org/r/1627653207-12317-1-git-send-email-charante@codeaurora.org
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Nitin Gupta <nigupta@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 09:58:17 -07:00
Charan Teja Reddy
e1e92bfa38 mm: compaction: optimize proactive compaction deferrals
Vlastimil Babka figured out that when fragmentation score didn't go down
across the proactive compaction i.e.  when no progress is made, next wake
up for proactive compaction is deferred for 1 << COMPACT_MAX_DEFER_SHIFT,
i.e.  64 times, with each wakeup interval of
HPAGE_FRAG_CHECK_INTERVAL_MSEC(=500).  In each of this wakeup, it just
decrement 'proactive_defer' counter and goes sleep i.e.  it is getting
woken to just decrement a counter.

The same deferral time can also achieved by simply doing the
HPAGE_FRAG_CHECK_INTERVAL_MSEC << COMPACT_MAX_DEFER_SHIFT thus unnecessary
wakeup of kcompact thread is avoided thus also removes the need of
'proactive_defer' thread counter.

[akpm@linux-foundation.org: tweak comment]

Link: https://lore.kernel.org/linux-fsdevel/88abfdb6-2c13-b5a6-5b46-742d12d1c910@suse.cz/
Link: https://lkml.kernel.org/r/1626869599-25412-1-git-send-email-charante@codeaurora.org
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Nitin Gupta <nigupta@nvidia.com>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 09:58:17 -07:00
Yang Shi
5ac95884a7 mm/migrate: enable returning precise migrate_pages() success count
Under normal circumstances, migrate_pages() returns the number of pages
migrated.  In error conditions, it returns an error code.  When returning
an error code, there is no way to know how many pages were migrated or not
migrated.

Make migrate_pages() return how many pages are demoted successfully for
all cases, including when encountering errors.  Page reclaim behavior will
depend on this in subsequent patches.

Link: https://lkml.kernel.org/r/20210721063926.3024591-3-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20210715055145.195411-4-ying.huang@intel.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Oscar Salvador <osalvador@suse.de> [optional parameter]
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 09:58:16 -07:00
Linus Torvalds
71bd934101 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
 "190 patches.

  Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
  vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
  migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
  zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
  core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
  signals, exec, kcov, selftests, compress/decompress, and ipc"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits)
  ipc/util.c: use binary search for max_idx
  ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
  ipc: use kmalloc for msg_queue and shmid_kernel
  ipc sem: use kvmalloc for sem_undo allocation
  lib/decompressors: remove set but not used variabled 'level'
  selftests/vm/pkeys: exercise x86 XSAVE init state
  selftests/vm/pkeys: refill shadow register after implicit kernel write
  selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
  selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
  kcov: add __no_sanitize_coverage to fix noinstr for all architectures
  exec: remove checks in __register_bimfmt()
  x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
  hfsplus: report create_date to kstat.btime
  hfsplus: remove unnecessary oom message
  nilfs2: remove redundant continue statement in a while-loop
  kprobes: remove duplicated strong free_insn_page in x86 and s390
  init: print out unknown kernel parameters
  checkpatch: do not complain about positive return values starting with EPOLL
  checkpatch: improve the indented label test
  checkpatch: scripts/spdxcheck.py now requires python3
  ...
2021-07-02 12:08:10 -07:00
Wonhyuk Yang
b55ca5264b mm/compaction: fix 'limit' in fast_isolate_freepages
Because of 'min(1, ...)', fast_isolate_freepages set 'limit' to 0 or 1.
This takes away the opportunities of find candinate pages.  So, by making
enough scans available, increases the probability of finding the
appropriate freepage.

Tested it on the thpscale and the results are as follows.

                                        5.12.0                 5.12.0
                                      valnilla                patched
Amean     fault-both-1       598.15 (   0.00%)      592.56 (   0.93%)
Amean     fault-both-3      1494.47 (   0.00%)     1514.35 (  -1.33%)
Amean     fault-both-5      2519.48 (   0.00%)     2471.76 (   1.89%)
Amean     fault-both-7      3173.85 (   0.00%)     3079.19 (   2.98%)
Amean     fault-both-12     8063.83 (   0.00%)     7858.29 (   2.55%)
Amean     fault-both-18     8781.20 (   0.00%)     7827.70 *  10.86%*
Amean     fault-both-24    12576.44 (   0.00%)    12250.20 (   2.59%)
Amean     fault-both-30    18503.27 (   0.00%)    17528.11 *   5.27%*
Amean     fault-both-32    16133.69 (   0.00%)    13874.24 *  14.00%*

                                           5.12.0         5.12.0
                                          vanilla        patched
Ops Compaction migrate scanned         6547133.00     5963901.00
Ops Compaction free scanned           32452453.00    26609101.00

                        5.12        5.12
                     vanilla     patched
Duration User          27.99       28.84
Duration System       244.08      236.76
Duration Elapsed       78.27       78.38

Link: https://lkml.kernel.org/r/20210626082443.22547-1-vvghjk1234@gmail.com
Fixes: 5a811889de ("mm, compaction: use free lists to quickly locate a migration target")
Signed-off-by: Wonhyuk Yang <vvghjk1234@gmail.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:29 -07:00
Liu Xiang
d2155fe54d mm: compaction: remove duplicate !list_empty(&sublist) check
The list_splice_tail(&sublist, freelist) also do !list_empty(&sublist)
check, so remove the duplicate call.

Link: https://lkml.kernel.org/r/20210609095409.19920-1-liu.xiang@zlingsmart.com
Signed-off-by: Liu Xiang <liu.xiang@zlingsmart.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:29 -07:00
YueHaibing
17adb230d6 mm/compaction: use DEVICE_ATTR_WO macro
Use DEVICE_ATTR_WO helper instead of plain DEVICE_ATTR, which makes the
code a bit shorter and easier to read.

Link: https://lkml.kernel.org/r/20210523064521.32912-1-yuehaibing@huawei.com
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:29 -07:00
Linus Torvalds
65090f30ab Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
 "191 patches.

  Subsystems affected by this patch series: kthread, ia64, scripts,
  ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab,
  slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap,
  mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization,
  pagealloc, and memory-failure)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (191 commits)
  mm,hwpoison: make get_hwpoison_page() call get_any_page()
  mm,hwpoison: send SIGBUS with error virutal address
  mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes
  mm/page_alloc: allow high-order pages to be stored on the per-cpu lists
  mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM
  mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
  docs: remove description of DISCONTIGMEM
  arch, mm: remove stale mentions of DISCONIGMEM
  mm: remove CONFIG_DISCONTIGMEM
  m68k: remove support for DISCONTIGMEM
  arc: remove support for DISCONTIGMEM
  arc: update comment about HIGHMEM implementation
  alpha: remove DISCONTIGMEM and NUMA
  mm/page_alloc: move free_the_page
  mm/page_alloc: fix counting of managed_pages
  mm/page_alloc: improve memmap_pages dbg msg
  mm: drop SECTION_SHIFT in code comments
  mm/page_alloc: introduce vm.percpu_pagelist_high_fraction
  mm/page_alloc: limit the number of pages on PCP lists when reclaim is active
  mm/page_alloc: scale the number of pages that are batch freed
  ...
2021-06-29 17:29:11 -07:00
Muchun Song
a984226f45 mm: memcontrol: remove the pgdata parameter of mem_cgroup_page_lruvec
All the callers of mem_cgroup_page_lruvec() just pass page_pgdat(page) as
the 2nd parameter to it (except isolate_migratepages_block()).  But for
isolate_migratepages_block(), the page_pgdat(page) is also equal to the
local variable of @pgdat.  So mem_cgroup_page_lruvec() do not need the
pgdat parameter.  Just remove it to simplify the code.

Link: https://lkml.kernel.org/r/20210417043538.9793-4-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:50 -07:00
Peter Zijlstra
b03fbd4ff2 sched: Introduce task_is_running()
Replace a bunch of 'p->state == TASK_RUNNING' with a new helper:
task_is_running(p).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210611082838.222401495@infradead.org
2021-06-18 11:43:07 +02:00
Ingo Molnar
f0953a1bba mm: fix typos in comments
Fix ~94 single-word typos in locking code comments, plus a few
very obvious grammar mistakes.

Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com
Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-07 00:26:35 -07:00
Zhiyuan Dai
68d68ff6eb mm/mempool: minor coding style tweaks
Various coding style tweaks to various files under mm/

[daizhiyuan@phytium.com.cn: mm/swapfile: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614223624-16055-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/sparse: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614227288-19363-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/vmscan: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614227649-19853-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/compaction: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614228218-20770-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/oom_kill: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614228360-21168-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/shmem: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614228504-21491-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/page_alloc: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614228613-21754-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/filemap: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1614228936-22337-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/mlock: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1613956588-2453-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/frontswap: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1613962668-15045-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/vmalloc: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1613963379-15988-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/memory_hotplug: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1613971784-24878-1-git-send-email-daizhiyuan@phytium.com.cn
[daizhiyuan@phytium.com.cn: mm/mempolicy: minor coding style tweaks]
  Link: https://lkml.kernel.org/r/1613972228-25501-1-git-send-email-daizhiyuan@phytium.com.cn

Link: https://lkml.kernel.org/r/1614222374-13805-1-git-send-email-daizhiyuan@phytium.com.cn
Signed-off-by: Zhiyuan Dai <daizhiyuan@phytium.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:27 -07:00
Minchan Kim
361a2a229f mm: replace migrate_[prep|finish] with lru_cache_[disable|enable]
Currently, migrate_[prep|finish] is merely a wrapper of
lru_cache_[disable|enable].  There is not much to gain from having
additional abstraction.

Use lru_cache_[disable|enable] instead of migrate_[prep|finish], which
would be more descriptive.

note: migrate_prep_local in compaction.c changed into lru_add_drain to
avoid CPU schedule cost with involving many other CPUs to keep old
behavior.

Link: https://lkml.kernel.org/r/20210319175127.886124-2-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Chris Goldsworthy <cgoldswo@codeaurora.org>
Cc: John Dias <joaodias@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oliver Sang <oliver.sang@intel.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:24 -07:00
Charan Teja Reddy
06dac2f467 mm: compaction: update the COMPACT[STALL|FAIL] events properly
By definition, COMPACT[STALL|FAIL] events needs to be counted when there
is 'At least in one zone compaction wasn't deferred or skipped from the
direct compaction'.  And when compaction is skipped or deferred,
COMPACT_SKIPPED will be returned but it will still go and update these
compaction events which is wrong in the sense that COMPACT[STALL|FAIL]
is counted without even trying the compaction.

Correct this by skipping the counting of these events when
COMPACT_SKIPPED is returned for compaction.  This indirectly also avoid
the unnecessary try into the get_page_from_freelist() when compaction is
not even tried.

There is a corner case where compaction is skipped but still count
COMPACTSTALL event, which is that IRQ came and freed the page and the
same is captured in capture_control.

Link: https://lkml.kernel.org/r/1613151184-21213-1-git-send-email-charante@codeaurora.org
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:24 -07:00
Pintu Kumar
ef49843841 mm/compaction: remove unused variable sysctl_compact_memory
The sysctl_compact_memory is mostly unused in mm/compaction.c It just
acts as a place holder for sysctl to store .data.

But the .data itself is not needed here.

So we can get ride of this variable completely and make .data as NULL.
This will also eliminate the extern declaration from header file.  No
functionality is broken or changed this way.

Link: https://lkml.kernel.org/r/1614852224-14671-1-git-send-email-pintu@codeaurora.org
Signed-off-by: Pintu Kumar <pintu@codeaurora.org>
Signed-off-by: Pintu Agarwal <pintu.ping@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:24 -07:00
Oscar Salvador
ae37c7ff79 mm: make alloc_contig_range handle in-use hugetlb pages
alloc_contig_range() will fail if it finds a HugeTLB page within the
range, without a chance to handle them.  Since HugeTLB pages can be
migrated as any LRU or Movable page, it does not make sense to bail out
without trying.  Enable the interface to recognize in-use HugeTLB pages so
we can migrate them, and have much better chances to succeed the call.

Link: https://lkml.kernel.org/r/20210419075413.1064-7-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:22 -07:00
Oscar Salvador
369fa227c2 mm: make alloc_contig_range handle free hugetlb pages
alloc_contig_range will fail if it ever sees a HugeTLB page within the
range we are trying to allocate, even when that page is free and can be
easily reallocated.

This has proved to be problematic for some users of alloc_contic_range,
e.g: CMA and virtio-mem, where those would fail the call even when those
pages lay in ZONE_MOVABLE and are free.

We can do better by trying to replace such page.

Free hugepages are tricky to handle so as to no userspace application
notices disruption, we need to replace the current free hugepage with a
new one.

In order to do that, a new function called alloc_and_dissolve_huge_page is
introduced.  This function will first try to get a new fresh hugepage, and
if it succeeds, it will replace the old one in the free hugepage pool.

The free page replacement is done under hugetlb_lock, so no external users
of hugetlb will notice the change.  To allocate the new huge page, we use
alloc_buddy_huge_page(), so we do not have to deal with any counters, and
prep_new_huge_page() is not called.  This is valulable because in case we
need to free the new page, we only need to call __free_pages().

Once we know that the page to be replaced is a genuine 0-refcounted huge
page, we remove the old page from the freelist by remove_hugetlb_page().
Then, we can call __prep_new_huge_page() and
__prep_account_new_huge_page() for the new huge page to properly
initialize it and increment the hstate->nr_huge_pages counter (previously
decremented by remove_hugetlb_page()).  Once done, the page is enqueued by
enqueue_huge_page() and it is ready to be used.

There is one tricky case when page's refcount is 0 because it is in the
process of being released.  A missing PageHugeFreed bit will tell us that
freeing is in flight so we retry after dropping the hugetlb_lock.  The
race window should be small and the next retry should make a forward
progress.

E.g:

CPU0				CPU1
free_huge_page()		isolate_or_dissolve_huge_page
				  PageHuge() == T
				  alloc_and_dissolve_huge_page
				    alloc_buddy_huge_page()
				    spin_lock_irq(hugetlb_lock)
				    // PageHuge() && !PageHugeFreed &&
				    // !PageCount()
				    spin_unlock_irq(hugetlb_lock)
  spin_lock_irq(hugetlb_lock)
  1) update_and_free_page
       PageHuge() == F
       __free_pages()
  2) enqueue_huge_page
       SetPageHugeFreed()
  spin_unlock_irq(&hugetlb_lock)
				  spin_lock_irq(hugetlb_lock)
                                   1) PageHuge() == F (freed by case#1 from CPU0)
				   2) PageHuge() == T
                                       PageHugeFreed() == T
                                       - proceed with replacing the page

In the case above we retry as the window race is quite small and we have
high chances to succeed next time.

With regard to the allocation, we restrict it to the node the page belongs
to with __GFP_THISNODE, meaning we do not fallback on other node's zones.

Note that gigantic hugetlb pages are fenced off since there is a cyclic
dependency between them and alloc_contig_range.

Link: https://lkml.kernel.org/r/20210419075413.1064-6-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:22 -07:00
Oscar Salvador
c2ad7a1ffe mm,compaction: let isolate_migratepages_{range,block} return error codes
Currently, isolate_migratepages_{range,block} and their callers use a pfn
== 0 vs pfn != 0 scheme to let the caller know whether there was any error
during isolation.

This does not work as soon as we need to start reporting different error
codes and make sure we pass them down the chain, so they are properly
interpreted by functions like e.g: alloc_contig_range.

Let us rework isolate_migratepages_{range,block} so we can report error
codes.  Since isolate_migratepages_block will stop returning the next pfn
to be scanned, we reuse the cc->migrate_pfn field to keep track of that.

Link: https://lkml.kernel.org/r/20210419075413.1064-3-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:22 -07:00
Vlastimil Babka
6e2b7044c1 mm, compaction: make fast_isolate_freepages() stay within zone
Compaction always operates on pages from a single given zone when
isolating both pages to migrate and freepages.  Pageblock boundaries are
intersected with zone boundaries to be safe in case zone starts or ends in
the middle of pageblock.  The use of pageblock_pfn_to_page() protects
against non-contiguous pageblocks.

The functions fast_isolate_freepages() and fast_isolate_around() don't
currently protect the fast freepage isolation thoroughly enough against
these corner cases, and can result in freepage isolation operate outside
of zone boundaries:

 - in fast_isolate_freepages() if we get a pfn from the first pageblock
   of a zone that starts in the middle of that pageblock, 'highest' can
   be a pfn outside of the zone.

   If we fail to isolate anything in this function, we may then call
   fast_isolate_around() on a pfn outside of the zone and there
   effectively do a set_pageblock_skip(page_to_pfn(highest)) which may
   currently hit a VM_BUG_ON() in some configurations

 - fast_isolate_around() checks only the zone end boundary and not
   beginning, nor that the pageblock is contiguous (with
   pageblock_pfn_to_page()) so it's possible that we end up calling
   isolate_freepages_block() on a range of pfn's from two different
   zones and end up e.g. isolating freepages under the wrong zone's
   lock.

This patch should fix the above issues.

Link: https://lkml.kernel.org/r/20210217173300.6394-1-vbabka@suse.cz
Fixes: 5a811889de ("mm, compaction: use free lists to quickly locate a migration target")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:34 -08:00
Wonhyuk Yang
15d28d0d11 mm/compaction: fix misbehaviors of fast_find_migrateblock()
In the fast_find_migrateblock(), it iterates ocer the freelist to find the
proper pageblock.  But there are some misbehaviors.

First, if the page we found is equal to cc->migrate_pfn, it is considered
that we didn't find a suitable pageblock.  Secondly, if the loop was
terminated because order is less than PAGE_ALLOC_COSTLY_ORDER, it could be
considered that we found a suitable one.  Thirdly, if the skip bit is set
on the page block and we goto continue, it doesn't check nr_scanned.
Fourthly, if the page block's skip bit is set, it checks that page block
is the last of list, which is unnecessary.

Link: https://lkml.kernel.org/r/20210128130411.6125-1-vvghjk1234@gmail.com
Fixes: 70b44595ea ("mm, compaction: use free lists to quickly locate a migration source")
Signed-off-by: Wonhyuk Yang <vvghjk1234@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:34 -08:00
Charan Teja Reddy
40d7e20320 mm/compaction: correct deferral logic for proactive compaction
should_proactive_compact_node() returns true when sum of the weighted
fragmentation score of all the zones in the node is greater than the
wmark_high of compaction, which then triggers the proactive compaction
that operates on the individual zones of the node.  But proactive
compaction runs on the zone only when its weighted fragmentation score
is greater than wmark_low(=wmark_high - 10).

This means that the sum of the weighted fragmentation scores of all the
zones can exceed the wmark_high but individual weighted fragmentation zone
scores can still be less than wmark_low which makes the unnecessary
trigger of the proactive compaction only to return doing nothing.

Issue with the return of proactive compaction with out even trying is its
deferral.  It is simply deferred for 1 << COMPACT_MAX_DEFER_SHIFT if the
scores across the proactive compaction is same, thinking that compaction
didn't make any progress but in reality it didn't even try.  With the
delay between successive retries for proactive compaction is 500msec, it
can result into the deferral for ~30sec with out even trying the proactive
compaction.

Test scenario is that: compaction_proactiveness=50 thus the wmark_low = 50
and wmark_high = 60.  System have 2 zones(Normal and Movable) with sizes
5GB and 6GB respectively.  After opening some apps on the android, the
weighted fragmentation scores of these zones are 47 and 49 respectively.
Since the sum of these fragmentation scores are above the wmark_high which
triggers the proactive compaction and there since the individual zones
weighted fragmentation scores are below wmark_low, it returns without
trying the proactive compaction.  As a result the weighted fragmentation
scores of the zones are still 47 and 49 which makes the existing logic to
defer the compaction thinking that noprogress is made across the
compaction.

Fix this by checking just zone fragmentation score, not the weighted, in
__compact_finished() and use the zones weighted fragmentation score in
fragmentation_score_node().  In the test case above, If the weighted
average of is above wmark_high, then individual score (not adjusted) of
atleast one zone has to be above wmark_high.  Thus it avoids the
unnecessary trigger and deferrals of the proactive compaction.

Link: https://lkml.kernel.org/r/1610989938-31374-1-git-send-email-charante@codeaurora.org
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nitin Gupta <ngupta@nitingupta.dev>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:34 -08:00
Miaohe Lin
e2d26aa5fb mm/compaction: remove duplicated VM_BUG_ON_PAGE !PageLocked
The VM_BUG_ON_PAGE(!PageLocked(page), page) is also done in PageMovable.
Remove this explicitly one.

Link: https://lkml.kernel.org/r/20210109081420.46030-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:34 -08:00
Alex Shi
d99fd5feb0 mm/compaction: remove rcu_read_lock during page compaction
isolate_migratepages_block() used rcu_read_lock() with the intention of
safeguarding against the mem_cgroup being destroyed concurrently; but
its TestClearPageLRU already protects against that.  Delete the
unnecessary rcu_read_lock() and _unlock().

Hugh Dickins helped on commit log polishing, Thanks!

Link: https://lkml.kernel.org/r/1608614453-10739-3-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:34 -08:00
Yu Zhao
46ae6b2cc2 mm/swap.c: don't pass "enum lru_list" to del_page_from_lru_list()
The parameter is redundant in the sense that it can be potentially
extracted from the "struct page" parameter by page_lru(). We need to
make sure that existing PageActive() or PageUnevictable() remains
until the function returns. A few places don't conform, and simple
reordering fixes them.

This patch may have left page_off_lru() seemingly odd, and we'll take
care of it in the next patch.

Link: https://lore.kernel.org/linux-mm/20201207220949.830352-6-yuzhao@google.com/
Link: https://lkml.kernel.org/r/20210122220600.906146-6-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:33 -08:00
Alex Shi
c2135f7c57 mm/vmscan: __isolate_lru_page_prepare() cleanup
The function just returns 2 results, so using a 'switch' to deal with its
result is unnecessary.  Also simplify it to a bool func as Vlastimil
suggested.

Also remove 'goto' by reusing list_move(), and take Matthew Wilcox's
suggestion to update comments in function.

Link: https://lkml.kernel.org/r/728874d7-2d93-4049-68c1-dcc3b2d52ccd@linux.alibaba.com
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:33 -08:00
Rokudo Yan
74e21484e4 mm, compaction: move high_pfn to the for loop scope
In fast_isolate_freepages, high_pfn will be used if a prefered one (ie
PFN >= low_fn) not found.

But the high_pfn is not reset before searching an free area, so when it
was used as freepage, it may from another free area searched before.  As
a result move_freelist_head(freelist, freepage) will have unexpected
behavior (eg corrupt the MOVABLE freelist)

  Unable to handle kernel paging request at virtual address dead000000000200
  Mem abort info:
    ESR = 0x96000044
    Exception class = DABT (current EL), IL = 32 bits
    SET = 0, FnV = 0
    EA = 0, S1PTW = 0
  Data abort info:
    ISV = 0, ISS = 0x00000044
    CM = 0, WnR = 1
  [dead000000000200] address between user and kernel address ranges

  -000|list_cut_before(inline)
  -000|move_freelist_head(inline)
  -000|fast_isolate_freepages(inline)
  -000|isolate_freepages(inline)
  -000|compaction_alloc(?, ?)
  -001|unmap_and_move(inline)
  -001|migrate_pages([NSD:0xFFFFFF80088CBBD0] from = 0xFFFFFF80088CBD88, [NSD:0xFFFFFF80088CBBC8] get_new_p
  -002|__read_once_size(inline)
  -002|static_key_count(inline)
  -002|static_key_false(inline)
  -002|trace_mm_compaction_migratepages(inline)
  -002|compact_zone(?, [NSD:0xFFFFFF80088CBCB0] capc = 0x0)
  -003|kcompactd_do_work(inline)
  -003|kcompactd([X19] p = 0xFFFFFF93227FBC40)
  -004|kthread([X20] _create = 0xFFFFFFE1AFB26380)
  -005|ret_from_fork(asm)

The issue was reported on an smart phone product with 6GB ram and 3GB
zram as swap device.

This patch fixes the issue by reset high_pfn before searching each free
area, which ensure freepage and freelist match when call
move_freelist_head in fast_isolate_freepages().

Link: http://lkml.kernel.org/r/20190118175136.31341-12-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20210112094720.1238444-1-wu-yan@tcl.com
Fixes: 5a811889de ("mm, compaction: use free lists to quickly locate a migration target")
Signed-off-by: Rokudo Yan <wu-yan@tcl.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-05 11:03:47 -08:00
Alex Shi
6168d0da2b mm/lru: replace pgdat lru_lock with lruvec lock
This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
each of memcg per node.  So on a large machine, each of memcg don't have
to suffer from per node pgdat->lru_lock competition.  They could go fast
with their self lru_lock.

After move memcg charge before lru inserting, page isolation could
serialize page's memcg, then per memcg lruvec lock is stable and could
replace per node lru lock.

In isolate_migratepages_block(), compact_unlock_should_abort and
lock_page_lruvec_irqsave are open coded to work with compact_control.
Also add a debug func in locking which may give some clues if there are
sth out of hands.

Daniel Jordan's testing show 62% improvement on modified readtwice case on
his 2P * 10 core * 2 HT broadwell box.
https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/

Hugh Dickins helped on the patch polish, thanks!

[alex.shi@linux.alibaba.com: fix comment typo]
  Link: https://lkml.kernel.org/r/5b085715-292a-4b43-50b3-d73dc90d1de5@linux.alibaba.com
[alex.shi@linux.alibaba.com: use page_memcg()]
  Link: https://lkml.kernel.org/r/5a4c2b72-7ee8-2478-fc0e-85eb83aafec4@linux.alibaba.com

Link: https://lkml.kernel.org/r/1604566549-62481-18-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rong Chen <rong.a.chen@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mika Penttilä <mika.penttila@nextfour.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 14:48:04 -08:00
Alex Shi
9df4131439 mm/compaction: do page isolation first in compaction
Currently, compaction would get the lru_lock and then do page isolation
which works fine with pgdat->lru_lock, since any page isoltion would
compete for the lru_lock.  If we want to change to memcg lru_lock, we have
to isolate the page before getting lru_lock, thus isoltion would block
page's memcg change which relay on page isoltion too.  Then we could
safely use per memcg lru_lock later.

The new page isolation use previous introduced TestClearPageLRU() + pgdat
lru locking which will be changed to memcg lru lock later.

Hugh Dickins <hughd@google.com> fixed following bugs in this patch's early
version:

Fix lots of crashes under compaction load: isolate_migratepages_block()
must clean up appropriately when rejecting a page, setting PageLRU again
if it had been cleared; and a put_page() after get_page_unless_zero()
cannot safely be done while holding locked_lruvec - it may turn out to be
the final put_page(), which will take an lruvec lock when PageLRU.

And move __isolate_lru_page_prepare back after get_page_unless_zero to
make trylock_page() safe: trylock_page() is not safe to use at this time:
its setting PG_locked can race with the page being freed or allocated
("Bad page"), and can also erase flags being set by one of those "sole
owners" of a freshly allocated page who use non-atomic __SetPageFlag().

Link: https://lkml.kernel.org/r/1604566549-62481-16-git-send-email-alex.shi@linux.alibaba.com
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: "Chen, Rong A" <rong.a.chen@intel.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mika Penttilä <mika.penttila@nextfour.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 14:48:04 -08:00
Hui Su
2271b016bf mm/compaction: make defer_compaction and compaction_deferred static
defer_compaction() and compaction_deferred() and compaction_restarting()
in mm/compaction.c won't be used in other files, so make them static, and
remove the declaration in the header file.

Take the chance to fix a typo.

Link: https://lkml.kernel.org/r/20201123170801.GA9625@rlk
Signed-off-by: Hui Su <sh_def@163.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Nitin Gupta <nigupta@nvidia.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Mateusz Nosek <mateusznosek0@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 12:13:45 -08:00
Hui Su
2b1a20c3af mm/compaction: move compaction_suitable's comment to right place
Since commit 837d026d56 ("mm/compaction: more trace to understand
when/why compaction start/finish"), the comment place is not suitable.

So move compaction_suitable's comment to right place.

Link: https://lkml.kernel.org/r/20201116144121.GA385717@rlk
Signed-off-by: Hui Su <sh_def@163.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 12:13:45 -08:00
Yanfei Xu
19d3cf9de1 mm/compaction: rename 'start_pfn' to 'iteration_start_pfn' in compact_zone()
There are two 'start_pfn' declared in compact_zone() which have different
meanings.  Rename the second one to 'iteration_start_pfn' to prevent
confusion.

Also, remove an useless semicolon.

Link: https://lkml.kernel.org/r/20201019115044.1571-1-yanfei.xu@windriver.com
Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 12:13:45 -08:00
Zi Yan
d20bdd571e mm/compaction: stop isolation if too many pages are isolated and we have pages to migrate
In isolate_migratepages_block, if we have too many isolated pages and
nr_migratepages is not zero, we should try to migrate what we have
without wasting time on isolating.

In theory it's possible that multiple parallel compactions will cause
too_many_isolated() to become true even if each has isolated less than
COMPACT_CLUSTER_MAX, and loop forever in the while loop.  Bailing
immediately prevents that.

[vbabka@suse.cz: changelog addition]

Fixes: 1da2f328fa (“mm,thp,compaction,cma: allow THP migration for CMA allocations”)
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: <stable@vger.kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Yang Shi <shy828301@gmail.com>
Link: https://lkml.kernel.org/r/20201030183809.3616803-2-zi.yan@sent.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-14 11:26:03 -08:00
Zi Yan
38935861d8 mm/compaction: count pages and stop correctly during page isolation
In isolate_migratepages_block, when cc->alloc_contig is true, we are
able to isolate compound pages.  But nr_migratepages and nr_isolated did
not count compound pages correctly, causing us to isolate more pages
than we thought.

So count compound pages as the number of base pages they contain.
Otherwise, we might be trapped in too_many_isolated while loop, since
the actual isolated pages can go up to COMPACT_CLUSTER_MAX*512=16384,
where COMPACT_CLUSTER_MAX is 32, since we stop isolation after
cc->nr_migratepages reaches to COMPACT_CLUSTER_MAX.

In addition, after we fix the issue above, cc->nr_migratepages could
never be equal to COMPACT_CLUSTER_MAX if compound pages are isolated,
thus page isolation could not stop as we intended.  Change the isolation
stop condition to '>='.

The issue can be triggered as follows:

In a system with 16GB memory and an 8GB CMA region reserved by
hugetlb_cma, if we first allocate 10GB THPs and mlock them (so some THPs
are allocated in the CMA region and mlocked), reserving 6 1GB hugetlb
pages via /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages will
get stuck (looping in too_many_isolated function) until we kill either
task.  With the patch applied, oom will kill the application with 10GB
THPs and let hugetlb page reservation finish.

[ziy@nvidia.com: v3]

Link: https://lkml.kernel.org/r/20201030183809.3616803-1-zi.yan@sent.com
Fixes: 1da2f328fa ("cmm,thp,compaction,cma: allow THP migration for CMA allocations")
Signed-off-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Rik van Riel <riel@surriel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201029200435.3386066-1-zi.yan@sent.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-14 11:26:03 -08:00
Matthew Wilcox (Oracle)
ab130f9108 mm: rename page_order() to buddy_order()
The current page_order() can only be called on pages in the buddy
allocator.  For compound pages, you have to use compound_order().  This is
confusing and led to a bug, so rename page_order() to buddy_order().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/20201001152259.14932-2-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:19 -07:00
Mateusz Nosek
62b35fe0eb mm/compaction.c: micro-optimization remove unnecessary branch
The same code can work both for 'zone->compact_considered > defer_limit'
and 'zone->compact_considered >= defer_limit'.  In the latter there is one
branch less which is more effective considering performance.

Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Link: https://lkml.kernel.org/r/20200913190448.28649-1-mateusznosek0@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13 18:38:34 -07:00
Matthew Wilcox (Oracle)
6c357848b4 mm: replace hpage_nr_pages with thp_nr_pages
The thp prefix is more frequently used than hpage and we should be
consistent between the various functions.

[akpm@linux-foundation.org: fix mm/migrate.c]

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-14 19:56:56 -07:00
Randy Dunlap
a1c1dbeb2e mm/compaction.c: delete duplicated word
Drop the repeated word "a".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Link: http://lkml.kernel.org/r/20200801173822.14973-2-rdunlap@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:57:58 -07:00
Alex Shi
860b32729a mm/compaction: correct the comments of compact_defer_shift
There is no compact_defer_limit. It should be compact_defer_shift in
use. and add compact_order_failed explanation.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Link: http://lkml.kernel.org/r/3bd60e1b-a74e-050d-ade4-6e8f54e00b92@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:57:56 -07:00
Nitin Gupta
d34c0a7599 mm: use unsigned types for fragmentation score
Proactive compaction uses per-node/zone "fragmentation score" which is
always in range [0, 100], so use unsigned type of these scores as well as
for related constants.

Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Link: http://lkml.kernel.org/r/20200618010319.13159-1-nigupta@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:57:56 -07:00
Nitin Gupta
25788738eb mm: fix compile error due to COMPACTION_HPAGE_ORDER
Fix compile error when COMPACTION_HPAGE_ORDER is assigned to
HUGETLB_PAGE_ORDER.  The correct way to check if this constant is defined
is to check for CONFIG_HUGETLBFS.

Reported-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623064544.25766-1-nigupta@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:57:56 -07:00
Nitin Gupta
facdaa917c mm: proactive compaction
For some applications, we need to allocate almost all memory as hugepages.
However, on a running system, higher-order allocations can fail if the
memory is fragmented.  Linux kernel currently does on-demand compaction as
we request more hugepages, but this style of compaction incurs very high
latency.  Experiments with one-time full memory compaction (followed by
hugepage allocations) show that kernel is able to restore a highly
fragmented memory state to a fairly compacted memory state within <1 sec
for a 32G system.  Such data suggests that a more proactive compaction can
help us allocate a large fraction of memory as hugepages keeping
allocation latencies low.

For a more proactive compaction, the approach taken here is to define a
new sysctl called 'vm.compaction_proactiveness' which dictates bounds for
external fragmentation which kcompactd tries to maintain.

The tunable takes a value in range [0, 100], with a default of 20.

Note that a previous version of this patch [1] was found to introduce too
many tunables (per-order extfrag{low, high}), but this one reduces them to
just one sysctl.  Also, the new tunable is an opaque value instead of
asking for specific bounds of "external fragmentation", which would have
been difficult to estimate.  The internal interpretation of this opaque
value allows for future fine-tuning.

Currently, we use a simple translation from this tunable to [low, high]
"fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
The score for a node is defined as weighted mean of per-zone external
fragmentation.  A zone's present_pages determines its weight.

To periodically check per-node score, we reuse per-node kcompactd threads,
which are woken up every 500 milliseconds to check the same.  If a node's
score exceeds its high threshold (as derived from user-provided
proactiveness value), proactive compaction is started until its score
reaches its low threshold value.  By default, proactiveness is set to 20,
which implies threshold values of low=80 and high=90.

This patch is largely based on ideas from Michal Hocko [2].  See also the
LWN article [3].

Performance data
================

System: x64_64, 1T RAM, 80 CPU threads.
Kernel: 5.6.0-rc3 + this patch

echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag

Before starting the driver, the system was fragmented from a userspace
program that allocates all memory and then for each 2M aligned section,
frees 3/4 of base pages using munmap.  The workload is mainly anonymous
userspace pages, which are easy to move around.  I intentionally avoided
unmovable pages in this test to see how much latency we incur when
hugepage allocations hit direct compaction.

1. Kernel hugepage allocation latencies

With the system in such a fragmented state, a kernel driver then allocates
as many hugepages as possible and measures allocation latency:

(all latency values are in microseconds)

- With vanilla 5.6.0-rc3

  percentile latency
  –––––––––– –––––––
	   5    7894
	  10    9496
	  25   12561
	  30   15295
	  40   18244
	  50   21229
	  60   27556
	  75   30147
	  80   31047
	  90   32859
	  95   33799

Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)

- With 5.6.0-rc3 + this patch, with proactiveness=20

sysctl -w vm.compaction_proactiveness=20

  percentile latency
  –––––––––– –––––––
	   5       2
	  10       2
	  25       3
	  30       3
	  40       3
	  50       4
	  60       4
	  75       4
	  80       4
	  90       5
	  95     429

Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)

2. JAVA heap allocation

In this test, we first fragment memory using the same method as for (1).

Then, we start a Java process with a heap size set to 700G and request the
heap to be allocated with THP hugepages.  We also set THP to madvise to
allow hugepage backing of this heap.

/usr/bin/time
 java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch

The above command allocates 700G of Java heap using hugepages.

- With vanilla 5.6.0-rc3

17.39user 1666.48system 27:37.89elapsed

- With 5.6.0-rc3 + this patch, with proactiveness=20

8.35user 194.58system 3:19.62elapsed

Elapsed time remains around 3:15, as proactiveness is further increased.

Note that proactive compaction happens throughout the runtime of these
workloads.  The situation of one-time compaction, sufficient to supply
hugepages for following allocation stream, can probably happen for more
extreme proactiveness values, like 80 or 90.

In the above Java workload, proactiveness is set to 20.  The test starts
with a node's score of 80 or higher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction.  As t he benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the low
threshold level (80).  Repeat.

bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark.  kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.

Backoff behavior
================

Above workloads produce a memory state which is easy to compact.  However,
if memory is filled with unmovable pages, proactive compaction should
essentially back off.  To test this aspect:

- Created a kernel driver that allocates almost all memory as hugepages
  followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
  with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
  (=> ~30 seconds between retries).

[1] https://patchwork.kernel.org/patch/11098289/
[2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
[3] https://lwn.net/Articles/817905/

Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Oleksandr Natalenko <oleksandr@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Nitin Gupta <ngupta@nitingupta.dev>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:57:56 -07:00
Vlastimil Babka
b9e20f0da1 mm, compaction: make capture control handling safe wrt interrupts
Hugh reports:

 "While stressing compaction, one run oopsed on NULL capc->cc in
  __free_one_page()'s task_capc(zone): compact_zone_order() had been
  interrupted, and a page was being freed in the return from interrupt.

  Though you would not expect it from the source, both gccs I was using
  (4.8.1 and 7.5.0) had chosen to compile compact_zone_order() with the
  ".cc = &cc" implemented by mov %rbx,-0xb0(%rbp) immediately before
  callq compact_zone - long after the "current->capture_control =
  &capc". An interrupt in between those finds capc->cc NULL (zeroed by
  an earlier rep stos).

  This could presumably be fixed by a barrier() before setting
  current->capture_control in compact_zone_order(); but would also need
  more care on return from compact_zone(), in order not to risk leaking
  a page captured by interrupt just before capture_control is reset.

  Maybe that is the preferable fix, but I felt safer for task_capc() to
  exclude the rather surprising possibility of capture at interrupt
  time"

I have checked that gcc10 also behaves the same.

The advantage of fix in compact_zone_order() is that we don't add
another test in the page freeing hot path, and that it might prevent
future problems if we stop exposing pointers to uninitialized structures
in current task.

So this patch implements the suggestion for compact_zone_order() with
barrier() (and WRITE_ONCE() to prevent store tearing) for setting
current->capture_control, and prevents page leaking with
WRITE_ONCE/READ_ONCE in the proper order.

Link: http://lkml.kernel.org/r/20200616082649.27173-1-vbabka@suse.cz
Fixes: 5e1f0f098b ("mm, compaction: capture a page under direct compaction")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Hugh Dickins <hughd@google.com>
Suggested-by: Hugh Dickins <hughd@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Li Wang <liwang@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>	[5.1+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-26 00:27:36 -07:00
Ethon Paul
f386775510 mm/compaction: fix a typo in comment "pessemistic"->"pessimistic"
There is a typo in comment, fix it.

Signed-off-by: Ethon Paul <ethp@qq.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Link: http://lkml.kernel.org/r/20200411070307.16021-1-ethp@qq.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 19:06:23 -07:00
Linus Torvalds
ee01c4d72a Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
 "More mm/ work, plenty more to come

  Subsystems affected by this patch series: slub, memcg, gup, kasan,
  pagealloc, hugetlb, vmscan, tools, mempolicy, memblock, hugetlbfs,
  thp, mmap, kconfig"

* akpm: (131 commits)
  arm64: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
  x86: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
  riscv: support DEBUG_WX
  mm: add DEBUG_WX support
  drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup
  mm/thp: rename pmd_mknotpresent() as pmd_mkinvalid()
  powerpc/mm: drop platform defined pmd_mknotpresent()
  mm: thp: don't need to drain lru cache when splitting and mlocking THP
  hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs
  sparc32: register memory occupied by kernel as memblock.memory
  include/linux/memblock.h: fix minor typo and unclear comment
  mm, mempolicy: fix up gup usage in lookup_node
  tools/vm/page_owner_sort.c: filter out unneeded line
  mm: swap: memcg: fix memcg stats for huge pages
  mm: swap: fix vmstats for huge pages
  mm: vmscan: limit the range of LRU type balancing
  mm: vmscan: reclaim writepage is IO cost
  mm: vmscan: determine anon/file pressure balance at the reclaim root
  mm: balance LRU lists based on relative thrashing
  mm: only count actual rotations as LRU reclaim cost
  ...
2020-06-03 20:24:15 -07:00