75a03dbfc836e86f6e56caf1b8e36a2735c5562f
333 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
3d8ac88867 |
Merge 5.15.46 into android14-5.15
Changes in 5.15.46
binfmt_flat: do not stop relocating GOT entries prematurely on riscv
parisc/stifb: Implement fb_is_primary_device()
parisc/stifb: Keep track of hardware path of graphics card
RISC-V: Mark IORESOURCE_EXCLUSIVE for reserved mem instead of IORESOURCE_BUSY
riscv: Initialize thread pointer before calling C functions
riscv: Fix irq_work when SMP is disabled
riscv: Wire up memfd_secret in UAPI header
riscv: Move alternative length validation into subsection
ALSA: hda/realtek - Add new type for ALC245
ALSA: hda/realtek: Enable 4-speaker output for Dell XPS 15 9520 laptop
ALSA: hda/realtek - Fix microphone noise on ASUS TUF B550M-PLUS
ALSA: usb-audio: Cancel pending work at closing a MIDI substream
USB: serial: pl2303: fix type detection for odd device
USB: serial: option: add Quectel BG95 modem
USB: new quirk for Dell Gen 2 devices
usb: isp1760: Fix out-of-bounds array access
usb: dwc3: gadget: Move null pinter check to proper place
usb: core: hcd: Add support for deferring roothub registration
fs/ntfs3: Update valid size if -EIOCBQUEUED
fs/ntfs3: Fix fiemap + fix shrink file size (to remove preallocated space)
fs/ntfs3: Keep preallocated only if option prealloc enabled
fs/ntfs3: Check new size for limits
fs/ntfs3: In function ntfs_set_acl_ex do not change inode->i_mode if called from function ntfs_init_acl
fs/ntfs3: Fix some memory leaks in an error handling path of 'log_replay()'
fs/ntfs3: Update i_ctime when xattr is added
fs/ntfs3: Restore ntfs_xattr_get_acl and ntfs_xattr_set_acl functions
cifs: fix potential double free during failed mount
cifs: when extending a file with falloc we should make files not-sparse
xhci: Allow host runtime PM as default for Intel Alder Lake N xHCI
platform/x86: intel-hid: fix _DSM function index handling
x86/MCE/AMD: Fix memory leak when threshold_create_bank() fails
perf/x86/intel: Fix event constraints for ICL
x86/kexec: fix memory leak of elf header buffer
x86/sgx: Set active memcg prior to shmem allocation
ptrace/um: Replace PT_DTRACE with TIF_SINGLESTEP
ptrace/xtensa: Replace PT_SINGLESTEP with TIF_SINGLESTEP
ptrace: Reimplement PTRACE_KILL by always sending SIGKILL
btrfs: add "0x" prefix for unsupported optional features
btrfs: return correct error number for __extent_writepage_io()
btrfs: repair super block num_devices automatically
btrfs: fix the error handling for submit_extent_page() for btrfs_do_readpage()
iommu/vt-d: Add RPLS to quirk list to skip TE disabling
drm/vmwgfx: validate the screen formats
drm/virtio: fix NULL pointer dereference in virtio_gpu_conn_get_modes
selftests/bpf: Fix vfs_link kprobe definition
selftests/bpf: Fix parsing of prog types in UAPI hdr for bpftool sync
mwifiex: add mutex lock for call in mwifiex_dfs_chan_sw_work_queue
b43legacy: Fix assigning negative value to unsigned variable
b43: Fix assigning negative value to unsigned variable
ipw2x00: Fix potential NULL dereference in libipw_xmit()
ipv6: fix locking issues with loops over idev->addr_list
fbcon: Consistently protect deferred_takeover with console_lock()
x86/platform/uv: Update TSC sync state for UV5
ACPICA: Avoid cache flush inside virtual machines
mac80211: minstrel_ht: fix where rate stats are stored (fixes debugfs output)
drm/komeda: return early if drm_universal_plane_init() fails.
drm/amd/display: Disabling Z10 on DCN31
rcu-tasks: Fix race in schedule and flush work
rcu: Make TASKS_RUDE_RCU select IRQ_WORK
sfc: ef10: Fix assigning negative value to unsigned variable
ALSA: jack: Access input_dev under mutex
rtw88: 8821c: fix debugfs rssi value
spi: spi-rspi: Remove setting {src,dst}_{addr,addr_width} based on DMA direction
tools/power turbostat: fix ICX DRAM power numbers
scsi: lpfc: Move cfg_log_verbose check before calling lpfc_dmp_dbg()
scsi: lpfc: Fix SCSI I/O completion and abort handler deadlock
scsi: lpfc: Fix call trace observed during I/O with CMF enabled
cpuidle: PSCI: Improve support for suspend-to-RAM for PSCI OSI mode
drm/amd/pm: fix double free in si_parse_power_table()
ASoC: rsnd: care default case on rsnd_ssiu_busif_err_status_clear()
ASoC: rsnd: care return value from rsnd_node_fixed_index()
ath9k: fix QCA9561 PA bias level
media: venus: hfi: avoid null dereference in deinit
media: pci: cx23885: Fix the error handling in cx23885_initdev()
media: cx25821: Fix the warning when removing the module
md/bitmap: don't set sb values if can't pass sanity check
mmc: jz4740: Apply DMA engine limits to maximum segment size
drivers: mmc: sdhci_am654: Add the quirk to set TESTCD bit
scsi: megaraid: Fix error check return value of register_chrdev()
drm/amdgpu/sdma: Fix incorrect calculations of the wptr of the doorbells
scsi: ufs: Use pm_runtime_resume_and_get() instead of pm_runtime_get_sync()
scsi: lpfc: Fix resource leak in lpfc_sli4_send_seq_to_ulp()
ath11k: disable spectral scan during spectral deinit
ASoC: Intel: bytcr_rt5640: Add quirk for the HP Pro Tablet 408
drm/plane: Move range check for format_count earlier
drm/amd/pm: fix the compile warning
ath10k: skip ath10k_halt during suspend for driver state RESTARTING
arm64: compat: Do not treat syscall number as ESR_ELx for a bad syscall
drm: msm: fix error check return value of irq_of_parse_and_map()
scsi: target: tcmu: Fix possible data corruption
ipv6: Don't send rs packets to the interface of ARPHRD_TUNNEL
net/mlx5: fs, delete the FTE when there are no rules attached to it
ASoC: dapm: Don't fold register value changes into notifications
mlxsw: spectrum_dcb: Do not warn about priority changes
mlxsw: Treat LLDP packets as control
drm/amdgpu/psp: move PSP memory alloc from hw_init to sw_init
drm/amdgpu/ucode: Remove firmware load type check in amdgpu_ucode_free_bo
regulator: mt6315: Enforce regulator-compatible, not name
HID: bigben: fix slab-out-of-bounds Write in bigben_probe
of: Support more than one crash kernel regions for kexec -s
ASoC: tscs454: Add endianness flag in snd_soc_component_driver
scsi: lpfc: Alter FPIN stat accounting logic
net: remove two BUG() from skb_checksum_help()
s390/preempt: disable __preempt_count_add() optimization for PROFILE_ALL_BRANCHES
perf/amd/ibs: Cascade pmu init functions' return value
sched/core: Avoid obvious double update_rq_clock warning
spi: stm32-qspi: Fix wait_cmd timeout in APM mode
dma-debug: change allocation mode from GFP_NOWAIT to GFP_ATIOMIC
ACPI: PM: Block ASUS B1400CEAE from suspend to idle by default
ipmi:ssif: Check for NULL msg when handling events and messages
ipmi: Fix pr_fmt to avoid compilation issues
rtlwifi: Use pr_warn instead of WARN_ONCE
mt76: mt7921: accept rx frames with non-standard VHT MCS10-11
mt76: fix encap offload ethernet type check
media: rga: fix possible memory leak in rga_probe
media: coda: limit frame interval enumeration to supported encoder frame sizes
media: hantro: HEVC: unconditionnaly set pps_{cb/cr}_qp_offset values
media: ccs-core.c: fix failure to call clk_disable_unprepare
media: imon: reorganize serialization
media: cec-adap.c: fix is_configuring state
usbnet: Run unregister_netdev() before unbind() again
openrisc: start CPU timer early in boot
nvme-pci: fix a NULL pointer dereference in nvme_alloc_admin_tags
ASoC: rt5645: Fix errorenous cleanup order
nbd: Fix hung on disconnect request if socket is closed before
drm/amd/pm: update smartshift powerboost calc for smu12
drm/amd/pm: update smartshift powerboost calc for smu13
net: phy: micrel: Allow probing without .driver_data
media: exynos4-is: Fix compile warning
media: hantro: Stop using H.264 parameter pic_num
ASoC: max98357a: remove dependency on GPIOLIB
ASoC: rt1015p: remove dependency on GPIOLIB
ACPI: CPPC: Assume no transition latency if no PCCT
nvme: set non-mdts limits in nvme_scan_work
can: mcp251xfd: silence clang's -Wunaligned-access warning
x86/microcode: Add explicit CPU vendor dependency
net: ipa: ignore endianness if there is no header
m68k: atari: Make Atari ROM port I/O write macros return void
rxrpc: Return an error to sendmsg if call failed
rxrpc, afs: Fix selection of abort codes
afs: Adjust ACK interpretation to try and cope with NAT
eth: tg3: silence the GCC 12 array-bounds warning
char: tpm: cr50_i2c: Suppress duplicated error message in .remove()
selftests/bpf: fix btf_dump/btf_dump due to recent clang change
gfs2: use i_lock spin_lock for inode qadata
scsi: target: tcmu: Avoid holding XArray lock when calling lock_page
IB/rdmavt: add missing locks in rvt_ruc_loopback
ARM: dts: ox820: align interrupt controller node name with dtschema
ARM: dts: socfpga: align interrupt controller node name with dtschema
ARM: dts: s5pv210: align DMA channels with dtschema
arm64: dts: qcom: msm8994: Fix the cont_splash_mem address
arm64: dts: qcom: msm8994: Fix BLSP[12]_DMA channels count
PM / devfreq: rk3399_dmc: Disable edev on remove()
crypto: ccree - use fine grained DMA mapping dir
soc: ti: ti_sci_pm_domains: Check for null return of devm_kcalloc
fs: jfs: fix possible NULL pointer dereference in dbFree()
arm64: dts: qcom: sdm845-xiaomi-beryllium: fix typo in panel's vddio-supply property
ALSA: usb-audio: Add quirk bits for enabling/disabling generic implicit fb
ALSA: usb-audio: Move generic implicit fb quirk entries into quirks.c
ARM: OMAP1: clock: Fix UART rate reporting algorithm
powerpc/fadump: Fix fadump to work with a different endian capture kernel
fat: add ratelimit to fat*_ent_bread()
pinctrl: renesas: rzn1: Fix possible null-ptr-deref in sh_pfc_map_resources()
ARM: versatile: Add missing of_node_put in dcscb_init
ARM: dts: exynos: add atmel,24c128 fallback to Samsung EEPROM
ARM: hisi: Add missing of_node_put after of_find_compatible_node
cpufreq: Avoid unnecessary frequency updates due to mismatch
powerpc/rtas: Keep MSR[RI] set when calling RTAS
PCI: Avoid pci_dev_lock() AB/BA deadlock with sriov_numvfs_store()
KVM: PPC: Book3S HV Nested: L2 LPCR should inherit L1 LPES setting
alpha: fix alloc_zeroed_user_highpage_movable()
tracing: incorrect isolate_mote_t cast in mm_vmscan_lru_isolate
powerpc/powernv/vas: Assign real address to rx_fifo in vas_rx_win_attr
powerpc/xics: fix refcount leak in icp_opal_init()
powerpc/powernv: fix missing of_node_put in uv_init()
macintosh/via-pmu: Fix build failure when CONFIG_INPUT is disabled
powerpc/iommu: Add missing of_node_put in iommu_init_early_dart
smb3: check for null tcon
RDMA/hfi1: Prevent panic when SDMA is disabled
Input: gpio-keys - cancel delayed work only in case of GPIO
drm: fix EDID struct for old ARM OABI format
drm/bridge_connector: enable HPD by default if supported
dt-bindings: display: sitronix, st7735r: Fix backlight in example
drm/vmwgfx: Fix an invalid read
ath11k: acquire ab->base_lock in unassign when finding the peer by addr
drm: bridge: it66121: Fix the register page length
ath9k: fix ar9003_get_eepmisc
drm/edid: fix invalid EDID extension block filtering
drm/bridge: adv7511: clean up CEC adapter when probe fails
drm: bridge: icn6211: Fix register layout
drm: bridge: icn6211: Fix HFP_HSW_HBP_HI and HFP_MIN handling
mtd: spinand: gigadevice: fix Quad IO for GD5F1GQ5UExxG
spi: qcom-qspi: Add minItems to interconnect-names
ASoC: mediatek: Fix error handling in mt8173_max98090_dev_probe
ASoC: mediatek: Fix missing of_node_put in mt2701_wm8960_machine_probe
x86/delay: Fix the wrong asm constraint in delay_loop()
drm/vc4: hvs: Fix frame count register readout
drm/mediatek: Fix mtk_cec_mask()
drm/vc4: hvs: Reset muxes at probe time
drm/vc4: txp: Don't set TXP_VSTART_AT_EOF
drm/vc4: txp: Force alpha to be 0xff if it's disabled
libbpf: Don't error out on CO-RE relos for overriden weak subprogs
x86/PCI: Fix ALi M1487 (IBC) PIRQ router link value interpretation
mptcp: reset the packet scheduler on PRIO change
nl80211: show SSID for P2P_GO interfaces
drm/komeda: Fix an undefined behavior bug in komeda_plane_add()
drm: mali-dp: potential dereference of null pointer
spi: spi-ti-qspi: Fix return value handling of wait_for_completion_timeout
scftorture: Fix distribution of short handler delays
net: dsa: mt7530: 1G can also support 1000BASE-X link mode
ixp4xx_eth: fix error check return value of platform_get_irq()
NFC: NULL out the dev->rfkill to prevent UAF
efi: Add missing prototype for efi_capsule_setup_info
device property: Check fwnode->secondary when finding properties
device property: Allow error pointer to be passed to fwnode APIs
target: remove an incorrect unmap zeroes data deduction
drbd: fix duplicate array initializer
EDAC/dmc520: Don't print an error for each unconfigured interrupt line
mtd: rawnand: denali: Use managed device resources
HID: hid-led: fix maximum brightness for Dream Cheeky
HID: elan: Fix potential double free in elan_input_configured
drm/bridge: Fix error handling in analogix_dp_probe
regulator: da9121: Fix uninit-value in da9121_assign_chip_model()
drm/mediatek: dpi: Use mt8183 output formats for mt8192
signal: Deliver SIGTRAP on perf event asynchronously if blocked
sched/fair: Fix cfs_rq_clock_pelt() for throttled cfs_rq
sched/psi: report zeroes for CPU full at the system level
spi: img-spfi: Fix pm_runtime_get_sync() error checking
cpufreq: Fix possible race in cpufreq online error path
printk: use atomic updates for klogd work
printk: add missing memory barrier to wake_up_klogd()
printk: wake waiters for safe and NMI contexts
ath9k_htc: fix potential out of bounds access with invalid rxstatus->rs_keyix
media: i2c: max9286: Use dev_err_probe() helper
media: i2c: max9286: Use "maxim,gpio-poc" property
media: i2c: max9286: fix kernel oops when removing module
media: hantro: Empty encoder capture buffers by default
drm/panel: simple: Add missing bus flags for Innolux G070Y2-L01
ALSA: pcm: Check for null pointer of pointer substream before dereferencing it
mtdblock: warn if opened on NAND
inotify: show inotify mask flags in proc fdinfo
fsnotify: fix wrong lockdep annotations
spi: rockchip: Stop spi slave dma receiver when cs inactive
spi: rockchip: Preset cs-high and clk polarity in setup progress
spi: rockchip: fix missing error on unsupported SPI_CS_HIGH
of: overlay: do not break notify on NOTIFY_{OK|STOP}
selftests/damon: add damon to selftests root Makefile
drm/msm/dp: Modify prototype of encoder based API
drm/msm/hdmi: switch to drm_bridge_connector
drm/msm/dpu: adjust display_v_end for eDP and DP
scsi: iscsi: Fix harmless double shift bug
scsi: ufs: qcom: Fix ufs_qcom_resume()
scsi: ufs: core: Exclude UECxx from SFR dump list
drm/v3d: Fix null pointer dereference of pointer perfmon
selftests/resctrl: Fix null pointer dereference on open failed
libbpf: Fix logic for finding matching program for CO-RE relocation
mtd: spi-nor: core: Check written SR value in spi_nor_write_16bit_sr_and_check()
x86/pm: Fix false positive kmemleak report in msr_build_context()
mtd: rawnand: cadence: fix possible null-ptr-deref in cadence_nand_dt_probe()
mtd: rawnand: intel: fix possible null-ptr-deref in ebu_nand_probe()
x86/speculation: Add missing prototype for unpriv_ebpf_notify()
ASoC: rk3328: fix disabling mclk on pclk probe failure
perf tools: Add missing headers needed by util/data.h
drm/msm/disp/dpu1: set vbif hw config to NULL to avoid use after memory free during pm runtime resume
drm/msm/dp: stop event kernel thread when DP unbind
drm/msm/dp: fix error check return value of irq_of_parse_and_map()
drm/msm/dp: reset DP controller before transmit phy test pattern
drm/msm/dp: do not stop transmitting phy test pattern during DP phy compliance test
drm/msm/dsi: fix error checks and return values for DSI xmit functions
drm/msm/hdmi: check return value after calling platform_get_resource_byname()
drm/msm/hdmi: fix error check return value of irq_of_parse_and_map()
drm/msm: add missing include to msm_drv.c
drm/panel: panel-simple: Fix proper bpc for AM-1280800N3TZQW-T00H
kunit: fix debugfs code to use enum kunit_status, not bool
drm/rockchip: vop: fix possible null-ptr-deref in vop_bind()
spi: cadence-quadspi: fix Direct Access Mode disable for SoCFPGA
perf tools: Use Python devtools for version autodetection rather than runtime
virtio_blk: fix the discard_granularity and discard_alignment queue limits
nl80211: don't hold RTNL in color change request
x86: Fix return value of __setup handlers
irqchip/exiu: Fix acknowledgment of edge triggered interrupts
irqchip/aspeed-i2c-ic: Fix irq_of_parse_and_map() return value
irqchip/aspeed-scu-ic: Fix irq_of_parse_and_map() return value
x86/mm: Cleanup the control_va_addr_alignment() __setup handler
arm64: fix types in copy_highpage()
regulator: core: Fix enable_count imbalance with EXCLUSIVE_GET
drm/msm/dsi: fix address for second DSI PHY on SDM660
drm/msm/dp: fix event thread stuck in wait_event after kthread_stop()
drm/msm/mdp5: Return error code in mdp5_pipe_release when deadlock is detected
drm/msm/mdp5: Return error code in mdp5_mixer_release when deadlock is detected
drm/msm: return an error pointer in msm_gem_prime_get_sg_table()
media: uvcvideo: Fix missing check to determine if element is found in list
arm64: stackleak: fix current_top_of_stack()
iomap: iomap_write_failed fix
spi: spi-fsl-qspi: check return value after calling platform_get_resource_byname()
Revert "cpufreq: Fix possible race in cpufreq online error path"
regulator: qcom_smd: Fix up PM8950 regulator configuration
samples: bpf: Don't fail for a missing VMLINUX_BTF when VMLINUX_H is provided
perf/amd/ibs: Use interrupt regs ip for stack unwinding
ath11k: Don't check arvif->is_started before sending management frames
wilc1000: fix crash observed in AP mode with cfg80211_register_netdevice()
HID: amd_sfh: Modify the bus name
HID: amd_sfh: Modify the hid name
ASoC: fsl: Use dev_err_probe() helper
ASoC: fsl: Fix refcount leak in imx_sgtl5000_probe
ASoC: imx-hdmi: Fix refcount leak in imx_hdmi_probe
ASoC: mxs-saif: Fix refcount leak in mxs_saif_probe
regulator: pfuze100: Fix refcount leak in pfuze_parse_regulators_dt
dma-direct: factor out a helper for DMA_ATTR_NO_KERNEL_MAPPING allocations
dma-direct: don't fail on highmem CMA pages in dma_direct_alloc_pages
ASoC: samsung: Use dev_err_probe() helper
ASoC: samsung: Fix refcount leak in aries_audio_probe
block: Fix the bio.bi_opf comment
kselftest/cgroup: fix test_stress.sh to use OUTPUT dir
scripts/faddr2line: Fix overlapping text section failures
media: aspeed: Fix an error handling path in aspeed_video_probe()
media: exynos4-is: Fix PM disable depth imbalance in fimc_is_probe
mt76: mt7921: Fix the error handling path of mt7921_pci_probe()
mt76: do not attempt to reorder received 802.3 packets without agg session
media: st-delta: Fix PM disable depth imbalance in delta_probe
media: atmel: atmel-isc: Fix PM disable depth imbalance in atmel_isc_probe
media: i2c: rdacm2x: properly set subdev entity function
media: exynos4-is: Change clk_disable to clk_disable_unprepare
media: pvrusb2: fix array-index-out-of-bounds in pvr2_i2c_core_init
media: vsp1: Fix offset calculation for plane cropping
media: atmel: atmel-sama5d2-isc: fix wrong mask in YUYV format check
media: hantro: HEVC: Fix tile info buffer value computation
Bluetooth: fix dangling sco_conn and use-after-free in sco_sock_timeout
Bluetooth: use hdev lock in activate_scan for hci_is_adv_monitoring
Bluetooth: use hdev lock for accept_list and reject_list in conn req
nvme: set dma alignment to dword
m68k: math-emu: Fix dependencies of math emulation support
sctp: read sk->sk_bound_dev_if once in sctp_rcv()
net: hinic: add missing destroy_workqueue in hinic_pf_to_mgmt_init
ASoC: ti: j721e-evm: Fix refcount leak in j721e_soc_probe_*
kselftest/arm64: bti: force static linking
media: ov7670: remove ov7670_power_off from ov7670_remove
media: i2c: ov5648: fix wrong pointer passed to IS_ERR() and PTR_ERR()
media: staging: media: rkvdec: Make use of the helper function devm_platform_ioremap_resource()
media: rkvdec: h264: Fix dpb_valid implementation
media: rkvdec: h264: Fix bit depth wrap in pps packet
regulator: scmi: Fix refcount leak in scmi_regulator_probe
ext4: reject the 'commit' option on ext2 filesystems
drm/msm/a6xx: Fix refcount leak in a6xx_gpu_init
drm: msm: fix possible memory leak in mdp5_crtc_cursor_set()
x86/sev: Annotate stack change in the #VC handler
drm/msm: don't free the IRQ if it was not requested
selftests/bpf: Add missed ima_setup.sh in Makefile
drm/msm/dpu: handle pm_runtime_get_sync() errors in bind path
drm/i915: Fix CFI violation with show_dynamic_id()
thermal/drivers/bcm2711: Don't clamp temperature at zero
thermal/drivers/broadcom: Fix potential NULL dereference in sr_thermal_probe
thermal/core: Fix memory leak in __thermal_cooling_device_register()
thermal/drivers/imx_sc_thermal: Fix refcount leak in imx_sc_thermal_probe
bfq: Relax waker detection for shared queues
bfq: Allow current waker to defend against a tentative one
ASoC: wm2000: fix missing clk_disable_unprepare() on error in wm2000_anc_transition()
PM: domains: Fix initialization of genpd's next_wakeup
net: macb: Fix PTP one step sync support
NFC: hci: fix sleep in atomic context bugs in nfc_hci_hcp_message_tx
ASoC: max98090: Move check for invalid values before casting in max98090_put_enab_tlv()
net: stmmac: selftests: Use kcalloc() instead of kzalloc()
net: stmmac: fix out-of-bounds access in a selftest
hv_netvsc: Fix potential dereference of NULL pointer
hwmon: (pmbus) Check PEC support before reading other registers
rxrpc: Fix listen() setting the bar too high for the prealloc rings
rxrpc: Don't try to resend the request if we're receiving the reply
rxrpc: Fix overlapping ACK accounting
rxrpc: Don't let ack.previousPacket regress
rxrpc: Fix decision on when to generate an IDLE ACK
net: huawei: hinic: Use devm_kcalloc() instead of devm_kzalloc()
hinic: Avoid some over memory allocation
net: dsa: restrict SMSC_LAN9303_I2C kconfig
net/smc: postpone sk_refcnt increment in connect()
dma-direct: factor out dma_set_{de,en}crypted helpers
dma-direct: don't call dma_set_decrypted for remapped allocations
dma-direct: always leak memory that can't be re-encrypted
dma-direct: don't over-decrypt memory
arm64: dts: rockchip: Move drive-impedance-ohm to emmc phy on rk3399
arm64: dts: mt8192: Fix nor_flash status disable typo
PCI/ACPI: Allow D3 only if Root Port can signal and wake from D3
memory: samsung: exynos5422-dmc: Avoid some over memory allocation
ARM: dts: BCM5301X: update CRU block description
ARM: dts: BCM5301X: Update pin controller node name
ARM: dts: suniv: F1C100: fix watchdog compatible
soc: qcom: smp2p: Fix missing of_node_put() in smp2p_parse_ipc
soc: qcom: smsm: Fix missing of_node_put() in smsm_parse_ipc
PCI: cadence: Fix find_first_zero_bit() limit
PCI: rockchip: Fix find_first_zero_bit() limit
PCI: mediatek: Fix refcount leak in mtk_pcie_subsys_powerup()
PCI: dwc: Fix setting error return on MSI DMA mapping failure
ARM: dts: ci4x10: Adapt to changes in imx6qdl.dtsi regarding fec clocks
soc: qcom: llcc: Add MODULE_DEVICE_TABLE()
KVM: nVMX: Leave most VM-Exit info fields unmodified on failed VM-Entry
KVM: nVMX: Clear IDT vectoring on nested VM-Exit for double/triple fault
crypto: qat - set CIPHER capability for QAT GEN2
crypto: qat - set COMPRESSION capability for QAT GEN2
crypto: qat - set CIPHER capability for DH895XCC
crypto: qat - set COMPRESSION capability for DH895XCC
platform/chrome: cros_ec: fix error handling in cros_ec_register()
ARM: dts: imx6dl-colibri: Fix I2C pinmuxing
platform/chrome: Re-introduce cros_ec_cmd_xfer and use it for ioctls
can: xilinx_can: mark bit timing constants as const
ARM: dts: stm32: Fix PHY post-reset delay on Avenger96
ARM: dts: bcm2835-rpi-zero-w: Fix GPIO line name for Wifi/BT
ARM: dts: bcm2837-rpi-cm3-io3: Fix GPIO line names for SMPS I2C
ARM: dts: bcm2837-rpi-3-b-plus: Fix GPIO line name of power LED
ARM: dts: bcm2835-rpi-b: Fix GPIO line names
misc: ocxl: fix possible double free in ocxl_file_register_afu
crypto: marvell/cesa - ECB does not IV
gpiolib: of: Introduce hook for missing gpio-ranges
pinctrl: bcm2835: implement hook for missing gpio-ranges
arm: mediatek: select arch timer for mt7629
pinctrl/rockchip: support deferring other gpio params
pinctrl: mediatek: mt8195: enable driver on mtk platforms
arm64: dts: qcom: qrb5165-rb5: Fix can-clock node name
Drivers: hv: vmbus: Fix handling of messages with transaction ID of zero
powerpc/fadump: fix PT_LOAD segment for boot memory area
mfd: ipaq-micro: Fix error check return value of platform_get_irq()
scsi: fcoe: Fix Wstringop-overflow warnings in fcoe_wwn_from_mac()
soc: bcm: Check for NULL return of devm_kzalloc()
arm64: dts: ti: k3-am64-mcu: remove incorrect UART base clock rates
ASoC: sh: rz-ssi: Check return value of pm_runtime_resume_and_get()
ASoC: sh: rz-ssi: Propagate error codes returned from platform_get_irq_byname()
ASoC: sh: rz-ssi: Release the DMA channels in rz_ssi_probe() error path
firmware: arm_scmi: Fix list protocols enumeration in the base protocol
nvdimm: Fix firmware activation deadlock scenarios
nvdimm: Allow overwrite in the presence of disabled dimms
pinctrl: mvebu: Fix irq_of_parse_and_map() return value
drivers/base/node.c: fix compaction sysfs file leak
dax: fix cache flush on PMD-mapped pages
drivers/base/memory: fix an unlikely reference counting issue in __add_memory_block()
firmware: arm_ffa: Fix uuid parameter to ffa_partition_probe
firmware: arm_ffa: Remove incorrect assignment of driver_data
list: introduce list_is_head() helper and re-use it in list.h
list: fix a data-race around ep->rdllist
drm/msm/dpu: fix error check return value of irq_of_parse_and_map()
powerpc/8xx: export 'cpm_setbrg' for modules
pinctrl: renesas: r8a779a0: Fix GPIO function on I2C-capable pins
pinctrl: renesas: core: Fix possible null-ptr-deref in sh_pfc_map_resources()
powerpc/idle: Fix return value of __setup() handler
powerpc/4xx/cpm: Fix return value of __setup() handler
RDMA/hns: Add the detection for CMDQ status in the device initialization process
arm64: dts: marvell: espressobin-ultra: fix SPI-NOR config
arm64: dts: marvell: espressobin-ultra: enable front USB3 port
ASoC: atmel-pdmic: Remove endianness flag on pdmic component
ASoC: atmel-classd: Remove endianness flag on class d component
proc: fix dentry/inode overinstantiating under /proc/${pid}/net
ipc/mqueue: use get_tree_nodev() in mqueue_get_tree()
PCI: imx6: Fix PERST# start-up sequence
tty: fix deadlock caused by calling printk() under tty_port->lock
crypto: sun8i-ss - rework handling of IV
crypto: sun8i-ss - handle zero sized sg
crypto: cryptd - Protect per-CPU resource by disabling BH.
ARM: dts: at91: sama7g5: remove interrupt-parent from gic node
hugetlbfs: fix hugetlbfs_statfs() locking
Input: sparcspkr - fix refcount leak in bbc_beep_probe
PCI/AER: Clear MULTI_ERR_COR/UNCOR_RCV bits
PCI: microchip: Fix potential race in interrupt handling
hwrng: omap3-rom - fix using wrong clk_disable() in omap_rom_rng_runtime_resume()
powerpc/64: Only WARN if __pa()/__va() called with bad addresses
powerpc/perf: Fix the threshold compare group constraint for power10
powerpc/perf: Fix the threshold compare group constraint for power9
macintosh: via-pmu and via-cuda need RTC_LIB
powerpc/xive: Add some error handling code to 'xive_spapr_init()'
powerpc/xive: Fix refcount leak in xive_spapr_init
powerpc/fsl_rio: Fix refcount leak in fsl_rio_setup
mfd: davinci_voicecodec: Fix possible null-ptr-deref davinci_vc_probe()
nfsd: destroy percpu stats counters after reply cache shutdown
mailbox: forward the hrtimer if not queued and under a lock
RDMA/hfi1: Prevent use of lock before it is initialized
KVM: LAPIC: Drop pending LAPIC timer injection when canceling the timer
Input: stmfts - do not leave device disabled in stmfts_input_open
OPP: call of_node_put() on error path in _bandwidth_supported()
f2fs: support fault injection for dquot_initialize()
f2fs: fix to do sanity check on inline_dots inode
f2fs: fix dereference of stale list iterator after loop body
iommu/amd: Enable swiotlb in all cases
iommu/mediatek: Fix 2 HW sharing pgtable issue
iommu/mediatek: Add list_del in mtk_iommu_remove
iommu/mediatek: Remove clk_disable in mtk_iommu_remove
iommu/mediatek: Add mutex for m4u_group and m4u_dom in data
i2c: at91: use dma safe buffers
cpufreq: mediatek: Use module_init and add module_exit
cpufreq: mediatek: Unregister platform device on exit
iommu/arm-smmu-v3-sva: Fix mm use-after-free
MIPS: Loongson: Use hwmon_device_register_with_groups() to register hwmon
iommu/mediatek: Fix NULL pointer dereference when printing dev_name
i2c: at91: Initialize dma_buf in at91_twi_xfer()
dmaengine: idxd: Fix the error handling path in idxd_cdev_register()
NFS: Do not report EINTR/ERESTARTSYS as mapping errors
NFS: fsync() should report filesystem errors over EINTR/ERESTARTSYS
NFS: Don't report ENOSPC write errors twice
NFS: Do not report flush errors in nfs_write_end()
NFS: Don't report errors from nfs_pageio_complete() more than once
NFSv4/pNFS: Do not fail I/O when we fail to allocate the pNFS layout
NFS: Further fixes to the writeback error handling
video: fbdev: clcdfb: Fix refcount leak in clcdfb_of_vram_setup
dmaengine: stm32-mdma: remove GISR1 register
dmaengine: stm32-mdma: fix chan initialization in stm32_mdma_irq_handler()
iommu/amd: Increase timeout waiting for GA log enablement
i2c: npcm: Fix timeout calculation
i2c: npcm: Correct register access width
i2c: npcm: Handle spurious interrupts
i2c: rcar: fix PM ref counts in probe error paths
perf build: Fix btf__load_from_kernel_by_id() feature check
perf c2c: Use stdio interface if slang is not supported
perf jevents: Fix event syntax error caused by ExtSel
video: fbdev: vesafb: Fix a use-after-free due early fb_info cleanup
NFS: Always initialise fattr->label in nfs_fattr_alloc()
NFS: Create a new nfs_alloc_fattr_with_label() function
NFS: Convert GFP_NOFS to GFP_KERNEL
NFSv4.1 mark qualified async operations as MOVEABLE tasks
f2fs: fix to avoid f2fs_bug_on() in dec_valid_node_count()
f2fs: fix to do sanity check on block address in f2fs_do_zero_range()
f2fs: fix to clear dirty inode in f2fs_evict_inode()
f2fs: fix deadloop in foreground GC
f2fs: don't need inode lock for system hidden quota
f2fs: fix to do sanity check on total_data_blocks
f2fs: don't use casefolded comparison for "." and ".."
f2fs: fix fallocate to use file_modified to update permissions consistently
f2fs: fix to do sanity check for inline inode
objtool: Fix objtool regression on x32 systems
objtool: Fix symbol creation
wifi: mac80211: fix use-after-free in chanctx code
iwlwifi: mvm: fix assert 1F04 upon reconfig
fs-writeback: writeback_sb_inodes:Recalculate 'wrote' according skipped pages
efi: Do not import certificates from UEFI Secure Boot for T2 Macs
bfq: Avoid false marking of bic as stably merged
bfq: Avoid merging queues with different parents
bfq: Split shared queues on move between cgroups
bfq: Update cgroup information before merging bio
bfq: Drop pointless unlock-lock pair
bfq: Remove pointless bfq_init_rq() calls
bfq: Track whether bfq_group is still online
bfq: Get rid of __bio_blkcg() usage
bfq: Make sure bfqg for which we are queueing requests is online
ext4: mark group as trimmed only if it was fully scanned
ext4: fix use-after-free in ext4_rename_dir_prepare
ext4: fix race condition between ext4_write and ext4_convert_inline_data
ext4: fix warning in ext4_handle_inode_extension
ext4: fix bug_on in ext4_writepages
ext4: filter out EXT4_FC_REPLAY from on-disk superblock field s_state
ext4: fix bug_on in __es_tree_search
ext4: verify dir block before splitting it
ext4: avoid cycles in directory h-tree
ACPI: property: Release subnode properties with data nodes
tty: goldfish: Introduce gf_ioread32()/gf_iowrite32()
tracing: Fix potential double free in create_var_ref()
tracing: Initialize integer variable to prevent garbage return value
drm/amdgpu: add beige goby PCI ID
PCI/PM: Fix bridge_d3_blacklist[] Elo i2 overwrite of Gigabyte X299
PCI: qcom: Fix runtime PM imbalance on probe errors
PCI: qcom: Fix unbalanced PHY init on probe errors
staging: r8188eu: prevent ->Ssid overflow in rtw_wx_set_scan()
mm, compaction: fast_find_migrateblock() should return pfn in the target zone
s390/perf: obtain sie_block from the right address
s390/stp: clock_delta should be signed
dlm: fix plock invalid read
dlm: uninitialized variable on error in dlm_listen_for_all()
dlm: fix missing lkb refcount handling
ocfs2: dlmfs: fix error handling of user_dlm_destroy_lock
scsi: dc395x: Fix a missing check on list iterator
scsi: ufs: qcom: Add a readl() to make sure ref_clk gets enabled
landlock: Add clang-format exceptions
landlock: Format with clang-format
selftests/landlock: Add clang-format exceptions
selftests/landlock: Normalize array assignment
selftests/landlock: Format with clang-format
samples/landlock: Add clang-format exceptions
samples/landlock: Format with clang-format
landlock: Fix landlock_add_rule(2) documentation
selftests/landlock: Make tests build with old libc
selftests/landlock: Extend tests for minimal valid attribute size
selftests/landlock: Add tests for unknown access rights
selftests/landlock: Extend access right tests to directories
selftests/landlock: Fully test file rename with "remove" access
selftests/landlock: Add tests for O_PATH
landlock: Change landlock_add_rule(2) argument check ordering
landlock: Change landlock_restrict_self(2) check ordering
selftests/landlock: Test landlock_create_ruleset(2) argument check ordering
landlock: Define access_mask_t to enforce a consistent access mask size
landlock: Reduce the maximum number of layers to 16
landlock: Create find_rule() from unmask_layers()
landlock: Fix same-layer rule unions
drm/amdgpu/cs: make commands with 0 chunks illegal behaviour.
drm/nouveau/subdev/bus: Ratelimit logging for fault errors
drm/etnaviv: check for reaped mapping in etnaviv_iommu_unmap_gem
drm/nouveau/clk: Fix an incorrect NULL check on list iterator
drm/nouveau/kms/nv50-: atom: fix an incorrect NULL check on list iterator
drm/bridge: analogix_dp: Grab runtime PM reference for DP-AUX
drm/i915/dsi: fix VBT send packet port selection for ICL+
md: fix an incorrect NULL check in does_sb_need_changing
md: fix an incorrect NULL check in md_reload_sb
mtd: cfi_cmdset_0002: Move and rename chip_check/chip_ready/chip_good_for_write
mtd: cfi_cmdset_0002: Use chip_ready() for write on S29GL064N
media: coda: Fix reported H264 profile
media: coda: Add more H264 levels for CODA960
ima: remove the IMA_TEMPLATE Kconfig option
Kconfig: Add option for asm goto w/ tied outputs to workaround clang-13 bug
RDMA/hfi1: Fix potential integer multiplication overflow errors
mmc: core: Allows to override the timeout value for ioctl() path
csky: patch_text: Fixup last cpu should be master
irqchip/armada-370-xp: Do not touch Performance Counter Overflow on A375, A38x, A39x
irqchip: irq-xtensa-mx: fix initial IRQ affinity
thermal: devfreq_cooling: use local ops instead of global ops
cfg80211: declare MODULE_FIRMWARE for regulatory.db
mac80211: upgrade passive scan to active scan on DFS channels after beacon rx
um: Use asm-generic/dma-mapping.h
um: chan_user: Fix winch_tramp() return value
um: Fix out-of-bounds read in LDT setup
kexec_file: drop weak attribute from arch_kexec_apply_relocations[_add]
ftrace: Clean up hash direct_functions on register failures
ksmbd: fix outstanding credits related bugs
iommu/msm: Fix an incorrect NULL check on list iterator
iommu/dma: Fix iova map result check bug
Revert "mm/cma.c: remove redundant cma_mutex lock"
mm/page_alloc: always attempt to allocate at least one page during bulk allocation
nodemask.h: fix compilation error with GCC12
hugetlb: fix huge_pmd_unshare address update
mm/memremap: fix missing call to untrack_pfn() in pagemap_range()
xtensa/simdisk: fix proc_read_simdisk()
rtl818x: Prevent using not initialized queues
ASoC: rt5514: Fix event generation for "DSP Voice Wake Up" control
carl9170: tx: fix an incorrect use of list iterator
stm: ltdc: fix two incorrect NULL checks on list iterator
bcache: improve multithreaded bch_btree_check()
bcache: improve multithreaded bch_sectors_dirty_init()
bcache: remove incremental dirty sector counting for bch_sectors_dirty_init()
bcache: avoid journal no-space deadlock by reserving 1 journal bucket
serial: pch: don't overwrite xmit->buf[0] by x_char
tilcdc: tilcdc_external: fix an incorrect NULL check on list iterator
gma500: fix an incorrect NULL check on list iterator
arm64: dts: qcom: ipq8074: fix the sleep clock frequency
arm64: tegra: Add missing DFLL reset on Tegra210
clk: tegra: Add missing reset deassertion
phy: qcom-qmp: fix struct clk leak on probe errors
ARM: dts: s5pv210: Remove spi-cs-high on panel in Aries
ARM: pxa: maybe fix gpio lookup tables
SMB3: EBADF/EIO errors in rename/open caused by race condition in smb2_compound_op
docs/conf.py: Cope with removal of language=None in Sphinx 5.0.0
dt-bindings: gpio: altera: correct interrupt-cells
vdpasim: allow to enable a vq repeatedly
blk-iolatency: Fix inflight count imbalances and IO hangs on offline
coresight: core: Fix coresight device probe failure issue
phy: qcom-qmp: fix reset-controller leak on probe errors
net: ipa: fix page free in ipa_endpoint_trans_release()
net: ipa: fix page free in ipa_endpoint_replenish_one()
kseltest/cgroup: Make test_stress.sh work if run interactively
list: test: Add a test for list_is_head()
Revert "random: use static branch for crng_ready()"
staging: r8188eu: delete rtw_wx_read/write32()
RDMA/hns: Remove the num_cqc_timer variable
RDMA/rxe: Generate a completion for unsupported/invalid opcode
MIPS: IP27: Remove incorrect `cpu_has_fpu' override
MIPS: IP30: Remove incorrect `cpu_has_fpu' override
ext4: only allow test_dummy_encryption when supported
interconnect: qcom: sc7180: Drop IP0 interconnects
interconnect: qcom: icc-rpmh: Add BCMs to commit list in pre_aggregate
fs: add two trivial lookup helpers
exportfs: support idmapped mounts
fs/ntfs3: Fix invalid free in log_replay
md: Don't set mddev private to NULL in raid0 pers->free
md: fix double free of io_acct_set bioset
md: bcache: check the return value of kzalloc() in detached_dev_do_request()
pinctrl/rockchip: support setting input-enable param
block: fix bio_clone_blkg_association() to associate with proper blkcg_gq
Linux 5.15.46
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I7b65df29c22a01b81a94cd844867a18e73098a15
|
||
|
|
20e6ec76ae |
mm, compaction: fast_find_migrateblock() should return pfn in the target zone
commit bbe832b9db2e1ad21522f8f0bf02775fff8a0e0e upstream. At present, pages not in the target zone are added to cc->migratepages list in isolate_migratepages_block(). As a result, pages may migrate between nodes unintentionally. This would be a serious problem for older kernels without commit |
||
|
|
061e34c52e |
ANDROID: mm: compaction: fix isolate_and_split_free_page() redefinition
Guard isolate_and_split_free_page() with CONFIG_COMPACTION. This fixes
the follwoing build error as the function collides with its inline stub
from the header file:
mm/compaction.c:766:15: error: redefinition of ‘isolate_and_split_free_page’
766 | unsigned long isolate_and_split_free_page(struct page *page,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from mm/compaction.c:14:
./include/linux/compaction.h:241:29: note: previous definition of ‘isolate_and_split_free_page’ was here
241 | static inline unsigned long isolate_and_split_free_page(struct page *page,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
Bug: 201263307
Fixes:
|
||
|
|
f47b852faa |
ANDROID: implement wrapper for reverse migration
Reverse migration is used to do the balancing the occupancy of memory zones in a node in the system whose imabalance may be caused by migration of pages to other zones by an operation, eg: hotremove and then hotadding the same memory. In this case there is a lot of free memory in newly hotadd memory which can be filled up by the previous migrated pages(as part of offline/hotremove) thus may free up some pressure in other zones of the node. Upstream discussion: https://lore.kernel.org/all/ee78c83d-da9b-f6d1-4f66-934b7782acfb@codeaurora.org/ Change-Id: Ib3137dab0db66ecf6858c4077dcadb9dfd0c6b1c Bug: 201263307 Signed-off-by: Charan Teja Reddy <quic_charante@quicinc.com> |
||
|
|
2d338201d5 |
Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
"147 patches, based on
|
||
|
|
859a85ddf9 |
mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE
Patch series "mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE". After recent updates to freeing unused parts of the memory map, no architecture can have holes in the memory map within a pageblock. This makes pfn_valid_within() check and CONFIG_HOLES_IN_ZONE configuration option redundant. The first patch removes them both in a mechanical way and the second patch simplifies memory_hotplug::test_pages_in_a_zone() that had pfn_valid_within() surrounded by more logic than simple if. This patch (of 2): After recent changes in freeing of the unused parts of the memory map and rework of pfn_valid() in arm and arm64 there are no architectures that can have holes in the memory map within a pageblock and so nothing can enable CONFIG_HOLES_IN_ZONE which guards non trivial implementation of pfn_valid_within(). With that, pfn_valid_within() is always hardwired to 1 and can be completely removed. Remove calls to pfn_valid_within() and CONFIG_HOLES_IN_ZONE. Link: https://lkml.kernel.org/r/20210713080035.7464-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20210713080035.7464-2-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
65d759c8f9 |
mm: compaction: support triggering of proactive compaction by user
The proactive compaction[1] gets triggered for every 500msec and run compaction on the node for COMPACTION_HPAGE_ORDER (usually order-9) pages based on the value set to sysctl.compaction_proactiveness. Triggering the compaction for every 500msec in search of COMPACTION_HPAGE_ORDER pages is not needed for all applications, especially on the embedded system usecases which may have few MB's of RAM. Enabling the proactive compaction in its state will endup in running almost always on such systems. Other side, proactive compaction can still be very much useful for getting a set of higher order pages in some controllable manner(controlled by using the sysctl.compaction_proactiveness). So, on systems where enabling the proactive compaction always may proove not required, can trigger the same from user space on write to its sysctl interface. As an example, say app launcher decide to launch the memory heavy application which can be launched fast if it gets more higher order pages thus launcher can prepare the system in advance by triggering the proactive compaction from userspace. This triggering of proactive compaction is done on a write to sysctl.compaction_proactiveness by user. [1]https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=facdaa917c4d5a376d09d25865f5a863f906234a [akpm@linux-foundation.org: tweak vm.rst, per Mike] Link: https://lkml.kernel.org/r/1627653207-12317-1-git-send-email-charante@codeaurora.org Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Nitin Gupta <nigupta@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
e1e92bfa38 |
mm: compaction: optimize proactive compaction deferrals
Vlastimil Babka figured out that when fragmentation score didn't go down across the proactive compaction i.e. when no progress is made, next wake up for proactive compaction is deferred for 1 << COMPACT_MAX_DEFER_SHIFT, i.e. 64 times, with each wakeup interval of HPAGE_FRAG_CHECK_INTERVAL_MSEC(=500). In each of this wakeup, it just decrement 'proactive_defer' counter and goes sleep i.e. it is getting woken to just decrement a counter. The same deferral time can also achieved by simply doing the HPAGE_FRAG_CHECK_INTERVAL_MSEC << COMPACT_MAX_DEFER_SHIFT thus unnecessary wakeup of kcompact thread is avoided thus also removes the need of 'proactive_defer' thread counter. [akpm@linux-foundation.org: tweak comment] Link: https://lore.kernel.org/linux-fsdevel/88abfdb6-2c13-b5a6-5b46-742d12d1c910@suse.cz/ Link: https://lkml.kernel.org/r/1626869599-25412-1-git-send-email-charante@codeaurora.org Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Nitin Gupta <nigupta@nvidia.com> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
5ac95884a7 |
mm/migrate: enable returning precise migrate_pages() success count
Under normal circumstances, migrate_pages() returns the number of pages migrated. In error conditions, it returns an error code. When returning an error code, there is no way to know how many pages were migrated or not migrated. Make migrate_pages() return how many pages are demoted successfully for all cases, including when encountering errors. Page reclaim behavior will depend on this in subsequent patches. Link: https://lkml.kernel.org/r/20210721063926.3024591-3-ying.huang@intel.com Link: https://lkml.kernel.org/r/20210715055145.195411-4-ying.huang@intel.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Oscar Salvador <osalvador@suse.de> [optional parameter] Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Wei Xu <weixugc@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
71bd934101 |
Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton: "190 patches. Subsystems affected by this patch series: mm (hugetlb, userfaultfd, vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock, migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap, zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc, core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs, signals, exec, kcov, selftests, compress/decompress, and ipc" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits) ipc/util.c: use binary search for max_idx ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock ipc: use kmalloc for msg_queue and shmid_kernel ipc sem: use kvmalloc for sem_undo allocation lib/decompressors: remove set but not used variabled 'level' selftests/vm/pkeys: exercise x86 XSAVE init state selftests/vm/pkeys: refill shadow register after implicit kernel write selftests/vm/pkeys: handle negative sys_pkey_alloc() return code selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random kcov: add __no_sanitize_coverage to fix noinstr for all architectures exec: remove checks in __register_bimfmt() x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned hfsplus: report create_date to kstat.btime hfsplus: remove unnecessary oom message nilfs2: remove redundant continue statement in a while-loop kprobes: remove duplicated strong free_insn_page in x86 and s390 init: print out unknown kernel parameters checkpatch: do not complain about positive return values starting with EPOLL checkpatch: improve the indented label test checkpatch: scripts/spdxcheck.py now requires python3 ... |
||
|
|
b55ca5264b |
mm/compaction: fix 'limit' in fast_isolate_freepages
Because of 'min(1, ...)', fast_isolate_freepages set 'limit' to 0 or 1.
This takes away the opportunities of find candinate pages. So, by making
enough scans available, increases the probability of finding the
appropriate freepage.
Tested it on the thpscale and the results are as follows.
5.12.0 5.12.0
valnilla patched
Amean fault-both-1 598.15 ( 0.00%) 592.56 ( 0.93%)
Amean fault-both-3 1494.47 ( 0.00%) 1514.35 ( -1.33%)
Amean fault-both-5 2519.48 ( 0.00%) 2471.76 ( 1.89%)
Amean fault-both-7 3173.85 ( 0.00%) 3079.19 ( 2.98%)
Amean fault-both-12 8063.83 ( 0.00%) 7858.29 ( 2.55%)
Amean fault-both-18 8781.20 ( 0.00%) 7827.70 * 10.86%*
Amean fault-both-24 12576.44 ( 0.00%) 12250.20 ( 2.59%)
Amean fault-both-30 18503.27 ( 0.00%) 17528.11 * 5.27%*
Amean fault-both-32 16133.69 ( 0.00%) 13874.24 * 14.00%*
5.12.0 5.12.0
vanilla patched
Ops Compaction migrate scanned 6547133.00 5963901.00
Ops Compaction free scanned 32452453.00 26609101.00
5.12 5.12
vanilla patched
Duration User 27.99 28.84
Duration System 244.08 236.76
Duration Elapsed 78.27 78.38
Link: https://lkml.kernel.org/r/20210626082443.22547-1-vvghjk1234@gmail.com
Fixes:
|
||
|
|
d2155fe54d |
mm: compaction: remove duplicate !list_empty(&sublist) check
The list_splice_tail(&sublist, freelist) also do !list_empty(&sublist) check, so remove the duplicate call. Link: https://lkml.kernel.org/r/20210609095409.19920-1-liu.xiang@zlingsmart.com Signed-off-by: Liu Xiang <liu.xiang@zlingsmart.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
17adb230d6 |
mm/compaction: use DEVICE_ATTR_WO macro
Use DEVICE_ATTR_WO helper instead of plain DEVICE_ATTR, which makes the code a bit shorter and easier to read. Link: https://lkml.kernel.org/r/20210523064521.32912-1-yuehaibing@huawei.com Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
65090f30ab |
Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton: "191 patches. Subsystems affected by this patch series: kthread, ia64, scripts, ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab, slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap, mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization, pagealloc, and memory-failure)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (191 commits) mm,hwpoison: make get_hwpoison_page() call get_any_page() mm,hwpoison: send SIGBUS with error virutal address mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes mm/page_alloc: allow high-order pages to be stored on the per-cpu lists mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA docs: remove description of DISCONTIGMEM arch, mm: remove stale mentions of DISCONIGMEM mm: remove CONFIG_DISCONTIGMEM m68k: remove support for DISCONTIGMEM arc: remove support for DISCONTIGMEM arc: update comment about HIGHMEM implementation alpha: remove DISCONTIGMEM and NUMA mm/page_alloc: move free_the_page mm/page_alloc: fix counting of managed_pages mm/page_alloc: improve memmap_pages dbg msg mm: drop SECTION_SHIFT in code comments mm/page_alloc: introduce vm.percpu_pagelist_high_fraction mm/page_alloc: limit the number of pages on PCP lists when reclaim is active mm/page_alloc: scale the number of pages that are batch freed ... |
||
|
|
a984226f45 |
mm: memcontrol: remove the pgdata parameter of mem_cgroup_page_lruvec
All the callers of mem_cgroup_page_lruvec() just pass page_pgdat(page) as the 2nd parameter to it (except isolate_migratepages_block()). But for isolate_migratepages_block(), the page_pgdat(page) is also equal to the local variable of @pgdat. So mem_cgroup_page_lruvec() do not need the pgdat parameter. Just remove it to simplify the code. Link: https://lkml.kernel.org/r/20210417043538.9793-4-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
b03fbd4ff2 |
sched: Introduce task_is_running()
Replace a bunch of 'p->state == TASK_RUNNING' with a new helper: task_is_running(p). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20210611082838.222401495@infradead.org |
||
|
|
f0953a1bba |
mm: fix typos in comments
Fix ~94 single-word typos in locking code comments, plus a few very obvious grammar mistakes. Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Cc: Bhaskar Chowdhury <unixbhaskar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
68d68ff6eb |
mm/mempool: minor coding style tweaks
Various coding style tweaks to various files under mm/ [daizhiyuan@phytium.com.cn: mm/swapfile: minor coding style tweaks] Link: https://lkml.kernel.org/r/1614223624-16055-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/sparse: minor coding style tweaks] Link: https://lkml.kernel.org/r/1614227288-19363-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/vmscan: minor coding style tweaks] Link: https://lkml.kernel.org/r/1614227649-19853-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/compaction: minor coding style tweaks] Link: https://lkml.kernel.org/r/1614228218-20770-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/oom_kill: minor coding style tweaks] Link: https://lkml.kernel.org/r/1614228360-21168-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/shmem: minor coding style tweaks] Link: https://lkml.kernel.org/r/1614228504-21491-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/page_alloc: minor coding style tweaks] Link: https://lkml.kernel.org/r/1614228613-21754-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/filemap: minor coding style tweaks] Link: https://lkml.kernel.org/r/1614228936-22337-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/mlock: minor coding style tweaks] Link: https://lkml.kernel.org/r/1613956588-2453-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/frontswap: minor coding style tweaks] Link: https://lkml.kernel.org/r/1613962668-15045-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/vmalloc: minor coding style tweaks] Link: https://lkml.kernel.org/r/1613963379-15988-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/memory_hotplug: minor coding style tweaks] Link: https://lkml.kernel.org/r/1613971784-24878-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/mempolicy: minor coding style tweaks] Link: https://lkml.kernel.org/r/1613972228-25501-1-git-send-email-daizhiyuan@phytium.com.cn Link: https://lkml.kernel.org/r/1614222374-13805-1-git-send-email-daizhiyuan@phytium.com.cn Signed-off-by: Zhiyuan Dai <daizhiyuan@phytium.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
361a2a229f |
mm: replace migrate_[prep|finish] with lru_cache_[disable|enable]
Currently, migrate_[prep|finish] is merely a wrapper of lru_cache_[disable|enable]. There is not much to gain from having additional abstraction. Use lru_cache_[disable|enable] instead of migrate_[prep|finish], which would be more descriptive. note: migrate_prep_local in compaction.c changed into lru_add_drain to avoid CPU schedule cost with involving many other CPUs to keep old behavior. Link: https://lkml.kernel.org/r/20210319175127.886124-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Chris Goldsworthy <cgoldswo@codeaurora.org> Cc: John Dias <joaodias@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oliver Sang <oliver.sang@intel.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
06dac2f467 |
mm: compaction: update the COMPACT[STALL|FAIL] events properly
By definition, COMPACT[STALL|FAIL] events needs to be counted when there is 'At least in one zone compaction wasn't deferred or skipped from the direct compaction'. And when compaction is skipped or deferred, COMPACT_SKIPPED will be returned but it will still go and update these compaction events which is wrong in the sense that COMPACT[STALL|FAIL] is counted without even trying the compaction. Correct this by skipping the counting of these events when COMPACT_SKIPPED is returned for compaction. This indirectly also avoid the unnecessary try into the get_page_from_freelist() when compaction is not even tried. There is a corner case where compaction is skipped but still count COMPACTSTALL event, which is that IRQ came and freed the page and the same is captured in capture_control. Link: https://lkml.kernel.org/r/1613151184-21213-1-git-send-email-charante@codeaurora.org Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
ef49843841 |
mm/compaction: remove unused variable sysctl_compact_memory
The sysctl_compact_memory is mostly unused in mm/compaction.c It just acts as a place holder for sysctl to store .data. But the .data itself is not needed here. So we can get ride of this variable completely and make .data as NULL. This will also eliminate the extern declaration from header file. No functionality is broken or changed this way. Link: https://lkml.kernel.org/r/1614852224-14671-1-git-send-email-pintu@codeaurora.org Signed-off-by: Pintu Kumar <pintu@codeaurora.org> Signed-off-by: Pintu Agarwal <pintu.ping@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
ae37c7ff79 |
mm: make alloc_contig_range handle in-use hugetlb pages
alloc_contig_range() will fail if it finds a HugeTLB page within the range, without a chance to handle them. Since HugeTLB pages can be migrated as any LRU or Movable page, it does not make sense to bail out without trying. Enable the interface to recognize in-use HugeTLB pages so we can migrate them, and have much better chances to succeed the call. Link: https://lkml.kernel.org/r/20210419075413.1064-7-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
369fa227c2 |
mm: make alloc_contig_range handle free hugetlb pages
alloc_contig_range will fail if it ever sees a HugeTLB page within the
range we are trying to allocate, even when that page is free and can be
easily reallocated.
This has proved to be problematic for some users of alloc_contic_range,
e.g: CMA and virtio-mem, where those would fail the call even when those
pages lay in ZONE_MOVABLE and are free.
We can do better by trying to replace such page.
Free hugepages are tricky to handle so as to no userspace application
notices disruption, we need to replace the current free hugepage with a
new one.
In order to do that, a new function called alloc_and_dissolve_huge_page is
introduced. This function will first try to get a new fresh hugepage, and
if it succeeds, it will replace the old one in the free hugepage pool.
The free page replacement is done under hugetlb_lock, so no external users
of hugetlb will notice the change. To allocate the new huge page, we use
alloc_buddy_huge_page(), so we do not have to deal with any counters, and
prep_new_huge_page() is not called. This is valulable because in case we
need to free the new page, we only need to call __free_pages().
Once we know that the page to be replaced is a genuine 0-refcounted huge
page, we remove the old page from the freelist by remove_hugetlb_page().
Then, we can call __prep_new_huge_page() and
__prep_account_new_huge_page() for the new huge page to properly
initialize it and increment the hstate->nr_huge_pages counter (previously
decremented by remove_hugetlb_page()). Once done, the page is enqueued by
enqueue_huge_page() and it is ready to be used.
There is one tricky case when page's refcount is 0 because it is in the
process of being released. A missing PageHugeFreed bit will tell us that
freeing is in flight so we retry after dropping the hugetlb_lock. The
race window should be small and the next retry should make a forward
progress.
E.g:
CPU0 CPU1
free_huge_page() isolate_or_dissolve_huge_page
PageHuge() == T
alloc_and_dissolve_huge_page
alloc_buddy_huge_page()
spin_lock_irq(hugetlb_lock)
// PageHuge() && !PageHugeFreed &&
// !PageCount()
spin_unlock_irq(hugetlb_lock)
spin_lock_irq(hugetlb_lock)
1) update_and_free_page
PageHuge() == F
__free_pages()
2) enqueue_huge_page
SetPageHugeFreed()
spin_unlock_irq(&hugetlb_lock)
spin_lock_irq(hugetlb_lock)
1) PageHuge() == F (freed by case#1 from CPU0)
2) PageHuge() == T
PageHugeFreed() == T
- proceed with replacing the page
In the case above we retry as the window race is quite small and we have
high chances to succeed next time.
With regard to the allocation, we restrict it to the node the page belongs
to with __GFP_THISNODE, meaning we do not fallback on other node's zones.
Note that gigantic hugetlb pages are fenced off since there is a cyclic
dependency between them and alloc_contig_range.
Link: https://lkml.kernel.org/r/20210419075413.1064-6-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
c2ad7a1ffe |
mm,compaction: let isolate_migratepages_{range,block} return error codes
Currently, isolate_migratepages_{range,block} and their callers use a pfn
== 0 vs pfn != 0 scheme to let the caller know whether there was any error
during isolation.
This does not work as soon as we need to start reporting different error
codes and make sure we pass them down the chain, so they are properly
interpreted by functions like e.g: alloc_contig_range.
Let us rework isolate_migratepages_{range,block} so we can report error
codes. Since isolate_migratepages_block will stop returning the next pfn
to be scanned, we reuse the cc->migrate_pfn field to keep track of that.
Link: https://lkml.kernel.org/r/20210419075413.1064-3-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
6e2b7044c1 |
mm, compaction: make fast_isolate_freepages() stay within zone
Compaction always operates on pages from a single given zone when
isolating both pages to migrate and freepages. Pageblock boundaries are
intersected with zone boundaries to be safe in case zone starts or ends in
the middle of pageblock. The use of pageblock_pfn_to_page() protects
against non-contiguous pageblocks.
The functions fast_isolate_freepages() and fast_isolate_around() don't
currently protect the fast freepage isolation thoroughly enough against
these corner cases, and can result in freepage isolation operate outside
of zone boundaries:
- in fast_isolate_freepages() if we get a pfn from the first pageblock
of a zone that starts in the middle of that pageblock, 'highest' can
be a pfn outside of the zone.
If we fail to isolate anything in this function, we may then call
fast_isolate_around() on a pfn outside of the zone and there
effectively do a set_pageblock_skip(page_to_pfn(highest)) which may
currently hit a VM_BUG_ON() in some configurations
- fast_isolate_around() checks only the zone end boundary and not
beginning, nor that the pageblock is contiguous (with
pageblock_pfn_to_page()) so it's possible that we end up calling
isolate_freepages_block() on a range of pfn's from two different
zones and end up e.g. isolating freepages under the wrong zone's
lock.
This patch should fix the above issues.
Link: https://lkml.kernel.org/r/20210217173300.6394-1-vbabka@suse.cz
Fixes:
|
||
|
|
15d28d0d11 |
mm/compaction: fix misbehaviors of fast_find_migrateblock()
In the fast_find_migrateblock(), it iterates ocer the freelist to find the
proper pageblock. But there are some misbehaviors.
First, if the page we found is equal to cc->migrate_pfn, it is considered
that we didn't find a suitable pageblock. Secondly, if the loop was
terminated because order is less than PAGE_ALLOC_COSTLY_ORDER, it could be
considered that we found a suitable one. Thirdly, if the skip bit is set
on the page block and we goto continue, it doesn't check nr_scanned.
Fourthly, if the page block's skip bit is set, it checks that page block
is the last of list, which is unnecessary.
Link: https://lkml.kernel.org/r/20210128130411.6125-1-vvghjk1234@gmail.com
Fixes:
|
||
|
|
40d7e20320 |
mm/compaction: correct deferral logic for proactive compaction
should_proactive_compact_node() returns true when sum of the weighted fragmentation score of all the zones in the node is greater than the wmark_high of compaction, which then triggers the proactive compaction that operates on the individual zones of the node. But proactive compaction runs on the zone only when its weighted fragmentation score is greater than wmark_low(=wmark_high - 10). This means that the sum of the weighted fragmentation scores of all the zones can exceed the wmark_high but individual weighted fragmentation zone scores can still be less than wmark_low which makes the unnecessary trigger of the proactive compaction only to return doing nothing. Issue with the return of proactive compaction with out even trying is its deferral. It is simply deferred for 1 << COMPACT_MAX_DEFER_SHIFT if the scores across the proactive compaction is same, thinking that compaction didn't make any progress but in reality it didn't even try. With the delay between successive retries for proactive compaction is 500msec, it can result into the deferral for ~30sec with out even trying the proactive compaction. Test scenario is that: compaction_proactiveness=50 thus the wmark_low = 50 and wmark_high = 60. System have 2 zones(Normal and Movable) with sizes 5GB and 6GB respectively. After opening some apps on the android, the weighted fragmentation scores of these zones are 47 and 49 respectively. Since the sum of these fragmentation scores are above the wmark_high which triggers the proactive compaction and there since the individual zones weighted fragmentation scores are below wmark_low, it returns without trying the proactive compaction. As a result the weighted fragmentation scores of the zones are still 47 and 49 which makes the existing logic to defer the compaction thinking that noprogress is made across the compaction. Fix this by checking just zone fragmentation score, not the weighted, in __compact_finished() and use the zones weighted fragmentation score in fragmentation_score_node(). In the test case above, If the weighted average of is above wmark_high, then individual score (not adjusted) of atleast one zone has to be above wmark_high. Thus it avoids the unnecessary trigger and deferrals of the proactive compaction. Link: https://lkml.kernel.org/r/1610989938-31374-1-git-send-email-charante@codeaurora.org Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Nitin Gupta <ngupta@nitingupta.dev> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
e2d26aa5fb |
mm/compaction: remove duplicated VM_BUG_ON_PAGE !PageLocked
The VM_BUG_ON_PAGE(!PageLocked(page), page) is also done in PageMovable. Remove this explicitly one. Link: https://lkml.kernel.org/r/20210109081420.46030-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
d99fd5feb0 |
mm/compaction: remove rcu_read_lock during page compaction
isolate_migratepages_block() used rcu_read_lock() with the intention of safeguarding against the mem_cgroup being destroyed concurrently; but its TestClearPageLRU already protects against that. Delete the unnecessary rcu_read_lock() and _unlock(). Hugh Dickins helped on commit log polishing, Thanks! Link: https://lkml.kernel.org/r/1608614453-10739-3-git-send-email-alex.shi@linux.alibaba.com Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
46ae6b2cc2 |
mm/swap.c: don't pass "enum lru_list" to del_page_from_lru_list()
The parameter is redundant in the sense that it can be potentially extracted from the "struct page" parameter by page_lru(). We need to make sure that existing PageActive() or PageUnevictable() remains until the function returns. A few places don't conform, and simple reordering fixes them. This patch may have left page_off_lru() seemingly odd, and we'll take care of it in the next patch. Link: https://lore.kernel.org/linux-mm/20201207220949.830352-6-yuzhao@google.com/ Link: https://lkml.kernel.org/r/20210122220600.906146-6-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
c2135f7c57 |
mm/vmscan: __isolate_lru_page_prepare() cleanup
The function just returns 2 results, so using a 'switch' to deal with its result is unnecessary. Also simplify it to a bool func as Vlastimil suggested. Also remove 'goto' by reusing list_move(), and take Matthew Wilcox's suggestion to update comments in function. Link: https://lkml.kernel.org/r/728874d7-2d93-4049-68c1-dcc3b2d52ccd@linux.alibaba.com Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
74e21484e4 |
mm, compaction: move high_pfn to the for loop scope
In fast_isolate_freepages, high_pfn will be used if a prefered one (ie
PFN >= low_fn) not found.
But the high_pfn is not reset before searching an free area, so when it
was used as freepage, it may from another free area searched before. As
a result move_freelist_head(freelist, freepage) will have unexpected
behavior (eg corrupt the MOVABLE freelist)
Unable to handle kernel paging request at virtual address dead000000000200
Mem abort info:
ESR = 0x96000044
Exception class = DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
Data abort info:
ISV = 0, ISS = 0x00000044
CM = 0, WnR = 1
[dead000000000200] address between user and kernel address ranges
-000|list_cut_before(inline)
-000|move_freelist_head(inline)
-000|fast_isolate_freepages(inline)
-000|isolate_freepages(inline)
-000|compaction_alloc(?, ?)
-001|unmap_and_move(inline)
-001|migrate_pages([NSD:0xFFFFFF80088CBBD0] from = 0xFFFFFF80088CBD88, [NSD:0xFFFFFF80088CBBC8] get_new_p
-002|__read_once_size(inline)
-002|static_key_count(inline)
-002|static_key_false(inline)
-002|trace_mm_compaction_migratepages(inline)
-002|compact_zone(?, [NSD:0xFFFFFF80088CBCB0] capc = 0x0)
-003|kcompactd_do_work(inline)
-003|kcompactd([X19] p = 0xFFFFFF93227FBC40)
-004|kthread([X20] _create = 0xFFFFFFE1AFB26380)
-005|ret_from_fork(asm)
The issue was reported on an smart phone product with 6GB ram and 3GB
zram as swap device.
This patch fixes the issue by reset high_pfn before searching each free
area, which ensure freepage and freelist match when call
move_freelist_head in fast_isolate_freepages().
Link: http://lkml.kernel.org/r/20190118175136.31341-12-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20210112094720.1238444-1-wu-yan@tcl.com
Fixes:
|
||
|
|
6168d0da2b |
mm/lru: replace pgdat lru_lock with lruvec lock
This patch moves per node lru_lock into lruvec, thus bring a lru_lock for each of memcg per node. So on a large machine, each of memcg don't have to suffer from per node pgdat->lru_lock competition. They could go fast with their self lru_lock. After move memcg charge before lru inserting, page isolation could serialize page's memcg, then per memcg lruvec lock is stable and could replace per node lru lock. In isolate_migratepages_block(), compact_unlock_should_abort and lock_page_lruvec_irqsave are open coded to work with compact_control. Also add a debug func in locking which may give some clues if there are sth out of hands. Daniel Jordan's testing show 62% improvement on modified readtwice case on his 2P * 10 core * 2 HT broadwell box. https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/ Hugh Dickins helped on the patch polish, thanks! [alex.shi@linux.alibaba.com: fix comment typo] Link: https://lkml.kernel.org/r/5b085715-292a-4b43-50b3-d73dc90d1de5@linux.alibaba.com [alex.shi@linux.alibaba.com: use page_memcg()] Link: https://lkml.kernel.org/r/5a4c2b72-7ee8-2478-fc0e-85eb83aafec4@linux.alibaba.com Link: https://lkml.kernel.org/r/1604566549-62481-18-git-send-email-alex.shi@linux.alibaba.com Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Rong Chen <rong.a.chen@intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Jann Horn <jannh@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mika Penttilä <mika.penttila@nextfour.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
9df4131439 |
mm/compaction: do page isolation first in compaction
Currently, compaction would get the lru_lock and then do page isolation which works fine with pgdat->lru_lock, since any page isoltion would compete for the lru_lock. If we want to change to memcg lru_lock, we have to isolate the page before getting lru_lock, thus isoltion would block page's memcg change which relay on page isoltion too. Then we could safely use per memcg lru_lock later. The new page isolation use previous introduced TestClearPageLRU() + pgdat lru locking which will be changed to memcg lru lock later. Hugh Dickins <hughd@google.com> fixed following bugs in this patch's early version: Fix lots of crashes under compaction load: isolate_migratepages_block() must clean up appropriately when rejecting a page, setting PageLRU again if it had been cleared; and a put_page() after get_page_unless_zero() cannot safely be done while holding locked_lruvec - it may turn out to be the final put_page(), which will take an lruvec lock when PageLRU. And move __isolate_lru_page_prepare back after get_page_unless_zero to make trylock_page() safe: trylock_page() is not safe to use at this time: its setting PG_locked can race with the page being freed or allocated ("Bad page"), and can also erase flags being set by one of those "sole owners" of a freshly allocated page who use non-atomic __SetPageFlag(). Link: https://lkml.kernel.org/r/1604566549-62481-16-git-send-email-alex.shi@linux.alibaba.com Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: "Chen, Rong A" <rong.a.chen@intel.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Jann Horn <jannh@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mika Penttilä <mika.penttila@nextfour.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
2271b016bf |
mm/compaction: make defer_compaction and compaction_deferred static
defer_compaction() and compaction_deferred() and compaction_restarting() in mm/compaction.c won't be used in other files, so make them static, and remove the declaration in the header file. Take the chance to fix a typo. Link: https://lkml.kernel.org/r/20201123170801.GA9625@rlk Signed-off-by: Hui Su <sh_def@163.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Nitin Gupta <nigupta@nvidia.com> Cc: Baoquan He <bhe@redhat.com> Cc: Mateusz Nosek <mateusznosek0@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
2b1a20c3af |
mm/compaction: move compaction_suitable's comment to right place
Since commit
|
||
|
|
19d3cf9de1 |
mm/compaction: rename 'start_pfn' to 'iteration_start_pfn' in compact_zone()
There are two 'start_pfn' declared in compact_zone() which have different meanings. Rename the second one to 'iteration_start_pfn' to prevent confusion. Also, remove an useless semicolon. Link: https://lkml.kernel.org/r/20201019115044.1571-1-yanfei.xu@windriver.com Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
d20bdd571e |
mm/compaction: stop isolation if too many pages are isolated and we have pages to migrate
In isolate_migratepages_block, if we have too many isolated pages and
nr_migratepages is not zero, we should try to migrate what we have
without wasting time on isolating.
In theory it's possible that multiple parallel compactions will cause
too_many_isolated() to become true even if each has isolated less than
COMPACT_CLUSTER_MAX, and loop forever in the while loop. Bailing
immediately prevents that.
[vbabka@suse.cz: changelog addition]
Fixes:
|
||
|
|
38935861d8 |
mm/compaction: count pages and stop correctly during page isolation
In isolate_migratepages_block, when cc->alloc_contig is true, we are
able to isolate compound pages. But nr_migratepages and nr_isolated did
not count compound pages correctly, causing us to isolate more pages
than we thought.
So count compound pages as the number of base pages they contain.
Otherwise, we might be trapped in too_many_isolated while loop, since
the actual isolated pages can go up to COMPACT_CLUSTER_MAX*512=16384,
where COMPACT_CLUSTER_MAX is 32, since we stop isolation after
cc->nr_migratepages reaches to COMPACT_CLUSTER_MAX.
In addition, after we fix the issue above, cc->nr_migratepages could
never be equal to COMPACT_CLUSTER_MAX if compound pages are isolated,
thus page isolation could not stop as we intended. Change the isolation
stop condition to '>='.
The issue can be triggered as follows:
In a system with 16GB memory and an 8GB CMA region reserved by
hugetlb_cma, if we first allocate 10GB THPs and mlock them (so some THPs
are allocated in the CMA region and mlocked), reserving 6 1GB hugetlb
pages via /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages will
get stuck (looping in too_many_isolated function) until we kill either
task. With the patch applied, oom will kill the application with 10GB
THPs and let hugetlb page reservation finish.
[ziy@nvidia.com: v3]
Link: https://lkml.kernel.org/r/20201030183809.3616803-1-zi.yan@sent.com
Fixes:
|
||
|
|
ab130f9108 |
mm: rename page_order() to buddy_order()
The current page_order() can only be called on pages in the buddy allocator. For compound pages, you have to use compound_order(). This is confusing and led to a bug, so rename page_order() to buddy_order(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: https://lkml.kernel.org/r/20201001152259.14932-2-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
62b35fe0eb |
mm/compaction.c: micro-optimization remove unnecessary branch
The same code can work both for 'zone->compact_considered > defer_limit' and 'zone->compact_considered >= defer_limit'. In the latter there is one branch less which is more effective considering performance. Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Link: https://lkml.kernel.org/r/20200913190448.28649-1-mateusznosek0@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
6c357848b4 |
mm: replace hpage_nr_pages with thp_nr_pages
The thp prefix is more frequently used than hpage and we should be consistent between the various functions. [akpm@linux-foundation.org: fix mm/migrate.c] Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
a1c1dbeb2e |
mm/compaction.c: delete duplicated word
Drop the repeated word "a". Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Link: http://lkml.kernel.org/r/20200801173822.14973-2-rdunlap@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
860b32729a |
mm/compaction: correct the comments of compact_defer_shift
There is no compact_defer_limit. It should be compact_defer_shift in use. and add compact_order_failed explanation. Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Link: http://lkml.kernel.org/r/3bd60e1b-a74e-050d-ade4-6e8f54e00b92@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
d34c0a7599 |
mm: use unsigned types for fragmentation score
Proactive compaction uses per-node/zone "fragmentation score" which is always in range [0, 100], so use unsigned type of these scores as well as for related constants. Signed-off-by: Nitin Gupta <nigupta@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: http://lkml.kernel.org/r/20200618010319.13159-1-nigupta@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
25788738eb |
mm: fix compile error due to COMPACTION_HPAGE_ORDER
Fix compile error when COMPACTION_HPAGE_ORDER is assigned to HUGETLB_PAGE_ORDER. The correct way to check if this constant is defined is to check for CONFIG_HUGETLBFS. Reported-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Nitin Gupta <nigupta@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Nathan Chancellor <natechancellor@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20200623064544.25766-1-nigupta@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
facdaa917c |
mm: proactive compaction
For some applications, we need to allocate almost all memory as hugepages.
However, on a running system, higher-order allocations can fail if the
memory is fragmented. Linux kernel currently does on-demand compaction as
we request more hugepages, but this style of compaction incurs very high
latency. Experiments with one-time full memory compaction (followed by
hugepage allocations) show that kernel is able to restore a highly
fragmented memory state to a fairly compacted memory state within <1 sec
for a 32G system. Such data suggests that a more proactive compaction can
help us allocate a large fraction of memory as hugepages keeping
allocation latencies low.
For a more proactive compaction, the approach taken here is to define a
new sysctl called 'vm.compaction_proactiveness' which dictates bounds for
external fragmentation which kcompactd tries to maintain.
The tunable takes a value in range [0, 100], with a default of 20.
Note that a previous version of this patch [1] was found to introduce too
many tunables (per-order extfrag{low, high}), but this one reduces them to
just one sysctl. Also, the new tunable is an opaque value instead of
asking for specific bounds of "external fragmentation", which would have
been difficult to estimate. The internal interpretation of this opaque
value allows for future fine-tuning.
Currently, we use a simple translation from this tunable to [low, high]
"fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
The score for a node is defined as weighted mean of per-zone external
fragmentation. A zone's present_pages determines its weight.
To periodically check per-node score, we reuse per-node kcompactd threads,
which are woken up every 500 milliseconds to check the same. If a node's
score exceeds its high threshold (as derived from user-provided
proactiveness value), proactive compaction is started until its score
reaches its low threshold value. By default, proactiveness is set to 20,
which implies threshold values of low=80 and high=90.
This patch is largely based on ideas from Michal Hocko [2]. See also the
LWN article [3].
Performance data
================
System: x64_64, 1T RAM, 80 CPU threads.
Kernel: 5.6.0-rc3 + this patch
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
Before starting the driver, the system was fragmented from a userspace
program that allocates all memory and then for each 2M aligned section,
frees 3/4 of base pages using munmap. The workload is mainly anonymous
userspace pages, which are easy to move around. I intentionally avoided
unmovable pages in this test to see how much latency we incur when
hugepage allocations hit direct compaction.
1. Kernel hugepage allocation latencies
With the system in such a fragmented state, a kernel driver then allocates
as many hugepages as possible and measures allocation latency:
(all latency values are in microseconds)
- With vanilla 5.6.0-rc3
percentile latency
–––––––––– –––––––
5 7894
10 9496
25 12561
30 15295
40 18244
50 21229
60 27556
75 30147
80 31047
90 32859
95 33799
Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)
- With 5.6.0-rc3 + this patch, with proactiveness=20
sysctl -w vm.compaction_proactiveness=20
percentile latency
–––––––––– –––––––
5 2
10 2
25 3
30 3
40 3
50 4
60 4
75 4
80 4
90 5
95 429
Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)
2. JAVA heap allocation
In this test, we first fragment memory using the same method as for (1).
Then, we start a Java process with a heap size set to 700G and request the
heap to be allocated with THP hugepages. We also set THP to madvise to
allow hugepage backing of this heap.
/usr/bin/time
java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
The above command allocates 700G of Java heap using hugepages.
- With vanilla 5.6.0-rc3
17.39user 1666.48system 27:37.89elapsed
- With 5.6.0-rc3 + this patch, with proactiveness=20
8.35user 194.58system 3:19.62elapsed
Elapsed time remains around 3:15, as proactiveness is further increased.
Note that proactive compaction happens throughout the runtime of these
workloads. The situation of one-time compaction, sufficient to supply
hugepages for following allocation stream, can probably happen for more
extreme proactiveness values, like 80 or 90.
In the above Java workload, proactiveness is set to 20. The test starts
with a node's score of 80 or higher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction. As t he benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the low
threshold level (80). Repeat.
bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark. kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.
Backoff behavior
================
Above workloads produce a memory state which is easy to compact. However,
if memory is filled with unmovable pages, proactive compaction should
essentially back off. To test this aspect:
- Created a kernel driver that allocates almost all memory as hugepages
followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
(=> ~30 seconds between retries).
[1] https://patchwork.kernel.org/patch/11098289/
[2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
[3] https://lwn.net/Articles/817905/
Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Oleksandr Natalenko <oleksandr@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Nitin Gupta <ngupta@nitingupta.dev>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
b9e20f0da1 |
mm, compaction: make capture control handling safe wrt interrupts
Hugh reports:
"While stressing compaction, one run oopsed on NULL capc->cc in
__free_one_page()'s task_capc(zone): compact_zone_order() had been
interrupted, and a page was being freed in the return from interrupt.
Though you would not expect it from the source, both gccs I was using
(4.8.1 and 7.5.0) had chosen to compile compact_zone_order() with the
".cc = &cc" implemented by mov %rbx,-0xb0(%rbp) immediately before
callq compact_zone - long after the "current->capture_control =
&capc". An interrupt in between those finds capc->cc NULL (zeroed by
an earlier rep stos).
This could presumably be fixed by a barrier() before setting
current->capture_control in compact_zone_order(); but would also need
more care on return from compact_zone(), in order not to risk leaking
a page captured by interrupt just before capture_control is reset.
Maybe that is the preferable fix, but I felt safer for task_capc() to
exclude the rather surprising possibility of capture at interrupt
time"
I have checked that gcc10 also behaves the same.
The advantage of fix in compact_zone_order() is that we don't add
another test in the page freeing hot path, and that it might prevent
future problems if we stop exposing pointers to uninitialized structures
in current task.
So this patch implements the suggestion for compact_zone_order() with
barrier() (and WRITE_ONCE() to prevent store tearing) for setting
current->capture_control, and prevents page leaking with
WRITE_ONCE/READ_ONCE in the proper order.
Link: http://lkml.kernel.org/r/20200616082649.27173-1-vbabka@suse.cz
Fixes:
|
||
|
|
f386775510 |
mm/compaction: fix a typo in comment "pessemistic"->"pessimistic"
There is a typo in comment, fix it. Signed-off-by: Ethon Paul <ethp@qq.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Link: http://lkml.kernel.org/r/20200411070307.16021-1-ethp@qq.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
ee01c4d72a |
Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton: "More mm/ work, plenty more to come Subsystems affected by this patch series: slub, memcg, gup, kasan, pagealloc, hugetlb, vmscan, tools, mempolicy, memblock, hugetlbfs, thp, mmap, kconfig" * akpm: (131 commits) arm64: mm: use ARCH_HAS_DEBUG_WX instead of arch defined x86: mm: use ARCH_HAS_DEBUG_WX instead of arch defined riscv: support DEBUG_WX mm: add DEBUG_WX support drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup mm/thp: rename pmd_mknotpresent() as pmd_mkinvalid() powerpc/mm: drop platform defined pmd_mknotpresent() mm: thp: don't need to drain lru cache when splitting and mlocking THP hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs sparc32: register memory occupied by kernel as memblock.memory include/linux/memblock.h: fix minor typo and unclear comment mm, mempolicy: fix up gup usage in lookup_node tools/vm/page_owner_sort.c: filter out unneeded line mm: swap: memcg: fix memcg stats for huge pages mm: swap: fix vmstats for huge pages mm: vmscan: limit the range of LRU type balancing mm: vmscan: reclaim writepage is IO cost mm: vmscan: determine anon/file pressure balance at the reclaim root mm: balance LRU lists based on relative thrashing mm: only count actual rotations as LRU reclaim cost ... |