kernel_arpi

Author	SHA1	Message	Date
Liujie Xie	04bb2779c9	ANDROID: vendor_hooks: add a field in pglist_data Add a pglist_data field to record additional node parameters. Bug: 192052083 Signed-off-by: Liujie Xie <xieliujie@oppo.com> Change-Id: I3d764ab298c71ab9aba245867ee529045551aef4 (cherry picked from commit `65115fdbf8`)	2022-06-18 13:20:30 -07:00
Greg Kroah-Hartman	98042d19ad	ANDROID: GKI: mm: add Android ABI padding to some structures Try to mitigate potential future driver core api changes by adding a padding to stuct vm_area_struct and struct zone. Based on a patch from Michal Marek <mmarek@suse.cz> from the SLES kernel Leaf changes summary: 3 artifacts changed Changed leaf types summary: 3 leaf types changed Removed/Changed/Added functions summary: 0 Removed, 0 Changed, 0 Added function Removed/Changed/Added variables summary: 0 Removed, 0 Changed, 0 Added variable 'struct vm_area_struct at mm_types.h:292:1' changed: type size changed from 1472 to 1728 (in bits) 4 data member insertions: 'u64 vm_area_struct::android_kabi_reserved1', at offset 1472 (in bits) at mm_types.h:365:1 'u64 vm_area_struct::android_kabi_reserved2', at offset 1536 (in bits) at mm_types.h:366:1 'u64 vm_area_struct::android_kabi_reserved3', at offset 1600 (in bits) at mm_types.h:367:1 'u64 vm_area_struct::android_kabi_reserved4', at offset 1664 (in bits) at mm_types.h:368:1 1435 impacted interfaces: 'struct zone at mmzone.h:420:1' changed: type size changed from 12800 to 13312 (in bits) 4 data member insertions: 'u64 zone::android_kabi_reserved1', at offset 12672 (in bits) at mmzone.h:569:1 'u64 zone::android_kabi_reserved2', at offset 12736 (in bits) at mmzone.h:570:1 'u64 zone::android_kabi_reserved3', at offset 12800 (in bits) at mmzone.h:571:1 'u64 zone::android_kabi_reserved4', at offset 12864 (in bits) at mmzone.h:572:1 624 impacted interfaces: Bug: 151154716 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I81702aa833f419928e0e32e9609722b98592c171	2022-06-18 18:44:16 +00:00
Bing Han	dcac70709f	ANDROID: add vendor fields to lruvec to record refault stats struct lruvec :: ANDROID_VENDOR_DATA(1) It is pointer to a struct to record the following message: 1）the account of workingset_restore pages of cached anonymous and file pages This is used to adjust the strategy and amount of reclaiming data. Bug: 225795494 Change-Id: I34e57ee23b6c97ac91effa5b72513d238335a996 Signed-off-by: Bing Han <bing.han@transsion.com> (cherry picked from commit 1b14ae01b09dfde89da470cac8415cefaca824fb)	2022-05-17 23:31:52 +00:00
Greg Kroah-Hartman	33f5d1daec	Merge 5.15.34 into android13-5.15 Changes in 5.15.34 lib/logic_iomem: correct fallback config references um: fix and optimize xor select template for CONFIG64 and timetravel mode rtc: wm8350: Handle error for wm8350_register_irq nbd: add error handling support for add_disk() nbd: Fix incorrect error handle when first_minor is illegal in nbd_dev_add nbd: Fix hungtask when nbd_config_put nbd: fix possible overflow on 'first_minor' in nbd_dev_add() kfence: count unexpectedly skipped allocations kfence: move saving stack trace of allocations into __kfence_alloc() kfence: limit currently covered allocations when pool nearly full KVM: x86/pmu: Use different raw event masks for AMD and Intel KVM: SVM: Fix kvm_cache_regs.h inclusions for is_guest_mode() KVM: x86/svm: Clear reserved bits written to PerfEvtSeln MSRs KVM: x86/pmu: Fix and isolate TSX-specific performance event logic KVM: x86/emulator: Emulate RDPID only if it is enabled in guest drm: Add orientation quirk for GPD Win Max ath5k: fix OOB in ath5k_eeprom_read_pcal_info_5111 drm/amd/display: Add signal type check when verify stream backends same drm/amd/amdgpu/amdgpu_cs: fix refcount leak of a dma_fence obj drm/amd/display: Fix memory leak drm/amd/display: Use PSR version selected during set_psr_caps usb: gadget: tegra-xudc: Do not program SPARAM usb: gadget: tegra-xudc: Fix control endpoint's definitions usb: cdnsp: fix cdnsp_decode_trb function to properly handle ret value ptp: replace snprintf with sysfs_emit drm/amdkfd: Don't take process mutex for svm ioctls powerpc: dts: t104xrdb: fix phy type for FMAN 4/5 ath11k: fix kernel panic during unload/load ath11k modules ath11k: pci: fix crash on suspend if board file is not found ath11k: mhi: use mhi_sync_power_up() net/smc: Send directly when TCP_CORK is cleared drm/bridge: Add missing pm_runtime_put_sync bpf: Make dst_port field in struct bpf_sock 16-bit wide scsi: mvsas: Replace snprintf() with sysfs_emit() scsi: bfa: Replace snprintf() with sysfs_emit() drm/v3d: fix missing unlock power: supply: axp20x_battery: properly report current when discharging mt76: mt7921: fix crash when startup fails. mt76: dma: initialize skip_unmap in mt76_dma_rx_fill cfg80211: don't add non transmitted BSS to 6GHz scanned channels libbpf: Fix build issue with llvm-readelf ipv6: make mc_forwarding atomic net: initialize init_net earlier powerpc: Set crashkernel offset to mid of RMA region drm/amdgpu: Fix recursive locking warning scsi: smartpqi: Fix kdump issue when controller is locked up PCI: aardvark: Fix support for MSI interrupts iommu/arm-smmu-v3: fix event handling soft lockup usb: ehci: add pci device support for Aspeed platforms PCI: endpoint: Fix alignment fault error in copy tests tcp: Don't acquire inet_listen_hashbucket::lock with disabled BH. PCI: pciehp: Add Qualcomm quirk for Command Completed erratum scsi: mpi3mr: Fix reporting of actual data transfer size scsi: mpi3mr: Fix memory leaks powerpc/set_memory: Avoid spinlock recursion in change_page_attr() power: supply: axp288-charger: Set Vhold to 4.4V net/mlx5e: Disable TX queues before registering the netdev usb: dwc3: pci: Set the swnode from inside dwc3_pci_quirks() iwlwifi: mvm: Correctly set fragmented EBS iwlwifi: mvm: move only to an enabled channel drm/msm/dsi: Remove spurious IRQF_ONESHOT flag ipv4: Invalidate neighbour for broadcast address upon address addition dm ioctl: prevent potential spectre v1 gadget dm: requeue IO if mapping table not yet available drm/amdkfd: make CRAT table missing message informational only vfio/pci: Stub vfio_pci_vga_rw when !CONFIG_VFIO_PCI_VGA scsi: pm8001: Fix pm80xx_pci_mem_copy() interface scsi: pm8001: Fix pm8001_mpi_task_abort_resp() scsi: pm8001: Fix task leak in pm8001_send_abort_all() scsi: pm8001: Fix tag leaks on error scsi: pm8001: Fix memory leak in pm8001_chip_fw_flash_update_req() mt76: mt7915: fix injected MPDU transmission to not use HW A-MSDU powerpc/64s/hash: Make hash faults work in NMI context mt76: mt7615: Fix assigning negative values to unsigned variable scsi: aha152x: Fix aha152x_setup() __setup handler return value scsi: hisi_sas: Free irq vectors in order for v3 HW scsi: hisi_sas: Limit users changing debugfs BIST count value net/smc: correct settings of RMB window update limit mips: ralink: fix a refcount leak in ill_acc_of_setup() macvtap: advertise link netns via netlink tuntap: add sanity checks about msg_controllen in sendmsg Bluetooth: Fix not checking for valid hdev on bt_dev_{info,warn,err,dbg} Bluetooth: use memset avoid memory leaks bnxt_en: Eliminate unintended link toggle during FW reset PCI: endpoint: Fix misused goto label MIPS: fix fortify panic when copying asm exception handlers powerpc/64e: Tie PPC_BOOK3E_64 to PPC_FSL_BOOK3E powerpc/secvar: fix refcount leak in format_show() scsi: libfc: Fix use after free in fc_exch_abts_resp() can: isotp: set default value for N_As to 50 micro seconds can: etas_es58x: es58x_fd_rx_event_msg(): initialize rx_event_msg before calling es58x_check_msg_len() riscv: Fixed misaligned memory access. Fixed pointer comparison. net: account alternate interface name memory net: limit altnames to 64k total net/mlx5e: Remove overzealous validations in netlink EEPROM query net: sfp: add 2500base-X quirk for Lantech SFP module usb: dwc3: omap: fix "unbalanced disables for smps10_out1" on omap5evm mt76: fix monitor mode crash with sdio driver xtensa: fix DTC warning unit_address_format MIPS: ingenic: correct unit node address Bluetooth: Fix use after free in hci_send_acl netfilter: conntrack: revisit gc autotuning netlabel: fix out-of-bounds memory accesses ceph: fix inode reference leakage in ceph_get_snapdir() ceph: fix memory leak in ceph_readdir when note_last_dentry returns error lib/Kconfig.debug: add ARCH dependency for FUNCTION_ALIGN option init/main.c: return 1 from handled __setup() functions minix: fix bug when opening a file with O_DIRECT clk: si5341: fix reported clk_rate when output divider is 2 staging: vchiq_arm: Avoid NULL ptr deref in vchiq_dump_platform_instances staging: vchiq_core: handle NULL result of find_service_by_handle phy: amlogic: phy-meson-gxl-usb2: fix shared reset controller use phy: amlogic: meson8b-usb2: Use dev_err_probe() phy: amlogic: meson8b-usb2: fix shared reset control use clk: rockchip: drop CLK_SET_RATE_PARENT from dclk_vop* on rk3568 cpufreq: CPPC: Fix performance/frequency conversion opp: Expose of-node's name in debugfs staging: wfx: fix an error handling in wfx_init_common() w1: w1_therm: fixes w1_seq for ds28ea00 sensors NFSv4.2: fix reference count leaks in _nfs42_proc_copy_notify() NFSv4: Protect the state recovery thread against direct reclaim habanalabs: fix possible memory leak in MMU DR fini xen: delay xen_hvm_init_time_ops() if kdump is boot on vcpu>=32 clk: ti: Preserve node in ti_dt_clocks_register() clk: Enforce that disjoints limits are invalid SUNRPC/call_alloc: async tasks mustn't block waiting for memory SUNRPC/xprt: async tasks mustn't block waiting for memory SUNRPC: remove scheduling boost for "SWAPPER" tasks. NFS: swap IO handling is slightly different for O_DIRECT IO NFS: swap-out must always use STABLE writes. x86: Annotate call_on_stack() x86/Kconfig: Do not allow CONFIG_X86_X32_ABI=y with llvm-objcopy serial: samsung_tty: do not unlock port->lock for uart_write_wakeup() virtio_console: eliminate anonymous module_init & module_exit jfs: prevent NULL deref in diFree SUNRPC: Fix socket waits for write buffer space NFS: nfsiod should not block forever in mempool_alloc() NFS: Avoid writeback threads getting stuck in mempool_alloc() selftests: net: Add tls config dependency for tls selftests parisc: Fix CPU affinity for Lasi, WAX and Dino chips parisc: Fix patch code locking and flushing mm: fix race between MADV_FREE reclaim and blkdev direct IO read rtc: mc146818-lib: change return values of mc146818_get_time() rtc: Check return value from mc146818_get_time() rtc: mc146818-lib: fix RTC presence check drm/amdgpu: fix off by one in amdgpu_gfx_kiq_acquire() Drivers: hv: vmbus: Fix potential crash on module unload Revert "NFSv4: Handle the special Linux file open access mode" NFSv4: fix open failure with O_ACCMODE flag scsi: sr: Fix typo in CDROM(CLOSETRAY\|EJECT) handling scsi: core: Fix sbitmap depth in scsi_realloc_sdev_budget_map() scsi: zorro7xx: Fix a resource leak in zorro7xx_remove_one() vdpa/mlx5: Rename control VQ workqueue to vdpa wq vdpa/mlx5: Propagate link status from device to vdpa driver vdpa: mlx5: prevent cvq work from hogging CPU net: sfc: add missing xdp queue reinitialization net/tls: fix slab-out-of-bounds bug in decrypt_internal vrf: fix packet sniffing for traffic originating from ip tunnels skbuff: fix coalescing for page_pool fragment recycling ice: Clear default forwarding VSI during VSI release mctp: Fix check for dev_hard_header() result net: ipv4: fix route with nexthop object delete warning net: stmmac: Fix unset max_speed difference between DT and non-DT platforms drm/imx: imx-ldb: Check for null pointer after calling kmemdup drm/imx: Fix memory leak in imx_pd_connector_get_modes drm/imx: dw_hdmi-imx: Fix bailout in error cases of probe regulator: rtq2134: Fix missing active_discharge_on setting regulator: atc260x: Fix missing active_discharge_on setting arch/arm64: Fix topology initialization for core scheduling bnxt_en: Synchronize tx when xdp redirects happen on same ring bnxt_en: reserve space inside receive page for skb_shared_info bnxt_en: Prevent XDP redirect from running when stopping TX queue sfc: Do not free an empty page_ring RDMA/mlx5: Don't remove cache MRs when a delay is needed RDMA/mlx5: Add a missing update of cache->last_add IB/cm: Cancel mad on the DREQ event when the state is MRA_REP_RCVD IB/rdmavt: add lock to call to rvt_error_qp to prevent a race condition sctp: count singleton chunks in assoc user stats dpaa2-ptp: Fix refcount leak in dpaa2_ptp_probe ice: Set txq_teid to ICE_INVAL_TEID on ring creation ice: Do not skip not enabled queues in ice_vc_dis_qs_msg ipv6: Fix stats accounting in ip6_pkt_drop ice: synchronize_rcu() when terminating rings ice: xsk: fix VSI state check in ice_xsk_wakeup() net: openvswitch: don't send internal clone attribute to the userspace. net: ethernet: mv643xx: Fix over zealous checking of_get_mac_address() net: openvswitch: fix leak of nested actions rxrpc: fix a race in rxrpc_exit_net() net: sfc: fix using uninitialized xdp tx_queue net: phy: mscc-miim: reject clause 45 register accesses qede: confirm skb is allocated before using spi: bcm-qspi: fix MSPI only access with bcm_qspi_exec_mem_op() bpf: Support dual-stack sockets in bpf_tcp_check_syncookie drbd: Fix five use after free bugs in get_initial_state scsi: ufs: ufshpb: Fix a NULL check on list iterator io_uring: nospec index for tags on files update io_uring: don't touch scm_fp_list after queueing skb SUNRPC: Handle ENOMEM in call_transmit_status() SUNRPC: Handle low memory situations in call_status() SUNRPC: svc_tcp_sendmsg() should handle errors from xdr_alloc_bvec() iommu/omap: Fix regression in probe for NULL pointer dereference perf: arm-spe: Fix perf report --mem-mode perf tools: Fix perf's libperf_print callback perf session: Remap buf if there is no space for event arm64: Add part number for Arm Cortex-A78AE scsi: mpt3sas: Fix use after free in _scsih_expander_node_remove() scsi: ufs: ufs-pci: Add support for Intel MTL Revert "mmc: sdhci-xenon: fix annoying 1.8V regulator warning" mmc: block: Check for errors after write on SPI mmc: mmci: stm32: correctly check all elements of sg list mmc: renesas_sdhi: don't overwrite TAP settings when HS400 tuning is complete mmc: core: Fixup support for writeback-cache for eMMC and SD lz4: fix LZ4_decompress_safe_partial read out of bound highmem: fix checks in __kmap_local_sched_{in,out} mmmremap.c: avoid pointless invalidate_range_start/end on mremap(old_size=0) mm/mempolicy: fix mpol_new leak in shared_policy_replace io_uring: don't check req->file in io_fsync_prep() io_uring: defer splice/tee file validity check until command issue io_uring: implement compat handling for IORING_REGISTER_IOWQ_AFF io_uring: fix race between timeout flush and removal x86/pm: Save the MSR validity status at context setup x86/speculation: Restore speculation related MSRs during S3 resume perf/x86/intel: Update the FRONTEND MSR mask on Sapphire Rapids btrfs: fix qgroup reserve overflow the qgroup limit btrfs: prevent subvol with swapfile from being deleted spi: core: add dma_map_dev for __spi_unmap_msg() arm64: patch_text: Fixup last cpu should be master RDMA/hfi1: Fix use-after-free bug for mm struct gpio: Restrict usage of GPIO chip irq members before initialization x86/msi: Fix msi message data shadow struct x86/mm/tlb: Revert retpoline avoidance approach perf/x86/intel: Don't extend the pseudo-encoding to GP counters ata: sata_dwc_460ex: Fix crash due to OOB write perf: qcom_l2_pmu: fix an incorrect NULL check on list iterator perf/core: Inherit event_caps irqchip/gic-v3: Fix GICR_CTLR.RWP polling fbdev: Fix unregistering of framebuffers without device amd/display: set backlight only if required SUNRPC: Prevent immediate close+reconnect drm/panel: ili9341: fix optional regulator handling drm/amdgpu/display: change pipe policy for DCN 2.1 drm/amdgpu/smu10: fix SoC/fclk units in auto mode drm/amdgpu/vcn: Fix the register setting for vcn1 drm/nouveau/pmu: Add missing callbacks for Tegra devices drm/amdkfd: Create file descriptor after client is added to smi_clients list drm/amdgpu: don't use BACO for reset in S3 KVM: SVM: Allow AVIC support on system w/ physical APIC ID > 255 net/smc: send directly on setting TCP_NODELAY Revert "selftests: net: Add tls config dependency for tls selftests" bpf: Make remote_port field in struct bpf_sk_lookup 16-bit wide selftests/bpf: Fix u8 narrow load checks for bpf_sk_lookup remote_port rtc: mc146818-lib: fix signedness bug in mc146818_get_time() SUNRPC: Don't call connect() more than once on a TCP socket Revert "nbd: fix possible overflow on 'first_minor' in nbd_dev_add()" perf build: Don't use -ffat-lto-objects in the python feature test when building with clang-13 perf python: Fix probing for some clang command line options tools build: Filter out options and warnings not supported by clang tools build: Use $(shell ) instead of `` to get embedded libperl's ccopts dmaengine: Revert "dmaengine: shdma: Fix runtime PM imbalance on error" KVM: avoid NULL pointer dereference in kvm_dirty_ring_push Revert "net/mlx5: Accept devlink user input after driver initialization complete" ubsan: remove CONFIG_UBSAN_OBJECT_SIZE selftests: cgroup: Make cg_create() use 0755 for permission instead of 0644 selftests: cgroup: Test open-time credential usage for migration checks selftests: cgroup: Test open-time cgroup namespace usage for migration checks mm: don't skip swap entry even if zap_details specified Drivers: hv: vmbus: Replace smp_store_mb() with virt_store_mb() x86/bug: Prevent shadowing in __WARN_FLAGS sched: Teach the forced-newidle balancer about CPU affinity limitation. x86,static_call: Fix __static_call_return0 for i386 irqchip/gic-v4: Wait for GICR_VPENDBASER.Dirty to clear before descheduling powerpc/64: Fix build failure with allyesconfig in book3s_64_entry.S irqchip/gic, gic-v3: Prevent GSI to SGI translations mm/sparsemem: fix 'mem_section' will never be NULL gcc 12 warning static_call: Don't make __static_call_return0 static powerpc: Fix virt_addr_valid() for 64-bit Book3E & 32-bit stacktrace: move filter_irq_stacks() to kernel/stacktrace.c Linux 5.15.34 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I98049d0d8ebd427296418d31085bfde482ad30e7	2022-04-24 16:57:32 +02:00
Yu Zhao	e8507816d1	FROMLIST: mm: multi-gen LRU: thrashing prevention Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention, as requested by many desktop users [1]. When set to value N, it prevents the working set of N milliseconds from getting evicted. The OOM killer is triggered if this working set cannot be kept in memory. Based on the average human detectable lag (~100ms), N=1000 usually eliminates intolerable lags due to thrashing. Larger values like N=3000 make lags less noticeable at the risk of premature OOM kills. Compared with the size-based approach, e.g., [2], this time-based approach has the following advantages: 1. It is easier to configure because it is agnostic to applications and memory sizes. 2. It is more reliable because it is directly wired to the OOM killer. [1] https://lore.kernel.org/lkml/Ydza%2FzXKY9ATRoh6@google.com/ [2] https://lore.kernel.org/lkml/20211130201652.2218636d@mail.inbox.lv/ Link: https://lore.kernel.org/lkml/20220309021230.721028-12-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I482d33f3beaf7723d2f3eeaaa5b4f12bcb9b48a1	2022-04-20 17:38:56 +00:00
Yu Zhao	76f7f07cbf	FROMLIST: mm: multi-gen LRU: kill switch Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that can be disabled include: 0x0001: the multi-gen LRU core 0x0002: walking page table, when arch_has_hw_pte_young() returns true 0x0004: clearing the accessed bit in non-leaf PMD entries, when CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y [yYnN]: apply to all the components above E.g., echo y >/sys/kernel/mm/lru_gen/enabled cat /sys/kernel/mm/lru_gen/enabled 0x0007 echo 5 >/sys/kernel/mm/lru_gen/enabled cat /sys/kernel/mm/lru_gen/enabled 0x0005 NB: the page table walks happen on the scale of seconds under heavy memory pressure, in which case the mmap_lock contention is a lesser concern, compared with the LRU lock contention and the I/O congestion. So far the only well-known case of the mmap_lock contention happens on Android, due to Scudo [1] which allocates several thousand VMAs for merely a few hundred MBs. The SPF and the Maple Tree also have provided their own assessments [2][3]. However, if walking page tables does worsen the mmap_lock contention, the kill switch can be used to disable it. In this case the multi-gen LRU will suffer a minor performance degradation, as shown previously. Clearing the accessed bit in non-leaf PMD entries can also be disabled, since this behavior was not tested on x86 varieties other than Intel and AMD. [1] https://source.android.com/devices/tech/debug/scudo [2] https://lore.kernel.org/lkml/20220128131006.67712-1-michel@lespinasse.org/ [3] https://lore.kernel.org/lkml/20220202024137.2516438-1-Liam.Howlett@oracle.com/ Link: https://lore.kernel.org/lkml/20220309021230.721028-11-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I71801d9470a2588cad8bfd14fbcfafc7b010aa03	2022-04-20 17:38:56 +00:00
Yu Zhao	5280d76d38	FROMLIST: mm: multi-gen LRU: support page table walks To further exploit spatial locality, the aging prefers to walk page tables to search for young PTEs and promote hot pages. A kill switch will be added in the next patch to disable this behavior. When disabled, the aging relies on the rmap only. NB: this behavior has nothing similar with the page table scanning in the 2.4 kernel [1], which searches page tables for old PTEs, adds cold pages to swapcache and unmaps them. To avoid confusion, the term "iteration" specifically means the traversal of an entire mm_struct list; the term "walk" will be applied to page tables and the rmap, as usual. An mm_struct list is maintained for each memcg, and an mm_struct follows its owner task to the new memcg when this task is migrated. Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls walk_page_range() with each mm_struct on this list to promote hot pages before it increments max_seq. When multiple page table walkers iterate the same list, each of them gets a unique mm_struct; therefore they can run concurrently. Page table walkers ignore any misplaced pages, e.g., if an mm_struct was migrated, pages it left in the previous memcg will not be promoted when its current memcg is under reclaim. Similarly, page table walkers will not promote pages from nodes other than the one under reclaim. This patch uses the following optimizations when walking page tables: 1. It tracks the usage of mm_struct's between context switches so that page table walkers can skip processes that have been sleeping since the last iteration. 2. It uses generational Bloom filters to record populated branches so that page table walkers can reduce their search space based on the query results, e.g., to skip page tables containing mostly holes or misplaced pages. 3. It takes advantage of the accessed bit in non-leaf PMD entries when CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y. 4. It does not zigzag between a PGD table and the same PMD table spanning multiple VMAs. IOW, it finishes all the VMAs within the range of the same PMD table before it returns to a PGD table. This improves the cache performance for workloads that have large numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5. Server benchmark results: Single workload: fio (buffered I/O): no change Single workload: memcached (anon): +[5.5, 7.5]% Ops/sec KB/sec patch1-7: 1014393.57 39455.42 patch1-8: 1078507.59 41949.15 Configurations: no change Client benchmark results: kswapd profiles: patch1-7 45.54% lzo1x_1_do_compress (real work) 9.56% page_vma_mapped_walk 6.70% _raw_spin_unlock_irq 2.78% ptep_clear_flush 2.47% do_raw_spin_lock 2.22% __zram_bvec_write 1.87% lru_gen_look_around 1.78% memmove 1.77% obj_malloc 1.44% free_unref_page_list patch1-8 47.02% lzo1x_1_do_compress (real work) 6.73% page_vma_mapped_walk 6.14% _raw_spin_unlock_irq 3.39% walk_pte_range 2.63% ptep_clear_flush 2.29% __zram_bvec_write 2.10% do_raw_spin_lock 1.81% memmove 1.73% obj_malloc 1.53% free_unref_page_list Configurations: no change [1] https://lwn.net/Articles/23732/ [2] https://source.android.com/devices/tech/debug/scudo Link: https://lore.kernel.org/lkml/20220309021230.721028-9-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I5a3c97cf8ebf8d65d5f9528cd979a637c190053e	2022-04-20 17:38:55 +00:00
Yu Zhao	afd94c9ef9	FROMLIST: mm: multi-gen LRU: exploit locality in rmap Searching the rmap for PTEs mapping each page on an LRU list (to test and clear the accessed bit) can be expensive because pages from different VMAs (PA space) are not cache friendly to the rmap (VA space). For workloads mostly using mapped pages, the rmap has a high CPU cost in the reclaim path. This patch exploits spatial locality to reduce the trips into the rmap. When shrink_page_list() walks the rmap and finds a young PTE, a new function lru_gen_look_around() scans at most BITS_PER_LONG-1 adjacent PTEs. On finding another young PTE, it clears the accessed bit and updates the gen counter of the page mapped by this PTE to (max_seq%MAX_NR_GENS)+1. Server benchmark results: Single workload: fio (buffered I/O): no change Single workload: memcached (anon): +[4, 6]% Ops/sec KB/sec patch1-6: 964656.80 37520.88 patch1-7: 1014393.57 39455.42 Configurations: no change Client benchmark results: kswapd profiles: patch1-6 36.13% lzo1x_1_do_compress (real work) 19.16% page_vma_mapped_walk 6.55% _raw_spin_unlock_irq 4.02% do_raw_spin_lock 2.32% anon_vma_interval_tree_iter_first 2.11% ptep_clear_flush 1.76% __zram_bvec_write 1.64% folio_referenced_one 1.40% memmove 1.35% obj_malloc patch1-7 45.54% lzo1x_1_do_compress (real work) 9.56% page_vma_mapped_walk 6.70% _raw_spin_unlock_irq 2.78% ptep_clear_flush 2.47% do_raw_spin_lock 2.22% __zram_bvec_write 1.87% lru_gen_look_around 1.78% memmove 1.77% obj_malloc 1.44% free_unref_page_list Configurations: no change Link: https://lore.kernel.org/lkml/20220309021230.721028-8-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I9a290343840f3cf925c891c8e360c7cdc24ffb9c	2022-04-20 17:38:55 +00:00
Yu Zhao	a1537a68c5	FROMLIST: mm: multi-gen LRU: minimal implementation To avoid confusion, the terms "promotion" and "demotion" will be applied to the multi-gen LRU, as a new convention; the terms "activation" and "deactivation" will be applied to the active/inactive LRU, as usual. The aging produces young generations. Given an lruvec, it increments max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes hot pages to the youngest generation when it finds them accessed through page tables; the demotion of cold pages happens consequently when it increments max_seq. The aging has the complexity O(nr_hot_pages), since it is only interested in hot pages. Promotion in the aging path does not require any LRU list operations, only the updates of the gen counter and lrugen->nr_pages[]; demotion, unless as the result of the increment of max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The eviction consumes old generations. Given an lruvec, it increments min_seq when the lists indexed by min_seq%MAX_NR_GENS become empty. A feedback loop modeled after the PID controller monitors refaults over anon and file types and decides which type to evict when both types are available from the same generation. Each generation is divided into multiple tiers. Tiers represent different ranges of numbers of accesses through file descriptors. A page accessed N times through file descriptors is in tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only bits in page->flags. In contrast to moving across generations, which requires the LRU lock, moving across tiers only involves operations on page->flags. The feedback loop also monitors refaults over all tiers and decides when to protect pages in which tiers (N>1), using the first tier (N=0,1) as a baseline. The first tier contains single-use unmapped clean pages, which are most likely the best choices. The eviction moves a page to the next generation, i.e., min_seq+1, if the feedback loop decides so. This approach has the following advantages: 1. It removes the cost of activation in the buffered access path by inferring whether pages accessed multiple times through file descriptors are statistically hot and thus worth protecting in the eviction path. 2. It takes pages accessed through page tables into account and avoids overprotecting pages accessed multiple times through file descriptors. (Pages accessed through page tables are in the first tier, since N=0.) 3. More tiers provide better protection for pages accessed more than twice through file descriptors, when under heavy buffered I/O workloads. Server benchmark results: Single workload: fio (buffered I/O): +[38, 40]% IOPS BW 5.18-ed4643521e6a: 2547k 9989MiB/s patch1-6: 3540k 13.5GiB/s Single workload: memcached (anon): +[103, 107]% Ops/sec KB/sec 5.18-ed4643521e6a: 469048.66 18243.91 patch1-6: 964656.80 37520.88 Configurations: CPU: two Xeon 6154 Mem: total 256G Node 1 was only used as a ram disk to reduce the variance in the results. patch drivers/block/brd.c <<EOF 99,100c99,100 < gfp_flags = GFP_NOIO \| __GFP_ZERO \| __GFP_HIGHMEM; < page = alloc_page(gfp_flags); --- > gfp_flags = GFP_NOIO \| __GFP_ZERO \| __GFP_HIGHMEM \| __GFP_THISNODE; > page = alloc_pages_node(1, gfp_flags, 0); EOF cat >>/etc/systemd/system.conf <<EOF CPUAffinity=numa NUMAPolicy=bind NUMAMask=0 EOF cat >>/etc/memcached.conf <<EOF -m 184320 -s /var/run/memcached/memcached.sock -a 0766 -t 36 -B binary EOF cat fio.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkfs.ext4 /dev/ram0 mount -t ext4 /dev/ram0 /mnt mkdir /sys/fs/cgroup/user.slice/test echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting cat memcached.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed Client benchmark results: kswapd profiles: 5.18-ed4643521e6a 39.56% page_vma_mapped_walk 19.32% lzo1x_1_do_compress (real work) 7.18% do_raw_spin_lock 4.23% _raw_spin_unlock_irq 2.26% vma_interval_tree_subtree_search 2.12% vma_interval_tree_iter_next 2.11% folio_referenced_one 1.90% anon_vma_interval_tree_iter_first 1.47% ptep_clear_flush 0.97% __anon_vma_interval_tree_subtree_search patch1-6 36.13% lzo1x_1_do_compress (real work) 19.16% page_vma_mapped_walk 6.55% _raw_spin_unlock_irq 4.02% do_raw_spin_lock 2.32% anon_vma_interval_tree_iter_first 2.11% ptep_clear_flush 1.76% __zram_bvec_write 1.64% folio_referenced_one 1.40% memmove 1.35% obj_malloc Configurations: CPU: single Snapdragon 7c Mem: total 4G Chrome OS MemoryPressure [1] [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/ Link: https://lore.kernel.org/lkml/20220309021230.721028-7-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I3fe4850006d7984cd9f4fd46134b826609dc2f86	2022-04-20 17:38:55 +00:00
Yu Zhao	f88ed5a3d3	FROMLIST: mm: multi-gen LRU: groundwork Evictable pages are divided into multiple generations for each lruvec. The youngest generation number is stored in lrugen->max_seq for both anon and file types as they are aged on an equal footing. The oldest generation numbers are stored in lrugen->min_seq[] separately for anon and file types as clean file pages can be evicted regardless of swap constraints. These three variables are monotonically increasing. Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits in order to fit into the gen counter in page->flags. Each truncated generation number is an index to lrugen->lists[]. The sliding window technique is used to track at least MIN_NR_GENS and at most MAX_NR_GENS generations. The gen counter stores a value within [1, MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it stores 0. There are two conceptually independent procedures: "the aging", which produces young generations, and "the eviction", which consumes old generations. They form a closed-loop system, i.e., "the page reclaim". Both procedures can be invoked from userspace for the purposes of working set estimation and proactive reclaim. These features are required to optimize job scheduling (bin packing) in data centers. The variable size of the sliding window is designed for such use cases [1][2]. To avoid confusion, the terms "hot" and "cold" will be applied to the multi-gen LRU, as a new convention; the terms "active" and "inactive" will be applied to the active/inactive LRU, as usual. The protection of hot pages and the selection of cold pages are based on page access channels and patterns. There are two access channels: one through page tables and the other through file descriptors. The protection of the former channel is by design stronger because: 1. The uncertainty in determining the access patterns of the former channel is higher due to the approximation of the accessed bit. 2. The cost of evicting the former channel is higher due to the TLB flushes required and the likelihood of encountering the dirty bit. 3. The penalty of underprotecting the former channel is higher because applications usually do not prepare themselves for major page faults like they do for blocked I/O. E.g., GUI applications commonly use dedicated I/O threads to avoid blocking the rendering threads. There are also two access patterns: one with temporal locality and the other without. For the reasons listed above, the former channel is assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is present; the latter channel is assumed to follow the latter pattern unless outlying refaults have been observed [3][4]. The next patch will address the "outlying refaults". Three macros, i.e., LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in this patch to make the entire patchset less diffy. A page is added to the youngest generation on faulting. The aging needs to check the accessed bit at least twice before handing this page over to the eviction. The first check takes care of the accessed bit set on the initial fault; the second check makes sure this page has not been used since then. This protocol, AKA second chance, requires a minimum of two generations, hence MIN_NR_GENS. [1] https://dl.acm.org/doi/10.1145/3297858.3304053 [2] https://dl.acm.org/doi/10.1145/3503222.3507731 [3] https://lwn.net/Articles/495543/ [4] https://lwn.net/Articles/815342/ Link: https://lore.kernel.org/lkml/20220309021230.721028-6-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I333ec6a1d2abfa60d93d6adc190ed3eefe441512	2022-04-20 17:38:55 +00:00
Waiman Long	429f413ed8	mm/sparsemem: fix 'mem_section' will never be NULL gcc 12 warning commit a431dbbc540532b7465eae4fc8b56a85a9fc7d17 upstream. The gcc 12 compiler reports a "'mem_section' will never be NULL" warning on the following code: static inline struct mem_section __nr_to_section(unsigned long nr) { #ifdef CONFIG_SPARSEMEM_EXTREME if (!mem_section) return NULL; #endif if (!mem_section[SECTION_NR_TO_ROOT(nr)]) return NULL; : It happens with CONFIG_SPARSEMEM_EXTREME off. The mem_section definition is #ifdef CONFIG_SPARSEMEM_EXTREME extern struct mem_section *mem_section; #else extern struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT]; #endif In the !CONFIG_SPARSEMEM_EXTREME case, mem_section is a static 2-dimensional array and so the check "!mem_section[SECTION_NR_TO_ROOT(nr)]" doesn't make sense. Fix this warning by moving the "!mem_section[SECTION_NR_TO_ROOT(nr)]" check up inside the CONFIG_SPARSEMEM_EXTREME block and adding an explicit NR_SECTION_ROOTS check to make sure that there is no out-of-bound array access. Link: https://lkml.kernel.org/r/20220331180246.2746210-1-longman@redhat.com Fixes: `3e347261a8` ("sparsemem extreme implementation") Signed-off-by: Waiman Long <longman@redhat.com> Reported-by: Justin Forbes <jforbes@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-04-13 20:59:28 +02:00
Charan Teja Reddy	d831f07038	ANDROID: vmscan: Support multiple kswapd threads per node Page replacement is handled in the Linux Kernel in one of two ways: 1) Asynchronously via kswapd 2) Synchronously, via direct reclaim At page allocation time the allocating task is immediately given a page from the zone free list allowing it to go right back to work doing whatever it was doing; Probably directly or indirectly executing business logic. Just prior to satisfying the allocation, free pages is checked to see if it has reached the zone low watermark and if so, kswapd is awakened. Kswapd will start scanning pages looking for inactive pages to evict to make room for new page allocations. The work of kswapd allows tasks to continue allocating memory from their respective zone free list without incurring any delay. When the demand for free pages exceeds the rate that kswapd tasks can supply them, page allocation works differently. Once the allocating task finds that the number of free pages is at or below the zone min watermark, the task will no longer pull pages from the free list. Instead, the task will run the same CPU-bound routines as kswapd to satisfy its own allocation by scanning and evicting pages. This is called a direct reclaim. The time spent performing a direct reclaim can be substantial, often taking tens to hundreds of milliseconds for small order0 allocations to half a second or more for order9 huge-page allocations. In fact, kswapd is not actually required on a linux system. It exists for the sole purpose of optimizing performance by preventing direct reclaims. When memory shortfall is sufficient to trigger direct reclaims, they can occur in any task that is running on the system. A single aggressive memory allocating task can set the stage for collateral damage to occur in small tasks that rarely allocate additional memory. Consider the impact of injecting an additional 100ms of latency when nscd allocates memory to facilitate caching of a DNS query. The presence of direct reclaims 10 years ago was a fairly reliable indicator that too much was being asked of a Linux system. Kswapd was likely wasting time scanning pages that were ineligible for eviction. Adding RAM or reducing the working set size would usually make the problem go away. Since then hardware has evolved to bring a new struggle for kswapd. Storage speeds have increased by orders of magnitude while CPU clock speeds stayed the same or even slowed down in exchange for more cores per package. This presents a throughput problem for a single threaded kswapd that will get worse with each generation of new hardware. Test Details NOTE: The tests below were run with shadow entries disabled. See the associated patch and cover letter for details The tests below were designed with the assumption that a kswapd bottleneck is best demonstrated using filesystem reads. This way, the inactive list will be full of clean pages, simplifying the analysis and allowing kswapd to achieve the highest possible steal rate. Maximum steal rates for kswapd are likely to be the same or lower for any other mix of page types on the system. Tests were run on a 2U Oracle X7-2L with 52 Intel Xeon Skylake 2GHz cores, 756GB of RAM and 8 x 3.6 TB NVMe Solid State Disk drives. Each drive has an XFS file system mounted separately as /d0 through /d7. SSD drives require multiple concurrent streams to show their potential, so I created eleven 250GB zero-filled files on each drive so that I could test with parallel reads. The test script runs in multiple stages. At each stage, the number of dd tasks run concurrently is increased by 2. I did not include all of the test output for brevity. During each stage dd tasks are launched to read from each drive in a round robin fashion until the specified number of tasks for the stage has been reached. Then iostat, vmstat and top are started in the background with 10 second intervals. After five minutes, all of the dd tasks are killed and the iostat, vmstat and top output is parsed in order to report the following: CPU consumption - sy - aggregate kernel mode CPU consumption from vmstat output. The value doesn't tend to fluctuate much so I just grab the highest value. Each sample is averaged over 10 seconds - dd_cpu - for all of the dd tasks averaged across the top samples since there is a lot of variation. Throughput - in Kbytes - Command is iostat -x -d 10 -g total This first test performs reads using O_DIRECT in order to show the maximum throughput that can be obtained using these drives. It also demonstrates how rapidly throughput scales as the number of dd tasks are increased. The dd command for this test looks like this: Command Used: dd iflag=direct if=/d${i}/$n of=/dev/null bs=4M Test #1: Direct IO dd sy dd_cpu throughput 6 0 2.33 14726026.40 10 1 2.95 19954974.80 16 1 2.63 24419689.30 22 1 2.63 25430303.20 28 1 2.91 26026513.20 34 1 2.53 26178618.00 40 1 2.18 26239229.20 46 1 1.91 26250550.40 52 1 1.69 26251845.60 58 1 1.54 26253205.60 64 1 1.43 26253780.80 70 1 1.31 26254154.80 76 1 1.21 26253660.80 82 1 1.12 26254214.80 88 1 1.07 26253770.00 90 1 1.04 26252406.40 Throughput was close to peak with only 22 dd tasks. Very little system CPU was consumed as expected as the drives DMA directly into the user address space when using direct IO. In this next test, the iflag=direct option is removed and we only run the test until the pgscan_kswapd from /proc/vmstat starts to increment. At that point metrics are parsed and reported and the pagecache contents are dropped prior to the next test. Lather, rinse, repeat. Test #2: standard file system IO, no page replacement dd sy dd_cpu throughput 6 2 28.78 5134316.40 10 3 31.40 8051218.40 16 5 34.73 11438106.80 22 7 33.65 14140596.40 28 8 31.24 16393455.20 34 10 29.88 18219463.60 40 11 28.33 19644159.60 46 11 25.05 20802497.60 52 13 26.92 22092370.00 58 13 23.29 22884881.20 64 14 23.12 23452248.80 70 15 22.40 23916468.00 76 16 22.06 24328737.20 82 17 20.97 24718693.20 88 16 18.57 25149404.40 90 16 18.31 25245565.60 Each read has to pause after the buffer in kernel space is populated while those pages are added to the pagecache and copied into the user address space. For this reason, more parallel streams are required to achieve peak throughput. The copy operation consumes substantially more CPU than direct IO as expected. The next test measures throughput after kswapd starts running. This is the same test only we wait for kswapd to wake up before we start collecting metrics. The script actually keeps track of a few things that were not mentioned earlier. It tracks direct reclaims and page scans by watching the metrics in /proc/vmstat. CPU consumption for kswapd is tracked the same way it is tracked for dd. Since the test is 100% reads, you can assume that the page steal rate for kswapd and direct reclaims is almost identical to the scan rate. Test #3: 1 kswapd thread per node dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct 10 4 26.07 28.56 27.03 7355924.40 0 459316976 0 16 7 34.94 69.33 69.66 10867895.20 0 872661643 0 22 10 36.03 93.99 99.33 13130613.60 489 1037654473 11268334 28 10 30.34 95.90 98.60 14601509.60 671 1182591373 15429142 34 14 34.77 97.50 99.23 16468012.00 10850 1069005644 249839515 40 17 36.32 91.49 97.11 17335987.60 18903 975417728 434467710 46 19 38.40 90.54 91.61 17705394.40 25369 855737040 582427973 52 22 40.88 83.97 83.70 17607680.40 31250 709532935 724282458 58 25 40.89 82.19 80.14 17976905.60 35060 657796473 804117540 64 28 41.77 73.49 75.20 18001910.00 39073 561813658 895289337 70 33 45.51 63.78 64.39 17061897.20 44523 379465571 1020726436 76 36 46.95 57.96 60.32 16964459.60 47717 291299464 1093172384 82 39 47.16 55.43 56.16 16949956.00 49479 247071062 1134163008 88 42 47.41 53.75 47.62 16930911.20 51521 195449924 1180442208 90 43 47.18 51.40 50.59 16864428.00 51618 190758156 1183203901 In the previous test where kswapd was not involved, the system-wide kernel mode CPU consumption with 90 dd tasks was 16%. In this test CPU consumption with 90 tasks is at 43%. With 52 cores, and two kswapd tasks (one per NUMA node), kswapd can only be responsible for a little over 4% of the increase. The rest is likely caused by 51,618 direct reclaims that scanned 1.2 billion pages over the five minute time period of the test. Same test, more kswapd tasks: Test #4: 4 kswapd threads per node dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct 10 5 27.09 16.65 14.17 7842605.60 0 459105291 0 16 10 37.12 26.02 24.85 11352920.40 15 920527796 358515 22 11 36.94 37.13 35.82 13771869.60 0 1132169011 0 28 13 35.23 48.43 46.86 16089746.00 0 1312902070 0 34 15 33.37 53.02 55.69 18314856.40 0 1476169080 0 40 19 35.90 69.60 64.41 19836126.80 0 1629999149 0 46 22 36.82 88.55 57.20 20740216.40 0 1708478106 0 52 24 34.38 93.76 68.34 21758352.00 0 1794055559 0 58 24 30.51 79.20 82.33 22735594.00 0 1872794397 0 64 26 30.21 97.12 76.73 23302203.60 176 1916593721 4206821 70 33 32.92 92.91 92.87 23776588.00 3575 1817685086 85574159 76 37 31.62 91.20 89.83 24308196.80 4752 1812262569 113981763 82 29 25.53 93.23 92.33 24802791.20 306 2032093122 7350704 88 43 37.12 76.18 77.01 25145694.40 20310 1253204719 487048202 90 42 38.56 73.90 74.57 22516787.60 22774 1193637495 545463615 By increasing the number of kswapd threads, throughput increased by ~50% while kernel mode CPU utilization decreased or stayed the same, likely due to a decrease in the number of parallel tasks at any given time doing page replacement. Signed-off-by: Buddy Lumpkin <buddy.lumpkin@oracle.com> Bug: 201263306 Link: https://lore.kernel.org/lkml/1522661062-39745-1-git-send-email-buddy.lumpkin@oracle.com [charante@codeaurora.org]: Changes made to select number of kswapds through uapi Change-Id: I8425cab7f40cbeaf65af0ea118c1a9ac7da0930e Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> [quic_vjitta@quicinc.com]: Changes made to move multiple kswapd threads logic to vendor hooks Signed-off-by: Vijayanand Jitta <quic_vjitta@quicinc.com> (cherry picked from commit `0d61a651e4`)	2022-04-06 08:31:35 -07:00
Charan Teja Reddy	f47b852faa	ANDROID: implement wrapper for reverse migration Reverse migration is used to do the balancing the occupancy of memory zones in a node in the system whose imabalance may be caused by migration of pages to other zones by an operation, eg: hotremove and then hotadding the same memory. In this case there is a lot of free memory in newly hotadd memory which can be filled up by the previous migrated pages(as part of offline/hotremove) thus may free up some pressure in other zones of the node. Upstream discussion: https://lore.kernel.org/all/ee78c83d-da9b-f6d1-4f66-934b7782acfb@codeaurora.org/ Change-Id: Ib3137dab0db66ecf6858c4077dcadb9dfd0c6b1c Bug: 201263307 Signed-off-by: Charan Teja Reddy <quic_charante@quicinc.com>	2022-03-17 21:15:46 +00:00
Greg Kroah-Hartman	a8b5dc3032	Merge 5.15.17 into android13-5.15 Changes in 5.15.17 KVM: x86/mmu: Fix write-protection of PTs mapped by the TDP MMU KVM: VMX: switch blocked_vcpu_on_cpu_lock to raw spinlock HID: Ignore battery for Elan touchscreen on HP Envy X360 15t-dr100 HID: uhid: Fix worker destroying device without any protection HID: wacom: Reset expected and received contact counts at the same time HID: wacom: Ignore the confidence flag when a touch is removed HID: wacom: Avoid using stale array indicies to read contact count ALSA: core: Fix SSID quirk lookup for subvendor=0 f2fs: fix to do sanity check on inode type during garbage collection f2fs: fix to do sanity check in is_alive() f2fs: avoid EINVAL by SBI_NEED_FSCK when pinning a file nfc: llcp: fix NULL error pointer dereference on sendmsg() after failed bind() mtd: rawnand: gpmi: Add ERR007117 protection for nfc_apply_timings mtd: rawnand: gpmi: Remove explicit default gpmi clock setting for i.MX6 mtd: Fixed breaking list in __mtd_del_partition. mtd: rawnand: davinci: Don't calculate ECC when reading page mtd: rawnand: davinci: Avoid duplicated page read mtd: rawnand: davinci: Rewrite function description mtd: rawnand: Export nand_read_page_hwecc_oob_first() mtd: rawnand: ingenic: JZ4740 needs 'oob_first' read page function riscv: Get rid of MAXPHYSMEM configs RISC-V: Use common riscv_cpuid_to_hartid_mask() for both SMP=y and SMP=n riscv: try to allocate crashkern region from 32bit addressible memory riscv: Don't use va_pa_offset on kdump riscv: use hart id instead of cpu id on machine_kexec riscv: mm: fix wrong phys_ram_base value for RV64 x86/gpu: Reserve stolen memory for first integrated Intel GPU tools/nolibc: x86-64: Fix startup code bug crypto: x86/aesni - don't require alignment of data tools/nolibc: i386: fix initial stack alignment tools/nolibc: fix incorrect truncation of exit code rtc: cmos: take rtc_lock while reading from CMOS net: phy: marvell: add Marvell specific PHY loopback ksmbd: uninitialized variable in create_socket() ksmbd: fix guest connection failure with nautilus ksmbd: add support for smb2 max credit parameter ksmbd: move credit charge deduction under processing request ksmbd: limits exceeding the maximum allowable outstanding requests ksmbd: add reserved room in ipc request/response media: cec: fix a deadlock situation media: ov8865: Disable only enabled regulators on error path media: v4l2-ioctl.c: readbuffers depends on V4L2_CAP_READWRITE media: flexcop-usb: fix control-message timeouts media: mceusb: fix control-message timeouts media: em28xx: fix control-message timeouts media: cpia2: fix control-message timeouts media: s2255: fix control-message timeouts media: dib0700: fix undefined behavior in tuner shutdown media: redrat3: fix control-message timeouts media: pvrusb2: fix control-message timeouts media: stk1160: fix control-message timeouts media: cec-pin: fix interrupt en/disable handling can: softing_cs: softingcs_probe(): fix memleak on registration failure mei: hbm: fix client dma reply status iio: adc: ti-adc081c: Partial revert of removal of ACPI IDs iio: trigger: Fix a scheduling whilst atomic issue seen on tsc2046 lkdtm: Fix content of section containing lkdtm_rodata_do_nothing() bus: mhi: pci_generic: Graceful shutdown on freeze bus: mhi: core: Fix reading wake_capable channel configuration bus: mhi: core: Fix race while handling SYS_ERR at power up cxl/pmem: Fix reference counting for delayed work arm64: errata: Fix exec handling in erratum `1418040` workaround ARM: dts: at91: update alternate function of signal PD20 iommu/io-pgtable-arm-v7s: Add error handle for page table allocation failure gpu: host1x: Add back arm_iommu_detach_device() drm/tegra: Add back arm_iommu_detach_device() virtio/virtio_mem: handle a possible NULL as a memcpy parameter dma_fence_array: Fix PENDING_ERROR leak in dma_fence_array_signaled() PCI: Add function 1 DMA alias quirk for Marvell 88SE9125 SATA controller mm_zone: add function to check if managed dma zone exists dma/pool: create dma atomic pool only if dma zone has managed pages mm/page_alloc.c: do not warn allocation failure on zone DMA if no managed pages ath11k: add string type to search board data in board-2.bin for WCN6855 shmem: fix a race between shmem_unused_huge_shrink and shmem_evict_inode drm/ttm: Put BO in its memory manager's lru list Bluetooth: L2CAP: Fix not initializing sk_peer_pid drm/bridge: display-connector: fix an uninitialized pointer in probe() drm: fix null-ptr-deref in drm_dev_init_release() drm/panel: kingdisplay-kd097d04: Delete panel on attach() failure drm/panel: innolux-p079zca: Delete panel on attach() failure drm/rockchip: dsi: Fix unbalanced clock on probe error drm/rockchip: dsi: Hold pm-runtime across bind/unbind drm/rockchip: dsi: Disable PLL clock on bind error drm/rockchip: dsi: Reconfigure hardware on resume() Bluetooth: virtio_bt: fix memory leak in virtbt_rx_handle() Bluetooth: cmtp: fix possible panic when cmtp_init_sockets() fails clk: bcm-2835: Pick the closest clock rate clk: bcm-2835: Remove rounding up the dividers drm/vc4: hdmi: Set a default HSM rate drm/vc4: hdmi: Move the HSM clock enable to runtime_pm drm/vc4: hdmi: Make sure the controller is powered in detect drm/vc4: hdmi: Make sure the controller is powered up during bind drm/vc4: hdmi: Rework the pre_crtc_configure error handling drm/vc4: crtc: Make sure the HDMI controller is powered when disabling wcn36xx: ensure pairing of init_scan/finish_scan and start_scan/end_scan wcn36xx: Indicate beacon not connection loss on MISSED_BEACON_IND drm/vc4: hdmi: Enable the scrambler on reconnection libbpf: Free up resources used by inner map definition wcn36xx: Fix DMA channel enable/disable cycle wcn36xx: Release DMA channel descriptor allocations wcn36xx: Put DXE block into reset before freeing memory wcn36xx: populate band before determining rate on RX wcn36xx: fix RX BD rate mapping for 5GHz legacy rates ath11k: Send PPDU_STATS_CFG with proper pdev mask to firmware bpftool: Fix memory leak in prog_dump() mtd: hyperbus: rpc-if: Check return value of rpcif_sw_init() media: videobuf2: Fix the size printk format media: atomisp: add missing media_device_cleanup() in atomisp_unregister_entities() media: atomisp: fix punit_ddr_dvfs_enable() argument for mrfld_power up case media: atomisp: fix inverted logic in buffers_needed() media: atomisp: do not use err var when checking port validity for ISP2400 media: atomisp: fix inverted error check for ia_css_mipi_is_source_port_valid() media: atomisp: fix ifdefs in sh_css.c media: atomisp: add NULL check for asd obtained from atomisp_video_pipe media: atomisp: fix enum formats logic media: atomisp: fix uninitialized bug in gmin_get_pmic_id_and_addr() media: aspeed: fix mode-detect always time out at 2nd run media: em28xx: fix memory leak in em28xx_init_dev media: aspeed: Update signal status immediately to ensure sane hw state arm64: dts: amlogic: meson-g12: Fix GPU operating point table node name arm64: dts: amlogic: Fix SPI NOR flash node name for ODROID N2/N2+ arm64: dts: meson-gxbb-wetek: fix HDMI in early boot arm64: dts: meson-gxbb-wetek: fix missing GPIO binding fs: dlm: don't call kernel_getpeername() in error_report() memory: renesas-rpc-if: Return error in case devm_ioremap_resource() fails Bluetooth: stop proccessing malicious adv data ath11k: Fix ETSI regd with weather radar overlap ath11k: clear the keys properly via DISABLE_KEY ath11k: reset RSN/WPA present state for open BSS spi: hisi-kunpeng: Fix the debugfs directory name incorrect tee: fix put order in teedev_close_context() fs: dlm: fix build with CONFIG_IPV6 disabled drm/dp: Don't read back backlight mode in drm_edp_backlight_enable() drm/vboxvideo: fix a NULL vs IS_ERR() check arm64: dts: renesas: cat875: Add rx/tx delays media: dmxdev: fix UAF when dvb_register_device() fails crypto: atmel-aes - Reestablish the correct tfm context at dequeue crypto: qce - fix uaf on qce_aead_register_one crypto: qce - fix uaf on qce_ahash_register_one crypto: qce - fix uaf on qce_skcipher_register_one arm64: dts: qcom: sc7280: Fix incorrect clock name mtd: hyperbus: rpc-if: fix bug in rpcif_hb_remove cpufreq: qcom-cpufreq-hw: Update offline CPUs per-cpu thermal pressure cpufreq: qcom-hw: Fix probable nested interrupt handling ARM: dts: stm32: fix dtbs_check warning on ili9341 dts binding on stm32f429 disco libbpf: Fix potential misaligned memory access in btf_ext__new() libbpf: Fix glob_syms memory leak in bpf_linker libbpf: Fix using invalidated memory in bpf_linker crypto: qat - remove unnecessary collision prevention step in PFVF crypto: qat - make pfvf send message direction agnostic crypto: qat - fix undetected PFVF timeout in ACK loop ath11k: Use host CE parameters for CE interrupts configuration arm64: dts: ti: k3-j721e: correct cache-sets info tty: serial: atmel: Check return code of dmaengine_submit() tty: serial: atmel: Call dma_async_issue_pending() mfd: atmel-flexcom: Remove #ifdef CONFIG_PM_SLEEP mfd: atmel-flexcom: Use .resume_noirq bfq: Do not let waker requests skip proper accounting libbpf: Silence uninitialized warning/error in btf_dump_dump_type_data media: i2c: imx274: fix s_frame_interval runtime resume not requested media: i2c: Re-order runtime pm initialisation media: i2c: ov8865: Fix lockdep error media: rcar-csi2: Correct the selection of hsfreqrange media: imx-pxp: Initialize the spinlock prior to using it media: si470x-i2c: fix possible memory leak in si470x_i2c_probe() media: mtk-vcodec: call v4l2_m2m_ctx_release first when file is released media: hantro: Hook up RK3399 JPEG encoder output media: coda: fix CODA960 JPEG encoder buffer overflow media: venus: correct low power frequency calculation for encoder media: venus: core: Fix a potential NULL pointer dereference in an error handling path media: venus: core: Fix a resource leak in the error handling path of 'venus_probe()' net: stmmac: Add platform level debug register dump feature thermal/drivers/imx: Implement runtime PM support igc: AF_XDP zero-copy metadata adjust breaks SKBs on XDP_PASS netfilter: bridge: add support for pppoe filtering powerpc: Avoid discarding flags in system_call_exception() arm64: dts: qcom: msm8916: fix MMC controller aliases drm/vmwgfx: Remove the deprecated lower mem limit drm/vmwgfx: Fail to initialize on broken configs cgroup: Trace event cgroup id fields should be u64 ACPI: EC: Rework flushing of EC work while suspended to idle thermal/drivers/imx8mm: Enable ADC when enabling monitor drm/amdgpu: Fix a NULL pointer dereference in amdgpu_connector_lcd_native_mode() drm/radeon/radeon_kms: Fix a NULL pointer dereference in radeon_driver_open_kms() libbpf: Clean gen_loader's attach kind. crypto: caam - save caam memory to support crypto engine retry mechanism. arm64: dts: ti: k3-am642: Fix the L2 cache sets arm64: dts: ti: k3-j7200: Fix the L2 cache sets arm64: dts: ti: k3-j721e: Fix the L2 cache sets arm64: dts: ti: k3-j7200: Correct the d-cache-sets info tty: serial: uartlite: allow 64 bit address serial: amba-pl011: do not request memory region twice mtd: core: provide unique name for nvmem device floppy: Fix hang in watchdog when disk is ejected staging: rtl8192e: return error code from rtllib_softmac_init() staging: rtl8192e: rtllib_module: fix error handle case in alloc_rtllib() Bluetooth: btmtksdio: fix resume failure bpf: Fix the test_task_vma selftest to support output shorter than 1 kB sched/fair: Fix detection of per-CPU kthreads waking a task sched/fair: Fix per-CPU kthread and wakee stacking for asym CPU capacity bpf: Adjust BTF log size limit. bpf: Disallow BPF_LOG_KERNEL log level for bpf(BPF_BTF_LOAD) bpf: Remove config check to enable bpf support for branch records arm64: clear_page() shouldn't use DC ZVA when DCZID_EL0.DZP == 1 arm64: mte: DC {GVA,GZVA} shouldn't be used when DCZID_EL0.DZP == 1 samples/bpf: Install libbpf headers when building samples/bpf: Clean up samples/bpf build failes samples: bpf: Fix xdp_sample_user.o linking with Clang samples: bpf: Fix 'unknown warning group' build warning on Clang media: dib8000: Fix a memleak in dib8000_init() media: saa7146: mxb: Fix a NULL pointer dereference in mxb_attach() media: si2157: Fix "warm" tuner state detection wireless: iwlwifi: Fix a double free in iwl_txq_dyn_alloc_dma sched/rt: Try to restart rt period timer when rt runtime exceeded ath10k: Fix the MTU size on QCA9377 SDIO Bluetooth: refactor set_exp_feature with a feature table Bluetooth: MGMT: Use hci_dev_test_and_{set,clear}_flag Bluetooth: btusb: Handle download_firmware failure cases drm/amd/display: Fix bug in debugfs crc_win_update entry drm/amd/display: Fix out of bounds access on DNC31 stream encoder regs drm/msm/gpu: Don't allow zero fence_id drm/msm/dp: displayPort driver need algorithm rational rcu/exp: Mark current CPU as exp-QS in IPI loop second pass wcn36xx: Fix max channels retrieval drm/msm/dsi: fix initialization in the bonded DSI case mwifiex: Fix possible ABBA deadlock xfrm: fix a small bug in xfrm_sa_len() x86/uaccess: Move variable into switch case statement selftests: clone3: clone3: add case CLONE3_ARGS_NO_TEST selftests: harness: avoid false negatives if test has no ASSERTs crypto: stm32/cryp - fix CTR counter carry crypto: stm32/cryp - fix xts and race condition in crypto_engine requests crypto: stm32/cryp - check early input data crypto: stm32/cryp - fix double pm exit crypto: stm32/cryp - fix lrw chaining mode crypto: stm32/cryp - fix bugs and crash in tests crypto: stm32 - Revert broken pm_runtime_resume_and_get changes crypto: hisilicon/qm - fix incorrect return value of hisi_qm_resume() ath11k: Fix deleting uninitialized kernel timer during fragment cache flush spi: Fix incorrect cs_setup delay handling ARM: dts: gemini: NAS4220-B: fis-index-block with 128 KiB sectors perf/arm-cmn: Fix CPU hotplug unregistration media: dw2102: Fix use after free media: msi001: fix possible null-ptr-deref in msi001_probe() media: coda/imx-vdoa: Handle dma_set_coherent_mask error codes ath11k: Fix a NULL pointer dereference in ath11k_mac_op_hw_scan() net: dsa: hellcreek: Fix insertion of static FDB entries net: dsa: hellcreek: Add STP forwarding rule net: dsa: hellcreek: Allow PTP P2P measurements on blocked ports net: dsa: hellcreek: Add missing PTP via UDP rules arm64: dts: qcom: c630: Fix soundcard setup arm64: dts: qcom: ipq6018: Fix gpio-ranges property drm/msm/dpu: fix safe status debugfs file drm/bridge: ti-sn65dsi86: Set max register for regmap gpu: host1x: select CONFIG_DMA_SHARED_BUFFER drm/tegra: gr2d: Explicitly control module reset drm/tegra: vic: Fix DMA API misuse media: hantro: Fix probe func error path xfrm: interface with if_id 0 should return error xfrm: state and policy should fail if XFRMA_IF_ID 0 ARM: 9159/1: decompressor: Avoid UNPREDICTABLE NOP encoding usb: ftdi-elan: fix memory leak on device disconnect arm64: dts: marvell: cn9130: add GPIO and SPI aliases arm64: dts: marvell: cn9130: enable CP0 GPIO controllers ARM: dts: armada-38x: Add generic compatible to UART nodes mt76: mt7921: drop offload_flags overwritten wilc1000: fix double free error in probe() rtw88: add quirk to disable pci caps on HP 250 G7 Notebook PC rtw88: Disable PCIe ASPM while doing NAPI poll on 8821CE iwlwifi: mvm: fix 32-bit build in FTM iwlwifi: mvm: test roc running status bits before removing the sta iwlwifi: mvm: perform 6GHz passive scan after suspend iwlwifi: mvm: set protected flag only for NDP ranging mmc: meson-mx-sdhc: add IRQ check mmc: meson-mx-sdio: add IRQ check block: fix error unwinding in device_add_disk selinux: fix potential memleak in selinux_add_opt() um: fix ndelay/udelay defines um: rename set_signals() to um_set_signals() um: virt-pci: Fix 32-bit compile lib/logic_iomem: Fix 32-bit build lib/logic_iomem: Fix operation on 32-bit um: virtio_uml: Fix time-travel external time propagation Bluetooth: L2CAP: Fix using wrong mode bpftool: Enable line buffering for stdout backlight: qcom-wled: Validate enabled string indices in DT backlight: qcom-wled: Pass number of elements to read to read_u32_array backlight: qcom-wled: Fix off-by-one maximum with default num_strings backlight: qcom-wled: Override default length with qcom,enabled-strings backlight: qcom-wled: Use cpu_to_le16 macro to perform conversion backlight: qcom-wled: Respect enabled-strings in set_brightness software node: fix wrong node passed to find nargs_prop Bluetooth: hci_qca: Stop IBS timer during BT OFF x86/boot/compressed: Move CLANG_FLAGS to beginning of KBUILD_CFLAGS crypto: octeontx2 - prevent underflow in get_cores_bmap() regulator: qcom-labibb: OCP interrupts are not a failure while disabled hwmon: (mr75203) fix wrong power-up delay value x86/mce/inject: Avoid out-of-bounds write when setting flags io_uring: remove double poll on poll update serial: 8250_bcm7271: Propagate error codes from brcmuart_probe() ACPI: scan: Create platform device for BCM4752 and LNV4752 ACPI nodes pcmcia: rsrc_nonstatic: Fix a NULL pointer dereference in __nonstatic_find_io_region() pcmcia: rsrc_nonstatic: Fix a NULL pointer dereference in nonstatic_find_mem_region() power: reset: mt6397: Check for null res pointer net/xfrm: IPsec tunnel mode fix inner_ipproto setting in sec_path net: ethernet: mtk_eth_soc: fix return values and refactor MDIO ops net: dsa: fix incorrect function pointer check for MRP ring roles netfilter: ipt_CLUSTERIP: fix refcount leak in clusterip_tg_check() bpf, sockmap: Fix return codes from tcp_bpf_recvmsg_parser() bpf, sockmap: Fix double bpf_prog_put on error case in map_link bpf: Don't promote bogus looking registers after null check. bpf: Fix verifier support for validation of async callbacks bpf: Fix SO_RCVBUF/SO_SNDBUF handling in _bpf_setsockopt(). netfilter: nft_payload: do not update layer 4 checksum when mangling fragments netfilter: nft_set_pipapo: allocate pcpu scratch maps on clone net: fix SOF_TIMESTAMPING_BIND_PHC to work with multiple sockets ppp: ensure minimum packet size in ppp_write() rocker: fix a sleeping in atomic bug staging: greybus: audio: Check null pointer fsl/fman: Check for null pointer after calling devm_ioremap Bluetooth: hci_bcm: Check for error irq Bluetooth: hci_qca: Fix NULL vs IS_ERR_OR_NULL check in qca_serdev_probe net/smc: Reset conn->lgr when link group registration fails usb: dwc3: qcom: Fix NULL vs IS_ERR checking in dwc3_qcom_probe usb: dwc2: do not gate off the hardware if it does not support clock gating usb: dwc2: gadget: initialize max_speed from params usb: gadget: u_audio: Subdevice 0 for capture ctls HID: hid-uclogic-params: Invalid parameter check in uclogic_params_init HID: hid-uclogic-params: Invalid parameter check in uclogic_params_get_str_desc HID: hid-uclogic-params: Invalid parameter check in uclogic_params_huion_init HID: hid-uclogic-params: Invalid parameter check in uclogic_params_frame_init_v1_buttonpad debugfs: lockdown: Allow reading debugfs files that are not world readable drivers/firmware: Add missing platform_device_put() in sysfb_create_simplefb serial: liteuart: fix MODULE_ALIAS serial: stm32: move tx dma terminate DMA to shutdown x86, sched: Fix undefined reference to init_freq_invariance_cppc() build error net/mlx5e: Fix page DMA map/unmap attributes net/mlx5e: Fix wrong usage of fib_info_nh when routes with nexthop objects are used net/mlx5e: Don't block routes with nexthop objects in SW Revert "net/mlx5e: Block offload of outer header csum for UDP tunnels" Revert "net/mlx5e: Block offload of outer header csum for GRE tunnel" net/mlx5e: Fix matching on modified inner ip_ecn bits net/mlx5: Fix access to sf_dev_table on allocation failure net/mlx5e: Sync VXLAN udp ports during uplink representor profile change net/mlx5: Set command entry semaphore up once got index free lib/mpi: Add the return value check of kcalloc() Bluetooth: L2CAP: uninitialized variables in l2cap_sock_setsockopt() mptcp: fix per socket endpoint accounting mptcp: fix opt size when sending DSS + MP_FAIL mptcp: fix a DSS option writing error spi: spi-meson-spifc: Add missing pm_runtime_disable() in meson_spifc_probe octeontx2-af: Increment ptp refcount before use ax25: uninitialized variable in ax25_setsockopt() netrom: fix api breakage in nr_setsockopt() regmap: Call regmap_debugfs_exit() prior to _init() net: mscc: ocelot: fix incorrect balancing with down LAG ports can: mcp251xfd: add missing newline to printed strings tpm: add request_locality before write TPM_INT_ENABLE tpm_tis: Fix an error handling path in 'tpm_tis_core_init()' can: softing: softing_startstop(): fix set but not used variable warning can: xilinx_can: xcan_probe(): check for error irq can: rcar_canfd: rcar_canfd_channel_probe(): make sure we free CAN network device pcmcia: fix setting of kthread task states net/sched: flow_dissector: Fix matching on zone id for invalid conns net: openvswitch: Fix matching zone id for invalid conns arriving from tc net: openvswitch: Fix ct_state nat flags for conns arriving from tc iwlwifi: mvm: Use div_s64 instead of do_div in iwl_mvm_ftm_rtt_smoothing() bnxt_en: Refactor coredump functions bnxt_en: move coredump functions into dedicated file bnxt_en: use firmware provided max timeout for messages net: mcs7830: handle usb read errors properly ext4: avoid trim error on fs with small groups ASoC: Intel: sof_sdw: fix jack detection on HP Spectre x360 convertible ALSA: jack: Add missing rwsem around snd_ctl_remove() calls ALSA: PCM: Add missing rwsem around snd_ctl_remove() calls ALSA: hda: Add missing rwsem around snd_ctl_remove() calls ALSA: hda: Fix potential deadlock at codec unbinding RDMA/bnxt_re: Scan the whole bitmap when checking if "disabling RCFW with pending cmd-bit" RDMA/hns: Validate the pkey index scsi: pm80xx: Update WARN_ON check in pm8001_mpi_build_cmd() clk: renesas: rzg2l: Check return value of pm_genpd_init() clk: renesas: rzg2l: propagate return value of_genpd_add_provider_simple() clk: imx8mn: Fix imx8mn_clko1_sels powerpc/prom_init: Fix improper check of prom_getprop() ASoC: uniphier: drop selecting non-existing SND_SOC_UNIPHIER_AIO_DMA ASoC: codecs: wcd938x: add SND_SOC_WCD938_SDW to codec list instead RDMA/rtrs-clt: Fix the initial value of min_latency ALSA: hda: Make proper use of timecounter dt-bindings: thermal: Fix definition of cooling-maps contribution property powerpc/perf: Fix PMU callbacks to clear pending PMI before resetting an overflown PMC powerpc/modules: Don't WARN on first module allocation attempt powerpc/32s: Fix shift-out-of-bounds in KASAN init clocksource: Avoid accidental unstable marking of clocksources ALSA: oss: fix compile error when OSS_DEBUG is enabled ALSA: usb-audio: Drop superfluous '0' in Presonus Studio 1810c's ID misc: at25: Make driver OF independent again char/mwave: Adjust io port register size binder: fix handling of error during copy binder: avoid potential data leakage when copying txn openrisc: Add clone3 ABI wrapper iommu: Extend mutex lock scope in iommu_probe_device() iommu/io-pgtable-arm: Fix table descriptor paddr formatting scsi: core: Fix scsi_device_max_queue_depth() scsi: ufs: Fix race conditions related to driver data RDMA/qedr: Fix reporting max_{send/recv}_wr attrs PCI/MSI: Fix pci_irq_vector()/pci_irq_get_affinity() powerpc/powermac: Add additional missing lockdep_register_key() iommu/arm-smmu-qcom: Fix TTBR0 read RDMA/core: Let ib_find_gid() continue search even after empty entry RDMA/cma: Let cma_resolve_ib_dev() continue search even after empty entry ASoC: rt5663: Handle device_property_read_u32_array error codes of: unittest: fix warning on PowerPC frame size warning of: unittest: 64 bit dma address test requires arch support clk: stm32: Fix ltdc's clock turn off by clk_disable_unused() after system enter shell mips: add SYS_HAS_CPU_MIPS64_R5 config for MIPS Release 5 support mips: fix Kconfig reference to PHYS_ADDR_T_64BIT dmaengine: pxa/mmp: stop referencing config->slave_id iommu/amd: Restore GA log/tail pointer on host resume iommu/amd: X2apic mode: re-enable after resume iommu/amd: X2apic mode: setup the INTX registers on mask/unmask iommu/amd: X2apic mode: mask/unmask interrupts on suspend/resume iommu/amd: Remove useless irq affinity notifier ASoC: Intel: catpt: Test dmaengine_submit() result before moving on iommu/iova: Fix race between FQ timeout and teardown ASoC: mediatek: mt8195: correct default value of: fdt: Aggregate the processing of "linux,usable-memory-range" efi: apply memblock cap after memblock_add() scsi: block: pm: Always set request queue runtime active in blk_post_runtime_resume() phy: uniphier-usb3ss: fix unintended writing zeros to PHY register ASoC: mediatek: Check for error clk pointer powerpc/64s: Mask NIP before checking against SRR0 powerpc/64s: Use EMIT_WARN_ENTRY for SRR debug warnings phy: cadence: Sierra: Fix to get correct parent for mux clocks ASoC: samsung: idma: Check of ioremap return value misc: lattice-ecp3-config: Fix task hung when firmware load failed ASoC: mediatek: mt8195: correct pcmif BE dai control flow arm64: tegra: Remove non existent Tegra194 reset mips: lantiq: add support for clk_set_parent() mips: bcm63xx: add support for clk_set_parent() powerpc/xive: Add missing null check after calling kmalloc ASoC: fsl_mqs: fix MODULE_ALIAS ALSA: hda/cs8409: Increase delay during jack detection ALSA: hda/cs8409: Fix Jack detection after resume RDMA/cxgb4: Set queue pair state when being queried clk: qcom: gcc-sc7280: Mark gcc_cfg_noc_lpass_clk always enabled ASoC: imx-card: Need special setting for ak4497 on i.MX8MQ ASoC: imx-card: Fix mclk calculation issue for akcodec ASoC: imx-card: improve the sound quality for low rate ASoC: fsl_asrc: refine the check of available clock divider clk: bm1880: remove kfrees on static allocations of: base: Fix phandle argument length mismatch error message of/fdt: Don't worry about non-memory region overlap for no-map MIPS: boot/compressed/: add __ashldi3 to target for ZSTD compression MIPS: compressed: Fix build with ZSTD compression mailbox: fix gce_num of mt8192 driver data ARM: dts: omap3-n900: Fix lp5523 for multi color leds: lp55xx: initialise output direction from dts Bluetooth: Fix debugfs entry leak in hci_register_dev() Bluetooth: Fix memory leak of hci device drm/panel: Delete panel on mipi_dsi_attach() failure Bluetooth: Fix removing adv when processing cmd complete fs: dlm: filter user dlm messages for kernel locks drm/lima: fix warning when CONFIG_DEBUG_SG=y & CONFIG_DMA_API_DEBUG=y selftests/bpf: Fix memory leaks in btf_type_c_dump() helper selftests/bpf: Destroy XDP link correctly selftests/bpf: Fix bpf_object leak in skb_ctx selftest ar5523: Fix null-ptr-deref with unexpected WDCMSG_TARGET_START reply drm/bridge: dw-hdmi: handle ELD when DRM_BRIDGE_ATTACH_NO_CONNECTOR drm/nouveau/pmu/gm200-: avoid touching PMU outside of DEVINIT/PREOS/ACR media: atomisp: fix try_fmt logic media: atomisp: set per-device's default mode media: atomisp-ov2680: Fix ov2680_set_fmt() clobbering the exposure media: atomisp: check before deference asd variable ARM: shmobile: rcar-gen2: Add missing of_node_put() batman-adv: allow netlink usage in unprivileged containers media: atomisp: handle errors at sh_css_create_isp_params() ath11k: Fix crash caused by uninitialized TX ring usb: dwc3: meson-g12a: fix shared reset control use USB: ehci_brcm_hub_control: Improve port index sanitizing usb: gadget: f_fs: Use stream_open() for endpoint files psi: Fix PSI_MEM_FULL state when tasks are in memstall and doing reclaim drm: panel-orientation-quirks: Add quirk for the Lenovo Yoga Book X91F/L HID: magicmouse: Report battery level over USB HID: apple: Do not reset quirks when the Fn key is not found media: b2c2: Add missing check in flexcop_pci_isr: libbpf: Accommodate DWARF/compiler bug with duplicated structs ethernet: renesas: Use div64_ul instead of do_div EDAC/synopsys: Use the quirk for version instead of ddr version arm64: dts: qcom: sm8350: Shorten camera-thermal-bottom name soc: imx: gpcv2: Synchronously suspend MIX domains ARM: imx: rename DEBUG_IMX21_IMX27_UART to DEBUG_IMX27_UART drm/amd/display: check top_pipe_to_program pointer drm/amdgpu/display: set vblank_disable_immediate for DC soc: ti: pruss: fix referenced node in error message mlxsw: pci: Add shutdown method in PCI driver drm/amd/display: add else to avoid double destroy clk_mgr drm/bridge: megachips: Ensure both bridges are probed before registration mxser: keep only !tty test in ISR tty: serial: imx: disable UCR4_OREN in .stop_rx() instead of .shutdown() gpiolib: acpi: Do not set the IRQ type if the IRQ is already in use HSI: core: Fix return freed object in hsi_new_client crypto: jitter - consider 32 LSB for APT mwifiex: Fix skb_over_panic in mwifiex_usb_recv() rsi: Fix use-after-free in rsi_rx_done_handler() rsi: Fix out-of-bounds read in rsi_read_pkt() ath11k: Avoid NULL ptr access during mgmt tx cleanup media: venus: avoid calling core_clk_setrate() concurrently during concurrent video sessions regulator: da9121: Prevent current limit change when enabled drm/vmwgfx: Release ttm memory if probe fails drm/vmwgfx: Introduce a new placement for MOB page tables ACPI / x86: Drop PWM2 device on Lenovo Yoga Book from always present table ACPI: Change acpi_device_always_present() into acpi_device_override_status() ACPI / x86: Allow specifying acpi_device_override_status() quirks by path ACPI / x86: Add not-present quirk for the PCI0.SDHB.BRC1 device on the GPD win arm64: dts: ti: j7200-main: Fix 'dtbs_check' serdes_ln_ctrl node arm64: dts: ti: j721e-main: Fix 'dtbs_check' in serdes_ln_ctrl node usb: uhci: add aspeed ast2600 uhci support floppy: Add max size check for user space request x86/mm: Flush global TLB when switching to trampoline page-table drm: rcar-du: Fix CRTC timings when CMM is used media: uvcvideo: Increase UVC_CTRL_CONTROL_TIMEOUT to 5 seconds. media: rcar-vin: Update format alignment constraints media: saa7146: hexium_orion: Fix a NULL pointer dereference in hexium_attach() media: atomisp: fix "variable dereferenced before check 'asd'" media: m920x: don't use stack on USB reads thunderbolt: Runtime PM activate both ends of the device link arm64: dts: renesas: Fix thermal bindings iwlwifi: mvm: synchronize with FW after multicast commands iwlwifi: mvm: avoid clearing a just saved session protection id rcutorture: Avoid soft lockup during cpu stall ath11k: avoid deadlock by change ieee80211_queue_work for regd_update_work ath10k: Fix tx hanging net-sysfs: update the queue counts in the unregistration path net: phy: prefer 1000baseT over 1000baseKX gpio: aspeed: Convert aspeed_gpio.lock to raw_spinlock gpio: aspeed-sgpio: Convert aspeed_sgpio.lock to raw_spinlock selftests/ftrace: make kprobe profile testcase description unique ath11k: Avoid false DEADLOCK warning reported by lockdep ARM: dts: qcom: sdx55: fix IPA interconnect definitions x86/mce: Allow instrumentation during task work queueing x86/mce: Mark mce_panic() noinstr x86/mce: Mark mce_end() noinstr x86/mce: Mark mce_read_aux() noinstr net: bonding: debug: avoid printing debug logs when bond is not notifying peers kunit: Don't crash if no parameters are generated bpf: Do not WARN in bpf_warn_invalid_xdp_action() drm/amdkfd: Fix error handling in svm_range_add HID: quirks: Allow inverting the absolute X/Y values HID: i2c-hid-of: Expose the touchscreen-inverted properties media: igorplugusb: receiver overflow should be reported media: rockchip: rkisp1: use device name for debugfs subdir name media: saa7146: hexium_gemini: Fix a NULL pointer dereference in hexium_attach() mmc: tmio: reinit card irqs in reset routine mmc: core: Fixup storing of OCR for MMC_QUIRK_NONSTD_SDIO drm/amd/amdgpu: fix psp tmr bo pin count leak in SRIOV drm/amd/amdgpu: fix gmc bo pin count leak in SRIOV audit: ensure userspace is penalized the same as the kernel when under pressure arm64: dts: ls1028a-qds: move rtc node to the correct i2c bus arm64: tegra: Adjust length of CCPLEX cluster MMIO region crypto: ccp - Move SEV_INIT retry for corrupted data crypto: hisilicon/hpre - fix memory leak in hpre_curve25519_src_init() PM: runtime: Add safety net to supplier device release cpufreq: Fix initialization of min and max frequency QoS requests usb: hub: Add delay for SuperSpeed hub resume to let links transit to U0 mt76: mt7615: fix possible deadlock while mt7615_register_ext_phy() mt76: do not pass the received frame with decryption error mt76: mt7615: improve wmm index allocation ath9k_htc: fix NULL pointer dereference at ath9k_htc_rxep() ath9k_htc: fix NULL pointer dereference at ath9k_htc_tx_get_packet() ath9k: Fix out-of-bound memcpy in ath9k_hif_usb_rx_stream rtw88: 8822c: update rx settings to prevent potential hw deadlock PM: AVS: qcom-cpr: Use div64_ul instead of do_div iwlwifi: fix leaks/bad data after failed firmware load iwlwifi: remove module loading failure message iwlwifi: mvm: Fix calculation of frame length iwlwifi: mvm: fix AUX ROC removal iwlwifi: pcie: make sure prph_info is set when treating wakeup IRQ mmc: sdhci-pci-gli: GL9755: Support for CD/WP inversion on OF platforms block: check minor range in device_add_disk() um: registers: Rename function names to avoid conflicts and build problems ath11k: Fix napi related hang Bluetooth: btintel: Add missing quirks and msft ext for legacy bootloader Bluetooth: vhci: Set HCI_QUIRK_VALID_LE_STATES xfrm: rate limit SA mapping change message to user space drm/etnaviv: consider completed fence seqno in hang check jffs2: GC deadlock reading a page that is used in jffs2_write_begin() ACPICA: actypes.h: Expand the ACPI_ACCESS_ definitions ACPICA: Utilities: Avoid deleting the same object twice in a row ACPICA: Executer: Fix the REFCLASS_REFOF case in acpi_ex_opcode_1A_0T_1R() ACPICA: Fix wrong interpretation of PCC address ACPICA: Hardware: Do not flush CPU cache when entering S4 and S5 mmc: mtk-sd: Use readl_poll_timeout instead of open-coded polling drm/amdgpu: fixup bad vram size on gmc v8 amdgpu/pm: Make sysfs pm attributes as read-only for VFs ACPI: battery: Add the ThinkPad "Not Charging" quirk ACPI: CPPC: Check present CPUs for determining _CPC is valid btrfs: remove BUG_ON() in find_parent_nodes() btrfs: remove BUG_ON(!eie) in find_parent_nodes net: mdio: Demote probed message to debug print mac80211: allow non-standard VHT MCS-10/11 dm btree: add a defensive bounds check to insert_at() dm space map common: add bounds check to sm_ll_lookup_bitmap() bpf/selftests: Fix namespace mount setup in tc_redirect mlxsw: pci: Avoid flow control for EMAD packets net: phy: marvell: configure RGMII delays for 88E1118 net: gemini: allow any RGMII interface mode regulator: qcom_smd: Align probe function with rpmh-regulator serial: pl010: Drop CR register reset on set_termios serial: pl011: Drop CR register reset on set_termios serial: core: Keep mctrl register state and cached copy in sync random: do not throw away excess input to crng_fast_load net/mlx5: Update log_max_qp value to FW max capability net/mlx5e: Unblock setting vid 0 for VF in case PF isn't eswitch manager parisc: Avoid calling faulthandler_disabled() twice can: flexcan: allow to change quirks at runtime can: flexcan: rename RX modes can: flexcan: add more quirks to describe RX path capabilities x86/kbuild: Enable CONFIG_KALLSYMS_ALL=y in the defconfigs powerpc/6xx: add missing of_node_put powerpc/powernv: add missing of_node_put powerpc/cell: add missing of_node_put powerpc/btext: add missing of_node_put powerpc/watchdog: Fix missed watchdog reset due to memory ordering race ASoC: imx-hdmi: add put_device() after of_find_device_by_node() i2c: i801: Don't silently correct invalid transfer size powerpc/smp: Move setup_profiling_timer() under CONFIG_PROFILING i2c: mpc: Correct I2C reset procedure clk: meson: gxbb: Fix the SDM_EN bit for MPLL0 on GXBB powerpc/powermac: Add missing lockdep_register_key() KVM: PPC: Book3S: Suppress warnings when allocating too big memory slots KVM: PPC: Book3S: Suppress failed alloc warning in H_COPY_TOFROM_GUEST w1: Misuse of get_user()/put_user() reported by sparse nvmem: core: set size for sysfs bin file dm: fix alloc_dax error handling in alloc_dev interconnect: qcom: rpm: Prevent integer overflow in rate scsi: ufs: Fix a kernel crash during shutdown scsi: lpfc: Fix leaked lpfc_dmabuf mbox allocations with NPIV scsi: lpfc: Trigger SLI4 firmware dump before doing driver cleanup ALSA: seq: Set upper limit of processed events MIPS: Loongson64: Use three arguments for slti powerpc/40x: Map 32Mbytes of memory at startup selftests/powerpc/spectre_v2: Return skip code when miss_percent is high powerpc: handle kdump appropriately with crash_kexec_post_notifiers option powerpc/fadump: Fix inaccurate CPU state info in vmcore generated with panic udf: Fix error handling in udf_new_inode() MIPS: OCTEON: add put_device() after of_find_device_by_node() irqchip/gic-v4: Disable redistributors' view of the VPE table at boot time i2c: designware-pci: Fix to change data types of hcnt and lcnt parameters selftests/powerpc: Add a test of sigreturning to the kernel MIPS: Octeon: Fix build errors using clang scsi: sr: Don't use GFP_DMA scsi: mpi3mr: Fixes around reply request queues ASoC: mediatek: mt8192-mt6359: fix device_node leak phy: phy-mtk-tphy: add support efuse setting ASoC: mediatek: mt8173: fix device_node leak ASoC: mediatek: mt8183: fix device_node leak habanalabs: skip read fw errors if dynamic descriptor invalid phy: mediatek: Fix missing check in mtk_mipi_tx_probe mailbox: change mailbox-mpfs compatible string seg6: export get_srh() for ICMP handling icmp: ICMPV6: Examine invoking packet for Segment Route Headers. udp6: Use Segment Routing Header for dest address if present rpmsg: core: Clean up resources on announce_create failure. ifcvf/vDPA: fix misuse virtio-net device config size for blk dev crypto: omap-aes - Fix broken pm_runtime_and_get() usage crypto: stm32/crc32 - Fix kernel BUG triggered in probe() crypto: caam - replace this_cpu_ptr with raw_cpu_ptr ubifs: Error path in ubifs_remount_rw() seems to wrongly free write buffers tpm: fix potential NULL pointer access in tpm_del_char_device tpm: fix NPE on probe for missing device mfd: tps65910: Set PWR_OFF bit during driver probe spi: uniphier: Fix a bug that doesn't point to private data correctly xen/gntdev: fix unmap notification order md: Move alloc/free acct bioset in to personality HID: magicmouse: Fix an error handling path in magicmouse_probe() fuse: Pass correct lend value to filemap_write_and_wait_range() serial: Fix incorrect rs485 polarity on uart open cputime, cpuacct: Include guest time in user time in cpuacct.stat sched/cpuacct: Fix user/system in shown cpuacct.usage* tracing/kprobes: 'nmissed' not showed correctly for kretprobe tracing: Have syscall trace events use trace_event_buffer_lock_reserve() remoteproc: imx_rproc: Fix a resource leak in the remove function iwlwifi: mvm: Increase the scan timeout guard to 30 seconds s390/mm: fix 2KB pgtable release race device property: Fix fwnode_graph_devcon_match() fwnode leak drm/tegra: submit: Add missing pm_runtime_mark_last_busy() drm/etnaviv: limit submit sizes drm/amd/display: Fix the uninitialized variable in enable_stream_features() drm/nouveau/kms/nv04: use vzalloc for nv04_display drm/bridge: analogix_dp: Make PSR-exit block less parisc: Fix lpa and lpa_user defines powerpc/64s/radix: Fix huge vmap false positive scsi: lpfc: Fix lpfc_force_rscn ndlp kref imbalance drm/amdgpu: don't do resets on APUs which don't support it drm/i915/display/ehl: Update voltage swing table PCI: xgene: Fix IB window setup PCI: pciehp: Use down_read/write_nested(reset_lock) to fix lockdep errors PCI: pci-bridge-emul: Make expansion ROM Base Address register read-only PCI: pci-bridge-emul: Properly mark reserved PCIe bits in PCI config space PCI: pci-bridge-emul: Fix definitions of reserved bits PCI: pci-bridge-emul: Correctly set PCIe capabilities PCI: pci-bridge-emul: Set PCI_STATUS_CAP_LIST for PCIe device xfrm: fix policy lookup for ipv6 gre packets xfrm: fix dflt policy check when there is no policy configured btrfs: fix deadlock between quota enable and other quota operations btrfs: check the root node for uptodate before returning it btrfs: respect the max size in the header when activating swap file ext4: make sure to reset inode lockdep class when quota enabling fails ext4: make sure quota gets properly shutdown on error ext4: fix a possible ABBA deadlock due to busy PA ext4: initialize err_blk before calling __ext4_get_inode_loc ext4: fix fast commit may miss tracking range for FALLOC_FL_ZERO_RANGE ext4: set csum seed in tmp inode while migrating to extents ext4: Fix BUG_ON in ext4_bread when write quota data ext4: use ext4_ext_remove_space() for fast commit replay delete range ext4: fast commit may miss tracking unwritten range during ftruncate ext4: destroy ext4_fc_dentry_cachep kmemcache on module removal ext4: fix null-ptr-deref in '__ext4_journal_ensure_credits' ext4: fix an use-after-free issue about data=journal writeback mode ext4: don't use the orphan list when migrating an inode tracing/osnoise: Properly unhook events if start_per_cpu_kthreads() fails ath11k: qmi: avoid error messages when dma allocation fails drm/radeon: fix error handling in radeon_driver_open_kms of: base: Improve argument length mismatch error firmware: Update Kconfig help text for Google firmware can: mcp251xfd: mcp251xfd_tef_obj_read(): fix typo in error message media: rcar-csi2: Optimize the selection PHTW register drm/vc4: hdmi: Make sure the device is powered with CEC media: correct MEDIA_TEST_SUPPORT help text Documentation: coresight: Fix documentation issue Documentation: dmaengine: Correctly describe dmatest with channel unset Documentation: ACPI: Fix data node reference documentation Documentation, arch: Remove leftovers from raw device Documentation, arch: Remove leftovers from CIFS_WEAK_PW_HASH Documentation: refer to config RANDOMIZE_BASE for kernel address-space randomization Documentation: fix firewire.rst ABI file path error Bluetooth: btusb: Return error code when getting patch status failed net: usb: Correct reset handling of smsc95xx Bluetooth: hci_sync: Fix not setting adv set duration scsi: core: Show SCMD_LAST in text form scsi: ufs: ufs-mediatek: Fix error checking in ufs_mtk_init_va09_pwr_ctrl() RDMA/cma: Remove open coding of overflow checking for private_data_len dmaengine: uniphier-xdmac: Fix type of address variables dmaengine: idxd: fix wq settings post wq disable RDMA/hns: Modify the mapping attribute of doorbell to device RDMA/rxe: Fix a typo in opcode name dmaengine: stm32-mdma: fix STM32_MDMA_CTBR_TSEL_MASK Revert "net/mlx5: Add retry mechanism to the command entry index allocation" powerpc/cell: Fix clang -Wimplicit-fallthrough warning powerpc/fsl/dts: Enable WA for erratum A-009885 on fman3l MDIO buses block: fix async_depth sysfs interface for mq-deadline block: Fix fsync always failed if once failed drm/vc4: crtc: Drop feed_txp from state drm/vc4: Fix non-blocking commit getting stuck forever drm/vc4: crtc: Copy assigned channel to the CRTC bpftool: Remove inclusion of utilities.mak from Makefiles bpftool: Fix indent in option lists in the documentation xdp: check prog type before updating BPF link bpf: Fix mount source show for bpffs bpf: Mark PTR_TO_FUNC register initially with zero offset perf evsel: Override attr->sample_period for non-libpfm4 events ipv4: update fib_info_cnt under spinlock protection ipv4: avoid quadratic behavior in netns dismantle mlx5: Don't accidentally set RTO_ONLINK before mlx5e_route_lookup_ipv4_get() net/fsl: xgmac_mdio: Add workaround for erratum A-009885 net/fsl: xgmac_mdio: Fix incorrect iounmap when removing module parisc: pdc_stable: Fix memory leak in pdcs_register_pathentries riscv: dts: microchip: mpfs: Drop empty chosen node drm/vmwgfx: Remove explicit transparent hugepages support drm/vmwgfx: Remove unused compile options f2fs: fix remove page failed in invalidate compress pages f2fs: fix to avoid panic in is_alive() if metadata is inconsistent f2fs: compress: fix potential deadlock of compress file f2fs: fix to reserve space for IO align feature f2fs: fix to check available space of CP area correctly in update_ckpt_flags() crypto: octeontx2 - uninitialized variable in kvf_limits_store() af_unix: annote lockless accesses to unix_tot_inflight & gc_in_progress clk: Emit a stern warning with writable debugfs enabled clk: si5341: Fix clock HW provider cleanup pinctrl/rockchip: fix gpio device creation gpio: mpc8xxx: Fix IRQ check in mpc8xxx_probe gpio: idt3243x: Fix IRQ check in idt_gpio_probe net/smc: Fix hung_task when removing SMC-R devices net: axienet: increase reset timeout net: axienet: Wait for PhyRstCmplt after core reset net: axienet: reset core on initialization prior to MDIO access net: axienet: add missing memory barriers net: axienet: limit minimum TX ring size net: axienet: Fix TX ring slot available check net: axienet: fix number of TX ring slots for available check net: axienet: fix for TX busy handling net: axienet: increase default TX ring size to 128 bitops: protect find_first_{,zero}_bit properly um: gitignore: Add kernel/capflags.c HID: vivaldi: fix handling devices not using numbered reports rtc: pxa: fix null pointer dereference vdpa/mlx5: Fix wrong configuration of virtio_version_1_0 virtio_ring: mark ring unused on error taskstats: Cleanup the use of task->exit_code inet: frags: annotate races around fqdir->dead and fqdir->high_thresh netns: add schedule point in ops_exit_list() iwlwifi: fix Bz NMI behaviour xfrm: Don't accidentally set RTO_ONLINK in decode_session4() vdpa/mlx5: Restore cur_num_vqs in case of failure in change_num_qps() gre: Don't accidentally set RTO_ONLINK in gre_fill_metadata_dst() libcxgb: Don't accidentally set RTO_ONLINK in cxgb_find_route() perf script: Fix hex dump character output dmaengine: at_xdmac: Don't start transactions at tx_submit level dmaengine: at_xdmac: Start transfer for cyclic channels in issue_pending dmaengine: at_xdmac: Print debug message after realeasing the lock dmaengine: at_xdmac: Fix concurrency over xfers_list dmaengine: at_xdmac: Fix lld view setting dmaengine: at_xdmac: Fix at_xdmac_lld struct definition perf tools: Drop requirement for libstdc++.so for libopencsd check perf probe: Fix ppc64 'perf probe add events failed' case devlink: Remove misleading internal_flags from health reporter dump arm64: dts: qcom: msm8996: drop not documented adreno properties net: fix sock_timestamping_bind_phc() to release device net: bonding: fix bond_xmit_broadcast return value error bug net: ipa: fix atomic update in ipa_endpoint_replenish() net_sched: restore "mpu xxx" handling net: mscc: ocelot: don't let phylink re-enable TX PAUSE on the NPI port bcmgenet: add WOL IRQ check net: wwan: Fix MRU mismatch issue which may lead to data connection lost net: ethernet: mtk_eth_soc: fix error checking in mtk_mac_config() net: ocelot: Fix the call to switchdev_bridge_port_offload net: sfp: fix high power modules without diagnostic monitoring net: cpsw: avoid alignment faults by taking NET_IP_ALIGN into account net: phy: micrel: use kszphy_suspend()/kszphy_resume for irq aware devices net: mscc: ocelot: fix using match before it is set dt-bindings: display: meson-dw-hdmi: add missing sound-name-prefix property dt-bindings: display: meson-vpu: Add missing amlogic,canvas property dt-bindings: watchdog: Require samsung,syscon-phandle for Exynos7 sch_api: Don't skip qdisc attach on ingress scripts/dtc: dtx_diff: remove broken example from help text lib82596: Fix IRQ check in sni_82596_probe mm/hmm.c: allow VM_MIXEDMAP to work with hmm_range_fault bonding: Fix extraction of ports from the packet headers lib/test_meminit: destroy cache in kmem_cache_alloc_bulk() test scripts: sphinx-pre-install: add required ctex dependency scripts: sphinx-pre-install: Fix ctex support on Debian Linux 5.15.17 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I6ddef7c3463bfc127b34c39ebcf5d286d3117931	2022-01-31 12:35:09 +01:00
Baoquan He	240e8d331a	mm_zone: add function to check if managed dma zone exists commit 62b3107073646e0946bd97ff926832bafb846d17 upstream. Patch series "Handle warning of allocation failure on DMA zone w/o managed pages", v4. Problem observed: On x86_64, when crash is triggered and entering into kdump kernel, page allocation failure can always be seen. --------------------------------- DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations swapper/0: page allocation failure: order:5, mode:0xcc1(GFP_KERNEL\|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0 CPU: 0 PID: 1 Comm: swapper/0 Call Trace: dump_stack+0x7f/0xa1 warn_alloc.cold+0x72/0xd6 ...... __alloc_pages+0x24d/0x2c0 ...... dma_atomic_pool_init+0xdb/0x176 do_one_initcall+0x67/0x320 ? rcu_read_lock_sched_held+0x3f/0x80 kernel_init_freeable+0x290/0x2dc ? rest_init+0x24f/0x24f kernel_init+0xa/0x111 ret_from_fork+0x22/0x30 Mem-Info: ------------------------------------ Root cause: In the current kernel, it assumes that DMA zone must have managed pages and try to request pages if CONFIG_ZONE_DMA is enabled. While this is not always true. E.g in kdump kernel of x86_64, only low 1M is presented and locked down at very early stage of boot, so that this low 1M won't be added into buddy allocator to become managed pages of DMA zone. This exception will always cause page allocation failure if page is requested from DMA zone. Investigation: This failure happens since below commit merged into linus's tree. `1a6a9044b9` x86/setup: Remove CONFIG_X86_RESERVE_LOW and reservelow= options `23721c8e92` x86/crash: Remove crash_reserve_low_1M() `f1d4d47c58` x86/setup: Always reserve the first 1M of RAM `7c321eb2b8` x86/kdump: Remove the backup region handling `6f599d8423` x86/kdump: Always reserve the low 1M when the crashkernel option is specified Before them, on x86_64, the low 640K area will be reused by kdump kernel. So in kdump kernel, the content of low 640K area is copied into a backup region for dumping before jumping into kdump. Then except of those firmware reserved region in [0, 640K], the left area will be added into buddy allocator to become available managed pages of DMA zone. However, after above commits applied, in kdump kernel of x86_64, the low 1M is reserved by memblock, but not released to buddy allocator. So any later page allocation requested from DMA zone will fail. At the beginning, if crashkernel is reserved, the low 1M need be locked down because AMD SME encrypts memory making the old backup region mechanims impossible when switching into kdump kernel. Later, it was also observed that there are BIOSes corrupting memory under 1M. To solve this, in commit `f1d4d47c58`, the entire region of low 1M is always reserved after the real mode trampoline is allocated. Besides, recently, Intel engineer mentioned their TDX (Trusted domain extensions) which is under development in kernel also needs to lock down the low 1M. So we can't simply revert above commits to fix the page allocation failure from DMA zone as someone suggested. *Solution: Currently, only DMA atomic pool and dma-kmalloc will initialize and request page allocation with GFP_DMA during bootup. So only initializ DMA atomic pool when DMA zone has available managed pages, otherwise just skip the initialization. For dma-kmalloc(), for the time being, let's mute the warning of allocation failure if requesting pages from DMA zone while no manged pages. Meanwhile, change code to use dma_alloc_xx/dma_map_xx API to replace kmalloc(GFP_DMA), or do not use GFP_DMA when calling kmalloc() if not necessary. Christoph is posting patches to fix those under drivers/scsi/. Finally, we can remove the need of dma-kmalloc() as people suggested. This patch (of 3): In some places of the current kernel, it assumes that dma zone must have managed pages if CONFIG_ZONE_DMA is enabled. While this is not always true. E.g in kdump kernel of x86_64, only low 1M is presented and locked down at very early stage of boot, so that there's no managed pages at all in DMA zone. This exception will always cause page allocation failure if page is requested from DMA zone. Here add function has_managed_dma() and the relevant helper functions to check if there's DMA zone with managed pages. It will be used in later patches. Link: https://lkml.kernel.org/r/20211223094435.248523-1-bhe@redhat.com Link: https://lkml.kernel.org/r/20211223094435.248523-2-bhe@redhat.com Fixes: `6f599d8423` ("x86/kdump: Always reserve the low 1M when the crashkernel option is specified") Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: John Donnelly <john.p.donnelly@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Borislav Petkov <bp@alien8.de> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-01-27 11:03:00 +01:00
Chris Goldsworthy	37b2d597bb	ANDROID: mm: add cma pcp list Add a PCP list for __GFP_CMA allocations so as not to deprive MIGRATE_MOVABLE allocations quick access to pages on their PCP lists. Bug: 158645321 Signed-off-by: Liam Mark <lmark@codeaurora.org> Signed-off-by: Chris Goldsworthy <cgoldswo@codeaurora.org> [isaacm@codeaurora.org: Resolve merge conflicts related to new mm features] Signed-off-by: Isaac J. Manjarres <isaacm@quicinc.com> Change-Id: I2f238ea5f8e4aef9c45b1a3180ce6b6a36d63d77	2021-11-05 15:36:55 +00:00
Heesub Shin	af82009880	ANDROID: cma: redirect page allocation to CMA CMA pages are designed to be used as fallback for movable allocations and cannot be used for non-movable allocations. If CMA pages are utilized poorly, non-movable allocations may end up getting starved if all regular movable pages are allocated and the only pages left are CMA. Always using CMA pages first creates unacceptable performance problems. As a midway alternative, use CMA pages for certain userspace allocations. The userspace pages can be migrated or dropped quickly which giving decent utilization. Additionally, add a fall-backs for failed CMA allocations in rmqueue() and __rmqueue_pcplist() (the latter addition being driven by a report by the kernel test robot); these fallbacks were dealt with differently in the original version of the patch as the rmqueue() call chain has changed). Bug: 158645321 Link: https://lore.kernel.org/lkml/cover.1604282969.git.cgoldswo@codeaurora.org/ Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Heesub Shin <heesub.shin@samsung.com> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> [cgoldswo@codeaurora.org: Place in bugfixes; remove cma_alloc zone flag] Signed-off-by: Chris Goldsworthy <cgoldswo@codeaurora.org> [isaacm@codeaurora.org: Resolve merge conflicts to account for new mm features] Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org> Change-Id: I3dfbc42f1d12416143550042182bf16030ca7190	2021-11-05 15:36:47 +00:00
Greg Kroah-Hartman	c0d1ebaba1	Merge `2d338201d5` ("Merge branch 'akpm' (patches from Andrew)") into android-mainline Steps on the way to 5.15-rc1 Resolves merge conflict in: fs/proc/base.c Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Ic554ca8447961e52fbc6f27d91470a816b59a771	2021-09-15 14:34:48 +02:00
Greg Kroah-Hartman	c2b303f98f	Merge `4e71add028` ("Merge branch 'stable/for-linus-5.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/ibft") into android-mainline Steps on the way to 5.15-rc1 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Ib3f181326491eb896547d802a6f0a1b3be54ce28	2021-09-14 14:35:23 +02:00
Linus Torvalds	2d338201d5	Merge branch 'akpm' (patches from Andrew) Merge more updates from Andrew Morton: "147 patches, based on `7d2a07b769`. Subsystems affected by this patch series: mm (memory-hotplug, rmap, ioremap, highmem, cleanups, secretmem, kfence, damon, and vmscan), alpha, percpu, procfs, misc, core-kernel, MAINTAINERS, lib, checkpatch, epoll, init, nilfs2, coredump, fork, pids, criu, kconfig, selftests, ipc, and scripts" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (94 commits) scripts: check_extable: fix typo in user error message mm/workingset: correct kernel-doc notations ipc: replace costly bailout check in sysvipc_find_ipc() selftests/memfd: remove unused variable Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH configs: remove the obsolete CONFIG_INPUT_POLLDEV prctl: allow to setup brk for et_dyn executables pid: cleanup the stale comment mentioning pidmap_init(). kernel/fork.c: unexport get_{mm,task}_exe_file coredump: fix memleak in dump_vma_snapshot() fs/coredump.c: log if a core dump is aborted due to changed file permissions nilfs2: use refcount_dec_and_lock() to fix potential UAF nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group nilfs2: fix NULL pointer in nilfs_##name##_attr_release nilfs2: fix memory leak in nilfs_sysfs_create_device_group trap: cleanup trap_init() init: move usermodehelper_enable() to populate_rootfs() ...	2021-09-08 12:55:35 -07:00
David Hildenbrand	4b09700244	mm: track present early pages per zone Patch series "mm/memory_hotplug: "auto-movable" online policy and memory groups", v3. I. Goal The goal of this series is improving in-kernel auto-online support. It tackles the fundamental problems that: 1) We can create zone imbalances when onlining all memory blindly to ZONE_MOVABLE, in the worst case crashing the system. We have to know upfront how much memory we are going to hotplug such that we can safely enable auto-onlining of all hotplugged memory to ZONE_MOVABLE via "online_movable". This is far from practical and only applicable in limited setups -- like inside VMs under the RHV/oVirt hypervisor which will never hotplug more than 3 times the boot memory (and the limitation is only in place due to the Linux limitation). 2) We see more setups that implement dynamic VM resizing, hot(un)plugging memory to resize VM memory. In these setups, we might hotplug a lot of memory, but it might happen in various small steps in both directions (e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the primary driver of this upstream right now, performing such dynamic resizing NUMA-aware via multiple virtio-mem devices. Onlining all hotplugged memory to ZONE_NORMAL means we basically have no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we can easily run into zone imbalances when growing a VM. We want a mixture, and we want as much memory as reasonable/configured in ZONE_MOVABLE. Details regarding zone imbalances can be found at [1]. 3) Memory devices consist of 1..X memory block devices, however, the kernel doesn't really track the relationship. Consequently, also user space has no idea. We want to make per-device decisions. As one example, for memory hotunplug it doesn't make sense to use a mixture of zones within a single DIMM: we want all MOVABLE if possible, otherwise all !MOVABLE, because any !MOVABLE part will easily block the whole DIMM from getting hotunplugged. As another example, virtio-mem operates on individual units that span 1..X memory blocks. Similar to a DIMM, we want a unit to either be all MOVABLE or !MOVABLE. A "unit" can be thought of like a DIMM, however, all units of a virtio-mem device logically belong together and are managed (added/removed) by a single driver. We want as much memory of a virtio-mem device to be MOVABLE as possible. 4) We want memory onlining to be done right from the kernel while adding memory, not triggered by user space via udev rules; for example, this is reqired for fast memory hotplug for drivers that add individual memory blocks, like virito-mem. We want a way to configure a policy in the kernel and avoid implementing advanced policies in user space. The auto-onlining support we have in the kernel is not sufficient. All we have is a) online everything MOVABLE (online_movable) b) online everything !MOVABLE (online_kernel) c) keep zones contiguous (online). This series allows configuring c) to mean instead "online movable if possible according to the coniguration, driven by a maximum MOVABLE:KERNEL ratio" -- a new onlining policy. II. Approach This series does 3 things: 1) Introduces the "auto-movable" online policy that initially operates on individual memory blocks only. It uses a maximum MOVABLE:KERNEL ratio to make a decision whether a memory block will be onlined to ZONE_MOVABLE or not. However, in the basic form, hotplugged KERNEL memory does not allow for more MOVABLE memory (details in the patches). CMA memory is treated like MOVABLE memory. 2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) memory groups and uses group information to make decisions in the "auto-movable" online policy across memory blocks of a single memory device (modeled as memory group). More details can be found in patch #3 or in the DIMM example below. 3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by allowing ZONE_NORMAL memory within a dynamic memory group to allow for more ZONE_MOVABLE memory within the same memory group. The target use case is dynamic VM resizing using virtio-mem. See the virtio-mem example below. I remember that the basic idea of using a ratio to implement a policy in the kernel was once mentioned by Vitaly Kuznetsov, but I might be wrong (I lost the pointer to that discussion). For me, the main use case is using it along with virtio-mem (and DIMMs / ppc64 dlpar where necessary) for dynamic resizing of VMs, increasing the amount of memory we can hotunplug reliably again if we might eventually hotplug a lot of memory to a VM. III. Target Usage The target usage will be: 1) Linux boots with "mhp_default_online_type=offline" 2) User space (e.g., systemd unit) configures memory onlining (according to a config file and system properties), for example: * Setting memory_hotplug.online_policy=auto-movable * Setting memory_hotplug.auto_movable_ratio=301 * Setting memory_hotplug.auto_movable_numa_aware=true 3) User space enabled auto onlining via "echo online > /sys/devices/system/memory/auto_online_blocks" 4) User space triggers manual onlining of all already-offline memory blocks (go over offline memory blocks and set them to "online") IV. Example For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured ratio of 301% results in the following layout: Memory block 0-15: DMA32 (early) Memory block 32-47: Normal (early) Memory block 48-79: Movable (DIMM 0) Memory block 80-111: Movable (DIMM 1) Memory block 112-143: Movable (DIMM 2) Memory block 144-275: Normal (DIMM 3) Memory block 176-207: Normal (DIMM 4) ... all Normal (-> hotplugged Normal memory does not allow for more Movable memory) For virtio-mem, using a simple, single virtio-mem device with a 4 GiB VM will result in the following layout: Memory block 0-15: DMA32 (early) Memory block 32-47: Normal (early) Memory block 48-143: Movable (virtio-mem, first 12 GiB) Memory block 144: Normal (virtio-mem, next 128 MiB) Memory block 145-147: Movable (virtio-mem, next 384 MiB) Memory block 148: Normal (virtio-mem, next 128 MiB) Memory block 149-151: Movable (virtio-mem, next 384 MiB) ... Normal/Movable mixture as above (-> hotplugged Normal memory allows for more Movable memory within the same device) Which gives us maximum flexibility when dynamically growing/shrinking a VM in smaller steps. V. Doc Update I'll update the memory-hotplug.rst documentation, once the overhaul [1] is usptream. Until then, details can be found in patch #2. VI. Future Work 1) Use memory groups for ppc64 dlpar 2) Being able to specify a portion of (early) kernel memory that will be excluded from the ratio. Like "128 MiB globally/per node" are excluded. This might be helpful when starting VMs with extremely small memory footprint (e.g., 128 MiB) and hotplugging memory later -- not wanting the first hotplugged units getting onlined to ZONE_MOVABLE. One alternative would be a trigger to not consider ZONE_DMA memory in the ratio. We'll have to see if this is really rrequired. 3) Indicate to user space that MOVABLE might be a bad idea -- especially relevant when memory ballooning without support for balloon compaction is active. This patch (of 9): For implementing a new memory onlining policy, which determines when to online memory blocks to ZONE_MOVABLE semi-automatically, we need the number of present early (boot) pages -- present pages excluding hotplugged pages. Let's track these pages per zone. Pass a page instead of the zone to adjust_present_page_count(), similar as adjust_managed_page_count() and derive the zone from the page. It's worth noting that a memory block to be offlined/onlined is either completely "early" or "not early". add_memory() and friends can only add complete memory blocks and we only online/offline complete (individual) memory blocks. Link: https://lkml.kernel.org/r/20210806124715.17090-1-david@redhat.com Link: https://lkml.kernel.org/r/20210806124715.17090-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: Hui Zhu <teawater@gmail.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Rapoport <rppt@kernel.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Len Brown <lenb@kernel.org> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-08 11:50:23 -07:00
Mike Rapoport	859a85ddf9	mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE Patch series "mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE". After recent updates to freeing unused parts of the memory map, no architecture can have holes in the memory map within a pageblock. This makes pfn_valid_within() check and CONFIG_HOLES_IN_ZONE configuration option redundant. The first patch removes them both in a mechanical way and the second patch simplifies memory_hotplug::test_pages_in_a_zone() that had pfn_valid_within() surrounded by more logic than simple if. This patch (of 2): After recent changes in freeing of the unused parts of the memory map and rework of pfn_valid() in arm and arm64 there are no architectures that can have holes in the memory map within a pageblock and so nothing can enable CONFIG_HOLES_IN_ZONE which guards non trivial implementation of pfn_valid_within(). With that, pfn_valid_within() is always hardwired to 1 and can be completely removed. Remove calls to pfn_valid_within() and CONFIG_HOLES_IN_ZONE. Link: https://lkml.kernel.org/r/20210713080035.7464-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20210713080035.7464-2-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-08 11:50:22 -07:00
Charan Teja Reddy	65d759c8f9	mm: compaction: support triggering of proactive compaction by user The proactive compaction[1] gets triggered for every 500msec and run compaction on the node for COMPACTION_HPAGE_ORDER (usually order-9) pages based on the value set to sysctl.compaction_proactiveness. Triggering the compaction for every 500msec in search of COMPACTION_HPAGE_ORDER pages is not needed for all applications, especially on the embedded system usecases which may have few MB's of RAM. Enabling the proactive compaction in its state will endup in running almost always on such systems. Other side, proactive compaction can still be very much useful for getting a set of higher order pages in some controllable manner(controlled by using the sysctl.compaction_proactiveness). So, on systems where enabling the proactive compaction always may proove not required, can trigger the same from user space on write to its sysctl interface. As an example, say app launcher decide to launch the memory heavy application which can be launched fast if it gets more higher order pages thus launcher can prepare the system in advance by triggering the proactive compaction from userspace. This triggering of proactive compaction is done on a write to sysctl.compaction_proactiveness by user. [1]https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=facdaa917c4d5a376d09d25865f5a863f906234a [akpm@linux-foundation.org: tweak vm.rst, per Mike] Link: https://lkml.kernel.org/r/1627653207-12317-1-git-send-email-charante@codeaurora.org Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Nitin Gupta <nigupta@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:17 -07:00
Naoya Horiguchi	01c8d337d1	mm/sparse: set SECTION_NID_SHIFT to 6 Currently SECTION_NID_SHIFT is set to 3, which is incorrect because bit 3 and 4 can be overlapped by sub-field for early NID, and can be unexpectedly set on NUMA systems. There are a few non-critical issues related to this: - Having SECTION_TAINT_ZONE_DEVICE set for wrong sections forces pfn_to_online_page() through the slow path, but doesn't actually break the kernel. - A kdump generation tool like makedumpfile uses this field to calculate the physical address to read. So wrong bits can make the tool access to wrong address and fail to create kdump. This can be avoided by the tool, so it's not critical. To fix it, set SECTION_NID_SHIFT to 6 which is the minimum number of available bits of section flag field. Link: https://lkml.kernel.org/r/20210707045548.810271-1-naoya.horiguchi@linux.dev Fixes: `1f90a3477d` ("mm: teach pfn_to_online_page() about ZONE_DEVICE section collisions") Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reported-by: Kazuhito Hagio <k-hagio-ab@nec.com> Suggested-by: Dan Williams <dan.j.williams@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Wang Wensheng <wangwensheng4@huawei.com> Cc: Rui Xiang <rui.xiang@huawei.com> Cc: Kazu <k-hagio-ab@nec.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:14 -07:00
Ohhoon Kwon	11e02d3729	mm: sparse: remove __section_nr() function As the last users of __section_nr() are gone, let's remove unused function __section_nr(). Link: https://lkml.kernel.org/r/20210707150212.855-4-ohoono.kwon@samsung.com Signed-off-by: Ohhoon Kwon <ohoono.kwon@samsung.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:14 -07:00
Lee Jones	293f275f4d	Merge commit `df8ba5f160` ("Merge tag 'kgdb-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux") into android-mainline A large step en route to v5.14-rc1 Change-Id: I52bb71dc737044a593d1a9dfd7fe02b31e273ff9 Signed-off-by: Lee Jones <lee.jones@linaro.org>	2021-07-12 11:00:18 +01:00
Lee Jones	8e658623d4	Merge commit `c288d9cd71` ("Merge tag 'for-5.14/io_uring-2021-06-30' of git://git.kernel.dk/linux-block") into android-mainline Another small step en route to v5.14-rc1 Change-Id: I24899ab78da7d367574ed69ceaa82ab0837d9556 Signed-off-by: Lee Jones <lee.jones@linaro.org>	2021-07-12 10:02:27 +01:00
Mel Gorman	351de44fde	mm/swap: make NODE_DATA an inline function on CONFIG_FLATMEM make W=1 generates the following warning in mm/workingset.c for allnoconfig mm/workingset.c: In function `unpack_shadow': mm/workingset.c:201:15: warning: variable `nid' set but not used [-Wunused-but-set-variable] int memcgid, nid; ^~~ On FLATMEM, NODE_DATA returns a global pglist_data without dereferencing nid. Make the helper an inline function to suppress the warning, add type checking and to apply any side-effects in the parameter list. Link: https://lkml.kernel.org/r/20210520084809.8576-15-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Streetman <ddstreet@ieee.org> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-07-01 11:06:03 -07:00
Zhen Lei	041711ce7c	mm: fix spelling mistakes Fix some spelling mistakes in comments: each having differents usage ==> each has a different usage statments ==> statements adresses ==> addresses aggresive ==> aggressive datas ==> data posion ==> poison higer ==> higher precisly ==> precisely wont ==> won't We moves tha ==> We move the endianess ==> endianness Link: https://lkml.kernel.org/r/20210519065853.7723-2-thunder.leizhen@huawei.com Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-07-01 11:06:02 -07:00
Anshuman Khandual	16c9afc776	arm64/mm: drop HAVE_ARCH_PFN_VALID CONFIG_SPARSEMEM_VMEMMAP is now the only available memory model on arm64 platforms and free_unused_memmap() would just return without creating any holes in the memmap mapping. There is no need for any special handling in pfn_valid() and HAVE_ARCH_PFN_VALID can just be dropped. This also moves the pfn upper bits sanity check into generic pfn_valid(). Link: https://lkml.kernel.org/r/1621947349-25421-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-30 20:47:29 -07:00
Mike Rapoport	51c656aef6	include/linux/mmzone.h: add documentation for pfn_valid() Patch series "arm64: drop pfn_valid_within() and simplify pfn_valid()", v4. These patches aim to remove CONFIG_HOLES_IN_ZONE and essentially hardwire pfn_valid_within() to 1. The idea is to mark NOMAP pages as reserved in the memory map and restore the intended semantics of pfn_valid() to designate availability of struct page for a pfn. With this the core mm will be able to cope with the fact that it cannot use NOMAP pages and the holes created by NOMAP ranges within MAX_ORDER blocks will be treated correctly even without the need for pfn_valid_within. This patch (of 4): Add comment describing the semantics of pfn_valid() that clarifies that pfn_valid() only checks for availability of a memory map entry (i.e. struct page) for a PFN rather than availability of usable memory backing that PFN. The most "generic" version of pfn_valid() used by the configurations with SPARSEMEM enabled resides in include/linux/mmzone.h so this is the most suitable place for documentation about semantics of pfn_valid(). Link: https://lkml.kernel.org/r/20210511100550.28178-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20210511100550.28178-2-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Suggested-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-30 20:47:29 -07:00
Mel Gorman	44042b4498	mm/page_alloc: allow high-order pages to be stored on the per-cpu lists The per-cpu page allocator (PCP) only stores order-0 pages. This means that all THP and "cheap" high-order allocations including SLUB contends on the zone->lock. This patch extends the PCP allocator to store THP and "cheap" high-order pages. Note that struct per_cpu_pages increases in size to 256 bytes (4 cache lines) on x86-64. Note that this is not necessarily a universal performance win because of how it is implemented. High-order pages can cause pcp->high to be exceeded prematurely for lower-orders so for example, a large number of THP pages being freed could release order-0 pages from the PCP lists. Hence, much depends on the allocation/free pattern as observed by a single CPU to determine if caching helps or hurts a particular workload. That said, basic performance testing passed. The following is a netperf UDP_STREAM test which hits the relevant patches as some of the network allocations are high-order. netperf-udp 5.13.0-rc2 5.13.0-rc2 mm-pcpburst-v3r4 mm-pcphighorder-v1r7 Hmean send-64 261.46 ( 0.00%) 266.30 * 1.85%* Hmean send-128 516.35 ( 0.00%) 536.78 * 3.96%* Hmean send-256 1014.13 ( 0.00%) 1034.63 * 2.02%* Hmean send-1024 3907.65 ( 0.00%) 4046.11 * 3.54%* Hmean send-2048 7492.93 ( 0.00%) 7754.85 * 3.50%* Hmean send-3312 11410.04 ( 0.00%) 11772.32 * 3.18%* Hmean send-4096 13521.95 ( 0.00%) 13912.34 * 2.89%* Hmean send-8192 21660.50 ( 0.00%) 22730.72 * 4.94%* Hmean send-16384 31902.32 ( 0.00%) 32637.50 * 2.30%* Functionally, a patch like this is necessary to make bulk allocation of high-order pages work with similar performance to order-0 bulk allocations. The bulk allocator is not updated in this series as it would have to be determined by bulk allocation users how they want to track the order of pages allocated with the bulk allocator. Link: https://lkml.kernel.org/r/20210611135753.GC30378@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-29 10:53:55 -07:00
Mike Rapoport	43b02ba93b	mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM After removal of the DISCONTIGMEM memory model the FLAT_NODE_MEM_MAP configuration option is equivalent to FLATMEM. Drop CONFIG_FLAT_NODE_MEM_MAP and use CONFIG_FLATMEM instead. Link: https://lkml.kernel.org/r/20210608091316.3622-10-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matt Turner <mattst88@gmail.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Vineet Gupta <vgupta@synopsys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-29 10:53:55 -07:00
Mike Rapoport	a9ee6cf5c6	mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA After removal of DISCINTIGMEM the NEED_MULTIPLE_NODES and NUMA configuration options are equivalent. Drop CONFIG_NEED_MULTIPLE_NODES and use CONFIG_NUMA instead. Done with $ sed -i 's/CONFIG_NEED_MULTIPLE_NODES/CONFIG_NUMA/' \ $(git grep -wl CONFIG_NEED_MULTIPLE_NODES) $ sed -i 's/NEED_MULTIPLE_NODES/NUMA/' \ $(git grep -wl NEED_MULTIPLE_NODES) with manual tweaks afterwards. [rppt@linux.ibm.com: fix arm boot crash] Link: https://lkml.kernel.org/r/YMj9vHhHOiCVN4BF@linux.ibm.com Link: https://lkml.kernel.org/r/20210608091316.3622-9-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matt Turner <mattst88@gmail.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Vineet Gupta <vgupta@synopsys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-29 10:53:55 -07:00
Mike Rapoport	bb1c50d396	mm: remove CONFIG_DISCONTIGMEM There are no architectures that support DISCONTIGMEM left. Remove the configuration option and the dead code it was guarding in the generic memory management code. Link: https://lkml.kernel.org/r/20210608091316.3622-6-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matt Turner <mattst88@gmail.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Vineet Gupta <vgupta@synopsys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-29 10:53:55 -07:00
Dong Aisheng	777c00f5ed	mm: drop SECTION_SHIFT in code comments Actually SECTIONS_SHIFT is used in the kernel code, so the code comments is strictly incorrect. And since commit `bbeae5b05e` ("mm: move page flags layout to separate header"), SECTIONS_SHIFT definition has been moved to include/linux/page-flags-layout.h, since code itself looks quite straighforward, instead of moving the code comment into the new place as well, we just simply remove it. This also fixed a checkpatch complain derived from the original code: WARNING: please, no space before tabs + * SECTIONS_SHIFT ^I^I#bits space required to store a section #$ Link: https://lkml.kernel.org/r/20210531091908.1738465-2-aisheng.dong@nxp.com Signed-off-by: Dong Aisheng <aisheng.dong@nxp.com> Suggested-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Yu Zhao <yuzhao@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-29 10:53:55 -07:00
Mel Gorman	74f4482209	mm/page_alloc: introduce vm.percpu_pagelist_high_fraction This introduces a new sysctl vm.percpu_pagelist_high_fraction. It is similar to the old vm.percpu_pagelist_fraction. The old sysctl increased both pcp->batch and pcp->high with the higher pcp->high potentially reducing zone->lock contention. However, the higher pcp->batch value also potentially increased allocation latency while the PCP was refilled. This sysctl only adjusts pcp->high so that zone->lock contention is potentially reduced but allocation latency during a PCP refill remains the same. # grep -E "high:\|batch" /proc/zoneinfo \| tail -2 high: 649 batch: 63 # sysctl vm.percpu_pagelist_high_fraction=8 # grep -E "high:\|batch" /proc/zoneinfo \| tail -2 high: 35071 batch: 63 # sysctl vm.percpu_pagelist_high_fraction=64 high: 4383 batch: 63 # sysctl vm.percpu_pagelist_high_fraction=0 high: 649 batch: 63 [mgorman@techsingularity.net: fix documentation] Link: https://lkml.kernel.org/r/20210528151010.GQ30378@techsingularity.net Link: https://lkml.kernel.org/r/20210525080119.5455-7-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-29 10:53:55 -07:00
Mel Gorman	c49c2c47da	mm/page_alloc: limit the number of pages on PCP lists when reclaim is active When kswapd is active then direct reclaim is potentially active. In either case, it is possible that a zone would be balanced if pages were not trapped on PCP lists. Instead of draining remote pages, simply limit the size of the PCP lists while kswapd is active. Link: https://lkml.kernel.org/r/20210525080119.5455-6-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-29 10:53:54 -07:00
Mel Gorman	3b12e7e979	mm/page_alloc: scale the number of pages that are batch freed When a task is freeing a large number of order-0 pages, it may acquire the zone->lock multiple times freeing pages in batches. This may unnecessarily contend on the zone lock when freeing very large number of pages. This patch adapts the size of the batch based on the recent pattern to scale the batch size for subsequent frees. As the machines I used were not large enough to test this are not large enough to illustrate a problem, a debugging patch shows patterns like the following (slightly editted for clarity) Baseline vanilla kernel time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378 time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378 time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378 time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378 time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378 With patches time-unmap-7724 [...] free_pcppages_bulk: free 126 count 814 high 814 time-unmap-7724 [...] free_pcppages_bulk: free 252 count 814 high 814 time-unmap-7724 [...] free_pcppages_bulk: free 504 count 814 high 814 time-unmap-7724 [...] free_pcppages_bulk: free 751 count 814 high 814 time-unmap-7724 [...] free_pcppages_bulk: free 751 count 814 high 814 Link: https://lkml.kernel.org/r/20210525080119.5455-5-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-29 10:53:54 -07:00
Mel Gorman	bbbecb35a4	mm/page_alloc: delete vm.percpu_pagelist_fraction Patch series "Calculate pcp->high based on zone sizes and active CPUs", v2. The per-cpu page allocator (PCP) is meant to reduce contention on the zone lock but the sizing of batch and high is archaic and neither takes the zone size into account or the number of CPUs local to a zone. With larger zones and more CPUs per node, the contention is getting worse. Furthermore, the fact that vm.percpu_pagelist_fraction adjusts both batch and high values means that the sysctl can reduce zone lock contention but also increase allocation latencies. This series disassociates pcp->high from pcp->batch and then scales pcp->high based on the size of the local zone with limited impact to reclaim and accounting for active CPUs but leaves pcp->batch static. It also adapts the number of pages that can be on the pcp list based on recent freeing patterns. The motivation is partially to adjust to larger memory sizes but is also driven by the fact that large batches of page freeing via release_pages() often shows zone contention as a major part of the problem. Another is a bug report based on an older kernel where a multi-terabyte process can takes several minutes to exit. A workaround was to use vm.percpu_pagelist_fraction to increase the pcp->high value but testing indicated that a production workload could not use the same values because of an increase in allocation latencies. Unfortunately, I cannot reproduce this test case myself as the multi-terabyte machines are in active use but it should alleviate the problem. The series aims to address both and partially acts as a pre-requisite. pcp only works with order-0 which is useless for SLUB (when using high orders) and THP (unconditionally). To store high-order pages on PCP, the pcp->high values need to be increased first. This patch (of 6): The vm.percpu_pagelist_fraction is used to increase the batch and high limits for the per-cpu page allocator (PCP). The intent behind the sysctl is to reduce zone lock acquisition when allocating/freeing pages but it has a problem. While it can decrease contention, it can also increase latency on the allocation side due to unreasonably large batch sizes. This leads to games where an administrator adjusts percpu_pagelist_fraction on the fly to work around contention and allocation latency problems. This series aims to alleviate the problems with zone lock contention while avoiding the allocation-side latency problems. For the purposes of review, it's easier to remove this sysctl now and reintroduce a similar sysctl later in the series that deals only with pcp->high. Link: https://lkml.kernel.org/r/20210525080119.5455-1-mgorman@techsingularity.net Link: https://lkml.kernel.org/r/20210525080119.5455-2-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-29 10:53:54 -07:00
Mel Gorman	f19298b951	mm/vmstat: convert NUMA statistics to basic NUMA counters NUMA statistics are maintained on the zone level for hits, misses, foreign etc but nothing relies on them being perfectly accurate for functional correctness. The counters are used by userspace to get a general overview of a workloads NUMA behaviour but the page allocator incurs a high cost to maintain perfect accuracy similar to what is required for a vmstat like NR_FREE_PAGES. There even is a sysctl vm.numa_stat to allow userspace to turn off the collection of NUMA statistics like NUMA_HIT. This patch converts NUMA_HIT and friends to be NUMA events with similar accuracy to VM events. There is a possibility that slight errors will be introduced but the overall trend as seen by userspace will be similar. The counters are no longer updated from vmstat_refresh context as it is unnecessary overhead for counters that may never be read by userspace. Note that counters could be maintained at the node level to save space but it would have a user-visible impact due to /proc/zoneinfo. [lkp@intel.com: Fix misplaced closing brace for !CONFIG_NUMA] Link: https://lkml.kernel.org/r/20210512095458.30632-4-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-29 10:53:54 -07:00
Mel Gorman	dbbee9d5cd	mm/page_alloc: convert per-cpu list protection to local_lock There is a lack of clarity of what exactly local_irq_save/local_irq_restore protects in page_alloc.c . It conflates the protection of per-cpu page allocation structures with per-cpu vmstat deltas. This patch protects the PCP structure using local_lock which for most configurations is identical to IRQ enabling/disabling. The scope of the lock is still wider than it should be but this is decreased later. It is possible for the local_lock to be embedded safely within struct per_cpu_pages but it adds complexity to free_unref_page_list. [akpm@linux-foundation.org: coding style fixes] [mgorman@techsingularity.net: work around a pahole limitation with zero-sized struct pagesets] Link: https://lkml.kernel.org/r/20210526080741.GW30378@techsingularity.net [lkp@intel.com: Make pagesets static] Link: https://lkml.kernel.org/r/20210512095458.30632-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-29 10:53:54 -07:00
Mel Gorman	28f836b677	mm/page_alloc: split per cpu page lists and zone stats The PCP (per-cpu page allocator in page_alloc.c) shares locking requirements with vmstat and the zone lock which is inconvenient and causes some issues. For example, the PCP list and vmstat share the same per-cpu space meaning that it's possible that vmstat updates dirty cache lines holding per-cpu lists across CPUs unless padding is used. Second, PREEMPT_RT does not want to disable IRQs for too long in the page allocator. This series splits the locking requirements and uses locks types more suitable for PREEMPT_RT, reduces the time when special locking is required for stats and reduces the time when IRQs need to be disabled on !PREEMPT_RT kernels. Why local_lock? PREEMPT_RT considers the following sequence to be unsafe as documented in Documentation/locking/locktypes.rst local_irq_disable(); spin_lock(&lock); The pcp allocator has this sequence for rmqueue_pcplist (local_irq_save) -> __rmqueue_pcplist -> rmqueue_bulk (spin_lock). While it's possible to separate this out, it generally means there are points where we enable IRQs and reenable them again immediately. To prevent a migration and the per-cpu pointer going stale, migrate_disable is also needed. That is a custom lock that is similar, but worse, than local_lock. Furthermore, on PREEMPT_RT, it's undesirable to leave IRQs disabled for too long. By converting to local_lock which disables migration on PREEMPT_RT, the locking requirements can be separated and start moving the protections for PCP, stats and the zone lock to PREEMPT_RT-safe equivalent locking. As a bonus, local_lock also means that PROVE_LOCKING does something useful. After that, it's obvious that zone_statistics incurs too much overhead and leaves IRQs disabled for longer than necessary on !PREEMPT_RT kernels. zone_statistics uses perfectly accurate counters requiring IRQs be disabled for parallel RMW sequences when inaccurate ones like vm_events would do. The series makes the NUMA statistics (NUMA_HIT and friends) inaccurate counters that then require no special protection on !PREEMPT_RT. The bulk page allocator can then do stat updates in bulk with IRQs enabled which should improve the efficiency. Technically, this could have been done without the local_lock and vmstat conversion work and the order simply reflects the timing of when different series were implemented. Finally, there are places where we conflate IRQs being disabled for the PCP with the IRQ-safe zone spinlock. The remainder of the series reduces the scope of what is protected by disabled IRQs on !PREEMPT_RT kernels. By the end of the series, page_alloc.c does not call local_irq_save so the locking scope is a bit clearer. The one exception is that modifying NR_FREE_PAGES still happens in places where it's known the IRQs are disabled as it's harmless for PREEMPT_RT and would be expensive to split the locking there. No performance data is included because despite the overhead of the stats, it's within the noise for most workloads on !PREEMPT_RT. However, Jesper Dangaard Brouer ran a page allocation microbenchmark on a E5-1650 v4 @ 3.60GHz CPU on the first version of this series. Focusing on the array variant of the bulk page allocator reveals the following. (CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz) ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size Baseline Patched 1 56.383 54.225 (+3.83%) 2 40.047 35.492 (+11.38%) 3 37.339 32.643 (+12.58%) 4 35.578 30.992 (+12.89%) 8 33.592 29.606 (+11.87%) 16 32.362 28.532 (+11.85%) 32 31.476 27.728 (+11.91%) 64 30.633 27.252 (+11.04%) 128 30.596 27.090 (+11.46%) While this is a positive outcome, the series is more likely to be interesting to the RT people in terms of getting parts of the PREEMPT_RT tree into mainline. This patch (of 9): The per-cpu page allocator lists and the per-cpu vmstat deltas are stored in the same struct per_cpu_pages even though vmstats have no direct impact on the per-cpu page lists. This is inconsistent because the vmstats for a node are stored on a dedicated structure. The bigger issue is that the per_cpu_pages structure is not cache-aligned and stat updates either cache conflict with adjacent per-cpu lists incurring a runtime cost or padding is required incurring a memory cost. This patch splits the per-cpu pagelists and the vmstat deltas into separate structures. It's mostly a mechanical conversion but some variable renaming is done to clearly distinguish the per-cpu pages structure (pcp) from the vmstats (pzstats). Superficially, this appears to increase the size of the per_cpu_pages structure but the movement of expire fills a structure hole so there is no impact overall. [mgorman@techsingularity.net: make it W=1 cleaner] Link: https://lkml.kernel.org/r/20210514144622.GA3735@techsingularity.net [mgorman@techsingularity.net: make it W=1 even cleaner] Link: https://lkml.kernel.org/r/20210516140705.GB3735@techsingularity.net [lkp@intel.com: check struct per_cpu_zonestat has a non-zero size] [vbabka@suse.cz: Init zone->per_cpu_zonestats properly] Link: https://lkml.kernel.org/r/20210512095458.30632-1-mgorman@techsingularity.net Link: https://lkml.kernel.org/r/20210512095458.30632-2-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-29 10:53:54 -07:00
Mike Rapoport	b19bd1c976	mm/mmzone.h: simplify is_highmem_idx() There is a lot of historical ifdefery in is_highmem_idx() and its helper zone_movable_is_highmem() that was required because of two different paths for nodes and zones initialization that were selected at compile time. Until commit `3f08a302f5` ("mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP option") the movable_zone variable was only available for configurations that had CONFIG_HAVE_MEMBLOCK_NODE_MAP enabled so the test in zone_movable_is_highmem() used that variable only for such configurations. For other configurations the test checked if the index of ZONE_MOVABLE was greater by 1 than the index of ZONE_HIGMEM and then movable zone was considered a highmem zone. Needless to say, ZONE_MOVABLE - 1 equals ZONE_HIGHMEM by definition when CONFIG_HIGHMEM=y. Commit `3f08a302f5` ("mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP option") made movable_zone variable always available. Since this variable is set to ZONE_HIGHMEM if CONFIG_HIGHMEM is enabled and highmem zone is populated, it is enough to check whether zone_idx == ZONE_MOVABLE && movable_zone == ZONE_HIGMEM to test if zone index points to a highmem zone. Remove zone_movable_is_highmem() that is not used anywhere except is_highmem_idx() and use the test above in is_highmem_idx() instead. Link: https://lkml.kernel.org/r/20210426141927.1314326-3-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-29 10:53:53 -07:00
Lee Jones	85adc860fd	Merge `6efb943b86` Linux 5.13-rc1 into android-mainline One giant leap, all the way up to 5.13-rc1 Also take the opportunity to re-align (a.k.a. fix a couple of previous merge conflict fix-up issues) which occurred during this merge-window. Fixes: `4797acfb9c` ("Merge `16b3d0cf5b` Merge tag 'sched-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into android-mainline") Fixes: `92f282f338` ("Merge `8ca5297e7e` Merge tag 'kconfig-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild into android-mainline") Signed-off-by: Lee Jones <lee.jones@linaro.org> Change-Id: Ie9389f595776e8f66bba6eaf0fa7a3587c6a5749	2021-05-15 09:09:01 +01:00
Lee Jones	163e8b4fec	Merge `d652502ef4` Merge tag 'ovl-update-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs into android-mainline A tiny step en route to v5.13-rc1 Signed-off-by: Lee Jones <lee.jones@linaro.org> Change-Id: I049e80976042ebffc90bb080f09da0afcfd48d77	2021-05-12 15:54:32 +01:00
Shijie Luo	cb152a1a95	mm: fix some typos and code style problems fix some typos and code style problems in mm. gfp.h: s/MAXNODES/MAX_NUMNODES mmzone.h: s/then/than rmap.c: s/__vma_split()/__vma_adjust() swap.c: s/__mod_zone_page_stat/__mod_zone_page_state, s/is is/is swap_state.c: s/whoes/whose z3fold.c: code style problem fix in z3fold_unregister_migration zsmalloc.c: s/of/or, s/give/given Link: https://lkml.kernel.org/r/20210419083057.64820-1-luoshijie1@huawei.com Signed-off-by: Shijie Luo <luoshijie1@huawei.com> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-05-07 00:26:33 -07:00
Oscar Salvador	a08a2ae346	mm,memory_hotplug: allocate memmap from the added memory range Physical memory hotadd has to allocate a memmap (struct page array) for the newly added memory section. Currently, alloc_pages_node() is used for those allocations. This has some disadvantages: a) an existing memory is consumed for that purpose (eg: ~2MB per 128MB memory section on x86_64) This can even lead to extreme cases where system goes OOM because the physically hotplugged memory depletes the available memory before it is onlined. b) if the whole node is movable then we have off-node struct pages which has performance drawbacks. c) It might be there are no PMD_ALIGNED chunks so memmap array gets populated with base pages. This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled. Vmemap page tables can map arbitrary memory. That means that we can reserve a part of the physically hotadded memory to back vmemmap page tables. This implementation uses the beginning of the hotplugged memory for that purpose. There are some non-obviously things to consider though. Vmemmap pages are allocated/freed during the memory hotplug events (add_memory_resource(), try_remove_memory()) when the memory is added/removed. This means that the reserved physical range is not online although it is used. The most obvious side effect is that pfn_to_online_page() returns NULL for those pfns. The current design expects that this should be OK as the hotplugged memory is considered a garbage until it is onlined. For example hibernation wouldn't save the content of those vmmemmaps into the image so it wouldn't be restored on resume but this should be OK as there no real content to recover anyway while metadata is reachable from other data structures (e.g. vmemmap page tables). The reserved space is therefore (de)initialized during the {on,off}line events (mhp_{de}init_memmap_on_memory). That is done by extracting page allocator independent initialization from the regular onlining path. The primary reason to handle the reserved space outside of {on,off}line_pages is to make each initialization specific to the purpose rather than special case them in a single function. As per above, the functions that are introduced are: - mhp_init_memmap_on_memory: Initializes vmemmap pages by calling move_pfn_range_to_zone(), calls kasan_add_zero_shadow(), and onlines as many sections as vmemmap pages fully span. - mhp_deinit_memmap_on_memory: Offlines as many sections as vmemmap pages fully span, removes the range from zhe zone by remove_pfn_range_from_zone(), and calls kasan_remove_zero_shadow() for the range. The new function memory_block_online() calls mhp_init_memmap_on_memory() before doing the actual online_pages(). Should online_pages() fail, we clean up by calling mhp_deinit_memmap_on_memory(). Adjusting of present_pages is done at the end once we know that online_pages() succedeed. On offline, memory_block_offline() needs to unaccount vmemmap pages from present_pages() before calling offline_pages(). This is necessary because offline_pages() tears down some structures based on the fact whether the node or the zone become empty. If offline_pages() fails, we account back vmemmap pages. If it succeeds, we call mhp_deinit_memmap_on_memory(). Hot-remove: We need to be careful when removing memory, as adding and removing memory needs to be done with the same granularity. To check that this assumption is not violated, we check the memory range we want to remove and if a) any memory block has vmemmap pages and b) the range spans more than a single memory block, we scream out loud and refuse to proceed. If all is good and the range was using memmap on memory (aka vmemmap pages), we construct an altmap structure so free_hugepage_table does the right thing and calls vmem_altmap_free instead of free_pagetable. Link: https://lkml.kernel.org/r/20210421102701.25051-5-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-05-05 11:27:26 -07:00
Pavel Tatashin	d1e153fea2	mm/gup: migrate pinned pages out of movable zone We should not pin pages in ZONE_MOVABLE. Currently, we do not pin only movable CMA pages. Generalize the function that migrates CMA pages to migrate all movable pages. Use is_pinnable_page() to check which pages need to be migrated Link: https://lkml.kernel.org/r/20210215161349.246722-10-pasha.tatashin@soleen.com Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: James Morris <jmorris@namei.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <sashal@kernel.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Tyler Hicks <tyhicks@linux.microsoft.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-05-05 11:27:26 -07:00
Pavel Tatashin	9afaf30f7a	mm/gup: do not migrate zero page On some platforms ZERO_PAGE(0) might end-up in a movable zone. Do not migrate zero page in gup during longterm pinning as migration of zero page is not allowed. For example, in x86 QEMU with 16G of memory and kernelcore=5G parameter, I see the following: Boot#1: zero_pfn 0x48a8d zero_pfn zone: ZONE_DMA32 Boot#2: zero_pfn 0x20168d zero_pfn zone: ZONE_MOVABLE On x86, empty_zero_page is declared in .bss and depending on the loader may end up in different physical locations during boots. Also, move is_zero_pfn() my_zero_pfn() functions under CONFIG_MMU, because zero_pfn that they are using is declared in memory.c which is compiled with CONFIG_MMU. Link: https://lkml.kernel.org/r/20210215161349.246722-9-pasha.tatashin@soleen.com Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: James Morris <jmorris@namei.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <sashal@kernel.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Tyler Hicks <tyhicks@linux.microsoft.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-05-05 11:27:26 -07:00

1 2 3 4 5 ...

514 Commits