889a0c39fef31949d4adad5816ea6445cf1aadb8
514 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
04bb2779c9 |
ANDROID: vendor_hooks: add a field in pglist_data
Add a pglist_data field to record additional node parameters.
Bug: 192052083
Signed-off-by: Liujie Xie <xieliujie@oppo.com>
Change-Id: I3d764ab298c71ab9aba245867ee529045551aef4
(cherry picked from commit
|
||
|
|
98042d19ad |
ANDROID: GKI: mm: add Android ABI padding to some structures
Try to mitigate potential future driver core api changes by adding a padding to stuct vm_area_struct and struct zone. Based on a patch from Michal Marek <mmarek@suse.cz> from the SLES kernel Leaf changes summary: 3 artifacts changed Changed leaf types summary: 3 leaf types changed Removed/Changed/Added functions summary: 0 Removed, 0 Changed, 0 Added function Removed/Changed/Added variables summary: 0 Removed, 0 Changed, 0 Added variable 'struct vm_area_struct at mm_types.h:292:1' changed: type size changed from 1472 to 1728 (in bits) 4 data member insertions: 'u64 vm_area_struct::android_kabi_reserved1', at offset 1472 (in bits) at mm_types.h:365:1 'u64 vm_area_struct::android_kabi_reserved2', at offset 1536 (in bits) at mm_types.h:366:1 'u64 vm_area_struct::android_kabi_reserved3', at offset 1600 (in bits) at mm_types.h:367:1 'u64 vm_area_struct::android_kabi_reserved4', at offset 1664 (in bits) at mm_types.h:368:1 1435 impacted interfaces: 'struct zone at mmzone.h:420:1' changed: type size changed from 12800 to 13312 (in bits) 4 data member insertions: 'u64 zone::android_kabi_reserved1', at offset 12672 (in bits) at mmzone.h:569:1 'u64 zone::android_kabi_reserved2', at offset 12736 (in bits) at mmzone.h:570:1 'u64 zone::android_kabi_reserved3', at offset 12800 (in bits) at mmzone.h:571:1 'u64 zone::android_kabi_reserved4', at offset 12864 (in bits) at mmzone.h:572:1 624 impacted interfaces: Bug: 151154716 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I81702aa833f419928e0e32e9609722b98592c171 |
||
|
|
dcac70709f |
ANDROID: add vendor fields to lruvec to record refault stats
struct lruvec :: ANDROID_VENDOR_DATA(1) It is pointer to a struct to record the following message: 1)the account of workingset_restore pages of cached anonymous and file pages This is used to adjust the strategy and amount of reclaiming data. Bug: 225795494 Change-Id: I34e57ee23b6c97ac91effa5b72513d238335a996 Signed-off-by: Bing Han <bing.han@transsion.com> (cherry picked from commit 1b14ae01b09dfde89da470cac8415cefaca824fb) |
||
|
|
33f5d1daec |
Merge 5.15.34 into android13-5.15
Changes in 5.15.34
lib/logic_iomem: correct fallback config references
um: fix and optimize xor select template for CONFIG64 and timetravel mode
rtc: wm8350: Handle error for wm8350_register_irq
nbd: add error handling support for add_disk()
nbd: Fix incorrect error handle when first_minor is illegal in nbd_dev_add
nbd: Fix hungtask when nbd_config_put
nbd: fix possible overflow on 'first_minor' in nbd_dev_add()
kfence: count unexpectedly skipped allocations
kfence: move saving stack trace of allocations into __kfence_alloc()
kfence: limit currently covered allocations when pool nearly full
KVM: x86/pmu: Use different raw event masks for AMD and Intel
KVM: SVM: Fix kvm_cache_regs.h inclusions for is_guest_mode()
KVM: x86/svm: Clear reserved bits written to PerfEvtSeln MSRs
KVM: x86/pmu: Fix and isolate TSX-specific performance event logic
KVM: x86/emulator: Emulate RDPID only if it is enabled in guest
drm: Add orientation quirk for GPD Win Max
ath5k: fix OOB in ath5k_eeprom_read_pcal_info_5111
drm/amd/display: Add signal type check when verify stream backends same
drm/amd/amdgpu/amdgpu_cs: fix refcount leak of a dma_fence obj
drm/amd/display: Fix memory leak
drm/amd/display: Use PSR version selected during set_psr_caps
usb: gadget: tegra-xudc: Do not program SPARAM
usb: gadget: tegra-xudc: Fix control endpoint's definitions
usb: cdnsp: fix cdnsp_decode_trb function to properly handle ret value
ptp: replace snprintf with sysfs_emit
drm/amdkfd: Don't take process mutex for svm ioctls
powerpc: dts: t104xrdb: fix phy type for FMAN 4/5
ath11k: fix kernel panic during unload/load ath11k modules
ath11k: pci: fix crash on suspend if board file is not found
ath11k: mhi: use mhi_sync_power_up()
net/smc: Send directly when TCP_CORK is cleared
drm/bridge: Add missing pm_runtime_put_sync
bpf: Make dst_port field in struct bpf_sock 16-bit wide
scsi: mvsas: Replace snprintf() with sysfs_emit()
scsi: bfa: Replace snprintf() with sysfs_emit()
drm/v3d: fix missing unlock
power: supply: axp20x_battery: properly report current when discharging
mt76: mt7921: fix crash when startup fails.
mt76: dma: initialize skip_unmap in mt76_dma_rx_fill
cfg80211: don't add non transmitted BSS to 6GHz scanned channels
libbpf: Fix build issue with llvm-readelf
ipv6: make mc_forwarding atomic
net: initialize init_net earlier
powerpc: Set crashkernel offset to mid of RMA region
drm/amdgpu: Fix recursive locking warning
scsi: smartpqi: Fix kdump issue when controller is locked up
PCI: aardvark: Fix support for MSI interrupts
iommu/arm-smmu-v3: fix event handling soft lockup
usb: ehci: add pci device support for Aspeed platforms
PCI: endpoint: Fix alignment fault error in copy tests
tcp: Don't acquire inet_listen_hashbucket::lock with disabled BH.
PCI: pciehp: Add Qualcomm quirk for Command Completed erratum
scsi: mpi3mr: Fix reporting of actual data transfer size
scsi: mpi3mr: Fix memory leaks
powerpc/set_memory: Avoid spinlock recursion in change_page_attr()
power: supply: axp288-charger: Set Vhold to 4.4V
net/mlx5e: Disable TX queues before registering the netdev
usb: dwc3: pci: Set the swnode from inside dwc3_pci_quirks()
iwlwifi: mvm: Correctly set fragmented EBS
iwlwifi: mvm: move only to an enabled channel
drm/msm/dsi: Remove spurious IRQF_ONESHOT flag
ipv4: Invalidate neighbour for broadcast address upon address addition
dm ioctl: prevent potential spectre v1 gadget
dm: requeue IO if mapping table not yet available
drm/amdkfd: make CRAT table missing message informational only
vfio/pci: Stub vfio_pci_vga_rw when !CONFIG_VFIO_PCI_VGA
scsi: pm8001: Fix pm80xx_pci_mem_copy() interface
scsi: pm8001: Fix pm8001_mpi_task_abort_resp()
scsi: pm8001: Fix task leak in pm8001_send_abort_all()
scsi: pm8001: Fix tag leaks on error
scsi: pm8001: Fix memory leak in pm8001_chip_fw_flash_update_req()
mt76: mt7915: fix injected MPDU transmission to not use HW A-MSDU
powerpc/64s/hash: Make hash faults work in NMI context
mt76: mt7615: Fix assigning negative values to unsigned variable
scsi: aha152x: Fix aha152x_setup() __setup handler return value
scsi: hisi_sas: Free irq vectors in order for v3 HW
scsi: hisi_sas: Limit users changing debugfs BIST count value
net/smc: correct settings of RMB window update limit
mips: ralink: fix a refcount leak in ill_acc_of_setup()
macvtap: advertise link netns via netlink
tuntap: add sanity checks about msg_controllen in sendmsg
Bluetooth: Fix not checking for valid hdev on bt_dev_{info,warn,err,dbg}
Bluetooth: use memset avoid memory leaks
bnxt_en: Eliminate unintended link toggle during FW reset
PCI: endpoint: Fix misused goto label
MIPS: fix fortify panic when copying asm exception handlers
powerpc/64e: Tie PPC_BOOK3E_64 to PPC_FSL_BOOK3E
powerpc/secvar: fix refcount leak in format_show()
scsi: libfc: Fix use after free in fc_exch_abts_resp()
can: isotp: set default value for N_As to 50 micro seconds
can: etas_es58x: es58x_fd_rx_event_msg(): initialize rx_event_msg before calling es58x_check_msg_len()
riscv: Fixed misaligned memory access. Fixed pointer comparison.
net: account alternate interface name memory
net: limit altnames to 64k total
net/mlx5e: Remove overzealous validations in netlink EEPROM query
net: sfp: add 2500base-X quirk for Lantech SFP module
usb: dwc3: omap: fix "unbalanced disables for smps10_out1" on omap5evm
mt76: fix monitor mode crash with sdio driver
xtensa: fix DTC warning unit_address_format
MIPS: ingenic: correct unit node address
Bluetooth: Fix use after free in hci_send_acl
netfilter: conntrack: revisit gc autotuning
netlabel: fix out-of-bounds memory accesses
ceph: fix inode reference leakage in ceph_get_snapdir()
ceph: fix memory leak in ceph_readdir when note_last_dentry returns error
lib/Kconfig.debug: add ARCH dependency for FUNCTION_ALIGN option
init/main.c: return 1 from handled __setup() functions
minix: fix bug when opening a file with O_DIRECT
clk: si5341: fix reported clk_rate when output divider is 2
staging: vchiq_arm: Avoid NULL ptr deref in vchiq_dump_platform_instances
staging: vchiq_core: handle NULL result of find_service_by_handle
phy: amlogic: phy-meson-gxl-usb2: fix shared reset controller use
phy: amlogic: meson8b-usb2: Use dev_err_probe()
phy: amlogic: meson8b-usb2: fix shared reset control use
clk: rockchip: drop CLK_SET_RATE_PARENT from dclk_vop* on rk3568
cpufreq: CPPC: Fix performance/frequency conversion
opp: Expose of-node's name in debugfs
staging: wfx: fix an error handling in wfx_init_common()
w1: w1_therm: fixes w1_seq for ds28ea00 sensors
NFSv4.2: fix reference count leaks in _nfs42_proc_copy_notify()
NFSv4: Protect the state recovery thread against direct reclaim
habanalabs: fix possible memory leak in MMU DR fini
xen: delay xen_hvm_init_time_ops() if kdump is boot on vcpu>=32
clk: ti: Preserve node in ti_dt_clocks_register()
clk: Enforce that disjoints limits are invalid
SUNRPC/call_alloc: async tasks mustn't block waiting for memory
SUNRPC/xprt: async tasks mustn't block waiting for memory
SUNRPC: remove scheduling boost for "SWAPPER" tasks.
NFS: swap IO handling is slightly different for O_DIRECT IO
NFS: swap-out must always use STABLE writes.
x86: Annotate call_on_stack()
x86/Kconfig: Do not allow CONFIG_X86_X32_ABI=y with llvm-objcopy
serial: samsung_tty: do not unlock port->lock for uart_write_wakeup()
virtio_console: eliminate anonymous module_init & module_exit
jfs: prevent NULL deref in diFree
SUNRPC: Fix socket waits for write buffer space
NFS: nfsiod should not block forever in mempool_alloc()
NFS: Avoid writeback threads getting stuck in mempool_alloc()
selftests: net: Add tls config dependency for tls selftests
parisc: Fix CPU affinity for Lasi, WAX and Dino chips
parisc: Fix patch code locking and flushing
mm: fix race between MADV_FREE reclaim and blkdev direct IO read
rtc: mc146818-lib: change return values of mc146818_get_time()
rtc: Check return value from mc146818_get_time()
rtc: mc146818-lib: fix RTC presence check
drm/amdgpu: fix off by one in amdgpu_gfx_kiq_acquire()
Drivers: hv: vmbus: Fix potential crash on module unload
Revert "NFSv4: Handle the special Linux file open access mode"
NFSv4: fix open failure with O_ACCMODE flag
scsi: sr: Fix typo in CDROM(CLOSETRAY|EJECT) handling
scsi: core: Fix sbitmap depth in scsi_realloc_sdev_budget_map()
scsi: zorro7xx: Fix a resource leak in zorro7xx_remove_one()
vdpa/mlx5: Rename control VQ workqueue to vdpa wq
vdpa/mlx5: Propagate link status from device to vdpa driver
vdpa: mlx5: prevent cvq work from hogging CPU
net: sfc: add missing xdp queue reinitialization
net/tls: fix slab-out-of-bounds bug in decrypt_internal
vrf: fix packet sniffing for traffic originating from ip tunnels
skbuff: fix coalescing for page_pool fragment recycling
ice: Clear default forwarding VSI during VSI release
mctp: Fix check for dev_hard_header() result
net: ipv4: fix route with nexthop object delete warning
net: stmmac: Fix unset max_speed difference between DT and non-DT platforms
drm/imx: imx-ldb: Check for null pointer after calling kmemdup
drm/imx: Fix memory leak in imx_pd_connector_get_modes
drm/imx: dw_hdmi-imx: Fix bailout in error cases of probe
regulator: rtq2134: Fix missing active_discharge_on setting
regulator: atc260x: Fix missing active_discharge_on setting
arch/arm64: Fix topology initialization for core scheduling
bnxt_en: Synchronize tx when xdp redirects happen on same ring
bnxt_en: reserve space inside receive page for skb_shared_info
bnxt_en: Prevent XDP redirect from running when stopping TX queue
sfc: Do not free an empty page_ring
RDMA/mlx5: Don't remove cache MRs when a delay is needed
RDMA/mlx5: Add a missing update of cache->last_add
IB/cm: Cancel mad on the DREQ event when the state is MRA_REP_RCVD
IB/rdmavt: add lock to call to rvt_error_qp to prevent a race condition
sctp: count singleton chunks in assoc user stats
dpaa2-ptp: Fix refcount leak in dpaa2_ptp_probe
ice: Set txq_teid to ICE_INVAL_TEID on ring creation
ice: Do not skip not enabled queues in ice_vc_dis_qs_msg
ipv6: Fix stats accounting in ip6_pkt_drop
ice: synchronize_rcu() when terminating rings
ice: xsk: fix VSI state check in ice_xsk_wakeup()
net: openvswitch: don't send internal clone attribute to the userspace.
net: ethernet: mv643xx: Fix over zealous checking of_get_mac_address()
net: openvswitch: fix leak of nested actions
rxrpc: fix a race in rxrpc_exit_net()
net: sfc: fix using uninitialized xdp tx_queue
net: phy: mscc-miim: reject clause 45 register accesses
qede: confirm skb is allocated before using
spi: bcm-qspi: fix MSPI only access with bcm_qspi_exec_mem_op()
bpf: Support dual-stack sockets in bpf_tcp_check_syncookie
drbd: Fix five use after free bugs in get_initial_state
scsi: ufs: ufshpb: Fix a NULL check on list iterator
io_uring: nospec index for tags on files update
io_uring: don't touch scm_fp_list after queueing skb
SUNRPC: Handle ENOMEM in call_transmit_status()
SUNRPC: Handle low memory situations in call_status()
SUNRPC: svc_tcp_sendmsg() should handle errors from xdr_alloc_bvec()
iommu/omap: Fix regression in probe for NULL pointer dereference
perf: arm-spe: Fix perf report --mem-mode
perf tools: Fix perf's libperf_print callback
perf session: Remap buf if there is no space for event
arm64: Add part number for Arm Cortex-A78AE
scsi: mpt3sas: Fix use after free in _scsih_expander_node_remove()
scsi: ufs: ufs-pci: Add support for Intel MTL
Revert "mmc: sdhci-xenon: fix annoying 1.8V regulator warning"
mmc: block: Check for errors after write on SPI
mmc: mmci: stm32: correctly check all elements of sg list
mmc: renesas_sdhi: don't overwrite TAP settings when HS400 tuning is complete
mmc: core: Fixup support for writeback-cache for eMMC and SD
lz4: fix LZ4_decompress_safe_partial read out of bound
highmem: fix checks in __kmap_local_sched_{in,out}
mmmremap.c: avoid pointless invalidate_range_start/end on mremap(old_size=0)
mm/mempolicy: fix mpol_new leak in shared_policy_replace
io_uring: don't check req->file in io_fsync_prep()
io_uring: defer splice/tee file validity check until command issue
io_uring: implement compat handling for IORING_REGISTER_IOWQ_AFF
io_uring: fix race between timeout flush and removal
x86/pm: Save the MSR validity status at context setup
x86/speculation: Restore speculation related MSRs during S3 resume
perf/x86/intel: Update the FRONTEND MSR mask on Sapphire Rapids
btrfs: fix qgroup reserve overflow the qgroup limit
btrfs: prevent subvol with swapfile from being deleted
spi: core: add dma_map_dev for __spi_unmap_msg()
arm64: patch_text: Fixup last cpu should be master
RDMA/hfi1: Fix use-after-free bug for mm struct
gpio: Restrict usage of GPIO chip irq members before initialization
x86/msi: Fix msi message data shadow struct
x86/mm/tlb: Revert retpoline avoidance approach
perf/x86/intel: Don't extend the pseudo-encoding to GP counters
ata: sata_dwc_460ex: Fix crash due to OOB write
perf: qcom_l2_pmu: fix an incorrect NULL check on list iterator
perf/core: Inherit event_caps
irqchip/gic-v3: Fix GICR_CTLR.RWP polling
fbdev: Fix unregistering of framebuffers without device
amd/display: set backlight only if required
SUNRPC: Prevent immediate close+reconnect
drm/panel: ili9341: fix optional regulator handling
drm/amdgpu/display: change pipe policy for DCN 2.1
drm/amdgpu/smu10: fix SoC/fclk units in auto mode
drm/amdgpu/vcn: Fix the register setting for vcn1
drm/nouveau/pmu: Add missing callbacks for Tegra devices
drm/amdkfd: Create file descriptor after client is added to smi_clients list
drm/amdgpu: don't use BACO for reset in S3
KVM: SVM: Allow AVIC support on system w/ physical APIC ID > 255
net/smc: send directly on setting TCP_NODELAY
Revert "selftests: net: Add tls config dependency for tls selftests"
bpf: Make remote_port field in struct bpf_sk_lookup 16-bit wide
selftests/bpf: Fix u8 narrow load checks for bpf_sk_lookup remote_port
rtc: mc146818-lib: fix signedness bug in mc146818_get_time()
SUNRPC: Don't call connect() more than once on a TCP socket
Revert "nbd: fix possible overflow on 'first_minor' in nbd_dev_add()"
perf build: Don't use -ffat-lto-objects in the python feature test when building with clang-13
perf python: Fix probing for some clang command line options
tools build: Filter out options and warnings not supported by clang
tools build: Use $(shell ) instead of `` to get embedded libperl's ccopts
dmaengine: Revert "dmaengine: shdma: Fix runtime PM imbalance on error"
KVM: avoid NULL pointer dereference in kvm_dirty_ring_push
Revert "net/mlx5: Accept devlink user input after driver initialization complete"
ubsan: remove CONFIG_UBSAN_OBJECT_SIZE
selftests: cgroup: Make cg_create() use 0755 for permission instead of 0644
selftests: cgroup: Test open-time credential usage for migration checks
selftests: cgroup: Test open-time cgroup namespace usage for migration checks
mm: don't skip swap entry even if zap_details specified
Drivers: hv: vmbus: Replace smp_store_mb() with virt_store_mb()
x86/bug: Prevent shadowing in __WARN_FLAGS
sched: Teach the forced-newidle balancer about CPU affinity limitation.
x86,static_call: Fix __static_call_return0 for i386
irqchip/gic-v4: Wait for GICR_VPENDBASER.Dirty to clear before descheduling
powerpc/64: Fix build failure with allyesconfig in book3s_64_entry.S
irqchip/gic, gic-v3: Prevent GSI to SGI translations
mm/sparsemem: fix 'mem_section' will never be NULL gcc 12 warning
static_call: Don't make __static_call_return0 static
powerpc: Fix virt_addr_valid() for 64-bit Book3E & 32-bit
stacktrace: move filter_irq_stacks() to kernel/stacktrace.c
Linux 5.15.34
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I98049d0d8ebd427296418d31085bfde482ad30e7
|
||
|
|
e8507816d1 |
FROMLIST: mm: multi-gen LRU: thrashing prevention
Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention, as requested by many desktop users [1]. When set to value N, it prevents the working set of N milliseconds from getting evicted. The OOM killer is triggered if this working set cannot be kept in memory. Based on the average human detectable lag (~100ms), N=1000 usually eliminates intolerable lags due to thrashing. Larger values like N=3000 make lags less noticeable at the risk of premature OOM kills. Compared with the size-based approach, e.g., [2], this time-based approach has the following advantages: 1. It is easier to configure because it is agnostic to applications and memory sizes. 2. It is more reliable because it is directly wired to the OOM killer. [1] https://lore.kernel.org/lkml/Ydza%2FzXKY9ATRoh6@google.com/ [2] https://lore.kernel.org/lkml/20211130201652.2218636d@mail.inbox.lv/ Link: https://lore.kernel.org/lkml/20220309021230.721028-12-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I482d33f3beaf7723d2f3eeaaa5b4f12bcb9b48a1 |
||
|
|
76f7f07cbf |
FROMLIST: mm: multi-gen LRU: kill switch
Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that
can be disabled include:
0x0001: the multi-gen LRU core
0x0002: walking page table, when arch_has_hw_pte_young() returns
true
0x0004: clearing the accessed bit in non-leaf PMD entries, when
CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y
[yYnN]: apply to all the components above
E.g.,
echo y >/sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/enabled
0x0007
echo 5 >/sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/enabled
0x0005
NB: the page table walks happen on the scale of seconds under heavy
memory pressure, in which case the mmap_lock contention is a lesser
concern, compared with the LRU lock contention and the I/O congestion.
So far the only well-known case of the mmap_lock contention happens on
Android, due to Scudo [1] which allocates several thousand VMAs for
merely a few hundred MBs. The SPF and the Maple Tree also have
provided their own assessments [2][3]. However, if walking page tables
does worsen the mmap_lock contention, the kill switch can be used to
disable it. In this case the multi-gen LRU will suffer a minor
performance degradation, as shown previously.
Clearing the accessed bit in non-leaf PMD entries can also be
disabled, since this behavior was not tested on x86 varieties other
than Intel and AMD.
[1] https://source.android.com/devices/tech/debug/scudo
[2] https://lore.kernel.org/lkml/20220128131006.67712-1-michel@lespinasse.org/
[3] https://lore.kernel.org/lkml/20220202024137.2516438-1-Liam.Howlett@oracle.com/
Link: https://lore.kernel.org/lkml/20220309021230.721028-11-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I71801d9470a2588cad8bfd14fbcfafc7b010aa03
|
||
|
|
5280d76d38 |
FROMLIST: mm: multi-gen LRU: support page table walks
To further exploit spatial locality, the aging prefers to walk page
tables to search for young PTEs and promote hot pages. A kill switch
will be added in the next patch to disable this behavior. When
disabled, the aging relies on the rmap only.
NB: this behavior has nothing similar with the page table scanning in
the 2.4 kernel [1], which searches page tables for old PTEs, adds cold
pages to swapcache and unmaps them.
To avoid confusion, the term "iteration" specifically means the
traversal of an entire mm_struct list; the term "walk" will be applied
to page tables and the rmap, as usual.
An mm_struct list is maintained for each memcg, and an mm_struct
follows its owner task to the new memcg when this task is migrated.
Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls
walk_page_range() with each mm_struct on this list to promote hot
pages before it increments max_seq.
When multiple page table walkers iterate the same list, each of them
gets a unique mm_struct; therefore they can run concurrently. Page
table walkers ignore any misplaced pages, e.g., if an mm_struct was
migrated, pages it left in the previous memcg will not be promoted
when its current memcg is under reclaim. Similarly, page table walkers
will not promote pages from nodes other than the one under reclaim.
This patch uses the following optimizations when walking page tables:
1. It tracks the usage of mm_struct's between context switches so that
page table walkers can skip processes that have been sleeping since
the last iteration.
2. It uses generational Bloom filters to record populated branches so
that page table walkers can reduce their search space based on the
query results, e.g., to skip page tables containing mostly holes or
misplaced pages.
3. It takes advantage of the accessed bit in non-leaf PMD entries when
CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
4. It does not zigzag between a PGD table and the same PMD table
spanning multiple VMAs. IOW, it finishes all the VMAs within the
range of the same PMD table before it returns to a PGD table. This
improves the cache performance for workloads that have large
numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.
Server benchmark results:
Single workload:
fio (buffered I/O): no change
Single workload:
memcached (anon): +[5.5, 7.5]%
Ops/sec KB/sec
patch1-7: 1014393.57 39455.42
patch1-8: 1078507.59 41949.15
Configurations:
no change
Client benchmark results:
kswapd profiles:
patch1-7
45.54% lzo1x_1_do_compress (real work)
9.56% page_vma_mapped_walk
6.70% _raw_spin_unlock_irq
2.78% ptep_clear_flush
2.47% do_raw_spin_lock
2.22% __zram_bvec_write
1.87% lru_gen_look_around
1.78% memmove
1.77% obj_malloc
1.44% free_unref_page_list
patch1-8
47.02% lzo1x_1_do_compress (real work)
6.73% page_vma_mapped_walk
6.14% _raw_spin_unlock_irq
3.39% walk_pte_range
2.63% ptep_clear_flush
2.29% __zram_bvec_write
2.10% do_raw_spin_lock
1.81% memmove
1.73% obj_malloc
1.53% free_unref_page_list
Configurations:
no change
[1] https://lwn.net/Articles/23732/
[2] https://source.android.com/devices/tech/debug/scudo
Link: https://lore.kernel.org/lkml/20220309021230.721028-9-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I5a3c97cf8ebf8d65d5f9528cd979a637c190053e
|
||
|
|
afd94c9ef9 |
FROMLIST: mm: multi-gen LRU: exploit locality in rmap
Searching the rmap for PTEs mapping each page on an LRU list (to test
and clear the accessed bit) can be expensive because pages from
different VMAs (PA space) are not cache friendly to the rmap (VA
space). For workloads mostly using mapped pages, the rmap has a high
CPU cost in the reclaim path.
This patch exploits spatial locality to reduce the trips into the
rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
new function lru_gen_look_around() scans at most BITS_PER_LONG-1
adjacent PTEs. On finding another young PTE, it clears the accessed
bit and updates the gen counter of the page mapped by this PTE to
(max_seq%MAX_NR_GENS)+1.
Server benchmark results:
Single workload:
fio (buffered I/O): no change
Single workload:
memcached (anon): +[4, 6]%
Ops/sec KB/sec
patch1-6: 964656.80 37520.88
patch1-7: 1014393.57 39455.42
Configurations:
no change
Client benchmark results:
kswapd profiles:
patch1-6
36.13% lzo1x_1_do_compress (real work)
19.16% page_vma_mapped_walk
6.55% _raw_spin_unlock_irq
4.02% do_raw_spin_lock
2.32% anon_vma_interval_tree_iter_first
2.11% ptep_clear_flush
1.76% __zram_bvec_write
1.64% folio_referenced_one
1.40% memmove
1.35% obj_malloc
patch1-7
45.54% lzo1x_1_do_compress (real work)
9.56% page_vma_mapped_walk
6.70% _raw_spin_unlock_irq
2.78% ptep_clear_flush
2.47% do_raw_spin_lock
2.22% __zram_bvec_write
1.87% lru_gen_look_around
1.78% memmove
1.77% obj_malloc
1.44% free_unref_page_list
Configurations:
no change
Link: https://lore.kernel.org/lkml/20220309021230.721028-8-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I9a290343840f3cf925c891c8e360c7cdc24ffb9c
|
||
|
|
a1537a68c5 |
FROMLIST: mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be
applied to the multi-gen LRU, as a new convention; the terms
"activation" and "deactivation" will be applied to the active/inactive
LRU, as usual.
The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
promotes hot pages to the youngest generation when it finds them
accessed through page tables; the demotion of cold pages happens
consequently when it increments max_seq. The aging has the complexity
O(nr_hot_pages), since it is only interested in hot pages. Promotion
in the aging path does not require any LRU list operations, only the
updates of the gen counter and lrugen->nr_pages[]; demotion, unless as
the result of the increment of max_seq, requires LRU list operations,
e.g., lru_deactivate_fn().
The eviction consumes old generations. Given an lruvec, it increments
min_seq when the lists indexed by min_seq%MAX_NR_GENS become empty. A
feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types
are available from the same generation.
Each generation is divided into multiple tiers. Tiers represent
different ranges of numbers of accesses through file descriptors. A
page accessed N times through file descriptors is in tier
order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in page->flags. In contrast to moving across generations, which
requires the LRU lock, moving across tiers only involves operations on
page->flags. The feedback loop also monitors refaults over all tiers
and decides when to protect pages in which tiers (N>1), using the
first tier (N=0,1) as a baseline. The first tier contains single-use
unmapped clean pages, which are most likely the best choices. The
eviction moves a page to the next generation, i.e., min_seq+1, if the
feedback loop decides so. This approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
inferring whether pages accessed multiple times through file
descriptors are statistically hot and thus worth protecting in the
eviction path.
2. It takes pages accessed through page tables into account and avoids
overprotecting pages accessed multiple times through file
descriptors. (Pages accessed through page tables are in the first
tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
twice through file descriptors, when under heavy buffered I/O
workloads.
Server benchmark results:
Single workload:
fio (buffered I/O): +[38, 40]%
IOPS BW
5.18-ed4643521e6a: 2547k 9989MiB/s
patch1-6: 3540k 13.5GiB/s
Single workload:
memcached (anon): +[103, 107]%
Ops/sec KB/sec
5.18-ed4643521e6a: 469048.66 18243.91
patch1-6: 964656.80 37520.88
Configurations:
CPU: two Xeon 6154
Mem: total 256G
Node 1 was only used as a ram disk to reduce the variance in the
results.
patch drivers/block/brd.c <<EOF
99,100c99,100
< gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
< page = alloc_page(gfp_flags);
---
> gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
> page = alloc_pages_node(1, gfp_flags, 0);
EOF
cat >>/etc/systemd/system.conf <<EOF
CPUAffinity=numa
NUMAPolicy=bind
NUMAMask=0
EOF
cat >>/etc/memcached.conf <<EOF
-m 184320
-s /var/run/memcached/memcached.sock
-a 0766
-t 36
-B binary
EOF
cat fio.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkfs.ext4 /dev/ram0
mount -t ext4 /dev/ram0 /mnt
mkdir /sys/fs/cgroup/user.slice/test
echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
cat memcached.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
--ratio 1:0 --pipeline 8 -d 2000
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
--ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
Client benchmark results:
kswapd profiles:
5.18-ed4643521e6a
39.56% page_vma_mapped_walk
19.32% lzo1x_1_do_compress (real work)
7.18% do_raw_spin_lock
4.23% _raw_spin_unlock_irq
2.26% vma_interval_tree_subtree_search
2.12% vma_interval_tree_iter_next
2.11% folio_referenced_one
1.90% anon_vma_interval_tree_iter_first
1.47% ptep_clear_flush
0.97% __anon_vma_interval_tree_subtree_search
patch1-6
36.13% lzo1x_1_do_compress (real work)
19.16% page_vma_mapped_walk
6.55% _raw_spin_unlock_irq
4.02% do_raw_spin_lock
2.32% anon_vma_interval_tree_iter_first
2.11% ptep_clear_flush
1.76% __zram_bvec_write
1.64% folio_referenced_one
1.40% memmove
1.35% obj_malloc
Configurations:
CPU: single Snapdragon 7c
Mem: total 4G
Chrome OS MemoryPressure [1]
[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
Link: https://lore.kernel.org/lkml/20220309021230.721028-7-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I3fe4850006d7984cd9f4fd46134b826609dc2f86
|
||
|
|
f88ed5a3d3 |
FROMLIST: mm: multi-gen LRU: groundwork
Evictable pages are divided into multiple generations for each lruvec. The youngest generation number is stored in lrugen->max_seq for both anon and file types as they are aged on an equal footing. The oldest generation numbers are stored in lrugen->min_seq[] separately for anon and file types as clean file pages can be evicted regardless of swap constraints. These three variables are monotonically increasing. Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits in order to fit into the gen counter in page->flags. Each truncated generation number is an index to lrugen->lists[]. The sliding window technique is used to track at least MIN_NR_GENS and at most MAX_NR_GENS generations. The gen counter stores a value within [1, MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it stores 0. There are two conceptually independent procedures: "the aging", which produces young generations, and "the eviction", which consumes old generations. They form a closed-loop system, i.e., "the page reclaim". Both procedures can be invoked from userspace for the purposes of working set estimation and proactive reclaim. These features are required to optimize job scheduling (bin packing) in data centers. The variable size of the sliding window is designed for such use cases [1][2]. To avoid confusion, the terms "hot" and "cold" will be applied to the multi-gen LRU, as a new convention; the terms "active" and "inactive" will be applied to the active/inactive LRU, as usual. The protection of hot pages and the selection of cold pages are based on page access channels and patterns. There are two access channels: one through page tables and the other through file descriptors. The protection of the former channel is by design stronger because: 1. The uncertainty in determining the access patterns of the former channel is higher due to the approximation of the accessed bit. 2. The cost of evicting the former channel is higher due to the TLB flushes required and the likelihood of encountering the dirty bit. 3. The penalty of underprotecting the former channel is higher because applications usually do not prepare themselves for major page faults like they do for blocked I/O. E.g., GUI applications commonly use dedicated I/O threads to avoid blocking the rendering threads. There are also two access patterns: one with temporal locality and the other without. For the reasons listed above, the former channel is assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is present; the latter channel is assumed to follow the latter pattern unless outlying refaults have been observed [3][4]. The next patch will address the "outlying refaults". Three macros, i.e., LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in this patch to make the entire patchset less diffy. A page is added to the youngest generation on faulting. The aging needs to check the accessed bit at least twice before handing this page over to the eviction. The first check takes care of the accessed bit set on the initial fault; the second check makes sure this page has not been used since then. This protocol, AKA second chance, requires a minimum of two generations, hence MIN_NR_GENS. [1] https://dl.acm.org/doi/10.1145/3297858.3304053 [2] https://dl.acm.org/doi/10.1145/3503222.3507731 [3] https://lwn.net/Articles/495543/ [4] https://lwn.net/Articles/815342/ Link: https://lore.kernel.org/lkml/20220309021230.721028-6-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I333ec6a1d2abfa60d93d6adc190ed3eefe441512 |
||
|
|
429f413ed8 |
mm/sparsemem: fix 'mem_section' will never be NULL gcc 12 warning
commit a431dbbc540532b7465eae4fc8b56a85a9fc7d17 upstream.
The gcc 12 compiler reports a "'mem_section' will never be NULL" warning
on the following code:
static inline struct mem_section *__nr_to_section(unsigned long nr)
{
#ifdef CONFIG_SPARSEMEM_EXTREME
if (!mem_section)
return NULL;
#endif
if (!mem_section[SECTION_NR_TO_ROOT(nr)])
return NULL;
:
It happens with CONFIG_SPARSEMEM_EXTREME off. The mem_section definition
is
#ifdef CONFIG_SPARSEMEM_EXTREME
extern struct mem_section **mem_section;
#else
extern struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT];
#endif
In the !CONFIG_SPARSEMEM_EXTREME case, mem_section is a static
2-dimensional array and so the check "!mem_section[SECTION_NR_TO_ROOT(nr)]"
doesn't make sense.
Fix this warning by moving the "!mem_section[SECTION_NR_TO_ROOT(nr)]"
check up inside the CONFIG_SPARSEMEM_EXTREME block and adding an
explicit NR_SECTION_ROOTS check to make sure that there is no
out-of-bound array access.
Link: https://lkml.kernel.org/r/20220331180246.2746210-1-longman@redhat.com
Fixes:
|
||
|
|
d831f07038 |
ANDROID: vmscan: Support multiple kswapd threads per node
Page replacement is handled in the Linux Kernel in one of two ways:
1) Asynchronously via kswapd
2) Synchronously, via direct reclaim
At page allocation time the allocating task is immediately given a page
from the zone free list allowing it to go right back to work doing
whatever it was doing; Probably directly or indirectly executing business
logic.
Just prior to satisfying the allocation, free pages is checked to see if
it has reached the zone low watermark and if so, kswapd is awakened.
Kswapd will start scanning pages looking for inactive pages to evict to
make room for new page allocations. The work of kswapd allows tasks to
continue allocating memory from their respective zone free list without
incurring any delay.
When the demand for free pages exceeds the rate that kswapd tasks can
supply them, page allocation works differently. Once the allocating task
finds that the number of free pages is at or below the zone min watermark,
the task will no longer pull pages from the free list. Instead, the task
will run the same CPU-bound routines as kswapd to satisfy its own
allocation by scanning and evicting pages. This is called a direct reclaim.
The time spent performing a direct reclaim can be substantial, often
taking tens to hundreds of milliseconds for small order0 allocations to
half a second or more for order9 huge-page allocations. In fact, kswapd is
not actually required on a linux system. It exists for the sole purpose of
optimizing performance by preventing direct reclaims.
When memory shortfall is sufficient to trigger direct reclaims, they can
occur in any task that is running on the system. A single aggressive
memory allocating task can set the stage for collateral damage to occur in
small tasks that rarely allocate additional memory. Consider the impact of
injecting an additional 100ms of latency when nscd allocates memory to
facilitate caching of a DNS query.
The presence of direct reclaims 10 years ago was a fairly reliable
indicator that too much was being asked of a Linux system. Kswapd was
likely wasting time scanning pages that were ineligible for eviction.
Adding RAM or reducing the working set size would usually make the problem
go away. Since then hardware has evolved to bring a new struggle for
kswapd. Storage speeds have increased by orders of magnitude while CPU
clock speeds stayed the same or even slowed down in exchange for more
cores per package. This presents a throughput problem for a single
threaded kswapd that will get worse with each generation of new hardware.
Test Details
NOTE: The tests below were run with shadow entries disabled. See the
associated patch and cover letter for details
The tests below were designed with the assumption that a kswapd bottleneck
is best demonstrated using filesystem reads. This way, the inactive list
will be full of clean pages, simplifying the analysis and allowing kswapd
to achieve the highest possible steal rate. Maximum steal rates for kswapd
are likely to be the same or lower for any other mix of page types on the
system.
Tests were run on a 2U Oracle X7-2L with 52 Intel Xeon Skylake 2GHz cores,
756GB of RAM and 8 x 3.6 TB NVMe Solid State Disk drives. Each drive has
an XFS file system mounted separately as /d0 through /d7. SSD drives
require multiple concurrent streams to show their potential, so I created
eleven 250GB zero-filled files on each drive so that I could test with
parallel reads.
The test script runs in multiple stages. At each stage, the number of dd
tasks run concurrently is increased by 2. I did not include all of the
test output for brevity.
During each stage dd tasks are launched to read from each drive in a round
robin fashion until the specified number of tasks for the stage has been
reached. Then iostat, vmstat and top are started in the background with 10
second intervals. After five minutes, all of the dd tasks are killed and
the iostat, vmstat and top output is parsed in order to report the
following:
CPU consumption
- sy - aggregate kernel mode CPU consumption from vmstat output. The value
doesn't tend to fluctuate much so I just grab the highest value.
Each sample is averaged over 10 seconds
- dd_cpu - for all of the dd tasks averaged across the top samples since
there is a lot of variation.
Throughput
- in Kbytes
- Command is iostat -x -d 10 -g total
This first test performs reads using O_DIRECT in order to show the maximum
throughput that can be obtained using these drives. It also demonstrates
how rapidly throughput scales as the number of dd tasks are increased.
The dd command for this test looks like this:
Command Used: dd iflag=direct if=/d${i}/$n of=/dev/null bs=4M
Test #1: Direct IO
dd sy dd_cpu throughput
6 0 2.33 14726026.40
10 1 2.95 19954974.80
16 1 2.63 24419689.30
22 1 2.63 25430303.20
28 1 2.91 26026513.20
34 1 2.53 26178618.00
40 1 2.18 26239229.20
46 1 1.91 26250550.40
52 1 1.69 26251845.60
58 1 1.54 26253205.60
64 1 1.43 26253780.80
70 1 1.31 26254154.80
76 1 1.21 26253660.80
82 1 1.12 26254214.80
88 1 1.07 26253770.00
90 1 1.04 26252406.40
Throughput was close to peak with only 22 dd tasks. Very little system CPU
was consumed as expected as the drives DMA directly into the user address
space when using direct IO.
In this next test, the iflag=direct option is removed and we only run the
test until the pgscan_kswapd from /proc/vmstat starts to increment. At
that point metrics are parsed and reported and the pagecache contents are
dropped prior to the next test. Lather, rinse, repeat.
Test #2: standard file system IO, no page replacement
dd sy dd_cpu throughput
6 2 28.78 5134316.40
10 3 31.40 8051218.40
16 5 34.73 11438106.80
22 7 33.65 14140596.40
28 8 31.24 16393455.20
34 10 29.88 18219463.60
40 11 28.33 19644159.60
46 11 25.05 20802497.60
52 13 26.92 22092370.00
58 13 23.29 22884881.20
64 14 23.12 23452248.80
70 15 22.40 23916468.00
76 16 22.06 24328737.20
82 17 20.97 24718693.20
88 16 18.57 25149404.40
90 16 18.31 25245565.60
Each read has to pause after the buffer in kernel space is populated while
those pages are added to the pagecache and copied into the user address
space. For this reason, more parallel streams are required to achieve peak
throughput. The copy operation consumes substantially more CPU than direct
IO as expected.
The next test measures throughput after kswapd starts running. This is the
same test only we wait for kswapd to wake up before we start collecting
metrics. The script actually keeps track of a few things that were not
mentioned earlier. It tracks direct reclaims and page scans by watching
the metrics in /proc/vmstat. CPU consumption for kswapd is tracked the
same way it is tracked for dd.
Since the test is 100% reads, you can assume that the page steal rate for
kswapd and direct reclaims is almost identical to the scan rate.
Test #3: 1 kswapd thread per node
dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct
10 4 26.07 28.56 27.03 7355924.40 0 459316976 0
16 7 34.94 69.33 69.66 10867895.20 0 872661643 0
22 10 36.03 93.99 99.33 13130613.60 489 1037654473 11268334
28 10 30.34 95.90 98.60 14601509.60 671 1182591373 15429142
34 14 34.77 97.50 99.23 16468012.00 10850 1069005644 249839515
40 17 36.32 91.49 97.11 17335987.60 18903 975417728 434467710
46 19 38.40 90.54 91.61 17705394.40 25369 855737040 582427973
52 22 40.88 83.97 83.70 17607680.40 31250 709532935 724282458
58 25 40.89 82.19 80.14 17976905.60 35060 657796473 804117540
64 28 41.77 73.49 75.20 18001910.00 39073 561813658 895289337
70 33 45.51 63.78 64.39 17061897.20 44523 379465571 1020726436
76 36 46.95 57.96 60.32 16964459.60 47717 291299464 1093172384
82 39 47.16 55.43 56.16 16949956.00 49479 247071062 1134163008
88 42 47.41 53.75 47.62 16930911.20 51521 195449924 1180442208
90 43 47.18 51.40 50.59 16864428.00 51618 190758156 1183203901
In the previous test where kswapd was not involved, the system-wide kernel
mode CPU consumption with 90 dd tasks was 16%. In this test CPU consumption
with 90 tasks is at 43%. With 52 cores, and two kswapd tasks (one per NUMA
node), kswapd can only be responsible for a little over 4% of the increase.
The rest is likely caused by 51,618 direct reclaims that scanned 1.2
billion pages over the five minute time period of the test.
Same test, more kswapd tasks:
Test #4: 4 kswapd threads per node
dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct
10 5 27.09 16.65 14.17 7842605.60 0 459105291 0
16 10 37.12 26.02 24.85 11352920.40 15 920527796 358515
22 11 36.94 37.13 35.82 13771869.60 0 1132169011 0
28 13 35.23 48.43 46.86 16089746.00 0 1312902070 0
34 15 33.37 53.02 55.69 18314856.40 0 1476169080 0
40 19 35.90 69.60 64.41 19836126.80 0 1629999149 0
46 22 36.82 88.55 57.20 20740216.40 0 1708478106 0
52 24 34.38 93.76 68.34 21758352.00 0 1794055559 0
58 24 30.51 79.20 82.33 22735594.00 0 1872794397 0
64 26 30.21 97.12 76.73 23302203.60 176 1916593721 4206821
70 33 32.92 92.91 92.87 23776588.00 3575 1817685086 85574159
76 37 31.62 91.20 89.83 24308196.80 4752 1812262569 113981763
82 29 25.53 93.23 92.33 24802791.20 306 2032093122 7350704
88 43 37.12 76.18 77.01 25145694.40 20310 1253204719 487048202
90 42 38.56 73.90 74.57 22516787.60 22774 1193637495 545463615
By increasing the number of kswapd threads, throughput increased by ~50%
while kernel mode CPU utilization decreased or stayed the same, likely due
to a decrease in the number of parallel tasks at any given time doing page
replacement.
Signed-off-by: Buddy Lumpkin <buddy.lumpkin@oracle.com>
Bug: 201263306
Link: https://lore.kernel.org/lkml/1522661062-39745-1-git-send-email-buddy.lumpkin@oracle.com
[charante@codeaurora.org]: Changes made to select number of kswapds through uapi
Change-Id: I8425cab7f40cbeaf65af0ea118c1a9ac7da0930e
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
[quic_vjitta@quicinc.com]: Changes made to move multiple kswapd threads logic to vendor hooks
Signed-off-by: Vijayanand Jitta <quic_vjitta@quicinc.com>
(cherry picked from commit
|
||
|
|
f47b852faa |
ANDROID: implement wrapper for reverse migration
Reverse migration is used to do the balancing the occupancy of memory zones in a node in the system whose imabalance may be caused by migration of pages to other zones by an operation, eg: hotremove and then hotadding the same memory. In this case there is a lot of free memory in newly hotadd memory which can be filled up by the previous migrated pages(as part of offline/hotremove) thus may free up some pressure in other zones of the node. Upstream discussion: https://lore.kernel.org/all/ee78c83d-da9b-f6d1-4f66-934b7782acfb@codeaurora.org/ Change-Id: Ib3137dab0db66ecf6858c4077dcadb9dfd0c6b1c Bug: 201263307 Signed-off-by: Charan Teja Reddy <quic_charante@quicinc.com> |
||
|
|
a8b5dc3032 |
Merge 5.15.17 into android13-5.15
Changes in 5.15.17
KVM: x86/mmu: Fix write-protection of PTs mapped by the TDP MMU
KVM: VMX: switch blocked_vcpu_on_cpu_lock to raw spinlock
HID: Ignore battery for Elan touchscreen on HP Envy X360 15t-dr100
HID: uhid: Fix worker destroying device without any protection
HID: wacom: Reset expected and received contact counts at the same time
HID: wacom: Ignore the confidence flag when a touch is removed
HID: wacom: Avoid using stale array indicies to read contact count
ALSA: core: Fix SSID quirk lookup for subvendor=0
f2fs: fix to do sanity check on inode type during garbage collection
f2fs: fix to do sanity check in is_alive()
f2fs: avoid EINVAL by SBI_NEED_FSCK when pinning a file
nfc: llcp: fix NULL error pointer dereference on sendmsg() after failed bind()
mtd: rawnand: gpmi: Add ERR007117 protection for nfc_apply_timings
mtd: rawnand: gpmi: Remove explicit default gpmi clock setting for i.MX6
mtd: Fixed breaking list in __mtd_del_partition.
mtd: rawnand: davinci: Don't calculate ECC when reading page
mtd: rawnand: davinci: Avoid duplicated page read
mtd: rawnand: davinci: Rewrite function description
mtd: rawnand: Export nand_read_page_hwecc_oob_first()
mtd: rawnand: ingenic: JZ4740 needs 'oob_first' read page function
riscv: Get rid of MAXPHYSMEM configs
RISC-V: Use common riscv_cpuid_to_hartid_mask() for both SMP=y and SMP=n
riscv: try to allocate crashkern region from 32bit addressible memory
riscv: Don't use va_pa_offset on kdump
riscv: use hart id instead of cpu id on machine_kexec
riscv: mm: fix wrong phys_ram_base value for RV64
x86/gpu: Reserve stolen memory for first integrated Intel GPU
tools/nolibc: x86-64: Fix startup code bug
crypto: x86/aesni - don't require alignment of data
tools/nolibc: i386: fix initial stack alignment
tools/nolibc: fix incorrect truncation of exit code
rtc: cmos: take rtc_lock while reading from CMOS
net: phy: marvell: add Marvell specific PHY loopback
ksmbd: uninitialized variable in create_socket()
ksmbd: fix guest connection failure with nautilus
ksmbd: add support for smb2 max credit parameter
ksmbd: move credit charge deduction under processing request
ksmbd: limits exceeding the maximum allowable outstanding requests
ksmbd: add reserved room in ipc request/response
media: cec: fix a deadlock situation
media: ov8865: Disable only enabled regulators on error path
media: v4l2-ioctl.c: readbuffers depends on V4L2_CAP_READWRITE
media: flexcop-usb: fix control-message timeouts
media: mceusb: fix control-message timeouts
media: em28xx: fix control-message timeouts
media: cpia2: fix control-message timeouts
media: s2255: fix control-message timeouts
media: dib0700: fix undefined behavior in tuner shutdown
media: redrat3: fix control-message timeouts
media: pvrusb2: fix control-message timeouts
media: stk1160: fix control-message timeouts
media: cec-pin: fix interrupt en/disable handling
can: softing_cs: softingcs_probe(): fix memleak on registration failure
mei: hbm: fix client dma reply status
iio: adc: ti-adc081c: Partial revert of removal of ACPI IDs
iio: trigger: Fix a scheduling whilst atomic issue seen on tsc2046
lkdtm: Fix content of section containing lkdtm_rodata_do_nothing()
bus: mhi: pci_generic: Graceful shutdown on freeze
bus: mhi: core: Fix reading wake_capable channel configuration
bus: mhi: core: Fix race while handling SYS_ERR at power up
cxl/pmem: Fix reference counting for delayed work
arm64: errata: Fix exec handling in erratum
|
||
|
|
240e8d331a |
mm_zone: add function to check if managed dma zone exists
commit 62b3107073646e0946bd97ff926832bafb846d17 upstream. Patch series "Handle warning of allocation failure on DMA zone w/o managed pages", v4. **Problem observed: On x86_64, when crash is triggered and entering into kdump kernel, page allocation failure can always be seen. --------------------------------- DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations swapper/0: page allocation failure: order:5, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0 CPU: 0 PID: 1 Comm: swapper/0 Call Trace: dump_stack+0x7f/0xa1 warn_alloc.cold+0x72/0xd6 ...... __alloc_pages+0x24d/0x2c0 ...... dma_atomic_pool_init+0xdb/0x176 do_one_initcall+0x67/0x320 ? rcu_read_lock_sched_held+0x3f/0x80 kernel_init_freeable+0x290/0x2dc ? rest_init+0x24f/0x24f kernel_init+0xa/0x111 ret_from_fork+0x22/0x30 Mem-Info: ------------------------------------ ***Root cause: In the current kernel, it assumes that DMA zone must have managed pages and try to request pages if CONFIG_ZONE_DMA is enabled. While this is not always true. E.g in kdump kernel of x86_64, only low 1M is presented and locked down at very early stage of boot, so that this low 1M won't be added into buddy allocator to become managed pages of DMA zone. This exception will always cause page allocation failure if page is requested from DMA zone. ***Investigation: This failure happens since below commit merged into linus's tree. |
||
|
|
37b2d597bb |
ANDROID: mm: add cma pcp list
Add a PCP list for __GFP_CMA allocations so as not to deprive MIGRATE_MOVABLE allocations quick access to pages on their PCP lists. Bug: 158645321 Signed-off-by: Liam Mark <lmark@codeaurora.org> Signed-off-by: Chris Goldsworthy <cgoldswo@codeaurora.org> [isaacm@codeaurora.org: Resolve merge conflicts related to new mm features] Signed-off-by: Isaac J. Manjarres <isaacm@quicinc.com> Change-Id: I2f238ea5f8e4aef9c45b1a3180ce6b6a36d63d77 |
||
|
|
af82009880 |
ANDROID: cma: redirect page allocation to CMA
CMA pages are designed to be used as fallback for movable allocations and cannot be used for non-movable allocations. If CMA pages are utilized poorly, non-movable allocations may end up getting starved if all regular movable pages are allocated and the only pages left are CMA. Always using CMA pages first creates unacceptable performance problems. As a midway alternative, use CMA pages for certain userspace allocations. The userspace pages can be migrated or dropped quickly which giving decent utilization. Additionally, add a fall-backs for failed CMA allocations in rmqueue() and __rmqueue_pcplist() (the latter addition being driven by a report by the kernel test robot); these fallbacks were dealt with differently in the original version of the patch as the rmqueue() call chain has changed). Bug: 158645321 Link: https://lore.kernel.org/lkml/cover.1604282969.git.cgoldswo@codeaurora.org/ Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Heesub Shin <heesub.shin@samsung.com> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> [cgoldswo@codeaurora.org: Place in bugfixes; remove cma_alloc zone flag] Signed-off-by: Chris Goldsworthy <cgoldswo@codeaurora.org> [isaacm@codeaurora.org: Resolve merge conflicts to account for new mm features] Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org> Change-Id: I3dfbc42f1d12416143550042182bf16030ca7190 |
||
|
|
c0d1ebaba1 |
Merge 2d338201d5 ("Merge branch 'akpm' (patches from Andrew)") into android-mainline
Steps on the way to 5.15-rc1 Resolves merge conflict in: fs/proc/base.c Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Ic554ca8447961e52fbc6f27d91470a816b59a771 |
||
|
|
c2b303f98f |
Merge 4e71add028 ("Merge branch 'stable/for-linus-5.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/ibft") into android-mainline
Steps on the way to 5.15-rc1 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Ib3f181326491eb896547d802a6f0a1b3be54ce28 |
||
|
|
2d338201d5 |
Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
"147 patches, based on
|
||
|
|
4b09700244 |
mm: track present early pages per zone
Patch series "mm/memory_hotplug: "auto-movable" online policy and memory groups", v3.
I. Goal
The goal of this series is improving in-kernel auto-online support. It
tackles the fundamental problems that:
1) We can create zone imbalances when onlining all memory blindly to
ZONE_MOVABLE, in the worst case crashing the system. We have to know
upfront how much memory we are going to hotplug such that we can
safely enable auto-onlining of all hotplugged memory to ZONE_MOVABLE
via "online_movable". This is far from practical and only applicable in
limited setups -- like inside VMs under the RHV/oVirt hypervisor which
will never hotplug more than 3 times the boot memory (and the
limitation is only in place due to the Linux limitation).
2) We see more setups that implement dynamic VM resizing, hot(un)plugging
memory to resize VM memory. In these setups, we might hotplug a lot of
memory, but it might happen in various small steps in both directions
(e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the
primary driver of this upstream right now, performing such dynamic
resizing NUMA-aware via multiple virtio-mem devices.
Onlining all hotplugged memory to ZONE_NORMAL means we basically have
no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we can
easily run into zone imbalances when growing a VM. We want a mixture,
and we want as much memory as reasonable/configured in ZONE_MOVABLE.
Details regarding zone imbalances can be found at [1].
3) Memory devices consist of 1..X memory block devices, however, the
kernel doesn't really track the relationship. Consequently, also user
space has no idea. We want to make per-device decisions.
As one example, for memory hotunplug it doesn't make sense to use a
mixture of zones within a single DIMM: we want all MOVABLE if
possible, otherwise all !MOVABLE, because any !MOVABLE part will easily
block the whole DIMM from getting hotunplugged.
As another example, virtio-mem operates on individual units that span
1..X memory blocks. Similar to a DIMM, we want a unit to either be all
MOVABLE or !MOVABLE. A "unit" can be thought of like a DIMM, however,
all units of a virtio-mem device logically belong together and are
managed (added/removed) by a single driver. We want as much memory of
a virtio-mem device to be MOVABLE as possible.
4) We want memory onlining to be done right from the kernel while adding
memory, not triggered by user space via udev rules; for example, this
is reqired for fast memory hotplug for drivers that add individual
memory blocks, like virito-mem. We want a way to configure a policy in
the kernel and avoid implementing advanced policies in user space.
The auto-onlining support we have in the kernel is not sufficient. All we
have is a) online everything MOVABLE (online_movable) b) online everything
!MOVABLE (online_kernel) c) keep zones contiguous (online). This series
allows configuring c) to mean instead "online movable if possible
according to the coniguration, driven by a maximum MOVABLE:KERNEL ratio"
-- a new onlining policy.
II. Approach
This series does 3 things:
1) Introduces the "auto-movable" online policy that initially operates on
individual memory blocks only. It uses a maximum MOVABLE:KERNEL ratio
to make a decision whether a memory block will be onlined to
ZONE_MOVABLE or not. However, in the basic form, hotplugged KERNEL
memory does not allow for more MOVABLE memory (details in the
patches). CMA memory is treated like MOVABLE memory.
2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) memory
groups and uses group information to make decisions in the
"auto-movable" online policy across memory blocks of a single memory
device (modeled as memory group). More details can be found in patch
#3 or in the DIMM example below.
3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by
allowing ZONE_NORMAL memory within a dynamic memory group to allow for
more ZONE_MOVABLE memory within the same memory group. The target use
case is dynamic VM resizing using virtio-mem. See the virtio-mem
example below.
I remember that the basic idea of using a ratio to implement a policy in
the kernel was once mentioned by Vitaly Kuznetsov, but I might be wrong (I
lost the pointer to that discussion).
For me, the main use case is using it along with virtio-mem (and DIMMs /
ppc64 dlpar where necessary) for dynamic resizing of VMs, increasing the
amount of memory we can hotunplug reliably again if we might eventually
hotplug a lot of memory to a VM.
III. Target Usage
The target usage will be:
1) Linux boots with "mhp_default_online_type=offline"
2) User space (e.g., systemd unit) configures memory onlining (according
to a config file and system properties), for example:
* Setting memory_hotplug.online_policy=auto-movable
* Setting memory_hotplug.auto_movable_ratio=301
* Setting memory_hotplug.auto_movable_numa_aware=true
3) User space enabled auto onlining via "echo online >
/sys/devices/system/memory/auto_online_blocks"
4) User space triggers manual onlining of all already-offline memory
blocks (go over offline memory blocks and set them to "online")
IV. Example
For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured ratio of
301% results in the following layout:
Memory block 0-15: DMA32 (early)
Memory block 32-47: Normal (early)
Memory block 48-79: Movable (DIMM 0)
Memory block 80-111: Movable (DIMM 1)
Memory block 112-143: Movable (DIMM 2)
Memory block 144-275: Normal (DIMM 3)
Memory block 176-207: Normal (DIMM 4)
... all Normal
(-> hotplugged Normal memory does not allow for more Movable memory)
For virtio-mem, using a simple, single virtio-mem device with a 4 GiB VM
will result in the following layout:
Memory block 0-15: DMA32 (early)
Memory block 32-47: Normal (early)
Memory block 48-143: Movable (virtio-mem, first 12 GiB)
Memory block 144: Normal (virtio-mem, next 128 MiB)
Memory block 145-147: Movable (virtio-mem, next 384 MiB)
Memory block 148: Normal (virtio-mem, next 128 MiB)
Memory block 149-151: Movable (virtio-mem, next 384 MiB)
... Normal/Movable mixture as above
(-> hotplugged Normal memory allows for more Movable memory within
the same device)
Which gives us maximum flexibility when dynamically growing/shrinking a
VM in smaller steps.
V. Doc Update
I'll update the memory-hotplug.rst documentation, once the overhaul [1] is
usptream. Until then, details can be found in patch #2.
VI. Future Work
1) Use memory groups for ppc64 dlpar
2) Being able to specify a portion of (early) kernel memory that will be
excluded from the ratio. Like "128 MiB globally/per node" are excluded.
This might be helpful when starting VMs with extremely small memory
footprint (e.g., 128 MiB) and hotplugging memory later -- not wanting
the first hotplugged units getting onlined to ZONE_MOVABLE. One
alternative would be a trigger to not consider ZONE_DMA memory
in the ratio. We'll have to see if this is really rrequired.
3) Indicate to user space that MOVABLE might be a bad idea -- especially
relevant when memory ballooning without support for balloon compaction
is active.
This patch (of 9):
For implementing a new memory onlining policy, which determines when to
online memory blocks to ZONE_MOVABLE semi-automatically, we need the
number of present early (boot) pages -- present pages excluding hotplugged
pages. Let's track these pages per zone.
Pass a page instead of the zone to adjust_present_page_count(), similar as
adjust_managed_page_count() and derive the zone from the page.
It's worth noting that a memory block to be offlined/onlined is either
completely "early" or "not early". add_memory() and friends can only add
complete memory blocks and we only online/offline complete (individual)
memory blocks.
Link: https://lkml.kernel.org/r/20210806124715.17090-1-david@redhat.com
Link: https://lkml.kernel.org/r/20210806124715.17090-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Cc: Hui Zhu <teawater@gmail.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
859a85ddf9 |
mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE
Patch series "mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE". After recent updates to freeing unused parts of the memory map, no architecture can have holes in the memory map within a pageblock. This makes pfn_valid_within() check and CONFIG_HOLES_IN_ZONE configuration option redundant. The first patch removes them both in a mechanical way and the second patch simplifies memory_hotplug::test_pages_in_a_zone() that had pfn_valid_within() surrounded by more logic than simple if. This patch (of 2): After recent changes in freeing of the unused parts of the memory map and rework of pfn_valid() in arm and arm64 there are no architectures that can have holes in the memory map within a pageblock and so nothing can enable CONFIG_HOLES_IN_ZONE which guards non trivial implementation of pfn_valid_within(). With that, pfn_valid_within() is always hardwired to 1 and can be completely removed. Remove calls to pfn_valid_within() and CONFIG_HOLES_IN_ZONE. Link: https://lkml.kernel.org/r/20210713080035.7464-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20210713080035.7464-2-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
65d759c8f9 |
mm: compaction: support triggering of proactive compaction by user
The proactive compaction[1] gets triggered for every 500msec and run compaction on the node for COMPACTION_HPAGE_ORDER (usually order-9) pages based on the value set to sysctl.compaction_proactiveness. Triggering the compaction for every 500msec in search of COMPACTION_HPAGE_ORDER pages is not needed for all applications, especially on the embedded system usecases which may have few MB's of RAM. Enabling the proactive compaction in its state will endup in running almost always on such systems. Other side, proactive compaction can still be very much useful for getting a set of higher order pages in some controllable manner(controlled by using the sysctl.compaction_proactiveness). So, on systems where enabling the proactive compaction always may proove not required, can trigger the same from user space on write to its sysctl interface. As an example, say app launcher decide to launch the memory heavy application which can be launched fast if it gets more higher order pages thus launcher can prepare the system in advance by triggering the proactive compaction from userspace. This triggering of proactive compaction is done on a write to sysctl.compaction_proactiveness by user. [1]https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=facdaa917c4d5a376d09d25865f5a863f906234a [akpm@linux-foundation.org: tweak vm.rst, per Mike] Link: https://lkml.kernel.org/r/1627653207-12317-1-git-send-email-charante@codeaurora.org Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Nitin Gupta <nigupta@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
01c8d337d1 |
mm/sparse: set SECTION_NID_SHIFT to 6
Currently SECTION_NID_SHIFT is set to 3, which is incorrect because bit 3
and 4 can be overlapped by sub-field for early NID, and can be
unexpectedly set on NUMA systems. There are a few non-critical issues
related to this:
- Having SECTION_TAINT_ZONE_DEVICE set for wrong sections forces
pfn_to_online_page() through the slow path, but doesn't actually break
the kernel.
- A kdump generation tool like makedumpfile uses this field to calculate
the physical address to read. So wrong bits can make the tool access to
wrong address and fail to create kdump. This can be avoided by the
tool, so it's not critical.
To fix it, set SECTION_NID_SHIFT to 6 which is the minimum number of
available bits of section flag field.
Link: https://lkml.kernel.org/r/20210707045548.810271-1-naoya.horiguchi@linux.dev
Fixes:
|
||
|
|
11e02d3729 |
mm: sparse: remove __section_nr() function
As the last users of __section_nr() are gone, let's remove unused function __section_nr(). Link: https://lkml.kernel.org/r/20210707150212.855-4-ohoono.kwon@samsung.com Signed-off-by: Ohhoon Kwon <ohoono.kwon@samsung.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
293f275f4d |
Merge commit df8ba5f160 ("Merge tag 'kgdb-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux") into android-mainline
A large step en route to v5.14-rc1 Change-Id: I52bb71dc737044a593d1a9dfd7fe02b31e273ff9 Signed-off-by: Lee Jones <lee.jones@linaro.org> |
||
|
|
8e658623d4 |
Merge commit c288d9cd71 ("Merge tag 'for-5.14/io_uring-2021-06-30' of git://git.kernel.dk/linux-block") into android-mainline
Another small step en route to v5.14-rc1 Change-Id: I24899ab78da7d367574ed69ceaa82ab0837d9556 Signed-off-by: Lee Jones <lee.jones@linaro.org> |
||
|
|
351de44fde |
mm/swap: make NODE_DATA an inline function on CONFIG_FLATMEM
make W=1 generates the following warning in mm/workingset.c for allnoconfig
mm/workingset.c: In function `unpack_shadow':
mm/workingset.c:201:15: warning: variable `nid' set but not used [-Wunused-but-set-variable]
int memcgid, nid;
^~~
On FLATMEM, NODE_DATA returns a global pglist_data without dereferencing
nid. Make the helper an inline function to suppress the warning, add type
checking and to apply any side-effects in the parameter list.
Link: https://lkml.kernel.org/r/20210520084809.8576-15-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
041711ce7c |
mm: fix spelling mistakes
Fix some spelling mistakes in comments: each having differents usage ==> each has a different usage statments ==> statements adresses ==> addresses aggresive ==> aggressive datas ==> data posion ==> poison higer ==> higher precisly ==> precisely wont ==> won't We moves tha ==> We move the endianess ==> endianness Link: https://lkml.kernel.org/r/20210519065853.7723-2-thunder.leizhen@huawei.com Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
16c9afc776 |
arm64/mm: drop HAVE_ARCH_PFN_VALID
CONFIG_SPARSEMEM_VMEMMAP is now the only available memory model on arm64 platforms and free_unused_memmap() would just return without creating any holes in the memmap mapping. There is no need for any special handling in pfn_valid() and HAVE_ARCH_PFN_VALID can just be dropped. This also moves the pfn upper bits sanity check into generic pfn_valid(). Link: https://lkml.kernel.org/r/1621947349-25421-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
51c656aef6 |
include/linux/mmzone.h: add documentation for pfn_valid()
Patch series "arm64: drop pfn_valid_within() and simplify pfn_valid()", v4. These patches aim to remove CONFIG_HOLES_IN_ZONE and essentially hardwire pfn_valid_within() to 1. The idea is to mark NOMAP pages as reserved in the memory map and restore the intended semantics of pfn_valid() to designate availability of struct page for a pfn. With this the core mm will be able to cope with the fact that it cannot use NOMAP pages and the holes created by NOMAP ranges within MAX_ORDER blocks will be treated correctly even without the need for pfn_valid_within. This patch (of 4): Add comment describing the semantics of pfn_valid() that clarifies that pfn_valid() only checks for availability of a memory map entry (i.e. struct page) for a PFN rather than availability of usable memory backing that PFN. The most "generic" version of pfn_valid() used by the configurations with SPARSEMEM enabled resides in include/linux/mmzone.h so this is the most suitable place for documentation about semantics of pfn_valid(). Link: https://lkml.kernel.org/r/20210511100550.28178-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20210511100550.28178-2-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Suggested-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
44042b4498 |
mm/page_alloc: allow high-order pages to be stored on the per-cpu lists
The per-cpu page allocator (PCP) only stores order-0 pages. This means
that all THP and "cheap" high-order allocations including SLUB contends on
the zone->lock. This patch extends the PCP allocator to store THP and
"cheap" high-order pages. Note that struct per_cpu_pages increases in
size to 256 bytes (4 cache lines) on x86-64.
Note that this is not necessarily a universal performance win because of
how it is implemented. High-order pages can cause pcp->high to be
exceeded prematurely for lower-orders so for example, a large number of
THP pages being freed could release order-0 pages from the PCP lists.
Hence, much depends on the allocation/free pattern as observed by a single
CPU to determine if caching helps or hurts a particular workload.
That said, basic performance testing passed. The following is a netperf
UDP_STREAM test which hits the relevant patches as some of the network
allocations are high-order.
netperf-udp
5.13.0-rc2 5.13.0-rc2
mm-pcpburst-v3r4 mm-pcphighorder-v1r7
Hmean send-64 261.46 ( 0.00%) 266.30 * 1.85%*
Hmean send-128 516.35 ( 0.00%) 536.78 * 3.96%*
Hmean send-256 1014.13 ( 0.00%) 1034.63 * 2.02%*
Hmean send-1024 3907.65 ( 0.00%) 4046.11 * 3.54%*
Hmean send-2048 7492.93 ( 0.00%) 7754.85 * 3.50%*
Hmean send-3312 11410.04 ( 0.00%) 11772.32 * 3.18%*
Hmean send-4096 13521.95 ( 0.00%) 13912.34 * 2.89%*
Hmean send-8192 21660.50 ( 0.00%) 22730.72 * 4.94%*
Hmean send-16384 31902.32 ( 0.00%) 32637.50 * 2.30%*
Functionally, a patch like this is necessary to make bulk allocation of
high-order pages work with similar performance to order-0 bulk
allocations. The bulk allocator is not updated in this series as it would
have to be determined by bulk allocation users how they want to track the
order of pages allocated with the bulk allocator.
Link: https://lkml.kernel.org/r/20210611135753.GC30378@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
43b02ba93b |
mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM
After removal of the DISCONTIGMEM memory model the FLAT_NODE_MEM_MAP configuration option is equivalent to FLATMEM. Drop CONFIG_FLAT_NODE_MEM_MAP and use CONFIG_FLATMEM instead. Link: https://lkml.kernel.org/r/20210608091316.3622-10-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matt Turner <mattst88@gmail.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Vineet Gupta <vgupta@synopsys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
a9ee6cf5c6 |
mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
After removal of DISCINTIGMEM the NEED_MULTIPLE_NODES and NUMA configuration options are equivalent. Drop CONFIG_NEED_MULTIPLE_NODES and use CONFIG_NUMA instead. Done with $ sed -i 's/CONFIG_NEED_MULTIPLE_NODES/CONFIG_NUMA/' \ $(git grep -wl CONFIG_NEED_MULTIPLE_NODES) $ sed -i 's/NEED_MULTIPLE_NODES/NUMA/' \ $(git grep -wl NEED_MULTIPLE_NODES) with manual tweaks afterwards. [rppt@linux.ibm.com: fix arm boot crash] Link: https://lkml.kernel.org/r/YMj9vHhHOiCVN4BF@linux.ibm.com Link: https://lkml.kernel.org/r/20210608091316.3622-9-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matt Turner <mattst88@gmail.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Vineet Gupta <vgupta@synopsys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
bb1c50d396 |
mm: remove CONFIG_DISCONTIGMEM
There are no architectures that support DISCONTIGMEM left. Remove the configuration option and the dead code it was guarding in the generic memory management code. Link: https://lkml.kernel.org/r/20210608091316.3622-6-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matt Turner <mattst88@gmail.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Vineet Gupta <vgupta@synopsys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
777c00f5ed |
mm: drop SECTION_SHIFT in code comments
Actually SECTIONS_SHIFT is used in the kernel code, so the code comments
is strictly incorrect. And since commit
|
||
|
|
74f4482209 |
mm/page_alloc: introduce vm.percpu_pagelist_high_fraction
This introduces a new sysctl vm.percpu_pagelist_high_fraction. It is
similar to the old vm.percpu_pagelist_fraction. The old sysctl increased
both pcp->batch and pcp->high with the higher pcp->high potentially
reducing zone->lock contention. However, the higher pcp->batch value also
potentially increased allocation latency while the PCP was refilled. This
sysctl only adjusts pcp->high so that zone->lock contention is potentially
reduced but allocation latency during a PCP refill remains the same.
# grep -E "high:|batch" /proc/zoneinfo | tail -2
high: 649
batch: 63
# sysctl vm.percpu_pagelist_high_fraction=8
# grep -E "high:|batch" /proc/zoneinfo | tail -2
high: 35071
batch: 63
# sysctl vm.percpu_pagelist_high_fraction=64
high: 4383
batch: 63
# sysctl vm.percpu_pagelist_high_fraction=0
high: 649
batch: 63
[mgorman@techsingularity.net: fix documentation]
Link: https://lkml.kernel.org/r/20210528151010.GQ30378@techsingularity.net
Link: https://lkml.kernel.org/r/20210525080119.5455-7-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
c49c2c47da |
mm/page_alloc: limit the number of pages on PCP lists when reclaim is active
When kswapd is active then direct reclaim is potentially active. In either case, it is possible that a zone would be balanced if pages were not trapped on PCP lists. Instead of draining remote pages, simply limit the size of the PCP lists while kswapd is active. Link: https://lkml.kernel.org/r/20210525080119.5455-6-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
3b12e7e979 |
mm/page_alloc: scale the number of pages that are batch freed
When a task is freeing a large number of order-0 pages, it may acquire the zone->lock multiple times freeing pages in batches. This may unnecessarily contend on the zone lock when freeing very large number of pages. This patch adapts the size of the batch based on the recent pattern to scale the batch size for subsequent frees. As the machines I used were not large enough to test this are not large enough to illustrate a problem, a debugging patch shows patterns like the following (slightly editted for clarity) Baseline vanilla kernel time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378 time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378 time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378 time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378 time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378 With patches time-unmap-7724 [...] free_pcppages_bulk: free 126 count 814 high 814 time-unmap-7724 [...] free_pcppages_bulk: free 252 count 814 high 814 time-unmap-7724 [...] free_pcppages_bulk: free 504 count 814 high 814 time-unmap-7724 [...] free_pcppages_bulk: free 751 count 814 high 814 time-unmap-7724 [...] free_pcppages_bulk: free 751 count 814 high 814 Link: https://lkml.kernel.org/r/20210525080119.5455-5-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
bbbecb35a4 |
mm/page_alloc: delete vm.percpu_pagelist_fraction
Patch series "Calculate pcp->high based on zone sizes and active CPUs", v2. The per-cpu page allocator (PCP) is meant to reduce contention on the zone lock but the sizing of batch and high is archaic and neither takes the zone size into account or the number of CPUs local to a zone. With larger zones and more CPUs per node, the contention is getting worse. Furthermore, the fact that vm.percpu_pagelist_fraction adjusts both batch and high values means that the sysctl can reduce zone lock contention but also increase allocation latencies. This series disassociates pcp->high from pcp->batch and then scales pcp->high based on the size of the local zone with limited impact to reclaim and accounting for active CPUs but leaves pcp->batch static. It also adapts the number of pages that can be on the pcp list based on recent freeing patterns. The motivation is partially to adjust to larger memory sizes but is also driven by the fact that large batches of page freeing via release_pages() often shows zone contention as a major part of the problem. Another is a bug report based on an older kernel where a multi-terabyte process can takes several minutes to exit. A workaround was to use vm.percpu_pagelist_fraction to increase the pcp->high value but testing indicated that a production workload could not use the same values because of an increase in allocation latencies. Unfortunately, I cannot reproduce this test case myself as the multi-terabyte machines are in active use but it should alleviate the problem. The series aims to address both and partially acts as a pre-requisite. pcp only works with order-0 which is useless for SLUB (when using high orders) and THP (unconditionally). To store high-order pages on PCP, the pcp->high values need to be increased first. This patch (of 6): The vm.percpu_pagelist_fraction is used to increase the batch and high limits for the per-cpu page allocator (PCP). The intent behind the sysctl is to reduce zone lock acquisition when allocating/freeing pages but it has a problem. While it can decrease contention, it can also increase latency on the allocation side due to unreasonably large batch sizes. This leads to games where an administrator adjusts percpu_pagelist_fraction on the fly to work around contention and allocation latency problems. This series aims to alleviate the problems with zone lock contention while avoiding the allocation-side latency problems. For the purposes of review, it's easier to remove this sysctl now and reintroduce a similar sysctl later in the series that deals only with pcp->high. Link: https://lkml.kernel.org/r/20210525080119.5455-1-mgorman@techsingularity.net Link: https://lkml.kernel.org/r/20210525080119.5455-2-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
f19298b951 |
mm/vmstat: convert NUMA statistics to basic NUMA counters
NUMA statistics are maintained on the zone level for hits, misses, foreign etc but nothing relies on them being perfectly accurate for functional correctness. The counters are used by userspace to get a general overview of a workloads NUMA behaviour but the page allocator incurs a high cost to maintain perfect accuracy similar to what is required for a vmstat like NR_FREE_PAGES. There even is a sysctl vm.numa_stat to allow userspace to turn off the collection of NUMA statistics like NUMA_HIT. This patch converts NUMA_HIT and friends to be NUMA events with similar accuracy to VM events. There is a possibility that slight errors will be introduced but the overall trend as seen by userspace will be similar. The counters are no longer updated from vmstat_refresh context as it is unnecessary overhead for counters that may never be read by userspace. Note that counters could be maintained at the node level to save space but it would have a user-visible impact due to /proc/zoneinfo. [lkp@intel.com: Fix misplaced closing brace for !CONFIG_NUMA] Link: https://lkml.kernel.org/r/20210512095458.30632-4-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
dbbee9d5cd |
mm/page_alloc: convert per-cpu list protection to local_lock
There is a lack of clarity of what exactly local_irq_save/local_irq_restore protects in page_alloc.c . It conflates the protection of per-cpu page allocation structures with per-cpu vmstat deltas. This patch protects the PCP structure using local_lock which for most configurations is identical to IRQ enabling/disabling. The scope of the lock is still wider than it should be but this is decreased later. It is possible for the local_lock to be embedded safely within struct per_cpu_pages but it adds complexity to free_unref_page_list. [akpm@linux-foundation.org: coding style fixes] [mgorman@techsingularity.net: work around a pahole limitation with zero-sized struct pagesets] Link: https://lkml.kernel.org/r/20210526080741.GW30378@techsingularity.net [lkp@intel.com: Make pagesets static] Link: https://lkml.kernel.org/r/20210512095458.30632-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
28f836b677 |
mm/page_alloc: split per cpu page lists and zone stats
The PCP (per-cpu page allocator in page_alloc.c) shares locking
requirements with vmstat and the zone lock which is inconvenient and
causes some issues. For example, the PCP list and vmstat share the same
per-cpu space meaning that it's possible that vmstat updates dirty cache
lines holding per-cpu lists across CPUs unless padding is used. Second,
PREEMPT_RT does not want to disable IRQs for too long in the page
allocator.
This series splits the locking requirements and uses locks types more
suitable for PREEMPT_RT, reduces the time when special locking is required
for stats and reduces the time when IRQs need to be disabled on
!PREEMPT_RT kernels.
Why local_lock? PREEMPT_RT considers the following sequence to be unsafe
as documented in Documentation/locking/locktypes.rst
local_irq_disable();
spin_lock(&lock);
The pcp allocator has this sequence for rmqueue_pcplist (local_irq_save)
-> __rmqueue_pcplist -> rmqueue_bulk (spin_lock). While it's possible to
separate this out, it generally means there are points where we enable
IRQs and reenable them again immediately. To prevent a migration and the
per-cpu pointer going stale, migrate_disable is also needed. That is a
custom lock that is similar, but worse, than local_lock. Furthermore, on
PREEMPT_RT, it's undesirable to leave IRQs disabled for too long. By
converting to local_lock which disables migration on PREEMPT_RT, the
locking requirements can be separated and start moving the protections for
PCP, stats and the zone lock to PREEMPT_RT-safe equivalent locking. As a
bonus, local_lock also means that PROVE_LOCKING does something useful.
After that, it's obvious that zone_statistics incurs too much overhead and
leaves IRQs disabled for longer than necessary on !PREEMPT_RT kernels.
zone_statistics uses perfectly accurate counters requiring IRQs be
disabled for parallel RMW sequences when inaccurate ones like vm_events
would do. The series makes the NUMA statistics (NUMA_HIT and friends)
inaccurate counters that then require no special protection on
!PREEMPT_RT.
The bulk page allocator can then do stat updates in bulk with IRQs enabled
which should improve the efficiency. Technically, this could have been
done without the local_lock and vmstat conversion work and the order
simply reflects the timing of when different series were implemented.
Finally, there are places where we conflate IRQs being disabled for the
PCP with the IRQ-safe zone spinlock. The remainder of the series reduces
the scope of what is protected by disabled IRQs on !PREEMPT_RT kernels.
By the end of the series, page_alloc.c does not call local_irq_save so the
locking scope is a bit clearer. The one exception is that modifying
NR_FREE_PAGES still happens in places where it's known the IRQs are
disabled as it's harmless for PREEMPT_RT and would be expensive to split
the locking there.
No performance data is included because despite the overhead of the stats,
it's within the noise for most workloads on !PREEMPT_RT. However, Jesper
Dangaard Brouer ran a page allocation microbenchmark on a E5-1650 v4 @
3.60GHz CPU on the first version of this series. Focusing on the array
variant of the bulk page allocator reveals the following.
(CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz)
ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size
Baseline Patched
1 56.383 54.225 (+3.83%)
2 40.047 35.492 (+11.38%)
3 37.339 32.643 (+12.58%)
4 35.578 30.992 (+12.89%)
8 33.592 29.606 (+11.87%)
16 32.362 28.532 (+11.85%)
32 31.476 27.728 (+11.91%)
64 30.633 27.252 (+11.04%)
128 30.596 27.090 (+11.46%)
While this is a positive outcome, the series is more likely to be
interesting to the RT people in terms of getting parts of the PREEMPT_RT
tree into mainline.
This patch (of 9):
The per-cpu page allocator lists and the per-cpu vmstat deltas are stored
in the same struct per_cpu_pages even though vmstats have no direct impact
on the per-cpu page lists. This is inconsistent because the vmstats for a
node are stored on a dedicated structure. The bigger issue is that the
per_cpu_pages structure is not cache-aligned and stat updates either cache
conflict with adjacent per-cpu lists incurring a runtime cost or padding
is required incurring a memory cost.
This patch splits the per-cpu pagelists and the vmstat deltas into
separate structures. It's mostly a mechanical conversion but some
variable renaming is done to clearly distinguish the per-cpu pages
structure (pcp) from the vmstats (pzstats).
Superficially, this appears to increase the size of the per_cpu_pages
structure but the movement of expire fills a structure hole so there is no
impact overall.
[mgorman@techsingularity.net: make it W=1 cleaner]
Link: https://lkml.kernel.org/r/20210514144622.GA3735@techsingularity.net
[mgorman@techsingularity.net: make it W=1 even cleaner]
Link: https://lkml.kernel.org/r/20210516140705.GB3735@techsingularity.net
[lkp@intel.com: check struct per_cpu_zonestat has a non-zero size]
[vbabka@suse.cz: Init zone->per_cpu_zonestats properly]
Link: https://lkml.kernel.org/r/20210512095458.30632-1-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20210512095458.30632-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
b19bd1c976 |
mm/mmzone.h: simplify is_highmem_idx()
There is a lot of historical ifdefery in is_highmem_idx() and its helper zone_movable_is_highmem() that was required because of two different paths for nodes and zones initialization that were selected at compile time. Until commit |
||
|
|
85adc860fd |
Merge 6efb943b86 Linux 5.13-rc1 into android-mainline
One giant leap, all the way up to 5.13-rc1 Also take the opportunity to re-align (a.k.a. fix a couple of previous merge conflict fix-up issues) which occurred during this merge-window. Fixes: |
||
|
|
163e8b4fec |
Merge d652502ef4 Merge tag 'ovl-update-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs into android-mainline
A tiny step en route to v5.13-rc1 Signed-off-by: Lee Jones <lee.jones@linaro.org> Change-Id: I049e80976042ebffc90bb080f09da0afcfd48d77 |
||
|
|
cb152a1a95 |
mm: fix some typos and code style problems
fix some typos and code style problems in mm. gfp.h: s/MAXNODES/MAX_NUMNODES mmzone.h: s/then/than rmap.c: s/__vma_split()/__vma_adjust() swap.c: s/__mod_zone_page_stat/__mod_zone_page_state, s/is is/is swap_state.c: s/whoes/whose z3fold.c: code style problem fix in z3fold_unregister_migration zsmalloc.c: s/of/or, s/give/given Link: https://lkml.kernel.org/r/20210419083057.64820-1-luoshijie1@huawei.com Signed-off-by: Shijie Luo <luoshijie1@huawei.com> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
a08a2ae346 |
mm,memory_hotplug: allocate memmap from the added memory range
Physical memory hotadd has to allocate a memmap (struct page array) for
the newly added memory section. Currently, alloc_pages_node() is used
for those allocations.
This has some disadvantages:
a) an existing memory is consumed for that purpose
(eg: ~2MB per 128MB memory section on x86_64)
This can even lead to extreme cases where system goes OOM because
the physically hotplugged memory depletes the available memory before
it is onlined.
b) if the whole node is movable then we have off-node struct pages
which has performance drawbacks.
c) It might be there are no PMD_ALIGNED chunks so memmap array gets
populated with base pages.
This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.
Vmemap page tables can map arbitrary memory. That means that we can
reserve a part of the physically hotadded memory to back vmemmap page
tables. This implementation uses the beginning of the hotplugged memory
for that purpose.
There are some non-obviously things to consider though.
Vmemmap pages are allocated/freed during the memory hotplug events
(add_memory_resource(), try_remove_memory()) when the memory is
added/removed. This means that the reserved physical range is not
online although it is used. The most obvious side effect is that
pfn_to_online_page() returns NULL for those pfns. The current design
expects that this should be OK as the hotplugged memory is considered a
garbage until it is onlined. For example hibernation wouldn't save the
content of those vmmemmaps into the image so it wouldn't be restored on
resume but this should be OK as there no real content to recover anyway
while metadata is reachable from other data structures (e.g. vmemmap
page tables).
The reserved space is therefore (de)initialized during the {on,off}line
events (mhp_{de}init_memmap_on_memory). That is done by extracting page
allocator independent initialization from the regular onlining path.
The primary reason to handle the reserved space outside of
{on,off}line_pages is to make each initialization specific to the
purpose rather than special case them in a single function.
As per above, the functions that are introduced are:
- mhp_init_memmap_on_memory:
Initializes vmemmap pages by calling move_pfn_range_to_zone(), calls
kasan_add_zero_shadow(), and onlines as many sections as vmemmap pages
fully span.
- mhp_deinit_memmap_on_memory:
Offlines as many sections as vmemmap pages fully span, removes the
range from zhe zone by remove_pfn_range_from_zone(), and calls
kasan_remove_zero_shadow() for the range.
The new function memory_block_online() calls mhp_init_memmap_on_memory()
before doing the actual online_pages(). Should online_pages() fail, we
clean up by calling mhp_deinit_memmap_on_memory(). Adjusting of
present_pages is done at the end once we know that online_pages()
succedeed.
On offline, memory_block_offline() needs to unaccount vmemmap pages from
present_pages() before calling offline_pages(). This is necessary because
offline_pages() tears down some structures based on the fact whether the
node or the zone become empty. If offline_pages() fails, we account back
vmemmap pages. If it succeeds, we call mhp_deinit_memmap_on_memory().
Hot-remove:
We need to be careful when removing memory, as adding and
removing memory needs to be done with the same granularity.
To check that this assumption is not violated, we check the
memory range we want to remove and if a) any memory block has
vmemmap pages and b) the range spans more than a single memory
block, we scream out loud and refuse to proceed.
If all is good and the range was using memmap on memory (aka vmemmap pages),
we construct an altmap structure so free_hugepage_table does the right
thing and calls vmem_altmap_free instead of free_pagetable.
Link: https://lkml.kernel.org/r/20210421102701.25051-5-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
d1e153fea2 |
mm/gup: migrate pinned pages out of movable zone
We should not pin pages in ZONE_MOVABLE. Currently, we do not pin only movable CMA pages. Generalize the function that migrates CMA pages to migrate all movable pages. Use is_pinnable_page() to check which pages need to be migrated Link: https://lkml.kernel.org/r/20210215161349.246722-10-pasha.tatashin@soleen.com Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: James Morris <jmorris@namei.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <sashal@kernel.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Tyler Hicks <tyhicks@linux.microsoft.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
9afaf30f7a |
mm/gup: do not migrate zero page
On some platforms ZERO_PAGE(0) might end-up in a movable zone. Do not migrate zero page in gup during longterm pinning as migration of zero page is not allowed. For example, in x86 QEMU with 16G of memory and kernelcore=5G parameter, I see the following: Boot#1: zero_pfn 0x48a8d zero_pfn zone: ZONE_DMA32 Boot#2: zero_pfn 0x20168d zero_pfn zone: ZONE_MOVABLE On x86, empty_zero_page is declared in .bss and depending on the loader may end up in different physical locations during boots. Also, move is_zero_pfn() my_zero_pfn() functions under CONFIG_MMU, because zero_pfn that they are using is declared in memory.c which is compiled with CONFIG_MMU. Link: https://lkml.kernel.org/r/20210215161349.246722-9-pasha.tatashin@soleen.com Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: James Morris <jmorris@namei.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <sashal@kernel.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Tyler Hicks <tyhicks@linux.microsoft.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |