f47b852faa4afaae0e534b616f2864bb587286e4
502 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
f47b852faa |
ANDROID: implement wrapper for reverse migration
Reverse migration is used to do the balancing the occupancy of memory zones in a node in the system whose imabalance may be caused by migration of pages to other zones by an operation, eg: hotremove and then hotadding the same memory. In this case there is a lot of free memory in newly hotadd memory which can be filled up by the previous migrated pages(as part of offline/hotremove) thus may free up some pressure in other zones of the node. Upstream discussion: https://lore.kernel.org/all/ee78c83d-da9b-f6d1-4f66-934b7782acfb@codeaurora.org/ Change-Id: Ib3137dab0db66ecf6858c4077dcadb9dfd0c6b1c Bug: 201263307 Signed-off-by: Charan Teja Reddy <quic_charante@quicinc.com> |
||
|
|
a8b5dc3032 |
Merge 5.15.17 into android13-5.15
Changes in 5.15.17
KVM: x86/mmu: Fix write-protection of PTs mapped by the TDP MMU
KVM: VMX: switch blocked_vcpu_on_cpu_lock to raw spinlock
HID: Ignore battery for Elan touchscreen on HP Envy X360 15t-dr100
HID: uhid: Fix worker destroying device without any protection
HID: wacom: Reset expected and received contact counts at the same time
HID: wacom: Ignore the confidence flag when a touch is removed
HID: wacom: Avoid using stale array indicies to read contact count
ALSA: core: Fix SSID quirk lookup for subvendor=0
f2fs: fix to do sanity check on inode type during garbage collection
f2fs: fix to do sanity check in is_alive()
f2fs: avoid EINVAL by SBI_NEED_FSCK when pinning a file
nfc: llcp: fix NULL error pointer dereference on sendmsg() after failed bind()
mtd: rawnand: gpmi: Add ERR007117 protection for nfc_apply_timings
mtd: rawnand: gpmi: Remove explicit default gpmi clock setting for i.MX6
mtd: Fixed breaking list in __mtd_del_partition.
mtd: rawnand: davinci: Don't calculate ECC when reading page
mtd: rawnand: davinci: Avoid duplicated page read
mtd: rawnand: davinci: Rewrite function description
mtd: rawnand: Export nand_read_page_hwecc_oob_first()
mtd: rawnand: ingenic: JZ4740 needs 'oob_first' read page function
riscv: Get rid of MAXPHYSMEM configs
RISC-V: Use common riscv_cpuid_to_hartid_mask() for both SMP=y and SMP=n
riscv: try to allocate crashkern region from 32bit addressible memory
riscv: Don't use va_pa_offset on kdump
riscv: use hart id instead of cpu id on machine_kexec
riscv: mm: fix wrong phys_ram_base value for RV64
x86/gpu: Reserve stolen memory for first integrated Intel GPU
tools/nolibc: x86-64: Fix startup code bug
crypto: x86/aesni - don't require alignment of data
tools/nolibc: i386: fix initial stack alignment
tools/nolibc: fix incorrect truncation of exit code
rtc: cmos: take rtc_lock while reading from CMOS
net: phy: marvell: add Marvell specific PHY loopback
ksmbd: uninitialized variable in create_socket()
ksmbd: fix guest connection failure with nautilus
ksmbd: add support for smb2 max credit parameter
ksmbd: move credit charge deduction under processing request
ksmbd: limits exceeding the maximum allowable outstanding requests
ksmbd: add reserved room in ipc request/response
media: cec: fix a deadlock situation
media: ov8865: Disable only enabled regulators on error path
media: v4l2-ioctl.c: readbuffers depends on V4L2_CAP_READWRITE
media: flexcop-usb: fix control-message timeouts
media: mceusb: fix control-message timeouts
media: em28xx: fix control-message timeouts
media: cpia2: fix control-message timeouts
media: s2255: fix control-message timeouts
media: dib0700: fix undefined behavior in tuner shutdown
media: redrat3: fix control-message timeouts
media: pvrusb2: fix control-message timeouts
media: stk1160: fix control-message timeouts
media: cec-pin: fix interrupt en/disable handling
can: softing_cs: softingcs_probe(): fix memleak on registration failure
mei: hbm: fix client dma reply status
iio: adc: ti-adc081c: Partial revert of removal of ACPI IDs
iio: trigger: Fix a scheduling whilst atomic issue seen on tsc2046
lkdtm: Fix content of section containing lkdtm_rodata_do_nothing()
bus: mhi: pci_generic: Graceful shutdown on freeze
bus: mhi: core: Fix reading wake_capable channel configuration
bus: mhi: core: Fix race while handling SYS_ERR at power up
cxl/pmem: Fix reference counting for delayed work
arm64: errata: Fix exec handling in erratum
|
||
|
|
240e8d331a |
mm_zone: add function to check if managed dma zone exists
commit 62b3107073646e0946bd97ff926832bafb846d17 upstream. Patch series "Handle warning of allocation failure on DMA zone w/o managed pages", v4. **Problem observed: On x86_64, when crash is triggered and entering into kdump kernel, page allocation failure can always be seen. --------------------------------- DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations swapper/0: page allocation failure: order:5, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0 CPU: 0 PID: 1 Comm: swapper/0 Call Trace: dump_stack+0x7f/0xa1 warn_alloc.cold+0x72/0xd6 ...... __alloc_pages+0x24d/0x2c0 ...... dma_atomic_pool_init+0xdb/0x176 do_one_initcall+0x67/0x320 ? rcu_read_lock_sched_held+0x3f/0x80 kernel_init_freeable+0x290/0x2dc ? rest_init+0x24f/0x24f kernel_init+0xa/0x111 ret_from_fork+0x22/0x30 Mem-Info: ------------------------------------ ***Root cause: In the current kernel, it assumes that DMA zone must have managed pages and try to request pages if CONFIG_ZONE_DMA is enabled. While this is not always true. E.g in kdump kernel of x86_64, only low 1M is presented and locked down at very early stage of boot, so that this low 1M won't be added into buddy allocator to become managed pages of DMA zone. This exception will always cause page allocation failure if page is requested from DMA zone. ***Investigation: This failure happens since below commit merged into linus's tree. |
||
|
|
37b2d597bb |
ANDROID: mm: add cma pcp list
Add a PCP list for __GFP_CMA allocations so as not to deprive MIGRATE_MOVABLE allocations quick access to pages on their PCP lists. Bug: 158645321 Signed-off-by: Liam Mark <lmark@codeaurora.org> Signed-off-by: Chris Goldsworthy <cgoldswo@codeaurora.org> [isaacm@codeaurora.org: Resolve merge conflicts related to new mm features] Signed-off-by: Isaac J. Manjarres <isaacm@quicinc.com> Change-Id: I2f238ea5f8e4aef9c45b1a3180ce6b6a36d63d77 |
||
|
|
af82009880 |
ANDROID: cma: redirect page allocation to CMA
CMA pages are designed to be used as fallback for movable allocations and cannot be used for non-movable allocations. If CMA pages are utilized poorly, non-movable allocations may end up getting starved if all regular movable pages are allocated and the only pages left are CMA. Always using CMA pages first creates unacceptable performance problems. As a midway alternative, use CMA pages for certain userspace allocations. The userspace pages can be migrated or dropped quickly which giving decent utilization. Additionally, add a fall-backs for failed CMA allocations in rmqueue() and __rmqueue_pcplist() (the latter addition being driven by a report by the kernel test robot); these fallbacks were dealt with differently in the original version of the patch as the rmqueue() call chain has changed). Bug: 158645321 Link: https://lore.kernel.org/lkml/cover.1604282969.git.cgoldswo@codeaurora.org/ Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Heesub Shin <heesub.shin@samsung.com> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> [cgoldswo@codeaurora.org: Place in bugfixes; remove cma_alloc zone flag] Signed-off-by: Chris Goldsworthy <cgoldswo@codeaurora.org> [isaacm@codeaurora.org: Resolve merge conflicts to account for new mm features] Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org> Change-Id: I3dfbc42f1d12416143550042182bf16030ca7190 |
||
|
|
c0d1ebaba1 |
Merge 2d338201d5 ("Merge branch 'akpm' (patches from Andrew)") into android-mainline
Steps on the way to 5.15-rc1 Resolves merge conflict in: fs/proc/base.c Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Ic554ca8447961e52fbc6f27d91470a816b59a771 |
||
|
|
c2b303f98f |
Merge 4e71add028 ("Merge branch 'stable/for-linus-5.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/ibft") into android-mainline
Steps on the way to 5.15-rc1 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Ib3f181326491eb896547d802a6f0a1b3be54ce28 |
||
|
|
2d338201d5 |
Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
"147 patches, based on
|
||
|
|
4b09700244 |
mm: track present early pages per zone
Patch series "mm/memory_hotplug: "auto-movable" online policy and memory groups", v3.
I. Goal
The goal of this series is improving in-kernel auto-online support. It
tackles the fundamental problems that:
1) We can create zone imbalances when onlining all memory blindly to
ZONE_MOVABLE, in the worst case crashing the system. We have to know
upfront how much memory we are going to hotplug such that we can
safely enable auto-onlining of all hotplugged memory to ZONE_MOVABLE
via "online_movable". This is far from practical and only applicable in
limited setups -- like inside VMs under the RHV/oVirt hypervisor which
will never hotplug more than 3 times the boot memory (and the
limitation is only in place due to the Linux limitation).
2) We see more setups that implement dynamic VM resizing, hot(un)plugging
memory to resize VM memory. In these setups, we might hotplug a lot of
memory, but it might happen in various small steps in both directions
(e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the
primary driver of this upstream right now, performing such dynamic
resizing NUMA-aware via multiple virtio-mem devices.
Onlining all hotplugged memory to ZONE_NORMAL means we basically have
no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we can
easily run into zone imbalances when growing a VM. We want a mixture,
and we want as much memory as reasonable/configured in ZONE_MOVABLE.
Details regarding zone imbalances can be found at [1].
3) Memory devices consist of 1..X memory block devices, however, the
kernel doesn't really track the relationship. Consequently, also user
space has no idea. We want to make per-device decisions.
As one example, for memory hotunplug it doesn't make sense to use a
mixture of zones within a single DIMM: we want all MOVABLE if
possible, otherwise all !MOVABLE, because any !MOVABLE part will easily
block the whole DIMM from getting hotunplugged.
As another example, virtio-mem operates on individual units that span
1..X memory blocks. Similar to a DIMM, we want a unit to either be all
MOVABLE or !MOVABLE. A "unit" can be thought of like a DIMM, however,
all units of a virtio-mem device logically belong together and are
managed (added/removed) by a single driver. We want as much memory of
a virtio-mem device to be MOVABLE as possible.
4) We want memory onlining to be done right from the kernel while adding
memory, not triggered by user space via udev rules; for example, this
is reqired for fast memory hotplug for drivers that add individual
memory blocks, like virito-mem. We want a way to configure a policy in
the kernel and avoid implementing advanced policies in user space.
The auto-onlining support we have in the kernel is not sufficient. All we
have is a) online everything MOVABLE (online_movable) b) online everything
!MOVABLE (online_kernel) c) keep zones contiguous (online). This series
allows configuring c) to mean instead "online movable if possible
according to the coniguration, driven by a maximum MOVABLE:KERNEL ratio"
-- a new onlining policy.
II. Approach
This series does 3 things:
1) Introduces the "auto-movable" online policy that initially operates on
individual memory blocks only. It uses a maximum MOVABLE:KERNEL ratio
to make a decision whether a memory block will be onlined to
ZONE_MOVABLE or not. However, in the basic form, hotplugged KERNEL
memory does not allow for more MOVABLE memory (details in the
patches). CMA memory is treated like MOVABLE memory.
2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) memory
groups and uses group information to make decisions in the
"auto-movable" online policy across memory blocks of a single memory
device (modeled as memory group). More details can be found in patch
#3 or in the DIMM example below.
3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by
allowing ZONE_NORMAL memory within a dynamic memory group to allow for
more ZONE_MOVABLE memory within the same memory group. The target use
case is dynamic VM resizing using virtio-mem. See the virtio-mem
example below.
I remember that the basic idea of using a ratio to implement a policy in
the kernel was once mentioned by Vitaly Kuznetsov, but I might be wrong (I
lost the pointer to that discussion).
For me, the main use case is using it along with virtio-mem (and DIMMs /
ppc64 dlpar where necessary) for dynamic resizing of VMs, increasing the
amount of memory we can hotunplug reliably again if we might eventually
hotplug a lot of memory to a VM.
III. Target Usage
The target usage will be:
1) Linux boots with "mhp_default_online_type=offline"
2) User space (e.g., systemd unit) configures memory onlining (according
to a config file and system properties), for example:
* Setting memory_hotplug.online_policy=auto-movable
* Setting memory_hotplug.auto_movable_ratio=301
* Setting memory_hotplug.auto_movable_numa_aware=true
3) User space enabled auto onlining via "echo online >
/sys/devices/system/memory/auto_online_blocks"
4) User space triggers manual onlining of all already-offline memory
blocks (go over offline memory blocks and set them to "online")
IV. Example
For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured ratio of
301% results in the following layout:
Memory block 0-15: DMA32 (early)
Memory block 32-47: Normal (early)
Memory block 48-79: Movable (DIMM 0)
Memory block 80-111: Movable (DIMM 1)
Memory block 112-143: Movable (DIMM 2)
Memory block 144-275: Normal (DIMM 3)
Memory block 176-207: Normal (DIMM 4)
... all Normal
(-> hotplugged Normal memory does not allow for more Movable memory)
For virtio-mem, using a simple, single virtio-mem device with a 4 GiB VM
will result in the following layout:
Memory block 0-15: DMA32 (early)
Memory block 32-47: Normal (early)
Memory block 48-143: Movable (virtio-mem, first 12 GiB)
Memory block 144: Normal (virtio-mem, next 128 MiB)
Memory block 145-147: Movable (virtio-mem, next 384 MiB)
Memory block 148: Normal (virtio-mem, next 128 MiB)
Memory block 149-151: Movable (virtio-mem, next 384 MiB)
... Normal/Movable mixture as above
(-> hotplugged Normal memory allows for more Movable memory within
the same device)
Which gives us maximum flexibility when dynamically growing/shrinking a
VM in smaller steps.
V. Doc Update
I'll update the memory-hotplug.rst documentation, once the overhaul [1] is
usptream. Until then, details can be found in patch #2.
VI. Future Work
1) Use memory groups for ppc64 dlpar
2) Being able to specify a portion of (early) kernel memory that will be
excluded from the ratio. Like "128 MiB globally/per node" are excluded.
This might be helpful when starting VMs with extremely small memory
footprint (e.g., 128 MiB) and hotplugging memory later -- not wanting
the first hotplugged units getting onlined to ZONE_MOVABLE. One
alternative would be a trigger to not consider ZONE_DMA memory
in the ratio. We'll have to see if this is really rrequired.
3) Indicate to user space that MOVABLE might be a bad idea -- especially
relevant when memory ballooning without support for balloon compaction
is active.
This patch (of 9):
For implementing a new memory onlining policy, which determines when to
online memory blocks to ZONE_MOVABLE semi-automatically, we need the
number of present early (boot) pages -- present pages excluding hotplugged
pages. Let's track these pages per zone.
Pass a page instead of the zone to adjust_present_page_count(), similar as
adjust_managed_page_count() and derive the zone from the page.
It's worth noting that a memory block to be offlined/onlined is either
completely "early" or "not early". add_memory() and friends can only add
complete memory blocks and we only online/offline complete (individual)
memory blocks.
Link: https://lkml.kernel.org/r/20210806124715.17090-1-david@redhat.com
Link: https://lkml.kernel.org/r/20210806124715.17090-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Cc: Hui Zhu <teawater@gmail.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
859a85ddf9 |
mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE
Patch series "mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE". After recent updates to freeing unused parts of the memory map, no architecture can have holes in the memory map within a pageblock. This makes pfn_valid_within() check and CONFIG_HOLES_IN_ZONE configuration option redundant. The first patch removes them both in a mechanical way and the second patch simplifies memory_hotplug::test_pages_in_a_zone() that had pfn_valid_within() surrounded by more logic than simple if. This patch (of 2): After recent changes in freeing of the unused parts of the memory map and rework of pfn_valid() in arm and arm64 there are no architectures that can have holes in the memory map within a pageblock and so nothing can enable CONFIG_HOLES_IN_ZONE which guards non trivial implementation of pfn_valid_within(). With that, pfn_valid_within() is always hardwired to 1 and can be completely removed. Remove calls to pfn_valid_within() and CONFIG_HOLES_IN_ZONE. Link: https://lkml.kernel.org/r/20210713080035.7464-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20210713080035.7464-2-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
65d759c8f9 |
mm: compaction: support triggering of proactive compaction by user
The proactive compaction[1] gets triggered for every 500msec and run compaction on the node for COMPACTION_HPAGE_ORDER (usually order-9) pages based on the value set to sysctl.compaction_proactiveness. Triggering the compaction for every 500msec in search of COMPACTION_HPAGE_ORDER pages is not needed for all applications, especially on the embedded system usecases which may have few MB's of RAM. Enabling the proactive compaction in its state will endup in running almost always on such systems. Other side, proactive compaction can still be very much useful for getting a set of higher order pages in some controllable manner(controlled by using the sysctl.compaction_proactiveness). So, on systems where enabling the proactive compaction always may proove not required, can trigger the same from user space on write to its sysctl interface. As an example, say app launcher decide to launch the memory heavy application which can be launched fast if it gets more higher order pages thus launcher can prepare the system in advance by triggering the proactive compaction from userspace. This triggering of proactive compaction is done on a write to sysctl.compaction_proactiveness by user. [1]https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=facdaa917c4d5a376d09d25865f5a863f906234a [akpm@linux-foundation.org: tweak vm.rst, per Mike] Link: https://lkml.kernel.org/r/1627653207-12317-1-git-send-email-charante@codeaurora.org Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Nitin Gupta <nigupta@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
01c8d337d1 |
mm/sparse: set SECTION_NID_SHIFT to 6
Currently SECTION_NID_SHIFT is set to 3, which is incorrect because bit 3
and 4 can be overlapped by sub-field for early NID, and can be
unexpectedly set on NUMA systems. There are a few non-critical issues
related to this:
- Having SECTION_TAINT_ZONE_DEVICE set for wrong sections forces
pfn_to_online_page() through the slow path, but doesn't actually break
the kernel.
- A kdump generation tool like makedumpfile uses this field to calculate
the physical address to read. So wrong bits can make the tool access to
wrong address and fail to create kdump. This can be avoided by the
tool, so it's not critical.
To fix it, set SECTION_NID_SHIFT to 6 which is the minimum number of
available bits of section flag field.
Link: https://lkml.kernel.org/r/20210707045548.810271-1-naoya.horiguchi@linux.dev
Fixes:
|
||
|
|
11e02d3729 |
mm: sparse: remove __section_nr() function
As the last users of __section_nr() are gone, let's remove unused function __section_nr(). Link: https://lkml.kernel.org/r/20210707150212.855-4-ohoono.kwon@samsung.com Signed-off-by: Ohhoon Kwon <ohoono.kwon@samsung.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
293f275f4d |
Merge commit df8ba5f160 ("Merge tag 'kgdb-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux") into android-mainline
A large step en route to v5.14-rc1 Change-Id: I52bb71dc737044a593d1a9dfd7fe02b31e273ff9 Signed-off-by: Lee Jones <lee.jones@linaro.org> |
||
|
|
8e658623d4 |
Merge commit c288d9cd71 ("Merge tag 'for-5.14/io_uring-2021-06-30' of git://git.kernel.dk/linux-block") into android-mainline
Another small step en route to v5.14-rc1 Change-Id: I24899ab78da7d367574ed69ceaa82ab0837d9556 Signed-off-by: Lee Jones <lee.jones@linaro.org> |
||
|
|
351de44fde |
mm/swap: make NODE_DATA an inline function on CONFIG_FLATMEM
make W=1 generates the following warning in mm/workingset.c for allnoconfig
mm/workingset.c: In function `unpack_shadow':
mm/workingset.c:201:15: warning: variable `nid' set but not used [-Wunused-but-set-variable]
int memcgid, nid;
^~~
On FLATMEM, NODE_DATA returns a global pglist_data without dereferencing
nid. Make the helper an inline function to suppress the warning, add type
checking and to apply any side-effects in the parameter list.
Link: https://lkml.kernel.org/r/20210520084809.8576-15-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
041711ce7c |
mm: fix spelling mistakes
Fix some spelling mistakes in comments: each having differents usage ==> each has a different usage statments ==> statements adresses ==> addresses aggresive ==> aggressive datas ==> data posion ==> poison higer ==> higher precisly ==> precisely wont ==> won't We moves tha ==> We move the endianess ==> endianness Link: https://lkml.kernel.org/r/20210519065853.7723-2-thunder.leizhen@huawei.com Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
16c9afc776 |
arm64/mm: drop HAVE_ARCH_PFN_VALID
CONFIG_SPARSEMEM_VMEMMAP is now the only available memory model on arm64 platforms and free_unused_memmap() would just return without creating any holes in the memmap mapping. There is no need for any special handling in pfn_valid() and HAVE_ARCH_PFN_VALID can just be dropped. This also moves the pfn upper bits sanity check into generic pfn_valid(). Link: https://lkml.kernel.org/r/1621947349-25421-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
51c656aef6 |
include/linux/mmzone.h: add documentation for pfn_valid()
Patch series "arm64: drop pfn_valid_within() and simplify pfn_valid()", v4. These patches aim to remove CONFIG_HOLES_IN_ZONE and essentially hardwire pfn_valid_within() to 1. The idea is to mark NOMAP pages as reserved in the memory map and restore the intended semantics of pfn_valid() to designate availability of struct page for a pfn. With this the core mm will be able to cope with the fact that it cannot use NOMAP pages and the holes created by NOMAP ranges within MAX_ORDER blocks will be treated correctly even without the need for pfn_valid_within. This patch (of 4): Add comment describing the semantics of pfn_valid() that clarifies that pfn_valid() only checks for availability of a memory map entry (i.e. struct page) for a PFN rather than availability of usable memory backing that PFN. The most "generic" version of pfn_valid() used by the configurations with SPARSEMEM enabled resides in include/linux/mmzone.h so this is the most suitable place for documentation about semantics of pfn_valid(). Link: https://lkml.kernel.org/r/20210511100550.28178-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20210511100550.28178-2-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Suggested-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
44042b4498 |
mm/page_alloc: allow high-order pages to be stored on the per-cpu lists
The per-cpu page allocator (PCP) only stores order-0 pages. This means
that all THP and "cheap" high-order allocations including SLUB contends on
the zone->lock. This patch extends the PCP allocator to store THP and
"cheap" high-order pages. Note that struct per_cpu_pages increases in
size to 256 bytes (4 cache lines) on x86-64.
Note that this is not necessarily a universal performance win because of
how it is implemented. High-order pages can cause pcp->high to be
exceeded prematurely for lower-orders so for example, a large number of
THP pages being freed could release order-0 pages from the PCP lists.
Hence, much depends on the allocation/free pattern as observed by a single
CPU to determine if caching helps or hurts a particular workload.
That said, basic performance testing passed. The following is a netperf
UDP_STREAM test which hits the relevant patches as some of the network
allocations are high-order.
netperf-udp
5.13.0-rc2 5.13.0-rc2
mm-pcpburst-v3r4 mm-pcphighorder-v1r7
Hmean send-64 261.46 ( 0.00%) 266.30 * 1.85%*
Hmean send-128 516.35 ( 0.00%) 536.78 * 3.96%*
Hmean send-256 1014.13 ( 0.00%) 1034.63 * 2.02%*
Hmean send-1024 3907.65 ( 0.00%) 4046.11 * 3.54%*
Hmean send-2048 7492.93 ( 0.00%) 7754.85 * 3.50%*
Hmean send-3312 11410.04 ( 0.00%) 11772.32 * 3.18%*
Hmean send-4096 13521.95 ( 0.00%) 13912.34 * 2.89%*
Hmean send-8192 21660.50 ( 0.00%) 22730.72 * 4.94%*
Hmean send-16384 31902.32 ( 0.00%) 32637.50 * 2.30%*
Functionally, a patch like this is necessary to make bulk allocation of
high-order pages work with similar performance to order-0 bulk
allocations. The bulk allocator is not updated in this series as it would
have to be determined by bulk allocation users how they want to track the
order of pages allocated with the bulk allocator.
Link: https://lkml.kernel.org/r/20210611135753.GC30378@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
43b02ba93b |
mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM
After removal of the DISCONTIGMEM memory model the FLAT_NODE_MEM_MAP configuration option is equivalent to FLATMEM. Drop CONFIG_FLAT_NODE_MEM_MAP and use CONFIG_FLATMEM instead. Link: https://lkml.kernel.org/r/20210608091316.3622-10-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matt Turner <mattst88@gmail.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Vineet Gupta <vgupta@synopsys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
a9ee6cf5c6 |
mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
After removal of DISCINTIGMEM the NEED_MULTIPLE_NODES and NUMA configuration options are equivalent. Drop CONFIG_NEED_MULTIPLE_NODES and use CONFIG_NUMA instead. Done with $ sed -i 's/CONFIG_NEED_MULTIPLE_NODES/CONFIG_NUMA/' \ $(git grep -wl CONFIG_NEED_MULTIPLE_NODES) $ sed -i 's/NEED_MULTIPLE_NODES/NUMA/' \ $(git grep -wl NEED_MULTIPLE_NODES) with manual tweaks afterwards. [rppt@linux.ibm.com: fix arm boot crash] Link: https://lkml.kernel.org/r/YMj9vHhHOiCVN4BF@linux.ibm.com Link: https://lkml.kernel.org/r/20210608091316.3622-9-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matt Turner <mattst88@gmail.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Vineet Gupta <vgupta@synopsys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
bb1c50d396 |
mm: remove CONFIG_DISCONTIGMEM
There are no architectures that support DISCONTIGMEM left. Remove the configuration option and the dead code it was guarding in the generic memory management code. Link: https://lkml.kernel.org/r/20210608091316.3622-6-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matt Turner <mattst88@gmail.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Vineet Gupta <vgupta@synopsys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
777c00f5ed |
mm: drop SECTION_SHIFT in code comments
Actually SECTIONS_SHIFT is used in the kernel code, so the code comments
is strictly incorrect. And since commit
|
||
|
|
74f4482209 |
mm/page_alloc: introduce vm.percpu_pagelist_high_fraction
This introduces a new sysctl vm.percpu_pagelist_high_fraction. It is
similar to the old vm.percpu_pagelist_fraction. The old sysctl increased
both pcp->batch and pcp->high with the higher pcp->high potentially
reducing zone->lock contention. However, the higher pcp->batch value also
potentially increased allocation latency while the PCP was refilled. This
sysctl only adjusts pcp->high so that zone->lock contention is potentially
reduced but allocation latency during a PCP refill remains the same.
# grep -E "high:|batch" /proc/zoneinfo | tail -2
high: 649
batch: 63
# sysctl vm.percpu_pagelist_high_fraction=8
# grep -E "high:|batch" /proc/zoneinfo | tail -2
high: 35071
batch: 63
# sysctl vm.percpu_pagelist_high_fraction=64
high: 4383
batch: 63
# sysctl vm.percpu_pagelist_high_fraction=0
high: 649
batch: 63
[mgorman@techsingularity.net: fix documentation]
Link: https://lkml.kernel.org/r/20210528151010.GQ30378@techsingularity.net
Link: https://lkml.kernel.org/r/20210525080119.5455-7-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
c49c2c47da |
mm/page_alloc: limit the number of pages on PCP lists when reclaim is active
When kswapd is active then direct reclaim is potentially active. In either case, it is possible that a zone would be balanced if pages were not trapped on PCP lists. Instead of draining remote pages, simply limit the size of the PCP lists while kswapd is active. Link: https://lkml.kernel.org/r/20210525080119.5455-6-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
3b12e7e979 |
mm/page_alloc: scale the number of pages that are batch freed
When a task is freeing a large number of order-0 pages, it may acquire the zone->lock multiple times freeing pages in batches. This may unnecessarily contend on the zone lock when freeing very large number of pages. This patch adapts the size of the batch based on the recent pattern to scale the batch size for subsequent frees. As the machines I used were not large enough to test this are not large enough to illustrate a problem, a debugging patch shows patterns like the following (slightly editted for clarity) Baseline vanilla kernel time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378 time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378 time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378 time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378 time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378 With patches time-unmap-7724 [...] free_pcppages_bulk: free 126 count 814 high 814 time-unmap-7724 [...] free_pcppages_bulk: free 252 count 814 high 814 time-unmap-7724 [...] free_pcppages_bulk: free 504 count 814 high 814 time-unmap-7724 [...] free_pcppages_bulk: free 751 count 814 high 814 time-unmap-7724 [...] free_pcppages_bulk: free 751 count 814 high 814 Link: https://lkml.kernel.org/r/20210525080119.5455-5-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
bbbecb35a4 |
mm/page_alloc: delete vm.percpu_pagelist_fraction
Patch series "Calculate pcp->high based on zone sizes and active CPUs", v2. The per-cpu page allocator (PCP) is meant to reduce contention on the zone lock but the sizing of batch and high is archaic and neither takes the zone size into account or the number of CPUs local to a zone. With larger zones and more CPUs per node, the contention is getting worse. Furthermore, the fact that vm.percpu_pagelist_fraction adjusts both batch and high values means that the sysctl can reduce zone lock contention but also increase allocation latencies. This series disassociates pcp->high from pcp->batch and then scales pcp->high based on the size of the local zone with limited impact to reclaim and accounting for active CPUs but leaves pcp->batch static. It also adapts the number of pages that can be on the pcp list based on recent freeing patterns. The motivation is partially to adjust to larger memory sizes but is also driven by the fact that large batches of page freeing via release_pages() often shows zone contention as a major part of the problem. Another is a bug report based on an older kernel where a multi-terabyte process can takes several minutes to exit. A workaround was to use vm.percpu_pagelist_fraction to increase the pcp->high value but testing indicated that a production workload could not use the same values because of an increase in allocation latencies. Unfortunately, I cannot reproduce this test case myself as the multi-terabyte machines are in active use but it should alleviate the problem. The series aims to address both and partially acts as a pre-requisite. pcp only works with order-0 which is useless for SLUB (when using high orders) and THP (unconditionally). To store high-order pages on PCP, the pcp->high values need to be increased first. This patch (of 6): The vm.percpu_pagelist_fraction is used to increase the batch and high limits for the per-cpu page allocator (PCP). The intent behind the sysctl is to reduce zone lock acquisition when allocating/freeing pages but it has a problem. While it can decrease contention, it can also increase latency on the allocation side due to unreasonably large batch sizes. This leads to games where an administrator adjusts percpu_pagelist_fraction on the fly to work around contention and allocation latency problems. This series aims to alleviate the problems with zone lock contention while avoiding the allocation-side latency problems. For the purposes of review, it's easier to remove this sysctl now and reintroduce a similar sysctl later in the series that deals only with pcp->high. Link: https://lkml.kernel.org/r/20210525080119.5455-1-mgorman@techsingularity.net Link: https://lkml.kernel.org/r/20210525080119.5455-2-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
f19298b951 |
mm/vmstat: convert NUMA statistics to basic NUMA counters
NUMA statistics are maintained on the zone level for hits, misses, foreign etc but nothing relies on them being perfectly accurate for functional correctness. The counters are used by userspace to get a general overview of a workloads NUMA behaviour but the page allocator incurs a high cost to maintain perfect accuracy similar to what is required for a vmstat like NR_FREE_PAGES. There even is a sysctl vm.numa_stat to allow userspace to turn off the collection of NUMA statistics like NUMA_HIT. This patch converts NUMA_HIT and friends to be NUMA events with similar accuracy to VM events. There is a possibility that slight errors will be introduced but the overall trend as seen by userspace will be similar. The counters are no longer updated from vmstat_refresh context as it is unnecessary overhead for counters that may never be read by userspace. Note that counters could be maintained at the node level to save space but it would have a user-visible impact due to /proc/zoneinfo. [lkp@intel.com: Fix misplaced closing brace for !CONFIG_NUMA] Link: https://lkml.kernel.org/r/20210512095458.30632-4-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
dbbee9d5cd |
mm/page_alloc: convert per-cpu list protection to local_lock
There is a lack of clarity of what exactly local_irq_save/local_irq_restore protects in page_alloc.c . It conflates the protection of per-cpu page allocation structures with per-cpu vmstat deltas. This patch protects the PCP structure using local_lock which for most configurations is identical to IRQ enabling/disabling. The scope of the lock is still wider than it should be but this is decreased later. It is possible for the local_lock to be embedded safely within struct per_cpu_pages but it adds complexity to free_unref_page_list. [akpm@linux-foundation.org: coding style fixes] [mgorman@techsingularity.net: work around a pahole limitation with zero-sized struct pagesets] Link: https://lkml.kernel.org/r/20210526080741.GW30378@techsingularity.net [lkp@intel.com: Make pagesets static] Link: https://lkml.kernel.org/r/20210512095458.30632-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
28f836b677 |
mm/page_alloc: split per cpu page lists and zone stats
The PCP (per-cpu page allocator in page_alloc.c) shares locking
requirements with vmstat and the zone lock which is inconvenient and
causes some issues. For example, the PCP list and vmstat share the same
per-cpu space meaning that it's possible that vmstat updates dirty cache
lines holding per-cpu lists across CPUs unless padding is used. Second,
PREEMPT_RT does not want to disable IRQs for too long in the page
allocator.
This series splits the locking requirements and uses locks types more
suitable for PREEMPT_RT, reduces the time when special locking is required
for stats and reduces the time when IRQs need to be disabled on
!PREEMPT_RT kernels.
Why local_lock? PREEMPT_RT considers the following sequence to be unsafe
as documented in Documentation/locking/locktypes.rst
local_irq_disable();
spin_lock(&lock);
The pcp allocator has this sequence for rmqueue_pcplist (local_irq_save)
-> __rmqueue_pcplist -> rmqueue_bulk (spin_lock). While it's possible to
separate this out, it generally means there are points where we enable
IRQs and reenable them again immediately. To prevent a migration and the
per-cpu pointer going stale, migrate_disable is also needed. That is a
custom lock that is similar, but worse, than local_lock. Furthermore, on
PREEMPT_RT, it's undesirable to leave IRQs disabled for too long. By
converting to local_lock which disables migration on PREEMPT_RT, the
locking requirements can be separated and start moving the protections for
PCP, stats and the zone lock to PREEMPT_RT-safe equivalent locking. As a
bonus, local_lock also means that PROVE_LOCKING does something useful.
After that, it's obvious that zone_statistics incurs too much overhead and
leaves IRQs disabled for longer than necessary on !PREEMPT_RT kernels.
zone_statistics uses perfectly accurate counters requiring IRQs be
disabled for parallel RMW sequences when inaccurate ones like vm_events
would do. The series makes the NUMA statistics (NUMA_HIT and friends)
inaccurate counters that then require no special protection on
!PREEMPT_RT.
The bulk page allocator can then do stat updates in bulk with IRQs enabled
which should improve the efficiency. Technically, this could have been
done without the local_lock and vmstat conversion work and the order
simply reflects the timing of when different series were implemented.
Finally, there are places where we conflate IRQs being disabled for the
PCP with the IRQ-safe zone spinlock. The remainder of the series reduces
the scope of what is protected by disabled IRQs on !PREEMPT_RT kernels.
By the end of the series, page_alloc.c does not call local_irq_save so the
locking scope is a bit clearer. The one exception is that modifying
NR_FREE_PAGES still happens in places where it's known the IRQs are
disabled as it's harmless for PREEMPT_RT and would be expensive to split
the locking there.
No performance data is included because despite the overhead of the stats,
it's within the noise for most workloads on !PREEMPT_RT. However, Jesper
Dangaard Brouer ran a page allocation microbenchmark on a E5-1650 v4 @
3.60GHz CPU on the first version of this series. Focusing on the array
variant of the bulk page allocator reveals the following.
(CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz)
ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size
Baseline Patched
1 56.383 54.225 (+3.83%)
2 40.047 35.492 (+11.38%)
3 37.339 32.643 (+12.58%)
4 35.578 30.992 (+12.89%)
8 33.592 29.606 (+11.87%)
16 32.362 28.532 (+11.85%)
32 31.476 27.728 (+11.91%)
64 30.633 27.252 (+11.04%)
128 30.596 27.090 (+11.46%)
While this is a positive outcome, the series is more likely to be
interesting to the RT people in terms of getting parts of the PREEMPT_RT
tree into mainline.
This patch (of 9):
The per-cpu page allocator lists and the per-cpu vmstat deltas are stored
in the same struct per_cpu_pages even though vmstats have no direct impact
on the per-cpu page lists. This is inconsistent because the vmstats for a
node are stored on a dedicated structure. The bigger issue is that the
per_cpu_pages structure is not cache-aligned and stat updates either cache
conflict with adjacent per-cpu lists incurring a runtime cost or padding
is required incurring a memory cost.
This patch splits the per-cpu pagelists and the vmstat deltas into
separate structures. It's mostly a mechanical conversion but some
variable renaming is done to clearly distinguish the per-cpu pages
structure (pcp) from the vmstats (pzstats).
Superficially, this appears to increase the size of the per_cpu_pages
structure but the movement of expire fills a structure hole so there is no
impact overall.
[mgorman@techsingularity.net: make it W=1 cleaner]
Link: https://lkml.kernel.org/r/20210514144622.GA3735@techsingularity.net
[mgorman@techsingularity.net: make it W=1 even cleaner]
Link: https://lkml.kernel.org/r/20210516140705.GB3735@techsingularity.net
[lkp@intel.com: check struct per_cpu_zonestat has a non-zero size]
[vbabka@suse.cz: Init zone->per_cpu_zonestats properly]
Link: https://lkml.kernel.org/r/20210512095458.30632-1-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20210512095458.30632-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
b19bd1c976 |
mm/mmzone.h: simplify is_highmem_idx()
There is a lot of historical ifdefery in is_highmem_idx() and its helper zone_movable_is_highmem() that was required because of two different paths for nodes and zones initialization that were selected at compile time. Until commit |
||
|
|
85adc860fd |
Merge 6efb943b86 Linux 5.13-rc1 into android-mainline
One giant leap, all the way up to 5.13-rc1 Also take the opportunity to re-align (a.k.a. fix a couple of previous merge conflict fix-up issues) which occurred during this merge-window. Fixes: |
||
|
|
163e8b4fec |
Merge d652502ef4 Merge tag 'ovl-update-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs into android-mainline
A tiny step en route to v5.13-rc1 Signed-off-by: Lee Jones <lee.jones@linaro.org> Change-Id: I049e80976042ebffc90bb080f09da0afcfd48d77 |
||
|
|
cb152a1a95 |
mm: fix some typos and code style problems
fix some typos and code style problems in mm. gfp.h: s/MAXNODES/MAX_NUMNODES mmzone.h: s/then/than rmap.c: s/__vma_split()/__vma_adjust() swap.c: s/__mod_zone_page_stat/__mod_zone_page_state, s/is is/is swap_state.c: s/whoes/whose z3fold.c: code style problem fix in z3fold_unregister_migration zsmalloc.c: s/of/or, s/give/given Link: https://lkml.kernel.org/r/20210419083057.64820-1-luoshijie1@huawei.com Signed-off-by: Shijie Luo <luoshijie1@huawei.com> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
a08a2ae346 |
mm,memory_hotplug: allocate memmap from the added memory range
Physical memory hotadd has to allocate a memmap (struct page array) for
the newly added memory section. Currently, alloc_pages_node() is used
for those allocations.
This has some disadvantages:
a) an existing memory is consumed for that purpose
(eg: ~2MB per 128MB memory section on x86_64)
This can even lead to extreme cases where system goes OOM because
the physically hotplugged memory depletes the available memory before
it is onlined.
b) if the whole node is movable then we have off-node struct pages
which has performance drawbacks.
c) It might be there are no PMD_ALIGNED chunks so memmap array gets
populated with base pages.
This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.
Vmemap page tables can map arbitrary memory. That means that we can
reserve a part of the physically hotadded memory to back vmemmap page
tables. This implementation uses the beginning of the hotplugged memory
for that purpose.
There are some non-obviously things to consider though.
Vmemmap pages are allocated/freed during the memory hotplug events
(add_memory_resource(), try_remove_memory()) when the memory is
added/removed. This means that the reserved physical range is not
online although it is used. The most obvious side effect is that
pfn_to_online_page() returns NULL for those pfns. The current design
expects that this should be OK as the hotplugged memory is considered a
garbage until it is onlined. For example hibernation wouldn't save the
content of those vmmemmaps into the image so it wouldn't be restored on
resume but this should be OK as there no real content to recover anyway
while metadata is reachable from other data structures (e.g. vmemmap
page tables).
The reserved space is therefore (de)initialized during the {on,off}line
events (mhp_{de}init_memmap_on_memory). That is done by extracting page
allocator independent initialization from the regular onlining path.
The primary reason to handle the reserved space outside of
{on,off}line_pages is to make each initialization specific to the
purpose rather than special case them in a single function.
As per above, the functions that are introduced are:
- mhp_init_memmap_on_memory:
Initializes vmemmap pages by calling move_pfn_range_to_zone(), calls
kasan_add_zero_shadow(), and onlines as many sections as vmemmap pages
fully span.
- mhp_deinit_memmap_on_memory:
Offlines as many sections as vmemmap pages fully span, removes the
range from zhe zone by remove_pfn_range_from_zone(), and calls
kasan_remove_zero_shadow() for the range.
The new function memory_block_online() calls mhp_init_memmap_on_memory()
before doing the actual online_pages(). Should online_pages() fail, we
clean up by calling mhp_deinit_memmap_on_memory(). Adjusting of
present_pages is done at the end once we know that online_pages()
succedeed.
On offline, memory_block_offline() needs to unaccount vmemmap pages from
present_pages() before calling offline_pages(). This is necessary because
offline_pages() tears down some structures based on the fact whether the
node or the zone become empty. If offline_pages() fails, we account back
vmemmap pages. If it succeeds, we call mhp_deinit_memmap_on_memory().
Hot-remove:
We need to be careful when removing memory, as adding and
removing memory needs to be done with the same granularity.
To check that this assumption is not violated, we check the
memory range we want to remove and if a) any memory block has
vmemmap pages and b) the range spans more than a single memory
block, we scream out loud and refuse to proceed.
If all is good and the range was using memmap on memory (aka vmemmap pages),
we construct an altmap structure so free_hugepage_table does the right
thing and calls vmem_altmap_free instead of free_pagetable.
Link: https://lkml.kernel.org/r/20210421102701.25051-5-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
d1e153fea2 |
mm/gup: migrate pinned pages out of movable zone
We should not pin pages in ZONE_MOVABLE. Currently, we do not pin only movable CMA pages. Generalize the function that migrates CMA pages to migrate all movable pages. Use is_pinnable_page() to check which pages need to be migrated Link: https://lkml.kernel.org/r/20210215161349.246722-10-pasha.tatashin@soleen.com Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: James Morris <jmorris@namei.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <sashal@kernel.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Tyler Hicks <tyhicks@linux.microsoft.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
9afaf30f7a |
mm/gup: do not migrate zero page
On some platforms ZERO_PAGE(0) might end-up in a movable zone. Do not migrate zero page in gup during longterm pinning as migration of zero page is not allowed. For example, in x86 QEMU with 16G of memory and kernelcore=5G parameter, I see the following: Boot#1: zero_pfn 0x48a8d zero_pfn zone: ZONE_DMA32 Boot#2: zero_pfn 0x20168d zero_pfn zone: ZONE_MOVABLE On x86, empty_zero_page is declared in .bss and depending on the loader may end up in different physical locations during boots. Also, move is_zero_pfn() my_zero_pfn() functions under CONFIG_MMU, because zero_pfn that they are using is declared in memory.c which is compiled with CONFIG_MMU. Link: https://lkml.kernel.org/r/20210215161349.246722-9-pasha.tatashin@soleen.com Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: James Morris <jmorris@namei.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <sashal@kernel.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Tyler Hicks <tyhicks@linux.microsoft.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
198fba4137 |
mm/mmzone.h: fix existing kernel-doc comments and link them to core-api
There are a couple of kernel-doc comments in include/linux/mmzone.h but they have minor formatting issues that would cause kernel-doc warnings. Fix the formatting of those comments, add missing Return: descriptions and link include/linux/mmzone.h to Documentation/core-api/mm-api.rst Link: https://lkml.kernel.org/r/20210426141927.1314326-2-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
14f1854cdb |
Merge fecfd01539 ("Merge tag 'leds-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/pavel/linux-leds") into android-mainline
Steps on the way to 5.12-rc1 Resolves merge conflicts in: drivers/dma-buf/dma-heap.c drivers/dma-buf/heaps/cma_heap.c drivers/dma-buf/heaps/system_heap.c include/linux/dma-heap.h Cc: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Ibb32dbdba5183c9e19f5d1e94016cc1ae9616173 |
||
|
|
3db2a71497 |
Merge 5b47b10e8f ("Merge tag 'pci-v5.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci") into android-mainline
Steps on the way to 5.12-rc1 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I8330e967324622f71630bd9ee9e19086688d068e |
||
|
|
1f90a3477d |
mm: teach pfn_to_online_page() about ZONE_DEVICE section collisions
While pfn_to_online_page() is able to determine pfn_valid() at subsection
granularity it is not able to reliably determine if a given pfn is also
online if the section is mixes ZONE_{NORMAL,MOVABLE} with ZONE_DEVICE.
This means that pfn_to_online_page() may return invalid @page objects.
For example with a memory map like:
100000000-1fbffffff : System RAM
142000000-143002e16 : Kernel code
143200000-143713fff : Kernel rodata
143800000-143b15b7f : Kernel data
144227000-144ffffff : Kernel bss
1fc000000-2fbffffff : Persistent Memory (legacy)
1fc000000-2fbffffff : namespace0.0
This command:
echo 0x1fc000000 > /sys/devices/system/memory/soft_offline_page
...succeeds when it should fail. When it succeeds it touches an
uninitialized page and may crash or cause other damage (see
dissolve_free_huge_page()).
While the memory map above is contrived via the memmap=ss!nn kernel
command line option, the collision happens in practice on shipping
platforms. The memory controller resources that decode spans of physical
address space are a limited resource. One technique platform-firmware
uses to conserve those resources is to share a decoder across 2 devices to
keep the address range contiguous. Unfortunately the unit of operation of
a decoder is 64MiB while the Linux section size is 128MiB. This results
in situations where, without subsection hotplug memory mappings with
different lifetimes collide into one object that can only express one
lifetime.
Update move_pfn_range_to_zone() to flag (SECTION_TAINT_ZONE_DEVICE) a
section that mixes ZONE_DEVICE pfns with other online pfns. With
SECTION_TAINT_ZONE_DEVICE to delineate, pfn_to_online_page() can fall back
to a slow-path check for ZONE_DEVICE pfns in an online section. In the
fast path online_section() for a full ZONE_DEVICE section returns false.
Because the collision case is rare, and for simplicity, the
SECTION_TAINT_ZONE_DEVICE flag is never cleared once set.
[dan.j.williams@intel.com: fix CONFIG_ZONE_DEVICE=n build]
Link: https://lkml.kernel.org/r/CAPcyv4iX+7LAgAeSqx7Zw-Zd=ZV9gBv8Bo7oTbwCOOqJoZ3+Yg@mail.gmail.com
Link: https://lkml.kernel.org/r/161058500675.1840162.7887862152161279354.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes:
|
||
|
|
3c381db1fa |
mm/page_alloc: count CMA pages per zone and print them in /proc/zoneinfo
Let's count the number of CMA pages per zone and print them in
/proc/zoneinfo.
Having access to the total number of CMA pages per zone is helpful for
debugging purposes to know where exactly the CMA pages ended up, and to
figure out how many pages of a zone might behave differently, even after
some of these pages might already have been allocated.
As one example, CMA pages part of a kernel zone cannot be used for
ordinary kernel allocations but instead behave more like ZONE_MOVABLE.
For now, we are only able to get the global nr+free cma pages from
/proc/meminfo and the free cma pages per zone from /proc/zoneinfo.
Example after this patch when booting a 6 GiB QEMU VM with
"hugetlb_cma=2G":
# cat /proc/zoneinfo | grep cma
cma 0
nr_free_cma 0
cma 0
nr_free_cma 0
cma 524288
nr_free_cma 493016
cma 0
cma 0
# cat /proc/meminfo | grep Cma
CmaTotal: 2097152 kB
CmaFree: 1972064 kB
Note: We print even without CONFIG_CMA, just like "nr_free_cma"; this way,
one can be sure when spotting "cma 0", that there are definetly no
CMA pages located in a zone.
[david@redhat.com: v2]
Link: https://lkml.kernel.org/r/20210128164533.18566-1-david@redhat.com
[david@redhat.com: v3]
Link: https://lkml.kernel.org/r/20210129113451.22085-1-david@redhat.com
Link: https://lkml.kernel.org/r/20210127101813.6370-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
2091339d59 |
mm/vmscan.c: make lruvec_lru_size() static
All other references to the function were removed after
commit
|
||
|
|
b603894248 |
mm: memcg: add swapcache stat for memcg v2
This patch adds swapcache stat for the cgroup v2. The swapcache represents the memory that is accounted against both the memory and the swap limit of the cgroup. The main motivation behind exposing the swapcache stat is for enabling users to gracefully migrate from cgroup v1's memsw counter to cgroup v2's memory and swap counters. Cgroup v1's memsw limit allows users to limit the memory+swap usage of a workload but without control on the exact proportion of memory and swap. Cgroup v2 provides separate limits for memory and swap which enables more control on the exact usage of memory and swap individually for the workload. With some little subtleties, the v1's memsw limit can be switched with the sum of the v2's memory and swap limits. However the alternative for memsw usage is not yet available in cgroup v2. Exposing per-cgroup swapcache stat enables that alternative. Adding the memory usage and swap usage and subtracting the swapcache will approximate the memsw usage. This will help in the transparent migration of the workloads depending on memsw usage and limit to v2' memory and swap counters. The reasons these applications are still interested in this approximate memsw usage are: (1) these applications are not really interested in two separate memory and swap usage metrics. A single usage metric is more simple to use and reason about for them. (2) The memsw usage metric hides the underlying system's swap setup from the applications. Applications with multiple instances running in a datacenter with heterogeneous systems (some have swap and some don't) will keep seeing a consistent view of their usage. [akpm@linux-foundation.org: fix CONFIG_SWAP=n build] Link: https://lkml.kernel.org/r/20210108155813.2914586-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
380780e718 |
mm: memcontrol: convert NR_FILE_PMDMAPPED account to pages
Currently we use struct per_cpu_nodestat to cache the vmstat counters,
which leads to inaccurate statistics especially THP vmstat counters. In
the systems with hundreds of processors it can be GBs of memory. For
example, for a 96 CPUs system, the threshold is the maximum number of 125.
And the per cpu counters can cache 23.4375 GB in total.
The THP page is already a form of batched addition (it will add 512 worth
of memory in one go) so skipping the batching seems like sensible.
Although every THP stats update overflows the per-cpu counter, resorting
to atomic global updates. But it can make the statistics more accuracy
for the THP vmstat counters.
So we convert the NR_FILE_PMDMAPPED account to pages. This patch is
consistent with
|
||
|
|
a1528e21f8 |
mm: memcontrol: convert NR_SHMEM_PMDMAPPED account to pages
Currently we use struct per_cpu_nodestat to cache the vmstat counters,
which leads to inaccurate statistics especially THP vmstat counters. In
the systems with hundreds of processors it can be GBs of memory. For
example, for a 96 CPUs system, the threshold is the maximum number of 125.
And the per cpu counters can cache 23.4375 GB in total.
The THP page is already a form of batched addition (it will add 512 worth
of memory in one go) so skipping the batching seems like sensible.
Although every THP stats update overflows the per-cpu counter, resorting
to atomic global updates. But it can make the statistics more accuracy
for the THP vmstat counters.
So we convert the NR_SHMEM_PMDMAPPED account to pages. This patch is
consistent with
|
||
|
|
57b2847d3c |
mm: memcontrol: convert NR_SHMEM_THPS account to pages
Currently we use struct per_cpu_nodestat to cache the vmstat counters,
which leads to inaccurate statistics especially THP vmstat counters. In
the systems with hundreds of processors it can be GBs of memory. For
example, for a 96 CPUs system, the threshold is the maximum number of 125.
And the per cpu counters can cache 23.4375 GB in total.
The THP page is already a form of batched addition (it will add 512 worth
of memory in one go) so skipping the batching seems like sensible.
Although every THP stats update overflows the per-cpu counter, resorting
to atomic global updates. But it can make the statistics more accuracy
for the THP vmstat counters.
So we convert the NR_SHMEM_THPS account to pages. This patch is
consistent with
|
||
|
|
bf9ecead53 |
mm: memcontrol: convert NR_FILE_THPS account to pages
Currently we use struct per_cpu_nodestat to cache the vmstat counters,
which leads to inaccurate statistics especially THP vmstat counters. In
the systems with if hundreds of processors it can be GBs of memory. For
example, for a 96 CPUs system, the threshold is the maximum number of 125.
And the per cpu counters can cache 23.4375 GB in total.
The THP page is already a form of batched addition (it will add 512 worth
of memory in one go) so skipping the batching seems like sensible.
Although every THP stats update overflows the per-cpu counter, resorting
to atomic global updates. But it can make the statistics more accuracy
for the THP vmstat counters.
So we convert the NR_FILE_THPS account to pages. This patch is consistent
with
|
||
|
|
69473e5de8 |
mm: memcontrol: convert NR_ANON_THPS account to pages
Currently we use struct per_cpu_nodestat to cache the vmstat counters,
which leads to inaccurate statistics especially THP vmstat counters. In
the systems with hundreds of processors it can be GBs of memory. For
example, for a 96 CPUs system, the threshold is the maximum number of 125.
And the per cpu counters can cache 23.4375 GB in total.
The THP page is already a form of batched addition (it will add 512 worth
of memory in one go) so skipping the batching seems like sensible.
Although every THP stats update overflows the per-cpu counter, resorting
to atomic global updates. But it can make the statistics more accuracy
for the THP vmstat counters.
So we convert the NR_ANON_THPS account to pages. This patch is consistent
with
|