Currently, debugging CMA allocation failures is quite limited. The most
common source of these failures seems to be page migration which doesn't
provide any useful information on the reason of the failure by itself.
alloc_contig_range can report those failures as it holds a list of
migrate-failed pages.
The information logged by dump_page() has already proven helpful for
debugging allocation issues, like identifying long-term pinnings on
ZONE_MOVABLE or MIGRATE_CMA.
Let's use the dynamic debugging infrastructure, such that we avoid
flooding the logs and creating a lot of noise on frequent
alloc_contig_range() calls. This information is helpful for debugging
only.
There are two ifdefery conditions to support common dyndbg options:
- CONFIG_DYNAMIC_DEBUG_CORE && DYNAMIC_DEBUG_MODULE
It aims for supporting the feature with only specific file with
adding ccflags.
- CONFIG_DYNAMIC_DEBUG
It aims for supporting the feature with system wide globally.
A simple example to enable the feature:
Admin could enable the dump like this(by default, disabled)
echo "func alloc_contig_dump_pages +p" > control
Admin could disable it.
echo "func alloc_contig_dump_pages =_" > control
Detail goes Documentation/admin-guide/dynamic-debug-howto.rst
A concern is utility functions in dump_page use inconsistent
loglevels. In the future, we might want to make the loglevels
used inside dump_page() consistent and eventually rework the way
we log the information here. See [1].
[1] https://lore.kernel.org/linux-mm/YEh4doXvyuRl5BDB@google.com/
Link: https://lkml.kernel.org/r/20210311194042.825152-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: John Dias <joaodias@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Rationalise __alloc_pages wrappers", v3.
I was poking around the __alloc_pages variants trying to understand why
they each exist, and couldn't really find a good justification for keeping
__alloc_pages and __alloc_pages_nodemask as separate functions. That led
to getting rid of alloc_pages_current() and then I noticed the
documentation was bad, and then I noticed the mempolicy documentation
wasn't included.
Anyway, this is all cleanups & doc fixes.
This patch (of 7):
We have two masks involved -- the nodemask and the gfp mask, so alloc_mask
is an unclear name.
Link: https://lkml.kernel.org/r/20210225150642.2582252-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__alloc_contig_migrate_range already has lru_add_drain_all call via
migrate_prep. It's necessary to move LRU taget pages into LRU list to be
able to isolated. However, lru_add_drain_all call after
__alloc_contig_migrate_range is pointless since it has changed source page
freeing from putback_lru_pages to put_page[1].
This patch removes it.
[1] c6c919eb90, ("mm: use put_page() to free page instead of putback_lru_page()"
Link: https://lkml.kernel.org/r/20210303204512.2863087-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The information that some PFNs are busy is:
a) not helpful for ordinary users: we don't even know *who* called
alloc_contig_range(). This is certainly not worth a pr_info.*().
b) not really helpful for debugging: we don't have any details *why*
these PFNs are busy, and that is what we usually care about.
c) not complete: there are other cases where we fail alloc_contig_range()
using different paths that are not getting recorded.
For example, we reach this path once we succeeded in isolating pageblocks,
but failed to migrate some pages - which can happen easily on ZONE_NORMAL
(i.e., has_unmovable_pages() is racy) but also on ZONE_MOVABLE i.e., we
would have to retry longer to migrate).
For example via virtio-mem when unplugging memory, we can create quite
some noise (especially with ZONE_NORMAL) that is not of interest to users
- it's expected that some allocations may fail as memory is busy.
Let's just drop that pr_info_ratelimit() and rather implement a dynamic
debugging mechanism in the future that can give us a better reason why
alloc_contig_range() failed on specific pages.
Link: https://lkml.kernel.org/r/20210301150945.77012-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During boot, all non-reserved memblock memory is exposed to page_alloc via
memblock_free_pages->__free_pages_core(). This results in
kasan_free_pages() being called, which poisons that memory.
Poisoning all that memory lengthens boot time. The most noticeable effect
is observed with the HW_TAGS mode. A boot-time impact may potentially
also affect systems with large amount of RAM.
This patch changes the tag-based modes to not poison the memory during the
memblock->page_alloc transition.
An exception is made for KASAN_GENERIC. Since it marks all new memory as
accessible, not poisoning the memory released from memblock will lead to
KASAN missing invalid boot-time accesses to that memory.
With KASAN_SW_TAGS, as it uses the invalid 0xFE tag as the default tag for
all memory, it won't miss bad boot-time accesses even if the poisoning of
memblock memory is removed.
With KASAN_HW_TAGS, the default memory tags values are unspecified.
Therefore, if memblock poisoning is removed, this KASAN mode will miss the
mentioned type of boot-time bugs with a 1/16 probability. This is taken
as an acceptable trafe-off.
Internally, the poisoning is removed as follows. __free_pages_core() is
used when exposing fresh memory during system boot and when onlining
memory during hotplug. This patch adds a new FPI_SKIP_KASAN_POISON flag
and passes it to __free_pages_ok() through free_pages_prepare() from
__free_pages_core(). If FPI_SKIP_KASAN_POISON is set, kasan_free_pages()
is not called.
All memory allocated normally when the boot is over keeps getting poisoned
as usual.
Link: https://lkml.kernel.org/r/a0570dc1e3a8f39a55aa343a1fc08cd5c2d4cad6.1613692950.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can sometimes end up with kasan_byte_accessible() being called on
non-slab memory. For example ksize() and krealloc() may end up calling it
on KFENCE allocated memory. In this case the memory will be tagged with
KASAN_SHADOW_INIT, which a subsequent patch ("kasan: initialize shadow to
TAG_INVALID for SW_TAGS") will set to the same value as KASAN_TAG_INVALID,
causing kasan_byte_accessible() to fail when called on non-slab memory.
This highlighted the fact that the check in kasan_byte_accessible() was
inconsistent with checks as implemented for loads and stores
(kasan_check_range() in SW tags mode and hardware-implemented checks in HW
tags mode). kasan_check_range() does not have a check for
KASAN_TAG_INVALID, and instead has a comparison against
KASAN_SHADOW_START. In HW tags mode, we do not have either, but we do set
TCR_EL1.TCMA which corresponds with the comparison against
KASAN_TAG_KERNEL.
Therefore, update kasan_byte_accessible() for both SW and HW tags modes to
correspond with the respective checks on loads and stores.
Link: https://linux-review.googlesource.com/id/Ic6d40803c57dcc6331bd97fbb9a60b0d38a65a36
Link: https://lkml.kernel.org/r/20210405220647.1965262-1-pcc@google.com
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A potential use after free can occur in _vm_unmap_aliases where an already
freed vmap_area could be accessed, Consider the following scenario:
Process 1 Process 2
__vm_unmap_aliases __vm_unmap_aliases
purge_fragmented_blocks_allcpus rcu_read_lock()
rcu_read_lock()
list_del_rcu(&vb->free_list)
list_for_each_entry_rcu(vb .. )
__purge_vmap_area_lazy
kmem_cache_free(va)
va_start = vb->va->va_start
Here Process 1 is in purge path and it does list_del_rcu on vmap_block and
later frees the vmap_area, since Process 2 was holding the rcu lock at
this time vmap_block will still be present in and Process 2 accesse it and
thereby it tries to access vmap_area of that vmap_block which was already
freed by Process 1 and this results in use after free.
Fix this by adding a check for vb->dirty before accessing vmap_area
structure since vb->dirty will be set to VMAP_BBMAP_BITS in purge path
checking for this will prevent the use after free.
Link: https://lkml.kernel.org/r/1616062105-23263-1-git-send-email-vjitta@codeaurora.org
Signed-off-by: Vijayanand Jitta <vjitta@codeaurora.org>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vread() has been linearly searching vmap_area_list for looking up vmalloc
areas to read from. These same areas are also tracked by a rb_tree
(vmap_area_root) which offers logarithmic lookup.
This patch modifies vread() to use the rb_tree structure instead of the
list and the speedup for heavy /proc/kcore readers can be pretty
significant. Below are the wall clock measurements of a Python
application that leverages the drgn debugging library to read and
interpret data read from /proc/kcore.
Before the patch:
-----
$ time sudo sdb -e 'dbuf | head 3000 | wc'
(unsigned long)3000
real 0m22.446s
user 0m2.321s
sys 0m20.690s
-----
With the patch:
-----
$ time sudo sdb -e 'dbuf | head 3000 | wc'
(unsigned long)3000
real 0m2.104s
user 0m2.043s
sys 0m0.921s
-----
Link: https://lkml.kernel.org/r/20210209190253.108763-1-serapheim@delphix.com
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: Extend MREMAP_DONTUNMAP to non-anonymous mappings", v5.
This patch (of 3):
Currently MREMAP_DONTUNMAP only accepts private anonymous mappings. This
restriction was placed initially for simplicity and not because there
exists a technical reason to do so.
This change will widen the support to include any mappings which are not
VM_DONTEXPAND or VM_PFNMAP. The primary use case is to support
MREMAP_DONTUNMAP on mappings which may have been created from a memfd.
This change will result in mremap(MREMAP_DONTUNMAP) returning -EINVAL if
VM_DONTEXPAND or VM_PFNMAP mappings are specified.
Lokesh Gidra who works on the Android JVM, provided an explanation of how
such a feature will improve Android JVM garbage collection: "Android is
developing a new garbage collector (GC), based on userfaultfd. The
garbage collector will use userfaultfd (uffd) on the java heap during
compaction. On accessing any uncompacted page, the application threads
will find it missing, at which point the thread will create the compacted
page and then use UFFDIO_COPY ioctl to get it mapped and then resume
execution. Before starting this compaction, in a stop-the-world pause the
heap will be mremap(MREMAP_DONTUNMAP) so that the java heap is ready to
receive UFFD_EVENT_PAGEFAULT events after resuming execution.
To speedup mremap operations, pagetable movement was optimized by moving
PUD entries instead of PTE entries [1]. It was necessary as mremap of
even modest sized memory ranges also took several milliseconds, and
stopping the application for that long isn't acceptable in response-time
sensitive cases.
With UFFDIO_CONTINUE feature [2], it will be even more efficient to
implement this GC, particularly the 'non-moveable' portions of the heap.
It will also help in reducing the need to copy (UFFDIO_COPY) the pages.
However, for this to work, the java heap has to be on a 'shared' vma.
Currently MREMAP_DONTUNMAP only supports private anonymous mappings, this
patch will enable using UFFDIO_CONTINUE for the new userfaultfd-based heap
compaction."
[1] https://lore.kernel.org/linux-mm/20201215030730.NC3CU98e4%25akpm@linux-foundation.org/
[2] https://lore.kernel.org/linux-mm/20210302000133.272579-1-axelrasmussen@google.com/
Link: https://lkml.kernel.org/r/20210323182520.2712101-1-bgeffon@google.com
Signed-off-by: Brian Geffon <bgeffon@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Tested-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Alejandro Colomar <alx.manpages@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With NUMA balancing, in hint page fault handler, the faulting page will be
migrated to the accessing node if necessary. During the migration, TLB
will be shot down on all CPUs that the process has run on recently.
Because in the hint page fault handler, the PTE will be made accessible
before the migration is tried. The overhead of TLB shooting down can be
high, so it's better to be avoided if possible. In fact, if we delay
mapping the page until migration, that can be avoided. This is what this
patch doing.
For the multiple threads applications, it's possible that a page is
accessed by multiple threads almost at the same time. In the original
implementation, because the first thread will install the accessible PTE
before migrating the page, the other threads may access the page directly
before the page is made inaccessible again during migration. While with
the patch, the second thread will go through the page fault handler too.
And because of the PageLRU() checking in the following code path,
migrate_misplaced_page()
numamigrate_isolate_page()
isolate_lru_page()
the migrate_misplaced_page() will return 0, and the PTE will be made
accessible in the second thread.
This will introduce a little more overhead. But we think the possibility
for a page to be accessed by the multiple threads at the same time is low,
and the overhead difference isn't too large. If this becomes a problem in
some workloads, we need to consider how to reduce the overhead.
To test the patch, we run a test case as follows on a 2-socket Intel
server (1 NUMA node per socket) with 128GB DRAM (64GB per socket).
1. Run a memory eater on NUMA node 1 to use 40GB memory before running
pmbench.
2. Run pmbench (normal accessing pattern) with 8 processes, and 8
threads per process, so there are 64 threads in total. The
working-set size of each process is 8960MB, so the total working-set
size is 8 * 8960MB = 70GB. The CPU of all pmbench processes is bound
to node 1. The pmbench processes will access some DRAM on node 0.
3. After the pmbench processes run for 10 seconds, kill the memory
eater. Now, some pages will be migrated from node 0 to node 1 via
NUMA balancing.
Test results show that, with the patch, the pmbench throughput (page
accesses/s) increases 5.5%. The number of the TLB shootdowns interrupts
reduces 98% (from ~4.7e7 to ~9.7e5) with about 9.2e6 pages (35.8GB)
migrated. From the perf profile, it can be found that the CPU cycles
spent by try_to_unmap() and its callees reduces from 6.02% to 0.47%. That
is, the CPU cycles spent by TLB shooting down decreases greatly.
Link: https://lkml.kernel.org/r/20210408132236.1175607-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Matthew Wilcox" <willy@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the unsigned page_counter underflows, even just by a few pages, a
cgroup will not be able to run anything afterwards and trigger the OOM
killer in a loop.
Underflows shouldn't happen, but when they do in practice, we may just be
off by a small amount that doesn't interfere with the normal operation -
consequences don't need to be that dire.
Reset the page_counter to 0 upon underflow. We'll issue a warning that
the accounting will be off and then try to keep limping along.
[ We used to do this with the original res_counter, where it was a
more straight-forward correction inside the spinlock section. I
didn't carry it forward into the lockless page counters for
simplicity, but it turns out this is quite useful in practice. ]
Link: https://lkml.kernel.org/r/20210408143155.2679744-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since Roman's series "The new cgroup slab memory controller" applied.
All slab objects are charged via the new APIs of obj_cgroup. The new
APIs introduce a struct obj_cgroup to charge slab objects. It prevents
long-living objects from pinning the original memory cgroup in the
memory. But there are still some corner objects (e.g. allocations
larger than order-1 page on SLUB) which are not charged via the new
APIs. Those objects (include the pages which are allocated from buddy
allocator directly) are charged as kmem pages which still hold a
reference to the memory cgroup.
We want to reuse the obj_cgroup APIs to charge the kmem pages. If we do
that, we should store an object cgroup pointer to page->memcg_data for
the kmem pages.
Finally, page->memcg_data will have 3 different meanings.
1) For the slab pages, page->memcg_data points to an object cgroups
vector.
2) For the kmem pages (exclude the slab pages), page->memcg_data
points to an object cgroup.
3) For the user pages (e.g. the LRU pages), page->memcg_data points
to a memory cgroup.
We do not change the behavior of page_memcg() and page_memcg_rcu(). They
are also suitable for LRU pages and kmem pages. Why?
Because memory allocations pinning memcgs for a long time - it exists at a
larger scale and is causing recurring problems in the real world: page
cache doesn't get reclaimed for a long time, or is used by the second,
third, fourth, ... instance of the same job that was restarted into a new
cgroup every time. Unreclaimable dying cgroups pile up, waste memory, and
make page reclaim very inefficient.
We can convert LRU pages and most other raw memcg pins to the objcg
direction to fix this problem, and then the page->memcg will always point
to an object cgroup pointer. At that time, LRU pages and kmem pages will
be treated the same. The implementation of page_memcg() will remove the
kmem page check.
This patch aims to charge the kmem pages by using the new APIs of
obj_cgroup. Finally, the page->memcg_data of the kmem page points to an
object cgroup. We can use the __page_objcg() to get the object cgroup
associated with a kmem page. Or we can use page_memcg() to get the memory
cgroup associated with a kmem page, but caller must ensure that the
returned memcg won't be released (e.g. acquire the rcu_read_lock or
css_set_lock).
Link: https://lkml.kernel.org/r/20210401030141.37061-1-songmuchun@bytedance.com
Link: https://lkml.kernel.org/r/20210319163821.20704-6-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
[songmuchun@bytedance.com: fix forget to obtain the ref to objcg in split_page_memcg]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>