3ea02fc5d488894de9a9b6fea840f2f30f4ff9cc
17104 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
9fc8d3bee7 |
mm/mmap.c: fix missing call to vm_unacct_memory in mmap_region
[ Upstream commit 7f82f922319ede486540e8746769865b9508d2c2 ] Since the beginning, charged is set to 0 to avoid calling vm_unacct_memory twice because vm_unacct_memory will be called by above unmap_region. But since commit |
||
|
|
4465215448 |
mm/mempolicy: fix get_nodes out of bound access
[ Upstream commit 000eca5d044d1ee23b4ca311793cf3fc528da6c6 ]
When user specified more nodes than supported, get_nodes will access nmask
array out of bounds.
Link: https://lkml.kernel.org/r/20220601093211.2970565-1-tianyu.li@arm.com
Fixes:
|
||
|
|
d56c5eac84 |
mm/memremap: fix memunmap_pages() race with get_dev_pagemap()
[ Upstream commit 1e57ffb6e3fd9583268c6462c4e3853575b21701 ]
Think about the below scene:
CPU1 CPU2
memunmap_pages
percpu_ref_exit
__percpu_ref_exit
free_percpu(percpu_count);
/* percpu_count is freed here! */
get_dev_pagemap
xa_load(&pgmap_array, PHYS_PFN(phys))
/* pgmap still in the pgmap_array */
percpu_ref_tryget_live(&pgmap->ref)
if __ref_is_percpu
/* __PERCPU_REF_ATOMIC_DEAD not set yet */
this_cpu_inc(*percpu_count)
/* access freed percpu_count here! */
ref->percpu_count_ptr = __PERCPU_REF_ATOMIC_DEAD;
/* too late... */
pageunmap_range
To fix the issue, do percpu_ref_exit() after pgmap_array is emptied. So
we won't do percpu_ref_tryget_live() against a being freed percpu_ref.
Link: https://lkml.kernel.org/r/20220609121305.2508-1-linmiaohe@huawei.com
Fixes:
|
||
|
|
d18a908565 |
memremap: remove support for external pgmap refcounts
[ Upstream commit b80892ca022e9eb484771a66eb68e12364695a2a ] No driver is left using the external pgmap refcount, so remove the code to support it. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://lore.kernel.org/r/20211028151017.50234-1-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
86e83233dd |
page_alloc: fix invalid watermark check on a negative value
commit 9282012fc0aa248b77a69f5eb802b67c5a16bb13 upstream. There was a report that a task is waiting at the throttle_direct_reclaim. The pgscan_direct_throttle in vmstat was increasing. This is a bug where zone_watermark_fast returns true even when the free is very low. The commit |
||
|
|
51a772c34e |
mm/hmm: fault non-owner device private entries
commit 8a295dbbaf7292c582a40ce469c326f472d51f66 upstream.
If hmm_range_fault() is called with the HMM_PFN_REQ_FAULT flag and a
device private PTE is found, the hmm_range::dev_private_owner page is used
to determine if the device private page should not be faulted in.
However, if the device private page is not owned by the caller,
hmm_range_fault() returns an error instead of calling migrate_to_ram() to
fault in the page.
For example, if a page is migrated to GPU private memory and a RDMA fault
capable NIC tries to read the migrated page, without this patch it will
get an error. With this patch, the page will be migrated back to system
memory and the NIC will be able to read the data.
Link: https://lkml.kernel.org/r/20220727000837.4128709-2-rcampbell@nvidia.com
Link: https://lkml.kernel.org/r/20220725183615.4118795-2-rcampbell@nvidia.com
Fixes:
|
||
|
|
dc124c849c |
hugetlb: fix memoryleak in hugetlb_mcopy_atomic_pte
commit da9a298f5fad0dc615079a340da42928bc5b138e upstream.
When alloc_huge_page fails, *pagep is set to NULL without put_page first.
So the hugepage indicated by *pagep is leaked.
Link: https://lkml.kernel.org/r/20220709092629.54291-1-linmiaohe@huawei.com
Fixes:
|
||
|
|
2722fb0f70 |
mm: fix page leak with multiple threads mapping the same page
commit 3fe2895cfecd03ac74977f32102b966b6589f481 upstream. We have an application with a lot of threads that use a shared mmap backed by tmpfs mounted with -o huge=within_size. This application started leaking loads of huge pages when we upgraded to a recent kernel. Using the page ref tracepoints and a BPF program written by Tejun Heo we were able to determine that these pages would have multiple refcounts from the page fault path, but when it came to unmap time we wouldn't drop the number of refs we had added from the faults. I wrote a reproducer that mmap'ed a file backed by tmpfs with -o huge=always, and then spawned 20 threads all looping faulting random offsets in this map, while using madvise(MADV_DONTNEED) randomly for huge page aligned ranges. This very quickly reproduced the problem. The problem here is that we check for the case that we have multiple threads faulting in a range that was previously unmapped. One thread maps the PMD, the other thread loses the race and then returns 0. However at this point we already have the page, and we are no longer putting this page into the processes address space, and so we leak the page. We actually did the correct thing prior to |
||
|
|
70d0ce332d |
secretmem: fix unhandled fault in truncate
commit 84ac013046ccc438af04b7acecd4d3ab84fe4bde upstream. syzkaller reports the following issue: BUG: unable to handle page fault for address: ffff888021f7e005 PGD |
||
|
|
8c5429a04c |
mm/mempolicy: fix uninit-value in mpol_rebind_policy()
commit 018160ad314d75b1409129b2247b614a9f35894c upstream. mpol_set_nodemask()(mm/mempolicy.c) does not set up nodemask when pol->mode is MPOL_LOCAL. Check pol->mode before access pol->w.cpuset_mems_allowed in mpol_rebind_policy()(mm/mempolicy.c). BUG: KMSAN: uninit-value in mpol_rebind_policy mm/mempolicy.c:352 [inline] BUG: KMSAN: uninit-value in mpol_rebind_task+0x2ac/0x2c0 mm/mempolicy.c:368 mpol_rebind_policy mm/mempolicy.c:352 [inline] mpol_rebind_task+0x2ac/0x2c0 mm/mempolicy.c:368 cpuset_change_task_nodemask kernel/cgroup/cpuset.c:1711 [inline] cpuset_attach+0x787/0x15e0 kernel/cgroup/cpuset.c:2278 cgroup_migrate_execute+0x1023/0x1d20 kernel/cgroup/cgroup.c:2515 cgroup_migrate kernel/cgroup/cgroup.c:2771 [inline] cgroup_attach_task+0x540/0x8b0 kernel/cgroup/cgroup.c:2804 __cgroup1_procs_write+0x5cc/0x7a0 kernel/cgroup/cgroup-v1.c:520 cgroup1_tasks_write+0x94/0xb0 kernel/cgroup/cgroup-v1.c:539 cgroup_file_write+0x4c2/0x9e0 kernel/cgroup/cgroup.c:3852 kernfs_fop_write_iter+0x66a/0x9f0 fs/kernfs/file.c:296 call_write_iter include/linux/fs.h:2162 [inline] new_sync_write fs/read_write.c:503 [inline] vfs_write+0x1318/0x2030 fs/read_write.c:590 ksys_write+0x28b/0x510 fs/read_write.c:643 __do_sys_write fs/read_write.c:655 [inline] __se_sys_write fs/read_write.c:652 [inline] __x64_sys_write+0xdb/0x120 fs/read_write.c:652 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x44/0xae Uninit was created at: slab_post_alloc_hook mm/slab.h:524 [inline] slab_alloc_node mm/slub.c:3251 [inline] slab_alloc mm/slub.c:3259 [inline] kmem_cache_alloc+0x902/0x11c0 mm/slub.c:3264 mpol_new mm/mempolicy.c:293 [inline] do_set_mempolicy+0x421/0xb70 mm/mempolicy.c:853 kernel_set_mempolicy mm/mempolicy.c:1504 [inline] __do_sys_set_mempolicy mm/mempolicy.c:1510 [inline] __se_sys_set_mempolicy+0x44c/0xb60 mm/mempolicy.c:1507 __x64_sys_set_mempolicy+0xd8/0x110 mm/mempolicy.c:1507 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x44/0xae KMSAN: uninit-value in mpol_rebind_task (2) https://syzkaller.appspot.com/bug?id=d6eb90f952c2a5de9ea718a1b873c55cb13b59dc This patch seems to fix below bug too. KMSAN: uninit-value in mpol_rebind_mm (2) https://syzkaller.appspot.com/bug?id=f2fecd0d7013f54ec4162f60743a2b28df40926b The uninit-value is pol->w.cpuset_mems_allowed in mpol_rebind_policy(). When syzkaller reproducer runs to the beginning of mpol_new(), mpol_new() mm/mempolicy.c do_mbind() mm/mempolicy.c kernel_mbind() mm/mempolicy.c `mode` is 1(MPOL_PREFERRED), nodes_empty(*nodes) is `true` and `flags` is 0. Then mode = MPOL_LOCAL; ... policy->mode = mode; policy->flags = flags; will be executed. So in mpol_set_nodemask(), mpol_set_nodemask() mm/mempolicy.c do_mbind() kernel_mbind() pol->mode is 4 (MPOL_LOCAL), that `nodemask` in `pol` is not initialized, which will be accessed in mpol_rebind_policy(). Link: https://lkml.kernel.org/r/20220512123428.fq3wofedp6oiotd4@ppc.localdomain Signed-off-by: Wang Cheng <wanngchenng@gmail.com> Reported-by: <syzbot+217f792c92599518a2ab@syzkaller.appspotmail.com> Tested-by: <syzbot+217f792c92599518a2ab@syzkaller.appspotmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
e4967d2288 |
mm: split huge PUD on wp_huge_pud fallback
commit 14c99d65941538aa33edd8dc7b1bbbb593c324a2 upstream.
Currently the implementation will split the PUD when a fallback is taken
inside the create_huge_pud function. This isn't where it should be done:
the splitting should be done in wp_huge_pud, just like it's done for PMDs.
Reason being that if a callback is taken during create, there is no PUD
yet so nothing to split, whereas if a fallback is taken when encountering
a write protection fault there is something to split.
It looks like this was the original intention with the commit where the
splitting was introduced, but somehow it got moved to the wrong place
between v1 and v2 of the patch series. Rebase mistake perhaps.
Link: https://lkml.kernel.org/r/6f48d622eb8bce1ae5dd75327b0b73894a2ec407.camel@amazon.com
Fixes:
|
||
|
|
27056f20d7 |
mm: userfaultfd: fix UFFDIO_CONTINUE on fallocated shmem pages
commit 73f37dbcfe1763ee2294c7717a1f571e27d17fd8 upstream.
When fallocate() is used on a shmem file, the pages we allocate can end up
with !PageUptodate.
Since UFFDIO_CONTINUE tries to find the existing page the user wants to
map with SGP_READ, we would fail to find such a page, since
shmem_getpage_gfp returns with a "NULL" pagep for SGP_READ if it discovers
!PageUptodate. As a result, UFFDIO_CONTINUE returns -EFAULT, as it would
do if the page wasn't found in the page cache at all.
This isn't the intended behavior. UFFDIO_CONTINUE is just trying to find
if a page exists, and doesn't care whether it still needs to be cleared or
not. So, instead of SGP_READ, pass in SGP_NOALLOC. This is the same,
except for one critical difference: in the !PageUptodate case, SGP_NOALLOC
will clear the page and then return it. With this change, UFFDIO_CONTINUE
works properly (succeeds) on a shmem file which has been fallocated, but
otherwise not modified.
Link: https://lkml.kernel.org/r/20220610173812.1768919-1-axelrasmussen@google.com
Fixes:
|
||
|
|
2fa22e7906 |
Revert "mm/memory-failure.c: fix race with changing page compound again"
commit 2ba2b008a8bf5fd268a43d03ba79e0ad464d6836 upstream.
Reverts commit 888af2701db7 ("mm/memory-failure.c: fix race with changing
page compound again") because now we fetch the page refcount under
hugetlb_lock in try_memory_failure_hugetlb() so that the race check is no
longer necessary.
Link: https://lkml.kernel.org/r/20220408135323.1559401-4-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Suggested-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
||
|
|
62d1655b92 |
mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()
[ Upstream commit 405ce051236cc65b30bbfe490b28ce60ae6aed85 ]
There is a race condition between memory_failure_hugetlb() and hugetlb
free/demotion, which causes setting PageHWPoison flag on the wrong page.
The one simple result is that wrong processes can be killed, but another
(more serious) one is that the actual error is left unhandled, so no one
prevents later access to it, and that might lead to more serious results
like consuming corrupted data.
Think about the below race window:
CPU 1 CPU 2
memory_failure_hugetlb
struct page *head = compound_head(p);
hugetlb page might be freed to
buddy, or even changed to another
compound page.
get_hwpoison_page -- page is not what we want now...
The current code first does prechecks roughly and then reconfirms after
taking refcount, but it's found that it makes code overly complicated,
so move the prechecks in a single hugetlb_lock range.
A newly introduced function, try_memory_failure_hugetlb(), always takes
hugetlb_lock (even for non-hugetlb pages). That can be improved, but
memory_failure() is rare in principle, so should not be a big problem.
Link: https://lkml.kernel.org/r/20220408135323.1559401-2-naoya.horiguchi@linux.dev
Fixes:
|
||
|
|
5429eb5502 |
mm/memory-failure.c: fix race with changing page compound again
[ Upstream commit 888af2701db79b9b27c7e37f9ede528a5ca53b76 ] Patch series "A few fixup patches for memory failure", v2. This series contains a few patches to fix the race with changing page compound page, make non-LRU movable pages unhandlable and so on. More details can be found in the respective changelogs. There is a race window where we got the compound_head, the hugetlb page could be freed to buddy, or even changed to another compound page just before we try to get hwpoison page. Think about the below race window: CPU 1 CPU 2 memory_failure_hugetlb struct page *head = compound_head(p); hugetlb page might be freed to buddy, or even changed to another compound page. get_hwpoison_page -- page is not what we want now... If this race happens, just bail out. Also MF_MSG_DIFFERENT_PAGE_SIZE is introduced to record this event. [akpm@linux-foundation.org: s@/**@/*@, per Naoya Horiguchi] Link: https://lkml.kernel.org/r/20220312074613.4798-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220312074613.4798-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
7a07875fab |
mm/hwpoison: avoid the impact of hwpoison_filter() return value on mce handler
[ Upstream commit d1fe111fb62a1cf0446a2919f5effbb33ad0702c ] When the hwpoison page meets the filter conditions, it should not be regarded as successful memory_failure() processing for mce handler, but should return a distinct value, otherwise mce handler regards the error page has been identified and isolated, which may lead to calling set_mce_nospec() to change page attribute, etc. Here memory_failure() return -EOPNOTSUPP to indicate that the error event is filtered, mce handler should not take any action for this situation and hwpoison injector should treat as correct. Link: https://lkml.kernel.org/r/20220223082135.2769649-1-luofei@unicloud.com Signed-off-by: luofei <luofei@unicloud.com> Acked-by: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
d04b62b640 |
mm/hwpoison: mf_mutex for soft offline and unpoison
[ Upstream commit 91d005479e06392617bacc114509d611b705eaac ] Patch series "mm/hwpoison: fix unpoison_memory()", v4. The main purpose of this series is to sync unpoison code to recent changes around how hwpoison code takes page refcount. Unpoison should work or simply fail (without crash) if impossible. The recent works of keeping hwpoison pages in shmem pagecache introduce a new state of hwpoisoned pages, but unpoison for such pages is not supported yet with this series. It seems that soft-offline and unpoison can be used as general purpose page offline/online mechanism (not in the context of memory error). I think that we need some additional works to realize it because currently soft-offline and unpoison are assumed not to happen so frequently (print out too many messages for aggressive usecases). But anyway this could be another interesting next topic. v1: https://lore.kernel.org/linux-mm/20210614021212.223326-1-nao.horiguchi@gmail.com/ v2: https://lore.kernel.org/linux-mm/20211025230503.2650970-1-naoya.horiguchi@linux.dev/ v3: https://lore.kernel.org/linux-mm/20211105055058.3152564-1-naoya.horiguchi@linux.dev/ This patch (of 3): Originally mf_mutex is introduced to serialize multiple MCE events, but it is not that useful to allow unpoison to run in parallel with memory_failure() and soft offline. So apply mf_mutex to soft offline and unpoison. The memory failure handler and soft offline handler get simpler with this. Link: https://lkml.kernel.org/r/20211115084006.3728254-1-naoya.horiguchi@linux.dev Link: https://lkml.kernel.org/r/20211115084006.3728254-2-naoya.horiguchi@linux.dev Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ding Hui <dinghui@sangfor.com.cn> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
c33904fd1e |
mm: vmalloc: introduce array allocation functions
[ Upstream commit a8749a35c39903120ec421ef2525acc8e0daa55c ] Linux has dozens of occurrences of vmalloc(array_size()) and vzalloc(array_size()). Allow to simplify the code by providing vmalloc_array and vcalloc, as well as the underscored variants that let the caller specify the GFP flags. Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
066a5b6784 |
mm/filemap: fix UAF in find_lock_entries
Release refcount after xas_set to fix UAF which may cause panic like this:
page:ffffea000491fa40 refcount:1 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x1247e9
head:ffffea000491fa00 order:3 compound_mapcount:0 compound_pincount:0
memcg:ffff888104f91091
flags: 0x2fffff80010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
...
page dumped because: VM_BUG_ON_PAGE(PageTail(page))
------------[ cut here ]------------
kernel BUG at include/linux/page-flags.h:632!
invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
CPU: 1 PID: 7642 Comm: sh Not tainted 5.15.51-dirty #26
...
Call Trace:
<TASK>
__invalidate_mapping_pages+0xe7/0x540
drop_pagecache_sb+0x159/0x320
iterate_supers+0x120/0x240
drop_caches_sysctl_handler+0xaa/0xe0
proc_sys_call_handler+0x2b4/0x480
new_sync_write+0x3d6/0x5c0
vfs_write+0x446/0x7a0
ksys_write+0x105/0x210
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f52b5733130
...
This problem has been fixed on mainline by patch 6b24ca4a1a8d ("mm: Use
multi-index entries in the page cache") since it deletes the related code.
Fixes:
|
||
|
|
0515cc9b6b |
mm/slub: add missing TID updates on slab deactivation
commit eeaa345e128515135ccb864c04482180c08e3259 upstream.
The fastpath in slab_alloc_node() assumes that c->slab is stable as long as
the TID stays the same. However, two places in __slab_alloc() currently
don't update the TID when deactivating the CPU slab.
If multiple operations race the right way, this could lead to an object
getting lost; or, in an even more unlikely situation, it could even lead to
an object being freed onto the wrong slab's freelist, messing up the
`inuse` counter and eventually causing a page to be freed to the page
allocator while it still contains slab objects.
(I haven't actually tested these cases though, this is just based on
looking at the code. Writing testcases for this stuff seems like it'd be
a pain...)
The race leading to state inconsistency is (all operations on the same CPU
and kmem_cache):
- task A: begin do_slab_free():
- read TID
- read pcpu freelist (==NULL)
- check `slab == c->slab` (true)
- [PREEMPT A->B]
- task B: begin slab_alloc_node():
- fastpath fails (`c->freelist` is NULL)
- enter __slab_alloc()
- slub_get_cpu_ptr() (disables preemption)
- enter ___slab_alloc()
- take local_lock_irqsave()
- read c->freelist as NULL
- get_freelist() returns NULL
- write `c->slab = NULL`
- drop local_unlock_irqrestore()
- goto new_slab
- slub_percpu_partial() is NULL
- get_partial() returns NULL
- slub_put_cpu_ptr() (enables preemption)
- [PREEMPT B->A]
- task A: finish do_slab_free():
- this_cpu_cmpxchg_double() succeeds()
- [CORRUPT STATE: c->slab==NULL, c->freelist!=NULL]
From there, the object on c->freelist will get lost if task B is allowed to
continue from here: It will proceed to the retry_load_slab label,
set c->slab, then jump to load_freelist, which clobbers c->freelist.
But if we instead continue as follows, we get worse corruption:
- task A: run __slab_free() on object from other struct slab:
- CPU_PARTIAL_FREE case (slab was on no list, is now on pcpu partial)
- task A: run slab_alloc_node() with NUMA node constraint:
- fastpath fails (c->slab is NULL)
- call __slab_alloc()
- slub_get_cpu_ptr() (disables preemption)
- enter ___slab_alloc()
- c->slab is NULL: goto new_slab
- slub_percpu_partial() is non-NULL
- set c->slab to slub_percpu_partial(c)
- [CORRUPT STATE: c->slab points to slab-1, c->freelist has objects
from slab-2]
- goto redo
- node_match() fails
- goto deactivate_slab
- existing c->freelist is passed into deactivate_slab()
- inuse count of slab-1 is decremented to account for object from
slab-2
At this point, the inuse count of slab-1 is 1 lower than it should be.
This means that if we free all allocated objects in slab-1 except for one,
SLUB will think that slab-1 is completely unused, and may free its page,
leading to use-after-free.
Fixes:
|
||
|
|
ca67881dce |
init: Initialize noop_backing_dev_info early
[ Upstream commit 4bca7e80b6455772b4bf3f536dcbc19aac424d6a ]
noop_backing_dev_info is used by superblocks of various
pseudofilesystems such as kdevtmpfs. After commit 10e14073107d
("writeback: Fix inode->i_io_list not be protected by inode->i_lock
error") this broke because __mark_inode_dirty() started to access more
fields from noop_backing_dev_info and this led to crashes inside
locked_inode_to_wb_and_lock_list() called from __mark_inode_dirty().
Fix the problem by initializing noop_backing_dev_info before the
filesystems get mounted.
Fixes: 10e14073107d ("writeback: Fix inode->i_io_list not be protected by inode->i_lock error")
Reported-and-tested-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Reported-and-tested-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reported-and-tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
||
|
|
fd9a5081ee |
mm/memremap: fix missing call to untrack_pfn() in pagemap_range()
commit a04e1928e2ead144dc2f369768bc0a0f3110af89 upstream.
We forget to call untrack_pfn() to pair with track_pfn_remap() when range
is not allowed to hotplug. Fix it by jump err_kasan.
Link: https://lkml.kernel.org/r/20220531122643.25249-1-linmiaohe@huawei.com
Fixes:
|
||
|
|
715455ca5e |
hugetlb: fix huge_pmd_unshare address update
commit 48381273f8734d28ef56a5bdf1966dd8530111bc upstream.
The routine huge_pmd_unshare() is passed a pointer to an address
associated with an area which may be unshared. If unshare is successful
this address is updated to 'optimize' callers iterating over huge page
addresses. For the optimization to work correctly, address should be
updated to the last huge page in the unmapped/unshared area. However, in
the common case where the passed address is PUD_SIZE aligned, the address
is incorrectly updated to the address of the preceding huge page. That
wastes CPU cycles as the unmapped/unshared range is scanned twice.
Link: https://lkml.kernel.org/r/20220524205003.126184-1-mike.kravetz@oracle.com
Fixes:
|
||
|
|
fb49bd85df |
mm/page_alloc: always attempt to allocate at least one page during bulk allocation
commit c572e4888ad1be123c1516ec577ad30a700bbec4 upstream. Peter Pavlisko reported the following problem on kernel bugzilla 216007. When I try to extract an uncompressed tar archive (2.6 milion files, 760.3 GiB in size) on newly created (empty) XFS file system, after first low tens of gigabytes extracted the process hangs in iowait indefinitely. One CPU core is 100% occupied with iowait, the other CPU core is idle (on 2-core Intel Celeron G1610T). It was bisected to |
||
|
|
c142bddf37 |
Revert "mm/cma.c: remove redundant cma_mutex lock"
commit 60a60e32cf91169840abcb4a80f0b0df31708ba7 upstream. This reverts commit |
||
|
|
20e6ec76ae |
mm, compaction: fast_find_migrateblock() should return pfn in the target zone
commit bbe832b9db2e1ad21522f8f0bf02775fff8a0e0e upstream. At present, pages not in the target zone are added to cc->migratepages list in isolate_migratepages_block(). As a result, pages may migrate between nodes unintentionally. This would be a serious problem for older kernels without commit |
||
|
|
3ec459c881 |
zsmalloc: fix races between asynchronous zspage free and page migration
commit 2505a981114dcb715f8977b8433f7540854851d8 upstream.
The asynchronous zspage free worker tries to lock a zspage's entire page
list without defending against page migration. Since pages which haven't
yet been locked can concurrently migrate off the zspage page list while
lock_zspage() churns away, lock_zspage() can suffer from a few different
lethal races.
It can lock a page which no longer belongs to the zspage and unsafely
dereference page_private(), it can unsafely dereference a torn pointer to
the next page (since there's a data race), and it can observe a spurious
NULL pointer to the next page and thus not lock all of the zspage's pages
(since a single page migration will reconstruct the entire page list, and
create_page_chain() unconditionally zeroes out each list pointer in the
process).
Fix the races by using migrate_read_lock() in lock_zspage() to synchronize
with page migration.
Link: https://lkml.kernel.org/r/20220509024703.243847-1-sultan@kerneltoast.com
Fixes:
|
||
|
|
64cb7f01dd |
random: move randomize_page() into mm where it belongs
commit 5ad7dd882e45d7fe432c32e896e2aaa0b21746ea upstream. randomize_page is an mm function. It is documented like one. It contains the history of one. It has the naming convention of one. It looks just like another very similar function in mm, randomize_stack_top(). And it has always been maintained and updated by mm people. There is no need for it to be in random.c. In the "which shape does not look like the other ones" test, pointing to randomize_page() is correct. So move randomize_page() into mm/util.c, right next to the similar randomize_stack_top() function. This commit contains no actual code changes. Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
a122529082 |
mm/kfence: reset PG_slab and memcg_data before freeing __kfence_pool
commit 2839b0999c20c9f6bf353849c69370e121e2fa1a upstream.
When kfence fails to initialize kfence pool, it frees the pool. But it
does not reset memcg_data and PG_slab flag.
Below is a BUG because of this. Let's fix it by resetting memcg_data
and PG_slab flag before free.
[ 0.089149] BUG: Bad page state in process swapper/0 pfn:3d8e06
[ 0.089149] page:ffffea46cf638180 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x3d8e06
[ 0.089150] memcg:ffffffff94a475d1
[ 0.089150] flags: 0x17ffffc0000200(slab|node=0|zone=2|lastcpupid=0x1fffff)
[ 0.089151] raw: 0017ffffc0000200 ffffea46cf638188 ffffea46cf638188 0000000000000000
[ 0.089152] raw: 0000000000000000 0000000000000000 00000000ffffffff ffffffff94a475d1
[ 0.089152] page dumped because: page still charged to cgroup
[ 0.089153] Modules linked in:
[ 0.089153] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G B W 5.18.0-rc1+ #965
[ 0.089154] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[ 0.089154] Call Trace:
[ 0.089155] <TASK>
[ 0.089155] dump_stack_lvl+0x49/0x5f
[ 0.089157] dump_stack+0x10/0x12
[ 0.089158] bad_page.cold+0x63/0x94
[ 0.089159] check_free_page_bad+0x66/0x70
[ 0.089160] __free_pages_ok+0x423/0x530
[ 0.089161] __free_pages_core+0x8e/0xa0
[ 0.089162] memblock_free_pages+0x10/0x12
[ 0.089164] memblock_free_late+0x8f/0xb9
[ 0.089165] kfence_init+0x68/0x92
[ 0.089166] start_kernel+0x789/0x992
[ 0.089167] x86_64_start_reservations+0x24/0x26
[ 0.089168] x86_64_start_kernel+0xa9/0xaf
[ 0.089170] secondary_startup_64_no_verify+0xd5/0xdb
[ 0.089171] </TASK>
Link: https://lkml.kernel.org/r/YnPG3pQrqfcgOlVa@hyeyoo
Fixes:
|
||
|
|
cc57b2e4a7 |
mm/hwpoison: use pr_err() instead of dump_page() in get_any_page()
commit 1825b93b626e99eb9a0f9f50342c7b2fa201b387 upstream. The following VM_BUG_ON_FOLIO() is triggered when memory error event happens on the (thp/folio) pages which are about to be freed: [ 1160.232771] page:00000000b36a8a0f refcount:1 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x16a000 [ 1160.236916] page:00000000b36a8a0f refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x16a000 [ 1160.240684] flags: 0x57ffffc0800000(hwpoison|node=1|zone=2|lastcpupid=0x1fffff) [ 1160.243458] raw: 0057ffffc0800000 dead000000000100 dead000000000122 0000000000000000 [ 1160.246268] raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000 [ 1160.249197] page dumped because: VM_BUG_ON_FOLIO(!folio_test_large(folio)) [ 1160.251815] ------------[ cut here ]------------ [ 1160.253438] kernel BUG at include/linux/mm.h:788! [ 1160.256162] invalid opcode: 0000 [#1] PREEMPT SMP PTI [ 1160.258172] CPU: 2 PID: 115368 Comm: mceinj.sh Tainted: G E 5.18.0-rc1-v5.18-rc1-220404-2353-005-g83111+ #3 [ 1160.262049] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014 [ 1160.265103] RIP: 0010:dump_page.cold+0x27e/0x2bd [ 1160.266757] Code: fe ff ff 48 c7 c6 81 f1 5a 98 e9 4c fe ff ff 48 c7 c6 a1 95 59 98 e9 40 fe ff ff 48 c7 c6 50 bf 5a 98 48 89 ef e8 9d 04 6d ff <0f> 0b 41 f7 c4 ff 0f 00 00 0f 85 9f fd ff ff 49 8b 04 24 a9 00 00 [ 1160.273180] RSP: 0018:ffffaa2c4d59fd18 EFLAGS: 00010292 [ 1160.274969] RAX: 000000000000003e RBX: 0000000000000001 RCX: 0000000000000000 [ 1160.277263] RDX: 0000000000000001 RSI: ffffffff985995a1 RDI: 00000000ffffffff [ 1160.279571] RBP: ffffdc9c45a80000 R08: 0000000000000000 R09: 00000000ffffdfff [ 1160.281794] R10: ffffaa2c4d59fb08 R11: ffffffff98940d08 R12: ffffdc9c45a80000 [ 1160.283920] R13: ffffffff985b6f94 R14: 0000000000000000 R15: ffffdc9c45a80000 [ 1160.286641] FS: 00007eff54ce1740(0000) GS:ffff99c67bd00000(0000) knlGS:0000000000000000 [ 1160.289498] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1160.291106] CR2: 00005628381a5f68 CR3: 0000000104712003 CR4: 0000000000170ee0 [ 1160.293031] Call Trace: [ 1160.293724] <TASK> [ 1160.294334] get_hwpoison_page+0x47d/0x570 [ 1160.295474] memory_failure+0x106/0xaa0 [ 1160.296474] ? security_capable+0x36/0x50 [ 1160.297524] hard_offline_page_store+0x43/0x80 [ 1160.298684] kernfs_fop_write_iter+0x11c/0x1b0 [ 1160.299829] new_sync_write+0xf9/0x160 [ 1160.300810] vfs_write+0x209/0x290 [ 1160.301835] ksys_write+0x4f/0xc0 [ 1160.302718] do_syscall_64+0x3b/0x90 [ 1160.303664] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 1160.304981] RIP: 0033:0x7eff54b018b7 As shown in the RIP address, this VM_BUG_ON in folio_entire_mapcount() is called from dump_page("hwpoison: unhandlable page") in get_any_page(). The below explains the mechanism of the race: CPU 0 CPU 1 memory_failure get_hwpoison_page get_any_page dump_page compound = PageCompound free_pages_prepare page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP folio_entire_mapcount VM_BUG_ON_FOLIO(!folio_test_large(folio)) So replace dump_page() with safer one, pr_err(). Link: https://lkml.kernel.org/r/20220427053220.719866-1-naoya.horiguchi@linux.dev Fixes: 74e8ee4708a8 ("mm: Turn head_compound_mapcount() into folio_entire_mapcount()") Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
f8f836100f |
mm/huge_memory: do not overkill when splitting huge_zero_page
commit 478d134e9506c7e9bfe2830ed03dd85e97966313 upstream.
Kernel panic when injecting memory_failure for the global huge_zero_page,
when CONFIG_DEBUG_VM is enabled, as follows.
Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000
page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00
head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0
flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff)
raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000
raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000
page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head))
------------[ cut here ]------------
kernel BUG at mm/huge_memory.c:2499!
invalid opcode: 0000 [#1] PREEMPT SMP PTI
CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11
Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
RIP: 0010:split_huge_page_to_list+0x66a/0x880
Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b
RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246
RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff
RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff
R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000
R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40
FS: 00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
try_to_split_thp_page+0x3a/0x130
memory_failure+0x128/0x800
madvise_inject_error.cold+0x8b/0xa1
__x64_sys_madvise+0x54/0x60
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fc3754f8bf9
Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8
RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9
RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000
RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490
R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000
We think that raising BUG is overkilling for splitting huge_zero_page, the
huge_zero_page can't be met from normal paths other than memory failure,
but memory failure is a valid caller. So we tend to replace the BUG to
WARN + returning -EBUSY, and thus the panic above won't happen again.
Link: https://lkml.kernel.org/r/f35f8b97377d5d3ede1bc5ac3114da888c57cbce.1651052574.git.xuyu@linux.alibaba.com
Fixes: d173d5417fb6 ("mm/memory-failure.c: skip huge_zero_page in memory_failure()")
Fixes:
|
||
|
|
0e89d770dc |
Revert "mm/memory-failure.c: skip huge_zero_page in memory_failure()"
commit b4e61fc031b11dd807dffc46cebbf0e25966d3d1 upstream.
Patch series "mm/memory-failure: rework fix on huge_zero_page splitting".
This patch (of 2):
This reverts commit d173d5417fb67411e623d394aab986d847e47dad.
The commit d173d5417fb6 ("mm/memory-failure.c: skip huge_zero_page in
memory_failure()") explicitly skips huge_zero_page in memory_failure(), in
order to avoid triggering VM_BUG_ON_PAGE on huge_zero_page in
split_huge_page_to_list().
This works, but Yang Shi thinks that,
Raising BUG is overkilling for splitting huge_zero_page. The
huge_zero_page can't be met from normal paths other than memory
failure, but memory failure is a valid caller. So I tend to replace
the BUG to WARN + returning -EBUSY. If we don't care about the
reason code in memory failure, we don't have to touch memory
failure.
And for the issue that huge_zero_page will be set PG_has_hwpoisoned,
Yang Shi comments that,
The anonymous page fault doesn't check if the page is poisoned or
not since it typically gets a fresh allocated page and assumes the
poisoned page (isolated successfully) can't be reallocated again.
But huge zero page and base zero page are reused every time. So no
matter what fix we pick, the issue is always there.
Finally, Yang, David, Anshuman and Naoya all agree to fix the bug, i.e.,
to split huge_zero_page, in split_huge_page_to_list().
This reverts the commit d173d5417fb6 ("mm/memory-failure.c: skip
huge_zero_page in memory_failure()"), and the original bug will be fixed
by the next patch.
Link: https://lkml.kernel.org/r/872cefb182ba1dd686b0e7db1e6b2ebe5a4fff87.1651039624.git.xuyu@linux.alibaba.com
Fixes: d173d5417fb6 ("mm/memory-failure.c: skip huge_zero_page in memory_failure()")
Fixes:
|
||
|
|
5f00232112 |
mm: fix invalid page pointer returned with FOLL_PIN gups
commit 7196040e19ad634293acd3eff7083149d7669031 upstream. Patch series "mm/gup: some cleanups", v5. This patch (of 5): Alex reported invalid page pointer returned with pin_user_pages_remote() from vfio after upstream commit |
||
|
|
954c78ed8c |
mm/mlock: fix potential imbalanced rlimit ucounts adjustment
commit 5c2a956c3eea173b2bc89f632507c0eeaebf6c4a upstream.
user_shm_lock forgets to set allowed to 0 when get_ucounts fails. So
the later user_shm_unlock might do the extra dec_rlimit_ucounts. Fix
this by resetting allowed to 0.
Link: https://lkml.kernel.org/r/20220310132417.41189-1-linmiaohe@huawei.com
Fixes:
|
||
|
|
acf3e6843a |
mm/hwpoison: fix error page recovered but reported "not recovered"
commit 046545a661af2beec21de7b90ca0e35f05088a81 upstream.
When an uncorrected memory error is consumed there is a race between the
CMCI from the memory controller reporting an uncorrected error with a
UCNA signature, and the core reporting and SRAR signature machine check
when the data is about to be consumed.
If the CMCI wins that race, the page is marked poisoned when
uc_decode_notifier() calls memory_failure() and the machine check
processing code finds the page already poisoned. It calls
kill_accessing_process() to make sure a SIGBUS is sent. But returns the
wrong error code.
Console log looks like this:
mce: Uncorrected hardware memory error in user-access at 3710b3400
Memory failure: 0x3710b3: recovery action for dirty LRU page: Recovered
Memory failure: 0x3710b3: already hardware poisoned
Memory failure: 0x3710b3: Sending SIGBUS to einj_mem_uc:361438 due to hardware memory corruption
mce: Memory error not recovered
kill_accessing_process() is supposed to return -EHWPOISON to notify that
SIGBUS is already set to the process and kill_me_maybe() doesn't have to
send it again. But current code simply fails to do this, so fix it to
make sure to work as intended. This change avoids the noise message
"Memory error not recovered" and skips duplicate SIGBUSs.
[tony.luck@intel.com: reword some parts of commit message]
Link: https://lkml.kernel.org/r/20220113231117.1021405-1-naoya.horiguchi@linux.dev
Fixes:
|
||
|
|
13d75c31a8 |
mm: userfaultfd: fix missing cache flush in mcopy_atomic_pte() and __mcopy_atomic()
commit 7c25a0b89a487878b0691e6524fb5a8827322194 upstream. userfaultfd calls mcopy_atomic_pte() and __mcopy_atomic() which do not do any cache flushing for the target page. Then the target page will be mapped to the user space with a different address (user address), which might have an alias issue with the kernel address used to copy the data from the user to. Fix this by insert flush_dcache_page() after copy_from_user() succeeds. Link: https://lkml.kernel.org/r/20220210123058.79206-7-songmuchun@bytedance.com Fixes: |
||
|
|
72dd048723 |
mm: shmem: fix missing cache flush in shmem_mfill_atomic_pte()
commit 19b482c29b6f3805f1d8e93015847b89e2f7f3b1 upstream. userfaultfd calls shmem_mfill_atomic_pte() which does not do any cache flushing for the target page. Then the target page will be mapped to the user space with a different address (user address), which might have an alias issue with the kernel address used to copy the data from the user to. Insert flush_dcache_page() in non-zero-page case. And replace clear_highpage() with clear_user_highpage() which already considers the cache maintenance. Link: https://lkml.kernel.org/r/20220210123058.79206-6-songmuchun@bytedance.com Fixes: |
||
|
|
e36b476a82 |
mm: hugetlb: fix missing cache flush in copy_huge_page_from_user()
commit e763243cc6cb1fcc720ec58cfd6e7c35ae90a479 upstream.
userfaultfd calls copy_huge_page_from_user() which does not do any cache
flushing for the target page. Then the target page will be mapped to
the user space with a different address (user address), which might have
an alias issue with the kernel address used to copy the data from the
user to.
Fix this issue by flushing dcache in copy_huge_page_from_user().
Link: https://lkml.kernel.org/r/20220210123058.79206-4-songmuchun@bytedance.com
Fixes:
|
||
|
|
97a9f80290 |
mm: fix missing cache flush for all tail pages of compound page
commit 2771739a7162782c0aa6424b2e3dd874e884a15d upstream.
The D-cache maintenance inside move_to_new_page() only consider one
page, there is still D-cache maintenance issue for tail pages of
compound page (e.g. THP or HugeTLB).
THP migration is only enabled on x86_64, ARM64 and powerpc, while
powerpc and arm64 need to maintain the consistency between I-Cache and
D-Cache, which depends on flush_dcache_page() to maintain the
consistency between I-Cache and D-Cache.
But there is no issues on arm64 and powerpc since they already considers
the compound page cache flushing in their icache flush function.
HugeTLB migration is enabled on arm, arm64, mips, parisc, powerpc,
riscv, s390 and sh, while arm has handled the compound page cache flush
in flush_dcache_page(), but most others do not.
In theory, the issue exists on many architectures. Fix this by not
using flush_dcache_folio() since it is not backportable.
Link: https://lkml.kernel.org/r/20220210123058.79206-3-songmuchun@bytedance.com
Fixes:
|
||
|
|
30083f0492 |
kasan: prevent cpu_quarantine corruption when CPU offline and cache shrink occur at same time
commit 31fa985b4196f8a66f027672e9bf2b81fea0417c upstream. kasan_quarantine_remove_cache() is called in kmem_cache_shrink()/ destroy(). The kasan_quarantine_remove_cache() call is protected by cpuslock in kmem_cache_destroy() to ensure serialization with kasan_cpu_offline(). However the kasan_quarantine_remove_cache() call is not protected by cpuslock in kmem_cache_shrink(). When a CPU is going offline and cache shrink occurs at same time, the cpu_quarantine may be corrupted by interrupt (per_cpu_remove_cache operation). So add a cpu_quarantine offline flags check in per_cpu_remove_cache(). [akpm@linux-foundation.org: add comment, per Zqiang] Link: https://lkml.kernel.org/r/20220414025925.2423818-1-qiang1.zhang@intel.com Signed-off-by: Zqiang <qiang1.zhang@intel.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
dcecd95a13 |
mm: gup: make fault_in_safe_writeable() use fixup_user_fault()
commit fe673d3f5bf1fc50cdc4b754831db91a2ec10126 upstream Instead of using GUP, make fault_in_safe_writeable() actually force a 'handle_mm_fault()' using the same fixup_user_fault() machinery that futexes already use. Using the GUP machinery meant that fault_in_safe_writeable() did not do everything that a real fault would do, ranging from not auto-expanding the stack segment, to not updating accessed or dirty flags in the page tables (GUP sets those flags on the pages themselves). The latter causes problems on architectures (like s390) that do accessed bit handling in software, which meant that fault_in_safe_writeable() didn't actually do all the fault handling it needed to, and trying to access the user address afterwards would still cause faults. Reported-and-tested-by: Andreas Gruenbacher <agruenba@redhat.com> Fixes: cdd591fc86e3 ("iov_iter: Introduce fault_in_iov_iter_writeable") Link: https://lore.kernel.org/all/CAHc6FU5nP+nziNGG0JAF1FUx-GV7kKFvM7aZuU_XD2_1v4vnvg@mail.gmail.com/ Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
6e213bc614 |
gup: Introduce FOLL_NOFAULT flag to disable page faults
commit 55b8fe703bc51200d4698596c90813453b35ae63 upstream Introduce a new FOLL_NOFAULT flag that causes get_user_pages to return -EFAULT when it would otherwise trigger a page fault. This is roughly similar to FOLL_FAST_ONLY but available on all architectures, and less fragile. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
1d91c912e7 |
iov_iter: Introduce fault_in_iov_iter_writeable
commit cdd591fc86e38ad3899196066219fbbd845f3162 upstream Introduce a new fault_in_iov_iter_writeable helper for safely faulting in an iterator for writing. Uses get_user_pages() to fault in the pages without actually writing to them, which would be destructive. We'll use fault_in_iov_iter_writeable in gfs2 once we've determined that the iterator passed to .read_iter isn't in memory. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
30e66b1dfc |
iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable
commit a6294593e8a1290091d0b078d5d33da5e0cd3dfe upstream Turn iov_iter_fault_in_readable into a function that returns the number of bytes not faulted in, similar to copy_to_user, instead of returning a non-zero value when any of the requested pages couldn't be faulted in. This supports the existing users that require all pages to be faulted in as well as new users that are happy if any pages can be faulted in. Rename iov_iter_fault_in_readable to fault_in_iov_iter_readable to make sure this change doesn't silently break things. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
923f05a660 |
gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable}
commit bb523b406c849eef8f265a07cd7f320f1f177743 upstream
Turn fault_in_pages_{readable,writeable} into versions that return the
number of bytes not faulted in, similar to copy_to_user, instead of
returning a non-zero value when any of the requested pages couldn't be
faulted in. This supports the existing users that require all pages to
be faulted in as well as new users that are happy if any pages can be
faulted in.
Rename the functions to fault_in_{readable,writeable} to make sure
this change doesn't silently break things.
Neither of these functions is entirely trivial and it doesn't seem
useful to inline them, so move them to mm/gup.c.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
||
|
|
19cbd78fb2 |
mm: kfence: fix objcgs vector allocation
commit 8f0b36497303487d5a32c75789c77859cc2ee895 upstream.
If the kfence object is allocated to be used for objects vector, then
this slot of the pool eventually being occupied permanently since the
vector is never freed. The solutions could be (1) freeing vector when
the kfence object is freed or (2) allocating all vectors statically.
Since the memory consumption of object vectors is low, it is better to
chose (2) to fix the issue and it is also can reduce overhead of vectors
allocating in the future.
Link: https://lkml.kernel.org/r/20220328132843.16624-1-songmuchun@bytedance.com
Fixes:
|
||
|
|
48b2ab1a96 |
mm/mmu_notifier.c: fix race in mmu_interval_notifier_remove()
commit 319561669a59d8e9206ab311ae5433ef92fd79d1 upstream.
In some cases it is possible for mmu_interval_notifier_remove() to race
with mn_tree_inv_end() allowing it to return while the notifier data
structure is still in use. Consider the following sequence:
CPU0 - mn_tree_inv_end() CPU1 - mmu_interval_notifier_remove()
----------------------------------- ------------------------------------
spin_lock(subscriptions->lock);
seq = subscriptions->invalidate_seq;
spin_lock(subscriptions->lock); spin_unlock(subscriptions->lock);
subscriptions->invalidate_seq++;
wait_event(invalidate_seq != seq);
return;
interval_tree_remove(interval_sub); kfree(interval_sub);
spin_unlock(subscriptions->lock);
wake_up_all();
As the wait_event() condition is true it will return immediately. This
can lead to use-after-free type errors if the caller frees the data
structure containing the interval notifier subscription while it is
still on a deferred list. Fix this by taking the appropriate lock when
reading invalidate_seq to ensure proper synchronisation.
I observed this whilst running stress testing during some development.
You do have to be pretty unlucky, but it leads to the usual problems of
use-after-free (memory corruption, kernel crash, difficult to diagnose
WARN_ON, etc).
Link: https://lkml.kernel.org/r/20220420043734.476348-1-apopple@nvidia.com
Fixes:
|
||
|
|
41ba681c63 |
oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanup
commit e4a38402c36e42df28eb1a5394be87e6571fb48a upstream.
The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which
can be targeted by the oom reaper. This mapping is used to store the
futex robust list head; the kernel does not keep a copy of the robust
list and instead references a userspace address to maintain the
robustness during a process death.
A race can occur between exit_mm and the oom reaper that allows the oom
reaper to free the memory of the futex robust list before the exit path
has handled the futex death:
CPU1 CPU2
--------------------------------------------------------------------
page_fault
do_exit "signal"
wake_oom_reaper
oom_reaper
oom_reap_task_mm (invalidates mm)
exit_mm
exit_mm_release
futex_exit_release
futex_cleanup
exit_robust_list
get_user (EFAULT- can't access memory)
If the get_user EFAULT's, the kernel will be unable to recover the
waiters on the robust_list, leaving userspace mutexes hung indefinitely.
Delay the OOM reaper, allowing more time for the exit path to perform
the futex cleanup.
Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer
Based on a patch by Michal Hocko.
Link: https://elixir.bootlin.com/glibc/glibc-2.35/source/nptl/allocatestack.c#L370 [1]
Link: https://lkml.kernel.org/r/20220414144042.677008-1-npache@redhat.com
Fixes:
|
||
|
|
9dcb65cdf3 |
mm, hugetlb: allow for "high" userspace addresses
commit 5f24d5a579d1eace79d505b148808a850b417d4c upstream. This is a fix for commit |
||
|
|
07bdd20777 |
memcg: sync flush only if periodic flush is delayed
commit 9b3016154c913b2e7ec5ae5c9a42eb9e732d86aa upstream.
Daniel Dao has reported [1] a regression on workloads that may trigger a
lot of refaults (anon and file). The underlying issue is that flushing
rstat is expensive. Although rstat flush are batched with (nr_cpus *
MEMCG_BATCH) stat updates, it seems like there are workloads which
genuinely do stat updates larger than batch value within short amount of
time. Since the rstat flush can happen in the performance critical
codepaths like page faults, such workload can suffer greatly.
This patch fixes this regression by making the rstat flushing
conditional in the performance critical codepaths. More specifically,
the kernel relies on the async periodic rstat flusher to flush the stats
and only if the periodic flusher is delayed by more than twice the
amount of its normal time window then the kernel allows rstat flushing
from the performance critical codepaths.
Now the question: what are the side-effects of this change? The worst
that can happen is the refault codepath will see 4sec old lruvec stats
and may cause false (or missed) activations of the refaulted page which
may under-or-overestimate the workingset size. Though that is not very
concerning as the kernel can already miss or do false activations.
There are two more codepaths whose flushing behavior is not changed by
this patch and we may need to come to them in future. One is the
writeback stats used by dirty throttling and second is the deactivation
heuristic in the reclaim. For now keeping an eye on them and if there
is report of regression due to these codepaths, we will reevaluate then.
Link: https://lore.kernel.org/all/CA+wXwBSyO87ZX5PVwdHm-=dBjZYECGmfnydUicUyrQqndgX2MQ@mail.gmail.com [1]
Link: https://lkml.kernel.org/r/20220304184040.1304781-1-shakeelb@google.com
Fixes:
|