kernel_arpi

Author	SHA1	Message	Date
Vijayanand Jitta	385b0dd1f9	ANDROID: mm: Fix page table lookup in speculative fault path In speculative fault path, while doing page table lookup, offset is obtained at each level and value at that offset is read and checks are perfomed on it, later to get next level offset we read from previous level offset again. A concurrent page table reclaimation operation could result in change in value at this offset, and we go ahead and access it, this would result in reading an invalid entry. Fix this by reading from previous level offset again and comparing before performing next level access. Bug: 221005439 Change-Id: I66b3d24ae79c7ee5ccce4ba7a94f028f4cf3fda0 Signed-off-by: Vijayanand Jitta <quic_vjitta@quicinc.com>	2022-03-23 11:32:20 -07:00
Michel Lespinasse	7d6787088d	BACKPORT: FROMLIST: mm: enable speculative fault handling for supported file types. Introduce vma_can_speculate(), which allows speculative handling for VMAs mapping supported file types. From do_handle_mm_fault(), speculative handling will follow through __handle_mm_fault(), handle_pte_fault() and do_fault(). At this point, we expect speculative faults to continue through one of: - do_read_fault(), fully implemented; - do_cow_fault(), which might abort if missing anon vmas, - do_shared_fault(), not implemented yet (would require ->page_mkwrite() changes). vma_can_speculate() provides an early abort for the do_shared_fault() case, limiting the time spent on trying that unimplemented case. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20210407014502.24091-31-michel@lespinasse.org/ Conflicts: include/linux/vm_event_item.h mm/vmstat.c 1. SPF_ATTEMPT_FILE is taken from https://lore.kernel.org/all/20210407014502.24091-36-michel@lespinasse.org/ since the patch posted upstream at the time had a different structure with stats for anonymouse and file-backed pagefaults introduced in a separate patch. Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I3a28af63b41b649f02f8b73d53f6494ad114ee5a	2022-03-23 11:32:19 -07:00
Michel Lespinasse	7045d2d838	FROMLIST: mm: implement speculative handling in do_fault_around() Call the vm_ops->map_pages method within an rcu read locked section. In the speculative case, verify the mmap sequence lock at the start of the section. A match guarantees that the original vma is still valid at that time, and that the associated vma->vm_file stays valid while the vm_ops->map_pages() method is running. Do not test vmf->pmd in the speculative case - we only speculate when a page table already exists, and and this saves us from having to handle synchronization around the vmf->pmd read. Change xfs_filemap_map_pages() account for the fact that it can not block anymore, as it is now running within an rcu read lock. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20210407014502.24091-28-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Id771c1e6fa9b883595a48d4df63f448a05916eda	2022-03-23 11:32:19 -07:00
Michel Lespinasse	6877640598	BACKPORT: FROMLIST: mm: implement speculative fault handling in finish_fault() In the speculative case, we want to avoid direct pmd checks (which would require some extra synchronization to be safe), and rely on pte_map_lock which will both lock the page table and verify that the pmd has not changed from its initial value. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20210407014502.24091-27-michel@lespinasse.org/ Conflicts: mm/memory.c 1. Merge conflict due to new vmf->prealloc_pte usage in finish_fault. Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: If6046592083eaf12caf5c51c3fbb287a4dfa1ace	2022-03-23 11:32:18 -07:00
Michel Lespinasse	b12e52ca98	FROMLIST: mm: implement speculative handling in __do_fault() In the speculative case, call the vm_ops->fault() method from within an rcu read locked section, and verify the mmap sequence lock at the start of the section. A match guarantees that the original vma is still valid at that time, and that the associated vma->vm_file stays valid while the vm_ops->fault() method is running. Note that this implies that speculative faults can not sleep within the vm_ops->fault method. We will only attempt to fetch existing pages from the page cache during speculative faults; any miss (or prefetch) will be handled by falling back to non-speculative fault handling. The speculative handling case also does not preallocate page tables, as it is always called with a pre-existing page table. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20210407014502.24091-25-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I995ba94d8e96014ef83ac93fe5a4669afcde34b9	2022-03-23 11:32:18 -07:00
Michel Lespinasse	9b92402808	FROMLIST: mm: anon spf statistics Add a new CONFIG_SPECULATIVE_PAGE_FAULT_STATS config option, and dump extra statistics about executed spf cases and abort reasons when the option is set. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-32-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Ia53cd88e4a7140aeb26bf8f3869e1fc5270012da	2022-03-23 11:32:17 -07:00
Michel Lespinasse	aa9ae5c915	FROMLIST: mm: implement and enable speculative fault handling in handle_pte_fault() In handle_pte_fault(), allow speculative execution to proceed. Use pte_spinlock() to validate the mmap sequence count when locking the page table. If speculative execution proceeds through do_wp_page(), ensure that we end up in the wp_page_reuse() or wp_page_copy() paths, rather than wp_pfn_shared() or wp_page_shared() (both unreachable as we only handle anon vmas so far) or handle_userfault() (needs an explicit abort to handle non-speculatively). Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-28-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Ia45d095ec7b8e23f1c5d68b7a7f572a3f6f6df97	2022-03-23 11:32:17 -07:00
Michel Lespinasse	40bc9ed389	FROMLIST: mm: implement speculative handling in wp_page_copy() Change wp_page_copy() to handle the speculative case. This involves aborting speculative faults if they have to allocate an anon_vma, read-locking the mmu_notifier_lock to avoid races with mmu_notifier_register(), and using pte_map_lock() instead of pte_offset_map_lock() to complete the page fault. Also change call sites to clear vmf->pte after unmapping the page table, in order to satisfy pte_map_lock()'s preconditions. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-27-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Icd2188e9facf5a7fea42000a2808bcda1ad6f0fc	2022-03-23 11:32:16 -07:00
Michel Lespinasse	009020e3d1	FROMLIST: mm: enable speculative fault handling in do_numa_page() Change handle_pte_fault() to allow speculative fault execution to proceed through do_numa_page(). do_swap_page() does not implement speculative execution yet, so it needs to abort with VM_FAULT_RETRY in that case. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-22-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I0390331facc9ecd37534012abdd9f255ab5bbb12	2022-03-23 11:32:16 -07:00
Michel Lespinasse	fedc4d513e	FROMLIST: mm: implement speculative handling in do_numa_page() change do_numa_page() to use pte_spinlock() when locking the page table, so that the mmap sequence counter will be validated in the speculative case. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-21-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: If252547faf2a8a6cbba4c0a7ff929071a5f6a657	2022-03-23 11:32:15 -07:00
Michel Lespinasse	c2b2abe724	FROMLIST: mm: enable speculative fault handling through do_anonymous_page() in x86 fault handler, only attempt spf if the vma is anonymous. In do_handle_mm_fault(), let speculative page faults proceed as long as they fall into anonymous vmas. This enables the speculative handling code in __handle_mm_fault() and do_anonymous_page(). In handle_pte_fault(), if vmf->pte is set (the original pte was not pte_none), catch speculative faults and return VM_FAULT_RETRY as those cases are not implemented yet. Also assert that do_fault() is not reached in the speculative case. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-20-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I875106fcfa1084f570c2bf8f24a129bdce55316b	2022-03-23 11:32:15 -07:00
Michel Lespinasse	31cf1fd564	FROMLIST: mm: implement speculative handling in do_anonymous_page() Change do_anonymous_page() to handle the speculative case. This involves aborting speculative faults if they have to allocate a new anon_vma, and using pte_map_lock() instead of pte_offset_map_lock() to complete the page fault. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-19-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I5ad955323faabc142c21f62415db039ac889066a	2022-03-23 11:32:15 -07:00
Michel Lespinasse	6e6766ab76	BACKPORT: FROMLIST: mm: add pte_map_lock() and pte_spinlock() pte_map_lock() and pte_spinlock() are used by fault handlers to ensure the pte is mapped and locked before they commit the faulted page to the mm's address space at the end of the fault. The functions differ in their preconditions; pte_map_lock() expects the pte to be unmapped prior to the call, while pte_spinlock() expects it to be already mapped. In the speculative fault case, the functions verify, after locking the pte, that the mmap sequence count has not changed since the start of the fault, and thus that no mmap lock writers have been running concurrently with the fault. After that point the page table lock serializes any further races with concurrent mmap lock writers. If the mmap sequence count check fails, both functions will return false with the pte being left unmapped and unlocked. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-18-michel@lespinasse.org/ Conflicts: include/linux/mm.h 1. Fixed pte_map_lock and pte_spinlock macros not to fail when CONFIG_SPECULATIVE_PAGE_FAULT=n Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Ibd7ccc2ead4fdf29f28c7657b312b2f677ac8836	2022-03-23 11:32:15 -07:00
Michel Lespinasse	6ab660d7cb	FROMLIST: mm: implement speculative handling in __handle_mm_fault(). The speculative path calls speculative_page_walk_begin() before walking the page table tree to prevent page table reclamation. The logic is otherwise similar to the non-speculative path, but with additional restrictions: in the speculative path, we do not handle huge pages or wiring new pages tables. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-17-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: If099534da8b0ac105bbaa5ea4714a6654032592a	2022-03-23 11:32:15 -07:00
Michel Lespinasse	f3f9f17a32	FROMLIST: mm: refactor __handle_mm_fault() / handle_pte_fault() Move the code that initializes vmf->pte and vmf->orig_pte from handle_pte_fault() to its single call site in __handle_mm_fault(). This ensures vmf->pte is now initialized together with the higher levels of the page table hierarchy. This also prepares for speculative page fault handling, where the entire page table walk (higher levels down to ptes) needs special care in the speculative case. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-16-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Id550086fe568331aa71c91468f8314faad993b20	2022-03-23 11:32:15 -07:00
Michel Lespinasse	f8a4611b47	FROMLIST: mm: add speculative_page_walk_begin() and speculative_page_walk_end() Speculative page faults will use these to protect against races with page table reclamation. This could always be handled by disabling local IRQs as the fast GUP code does; however speculative page faults do not need to protect against races with THP page splitting, so a weaker rcu read lock is sufficient in the MMU_GATHER_RCU_TABLE_FREE case. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-15-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I3efe5fc6a5a49d537cf33e8093daeea42550077a	2022-03-23 11:32:14 -07:00
Michel Lespinasse	4e2e391ff7	BACKPORT: FROMLIST: mm: add do_handle_mm_fault() Add a new do_handle_mm_fault function, which extends the existing handle_mm_fault() API by adding an mmap sequence count, to be used in the FAULT_FLAG_SPECULATIVE case. In the initial implementation, FAULT_FLAG_SPECULATIVE always fails (by returning VM_FAULT_RETRY). The existing handle_mm_fault() API is kept as a wrapper around do_handle_mm_fault() so that we do not have to immediately update every handle_mm_fault() call site. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Conflicts: mm/memory.c 1. Trivial merge conflict due to folios. Link: https://lore.kernel.org/all/20220128131006.67712-10-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Ic07b6d84af3e5d1fcc856e0968f1a6dd1544fa88	2022-03-23 11:32:14 -07:00
Michel Lespinasse	57f3bb2b12	BACKPORT: FROMLIST: do_anonymous_page: reduce code duplication In do_anonymous_page(), we have separate cases for the zero page vs allocating new anonymous pages. However, once the pte entry has been computed, the rest of the handling (mapping and locking the page table, checking that we didn't lose a race with another page fault handler, etc) is identical between the two cases. This change reduces the code duplication between the two cases. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Conflicts: mm/memory.c 1. Trivial merge conflict caused by folios in mem_cgroup_charge call. Link: https://lore.kernel.org/all/20220128131006.67712-6-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Ic19579571925878d632e43aa40b9f50cdf473ee6	2022-03-23 11:32:13 -07:00
Michel Lespinasse	82ab55ebcc	FROMLIST: do_anonymous_page: use update_mmu_tlb() update_mmu_tlb() can be used instead of update_mmu_cache() when the page fault handler detects that it lost the race to another page fault. It looks like this one call was missed in https://patchwork.kernel.org/project/linux-mips/patch/1590375160-6997-2-git-send-email-maobibo@loongson.cn after Andrew asked to replace all update_mmu_cache() calls with an alias in the previous version of this patch here: https://patchwork.kernel.org/project/linux-mips/patch/1590031837-9582-2-git-send-email-maobibo@loongson.cn/#23374625 Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-5-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Iaad4d3c27e12c2d9bf68d1140709788fc8dead24	2022-03-23 11:32:13 -07:00
Kalesh Singh	1a77f04aac	Revert "ANDROID: mm: Throttle rss_stat tracepoint" This reverts commit `77dfeaa02d`. Throttling can now be done using hist triggers and synthetic events Bug: 145972256 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I39c284040e2fdb815cda980f3a40ef188e59287c	2021-11-17 23:30:23 +00:00
Greg Kroah-Hartman	d1a66e7942	Merge tag 'v5.15' into android-mainline Linux 5.15 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I81d96ada6b66bcf73d192ddde9018f1804e7d90b	2021-11-01 07:40:54 +01:00
Yang Shi	eac96c3efd	mm: filemap: check if THP has hwpoisoned subpage for PMD page fault When handling shmem page fault the THP with corrupted subpage could be PMD mapped if certain conditions are satisfied. But kernel is supposed to send SIGBUS when trying to map hwpoisoned page. There are two paths which may do PMD map: fault around and regular fault. Before commit `f9ce0be71d` ("mm: Cleanup faultaround and finish_fault() codepaths") the thing was even worse in fault around path. The THP could be PMD mapped as long as the VMA fits regardless what subpage is accessed and corrupted. After this commit as long as head page is not corrupted the THP could be PMD mapped. In the regular fault path the THP could be PMD mapped as long as the corrupted page is not accessed and the VMA fits. This loophole could be fixed by iterating every subpage to check if any of them is hwpoisoned or not, but it is somewhat costly in page fault path. So introduce a new page flag called HasHWPoisoned on the first tail page. It indicates the THP has hwpoisoned subpage(s). It is set if any subpage of THP is found hwpoisoned by memory failure and after the refcount is bumped successfully, then cleared when the THP is freed or split. The soft offline path doesn't need this since soft offline handler just marks a subpage hwpoisoned when the subpage is migrated successfully. But shmem THP didn't get split then migrated at all. Link: https://lkml.kernel.org/r/20211020210755.23964-3-shy828301@gmail.com Fixes: `800d8c63b2` ("shmem: add huge pages support") Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-10-28 17:18:55 -07:00
Greg Kroah-Hartman	2d03f6ec4b	Merge `92477dd1fa` ("Merge tag 's390-5.15-ebpf-jit-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux") into android-mainline Steps on the way to 5.15-rc3 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Ia35478d4c4a085bff847b8cd484bd3290f9529b8	2021-09-23 09:26:41 +02:00
David Howells	6e0e99d58a	afs: Fix mmap coherency vs 3rd-party changes Fix the coherency management of mmap'd data such that 3rd-party changes become visible as soon as possible after the callback notification is delivered by the fileserver. This is done by the following means: (1) When we break a callback on a vnode specified by the CB.CallBack call from the server, we queue a work item (vnode->cb_work) to go and clobber all the PTEs mapping to that inode. This causes the CPU to trip through the ->map_pages() and ->page_mkwrite() handlers if userspace attempts to access the page(s) again. (Ideally, this would be done in the service handler for CB.CallBack, but the server is waiting for our reply before considering, and we have a list of vnodes, all of which need breaking - and the process of getting the mmap_lock and stripping the PTEs on all CPUs could be quite slow.) (2) Call afs_validate() from the ->map_pages() handler to check to see if the file has changed and to get a new callback promise from the server. Also handle the fileserver telling us that it's dropping all callbacks, possibly after it's been restarted by sending us a CB.InitCallBackState* call by the following means: (3) Maintain a per-cell list of afs files that are currently mmap'd (cell->fs_open_mmaps). (4) Add a work item to each server that is invoked if there are any open mmaps when CB.InitCallBackState happens. This work item goes through the aforementioned list and invokes the vnode->cb_work work item for each one that is currently using this server. This causes the PTEs to be cleared, causing ->map_pages() or ->page_mkwrite() to be called again, thereby calling afs_validate() again. I've chosen to simply strip the PTEs at the point of notification reception rather than invalidate all the pages as well because (a) it's faster, (b) we may get a notification for other reasons than the data being altered (in which case we don't want to clobber the pagecache) and (c) we need to ask the server to find out - and I don't want to wait for the reply before holding up userspace. This was tested using the attached test program: #include <stdbool.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> #include <sys/mman.h> int main(int argc, char argv[]) { size_t size = getpagesize(); unsigned char p; bool mod = (argc == 3); int fd; if (argc != 2 && argc != 3) { fprintf(stderr, "Format: %s <file> [mod]\n", argv[0]); exit(2); } fd = open(argv[1], mod ? O_RDWR : O_RDONLY); if (fd < 0) { perror(argv[1]); exit(1); } p = mmap(NULL, size, mod ? PROT_READ\|PROT_WRITE : PROT_READ, MAP_SHARED, fd, 0); if (p == MAP_FAILED) { perror("mmap"); exit(1); } for (;;) { if (mod) { p[0]++; msync(p, size, MS_ASYNC); fsync(fd); } printf("%02x", p[0]); fflush(stdout); sleep(1); } } It runs in two modes: in one mode, it mmaps a file, then sits in a loop reading the first byte, printing it and sleeping for a second; in the second mode it mmaps a file, then sits in a loop incrementing the first byte and flushing, then printing and sleeping. Two instances of this program can be run on different machines, one doing the reading and one doing the writing. The reader should see the changes made by the writer, but without this patch, they aren't because validity checking is being done lazily - only on entry to the filesystem. Testing the InitCallBackState change is more complicated. The server has to be taken offline, the saved callback state file removed and then the server restarted whilst the reading-mode program continues to run. The client machine then has to poke the server to trigger the InitCallBackState call. Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Markus Suvanto <markus.suvanto@gmail.com> cc: linux-afs@lists.infradead.org Link: https://lore.kernel.org/r/163111668833.283156.382633263709075739.stgit@warthog.procyon.org.uk/	2021-09-13 09:10:39 +01:00
Greg Kroah-Hartman	e9975a8f2e	Merge tag 'v5.14-rc3' into android-mainline Linux 5.14-rc3 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I5db742d6b8faf7b40efe1cb3b5beae486010f3fd	2021-07-28 14:45:26 +02:00
Qi Zheng	e4dc348914	mm: fix the deadlock in finish_fault() Commit `63f3655f95` ("mm, memcg: fix reclaim deadlock with writeback") fix the following ABBA deadlock by pre-allocating the pte page table without holding the page lock. lock_page(A) SetPageWriteback(A) unlock_page(A) lock_page(B) lock_page(B) pte_alloc_one shrink_page_list wait_on_page_writeback(A) SetPageWriteback(B) unlock_page(B) # flush A, B to clear the writeback Commit `f9ce0be71d` ("mm: Cleanup faultaround and finish_fault() codepaths") reworked the relevant code but ignored this race. This will cause the deadlock above to appear again, so fix it. Link: https://lkml.kernel.org/r/20210721074849.57004-1-zhengqi.arch@bytedance.com Fixes: `f9ce0be71d` ("mm: Cleanup faultaround and finish_fault() codepaths") Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-07-23 17:43:28 -07:00
Lee Jones	293f275f4d	Merge commit `df8ba5f160` ("Merge tag 'kgdb-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux") into android-mainline A large step en route to v5.14-rc1 Change-Id: I52bb71dc737044a593d1a9dfd7fe02b31e273ff9 Signed-off-by: Lee Jones <lee.jones@linaro.org>	2021-07-12 11:00:18 +01:00
Lee Jones	8e658623d4	Merge commit `c288d9cd71` ("Merge tag 'for-5.14/io_uring-2021-06-30' of git://git.kernel.dk/linux-block") into android-mainline Another small step en route to v5.14-rc1 Change-Id: I24899ab78da7d367574ed69ceaa82ab0837d9556 Signed-off-by: Lee Jones <lee.jones@linaro.org>	2021-07-12 10:02:27 +01:00
Alistair Popple	b756a3b5e7	mm: device exclusive memory access Some devices require exclusive write access to shared virtual memory (SVM) ranges to perform atomic operations on that memory. This requires CPU page tables to be updated to deny access whilst atomic operations are occurring. In order to do this introduce a new swap entry type (SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for exclusive access by a device all page table mappings for the particular range are replaced with device exclusive swap entries. This causes any CPU access to the page to result in a fault. Faults are resovled by replacing the faulting entry with the original mapping. This results in MMU notifiers being called which a driver uses to update access permissions such as revoking atomic access. After notifiers have been called the device will no longer have exclusive access to the region. Walking of the page tables to find the target pages is handled by get_user_pages() rather than a direct page table walk. A direct page table walk similar to what migrate_vma_collect()/unmap() does could also have been utilised. However this resulted in more code similar in functionality to what get_user_pages() provides as page faulting is required to make the PTEs present and to break COW. [dan.carpenter@oracle.com: fix signedness bug in make_device_exclusive_range()] Link: https://lkml.kernel.org/r/YNIz5NVnZ5GiZ3u1@mwanda Link: https://lkml.kernel.org/r/20210616105937.23201-8-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-07-01 11:06:03 -07:00
Alistair Popple	9a5cc85c40	mm/memory.c: allow different return codes for copy_nonpresent_pte() Currently if copy_nonpresent_pte() returns a non-zero value it is assumed to be a swap entry which requires further processing outside the loop in copy_pte_range() after dropping locks. This prevents other values being returned to signal conditions such as failure which a subsequent change requires. Instead make copy_nonpresent_pte() return an error code if further processing is required and read the value for the swap entry in the main loop under the ptl. Link: https://lkml.kernel.org/r/20210616105937.23201-7-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-07-01 11:06:03 -07:00
Alistair Popple	4dd845b5a3	mm/swapops: rework swap entry manipulation code Both migration and device private pages use special swap entries that are manipluated by a range of inline functions. The arguments to these are somewhat inconsistent so rework them to remove flag type arguments and to make the arguments similar for both read and write entry creation. Link: https://lkml.kernel.org/r/20210616105937.23201-3-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-07-01 11:06:03 -07:00
Alistair Popple	af5cdaf822	mm: remove special swap entry functions Patch series "Add support for SVM atomics in Nouveau", v11. Introduction ============ Some devices have features such as atomic PTE bits that can be used to implement atomic access to system memory. To support atomic operations to a shared virtual memory page such a device needs access to that page which is exclusive of the CPU. This series introduces a mechanism to temporarily unmap pages granting exclusive access to a device. These changes are required to support OpenCL atomic operations in Nouveau to shared virtual memory (SVM) regions allocated with the CL_MEM_SVM_ATOMICS clSVMAlloc flag. A more complete description of the OpenCL SVM feature is available at https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/ OpenCL_API.html#_shared_virtual_memory . Implementation ============== Exclusive device access is implemented by adding a new swap entry type (SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry. The main difference is that on fault the original entry is immediately restored by the fault handler instead of waiting. Restoring the entry triggers calls to MMU notifers which allows a device driver to revoke the atomic access permission from the GPU prior to the CPU finalising the entry. Patches ======= Patches 1 & 2 refactor existing migration and device private entry functions. Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated functionality into separate functions - try_to_migrate_one() and try_to_munlock_one(). Patch 5 renames some existing code but does not introduce functionality. Patch 6 is a small clean-up to swap entry handling in copy_pte_range(). Patch 7 contains the bulk of the implementation for device exclusive memory. Patch 8 contains some additions to the HMM selftests to ensure everything works as expected. Patch 9 is a cleanup for the Nouveau SVM implementation. Patch 10 contains the implementation of atomic access for the Nouveau driver. Testing ======= This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program which checks that GPU atomic accesses to system memory are atomic. Without this series the test fails as there is no way of write-protecting the page mapping which results in the device clobbering CPU writes. For reference the test is available at https://ozlabs.org/~apopple/opencl_svm_atomics/ Further testing has been performed by adding support for testing exclusive access to the hmm-tests kselftests. This patch (of 10): Remove multiple similar inline functions for dealing with different types of special swap entries. Both migration and device private swap entries use the swap offset to store a pfn. Instead of multiple inline functions to obtain a struct page for each swap entry type use a common function pfn_swap_entry_to_page(). Also open-code the various entry_to_pfn() functions as this results is shorter code that is easier to understand. Link: https://lkml.kernel.org/r/20210616105937.23201-1-apopple@nvidia.com Link: https://lkml.kernel.org/r/20210616105937.23201-2-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-07-01 11:06:03 -07:00
Yang Shi	f4c0d8367e	mm: memory: make numa_migrate_prep() non-static The numa_migrate_prep() will be used by huge NUMA fault as well in the following patch, make it non-static. Link: https://lkml.kernel.org/r/20210518200801.7413-3-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-30 20:47:30 -07:00
Yang Shi	5db4f15c4f	mm: memory: add orig_pmd to struct vm_fault Pach series "mm: thp: use generic THP migration for NUMA hinting fault", v3. When the THP NUMA fault support was added THP migration was not supported yet. So the ad hoc THP migration was implemented in NUMA fault handling. Since v4.14 THP migration has been supported so it doesn't make too much sense to still keep another THP migration implementation rather than using the generic migration code. It is definitely a maintenance burden to keep two THP migration implementation for different code paths and it is more error prone. Using the generic THP migration implementation allows us remove the duplicate code and some hacks needed by the old ad hoc implementation. A quick grep shows x86_64, PowerPC (book3s), ARM64 ans S390 support both THP and NUMA balancing. The most of them support THP migration except for S390. Zi Yan tried to add THP migration support for S390 before but it was not accepted due to the design of S390 PMD. For the discussion, please see: https://lkml.org/lkml/2018/4/27/953. Per the discussion with Gerald Schaefer in v1 it is acceptible to skip huge PMD for S390 for now. I saw there were some hacks about gup from git history, but I didn't figure out if they have been removed or not since I just found FOLL_NUMA code in the current gup implementation and they seems useful. Patch #1 ~ #2 are preparation patches. Patch #3 is the real meat. Patch #4 ~ #6 keep consistent counters and behaviors with before. Patch #7 skips change huge PMD to prot_none if thp migration is not supported. Test ---- Did some tests to measure the latency of do_huge_pmd_numa_page. The test VM has 80 vcpus and 64G memory. The test would create 2 processes to consume 128G memory together which would incur memory pressure to cause THP splits. And it also creates 80 processes to hog cpu, and the memory consumer processes are bound to different nodes periodically in order to increase NUMA faults. The below test script is used: echo 3 > /proc/sys/vm/drop_caches # Run stress-ng for 24 hours ./stress-ng/stress-ng --vm 2 --vm-bytes 64G --timeout 24h & PID=$! ./stress-ng/stress-ng --cpu $NR_CPUS --timeout 24h & # Wait for vm stressors forked sleep 5 PID_1=`pgrep -P $PID \| awk 'NR == 1'` PID_2=`pgrep -P $PID \| awk 'NR == 2'` JOB1=`pgrep -P $PID_1` JOB2=`pgrep -P $PID_2` # Bind load jobs to different nodes periodically to force generate # cross node memory access while [ -d "/proc/$PID" ] do taskset -apc 8 $JOB1 taskset -apc 8 $JOB2 sleep 300 taskset -apc 58 $JOB1 taskset -apc 58 $JOB2 sleep 300 done With the above test the histogram of latency of do_huge_pmd_numa_page is as shown below. Since the number of do_huge_pmd_numa_page varies drastically for each run (should be due to scheduler), so I converted the raw number to percentage. patched base @us[stress-ng]: [0] 3.57% 0.16% [1] 55.68% 18.36% [2, 4) 10.46% 40.44% [4, 8) 7.26% 17.82% [8, 16) 21.12% 13.41% [16, 32) 1.06% 4.27% [32, 64) 0.56% 4.07% [64, 128) 0.16% 0.35% [128, 256) < 0.1% < 0.1% [256, 512) < 0.1% < 0.1% [512, 1K) < 0.1% < 0.1% [1K, 2K) < 0.1% < 0.1% [2K, 4K) < 0.1% < 0.1% [4K, 8K) < 0.1% < 0.1% [8K, 16K) < 0.1% < 0.1% [16K, 32K) < 0.1% < 0.1% [32K, 64K) < 0.1% < 0.1% Per the result, patched kernel is even slightly better than the base kernel. I think this is because the lock contention against THP split is less than base kernel due to the refactor. To exclude the affect from THP split, I also did test w/o memory pressure. No obvious regression is spotted. The below is the test result w/o memory pressure. patched base @us[stress-ng]: [0] 7.97% 18.4% [1] 69.63% 58.24% [2, 4) 4.18% 2.63% [4, 8) 0.22% 0.17% [8, 16) 1.03% 0.92% [16, 32) 0.14% < 0.1% [32, 64) < 0.1% < 0.1% [64, 128) < 0.1% < 0.1% [128, 256) < 0.1% < 0.1% [256, 512) 0.45% 1.19% [512, 1K) 15.45% 17.27% [1K, 2K) < 0.1% < 0.1% [2K, 4K) < 0.1% < 0.1% [4K, 8K) < 0.1% < 0.1% [8K, 16K) 0.86% 0.88% [16K, 32K) < 0.1% 0.15% [32K, 64K) < 0.1% < 0.1% [64K, 128K) < 0.1% < 0.1% [128K, 256K) < 0.1% < 0.1% The series also survived a series of tests that exercise NUMA balancing migrations by Mel. This patch (of 7): Add orig_pmd to struct vm_fault so the "orig_pmd" parameter used by huge page fault could be removed, just like its PTE counterpart does. Link: https://lkml.kernel.org/r/20210518200801.7413-1-shy828301@gmail.com Link: https://lkml.kernel.org/r/20210518200801.7413-2-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-30 20:47:30 -07:00
Axel Rasmussen	c949b097ef	userfaultfd/shmem: support minor fault registration for shmem This patch allows shmem-backed VMAs to be registered for minor faults. Minor faults are appropriately relayed to userspace in the fault path, for VMAs with the relevant flag. This commit doesn't hook up the UFFDIO_CONTINUE ioctl for shmem-backed minor faults, though, so userspace doesn't yet have a way to resolve such faults. Because of this, we also don't yet advertise this as a supported feature. That will be done in a separate commit when the feature is fully implemented. Link: https://lkml.kernel.org/r/20210503180737.2487560-4-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Brian Geffon <bgeffon@google.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Upton <oupton@google.com> Cc: Shaohua Li <shli@fb.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Wang Qing <wangqing@vivo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-30 20:47:27 -07:00
Peter Xu	8f34f1eac3	mm/userfaultfd: fix uffd-wp special cases for fork() We tried to do something similar in `b569a17607` ("userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork") previously, but it's not doing it all right.. A few fixes around the code path: 1. We were referencing VM_UFFD_WP vm_flags on the _old_ vma rather than the new vma. That's overlooked in `b569a17607`, so it won't work as expected. Thanks to the recent rework on fork code (`7a4830c380`), we can easily get the new vma now, so switch the checks to that. 2. Dropping the uffd-wp bit in copy_huge_pmd() could be wrong if the huge pmd is a migration huge pmd. When it happens, instead of using pmd_uffd_wp(), we should use pmd_swp_uffd_wp(). The fix is simply to handle them separately. 3. Forget to carry over uffd-wp bit for a write migration huge pmd entry. This also happens in copy_huge_pmd(), where we converted a write huge migration entry into a read one. 4. In copy_nonpresent_pte(), drop uffd-wp if necessary for swap ptes. 5. In copy_present_page() when COW is enforced when fork(), we also need to pass over the uffd-wp bit if VM_UFFD_WP is armed on the new vma, and when the pte to be copied has uffd-wp bit set. Remove the comment in copy_present_pte() about this. It won't help a huge lot to only comment there, but comment everywhere would be an overkill. Let's assume the commit messages would help. [peterx@redhat.com: fix a few thp pmd missing uffd-wp bit] Link: https://lkml.kernel.org/r/20210428225030.9708-4-peterx@redhat.com Link: https://lkml.kernel.org/r/20210428225030.9708-3-peterx@redhat.com Fixes: `b569a17607` ("userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork") Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joe Perches <joe@perches.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Upton <oupton@google.com> Cc: Shaohua Li <shli@fb.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Wang Qing <wangqing@vivo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-30 20:47:27 -07:00
Mike Rapoport	a9ee6cf5c6	mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA After removal of DISCINTIGMEM the NEED_MULTIPLE_NODES and NUMA configuration options are equivalent. Drop CONFIG_NEED_MULTIPLE_NODES and use CONFIG_NUMA instead. Done with $ sed -i 's/CONFIG_NEED_MULTIPLE_NODES/CONFIG_NUMA/' \ $(git grep -wl CONFIG_NEED_MULTIPLE_NODES) $ sed -i 's/NEED_MULTIPLE_NODES/NUMA/' \ $(git grep -wl NEED_MULTIPLE_NODES) with manual tweaks afterwards. [rppt@linux.ibm.com: fix arm boot crash] Link: https://lkml.kernel.org/r/YMj9vHhHOiCVN4BF@linux.ibm.com Link: https://lkml.kernel.org/r/20210608091316.3622-9-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matt Turner <mattst88@gmail.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Vineet Gupta <vgupta@synopsys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-29 10:53:55 -07:00
Liam Howlett	3e418f9888	mm/memory.c: use vma_lookup() in __access_remote_vm() Use vma_lookup() to find the VMA at a specific address. As vma_lookup() will return NULL if the address is not within any VMA, the start address no longer needs to be validated. Link: https://lkml.kernel.org/r/20210521174745.2219620-22-Liam.Howlett@Oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-29 10:53:52 -07:00
Liu Xiang	2797e79f1a	mm/memory.c: fix comment of finish_mkwrite_fault() Fix the return value in comment of finish_mkwrite_fault(). Link: https://lkml.kernel.org/r/20210513093931.15234-1-liu.xiang@zlingsmart.com Signed-off-by: Liu Xiang <liu.xiang@zlingsmart.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-29 10:53:51 -07:00
Huang Ying	f4c4a3f484	mm: free idle swap cache page after COW With commit `09854ba94c` ("mm: do_wp_page() simplification"), after COW, the idle swap cache page (neither the page nor the corresponding swap entry is mapped by any process) will be left in the LRU list, even if it's in the active list or the head of the inactive list. So, the page reclaimer may take quite some overhead to reclaim these actually unused pages. To help the page reclaiming, in this patch, after COW, the idle swap cache page will be tried to be freed. To avoid to introduce much overhead to the hot COW code path, a) there's almost zero overhead for non-swap case via checking PageSwapCache() firstly. b) the page lock is acquired via trylock only. To test the patch, we used pmbench memory accessing benchmark with working-set larger than available memory on a 2-socket Intel server with a NVMe SSD as swap device. Test results shows that the pmbench score increases up to 23.8% with the decreased size of swap cache and swapin throughput. Link: https://lkml.kernel.org/r/20210601053143.1380078-1-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> [use free_swap_cache()] Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@surriel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-29 10:53:49 -07:00
Miaohe Lin	2799e77529	swap: fix do_swap_page() race with swapoff When I was investigating the swap code, I found the below possible race window: CPU 1 CPU 2 ----- ----- do_swap_page if (data_race(si->flags & SWP_SYNCHRONOUS_IO) swap_readpage if (data_race(sis->flags & SWP_FS_OPS)) { swapoff .. p->swap_file = NULL; .. struct file swap_file = sis->swap_file; struct address_space mapping = swap_file->f_mapping;[oops!] Note that for the pages that are swapped in through swap cache, this isn't an issue. Because the page is locked, and the swap entry will be marked with SWAP_HAS_CACHE, so swapoff() can not proceed until the page has been unlocked. Fix this race by using get/put_swap_device() to guard against concurrent swapoff. Link: https://lkml.kernel.org/r/20210426123316.806267-3-linmiaohe@huawei.com Fixes: `0bcac06f27` ("mm,swap: skip swapcache for swapin of synchronous device") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Alex Shi <alexs@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-29 10:53:49 -07:00
Greg Kroah-Hartman	377f65b631	Merge tag 'v5.13-rc7' into android-mainline Linux 5.13-rc7 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I9d43072dc419205c72040a957af73400408a2063	2021-06-21 15:14:49 +02:00
Hugh Dickins	22061a1ffa	mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page() There is a race between THP unmapping and truncation, when truncate sees pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared it, but before its page_remove_rmap() gets to decrement compound_mapcount: generating false "BUG: Bad page cache" reports that the page is still mapped when deleted. This commit fixes that, but not in the way I hoped. The first attempt used try_to_unmap(page, TTU_SYNC\|TTU_IGNORE_MLOCK) instead of unmap_mapping_range() in truncate_cleanup_page(): it has often been an annoyance that we usually call unmap_mapping_range() with no pages locked, but there apply it to a single locked page. try_to_unmap() looks more suitable for a single locked page. However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page): it is used to insert THP migration entries, but not used to unmap THPs. Copy zap_huge_pmd() and add THP handling now? Perhaps, but their TLB needs are different, I'm too ignorant of the DAX cases, and couldn't decide how far to go for anon+swap. Set that aside. The second attempt took a different tack: make no change in truncate.c, but modify zap_huge_pmd() to insert an invalidated huge pmd instead of clearing it initially, then pmd_clear() between page_remove_rmap() and unlocking at the end. Nice. But powerpc blows that approach out of the water, with its serialize_against_pte_lookup(), and interesting pgtable usage. It would need serious help to get working on powerpc (with a minor optimization issue on s390 too). Set that aside. Just add an "if (page_mapped(page)) synchronize_rcu();" or other such delay, after unmapping in truncate_cleanup_page()? Perhaps, but though that's likely to reduce or eliminate the number of incidents, it would give less assurance of whether we had identified the problem correctly. This successful iteration introduces "unmap_mapping_page(page)" instead of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route, with an addition to details. Then zap_pmd_range() watches for this case, and does spin_unlock(pmd_lock) if so - just like page_vma_mapped_walk() now does in the PVMW_SYNC case. Not pretty, but safe. Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to assert its interface; but currently that's only used to make sure that page->mapping is stable, and zap_pmd_range() doesn't care if the page is locked or not. Along these lines, in invalidate_inode_pages2_range() move the initial unmap_mapping_range() out from under page lock, before then calling unmap_mapping_page() under page lock if still mapped. Link: https://lkml.kernel.org/r/a2a4a148-cdd8-942c-4ef8-51b77f643dbe@google.com Fixes: `fc127da085` ("truncate: handle file thp") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jan Kara <jack@suse.cz> Cc: Jue Wang <juew@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-16 09:24:42 -07:00
Greg Kroah-Hartman	5f0a72b046	Merge 5.13-rc5 into android-mainline Linux 5.13-rc5 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I49ad8b28f9f48adbddcce610f659f698aca396df	2021-06-08 09:11:02 +02:00
Thomas Bogendoerfer	50c25ee97c	Revert "MIPS: make userspace mapping young by default" This reverts commit `f685a533a7`. The MIPS cache flush logic needs to know whether the mapping was already established to decide how to flush caches. This is done by checking the valid bit in the PTE. The commit above breaks this logic by setting the valid in the PTE in new mappings, which causes kernel crashes. Link: https://lkml.kernel.org/r/20210526094335.92948-1-tsbogend@alpha.franken.de Fixes: `f685a533a7` ("MIPS: make userspace mapping young by default") Reported-by: Zhou Yanjie <zhouyanjie@wanyeetech.com> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Huang Pei <huangpei@loongson.cn> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-06-05 08:58:11 -07:00
Lee Jones	85adc860fd	Merge `6efb943b86` Linux 5.13-rc1 into android-mainline One giant leap, all the way up to 5.13-rc1 Also take the opportunity to re-align (a.k.a. fix a couple of previous merge conflict fix-up issues) which occurred during this merge-window. Fixes: `4797acfb9c` ("Merge `16b3d0cf5b` Merge tag 'sched-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into android-mainline") Fixes: `92f282f338` ("Merge `8ca5297e7e` Merge tag 'kconfig-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild into android-mainline") Signed-off-by: Lee Jones <lee.jones@linaro.org> Change-Id: Ie9389f595776e8f66bba6eaf0fa7a3587c6a5749	2021-05-15 09:09:01 +01:00
Lee Jones	163e8b4fec	Merge `d652502ef4` Merge tag 'ovl-update-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs into android-mainline A tiny step en route to v5.13-rc1 Signed-off-by: Lee Jones <lee.jones@linaro.org> Change-Id: I049e80976042ebffc90bb080f09da0afcfd48d77	2021-05-12 15:54:32 +01:00
Ingo Molnar	f0953a1bba	mm: fix typos in comments Fix ~94 single-word typos in locking code comments, plus a few very obvious grammar mistakes. Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Cc: Bhaskar Chowdhury <unixbhaskar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-05-07 00:26:35 -07:00
Yafang Shao	3d1c7fd97e	delayacct: clear right task's flag after blkio completes When I was implementing a latency analyzer tool by using task->delays and other things, I found an issue in delayacct. The issue is it should clear the target's flag instead of current's in delayacct_blkio_end(). When I git blame delayacct, I found there're some similar issues we have fixed in delayacct_blkio_end(). - Commit `c96f5471ce` ("delayacct: Account blkio completion on the correct task") fixed the issue that it should account blkio completion on the target task instead of current. - Commit `b512719f77` ("delayacct: fix crash in delayacct_blkio_end() after delayacct init failure") fixed the issue that it should check target task's delays instead of current task'. It seems that delayacct_blkio_{begin, end} are error prone. So I introduce a new paratmeter - the target task 'p' - to these helpers. After that change, the callsite will specifilly set the right task, which should make it less error prone. Link: https://lkml.kernel.org/r/20210414083720.24083-1-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Josh Snyder <joshs@netflix.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-05-07 00:26:32 -07:00
Nicholas Piggin	0c95cba492	mm: apply_to_pte_range warn and fail if a large pte is encountered apply_to_pte_range might mistake a large pte for bad, or treat it as a page table, resulting in a crash or corruption. Add a test to warn and return error if large entries are found. Link: https://lkml.kernel.org/r/20210317062402.533919-4-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ding Tianhong <dingtianhong@huawei.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-04-30 11:20:39 -07:00

1 2 3 4 5 ...

1062 Commits