kernel_arpi

Author	SHA1	Message	Date
Charan Teja Reddy	d831f07038	ANDROID: vmscan: Support multiple kswapd threads per node Page replacement is handled in the Linux Kernel in one of two ways: 1) Asynchronously via kswapd 2) Synchronously, via direct reclaim At page allocation time the allocating task is immediately given a page from the zone free list allowing it to go right back to work doing whatever it was doing; Probably directly or indirectly executing business logic. Just prior to satisfying the allocation, free pages is checked to see if it has reached the zone low watermark and if so, kswapd is awakened. Kswapd will start scanning pages looking for inactive pages to evict to make room for new page allocations. The work of kswapd allows tasks to continue allocating memory from their respective zone free list without incurring any delay. When the demand for free pages exceeds the rate that kswapd tasks can supply them, page allocation works differently. Once the allocating task finds that the number of free pages is at or below the zone min watermark, the task will no longer pull pages from the free list. Instead, the task will run the same CPU-bound routines as kswapd to satisfy its own allocation by scanning and evicting pages. This is called a direct reclaim. The time spent performing a direct reclaim can be substantial, often taking tens to hundreds of milliseconds for small order0 allocations to half a second or more for order9 huge-page allocations. In fact, kswapd is not actually required on a linux system. It exists for the sole purpose of optimizing performance by preventing direct reclaims. When memory shortfall is sufficient to trigger direct reclaims, they can occur in any task that is running on the system. A single aggressive memory allocating task can set the stage for collateral damage to occur in small tasks that rarely allocate additional memory. Consider the impact of injecting an additional 100ms of latency when nscd allocates memory to facilitate caching of a DNS query. The presence of direct reclaims 10 years ago was a fairly reliable indicator that too much was being asked of a Linux system. Kswapd was likely wasting time scanning pages that were ineligible for eviction. Adding RAM or reducing the working set size would usually make the problem go away. Since then hardware has evolved to bring a new struggle for kswapd. Storage speeds have increased by orders of magnitude while CPU clock speeds stayed the same or even slowed down in exchange for more cores per package. This presents a throughput problem for a single threaded kswapd that will get worse with each generation of new hardware. Test Details NOTE: The tests below were run with shadow entries disabled. See the associated patch and cover letter for details The tests below were designed with the assumption that a kswapd bottleneck is best demonstrated using filesystem reads. This way, the inactive list will be full of clean pages, simplifying the analysis and allowing kswapd to achieve the highest possible steal rate. Maximum steal rates for kswapd are likely to be the same or lower for any other mix of page types on the system. Tests were run on a 2U Oracle X7-2L with 52 Intel Xeon Skylake 2GHz cores, 756GB of RAM and 8 x 3.6 TB NVMe Solid State Disk drives. Each drive has an XFS file system mounted separately as /d0 through /d7. SSD drives require multiple concurrent streams to show their potential, so I created eleven 250GB zero-filled files on each drive so that I could test with parallel reads. The test script runs in multiple stages. At each stage, the number of dd tasks run concurrently is increased by 2. I did not include all of the test output for brevity. During each stage dd tasks are launched to read from each drive in a round robin fashion until the specified number of tasks for the stage has been reached. Then iostat, vmstat and top are started in the background with 10 second intervals. After five minutes, all of the dd tasks are killed and the iostat, vmstat and top output is parsed in order to report the following: CPU consumption - sy - aggregate kernel mode CPU consumption from vmstat output. The value doesn't tend to fluctuate much so I just grab the highest value. Each sample is averaged over 10 seconds - dd_cpu - for all of the dd tasks averaged across the top samples since there is a lot of variation. Throughput - in Kbytes - Command is iostat -x -d 10 -g total This first test performs reads using O_DIRECT in order to show the maximum throughput that can be obtained using these drives. It also demonstrates how rapidly throughput scales as the number of dd tasks are increased. The dd command for this test looks like this: Command Used: dd iflag=direct if=/d${i}/$n of=/dev/null bs=4M Test #1: Direct IO dd sy dd_cpu throughput 6 0 2.33 14726026.40 10 1 2.95 19954974.80 16 1 2.63 24419689.30 22 1 2.63 25430303.20 28 1 2.91 26026513.20 34 1 2.53 26178618.00 40 1 2.18 26239229.20 46 1 1.91 26250550.40 52 1 1.69 26251845.60 58 1 1.54 26253205.60 64 1 1.43 26253780.80 70 1 1.31 26254154.80 76 1 1.21 26253660.80 82 1 1.12 26254214.80 88 1 1.07 26253770.00 90 1 1.04 26252406.40 Throughput was close to peak with only 22 dd tasks. Very little system CPU was consumed as expected as the drives DMA directly into the user address space when using direct IO. In this next test, the iflag=direct option is removed and we only run the test until the pgscan_kswapd from /proc/vmstat starts to increment. At that point metrics are parsed and reported and the pagecache contents are dropped prior to the next test. Lather, rinse, repeat. Test #2: standard file system IO, no page replacement dd sy dd_cpu throughput 6 2 28.78 5134316.40 10 3 31.40 8051218.40 16 5 34.73 11438106.80 22 7 33.65 14140596.40 28 8 31.24 16393455.20 34 10 29.88 18219463.60 40 11 28.33 19644159.60 46 11 25.05 20802497.60 52 13 26.92 22092370.00 58 13 23.29 22884881.20 64 14 23.12 23452248.80 70 15 22.40 23916468.00 76 16 22.06 24328737.20 82 17 20.97 24718693.20 88 16 18.57 25149404.40 90 16 18.31 25245565.60 Each read has to pause after the buffer in kernel space is populated while those pages are added to the pagecache and copied into the user address space. For this reason, more parallel streams are required to achieve peak throughput. The copy operation consumes substantially more CPU than direct IO as expected. The next test measures throughput after kswapd starts running. This is the same test only we wait for kswapd to wake up before we start collecting metrics. The script actually keeps track of a few things that were not mentioned earlier. It tracks direct reclaims and page scans by watching the metrics in /proc/vmstat. CPU consumption for kswapd is tracked the same way it is tracked for dd. Since the test is 100% reads, you can assume that the page steal rate for kswapd and direct reclaims is almost identical to the scan rate. Test #3: 1 kswapd thread per node dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct 10 4 26.07 28.56 27.03 7355924.40 0 459316976 0 16 7 34.94 69.33 69.66 10867895.20 0 872661643 0 22 10 36.03 93.99 99.33 13130613.60 489 1037654473 11268334 28 10 30.34 95.90 98.60 14601509.60 671 1182591373 15429142 34 14 34.77 97.50 99.23 16468012.00 10850 1069005644 249839515 40 17 36.32 91.49 97.11 17335987.60 18903 975417728 434467710 46 19 38.40 90.54 91.61 17705394.40 25369 855737040 582427973 52 22 40.88 83.97 83.70 17607680.40 31250 709532935 724282458 58 25 40.89 82.19 80.14 17976905.60 35060 657796473 804117540 64 28 41.77 73.49 75.20 18001910.00 39073 561813658 895289337 70 33 45.51 63.78 64.39 17061897.20 44523 379465571 1020726436 76 36 46.95 57.96 60.32 16964459.60 47717 291299464 1093172384 82 39 47.16 55.43 56.16 16949956.00 49479 247071062 1134163008 88 42 47.41 53.75 47.62 16930911.20 51521 195449924 1180442208 90 43 47.18 51.40 50.59 16864428.00 51618 190758156 1183203901 In the previous test where kswapd was not involved, the system-wide kernel mode CPU consumption with 90 dd tasks was 16%. In this test CPU consumption with 90 tasks is at 43%. With 52 cores, and two kswapd tasks (one per NUMA node), kswapd can only be responsible for a little over 4% of the increase. The rest is likely caused by 51,618 direct reclaims that scanned 1.2 billion pages over the five minute time period of the test. Same test, more kswapd tasks: Test #4: 4 kswapd threads per node dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct 10 5 27.09 16.65 14.17 7842605.60 0 459105291 0 16 10 37.12 26.02 24.85 11352920.40 15 920527796 358515 22 11 36.94 37.13 35.82 13771869.60 0 1132169011 0 28 13 35.23 48.43 46.86 16089746.00 0 1312902070 0 34 15 33.37 53.02 55.69 18314856.40 0 1476169080 0 40 19 35.90 69.60 64.41 19836126.80 0 1629999149 0 46 22 36.82 88.55 57.20 20740216.40 0 1708478106 0 52 24 34.38 93.76 68.34 21758352.00 0 1794055559 0 58 24 30.51 79.20 82.33 22735594.00 0 1872794397 0 64 26 30.21 97.12 76.73 23302203.60 176 1916593721 4206821 70 33 32.92 92.91 92.87 23776588.00 3575 1817685086 85574159 76 37 31.62 91.20 89.83 24308196.80 4752 1812262569 113981763 82 29 25.53 93.23 92.33 24802791.20 306 2032093122 7350704 88 43 37.12 76.18 77.01 25145694.40 20310 1253204719 487048202 90 42 38.56 73.90 74.57 22516787.60 22774 1193637495 545463615 By increasing the number of kswapd threads, throughput increased by ~50% while kernel mode CPU utilization decreased or stayed the same, likely due to a decrease in the number of parallel tasks at any given time doing page replacement. Signed-off-by: Buddy Lumpkin <buddy.lumpkin@oracle.com> Bug: 201263306 Link: https://lore.kernel.org/lkml/1522661062-39745-1-git-send-email-buddy.lumpkin@oracle.com [charante@codeaurora.org]: Changes made to select number of kswapds through uapi Change-Id: I8425cab7f40cbeaf65af0ea118c1a9ac7da0930e Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> [quic_vjitta@quicinc.com]: Changes made to move multiple kswapd threads logic to vendor hooks Signed-off-by: Vijayanand Jitta <quic_vjitta@quicinc.com> (cherry picked from commit `0d61a651e4`)	2022-04-06 08:31:35 -07:00
Shiraz Hashim	d9f210a14d	ANDROID: kunit: Provision kunit as a vendor module 'kunit_test' member variable in task_struct is defined under CONFIG_KUNIT. Besides there are supportive functions in slub and kasan which gets conditionally compiled out. Allow kunit to be build as vendor module by removing compile time dependencies. Bug: 215096354 Change-Id: If57b1df6e479aa0388aabc53af5ae10e20a844b2 Signed-off-by: Shiraz Hashim <quic_shashim@quicinc.com>	2022-04-05 16:52:37 +00:00
Vijayanand Jitta	12972dd7bf	ANDROID: mm: Export kswapd function To support multiple kswap threads vendor modules need access to kswapd function. So, export it. Bug: 201263306 Change-Id: I442612710835f39836a295e9d1936f86826ab960 Signed-off-by: Vijayanand Jitta <quic_vjitta@quicinc.com>	2022-03-29 19:03:46 +00:00
Suren Baghdasaryan	e9eea2a170	UPSTREAM: mm: fix use-after-free when anon vma name is used after vma is freed When adjacent vmas are being merged it can result in the vma that was originally passed to madvise_update_vma being destroyed. In the current implementation, the name parameter passed to madvise_update_vma points directly to vma->anon_name and it is used after the call to vma_merge. In the cases when vma_merge merges the original vma and destroys it, this might result in UAF. For that the original vma would have to hold the anon_vma_name with the last reference. The following vma would need to contain a different anon_vma_name object with the same string. Such scenario is shown below: madvise_vma_behavior(vma) madvise_update_vma(vma, ..., anon_name == vma->anon_name) vma_merge(vma) __vma_adjust(vma) <-- merges vma with adjacent one vm_area_free(vma) <-- frees the original vma replace_vma_anon_name(anon_name) <-- UAF of vma->anon_name Fix this by raising the name refcount and stabilizing it. Link: https://lkml.kernel.org/r/20220224231834.1481408-3-surenb@google.com Link: https://lkml.kernel.org/r/20220223153613.835563-3-surenb@google.com Fixes: 9a10064f5625 ("mm: add a field to store names for private anonymous memory") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reported-by: syzbot+aa7b3d4b35f9dc46a366@syzkaller.appspotmail.com Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexey Gladkov <legion@kernel.org> Cc: Chris Hyser <chris.hyser@oracle.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Colin Cross <ccross@google.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kees Cook <keescook@chromium.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Sasha Levin <sashal@kernel.org> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xiaofeng Cao <caoxiaofeng@yulong.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 942341dcc5748d9c1fc7009a359fc1916bfe0ef0) Bug: 218352794 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I07e3cbff2eaa69a0d56281537510f7a42feaaf09	2022-03-24 18:44:46 -07:00
Suren Baghdasaryan	6e2654ba49	UPSTREAM: mm: prevent vm_area_struct::anon_name refcount saturation A deep process chain with many vmas could grow really high. With default sysctl_max_map_count (64k) and default pid_max (32k) the max number of vmas in the system is 2147450880 and the refcounter has headroom of 1073774592 before it reaches REFCOUNT_SATURATED (3221225472). Therefore it's unlikely that an anonymous name refcounter will overflow with these defaults. Currently the max for pid_max is PID_MAX_LIMIT (4194304) and for sysctl_max_map_count it's INT_MAX (2147483647). In this configuration anon_vma_name refcount overflow becomes theoretically possible (that still require heavy sharing of that anon_vma_name between processes). kref refcounting interface used in anon_vma_name structure will detect a counter overflow when it reaches REFCOUNT_SATURATED value but will only generate a warning and freeze the ref counter. This would lead to the refcounted object never being freed. A determined attacker could leak memory like that but it would be rather expensive and inefficient way to do so. To ensure anon_vma_name refcount does not overflow, stop anon_vma_name sharing when the refcount reaches REFCOUNT_MAX (2147483647), which still leaves INT_MAX/2 (1073741823) values before the counter reaches REFCOUNT_SATURATED. This should provide enough headroom for raising the refcounts temporarily. Link: https://lkml.kernel.org/r/20220223153613.835563-2-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexey Gladkov <legion@kernel.org> Cc: Chris Hyser <chris.hyser@oracle.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Colin Cross <ccross@google.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kees Cook <keescook@chromium.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Collingbourne <pcc@google.com> Cc: Sasha Levin <sashal@kernel.org> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xiaofeng Cao <caoxiaofeng@yulong.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 96403e11283def1d1c465c8279514c9a504d8630) Bug: 218352794 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Ieaab58f6300d9aff3139eed1c1d3417237d81955	2022-03-24 18:44:43 -07:00
Suren Baghdasaryan	0fd37220d8	UPSTREAM: mm: refactor vm_area_struct::anon_vma_name usage code Avoid mixing strings and their anon_vma_name referenced pointers by using struct anon_vma_name whenever possible. This simplifies the code and allows easier sharing of anon_vma_name structures when they represent the same name. [surenb@google.com: fix comment] Link: https://lkml.kernel.org/r/20220223153613.835563-1-surenb@google.com Link: https://lkml.kernel.org/r/20220224231834.1481408-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Colin Cross <ccross@google.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Alexey Gladkov <legion@kernel.org> Cc: Sasha Levin <sashal@kernel.org> Cc: Chris Hyser <chris.hyser@oracle.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Peter Collingbourne <pcc@google.com> Cc: Xiaofeng Cao <caoxiaofeng@yulong.com> Cc: David Hildenbrand <david@redhat.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 5c26f6ac9416b63d093e29c30e79b3297e425472) Bug: 218352794 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I4a6b5602ce7151d1a4b88fac489f86d68089bd4d	2022-03-24 18:44:39 -07:00
Suren Baghdasaryan	7ff2a03673	Revert "FROMGIT: mm: fix use-after-free when anon vma name is used after vma is freed" This reverts commit `91b7584411`. Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Ic9b7399bc4649e68cad9ac7017225262a2dea579	2022-03-24 18:44:32 -07:00
Suren Baghdasaryan	928b638950	ANDROID: mm: Fix implicit declaration of function 'isolate_lru_page' When compiled with CONFIG_SHMEM=n, shmem.c does not include internal.h and isolate_lru_page function declaration can't be found. Fix this by making isolate_lru_page usage conditional upon CONFIG_SHMEM inside reclaim_shmem_address_space. Fixes: `96f80f6284` ("ANDROID: mm: add reclaim_shmem_address_space() for faster reclaims") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Ia46a57681d26ac103e84ef7caa61a22dbd45cf04	2022-03-25 00:15:52 +00:00
Greg Kroah-Hartman	61abfd4773	Merge 5.15.31 into android13-5.15 Changes in 5.15.31 crypto: qcom-rng - ensure buffer for generate is completely filled ocfs2: fix crash when initialize filecheck kobj fails mm: swap: get rid of livelock in swapin readahead block: release rq qos structures for queue without disk drm/mgag200: Fix PLL setup for g200wb and g200ew efi: fix return value of __setup handlers alx: acquire mutex for alx_reinit in alx_change_mtu vsock: each transport cycles only on its own sockets esp6: fix check on ipv6_skip_exthdr's return value net: phy: marvell: Fix invalid comparison in the resume and suspend functions net/packet: fix slab-out-of-bounds access in packet_recvmsg() atm: eni: Add check for dma_map_single iavf: Fix double free in iavf_reset_task hv_netvsc: Add check for kvmalloc_array drm/imx: parallel-display: Remove bus flags check in imx_pd_bridge_atomic_check() drm/panel: simple: Fix Innolux G070Y2-L01 BPP settings net: handle ARPHRD_PIMREG in dev_is_mac_header_xmit() drm: Don't make DRM_PANEL_BRIDGE dependent on DRM_KMS_HELPERS net: dsa: Add missing of_node_put() in dsa_port_parse_of net: phy: mscc: Add MODULE_FIRMWARE macros bnx2x: fix built-in kernel driver load failure net: bcmgenet: skip invalid partial checksums net: mscc: ocelot: fix backwards compatibility with single-chain tc-flower offload iavf: Fix hang during reboot/shutdown arm64: fix clang warning about TRAMP_VALIAS usb: gadget: rndis: prevent integer overflow in rndis_set_response() usb: gadget: Fix use-after-free bug by not setting udc->dev.driver usb: usbtmc: Fix bug in pipe direction for control transfers scsi: mpt3sas: Page fault in reply q processing Input: aiptek - properly check endpoint type perf symbols: Fix symbol size calculation condition btrfs: skip reserved bytes warning on unmount after log cleanup failure Linux 5.15.31 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Iea69c3aeae614eb6b871993dc29fc1010c064f24	2022-03-24 13:16:37 +01:00
Charan Teja Reddy	96f80f6284	ANDROID: mm: add reclaim_shmem_address_space() for faster reclaims Add the functionality that allow users of shmem to reclaim its pages without going through the kswapd/direct reclaim path. An example usecase is: Say that device allocates a larger amount of shmem pages and shares it with hardware. To faster reclaims such pages, drivers can register the shrinkers and call reclaim_shmem_address_space(). The implementation of this function is mostly borrowed from reclaim_address_space() implemented for per process reclaim[1]. [1] https://lore.kernel.org/patchwork/cover/378056/ Bug: 201263305 Change-Id: I03d2c3b9610612af977f89ddeabb63b8e9e50918 Signed-off-by: Charan Teja Reddy <quic_charante@quicinc.com>	2022-03-23 11:32:21 -07:00
Vijayanand Jitta	385b0dd1f9	ANDROID: mm: Fix page table lookup in speculative fault path In speculative fault path, while doing page table lookup, offset is obtained at each level and value at that offset is read and checks are perfomed on it, later to get next level offset we read from previous level offset again. A concurrent page table reclaimation operation could result in change in value at this offset, and we go ahead and access it, this would result in reading an invalid entry. Fix this by reading from previous level offset again and comparing before performing next level access. Bug: 221005439 Change-Id: I66b3d24ae79c7ee5ccce4ba7a94f028f4cf3fda0 Signed-off-by: Vijayanand Jitta <quic_vjitta@quicinc.com>	2022-03-23 11:32:20 -07:00
Michel Lespinasse	7d6787088d	BACKPORT: FROMLIST: mm: enable speculative fault handling for supported file types. Introduce vma_can_speculate(), which allows speculative handling for VMAs mapping supported file types. From do_handle_mm_fault(), speculative handling will follow through __handle_mm_fault(), handle_pte_fault() and do_fault(). At this point, we expect speculative faults to continue through one of: - do_read_fault(), fully implemented; - do_cow_fault(), which might abort if missing anon vmas, - do_shared_fault(), not implemented yet (would require ->page_mkwrite() changes). vma_can_speculate() provides an early abort for the do_shared_fault() case, limiting the time spent on trying that unimplemented case. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20210407014502.24091-31-michel@lespinasse.org/ Conflicts: include/linux/vm_event_item.h mm/vmstat.c 1. SPF_ATTEMPT_FILE is taken from https://lore.kernel.org/all/20210407014502.24091-36-michel@lespinasse.org/ since the patch posted upstream at the time had a different structure with stats for anonymouse and file-backed pagefaults introduced in a separate patch. Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I3a28af63b41b649f02f8b73d53f6494ad114ee5a	2022-03-23 11:32:19 -07:00
Michel Lespinasse	a2138fee6c	FROMLIST: fs: list file types that support speculative faults. Add a speculative field to the vm_operations_struct, which indicates if the associated file type supports speculative faults. Initially this is set for files that implement fault() with filemap_fault(). Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20210407014502.24091-30-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Ic92efdf13283c45e7da7bf703f4f85f8b392ba69	2022-03-23 11:32:19 -07:00
Michel Lespinasse	4979ff3738	FROMLIST: mm: implement speculative handling in filemap_map_pages() In the speculative case, we know the page table already exists, and it must be locked with pte_map_lock(). In the case where no page is found for the given address, return VM_FAULT_RETRY which will abort the fault before we get into the vm_ops->fault() callback. This is fine because if filemap_map_pages does not find the page in page cache, vm_ops->fault() will not either. Initialize addr and last_pgoff to correspond to the pte at the original fault address (which was mapped with pte_map_lock()), rather than the pte at start_pgoff. The choice of initial values doesn't matter as they will all be adjusted together before use, so they just need to be consistent with each other, and using the original fault address and pte allows us to reuse pte_map_lock() without any changes to it. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20210407014502.24091-29-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I0acf4f9626ec0126cdc9a95a7ff1cd735c1af2ca	2022-03-23 11:32:19 -07:00
Michel Lespinasse	7045d2d838	FROMLIST: mm: implement speculative handling in do_fault_around() Call the vm_ops->map_pages method within an rcu read locked section. In the speculative case, verify the mmap sequence lock at the start of the section. A match guarantees that the original vma is still valid at that time, and that the associated vma->vm_file stays valid while the vm_ops->map_pages() method is running. Do not test vmf->pmd in the speculative case - we only speculate when a page table already exists, and and this saves us from having to handle synchronization around the vmf->pmd read. Change xfs_filemap_map_pages() account for the fact that it can not block anymore, as it is now running within an rcu read lock. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20210407014502.24091-28-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Id771c1e6fa9b883595a48d4df63f448a05916eda	2022-03-23 11:32:19 -07:00
Michel Lespinasse	6877640598	BACKPORT: FROMLIST: mm: implement speculative fault handling in finish_fault() In the speculative case, we want to avoid direct pmd checks (which would require some extra synchronization to be safe), and rely on pte_map_lock which will both lock the page table and verify that the pmd has not changed from its initial value. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20210407014502.24091-27-michel@lespinasse.org/ Conflicts: mm/memory.c 1. Merge conflict due to new vmf->prealloc_pte usage in finish_fault. Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: If6046592083eaf12caf5c51c3fbb287a4dfa1ace	2022-03-23 11:32:18 -07:00
Michel Lespinasse	cd333a037c	BACKPORT: FROMLIST: mm: implement speculative handling in filemap_fault() Extend filemap_fault() to handle speculative faults. In the speculative case, we will only be fishing existing pages out of the page cache. The logic we use mirrors what is done in the non-speculative case, assuming that pages are found in the page cache, are up to date and not already locked, and that readahead is not necessary at this time. In all other cases, the fault is aborted to be handled non-speculatively. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20210407014502.24091-26-michel@lespinasse.org/ Conflicts: mm/filemap.c 1. Added back file_ra_state variable used by SPF path. Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I82eba7fcfc81876245c2e65bc5ae3d33ddfcc368	2022-03-23 11:32:18 -07:00
Michel Lespinasse	b12e52ca98	FROMLIST: mm: implement speculative handling in __do_fault() In the speculative case, call the vm_ops->fault() method from within an rcu read locked section, and verify the mmap sequence lock at the start of the section. A match guarantees that the original vma is still valid at that time, and that the associated vma->vm_file stays valid while the vm_ops->fault() method is running. Note that this implies that speculative faults can not sleep within the vm_ops->fault method. We will only attempt to fetch existing pages from the page cache during speculative faults; any miss (or prefetch) will be handled by falling back to non-speculative fault handling. The speculative handling case also does not preallocate page tables, as it is always called with a pre-existing page table. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20210407014502.24091-25-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I995ba94d8e96014ef83ac93fe5a4669afcde34b9	2022-03-23 11:32:18 -07:00
Michel Lespinasse	48e35d053f	FROMLIST: mm: rcu safe vma->vm_file freeing Defer freeing of vma->vm_file when freeing vmas. This is to allow speculative page faults in the mapped file case. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20210407014502.24091-24-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Ic766bc2086db82eae9f3aadf0f23dd743be1c464	2022-03-23 11:32:18 -07:00
Michel Lespinasse	9b92402808	FROMLIST: mm: anon spf statistics Add a new CONFIG_SPECULATIVE_PAGE_FAULT_STATS config option, and dump extra statistics about executed spf cases and abort reasons when the option is set. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-32-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Ia53cd88e4a7140aeb26bf8f3869e1fc5270012da	2022-03-23 11:32:17 -07:00
Michel Lespinasse	aa9ae5c915	FROMLIST: mm: implement and enable speculative fault handling in handle_pte_fault() In handle_pte_fault(), allow speculative execution to proceed. Use pte_spinlock() to validate the mmap sequence count when locking the page table. If speculative execution proceeds through do_wp_page(), ensure that we end up in the wp_page_reuse() or wp_page_copy() paths, rather than wp_pfn_shared() or wp_page_shared() (both unreachable as we only handle anon vmas so far) or handle_userfault() (needs an explicit abort to handle non-speculatively). Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-28-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Ia45d095ec7b8e23f1c5d68b7a7f572a3f6f6df97	2022-03-23 11:32:17 -07:00
Michel Lespinasse	40bc9ed389	FROMLIST: mm: implement speculative handling in wp_page_copy() Change wp_page_copy() to handle the speculative case. This involves aborting speculative faults if they have to allocate an anon_vma, read-locking the mmu_notifier_lock to avoid races with mmu_notifier_register(), and using pte_map_lock() instead of pte_offset_map_lock() to complete the page fault. Also change call sites to clear vmf->pte after unmapping the page table, in order to satisfy pte_map_lock()'s preconditions. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-27-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Icd2188e9facf5a7fea42000a2808bcda1ad6f0fc	2022-03-23 11:32:16 -07:00
Michel Lespinasse	3e15787d22	FROMLIST: mm: write lock mmu_notifier_lock when registering mmu notifiers Change mm_take_all_locks to also take the mmu_notifier_lock. Note that mm_take_all_locks is called from mmu_notifier_register() only. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-25-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I7ad82c6bc66f8f59a718dc4bf030674d9306a53d	2022-03-23 11:32:16 -07:00
Michel Lespinasse	009020e3d1	FROMLIST: mm: enable speculative fault handling in do_numa_page() Change handle_pte_fault() to allow speculative fault execution to proceed through do_numa_page(). do_swap_page() does not implement speculative execution yet, so it needs to abort with VM_FAULT_RETRY in that case. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-22-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I0390331facc9ecd37534012abdd9f255ab5bbb12	2022-03-23 11:32:16 -07:00
Michel Lespinasse	fedc4d513e	FROMLIST: mm: implement speculative handling in do_numa_page() change do_numa_page() to use pte_spinlock() when locking the page table, so that the mmap sequence counter will be validated in the speculative case. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-21-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: If252547faf2a8a6cbba4c0a7ff929071a5f6a657	2022-03-23 11:32:15 -07:00
Michel Lespinasse	c2b2abe724	FROMLIST: mm: enable speculative fault handling through do_anonymous_page() in x86 fault handler, only attempt spf if the vma is anonymous. In do_handle_mm_fault(), let speculative page faults proceed as long as they fall into anonymous vmas. This enables the speculative handling code in __handle_mm_fault() and do_anonymous_page(). In handle_pte_fault(), if vmf->pte is set (the original pte was not pte_none), catch speculative faults and return VM_FAULT_RETRY as those cases are not implemented yet. Also assert that do_fault() is not reached in the speculative case. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-20-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I875106fcfa1084f570c2bf8f24a129bdce55316b	2022-03-23 11:32:15 -07:00
Michel Lespinasse	31cf1fd564	FROMLIST: mm: implement speculative handling in do_anonymous_page() Change do_anonymous_page() to handle the speculative case. This involves aborting speculative faults if they have to allocate a new anon_vma, and using pte_map_lock() instead of pte_offset_map_lock() to complete the page fault. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-19-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I5ad955323faabc142c21f62415db039ac889066a	2022-03-23 11:32:15 -07:00
Michel Lespinasse	6e6766ab76	BACKPORT: FROMLIST: mm: add pte_map_lock() and pte_spinlock() pte_map_lock() and pte_spinlock() are used by fault handlers to ensure the pte is mapped and locked before they commit the faulted page to the mm's address space at the end of the fault. The functions differ in their preconditions; pte_map_lock() expects the pte to be unmapped prior to the call, while pte_spinlock() expects it to be already mapped. In the speculative fault case, the functions verify, after locking the pte, that the mmap sequence count has not changed since the start of the fault, and thus that no mmap lock writers have been running concurrently with the fault. After that point the page table lock serializes any further races with concurrent mmap lock writers. If the mmap sequence count check fails, both functions will return false with the pte being left unmapped and unlocked. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-18-michel@lespinasse.org/ Conflicts: include/linux/mm.h 1. Fixed pte_map_lock and pte_spinlock macros not to fail when CONFIG_SPECULATIVE_PAGE_FAULT=n Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Ibd7ccc2ead4fdf29f28c7657b312b2f677ac8836	2022-03-23 11:32:15 -07:00
Michel Lespinasse	6ab660d7cb	FROMLIST: mm: implement speculative handling in __handle_mm_fault(). The speculative path calls speculative_page_walk_begin() before walking the page table tree to prevent page table reclamation. The logic is otherwise similar to the non-speculative path, but with additional restrictions: in the speculative path, we do not handle huge pages or wiring new pages tables. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-17-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: If099534da8b0ac105bbaa5ea4714a6654032592a	2022-03-23 11:32:15 -07:00
Michel Lespinasse	f3f9f17a32	FROMLIST: mm: refactor __handle_mm_fault() / handle_pte_fault() Move the code that initializes vmf->pte and vmf->orig_pte from handle_pte_fault() to its single call site in __handle_mm_fault(). This ensures vmf->pte is now initialized together with the higher levels of the page table hierarchy. This also prepares for speculative page fault handling, where the entire page table walk (higher levels down to ptes) needs special care in the speculative case. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-16-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Id550086fe568331aa71c91468f8314faad993b20	2022-03-23 11:32:15 -07:00
Michel Lespinasse	f8a4611b47	FROMLIST: mm: add speculative_page_walk_begin() and speculative_page_walk_end() Speculative page faults will use these to protect against races with page table reclamation. This could always be handled by disabling local IRQs as the fast GUP code does; however speculative page faults do not need to protect against races with THP page splitting, so a weaker rcu read lock is sufficient in the MMU_GATHER_RCU_TABLE_FREE case. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-15-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I3efe5fc6a5a49d537cf33e8093daeea42550077a	2022-03-23 11:32:14 -07:00
Michel Lespinasse	4dea585cfe	FROMLIST: x86/mm: attempt speculative mm faults first Attempt speculative mm fault handling first, and fall back to the existing (non-speculative) code if that fails. The speculative handling closely mirrors the non-speculative logic. This includes some x86 specific bits such as the access_error() call. This is why we chose to implement the speculative handling in arch/x86 rather than in common code. The vma is first looked up and copied, under protection of the rcu read lock. The mmap lock sequence count is used to verify the integrity of the copied vma, and passed to do_handle_mm_fault() to allow checking against races with mmap writers when finalizing the fault. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-14-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I2c078a173ee39f35af16daeee8c6a1466d10c3e8	2022-03-23 11:32:14 -07:00
Michel Lespinasse	0823d516af	FROMLIST: mm: separate mmap locked assertion from find_vma This adds a new __find_vma() function, which implements find_vma minus the mmap_assert_locked() assertion. find_vma() is then implemented as an inline wrapper around __find_vma(). Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-13-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Ia999b8cb8f5eed93040ab4b3caaf90d739da908d	2022-03-23 11:32:14 -07:00
Michel Lespinasse	4e2e391ff7	BACKPORT: FROMLIST: mm: add do_handle_mm_fault() Add a new do_handle_mm_fault function, which extends the existing handle_mm_fault() API by adding an mmap sequence count, to be used in the FAULT_FLAG_SPECULATIVE case. In the initial implementation, FAULT_FLAG_SPECULATIVE always fails (by returning VM_FAULT_RETRY). The existing handle_mm_fault() API is kept as a wrapper around do_handle_mm_fault() so that we do not have to immediately update every handle_mm_fault() call site. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Conflicts: mm/memory.c 1. Trivial merge conflict due to folios. Link: https://lore.kernel.org/all/20220128131006.67712-10-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Ic07b6d84af3e5d1fcc856e0968f1a6dd1544fa88	2022-03-23 11:32:14 -07:00
Michel Lespinasse	67ad4ad4de	FROMLIST: mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT This configuration variable will be used to build the code needed to handle speculative page fault. This is enabled by default on supported architectures with SMP and MMU set. The architecture support is needed since the speculative page fault handler is called from the architecture's page faulting code, and some code has to be added there to try speculative fault handling first. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-7-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Ie1dc3af30bf3949173b126e6469f372c4505ec8e	2022-03-23 11:32:13 -07:00
Michel Lespinasse	57f3bb2b12	BACKPORT: FROMLIST: do_anonymous_page: reduce code duplication In do_anonymous_page(), we have separate cases for the zero page vs allocating new anonymous pages. However, once the pte entry has been computed, the rest of the handling (mapping and locking the page table, checking that we didn't lose a race with another page fault handler, etc) is identical between the two cases. This change reduces the code duplication between the two cases. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Conflicts: mm/memory.c 1. Trivial merge conflict caused by folios in mem_cgroup_charge call. Link: https://lore.kernel.org/all/20220128131006.67712-6-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Ic19579571925878d632e43aa40b9f50cdf473ee6	2022-03-23 11:32:13 -07:00
Michel Lespinasse	82ab55ebcc	FROMLIST: do_anonymous_page: use update_mmu_tlb() update_mmu_tlb() can be used instead of update_mmu_cache() when the page fault handler detects that it lost the race to another page fault. It looks like this one call was missed in https://patchwork.kernel.org/project/linux-mips/patch/1590375160-6997-2-git-send-email-maobibo@loongson.cn after Andrew asked to replace all update_mmu_cache() calls with an alias in the previous version of this patch here: https://patchwork.kernel.org/project/linux-mips/patch/1590031837-9582-2-git-send-email-maobibo@loongson.cn/#23374625 Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-5-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Iaad4d3c27e12c2d9bf68d1140709788fc8dead24	2022-03-23 11:32:13 -07:00
Michel Lespinasse	80169a2fe4	FROMLIST: mm: export dump_mm This is necessary in order to allow VM_BUG_ON_MM to be used in modules (I encountered the issue when adding VM_BUG_ON_MM in mmap locking functions). Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-2-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Ia373e4adde92ee4aa59ff9a1313d42a3ebccb7e3	2022-03-23 11:32:12 -07:00
Guo Ziliang	6829aa17ca	mm: swap: get rid of livelock in swapin readahead commit 029c4628b2eb2ca969e9bf979b05dc18d8d5575e upstream. In our testing, a livelock task was found. Through sysrq printing, same stack was found every time, as follows: __swap_duplicate+0x58/0x1a0 swapcache_prepare+0x24/0x30 __read_swap_cache_async+0xac/0x220 read_swap_cache_async+0x58/0xa0 swapin_readahead+0x24c/0x628 do_swap_page+0x374/0x8a0 __handle_mm_fault+0x598/0xd60 handle_mm_fault+0x114/0x200 do_page_fault+0x148/0x4d0 do_translation_fault+0xb0/0xd4 do_mem_abort+0x50/0xb0 The reason for the livelock is that swapcache_prepare() always returns EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that it cannot jump out of the loop. We suspect that the task that clears the SWAP_HAS_CACHE flag never gets a chance to run. We try to lower the priority of the task stuck in a livelock so that the task that clears the SWAP_HAS_CACHE flag will run. The results show that the system returns to normal after the priority is lowered. In our testing, multiple real-time tasks are bound to the same core, and the task in the livelock is the highest priority task of the core, so the livelocked task cannot be preempted. Although cond_resched() is used by __read_swap_cache_async, it is an empty function in the preemptive system and cannot achieve the purpose of releasing the CPU. A high-priority task cannot release the CPU unless preempted by a higher-priority task. But when this task is already the highest priority task on this core, other tasks will not be able to be scheduled. So we think we should replace cond_resched() with schedule_timeout_uninterruptible(1), schedule_timeout_interruptible will call set_current_state first to set the task state, so the task will be removed from the running queue, so as to achieve the purpose of giving up the CPU and prevent it from running in kernel mode for too long. (akpm: ugly hack becomes uglier. But it fixes the issue in a backportable-to-stable fashion while we hopefully work on something better) Link: https://lkml.kernel.org/r/20220221111749.1928222-1-cgel.zte@gmail.com Signed-off-by: Guo Ziliang <guo.ziliang@zte.com.cn> Reported-by: Zeal Robot <zealci@zte.com.cn> Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Reviewed-by: Jiang Xuexin <jiang.xuexin@zte.com.cn> Reviewed-by: Yang Yang <yang.yang29@zte.com.cn> Acked-by: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roger Quadros <rogerq@kernel.org> Cc: Ziliang Guo <guo.ziliang@zte.com.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-03-23 09:16:41 +01:00
Andrey Konovalov	4b6f018168	ANDROID: kasan: sync vmalloc support with linux-next/akpm The FROMLIST patches merged in aosp/1974918 that add vmalloc support to KASAN now have a few fixes staged in linux-next/akpm. Sync the changes. Bug: 217222520 Bug: 222221793 Change-Id: I33dd30e3834a4d1bb8eac611b350004afdb08a74 Signed-off-by: Andrey Konovalov <andreyknvl@google.com>	2022-03-21 15:31:03 +00:00
Carlos Llamas	061e34c52e	ANDROID: mm: compaction: fix isolate_and_split_free_page() redefinition Guard isolate_and_split_free_page() with CONFIG_COMPACTION. This fixes the follwoing build error as the function collides with its inline stub from the header file: mm/compaction.c:766:15: error: redefinition of ‘isolate_and_split_free_page’ 766 \| unsigned long isolate_and_split_free_page(struct page page, \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~ In file included from mm/compaction.c:14: ./include/linux/compaction.h:241:29: note: previous definition of ‘isolate_and_split_free_page’ was here 241 \| static inline unsigned long isolate_and_split_free_page(struct page page, \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~ Bug: 201263307 Fixes: `8cd9aa93b7` ("ANDROID: implement wrapper for reverse migration") Reported-by: kernelci.org bot <bot@kernelci.org> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Carlos Llamas <cmllamas@google.com> Change-Id: Ie8f3fedcc9d4af5cfdcfd5829377671745ab77d6	2022-03-19 05:19:23 +00:00
andrew.yang	0d8a83644b	FROMGIT: mm/migrate: fix race between lock page and clear PG_Isolated When memory is tight, system may start to compact memory for large continuous memory demands. If one process tries to lock a memory page that is being locked and isolated for compaction, it may wait a long time or even forever. This is because compaction will perform non-atomic PG_Isolated clear while holding page lock, this may overwrite PG_waiters set by the process that can't obtain the page lock and add itself to the waiting queue to wait for the lock to be unlocked. CPU1 CPU2 lock_page(page); (successful) lock_page(); (failed) __ClearPageIsolated(page); SetPageWaiters(page) (may be overwritten) unlock_page(page); The solution is to not perform non-atomic operation on page flags while holding page lock. Link: https://lkml.kernel.org/r/20220315030515.20263-1-andrew.yang@mediatek.com Signed-off-by: andrew.yang <andrew.yang@mediatek.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Vlastimil Babka" <vbabka@suse.cz> Cc: David Howells <dhowells@redhat.com> Cc: "William Kucharski" <william.kucharski@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Nicholas Tang <nicholas.tang@mediatek.com> Cc: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> (cherry picked from commit 48911e41ddee4fe113bf1e4303dda1a413b169c9 https://github.com/hnaz/linux-mm.git) Bug: 225086204 Change-Id: I3dc59bf75f12d4ee93779bcbe9336c6376d2d6b6	2022-03-18 16:59:51 +00:00
Charan Teja Reddy	f47b852faa	ANDROID: implement wrapper for reverse migration Reverse migration is used to do the balancing the occupancy of memory zones in a node in the system whose imabalance may be caused by migration of pages to other zones by an operation, eg: hotremove and then hotadding the same memory. In this case there is a lot of free memory in newly hotadd memory which can be filled up by the previous migrated pages(as part of offline/hotremove) thus may free up some pressure in other zones of the node. Upstream discussion: https://lore.kernel.org/all/ee78c83d-da9b-f6d1-4f66-934b7782acfb@codeaurora.org/ Change-Id: Ib3137dab0db66ecf6858c4077dcadb9dfd0c6b1c Bug: 201263307 Signed-off-by: Charan Teja Reddy <quic_charante@quicinc.com>	2022-03-17 21:15:46 +00:00
Greg Kroah-Hartman	16f06ae351	Merge 5.15.27 into android-5.15 Changes in 5.15.27 mac80211_hwsim: report NOACK frames in tx_status mac80211_hwsim: initialize ieee80211_tx_info at hw_scan_work i2c: bcm2835: Avoid clock stretching timeouts ASoC: rt5668: do not block workqueue if card is unbound ASoC: rt5682: do not block workqueue if card is unbound regulator: core: fix false positive in regulator_late_cleanup() Input: clear BTN_RIGHT/MIDDLE on buttonpads btrfs: get rid of warning on transaction commit when using flushoncommit KVM: arm64: vgic: Read HW interrupt pending state from the HW block: loop:use kstatfs.f_bsize of backing file to set discard granularity tipc: fix a bit overflow in tipc_crypto_key_rcv() cifs: do not use uninitialized data in the owner/group sid cifs: fix double free race when mount fails in cifs_get_root() HID: amd_sfh: Handle amd_sfh work buffer in PM ops HID: amd_sfh: Add functionality to clear interrupts HID: amd_sfh: Add interrupt handler to process interrupts cifs: modefromsids must add an ACE for authenticated users selftests/seccomp: Fix seccomp failure by adding missing headers drm/amd/pm: correct UMD pstate clocks for Dimgrey Cavefish and Beige Goby selftests/ftrace: Do not trace do_softirq because of PREEMPT_RT dmaengine: shdma: Fix runtime PM imbalance on error i2c: cadence: allow COMPILE_TEST i2c: imx: allow COMPILE_TEST i2c: qup: allow COMPILE_TEST net: usb: cdc_mbim: avoid altsetting toggling for Telit FN990 block-map: add __GFP_ZERO flag for alloc_page in function bio_copy_kern usb: gadget: don't release an existing dev->buf usb: gadget: clear related members when goto fail exfat: reuse exfat_inode_info variable instead of calling EXFAT_I() exfat: fix i_blocks for files truncated over 4 GiB tracing: Add test for user space strings when filtering on string pointers arm64: Mark start_backtrace() notrace and NOKPROBE_SYMBOL serial: stm32: prevent TDR register overwrite when sending x_char ext4: drop ineligible txn start stop APIs ext4: simplify updating of fast commit stats ext4: fast commit may not fallback for ineligible commit ext4: fast commit may miss file actions sched/fair: Fix fault in reweight_entity ata: pata_hpt37x: fix PCI clock detection drm/amdgpu: check vm ready by amdgpu_vm->evicting flag tracing: Add ustring operation to filtering string pointers ipv6: fix skb drops in igmp6_event_query() and igmp6_event_report() NFSD: Have legacy NFSD WRITE decoders use xdr_stream_subsegment() NFSD: Fix zero-length NFSv3 WRITEs io_uring: fix no lock protection for ctx->cq_extra tools/resolve_btf_ids: Close ELF file on error mtd: spi-nor: Fix mtd size for s3an flashes MIPS: fix local_{add,sub}_return on MIPS64 signal: In get_signal test for signal_group_exit every time through the loop PCI: mediatek-gen3: Disable DVFSRC voltage request PCI: rcar: Check if device is runtime suspended instead of __clk_is_enabled() PCI: dwc: Do not remap invalid res PCI: aardvark: Fix checking for MEM resource type KVM: VMX: Don't unblock vCPU w/ Posted IRQ if IRQs are disabled in guest KVM: s390: Ensure kvm_arch_no_poll() is read once when blocking vCPU KVM: VMX: Read Posted Interrupt "control" exactly once per loop iteration KVM: X86: Ensure that dirty PDPTRs are loaded KVM: x86: Handle 32-bit wrap of EIP for EMULTYPE_SKIP with flat code seg KVM: x86: Exit to userspace if emulation prepared a completion callback i3c: fix incorrect address slot lookup on 64-bit i3c/master/mipi-i3c-hci: Fix a potentially infinite loop in 'hci_dat_v1_get_index()' tracing: Do not let synth_events block other dyn_event systems during create Input: ti_am335x_tsc - set ADCREFM for X configuration Input: ti_am335x_tsc - fix STEPCONFIG setup for Z2 PCI: mvebu: Check for errors from pci_bridge_emul_init() call PCI: mvebu: Do not modify PCI IO type bits in conf_write PCI: mvebu: Fix support for bus mastering and PCI_COMMAND on emulated bridge PCI: mvebu: Fix configuring secondary bus of PCIe Root Port via emulated bridge PCI: mvebu: Setup PCIe controller to Root Complex mode PCI: mvebu: Fix support for PCI_BRIDGE_CTL_BUS_RESET on emulated bridge PCI: mvebu: Fix support for PCI_EXP_DEVCTL on emulated bridge PCI: mvebu: Fix support for PCI_EXP_RTSTA on emulated bridge PCI: mvebu: Fix support for DEVCAP2, DEVCTL2 and LNKCTL2 registers on emulated bridge NFSD: Fix verifier returned in stable WRITEs Revert "nfsd: skip some unnecessary stats in the v4 case" nfsd: fix crash on COPY_NOTIFY with special stateid x86/hyperv: Properly deal with empty cpumasks in hyperv_flush_tlb_multi() drm/i915: don't call free_mmap_offset when purging SUNRPC: Fix sockaddr handling in the svc_xprt_create_error trace point SUNRPC: Fix sockaddr handling in svcsock_accept_class trace points drm/sun4i: dw-hdmi: Fix missing put_device() call in sun8i_hdmi_phy_get drm/atomic: Check new_crtc_state->active to determine if CRTC needs disable in self refresh mode ntb_hw_switchtec: Fix pff ioread to read into mmio_part_cfg_all ntb_hw_switchtec: Fix bug with more than 32 partitions drm/amdkfd: Check for null pointer after calling kmemdup drm/amdgpu: use spin_lock_irqsave to avoid deadlock by local interrupt i3c: master: dw: check return of dw_i3c_master_get_free_pos() dma-buf: cma_heap: Fix mutex locking section tracing/uprobes: Check the return value of kstrdup() for tu->filename tracing/probes: check the return value of kstrndup() for pbuf mm: defer kmemleak object creation of module_alloc() kasan: fix quarantine conflicting with init_on_free selftests/vm: make charge_reserved_hugetlb.sh work with existing cgroup setting hugetlbfs: fix off-by-one error in hugetlb_vmdelete_list() drm/amdgpu/display: Only set vblank_disable_immediate when PSR is not enabled drm/amdgpu: filter out radeon PCI device IDs drm/amdgpu: filter out radeon secondary ids as well drm/amd/display: Use adjusted DCN301 watermarks drm/amd/display: move FPU associated DSC code to DML folder ethtool: Fix link extended state for big endian octeontx2-af: Optimize KPU1 processing for variable-length headers octeontx2-af: Reset PTP config in FLR handler octeontx2-af: cn10k: RPM hardware timestamp configuration octeontx2-af: cn10k: Use appropriate register for LMAC enable octeontx2-af: Adjust LA pointer for cpt parse header octeontx2-af: Add KPU changes to parse NGIO as separate layer net/mlx5e: IPsec: Refactor checksum code in tx data path net/mlx5e: IPsec: Fix crypto offload for non TCP/UDP encapsulated traffic bpf: Use u64_stats_t in struct bpf_prog_stats bpf: Fix possible race in inc_misses_counter drm/amd/display: Update watermark values for DCN301 drm: mxsfb: Set fallback bus format when the bridge doesn't provide one drm: mxsfb: Fix NULL pointer dereference riscv/mm: Add XIP_FIXUP for phys_ram_base drm/i915/display: split out dpt out of intel_display.c drm/i915/display: Move DRRS code its own file drm/i915: Disable DRRS on IVB/HSW port != A gve: Recording rx queue before sending to napi net: dsa: ocelot: seville: utilize of_mdiobus_register net: dsa: seville: register the mdiobus under devres ibmvnic: don't release napi in __ibmvnic_open() of: net: move of_net under net/ net: ethernet: litex: Add the dependency on HAS_IOMEM drm/mediatek: mtk_dsi: Reset the dsi0 hardware cifs: protect session channel fields with chan_lock cifs: fix confusing unneeded warning message on smb2.1 and earlier drm/amd/display: Fix stream->link_enc unassigned during stream removal bnxt_en: Fix occasional ethtool -t loopback test failures drm/amd/display: For vblank_disable_immediate, check PSR is really used PCI: mvebu: Fix device enumeration regression net: of: fix stub of_net helpers for CONFIG_NET=n ALSA: intel_hdmi: Fix reference to PCM buffer address ucounts: Fix systemd LimitNPROC with private users regression riscv/efi_stub: Fix get_boot_hartid_from_fdt() return value riscv: Fix config KASAN && SPARSEMEM && !SPARSE_VMEMMAP riscv: Fix config KASAN && DEBUG_VIRTUAL iwlwifi: mvm: check debugfs_dir ptr before use ASoC: ops: Shift tested values in snd_soc_put_volsw() by +min iommu/vt-d: Fix double list_add when enabling VMD in scalable mode iommu/amd: Recover from event log overflow drm/i915: s/JSP2/ICP2/ PCH drm/amd/display: Reduce dmesg error to a debug print xen/netfront: destroy queues before real_num_tx_queues is zeroed thermal: core: Fix TZ_GET_TRIP NULL pointer dereference mac80211: fix EAPoL rekey fail in 802.3 rx path blktrace: fix use after free for struct blk_trace ntb: intel: fix port config status offset for SPR mm: Consider __GFP_NOWARN flag for oversized kvmalloc() calls xfrm: fix MTU regression netfilter: fix use-after-free in __nf_register_net_hook() bpf, sockmap: Do not ignore orig_len parameter xfrm: fix the if_id check in changelink xfrm: enforce validity of offload input flags e1000e: Correct NVM checksum verification flow net: fix up skbs delta_truesize in UDP GRO frag_list netfilter: nf_queue: don't assume sk is full socket netfilter: nf_queue: fix possible use-after-free netfilter: nf_queue: handle socket prefetch batman-adv: Request iflink once in batadv-on-batadv check batman-adv: Request iflink once in batadv_get_real_netdevice batman-adv: Don't expect inter-netns unique iflink indices net: ipv6: ensure we call ipv6_mc_down() at most once net: dcb: flush lingering app table entries for unregistered devices net: ipa: add an interconnect dependency net/smc: fix connection leak net/smc: fix unexpected SMC_CLC_DECL_ERR_REGRMB error generated by client net/smc: fix unexpected SMC_CLC_DECL_ERR_REGRMB error cause by server btrfs: fix ENOSPC failure when attempting direct IO write into NOCOW range mac80211: fix forwarded mesh frames AC & queue selection net: stmmac: fix return value of __setup handler mac80211: treat some SAE auth steps as final iavf: Fix missing check for running netdev net: sxgbe: fix return value of __setup handler ibmvnic: register netdev after init of adapter net: arcnet: com20020: Fix null-ptr-deref in com20020pci_probe() ixgbe: xsk: change !netif_carrier_ok() handling in ixgbe_xmit_zc() iavf: Fix deadlock in iavf_reset_task efivars: Respect "block" flag in efivar_entry_set_safe() auxdisplay: lcd2s: Fix lcd2s_redefine_char() feature firmware: arm_scmi: Remove space in MODULE_ALIAS name ASoC: cs4265: Fix the duplicated control name auxdisplay: lcd2s: Fix memory leak in ->remove() auxdisplay: lcd2s: Use proper API to free the instance of charlcd object can: gs_usb: change active_channels's type from atomic_t to u8 iommu/tegra-smmu: Fix missing put_device() call in tegra_smmu_find arm64: dts: rockchip: Switch RK3399-Gru DP to SPDIF output igc: igc_read_phy_reg_gpy: drop premature return ARM: Fix kgdb breakpoint for Thumb2 mips: setup: fix setnocoherentio() boolean setting ARM: 9182/1: mmu: fix returns from early_param() and __setup() functions mptcp: Correctly set DATA_FIN timeout when number of retransmits is large selftests: mlxsw: tc_police_scale: Make test more robust pinctrl: sunxi: Use unique lockdep classes for IRQs igc: igc_write_phy_reg_gpy: drop premature return ibmvnic: free reset-work-item when flushing memfd: fix F_SEAL_WRITE after shmem huge page allocated s390/extable: fix exception table sorting sched: Fix yet more sched_fork() races arm64: dts: juno: Remove GICv2m dma-range iommu/amd: Fix I/O page table memory leak MIPS: ralink: mt7621: do memory detection on KSEG1 ARM: dts: switch timer config to common devkit8000 devicetree ARM: dts: Use 32KiHz oscillator on devkit8000 soc: fsl: guts: Revert commit `3c0d64e867` soc: fsl: guts: Add a missing memory allocation failure check soc: fsl: qe: Check of ioremap return value netfilter: nf_tables: prefer kfree_rcu(ptr, rcu) variant ARM: tegra: Move panels to AUX bus can: etas_es58x: change opened_channel_cnt's type from atomic_t to u8 net: stmmac: enhance XDP ZC driver level switching performance net: stmmac: only enable DMA interrupts when ready ibmvnic: initialize rc before completing wait ibmvnic: define flush_reset_queue helper ibmvnic: complete init_done on transport events net: chelsio: cxgb3: check the return value of pci_find_capability() net: sparx5: Fix add vlan when invalid operation iavf: Refactor iavf state machine tracking iavf: Add __IAVF_INIT_FAILED state iavf: Combine init and watchdog state machines iavf: Add trace while removing device iavf: Rework mutexes for better synchronisation iavf: Add helper function to go from pci_dev to adapter iavf: Fix kernel BUG in free_msi_irqs iavf: Add waiting so the port is initialized in remove iavf: Fix init state closure on remove iavf: Fix locking for VIRTCHNL_OP_GET_OFFLOAD_VLAN_V2_CAPS iavf: Fix race in init state iavf: Fix __IAVF_RESETTING state usage drm/i915/guc/slpc: Correct the param count for unset param drm/bridge: ti-sn65dsi86: Properly undo autosuspend e1000e: Fix possible HW unit hang after an s0ix exit MIPS: ralink: mt7621: use bitwise NOT instead of logical nl80211: Handle nla_memdup failures in handle_nan_filter drm/amdgpu: fix suspend/resume hang regression net: dcb: disable softirqs in dcbnl_flush_dev() selftests: mlxsw: resource_scale: Fix return value net: stmmac: perserve TX and RX coalesce value during XDP setup iavf: do not override the adapter state in the watchdog task (again) iavf: missing unlocks in iavf_watchdog_task() MAINTAINERS: adjust file entry for of_net.c after movement Input: elan_i2c - move regulator_[en\|dis]able() out of elan_[en\|dis]able_power() Input: elan_i2c - fix regulator enable count imbalance after suspend/resume Input: samsung-keypad - properly state IOMEM dependency HID: add mapping for KEY_DICTATE HID: add mapping for KEY_ALL_APPLICATIONS tracing/histogram: Fix sorting on old "cpu" value tracing: Fix return value of __setup handlers btrfs: fix lost prealloc extents beyond eof after full fsync btrfs: fix relocation crash due to premature return from btrfs_commit_transaction() btrfs: do not WARN_ON() if we have PageError set btrfs: qgroup: fix deadlock between rescan worker and remove qgroup btrfs: add missing run of delayed items after unlink during log replay btrfs: do not start relocation until in progress drops are done Revert "xfrm: xfrm_state_mtu should return at least 1280 for ipv6" proc: fix documentation and description of pagemap KVM: x86/mmu: Passing up the error state of mmu_alloc_shadow_roots() hamradio: fix macro redefine warning Linux 5.15.27 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Ie338dd23e0eb61feb540b4256b5d1840fee4db84	2022-03-17 14:02:09 +01:00
Greg Kroah-Hartman	26481b5161	Merge 5.15.26 into android13-5.15 Changes in 5.15.26 mm/filemap: Fix handling of THPs in generic_file_buffered_read() cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug cgroup-v1: Correct privileges check in release_agent writes x86/ptrace: Fix xfpregs_set()'s incorrect xmm clearing btrfs: tree-checker: check item_size for inode_item btrfs: tree-checker: check item_size for dev_item clk: jz4725b: fix mmc0 clock gating io_uring: don't convert to jiffies for waiting on timeouts io_uring: disallow modification of rsrc_data during quiesce selinux: fix misuse of mutex_is_locked() vhost/vsock: don't check owner in vhost_vsock_stop() while releasing parisc/unaligned: Fix fldd and fstd unaligned handlers on 32-bit kernel parisc/unaligned: Fix ldw() and stw() unalignment handlers KVM: x86/mmu: make apf token non-zero to fix bug drm/amd/display: Protect update_bw_bounding_box FPU code. drm/amd/pm: fix some OEM SKU specific stability issues drm/amd: Check if ASPM is enabled from PCIe subsystem drm/amdgpu: disable MMHUB PG for Picasso drm/amdgpu: do not enable asic reset for raven2 drm/i915: Widen the QGV point mask drm/i915: Correctly populate use_sagv_wm for all pipes drm/i915: Fix bw atomic check when switching between SAGV vs. no SAGV sr9700: sanity check for packet length USB: zaurus: support another broken Zaurus CDC-NCM: avoid overflow in sanity checking netfilter: xt_socket: fix a typo in socket_mt_destroy() netfilter: xt_socket: missing ifdef CONFIG_IP6_NF_IPTABLES dependency netfilter: nf_tables_offload: incorrect flow offload action array size tee: export teedev_open() and teedev_close_context() optee: use driver internal tee_context for some rpc ping: remove pr_err from ping_lookup Revert "i40e: Fix reset bw limit when DCB enabled with 1 TC" gpu: host1x: Always return syncpoint value when waiting perf evlist: Fix failed to use cpu list for uncore events perf data: Fix double free in perf_session__delete() mptcp: fix race in incoming ADD_ADDR option processing mptcp: add mibs counter for ignored incoming options selftests: mptcp: fix diag instability selftests: mptcp: be more conservative with cookie MPJ limits bnx2x: fix driver load from initrd bnxt_en: Fix active FEC reporting to ethtool bnxt_en: Fix offline ethtool selftest with RDMA enabled bnxt_en: Fix incorrect multicast rx mask setting when not requested hwmon: Handle failure to register sensor with thermal zone correctly net/mlx5: Fix tc max supported prio for nic mode ice: check the return of ice_ptp_gettimex64 ice: initialize local variable 'tlv' net/mlx5: Update the list of the PCI supported devices bpf: Fix crash due to incorrect copy_map_value bpf: Do not try bpf_msg_push_data with len 0 selftests: bpf: Check bpf_msg_push_data return value bpf: Fix a bpf_timer initialization issue bpf: Add schedule points in batch ops io_uring: add a schedule point in io_add_buffers() net: __pskb_pull_tail() & pskb_carve_frag_list() drop_monitor friends nvme: also mark passthrough-only namespaces ready in nvme_update_ns_info tipc: Fix end of loop tests for list_for_each_entry() gso: do not skip outer ip header in case of ipip and net_failover net: mv643xx_eth: process retval from of_get_mac_address openvswitch: Fix setting ipv6 fields causing hw csum failure drm/edid: Always set RGB444 net/mlx5e: Fix wrong return value on ioctl EEPROM query failure drm/vc4: crtc: Fix runtime_pm reference counting drm/i915/dg2: Print PHY name properly on calibration error net/sched: act_ct: Fix flow table lookup after ct clear or switching zones net: ll_temac: check the return value of devm_kmalloc() net: Force inlining of checksum functions in net/checksum.h netfilter: nf_tables: unregister flowtable hooks on netns exit nfp: flower: Fix a potential leak in nfp_tunnel_add_shared_mac() net: mdio-ipq4019: add delay after clock enable netfilter: nf_tables: fix memory leak during stateful obj update net/smc: Use a mutex for locking "struct smc_pnettable" surface: surface3_power: Fix battery readings on batteries without a serial number udp_tunnel: Fix end of loop test in udp_tunnel_nic_unregister() net/mlx5: DR, Cache STE shadow memory ibmvnic: schedule failover only if vioctl fails net/mlx5: DR, Don't allow match on IP w/o matching on full ethertype/ip_version net/mlx5: Fix possible deadlock on rule deletion net/mlx5: Fix wrong limitation of metadata match on ecpf net/mlx5: DR, Fix the threshold that defines when pool sync is initiated net/mlx5e: MPLSoUDP decap, fix check for unsupported matches net/mlx5e: kTLS, Use CHECKSUM_UNNECESSARY for device-offloaded packets net/mlx5: Update log_max_qp value to be 17 at most spi: spi-zynq-qspi: Fix a NULL pointer dereference in zynq_qspi_exec_mem_op() gpio: rockchip: Reset int_bothedge when changing trigger regmap-irq: Update interrupt clear register for proper reset net-timestamp: convert sk->sk_tskey to atomic_t RDMA/rtrs-clt: Fix possible double free in error case RDMA/rtrs-clt: Move free_permit from free_clt to rtrs_clt_close bnxt_en: Increase firmware message response DMA wait time configfs: fix a race in configfs_{,un}register_subsystem() RDMA/ib_srp: Fix a deadlock tracing: Dump stacktrace trigger to the corresponding instance tracing: Have traceon and traceoff trigger honor the instance iio:imu:adis16480: fix buffering for devices with no burst mode iio: adc: men_z188_adc: Fix a resource leak in an error handling path iio: adc: tsc2046: fix memory corruption by preventing array overflow iio: adc: ad7124: fix mask used for setting AIN_BUFP & AIN_BUFM bits iio: accel: fxls8962af: add padding to regmap for SPI iio: imu: st_lsm6dsx: wait for settling time in st_lsm6dsx_read_oneshot iio: Fix error handling for PM sc16is7xx: Fix for incorrect data being transmitted ata: pata_hpt37x: disable primary channel on HPT371 Revert "USB: serial: ch341: add new Product ID for CH341A" usb: gadget: rndis: add spinlock for rndis response list USB: gadget: validate endpoint index for xilinx udc tracefs: Set the group ownership in apply_options() not parse_options() USB: serial: option: add support for DW5829e USB: serial: option: add Telit LE910R1 compositions usb: dwc2: drd: fix soft connect when gadget is unconfigured usb: dwc3: pci: Add "snps,dis_u2_susphy_quirk" for Intel Bay Trail usb: dwc3: pci: Fix Bay Trail phy GPIO mappings usb: dwc3: gadget: Let the interrupt handler disable bottom halves. xhci: re-initialize the HC during resume if HCE was set xhci: Prevent futile URB re-submissions due to incorrect return value. nvmem: core: Fix a conflict between MTD and NVMEM on wp-gpios property mtd: core: Fix a conflict between MTD and NVMEM on wp-gpios property driver core: Free DMA range map when device is released btrfs: prevent copying too big compressed lzo segment RDMA/cma: Do not change route.addr.src_addr outside state checks thermal: int340x: fix memory leak in int3400_notify() staging: fbtft: fb_st7789v: reset display before initialization tps6598x: clear int mask on probe failure IB/qib: Fix duplicate sysfs directory name riscv: fix nommu_k210_sdcard_defconfig riscv: fix oops caused by irqsoff latency tracer tty: n_gsm: fix encoding of control signal octet bit DV tty: n_gsm: fix proper link termination after failed open tty: n_gsm: fix NULL pointer access due to DLCI release tty: n_gsm: fix wrong tty control line for flow control tty: n_gsm: fix wrong modem processing in convergence layer type 2 tty: n_gsm: fix deadlock in gsmtty_open() pinctrl: fix loop in k210_pinconf_get_drive() pinctrl: k210: Fix bias-pull-up gpio: tegra186: Fix chip_data type confusion memblock: use kfree() to release kmalloced memblock regions ice: Fix race conditions between virtchnl handling and VF ndo ops ice: fix concurrent reset and removal of VFs Linux 5.15.26 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Ied0cc9bd48b7af71a064107676f37b0dd39ce3cf	2022-03-16 12:53:52 +01:00
Vijayanand Jitta	0e00d7c46b	ANDROID: mm/slub: Fix Kasan issue with for_each_object_track In for_each_object_track we go through meta data of the slab object in function(fn), and as a result false postive out-of-bound access is reported by kasan. Fix this by wrapping that function call with metadata_access_enable/disable. Bug: 222651868 Fixes: `ee8d2c7884` ("ANDROID: mm: add get_each_object_track function") Change-Id: Ifb4241a9c3e397a52759d467aa267d1297e297dd Signed-off-by: Vijayanand Jitta <quic_vjitta@quicinc.com> (cherry picked from commit cd6e5d5d7d0338fbfe58010f7dde0a521db6b681)	2022-03-15 19:15:39 +00:00
Hugh Dickins	b7c35587be	memfd: fix F_SEAL_WRITE after shmem huge page allocated commit f2b277c4d1c63a85127e8aa2588e9cc3bd21cb99 upstream. Wangyong reports: after enabling tmpfs filesystem to support transparent hugepage with the following command: echo always > /sys/kernel/mm/transparent_hugepage/shmem_enabled the docker program tries to add F_SEAL_WRITE through the following command, but it fails unexpectedly with errno EBUSY: fcntl(5, F_ADD_SEALS, F_SEAL_WRITE) = -1. That is because memfd_tag_pins() and memfd_wait_for_pins() were never updated for shmem huge pages: checking page_mapcount() against page_count() is hopeless on THP subpages - they need to check total_mapcount() against page_count() on THP heads only. Make memfd_tag_pins() (compared > 1) as strict as memfd_wait_for_pins() (compared != 1): either can be justified, but given the non-atomic total_mapcount() calculation, it is better now to be strict. Bear in mind that total_mapcount() itself scans all of the THP subpages, when choosing to take an XA_CHECK_SCHED latency break. Also fix the unlikely xa_is_value() case in memfd_wait_for_pins(): if a page has been swapped out since memfd_tag_pins(), then its refcount must have fallen, and so it can safely be untagged. Link: https://lkml.kernel.org/r/a4f79248-df75-2c8c-3df-ba3317ccb5da@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Zeal Robot <zealci@zte.com.cn> Reported-by: wangyong <wang.yong12@zte.com.cn> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: CGEL ZTE <cgel.zte@gmail.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Song Liu <songliubraving@fb.com> Cc: Yang Yang <yang.yang29@zte.com.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-03-08 19:12:48 +01:00
Daniel Borkmann	261eff11ad	mm: Consider __GFP_NOWARN flag for oversized kvmalloc() calls commit 0708a0afe291bdfe1386d74d5ec1f0c27e8b9168 upstream. syzkaller was recently triggering an oversized kvmalloc() warning via xdp_umem_create(). The triggered warning was added back in `7661809d49` ("mm: don't allow oversized kvmalloc() calls"). The rationale for the warning for huge kvmalloc sizes was as a reaction to a security bug where the size was more than UINT_MAX but not everything was prepared to handle unsigned long sizes. Anyway, the AF_XDP related call trace from this syzkaller report was: kvmalloc include/linux/mm.h:806 [inline] kvmalloc_array include/linux/mm.h:824 [inline] kvcalloc include/linux/mm.h:829 [inline] xdp_umem_pin_pages net/xdp/xdp_umem.c:102 [inline] xdp_umem_reg net/xdp/xdp_umem.c:219 [inline] xdp_umem_create+0x6a5/0xf00 net/xdp/xdp_umem.c:252 xsk_setsockopt+0x604/0x790 net/xdp/xsk.c:1068 __sys_setsockopt+0x1fd/0x4e0 net/socket.c:2176 __do_sys_setsockopt net/socket.c:2187 [inline] __se_sys_setsockopt net/socket.c:2184 [inline] __x64_sys_setsockopt+0xb5/0x150 net/socket.c:2184 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae Björn mentioned that requests for >2GB allocation can still be valid: The structure that is being allocated is the page-pinning accounting. AF_XDP has an internal limit of U32_MAX pages, which is a lot, but still fewer than what memcg allows (PAGE_COUNTER_MAX is a LONG_MAX/ PAGE_SIZE on 64 bit systems). [...] I could just change from U32_MAX to INT_MAX, but as I stated earlier that has a hacky feeling to it. [...] From my perspective, the code isn't broken, with the memcg limits in consideration. [...] Linus says: [...] Pretty much every time this has come up, the kernel warning has shown that yes, the code was broken and there really wasn't a reason for doing allocations that big. Of course, some people would be perfectly fine with the allocation failing, they just don't want the warning. I didn't want __GFP_NOWARN to shut it up originally because I wanted people to see all those cases, but these days I think we can just say "yeah, people can shut it up explicitly by saying 'go ahead and fail this allocation, don't warn about it'". So enough time has passed that by now I'd certainly be ok with [it]. Thus allow call-sites to silence such userspace triggered splats if the allocation requests have __GFP_NOWARN. For xdp_umem_pin_pages()'s call to kvcalloc() this is already the case, so nothing else needed there. Fixes: `7661809d49` ("mm: don't allow oversized kvmalloc() calls") Reported-by: syzbot+11421fbbff99b989670e@syzkaller.appspotmail.com Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: syzbot+11421fbbff99b989670e@syzkaller.appspotmail.com Cc: Björn Töpel <bjorn@kernel.org> Cc: Magnus Karlsson <magnus.karlsson@intel.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Jakub Kicinski <kuba@kernel.org> Cc: David S. Miller <davem@davemloft.net> Link: https://lore.kernel.org/bpf/CAJ+HfNhyfsT5cS_U9EC213ducHs9k9zNxX9+abqC0kTrPbQ0gg@mail.gmail.com Link: https://lore.kernel.org/bpf/20211201202905.b9892171e3f5b9a60f9da251@linux-foundation.org Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Ackd-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-03-08 19:12:44 +01:00
Andrey Konovalov	1123c2fb9d	kasan: fix quarantine conflicting with init_on_free [ Upstream commit 26dca996ea7b1ac7008b6b6063fc88b849e3ac3e ] KASAN's quarantine might save its metadata inside freed objects. As this happens after the memory is zeroed by the slab allocator when init_on_free is enabled, the memory coming out of quarantine is not properly zeroed. This causes lib/test_meminit.c tests to fail with Generic KASAN. Zero the metadata when the object is removed from quarantine. Link: https://lkml.kernel.org/r/2805da5df4b57138fdacd671f5d227d58950ba54.1640037083.git.andreyknvl@google.com Fixes: `6471384af2` ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options") Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-03-08 19:12:38 +01:00
Kefeng Wang	f1675103e0	mm: defer kmemleak object creation of module_alloc() [ Upstream commit 60115fa54ad7b913b7cb5844e6b7ffeb842d55f2 ] Yongqiang reports a kmemleak panic when module insmod/rmmod with KASAN enabled(without KASAN_VMALLOC) on x86[1]. When the module area allocates memory, it's kmemleak_object is created successfully, but the KASAN shadow memory of module allocation is not ready, so when kmemleak scan the module's pointer, it will panic due to no shadow memory with KASAN check. module_alloc __vmalloc_node_range kmemleak_vmalloc kmemleak_scan update_checksum kasan_module_alloc kmemleak_ignore Note, there is no problem if KASAN_VMALLOC enabled, the modules area entire shadow memory is preallocated. Thus, the bug only exits on ARCH which supports dynamic allocation of module area per module load, for now, only x86/arm64/s390 are involved. Add a VM_DEFER_KMEMLEAK flags, defer vmalloc'ed object register of kmemleak in module_alloc() to fix this issue. [1] https://lore.kernel.org/all/6d41e2b9-4692-5ec4-b1cd-cbe29ae89739@huawei.com/ [wangkefeng.wang@huawei.com: fix build] Link: https://lkml.kernel.org/r/20211125080307.27225-1-wangkefeng.wang@huawei.com [akpm@linux-foundation.org: simplify ifdefs, per Andrey] Link: https://lkml.kernel.org/r/CA+fCnZcnwJHUQq34VuRxpdoY6_XbJCDJ-jopksS5Eia4PijPzw@mail.gmail.com Link: https://lkml.kernel.org/r/20211124142034.192078-1-wangkefeng.wang@huawei.com Fixes: `793213a82d` ("s390/kasan: dynamic shadow mem allocation for modules") Fixes: `39d114ddc6` ("arm64: add KASAN support") Fixes: `bebf56a1b1` ("kasan: enable instrumentation of global variables") Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reported-by: Yongqiang Liu <liuyongqiang13@huawei.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-03-08 19:12:38 +01:00

1 2 3 4 5 ...

17404 Commits