Page replacement is handled in the Linux Kernel in one of two ways:
1) Asynchronously via kswapd
2) Synchronously, via direct reclaim
At page allocation time the allocating task is immediately given a page
from the zone free list allowing it to go right back to work doing
whatever it was doing; Probably directly or indirectly executing business
logic.
Just prior to satisfying the allocation, free pages is checked to see if
it has reached the zone low watermark and if so, kswapd is awakened.
Kswapd will start scanning pages looking for inactive pages to evict to
make room for new page allocations. The work of kswapd allows tasks to
continue allocating memory from their respective zone free list without
incurring any delay.
When the demand for free pages exceeds the rate that kswapd tasks can
supply them, page allocation works differently. Once the allocating task
finds that the number of free pages is at or below the zone min watermark,
the task will no longer pull pages from the free list. Instead, the task
will run the same CPU-bound routines as kswapd to satisfy its own
allocation by scanning and evicting pages. This is called a direct reclaim.
The time spent performing a direct reclaim can be substantial, often
taking tens to hundreds of milliseconds for small order0 allocations to
half a second or more for order9 huge-page allocations. In fact, kswapd is
not actually required on a linux system. It exists for the sole purpose of
optimizing performance by preventing direct reclaims.
When memory shortfall is sufficient to trigger direct reclaims, they can
occur in any task that is running on the system. A single aggressive
memory allocating task can set the stage for collateral damage to occur in
small tasks that rarely allocate additional memory. Consider the impact of
injecting an additional 100ms of latency when nscd allocates memory to
facilitate caching of a DNS query.
The presence of direct reclaims 10 years ago was a fairly reliable
indicator that too much was being asked of a Linux system. Kswapd was
likely wasting time scanning pages that were ineligible for eviction.
Adding RAM or reducing the working set size would usually make the problem
go away. Since then hardware has evolved to bring a new struggle for
kswapd. Storage speeds have increased by orders of magnitude while CPU
clock speeds stayed the same or even slowed down in exchange for more
cores per package. This presents a throughput problem for a single
threaded kswapd that will get worse with each generation of new hardware.
Test Details
NOTE: The tests below were run with shadow entries disabled. See the
associated patch and cover letter for details
The tests below were designed with the assumption that a kswapd bottleneck
is best demonstrated using filesystem reads. This way, the inactive list
will be full of clean pages, simplifying the analysis and allowing kswapd
to achieve the highest possible steal rate. Maximum steal rates for kswapd
are likely to be the same or lower for any other mix of page types on the
system.
Tests were run on a 2U Oracle X7-2L with 52 Intel Xeon Skylake 2GHz cores,
756GB of RAM and 8 x 3.6 TB NVMe Solid State Disk drives. Each drive has
an XFS file system mounted separately as /d0 through /d7. SSD drives
require multiple concurrent streams to show their potential, so I created
eleven 250GB zero-filled files on each drive so that I could test with
parallel reads.
The test script runs in multiple stages. At each stage, the number of dd
tasks run concurrently is increased by 2. I did not include all of the
test output for brevity.
During each stage dd tasks are launched to read from each drive in a round
robin fashion until the specified number of tasks for the stage has been
reached. Then iostat, vmstat and top are started in the background with 10
second intervals. After five minutes, all of the dd tasks are killed and
the iostat, vmstat and top output is parsed in order to report the
following:
CPU consumption
- sy - aggregate kernel mode CPU consumption from vmstat output. The value
doesn't tend to fluctuate much so I just grab the highest value.
Each sample is averaged over 10 seconds
- dd_cpu - for all of the dd tasks averaged across the top samples since
there is a lot of variation.
Throughput
- in Kbytes
- Command is iostat -x -d 10 -g total
This first test performs reads using O_DIRECT in order to show the maximum
throughput that can be obtained using these drives. It also demonstrates
how rapidly throughput scales as the number of dd tasks are increased.
The dd command for this test looks like this:
Command Used: dd iflag=direct if=/d${i}/$n of=/dev/null bs=4M
Test #1: Direct IO
dd sy dd_cpu throughput
6 0 2.33 14726026.40
10 1 2.95 19954974.80
16 1 2.63 24419689.30
22 1 2.63 25430303.20
28 1 2.91 26026513.20
34 1 2.53 26178618.00
40 1 2.18 26239229.20
46 1 1.91 26250550.40
52 1 1.69 26251845.60
58 1 1.54 26253205.60
64 1 1.43 26253780.80
70 1 1.31 26254154.80
76 1 1.21 26253660.80
82 1 1.12 26254214.80
88 1 1.07 26253770.00
90 1 1.04 26252406.40
Throughput was close to peak with only 22 dd tasks. Very little system CPU
was consumed as expected as the drives DMA directly into the user address
space when using direct IO.
In this next test, the iflag=direct option is removed and we only run the
test until the pgscan_kswapd from /proc/vmstat starts to increment. At
that point metrics are parsed and reported and the pagecache contents are
dropped prior to the next test. Lather, rinse, repeat.
Test #2: standard file system IO, no page replacement
dd sy dd_cpu throughput
6 2 28.78 5134316.40
10 3 31.40 8051218.40
16 5 34.73 11438106.80
22 7 33.65 14140596.40
28 8 31.24 16393455.20
34 10 29.88 18219463.60
40 11 28.33 19644159.60
46 11 25.05 20802497.60
52 13 26.92 22092370.00
58 13 23.29 22884881.20
64 14 23.12 23452248.80
70 15 22.40 23916468.00
76 16 22.06 24328737.20
82 17 20.97 24718693.20
88 16 18.57 25149404.40
90 16 18.31 25245565.60
Each read has to pause after the buffer in kernel space is populated while
those pages are added to the pagecache and copied into the user address
space. For this reason, more parallel streams are required to achieve peak
throughput. The copy operation consumes substantially more CPU than direct
IO as expected.
The next test measures throughput after kswapd starts running. This is the
same test only we wait for kswapd to wake up before we start collecting
metrics. The script actually keeps track of a few things that were not
mentioned earlier. It tracks direct reclaims and page scans by watching
the metrics in /proc/vmstat. CPU consumption for kswapd is tracked the
same way it is tracked for dd.
Since the test is 100% reads, you can assume that the page steal rate for
kswapd and direct reclaims is almost identical to the scan rate.
Test #3: 1 kswapd thread per node
dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct
10 4 26.07 28.56 27.03 7355924.40 0 459316976 0
16 7 34.94 69.33 69.66 10867895.20 0 872661643 0
22 10 36.03 93.99 99.33 13130613.60 489 1037654473 11268334
28 10 30.34 95.90 98.60 14601509.60 671 1182591373 15429142
34 14 34.77 97.50 99.23 16468012.00 10850 1069005644 249839515
40 17 36.32 91.49 97.11 17335987.60 18903 975417728 434467710
46 19 38.40 90.54 91.61 17705394.40 25369 855737040 582427973
52 22 40.88 83.97 83.70 17607680.40 31250 709532935 724282458
58 25 40.89 82.19 80.14 17976905.60 35060 657796473 804117540
64 28 41.77 73.49 75.20 18001910.00 39073 561813658 895289337
70 33 45.51 63.78 64.39 17061897.20 44523 379465571 1020726436
76 36 46.95 57.96 60.32 16964459.60 47717 291299464 1093172384
82 39 47.16 55.43 56.16 16949956.00 49479 247071062 1134163008
88 42 47.41 53.75 47.62 16930911.20 51521 195449924 1180442208
90 43 47.18 51.40 50.59 16864428.00 51618 190758156 1183203901
In the previous test where kswapd was not involved, the system-wide kernel
mode CPU consumption with 90 dd tasks was 16%. In this test CPU consumption
with 90 tasks is at 43%. With 52 cores, and two kswapd tasks (one per NUMA
node), kswapd can only be responsible for a little over 4% of the increase.
The rest is likely caused by 51,618 direct reclaims that scanned 1.2
billion pages over the five minute time period of the test.
Same test, more kswapd tasks:
Test #4: 4 kswapd threads per node
dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct
10 5 27.09 16.65 14.17 7842605.60 0 459105291 0
16 10 37.12 26.02 24.85 11352920.40 15 920527796 358515
22 11 36.94 37.13 35.82 13771869.60 0 1132169011 0
28 13 35.23 48.43 46.86 16089746.00 0 1312902070 0
34 15 33.37 53.02 55.69 18314856.40 0 1476169080 0
40 19 35.90 69.60 64.41 19836126.80 0 1629999149 0
46 22 36.82 88.55 57.20 20740216.40 0 1708478106 0
52 24 34.38 93.76 68.34 21758352.00 0 1794055559 0
58 24 30.51 79.20 82.33 22735594.00 0 1872794397 0
64 26 30.21 97.12 76.73 23302203.60 176 1916593721 4206821
70 33 32.92 92.91 92.87 23776588.00 3575 1817685086 85574159
76 37 31.62 91.20 89.83 24308196.80 4752 1812262569 113981763
82 29 25.53 93.23 92.33 24802791.20 306 2032093122 7350704
88 43 37.12 76.18 77.01 25145694.40 20310 1253204719 487048202
90 42 38.56 73.90 74.57 22516787.60 22774 1193637495 545463615
By increasing the number of kswapd threads, throughput increased by ~50%
while kernel mode CPU utilization decreased or stayed the same, likely due
to a decrease in the number of parallel tasks at any given time doing page
replacement.
Signed-off-by: Buddy Lumpkin <buddy.lumpkin@oracle.com>
Bug: 201263306
Link: https://lore.kernel.org/lkml/1522661062-39745-1-git-send-email-buddy.lumpkin@oracle.com
[charante@codeaurora.org]: Changes made to select number of kswapds through uapi
Change-Id: I8425cab7f40cbeaf65af0ea118c1a9ac7da0930e
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
[quic_vjitta@quicinc.com]: Changes made to move multiple kswapd threads logic to vendor hooks
Signed-off-by: Vijayanand Jitta <quic_vjitta@quicinc.com>
(cherry picked from commit 0d61a651e4)
'kunit_test' member variable in task_struct is defined
under CONFIG_KUNIT. Besides there are supportive functions
in slub and kasan which gets conditionally compiled out.
Allow kunit to be build as vendor module by removing
compile time dependencies.
Bug: 215096354
Change-Id: If57b1df6e479aa0388aabc53af5ae10e20a844b2
Signed-off-by: Shiraz Hashim <quic_shashim@quicinc.com>
To support multiple kswap threads vendor modules need
access to kswapd function. So, export it.
Bug: 201263306
Change-Id: I442612710835f39836a295e9d1936f86826ab960
Signed-off-by: Vijayanand Jitta <quic_vjitta@quicinc.com>
A deep process chain with many vmas could grow really high. With
default sysctl_max_map_count (64k) and default pid_max (32k) the max
number of vmas in the system is 2147450880 and the refcounter has
headroom of 1073774592 before it reaches REFCOUNT_SATURATED
(3221225472).
Therefore it's unlikely that an anonymous name refcounter will overflow
with these defaults. Currently the max for pid_max is PID_MAX_LIMIT
(4194304) and for sysctl_max_map_count it's INT_MAX (2147483647). In
this configuration anon_vma_name refcount overflow becomes theoretically
possible (that still require heavy sharing of that anon_vma_name between
processes).
kref refcounting interface used in anon_vma_name structure will detect a
counter overflow when it reaches REFCOUNT_SATURATED value but will only
generate a warning and freeze the ref counter. This would lead to the
refcounted object never being freed. A determined attacker could leak
memory like that but it would be rather expensive and inefficient way to
do so.
To ensure anon_vma_name refcount does not overflow, stop anon_vma_name
sharing when the refcount reaches REFCOUNT_MAX (2147483647), which still
leaves INT_MAX/2 (1073741823) values before the counter reaches
REFCOUNT_SATURATED. This should provide enough headroom for raising the
refcounts temporarily.
Link: https://lkml.kernel.org/r/20220223153613.835563-2-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Gladkov <legion@kernel.org>
Cc: Chris Hyser <chris.hyser@oracle.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Colin Cross <ccross@google.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xiaofeng Cao <caoxiaofeng@yulong.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 96403e11283def1d1c465c8279514c9a504d8630)
Bug: 218352794
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ieaab58f6300d9aff3139eed1c1d3417237d81955
When compiled with CONFIG_SHMEM=n, shmem.c does not include internal.h
and isolate_lru_page function declaration can't be found.
Fix this by making isolate_lru_page usage conditional upon CONFIG_SHMEM
inside reclaim_shmem_address_space.
Fixes: 96f80f6284 ("ANDROID: mm: add reclaim_shmem_address_space() for faster reclaims")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ia46a57681d26ac103e84ef7caa61a22dbd45cf04
Changes in 5.15.31
crypto: qcom-rng - ensure buffer for generate is completely filled
ocfs2: fix crash when initialize filecheck kobj fails
mm: swap: get rid of livelock in swapin readahead
block: release rq qos structures for queue without disk
drm/mgag200: Fix PLL setup for g200wb and g200ew
efi: fix return value of __setup handlers
alx: acquire mutex for alx_reinit in alx_change_mtu
vsock: each transport cycles only on its own sockets
esp6: fix check on ipv6_skip_exthdr's return value
net: phy: marvell: Fix invalid comparison in the resume and suspend functions
net/packet: fix slab-out-of-bounds access in packet_recvmsg()
atm: eni: Add check for dma_map_single
iavf: Fix double free in iavf_reset_task
hv_netvsc: Add check for kvmalloc_array
drm/imx: parallel-display: Remove bus flags check in imx_pd_bridge_atomic_check()
drm/panel: simple: Fix Innolux G070Y2-L01 BPP settings
net: handle ARPHRD_PIMREG in dev_is_mac_header_xmit()
drm: Don't make DRM_PANEL_BRIDGE dependent on DRM_KMS_HELPERS
net: dsa: Add missing of_node_put() in dsa_port_parse_of
net: phy: mscc: Add MODULE_FIRMWARE macros
bnx2x: fix built-in kernel driver load failure
net: bcmgenet: skip invalid partial checksums
net: mscc: ocelot: fix backwards compatibility with single-chain tc-flower offload
iavf: Fix hang during reboot/shutdown
arm64: fix clang warning about TRAMP_VALIAS
usb: gadget: rndis: prevent integer overflow in rndis_set_response()
usb: gadget: Fix use-after-free bug by not setting udc->dev.driver
usb: usbtmc: Fix bug in pipe direction for control transfers
scsi: mpt3sas: Page fault in reply q processing
Input: aiptek - properly check endpoint type
perf symbols: Fix symbol size calculation condition
btrfs: skip reserved bytes warning on unmount after log cleanup failure
Linux 5.15.31
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Iea69c3aeae614eb6b871993dc29fc1010c064f24
Add the functionality that allow users of shmem to reclaim its pages
without going through the kswapd/direct reclaim path. An example usecase
is: Say that device allocates a larger amount of shmem pages and shares
it with hardware. To faster reclaims such pages, drivers can register
the shrinkers and call reclaim_shmem_address_space().
The implementation of this function is mostly borrowed from
reclaim_address_space() implemented for per process reclaim[1].
[1] https://lore.kernel.org/patchwork/cover/378056/
Bug: 201263305
Change-Id: I03d2c3b9610612af977f89ddeabb63b8e9e50918
Signed-off-by: Charan Teja Reddy <quic_charante@quicinc.com>
In speculative fault path, while doing page table lookup, offset
is obtained at each level and value at that offset is read and
checks are perfomed on it, later to get next level offset we read
from previous level offset again. A concurrent page table reclaimation
operation could result in change in value at this offset, and we go
ahead and access it, this would result in reading an invalid entry.
Fix this by reading from previous level offset again and comparing
before performing next level access.
Bug: 221005439
Change-Id: I66b3d24ae79c7ee5ccce4ba7a94f028f4cf3fda0
Signed-off-by: Vijayanand Jitta <quic_vjitta@quicinc.com>
Introduce vma_can_speculate(), which allows speculative handling for
VMAs mapping supported file types.
From do_handle_mm_fault(), speculative handling will follow through
__handle_mm_fault(), handle_pte_fault() and do_fault().
At this point, we expect speculative faults to continue through one of:
- do_read_fault(), fully implemented;
- do_cow_fault(), which might abort if missing anon vmas,
- do_shared_fault(), not implemented yet
(would require ->page_mkwrite() changes).
vma_can_speculate() provides an early abort for the do_shared_fault() case,
limiting the time spent on trying that unimplemented case.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20210407014502.24091-31-michel@lespinasse.org/
Conflicts:
include/linux/vm_event_item.h
mm/vmstat.c
1. SPF_ATTEMPT_FILE is taken from https://lore.kernel.org/all/20210407014502.24091-36-michel@lespinasse.org/
since the patch posted upstream at the time had a different structure
with stats for anonymouse and file-backed pagefaults introduced in a
separate patch.
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I3a28af63b41b649f02f8b73d53f6494ad114ee5a
Add a speculative field to the vm_operations_struct, which indicates if
the associated file type supports speculative faults.
Initially this is set for files that implement fault() with filemap_fault().
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20210407014502.24091-30-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ic92efdf13283c45e7da7bf703f4f85f8b392ba69
In the speculative case, we know the page table already exists, and it
must be locked with pte_map_lock(). In the case where no page is found
for the given address, return VM_FAULT_RETRY which will abort the
fault before we get into the vm_ops->fault() callback. This is fine
because if filemap_map_pages does not find the page in page cache,
vm_ops->fault() will not either.
Initialize addr and last_pgoff to correspond to the pte at the original
fault address (which was mapped with pte_map_lock()), rather than the
pte at start_pgoff. The choice of initial values doesn't matter as
they will all be adjusted together before use, so they just need to be
consistent with each other, and using the original fault address and
pte allows us to reuse pte_map_lock() without any changes to it.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20210407014502.24091-29-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I0acf4f9626ec0126cdc9a95a7ff1cd735c1af2ca
Call the vm_ops->map_pages method within an rcu read locked section.
In the speculative case, verify the mmap sequence lock at the start of
the section. A match guarantees that the original vma is still valid
at that time, and that the associated vma->vm_file stays valid while
the vm_ops->map_pages() method is running.
Do not test vmf->pmd in the speculative case - we only speculate when
a page table already exists, and and this saves us from having to handle
synchronization around the vmf->pmd read.
Change xfs_filemap_map_pages() account for the fact that it can not
block anymore, as it is now running within an rcu read lock.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20210407014502.24091-28-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Id771c1e6fa9b883595a48d4df63f448a05916eda
In the speculative case, we want to avoid direct pmd checks (which
would require some extra synchronization to be safe), and rely on
pte_map_lock which will both lock the page table and verify that the
pmd has not changed from its initial value.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20210407014502.24091-27-michel@lespinasse.org/
Conflicts:
mm/memory.c
1. Merge conflict due to new vmf->prealloc_pte usage in finish_fault.
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: If6046592083eaf12caf5c51c3fbb287a4dfa1ace
Extend filemap_fault() to handle speculative faults.
In the speculative case, we will only be fishing existing pages out of
the page cache. The logic we use mirrors what is done in the
non-speculative case, assuming that pages are found in the page cache,
are up to date and not already locked, and that readahead is not
necessary at this time. In all other cases, the fault is aborted to be
handled non-speculatively.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20210407014502.24091-26-michel@lespinasse.org/
Conflicts:
mm/filemap.c
1. Added back file_ra_state variable used by SPF path.
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I82eba7fcfc81876245c2e65bc5ae3d33ddfcc368
In the speculative case, call the vm_ops->fault() method from within
an rcu read locked section, and verify the mmap sequence lock at the
start of the section. A match guarantees that the original vma is still
valid at that time, and that the associated vma->vm_file stays valid
while the vm_ops->fault() method is running.
Note that this implies that speculative faults can not sleep within
the vm_ops->fault method. We will only attempt to fetch existing pages
from the page cache during speculative faults; any miss (or prefetch)
will be handled by falling back to non-speculative fault handling.
The speculative handling case also does not preallocate page tables,
as it is always called with a pre-existing page table.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20210407014502.24091-25-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I995ba94d8e96014ef83ac93fe5a4669afcde34b9
In handle_pte_fault(), allow speculative execution to proceed.
Use pte_spinlock() to validate the mmap sequence count when locking
the page table.
If speculative execution proceeds through do_wp_page(), ensure that we
end up in the wp_page_reuse() or wp_page_copy() paths, rather than
wp_pfn_shared() or wp_page_shared() (both unreachable as we only
handle anon vmas so far) or handle_userfault() (needs an explicit
abort to handle non-speculatively).
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-28-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ia45d095ec7b8e23f1c5d68b7a7f572a3f6f6df97
Change wp_page_copy() to handle the speculative case. This involves
aborting speculative faults if they have to allocate an anon_vma,
read-locking the mmu_notifier_lock to avoid races with
mmu_notifier_register(), and using pte_map_lock() instead of
pte_offset_map_lock() to complete the page fault.
Also change call sites to clear vmf->pte after unmapping the page table,
in order to satisfy pte_map_lock()'s preconditions.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-27-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Icd2188e9facf5a7fea42000a2808bcda1ad6f0fc
Change handle_pte_fault() to allow speculative fault execution to proceed
through do_numa_page().
do_swap_page() does not implement speculative execution yet, so it
needs to abort with VM_FAULT_RETRY in that case.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-22-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I0390331facc9ecd37534012abdd9f255ab5bbb12
in x86 fault handler, only attempt spf if the vma is anonymous.
In do_handle_mm_fault(), let speculative page faults proceed as long
as they fall into anonymous vmas. This enables the speculative
handling code in __handle_mm_fault() and do_anonymous_page().
In handle_pte_fault(), if vmf->pte is set (the original pte was not
pte_none), catch speculative faults and return VM_FAULT_RETRY as
those cases are not implemented yet. Also assert that do_fault()
is not reached in the speculative case.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-20-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I875106fcfa1084f570c2bf8f24a129bdce55316b
Change do_anonymous_page() to handle the speculative case.
This involves aborting speculative faults if they have to allocate a new
anon_vma, and using pte_map_lock() instead of pte_offset_map_lock()
to complete the page fault.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-19-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I5ad955323faabc142c21f62415db039ac889066a
pte_map_lock() and pte_spinlock() are used by fault handlers to ensure
the pte is mapped and locked before they commit the faulted page to the
mm's address space at the end of the fault.
The functions differ in their preconditions; pte_map_lock() expects
the pte to be unmapped prior to the call, while pte_spinlock() expects
it to be already mapped.
In the speculative fault case, the functions verify, after locking the pte,
that the mmap sequence count has not changed since the start of the fault,
and thus that no mmap lock writers have been running concurrently with
the fault. After that point the page table lock serializes any further
races with concurrent mmap lock writers.
If the mmap sequence count check fails, both functions will return false
with the pte being left unmapped and unlocked.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-18-michel@lespinasse.org/
Conflicts:
include/linux/mm.h
1. Fixed pte_map_lock and pte_spinlock macros not to fail when
CONFIG_SPECULATIVE_PAGE_FAULT=n
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ibd7ccc2ead4fdf29f28c7657b312b2f677ac8836
The speculative path calls speculative_page_walk_begin() before walking
the page table tree to prevent page table reclamation. The logic is
otherwise similar to the non-speculative path, but with additional
restrictions: in the speculative path, we do not handle huge pages or
wiring new pages tables.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-17-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: If099534da8b0ac105bbaa5ea4714a6654032592a
Move the code that initializes vmf->pte and vmf->orig_pte from
handle_pte_fault() to its single call site in __handle_mm_fault().
This ensures vmf->pte is now initialized together with the higher levels
of the page table hierarchy. This also prepares for speculative page fault
handling, where the entire page table walk (higher levels down to ptes)
needs special care in the speculative case.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-16-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Id550086fe568331aa71c91468f8314faad993b20
Speculative page faults will use these to protect against races with
page table reclamation.
This could always be handled by disabling local IRQs as the fast GUP
code does; however speculative page faults do not need to protect
against races with THP page splitting, so a weaker rcu read lock is
sufficient in the MMU_GATHER_RCU_TABLE_FREE case.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-15-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I3efe5fc6a5a49d537cf33e8093daeea42550077a
Attempt speculative mm fault handling first, and fall back to the
existing (non-speculative) code if that fails.
The speculative handling closely mirrors the non-speculative logic.
This includes some x86 specific bits such as the access_error() call.
This is why we chose to implement the speculative handling in arch/x86
rather than in common code.
The vma is first looked up and copied, under protection of the rcu
read lock. The mmap lock sequence count is used to verify the
integrity of the copied vma, and passed to do_handle_mm_fault() to
allow checking against races with mmap writers when finalizing the fault.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-14-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I2c078a173ee39f35af16daeee8c6a1466d10c3e8
Add a new do_handle_mm_fault function, which extends the existing
handle_mm_fault() API by adding an mmap sequence count, to be used
in the FAULT_FLAG_SPECULATIVE case.
In the initial implementation, FAULT_FLAG_SPECULATIVE always fails
(by returning VM_FAULT_RETRY).
The existing handle_mm_fault() API is kept as a wrapper around
do_handle_mm_fault() so that we do not have to immediately update
every handle_mm_fault() call site.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Conflicts:
mm/memory.c
1. Trivial merge conflict due to folios.
Link: https://lore.kernel.org/all/20220128131006.67712-10-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ic07b6d84af3e5d1fcc856e0968f1a6dd1544fa88
This configuration variable will be used to build the code needed to
handle speculative page fault.
This is enabled by default on supported architectures with SMP and MMU set.
The architecture support is needed since the speculative page fault handler
is called from the architecture's page faulting code, and some code has to
be added there to try speculative fault handling first.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-7-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ie1dc3af30bf3949173b126e6469f372c4505ec8e
In do_anonymous_page(), we have separate cases for the zero page vs
allocating new anonymous pages. However, once the pte entry has been
computed, the rest of the handling (mapping and locking the page table,
checking that we didn't lose a race with another page fault handler, etc)
is identical between the two cases.
This change reduces the code duplication between the two cases.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Conflicts:
mm/memory.c
1. Trivial merge conflict caused by folios in mem_cgroup_charge call.
Link: https://lore.kernel.org/all/20220128131006.67712-6-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ic19579571925878d632e43aa40b9f50cdf473ee6
commit 029c4628b2eb2ca969e9bf979b05dc18d8d5575e upstream.
In our testing, a livelock task was found. Through sysrq printing, same
stack was found every time, as follows:
__swap_duplicate+0x58/0x1a0
swapcache_prepare+0x24/0x30
__read_swap_cache_async+0xac/0x220
read_swap_cache_async+0x58/0xa0
swapin_readahead+0x24c/0x628
do_swap_page+0x374/0x8a0
__handle_mm_fault+0x598/0xd60
handle_mm_fault+0x114/0x200
do_page_fault+0x148/0x4d0
do_translation_fault+0xb0/0xd4
do_mem_abort+0x50/0xb0
The reason for the livelock is that swapcache_prepare() always returns
EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that it
cannot jump out of the loop. We suspect that the task that clears the
SWAP_HAS_CACHE flag never gets a chance to run. We try to lower the
priority of the task stuck in a livelock so that the task that clears
the SWAP_HAS_CACHE flag will run. The results show that the system
returns to normal after the priority is lowered.
In our testing, multiple real-time tasks are bound to the same core, and
the task in the livelock is the highest priority task of the core, so
the livelocked task cannot be preempted.
Although cond_resched() is used by __read_swap_cache_async, it is an
empty function in the preemptive system and cannot achieve the purpose
of releasing the CPU. A high-priority task cannot release the CPU
unless preempted by a higher-priority task. But when this task is
already the highest priority task on this core, other tasks will not be
able to be scheduled. So we think we should replace cond_resched() with
schedule_timeout_uninterruptible(1), schedule_timeout_interruptible will
call set_current_state first to set the task state, so the task will be
removed from the running queue, so as to achieve the purpose of giving
up the CPU and prevent it from running in kernel mode for too long.
(akpm: ugly hack becomes uglier. But it fixes the issue in a
backportable-to-stable fashion while we hopefully work on something
better)
Link: https://lkml.kernel.org/r/20220221111749.1928222-1-cgel.zte@gmail.com
Signed-off-by: Guo Ziliang <guo.ziliang@zte.com.cn>
Reported-by: Zeal Robot <zealci@zte.com.cn>
Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Reviewed-by: Jiang Xuexin <jiang.xuexin@zte.com.cn>
Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roger Quadros <rogerq@kernel.org>
Cc: Ziliang Guo <guo.ziliang@zte.com.cn>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The FROMLIST patches merged in aosp/1974918 that add vmalloc support to
KASAN now have a few fixes staged in linux-next/akpm. Sync the changes.
Bug: 217222520
Bug: 222221793
Change-Id: I33dd30e3834a4d1bb8eac611b350004afdb08a74
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Guard isolate_and_split_free_page() with CONFIG_COMPACTION. This fixes
the follwoing build error as the function collides with its inline stub
from the header file:
mm/compaction.c:766:15: error: redefinition of ‘isolate_and_split_free_page’
766 | unsigned long isolate_and_split_free_page(struct page *page,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from mm/compaction.c:14:
./include/linux/compaction.h:241:29: note: previous definition of ‘isolate_and_split_free_page’ was here
241 | static inline unsigned long isolate_and_split_free_page(struct page *page,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
Bug: 201263307
Fixes: 8cd9aa93b7 ("ANDROID: implement wrapper for reverse migration")
Reported-by: kernelci.org bot <bot@kernelci.org>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Carlos Llamas <cmllamas@google.com>
Change-Id: Ie8f3fedcc9d4af5cfdcfd5829377671745ab77d6
When memory is tight, system may start to compact memory for large
continuous memory demands. If one process tries to lock a memory page
that is being locked and isolated for compaction, it may wait a long time
or even forever. This is because compaction will perform non-atomic
PG_Isolated clear while holding page lock, this may overwrite PG_waiters
set by the process that can't obtain the page lock and add itself to the
waiting queue to wait for the lock to be unlocked.
CPU1 CPU2
lock_page(page); (successful)
lock_page(); (failed)
__ClearPageIsolated(page); SetPageWaiters(page) (may be overwritten)
unlock_page(page);
The solution is to not perform non-atomic operation on page flags while
holding page lock.
Link: https://lkml.kernel.org/r/20220315030515.20263-1-andrew.yang@mediatek.com
Signed-off-by: andrew.yang <andrew.yang@mediatek.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Vlastimil Babka" <vbabka@suse.cz>
Cc: David Howells <dhowells@redhat.com>
Cc: "William Kucharski" <william.kucharski@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Nicholas Tang <nicholas.tang@mediatek.com>
Cc: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
(cherry picked from commit 48911e41ddee4fe113bf1e4303dda1a413b169c9
https://github.com/hnaz/linux-mm.git)
Bug: 225086204
Change-Id: I3dc59bf75f12d4ee93779bcbe9336c6376d2d6b6
Reverse migration is used to do the balancing the occupancy of memory
zones in a node in the system whose imabalance may be caused by
migration of pages to other zones by an operation, eg: hotremove and
then hotadding the same memory. In this case there is a lot of free
memory in newly hotadd memory which can be filled up by the previous
migrated pages(as part of offline/hotremove) thus may free up some
pressure in other zones of the node.
Upstream discussion: https://lore.kernel.org/all/ee78c83d-da9b-f6d1-4f66-934b7782acfb@codeaurora.org/
Change-Id: Ib3137dab0db66ecf6858c4077dcadb9dfd0c6b1c
Bug: 201263307
Signed-off-by: Charan Teja Reddy <quic_charante@quicinc.com>
Changes in 5.15.27
mac80211_hwsim: report NOACK frames in tx_status
mac80211_hwsim: initialize ieee80211_tx_info at hw_scan_work
i2c: bcm2835: Avoid clock stretching timeouts
ASoC: rt5668: do not block workqueue if card is unbound
ASoC: rt5682: do not block workqueue if card is unbound
regulator: core: fix false positive in regulator_late_cleanup()
Input: clear BTN_RIGHT/MIDDLE on buttonpads
btrfs: get rid of warning on transaction commit when using flushoncommit
KVM: arm64: vgic: Read HW interrupt pending state from the HW
block: loop:use kstatfs.f_bsize of backing file to set discard granularity
tipc: fix a bit overflow in tipc_crypto_key_rcv()
cifs: do not use uninitialized data in the owner/group sid
cifs: fix double free race when mount fails in cifs_get_root()
HID: amd_sfh: Handle amd_sfh work buffer in PM ops
HID: amd_sfh: Add functionality to clear interrupts
HID: amd_sfh: Add interrupt handler to process interrupts
cifs: modefromsids must add an ACE for authenticated users
selftests/seccomp: Fix seccomp failure by adding missing headers
drm/amd/pm: correct UMD pstate clocks for Dimgrey Cavefish and Beige Goby
selftests/ftrace: Do not trace do_softirq because of PREEMPT_RT
dmaengine: shdma: Fix runtime PM imbalance on error
i2c: cadence: allow COMPILE_TEST
i2c: imx: allow COMPILE_TEST
i2c: qup: allow COMPILE_TEST
net: usb: cdc_mbim: avoid altsetting toggling for Telit FN990
block-map: add __GFP_ZERO flag for alloc_page in function bio_copy_kern
usb: gadget: don't release an existing dev->buf
usb: gadget: clear related members when goto fail
exfat: reuse exfat_inode_info variable instead of calling EXFAT_I()
exfat: fix i_blocks for files truncated over 4 GiB
tracing: Add test for user space strings when filtering on string pointers
arm64: Mark start_backtrace() notrace and NOKPROBE_SYMBOL
serial: stm32: prevent TDR register overwrite when sending x_char
ext4: drop ineligible txn start stop APIs
ext4: simplify updating of fast commit stats
ext4: fast commit may not fallback for ineligible commit
ext4: fast commit may miss file actions
sched/fair: Fix fault in reweight_entity
ata: pata_hpt37x: fix PCI clock detection
drm/amdgpu: check vm ready by amdgpu_vm->evicting flag
tracing: Add ustring operation to filtering string pointers
ipv6: fix skb drops in igmp6_event_query() and igmp6_event_report()
NFSD: Have legacy NFSD WRITE decoders use xdr_stream_subsegment()
NFSD: Fix zero-length NFSv3 WRITEs
io_uring: fix no lock protection for ctx->cq_extra
tools/resolve_btf_ids: Close ELF file on error
mtd: spi-nor: Fix mtd size for s3an flashes
MIPS: fix local_{add,sub}_return on MIPS64
signal: In get_signal test for signal_group_exit every time through the loop
PCI: mediatek-gen3: Disable DVFSRC voltage request
PCI: rcar: Check if device is runtime suspended instead of __clk_is_enabled()
PCI: dwc: Do not remap invalid res
PCI: aardvark: Fix checking for MEM resource type
KVM: VMX: Don't unblock vCPU w/ Posted IRQ if IRQs are disabled in guest
KVM: s390: Ensure kvm_arch_no_poll() is read once when blocking vCPU
KVM: VMX: Read Posted Interrupt "control" exactly once per loop iteration
KVM: X86: Ensure that dirty PDPTRs are loaded
KVM: x86: Handle 32-bit wrap of EIP for EMULTYPE_SKIP with flat code seg
KVM: x86: Exit to userspace if emulation prepared a completion callback
i3c: fix incorrect address slot lookup on 64-bit
i3c/master/mipi-i3c-hci: Fix a potentially infinite loop in 'hci_dat_v1_get_index()'
tracing: Do not let synth_events block other dyn_event systems during create
Input: ti_am335x_tsc - set ADCREFM for X configuration
Input: ti_am335x_tsc - fix STEPCONFIG setup for Z2
PCI: mvebu: Check for errors from pci_bridge_emul_init() call
PCI: mvebu: Do not modify PCI IO type bits in conf_write
PCI: mvebu: Fix support for bus mastering and PCI_COMMAND on emulated bridge
PCI: mvebu: Fix configuring secondary bus of PCIe Root Port via emulated bridge
PCI: mvebu: Setup PCIe controller to Root Complex mode
PCI: mvebu: Fix support for PCI_BRIDGE_CTL_BUS_RESET on emulated bridge
PCI: mvebu: Fix support for PCI_EXP_DEVCTL on emulated bridge
PCI: mvebu: Fix support for PCI_EXP_RTSTA on emulated bridge
PCI: mvebu: Fix support for DEVCAP2, DEVCTL2 and LNKCTL2 registers on emulated bridge
NFSD: Fix verifier returned in stable WRITEs
Revert "nfsd: skip some unnecessary stats in the v4 case"
nfsd: fix crash on COPY_NOTIFY with special stateid
x86/hyperv: Properly deal with empty cpumasks in hyperv_flush_tlb_multi()
drm/i915: don't call free_mmap_offset when purging
SUNRPC: Fix sockaddr handling in the svc_xprt_create_error trace point
SUNRPC: Fix sockaddr handling in svcsock_accept_class trace points
drm/sun4i: dw-hdmi: Fix missing put_device() call in sun8i_hdmi_phy_get
drm/atomic: Check new_crtc_state->active to determine if CRTC needs disable in self refresh mode
ntb_hw_switchtec: Fix pff ioread to read into mmio_part_cfg_all
ntb_hw_switchtec: Fix bug with more than 32 partitions
drm/amdkfd: Check for null pointer after calling kmemdup
drm/amdgpu: use spin_lock_irqsave to avoid deadlock by local interrupt
i3c: master: dw: check return of dw_i3c_master_get_free_pos()
dma-buf: cma_heap: Fix mutex locking section
tracing/uprobes: Check the return value of kstrdup() for tu->filename
tracing/probes: check the return value of kstrndup() for pbuf
mm: defer kmemleak object creation of module_alloc()
kasan: fix quarantine conflicting with init_on_free
selftests/vm: make charge_reserved_hugetlb.sh work with existing cgroup setting
hugetlbfs: fix off-by-one error in hugetlb_vmdelete_list()
drm/amdgpu/display: Only set vblank_disable_immediate when PSR is not enabled
drm/amdgpu: filter out radeon PCI device IDs
drm/amdgpu: filter out radeon secondary ids as well
drm/amd/display: Use adjusted DCN301 watermarks
drm/amd/display: move FPU associated DSC code to DML folder
ethtool: Fix link extended state for big endian
octeontx2-af: Optimize KPU1 processing for variable-length headers
octeontx2-af: Reset PTP config in FLR handler
octeontx2-af: cn10k: RPM hardware timestamp configuration
octeontx2-af: cn10k: Use appropriate register for LMAC enable
octeontx2-af: Adjust LA pointer for cpt parse header
octeontx2-af: Add KPU changes to parse NGIO as separate layer
net/mlx5e: IPsec: Refactor checksum code in tx data path
net/mlx5e: IPsec: Fix crypto offload for non TCP/UDP encapsulated traffic
bpf: Use u64_stats_t in struct bpf_prog_stats
bpf: Fix possible race in inc_misses_counter
drm/amd/display: Update watermark values for DCN301
drm: mxsfb: Set fallback bus format when the bridge doesn't provide one
drm: mxsfb: Fix NULL pointer dereference
riscv/mm: Add XIP_FIXUP for phys_ram_base
drm/i915/display: split out dpt out of intel_display.c
drm/i915/display: Move DRRS code its own file
drm/i915: Disable DRRS on IVB/HSW port != A
gve: Recording rx queue before sending to napi
net: dsa: ocelot: seville: utilize of_mdiobus_register
net: dsa: seville: register the mdiobus under devres
ibmvnic: don't release napi in __ibmvnic_open()
of: net: move of_net under net/
net: ethernet: litex: Add the dependency on HAS_IOMEM
drm/mediatek: mtk_dsi: Reset the dsi0 hardware
cifs: protect session channel fields with chan_lock
cifs: fix confusing unneeded warning message on smb2.1 and earlier
drm/amd/display: Fix stream->link_enc unassigned during stream removal
bnxt_en: Fix occasional ethtool -t loopback test failures
drm/amd/display: For vblank_disable_immediate, check PSR is really used
PCI: mvebu: Fix device enumeration regression
net: of: fix stub of_net helpers for CONFIG_NET=n
ALSA: intel_hdmi: Fix reference to PCM buffer address
ucounts: Fix systemd LimitNPROC with private users regression
riscv/efi_stub: Fix get_boot_hartid_from_fdt() return value
riscv: Fix config KASAN && SPARSEMEM && !SPARSE_VMEMMAP
riscv: Fix config KASAN && DEBUG_VIRTUAL
iwlwifi: mvm: check debugfs_dir ptr before use
ASoC: ops: Shift tested values in snd_soc_put_volsw() by +min
iommu/vt-d: Fix double list_add when enabling VMD in scalable mode
iommu/amd: Recover from event log overflow
drm/i915: s/JSP2/ICP2/ PCH
drm/amd/display: Reduce dmesg error to a debug print
xen/netfront: destroy queues before real_num_tx_queues is zeroed
thermal: core: Fix TZ_GET_TRIP NULL pointer dereference
mac80211: fix EAPoL rekey fail in 802.3 rx path
blktrace: fix use after free for struct blk_trace
ntb: intel: fix port config status offset for SPR
mm: Consider __GFP_NOWARN flag for oversized kvmalloc() calls
xfrm: fix MTU regression
netfilter: fix use-after-free in __nf_register_net_hook()
bpf, sockmap: Do not ignore orig_len parameter
xfrm: fix the if_id check in changelink
xfrm: enforce validity of offload input flags
e1000e: Correct NVM checksum verification flow
net: fix up skbs delta_truesize in UDP GRO frag_list
netfilter: nf_queue: don't assume sk is full socket
netfilter: nf_queue: fix possible use-after-free
netfilter: nf_queue: handle socket prefetch
batman-adv: Request iflink once in batadv-on-batadv check
batman-adv: Request iflink once in batadv_get_real_netdevice
batman-adv: Don't expect inter-netns unique iflink indices
net: ipv6: ensure we call ipv6_mc_down() at most once
net: dcb: flush lingering app table entries for unregistered devices
net: ipa: add an interconnect dependency
net/smc: fix connection leak
net/smc: fix unexpected SMC_CLC_DECL_ERR_REGRMB error generated by client
net/smc: fix unexpected SMC_CLC_DECL_ERR_REGRMB error cause by server
btrfs: fix ENOSPC failure when attempting direct IO write into NOCOW range
mac80211: fix forwarded mesh frames AC & queue selection
net: stmmac: fix return value of __setup handler
mac80211: treat some SAE auth steps as final
iavf: Fix missing check for running netdev
net: sxgbe: fix return value of __setup handler
ibmvnic: register netdev after init of adapter
net: arcnet: com20020: Fix null-ptr-deref in com20020pci_probe()
ixgbe: xsk: change !netif_carrier_ok() handling in ixgbe_xmit_zc()
iavf: Fix deadlock in iavf_reset_task
efivars: Respect "block" flag in efivar_entry_set_safe()
auxdisplay: lcd2s: Fix lcd2s_redefine_char() feature
firmware: arm_scmi: Remove space in MODULE_ALIAS name
ASoC: cs4265: Fix the duplicated control name
auxdisplay: lcd2s: Fix memory leak in ->remove()
auxdisplay: lcd2s: Use proper API to free the instance of charlcd object
can: gs_usb: change active_channels's type from atomic_t to u8
iommu/tegra-smmu: Fix missing put_device() call in tegra_smmu_find
arm64: dts: rockchip: Switch RK3399-Gru DP to SPDIF output
igc: igc_read_phy_reg_gpy: drop premature return
ARM: Fix kgdb breakpoint for Thumb2
mips: setup: fix setnocoherentio() boolean setting
ARM: 9182/1: mmu: fix returns from early_param() and __setup() functions
mptcp: Correctly set DATA_FIN timeout when number of retransmits is large
selftests: mlxsw: tc_police_scale: Make test more robust
pinctrl: sunxi: Use unique lockdep classes for IRQs
igc: igc_write_phy_reg_gpy: drop premature return
ibmvnic: free reset-work-item when flushing
memfd: fix F_SEAL_WRITE after shmem huge page allocated
s390/extable: fix exception table sorting
sched: Fix yet more sched_fork() races
arm64: dts: juno: Remove GICv2m dma-range
iommu/amd: Fix I/O page table memory leak
MIPS: ralink: mt7621: do memory detection on KSEG1
ARM: dts: switch timer config to common devkit8000 devicetree
ARM: dts: Use 32KiHz oscillator on devkit8000
soc: fsl: guts: Revert commit 3c0d64e867
soc: fsl: guts: Add a missing memory allocation failure check
soc: fsl: qe: Check of ioremap return value
netfilter: nf_tables: prefer kfree_rcu(ptr, rcu) variant
ARM: tegra: Move panels to AUX bus
can: etas_es58x: change opened_channel_cnt's type from atomic_t to u8
net: stmmac: enhance XDP ZC driver level switching performance
net: stmmac: only enable DMA interrupts when ready
ibmvnic: initialize rc before completing wait
ibmvnic: define flush_reset_queue helper
ibmvnic: complete init_done on transport events
net: chelsio: cxgb3: check the return value of pci_find_capability()
net: sparx5: Fix add vlan when invalid operation
iavf: Refactor iavf state machine tracking
iavf: Add __IAVF_INIT_FAILED state
iavf: Combine init and watchdog state machines
iavf: Add trace while removing device
iavf: Rework mutexes for better synchronisation
iavf: Add helper function to go from pci_dev to adapter
iavf: Fix kernel BUG in free_msi_irqs
iavf: Add waiting so the port is initialized in remove
iavf: Fix init state closure on remove
iavf: Fix locking for VIRTCHNL_OP_GET_OFFLOAD_VLAN_V2_CAPS
iavf: Fix race in init state
iavf: Fix __IAVF_RESETTING state usage
drm/i915/guc/slpc: Correct the param count for unset param
drm/bridge: ti-sn65dsi86: Properly undo autosuspend
e1000e: Fix possible HW unit hang after an s0ix exit
MIPS: ralink: mt7621: use bitwise NOT instead of logical
nl80211: Handle nla_memdup failures in handle_nan_filter
drm/amdgpu: fix suspend/resume hang regression
net: dcb: disable softirqs in dcbnl_flush_dev()
selftests: mlxsw: resource_scale: Fix return value
net: stmmac: perserve TX and RX coalesce value during XDP setup
iavf: do not override the adapter state in the watchdog task (again)
iavf: missing unlocks in iavf_watchdog_task()
MAINTAINERS: adjust file entry for of_net.c after movement
Input: elan_i2c - move regulator_[en|dis]able() out of elan_[en|dis]able_power()
Input: elan_i2c - fix regulator enable count imbalance after suspend/resume
Input: samsung-keypad - properly state IOMEM dependency
HID: add mapping for KEY_DICTATE
HID: add mapping for KEY_ALL_APPLICATIONS
tracing/histogram: Fix sorting on old "cpu" value
tracing: Fix return value of __setup handlers
btrfs: fix lost prealloc extents beyond eof after full fsync
btrfs: fix relocation crash due to premature return from btrfs_commit_transaction()
btrfs: do not WARN_ON() if we have PageError set
btrfs: qgroup: fix deadlock between rescan worker and remove qgroup
btrfs: add missing run of delayed items after unlink during log replay
btrfs: do not start relocation until in progress drops are done
Revert "xfrm: xfrm_state_mtu should return at least 1280 for ipv6"
proc: fix documentation and description of pagemap
KVM: x86/mmu: Passing up the error state of mmu_alloc_shadow_roots()
hamradio: fix macro redefine warning
Linux 5.15.27
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ie338dd23e0eb61feb540b4256b5d1840fee4db84
Changes in 5.15.26
mm/filemap: Fix handling of THPs in generic_file_buffered_read()
cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug
cgroup-v1: Correct privileges check in release_agent writes
x86/ptrace: Fix xfpregs_set()'s incorrect xmm clearing
btrfs: tree-checker: check item_size for inode_item
btrfs: tree-checker: check item_size for dev_item
clk: jz4725b: fix mmc0 clock gating
io_uring: don't convert to jiffies for waiting on timeouts
io_uring: disallow modification of rsrc_data during quiesce
selinux: fix misuse of mutex_is_locked()
vhost/vsock: don't check owner in vhost_vsock_stop() while releasing
parisc/unaligned: Fix fldd and fstd unaligned handlers on 32-bit kernel
parisc/unaligned: Fix ldw() and stw() unalignment handlers
KVM: x86/mmu: make apf token non-zero to fix bug
drm/amd/display: Protect update_bw_bounding_box FPU code.
drm/amd/pm: fix some OEM SKU specific stability issues
drm/amd: Check if ASPM is enabled from PCIe subsystem
drm/amdgpu: disable MMHUB PG for Picasso
drm/amdgpu: do not enable asic reset for raven2
drm/i915: Widen the QGV point mask
drm/i915: Correctly populate use_sagv_wm for all pipes
drm/i915: Fix bw atomic check when switching between SAGV vs. no SAGV
sr9700: sanity check for packet length
USB: zaurus: support another broken Zaurus
CDC-NCM: avoid overflow in sanity checking
netfilter: xt_socket: fix a typo in socket_mt_destroy()
netfilter: xt_socket: missing ifdef CONFIG_IP6_NF_IPTABLES dependency
netfilter: nf_tables_offload: incorrect flow offload action array size
tee: export teedev_open() and teedev_close_context()
optee: use driver internal tee_context for some rpc
ping: remove pr_err from ping_lookup
Revert "i40e: Fix reset bw limit when DCB enabled with 1 TC"
gpu: host1x: Always return syncpoint value when waiting
perf evlist: Fix failed to use cpu list for uncore events
perf data: Fix double free in perf_session__delete()
mptcp: fix race in incoming ADD_ADDR option processing
mptcp: add mibs counter for ignored incoming options
selftests: mptcp: fix diag instability
selftests: mptcp: be more conservative with cookie MPJ limits
bnx2x: fix driver load from initrd
bnxt_en: Fix active FEC reporting to ethtool
bnxt_en: Fix offline ethtool selftest with RDMA enabled
bnxt_en: Fix incorrect multicast rx mask setting when not requested
hwmon: Handle failure to register sensor with thermal zone correctly
net/mlx5: Fix tc max supported prio for nic mode
ice: check the return of ice_ptp_gettimex64
ice: initialize local variable 'tlv'
net/mlx5: Update the list of the PCI supported devices
bpf: Fix crash due to incorrect copy_map_value
bpf: Do not try bpf_msg_push_data with len 0
selftests: bpf: Check bpf_msg_push_data return value
bpf: Fix a bpf_timer initialization issue
bpf: Add schedule points in batch ops
io_uring: add a schedule point in io_add_buffers()
net: __pskb_pull_tail() & pskb_carve_frag_list() drop_monitor friends
nvme: also mark passthrough-only namespaces ready in nvme_update_ns_info
tipc: Fix end of loop tests for list_for_each_entry()
gso: do not skip outer ip header in case of ipip and net_failover
net: mv643xx_eth: process retval from of_get_mac_address
openvswitch: Fix setting ipv6 fields causing hw csum failure
drm/edid: Always set RGB444
net/mlx5e: Fix wrong return value on ioctl EEPROM query failure
drm/vc4: crtc: Fix runtime_pm reference counting
drm/i915/dg2: Print PHY name properly on calibration error
net/sched: act_ct: Fix flow table lookup after ct clear or switching zones
net: ll_temac: check the return value of devm_kmalloc()
net: Force inlining of checksum functions in net/checksum.h
netfilter: nf_tables: unregister flowtable hooks on netns exit
nfp: flower: Fix a potential leak in nfp_tunnel_add_shared_mac()
net: mdio-ipq4019: add delay after clock enable
netfilter: nf_tables: fix memory leak during stateful obj update
net/smc: Use a mutex for locking "struct smc_pnettable"
surface: surface3_power: Fix battery readings on batteries without a serial number
udp_tunnel: Fix end of loop test in udp_tunnel_nic_unregister()
net/mlx5: DR, Cache STE shadow memory
ibmvnic: schedule failover only if vioctl fails
net/mlx5: DR, Don't allow match on IP w/o matching on full ethertype/ip_version
net/mlx5: Fix possible deadlock on rule deletion
net/mlx5: Fix wrong limitation of metadata match on ecpf
net/mlx5: DR, Fix the threshold that defines when pool sync is initiated
net/mlx5e: MPLSoUDP decap, fix check for unsupported matches
net/mlx5e: kTLS, Use CHECKSUM_UNNECESSARY for device-offloaded packets
net/mlx5: Update log_max_qp value to be 17 at most
spi: spi-zynq-qspi: Fix a NULL pointer dereference in zynq_qspi_exec_mem_op()
gpio: rockchip: Reset int_bothedge when changing trigger
regmap-irq: Update interrupt clear register for proper reset
net-timestamp: convert sk->sk_tskey to atomic_t
RDMA/rtrs-clt: Fix possible double free in error case
RDMA/rtrs-clt: Move free_permit from free_clt to rtrs_clt_close
bnxt_en: Increase firmware message response DMA wait time
configfs: fix a race in configfs_{,un}register_subsystem()
RDMA/ib_srp: Fix a deadlock
tracing: Dump stacktrace trigger to the corresponding instance
tracing: Have traceon and traceoff trigger honor the instance
iio:imu:adis16480: fix buffering for devices with no burst mode
iio: adc: men_z188_adc: Fix a resource leak in an error handling path
iio: adc: tsc2046: fix memory corruption by preventing array overflow
iio: adc: ad7124: fix mask used for setting AIN_BUFP & AIN_BUFM bits
iio: accel: fxls8962af: add padding to regmap for SPI
iio: imu: st_lsm6dsx: wait for settling time in st_lsm6dsx_read_oneshot
iio: Fix error handling for PM
sc16is7xx: Fix for incorrect data being transmitted
ata: pata_hpt37x: disable primary channel on HPT371
Revert "USB: serial: ch341: add new Product ID for CH341A"
usb: gadget: rndis: add spinlock for rndis response list
USB: gadget: validate endpoint index for xilinx udc
tracefs: Set the group ownership in apply_options() not parse_options()
USB: serial: option: add support for DW5829e
USB: serial: option: add Telit LE910R1 compositions
usb: dwc2: drd: fix soft connect when gadget is unconfigured
usb: dwc3: pci: Add "snps,dis_u2_susphy_quirk" for Intel Bay Trail
usb: dwc3: pci: Fix Bay Trail phy GPIO mappings
usb: dwc3: gadget: Let the interrupt handler disable bottom halves.
xhci: re-initialize the HC during resume if HCE was set
xhci: Prevent futile URB re-submissions due to incorrect return value.
nvmem: core: Fix a conflict between MTD and NVMEM on wp-gpios property
mtd: core: Fix a conflict between MTD and NVMEM on wp-gpios property
driver core: Free DMA range map when device is released
btrfs: prevent copying too big compressed lzo segment
RDMA/cma: Do not change route.addr.src_addr outside state checks
thermal: int340x: fix memory leak in int3400_notify()
staging: fbtft: fb_st7789v: reset display before initialization
tps6598x: clear int mask on probe failure
IB/qib: Fix duplicate sysfs directory name
riscv: fix nommu_k210_sdcard_defconfig
riscv: fix oops caused by irqsoff latency tracer
tty: n_gsm: fix encoding of control signal octet bit DV
tty: n_gsm: fix proper link termination after failed open
tty: n_gsm: fix NULL pointer access due to DLCI release
tty: n_gsm: fix wrong tty control line for flow control
tty: n_gsm: fix wrong modem processing in convergence layer type 2
tty: n_gsm: fix deadlock in gsmtty_open()
pinctrl: fix loop in k210_pinconf_get_drive()
pinctrl: k210: Fix bias-pull-up
gpio: tegra186: Fix chip_data type confusion
memblock: use kfree() to release kmalloced memblock regions
ice: Fix race conditions between virtchnl handling and VF ndo ops
ice: fix concurrent reset and removal of VFs
Linux 5.15.26
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ied0cc9bd48b7af71a064107676f37b0dd39ce3cf
In for_each_object_track we go through meta data of the slab
object in function(fn), and as a result false postive out-of-bound
access is reported by kasan. Fix this by wrapping that function call
with metadata_access_enable/disable.
Bug: 222651868
Fixes: ee8d2c7884 ("ANDROID: mm: add get_each_object_track function")
Change-Id: Ifb4241a9c3e397a52759d467aa267d1297e297dd
Signed-off-by: Vijayanand Jitta <quic_vjitta@quicinc.com>
(cherry picked from commit cd6e5d5d7d0338fbfe58010f7dde0a521db6b681)
commit f2b277c4d1c63a85127e8aa2588e9cc3bd21cb99 upstream.
Wangyong reports: after enabling tmpfs filesystem to support transparent
hugepage with the following command:
echo always > /sys/kernel/mm/transparent_hugepage/shmem_enabled
the docker program tries to add F_SEAL_WRITE through the following
command, but it fails unexpectedly with errno EBUSY:
fcntl(5, F_ADD_SEALS, F_SEAL_WRITE) = -1.
That is because memfd_tag_pins() and memfd_wait_for_pins() were never
updated for shmem huge pages: checking page_mapcount() against
page_count() is hopeless on THP subpages - they need to check
total_mapcount() against page_count() on THP heads only.
Make memfd_tag_pins() (compared > 1) as strict as memfd_wait_for_pins()
(compared != 1): either can be justified, but given the non-atomic
total_mapcount() calculation, it is better now to be strict. Bear in
mind that total_mapcount() itself scans all of the THP subpages, when
choosing to take an XA_CHECK_SCHED latency break.
Also fix the unlikely xa_is_value() case in memfd_wait_for_pins(): if a
page has been swapped out since memfd_tag_pins(), then its refcount must
have fallen, and so it can safely be untagged.
Link: https://lkml.kernel.org/r/a4f79248-df75-2c8c-3df-ba3317ccb5da@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Zeal Robot <zealci@zte.com.cn>
Reported-by: wangyong <wang.yong12@zte.com.cn>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: CGEL ZTE <cgel.zte@gmail.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Song Liu <songliubraving@fb.com>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0708a0afe291bdfe1386d74d5ec1f0c27e8b9168 upstream.
syzkaller was recently triggering an oversized kvmalloc() warning via
xdp_umem_create().
The triggered warning was added back in 7661809d49 ("mm: don't allow
oversized kvmalloc() calls"). The rationale for the warning for huge
kvmalloc sizes was as a reaction to a security bug where the size was
more than UINT_MAX but not everything was prepared to handle unsigned
long sizes.
Anyway, the AF_XDP related call trace from this syzkaller report was:
kvmalloc include/linux/mm.h:806 [inline]
kvmalloc_array include/linux/mm.h:824 [inline]
kvcalloc include/linux/mm.h:829 [inline]
xdp_umem_pin_pages net/xdp/xdp_umem.c:102 [inline]
xdp_umem_reg net/xdp/xdp_umem.c:219 [inline]
xdp_umem_create+0x6a5/0xf00 net/xdp/xdp_umem.c:252
xsk_setsockopt+0x604/0x790 net/xdp/xsk.c:1068
__sys_setsockopt+0x1fd/0x4e0 net/socket.c:2176
__do_sys_setsockopt net/socket.c:2187 [inline]
__se_sys_setsockopt net/socket.c:2184 [inline]
__x64_sys_setsockopt+0xb5/0x150 net/socket.c:2184
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Björn mentioned that requests for >2GB allocation can still be valid:
The structure that is being allocated is the page-pinning accounting.
AF_XDP has an internal limit of U32_MAX pages, which is *a lot*, but
still fewer than what memcg allows (PAGE_COUNTER_MAX is a LONG_MAX/
PAGE_SIZE on 64 bit systems). [...]
I could just change from U32_MAX to INT_MAX, but as I stated earlier
that has a hacky feeling to it. [...] From my perspective, the code
isn't broken, with the memcg limits in consideration. [...]
Linus says:
[...] Pretty much every time this has come up, the kernel warning has
shown that yes, the code was broken and there really wasn't a reason
for doing allocations that big.
Of course, some people would be perfectly fine with the allocation
failing, they just don't want the warning. I didn't want __GFP_NOWARN
to shut it up originally because I wanted people to see all those
cases, but these days I think we can just say "yeah, people can shut
it up explicitly by saying 'go ahead and fail this allocation, don't
warn about it'".
So enough time has passed that by now I'd certainly be ok with [it].
Thus allow call-sites to silence such userspace triggered splats if the
allocation requests have __GFP_NOWARN. For xdp_umem_pin_pages()'s call
to kvcalloc() this is already the case, so nothing else needed there.
Fixes: 7661809d49 ("mm: don't allow oversized kvmalloc() calls")
Reported-by: syzbot+11421fbbff99b989670e@syzkaller.appspotmail.com
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: syzbot+11421fbbff99b989670e@syzkaller.appspotmail.com
Cc: Björn Töpel <bjorn@kernel.org>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Link: https://lore.kernel.org/bpf/CAJ+HfNhyfsT5cS_U9EC213ducHs9k9zNxX9+abqC0kTrPbQ0gg@mail.gmail.com
Link: https://lore.kernel.org/bpf/20211201202905.b9892171e3f5b9a60f9da251@linux-foundation.org
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Ackd-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>