Commit Graph

1088 Commits

Author SHA1 Message Date
Minchan Kim
243f54dd3a ANDROID: vendor hook for TLB batching control
Add vendor hook for flushing TLB batching in zap_pte_range.

Merged CL 232bdcbd660b1129b7d8d0de25a563b476eeb522:
ANDROID: pass argument in zap_pte_range vendor hooks

Bug: 238728493
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: If2de5f070dd7b76624961f5a91440bf69a99ca2d
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit d257ef6764f228145d0fca24998162809bb5b9f7)
2023-01-13 18:43:29 +00:00
Martin Liu
c20204c67d ANDROID: cma: allow to use CMA in swap-in path
Now, we allow to use CMA pages for certain user space
allocations. One of them is anonymous page fault case.
To align the use case, we should also allow to use CMA
pages in swap-in cases. This could help mitigate OOM
on swap-in cases showing plenty of free CMA left.

logd.klogd invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000
CPU: 0 PID: 433 Comm: logd.klogd Tainted: G        W  OE     5.10.100-android13-0 #1

Call trace:
 dump_backtrace.cfi_jt+0x0/0x8
 show_stack+0x1c/0x2c
 dump_stack_lvl+0xc0/0x13c
 dump_header+0x54/0x238
 oom_kill_process+0xb0/0x158
 out_of_memory+0x17c/0x328
 __alloc_pages_slowpath+0x5c4/0x8d0
 __alloc_pages_nodemask+0x1bc/0x2e0
 __read_swap_cache_async+0xdc/0x370
 swap_vma_readahead+0x3b4/0x488
 swapin_readahead+0x3c/0x54
 do_swap_page+0x1e0/0xaa0
 handle_pte_fault+0x128/0x1e0
 handle_mm_fault+0x308/0x590
 do_page_fault+0x33c/0x478
 do_translation_fault+0x58/0x11c
 do_mem_abort+0x68/0x144
 el0_da+0x24/0x34
 el0_sync_handler+0xc4/0xec
 el0_sync+0x1c0/0x200
Mem-Info:
active_anon:0 inactive_anon:3222 isolated_anon:62
 active_file:232 inactive_file:428 isolated_file:0
 unevictable:37232 dirty:3 writeback:40
 slab_reclaimable:19943 slab_unreclaimable:281193
 mapped:37126 shmem:2815 pagetables:8981 bounce:0
 free:126007 free_pcp:223 free_cma:123062
Node 0 active_anon:16kB inactive_anon:13160kB active_file:292kB inactive_file:2000kB unevictable:148928kB isolated(anon):0kB isolated(file):0kB mapped:148308kB dirty:12kB writeback:164kB shmem:11260kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:20528kB shadow_call_stack:5200kB all_unreclaimable? no
DMA32 free:14128kB min:7572kB low:22636kB high:37700kB reserved_highatomic:4096KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1913856kB managed:1553276kB mlocked:0kB pagetables:1292kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:2520kB
lowmem_reserve[]: 0 0 0
Normal free:489888kB min:19808kB low:59220kB high:98632kB reserved_highatomic:36864KB active_anon:20kB inactive_anon:12168kB active_file:0kB inactive_file:1640kB unevictable:148928kB writepending:180kB present:4194304kB managed:4063392kB mlocked:148928kB pagetables:34632kB bounce:0kB free_pcp:1928kB local_pcp:0kB free_cma:489752kB
lowmem_reserve[]: 0 0 0
DMA32: 166*4kB (UME) 163*8kB (UMECH) 592*16kB (UMCH) 5*32kB (UC) 2*64kB (C) 2*128kB (C) 2*256kB (C) 1*512kB (C) 1*1024kB (C) 0*2048kB 0*4096kB = 14032kB
Normal: 969*4kB (C) 77*8kB (C) 40*16kB (C) 17*32kB (C) 5*64kB (C) 1*128kB (C) 2*256kB (C) 1*512kB (C) 0*1024kB 0*2048kB 118*4096kB (C) = 490476kB
40220 total pagecache pages
30 pages in swap cache
Swap cache stats: add 2634625, delete 2635304, find 160621/2963954
Free swap  = 1473788kB
Total swap = 2097148kB

Bug: 229822798
Signed-off-by: Martin Liu <liumartin@google.com>
Change-Id: Ia0bb6f72e52f77f26062e1769bfd92e831f07cab
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit 3e591c63b13772a1de0ffed995b482166d27ed71)
2023-01-05 02:16:50 +00:00
Yu Zhao
d5b2fa1c7b BACKPORT: mm: multi-gen LRU: groundwork
Evictable pages are divided into multiple generations for each lruvec.
The youngest generation number is stored in lrugen->max_seq for both
anon and file types as they are aged on an equal footing. The oldest
generation numbers are stored in lrugen->min_seq[] separately for anon
and file types as clean file pages can be evicted regardless of swap
constraints. These three variables are monotonically increasing.

Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
in order to fit into the gen counter in page->flags. Each truncated
generation number is an index to lrugen->lists[]. The sliding window
technique is used to track at least MIN_NR_GENS and at most
MAX_NR_GENS generations. The gen counter stores a value within [1,
MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
stores 0.

There are two conceptually independent procedures: "the aging", which
produces young generations, and "the eviction", which consumes old
generations.  They form a closed-loop system, i.e., "the page reclaim".
Both procedures can be invoked from userspace for the purposes of working
set estimation and proactive reclaim.  These techniques are commonly used
to optimize job scheduling (bin packing) in data centers [1][2].

To avoid confusion, the terms "hot" and "cold" will be applied to the
multi-gen LRU, as a new convention; the terms "active" and "inactive" will
be applied to the active/inactive LRU, as usual.

The protection of hot pages and the selection of cold pages are based
on page access channels and patterns. There are two access channels:
one through page tables and the other through file descriptors. The
protection of the former channel is by design stronger because:
1. The uncertainty in determining the access patterns of the former
   channel is higher due to the approximation of the accessed bit.
2. The cost of evicting the former channel is higher due to the TLB
   flushes required and the likelihood of encountering the dirty bit.
3. The penalty of underprotecting the former channel is higher because
   applications usually do not prepare themselves for major page
   faults like they do for blocked I/O. E.g., GUI applications
   commonly use dedicated I/O threads to avoid blocking rendering
   threads.

There are also two access patterns: one with temporal locality and the
other without.  For the reasons listed above, the former channel is
assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is
present; the latter channel is assumed to follow the latter pattern unless
outlying refaults have been observed [3][4].

The next patch will address the "outlying refaults".  Three macros, i.e.,
LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in
this patch to make the entire patchset less diffy.

A page is added to the youngest generation on faulting.  The aging needs
to check the accessed bit at least twice before handing this page over to
the eviction.  The first check takes care of the accessed bit set on the
initial fault; the second check makes sure this page has not been used
since then.  This protocol, AKA second chance, requires a minimum of two
generations, hence MIN_NR_GENS.

[1] https://dl.acm.org/doi/10.1145/3297858.3304053
[2] https://dl.acm.org/doi/10.1145/3503222.3507731
[3] https://lwn.net/Articles/495543/
[4] https://lwn.net/Articles/815342/

Link: https://lkml.kernel.org/r/20220918080010.2920238-6-yuzhao@google.com
Change-Id: I7b24d1e9d263e4eb2c2ee23f2eb143824fcb5201
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit ec1c86b25f4bdd9dce6436c0539d2a6ae676e1c4)
[ Resolve conflicts in mm/memory.c, mm/memcontrol.c, mm/Kconfig,
include/linux/mm_inline.h]
Bug: 249601646
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
2022-11-30 00:28:11 +00:00
Yu Zhao
d232fd437a BACKPORT: mm: x86, arm64: add arch_has_hw_pte_young()
Patch series "Multi-Gen LRU Framework", v14.

What's new
==========
1. OpenWrt, in addition to Android, Arch Linux Zen, Armbian, ChromeOS,
   Liquorix, post-factum and XanMod, is now shipping MGLRU on 5.15.
2. Fixed long-tailed direct reclaim latency seen on high-memory (TBs)
   machines. The old direct reclaim backoff, which tries to enforce a
   minimum fairness among all eligible memcgs, over-swapped by about
   (total_mem>>DEF_PRIORITY)-nr_to_reclaim. The new backoff, which
   pulls the plug on swapping once the target is met, trades some
   fairness for curtailed latency:
   https://lore.kernel.org/r/20220918080010.2920238-10-yuzhao@google.com/
3. Fixed minior build warnings and conflicts. More comments and nits.

TLDR
====
The current page reclaim is too expensive in terms of CPU usage and it
often makes poor choices about what to evict. This patchset offers an
alternative solution that is performant, versatile and
straightforward.

Patchset overview
=================
The design and implementation overview is in patch 14:
https://lore.kernel.org/r/20220918080010.2920238-15-yuzhao@google.com/

01. mm: x86, arm64: add arch_has_hw_pte_young()
02. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
Take advantage of hardware features when trying to clear the accessed
bit in many PTEs.

03. mm/vmscan.c: refactor shrink_node()
04. Revert "include/linux/mm_inline.h: fold __update_lru_size() into
    its sole caller"
Minor refactors to improve readability for the following patches.

05. mm: multi-gen LRU: groundwork
Adds the basic data structure and the functions that insert pages to
and remove pages from the multi-gen LRU (MGLRU) lists.

06. mm: multi-gen LRU: minimal implementation
A minimal implementation without optimizations.

07. mm: multi-gen LRU: exploit locality in rmap
Exploits spatial locality to improve efficiency when using the rmap.

08. mm: multi-gen LRU: support page table walks
Further exploits spatial locality by optionally scanning page tables.

09. mm: multi-gen LRU: optimize multiple memcgs
Optimizes the overall performance for multiple memcgs running mixed
types of workloads.

10. mm: multi-gen LRU: kill switch
Adds a kill switch to enable or disable MGLRU at runtime.

11. mm: multi-gen LRU: thrashing prevention
12. mm: multi-gen LRU: debugfs interface
Provide userspace with features like thrashing prevention, working set
estimation and proactive reclaim.

13. mm: multi-gen LRU: admin guide
14. mm: multi-gen LRU: design doc
Add an admin guide and a design doc.

Benchmark results
=================
Independent lab results
-----------------------
Based on the popularity of searches [01] and the memory usage in
Google's public cloud, the most popular open-source memory-hungry
applications, in alphabetical order, are:
      Apache Cassandra      Memcached
      Apache Hadoop         MongoDB
      Apache Spark          PostgreSQL
      MariaDB (MySQL)       Redis

An independent lab evaluated MGLRU with the most widely used benchmark
suites for the above applications. They posted 960 data points along
with kernel metrics and perf profiles collected over more than 500
hours of total benchmark time. Their final reports show that, with 95%
confidence intervals (CIs), the above applications all performed
significantly better for at least part of their benchmark matrices.

On 5.14:
1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
   less wall time to sort three billion random integers, respectively,
   under the medium- and the high-concurrency conditions, when
   overcommitting memory. There were no statistically significant
   changes in wall time for the rest of the benchmark matrix.
2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
   more transactions per minute (TPM), respectively, under the medium-
   and the high-concurrency conditions, when overcommitting memory.
   There were no statistically significant changes in TPM for the rest
   of the benchmark matrix.
3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
   and [21.59, 30.02]% more operations per second (OPS), respectively,
   for sequential access, random access and Gaussian (distribution)
   access, when THP=always; 95% CIs [13.85, 15.97]% and
   [23.94, 29.92]% more OPS, respectively, for random access and
   Gaussian access, when THP=never. There were no statistically
   significant changes in OPS for the rest of the benchmark matrix.
4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
   [2.16, 3.55]% more operations per second (OPS), respectively, for
   exponential (distribution) access, random access and Zipfian
   (distribution) access, when underutilizing memory; 95% CIs
   [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
   respectively, for exponential access, random access and Zipfian
   access, when overcommitting memory.

On 5.15:
5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
   and [4.11, 7.50]% more operations per second (OPS), respectively,
   for exponential (distribution) access, random access and Zipfian
   (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
   [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
   exponential access, random access and Zipfian access, when swap was
   on.
6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
   less average wall time to finish twelve parallel TeraSort jobs,
   respectively, under the medium- and the high-concurrency
   conditions, when swap was on. There were no statistically
   significant changes in average wall time for the rest of the
   benchmark matrix.
7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
   minute (TPM) under the high-concurrency condition, when swap was
   off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
   respectively, under the medium- and the high-concurrency
   conditions, when swap was on. There were no statistically
   significant changes in TPM for the rest of the benchmark matrix.
8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
   [11.47, 19.36]% more total operations per second (OPS),
   respectively, for sequential access, random access and Gaussian
   (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
   [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
   for sequential access, random access and Gaussian access, when
   THP=never.

Our lab results
---------------
To supplement the above results, we ran the following benchmark suites
on 5.16-rc7 and found no regressions [10].
      fs_fio_bench_hdd_mq      pft
      fs_lmbench               pgsql-hammerdb
      fs_parallelio            redis
      fs_postmark              stream
      hackbench                sysbenchthread
      kernbench                tpcc_spark
      memcached                unixbench
      multichase               vm-scalability
      mutilate                 will-it-scale
      nginx

[01] https://trends.google.com
[02] https://lore.kernel.org/r/20211102002002.92051-1-bot@edi.works/
[03] https://lore.kernel.org/r/20211009054315.47073-1-bot@edi.works/
[04] https://lore.kernel.org/r/20211021194103.65648-1-bot@edi.works/
[05] https://lore.kernel.org/r/20211109021346.50266-1-bot@edi.works/
[06] https://lore.kernel.org/r/20211202062806.80365-1-bot@edi.works/
[07] https://lore.kernel.org/r/20211209072416.33606-1-bot@edi.works/
[08] https://lore.kernel.org/r/20211218071041.24077-1-bot@edi.works/
[09] https://lore.kernel.org/r/20211122053248.57311-1-bot@edi.works/
[10] https://lore.kernel.org/r/20220104202247.2903702-1-yuzhao@google.com/

Read-world applications
=======================
Third-party testimonials
------------------------
Konstantin reported [11]:
   I have Archlinux with 8G RAM + zswap + swap. While developing, I
   have lots of apps opened such as multiple LSP-servers for different
   langs, chats, two browsers, etc... Usually, my system gets quickly
   to a point of SWAP-storms, where I have to kill LSP-servers,
   restart browsers to free memory, etc, otherwise the system lags
   heavily and is barely usable.

   1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
   patchset, and I started up by opening lots of apps to create memory
   pressure, and worked for a day like this. Till now I had not a
   single SWAP-storm, and mind you I got 3.4G in SWAP. I was never
   getting to the point of 3G in SWAP before without a single
   SWAP-storm.

Vaibhav from IBM reported [12]:
   In a synthetic MongoDB Benchmark, seeing an average of ~19%
   throughput improvement on POWER10(Radix MMU + 64K Page Size) with
   MGLRU patches on top of 5.16 kernel for MongoDB + YCSB across
   three different request distributions, namely, Exponential, Uniform
   and Zipfan.

Shuang from U of Rochester reported [13]:
   With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
   and [9.26, 10.36]% higher throughput, respectively, for random
   access, Zipfian (distribution) access and Gaussian (distribution)
   access, when the average number of jobs per CPU is 1; 95% CIs
   [42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher
   throughput, respectively, for random access, Zipfian access and
   Gaussian access, when the average number of jobs per CPU is 2.

Daniel from Michigan Tech reported [14]:
   With Memcached allocating ~100GB of byte-addressable Optante,
   performance improvement in terms of throughput (measured as queries
   per second) was about 10% for a series of workloads.

Large-scale deployments
-----------------------
We've rolled out MGLRU to tens of millions of ChromeOS users and
about a million Android users. Google's fleetwide profiling [15] shows
an overall 40% decrease in kswapd CPU usage, in addition to
improvements in other UX metrics, e.g., an 85% decrease in the number
of low-memory kills at the 75th percentile and an 18% decrease in
app launch time at the 50th percentile.

The downstream kernels that have been using MGLRU include:
1. Android [16]
2. Arch Linux Zen [17]
3. Armbian [18]
4. ChromeOS [19]
5. Liquorix [20]
6. OpenWrt [21]
7. post-factum [22]
8. XanMod [23]

[11] https://lore.kernel.org/r/140226722f2032c86301fbd326d91baefe3d7d23.camel@yandex.ru/
[12] https://lore.kernel.org/r/87czj3mux0.fsf@vajain21.in.ibm.com/
[13] https://lore.kernel.org/r/20220105024423.26409-1-szhai2@cs.rochester.edu/
[14] https://lore.kernel.org/r/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/
[15] https://dl.acm.org/doi/10.1145/2749469.2750392
[16] https://android.com
[17] https://archlinux.org
[18] https://armbian.com
[19] https://chromium.org
[20] https://liquorix.net
[21] https://openwrt.org
[22] https://codeberg.org/pf-kernel
[23] https://xanmod.org

Summary
=======
The facts are:
1. The independent lab results and the real-world applications
   indicate substantial improvements; there are no known regressions.
2. Thrashing prevention, working set estimation and proactive reclaim
   work out of the box; there are no equivalent solutions.
3. There is a lot of new code; no smaller changes have been
   demonstrated similar effects.

Our options, accordingly, are:
1. Given the amount of evidence, the reported improvements will likely
   materialize for a wide range of workloads.
2. Gauging the interest from the past discussions, the new features
   will likely be put to use for both personal computers and data
   centers.
3. Based on Google's track record, the new code will likely be well
   maintained in the long term. It'd be more difficult if not
   impossible to achieve similar effects with other approaches.

This patch (of 14):

Some architectures automatically set the accessed bit in PTEs, e.g., x86
and arm64 v8.2.  On architectures that do not have this capability,
clearing the accessed bit in a PTE usually triggers a page fault following
the TLB miss of this PTE (to emulate the accessed bit).

Being aware of this capability can help make better decisions, e.g.,
whether to spread the work out over a period of time to reduce bursty page
faults when trying to clear the accessed bit in many PTEs.

Note that theoretically this capability can be unreliable, e.g.,
hotplugged CPUs might be different from builtin ones.  Therefore it should
not be used in architecture-independent code that involves correctness,
e.g., to determine whether TLB flushes are required (in combination with
the accessed bit).

Link: https://lkml.kernel.org/r/20220918080010.2920238-1-yuzhao@google.com
Link: https://lkml.kernel.org/r/20220918080010.2920238-2-yuzhao@google.com
Change-Id: I7c94aa3ffeb0a8e570c2d7db15183f87658d0141
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Acked-by: Will Deacon <will@kernel.org>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit e1fd09e3d1dd4a1a8b3b33bc1fd647eee9f4e475)
[Kalesh Singh - Fix trivial conflict in arch/arm64/include/asm/pgtable.h]
Bug: 249601646
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
2022-11-30 00:28:11 +00:00
Kalesh Singh
b3890c0f96 Revert "FROMLIST: mm: x86, arm64: add arch_has_hw_pte_young()"
This reverts commit 1861f17391.

To be replaced with upstream version.

Bug: 249601646
Change-Id: Ib992a9f199a9f30fbbf3f39537d87a8fb605c893
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
2022-11-30 00:28:11 +00:00
Kalesh Singh
2635d7d108 Revert "FROMLIST: mm: multi-gen LRU: groundwork"
This reverts commit f88ed5a3d3.

To be replaced with upstream version.

Bug: 249601646
Change-Id: I5a206480f838c304fb1c960fec2615894c2421bb
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
2022-11-30 00:28:11 +00:00
Suren Baghdasaryan
4ea18cd059 ANDROID: mm: fix speculative walk which is unsafe under RCU
Speculative page fault handling expects MMU_GATHER_RCU_TABLE_FREE to
guarantee that page tables are stable, however tlb_remove_table() has
a slow-path fall-back case when __get_free_page() returns NULL and
tlb_remove_table_one() gets called. The way synchronization is
implemented in that function is not RCU-safe and require IRQs to be
disabled (see the comment in tlb_remove_table_sync_one()).
Fix the invalid assumption to disable IRQs even when
MMU_GATHER_RCU_TABLE_FREE=y.

Bug: 257443051
Change-Id: I227f351607cf73022cb31f6f7a232cab41cf6a5a
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2022-11-23 10:25:28 -08:00
Suren Baghdasaryan
ca96bd7bf1 ANDROID: mm: avoid using vmacache in lockless vma search
When searching vma under RCU protection vmcache should be avoided because
a race with munmap() might result in finding a vma and placing it into
vmcache after munmap() removed that vma and called vmcache_invalidate.
Once that vma is freed, vmcache will be left with an invalid vma pointer.

Bug: 257443051
Change-Id: I62438305fcf5139974f4f7d3bae5b22c74084a59
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2022-11-23 10:25:27 -08:00
Suren Baghdasaryan
a1f65b39ba ANDROID: mm: skip pte_alloc during speculative page fault
Speculative page fault checks pmd to be valid before starting to handle
the page fault and pte_alloc() should do nothing if pmd stays valid.
If pmd gets changed during speculative page fault, we will detect the
change later and retry with mmap_lock. Therefore pte_alloc() can be
safely skipped and this prevents the racy pmd_lock() call which can
access pmd->ptl after pmd was cleared.

Bug: 257443051
Change-Id: Iec57df5530dba6e0e0bdf9f7500f910851c3d3fd
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2022-11-23 10:25:27 -08:00
Suren Baghdasaryan
3f311327f9 ANDROID: mm: introduce vma refcounting to protect vma during SPF
Current mechanism to stabilize a vma during speculative page fault
handling makes a copy of the faulting vma under RCU protection. This
makes it hard to protect elements which do not belong to the vma but
are used by the page fault handler like vma->vm_file.
The problems is that a copy of the vma can't be used to safely
protect the file attached to the original vma unless the file is
also released after RCU grace period (which is how SPF was designed
originally but that caused performance regression and had to be
changed).
To avoid these complications, introduce vma refcounting to stabilize
and operate on the original vma during page fault handling. Page
fault handler finds the vma and increases its refcount under RCU
protection, vma is freed after RCU grace period, vma->vm_file is
released only after refcount indicates no users. This mechanism
guarantees that once get_vma returns a vma, both the vma itself and
vma->vm_file are stable.
Additional benefits of this patch are: we don't need to copy the vma
and no additional logic is needed to stabilize vma->vm_file.

Bug: 257443051
Change-Id: I59d373926d687fcbd56847a8c3500c43bf1844c8
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2022-11-23 10:25:27 -08:00
Suren Baghdasaryan
19f11a21de ANDROID: fix a race between speculative page walk and unmap operations
Speculative page fault walks the page tables under RCU protection and
assumes that page tables are stable after ptl lock is taken. Current
implementation has three issues:
1. While pmd can't be destroyed while in RCU read section, it can be
cleared and result in an invalid ptl lock address. Fix that by
rechecking pmd value after obtaining ptl lock.
2. In case of CONFIG_ALLOC_SPLIT_PTLOCKS, ptl lock is separate from the
pmd and is destroyed by a synchronous call to pgtable_pmd_page_dtor,
which can happen while page walker is in RCU section. Prevent this by
adding a dependency for CONFIG_SPECULATIVE_PAGE_FAULT to require
!CONIG_ALLOC_SPLIT_PTLOCKS.
3. Below sequence when do_mmap happens after the last mmap_seq check
would result in use-after-free issue.

__pte_map_lock
      rcu_read_lock()
      mmap_seq_read_check()

      ptl = pte_lockptr(vmf->pmd)

      spin_trylock(ptl)
      mmap_seq_read_check()
                             mmap_write_lock()
                             do_mmap()
                               unmap_region()
                                 unmap_vmas()
                                 free_pgtables()
                                   ...
                                   free_pte_range
                                   pmd_clear
                                     pte_free_tlb
                                        ...
                                        call_rcu(tlb_remove_table_rcu)

      rcu_read_unlock()
                             tlb_remove_table_rcu
      spin_unlock(ptl) <-- UAF!

To prevent that free_pte_range needs to be blocked if ptl is locked and
is in use.

Bug: 245389404
Change-Id: I7b353f0995fc59e92bb2069bcdc7d1ac29b521b9
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
(cherry picked from commit 365ffc56b4862e9b49097450a1ad10392723bbe0)
2022-11-07 20:11:09 +00:00
Greg Kroah-Hartman
046ce7a74e Merge 5.15.59 into android14-5.15
Changes in 5.15.59
	Bluetooth: L2CAP: Fix use-after-free caused by l2cap_chan_put
	Revert "ocfs2: mount shared volume without ha stack"
	ntfs: fix use-after-free in ntfs_ucsncmp()
	fs: sendfile handles O_NONBLOCK of out_fd
	secretmem: fix unhandled fault in truncate
	mm: fix page leak with multiple threads mapping the same page
	hugetlb: fix memoryleak in hugetlb_mcopy_atomic_pte
	asm-generic: remove a broken and needless ifdef conditional
	s390/archrandom: prevent CPACF trng invocations in interrupt context
	nouveau/svm: Fix to migrate all requested pages
	drm/simpledrm: Fix return type of simpledrm_simple_display_pipe_mode_valid()
	watch_queue: Fix missing rcu annotation
	watch_queue: Fix missing locking in add_watch_to_object()
	tcp: Fix data-races around sysctl_tcp_dsack.
	tcp: Fix a data-race around sysctl_tcp_app_win.
	tcp: Fix a data-race around sysctl_tcp_adv_win_scale.
	tcp: Fix a data-race around sysctl_tcp_frto.
	tcp: Fix a data-race around sysctl_tcp_nometrics_save.
	tcp: Fix data-races around sysctl_tcp_no_ssthresh_metrics_save.
	ice: check (DD | EOF) bits on Rx descriptor rather than (EOP | RS)
	ice: do not setup vlan for loopback VSI
	scsi: ufs: host: Hold reference returned by of_parse_phandle()
	Revert "tcp: change pingpong threshold to 3"
	octeontx2-pf: Fix UDP/TCP src and dst port tc filters
	tcp: Fix data-races around sysctl_tcp_moderate_rcvbuf.
	tcp: Fix a data-race around sysctl_tcp_limit_output_bytes.
	tcp: Fix a data-race around sysctl_tcp_challenge_ack_limit.
	scsi: core: Fix warning in scsi_alloc_sgtables()
	scsi: mpt3sas: Stop fw fault watchdog work item during system shutdown
	net: ping6: Fix memleak in ipv6_renew_options().
	ipv6/addrconf: fix a null-ptr-deref bug for ip6_ptr
	net/tls: Remove the context from the list in tls_device_down
	igmp: Fix data-races around sysctl_igmp_qrv.
	net: pcs: xpcs: propagate xpcs_read error to xpcs_get_state_c37_sgmii
	net: sungem_phy: Add of_node_put() for reference returned by of_get_parent()
	tcp: Fix a data-race around sysctl_tcp_min_tso_segs.
	tcp: Fix a data-race around sysctl_tcp_min_rtt_wlen.
	tcp: Fix a data-race around sysctl_tcp_autocorking.
	tcp: Fix a data-race around sysctl_tcp_invalid_ratelimit.
	Documentation: fix sctp_wmem in ip-sysctl.rst
	macsec: fix NULL deref in macsec_add_rxsa
	macsec: fix error message in macsec_add_rxsa and _txsa
	macsec: limit replay window size with XPN
	macsec: always read MACSEC_SA_ATTR_PN as a u64
	net: macsec: fix potential resource leak in macsec_add_rxsa() and macsec_add_txsa()
	net: mld: fix reference count leak in mld_{query | report}_work()
	tcp: Fix data-races around sk_pacing_rate.
	net: Fix data-races around sysctl_[rw]mem(_offset)?.
	tcp: Fix a data-race around sysctl_tcp_comp_sack_delay_ns.
	tcp: Fix a data-race around sysctl_tcp_comp_sack_slack_ns.
	tcp: Fix a data-race around sysctl_tcp_comp_sack_nr.
	tcp: Fix data-races around sysctl_tcp_reflect_tos.
	ipv4: Fix data-races around sysctl_fib_notify_on_flag_change.
	i40e: Fix interface init with MSI interrupts (no MSI-X)
	sctp: fix sleep in atomic context bug in timer handlers
	octeontx2-pf: cn10k: Fix egress ratelimit configuration
	netfilter: nf_queue: do not allow packet truncation below transport header offset
	virtio-net: fix the race between refill work and close
	perf symbol: Correct address for bss symbols
	sfc: disable softirqs for ptp TX
	sctp: leave the err path free in sctp_stream_init to sctp_stream_free
	ARM: crypto: comment out gcc warning that breaks clang builds
	mm/hmm: fault non-owner device private entries
	page_alloc: fix invalid watermark check on a negative value
	ARM: 9216/1: Fix MAX_DMA_ADDRESS overflow
	EDAC/ghes: Set the DIMM label unconditionally
	docs/kernel-parameters: Update descriptions for "mitigations=" param with retbleed
	locking/rwsem: Allow slowpath writer to ignore handoff bit if not set by first waiter
	x86/bugs: Do not enable IBPB at firmware entry when IBPB is not available
	Linux 5.15.59

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I4f2002d38aea467e150a912f50d456c41b23de89
2022-08-04 15:18:41 +02:00
Josef Bacik
2722fb0f70 mm: fix page leak with multiple threads mapping the same page
commit 3fe2895cfecd03ac74977f32102b966b6589f481 upstream.

We have an application with a lot of threads that use a shared mmap backed
by tmpfs mounted with -o huge=within_size.  This application started
leaking loads of huge pages when we upgraded to a recent kernel.

Using the page ref tracepoints and a BPF program written by Tejun Heo we
were able to determine that these pages would have multiple refcounts from
the page fault path, but when it came to unmap time we wouldn't drop the
number of refs we had added from the faults.

I wrote a reproducer that mmap'ed a file backed by tmpfs with -o
huge=always, and then spawned 20 threads all looping faulting random
offsets in this map, while using madvise(MADV_DONTNEED) randomly for huge
page aligned ranges.  This very quickly reproduced the problem.

The problem here is that we check for the case that we have multiple
threads faulting in a range that was previously unmapped.  One thread maps
the PMD, the other thread loses the race and then returns 0.  However at
this point we already have the page, and we are no longer putting this
page into the processes address space, and so we leak the page.  We
actually did the correct thing prior to f9ce0be71d, however it looks
like Kirill copied what we do in the anonymous page case.  In the
anonymous page case we don't yet have a page, so we don't have to drop a
reference on anything.  Previously we did the correct thing for file based
faults by returning VM_FAULT_NOPAGE so we correctly drop the reference on
the page we faulted in.

Fix this by returning VM_FAULT_NOPAGE in the pmd_devmap_trans_unstable()
case, this makes us drop the ref on the page properly, and now my
reproducer no longer leaks the huge pages.

[josef@toxicpanda.com: v2]
  Link: https://lkml.kernel.org/r/e90c8f0dbae836632b669c2afc434006a00d4a67.1657721478.git.josef@toxicpanda.com
Link: https://lkml.kernel.org/r/2b798acfd95c9ab9395fe85e8d5a835e2e10a920.1657051137.git.josef@toxicpanda.com
Fixes: f9ce0be71d ("mm: Cleanup faultaround and finish_fault() codepaths")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Chris Mason <clm@fb.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-03 12:03:42 +02:00
Greg Kroah-Hartman
56f32ebb01 Merge 5.15.56 into android14-5.15
Changes in 5.15.56
	ALSA: hda - Add fixup for Dell Latitidue E5430
	ALSA: hda/conexant: Apply quirk for another HP ProDesk 600 G3 model
	ALSA: hda/realtek: Fix headset mic for Acer SF313-51
	ALSA: hda/realtek - Fix headset mic problem for a HP machine with alc671
	ALSA: hda/realtek: fix mute/micmute LEDs for HP machines
	ALSA: hda/realtek - Fix headset mic problem for a HP machine with alc221
	ALSA: hda/realtek - Enable the headset-mic on a Xiaomi's laptop
	xen/netback: avoid entering xenvif_rx_next_skb() with an empty rx queue
	fix race between exit_itimers() and /proc/pid/timers
	mm: userfaultfd: fix UFFDIO_CONTINUE on fallocated shmem pages
	mm: split huge PUD on wp_huge_pud fallback
	tracing/histograms: Fix memory leak problem
	net: sock: tracing: Fix sock_exceed_buf_limit not to dereference stale pointer
	ip: fix dflt addr selection for connected nexthop
	ARM: 9213/1: Print message about disabled Spectre workarounds only once
	ARM: 9214/1: alignment: advance IT state after emulating Thumb instruction
	wifi: mac80211: fix queue selection for mesh/OCB interfaces
	cgroup: Use separate src/dst nodes when preloading css_sets for migration
	btrfs: return -EAGAIN for NOWAIT dio reads/writes on compressed and inline extents
	drm/panfrost: Put mapping instead of shmem obj on panfrost_mmu_map_fault_addr() error
	drm/panfrost: Fix shrinker list corruption by madvise IOCTL
	fs/remap: constrain dedupe of EOF blocks
	nilfs2: fix incorrect masking of permission flags for symlinks
	sh: convert nommu io{re,un}map() to static inline functions
	Revert "evm: Fix memleak in init_desc"
	xfs: only run COW extent recovery when there are no live extents
	xfs: don't include bnobt blocks when reserving free block pool
	xfs: run callbacks before waking waiters in xlog_state_shutdown_callbacks
	xfs: drop async cache flushes from CIL commits.
	reset: Fix devm bulk optional exclusive control getter
	ARM: dts: imx6qdl-ts7970: Fix ngpio typo and count
	spi: amd: Limit max transfer and message size
	ARM: 9209/1: Spectre-BHB: avoid pr_info() every time a CPU comes out of idle
	ARM: 9210/1: Mark the FDT_FIXED sections as shareable
	net/mlx5e: kTLS, Fix build time constant test in TX
	net/mlx5e: kTLS, Fix build time constant test in RX
	net/mlx5e: Fix enabling sriov while tc nic rules are offloaded
	net/mlx5e: Fix capability check for updating vnic env counters
	net/mlx5e: Ring the TX doorbell on DMA errors
	drm/i915: fix a possible refcount leak in intel_dp_add_mst_connector()
	ima: Fix a potential integer overflow in ima_appraise_measurement
	ASoC: sgtl5000: Fix noise on shutdown/remove
	ASoC: tas2764: Add post reset delays
	ASoC: tas2764: Fix and extend FSYNC polarity handling
	ASoC: tas2764: Correct playback volume range
	ASoC: tas2764: Fix amp gain register offset & default
	ASoC: Intel: Skylake: Correct the ssp rate discovery in skl_get_ssp_clks()
	ASoC: Intel: Skylake: Correct the handling of fmt_config flexible array
	net: stmmac: dwc-qos: Disable split header for Tegra194
	net: ethernet: ti: am65-cpsw: Fix devlink port register sequence
	sysctl: Fix data races in proc_dointvec().
	sysctl: Fix data races in proc_douintvec().
	sysctl: Fix data races in proc_dointvec_minmax().
	sysctl: Fix data races in proc_douintvec_minmax().
	sysctl: Fix data races in proc_doulongvec_minmax().
	sysctl: Fix data races in proc_dointvec_jiffies().
	tcp: Fix a data-race around sysctl_tcp_max_orphans.
	inetpeer: Fix data-races around sysctl.
	net: Fix data-races around sysctl_mem.
	cipso: Fix data-races around sysctl.
	icmp: Fix data-races around sysctl.
	ipv4: Fix a data-race around sysctl_fib_sync_mem.
	ARM: dts: at91: sama5d2: Fix typo in i2s1 node
	ARM: dts: sunxi: Fix SPI NOR campatible on Orange Pi Zero
	arm64: dts: broadcom: bcm4908: Fix timer node for BCM4906 SoC
	arm64: dts: broadcom: bcm4908: Fix cpu node for smp boot
	netfilter: nf_log: incorrect offset to network header
	netfilter: nf_tables: replace BUG_ON by element length check
	drm/i915/gvt: IS_ERR() vs NULL bug in intel_gvt_update_reg_whitelist()
	xen/gntdev: Ignore failure to unmap INVALID_GRANT_HANDLE
	lockd: set fl_owner when unlocking files
	lockd: fix nlm_close_files
	tracing: Fix sleeping while atomic in kdb ftdump
	drm/i915/selftests: fix a couple IS_ERR() vs NULL tests
	drm/i915/dg2: Add Wa_22011100796
	drm/i915/gt: Serialize GRDOM access between multiple engine resets
	drm/i915/gt: Serialize TLB invalidates with GT resets
	drm/i915/uc: correctly track uc_fw init failure
	drm/i915: Require the vm mutex for i915_vma_bind()
	bnxt_en: Fix bnxt_reinit_after_abort() code path
	bnxt_en: Fix bnxt_refclk_read()
	sysctl: Fix data-races in proc_dou8vec_minmax().
	sysctl: Fix data-races in proc_dointvec_ms_jiffies().
	icmp: Fix data-races around sysctl_icmp_echo_enable_probe.
	icmp: Fix a data-race around sysctl_icmp_ignore_bogus_error_responses.
	icmp: Fix a data-race around sysctl_icmp_errors_use_inbound_ifaddr.
	icmp: Fix a data-race around sysctl_icmp_ratelimit.
	icmp: Fix a data-race around sysctl_icmp_ratemask.
	raw: Fix a data-race around sysctl_raw_l3mdev_accept.
	tcp: Fix a data-race around sysctl_tcp_ecn_fallback.
	ipv4: Fix data-races around sysctl_ip_dynaddr.
	nexthop: Fix data-races around nexthop_compat_mode.
	net: ftgmac100: Hold reference returned by of_get_child_by_name()
	net: stmmac: fix leaks in probe
	ima: force signature verification when CONFIG_KEXEC_SIG is configured
	ima: Fix potential memory leak in ima_init_crypto()
	drm/amd/display: Only use depth 36 bpp linebuffers on DCN display engines.
	drm/amd/pm: Prevent divide by zero
	sfc: fix use after free when disabling sriov
	ceph: switch netfs read ops to use rreq->inode instead of rreq->mapping->host
	seg6: fix skb checksum evaluation in SRH encapsulation/insertion
	seg6: fix skb checksum in SRv6 End.B6 and End.B6.Encaps behaviors
	seg6: bpf: fix skb checksum in bpf_push_seg6_encap()
	sfc: fix kernel panic when creating VF
	net: atlantic: remove deep parameter on suspend/resume functions
	net: atlantic: remove aq_nic_deinit() when resume
	KVM: x86: Fully initialize 'struct kvm_lapic_irq' in kvm_pv_kick_cpu_op()
	net/tls: Check for errors in tls_device_init
	ACPI: video: Fix acpi_video_handles_brightness_key_presses()
	mm: sysctl: fix missing numa_stat when !CONFIG_HUGETLB_PAGE
	btrfs: rename btrfs_bio to btrfs_io_context
	btrfs: zoned: fix a leaked bioc in read_zone_info
	ksmbd: use SOCK_NONBLOCK type for kernel_accept()
	powerpc/xive/spapr: correct bitmap allocation size
	vdpa/mlx5: Initialize CVQ vringh only once
	vduse: Tie vduse mgmtdev and its device
	virtio_mmio: Add missing PM calls to freeze/restore
	virtio_mmio: Restore guest page size on resume
	netfilter: br_netfilter: do not skip all hooks with 0 priority
	scsi: hisi_sas: Limit max hw sectors for v3 HW
	cpufreq: pmac32-cpufreq: Fix refcount leak bug
	platform/x86: hp-wmi: Ignore Sanitization Mode event
	firmware: sysfb: Make sysfb_create_simplefb() return a pdev pointer
	firmware: sysfb: Add sysfb_disable() helper function
	fbdev: Disable sysfb device registration when removing conflicting FBs
	net: tipc: fix possible refcount leak in tipc_sk_create()
	NFC: nxp-nci: don't print header length mismatch on i2c error
	nvme-tcp: always fail a request when sending it failed
	nvme: fix regression when disconnect a recovering ctrl
	net: sfp: fix memory leak in sfp_probe()
	ASoC: ops: Fix off by one in range control validation
	pinctrl: aspeed: Fix potential NULL dereference in aspeed_pinmux_set_mux()
	ASoC: Realtek/Maxim SoundWire codecs: disable pm_runtime on remove
	ASoC: rt711-sdca-sdw: fix calibrate mutex initialization
	ASoC: Intel: sof_sdw: handle errors on card registration
	ASoC: rt711: fix calibrate mutex initialization
	ASoC: rt7*-sdw: harden jack_detect_handler
	ASoC: codecs: rt700/rt711/rt711-sdca: initialize workqueues in probe
	ASoC: SOF: Intel: hda-loader: Clarify the cl_dsp_init() flow
	ASoC: wcd938x: Fix event generation for some controls
	ASoC: Intel: bytcr_wm5102: Fix GPIO related probe-ordering problem
	ASoC: wm5110: Fix DRE control
	ASoC: rt711-sdca: fix kernel NULL pointer dereference when IO error
	ASoC: dapm: Initialise kcontrol data for mux/demux controls
	ASoC: cs47l15: Fix event generation for low power mux control
	ASoC: madera: Fix event generation for OUT1 demux
	ASoC: madera: Fix event generation for rate controls
	irqchip: or1k-pic: Undefine mask_ack for level triggered hardware
	x86: Clear .brk area at early boot
	soc: ixp4xx/npe: Fix unused match warning
	ARM: dts: stm32: use the correct clock source for CEC on stm32mp151
	Revert "can: xilinx_can: Limit CANFD brp to 2"
	ALSA: usb-audio: Add quirks for MacroSilicon MS2100/MS2106 devices
	ALSA: usb-audio: Add quirk for Fiero SC-01
	ALSA: usb-audio: Add quirk for Fiero SC-01 (fw v1.0.0)
	nvme-pci: phison e16 has bogus namespace ids
	signal handling: don't use BUG_ON() for debugging
	USB: serial: ftdi_sio: add Belimo device ids
	usb: typec: add missing uevent when partner support PD
	usb: dwc3: gadget: Fix event pending check
	tty: serial: samsung_tty: set dma burst_size to 1
	vt: fix memory overlapping when deleting chars in the buffer
	serial: 8250: fix return error code in serial8250_request_std_resource()
	serial: stm32: Clear prev values before setting RTS delays
	serial: pl011: UPSTAT_AUTORTS requires .throttle/unthrottle
	serial: 8250: Fix PM usage_count for console handover
	x86/pat: Fix x86_has_pat_wp()
	drm/aperture: Run fbdev removal before internal helpers
	Linux 5.15.56

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I763d2a7b49435bf2996b31e201aa9794ab64609e
2022-07-22 17:43:50 +02:00
Gowans, James
e4967d2288 mm: split huge PUD on wp_huge_pud fallback
commit 14c99d65941538aa33edd8dc7b1bbbb593c324a2 upstream.

Currently the implementation will split the PUD when a fallback is taken
inside the create_huge_pud function.  This isn't where it should be done:
the splitting should be done in wp_huge_pud, just like it's done for PMDs.
Reason being that if a callback is taken during create, there is no PUD
yet so nothing to split, whereas if a fallback is taken when encountering
a write protection fault there is something to split.

It looks like this was the original intention with the commit where the
splitting was introduced, but somehow it got moved to the wrong place
between v1 and v2 of the patch series.  Rebase mistake perhaps.

Link: https://lkml.kernel.org/r/6f48d622eb8bce1ae5dd75327b0b73894a2ec407.camel@amazon.com
Fixes: 327e9fd489 ("mm: Split huge pages on write-notify or COW")
Signed-off-by: James Gowans <jgowans@amazon.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-07-21 21:24:11 +02:00
Suren Baghdasaryan
0864756fb0 Revert "ANDROID: Use the notifier lock to perform file-backed vma teardown"
This reverts commit dc8ac508af.
Reason for revert: performance regression.

Bug: 234527424
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I5811c7ee2feac7f9057ed0ccd848e5de71f79354
2022-07-19 12:47:27 +00:00
Suren Baghdasaryan
6072c99f21 ANDROID: Use the notifier lock to perform file-backed vma teardown
When a file-backed vma is being released, the userspace can have an
expectation that the vma and the file it's pinning will be released
synchronously. This does not happen when SPF is enabled because vma
and associated file are released asynchronously after RCU grace
period. This is done to prevent pagefault handler from stepping on
a deleted object. Fix this issue by synchronizing the file-backed
pagefault handler with the vma tear-down using notifier lock.

Fixes: 48e35d053f "FROMLIST: mm: rcu safe vma->vm_file freeing"
Bug: 231394031
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Idabf44b8e5a91805e99d79884af77a000dca7637
2022-07-16 17:06:39 +00:00
Greg Kroah-Hartman
3fb2e91e28 Merge 5.15.40 into android-5.15
Changes in 5.15.40
	x86/lib/atomic64_386_32: Rename things
	x86: Prepare asm files for straight-line-speculation
	x86: Prepare inline-asm for straight-line-speculation
	objtool: Add straight-line-speculation validation
	x86/alternative: Relax text_poke_bp() constraint
	kbuild: move objtool_args back to scripts/Makefile.build
	x86: Add straight-line-speculation mitigation
	tools arch: Update arch/x86/lib/mem{cpy,set}_64.S copies used in 'perf bench mem memcpy'
	kvm/emulate: Fix SETcc emulation function offsets with SLS
	crypto: x86/poly1305 - Fixup SLS
	objtool: Fix SLS validation for kcov tail-call replacement
	Bluetooth: Fix the creation of hdev->name
	rfkill: uapi: fix RFKILL_IOCTL_MAX_SIZE ioctl request definition
	udf: Avoid using stale lengthOfImpUse
	mm: fix missing cache flush for all tail pages of compound page
	mm: hugetlb: fix missing cache flush in copy_huge_page_from_user()
	mm: shmem: fix missing cache flush in shmem_mfill_atomic_pte()
	mm: userfaultfd: fix missing cache flush in mcopy_atomic_pte() and __mcopy_atomic()
	mm/hwpoison: fix error page recovered but reported "not recovered"
	mm/mlock: fix potential imbalanced rlimit ucounts adjustment
	mm: fix invalid page pointer returned with FOLL_PIN gups
	Linux 5.15.40

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I039f8014f8e898f74f2f98ab71ef29c39a1ad544
2022-06-09 15:39:10 +02:00
Muchun Song
e36b476a82 mm: hugetlb: fix missing cache flush in copy_huge_page_from_user()
commit e763243cc6cb1fcc720ec58cfd6e7c35ae90a479 upstream.

userfaultfd calls copy_huge_page_from_user() which does not do any cache
flushing for the target page.  Then the target page will be mapped to
the user space with a different address (user address), which might have
an alias issue with the kernel address used to copy the data from the
user to.

Fix this issue by flushing dcache in copy_huge_page_from_user().

Link: https://lkml.kernel.org/r/20220210123058.79206-4-songmuchun@bytedance.com
Fixes: fa4d75c1de ("userfaultfd: hugetlbfs: add copy_huge_page_from_user for hugetlb userfaultfd support")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lars Persson <lars.persson@axis.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-15 20:18:52 +02:00
Greg Kroah-Hartman
33f5d1daec Merge 5.15.34 into android13-5.15
Changes in 5.15.34
	lib/logic_iomem: correct fallback config references
	um: fix and optimize xor select template for CONFIG64 and timetravel mode
	rtc: wm8350: Handle error for wm8350_register_irq
	nbd: add error handling support for add_disk()
	nbd: Fix incorrect error handle when first_minor is illegal in nbd_dev_add
	nbd: Fix hungtask when nbd_config_put
	nbd: fix possible overflow on 'first_minor' in nbd_dev_add()
	kfence: count unexpectedly skipped allocations
	kfence: move saving stack trace of allocations into __kfence_alloc()
	kfence: limit currently covered allocations when pool nearly full
	KVM: x86/pmu: Use different raw event masks for AMD and Intel
	KVM: SVM: Fix kvm_cache_regs.h inclusions for is_guest_mode()
	KVM: x86/svm: Clear reserved bits written to PerfEvtSeln MSRs
	KVM: x86/pmu: Fix and isolate TSX-specific performance event logic
	KVM: x86/emulator: Emulate RDPID only if it is enabled in guest
	drm: Add orientation quirk for GPD Win Max
	ath5k: fix OOB in ath5k_eeprom_read_pcal_info_5111
	drm/amd/display: Add signal type check when verify stream backends same
	drm/amd/amdgpu/amdgpu_cs: fix refcount leak of a dma_fence obj
	drm/amd/display: Fix memory leak
	drm/amd/display: Use PSR version selected during set_psr_caps
	usb: gadget: tegra-xudc: Do not program SPARAM
	usb: gadget: tegra-xudc: Fix control endpoint's definitions
	usb: cdnsp: fix cdnsp_decode_trb function to properly handle ret value
	ptp: replace snprintf with sysfs_emit
	drm/amdkfd: Don't take process mutex for svm ioctls
	powerpc: dts: t104xrdb: fix phy type for FMAN 4/5
	ath11k: fix kernel panic during unload/load ath11k modules
	ath11k: pci: fix crash on suspend if board file is not found
	ath11k: mhi: use mhi_sync_power_up()
	net/smc: Send directly when TCP_CORK is cleared
	drm/bridge: Add missing pm_runtime_put_sync
	bpf: Make dst_port field in struct bpf_sock 16-bit wide
	scsi: mvsas: Replace snprintf() with sysfs_emit()
	scsi: bfa: Replace snprintf() with sysfs_emit()
	drm/v3d: fix missing unlock
	power: supply: axp20x_battery: properly report current when discharging
	mt76: mt7921: fix crash when startup fails.
	mt76: dma: initialize skip_unmap in mt76_dma_rx_fill
	cfg80211: don't add non transmitted BSS to 6GHz scanned channels
	libbpf: Fix build issue with llvm-readelf
	ipv6: make mc_forwarding atomic
	net: initialize init_net earlier
	powerpc: Set crashkernel offset to mid of RMA region
	drm/amdgpu: Fix recursive locking warning
	scsi: smartpqi: Fix kdump issue when controller is locked up
	PCI: aardvark: Fix support for MSI interrupts
	iommu/arm-smmu-v3: fix event handling soft lockup
	usb: ehci: add pci device support for Aspeed platforms
	PCI: endpoint: Fix alignment fault error in copy tests
	tcp: Don't acquire inet_listen_hashbucket::lock with disabled BH.
	PCI: pciehp: Add Qualcomm quirk for Command Completed erratum
	scsi: mpi3mr: Fix reporting of actual data transfer size
	scsi: mpi3mr: Fix memory leaks
	powerpc/set_memory: Avoid spinlock recursion in change_page_attr()
	power: supply: axp288-charger: Set Vhold to 4.4V
	net/mlx5e: Disable TX queues before registering the netdev
	usb: dwc3: pci: Set the swnode from inside dwc3_pci_quirks()
	iwlwifi: mvm: Correctly set fragmented EBS
	iwlwifi: mvm: move only to an enabled channel
	drm/msm/dsi: Remove spurious IRQF_ONESHOT flag
	ipv4: Invalidate neighbour for broadcast address upon address addition
	dm ioctl: prevent potential spectre v1 gadget
	dm: requeue IO if mapping table not yet available
	drm/amdkfd: make CRAT table missing message informational only
	vfio/pci: Stub vfio_pci_vga_rw when !CONFIG_VFIO_PCI_VGA
	scsi: pm8001: Fix pm80xx_pci_mem_copy() interface
	scsi: pm8001: Fix pm8001_mpi_task_abort_resp()
	scsi: pm8001: Fix task leak in pm8001_send_abort_all()
	scsi: pm8001: Fix tag leaks on error
	scsi: pm8001: Fix memory leak in pm8001_chip_fw_flash_update_req()
	mt76: mt7915: fix injected MPDU transmission to not use HW A-MSDU
	powerpc/64s/hash: Make hash faults work in NMI context
	mt76: mt7615: Fix assigning negative values to unsigned variable
	scsi: aha152x: Fix aha152x_setup() __setup handler return value
	scsi: hisi_sas: Free irq vectors in order for v3 HW
	scsi: hisi_sas: Limit users changing debugfs BIST count value
	net/smc: correct settings of RMB window update limit
	mips: ralink: fix a refcount leak in ill_acc_of_setup()
	macvtap: advertise link netns via netlink
	tuntap: add sanity checks about msg_controllen in sendmsg
	Bluetooth: Fix not checking for valid hdev on bt_dev_{info,warn,err,dbg}
	Bluetooth: use memset avoid memory leaks
	bnxt_en: Eliminate unintended link toggle during FW reset
	PCI: endpoint: Fix misused goto label
	MIPS: fix fortify panic when copying asm exception handlers
	powerpc/64e: Tie PPC_BOOK3E_64 to PPC_FSL_BOOK3E
	powerpc/secvar: fix refcount leak in format_show()
	scsi: libfc: Fix use after free in fc_exch_abts_resp()
	can: isotp: set default value for N_As to 50 micro seconds
	can: etas_es58x: es58x_fd_rx_event_msg(): initialize rx_event_msg before calling es58x_check_msg_len()
	riscv: Fixed misaligned memory access. Fixed pointer comparison.
	net: account alternate interface name memory
	net: limit altnames to 64k total
	net/mlx5e: Remove overzealous validations in netlink EEPROM query
	net: sfp: add 2500base-X quirk for Lantech SFP module
	usb: dwc3: omap: fix "unbalanced disables for smps10_out1" on omap5evm
	mt76: fix monitor mode crash with sdio driver
	xtensa: fix DTC warning unit_address_format
	MIPS: ingenic: correct unit node address
	Bluetooth: Fix use after free in hci_send_acl
	netfilter: conntrack: revisit gc autotuning
	netlabel: fix out-of-bounds memory accesses
	ceph: fix inode reference leakage in ceph_get_snapdir()
	ceph: fix memory leak in ceph_readdir when note_last_dentry returns error
	lib/Kconfig.debug: add ARCH dependency for FUNCTION_ALIGN option
	init/main.c: return 1 from handled __setup() functions
	minix: fix bug when opening a file with O_DIRECT
	clk: si5341: fix reported clk_rate when output divider is 2
	staging: vchiq_arm: Avoid NULL ptr deref in vchiq_dump_platform_instances
	staging: vchiq_core: handle NULL result of find_service_by_handle
	phy: amlogic: phy-meson-gxl-usb2: fix shared reset controller use
	phy: amlogic: meson8b-usb2: Use dev_err_probe()
	phy: amlogic: meson8b-usb2: fix shared reset control use
	clk: rockchip: drop CLK_SET_RATE_PARENT from dclk_vop* on rk3568
	cpufreq: CPPC: Fix performance/frequency conversion
	opp: Expose of-node's name in debugfs
	staging: wfx: fix an error handling in wfx_init_common()
	w1: w1_therm: fixes w1_seq for ds28ea00 sensors
	NFSv4.2: fix reference count leaks in _nfs42_proc_copy_notify()
	NFSv4: Protect the state recovery thread against direct reclaim
	habanalabs: fix possible memory leak in MMU DR fini
	xen: delay xen_hvm_init_time_ops() if kdump is boot on vcpu>=32
	clk: ti: Preserve node in ti_dt_clocks_register()
	clk: Enforce that disjoints limits are invalid
	SUNRPC/call_alloc: async tasks mustn't block waiting for memory
	SUNRPC/xprt: async tasks mustn't block waiting for memory
	SUNRPC: remove scheduling boost for "SWAPPER" tasks.
	NFS: swap IO handling is slightly different for O_DIRECT IO
	NFS: swap-out must always use STABLE writes.
	x86: Annotate call_on_stack()
	x86/Kconfig: Do not allow CONFIG_X86_X32_ABI=y with llvm-objcopy
	serial: samsung_tty: do not unlock port->lock for uart_write_wakeup()
	virtio_console: eliminate anonymous module_init & module_exit
	jfs: prevent NULL deref in diFree
	SUNRPC: Fix socket waits for write buffer space
	NFS: nfsiod should not block forever in mempool_alloc()
	NFS: Avoid writeback threads getting stuck in mempool_alloc()
	selftests: net: Add tls config dependency for tls selftests
	parisc: Fix CPU affinity for Lasi, WAX and Dino chips
	parisc: Fix patch code locking and flushing
	mm: fix race between MADV_FREE reclaim and blkdev direct IO read
	rtc: mc146818-lib: change return values of mc146818_get_time()
	rtc: Check return value from mc146818_get_time()
	rtc: mc146818-lib: fix RTC presence check
	drm/amdgpu: fix off by one in amdgpu_gfx_kiq_acquire()
	Drivers: hv: vmbus: Fix potential crash on module unload
	Revert "NFSv4: Handle the special Linux file open access mode"
	NFSv4: fix open failure with O_ACCMODE flag
	scsi: sr: Fix typo in CDROM(CLOSETRAY|EJECT) handling
	scsi: core: Fix sbitmap depth in scsi_realloc_sdev_budget_map()
	scsi: zorro7xx: Fix a resource leak in zorro7xx_remove_one()
	vdpa/mlx5: Rename control VQ workqueue to vdpa wq
	vdpa/mlx5: Propagate link status from device to vdpa driver
	vdpa: mlx5: prevent cvq work from hogging CPU
	net: sfc: add missing xdp queue reinitialization
	net/tls: fix slab-out-of-bounds bug in decrypt_internal
	vrf: fix packet sniffing for traffic originating from ip tunnels
	skbuff: fix coalescing for page_pool fragment recycling
	ice: Clear default forwarding VSI during VSI release
	mctp: Fix check for dev_hard_header() result
	net: ipv4: fix route with nexthop object delete warning
	net: stmmac: Fix unset max_speed difference between DT and non-DT platforms
	drm/imx: imx-ldb: Check for null pointer after calling kmemdup
	drm/imx: Fix memory leak in imx_pd_connector_get_modes
	drm/imx: dw_hdmi-imx: Fix bailout in error cases of probe
	regulator: rtq2134: Fix missing active_discharge_on setting
	regulator: atc260x: Fix missing active_discharge_on setting
	arch/arm64: Fix topology initialization for core scheduling
	bnxt_en: Synchronize tx when xdp redirects happen on same ring
	bnxt_en: reserve space inside receive page for skb_shared_info
	bnxt_en: Prevent XDP redirect from running when stopping TX queue
	sfc: Do not free an empty page_ring
	RDMA/mlx5: Don't remove cache MRs when a delay is needed
	RDMA/mlx5: Add a missing update of cache->last_add
	IB/cm: Cancel mad on the DREQ event when the state is MRA_REP_RCVD
	IB/rdmavt: add lock to call to rvt_error_qp to prevent a race condition
	sctp: count singleton chunks in assoc user stats
	dpaa2-ptp: Fix refcount leak in dpaa2_ptp_probe
	ice: Set txq_teid to ICE_INVAL_TEID on ring creation
	ice: Do not skip not enabled queues in ice_vc_dis_qs_msg
	ipv6: Fix stats accounting in ip6_pkt_drop
	ice: synchronize_rcu() when terminating rings
	ice: xsk: fix VSI state check in ice_xsk_wakeup()
	net: openvswitch: don't send internal clone attribute to the userspace.
	net: ethernet: mv643xx: Fix over zealous checking of_get_mac_address()
	net: openvswitch: fix leak of nested actions
	rxrpc: fix a race in rxrpc_exit_net()
	net: sfc: fix using uninitialized xdp tx_queue
	net: phy: mscc-miim: reject clause 45 register accesses
	qede: confirm skb is allocated before using
	spi: bcm-qspi: fix MSPI only access with bcm_qspi_exec_mem_op()
	bpf: Support dual-stack sockets in bpf_tcp_check_syncookie
	drbd: Fix five use after free bugs in get_initial_state
	scsi: ufs: ufshpb: Fix a NULL check on list iterator
	io_uring: nospec index for tags on files update
	io_uring: don't touch scm_fp_list after queueing skb
	SUNRPC: Handle ENOMEM in call_transmit_status()
	SUNRPC: Handle low memory situations in call_status()
	SUNRPC: svc_tcp_sendmsg() should handle errors from xdr_alloc_bvec()
	iommu/omap: Fix regression in probe for NULL pointer dereference
	perf: arm-spe: Fix perf report --mem-mode
	perf tools: Fix perf's libperf_print callback
	perf session: Remap buf if there is no space for event
	arm64: Add part number for Arm Cortex-A78AE
	scsi: mpt3sas: Fix use after free in _scsih_expander_node_remove()
	scsi: ufs: ufs-pci: Add support for Intel MTL
	Revert "mmc: sdhci-xenon: fix annoying 1.8V regulator warning"
	mmc: block: Check for errors after write on SPI
	mmc: mmci: stm32: correctly check all elements of sg list
	mmc: renesas_sdhi: don't overwrite TAP settings when HS400 tuning is complete
	mmc: core: Fixup support for writeback-cache for eMMC and SD
	lz4: fix LZ4_decompress_safe_partial read out of bound
	highmem: fix checks in __kmap_local_sched_{in,out}
	mmmremap.c: avoid pointless invalidate_range_start/end on mremap(old_size=0)
	mm/mempolicy: fix mpol_new leak in shared_policy_replace
	io_uring: don't check req->file in io_fsync_prep()
	io_uring: defer splice/tee file validity check until command issue
	io_uring: implement compat handling for IORING_REGISTER_IOWQ_AFF
	io_uring: fix race between timeout flush and removal
	x86/pm: Save the MSR validity status at context setup
	x86/speculation: Restore speculation related MSRs during S3 resume
	perf/x86/intel: Update the FRONTEND MSR mask on Sapphire Rapids
	btrfs: fix qgroup reserve overflow the qgroup limit
	btrfs: prevent subvol with swapfile from being deleted
	spi: core: add dma_map_dev for __spi_unmap_msg()
	arm64: patch_text: Fixup last cpu should be master
	RDMA/hfi1: Fix use-after-free bug for mm struct
	gpio: Restrict usage of GPIO chip irq members before initialization
	x86/msi: Fix msi message data shadow struct
	x86/mm/tlb: Revert retpoline avoidance approach
	perf/x86/intel: Don't extend the pseudo-encoding to GP counters
	ata: sata_dwc_460ex: Fix crash due to OOB write
	perf: qcom_l2_pmu: fix an incorrect NULL check on list iterator
	perf/core: Inherit event_caps
	irqchip/gic-v3: Fix GICR_CTLR.RWP polling
	fbdev: Fix unregistering of framebuffers without device
	amd/display: set backlight only if required
	SUNRPC: Prevent immediate close+reconnect
	drm/panel: ili9341: fix optional regulator handling
	drm/amdgpu/display: change pipe policy for DCN 2.1
	drm/amdgpu/smu10: fix SoC/fclk units in auto mode
	drm/amdgpu/vcn: Fix the register setting for vcn1
	drm/nouveau/pmu: Add missing callbacks for Tegra devices
	drm/amdkfd: Create file descriptor after client is added to smi_clients list
	drm/amdgpu: don't use BACO for reset in S3
	KVM: SVM: Allow AVIC support on system w/ physical APIC ID > 255
	net/smc: send directly on setting TCP_NODELAY
	Revert "selftests: net: Add tls config dependency for tls selftests"
	bpf: Make remote_port field in struct bpf_sk_lookup 16-bit wide
	selftests/bpf: Fix u8 narrow load checks for bpf_sk_lookup remote_port
	rtc: mc146818-lib: fix signedness bug in mc146818_get_time()
	SUNRPC: Don't call connect() more than once on a TCP socket
	Revert "nbd: fix possible overflow on 'first_minor' in nbd_dev_add()"
	perf build: Don't use -ffat-lto-objects in the python feature test when building with clang-13
	perf python: Fix probing for some clang command line options
	tools build: Filter out options and warnings not supported by clang
	tools build: Use $(shell ) instead of `` to get embedded libperl's ccopts
	dmaengine: Revert "dmaengine: shdma: Fix runtime PM imbalance on error"
	KVM: avoid NULL pointer dereference in kvm_dirty_ring_push
	Revert "net/mlx5: Accept devlink user input after driver initialization complete"
	ubsan: remove CONFIG_UBSAN_OBJECT_SIZE
	selftests: cgroup: Make cg_create() use 0755 for permission instead of 0644
	selftests: cgroup: Test open-time credential usage for migration checks
	selftests: cgroup: Test open-time cgroup namespace usage for migration checks
	mm: don't skip swap entry even if zap_details specified
	Drivers: hv: vmbus: Replace smp_store_mb() with virt_store_mb()
	x86/bug: Prevent shadowing in __WARN_FLAGS
	sched: Teach the forced-newidle balancer about CPU affinity limitation.
	x86,static_call: Fix __static_call_return0 for i386
	irqchip/gic-v4: Wait for GICR_VPENDBASER.Dirty to clear before descheduling
	powerpc/64: Fix build failure with allyesconfig in book3s_64_entry.S
	irqchip/gic, gic-v3: Prevent GSI to SGI translations
	mm/sparsemem: fix 'mem_section' will never be NULL gcc 12 warning
	static_call: Don't make __static_call_return0 static
	powerpc: Fix virt_addr_valid() for 64-bit Book3E & 32-bit
	stacktrace: move filter_irq_stacks() to kernel/stacktrace.c
	Linux 5.15.34

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I98049d0d8ebd427296418d31085bfde482ad30e7
2022-04-24 16:57:32 +02:00
Yu Zhao
f88ed5a3d3 FROMLIST: mm: multi-gen LRU: groundwork
Evictable pages are divided into multiple generations for each lruvec.
The youngest generation number is stored in lrugen->max_seq for both
anon and file types as they are aged on an equal footing. The oldest
generation numbers are stored in lrugen->min_seq[] separately for anon
and file types as clean file pages can be evicted regardless of swap
constraints. These three variables are monotonically increasing.

Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
in order to fit into the gen counter in page->flags. Each truncated
generation number is an index to lrugen->lists[]. The sliding window
technique is used to track at least MIN_NR_GENS and at most
MAX_NR_GENS generations. The gen counter stores a value within [1,
MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
stores 0.

There are two conceptually independent procedures: "the aging", which
produces young generations, and "the eviction", which consumes old
generations. They form a closed-loop system, i.e., "the page reclaim".
Both procedures can be invoked from userspace for the purposes of
working set estimation and proactive reclaim. These features are
required to optimize job scheduling (bin packing) in data centers. The
variable size of the sliding window is designed for such use cases
[1][2].

To avoid confusion, the terms "hot" and "cold" will be applied to the
multi-gen LRU, as a new convention; the terms "active" and "inactive"
will be applied to the active/inactive LRU, as usual.

The protection of hot pages and the selection of cold pages are based
on page access channels and patterns. There are two access channels:
one through page tables and the other through file descriptors. The
protection of the former channel is by design stronger because:
1. The uncertainty in determining the access patterns of the former
   channel is higher due to the approximation of the accessed bit.
2. The cost of evicting the former channel is higher due to the TLB
   flushes required and the likelihood of encountering the dirty bit.
3. The penalty of underprotecting the former channel is higher because
   applications usually do not prepare themselves for major page
   faults like they do for blocked I/O. E.g., GUI applications
   commonly use dedicated I/O threads to avoid blocking the rendering
   threads.
There are also two access patterns: one with temporal locality and the
other without. For the reasons listed above, the former channel is
assumed to follow the former pattern unless VM_SEQ_READ or
VM_RAND_READ is present; the latter channel is assumed to follow the
latter pattern unless outlying refaults have been observed [3][4].

The next patch will address the "outlying refaults". Three macros,
i.e., LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are
added in this patch to make the entire patchset less diffy.

A page is added to the youngest generation on faulting. The aging
needs to check the accessed bit at least twice before handing this
page over to the eviction. The first check takes care of the accessed
bit set on the initial fault; the second check makes sure this page
has not been used since then. This protocol, AKA second chance,
requires a minimum of two generations, hence MIN_NR_GENS.

[1] https://dl.acm.org/doi/10.1145/3297858.3304053
[2] https://dl.acm.org/doi/10.1145/3503222.3507731
[3] https://lwn.net/Articles/495543/
[4] https://lwn.net/Articles/815342/

Link: https://lore.kernel.org/lkml/20220309021230.721028-6-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I333ec6a1d2abfa60d93d6adc190ed3eefe441512
2022-04-20 17:38:55 +00:00
Yu Zhao
1861f17391 FROMLIST: mm: x86, arm64: add arch_has_hw_pte_young()
Some architectures automatically set the accessed bit in PTEs, e.g.,
x86 and arm64 v8.2. On architectures that do not have this capability,
clearing the accessed bit in a PTE usually triggers a page fault
following the TLB miss of this PTE (to emulate the accessed bit).

Being aware of this capability can help make better decisions, e.g.,
whether to spread the work out over a period of time to reduce bursty
page faults when trying to clear the accessed bit in many PTEs.

Note that theoretically this capability can be unreliable, e.g.,
hotplugged CPUs might be different from builtin ones. Therefore it
should not be used in architecture-independent code that involves
correctness, e.g., to determine whether TLB flushes are required (in
combination with the accessed bit).

Link: https://lore.kernel.org/lkml/20220309021230.721028-2-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Acked-by: Will Deacon <will@kernel.org>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: Ie81175d7e0d239f688d31487b298cf9b4fb66707
2022-04-20 17:38:55 +00:00
Greg Kroah-Hartman
b41a37c036 Merge 5.15.33 into android13-5.15
Changes in 5.15.33
	Revert "swiotlb: rework "fix info leak with DMA_FROM_DEVICE""
	USB: serial: pl2303: add IBM device IDs
	dt-bindings: usb: hcd: correct usb-device path
	USB: serial: pl2303: fix GS type detection
	USB: serial: simple: add Nokia phone driver
	mm: kfence: fix missing objcg housekeeping for SLAB
	hv: utils: add PTP_1588_CLOCK to Kconfig to fix build
	HID: logitech-dj: add new lightspeed receiver id
	HID: Add support for open wheel and no attachment to T300
	xfrm: fix tunnel model fragmentation behavior
	ARM: mstar: Select HAVE_ARM_ARCH_TIMER
	virtio_console: break out of buf poll on remove
	vdpa/mlx5: should verify CTRL_VQ feature exists for MQ
	tools/virtio: fix virtio_test execution
	ethernet: sun: Free the coherent when failing in probing
	gpio: Revert regression in sysfs-gpio (gpiolib.c)
	spi: Fix invalid sgs value
	net:mcf8390: Use platform_get_irq() to get the interrupt
	Revert "gpio: Revert regression in sysfs-gpio (gpiolib.c)"
	spi: Fix erroneous sgs value with min_t()
	Input: zinitix - do not report shadow fingers
	af_key: add __GFP_ZERO flag for compose_sadb_supported in function pfkey_register
	net: dsa: microchip: add spi_device_id tables
	selftests: vm: fix clang build error multiple output files
	locking/lockdep: Avoid potential access of invalid memory in lock_class
	drm/amdgpu: move PX checking into amdgpu_device_ip_early_init
	drm/amdgpu: only check for _PR3 on dGPUs
	iommu/iova: Improve 32-bit free space estimate
	virtio-blk: Use blk_validate_block_size() to validate block size
	tpm: fix reference counting for struct tpm_chip
	usb: typec: tipd: Forward plug orientation to typec subsystem
	USB: usb-storage: Fix use of bitfields for hardware data in ene_ub6250.c
	xhci: fix garbage USBSTS being logged in some cases
	xhci: fix runtime PM imbalance in USB2 resume
	xhci: make xhci_handshake timeout for xhci_reset() adjustable
	xhci: fix uninitialized string returned by xhci_decode_ctrl_ctx()
	mei: me: disable driver on the ign firmware
	mei: me: add Alder Lake N device id.
	mei: avoid iterator usage outside of list_for_each_entry
	bus: mhi: pci_generic: Add mru_default for Quectel EM1xx series
	bus: mhi: Fix MHI DMA structure endianness
	docs: sphinx/requirements: Limit jinja2<3.1
	coresight: Fix TRCCONFIGR.QE sysfs interface
	coresight: syscfg: Fix memleak on registration failure in cscfg_create_device
	iio: afe: rescale: use s64 for temporary scale calculations
	iio: inkern: apply consumer scale on IIO_VAL_INT cases
	iio: inkern: apply consumer scale when no channel scale is available
	iio: inkern: make a best effort on offset calculation
	greybus: svc: fix an error handling bug in gb_svc_hello()
	clk: rockchip: re-add rational best approximation algorithm to the fractional divider
	clk: uniphier: Fix fixed-rate initialization
	ptrace: Check PTRACE_O_SUSPEND_SECCOMP permission on PTRACE_SEIZE
	cifs: fix handlecache and multiuser
	cifs: we do not need a spinlock around the tree access during umount
	KEYS: fix length validation in keyctl_pkey_params_get_2()
	KEYS: asymmetric: enforce that sig algo matches key algo
	KEYS: asymmetric: properly validate hash_algo and encoding
	Documentation: add link to stable release candidate tree
	Documentation: update stable tree link
	firmware: stratix10-svc: add missing callback parameter on RSU
	firmware: sysfb: fix platform-device leak in error path
	HID: intel-ish-hid: Use dma_alloc_coherent for firmware update
	SUNRPC: avoid race between mod_timer() and del_timer_sync()
	NFS: NFSv2/v3 clients should never be setting NFS_CAP_XATTR
	NFSD: prevent underflow in nfssvc_decode_writeargs()
	NFSD: prevent integer overflow on 32 bit systems
	f2fs: fix to unlock page correctly in error path of is_alive()
	f2fs: quota: fix loop condition at f2fs_quota_sync()
	f2fs: fix to do sanity check on .cp_pack_total_block_count
	remoteproc: Fix count check in rproc_coredump_write()
	mm/mlock: fix two bugs in user_shm_lock()
	pinctrl: ingenic: Fix regmap on X series SoCs
	pinctrl: samsung: drop pin banks references on error paths
	net: bnxt_ptp: fix compilation error
	spi: mxic: Fix the transmit path
	mtd: rawnand: protect access to rawnand devices while in suspend
	can: ems_usb: ems_usb_start_xmit(): fix double dev_kfree_skb() in error path
	can: m_can: m_can_tx_handler(): fix use after free of skb
	can: usb_8dev: usb_8dev_start_xmit(): fix double dev_kfree_skb() in error path
	jffs2: fix use-after-free in jffs2_clear_xattr_subsystem
	jffs2: fix memory leak in jffs2_do_mount_fs
	jffs2: fix memory leak in jffs2_scan_medium
	mm: fs: fix lru_cache_disabled race in bh_lru
	mm/pages_alloc.c: don't create ZONE_MOVABLE beyond the end of a node
	mm: invalidate hwpoison page cache page in fault path
	mempolicy: mbind_range() set_policy() after vma_merge()
	scsi: core: sd: Add silence_suspend flag to suppress some PM messages
	scsi: ufs: Fix runtime PM messages never-ending cycle
	scsi: scsi_transport_fc: Fix FPIN Link Integrity statistics counters
	scsi: libsas: Fix sas_ata_qc_issue() handling of NCQ NON DATA commands
	qed: display VF trust config
	qed: validate and restrict untrusted VFs vlan promisc mode
	riscv: dts: canaan: Fix SPI3 bus width
	riscv: Fix fill_callchain return value
	riscv: Increase stack size under KASAN
	Revert "Input: clear BTN_RIGHT/MIDDLE on buttonpads"
	cifs: prevent bad output lengths in smb2_ioctl_query_info()
	cifs: fix NULL ptr dereference in smb2_ioctl_query_info()
	ALSA: cs4236: fix an incorrect NULL check on list iterator
	ALSA: hda: Avoid unsol event during RPM suspending
	ALSA: pcm: Fix potential AB/BA lock with buffer_mutex and mmap_lock
	ALSA: hda/realtek: Fix audio regression on Mi Notebook Pro 2020
	rtc: mc146818-lib: fix locking in mc146818_set_time
	rtc: pl031: fix rtc features null pointer dereference
	ocfs2: fix crash when mount with quota enabled
	drm/simpledrm: Add "panel orientation" property on non-upright mounted LCD panels
	mm: madvise: skip unmapped vma holes passed to process_madvise
	mm: madvise: return correct bytes advised with process_madvise
	Revert "mm: madvise: skip unmapped vma holes passed to process_madvise"
	mm,hwpoison: unmap poisoned page before invalidation
	mm/kmemleak: reset tag when compare object pointer
	dm stats: fix too short end duration_ns when using precise_timestamps
	dm: fix use-after-free in dm_cleanup_zoned_dev()
	dm: interlock pending dm_io and dm_wait_for_bios_completion
	dm: fix double accounting of flush with data
	dm integrity: set journal entry unused when shrinking device
	tracing: Have trace event string test handle zero length strings
	drbd: fix potential silent data corruption
	powerpc/kvm: Fix kvm_use_magic_page
	PCI: fu740: Force 2.5GT/s for initial device probe
	arm64: signal: nofpsimd: Do not allocate fp/simd context when not available
	arm64: Do not defer reserve_crashkernel() for platforms with no DMA memory zones
	arm64: dts: qcom: sm8250: Fix MSI IRQ for PCIe1 and PCIe2
	arm64: dts: ti: k3-am65: Fix gic-v3 compatible regs
	arm64: dts: ti: k3-j721e: Fix gic-v3 compatible regs
	arm64: dts: ti: k3-j7200: Fix gic-v3 compatible regs
	arm64: dts: ti: k3-am64: Fix gic-v3 compatible regs
	ASoC: SOF: Intel: Fix NULL ptr dereference when ENOMEM
	Revert "ACPI: Pass the same capabilities to the _OSC regardless of the query flag"
	ACPI: properties: Consistently return -ENOENT if there are no more references
	coredump: Also dump first pages of non-executable ELF libraries
	ext4: fix ext4_fc_stats trace point
	ext4: fix fs corruption when tring to remove a non-empty directory with IO error
	ext4: make mb_optimize_scan performance mount option work with extents
	drivers: hamradio: 6pack: fix UAF bug caused by mod_timer()
	samples/landlock: Fix path_list memory leak
	landlock: Use square brackets around "landlock-ruleset"
	mailbox: tegra-hsp: Flush whole channel
	block: limit request dispatch loop duration
	block: don't merge across cgroup boundaries if blkcg is enabled
	drm/edid: check basic audio support on CEA extension block
	fbdev: Hot-unplug firmware fb devices on forced removal
	video: fbdev: sm712fb: Fix crash in smtcfb_read()
	video: fbdev: atari: Atari 2 bpp (STe) palette bugfix
	rfkill: make new event layout opt-in
	ARM: dts: at91: sama7g5: Remove unused properties in i2c nodes
	ARM: dts: at91: sama5d2: Fix PMERRLOC resource size
	ARM: dts: exynos: fix UART3 pins configuration in Exynos5250
	ARM: dts: exynos: add missing HDMI supplies on SMDK5250
	ARM: dts: exynos: add missing HDMI supplies on SMDK5420
	mgag200 fix memmapsl configuration in GCTL6 register
	carl9170: fix missing bit-wise or operator for tx_params
	pstore: Don't use semaphores in always-atomic-context code
	thermal: int340x: Increase bitmap size
	lib/raid6/test: fix multiple definition linking error
	exec: Force single empty string when argv is empty
	crypto: rsa-pkcs1pad - only allow with rsa
	crypto: rsa-pkcs1pad - correctly get hash from source scatterlist
	crypto: rsa-pkcs1pad - restore signature length check
	crypto: rsa-pkcs1pad - fix buffer overread in pkcs1pad_verify_complete()
	bcache: fixup multiple threads crash
	PM: domains: Fix sleep-in-atomic bug caused by genpd_debug_remove()
	DEC: Limit PMAX memory probing to R3k systems
	media: gpio-ir-tx: fix transmit with long spaces on Orange Pi PC
	media: venus: hfi_cmds: List HDR10 property as unsupported for v1 and v3
	media: venus: venc: Fix h264 8x8 transform control
	media: davinci: vpif: fix unbalanced runtime PM get
	media: davinci: vpif: fix unbalanced runtime PM enable
	btrfs: zoned: mark relocation as writing
	btrfs: extend locking to all space_info members accesses
	btrfs: verify the tranisd of the to-be-written dirty extent buffer
	xtensa: define update_mmu_tlb function
	xtensa: fix stop_machine_cpuslocked call in patch_text
	xtensa: fix xtensa_wsr always writing 0
	drm/syncobj: flatten dma_fence_chains on transfer
	drm/nouveau/backlight: Fix LVDS backlight detection on some laptops
	drm/nouveau/backlight: Just set all backlight types as RAW
	drm/fb-helper: Mark screen buffers in system memory with FBINFO_VIRTFB
	brcmfmac: firmware: Allocate space for default boardrev in nvram
	brcmfmac: pcie: Release firmwares in the brcmf_pcie_setup error path
	brcmfmac: pcie: Declare missing firmware files in pcie.c
	brcmfmac: pcie: Replace brcmf_pcie_copy_mem_todev with memcpy_toio
	brcmfmac: pcie: Fix crashes due to early IRQs
	drm/i915/opregion: check port number bounds for SWSCI display power state
	drm/i915/gem: add missing boundary check in vm_access
	PCI: imx6: Allow to probe when dw_pcie_wait_for_link() fails
	PCI: pciehp: Clear cmd_busy bit in polling mode
	PCI: xgene: Revert "PCI: xgene: Fix IB window setup"
	regulator: qcom_smd: fix for_each_child.cocci warnings
	selinux: access superblock_security_struct in LSM blob way
	selinux: check return value of sel_make_avc_files
	crypto: ccp - Ensure psp_ret is always init'd in __sev_platform_init_locked()
	hwrng: cavium - Check health status while reading random data
	hwrng: cavium - HW_RANDOM_CAVIUM should depend on ARCH_THUNDER
	crypto: sun8i-ss - really disable hash on A80
	crypto: authenc - Fix sleep in atomic context in decrypt_tail
	crypto: mxs-dcp - Fix scatterlist processing
	selinux: Fix selinux_sb_mnt_opts_compat()
	thermal: int340x: Check for NULL after calling kmemdup()
	crypto: octeontx2 - remove CONFIG_DM_CRYPT check
	spi: tegra114: Add missing IRQ check in tegra_spi_probe
	spi: tegra210-quad: Fix missin IRQ check in tegra_qspi_probe
	stack: Constrain and fix stack offset randomization with Clang builds
	arm64/mm: avoid fixmap race condition when create pud mapping
	blk-cgroup: set blkg iostat after percpu stat aggregation
	selftests/x86: Add validity check and allow field splitting
	selftests/sgx: Treat CC as one argument
	crypto: rockchip - ECB does not need IV
	audit: log AUDIT_TIME_* records only from rules
	EVM: fix the evm= __setup handler return value
	crypto: ccree - don't attempt 0 len DMA mappings
	crypto: hisilicon/sec - fix the aead software fallback for engine
	spi: pxa2xx-pci: Balance reference count for PCI DMA device
	hwmon: (pmbus) Add mutex to regulator ops
	hwmon: (sch56xx-common) Replace WDOG_ACTIVE with WDOG_HW_RUNNING
	nvme: cleanup __nvme_check_ids
	nvme: fix the check for duplicate unique identifiers
	block: don't delete queue kobject before its children
	PM: hibernate: fix __setup handler error handling
	PM: suspend: fix return value of __setup handler
	spi: spi-zynqmp-gqspi: Handle error for dma_set_mask
	hwrng: atmel - disable trng on failure path
	crypto: sun8i-ss - call finalize with bh disabled
	crypto: sun8i-ce - call finalize with bh disabled
	crypto: amlogic - call finalize with bh disabled
	crypto: gemini - call finalize with bh disabled
	crypto: vmx - add missing dependencies
	clocksource/drivers/timer-ti-dm: Fix regression from errata i940 fix
	clocksource/drivers/exynos_mct: Refactor resources allocation
	clocksource/drivers/exynos_mct: Handle DTS with higher number of interrupts
	clocksource/drivers/timer-microchip-pit64b: Use notrace
	clocksource/drivers/timer-of: Check return value of of_iomap in timer_of_base_init()
	arm64: prevent instrumentation of bp hardening callbacks
	KEYS: trusted: Fix trusted key backends when building as module
	KEYS: trusted: Avoid calling null function trusted_key_exit
	ACPI: APEI: fix return value of __setup handlers
	crypto: ccp - ccp_dmaengine_unregister release dma channels
	crypto: ccree - Fix use after free in cc_cipher_exit()
	hwrng: nomadik - Change clk_disable to clk_disable_unprepare
	hwmon: (pmbus) Add Vin unit off handling
	clocksource: acpi_pm: fix return value of __setup handler
	io_uring: don't check unrelated req->open.how in accept request
	io_uring: terminate manual loop iterator loop correctly for non-vecs
	watch_queue: Fix NULL dereference in error cleanup
	watch_queue: Actually free the watch
	f2fs: fix to enable ATGC correctly via gc_idle sysfs interface
	sched/debug: Remove mpol_get/put and task_lock/unlock from sched_show_numa
	sched/core: Export pelt_thermal_tp
	sched/uclamp: Fix iowait boost escaping uclamp restriction
	rseq: Remove broken uapi field layout on 32-bit little endian
	perf/core: Fix address filter parser for multiple filters
	perf/x86/intel/pt: Fix address filter config for 32-bit kernel
	sched/fair: Improve consistency of allowed NUMA balance calculations
	f2fs: fix missing free nid in f2fs_handle_failed_inode
	nfsd: more robust allocation failure handling in nfsd_file_cache_init
	sched/cpuacct: Fix charge percpu cpuusage
	sched/rt: Plug rt_mutex_setprio() vs push_rt_task() race
	f2fs: fix to avoid potential deadlock
	btrfs: fix unexpected error path when reflinking an inline extent
	f2fs: fix compressed file start atomic write may cause data corruption
	selftests, x86: fix how check_cc.sh is being invoked
	drivers/base/memory: add memory block to memory group after registration succeeded
	kunit: make kunit_test_timeout compatible with comment
	pinctrl: samsung: Remove EINT handler for Exynos850 ALIVE and CMGP gpios
	media: staging: media: zoran: fix usage of vb2_dma_contig_set_max_seg_size
	media: camss: csid-170: fix non-10bit formats
	media: camss: csid-170: don't enable unused irqs
	media: camss: csid-170: set the right HALT_CMD when disabled
	media: camss: vfe-170: fix "VFE halt timeout" error
	media: staging: media: imx: imx7-mipi-csis: Make subdev name unique
	media: v4l2-mem2mem: Apply DST_QUEUE_OFF_BASE on MMAP buffers across ioctls
	media: mtk-vcodec: potential dereference of null pointer
	media: imx: imx8mq-mipi-csi2: remove wrong irq config write operation
	media: imx: imx8mq-mipi_csi2: fix system resume
	media: bttv: fix WARNING regression on tunerless devices
	media: atmel: atmel-sama7g5-isc: fix ispck leftover
	ASoC: sh: rz-ssi: Drop calling rz_ssi_pio_recv() recursively
	ASoC: codecs: Check for error pointer after calling devm_regmap_init_mmio
	ASoC: xilinx: xlnx_formatter_pcm: Handle sysclk setting
	ASoC: simple-card-utils: Set sysclk on all components
	media: coda: Fix missing put_device() call in coda_get_vdoa_data
	media: meson: vdec: potential dereference of null pointer
	media: hantro: Fix overfill bottom register field name
	media: ov6650: Fix set format try processing path
	media: v4l: Avoid unaligned access warnings when printing 4cc modifiers
	media: ov5648: Don't pack controls struct
	media: aspeed: Correct value for h-total-pixels
	video: fbdev: matroxfb: set maxvram of vbG200eW to the same as vbG200 to avoid black screen
	video: fbdev: controlfb: Fix COMPILE_TEST build
	video: fbdev: smscufx: Fix null-ptr-deref in ufx_usb_probe()
	video: fbdev: atmel_lcdfb: fix an error code in atmel_lcdfb_probe()
	video: fbdev: fbcvt.c: fix printing in fb_cvt_print_name()
	ARM: dts: Fix OpenBMC flash layout label addresses
	firmware: qcom: scm: Remove reassignment to desc following initializer
	ARM: dts: qcom: ipq4019: fix sleep clock
	soc: qcom: rpmpd: Check for null return of devm_kcalloc
	soc: qcom: ocmem: Fix missing put_device() call in of_get_ocmem
	soc: qcom: aoss: remove spurious IRQF_ONESHOT flags
	arm64: dts: qcom: sdm845: fix microphone bias properties and values
	arm64: dts: qcom: sm8250: fix PCIe bindings to follow schema
	arm64: dts: broadcom: bcm4908: use proper TWD binding
	arm64: dts: qcom: sm8150: Correct TCS configuration for apps rsc
	arm64: dts: qcom: sm8350: Correct TCS configuration for apps rsc
	firmware: ti_sci: Fix compilation failure when CONFIG_TI_SCI_PROTOCOL is not defined
	soc: ti: wkup_m3_ipc: Fix IRQ check in wkup_m3_ipc_probe
	ARM: dts: sun8i: v3s: Move the csi1 block to follow address order
	vsprintf: Fix potential unaligned access
	ARM: dts: imx: Add missing LVDS decoder on M53Menlo
	media: mexon-ge2d: fixup frames size in registers
	media: video/hdmi: handle short reads of hdmi info frame.
	media: ti-vpe: cal: Fix a NULL pointer dereference in cal_ctx_v4l2_init_formats()
	media: em28xx: initialize refcount before kref_get
	media: usb: go7007: s2250-board: fix leak in probe()
	media: cedrus: H265: Fix neighbour info buffer size
	media: cedrus: h264: Fix neighbour info buffer size
	ASoC: codecs: rx-macro: fix accessing compander for aux
	ASoC: codecs: rx-macro: fix accessing array out of bounds for enum type
	ASoC: codecs: va-macro: fix accessing array out of bounds for enum type
	ASoC: codecs: wc938x: fix accessing array out of bounds for enum type
	ASoC: codecs: wcd938x: fix kcontrol max values
	ASoC: codecs: wcd934x: fix kcontrol max values
	ASoC: codecs: wcd934x: fix return value of wcd934x_rx_hph_mode_put
	media: v4l2-core: Initialize h264 scaling matrix
	media: ov5640: Fix set format, v4l2_mbus_pixelcode not updated
	selftests/lkdtm: Add UBSAN config
	lib: uninline simple_strntoull() as well
	vsprintf: Fix %pK with kptr_restrict == 0
	uaccess: fix nios2 and microblaze get_user_8()
	ASoC: rt5663: check the return value of devm_kzalloc() in rt5663_parse_dp()
	soc: mediatek: pm-domains: Add wakeup capacity support in power domain
	mmc: sdhci_am654: Fix the driver data of AM64 SoC
	ASoC: ti: davinci-i2s: Add check for clk_enable()
	ALSA: spi: Add check for clk_enable()
	arm64: dts: ns2: Fix spi-cpol and spi-cpha property
	arm64: dts: broadcom: Fix sata nodename
	printk: fix return value of printk.devkmsg __setup handler
	ASoC: mxs-saif: Handle errors for clk_enable
	ASoC: atmel_ssc_dai: Handle errors for clk_enable
	ASoC: dwc-i2s: Handle errors for clk_enable
	ASoC: soc-compress: prevent the potentially use of null pointer
	memory: emif: Add check for setup_interrupts
	memory: emif: check the pointer temp in get_device_details()
	ALSA: firewire-lib: fix uninitialized flag for AV/C deferred transaction
	arm64: dts: rockchip: Fix SDIO regulator supply properties on rk3399-firefly
	m68k: coldfire/device.c: only build for MCF_EDMA when h/w macros are defined
	media: stk1160: If start stream fails, return buffers with VB2_BUF_STATE_QUEUED
	media: vidtv: Check for null return of vzalloc
	ASoC: atmel: Add missing of_node_put() in at91sam9g20ek_audio_probe
	ASoC: wm8350: Handle error for wm8350_register_irq
	ASoC: fsi: Add check for clk_enable
	video: fbdev: omapfb: Add missing of_node_put() in dvic_probe_of
	media: saa7134: fix incorrect use to determine if list is empty
	ivtv: fix incorrect device_caps for ivtvfb
	ASoC: atmel: Fix error handling in snd_proto_probe
	ASoC: rockchip: i2s: Fix missing clk_disable_unprepare() in rockchip_i2s_probe
	ASoC: SOF: Add missing of_node_put() in imx8m_probe
	ASoC: mediatek: use of_device_get_match_data()
	ASoC: mediatek: mt8192-mt6359: Fix error handling in mt8192_mt6359_dev_probe
	ASoC: rk817: Fix missing clk_disable_unprepare() in rk817_platform_probe
	ASoC: dmaengine: do not use a NULL prepare_slave_config() callback
	ASoC: mxs: Fix error handling in mxs_sgtl5000_probe
	ASoC: fsl_spdif: Disable TX clock when stop
	ASoC: imx-es8328: Fix error return code in imx_es8328_probe()
	ASoC: SOF: Intel: enable DMI L1 for playback streams
	ASoC: msm8916-wcd-digital: Fix missing clk_disable_unprepare() in msm8916_wcd_digital_probe
	mmc: davinci_mmc: Handle error for clk_enable
	ASoC: atmel: Fix error handling in sam9x5_wm8731_driver_probe
	ASoC: msm8916-wcd-analog: Fix error handling in pm8916_wcd_analog_spmi_probe
	ASoC: codecs: wcd934x: Add missing of_node_put() in wcd934x_codec_parse_data
	ASoC: amd: Fix reference to PCM buffer address
	ARM: configs: multi_v5_defconfig: re-enable CONFIG_V4L_PLATFORM_DRIVERS
	ARM: configs: multi_v5_defconfig: re-enable DRM_PANEL and FB_xxx
	drm/meson: osd_afbcd: Add an exit callback to struct meson_afbcd_ops
	drm/meson: Make use of the helper function devm_platform_ioremap_resourcexxx()
	drm/meson: split out encoder from meson_dw_hdmi
	drm/meson: Fix error handling when afbcd.ops->init fails
	drm/bridge: Fix free wrong object in sii8620_init_rcp_input_dev
	drm/bridge: Add missing pm_runtime_disable() in __dw_mipi_dsi_probe
	drm/bridge: nwl-dsi: Fix PM disable depth imbalance in nwl_dsi_probe
	drm: bridge: adv7511: Fix ADV7535 HPD enablement
	ath10k: fix memory overwrite of the WoWLAN wakeup packet pattern
	drm/v3d/v3d_drv: Check for error num after setting mask
	drm/panfrost: Check for error num after setting mask
	libbpf: Fix possible NULL pointer dereference when destroying skeleton
	bpftool: Only set obj->skeleton on complete success
	udmabuf: validate ubuf->pagecount
	bpf: Fix UAF due to race between btf_try_get_module and load_module
	drm/selftests/test-drm_dp_mst_helper: Fix memory leak in sideband_msg_req_encode_decode
	selftests: bpf: Fix bind on used port
	Bluetooth: btintel: Fix WBS setting for Intel legacy ROM products
	Bluetooth: hci_serdev: call init_rwsem() before p->open()
	mtd: onenand: Check for error irq
	mtd: rawnand: gpmi: fix controller timings setting
	drm/edid: Don't clear formats if using deep color
	drm/edid: Split deep color modes between RGB and YUV444
	ionic: fix type complaint in ionic_dev_cmd_clean()
	ionic: start watchdog after all is setup
	ionic: Don't send reset commands if FW isn't running
	drm/nouveau/acr: Fix undefined behavior in nvkm_acr_hsfw_load_bl()
	drm/amd/display: Fix a NULL pointer dereference in amdgpu_dm_connector_add_common_modes()
	drm/amd/pm: return -ENOTSUPP if there is no get_dpm_ultimate_freq function
	net: phy: at803x: move page selection fix to config_init
	selftests/bpf: Normalize XDP section names in selftests
	selftests/bpf/test_xdp_redirect_multi: use temp netns for testing
	ath9k_htc: fix uninit value bugs
	RDMA/core: Set MR type in ib_reg_user_mr
	KVM: PPC: Fix vmx/vsx mixup in mmio emulation
	selftests/net: timestamping: Fix bind_phc check
	i40e: don't reserve excessive XDP_PACKET_HEADROOM on XSK Rx to skb
	i40e: respect metadata on XSK Rx to skb
	igc: don't reserve excessive XDP_PACKET_HEADROOM on XSK Rx to skb
	ixgbe: pass bi->xdp to ixgbe_construct_skb_zc() directly
	ixgbe: don't reserve excessive XDP_PACKET_HEADROOM on XSK Rx to skb
	ixgbe: respect metadata on XSK Rx to skb
	power: reset: gemini-poweroff: Fix IRQ check in gemini_poweroff_probe
	ray_cs: Check ioremap return value
	powerpc: dts: t1040rdb: fix ports names for Seville Ethernet switch
	KVM: PPC: Book3S HV: Check return value of kvmppc_radix_init
	powerpc/perf: Don't use perf_hw_context for trace IMC PMU
	mt76: connac: fix sta_rec_wtbl tag len
	mt76: mt7915: use proper aid value in mt7915_mcu_wtbl_generic_tlv in sta mode
	mt76: mt7915: use proper aid value in mt7915_mcu_sta_basic_tlv
	mt76: mt7921: fix a leftover race in runtime-pm
	mt76: mt7615: fix a leftover race in runtime-pm
	mt76: mt7603: check sta_rates pointer in mt7603_sta_rate_tbl_update
	mt76: mt7615: check sta_rates pointer in mt7615_sta_rate_tbl_update
	ptp: unregister virtual clocks when unregistering physical clock.
	net: dsa: mv88e6xxx: Enable port policy support on 6097
	mac80211: Remove a couple of obsolete TODO
	mac80211: limit bandwidth in HE capabilities
	scripts/dtc: Call pkg-config POSIXly correct
	livepatch: Fix build failure on 32 bits processors
	net: asix: add proper error handling of usb read errors
	i2c: bcm2835: Use platform_get_irq() to get the interrupt
	i2c: bcm2835: Fix the error handling in 'bcm2835_i2c_probe()'
	mtd: mchp23k256: Add SPI ID table
	mtd: mchp48l640: Add SPI ID table
	igc: avoid kernel warning when changing RX ring parameters
	igb: refactor XDP registration
	PCI: aardvark: Fix reading MSI interrupt number
	PCI: aardvark: Fix reading PCI_EXP_RTSTA_PME bit on emulated bridge
	RDMA/rxe: Check the last packet by RXE_END_MASK
	libbpf: Fix signedness bug in btf_dump_array_data()
	cxl/core: Fix cxl_probe_component_regs() error message
	cxl/regs: Fix size of CXL Capability Header Register
	net:enetc: allocate CBD ring data memory using DMA coherent methods
	libbpf: Fix compilation warning due to mismatched printf format
	drm/bridge: dw-hdmi: use safe format when first in bridge chain
	libbpf: Use dynamically allocated buffer when receiving netlink messages
	power: supply: ab8500: Fix memory leak in ab8500_fg_sysfs_init
	HID: i2c-hid: fix GET/SET_REPORT for unnumbered reports
	iommu/ipmmu-vmsa: Check for error num after setting mask
	drm/bridge: anx7625: Fix overflow issue on reading EDID
	bpftool: Fix the error when lookup in no-btf maps
	drm/amd/pm: enable pm sysfs write for one VF mode
	drm/amd/display: Add affected crtcs to atomic state for dsc mst unplug
	libbpf: Fix memleak in libbpf_netlink_recv()
	IB/cma: Allow XRC INI QPs to set their local ACK timeout
	dax: make sure inodes are flushed before destroy cache
	selftests: mptcp: add csum mib check for mptcp_connect
	iwlwifi: mvm: Don't call iwl_mvm_sta_from_mac80211() with NULL sta
	iwlwifi: mvm: don't iterate unadded vifs when handling FW SMPS req
	iwlwifi: mvm: align locking in D3 test debugfs
	iwlwifi: yoyo: remove DBGI_SRAM address reset writing
	iwlwifi: Fix -EIO error code that is never returned
	iwlwifi: mvm: Fix an error code in iwl_mvm_up()
	mtd: rawnand: pl353: Set the nand chip node as the flash node
	drm/msm/dp: populate connector of struct dp_panel
	drm/msm/dp: stop link training after link training 2 failed
	drm/msm/dp: always add fail-safe mode into connector mode list
	drm/msm/dsi: Use "ref" fw clock instead of global name for VCO parent
	drm/msm/dsi/phy: fix 7nm v4.0 settings for C-PHY mode
	drm/msm/dpu: add DSPP blocks teardown
	drm/msm/dpu: fix dp audio condition
	dm crypt: fix get_key_size compiler warning if !CONFIG_KEYS
	vfio/pci: fix memory leak during D3hot to D0 transition
	vfio/pci: wake-up devices around reset functions
	scsi: fnic: Fix a tracing statement
	scsi: pm8001: Fix command initialization in pm80XX_send_read_log()
	scsi: pm8001: Fix command initialization in pm8001_chip_ssp_tm_req()
	scsi: pm8001: Fix payload initialization in pm80xx_set_thermal_config()
	scsi: pm8001: Fix le32 values handling in pm80xx_set_sas_protocol_timer_config()
	scsi: pm8001: Fix payload initialization in pm80xx_encrypt_update()
	scsi: pm8001: Fix le32 values handling in pm80xx_chip_ssp_io_req()
	scsi: pm8001: Fix le32 values handling in pm80xx_chip_sata_req()
	scsi: pm8001: Fix NCQ NON DATA command task initialization
	scsi: pm8001: Fix NCQ NON DATA command completion handling
	scsi: pm8001: Fix abort all task initialization
	RDMA/mlx5: Fix the flow of a miss in the allocation of a cache ODP MR
	drm/amd/display: Remove vupdate_int_entry definition
	TOMOYO: fix __setup handlers return values
	power: supply: sbs-charger: Don't cancel work that is not initialized
	ext2: correct max file size computing
	drm/tegra: Fix reference leak in tegra_dsi_ganged_probe
	power: supply: bq24190_charger: Fix bq24190_vbus_is_enabled() wrong false return
	scsi: hisi_sas: Change permission of parameter prot_mask
	drm/bridge: cdns-dsi: Make sure to to create proper aliases for dt
	bpf, arm64: Call build_prologue() first in first JIT pass
	bpf, arm64: Feed byte-offset into bpf line info
	xsk: Fix race at socket teardown
	RDMA/irdma: Fix netdev notifications for vlan's
	RDMA/irdma: Fix Passthrough mode in VM
	RDMA/irdma: Remove incorrect masking of PD
	gpu: host1x: Fix a memory leak in 'host1x_remove()'
	libbpf: Skip forward declaration when counting duplicated type names
	powerpc/mm/numa: skip NUMA_NO_NODE onlining in parse_numa_properties()
	powerpc/Makefile: Don't pass -mcpu=powerpc64 when building 32-bit
	KVM: x86: Fix emulation in writing cr8
	KVM: x86/emulator: Defer not-present segment check in __load_segment_descriptor()
	hv_balloon: rate-limit "Unhandled message" warning
	i2c: xiic: Make bus names unique
	power: supply: wm8350-power: Handle error for wm8350_register_irq
	power: supply: wm8350-power: Add missing free in free_charger_irq
	IB/hfi1: Allow larger MTU without AIP
	RDMA/core: Fix ib_qp_usecnt_dec() called when error
	PCI: Reduce warnings on possible RW1C corruption
	net: axienet: fix RX ring refill allocation failure handling
	drm/msm/a6xx: Fix missing ARRAY_SIZE() check
	mips: DEC: honor CONFIG_MIPS_FP_SUPPORT=n
	MIPS: Sanitise Cavium switch cases in TLB handler synthesizers
	powerpc/sysdev: fix incorrect use to determine if list is empty
	powerpc/64s: Don't use DSISR for SLB faults
	mfd: mc13xxx: Add check for mc13xxx_irq_request
	libbpf: Unmap rings when umem deleted
	selftests/bpf: Make test_lwt_ip_encap more stable and faster
	platform/x86: huawei-wmi: check the return value of device_create_file()
	scsi: mpt3sas: Fix incorrect 4GB boundary check
	powerpc: 8xx: fix a return value error in mpc8xx_pic_init
	vxcan: enable local echo for sent CAN frames
	ath10k: Fix error handling in ath10k_setup_msa_resources
	mips: cdmm: Fix refcount leak in mips_cdmm_phys_base
	MIPS: RB532: fix return value of __setup handler
	MIPS: pgalloc: fix memory leak caused by pgd_free()
	mtd: rawnand: atmel: fix refcount issue in atmel_nand_controller_init
	power: ab8500_chargalg: Use CLOCK_MONOTONIC
	RDMA/irdma: Prevent some integer underflows
	Revert "RDMA/core: Fix ib_qp_usecnt_dec() called when error"
	RDMA/mlx5: Fix memory leak in error flow for subscribe event routine
	bpf, sockmap: Fix memleak in sk_psock_queue_msg
	bpf, sockmap: Fix memleak in tcp_bpf_sendmsg while sk msg is full
	bpf, sockmap: Fix more uncharged while msg has more_data
	bpf, sockmap: Fix double uncharge the mem of sk_msg
	samples/bpf, xdpsock: Fix race when running for fix duration of time
	USB: storage: ums-realtek: fix error code in rts51x_read_mem()
	drm/i915/display: Fix HPD short pulse handling for eDP
	netfilter: flowtable: Fix QinQ and pppoe support for inet table
	mt76: mt7921: fix mt7921_queues_acq implementation
	can: isotp: sanitize CAN ID checks in isotp_bind()
	can: isotp: return -EADDRNOTAVAIL when reading from unbound socket
	can: isotp: support MSG_TRUNC flag when reading from socket
	bareudp: use ipv6_mod_enabled to check if IPv6 enabled
	ibmvnic: fix race between xmit and reset
	af_unix: Fix some data-races around unix_sk(sk)->oob_skb.
	selftests/bpf: Fix error reporting from sock_fields programs
	Bluetooth: hci_uart: add missing NULL check in h5_enqueue
	Bluetooth: call hci_le_conn_failed with hdev lock in hci_le_conn_failed
	Bluetooth: btmtksdio: Fix kernel oops in btmtksdio_interrupt
	ipv4: Fix route lookups when handling ICMP redirects and PMTU updates
	af_netlink: Fix shift out of bounds in group mask calculation
	i2c: meson: Fix wrong speed use from probe
	netfilter: conntrack: Add and use nf_ct_set_auto_assign_helper_warned()
	i2c: mux: demux-pinctrl: do not deactivate a master that is not active
	powerpc/pseries: Fix use after free in remove_phb_dynamic()
	selftests/bpf/test_lirc_mode2.sh: Exit with proper code
	PCI: Avoid broken MSI on SB600 USB devices
	net: bcmgenet: Use stronger register read/writes to assure ordering
	tcp: ensure PMTU updates are processed during fastopen
	openvswitch: always update flow key after nat
	net: dsa: fix panic on shutdown if multi-chip tree failed to probe
	tipc: fix the timer expires after interval 100ms
	mfd: asic3: Add missing iounmap() on error asic3_mfd_probe
	ice: fix 'scheduling while atomic' on aux critical err interrupt
	ice: don't allow to run ice_send_event_to_aux() in atomic ctx
	drivers: ethernet: cpsw: fix panic when interrupt coaleceing is set via ethtool
	kernel/resource: fix kfree() of bootmem memory again
	staging: r8188eu: convert DBG_88E_LEVEL call in hal/rtl8188e_hal_init.c
	staging: r8188eu: release_firmware is not called if allocation fails
	mxser: fix xmit_buf leak in activate when LSR == 0xff
	fsi: scom: Fix error handling
	fsi: scom: Remove retries in indirect scoms
	pwm: lpc18xx-sct: Initialize driver data and hardware before pwmchip_add()
	pps: clients: gpio: Propagate return value from pps_gpio_probe
	fsi: Aspeed: Fix a potential double free
	misc: alcor_pci: Fix an error handling path
	cpufreq: qcom-cpufreq-nvmem: fix reading of PVS Valid fuse
	soundwire: intel: fix wrong register name in intel_shim_wake
	clk: qcom: ipq8074: fix PCI-E clock oops
	dmaengine: idxd: check GENCAP config support for gencfg register
	dmaengine: idxd: change bandwidth token to read buffers
	dmaengine: idxd: restore traffic class defaults after wq reset
	iio: mma8452: Fix probe failing when an i2c_device_id is used
	serial: 8250_aspeed_vuart: add PORT_ASPEED_VUART port type
	staging:iio:adc:ad7280a: Fix handing of device address bit reversing.
	pinctrl: renesas: r8a77470: Reduce size for narrow VIN1 channel
	pinctrl: renesas: checker: Fix miscalculation of number of states
	clk: qcom: ipq8074: Use floor ops for SDCC1 clock
	phy: dphy: Correct lpx parameter and its derivatives(ta_{get,go,sure})
	phy: phy-brcm-usb: fixup BCM4908 support
	serial: 8250_mid: Balance reference count for PCI DMA device
	serial: 8250_lpss: Balance reference count for PCI DMA device
	NFS: Use of mapping_set_error() results in spurious errors
	serial: 8250: Fix race condition in RTS-after-send handling
	iio: adc: Add check for devm_request_threaded_irq
	habanalabs: Add check for pci_enable_device
	NFS: Return valid errors from nfs2/3_decode_dirent()
	staging: r8188eu: fix endless loop in recv_func
	dma-debug: fix return value of __setup handlers
	clk: imx7d: Remove audio_mclk_root_clk
	clk: imx: off by one in imx_lpcg_parse_clks_from_dt()
	clk: at91: sama7g5: fix parents of PDMCs' GCLK
	clk: qcom: clk-rcg2: Update logic to calculate D value for RCG
	clk: qcom: clk-rcg2: Update the frac table for pixel clock
	dmaengine: hisi_dma: fix MSI allocate fail when reload hisi_dma
	remoteproc: qcom: Fix missing of_node_put in adsp_alloc_memory_region
	remoteproc: qcom_wcnss: Add missing of_node_put() in wcnss_alloc_memory_region
	remoteproc: qcom_q6v5_mss: Fix some leaks in q6v5_alloc_memory_region
	nvdimm/region: Fix default alignment for small regions
	clk: actions: Terminate clk_div_table with sentinel element
	clk: loongson1: Terminate clk_div_table with sentinel element
	clk: hisilicon: Terminate clk_div_table with sentinel element
	clk: clps711x: Terminate clk_div_table with sentinel element
	clk: Fix clk_hw_get_clk() when dev is NULL
	clk: tegra: tegra124-emc: Fix missing put_device() call in emc_ensure_emc_driver
	mailbox: imx: fix crash in resume on i.mx8ulp
	NFS: remove unneeded check in decode_devicenotify_args()
	staging: mt7621-dts: fix LEDs and pinctrl on GB-PC1 devicetree
	staging: mt7621-dts: fix formatting
	staging: mt7621-dts: fix pinctrl properties for ethernet
	staging: mt7621-dts: fix GB-PC2 devicetree
	pinctrl: mediatek: Fix missing of_node_put() in mtk_pctrl_init
	pinctrl: mediatek: paris: Fix PIN_CONFIG_BIAS_* readback
	pinctrl: mediatek: paris: Fix "argument" argument type for mtk_pinconf_get()
	pinctrl: mediatek: paris: Fix pingroup pin config state readback
	pinctrl: mediatek: paris: Skip custom extra pin config dump for virtual GPIOs
	pinctrl: microchip sgpio: use reset driver
	pinctrl: microchip-sgpio: lock RMW access
	pinctrl: nomadik: Add missing of_node_put() in nmk_pinctrl_probe
	pinctrl/rockchip: Add missing of_node_put() in rockchip_pinctrl_probe
	tty: hvc: fix return value of __setup handler
	kgdboc: fix return value of __setup handler
	serial: 8250: fix XOFF/XON sending when DMA is used
	virt: acrn: obtain pa from VMA with PFNMAP flag
	virt: acrn: fix a memory leak in acrn_dev_ioctl()
	kgdbts: fix return value of __setup handler
	firmware: google: Properly state IOMEM dependency
	driver core: dd: fix return value of __setup handler
	jfs: fix divide error in dbNextAG
	netfilter: nf_conntrack_tcp: preserve liberal flag in tcp options
	SUNRPC don't resend a task on an offlined transport
	NFSv4.1: don't retry BIND_CONN_TO_SESSION on session error
	kdb: Fix the putarea helper function
	perf stat: Fix forked applications enablement of counters
	clk: qcom: gcc-msm8994: Fix gpll4 width
	vsock/virtio: initialize vdev->priv before using VQs
	vsock/virtio: read the negotiated features before using VQs
	vsock/virtio: enable VQs early on probe
	clk: Initialize orphan req_rate
	xen: fix is_xen_pmu()
	net: enetc: report software timestamping via SO_TIMESTAMPING
	net: hns3: fix bug when PF set the duplicate MAC address for VFs
	net: hns3: fix port base vlan add fail when concurrent with reset
	net: hns3: add vlan list lock to protect vlan list
	net: hns3: format the output of the MAC address
	net: hns3: refine the process when PF set VF VLAN
	net: phy: broadcom: Fix brcm_fet_config_init()
	selftests: test_vxlan_under_vrf: Fix broken test case
	NFS: Don't loop forever in nfs_do_recoalesce()
	net: hns3: clean residual vf config after disable sriov
	net: sparx5: depends on PTP_1588_CLOCK_OPTIONAL
	qlcnic: dcb: default to returning -EOPNOTSUPP
	net/x25: Fix null-ptr-deref caused by x25_disconnect
	net: sparx5: switchdev: fix possible NULL pointer dereference
	octeontx2-af: initialize action variable
	net: prefer nf_ct_put instead of nf_conntrack_put
	net/sched: act_ct: fix ref leak when switching zones
	NFSv4/pNFS: Fix another issue with a list iterator pointing to the head
	net: dsa: bcm_sf2_cfp: fix an incorrect NULL check on list iterator
	fs: fd tables have to be multiples of BITS_PER_LONG
	lib/test: use after free in register_test_dev_kmod()
	fs: fix fd table size alignment properly
	LSM: general protection fault in legacy_parse_param
	regulator: rpi-panel: Handle I2C errors/timing to the Atmel
	crypto: hisilicon/qm - cleanup warning in qm_vf_read_qos
	gcc-plugins/stackleak: Exactly match strings instead of prefixes
	pinctrl: npcm: Fix broken references to chip->parent_device
	rcu: Mark writes to the rcu_segcblist structure's ->flags field
	block/bfq_wf2q: correct weight to ioprio
	crypto: xts - Add softdep on ecb
	crypto: hisilicon/sec - not need to enable sm4 extra mode at HW V3
	block, bfq: don't move oom_bfqq
	selinux: use correct type for context length
	arm64: module: remove (NOLOAD) from linker script
	selinux: allow FIOCLEX and FIONCLEX with policy capability
	loop: use sysfs_emit() in the sysfs xxx show()
	Fix incorrect type in assignment of ipv6 port for audit
	irqchip/qcom-pdc: Fix broken locking
	irqchip/nvic: Release nvic_base upon failure
	fs/binfmt_elf: Fix AT_PHDR for unusual ELF files
	bfq: fix use-after-free in bfq_dispatch_request
	ACPICA: Avoid walking the ACPI Namespace if it is not there
	lib/raid6/test/Makefile: Use $(pound) instead of \# for Make 4.3
	Revert "Revert "block, bfq: honor already-setup queue merges""
	ACPI/APEI: Limit printable size of BERT table data
	PM: core: keep irq flags in device_pm_check_callbacks()
	parisc: Fix handling off probe non-access faults
	nvme-tcp: lockdep: annotate in-kernel sockets
	spi: tegra20: Use of_device_get_match_data()
	atomics: Fix atomic64_{read_acquire,set_release} fallbacks
	locking/lockdep: Iterate lock_classes directly when reading lockdep files
	ext4: correct cluster len and clusters changed accounting in ext4_mb_mark_bb
	ext4: fix ext4_mb_mark_bb() with flex_bg with fast_commit
	sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE
	ext4: don't BUG if someone dirty pages without asking ext4 first
	f2fs: fix to do sanity check on curseg->alloc_type
	NFSD: Fix nfsd_breaker_owns_lease() return values
	f2fs: don't get FREEZE lock in f2fs_evict_inode in frozen fs
	btrfs: harden identification of a stale device
	btrfs: make search_csum_tree return 0 if we get -EFBIG
	f2fs: use spin_lock to avoid hang
	f2fs: compress: fix to print raw data size in error path of lz4 decompression
	Adjust cifssb maximum read size
	ntfs: add sanity check on allocation size
	media: staging: media: zoran: move videodev alloc
	media: staging: media: zoran: calculate the right buffer number for zoran_reap_stat_com
	media: staging: media: zoran: fix various V4L2 compliance errors
	media: atmel: atmel-isc-base: report frame sizes as full supported range
	media: ir_toy: free before error exiting
	ASoC: sh: rz-ssi: Make the data structures available before registering the handlers
	ASoC: SOF: Intel: match sdw version on link_slaves_found
	media: imx-jpeg: Prevent decoding NV12M jpegs into single-planar buffers
	media: iommu/mediatek-v1: Free the existed fwspec if the master dev already has
	media: iommu/mediatek: Return ENODEV if the device is NULL
	media: iommu/mediatek: Add device_link between the consumer and the larb devices
	video: fbdev: nvidiafb: Use strscpy() to prevent buffer overflow
	video: fbdev: w100fb: Reset global state
	video: fbdev: cirrusfb: check pixclock to avoid divide by zero
	video: fbdev: omapfb: acx565akm: replace snprintf with sysfs_emit
	ARM: dts: qcom: fix gic_irq_domain_translate warnings for msm8960
	ARM: dts: bcm2837: Add the missing L1/L2 cache information
	ASoC: madera: Add dependencies on MFD
	media: atomisp_gmin_platform: Add DMI quirk to not turn AXP ELDO2 regulator off on some boards
	media: atomisp: fix dummy_ptr check to avoid duplicate active_bo
	ARM: ftrace: avoid redundant loads or clobbering IP
	ARM: dts: imx7: Use audio_mclk_post_div instead audio_mclk_root_clk
	arm64: defconfig: build imx-sdma as a module
	video: fbdev: omapfb: panel-dsi-cm: Use sysfs_emit() instead of snprintf()
	video: fbdev: omapfb: panel-tpo-td043mtea1: Use sysfs_emit() instead of snprintf()
	video: fbdev: udlfb: replace snprintf in show functions with sysfs_emit
	ARM: dts: bcm2711: Add the missing L1/L2 cache information
	ASoC: soc-core: skip zero num_dai component in searching dai name
	media: imx-jpeg: fix a bug of accessing array out of bounds
	media: cx88-mpeg: clear interrupt status register before streaming video
	uaccess: fix type mismatch warnings from access_ok()
	lib/test_lockup: fix kernel pointer check for separate address spaces
	ARM: tegra: tamonten: Fix I2C3 pad setting
	ARM: mmp: Fix failure to remove sram device
	ASoC: amd: vg: fix for pm resume callback sequence
	video: fbdev: sm712fb: Fix crash in smtcfb_write()
	media: i2c: ov5648: Fix lockdep error
	media: Revert "media: em28xx: add missing em28xx_close_extension"
	media: hdpvr: initialize dev->worker at hdpvr_register_videodev
	ASoC: Intel: sof_sdw: fix quirks for 2022 HP Spectre x360 13"
	tracing: Have TRACE_DEFINE_ENUM affect trace event types as well
	mmc: host: Return an error when ->enable_sdio_irq() ops is missing
	media: atomisp: fix bad usage at error handling logic
	ALSA: hda/realtek: Add alc256-samsung-headphone fixup
	KVM: x86: Reinitialize context if host userspace toggles EFER.LME
	KVM: x86/mmu: Move "invalid" check out of kvm_tdp_mmu_get_root()
	KVM: x86/mmu: Zap _all_ roots when unmapping gfn range in TDP MMU
	KVM: x86/mmu: Check for present SPTE when clearing dirty bit in TDP MMU
	KVM: x86: hyper-v: Drop redundant 'ex' parameter from kvm_hv_send_ipi()
	KVM: x86: hyper-v: Drop redundant 'ex' parameter from kvm_hv_flush_tlb()
	KVM: x86: hyper-v: Fix the maximum number of sparse banks for XMM fast TLB flush hypercalls
	KVM: x86: hyper-v: HVCALL_SEND_IPI_EX is an XMM fast hypercall
	powerpc/kasan: Fix early region not updated correctly
	powerpc/lib/sstep: Fix 'sthcx' instruction
	powerpc/lib/sstep: Fix build errors with newer binutils
	powerpc: Add set_memory_{p/np}() and remove set_memory_attr()
	powerpc: Fix build errors with newer binutils
	drm/dp: Fix off-by-one in register cache size
	drm/i915: Treat SAGV block time 0 as SAGV disabled
	drm/i915: Fix PSF GV point mask when SAGV is not possible
	drm/i915: Reject unsupported TMDS rates on ICL+
	scsi: qla2xxx: Refactor asynchronous command initialization
	scsi: qla2xxx: Implement ref count for SRB
	scsi: qla2xxx: Fix stuck session in gpdb
	scsi: qla2xxx: Fix warning message due to adisc being flushed
	scsi: qla2xxx: Fix scheduling while atomic
	scsi: qla2xxx: Fix premature hw access after PCI error
	scsi: qla2xxx: Fix wrong FDMI data for 64G adapter
	scsi: qla2xxx: Fix warning for missing error code
	scsi: qla2xxx: Fix device reconnect in loop topology
	scsi: qla2xxx: edif: Fix clang warning
	scsi: qla2xxx: Fix T10 PI tag escape and IP guard options for 28XX adapters
	scsi: qla2xxx: Add devids and conditionals for 28xx
	scsi: qla2xxx: Check for firmware dump already collected
	scsi: qla2xxx: Suppress a kernel complaint in qla_create_qpair()
	scsi: qla2xxx: Fix disk failure to rediscover
	scsi: qla2xxx: Fix incorrect reporting of task management failure
	scsi: qla2xxx: Fix hang due to session stuck
	scsi: qla2xxx: Fix missed DMA unmap for NVMe ls requests
	scsi: qla2xxx: Fix N2N inconsistent PLOGI
	scsi: qla2xxx: Fix stuck session of PRLI reject
	scsi: qla2xxx: Reduce false trigger to login
	scsi: qla2xxx: Use correct feature type field during RFF_ID processing
	platform: chrome: Split trace include file
	KVM: x86: Check lapic_in_kernel() before attempting to set a SynIC irq
	KVM: x86: Avoid theoretical NULL pointer dereference in kvm_irq_delivery_to_apic_fast()
	KVM: x86: Forbid VMM to set SYNIC/STIMER MSRs when SynIC wasn't activated
	KVM: Prevent module exit until all VMs are freed
	KVM: x86: fix sending PV IPI
	KVM: SVM: fix panic on out-of-bounds guest IRQ
	ubifs: rename_whiteout: Fix double free for whiteout_ui->data
	ubifs: Fix deadlock in concurrent rename whiteout and inode writeback
	ubifs: Add missing iput if do_tmpfile() failed in rename whiteout
	ubifs: Rename whiteout atomically
	ubifs: Fix 'ui->dirty' race between do_tmpfile() and writeback work
	ubifs: Rectify space amount budget for mkdir/tmpfile operations
	ubifs: setflags: Make dirtied_ino_d 8 bytes aligned
	ubifs: Fix read out-of-bounds in ubifs_wbuf_write_nolock()
	ubifs: Fix to add refcount once page is set private
	ubifs: rename_whiteout: correct old_dir size computing
	nvme: allow duplicate NSIDs for private namespaces
	nvme: fix the read-only state for zoned namespaces with unsupposed features
	wireguard: queueing: use CFI-safe ptr_ring cleanup function
	wireguard: socket: free skb in send6 when ipv6 is disabled
	wireguard: socket: ignore v6 endpoints when ipv6 is disabled
	XArray: Fix xas_create_range() when multi-order entry present
	can: mcba_usb: mcba_usb_start_xmit(): fix double dev_kfree_skb in error path
	can: mcba_usb: properly check endpoint type
	can: mcp251xfd: mcp251xfd_register_get_dev_id(): fix return of error value
	XArray: Update the LRU list in xas_split()
	modpost: restore the warning message for missing symbol versions
	rtc: check if __rtc_read_time was successful
	gfs2: gfs2_setattr_size error path fix
	gfs2: Make sure FITRIM minlen is rounded up to fs block size
	net: hns3: fix the concurrency between functions reading debugfs
	net: hns3: fix software vlan talbe of vlan 0 inconsistent with hardware
	rxrpc: fix some null-ptr-deref bugs in server_key.c
	rxrpc: Fix call timer start racing with call destruction
	mailbox: imx: fix wakeup failure from freeze mode
	crypto: arm/aes-neonbs-cbc - Select generic cbc and aes
	watch_queue: Free the page array when watch_queue is dismantled
	pinctrl: pinconf-generic: Print arguments for bias-pull-*
	watchdog: rti-wdt: Add missing pm_runtime_disable() in probe function
	net: sparx5: uses, depends on BRIDGE or !BRIDGE
	pinctrl: nuvoton: npcm7xx: Rename DS() macro to DSTR()
	pinctrl: nuvoton: npcm7xx: Use %zu printk format for ARRAY_SIZE()
	ASoC: mediatek: mt6358: add missing EXPORT_SYMBOLs
	ubi: Fix race condition between ctrl_cdev_ioctl and ubi_cdev_ioctl
	ARM: iop32x: offset IRQ numbers by 1
	block: Fix the maximum minor value is blk_alloc_ext_minor()
	io_uring: fix memory leak of uid in files registration
	riscv module: remove (NOLOAD)
	ACPI: CPPC: Avoid out of bounds access when parsing _CPC data
	vhost: handle error while adding split ranges to iotlb
	spi: Fix Tegra QSPI example
	platform/chrome: cros_ec_typec: Check for EC device
	can: isotp: restore accidentally removed MSG_PEEK feature
	proc: bootconfig: Add null pointer check
	drm/connector: Fix typo in documentation
	scsi: qla2xxx: Add qla2x00_async_done() for async routines
	staging: mt7621-dts: fix pinctrl-0 items to be size-1 items on ethernet
	arm64: mm: Drop 'const' from conditional arm64_dma_phys_limit definition
	ASoC: soc-compress: Change the check for codec_dai
	Reinstate some of "swiotlb: rework "fix info leak with DMA_FROM_DEVICE""
	tracing: Have type enum modifications copy the strings
	net: add skb_set_end_offset() helper
	net: preserve skb_end_offset() in skb_unclone_keeptruesize()
	mm/mmap: return 1 from stack_guard_gap __setup() handler
	ARM: 9187/1: JIVE: fix return value of __setup handler
	mm/memcontrol: return 1 from cgroup.memory __setup() handler
	mm/usercopy: return 1 from hardened_usercopy __setup() handler
	af_unix: Support POLLPRI for OOB.
	bpf: Adjust BPF stack helper functions to accommodate skip > 0
	bpf: Fix comment for helper bpf_current_task_under_cgroup()
	mmc: rtsx: Use pm_runtime_{get,put}() to handle runtime PM
	dt-bindings: mtd: nand-controller: Fix the reg property description
	dt-bindings: mtd: nand-controller: Fix a comment in the examples
	dt-bindings: spi: mxic: The interrupt property is not mandatory
	dt-bindings: memory: mtk-smi: No need mediatek,larb-id for mt8167
	dt-bindings: pinctrl: pinctrl-microchip-sgpio: Fix example
	ubi: fastmap: Return error code if memory allocation fails in add_aeb()
	ASoC: SOF: Intel: Fix build error without SND_SOC_SOF_PCI_DEV
	ASoC: topology: Allow TLV control to be either read or write
	perf vendor events: Update metrics for SkyLake Server
	media: ov6650: Add try support to selection API operations
	media: ov6650: Fix crop rectangle affected by set format
	spi: mediatek: support tick_delay without enhance_timing
	ARM: dts: spear1340: Update serial node properties
	ARM: dts: spear13xx: Update SPI dma properties
	arm64: dts: ls1043a: Update i2c dma properties
	arm64: dts: ls1046a: Update i2c node dma properties
	um: Fix uml_mconsole stop/go
	docs: sysctl/kernel: add missing bit to panic_print
	openvswitch: Fixed nd target mask field in the flow dump.
	torture: Make torture.sh help message match reality
	n64cart: convert bi_disk to bi_bdev->bd_disk fix build
	mmc: rtsx: Let MMC core handle runtime PM
	mmc: rtsx: Fix build errors/warnings for unused variable
	KVM: x86/mmu: do compare-and-exchange of gPTE via the user address
	iommu/dma: Skip extra sync during unmap w/swiotlb
	iommu/dma: Fold _swiotlb helpers into callers
	iommu/dma: Check CONFIG_SWIOTLB more broadly
	swiotlb: Support aligned swiotlb buffers
	iommu/dma: Account for min_align_mask w/swiotlb
	coredump: Snapshot the vmas in do_coredump
	coredump: Remove the WARN_ON in dump_vma_snapshot
	coredump/elf: Pass coredump_params into fill_note_info
	coredump: Use the vma snapshot in fill_files_note
	PCI: xgene: Revert "PCI: xgene: Use inbound resources for setup"
	Linux 5.15.33

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Id62bd8a22d0bfa7c2096539d253ffce804bed017
2022-04-20 08:18:54 +02:00
Peter Xu
f5e59185b0 mm: don't skip swap entry even if zap_details specified
commit 5abfd71d936a8aefd9f9ccd299dea7a164a5d455 upstream.

Patch series "mm: Rework zap ptes on swap entries", v5.

Patch 1 should fix a long standing bug for zap_pte_range() on
zap_details usage.  The risk is we could have some swap entries skipped
while we should have zapped them.

Migration entries are not the major concern because file backed memory
always zap in the pattern that "first time without page lock, then
re-zap with page lock" hence the 2nd zap will always make sure all
migration entries are already recovered.

However there can be issues with real swap entries got skipped
errornoously.  There's a reproducer provided in commit message of patch
1 for that.

Patch 2-4 are cleanups that are based on patch 1.  After the whole
patchset applied, we should have a very clean view of zap_pte_range().

Only patch 1 needs to be backported to stable if necessary.

This patch (of 4):

The "details" pointer shouldn't be the token to decide whether we should
skip swap entries.

For example, when the callers specified details->zap_mapping==NULL, it
means the user wants to zap all the pages (including COWed pages), then
we need to look into swap entries because there can be private COWed
pages that was swapped out.

Skipping some swap entries when details is non-NULL may lead to wrongly
leaving some of the swap entries while we should have zapped them.

A reproducer of the problem:

===8<===
        #define _GNU_SOURCE         /* See feature_test_macros(7) */
        #include <stdio.h>
        #include <assert.h>
        #include <unistd.h>
        #include <sys/mman.h>
        #include <sys/types.h>

        int page_size;
        int shmem_fd;
        char *buffer;

        void main(void)
        {
                int ret;
                char val;

                page_size = getpagesize();
                shmem_fd = memfd_create("test", 0);
                assert(shmem_fd >= 0);

                ret = ftruncate(shmem_fd, page_size * 2);
                assert(ret == 0);

                buffer = mmap(NULL, page_size * 2, PROT_READ | PROT_WRITE,
                                MAP_PRIVATE, shmem_fd, 0);
                assert(buffer != MAP_FAILED);

                /* Write private page, swap it out */
                buffer[page_size] = 1;
                madvise(buffer, page_size * 2, MADV_PAGEOUT);

                /* This should drop private buffer[page_size] already */
                ret = ftruncate(shmem_fd, page_size);
                assert(ret == 0);
                /* Recover the size */
                ret = ftruncate(shmem_fd, page_size * 2);
                assert(ret == 0);

                /* Re-read the data, it should be all zero */
                val = buffer[page_size];
                if (val == 0)
                        printf("Good\n");
                else
                        printf("BUG\n");
        }
===8<===

We don't need to touch up the pmd path, because pmd never had a issue with
swap entries.  For example, shmem pmd migration will always be split into
pte level, and same to swapping on anonymous.

Add another helper should_zap_cows() so that we can also check whether we
should zap private mappings when there's no page pointer specified.

This patch drops that trick, so we handle swap ptes coherently.  Meanwhile
we should do the same check upon migration entry, hwpoison entry and
genuine swap entries too.

To be explicit, we should still remember to keep the private entries if
even_cows==false, and always zap them when even_cows==true.

The issue seems to exist starting from the initial commit of git.

[peterx@redhat.com: comment tweaks]
  Link: https://lkml.kernel.org/r/20220217060746.71256-2-peterx@redhat.com

Link: https://lkml.kernel.org/r/20220217060746.71256-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20220216094810.60572-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20220216094810.60572-2-peterx@redhat.com
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-13 20:59:27 +02:00
Rik van Riel
7d04d6d5c1 mm,hwpoison: unmap poisoned page before invalidation
commit 3149c79f3cb0e2e3bafb7cfadacec090cbd250d3 upstream.

In some cases it appears the invalidation of a hwpoisoned page fails
because the page is still mapped in another process.  This can cause a
program to be continuously restarted and die when it page faults on the
page that was not invalidated.  Avoid that problem by unmapping the
hwpoisoned page when we find it.

Another issue is that sometimes we end up oopsing in finish_fault, if
the code tries to do something with the now-NULL vmf->page.  I did not
hit this error when submitting the previous patch because there are
several opportunities for alloc_set_pte to bail out before accessing
vmf->page, and that apparently happened on those systems, and most of
the time on other systems, too.

However, across several million systems that error does occur a handful
of times a day.  It can be avoided by returning VM_FAULT_NOPAGE which
will cause do_read_fault to return before calling finish_fault.

Link: https://lkml.kernel.org/r/20220325161428.5068d97e@imladris.surriel.com
Fixes: e53ac7374e64 ("mm: invalidate hwpoison page cache page in fault path")
Signed-off-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Tested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-08 14:22:56 +02:00
Rik van Riel
3bae72c2db mm: invalidate hwpoison page cache page in fault path
commit e53ac7374e64dede04d745ff0e70ff5048378d1f upstream.

Sometimes the page offlining code can leave behind a hwpoisoned clean
page cache page.  This can lead to programs being killed over and over
and over again as they fault in the hwpoisoned page, get killed, and
then get re-spawned by whatever wanted to run them.

This is particularly embarrassing when the page was offlined due to
having too many corrected memory errors.  Now we are killing tasks due
to them trying to access memory that probably isn't even corrupted.

This problem can be avoided by invalidating the page from the page fault
handler, which already has a branch for dealing with these kinds of
pages.  With this patch we simply pretend the page fault was successful
if the page was invalidated, return to userspace, incur another page
fault, read in the file from disk (to a new memory page), and then
everything works again.

Link: https://lkml.kernel.org/r/20220212213740.423efcea@imladris.surriel.com
Signed-off-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-08 14:22:54 +02:00
Vijayanand Jitta
385b0dd1f9 ANDROID: mm: Fix page table lookup in speculative fault path
In speculative fault path, while doing page table lookup, offset
is obtained at each level and value at that offset is read and
checks are perfomed on it, later to get next level offset we read
from previous level offset again. A concurrent page table reclaimation
operation could result in change in value at this offset, and we go
ahead and access it, this would result in reading an invalid entry.
Fix this by reading from previous level offset again and comparing
before performing next level access.

Bug: 221005439
Change-Id: I66b3d24ae79c7ee5ccce4ba7a94f028f4cf3fda0
Signed-off-by: Vijayanand Jitta <quic_vjitta@quicinc.com>
2022-03-23 11:32:20 -07:00
Michel Lespinasse
7d6787088d BACKPORT: FROMLIST: mm: enable speculative fault handling for supported file types.
Introduce vma_can_speculate(), which allows speculative handling for
VMAs mapping supported file types.

From do_handle_mm_fault(), speculative handling will follow through
__handle_mm_fault(), handle_pte_fault() and do_fault().

At this point, we expect speculative faults to continue through one of:
- do_read_fault(), fully implemented;
- do_cow_fault(), which might abort if missing anon vmas,
- do_shared_fault(), not implemented yet
  (would require ->page_mkwrite() changes).

vma_can_speculate() provides an early abort for the do_shared_fault() case,
limiting the time spent on trying that unimplemented case.

Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20210407014502.24091-31-michel@lespinasse.org/

Conflicts:
    include/linux/vm_event_item.h
    mm/vmstat.c

1. SPF_ATTEMPT_FILE is taken from https://lore.kernel.org/all/20210407014502.24091-36-michel@lespinasse.org/
since the patch posted upstream at the time had a different structure
with stats for anonymouse and file-backed pagefaults introduced in a
separate patch.

Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I3a28af63b41b649f02f8b73d53f6494ad114ee5a
2022-03-23 11:32:19 -07:00
Michel Lespinasse
7045d2d838 FROMLIST: mm: implement speculative handling in do_fault_around()
Call the vm_ops->map_pages method within an rcu read locked section.
In the speculative case, verify the mmap sequence lock at the start of
the section. A match guarantees that the original vma is still valid
at that time, and that the associated vma->vm_file stays valid while
the vm_ops->map_pages() method is running.

Do not test vmf->pmd in the speculative case - we only speculate when
a page table already exists, and and this saves us from having to handle
synchronization around the vmf->pmd read.

Change xfs_filemap_map_pages() account for the fact that it can not
block anymore, as it is now running within an rcu read lock.

Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20210407014502.24091-28-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Id771c1e6fa9b883595a48d4df63f448a05916eda
2022-03-23 11:32:19 -07:00
Michel Lespinasse
6877640598 BACKPORT: FROMLIST: mm: implement speculative fault handling in finish_fault()
In the speculative case, we want to avoid direct pmd checks (which
would require some extra synchronization to be safe), and rely on
pte_map_lock which will both lock the page table and verify that the
pmd has not changed from its initial value.

Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20210407014502.24091-27-michel@lespinasse.org/

Conflicts:
    mm/memory.c

1. Merge conflict due to new vmf->prealloc_pte usage in finish_fault.

Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: If6046592083eaf12caf5c51c3fbb287a4dfa1ace
2022-03-23 11:32:18 -07:00
Michel Lespinasse
b12e52ca98 FROMLIST: mm: implement speculative handling in __do_fault()
In the speculative case, call the vm_ops->fault() method from within
an rcu read locked section, and verify the mmap sequence lock at the
start of the section. A match guarantees that the original vma is still
valid at that time, and that the associated vma->vm_file stays valid
while the vm_ops->fault() method is running.

Note that this implies that speculative faults can not sleep within
the vm_ops->fault method. We will only attempt to fetch existing pages
from the page cache during speculative faults; any miss (or prefetch)
will be handled by falling back to non-speculative fault handling.

The speculative handling case also does not preallocate page tables,
as it is always called with a pre-existing page table.

Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20210407014502.24091-25-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I995ba94d8e96014ef83ac93fe5a4669afcde34b9
2022-03-23 11:32:18 -07:00
Michel Lespinasse
9b92402808 FROMLIST: mm: anon spf statistics
Add a new CONFIG_SPECULATIVE_PAGE_FAULT_STATS config option,
and dump extra statistics about executed spf cases and abort reasons
when the option is set.

Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-32-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ia53cd88e4a7140aeb26bf8f3869e1fc5270012da
2022-03-23 11:32:17 -07:00
Michel Lespinasse
aa9ae5c915 FROMLIST: mm: implement and enable speculative fault handling in handle_pte_fault()
In handle_pte_fault(), allow speculative execution to proceed.

Use pte_spinlock() to validate the mmap sequence count when locking
the page table.

If speculative execution proceeds through do_wp_page(), ensure that we
end up in the wp_page_reuse() or wp_page_copy() paths, rather than
wp_pfn_shared() or wp_page_shared() (both unreachable as we only
handle anon vmas so far) or handle_userfault() (needs an explicit
abort to handle non-speculatively).

Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-28-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ia45d095ec7b8e23f1c5d68b7a7f572a3f6f6df97
2022-03-23 11:32:17 -07:00
Michel Lespinasse
40bc9ed389 FROMLIST: mm: implement speculative handling in wp_page_copy()
Change wp_page_copy() to handle the speculative case. This involves
aborting speculative faults if they have to allocate an anon_vma,
read-locking the mmu_notifier_lock to avoid races with
mmu_notifier_register(), and using pte_map_lock() instead of
pte_offset_map_lock() to complete the page fault.

Also change call sites to clear vmf->pte after unmapping the page table,
in order to satisfy pte_map_lock()'s preconditions.

Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-27-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Icd2188e9facf5a7fea42000a2808bcda1ad6f0fc
2022-03-23 11:32:16 -07:00
Michel Lespinasse
009020e3d1 FROMLIST: mm: enable speculative fault handling in do_numa_page()
Change handle_pte_fault() to allow speculative fault execution to proceed
through do_numa_page().

do_swap_page() does not implement speculative execution yet, so it
needs to abort with VM_FAULT_RETRY in that case.

Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-22-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I0390331facc9ecd37534012abdd9f255ab5bbb12
2022-03-23 11:32:16 -07:00
Michel Lespinasse
fedc4d513e FROMLIST: mm: implement speculative handling in do_numa_page()
change do_numa_page() to use pte_spinlock() when locking the page table,
so that the mmap sequence counter will be validated in the speculative case.

Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-21-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: If252547faf2a8a6cbba4c0a7ff929071a5f6a657
2022-03-23 11:32:15 -07:00
Michel Lespinasse
c2b2abe724 FROMLIST: mm: enable speculative fault handling through do_anonymous_page()
in x86 fault handler, only attempt spf if the vma is anonymous.

In do_handle_mm_fault(), let speculative page faults proceed as long
as they fall into anonymous vmas. This enables the speculative
handling code in __handle_mm_fault() and do_anonymous_page().

In handle_pte_fault(), if vmf->pte is set (the original pte was not
pte_none), catch speculative faults and return VM_FAULT_RETRY as
those cases are not implemented yet. Also assert that do_fault()
is not reached in the speculative case.

Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-20-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I875106fcfa1084f570c2bf8f24a129bdce55316b
2022-03-23 11:32:15 -07:00
Michel Lespinasse
31cf1fd564 FROMLIST: mm: implement speculative handling in do_anonymous_page()
Change do_anonymous_page() to handle the speculative case.
This involves aborting speculative faults if they have to allocate a new
anon_vma, and using pte_map_lock() instead of pte_offset_map_lock()
to complete the page fault.

Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-19-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I5ad955323faabc142c21f62415db039ac889066a
2022-03-23 11:32:15 -07:00
Michel Lespinasse
6e6766ab76 BACKPORT: FROMLIST: mm: add pte_map_lock() and pte_spinlock()
pte_map_lock() and pte_spinlock() are used by fault handlers to ensure
the pte is mapped and locked before they commit the faulted page to the
mm's address space at the end of the fault.

The functions differ in their preconditions; pte_map_lock() expects
the pte to be unmapped prior to the call, while pte_spinlock() expects
it to be already mapped.

In the speculative fault case, the functions verify, after locking the pte,
that the mmap sequence count has not changed since the start of the fault,
and thus that no mmap lock writers have been running concurrently with
the fault. After that point the page table lock serializes any further
races with concurrent mmap lock writers.

If the mmap sequence count check fails, both functions will return false
with the pte being left unmapped and unlocked.

Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-18-michel@lespinasse.org/

Conflicts:
    include/linux/mm.h

1. Fixed pte_map_lock and pte_spinlock macros not to fail when
CONFIG_SPECULATIVE_PAGE_FAULT=n

Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ibd7ccc2ead4fdf29f28c7657b312b2f677ac8836
2022-03-23 11:32:15 -07:00
Michel Lespinasse
6ab660d7cb FROMLIST: mm: implement speculative handling in __handle_mm_fault().
The speculative path calls speculative_page_walk_begin() before walking
the page table tree to prevent page table reclamation. The logic is
otherwise similar to the non-speculative path, but with additional
restrictions: in the speculative path, we do not handle huge pages or
wiring new pages tables.

Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-17-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: If099534da8b0ac105bbaa5ea4714a6654032592a
2022-03-23 11:32:15 -07:00
Michel Lespinasse
f3f9f17a32 FROMLIST: mm: refactor __handle_mm_fault() / handle_pte_fault()
Move the code that initializes vmf->pte and vmf->orig_pte from
handle_pte_fault() to its single call site in __handle_mm_fault().

This ensures vmf->pte is now initialized together with the higher levels
of the page table hierarchy. This also prepares for speculative page fault
handling, where the entire page table walk (higher levels down to ptes)
needs special care in the speculative case.

Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-16-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Id550086fe568331aa71c91468f8314faad993b20
2022-03-23 11:32:15 -07:00
Michel Lespinasse
f8a4611b47 FROMLIST: mm: add speculative_page_walk_begin() and speculative_page_walk_end()
Speculative page faults will use these to protect against races with
page table reclamation.

This could always be handled by disabling local IRQs as the fast GUP
code does; however speculative page faults do not need to protect
against races with THP page splitting, so a weaker rcu read lock is
sufficient in the MMU_GATHER_RCU_TABLE_FREE case.

Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-15-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I3efe5fc6a5a49d537cf33e8093daeea42550077a
2022-03-23 11:32:14 -07:00
Michel Lespinasse
4e2e391ff7 BACKPORT: FROMLIST: mm: add do_handle_mm_fault()
Add a new do_handle_mm_fault function, which extends the existing
handle_mm_fault() API by adding an mmap sequence count, to be used
in the FAULT_FLAG_SPECULATIVE case.

In the initial implementation, FAULT_FLAG_SPECULATIVE always fails
(by returning VM_FAULT_RETRY).

The existing handle_mm_fault() API is kept as a wrapper around
do_handle_mm_fault() so that we do not have to immediately update
every handle_mm_fault() call site.

Signed-off-by: Michel Lespinasse <michel@lespinasse.org>

Conflicts:
    mm/memory.c

1. Trivial merge conflict due to folios.

Link: https://lore.kernel.org/all/20220128131006.67712-10-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ic07b6d84af3e5d1fcc856e0968f1a6dd1544fa88
2022-03-23 11:32:14 -07:00
Michel Lespinasse
57f3bb2b12 BACKPORT: FROMLIST: do_anonymous_page: reduce code duplication
In do_anonymous_page(), we have separate cases for the zero page vs
allocating new anonymous pages. However, once the pte entry has been
computed, the rest of the handling (mapping and locking the page table,
checking that we didn't lose a race with another page fault handler, etc)
is identical between the two cases.

This change reduces the code duplication between the two cases.

Signed-off-by: Michel Lespinasse <michel@lespinasse.org>

Conflicts:
    mm/memory.c

1. Trivial merge conflict caused by folios in mem_cgroup_charge call.

Link: https://lore.kernel.org/all/20220128131006.67712-6-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ic19579571925878d632e43aa40b9f50cdf473ee6
2022-03-23 11:32:13 -07:00
Michel Lespinasse
82ab55ebcc FROMLIST: do_anonymous_page: use update_mmu_tlb()
update_mmu_tlb() can be used instead of update_mmu_cache() when the
page fault handler detects that it lost the race to another page fault.

It looks like this one call was missed in
https://patchwork.kernel.org/project/linux-mips/patch/1590375160-6997-2-git-send-email-maobibo@loongson.cn
after Andrew asked to replace all update_mmu_cache() calls with an alias
in the previous version of this patch here:
https://patchwork.kernel.org/project/linux-mips/patch/1590031837-9582-2-git-send-email-maobibo@loongson.cn/#23374625

Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-5-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Iaad4d3c27e12c2d9bf68d1140709788fc8dead24
2022-03-23 11:32:13 -07:00
Kalesh Singh
1a77f04aac Revert "ANDROID: mm: Throttle rss_stat tracepoint"
This reverts commit 77dfeaa02d.

Throttling can now be done using hist triggers and synthetic events

Bug: 145972256
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I39c284040e2fdb815cda980f3a40ef188e59287c
2021-11-17 23:30:23 +00:00
Greg Kroah-Hartman
d1a66e7942 Merge tag 'v5.15' into android-mainline
Linux 5.15

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I81d96ada6b66bcf73d192ddde9018f1804e7d90b
2021-11-01 07:40:54 +01:00
Yang Shi
eac96c3efd mm: filemap: check if THP has hwpoisoned subpage for PMD page fault
When handling shmem page fault the THP with corrupted subpage could be
PMD mapped if certain conditions are satisfied.  But kernel is supposed
to send SIGBUS when trying to map hwpoisoned page.

There are two paths which may do PMD map: fault around and regular
fault.

Before commit f9ce0be71d ("mm: Cleanup faultaround and finish_fault()
codepaths") the thing was even worse in fault around path.  The THP
could be PMD mapped as long as the VMA fits regardless what subpage is
accessed and corrupted.  After this commit as long as head page is not
corrupted the THP could be PMD mapped.

In the regular fault path the THP could be PMD mapped as long as the
corrupted page is not accessed and the VMA fits.

This loophole could be fixed by iterating every subpage to check if any
of them is hwpoisoned or not, but it is somewhat costly in page fault
path.

So introduce a new page flag called HasHWPoisoned on the first tail
page.  It indicates the THP has hwpoisoned subpage(s).  It is set if any
subpage of THP is found hwpoisoned by memory failure and after the
refcount is bumped successfully, then cleared when the THP is freed or
split.

The soft offline path doesn't need this since soft offline handler just
marks a subpage hwpoisoned when the subpage is migrated successfully.
But shmem THP didn't get split then migrated at all.

Link: https://lkml.kernel.org/r/20211020210755.23964-3-shy828301@gmail.com
Fixes: 800d8c63b2 ("shmem: add huge pages support")
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-10-28 17:18:55 -07:00
Greg Kroah-Hartman
2d03f6ec4b Merge 92477dd1fa ("Merge tag 's390-5.15-ebpf-jit-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux") into android-mainline
Steps on the way to 5.15-rc3

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ia35478d4c4a085bff847b8cd484bd3290f9529b8
2021-09-23 09:26:41 +02:00
David Howells
6e0e99d58a afs: Fix mmap coherency vs 3rd-party changes
Fix the coherency management of mmap'd data such that 3rd-party changes
become visible as soon as possible after the callback notification is
delivered by the fileserver.  This is done by the following means:

 (1) When we break a callback on a vnode specified by the CB.CallBack call
     from the server, we queue a work item (vnode->cb_work) to go and
     clobber all the PTEs mapping to that inode.

     This causes the CPU to trip through the ->map_pages() and
     ->page_mkwrite() handlers if userspace attempts to access the page(s)
     again.

     (Ideally, this would be done in the service handler for CB.CallBack,
     but the server is waiting for our reply before considering, and we
     have a list of vnodes, all of which need breaking - and the process of
     getting the mmap_lock and stripping the PTEs on all CPUs could be
     quite slow.)

 (2) Call afs_validate() from the ->map_pages() handler to check to see if
     the file has changed and to get a new callback promise from the
     server.

Also handle the fileserver telling us that it's dropping all callbacks,
possibly after it's been restarted by sending us a CB.InitCallBackState*
call by the following means:

 (3) Maintain a per-cell list of afs files that are currently mmap'd
     (cell->fs_open_mmaps).

 (4) Add a work item to each server that is invoked if there are any open
     mmaps when CB.InitCallBackState happens.  This work item goes through
     the aforementioned list and invokes the vnode->cb_work work item for
     each one that is currently using this server.

     This causes the PTEs to be cleared, causing ->map_pages() or
     ->page_mkwrite() to be called again, thereby calling afs_validate()
     again.

I've chosen to simply strip the PTEs at the point of notification reception
rather than invalidate all the pages as well because (a) it's faster, (b)
we may get a notification for other reasons than the data being altered (in
which case we don't want to clobber the pagecache) and (c) we need to ask
the server to find out - and I don't want to wait for the reply before
holding up userspace.

This was tested using the attached test program:

	#include <stdbool.h>
	#include <stdio.h>
	#include <stdlib.h>
	#include <unistd.h>
	#include <fcntl.h>
	#include <sys/mman.h>
	int main(int argc, char *argv[])
	{
		size_t size = getpagesize();
		unsigned char *p;
		bool mod = (argc == 3);
		int fd;
		if (argc != 2 && argc != 3) {
			fprintf(stderr, "Format: %s <file> [mod]\n", argv[0]);
			exit(2);
		}
		fd = open(argv[1], mod ? O_RDWR : O_RDONLY);
		if (fd < 0) {
			perror(argv[1]);
			exit(1);
		}

		p = mmap(NULL, size, mod ? PROT_READ|PROT_WRITE : PROT_READ,
			 MAP_SHARED, fd, 0);
		if (p == MAP_FAILED) {
			perror("mmap");
			exit(1);
		}
		for (;;) {
			if (mod) {
				p[0]++;
				msync(p, size, MS_ASYNC);
				fsync(fd);
			}
			printf("%02x", p[0]);
			fflush(stdout);
			sleep(1);
		}
	}

It runs in two modes: in one mode, it mmaps a file, then sits in a loop
reading the first byte, printing it and sleeping for a second; in the
second mode it mmaps a file, then sits in a loop incrementing the first
byte and flushing, then printing and sleeping.

Two instances of this program can be run on different machines, one doing
the reading and one doing the writing.  The reader should see the changes
made by the writer, but without this patch, they aren't because validity
checking is being done lazily - only on entry to the filesystem.

Testing the InitCallBackState change is more complicated.  The server has
to be taken offline, the saved callback state file removed and then the
server restarted whilst the reading-mode program continues to run.  The
client machine then has to poke the server to trigger the InitCallBackState
call.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/163111668833.283156.382633263709075739.stgit@warthog.procyon.org.uk/
2021-09-13 09:10:39 +01:00