28f0c67d40e16c8b5dfb3dac7967c87c563918b9
1226 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
28f0c67d40 |
Merge 5.15.44 into android14-5.15
Changes in 5.15.44
HID: amd_sfh: Add support for sensor discovery
KVM: x86/mmu: fix NULL pointer dereference on guest INVPCID
ice: fix crash at allocation failure
ACPI: sysfs: Fix BERT error region memory mapping
MAINTAINERS: co-maintain random.c
MAINTAINERS: add git tree for random.c
lib/crypto: blake2s: include as built-in
lib/crypto: blake2s: move hmac construction into wireguard
lib/crypto: sha1: re-roll loops to reduce code size
lib/crypto: blake2s: avoid indirect calls to compression function for Clang CFI
random: document add_hwgenerator_randomness() with other input functions
random: remove unused irq_flags argument from add_interrupt_randomness()
random: use BLAKE2s instead of SHA1 in extraction
random: do not sign extend bytes for rotation when mixing
random: do not re-init if crng_reseed completes before primary init
random: mix bootloader randomness into pool
random: harmonize "crng init done" messages
random: use IS_ENABLED(CONFIG_NUMA) instead of ifdefs
random: early initialization of ChaCha constants
random: avoid superfluous call to RDRAND in CRNG extraction
random: don't reset crng_init_cnt on urandom_read()
random: fix typo in comments
random: cleanup poolinfo abstraction
random: cleanup integer types
random: remove incomplete last_data logic
random: remove unused extract_entropy() reserved argument
random: rather than entropy_store abstraction, use global
random: remove unused OUTPUT_POOL constants
random: de-duplicate INPUT_POOL constants
random: prepend remaining pool constants with POOL_
random: cleanup fractional entropy shift constants
random: access input_pool_data directly rather than through pointer
random: selectively clang-format where it makes sense
random: simplify arithmetic function flow in account()
random: continually use hwgenerator randomness
random: access primary_pool directly rather than through pointer
random: only call crng_finalize_init() for primary_crng
random: use computational hash for entropy extraction
random: simplify entropy debiting
random: use linear min-entropy accumulation crediting
random: always wake up entropy writers after extraction
random: make credit_entropy_bits() always safe
random: remove use_input_pool parameter from crng_reseed()
random: remove batched entropy locking
random: fix locking in crng_fast_load()
random: use RDSEED instead of RDRAND in entropy extraction
random: get rid of secondary crngs
random: inline leaves of rand_initialize()
random: ensure early RDSEED goes through mixer on init
random: do not xor RDRAND when writing into /dev/random
random: absorb fast pool into input pool after fast load
random: use simpler fast key erasure flow on per-cpu keys
random: use hash function for crng_slow_load()
random: make more consistent use of integer types
random: remove outdated INT_MAX >> 6 check in urandom_read()
random: zero buffer after reading entropy from userspace
random: fix locking for crng_init in crng_reseed()
random: tie batched entropy generation to base_crng generation
random: remove ifdef'd out interrupt bench
random: remove unused tracepoints
random: add proper SPDX header
random: deobfuscate irq u32/u64 contributions
random: introduce drain_entropy() helper to declutter crng_reseed()
random: remove useless header comment
random: remove whitespace and reorder includes
random: group initialization wait functions
random: group crng functions
random: group entropy extraction functions
random: group entropy collection functions
random: group userspace read/write functions
random: group sysctl functions
random: rewrite header introductory comment
random: defer fast pool mixing to worker
random: do not take pool spinlock at boot
random: unify early init crng load accounting
random: check for crng_init == 0 in add_device_randomness()
random: pull add_hwgenerator_randomness() declaration into random.h
random: clear fast pool, crng, and batches in cpuhp bring up
random: round-robin registers as ulong, not u32
random: only wake up writers after zap if threshold was passed
random: cleanup UUID handling
random: unify cycles_t and jiffies usage and types
random: do crng pre-init loading in worker rather than irq
random: give sysctl_random_min_urandom_seed a more sensible value
random: don't let 644 read-only sysctls be written to
random: replace custom notifier chain with standard one
random: use SipHash as interrupt entropy accumulator
random: make consistent usage of crng_ready()
random: reseed more often immediately after booting
random: check for signal and try earlier when generating entropy
random: skip fast_init if hwrng provides large chunk of entropy
random: treat bootloader trust toggle the same way as cpu trust toggle
random: re-add removed comment about get_random_{u32,u64} reseeding
random: mix build-time latent entropy into pool at init
random: do not split fast init input in add_hwgenerator_randomness()
random: do not allow user to keep crng key around on stack
random: check for signal_pending() outside of need_resched() check
random: check for signals every PAGE_SIZE chunk of /dev/[u]random
random: allow partial reads if later user copies fail
random: make random_get_entropy() return an unsigned long
random: document crng_fast_key_erasure() destination possibility
random: fix sysctl documentation nits
init: call time_init() before rand_initialize()
ia64: define get_cycles macro for arch-override
s390: define get_cycles macro for arch-override
parisc: define get_cycles macro for arch-override
alpha: define get_cycles macro for arch-override
powerpc: define get_cycles macro for arch-override
timekeeping: Add raw clock fallback for random_get_entropy()
m68k: use fallback for random_get_entropy() instead of zero
riscv: use fallback for random_get_entropy() instead of zero
mips: use fallback for random_get_entropy() instead of just c0 random
arm: use fallback for random_get_entropy() instead of zero
nios2: use fallback for random_get_entropy() instead of zero
x86/tsc: Use fallback for random_get_entropy() instead of zero
um: use fallback for random_get_entropy() instead of zero
sparc: use fallback for random_get_entropy() instead of zero
xtensa: use fallback for random_get_entropy() instead of zero
random: insist on random_get_entropy() existing in order to simplify
random: do not use batches when !crng_ready()
random: use first 128 bits of input as fast init
random: do not pretend to handle premature next security model
random: order timer entropy functions below interrupt functions
random: do not use input pool from hard IRQs
random: help compiler out with fast_mix() by using simpler arguments
siphash: use one source of truth for siphash permutations
random: use symbolic constants for crng_init states
random: avoid initializing twice in credit race
random: move initialization out of reseeding hot path
random: remove ratelimiting for in-kernel unseeded randomness
random: use proper jiffies comparison macro
random: handle latent entropy and command line from random_init()
random: credit architectural init the exact amount
random: use static branch for crng_ready()
random: remove extern from functions in header
random: use proper return types on get_random_{int,long}_wait()
random: make consistent use of buf and len
random: move initialization functions out of hot pages
random: move randomize_page() into mm where it belongs
random: unify batched entropy implementations
random: convert to using fops->read_iter()
random: convert to using fops->write_iter()
random: wire up fops->splice_{read,write}_iter()
random: check for signals after page of pool writes
ALSA: ctxfi: Add SB046x PCI ID
Linux 5.15.44
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I2d874cba14f13379fc5f874e72634c9179b28742
|
||
|
|
a0d26c51d7 |
Merge 5.15.37 into android14-5.15
Changes in 5.15.37
floppy: disable FDRAWCMD by default
bpf: Introduce composable reg, ret and arg types.
bpf: Replace ARG_XXX_OR_NULL with ARG_XXX | PTR_MAYBE_NULL
bpf: Replace RET_XXX_OR_NULL with RET_XXX | PTR_MAYBE_NULL
bpf: Replace PTR_TO_XXX_OR_NULL with PTR_TO_XXX | PTR_MAYBE_NULL
bpf: Introduce MEM_RDONLY flag
bpf: Convert PTR_TO_MEM_OR_NULL to composable types.
bpf: Make per_cpu_ptr return rdonly PTR_TO_MEM.
bpf: Add MEM_RDONLY for helper args that are pointers to rdonly mem.
bpf/selftests: Test PTR_TO_RDONLY_MEM
bpf: Fix crash due to out of bounds access into reg2btf_ids.
spi: cadence-quadspi: fix write completion support
ARM: dts: socfpga: change qspi to "intel,socfpga-qspi"
mm: kfence: fix objcgs vector allocation
gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable}
iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable
iov_iter: Introduce fault_in_iov_iter_writeable
gfs2: Add wrapper for iomap_file_buffered_write
gfs2: Clean up function may_grant
gfs2: Introduce flag for glock holder auto-demotion
gfs2: Move the inode glock locking to gfs2_file_buffered_write
gfs2: Eliminate ip->i_gh
gfs2: Fix mmap + page fault deadlocks for buffered I/O
iomap: Fix iomap_dio_rw return value for user copies
iomap: Support partial direct I/O on user copy failures
iomap: Add done_before argument to iomap_dio_rw
gup: Introduce FOLL_NOFAULT flag to disable page faults
iov_iter: Introduce nofault flag to disable page faults
gfs2: Fix mmap + page fault deadlocks for direct I/O
btrfs: fix deadlock due to page faults during direct IO reads and writes
btrfs: fallback to blocking mode when doing async dio over multiple extents
mm: gup: make fault_in_safe_writeable() use fixup_user_fault()
selftests/bpf: Add test for reg2btf_ids out of bounds access
Linux 5.15.37
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I785543e252f972c5a86f313e4b6721e2ff0797e6
|
||
|
|
64cb7f01dd |
random: move randomize_page() into mm where it belongs
commit 5ad7dd882e45d7fe432c32e896e2aaa0b21746ea upstream. randomize_page is an mm function. It is documented like one. It contains the history of one. It has the naming convention of one. It looks just like another very similar function in mm, randomize_stack_top(). And it has always been maintained and updated by mm people. There is no need for it to be in random.c. In the "which shape does not look like the other ones" test, pointing to randomize_page() is correct. So move randomize_page() into mm/util.c, right next to the similar randomize_stack_top() function. This commit contains no actual code changes. Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
6e213bc614 |
gup: Introduce FOLL_NOFAULT flag to disable page faults
commit 55b8fe703bc51200d4698596c90813453b35ae63 upstream Introduce a new FOLL_NOFAULT flag that causes get_user_pages to return -EFAULT when it would otherwise trigger a page fault. This is roughly similar to FOLL_FAST_ONLY but available on all architectures, and less fragile. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
f88ed5a3d3 |
FROMLIST: mm: multi-gen LRU: groundwork
Evictable pages are divided into multiple generations for each lruvec. The youngest generation number is stored in lrugen->max_seq for both anon and file types as they are aged on an equal footing. The oldest generation numbers are stored in lrugen->min_seq[] separately for anon and file types as clean file pages can be evicted regardless of swap constraints. These three variables are monotonically increasing. Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits in order to fit into the gen counter in page->flags. Each truncated generation number is an index to lrugen->lists[]. The sliding window technique is used to track at least MIN_NR_GENS and at most MAX_NR_GENS generations. The gen counter stores a value within [1, MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it stores 0. There are two conceptually independent procedures: "the aging", which produces young generations, and "the eviction", which consumes old generations. They form a closed-loop system, i.e., "the page reclaim". Both procedures can be invoked from userspace for the purposes of working set estimation and proactive reclaim. These features are required to optimize job scheduling (bin packing) in data centers. The variable size of the sliding window is designed for such use cases [1][2]. To avoid confusion, the terms "hot" and "cold" will be applied to the multi-gen LRU, as a new convention; the terms "active" and "inactive" will be applied to the active/inactive LRU, as usual. The protection of hot pages and the selection of cold pages are based on page access channels and patterns. There are two access channels: one through page tables and the other through file descriptors. The protection of the former channel is by design stronger because: 1. The uncertainty in determining the access patterns of the former channel is higher due to the approximation of the accessed bit. 2. The cost of evicting the former channel is higher due to the TLB flushes required and the likelihood of encountering the dirty bit. 3. The penalty of underprotecting the former channel is higher because applications usually do not prepare themselves for major page faults like they do for blocked I/O. E.g., GUI applications commonly use dedicated I/O threads to avoid blocking the rendering threads. There are also two access patterns: one with temporal locality and the other without. For the reasons listed above, the former channel is assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is present; the latter channel is assumed to follow the latter pattern unless outlying refaults have been observed [3][4]. The next patch will address the "outlying refaults". Three macros, i.e., LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in this patch to make the entire patchset less diffy. A page is added to the youngest generation on faulting. The aging needs to check the accessed bit at least twice before handing this page over to the eviction. The first check takes care of the accessed bit set on the initial fault; the second check makes sure this page has not been used since then. This protocol, AKA second chance, requires a minimum of two generations, hence MIN_NR_GENS. [1] https://dl.acm.org/doi/10.1145/3297858.3304053 [2] https://dl.acm.org/doi/10.1145/3503222.3507731 [3] https://lwn.net/Articles/495543/ [4] https://lwn.net/Articles/815342/ Link: https://lore.kernel.org/lkml/20220309021230.721028-6-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I333ec6a1d2abfa60d93d6adc190ed3eefe441512 |
||
|
|
eabd925e61 |
ANDROID: mm: shmem: use reclaim_pages() to recalim pages from a list
Static code analysis tool reported NULL pointer access in shrink_page_list() as the commit |
||
|
|
0fd37220d8 |
UPSTREAM: mm: refactor vm_area_struct::anon_vma_name usage code
Avoid mixing strings and their anon_vma_name referenced pointers by using struct anon_vma_name whenever possible. This simplifies the code and allows easier sharing of anon_vma_name structures when they represent the same name. [surenb@google.com: fix comment] Link: https://lkml.kernel.org/r/20220223153613.835563-1-surenb@google.com Link: https://lkml.kernel.org/r/20220224231834.1481408-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Colin Cross <ccross@google.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Alexey Gladkov <legion@kernel.org> Cc: Sasha Levin <sashal@kernel.org> Cc: Chris Hyser <chris.hyser@oracle.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Peter Collingbourne <pcc@google.com> Cc: Xiaofeng Cao <caoxiaofeng@yulong.com> Cc: David Hildenbrand <david@redhat.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 5c26f6ac9416b63d093e29c30e79b3297e425472) Bug: 218352794 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I4a6b5602ce7151d1a4b88fac489f86d68089bd4d |
||
|
|
96f80f6284 |
ANDROID: mm: add reclaim_shmem_address_space() for faster reclaims
Add the functionality that allow users of shmem to reclaim its pages without going through the kswapd/direct reclaim path. An example usecase is: Say that device allocates a larger amount of shmem pages and shares it with hardware. To faster reclaims such pages, drivers can register the shrinkers and call reclaim_shmem_address_space(). The implementation of this function is mostly borrowed from reclaim_address_space() implemented for per process reclaim[1]. [1] https://lore.kernel.org/patchwork/cover/378056/ Bug: 201263305 Change-Id: I03d2c3b9610612af977f89ddeabb63b8e9e50918 Signed-off-by: Charan Teja Reddy <quic_charante@quicinc.com> |
||
|
|
7d6787088d |
BACKPORT: FROMLIST: mm: enable speculative fault handling for supported file types.
Introduce vma_can_speculate(), which allows speculative handling for VMAs mapping supported file types. From do_handle_mm_fault(), speculative handling will follow through __handle_mm_fault(), handle_pte_fault() and do_fault(). At this point, we expect speculative faults to continue through one of: - do_read_fault(), fully implemented; - do_cow_fault(), which might abort if missing anon vmas, - do_shared_fault(), not implemented yet (would require ->page_mkwrite() changes). vma_can_speculate() provides an early abort for the do_shared_fault() case, limiting the time spent on trying that unimplemented case. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20210407014502.24091-31-michel@lespinasse.org/ Conflicts: include/linux/vm_event_item.h mm/vmstat.c 1. SPF_ATTEMPT_FILE is taken from https://lore.kernel.org/all/20210407014502.24091-36-michel@lespinasse.org/ since the patch posted upstream at the time had a different structure with stats for anonymouse and file-backed pagefaults introduced in a separate patch. Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I3a28af63b41b649f02f8b73d53f6494ad114ee5a |
||
|
|
a2138fee6c |
FROMLIST: fs: list file types that support speculative faults.
Add a speculative field to the vm_operations_struct, which indicates if the associated file type supports speculative faults. Initially this is set for files that implement fault() with filemap_fault(). Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20210407014502.24091-30-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Ic92efdf13283c45e7da7bf703f4f85f8b392ba69 |
||
|
|
6e6766ab76 |
BACKPORT: FROMLIST: mm: add pte_map_lock() and pte_spinlock()
pte_map_lock() and pte_spinlock() are used by fault handlers to ensure the pte is mapped and locked before they commit the faulted page to the mm's address space at the end of the fault. The functions differ in their preconditions; pte_map_lock() expects the pte to be unmapped prior to the call, while pte_spinlock() expects it to be already mapped. In the speculative fault case, the functions verify, after locking the pte, that the mmap sequence count has not changed since the start of the fault, and thus that no mmap lock writers have been running concurrently with the fault. After that point the page table lock serializes any further races with concurrent mmap lock writers. If the mmap sequence count check fails, both functions will return false with the pte being left unmapped and unlocked. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-18-michel@lespinasse.org/ Conflicts: include/linux/mm.h 1. Fixed pte_map_lock and pte_spinlock macros not to fail when CONFIG_SPECULATIVE_PAGE_FAULT=n Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Ibd7ccc2ead4fdf29f28c7657b312b2f677ac8836 |
||
|
|
6ab660d7cb |
FROMLIST: mm: implement speculative handling in __handle_mm_fault().
The speculative path calls speculative_page_walk_begin() before walking the page table tree to prevent page table reclamation. The logic is otherwise similar to the non-speculative path, but with additional restrictions: in the speculative path, we do not handle huge pages or wiring new pages tables. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-17-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: If099534da8b0ac105bbaa5ea4714a6654032592a |
||
|
|
0823d516af |
FROMLIST: mm: separate mmap locked assertion from find_vma
This adds a new __find_vma() function, which implements find_vma minus the mmap_assert_locked() assertion. find_vma() is then implemented as an inline wrapper around __find_vma(). Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Link: https://lore.kernel.org/all/20220128131006.67712-13-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Ia999b8cb8f5eed93040ab4b3caaf90d739da908d |
||
|
|
4e2e391ff7 |
BACKPORT: FROMLIST: mm: add do_handle_mm_fault()
Add a new do_handle_mm_fault function, which extends the existing handle_mm_fault() API by adding an mmap sequence count, to be used in the FAULT_FLAG_SPECULATIVE case. In the initial implementation, FAULT_FLAG_SPECULATIVE always fails (by returning VM_FAULT_RETRY). The existing handle_mm_fault() API is kept as a wrapper around do_handle_mm_fault() so that we do not have to immediately update every handle_mm_fault() call site. Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Conflicts: mm/memory.c 1. Trivial merge conflict due to folios. Link: https://lore.kernel.org/all/20220128131006.67712-10-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Ic07b6d84af3e5d1fcc856e0968f1a6dd1544fa88 |
||
|
|
f2fa9aae2e |
BACKPORT: FROMLIST: mm: add FAULT_FLAG_SPECULATIVE flag
Define the new FAULT_FLAG_SPECULATIVE flag, which indicates when we are attempting speculative fault handling (without holding the mmap lock). Signed-off-by: Michel Lespinasse <michel@lespinasse.org> Conflicts: include/linux/mm_types.h 1. Merge conflict due to enum fault_flag being defined in mm.h instead of mm_types.h Link: https://lore.kernel.org/all/20220128131006.67712-9-michel@lespinasse.org/ Bug: 161210518 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I48ab427dfa4d7bdbe9932588bec7ae99e9e80ae9 |
||
|
|
8222792e8e |
Merge 5.15.19 into android13-5.15
Changes in 5.15.19
can: m_can: m_can_fifo_{read,write}: don't read or write from/to FIFO if length is 0
net: sfp: ignore disabled SFP node
net: stmmac: configure PTP clock source prior to PTP initialization
net: stmmac: skip only stmmac_ptp_register when resume from suspend
ARM: 9179/1: uaccess: avoid alignment faults in copy_[from|to]_kernel_nofault
ARM: 9180/1: Thumb2: align ALT_UP() sections in modules sufficiently
KVM: arm64: Use shadow SPSR_EL1 when injecting exceptions on !VHE
s390/module: fix loading modules with a lot of relocations
s390/hypfs: include z/VM guests with access control group set
s390/nmi: handle guarded storage validity failures for KVM guests
s390/nmi: handle vector validity failures for KVM guests
bpf: Guard against accessing NULL pt_regs in bpf_get_task_stack()
powerpc32/bpf: Fix codegen for bpf-to-bpf calls
powerpc/bpf: Update ldimm64 instructions during extra pass
ucount: Make get_ucount a safe get_user replacement
scsi: zfcp: Fix failed recovery on gone remote port with non-NPIV FCP devices
udf: Restore i_lenAlloc when inode expansion fails
udf: Fix NULL ptr deref when converting from inline format
efi: runtime: avoid EFIv2 runtime services on Apple x86 machines
PM: wakeup: simplify the output logic of pm_show_wakelocks()
tracing/histogram: Fix a potential memory leak for kstrdup()
tracing: Don't inc err_log entry count if entry allocation fails
ceph: properly put ceph_string reference after async create attempt
ceph: set pool_ns in new inode layout for async creates
fsnotify: fix fsnotify hooks in pseudo filesystems
Revert "KVM: SVM: avoid infinite loop on NPF from bad address"
psi: Fix uaf issue when psi trigger is destroyed while being polled
powerpc/audit: Fix syscall_get_arch()
perf/x86/intel/uncore: Fix CAS_COUNT_WRITE issue for ICX
perf/x86/intel: Add a quirk for the calculation of the number of counters on Alder Lake
drm/etnaviv: relax submit size limits
drm/atomic: Add the crtc to affected crtc only if uapi.enable = true
drm/amd/display: Fix FP start/end for dcn30_internal_validate_bw.
KVM: LAPIC: Also cancel preemption timer during SET_LAPIC
KVM: SVM: Never reject emulation due to SMAP errata for !SEV guests
KVM: SVM: Don't intercept #GP for SEV guests
KVM: x86: nSVM: skip eax alignment check for non-SVM instructions
KVM: x86: Forcibly leave nested virt when SMM state is toggled
KVM: x86: Keep MSR_IA32_XSS unchanged for INIT
KVM: x86: Update vCPU's runtime CPUID on write to MSR_IA32_XSS
KVM: x86: Sync the states size with the XCR0/IA32_XSS at, any time
KVM: PPC: Book3S HV Nested: Fix nested HFSCR being clobbered with multiple vCPUs
dm: revert partial fix for redundant bio-based IO accounting
block: add bio_start_io_acct_time() to control start_time
dm: properly fix redundant bio-based IO accounting
serial: pl011: Fix incorrect rs485 RTS polarity on set_mctrl
serial: 8250: of: Fix mapped region size when using reg-offset property
serial: stm32: fix software flow control transfer
tty: n_gsm: fix SW flow control encoding/handling
tty: Partially revert the removal of the Cyclades public API
tty: Add support for Brainboxes UC cards.
kbuild: remove include/linux/cyclades.h from header file check
usb-storage: Add unusual-devs entry for VL817 USB-SATA bridge
usb: xhci-plat: fix crash when suspend if remote wake enable
usb: common: ulpi: Fix crash in ulpi_match()
usb: gadget: f_sourcesink: Fix isoc transfer for USB_SPEED_SUPER_PLUS
usb: cdnsp: Fix segmentation fault in cdns_lost_power function
usb: dwc3: xilinx: Skip resets and USB3 register settings for USB2.0 mode
usb: dwc3: xilinx: Fix error handling when getting USB3 PHY
USB: core: Fix hang in usb_kill_urb by adding memory barriers
usb: typec: tcpci: don't touch CC line if it's Vconn source
usb: typec: tcpm: Do not disconnect while receiving VBUS off
usb: typec: tcpm: Do not disconnect when receiving VSAFE0V
ucsi_ccg: Check DEV_INT bit only when starting CCG4
mm, kasan: use compare-exchange operation to set KASAN page tag
jbd2: export jbd2_journal_[grab|put]_journal_head
ocfs2: fix a deadlock when commit trans
sched/membarrier: Fix membarrier-rseq fence command missing from query bitmask
PCI/sysfs: Find shadow ROM before static attribute initialization
x86/MCE/AMD: Allow thresholding interface updates after init
x86/cpu: Add Xeon Icelake-D to list of CPUs that support PPIN
powerpc/32s: Allocate one 256k IBAT instead of two consecutives 128k IBATs
powerpc/32s: Fix kasan_init_region() for KASAN
powerpc/32: Fix boot failure with GCC latent entropy plugin
i40e: Increase delay to 1 s after global EMP reset
i40e: Fix issue when maximum queues is exceeded
i40e: Fix queues reservation for XDP
i40e: Fix for failed to init adminq while VF reset
i40e: fix unsigned stat widths
usb: roles: fix include/linux/usb/role.h compile issue
rpmsg: char: Fix race between the release of rpmsg_ctrldev and cdev
rpmsg: char: Fix race between the release of rpmsg_eptdev and cdev
scsi: elx: efct: Don't use GFP_KERNEL under spin lock
scsi: bnx2fc: Flush destroy_work queue before calling bnx2fc_interface_put()
ipv6_tunnel: Rate limit warning messages
ARM: 9170/1: fix panic when kasan and kprobe are enabled
net: fix information leakage in /proc/net/ptype
hwmon: (lm90) Mark alert as broken for MAX6646/6647/6649
hwmon: (lm90) Mark alert as broken for MAX6680
ping: fix the sk_bound_dev_if match in ping_lookup
ipv4: avoid using shared IP generator for connected sockets
hwmon: (lm90) Reduce maximum conversion rate for G781
NFSv4: Handle case where the lookup of a directory fails
NFSv4: nfs_atomic_open() can race when looking up a non-regular file
net-procfs: show net devices bound packet types
drm/msm: Fix wrong size calculation
drm/msm/dsi: Fix missing put_device() call in dsi_get_phy
drm/msm/dsi: invalid parameter check in msm_dsi_phy_enable
ipv6: annotate accesses to fn->fn_sernum
NFS: Ensure the server has an up to date ctime before hardlinking
NFS: Ensure the server has an up to date ctime before renaming
KVM: arm64: pkvm: Use the mm_ops indirection for cache maintenance
SUNRPC: Use BIT() macro in rpc_show_xprt_state()
SUNRPC: Don't dereference xprt->snd_task if it's a cookie
powerpc64/bpf: Limit 'ldbrx' to processors compliant with ISA v2.06
netfilter: conntrack: don't increment invalid counter on NF_REPEAT
powerpc/64s: Mask SRR0 before checking against the masked NIP
perf: Fix perf_event_read_local() time
sched/pelt: Relax the sync of util_sum with util_avg
net: phy: broadcom: hook up soft_reset for BCM54616S
net: stmmac: dwmac-visconti: Fix bit definitions for ETHER_CLK_SEL
net: stmmac: dwmac-visconti: Fix clock configuration for RMII mode
phylib: fix potential use-after-free
octeontx2-af: Do not fixup all VF action entries
octeontx2-af: Fix LBK backpressure id count
octeontx2-af: Retry until RVU block reset complete
octeontx2-pf: cn10k: Ensure valid pointers are freed to aura
octeontx2-af: verify CQ context updates
octeontx2-af: Increase link credit restore polling timeout
octeontx2-af: cn10k: Do not enable RPM loopback for LPC interfaces
octeontx2-pf: Forward error codes to VF
rxrpc: Adjust retransmission backoff
efi/libstub: arm64: Fix image check alignment at entry
io_uring: fix bug in slow unregistering of nodes
Drivers: hv: balloon: account for vmbus packet header in max_pkt_size
hwmon: (lm90) Re-enable interrupts after alert clears
hwmon: (lm90) Mark alert as broken for MAX6654
hwmon: (lm90) Fix sysfs and udev notifications
hwmon: (adt7470) Prevent divide by zero in adt7470_fan_write()
powerpc/perf: Fix power_pmu_disable to call clear_pmi_irq_pending only if PMI is pending
ipv4: fix ip option filtering for locally generated fragments
ibmvnic: Allow extra failures before disabling
ibmvnic: init ->running_cap_crqs early
ibmvnic: don't spin in tasklet
net/smc: Transitional solution for clcsock race issue
video: hyperv_fb: Fix validation of screen resolution
can: tcan4x5x: regmap: fix max register value
drm/msm/hdmi: Fix missing put_device() call in msm_hdmi_get_phy
drm/msm/dpu: invalid parameter check in dpu_setup_dspp_pcc
drm/msm/a6xx: Add missing suspend_count increment
yam: fix a memory leak in yam_siocdevprivate()
net: cpsw: Properly initialise struct page_pool_params
net: hns3: handle empty unknown interrupt for VF
sch_htb: Fail on unsupported parameters when offload is requested
Revert "drm/ast: Support 1600x900 with 108MHz PCLK"
KVM: selftests: Don't skip L2's VMCALL in SMM test for SVM guest
ceph: put the requests/sessions when it fails to alloc memory
gve: Fix GFP flags when allocing pages
Revert "ipv6: Honor all IPv6 PIO Valid Lifetime values"
net: bridge: vlan: fix single net device option dumping
ipv4: raw: lock the socket in raw_bind()
ipv4: tcp: send zero IPID in SYNACK messages
ipv4: remove sparse error in ip_neigh_gw4()
net: bridge: vlan: fix memory leak in __allowed_ingress
Bluetooth: refactor malicious adv data check
irqchip/realtek-rtl: Map control data to virq
irqchip/realtek-rtl: Fix off-by-one in routing
dt-bindings: can: tcan4x5x: fix mram-cfg RX FIFO config
perf/core: Fix cgroup event list management
psi: fix "no previous prototype" warnings when CONFIG_CGROUPS=n
psi: fix "defined but not used" warnings when CONFIG_PROC_FS=n
usb: dwc3: xilinx: fix uninitialized return value
usr/include/Makefile: add linux/nfc.h to the compile-test coverage
fsnotify: invalidate dcache before IN_DELETE event
block: Fix wrong offset in bio_truncate()
mtd: rawnand: mpc5121: Remove unused variable in ads5121_select_chip()
Linux 5.15.19
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I66399d45af362fa8e1672ba38c0d672e21afc716
|
||
|
|
4ca8a0bc83 |
mm, kasan: use compare-exchange operation to set KASAN page tag
commit 27fe73394a1c6d0b07fa4d95f1bca116d1cc66e9 upstream.
It has been reported that the tag setting operation on newly-allocated
pages can cause the page flags to be corrupted when performed
concurrently with other flag updates as a result of the use of
non-atomic operations.
Fix the problem by using a compare-exchange loop to update the tag.
Link: https://lkml.kernel.org/r/20220120020148.1632253-1-pcc@google.com
Link: https://linux-review.googlesource.com/id/I456b24a2b9067d93968d43b4bb3351c0cec63101
Fixes:
|
||
|
|
301c56064d |
UPSTREAM: mm: add a field to store names for private anonymous memory
In many userspace applications, and especially in VM based applications like Android uses heavily, there are multiple different allocators in use. At a minimum there is libc malloc and the stack, and in many cases there are libc malloc, the stack, direct syscalls to mmap anonymous memory, and multiple VM heaps (one for small objects, one for big objects, etc.). Each of these layers usually has its own tools to inspect its usage; malloc by compiling a debug version, the VM through heap inspection tools, and for direct syscalls there is usually no way to track them. On Android we heavily use a set of tools that use an extended version of the logic covered in Documentation/vm/pagemap.txt to walk all pages mapped in userspace and slice their usage by process, shared (COW) vs. unique mappings, backing, etc. This can account for real physical memory usage even in cases like fork without exec (which Android uses heavily to share as many private COW pages as possible between processes), Kernel SamePage Merging, and clean zero pages. It produces a measurement of the pages that only exist in that process (USS, for unique), and a measurement of the physical memory usage of that process with the cost of shared pages being evenly split between processes that share them (PSS). If all anonymous memory is indistinguishable then figuring out the real physical memory usage (PSS) of each heap requires either a pagemap walking tool that can understand the heap debugging of every layer, or for every layer's heap debugging tools to implement the pagemap walking logic, in which case it is hard to get a consistent view of memory across the whole system. Tracking the information in userspace leads to all sorts of problems. It either needs to be stored inside the process, which means every process has to have an API to export its current heap information upon request, or it has to be stored externally in a filesystem that somebody needs to clean up on crashes. It needs to be readable while the process is still running, so it has to have some sort of synchronization with every layer of userspace. Efficiently tracking the ranges requires reimplementing something like the kernel vma trees, and linking to it from every layer of userspace. It requires more memory, more syscalls, more runtime cost, and more complexity to separately track regions that the kernel is already tracking. This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a userspace-provided name for anonymous vmas. The names of named anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as [anon:<name>]. Userspace can set the name for a region of memory by calling prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name) Setting the name to NULL clears it. The name length limit is 80 bytes including NUL-terminator and is checked to contain only printable ascii characters (including space), except '[',']','\','$' and '`'. Ascii strings are being used to have a descriptive identifiers for vmas, which can be understood by the users reading /proc/pid/maps or /proc/pid/smaps. Names can be standardized for a given system and they can include some variable parts such as the name of the allocator or a library, tid of the thread using it, etc. The name is stored in a pointer in the shared union in vm_area_struct that points to a null terminated string. Anonymous vmas with the same name (equivalent strings) and are otherwise mergeable will be merged. The name pointers are not shared between vmas even if they contain the same name. The name pointer is stored in a union with fields that are only used on file-backed mappings, so it does not increase memory usage. CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this feature. It keeps the feature disabled by default to prevent any additional memory overhead and to avoid confusing procfs parsers on systems which are not ready to support named anonymous vmas. The patch is based on the original patch developed by Colin Cross, more specifically on its latest version [1] posted upstream by Sumit Semwal. It used a userspace pointer to store vma names. In that design, name pointers could be shared between vmas. However during the last upstreaming attempt, Kees Cook raised concerns [2] about this approach and suggested to copy the name into kernel memory space, perform validity checks [3] and store as a string referenced from vm_area_struct. One big concern is about fork() performance which would need to strdup anonymous vma names. Dave Hansen suggested experimenting with worst-case scenario of forking a process with 64k vmas having longest possible names [4]. I ran this experiment on an ARM64 Android device and recorded a worst-case regression of almost 40% when forking such a process. This regression is addressed in the followup patch which replaces the pointer to a name with a refcounted structure that allows sharing the name pointer between vmas of the same name. Instead of duplicating the string during fork() or when splitting a vma it increments the refcount. [1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@linaro.org/ [2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/ [3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/ [4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@intel.com/ Changes for prctl(2) manual page (in the options section): PR_SET_VMA Sets an attribute specified in arg2 for virtual memory areas starting from the address specified in arg3 and spanning the size specified in arg4. arg5 specifies the value of the attribute to be set. Note that assigning an attribute to a virtual memory area might prevent it from being merged with adjacent virtual memory areas due to the difference in that attribute's value. Currently, arg2 must be one of: PR_SET_VMA_ANON_NAME Set a name for anonymous virtual memory areas. arg5 should be a pointer to a null-terminated string containing the name. The name length including null byte cannot exceed 80 bytes. If arg5 is NULL, the name of the appropriate anonymous virtual memory areas will be reset. The name can contain only printable ascii characters (including space), except '[',']','\','$' and '`'. This feature is available only if the kernel is built with the CONFIG_ANON_VMA_NAME option enabled. [surenb@google.com: docs: proc.rst: /proc/PID/maps: fix malformed table] Link: https://lkml.kernel.org/r/20211123185928.2513763-1-surenb@google.com [surenb: rebased over v5.15-rc6, replaced userpointer with a kernel copy, added input sanitization and CONFIG_ANON_VMA_NAME config. The bulk of the work here was done by Colin Cross, therefore, with his permission, keeping him as the author] Link: https://lkml.kernel.org/r/20211019215511.3771969-2-surenb@google.com Signed-off-by: Colin Cross <ccross@google.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Glauber <jan.glauber@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Stultz <john.stultz@linaro.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rob Landley <rob@landley.net> Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com> Cc: Shaohua Li <shli@fusionio.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 9a10064f5625d5572c3626c1516e0bebc6c9fe9b) Bug: 120441514 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I53d56d551a7d62f75341304751814294b447c04e |
||
|
|
f355f9635d |
Revert "ANDROID: mm: add a field to store names for private anonymous memory"
This reverts commit
|
||
|
|
1a77f04aac |
Revert "ANDROID: mm: Throttle rss_stat tracepoint"
This reverts commit
|
||
|
|
5606699789 |
Merge 996fe06160 ("Merge tag 'kgdb-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux") into android-mainline
Steps on the way to 5.15-rc1 Change-Id: I3806b714a5a783a7132b1daf766ebb71985fc640 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> |
||
|
|
c5cd945b24 |
Merge fd47ff55c9 ("Merge tag 'usb-5.15-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb") into android-mainline
Steps on the way to 5.15-rc1. Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I42ffa8818bbb2072f043923553c4d8f91d9647a5 |
||
|
|
c2b303f98f |
Merge 4e71add028 ("Merge branch 'stable/for-linus-5.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/ibft") into android-mainline
Steps on the way to 5.15-rc1 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Ib3f181326491eb896547d802a6f0a1b3be54ce28 |
||
|
|
59437fa4ba |
Merge 90c90cda05 ("Merge tag 'xfs-5.15-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux") into android-mainline
Steps on the way to 5.15-rc1 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Id0e9064c101f599601a6d864ce7ac2ed88083562 |
||
|
|
cd1adf1b63 |
Revert "mm/gup: remove try_get_page(), call try_get_compound_head() directly"
This reverts commit
|
||
|
|
49624efa65 |
Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux
Pull MAP_DENYWRITE removal from David Hildenbrand:
"Remove all in-tree usage of MAP_DENYWRITE from the kernel and remove
VM_DENYWRITE.
There are some (minor) user-visible changes:
- We no longer deny write access to shared libaries loaded via legacy
uselib(); this behavior matches modern user space e.g. dlopen().
- We no longer deny write access to the elf interpreter after exec
completed, treating it just like shared libraries (which it often
is).
- We always deny write access to the file linked via /proc/pid/exe:
sys_prctl(PR_SET_MM_MAP/EXE_FILE) will fail if write access to the
file cannot be denied, and write access to the file will remain
denied until the link is effectivel gone (exec, termination,
sys_prctl(PR_SET_MM_MAP/EXE_FILE)) -- just as if exec'ing the file.
Cross-compiled for a bunch of architectures (alpha, microblaze, i386,
s390x, ...) and verified via ltp that especially the relevant tests
(i.e., creat07 and execve04) continue working as expected"
* tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux:
fs: update documentation of get_write_access() and friends
mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff()
mm: remove VM_DENYWRITE
binfmt: remove in-tree usage of MAP_DENYWRITE
kernel/fork: always deny write access to current MM exe_file
kernel/fork: factor out replacing the current MM exe_file
binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib()
|
||
|
|
14726903c8 |
Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton: "173 patches. Subsystems affected by this series: ia64, ocfs2, block, and mm (debug, pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap, bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock, oom-kill, migration, ksm, percpu, vmstat, and madvise)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits) mm/madvise: add MADV_WILLNEED to process_madvise() mm/vmstat: remove unneeded return value mm/vmstat: simplify the array size calculation mm/vmstat: correct some wrong comments mm/percpu,c: remove obsolete comments of pcpu_chunk_populated() selftests: vm: add COW time test for KSM pages selftests: vm: add KSM merging time test mm: KSM: fix data type selftests: vm: add KSM merging across nodes test selftests: vm: add KSM zero page merging test selftests: vm: add KSM unmerge test selftests: vm: add KSM merge test mm/migrate: correct kernel-doc notation mm: wire up syscall process_mrelease mm: introduce process_mrelease system call memblock: make memblock_find_in_range method private mm/mempolicy.c: use in_task() in mempolicy_slab_node() mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies mm/mempolicy: advertise new MPOL_PREFERRED_MANY mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY ... |
||
|
|
d0505e9f7d |
mm: hwpoison: don't drop slab caches for offlining non-LRU page
In the current implementation of soft offline, if non-LRU page is met,
all the slab caches will be dropped to free the page then offline. But
if the page is not slab page all the effort is wasted in vain. Even
though it is a slab page, it is not guaranteed the page could be freed
at all.
However the side effect and cost is quite high. It does not only drop
the slab caches, but also may drop a significant amount of page caches
which are associated with inode caches. It could make the most
workingset gone in order to just offline a page. And the offline is not
guaranteed to succeed at all, actually I really doubt the success rate
for real life workload.
Furthermore the worse consequence is the system may be locked up and
unusable since the page cache release may incur huge amount of works
queued for memcg release.
Actually we ran into such unpleasant case in our production environment.
Firstly, the workqueue of memory_failure_work_func is locked up as
below:
BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 53s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=14/256 refcnt=15
in-flight: 409271:memory_failure_work_func
pending: kfree_rcu_work, kfree_rcu_monitor, kfree_rcu_work, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, drain_local_stock, kfree_rcu_work
workqueue mm_percpu_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
pending: vmstat_update
workqueue cgroup_destroy: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1 refcnt=12072
pending: css_release_work_fn
There were over 12K css_release_work_fn queued, and this caused a few
lockups due to the contention of worker pool lock with IRQ disabled, for
example:
NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
Modules linked in: amd64_edac_mod edac_mce_amd crct10dif_pclmul crc32_pclmul ghash_clmulni_intel xt_DSCP iptable_mangle kvm_amd bpfilter vfat fat acpi_ipmi i2c_piix4 usb_storage ipmi_si k10temp i2c_core ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel xfs libcrc32c crc32c_intel mlx5_core mlxfw nvme xhci_pci ptp nvme_core pps_core xhci_hcd
CPU: 1 PID: 205500 Comm: kworker/1:0 Tainted: G L 5.10.32-t1.el7.twitter.x86_64 #1
Hardware name: TYAN F5AMT /z /S8026GM2NRE-CGN, BIOS V8.030 03/30/2021
Workqueue: events memory_failure_work_func
RIP: 0010:queued_spin_lock_slowpath+0x41/0x1a0
Code: 41 f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 01 75 04 c6 47
RSP: 0018:ffff9b2ac278f900 EFLAGS: 00000002
RAX: 0000000000480101 RBX: ffff8ce98ce71800 RCX: 0000000000000084
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8ce98ce6a140
RBP: 00000000000284c8 R08: ffffd7248dcb6808 R09: 0000000000000000
R10: 0000000000000003 R11: ffff9b2ac278f9b0 R12: 0000000000000001
R13: ffff8cb44dab9c00 R14: ffffffffbd1ce6a0 R15: ffff8cacaa37f068
FS: 0000000000000000(0000) GS:ffff8ce98ce40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fcf6e8cb000 CR3: 0000000a0c60a000 CR4: 0000000000350ee0
Call Trace:
__queue_work+0xd6/0x3c0
queue_work_on+0x1c/0x30
uncharge_batch+0x10e/0x110
mem_cgroup_uncharge_list+0x6d/0x80
release_pages+0x37f/0x3f0
__pagevec_release+0x1c/0x50
__invalidate_mapping_pages+0x348/0x380
inode_lru_isolate+0x10a/0x160
__list_lru_walk_one+0x7b/0x170
list_lru_walk_one+0x4a/0x60
prune_icache_sb+0x37/0x50
super_cache_scan+0x123/0x1a0
do_shrink_slab+0x10c/0x2c0
shrink_slab+0x1f1/0x290
drop_slab_node+0x4d/0x70
soft_offline_page+0x1ac/0x5b0
memory_failure_work_func+0x6a/0x90
process_one_work+0x19e/0x340
worker_thread+0x30/0x360
kthread+0x116/0x130
The lockup made the machine is quite unusable. And it also made the
most workingset gone, the reclaimabled slab caches were reduced from 12G
to 300MB, the page caches were decreased from 17G to 4G.
But the most disappointing thing is all the effort doesn't make the page
offline, it just returns:
soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 ()
It seems the aggressive behavior for non-LRU page didn't pay back, so it
doesn't make too much sense to keep it considering the terrible side
effect.
Link: https://lkml.kernel.org/r/20210819054116.266126-1-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reported-by: David Mackey <tdmackey@twitter.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
3969b1a654 |
mm: delete unused get_kernel_page()
get_kernel_page() was added in 2012 by [1]. It was used for a while for NFS, but then in 2014, a refactoring [2] removed all callers, and it has apparently not been used since. Remove get_kernel_page() because it has no callers. [1] commit |
||
|
|
9857a17f20 |
mm/gup: remove try_get_page(), call try_get_compound_head() directly
try_get_page() is very similar to try_get_compound_head(), and in fact try_get_page() has fallen a little behind in terms of maintenance: try_get_compound_head() handles speculative page references more thoroughly. There are only two try_get_page() callsites, so just call try_get_compound_head() directly from those, and remove try_get_page() entirely. Also, seeing as how this changes try_get_compound_head() into a non-static function, provide some kerneldoc documentation for it. Link: https://lkml.kernel.org/r/20210813044133.1536842-4-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
54d516b1d6 |
mm/gup: small refactoring: simplify try_grab_page()
try_grab_page() does the same thing as try_grab_compound_head(..., refs=1, ...), just with a different API. So there is a lot of code duplication there. Change try_grab_page() to call try_grab_compound_head(), while keeping the API contract identical for callers. Also, now that try_grab_compound_head() always has a caller, remove the __maybe_unused annotation. Link: https://lkml.kernel.org/r/20210813044133.1536842-3-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
8d0920bde5 |
mm: remove VM_DENYWRITE
All in-tree users of MAP_DENYWRITE are gone. MAP_DENYWRITE cannot be set from user space, so all users are gone; let's remove it. Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: David Hildenbrand <david@redhat.com> |
||
|
|
fe69d560b5 |
kernel/fork: always deny write access to current MM exe_file
We want to remove VM_DENYWRITE only currently only used when mapping the executable during exec. During exec, we already deny_write_access() the executable, however, after exec completes the VMAs mapped with VM_DENYWRITE effectively keeps write access denied via deny_write_access(). Let's deny write access when setting or replacing the MM exe_file. With this change, we can remove VM_DENYWRITE for mapping executables. Make set_mm_exe_file() return an error in case deny_write_access() fails; note that this should never happen, because exec code does a deny_write_access() early and keeps write access denied when calling set_mm_exe_file. However, it makes the code easier to read and makes set_mm_exe_file() and replace_mm_exe_file() look more similar. This represents a minor user space visible change: sys_prctl(PR_SET_MM_MAP/EXE_FILE) can now fail if the file is already opened writable. Also, after sys_prctl(PR_SET_MM_MAP/EXE_FILE) the file cannot be opened writable. Note that we can already fail with -EACCES if the file doesn't have execute permissions. Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: David Hildenbrand <david@redhat.com> |
||
|
|
35d7bdc860 |
kernel/fork: factor out replacing the current MM exe_file
Let's factor the main logic out into replace_mm_exe_file(), such that all mm->exe_file logic is contained in kernel/fork.c. While at it, perform some simple cleanups that are possible now that we're simplifying the individual functions. Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: David Hildenbrand <david@redhat.com> |
||
|
|
de2860f463 |
mm: Add kvrealloc()
During log recovery of an XFS filesystem with 64kB directory
buffers, rebuilding a buffer split across two log records results
in a memory allocation warning from krealloc like this:
xfs filesystem being mounted at /mnt/scratch supports timestamps until 2038 (0x7fffffff)
XFS (dm-0): Unmounting Filesystem
XFS (dm-0): Mounting V5 Filesystem
XFS (dm-0): Starting recovery (logdev: internal)
------------[ cut here ]------------
WARNING: CPU: 5 PID: 3435170 at mm/page_alloc.c:3539 get_page_from_freelist+0xdee/0xe40
.....
RIP: 0010:get_page_from_freelist+0xdee/0xe40
Call Trace:
? complete+0x3f/0x50
__alloc_pages+0x16f/0x300
alloc_pages+0x87/0x110
kmalloc_order+0x2c/0x90
kmalloc_order_trace+0x1d/0x90
__kmalloc_track_caller+0x215/0x270
? xlog_recover_add_to_cont_trans+0x63/0x1f0
krealloc+0x54/0xb0
xlog_recover_add_to_cont_trans+0x63/0x1f0
xlog_recovery_process_trans+0xc1/0xd0
xlog_recover_process_ophdr+0x86/0x130
xlog_recover_process_data+0x9f/0x160
xlog_recover_process+0xa2/0x120
xlog_do_recovery_pass+0x40b/0x7d0
? __irq_work_queue_local+0x4f/0x60
? irq_work_queue+0x3a/0x50
xlog_do_log_recovery+0x70/0x150
xlog_do_recover+0x38/0x1d0
xlog_recover+0xd8/0x170
xfs_log_mount+0x181/0x300
xfs_mountfs+0x4a1/0x9b0
xfs_fs_fill_super+0x3c0/0x7b0
get_tree_bdev+0x171/0x270
? suffix_kstrtoint.constprop.0+0xf0/0xf0
xfs_fs_get_tree+0x15/0x20
vfs_get_tree+0x24/0xc0
path_mount+0x2f5/0xaf0
__x64_sys_mount+0x108/0x140
do_syscall_64+0x3a/0x70
entry_SYSCALL_64_after_hwframe+0x44/0xae
Essentially, we are taking a multi-order allocation from kmem_alloc()
(which has an open coded no fail, no warn loop) and then
reallocating it out to 64kB using krealloc(__GFP_NOFAIL) and that is
then triggering the above warning.
This is a regression caused by converting this code from an open
coded no fail/no warn reallocation loop to using __GFP_NOFAIL.
What we actually need here is kvrealloc(), so that if contiguous
page allocation fails we fall back to vmalloc() and we don't
get nasty warnings happening in XFS.
Fixes:
|
||
|
|
946e465c81 |
Merge tag 'v5.14-rc2' into android-mainline
Linux 5.14-rc2 Signed-off-by: Lee Jones <lee.jones@linaro.org> Change-Id: Ia2131de59daa96610741f5a0ff267b0d08697023 |
||
|
|
a8b636db7d |
Merge tag 'v5.14-rc1' into android-mainline
Linux 5.14-rc1 Change-Id: I9765cd4581f6683a6fca3580667017fff9cbaa2b Signed-off-by: Lee Jones <lee.jones@linaro.org> |
||
|
|
79789db03f |
mm: Make copy_huge_page() always available
Rewrite copy_huge_page() and move it into mm/util.c so it's always
available. Fixes an exposure of uninitialised memory on configurations
with HUGETLB and UFFD enabled and MIGRATION disabled.
Fixes:
|
||
|
|
293f275f4d |
Merge commit df8ba5f160 ("Merge tag 'kgdb-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux") into android-mainline
A large step en route to v5.14-rc1 Change-Id: I52bb71dc737044a593d1a9dfd7fe02b31e273ff9 Signed-off-by: Lee Jones <lee.jones@linaro.org> |
||
|
|
8e658623d4 |
Merge commit c288d9cd71 ("Merge tag 'for-5.14/io_uring-2021-06-30' of git://git.kernel.dk/linux-block") into android-mainline
Another small step en route to v5.14-rc1 Change-Id: I24899ab78da7d367574ed69ceaa82ab0837d9556 Signed-off-by: Lee Jones <lee.jones@linaro.org> |
||
|
|
5748fbc533 |
mm: add setup_initial_init_mm() helper
Patch series "init_mm: cleanup ARCH's text/data/brk setup code", v3. Add setup_initial_init_mm() helper, then use it to cleanup the text, data and brk setup code. This patch (of 15): Add setup_initial_init_mm() helper to setup kernel text, data and brk. Link: https://lkml.kernel.org/r/20210608083418.137226-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20210608083418.137226-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nick Hu <nickhu@andestech.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Rich Felker <dalias@libc.org> Cc: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
71bd934101 |
Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton: "190 patches. Subsystems affected by this patch series: mm (hugetlb, userfaultfd, vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock, migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap, zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc, core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs, signals, exec, kcov, selftests, compress/decompress, and ipc" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits) ipc/util.c: use binary search for max_idx ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock ipc: use kmalloc for msg_queue and shmid_kernel ipc sem: use kvmalloc for sem_undo allocation lib/decompressors: remove set but not used variabled 'level' selftests/vm/pkeys: exercise x86 XSAVE init state selftests/vm/pkeys: refill shadow register after implicit kernel write selftests/vm/pkeys: handle negative sys_pkey_alloc() return code selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random kcov: add __no_sanitize_coverage to fix noinstr for all architectures exec: remove checks in __register_bimfmt() x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned hfsplus: report create_date to kstat.btime hfsplus: remove unnecessary oom message nilfs2: remove redundant continue statement in a while-loop kprobes: remove duplicated strong free_insn_page in x86 and s390 init: print out unknown kernel parameters checkpatch: do not complain about positive return values starting with EPOLL checkpatch: improve the indented label test checkpatch: scripts/spdxcheck.py now requires python3 ... |
||
|
|
c4ffefd16d |
mm: fix typos and grammar error in comments
We moves tha -> We move that in mm/swap.c statments -> statements in include/linux/mm.h Link: https://lkml.kernel.org/r/20210509063444.GA24745@hyeyoo Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
5db4f15c4f |
mm: memory: add orig_pmd to struct vm_fault
Pach series "mm: thp: use generic THP migration for NUMA hinting fault", v3. When the THP NUMA fault support was added THP migration was not supported yet. So the ad hoc THP migration was implemented in NUMA fault handling. Since v4.14 THP migration has been supported so it doesn't make too much sense to still keep another THP migration implementation rather than using the generic migration code. It is definitely a maintenance burden to keep two THP migration implementation for different code paths and it is more error prone. Using the generic THP migration implementation allows us remove the duplicate code and some hacks needed by the old ad hoc implementation. A quick grep shows x86_64, PowerPC (book3s), ARM64 ans S390 support both THP and NUMA balancing. The most of them support THP migration except for S390. Zi Yan tried to add THP migration support for S390 before but it was not accepted due to the design of S390 PMD. For the discussion, please see: https://lkml.org/lkml/2018/4/27/953. Per the discussion with Gerald Schaefer in v1 it is acceptible to skip huge PMD for S390 for now. I saw there were some hacks about gup from git history, but I didn't figure out if they have been removed or not since I just found FOLL_NUMA code in the current gup implementation and they seems useful. Patch #1 ~ #2 are preparation patches. Patch #3 is the real meat. Patch #4 ~ #6 keep consistent counters and behaviors with before. Patch #7 skips change huge PMD to prot_none if thp migration is not supported. Test ---- Did some tests to measure the latency of do_huge_pmd_numa_page. The test VM has 80 vcpus and 64G memory. The test would create 2 processes to consume 128G memory together which would incur memory pressure to cause THP splits. And it also creates 80 processes to hog cpu, and the memory consumer processes are bound to different nodes periodically in order to increase NUMA faults. The below test script is used: echo 3 > /proc/sys/vm/drop_caches # Run stress-ng for 24 hours ./stress-ng/stress-ng --vm 2 --vm-bytes 64G --timeout 24h & PID=$! ./stress-ng/stress-ng --cpu $NR_CPUS --timeout 24h & # Wait for vm stressors forked sleep 5 PID_1=`pgrep -P $PID | awk 'NR == 1'` PID_2=`pgrep -P $PID | awk 'NR == 2'` JOB1=`pgrep -P $PID_1` JOB2=`pgrep -P $PID_2` # Bind load jobs to different nodes periodically to force generate # cross node memory access while [ -d "/proc/$PID" ] do taskset -apc 8 $JOB1 taskset -apc 8 $JOB2 sleep 300 taskset -apc 58 $JOB1 taskset -apc 58 $JOB2 sleep 300 done With the above test the histogram of latency of do_huge_pmd_numa_page is as shown below. Since the number of do_huge_pmd_numa_page varies drastically for each run (should be due to scheduler), so I converted the raw number to percentage. patched base @us[stress-ng]: [0] 3.57% 0.16% [1] 55.68% 18.36% [2, 4) 10.46% 40.44% [4, 8) 7.26% 17.82% [8, 16) 21.12% 13.41% [16, 32) 1.06% 4.27% [32, 64) 0.56% 4.07% [64, 128) 0.16% 0.35% [128, 256) < 0.1% < 0.1% [256, 512) < 0.1% < 0.1% [512, 1K) < 0.1% < 0.1% [1K, 2K) < 0.1% < 0.1% [2K, 4K) < 0.1% < 0.1% [4K, 8K) < 0.1% < 0.1% [8K, 16K) < 0.1% < 0.1% [16K, 32K) < 0.1% < 0.1% [32K, 64K) < 0.1% < 0.1% Per the result, patched kernel is even slightly better than the base kernel. I think this is because the lock contention against THP split is less than base kernel due to the refactor. To exclude the affect from THP split, I also did test w/o memory pressure. No obvious regression is spotted. The below is the test result *w/o* memory pressure. patched base @us[stress-ng]: [0] 7.97% 18.4% [1] 69.63% 58.24% [2, 4) 4.18% 2.63% [4, 8) 0.22% 0.17% [8, 16) 1.03% 0.92% [16, 32) 0.14% < 0.1% [32, 64) < 0.1% < 0.1% [64, 128) < 0.1% < 0.1% [128, 256) < 0.1% < 0.1% [256, 512) 0.45% 1.19% [512, 1K) 15.45% 17.27% [1K, 2K) < 0.1% < 0.1% [2K, 4K) < 0.1% < 0.1% [4K, 8K) < 0.1% < 0.1% [8K, 16K) 0.86% 0.88% [16K, 32K) < 0.1% 0.15% [32K, 64K) < 0.1% < 0.1% [64K, 128K) < 0.1% < 0.1% [128K, 256K) < 0.1% < 0.1% The series also survived a series of tests that exercise NUMA balancing migrations by Mel. This patch (of 7): Add orig_pmd to struct vm_fault so the "orig_pmd" parameter used by huge page fault could be removed, just like its PTE counterpart does. Link: https://lkml.kernel.org/r/20210518200801.7413-1-shy828301@gmail.com Link: https://lkml.kernel.org/r/20210518200801.7413-2-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
3bc2b6a725 |
mm: sparsemem: split the huge PMD mapping of vmemmap pages
Patch series "Split huge PMD mapping of vmemmap pages", v4. In order to reduce the difficulty of code review in series[1]. We disable huge PMD mapping of vmemmap pages when that feature is enabled. In this series, we do not disable huge PMD mapping of vmemmap pages anymore. We will split huge PMD mapping when needed. When HugeTLB pages are freed from the pool we do not attempt coalasce and move back to a PMD mapping because it is much more complex. [1] https://lore.kernel.org/linux-doc/20210510030027.56044-1-songmuchun@bytedance.com/ This patch (of 3): In [1], PMD mappings of vmemmap pages were disabled if the the feature hugetlb_free_vmemmap was enabled. This was done to simplify the initial implementation of vmmemap freeing for hugetlb pages. Now, remove this simplification by allowing PMD mapping and switching to PTE mappings as needed for allocated hugetlb pages. When a hugetlb page is allocated, the vmemmap page tables are walked to free vmemmap pages. During this walk, split huge PMD mappings to PTE mappings as required. In the unlikely case PTE pages can not be allocated, return error(ENOMEM) and do not optimize vmemmap of the hugetlb page. When HugeTLB pages are freed from the pool, we do not attempt to coalesce and move back to a PMD mapping because it is much more complex. [1] https://lkml.kernel.org/r/20210510030027.56044-8-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20210616094915.34432-1-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20210616094915.34432-2-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Chen Huang <chenhuang5@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
ad2fa3717b |
mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page
When we free a HugeTLB page to the buddy allocator, we need to allocate
the vmemmap pages associated with it. However, we may not be able to
allocate the vmemmap pages when the system is under memory pressure. In
this case, we just refuse to free the HugeTLB page. This changes behavior
in some corner cases as listed below:
1) Failing to free a huge page triggered by the user (decrease nr_pages).
User needs to try again later.
2) Failing to free a surplus huge page when freed by the application.
Try again later when freeing a huge page next time.
3) Failing to dissolve a free huge page on ZONE_MOVABLE via
offline_pages().
This can happen when we have plenty of ZONE_MOVABLE memory, but
not enough kernel memory to allocate vmemmmap pages. We may even
be able to migrate huge page contents, but will not be able to
dissolve the source huge page. This will prevent an offline
operation and is unfortunate as memory offlining is expected to
succeed on movable zones. Users that depend on memory hotplug
to succeed for movable zones should carefully consider whether the
memory savings gained from this feature are worth the risk of
possibly not being able to offline memory in certain situations.
4) Failing to dissolve a huge page on CMA/ZONE_MOVABLE via
alloc_contig_range() - once we have that handling in place. Mainly
affects CMA and virtio-mem.
Similar to 3). virito-mem will handle migration errors gracefully.
CMA might be able to fallback on other free areas within the CMA
region.
Vmemmap pages are allocated from the page freeing context. In order for
those allocations to be not disruptive (e.g. trigger oom killer)
__GFP_NORETRY is used. hugetlb_lock is dropped for the allocation because
a non sleeping allocation would be too fragile and it could fail too
easily under memory pressure. GFP_ATOMIC or other modes to access memory
reserves is not used because we want to prevent consuming reserves under
heavy hugetlb freeing.
[mike.kravetz@oracle.com: fix dissolve_free_huge_page use of tail/head page]
Link: https://lkml.kernel.org/r/20210527231225.226987-1-mike.kravetz@oracle.com
[willy@infradead.org: fix alloc_vmemmap_page_list documentation warning]
Link: https://lkml.kernel.org/r/20210615200242.1716568-6-willy@infradead.org
Link: https://lkml.kernel.org/r/20210510030027.56044-7-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chen Huang <chenhuang5@huawei.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
f41f2ed43c |
mm: hugetlb: free the vmemmap pages associated with each HugeTLB page
Every HugeTLB has more than one struct page structure. We __know__ that we only use the first 4 (__NR_USED_SUBPAGE) struct page structures to store metadata associated with each HugeTLB. There are a lot of struct page structures associated with each HugeTLB page. For tail pages, the value of compound_head is the same. So we can reuse first page of tail page structures. We map the virtual addresses of the remaining pages of tail page structures to the first tail page struct, and then free these page frames. Therefore, we need to reserve two pages as vmemmap areas. When we allocate a HugeTLB page from the buddy, we can free some vmemmap pages associated with each HugeTLB page. It is more appropriate to do it in the prep_new_huge_page(). The free_vmemmap_pages_per_hpage(), which indicates how many vmemmap pages associated with a HugeTLB page can be freed, returns zero for now, which means the feature is disabled. We will enable it once all the infrastructure is there. [willy@infradead.org: fix documentation warning] Link: https://lkml.kernel.org/r/20210615200242.1716568-5-willy@infradead.org Link: https://lkml.kernel.org/r/20210510030027.56044-5-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Tested-by: Chen Huang <chenhuang5@huawei.com> Tested-by: Bodeddula Balasubramaniam <bodeddub@amazon.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Neukum <oneukum@suse.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
dbe69e4337 |
Merge tag 'net-next-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core:
- BPF:
- add syscall program type and libbpf support for generating
instructions and bindings for in-kernel BPF loaders (BPF loaders
for BPF), this is a stepping stone for signed BPF programs
- infrastructure to migrate TCP child sockets from one listener to
another in the same reuseport group/map to improve flexibility
of service hand-off/restart
- add broadcast support to XDP redirect
- allow bypass of the lockless qdisc to improving performance (for
pktgen: +23% with one thread, +44% with 2 threads)
- add a simpler version of "DO_ONCE()" which does not require jump
labels, intended for slow-path usage
- virtio/vsock: introduce SOCK_SEQPACKET support
- add getsocketopt to retrieve netns cookie
- ip: treat lowest address of a IPv4 subnet as ordinary unicast
address allowing reclaiming of precious IPv4 addresses
- ipv6: use prandom_u32() for ID generation
- ip: add support for more flexible field selection for hashing
across multi-path routes (w/ offload to mlxsw)
- icmp: add support for extended RFC 8335 PROBE (ping)
- seg6: add support for SRv6 End.DT46 behavior
- mptcp:
- DSS checksum support (RFC 8684) to detect middlebox meddling
- support Connection-time 'C' flag
- time stamping support
- sctp: packetization Layer Path MTU Discovery (RFC 8899)
- xfrm: speed up state addition with seq set
- WiFi:
- hidden AP discovery on 6 GHz and other HE 6 GHz improvements
- aggregation handling improvements for some drivers
- minstrel improvements for no-ack frames
- deferred rate control for TXQs to improve reaction times
- switch from round robin to virtual time-based airtime scheduler
- add trace points:
- tcp checksum errors
- openvswitch - action execution, upcalls
- socket errors via sk_error_report
Device APIs:
- devlink: add rate API for hierarchical control of max egress rate
of virtual devices (VFs, SFs etc.)
- don't require RCU read lock to be held around BPF hooks in NAPI
context
- page_pool: generic buffer recycling
New hardware/drivers:
- mobile:
- iosm: PCIe Driver for Intel M.2 Modem
- support for Qualcomm MSM8998 (ipa)
- WiFi: Qualcomm QCN9074 and WCN6855 PCI devices
- sparx5: Microchip SparX-5 family of Enterprise Ethernet switches
- Mellanox BlueField Gigabit Ethernet (control NIC of the DPU)
- NXP SJA1110 Automotive Ethernet 10-port switch
- Qualcomm QCA8327 switch support (qca8k)
- Mikrotik 10/25G NIC (atl1c)
Driver changes:
- ACPI support for some MDIO, MAC and PHY devices from Marvell and
NXP (our first foray into MAC/PHY description via ACPI)
- HW timestamping (PTP) support: bnxt_en, ice, sja1105, hns3, tja11xx
- Mellanox/Nvidia NIC (mlx5)
- NIC VF offload of L2 bridging
- support IRQ distribution to Sub-functions
- Marvell (prestera):
- add flower and match all
- devlink trap
- link aggregation
- Netronome (nfp): connection tracking offload
- Intel 1GE (igc): add AF_XDP support
- Marvell DPU (octeontx2): ingress ratelimit offload
- Google vNIC (gve): new ring/descriptor format support
- Qualcomm mobile (rmnet & ipa): inline checksum offload support
- MediaTek WiFi (mt76)
- mt7915 MSI support
- mt7915 Tx status reporting
- mt7915 thermal sensors support
- mt7921 decapsulation offload
- mt7921 enable runtime pm and deep sleep
- Realtek WiFi (rtw88)
- beacon filter support
- Tx antenna path diversity support
- firmware crash information via devcoredump
- Qualcomm WiFi (wcn36xx)
- Wake-on-WLAN support with magic packets and GTK rekeying
- Micrel PHY (ksz886x/ksz8081): add cable test support"
* tag 'net-next-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2168 commits)
tcp: change ICSK_CA_PRIV_SIZE definition
tcp_yeah: check struct yeah size at compile time
gve: DQO: Fix off by one in gve_rx_dqo()
stmmac: intel: set PCI_D3hot in suspend
stmmac: intel: Enable PHY WOL option in EHL
net: stmmac: option to enable PHY WOL with PMT enabled
net: say "local" instead of "static" addresses in ndo_dflt_fdb_{add,del}
net: use netdev_info in ndo_dflt_fdb_{add,del}
ptp: Set lookup cookie when creating a PTP PPS source.
net: sock: add trace for socket errors
net: sock: introduce sk_error_report
net: dsa: replay the local bridge FDB entries pointing to the bridge dev too
net: dsa: ensure during dsa_fdb_offload_notify that dev_hold and dev_put are on the same dev
net: dsa: include fdb entries pointing to bridge in the host fdb list
net: dsa: include bridge addresses which are local in the host fdb list
net: dsa: sync static FDB entries on foreign interfaces to hardware
net: dsa: install the host MDB and FDB entries in the master's RX filter
net: dsa: reference count the FDB addresses at the cross-chip notifier level
net: dsa: introduce a separate cross-chip notifier type for host FDBs
net: dsa: reference count the MDB entries at the cross-chip notifier level
...
|
||
|
|
44b6ed4cfa |
Merge tag 'clang-features-v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull clang feature updates from Kees Cook: - Add CC_HAS_NO_PROFILE_FN_ATTR in preparation for PGO support in the face of the noinstr attribute, paving the way for PGO and fixing GCOV. (Nick Desaulniers) - x86_64 LTO coverage is expanded to 32-bit x86. (Nathan Chancellor) - Small fixes to CFI. (Mark Rutland, Nathan Chancellor) * tag 'clang-features-v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: qemu_fw_cfg: Make fw_cfg_rev_attr a proper kobj_attribute Kconfig: Introduce ARCH_WANTS_NO_INSTR and CC_HAS_NO_PROFILE_FN_ATTR compiler_attributes.h: cleanups for GCC 4.9+ compiler_attributes.h: define __no_profile, add to noinstr x86, lto: Enable Clang LTO for 32-bit as well CFI: Move function_nocfi() into compiler.h MAINTAINERS: Add Clang CFI section |
||
|
|
65090f30ab |
Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton: "191 patches. Subsystems affected by this patch series: kthread, ia64, scripts, ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab, slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap, mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization, pagealloc, and memory-failure)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (191 commits) mm,hwpoison: make get_hwpoison_page() call get_any_page() mm,hwpoison: send SIGBUS with error virutal address mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes mm/page_alloc: allow high-order pages to be stored on the per-cpu lists mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA docs: remove description of DISCONTIGMEM arch, mm: remove stale mentions of DISCONIGMEM mm: remove CONFIG_DISCONTIGMEM m68k: remove support for DISCONTIGMEM arc: remove support for DISCONTIGMEM arc: update comment about HIGHMEM implementation alpha: remove DISCONTIGMEM and NUMA mm/page_alloc: move free_the_page mm/page_alloc: fix counting of managed_pages mm/page_alloc: improve memmap_pages dbg msg mm: drop SECTION_SHIFT in code comments mm/page_alloc: introduce vm.percpu_pagelist_high_fraction mm/page_alloc: limit the number of pages on PCP lists when reclaim is active mm/page_alloc: scale the number of pages that are batch freed ... |