9655518f3b1de0c691fd33c096f98e86966c1cc2
659 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
e3d8fe0993 |
Merge 5.15.91 into android14-5.15
Changes in 5.15.91 memory: tegra: Remove clients SID override programming memory: atmel-sdramc: Fix missing clk_disable_unprepare in atmel_ramc_probe() memory: mvebu-devbus: Fix missing clk_disable_unprepare in mvebu_devbus_probe() dmaengine: ti: k3-udma: Do conditional decrement of UDMA_CHAN_RT_PEER_BCNT_REG arm64: dts: imx8mp-phycore-som: Remove invalid PMIC property ARM: dts: imx6ul-pico-dwarf: Use 'clock-frequency' ARM: dts: imx7d-pico: Use 'clock-frequency' ARM: dts: imx6qdl-gw560x: Remove incorrect 'uart-has-rtscts' arm64: dts: imx8mm-beacon: Fix ecspi2 pinmux ARM: imx: add missing of_node_put() HID: intel_ish-hid: Add check for ishtp_dma_tx_map arm64: dts: imx8mm-venice-gw7901: fix USB2 controller OC polarity soc: imx8m: Fix incorrect check for of_clk_get_by_name() reset: uniphier-glue: Use reset_control_bulk API reset: uniphier-glue: Fix possible null-ptr-deref EDAC/highbank: Fix memory leak in highbank_mc_probe() firmware: arm_scmi: Harden shared memory access in fetch_response firmware: arm_scmi: Harden shared memory access in fetch_notification tomoyo: fix broken dependency on *.conf.default RDMA/core: Fix ib block iterator counter overflow IB/hfi1: Reject a zero-length user expected buffer IB/hfi1: Reserve user expected TIDs IB/hfi1: Fix expected receive setup error exit issues IB/hfi1: Immediately remove invalid memory from hardware IB/hfi1: Remove user expected buffer invalidate race affs: initialize fsdata in affs_truncate() PM: AVS: qcom-cpr: Fix an error handling path in cpr_probe() arm64: dts: qcom: msm8992: Don't use sfpb mutex arm64: dts: qcom: msm8992-libra: Add CPU regulators arm64: dts: qcom: msm8992-libra: Fix the memory map phy: ti: fix Kconfig warning and operator precedence NFSD: fix use-after-free in nfsd4_ssc_setup_dul() ARM: dts: at91: sam9x60: fix the ddr clock for sam9x60 amd-xgbe: TX Flow Ctrl Registers are h/w ver dependent amd-xgbe: Delay AN timeout during KR training bpf: Fix pointer-leak due to insufficient speculative store bypass mitigation phy: rockchip-inno-usb2: Fix missing clk_disable_unprepare() in rockchip_usb2phy_power_on() net: nfc: Fix use-after-free in local_cleanup() net: wan: Add checks for NULL for utdm in undo_uhdlc_init and unmap_si_regs net: enetc: avoid deadlock in enetc_tx_onestep_tstamp() sch_htb: Avoid grafting on htb_destroy_class_offload when destroying htb gpio: use raw spinlock for gpio chip shadowed data gpio: mxc: Protect GPIO irqchip RMW with bgpio spinlock gpio: mxc: Always set GPIOs used as interrupt source to INPUT mode wifi: rndis_wlan: Prevent buffer overflow in rndis_query_oid pinctrl/rockchip: Use temporary variable for struct device pinctrl/rockchip: add error handling for pull/drive register getters pinctrl: rockchip: fix reading pull type on rk3568 net: stmmac: Fix queue statistics reading net/sched: sch_taprio: fix possible use-after-free l2tp: Serialize access to sk_user_data with sk_callback_lock l2tp: Don't sleep and disable BH under writer-side sk_callback_lock l2tp: convert l2tp_tunnel_list to idr l2tp: close all race conditions in l2tp_tunnel_register() octeontx2-pf: Avoid use of GFP_KERNEL in atomic context net: usb: sr9700: Handle negative len net: mdio: validate parameter addr in mdiobus_get_phy() HID: check empty report_list in hid_validate_values() HID: check empty report_list in bigben_probe() net: stmmac: fix invalid call to mdiobus_get_phy() pinctrl: rockchip: fix mux route data for rk3568 HID: revert CHERRY_MOUSE_000C quirk usb: gadget: f_fs: Prevent race during ffs_ep0_queue_wait usb: gadget: f_fs: Ensure ep0req is dequeued before free_request Bluetooth: Fix possible deadlock in rfcomm_sk_state_change net: ipa: disable ipa interrupt during suspend net/mlx5: E-switch, Fix setting of reserved fields on MODIFY_SCHEDULING_ELEMENT net: mlx5: eliminate anonymous module_init & module_exit drm/panfrost: fix GENERIC_ATOMIC64 dependency dmaengine: Fix double increment of client_count in dma_chan_get() net: macb: fix PTP TX timestamp failure due to packet padding virtio-net: correctly enable callback during start_xmit l2tp: prevent lockdep issue in l2tp_tunnel_register() HID: betop: check shape of output reports cifs: fix potential deadlock in cache_refresh_path() dmaengine: xilinx_dma: call of_node_put() when breaking out of for_each_child_of_node() phy: phy-can-transceiver: Skip warning if no "max-bitrate" drm/amd/display: fix issues with driver unload nvme-pci: fix timeout request state check tcp: avoid the lookup process failing to get sk in ehash table octeontx2-pf: Fix the use of GFP_KERNEL in atomic context on rt ptdma: pt_core_execute_cmd() should use spinlock device property: fix of node refcount leak in fwnode_graph_get_next_endpoint() w1: fix deadloop in __w1_remove_master_device() w1: fix WARNING after calling w1_process() driver core: Fix test_async_probe_init saves device in wrong array selftests/net: toeplitz: fix race on tpacket_v3 block close net: dsa: microchip: ksz9477: port map correction in ALU table entry register thermal/core: Remove duplicate information when an error occurs thermal/core: Rename 'trips' to 'num_trips' thermal: Validate new state in cur_state_store() thermal/core: fix error code in __thermal_cooling_device_register() thermal: core: call put_device() only after device_register() fails net: stmmac: enable all safety features by default tcp: fix rate_app_limited to default to 1 scsi: iscsi: Fix multiple iSCSI session unbind events sent to userspace cpufreq: Add Tegra234 to cpufreq-dt-platdev blocklist kcsan: test: don't put the expect array on the stack cpufreq: Add SM6375 to cpufreq-dt-platdev blocklist ASoC: fsl_micfil: Correct the number of steps on SX controls net: usb: cdc_ether: add support for Thales Cinterion PLS62-W modem drm: Add orientation quirk for Lenovo ideapad D330-10IGL s390/debug: add _ASM_S390_ prefix to header guard s390: expicitly align _edata and _end symbols on page boundary perf/x86/msr: Add Emerald Rapids perf/x86/intel/uncore: Add Emerald Rapids cpufreq: armada-37xx: stop using 0 as NULL pointer ASoC: fsl_ssi: Rename AC'97 streams to avoid collisions with AC'97 CODEC ASoC: fsl-asoc-card: Fix naming of AC'97 CODEC widgets spi: spidev: remove debug messages that access spidev->spi without locking KVM: s390: interrupt: use READ_ONCE() before cmpxchg() scsi: hisi_sas: Set a port invalid only if there are no devices attached when refreshing port id r8152: add vendor/device ID pair for Microsoft Devkit platform/x86: touchscreen_dmi: Add info for the CSL Panther Tab HD platform/x86: asus-nb-wmi: Add alternate mapping for KEY_SCREENLOCK lockref: stop doing cpu_relax in the cmpxchg loop firmware: coreboot: Check size of table entry and use flex-array drm/i915: Allow switching away via vga-switcheroo if uninitialized Revert "selftests/bpf: check null propagation only neither reg is PTR_TO_BTF_ID" drm/i915: Remove unused variable x86: ACPI: cstate: Optimize C3 entry on AMD CPUs fs: reiserfs: remove useless new_opts in reiserfs_remount sysctl: add a new register_sysctl_init() interface kernel/panic: move panic sysctls to its own file panic: unset panic_on_warn inside panic() ubsan: no need to unset panic_on_warn in ubsan_epilogue() kasan: no need to unset panic_on_warn in end_report() exit: Add and use make_task_dead. objtool: Add a missing comma to avoid string concatenation hexagon: Fix function name in die() h8300: Fix build errors from do_exit() to make_task_dead() transition csky: Fix function name in csky_alignment() and die() ia64: make IA64_MCA_RECOVERY bool instead of tristate panic: Separate sysctl logic from CONFIG_SMP exit: Put an upper limit on how often we can oops exit: Expose "oops_count" to sysfs exit: Allow oops_limit to be disabled panic: Consolidate open-coded panic_on_warn checks panic: Introduce warn_limit panic: Expose "warn_count" to sysfs docs: Fix path paste-o for /sys/kernel/warn_count exit: Use READ_ONCE() for all oops/warn limit reads Bluetooth: hci_sync: cancel cmd_timer if hci_open failed drm/amdgpu: complete gfxoff allow signal during suspend without delay scsi: hpsa: Fix allocation size for scsi_host_alloc() KVM: SVM: fix tsc scaling cache logic module: Don't wait for GOING modules tracing: Make sure trace_printk() can output as soon as it can be used trace_events_hist: add check for return value of 'create_hist_field' ftrace/scripts: Update the instructions for ftrace-bisect.sh cifs: Fix oops due to uncleared server->smbd_conn in reconnect i2c: mv64xxx: Remove shutdown method from driver i2c: mv64xxx: Add atomic_xfer method to driver ksmbd: add smbd max io size parameter ksmbd: add max connections parameter ksmbd: do not sign response to session request for guest login ksmbd: downgrade ndr version error message to debug ksmbd: limit pdu length size according to connection status ovl: fail on invalid uid/gid mapping at copy up KVM: x86/vmx: Do not skip segment attributes if unusable bit is set KVM: arm64: GICv4.1: Fix race with doorbell on VPE activation/deactivation thermal: intel: int340x: Protect trip temperature from concurrent updates ipv6: fix reachability confirmation with proxy_ndp ARM: 9280/1: mm: fix warning on phys_addr_t to void pointer assignment EDAC/device: Respect any driver-supplied workqueue polling value EDAC/qcom: Do not pass llcc_driv_data as edac_device_ctl_info's pvt_info net: mana: Fix IRQ name - add PCI and queue number scsi: ufs: core: Fix devfreq deadlocks i2c: designware: use casting of u64 in clock multiplication to avoid overflow netlink: prevent potential spectre v1 gadgets net: fix UaF in netns ops registration error path drm/i915/selftest: fix intel_selftest_modify_policy argument types netfilter: nft_set_rbtree: Switch to node list walk for overlap detection netfilter: nft_set_rbtree: skip elements in transaction from garbage collection netlink: annotate data races around nlk->portid netlink: annotate data races around dst_portid and dst_group netlink: annotate data races around sk_state ipv4: prevent potential spectre v1 gadget in ip_metrics_convert() ipv4: prevent potential spectre v1 gadget in fib_metrics_match() netfilter: conntrack: fix vtag checks for ABORT/SHUTDOWN_COMPLETE netrom: Fix use-after-free of a listening socket. net/sched: sch_taprio: do not schedule in taprio_reset() sctp: fail if no bound addresses can be used for a given scope riscv/kprobe: Fix instruction simulation of JALR nvme: fix passthrough csi check gpio: mxc: Unlock on error path in mxc_flip_edge() ravb: Rename "no_ptp_cfg_active" and "ptp_cfg_active" variables net: ravb: Fix lack of register setting after system resumed for Gen3 net: ravb: Fix possible hang if RIS2_QFF1 happen net: mctp: mark socks as dead on unhash, prevent re-add thermal: intel: int340x: Add locking to int340x_thermal_get_trip_type() net/tg3: resolve deadlock in tg3_reset_task() during EEH net: mdio-mux-meson-g12a: force internal PHY off on mux switch treewide: fix up files incorrectly marked executable tools: gpio: fix -c option of gpio-event-mon Revert "Input: synaptics - switch touchpad on HP Laptop 15-da3001TU to RMI mode" cpufreq: Move to_gov_attr_set() to cpufreq.h cpufreq: governor: Use kobject release() method to free dbs_data kbuild: Allow kernel installation packaging to override pkg-config block: fix and cleanup bio_check_ro x86/i8259: Mark legacy PIC interrupts with IRQ_LEVEL netfilter: conntrack: unify established states for SCTP paths perf/x86/amd: fix potential integer overflow on shift of a int Linux 5.15.91 Change-Id: I3349d802533097ac86e5c680fbd40c00c9719ec7 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> |
||
|
|
21998acd31 |
exit: Use READ_ONCE() for all oops/warn limit reads
commit 7535b832c6399b5ebfc5b53af5c51dd915ee2538 upstream. Use a temporary variable to take full advantage of READ_ONCE() behavior. Without this, the report (and even the test) might be out of sync with the initial test. Reported-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/lkml/Y5x7GXeluFmZ8E0E@hirez.programming.kicks-ass.net Fixes: 9fc9e278a5c0 ("panic: Introduce warn_limit") Fixes: d4ccd54d28d3 ("exit: Put an upper limit on how often we can oops") Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Jann Horn <jannh@google.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Petr Mladek <pmladek@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Marco Elver <elver@google.com> Cc: tangmeng <tangmeng@uniontech.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
fc636b1362 |
exit: Allow oops_limit to be disabled
commit de92f65719cd672f4b48397540b9f9eff67eca40 upstream. In preparation for keeping oops_limit logic in sync with warn_limit, have oops_limit == 0 disable checking the Oops counter. Cc: Jann Horn <jannh@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: linux-doc@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
339f8a8e52 |
exit: Expose "oops_count" to sysfs
commit 9db89b41117024f80b38b15954017fb293133364 upstream. Since Oops count is now tracked and is a fairly interesting signal, add the entry /sys/kernel/oops_count to expose it to userspace. Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Jann Horn <jannh@google.com> Cc: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221117234328.594699-3-keescook@chromium.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
f80fb0001f |
exit: Put an upper limit on how often we can oops
commit d4ccd54d28d3c8598e2354acc13e28c060961dbb upstream. Many Linux systems are configured to not panic on oops; but allowing an attacker to oops the system **really** often can make even bugs that look completely unexploitable exploitable (like NULL dereferences and such) if each crash elevates a refcount by one or a lock is taken in read mode, and this causes a counter to eventually overflow. The most interesting counters for this are 32 bits wide (like open-coded refcounts that don't use refcount_t). (The ldsem reader count on 32-bit platforms is just 16 bits, but probably nobody cares about 32-bit platforms that much nowadays.) So let's panic the system if the kernel is constantly oopsing. The speed of oopsing 2^32 times probably depends on several factors, like how long the stack trace is and which unwinder you're using; an empirically important one is whether your console is showing a graphical environment or a text console that oopses will be printed to. In a quick single-threaded benchmark, it looks like oopsing in a vfork() child with a very short stack trace only takes ~510 microseconds per run when a graphical console is active; but switching to a text console that oopses are printed to slows it down around 87x, to ~45 milliseconds per run. (Adding more threads makes this faster, but the actual oops printing happens under &die_lock on x86, so you can maybe speed this up by a factor of around 2 and then any further improvement gets eaten up by lock contention.) It looks like it would take around 8-12 days to overflow a 32-bit counter with repeated oopsing on a multi-core X86 system running a graphical environment; both me (in an X86 VM) and Seth (with a distro kernel on normal hardware in a standard configuration) got numbers in that ballpark. 12 days aren't *that* short on a desktop system, and you'd likely need much longer on a typical server system (assuming that people don't run graphical desktop environments on their servers), and this is a *very* noisy and violent approach to exploiting the kernel; and it also seems to take orders of magnitude longer on some machines, probably because stuff like EFI pstore will slow it down a ton if that's active. Signed-off-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/r/20221107201317.324457-1-jannh@google.com Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221117234328.594699-2-keescook@chromium.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
39a26d8721 |
exit: Add and use make_task_dead.
commit 0e25498f8cd43c1b5aa327f373dd094e9a006da7 upstream. There are two big uses of do_exit. The first is it's design use to be the guts of the exit(2) system call. The second use is to terminate a task after something catastrophic has happened like a NULL pointer in kernel code. Add a function make_task_dead that is initialy exactly the same as do_exit to cover the cases where do_exit is called to handle catastrophic failure. In time this can probably be reduced to just a light wrapper around do_task_dead. For now keep it exactly the same so that there will be no behavioral differences introducing this new concept. Replace all of the uses of do_exit that use it for catastraphic task cleanup with make_task_dead to make it clear what the code is doing. As part of this rename rewind_stack_do_exit rewind_stack_and_make_dead. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
7f53b0e704 |
BACKPORT: mm: multi-gen LRU: support page table walks
To further exploit spatial locality, the aging prefers to walk page tables
to search for young PTEs and promote hot pages. A kill switch will be
added in the next patch to disable this behavior. When disabled, the
aging relies on the rmap only.
NB: this behavior has nothing similar with the page table scanning in the
2.4 kernel [1], which searches page tables for old PTEs, adds cold pages
to swapcache and unmaps them.
To avoid confusion, the term "iteration" specifically means the traversal
of an entire mm_struct list; the term "walk" will be applied to page
tables and the rmap, as usual.
An mm_struct list is maintained for each memcg, and an mm_struct follows
its owner task to the new memcg when this task is migrated. Given an
lruvec, the aging iterates lruvec_memcg()->mm_list and calls
walk_page_range() with each mm_struct on this list to promote hot pages
before it increments max_seq.
When multiple page table walkers iterate the same list, each of them gets
a unique mm_struct; therefore they can run concurrently. Page table
walkers ignore any misplaced pages, e.g., if an mm_struct was migrated,
pages it left in the previous memcg will not be promoted when its current
memcg is under reclaim. Similarly, page table walkers will not promote
pages from nodes other than the one under reclaim.
This patch uses the following optimizations when walking page tables:
1. It tracks the usage of mm_struct's between context switches so that
page table walkers can skip processes that have been sleeping since
the last iteration.
2. It uses generational Bloom filters to record populated branches so
that page table walkers can reduce their search space based on the
query results, e.g., to skip page tables containing mostly holes or
misplaced pages.
3. It takes advantage of the accessed bit in non-leaf PMD entries when
CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
4. It does not zigzag between a PGD table and the same PMD table
spanning multiple VMAs. IOW, it finishes all the VMAs within the
range of the same PMD table before it returns to a PGD table. This
improves the cache performance for workloads that have large
numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.
Server benchmark results:
Single workload:
fio (buffered I/O): no change
Single workload:
memcached (anon): +[8, 10]%
Ops/sec KB/sec
patch1-7: 1147696.57 44640.29
patch1-8: 1245274.91 48435.66
Configurations:
no change
Client benchmark results:
kswapd profiles:
patch1-7
48.16% lzo1x_1_do_compress (real work)
8.20% page_vma_mapped_walk (overhead)
7.06% _raw_spin_unlock_irq
2.92% ptep_clear_flush
2.53% __zram_bvec_write
2.11% do_raw_spin_lock
2.02% memmove
1.93% lru_gen_look_around
1.56% free_unref_page_list
1.40% memset
patch1-8
49.44% lzo1x_1_do_compress (real work)
6.19% page_vma_mapped_walk (overhead)
5.97% _raw_spin_unlock_irq
3.13% get_pfn_page
2.85% ptep_clear_flush
2.42% __zram_bvec_write
2.08% do_raw_spin_lock
1.92% memmove
1.44% alloc_zspage
1.36% memset
Configurations:
no change
Thanks to the following developers for their efforts [3].
kernel test robot <lkp@intel.com>
[1] https://lwn.net/Articles/23732/
[2] https://llvm.org/docs/ScudoHardenedAllocator.html
[3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/
Link: https://lkml.kernel.org/r/20220918080010.2920238-9-yuzhao@google.com
Change-Id: I7ed3daf288e664e15bfd34991a77467a19a4e39a
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit bd74fdaea146029e4fa12c6de89adbe0779348a9)
[ Resolve conflicts in include/linux/memcontrol.h,
include/linux/mm_types.h ]
Bug: 249601646
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
|
||
|
|
02dc0d1dda |
Revert "FROMLIST: mm: multi-gen LRU: support page table walks"
This reverts commit
|
||
|
|
56f32ebb01 |
Merge 5.15.56 into android14-5.15
Changes in 5.15.56
ALSA: hda - Add fixup for Dell Latitidue E5430
ALSA: hda/conexant: Apply quirk for another HP ProDesk 600 G3 model
ALSA: hda/realtek: Fix headset mic for Acer SF313-51
ALSA: hda/realtek - Fix headset mic problem for a HP machine with alc671
ALSA: hda/realtek: fix mute/micmute LEDs for HP machines
ALSA: hda/realtek - Fix headset mic problem for a HP machine with alc221
ALSA: hda/realtek - Enable the headset-mic on a Xiaomi's laptop
xen/netback: avoid entering xenvif_rx_next_skb() with an empty rx queue
fix race between exit_itimers() and /proc/pid/timers
mm: userfaultfd: fix UFFDIO_CONTINUE on fallocated shmem pages
mm: split huge PUD on wp_huge_pud fallback
tracing/histograms: Fix memory leak problem
net: sock: tracing: Fix sock_exceed_buf_limit not to dereference stale pointer
ip: fix dflt addr selection for connected nexthop
ARM: 9213/1: Print message about disabled Spectre workarounds only once
ARM: 9214/1: alignment: advance IT state after emulating Thumb instruction
wifi: mac80211: fix queue selection for mesh/OCB interfaces
cgroup: Use separate src/dst nodes when preloading css_sets for migration
btrfs: return -EAGAIN for NOWAIT dio reads/writes on compressed and inline extents
drm/panfrost: Put mapping instead of shmem obj on panfrost_mmu_map_fault_addr() error
drm/panfrost: Fix shrinker list corruption by madvise IOCTL
fs/remap: constrain dedupe of EOF blocks
nilfs2: fix incorrect masking of permission flags for symlinks
sh: convert nommu io{re,un}map() to static inline functions
Revert "evm: Fix memleak in init_desc"
xfs: only run COW extent recovery when there are no live extents
xfs: don't include bnobt blocks when reserving free block pool
xfs: run callbacks before waking waiters in xlog_state_shutdown_callbacks
xfs: drop async cache flushes from CIL commits.
reset: Fix devm bulk optional exclusive control getter
ARM: dts: imx6qdl-ts7970: Fix ngpio typo and count
spi: amd: Limit max transfer and message size
ARM: 9209/1: Spectre-BHB: avoid pr_info() every time a CPU comes out of idle
ARM: 9210/1: Mark the FDT_FIXED sections as shareable
net/mlx5e: kTLS, Fix build time constant test in TX
net/mlx5e: kTLS, Fix build time constant test in RX
net/mlx5e: Fix enabling sriov while tc nic rules are offloaded
net/mlx5e: Fix capability check for updating vnic env counters
net/mlx5e: Ring the TX doorbell on DMA errors
drm/i915: fix a possible refcount leak in intel_dp_add_mst_connector()
ima: Fix a potential integer overflow in ima_appraise_measurement
ASoC: sgtl5000: Fix noise on shutdown/remove
ASoC: tas2764: Add post reset delays
ASoC: tas2764: Fix and extend FSYNC polarity handling
ASoC: tas2764: Correct playback volume range
ASoC: tas2764: Fix amp gain register offset & default
ASoC: Intel: Skylake: Correct the ssp rate discovery in skl_get_ssp_clks()
ASoC: Intel: Skylake: Correct the handling of fmt_config flexible array
net: stmmac: dwc-qos: Disable split header for Tegra194
net: ethernet: ti: am65-cpsw: Fix devlink port register sequence
sysctl: Fix data races in proc_dointvec().
sysctl: Fix data races in proc_douintvec().
sysctl: Fix data races in proc_dointvec_minmax().
sysctl: Fix data races in proc_douintvec_minmax().
sysctl: Fix data races in proc_doulongvec_minmax().
sysctl: Fix data races in proc_dointvec_jiffies().
tcp: Fix a data-race around sysctl_tcp_max_orphans.
inetpeer: Fix data-races around sysctl.
net: Fix data-races around sysctl_mem.
cipso: Fix data-races around sysctl.
icmp: Fix data-races around sysctl.
ipv4: Fix a data-race around sysctl_fib_sync_mem.
ARM: dts: at91: sama5d2: Fix typo in i2s1 node
ARM: dts: sunxi: Fix SPI NOR campatible on Orange Pi Zero
arm64: dts: broadcom: bcm4908: Fix timer node for BCM4906 SoC
arm64: dts: broadcom: bcm4908: Fix cpu node for smp boot
netfilter: nf_log: incorrect offset to network header
netfilter: nf_tables: replace BUG_ON by element length check
drm/i915/gvt: IS_ERR() vs NULL bug in intel_gvt_update_reg_whitelist()
xen/gntdev: Ignore failure to unmap INVALID_GRANT_HANDLE
lockd: set fl_owner when unlocking files
lockd: fix nlm_close_files
tracing: Fix sleeping while atomic in kdb ftdump
drm/i915/selftests: fix a couple IS_ERR() vs NULL tests
drm/i915/dg2: Add Wa_22011100796
drm/i915/gt: Serialize GRDOM access between multiple engine resets
drm/i915/gt: Serialize TLB invalidates with GT resets
drm/i915/uc: correctly track uc_fw init failure
drm/i915: Require the vm mutex for i915_vma_bind()
bnxt_en: Fix bnxt_reinit_after_abort() code path
bnxt_en: Fix bnxt_refclk_read()
sysctl: Fix data-races in proc_dou8vec_minmax().
sysctl: Fix data-races in proc_dointvec_ms_jiffies().
icmp: Fix data-races around sysctl_icmp_echo_enable_probe.
icmp: Fix a data-race around sysctl_icmp_ignore_bogus_error_responses.
icmp: Fix a data-race around sysctl_icmp_errors_use_inbound_ifaddr.
icmp: Fix a data-race around sysctl_icmp_ratelimit.
icmp: Fix a data-race around sysctl_icmp_ratemask.
raw: Fix a data-race around sysctl_raw_l3mdev_accept.
tcp: Fix a data-race around sysctl_tcp_ecn_fallback.
ipv4: Fix data-races around sysctl_ip_dynaddr.
nexthop: Fix data-races around nexthop_compat_mode.
net: ftgmac100: Hold reference returned by of_get_child_by_name()
net: stmmac: fix leaks in probe
ima: force signature verification when CONFIG_KEXEC_SIG is configured
ima: Fix potential memory leak in ima_init_crypto()
drm/amd/display: Only use depth 36 bpp linebuffers on DCN display engines.
drm/amd/pm: Prevent divide by zero
sfc: fix use after free when disabling sriov
ceph: switch netfs read ops to use rreq->inode instead of rreq->mapping->host
seg6: fix skb checksum evaluation in SRH encapsulation/insertion
seg6: fix skb checksum in SRv6 End.B6 and End.B6.Encaps behaviors
seg6: bpf: fix skb checksum in bpf_push_seg6_encap()
sfc: fix kernel panic when creating VF
net: atlantic: remove deep parameter on suspend/resume functions
net: atlantic: remove aq_nic_deinit() when resume
KVM: x86: Fully initialize 'struct kvm_lapic_irq' in kvm_pv_kick_cpu_op()
net/tls: Check for errors in tls_device_init
ACPI: video: Fix acpi_video_handles_brightness_key_presses()
mm: sysctl: fix missing numa_stat when !CONFIG_HUGETLB_PAGE
btrfs: rename btrfs_bio to btrfs_io_context
btrfs: zoned: fix a leaked bioc in read_zone_info
ksmbd: use SOCK_NONBLOCK type for kernel_accept()
powerpc/xive/spapr: correct bitmap allocation size
vdpa/mlx5: Initialize CVQ vringh only once
vduse: Tie vduse mgmtdev and its device
virtio_mmio: Add missing PM calls to freeze/restore
virtio_mmio: Restore guest page size on resume
netfilter: br_netfilter: do not skip all hooks with 0 priority
scsi: hisi_sas: Limit max hw sectors for v3 HW
cpufreq: pmac32-cpufreq: Fix refcount leak bug
platform/x86: hp-wmi: Ignore Sanitization Mode event
firmware: sysfb: Make sysfb_create_simplefb() return a pdev pointer
firmware: sysfb: Add sysfb_disable() helper function
fbdev: Disable sysfb device registration when removing conflicting FBs
net: tipc: fix possible refcount leak in tipc_sk_create()
NFC: nxp-nci: don't print header length mismatch on i2c error
nvme-tcp: always fail a request when sending it failed
nvme: fix regression when disconnect a recovering ctrl
net: sfp: fix memory leak in sfp_probe()
ASoC: ops: Fix off by one in range control validation
pinctrl: aspeed: Fix potential NULL dereference in aspeed_pinmux_set_mux()
ASoC: Realtek/Maxim SoundWire codecs: disable pm_runtime on remove
ASoC: rt711-sdca-sdw: fix calibrate mutex initialization
ASoC: Intel: sof_sdw: handle errors on card registration
ASoC: rt711: fix calibrate mutex initialization
ASoC: rt7*-sdw: harden jack_detect_handler
ASoC: codecs: rt700/rt711/rt711-sdca: initialize workqueues in probe
ASoC: SOF: Intel: hda-loader: Clarify the cl_dsp_init() flow
ASoC: wcd938x: Fix event generation for some controls
ASoC: Intel: bytcr_wm5102: Fix GPIO related probe-ordering problem
ASoC: wm5110: Fix DRE control
ASoC: rt711-sdca: fix kernel NULL pointer dereference when IO error
ASoC: dapm: Initialise kcontrol data for mux/demux controls
ASoC: cs47l15: Fix event generation for low power mux control
ASoC: madera: Fix event generation for OUT1 demux
ASoC: madera: Fix event generation for rate controls
irqchip: or1k-pic: Undefine mask_ack for level triggered hardware
x86: Clear .brk area at early boot
soc: ixp4xx/npe: Fix unused match warning
ARM: dts: stm32: use the correct clock source for CEC on stm32mp151
Revert "can: xilinx_can: Limit CANFD brp to 2"
ALSA: usb-audio: Add quirks for MacroSilicon MS2100/MS2106 devices
ALSA: usb-audio: Add quirk for Fiero SC-01
ALSA: usb-audio: Add quirk for Fiero SC-01 (fw v1.0.0)
nvme-pci: phison e16 has bogus namespace ids
signal handling: don't use BUG_ON() for debugging
USB: serial: ftdi_sio: add Belimo device ids
usb: typec: add missing uevent when partner support PD
usb: dwc3: gadget: Fix event pending check
tty: serial: samsung_tty: set dma burst_size to 1
vt: fix memory overlapping when deleting chars in the buffer
serial: 8250: fix return error code in serial8250_request_std_resource()
serial: stm32: Clear prev values before setting RTS delays
serial: pl011: UPSTAT_AUTORTS requires .throttle/unthrottle
serial: 8250: Fix PM usage_count for console handover
x86/pat: Fix x86_has_pat_wp()
drm/aperture: Run fbdev removal before internal helpers
Linux 5.15.56
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I763d2a7b49435bf2996b31e201aa9794ab64609e
|
||
|
|
b856e5738b |
fix race between exit_itimers() and /proc/pid/timers
commit d5b36a4dbd06c5e8e36ca8ccc552f679069e2946 upstream. As Chris explains, the comment above exit_itimers() is not correct, we can race with proc_timers_seq_ops. Change exit_itimers() to clear signal->posix_timers with ->siglock held. Cc: <stable@vger.kernel.org> Reported-by: chris@accessvector.net Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
5fdfed7d78 |
ANDROID: vendor_hooks: Add hooks for memory when debug
Add vendors hooks for recording memory used Bug: 182443489 Bug: 234407991 Signed-off-by: Liujie Xie <xieliujie@oppo.com> Change-Id: I62d8bb2b6650d8b187b433f97eb833ef0b784df1 |
||
|
|
5280d76d38 |
FROMLIST: mm: multi-gen LRU: support page table walks
To further exploit spatial locality, the aging prefers to walk page
tables to search for young PTEs and promote hot pages. A kill switch
will be added in the next patch to disable this behavior. When
disabled, the aging relies on the rmap only.
NB: this behavior has nothing similar with the page table scanning in
the 2.4 kernel [1], which searches page tables for old PTEs, adds cold
pages to swapcache and unmaps them.
To avoid confusion, the term "iteration" specifically means the
traversal of an entire mm_struct list; the term "walk" will be applied
to page tables and the rmap, as usual.
An mm_struct list is maintained for each memcg, and an mm_struct
follows its owner task to the new memcg when this task is migrated.
Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls
walk_page_range() with each mm_struct on this list to promote hot
pages before it increments max_seq.
When multiple page table walkers iterate the same list, each of them
gets a unique mm_struct; therefore they can run concurrently. Page
table walkers ignore any misplaced pages, e.g., if an mm_struct was
migrated, pages it left in the previous memcg will not be promoted
when its current memcg is under reclaim. Similarly, page table walkers
will not promote pages from nodes other than the one under reclaim.
This patch uses the following optimizations when walking page tables:
1. It tracks the usage of mm_struct's between context switches so that
page table walkers can skip processes that have been sleeping since
the last iteration.
2. It uses generational Bloom filters to record populated branches so
that page table walkers can reduce their search space based on the
query results, e.g., to skip page tables containing mostly holes or
misplaced pages.
3. It takes advantage of the accessed bit in non-leaf PMD entries when
CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
4. It does not zigzag between a PGD table and the same PMD table
spanning multiple VMAs. IOW, it finishes all the VMAs within the
range of the same PMD table before it returns to a PGD table. This
improves the cache performance for workloads that have large
numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.
Server benchmark results:
Single workload:
fio (buffered I/O): no change
Single workload:
memcached (anon): +[5.5, 7.5]%
Ops/sec KB/sec
patch1-7: 1014393.57 39455.42
patch1-8: 1078507.59 41949.15
Configurations:
no change
Client benchmark results:
kswapd profiles:
patch1-7
45.54% lzo1x_1_do_compress (real work)
9.56% page_vma_mapped_walk
6.70% _raw_spin_unlock_irq
2.78% ptep_clear_flush
2.47% do_raw_spin_lock
2.22% __zram_bvec_write
1.87% lru_gen_look_around
1.78% memmove
1.77% obj_malloc
1.44% free_unref_page_list
patch1-8
47.02% lzo1x_1_do_compress (real work)
6.73% page_vma_mapped_walk
6.14% _raw_spin_unlock_irq
3.39% walk_pte_range
2.63% ptep_clear_flush
2.29% __zram_bvec_write
2.10% do_raw_spin_lock
1.81% memmove
1.73% obj_malloc
1.53% free_unref_page_list
Configurations:
no change
[1] https://lwn.net/Articles/23732/
[2] https://source.android.com/devices/tech/debug/scudo
Link: https://lore.kernel.org/lkml/20220309021230.721028-9-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I5a3c97cf8ebf8d65d5f9528cd979a637c190053e
|
||
|
|
f552a27afe |
io_uring: remove files pointer in cancellation functions
When doing cancellation, we use a parameter to indicate where it's from do_exit or exec. So a boolean value is good enough for this, remove the struct files* as it is not necessary. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> [axboe: fixup io_uring_files_cancel for !CONFIG_IO_URING] Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
c54b245d01 |
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace rlimit handling update from Eric Biederman: "This is the work mainly by Alexey Gladkov to limit rlimits to the rlimits of the user that created a user namespace, and to allow users to have stricter limits on the resources created within a user namespace." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: cred: add missing return error code when set_cred_ucounts() failed ucounts: Silence warning in dec_rlimit_ucounts ucounts: Set ucount_max to the largest positive value the type can hold kselftests: Add test to check for rlimit changes in different user namespaces Reimplement RLIMIT_MEMLOCK on top of ucounts Reimplement RLIMIT_SIGPENDING on top of ucounts Reimplement RLIMIT_MSGQUEUE on top of ucounts Reimplement RLIMIT_NPROC on top of ucounts Use atomic_t for ucounts reference counting Add a reference to ucounts for each cred Increase size of ucounts to atomic_long_t |
||
|
|
b4b27b9eed |
Revert "signal: Allow tasks to cache one sigqueue struct"
This reverts commits |
||
|
|
5449162ac0 |
do_wait: make PIDTYPE_PID case O(1) instead of O(n)
Add a special-case when waiting on a pid (via waitpid, waitid, wait4, etc) to avoid doing an O(n) scan of children and tracees, and instead do an O(1) lookup. This improves performance when waiting on a pid from a thread group with many children and/or tracees. Time to fork and then call waitpid on the child, from a task that already has N children [1]: N | Before | After -----|---------|------ 1 | 74 us | 74 us 20 | 72 us | 75 us 100 | 83 us | 77 us 500 | 99 us | 74 us 1000 | 179 us | 75 us 5000 | 804 us | 79 us 8000 | 1268 us | 78 us [1]: https://lkml.org/lkml/2021/3/12/1567 This can make a substantial performance improvement for applications with a thread that has many children or tracees and frequently needs to wait on them. Tools that use ptrace to intercept syscalls for a large number of processes are likely to fall into this category. In particular this patch was developed while building a ptrace-based second generation of the Shadow emulator [2], for which it allows us to avoid quadratic scaling (without having to use a workaround that introduces a ~40% performance penalty) [3]. Other examples of tools that fall into this category which this patch may help include User Mode Linux [4] and DetTrace [5]. [2]: https://shadow.github.io/ [3]: https://github.com/shadow/shadow/issues/1134#issuecomment-798992292 [4]: https://en.wikipedia.org/wiki/User-mode_Linux [5]: https://github.com/dettrace/dettrace Link: https://lkml.kernel.org/r/20210314231544.9379-1-jnewsome@torproject.org Signed-off-by: James Newsome <jnewsome@torproject.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Christian Brauner <christian@brauner.io> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
21d1c5e386 |
Reimplement RLIMIT_NPROC on top of ucounts
The rlimit counter is tied to uid in the user_namespace. This allows rlimit values to be specified in userns even if they are already globally exceeded by the user. However, the value of the previous user_namespaces cannot be exceeded. To illustrate the impact of rlimits, let's say there is a program that does not fork. Some service-A wants to run this program as user X in multiple containers. Since the program never fork the service wants to set RLIMIT_NPROC=1. service-A \- program (uid=1000, container1, rlimit_nproc=1) \- program (uid=1000, container2, rlimit_nproc=1) The service-A sets RLIMIT_NPROC=1 and runs the program in container1. When the service-A tries to run a program with RLIMIT_NPROC=1 in container2 it fails since user X already has one running process. We cannot use existing inc_ucounts / dec_ucounts because they do not allow us to exceed the maximum for the counter. Some rlimits can be overlimited by root or if the user has the appropriate capability. Changelog v11: * Change inc_rlimit_ucounts() which now returns top value of ucounts. * Drop inc_rlimit_ucounts_and_test() because the return code of inc_rlimit_ucounts() can be checked. Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/c5286a8aa16d2d698c222f7532f3d735c82bc6bc.1619094428.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> |
||
|
|
4bad58ebc8 |
signal: Allow tasks to cache one sigqueue struct
The idea for this originates from the real time tree to make signal delivery for realtime applications more efficient. In quite some of these application scenarios a control tasks signals workers to start their computations. There is usually only one signal per worker on flight. This works nicely as long as the kmem cache allocations do not hit the slow path and cause latencies. To cure this an optimistic caching was introduced (limited to RT tasks) which allows a task to cache a single sigqueue in a pointer in task_struct instead of handing it back to the kmem cache after consuming a signal. When the next signal is sent to the task then the cached sigqueue is used instead of allocating a new one. This solved the problem for this set of application scenarios nicely. The task cache is not preallocated so the first signal sent to a task goes always to the cache allocator. The cached sigqueue stays around until the task exits and is freed when task::sighand is dropped. After posting this solution for mainline the discussion came up whether this would be useful in general and should not be limited to realtime tasks: https://lore.kernel.org/r/m11rcu7nbr.fsf@fess.ebiederm.org One concern leading to the original limitation was to avoid a large amount of pointlessly cached sigqueues in alive tasks. The other concern was vs. RLIMIT_SIGPENDING as these cached sigqueues are not accounted for. The accounting problem is real, but on the other hand slightly academic. After gathering some statistics it turned out that after boot of a regular distro install there are less than 10 sigqueues cached in ~1500 tasks. In case of a 'mass fork and fire signal to child' scenario the extra 80 bytes of memory per task are well in the noise of the overall memory consumption of the fork bomb. If this should be limited then this would need an extra counter in struct user, more atomic instructions and a seperate rlimit. Yet another tunable which is mostly unused. The caching is actually used. After boot and a full kernel compile on a 64CPU machine with make -j128 the number of 'allocations' looks like this: From slab: 23996 From task cache: 52223 I.e. it reduces the number of slab cache operations by ~68%. A typical pattern there is: <...>-58490 __sigqueue_alloc: for 58488 from slab ffff8881132df460 <...>-58488 __sigqueue_free: cache ffff8881132df460 <...>-58488 __sigqueue_alloc: for 1149 from cache ffff8881103dc550 bash-1149 exit_task_sighand: free ffff8881132df460 bash-1149 __sigqueue_free: cache ffff8881103dc550 The interesting sequence is that the exiting task 58488 grabs the sigqueue from bash's task cache to signal exit and bash sticks it back into it's own cache. Lather, rinse and repeat. The caching is probably not noticable for the general use case, but the benefit for latency sensitive applications is clear. While kmem caches are usually just serving from the fast path the slab merging (default) can depending on the usage pattern of the merged slabs cause occasional slow path allocations. The time spared per cached entry is a few micro seconds per signal which is not relevant for e.g. a kernel build, but for signal heavy workloads it's measurable. As there is no real downside of this caching mechanism making it unconditionally available is preferred over more conditional code or new magic tunables. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Link: https://lkml.kernel.org/r/87sg4lbmxo.fsf@nanos.tec.linutronix.de |
||
|
|
b1b6b5a30d |
kernel/io_uring: cancel io_uring before task works
For cancelling io_uring requests it needs either to be able to run currently enqueued task_works or having it shut down by that moment. Otherwise io_uring_cancel_files() may be waiting for requests that won't ever complete. Go with the first way and do cancellations before setting PF_EXITING and so before putting the task_work infrastructure into a transition state where task_work_run() would better not be called. Cc: stable@vger.kernel.org # 5.5+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
a787bdaff8 |
Merge branch 'linus' into sched/core, to resolve semantic conflict
Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
5bc7850232 |
sched: fix exit_mm vs membarrier (v4)
exit_mm should issue memory barriers after user-space memory accesses,
before clearing current->mm, to order user-space memory accesses
performed prior to exit_mm before clearing tsk->mm, which has the
effect of skipping the membarrier private expedited IPIs.
exit_mm should also update the runqueue's membarrier_state so
membarrier global expedited IPIs are not sent when they are not
needed.
The membarrier system call can be issued concurrently with do_exit
if we have thread groups created with CLONE_VM but not CLONE_THREAD.
Here is the scenario I have in mind:
Two thread groups are created, A and B. Thread group B is created by
issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD.
Let's assume we have a single thread within each thread group (Thread A
and Thread B).
The AFAIU we can have:
Userspace variables:
int x = 0, y = 0;
CPU 0 CPU 1
Thread A Thread B
(in thread group A) (in thread group B)
x = 1
barrier()
y = 1
exit()
exit_mm()
current->mm = NULL;
r1 = load y
membarrier()
skips CPU 0 (no IPI) because its current mm is NULL
r2 = load x
BUG_ON(r1 == 1 && r2 == 0)
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201020134715.13909-2-mathieu.desnoyers@efficios.com
|
||
|
|
77f6ab8b77 |
don't dump the threads that had been already exiting when zapped.
Coredump logics needs to report not only the registers of the dumping thread, but (since 2.5.43) those of other threads getting killed. Doing that might require extra state saved on the stack in asm glue at kernel entry; signal delivery logics does that (we need to be able to save sigcontext there, at the very least) and so does seccomp. That covers all callers of do_coredump(). Secondary threads get hit with SIGKILL and caught as soon as they reach exit_mm(), which normally happens in signal delivery, so those are also fine most of the time. Unfortunately, it is possible to end up with secondary zapped when it has already entered exit(2) (or, worse yet, is oopsing). In those cases we reach exit_mm() when mm->core_state is already set, but the stack contents is not what we would have in signal delivery. At least on two architectures (alpha and m68k) it leads to infoleaks - we end up with a chunk of kernel stack written into coredump, with the contents consisting of normal C stack frames of the call chain leading to exit_mm() instead of the expected copy of userland registers. In case of alpha we leak 312 bytes of stack. Other architectures (including the regset-using ones) might have similar problems - the normal user of regsets is ptrace and the state of tracee at the time of such calls is special in the same way signal delivery is. Note that had the zapper gotten to the exiting thread slightly later, it wouldn't have been included into coredump anyway - we skip the threads that have already cleared their ->mm. So let's pretend that zapper always loses the race. IOW, have exit_mm() only insert into the dumper list if we'd gotten there from handling a fatal signal[*] As the result, the callers of do_exit() that have *not* gone through get_signal() are not seen by coredump logics as secondary threads. Which excludes voluntary exit()/oopsen/traps/etc. The dumper thread itself is unaffected by that, so seccomp is fine. [*] originally I intended to add a new flag in tsk->flags, but ebiederman pointed out that PF_SIGNALED is already doing just what we need. Cc: stable@vger.kernel.org Fixes: d89f3847def4 ("[PATCH] thread-aware coredumps, 2.5.43-C3") History-tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
||
|
|
1aa92cd31c |
pid: move pidfd_get_pid() to pid.c
process_madvise syscall needs pidfd_get_pid function to translate pidfd to pid so this patch move the function to kernel/pid.c. Suggested-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jann Horn <jannh@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Daniel Colascione <dancol@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Dias <joaodias@google.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Sandeep Patil <sspatil@google.com> Cc: SeongJae Park <sj38.park@gmail.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Tim Murray <timmurray@google.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fw@deneb.enyo.de> Cc: <linux-man@vger.kernel.org> Link: http://lkml.kernel.org/r/20200302193630.68771-5-minchan@kernel.org Link: http://lkml.kernel.org/r/20200622192900.22757-3-minchan@kernel.org Link: https://lkml.kernel.org/r/20200901000633.1920247-3-minchan@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
ba7d25f3df |
exit: support non-blocking pidfds
Passing a non-blocking pidfd to waitid() currently has no effect, i.e. is not supported. There are users which would like to use waitid() on pidfds that are O_NONBLOCK and mix it with pidfds that are blocking and both pass them to waitid(). The expected behavior is to have waitid() return -EAGAIN for non-blocking pidfds and to block for blocking pidfds without needing to perform any additional checks for flags set on the pidfd before passing it to waitid(). Non-blocking pidfds will return EAGAIN from waitid() when no child process is ready yet. Returning -EAGAIN for non-blocking pidfds makes it easier for event loops that handle EAGAIN specially. It also makes the API more consistent and uniform. In essence, waitid() is treated like a read on a non-blocking pidfd or a recvmsg() on a non-blocking socket. With the addition of support for non-blocking pidfds we support the same functionality that sockets do. For sockets() recvmsg() supports MSG_DONTWAIT for pidfds waitid() supports WNOHANG. Both flags are per-call options. In contrast non-blocking pidfds and non-blocking sockets are a setting on an open file description affecting all threads in the calling process as well as other processes that hold file descriptors referring to the same open file description. Both behaviors, per call and per open file description, have genuine use-cases. The implementation should be straightforward: - If a non-blocking pidfd is passed and WNOHANG is not raised we simply raise the WNOHANG flag internally. When do_wait() returns indicating that there are eligible child processes but none have exited yet we set EAGAIN. If no child process exists we continue returning ECHILD. - If a non-blocking pidfd is passed and WNOHANG is raised waitid() will continue returning 0, i.e. it will not set EAGAIN. This ensure backwards compatibility with applications passing WNOHANG explicitly with pidfds. A concrete use-case that was brought on-list was Josh's async pidfd library. Ever since the introduction of pidfds and more advanced async io various programming languages such as Rust have grown support for async event libraries. These libraries are created to help build epoll-based event loops around file descriptors. A common pattern is to automatically make all file descriptors they manage to O_NONBLOCK. For such libraries the EAGAIN error code is treated specially. When a function is called that returns EAGAIN the function isn't called again until the event loop indicates the the file descriptor is ready. Supporting EAGAIN when waiting on pidfds makes such libraries just work with little effort. Suggested-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Jann Horn <jannh@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Link: https://lore.kernel.org/lkml/20200811181236.GA18763@localhost/ Link: https://github.com/joshtriplett/async-pidfd Link: https://lore.kernel.org/r/20200902102130.147672-3-christian.brauner@ubuntu.com |
||
|
|
8043fc147a |
kernel: add a kernel_wait helper
Add a helper that waits for a pid and stores the status in the passed in kernel pointer. Use it to fix the usage of kernel_wait4 in call_usermodehelper_exec_sync that only happens to work due to the implicit set_fs(KERNEL_DS) for kernel threads. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Link: http://lkml.kernel.org/r/20200721130449.5008-1-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
fe81417596 |
exec: use force_uaccess_begin during exec and exit
Both exec and exit want to ensure that the uaccess routines actually do access user pointers. Use the newly added force_uaccess_begin helper instead of an open coded set_fs for that to prepare for kernel builds where set_fs() does not exist. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nick Hu <nickhu@andestech.com> Cc: Greentime Hu <green.hu@gmail.com> Cc: Vincent Chen <deanbo422@gmail.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Link: http://lkml.kernel.org/r/20200710135706.537715-7-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
3950e97543 |
Merge branch 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull execve updates from Eric Biederman: "During the development of v5.7 I ran into bugs and quality of implementation issues related to exec that could not be easily fixed because of the way exec is implemented. So I have been diggin into exec and cleaning up what I can. This cycle I have been looking at different ideas and different implementations to see what is possible to improve exec, and cleaning the way exec interfaces with in kernel users. Only cleaning up the interfaces of exec with rest of the kernel has managed to stabalize and make it through review in time for v5.9-rc1 resulting in 2 sets of changes this cycle. - Implement kernel_execve - Make the user mode driver code a better citizen With kernel_execve the code size got a little larger as the copying of parameters from userspace and copying of parameters from userspace is now separate. The good news is kernel threads no longer need to play games with set_fs to use exec. Which when combined with the rest of Christophs set_fs changes should security bugs with set_fs much more difficult" * 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits) exec: Implement kernel_execve exec: Factor bprm_stack_limits out of prepare_arg_pages exec: Factor bprm_execve out of do_execve_common exec: Move bprm_mm_init into alloc_bprm exec: Move initialization of bprm->filename into alloc_bprm exec: Factor out alloc_bprm exec: Remove unnecessary spaces from binfmts.h umd: Stop using split_argv umd: Remove exit_umh bpfilter: Take advantage of the facilities of struct pid exit: Factor thread_group_exited out of pidfd_poll umd: Track user space drivers with struct pid bpfilter: Move bpfilter_umh back into init data exec: Remove do_execve_file umh: Stop calling do_execve_file umd: Transform fork_usermode_blob into fork_usermode_driver umd: Rename umd_info.cmdline umd_info.driver_name umd: For clarity rename umh_info umd_info umh: Separate the user mode driver and the user mode helper support umh: Remove call_usermodehelper_setup_file. ... |
||
|
|
9ecc6ea491 |
Merge tag 'seccomp-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull seccomp updates from Kees Cook:
"There are a bunch of clean ups and selftest improvements along with
two major updates to the SECCOMP_RET_USER_NOTIF filter return:
EPOLLHUP support to more easily detect the death of a monitored
process, and being able to inject fds when intercepting syscalls that
expect an fd-opening side-effect (needed by both container folks and
Chrome). The latter continued the refactoring of __scm_install_fd()
started by Christoph, and in the process found and fixed a handful of
bugs in various callers.
- Improved selftest coverage, timeouts, and reporting
- Add EPOLLHUP support for SECCOMP_RET_USER_NOTIF (Christian Brauner)
- Refactor __scm_install_fd() into __receive_fd() and fix buggy
callers
- Introduce 'addfd' command for SECCOMP_RET_USER_NOTIF (Sargun
Dhillon)"
* tag 'seccomp-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (30 commits)
selftests/seccomp: Test SECCOMP_IOCTL_NOTIF_ADDFD
seccomp: Introduce addfd ioctl to seccomp user notifier
fs: Expand __receive_fd() to accept existing fd
pidfd: Replace open-coded receive_fd()
fs: Add receive_fd() wrapper for __receive_fd()
fs: Move __scm_install_fd() to __receive_fd()
net/scm: Regularize compat handling of scm_detach_fds()
pidfd: Add missing sock updates for pidfd_getfd()
net/compat: Add missing sock updates for SCM_RIGHTS
selftests/seccomp: Check ENOSYS under tracing
selftests/seccomp: Refactor to use fixture variants
selftests/harness: Clean up kern-doc for fixtures
seccomp: Use -1 marker for end of mode 1 syscall list
seccomp: Fix ioctl number for SECCOMP_IOCTL_NOTIF_ID_VALID
selftests/seccomp: Rename user_trap_syscall() to user_notif_syscall()
selftests/seccomp: Make kcmp() less required
seccomp: Use pr_fmt
selftests/seccomp: Improve calibration loop
selftests/seccomp: use 90s as timeout
selftests/seccomp: Expand benchmark to per-filter measurements
...
|
||
|
|
3f649ab728 |
treewide: Remove uninitialized_var() usage
Using uninitialized_var() is dangerous as it papers over real bugs[1] (or can in the future), and suppresses unrelated compiler warnings (e.g. "unused variable"). If the compiler thinks it is uninitialized, either simply initialize the variable or make compiler changes. In preparation for removing[2] the[3] macro[4], remove all remaining needless uses with the following script: git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \ xargs perl -pi -e \ 's/\buninitialized_var\(([^\)]+)\)/\1/g; s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;' drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid pathological white-space. No outstanding warnings were found building allmodconfig with GCC 9.3.0 for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64, alpha, and m68k. [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/ [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/ [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/ Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5 Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs Signed-off-by: Kees Cook <keescook@chromium.org> |
||
|
|
3a15fb6ed9 |
seccomp: release filter after task is fully dead
The seccomp filter used to be released in free_task() which is called asynchronously via call_rcu() and assorted mechanisms. Since we need to inform tasks waiting on the seccomp notifier when a filter goes empty we will notify them as soon as a task has been marked fully dead in release_task(). To not split seccomp cleanup into two parts, move filter release out of free_task() and into release_task() after we've unhashed struct task from struct pid, exited signals, and unlinked it from the threadgroups' thread list. We'll put the empty filter notification infrastructure into it in a follow up patch. This also renames put_seccomp_filter() to seccomp_filter_release() which is a more descriptive name of what we're doing here especially once we've added the empty filter notification mechanism in there. We're also NULL-ing the task's filter tree entrypoint which seems cleaner than leaving a dangling pointer in there. Note that this shouldn't need any memory barriers since we're calling this when the task is in release_task() which means it's EXIT_DEAD. So it can't modify its seccomp filters anymore. You can also see this from the point where we're calling seccomp_filter_release(). It's after __exit_signal() and at this point, tsk->sighand will already have been NULLed which is required for thread-sync and filter installation alike. Cc: Tycho Andersen <tycho@tycho.ws> Cc: Kees Cook <keescook@chromium.org> Cc: Matt Denton <mpdenton@google.com> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Jann Horn <jannh@google.com> Cc: Chris Palmer <palmer@google.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Robert Sesek <rsesek@google.com> Cc: Jeffrey Vander Stoep <jeffv@google.com> Cc: Linux Containers <containers@lists.linux-foundation.org> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20200531115031.391515-2-christian.brauner@ubuntu.com Signed-off-by: Kees Cook <keescook@chromium.org> |
||
|
|
8c2f526639 |
umd: Remove exit_umh
The bpfilter code no longer uses the umd_info.cleanup callback. This callback is what exit_umh exists to call. So remove exit_umh and all of it's associated booking. v1: https://lkml.kernel.org/r/87bll6dlte.fsf_-_@x220.int.ebiederm.org v2: https://lkml.kernel.org/r/87y2o53abg.fsf_-_@x220.int.ebiederm.org Link: https://lkml.kernel.org/r/20200702164140.4468-15-ebiederm@xmission.com Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
||
|
|
38fd525a4c |
exit: Factor thread_group_exited out of pidfd_poll
Create an independent helper thread_group_exited which returns true when all threads have passed exit_notify in do_exit. AKA all of the threads are at least zombies and might be dead or completely gone. Create this helper by taking the logic out of pidfd_poll where it is already tested, and adding a READ_ONCE on the read of task->exit_state. I will be changing the user mode driver code to use this same logic to know when a user mode driver needs to be restarted. Place the new helper thread_group_exited in kernel/exit.c and EXPORT it so it can be used by modules. Link: https://lkml.kernel.org/r/20200702164140.4468-13-ebiederm@xmission.com Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
||
|
|
1c340ead18 |
umd: Track user space drivers with struct pid
Use struct pid instead of user space pid values that are prone to wrap araound. In addition track the entire thread group instead of just the first thread that is started by exec. There are no multi-threaded user mode drivers today but there is nothing preclucing user drivers from being multi-threaded, so it is just a good idea to track the entire process. Take a reference count on the tgid's in question to make it possible to remove exit_umh in a future change. As a struct pid is available directly use kill_pid_info. The prior process signalling code was iffy in using a userspace pid known to be in the initial pid namespace and then looking up it's task in whatever the current pid namespace is. It worked only because kernel threads always run in the initial pid namespace. As the tgid is now refcounted verify the tgid is NULL at the start of fork_usermode_driver to avoid the possibility of silent pid leaks. v1: https://lkml.kernel.org/r/87mu4qdlv2.fsf_-_@x220.int.ebiederm.org v2: https://lkml.kernel.org/r/a70l4oy8.fsf_-_@x220.int.ebiederm.org Link: https://lkml.kernel.org/r/20200702164140.4468-12-ebiederm@xmission.com Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
||
|
|
884c5e683b |
umh: Separate the user mode driver and the user mode helper support
This makes it clear which code is part of the core user mode helper support and which code is needed to implement user mode drivers. This makes the kernel smaller for everyone who does not use a usermode driver. v1: https://lkml.kernel.org/r/87tuyyf0ln.fsf_-_@x220.int.ebiederm.org v2: https://lkml.kernel.org/r/87imf963s6.fsf_-_@x220.int.ebiederm.org Link: https://lkml.kernel.org/r/20200702164140.4468-5-ebiederm@xmission.com Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
||
|
|
c1e8d7c6a7 |
mmap locking API: convert mmap_sem comments
Convert comments that reference mmap_sem to reference mmap_lock instead. [akpm@linux-foundation.org: fix up linux-next leftovers] [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil] [akpm@linux-foundation.org: more linux-next fixups, per Michel] Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
d8ed45c5dc |
mmap locking API: use coccinelle to convert mmap_sem rwsem call sites
This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock | -down_write +mmap_write_lock | -down_write_killable +mmap_write_lock_killable | -down_write_trylock +mmap_write_trylock | -up_write +mmap_write_unlock | -downgrade_write +mmap_write_downgrade | -down_read +mmap_read_lock | -down_read_killable +mmap_read_lock_killable | -down_read_trylock +mmap_read_trylock | -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
e31cf2f4ca |
mm: don't include asm/pgtable.h if linux/mm.h is already included
Patch series "mm: consolidate definitions of page table accessors", v2.
The low level page table accessors (pXY_index(), pXY_offset()) are
duplicated across all architectures and sometimes more than once. For
instance, we have 31 definition of pgd_offset() for 25 supported
architectures.
Most of these definitions are actually identical and typically it boils
down to, e.g.
static inline unsigned long pmd_index(unsigned long address)
{
return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
}
static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}
These definitions can be shared among 90% of the arches provided
XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.
For architectures that really need a custom version there is always
possibility to override the generic version with the usual ifdefs magic.
These patches introduce include/linux/pgtable.h that replaces
include/asm-generic/pgtable.h and add the definitions of the page table
accessors to the new header.
This patch (of 12):
The linux/mm.h header includes <asm/pgtable.h> to allow inlining of the
functions involving page table manipulations, e.g. pte_alloc() and
pmd_alloc(). So, there is no point to explicitly include <asm/pgtable.h>
in the files that include <linux/mm.h>.
The include statements in such cases are remove with a simple loop:
for f in $(git grep -l "include <linux/mm.h>") ; do
sed -i -e '/include <asm\/pgtable.h>/ d' $f
done
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
039aeb9deb |
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"ARM:
- Move the arch-specific code into arch/arm64/kvm
- Start the post-32bit cleanup
- Cherry-pick a few non-invasive pre-NV patches
x86:
- Rework of TLB flushing
- Rework of event injection, especially with respect to nested
virtualization
- Nested AMD event injection facelift, building on the rework of
generic code and fixing a lot of corner cases
- Nested AMD live migration support
- Optimization for TSC deadline MSR writes and IPIs
- Various cleanups
- Asynchronous page fault cleanups (from tglx, common topic branch
with tip tree)
- Interrupt-based delivery of asynchronous "page ready" events (host
side)
- Hyper-V MSRs and hypercalls for guest debugging
- VMX preemption timer fixes
s390:
- Cleanups
Generic:
- switch vCPU thread wakeup from swait to rcuwait
The other architectures, and the guest side of the asynchronous page
fault work, will come next week"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (256 commits)
KVM: selftests: fix rdtsc() for vmx_tsc_adjust_test
KVM: check userspace_addr for all memslots
KVM: selftests: update hyperv_cpuid with SynDBG tests
x86/kvm/hyper-v: Add support for synthetic debugger via hypercalls
x86/kvm/hyper-v: enable hypercalls regardless of hypercall page
x86/kvm/hyper-v: Add support for synthetic debugger interface
x86/hyper-v: Add synthetic debugger definitions
KVM: selftests: VMX preemption timer migration test
KVM: nVMX: Fix VMX preemption timer migration
x86/kvm/hyper-v: Explicitly align hcall param for kvm_hyperv_exit
KVM: x86/pmu: Support full width counting
KVM: x86/pmu: Tweak kvm_pmu_get_msr to pass 'struct msr_data' in
KVM: x86: announce KVM_FEATURE_ASYNC_PF_INT
KVM: x86: acknowledgment mechanism for async pf page ready notifications
KVM: x86: interrupt based APF 'page ready' event delivery
KVM: introduce kvm_read_guest_offset_cached()
KVM: rename kvm_arch_can_inject_async_page_present() to kvm_arch_can_dequeue_async_page_present()
KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
Revert "KVM: async_pf: Fix #DF due to inject "Page not Present" and "Page Ready" exceptions simultaneously"
KVM: VMX: Replace zero-length array with flexible-array
...
|
||
|
|
d479c5a191 |
Merge tag 'sched-core-2020-06-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"The changes in this cycle are:
- Optimize the task wakeup CPU selection logic, to improve
scalability and reduce wakeup latency spikes
- PELT enhancements
- CFS bandwidth handling fixes
- Optimize the wakeup path by remove rq->wake_list and replacing it
with ->ttwu_pending
- Optimize IPI cross-calls by making flush_smp_call_function_queue()
process sync callbacks first.
- Misc fixes and enhancements"
* tag 'sched-core-2020-06-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
irq_work: Define irq_work_single() on !CONFIG_IRQ_WORK too
sched/headers: Split out open-coded prototypes into kernel/sched/smp.h
sched: Replace rq::wake_list
sched: Add rq::ttwu_pending
irq_work, smp: Allow irq_work on call_single_queue
smp: Optimize send_call_function_single_ipi()
smp: Move irq_work_run() out of flush_smp_call_function_queue()
smp: Optimize flush_smp_call_function_queue()
sched: Fix smp_call_function_single_async() usage for ILB
sched/core: Offload wakee task activation if it the wakee is descheduling
sched/core: Optimize ttwu() spinning on p->on_cpu
sched: Defend cfs and rt bandwidth quota against overflow
sched/cpuacct: Fix charge cpuacct.usage_sys
sched/fair: Replace zero-length array with flexible-array
sched/pelt: Sync util/runnable_sum with PELT window when propagating
sched/cpuacct: Use __this_cpu_add() instead of this_cpu_ptr()
sched/fair: Optimize enqueue_task_fair()
sched: Make scheduler_ipi inline
sched: Clean up scheduler_ipi()
sched/core: Simplify sched_init()
...
|
||
|
|
e148a8f948 |
Merge branch 'uaccess.readdir' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull uaccess/readdir updates from Al Viro:
"Finishing the conversion of readdir.c to unsafe_... API.
This includes the uaccess_{read,write}_begin series by Christophe
Leroy"
* 'uaccess.readdir' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
readdir.c: get rid of the last __put_user(), drop now-useless access_ok()
readdir.c: get compat_filldir() more or less in sync with filldir()
switch readdir(2) to unsafe_copy_dirent_name()
drm/i915/gem: Replace user_access_begin by user_write_access_begin
uaccess: Selectively open read or write user access
uaccess: Add user_read_access_begin/end and user_write_access_begin/end
|
||
|
|
9d5272f5e3 | Merge tag 'noinstr-x86-kvm-2020-05-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into HEAD | ||
|
|
9d9a6ebfea |
rcuwait: Let rcuwait_wake_up() return whether or not a task was awoken
Propagating the return value of wake_up_process() back to the caller can come in handy for future users, such as for statistics or accounting purposes. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Message-Id: <20200424054837.5138-3-dave@stgolabs.net> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
||
|
|
c9d64a1b2d |
rcuwait: Fix stale wake call name in comment
The 'trywake' name was renamed to simply 'wake', update the comment. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Message-Id: <20200424054837.5138-2-dave@stgolabs.net> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
||
|
|
41cd780524 |
uaccess: Selectively open read or write user access
When opening user access to only perform reads, only open read access. When opening user access to only perform writes, only open write access. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/2e73bc57125c2c6ab12a587586a4eed3a47105fc.1585898438.git.christophe.leroy@c-s.fr |
||
|
|
586b58cac8 |
exit: Move preemption fixup up, move blocking operations down
With CONFIG_DEBUG_ATOMIC_SLEEP=y and CONFIG_CGROUPS=y, kernel oopses in
non-preemptible context look untidy; after the main oops, the kernel prints
a "sleeping function called from invalid context" report because
exit_signals() -> cgroup_threadgroup_change_begin() -> percpu_down_read()
can sleep, and that happens before the preempt_count_set(PREEMPT_ENABLED)
fixup.
It looks like the same thing applies to profile_task_exit() and
kcov_task_exit().
Fix it by moving the preemption fixup up and the calls to
profile_task_exit() and kcov_task_exit() down.
Fixes:
|
||
|
|
6ade99ec61 |
proc: Put thread_pid in release_task not proc_flush_pid
Oleg pointed out that in the unlikely event the kernel is compiled
with CONFIG_PROC_FS unset that release_task will now leak the pid.
Move the put_pid out of proc_flush_pid into release_task to fix this
and to guarantee I don't make that mistake again.
When possible it makes sense to keep get and put in the same function
so it can easily been seen how they pair up.
Fixes:
|
||
|
|
d987ca1c6b |
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull exec/proc updates from Eric Biederman: "This contains two significant pieces of work: the work to sort out proc_flush_task, and the work to solve a deadlock between strace and exec. Fixing proc_flush_task so that it no longer requires a persistent mount makes improvements to proc possible. The removal of the persistent mount solves an old regression that that caused the hidepid mount option to only work on remount not on mount. The regression was found and reported by the Android folks. This further allows Alexey Gladkov's work making proc mount options specific to an individual mount of proc to move forward. The work on exec starts solving a long standing issue with exec that it takes mutexes of blocking userspace applications, which makes exec extremely deadlock prone. For the moment this adds a second mutex with a narrower scope that handles all of the easy cases. Which makes the tricky cases easy to spot. With a little luck the code to solve those deadlocks will be ready by next merge window" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (25 commits) signal: Extend exec_id to 64bits pidfd: Use new infrastructure to fix deadlocks in execve perf: Use new infrastructure to fix deadlocks in execve proc: io_accounting: Use new infrastructure to fix deadlocks in execve proc: Use new infrastructure to fix deadlocks in execve kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve kernel: doc: remove outdated comment cred.c mm: docs: Fix a comment in process_vm_rw_core selftests/ptrace: add test cases for dead-locks exec: Fix a deadlock in strace exec: Add exec_update_mutex to replace cred_guard_mutex exec: Move exec_mmap right after de_thread in flush_old_exec exec: Move cleanup of posix timers on exec out of de_thread exec: Factor unshare_sighand out of de_thread and call it separately exec: Only compute current once in flush_old_exec pid: Improve the comment about waiting in zap_pid_ns_processes proc: Remove the now unnecessary internal mount of proc uml: Create a private mount of proc for mconsole uml: Don't consult current to find the proc_mnt in mconsole_proc proc: Use a list of inodes to flush from proc ... |
||
|
|
dbb381b619 |
Merge tag 'timers-core-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timekeeping and timer updates from Thomas Gleixner:
"Core:
- Consolidation of the vDSO build infrastructure to address the
difficulties of cross-builds for ARM64 compat vDSO libraries by
restricting the exposure of header content to the vDSO build.
This is achieved by splitting out header content into separate
headers. which contain only the minimaly required information which
is necessary to build the vDSO. These new headers are included from
the kernel headers and the vDSO specific files.
- Enhancements to the generic vDSO library allowing more fine grained
control over the compiled in code, further reducing architecture
specific storage and preparing for adopting the generic library by
PPC.
- Cleanup and consolidation of the exit related code in posix CPU
timers.
- Small cleanups and enhancements here and there
Drivers:
- The obligatory new drivers: Ingenic JZ47xx and X1000 TCU support
- Correct the clock rate of PIT64b global clock
- setup_irq() cleanup
- Preparation for PWM and suspend support for the TI DM timer
- Expand the fttmr010 driver to support ast2600 systems
- The usual small fixes, enhancements and cleanups all over the
place"
* tag 'timers-core-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (80 commits)
Revert "clocksource/drivers/timer-probe: Avoid creating dead devices"
vdso: Fix clocksource.h macro detection
um: Fix header inclusion
arm64: vdso32: Enable Clang Compilation
lib/vdso: Enable common headers
arm: vdso: Enable arm to use common headers
x86/vdso: Enable x86 to use common headers
mips: vdso: Enable mips to use common headers
arm64: vdso32: Include common headers in the vdso library
arm64: vdso: Include common headers in the vdso library
arm64: Introduce asm/vdso/processor.h
arm64: vdso32: Code clean up
linux/elfnote.h: Replace elf.h with UAPI equivalent
scripts: Fix the inclusion order in modpost
common: Introduce processor.h
linux/ktime.h: Extract common header for vDSO
linux/jiffies.h: Extract common header for vDSO
linux/time64.h: Extract common header for vDSO
linux/time32.h: Extract common header for vDSO
linux/time.h: Extract common header for vDSO
...
|
||
|
|
4b9fd8a829 |
Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
"The main changes in this cycle were:
- Continued user-access cleanups in the futex code.
- percpu-rwsem rewrite that uses its own waitqueue and atomic_t
instead of an embedded rwsem. This addresses a couple of
weaknesses, but the primary motivation was complications on the -rt
kernel.
- Introduce raw lock nesting detection on lockdep
(CONFIG_PROVE_RAW_LOCK_NESTING=y), document the raw_lock vs. normal
lock differences. This too originates from -rt.
- Reuse lockdep zapped chain_hlocks entries, to conserve RAM
footprint on distro-ish kernels running into the "BUG:
MAX_LOCKDEP_CHAIN_HLOCKS too low!" depletion of the lockdep
chain-entries pool.
- Misc cleanups, smaller fixes and enhancements - see the changelog
for details"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (55 commits)
fs/buffer: Make BH_Uptodate_Lock bit_spin_lock a regular spinlock_t
thermal/x86_pkg_temp: Make pkg_temp_lock a raw_spinlock_t
Documentation/locking/locktypes: Minor copy editor fixes
Documentation/locking/locktypes: Further clarifications and wordsmithing
m68knommu: Remove mm.h include from uaccess_no.h
x86: get rid of user_atomic_cmpxchg_inatomic()
generic arch_futex_atomic_op_inuser() doesn't need access_ok()
x86: don't reload after cmpxchg in unsafe_atomic_op2() loop
x86: convert arch_futex_atomic_op_inuser() to user_access_begin/user_access_end()
objtool: whitelist __sanitizer_cov_trace_switch()
[parisc, s390, sparc64] no need for access_ok() in futex handling
sh: no need of access_ok() in arch_futex_atomic_op_inuser()
futex: arch_futex_atomic_op_inuser() calling conventions change
completion: Use lockdep_assert_RT_in_threaded_ctx() in complete_all()
lockdep: Add posixtimer context tracing bits
lockdep: Annotate irq_work
lockdep: Add hrtimer context tracing bits
lockdep: Introduce wait-type checks
completion: Use simple wait queues
sched/swait: Prepare usage in completions
...
|
||
|
|
b95e31c07c |
posix-cpu-timers: Stop disabling timers on mt-exec
The reasons why the extra posix_cpu_timers_exit_group() invocation has been
added are not entirely clear from the commit message. Today all that
posix_cpu_timers_exit_group() does is stop timers that are tracking the
task from firing. Every other operation on those timers is still allowed.
The practical implication of this is posix_cpu_timer_del() which could
not get the siglock after the thread group leader has exited (because
sighand == NULL) would be able to run successfully because the timer
was already dequeued.
With that locking issue fixed there is no point in disabling all of the
timers. So remove this ``tempoary'' hack.
Fixes:
|