Changes in 5.4.42
net: dsa: Do not make user port errors fatal
shmem: fix possible deadlocks on shmlock_user_lock
net: phy: microchip_t1: add lan87xx_phy_init to initialize the lan87xx phy.
KVM: arm: vgic: Synchronize the whole guest on GIC{D,R}_I{S,C}ACTIVER read
gpio: pca953x: Fix pca953x_gpio_set_config
SUNRPC: Add "@len" parameter to gss_unwrap()
SUNRPC: Fix GSS privacy computation of auth->au_ralign
net/sonic: Fix a resource leak in an error handling path in 'jazz_sonic_probe()'
net: moxa: Fix a potential double 'free_irq()'
ftrace/selftests: workaround cgroup RT scheduling issues
drop_monitor: work around gcc-10 stringop-overflow warning
virtio-blk: handle block_device_operations callbacks after hot unplug
sun6i: dsi: fix gcc-4.8
net_sched: fix tcm_parent in tc filter dump
scsi: sg: add sg_remove_request in sg_write
selftests/bpf: fix goto cleanup label not defined
mmc: sdhci-acpi: Add SDHCI_QUIRK2_BROKEN_64_BIT_DMA for AMDI0040
dpaa2-eth: properly handle buffer size restrictions
net: fix a potential recursive NETDEV_FEAT_CHANGE
netlabel: cope with NULL catmap
net: phy: fix aneg restart in phy_ethtool_set_eee
net: stmmac: fix num_por initialization
pppoe: only process PADT targeted at local interfaces
Revert "ipv6: add mtu lock check in __ip6_rt_update_pmtu"
tcp: fix error recovery in tcp_zerocopy_receive()
tcp: fix SO_RCVLOWAT hangs with fat skbs
virtio_net: fix lockdep warning on 32 bit
dpaa2-eth: prevent array underflow in update_cls_rule()
hinic: fix a bug of ndo_stop
net: dsa: loop: Add module soft dependency
net: ipv4: really enforce backoff for redirects
netprio_cgroup: Fix unlimited memory leak of v2 cgroups
net: tcp: fix rx timestamp behavior for tcp_recvmsg
nfp: abm: fix error return code in nfp_abm_vnic_alloc()
r8169: re-establish support for RTL8401 chip version
umh: fix memory leak on execve failure
riscv: fix vdso build with lld
dmaengine: pch_dma.c: Avoid data race between probe and irq handler
dmaengine: mmp_tdma: Do not ignore slave config validation errors
dmaengine: mmp_tdma: Reset channel error on release
selftests/ftrace: Check the first record for kprobe_args_type.tc
cpufreq: intel_pstate: Only mention the BIOS disabling turbo mode once
ALSA: hda/hdmi: fix race in monitor detection during probe
drm/amd/powerplay: avoid using pm_en before it is initialized revised
drm/amd/display: check if REFCLK_CNTL register is present
drm/amd/display: Update downspread percent to match spreadsheet for DCN2.1
drm/qxl: lost qxl_bo_kunmap_atomic_page in qxl_image_init_helper()
drm/amdgpu: simplify padding calculations (v2)
drm/amdgpu: invalidate L2 before SDMA IBs (v2)
ipc/util.c: sysvipc_find_ipc() incorrectly updates position index
ALSA: hda/realtek - Fix S3 pop noise on Dell Wyse
gfs2: Another gfs2_walk_metadata fix
mmc: sdhci-pci-gli: Fix no irq handler from suspend
IB/hfi1: Fix another case where pq is left on waitlist
ACPI: EC: PM: Avoid premature returns from acpi_s2idle_wake()
pinctrl: sunrisepoint: Fix PAD lock register offset for SPT-H
pinctrl: baytrail: Enable pin configuration setting for GPIO chip
pinctrl: qcom: fix wrong write in update_dual_edge
pinctrl: cherryview: Add missing spinlock usage in chv_gpio_irq_handler
bpf: Fix error return code in map_lookup_and_delete_elem()
ALSA: firewire-lib: fix 'function sizeof not defined' error of tracepoints format
i40iw: Fix error handling in i40iw_manage_arp_cache()
drm/i915: Don't enable WaIncreaseLatencyIPCEnabled when IPC is disabled
bpf, sockmap: msg_pop_data can incorrecty set an sge length
bpf, sockmap: bpf_tcp_ingress needs to subtract bytes from sg.size
mmc: alcor: Fix a resource leak in the error path for ->probe()
mmc: sdhci-pci-gli: Fix can not access GL9750 after reboot from Windows 10
mmc: core: Check request type before completing the request
mmc: core: Fix recursive locking issue in CQE recovery path
mmc: block: Fix request completion in the CQE timeout path
gfs2: More gfs2_find_jhead fixes
fork: prevent accidental access to clone3 features
drm/amdgpu: force fbdev into vram
NFS: Fix fscache super_cookie index_key from changing after umount
nfs: fscache: use timespec64 in inode auxdata
NFSv4: Fix fscache cookie aux_data to ensure change_attr is included
netfilter: conntrack: avoid gcc-10 zero-length-bounds warning
drm/i915/gvt: Fix kernel oops for 3-level ppgtt guest
arm64: fix the flush_icache_range arguments in machine_kexec
nfs: fix NULL deference in nfs4_get_valid_delegation
SUNRPC: Signalled ASYNC tasks need to exit
netfilter: nft_set_rbtree: Introduce and use nft_rbtree_interval_start()
netfilter: nft_set_rbtree: Add missing expired checks
RDMA/rxe: Always return ERR_PTR from rxe_create_mmap_info()
IB/mlx4: Test return value of calls to ib_get_cached_pkey
IB/core: Fix potential NULL pointer dereference in pkey cache
RDMA/core: Fix double put of resource
RDMA/iw_cxgb4: Fix incorrect function parameters
hwmon: (da9052) Synchronize access with mfd
s390/ism: fix error return code in ism_probe()
mm, memcg: fix inconsistent oom event behavior
NFSv3: fix rpc receive buffer size for MOUNT call
pnp: Use list_for_each_entry() instead of open coding
net/rds: Use ERR_PTR for rds_message_alloc_sgs()
Stop the ad-hoc games with -Wno-maybe-initialized
gcc-10: disable 'zero-length-bounds' warning for now
gcc-10: disable 'array-bounds' warning for now
gcc-10: disable 'stringop-overflow' warning for now
gcc-10: disable 'restrict' warning for now
gcc-10 warnings: fix low-hanging fruit
gcc-10: mark more functions __init to avoid section mismatch warnings
gcc-10: avoid shadowing standard library 'free()' in crypto
usb: usbfs: correct kernel->user page attribute mismatch
USB: usbfs: fix mmap dma mismatch
ALSA: hda/realtek - Limit int mic boost for Thinkpad T530
ALSA: hda/realtek - Add COEF workaround for ASUS ZenBook UX431DA
ALSA: rawmidi: Fix racy buffer resize under concurrent accesses
ALSA: usb-audio: Add control message quirk delay for Kingston HyperX headset
usb: core: hub: limit HUB_QUIRK_DISABLE_AUTOSUSPEND to USB5534B
usb: host: xhci-plat: keep runtime active when removing host
usb: cdns3: gadget: prev_req->trb is NULL for ep0
USB: gadget: fix illegal array access in binding with UDC
usb: xhci: Fix NULL pointer dereference when enqueuing trbs from urb sg list
Make the "Reducing compressed framebufer size" message be DRM_INFO_ONCE()
ARM: dts: dra7: Fix bus_dma_limit for PCIe
ARM: dts: imx27-phytec-phycard-s-rdk: Fix the I2C1 pinctrl entries
ARM: dts: imx6dl-yapp4: Fix Ursa board Ethernet connection
drm/amd/display: add basic atomic check for cursor plane
powerpc/32s: Fix build failure with CONFIG_PPC_KUAP_DEBUG
cifs: fix leaked reference on requeued write
x86: Fix early boot crash on gcc-10, third try
x86/unwind/orc: Fix error handling in __unwind_start()
exec: Move would_dump into flush_old_exec
clk: rockchip: fix incorrect configuration of rk3228 aclk_gpu* clocks
dwc3: Remove check for HWO flag in dwc3_gadget_ep_reclaim_trb_sg()
fanotify: fix merging marks masks with FAN_ONDIR
usb: gadget: net2272: Fix a memory leak in an error handling path in 'net2272_plat_probe()'
usb: gadget: audio: Fix a missing error return value in audio_bind()
usb: gadget: legacy: fix error return code in gncm_bind()
usb: gadget: legacy: fix error return code in cdc_bind()
Revert "ALSA: hda/realtek: Fix pop noise on ALC225"
clk: Unlink clock if failed to prepare or enable
arm64: dts: meson-g12b-khadas-vim3: add missing frddr_a status property
arm64: dts: meson-g12-common: fix dwc2 clock names
arm64: dts: rockchip: Replace RK805 PMIC node name with "pmic" on rk3328 boards
arm64: dts: rockchip: Rename dwc3 device nodes on rk3399 to make dtc happy
arm64: dts: imx8mn: Change SDMA1 ahb clock for imx8mn
ARM: dts: r8a73a4: Add missing CMT1 interrupts
arm64: dts: renesas: r8a77980: Fix IPMMU VIP[01] nodes
ARM: dts: r8a7740: Add missing extal2 to CPG node
SUNRPC: Revert 241b1f419f ("SUNRPC: Remove xdr_buf_trim()")
bpf: Fix sk_psock refcnt leak when receiving message
KVM: x86: Fix off-by-one error in kvm_vcpu_ioctl_x86_setup_mce
Makefile: disallow data races on gcc-10 as well
libbpf: Extract and generalize CPU mask parsing logic
selftest/bpf: fix backported test_select_reuseport selftest changes
bpf: Test_progs, fix test_get_stack_rawtp_err.c build
Linux 5.4.42
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I2172c08d027990b35ae55a343f31f0d7588c77a5
[ Upstream commit 3f2c788a13 ]
Jan reported an issue where an interaction between sign-extending clone's
flag argument on ppc64le and the new CLONE_INTO_CGROUP feature causes
clone() to consistently fail with EBADF.
The whole story is a little longer. The legacy clone() syscall is odd in a
bunch of ways and here two things interact. First, legacy clone's flag
argument is word-size dependent, i.e. it's an unsigned long whereas most
system calls with flag arguments use int or unsigned int. Second, legacy
clone() ignores unknown and deprecated flags. The two of them taken
together means that users on 64bit systems can pass garbage for the upper
32bit of the clone() syscall since forever and things would just work fine.
Just try this on a 64bit kernel prior to v5.7-rc1 where this will succeed
and on v5.7-rc1 where this will fail with EBADF:
int main(int argc, char *argv[])
{
pid_t pid;
/* Note that legacy clone() has different argument ordering on
* different architectures so this won't work everywhere.
*
* Only set the upper 32 bits.
*/
pid = syscall(__NR_clone, 0xffffffff00000000 | SIGCHLD,
NULL, NULL, NULL, NULL);
if (pid < 0)
exit(EXIT_FAILURE);
if (pid == 0)
exit(EXIT_SUCCESS);
if (wait(NULL) != pid)
exit(EXIT_FAILURE);
exit(EXIT_SUCCESS);
}
Since legacy clone() couldn't be extended this was not a problem so far and
nobody really noticed or cared since nothing in the kernel ever bothered to
look at the upper 32 bits.
But once we introduced clone3() and expanded the flag argument in struct
clone_args to 64 bit we opened this can of worms. With the first flag-based
extension to clone3() making use of the upper 32 bits of the flag argument
we've effectively made it possible for the legacy clone() syscall to reach
clone3() only flags. The sign extension scenario is just the odd
corner-case that we needed to figure this out.
The reason we just realized this now and not already when we introduced
CLONE_CLEAR_SIGHAND was that CLONE_INTO_CGROUP assumes that a valid cgroup
file descriptor has been given. So the sign extension (or the user
accidently passing garbage for the upper 32 bits) caused the
CLONE_INTO_CGROUP bit to be raised and the kernel to error out when it
didn't find a valid cgroup file descriptor.
Let's fix this by always capping the upper 32 bits for all codepaths that
are not aware of clone3() features. This ensures that we can't reach
clone3() only features by accident via legacy clone as with the sign
extension case and also that legacy clone() works exactly like before, i.e.
ignoring any unknown flags. This solution risks no regressions and is also
pretty clean.
Fixes: 7f192e3cd3 ("fork: add clone3")
Fixes: ef2c41cf38 ("clone3: allow spawning processes into cgroups")
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dmitry V. Levin <ldv@altlinux.org>
Cc: Andreas Schwab <schwab@linux-m68k.org>
Cc: Florian Weimer <fw@deneb.enyo.de>
Cc: libc-alpha@sourceware.org
Cc: stable@vger.kernel.org # 5.3+
Link: https://sourceware.org/pipermail/libc-alpha/2020-May/113596.html
Link: https://lore.kernel.org/r/20200507103214.77218-1-christian.brauner@ubuntu.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
Changes in 5.4.29
mmc: core: Allow host controllers to require R1B for CMD6
mmc: core: Respect MMC_CAP_NEED_RSP_BUSY for erase/trim/discard
mmc: core: Respect MMC_CAP_NEED_RSP_BUSY for eMMC sleep command
mmc: sdhci-omap: Fix busy detection by enabling MMC_CAP_NEED_RSP_BUSY
mmc: sdhci-tegra: Fix busy detection by enabling MMC_CAP_NEED_RSP_BUSY
ACPI: PM: s2idle: Rework ACPI events synchronization
cxgb4: fix throughput drop during Tx backpressure
cxgb4: fix Txq restart check during backpressure
geneve: move debug check after netdev unregister
hsr: fix general protection fault in hsr_addr_is_self()
ipv4: fix a RCU-list lock in inet_dump_fib()
macsec: restrict to ethernet devices
mlxsw: pci: Only issue reset when system is ready
mlxsw: spectrum_mr: Fix list iteration in error path
net/bpfilter: fix dprintf usage for /dev/kmsg
net: cbs: Fix software cbs to consider packet sending time
net: dsa: Fix duplicate frames flooded by learning
net: dsa: mt7530: Change the LINK bit to reflect the link status
net: dsa: tag_8021q: replace dsa_8021q_remove_header with __skb_vlan_pop
net: ena: Add PCI shutdown handler to allow safe kexec
net: mvneta: Fix the case where the last poll did not process all rx
net/packet: tpacket_rcv: avoid a producer race condition
net: phy: dp83867: w/a for fld detect threshold bootstrapping issue
net: phy: mdio-bcm-unimac: Fix clock handling
net: phy: mdio-mux-bcm-iproc: check clk_prepare_enable() return value
net: qmi_wwan: add support for ASKEY WWHC050
net/sched: act_ct: Fix leak of ct zone template on replace
net_sched: cls_route: remove the right filter from hashtable
net_sched: hold rtnl lock in tcindex_partial_destroy_work()
net_sched: keep alloc_hash updated after hash allocation
net: stmmac: dwmac-rk: fix error path in rk_gmac_probe
NFC: fdp: Fix a signedness bug in fdp_nci_send_patch()
r8169: re-enable MSI on RTL8168c
slcan: not call free_netdev before rtnl_unlock in slcan_open
tcp: also NULL skb->dev when copy was needed
tcp: ensure skb->dev is NULL before leaving TCP stack
tcp: repair: fix TCP_QUEUE_SEQ implementation
vxlan: check return value of gro_cells_init()
bnxt_en: Fix Priority Bytes and Packets counters in ethtool -S.
bnxt_en: fix memory leaks in bnxt_dcbnl_ieee_getets()
bnxt_en: Return error if bnxt_alloc_ctx_mem() fails.
bnxt_en: Free context memory after disabling PCI in probe error path.
bnxt_en: Reset rings if ring reservation fails during open()
net: ip_gre: Separate ERSPAN newlink / changelink callbacks
net: ip_gre: Accept IFLA_INFO_DATA-less configuration
hsr: use rcu_read_lock() in hsr_get_node_{list/status}()
hsr: add restart routine into hsr_get_node_list()
hsr: set .netnsok flag
net/mlx5: DR, Fix postsend actions write length
net/mlx5e: Enhance ICOSQ WQE info fields
net/mlx5e: Fix missing reset of SW metadata in Striding RQ reset
net/mlx5e: Fix ICOSQ recovery flow with Striding RQ
net/mlx5e: Do not recover from a non-fatal syndrome
cgroup-v1: cgroup_pidlist_next should update position index
nfs: add minor version to nfs_server_key for fscache
cpupower: avoid multiple definition with gcc -fno-common
drivers/of/of_mdio.c:fix of_mdiobus_register()
cgroup1: don't call release_agent when it is ""
dt-bindings: net: FMan erratum A050385
arm64: dts: ls1043a: FMan erratum A050385
fsl/fman: detect FMan erratum A050385
drm/amd/display: update soc bb for nv14
drm/amdgpu: correct ROM_INDEX/DATA offset for VEGA20
drm/exynos: Fix cleanup of IOMMU related objects
iommu/vt-d: Silence RCU-list debugging warnings
s390/qeth: don't reset default_out_queue
s390/qeth: handle error when backing RX buffer
scsi: ipr: Fix softlockup when rescanning devices in petitboot
mac80211: Do not send mesh HWMP PREQ if HWMP is disabled
dpaa_eth: Remove unnecessary boolean expression in dpaa_get_headroom
sxgbe: Fix off by one in samsung driver strncpy size arg
net: hns3: fix "tc qdisc del" failed issue
iommu/vt-d: Fix debugfs register reads
iommu/vt-d: Populate debugfs if IOMMUs are detected
iwlwifi: mvm: fix non-ACPI function
i2c: hix5hd2: add missed clk_disable_unprepare in remove
Input: raydium_i2c_ts - fix error codes in raydium_i2c_boot_trigger()
Input: fix stale timestamp on key autorepeat events
Input: synaptics - enable RMI on HP Envy 13-ad105ng
Input: avoid BIT() macro usage in the serio.h UAPI header
IB/rdmavt: Free kernel completion queue when done
RDMA/core: Fix missing error check on dev_set_name()
gpiolib: Fix irq_disable() semantics
RDMA/nl: Do not permit empty devices names during RDMA_NLDEV_CMD_NEWLINK/SET
RDMA/mad: Do not crash if the rdma device does not have a umad interface
ceph: check POOL_FLAG_FULL/NEARFULL in addition to OSDMAP_FULL/NEARFULL
ceph: fix memory leak in ceph_cleanup_snapid_map()
ARM: dts: dra7: Add bus_dma_limit for L3 bus
ARM: dts: omap5: Add bus_dma_limit for L3 bus
x86/ioremap: Fix CONFIG_EFI=n build
perf probe: Fix to delete multiple probe event
perf probe: Do not depend on dwfl_module_addrsym()
rtlwifi: rtl8188ee: Fix regression due to commit d1d1a96bdb
tools: Let O= makes handle a relative path with -C option
scripts/dtc: Remove redundant YYLOC global declaration
scsi: sd: Fix optimal I/O size for devices that change reported values
nl80211: fix NL80211_ATTR_CHANNEL_WIDTH attribute type
mac80211: drop data frames without key on encrypted links
mac80211: mark station unauthorized before key removal
mm/swapfile.c: move inode_lock out of claim_swapfile
drivers/base/memory.c: indicate all memory blocks as removable
mm/sparse: fix kernel crash with pfn_section_valid check
mm: fork: fix kernel_stack memcg stats for various stack implementations
gpiolib: acpi: Correct comment for HP x2 10 honor_wakeup quirk
gpiolib: acpi: Rework honor_wakeup option into an ignore_wake option
gpiolib: acpi: Add quirk to ignore EC wakeups on HP x2 10 BYT + AXP288 model
bpf: Fix cgroup ref leak in cgroup_bpf_inherit on out-of-memory
RDMA/core: Ensure security pkey modify is not lost
afs: Fix handling of an abort from a service handler
genirq: Fix reference leaks on irq affinity notifiers
xfrm: handle NETDEV_UNREGISTER for xfrm device
vti[6]: fix packet tx through bpf_redirect() in XinY cases
RDMA/mlx5: Fix the number of hwcounters of a dynamic counter
RDMA/mlx5: Fix access to wrong pointer while performing flush due to error
RDMA/mlx5: Block delay drop to unprivileged users
xfrm: fix uctx len check in verify_sec_ctx_len
xfrm: add the missing verify_sec_ctx_len check in xfrm_add_acquire
xfrm: policy: Fix doulbe free in xfrm_policy_timer
afs: Fix client call Rx-phase signal handling
afs: Fix some tracing details
afs: Fix unpinned address list during probing
ieee80211: fix HE SPR size calculation
mac80211: set IEEE80211_TX_CTRL_PORT_CTRL_PROTO for nl80211 TX
netfilter: flowtable: reload ip{v6}h in nf_flow_tuple_ip{v6}
netfilter: nft_fwd_netdev: validate family and chain type
netfilter: nft_fwd_netdev: allow to redirect to ifb via ingress
i2c: nvidia-gpu: Handle timeout correctly in gpu_i2c_check_status()
bpf, x32: Fix bug with JMP32 JSET BPF_X checking upper bits
bpf: Initialize storage pointers to NULL to prevent freeing garbage pointer
bpf/btf: Fix BTF verification of enum members in struct/union
bpf, sockmap: Remove bucket->lock from sock_{hash|map}_free
ARM: dts: sun8i-a83t-tbs-a711: Fix USB OTG mode detection
vti6: Fix memory leak of skb if input policy check fails
r8169: fix PHY driver check on platforms w/o module softdeps
clocksource/drivers/hyper-v: Untangle stimers and timesync from clocksources
bpf: Undo incorrect __reg_bound_offset32 handling
USB: serial: option: add support for ASKEY WWHC050
USB: serial: option: add BroadMobi BM806U
USB: serial: option: add Wistron Neweb D19Q1
USB: cdc-acm: restore capability check order
USB: serial: io_edgeport: fix slab-out-of-bounds read in edge_interrupt_callback
usb: musb: fix crash with highmen PIO and usbmon
media: flexcop-usb: fix endpoint sanity check
media: usbtv: fix control-message timeouts
staging: kpc2000: prevent underflow in cpld_reconfigure()
staging: rtl8188eu: Add ASUS USB-N10 Nano B1 to device table
staging: wlan-ng: fix ODEBUG bug in prism2sta_disconnect_usb
staging: wlan-ng: fix use-after-free Read in hfa384x_usbin_callback
ahci: Add Intel Comet Lake H RAID PCI ID
libfs: fix infoleak in simple_attr_read()
media: ov519: add missing endpoint sanity checks
media: dib0700: fix rc endpoint lookup
media: stv06xx: add missing descriptor sanity checks
media: xirlink_cit: add missing descriptor sanity checks
media: v4l2-core: fix a use-after-free bug of sd->devnode
net: Fix CONFIG_NET_CLS_ACT=n and CONFIG_NFT_FWD_NETDEV={y, m} build
Linux 5.4.29
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Iebce1f1b95935de9229e2deb83dae66cf8661a88
commit 8380ce4790 upstream.
Depending on CONFIG_VMAP_STACK and the THREAD_SIZE / PAGE_SIZE ratio the
space for task stacks can be allocated using __vmalloc_node_range(),
alloc_pages_node() and kmem_cache_alloc_node().
In the first and the second cases page->mem_cgroup pointer is set, but
in the third it's not: memcg membership of a slab page should be
determined using the memcg_from_slab_page() function, which looks at
page->slab_cache->memcg_params.memcg . In this case, using
mod_memcg_page_state() (as in account_kernel_stack()) is incorrect:
page->mem_cgroup pointer is NULL even for pages charged to a non-root
memory cgroup.
It can lead to kernel_stack per-memcg counters permanently showing 0 on
some architectures (depending on the configuration).
In order to fix it, let's introduce a mod_memcg_obj_state() helper,
which takes a pointer to a kernel object as a first argument, uses
mem_cgroup_from_obj() to get a RCU-protected memcg pointer and calls
mod_memcg_state(). It allows to handle all possible configurations
(CONFIG_VMAP_STACK and various THREAD_SIZE/PAGE_SIZE values) without
spilling any memcg/kmem specifics into fork.c .
Note: This is a special version of the patch created for stable
backports. It contains code from the following two patches:
- mm: memcg/slab: introduce mem_cgroup_from_obj()
- mm: fork: fix kernel_stack memcg stats for various stack implementations
[guro@fb.com: introduce mem_cgroup_from_obj()]
Link: http://lkml.kernel.org/r/20200324004221.GA36662@carbon.dhcp.thefacebook.com
Fixes: 4d96ba3530 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200303233550.251375-1-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 5.4.12
chardev: Avoid potential use-after-free in 'chrdev_open()'
i2c: fix bus recovery stop mode timing
powercap: intel_rapl: add NULL pointer check to rapl_mmio_cpu_online()
usb: chipidea: host: Disable port power only if previously enabled
ALSA: usb-audio: Apply the sample rate quirk for Bose Companion 5
ALSA: hda/realtek - Add new codec supported for ALCS1200A
ALSA: hda/realtek - Set EAPD control to default for ALC222
ALSA: hda/realtek - Add quirk for the bass speaker on Lenovo Yoga X1 7th gen
tpm: Revert "tpm_tis: reserve chip for duration of tpm_tis_core_init"
tpm: Revert "tpm_tis_core: Set TPM_CHIP_FLAG_IRQ before probing for interrupts"
tpm: Revert "tpm_tis_core: Turn on the TPM before probing IRQ's"
tpm: Handle negative priv->response_len in tpm_common_read()
rtc: sun6i: Add support for RTC clocks on R40
kernel/trace: Fix do not unregister tracepoints when register sched_migrate_task fail
tracing: Have stack tracer compile when MCOUNT_INSN_SIZE is not defined
tracing: Change offset type to s32 in preempt/irq tracepoints
HID: Fix slab-out-of-bounds read in hid_field_extract
HID: uhid: Fix returning EPOLLOUT from uhid_char_poll
HID: hidraw: Fix returning EPOLLOUT from hidraw_poll
HID: hid-input: clear unmapped usages
Input: add safety guards to input_set_keycode()
Input: input_event - fix struct padding on sparc64
drm/i915: Add Wa_1408615072 and Wa_1407596294 to icl,ehl
Revert "drm/amdgpu: Set no-retry as default."
drm/sun4i: tcon: Set RGB DCLK min. divider based on hardware model
drm/fb-helper: Round up bits_per_pixel if possible
drm/dp_mst: correct the shifting in DP_REMOTE_I2C_READ
drm/i915: Add Wa_1407352427:icl,ehl
drm/i915/gt: Mark up virtual engine uabi_instance
IB/hfi1: Adjust flow PSN with the correct resync_psn
can: kvaser_usb: fix interface sanity check
can: gs_usb: gs_usb_probe(): use descriptors of current altsetting
can: tcan4x5x: tcan4x5x_can_probe(): get the device out of standby before register access
can: mscan: mscan_rx_poll(): fix rx path lockup when returning from polling to irq mode
can: can_dropped_invalid_skb(): ensure an initialized headroom in outgoing CAN sk_buffs
gpiolib: acpi: Turn dmi_system_id table into a generic quirk table
gpiolib: acpi: Add honor_wakeup module-option + quirk mechanism
pstore/ram: Regularize prz label allocation lifetime
staging: vt6656: set usb_set_intfdata on driver fail.
staging: vt6656: Fix non zero logical return of, usb_control_msg
usb: cdns3: should not use the same dev_id for shared interrupt handler
usb: ohci-da8xx: ensure error return on variable error is set
USB-PD tcpm: bad warning+size, PPS adapters
USB: serial: option: add ZLP support for 0x1bc7/0x9010
usb: musb: fix idling for suspend after disconnect interrupt
usb: musb: Disable pullup at init
usb: musb: dma: Correct parameter passed to IRQ handler
staging: comedi: adv_pci1710: fix AI channels 16-31 for PCI-1713
staging: vt6656: correct return of vnt_init_registers.
staging: vt6656: limit reg output to block size
staging: rtl8188eu: Add device code for TP-Link TL-WN727N v5.21
serdev: Don't claim unsupported ACPI serial devices
iommu/vt-d: Fix adding non-PCI devices to Intel IOMMU
tty: link tty and port before configuring it as console
tty: always relink the port
arm64: Move __ARCH_WANT_SYS_CLONE3 definition to uapi headers
arm64: Implement copy_thread_tls
arm: Implement copy_thread_tls
parisc: Implement copy_thread_tls
riscv: Implement copy_thread_tls
xtensa: Implement copy_thread_tls
clone3: ensure copy_thread_tls is implemented
um: Implement copy_thread_tls
staging: vt6656: remove bool from vnt_radio_power_on ret
mwifiex: fix possible heap overflow in mwifiex_process_country_ie()
mwifiex: pcie: Fix memory leak in mwifiex_pcie_alloc_cmdrsp_buf
rpmsg: char: release allocated memory
scsi: bfa: release allocated memory in case of error
rtl8xxxu: prevent leaking urb
ath10k: fix memory leak
HID: hiddev: fix mess in hiddev_open()
USB: Fix: Don't skip endpoint descriptors with maxpacket=0
phy: cpcap-usb: Fix error path when no host driver is loaded
phy: cpcap-usb: Fix flakey host idling and enumerating of devices
netfilter: arp_tables: init netns pointer in xt_tgchk_param struct
netfilter: conntrack: dccp, sctp: handle null timeout argument
netfilter: ipset: avoid null deref when IPSET_ATTR_LINENO is present
drm/i915/gen9: Clear residual context state on context switch
Linux 5.4.12
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ib8604812a0a41f9e3ab36ef238623fc222096fea
Changes in 5.4.1
Bluetooth: Fix invalid-free in bcsp_close()
ath9k_hw: fix uninitialized variable data
ath10k: Fix a NULL-ptr-deref bug in ath10k_usb_alloc_urb_from_pipe
ath10k: Fix HOST capability QMI incompatibility
ath10k: restore QCA9880-AR1A (v1) detection
Revert "Bluetooth: hci_ll: set operational frequency earlier"
Revert "dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues"
md/raid10: prevent access of uninitialized resync_pages offset
x86/insn: Fix awk regexp warnings
x86/speculation: Fix incorrect MDS/TAA mitigation status
x86/speculation: Fix redundant MDS mitigation message
nbd: prevent memory leak
x86/stackframe/32: Repair 32-bit Xen PV
x86/xen/32: Make xen_iret_crit_fixup() independent of frame layout
x86/xen/32: Simplify ring check in xen_iret_crit_fixup()
x86/doublefault/32: Fix stack canaries in the double fault handler
x86/pti/32: Size initial_page_table correctly
x86/cpu_entry_area: Add guard page for entry stack on 32bit
x86/entry/32: Fix IRET exception
x86/entry/32: Use %ss segment where required
x86/entry/32: Move FIXUP_FRAME after pushing %fs in SAVE_ALL
x86/entry/32: Unwind the ESPFIX stack earlier on exception entry
x86/entry/32: Fix NMI vs ESPFIX
selftests/x86/mov_ss_trap: Fix the SYSENTER test
selftests/x86/sigreturn/32: Invalidate DS and ES when abusing the kernel
x86/pti/32: Calculate the various PTI cpu_entry_area sizes correctly, make the CPU_ENTRY_AREA_PAGES assert precise
x86/entry/32: Fix FIXUP_ESPFIX_STACK with user CR3
futex: Prevent robust futex exit race
ALSA: usb-audio: Fix NULL dereference at parsing BADD
ALSA: usb-audio: Fix Scarlett 6i6 Gen 2 port data
media: vivid: Set vid_cap_streaming and vid_out_streaming to true
media: vivid: Fix wrong locking that causes race conditions on streaming stop
media: usbvision: Fix invalid accesses after device disconnect
media: usbvision: Fix races among open, close, and disconnect
cpufreq: Add NULL checks to show() and store() methods of cpufreq
futex: Move futex exit handling into futex code
futex: Replace PF_EXITPIDONE with a state
exit/exec: Seperate mm_release()
futex: Split futex_mm_release() for exit/exec
futex: Set task::futex_state to DEAD right after handling futex exit
futex: Mark the begin of futex exit explicitly
futex: Sanitize exit state handling
futex: Provide state handling for exec() as well
futex: Add mutex around futex exit
futex: Provide distinct return value when owner is exiting
futex: Prevent exit livelock
media: uvcvideo: Fix error path in control parsing failure
media: b2c2-flexcop-usb: add sanity checking
media: cxusb: detect cxusb_ctrl_msg error in query
media: imon: invalid dereference in imon_touch_event
media: mceusb: fix out of bounds read in MCE receiver buffer
ALSA: hda - Disable audio component for legacy Nvidia HDMI codecs
USBIP: add config dependency for SGL_ALLOC
usbip: tools: fix fd leakage in the function of read_attr_usbip_status
usbip: Fix uninitialized symbol 'nents' in stub_recv_cmd_submit()
usb-serial: cp201x: support Mark-10 digital force gauge
USB: chaoskey: fix error case of a timeout
appledisplay: fix error handling in the scheduled work
USB: serial: mos7840: add USB ID to support Moxa UPort 2210
USB: serial: mos7720: fix remote wakeup
USB: serial: mos7840: fix remote wakeup
USB: serial: option: add support for DW5821e with eSIM support
USB: serial: option: add support for Foxconn T77W968 LTE modules
staging: comedi: usbduxfast: usbduxfast_ai_cmdtest rounding error
powerpc/book3s64: Fix link stack flush on context switch
KVM: PPC: Book3S HV: Flush link stack on guest exit to host kernel
Linux 5.4.1
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Id50109953b5638956d150e4fc648a94b6e347fb5
commit 4610ba7ad8 upstream.
mm_release() contains the futex exit handling. mm_release() is called from
do_exit()->exit_mm() and from exec()->exec_mm().
In the exit_mm() case PF_EXITING and the futex state is updated. In the
exec_mm() case these states are not touched.
As the futex exit code needs further protections against exit races, this
needs to be split into two functions.
Preparatory only, no functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191106224556.240518241@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ba31c1a485 upstream.
The futex exit handling is #ifdeffed into mm_release() which is not pretty
to begin with. But upcoming changes to address futex exit races need to add
more functionality to this exit code.
Split it out into a function, move it into futex code and make the various
futex exit functions static.
Preparatory only and no functional change.
Folded build fix from Borislav.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191106224556.049705556@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This change adds generic support for Clang's Shadow Call Stack,
which uses a shadow stack to protect return addresses from being
overwritten by an attacker. Details are available here:
https://clang.llvm.org/docs/ShadowCallStack.html
Note that security guarantees in the kernel differ from the
ones documented for user space. The kernel must store addresses
of shadow stacks used by other tasks and interrupt handlers in
memory, which means an attacker capable reading and writing
arbitrary memory may be able to locate them and hijack control
flow by modifying shadow stacks that are not currently in use.
Bug: 145210207
Change-Id: I2a8ba6a3decac50c169731c3121c9dcab96621d2
(am from https://lore.kernel.org/patchwork/patch/1149054/)
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Validate the stack arguments and setup the stack depening on whether or not
it is growing down or up.
Legacy clone() required userspace to know in which direction the stack is
growing and pass down the stack pointer appropriately. To make things more
confusing microblaze uses a variant of the clone() syscall selected by
CONFIG_CLONE_BACKWARDS3 that takes an additional stack_size argument.
IA64 has a separate clone2() syscall which also takes an additional
stack_size argument. Finally, parisc has a stack that is growing upwards.
Userspace therefore has a lot nasty code like the following:
#define __STACK_SIZE (8 * 1024 * 1024)
pid_t sys_clone(int (*fn)(void *), void *arg, int flags, int *pidfd)
{
pid_t ret;
void *stack;
stack = malloc(__STACK_SIZE);
if (!stack)
return -ENOMEM;
#ifdef __ia64__
ret = __clone2(fn, stack, __STACK_SIZE, flags | SIGCHLD, arg, pidfd);
#elif defined(__parisc__) /* stack grows up */
ret = clone(fn, stack, flags | SIGCHLD, arg, pidfd);
#else
ret = clone(fn, stack + __STACK_SIZE, flags | SIGCHLD, arg, pidfd);
#endif
return ret;
}
or even crazier variants such as [3].
With clone3() we have the ability to validate the stack. We can check that
when stack_size is passed, the stack pointer is valid and the other way
around. We can also check that the memory area userspace gave us is fine to
use via access_ok(). Furthermore, we probably should not require
userspace to know in which direction the stack is growing. It is easy
for us to do this in the kernel and I couldn't find the original
reasoning behind exposing this detail to userspace.
/* Intentional user visible API change */
clone3() was released with 5.3. Currently, it is not documented and very
unclear to userspace how the stack and stack_size argument have to be
passed. After talking to glibc folks we concluded that trying to change
clone3() to setup the stack instead of requiring userspace to do this is
the right course of action.
Note, that this is an explicit change in user visible behavior we introduce
with this patch. If it breaks someone's use-case we will revert! (And then
e.g. place the new behavior under an appropriate flag.)
Breaking someone's use-case is very unlikely though. First, neither glibc
nor musl currently expose a wrapper for clone3(). Second, there is no real
motivation for anyone to use clone3() directly since it does not provide
features that legacy clone doesn't. New features for clone3() will first
happen in v5.5 which is why v5.4 is still a good time to try and make that
change now and backport it to v5.3. Searches on [4] did not reveal any
packages calling clone3().
[1]: https://lore.kernel.org/r/CAG48ez3q=BeNcuVTKBN79kJui4vC6nw0Bfq6xc-i0neheT17TA@mail.gmail.com
[2]: https://lore.kernel.org/r/20191028172143.4vnnjpdljfnexaq5@wittgenstein
[3]: 5238e95759/src/basic/raw-clone.h (L31)
[4]: https://codesearch.debian.net
Fixes: 7f192e3cd3 ("fork: add clone3")
Cc: Kees Cook <keescook@chromium.org>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-api@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: <stable@vger.kernel.org> # 5.3
Cc: GNU C Library <libc-alpha@sourceware.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Aleksa Sarai <cyphar@cyphar.com>
Link: https://lore.kernel.org/r/20191031113608.20713-1-christian.brauner@ubuntu.com
Partially revert 16db3d3f11 ("kernel/sysctl.c: threads-max observe
limits") because the patch is causing a regression to any workload which
needs to override the auto-tuning of the limit provided by kernel.
set_max_threads is implementing a boot time guesstimate to provide a
sensible limit of the concurrently running threads so that runaways will
not deplete all the memory. This is a good thing in general but there
are workloads which might need to increase this limit for an application
to run (reportedly WebSpher MQ is affected) and that is simply not
possible after the mentioned change. It is also very dubious to
override an admin decision by an estimation that doesn't have any direct
relation to correctness of the kernel operation.
Fix this by dropping set_max_threads from sysctl_max_threads so any
value is accepted as long as it fits into MAX_THREADS which is important
to check because allowing more threads could break internal robust futex
restriction. While at it, do not use MIN_THREADS as the lower boundary
because it is also only a heuristic for automatic estimation and admin
might have a good reason to stop new threads to be created even when
below this limit.
This became more severe when we switched x86 from 4k to 8k kernel
stacks. Starting since 6538b8ea88 ("x86_64: expand kernel stack to
16K") (3.16) we use THREAD_SIZE_ORDER = 2 and that halved the auto-tuned
value.
In the particular case
3.12
kernel.threads-max = 515561
4.4
kernel.threads-max = 200000
Neither of the two values is really insane on 32GB machine.
I am not sure we want/need to tune the max_thread value further. If
anything the tuning should be removed altogether if proven not useful in
general. But we definitely need a way to override this auto-tuning.
Link: http://lkml.kernel.org/r/20190922065801.GB18814@dhcp22.suse.cz
Fixes: 16db3d3f11 ("kernel/sysctl.c: threads-max observe limits")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull copy_struct_from_user() helper from Christian Brauner:
"This contains the copy_struct_from_user() helper which got split out
from the openat2() patchset. It is a generic interface designed to
copy a struct from userspace.
The helper will be especially useful for structs versioned by size of
which we have quite a few. This allows for backwards compatibility,
i.e. an extended struct can be passed to an older kernel, or a legacy
struct can be passed to a newer kernel. For the first case (extended
struct, older kernel) the new fields in an extended struct can be set
to zero and the struct safely passed to an older kernel.
The most obvious benefit is that this helper lets us get rid of
duplicate code present in at least sched_setattr(), perf_event_open(),
and clone3(). More importantly it will also help to ensure that users
implementing versioning-by-size end up with the same core semantics.
This point is especially crucial since we have at least one case where
versioning-by-size is used but with slighly different semantics:
sched_setattr(), perf_event_open(), and clone3() all do do similar
checks to copy_struct_from_user() while rt_sigprocmask(2) always
rejects differently-sized struct arguments.
With this pull request we also switch over sched_setattr(),
perf_event_open(), and clone3() to use the new helper"
* tag 'copy-struct-from-user-v5.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
usercopy: Add parentheses around assignment in test_copy_struct_from_user
perf_event_open: switch to copy_struct_from_user()
sched_setattr: switch to copy_struct_from_user()
clone3: switch to copy_struct_from_user()
lib: introduce copy_struct_from_user() helper
To make the 5.4-rc1 merge easier, merge at a prerelease point in time
before the final release happens.
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: If613d657fd0abf9910c5bf3435a745f01b89765e
Switch clone3() syscall from it's own copying struct clone_args from
userspace to the new dedicated copy_struct_from_user() helper.
The change is very straightforward, and helps unify the syscall
interface for struct-from-userspace syscalls. Additionally, explicitly
define CLONE_ARGS_SIZE_VER0 to match the other users of the
struct-extension pattern.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
[christian.brauner@ubuntu.com: improve commit message]
Link: https://lore.kernel.org/r/20191001011055.19283-3-cyphar@cyphar.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Pull scheduler fixes from Ingo Molnar:
- Apply a number of membarrier related fixes and cleanups, which fixes
a use-after-free race in the membarrier code
- Introduce proper RCU protection for tasks on the runqueue - to get
rid of the subtle task_rcu_dereference() interface that was easy to
get wrong
- Misc fixes, but also an EAS speedup
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Avoid redundant EAS calculation
sched/core: Remove double update_max_interval() call on CPU startup
sched/core: Fix preempt_schedule() interrupt return comment
sched/fair: Fix -Wunused-but-set-variable warnings
sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr()
sched/membarrier: Return -ENOMEM to userspace on memory allocation failure
sched/membarrier: Skip IPIs when mm->mm_users == 1
selftests, sched/membarrier: Add multi-threaded test
sched/membarrier: Fix p->mm->membarrier_state racy load
sched/membarrier: Call sync_core only before usermode for same mm
sched/membarrier: Remove redundant check
sched/membarrier: Fix private expedited registration check
tasks, sched/core: RCUify the assignment of rq->curr
tasks, sched/core: With a grace period after finish_task_switch(), remove unnecessary code
tasks, sched/core: Ensure tasks are available for a grace period after leaving the runqueue
tasks: Add a count of task RCU users
sched/core: Convert vcpu_is_preempted() from macro to an inline function
sched/fair: Remove unused cfs_rq_clock_task() function
When a user process exits, the kernel cleans up the mm_struct of the user
process and during cleanup, check_mm() checks the page tables of the user
process for corruption (E.g: unexpected page flags set/cleared). For
corrupted page tables, the error message printed by check_mm() isn't very
clear as it prints the loop index instead of page table type (E.g:
Resident file mapping pages vs Resident shared memory pages). The loop
index in check_mm() is used to index rss_stat[] which represents
individual memory type stats. Hence, instead of printing index, print
memory type, thereby improving error message.
Without patch:
--------------
[ 204.836425] mm/pgtable-generic.c:29: bad p4d 0000000089eb4e92(800000025f941467)
[ 204.836544] BUG: Bad rss-counter state mm:00000000f75895ea idx:0 val:2
[ 204.836615] BUG: Bad rss-counter state mm:00000000f75895ea idx:1 val:5
[ 204.836685] BUG: non-zero pgtables_bytes on freeing mm: 20480
With patch:
-----------
[ 69.815453] mm/pgtable-generic.c:29: bad p4d 0000000084653642(800000025ca37467)
[ 69.815872] BUG: Bad rss-counter state mm:00000000014a6c03 type:MM_FILEPAGES val:2
[ 69.815962] BUG: Bad rss-counter state mm:00000000014a6c03 type:MM_ANONPAGES val:5
[ 69.816050] BUG: non-zero pgtables_bytes on freeing mm: 20480
Also, change print function (from printk(KERN_ALERT, ..) to pr_alert()) so
that it matches the other print statement.
Link: http://lkml.kernel.org/r/da75b5153f617f4c5739c08ee6ebeb3d19db0fbc.1565123758.git.sai.praneeth.prakhya@intel.com
Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the ordinary case today the RCU grace period for a task_struct is
triggered when another process wait's for it's zombine and causes the
kernel to call release_task(). As the waiting task has to receive a
signal and then act upon it before this happens, typically this will
occur after the original task as been removed from the runqueue.
Unfortunaty in some cases such as self reaping tasks it can be shown
that release_task() will be called starting the grace period for
task_struct long before the task leaves the runqueue.
Therefore use put_task_struct_rcu_user() in finish_task_switch() to
guarantee that the there is a RCU lifetime after the task
leaves the runqueue.
Besides the change in the start of the RCU grace period for the
task_struct this change may cause perf_event_delayed_put and
trace_sched_process_free. The function perf_event_delayed_put boils
down to just a WARN_ON for cases that I assume never show happen. So
I don't see any problem with delaying it.
The function trace_sched_process_free is a trace point and thus
visible to user space. Occassionally userspace has the strangest
dependencies so this has a miniscule chance of causing a regression.
This change only changes the timing of when the tracepoint is called.
The change in timing arguably gives userspace a more accurate picture
of what is going on. So I don't expect there to be a regression.
In the case where a task self reaps we are pretty much guaranteed that
the RCU grace period is delayed. So we should get quite a bit of
coverage in of this worst case for the change in a normal threaded
workload. So I expect any issues to turn up quickly or not at all.
I have lightly tested this change and everything appears to work
fine.
Inspired-by: Linus Torvalds <torvalds@linux-foundation.org>
Inspired-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/87r24jdpl5.fsf_-_@x220.int.ebiederm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
To make the 5.4-rc1 merge easier, merge at a prerelease point in time
before the final release happens.
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I29b683c837ed1a3324644dbf9bf863f30740cd0b
Pull hmm updates from Jason Gunthorpe:
"This is more cleanup and consolidation of the hmm APIs and the very
strongly related mmu_notifier interfaces. Many places across the tree
using these interfaces are touched in the process. Beyond that a
cleanup to the page walker API and a few memremap related changes
round out the series:
- General improvement of hmm_range_fault() and related APIs, more
documentation, bug fixes from testing, API simplification &
consolidation, and unused API removal
- Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE,
and make them internal kconfig selects
- Hoist a lot of code related to mmu notifier attachment out of
drivers by using a refcount get/put attachment idiom and remove the
convoluted mmu_notifier_unregister_no_release() and related APIs.
- General API improvement for the migrate_vma API and revision of its
only user in nouveau
- Annotate mmu_notifiers with lockdep and sleeping region debugging
Two series unrelated to HMM or mmu_notifiers came along due to
dependencies:
- Allow pagemap's memremap_pages family of APIs to work without
providing a struct device
- Make walk_page_range() and related use a constant structure for
function pointers"
* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (75 commits)
libnvdimm: Enable unit test infrastructure compile checks
mm, notifier: Catch sleeping/blocking for !blockable
kernel.h: Add non_block_start/end()
drm/radeon: guard against calling an unpaired radeon_mn_unregister()
csky: add missing brackets in a macro for tlb.h
pagewalk: use lockdep_assert_held for locking validation
pagewalk: separate function pointers from iterator data
mm: split out a new pagewalk.h header from mm.h
mm/mmu_notifiers: annotate with might_sleep()
mm/mmu_notifiers: prime lockdep
mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end
mm/mmu_notifiers: remove the __mmu_notifier_invalidate_range_start/end exports
mm/hmm: hmm_range_fault() infinite loop
mm/hmm: hmm_range_fault() NULL pointer bug
mm/hmm: fix hmm_range_fault()'s handling of swapped out pages
mm/mmu_notifiers: remove unregister_no_release
RDMA/odp: remove ib_ucontext from ib_umem
RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm'
RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
RDMA/mlx5: Use ib_umem_start instead of umem.address
...
This merges Linus's tree as of commit b41dae061b ("Merge tag
'xfs-5.4-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux")
into android-mainline.
This "early" merge makes it easier to test and handle merge conflicts
instead of having to wait until the "end" of the merge window and handle
all 10000+ commits at once.
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I6bebf55e5e2353f814e3c87f5033607b1ae5d812
Pull core timer updates from Thomas Gleixner:
"Timers and timekeeping updates:
- A large overhaul of the posix CPU timer code which is a preparation
for moving the CPU timer expiry out into task work so it can be
properly accounted on the task/process.
An update to the bogus permission checks will come later during the
merge window as feedback was not complete before heading of for
travel.
- Switch the timerqueue code to use cached rbtrees and get rid of the
homebrewn caching of the leftmost node.
- Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a
single function
- Implement the separation of hrtimers to be forced to expire in hard
interrupt context even when PREEMPT_RT is enabled and mark the
affected timers accordingly.
- Implement a mechanism for hrtimers and the timer wheel to protect
RT against priority inversion and live lock issues when a (hr)timer
which should be canceled is currently executing the callback.
Instead of infinitely spinning, the task which tries to cancel the
timer blocks on a per cpu base expiry lock which is held and
released by the (hr)timer expiry code.
- Enable the Hyper-V TSC page based sched_clock for Hyper-V guests
resulting in faster access to timekeeping functions.
- Updates to various clocksource/clockevent drivers and their device
tree bindings.
- The usual small improvements all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
posix-cpu-timers: Fix permission check regression
posix-cpu-timers: Always clear head pointer on dequeue
hrtimer: Add a missing bracket and hide `migration_base' on !SMP
posix-cpu-timers: Make expiry_active check actually work correctly
posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build
tick: Mark sched_timer to expire in hard interrupt context
hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD
x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n
posix-cpu-timers: Utilize timerqueue for storage
posix-cpu-timers: Move state tracking to struct posix_cputimers
posix-cpu-timers: Deduplicate rlimit handling
posix-cpu-timers: Remove pointless comparisons
posix-cpu-timers: Get rid of 64bit divisions
posix-cpu-timers: Consolidate timer expiry further
posix-cpu-timers: Get rid of zero checks
rlimit: Rewrite non-sensical RLIMIT_CPU comment
posix-cpu-timers: Respect INFINITY for hard RTTIME limit
posix-cpu-timers: Switch thread group sampling to array
posix-cpu-timers: Restructure expiry array
posix-cpu-timers: Remove cputime_expires
...
Pull ia64 updates from Tony Luck:
"The big change here is removal of support for SGI Altix"
* tag 'please-pull-ia64_for_5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux: (33 commits)
genirq: remove the is_affinity_mask_valid hook
ia64: remove CONFIG_SWIOTLB ifdefs
ia64: remove support for machvecs
ia64: move the screen_info setup to common code
ia64: move the ROOT_DEV setup to common code
ia64: rework iommu probing
ia64: remove the unused sn_coherency_id symbol
ia64: remove the SGI UV simulator support
ia64: remove the zx1 swiotlb machvec
ia64: remove CONFIG_ACPI ifdefs
ia64: remove CONFIG_PCI ifdefs
ia64: remove the hpsim platform
ia64: remove now unused machvec indirections
ia64: remove support for the SGI SN2 platform
drivers: remove the SGI SN2 IOC4 base support
drivers: remove the SGI SN2 IOC3 base support
qla2xxx: remove SGI SN2 support
qla1280: remove SGI SN2 support
misc/sgi-xp: remove SGI SN2 support
char/mspec: remove SGI SN2 support
...
Pull pidfd/waitid updates from Christian Brauner:
"This contains two features and various tests.
First, it adds support for waiting on process through pidfds by adding
the P_PIDFD type to the waitid() syscall. This completes the basic
functionality of the pidfd api (cf. [1]). In the meantime we also have
a new adition to the userspace projects that make use of the pidfd
api. The qt project was nice enough to send a mail pointing out that
they have a pr up to switch to the pidfd api (cf. [2]).
Second, this tag contains an extension to the waitid() syscall to make
it possible to wait on the current process group in a race free manner
(even though the actual problem is very unlikely) by specifing 0
together with the P_PGID type. This extension traces back to a
discussion on the glibc development mailing list.
There are also a range of tests for the features above. Additionally,
the test-suite which detected the pidfd-polling race we fixed in [3]
is included in this tag"
[1] https://lwn.net/Articles/794707/
[2] https://codereview.qt-project.org/c/qt/qtbase/+/108456
[3] commit b191d6491b ("pidfd: fix a poll race when setting exit_state")
* tag 'core-process-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
waitid: Add support for waiting for the current process group
tests: add pidfd poll tests
tests: move common definitions and functions into pidfd.h
pidfd: add pidfd_wait tests
pidfd: add P_PIDFD to waitid()
Previously, higher 32 bits of exit_signal fields were lost when copied
to the kernel args structure (that uses int as a type for the respective
field). Moreover, as Oleg has noted, exit_signal is used unchecked, so
it has to be checked for sanity before use; for the legacy syscalls,
applying CSIGNAL mask guarantees that it is at least non-negative;
however, there's no such thing is done in clone3() code path, and that
can break at least thread_group_leader.
This commit adds a check to copy_clone_args_from_user() to verify that
the exit signal is limited by CSIGNAL as with legacy clone() and that
the signal is valid. With this we don't get the legacy clone behavior
were an invalid signal could be handed down and would only be detected
and ignored in do_notify_parent(). Users of clone3() will now get a
proper error when they pass an invalid exit signal. Note, that this is
not user-visible behavior since no kernel with clone3() has been
released yet.
The following program will cause a splat on a non-fixed clone3() version
and will fail correctly on a fixed version:
#define _GNU_SOURCE
#include <linux/sched.h>
#include <linux/types.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/syscall.h>
#include <sys/wait.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
pid_t pid = -1;
struct clone_args args = {0};
args.exit_signal = -1;
pid = syscall(__NR_clone3, &args, sizeof(struct clone_args));
if (pid < 0)
exit(EXIT_FAILURE);
if (pid == 0)
exit(EXIT_SUCCESS);
wait(NULL);
exit(EXIT_SUCCESS);
}
Fixes: 7f192e3cd3 ("fork: add clone3")
Reported-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Dmitry V. Levin <ldv@altlinux.org>
Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
Link: https://lore.kernel.org/r/4b38fa4ce420b119a4c6345f42fe3cec2de9b0b5.1568223594.git.esyr@redhat.com
[christian.brauner@ubuntu.com: simplify check and rework commit message]
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Per task/process data of posix CPU timers is all over the place which
makes the code hard to follow and requires ifdeffery.
Create a container to hold all this information in one place, so data is
consolidated and the ifdeffery can be confined to the posix timer header
file and removed from places like fork.
As a first step, move the cpu_timers list head array into the new struct
and clean up the initializers and simplify fork. The remaining #ifdef in
fork will be removed later.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192920.819418976@linutronix.de
From rdma.git
Jason Gunthorpe says:
====================
This is a collection of general cleanups for ODP to clarify some of the
flows around umem creation and use of the interval tree.
====================
The branch is based on v5.3-rc5 due to dependencies, and is being taken
into hmm.git due to dependencies in the next patches.
* odp_fixes:
RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
RDMA/mlx5: Use ib_umem_start instead of umem.address
RDMA/core: Make invalidate_range a device operation
RDMA/odp: Use kvcalloc for the dma_list and page_list
RDMA/odp: Check for overflow when computing the umem_odp end
RDMA/odp: Provide ib_umem_odp_release() to undo the allocs
RDMA/odp: Split creating a umem_odp from ib_umem_get
RDMA/odp: Make the three ways to create a umem_odp clear
RMDA/odp: Consolidate umem_odp initialization
RDMA/odp: Make it clearer when a umem is an implicit ODP umem
RDMA/odp: Iterate over the whole rbtree directly
RDMA/odp: Use the common interval tree library instead of generic
RDMA/mlx5: Fix MR npages calculation for IB_ACCESS_HUGETLB
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This is a significant simplification, it eliminates all the remaining
'hmm' stuff in mm_struct, eliminates krefing along the critical notifier
paths, and takes away all the ugly locking and abuse of page_table_lock.
mmu_notifier_get() provides the single struct hmm per struct mm which
eliminates mm->hmm.
It also directly guarantees that no mmu_notifier op callback is callable
while concurrent free is possible, this eliminates all the krefs inside
the mmu_notifier callbacks.
The remaining krefs in the range code were overly cautious, drivers are
already not permitted to free the mirror while a range exists.
Link: https://lore.kernel.org/r/20190806231548.25242-6-jgg@ziepe.ca
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This adds the P_PIDFD type to waitid().
One of the last remaining bits for the pidfd api is to make it possible
to wait on pidfds. With P_PIDFD added to waitid() the parts of userspace
that want to use the pidfd api to exclusively manage processes can do so
now.
One of the things this will unblock in the future is the ability to make
it possible to retrieve the exit status via waitid(P_PIDFD) for
non-parent processes if handed a _suitable_ pidfd that has this feature
set. This is similar to what you can do on FreeBSD with kqueue(). It
might even end up being possible to wait on a process as a non-parent if
an appropriate property is enabled on the pidfd.
With P_PIDFD no scoping of the process identified by the pidfd is
possible, i.e. it explicitly blocks things such as wait4(-1), wait4(0),
waitid(P_ALL), waitid(P_PGID) etc. It only allows for semantics
equivalent to wait4(pid), waitid(P_PID). Users that need scoping should
rely on pid-based wait*() syscalls for now.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20190727222229.6516-2-christian@brauner.io
When going through execve(), zero out the NUMA fault statistics instead of
freeing them.
During execve, the task is reachable through procfs and the scheduler. A
concurrent /proc/*/sched reader can read data from a freed ->numa_faults
allocation (confirmed by KASAN) and write it back to userspace.
I believe that it would also be possible for a use-after-free read to occur
through a race between a NUMA fault and execve(): task_numa_fault() can
lead to task_numa_compare(), which invokes task_weight() on the currently
running task of a different CPU.
Another way to fix this would be to make ->numa_faults RCU-managed or add
extra locking, but it seems easier to wipe the NUMA fault statistics on
execve.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Fixes: 82727018b0 ("sched/numa: Call task_numa_free() from do_execve()")
Link: https://lkml.kernel.org/r/20190716152047.14424-1-jannh@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull pidfd and clone3 fixes from Christian Brauner:
"This contains a bugfix for CLONE_PIDFD when used with the legacy clone
syscall, two fixes to ensure that syscall numbering and clone3
entrypoint implementations will stay consistent, and an update for the
maintainers file:
- The addition of clone3 broke CLONE_PIDFD for legacy clone on all
architectures that use do_fork() directly instead of calling the
clone syscall itself. (Fwiw, cleaning do_fork() up is on my todo.)
The reason this happened was that during conversion of _do_fork()
to use struct kernel_clone_args we missed that do_fork() is called
directly by various architectures. This is fixed by making sure
that the pidfd argument in struct kernel_clone_args is correctly
initialized with the parent_tidptr argument passed down from
do_fork(). Additionally, do_fork() missed a check to make
CLONE_PIDFD and CLONE_PARENT_SETTID mutually exclusive just a
clone() does. This is now fixed too.
- When clone3() was introduced we skipped architectures that require
special handling for fork-like syscalls. Their syscall tables did
not contain any mention of clone3().
To make sure that Arnd's work to make syscall numbers on all
architectures identical (minus alpha) was not for naught we are
placing a comment in all syscall tables that do not yet implement
clone3(). The comment makes it clear that 435 is reserved for
clone3 and should not be used.
- Also, this contains a patch to make the clone3() syscall definition
in asm-generic/unist.h conditional on __ARCH_WANT_SYS_CLONE3. This
lets us catch new architectures that implicitly make use of clone3
without setting __ARCH_WANT_SYS_CLONE3 which is a good indicator
that they did not check whether it needs special treatment or not.
- Finally, this contains a patch to add me as maintainer for pidfd
stuff so people can start blaming me (more)"
* tag 'for-linus-20190715' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
MAINTAINERS: add new entry for pidfd api
unistd: protect clone3 via __ARCH_WANT_SYS_CLONE3
arch: mark syscall number 435 reserved for clone3
clone: fix CLONE_PIDFD support
Pull HMM updates from Jason Gunthorpe:
"Improvements and bug fixes for the hmm interface in the kernel:
- Improve clarity, locking and APIs related to the 'hmm mirror'
feature merged last cycle. In linux-next we now see AMDGPU and
nouveau to be using this API.
- Remove old or transitional hmm APIs. These are hold overs from the
past with no users, or APIs that existed only to manage cross tree
conflicts. There are still a few more of these cleanups that didn't
make the merge window cut off.
- Improve some core mm APIs:
- export alloc_pages_vma() for driver use
- refactor into devm_request_free_mem_region() to manage
DEVICE_PRIVATE resource reservations
- refactor duplicative driver code into the core dev_pagemap
struct
- Remove hmm wrappers of improved core mm APIs, instead have drivers
use the simplified API directly
- Remove DEVICE_PUBLIC
- Simplify the kconfig flow for the hmm users and core code"
* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (42 commits)
mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR
mm: remove the HMM config option
mm: sort out the DEVICE_PRIVATE Kconfig mess
mm: simplify ZONE_DEVICE page private data
mm: remove hmm_devmem_add
mm: remove hmm_vma_alloc_locked_page
nouveau: use devm_memremap_pages directly
nouveau: use alloc_page_vma directly
PCI/P2PDMA: use the dev_pagemap internal refcount
device-dax: use the dev_pagemap internal refcount
memremap: provide an optional internal refcount in struct dev_pagemap
memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
memremap: remove the data field in struct dev_pagemap
memremap: add a migrate_to_ram method to struct dev_pagemap_ops
memremap: lift the devmap_enable manipulation into devm_memremap_pages
memremap: pass a struct dev_pagemap to ->kill and ->cleanup
memremap: move dev_pagemap callbacks into a separate structure
memremap: validate the pagemap type passed to devm_memremap_pages
mm: factor out a devm_request_free_mem_region helper
mm: export alloc_pages_vma
...
Pull clone3 system call from Christian Brauner:
"This adds the clone3 syscall which is an extensible successor to clone
after we snagged the last flag with CLONE_PIDFD during the 5.2 merge
window for clone(). It cleanly supports all of the flags from clone()
and thus all legacy workloads.
There are few user visible differences between clone3 and clone.
First, CLONE_DETACHED will cause EINVAL with clone3 so we can reuse
this flag. Second, the CSIGNAL flag is deprecated and will cause
EINVAL to be reported. It is superseeded by a dedicated "exit_signal"
argument in struct clone_args thus freeing up even more flags. And
third, clone3 gives CLONE_PIDFD a dedicated return argument in struct
clone_args instead of abusing CLONE_PARENT_SETTID's parent_tidptr
argument.
The clone3 uapi is designed to be easy to handle on 32- and 64 bit:
/* uapi */
struct clone_args {
__aligned_u64 flags;
__aligned_u64 pidfd;
__aligned_u64 child_tid;
__aligned_u64 parent_tid;
__aligned_u64 exit_signal;
__aligned_u64 stack;
__aligned_u64 stack_size;
__aligned_u64 tls;
};
and a separate kernel struct is used that uses proper kernel typing:
/* kernel internal */
struct kernel_clone_args {
u64 flags;
int __user *pidfd;
int __user *child_tid;
int __user *parent_tid;
int exit_signal;
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
};
The system call comes with a size argument which enables the kernel to
detect what version of clone_args userspace is passing in. clone3
validates that any additional bytes a given kernel does not know about
are set to zero and that the size never exceeds a page.
A nice feature is that this patchset allowed us to cleanup and
simplify various core kernel codepaths in kernel/fork.c by making the
internal _do_fork() function take struct kernel_clone_args even for
legacy clone().
This patch also unblocks the time namespace patchset which wants to
introduce a new CLONE_TIMENS flag.
Note, that clone3 has only been wired up for x86{_32,64}, arm{64}, and
xtensa. These were the architectures that did not require special
massaging.
Other architectures treat fork-like system calls individually and
after some back and forth neither Arnd nor I felt confident that we
dared to add clone3 unconditionally to all architectures. We agreed to
leave this up to individual architecture maintainers. This is why
there's an additional patch that introduces __ARCH_WANT_SYS_CLONE3
which any architecture can set once it has implemented support for
clone3. The patch also adds a cond_syscall(clone3) for architectures
such as nios2 or h8300 that generate their syscall table by simply
including asm-generic/unistd.h. The hope is to get rid of
__ARCH_WANT_SYS_CLONE3 and cond_syscall() rather soon"
* tag 'clone3-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
arch: handle arches who do not yet define clone3
arch: wire-up clone3() syscall
fork: add clone3
Pull pidfd updates from Christian Brauner:
"This adds two main features.
- First, it adds polling support for pidfds. This allows process
managers to know when a (non-parent) process dies in a race-free
way.
The notification mechanism used follows the same logic that is
currently used when the parent of a task is notified of a child's
death. With this patchset it is possible to put pidfds in an
{e}poll loop and get reliable notifications for process (i.e.
thread-group) exit.
- The second feature compliments the first one by making it possible
to retrieve pollable pidfds for processes that were not created
using CLONE_PIDFD.
A lot of processes get created with traditional PID-based calls
such as fork() or clone() (without CLONE_PIDFD). For these
processes a caller can currently not create a pollable pidfd. This
is a problem for Android's low memory killer (LMK) and service
managers such as systemd.
Both patchsets are accompanied by selftests.
It's perhaps worth noting that the work done so far and the work done
in this branch for pidfd_open() and polling support do already see
some adoption:
- Android is in the process of backporting this work to all their LTS
kernels [1]
- Service managers make use of pidfd_send_signal but will need to
wait until we enable waiting on pidfds for full adoption.
- And projects I maintain make use of both pidfd_send_signal and
CLONE_PIDFD [2] and will use polling support and pidfd_open() too"
[1] https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.9+backport%22https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.14+backport%22https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.19+backport%22
[2] aab6e3eb73/src/lxc/start.c (L1753)
* tag 'pidfd-updates-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
tests: add pidfd_open() tests
arch: wire-up pidfd_open()
pid: add pidfd_open()
pidfd: add polling selftests
pidfd: add polling support