Commit Graph

373 Commits

Author SHA1 Message Date
Greg Kroah-Hartman
fae17cd97d Merge 5.15.52 into android14-5.15
Changes in 5.15.52
	tick/nohz: unexport __init-annotated tick_nohz_full_setup()
	x86, kvm: use proper ASM macros for kvm_vcpu_is_preempted
	bcache: memset on stack variables in bch_btree_check() and bch_sectors_dirty_init()
	xfs: use kmem_cache_free() for kmem_cache objects
	xfs: punch out data fork delalloc blocks on COW writeback failure
	xfs: Fix the free logic of state in xfs_attr_node_hasname
	xfs: remove all COW fork extents when remounting readonly
	xfs: check sb_meta_uuid for dabuf buffer recovery
	xfs: prevent UAF in xfs_log_item_in_current_chkpt
	xfs: only bother with sync_filesystem during readonly remount
	powerpc/ftrace: Remove ftrace init tramp once kernel init is complete
	fs: add is_idmapped_mnt() helper
	fs: move mapping helpers
	fs: tweak fsuidgid_has_mapping()
	fs: account for filesystem mappings
	docs: update mapping documentation
	fs: use low-level mapping helpers
	fs: remove unused low-level mapping helpers
	fs: port higher-level mapping helpers
	fs: add i_user_ns() helper
	fs: support mapped mounts of mapped filesystems
	fs: fix acl translation
	fs: account for group membership
	rtw88: 8821c: support RFE type4 wifi NIC
	rtw88: rtw8821c: enable rfe 6 devices
	net: mscc: ocelot: allow unregistered IP multicast flooding to CPU
	io_uring: fix not locked access to fixed buf table
	Linux 5.15.52

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Icfb690703efd8cab1dffa7ca6cce28bbca635c3d
2022-07-13 18:34:48 +02:00
Christian Brauner
38753e9173 fs: support mapped mounts of mapped filesystems
commit bd303368b776eead1c29e6cdda82bde7128b82a7 upstream.

In previous patches we added new and modified existing helpers to handle
idmapped mounts of filesystems mounted with an idmapping. In this final
patch we convert all relevant places in the vfs to actually pass the
filesystem's idmapping into these helpers.

With this the vfs is in shape to handle idmapped mounts of filesystems
mounted with an idmapping. Note that this is just the generic
infrastructure. Actually adding support for idmapped mounts to a
filesystem mountable with an idmapping is follow-up work.

In this patch we extend the definition of an idmapped mount from a mount
that that has the initial idmapping attached to it to a mount that has
an idmapping attached to it which is not the same as the idmapping the
filesystem was mounted with.

As before we do not allow the initial idmapping to be attached to a
mount. In addition this patch prevents that the idmapping the filesystem
was mounted with can be attached to a mount created based on this
filesystem.

This has multiple reasons and advantages. First, attaching the initial
idmapping or the filesystem's idmapping doesn't make much sense as in
both cases the values of the i_{g,u}id and other places where k{g,u}ids
are used do not change. Second, a user that really wants to do this for
whatever reason can just create a separate dedicated identical idmapping
to attach to the mount. Third, we can continue to use the initial
idmapping as an indicator that a mount is not idmapped allowing us to
continue to keep passing the initial idmapping into the mapping helpers
to tell them that something isn't an idmapped mount even if the
filesystem is mounted with an idmapping.

Link: https://lore.kernel.org/r/20211123114227.3124056-11-brauner@kernel.org (v1)
Link: https://lore.kernel.org/r/20211130121032.3753852-11-brauner@kernel.org (v2)
Link: https://lore.kernel.org/r/20211203111707.3901969-11-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-07-02 16:41:17 +02:00
Christian Brauner
f895d0ff47 fs: use low-level mapping helpers
commit 4472071331549e911a5abad41aea6e3be855a1a4 upstream.

In a few places the vfs needs to interact with bare k{g,u}ids directly
instead of struct inode. These are just a few. In previous patches we
introduced low-level mapping helpers that are able to support
filesystems mounted an idmapping. This patch simply converts the places
to use these new helpers.

Link: https://lore.kernel.org/r/20211123114227.3124056-7-brauner@kernel.org (v1)
Link: https://lore.kernel.org/r/20211130121032.3753852-7-brauner@kernel.org (v2)
Link: https://lore.kernel.org/r/20211203111707.3901969-7-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-07-02 16:41:16 +02:00
Christian Brauner
7bc23abcb4 fs: move mapping helpers
commit a793d79ea3e041081cd7cbd8ee43d0b5e4914a2b upstream.

The low-level mapping helpers were so far crammed into fs.h. They are
out of place there. The fs.h header should just contain the higher-level
mapping helpers that interact directly with vfs objects such as struct
super_block or struct inode and not the bare mapping helpers. Similarly,
only vfs and specific fs code shall interact with low-level mapping
helpers. And so they won't be made accessible automatically through
regular {g,u}id helpers.

Link: https://lore.kernel.org/r/20211123114227.3124056-3-brauner@kernel.org (v1)
Link: https://lore.kernel.org/r/20211130121032.3753852-3-brauner@kernel.org (v2)
Link: https://lore.kernel.org/r/20211203111707.3901969-3-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-07-02 16:41:15 +02:00
Greg Kroah-Hartman
0a77fca3aa ANDROID: GKI: set vfs-only exports into their own namespace
We have namespaces, so use them for all vfs-exported namespaces so that
filesystems can use them, but not anything else.

Some in-kernel drivers that do direct filesystem accesses (because they
serve up files) are also allowed access to these symbols to keep 'make
allmodconfig' builds working properly, but it is not needed for Android
kernel images.

Bug: 157965270
Bug: 210074446
Cc: Matthias Maennich <maennich@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Iaf6140baf3a18a516ab2d5c3966235c42f3f70de
2022-04-07 15:14:24 +02:00
David Anderson
e884438aa5 Revert "FROMLIST: Add flags option to get xattr method paired to..."
Revert submission 1881578

Reason for revert: broken build in CI
Reverted Changes:
Id2c6fa6ee:FROMLIST: Add flags option to get xattr method pai...
Ifa966dabd:FROMLIST: overlayfs: inode_owner_or_capable called...
I46e6c74ff:FROMLIST: overlayfs: override_creds=off option byp...
I0b8fe9f1f:FROMLIST: overlayfs: handle XATTR_NOSECURITY flag ...

Change-Id: Ic4f9a8dd92dc492ed0a474c783497ec525f1c762
Signed-off-by: David Anderson <dvander@google.com>
2021-11-20 03:15:20 +00:00
Greg Kroah-Hartman
36de88a855 Merge 5.15.3 into android13-5.15
Changes in 5.15.3
	xhci: Fix USB 3.1 enumeration issues by increasing roothub power-on-good delay
	usb: xhci: Enable runtime-pm by default on AMD Yellow Carp platform
	Input: iforce - fix control-message timeout
	Input: elantench - fix misreporting trackpoint coordinates
	Input: i8042 - Add quirk for Fujitsu Lifebook T725
	libata: fix read log timeout value
	ocfs2: fix data corruption on truncate
	scsi: scsi_ioctl: Validate command size
	scsi: core: Avoid leaving shost->last_reset with stale value if EH does not run
	scsi: core: Remove command size deduction from scsi_setup_scsi_cmnd()
	scsi: lpfc: Don't release final kref on Fport node while ABTS outstanding
	scsi: lpfc: Fix FCP I/O flush functionality for TMF routines
	scsi: qla2xxx: Fix crash in NVMe abort path
	scsi: qla2xxx: Fix kernel crash when accessing port_speed sysfs file
	scsi: qla2xxx: Fix use after free in eh_abort path
	ce/gf100: fix incorrect CE0 address calculation on some GPUs
	char: xillybus: fix msg_ep UAF in xillyusb_probe()
	mmc: mtk-sd: Add wait dma stop done flow
	mmc: dw_mmc: Dont wait for DRTO on Write RSP error
	exfat: fix incorrect loading of i_blocks for large files
	io-wq: remove worker to owner tw dependency
	parisc: Fix set_fixmap() on PA1.x CPUs
	parisc: Fix ptrace check on syscall return
	tpm: Check for integer overflow in tpm2_map_response_body()
	firmware/psci: fix application of sizeof to pointer
	crypto: s5p-sss - Add error handling in s5p_aes_probe()
	media: rkvdec: Do not override sizeimage for output format
	media: ite-cir: IR receiver stop working after receive overflow
	media: rkvdec: Support dynamic resolution changes
	media: ir-kbd-i2c: improve responsiveness of hauppauge zilog receivers
	media: v4l2-ioctl: Fix check_ext_ctrls
	ALSA: hda/realtek: Fix mic mute LED for the HP Spectre x360 14
	ALSA: hda/realtek: Add a quirk for HP OMEN 15 mute LED
	ALSA: hda/realtek: Add quirk for Clevo PC70HS
	ALSA: hda/realtek: Headset fixup for Clevo NH77HJQ
	ALSA: hda/realtek: Add a quirk for Acer Spin SP513-54N
	ALSA: hda/realtek: Add quirk for ASUS UX550VE
	ALSA: hda/realtek: Add quirk for HP EliteBook 840 G7 mute LED
	ALSA: ua101: fix division by zero at probe
	ALSA: 6fire: fix control and bulk message timeouts
	ALSA: line6: fix control and interrupt message timeouts
	ALSA: mixer: oss: Fix racy access to slots
	ALSA: mixer: fix deadlock in snd_mixer_oss_set_volume
	ALSA: usb-audio: Line6 HX-Stomp XL USB_ID for 48k-fixed quirk
	ALSA: usb-audio: Add registration quirk for JBL Quantum 400
	ALSA: hda: Free card instance properly at probe errors
	ALSA: synth: missing check for possible NULL after the call to kstrdup
	ALSA: pci: rme: Fix unaligned buffer addresses
	ALSA: PCM: Fix NULL dereference at mmap checks
	ALSA: timer: Fix use-after-free problem
	ALSA: timer: Unconditionally unlink slave instances, too
	Revert "ext4: enforce buffer head state assertion in ext4_da_map_blocks"
	ext4: fix lazy initialization next schedule time computation in more granular unit
	ext4: ensure enough credits in ext4_ext_shift_path_extents
	ext4: refresh the ext4_ext_path struct after dropping i_data_sem.
	fuse: fix page stealing
	x86/sme: Use #define USE_EARLY_PGTABLE_L5 in mem_encrypt_identity.c
	x86/cpu: Fix migration safety with X86_BUG_NULL_SEL
	x86/irq: Ensure PI wakeup handler is unregistered before module unload
	x86/iopl: Fake iopl(3) CLI/STI usage
	btrfs: clear MISSING device status bit in btrfs_close_one_device
	btrfs: fix lost error handling when replaying directory deletes
	btrfs: call btrfs_check_rw_degradable only if there is a missing device
	KVM: x86/mmu: Drop a redundant, broken remote TLB flush
	KVM: VMX: Unregister posted interrupt wakeup handler on hardware unsetup
	KVM: PPC: Tick accounting should defer vtime accounting 'til after IRQ handling
	ia64: kprobes: Fix to pass correct trampoline address to the handler
	selinux: fix race condition when computing ocontext SIDs
	ipmi:watchdog: Set panic count to proper value on a panic
	md/raid1: only allocate write behind bio for WriteMostly device
	hwmon: (pmbus/lm25066) Add offset coefficients
	regulator: s5m8767: do not use reset value as DVS voltage if GPIO DVS is disabled
	regulator: dt-bindings: samsung,s5m8767: correct s5m8767,pmic-buck-default-dvs-idx property
	EDAC/sb_edac: Fix top-of-high-memory value for Broadwell/Haswell
	mwifiex: fix division by zero in fw download path
	ath6kl: fix division by zero in send path
	ath6kl: fix control-message timeout
	ath10k: fix control-message timeout
	ath10k: fix division by zero in send path
	PCI: Mark Atheros QCA6174 to avoid bus reset
	rtl8187: fix control-message timeouts
	evm: mark evm_fixmode as __ro_after_init
	ifb: Depend on netfilter alternatively to tc
	platform/surface: aggregator_registry: Add support for Surface Laptop Studio
	mt76: mt7615: fix skb use-after-free on mac reset
	HID: surface-hid: Use correct event registry for managing HID events
	HID: surface-hid: Allow driver matching for target ID 1 devices
	wcn36xx: Fix HT40 capability for 2Ghz band
	wcn36xx: Fix tx_status mechanism
	wcn36xx: Fix (QoS) null data frame bitrate/modulation
	PM: sleep: Do not let "syscore" devices runtime-suspend during system transitions
	mwifiex: Read a PCI register after writing the TX ring write pointer
	mwifiex: Try waking the firmware until we get an interrupt
	libata: fix checking of DMA state
	dma-buf: fix and rework dma_buf_poll v7
	wcn36xx: handle connection loss indication
	rsi: fix occasional initialisation failure with BT coex
	rsi: fix key enabled check causing unwanted encryption for vap_id > 0
	rsi: fix rate mask set leading to P2P failure
	rsi: Fix module dev_oper_mode parameter description
	perf/x86/intel/uncore: Support extra IMC channel on Ice Lake server
	perf/x86/intel/uncore: Fix invalid unit check
	perf/x86/intel/uncore: Fix Intel ICX IIO event constraints
	RDMA/qedr: Fix NULL deref for query_qp on the GSI QP
	ASoC: tegra: Set default card name for Trimslice
	ASoC: tegra: Restore AC97 support
	signal: Remove the bogus sigkill_pending in ptrace_stop
	memory: renesas-rpc-if: Correct QSPI data transfer in Manual mode
	signal/mips: Update (_save|_restore)_fp_context to fail with -EFAULT
	signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed
	soc: samsung: exynos-pmu: Fix compilation when nothing selects CONFIG_MFD_CORE
	soc: fsl: dpio: replace smp_processor_id with raw_smp_processor_id
	soc: fsl: dpio: use the combined functions to protect critical zone
	mtd: rawnand: socrates: Keep the driver compatible with on-die ECC engines
	mctp: handle the struct sockaddr_mctp padding fields
	power: supply: max17042_battery: Prevent int underflow in set_soc_threshold
	power: supply: max17042_battery: use VFSOC for capacity when no rsns
	iio: core: fix double free in iio_device_unregister_sysfs()
	iio: core: check return value when calling dev_set_name()
	KVM: arm64: Extract ESR_ELx.EC only
	KVM: x86: Fix recording of guest steal time / preempted status
	KVM: x86: Add helper to consolidate core logic of SET_CPUID{2} flows
	KVM: nVMX: Query current VMCS when determining if MSR bitmaps are in use
	KVM: nVMX: Handle dynamic MSR intercept toggling
	can: peak_usb: always ask for BERR reporting for PCAN-USB devices
	can: mcp251xfd: mcp251xfd_irq(): add missing can_rx_offload_threaded_irq_finish() in case of bus off
	can: j1939: j1939_tp_cmd_recv(): ignore abort message in the BAM transport
	can: j1939: j1939_can_recv(): ignore messages with invalid source address
	can: j1939: j1939_tp_cmd_recv(): check the dst address of TP.CM_BAM
	iio: adc: tsc2046: fix scan interval warning
	powerpc/85xx: Fix oops when mpc85xx_smp_guts_ids node cannot be found
	io_uring: honour zeroes as io-wq worker limits
	ring-buffer: Protect ring_buffer_reset() from reentrancy
	serial: core: Fix initializing and restoring termios speed
	ifb: fix building without CONFIG_NET_CLS_ACT
	xen/balloon: add late_initcall_sync() for initial ballooning done
	ovl: fix use after free in struct ovl_aio_req
	ovl: fix filattr copy-up failure
	PCI: pci-bridge-emul: Fix emulation of W1C bits
	PCI: cadence: Add cdns_plat_pcie_probe() missing return
	cxl/pci: Fix NULL vs ERR_PTR confusion
	PCI: aardvark: Do not clear status bits of masked interrupts
	PCI: aardvark: Fix checking for link up via LTSSM state
	PCI: aardvark: Do not unmask unused interrupts
	PCI: aardvark: Fix reporting Data Link Layer Link Active
	PCI: aardvark: Fix configuring Reference clock
	PCI: aardvark: Fix return value of MSI domain .alloc() method
	PCI: aardvark: Read all 16-bits from PCIE_MSI_PAYLOAD_REG
	PCI: aardvark: Fix support for bus mastering and PCI_COMMAND on emulated bridge
	PCI: aardvark: Fix support for PCI_BRIDGE_CTL_BUS_RESET on emulated bridge
	PCI: aardvark: Set PCI Bridge Class Code to PCI Bridge
	PCI: aardvark: Fix support for PCI_ROM_ADDRESS1 on emulated bridge
	quota: check block number when reading the block in quota file
	quota: correct error number in free_dqentry()
	cifs: To match file servers, make sure the server hostname matches
	cifs: set a minimum of 120s for next dns resolution
	mfd: simple-mfd-i2c: Select MFD_CORE to fix build error
	pinctrl: core: fix possible memory leak in pinctrl_enable()
	coresight: cti: Correct the parameter for pm_runtime_put
	coresight: trbe: Fix incorrect access of the sink specific data
	coresight: trbe: Defer the probe on offline CPUs
	iio: buffer: check return value of kstrdup_const()
	iio: buffer: Fix memory leak in iio_buffers_alloc_sysfs_and_mask()
	iio: buffer: Fix memory leak in __iio_buffer_alloc_sysfs_and_mask()
	iio: buffer: Fix memory leak in iio_buffer_register_legacy_sysfs_groups()
	drivers: iio: dac: ad5766: Fix dt property name
	iio: dac: ad5446: Fix ad5622_write() return value
	iio: ad5770r: make devicetree property reading consistent
	Documentation:devicetree:bindings:iio:dac: Fix val
	USB: serial: keyspan: fix memleak on probe errors
	serial: 8250: fix racy uartclk update
	ksmbd: set unique value to volume serial field in FS_VOLUME_INFORMATION
	io-wq: serialize hash clear with wakeup
	serial: 8250: Fix reporting real baudrate value in c_ospeed field
	Revert "serial: 8250: Fix reporting real baudrate value in c_ospeed field"
	most: fix control-message timeouts
	USB: iowarrior: fix control-message timeouts
	USB: chipidea: fix interrupt deadlock
	power: supply: max17042_battery: Clear status bits in interrupt handler
	component: do not leave master devres group open after bind
	dma-buf: WARN on dmabuf release with pending attachments
	drm: panel-orientation-quirks: Update the Lenovo Ideapad D330 quirk (v2)
	drm: panel-orientation-quirks: Add quirk for KD Kurio Smart C15200 2-in-1
	drm: panel-orientation-quirks: Add quirk for the Samsung Galaxy Book 10.6
	Bluetooth: sco: Fix lock_sock() blockage by memcpy_from_msg()
	Bluetooth: fix use-after-free error in lock_sock_nested()
	Bluetooth: call sock_hold earlier in sco_conn_del
	drm/panel-orientation-quirks: add Valve Steam Deck
	rcutorture: Avoid problematic critical section nesting on PREEMPT_RT
	platform/x86: wmi: do not fail if disabling fails
	drm/amdgpu: move iommu_resume before ip init/resume
	MIPS: lantiq: dma: add small delay after reset
	MIPS: lantiq: dma: reset correct number of channel
	locking/lockdep: Avoid RCU-induced noinstr fail
	net: sched: update default qdisc visibility after Tx queue cnt changes
	ACPI: resources: Add DMI-based legacy IRQ override quirk
	rcu-tasks: Move RTGS_WAIT_CBS to beginning of rcu_tasks_kthread() loop
	smackfs: Fix use-after-free in netlbl_catmap_walk()
	ath11k: Align bss_chan_info structure with firmware
	crypto: aesni - check walk.nbytes instead of err
	x86/mm/64: Improve stack overflow warnings
	x86: Increase exception stack sizes
	mwifiex: Run SET_BSS_MODE when changing from P2P to STATION vif-type
	mwifiex: Properly initialize private structure on interface type changes
	spi: Check we have a spi_device_id for each DT compatible
	fscrypt: allow 256-bit master keys with AES-256-XTS
	drm/amdgpu: Fix MMIO access page fault
	drm/amd/display: Fix null pointer dereference for encoders
	selftests: net: fib_nexthops: Wait before checking reported idle time
	ath11k: Avoid reg rules update during firmware recovery
	ath11k: add handler for scan event WMI_SCAN_EVENT_DEQUEUED
	ath11k: Change DMA_FROM_DEVICE to DMA_TO_DEVICE when map reinjected packets
	ath10k: high latency fixes for beacon buffer
	octeontx2-pf: Enable promisc/allmulti match MCAM entries.
	media: mt9p031: Fix corrupted frame after restarting stream
	media: netup_unidvb: handle interrupt properly according to the firmware
	media: atomisp: Fix error handling in probe
	media: stm32: Potential NULL pointer dereference in dcmi_irq_thread()
	media: uvcvideo: Set capability in s_param
	media: uvcvideo: Return -EIO for control errors
	media: uvcvideo: Set unique vdev name based in type
	media: vidtv: Fix memory leak in remove
	media: s5p-mfc: fix possible null-pointer dereference in s5p_mfc_probe()
	media: s5p-mfc: Add checking to s5p_mfc_probe().
	media: videobuf2: rework vb2_mem_ops API
	media: imx: set a media_device bus_info string
	media: rcar-vin: Use user provided buffers when starting
	media: mceusb: return without resubmitting URB in case of -EPROTO error.
	ia64: don't do IA64_CMPXCHG_DEBUG without CONFIG_PRINTK
	rtw88: fix RX clock gate setting while fifo dump
	brcmfmac: Add DMI nvram filename quirk for Cyberbook T116 tablet
	media: rcar-csi2: Add checking to rcsi2_start_receiver()
	ipmi: Disable some operations during a panic
	fs/proc/uptime.c: Fix idle time reporting in /proc/uptime
	kselftests/sched: cleanup the child processes
	ACPICA: Avoid evaluating methods too early during system resume
	cpufreq: Make policy min/max hard requirements
	ice: Move devlink port to PF/VF struct
	media: imx-jpeg: Fix possible null pointer dereference
	media: ipu3-imgu: imgu_fmt: Handle properly try
	media: ipu3-imgu: VIDIOC_QUERYCAP: Fix bus_info
	media: usb: dvd-usb: fix uninit-value bug in dibusb_read_eeprom_byte()
	net-sysfs: try not to restart the syscall if it will fail eventually
	drm/amdkfd: rm BO resv on validation to avoid deadlock
	tracefs: Have tracefs directories not set OTH permission bits by default
	tracing: Disable "other" permission bits in the tracefs files
	ath: dfs_pattern_detector: Fix possible null-pointer dereference in channel_detector_create()
	KVM: arm64: Propagate errors from __pkvm_prot_finalize hypercall
	mmc: moxart: Fix reference count leaks in moxart_probe
	iov_iter: Fix iov_iter_get_pages{,_alloc} page fault return value
	ACPI: battery: Accept charges over the design capacity as full
	ACPI: scan: Release PM resources blocked by unused objects
	drm/amd/display: fix null pointer deref when plugging in display
	drm/amdkfd: fix resume error when iommu disabled in Picasso
	net: phy: micrel: make *-skew-ps check more lenient
	leaking_addresses: Always print a trailing newline
	thermal/core: Fix null pointer dereference in thermal_release()
	drm/msm: prevent NULL dereference in msm_gpu_crashstate_capture()
	thermal/drivers/tsens: Add timeout to get_temp_tsens_valid
	block: bump max plugged deferred size from 16 to 32
	floppy: fix calling platform_device_unregister() on invalid drives
	md: update superblock after changing rdev flags in state_store
	memstick: r592: Fix a UAF bug when removing the driver
	locking/rwsem: Disable preemption for spinning region
	lib/xz: Avoid overlapping memcpy() with invalid input with in-place decompression
	lib/xz: Validate the value before assigning it to an enum variable
	workqueue: make sysfs of unbound kworker cpumask more clever
	tracing/cfi: Fix cmp_entries_* functions signature mismatch
	mt76: mt7915: fix an off-by-one bound check
	mwl8k: Fix use-after-free in mwl8k_fw_state_machine()
	iwlwifi: change all JnP to NO-160 configuration
	block: remove inaccurate requeue check
	media: allegro: ignore interrupt if mailbox is not initialized
	drm/amdgpu/pm: properly handle sclk for profiling modes on vangogh
	nvmet: fix use-after-free when a port is removed
	nvmet-rdma: fix use-after-free when a port is removed
	nvmet-tcp: fix use-after-free when a port is removed
	nvme: drop scan_lock and always kick requeue list when removing namespaces
	samples/bpf: Fix application of sizeof to pointer
	arm64: vdso32: suppress error message for 'make mrproper'
	PM: hibernate: Get block device exclusively in swsusp_check()
	selftests: kvm: fix mismatched fclose() after popen()
	selftests/bpf: Fix perf_buffer test on system with offline cpus
	iwlwifi: mvm: disable RX-diversity in powersave
	smackfs: use __GFP_NOFAIL for smk_cipso_doi()
	ARM: clang: Do not rely on lr register for stacktrace
	gre/sit: Don't generate link-local addr if addr_gen_mode is IN6_ADDR_GEN_MODE_NONE
	can: bittiming: can_fixup_bittiming(): change type of tseg1 and alltseg to unsigned int
	gfs2: Cancel remote delete work asynchronously
	gfs2: Fix glock_hash_walk bugs
	ARM: 9136/1: ARMv7-M uses BE-8, not BE-32
	tools/latency-collector: Use correct size when writing queue_full_warning
	vrf: run conntrack only in context of lower/physdev for locally generated packets
	net: annotate data-race in neigh_output()
	ACPI: AC: Quirk GK45 to skip reading _PSR
	ACPI: resources: Add one more Medion model in IRQ override quirk
	btrfs: reflink: initialize return value to 0 in btrfs_extent_same()
	btrfs: do not take the uuid_mutex in btrfs_rm_device
	spi: bcm-qspi: Fix missing clk_disable_unprepare() on error in bcm_qspi_probe()
	wcn36xx: Correct band/freq reporting on RX
	wcn36xx: Fix packet drop on resume
	Revert "wcn36xx: Enable firmware link monitoring"
	ftrace: do CPU checking after preemption disabled
	inet: remove races in inet{6}_getname()
	x86/hyperv: Protect set_hv_tscchange_cb() against getting preempted
	drm/amd/display: dcn20_resource_construct reduce scope of FPU enabled
	selftests/core: fix conflicting types compile error for close_range()
	perf/x86/intel: Fix ICL/SPR INST_RETIRED.PREC_DIST encodings
	parisc: fix warning in flush_tlb_all
	task_stack: Fix end_of_stack() for architectures with upwards-growing stack
	erofs: don't trigger WARN() when decompression fails
	parisc/unwind: fix unwinder when CONFIG_64BIT is enabled
	parisc/kgdb: add kgdb_roundup() to make kgdb work with idle polling
	netfilter: conntrack: set on IPS_ASSURED if flows enters internal stream state
	selftests/bpf: Fix strobemeta selftest regression
	fbdev/efifb: Release PCI device's runtime PM ref during FB destroy
	drm/bridge: anx7625: Propagate errors from sp_tx_rst_aux()
	perf/x86/intel/uncore: Fix Intel SPR CHA event constraints
	perf/x86/intel/uncore: Fix Intel SPR IIO event constraints
	perf/x86/intel/uncore: Fix Intel SPR M2PCIE event constraints
	perf/x86/intel/uncore: Fix Intel SPR M3UPI event constraints
	drm/bridge: it66121: Initialize {device,vendor}_ids
	drm/bridge: it66121: Wait for next bridge to be probed
	Bluetooth: fix init and cleanup of sco_conn.timeout_work
	libbpf: Don't crash on object files with no symbol tables
	Bluetooth: hci_uart: fix GPF in h5_recv
	rcu: Fix existing exp request check in sync_sched_exp_online_cleanup()
	MIPS: lantiq: dma: fix burst length for DEU
	x86/xen: Mark cpu_bringup_and_idle() as dead_end_function
	objtool: Handle __sanitize_cov*() tail calls
	net/mlx5: Publish and unpublish all devlink parameters at once
	drm/v3d: fix wait for TMU write combiner flush
	crypto: sm4 - Do not change section of ck and sbox
	virtio-gpu: fix possible memory allocation failure
	lockdep: Let lock_is_held_type() detect recursive read as read
	net: net_namespace: Fix undefined member in key_remove_domain()
	net: phylink: don't call netif_carrier_off() with NULL netdev
	drm: bridge: it66121: Fix return value it66121_probe
	spi: Fixed division by zero warning
	cgroup: Make rebind_subsystems() disable v2 controllers all at once
	wcn36xx: Fix Antenna Diversity Switching
	wilc1000: fix possible memory leak in cfg_scan_result()
	Bluetooth: btmtkuart: fix a memleak in mtk_hci_wmt_sync
	drm/amdgpu: Fix crash on device remove/driver unload
	drm/amd/display: Pass display_pipe_params_st as const in DML
	drm/amdgpu: move amdgpu_virt_release_full_gpu to fini_early stage
	crypto: caam - disable pkc for non-E SoCs
	crypto: qat - power up 4xxx device
	Bluetooth: hci_h5: Fix (runtime)suspend issues on RTL8723BS HCIs
	bnxt_en: Check devlink allocation and registration status
	qed: Don't ignore devlink allocation failures
	rxrpc: Fix _usecs_to_jiffies() by using usecs_to_jiffies()
	mptcp: do not shrink snd_nxt when recovering
	fortify: Fix dropped strcpy() compile-time write overflow check
	mac80211: twt: don't use potentially unaligned pointer
	cfg80211: always free wiphy specific regdomain
	net/mlx5: Accept devlink user input after driver initialization complete
	net: dsa: rtl8366rb: Fix off-by-one bug
	net: dsa: rtl8366: Fix a bug in deleting VLANs
	bpf/tests: Fix error in tail call limit tests
	ath11k: fix some sleeping in atomic bugs
	ath11k: Avoid race during regd updates
	ath11k: fix packet drops due to incorrect 6 GHz freq value in rx status
	ath11k: Fix memory leak in ath11k_qmi_driver_event_work
	gve: DQO: avoid unused variable warnings
	ath10k: Fix missing frame timestamp for beacon/probe-resp
	ath10k: sdio: Add missing BH locking around napi_schdule()
	drm/ttm: stop calling tt_swapin in vm_access
	arm64: mm: update max_pfn after memory hotplug
	drm/amdgpu: fix warning for overflow check
	libbpf: Fix skel_internal.h to set errno on loader retval < 0
	media: em28xx: add missing em28xx_close_extension
	media: meson-ge2d: Fix rotation parameter changes detection in 'ge2d_s_ctrl()'
	media: cxd2880-spi: Fix a null pointer dereference on error handling path
	media: ttusb-dec: avoid release of non-acquired mutex
	media: dvb-usb: fix ununit-value in az6027_rc_query
	media: imx258: Fix getting clock frequency
	media: v4l2-ioctl: S_CTRL output the right value
	media: mtk-vcodec: venc: fix return value when start_streaming fails
	media: TDA1997x: handle short reads of hdmi info frame.
	media: mtk-vpu: Fix a resource leak in the error handling path of 'mtk_vpu_probe()'
	media: imx-jpeg: Fix the error handling path of 'mxc_jpeg_probe()'
	media: i2c: ths8200 needs V4L2_ASYNC
	media: sun6i-csi: Allow the video device to be open multiple times
	media: radio-wl1273: Avoid card name truncation
	media: si470x: Avoid card name truncation
	media: tm6000: Avoid card name truncation
	media: cx23885: Fix snd_card_free call on null card pointer
	media: atmel: fix the ispck initialization
	scs: Release kasan vmalloc poison in scs_free process
	kprobes: Do not use local variable when creating debugfs file
	crypto: ecc - fix CRYPTO_DEFAULT_RNG dependency
	drm: fb_helper: fix CONFIG_FB dependency
	cpuidle: Fix kobject memory leaks in error paths
	media: em28xx: Don't use ops->suspend if it is NULL
	ath10k: Don't always treat modem stop events as crashes
	ath9k: Fix potential interrupt storm on queue reset
	PM: EM: Fix inefficient states detection
	x86/insn: Use get_unaligned() instead of memcpy()
	EDAC/amd64: Handle three rank interleaving mode
	rcu: Always inline rcu_dynticks_task*_{enter,exit}()
	rcu: Fix rcu_dynticks_curr_cpu_in_eqs() vs noinstr
	netfilter: nft_dynset: relax superfluous check on set updates
	media: venus: fix vpp frequency calculation for decoder
	media: dvb-frontends: mn88443x: Handle errors of clk_prepare_enable()
	crypto: ccree - avoid out-of-range warnings from clang
	crypto: qat - detect PFVF collision after ACK
	crypto: qat - disregard spurious PFVF interrupts
	hwrng: mtk - Force runtime pm ops for sleep ops
	ima: fix deadlock when traversing "ima_default_rules".
	b43legacy: fix a lower bounds test
	b43: fix a lower bounds test
	gve: Recover from queue stall due to missed IRQ
	gve: Track RX buffer allocation failures
	mmc: sdhci-omap: Fix NULL pointer exception if regulator is not configured
	mmc: sdhci-omap: Fix context restore
	memstick: avoid out-of-range warning
	memstick: jmb38x_ms: use appropriate free function in jmb38x_ms_alloc_host()
	net, neigh: Fix NTF_EXT_LEARNED in combination with NTF_USE
	hwmon: Fix possible memleak in __hwmon_device_register()
	hwmon: (pmbus/lm25066) Let compiler determine outer dimension of lm25066_coeff
	ath10k: fix max antenna gain unit
	kernel/sched: Fix sched_fork() access an invalid sched_task_group
	net: fealnx: fix build for UML
	net: intel: igc_ptp: fix build for UML
	net: tulip: winbond-840: fix build for UML
	tcp: switch orphan_count to bare per-cpu counters
	crypto: octeontx2 - set assoclen in aead_do_fallback()
	thermal/core: fix a UAF bug in __thermal_cooling_device_register()
	drm/msm/dsi: do not enable irq handler before powering up the host
	drm/msm: Fix potential Oops in a6xx_gmu_rpmh_init()
	drm/msm: potential error pointer dereference in init()
	drm/msm: unlock on error in get_sched_entity()
	drm/msm: fix potential NULL dereference in cleanup
	drm/msm: uninitialized variable in msm_gem_import()
	net: stream: don't purge sk_error_queue in sk_stream_kill_queues()
	thermal/drivers/qcom/lmh: make QCOM_LMH depends on QCOM_SCM
	mailbox: Remove WARN_ON for async_cb.cb in cmdq_exec_done
	media: ivtv: fix build for UML
	media: ir_toy: assignment to be16 should be of correct type
	mmc: mxs-mmc: disable regulator on error and in the remove function
	io-wq: Remove duplicate code in io_workqueue_create()
	block: ataflop: fix breakage introduced at blk-mq refactoring
	blk-wbt: prevent NULL pointer dereference in wb_timer_fn
	platform/x86: thinkpad_acpi: Fix bitwise vs. logical warning
	mailbox: mtk-cmdq: Validate alias_id on probe
	mailbox: mtk-cmdq: Fix local clock ID usage
	ACPI: PM: Turn off unused wakeup power resources
	ACPI: PM: Fix sharing of wakeup power resources
	drm/amdkfd: Fix an inappropriate error handling in allloc memory of gpu
	mt76: mt7921: fix endianness in mt7921_mcu_tx_done_event
	mt76: mt7915: fix endianness warning in mt7915_mac_add_txs_skb
	mt76: mt7921: fix endianness warning in mt7921_update_txs
	mt76: mt7615: fix endianness warning in mt7615_mac_write_txwi
	mt76: mt7915: fix info leak in mt7915_mcu_set_pre_cal()
	mt76: connac: fix mt76_connac_gtk_rekey_tlv usage
	mt76: fix build error implicit enumeration conversion
	mt76: mt7921: fix survey-dump reporting
	mt76: mt76x02: fix endianness warnings in mt76x02_mac.c
	mt76: mt7921: Fix out of order process by invalid event pkt
	mt76: mt7915: fix potential overflow of eeprom page index
	mt76: mt7915: fix bit fields for HT rate idx
	mt76: mt7921: fix dma hang in rmmod
	mt76: connac: fix GTK rekey offload failure on WPA mixed mode
	mt76: overwrite default reg_ops if necessary
	mt76: mt7921: report HE MU radiotap
	mt76: mt7921: fix firmware usage of RA info using legacy rates
	mt76: mt7921: fix kernel warning from cfg80211_calculate_bitrate
	mt76: mt7921: always wake device if necessary in debugfs
	mt76: mt7915: fix hwmon temp sensor mem use-after-free
	mt76: mt7615: fix hwmon temp sensor mem use-after-free
	mt76: mt7915: fix possible infinite loop release semaphore
	mt76: mt7921: fix retrying release semaphore without end
	mt76: mt7615: fix monitor mode tear down crash
	mt76: connac: fix possible NULL pointer dereference in mt76_connac_get_phy_mode_v2
	mt76: mt7915: fix sta_rec_wtbl tag len
	mt76: mt7915: fix muar_idx in mt7915_mcu_alloc_sta_req()
	rsi: stop thread firstly in rsi_91x_init() error handling
	mwifiex: Send DELBA requests according to spec
	iwlwifi: mvm: reset PM state on unsuccessful resume
	iwlwifi: pnvm: don't kmemdup() more than we have
	iwlwifi: pnvm: read EFI data only if long enough
	net: enetc: unmap DMA in enetc_send_cmd()
	phy: micrel: ksz8041nl: do not use power down mode
	nbd: Fix use-after-free in pid_show
	nvme-rdma: fix error code in nvme_rdma_setup_ctrl
	PM: hibernate: fix sparse warnings
	clocksource/drivers/timer-ti-dm: Select TIMER_OF
	x86/sev: Fix stack type check in vc_switch_off_ist()
	drm/msm: Fix potential NULL dereference in DPU SSPP
	drm/msm/dsi: fix wrong type in msm_dsi_host
	crypto: tcrypt - fix skcipher multi-buffer tests for 1420B blocks
	smackfs: use netlbl_cfg_cipsov4_del() for deleting cipso_v4_doi
	KVM: selftests: Fix nested SVM tests when built with clang
	libbpf: Fix memory leak in btf__dedup()
	bpftool: Avoid leaking the JSON writer prepared for program metadata
	libbpf: Fix overflow in BTF sanity checks
	libbpf: Fix BTF header parsing checks
	mt76: mt7615: mt7622: fix ibss and meshpoint
	s390/gmap: validate VMA in __gmap_zap()
	s390/gmap: don't unconditionally call pte_unmap_unlock() in __gmap_zap()
	s390/mm: validate VMA in PGSTE manipulation functions
	s390/mm: fix VMA and page table handling code in storage key handling functions
	s390/uv: fully validate the VMA before calling follow_page()
	KVM: s390: pv: avoid double free of sida page
	KVM: s390: pv: avoid stalls for kvm_s390_pv_init_vm
	irq: mips: avoid nested irq_enter()
	net: dsa: avoid refcount warnings when ->port_{fdb,mdb}_del returns error
	ARM: 9142/1: kasan: work around LPAE build warning
	ath10k: fix module load regression with iram-recovery feature
	block: ataflop: more blk-mq refactoring fixes
	blk-cgroup: synchronize blkg creation against policy deactivation
	libbpf: Fix off-by-one bug in bpf_core_apply_relo()
	tpm: fix Atmel TPM crash caused by too frequent queries
	tpm_tis_spi: Add missing SPI ID
	libbpf: Fix endianness detection in BPF_CORE_READ_BITFIELD_PROBED()
	tcp: don't free a FIN sk_buff in tcp_remove_empty_skb()
	tracing: Fix missing trace_boot_init_histograms kstrdup NULL checks
	cpufreq: intel_pstate: Fix cpu->pstate.turbo_freq initialization
	spi: spi-rpc-if: Check return value of rpcif_sw_init()
	samples/kretprobes: Fix return value if register_kretprobe() failed
	KVM: s390: Fix handle_sske page fault handling
	libertas_tf: Fix possible memory leak in probe and disconnect
	libertas: Fix possible memory leak in probe and disconnect
	wcn36xx: add proper DMA memory barriers in rx path
	wcn36xx: Fix discarded frames due to wrong sequence number
	bpf: Avoid races in __bpf_prog_run() for 32bit arches
	bpf: Fixes possible race in update_prog_stats() for 32bit arches
	wcn36xx: Channel list update before hardware scan
	drm/amdgpu: fix a potential memory leak in amdgpu_device_fini_sw()
	drm/amdgpu/gmc6: fix DMA mask from 44 to 40 bits
	selftests/bpf: Fix fd cleanup in sk_lookup test
	selftests/bpf: Fix memory leak in test_ima
	sctp: allow IP fragmentation when PLPMTUD enters Error state
	sctp: reset probe_timer in sctp_transport_pl_update
	sctp: subtract sctphdr len in sctp_transport_pl_hlen
	sctp: return true only for pathmtu update in sctp_transport_pl_toobig
	net: amd-xgbe: Toggle PLL settings during rate change
	ipmi: kcs_bmc: Fix a memory leak in the error handling path of 'kcs_bmc_serio_add_device()'
	nfp: fix NULL pointer access when scheduling dim work
	nfp: fix potential deadlock when canceling dim work
	net: phylink: avoid mvneta warning when setting pause parameters
	net: bridge: fix uninitialized variables when BRIDGE_CFM is disabled
	selftests: net: bridge: update IGMP/MLD membership interval value
	crypto: pcrypt - Delay write to padata->info
	selftests/bpf: Fix fclose/pclose mismatch in test_progs
	udp6: allow SO_MARK ctrl msg to affect routing
	ibmvnic: don't stop queue in xmit
	ibmvnic: Process crqs after enabling interrupts
	ibmvnic: delay complete()
	selftests: mptcp: fix proto type in link_failure tests
	skmsg: Lose offset info in sk_psock_skb_ingress
	cgroup: Fix rootcg cpu.stat guest double counting
	bpf: Fix propagation of bounds from 64-bit min/max into 32-bit and var_off.
	bpf: Fix propagation of signed bounds from 64-bit min/max into 32-bit.
	of: unittest: fix EXPECT text for gpio hog errors
	cpufreq: Fix parameter in parse_perf_domain()
	staging: r8188eu: fix memory leak in rtw_set_key
	arm64: dts: meson: sm1: add Ethernet PHY reset line for ODROID-C4/HC4
	iio: st_sensors: disable regulators after device unregistration
	RDMA/rxe: Fix wrong port_cap_flags
	ARM: dts: BCM5301X: Fix memory nodes names
	arm64: dts: broadcom: bcm4908: Fix UART clock name
	clk: mvebu: ap-cpu-clk: Fix a memory leak in error handling paths
	scsi: pm80xx: Fix lockup in outbound queue management
	scsi: qla2xxx: edif: Use link event to wake up app
	scsi: lpfc: Fix NVMe I/O failover to non-optimized path
	ARM: s3c: irq-s3c24xx: Fix return value check for s3c24xx_init_intc()
	arm64: dts: rockchip: Fix GPU register width for RK3328
	ARM: dts: qcom: msm8974: Add xo_board reference clock to DSI0 PHY
	RDMA/bnxt_re: Fix query SRQ failure
	arm64: dts: ti: k3-j721e-main: Fix "max-virtual-functions" in PCIe EP nodes
	arm64: dts: ti: k3-j721e-main: Fix "bus-range" upto 256 bus number for PCIe
	arm64: dts: ti: j7200-main: Fix "vendor-id"/"device-id" properties of pcie node
	arm64: dts: ti: j7200-main: Fix "bus-range" upto 256 bus number for PCIe
	arm64: dts: meson-g12a: Fix the pwm regulator supply properties
	arm64: dts: meson-g12b: Fix the pwm regulator supply properties
	arm64: dts: meson-sm1: Fix the pwm regulator supply properties
	bus: ti-sysc: Fix timekeeping_suspended warning on resume
	ARM: dts: at91: tse850: the emac<->phy interface is rmii
	arm64: dts: qcom: sc7180: Base dynamic CPU power coefficients in reality
	soc: qcom: llcc: Disable MMUHWT retention
	arm64: dts: qcom: sc7280: fix display port phy reg property
	scsi: dc395: Fix error case unwinding
	MIPS: loongson64: make CPU_LOONGSON64 depends on MIPS_FP_SUPPORT
	JFS: fix memleak in jfs_mount
	pinctrl: renesas: rzg2l: Fix missing port register 21h
	ASoC: wcd9335: Use correct version to initialize Class H
	arm64: dts: qcom: msm8916: Fix Secondary MI2S bit clock
	arm64: dts: renesas: beacon: Fix Ethernet PHY mode
	iommu/mediatek: Fix out-of-range warning with clang
	arm64: dts: qcom: pm8916: Remove wrong reg-names for rtc@6000
	iommu/dma: Fix sync_sg with swiotlb
	iommu/dma: Fix arch_sync_dma for map
	ALSA: hda: Reduce udelay() at SKL+ position reporting
	ALSA: hda: Use position buffer for SKL+ again
	ALSA: usb-audio: Fix possible race at sync of urb completions
	soundwire: debugfs: use controller id and link_id for debugfs
	power: reset: at91-reset: check properly the return value of devm_of_iomap
	scsi: ufs: core: Fix ufshcd_probe_hba() prototype to match the definition
	scsi: ufs: core: Stop clearing UNIT ATTENTIONS
	scsi: megaraid_sas: Fix concurrent access to ISR between IRQ polling and real interrupt
	scsi: pm80xx: Fix misleading log statement in pm8001_mpi_get_nvmd_resp()
	driver core: Fix possible memory leak in device_link_add()
	arm: dts: omap3-gta04a4: accelerometer irq fix
	ASoC: SOF: topology: do not power down primary core during topology removal
	iio: st_pressure_spi: Add missing entries SPI to device ID table
	soc/tegra: Fix an error handling path in tegra_powergate_power_up()
	memory: fsl_ifc: fix leak of irq and nand_irq in fsl_ifc_ctrl_probe
	clk: at91: check pmc node status before registering syscore ops
	powerpc/mem: Fix arch/powerpc/mm/mem.c:53:12: error: no previous prototype for 'create_section_mapping'
	video: fbdev: chipsfb: use memset_io() instead of memset()
	powerpc: fix unbalanced node refcount in check_kvm_guest()
	powerpc/paravirt: correct preempt debug splat in vcpu_is_preempted()
	serial: 8250_dw: Drop wrong use of ACPI_PTR()
	usb: gadget: hid: fix error code in do_config()
	power: supply: rt5033_battery: Change voltage values to µV
	power: supply: max17040: fix null-ptr-deref in max17040_probe()
	scsi: csiostor: Uninitialized data in csio_ln_vnp_read_cbfn()
	RDMA/mlx4: Return missed an error if device doesn't support steering
	usb: musb: select GENERIC_PHY instead of depending on it
	staging: most: dim2: do not double-register the same device
	staging: ks7010: select CRYPTO_HASH/CRYPTO_MICHAEL_MIC
	RDMA/core: Set sgtable nents when using ib_dma_virt_map_sg()
	dyndbg: make dyndbg a known cli param
	powerpc/perf: Fix cycles/instructions as PM_CYC/PM_INST_CMPL in power10
	pinctrl: renesas: checker: Fix off-by-one bug in drive register check
	ARM: dts: stm32: Reduce DHCOR SPI NOR frequency to 50 MHz
	ARM: dts: stm32: fix STUSB1600 Type-C irq level on stm32mp15xx-dkx
	ARM: dts: stm32: fix SAI sub nodes register range
	ARM: dts: stm32: fix AV96 board SAI2 pin muxing on stm32mp15
	ASoC: cs42l42: Always configure both ASP TX channels
	ASoC: cs42l42: Correct some register default values
	ASoC: cs42l42: Defer probe if request_threaded_irq() returns EPROBE_DEFER
	soc: qcom: rpmhpd: Make power_on actually enable the domain
	soc: qcom: socinfo: add two missing PMIC IDs
	iio: buffer: Fix double-free in iio_buffers_alloc_sysfs_and_mask()
	usb: typec: STUSB160X should select REGMAP_I2C
	iio: adis: do not disabe IRQs in 'adis_init()'
	soundwire: bus: stop dereferencing invalid slave pointer
	scsi: ufs: ufshcd-pltfrm: Fix memory leak due to probe defer
	scsi: lpfc: Wait for successful restart of SLI3 adapter during host sg_reset
	serial: imx: fix detach/attach of serial console
	usb: dwc2: drd: fix dwc2_force_mode call in dwc2_ovr_init
	usb: dwc2: drd: fix dwc2_drd_role_sw_set when clock could be disabled
	usb: dwc2: drd: reset current session before setting the new one
	powerpc/booke: Disable STRICT_KERNEL_RWX, DEBUG_PAGEALLOC and KFENCE
	usb: dwc3: gadget: Skip resizing EP's TX FIFO if already resized
	firmware: qcom_scm: Fix error retval in __qcom_scm_is_call_available()
	soc: qcom: rpmhpd: fix sm8350_mxc's peer domain
	soc: qcom: apr: Add of_node_put() before return
	arm64: dts: qcom: pmi8994: Fix "eternal"->"external" typo in WLED node
	arm64: dts: qcom: sdm845: Use RPMH_CE_CLK macro directly
	arm64: dts: qcom: sdm845: Fix Qualcomm crypto engine bus clock
	pinctrl: equilibrium: Fix function addition in multiple groups
	ASoC: topology: Fix stub for snd_soc_tplg_component_remove()
	phy: qcom-qusb2: Fix a memory leak on probe
	phy: ti: gmii-sel: check of_get_address() for failure
	phy: qcom-qmp: another fix for the sc8180x PCIe definition
	phy: qcom-snps: Correct the FSEL_MASK
	phy: Sparx5 Eth SerDes: Fix return value check in sparx5_serdes_probe()
	serial: xilinx_uartps: Fix race condition causing stuck TX
	clk: at91: sam9x60-pll: use DIV_ROUND_CLOSEST_ULL
	clk: at91: clk-master: check if div or pres is zero
	clk: at91: clk-master: fix prescaler logic
	HID: u2fzero: clarify error check and length calculations
	HID: u2fzero: properly handle timeouts in usb_submit_urb
	powerpc/nohash: Fix __ptep_set_access_flags() and ptep_set_wrprotect()
	powerpc/book3e: Fix set_memory_x() and set_memory_nx()
	powerpc/44x/fsp2: add missing of_node_put
	powerpc/xmon: fix task state output
	ALSA: oxfw: fix functional regression for Mackie Onyx 1640i in v5.14 or later
	iommu/dma: Fix incorrect error return on iommu deferred attach
	powerpc: Don't provide __kernel_map_pages() without ARCH_SUPPORTS_DEBUG_PAGEALLOC
	ASoC: cs42l42: Correct configuring of switch inversion from ts-inv
	RDMA/hns: Fix initial arm_st of CQ
	RDMA/hns: Modify the value of MAX_LP_MSG_LEN to meet hardware compatibility
	ASoC: rsnd: Fix an error handling path in 'rsnd_node_count()'
	serial: cpm_uart: Protect udbg definitions by CONFIG_SERIAL_CPM_CONSOLE
	virtio_ring: check desc == NULL when using indirect with packed
	vdpa/mlx5: Fix clearing of VIRTIO_NET_F_MAC feature bit
	mips: cm: Convert to bitfield API to fix out-of-bounds access
	power: supply: bq27xxx: Fix kernel crash on IRQ handler register error
	RDMA/core: Require the driver to set the IOVA correctly during rereg_mr
	apparmor: fix error check
	rpmsg: Fix rpmsg_create_ept return when RPMSG config is not defined
	mtd: rawnand: intel: Fix potential buffer overflow in probe
	nfsd: don't alloc under spinlock in rpc_parse_scope_id
	rtc: ds1302: Add SPI ID table
	rtc: ds1390: Add SPI ID table
	rtc: pcf2123: Add SPI ID table
	remoteproc: imx_rproc: Fix TCM io memory type
	i2c: i801: Use PCI bus rescan mutex to protect P2SB access
	dmaengine: idxd: move out percpu_ref_exit() to ensure it's outside submission
	rtc: mcp795: Add SPI ID table
	Input: ariel-pwrbutton - add SPI device ID table
	i2c: mediatek: fixing the incorrect register offset
	NFS: Default change_attr_type to NFS4_CHANGE_TYPE_IS_UNDEFINED
	NFS: Don't set NFS_INO_DATA_INVAL_DEFER and NFS_INO_INVALID_DATA
	NFS: Ignore the directory size when marking for revalidation
	NFS: Fix dentry verifier races
	pnfs/flexfiles: Fix misplaced barrier in nfs4_ff_layout_prepare_ds
	drm/bridge/lontium-lt9611uxc: fix provided connector suport
	drm/plane-helper: fix uninitialized variable reference
	PCI: aardvark: Don't spam about PIO Response Status
	PCI: aardvark: Fix preserving PCI_EXP_RTCTL_CRSSVE flag on emulated bridge
	opp: Fix return in _opp_add_static_v2()
	NFS: Fix deadlocks in nfs_scan_commit_list()
	sparc: Add missing "FORCE" target when using if_changed
	fs: orangefs: fix error return code of orangefs_revalidate_lookup()
	Input: st1232 - increase "wait ready" timeout
	drm/bridge: nwl-dsi: Add atomic_get_input_bus_fmts
	mtd: spi-nor: hisi-sfc: Remove excessive clk_disable_unprepare()
	PCI: uniphier: Serialize INTx masking/unmasking and fix the bit operation
	mtd: rawnand: arasan: Prevent an unsupported configuration
	mtd: core: don't remove debugfs directory if device is in use
	remoteproc: Fix a memory leak in an error handling path in 'rproc_handle_vdev()'
	rtc: rv3032: fix error handling in rv3032_clkout_set_rate()
	dmaengine: at_xdmac: call at_xdmac_axi_config() on resume path
	dmaengine: at_xdmac: fix AT_XDMAC_CC_PERID() macro
	dmaengine: stm32-dma: fix stm32_dma_get_max_width
	NFS: Fix up commit deadlocks
	NFS: Fix an Oops in pnfs_mark_request_commit()
	Fix user namespace leak
	auxdisplay: img-ascii-lcd: Fix lock-up when displaying empty string
	auxdisplay: ht16k33: Connect backlight to fbdev
	auxdisplay: ht16k33: Fix frame buffer device blanking
	soc: fsl: dpaa2-console: free buffer before returning from dpaa2_console_read
	netfilter: nfnetlink_queue: fix OOB when mac header was cleared
	dmaengine: dmaengine_desc_callback_valid(): Check for `callback_result`
	dmaengine: tegra210-adma: fix pm runtime unbalance
	dmanegine: idxd: fix resource free ordering on driver removal
	dmaengine: idxd: reconfig device after device reset command
	signal/sh: Use force_sig(SIGKILL) instead of do_group_exit(SIGKILL)
	m68k: set a default value for MEMORY_RESERVE
	watchdog: f71808e_wdt: fix inaccurate report in WDIOC_GETTIMEOUT
	ar7: fix kernel builds for compiler test
	scsi: target: core: Remove from tmr_list during LUN unlink
	scsi: qla2xxx: Relogin during fabric disturbance
	scsi: qla2xxx: Fix gnl list corruption
	scsi: qla2xxx: Turn off target reset during issue_lip
	scsi: qla2xxx: edif: Fix app start fail
	scsi: qla2xxx: edif: Fix app start delay
	scsi: qla2xxx: edif: Flush stale events and msgs on session down
	scsi: qla2xxx: edif: Increase ELS payload
	scsi: qla2xxx: edif: Fix EDIF bsg
	NFSv4: Fix a regression in nfs_set_open_stateid_locked()
	dmaengine: idxd: fix resource leak on dmaengine driver disable
	i2c: xlr: Fix a resource leak in the error handling path of 'xlr_i2c_probe()'
	gpio: realtek-otto: fix GPIO line IRQ offset
	xen-pciback: Fix return in pm_ctrl_init()
	nbd: fix max value for 'first_minor'
	nbd: fix possible overflow for 'first_minor' in nbd_dev_add()
	io-wq: fix max-workers not correctly set on multi-node system
	net: davinci_emac: Fix interrupt pacing disable
	kselftests/net: add missed icmp.sh test to Makefile
	kselftests/net: add missed setup_loopback.sh/setup_veth.sh to Makefile
	kselftests/net: add missed SRv6 tests
	kselftests/net: add missed vrf_strict_mode_test.sh test to Makefile
	kselftests/net: add missed toeplitz.sh/toeplitz_client.sh to Makefile
	ethtool: fix ethtool msg len calculation for pause stats
	openrisc: fix SMP tlb flush NULL pointer dereference
	net: vlan: fix a UAF in vlan_dev_real_dev()
	net: dsa: felix: fix broken VLAN-tagged PTP under VLAN-aware bridge
	ice: Fix replacing VF hardware MAC to existing MAC filter
	ice: Fix not stopping Tx queues for VFs
	kdb: Adopt scheduler's task classification
	ACPI: PMIC: Fix intel_pmic_regs_handler() read accesses
	PCI: j721e: Fix j721e_pcie_probe() error path
	nvdimm/btt: do not call del_gendisk() if not needed
	scsi: bsg: Fix errno when scsi_bsg_register_queue() fails
	scsi: ufs: ufshpb: Use proper power management API
	scsi: ufs: core: Fix NULL pointer dereference
	scsi: ufs: ufshpb: Properly handle max-single-cmd
	selftests: net: properly support IPv6 in GSO GRE test
	drm/nouveau/svm: Fix refcount leak bug and missing check against null bug
	nvdimm/pmem: cleanup the disk if pmem_release_disk() is yet assigned
	block/ataflop: use the blk_cleanup_disk() helper
	block/ataflop: add registration bool before calling del_gendisk()
	block/ataflop: provide a helper for cleanup up an atari disk
	ataflop: remove ataflop_probe_lock mutex
	PCI: Do not enable AtomicOps on VFs
	cpufreq: intel_pstate: Clear HWP desired on suspend/shutdown and offline
	net: phy: fix duplex out of sync problem while changing settings
	block: fix device_add_disk() kobject_create_and_add() error handling
	drm/ttm: remove ttm_bo_vm_insert_huge()
	bonding: Fix a use-after-free problem when bond_sysfs_slave_add() failed
	octeontx2-pf: select CONFIG_NET_DEVLINK
	ALSA: memalloc: Catch call with NULL snd_dma_buffer pointer
	mfd: core: Add missing of_node_put for loop iteration
	mfd: cpcap: Add SPI device ID table
	mfd: sprd: Add SPI device ID table
	mfd: altera-sysmgr: Fix a mistake caused by resource_size conversion
	ACPI: PM: Fix device wakeup power reference counting error
	libbpf: Fix lookup_and_delete_elem_flags error reporting
	selftests/bpf/xdp_redirect_multi: Put the logs to tmp folder
	selftests/bpf/xdp_redirect_multi: Use arping to accurate the arp number
	selftests/bpf/xdp_redirect_multi: Give tcpdump a chance to terminate cleanly
	selftests/bpf/xdp_redirect_multi: Limit the tests in netns
	drm: fb_helper: improve CONFIG_FB dependency
	Revert "drm/imx: Annotate dma-fence critical section in commit path"
	drm/amdgpu/powerplay: fix sysfs_emit/sysfs_emit_at handling
	can: etas_es58x: es58x_rx_err_msg(): fix memory leak in error path
	can: mcp251xfd: mcp251xfd_chip_start(): fix error handling for mcp251xfd_chip_rx_int_enable()
	mm/zsmalloc.c: close race window between zs_pool_dec_isolated() and zs_unregister_migration()
	zram: off by one in read_block_state()
	perf bpf: Add missing free to bpf_event__print_bpf_prog_info()
	llc: fix out-of-bound array index in llc_sk_dev_hash()
	nfc: pn533: Fix double free when pn533_fill_fragment_skbs() fails
	litex_liteeth: Fix a double free in the remove function
	arm64: arm64_ftr_reg->name may not be a human-readable string
	arm64: pgtable: make __pte_to_phys/__phys_to_pte_val inline functions
	bpf, sockmap: Remove unhash handler for BPF sockmap usage
	bpf, sockmap: Fix race in ingress receive verdict with redirect to self
	bpf: sockmap, strparser, and tls are reusing qdisc_skb_cb and colliding
	bpf, sockmap: sk_skb data_end access incorrect when src_reg = dst_reg
	dmaengine: stm32-dma: fix burst in case of unaligned memory address
	dmaengine: stm32-dma: avoid 64-bit division in stm32_dma_get_max_width
	gve: Fix off by one in gve_tx_timeout()
	drm/i915/fb: Fix rounding error in subsampled plane size calculation
	init: make unknown command line param message clearer
	seq_file: fix passing wrong private data
	drm/amdgpu: fix uvd crash on Polaris12 during driver unloading
	net: dsa: mv88e6xxx: Don't support >1G speeds on 6191X on ports other than 10
	net/sched: sch_taprio: fix undefined behavior in ktime_mono_to_any
	net: hns3: fix ROCE base interrupt vector initialization bug
	net: hns3: fix pfc packet number incorrect after querying pfc parameters
	net: hns3: fix kernel crash when unload VF while it is being reset
	net: hns3: allow configure ETS bandwidth of all TCs
	net: stmmac: allow a tc-taprio base-time of zero
	net: ethernet: ti: cpsw_ale: Fix access to un-initialized memory
	net: marvell: mvpp2: Fix wrong SerDes reconfiguration order
	vsock: prevent unnecessary refcnt inc for nonblocking connect
	net/smc: fix sk_refcnt underflow on linkdown and fallback
	cxgb4: fix eeprom len when diagnostics not implemented
	selftests/net: udpgso_bench_rx: fix port argument
	thermal: int340x: fix build on 32-bit targets
	smb3: do not error on fsync when readonly
	ARM: 9155/1: fix early early_iounmap()
	ARM: 9156/1: drop cc-option fallbacks for architecture selection
	parisc: Fix backtrace to always include init funtion names
	parisc: Flush kernel data mapping in set_pte_at() when installing pte for user page
	MIPS: fix duplicated slashes for Platform file path
	MIPS: fix *-pkg builds for loongson2ef platform
	MIPS: Fix assembly error from MIPSr2 code used within MIPS_ISA_ARCH_LEVEL
	x86/mce: Add errata workaround for Skylake SKX37
	PCI/MSI: Move non-mask check back into low level accessors
	PCI/MSI: Destroy sysfs before freeing entries
	KVM: x86: move guest_pv_has out of user_access section
	posix-cpu-timers: Clear task::posix_cputimers_work in copy_process()
	irqchip/sifive-plic: Fixup EOI failed when masked
	f2fs: should use GFP_NOFS for directory inodes
	f2fs: include non-compressed blocks in compr_written_block
	f2fs: fix UAF in f2fs_available_free_memory
	ceph: fix mdsmap decode when there are MDS's beyond max_mds
	erofs: fix unsafe pagevec reuse of hooked pclusters
	drm/i915/guc: Fix blocked context accounting
	block: Hold invalidate_lock in BLKDISCARD ioctl
	block: Hold invalidate_lock in BLKZEROOUT ioctl
	block: Hold invalidate_lock in BLKRESETZONE ioctl
	ksmbd: Fix buffer length check in fsctl_validate_negotiate_info()
	ksmbd: don't need 8byte alignment for request length in ksmbd_check_message
	dmaengine: ti: k3-udma: Set bchan to NULL if a channel request fail
	dmaengine: ti: k3-udma: Set r/tchan or rflow to NULL if request fail
	dmaengine: bestcomm: fix system boot lockups
	net, neigh: Enable state migration between NUD_PERMANENT and NTF_USE
	9p/net: fix missing error check in p9_check_errors
	mm/filemap.c: remove bogus VM_BUG_ON
	memcg: prohibit unconditional exceeding the limit of dying tasks
	mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks
	mm, oom: do not trigger out_of_memory from the #PF
	mm, thp: lock filemap when truncating page cache
	mm, thp: fix incorrect unmap behavior for private pages
	mfd: dln2: Add cell for initializing DLN2 ADC
	video: backlight: Drop maximum brightness override for brightness zero
	bcache: fix use-after-free problem in bcache_device_free()
	bcache: Revert "bcache: use bvec_virt"
	PM: sleep: Avoid calling put_device() under dpm_list_mtx
	s390/cpumf: cpum_cf PMU displays invalid value after hotplug remove
	s390/cio: check the subchannel validity for dev_busid
	s390/tape: fix timer initialization in tape_std_assign()
	s390/ap: Fix hanging ioctl caused by orphaned replies
	s390/cio: make ccw_device_dma_* more robust
	remoteproc: elf_loader: Fix loading segment when is_iomem true
	remoteproc: Fix the wrong default value of is_iomem
	remoteproc: imx_rproc: Fix ignoring mapping vdev regions
	remoteproc: imx_rproc: Fix rsc-table name
	mtd: rawnand: fsmc: Fix use of SM ORDER
	mtd: rawnand: ams-delta: Keep the driver compatible with on-die ECC engines
	mtd: rawnand: xway: Keep the driver compatible with on-die ECC engines
	mtd: rawnand: mpc5121: Keep the driver compatible with on-die ECC engines
	mtd: rawnand: gpio: Keep the driver compatible with on-die ECC engines
	mtd: rawnand: pasemi: Keep the driver compatible with on-die ECC engines
	mtd: rawnand: orion: Keep the driver compatible with on-die ECC engines
	mtd: rawnand: plat_nand: Keep the driver compatible with on-die ECC engines
	mtd: rawnand: au1550nd: Keep the driver compatible with on-die ECC engines
	powerpc/vas: Fix potential NULL pointer dereference
	powerpc/bpf: Fix write protecting JIT code
	powerpc/32e: Ignore ESR in instruction storage interrupt handler
	powerpc/powernv/prd: Unregister OPAL_MSG_PRD2 notifier during module unload
	powerpc/security: Use a mutex for interrupt exit code patching
	powerpc/64s/interrupt: Fix check_return_regs_valid() false positive
	powerpc/pseries/mobility: ignore ibm, platform-facilities updates
	powerpc/85xx: fix timebase sync issue when CONFIG_HOTPLUG_CPU=n
	drm/sun4i: Fix macros in sun8i_csc.h
	PCI: Add PCI_EXP_DEVCTL_PAYLOAD_* macros
	PCI: aardvark: Fix PCIe Max Payload Size setting
	SUNRPC: Partial revert of commit 6f9f17287e
	drm/amd/display: Look at firmware version to determine using dmub on dcn21
	media: vidtv: move kfree(dvb) to vidtv_bridge_dev_release()
	cifs: fix memory leak of smb3_fs_context_dup::server_hostname
	ath10k: fix invalid dma_addr_t token assignment
	mmc: moxart: Fix null pointer dereference on pointer host
	selftests/x86/iopl: Adjust to the faked iopl CLI/STI usage
	selftests/bpf: Fix also no-alu32 strobemeta selftest
	arch/cc: Introduce a function to check for confidential computing features
	x86/sev: Add an x86 version of cc_platform_has()
	x86/sev: Make the #VC exception stacks part of the default stacks storage
	media: videobuf2: always set buffer vb2 pointer
	media: videobuf2-dma-sg: Fix buf->vb NULL pointer dereference
	Linux 5.15.3

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I09574eb6b4fbe930bd13f932cc618846972fcc27
2021-11-19 15:38:07 +01:00
Rongwei Wang
3ea871f0d8 mm, thp: fix incorrect unmap behavior for private pages
commit 8468e937df1f31411d1e127fa38db064af051fe5 upstream.

When truncating pagecache on file THP, the private pages of a process
should not be unmapped mapping.  This incorrect behavior on a dynamic
shared libraries which will cause related processes to happen core dump.

A simple test for a DSO (Prerequisite is the DSO mapped in file THP):

    int main(int argc, char *argv[])
    {
	int fd;

	fd = open(argv[1], O_WRONLY);
	if (fd < 0) {
		perror("open");
	}

	close(fd);
	return 0;
    }

The test only to open a target DSO, and do nothing.  But this operation
will lead one or more process to happen core dump.  This patch mainly to
fix this bug.

Link: https://lkml.kernel.org/r/20211025092134.18562-3-rongwei.wang@linux.alibaba.com
Fixes: eb6ecbed0a ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs")
Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Tested-by: Xu Yu <xuyu@linux.alibaba.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Song Liu <song@kernel.org>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Collin Fijalkovich <cfijalkovich@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-18 19:17:17 +01:00
Rongwei Wang
fd8e972dc4 mm, thp: lock filemap when truncating page cache
commit 55fc0d91746759c71bc165bba62a2db64ac98e35 upstream.

Patch series "fix two bugs for file THP".

This patch (of 2):

Transparent huge page has supported read-only non-shmem files.  The
file- backed THP is collapsed by khugepaged and truncated when written
(for shared libraries).

However, there is a race when multiple writers truncate the same page
cache concurrently.

In that case, subpage(s) of file THP can be revealed by find_get_entry
in truncate_inode_pages_range, which will trigger PageTail BUG_ON in
truncate_inode_page, as follows:

    page:000000009e420ff2 refcount:1 mapcount:0 mapping:0000000000000000 index:0x7ff pfn:0x50c3ff
    head:0000000075ff816d order:9 compound_mapcount:0 compound_pincount:0
    flags: 0x37fffe0000010815(locked|uptodate|lru|arch_1|head)
    raw: 37fffe0000000000 fffffe0013108001 dead000000000122 dead000000000400
    raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
    head: 37fffe0000010815 fffffe001066bd48 ffff000404183c20 0000000000000000
    head: 0000000000000600 0000000000000000 00000001ffffffff ffff000c0345a000
    page dumped because: VM_BUG_ON_PAGE(PageTail(page))
    ------------[ cut here ]------------
    kernel BUG at mm/truncate.c:213!
    Internal error: Oops - BUG: 0 [#1] SMP
    Modules linked in: xfs(E) libcrc32c(E) rfkill(E) ...
    CPU: 14 PID: 11394 Comm: check_madvise_d Kdump: ...
    Hardware name: ECS, BIOS 0.0.0 02/06/2015
    pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
    Call trace:
     truncate_inode_page+0x64/0x70
     truncate_inode_pages_range+0x550/0x7e4
     truncate_pagecache+0x58/0x80
     do_dentry_open+0x1e4/0x3c0
     vfs_open+0x38/0x44
     do_open+0x1f0/0x310
     path_openat+0x114/0x1dc
     do_filp_open+0x84/0x134
     do_sys_openat2+0xbc/0x164
     __arm64_sys_openat+0x74/0xc0
     el0_svc_common.constprop.0+0x88/0x220
     do_el0_svc+0x30/0xa0
     el0_svc+0x20/0x30
     el0_sync_handler+0x1a4/0x1b0
     el0_sync+0x180/0x1c0
    Code: aa0103e0 900061e1 910ec021 9400d300 (d4210000)

This patch mainly to lock filemap when one enter truncate_pagecache(),
avoiding truncating the same page cache concurrently.

Link: https://lkml.kernel.org/r/20211025092134.18562-1-rongwei.wang@linux.alibaba.com
Link: https://lkml.kernel.org/r/20211025092134.18562-2-rongwei.wang@linux.alibaba.com
Fixes: eb6ecbed0a ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs")
Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Song Liu <song@kernel.org>
Cc: Collin Fijalkovich <cfijalkovich@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-18 19:17:16 +01:00
David Anderson
8c288004aa FROMLIST: Add flags option to get xattr method paired to __vfs_getxattr
Add a flag option to get xattr method that could have a bit flag of
XATTR_NOSECURITY passed to it.  XATTR_NOSECURITY is generally then
set in the __vfs_getxattr path when called by security
infrastructure.

This handles the case of a union filesystem driver that is being
requested by the security layer to report back the xattr data.

For the use case where access is to be blocked by the security layer.

The path then could be security(dentry) ->
__vfs_getxattr(dentry...XATTR_NOSECURITY) ->
handler->get(dentry...XATTR_NOSECURITY) ->
__vfs_getxattr(lower_dentry...XATTR_NOSECURITY) ->
lower_handler->get(lower_dentry...XATTR_NOSECURITY)
which would report back through the chain data and success as
expected, the logging security layer at the top would have the
data to determine the access permissions and report back the target
context that was blocked.

Without the get handler flag, the path on a union filesystem would be
the errant security(dentry) -> __vfs_getxattr(dentry) ->
handler->get(dentry) -> vfs_getxattr(lower_dentry) -> nested ->
security(lower_dentry, log off) -> lower_handler->get(lower_dentry)
which would report back through the chain no data, and -EACCES.

For selinux for both cases, this would translate to a correctly
determined blocked access. In the first case with this change a correct avc
log would be reported, in the second legacy case an incorrect avc log
would be reported against an uninitialized u:object_r:unlabeled:s0
context making the logs cosmetically useless for audit2allow.

This patch series is inert and is the wide-spread addition of the
flags option for xattr functions, and a replacement of __vfs_getxattr
with __vfs_getxattr(...XATTR_NOSECURITY).

Bug: 204981027
Link: https://lore.kernel.org/lkml/20211117015806.2192263-2-dvander@google.com
Change-Id: Id2c6fa6eeb2b5cca5a11e0cd02a3fbf2a5fcbef4
Signed-off-by: David Anderson <dvander@google.com>
Signed-off-by: Mark Salyzyn <salyzyn@android.com>
2021-11-16 18:14:17 -08:00
Alistair Delva
62d9cb66f6 ANDROID: Add filp_open_block() for zram
We currently plan to disallow use of filp_open() from drivers in GKI,
however the ZRAM driver still needs it. Add a new GKI-only variant of
filp_open() which only permits a block device to be opened, which can
be exported instead. This keeps ZRAM working but cuts down on drivers
that attempt to open and write files in kernel mode.

Bug: 179220339
Bug: 205141088
Change-Id: Id696b4aaf204b0499ce0a1b6416648670236e570
Signed-off-by: Alistair Delva <adelva@google.com>
2021-11-04 20:06:00 +00:00
Kuan-Ying Lee
996496b8ea ANDROID: syscall_check: add vendor hook for open syscall
Through this vendor hook, we can get the timing to check
current running task for the validation of its credential
and open operation.

Bug: 191291287

Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Change-Id: Ia644ceb02dbc230ee1d25cad3630c2c3f908e41a
(cherry picked from commit a7a3b31d58)
2021-10-29 14:20:13 +08:00
Jeff Layton
f7e33bdbd6 fs: remove mandatory file locking support
We added CONFIG_MANDATORY_FILE_LOCKING in 2015, and soon after turned it
off in Fedora and RHEL8. Several other distros have followed suit.

I've heard of one problem in all that time: Someone migrated from an
older distro that supported "-o mand" to one that didn't, and the host
had a fstab entry with "mand" in it which broke on reboot. They didn't
actually _use_ mandatory locking so they just removed the mount option
and moved on.

This patch rips out mandatory locking support wholesale from the kernel,
along with the Kconfig option and the Documentation file. It also
changes the mount code to ignore the "mand" mount option instead of
erroring out, and to throw a big, ugly warning.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
2021-08-23 06:15:36 -04:00
Linus Torvalds
58ec9059b3 Merge branch 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs name lookup updates from Al Viro:
 "Small namei.c patch series, mostly to simplify the rules for nameidata
  state. It's actually from the previous cycle - but I didn't post it
  for review in time...

  Changes visible outside of fs/namei.c: file_open_root() calling
  conventions change, some freed bits in LOOKUP_... space"

* 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  namei: make sure nd->depth is always valid
  teach set_nameidata() to handle setting the root as well
  take LOOKUP_{ROOT,ROOT_GRABBED,JUMPED} out of LOOKUP_... space
  switch file_open_root() to struct path
2021-07-03 11:41:14 -07:00
Linus Torvalds
71bd934101 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
 "190 patches.

  Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
  vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
  migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
  zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
  core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
  signals, exec, kcov, selftests, compress/decompress, and ipc"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits)
  ipc/util.c: use binary search for max_idx
  ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
  ipc: use kmalloc for msg_queue and shmid_kernel
  ipc sem: use kvmalloc for sem_undo allocation
  lib/decompressors: remove set but not used variabled 'level'
  selftests/vm/pkeys: exercise x86 XSAVE init state
  selftests/vm/pkeys: refill shadow register after implicit kernel write
  selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
  selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
  kcov: add __no_sanitize_coverage to fix noinstr for all architectures
  exec: remove checks in __register_bimfmt()
  x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
  hfsplus: report create_date to kstat.btime
  hfsplus: remove unnecessary oom message
  nilfs2: remove redundant continue statement in a while-loop
  kprobes: remove duplicated strong free_insn_page in x86 and s390
  init: print out unknown kernel parameters
  checkpatch: do not complain about positive return values starting with EPOLL
  checkpatch: improve the indented label test
  checkpatch: scripts/spdxcheck.py now requires python3
  ...
2021-07-02 12:08:10 -07:00
Collin Fijalkovich
eb6ecbed0a mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs
Transparent huge pages are supported for read-only non-shmem files, but
are only used for vmas with VM_DENYWRITE.  This condition ensures that
file THPs are protected from writes while an application is running
(ETXTBSY).  Any existing file THPs are then dropped from the page cache
when a file is opened for write in do_dentry_open().  Since sys_mmap
ignores MAP_DENYWRITE, this constrains the use of file THPs to vmas
produced by execve().

Systems that make heavy use of shared libraries (e.g.  Android) are unable
to apply VM_DENYWRITE through the dynamic linker, preventing them from
benefiting from the resultant reduced contention on the TLB.

This patch reduces the constraint on file THPs allowing use with any
executable mapping from a file not opened for write (see
inode_is_open_for_write()).  It also introduces additional conditions to
ensure that files opened for write will never be backed by file THPs.

Restricting the use of THPs to executable mappings eliminates the risk
that a read-only file later opened for write would encounter significant
latencies due to page cache truncation.

The ld linker flag '-z max-page-size=(hugepage size)' can be used to
produce executables with the necessary layout.  The dynamic linker must
map these file's segments at a hugepage size aligned vma for the mapping
to be backed with THPs.

Comparison of the performance characteristics of 4KB and 2MB-backed
libraries follows; the Android dex2oat tool was used to AOT compile an
example application on a single ARM core.

4KB Pages:
==========

count              event_name            # count / runtime
598,995,035,942    cpu-cycles            # 1.800861 GHz
 81,195,620,851    raw-stall-frontend    # 244.112 M/sec
347,754,466,597    iTLB-loads            # 1.046 G/sec
  2,970,248,900    iTLB-load-misses      # 0.854122% miss rate

Total test time: 332.854998 seconds.

2MB Pages:
==========

count              event_name            # count / runtime
592,872,663,047    cpu-cycles            # 1.800358 GHz
 76,485,624,143    raw-stall-frontend    # 232.261 M/sec
350,478,413,710    iTLB-loads            # 1.064 G/sec
    803,233,322    iTLB-load-misses      # 0.229182% miss rate

Total test time: 329.826087 seconds

A check of /proc/$(pidof dex2oat64)/smaps shows THPs in use:

/apex/com.android.art/lib64/libart.so
FilePmdMapped:      4096 kB

/apex/com.android.art/lib64/libart-compiler.so
FilePmdMapped:      2048 kB

Link: https://lkml.kernel.org/r/20210406000930.3455850-1-cfijalkovich@google.com
Signed-off-by: Collin Fijalkovich <cfijalkovich@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Acked-by: Song Liu <song@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Hridya Valsaraju <hridya@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:29 -07:00
Christian Brauner
cfe80306a0 open: don't silently ignore unknown O-flags in openat2()
The new openat2() syscall verifies that no unknown O-flag values are
set and returns an error to userspace if they are while the older open
syscalls like open() and openat() simply ignore unknown flag values:

  #define O_FLAG_CURRENTLY_INVALID (1 << 31)
  struct open_how how = {
          .flags = O_RDONLY | O_FLAG_CURRENTLY_INVALID,
          .resolve = 0,
  };

  /* fails */
  fd = openat2(-EBADF, "/dev/null", &how, sizeof(how));

  /* succeeds */
  fd = openat(-EBADF, "/dev/null", O_RDONLY | O_FLAG_CURRENTLY_INVALID);

However, openat2() silently truncates the upper 32 bits meaning:

  #define O_FLAG_CURRENTLY_INVALID_LOWER32 (1 << 31)
  #define O_FLAG_CURRENTLY_INVALID_UPPER32 (1 << 40)

  struct open_how how_lowe32 = {
          .flags = O_RDONLY | O_FLAG_CURRENTLY_INVALID_LOWER32,
  };

  struct open_how how_upper32 = {
          .flags = O_RDONLY | O_FLAG_CURRENTLY_INVALID_UPPER32,
  };

  /* fails */
  fd = openat2(-EBADF, "/dev/null", &how_lower32, sizeof(how_lower32));

  /* succeeds */
  fd = openat2(-EBADF, "/dev/null", &how_upper32, sizeof(how_upper32));

Fix this by preventing the immediate truncation in build_open_flags().

There's a snafu here though stripping FMODE_* directly from flags would
cause the upper 32 bits to be truncated as well due to integer promotion
rules since FMODE_* is unsigned int, O_* are signed ints (yuck).

In addition, struct open_flags currently defines flags to be 32 bit
which is reasonable. If we simply were to bump it to 64 bit we would
need to change a lot of code preemptively which doesn't seem worth it.
So simply add a compile-time check verifying that all currently known
O_* flags are within the 32 bit range and fail to build if they aren't
anymore.

This change shouldn't regress old open syscalls since they silently
truncate any unknown values anyway. It is a tiny semantic change for
openat2() but it is very unlikely people pass ing > 32 bit unknown flags
and the syscall is relatively new too.

Link: https://lore.kernel.org/r/20210528092417.3942079-3-brauner@kernel.org
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reported-by: Richard Guy Briggs <rgb@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-05-28 17:44:37 +02:00
Al Viro
ffb37ca3bd switch file_open_root() to struct path
... and provide file_open_root_mnt(), using the root of given mount.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-04-07 13:56:43 -04:00
Linus Torvalds
7d6beb71da Merge tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull idmapped mounts from Christian Brauner:
 "This introduces idmapped mounts which has been in the making for some
  time. Simply put, different mounts can expose the same file or
  directory with different ownership. This initial implementation comes
  with ports for fat, ext4 and with Christoph's port for xfs with more
  filesystems being actively worked on by independent people and
  maintainers.

  Idmapping mounts handle a wide range of long standing use-cases. Here
  are just a few:

   - Idmapped mounts make it possible to easily share files between
     multiple users or multiple machines especially in complex
     scenarios. For example, idmapped mounts will be used in the
     implementation of portable home directories in
     systemd-homed.service(8) where they allow users to move their home
     directory to an external storage device and use it on multiple
     computers where they are assigned different uids and gids. This
     effectively makes it possible to assign random uids and gids at
     login time.

   - It is possible to share files from the host with unprivileged
     containers without having to change ownership permanently through
     chown(2).

   - It is possible to idmap a container's rootfs and without having to
     mangle every file. For example, Chromebooks use it to share the
     user's Download folder with their unprivileged containers in their
     Linux subsystem.

   - It is possible to share files between containers with
     non-overlapping idmappings.

   - Filesystem that lack a proper concept of ownership such as fat can
     use idmapped mounts to implement discretionary access (DAC)
     permission checking.

   - They allow users to efficiently changing ownership on a per-mount
     basis without having to (recursively) chown(2) all files. In
     contrast to chown (2) changing ownership of large sets of files is
     instantenous with idmapped mounts. This is especially useful when
     ownership of a whole root filesystem of a virtual machine or
     container is changed. With idmapped mounts a single syscall
     mount_setattr syscall will be sufficient to change the ownership of
     all files.

   - Idmapped mounts always take the current ownership into account as
     idmappings specify what a given uid or gid is supposed to be mapped
     to. This contrasts with the chown(2) syscall which cannot by itself
     take the current ownership of the files it changes into account. It
     simply changes the ownership to the specified uid and gid. This is
     especially problematic when recursively chown(2)ing a large set of
     files which is commong with the aforementioned portable home
     directory and container and vm scenario.

   - Idmapped mounts allow to change ownership locally, restricting it
     to specific mounts, and temporarily as the ownership changes only
     apply as long as the mount exists.

  Several userspace projects have either already put up patches and
  pull-requests for this feature or will do so should you decide to pull
  this:

   - systemd: In a wide variety of scenarios but especially right away
     in their implementation of portable home directories.

         https://systemd.io/HOME_DIRECTORY/

   - container runtimes: containerd, runC, LXD:To share data between
     host and unprivileged containers, unprivileged and privileged
     containers, etc. The pull request for idmapped mounts support in
     containerd, the default Kubernetes runtime is already up for quite
     a while now: https://github.com/containerd/containerd/pull/4734

   - The virtio-fs developers and several users have expressed interest
     in using this feature with virtual machines once virtio-fs is
     ported.

   - ChromeOS: Sharing host-directories with unprivileged containers.

  I've tightly synced with all those projects and all of those listed
  here have also expressed their need/desire for this feature on the
  mailing list. For more info on how people use this there's a bunch of
  talks about this too. Here's just two recent ones:

      https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf
      https://fosdem.org/2021/schedule/event/containers_idmap/

  This comes with an extensive xfstests suite covering both ext4 and
  xfs:

      https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts

  It covers truncation, creation, opening, xattrs, vfscaps, setid
  execution, setgid inheritance and more both with idmapped and
  non-idmapped mounts. It already helped to discover an unrelated xfs
  setgid inheritance bug which has since been fixed in mainline. It will
  be sent for inclusion with the xfstests project should you decide to
  merge this.

  In order to support per-mount idmappings vfsmounts are marked with
  user namespaces. The idmapping of the user namespace will be used to
  map the ids of vfs objects when they are accessed through that mount.
  By default all vfsmounts are marked with the initial user namespace.
  The initial user namespace is used to indicate that a mount is not
  idmapped. All operations behave as before and this is verified in the
  testsuite.

  Based on prior discussions we want to attach the whole user namespace
  and not just a dedicated idmapping struct. This allows us to reuse all
  the helpers that already exist for dealing with idmappings instead of
  introducing a whole new range of helpers. In addition, if we decide in
  the future that we are confident enough to enable unprivileged users
  to setup idmapped mounts the permission checking can take into account
  whether the caller is privileged in the user namespace the mount is
  currently marked with.

  The user namespace the mount will be marked with can be specified by
  passing a file descriptor refering to the user namespace as an
  argument to the new mount_setattr() syscall together with the new
  MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
  of extensibility.

  The following conditions must be met in order to create an idmapped
  mount:

   - The caller must currently have the CAP_SYS_ADMIN capability in the
     user namespace the underlying filesystem has been mounted in.

   - The underlying filesystem must support idmapped mounts.

   - The mount must not already be idmapped. This also implies that the
     idmapping of a mount cannot be altered once it has been idmapped.

   - The mount must be a detached/anonymous mount, i.e. it must have
     been created by calling open_tree() with the OPEN_TREE_CLONE flag
     and it must not already have been visible in the filesystem.

  The last two points guarantee easier semantics for userspace and the
  kernel and make the implementation significantly simpler.

  By default vfsmounts are marked with the initial user namespace and no
  behavioral or performance changes are observed.

  The manpage with a detailed description can be found here:

      1d7b902e28

  In order to support idmapped mounts, filesystems need to be changed
  and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The
  patches to convert individual filesystem are not very large or
  complicated overall as can be seen from the included fat, ext4, and
  xfs ports. Patches for other filesystems are actively worked on and
  will be sent out separately. The xfstestsuite can be used to verify
  that port has been done correctly.

  The mount_setattr() syscall is motivated independent of the idmapped
  mounts patches and it's been around since July 2019. One of the most
  valuable features of the new mount api is the ability to perform
  mounts based on file descriptors only.

  Together with the lookup restrictions available in the openat2()
  RESOLVE_* flag namespace which we added in v5.6 this is the first time
  we are close to hardened and race-free (e.g. symlinks) mounting and
  path resolution.

  While userspace has started porting to the new mount api to mount
  proper filesystems and create new bind-mounts it is currently not
  possible to change mount options of an already existing bind mount in
  the new mount api since the mount_setattr() syscall is missing.

  With the addition of the mount_setattr() syscall we remove this last
  restriction and userspace can now fully port to the new mount api,
  covering every use-case the old mount api could. We also add the
  crucial ability to recursively change mount options for a whole mount
  tree, both removing and adding mount options at the same time. This
  syscall has been requested multiple times by various people and
  projects.

  There is a simple tool available at

      https://github.com/brauner/mount-idmapped

  that allows to create idmapped mounts so people can play with this
  patch series. I'll add support for the regular mount binary should you
  decide to pull this in the following weeks:

  Here's an example to a simple idmapped mount of another user's home
  directory:

	u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt

	u1001@f2-vm:/$ ls -al /home/ubuntu/
	total 28
	drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 .
	drwxr-xr-x 4 root   root   4096 Oct 28 04:00 ..
	-rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history
	-rw-r--r-- 1 ubuntu ubuntu  220 Feb 25  2020 .bash_logout
	-rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25  2020 .bashrc
	-rw-r--r-- 1 ubuntu ubuntu  807 Feb 25  2020 .profile
	-rw-r--r-- 1 ubuntu ubuntu    0 Oct 16 16:11 .sudo_as_admin_successful
	-rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo

	u1001@f2-vm:/$ ls -al /mnt/
	total 28
	drwxr-xr-x  2 u1001 u1001 4096 Oct 28 22:07 .
	drwxr-xr-x 29 root  root  4096 Oct 28 22:01 ..
	-rw-------  1 u1001 u1001 3154 Oct 28 22:12 .bash_history
	-rw-r--r--  1 u1001 u1001  220 Feb 25  2020 .bash_logout
	-rw-r--r--  1 u1001 u1001 3771 Feb 25  2020 .bashrc
	-rw-r--r--  1 u1001 u1001  807 Feb 25  2020 .profile
	-rw-r--r--  1 u1001 u1001    0 Oct 16 16:11 .sudo_as_admin_successful
	-rw-------  1 u1001 u1001 1144 Oct 28 00:43 .viminfo

	u1001@f2-vm:/$ touch /mnt/my-file

	u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file

	u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file

	u1001@f2-vm:/$ ls -al /mnt/my-file
	-rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file

	u1001@f2-vm:/$ ls -al /home/ubuntu/my-file
	-rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file

	u1001@f2-vm:/$ getfacl /mnt/my-file
	getfacl: Removing leading '/' from absolute path names
	# file: mnt/my-file
	# owner: u1001
	# group: u1001
	user::rw-
	user:u1001:rwx
	group::rw-
	mask::rwx
	other::r--

	u1001@f2-vm:/$ getfacl /home/ubuntu/my-file
	getfacl: Removing leading '/' from absolute path names
	# file: home/ubuntu/my-file
	# owner: ubuntu
	# group: ubuntu
	user::rw-
	user:ubuntu:rwx
	group::rw-
	mask::rwx
	other::r--"

* tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits)
  xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl
  xfs: support idmapped mounts
  ext4: support idmapped mounts
  fat: handle idmapped mounts
  tests: add mount_setattr() selftests
  fs: introduce MOUNT_ATTR_IDMAP
  fs: add mount_setattr()
  fs: add attr_flags_to_mnt_flags helper
  fs: split out functions to hold writers
  namespace: only take read lock in do_reconfigure_mnt()
  mount: make {lock,unlock}_mount_hash() static
  namespace: take lock_mount_hash() directly when changing flags
  nfs: do not export idmapped mounts
  overlayfs: do not mount on top of idmapped mounts
  ecryptfs: do not mount on top of idmapped mounts
  ima: handle idmapped mounts
  apparmor: handle idmapped mounts
  fs: make helpers idmap mount aware
  exec: handle idmapped mounts
  would_dump: handle idmapped mounts
  ...
2021-02-23 13:39:45 -08:00
Christian Brauner
b8b546a061 open: handle idmapped mounts
For core file operations such as changing directories or chrooting,
determining file access, changing mode or ownership the vfs will verify
that the caller is privileged over the inode. Extend the various helpers
to handle idmapped mounts. If the inode is accessed through an idmapped
mount map it into the mount's user namespace. Afterwards the permissions
checks are identical to non-idmapped mounts. When changing file
ownership we need to map the uid and gid from the mount's user
namespace. If the initial user namespace is passed nothing changes so
non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-17-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:18 +01:00
Christian Brauner
643fe55a06 open: handle idmapped mounts in do_truncate()
When truncating files the vfs will verify that the caller is privileged
over the inode. Extend it to handle idmapped mounts. If the inode is
accessed through an idmapped mount it is mapped according to the mount's
user namespace. Afterwards the permissions checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-16-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:18 +01:00
Christian Brauner
2f221d6f7b attr: handle idmapped mounts
When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:16 +01:00
Christian Brauner
47291baa8d namei: make permission helpers idmapped mount aware
The two helpers inode_permission() and generic_permission() are used by
the vfs to perform basic permission checking by verifying that the
caller is privileged over an inode. In order to handle idmapped mounts
we extend the two helpers with an additional user namespace argument.
On idmapped mounts the two helpers will make sure to map the inode
according to the mount's user namespace and then peform identical
permission checks to inode_permission() and generic_permission(). If the
initial user namespace is passed nothing changes so non-idmapped mounts
will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:16 +01:00
Christian Brauner
02f92b3868 fs: add file and path permissions helpers
Add two simple helpers to check permissions on a file and path
respectively and convert over some callers. It simplifies quite a few
codepaths and also reduces the churn in later patches quite a bit.
Christoph also correctly points out that this makes codepaths (e.g.
ioctls) way easier to follow that would otherwise have to do more
complex argument passing than necessary.

Link: https://lore.kernel.org/r/20210121131959.646623-4-christian.brauner@ubuntu.com
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:16 +01:00
Jens Axboe
99668f6180 fs: expose LOOKUP_CACHED through openat2() RESOLVE_CACHED
Now that we support non-blocking path resolution internally, expose it
via openat2() in the struct open_how ->resolve flags. This allows
applications using openat2() to limit path resolution to the extent that
it is already cached.

If the lookup cannot be satisfied in a non-blocking manner, openat2(2)
will return -1/-EAGAIN.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-01-04 11:42:26 -05:00
Linus Torvalds
faf145d6f3 Merge branch 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull execve updates from Eric Biederman:
 "This set of changes ultimately fixes the interaction of posix file
  lock and exec. Fundamentally most of the change is just moving where
  unshare_files is called during exec, and tweaking the users of
  files_struct so that the count of files_struct is not unnecessarily
  played with.

  Along the way fcheck and related helpers were renamed to more
  accurately reflect what they do.

  There were also many other small changes that fell out, as this is the
  first time in a long time much of this code has been touched.

  Benchmarks haven't turned up any practical issues but Al Viro has
  observed a possibility for a lot of pounding on task_lock. So I have
  some changes in progress to convert put_files_struct to always rcu
  free files_struct. That wasn't ready for the merge window so that will
  have to wait until next time"

* 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
  exec: Move io_uring_task_cancel after the point of no return
  coredump: Document coredump code exclusively used by cell spufs
  file: Remove get_files_struct
  file: Rename __close_fd_get_file close_fd_get_file
  file: Replace ksys_close with close_fd
  file: Rename __close_fd to close_fd and remove the files parameter
  file: Merge __alloc_fd into alloc_fd
  file: In f_dupfd read RLIMIT_NOFILE once.
  file: Merge __fd_install into fd_install
  proc/fd: In fdinfo seq_show don't use get_files_struct
  bpf/task_iter: In task_file_seq_get_next use task_lookup_next_fd_rcu
  proc/fd: In proc_readfd_common use task_lookup_next_fd_rcu
  file: Implement task_lookup_next_fd_rcu
  kcmp: In get_file_raw_ptr use task_lookup_fd_rcu
  proc/fd: In tid_fd_mode use task_lookup_fd_rcu
  file: Implement task_lookup_fd_rcu
  file: Rename fcheck lookup_fd_rcu
  file: Replace fcheck_files with files_lookup_fd_rcu
  file: Factor files_lookup_fd_locked out of fcheck_files
  file: Rename __fcheck_files to files_lookup_fd_raw
  ...
2020-12-15 19:29:43 -08:00
Eric W. Biederman
8760c909f5 file: Rename __close_fd to close_fd and remove the files parameter
The function __close_fd was added to support binder[1].  Now that
binder has been fixed to no longer need __close_fd[2] all calls
to __close_fd pass current->files.

Therefore transform the files parameter into a local variable
initialized to current->files, and rename __close_fd to close_fd to
reflect this change, and keep it in sync with the similar changes to
__alloc_fd, and __fd_install.

This removes the need for callers to care about the extra care that
needs to be take if anything except current->files is passed, by
limiting the callers to only operation on current->files.

[1] 483ce1d4b8 ("take descriptor-related part of close() to file.c")
[2] 44d8047f1d ("binder: use standard functions to allocate fds")
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
v1: https://lkml.kernel.org/r/20200817220425.9389-17-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20201120231441.29911-21-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10 12:42:59 -06:00
Aleksa Sarai
398840f8bb openat2: reject RESOLVE_BENEATH|RESOLVE_IN_ROOT
This was an oversight in the original implementation, as it makes no
sense to specify both scoping flags to the same openat2(2) invocation
(before this patch, the result of such an invocation was equivalent to
RESOLVE_IN_ROOT being ignored).

This is a userspace-visible ABI change, but the only user of openat2(2)
at the moment is LXC which doesn't specify both flags and so no
userspace programs will break as a result.

Fixes: fddb5d430a ("open: introduce openat2(2) syscall")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: <stable@vger.kernel.org> # v5.6+
Link: https://lore.kernel.org/r/20201027235044.5240-2-cyphar@cyphar.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-12-03 10:15:50 +01:00
Kees Cook
633fb6ac39 exec: move S_ISREG() check earlier
The execve(2)/uselib(2) syscalls have always rejected non-regular files.
Recently, it was noticed that a deadlock was introduced when trying to
execute pipes, as the S_ISREG() test was happening too late.  This was
fixed in commit 73601ea5b7 ("fs/open.c: allow opening only regular files
during execve()"), but it was added after inode_permission() had already
run, which meant LSMs could see bogus attempts to execute non-regular
files.

Move the test into the other inode type checks (which already look for
other pathological conditions[1]).  Since there is no need to use
FMODE_EXEC while we still have access to "acc_mode", also switch the test
to MAY_EXEC.

Also include a comment with the redundant S_ISREG() checks at the end of
execve(2)/uselib(2) to note that they are present to avoid any mistakes.

My notes on the call path, and related arguments, checks, etc:

do_open_execat()
    struct open_flags open_exec_flags = {
        .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
        .acc_mode = MAY_EXEC,
        ...
    do_filp_open(dfd, filename, open_flags)
        path_openat(nameidata, open_flags, flags)
            file = alloc_empty_file(open_flags, current_cred());
            do_open(nameidata, file, open_flags)
                may_open(path, acc_mode, open_flag)
		    /* new location of MAY_EXEC vs S_ISREG() test */
                    inode_permission(inode, MAY_OPEN | acc_mode)
                        security_inode_permission(inode, acc_mode)
                vfs_open(path, file)
                    do_dentry_open(file, path->dentry->d_inode, open)
                        /* old location of FMODE_EXEC vs S_ISREG() test */
                        security_file_open(f)
                        open()

[1] https://lore.kernel.org/lkml/202006041910.9EF0C602@keescook/

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: http://lkml.kernel.org/r/20200605160013.3954297-3-keescook@chromium.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:58:01 -07:00
Linus Torvalds
e1ec517e18 Merge branch 'hch.init_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull init and set_fs() cleanups from Al Viro:
 "Christoph's 'getting rid of ksys_...() uses under KERNEL_DS' series"

* 'hch.init_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (50 commits)
  init: add an init_dup helper
  init: add an init_utimes helper
  init: add an init_stat helper
  init: add an init_mknod helper
  init: add an init_mkdir helper
  init: add an init_symlink helper
  init: add an init_link helper
  init: add an init_eaccess helper
  init: add an init_chmod helper
  init: add an init_chown helper
  init: add an init_chroot helper
  init: add an init_chdir helper
  init: add an init_rmdir helper
  init: add an init_unlink helper
  init: add an init_umount helper
  init: add an init_mount helper
  init: mark create_dev as __init
  init: mark console_on_rootfs as __init
  init: initialize ramdisk_execute_command at compile time
  devtmpfs: refactor devtmpfsd()
  ...
2020-08-07 09:40:34 -07:00
Christoph Hellwig
eb9d7d390e init: add an init_eaccess helper
Add a simple helper to check if a file exists based on kernel space file
name and switch the early init code over to it.  Note that this
theoretically changes behavior as it always is based on the effective
permissions.  But during early init that doesn't make a difference.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31 08:17:53 +02:00
Christoph Hellwig
1097742efc init: add an init_chmod helper
Add a simple helper to chmod with a kernel space file name and switch
the early init code over to it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31 08:17:53 +02:00
Christoph Hellwig
b873498f99 init: add an init_chown helper
Add a simple helper to chown with a kernel space file name and switch
the early init code over to it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31 08:17:52 +02:00
Christoph Hellwig
4b7ca5014c init: add an init_chroot helper
Add a simple helper to chroot with a kernel space file name and switch
the early init code over to it.  Remove the now unused ksys_chroot.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31 08:17:52 +02:00
Christoph Hellwig
db63f1e315 init: add an init_chdir helper
Add a simple helper to chdir with a kernel space file name and switch
the early init code over to it.  Remove the now unused ksys_chdir.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31 08:17:52 +02:00
Christoph Hellwig
b25ba7c3c9 fs: remove ksys_fchmod
Fold it into the only remaining caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-31 08:16:01 +02:00
Christoph Hellwig
166e07c37c fs: remove ksys_open
Just open code it in the two callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-31 08:16:00 +02:00
Christoph Hellwig
9e96c8c0e9 fs: add a vfs_fchmod helper
Add a helper for struct file based chmode operations.  To be used by
the initramfs code soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-16 15:33:04 +02:00
Christoph Hellwig
c04011fe8c fs: add a vfs_fchown helper
Add a helper for struct file based chown operations.  To be used by
the initramfs code soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-16 15:33:00 +02:00
Christian Brauner
60997c3d45 close_range: add CLOSE_RANGE_UNSHARE
One of the use-cases of close_range() is to drop file descriptors just before
execve(). This would usually be expressed in the sequence:

unshare(CLONE_FILES);
close_range(3, ~0U);

as pointed out by Linus it might be desirable to have this be a part of
close_range() itself under a new flag CLOSE_RANGE_UNSHARE.

This expands {dup,unshare)_fd() to take a max_fds argument that indicates the
maximum number of file descriptors to copy from the old struct files. When the
user requests that all file descriptors are supposed to be closed via
close_range(min, max) then we can cap via unshare_fd(min) and hence don't need
to do any of the heavy fput() work for everything above min.

The patch makes it so that if CLOSE_RANGE_UNSHARE is requested and we do in
fact currently share our file descriptor table we create a new private copy.
We then close all fds in the requested range and finally after we're done we
install the new fd table.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-06-17 00:07:38 +02:00
Christian Brauner
278a5fbaed open: add close_range()
This adds the close_range() syscall. It allows to efficiently close a range
of file descriptors up to all file descriptors of a calling task.

I was contacted by FreeBSD as they wanted to have the same close_range()
syscall as we proposed here. We've coordinated this and in the meantime, Kyle
was fast enough to merge close_range() into FreeBSD already in April:
https://reviews.freebsd.org/D21627
https://svnweb.freebsd.org/base?view=revision&revision=359836
and the current plan is to backport close_range() to FreeBSD 12.2 (cf. [2])
once its merged in Linux too. Python is in the process of switching to
close_range() on FreeBSD and they are waiting on us to merge this to switch on
Linux as well: https://bugs.python.org/issue38061

The syscall came up in a recent discussion around the new mount API and
making new file descriptor types cloexec by default. During this
discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
syscall in this manner has been requested by various people over time.

First, it helps to close all file descriptors of an exec()ing task. This
can be done safely via (quoting Al's example from [1] verbatim):

        /* that exec is sensitive */
        unshare(CLONE_FILES);
        /* we don't want anything past stderr here */
        close_range(3, ~0U);
        execve(....);

The code snippet above is one way of working around the problem that file
descriptors are not cloexec by default. This is aggravated by the fact that
we can't just switch them over without massively regressing userspace. For
a whole class of programs having an in-kernel method of closing all file
descriptors is very helpful (e.g. demons, service managers, programming
language standard libraries, container managers etc.).
(Please note, unshare(CLONE_FILES) should only be needed if the calling
task is multi-threaded and shares the file descriptor table with another
thread in which case two threads could race with one thread allocating file
descriptors and the other one closing them via close_range(). For the
general case close_range() before the execve() is sufficient.)

Second, it allows userspace to avoid implementing closing all file
descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
file descriptor. From looking at various large(ish) userspace code bases
this or similar patterns are very common in:
- service managers (cf. [4])
- libcs (cf. [6])
- container runtimes (cf. [5])
- programming language runtimes/standard libraries
  - Python (cf. [2])
  - Rust (cf. [7], [8])
As Dmitry pointed out there's even a long-standing glibc bug about missing
kernel support for this task (cf. [3]).
In addition, the syscall will also work for tasks that do not have procfs
mounted and on kernels that do not have procfs support compiled in. In such
situations the only way to make sure that all file descriptors are closed
is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
OPEN_MAX trickery (cf. comment [8] on Rust).

The performance is striking. For good measure, comparing the following
simple close_all_fds() userspace implementation that is essentially just
glibc's version in [6]:

static int close_all_fds(void)
{
        int dir_fd;
        DIR *dir;
        struct dirent *direntp;

        dir = opendir("/proc/self/fd");
        if (!dir)
                return -1;
        dir_fd = dirfd(dir);
        while ((direntp = readdir(dir))) {
                int fd;
                if (strcmp(direntp->d_name, ".") == 0)
                        continue;
                if (strcmp(direntp->d_name, "..") == 0)
                        continue;
                fd = atoi(direntp->d_name);
                if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
                        continue;
                close(fd);
        }
        closedir(dir);
        return 0;
}

to close_range() yields:
1. closing 4 open files:
   - close_all_fds(): ~280 us
   - close_range():    ~24 us

2. closing 1000 open files:
   - close_all_fds(): ~5000 us
   - close_range():   ~800 us

close_range() is designed to allow for some flexibility. Specifically, it
does not simply always close all open file descriptors of a task. Instead,
callers can specify an upper bound.
This is e.g. useful for scenarios where specific file descriptors are
created with well-known numbers that are supposed to be excluded from
getting closed.
For extra paranoia close_range() comes with a flags argument. This can e.g.
be used to implement extension. Once can imagine userspace wanting to stop
at the first error instead of ignoring errors under certain circumstances.
There might be other valid ideas in the future. In any case, a flag
argument doesn't hurt and keeps us on the safe side.

From an implementation side this is kept rather dumb. It saw some input
from David and Jann but all nonsense is obviously my own!
- Errors to close file descriptors are currently ignored. (Could be changed
  by setting a flag in the future if needed.)
- __close_range() is a rather simplistic wrapper around __close_fd().
  My reasoning behind this is based on the nature of how __close_fd() needs
  to release an fd. But maybe I misunderstood specifics:
  We take the files_lock and rcu-dereference the fdtable of the calling
  task, we find the entry in the fdtable, get the file and need to release
  files_lock before calling filp_close().
  In the meantime the fdtable might have been altered so we can't just
  retake the spinlock and keep the old rcu-reference of the fdtable
  around. Instead we need to grab a fresh reference to the fdtable.
  If my reasoning is correct then there's really no point in fancyfying
  __close_range(): We just need to rcu-dereference the fdtable of the
  calling task once to cap the max_fd value correctly and then go on
  calling __close_fd() in a loop.

/* References */
[1]: https://lore.kernel.org/lkml/20190516165021.GD17978@ZenIV.linux.org.uk/
[2]: 9e4f2f3a6b/Modules/_posixsubprocess.c (L220)
[3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7
[4]: 5238e95759/src/basic/fd-util.c (L217)
[5]: ddf4b77e11/src/lxc/start.c (L236)
[6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17
     Note that this is an internal implementation that is not exported.
     Currently, libc seems to not provide an exported version of this
     because of missing kernel support to do this.
     Note, in a recent patch series Florian made grantpt() a nop thereby
     removing the code referenced here.
[7]: https://github.com/rust-lang/rust/issues/12148
[8]: 5f47c0613e/src/libstd/sys/unix/process2.rs (L303-L308)
     Rust's solution is slightly different but is equally unperformant.
     Rust calls getdtablesize() which is a glibc library function that
     simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then
     goes on to call close() on each fd. That's obviously overkill for most
     tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or
     OPEN_MAX.
     Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set
     to 1024. Even in this case, there's a very high chance that in the
     common case Rust is calling the close() syscall 1021 times pointlessly
     if the task just has 0, 1, and 2 open.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kyle Evans <self@kyle-evans.net>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dmitry V. Levin <ldv@altlinux.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
2020-06-17 00:05:19 +02:00
Linus Torvalds
94709049fb Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:
 "A few little subsystems and a start of a lot of MM patches.

  Subsystems affected by this patch series: squashfs, ocfs2, parisc,
  vfs. With mm subsystems: slab-generic, slub, debug, pagecache, gup,
  swap, memcg, pagemap, memory-failure, vmalloc, kasan"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (128 commits)
  kasan: move kasan_report() into report.c
  mm/mm_init.c: report kasan-tag information stored in page->flags
  ubsan: entirely disable alignment checks under UBSAN_TRAP
  kasan: fix clang compilation warning due to stack protector
  x86/mm: remove vmalloc faulting
  mm: remove vmalloc_sync_(un)mappings()
  x86/mm/32: implement arch_sync_kernel_mappings()
  x86/mm/64: implement arch_sync_kernel_mappings()
  mm/ioremap: track which page-table levels were modified
  mm/vmalloc: track which page-table levels were modified
  mm: add functions to track page directory modifications
  s390: use __vmalloc_node in stack_alloc
  powerpc: use __vmalloc_node in alloc_vm_stack
  arm64: use __vmalloc_node in arch_alloc_vmap_stack
  mm: remove vmalloc_user_node_flags
  mm: switch the test_vmalloc module to use __vmalloc_node
  mm: remove __vmalloc_node_flags_caller
  mm: remove both instances of __vmalloc_node_flags
  mm: remove the prot argument to __vmalloc_node
  mm: remove the pgprot argument to __vmalloc
  ...
2020-06-02 12:21:36 -07:00
Jeff Layton
735e4ae5ba vfs: track per-sb writeback errors and report them to syncfs
Patch series "vfs: have syncfs() return error when there are writeback
errors", v6.

Currently, syncfs does not return errors when one of the inodes fails to
be written back.  It will return errors based on the legacy AS_EIO and
AS_ENOSPC flags when syncing out the block device fails, but that's not
particularly helpful for filesystems that aren't backed by a blockdev.
It's also possible for a stray sync to lose those errors.

The basic idea in this set is to track writeback errors at the
superblock level, so that we can quickly and easily check whether
something bad happened without having to fsync each file individually.
syncfs is then changed to reliably report writeback errors after they
occur, much in the same fashion as fsync does now.

This patch (of 2):

Usually we suggest that applications call fsync when they want to ensure
that all data written to the file has made it to the backing store, but
that can be inefficient when there are a lot of open files.

Calling syncfs on the filesystem can be more efficient in some
situations, but the error reporting doesn't currently work the way most
people expect.  If a single inode on a filesystem reports a writeback
error, syncfs won't necessarily return an error.  syncfs only returns an
error if __sync_blockdev fails, and on some filesystems that's a no-op.

It would be better if syncfs reported an error if there were any
writeback failures.  Then applications could call syncfs to see if there
are any errors on any open files, and could then call fsync on all of
the other descriptors to figure out which one failed.

This patch adds a new errseq_t to struct super_block, and has
mapping_set_error also record writeback errors there.

To report those errors, we also need to keep an errseq_t in struct file
to act as a cursor.  This patch adds a dedicated field for that purpose,
which slots nicely into 4 bytes of padding at the end of struct file on
x86_64.

An earlier version of this patch used an O_PATH file descriptor to cue
the kernel that the open file should track the superblock error and not
the inode's writeback error.

I think that API is just too weird though.  This is simpler and should
make syncfs error reporting "just work" even if someone is multiplexing
fsync and syncfs on the same fds.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andres Freund <andres@anarazel.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: David Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/20200428135155.19223-1-jlayton@kernel.org
Link: http://lkml.kernel.org/r/20200428135155.19223-2-jlayton@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:05 -07:00
Miklos Szeredi
c8ffd8bcdd vfs: add faccessat2 syscall
POSIX defines faccessat() as having a fourth "flags" argument, while the
linux syscall doesn't have it.  Glibc tries to emulate AT_EACCESS and
AT_SYMLINK_NOFOLLOW, but AT_EACCESS emulation is broken.

Add a new faccessat(2) syscall with the added flags argument and implement
both flags.

The value of AT_EACCESS is defined in glibc headers to be the same as
AT_REMOVEDIR.  Use this value for the kernel interface as well, together
with the explanatory comment.

Also add AT_EMPTY_PATH support, which is not documented by POSIX, but can
be useful and is trivial to implement.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-05-14 16:44:25 +02:00
Miklos Szeredi
9470451505 vfs: split out access_override_creds()
Split out a helper that overrides the credentials in preparation for
actually doing the access check.

This prepares for the next patch that optionally disables the creds
override.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-05-14 16:44:24 +02:00
Linus Torvalds
9c577491b9 Merge branch 'work.dotdot1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pathwalk sanitizing from Al Viro:
 "Massive pathwalk rewrite and cleanups.

  Several iterations have been posted; hopefully this thing is getting
  readable and understandable now. Pretty much all parts of pathname
  resolutions are affected...

  The branch is identical to what has sat in -next, except for commit
  message in "lift all calls of step_into() out of follow_dotdot/
  follow_dotdot_rcu", crediting Qian Cai for reporting the bug; only
  commit message changed there."

* 'work.dotdot1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (69 commits)
  lookup_open(): don't bother with fallbacks to lookup+create
  atomic_open(): no need to pass struct open_flags anymore
  open_last_lookups(): move complete_walk() into do_open()
  open_last_lookups(): lift O_EXCL|O_CREAT handling into do_open()
  open_last_lookups(): don't abuse complete_walk() when all we want is unlazy
  open_last_lookups(): consolidate fsnotify_create() calls
  take post-lookup part of do_last() out of loop
  link_path_walk(): sample parent's i_uid and i_mode for the last component
  __nd_alloc_stack(): make it return bool
  reserve_stack(): switch to __nd_alloc_stack()
  pick_link(): take reserving space on stack into a new helper
  pick_link(): more straightforward handling of allocation failures
  fold path_to_nameidata() into its only remaining caller
  pick_link(): pass it struct path already with normal refcounting rules
  fs/namei.c: kill follow_mount()
  non-RCU analogue of the previous commit
  helper for mount rootwards traversal
  follow_dotdot(): be lazy about changing nd->path
  follow_dotdot_rcu(): be lazy about changing nd->path
  follow_dotdot{,_rcu}(): massage loops
  ...
2020-04-02 12:30:08 -07:00
Al Viro
d9a9f4849f cifs_atomic_open(): fix double-put on late allocation failure
several iterations of ->atomic_open() calling conventions ago, we
used to need fput() if ->atomic_open() failed at some point after
successful finish_open().  Now (since 2016) it's not needed -
struct file carries enough state to make fput() work regardless
of the point in struct file lifecycle and discarding it on
failure exits in open() got unified.  Unfortunately, I'd missed
the fact that we had an instance of ->atomic_open() (cifs one)
that used to need that fput(), as well as the stale comment in
finish_open() demanding such late failure handling.  Trivially
fixed...

Fixes: fe9ec8291f "do_last(): take fput() on error after opening to out:"
Cc: stable@kernel.org # v4.7+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-12 18:25:20 -04:00
Al Viro
31d1726d72 make build_open_flags() treat O_CREAT | O_EXCL as implying O_NOFOLLOW
O_CREAT | O_EXCL means "-EEXIST if we run into a trailing symlink".
As it is, we might or might not have LOOKUP_FOLLOW in op->intent
in that case - that depends upon having O_NOFOLLOW in open flags.
It doesn't matter, since we won't be checking it in that case -
do_last() bails out earlier.

However, making sure it's not set (i.e. acting as if we had an explicit
O_NOFOLLOW) makes the behaviour more explicit and allows to reorder the
check for O_CREAT | O_EXCL in do_last() with the call of step_into()
immediately following it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-27 14:43:56 -05:00
Jens Axboe
35cb6d54c1 fs: make build_open_flags() available internally
This is a prep patch for supporting non-blocking open from io_uring.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:01:53 -07:00
Aleksa Sarai
fddb5d430a open: introduce openat2(2) syscall
/* Background. */
For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown flags
are present[1].

This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road to
being added to openat(2).

Userspace also has a hard time figuring out whether a particular flag is
supported on a particular kernel. While it is now possible with
contemporary kernels (thanks to [3]), older kernels will expose unknown
flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during
openat(2) time matches modern syscall designs and is far more
fool-proof.

In addition, the newly-added path resolution restriction LOOKUP flags
(which we would like to expose to user-space) don't feel related to the
pre-existing O_* flag set -- they affect all components of path lookup.
We'd therefore like to add a new flag argument.

Adding a new syscall allows us to finally fix the flag-ignoring problem,
and we can make it extensible enough so that we will hopefully never
need an openat3(2).

/* Syscall Prototype. */
  /*
   * open_how is an extensible structure (similar in interface to
   * clone3(2) or sched_setattr(2)). The size parameter must be set to
   * sizeof(struct open_how), to allow for future extensions. All future
   * extensions will be appended to open_how, with their zero value
   * acting as a no-op default.
   */
  struct open_how { /* ... */ };

  int openat2(int dfd, const char *pathname,
              struct open_how *how, size_t size);

/* Description. */
The initial version of 'struct open_how' contains the following fields:

  flags
    Used to specify openat(2)-style flags. However, any unknown flag
    bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
    will result in -EINVAL. In addition, this field is 64-bits wide to
    allow for more O_ flags than currently permitted with openat(2).

  mode
    The file mode for O_CREAT or O_TMPFILE.

    Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.

  resolve
    Restrict path resolution (in contrast to O_* flags they affect all
    path components). The current set of flags are as follows (at the
    moment, all of the RESOLVE_ flags are implemented as just passing
    the corresponding LOOKUP_ flag).

    RESOLVE_NO_XDEV       => LOOKUP_NO_XDEV
    RESOLVE_NO_SYMLINKS   => LOOKUP_NO_SYMLINKS
    RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
    RESOLVE_BENEATH       => LOOKUP_BENEATH
    RESOLVE_IN_ROOT       => LOOKUP_IN_ROOT

open_how does not contain an embedded size field, because it is of
little benefit (userspace can figure out the kernel open_how size at
runtime fairly easily without it). It also only contains u64s (even
though ->mode arguably should be a u16) to avoid having padding fields
which are never used in the future.

Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
is no longer permitted for openat(2). As far as I can tell, this has
always been a bug and appears to not be used by userspace (and I've not
seen any problems on my machines by disallowing it). If it turns out
this breaks something, we can special-case it and only permit it for
openat(2) but not openat2(2).

After input from Florian Weimer, the new open_how and flag definitions
are inside a separate header from uapi/linux/fcntl.h, to avoid problems
that glibc has with importing that header.

/* Testing. */
In a follow-up patch there are over 200 selftests which ensure that this
syscall has the correct semantics and will correctly handle several
attack scenarios.

In addition, I've written a userspace library[4] which provides
convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
syscalls). During the development of this patch, I've run numerous
verification tests using libpathrs (showing that the API is reasonably
usable by userspace).

/* Future Work. */
Additional RESOLVE_ flags have been suggested during the review period.
These can be easily implemented separately (such as blocking auto-mount
during resolution).

Furthermore, there are some other proposed changes to the openat(2)
interface (the most obvious example is magic-link hardening[5]) which
would be a good opportunity to add a way for userspace to restrict how
O_PATH file descriptors can be re-opened.

Another possible avenue of future work would be some kind of
CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace
which openat2(2) flags and fields are supported by the current kernel
(to avoid userspace having to go through several guesses to figure it
out).

[1]: https://lwn.net/Articles/588444/
[2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
[3]: commit 629e014bb8 ("fs: completely ignore unknown open flags")
[4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
[5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/
[6]: https://youtu.be/ggD-eb3yPVs

Suggested-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-01-18 09:19:18 -05:00