Commit Graph

749 Commits

Author SHA1 Message Date
Greg Kroah-Hartman
06d58f3cef Merge 5.15.54 into android14-5.15
Changes in 5.15.54
	mm/slub: add missing TID updates on slab deactivation
	mm/filemap: fix UAF in find_lock_entries
	Revert "selftests/bpf: Add test for bpf_timer overwriting crash"
	ALSA: usb-audio: Workarounds for Behringer UMC 204/404 HD
	ALSA: hda/realtek: Add quirk for Clevo L140PU
	ALSA: cs46xx: Fix missing snd_card_free() call at probe error
	can: bcm: use call_rcu() instead of costly synchronize_rcu()
	can: grcan: grcan_probe(): remove extra of_node_get()
	can: gs_usb: gs_usb_open/close(): fix memory leak
	can: m_can: m_can_chip_config(): actually enable internal timestamping
	can: m_can: m_can_{read_fifo,echo_tx_event}(): shift timestamp to full 32 bits
	can: mcp251xfd: mcp251xfd_regmap_crc_read(): improve workaround handling for mcp2517fd
	can: mcp251xfd: mcp251xfd_regmap_crc_read(): update workaround broken CRC on TBC register
	bpf: Fix incorrect verifier simulation around jmp32's jeq/jne
	bpf: Fix insufficient bounds propagation from adjust_scalar_min_max_vals
	usbnet: fix memory leak in error case
	net: rose: fix UAF bug caused by rose_t0timer_expiry
	netfilter: nft_set_pipapo: release elements in clone from abort path
	netfilter: nf_tables: stricter validation of element data
	btrfs: rename btrfs_alloc_chunk to btrfs_create_chunk
	btrfs: add additional parameters to btrfs_init_tree_ref/btrfs_init_data_ref
	btrfs: fix invalid delayed ref after subvolume creation failure
	btrfs: fix warning when freeing leaf after subvolume creation failure
	Input: cpcap-pwrbutton - handle errors from platform_get_irq()
	Input: goodix - change goodix_i2c_write() len parameter type to int
	Input: goodix - add a goodix.h header file
	Input: goodix - refactor reset handling
	Input: goodix - try not to touch the reset-pin on x86/ACPI devices
	dma-buf/poll: Get a file reference for outstanding fence callbacks
	btrfs: fix deadlock between chunk allocation and chunk btree modifications
	drm/i915: Disable bonding on gen12+ platforms
	drm/i915/gt: Register the migrate contexts with their engines
	drm/i915: Replace the unconditional clflush with drm_clflush_virt_range()
	PCI/portdrv: Rename pm_iter() to pcie_port_device_iter()
	PCI: pciehp: Ignore Link Down/Up caused by error-induced Hot Reset
	media: ir_toy: prevent device from hanging during transmit
	memory: renesas-rpc-if: Avoid unaligned bus access for HyperFlash
	ath11k: add hw_param for wakeup_mhi
	qed: Improve the stack space of filter_config()
	platform/x86: wmi: introduce helper to convert driver to WMI driver
	platform/x86: wmi: Replace read_takes_no_args with a flags field
	platform/x86: wmi: Fix driver->notify() vs ->probe() race
	mt76: mt7921: get rid of mt7921_mac_set_beacon_filter
	mt76: mt7921: introduce mt7921_mcu_set_beacon_filter utility routine
	mt76: mt7921: fix a possible race enabling/disabling runtime-pm
	bpf: Stop caching subprog index in the bpf_pseudo_func insn
	bpf, arm64: Use emit_addr_mov_i64() for BPF_PSEUDO_FUNC
	riscv: defconfig: enable DRM_NOUVEAU
	RISC-V: defconfigs: Set CONFIG_FB=y, for FB console
	net/mlx5e: Check action fwd/drop flag exists also for nic flows
	net/mlx5e: Split actions_match_supported() into a sub function
	net/mlx5e: TC, Reject rules with drop and modify hdr action
	net/mlx5e: TC, Reject rules with forward and drop actions
	ASoC: rt5682: Avoid the unexpected IRQ event during going to suspend
	ASoC: rt5682: Re-detect the combo jack after resuming
	ASoC: rt5682: Fix deadlock on resume
	netfilter: nf_tables: convert pktinfo->tprot_set to flags field
	netfilter: nft_payload: support for inner header matching / mangling
	netfilter: nft_payload: don't allow th access for fragments
	s390/boot: allocate amode31 section in decompressor
	s390/setup: use physical pointers for memblock_reserve()
	s390/setup: preserve memory at OLDMEM_BASE and OLDMEM_SIZE
	ibmvnic: init init_done_rc earlier
	ibmvnic: clear fop when retrying probe
	ibmvnic: Allow queueing resets during probe
	virtio-blk: avoid preallocating big SGL for data
	io_uring: ensure that fsnotify is always called
	block: use bdev_get_queue() in bio.c
	block: only mark bio as tracked if it really is tracked
	block: fix rq-qos breakage from skipping rq_qos_done_bio()
	stddef: Introduce struct_group() helper macro
	media: omap3isp: Use struct_group() for memcpy() region
	media: davinci: vpif: fix use-after-free on driver unbind
	mt76: mt76_connac: fix MCU_CE_CMD_SET_ROC definition error
	mt76: mt7921: do not always disable fw runtime-pm
	cxl/port: Hold port reference until decoder release
	clk: renesas: r9a07g044: Update multiplier and divider values for PLL2/3
	KVM: x86/mmu: Use yield-safe TDP MMU root iter in MMU notifier unmapping
	KVM: x86/mmu: Use common TDP MMU zap helper for MMU notifier unmap hook
	scsi: qla2xxx: Move heartbeat handling from DPC thread to workqueue
	scsi: qla2xxx: Fix laggy FC remote port session recovery
	scsi: qla2xxx: edif: Replace list_for_each_safe with list_for_each_entry_safe
	scsi: qla2xxx: Fix crash during module load unload test
	gfs2: Fix gfs2_file_buffered_write endless loop workaround
	vdpa/mlx5: Avoid processing works if workqueue was destroyed
	btrfs: handle device lookup with btrfs_dev_lookup_args
	btrfs: add a btrfs_get_dev_args_from_path helper
	btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls
	btrfs: remove device item and update super block in the same transaction
	drbd: add error handling support for add_disk()
	drbd: Fix double free problem in drbd_create_device
	drbd: fix an invalid memory access caused by incorrect use of list iterator
	drm/amd/display: Set min dcfclk if pipe count is 0
	drm/amd/display: Fix by adding FPU protection for dcn30_internal_validate_bw
	NFSD: De-duplicate net_generic(nf->nf_net, nfsd_net_id)
	NFSD: COMMIT operations must not return NFS?ERR_INVAL
	riscv/mm: Add XIP_FIXUP for riscv_pfn_base
	iio: accel: mma8452: use the correct logic to get mma8452_data
	batman-adv: Use netif_rx().
	mtd: spi-nor: Skip erase logic when SPI_NOR_NO_ERASE is set
	Compiler Attributes: add __alloc_size() for better bounds checking
	mm: vmalloc: introduce array allocation functions
	KVM: use __vcalloc for very large allocations
	btrfs: don't access possibly stale fs_info data in device_list_add
	KVM: s390x: fix SCK locking
	scsi: qla2xxx: Fix loss of NVMe namespaces after driver reload test
	powerpc/32: Don't use lmw/stmw for saving/restoring non volatile regs
	powerpc: flexible GPR range save/restore macros
	powerpc/tm: Fix more userspace r13 corruption
	serial: sc16is7xx: Clear RS485 bits in the shutdown
	bus: mhi: core: Use correctly sized arguments for bit field
	bus: mhi: Fix pm_state conversion to string
	stddef: Introduce DECLARE_FLEX_ARRAY() helper
	uapi/linux/stddef.h: Add include guards
	ASoC: rt5682: move clk related code to rt5682_i2c_probe
	ASoC: rt5682: fix an incorrect NULL check on list iterator
	drm/amd/vcn: fix an error msg on vcn 3.0
	KVM: Don't create VM debugfs files outside of the VM directory
	tty: n_gsm: Modify CR,PF bit when config requester
	tty: n_gsm: Save dlci address open status when config requester
	tty: n_gsm: fix frame reception handling
	ALSA: usb-audio: add mapping for MSI MPG X570S Carbon Max Wifi.
	ALSA: usb-audio: add mapping for MSI MAG X570S Torpedo MAX.
	tty: n_gsm: fix missing update of modem controls after DLCI open
	btrfs: zoned: encapsulate inode locking for zoned relocation
	btrfs: zoned: use dedicated lock for data relocation
	KVM: Initialize debugfs_dentry when a VM is created to avoid NULL deref
	mm/hwpoison: mf_mutex for soft offline and unpoison
	mm/hwpoison: avoid the impact of hwpoison_filter() return value on mce handler
	mm/memory-failure.c: fix race with changing page compound again
	mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()
	tty: n_gsm: fix invalid use of MSC in advanced option
	tty: n_gsm: fix sometimes uninitialized warning in gsm_dlci_modem_output()
	serial: 8250_mtk: Make sure to select the right FEATURE_SEL
	tty: n_gsm: fix invalid gsmtty_write_room() result
	drm/amd: Refactor `amdgpu_aspm` to be evaluated per device
	drm/amdgpu: vi: disable ASPM on Intel Alder Lake based systems
	drm/i915: Fix a race between vma / object destruction and unbinding
	drm/mediatek: Use mailbox rx_callback instead of cmdq_task_cb
	drm/mediatek: Remove the pointer of struct cmdq_client
	drm/mediatek: Detect CMDQ execution timeout
	drm/mediatek: Add cmdq_handle in mtk_crtc
	drm/mediatek: Add vblank register/unregister callback functions
	Bluetooth: protect le accept and resolv lists with hdev->lock
	Bluetooth: btmtksdio: fix use-after-free at btmtksdio_recv_event
	io_uring: avoid io-wq -EAGAIN looping for !IOPOLL
	irqchip/gic-v3: Ensure pseudo-NMIs have an ISB between ack and handling
	irqchip/gic-v3: Refactor ISB + EOIR at ack time
	rxrpc: Fix locking issue
	dt-bindings: soc: qcom: smd-rpm: Add compatible for MSM8953 SoC
	dt-bindings: soc: qcom: smd-rpm: Fix missing MSM8936 compatible
	module: change to print useful messages from elf_validity_check()
	module: fix [e_shstrndx].sh_size=0 OOB access
	iommu/vt-d: Fix PCI bus rescan device hot add
	fbdev: fbmem: Fix logo center image dx issue
	fbmem: Check virtual screen sizes in fb_set_var()
	fbcon: Disallow setting font bigger than screen size
	fbcon: Prevent that screen size is smaller than font size
	PM: runtime: Redefine pm_runtime_release_supplier()
	memregion: Fix memregion_free() fallback definition
	video: of_display_timing.h: include errno.h
	powerpc/powernv: delay rng platform device creation until later in boot
	net: dsa: qca8k: reset cpu port on MTU change
	can: kvaser_usb: replace run-time checks with struct kvaser_usb_driver_info
	can: kvaser_usb: kvaser_usb_leaf: fix CAN clock frequency regression
	can: kvaser_usb: kvaser_usb_leaf: fix bittiming limits
	xfs: remove incorrect ASSERT in xfs_rename
	Revert "serial: sc16is7xx: Clear RS485 bits in the shutdown"
	btrfs: fix error pointer dereference in btrfs_ioctl_rm_dev_v2()
	virtio-blk: modify the value type of num in virtio_queue_rq()
	btrfs: fix use of uninitialized variable at rm device ioctl
	tty: n_gsm: fix encoding of command/response bit
	ARM: meson: Fix refcount leak in meson_smp_prepare_cpus
	pinctrl: sunxi: a83t: Fix NAND function name for some pins
	ASoC: rt711: Add endianness flag in snd_soc_component_driver
	ASoC: rt711-sdca: Add endianness flag in snd_soc_component_driver
	ASoC: codecs: rt700/rt711/rt711-sdca: resume bus/codec in .set_jack_detect
	arm64: dts: qcom: msm8994: Fix CPU6/7 reg values
	arm64: dts: qcom: sdm845: use dispcc AHB clock for mdss node
	ARM: mxs_defconfig: Enable the framebuffer
	arm64: dts: imx8mp-evk: correct mmc pad settings
	arm64: dts: imx8mp-evk: correct the uart2 pinctl value
	arm64: dts: imx8mp-evk: correct gpio-led pad settings
	arm64: dts: imx8mp-evk: correct vbus pad settings
	arm64: dts: imx8mp-evk: correct eqos pad settings
	arm64: dts: imx8mp-evk: correct I2C1 pad settings
	arm64: dts: imx8mp-evk: correct I2C3 pad settings
	arm64: dts: imx8mp-phyboard-pollux-rdk: correct uart pad settings
	arm64: dts: imx8mp-phyboard-pollux-rdk: correct eqos pad settings
	arm64: dts: imx8mp-phyboard-pollux-rdk: correct i2c2 & mmc settings
	pinctrl: sunxi: sunxi_pconf_set: use correct offset
	arm64: dts: qcom: msm8992-*: Fix vdd_lvs1_2-supply typo
	ARM: at91: pm: use proper compatible for sama5d2's rtc
	ARM: at91: pm: use proper compatibles for sam9x60's rtc and rtt
	ARM: at91: pm: use proper compatibles for sama7g5's rtc and rtt
	ARM: dts: at91: sam9x60ek: fix eeprom compatible and size
	ARM: dts: at91: sama5d2_icp: fix eeprom compatibles
	ARM: at91: fix soc detection for SAM9X60 SiPs
	xsk: Clear page contiguity bit when unmapping pool
	i2c: piix4: Fix a memory leak in the EFCH MMIO support
	i40e: Fix dropped jumbo frames statistics
	i40e: Fix VF's MAC Address change on VM
	ARM: dts: stm32: use usbphyc ck_usbo_48m as USBH OHCI clock on stm32mp151
	ARM: dts: stm32: add missing usbh clock and fix clk order on stm32mp15
	ibmvnic: Properly dispose of all skbs during a failover.
	selftests: forwarding: fix flood_unicast_test when h2 supports IFF_UNICAST_FLT
	selftests: forwarding: fix learning_test when h1 supports IFF_UNICAST_FLT
	selftests: forwarding: fix error message in learning_test
	r8169: fix accessing unset transport header
	i2c: cadence: Unregister the clk notifier in error path
	dmaengine: imx-sdma: Allow imx8m for imx7 FW revs
	misc: rtsx_usb: fix use of dma mapped buffer for usb bulk transfer
	misc: rtsx_usb: use separate command and response buffers
	misc: rtsx_usb: set return value in rsp_buf alloc err path
	Revert "mm/memory-failure.c: fix race with changing page compound again"
	Revert "serial: 8250_mtk: Make sure to select the right FEATURE_SEL"
	dt-bindings: dma: allwinner,sun50i-a64-dma: Fix min/max typo
	ida: don't use BUG_ON() for debugging
	dmaengine: pl330: Fix lockdep warning about non-static key
	dmaengine: lgm: Fix an error handling path in intel_ldma_probe()
	dmaengine: at_xdma: handle errors of at_xdmac_alloc_desc() correctly
	dmaengine: ti: Fix refcount leak in ti_dra7_xbar_route_allocate
	dmaengine: qcom: bam_dma: fix runtime PM underflow
	dmaengine: ti: Add missing put_device in ti_dra7_xbar_route_allocate
	dmaengine: idxd: force wq context cleanup on device disable path
	selftests/net: fix section name when using xdp_dummy.o
	Linux 5.15.54

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I3ca4c0aa09a3bea6969c7a127d833034a123f437
2022-07-13 19:41:43 +02:00
Greg Kroah-Hartman
99387927eb Merge 5.15.51 into android14-5.15
Changes in 5.15.51
	random: schedule mix_interrupt_randomness() less often
	random: quiet urandom warning ratelimit suppression message
	ALSA: hda/via: Fix missing beep setup
	ALSA: hda/conexant: Fix missing beep setup
	ALSA: hda/realtek: Add mute LED quirk for HP Omen laptop
	ALSA: hda/realtek - ALC897 headset MIC no sound
	ALSA: hda/realtek: Apply fixup for Lenovo Yoga Duet 7 properly
	ALSA: hda/realtek: Add quirk for Clevo PD70PNT
	ALSA: hda/realtek: Add quirk for Clevo NS50PU
	net: openvswitch: fix parsing of nw_proto for IPv6 fragments
	9p: Fix refcounting during full path walks for fid lookups
	9p: fix fid refcount leak in v9fs_vfs_atomic_open_dotl
	9p: fix fid refcount leak in v9fs_vfs_get_link
	btrfs: fix hang during unmount when block group reclaim task is running
	btrfs: prevent remounting to v1 space cache for subpage mount
	btrfs: add error messages to all unrecognized mount options
	scsi: ibmvfc: Store vhost pointer during subcrq allocation
	scsi: ibmvfc: Allocate/free queue resource only during probe/remove
	mmc: sdhci-pci-o2micro: Fix card detect by dealing with debouncing
	mmc: mediatek: wait dma stop bit reset to 0
	xen/gntdev: Avoid blocking in unmap_grant_pages()
	MAINTAINERS: Add new IOMMU development mailing list
	mtd: rawnand: gpmi: Fix setting busy timeout setting
	ata: libata: add qc->flags in ata_qc_complete_template tracepoint
	dm era: commit metadata in postsuspend after worker stops
	dm mirror log: clear log bits up to BITS_PER_LONG boundary
	tracing/kprobes: Check whether get_kretprobe() returns NULL in kretprobe_dispatcher()
	drm/i915: Implement w/a 22010492432 for adl-s
	USB: serial: pl2303: add support for more HXN (G) types
	USB: serial: option: add Telit LE910Cx 0x1250 composition
	USB: serial: option: add Quectel EM05-G modem
	USB: serial: option: add Quectel RM500K module support
	drm/msm: Ensure mmap offset is initialized
	drm/msm: Fix double pm_runtime_disable() call
	netfilter: use get_random_u32 instead of prandom
	scsi: scsi_debug: Fix zone transition to full condition
	drm/msm: Switch ordering of runpm put vs devfreq_idle
	scsi: iscsi: Exclude zero from the endpoint ID range
	xsk: Fix generic transmit when completion queue reservation fails
	drm/msm: use for_each_sgtable_sg to iterate over scatterlist
	bpf: Fix request_sock leak in sk lookup helpers
	drm/sun4i: Fix crash during suspend after component bind failure
	bpf, x86: Fix tail call count offset calculation on bpf2bpf call
	scsi: storvsc: Correct reporting of Hyper-V I/O size limits
	phy: aquantia: Fix AN when higher speeds than 1G are not advertised
	KVM: arm64: Prevent kmemleak from accessing pKVM memory
	net: Write lock dev_base_lock without disabling bottom halves.
	net: fix data-race in dev_isalive()
	tipc: fix use-after-free Read in tipc_named_reinit
	igb: fix a use-after-free issue in igb_clean_tx_ring
	bonding: ARP monitor spams NETDEV_NOTIFY_PEERS notifiers
	ethtool: Fix get module eeprom fallback
	net/sched: sch_netem: Fix arithmetic in netem_dump() for 32-bit platforms
	drm/msm/mdp4: Fix refcount leak in mdp4_modeset_init_intf
	drm/msm/dp: check core_initialized before disable interrupts at dp_display_unbind()
	drm/msm/dp: Drop now unused hpd_high member
	drm/msm/dp: dp_link_parse_sink_count() return immediately if aux read failed
	drm/msm/dp: do not initialize phy until plugin interrupt received
	drm/msm/dp: force link training for display resolution change
	perf arm-spe: Don't set data source if it's not a memory operation
	erspan: do not assume transport header is always set
	net/tls: fix tls_sk_proto_close executed repeatedly
	udmabuf: add back sanity check
	selftests: netfilter: correct PKTGEN_SCRIPT_PATHS in nft_concat_range.sh
	xen-blkfront: Handle NULL gendisk
	x86/xen: Remove undefined behavior in setup_features()
	MIPS: Remove repetitive increase irq_err_count
	afs: Fix dynamic root getattr
	ice: ethtool: advertise 1000M speeds properly
	regmap-irq: Fix a bug in regmap_irq_enable() for type_in_mask chips
	regmap-irq: Fix offset/index mismatch in read_sub_irq_data()
	igb: Make DMA faster when CPU is active on the PCIe link
	virtio_net: fix xdp_rxq_info bug after suspend/resume
	Revert "net/tls: fix tls_sk_proto_close executed repeatedly"
	sock: redo the psock vs ULP protection check
	nvme-pci: add NO APST quirk for Kioxia device
	nvme: move the Samsung X5 quirk entry to the core quirks
	gpio: winbond: Fix error code in winbond_gpio_get()
	s390/cpumf: Handle events cycles and instructions identical
	iio: mma8452: fix probe fail when device tree compatible is used.
	iio: magnetometer: yas530: Fix memchr_inv() misuse
	iio: adc: vf610: fix conversion mode sysfs node name
	usb: typec: wcove: Drop wrong dependency to INTEL_SOC_PMIC
	xhci: turn off port power in shutdown
	xhci-pci: Allow host runtime PM as default for Intel Raptor Lake xHCI
	xhci-pci: Allow host runtime PM as default for Intel Meteor Lake xHCI
	usb: gadget: Fix non-unique driver names in raw-gadget driver
	USB: gadget: Fix double-free bug in raw_gadget driver
	usb: chipidea: udc: check request status before setting device address
	dt-bindings: usb: ohci: Increase the number of PHYs
	dt-bindings: usb: ehci: Increase the number of PHYs
	btrfs: don't set lock_owner when locking extent buffer for reading
	btrfs: fix deadlock with fsync+fiemap+transaction commit
	f2fs: attach inline_data after setting compression
	iio:humidity:hts221: rearrange iio trigger get and register
	iio:chemical:ccs811: rearrange iio trigger get and register
	iio:accel:kxcjk-1013: rearrange iio trigger get and register
	iio:accel:bma180: rearrange iio trigger get and register
	iio:accel:mxc4005: rearrange iio trigger get and register
	iio: accel: mma8452: ignore the return value of reset operation
	iio: gyro: mpu3050: Fix the error handling in mpu3050_power_up()
	iio: trigger: sysfs: fix use-after-free on remove
	iio: adc: stm32: fix maximum clock rate for stm32mp15x
	iio: imu: inv_icm42600: Fix broken icm42600 (chip id 0 value)
	iio: afe: rescale: Fix boolean logic bug
	iio: adc: stm32: Fix ADCs iteration in irq handler
	iio: adc: stm32: Fix IRQs on STM32F4 by removing custom spurious IRQs message
	iio: adc: axp288: Override TS pin bias current for some models
	iio: adc: rzg2l_adc: add missing fwnode_handle_put() in rzg2l_adc_parse_properties()
	iio: adc: adi-axi-adc: Fix refcount leak in adi_axi_adc_attach_client
	iio: adc: ti-ads131e08: add missing fwnode_handle_put() in ads131e08_alloc_channels()
	xtensa: xtfpga: Fix refcount leak bug in setup
	xtensa: Fix refcount leak bug in time.c
	parisc/stifb: Fix fb_is_primary_device() only available with CONFIG_FB_STI
	parisc: Enable ARCH_HAS_STRICT_MODULE_RWX
	powerpc/microwatt: wire up rng during setup_arch()
	powerpc: Enable execve syscall exit tracepoint
	powerpc/rtas: Allow ibm,platform-dump RTAS call with null buffer address
	powerpc/powernv: wire up rng during setup_arch
	drm/msm/dp: Always clear mask bits to disable interrupts at dp_ctrl_reset_irq_ctrl()
	ARM: dts: imx7: Move hsic_phy power domain to HSIC PHY node
	ARM: dts: imx6qdl: correct PU regulator ramp delay
	arm64: dts: ti: k3-am64-main: Remove support for HS400 speed mode
	ARM: exynos: Fix refcount leak in exynos_map_pmu
	soc: bcm: brcmstb: pm: pm-arm: Fix refcount leak in brcmstb_pm_probe
	ARM: Fix refcount leak in axxia_boot_secondary
	memory: samsung: exynos5422-dmc: Fix refcount leak in of_get_dram_timings
	ARM: cns3xxx: Fix refcount leak in cns3xxx_init
	modpost: fix section mismatch check for exported init/exit sections
	ARM: dts: bcm2711-rpi-400: Fix GPIO line names
	random: update comment from copy_to_user() -> copy_to_iter()
	perf build-id: Fix caching files with a wrong build ID
	dma-direct: use the correct size for dma_set_encrypted()
	kbuild: link vmlinux only once for CONFIG_TRIM_UNUSED_KSYMS (2nd attempt)
	powerpc/pseries: wire up rng during setup_arch()
	Linux 5.15.51

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ic8819a78d2d84055e7a6d44bdfab6a6cd8296dac
2022-07-13 17:32:01 +02:00
Nikolay Borisov
6618205047 btrfs: add additional parameters to btrfs_init_tree_ref/btrfs_init_data_ref
[ Upstream commit f42c5da6c12e990d8ec415199600b4d593c63bf5 ]

In order to make 'real_root' used only in ref-verify it's required to
have the necessary context to perform the same checks that this member
is used for. So add 'mod_root' which will contain the root on behalf of
which a delayed ref was created and a 'skip_group' parameter which
will contain callsite-specific override of skip_qgroup.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-07-12 16:34:50 +02:00
Josef Bacik
d98b5032c9 btrfs: fix deadlock with fsync+fiemap+transaction commit
commit bf7ba8ee759b7b7a34787ddd8dc3f190a3d7fa24 upstream.

We are hitting the following deadlock in production occasionally

Task 1		Task 2		Task 3		Task 4		Task 5
		fsync(A)
		 start trans
						start commit
				falloc(A)
				 lock 5m-10m
				 start trans
				  wait for commit
fiemap(A)
 lock 0-10m
  wait for 5m-10m
   (have 0-5m locked)

		 have btrfs_need_log_full_commit
		  !full_sync
		  wait_ordered_extents
								finish_ordered_io(A)
								lock 0-5m
								DEADLOCK

We have an existing dependency of file extent lock -> transaction.
However in fsync if we tried to do the fast logging, but then had to
fall back to committing the transaction, we will be forced to call
btrfs_wait_ordered_range() to make sure all of our extents are updated.

This creates a dependency of transaction -> file extent lock, because
btrfs_finish_ordered_io() will need to take the file extent lock in
order to run the ordered extents.

Fix this by stopping the transaction if we have to do the full commit
and we attempted to do the fast logging.  Then attach to the transaction
and commit it if we need to.

CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-29 09:03:27 +02:00
Greg Kroah-Hartman
a0d26c51d7 Merge 5.15.37 into android14-5.15
Changes in 5.15.37
	floppy: disable FDRAWCMD by default
	bpf: Introduce composable reg, ret and arg types.
	bpf: Replace ARG_XXX_OR_NULL with ARG_XXX | PTR_MAYBE_NULL
	bpf: Replace RET_XXX_OR_NULL with RET_XXX | PTR_MAYBE_NULL
	bpf: Replace PTR_TO_XXX_OR_NULL with PTR_TO_XXX | PTR_MAYBE_NULL
	bpf: Introduce MEM_RDONLY flag
	bpf: Convert PTR_TO_MEM_OR_NULL to composable types.
	bpf: Make per_cpu_ptr return rdonly PTR_TO_MEM.
	bpf: Add MEM_RDONLY for helper args that are pointers to rdonly mem.
	bpf/selftests: Test PTR_TO_RDONLY_MEM
	bpf: Fix crash due to out of bounds access into reg2btf_ids.
	spi: cadence-quadspi: fix write completion support
	ARM: dts: socfpga: change qspi to "intel,socfpga-qspi"
	mm: kfence: fix objcgs vector allocation
	gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable}
	iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable
	iov_iter: Introduce fault_in_iov_iter_writeable
	gfs2: Add wrapper for iomap_file_buffered_write
	gfs2: Clean up function may_grant
	gfs2: Introduce flag for glock holder auto-demotion
	gfs2: Move the inode glock locking to gfs2_file_buffered_write
	gfs2: Eliminate ip->i_gh
	gfs2: Fix mmap + page fault deadlocks for buffered I/O
	iomap: Fix iomap_dio_rw return value for user copies
	iomap: Support partial direct I/O on user copy failures
	iomap: Add done_before argument to iomap_dio_rw
	gup: Introduce FOLL_NOFAULT flag to disable page faults
	iov_iter: Introduce nofault flag to disable page faults
	gfs2: Fix mmap + page fault deadlocks for direct I/O
	btrfs: fix deadlock due to page faults during direct IO reads and writes
	btrfs: fallback to blocking mode when doing async dio over multiple extents
	mm: gup: make fault_in_safe_writeable() use fixup_user_fault()
	selftests/bpf: Add test for reg2btf_ids out of bounds access
	Linux 5.15.37

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I785543e252f972c5a86f313e4b6721e2ff0797e6
2022-06-06 15:02:31 +02:00
Filipe Manana
c81c4f5666 btrfs: fix deadlock due to page faults during direct IO reads and writes
commit 51bd9563b6783de8315f38f7baed949e77c42311 upstream

If we do a direct IO read or write when the buffer given by the user is
memory mapped to the file range we are going to do IO, we end up ending
in a deadlock. This is triggered by the new test case generic/647 from
fstests.

For a direct IO read we get a trace like this:

  [967.872718] INFO: task mmap-rw-fault:12176 blocked for more than 120 seconds.
  [967.874161]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
  [967.874909] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [967.875983] task:mmap-rw-fault   state:D stack:    0 pid:12176 ppid: 11884 flags:0x00000000
  [967.875992] Call Trace:
  [967.875999]  __schedule+0x3ca/0xe10
  [967.876015]  schedule+0x43/0xe0
  [967.876020]  wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
  [967.876109]  ? do_wait_intr_irq+0xb0/0xb0
  [967.876118]  lock_extent_bits+0x37/0x90 [btrfs]
  [967.876150]  btrfs_lock_and_flush_ordered_range+0xa9/0x120 [btrfs]
  [967.876184]  ? extent_readahead+0xa7/0x530 [btrfs]
  [967.876214]  extent_readahead+0x32d/0x530 [btrfs]
  [967.876253]  ? lru_cache_add+0x104/0x220
  [967.876255]  ? kvm_sched_clock_read+0x14/0x40
  [967.876258]  ? sched_clock_cpu+0xd/0x110
  [967.876263]  ? lock_release+0x155/0x4a0
  [967.876271]  read_pages+0x86/0x270
  [967.876274]  ? lru_cache_add+0x125/0x220
  [967.876281]  page_cache_ra_unbounded+0x1a3/0x220
  [967.876291]  filemap_fault+0x626/0xa20
  [967.876303]  __do_fault+0x36/0xf0
  [967.876308]  __handle_mm_fault+0x83f/0x15f0
  [967.876322]  handle_mm_fault+0x9e/0x260
  [967.876327]  __get_user_pages+0x204/0x620
  [967.876332]  ? get_user_pages_unlocked+0x69/0x340
  [967.876340]  get_user_pages_unlocked+0xd3/0x340
  [967.876349]  internal_get_user_pages_fast+0xbca/0xdc0
  [967.876366]  iov_iter_get_pages+0x8d/0x3a0
  [967.876374]  bio_iov_iter_get_pages+0x82/0x4a0
  [967.876379]  ? lock_release+0x155/0x4a0
  [967.876387]  iomap_dio_bio_actor+0x232/0x410
  [967.876396]  iomap_apply+0x12a/0x4a0
  [967.876398]  ? iomap_dio_rw+0x30/0x30
  [967.876414]  __iomap_dio_rw+0x29f/0x5e0
  [967.876415]  ? iomap_dio_rw+0x30/0x30
  [967.876420]  ? lock_acquired+0xf3/0x420
  [967.876429]  iomap_dio_rw+0xa/0x30
  [967.876431]  btrfs_file_read_iter+0x10b/0x140 [btrfs]
  [967.876460]  new_sync_read+0x118/0x1a0
  [967.876472]  vfs_read+0x128/0x1b0
  [967.876477]  __x64_sys_pread64+0x90/0xc0
  [967.876483]  do_syscall_64+0x3b/0xc0
  [967.876487]  entry_SYSCALL_64_after_hwframe+0x44/0xae
  [967.876490] RIP: 0033:0x7fb6f2c038d6
  [967.876493] RSP: 002b:00007fffddf586b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
  [967.876496] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007fb6f2c038d6
  [967.876498] RDX: 0000000000001000 RSI: 00007fb6f2c17000 RDI: 0000000000000003
  [967.876499] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000000000
  [967.876501] R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000003
  [967.876502] R13: 0000000000000000 R14: 00007fb6f2c17000 R15: 0000000000000000

This happens because at btrfs_dio_iomap_begin() we lock the extent range
and return with it locked - we only unlock in the endio callback, at
end_bio_extent_readpage() -> endio_readpage_release_extent(). Then after
iomap called the btrfs_dio_iomap_begin() callback, it triggers the page
faults that resulting in reading the pages, through the readahead callback
btrfs_readahead(), and through there we end to attempt to lock again the
same extent range (or a subrange of what we locked before), resulting in
the deadlock.

For a direct IO write, the scenario is a bit different, and it results in
trace like this:

  [1132.442520] run fstests generic/647 at 2021-08-31 18:53:35
  [1330.349355] INFO: task mmap-rw-fault:184017 blocked for more than 120 seconds.
  [1330.350540]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
  [1330.351158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [1330.351900] task:mmap-rw-fault   state:D stack:    0 pid:184017 ppid:183725 flags:0x00000000
  [1330.351906] Call Trace:
  [1330.351913]  __schedule+0x3ca/0xe10
  [1330.351930]  schedule+0x43/0xe0
  [1330.351935]  btrfs_start_ordered_extent+0x108/0x1c0 [btrfs]
  [1330.352020]  ? do_wait_intr_irq+0xb0/0xb0
  [1330.352028]  btrfs_lock_and_flush_ordered_range+0x8c/0x120 [btrfs]
  [1330.352064]  ? extent_readahead+0xa7/0x530 [btrfs]
  [1330.352094]  extent_readahead+0x32d/0x530 [btrfs]
  [1330.352133]  ? lru_cache_add+0x104/0x220
  [1330.352135]  ? kvm_sched_clock_read+0x14/0x40
  [1330.352138]  ? sched_clock_cpu+0xd/0x110
  [1330.352143]  ? lock_release+0x155/0x4a0
  [1330.352151]  read_pages+0x86/0x270
  [1330.352155]  ? lru_cache_add+0x125/0x220
  [1330.352162]  page_cache_ra_unbounded+0x1a3/0x220
  [1330.352172]  filemap_fault+0x626/0xa20
  [1330.352176]  ? filemap_map_pages+0x18b/0x660
  [1330.352184]  __do_fault+0x36/0xf0
  [1330.352189]  __handle_mm_fault+0x1253/0x15f0
  [1330.352203]  handle_mm_fault+0x9e/0x260
  [1330.352208]  __get_user_pages+0x204/0x620
  [1330.352212]  ? get_user_pages_unlocked+0x69/0x340
  [1330.352220]  get_user_pages_unlocked+0xd3/0x340
  [1330.352229]  internal_get_user_pages_fast+0xbca/0xdc0
  [1330.352246]  iov_iter_get_pages+0x8d/0x3a0
  [1330.352254]  bio_iov_iter_get_pages+0x82/0x4a0
  [1330.352259]  ? lock_release+0x155/0x4a0
  [1330.352266]  iomap_dio_bio_actor+0x232/0x410
  [1330.352275]  iomap_apply+0x12a/0x4a0
  [1330.352278]  ? iomap_dio_rw+0x30/0x30
  [1330.352292]  __iomap_dio_rw+0x29f/0x5e0
  [1330.352294]  ? iomap_dio_rw+0x30/0x30
  [1330.352306]  btrfs_file_write_iter+0x238/0x480 [btrfs]
  [1330.352339]  new_sync_write+0x11f/0x1b0
  [1330.352344]  ? NF_HOOK_LIST.constprop.0.cold+0x31/0x3e
  [1330.352354]  vfs_write+0x292/0x3c0
  [1330.352359]  __x64_sys_pwrite64+0x90/0xc0
  [1330.352365]  do_syscall_64+0x3b/0xc0
  [1330.352369]  entry_SYSCALL_64_after_hwframe+0x44/0xae
  [1330.352372] RIP: 0033:0x7f4b0a580986
  [1330.352379] RSP: 002b:00007ffd34d75418 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
  [1330.352382] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f4b0a580986
  [1330.352383] RDX: 0000000000001000 RSI: 00007f4b0a3a4000 RDI: 0000000000000003
  [1330.352385] RBP: 00007f4b0a3a4000 R08: 0000000000000003 R09: 0000000000000000
  [1330.352386] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
  [1330.352387] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

Unlike for reads, at btrfs_dio_iomap_begin() we return with the extent
range unlocked, but later when the page faults are triggered and we try
to read the extents, we end up btrfs_lock_and_flush_ordered_range() where
we find the ordered extent for our write, created by the iomap callback
btrfs_dio_iomap_begin(), and we wait for it to complete, which makes us
deadlock since we can't complete the ordered extent without reading the
pages (the iomap code only submits the bio after the pages are faulted
in).

Fix this by setting the nofault attribute of the given iov_iter and retry
the direct IO read/write if we get an -EFAULT error returned from iomap.
For reads, also disable page faults completely, this is because when we
read from a hole or a prealloc extent, we can still trigger page faults
due to the call to iov_iter_zero() done by iomap - at the moment, it is
oblivious to the value of the ->nofault attribute of an iov_iter.
We also need to keep track of the number of bytes written or read, and
pass it to iomap_dio_rw(), as well as use the new flag IOMAP_DIO_PARTIAL.

This depends on the iov_iter and iomap changes introduced in commit
c03098d4b9ad ("Merge tag 'gfs2-v5.15-rc5-mmap-fault' of
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2").

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-01 17:22:33 +02:00
Andreas Gruenbacher
d3b744791b iomap: Add done_before argument to iomap_dio_rw
commit 4fdccaa0d184c202f98d73b24e3ec8eeee88ab8d upstream

Add a done_before argument to iomap_dio_rw that indicates how much of
the request has already been transferred.  When the request succeeds, we
report that done_before additional bytes were tranferred.  This is
useful for finishing a request asynchronously when part of the request
has already been completed synchronously.

We'll use that to allow iomap_dio_rw to be used with page faults
disabled: when a page fault occurs while submitting a request, we
synchronously complete the part of the request that has already been
submitted.  The caller can then take care of the page fault and call
iomap_dio_rw again for the rest of the request, passing in the number of
bytes already tranferred.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-01 17:22:32 +02:00
Andreas Gruenbacher
30e66b1dfc iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable
commit a6294593e8a1290091d0b078d5d33da5e0cd3dfe upstream

Turn iov_iter_fault_in_readable into a function that returns the number
of bytes not faulted in, similar to copy_to_user, instead of returning a
non-zero value when any of the requested pages couldn't be faulted in.
This supports the existing users that require all pages to be faulted in
as well as new users that are happy if any pages can be faulted in.

Rename iov_iter_fault_in_readable to fault_in_iov_iter_readable to make
sure this change doesn't silently break things.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-01 17:22:28 +02:00
Greg Kroah-Hartman
ec1a28c7c0 Merge 5.15.35 into android13-5.15
Changes in 5.15.35
	drm/amd/display: Add pstate verification and recovery for DCN31
	drm/amd/display: Fix p-state allow debug index on dcn31
	hamradio: defer 6pack kfree after unregister_netdev
	hamradio: remove needs_free_netdev to avoid UAF
	cpuidle: PSCI: Move the `has_lpi` check to the beginning of the function
	ACPI: processor idle: Check for architectural support for LPI
	ACPI: processor idle: Allow playing dead in C3 state
	ACPI: processor: idle: fix lockup regression on 32-bit ThinkPad T40
	btrfs: remove unused parameter nr_pages in add_ra_bio_pages()
	btrfs: remove no longer used counter when reading data page
	btrfs: remove unused variable in btrfs_{start,write}_dirty_block_groups()
	soc: qcom: aoss: Expose send for generic usecase
	dt-bindings: net: qcom,ipa: add optional qcom,qmp property
	net: ipa: request IPA register values be retained
	btrfs: release correct delalloc amount in direct IO write path
	ALSA: core: Add snd_card_free_on_error() helper
	ALSA: sis7019: Fix the missing error handling
	ALSA: ali5451: Fix the missing snd_card_free() call at probe error
	ALSA: als300: Fix the missing snd_card_free() call at probe error
	ALSA: als4000: Fix the missing snd_card_free() call at probe error
	ALSA: atiixp: Fix the missing snd_card_free() call at probe error
	ALSA: au88x0: Fix the missing snd_card_free() call at probe error
	ALSA: aw2: Fix the missing snd_card_free() call at probe error
	ALSA: azt3328: Fix the missing snd_card_free() call at probe error
	ALSA: bt87x: Fix the missing snd_card_free() call at probe error
	ALSA: ca0106: Fix the missing snd_card_free() call at probe error
	ALSA: cmipci: Fix the missing snd_card_free() call at probe error
	ALSA: cs4281: Fix the missing snd_card_free() call at probe error
	ALSA: cs5535audio: Fix the missing snd_card_free() call at probe error
	ALSA: echoaudio: Fix the missing snd_card_free() call at probe error
	ALSA: emu10k1x: Fix the missing snd_card_free() call at probe error
	ALSA: ens137x: Fix the missing snd_card_free() call at probe error
	ALSA: es1938: Fix the missing snd_card_free() call at probe error
	ALSA: es1968: Fix the missing snd_card_free() call at probe error
	ALSA: fm801: Fix the missing snd_card_free() call at probe error
	ALSA: galaxy: Fix the missing snd_card_free() call at probe error
	ALSA: hdsp: Fix the missing snd_card_free() call at probe error
	ALSA: hdspm: Fix the missing snd_card_free() call at probe error
	ALSA: ice1724: Fix the missing snd_card_free() call at probe error
	ALSA: intel8x0: Fix the missing snd_card_free() call at probe error
	ALSA: intel_hdmi: Fix the missing snd_card_free() call at probe error
	ALSA: korg1212: Fix the missing snd_card_free() call at probe error
	ALSA: lola: Fix the missing snd_card_free() call at probe error
	ALSA: lx6464es: Fix the missing snd_card_free() call at probe error
	ALSA: maestro3: Fix the missing snd_card_free() call at probe error
	ALSA: oxygen: Fix the missing snd_card_free() call at probe error
	ALSA: riptide: Fix the missing snd_card_free() call at probe error
	ALSA: rme32: Fix the missing snd_card_free() call at probe error
	ALSA: rme9652: Fix the missing snd_card_free() call at probe error
	ALSA: rme96: Fix the missing snd_card_free() call at probe error
	ALSA: sc6000: Fix the missing snd_card_free() call at probe error
	ALSA: sonicvibes: Fix the missing snd_card_free() call at probe error
	ALSA: via82xx: Fix the missing snd_card_free() call at probe error
	ALSA: usb-audio: Cap upper limits of buffer/period bytes for implicit fb
	ALSA: nm256: Don't call card private_free at probe error path
	drm/msm: Add missing put_task_struct() in debugfs path
	firmware: arm_scmi: Remove clear channel call on the TX channel
	memory: atmel-ebi: Fix missing of_node_put in atmel_ebi_probe
	Revert "ath11k: mesh: add support for 256 bitmap in blockack frames in 11ax"
	firmware: arm_scmi: Fix sorting of retrieved clock rates
	media: rockchip/rga: do proper error checking in probe
	SUNRPC: Fix the svc_deferred_event trace class
	net/sched: flower: fix parsing of ethertype following VLAN header
	veth: Ensure eth header is in skb's linear part
	gpiolib: acpi: use correct format characters
	cifs: release cached dentries only if mount is complete
	net: mdio: don't defer probe forever if PHY IRQ provider is missing
	mlxsw: i2c: Fix initialization error flow
	net/sched: fix initialization order when updating chain 0 head
	net: dsa: felix: suppress -EPROBE_DEFER errors
	net: ethernet: stmmac: fix altr_tse_pcs function when using a fixed-link
	net/sched: taprio: Check if socket flags are valid
	cfg80211: hold bss_lock while updating nontrans_list
	netfilter: nft_socket: make cgroup match work in input too
	drm/msm: Fix range size vs end confusion
	drm/msm/dsi: Use connector directly in msm_dsi_manager_connector_init()
	drm/msm/dp: add fail safe mode outside of event_mutex context
	net/smc: Fix NULL pointer dereference in smc_pnet_find_ib()
	scsi: pm80xx: Mask and unmask upper interrupt vectors 32-63
	scsi: pm80xx: Enable upper inbound, outbound queues
	scsi: iscsi: Move iscsi_ep_disconnect()
	scsi: iscsi: Fix offload conn cleanup when iscsid restarts
	scsi: iscsi: Fix endpoint reuse regression
	scsi: iscsi: Fix conn cleanup and stop race during iscsid restart
	scsi: iscsi: Fix unbound endpoint error handling
	sctp: Initialize daddr on peeled off socket
	netfilter: nf_tables: nft_parse_register can return a negative value
	ALSA: ad1889: Fix the missing snd_card_free() call at probe error
	ALSA: mtpav: Don't call card private_free at probe error path
	io_uring: move io_uring_rsrc_update2 validation
	io_uring: verify that resv2 is 0 in io_uring_rsrc_update2
	io_uring: verify pad field is 0 in io_get_ext_arg
	testing/selftests/mqueue: Fix mq_perf_tests to free the allocated cpu set
	ALSA: usb-audio: Increase max buffer size
	ALSA: usb-audio: Limit max buffer and period sizes per time
	perf tools: Fix misleading add event PMU debug message
	macvlan: Fix leaking skb in source mode with nodst option
	net: ftgmac100: access hardware register after clock ready
	nfc: nci: add flush_workqueue to prevent uaf
	cifs: potential buffer overflow in handling symlinks
	dm mpath: only use ktime_get_ns() in historical selector
	vfio/pci: Fix vf_token mechanism when device-specific VF drivers are used
	net: bcmgenet: Revert "Use stronger register read/writes to assure ordering"
	block: fix offset/size check in bio_trim()
	drm/amd: Add USBC connector ID
	btrfs: fix fallocate to use file_modified to update permissions consistently
	btrfs: do not warn for free space inode in cow_file_range
	drm/amdgpu: conduct a proper cleanup of PDB bo
	drm/amdgpu/gmc: use PCI BARs for APUs in passthrough
	drm/amd/display: fix audio format not updated after edid updated
	drm/amd/display: FEC check in timing validation
	drm/amd/display: Update VTEM Infopacket definition
	drm/amdkfd: Fix Incorrect VMIDs passed to HWS
	drm/amdgpu/vcn: improve vcn dpg stop procedure
	drm/amdkfd: Check for potential null return of kmalloc_array()
	Drivers: hv: vmbus: Deactivate sysctl_record_panic_msg by default in isolated guests
	PCI: hv: Propagate coherence from VMbus device to PCI device
	Drivers: hv: vmbus: Prevent load re-ordering when reading ring buffer
	scsi: target: tcmu: Fix possible page UAF
	scsi: lpfc: Fix queue failures when recovering from PCI parity error
	scsi: ibmvscsis: Increase INITIAL_SRP_LIMIT to 1024
	net: micrel: fix KS8851_MLL Kconfig
	ata: libata-core: Disable READ LOG DMA EXT for Samsung 840 EVOs
	gpu: ipu-v3: Fix dev_dbg frequency output
	regulator: wm8994: Add an off-on delay for WM8994 variant
	arm64: alternatives: mark patch_alternative() as `noinstr`
	tlb: hugetlb: Add more sizes to tlb_remove_huge_tlb_entry
	net: axienet: setup mdio unconditionally
	Drivers: hv: balloon: Disable balloon and hot-add accordingly
	net: usb: aqc111: Fix out-of-bounds accesses in RX fixup
	myri10ge: fix an incorrect free for skb in myri10ge_sw_tso
	spi: cadence-quadspi: fix protocol setup for non-1-1-X operations
	drm/amd/display: Enable power gating before init_pipes
	drm/amd/display: Revert FEC check in validation
	drm/amd/display: Fix allocate_mst_payload assert on resume
	drbd: set QUEUE_FLAG_STABLE_WRITES
	scsi: mpt3sas: Fail reset operation if config request timed out
	scsi: mvsas: Add PCI ID of RocketRaid 2640
	scsi: megaraid_sas: Target with invalid LUN ID is deleted during scan
	drivers: net: slip: fix NPD bug in sl_tx_timeout()
	io_uring: zero tag on rsrc removal
	io_uring: use nospec annotation for more indexes
	perf/imx_ddr: Fix undefined behavior due to shift overflowing the constant
	mm/secretmem: fix panic when growing a memfd_secret
	mm, page_alloc: fix build_zonerefs_node()
	mm: fix unexpected zeroed page mapping with zram swap
	mm: kmemleak: take a full lowmem check in kmemleak_*_phys()
	KVM: x86/mmu: Resolve nx_huge_pages when kvm.ko is loaded
	SUNRPC: Fix NFSD's request deferral on RDMA transports
	memory: renesas-rpc-if: fix platform-device leak in error path
	gcc-plugins: latent_entropy: use /dev/urandom
	cifs: verify that tcon is valid before dereference in cifs_kill_sb
	ath9k: Properly clear TX status area before reporting to mac80211
	ath9k: Fix usage of driver-private space in tx_info
	btrfs: fix root ref counts in error handling in btrfs_get_root_ref
	btrfs: mark resumed async balance as writing
	ALSA: hda/realtek: Add quirk for Clevo PD50PNT
	ALSA: hda/realtek: add quirk for Lenovo Thinkpad X12 speakers
	ALSA: pcm: Test for "silence" field in struct "pcm_format_data"
	nl80211: correctly check NL80211_ATTR_REG_ALPHA2 size
	ipv6: fix panic when forwarding a pkt with no in6 dev
	drm/amd/display: don't ignore alpha property on pre-multiplied mode
	drm/amdgpu: Enable gfxoff quirk on MacBook Pro
	x86/tsx: Use MSR_TSX_CTRL to clear CPUID bits
	x86/tsx: Disable TSX development mode at boot
	genirq/affinity: Consider that CPUs on nodes can be unbalanced
	tick/nohz: Use WARN_ON_ONCE() to prevent console saturation
	ARM: davinci: da850-evm: Avoid NULL pointer dereference
	dm integrity: fix memory corruption when tag_size is less than digest size
	i2c: dev: check return value when calling dev_set_name()
	smp: Fix offline cpu check in flush_smp_call_function_queue()
	i2c: pasemi: Wait for write xfers to finish
	dt-bindings: net: snps: remove duplicate name
	timers: Fix warning condition in __run_timers()
	dma-direct: avoid redundant memory sync for swiotlb
	drm/i915: Sunset igpu legacy mmap support based on GRAPHICS_VER_FULL
	cpu/hotplug: Remove the 'cpu' member of cpuhp_cpu_state
	soc: qcom: aoss: Fix missing put_device call in qmp_get
	net: ipa: fix a build dependency
	cpufreq: intel_pstate: ITMT support for overclocked system
	ax25: add refcount in ax25_dev to avoid UAF bugs
	ax25: fix reference count leaks of ax25_dev
	ax25: fix UAF bugs of net_device caused by rebinding operation
	ax25: Fix refcount leaks caused by ax25_cb_del()
	ax25: fix UAF bug in ax25_send_control()
	ax25: fix NPD bug in ax25_disconnect
	ax25: Fix NULL pointer dereferences in ax25 timers
	ax25: Fix UAF bugs in ax25 timers
	Linux 5.15.35

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I0dd9eaea7f977df42b0a5b9cb9043c879f62718b
2022-04-24 16:58:59 +02:00
Darrick J. Wong
bb93369f93 btrfs: fix fallocate to use file_modified to update permissions consistently
[ Upstream commit 05fd9564e9faf0f23b4676385e27d9405cef6637 ]

Since the initial introduction of (posix) fallocate back at the turn of
the century, it has been possible to use this syscall to change the
user-visible contents of files.  This can happen by extending the file
size during a preallocation, or through any of the newer modes (punch,
zero range).  Because the call can be used to change file contents, we
should treat it like we do any other modification to a file -- update
the mtime, and drop set[ug]id privileges/capabilities.

The VFS function file_modified() does all this for us if pass it a
locked inode, so let's make fallocate drop permissions correctly.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-20 09:34:14 +02:00
Michel Lespinasse
a2138fee6c FROMLIST: fs: list file types that support speculative faults.
Add a speculative field to the vm_operations_struct, which indicates if
the associated file type supports speculative faults.

Initially this is set for files that implement fault() with filemap_fault().

Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20210407014502.24091-30-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ic92efdf13283c45e7da7bf703f4f85f8b392ba69
2022-03-23 11:32:19 -07:00
Andreas Gruenbacher
9289a47a1b iomap: Add done_before argument to iomap_dio_rw
Add a done_before argument to iomap_dio_rw that indicates how much of
the request has already been transferred.  When the request succeeds, we
report that done_before additional bytes were tranferred.  This is
useful for finishing a request asynchronously when part of the request
has already been completed synchronously.

We'll use that to allow iomap_dio_rw to be used with page faults
disabled: when a page fault occurs while submitting a request, we
synchronously complete the part of the request that has already been
submitted.  The caller can then take care of the page fault and call
iomap_dio_rw again for the rest of the request, passing in the number of
bytes already tranferred.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-12-06 10:51:33 -08:00
Josef Bacik
4afb912f43 btrfs: fix abort logic in btrfs_replace_file_extents
Error injection testing uncovered a case where we'd end up with a
corrupt file system with a missing extent in the middle of a file.  This
occurs because the if statement to decide if we should abort is wrong.

The only way we would abort in this case is if we got a ret !=
-EOPNOTSUPP and we called from the file clone code.  However the
prealloc code uses this path too.  Instead we need to abort if there is
an error, and the only error we _don't_ abort on is -EOPNOTSUPP and only
if we came from the clone file code.

CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-07 22:08:06 +02:00
Josef Bacik
d175209be0 btrfs: update refs for any root except tree log roots
I hit a stuck relocation on btrfs/061 during my overnight testing.  This
turned out to be because we had left over extent entries in our extent
root for a data reloc inode that no longer existed.  This happened
because in btrfs_drop_extents() we only update refs if we have SHAREABLE
set or we are the tree_root.  This regression was introduced by
aeb935a455 ("btrfs: don't set SHAREABLE flag for data reloc tree")
where we stopped setting SHAREABLE for the data reloc tree.

The problem here is we actually do want to update extent references for
data extents in the data reloc tree, in fact we only don't want to
update extent references if the file extents are in the log tree.
Update this check to only skip updating references in the case of the
log tree.

This is relatively rare, because you have to be running scrub at the
same time, which is what btrfs/061 does.  The data reloc inode has its
extents pre-allocated, and then we copy the extent into the
pre-allocated chunks.  We theoretically should never be calling
btrfs_drop_extents() on a data reloc inode.  The exception of course is
with scrub, if our pre-allocated extent falls inside of the block group
we are scrubbing, then the block group will be marked read only and we
will be forced to cow that extent.  This means we will call
btrfs_drop_extents() on that range when we COW that file extent.

This isn't really problematic if we do this, the data reloc inode
requires that our extent lengths match exactly with the extent we are
copying, thankfully we validate the extent is correct with
get_new_location(), so if we happen to COW only part of the extent we
won't link it in when we do the relocation, so we are safe from any
other shenanigans that arise because of this interaction with scrub.

Fixes: aeb935a455 ("btrfs: don't set SHAREABLE flag for data reloc tree")
CC: stable@vger.kernel.org # 5.8+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-07 22:04:36 +02:00
Boris Burkov
146054090b btrfs: initial fsverity support
Add support for fsverity in btrfs. To support the generic interface in
fs/verity, we add two new item types in the fs tree for inodes with
verity enabled. One stores the per-file verity descriptor and btrfs
verity item and the other stores the Merkle tree data itself.

Verity checking is done in end_page_read just before a page is marked
uptodate. This naturally handles a variety of edge cases like holes,
preallocated extents, and inline extents. Some care needs to be taken to
not try to verity pages past the end of the file, which are accessed by
the generic buffered file reading code under some circumstances like
reading to the end of the last page and trying to read again. Direct IO
on a verity file falls back to buffered reads.

Verity relies on PageChecked for the Merkle tree data itself to avoid
re-walking up shared paths in the tree. For this reason, we need to
cache the Merkle tree data. Since the file is immutable after verity is
turned on, we can cache it at an index past EOF.

Use the new inode ro_flags to store verity on the inode item, so that we
can enable verity on a file, then rollback to an older kernel and still
mount the file system and read the file. Since we can't safely write the
file anymore without ruining the invariants of the Merkle tree, we mark
a ro_compat flag on the file system when a file has verity enabled.

Acked-by: Eric Biggers <ebiggers@google.com>
Co-developed-by: Chris Mason <clm@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:09 +02:00
Qu Wenruo
7c11d0ae43 btrfs: subpage: fix a potential use-after-free in writeback helper
[BUG]
There is a possible use-after-free bug when running generic/095.

 BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
 Faulting instruction address: 0xc000000000283654
 c000000000283078 do_raw_spin_unlock+0x88/0x230
 c0000000012b1e14 _raw_spin_unlock_irqrestore+0x44/0x90
 c000000000a918dc btrfs_subpage_clear_writeback+0xac/0xe0
 c0000000009e0458 end_bio_extent_writepage+0x158/0x270
 c000000000b6fd14 bio_endio+0x254/0x270
 c0000000009fc0f0 btrfs_end_bio+0x1a0/0x200
 c000000000b6fd14 bio_endio+0x254/0x270
 c000000000b781fc blk_update_request+0x46c/0x670
 c000000000b8b394 blk_mq_end_request+0x34/0x1d0
 c000000000d82d1c lo_complete_rq+0x11c/0x140
 c000000000b880a4 blk_complete_reqs+0x84/0xb0
 c0000000012b2ca4 __do_softirq+0x334/0x680
 c0000000001dd878 irq_exit+0x148/0x1d0
 c000000000016f4c do_IRQ+0x20c/0x240
 c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0

[CAUSE]
There is very small race window like the following in generic/095.

	Thread 1		|		Thread 2
--------------------------------+------------------------------------
  end_bio_extent_writepage()	| btrfs_releasepage()
  |- spin_lock_irqsave()	| |
  |- end_page_writeback()	| |
  |				| |- if (PageWriteback() ||...)
  |				| |- clear_page_extent_mapped()
  |				|    |- kfree(subpage);
  |- spin_unlock_irqrestore().

The race can also happen between writeback and btrfs_invalidatepage(),
although that would be much harder as btrfs_invalidatepage() has much
more work to do before the clear_page_extent_mapped() call.

[FIX]
Here we "wait" for the subapge spinlock to be released before we detach
subpage structure.
So this patch will introduce a new function, wait_subpage_spinlock(), to
do the "wait" by acquiring the spinlock and release it.

Since the caller has ensured the page is not dirty nor writeback, and
page is already locked, the only way to hold the subpage spinlock is
from endio function.
Thus we only need to acquire the spinlock to wait for any existing
holder.

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:05 +02:00
Qu Wenruo
e046786619 btrfs: subpage: fix race between prepare_pages() and btrfs_releasepage()
[BUG]
When running generic/095, there is a high chance to crash with subpage
data RW support:

 assertion failed: PagePrivate(page) && page->private
 ------------[ cut here ]------------
 kernel BUG at fs/btrfs/ctree.h:3403!
 Internal error: Oops - BUG: 0 [#1] SMP
 CPU: 1 PID: 3567 Comm: fio Tainted: 5.12.0-rc7-custom+ #17
 Hardware name: Khadas VIM3 (DT)
 Call trace:
  assertfail.constprop.0+0x28/0x2c [btrfs]
  btrfs_subpage_assert+0x80/0xa0 [btrfs]
  btrfs_subpage_set_uptodate+0x34/0xec [btrfs]
  btrfs_page_clamp_set_uptodate+0x74/0xa4 [btrfs]
  btrfs_dirty_pages+0x160/0x270 [btrfs]
  btrfs_buffered_write+0x444/0x630 [btrfs]
  btrfs_direct_write+0x1cc/0x2d0 [btrfs]
  btrfs_file_write_iter+0xc0/0x160 [btrfs]
  new_sync_write+0xe8/0x180
  vfs_write+0x1b4/0x210
  ksys_pwrite64+0x7c/0xc0
  __arm64_sys_pwrite64+0x24/0x30
  el0_svc_common.constprop.0+0x70/0x140
  do_el0_svc+0x28/0x90
  el0_svc+0x2c/0x54
  el0_sync_handler+0x1a8/0x1ac
  el0_sync+0x170/0x180
 Code: f0000160 913be042 913c4000 955444bc (d4210000)
 ---[ end trace 3fdd39f4cccedd68 ]---

[CAUSE]
Although prepare_pages() calls find_or_create_page(), which returns the
page locked, but in later prepare_uptodate_page() calls, we may call
btrfs_readpage() which will unlock the page before it returns.

This leaves a window where btrfs_releasepage() can sneak in and release
the page, clearing page->private and causing above ASSERT().

[FIX]
In prepare_uptodate_page(), we should not only check page->mapping, but
also PagePrivate() to ensure we are still holding the correct page which
has proper fs context setup.

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:05 +02:00
Linus Torvalds
d3acb15a3a Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull iov_iter updates from Al Viro:
 "iov_iter cleanups and fixes.

  There are followups, but this is what had sat in -next this cycle. IMO
  the macro forest in there became much thinner and easier to follow..."

* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
  csum_and_copy_to_pipe_iter(): leave handling of csum_state to caller
  clean up copy_mc_pipe_to_iter()
  pipe_zero(): we don't need no stinkin' kmap_atomic()...
  iov_iter: clean csum_and_copy_...() primitives up a bit
  copy_page_from_iter(): don't need kmap_atomic() for kvec/bvec cases
  copy_page_to_iter(): don't bother with kmap_atomic() for bvec/kvec cases
  iterate_xarray(): only of the first iteration we might get offset != 0
  pull handling of ->iov_offset into iterate_{iovec,bvec,xarray}
  iov_iter: make iterator callbacks use base and len instead of iovec
  iov_iter: make the amount already copied available to iterator callbacks
  iov_iter: get rid of separate bvec and xarray callbacks
  iov_iter: teach iterate_{bvec,xarray}() about possible short copies
  iterate_bvec(): expand bvec.h macro forest, massage a bit
  iov_iter: unify iterate_iovec and iterate_kvec
  iov_iter: massage iterate_iovec and iterate_kvec to logics similar to iterate_bvec
  iterate_and_advance(): get rid of magic in case when n is 0
  csum_and_copy_to_iter(): massage into form closer to csum_and_copy_from_iter()
  iov_iter: replace iov_iter_copy_from_user_atomic() with iterator-advancing variant
  [xarray] iov_iter_npages(): just use DIV_ROUND_UP()
  iov_iter_npages(): don't bother with iterate_all_kinds()
  ...
2021-07-03 11:30:04 -07:00
Nikolay Borisov
77d255348b btrfs: eliminate insert label in add_falloc_range
By way of inverting the list_empty conditional the insert label can be
eliminated, making the function's flow entirely linear.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:10 +02:00
Qu Wenruo
0528476b6a btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range()
[BUG]
With current subpage RW support, the following script can hang the fs
with 64K page size.

 # mkfs.btrfs -f -s 4k $dev
 # mount $dev -o nospace_cache $mnt
 # fsstress -w -n 50 -p 1 -s 1607749395 -d $mnt

The kernel will do an infinite loop in btrfs_punch_hole_lock_range().

[CAUSE]
In btrfs_punch_hole_lock_range() we:

- Truncate page cache range
- Lock extent io tree
- Wait any ordered extents in the range.

We exit the loop until we meet all the following conditions:

- No ordered extent in the lock range
- No page is in the lock range

The latter condition has a pitfall, it only works for sector size ==
PAGE_SIZE case.

While can't handle the following subpage case:

  0       32K     64K     96K     128K
  |       |///////||//////|       ||

lockstart=32K
lockend=96K - 1

In this case, although the range crosses 2 pages,
truncate_pagecache_range() will invalidate no page at all, but only zero
the [32K, 96K) range of the two pages.

Thus filemap_range_has_page(32K, 96K-1) will always return true, thus we
will never meet the loop exit condition.

[FIX]
Fix the problem by doing page alignment for the lock range.

Function filemap_range_has_page() has already handled lend < lstart
case, we only need to round up @lockstart, and round_down @lockend for
truncate_pagecache_range().

This modification should not change any thing for sector size ==
PAGE_SIZE case, as in that case our range is already page aligned.

Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:10 +02:00
Qu Wenruo
f02a85d2d5 btrfs: make btrfs_dirty_pages() to be subpage compatible
Since the extent io tree operations in btrfs_dirty_pages() are already
subpage compatible, we only need to make the page status update to use
subpage helpers.

Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:09 +02:00
Nikolay Borisov
ec87b42f70 btrfs: use list_last_entry in add_falloc_range
Instead of calling list_entry with head->prev simply call
list_last_entry which makes it obvious which member of the list is
being referred. This allows to remove the extra 'prev' pointer.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:07 +02:00
Al Viro
f0b65f39ac iov_iter: replace iov_iter_copy_from_user_atomic() with iterator-advancing variant
Replacement is called copy_page_from_iter_atomic(); unlike the old primitive the
callers do *not* need to do iov_iter_advance() after it.  In case when they end
up consuming less than they'd been given they need to do iov_iter_revert() on
everything they had not consumed.  That, however, needs to be done only on slow
paths.

All in-tree callers converted.  And that kills the last user of iterate_all_kinds()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-06-10 11:45:14 -04:00
Ritesh Harjani
e7b2ec3d3d btrfs: return value from btrfs_mark_extent_written() in case of error
We always return 0 even in case of an error in btrfs_mark_extent_written().
Fix it to return proper error value in case of a failure. All callers
handle it.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-04 13:11:58 +02:00
Filipe Manana
626e9f41f7 btrfs: fix race leading to unpersisted data and metadata on fsync
When doing a fast fsync on a file, there is a race which can result in the
fsync returning success to user space without logging the inode and without
durably persisting new data.

The following example shows one possible scenario for this:

   $ mkfs.btrfs -f /dev/sdc
   $ mount /dev/sdc /mnt

   $ touch /mnt/bar
   $ xfs_io -f -c "pwrite -S 0xab 0 1M" -c "fsync" /mnt/baz

   # Now we have:
   # file bar == inode 257
   # file baz == inode 258

   $ mv /mnt/baz /mnt/foo

   # Now we have:
   # file bar == inode 257
   # file foo == inode 258

   $ xfs_io -c "pwrite -S 0xcd 0 1M" /mnt/foo

   # fsync bar before foo, it is important to trigger the race.
   $ xfs_io -c "fsync" /mnt/bar
   $ xfs_io -c "fsync" /mnt/foo

   # After this:
   # inode 257, file bar, is empty
   # inode 258, file foo, has 1M filled with 0xcd

   <power failure>

   # Replay the log:
   $ mount /dev/sdc /mnt

   # After this point file foo should have 1M filled with 0xcd and not 0xab

The following steps explain how the race happens:

1) Before the first fsync of inode 258, when it has the "baz" name, its
   ->logged_trans is 0, ->last_sub_trans is 0 and ->last_log_commit is -1.
   The inode also has the full sync flag set;

2) After the first fsync, we set inode 258 ->logged_trans to 6, which is
   the generation of the current transaction, and set ->last_log_commit
   to 0, which is the current value of ->last_sub_trans (done at
   btrfs_log_inode()).

   The full sync flag is cleared from the inode during the fsync.

   The log sub transaction that was committed had an ID of 0 and when we
   synced the log, at btrfs_sync_log(), we incremented root->log_transid
   from 0 to 1;

3) During the rename:

   We update inode 258, through btrfs_update_inode(), and that causes its
   ->last_sub_trans to be set to 1 (the current log transaction ID), and
   ->last_log_commit remains with a value of 0.

   After updating inode 258, because we have previously logged the inode
   in the previous fsync, we log again the inode through the call to
   btrfs_log_new_name(). This results in updating the inode's
   ->last_log_commit from 0 to 1 (the current value of its
   ->last_sub_trans).

   The ->last_sub_trans of inode 257 is updated to 1, which is the ID of
   the next log transaction;

4) Then a buffered write against inode 258 is made. This leaves the value
   of ->last_sub_trans as 1 (the ID of the current log transaction, stored
   at root->log_transid);

5) Then an fsync against inode 257 (or any other inode other than 258),
   happens. This results in committing the log transaction with ID 1,
   which results in updating root->last_log_commit to 1 and bumping
   root->log_transid from 1 to 2;

6) Then an fsync against inode 258 starts. We flush delalloc and wait only
   for writeback to complete, since the full sync flag is not set in the
   inode's runtime flags - we do not wait for ordered extents to complete.

   Then, at btrfs_sync_file(), we call btrfs_inode_in_log() before the
   ordered extent completes. The call returns true:

     static inline bool btrfs_inode_in_log(...)
     {
         bool ret = false;

         spin_lock(&inode->lock);
         if (inode->logged_trans == generation &&
             inode->last_sub_trans <= inode->last_log_commit &&
             inode->last_sub_trans <= inode->root->last_log_commit)
                 ret = true;
         spin_unlock(&inode->lock);
         return ret;
     }

   generation has a value of 6 (fs_info->generation), ->logged_trans also
   has a value of 6 (set when we logged the inode during the first fsync
   and when logging it during the rename), ->last_sub_trans has a value
   of 1, set during the rename (step 3), ->last_log_commit also has a
   value of 1 (set in step 3) and root->last_log_commit has a value of 1,
   which was set in step 5 when fsyncing inode 257.

   As a consequence we don't log the inode, any new extents and do not
   sync the log, resulting in a data loss if a power failure happens
   after the fsync and before the current transaction commits.
   Also, because we do not log the inode, after a power failure the mtime
   and ctime of the inode do not match those we had before.

   When the ordered extent completes before we call btrfs_inode_in_log(),
   then the call returns false and we log the inode and sync the log,
   since at the end of ordered extent completion we update the inode and
   set ->last_sub_trans to 2 (the value of root->log_transid) and
   ->last_log_commit to 1.

This problem is found after removing the check for the emptiness of the
inode's list of modified extents in the recent commit 209ecbb858
("btrfs: remove stale comment and logic from btrfs_inode_in_log()"),
added in the 5.13 merge window. However checking the emptiness of the
list is not really the way to solve this problem, and was never intended
to, because while that solves the problem for COW writes, the problem
persists for NOCOW writes because in that case the list is always empty.

In the case of NOCOW writes, even though we wait for the writeback to
complete before returning from btrfs_sync_file(), we end up not logging
the inode, which has a new mtime/ctime, and because we don't sync the log,
we never issue disk barriers (send REQ_PREFLUSH to the device) since that
only happens when we sync the log (when we write super blocks at
btrfs_sync_log()). So effectively, for a NOCOW case, when we return from
btrfs_sync_file() to user space, we are not guaranteeing that the data is
durably persisted on disk.

Also, while the example above uses a rename exchange to show how the
problem happens, it is not the only way to trigger it. An alternative
could be adding a new hard link to inode 258, since that also results
in calling btrfs_log_new_name() and updating the inode in the log.
An example reproducer using the addition of a hard link instead of a
rename operation:

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt

  $ touch /mnt/bar
  $ xfs_io -f -c "pwrite -S 0xab 0 1M" -c "fsync" /mnt/foo

  $ ln /mnt/foo /mnt/foo_link
  $ xfs_io -c "pwrite -S 0xcd 0 1M" /mnt/foo

  $ xfs_io -c "fsync" /mnt/bar
  $ xfs_io -c "fsync" /mnt/foo

  <power failure>

  # Replay the log:
  $ mount /dev/sdc /mnt

  # After this point file foo often has 1M filled with 0xab and not 0xcd

The reasons leading to the final fsync of file foo, inode 258, not
persisting the new data are the same as for the previous example with
a rename operation.

So fix by never skipping logging and log syncing when there are still any
ordered extents in flight. To avoid making the conditional if statement
that checks if logging an inode is needed harder to read, place all the
logic into an helper function with separate if statements to make it more
manageable and easier to read.

A test case for fstests will follow soon.

For NOCOW writes, the problem existed before commit b5e6c3e170
("btrfs: always wait on ordered extents at fsync time"), introduced in
kernel 4.19, then it went away with that commit since we started to always
wait for ordered extent completion before logging.

The problem came back again once the fast fsync path was changed again to
avoid waiting for ordered extent completion, in commit 487781796d
("btrfs: make fast fsyncs wait only for writeback"), added in kernel 5.10.

However, for COW writes, the race only happens after the recent
commit 209ecbb858 ("btrfs: remove stale comment and logic from
btrfs_inode_in_log()"), introduced in the 5.13 merge window. For NOCOW
writes, the bug existed before that commit. So tag 5.10+ as the release
for stable backports.

CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-28 20:09:45 +02:00
BingJing Chang
3227788cd3 btrfs: fix a potential hole punching failure
In commit d77815461f ("btrfs: Avoid trucating page or punching hole
in a already existed hole."), existing holes can be skipped by calling
find_first_non_hole() to adjust start and len. However, if the given len
is invalid and large, when an EXTENT_MAP_HOLE extent is found, len will
not be set to zero because (em->start + em->len) is less than
(start + len). Then the ret will be 1 but len will not be set to 0.
The propagated non-zero ret will result in fallocate failure.

In the while-loop of btrfs_replace_file_extents(), len is not updated
every time before it calls find_first_non_hole(). That is, after
btrfs_drop_extents() successfully drops the last non-hole file extent,
it may fail with ENOSPC when attempting to drop a file extent item
representing a hole. The problem can happen. After it calls
find_first_non_hole(), the cur_offset will be adjusted to be larger
than or equal to end. However, since the len is not set to zero, the
break-loop condition (ret && !len) will not be met. After it leaves the
while-loop, fallocate will return 1, which is an unexpected return
value.

We're not able to construct a reproducible way to let
btrfs_drop_extents() fail with ENOSPC after it drops the last non-hole
file extent but with remaining holes left. However, it's quite easy to
fix. We just need to update and check the len every time before we call
find_first_non_hole(). To make the while loop more readable, we also
pull the variable updates to the bottom of loop like this:
  while (cur_offset < end) {
	  ...
	  // update cur_offset & len
	  // advance cur_offset & len in hole-punching case if needed
  }

Reported-by: Robbie Ko <robbieko@synology.com>
Fixes: d77815461f ("btrfs: Avoid trucating page or punching hole in a already existed hole.")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19 17:25:17 +02:00
Filipe Manana
e2b84217f3 btrfs: update outdated comment at btrfs_replace_file_extents()
There is a comment at btrfs_replace_file_extents() that mentions that we
set the full sync flag on an inode when cloning into a file with a size
greater than or equals to 16MiB, through try_release_extent_mapping() when
we truncate the page cache after replacing file extents during a clone
operation.

That is not true anymore since commit 5e548b3201 ("btrfs: do not set
the full sync flag on the inode during page release"), so update the
comment to remove that part and rephrase it slightly to make it more
clear why the full sync flag is set at btrfs_replace_file_extents().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19 17:25:17 +02:00
Filipe Manana
bc0939fcfa btrfs: fix race between marking inode needs to be logged and log syncing
We have a race between marking that an inode needs to be logged, either
at btrfs_set_inode_last_trans() or at btrfs_page_mkwrite(), and between
btrfs_sync_log(). The following steps describe how the race happens.

1) We are at transaction N;

2) Inode I was previously fsynced in the current transaction so it has:

    inode->logged_trans set to N;

3) The inode's root currently has:

   root->log_transid set to 1
   root->last_log_commit set to 0

   Which means only one log transaction was committed to far, log
   transaction 0. When a log tree is created we set ->log_transid and
   ->last_log_commit of its parent root to 0 (at btrfs_add_log_tree());

4) One more range of pages is dirtied in inode I;

5) Some task A starts an fsync against some other inode J (same root), and
   so it joins log transaction 1.

   Before task A calls btrfs_sync_log()...

6) Task B starts an fsync against inode I, which currently has the full
   sync flag set, so it starts delalloc and waits for the ordered extent
   to complete before calling btrfs_inode_in_log() at btrfs_sync_file();

7) During ordered extent completion we have btrfs_update_inode() called
   against inode I, which in turn calls btrfs_set_inode_last_trans(),
   which does the following:

     spin_lock(&inode->lock);
     inode->last_trans = trans->transaction->transid;
     inode->last_sub_trans = inode->root->log_transid;
     inode->last_log_commit = inode->root->last_log_commit;
     spin_unlock(&inode->lock);

   So ->last_trans is set to N and ->last_sub_trans set to 1.
   But before setting ->last_log_commit...

8) Task A is at btrfs_sync_log():

   - it increments root->log_transid to 2
   - starts writeback for all log tree extent buffers
   - waits for the writeback to complete
   - writes the super blocks
   - updates root->last_log_commit to 1

   It's a lot of slow steps between updating root->log_transid and
   root->last_log_commit;

9) The task doing the ordered extent completion, currently at
   btrfs_set_inode_last_trans(), then finally runs:

     inode->last_log_commit = inode->root->last_log_commit;
     spin_unlock(&inode->lock);

   Which results in inode->last_log_commit being set to 1.
   The ordered extent completes;

10) Task B is resumed, and it calls btrfs_inode_in_log() which returns
    true because we have all the following conditions met:

    inode->logged_trans == N which matches fs_info->generation &&
    inode->last_subtrans (1) <= inode->last_log_commit (1) &&
    inode->last_subtrans (1) <= root->last_log_commit (1) &&
    list inode->extent_tree.modified_extents is empty

    And as a consequence we return without logging the inode, so the
    existing logged version of the inode does not point to the extent
    that was written after the previous fsync.

It should be impossible in practice for one task be able to do so much
progress in btrfs_sync_log() while another task is at
btrfs_set_inode_last_trans() right after it reads root->log_transid and
before it reads root->last_log_commit. Even if kernel preemption is enabled
we know the task at btrfs_set_inode_last_trans() can not be preempted
because it is holding the inode's spinlock.

However there is another place where we do the same without holding the
spinlock, which is in the memory mapped write path at:

  vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
  {
     (...)
     BTRFS_I(inode)->last_trans = fs_info->generation;
     BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid;
     BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit;
     (...)

So with preemption happening after setting ->last_sub_trans and before
setting ->last_log_commit, it is less of a stretch to have another task
do enough progress at btrfs_sync_log() such that the task doing the memory
mapped write ends up with ->last_sub_trans and ->last_log_commit set to
the same value. It is still a big stretch to get there, as the task doing
btrfs_sync_log() has to start writeback, wait for its completion and write
the super blocks.

So fix this in two different ways:

1) For btrfs_set_inode_last_trans(), simply set ->last_log_commit to the
   value of ->last_sub_trans minus 1;

2) For btrfs_page_mkwrite() only set the inode's ->last_sub_trans, just
   like we do for buffered and direct writes at btrfs_file_write_iter(),
   which is all we need to make sure multiple writes and fsyncs to an
   inode in the same transaction never result in an fsync missing that
   the inode changed and needs to be logged. Turn this into a helper
   function and use it both at btrfs_page_mkwrite() and at
   btrfs_file_write_iter() - this also fixes the problem that at
   btrfs_page_mkwrite() we were setting those fields without the
   protection of the inode's spinlock.

This is an extremely unlikely race to happen in practice.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19 17:25:16 +02:00
Filipe Manana
885f46d87f btrfs: fix race between memory mapped writes and fsync
When doing an fsync we flush all delalloc, lock the inode (VFS lock), flush
any new delalloc that might have been created before taking the lock and
then wait either for the ordered extents to complete or just for the
writeback to complete (depending on whether the full sync flag is set or
not). We then start logging the inode and assume that while we are doing it
no one else is touching the inode's file extent items (or adding new ones).

That is generally true because all operations that modify an inode acquire
the inode's lock first, including buffered and direct IO writes. However
there is one exception: memory mapped writes, which do not and can not
acquire the inode's lock.

This can cause two types of issues: ending up logging file extent items
with overlapping ranges, which is detected by the tree checker and will
result in aborting the transaction when starting writeback for a log
tree's extent buffers, or a silent corruption where we log a version of
the file that never existed.

Scenario 1 - logging overlapping extents

The following steps explain how we can end up with file extents items with
overlapping ranges in a log tree due to a race between a fsync and memory
mapped writes:

1) Task A starts an fsync on inode X, which has the full sync runtime flag
   set. First it starts by flushing all delalloc for the inode;

2) Task A then locks the inode and flushes any other delalloc that might
   have been created after the previous flush and waits for all ordered
   extents to complete;

3) In the inode's root we have the following leaf:

   Leaf N, generation == current transaction id:

   ---------------------------------------------------------
   | (...)  [ file extent item, offset 640K, length 128K ] |
   ---------------------------------------------------------

   The last file extent item in leaf N covers the file range from 640K to
   768K;

4) Task B does a memory mapped write for the page corresponding to the
   file range from 764K to 768K;

5) Task A starts logging the inode. At copy_inode_items_to_log() it uses
   btrfs_search_forward() to search for leafs modified in the current
   transaction that contain items for the inode. It finds leaf N and copies
   all the inode items from that leaf into the log tree.

   Now the log tree has a copy of the last file extent item from leaf N.

   At the end of the while loop at copy_inode_items_to_log(), we have the
   minimum key set to:

   min_key.objectid = <inode X number>
   min_key.type = BTRFS_EXTENT_DATA_KEY
   min_key.offset = 640K

   Then we increment the key's offset by 1 so that the next call to
   btrfs_search_forward() leaves us at the first key greater than the key
   we just processed.

   But before btrfs_search_forward() is called again...

6) Dellaloc for the page at offset 764K, dirtied by task B, is started.
   It can be started for several reasons:

     - The async reclaim task is attempting to satisfy metadata or data
       reservation requests, and it has reached a point where it decided
       to flush delalloc;
     - Due to memory pressure the VMM triggers writeback of dirty pages;
     - The system call sync_file_range(2) is called from user space.

7) When the respective ordered extent completes, it trims the length of
   the existing file extent item for file offset 640K from 128K to 124K,
   and a new file extent item is added with a key offset of 764K and a
   length of 4K;

8) Task A calls btrfs_search_forward(), which returns us a path pointing
   to the leaf (can be leaf N or some other) containing the new file extent
   item for file offset 764K.

   We end up copying this item to the log tree, which overlaps with the
   last copied file extent item, which covers the file range from 640K to
   768K.

   When writeback is triggered for log tree's extent buffers, the issue
   will be detected by the tree checker which will dump a trace and an
   error message on dmesg/syslog. If the writeback is triggered when
   syncing the log, which typically is, then we also end up aborting the
   current transaction.

This is the same type of problem fixed in 0c713cbab6 ("Btrfs: fix race
between ranged fsync and writeback of adjacent ranges").

Scenario 2 - logging a version of the file that never existed

This scenario only happens when using the NO_HOLES feature and results in
a silent corruption, in the sense that is not detectable by 'btrfs check'
or the tree checker:

1) We have an inode I with a size of 1M and two file extent items, one
   covering an extent with disk_bytenr == X for the file range [0, 512K)
   and another one covering another extent with disk_bytenr == Y for the
   file range [512K, 1M);

2) A hole is punched for the file range [512K, 1M);

3) Task A starts an fsync of inode I, which has the full sync runtime flag
   set. It starts by flushing all existing delalloc, locks the inode (VFS
   lock), starts any new delalloc that might have been created before
   taking the lock and waits for all ordered extents to complete;

4) Some other task does a memory mapped write for the page corresponding to
   the file range [640K, 644K) for example;

5) Task A then logs all items of the inode with the call to
   copy_inode_items_to_log();

6) In the meanwhile delalloc for the range [640K, 644K) is started. It can
   be started for several reasons:

     - The async reclaim task is attempting to satisfy metadata or data
       reservation requests, and it has reached a point where it decided
       to flush delalloc;
     - Due to memory pressure the VMM triggers writeback of dirty pages;
     - The system call sync_file_range(2) is called from user space.

7) The ordered extent for the range [640K, 644K) completes and a file
   extent item for that range is added to the subvolume tree, pointing
   to a 4K extent with a disk_bytenr == Z;

8) Task A then calls btrfs_log_holes(), to scan for implicit holes in
   the subvolume tree. It finds two implicit holes:

   - one for the file range [512K, 640K)
   - one for the file range [644K, 1M)

   As a result we end up neither logging a hole for the range [640K, 644K)
   nor logging the file extent item with a disk_bytenr == Z.
   This means that if we have a power failure and replay the log tree we
   end up getting the following file extent layout:

   [ disk_bytenr X ]    [   hole   ]    [ disk_bytenr Y ]    [  hole  ]
   0             512K  512K      640K  640K           644K  644K     1M

   Which does not corresponding to any layout the file ever had before
   the power failure. The only two valid layouts would be:

   [ disk_bytenr X ]    [   hole   ]
   0             512K  512K        1M

   and

   [ disk_bytenr X ]    [   hole   ]    [ disk_bytenr Z ]    [  hole  ]
   0             512K  512K      640K  640K           644K  644K     1M

This can be fixed by serializing memory mapped writes with fsync, and there
are two ways to do it:

1) Make a fsync lock the entire file range, from 0 to (u64)-1 / LLONG_MAX
   in the inode's io tree. This prevents the race but also blocks any reads
   during the duration of the fsync, which has a negative impact for many
   common workloads;

2) Make an fsync write lock the i_mmap_lock semaphore in the inode. This
   semaphore was recently added by Josef's patch set:

   btrfs: add a i_mmap_lock to our inode
   btrfs: cleanup inode_lock/inode_unlock uses
   btrfs: exclude mmaps while doing remap
   btrfs: exclude mmap from happening during all fallocate operations

   and is used to solve races between memory mapped writes and
   clone/dedupe/fallocate. This also makes us have the same behaviour we
   have regarding other writes (buffered and direct IO) and fsync - block
   them while the inode logging is in progress.

This change uses the second approach due to the performance impact of the
first one.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19 17:25:15 +02:00
Josef Bacik
8d9b4a162a btrfs: exclude mmap from happening during all fallocate operations
There's a small window where a deadlock can happen between fallocate and
mmap.  This is described in detail by Filipe:

"""
When doing a fallocate operation we lock the inode, flush delalloc within
the target range, wait for any ordered extents to complete and then lock
the file range. Before we lock the range and after we flush delalloc,
there is a time window where another task can come in and do a memory
mapped write for a page within the fallocate range.

This means that after fallocate locks the range, there can be a dirty page
in the range. More often than not, this does not cause any problem.
The exception is when we are low on available metadata space, because an
fallocate operation needs to start a transaction while holding the file
range locked, either through btrfs_prealloc_file_range() or through the
call to btrfs_fallocate_update_isize(). If that's the case, we can end up
in a deadlock. The following list of steps explains how that happens:

1) A fallocate operation starts, locks the inode, flushes delalloc in the
   range and waits for ordered extents in the range to complete;

2) Before the fallocate task locks the file range, another task does a
   memory mapped write for a page in the fallocate target range. This is
   possible since memory mapped writes do not (and can not) lock the
   inode;

3) The fallocate task locks the file range. At this point there is one
   dirty page in the range (due to the memory mapped write);

4) When the fallocate task attempts to start a transaction, it blocks when
   attempting to reserve metadata space, since we are low on available
   metadata space. Before blocking (wait on its reservation ticket), it
   starts the async reclaim task (if not running already);

5) The async reclaim task is not able to release space through any other
   means, so it decides to flush delalloc for inodes with dirty pages.
   It finds that the inode used in the fallocate operation has a dirty
   page and therefore queues a job (fs_info->flush_workers workqueue) to
   flush delalloc for that inode and waits on that job to complete;

6) The flush job blocks when attempting to lock the file range because
   it is currently locked by the fallocate task;

7) The fallocate task keeps waiting for its metadata reservation, waiting
   for a wakeup on its reservation ticket. The async reclaim task is
   waiting on the flush job, which in turn is waiting for locking the file
   range that is currently locked by the fallocate task. So unless some
   other task is able to release enough metadata space, for example an
   ordered extent for some other inode completes, we end up in a deadlock
   between all these tasks.

When this happens stack traces like the following show up in dmesg/syslog:

 INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
       Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kworker/u16:11  state:D stack:    0 pid:1810830 ppid:     2 flags:0x00004000
 Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
 Call Trace:
  __schedule+0x5d1/0xcf0
  schedule+0x45/0xe0
  lock_extent_bits+0x1e6/0x2d0 [btrfs]
  ? finish_wait+0x90/0x90
  btrfs_invalidatepage+0x32c/0x390 [btrfs]
  ? __mod_memcg_state+0x8e/0x160
  __extent_writepage+0x2d4/0x400 [btrfs]
  extent_write_cache_pages+0x2b2/0x500 [btrfs]
  ? lock_release+0x20e/0x4c0
  ? trace_hardirqs_on+0x1b/0xf0
  extent_writepages+0x43/0x90 [btrfs]
  ? lock_acquire+0x1a3/0x490
  do_writepages+0x43/0xe0
  ? __filemap_fdatawrite_range+0xa4/0x100
  __filemap_fdatawrite_range+0xc5/0x100
  btrfs_run_delalloc_work+0x17/0x40 [btrfs]
  btrfs_work_helper+0xf1/0x600 [btrfs]
  process_one_work+0x24e/0x5e0
  worker_thread+0x50/0x3b0
  ? process_one_work+0x5e0/0x5e0
  kthread+0x153/0x170
  ? kthread_mod_delayed_work+0xc0/0xc0
  ret_from_fork+0x22/0x30
 INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
       Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kworker/u16:1   state:D stack:    0 pid:2426217 ppid:     2 flags:0x00004000
 Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
 Call Trace:
  __schedule+0x5d1/0xcf0
  ? kvm_clock_read+0x14/0x30
  ? wait_for_completion+0x81/0x110
  schedule+0x45/0xe0
  schedule_timeout+0x30c/0x580
  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  ? lock_acquire+0x1a3/0x490
  ? try_to_wake_up+0x7a/0xa20
  ? lock_release+0x20e/0x4c0
  ? lock_acquired+0x199/0x490
  ? wait_for_completion+0x81/0x110
  wait_for_completion+0xab/0x110
  start_delalloc_inodes+0x2af/0x390 [btrfs]
  btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
  flush_space+0x24f/0x660 [btrfs]
  btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
  process_one_work+0x24e/0x5e0
  worker_thread+0x20f/0x3b0
  ? process_one_work+0x5e0/0x5e0
  kthread+0x153/0x170
  ? kthread_mod_delayed_work+0xc0/0xc0
  ret_from_fork+0x22/0x30
(...)
several tasks waiting for the inode lock held by the fallocate task below
(...)
 RIP: 0033:0x7f61efe73fff
 Code: Unable to access opcode bytes at RIP 0x7f61efe73fd5.
 RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c
 RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73fff
 RDX: 00000000ffffff9c RSI: 0000560fbd5d90a0 RDI: 00000000ffffff9c
 RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
 R10: 0000560fbd5d7ad0 R11: 0000000000000202 R12: 0000000000000001
 R13: 000000000000005e R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
 task:fdm-stress        state:D stack:    0 pid:2508243 ppid:2508153 flags:0x00000000
 Call Trace:
  __schedule+0x5d1/0xcf0
  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  schedule+0x45/0xe0
  __reserve_bytes+0x4a4/0xb10 [btrfs]
  ? finish_wait+0x90/0x90
  btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
  btrfs_block_rsv_add+0x1f/0x50 [btrfs]
  start_transaction+0x2d1/0x760 [btrfs]
  btrfs_replace_file_extents+0x120/0x930 [btrfs]
  ? btrfs_fallocate+0xdcf/0x1260 [btrfs]
  btrfs_fallocate+0xdfb/0x1260 [btrfs]
  ? filename_lookup+0xf1/0x180
  vfs_fallocate+0x14f/0x440
  ioctl_preallocate+0x92/0xc0
  do_vfs_ioctl+0x66b/0x750
  ? __do_sys_newfstat+0x53/0x60
  __x64_sys_ioctl+0x62/0xb0
  do_syscall_64+0x33/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
"""

Fix this by disallowing mmaps from happening while we're doing any of
the fallocate operations on this inode.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19 17:25:15 +02:00
Josef Bacik
64708539cd btrfs: use btrfs_inode_lock/btrfs_inode_unlock inode lock helpers
A few places we intermix btrfs_inode_lock with a inode_unlock, and some
places we just use inode_lock/inode_unlock instead of btrfs_inode_lock.

None of these places are using this incorrectly, but as we adjust some
of these callers it would be nice to keep everything consistent, so
convert everybody to use btrfs_inode_lock/btrfs_inode_unlock.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19 17:25:15 +02:00
Nikolay Borisov
cca5de97ae btrfs: make find_desired_extent take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19 17:25:14 +02:00
Nikolay Borisov
bfc78479eb btrfs: make btrfs_replace_file_extents take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19 17:25:14 +02:00
Linus Torvalds
f09b04cc64 Merge tag 'for-5.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
 "More regression fixes and stabilization.

  Regressions:

   - zoned mode
      - count zone sizes in wider int types
      - fix space accounting for read-only block groups

   - subpage: fix page tail zeroing

  Fixes:

   - fix spurious warning when remounting with free space tree

   - fix warning when creating a directory with smack enabled

   - ioctl checks for qgroup inheritance when creating a snapshot

   - qgroup
      - fix missing unlock on error path in zero range
      - fix amount of released reservation on error
      - fix flushing from unsafe context with open transaction,
        potentially deadlocking

   - minor build warning fixes"

* tag 'for-5.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: zoned: do not account freed region of read-only block group as zone_unusable
  btrfs: zoned: use sector_t for zone sectors
  btrfs: subpage: fix the false data csum mismatch error
  btrfs: fix warning when creating a directory with smack enabled
  btrfs: don't flush from btrfs_delayed_inode_reserve_metadata
  btrfs: export and rename qgroup_reserve_meta
  btrfs: free correct amount of space in btrfs_delayed_inode_reserve_metadata
  btrfs: fix spurious free_space_tree remount warning
  btrfs: validate qgroup inherit for SNAP_CREATE_V2 ioctl
  btrfs: unlock extents in btrfs_zero_range in case of quota reservation errors
  btrfs: ref-verify: use 'inline void' keyword ordering
2021-03-05 12:21:14 -08:00
Nikolay Borisov
4f6a49de64 btrfs: unlock extents in btrfs_zero_range in case of quota reservation errors
If btrfs_qgroup_reserve_data returns an error (i.e quota limit reached)
the handling logic directly goes to the 'out' label without first
unlocking the extent range between lockstart, lockend. This results in
deadlocks as other processes try to lock the same extent.

Fixes: a7f8b1c2ac ("btrfs: file: reserve qgroup space after the hole punch range is locked")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-02 16:55:44 +01:00
Christoph Hellwig
87fa0f3eb2 mm/filemap: rename generic_file_buffered_read to filemap_read
Rename generic_file_buffered_read to match the naming of filemap_fault,
also update the written parameter to a more descriptive name and improve
the kerneldoc comment.

Link: https://lkml.kernel.org/r/20210122160140.223228-18-willy@infradead.org
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:28 -08:00
Linus Torvalds
4f016a316f Merge tag 'iomap-5.12-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap updates from Darrick Wong:
 "The big change in this cycle is some new code to make it possible for
  XFS to try unaligned directio overwrites without taking locks. If the
  block is fully written and within EOF (i.e. doesn't require any
  further fs intervention) then we can let the unlocked write proceed.
  If not, we fall back to synchronizing direct writes.

  Summary:

   - Adjust the final parameter of iomap_dio_rw.

   - Add a new flag to request that iomap directio writes return EAGAIN
     if the write is not a pure overwrite within EOF; this will be used
     to reduce lock contention with unaligned direct writes on XFS.

   - Amend XFS' directio code to eliminate exclusive locking for
     unaligned direct writes if the circumstances permit"

* tag 'iomap-5.12-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: reduce exclusive locking on unaligned dio
  xfs: split the unaligned DIO write code out
  xfs: improve the reflink_bounce_dio_write tracepoint
  xfs: simplify the read/write tracepoints
  xfs: remove the buffered I/O fallback assert
  xfs: cleanup the read/write helper naming
  xfs: make xfs_file_aio_write_checks IOCB_NOWAIT-aware
  xfs: factor out a xfs_ilock_iocb helper
  iomap: add a IOMAP_DIO_OVERWRITE_ONLY flag
  iomap: pass a flags argument to iomap_dio_rw
  iomap: rename the flags variable in __iomap_dio_rw
2021-02-21 10:29:20 -08:00
Naohiro Aota
d8e3fb106f btrfs: zoned: use ZONE_APPEND write for zoned mode
Enable zone append writing for zoned mode. When using zone append, a
bio is issued to the start of a target zone and the device decides to
place it inside the zone. Upon completion the device reports the actual
written position back to the host.

Three parts are necessary to enable zone append mode. First, modify the
bio to use REQ_OP_ZONE_APPEND in btrfs_submit_bio_hook() and adjust the
bi_sector to point the beginning of the zone.

Second, record the returned physical address (and disk/partno) to the
ordered extent in end_bio_extent_writepage() after the bio has been
completed. We cannot resolve the physical address to the logical address
because we can neither take locks nor allocate a buffer in this end_bio
context. So, we need to record the physical address to resolve it later
in btrfs_finish_ordered_io().

And finally, rewrite the logical addresses of the extent mapping and
checksum data according to the physical address using btrfs_rmap_block.
If the returned address matches the originally allocated address, we can
skip this rewriting process.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:06 +01:00
Qu Wenruo
32443de338 btrfs: introduce btrfs_subpage for data inodes
To support subpage sector size, data also need extra info to make sure
which sectors in a page are uptodate/dirty/...

This patch will make pages for data inodes get btrfs_subpage structure
attached, and detached when the page is freed.

This patch also slightly changes the timing when
set_page_extent_mapped() is called to make sure:

- We have page->mapping set
  page->mapping->host is used to grab btrfs_fs_info, thus we can only
  call this function after page is mapped to an inode.

  One call site attaches pages to inode manually, thus we have to modify
  the timing of set_page_extent_mapped() a bit.

- As soon as possible, before other operations
  Since memory allocation can fail, we have to do extra error handling.
  Calling set_page_extent_mapped() as soon as possible can simply the
  error handling for several call sites.

The idea is pretty much the same as iomap_page, but with more bitmaps
for btrfs specific cases.

Currently the plan is to switch iomap if iomap can provide sector
aligned write back (only write back dirty sectors, but not the full
page, data balance require this feature).

So we will stick to btrfs specific bitmap for now.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:03 +01:00
Filipe Manana
d0c2f4fa55 btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
Often an fsync needs to fallback to a transaction commit for several
reasons (to ensure consistency after a power failure, a new block group
was allocated or a temporary error such as ENOMEM or ENOSPC happened).

In that case the log is marked as needing a full commit and any concurrent
tasks attempting to log inodes or commit the log will also fallback to the
transaction commit. When this happens they all wait for the task that first
started the transaction commit to finish the transaction commit - however
they wait until the full transaction commit happens, which is not needed,
as they only need to wait for the superblocks to be persisted and not for
unpinning all the extents pinned during the transaction's lifetime, which
even for short lived transactions can be a few thousand and take some
significant amount of time to complete - for dbench workloads I have
observed up to 4~5 milliseconds of time spent unpinning extents in the
worst cases, and the number of pinned extents was between 2 to 3 thousand.

So allow fsync tasks to skip waiting for the unpinning of extents when
they call btrfs_commit_transaction() and they were not the task that
started the transaction commit (that one has to do it, the alternative
would be to offload the transaction commit to another task so that it
could avoid waiting for the extent unpinning or offload the extent
unpinning to another task).

This patch is part of a patchset comprised of the following patches:

  btrfs: remove unnecessary directory inode item update when deleting dir entry
  btrfs: stop setting nbytes when filling inode item for logging
  btrfs: avoid logging new ancestor inodes when logging new inode
  btrfs: skip logging directories already logged when logging all parents
  btrfs: skip logging inodes already logged when logging new entries
  btrfs: remove unnecessary check_parent_dirs_for_sync()
  btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

After applying the entire patchset, dbench shows improvements in respect
to throughput and latency. The script used to measure it is the following:

  $ cat dbench-test.sh
  #!/bin/bash

  DEV=/dev/sdk
  MNT=/mnt/sdk
  MOUNT_OPTIONS="-o ssd"
  MKFS_OPTIONS="-m single -d single"

  echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

  umount $DEV &> /dev/null
  mkfs.btrfs -f $MKFS_OPTIONS $DEV
  mount $MOUNT_OPTIONS $DEV $MNT

  dbench -D $MNT -t 300 64

  umount $MNT

The test was run on a physical machine with 12 cores (Intel corei7), 64G
of ram, using a NVMe device and a non-debug kernel configuration (Debian's
default configuration).

Before applying patchset, 32 clients:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    9627107     0.153    61.938
 Close        7072076     0.001     3.175
 Rename        407633     1.222    44.439
 Unlink       1943895     0.658    44.440
 Deltree          256    17.339   110.891
 Mkdir            128     0.003     0.009
 Qpathinfo    8725406     0.064    17.850
 Qfileinfo    1529516     0.001     2.188
 Qfsinfo      1599884     0.002     1.457
 Sfileinfo     784200     0.005     3.562
 Find         3373513     0.411    30.312
 WriteX       4802132     0.053    29.054
 ReadX       15089959     0.002     5.801
 LockX          31344     0.002     0.425
 UnlockX        31344     0.001     0.173
 Flush         674724     5.952   341.830

Throughput 1008.02 MB/sec  32 clients  32 procs  max_latency=341.833 ms

After applying patchset, 32 clients:

After patchset, with 32 clients:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    9931568     0.111    25.597
 Close        7295730     0.001     2.171
 Rename        420549     0.982    49.714
 Unlink       2005366     0.497    39.015
 Deltree          256    11.149    89.242
 Mkdir            128     0.002     0.014
 Qpathinfo    9001863     0.049    20.761
 Qfileinfo    1577730     0.001     2.546
 Qfsinfo      1650508     0.002     3.531
 Sfileinfo     809031     0.005     5.846
 Find         3480259     0.309    23.977
 WriteX       4952505     0.043    41.283
 ReadX       15568127     0.002     5.476
 LockX          32338     0.002     0.978
 UnlockX        32338     0.001     2.032
 Flush         696017     7.485   228.835

Throughput 1049.91 MB/sec  32 clients  32 procs  max_latency=228.847 ms

 --> +4.1% throughput, -39.6% max latency

Before applying patchset, 64 clients:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    8956748     0.342   108.312
 Close        6579660     0.001     3.823
 Rename        379209     2.396    81.897
 Unlink       1808625     1.108   131.148
 Deltree          256    25.632   172.176
 Mkdir            128     0.003     0.018
 Qpathinfo    8117615     0.131    55.916
 Qfileinfo    1423495     0.001     2.635
 Qfsinfo      1488496     0.002     5.412
 Sfileinfo     729472     0.007     8.643
 Find         3138598     0.855    78.321
 WriteX       4470783     0.102    79.442
 ReadX       14038139     0.002     7.578
 LockX          29158     0.002     0.844
 UnlockX        29158     0.001     0.567
 Flush         627746    14.168   506.151

Throughput 924.738 MB/sec  64 clients  64 procs  max_latency=506.154 ms

After applying patchset, 64 clients:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    9069003     0.303    43.193
 Close        6662328     0.001     3.888
 Rename        383976     2.194    46.418
 Unlink       1831080     1.022    43.873
 Deltree          256    24.037   155.763
 Mkdir            128     0.002     0.005
 Qpathinfo    8219173     0.137    30.233
 Qfileinfo    1441203     0.001     3.204
 Qfsinfo      1507092     0.002     4.055
 Sfileinfo     738775     0.006     5.431
 Find         3177874     0.936    38.170
 WriteX       4526152     0.084    39.518
 ReadX       14213562     0.002    24.760
 LockX          29522     0.002     1.221
 UnlockX        29522     0.001     0.694
 Flush         635652    14.358   422.039

Throughput 990.13 MB/sec  64 clients  64 procs  max_latency=422.043 ms

 --> +6.8% throughput, -18.1% max latency

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:01 +01:00
Qu Wenruo
c0fab48095 btrfs: update comment for btrfs_dirty_pages
The original comment is from the initial merge, which has several
problems:

- No holes check any more
- No inline decision is made

Update the out-of-date comment with more correct one.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:52 +01:00
Nikolay Borisov
149716570b btrfs: cleanup local variables in btrfs_file_write_iter
First replace all inode instances with a pointer to btrfs_inode. This
removes multiple invocations of the BTRFS_I macro, subsequently remove
2 local variables as they are called only once and simply refer to
them directly.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:49 +01:00
Christoph Hellwig
2f63296578 iomap: pass a flags argument to iomap_dio_rw
Pass a set of flags to iomap_dio_rw instead of the boolean
wait_for_completion argument.  The IOMAP_DIO_FORCE_WAIT flag
replaces the wait_for_completion, but only needs to be passed
when the iocb isn't synchronous to start with to simplify the
callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
[djwong: rework xfs_file.c so that we can push iomap changes separately]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-01-23 10:06:09 -08:00
Naohiro Aota
f1569c4c10 btrfs: disable fallocate in ZONED mode
fallocate() is implemented by reserving actual extent instead of
reservations. This can result in exposing the sequential write
constraint of host-managed zoned block devices to the application, which
would break the POSIX semantic for the fallocated file.  To avoid this,
report fallocate() as not supported when in ZONED mode for now.

In the future, we may be able to implement "in-memory" fallocate() in
ZONED mode by utilizing space_info->bytes_may_use or similar, so this
returns EOPNOTSUPP.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:04 +01:00
Nikolay Borisov
b06359a325 btrfs: make btrfs_cont_expand take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:12 +01:00
Nikolay Borisov
217f42eb3d btrfs: make btrfs_truncate_block take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:11 +01:00
Nikolay Borisov
03fcb1ab6f btrfs: make btrfs_insert_replace_extent take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:11 +01:00
Nikolay Borisov
dea46d84a3 btrfs: make find_first_non_hole take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:11 +01:00
Nikolay Borisov
9a56fcd15a btrfs: make btrfs_update_inode take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:11 +01:00
Nikolay Borisov
76aea53796 btrfs: make btrfs_inode_safe_disk_i_size_write take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:10 +01:00