Commit Graph

317 Commits

Author SHA1 Message Date
Greg Kroah-Hartman
b34f70f7bf ANDROID: GKI: sched.h: add Android ABI padding to some structures
Try to mitigate potential future driver core api changes by adding a
pointer to struct signal_struct, struct sched_entity, struct
sched_rt_entity, and struct task_struct.

Based on a patch from Michal Marek <mmarek@suse.cz> from the SLES kernel

Leaf changes summary: 3 artifacts changed
Changed leaf types summary: 3 leaf types changed
Removed/Changed/Added functions summary: 0 Removed, 0 Changed, 0 Added function
Removed/Changed/Added variables summary: 0 Removed, 0 Changed, 0 Added variable

'struct sched_entity at sched.h:444:1' changed:
  type size changed from 3584 to 4096 (in bits)
  4 data member insertions:
    'u64 sched_entity::android_kabi_reserved1', at offset 3584 (in bits) at sched.h:481:1
    'u64 sched_entity::android_kabi_reserved2', at offset 3648 (in bits) at sched.h:482:1
    'u64 sched_entity::android_kabi_reserved3', at offset 3712 (in bits) at sched.h:483:1
    'u64 sched_entity::android_kabi_reserved4', at offset 3776 (in bits) at sched.h:484:1
  1435 impacted interfaces:

'struct sched_rt_entity at sched.h:481:1' changed:
  type size changed from 384 to 640 (in bits)
  4 data member insertions:
    'u64 sched_rt_entity::android_kabi_reserved1', at offset 384 (in bits) at sched.h:504:1
    'u64 sched_rt_entity::android_kabi_reserved2', at offset 448 (in bits) at sched.h:505:1
    'u64 sched_rt_entity::android_kabi_reserved3', at offset 512 (in bits) at sched.h:506:1
    'u64 sched_rt_entity::android_kabi_reserved4', at offset 576 (in bits) at sched.h:507:1
  1435 impacted interfaces:

'struct task_struct at sched.h:624:1' changed:
  type size changed from 28672 to 30208 (in bits)
  8 data member insertions:
    'u64 task_struct::android_kabi_reserved1', at offset 20992 (in bits) at sched.h:1294:1
    'u64 task_struct::android_kabi_reserved2', at offset 21056 (in bits) at sched.h:1295:1
    'u64 task_struct::android_kabi_reserved3', at offset 21120 (in bits) at sched.h:1296:1
    'u64 task_struct::android_kabi_reserved4', at offset 21184 (in bits) at sched.h:1297:1
    'u64 task_struct::android_kabi_reserved5', at offset 21248 (in bits) at sched.h:1298:1
    'u64 task_struct::android_kabi_reserved6', at offset 21312 (in bits) at sched.h:1299:1
    'u64 task_struct::android_kabi_reserved7', at offset 21376 (in bits) at sched.h:1300:1
    'u64 task_struct::android_kabi_reserved8', at offset 21440 (in bits) at sched.h:1301:1
  there are data member changes:
  1435 impacted interfaces:

Bug: 151154716
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I1449735b836399e9b356608727092334daf5c36b
2020-03-19 13:50:19 +01:00
Greg Kroah-Hartman
ce5de62e20 Merge 5.4.24 into android-5.4
Changes in 5.4.24
	io_uring: grab ->fs as part of async offload
	EDAC: skx_common: downgrade message importance on missing PCI device
	net: dsa: b53: Ensure the default VID is untagged
	net: fib_rules: Correctly set table field when table number exceeds 8 bits
	net: macb: ensure interface is not suspended on at91rm9200
	net: mscc: fix in frame extraction
	net: phy: restore mdio regs in the iproc mdio driver
	net: sched: correct flower port blocking
	net/tls: Fix to avoid gettig invalid tls record
	nfc: pn544: Fix occasional HW initialization failure
	qede: Fix race between rdma destroy workqueue and link change event
	Revert "net: dev: introduce support for sch BYPASS for lockless qdisc"
	udp: rehash on disconnect
	sctp: move the format error check out of __sctp_sf_do_9_1_abort
	bnxt_en: Improve device shutdown method.
	bnxt_en: Issue PCIe FLR in kdump kernel to cleanup pending DMAs.
	bonding: add missing netdev_update_lockdep_key()
	net: export netdev_next_lower_dev_rcu()
	bonding: fix lockdep warning in bond_get_stats()
	ipv6: Fix route replacement with dev-only route
	ipv6: Fix nlmsg_flags when splitting a multipath route
	ipmi:ssif: Handle a possible NULL pointer reference
	drm/msm: Set dma maximum segment size for mdss
	sched/core: Don't skip remote tick for idle CPUs
	timers/nohz: Update NOHZ load in remote tick
	sched/fair: Prevent unlimited runtime on throttled group
	dax: pass NOWAIT flag to iomap_apply
	mac80211: consider more elements in parsing CRC
	cfg80211: check wiphy driver existence for drvinfo report
	s390/zcrypt: fix card and queue total counter wrap
	qmi_wwan: re-add DW5821e pre-production variant
	qmi_wwan: unconditionally reject 2 ep interfaces
	NFSv4: Fix races between open and dentry revalidation
	perf/smmuv3: Use platform_get_irq_optional() for wired interrupt
	perf/x86/intel: Add Elkhart Lake support
	perf/x86/cstate: Add Tremont support
	perf/x86/msr: Add Tremont support
	ceph: do not execute direct write in parallel if O_APPEND is specified
	ARM: dts: sti: fixup sound frame-inversion for stihxxx-b2120.dtsi
	drm/amd/display: Do not set optimized_require to false after plane disable
	RDMA/siw: Remove unwanted WARN_ON in siw_cm_llp_data_ready()
	drm/amd/display: Check engine is not NULL before acquiring
	drm/amd/display: Limit minimum DPPCLK to 100MHz.
	drm/amd/display: Add initialitions for PLL2 clock source
	amdgpu: Prevent build errors regarding soft/hard-float FP ABI tags
	soc/tegra: fuse: Fix build with Tegra194 configuration
	i40e: Fix the conditional for i40e_vc_validate_vqs_bitmaps
	net: ena: fix potential crash when rxfh key is NULL
	net: ena: fix uses of round_jiffies()
	net: ena: add missing ethtool TX timestamping indication
	net: ena: fix incorrect default RSS key
	net: ena: rss: do not allocate key when not supported
	net: ena: rss: fix failure to get indirection table
	net: ena: rss: store hash function as values and not bits
	net: ena: fix incorrectly saving queue numbers when setting RSS indirection table
	net: ena: fix corruption of dev_idx_to_host_tbl
	net: ena: ethtool: use correct value for crc32 hash
	net: ena: ena-com.c: prevent NULL pointer dereference
	ice: update Unit Load Status bitmask to check after reset
	cifs: Fix mode output in debugging statements
	cfg80211: add missing policy for NL80211_ATTR_STATUS_CODE
	mac80211: fix wrong 160/80+80 MHz setting
	net: hns3: add management table after IMP reset
	net: hns3: fix a copying IPv6 address error in hclge_fd_get_flow_tuples()
	nvme/tcp: fix bug on double requeue when send fails
	nvme: prevent warning triggered by nvme_stop_keep_alive
	nvme/pci: move cqe check after device shutdown
	ext4: potential crash on allocation error in ext4_alloc_flex_bg_array()
	audit: fix error handling in audit_data_to_entry()
	audit: always check the netlink payload length in audit_receive_msg()
	ACPICA: Introduce ACPI_ACCESS_BYTE_WIDTH() macro
	ACPI: watchdog: Fix gas->access_width usage
	KVM: VMX: check descriptor table exits on instruction emulation
	HID: ite: Only bind to keyboard USB interface on Acer SW5-012 keyboard dock
	HID: core: fix off-by-one memset in hid_report_raw_event()
	HID: core: increase HID report buffer size to 8KiB
	drm/amdgpu: Drop DRIVER_USE_AGP
	drm/radeon: Inline drm_get_pci_dev
	macintosh: therm_windtunnel: fix regression when instantiating devices
	tracing: Disable trace_printk() on post poned tests
	Revert "PM / devfreq: Modify the device name as devfreq(X) for sysfs"
	amdgpu/gmc_v9: save/restore sdpif regs during S3
	cpufreq: Fix policy initialization for internal governor drivers
	io_uring: fix 32-bit compatability with sendmsg/recvmsg
	netfilter: ipset: Fix "INFO: rcu detected stall in hash_xxx" reports
	net/smc: transfer fasync_list in case of fallback
	vhost: Check docket sk_family instead of call getname
	netfilter: ipset: Fix forceadd evaluation path
	netfilter: xt_hashlimit: reduce hashlimit_mutex scope for htable_put()
	HID: alps: Fix an error handling path in 'alps_input_configured()'
	HID: hiddev: Fix race in in hiddev_disconnect()
	MIPS: VPE: Fix a double free and a memory leak in 'release_vpe()'
	i2c: altera: Fix potential integer overflow
	i2c: jz4780: silence log flood on txabrt
	drm/i915/gvt: Fix orphan vgpu dmabuf_objs' lifetime
	drm/i915/gvt: Separate display reset from ALL_ENGINES reset
	nl80211: fix potential leak in AP start
	mac80211: Remove a redundant mutex unlock
	kbuild: fix DT binding schema rule to detect command line changes
	hv_netvsc: Fix unwanted wakeup in netvsc_attach()
	usb: charger: assign specific number for enum value
	nvme-pci: Hold cq_poll_lock while completing CQEs
	s390/qeth: vnicc Fix EOPNOTSUPP precedence
	net: netlink: cap max groups which will be considered in netlink_bind()
	net: atlantic: fix use after free kasan warn
	net: atlantic: fix potential error handling
	net: atlantic: fix out of range usage of active_vlans array
	net/smc: no peer ID in CLC decline for SMCD
	net: ena: make ena rxfh support ETH_RSS_HASH_NO_CHANGE
	selftests: Install settings files to fix TIMEOUT failures
	kbuild: remove header compile test
	kbuild: move headers_check rule to usr/include/Makefile
	kbuild: remove unneeded variable, single-all
	kbuild: make single target builds even faster
	namei: only return -ECHILD from follow_dotdot_rcu()
	mwifiex: drop most magic numbers from mwifiex_process_tdls_action_frame()
	mwifiex: delete unused mwifiex_get_intf_num()
	KVM: SVM: Override default MMIO mask if memory encryption is enabled
	KVM: Check for a bad hva before dropping into the ghc slow path
	sched/fair: Optimize select_idle_cpu
	f2fs: fix to add swap extent correctly
	RDMA/hns: Simplify the calculation and usage of wqe idx for post verbs
	RDMA/hns: Bugfix for posting a wqe with sge
	drivers: net: xgene: Fix the order of the arguments of 'alloc_etherdev_mqs()'
	ima: ima/lsm policy rule loading logic bug fixes
	kprobes: Set unoptimized flag after unoptimizing code
	lib/vdso: Make __arch_update_vdso_data() logic understandable
	lib/vdso: Update coarse timekeeper unconditionally
	pwm: omap-dmtimer: put_device() after of_find_device_by_node()
	perf hists browser: Restore ESC as "Zoom out" of DSO/thread/etc
	perf ui gtk: Add missing zalloc object
	x86/resctrl: Check monitoring static key in the MBM overflow handler
	KVM: x86: Remove spurious kvm_mmu_unload() from vcpu destruction path
	KVM: x86: Remove spurious clearing of async #PF MSR
	rcu: Allow only one expedited GP to run concurrently with wakeups
	ubifs: Fix ino_t format warnings in orphan_delete()
	thermal: db8500: Depromote debug print
	thermal: brcmstb_thermal: Do not use DT coefficients
	netfilter: nft_tunnel: no need to call htons() when dumping ports
	netfilter: nf_flowtable: fix documentation
	bus: tegra-aconnect: Remove PM_CLK dependency
	xfs: clear kernel only flags in XFS_IOC_ATTRMULTI_BY_HANDLE
	locking/lockdep: Fix lockdep_stats indentation problem
	mm/debug.c: always print flags in dump_page()
	mm/gup: allow FOLL_FORCE for get_user_pages_fast()
	mm/huge_memory.c: use head to check huge zero page
	mm, thp: fix defrag setting if newline is not used
	kvm: nVMX: VMWRITE checks VMCS-link pointer before VMCS field
	kvm: nVMX: VMWRITE checks unsupported field before read-only field
	blktrace: Protect q->blk_trace with RCU
	Linux 5.4.24

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I0b31557e16c72bd30d1e6938ed199918ff326d88
2020-03-05 17:42:40 +01:00
Peter Zijlstra (Intel)
166d6008fa timers/nohz: Update NOHZ load in remote tick
[ Upstream commit ebc0f83c78 ]

The way loadavg is tracked during nohz only pays attention to the load
upon entering nohz.  This can be particularly noticeable if full nohz is
entered while non-idle, and then the cpu goes idle and stays that way for
a long time.

Use the remote tick to ensure that full nohz cpus report their deltas
within a reasonable time.

[ swood: Added changelog and removed recheck of stopped tick. ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Scott Wood <swood@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/1578736419-14628-3-git-send-email-swood@redhat.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-03-05 16:43:36 +01:00
Greg Kroah-Hartman
861433ef01 Merge 5.4.7 into android-5.4
Changes in 5.4.7
	af_packet: set defaule value for tmo
	fjes: fix missed check in fjes_acpi_add
	mod_devicetable: fix PHY module format
	net: dst: Force 4-byte alignment of dst_metrics
	net: gemini: Fix memory leak in gmac_setup_txqs
	net: hisilicon: Fix a BUG trigered by wrong bytes_compl
	net: nfc: nci: fix a possible sleep-in-atomic-context bug in nci_uart_tty_receive()
	net: phy: ensure that phy IDs are correctly typed
	net: qlogic: Fix error paths in ql_alloc_large_buffers()
	net-sysfs: Call dev_hold always in rx_queue_add_kobject
	net: usb: lan78xx: Fix suspend/resume PHY register access error
	nfp: flower: fix stats id allocation
	qede: Disable hardware gro when xdp prog is installed
	qede: Fix multicast mac configuration
	sctp: fix memleak on err handling of stream initialization
	sctp: fully initialize v4 addr in some functions
	selftests: forwarding: Delete IPv6 address at the end
	neighbour: remove neigh_cleanup() method
	bonding: fix bond_neigh_init()
	net: ena: fix default tx interrupt moderation interval
	net: ena: fix issues in setting interrupt moderation params in ethtool
	dpaa2-ptp: fix double free of the ptp_qoriq IRQ
	mlxsw: spectrum_router: Remove unlikely user-triggerable warning
	net: ethernet: ti: davinci_cpdma: fix warning "device driver frees DMA memory with different size"
	net: stmmac: platform: Fix MDIO init for platforms without PHY
	net: dsa: b53: Fix egress flooding settings
	NFC: nxp-nci: Fix probing without ACPI
	btrfs: don't double lock the subvol_sem for rename exchange
	btrfs: do not call synchronize_srcu() in inode_tree_del
	Btrfs: make tree checker detect checksum items with overlapping ranges
	btrfs: return error pointer from alloc_test_extent_buffer
	Btrfs: fix missing data checksums after replaying a log tree
	btrfs: send: remove WARN_ON for readonly mount
	btrfs: abort transaction after failed inode updates in create_subvol
	btrfs: skip log replay on orphaned roots
	btrfs: do not leak reloc root if we fail to read the fs root
	btrfs: handle ENOENT in btrfs_uuid_tree_iterate
	Btrfs: fix removal logic of the tree mod log that leads to use-after-free issues
	ALSA: pcm: Avoid possible info leaks from PCM stream buffers
	ALSA: hda/ca0132 - Keep power on during processing DSP response
	ALSA: hda/ca0132 - Avoid endless loop
	ALSA: hda/ca0132 - Fix work handling in delayed HP detection
	drm/vc4/vc4_hdmi: fill in connector info
	drm/virtio: switch virtio_gpu_wait_ioctl() to gem helper.
	drm: mst: Fix query_payload ack reply struct
	drm/mipi-dbi: fix a loop in debugfs code
	drm/panel: Add missing drm_panel_init() in panel drivers
	drm: exynos: exynos_hdmi: use cec_notifier_conn_(un)register
	drm: Use EOPNOTSUPP, not ENOTSUPP
	drm/amd/display: verify stream link before link test
	drm/bridge: analogix-anx78xx: silence -EPROBE_DEFER warnings
	drm/amd/display: OTC underflow fix
	iio: max31856: add missing of_node and parent references to iio_dev
	iio: light: bh1750: Resolve compiler warning and make code more readable
	drm/amdgpu/sriov: add ring_stop before ring_create in psp v11 code
	drm/amdgpu: grab the id mgr lock while accessing passid_mapping
	drm/ttm: return -EBUSY on pipelining with no_gpu_wait (v2)
	drm/amd/display: Rebuild mapped resources after pipe split
	ath10k: add cleanup in ath10k_sta_state()
	drm/amd/display: Handle virtual signal type in disable_link()
	ath10k: Check if station exists before forwarding tx airtime report
	spi: Add call to spi_slave_abort() function when spidev driver is released
	drm/meson: vclk: use the correct G12A frac max value
	staging: rtl8192u: fix multiple memory leaks on error path
	staging: rtl8188eu: fix possible null dereference
	rtlwifi: prevent memory leak in rtl_usb_probe
	libertas: fix a potential NULL pointer dereference
	Revert "pinctrl: sh-pfc: r8a77990: Fix MOD_SEL1 bit30 when using SSI_SCK2 and SSI_WS2"
	Revert "pinctrl: sh-pfc: r8a77990: Fix MOD_SEL1 bit31 when using SIM0_D"
	ath10k: fix backtrace on coredump
	IB/iser: bound protection_sg size by data_sg size
	drm/komeda: Workaround for broken FLIP_COMPLETE timestamps
	spi: gpio: prevent memory leak in spi_gpio_probe
	media: am437x-vpfe: Setting STD to current value is not an error
	media: cedrus: fill in bus_info for media device
	media: seco-cec: Add a missing 'release_region()' in an error handling path
	media: vim2m: Fix abort issue
	media: vim2m: Fix BUG_ON in vim2m_device_release()
	media: max2175: Fix build error without CONFIG_REGMAP_I2C
	media: ov6650: Fix control handler not freed on init error
	media: i2c: ov2659: fix s_stream return value
	media: ov6650: Fix crop rectangle alignment not passed back
	media: i2c: ov2659: Fix missing 720p register config
	media: ov6650: Fix stored frame format not in sync with hardware
	media: ov6650: Fix stored crop rectangle not in sync with hardware
	tools/power/cpupower: Fix initializer override in hsw_ext_cstates
	media: venus: core: Fix msm8996 frequency table
	ath10k: fix offchannel tx failure when no ath10k_mac_tx_frm_has_freq
	media: vimc: Fix gpf in rmmod path when stream is active
	drm/amd/display: Set number of pipes to 1 if the second pipe was disabled
	pinctrl: devicetree: Avoid taking direct reference to device name string
	drm/sun4i: dsi: Fix TCON DRQ set bits
	drm/amdkfd: fix a potential NULL pointer dereference (v2)
	x86/math-emu: Check __copy_from_user() result
	drm/amd/powerplay: A workaround to GPU RESET on APU
	selftests/bpf: Correct path to include msg + path
	drm/amd/display: set minimum abm backlight level
	media: venus: Fix occasionally failures to suspend
	rtw88: fix NSS of hw_cap
	drm/amd/display: fix struct init in update_bounding_box
	usb: renesas_usbhs: add suspend event support in gadget mode
	crypto: aegis128-neon - use Clang compatible cflags for ARM
	hwrng: omap3-rom - Call clk_disable_unprepare() on exit only if not idled
	regulator: max8907: Fix the usage of uninitialized variable in max8907_regulator_probe()
	tools/memory-model: Fix data race detection for unordered store and load
	media: flexcop-usb: fix NULL-ptr deref in flexcop_usb_transfer_init()
	media: cec-funcs.h: add status_req checks
	media: meson/ao-cec: move cec_notifier_cec_adap_register after hw setup
	drm/bridge: dw-hdmi: Refuse DDC/CI transfers on the internal I2C controller
	samples: pktgen: fix proc_cmd command result check logic
	block: Fix writeback throttling W=1 compiler warnings
	drm/amdkfd: Fix MQD size calculation
	MIPS: futex: Emit Loongson3 sync workarounds within asm
	mwifiex: pcie: Fix memory leak in mwifiex_pcie_init_evt_ring
	drm/drm_vblank: Change EINVAL by the correct errno
	selftests/bpf: Fix btf_dump padding test case
	libbpf: Fix struct end padding in btf_dump
	libbpf: Fix passing uninitialized bytes to setsockopt
	net/smc: increase device refcount for added link group
	team: call RCU read lock when walking the port_list
	media: cx88: Fix some error handling path in 'cx8800_initdev()'
	crypto: inside-secure - Fix a maybe-uninitialized warning
	crypto: aegis128/simd - build 32-bit ARM for v8 architecture explicitly
	misc: fastrpc: fix memory leak from miscdev->name
	ASoC: SOF: enable sync_write in hdac_bus
	media: ti-vpe: vpe: Fix Motion Vector vpdma stride
	media: ti-vpe: vpe: fix a v4l2-compliance warning about invalid pixel format
	media: ti-vpe: vpe: fix a v4l2-compliance failure about frame sequence number
	media: ti-vpe: vpe: Make sure YUYV is set as default format
	media: ti-vpe: vpe: fix a v4l2-compliance failure causing a kernel panic
	media: ti-vpe: vpe: ensure buffers are cleaned up properly in abort cases
	drm/amd/display: Properly round nominal frequency for SPD
	drm/amd/display: wait for set pipe mcp command completion
	media: ti-vpe: vpe: fix a v4l2-compliance failure about invalid sizeimage
	drm/amd/display: add new active dongle to existent w/a
	syscalls/x86: Use the correct function type in SYSCALL_DEFINE0
	drm/amd/display: Fix dongle_caps containing stale information.
	extcon: sm5502: Reset registers during initialization
	drm/amd/display: Program DWB watermarks from correct state
	x86/mm: Use the correct function type for native_set_fixmap()
	ath10k: Correct error handling of dma_map_single()
	rtw88: coex: Set 4 slot mode for A2DP
	drm/bridge: dw-hdmi: Restore audio when setting a mode
	perf test: Report failure for mmap events
	perf report: Add warning when libunwind not compiled in
	perf test: Avoid infinite loop for task exit case
	perf vendor events arm64: Fix Hisi hip08 DDRC PMU eventname
	usb: usbfs: Suppress problematic bind and unbind uevents.
	drm/amd/powerplay: avoid disabling ECC if RAS is enabled for VEGA20
	iio: adc: max1027: Reset the device at probe time
	Bluetooth: btusb: avoid unused function warning
	Bluetooth: missed cpu_to_le16 conversion in hci_init4_req
	Bluetooth: Workaround directed advertising bug in Broadcom controllers
	Bluetooth: hci_core: fix init for HCI_USER_CHANNEL
	bpf/stackmap: Fix deadlock with rq_lock in bpf_get_stack()
	x86/mce: Lower throttling MCE messages' priority to warning
	drm/amd/display: enable hostvm based on roimmu active for dcn2.1
	drm/amd/display: fix header for RN clk mgr
	drm/amdgpu: fix amdgpu trace event print string format error
	staging: iio: ad9834: add a check for devm_clk_get
	power: supply: cpcap-battery: Check voltage before orderly_poweroff
	perf tests: Disable bp_signal testing for arm64
	selftests/bpf: Make a copy of subtest name
	net: hns3: log and clear hardware error after reset complete
	RDMA/hns: Fix wrong parameters when initial mtt of srq->idx_que
	drm/gma500: fix memory disclosures due to uninitialized bytes
	ASoC: soc-pcm: fixup dpcm_prune_paths() loop continue
	rtl8xxxu: fix RTL8723BU connection failure issue after warm reboot
	RDMA/siw: Fix SQ/RQ drain logic
	ipmi: Don't allow device module unload when in use
	x86/ioapic: Prevent inconsistent state when moving an interrupt
	media: cedrus: Fix undefined shift with a SHIFT_AND_MASK_BITS macro
	media: aspeed: set hsync and vsync polarities to normal before starting mode detection
	drm/nouveau: Don't grab runtime PM refs for HPD IRQs
	media: ov6650: Fix stored frame interval not in sync with hardware
	media: ad5820: Define entity function
	media: ov5640: Make 2592x1944 mode only available at 15 fps
	media: st-mipid02: add a check for devm_gpiod_get_optional
	media: imx7-mipi-csis: Add a check for devm_regulator_get
	media: aspeed: clear garbage interrupts
	media: smiapp: Register sensor after enabling runtime PM on the device
	md: no longer compare spare disk superblock events in super_load
	staging: wilc1000: potential corruption in wilc_parse_join_bss_param()
	md/bitmap: avoid race window between md_bitmap_resize and bitmap_file_clear_bit
	drm: Don't free jobs in wait_event_interruptible()
	EDAC/amd64: Set grain per DIMM
	arm64: psci: Reduce the waiting time for cpu_psci_cpu_kill()
	drm/amd/display: setting the DIG_MODE to the correct value.
	i40e: initialize ITRN registers with correct values
	drm/amd/display: correctly populate dpp refclk in fpga
	i40e: Wrong 'Advertised FEC modes' after set FEC to AUTO
	net: phy: dp83867: enable robust auto-mdix
	drm/tegra: sor: Use correct SOR index on Tegra210
	regulator: core: Release coupled_rdevs on regulator_init_coupling() error
	ubsan, x86: Annotate and allow __ubsan_handle_shift_out_of_bounds() in uaccess regions
	spi: sprd: adi: Add missing lock protection when rebooting
	ACPI: button: Add DMI quirk for Medion Akoya E2215T
	RDMA/qedr: Fix memory leak in user qp and mr
	RDMA/hns: Fix memory leak on 'context' on error return path
	RDMA/qedr: Fix srqs xarray initialization
	RDMA/core: Set DMA parameters correctly
	staging: wilc1000: check if device is initialzied before changing vif
	gpu: host1x: Allocate gather copy for host1x
	net: dsa: LAN9303: select REGMAP when LAN9303 enable
	phy: renesas: phy-rcar-gen2: Fix the array off by one warning
	phy: qcom-usb-hs: Fix extcon double register after power cycle
	s390/time: ensure get_clock_monotonic() returns monotonic values
	s390: add error handling to perf_callchain_kernel
	s390/mm: add mm_pxd_folded() checks to pxd_free()
	net: hns3: add struct netdev_queue debug info for TX timeout
	libata: Ensure ata_port probe has completed before detach
	loop: fix no-unmap write-zeroes request behavior
	net/mlx5e: Verify that rule has at least one fwd/drop action
	pinctrl: sh-pfc: sh7734: Fix duplicate TCLK1_B
	ALSA: bebob: expand sleep just after breaking connections for protocol version 1
	iio: dln2-adc: fix iio_triggered_buffer_postenable() position
	libbpf: Fix error handling in bpf_map__reuse_fd()
	Bluetooth: Fix advertising duplicated flags
	ALSA: pcm: Fix missing check of the new non-cached buffer type
	spi: sifive: disable clk when probe fails and remove
	ASoC: SOF: imx: fix reverse CONFIG_SND_SOC_SOF_OF dependency
	pinctrl: qcom: sc7180: Add missing tile info in SDC_QDSD_PINGROUP/UFS_RESET
	pinctrl: amd: fix __iomem annotation in amd_gpio_irq_handler()
	ixgbe: protect TX timestamping from API misuse
	cpufreq: sun50i: Fix CPU speed bin detection
	media: rcar_drif: fix a memory disclosure
	media: v4l2-core: fix touch support in v4l_g_fmt
	nvme: introduce "Command Aborted By host" status code
	media: staging/imx: Use a shorter name for driver
	nvmem: imx-ocotp: reset error status on probe
	nvmem: core: fix nvmem_cell_write inline function
	ASoC: SOF: topology: set trigger order for FE DAI link
	media: vivid: media_device_cleanup was called too early
	spi: dw: Fix Designware SPI loopback
	bnx2x: Fix PF-VF communication over multi-cos queues.
	spi: img-spfi: fix potential double release
	ALSA: timer: Limit max amount of slave instances
	RDMA/core: Fix return code when modify_port isn't supported
	drm: msm: a6xx: fix debug bus register configuration
	rtlwifi: fix memory leak in rtl92c_set_fw_rsvdpagepkt()
	perf probe: Fix to find range-only function instance
	perf cs-etm: Fix definition of macro TO_CS_QUEUE_NR
	perf probe: Fix to list probe event with correct line number
	perf jevents: Fix resource leak in process_mapfile() and main()
	perf probe: Walk function lines in lexical blocks
	perf probe: Fix to probe an inline function which has no entry pc
	perf probe: Fix to show ranges of variables in functions without entry_pc
	perf probe: Fix to show inlined function callsite without entry_pc
	libsubcmd: Use -O0 with DEBUG=1
	perf probe: Fix to probe a function which has no entry pc
	perf tools: Fix cross compile for ARM64
	perf tools: Splice events onto evlist even on error
	drm/amdgpu: disallow direct upload save restore list from gfx driver
	drm/amd/powerplay: fix struct init in renoir_print_clk_levels
	drm/amdgpu: fix potential double drop fence reference
	ice: Check for null pointer dereference when setting rings
	xen/gntdev: Use select for DMA_SHARED_BUFFER
	perf parse: If pmu configuration fails free terms
	perf probe: Skip overlapped location on searching variables
	net: avoid potential false sharing in neighbor related code
	perf probe: Return a better scope DIE if there is no best scope
	perf probe: Fix to show calling lines of inlined functions
	perf probe: Skip end-of-sequence and non statement lines
	perf probe: Filter out instances except for inlined subroutine and subprogram
	libbpf: Fix negative FD close() in xsk_setup_xdp_prog()
	s390/bpf: Use kvcalloc for addrs array
	cgroup: freezer: don't change task and cgroups status unnecessarily
	selftests: proc: Make va_max 1MB
	drm/amdgpu: Avoid accidental thread reactivation.
	media: exynos4-is: fix wrong mdev and v4l2 dev order in error path
	ath10k: fix get invalid tx rate for Mesh metric
	fsi: core: Fix small accesses and unaligned offsets via sysfs
	selftests: net: Fix printf format warnings on arm
	media: pvrusb2: Fix oops on tear-down when radio support is not present
	soundwire: intel: fix PDI/stream mapping for Bulk
	crypto: atmel - Fix authenc support when it is set to m
	ice: delay less
	media: si470x-i2c: add missed operations in remove
	media: cedrus: Use helpers to access capture queue
	media: v4l2-ctrl: Lock main_hdl on operations of requests_queued.
	iio: cros_ec_baro: set info_mask_shared_by_all_available field
	EDAC/ghes: Fix grain calculation
	media: vicodec: media_device_cleanup was called too early
	media: vim2m: media_device_cleanup was called too early
	spi: pxa2xx: Add missed security checks
	ASoC: rt5677: Mark reg RT5677_PWR_ANLG2 as volatile
	iio: dac: ad5446: Add support for new AD5600 DAC
	bpf, testing: Workaround a verifier failure for test_progs
	ASoC: Intel: kbl_rt5663_rt5514_max98927: Add dmic format constraint
	net: dsa: sja1105: Disallow management xmit during switch reset
	r8169: respect EEE user setting when restarting network
	s390/disassembler: don't hide instruction addresses
	net: ethernet: ti: Add dependency for TI_DAVINCI_EMAC
	nvme: Discard workaround for non-conformant devices
	parport: load lowlevel driver if ports not found
	bcache: fix static checker warning in bcache_device_free()
	cpufreq: Register drivers only after CPU devices have been registered
	qtnfmac: fix debugfs support for multiple cards
	qtnfmac: fix invalid channel information output
	x86/crash: Add a forward declaration of struct kimage
	qtnfmac: fix using skb after free
	RDMA/efa: Clear the admin command buffer prior to its submission
	tracing: use kvcalloc for tgid_map array allocation
	MIPS: ralink: enable PCI support only if driver for mt7621 SoC is selected
	tracing/kprobe: Check whether the non-suffixed symbol is notrace
	bcache: fix deadlock in bcache_allocator
	iwlwifi: mvm: fix unaligned read of rx_pkt_status
	ASoC: wm8904: fix regcache handling
	regulator: core: Let boot-on regulators be powered off
	spi: tegra20-slink: add missed clk_unprepare
	tun: fix data-race in gro_normal_list()
	xhci-pci: Allow host runtime PM as default also for Intel Ice Lake xHCI
	crypto: virtio - deal with unsupported input sizes
	mmc: tmio: Add MMC_CAP_ERASE to allow erase/discard/trim requests
	btrfs: don't prematurely free work in end_workqueue_fn()
	btrfs: don't prematurely free work in run_ordered_work()
	sched/uclamp: Fix overzealous type replacement
	ASoC: wm2200: add missed operations in remove and probe failure
	spi: st-ssc4: add missed pm_runtime_disable
	ASoC: wm5100: add missed pm_runtime_disable
	perf/core: Fix the mlock accounting, again
	selftests, bpf: Fix test_tc_tunnel hanging
	selftests, bpf: Workaround an alu32 sub-register spilling issue
	bnxt_en: Return proper error code for non-existent NVM variable
	net: phy: avoid matching all-ones clause 45 PHY IDs
	firmware_loader: Fix labels with comma for builtin firmware
	ASoC: Intel: bytcr_rt5640: Update quirk for Acer Switch 10 SW5-012 2-in-1
	x86/insn: Add some Intel instructions to the opcode map
	net-af_xdp: Use correct number of channels from ethtool
	brcmfmac: remove monitor interface when detaching
	perf session: Fix decompression of PERF_RECORD_COMPRESSED records
	perf probe: Fix to show function entry line as probe-able
	s390/crypto: Fix unsigned variable compared with zero
	s390/kasan: support memcpy_real with TRACE_IRQFLAGS
	bnxt_en: Improve RX buffer error handling.
	iwlwifi: check kasprintf() return value
	fbtft: Make sure string is NULL terminated
	ASoC: soc-pcm: check symmetry before hw_params
	net: ethernet: ti: ale: clean ale tbl on init and intf restart
	mt76: fix possible out-of-bound access in mt7615_fill_txs/mt7603_fill_txs
	s390/cpumf: Adjust registration of s390 PMU device drivers
	crypto: sun4i-ss - Fix 64-bit size_t warnings
	crypto: sun4i-ss - Fix 64-bit size_t warnings on sun4i-ss-hash.c
	mac80211: consider QoS Null frames for STA_NULLFUNC_ACKED
	crypto: vmx - Avoid weird build failures
	libtraceevent: Fix memory leakage in copy_filter_type
	mips: fix build when "48 bits virtual memory" is enabled
	drm/amdgpu: fix bad DMA from INTERRUPT_CNTL2
	ice: Only disable VF state when freeing each VF resources
	ice: Fix setting coalesce to handle DCB configuration
	net: phy: initialise phydev speed and duplex sanely
	tools, bpf: Fix build for 'make -s tools/bpf O=<dir>'
	RDMA/bnxt_re: Fix missing le16_to_cpu
	RDMA/bnxt_re: Fix stat push into dma buffer on gen p5 devices
	bpf: Provide better register bounds after jmp32 instructions
	RDMA/bnxt_re: Fix chip number validation Broadcom's Gen P5 series
	ibmvnic: Fix completion structure initialization
	net: wireless: intel: iwlwifi: fix GRO_NORMAL packet stalling
	MIPS: futex: Restore \n after sync instructions
	btrfs: don't prematurely free work in reada_start_machine_worker()
	btrfs: don't prematurely free work in scrub_missing_raid56_worker()
	Revert "mmc: sdhci: Fix incorrect switch to HS mode"
	mmc: mediatek: fix CMD_TA to 2 for MT8173 HS200/HS400 mode
	tpm_tis: reserve chip for duration of tpm_tis_core_init
	tpm: fix invalid locking in NONBLOCKING mode
	iommu: fix KASAN use-after-free in iommu_insert_resv_region
	iommu: set group default domain before creating direct mappings
	iommu/vt-d: Fix dmar pte read access not set error
	iommu/vt-d: Set ISA bridge reserved region as relaxable
	iommu/vt-d: Allocate reserved region for ISA with correct permission
	can: xilinx_can: Fix missing Rx can packets on CANFD2.0
	can: m_can: tcan4x5x: add required delay after reset
	can: j1939: j1939_sk_bind(): take priv after lock is held
	can: flexcan: fix possible deadlock and out-of-order reception after wakeup
	can: flexcan: poll MCR_LPM_ACK instead of GPR ACK for stop mode acknowledgment
	can: kvaser_usb: kvaser_usb_leaf: Fix some info-leaks to USB devices
	selftests: net: tls: remove recv_rcvbuf test
	spi: dw: Correct handling of native chipselect
	spi: cadence: Correct handling of native chipselect
	usb: xhci: Fix build warning seen with CONFIG_PM=n
	drm/amdgpu: fix uninitialized variable pasid_mapping_needed
	ath10k: Revert "ath10k: add cleanup in ath10k_sta_state()"
	RDMA/siw: Fix post_recv QP state locking
	md: avoid invalid memory access for array sb->dev_roles
	s390/ftrace: fix endless recursion in function_graph tracer
	ARM: dts: Fix vcsi regulator to be always-on for droid4 to prevent hangs
	can: flexcan: add low power enter/exit acknowledgment helper
	usbip: Fix receive error in vhci-hcd when using scatter-gather
	usbip: Fix error path of vhci_recv_ret_submit()
	spi: fsl: don't map irq during probe
	spi: fsl: use platform_get_irq() instead of of_irq_to_resource()
	efi/memreserve: Register reservations as 'reserved' in /proc/iomem
	cpufreq: Avoid leaving stale IRQ work items during CPU offline
	KEYS: asymmetric: return ENOMEM if akcipher_request_alloc() fails
	mm: vmscan: protect shrinker idr replace with CONFIG_MEMCG
	USB: EHCI: Do not return -EPIPE when hub is disconnected
	intel_th: pci: Add Comet Lake PCH-V support
	intel_th: pci: Add Elkhart Lake SOC support
	intel_th: Fix freeing IRQs
	intel_th: msu: Fix window switching without windows
	platform/x86: hp-wmi: Make buffer for HPWMI_FEATURE2_QUERY 128 bytes
	staging: comedi: gsc_hpdi: check dma_alloc_coherent() return value
	tty/serial: atmel: fix out of range clock divider handling
	serial: sprd: Add clearing break interrupt operation
	pinctrl: baytrail: Really serialize all register accesses
	clk: imx: clk-imx7ulp: Add missing sentinel of ulp_div_table
	clk: imx: clk-composite-8m: add lock to gate/mux
	clk: imx: pll14xx: fix clk_pll14xx_wait_lock
	ext4: fix ext4_empty_dir() for directories with holes
	ext4: check for directory entries too close to block end
	ext4: unlock on error in ext4_expand_extra_isize()
	ext4: validate the debug_want_extra_isize mount option at parse time
	iocost: over-budget forced IOs should schedule async delay
	KVM: PPC: Book3S HV: Fix regression on big endian hosts
	kvm: x86: Host feature SSBD doesn't imply guest feature SPEC_CTRL_SSBD
	kvm: x86: Host feature SSBD doesn't imply guest feature AMD_SSBD
	KVM: arm/arm64: Properly handle faulting of device mappings
	KVM: arm64: Ensure 'params' is initialised when looking up sys register
	x86/intel: Disable HPET on Intel Coffee Lake H platforms
	x86/MCE/AMD: Do not use rdmsr_safe_on_cpu() in smca_configure()
	x86/MCE/AMD: Allow Reserved types to be overwritten in smca_banks[]
	x86/mce: Fix possibly incorrect severity calculation on AMD
	powerpc/vcpu: Assume dedicated processors as non-preempt
	powerpc/irq: fix stack overflow verification
	ocxl: Fix concurrent AFU open and device removal
	mmc: sdhci-msm: Correct the offset and value for DDR_CONFIG register
	mmc: sdhci-of-esdhc: Revert "mmc: sdhci-of-esdhc: add erratum A-009204 support"
	mmc: sdhci: Update the tuning failed messages to pr_debug level
	mmc: sdhci-of-esdhc: fix P2020 errata handling
	mmc: sdhci: Workaround broken command queuing on Intel GLK
	mmc: sdhci: Add a quirk for broken command queuing
	nbd: fix shutdown and recv work deadlock v2
	iwlwifi: pcie: move power gating workaround earlier in the flow
	Linux 5.4.7

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I3585238149235bf73bb453e25861d9a6b9193dfa
2019-12-31 17:56:13 +01:00
Rafael J. Wysocki
da06508bcb cpufreq: Avoid leaving stale IRQ work items during CPU offline
commit 85572c2c4a upstream.

The scheduler code calling cpufreq_update_util() may run during CPU
offline on the target CPU after the IRQ work lists have been flushed
for it, so the target CPU should be prevented from running code that
may queue up an IRQ work item on it at that point.

Unfortunately, that may not be the case if dvfs_possible_from_any_cpu
is set for at least one cpufreq policy in the system, because that
allows the CPU going offline to run the utilization update callback
of the cpufreq governor on behalf of another (online) CPU in some
cases.

If that happens, the cpufreq governor callback may queue up an IRQ
work on the CPU running it, which is going offline, and the IRQ work
may not be flushed after that point.  Moreover, that IRQ work cannot
be flushed until the "offlining" CPU goes back online, so if any
other CPU calls irq_work_sync() to wait for the completion of that
IRQ work, it will have to wait until the "offlining" CPU is back
online and that may not happen forever.  In particular, a system-wide
deadlock may occur during CPU online as a result of that.

The failing scenario is as follows.  CPU0 is the boot CPU, so it
creates a cpufreq policy and becomes the "leader" of it
(policy->cpu).  It cannot go offline, because it is the boot CPU.
Next, other CPUs join the cpufreq policy as they go online and they
leave it when they go offline.  The last CPU to go offline, say CPU3,
may queue up an IRQ work while running the governor callback on
behalf of CPU0 after leaving the cpufreq policy because of the
dvfs_possible_from_any_cpu effect described above.  Then, CPU0 is
the only online CPU in the system and the stale IRQ work is still
queued on CPU3.  When, say, CPU1 goes back online, it will run
irq_work_sync() to wait for that IRQ work to complete and so it
will wait for CPU3 to go back online (which may never happen even
in principle), but (worse yet) CPU0 is waiting for CPU1 at that
point too and a system-wide deadlock occurs.

To address this problem notice that CPUs which cannot run cpufreq
utilization update code for themselves (for example, because they
have left the cpufreq policies that they belonged to), should also
be prevented from running that code on behalf of the other CPUs that
belong to a cpufreq policy with dvfs_possible_from_any_cpu set and so
in that case the cpufreq_update_util_data pointer of the CPU running
the code must not be NULL as well as for the CPU which is the target
of the cpufreq utilization update in progress.

Accordingly, change cpufreq_this_cpu_can_update() into a regular
function in kernel/sched/cpufreq.c (instead of a static inline in a
header file) and make it check the cpufreq_update_util_data pointer
of the local CPU if dvfs_possible_from_any_cpu is set for the target
cpufreq policy.

Also update the schedutil governor to do the
cpufreq_this_cpu_can_update() check in the non-fast-switch
case too to avoid the stale IRQ work issues.

Fixes: 99d14d0e16 ("cpufreq: Process remote callbacks from any CPU if the platform permits")
Link: https://lore.kernel.org/linux-pm/20191121093557.bycvdo4xyinbc5cb@vireshk-i7/
Reported-by: Anson Huang <anson.huang@nxp.com>
Tested-by: Anson Huang <anson.huang@nxp.com>
Cc: 4.14+ <stable@vger.kernel.org> # 4.14+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Peng Fan <peng.fan@nxp.com> (i.MX8QXP-MEK)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31 16:46:06 +01:00
Greg Kroah-Hartman
c32aefc014 Merge 5.4.1 into android-5.4
Changes in 5.4.1
	Bluetooth: Fix invalid-free in bcsp_close()
	ath9k_hw: fix uninitialized variable data
	ath10k: Fix a NULL-ptr-deref bug in ath10k_usb_alloc_urb_from_pipe
	ath10k: Fix HOST capability QMI incompatibility
	ath10k: restore QCA9880-AR1A (v1) detection
	Revert "Bluetooth: hci_ll: set operational frequency earlier"
	Revert "dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues"
	md/raid10: prevent access of uninitialized resync_pages offset
	x86/insn: Fix awk regexp warnings
	x86/speculation: Fix incorrect MDS/TAA mitigation status
	x86/speculation: Fix redundant MDS mitigation message
	nbd: prevent memory leak
	x86/stackframe/32: Repair 32-bit Xen PV
	x86/xen/32: Make xen_iret_crit_fixup() independent of frame layout
	x86/xen/32: Simplify ring check in xen_iret_crit_fixup()
	x86/doublefault/32: Fix stack canaries in the double fault handler
	x86/pti/32: Size initial_page_table correctly
	x86/cpu_entry_area: Add guard page for entry stack on 32bit
	x86/entry/32: Fix IRET exception
	x86/entry/32: Use %ss segment where required
	x86/entry/32: Move FIXUP_FRAME after pushing %fs in SAVE_ALL
	x86/entry/32: Unwind the ESPFIX stack earlier on exception entry
	x86/entry/32: Fix NMI vs ESPFIX
	selftests/x86/mov_ss_trap: Fix the SYSENTER test
	selftests/x86/sigreturn/32: Invalidate DS and ES when abusing the kernel
	x86/pti/32: Calculate the various PTI cpu_entry_area sizes correctly, make the CPU_ENTRY_AREA_PAGES assert precise
	x86/entry/32: Fix FIXUP_ESPFIX_STACK with user CR3
	futex: Prevent robust futex exit race
	ALSA: usb-audio: Fix NULL dereference at parsing BADD
	ALSA: usb-audio: Fix Scarlett 6i6 Gen 2 port data
	media: vivid: Set vid_cap_streaming and vid_out_streaming to true
	media: vivid: Fix wrong locking that causes race conditions on streaming stop
	media: usbvision: Fix invalid accesses after device disconnect
	media: usbvision: Fix races among open, close, and disconnect
	cpufreq: Add NULL checks to show() and store() methods of cpufreq
	futex: Move futex exit handling into futex code
	futex: Replace PF_EXITPIDONE with a state
	exit/exec: Seperate mm_release()
	futex: Split futex_mm_release() for exit/exec
	futex: Set task::futex_state to DEAD right after handling futex exit
	futex: Mark the begin of futex exit explicitly
	futex: Sanitize exit state handling
	futex: Provide state handling for exec() as well
	futex: Add mutex around futex exit
	futex: Provide distinct return value when owner is exiting
	futex: Prevent exit livelock
	media: uvcvideo: Fix error path in control parsing failure
	media: b2c2-flexcop-usb: add sanity checking
	media: cxusb: detect cxusb_ctrl_msg error in query
	media: imon: invalid dereference in imon_touch_event
	media: mceusb: fix out of bounds read in MCE receiver buffer
	ALSA: hda - Disable audio component for legacy Nvidia HDMI codecs
	USBIP: add config dependency for SGL_ALLOC
	usbip: tools: fix fd leakage in the function of read_attr_usbip_status
	usbip: Fix uninitialized symbol 'nents' in stub_recv_cmd_submit()
	usb-serial: cp201x: support Mark-10 digital force gauge
	USB: chaoskey: fix error case of a timeout
	appledisplay: fix error handling in the scheduled work
	USB: serial: mos7840: add USB ID to support Moxa UPort 2210
	USB: serial: mos7720: fix remote wakeup
	USB: serial: mos7840: fix remote wakeup
	USB: serial: option: add support for DW5821e with eSIM support
	USB: serial: option: add support for Foxconn T77W968 LTE modules
	staging: comedi: usbduxfast: usbduxfast_ai_cmdtest rounding error
	powerpc/book3s64: Fix link stack flush on context switch
	KVM: PPC: Book3S HV: Flush link stack on guest exit to host kernel
	Linux 5.4.1

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Id50109953b5638956d150e4fc648a94b6e347fb5
2019-11-29 10:56:00 +01:00
Thomas Gleixner
7d7e93588f exit/exec: Seperate mm_release()
commit 4610ba7ad8 upstream.

mm_release() contains the futex exit handling. mm_release() is called from
do_exit()->exit_mm() and from exec()->exec_mm().

In the exit_mm() case PF_EXITING and the futex state is updated. In the
exec_mm() case these states are not touched.

As the futex exit code needs further protections against exit races, this
needs to be split into two functions.

Preparatory only, no functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191106224556.240518241@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-29 10:10:10 +01:00
Greg Kroah-Hartman
cb33d78781 Merge 5.4-rc1 into android-mainline
Linux 5.4-rc1

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I15eec52df70f829acf81ff614a1c2a5fb443a4e0
2019-10-02 19:10:07 +02:00
Linus Torvalds
9c5efe9ae7 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:

 - Apply a number of membarrier related fixes and cleanups, which fixes
   a use-after-free race in the membarrier code

 - Introduce proper RCU protection for tasks on the runqueue - to get
   rid of the subtle task_rcu_dereference() interface that was easy to
   get wrong

 - Misc fixes, but also an EAS speedup

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Avoid redundant EAS calculation
  sched/core: Remove double update_max_interval() call on CPU startup
  sched/core: Fix preempt_schedule() interrupt return comment
  sched/fair: Fix -Wunused-but-set-variable warnings
  sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr()
  sched/membarrier: Return -ENOMEM to userspace on memory allocation failure
  sched/membarrier: Skip IPIs when mm->mm_users == 1
  selftests, sched/membarrier: Add multi-threaded test
  sched/membarrier: Fix p->mm->membarrier_state racy load
  sched/membarrier: Call sync_core only before usermode for same mm
  sched/membarrier: Remove redundant check
  sched/membarrier: Fix private expedited registration check
  tasks, sched/core: RCUify the assignment of rq->curr
  tasks, sched/core: With a grace period after finish_task_switch(), remove unnecessary code
  tasks, sched/core: Ensure tasks are available for a grace period after leaving the runqueue
  tasks: Add a count of task RCU users
  sched/core: Convert vcpu_is_preempted() from macro to an inline function
  sched/fair: Remove unused cfs_rq_clock_task() function
2019-09-28 12:39:07 -07:00
Mathieu Desnoyers
227a4aadc7 sched/membarrier: Fix p->mm->membarrier_state racy load
The membarrier_state field is located within the mm_struct, which
is not guaranteed to exist when used from runqueue-lock-free iteration
on runqueues by the membarrier system call.

Copy the membarrier_state from the mm_struct into the scheduler runqueue
when the scheduler switches between mm.

When registering membarrier for mm, after setting the registration bit
in the mm membarrier state, issue a synchronize_rcu() to ensure the
scheduler observes the change. In order to take care of the case
where a runqueue keeps executing the target mm without swapping to
other mm, iterate over each runqueue and issue an IPI to copy the
membarrier_state from the mm_struct into each runqueue which have the
same mm which state has just been modified.

Move the mm membarrier_state field closer to pgd in mm_struct to use
a cache line already touched by the scheduler switch_mm.

The membarrier_execve() (now membarrier_exec_mmap) hook now needs to
clear the runqueue's membarrier state in addition to clear the mm
membarrier state, so move its implementation into the scheduler
membarrier code so it can access the runqueue structure.

Add memory barrier in membarrier_exec_mmap() prior to clearing
the membarrier state, ensuring memory accesses executed prior to exec
are not reordered with the stores clearing the membarrier state.

As suggested by Linus, move all membarrier.c RCU read-side locks outside
of the for each cpu loops.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190919173705.2181-5-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:30 +02:00
Mathieu Desnoyers
2840cf02fa sched/membarrier: Call sync_core only before usermode for same mm
When the prev and next task's mm change, switch_mm() provides the core
serializing guarantees before returning to usermode. The only case
where an explicit core serialization is needed is when the scheduler
keeps the same mm for prev and next.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190919173705.2181-4-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:30 +02:00
Eric W. Biederman
154abafc68 tasks, sched/core: With a grace period after finish_task_switch(), remove unnecessary code
Remove work arounds that were written before there was a grace period
after tasks left the runqueue in finish_task_switch().

In particular now that there tasks exiting the runqueue exprience
a RCU grace period none of the work performed by task_rcu_dereference()
excpet the rcu_dereference() is necessary so replace task_rcu_dereference()
with rcu_dereference().

Remove the code in rcuwait_wait_event() that checks to ensure the current
task has not exited.  It is no longer necessary as it is guaranteed
that any running task will experience a RCU grace period after it
leaves the run queueue.

Remove the comment in rcuwait_wake_up() as it is no longer relevant.

Ref: 8f95c90ceb ("sched/wait, RCU: Introduce rcuwait machinery")
Ref: 150593bf86 ("sched/api: Introduce task_rcu_dereference() and try_get_task_struct()")
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/87lfurdpk9.fsf_-_@x220.int.ebiederm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:29 +02:00
Eric W. Biederman
3fbd7ee285 tasks: Add a count of task RCU users
Add a count of the number of RCU users (currently 1) of the task
struct so that we can later add the scheduler case and get rid of the
very subtle task_rcu_dereference(), and just use rcu_dereference().

As suggested by Oleg have the count overlap rcu_head so that no
additional space in task_struct is required.

Inspired-by: Linus Torvalds <torvalds@linux-foundation.org>
Inspired-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/87woebdplt.fsf_-_@x220.int.ebiederm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:29 +02:00
Greg Kroah-Hartman
bfa0399bc8 Merge Linus's 5.4-rc1-prerelease branch into android-mainline
This merges Linus's tree as of commit b41dae061b ("Merge tag
'xfs-5.4-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux")
into android-mainline.

This "early" merge makes it easier to test and handle merge conflicts
instead of having to wait until the "end" of the merge window and handle
all 10000+ commits at once.

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I6bebf55e5e2353f814e3c87f5033607b1ae5d812
2019-09-20 16:07:54 -07:00
Linus Torvalds
7f2444d38f Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core timer updates from Thomas Gleixner:
 "Timers and timekeeping updates:

   - A large overhaul of the posix CPU timer code which is a preparation
     for moving the CPU timer expiry out into task work so it can be
     properly accounted on the task/process.

     An update to the bogus permission checks will come later during the
     merge window as feedback was not complete before heading of for
     travel.

   - Switch the timerqueue code to use cached rbtrees and get rid of the
     homebrewn caching of the leftmost node.

   - Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a
     single function

   - Implement the separation of hrtimers to be forced to expire in hard
     interrupt context even when PREEMPT_RT is enabled and mark the
     affected timers accordingly.

   - Implement a mechanism for hrtimers and the timer wheel to protect
     RT against priority inversion and live lock issues when a (hr)timer
     which should be canceled is currently executing the callback.
     Instead of infinitely spinning, the task which tries to cancel the
     timer blocks on a per cpu base expiry lock which is held and
     released by the (hr)timer expiry code.

   - Enable the Hyper-V TSC page based sched_clock for Hyper-V guests
     resulting in faster access to timekeeping functions.

   - Updates to various clocksource/clockevent drivers and their device
     tree bindings.

   - The usual small improvements all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
  posix-cpu-timers: Fix permission check regression
  posix-cpu-timers: Always clear head pointer on dequeue
  hrtimer: Add a missing bracket and hide `migration_base' on !SMP
  posix-cpu-timers: Make expiry_active check actually work correctly
  posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build
  tick: Mark sched_timer to expire in hard interrupt context
  hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD
  x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n
  posix-cpu-timers: Utilize timerqueue for storage
  posix-cpu-timers: Move state tracking to struct posix_cputimers
  posix-cpu-timers: Deduplicate rlimit handling
  posix-cpu-timers: Remove pointless comparisons
  posix-cpu-timers: Get rid of 64bit divisions
  posix-cpu-timers: Consolidate timer expiry further
  posix-cpu-timers: Get rid of zero checks
  rlimit: Rewrite non-sensical RLIMIT_CPU comment
  posix-cpu-timers: Respect INFINITY for hard RTTIME limit
  posix-cpu-timers: Switch thread group sampling to array
  posix-cpu-timers: Restructure expiry array
  posix-cpu-timers: Remove cputime_expires
  ...
2019-09-17 12:35:15 -07:00
Thomas Gleixner
244d49e306 posix-cpu-timers: Move state tracking to struct posix_cputimers
Put it where it belongs and clean up the ifdeffery in fork completely.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190821192922.743229404@linutronix.de
2019-08-28 11:50:42 +02:00
Thomas Gleixner
b7be4ef136 posix-cpu-timers: Switch thread group sampling to array
That allows more simplifications in various places.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192921.988426956@linutronix.de
2019-08-28 11:50:39 +02:00
Thomas Gleixner
11b8462f7e posix-cpu-timers: Provide array based access to expiry cache
Using struct task_cputime for the expiry cache is a pretty odd choice and
comes with magic defines to rename the fields for usage in the expiry
cache.

struct task_cputime is basically a u64 array with 3 members, but it has
distinct members.

The expiry cache content is different than the content of task_cputime
because

  expiry[PROF]  = task_cputime.stime + task_cputime.utime
  expiry[VIRT]  = task_cputime.utime
  expiry[SCHED] = task_cputime.sum_exec_runtime

So there is no direct mapping between task_cputime and the expiry cache and
the #define based remapping is just a horrible hack.

Having the expiry cache array based allows further simplification of the
expiry code.

To avoid an all in one cleanup which is hard to review add a temporary
anonymous union into struct task_cputime which allows array based access to
it. That requires to reorder the members. Add a build time sanity check to
validate that the members are at the same place.

The union and the build time checks will be removed after conversion.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192921.105793824@linutronix.de
2019-08-28 11:50:35 +02:00
Thomas Gleixner
3a245c0f11 posix-cpu-timers: Move expiry cache into struct posix_cputimers
The expiry cache belongs into the posix_cputimers container where the other
cpu timers information is.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192921.014444012@linutronix.de
2019-08-28 11:50:35 +02:00
Thomas Gleixner
9eacb5c7e6 sched: Move struct task_cputime to types.h
For upcoming posix-timer changes to avoid include recursion hell.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190821192920.909530418@linutronix.de
2019-08-28 11:50:34 +02:00
Thomas Gleixner
2b69942f90 posix-cpu-timers: Create a container struct
Per task/process data of posix CPU timers is all over the place which
makes the code hard to follow and requires ifdeffery.

Create a container to hold all this information in one place, so data is
consolidated and the ifdeffery can be confined to the posix timer header
file and removed from places like fork.

As a first step, move the cpu_timers list head array into the new struct
and clean up the initializers and simplify fork. The remaining #ifdef in
fork will be removed later.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192920.819418976@linutronix.de
2019-08-28 11:50:33 +02:00
Thomas Gleixner
c506bef424 posix-cpu-timers: Rename thread_group_cputimer() and make it static
thread_group_cputimer() is a complete misnomer. The function does two things:

 - For arming process wide timers it makes sure that the atomic time
   storage is up to date. If no cpu timer is armed yet, then the atomic
   time storage is not updated by the scheduler for performance reasons.

   In that case a full summing up of all threads needs to be done and the
   update needs to be enabled.

- Samples the current time into the caller supplied storage.

Rename it to thread_group_start_cputime(), make it static and fixup the
callsite.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192919.869350319@linutronix.de
2019-08-28 11:50:27 +02:00
Thomas Gleixner
19298fbf45 posix-cpu-timers: Provide quick sample function for itimer
get_itimer() needs a sample of the current thread group cputime. It invokes
thread_group_cputimer() - which is a misnomer. That function also starts
eventually the group cputime accouting which is bogus because the
accounting is already active when a timer is armed.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192919.599658199@linutronix.de
2019-08-28 11:50:26 +02:00
Greg Kroah-Hartman
bea0791583 Merge 5.3-rc2 into android-mainline
Linux 5.3-rc2

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I4d36fd27ccc8cd773ba1b97dc3bd382e99a4dd7a
2019-07-29 08:40:17 +02:00
Mathieu Poirier
f9a25f776d cpusets: Rebuild root domain deadline accounting information
When the topology of root domains is modified by CPUset or CPUhotplug
operations information about the current deadline bandwidth held in the
root domain is lost.

This patch addresses the issue by recalculating the lost deadline
bandwidth information by circling through the deadline tasks held in
CPUsets and adding their current load to the root domain they are
associated with.

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
[ Various additional modifications. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: rostedt@goodmis.org
Cc: tj@kernel.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-4-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:55:01 +02:00
Mathieu Poirier
c22645f4c8 sched/topology: Add partition_sched_domains_locked()
Introduce the partition_sched_domains_locked() function by taking
the mutex locking code out of the original function.  That way
the work done by partition_sched_domains_locked() can be reused
without dropping the mutex lock.

No change of functionality is introduced by this patch.

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: rostedt@goodmis.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-2-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:51:57 +02:00
Matthew Wilcox (Oracle)
7b3c92b85a sched/core: Convert get_task_struct() to return the task
Returning the pointer that was passed in allows us to write
slightly more idiomatic code.  Convert a few users.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190704221323.24290-1-willy@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:51:54 +02:00
Jann Horn
16d51a590a sched/fair: Don't free p->numa_faults with concurrent readers
When going through execve(), zero out the NUMA fault statistics instead of
freeing them.

During execve, the task is reachable through procfs and the scheduler. A
concurrent /proc/*/sched reader can read data from a freed ->numa_faults
allocation (confirmed by KASAN) and write it back to userspace.
I believe that it would also be possible for a use-after-free read to occur
through a race between a NUMA fault and execve(): task_numa_fault() can
lead to task_numa_compare(), which invokes task_weight() on the currently
running task of a different CPU.

Another way to fix this would be to make ->numa_faults RCU-managed or add
extra locking, but it seems easier to wipe the NUMA fault statistics on
execve.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Fixes: 82727018b0 ("sched/numa: Call task_numa_free() from do_execve()")
Link: https://lkml.kernel.org/r/20190716152047.14424-1-jannh@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:37:04 +02:00
Greg Kroah-Hartman
37766c2946 Merge 5.3.0-rc1 into android-mainline
Linus 5.3-rc1 release

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ic171e37d4c21ffa495240c5538852bbb5a9dcce8
2019-07-23 16:21:59 -07:00
Linus Torvalds
07ab9d5bc5 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull more KVM updates from Paolo Bonzini:
 "Mostly bugfixes, but also:

   - s390 support for KVM selftests

   - LAPIC timer offloading to housekeeping CPUs

   - Extend an s390 optimization for overcommitted hosts to all
     architectures

   - Debugging cleanups and improvements"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits)
  KVM: x86: Add fixed counters to PMU filter
  KVM: nVMX: do not use dangling shadow VMCS after guest reset
  KVM: VMX: dump VMCS on failed entry
  KVM: x86/vPMU: refine kvm_pmu err msg when event creation failed
  KVM: s390: Use kvm_vcpu_wake_up in kvm_s390_vcpu_wakeup
  KVM: Boost vCPUs that are delivering interrupts
  KVM: selftests: Remove superfluous define from vmx.c
  KVM: SVM: Fix detection of AMD Errata 1096
  KVM: LAPIC: Inject timer interrupt via posted interrupt
  KVM: LAPIC: Make lapic timer unpinned
  KVM: x86/vPMU: reset pmc->counter to 0 for pmu fixed_counters
  KVM: nVMX: Ignore segment base for VMX memory operand when segment not FS or GS
  kvm: x86: ioapic and apic debug macros cleanup
  kvm: x86: some tsc debug cleanup
  kvm: vmx: fix coccinelle warnings
  x86: kvm: avoid constant-conversion warning
  x86: kvm: avoid -Wsometimes-uninitized warning
  KVM: x86: expose AVX512_BF16 feature to guest
  KVM: selftests: enable pgste option for the linker on s390
  KVM: selftests: Move kvm_create_max_vcpus test to generic code
  ...
2019-07-20 10:20:27 -07:00
Wanpeng Li
0c5f81dad4 KVM: LAPIC: Inject timer interrupt via posted interrupt
Dedicated instances are currently disturbed by unnecessary jitter due
to the emulated lapic timers firing on the same pCPUs where the
vCPUs reside.  There is no hardware virtual timer on Intel for guest
like ARM, so both programming timer in guest and the emulated timer fires
incur vmexits.  This patch tries to avoid vmexit when the emulated timer
fires, at least in dedicated instance scenario when nohz_full is enabled.

In that case, the emulated timers can be offload to the nearest busy
housekeeping cpus since APICv has been found for several years in server
processors. The guest timer interrupt can then be injected via posted interrupts,
which are delivered by the housekeeping cpu once the emulated timer fires.

The host should tuned so that vCPUs are placed on isolated physical
processors, and with several pCPUs surplus for busy housekeeping.
If disabled mwait/hlt/pause vmexits keep the vCPUs in non-root mode,
~3% redis performance benefit can be observed on Skylake server, and the
number of external interrupt vmexits drops substantially.  Without patch

            VM-EXIT  Samples  Samples%  Time%   Min Time  Max Time   Avg time
EXTERNAL_INTERRUPT    42916    49.43%   39.30%   0.47us   106.09us   0.71us ( +-   1.09% )

While with patch:

            VM-EXIT  Samples  Samples%  Time%   Min Time  Max Time         Avg time
EXTERNAL_INTERRUPT    6871     9.29%     2.96%   0.44us    57.88us   0.72us ( +-   4.02% )

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-20 09:00:40 +02:00
Linus Torvalds
57a8ec387e Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
 "VM:
   - z3fold fixes and enhancements by Henry Burns and Vitaly Wool

   - more accurate reclaimed slab caches calculations by Yafang Shao

   - fix MAP_UNINITIALIZED UAPI symbol to not depend on config, by
     Christoph Hellwig

   - !CONFIG_MMU fixes by Christoph Hellwig

   - new novmcoredd parameter to omit device dumps from vmcore, by
     Kairui Song

   - new test_meminit module for testing heap and pagealloc
     initialization, by Alexander Potapenko

   - ioremap improvements for huge mappings, by Anshuman Khandual

   - generalize kprobe page fault handling, by Anshuman Khandual

   - device-dax hotplug fixes and improvements, by Pavel Tatashin

   - enable synchronous DAX fault on powerpc, by Aneesh Kumar K.V

   - add pte_devmap() support for arm64, by Robin Murphy

   - unify locked_vm accounting with a helper, by Daniel Jordan

   - several misc fixes

  core/lib:
   - new typeof_member() macro including some users, by Alexey Dobriyan

   - make BIT() and GENMASK() available in asm, by Masahiro Yamada

   - changed LIST_POISON2 on x86_64 to 0xdead000000000122 for better
     code generation, by Alexey Dobriyan

   - rbtree code size optimizations, by Michel Lespinasse

   - convert struct pid count to refcount_t, by Joel Fernandes

  get_maintainer.pl:
   - add --no-moderated switch to skip moderated ML's, by Joe Perches

  misc:
   - ptrace PTRACE_GET_SYSCALL_INFO interface

   - coda updates

   - gdb scripts, various"

[ Using merge message suggestion from Vlastimil Babka, with some editing - Linus ]

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (100 commits)
  fs/select.c: use struct_size() in kmalloc()
  mm: add account_locked_vm utility function
  arm64: mm: implement pte_devmap support
  mm: introduce ARCH_HAS_PTE_DEVMAP
  mm: clean up is_device_*_page() definitions
  mm/mmap: move common defines to mman-common.h
  mm: move MAP_SYNC to asm-generic/mman-common.h
  device-dax: "Hotremove" persistent memory that is used like normal RAM
  mm/hotplug: make remove_memory() interface usable
  device-dax: fix memory and resource leak if hotplug fails
  include/linux/lz4.h: fix spelling and copy-paste errors in documentation
  ipc/mqueue.c: only perform resource calculation if user valid
  include/asm-generic/bug.h: fix "cut here" for WARN_ON for __WARN_TAINT architectures
  scripts/gdb: add helpers to find and list devices
  scripts/gdb: add lx-genpd-summary command
  drivers/pps/pps.c: clear offset flags in PPS_SETPARAMS ioctl
  kernel/pid.c: convert struct pid count to refcount_t
  drivers/rapidio/devices/rio_mport_cdev.c: NUL terminate some strings
  select: shift restore_saved_sigmask_unless() into poll_select_copy_remaining()
  select: change do_poll() to return -ERESTARTNOHAND rather than -EINTR
  ...
2019-07-17 08:58:04 -07:00
Oleg Nesterov
b772434be0 signal: simplify set_user_sigmask/restore_user_sigmask
task->saved_sigmask and ->restore_sigmask are only used in the ret-from-
syscall paths.  This means that set_user_sigmask() can save ->blocked in
->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked
was modified.

This way the callers do not need 2 sigset_t's passed to set/restore and
restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns
into the trivial helper which just calls restore_saved_sigmask().

Link: http://lkml.kernel.org/r/20190606113206.GA9464@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Eric Wong <e@80x24.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David Laight <David.Laight@aculab.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-16 19:23:24 -07:00
Alexey Dobriyan
e2d9018e81 signal: reorder struct sighand_struct
struct sighand_struct::siglock field is the most used field by far, put
it first so that is can be accessed without IMM8 or IMM32 encoding on
x86_64.

Space savings (on trimmed down VM test config):

add/remove: 0/0 grow/shrink: 8/68 up/down: 49/-1147 (-1098)
Function                                     old     new   delta
complete_signal                              512     533     +21
do_signalfd4                                 335     346     +11
__cleanup_sighand                             39      43      +4
unhandled_signal                              49      52      +3
prepare_signal                               692     695      +3
ignore_signals                                37      40      +3
__tty_check_change.part                      248     251      +3
ksys_unshare                                 780     781      +1
sighand_ctor                                  33      29      -4
ptrace_trap_notify                            60      56      -4
sigqueue_free                                 98      91      -7
run_posix_cpu_timers                        1389    1382      -7
proc_pid_status                             2448    2441      -7
proc_pid_limits                              344     337      -7
posix_cpu_timer_rearm                        222     215      -7
posix_cpu_timer_get                          249     242      -7
kill_pid_info_as_cred                        243     236      -7
freeze_task                                  197     190      -7
flush_old_exec                              1873    1866      -7
do_task_stat                                3363    3356      -7
do_send_sig_info                              98      91      -7
do_group_exit                                147     140      -7
init_sighand                                2088    2080      -8
do_notify_parent_cldstop                     399     391      -8
signalfd_cleanup                              50      41      -9
do_notify_parent                             557     545     -12
__send_signal                               1029    1017     -12
ptrace_stop                                  590     577     -13
get_signal                                  1576    1563     -13
__lock_task_sighand                          112      99     -13
zap_pid_ns_processes                         391     377     -14
update_rlimit_cpu                             78      64     -14
tty_signal_session_leader                    413     399     -14
tty_open_proc_set_tty                        149     135     -14
tty_jobctrl_ioctl                            936     922     -14
set_cpu_itimer                               339     325     -14
ptrace_resume                                226     212     -14
ptrace_notify                                110      96     -14
proc_clear_tty                                81      67     -14
posix_cpu_timer_del                          229     215     -14
kernel_sigaction                             156     142     -14
getrusage                                    977     963     -14
get_current_tty                               98      84     -14
force_sigsegv                                 89      75     -14
force_sig_info                               205     191     -14
flush_signals                                 83      69     -14
flush_itimer_signals                          85      71     -14
do_timer_create                             1120    1106     -14
do_sigpending                                 88      74     -14
do_signal_stop                               537     523     -14
cgroup_init_fs_context                       644     630     -14
call_usermodehelper_exec_async               402     388     -14
calculate_sigpending                          58      44     -14
__x64_sys_timer_delete                       248     234     -14
__set_current_blocked                         80      66     -14
__ptrace_unlink                              310     296     -14
__ptrace_detach.part                         187     173     -14
send_sigqueue                                362     347     -15
get_cpu_itimer                               214     199     -15
signalfd_poll                                175     159     -16
dequeue_signal                               340     323     -17
do_getitimer                                 192     174     -18
release_task.part                           1060    1040     -20
ptrace_peek_siginfo                          408     387     -21
posix_cpu_timer_set                          827     806     -21
exit_signals                                 437     416     -21
do_sigaction                                 541     520     -21
do_setitimer                                 485     464     -21
disassociate_ctty.part                       545     517     -28
__x64_sys_rt_sigtimedwait                    721     679     -42
__x64_sys_ptrace                            1319    1277     -42
ptrace_request                              1828    1782     -46
signalfd_read                                507     459     -48
wait_consider_task                          2027    1971     -56
do_coredump                                 3672    3616     -56
copy_process.part                           6936    6871     -65

Link: http://lkml.kernel.org/r/20190503192800.GA18004@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-16 19:23:24 -07:00
Dmitry V. Levin
028b6e8a89 clone: fix CLONE_PIDFD support
The introduction of clone3 syscall accidentally broke CLONE_PIDFD
support in traditional clone syscall on compat x86 and those
architectures that use do_fork to implement clone syscall.

This bug was found by strace test suite.

Link: https://strace.io/logs/strace/2019-07-12
Fixes: 7f192e3cd3 ("fork: add clone3")
Bisected-and-tested-by: Anatoly Pugachev <matorola@gmail.com>
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Link: https://lore.kernel.org/r/20190714162047.GB10389@altlinux.org
Signed-off-by: Christian Brauner <christian@brauner.io>
2019-07-14 20:36:12 +02:00
Linus Torvalds
8f6ccf6159 Merge tag 'clone3-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull clone3 system call from Christian Brauner:
 "This adds the clone3 syscall which is an extensible successor to clone
  after we snagged the last flag with CLONE_PIDFD during the 5.2 merge
  window for clone(). It cleanly supports all of the flags from clone()
  and thus all legacy workloads.

  There are few user visible differences between clone3 and clone.
  First, CLONE_DETACHED will cause EINVAL with clone3 so we can reuse
  this flag. Second, the CSIGNAL flag is deprecated and will cause
  EINVAL to be reported. It is superseeded by a dedicated "exit_signal"
  argument in struct clone_args thus freeing up even more flags. And
  third, clone3 gives CLONE_PIDFD a dedicated return argument in struct
  clone_args instead of abusing CLONE_PARENT_SETTID's parent_tidptr
  argument.

  The clone3 uapi is designed to be easy to handle on 32- and 64 bit:

    /* uapi */
    struct clone_args {
            __aligned_u64 flags;
            __aligned_u64 pidfd;
            __aligned_u64 child_tid;
            __aligned_u64 parent_tid;
            __aligned_u64 exit_signal;
            __aligned_u64 stack;
            __aligned_u64 stack_size;
            __aligned_u64 tls;
    };

  and a separate kernel struct is used that uses proper kernel typing:

    /* kernel internal */
    struct kernel_clone_args {
            u64 flags;
            int __user *pidfd;
            int __user *child_tid;
            int __user *parent_tid;
            int exit_signal;
            unsigned long stack;
            unsigned long stack_size;
            unsigned long tls;
    };

  The system call comes with a size argument which enables the kernel to
  detect what version of clone_args userspace is passing in. clone3
  validates that any additional bytes a given kernel does not know about
  are set to zero and that the size never exceeds a page.

  A nice feature is that this patchset allowed us to cleanup and
  simplify various core kernel codepaths in kernel/fork.c by making the
  internal _do_fork() function take struct kernel_clone_args even for
  legacy clone().

  This patch also unblocks the time namespace patchset which wants to
  introduce a new CLONE_TIMENS flag.

  Note, that clone3 has only been wired up for x86{_32,64}, arm{64}, and
  xtensa. These were the architectures that did not require special
  massaging.

  Other architectures treat fork-like system calls individually and
  after some back and forth neither Arnd nor I felt confident that we
  dared to add clone3 unconditionally to all architectures. We agreed to
  leave this up to individual architecture maintainers. This is why
  there's an additional patch that introduces __ARCH_WANT_SYS_CLONE3
  which any architecture can set once it has implemented support for
  clone3. The patch also adds a cond_syscall(clone3) for architectures
  such as nios2 or h8300 that generate their syscall table by simply
  including asm-generic/unistd.h. The hope is to get rid of
  __ARCH_WANT_SYS_CLONE3 and cond_syscall() rather soon"

* tag 'clone3-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  arch: handle arches who do not yet define clone3
  arch: wire-up clone3() syscall
  fork: add clone3
2019-07-11 10:09:44 -07:00
Linus Torvalds
5ad18b2e60 Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull force_sig() argument change from Eric Biederman:
 "A source of error over the years has been that force_sig has taken a
  task parameter when it is only safe to use force_sig with the current
  task.

  The force_sig function is built for delivering synchronous signals
  such as SIGSEGV where the userspace application caused a synchronous
  fault (such as a page fault) and the kernel responded with a signal.

  Because the name force_sig does not make this clear, and because the
  force_sig takes a task parameter the function force_sig has been
  abused for sending other kinds of signals over the years. Slowly those
  have been fixed when the oopses have been tracked down.

  This set of changes fixes the remaining abusers of force_sig and
  carefully rips out the task parameter from force_sig and friends
  making this kind of error almost impossible in the future"

* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
  signal/x86: Move tsk inside of CONFIG_MEMORY_FAILURE in do_sigbus
  signal: Remove the signal number and task parameters from force_sig_info
  signal: Factor force_sig_info_to_task out of force_sig_info
  signal: Generate the siginfo in force_sig
  signal: Move the computation of force into send_signal and correct it.
  signal: Properly set TRACE_SIGNAL_LOSE_INFO in __send_signal
  signal: Remove the task parameter from force_sig_fault
  signal: Use force_sig_fault_to_task for the two calls that don't deliver to current
  signal: Explicitly call force_sig_fault on current
  signal/unicore32: Remove tsk parameter from __do_user_fault
  signal/arm: Remove tsk parameter from __do_user_fault
  signal/arm: Remove tsk parameter from ptrace_break
  signal/nds32: Remove tsk parameter from send_sigtrap
  signal/riscv: Remove tsk parameter from do_trap
  signal/sh: Remove tsk parameter from force_sig_info_fault
  signal/um: Remove task parameter from send_sigtrap
  signal/x86: Remove task parameter from send_sigtrap
  signal: Remove task parameter from force_sig_mceerr
  signal: Remove task parameter from force_sig
  signal: Remove task parameter from force_sigsegv
  ...
2019-07-08 21:48:15 -07:00
Linus Torvalds
c84ca912b0 Merge tag 'keys-namespace-20190627' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull keyring namespacing from David Howells:
 "These patches help make keys and keyrings more namespace aware.

  Firstly some miscellaneous patches to make the process easier:

   - Simplify key index_key handling so that the word-sized chunks
     assoc_array requires don't have to be shifted about, making it
     easier to add more bits into the key.

   - Cache the hash value in the key so that we don't have to calculate
     on every key we examine during a search (it involves a bunch of
     multiplications).

   - Allow keying_search() to search non-recursively.

  Then the main patches:

   - Make it so that keyring names are per-user_namespace from the point
     of view of KEYCTL_JOIN_SESSION_KEYRING so that they're not
     accessible cross-user_namespace.

     keyctl_capabilities() shows KEYCTL_CAPS1_NS_KEYRING_NAME for this.

   - Move the user and user-session keyrings to the user_namespace
     rather than the user_struct. This prevents them propagating
     directly across user_namespaces boundaries (ie. the KEY_SPEC_*
     flags will only pick from the current user_namespace).

   - Make it possible to include the target namespace in which the key
     shall operate in the index_key. This will allow the possibility of
     multiple keys with the same description, but different target
     domains to be held in the same keyring.

     keyctl_capabilities() shows KEYCTL_CAPS1_NS_KEY_TAG for this.

   - Make it so that keys are implicitly invalidated by removal of a
     domain tag, causing them to be garbage collected.

   - Institute a network namespace domain tag that allows keys to be
     differentiated by the network namespace in which they operate. New
     keys that are of a type marked 'KEY_TYPE_NET_DOMAIN' are assigned
     the network domain in force when they are created.

   - Make it so that the desired network namespace can be handed down
     into the request_key() mechanism. This allows AFS, NFS, etc. to
     request keys specific to the network namespace of the superblock.

     This also means that the keys in the DNS record cache are
     thenceforth namespaced, provided network filesystems pass the
     appropriate network namespace down into dns_query().

     For DNS, AFS and NFS are good, whilst CIFS and Ceph are not. Other
     cache keyrings, such as idmapper keyrings, also need to set the
     domain tag - for which they need access to the network namespace of
     the superblock"

* tag 'keys-namespace-20190627' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  keys: Pass the network namespace into request_key mechanism
  keys: Network namespace domain tag
  keys: Garbage collect keys for which the domain has been removed
  keys: Include target namespace in match criteria
  keys: Move the user and user-session keyrings to the user_namespace
  keys: Namespace keyring names
  keys: Add a 'recurse' flag for keyring searches
  keys: Cache the hash value to avoid lots of recalculation
  keys: Simplify key description management
2019-07-08 19:36:47 -07:00
Linus Torvalds
dad1c12ed8 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:

 - Remove the unused per rq load array and all its infrastructure, by
   Dietmar Eggemann.

 - Add utilization clamping support by Patrick Bellasi. This is a
   refinement of the energy aware scheduling framework with support for
   boosting of interactive and capping of background workloads: to make
   sure critical GUI threads get maximum frequency ASAP, and to make
   sure background processing doesn't unnecessarily move to cpufreq
   governor to higher frequencies and less energy efficient CPU modes.

 - Add the bare minimum of tracepoints required for LISA EAS regression
   testing, by Qais Yousef - which allows automated testing of various
   power management features, including energy aware scheduling.

 - Restructure the former tsk_nr_cpus_allowed() facility that the -rt
   kernel used to modify the scheduler's CPU affinity logic such as
   migrate_disable() - introduce the task->cpus_ptr value instead of
   taking the address of &task->cpus_allowed directly - by Sebastian
   Andrzej Siewior.

 - Misc optimizations, fixes, cleanups and small enhancements - see the
   Git log for details.

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  sched/uclamp: Add uclamp support to energy_compute()
  sched/uclamp: Add uclamp_util_with()
  sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks
  sched/uclamp: Set default clamps for RT tasks
  sched/uclamp: Reset uclamp values on RESET_ON_FORK
  sched/uclamp: Extend sched_setattr() to support utilization clamping
  sched/core: Allow sched_setattr() to use the current policy
  sched/uclamp: Add system default clamps
  sched/uclamp: Enforce last task's UCLAMP_MAX
  sched/uclamp: Add bucket local max tracking
  sched/uclamp: Add CPU's clamp buckets refcounting
  sched/fair: Rename weighted_cpuload() to cpu_runnable_load()
  sched/debug: Export the newly added tracepoints
  sched/debug: Add sched_overutilized tracepoint
  sched/debug: Add new tracepoint to track PELT at se level
  sched/debug: Add new tracepoints to track PELT at rq level
  sched/debug: Add a new sched_trace_*() helper functions
  sched/autogroup: Make autogroup_path() always available
  sched/wait: Deduplicate code with do-while
  sched/topology: Remove unused 'sd' parameter from arch_scale_cpu_capacity()
  ...
2019-07-08 16:39:53 -07:00
David Howells
0f44e4d976 keys: Move the user and user-session keyrings to the user_namespace
Move the user and user-session keyrings to the user_namespace struct rather
than pinning them from the user_struct struct.  This prevents these
keyrings from propagating across user-namespaces boundaries with regard to
the KEY_SPEC_* flags, thereby making them more useful in a containerised
environment.

The issue is that a single user_struct may be represent UIDs in several
different namespaces.

The way the patch does this is by attaching a 'register keyring' in each
user_namespace and then sticking the user and user-session keyrings into
that.  It can then be searched to retrieve them.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jann Horn <jannh@google.com>
2019-06-26 21:02:32 +01:00
Patrick Bellasi
e8f14172c6 sched/uclamp: Add system default clamps
Tasks without a user-defined clamp value are considered not clamped
and by default their utilization can have any value in the
[0..SCHED_CAPACITY_SCALE] range.

Tasks with a user-defined clamp value are allowed to request any value
in that range, and the required clamp is unconditionally enforced.
However, a "System Management Software" could be interested in limiting
the range of clamp values allowed for all tasks.

Add a privileged interface to define a system default configuration via:

  /proc/sys/kernel/sched_uclamp_util_{min,max}

which works as an unconditional clamp range restriction for all tasks.

With the default configuration, the full SCHED_CAPACITY_SCALE range of
values is allowed for each clamp index. Otherwise, the task-specific
clamp is capped by the corresponding system default value.

Do that by tracking, for each task, the "effective" clamp value and
bucket the task has been refcounted in at enqueue time. This
allows to lazy aggregate "requested" and "system default" values at
enqueue time and simplifies refcounting updates at dequeue time.

The cached bucket ids are used to avoid (relatively) more expensive
integer divisions every time a task is enqueued.

An active flag is used to report when the "effective" value is valid and
thus the task is actually refcounted in the corresponding rq's bucket.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-5-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:45 +02:00
Patrick Bellasi
69842cba9a sched/uclamp: Add CPU's clamp buckets refcounting
Utilization clamping allows to clamp the CPU's utilization within a
[util_min, util_max] range, depending on the set of RUNNABLE tasks on
that CPU. Each task references two "clamp buckets" defining its minimum
and maximum (util_{min,max}) utilization "clamp values". A CPU's clamp
bucket is active if there is at least one RUNNABLE tasks enqueued on
that CPU and refcounting that bucket.

When a task is {en,de}queued {on,from} a rq, the set of active clamp
buckets on that CPU can change. If the set of active clamp buckets
changes for a CPU a new "aggregated" clamp value is computed for that
CPU. This is because each clamp bucket enforces a different utilization
clamp value.

Clamp values are always MAX aggregated for both util_min and util_max.
This ensures that no task can affect the performance of other
co-scheduled tasks which are more boosted (i.e. with higher util_min
clamp) or less capped (i.e. with higher util_max clamp).

A task has:
   task_struct::uclamp[clamp_id]::bucket_id
to track the "bucket index" of the CPU's clamp bucket it refcounts while
enqueued, for each clamp index (clamp_id).

A runqueue has:
   rq::uclamp[clamp_id]::bucket[bucket_id].tasks
to track how many RUNNABLE tasks on that CPU refcount each
clamp bucket (bucket_id) of a clamp index (clamp_id).
It also has a:
   rq::uclamp[clamp_id]::bucket[bucket_id].value
to track the clamp value of each clamp bucket (bucket_id) of a clamp
index (clamp_id).

The rq::uclamp::bucket[clamp_id][] array is scanned every time it's
needed to find a new MAX aggregated clamp value for a clamp_id. This
operation is required only when it's dequeued the last task of a clamp
bucket tracking the current MAX aggregated clamp value. In this case,
the CPU is either entering IDLE or going to schedule a less boosted or
more clamped task.
The expected number of different clamp values configured at build time
is small enough to fit the full unordered array into a single cache
line, for configurations of up to 7 buckets.

Add to struct rq the basic data structures required to refcount the
number of RUNNABLE tasks for each clamp bucket. Add also the max
aggregation required to update the rq's clamp value at each
enqueue/dequeue event.

Use a simple linear mapping of clamp values into clamp buckets.
Pre-compute and cache bucket_id to avoid integer divisions at
enqueue/dequeue time.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-2-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:44 +02:00
Vincent Guittot
8ec59c0f5f sched/topology: Remove unused 'sd' parameter from arch_scale_cpu_capacity()
The 'struct sched_domain *sd' parameter to arch_scale_cpu_capacity() is
unused since commit:

  765d0af19f ("sched/topology: Remove the ::smt_gain field from 'struct sched_domain'")

Remove it.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: gregkh@linuxfoundation.org
Cc: linux@armlinux.org.uk
Cc: quentin.perret@arm.com
Cc: rafael@kernel.org
Link: https://lkml.kernel.org/r/1560783617-5827-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:39 +02:00
Waiman Long
00f3c5a3df locking/rwsem: Always release wait_lock before waking up tasks
With the use of wake_q, we can do task wakeups without holding the
wait_lock. There is one exception in the rwsem code, though. It is
when the writer in the slowpath detects that there are waiters ahead
but the rwsem is not held by a writer. This can lead to a long wait_lock
hold time especially when a large number of readers are to be woken up.

Remediate this situation by releasing the wait_lock before waking
up tasks and re-acquiring it afterward. The rwsem_try_write_lock()
function is also modified to read the rwsem count directly to avoid
stale count value.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-9-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:28:00 +02:00
Ingo Molnar
23da766ab1 Merge tag 'v5.2-rc5' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:12:27 +02:00
Greg Kroah-Hartman
879ebb9016 Merge 5.2-rc5 into android-mainline
Linux 5.2-rc5

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-06-17 08:38:06 +02:00
Andrea Arcangeli
59ea6d06cf coredump: fix race condition between collapse_huge_page() and core dumping
When fixing the race conditions between the coredump and the mmap_sem
holders outside the context of the process, we focused on
mmget_not_zero()/get_task_mm() callers in 04f5866e41 ("coredump: fix
race condition between mmget_not_zero()/get_task_mm() and core
dumping"), but those aren't the only cases where the mmap_sem can be
taken outside of the context of the process as Michal Hocko noticed
while backporting that commit to older -stable kernels.

If mmgrab() is called in the context of the process, but then the
mm_count reference is transferred outside the context of the process,
that can also be a problem if the mmap_sem has to be taken for writing
through that mm_count reference.

khugepaged registration calls mmgrab() in the context of the process,
but the mmap_sem for writing is taken later in the context of the
khugepaged kernel thread.

collapse_huge_page() after taking the mmap_sem for writing doesn't
modify any vma, so it's not obvious that it could cause a problem to the
coredump, but it happens to modify the pmd in a way that breaks an
invariant that pmd_trans_huge_lock() relies upon.  collapse_huge_page()
needs the mmap_sem for writing just to block concurrent page faults that
call pmd_trans_huge_lock().

Specifically the invariant that "!pmd_trans_huge()" cannot become a
"pmd_trans_huge()" doesn't hold while collapse_huge_page() runs.

The coredump will call __get_user_pages() without mmap_sem for reading,
which eventually can invoke a lockless page fault which will need a
functional pmd_trans_huge_lock().

So collapse_huge_page() needs to use mmget_still_valid() to check it's
not running concurrently with the coredump...  as long as the coredump
can invoke page faults without holding the mmap_sem for reading.

This has "Fixes: khugepaged" to facilitate backporting, but in my view
it's more a bug in the coredump code that will eventually have to be
rewritten to stop invoking page faults without the mmap_sem for reading.
So the long term plan is still to drop all mmget_still_valid().

Link: http://lkml.kernel.org/r/20190607161558.32104-1-aarcange@redhat.com
Fixes: ba76149f47 ("thp: khugepaged")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13 17:34:56 -10:00
Christian Brauner
7f192e3cd3 fork: add clone3
This adds the clone3 system call.

As mentioned several times already (cf. [7], [8]) here's the promised
patchset for clone3().

We recently merged the CLONE_PIDFD patchset (cf. [1]). It took the last
free flag from clone().

Independent of the CLONE_PIDFD patchset a time namespace has been discussed
at Linux Plumber Conference last year and has been sent out and reviewed
(cf. [5]). It is expected that it will go upstream in the not too distant
future. However, it relies on the addition of the CLONE_NEWTIME flag to
clone(). The only other good candidate - CLONE_DETACHED - is currently not
recyclable as we have identified at least two large or widely used
codebases that currently pass this flag (cf. [2], [3], and [4]). Given that
CLONE_PIDFD grabbed the last clone() flag the time namespace is effectively
blocked. clone3() has the advantage that it will unblock this patchset
again. In general, clone3() is extensible and allows for the implementation
of new features.

The idea is to keep clone3() very simple and close to the original clone(),
specifically, to keep on supporting old clone()-based workloads.
We know there have been various creative proposals how a new process
creation syscall or even api is supposed to look like. Some people even
going so far as to argue that the traditional fork()+exec() split should be
abandoned in favor of an in-kernel version of spawn(). Independent of
whether or not we personally think spawn() is a good idea this patchset has
and does not want to have anything to do with this.
One stance we take is that there's no real good alternative to
clone()+exec() and we need and want to support this model going forward;
independent of spawn().
The following requirements guided clone3():
- bump the number of available flags
- move arguments that are currently passed as separate arguments
  in clone() into a dedicated struct clone_args
  - choose a struct layout that is easy to handle on 32 and on 64 bit
  - choose a struct layout that is extensible
  - give new flags that currently need to abuse another flag's dedicated
    return argument in clone() their own dedicated return argument
    (e.g. CLONE_PIDFD)
  - use a separate kernel internal struct kernel_clone_args that is
    properly typed according to current kernel conventions in fork.c and is
    different from  the uapi struct clone_args
- port _do_fork() to use kernel_clone_args so that all process creation
  syscalls such as fork(), vfork(), clone(), and clone3() behave identical
  (Arnd suggested, that we can probably also port do_fork() itself in a
   separate patchset.)
- ease of transition for userspace from clone() to clone3()
  This very much means that we do *not* remove functionality that userspace
  currently relies on as the latter is a good way of creating a syscall
  that won't be adopted.
- do not try to be clever or complex: keep clone3() as dumb as possible

In accordance with Linus suggestions (cf. [11]), clone3() has the following
signature:

/* uapi */
struct clone_args {
        __aligned_u64 flags;
        __aligned_u64 pidfd;
        __aligned_u64 child_tid;
        __aligned_u64 parent_tid;
        __aligned_u64 exit_signal;
        __aligned_u64 stack;
        __aligned_u64 stack_size;
        __aligned_u64 tls;
};

/* kernel internal */
struct kernel_clone_args {
        u64 flags;
        int __user *pidfd;
        int __user *child_tid;
        int __user *parent_tid;
        int exit_signal;
        unsigned long stack;
        unsigned long stack_size;
        unsigned long tls;
};

long sys_clone3(struct clone_args __user *uargs, size_t size)

clone3() cleanly supports all of the supported flags from clone() and thus
all legacy workloads.
The advantage of sticking close to the old clone() is the low cost for
userspace to switch to this new api. Quite a lot of userspace apis (e.g.
pthreads) are based on the clone() syscall. With the new clone3() syscall
supporting all of the old workloads and opening up the ability to add new
features should make switching to it for userspace more appealing. In
essence, glibc can just write a simple wrapper to switch from clone() to
clone3().

There has been some interest in this patchset already. We have received a
patch from the CRIU corner for clone3() that would set the PID/TID of a
restored process without /proc/sys/kernel/ns_last_pid to eliminate a race.

/* User visible differences to legacy clone() */
- CLONE_DETACHED will cause EINVAL with clone3()
- CSIGNAL is deprecated
  It is superseeded by a dedicated "exit_signal" argument in struct
  clone_args freeing up space for additional flags.
  This is based on a suggestion from Andrei and Linus (cf. [9] and [10])

/* References */
[1]: b3e5838252
[2]: https://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/SandboxFilter.cpp#343
[3]: https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_create.c#n233
[4]: https://sources.debian.org/src/blcr/0.8.5-2.3/cr_module/cr_dump_self.c/?hl=740#L740
[5]: https://lore.kernel.org/lkml/20190425161416.26600-1-dima@arista.com/
[6]: https://lore.kernel.org/lkml/20190425161416.26600-2-dima@arista.com/
[7]: https://lore.kernel.org/lkml/CAHrFyr5HxpGXA2YrKza-oB-GGwJCqwPfyhD-Y5wbktWZdt0sGQ@mail.gmail.com/
[8]: https://lore.kernel.org/lkml/20190524102756.qjsjxukuq2f4t6bo@brauner.io/
[9]: https://lore.kernel.org/lkml/20190529222414.GA6492@gmail.com/
[10]: https://lore.kernel.org/lkml/CAHk-=whQP-Ykxi=zSYaV9iXsHsENa+2fdj-zYKwyeyed63Lsfw@mail.gmail.com/
[11]: https://lore.kernel.org/lkml/CAHk-=wieuV4hGwznPsX-8E0G2FKhx3NjZ9X3dTKh5zKd+iqOBw@mail.gmail.com/

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Serge Hallyn <serge@hallyn.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Adrian Reber <adrian@lisas.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
2019-06-09 09:29:28 +02:00
Dietmar Eggemann
0e1fef63d9 sched/core: Remove sd->*_idx
The sched domain per rq load index files also disappear from the
/proc/sys/kernel/sched_domain/cpuX/domainY directories.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20190527062116.11512-6-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:49:40 +02:00
Dietmar Eggemann
5e83eafbfd sched/fair: Remove the rq->cpu_load[] update code
With LB_BIAS disabled, there is no need to update the rq->cpu_load[idx]
any more.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20190527062116.11512-2-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:49:38 +02:00