Changes in 5.15.59
Bluetooth: L2CAP: Fix use-after-free caused by l2cap_chan_put
Revert "ocfs2: mount shared volume without ha stack"
ntfs: fix use-after-free in ntfs_ucsncmp()
fs: sendfile handles O_NONBLOCK of out_fd
secretmem: fix unhandled fault in truncate
mm: fix page leak with multiple threads mapping the same page
hugetlb: fix memoryleak in hugetlb_mcopy_atomic_pte
asm-generic: remove a broken and needless ifdef conditional
s390/archrandom: prevent CPACF trng invocations in interrupt context
nouveau/svm: Fix to migrate all requested pages
drm/simpledrm: Fix return type of simpledrm_simple_display_pipe_mode_valid()
watch_queue: Fix missing rcu annotation
watch_queue: Fix missing locking in add_watch_to_object()
tcp: Fix data-races around sysctl_tcp_dsack.
tcp: Fix a data-race around sysctl_tcp_app_win.
tcp: Fix a data-race around sysctl_tcp_adv_win_scale.
tcp: Fix a data-race around sysctl_tcp_frto.
tcp: Fix a data-race around sysctl_tcp_nometrics_save.
tcp: Fix data-races around sysctl_tcp_no_ssthresh_metrics_save.
ice: check (DD | EOF) bits on Rx descriptor rather than (EOP | RS)
ice: do not setup vlan for loopback VSI
scsi: ufs: host: Hold reference returned by of_parse_phandle()
Revert "tcp: change pingpong threshold to 3"
octeontx2-pf: Fix UDP/TCP src and dst port tc filters
tcp: Fix data-races around sysctl_tcp_moderate_rcvbuf.
tcp: Fix a data-race around sysctl_tcp_limit_output_bytes.
tcp: Fix a data-race around sysctl_tcp_challenge_ack_limit.
scsi: core: Fix warning in scsi_alloc_sgtables()
scsi: mpt3sas: Stop fw fault watchdog work item during system shutdown
net: ping6: Fix memleak in ipv6_renew_options().
ipv6/addrconf: fix a null-ptr-deref bug for ip6_ptr
net/tls: Remove the context from the list in tls_device_down
igmp: Fix data-races around sysctl_igmp_qrv.
net: pcs: xpcs: propagate xpcs_read error to xpcs_get_state_c37_sgmii
net: sungem_phy: Add of_node_put() for reference returned by of_get_parent()
tcp: Fix a data-race around sysctl_tcp_min_tso_segs.
tcp: Fix a data-race around sysctl_tcp_min_rtt_wlen.
tcp: Fix a data-race around sysctl_tcp_autocorking.
tcp: Fix a data-race around sysctl_tcp_invalid_ratelimit.
Documentation: fix sctp_wmem in ip-sysctl.rst
macsec: fix NULL deref in macsec_add_rxsa
macsec: fix error message in macsec_add_rxsa and _txsa
macsec: limit replay window size with XPN
macsec: always read MACSEC_SA_ATTR_PN as a u64
net: macsec: fix potential resource leak in macsec_add_rxsa() and macsec_add_txsa()
net: mld: fix reference count leak in mld_{query | report}_work()
tcp: Fix data-races around sk_pacing_rate.
net: Fix data-races around sysctl_[rw]mem(_offset)?.
tcp: Fix a data-race around sysctl_tcp_comp_sack_delay_ns.
tcp: Fix a data-race around sysctl_tcp_comp_sack_slack_ns.
tcp: Fix a data-race around sysctl_tcp_comp_sack_nr.
tcp: Fix data-races around sysctl_tcp_reflect_tos.
ipv4: Fix data-races around sysctl_fib_notify_on_flag_change.
i40e: Fix interface init with MSI interrupts (no MSI-X)
sctp: fix sleep in atomic context bug in timer handlers
octeontx2-pf: cn10k: Fix egress ratelimit configuration
netfilter: nf_queue: do not allow packet truncation below transport header offset
virtio-net: fix the race between refill work and close
perf symbol: Correct address for bss symbols
sfc: disable softirqs for ptp TX
sctp: leave the err path free in sctp_stream_init to sctp_stream_free
ARM: crypto: comment out gcc warning that breaks clang builds
mm/hmm: fault non-owner device private entries
page_alloc: fix invalid watermark check on a negative value
ARM: 9216/1: Fix MAX_DMA_ADDRESS overflow
EDAC/ghes: Set the DIMM label unconditionally
docs/kernel-parameters: Update descriptions for "mitigations=" param with retbleed
locking/rwsem: Allow slowpath writer to ignore handoff bit if not set by first waiter
x86/bugs: Do not enable IBPB at firmware entry when IBPB is not available
Linux 5.15.59
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I4f2002d38aea467e150a912f50d456c41b23de89
[ Upstream commit 02739545951ad4c1215160db7fbf9b7a918d3c0b ]
While reading these sysctl variables, they can be changed concurrently.
Thus, we need to add READ_ONCE() to their readers.
- .sysctl_rmem
- .sysctl_rwmem
- .sysctl_rmem_offset
- .sysctl_wmem_offset
- sysctl_tcp_rmem[1, 2]
- sysctl_tcp_wmem[1, 2]
- sysctl_decnet_rmem[1]
- sysctl_decnet_wmem[1]
- sysctl_tipc_rmem[1]
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Changes in 5.15.56
ALSA: hda - Add fixup for Dell Latitidue E5430
ALSA: hda/conexant: Apply quirk for another HP ProDesk 600 G3 model
ALSA: hda/realtek: Fix headset mic for Acer SF313-51
ALSA: hda/realtek - Fix headset mic problem for a HP machine with alc671
ALSA: hda/realtek: fix mute/micmute LEDs for HP machines
ALSA: hda/realtek - Fix headset mic problem for a HP machine with alc221
ALSA: hda/realtek - Enable the headset-mic on a Xiaomi's laptop
xen/netback: avoid entering xenvif_rx_next_skb() with an empty rx queue
fix race between exit_itimers() and /proc/pid/timers
mm: userfaultfd: fix UFFDIO_CONTINUE on fallocated shmem pages
mm: split huge PUD on wp_huge_pud fallback
tracing/histograms: Fix memory leak problem
net: sock: tracing: Fix sock_exceed_buf_limit not to dereference stale pointer
ip: fix dflt addr selection for connected nexthop
ARM: 9213/1: Print message about disabled Spectre workarounds only once
ARM: 9214/1: alignment: advance IT state after emulating Thumb instruction
wifi: mac80211: fix queue selection for mesh/OCB interfaces
cgroup: Use separate src/dst nodes when preloading css_sets for migration
btrfs: return -EAGAIN for NOWAIT dio reads/writes on compressed and inline extents
drm/panfrost: Put mapping instead of shmem obj on panfrost_mmu_map_fault_addr() error
drm/panfrost: Fix shrinker list corruption by madvise IOCTL
fs/remap: constrain dedupe of EOF blocks
nilfs2: fix incorrect masking of permission flags for symlinks
sh: convert nommu io{re,un}map() to static inline functions
Revert "evm: Fix memleak in init_desc"
xfs: only run COW extent recovery when there are no live extents
xfs: don't include bnobt blocks when reserving free block pool
xfs: run callbacks before waking waiters in xlog_state_shutdown_callbacks
xfs: drop async cache flushes from CIL commits.
reset: Fix devm bulk optional exclusive control getter
ARM: dts: imx6qdl-ts7970: Fix ngpio typo and count
spi: amd: Limit max transfer and message size
ARM: 9209/1: Spectre-BHB: avoid pr_info() every time a CPU comes out of idle
ARM: 9210/1: Mark the FDT_FIXED sections as shareable
net/mlx5e: kTLS, Fix build time constant test in TX
net/mlx5e: kTLS, Fix build time constant test in RX
net/mlx5e: Fix enabling sriov while tc nic rules are offloaded
net/mlx5e: Fix capability check for updating vnic env counters
net/mlx5e: Ring the TX doorbell on DMA errors
drm/i915: fix a possible refcount leak in intel_dp_add_mst_connector()
ima: Fix a potential integer overflow in ima_appraise_measurement
ASoC: sgtl5000: Fix noise on shutdown/remove
ASoC: tas2764: Add post reset delays
ASoC: tas2764: Fix and extend FSYNC polarity handling
ASoC: tas2764: Correct playback volume range
ASoC: tas2764: Fix amp gain register offset & default
ASoC: Intel: Skylake: Correct the ssp rate discovery in skl_get_ssp_clks()
ASoC: Intel: Skylake: Correct the handling of fmt_config flexible array
net: stmmac: dwc-qos: Disable split header for Tegra194
net: ethernet: ti: am65-cpsw: Fix devlink port register sequence
sysctl: Fix data races in proc_dointvec().
sysctl: Fix data races in proc_douintvec().
sysctl: Fix data races in proc_dointvec_minmax().
sysctl: Fix data races in proc_douintvec_minmax().
sysctl: Fix data races in proc_doulongvec_minmax().
sysctl: Fix data races in proc_dointvec_jiffies().
tcp: Fix a data-race around sysctl_tcp_max_orphans.
inetpeer: Fix data-races around sysctl.
net: Fix data-races around sysctl_mem.
cipso: Fix data-races around sysctl.
icmp: Fix data-races around sysctl.
ipv4: Fix a data-race around sysctl_fib_sync_mem.
ARM: dts: at91: sama5d2: Fix typo in i2s1 node
ARM: dts: sunxi: Fix SPI NOR campatible on Orange Pi Zero
arm64: dts: broadcom: bcm4908: Fix timer node for BCM4906 SoC
arm64: dts: broadcom: bcm4908: Fix cpu node for smp boot
netfilter: nf_log: incorrect offset to network header
netfilter: nf_tables: replace BUG_ON by element length check
drm/i915/gvt: IS_ERR() vs NULL bug in intel_gvt_update_reg_whitelist()
xen/gntdev: Ignore failure to unmap INVALID_GRANT_HANDLE
lockd: set fl_owner when unlocking files
lockd: fix nlm_close_files
tracing: Fix sleeping while atomic in kdb ftdump
drm/i915/selftests: fix a couple IS_ERR() vs NULL tests
drm/i915/dg2: Add Wa_22011100796
drm/i915/gt: Serialize GRDOM access between multiple engine resets
drm/i915/gt: Serialize TLB invalidates with GT resets
drm/i915/uc: correctly track uc_fw init failure
drm/i915: Require the vm mutex for i915_vma_bind()
bnxt_en: Fix bnxt_reinit_after_abort() code path
bnxt_en: Fix bnxt_refclk_read()
sysctl: Fix data-races in proc_dou8vec_minmax().
sysctl: Fix data-races in proc_dointvec_ms_jiffies().
icmp: Fix data-races around sysctl_icmp_echo_enable_probe.
icmp: Fix a data-race around sysctl_icmp_ignore_bogus_error_responses.
icmp: Fix a data-race around sysctl_icmp_errors_use_inbound_ifaddr.
icmp: Fix a data-race around sysctl_icmp_ratelimit.
icmp: Fix a data-race around sysctl_icmp_ratemask.
raw: Fix a data-race around sysctl_raw_l3mdev_accept.
tcp: Fix a data-race around sysctl_tcp_ecn_fallback.
ipv4: Fix data-races around sysctl_ip_dynaddr.
nexthop: Fix data-races around nexthop_compat_mode.
net: ftgmac100: Hold reference returned by of_get_child_by_name()
net: stmmac: fix leaks in probe
ima: force signature verification when CONFIG_KEXEC_SIG is configured
ima: Fix potential memory leak in ima_init_crypto()
drm/amd/display: Only use depth 36 bpp linebuffers on DCN display engines.
drm/amd/pm: Prevent divide by zero
sfc: fix use after free when disabling sriov
ceph: switch netfs read ops to use rreq->inode instead of rreq->mapping->host
seg6: fix skb checksum evaluation in SRH encapsulation/insertion
seg6: fix skb checksum in SRv6 End.B6 and End.B6.Encaps behaviors
seg6: bpf: fix skb checksum in bpf_push_seg6_encap()
sfc: fix kernel panic when creating VF
net: atlantic: remove deep parameter on suspend/resume functions
net: atlantic: remove aq_nic_deinit() when resume
KVM: x86: Fully initialize 'struct kvm_lapic_irq' in kvm_pv_kick_cpu_op()
net/tls: Check for errors in tls_device_init
ACPI: video: Fix acpi_video_handles_brightness_key_presses()
mm: sysctl: fix missing numa_stat when !CONFIG_HUGETLB_PAGE
btrfs: rename btrfs_bio to btrfs_io_context
btrfs: zoned: fix a leaked bioc in read_zone_info
ksmbd: use SOCK_NONBLOCK type for kernel_accept()
powerpc/xive/spapr: correct bitmap allocation size
vdpa/mlx5: Initialize CVQ vringh only once
vduse: Tie vduse mgmtdev and its device
virtio_mmio: Add missing PM calls to freeze/restore
virtio_mmio: Restore guest page size on resume
netfilter: br_netfilter: do not skip all hooks with 0 priority
scsi: hisi_sas: Limit max hw sectors for v3 HW
cpufreq: pmac32-cpufreq: Fix refcount leak bug
platform/x86: hp-wmi: Ignore Sanitization Mode event
firmware: sysfb: Make sysfb_create_simplefb() return a pdev pointer
firmware: sysfb: Add sysfb_disable() helper function
fbdev: Disable sysfb device registration when removing conflicting FBs
net: tipc: fix possible refcount leak in tipc_sk_create()
NFC: nxp-nci: don't print header length mismatch on i2c error
nvme-tcp: always fail a request when sending it failed
nvme: fix regression when disconnect a recovering ctrl
net: sfp: fix memory leak in sfp_probe()
ASoC: ops: Fix off by one in range control validation
pinctrl: aspeed: Fix potential NULL dereference in aspeed_pinmux_set_mux()
ASoC: Realtek/Maxim SoundWire codecs: disable pm_runtime on remove
ASoC: rt711-sdca-sdw: fix calibrate mutex initialization
ASoC: Intel: sof_sdw: handle errors on card registration
ASoC: rt711: fix calibrate mutex initialization
ASoC: rt7*-sdw: harden jack_detect_handler
ASoC: codecs: rt700/rt711/rt711-sdca: initialize workqueues in probe
ASoC: SOF: Intel: hda-loader: Clarify the cl_dsp_init() flow
ASoC: wcd938x: Fix event generation for some controls
ASoC: Intel: bytcr_wm5102: Fix GPIO related probe-ordering problem
ASoC: wm5110: Fix DRE control
ASoC: rt711-sdca: fix kernel NULL pointer dereference when IO error
ASoC: dapm: Initialise kcontrol data for mux/demux controls
ASoC: cs47l15: Fix event generation for low power mux control
ASoC: madera: Fix event generation for OUT1 demux
ASoC: madera: Fix event generation for rate controls
irqchip: or1k-pic: Undefine mask_ack for level triggered hardware
x86: Clear .brk area at early boot
soc: ixp4xx/npe: Fix unused match warning
ARM: dts: stm32: use the correct clock source for CEC on stm32mp151
Revert "can: xilinx_can: Limit CANFD brp to 2"
ALSA: usb-audio: Add quirks for MacroSilicon MS2100/MS2106 devices
ALSA: usb-audio: Add quirk for Fiero SC-01
ALSA: usb-audio: Add quirk for Fiero SC-01 (fw v1.0.0)
nvme-pci: phison e16 has bogus namespace ids
signal handling: don't use BUG_ON() for debugging
USB: serial: ftdi_sio: add Belimo device ids
usb: typec: add missing uevent when partner support PD
usb: dwc3: gadget: Fix event pending check
tty: serial: samsung_tty: set dma burst_size to 1
vt: fix memory overlapping when deleting chars in the buffer
serial: 8250: fix return error code in serial8250_request_std_resource()
serial: stm32: Clear prev values before setting RTS delays
serial: pl011: UPSTAT_AUTORTS requires .throttle/unthrottle
serial: 8250: Fix PM usage_count for console handover
x86/pat: Fix x86_has_pat_wp()
drm/aperture: Run fbdev removal before internal helpers
Linux 5.15.56
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I763d2a7b49435bf2996b31e201aa9794ab64609e
[ Upstream commit 310731e2f1611d1d13aae237abcf8e66d33345d5 ]
While reading .sysctl_mem, it can be changed concurrently.
So, we need to add READ_ONCE() to avoid data-races.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Try to mitigate potential future driver core api changes by adding a
padding to struct sock.
Based on a patch from Michal Marek <mmarek@suse.cz> from the SLES kernel
Leaf changes summary: 1 artifact changed
Changed leaf types summary: 1 leaf type changed
Removed/Changed/Added functions summary: 0 Removed, 0 Changed, 0 Added function
Removed/Changed/Added variables summary: 0 Removed, 0 Changed, 0 Added variable
'struct sock at sock.h:324:1' changed:
type size changed from 6016 to 6528 (in bits)
8 data member insertions:
'u64 sock::android_kabi_reserved1', at offset 6016 (in bits) at sock.h:516:1
'u64 sock::android_kabi_reserved2', at offset 6080 (in bits) at sock.h:517:1
'u64 sock::android_kabi_reserved3', at offset 6144 (in bits) at sock.h:518:1
'u64 sock::android_kabi_reserved4', at offset 6208 (in bits) at sock.h:519:1
'u64 sock::android_kabi_reserved5', at offset 6272 (in bits) at sock.h:520:1
'u64 sock::android_kabi_reserved6', at offset 6336 (in bits) at sock.h:521:1
'u64 sock::android_kabi_reserved7', at offset 6400 (in bits) at sock.h:522:1
'u64 sock::android_kabi_reserved8', at offset 6464 (in bits) at sock.h:523:1
Bug: 151154716
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I61c3d6cf12c087345db71fc6d93ee6bd58969003
Some vendors want to add things to 'struct sock', so give them a u64
where they can then have a pointer off to their private data and they
can do whatever they want to do without breaking or changing any abi for
anyone else.
Note, usually an android trace hook is also needed to use this properly,
so be aware that this will be required as well.
Bug: 171013716
Signed-off-by: Vignesh Saravanaperumal <vignesh1.s@samsung.com>
Change-Id: Iab0b5570753d4a9722ecf9ca9eeb9b9887e2a9d9
This reverts commit 0e189b0893.
It is no longer needed as we can modify the KABI at this point in time.
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I361d61ed4282366eea314224870ff8d02ebb7311
This reverts commit ff999198ec as it
breaks the KABI. It will be reverted the next KABI gate in a week.
Fixes: 8993e6067f ("Linux 5.15.26")
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Iee6f63dce33dc6c1df93dd4aefb06cf561b5b45b
[ Upstream commit ef57c1610dd8fba5031bf71e0db73356190de151 ]
Increase cache locality by moving rx_dst_coookie next to sk->sk_rx_dst
This removes one or two cache line misses in IPv6 early demux (TCP/UDP)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0c0a5ef809f9150e9229e7b13e43183b681b7a39 ]
Increase cache locality by moving rx_dst_ifindex next to sk->sk_rx_dst
This is part of an effort to reduce cache line misses in TCP fast path.
This removes one cache line miss in early demux.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit dacb5d8875cc6cd3a553363b4d6f06760fcbe70c upstream.
Steffen reported a TCP stream corruption for HTTP requests
served by the apache web-server using a cifs mount-point
and memory mapping the relevant file.
The root cause is quite similar to the one addressed by
commit 20eb4f29b6 ("net: fix sk_page_frag() recursion from
memory reclaim"). Here the nested access to the task page frag
is caused by a page fault on the (mmapped) user-space memory
buffer coming from the cifs file.
The page fault handler performs an smb transaction on a different
socket, inside the same process context. Since sk->sk_allaction
for such socket does not prevent the usage for the task_frag,
the nested allocation modify "under the hood" the page frag
in use by the outer sendmsg call, corrupting the stream.
The overall relevant stack trace looks like the following:
httpd 78268 [001] 3461630.850950: probe:tcp_sendmsg_locked:
ffffffff91461d91 tcp_sendmsg_locked+0x1
ffffffff91462b57 tcp_sendmsg+0x27
ffffffff9139814e sock_sendmsg+0x3e
ffffffffc06dfe1d smb_send_kvec+0x28
[...]
ffffffffc06cfaf8 cifs_readpages+0x213
ffffffff90e83c4b read_pages+0x6b
ffffffff90e83f31 __do_page_cache_readahead+0x1c1
ffffffff90e79e98 filemap_fault+0x788
ffffffff90eb0458 __do_fault+0x38
ffffffff90eb5280 do_fault+0x1a0
ffffffff90eb7c84 __handle_mm_fault+0x4d4
ffffffff90eb8093 handle_mm_fault+0xc3
ffffffff90c74f6d __do_page_fault+0x1ed
ffffffff90c75277 do_page_fault+0x37
ffffffff9160111e page_fault+0x1e
ffffffff9109e7b5 copyin+0x25
ffffffff9109eb40 _copy_from_iter_full+0xe0
ffffffff91462370 tcp_sendmsg_locked+0x5e0
ffffffff91462370 tcp_sendmsg_locked+0x5e0
ffffffff91462b57 tcp_sendmsg+0x27
ffffffff9139815c sock_sendmsg+0x4c
ffffffff913981f7 sock_write_iter+0x97
ffffffff90f2cc56 do_iter_readv_writev+0x156
ffffffff90f2dff0 do_iter_write+0x80
ffffffff90f2e1c3 vfs_writev+0xa3
ffffffff90f2e27c do_writev+0x5c
ffffffff90c042bb do_syscall_64+0x5b
ffffffff916000ad entry_SYSCALL_64_after_hwframe+0x65
The cifs filesystem rightfully sets sk_allocations to GFP_NOFS,
we can avoid the nesting using the sk page frag for allocation
lacking the __GFP_FS flag. Do not define an additional mm-helper
for that, as this is strictly tied to the sk page frag usage.
v1 -> v2:
- use a stricted sk_page_frag() check instead of reordering the
code (Eric)
Reported-by: Steffen Froemer <sfroemer@redhat.com>
Fixes: 5640f76858 ("net: use a per task frag allocator")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 19757cebf0c5016a1f36f7fe9810a9f0b33c0832 ]
Use of percpu_counter structure to track count of orphaned
sockets is causing problems on modern hosts with 256 cpus
or more.
Stefan Bach reported a serious spinlock contention in real workloads,
that I was able to reproduce with a netfilter rule dropping
incoming FIN packets.
53.56% server [kernel.kallsyms] [k] queued_spin_lock_slowpath
|
---queued_spin_lock_slowpath
|
--53.51%--_raw_spin_lock_irqsave
|
--53.51%--__percpu_counter_sum
tcp_check_oom
|
|--39.03%--__tcp_close
| tcp_close
| inet_release
| inet6_release
| sock_close
| __fput
| ____fput
| task_work_run
| exit_to_usermode_loop
| do_syscall_64
| entry_SYSCALL_64_after_hwframe
| __GI___libc_close
|
--14.48%--tcp_out_of_resources
tcp_write_timeout
tcp_retransmit_timer
tcp_write_timer_handler
tcp_write_timer
call_timer_fn
expire_timers
__run_timers
run_timer_softirq
__softirqentry_text_start
As explained in commit cf86a086a1 ("net/dst: use a smaller percpu_counter
batch for dst entries accounting"), default batch size is too big
for the default value of tcp_max_orphans (262144).
But even if we reduce batch sizes, there would still be cases
where the estimated count of orphans is beyond the limit,
and where tcp_too_many_orphans() has to call the expensive
percpu_counter_sum_positive().
One solution is to use plain per-cpu counters, and have
a timer to periodically refresh this cache.
Updating this cache every 100ms seems about right, tcp pressure
state is not radically changing over shorter periods.
percpu_counter was nice 15 years ago while hosts had less
than 16 cpus, not anymore by current standards.
v2: Fix the build issue for CONFIG_CRYPTO_DEV_CHELSIO_TLS=m,
reported by kernel test robot <lkp@intel.com>
Remove unused socket argument from tcp_too_many_orphans()
Fixes: dd24c00191 ("net: Use a percpu_counter for orphan_count")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Stefan Bach <sfb@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Jann Horn reported that SO_PEERCRED and SO_PEERGROUPS implementations
are racy, as af_unix can concurrently change sk_peer_pid and sk_peer_cred.
In order to fix this issue, this patch adds a new spinlock that needs
to be used whenever these fields are read or written.
Jann also pointed out that l2cap_sock_get_peer_pid_cb() is currently
reading sk->sk_peer_pid which makes no sense, as this field
is only possibly set by AF_UNIX sockets.
We will have to clean this in a separate patch.
This could be done by reverting b48596d1dc "Bluetooth: L2CAP: Add get_peer_pid callback"
or implementing what was truly expected.
Fixes: 109f6e39fa ("af_unix: Allow SO_PEERCRED to work across namespaces.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jann Horn <jannh@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Cc: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
lock_sock_fast() and lock_sock_nested() contain lockdep annotations for the
sock::sk_lock.owned 'mutex'. sock::sk_lock.owned is not a regular mutex. It
is just lockdep wise equivalent. In fact it's an open coded trivial mutex
implementation with some interesting features.
sock::sk_lock.slock is a regular spinlock protecting the 'mutex'
representation sock::sk_lock.owned which is a plain boolean. If 'owned' is
true, then some other task holds the 'mutex', otherwise it is uncontended.
As this locking construct is obviously endangered by lock ordering issues as
any other locking primitive it got lockdep annotated via a dedicated
dependency map sock::sk_lock.dep_map which has to be updated at the lock
and unlock sites.
lock_sock_nested() is a straight forward 'mutex' lock operation:
might_sleep();
spin_lock_bh(sock::sk_lock.slock)
while (!try_lock(sock::sk_lock.owned)) {
spin_unlock_bh(sock::sk_lock.slock);
wait_for_release();
spin_lock_bh(sock::sk_lock.slock);
}
The lockdep annotation for sock::sk_lock.owned is for unknown reasons
_after_ the lock has been acquired, i.e. after the code block above and
after releasing sock::sk_lock.slock, but inside the bottom halves disabled
region:
spin_unlock(sock::sk_lock.slock);
mutex_acquire(&sk->sk_lock.dep_map, subclass, 0, _RET_IP_);
local_bh_enable();
The placement after the unlock is obvious because otherwise the
mutex_acquire() would nest into the spin lock held region.
But that's from the lockdep perspective still the wrong place:
1) The mutex_acquire() is issued _after_ the successful acquisition which
is pointless because in a dead lock scenario this point is never
reached which means that if the deadlock is the first instance of
exposing the wrong lock order lockdep does not have a chance to detect
it.
2) It only works because lockdep is rather lax on the context from which
the mutex_acquire() is issued. Acquiring a mutex inside a bottom halves
and therefore non-preemptible region is obviously invalid, except for a
trylock which is clearly not the case here.
This 'works' stops working on RT enabled kernels where the bottom halves
serialization is done via a local lock, which exposes this misplacement
because the 'mutex' and the local lock nest the wrong way around and
lockdep complains rightfully about a lock inversion.
The placement is wrong since the initial commit a5b5bb9a05 ("[PATCH]
lockdep: annotate sk_locks") which introduced this.
Fix it by moving the mutex_acquire() in front of the actual lock
acquisition, which is what the regular mutex_lock() operation does as well.
lock_sock_fast() is not that straight forward. It looks at the first glance
like a convoluted trylock operation:
spin_lock_bh(sock::sk_lock.slock)
if (!sock::sk_lock.owned)
return false;
while (!try_lock(sock::sk_lock.owned)) {
spin_unlock_bh(sock::sk_lock.slock);
wait_for_release();
spin_lock_bh(sock::sk_lock.slock);
}
spin_unlock(sock::sk_lock.slock);
mutex_acquire(&sk->sk_lock.dep_map, subclass, 0, _RET_IP_);
local_bh_enable();
return true;
But that's not the case: lock_sock_fast() is an interesting optimization
for short critical sections which can run with bottom halves disabled and
sock::sk_lock.slock held. This allows to shortcut the 'mutex' operation in
the non contended case by preventing other lockers to acquire
sock::sk_lock.owned because they are blocked on sock::sk_lock.slock, which
in turn avoids the overhead of doing the heavy processing in release_sock()
including waking up wait queue waiters.
In the contended case, i.e. when sock::sk_lock.owned == true the behavior
is the same as lock_sock_nested().
Semantically this shortcut means, that the task acquired the 'mutex' even
if it does not touch the sock::sk_lock.owned field in the non-contended
case. Not telling lockdep about this shortcut acquisition is hiding
potential lock ordering violations in the fast path.
As a consequence the same reasoning as for the above lock_sock_nested()
case vs. the placement of the lockdep annotation applies.
The current placement of the lockdep annotation was just copied from
the original lock_sock(), now renamed to lock_sock_nested(),
implementation.
Fix this by moving the mutex_acquire() in front of the actual lock
acquisition and adding the corresponding mutex_release() into
unlock_sock_fast(). Also document the fast path return case with a comment.
Reported-by: Sebastian Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: netdev@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both SKB_FRAG_PAGE_ORDER are defined to the same value in
net/core/sock.c and drivers/vhost/net.c.
Move the SKB_FRAG_PAGE_ORDER definition to net/core/sock.h,
as both net/core/sock.c and drivers/vhost/net.c include it,
and it seems a reasonable file to put the macro.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add gfp_t mask as an input parameter to mem_cgroup_charge_skmem(),
to give more control to the networking stack and enable it to change
memcg charging behavior. In the future, the networking stack may decide
to avoid oom-kills when fallbacks are more appropriate.
One behavior change in mem_cgroup_charge_skmem() by this patch is to
avoid force charging by default and let the caller decide when and if
force charging is needed through the presence or absence of
__GFP_NOFAIL.
Signed-off-by: Wei Wang <weiwan@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SOCK_SNDBUF_LOCK and SOCK_RCVBUF_LOCK flags disable automatic socket
buffers adjustment done by kernel (see tcp_fixup_rcvbuf() and
tcp_sndbuf_expand()). If we've just created a new socket this adjustment
is enabled on it, but if one changes the socket buffer size by
setsockopt(SO_{SND,RCV}BUF*) it becomes disabled.
CRIU needs to call setsockopt(SO_{SND,RCV}BUF*) on each socket on
restore as it first needs to increase buffer sizes for packet queues
restore and second it needs to restore back original buffer sizes. So
after CRIU restore all sockets become non-auto-adjustable, which can
decrease network performance of restored applications significantly.
CRIU need to be able to restore sockets with enabled/disabled adjustment
to the same state it was before dump, so let's add special setsockopt
for it.
Let's also export SOCK_SNDBUF_LOCK and SOCK_RCVBUF_LOCK flags to uAPI so
that using these interface one can reenable automatic socket buffer
adjustment on their sockets.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change leverages the infrastructure introduced by the previous
patches to allow soft devices passing to the GRO engine owned skbs
without impacting the fast-path.
It's up to the GRO caller ensuring the slow_gro bit validity before
invoking the GRO engine. The new helper skb_prepare_for_gro() is
introduced for that goal.
On slow_gro, skbs are aggregated only with equal sk.
Additionally, skb truesize on GRO recycle and free is correctly
updated so that sk wmem is not changed by the GRO processing.
rfc-> v1:
- fixed bad truesize on dev_gro_receive NAPI_FREE
- use the existing state bit
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since PTP virtual clock support is added, there can be
several PTP virtual clocks based on one PTP physical
clock for timestamping.
This patch is to extend SO_TIMESTAMPING API to support
PHC (PTP Hardware Clock) binding by adding a new flag
SOF_TIMESTAMPING_BIND_PHC. When PTP virtual clocks are
in use, user space can configure to bind one for
timestamping, but PTP physical clock is not supported
and not needed to bind.
This patch is preparation for timestamp conversion from
raw timestamp to a specific PTP virtual clock time in
core net.
Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces a function wrapper to call the sk_error_report
callback. That will prepare to add additional handling whenever
sk_error_report is called, for example to trace socket errors.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Trivial conflicts in net/can/isotp.c and
tools/testing/selftests/net/mptcp/mptcp_connect.sh
scaled_ppm_to_ppb() was moved from drivers/ptp/ptp_clock.c
to include/linux/ptp_clock_kernel.h in -next so re-apply
the fix there.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
UDP sendmsg() path can be lockless, it is possible for another
thread to re-connect an change sk->sk_txhash under us.
There is no serious impact, but we can use READ_ONCE()/WRITE_ONCE()
pair to document the race.
BUG: KCSAN: data-race in __ip4_datagram_connect / skb_set_owner_w
write to 0xffff88813397920c of 4 bytes by task 30997 on cpu 1:
sk_set_txhash include/net/sock.h:1937 [inline]
__ip4_datagram_connect+0x69e/0x710 net/ipv4/datagram.c:75
__ip6_datagram_connect+0x551/0x840 net/ipv6/datagram.c:189
ip6_datagram_connect+0x2a/0x40 net/ipv6/datagram.c:272
inet_dgram_connect+0xfd/0x180 net/ipv4/af_inet.c:580
__sys_connect_file net/socket.c:1837 [inline]
__sys_connect+0x245/0x280 net/socket.c:1854
__do_sys_connect net/socket.c:1864 [inline]
__se_sys_connect net/socket.c:1861 [inline]
__x64_sys_connect+0x3d/0x50 net/socket.c:1861
do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
entry_SYSCALL_64_after_hwframe+0x44/0xae
read to 0xffff88813397920c of 4 bytes by task 31039 on cpu 0:
skb_set_hash_from_sk include/net/sock.h:2211 [inline]
skb_set_owner_w+0x118/0x220 net/core/sock.c:2101
sock_alloc_send_pskb+0x452/0x4e0 net/core/sock.c:2359
sock_alloc_send_skb+0x2d/0x40 net/core/sock.c:2373
__ip6_append_data+0x1743/0x21a0 net/ipv6/ip6_output.c:1621
ip6_make_skb+0x258/0x420 net/ipv6/ip6_output.c:1983
udpv6_sendmsg+0x160a/0x16b0 net/ipv6/udp.c:1527
inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:642
sock_sendmsg_nosec net/socket.c:654 [inline]
sock_sendmsg net/socket.c:674 [inline]
____sys_sendmsg+0x360/0x4d0 net/socket.c:2350
___sys_sendmsg net/socket.c:2404 [inline]
__sys_sendmmsg+0x315/0x4b0 net/socket.c:2490
__do_sys_sendmmsg net/socket.c:2519 [inline]
__se_sys_sendmmsg net/socket.c:2516 [inline]
__x64_sys_sendmmsg+0x53/0x60 net/socket.c:2516
do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
entry_SYSCALL_64_after_hwframe+0x44/0xae
value changed: 0xbca3c43d -> 0xfdb309e0
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 31039 Comm: syz-executor.2 Not tainted 5.13.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This exports SO_TIMESTAMP_* function for re-use by MPTCP.
Without this there is too much copy & paste needed to support
this from mptcp setsockopt path.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the owing socket is shutting down - e.g. the sock reference
count already dropped to 0 and only sk_wmem_alloc is keeping
the sock alive, skb_orphan_partial() becomes a no-op.
When forwarding packets over veth with GRO enabled, the above
causes refcount errors.
This change addresses the issue with a plain skb_orphan() call
in the critical scenario.
Fixes: 9adc89af72 ("net: let skb_orphan_partial wake-up waiters.")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2021-04-23
The following pull-request contains BPF updates for your *net-next* tree.
We've added 69 non-merge commits during the last 22 day(s) which contain
a total of 69 files changed, 3141 insertions(+), 866 deletions(-).
The main changes are:
1) Add BPF static linker support for extern resolution of global, from Andrii.
2) Refine retval for bpf_get_task_stack helper, from Dave.
3) Add a bpf_snprintf helper, from Florent.
4) A bunch of miscellaneous improvements from many developers.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
MAINTAINERS
- keep Chandrasekar
drivers/net/ethernet/mellanox/mlx5/core/en_main.c
- simple fix + trust the code re-added to param.c in -next is fine
include/linux/bpf.h
- trivial
include/linux/ethtool.h
- trivial, fix kdoc while at it
include/linux/skmsg.h
- move to relevant place in tcp.c, comment re-wrapped
net/core/skmsg.c
- add the sk = sk // sk = NULL around calls
net/tipc/crypto.c
- trivial
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently sockmap calls into each protocol to update the struct
proto and replace it. This certainly won't work when the protocol
is implemented as a module, for example, AF_UNIX.
Introduce a new ops sk->sk_prot->psock_update_sk_prot(), so each
protocol can implement its own way to replace the struct proto.
This also helps get rid of symbol dependencies on CONFIG_INET.
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210331023237.41094-11-xiyou.wangcong@gmail.com
This reverts commit f211ac1545.
We had similar attempt in the past, and we reverted it.
History:
64a146513f [NET]: Revert incorrect accept queue backlog changes.
8488df894d [NET]: Fix bugs in "Whether sock accept queue is full" checking
I am adding a fat comment so that future attempts will
be much harder.
Fixes: f211ac1545 ("net: correct sk_acceptq_is_full()")
Cc: iuyacan <yacanliu@163.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the mentioned helper can end-up freeing the socket wmem
without waking-up any processes waiting for more write memory.
If the partially orphaned skb is attached to an UDP (or raw) socket,
the lack of wake-up can hang the user-space.
Even for TCP sockets not calling the sk destructor could have bad
effects on TSQ.
Address the issue using skb_orphan to release the sk wmem before
setting the new sock_efree destructor. Additionally bundle the
whole ownership update in a new helper, so that later other
potential users could avoid duplicate code.
v1 -> v2:
- use skb_orphan() instead of sort of open coding it (Eric)
- provide an helper for the ownership change (Eric)
Fixes: f6ba8d33cf ("netem: fix skb_orphan_partial()")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The "backlog" argument in listen() specifies
the maximom length of pending connections,
so the accept queue should be considered full
if there are exactly "backlog" elements.
Signed-off-by: liuyacan <yacanliu@163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2021-02-16
The following pull-request contains BPF updates for your *net-next* tree.
There's a small merge conflict between 7eeba1706e ("tcp: Add receive timestamp
support for receive zerocopy.") from net-next tree and 9cacf81f81 ("bpf: Remove
extra lock_sock for TCP_ZEROCOPY_RECEIVE") from bpf-next tree. Resolve as follows:
[...]
lock_sock(sk);
err = tcp_zerocopy_receive(sk, &zc, &tss);
err = BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sk, level, optname,
&zc, &len, err);
release_sock(sk);
[...]
We've added 116 non-merge commits during the last 27 day(s) which contain
a total of 156 files changed, 5662 insertions(+), 1489 deletions(-).
The main changes are:
1) Adds support of pointers to types with known size among global function
args to overcome the limit on max # of allowed args, from Dmitrii Banshchikov.
2) Add bpf_iter for task_vma which can be used to generate information similar
to /proc/pid/maps, from Song Liu.
3) Enable bpf_{g,s}etsockopt() from all sock_addr related program hooks. Allow
rewriting bind user ports from BPF side below the ip_unprivileged_port_start
range, both from Stanislav Fomichev.
4) Prevent recursion on fentry/fexit & sleepable programs and allow map-in-map
as well as per-cpu maps for the latter, from Alexei Starovoitov.
5) Add selftest script to run BPF CI locally. Also enable BPF ringbuffer
for sleepable programs, both from KP Singh.
6) Extend verifier to enable variable offset read/write access to the BPF
program stack, from Andrei Matei.
7) Improve tc & XDP MTU handling and add a new bpf_check_mtu() helper to
query device MTU from programs, from Jesper Dangaard Brouer.
8) Allow bpf_get_socket_cookie() helper also be called from [sleepable] BPF
tracing programs, from Florent Revest.
9) Extend x86 JIT to pad JMPs with NOPs for helping image to converge when
otherwise too many passes are required, from Gary Lin.
10) Verifier fixes on atomics with BPF_FETCH as well as function-by-function
verification both related to zero-extension handling, from Ilya Leoshkevich.
11) Better kernel build integration of resolve_btfids tool, from Jiri Olsa.
12) Batch of AF_XDP selftest cleanups and small performance improvement
for libbpf's xsk map redirect for newer kernels, from Björn Töpel.
13) Follow-up BPF doc and verifier improvements around atomics with
BPF_FETCH, from Brendan Jackman.
14) Permit zero-sized data sections e.g. if ELF .rodata section contains
read-only data from local variables, from Yonghong Song.
15) veth driver skb bulk-allocation for ndo_xdp_xmit, from Lorenzo Bianconi.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Use a new config SOCK_RX_QUEUE_MAPPING to compile-in the socket
RX queue field and logic, instead of the XPS config.
This breaks dependency in XPS, and allows selecting it from non-XPS
use cases, as we do in the next patch.
In addition, use the new flag to wrap the logic in sk_rx_queue_get()
and protect access to the sk_rx_queue_mapping field, while keeping
the function exposed unconditionally, just like sk_rx_queue_set()
and sk_rx_queue_clear().
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, a percpu_counter with the default batch size (2*nr_cpus) is
used to record the total # of active sockets per protocol. This means
sk_sockets_allocated_read_positive() could be off by +/-2*(nr_cpus^2).
This under/over-estimation could lead to wrong memory suppression
conditions in __sk_raise_mem_allocated().
Fix this by using a more reasonable fixed batch size of 16.
See related commit cf86a086a1 ("net/dst: use a smaller percpu_counter
batch for dst entries accounting") that addresses a similar issue.
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Link: https://lore.kernel.org/r/20210202193408.1171634-1-weiwan@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE.
We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom
call in do_tcp_getsockopt using the on-stack data. This removes
3% overhead for locking/unlocking the socket.
Without this patch:
3.38% 0.07% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt
|
--3.30%--__cgroup_bpf_run_filter_getsockopt
|
--0.81%--__kmalloc
With the patch applied:
0.52% 0.12% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt_kern
Note, exporting uapi/tcp.h requires removing netinet/tcp.h
from test_progs.h because those headers have confliciting
definitions.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210115163501.805133-2-sdf@google.com
The previous commit 32efcc06d2 ("tcp: export count for rehash attempts")
would mis-account rehashing SNMP and socket stats:
a. During handshake of an active open, only counts the first
SYN timeout
b. After handshake of passive and active open, stop updating
after (roughly) TCP_RETRIES1 recurring RTOs
c. After the socket aborts, over count timeout_rehash by 1
This patch fixes this by checking the rehash result from sk_rethink_txhash.
Fixes: 32efcc06d2 ("tcp: export count for rehash attempts")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20210119192619.1848270-1-ycheng@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull networking updates from Jakub Kicinski:
"Core:
- support "prefer busy polling" NAPI operation mode, where we defer
softirq for some time expecting applications to periodically busy
poll
- AF_XDP: improve efficiency by more batching and hindering the
adjacency cache prefetcher
- af_packet: make packet_fanout.arr size configurable up to 64K
- tcp: optimize TCP zero copy receive in presence of partial or
unaligned reads making zero copy a performance win for much smaller
messages
- XDP: add bulk APIs for returning / freeing frames
- sched: support fragmenting IP packets as they come out of conntrack
- net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs
BPF:
- BPF switch from crude rlimit-based to memcg-based memory accounting
- BPF type format information for kernel modules and related tracing
enhancements
- BPF implement task local storage for BPF LSM
- allow the FENTRY/FEXIT/RAW_TP tracing programs to use
bpf_sk_storage
Protocols:
- mptcp: improve multiple xmit streams support, memory accounting and
many smaller improvements
- TLS: support CHACHA20-POLY1305 cipher
- seg6: add support for SRv6 End.DT4/DT6 behavior
- sctp: Implement RFC 6951: UDP Encapsulation of SCTP
- ppp_generic: add ability to bridge channels directly
- bridge: Connectivity Fault Management (CFM) support as is defined
in IEEE 802.1Q section 12.14.
Drivers:
- mlx5: make use of the new auxiliary bus to organize the driver
internals
- mlx5: more accurate port TX timestamping support
- mlxsw:
- improve the efficiency of offloaded next hop updates by using
the new nexthop object API
- support blackhole nexthops
- support IEEE 802.1ad (Q-in-Q) bridging
- rtw88: major bluetooth co-existance improvements
- iwlwifi: support new 6 GHz frequency band
- ath11k: Fast Initial Link Setup (FILS)
- mt7915: dual band concurrent (DBDC) support
- net: ipa: add basic support for IPA v4.5
Refactor:
- a few pieces of in_interrupt() cleanup work from Sebastian Andrzej
Siewior
- phy: add support for shared interrupts; get rid of multiple driver
APIs and have the drivers write a full IRQ handler, slight growth
of driver code should be compensated by the simpler API which also
allows shared IRQs
- add common code for handling netdev per-cpu counters
- move TX packet re-allocation from Ethernet switch tag drivers to a
central place
- improve efficiency and rename nla_strlcpy
- number of W=1 warning cleanups as we now catch those in a patchwork
build bot
Old code removal:
- wan: delete the DLCI / SDLA drivers
- wimax: move to staging
- wifi: remove old WDS wifi bridging support"
* tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1922 commits)
net: hns3: fix expression that is currently always true
net: fix proc_fs init handling in af_packet and tls
nfc: pn533: convert comma to semicolon
af_vsock: Assign the vsock transport considering the vsock address flags
af_vsock: Set VMADDR_FLAG_TO_HOST flag on the receive path
vsock_addr: Check for supported flag values
vm_sockets: Add VMADDR_FLAG_TO_HOST vsock flag
vm_sockets: Add flags field in the vsock address data structure
net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled
tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit
net: mscc: ocelot: install MAC addresses in .ndo_set_rx_mode from process context
nfc: s3fwrn5: Release the nfc firmware
net: vxget: clean up sparse warnings
mlxsw: spectrum_router: Use eXtended mezzanine to offload IPv4 router
mlxsw: spectrum: Set KVH XLT cache mode for Spectrum2/3
mlxsw: spectrum_router_xm: Introduce basic XM cache flushing
mlxsw: reg: Add Router LPM Cache Enable Register
mlxsw: reg: Add Router LPM Cache ML Delete Register
mlxsw: spectrum_router_xm: Implement L-value tracking for M-index
mlxsw: reg: Add XM Router M Table Register
...
Alexei Starovoitov says:
====================
pull-request: bpf-next 2020-12-03
The main changes are:
1) Support BTF in kernel modules, from Andrii.
2) Introduce preferred busy-polling, from Björn.
3) bpf_ima_inode_hash() and bpf_bprm_opts_set() helpers, from KP Singh.
4) Memcg-based memory accounting for bpf objects, from Roman.
5) Allow bpf_{s,g}etsockopt from cgroup bind{4,6} hooks, from Stanislav.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (118 commits)
selftests/bpf: Fix invalid use of strncat in test_sockmap
libbpf: Use memcpy instead of strncpy to please GCC
selftests/bpf: Add fentry/fexit/fmod_ret selftest for kernel module
selftests/bpf: Add tp_btf CO-RE reloc test for modules
libbpf: Support attachment of BPF tracing programs to kernel modules
libbpf: Factor out low-level BPF program loading helper
bpf: Allow to specify kernel module BTFs when attaching BPF programs
bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier
selftests/bpf: Add CO-RE relocs selftest relying on kernel module BTF
selftests/bpf: Add support for marking sub-tests as skipped
selftests/bpf: Add bpf_testmod kernel module for testing
libbpf: Add kernel module BTF support for CO-RE relocations
libbpf: Refactor CO-RE relocs to not assume a single BTF object
libbpf: Add internal helper to load BTF data by FD
bpf: Keep module's btf_data_size intact after load
bpf: Fix bpf_put_raw_tracepoint()'s use of __module_address()
selftests/bpf: Add Userspace tests for TCP_WINDOW_CLAMP
bpf: Adds support for setting window clamp
samples/bpf: Fix spelling mistake "recieving" -> "receiving"
bpf: Fix cold build of test_progs-no_alu32
...
====================
Link: https://lore.kernel.org/r/20201204021936.85653-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This allows invoking an additional callback under the
socket spin lock.
Will be used by the next patches to avoid additional
spin lock contention.
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The existing busy-polling mode, enabled by the SO_BUSY_POLL socket
option or system-wide using the /proc/sys/net/core/busy_read knob, is
an opportunistic. That means that if the NAPI context is not
scheduled, it will poll it. If, after busy-polling, the budget is
exceeded the busy-polling logic will schedule the NAPI onto the
regular softirq handling.
One implication of the behavior above is that a busy/heavy loaded NAPI
context will never enter/allow for busy-polling. Some applications
prefer that most NAPI processing would be done by busy-polling.
This series adds a new socket option, SO_PREFER_BUSY_POLL, that works
in concert with the napi_defer_hard_irqs and gro_flush_timeout
knobs. The napi_defer_hard_irqs and gro_flush_timeout knobs were
introduced in commit 6f8b12d661 ("net: napi: add hard irqs deferral
feature"), and allows for a user to defer interrupts to be enabled and
instead schedule the NAPI context from a watchdog timer. When a user
enables the SO_PREFER_BUSY_POLL, again with the other knobs enabled,
and the NAPI context is being processed by a softirq, the softirq NAPI
processing will exit early to allow the busy-polling to be performed.
If the application stops performing busy-polling via a system call,
the watchdog timer defined by gro_flush_timeout will timeout, and
regular softirq handling will resume.
In summary; Heavy traffic applications that prefer busy-polling over
softirq processing should use this option.
Example usage:
$ echo 2 | sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs
$ echo 200000 | sudo tee /sys/class/net/ens785f1/gro_flush_timeout
Note that the timeout should be larger than the userspace processing
window, otherwise the watchdog will timeout and fall back to regular
softirq processing.
Enable the SO_BUSY_POLL/SO_PREFER_BUSY_POLL options on your socket.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/20201130185205.196029-2-bjorn.topel@gmail.com
Fix build of net/core/stream.o when CONFIG_INET is not enabled.
Fixes these build errors (sample):
ld: net/core/stream.o: in function `sk_stream_write_space':
(.text+0x27e): undefined reference to `tcp_stream_memory_free'
ld: (.text+0x29c): undefined reference to `tcp_stream_memory_free'
ld: (.text+0x2ab): undefined reference to `tcp_stream_memory_free'
ld: net/core/stream.o: in function `sk_stream_wait_memory':
(.text+0x5a1): undefined reference to `tcp_stream_memory_free'
ld: (.text+0x5bf): undefined reference to `tcp_stream_memory_free'
Fixes: 1c5f2ced13 ("tcp: avoid indirect call to tcp_stream_memory_free()")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20201118194438.674-1-rdunlap@infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>