Commit Graph

6700 Commits

Author SHA1 Message Date
Michal Kubecek
6b08d6c146 ethtool: support for netlink notifications
Add infrastructure for ethtool netlink notifications. There is only one
multicast group "monitor" which is used to notify userspace about changes
and actions performed. Notification messages (types using suffix _NTF)
share the format with replies to GET requests.

Notifications are supposed to be broadcasted on every configuration change,
whether it is done using the netlink interface or ioctl one. Netlink SET
requests only trigger a notification if some data is actually changed.

To trigger an ethtool notification, both ethtool netlink and external code
use ethtool_notify() helper. This helper requires RTNL to be held and may
sleep. Handlers sending messages for specific notification message types
are registered in ethnl_notify_handlers array. As notifications can be
triggered from other code, ethnl_ok flag is used to prevent an attempt to
send notification before genetlink family is registered.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27 16:40:01 -08:00
Michal Kubecek
10b518d4e6 ethtool: netlink bitset handling
The ethtool netlink code uses common framework for passing arbitrary
length bit sets to allow future extensions. A bitset can be a list (only
one bitmap) or can consist of value and mask pair (used e.g. when client
want to modify only some bits). A bitset can use one of two formats:
verbose (bit by bit) or compact.

Verbose format consists of bitset size (number of bits), list flag and
an array of bit nests, telling which bits are part of the list or which
bits are in the mask and which of them are to be set. In requests, bits
can be identified by index (position) or by name. In replies, kernel
provides both index and name. Verbose format is suitable for "one shot"
applications like standard ethtool command as it avoids the need to
either keep bit names (e.g. link modes) in sync with kernel or having to
add an extra roundtrip for string set request (e.g. for private flags).

Compact format uses one (list) or two (value/mask) arrays of 32-bit
words to store the bitmap(s). It is more suitable for long running
applications (ethtool in monitor mode or network management daemons)
which can retrieve the names once and then pass only compact bitmaps to
save space.

Userspace requests can use either format; ETHTOOL_FLAG_COMPACT_BITSETS
flag in request header tells kernel which format to use in reply.
Notifications always use compact format.

As some code uses arrays of unsigned long for internal representation and
some arrays of u32 (or even a single u32), two sets of parse/compose
helpers are introduced. To avoid code duplication, helpers for unsigned
long arrays are implemented as wrappers around helpers for u32 arrays.
There are two reasons for this choice: (1) u32 arrays are more frequent in
ethtool code and (2) unsigned long array can be always interpreted as an
u32 array on little endian 64-bit and all 32-bit architectures while we
would need special handling for odd number of u32 words in the opposite
direction.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27 16:40:01 -08:00
Michal Kubecek
041b1c5d4a ethtool: helper functions for netlink interface
Add common request/reply header definition and helpers to parse request
header and fill reply header. Provide ethnl_update_* helpers to update
structure members from request attributes (to be used for *_SET requests).

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27 16:40:01 -08:00
Michal Kubecek
2b4a8990b7 ethtool: introduce ethtool netlink interface
Basic genetlink and init infrastructure for the netlink interface, register
genetlink family "ethtool". Add CONFIG_ETHTOOL_NETLINK Kconfig option to
make the build optional. Add initial overall interface description into
Documentation/networking/ethtool-netlink.rst, further patches will add more
detailed information.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27 16:40:01 -08:00
David S. Miller
2bbc078f81 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2019-12-27

The following pull-request contains BPF updates for your *net-next* tree.

We've added 127 non-merge commits during the last 17 day(s) which contain
a total of 110 files changed, 6901 insertions(+), 2721 deletions(-).

There are three merge conflicts. Conflicts and resolution looks as follows:

1) Merge conflict in net/bpf/test_run.c:

There was a tree-wide cleanup c593642c8b ("treewide: Use sizeof_field() macro")
which gets in the way with b590cb5f80 ("bpf: Switch to offsetofend in
BPF_PROG_TEST_RUN"):

  <<<<<<< HEAD
          if (!range_is_zero(__skb, offsetof(struct __sk_buff, priority) +
                             sizeof_field(struct __sk_buff, priority),
  =======
          if (!range_is_zero(__skb, offsetofend(struct __sk_buff, priority),
  >>>>>>> 7c8dce4b16

There are a few occasions that look similar to this. Always take the chunk with
offsetofend(). Note that there is one where the fields differ in here:

  <<<<<<< HEAD
          if (!range_is_zero(__skb, offsetof(struct __sk_buff, tstamp) +
                             sizeof_field(struct __sk_buff, tstamp),
  =======
          if (!range_is_zero(__skb, offsetofend(struct __sk_buff, gso_segs),
  >>>>>>> 7c8dce4b16

Just take the one with offsetofend() /and/ gso_segs. Latter is correct due to
850a88cc40 ("bpf: Expose __sk_buff wire_len/gso_segs to BPF_PROG_TEST_RUN").

2) Merge conflict in arch/riscv/net/bpf_jit_comp.c:

(I'm keeping Bjorn in Cc here for a double-check in case I got it wrong.)

  <<<<<<< HEAD
          if (is_13b_check(off, insn))
                  return -1;
          emit(rv_blt(tcc, RV_REG_ZERO, off >> 1), ctx);
  =======
          emit_branch(BPF_JSLT, RV_REG_T1, RV_REG_ZERO, off, ctx);
  >>>>>>> 7c8dce4b16

Result should look like:

          emit_branch(BPF_JSLT, tcc, RV_REG_ZERO, off, ctx);

3) Merge conflict in arch/riscv/include/asm/pgtable.h:

  <<<<<<< HEAD
  =======
  #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
  #define VMALLOC_END      (PAGE_OFFSET - 1)
  #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)

  #define BPF_JIT_REGION_SIZE     (SZ_128M)
  #define BPF_JIT_REGION_START    (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
  #define BPF_JIT_REGION_END      (VMALLOC_END)

  /*
   * Roughly size the vmemmap space to be large enough to fit enough
   * struct pages to map half the virtual address space. Then
   * position vmemmap directly below the VMALLOC region.
   */
  #define VMEMMAP_SHIFT \
          (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
  #define VMEMMAP_SIZE    BIT(VMEMMAP_SHIFT)
  #define VMEMMAP_END     (VMALLOC_START - 1)
  #define VMEMMAP_START   (VMALLOC_START - VMEMMAP_SIZE)

  #define vmemmap         ((struct page *)VMEMMAP_START)

  >>>>>>> 7c8dce4b16

Only take the BPF_* defines from there and move them higher up in the
same file. Remove the rest from the chunk. The VMALLOC_* etc defines
got moved via 01f52e16b8 ("riscv: define vmemmap before pfn_to_page
calls"). Result:

  [...]
  #define __S101  PAGE_READ_EXEC
  #define __S110  PAGE_SHARED_EXEC
  #define __S111  PAGE_SHARED_EXEC

  #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
  #define VMALLOC_END      (PAGE_OFFSET - 1)
  #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)

  #define BPF_JIT_REGION_SIZE     (SZ_128M)
  #define BPF_JIT_REGION_START    (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
  #define BPF_JIT_REGION_END      (VMALLOC_END)

  /*
   * Roughly size the vmemmap space to be large enough to fit enough
   * struct pages to map half the virtual address space. Then
   * position vmemmap directly below the VMALLOC region.
   */
  #define VMEMMAP_SHIFT \
          (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
  #define VMEMMAP_SIZE    BIT(VMEMMAP_SHIFT)
  #define VMEMMAP_END     (VMALLOC_START - 1)
  #define VMEMMAP_START   (VMALLOC_START - VMEMMAP_SIZE)

  [...]

Let me know if there are any other issues.

Anyway, the main changes are:

1) Extend bpftool to produce a struct (aka "skeleton") tailored and specific
   to a provided BPF object file. This provides an alternative, simplified API
   compared to standard libbpf interaction. Also, add libbpf extern variable
   resolution for .kconfig section to import Kconfig data, from Andrii Nakryiko.

2) Add BPF dispatcher for XDP which is a mechanism to avoid indirect calls by
   generating a branch funnel as discussed back in bpfconf'19 at LSF/MM. Also,
   add various BPF riscv JIT improvements, from Björn Töpel.

3) Extend bpftool to allow matching BPF programs and maps by name,
   from Paul Chaignon.

4) Support for replacing cgroup BPF programs attached with BPF_F_ALLOW_MULTI
   flag for allowing updates without service interruption, from Andrey Ignatov.

5) Cleanup and simplification of ring access functions for AF_XDP with a
   bonus of 0-5% performance improvement, from Magnus Karlsson.

6) Enable BPF JITs for x86-64 and arm64 by default. Also, final version of
   audit support for BPF, from Daniel Borkmann and latter with Jiri Olsa.

7) Move and extend test_select_reuseport into BPF program tests under
   BPF selftests, from Jakub Sitnicki.

8) Various BPF sample improvements for xdpsock for customizing parameters
   to set up and benchmark AF_XDP, from Jay Jayatheerthan.

9) Improve libbpf to provide a ulimit hint on permission denied errors.
   Also change XDP sample programs to attach in driver mode by default,
   from Toke Høiland-Jørgensen.

10) Extend BPF test infrastructure to allow changing skb mark from tc BPF
    programs, from Nikita V. Shirokov.

11) Optimize prologue code sequence in BPF arm32 JIT, from Russell King.

12) Fix xdp_redirect_cpu BPF sample to manually attach to tracepoints after
    libbpf conversion, from Jesper Dangaard Brouer.

13) Minor misc improvements from various others.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27 14:20:10 -08:00
Andy Roulin
c1e4699026 bonding: rename AD_STATE_* to LACP_STATE_*
As the LACP actor/partner state is now part of the uapi, rename the
3ad state defines with LACP prefix. The LACP prefix is preferred over
BOND_3AD as the LACP standard moved to 802.1AX.

Fixes: 826f66b30c ("bonding: move 802.3ad port state flags to uapi")
Signed-off-by: Andy Roulin <aroulin@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-26 13:09:37 -08:00
Florian Westphal
c14ceb0ec7 netfilter: nft_meta: add support for slave device ifindex matching
Allow to match on vrf slave ifindex or name.

In case there was no slave interface involved, store 0 in the
destination register just like existing iif/oif matching.

sdif(name) is restricted to the ipv4/ipv6 input and forward hooks,
as it depends on ip(6) stack parsing/storing info in skb->cb[].

Cc: Martin Willi <martin@strongswan.org>
Cc: David Ahern <dsahern@kernel.org>
Cc: Shrijeet Mukherjee <shrijeet@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-12-26 17:41:34 +01:00
Richard Cochran
b6fd7b9636 net: Introduce peer to peer one step PTP time stamping.
The 1588 standard defines one step operation for both Sync and
PDelay_Resp messages.  Up until now, hardware with P2P one step has
been rare, and kernel support was lacking.  This patch adds support of
the mode in anticipation of new hardware developments.

Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-25 19:51:34 -08:00
Martin Varghese
f66b53fdbb openvswitch: New MPLS actions for layer 2 tunnelling
The existing PUSH MPLS action inserts MPLS header between ethernet header
and the IP header. Though this behaviour is fine for L3 VPN where an IP
packet is encapsulated inside a MPLS tunnel, it does not suffice the L2
VPN (l2 tunnelling) requirements. In L2 VPN the MPLS header should
encapsulate the ethernet packet.

The new mpls action ADD_MPLS inserts MPLS header at the start of the
packet or at the start of the l3 header depending on the value of l3 tunnel
flag in the ADD_MPLS arguments.

POP_MPLS action is extended to support ethertype 0x6558.

Signed-off-by: Martin Varghese <martin.varghese@nokia.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24 22:24:45 -08:00
Greg Kroah-Hartman
398d999f96 Merge 5.5-rc3 into staging-next
We need the staging fixes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-23 07:00:09 -05:00
David S. Miller
ac80010fc9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Mere overlapping changes in the conflicts here.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-22 15:15:05 -08:00
Linus Torvalds
78bac77b52 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Several nf_flow_table_offload fixes from Pablo Neira Ayuso,
    including adding a missing ipv6 match description.

 2) Several heap overflow fixes in mwifiex from qize wang and Ganapathi
    Bhat.

 3) Fix uninit value in bond_neigh_init(), from Eric Dumazet.

 4) Fix non-ACPI probing of nxp-nci, from Stephan Gerhold.

 5) Fix use after free in tipc_disc_rcv(), from Tuong Lien.

 6) Enforce limit of 33 tail calls in mips and riscv JIT, from Paul
    Chaignon.

 7) Multicast MAC limit test is off by one in qede, from Manish Chopra.

 8) Fix established socket lookup race when socket goes from
    TCP_ESTABLISHED to TCP_LISTEN, because there lacks an intervening
    RCU grace period. From Eric Dumazet.

 9) Don't send empty SKBs from tcp_write_xmit(), also from Eric Dumazet.

10) Fix active backup transition after link failure in bonding, from
    Mahesh Bandewar.

11) Avoid zero sized hash table in gtp driver, from Taehee Yoo.

12) Fix wrong interface passed to ->mac_link_up(), from Russell King.

13) Fix DSA egress flooding settings in b53, from Florian Fainelli.

14) Memory leak in gmac_setup_txqs(), from Navid Emamdoost.

15) Fix double free in dpaa2-ptp code, from Ioana Ciornei.

16) Reject invalid MTU values in stmmac, from Jose Abreu.

17) Fix refcount leak in error path of u32 classifier, from Davide
    Caratti.

18) Fix regression causing iwlwifi firmware crashes on boot, from Anders
    Kaseorg.

19) Fix inverted return value logic in llc2 code, from Chan Shu Tak.

20) Disable hardware GRO when XDP is attached to qede, frm Manish
    Chopra.

21) Since we encode state in the low pointer bits, dst metrics must be
    at least 4 byte aligned, which is not necessarily true on m68k. Add
    annotations to fix this, from Geert Uytterhoeven.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (160 commits)
  sfc: Include XDP packet headroom in buffer step size.
  sfc: fix channel allocation with brute force
  net: dst: Force 4-byte alignment of dst_metrics
  selftests: pmtu: fix init mtu value in description
  hv_netvsc: Fix unwanted rx_table reset
  net: phy: ensure that phy IDs are correctly typed
  mod_devicetable: fix PHY module format
  qede: Disable hardware gro when xdp prog is installed
  net: ena: fix issues in setting interrupt moderation params in ethtool
  net: ena: fix default tx interrupt moderation interval
  net/smc: unregister ib devices in reboot_event
  net: stmmac: platform: Fix MDIO init for platforms without PHY
  llc2: Fix return statement of llc_stat_ev_rx_null_dsap_xid_c (and _test_c)
  net: hisilicon: Fix a BUG trigered by wrong bytes_compl
  net: dsa: ksz: use common define for tag len
  s390/qeth: don't return -ENOTSUPP to userspace
  s390/qeth: fix promiscuous mode after reset
  s390/qeth: handle error due to unsupported transport mode
  cxgb4: fix refcount init for TC-MQPRIO offload
  tc-testing: initial tdc selftests for cls_u32
  ...
2019-12-22 09:54:33 -08:00
John Rutherford
e1b5e598e5 tipc: make legacy address flag readable over netlink
To enable iproute2/tipc to generate backwards compatible
printouts and validate command parameters for nodes using a
<z.c.n> node address, it needs to be able to read the legacy
address flag from the kernel.  The legacy address flag records
the way in which the node identity was originally specified.

The legacy address flag is requested by the netlink message
TIPC_NL_ADDR_LEGACY_GET.  If the flag is set the attribute
TIPC_NLA_NET_ADDR_LEGACY is set in the return message.

Signed-off-by: John Rutherford <john.rutherford@dektech.com.au>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20 21:18:42 -08:00
Andrey Ignatov
7dd68b3279 bpf: Support replacing cgroup-bpf program in MULTI mode
The common use-case in production is to have multiple cgroup-bpf
programs per attach type that cover multiple use-cases. Such programs
are attached with BPF_F_ALLOW_MULTI and can be maintained by different
people.

Order of programs usually matters, for example imagine two egress
programs: the first one drops packets and the second one counts packets.
If they're swapped the result of counting program will be different.

It brings operational challenges with updating cgroup-bpf program(s)
attached with BPF_F_ALLOW_MULTI since there is no way to replace a
program:

* One way to update is to detach all programs first and then attach the
  new version(s) again in the right order. This introduces an
  interruption in the work a program is doing and may not be acceptable
  (e.g. if it's egress firewall);

* Another way is attach the new version of a program first and only then
  detach the old version. This introduces the time interval when two
  versions of same program are working, what may not be acceptable if a
  program is not idempotent. It also imposes additional burden on
  program developers to make sure that two versions of their program can
  co-exist.

Solve the problem by introducing a "replace" mode in BPF_PROG_ATTACH
command for cgroup-bpf programs being attached with BPF_F_ALLOW_MULTI
flag. This mode is enabled by newly introduced BPF_F_REPLACE attach flag
and bpf_attr.replace_bpf_fd attribute to pass fd of the old program to
replace

That way user can replace any program among those attached with
BPF_F_ALLOW_MULTI flag without the problems described above.

Details of the new API:

* If BPF_F_REPLACE is set but replace_bpf_fd doesn't have valid
  descriptor of BPF program, BPF_PROG_ATTACH will return corresponding
  error (EINVAL or EBADF).

* If replace_bpf_fd has valid descriptor of BPF program but such a
  program is not attached to specified cgroup, BPF_PROG_ATTACH will
  return ENOENT.

BPF_F_REPLACE is introduced to make the user intent clear, since
replace_bpf_fd alone can't be used for this (its default value, 0, is a
valid fd). BPF_F_REPLACE also makes it possible to extend the API in the
future (e.g. add BPF_F_BEFORE and BPF_F_AFTER if needed).

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Narkyiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/30cd850044a0057bdfcaaf154b7d2f39850ba813.1576741281.git.rdna@fb.com
2019-12-19 21:22:25 -08:00
Petr Machata
dcc68b4d80 net: sch_ets: Add a new Qdisc
Introduces a new Qdisc, which is based on 802.1Q-2014 wording. It is
PRIO-like in how it is configured, meaning one needs to specify how many
bands there are, how many are strict and how many are dwrr, quanta for the
latter, and priomap.

The new Qdisc operates like the PRIO / DRR combo would when configured as
per the standard. The strict classes, if any, are tried for traffic first.
When there's no traffic in any of the strict queues, the ETS ones (if any)
are treated in the same way as in DRR.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-18 13:32:29 -08:00
Arnd Bergmann
251ec1c159 y2038: sparc: remove use of struct timex
'struct timex' is one of the last users of 'struct timeval' and is
only referenced in one place in the kernel any more, to convert the
user space timex into the kernel-internal version on sparc64, with a
different tv_usec member type.

As a preparation for hiding the time_t definition and everything
using that in the kernel, change the implementation once more
to only convert the timeval member, and then enclose the
struct definition in an #ifdef.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-12-18 18:07:33 +01:00
Arnd Bergmann
4f9fbd893f y2038: rename itimerval to __kernel_old_itimerval
Take the renaming of timeval and timespec one level further,
also renaming itimerval to __kernel_old_itimerval, to avoid
namespace conflicts with the user-space structure that may
use 64-bit time_t members.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-12-18 18:07:33 +01:00
Arnd Bergmann
352c912b0a tsacct: add 64-bit btime field
As there is only a 32-bit ac_btime field in taskstat and
we should handle dates after the overflow, add a new field
with the same information but 64-bit width that can hold
a full time64_t.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-12-18 18:07:31 +01:00
Arnd Bergmann
2d602bf283 acct: stop using get_seconds()
In 'struct acct', 'struct acct_v3', and 'struct taskstats' we have
a 32-bit 'ac_btime' field containing an absolute time value, which
will overflow in year 2106.

There are two possible ways to deal with it:

a) let it overflow and have user space code deal with reconstructing
   the data based on the current time, or
b) truncate the times based on the range of the u32 type.

Neither of them solves the actual problem. Pick the second
one to best document what the issue is, and have someone
fix it in a future version.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-12-18 18:07:31 +01:00
Alexandre Belloni
3431ca4837 rtc: define RTC_VL_READ values
Currently, the meaning of the value returned by RTC_VL_READ is undocumented
and left to the driver implementation. In order to get more meaningful
values, define a set of values to use as to make clear to userspace what is
the status of the various voltages feeding the RTC.

Link: https://lore.kernel.org/r/20191214220259.621996-2-alexandre.belloni@bootlin.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
2019-12-18 10:37:18 +01:00
Andrew F. Davis
b3b4346544 dma-buf: heaps: Use _IOCTL_ for userspace IOCTL identifier
This is more consistent with the DMA and DRM frameworks convention. This
patch is only a name change, no logic is changed.

Signed-off-by: Andrew F. Davis <afd@ti.com>
Acked-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20191216133405.1001-2-afd@ti.com
2019-12-17 21:37:40 +05:30
Daniel Vetter
6c56e8adc0 Merge tag 'drm-misc-next-2019-12-16' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
drm-misc-next for v5.6:

UAPI Changes:
- Add support for DMA-BUF HEAPS.

Cross-subsystem Changes:
- mipi dsi definition updates, pulled into drm-intel as well.
- Add lockdep annotations for dma_resv vs mmap_sem and fs_reclaim.
- Remove support for dma-buf kmap/kunmap.
- Constify fb_ops in all fbdev drivers, including drm drivers and drm-core, and media as well.

Core Changes:
- Small cleanups to ttm.
- Fix SCDC definition.
- Assorted cleanups to core.
- Add todo to remove load/unload hooks, and use generic fbdev emulation.
- Assorted documentation updates.
- Use blocking ww lock in ttm fault handler.
- Remove drm_fb_helper_fbdev_setup/teardown.
- Warning fixes with W=1 for atomic.
- Use drm_debug_enabled() instead of drm_debug flag testing in various drivers.
- Fallback to nontiled mode in fbdev emulation when not all tiles are present. (Later on reverted)
- Various kconfig indentation fixes in core and drivers.
- Fix freeing transactions in dp-mst correctly.
- Sean Paul is steping down as core maintainer. :-(
- Add lockdep annotations for atomic locks vs dma-resv.
- Prevent use-after-free for a bad job in drm_scheduler.
- Fill out all block sizes in the P01x and P210 definitions.
- Avoid division by zero in drm/rect, and fix bounds.
- Add drm/rect selftests.
- Add aspect ratio and alternate clocks for HDMI 4k modes.
- Add todo for drm_framebuffer_funcs and fb_create cleanup.
- Drop DRM_AUTH for prime import/export ioctls.
- Clear DP-MST payload id tables downstream when initializating.
- Fix for DSC throughput definition.
- Add extra FEC definitions.
- Fix fake offset in drm_gem_object_funs.mmap.
- Stop using encoder->bridge in core directly
- Handle bridge chaining slightly better.
- Add backlight support to drm/panel, and use it in many panel drivers.
- Increase max number of y420 modes from 128 to 256, as preparation to add the new modes.

Driver Changes:
- Small fixes all over.
- Fix documentation in vkms.
- Fix mmap_sem vs dma_resv in nouveau.
- Small cleanup in komeda.
- Add page flip support in gma500 for psb/cdv.
- Add ddc symlink in the connector sysfs directory for many drivers.
- Add support for analogic an6345, and fix small bugs in it.
- Add atomic modesetting support to ast.
- Fix radeon fault handler VMA race.
- Switch udl to use generic shmem helpers.
- Unconditional vblank handling for mcde.
- Miscellaneous fixes to mcde.
- Tweak debug output from komeda using debugfs.
- Add gamma and color transform support to komeda for DOU-IPS.
- Add support for sony acx424AKP panel.
- Various small cleanups to gma500.
- Use generic fbdev emulation in udl, and replace udl_framebuffer with generic implementation.
- Add support for Logic PD Type 28 panel.
- Use drm_panel_* wrapper functions in exynos/tegra/msm.
- Add devicetree bindings for generic DSI panels.
- Don't include drm_pci.h directly in many drivers.
- Add support for begin/end_cpu_access in udmabuf.
- Stop using drm_get_pci_dev in gma500 and mga200.
- Fixes to UDL damage handling, and use dma_buf_begin/end_cpu_access.
- Add devfreq thermal support to panfrost.
- Fix hotplug with daisy chained monitors by removing VCPI when disabling topology manager.
- meson: Add support for OSD1 plane AFBC commit.
- Stop displaying garbage when toggling ast primary plane on/off.
- More cleanups and fixes to UDL.
- Add D32 suport to komeda.
- Remove globle copy of drm_dev in gma500.
- Add support for Boe Himax8279d MIPI-DSI LCD panel.
- Add support for ingenic JZ4770 panel.
- Small null pointer deference fix in ingenic.
- Remove support for the special tfp420 driver, as there is a generic way to do it.

Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/ba73535a-9334-5302-2e1f-5208bd7390bd@linux.intel.com
2019-12-17 13:57:54 +01:00
Josh Soref
a2ec8b5706 wireguard: global: fix spelling mistakes in comments
This fixes two spelling errors in source code comments.

Signed-off-by: Josh Soref <jsoref@gmail.com>
[Jason: rewrote commit message]
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-16 19:22:22 -08:00
Greg Kroah-Hartman
b3bb164aa5 Merge 5.5-rc2 into staging-next
We want the staging driver fixes in here, and this resolves merge issues
with the isdn code that was pointed out in linux-next

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-16 09:06:50 +01:00
Andrii Nakryiko
166750bc1d libbpf: Support libbpf-provided extern variables
Add support for extern variables, provided to BPF program by libbpf. Currently
the following extern variables are supported:
  - LINUX_KERNEL_VERSION; version of a kernel in which BPF program is
    executing, follows KERNEL_VERSION() macro convention, can be 4- and 8-byte
    long;
  - CONFIG_xxx values; a set of values of actual kernel config. Tristate,
    boolean, strings, and integer values are supported.

Set of possible values is determined by declared type of extern variable.
Supported types of variables are:
- Tristate values. Are represented as `enum libbpf_tristate`. Accepted values
  are **strictly** 'y', 'n', or 'm', which are represented as TRI_YES, TRI_NO,
  or TRI_MODULE, respectively.
- Boolean values. Are represented as bool (_Bool) types. Accepted values are
  'y' and 'n' only, turning into true/false values, respectively.
- Single-character values. Can be used both as a substritute for
  bool/tristate, or as a small-range integer:
  - 'y'/'n'/'m' are represented as is, as characters 'y', 'n', or 'm';
  - integers in a range [-128, 127] or [0, 255] (depending on signedness of
    char in target architecture) are recognized and represented with
    respective values of char type.
- Strings. String values are declared as fixed-length char arrays. String of
  up to that length will be accepted and put in first N bytes of char array,
  with the rest of bytes zeroed out. If config string value is longer than
  space alloted, it will be truncated and warning message emitted. Char array
  is always zero terminated. String literals in config have to be enclosed in
  double quotes, just like C-style string literals.
- Integers. 8-, 16-, 32-, and 64-bit integers are supported, both signed and
  unsigned variants. Libbpf enforces parsed config value to be in the
  supported range of corresponding integer type. Integers values in config can
  be:
  - decimal integers, with optional + and - signs;
  - hexadecimal integers, prefixed with 0x or 0X;
  - octal integers, starting with 0.

Config file itself is searched in /boot/config-$(uname -r) location with
fallback to /proc/config.gz, unless config path is specified explicitly
through bpf_object_open_opts' kernel_config_path option. Both gzipped and
plain text formats are supported. Libbpf adds explicit dependency on zlib
because of this, but this shouldn't be a problem, given libelf already depends
on zlib.

All detected extern variables, are put into a separate .extern internal map.
It, similarly to .rodata map, is marked as read-only from BPF program side, as
well as is frozen on load. This allows BPF verifier to track extern values as
constants and perform enhanced branch prediction and dead code elimination.
This can be relied upon for doing kernel version/feature detection and using
potentially unsupported field relocations or BPF helpers in a CO-RE-based BPF
program, while still having a single version of BPF program running on old and
new kernels. Selftests are validating this explicitly for unexisting BPF
helper.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191214014710.3449601-3-andriin@fb.com
2019-12-15 16:41:12 -08:00
Vivien Didelot
de1799667b net: bridge: add STP xstats
This adds rx_bpdu, tx_bpdu, rx_tcn, tx_tcn, transition_blk,
transition_fwd xstats counters to the bridge ports copied over via
netlink, providing useful information for STP.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-12-14 20:02:36 -08:00
Andy Roulin
826f66b30c bonding: move 802.3ad port state flags to uapi
The bond slave actor/partner operating state is exported as
bitfield to userspace, which lacks a way to interpret it, e.g.,
iproute2 only prints the state as a number:

ad_actor_oper_port_state 15

For userspace to interpret the bitfield, the bitfield definitions
should be part of the uapi. The bitfield itself is defined in the
802.3ad standard.

This commit moves the 802.3ad bitfield definitions to uapi.

Related iproute2 patches, soon to be posted upstream, use the new uapi
headers to pretty-print bond slave state, e.g., with ip -d link show

ad_actor_oper_port_state_str <active,short_timeout,aggregating,in_sync>

Signed-off-by: Andy Roulin <aroulin@cumulusnetworks.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-12-14 13:07:44 -08:00
Arnd Bergmann
f729a1b0f8 Input: input_event - fix struct padding on sparc64
Going through all uses of timeval, I noticed that we screwed up
input_event in the previous attempts to fix it:

The time fields now match between kernel and user space, but all following
fields are in the wrong place.

Add the required padding that is implied by the glibc timeval definition
to fix the layout, and use a struct initializer to avoid leaking kernel
stack data.

Fixes: 141e5dcaa7 ("Input: input_event - fix the CONFIG_SPARC64 mixup")
Fixes: 2e746942eb ("Input: input_event - provide override for sparc64")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20191213204936.3643476-2-arnd@arndb.de
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2019-12-13 15:00:36 -08:00
Linus Torvalds
5bd831a469 Merge tag 'io_uring-5.5-20191212' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:

 - A tweak to IOSQE_IO_LINK (also marked for stable) to allow links that
   don't sever if the result is < 0.

   This is mostly for linked timeouts, where if we ask for a pure
   timeout we always get -ETIME. This makes links useless for that case,
   hence allow a case where it works.

 - Five minor optimizations to fix and improve cases that regressed
   since v5.4.

 - An SQTHREAD locking fix.

 - A sendmsg/recvmsg iov assignment fix.

 - Net fix where read_iter/write_iter don't honor IOCB_NOWAIT, and
   subsequently ensuring that works for io_uring.

 - Fix a case where for an invalid opcode we might return -EBADF instead
   of -EINVAL, if the ->fd of that sqe was set to an invalid fd value.

* tag 'io_uring-5.5-20191212' of git://git.kernel.dk/linux-block:
  io_uring: ensure we return -EINVAL on unknown opcode
  io_uring: add sockets to list of files that support non-blocking issue
  net: make socket read/write_iter() honor IOCB_NOWAIT
  io_uring: only hash regular files for async work execution
  io_uring: run next sqe inline if possible
  io_uring: don't dynamically allocate poll data
  io_uring: deferred send/recvmsg should assign iov
  io_uring: sqthread should grab ctx->uring_lock for submissions
  io-wq: briefly spin for new work after finishing work
  io-wq: remove worker->wait waitqueue
  io_uring: allow unbreakable links
2019-12-13 14:24:54 -08:00
Pi-Hsun Shih
6989310f5d wireless: Use offsetof instead of custom macro.
Use offsetof to calculate offset of a field to take advantage of
compiler built-in version when possible, and avoid UBSAN warning when
compiling with Clang:

==================================================================
UBSAN: Undefined behaviour in net/wireless/wext-core.c:525:14
member access within null pointer of type 'struct iw_point'
CPU: 3 PID: 165 Comm: kworker/u16:3 Tainted: G S      W         4.19.23 #43
Workqueue: cfg80211 __cfg80211_scan_done [cfg80211]
Call trace:
 dump_backtrace+0x0/0x194
 show_stack+0x20/0x2c
 __dump_stack+0x20/0x28
 dump_stack+0x70/0x94
 ubsan_epilogue+0x14/0x44
 ubsan_type_mismatch_common+0xf4/0xfc
 __ubsan_handle_type_mismatch_v1+0x34/0x54
 wireless_send_event+0x3cc/0x470
 ___cfg80211_scan_done+0x13c/0x220 [cfg80211]
 __cfg80211_scan_done+0x28/0x34 [cfg80211]
 process_one_work+0x170/0x35c
 worker_thread+0x254/0x380
 kthread+0x13c/0x158
 ret_from_fork+0x10/0x18
===================================================================

Signed-off-by: Pi-Hsun Shih <pihsun@chromium.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lore.kernel.org/r/20191204081307.138765-1-pihsun@chromium.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-12-13 10:45:35 +01:00
Toke Høiland-Jørgensen
911bde0fe5 mac80211: Turn AQL into an NL80211_EXT_FEATURE
Instead of just having an airtime flag in debugfs, turn AQL into a proper
NL80211_EXT_FEATURE, so drivers can turn it on when they are ready, and so
we also expose the presence of the feature to userspace.

This also has the effect of flipping the default, so drivers have to opt in
to using AQL instead of getting it by default with TXQs. To keep
functionality the same as pre-patch, we set this feature for ath10k (which
is where it is needed the most).

While we're at it, split out the debugfs interface so AQL gets its own
per-station debugfs file instead of using the 'airtime' file.

[Johannes:]
This effectively disables AQL for iwlwifi, where it fixes a number of
issues:
 * TSO in iwlwifi is causing underflows and associated warnings in AQL
 * HE (802.11ax) rates aren't reported properly so at HE rates, AQL could
   never have a valid estimate (it'd use 6 Mbps instead of up to 2400!)

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20191212111437.224294-1-toke@redhat.com
Fixes: 3ace10f5b5 ("mac80211: Implement Airtime-based Queue Limit (AQL)")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-12-13 10:34:04 +01:00
Michal Kubecek
428c122f5f ethtool: provide link mode names as a string set
Unlike e.g. netdev features, the ethtool ioctl interface requires link mode
table to be in sync between kernel and userspace for userspace to be able
to display and set all link modes supported by kernel. The way arbitrary
length bitsets are implemented in netlink interface, this will be no longer
needed.

To allow userspace to access all link modes running kernel supports, add
table of ethernet link mode names and make it available as a string set to
userspace GET_STRSET requests. Add build time check to make sure names
are defined for all modes declared in enum ethtool_link_mode_bit_indices.

Once the string set is available, make it also accessible via ioctl.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-12 17:07:05 -08:00
Michal Kubecek
f74877a545 rtnetlink: provide permanent hardware address in RTM_NEWLINK
Permanent hardware address of a network device was traditionally provided
via ethtool ioctl interface but as Jiri Pirko pointed out in a review of
ethtool netlink interface, rtnetlink is much more suitable for it so let's
add it to the RTM_NEWLINK message.

Add IFLA_PERM_ADDRESS attribute to RTM_NEWLINK messages unless the
permanent address is all zeros (i.e. device driver did not fill it). As
permanent address is not modifiable, reject userspace requests containing
IFLA_PERM_ADDRESS attribute.

Note: we already provide permanent hardware address for bond slaves;
unfortunately we cannot drop that attribute for backward compatibility
reasons.

v5 -> v6: only add the attribute if permanent address is not zero

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-12 17:07:05 -08:00
Jens Axboe
9e3aa61ae3 io_uring: ensure we return -EINVAL on unknown opcode
If we submit an unknown opcode and have fd == -1, io_op_needs_file()
will return true as we default to needing a file. Then when we go and
assign the file, we find the 'fd' invalid and return -EBADF. We really
should be returning -EINVAL for that case, as we normally do for
unsupported opcodes.

Change io_op_needs_file() to have the following return values:

0   - does not need a file
1   - does need a file
< 0 - error value

and use this to pass back the right value for this invalid case.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-11 16:02:32 -07:00
Stefano Garzarella
ef343b35d4 vsock: add VMADDR_CID_LOCAL definition
The VMADDR_CID_RESERVED (1) was used by VMCI, but now it is not
used anymore, so we can reuse it for local communication
(loopback) adding the new well-know CID: VMADDR_CID_LOCAL.

Cc: Jorgen Hansen <jhansen@vmware.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-11 15:01:23 -08:00
Daniel Borkmann
bae141f54b bpf: Emit audit messages upon successful prog load and unload
Allow for audit messages to be emitted upon BPF program load and
unload for having a timeline of events. The load itself is in
syscall context, so additional info about the process initiating
the BPF prog creation can be logged and later directly correlated
to the unload event.

The only info really needed from BPF side is the globally unique
prog ID where then audit user space tooling can query / dump all
info needed about the specific BPF program right upon load event
and enrich the record, thus these changes needed here can be kept
small and non-intrusive to the core.

Raw example output:

  # auditctl -D
  # auditctl -a always,exit -F arch=x86_64 -S bpf
  # ausearch --start recent -m 1334
  ...
  ----
  time->Wed Nov 27 16:04:13 2019
  type=PROCTITLE msg=audit(1574867053.120:84664): proctitle="./bpf"
  type=SYSCALL msg=audit(1574867053.120:84664): arch=c000003e syscall=321   \
    success=yes exit=3 a0=5 a1=7ffea484fbe0 a2=70 a3=0 items=0 ppid=7477    \
    pid=12698 auid=1001 uid=1001 gid=1001 euid=1001 suid=1001 fsuid=1001    \
    egid=1001 sgid=1001 fsgid=1001 tty=pts2 ses=4 comm="bpf"                \
    exe="/home/jolsa/auditd/audit-testsuite/tests/bpf/bpf"                  \
    subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=(null)
  type=UNKNOWN[1334] msg=audit(1574867053.120:84664): prog-id=76 op=LOAD
  ----
  time->Wed Nov 27 16:04:13 2019
  type=UNKNOWN[1334] msg=audit(1574867053.120:84665): prog-id=76 op=UNLOAD
  ...

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/bpf/20191206214934.11319-1-jolsa@kernel.org
2019-12-11 17:41:09 +01:00
Marcel Holtmann
2f48865db3 HID: hidraw: add support uniq ioctl
Add support for reading out the uniq information from the underlying HID
device. This might be the iSerialNumber in case of USB or the BD_ADDR in
case of Bluetooth.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2019-12-11 15:31:52 +01:00
Arnd Bergmann
f59aba2f75 isdn: capi: dead code removal
The staging isdn drivers are gone, and CONFIG_BT_CMTP is now
the only user. This means a lot of the code in the subsystem
has no remaining callers and can be removed.

Change the capi user space front-end to be part of kernelcapi,
and the combined module to only be compiled if BT_CMTP is
also enabled, then remove the interfaces that have no remaining
callers.

As the notifier list and the capi_drivers list have no callers
outside of kcapi.c, the implementation gets much simpler.

Some definitions from the include/linux/*.h headers are only
needed internally and are moved to kcapi.h.

Acked-by: David Miller <davem@davemloft.net>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20191210210455.3475361-2-arnd@arndb.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-11 09:12:38 +01:00
Arnd Bergmann
f10870b05d staging: remove isdn capi drivers
As described in drivers/staging/isdn/TODO, the drivers are all
assumed to be unmaintained and unused now, with gigaset being the
last one to stop being maintained after Paul Bolle lost access
to an ISDN network.

The CAPI subsystem remains for now, as it is still required by
bluetooth/cmtp.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20191210210455.3475361-1-arnd@arndb.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-11 09:11:29 +01:00
Andrew F. Davis
c02a81fba7 dma-buf: Add dma-buf heaps framework
This framework allows a unified userspace interface for dma-buf
exporters, allowing userland to allocate specific types of memory
for use in dma-buf sharing.

Each heap is given its own device node, which a user can allocate
a dma-buf fd from using the DMA_HEAP_IOC_ALLOC.

This code is an evoluiton of the Android ION implementation,
and a big thanks is due to its authors/maintainers over time
for their effort:
  Rebecca Schultz Zavin, Colin Cross, Benjamin Gaignard,
  Laura Abbott, and many other contributors!

Cc: Laura Abbott <labbott@redhat.com>
Cc: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Liam Mark <lmark@codeaurora.org>
Cc: Pratik Patel <pratikp@codeaurora.org>
Cc: Brian Starkey <Brian.Starkey@arm.com>
Cc: Vincent Donnefort <Vincent.Donnefort@arm.com>
Cc: Sudipto Paul <Sudipto.Paul@arm.com>
Cc: Andrew F. Davis <afd@ti.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Chenbo Feng <fengc@google.com>
Cc: Alistair Strachan <astrachan@google.com>
Cc: Hridya Valsaraju <hridya@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Dave Airlie <airlied@gmail.com>
Cc: dri-devel@lists.freedesktop.org
Reviewed-by: Brian Starkey <brian.starkey@arm.com>
Acked-by: Sandeep Patil <sspatil@android.com>
Signed-off-by: Andrew F. Davis <afd@ti.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20191203172641.66642-2-john.stultz@linaro.org
2019-12-11 11:13:33 +05:30
Jens Axboe
4e88d6e779 io_uring: allow unbreakable links
Some commands will invariably end in a failure in the sense that the
completion result will be less than zero. One such example is timeouts
that don't have a completion count set, they will always complete with
-ETIME unless cancelled.

For linked commands, we sever links and fail the rest of the chain if
the result is less than zero. Since we have commands where we know that
will happen, add IOSQE_IO_HARDLINK as a stronger link that doesn't sever
regardless of the completion result. Note that the link will still sever
if we fail submitting the parent request, hard links are only resilient
in the presence of completion results for requests that did submit
correctly.

Cc: stable@vger.kernel.org # v5.4
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reported-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-10 16:33:06 -07:00
David S. Miller
7da538c1e1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Wait for rcu grace period after releasing netns in ctnetlink,
   from Florian Westphal.

2) Incorrect command type in flowtable offload ndo invocation,
   from wenxu.

3) Incorrect callback type in flowtable offload flow tuple
   updates, also from wenxu.

4) Fix compile warning on flowtable offload infrastructure due to
   possible reference to uninitialized variable, from Nathan Chancellor.

5) Do not inline nf_ct_resolve_clash(), this is called from slow
   path / stress situations. From Florian Westphal.

6) Missing IPv6 flow selector description in flowtable offload.

7) Missing check for NETDEV_UNREGISTER in nf_tables offload
   infrastructure, from wenxu.

8) Update NAT selftest to use randomized netns names, from
   Florian Westphal.

9) Restore nfqueue bridge support, from Marco Oliverio.

10) Compilation warning in SCTP_CHUNKMAP_*() on xt_sctp header.
    From Phil Sutter.

11) Fix bogus lookup/get match for non-anonymous rbtree sets.

12) Missing netlink validation for NFT_SET_ELEM_INTERVAL_END
    elements.

13) Missing netlink validation for NFT_DATA_VALUE after
    nft_data_init().

14) If rule specifies no actions, offload infrastructure returns
    EOPNOTSUPP.

15) Module refcount leak in object updates.

16) Missing sanitization for ARP traffic from br_netfilter, from
    Eric Dumazet.

17) Compilation breakage on big-endian due to incorrect memcpy()
    size in the flowtable offload infrastructure.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-09 14:03:33 -08:00
Phil Sutter
164166558a netfilter: uapi: Avoid undefined left-shift in xt_sctp.h
With 'bytes(__u32)' being 32, a left-shift of 31 may happen which is
undefined for the signed 32-bit value 1. Avoid this by declaring 1 as
unsigned.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-12-09 13:02:07 +01:00
Sabrina Dubroca
e27cca96cd xfrm: add espintcp (RFC 8229)
TCP encapsulation of IKE and IPsec messages (RFC 8229) is implemented
as a TCP ULP, overriding in particular the sendmsg and recvmsg
operations. A Stream Parser is used to extract messages out of the TCP
stream using the first 2 bytes as length marker. Received IKE messages
are put on "ike_queue", waiting to be dequeued by the custom recvmsg
implementation. Received ESP messages are sent to XFRM, like with UDP
encapsulation.

Some of this code is taken from the original submission by Herbert
Xu. Currently, only IPv4 is supported, like for UDP encapsulation.

Co-developed-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-12-09 09:59:07 +01:00
Jason A. Donenfeld
e7096c131e net: WireGuard secure network tunnel
WireGuard is a layer 3 secure networking tunnel made specifically for
the kernel, that aims to be much simpler and easier to audit than IPsec.
Extensive documentation and description of the protocol and
considerations, along with formal proofs of the cryptography, are
available at:

  * https://www.wireguard.com/
  * https://www.wireguard.com/papers/wireguard.pdf

This commit implements WireGuard as a simple network device driver,
accessible in the usual RTNL way used by virtual network drivers. It
makes use of the udp_tunnel APIs, GRO, GSO, NAPI, and the usual set of
networking subsystem APIs. It has a somewhat novel multicore queueing
system designed for maximum throughput and minimal latency of encryption
operations, but it is implemented modestly using workqueues and NAPI.
Configuration is done via generic Netlink, and following a review from
the Netlink maintainer a year ago, several high profile userspace tools
have already implemented the API.

This commit also comes with several different tests, both in-kernel
tests and out-of-kernel tests based on network namespaces, taking profit
of the fact that sockets used by WireGuard intentionally stay in the
namespace the WireGuard interface was originally created, exactly like
the semantics of userspace tun devices. See wireguard.com/netns/ for
pictures and examples.

The source code is fairly short, but rather than combining everything
into a single file, WireGuard is developed as cleanly separable files,
making auditing and comprehension easier. Things are laid out as
follows:

  * noise.[ch], cookie.[ch], messages.h: These implement the bulk of the
    cryptographic aspects of the protocol, and are mostly data-only in
    nature, taking in buffers of bytes and spitting out buffers of
    bytes. They also handle reference counting for their various shared
    pieces of data, like keys and key lists.

  * ratelimiter.[ch]: Used as an integral part of cookie.[ch] for
    ratelimiting certain types of cryptographic operations in accordance
    with particular WireGuard semantics.

  * allowedips.[ch], peerlookup.[ch]: The main lookup structures of
    WireGuard, the former being trie-like with particular semantics, an
    integral part of the design of the protocol, and the latter just
    being nice helper functions around the various hashtables we use.

  * device.[ch]: Implementation of functions for the netdevice and for
    rtnl, responsible for maintaining the life of a given interface and
    wiring it up to the rest of WireGuard.

  * peer.[ch]: Each interface has a list of peers, with helper functions
    available here for creation, destruction, and reference counting.

  * socket.[ch]: Implementation of functions related to udp_socket and
    the general set of kernel socket APIs, for sending and receiving
    ciphertext UDP packets, and taking care of WireGuard-specific sticky
    socket routing semantics for the automatic roaming.

  * netlink.[ch]: Userspace API entry point for configuring WireGuard
    peers and devices. The API has been implemented by several userspace
    tools and network management utility, and the WireGuard project
    distributes the basic wg(8) tool.

  * queueing.[ch]: Shared function on the rx and tx path for handling
    the various queues used in the multicore algorithms.

  * send.c: Handles encrypting outgoing packets in parallel on
    multiple cores, before sending them in order on a single core, via
    workqueues and ring buffers. Also handles sending handshake and cookie
    messages as part of the protocol, in parallel.

  * receive.c: Handles decrypting incoming packets in parallel on
    multiple cores, before passing them off in order to be ingested via
    the rest of the networking subsystem with GRO via the typical NAPI
    poll function. Also handles receiving handshake and cookie messages
    as part of the protocol, in parallel.

  * timers.[ch]: Uses the timer wheel to implement protocol particular
    event timeouts, and gives a set of very simple event-driven entry
    point functions for callers.

  * main.c, version.h: Initialization and deinitialization of the module.

  * selftest/*.h: Runtime unit tests for some of the most security
    sensitive functions.

  * tools/testing/selftests/wireguard/netns.sh: Aforementioned testing
    script using network namespaces.

This commit aims to be as self-contained as possible, implementing
WireGuard as a standalone module not needing much special handling or
coordination from the network subsystem. I expect for future
optimizations to the network stack to positively improve WireGuard, and
vice-versa, but for the time being, this exists as intentionally
standalone.

We introduce a menu option for CONFIG_WIREGUARD, as well as providing a
verbose debug log and self-tests via CONFIG_WIREGUARD_DEBUG.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: David Miller <davem@davemloft.net>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-08 17:48:42 -08:00
Linus Torvalds
737214515d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull more input updates from Dmitry Torokhov:

 - fixups for Synaptics RMI4 driver

 - a quirk for Goodinx touchscreen on Teclast tablet

 - a new keycode definition for activating privacy screen feature found
   on a few "enterprise" laptops

 - updates to snvs_pwrkey driver

 - polling uinput device for writing (which is always allowed) now works

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: synaptics-rmi4 - don't increment rmiaddr for SMBus transfers
  Input: synaptics-rmi4 - re-enable IRQs in f34v7_do_reflash
  Input: goodix - add upside-down quirk for Teclast X89 tablet
  Input: add privacy screen toggle keycode
  Input: uinput - fix returning EPOLLOUT from uinput_poll
  Input: snvs_pwrkey - remove gratuitous NULL initializers
  Input: snvs_pwrkey - send key events for i.MX6 S, DL and Q
2019-12-07 18:33:01 -08:00
Linus Torvalds
9feb1af97e Merge tag 'for-linus-20191205' of git://git.kernel.dk/linux-block
Pull more block and io_uring updates from Jens Axboe:
 "I wasn't expecting this to be so big, and if I was, I would have used
  separate branches for this. Going forward I'll be doing separate
  branches for the current tree, just like for the next kernel version
  tree. In any case, this contains:

   - Series from Christoph that fixes an inherent race condition with
     zoned devices and revalidation.

   - null_blk zone size fix (Damien)

   - Fix for a regression in this merge window that caused busy spins by
     sending empty disk uevents (Eric)

   - Fix for a regression in this merge window for bfq stats (Hou)

   - Fix for io_uring creds allocation failure handling (me)

   - io_uring -ERESTARTSYS send/recvmsg fix (me)

   - Series that fixes the need for applications to retain state across
     async request punts for io_uring. This one is a bit larger than I
     would have hoped, but I think it's important we get this fixed for
     5.5.

   - connect(2) improvement for io_uring, handling EINPROGRESS instead
     of having applications needing to poll for it (me)

   - Have io_uring use a hash for poll requests instead of an rbtree.
     This turned out to work much better in practice, so I think we
     should make the switch now. For some workloads, even with a fair
     amount of cancellations, the insertion sort is just too expensive.
     (me)

   - Various little io_uring fixes (me, Jackie, Pavel, LimingWu)

   - Fix for brd unaligned IO, and a warning for the future (Ming)

   - Fix for a bio integrity data leak (Justin)

   - bvec_iter_advance() improvement (Pavel)

   - Xen blkback page unmap fix (SeongJae)

  The major items in here are all well tested, and on the liburing side
  we continue to add regression and feature test cases. We're up to 50
  topic cases now, each with anywhere from 1 to more than 10 cases in
  each"

* tag 'for-linus-20191205' of git://git.kernel.dk/linux-block: (33 commits)
  block: fix memleak of bio integrity data
  io_uring: fix a typo in a comment
  bfq-iosched: Ensure bio->bi_blkg is valid before using it
  io_uring: hook all linked requests via link_list
  io_uring: fix error handling in io_queue_link_head
  io_uring: use hash table for poll command lookups
  io-wq: clear node->next on list deletion
  io_uring: ensure deferred timeouts copy necessary data
  io_uring: allow IO_SQE_* flags on IORING_OP_TIMEOUT
  null_blk: remove unused variable warning on !CONFIG_BLK_DEV_ZONED
  brd: warn on un-aligned buffer
  brd: remove max_hw_sectors queue limit
  xen/blkback: Avoid unmapping unmapped grant pages
  io_uring: handle connect -EINPROGRESS like -EAGAIN
  block: set the zone size in blk_revalidate_disk_zones atomically
  block: don't handle bio based drivers in blk_revalidate_disk_zones
  block: allocate the zone bitmaps lazily
  block: replace seq_zones_bitmap with conv_zones_bitmap
  block: simplify blkdev_nr_zones
  block: remove the empty line at the end of blk-zoned.c
  ...
2019-12-06 10:08:59 -08:00
Linus Torvalds
5ecc9d15f7 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
 "Most of the rest of MM and various other things. Some Kconfig rework
  still awaits merges of dependent trees from linux-next.

  Subsystems affected by this patch series: mm/hotfixes, mm/memcg,
  mm/vmstat, mm/thp, procfs, sysctl, misc, notifiers, core-kernel,
  bitops, lib, checkpatch, epoll, binfmt, init, rapidio, uaccess, kcov,
  ubsan, ipc, bitmap, mm/pagemap"

* akpm: (86 commits)
  mm: remove __ARCH_HAS_4LEVEL_HACK and include/asm-generic/4level-fixup.h
  um: add support for folded p4d page tables
  um: remove unused pxx_offset_proc() and addr_pte() functions
  sparc32: use pgtable-nopud instead of 4level-fixup
  parisc/hugetlb: use pgtable-nopXd instead of 4level-fixup
  parisc: use pgtable-nopXd instead of 4level-fixup
  nds32: use pgtable-nopmd instead of 4level-fixup
  microblaze: use pgtable-nopmd instead of 4level-fixup
  m68k: mm: use pgtable-nopXd instead of 4level-fixup
  m68k: nommu: use pgtable-nopud instead of 4level-fixup
  c6x: use pgtable-nopud instead of 4level-fixup
  arm: nommu: use pgtable-nopud instead of 4level-fixup
  alpha: use pgtable-nopud instead of 4level-fixup
  gpio: pca953x: tighten up indentation
  gpio: pca953x: convert to use bitmap API
  gpio: pca953x: use input from regs structure in pca953x_irq_pending()
  gpio: pca953x: remove redundant variable and check in IRQ handler
  lib/bitmap: introduce bitmap_replace() helper
  lib/test_bitmap: fix comment about this file
  lib/test_bitmap: move exp1 and exp2 upper for others to use
  ...
2019-12-05 09:46:26 -08:00
Andrey Konovalov
eec028c938 kcov: remote coverage support
Patch series " kcov: collect coverage from usb and vhost", v3.

This patchset extends kcov to allow collecting coverage from backgound
kernel threads.  This extension requires custom annotations for each of
the places where coverage collection is desired.  This patchset
implements this for hub events in the USB subsystem and for vhost
workers.  See the first patch description for details about the kcov
extension.  The other two patches apply this kcov extension to USB and
vhost.

Examples of other subsystems that might potentially benefit from this
when custom annotations are added (the list is based on
process_one_work() callers for bugs recently reported by syzbot):

1. fs: writeback wb_workfn() worker,
2. net: addrconf_dad_work()/addrconf_verify_work() workers,
3. net: neigh_periodic_work() worker,
4. net/p9: p9_write_work()/p9_read_work() workers,
5. block: blk_mq_run_work_fn() worker.

These patches have been used to enable coverage-guided USB fuzzing with
syzkaller for the last few years, see the details here:

  https://github.com/google/syzkaller/blob/master/docs/linux/external_fuzzing_usb.md

This patchset has been pushed to the public Linux kernel Gerrit
instance:

  https://linux-review.googlesource.com/c/linux/kernel/git/torvalds/linux/+/1524

This patch (of 3):

Add background thread coverage collection ability to kcov.

With KCOV_ENABLE coverage is collected only for syscalls that are issued
from the current process.  With KCOV_REMOTE_ENABLE it's possible to
collect coverage for arbitrary parts of the kernel code, provided that
those parts are annotated with kcov_remote_start()/kcov_remote_stop().

This allows to collect coverage from two types of kernel background
threads: the global ones, that are spawned during kernel boot in a
limited number of instances (e.g.  one USB hub_event() worker thread is
spawned per USB HCD); and the local ones, that are spawned when a user
interacts with some kernel interface (e.g.  vhost workers).

To enable collecting coverage from a global background thread, a unique
global handle must be assigned and passed to the corresponding
kcov_remote_start() call.  Then a userspace process can pass a list of
such handles to the KCOV_REMOTE_ENABLE ioctl in the handles array field
of the kcov_remote_arg struct.  This will attach the used kcov device to
the code sections, that are referenced by those handles.

Since there might be many local background threads spawned from
different userspace processes, we can't use a single global handle per
annotation.  Instead, the userspace process passes a non-zero handle
through the common_handle field of the kcov_remote_arg struct.  This
common handle gets saved to the kcov_handle field in the current
task_struct and needs to be passed to the newly spawned threads via
custom annotations.  Those threads should in turn be annotated with
kcov_remote_start()/kcov_remote_stop().

Internally kcov stores handles as u64 integers.  The top byte of a
handle is used to denote the id of a subsystem that this handle belongs
to, and the lower 4 bytes are used to denote the id of a thread instance
within that subsystem.  A reserved value 0 is used as a subsystem id for
common handles as they don't belong to a particular subsystem.  The
bytes 4-7 are currently reserved and must be zero.  In the future the
number of bytes used for the subsystem or handle ids might be increased.

When a particular userspace process collects coverage by via a common
handle, kcov will collect coverage for each code section that is
annotated to use the common handle obtained as kcov_handle from the
current task_struct.  However non common handles allow to collect
coverage selectively from different subsystems.

Link: http://lkml.kernel.org/r/e90e315426a384207edbec1d6aa89e43008e4caf.1572366574.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: David Windsor <dwindsor@gmail.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Anders Roxell <anders.roxell@linaro.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04 19:44:14 -08:00
Masahiro Yamada
1a18374fc3 linux/scc.h: make uapi linux/scc.h self-contained
Userspace cannot compile <linux/scc.h>

    CC      usr/include/linux/scc.h.s
  In file included from <command-line>:32:0:
  usr/include/linux/scc.h:20:20: error: `SIOCDEVPRIVATE' undeclared here (not in a function)
    SIOCSCCRESERVED = SIOCDEVPRIVATE,
                      ^~~~~~~~~~~~~~

Include <linux/sockios.h> to make it self-contained, and add it to the
compile-test coverage.

Link: http://lkml.kernel.org/r/20191108055809.26969-1-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04 19:44:12 -08:00