Changes in 5.4.8
Revert "MIPS: futex: Restore \n after sync instructions"
Revert "MIPS: futex: Emit Loongson3 sync workarounds within asm"
scsi: lpfc: Fix spinlock_irq issues in lpfc_els_flush_cmd()
scsi: lpfc: Fix discovery failures when target device connectivity bounces
scsi: mpt3sas: Fix clear pending bit in ioctl status
scsi: lpfc: Fix locking on mailbox command completion
scsi: mpt3sas: Reject NVMe Encap cmnds to unsupported HBA
gpio: mxc: Only get the second IRQ when there is more than one IRQ
scsi: lpfc: Fix list corruption in lpfc_sli_get_iocbq
Input: atmel_mxt_ts - disable IRQ across suspend
f2fs: fix to update time in lazytime mode
powerpc/papr_scm: Fix an off-by-one check in papr_scm_meta_{get, set}
tools/power/x86/intel-speed-select: Remove warning for unused result
platform/x86: peaq-wmi: switch to using polled mode of input devices
iommu: rockchip: Free domain on .domain_free
iommu/tegra-smmu: Fix page tables in > 4 GiB memory
dmaengine: xilinx_dma: Clear desc_pendingcount in xilinx_dma_reset
scsi: target: compare full CHAP_A Algorithm strings
scsi: lpfc: Fix hardlockup in lpfc_abort_handler
scsi: lpfc: Fix SLI3 hba in loop mode not discovering devices
scsi: csiostor: Don't enable IRQs too early
scsi: hisi_sas: Replace in_softirq() check in hisi_sas_task_exec()
scsi: hisi_sas: Delete the debugfs folder of hisi_sas when the probe fails
powerpc/pseries: Mark accumulate_stolen_time() as notrace
powerpc/pseries: Don't fail hash page table insert for bolted mapping
Input: st1232 - do not reset the chip too early
selftests/powerpc: Fixup clobbers for TM tests
powerpc/tools: Don't quote $objdump in scripts
dma-debug: add a schedule point in debug_dma_dump_mappings()
dma-mapping: Add vmap checks to dma_map_single()
dma-mapping: fix handling of dma-ranges for reserved memory (again)
dmaengine: fsl-qdma: Handle invalid qdma-queue0 IRQ
leds: lm3692x: Handle failure to probe the regulator
leds: an30259a: add a check for devm_regmap_init_i2c
leds: trigger: netdev: fix handling on interface rename
clocksource/drivers/asm9260: Add a check for of_clk_get
clocksource/drivers/timer-of: Use unique device name instead of timer
dtc: Use pkg-config to locate libyaml
selftests/powerpc: Skip tm-signal-sigreturn-nt if TM not available
powerpc/security/book3s64: Report L1TF status in sysfs
powerpc/book3s64/hash: Add cond_resched to avoid soft lockup warning
ext4: update direct I/O read lock pattern for IOCB_NOWAIT
ext4: iomap that extends beyond EOF should be marked dirty
jbd2: Fix statistics for the number of logged blocks
scsi: tracing: Fix handling of TRANSFER LENGTH == 0 for READ(6) and WRITE(6)
scsi: lpfc: Fix unexpected error messages during RSCN handling
scsi: lpfc: Fix duplicate unreg_rpi error in port offline flow
f2fs: fix to update dir's i_pino during cross_rename
clk: qcom: smd: Add missing pnoc clock
clk: qcom: Allow constant ratio freq tables for rcg
clk: clk-gpio: propagate rate change to parent
irqchip/irq-bcm7038-l1: Enable parent IRQ if necessary
irqchip: ingenic: Error out if IRQ domain creation failed
dma-direct: check for overflows on 32 bit DMA addresses
fs/quota: handle overflows of sysctl fs.quota.* and report as unsigned long
iommu/arm-smmu-v3: Don't display an error when IRQ lines are missing
i2c: stm32f7: fix & reorder remove & probe error handling
iomap: fix return value of iomap_dio_bio_actor on 32bit systems
Input: ili210x - handle errors from input_mt_init_slots()
scsi: lpfc: fix: Coverity: lpfc_cmpl_els_rsp(): Null pointer dereferences
scsi: zorro_esp: Limit DMA transfers to 65536 bytes (except on Fastlane)
PCI: rpaphp: Fix up pointer to first drc-info entry
scsi: ufs: fix potential bug which ends in system hang
powerpc/pseries/cmm: Implement release() function for sysfs device
PCI: rpaphp: Don't rely on firmware feature to imply drc-info support
PCI: rpaphp: Annotate and correctly byte swap DRC properties
PCI: rpaphp: Correctly match ibm, my-drc-index to drc-name when using drc-info
powerpc/security: Fix wrong message when RFI Flush is disable
powerpc/eeh: differentiate duplicate detection message
powerpc/book3s/mm: Update Oops message to print the correct translation in use
scsi: atari_scsi: sun3_scsi: Set sg_tablesize to 1 instead of SG_NONE
clk: pxa: fix one of the pxa RTC clocks
bcache: at least try to shrink 1 node in bch_mca_scan()
HID: quirks: Add quirk for HP MSU1465 PIXART OEM mouse
dt-bindings: Improve validation build error handling
HID: logitech-hidpp: Silence intermittent get_battery_capacity errors
HID: i2c-hid: fix no irq after reset on raydium 3118
ARM: 8937/1: spectre-v2: remove Brahma-B53 from hardening
libnvdimm/btt: fix variable 'rc' set but not used
HID: Improve Windows Precision Touchpad detection.
HID: rmi: Check that the RMI_STARTED bit is set before unregistering the RMI transport device
watchdog: imx7ulp: Fix reboot hang
watchdog: prevent deferral of watchdogd wakeup on RT
watchdog: Fix the race between the release of watchdog_core_data and cdev
powerpc/fixmap: Use __fix_to_virt() instead of fix_to_virt()
scsi: pm80xx: Fix for SATA device discovery
scsi: ufs: Fix error handing during hibern8 enter
scsi: scsi_debug: num_tgts must be >= 0
scsi: NCR5380: Add disconnect_mask module parameter
scsi: target: core: Release SPC-2 reservations when closing a session
scsi: ufs: Fix up auto hibern8 enablement
scsi: iscsi: Don't send data to unbound connection
scsi: target: iscsi: Wait for all commands to finish before freeing a session
f2fs: Fix deadlock in f2fs_gc() context during atomic files handling
habanalabs: skip VA block list update in reset flow
gpio/mpc8xxx: fix qoriq GPIO reading
platform/x86: intel_pmc_core: Fix the SoC naming inconsistency
platform/x86: intel_pmc_core: Add Comet Lake (CML) platform support to intel_pmc_core driver
gpio: mpc8xxx: Don't overwrite default irq_set_type callback
gpio: lynxpoint: Setup correct IRQ handlers
tools/power/x86/intel-speed-select: Ignore missing config level
Drivers: hv: vmbus: Fix crash handler reset of Hyper-V synic
apparmor: fix unsigned len comparison with less than zero
drm/amdgpu: Call find_vma under mmap_sem
scripts/kallsyms: fix definitely-lost memory leak
powerpc: Don't add -mabi= flags when building with Clang
cifs: Fix use-after-free bug in cifs_reconnect()
um: virtio: Keep reading on -EAGAIN
io_uring: io_allocate_scq_urings() should return a sane state
of: unittest: fix memory leak in attach_node_and_children
cdrom: respect device capabilities during opening action
cifs: move cifsFileInfo_put logic into a work-queue
perf diff: Use llabs() with 64-bit values
perf script: Fix brstackinsn for AUXTRACE
perf regs: Make perf_reg_name() return "unknown" instead of NULL
s390/zcrypt: handle new reply code FILTERED_BY_HYPERVISOR
mailbox: imx: Clear the right interrupts at shutdown
libfdt: define INT32_MAX and UINT32_MAX in libfdt_env.h
s390/unwind: filter out unreliable bogus %r14
s390/cpum_sf: Check for SDBT and SDB consistency
ocfs2: fix passing zero to 'PTR_ERR' warning
mailbox: imx: Fix Tx doorbell shutdown path
s390: disable preemption when switching to nodat stack with CALL_ON_STACK
selftests: vm: add fragment CONFIG_TEST_VMALLOC
mm/hugetlbfs: fix error handling when setting up mounts
kernel: sysctl: make drop_caches write-only
userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK
Revert "powerpc/vcpu: Assume dedicated processors as non-preempt"
sctp: fix err handling of stream initialization
md: make sure desc_nr less than MD_SB_DISKS
Revert "iwlwifi: assign directly to iwl_trans->cfg in QuZ detection"
netfilter: ebtables: compat: reject all padding in matches/watchers
6pack,mkiss: fix possible deadlock
powerpc: Fix __clear_user() with KUAP enabled
net/smc: add fallback check to connect()
netfilter: bridge: make sure to pull arp header in br_nf_forward_arp()
inetpeer: fix data-race in inet_putpeer / inet_putpeer
net: add a READ_ONCE() in skb_peek_tail()
net: icmp: fix data-race in cmp_global_allow()
hrtimer: Annotate lockless access to timer->state
tomoyo: Don't use nifty names on sockets.
uaccess: disallow > INT_MAX copy sizes
drm: limit to INT_MAX in create_blob ioctl
xfs: fix mount failure crash on invalid iclog memory access
cxgb4/cxgb4vf: fix flow control display for auto negotiation
net: dsa: bcm_sf2: Fix IP fragment location and behavior
net/mlxfw: Fix out-of-memory error in mfa2 flash burning
net: phy: aquantia: add suspend / resume ops for AQR105
net/sched: act_mirred: Pull mac prior redir to non mac_header_xmit device
net/sched: add delete_empty() to filters and use it in cls_flower
net_sched: sch_fq: properly set sk->sk_pacing_status
net: stmmac: dwmac-meson8b: Fix the RGMII TX delay on Meson8b/8m2 SoCs
ptp: fix the race between the release of ptp_clock and cdev
tcp: Fix highest_sack and highest_sack_seq
udp: fix integer overflow while computing available space in sk_rcvbuf
bnxt_en: Fix MSIX request logic for RDMA driver.
bnxt_en: Free context memory in the open path if firmware has been reset.
bnxt_en: Return error if FW returns more data than dump length
bnxt_en: Fix bp->fw_health allocation and free logic.
bnxt_en: Remove unnecessary NULL checks for fw_health
bnxt_en: Fix the logic that creates the health reporters.
bnxt_en: Add missing devlink health reporters for VFs.
mlxsw: spectrum_router: Skip loopback RIFs during MAC validation
mlxsw: spectrum: Use dedicated policer for VRRP packets
net: add bool confirm_neigh parameter for dst_ops.update_pmtu
ip6_gre: do not confirm neighbor when do pmtu update
gtp: do not confirm neighbor when do pmtu update
net/dst: add new function skb_dst_update_pmtu_no_confirm
tunnel: do not confirm neighbor when do pmtu update
vti: do not confirm neighbor when do pmtu update
sit: do not confirm neighbor when do pmtu update
net/dst: do not confirm neighbor for vxlan and geneve pmtu update
net: dsa: sja1105: Reconcile the meaning of TPID and TPID2 for E/T and P/Q/R/S
net: marvell: mvpp2: phylink requires the link interrupt
gtp: fix wrong condition in gtp_genl_dump_pdp()
gtp: avoid zero size hashtable
bonding: fix active-backup transition after link failure
tcp: do not send empty skb from tcp_write_xmit()
tcp/dccp: fix possible race __inet_lookup_established()
hv_netvsc: Fix tx_table init in rndis_set_subchannel()
gtp: fix an use-after-free in ipv4_pdp_find()
gtp: do not allow adding duplicate tid and ms_addr pdp context
bnxt: apply computed clamp value for coalece parameter
ipv6/addrconf: only check invalid header values when NETLINK_F_STRICT_CHK is set
net: phylink: fix interface passed to mac_link_up
net: ena: fix napi handler misbehavior when the napi budget is zero
vhost/vsock: accept only packets with the right dst_cid
mmc: sdhci-of-esdhc: fix up erratum A-008171 workaround
mmc: sdhci-of-esdhc: re-implement erratum A-009204 workaround
mm/hugetlbfs: fix for_each_hstate() loop in init_hugetlbfs_fs()
Linux 5.4.8
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I9962505e7207f0004499de4666df6862105e990d
[ Upstream commit 204cb79ad4 ]
Currently, the drop_caches proc file and sysctl read back the last value
written, suggesting this is somehow a stateful setting instead of a
one-time command. Make it write-only, like e.g. compact_memory.
While mitigating a VM problem at scale in our fleet, there was confusion
about whether writing to this file will permanently switch the kernel into
a non-caching mode. This influences the decision making in a tense
situation, where tens of people are trying to fix tens of thousands of
affected machines: Do we need a rollback strategy? What are the
performance implications of operating in a non-caching state for several
days? It also caused confusion when the kernel team said we may need to
write the file several times to make sure it's effective ("But it already
reads back 3?").
Link: http://lkml.kernel.org/r/20191031221602.9375-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Chris Down <chris@chrisdown.name>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
To make the 5.4-rc1 merge easier, merge at a prerelease point in time
before the final release happens.
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: If613d657fd0abf9910c5bf3435a745f01b89765e
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tasks without a user-defined clamp value are considered not clamped
and by default their utilization can have any value in the
[0..SCHED_CAPACITY_SCALE] range.
Tasks with a user-defined clamp value are allowed to request any value
in that range, and the required clamp is unconditionally enforced.
However, a "System Management Software" could be interested in limiting
the range of clamp values allowed for all tasks.
Add a privileged interface to define a system default configuration via:
/proc/sys/kernel/sched_uclamp_util_{min,max}
which works as an unconditional clamp range restriction for all tasks.
With the default configuration, the full SCHED_CAPACITY_SCALE range of
values is allowed for each clamp index. Otherwise, the task-specific
clamp is capped by the corresponding system default value.
Do that by tracking, for each task, the "effective" clamp value and
bucket the task has been refcounted in at enqueue time. This
allows to lazy aggregate "requested" and "system default" values at
enqueue time and simplifies refcounting updates at dequeue time.
The cached bucket ids are used to avoid (relatively) more expensive
integer divisions every time a task is enqueued.
An active flag is used to report when the "effective" value is valid and
thus the task is actually refcounted in the corresponding rq's bucket.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-5-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Convert proc_dointvec_minmax_bpf_stats() into a more generic
helper, since we are going to use jump labels more often.
Note that sysctl_bpf_stats_enabled is removed, since
it is no longer needed/used.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add SPDX license identifiers to all files which:
- Have no license information of any form
- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file
These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:
GPL-2.0-only
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Today, proc_do_large_bitmap() truncates a large write input buffer to
PAGE_SIZE - 1, which may result in misparsed numbers at the (truncated)
end of the buffer. Further, it fails to notify the caller that the
buffer was truncated, so it doesn't get called iteratively to finish the
entire input buffer.
Tell the caller if there's more work to do by adding the skipped amount
back to left/*lenp before returning.
To fix the misparsing, reset the position if we have completely consumed
a truncated buffer (or if just one char is left, which may be a "-" in a
range), and ask the caller to come back for more.
Link: http://lkml.kernel.org/r/20190320222831.8243-7-mcgrof@kernel.org
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Userfaultfd can be misued to make it easier to exploit existing
use-after-free (and similar) bugs that might otherwise only make a
short window or race condition available. By using userfaultfd to
stall a kernel thread, a malicious program can keep some state that it
wrote, stable for an extended period, which it can then access using an
existing exploit. While it doesn't cause the exploit itself, and while
it's not the only thing that can stall a kernel thread when accessing a
memory location, it's one of the few that never needs privilege.
We can add a flag, allowing userfaultfd to be restricted, so that in
general it won't be useable by arbitrary user programs, but in
environments that require userfaultfd it can be turned back on.
Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
whether userfaultfd is allowed by unprivileged users. When this is
set to zero, only privileged users (root user, or users with the
CAP_SYS_PTRACE capability) will be able to use the userfaultfd
syscalls.
Andrea said:
: The only difference between the bpf sysctl and the userfaultfd sysctl
: this way is that the bpf sysctl adds the CAP_SYS_ADMIN capability
: requirement, while userfaultfd adds the CAP_SYS_PTRACE requirement,
: because the userfaultfd monitor is more likely to need CAP_SYS_PTRACE
: already if it's doing other kind of tracking on processes runtime, in
: addition of userfaultfd. In other words both syscalls works only for
: root, when the two sysctl are opt-in set to 1.
[dgilbert@redhat.com: changelog additions]
[akpm@linux-foundation.org: documentation tweak, per Mike]
Link: http://lkml.kernel.org/r/20190319030722.12441-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Suggested-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a userspace visible knob to tell the VM to keep an extra amount
of memory free, by increasing the gap between each zone's min and
low watermarks.
This is useful for realtime applications that call system
calls and have a bound on the number of allocations that happen
in any short time period. In this application, extra_free_kbytes
would be left at an amount equal to or larger than than the
maximum number of allocations that happen in any burst.
It may also be useful to reduce the memory use of virtual
machines (temporarily?), in a way that does not cause memory
fragmentation like ballooning does.
[ccross]
Revived for use on old kernels where no other solution exists.
The tunable will be removed on kernels that do better at avoiding
direct reclaim.
[surenb]
Will be reverted as soon as Android framework is reworked to
use upstream-supported watermark_scale_factor instead of
extra_free_kbytes.
Bug: 86445363
Bug: 109664768
Bug: 120445732
Change-Id: I765a42be8e964bfd3e2886d1ca85a29d60c3bb3e
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
To make ICMPv6 closer to ICMPv4, add ratemask parameter. Since the ICMP
message types use larger numeric values, a simple bitmask doesn't fit.
I use large bitmap. The input and output are the in form of list of
ranges. Set the default to rate limit all error messages but Packet Too
Big. For Packet Too Big, use ratemask instead of hard-coded.
There are functions where icmpv6_xrlim_allow() and icmpv6_global_allow()
aren't called. This patch only adds them to icmpv6_echo_reply().
Rate limiting error messages is mandated by RFC 4443 but RFC 4890 says
that it is also acceptable to rate limit informational messages. Thus,
I removed the current hard-coded behavior of icmpv6_mask_allow() that
doesn't rate limit informational messages.
v2: Add dummy function proc_do_large_bitmap() if CONFIG_PROC_SYSCTL
isn't defined, expand the description in ip-sysctl.txt and remove
unnecessary conditional before kfree().
v3: Inline the bitmap instead of dynamically allocated. Still is a
pointer to it is needed because of the way proc_do_large_bitmap work.
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 32a5ad9c22 ("sysctl: handle overflow for file-max") hooked up
min/max values for the file-max sysctl parameter via the .extra1 and
.extra2 fields in the corresponding struct ctl_table entry.
Unfortunately, the minimum value points at the global 'zero' variable,
which is an int. This results in a KASAN splat when accessed as a long
by proc_doulongvec_minmax on 64-bit architectures:
| BUG: KASAN: global-out-of-bounds in __do_proc_doulongvec_minmax+0x5d8/0x6a0
| Read of size 8 at addr ffff2000133d1c20 by task systemd/1
|
| CPU: 0 PID: 1 Comm: systemd Not tainted 5.1.0-rc3-00012-g40b114779944 #2
| Hardware name: linux,dummy-virt (DT)
| Call trace:
| dump_backtrace+0x0/0x228
| show_stack+0x14/0x20
| dump_stack+0xe8/0x124
| print_address_description+0x60/0x258
| kasan_report+0x140/0x1a0
| __asan_report_load8_noabort+0x18/0x20
| __do_proc_doulongvec_minmax+0x5d8/0x6a0
| proc_doulongvec_minmax+0x4c/0x78
| proc_sys_call_handler.isra.19+0x144/0x1d8
| proc_sys_write+0x34/0x58
| __vfs_write+0x54/0xe8
| vfs_write+0x124/0x3c0
| ksys_write+0xbc/0x168
| __arm64_sys_write+0x68/0x98
| el0_svc_common+0x100/0x258
| el0_svc_handler+0x48/0xc0
| el0_svc+0x8/0xc
|
| The buggy address belongs to the variable:
| zero+0x0/0x40
|
| Memory state around the buggy address:
| ffff2000133d1b00: 00 00 00 00 00 00 00 00 fa fa fa fa 04 fa fa fa
| ffff2000133d1b80: fa fa fa fa 04 fa fa fa fa fa fa fa 04 fa fa fa
| >ffff2000133d1c00: fa fa fa fa 04 fa fa fa fa fa fa fa 00 00 00 00
| ^
| ffff2000133d1c80: fa fa fa fa 00 fa fa fa fa fa fa fa 00 00 00 00
| ffff2000133d1d00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Fix the splat by introducing a unsigned long 'zero_ul' and using that
instead.
Link: http://lkml.kernel.org/r/20190403153409.17307-1-will.deacon@arm.com
Fixes: 32a5ad9c22 ("sysctl: handle overflow for file-max")
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Christian Brauner <christian@brauner.io>
Cc: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking fixes from David Miller:
"First batch of fixes in the new merge window:
1) Double dst_cache free in act_tunnel_key, from Wenxu.
2) Avoid NULL deref in IN_DEV_MFORWARD() by failing early in the
ip_route_input_rcu() path, from Paolo Abeni.
3) Fix appletalk compile regression, from Arnd Bergmann.
4) If SLAB objects reach the TCP sendpage method we are in serious
trouble, so put a debugging check there. From Vasily Averin.
5) Memory leak in hsr layer, from Mao Wenan.
6) Only test GSO type on GSO packets, from Willem de Bruijn.
7) Fix crash in xsk_diag_put_umem(), from Eric Dumazet.
8) Fix VNIC mailbox length in nfp, from Dirk van der Merwe.
9) Fix race in ipv4 route exception handling, from Xin Long.
10) Missing DMA memory barrier in hns3 driver, from Jian Shen.
11) Use after free in __tcf_chain_put(), from Vlad Buslov.
12) Handle inet_csk_reqsk_queue_add() failures, from Guillaume Nault.
13) Return value correction when ip_mc_may_pull() fails, from Eric
Dumazet.
14) Use after free in x25_device_event(), also from Eric"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (72 commits)
gro_cells: make sure device is up in gro_cells_receive()
vxlan: test dev->flags & IFF_UP before calling gro_cells_receive()
net/x25: fix use-after-free in x25_device_event()
isdn: mISDNinfineon: fix potential NULL pointer dereference
net: hns3: fix to stop multiple HNS reset due to the AER changes
ip: fix ip_mc_may_pull() return value
net: keep refcount warning in reqsk_free()
net: stmmac: Avoid one more sometimes uninitialized Clang warning
net: dsa: mv88e6xxx: Set correct interface mode for CPU/DSA ports
rxrpc: Fix client call queueing, waiting for channel
tcp: handle inet_csk_reqsk_queue_add() failures
net: ethernet: sun: Zero initialize class in default case in niu_add_ethtool_tcam_entry
8139too : Add support for U.S. Robotics USR997901A 10/100 Cardbus NIC
fou, fou6: avoid uninit-value in gue_err() and gue6_err()
net: sched: fix potential use-after-free in __tcf_chain_put()
vhost: silence an unused-variable warning
vsock/virtio: fix kernel panic from virtio_transport_reset_no_sock
connector: fix unsafe usage of ->real_parent
vxlan: do not need BH again in vxlan_cleanup()
net: hns3: add dma_rmb() for rx description
...
Currently, when writing
echo 18446744073709551616 > /proc/sys/fs/file-max
/proc/sys/fs/file-max will overflow and be set to 0. That quickly
crashes the system.
This commit sets the max and min value for file-max. The max value is
set to long int. Any higher value cannot currently be used as the
percpu counters are long ints and not unsigned integers.
Note that the file-max value is ultimately parsed via
__do_proc_doulongvec_minmax(). This function does not report error when
min or max are exceeded. Which means if a value largen that long int is
written userspace will not receive an error instead the old value will be
kept. There is an argument to be made that this should be changed and
__do_proc_doulongvec_minmax() should return an error when a dedicated min
or max value are exceeded. However this has the potential to break
userspace so let's defer this to an RFC patch.
Link: http://lkml.kernel.org/r/20190107222700.15954-3-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Waiman Long <longman@redhat.com>
[christian@brauner.io: v4]
Link: http://lkml.kernel.org/r/20190210203943.8227-3-christian@brauner.io
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
proc_get_long() is a funny function. It uses simple_strtoul() and for a
good reason. proc_get_long() wants to always succeed the parse and
return the maybe incorrect value and the trailing characters to check
against a pre-defined list of acceptable trailing values. However,
simple_strtoul() explicitly ignores overflows which can cause funny
things like the following to happen:
echo 18446744073709551616 > /proc/sys/fs/file-max
cat /proc/sys/fs/file-max
0
(Which will cause your system to silently die behind your back.)
On the other hand kstrtoul() does do overflow detection but does not
return the trailing characters, and also fails the parse when anything
other than '\n' is a trailing character whereas proc_get_long() wants to
be more lenient.
Now, before adding another kstrtoul() function let's simply add a static
parse strtoul_lenient() which:
- fails on overflow with -ERANGE
- returns the trailing characters to the caller
The reason why we should fail on ERANGE is that we already do a partial
fail on overflow right now. Namely, when the TMPBUFLEN is exceeded. So
we already reject values such as 184467440737095516160 (21 chars) but
accept values such as 18446744073709551616 (20 chars) but both are
overflows. So we should just always reject 64bit overflows and not
special-case this based on the number of chars.
Link: http://lkml.kernel.org/r/20190107222700.15954-2-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When CONFIG_BPF_SYSCALL or CONFIG_SYSCTL is disabled, we get
a warning about an unused function:
kernel/sysctl.c:3331:12: error: 'proc_dointvec_minmax_bpf_stats' defined but not used [-Werror=unused-function]
static int proc_dointvec_minmax_bpf_stats(struct ctl_table *table, int write,
The CONFIG_BPF_SYSCALL check was already handled, but the SYSCTL check
is needed on top.
Fixes: 492ecee892 ("bpf: enable program stats")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christian Brauner <christian@brauner.io>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Merge misc updates from Andrew Morton:
- a few misc things
- ocfs2 updates
- most of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (159 commits)
tools/testing/selftests/proc/proc-self-syscall.c: remove duplicate include
proc: more robust bulk read test
proc: test /proc/*/maps, smaps, smaps_rollup, statm
proc: use seq_puts() everywhere
proc: read kernel cpu stat pointer once
proc: remove unused argument in proc_pid_lookup()
fs/proc/thread_self.c: code cleanup for proc_setup_thread_self()
fs/proc/self.c: code cleanup for proc_setup_self()
proc: return exit code 4 for skipped tests
mm,mremap: bail out earlier in mremap_to under map pressure
mm/sparse: fix a bad comparison
mm/memory.c: do_fault: avoid usage of stale vm_area_struct
writeback: fix inode cgroup switching comment
mm/huge_memory.c: fix "orig_pud" set but not used
mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC
mm/memcontrol.c: fix bad line in comment
mm/cma.c: cma_declare_contiguous: correct err handling
mm/page_ext.c: fix an imbalance with kmemleak
mm/compaction: pass pgdat to too_many_isolated() instead of zone
mm: remove zone_lru_lock() function, access ->lru_lock directly
...
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:
- refcount conversions
- Solve the rq->leaf_cfs_rq_list can of worms for real.
- improve power-aware scheduling
- add sysctl knob for Energy Aware Scheduling
- documentation updates
- misc other changes"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
kthread: Do not use TIMER_IRQSAFE
kthread: Convert worker lock to raw spinlock
sched/fair: Use non-atomic cpumask_{set,clear}_cpu()
sched/fair: Remove unused 'sd' parameter from select_idle_smt()
sched/wait: Use freezable_schedule() when possible
sched/fair: Prune, fix and simplify the nohz_balancer_kick() comment block
sched/fair: Explain LLC nohz kick condition
sched/fair: Simplify nohz_balancer_kick()
sched/topology: Fix percpu data types in struct sd_data & struct s_data
sched/fair: Simplify post_init_entity_util_avg() by calling it with a task_struct pointer argument
sched/fair: Fix O(nr_cgroups) in the load balancing path
sched/fair: Optimize update_blocked_averages()
sched/fair: Fix insertion in rq->leaf_cfs_rq_list
sched/fair: Add tmp_alone_branch assertion
sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock()
sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK
sched/pelt: Skip updating util_est when utilization is higher than CPU's capacity
sched/fair: Update scale invariance of PELT
sched/fair: Move the rq_of() helper function
sched/core: Convert task_struct.stack_refcount to refcount_t
...
JITed BPF programs are indistinguishable from kernel functions, but unlike
kernel code BPF code can be changed often.
Typical approach of "perf record" + "perf report" profiling and tuning of
kernel code works just as well for BPF programs, but kernel code doesn't
need to be monitored whereas BPF programs do.
Users load and run large amount of BPF programs.
These BPF stats allow tools monitor the usage of BPF on the server.
The monitoring tools will turn sysctl kernel.bpf_stats_enabled
on and off for few seconds to sample average cost of the programs.
Aggregated data over hours and days will provide an insight into cost of BPF
and alarms can trigger in case given program suddenly gets more expensive.
The cost of two sched_clock() per program invocation adds ~20 nsec.
Fast BPF progs (like selftests/bpf/progs/test_pkt_access.c) will slow down
from ~10 nsec to ~30 nsec.
static_key minimizes the cost of the stats collection.
There is no measurable difference before/after this patch
with kernel.bpf_stats_enabled=0
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
If the number of input parameters is less than the total parameters, an
EINVAL error will be returned.
For example, we use proc_doulongvec_minmax to pass up to two parameters
with kern_table:
{
.procname = "monitor_signals",
.data = &monitor_sigs,
.maxlen = 2*sizeof(unsigned long),
.mode = 0644,
.proc_handler = proc_doulongvec_minmax,
},
Reproduce:
When passing two parameters, it's work normal. But passing only one
parameter, an error "Invalid argument"(EINVAL) is returned.
[root@cl150 ~]# echo 1 2 > /proc/sys/kernel/monitor_signals
[root@cl150 ~]# cat /proc/sys/kernel/monitor_signals
1 2
[root@cl150 ~]# echo 3 > /proc/sys/kernel/monitor_signals
-bash: echo: write error: Invalid argument
[root@cl150 ~]# echo $?
1
[root@cl150 ~]# cat /proc/sys/kernel/monitor_signals
3 2
[root@cl150 ~]#
The following is the result after apply this patch. No error is
returned when the number of input parameters is less than the total
parameters.
[root@cl150 ~]# echo 1 2 > /proc/sys/kernel/monitor_signals
[root@cl150 ~]# cat /proc/sys/kernel/monitor_signals
1 2
[root@cl150 ~]# echo 3 > /proc/sys/kernel/monitor_signals
[root@cl150 ~]# echo $?
0
[root@cl150 ~]# cat /proc/sys/kernel/monitor_signals
3 2
[root@cl150 ~]#
There are three processing functions dealing with digital parameters,
__do_proc_dointvec/__do_proc_douintvec/__do_proc_doulongvec_minmax.
This patch deals with __do_proc_doulongvec_minmax, just as
__do_proc_dointvec does, adding a check for parameters 'left'. In
__do_proc_douintvec, its code implementation explicitly does not support
multiple inputs.
static int __do_proc_douintvec(...){
...
/*
* Arrays are not supported, keep this simple. *Do not* add
* support for them.
*/
if (vleft != 1) {
*lenp = 0;
return -EINVAL;
}
...
}
So, just __do_proc_doulongvec_minmax has the problem. And most use of
proc_doulongvec_minmax/proc_doulongvec_ms_jiffies_minmax just have one
parameter.
Link: http://lkml.kernel.org/r/1544081775-15720-1-git-send-email-cheng.lin130@zte.com.cn
Signed-off-by: Cheng Lin <cheng.lin130@zte.com.cn>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An external fragmentation event was previously described as
When the page allocator fragments memory, it records the event using
the mm_page_alloc_extfrag event. If the fallback_order is smaller
than a pageblock order (order-9 on 64-bit x86) then it's considered
an event that will cause external fragmentation issues in the future.
The kernel reduces the probability of such events by increasing the
watermark sizes by calling set_recommended_min_free_kbytes early in the
lifetime of the system. This works reasonably well in general but if
there are enough sparsely populated pageblocks then the problem can still
occur as enough memory is free overall and kswapd stays asleep.
This patch introduces a watermark_boost_factor sysctl that allows a zone
watermark to be temporarily boosted when an external fragmentation causing
events occurs. The boosting will stall allocations that would decrease
free memory below the boosted low watermark and kswapd is woken if the
calling context allows to reclaim an amount of memory relative to the size
of the high watermark and the watermark_boost_factor until the boost is
cleared. When kswapd finishes, it wakes kcompactd at the pageblock order
to clean some of the pageblocks that may have been affected by the
fragmentation event. kswapd avoids any writeback, slab shrinkage and swap
from reclaim context during this operation to avoid excessive system
disruption in the name of fragmentation avoidance. Care is taken so that
kswapd will do normal reclaim work if the system is really low on memory.
This was evaluated using the same workloads as "mm, page_alloc: Spread
allocations across zones before introducing fragmentation".
1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--------------------------------------
4.20-rc3 extfrag events < order 9: 804694
4.20-rc3+patch: 408912 (49% reduction)
4.20-rc3+patch1-4: 18421 (98% reduction)
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Amean fault-base-1 653.58 ( 0.00%) 652.71 ( 0.13%)
Amean fault-huge-1 0.00 ( 0.00%) 178.93 * -99.00%*
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Percentage huge-1 0.00 ( 0.00%) 5.12 ( 100.00%)
Note that external fragmentation causing events are massively reduced by
this path whether in comparison to the previous kernel or the vanilla
kernel. The fault latency for huge pages appears to be increased but that
is only because THP allocations were successful with the patch applied.
1-socket Skylake machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------
4.20-rc3 extfrag events < order 9: 291392
4.20-rc3+patch: 191187 (34% reduction)
4.20-rc3+patch1-4: 13464 (95% reduction)
thpfioscale Fault Latencies
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Min fault-base-1 912.00 ( 0.00%) 905.00 ( 0.77%)
Min fault-huge-1 127.00 ( 0.00%) 135.00 ( -6.30%)
Amean fault-base-1 1467.55 ( 0.00%) 1481.67 ( -0.96%)
Amean fault-huge-1 1127.11 ( 0.00%) 1063.88 * 5.61%*
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Percentage huge-1 77.64 ( 0.00%) 83.46 ( 7.49%)
As before, massive reduction in external fragmentation events, some jitter
on latencies and an increase in THP allocation success rates.
2-socket Haswell machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 5 THP allocating threads
----------------------------------------------------------------
4.20-rc3 extfrag events < order 9: 215698
4.20-rc3+patch: 200210 (7% reduction)
4.20-rc3+patch1-4: 14263 (93% reduction)
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Amean fault-base-5 1346.45 ( 0.00%) 1306.87 ( 2.94%)
Amean fault-huge-5 3418.60 ( 0.00%) 1348.94 ( 60.54%)
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Percentage huge-5 0.78 ( 0.00%) 7.91 ( 910.64%)
There is a 93% reduction in fragmentation causing events, there is a big
reduction in the huge page fault latency and allocation success rate is
higher.
2-socket Haswell machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------
4.20-rc3 extfrag events < order 9: 166352
4.20-rc3+patch: 147463 (11% reduction)
4.20-rc3+patch1-4: 11095 (93% reduction)
thpfioscale Fault Latencies
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Amean fault-base-5 6217.43 ( 0.00%) 7419.67 * -19.34%*
Amean fault-huge-5 3163.33 ( 0.00%) 3263.80 ( -3.18%)
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Percentage huge-5 95.14 ( 0.00%) 87.98 ( -7.53%)
There is a large reduction in fragmentation events with some jitter around
the latencies and success rates. As before, the high THP allocation
success rate does mean the system is under a lot of pressure. However, as
the fragmentation events are reduced, it would be expected that the
long-term allocation success rate would be higher.
Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce CONFIG_STACKLEAK_RUNTIME_DISABLE option, which provides
'stack_erasing' sysctl. It can be used in runtime to control kernel
stack erasing for kernels built with CONFIG_GCC_PLUGIN_STACKLEAK.
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Alexander Popov <alex.popov@linux.com>
Tested-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Disallows open of FIFOs or regular files not owned by the user in world
writable sticky directories, unless the owner is the same as that of the
directory or the file is opened without the O_CREAT flag. The purpose
is to make data spoofing attacks harder. This protection can be turned
on and off separately for FIFOs and regular files via sysctl, just like
the symlinks/hardlinks protection. This patch is based on Openwall's
"HARDEN_FIFO" feature by Solar Designer.
This is a brief list of old vulnerabilities that could have been prevented
by this feature, some of them even allow for privilege escalation:
CVE-2000-1134
CVE-2007-3852
CVE-2008-0525
CVE-2009-0416
CVE-2011-4834
CVE-2015-1838
CVE-2015-7442
CVE-2016-7489
This list is not meant to be complete. It's difficult to track down all
vulnerabilities of this kind because they were often reported without any
mention of this particular attack vector. In fact, before
hardlinks/symlinks restrictions, fifos/regular files weren't the favorite
vehicle to exploit them.
[s.mesoraca16@gmail.com: fix bug reported by Dan Carpenter]
Link: https://lkml.kernel.org/r/20180426081456.GA7060@mwanda
Link: http://lkml.kernel.org/r/1524829819-11275-1-git-send-email-s.mesoraca16@gmail.com
[keescook@chromium.org: drop pr_warn_ratelimited() in favor of audit changes in the future]
[keescook@chromium.org: adjust commit subjet]
Link: http://lkml.kernel.org/r/20180416175918.GA13494@beast
Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Suggested-by: Solar Designer <solar@openwall.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently task hung checking interval is equal to timeout, as the result
hung is detected anywhere between timeout and 2*timeout. This is fine for
most interactive environments, but this hurts automated testing setups
(syzbot). In an automated setup we need to strictly order CPU lockup <
RCU stall < workqueue lockup < task hung < silent loss, so that RCU stall
is not detected as task hung and task hung is not detected as silent
machine loss. The large variance in task hung detection timeout requires
setting silent machine loss timeout to a very large value (e.g. if task
hung is 3 mins, then silent loss need to be set to ~7 mins). The
additional 3 minutes significantly reduce testing efficiency because
usually we crash kernel within a minute, and this can add hours to bug
localization process as it needs to do dozens of tests.
Allow setting checking interval separately from timeout. This allows to
set timeout to, say, 3 minutes, but checking interval to 10 secs.
The interval is controlled via a new hung_task_check_interval_secs sysctl,
similar to the existing hung_task_timeout_secs sysctl. The default value
of 0 results in the current behavior: checking interval is equal to
timeout.
[akpm@linux-foundation.org: update hung_task_timeout_max's comment]
Link: http://lkml.kernel.org/r/20180611111004.203513-1-dvyukov@google.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently one requires to test four kernel configurations to test the
firmware API completely:
0)
CONFIG_FW_LOADER=y
1)
o CONFIG_FW_LOADER=y
o CONFIG_FW_LOADER_USER_HELPER=y
2)
o CONFIG_FW_LOADER=y
o CONFIG_FW_LOADER_USER_HELPER=y
o CONFIG_FW_LOADER_USER_HELPER_FALLBACK=y
3) When CONFIG_FW_LOADER=m the built-in stuff is disabled, we have
no current tests for this.
We can reduce the requirements to three kernel configurations by making
fw_config.force_sysfs_fallback a proc knob we flip on off. For kernels that
disable CONFIG_IKCONFIG_PROC this can also enable one to inspect if
CONFIG_FW_LOADER_USER_HELPER_FALLBACK was enabled at build time by checking
the proc value at boot time.
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>