Currently, __note_gp_changes() checks to see if the CPU has slept through
multiple grace periods. If it has, it resynchronizes that CPU's view
of the grace-period state, which includes whether or not the current
grace period needs a quiescent state from this CPU. The fact of this
need (or lack thereof) needs to be in two places, rdp->cpu_no_qs.b.norm
and rdp->core_needs_qs. The former tells RCU's context-switch code to
go get a quiescent state and the latter says that it needs to be reported.
The current code unconditionally sets the former to true, but correctly
sets the latter.
This does not result in failures, but it does unnecessarily increase
the amount of work done on average at context-switch time. This commit
therefore correctly sets both fields.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The Kconfig currently controlling compilation of tree.c is:
init/Kconfig:config TREE_RCU
init/Kconfig: bool
...and update.c and sync.c are "obj-y" meaning that none are ever
built as a module by anyone.
Since MODULE_ALIAS is a no-op for non-modular code, we can remove
them from these files.
We leave moduleparam.h behind since the files instantiate some boot
time configuration parameters with module_param() still.
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Both timers and hrtimers are maintained on the outgoing CPU until
CPU_DEAD time, at which point they are migrated to a surviving CPU. If a
mod_timer() executes between CPU_DYING and CPU_DEAD time, x86 systems
will splat in native_smp_send_reschedule() when attempting to wake up
the just-now-offlined CPU, as shown below from a NO_HZ_FULL kernel:
[ 7976.741556] WARNING: CPU: 0 PID: 661 at /home/paulmck/public_git/linux-rcu/arch/x86/kernel/smp.c:125 native_smp_send_reschedule+0x39/0x40
[ 7976.741595] Modules linked in:
[ 7976.741595] CPU: 0 PID: 661 Comm: rcu_torture_rea Not tainted 4.7.0-rc2+ #1
[ 7976.741595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[ 7976.741595] 0000000000000000 ffff88000002fcc8 ffffffff8138ab2e 0000000000000000
[ 7976.741595] 0000000000000000 ffff88000002fd08 ffffffff8105cabc 0000007d1fd0ee18
[ 7976.741595] 0000000000000001 ffff88001fd16d40 ffff88001fd0ee00 ffff88001fd0ee00
[ 7976.741595] Call Trace:
[ 7976.741595] [<ffffffff8138ab2e>] dump_stack+0x67/0x99
[ 7976.741595] [<ffffffff8105cabc>] __warn+0xcc/0xf0
[ 7976.741595] [<ffffffff8105cb98>] warn_slowpath_null+0x18/0x20
[ 7976.741595] [<ffffffff8103cba9>] native_smp_send_reschedule+0x39/0x40
[ 7976.741595] [<ffffffff81089bc2>] wake_up_nohz_cpu+0x82/0x190
[ 7976.741595] [<ffffffff810d275a>] internal_add_timer+0x7a/0x80
[ 7976.741595] [<ffffffff810d3ee7>] mod_timer+0x187/0x2b0
[ 7976.741595] [<ffffffff810c89dd>] rcu_torture_reader+0x33d/0x380
[ 7976.741595] [<ffffffff810c66f0>] ? sched_torture_read_unlock+0x30/0x30
[ 7976.741595] [<ffffffff810c86a0>] ? rcu_bh_torture_read_lock+0x80/0x80
[ 7976.741595] [<ffffffff8108068f>] kthread+0xdf/0x100
[ 7976.741595] [<ffffffff819dd83f>] ret_from_fork+0x1f/0x40
[ 7976.741595] [<ffffffff810805b0>] ? kthread_create_on_node+0x200/0x200
However, in this case, the wakeup is redundant, because the timer
migration will reprogram timer hardware as needed. Note that the fact
that preemption is disabled does not avoid the splat, as the offline
operation has already passed both the synchronize_sched() and the
stop_machine() that would be blocked by disabled preemption.
This commit therefore modifies wake_up_nohz_cpu() to avoid attempting
to wake up offline CPUs. It also adds a comment stating that the
caller must tolerate lost wakeups when the target CPU is going offline,
and suggesting the CPU_DEAD notifier as a recovery mechanism.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Commit abedf8e241 ("rcu: Use simple wait queues where possible in
rcutree") converts Tree RCU's wait queues to simple wait queues,
but it incorrectly reverts the commit 2aa792e6fa ("rcu: Use
rcu_gp_kthread_wake() to wake up grace period kthreads"). This can
result in redundant self-wakeups.
This commit therefore replaces the simple wait-queue wakeups with
rcu_gp_kthread_wake(), thus avoiding the redundant wakeups.
Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit improves the accuracy of the interaction between CPU hotplug
operations and RCU's expedited grace periods by using RCU's online-CPU
state to determine when failed IPIs should be retried.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The expedited RCU grace periods currently rely on a failure indication
from smp_call_function_single() to determine that a given CPU is offline.
This works after a fashion, but is more contorted and less precise than
relying on RCU's internal state. This commit therefore takes a first
step towards relying on internal state.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The expedited RCU CPU stall warnings currently responds to neither the
panic_on_rcu_stall sysctl setting nor the rcupdate.rcu_cpu_stall_suppress
kernel boot parameter. This commit therefore updates the expedited code
to respond to these two controls.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that RCU expedited grace periods are always driven by a workqueue,
there is no need to account for signal reception, and thus no need
to disable expedited RCU CPU stall warnings due to signal reception.
This commit therefore removes the signal-reception checks, leaving a
WARN_ON() to catch possible future bugs.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The current implementation of expedited grace periods has the user
task drive the grace period. This works, but has downsides: (1) The
user task must awaken tasks piggybacking on this grace period, which
can result in latencies rivaling that of the grace period itself, and
(2) User tasks can receive signals, which interfere with RCU CPU stall
warnings.
This commit therefore uses workqueues to drive the grace periods, so
that the user task need not do the awakening. A subsequent commit
will remove the now-unnecessary code allowing for signals.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The functions synchronize_rcu_expedited() and synchronize_sched_expedited()
have nearly identical code. This commit therefore consolidates this code
into a new _synchronize_rcu_expedited() function.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Pull perf fixes from Ingo Molnar:
"Mostly tooling fixes, but also start/stop filter related fixes, a perf
event read() fix, a fix uncovered by fuzzing, and an uprobes leak fix"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Check return value of the perf_event_read() IPI
perf/core: Enable mapping of the stop filters
perf/core: Update filters only on executable mmap
perf/core: Fix file name handling for start/stop filters
perf/core: Fix event_function_local()
uprobes: Fix the memcg accounting
perf intel-pt: Fix occasional decoding errors when tracing system-wide
tools: Sync kvm related header files for arm64 and s390
perf probe: Release resources on error when handling exit paths
perf probe: Check for dup and fdopen failures
perf symbols: Fix annotation of objects with debuginfo files
perf script: Don't disable use_callchain if input is pipe
perf script: Show proper message when failed list scripts
perf jitdump: Add the right header to get the major()/minor() definitions
perf ppc64le: Fix build failure when libelf is not present
perf tools mem: Fix -t store option for record command
perf intel-pt: Fix ip compression
Introduce arch_klp_init_object_loaded() to complete any additional
arch-specific tasks during patching. Architecture code may override this
function.
Signed-off-by: Jessica Yu <jeyu@redhat.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Pull power management fixes from Rafael Wysocki:
"More hibernation-related material: one fix for a recent regression in
the core, one small cleanup of the x86-64 resume code and a
documentation update.
Specifics:
- Fix a hibernate core regression resulting from uncovering a latent
bug in its implementation of memory bitmaps by a recent commit
(James Morse).
- Use __pa() to compute a physical address in the x86-64 code
finalizing resume from hibernation (Rafael Wysocki).
- Update power management documentation related to system sleep
states to remove outdated information from it and to add a
description of a recently introduced hibernation debug feature to
it (Rafael Wysocki)"
* tag 'pm-4.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / hibernate: Fix rtree_next_node() to avoid walking off list ends
x86/power/64: Use __pa() for physical address computation
PM / sleep: Update some system sleep documentation
When wanting to wakeup readers, __rwsem_mark_wakeup() currently
iterates the wait_list twice while looking to wakeup the first N
queued reader-tasks. While this can be quite inefficient, it was
there such that a awoken reader would be first and foremost
acknowledged by the lock counter.
Keeping the same logic, we can further benefit from the use of
wake_qs and avoid entirely the first wait_list iteration that sets
the counter as wake_up_process() isn't going to occur right away,
and therefore we maintain the counter->list order of going about
things.
Other than saving cycles with O(n) "scanning", this change also
nicely cleans up a good chunk of __rwsem_mark_wakeup(); both
visually and less tedious to read.
For example, the following improvements where seen on some will
it scale microbenchmarks, on a 48-core Haswell:
v4.7 v4.7-rwsem-v1
Hmean signal1-processes-8 5792691.42 ( 0.00%) 5771971.04 ( -0.36%)
Hmean signal1-processes-12 6081199.96 ( 0.00%) 6072174.38 ( -0.15%)
Hmean signal1-processes-21 3071137.71 ( 0.00%) 3041336.72 ( -0.97%)
Hmean signal1-processes-48 3712039.98 ( 0.00%) 3708113.59 ( -0.11%)
Hmean signal1-processes-79 4464573.45 ( 0.00%) 4682798.66 ( 4.89%)
Hmean signal1-processes-110 4486842.01 ( 0.00%) 4633781.71 ( 3.27%)
Hmean signal1-processes-141 4611816.83 ( 0.00%) 4692725.38 ( 1.75%)
Hmean signal1-processes-172 4638157.05 ( 0.00%) 4714387.86 ( 1.64%)
Hmean signal1-processes-203 4465077.80 ( 0.00%) 4690348.07 ( 5.05%)
Hmean signal1-processes-224 4410433.74 ( 0.00%) 4687534.43 ( 6.28%)
Stddev signal1-processes-8 6360.47 ( 0.00%) 8455.31 ( 32.94%)
Stddev signal1-processes-12 4004.98 ( 0.00%) 9156.13 (128.62%)
Stddev signal1-processes-21 3273.14 ( 0.00%) 5016.80 ( 53.27%)
Stddev signal1-processes-48 28420.25 ( 0.00%) 26576.22 ( -6.49%)
Stddev signal1-processes-79 22038.34 ( 0.00%) 18992.70 (-13.82%)
Stddev signal1-processes-110 23226.93 ( 0.00%) 17245.79 (-25.75%)
Stddev signal1-processes-141 6358.98 ( 0.00%) 7636.14 ( 20.08%)
Stddev signal1-processes-172 9523.70 ( 0.00%) 4824.75 (-49.34%)
Stddev signal1-processes-203 13915.33 ( 0.00%) 9326.33 (-32.98%)
Stddev signal1-processes-224 15573.94 ( 0.00%) 10613.82 (-31.85%)
Other runs that saw improvements include context_switch and pipe; and
as expected, this is particularly highlighted on larger thread counts
as it becomes more expensive to walk the list twice.
No change in wakeup ordering or semantics.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman.Long@hp.com
Cc: dave@stgolabs.net
Cc: jason.low2@hpe.com
Cc: wanpeng.li@hotmail.com
Link: http://lkml.kernel.org/r/1470384285-32163-4-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The current percpu-rwsem read side is entirely free of serializing insns
at the cost of having a synchronize_sched() in the write path.
The latency of the synchronize_sched() is too high for cgroups. The
commit 1ed1328792 talks about the write path being a fairly cold path
but this is not the case for Android which moves task to the foreground
cgroup and back around binder IPC calls from foreground processes to
background processes, so it is significantly hotter than human initiated
operations.
Switch cgroup_threadgroup_rwsem into the slow mode for now to avoid the
problem, hopefully it should not be that slow after another commit:
80127a3968 ("locking/percpu-rwsem: Optimize readers and reduce global impact").
We could just add rcu_sync_enter() into cgroup_init() but we do not want
another synchronize_sched() at boot time, so this patch adds the new helper
which doesn't block but currently can only be called before the first use.
Reported-by: John Stultz <john.stultz@linaro.org>
Reported-by: Dmitry Shmidt <dimitrysh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Colin Cross <ccross@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rom Lemarchand <romlem@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Link: http://lkml.kernel.org/r/20160811165413.GA22807@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit:
d670ec1317 ("posix-cpu-timers: Cure SMP wobbles")
started accounting thread group tasks pending runtime in thread_group_cputime().
Another commit:
6e998916df ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency")
updated scheduler runtime statistics (call update_curr()) when reading task pending
runtime. Those changes cause bad performance of SYS_times() and
SYS_clock_gettimes(CLOCK_PROCESS_CPUTIME_ID) syscalls, especially on
larger systems with many CPUs.
While we would like to have cpuclock monotonicity kept i.e. have
problems fixed by above commits stay fixed, we also would like to have
good performance.
However when we notice that change from commit d670ec1317 is not
longer needed to solve problem addressed by that commit, because of
change from the second commit 6e998916df, we can get room for
optimization. Since we update task while reading it's pending runtime
in task_sched_runtime(), clock_gettime(CLOCK_PROCESS_CPUTIME_ID) will
see updated values and on testcase from d670ec1317 process cpuclock
will not be smaller than thread cpuclock.
I tested the patch on testcases from commits d670ec1317,
6e998916df and some other cpuclock/cputimers testcases and
did not found cpuclock monotonicity problems or other malfunction.
This patch has the drawback that we will not provide thread group cputime
up-to-date to the last moment. For example when arming cputime timer,
we will arm it with possibly a bit outdated values and that timer will
trigger earlier compared to behaviour without the patch. However that
was the behaviour before d670ec1317 commit (kernel v3.1) so it's
unlikely to affect applications.
Patch improves related syscall performance, as measured by Giovanni's
benchmarks described in commit:
6075620b05 ("sched/cputime: Mitigate performance regression in times()/clock_gettime()")
The benchmark results are:
SYS_clock_gettime():
threads 4.7-rc7 3.18-rc3 4.7-rc7 + prefetch 4.7-rc7 + patch
(pre-6e998916dfe3)
2 3.48 2.23 ( 35.68%) 3.06 ( 11.83%) 1.08 ( 68.81%)
5 3.33 2.83 ( 14.84%) 3.25 ( 2.40%) 0.71 ( 78.55%)
8 3.37 2.84 ( 15.80%) 3.26 ( 3.30%) 0.56 ( 83.49%)
12 3.32 3.09 ( 6.69%) 3.37 ( -1.60%) 0.42 ( 87.28%)
21 4.01 3.14 ( 21.70%) 3.90 ( 2.74%) 0.35 ( 91.35%)
30 3.63 3.28 ( 9.75%) 3.36 ( 7.41%) 0.28 ( 92.23%)
48 3.71 3.02 ( 18.69%) 3.11 ( 16.27%) 0.39 ( 89.39%)
79 3.75 2.88 ( 23.23%) 3.16 ( 15.74%) 0.46 ( 87.76%)
110 3.81 2.95 ( 22.62%) 3.25 ( 14.80%) 0.56 ( 85.41%)
128 3.88 3.05 ( 21.28%) 3.31 ( 14.76%) 0.62 ( 84.10%)
SYS_times():
threads 4.7-rc7 3.18-rc3 4.7-rc7 + prefetch 4.7-rc7 + patch
(pre-6e998916dfe3)
2 3.65 2.27 ( 37.94%) 3.25 ( 11.03%) 1.62 ( 55.71%)
5 3.45 2.78 ( 19.34%) 3.17 ( 7.92%) 2.33 ( 32.28%)
8 3.52 2.79 ( 20.66%) 3.22 ( 8.69%) 2.06 ( 41.44%)
12 3.29 3.02 ( 8.33%) 3.36 ( -2.04%) 2.00 ( 39.18%)
21 4.07 3.10 ( 23.86%) 3.92 ( 3.78%) 2.07 ( 49.18%)
30 3.87 3.33 ( 13.80%) 3.40 ( 12.17%) 1.89 ( 51.12%)
48 3.79 2.96 ( 21.94%) 3.16 ( 16.61%) 1.69 ( 55.46%)
79 3.88 2.88 ( 25.82%) 3.28 ( 15.42%) 1.60 ( 58.81%)
110 3.90 2.98 ( 23.73%) 3.38 ( 13.35%) 1.73 ( 55.61%)
128 4.00 3.10 ( 22.40%) 3.38 ( 15.45%) 1.66 ( 58.52%)
Reported-and-tested-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <mgalbraith@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/20160817093043.GA25206@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add a topology flag to the sched_domain hierarchy indicating the lowest
domain level where the full range of CPU capacities is represented by
the domain members for asymmetric capacity topologies (e.g. ARM
big.LITTLE).
The flag is intended to indicate that extra care should be taken when
placing tasks on CPUs and this level spans all the different types of
CPUs found in the system (no need to look further up the domain
hierarchy). This information is currently only available through
iterating through the capacities of all the CPUs at parent levels in the
sched_domain hierarchy.
SD 2 [ 0 1 2 3] SD_ASYM_CPUCAPACITY
SD 1 [ 0 1] [ 2 3] !SD_ASYM_CPUCAPACITY
CPU: 0 1 2 3
capacity: 756 756 1024 1024
If the topology in the example above is duplicated to create an eight
CPU example with third sched_domain level on top (SD 3), this level
should not have the flag set (!SD_ASYM_CPUCAPACITY) as its two group
would both have all CPU capacities represented within them.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1469453670-2660-6-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit:
5743021831 ("sched/cputime: Count actually elapsed irq & softirq time")
... fixed a bug but also triggered a regression:
On an i5 laptop, 4 pCPUs, 4vCPUs for one full dynticks guest, there are four
CPU hog processes(for loop) running in the guest, I hot-unplug the pCPUs
on host one by one until there is only one left, then observe CPU utilization
via 'top' in the guest, it shows:
100% st for cpu0(housekeeping)
75% st for other CPUs (nohz full mode)
However, w/o this commit it shows the correct 75% for all four CPUs.
When a guest is interrupted for a longer amount of time, missed clock ticks
are not redelivered later. Because of that, we should not limit the amount
of steal time accounted to the amount of time that the calling functions
think have passed.
However, the interval returned by account_other_time() is NOT rounded down
to the nearest jiffy, while the base interval in get_vtime_delta() it is
subtracted from is, so the max cputime limit is required to avoid underflow.
This patch fixes the regression by limiting the account_other_time() from
get_vtime_delta() to avoid underflow, and lets the other three call sites
(in account_other_time() and steal_account_process_time()) account however
much steal time the host told us elapsed.
Suggested-by: Rik van Riel <riel@redhat.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm@vger.kernel.org
Link: http://lkml.kernel.org/r/1471399546-4069-1-git-send-email-wanpeng.li@hotmail.com
[ Improved the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Mike reports:
Roughly 10% of the time, ltp testcase getrusage04 fails:
getrusage04 0 TINFO : Expected timers granularity is 4000 us
getrusage04 0 TINFO : Using 1 as multiply factor for max [us]time increment (1000+4000us)!
getrusage04 0 TINFO : utime: 0us; stime: 179us
getrusage04 0 TINFO : utime: 3751us; stime: 0us
getrusage04 1 TFAIL : getrusage04.c:133: stime increased > 5000us:
And tracked it down to the case where the task simply doesn't get
_any_ [us]time ticks.
Update the code to assume all rtime is utime when we lack information,
thus ensuring a task that elides the tick gets time accounted.
Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Fredrik Markstrom <fredrik.markstrom@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim <rkrcmar@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: stable@vger.kernel.org # 4.3+
Fixes: 9d7fb04276 ("sched/cputime: Guarantee stime + utime == rtime")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, PERF_GROUP_SOFTWARE is used in the group_flags field of a
group's leader to indicate that is_software_event(event) is true for all
events in a group. This is the only usage of event->group_flags.
This pattern of setting a group level flags when all events in the group
share a property is useful for the flag introduced in the next patch and
for future CQM/CMT flags. So this patches generalizes group_flags to work
as an aggregate of event level flags.
PERF_GROUP_SOFTWARE denotes an inmutable event's property. All other flags
that I intend to add are also determinable at event initialization.
To better convey the above, this patch renames event's group_flags to
group_caps and PERF_GROUP_SOFTWARE to PERF_EV_CAP_SOFTWARE.
Individual event flags are stored in the new event->event_caps. Since the
cap flags do not change after event initialization, there is no need to
serialize event_caps. This new field is used when events are added to a
context, similarly to how PERF_GROUP_SOFTWARE and is_software_event()
worked.
Lastly, for consistency, updates is_software_event() to rely in event_cap
instead of the context index.
Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1471467307-61171-3-git-send-email-davidcc@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Function perf_event_mmap() is called by the MM subsystem each time
part of a binary is loaded in memory. There can be several mapping
for a binary, many times unrelated to the code section.
Each time a section of a binary is mapped address filters are
updated, event when the map doesn't pertain to the code section.
The end result is that filters are configured based on the last map
event that was received rather than the last mapping of the code
segment.
For example if we have an executable 'main' that calls library
'libcstest.so.1.0', and that we want to collect traces on code
that is in that library. The perf cmd line for this scenario
would be:
perf record -e cs_etm// --filter 'filter 0x72c/0x40@/opt/lib/libcstest.so.1.0' --per-thread ./main
Resulting in binaries being mapped this way:
root@linaro-nano:~# cat /proc/1950/maps
00400000-00401000 r-xp 00000000 08:02 33169 /home/linaro/main
00410000-00411000 r--p 00000000 08:02 33169 /home/linaro/main
00411000-00412000 rw-p 00001000 08:02 33169 /home/linaro/main
7fa2464000-7fa2474000 rw-p 00000000 00:00 0
7fa2474000-7fa25a4000 r-xp 00000000 08:02 543 /lib/aarch64-linux-gnu/libc-2.21.so
7fa25a4000-7fa25b3000 ---p 00130000 08:02 543 /lib/aarch64-linux-gnu/libc-2.21.so
7fa25b3000-7fa25b7000 r--p 0012f000 08:02 543 /lib/aarch64-linux-gnu/libc-2.21.so
7fa25b7000-7fa25b9000 rw-p 00133000 08:02 543 /lib/aarch64-linux-gnu/libc-2.21.so
7fa25b9000-7fa25bd000 rw-p 00000000 00:00 0
7fa25bd000-7fa25be000 r-xp 00000000 08:02 38308 /opt/lib/libcstest.so.1.0
7fa25be000-7fa25cd000 ---p 00001000 08:02 38308 /opt/lib/libcstest.so.1.0
7fa25cd000-7fa25ce000 r--p 00000000 08:02 38308 /opt/lib/libcstest.so.1.0
7fa25ce000-7fa25cf000 rw-p 00001000 08:02 38308 /opt/lib/libcstest.so.1.0
7fa25cf000-7fa25eb000 r-xp 00000000 08:02 574 /lib/aarch64-linux-gnu/ld-2.21.so
7fa25ef000-7fa25f2000 rw-p 00000000 00:00 0
7fa25f7000-7fa25f9000 rw-p 00000000 00:00 0
7fa25f9000-7fa25fa000 r--p 00000000 00:00 0 [vvar]
7fa25fa000-7fa25fb000 r-xp 00000000 00:00 0 [vdso]
7fa25fb000-7fa25fc000 r--p 0001c000 08:02 574 /lib/aarch64-linux-gnu/ld-2.21.so
7fa25fc000-7fa25fe000 rw-p 0001d000 08:02 574 /lib/aarch64-linux-gnu/ld-2.21.so
7ff2ea8000-7ff2ec9000 rw-p 00000000 00:00 0 [stack]
root@linaro-nano:~#
Before 'main()' can execute 'libcstest.so.1.0' has to be loaded in
memory. Once that has been done perf_event_mmap() has been called
4 times, with the last map starting at address 0x7fa25ce000 and
the address filter configured to start filtering when the
IP has passed over address 0x0x7fa25ce72c (0x7fa25ce000 + 0x72c).
But that is wrong since the code segment for library 'libcstest.so.1.0'
as been mapped at 0x7fa25bd000, resulting in traces not being
collected.
This patch corrects the situation by requesting that address
filters be updated only if the mapped event is for a code
segment.
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1468860187-318-3-git-send-email-mathieu.poirier@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Minor overlapping changes for both merge conflicts.
Resolution work done by Stephen Rothwell was used
as a reference.
Signed-off-by: David S. Miller <davem@davemloft.net>
* pm-sleep:
PM / hibernate: Fix rtree_next_node() to avoid walking off list ends
x86/power/64: Use __pa() for physical address computation
PM / sleep: Update some system sleep documentation
Pull networking fixes from David Miller:
1) Buffers powersave frame test is reversed in cfg80211, fix from Felix
Fietkau.
2) Remove bogus WARN_ON in openvswitch, from Jarno Rajahalme.
3) Fix some tg3 ethtool logic bugs, and one that would cause no
interrupts to be generated when rx-coalescing is set to 0. From
Satish Baddipadige and Siva Reddy Kallam.
4) QLCNIC mailbox corruption and napi budget handling fix from Manish
Chopra.
5) Fix fib_trie logic when walking the trie during /proc/net/route
output than can access a stale node pointer. From David Forster.
6) Several sctp_diag fixes from Phil Sutter.
7) PAUSE frame handling fixes in mlxsw driver from Ido Schimmel.
8) Checksum fixup fixes in bpf from Daniel Borkmann.
9) Memork leaks in nfnetlink, from Liping Zhang.
10) Use after free in rxrpc, from David Howells.
11) Use after free in new skb_array code of macvtap driver, from Jason
Wang.
12) Calipso resource leak, from Colin Ian King.
13) mediatek bug fixes (missing stats sync init, etc.) from Sean Wang.
14) Fix bpf non-linear packet write helpers, from Daniel Borkmann.
15) Fix lockdep splats in macsec, from Sabrina Dubroca.
16) hv_netvsc bug fixes from Vitaly Kuznetsov, mostly to do with VF
handling.
17) Various tc-action bug fixes, from CONG Wang.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (116 commits)
net_sched: allow flushing tc police actions
net_sched: unify the init logic for act_police
net_sched: convert tcf_exts from list to pointer array
net_sched: move tc offload macros to pkt_cls.h
net_sched: fix a typo in tc_for_each_action()
net_sched: remove an unnecessary list_del()
net_sched: remove the leftover cleanup_a()
mlxsw: spectrum: Allow packets to be trapped from any PG
mlxsw: spectrum: Unmap 802.1Q FID before destroying it
mlxsw: spectrum: Add missing rollbacks in error path
mlxsw: reg: Fix missing op field fill-up
mlxsw: spectrum: Trap loop-backed packets
mlxsw: spectrum: Add missing packet traps
mlxsw: spectrum: Mark port as active before registering it
mlxsw: spectrum: Create PVID vPort before registering netdevice
mlxsw: spectrum: Remove redundant errors from the code
mlxsw: spectrum: Don't return upon error in removal path
i40e: check for and deal with non-contiguous TCs
ixgbe: Re-enable ability to toggle VLAN filtering
ixgbe: Force VLNCTRL.VFE to be set in all VMDq paths
...
cgroup_threadgroup_rwsem is acquired in read mode during process exit
and fork. It is also grabbed in write mode during
__cgroups_proc_write(). I've recently run into a scenario with lots
of memory pressure and OOM and I am beginning to see
systemd
__switch_to+0x1f8/0x350
__schedule+0x30c/0x990
schedule+0x48/0xc0
percpu_down_write+0x114/0x170
__cgroup_procs_write.isra.12+0xb8/0x3c0
cgroup_file_write+0x74/0x1a0
kernfs_fop_write+0x188/0x200
__vfs_write+0x6c/0xe0
vfs_write+0xc0/0x230
SyS_write+0x6c/0x110
system_call+0x38/0xb4
This thread is waiting on the reader of cgroup_threadgroup_rwsem to
exit. The reader itself is under memory pressure and has gone into
reclaim after fork. There are times the reader also ends up waiting on
oom_lock as well.
__switch_to+0x1f8/0x350
__schedule+0x30c/0x990
schedule+0x48/0xc0
jbd2_log_wait_commit+0xd4/0x180
ext4_evict_inode+0x88/0x5c0
evict+0xf8/0x2a0
dispose_list+0x50/0x80
prune_icache_sb+0x6c/0x90
super_cache_scan+0x190/0x210
shrink_slab.part.15+0x22c/0x4c0
shrink_zone+0x288/0x3c0
do_try_to_free_pages+0x1dc/0x590
try_to_free_pages+0xdc/0x260
__alloc_pages_nodemask+0x72c/0xc90
alloc_pages_current+0xb4/0x1a0
page_table_alloc+0xc0/0x170
__pte_alloc+0x58/0x1f0
copy_page_range+0x4ec/0x950
copy_process.isra.5+0x15a0/0x1870
_do_fork+0xa8/0x4b0
ppc_clone+0x8/0xc
In the meanwhile, all processes exiting/forking are blocked almost
stalling the system.
This patch moves the threadgroup_change_begin from before
cgroup_fork() to just before cgroup_canfork(). There is no nee to
worry about threadgroup changes till the task is actually added to the
threadgroup. This avoids having to call reclaim with
cgroup_threadgroup_rwsem held.
tj: Subject and description edits.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Acked-by: Zefan Li <lizefan@huawei.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit 1e2a7d7849 ("irqdomain: Don't set type when mapping an IRQ")
moved the trigger configuration call from the irqdomain mapping to
the interrupt being actually requested.
This patch failed to handle the case where we configure a chained
interrupt, which doesn't get requested through the usual path.
In order to solve this, let's call __irq_set_trigger just before
starting the cascade interrupt. Special care must be taken to
make the flow handler stick, as the .irq_set_type method could
have reset it (it doesn't know we're dealing with a chained
interrupt).
Based on an initial patch by Jon Hunter.
Fixes: 1e2a7d7849 ("irqdomain: Don't set type when mapping an IRQ")
Reported-by: John Stultz <john.stultz@linaro.org>
Reported-by: Linus Walleij <linus.walleij@linaro.org>
Tested-by: John Stultz <john.stultz@linaro.org>
Acked-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>