Jiri Olsa [Mon, 8 Feb 2021 20:08:54 +0000 (21:08 +0100)]
perf daemon: Add 'list' command
Add a 'list' command to display all running sessions. It's the default
command if no other command is specified.
Example:
# cat ~/.perfconfig
[daemon]
base=/opt/perfdata
[session-cycles]
run = -m 10M -e cycles --overwrite --switch-output -a
[session-sched]
run = -m 20M -e sched:* --overwrite --switch-output -a
Start the daemon:
# perf daemon start
List sessions:
# perf daemon
[771394:daemon] base: /opt/perfdata
[771395:cycles] perf record -m 10M -e cycles --overwrite --switch-output -a
[771396:sched] perf record -m 20M -e sched:* --overwrite --switch-output -a
List sessions with more info:
# perf daemon -v
[771394:daemon] base: /opt/perfdata
output: /opt/perfdata/output
[771395:cycles] perf record -m 10M -e cycles --overwrite --switch-output -a
base: /opt/perfdata/session-cycles
output: /opt/perfdata/session-cycles/output
[771396:sched] perf record -m 20M -e sched:* --overwrite --switch-output -a
base: /opt/perfdata/session-sched
output: /opt/perfdata/session-sched/output
The 'output' file is perf record output for specific session.
Note you have to stop all running perf processes manually at this point,
stop command is coming in following patches.
Committer notes:
Fixup union initialization to overcome this in multiple older systems:
22 15.74 debian:8 : FAIL gcc version 4.9.2 (Debian 4.9.2-10+deb8u2)
builtin-daemon.c: In function 'send_cmd_list':
builtin-daemon.c:1386:2: error: missing initializer for field 'csv_sep' of 'struct <anonymous>' [-Werror=missing-field-initializers]
};
^
builtin-daemon.c:641:8: note: 'csv_sep' declared here
char csv_sep;
^
cc1: all warnings being treated as errors
Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Budankov <abudankov@huawei.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Petlan <mpetlan@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: https://lore.kernel.org/r/20210208200908.1019149-11-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jiri Olsa [Mon, 8 Feb 2021 20:08:49 +0000 (21:08 +0100)]
perf daemon: Add client socket support
Add support for client socket side that will be used to send commands to
the daemon server socket.
This patch adds only the core support, all commands using this
functionality are coming in the following patches.
Committer notes:
Hat to patch patch it to deal with this in some systems:
cc1: warnings being treated as errors
builtin-daemon.c: In function 'send_cmd': MKDIR /tmp/build/perf/bench/
builtin-daemon.c:1368: error: ignoring return value of 'fwrite', declared with attribute warn_unused_result
MKDIR /tmp/build/perf/tests/
make[3]: *** [/tmp/build/perf/builtin-daemon.o] Error 1
And also to not leak the 'line' buffer allocated by getline(), since you
initialized line to NULL and len to zero, man page says:
If *lineptr is set to NULL and *n is set 0 before the call,
then getline() will allocate a buffer for storing the line.
This buffer should be freed by the user program even if
getline() failed.
Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Budankov <abudankov@huawei.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Petlan <mpetlan@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: https://lore.kernel.org/r/20210208200908.1019149-6-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jiri Olsa [Mon, 8 Feb 2021 20:08:47 +0000 (21:08 +0100)]
perf daemon: Add base option
Add a base option allowing the user to specify a base directory. It
will have precedence over config file base definition coming in the
following patches.
Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Budankov <abudankov@huawei.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Petlan <mpetlan@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: https://lore.kernel.org/r/20210208200908.1019149-4-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jiri Olsa [Mon, 8 Feb 2021 20:08:46 +0000 (21:08 +0100)]
perf daemon: Add config option
Add a config option and base functionality that takes the option
argument (if specified) and other system config locations and produces
an 'acting' config file path.
The actual config file processing is coming in following patches.
Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Budankov <abudankov@huawei.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Petlan <mpetlan@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: https://lore.kernel.org/r/20210208200908.1019149-3-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jiri Olsa [Mon, 8 Feb 2021 20:08:45 +0000 (21:08 +0100)]
perf daemon: Add daemon command
Add a daemon skeleton with a minimal base (non) functionality, covering
various setup in start command.
Add an initial perf-daemon.txt with basic info.
This is in response to pople asking for the possibility to be able run
record long running sessions on the background.
The patchset that starts with this adds support to configure and run
record sessions on background via new 'perf daemon' command.
This is useful for being able to use perf as a flight recorder that one
can interact with asking for events to be enabled or disabled, added or
removed, etc.
Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Budankov <abudankov@huawei.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Petlan <mpetlan@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: https://lore.kernel.org/r/20210208200908.1019149-2-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Yang Li [Mon, 8 Feb 2021 08:45:36 +0000 (16:45 +0800)]
perf script: Simplify bool conversion
Fix the following coccicheck warning:
./tools/perf/builtin-script.c:2789:36-41: WARNING: conversion to bool
not needed here
./tools/perf/builtin-script.c:3237:48-53: WARNING: conversion to bool
not needed here
Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <songliubraving@fb.com> Cc: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/1612773936-98691-1-git-send-email-yang.lee@linux.alibaba.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
perf arm64/s390: Fix printf conversion specifier for IP addresses
We need to use "%#" PRIx64 for u64 values, not "%lx". In arm64's and
s390x cases the compiler doesn't complain, but lets fix this in case
this code gets copied to a 32-bit arch, like with powerpc 32-bit that
got fixed in the previous patch.
Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Hewenliang <hewenliang4@huawei.com> Cc: Hu Shiyuan <hushiyuan@huawei.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Richter <tmricht@linux.ibm.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jin Yao [Sun, 7 Feb 2021 08:09:35 +0000 (16:09 +0800)]
perf script: Support filtering by hex address
'perf script' supports '-S' or '--symbol' options to only list the
records for these symbols. A symbol is typically a name or hex address.
If it's hex address, it is the start address of one symbol.
While it would be useful if we can filter trace records by any hex
address (not only the start address of symbol). So now we support
filtering trace records by more conditions, such as:
- symbol name
- start address of symbol
- any hexadecimal address
- address range
The comparison order is defined as:
1. symbol name comparison
2. symbol start address comparison.
3. any hexadecimal address comparison.
4. address range comparison.
The idea is if we can get a valid address from -S list, we add the
address to addr_list for address comparison otherwise we still leave
it to sym_list for symbol comparison.
Signed-off-by: Jin Yao <yao.jin@linux.intel.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jin Yao <yao.jin@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lore.kernel.org/lkml/20210207080935.31784-2-yao.jin@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jin Yao [Sun, 7 Feb 2021 08:09:34 +0000 (16:09 +0800)]
perf intlist: Change 'struct intlist' int member to 'unsigned long'
This is to let intlist support addresses as its payload.
One potential problem is it can't support negative number. But so far,
there is no such kind of use case.
Signed-off-by: Jin Yao <yao.jin@linux.intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jin Yao <yao.jin@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lore.kernel.org/lkml/20210207080935.31784-1-yao.jin@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Paul Cercueil [Mon, 8 Feb 2021 18:11:57 +0000 (18:11 +0000)]
perf stat: Use nftw() instead of ftw()
ftw() has been obsolete for about 12 years now.
Committer notes:
Further notes provided by the patch author:
"NOTE: Not runtime-tested, I have no idea what I need to do in perf
to test this. But at least it compiles now with my uClibc-based
toolchain."
I looked at the nftw()/ftw() man page and for the use made with cgroups
in 'perf stat' the end result is equivalent.
Fixes: f52676ff132c ("perf stat: Support regex pattern in --for-each-cgroup") Signed-off-by: Paul Cercueil <paul@crapouillou.net> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: od@zcrc.me Cc: stable@vger.kernel.org Link: http://lore.kernel.org/lkml/20210208181157.1324550-1-paul@crapouillou.net Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Kan Liang [Tue, 2 Feb 2021 20:09:12 +0000 (12:09 -0800)]
perf stat: Support L2 Topdown events
The TMA method level 2 metrics is supported from the Intel Sapphire
Rapids server, which expose four L2 Topdown metrics events to user
space. There are eight L2 events in total. The other four L2 Topdown
metrics events are calculated from the corresponding L1 and the exposed
L2 events.
Now, the --topdown prints the complete top-down metrics that supported
by the CPU. For the Intel Sapphire Rapids server, there are 4 L1 events
and 8 L2 events displyed in one line.
Add a new option, --td-level, to display the top-down statistics that
equal to or lower than the input level.
The L2 event is marked only when both its L1 parent event and itself
crosse the threshold.
Here is an example:
$ perf stat --topdown --td-level=2 --no-metric-only sleep 1
Topdown accuracy may decrease when measuring long periods.
Please print the result regularly, e.g. -I1000
Kan Liang [Tue, 2 Feb 2021 20:09:10 +0000 (12:09 -0800)]
perf report: Support instruction latency
The instruction latency information can be recorded on some platforms,
e.g., the Intel Sapphire Rapids server. With both memory latency
(weight) and the new instruction latency information, users can easily
locate the expensive load instructions, and also understand the time
spent in different stages. The users can optimize their applications in
different pipeline stages.
The 'weight' field is shared among different architectures. Reusing the
'weight' field may impacts other architectures. Add a new field to store
the instruction latency.
Like the 'weight' support, introduce a 'ins_lat' for the global
instruction latency, and a 'local_ins_lat' for the local instruction
latency version.
Add new sort functions, INSTR Latency and Local INSTR Latency,
accordingly.
Add local_ins_lat to the default_mem_sort_order[].
Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lore.kernel.org/lkml/1612296553-21962-7-git-send-email-kan.liang@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Kan Liang [Tue, 2 Feb 2021 20:09:09 +0000 (12:09 -0800)]
perf tools: Support PERF_SAMPLE_WEIGHT_STRUCT
The new sample type, PERF_SAMPLE_WEIGHT_STRUCT, is an alternative of the
PERF_SAMPLE_WEIGHT sample type. Users can apply either the
PERF_SAMPLE_WEIGHT sample type or the PERF_SAMPLE_WEIGHT_STRUCT sample
type to retrieve the sample weight, but they cannot apply both sample
types simultaneously.
The new sample type shares the same space as the PERF_SAMPLE_WEIGHT
sample type. The lower 32 bits are exactly the same for both sample
type. The higher 32 bits may be different for different architecture.
Add arch specific arch_evsel__set_sample_weight() to set the new sample
type for X86. Only store the lower 32 bits for the sample->weight if the
new sample type is applied. In practice, no memory access could last
than 4G cycles. No data will be lost.
If the kernel doesn't support the new sample type. Fall back to the
PERF_SAMPLE_WEIGHT sample type.
There is no impact for other architectures.
Committer notes:
Fixup related to PERF_SAMPLE_CODE_PAGE_SIZE, present in acme/perf/core
but not upstream yet.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lore.kernel.org/lkml/1612296553-21962-6-git-send-email-kan.liang@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Kan Liang [Tue, 2 Feb 2021 20:09:08 +0000 (12:09 -0800)]
perf c2c: Support data block and addr block
'perf c2c' is also a memory profiling tool. Apply the two new data
source fields to 'perf c2c' as well.
Extend 'perf c2c' to display the number of loads which blocked by data or
address conflict.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Joe Mario <jmario@redhat.com> Cc: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lore.kernel.org/lkml/1612296553-21962-5-git-send-email-kan.liang@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Kan Liang [Tue, 2 Feb 2021 20:09:07 +0000 (12:09 -0800)]
perf tools: Support data block and addr block
Two new data source fields, to indicate the block reasons of a load
instruction, are introduced on the Intel Sapphire Rapids server. The
fields can be used by the memory profiling.
Add a new sort function, SORT_MEM_BLOCKED, for the two fields.
For the previous platforms or the block reason is unknown, print "N/A"
for the block reason.
Add blocked as a default mem sort key for perf report and perf mem
report.
Committer testing:
So in machines without this capability we get a "N/A" filling the new "Blocked"
column:
$ perf mem record ls
arch certs CREDITS Documentation include ipc Kconfig lib MAINTAINERS mm samples security usr block
COPYING crypto drivers fs init Kbuild kernel LICENSES Makefile net README scripts sound tools
virt
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.008 MB perf.data (17 samples) ]
$
$ perf mem report --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
# Total Lost Samples: 0
#
# Samples: 6 of event 'cpu/mem-loads,ldlat=30/Pu'
# Total weight : 1381
# Sort order : local_weight,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked,blocked
#
# Overhead Samples Local Weight Memory access Symbol Shared Object Data Symbol Data Object Snoop TLB access Locked Blocked
# ........ ....... ............ .................... ....................... ............. ...................... ............ ..... ............ ...... .......
#
32.87% 1 454 Local RAM or RAM hit [.] _dl_relocate_object ld-2.31.so [.] 0x00007fe91cef3078 libc-2.31.so Hit L1 or L2 hit No N/A
25.56% 1 353 LFB or LFB hit [.] strcmp ld-2.31.so [.] 0x00005586973855ca ls None L1 or L2 hit No N/A
22.59% 1 312 LFB or LFB hit [.] _dl_cache_libcmp ld-2.31.so [.] 0x00007fe91d0e3b18 ld.so.cache None L1 or L2 hit No N/A
8.47% 1 117 LFB or LFB hit [.] _dl_relocate_object ld-2.31.so [.] 0x00007fe91ceee570 libc-2.31.so None L1 or L2 hit No N/A
6.88% 1 95 LFB or LFB hit [.] _dl_relocate_object ld-2.31.so [.] 0x00007fe91ceed490 libc-2.31.so None L1 or L2 hit No N/A
3.62% 1 50 LFB or LFB hit [.] _dl_cache_libcmp ld-2.31.so [.] 0x00007fe91d0ebe60 ld.so.cache None L1 or L2 hit No N/A
Kan Liang [Tue, 2 Feb 2021 20:09:06 +0000 (12:09 -0800)]
perf tools: Support the auxiliary event
On the Intel Sapphire Rapids server, an auxiliary event has to be
enabled simultaneously with the load latency event to retrieve complete
Memory Info.
Add X86 specific perf_mem_events__name() to handle the auxiliary event.
- Users are only interested in the samples of the mem-loads event.
Sample read the auxiliary event.
- The auxiliary event must be in front of the load latency event in a
group. Assume the second event to sample if the auxiliary event is the
leader.
- Add a weak is_mem_loads_aux_event() to check the auxiliary event for
X86. For other ARCHs, it always return false.
Parse the unique event name, mem-loads-aux, for the auxiliary event.
Committer notes:
According to 23dcad95692abf5d ("perf/x86/intel: Add perf core PMU
support for Sapphire Rapids"), ENODATA is only returned by
sys_perf_event_open() when used with these auxiliary events, with this
in evsel__open_strerror():
case ENODATA:
return scnprintf(msg, size, "Cannot collect data source with the load latency event alone. "
"Please add an auxiliary event in front of the load latency event.");
This is Ok at this point in time, but fragile long term, I pointed this
out in the e-mail thread, requesting a follow up patch to check if
ENODATA is really for this specific case.
Fixed up sizeof(MEM_LOADS_AUX_NAME) bug pointed out by Namhyung.
Kan Liang [Tue, 2 Feb 2021 20:09:05 +0000 (12:09 -0800)]
tools headers uapi: Update tools's copy of linux/perf_event.h
To get the changes in these csets:
a543bea20a230495 ("perf/core: Add PERF_SAMPLE_WEIGHT_STRUCT") 23dcad95692abf5d ("perf/x86/intel: Add perf core PMU support for Sapphire Rapids")
This cures the following warning during perf's build:
Warning: Kernel ABI header at 'tools/include/uapi/linux/perf_event.h' differs from latest version at 'include/uapi/linux/perf_event.h'
diff -u tools/include/uapi/linux/perf_event.h include/uapi/linux/perf_event.h
Committer notes:
Picked by hand as I had already merged the MMAP buildid patch that also touches
perf_event.h and is also only in {acme,tip}/perf/core, not yet upstream.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lore.kernel.org/lkml/1612296553-21962-2-git-send-email-kan.liang@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Athira Rajeev [Wed, 3 Feb 2021 06:55:37 +0000 (01:55 -0500)]
perf powerpc: Support exposing Performance Monitor Counter SPRs as part of extended regs
To enable presenting of Performance Monitor Counter Registers (PMC1 to
PMC6) as part of extended regsiters, this patch adds these to
sample_reg_mask in the tool side (to use with -I? option).
Simplified the PERF_REG_PMU_MASK_300/31 definition. Excluded the
unsupported SPRs (MMCR3, SIER2, SIER3) from extended mask value for
CPU_FTR_ARCH_300.
Signed-off-by: Athira Jajeev <atrajeev@linux.vnet.ibm.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kajol Jain <kjain@linux.ibm.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jianlin Lv [Wed, 3 Feb 2021 14:57:02 +0000 (22:57 +0800)]
perf probe: Add protection to avoid endless loop
if dwarf_offdie() returns NULL, the continue statement forces the next
iteration of the loop without updating the 'off' variable. It will cause
an endless loop in the process of traversing the compile unit. So add
exception protection for looping CUs.
Signed-off-by: Jianlin Lv <Jianlin.Lv@arm.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: jianlin.lv@arm.com Link: http://lore.kernel.org/lkml/20210203145702.1219509-1-Jianlin.Lv@arm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
symbol__inc_addr_samples() calls annotated_source__alloc_histograms ()
to allocate memory for sample histograms using calloc(). Here calloc()
fails since the size of symbol is huge. The size of a symbol is
calculated as difference between its start and end address.
Example histogram allocation that fails is:
sym->name is _end
sym->start is 0xc0000000027a0000
sym->end is 0xc008000003890000
symbol__size(sym) is 0x80000010f0000
In the above case, the difference between sym->start
(0xc0000000027a0000) and sym->end (0xc008000003890000) is huge.
This is same problem as in s390 and arm64 which are fixed in commits:
b2a165847da1 ("perf annotate: Fix s390 gap between kernel end and module start") c47b14d55722 ("perf symbols: Fix arm64 gap between kernel start and module end")
When this symbol was read first, its start and end address was set to
address which matches with data from /proc/kallsyms.
Perf calls function symbols__fixup_end() which sets the end of symbol to
0xc008000003890000, which is the next address and this is the start
address of first module (icmp_checkentry in above) which will make the
huge symbol size of 0x80000010f0000.
On powerpc, kernel text segment is located at 0xc000000000000000 whereas
the modules are located at very high memory addresses,
0xc00800000xxxxxxx. Since the gap between end of kernel text segment and
beginning of first module's address is high, histogram allocation using
calloc fails.
Fix this by detecting the kernel's last symbol and limiting the range of
last kernel symbol to pagesize.
Signed-off-by: Athira Rajeev<atrajeev@linux.vnet.ibm.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Tested-By: Kajol Jain <kjain@linux.ibm.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1609208054-1566-1-git-send-email-atrajeev@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
This patch fixes "perf inject --jit" to properly operate on
namespaced/containerized processes:
* jitdump files are generated by the process, thus they should be
looked up in its mount NS.
* DSOs of injected MMAP events will later be looked up in the process
mount NS, so write them into its NS.
* PIDs & TIDs from jitdump events need to be translated to the PID as
seen by "perf record" before written into MMAP events.
For a process in a different PID NS, the TID & PID given in the jitdump
event are actually ignored; I use the TID & PID of the thread which
mmap()ed the jitdump file. This is simplified and won't do for forks of
the initial process, if they continue using the same jitdump file.
Future patches might improve it.
This was tested by recording a NodeJS process running with
"--perf-prof", inside a Docker container, and by recording another
NodeJS process running in the same namespaces as perf itself, to make
sure it's not broken for non-containerized processes.
Signed-off-by: Yonatan Goldschmidt <yonatan.goldschmidt@granulate.io> Acked-by: Jiri Olsa <jolsa@redhat.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: https://lore.kernel.org/r/20201105015604.1726943-1-yonatan.goldschmidt@granulate.io Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Namhyung Kim [Tue, 2 Feb 2021 09:01:18 +0000 (18:01 +0900)]
perf tools: Use scandir() to iterate threads when synthesizing PERF_RECORD_ events
Like in __event__synthesize_thread(), I think it's better to use
scandir() instead of the readdir() loop. In case some malicious task
continues to create new threads, the readdir() loop will run over and
over to collect tids. The scandir() also has the problem but the window
is much smaller since it doesn't do much work during the iteration.
Also add filter_task() function as we only care the tasks.
Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: https://lore.kernel.org/r/20210202090118.2008551-4-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Namhyung Kim [Tue, 2 Feb 2021 09:01:17 +0000 (18:01 +0900)]
perf tools: Skip PERF_RECORD_MMAP event synthesis for kernel threads
To synthesize information to resolve sample IPs, it needs to scan task
and mmap info from the /proc filesystem. For each process, it opens
(and reads) status and maps file respectively. But as kernel threads
don't have memory maps so we can skip the maps file.
To find kernel threads, check "VmPeak:" line in /proc/<PID>/status file.
It's about the peak virtual memory usage so only user-level tasks have
that. Note that it's possible to miss the line due to partial reads.
So we should double-check if it's a really kernel thread when there's no
VmPeak line.
Thus check "Threads:" line (which follows the VmPeak line whether or not
it exists) to be sure it's read enough data - just in case of deeply
nested pid namespaces or large number of supplementary groups are
involved.
Namhyung Kim [Tue, 2 Feb 2021 09:01:16 +0000 (18:01 +0900)]
perf tools: Use /proc/<PID>/task/<TID>/status for PERF_RECORD_ event synthesis
To save memory usage, it needs to reduce the number of entries in the
proc filesystem. It's using /proc/<PID>/task directory to traverse
threads in the process and then kernel creates /proc/<PID>/task/<TID>
entries.
After that it checks the thread info using the /proc/<TID>/status file
rather than /proc/<PID>/task/<TID>/status. As far as I can see, they
are the same and contain all the info we need.
Using the latter eliminates the unnecessary /proc/<TID> entry. This can
be useful especially a large number of threads are used in the system.
In my experiment around 1KB of memory on average was saved for each
thread (which is not a thread group leader).
To do this, pass both pid and tid to perf_event_prepare_comm() if it
knows them. In case it doesn't know, passing 0 as pid will do the old
way.
Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: https://lore.kernel.org/r/20210202090118.2008551-2-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
John Garry [Thu, 28 Jan 2021 12:00:36 +0000 (20:00 +0800)]
perf vendor events arm64: Reference common and uarch events for A76
Reduce duplication in the JSONs by referencing standard events from
armv8-common-and-microarch.json
In general the "PublicDescription" fields are not modified when somewhat
significantly worded differently than the standard.
Apart from that, description and names for events slightly different to
standard are changed (to standard) for consistency.
Signed-off-by: John Garry <john.garry@huawei.com> Acked-by: Will Deacon <will@kernel.org> Cc: James Clark <james.clark@arm.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Leo Yan <leo.yan@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Nakamura, Shunsuke/中村 俊介 <nakamura.shun@fujitsu.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linuxarm@openeuler.org Link: https://lore.kernel.org/r/1611835236-34696-5-git-send-email-john.garry@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: John Garry <john.garry@huawei.com> Acked-by: Will Deacon <will@kernel.org> Cc: James Clark <james.clark@arm.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Leo Yan <leo.yan@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Nakamura, Shunsuke/中村 俊介 <nakamura.shun@fujitsu.com> Cc: mathieu.poirier@linaro.org Cc: linux-arm-kernel@lists.infradead.org Cc: linuxarm@openeuler.org Link: https://lore.kernel.org/r/1611835236-34696-4-git-send-email-john.garry@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: John Garry <john.garry@huawei.com> Acked-by: Will Deacon <will@kernel.org> Cc: James Clark <james.clark@arm.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Leo Yan <leo.yan@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Nakamura, Shunsuke/中村 俊介 <nakamura.shun@fujitsu.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linuxarm@openeuler.org Link: https://lore.kernel.org/r/1611835236-34696-3-git-send-email-john.garry@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Jin Yao <yao.jin@linux.intel.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jin Yao <yao.jin@intel.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lore.kernel.org/lkml/20210124232750.19170-2-yao.jin@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Fixes: 61a795e89e89d00d ("perf symbols: Avoid unnecessary symbol loading when dso list is specified") Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lore.kernel.org/lkml/20210128131209.GD775562@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Kan Liang [Thu, 21 Jan 2021 13:37:52 +0000 (05:37 -0800)]
perf stat: Add Topdown metrics events as default events
The Topdown Microarchitecture Analysis (TMA) Method is a structured
analysis methodology to identify critical performance bottlenecks in
out-of-order processors. From the Ice Lake and later platforms, the
Topdown information can be retrieved from the dedicated "metrics"
register, which isn't impacted by other events. Also, the Topdown
metrics support both per thread/process and per core measuring. Adding
Topdown metrics events as default events can enrich the default
measuring information, and would not cost any extra multiplexing.
Introduce arch_evlist__add_default_attrs() to allow architecture
specific default events. Add the Topdown metrics events in the X86
specific arch_evlist__add_default_attrs(). Other architectures can add
their own default events later separately.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lore.kernel.org/lkml/20210121133752.118327-1-kan.liang@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Event duration_time in a metric expression requires special handling.
Improve test coverage by including a metric whose expression includes
duration_time. The actual metric is a copied from the L1D_Cache_Fill_BW
metric on my broadwell machine.
Signed-off-by: John Garry <john.garry@huawei.com> Acked-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kajol Jain <kjain@linux.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linuxarm@openeuler.org Link: http://lore.kernel.org/lkml/1611578842-5749-1-git-send-email-john.garry@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Linus Torvalds [Tue, 2 Feb 2021 18:26:09 +0000 (10:26 -0800)]
Merge tag 'net-5.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes for 5.11-rc7, including fixes from bpf and mac80211
trees.
Current release - regressions:
- ip_tunnel: fix mtu calculation
- mlx5: fix function calculation for page trees
Previous releases - regressions:
- vsock: fix the race conditions in multi-transport support
- neighbour: prevent a dead entry from updating gc_list
- dsa: mv88e6xxx: override existent unicast portvec in port_fdb_add
Previous releases - always broken:
- bpf, cgroup: two copy_{from,to}_user() warn_on_once splats for BPF
cgroup getsockopt infra when user space is trying to race against
optlen, from Loris Reiff.
- mac80211: fix station rate table updates on assoc
- r8169: work around RTL8125 UDP HW bug
- igc: report speed and duplex as unknown when device is runtime
suspended
- rxrpc: fix deadlock around release of dst cached on udp tunnel"
* tag 'net-5.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (36 commits)
net: hsr: align sup_multicast_addr in struct hsr_priv to u16 boundary
net: ipa: fix two format specifier errors
net: ipa: use the right accessor in ipa_endpoint_status_skip()
net: ipa: be explicit about endianness
net: ipa: add a missing __iomem attribute
net: ipa: pass correct dma_handle to dma_free_coherent()
r8169: fix WoL on shutdown if CONFIG_DEBUG_SHIRQ is set
net/rds: restrict iovecs length for RDS_CMSG_RDMA_ARGS
net: mvpp2: TCAM entry enable should be written after SRAM data
net: lapb: Copy the skb before sending a packet
net/mlx5e: Release skb in case of failure in tc update skb
net/mlx5e: Update max_opened_tc also when channels are closed
net/mlx5: Fix leak upon failure of rule creation
net/mlx5: Fix function calculation for page trees
docs: networking: swap words in icmp_errors_use_inbound_ifaddr doc
udp: ipv4: manipulate network header of NATed UDP GRO fraglist
net: ip_tunnel: fix mtu calculation
vsock: fix the race conditions in multi-transport support
net: sched: replaced invalid qdisc tree flush helper in qdisc_replace
ibmvnic: device remove has higher precedence over reset
...
Andreas Oetken [Tue, 2 Feb 2021 09:03:04 +0000 (10:03 +0100)]
net: hsr: align sup_multicast_addr in struct hsr_priv to u16 boundary
sup_multicast_addr is passed to ether_addr_equal for address comparison
which casts the address inputs to u16 leading to an unaligned access.
Aligning the sup_multicast_addr to u16 boundary fixes the issue.
Jakub Kicinski [Tue, 2 Feb 2021 16:51:25 +0000 (08:51 -0800)]
Merge tag 'mlx5-fixes-2021-02-01' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5 fixes 2021-02-01
Please note the first patch in this series
("Fix function calculation for page trees") is fixing a regression
due to previous fix in net which you didn't include in your previous
rc pr. So I hope this series will make it into your next rc pr,
so mlx5 won't be broken in the next rc.
* tag 'mlx5-fixes-2021-02-01' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5e: Release skb in case of failure in tc update skb
net/mlx5e: Update max_opened_tc also when channels are closed
net/mlx5: Fix leak upon failure of rule creation
net/mlx5: Fix function calculation for page trees
====================
Jakub Kicinski [Tue, 2 Feb 2021 16:48:17 +0000 (08:48 -0800)]
Merge branch 'net-ipa-a-few-bug-fixes'
Alex Elder says:
====================
net: ipa: a few bug fixes
This series fixes four minor bugs. The first two are things that
sparse points out. All four are very simple and each patch should
explain itself pretty well.
====================
Alex Elder [Mon, 1 Feb 2021 23:26:08 +0000 (17:26 -0600)]
net: ipa: use the right accessor in ipa_endpoint_status_skip()
When extracting the destination endpoint ID from the status in
ipa_endpoint_status_skip(), u32_get_bits() is used. This happens to
work, but it's wrong: the structure field is only 8 bits wide
instead of 32.
Fix this by using u8_get_bits() to get the destination endpoint ID.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alex Elder [Mon, 1 Feb 2021 23:26:07 +0000 (17:26 -0600)]
net: ipa: be explicit about endianness
Sparse warns that the assignment of the metadata mask for a QMAP
endpoint in ipa_endpoint_init_hdr_metadata_mask() is a bad
assignment. We know we want the mask value to be big endian, even
though the value we write is in host byte order. Use a __force
tag to indicate we really mean it.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Heiner Kallweit [Mon, 1 Feb 2021 20:50:56 +0000 (21:50 +0100)]
r8169: fix WoL on shutdown if CONFIG_DEBUG_SHIRQ is set
So far phy_disconnect() is called before free_irq(). If CONFIG_DEBUG_SHIRQ
is set and interrupt is shared, then free_irq() creates an "artificial"
interrupt by calling the interrupt handler. The "link change" flag is set
in the interrupt status register, causing phylib to eventually call
phy_suspend(). Because the net_device is detached from the PHY already,
the PHY driver can't recognize that WoL is configured and powers down the
PHY.
net/rds: restrict iovecs length for RDS_CMSG_RDMA_ARGS
syzbot found WARNING in rds_rdma_extra_size [1] when RDS_CMSG_RDMA_ARGS
control message is passed with user-controlled
0x40001 bytes of args->nr_local, causing order >= MAX_ORDER condition.
The exact value 0x40001 can be checked with UIO_MAXIOV which is 0x400.
So for kcalloc() 0x400 iovecs with sizeof(struct rds_iovec) = 0x10
is the closest limit, with 0x10 leftover.
Same condition is currently done in rds_cmsg_rdma_args().
Xie He [Mon, 1 Feb 2021 05:57:06 +0000 (21:57 -0800)]
net: lapb: Copy the skb before sending a packet
When sending a packet, we will prepend it with an LAPB header.
This modifies the shared parts of a cloned skb, so we should copy the
skb rather than just clone it, before we prepend the header.
In "Documentation/networking/driver.rst" (the 2nd point), it states
that drivers shouldn't modify the shared parts of a cloned skb when
transmitting.
The "dev_queue_xmit_nit" function in "net/core/dev.c", which is called
when an skb is being sent, clones the skb and sents the clone to
AF_PACKET sockets. Because the LAPB drivers first remove a 1-byte
pseudo-header before handing over the skb to us, if we don't copy the
skb before prepending the LAPB header, the first byte of the packets
received on AF_PACKET sockets can be corrupted.
Jakub Kicinski [Tue, 2 Feb 2021 16:37:00 +0000 (08:37 -0800)]
Merge tag 'mac80211-for-net-2021-02-02' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Two fixes:
- station rate tables were not updated correctly
after association, leading to bad configuration
- rtl8723bs (staging) was initializing data incorrectly
after the previous fix and needed to move the init
later
* tag 'mac80211-for-net-2021-02-02' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211:
staging: rtl8723bs: Move wiphy setup to after reading the regulatory settings from the chip
mac80211: fix station rate table updates on assoc
====================
net/mlx5e: Update max_opened_tc also when channels are closed
max_opened_tc is used for stats, so that potentially non-zero stats
won't disappear when num_tc decreases. However, mlx5e_setup_tc_mqprio
fails to update it in the flow where channels are closed.
This commit fixes it. The new value of priv->channels.params.num_tc is
always checked on exit. In case of errors it will just be the old value,
and in case of success it will be the updated value.
Daniel Jurgens [Mon, 1 Feb 2021 16:11:10 +0000 (18:11 +0200)]
net/mlx5: Fix function calculation for page trees
The function calculation always results in a value of 0. This works
generally, but when the release all pages feature is enabled it will
result in crashes.
Fixes: 9bb0189d6ae5 ("net/mlx5: Maintain separate page trees for ECPF and PF functions") Signed-off-by: Daniel Jurgens <danielj@nvidia.com> Reported-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Jakub Kicinski [Tue, 2 Feb 2021 04:23:44 +0000 (20:23 -0800)]
Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2021-02-01
This series contains updates to igc and i40e drivers.
Kai-Heng Feng fixes igc to report unknown speed and duplex during suspend
as an attempted read will cause errors.
Kevin Lo sets the default value to -IGC_ERR_NVM instead of success for
writing shadow RAM as this could miss a timeout. Also propagates the return
value for Flow Control configuration to properly pass on errors for igc.
Aleksandr reverts commit 346f948a5385 ("i40e: don't report link up for a VF
who hasn't enabled queues") as this can cause link flapping.
* '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
i40e: Revert "i40e: don't report link up for a VF who hasn't enabled queues"
igc: check return value of ret_val in igc_config_fc_after_link_up
igc: set the default return value to -IGC_ERR_NVM in igc_write_nvm_srwr
igc: Report speed and duplex as unknown when device is runtime suspended
====================
Dongseok Yi [Fri, 29 Jan 2021 23:13:27 +0000 (08:13 +0900)]
udp: ipv4: manipulate network header of NATed UDP GRO fraglist
UDP/IP header of UDP GROed frag_skbs are not updated even after NAT
forwarding. Only the header of head_skb from ip_finish_output_gso ->
skb_gso_segment is updated but following frag_skbs are not updated.
A call path skb_mac_gso_segment -> inet_gso_segment ->
udp4_ufo_fragment -> __udp_gso_segment -> __udp_gso_segment_list
does not try to update UDP/IP header of the segment list but copy
only the MAC header.
Update port, addr and check of each skb of the segment list in
__udp_gso_segment_list. It covers both SNAT and DNAT.
Vadim Fedorenko [Fri, 29 Jan 2021 22:27:47 +0000 (01:27 +0300)]
net: ip_tunnel: fix mtu calculation
dev->hard_header_len for tunnel interface is set only when header_ops
are set too and already contains full overhead of any tunnel encapsulation.
That's why there is not need to use this overhead twice in mtu calc.
Fixes: c5ed160e7b57 ("ip_gre: set dev->hard_header_len and dev->needed_headroom properly") Reported-by: Slava Bacherikov <mail@slava.cc> Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru> Link: https://lore.kernel.org/r/1611959267-20536-1-git-send-email-vfedorenko@novek.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alexander Popov [Mon, 1 Feb 2021 08:47:19 +0000 (11:47 +0300)]
vsock: fix the race conditions in multi-transport support
There are multiple similar bugs implicitly introduced by the
commit e29ab509ea9a91e6 ("vsock: add multi-transports support") and
commit 6373aba309c5f8ff ("vsock: prevent transport modules unloading").
The bug pattern:
[1] vsock_sock.transport pointer is copied to a local variable,
[2] lock_sock() is called,
[3] the local variable is used.
VSOCK multi-transport support introduced the race condition:
vsock_sock.transport value may change between [1] and [2].
Let's copy vsock_sock.transport pointer to local variables after
the lock_sock() call.
net: sched: replaced invalid qdisc tree flush helper in qdisc_replace
Commit a9ee80847d0d ("net: sched: introduce and use qdisc tree flush/purge helpers")
introduced qdisc tree flush/purge helpers, but erroneously used flush helper
instead of purge helper in qdisc_replace function.
This issue was found in our CI, that tests various qdisc setups by configuring
qdisc and sending data through it. Call of invalid helper sporadically leads
to corruption of vt_tree/cf_tree of hfsc_class that causes kernel oops:
Lijun Pan [Fri, 29 Jan 2021 04:34:01 +0000 (22:34 -0600)]
ibmvnic: device remove has higher precedence over reset
Returning -EBUSY in ibmvnic_remove() does not actually hold the
removal procedure since driver core doesn't care for the return
value (see __device_release_driver() in drivers/base/dd.c
calling dev->bus->remove()) though vio_bus_remove
(in arch/powerpc/platforms/pseries/vio.c) records the
return value and passes it on. [1]
During the device removal precedure, checking for resetting
bit is dropped so that we can continue executing all the
cleanup calls in the rest of the remove function. Otherwise,
it can cause latent memory leaks and kernel crashes.
[1] https://lore.kernel.org/linuxppc-dev/20210117101242.dpwayq6wdgfdzirl@pengutronix.de/T/#m48f5befd96bc9842ece2a3ad14f4c27747206a53 Reported-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Fixes: 76c528960bab ("ibmvnic: Do not process device remove during device reset") Signed-off-by: Lijun Pan <ljp@linux.ibm.com> Link: https://lore.kernel.org/r/20210129043402.95744-1-ljp@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
DENG Qingfang [Sat, 30 Jan 2021 13:43:34 +0000 (21:43 +0800)]
net: dsa: mv88e6xxx: override existent unicast portvec in port_fdb_add
Having multiple destination ports for a unicast address does not make
sense.
Make port_db_load_purge override existent unicast portvec instead of
adding a new port bit.
Fixes: a943b2a0b2ac ("net: dsa: mv88e6xxx: handle multiple ports in ATU") Signed-off-by: DENG Qingfang <dqfext@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Link: https://lore.kernel.org/r/20210130134334.10243-1-dqfext@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
VF queues were not brought up when PF was brought up after being
downed if the VF driver disabled VFs queues during PF down.
This could happen in some older or external VF driver implementations.
The problem was that PF driver used vf->queues_enabled as a condition
to decide what link-state it would send out which caused the issue.
Remove the check for vf->queues_enabled in the VF link notify.
Now VF will always be notified of the current link status.
Also remove the queues_enabled member from i40e_vf structure as it is
not used anymore. Otherwise VNF implementation was broken and caused
a link flap.
The original commit was a workaround to avoid breaking existing VFs though
it's really a fault of the VF code not the PF. The commit should be safe to
revert as all of the VFs we know of have been fixed. Also, since we now
know there is a related bug in the workaround, removing it is preferred.
Fixes: 346f948a5385 ("i40e: don't report link up for a VF who hasn't enabled") Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Linus Torvalds [Mon, 1 Feb 2021 19:15:57 +0000 (11:15 -0800)]
Merge tag 'media/v5.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab:
"The rockship rkisp1 driver will be promoted from staging in 5.11.
While not too late, do a few uAPI changes which are needed to better
support its functionalities"
* tag 'media/v5.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
media: rockchip: rkisp1: extend uapi array sizes
media: rockchip: rkisp1: carry ip version information
media: rockchip: rkisp1: reduce number of histogram grid elements in uapi
media: rkisp1: stats: mask the hist_bins values
media: rkisp1: stats: remove a wrong cast to u8
media: rkisp1: uapi: change hist_bins array type from __u16 to __u32
Hans de Goede [Mon, 1 Feb 2021 15:29:56 +0000 (16:29 +0100)]
staging: rtl8723bs: Move wiphy setup to after reading the regulatory settings from the chip
Commit 73c6fe37ed09 ("staging: rtl8723bs: fix wireless regulatory API
misuse") moved the wiphy_apply_custom_regulatory() call to earlier in the
driver's init-sequence, so that it gets called before wiphy_register().
But at this point in time the eFuses which code the regulatory-settings
for the chip have not been read by the driver yet, causing
_rtw_reg_apply_flags() to set the IEEE80211_CHAN_DISABLED flag on *all*
channels.
On the device where I initially tested the fix, a Jumper EZpad 7 tablet,
this does not cause any problems because shortly after init the
rtw_reg_notifier() gets called fixing things up. I guess this happens
into response to receiving a (broadcast) packet with regulatory info
from the access-point ?
But on another device with a RTL8723BS wifi chip, an Acer Switch 10E
(SW3-016), the rtw_reg_notifier() never gets called. I assume that some
fuse has been set on this device to ignore regulatory info received from
access-points.
This means that on the Acer the driver is stuck in a state with all
channels disabled, leading to non working Wifi.
We cannot move the wiphy_apply_custom_regulatory() call back, because
that call must be made before the wiphy_register() call.
Instead move the entire rtw_wdev_alloc() call to after the Efuses have
been read, fixing all channels being disabled in the initial channel-map.
Fixes: 73c6fe37ed09 ("staging: rtl8723bs: fix wireless regulatory API misuse") Cc: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20210201152956.370186-2-hdegoede@redhat.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Kevin Lo [Thu, 7 Jan 2021 06:10:38 +0000 (14:10 +0800)]
igc: check return value of ret_val in igc_config_fc_after_link_up
Check return value from ret_val to make error check actually work.
Fixes: 3a56c17c0297 ("igc: Add setup link functionality") Signed-off-by: Kevin Lo <kevlo@kevlo.org> Acked-by: Sasha Neftin <sasha.neftin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Kevin Lo [Sun, 20 Dec 2020 14:18:19 +0000 (22:18 +0800)]
igc: set the default return value to -IGC_ERR_NVM in igc_write_nvm_srwr
This patch sets the default return value to -IGC_ERR_NVM in
igc_write_nvm_srwr. Without this change it wouldn't lead to a shadow RAM
write EEWR timeout.
Fixes: d6c58def7f21 ("igc: Add NVM support") Signed-off-by: Kevin Lo <kevlo@kevlo.org> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Kai-Heng Feng [Wed, 2 Dec 2020 07:50:17 +0000 (15:50 +0800)]
igc: Report speed and duplex as unknown when device is runtime suspended
Similar to commit 778be3147ac3 ("igb: Report speed and duplex as unknown
when device is runtime suspended"), if we try to read speed and duplex
sysfs while the device is runtime suspended, igc will complain and
stops working:
The more generic approach will be wrap get_link_ksettings() with begin()
and complete() callbacks, and calls runtime resume and runtime suspend
routine respectively. However, igc is like igb, runtime resume routine
uses rtnl_lock() which upper ethtool layer also uses.
So to prevent a deadlock on rtnl, take a different approach, use
pm_runtime_suspended() to avoid reading register while device is runtime
suspended.
Felix Fietkau [Mon, 1 Feb 2021 08:33:24 +0000 (09:33 +0100)]
mac80211: fix station rate table updates on assoc
If the driver uses .sta_add, station entries are only uploaded after the sta
is in assoc state. Fix early station rate table updates by deferring them
until the sta has been uploaded.
Cc: stable@vger.kernel.org Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20210201083324.3134-1-nbd@nbd.name
[use rcu_access_pointer() instead since we won't dereference here] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Linus Torvalds [Sun, 31 Jan 2021 19:48:12 +0000 (11:48 -0800)]
Merge tag 'x86_entry_for_v5.11_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Borislav Petkov:
"A single fix for objtool to generate proper unwind info for newer
toolchains which do not generate section symbols anymore. And a
cleanup ontop.
This was originally going to go during the next merge window but
people can already trigger a build error with binutils-2.36 which
doesn't emit section symbols - something which objtool relies on - so
let's expedite it"
* tag 'x86_entry_for_v5.11_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/entry: Remove put_ret_addr_in_rdi THUNK macro argument
x86/entry: Emit a symbol for register restoring thunk
Linus Torvalds [Sun, 31 Jan 2021 19:40:57 +0000 (11:40 -0800)]
Merge tag 'timers-urgent-2021-01-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
"A fix for handling advertised, but non-existent 146818 RTCs correctly.
With the recent UIP handling changes the time readout of non-existent
RTCs hangs forever as the read returns always 0xFF which means the UIP
bit is set.
Sanity check the RTC before registering by checking the RTC_VALID
register for correctness"
* tag 'timers-urgent-2021-01-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rtc: mc146818: Detect and handle broken RTCs
Linus Torvalds [Sun, 31 Jan 2021 19:39:32 +0000 (11:39 -0800)]
Merge tag 'core-urgent-2021-01-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull single stepping fix from Thomas Gleixner:
"A single fix for the single step reporting regression caused by
getting the condition wrong when moving SYSCALL_EMU away from TIF
flags"
[ There's apparently another problem too, fix pending ]
* tag 'core-urgent-2021-01-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
entry: Unbreak single step reporting behaviour
Linus Torvalds [Sun, 31 Jan 2021 19:37:43 +0000 (11:37 -0800)]
Merge tag 'powerpc-5.11-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fix from Michael Ellerman:
"One fix for a bug in our soft interrupt masking, which could lead to
interrupt replaying recursing, causing spurious interrupts.
Thanks to Nicholas Piggin"
* tag 'powerpc-5.11-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s: prevent recursive replay_soft_interrupts causing superfluous interrupt
Linus Torvalds [Sun, 31 Jan 2021 19:19:12 +0000 (11:19 -0800)]
Merge tag 'nfs-for-5.11-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client fixes from Trond Myklebust:
- SUNRPC: Handle 0 length opaque XDR object data properly
- Fix a layout segment leak in pnfs_layout_process()
- pNFS/NFSv4: Update the layout barrier when we schedule a layoutreturn
- pNFS/NFSv4: Improve rejection of out-of-order layouts
- pNFS/NFSv4: Try to return invalid layout in pnfs_layout_process()
* tag 'nfs-for-5.11-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
SUNRPC: Handle 0 length opaque XDR object data properly
SUNRPC: Move simple_get_bytes and simple_get_netobj into private header
pNFS/NFSv4: Improve rejection of out-of-order layouts
pNFS/NFSv4: Update the layout barrier when we schedule a layoutreturn
pNFS/NFSv4: Try to return invalid layout in pnfs_layout_process()
pNFS/NFSv4: Fix a layout segment leak in pnfs_layout_process()
Linus Walleij [Sat, 2 Jan 2021 23:15:10 +0000 (00:15 +0100)]
leds: rt8515: Add Richtek RT8515 LED driver
This adds a driver for the Richtek RT8515 dual channel
torch/flash white LED driver.
This LED driver is found in some mobile phones from
Samsung such as the GT-S7710 and GT-I8190.
A V4L interface is added.
We do not have a proper datasheet for the RT8515 but
it turns out that RT9387A has a public datasheet and
is essentially the same chip. We designed the driver
in accordance with this datasheet. The day someone
needs to drive a RT9387A this driver can probably
easily be augmented to handle that chip too.
Sakari Ailus, Pavel Machek and Andy Shevchenko helped
significantly in getting this driver right.
Cc: Sakari Ailus <sakari.ailus@iki.fi> Cc: newbytee@protonmail.com Cc: Stephan Gerhold <stephan@gerhold.net> Cc: linux-media@vger.kernel.org Cc: phone-devel@vger.kernel.org Reviewed-by: Sakari Ailus <sakari.ailus@linux.intel.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Pavel Machek <pavel@ucw.cz>
Andrea Righi [Wed, 25 Nov 2020 15:18:22 +0000 (16:18 +0100)]
leds: trigger: fix potential deadlock with libata
We have the following potential deadlock condition:
========================================================
WARNING: possible irq lock inversion dependency detected
5.10.0-rc2+ #25 Not tainted
--------------------------------------------------------
swapper/3/0 just changed the state of lock: ffff8880063bd618 (&host->lock){-...}-{2:2}, at: ata_bmdma_interrupt+0x27/0x200
but this lock took another, HARDIRQ-READ-unsafe lock in the past:
(&trig->leddev_list_lock){.+.?}-{2:2}
and interrupts could create inverse lock ordering between them.
other info that might help us debug this:
Possible interrupt unsafe locking scenario:
This lockdep splat is reported after:
commit 6836e0c29878 ("locking: More accurate annotations for read_lock()")
To clarify:
- read-locks are recursive only in interrupt context (when
in_interrupt() returns true)
- after acquiring host->lock in CPU1, another cpu (i.e. CPU2) may call
write_lock(&trig->leddev_list_lock) that would be blocked by CPU0
that holds trig->leddev_list_lock in read-mode
- when CPU1 (ata_ac_complete()) tries to read-lock
trig->leddev_list_lock, it would be blocked by the write-lock waiter
on CPU2 (because we are not in interrupt context, so the read-lock is
not recursive)
- at this point if an interrupt happens on CPU0 and
ata_bmdma_interrupt() is executed it will try to acquire host->lock,
that is held by CPU1, that is currently blocked by CPU2, so:
* CPU0 blocked by CPU1
* CPU1 blocked by CPU2
* CPU2 blocked by CPU0
*** DEADLOCK ***
The deadlock scenario is better represented by the following schema
(thanks to Boqun Feng <boqun.feng@gmail.com> for the schema and the
detailed explanation of the deadlock condition):
CPU 0: CPU 1: CPU 2:
----- ----- -----
led_trigger_event():
read_lock(&trig->leddev_list_lock);
<workqueue>
ata_hsm_qc_complete():
spin_lock_irqsave(&host->lock);
write_lock(&trig->leddev_list_lock);
ata_port_freeze():
ata_do_link_abort():
ata_qc_complete():
ledtrig_disk_activity():
led_trigger_blink_oneshot():
read_lock(&trig->leddev_list_lock);
// ^ not in in_interrupt() context, so could get blocked by CPU 2
<interrupt>
ata_bmdma_interrupt():
spin_lock_irqsave(&host->lock);
Fix by using read_lock_irqsave/irqrestore() in led_trigger_event(), so
that no interrupt can happen in between, preventing the deadlock
condition.
Apply the same change to led_trigger_blink_setup() as well, since the
same deadlock scenario can also happen in power_supply_update_bat_leds()
-> led_trigger_blink() -> led_trigger_blink_setup() (workqueue context),
and potentially prevent other similar usages.
Linus Torvalds [Sun, 31 Jan 2021 01:51:06 +0000 (17:51 -0800)]
Merge tag '5.11-rc5-smb3' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Four cifs patches found in additional testing of the conversion to the
new mount API: three small option processing ones, and one fixing domain
based DFS referrals"
* tag '5.11-rc5-smb3' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix dfs domain referrals
cifs: returning mount parm processing errors correctly
cifs: fix mounts to subdirectories of target
cifs: ignore auto and noauto options if given
Linus Torvalds [Sun, 31 Jan 2021 01:42:42 +0000 (17:42 -0800)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Two minor fixes in drivers. Both changing strings (one in a comment,
one in a module help text) with no code impact"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: qla2xxx: Fix description for parameter ql2xenforce_iocb_limit
scsi: target: iscsi: Fix typo in comment
Linus Torvalds [Sat, 30 Jan 2021 19:48:57 +0000 (11:48 -0800)]
Merge tag 's390-5.11-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Vasily Gorbik:
- Fix max number of VCPUs reported via ultravisor information sysfs
interface.
- Fix memory leaks during vfio-ap resources clean up on KVM pointer
invalidation notification.
- Fix potential specification exception by avoiding unnecessary
interrupts disable after queue reset in vfio-ap.
* tag 's390-5.11-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390: uv: Fix sysfs max number of VCPUs reporting
s390/vfio-ap: No need to disable IRQ after queue reset
s390/vfio-ap: clean up vfio_ap resources when KVM pointer invalidated
Chinmay Agarwal [Wed, 27 Jan 2021 16:54:54 +0000 (22:24 +0530)]
neighbour: Prevent a dead entry from updating gc_list
Following race condition was detected:
<CPU A, t0> - neigh_flush_dev() is under execution and calls
neigh_mark_dead(n) marking the neighbour entry 'n' as dead.
<CPU B, t1> - Executing: __netif_receive_skb() ->
__netif_receive_skb_core() -> arp_rcv() -> arp_process().arp_process()
calls __neigh_lookup() which takes a reference on neighbour entry 'n'.
<CPU A, t2> - Moves further along neigh_flush_dev() and calls
neigh_cleanup_and_release(n), but since reference count increased in t2,
'n' couldn't be destroyed.
<CPU B, t3> - Moves further along, arp_process() and calls
neigh_update()-> __neigh_update() -> neigh_update_gc_list(), which adds
the neighbour entry back in gc_list(neigh_mark_dead(), removed it
earlier in t0 from gc_list)
This leads to 'n' still being part of gc_list, but the actual
neighbour structure has been freed.
The situation can be prevented from happening if we disallow a dead
entry to have any possibility of updating gc_list. This is what the
patch intends to achieve.
Fixes: 6547d8e2146f ("neighbor: Fix locking order for gc_list changes") Signed-off-by: Chinmay Agarwal <chinagar@codeaurora.org> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20210127165453.GA20514@chinagar-linux.qualcomm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>