Merge coregroup support into next
From Srikar's cover letter, with some reformatting:
Cleanup of existing powerpc topologies and add coregroup support on
powerpc. Coregroup is a group of (subset of) cores of a DIE that share
a resource.
Summary of some of the testing done with coregroup patchset.
It includes ebizzy, schbench, perf bench sched pipe and topology
verification. On the left side are results from powerpc/next tree and
on the right are the results with the patchset applied. Topological
verification clearly shows that there is no change in topology with
and without the patches on all the 3 class of systems that were
tested.
Power 9 PowerNV (2 Node/ 160 Cpu System)
----------------------------------------
Baseline Baseline + Coregroup Support
N Min Max Median Avg Stddev N Min Max Median Avg Stddev
100 993884 1276090 1173476 1165914 54867.201 100 910470 1279820 1171095 1162091 67363.28
^ ebizzy (Throughput of 100 iterations of 30 seconds higher throughput is better)
schbench (latency hence lower is better)
Latency percentiles (usec) Latency percentiles (usec)
50.0th: 455 50.0th: 454
75.0th: 533 75.0th: 543
90.0th: 683 90.0th: 701
95.0th: 743 95.0th: 737
*99.0th: 815 *99.0th: 805
99.5th: 839 99.5th: 835
99.9th: 913 99.9th: 893
min=0, max=1011 min=0, max=2833
perf bench sched pipe (lesser time and higher ops/sec is better)
Running 'sched/pipe' benchmark: Running 'sched/pipe' benchmark:
Executed 1000000 pipe operations between two processes Executed 1000000 pipe operations between two processes
Total time: 6.083 [sec] Total time: 6.303 [sec]
6.083576 usecs/op 6.303318 usecs/op
164377 ops/sec 158646 ops/sec
Power 9 LPAR (2 Node/ 128 Cpu System)
-------------------------------------
Baseline Baseline + Coregroup Support
N Min Max Median Avg Stddev N Min Max Median Avg Stddev
100 1058029 1295393 1200414 1188306.7 56786.538 100 943264 1287619 1180522 1168473.2 64469.955
^ ebizzy (Throughput of 100 iterations of 30 seconds higher throughput is better)
schbench (latency hence lower is better)
Latency percentiles (usec) Latency percentiles (usec)
50.0000th: 34 50.0000th: 39
75.0000th: 46 75.0000th: 52
90.0000th: 53 90.0000th: 68
95.0000th: 56 95.0000th: 77
*99.0000th: 61 *99.0000th: 89
99.5000th: 63 99.5000th: 94
99.9000th: 81 99.9000th: 169
min=0, max=8405 min=0, max=23674
perf bench sched pipe (lesser time and higher ops/sec is better)
Running 'sched/pipe' benchmark: Running 'sched/pipe' benchmark:
Executed 1000000 pipe operations between two processes Executed 1000000 pipe operations between two processes
Total time: 8.768 [sec] Total time: 5.217 [sec]
8.768400 usecs/op 5.217625 usecs/op
114045 ops/sec 191658 ops/sec
Power 8 LPAR (8 Node/ 256 Cpu System)
-------------------------------------
Baseline Baseline + Coregroup Support
N Min Max Median Avg Stddev N Min Max Median Avg Stddev
100 1267615 1965234 1707423 1689137.6 144363.29 100 1175357 1924262 1691104 1664792.1 145876.4
^ ebizzy (Throughput of 100 iterations of 30 seconds higher throughput is better)
schbench (latency hence lower is better)
Latency percentiles (usec) Latency percentiles (usec)
50.0th: 37 50.0th: 36
75.0th: 51 75.0th: 48
90.0th: 59 90.0th: 55
95.0th: 63 95.0th: 59
*99.0th: 71 *99.0th: 67
99.5th: 75 99.5th: 72
99.9th: 105 99.9th: 170
min=0, max=18560 min=0, max=27031
perf bench sched pipe (lesser time and higher ops/sec is better)
Running 'sched/pipe' benchmark: Running 'sched/pipe' benchmark:
Executed 1000000 pipe operations between two processes Executed 1000000 pipe operations between two processes
Total time: 6.013 [sec] Total time: 5.930 [sec]
6.013963 usecs/op 5.930724 usecs/op
166279 ops/sec 168613 ops/sec
Topology verification on Power9
Power9 / powernv / SMT4
$ tail /proc/cpuinfo
cpu : POWER9, altivec supported
clock : 3600.000000MHz
revision : 2.2 (pvr 004e 1202)
timebase : 512000000
platform : PowerNV
model : 9006-22P
machine : PowerNV 9006-22P
firmware : OPAL
MMU : Radix
Baseline Baseline + Coregroup Support
lscpu lscpu
------ ------
Architecture: ppc64le Architecture: ppc64le
Byte Order: Little Endian Byte Order: Little Endian
CPU(s): 160 CPU(s): 160
On-line CPU(s) list: 0-159 On-line CPU(s) list: 0-159
Thread(s) per core: 4 Thread(s) per core: 4
Core(s) per socket: 20 Core(s) per socket: 20
Socket(s): 2 Socket(s): 2
NUMA node(s): 2 NUMA node(s): 2
Model: 2.2 (pvr 004e 1202) Model: 2.2 (pvr 004e 1202)
Model name: POWER9, altivec supported Model name: POWER9, altivec supported
CPU max MHz: 3800.0000 CPU max MHz: 3800.0000
CPU min MHz: 2166.0000 CPU min MHz: 2166.0000
L1d cache: 32K L1d cache: 32K
L1i cache: 32K L1i cache: 32K
L2 cache: 512K L2 cache: 512K
L3 cache: 10240K L3 cache: 10240K
NUMA node0 CPU(s): 0-79 NUMA node0 CPU(s): 0-79
NUMA node8 CPU(s): 80-159 NUMA node8 CPU(s): 80-159
grep . /proc/sys/kernel/sched_domain/cpu0/domain*/name grep . /proc/sys/kernel/sched_domain/cpu0/domain*/name
----------------------------------------------------- -----------------------------------------------------
/proc/sys/kernel/sched_domain/cpu0/domain0/name:SMT /proc/sys/kernel/sched_domain/cpu0/domain0/name:SMT
/proc/sys/kernel/sched_domain/cpu0/domain1/name:CACHE /proc/sys/kernel/sched_domain/cpu0/domain1/name:CACHE
/proc/sys/kernel/sched_domain/cpu0/domain2/name:DIE /proc/sys/kernel/sched_domain/cpu0/domain2/name:DIE
/proc/sys/kernel/sched_domain/cpu0/domain3/name:NUMA /proc/sys/kernel/sched_domain/cpu0/domain3/name:NUMA
grep . /proc/sys/kernel/sched_domain/cpu0/domain*/flags grep . /proc/sys/kernel/sched_domain/cpu0/domain*/flags
------------------------------------------------------ ------------------------------------------------------
/proc/sys/kernel/sched_domain/cpu0/domain0/flags:2391 /proc/sys/kernel/sched_domain/cpu0/domain0/flags:2391
/proc/sys/kernel/sched_domain/cpu0/domain1/flags:2327 /proc/sys/kernel/sched_domain/cpu0/domain1/flags:2327
/proc/sys/kernel/sched_domain/cpu0/domain2/flags:2071 /proc/sys/kernel/sched_domain/cpu0/domain2/flags:2071
/proc/sys/kernel/sched_domain/cpu0/domain3/flags:12801 /proc/sys/kernel/sched_domain/cpu0/domain3/flags:12801
Baseline
head /proc/schedstat
--------------------
version 15
timestamp 4295043536
cpu0 0 0 0 0 0 0 9597119314 2408913694 11897
domain0 00000000,00000000,00000000,00000000,0000000f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 00000000,00000000,0000ffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain3 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu1 0 0 0 0 0 0 4941435230 11106132 1583
domain0 00000000,00000000,00000000,00000000,0000000f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Baseline + Coregroup Support
head /proc/schedstat
--------------------
version 15
timestamp 4296311826
cpu0 0 0 0 0 0 0 3353674045024 3781680865826 297483
domain0 00000000,00000000,00000000,00000000,0000000f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 00000000,00000000,0000ffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain3 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu1 0 0 0 0 0 0 3337873293332 4231590033856 229090
domain0 00000000,00000000,00000000,00000000,0000000f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Post sudo ppc64_cpu --smt=1 Post sudo ppc64_cpu --smt=1
--------------------- ---------------------
grep . /proc/sys/kernel/sched_domain/cpu0/domain*/name grep . /proc/sys/kernel/sched_domain/cpu0/domain*/name
----------------------------------------------------- -----------------------------------------------------
/proc/sys/kernel/sched_domain/cpu0/domain0/name:CACHE /proc/sys/kernel/sched_domain/cpu0/domain0/name:CACHE
/proc/sys/kernel/sched_domain/cpu0/domain1/name:DIE /proc/sys/kernel/sched_domain/cpu0/domain1/name:DIE
/proc/sys/kernel/sched_domain/cpu0/domain2/name:NUMA /proc/sys/kernel/sched_domain/cpu0/domain2/name:NUMA
grep . /proc/sys/kernel/sched_domain/cpu0/domain*/flags grep . /proc/sys/kernel/sched_domain/cpu0/domain*/flags
------------------------------------------------------ ------------------------------------------------------
/proc/sys/kernel/sched_domain/cpu0/domain0/flags:2327 /proc/sys/kernel/sched_domain/cpu0/domain0/flags:2327
/proc/sys/kernel/sched_domain/cpu0/domain1/flags:2071 /proc/sys/kernel/sched_domain/cpu0/domain1/flags:2071
/proc/sys/kernel/sched_domain/cpu0/domain2/flags:12801 /proc/sys/kernel/sched_domain/cpu0/domain2/flags:12801
Baseline:
head /proc/schedstat
--------------------
version 15
timestamp 4295046242
cpu0 0 0 0 0 0 0 10978610020 2658997390 13068
domain0 00000000,00000000,00000000,00000000,00000011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00001111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 91111111,11111111,11111111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu4 0 0 0 0 0 0 5408663896 95701034 7697
domain0 00000000,00000000,00000000,00000000,00000011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00001111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 91111111,11111111,11111111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Baseline + Coregroup Support
head /proc/schedstat
--------------------
version 15
timestamp 4296314905
cpu0 0 0 0 0 0 0 3355392013536 3781975150576 298723
domain0 00000000,00000000,00000000,00000000,00000011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00001111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 91111111,11111111,11111111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu4 0 0 0 0 0 0 3351637920996 4427329763050 256776
domain0 00000000,00000000,00000000,00000000,00000011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00001111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 91111111,11111111,11111111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Similar verification was done on Power 8 (8 Node 256 CPU LPAR) and
Power 9 (2 node 128 Cpu LPAR) and they showed the topology before and
after the patch to be identical. If Interested, I could provide the
same.
On Power 9 (with device-tree enablement to show coregroups):
$ tail /proc/cpuinfo
processor : 127
cpu : POWER9 (architected), altivec supported
clock : 3000.000000MHz
revision : 2.2 (pvr 004e 0202)
timebase : 512000000
platform : pSeries
model : IBM,9008-22L
machine : CHRP IBM,9008-22L
MMU : Hash
Before patchset:
$ cat /proc/sys/kernel/sched_domain/cpu0/domain*/name
SMT
CACHE
DIE
NUMA
$ head /proc/schedstat
version 15
timestamp 4318242208
cpu0 0 0 0 0 0 0 28077107004 4773387362 78205
domain0 00000000,00000000,00000000,00000055 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 00000000,00000000,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain3 ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu1 0 0 0 0 0 0 24177439200 413887604 75393
domain0 00000000,00000000,00000000,000000aa 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
After patchset:
$ cat /proc/sys/kernel/sched_domain/cpu0/domain*/name
SMT
CACHE
MC
DIE
NUMA
$ head /proc/schedstat
version 15
timestamp 4318242208
cpu0 0 0 0 0 0 0 28077107004 4773387362 78205
domain0 00000000,00000000,00000000,00000055 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 00000000,00000000,00000000,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain3 00000000,00000000,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain4 ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu1 0 0 0 0 0 0 24177439200 413887604 75393
domain0 00000000,00000000,00000000,000000aa 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index b727f5f..041f0b9 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -28,6 +28,7 @@
extern int boot_cpuid;
extern int spinning_secondaries;
extern u32 *cpu_to_phys_id;
+extern bool coregroup_enabled;
extern void cpu_die(void);
extern int cpu_to_chip_id(int cpu);
diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index f0b6300..660917491 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -88,12 +88,22 @@ static inline int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
#if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
extern int find_and_online_cpu_nid(int cpu);
+extern int cpu_to_coregroup_id(int cpu);
#else
static inline int find_and_online_cpu_nid(int cpu)
{
return 0;
}
+static inline int cpu_to_coregroup_id(int cpu)
+{
+#ifdef CONFIG_SMP
+ return cpu_to_core_id(cpu);
+#else
+ return 0;
+#endif
+}
+
#endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */
#include <asm-generic/topology.h>
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 8261999..3d96752d 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -75,17 +75,28 @@ static DEFINE_PER_CPU(int, cpu_state) = { 0 };
struct task_struct *secondary_current;
bool has_big_cores;
+bool coregroup_enabled;
DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
DEFINE_PER_CPU(cpumask_var_t, cpu_l2_cache_map);
DEFINE_PER_CPU(cpumask_var_t, cpu_core_map);
+DEFINE_PER_CPU(cpumask_var_t, cpu_coregroup_map);
EXPORT_PER_CPU_SYMBOL(cpu_sibling_map);
EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map);
EXPORT_PER_CPU_SYMBOL(cpu_core_map);
EXPORT_SYMBOL_GPL(has_big_cores);
+enum {
+#ifdef CONFIG_SCHED_SMT
+ smt_idx,
+#endif
+ cache_idx,
+ mc_idx,
+ die_idx,
+};
+
#define MAX_THREAD_LIST_SIZE 8
#define THREAD_GROUP_SHARE_L1 1
struct thread_groups {
@@ -789,10 +800,6 @@ static int init_cpu_l1_cache_map(int cpu)
if (err)
goto out;
- zalloc_cpumask_var_node(&per_cpu(cpu_l1_cache_map, cpu),
- GFP_KERNEL,
- cpu_to_node(cpu));
-
cpu_group_start = get_cpu_thread_group_start(cpu, &tg);
if (unlikely(cpu_group_start == -1)) {
@@ -801,6 +808,9 @@ static int init_cpu_l1_cache_map(int cpu)
goto out;
}
+ zalloc_cpumask_var_node(&per_cpu(cpu_l1_cache_map, cpu),
+ GFP_KERNEL, cpu_to_node(cpu));
+
for (i = first_thread; i < first_thread + threads_per_core; i++) {
int i_group_start = get_cpu_thread_group_start(i, &tg);
@@ -819,6 +829,74 @@ static int init_cpu_l1_cache_map(int cpu)
return err;
}
+static bool shared_caches;
+
+#ifdef CONFIG_SCHED_SMT
+/* cpumask of CPUs with asymmetric SMT dependency */
+static int powerpc_smt_flags(void)
+{
+ int flags = SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES;
+
+ if (cpu_has_feature(CPU_FTR_ASYM_SMT)) {
+ printk_once(KERN_INFO "Enabling Asymmetric SMT scheduling\n");
+ flags |= SD_ASYM_PACKING;
+ }
+ return flags;
+}
+#endif
+
+/*
+ * P9 has a slightly odd architecture where pairs of cores share an L2 cache.
+ * This topology makes it *much* cheaper to migrate tasks between adjacent cores
+ * since the migrated task remains cache hot. We want to take advantage of this
+ * at the scheduler level so an extra topology level is required.
+ */
+static int powerpc_shared_cache_flags(void)
+{
+ return SD_SHARE_PKG_RESOURCES;
+}
+
+/*
+ * We can't just pass cpu_l2_cache_mask() directly because
+ * returns a non-const pointer and the compiler barfs on that.
+ */
+static const struct cpumask *shared_cache_mask(int cpu)
+{
+ return per_cpu(cpu_l2_cache_map, cpu);
+}
+
+#ifdef CONFIG_SCHED_SMT
+static const struct cpumask *smallcore_smt_mask(int cpu)
+{
+ return cpu_smallcore_mask(cpu);
+}
+#endif
+
+static struct cpumask *cpu_coregroup_mask(int cpu)
+{
+ return per_cpu(cpu_coregroup_map, cpu);
+}
+
+static bool has_coregroup_support(void)
+{
+ return coregroup_enabled;
+}
+
+static const struct cpumask *cpu_mc_mask(int cpu)
+{
+ return cpu_coregroup_mask(cpu);
+}
+
+static struct sched_domain_topology_level powerpc_topology[] = {
+#ifdef CONFIG_SCHED_SMT
+ { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
+#endif
+ { shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
+ { cpu_mc_mask, SD_INIT_NAME(MC) },
+ { cpu_cpu_mask, SD_INIT_NAME(DIE) },
+ { NULL, },
+};
+
static int init_big_cores(void)
{
int cpu;
@@ -861,6 +939,11 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
GFP_KERNEL, cpu_to_node(cpu));
zalloc_cpumask_var_node(&per_cpu(cpu_core_map, cpu),
GFP_KERNEL, cpu_to_node(cpu));
+ if (has_coregroup_support())
+ zalloc_cpumask_var_node(&per_cpu(cpu_coregroup_map, cpu),
+ GFP_KERNEL, cpu_to_node(cpu));
+
+#ifdef CONFIG_NEED_MULTIPLE_NODES
/*
* numa_node_id() works after this.
*/
@@ -869,6 +952,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
set_cpu_numa_mem(cpu,
local_memory_node(numa_cpu_lookup_table[cpu]));
}
+#endif
}
/* Init the cpumasks so the boot CPU is related to itself */
@@ -876,6 +960,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
cpumask_set_cpu(boot_cpuid, cpu_l2_cache_mask(boot_cpuid));
cpumask_set_cpu(boot_cpuid, cpu_core_mask(boot_cpuid));
+ if (has_coregroup_support())
+ cpumask_set_cpu(boot_cpuid, cpu_coregroup_mask(boot_cpuid));
+
init_big_cores();
if (has_big_cores) {
cpumask_set_cpu(boot_cpuid,
@@ -1132,9 +1219,23 @@ static bool update_mask_by_l2(int cpu, struct cpumask *(*mask_fn)(int))
int i;
l2_cache = cpu_to_l2cache(cpu);
- if (!l2_cache)
- return false;
+ if (!l2_cache) {
+ struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
+ /*
+ * If no l2cache for this CPU, assume all siblings to share
+ * cache with this CPU.
+ */
+ if (has_big_cores)
+ sibling_mask = cpu_smallcore_mask;
+
+ for_each_cpu(i, sibling_mask(cpu))
+ set_cpus_related(cpu, i, cpu_l2_cache_mask);
+
+ return false;
+ }
+
+ cpumask_set_cpu(cpu, mask_fn(cpu));
for_each_cpu(i, cpu_online_mask) {
/*
* when updating the marks the current CPU has not been marked
@@ -1166,6 +1267,8 @@ static void remove_cpu_from_masks(int cpu)
set_cpus_unrelated(cpu, i, cpu_sibling_mask);
if (has_big_cores)
set_cpus_unrelated(cpu, i, cpu_smallcore_mask);
+ if (has_coregroup_support())
+ set_cpus_unrelated(cpu, i, cpu_coregroup_mask);
}
}
#endif
@@ -1217,42 +1320,54 @@ static void add_cpu_to_masks(int cpu)
* add it to it's own thread sibling mask.
*/
cpumask_set_cpu(cpu, cpu_sibling_mask(cpu));
+ cpumask_set_cpu(cpu, cpu_core_mask(cpu));
for (i = first_thread; i < first_thread + threads_per_core; i++)
if (cpu_online(i))
set_cpus_related(i, cpu, cpu_sibling_mask);
add_cpu_to_smallcore_masks(cpu);
- /*
- * Copy the thread sibling mask into the cache sibling mask
- * and mark any CPUs that share an L2 with this CPU.
- */
- for_each_cpu(i, cpu_sibling_mask(cpu))
- set_cpus_related(cpu, i, cpu_l2_cache_mask);
update_mask_by_l2(cpu, cpu_l2_cache_mask);
- /*
- * Copy the cache sibling mask into core sibling mask and mark
- * any CPUs on the same chip as this CPU.
- */
- for_each_cpu(i, cpu_l2_cache_mask(cpu))
- set_cpus_related(cpu, i, cpu_core_mask);
+ if (has_coregroup_support()) {
+ int coregroup_id = cpu_to_coregroup_id(cpu);
- if (pkg_id == -1)
+ cpumask_set_cpu(cpu, cpu_coregroup_mask(cpu));
+ for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) {
+ int fcpu = cpu_first_thread_sibling(i);
+
+ if (fcpu == first_thread)
+ set_cpus_related(cpu, i, cpu_coregroup_mask);
+ else if (coregroup_id == cpu_to_coregroup_id(i))
+ set_cpus_related(cpu, i, cpu_coregroup_mask);
+ }
+ }
+
+ if (pkg_id == -1) {
+ struct cpumask *(*mask)(int) = cpu_sibling_mask;
+
+ /*
+ * Copy the sibling mask into core sibling mask and
+ * mark any CPUs on the same chip as this CPU.
+ */
+ if (shared_caches)
+ mask = cpu_l2_cache_mask;
+
+ for_each_cpu(i, mask(cpu))
+ set_cpus_related(cpu, i, cpu_core_mask);
+
return;
+ }
for_each_cpu(i, cpu_online_mask)
if (get_physical_package_id(i) == pkg_id)
set_cpus_related(cpu, i, cpu_core_mask);
}
-static bool shared_caches;
-
/* Activate a secondary processor. */
void start_secondary(void *unused)
{
unsigned int cpu = smp_processor_id();
- struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
mmgrab(&init_mm);
current->active_mm = &init_mm;
@@ -1278,14 +1393,20 @@ void start_secondary(void *unused)
/* Update topology CPU masks */
add_cpu_to_masks(cpu);
- if (has_big_cores)
- sibling_mask = cpu_smallcore_mask;
/*
* Check for any shared caches. Note that this must be done on a
* per-core basis because one core in the pair might be disabled.
*/
- if (!cpumask_equal(cpu_l2_cache_mask(cpu), sibling_mask(cpu)))
- shared_caches = true;
+ if (!shared_caches) {
+ struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
+ struct cpumask *mask = cpu_l2_cache_mask(cpu);
+
+ if (has_big_cores)
+ sibling_mask = cpu_smallcore_mask;
+
+ if (cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu)))
+ shared_caches = true;
+ }
set_numa_node(numa_cpu_lookup_table[cpu]);
set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
@@ -1311,64 +1432,19 @@ int setup_profiling_timer(unsigned int multiplier)
return 0;
}
-#ifdef CONFIG_SCHED_SMT
-/* cpumask of CPUs with asymetric SMT dependancy */
-static int powerpc_smt_flags(void)
+static void fixup_topology(void)
{
- int flags = SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES;
-
- if (cpu_has_feature(CPU_FTR_ASYM_SMT)) {
- printk_once(KERN_INFO "Enabling Asymmetric SMT scheduling\n");
- flags |= SD_ASYM_PACKING;
+#ifdef CONFIG_SCHED_SMT
+ if (has_big_cores) {
+ pr_info("Big cores detected but using small core scheduling\n");
+ powerpc_topology[smt_idx].mask = smallcore_smt_mask;
}
- return flags;
-}
#endif
-static struct sched_domain_topology_level powerpc_topology[] = {
-#ifdef CONFIG_SCHED_SMT
- { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
-#endif
- { cpu_cpu_mask, SD_INIT_NAME(DIE) },
- { NULL, },
-};
-
-/*
- * P9 has a slightly odd architecture where pairs of cores share an L2 cache.
- * This topology makes it *much* cheaper to migrate tasks between adjacent cores
- * since the migrated task remains cache hot. We want to take advantage of this
- * at the scheduler level so an extra topology level is required.
- */
-static int powerpc_shared_cache_flags(void)
-{
- return SD_SHARE_PKG_RESOURCES;
+ if (!has_coregroup_support())
+ powerpc_topology[mc_idx].mask = powerpc_topology[cache_idx].mask;
}
-/*
- * We can't just pass cpu_l2_cache_mask() directly because
- * returns a non-const pointer and the compiler barfs on that.
- */
-static const struct cpumask *shared_cache_mask(int cpu)
-{
- return cpu_l2_cache_mask(cpu);
-}
-
-#ifdef CONFIG_SCHED_SMT
-static const struct cpumask *smallcore_smt_mask(int cpu)
-{
- return cpu_smallcore_mask(cpu);
-}
-#endif
-
-static struct sched_domain_topology_level power9_topology[] = {
-#ifdef CONFIG_SCHED_SMT
- { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
-#endif
- { shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
- { cpu_cpu_mask, SD_INIT_NAME(DIE) },
- { NULL, },
-};
-
void __init smp_cpus_done(unsigned int max_cpus)
{
/*
@@ -1382,24 +1458,8 @@ void __init smp_cpus_done(unsigned int max_cpus)
dump_numa_cpu_topology();
-#ifdef CONFIG_SCHED_SMT
- if (has_big_cores) {
- pr_info("Big cores detected but using small core scheduling\n");
- power9_topology[0].mask = smallcore_smt_mask;
- powerpc_topology[0].mask = smallcore_smt_mask;
- }
-#endif
- /*
- * If any CPU detects that it's sharing a cache with another CPU then
- * use the deeper topology that is aware of this sharing.
- */
- if (shared_caches) {
- pr_info("Using shared cache scheduler topology\n");
- set_sched_topology(power9_topology);
- } else {
- pr_info("Using standard scheduler topology\n");
- set_sched_topology(powerpc_topology);
- }
+ fixup_topology();
+ set_sched_topology(powerpc_topology);
}
#ifdef CONFIG_HOTPLUG_CPU
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 481951a..b725fb6 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -897,7 +897,9 @@ static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
static void __init find_possible_nodes(void)
{
struct device_node *rtas;
- u32 numnodes, i;
+ const __be32 *domains;
+ int prop_length, max_nodes;
+ u32 i;
if (!numa_enabled)
return;
@@ -906,25 +908,31 @@ static void __init find_possible_nodes(void)
if (!rtas)
return;
- if (of_property_read_u32_index(rtas, "ibm,current-associativity-domains",
- min_common_depth, &numnodes)) {
- /*
- * ibm,current-associativity-domains is a fairly recent
- * property. If it doesn't exist, then fallback on
- * ibm,max-associativity-domains. Current denotes what the
- * platform can support compared to max which denotes what the
- * Hypervisor can support.
- */
- if (of_property_read_u32_index(rtas, "ibm,max-associativity-domains",
- min_common_depth, &numnodes))
+ /*
+ * ibm,current-associativity-domains is a fairly recent property. If
+ * it doesn't exist, then fallback on ibm,max-associativity-domains.
+ * Current denotes what the platform can support compared to max
+ * which denotes what the Hypervisor can support.
+ */
+ domains = of_get_property(rtas, "ibm,current-associativity-domains",
+ &prop_length);
+ if (!domains) {
+ domains = of_get_property(rtas, "ibm,max-associativity-domains",
+ &prop_length);
+ if (!domains)
goto out;
}
- for (i = 0; i < numnodes; i++) {
+ max_nodes = of_read_number(&domains[min_common_depth], 1);
+ for (i = 0; i < max_nodes; i++) {
if (!node_possible(i))
node_set(i, node_possible_map);
}
+ prop_length /= sizeof(int);
+ if (prop_length > min_common_depth + 2)
+ coregroup_enabled = 1;
+
out:
of_node_put(rtas);
}
@@ -1237,6 +1245,31 @@ int find_and_online_cpu_nid(int cpu)
return new_nid;
}
+int cpu_to_coregroup_id(int cpu)
+{
+ __be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
+ int index;
+
+ if (cpu < 0 || cpu > nr_cpu_ids)
+ return -1;
+
+ if (!coregroup_enabled)
+ goto out;
+
+ if (!firmware_has_feature(FW_FEATURE_VPHN))
+ goto out;
+
+ if (vphn_get_associativity(cpu, associativity))
+ goto out;
+
+ index = of_read_number(associativity, 1);
+ if (index > min_common_depth + 1)
+ return of_read_number(&associativity[index - 1], 1);
+
+out:
+ return cpu_to_core_id(cpu);
+}
+
static int topology_update_init(void)
{
topology_inited = 1;