sched/numa: Do not set preferred_node on migration to a second choice node Setting the numa_preferred_node for a task in task_numa_migrate does nothing on a 2-node system. Either we migrate to the node that already was our preferred node, or we stay where we were. On a 4-node system, it can slightly decrease overhead, by not calling the NUMA code as much. Since every node tends to be directly connected to every other node, running on the wrong node for a while does not do much damage. However, on an 8 node system, there are far more bad nodes than there are good ones, and pretending that a second choice is actually the preferred node can greatly delay, or even prevent, a workload from converging. The only time we can safely pretend that a second choice node is the preferred node is when the task is part of a workload that spans multiple NUMA nodes. Signed-off-by: Rik van Riel <riel@redhat.com> Tested-by: Vinod Chegu <chegu_vinod@hp.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1397235629-16328-4-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>

commit: 68d1b02a58f5d9f584c1fb2923ed60ec68cbbd9b [log] [tgz]
author: Rik van Riel <riel@redhat.com> Fri Apr 11 13:00:29 2014 -0400
committer: Ingo Molnar <mingo@kernel.org> Wed May 07 13:33:47 2014 +0200
tree: 3a2c4afeca2dd9403a3e7e9d646d9067f4bf7d1d
parent: 5085e2a328849bdee6650b32d52c87c3788ab01c [diff]
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ecea8d9..051903f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c

@@ -1301,7 +1301,16 @@
 	if (env.best_cpu == -1)
 		return -EAGAIN;
 
-	sched_setnuma(p, env.dst_nid);
+	/*
+	 * If the task is part of a workload that spans multiple NUMA nodes,
+	 * and is migrating into one of the workload's active nodes, remember
+	 * this node as the task's preferred numa node, so the workload can
+	 * settle down.
+	 * A task that migrated to a second choice node will be better off
+	 * trying for a better one later. Do not set the preferred node here.
+	 */
+	if (p->numa_group && node_isset(env.dst_nid, p->numa_group->active_nodes))
+		sched_setnuma(p, env.dst_nid);
 
 	/*
 	 * Reset the scan period if the task is being rescheduled on an
commit	68d1b02a58f5d9f584c1fb2923ed60ec68cbbd9b	[log] [tgz]
author	Rik van Riel <riel@redhat.com>	Fri Apr 11 13:00:29 2014 -0400
committer	Ingo Molnar <mingo@kernel.org>	Wed May 07 13:33:47 2014 +0200
tree	3a2c4afeca2dd9403a3e7e9d646d9067f4bf7d1d
parent	5085e2a328849bdee6650b32d52c87c3788ab01c [diff]