Does pinning a process to a CPU core or an SMP node help reduce cache coherency traffic?

There should be no CPU socket-to-socket coherency traffic for the pinned process case you describe. Modern Xeon platforms implement snoop filtering in the chipset. The snoop filter indicates when a remote socket cannot have the cache line in question, thus avoiding the need to send cache invalidate messages to that socket.

Up vote 2 down vote favorite 1 share g+ share fb share tw.

It is possible to pin a process to a specific set of CPU cores using sched_setaffinity() call. The manual page says: Restricting a process to run on a single CPU also avoids the performance cost caused by the cache invalidation that occurs when a process ceases to execute on one CPU and then recommences execution on a different CPU. Which is almost an obvious thing (or not?

). What is not that obvious to me is this - Does pinning LWPs to a specific CPU or an SMP node reduces a cache coherency bus traffic? For example, since a process is running pinned, other CPUs should not modify its private memory, thus only CPUs that are part of the same SMP node should stay cache-coherent.

Linux performance cpu cpu-architecture cpu-cache link|improve this question asked Feb 27 at 14:15Vlad Lazarenko13.2k924 90% accept rate.

There should be no CPU socket-to-socket coherency traffic for the pinned process case you describe. Modern Xeon platforms implement snoop filtering in the chipset. The snoop filter indicates when a remote socket cannot have the cache line in question, thus avoiding the need to send cache invalidate messages to that socket.

You can measure this for yourself. Xeon processors implement a large variety of cache statistic counters. You can read the counters in your own code with the rdpmc instruction or just use a product like VTune.

FYI, using rdpmc is very precise, but a little tricky since you have to initially set a bit in CR4 to allow using this instruction in user mode. -- EDIT -- My answer above is outdated for the 55xx series of CPUs which use QPI links. These links interconnect CPU sockets directly without an intervening chipset, as in: ark.intel.com/products/37111/Intel-Xeon-... However, since the L3 cache in each CPU is inclusive, snoops over the QPI links only occur when the local L3 cache indicates the line is nowhere in the local socket.

Likewise, the remote socket's L3 can quickly respond to a cross-snoop without bothering the cores, assuming the line isn't there either. So, the inclusive L3 caches should minimize inter-socket coherency overhead, it's just not due to a chipset snoop filter in your case.

If you run on a NUMA system (like, Opteron server or Itanium), it makes sense, but you must be sure to bind a process to the same NUMA node that it allocates memory from. Otherwise, this is an anti-optimization. It should be noted that any NUMA-aware operating system will try to keep execution and memory in the same node anyway, if you don't tell it anything at all, to the best of its abilities (some elderly versions of Windows are rather poor at this, but I wouldn't expect that to be the case with recent Linux).

If you don't run on a NUMA system, binding a process to a particular core is the one biggest stupid thing you can do. The OS will not make processes bounce between CPUs for fun, and if a process must be moved to another CPU, then that is not ideal, but the world does not end, either. It happens rarely, and when it does, you will hardly be able to tell.

On the other hand, if the process is bound to a CPU and another CPU is idle, the OS cannot use it... that is 100% available processing power gone down the drain.

I am using Xeon X5570, which is somewhat NUMA. Say I have 4 cores per CPU and 4 CPUs. The question is - will a cache invalidation request be sent to other CPUs if I run 4 threads constantly doing some busy-waiting with memory fences, but those LWPs are pinned to 4 cores of the same CPU?

If so, who ensures that other CPUs cannot theoretically modify the same memory and so this optimization can be applied? I know some people do that with DMA and device drivers, but how about user-space Linux apps? – Vlad Lazarenko Feb 27 at 14:49 For the 4 cores on the same CPU, I don't see why this should be a problem at all (L2 is shared on Xeon 5xxx anyway, and the CPU does whatever it needs to do to keep L1 in sync just fine, I've never noticed any issue with that).

For cores on one of the other 3 CPUs, there's the snoop filter as pointed out by srking. You won't normally want to busy-wait all the time anyway (that makes little sense), so performance issues should not matter too much either way. – Damon Feb 27 at 15:52 Of course it depends very much on your application, but when one busy-waits, that's usually for very short times between longer tasks, so usually there is not that much contention anyway, even with 16 threads.

Thus, any cache consistency effects you might see, even hypothetically, won't really matter so much. If they do, you must really be spending most of your time spinning, and then there is something wrong from the beginning. – Damon Feb 27 at 15:56 @Damon, load balancing behavior in common OS's isn't all that great.

They do tend to bounce processes around between CPUs unnecessarily. – srking Feb 27 at 17:43.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions