cpuinfo to decide processor affinity - c

I've tried to decide processor affinity rule for my applications according to /proc/cpuinfo , My redhat Linux showes
processor : 0 to 47 , means server has 48 processor unit
physical id : 0 to 3 , means server has 4 cpu sockets
cpu cores : 6 , means each socket has 6 cores
siblings : 12 , means each core has 2 hyperthreads
So totally , this server has 4 * 6 * 2 = 48 processor units , am I correct so far ?
What I like to do is to use sched_setaffinity function , first I like to know is
the hyperthreads in the same core , for example ...
processor 0 : physical id:0,core id: 0 ...
processor 24 : physical id:0,core id: 0 ...
If in my application , I use CPU_SET(0, &mask) in thread1 , CPU_SET(24, &mask)
in thread2 , then I might can say that thread1 and thread2 will share the same L1 cache,
and of course share the same L2 cache , too ...am I correct in this guess ?

You can only guarantee fully shared caches if your threads are being scheduled on the same core (i.e. different hyperthreads) in which case your approach is correct.
But keep in mind that scheduling two tasks on the same core will not necessarily make them run faster then if you schedule them on different cores. L3 which is commonly shared among all cores is very fast.
You need to check how caches are shared among your processors. Most Intel processors share L2 among 2-4 cores and L3 among all cores while most AMD models only share L3.

Related

windows 7 processor interrogation

I have a PC with 24 cores. I have an application that needs to dedicate a thread to one of those cores, and the process itself to a few of those cores. The affinity and priorities are now hard-coded, I would like to programmatically determine what set of cores my application should set its affinity to.
I have read to stay away from core 0, I am currently using the last 8 cores on the first CPU for the process and the 12th core for the thread I want to run. Here is sample code which may not be 100% accurate with the parameters.
SetProirityClass(getCurrentProcess(),REAL_TIME_PRIORITY_CLASS);
SetProcessAffinityMask(getCurrentProcess(),0xFF0);
CreatThread(myThread, 0, entryPoint, NUll, 0, 0);//all 0 params besides handle and entry
SetThreadPriorityClass(myThread, TIME_CRITICAL);
SetThreadAffinityMask(myThread, 0x1 << 11);
I know that with elevated priorities (even with base priority 31) there is no way to dedicate a core to an application (Please correct me if I am wrong here since this is exactly what I want to do, non-programmatical solutions would be fine if I could do that). That being said, the OS itself runs "mostly" on a core or a couple of cores. Is it randomly determined on boot? Can I interrogate available cores to programmatically determine which set of cores my process and TIME_CRITICAL thread should be running on?
Is there any way to prevent the kernel threads from stealing time slices of my TIME_CRITICAL thread?
I understand windows is not real time but I'm doing the best with what I have. The solution needs to apply to win 7 but if it is also supported under XP that would be great.

Run batch file on specific processor

I have a server with dual processors, that is multiple cores per processor and two physical Xenon processors.
Each process will only run on one processor, which is fine. If you start a multi-threaded app it can only use the maximum amount of cores on one physical processor, not both (Windows 10 limitation?). I would like to start two instances of the same program so that I can use all cores on both the processors.
How do I start a process from a batch file so that it runs on a specified processor group? I.e. Cores 0-16 of processor 1, or Cores 0-16 of processor 2?
I've tried:
start /affinity FF file.exe
But that only runs it on cores from one particular processor. I believe I need to set the processor group but how do I do that using the 'start' command?
I can see you can use hexadecimal masks for the affinity with 'start' but that only seems to work on the cores of the first processor, I can't seem to access the cores of the second processor.
Since there is much confusion over my question, please see below. It's from task manager when you try and set an affinity, notice how I have multiple processor groups? That's what I am trying to configure using the 'start' command. '/affinity' only uses cores from group 0.
Judging by your "Processor group" combo, it appears that you have the system set to present NUMA nodes with each physical CPU being assigned to a single node. This question talks about how to check the config, so assuming that that is how you are set up, the command line flag /node <NUMA index> would allow you to select which node, so we get:
start /node 1 file.exe
This should start the application on the second NUMA node. Note that you might be able to combine this with the /affinity flag, so to run on just two cores of the first node, the following might work:
start /node 0 /affinity 3 file.exe

OpenMP on Kirin 650

Basically, the Kirin 650 ARM core is the one in the Honor 5c phone from Huawei. I am trying to bench C++ code directly on the rooted phone I have. This code is executing Neon instructions and have OpenMP parallelization. When I execute OpenMP functions on the phone it returns:
Num Procs = 8
Num Threads = 1
Max Threads = 8
Num Devices = 0
Default Device = 0
Is Initial Device = 1
These infos seems coherent since the Kirin 650 has 8 cores. However it is also specified in the phone technical details that the Kirin can have 4 cores up to 1.7GHz (power saving cores) and 4 cores up to 2GHz (performance cores).
How does OpenMP handle these as they are asynchronous as I understand it ? With my benchmarks I see speedups when I test No OpenMP against OpenMP - 2 threads and OpenMP - 4 threads but my OpenMP - 8 threads results are disaster.
Does these cores share the same clock (I'm using the clock_gettime instruction from the Hayai library to bench everything) ? Do you have any advice when executing stuff with OpenMP on these kinds of platforms ?

network performance tunning in Linux

I have a application which has two threads ,thread1 would receive multicast packages
from network card eth1 , suppose I use sched_setaffinity to set cpu affinity
for thread1 to cpu core 1 , and then I have thread2 to use these packages
(received from thread1,located in heap area global vars) to do some operations ,
I set cpu affinity for thread2 to core 7 ,suppose core 1 and core 7 are
in the same core with hyper-threading , I think the performance would be good,
since core 1 and core 7 can use L1 cache .
I have watched /proc/interrupt , I see eth1 has interrupts in several cpu cores ,
so in my case , I set cpu affinity to core 1 for thread1 ,but interrupts happened
in many cores , would it effect performance ? those packages received from eth1
would go directly to main memory no matter which core has the interrupt ?
I don't know much about network in linux kernel , may anyone who would suggest
books or websites can help me for this topic ? Thanks for any comments ~~
Edit : according to "What every programmer should know about memory" 6.3.5 "Direct Cache Access" , I think "DCA" is hwat i like to know ...
The interrupt will (quite likely) happen on a different core than the one receiving the packet. Depending on how the driver deals with packets, that may or may not matter. If the driver reads the packet (e.g. to makle a copy), then it's not ideal, as the cache gets filled on a different CPU. But if the packet is just loaded into memory somewhere using DMA, and left there for the software to pick up later, then it doesn't matter [in fact, it's better to have it happen on a different CPU, as "your" cpu gets more time to do other things].
As to using hyperthreading, my experience (and that of many others) is that hyperthreading SOMETIMES gives a benefit, but often ends up being similar to not having hyperthreading, because the two threads use the same execution units of the same core. You may want to compare the throughput with both threads set to affinity on the same core as well, to see if that makes it "better" or "worse" - like most things, it's often details that make a difference, so your code may be slightly different from someone elses, meaning it works better in one or the other of the cases.
Edit: if your system has multiple sockets, you may also want to ensure that the CPU on the socket "nearest" (as in the number of QPI/PCI bridge hops) the network card.

cpu cores vs threads

My MacBookPro, running BootCamp, has an Intel i7-640M processor, which has 2 cores. Like all the other i7 chips, each core is hyperthreaded, so you can have up to 4 threads. Using Visual Studio 2010 c/c++ to determine these:
coresAvailable = omp_get_num_procs ( );
threadsAvailable = omp_get_max_threads ( ) ;
The "threadsAvailable" comes back with a value of 4, as expected. But "coresAvailable" also is reported as 4.
What am I missing?
omp_get_num_procs returns the number of CPUs the OS reports, and since a hyperthreaded core reports itself as 2 CPUs, a dual-core hyperthreaded chip will report itself as 4 processors.
omp_get_max_threads returns the most threads that will be used in a parallel region of code, so it makes sense that the most threads it will use will be the number of CPUs available.

Resources