Is there a function or any other way to know, programatically, what core of what processor a given thread of my program (pid) is running on? Both OpenMP or Pthreads solutions would help me, if possible. Thanks.
I think on Linux one can try sched_getcpu().
This is going to be platform-specific, I would think. On Windows you can use NtGetCurrentProcessorNumber, but this is caveat-ed as possibly disappearing.
I expect this is hard to do, because there's nothing to stop the thread being moved to a new core at any time (in most apps, anyway). As soon as you get the result, it could be out of date.
For pthreads, I think sched_getaffinity() is at least part of the solution. Not sure exactly how pthreads names the CPU:s and cores, though.
This is hard to do portably, as the answer depends both on hardware and OS.
The hardware locality library is a new tool which allows you to query CPU/core/thread etc information (and set affinity bindings) in an OS/hardware agnostic way. It supports a huge list of hardware and OSes, and so should add a lot of portability to these sorts of queries. Once you map out your system's topology, hwloc_get_last_cpu_location will return the CPU the thread last ran on, where CPU can mean core or hardware thread.
Related
Suppose an embedded system project where I have a multicore ARM processor (to make it simple assume 2 cores with an unshared cache between the 2 cores). Suppose my system contains a critical task and several non-critical tasks.
Therefore, can I assign the critical task to "core 1" exclusively? And all other to "core 2" exclusively?
If so, how to do and what are the best practices from an implementation point of view [assume I use C]? Should I use a library (if so which one)? An RTOS?
Ok, I see that you asked this over in the EE board as well. They gave the same answer I want to give you as well. Use an operating system of some sort to handle thread affinities. If your RTOS or whatever you have does not support this, then look into it and see how it actually handles process/thread scheduling.
Typically, each CPU on a system will be assigned some sort of thread that handles scheduling of tasks. This thread is one of the first things that an OS sets up. Feel free to research some micro kernels out there to see how this is done for your particular processor. You can also find the secret sauce for setting up this thread in the ARM documentation for your particular CPU.
But, I am going out on a limb and assuming this is far, far beyond the scope of any assignment given to you for a project. I would hope that you have some affinity of some sort built into what you were given. Setting up affinity for a known OS is a few seconds task. Setting up affinity on a bare metal system with no OS at all is much more involved.
Original question:
https://electronics.stackexchange.com/questions/356225/multicore-arm-how-to-assign-a-critical-task-to-one-dedicated-core#comment854845_356225
If you don't need real-time functionality, you can do this on a device with a Linux kernel without too much hassle.
See this question here
A few years ago, when I was on my degree, my teacher told me that if I make a infinite loop in C it would crash my computer making it to use all processor resources with nothing and I need to reboot my system to make things good again. Today I tested the same situation on my Windows Seven computer and I saw that my computer didn't crashed and my processor resources were just "idle". What changes from 5 years ago until today to change this specific event?
An infinite loop will only "crash" the OS if the OS doesn't support preemptive multitasking. In any decent OS the scheduler will make that process take a break once in a while and allow other stuff to run.
At any rate, if the resource usage is low, look at the generated code - the compiler might have done something smart like optimizing the whole thing away.
Your teacher told you something that wasn't true to begin with, so it isn't surprising that it doesn't happen.
At most, an infinite loop will make your CPU go to 100% but on any modern operating system other processes will still get time slices and you can easily kill it. An OS would not be of much use if a simple mistake by a programmer made the whole machine hang so easily.
Multi-core processors are in common use now unlike 8 years ago, which means that a single inflooping process would only tie up a single core nowadays and leave the rest of the cores free to do other work. Even so, you'd have to be running a pretty lousy operating system to allow a single busy looping process to tie up the whole system.
Windows has had a preemptive multitasker since W95. Even on a single-CPU box, one looper thread would still leave the box useable, (though slower), certainly useable enough to shut down the offending process in the usual way or start the task manager and kill off the process.
To truly bork you box, raise the thread and process priorities to real-time and create as many threads as there are cores, (save your work).
I am thinking about an idea , where a lagacy application needing To run on full performance on Core i7 cpu. Is there any linux software / utility to combine all cores for that application, so it can process at some higher performance than using only 1 core?
the application is readpst and it only uses 1 Core for Processing outlook PST files.
Its ok if i can't use all cores , it will be fine if can use like 3 cores.
Possible? or am i drunk?
I will rewrite it to use multiple cores if my C knowledge on multi forking is good.
Intel Nehalem-based CPUs (i7, i5, i3) already do this to an extent.
By using their Turbo Boost mode, when a single core is being used it is automatically over-clocked until the power and temperature limits are reached.
The newer versions of the i7 (the 2K chips) do this even better.
Read this, and this.
"Possible? or am i drunk?"
You're drunk! If this was easy in the general case, Intel would have built it into the processors by now!
What you're looking for is called 'Single System Image' or SSI. There is scant information on the internet about people doing such a thing, as it tends to be reserved for super computing (and perhaps servers).
http://en.wikipedia.org/wiki/Single_system_image
No, the application needs to be multi-threaded to use more than one core. You're of course free to write a multi-threaded version of that application if you wish, but it may not be easy to make sure the different threads don't mess each other up.
If you want it to alleviate multiple cores then you could write a multi-threaded version of your program. But only in the case that it is actually parallelizable. You said you were reading from pst-files, take care not to run into IO bottlenecks.
A great library for working with threads, mutex, semaphores and so on is POSIX Threads.
There is'nt available such an application, but it is possible.
When a OS will run in a VM, then the hypervisor could make use of a few CPUs to identify which CPU code could run parallel, and are not required to run sequentially, and then they could be actually done with a few other CPUs at once,
In the next second when the Operating CPUs are idle (because they finished their work faster then the menager can provide them with new they can start calculating the next second of instructions.
The reason why we need to do this on the Hypervisor level, and not within the OS, is because of memory locking this wouldnt be possible.
I am starting to learn OpenMP, running examples (with gcc 4.3) from https://computing.llnl.gov/tutorials/openMP/exercise.html in a cluster. All the examples work fine, but I have some questions:
How do I know in which nodes (or cores of each node) have the different threads been "run"?
Case of nodes, what is the average transfer time in microsecs or nanosecs for sending the info and getting it back?
What are the best tools for debugging OpenMP programs?
Best advices for speeding up real programs?
Typically your OpenMP program does not know, nor does it care, on which cores it is running. If you have a job management system that may provide the information you want in its log files. Failing that, you could probably insert calls to the environment inside your threads and check the value of some environment variable. What that is called and how you do this is platform dependent, I'll leave figuring it out up to you.
How the heck should I (or any other SOer) know ? For an educated guess you'd have to tell us a lot more about your hardware, o/s, run-time system, etc, etc, etc. The best answer to the question is the one you determine from your own measurements. I fear that you may also be mistaken in thinking that information is sent around the computer -- in shared-memory programming variables usually stay in one place (or at least you should think about them staying in one place the reality may be a lot messier but also impossible to discern) and is not sent or received.
Parallel debuggers such as TotalView or DDT are probably the best tools. I haven't yet used Intel's debugger's parallel capabilities but they look promising. I'll leave it to less well-funded programmers than me to recommend FOSS options, but they are out there.
i) Select the fastest parallel algorithm for your problem. This is not necessarily the fastest serial algorithm made parallel.
ii) Test and measure. You can't optimise without data so you have to profile the program and understand where the performance bottlenecks are. Don't believe any advice along the lines that 'X is faster than Y'. Such statements are usually based on very narrow, and often out-dated, cases and have become, in the minds of their promoters, 'truths'. It's almost always possible to find counter-examples. It's YOUR code YOU want to make faster, there's no substitute for YOUR investigations.
iii) Know your compiler inside out. The rate of return (measured in code speed improvements) on the time you spent adjusting compilation options is far higher than the rate of return from modifying the code 'by hand'.
iv) One of the 'truths' that I cling to is that compilers are not terrifically good at optimising for use of the memory hierarchy on current processor architectures. This is one area where code modification may well be worthwhile, but you won't know this until you've profiled your code.
You cannot know, the partition of threads on different cores is handled entirely by the OS. You speaking about nodes, but OpenMP is a multi-thread (and not multi-process) parallelization that allow parallelization for one machine containing several cores. If you need parallelization across different machines you have to use a multi-process system like OpenMPI.
The order of magnitude of communication times are :
huge in case of communications between cores inside the same CPU, it can be considered as instantaneous
~10 GB/s for communications between two CPU across a motherboard
~100-1000 MB/s for network communications between nodes, depending of the hardware
All the theoretical speeds should be specified in your hardware specifications. You should also do little benchmarks to know what you will really have.
For OpenMP, gdb do the job well, even with many threads.
I work in extreme physics simulation on supercomputer, here are our daily aims :
use as less communication as possible between the threads/processes, 99% of the time it is communications that kill performances in parallel jobs
split the tasks optimally, machine load should be as close as possible to 100% all the time
test, tune, re-test, re-tune... . Parallelization is not at all a generic "miracle solution", it generally needs some practical work to be efficient.
Intel Core2Duo, for example is supposed to have a single die but two cores.
So, it should be possible to control what is processed on which core, which means that it is possible to instruct my algorithm to use the two cores in parallel.
The question is how?
Do I need to go down at the kernel level to do this, or is there a simpler way? To be more concrete, what does it take to implement a dual-core-merge-sort?
Judging by your past questions, I'd say you're looking to implement in C/C++, but I believe the answer is roughly the same regardless of language.
If you want to parallelize any operation, make it multithreaded. You can have as many parallel, concurrent threads as you have cores.
Here's a related question:
How to implement divide and conquer algorithms in C# using multithreading?
As I understand it, binding a particular thread to a core or processor is called processor affinity. It's generally not a good idea, because the purpose of the operating system is to juggle threads between processors. It's unlikely that you'll do a better job of this than the OS can.
To implement an algorithm that takes advantage of multiple cores, consider OpenMP.
Of course, algorithms that have strong data dependencies may not parallelize well.
I hope you are looking to assigning threads to each of the core ..
This is a detailed description of what can be done and how to do it.
Processor affinity
Hope this helps.
A Specimen of Parallel Programming: Parallel Merge Sort Implementation. And here is one in Erlang. For more precise answers, you have to ask a more precise question.
It depends in what programming language you want to achieve this. For example for:
Python - use multiprocessing or parallel python lib.
C#/.NET - use parallel framework for multi-core apps.
newLISP - just use spawn()/other stuff for same goal.
C - you may use POSIX threads and such
etc., etc., etc....
Just choose language and look for parallel processing abilities in that language.
Good luck!
While POSIX threads (pthreads) are probably a good idea to start, this is not exclusive.
Multithreading in C isn't actually trivial, hence, I'd advice fork().
start one worker fork for each CPU with a subsection of you'r mergesort algorithm, and then reassemble them in the manager fork.
When working towards a parallel solution, I consider forks before threads, since they're easier to implement, and you get a quick preliminary result. Once that works, you might want to take some time and work with pthreads.