Using make with -j4 or -j8

Using make with -j4 or -j8 - c

I have 4 processors and am compiling a processor-hungry application. I read that using make with the -j4 switch was recommended for OpenCV; should I instead use -j8? What is the advantage of making for multiple processors?

The answers above are all mostly correct. However, the details are a bit misleading. For example, there's no need to add an extra job for a "managing thread" (note: make is not actually multithreaded). make never counts itself as a job for the purposes of -j, so, as Huygens says above, if you say -j5 you'll get 5 compile jobs running, not 4 plus make.
The reason most people use [number of cores] + [some padding] has nothing to do with make or what it needs, but rather with the nature of the compiler. A compiler is really just a very complicated text translation tool: it reads in text in one form and converts it to "text" (binary) in another form. A lot of this (especially as your language gets more complex, like C++), requires a lot of CPU. But it also requires a lot of disk I/O. Disk I/O is slow, so while one compiler is waiting for some data from the disk, the kernel schedules other jobs to run. That is why you can usefully have more than the number of cores compiles running at the same time.
Exactly how large you can get -j before you start seeing diminishing returns (your builds actually start going slower, at some point, with more -j) depends completely on your hardware, the kinds of builds you're doing, etc. The only way to know for sure is experimentation.
However, [number of cores]+[a few] is typically a good approximation.

As you say the -j flag tells make that it is allowed to spawn the provided amount of 'threads'. Ideally each thread is executed on its own core/CPU, so your multi-core/CPU environment is used to its fullest.
make itself does not compile the source files. This is done by a compiler (gcc). The Makefile (input for make) contains a set of targets. Each target has a set of dependencies (on other targets) and rules how to build the target. make reads the Makefile(s) and manages all targets, dependencies, and build rules. Besides compiling source files you can use make to perform any task that can be described by shell commands.
If you set the allowed number of threads too high, it is not possible to schedule each thread on its own core. Additional scheduling (context) switches are required to let all threads execute. This additional resource usage obviously result in lower performance.
There are multiple rules-of-thumb, but I guess that setting to total amount to <number of cores> + 1 is the most common. The idea behind this is that all cores have their own thread and there is one additional managing thread that handles the targets and which is next to be built.

One CPU per thread plus one manager/loader. Since a thread that does disk operations is technically almost idle from CPU point of view, add one to the total number of cores.
If the CPU is using hyperthreading, you can safely count each core as two cores and double the number of threads, so a quad core Intel Core i7 should get -j9 (eight virtual cores plus manager.) On a quad core AMD use -j5

The -j option is only use to speed up application build, it determines how many jobs make can spawn for the build. You can either set -j<nb core> or even higher -j<nb-core * 1.5> so that compilation can happen in parallel.
It has no impact on the compiled code.
For a 4 core system, you could try make -j6. If make can run parallel builds, it will launch up to 6 simultaneous compilation process (e.g. 6 calls to gcc).

Related

Fork()ing and running on specific set of CPUs

I have a parent process, which I use to spawn a series of child processes, which each run their own program sequentially. Each of these programs change a file over time, I want to read the data from this file and see how it changes as each program runs.
I need two sets of data for this to work, the value of the file at some set interval (I haven't decided on the interval yet), and the time each program takes to run, there are other variables which can influence the execution times of these programs, which I want to see also.
So I figured to get more accurate timing of the child process while still reading from a file I could run them on different cores. I have 8 cores, I would like to run the parent process on 0-3, then fork the child to run on 4-7. I'm not sure if this is possible though within C, and a search around hasn't yielded any answers, which makes me think it isn't.
Within Linux, outside of a program, I can use taskset to do this.
I plan on setting aside 4 of the cores using the kernel parameter isolcpus(). I want as little noise as possible while running the child programs.

Asking the kernel to associate CPU cores with threads or processes is also known as setting the "affinity" between the core and the process/thread.
Under linux, there exists a set of functions that provide this capability. Take a look at the manual page for one of the functions...
man pthread_setaffinity_np
This family of API calls might be able to give you what you need.
That man page has a "see also" section that links to the other functions in this family.
Typically with features such as these that deal with kernel process and thread scheduling, it is entirely dependent on what mood the kernel is in at the time as to whether your requests are met or ignored. Your mileage may very due to system load or the number of available cores. Even if a system has 16 cores, these features may be disabled in the kernel compilation settings (think virtual machines). Equally, you may find that there are some additional options that you may be able to add to your kernel to get better results than the defaults.

Memory Sharing C - Performance

I'm playing around with process creation/ scheduling in Linux. As part of that, I have a number of concurrent threads computing a basic hash function from a shared in memory buffer. Each thread is created using clone, I'm trying and I'm playing around with the various flags, stack size, to measure process creation time, etc. (hence the use of clone)
My experiments are run on a 2 core i7 with hyperthreading enabled.
In this context, I find that, with all flags enabled (CLONE_VM, CLONE_SIGHAND, CLONE_FILES, CLONE_FS), the time it takes to compute n hash functions doubles when I run 4 processes (ak one per logical cpu) over when I run 2 processes. My understanding is that hyperthreading helps when a process is waiting on IO, so for a CPU bound process, it has almost no effect. Is this correct?
The second observation is that I observe pretty high variance (up to 2 seconds) when computing these hash functions (I compute a hash 1 000 000 times). No other process is running on he system (though there are some background threads). I'm struggling to understand why so much variance? Is it strictly due to how the scheduler happens to schedule the processes? I understand that without using sched_affinity, there is no guarantee that they will be located on different cpus, so can that just be explained by them being placed on the same CPU?
Are there any other ways to guarantee improved reliability without relying on sched_affinity?
The third observation is that, even when I run with just 2 threads (so when each should be scheduled on a diff CPU), I find that the performance goes down (not by much, but a little bit). I'm struggling to understand why that is the case? It's the same read-only buffer, and fits in the cache. Is there some contention in accessing the page table? Would it then be preferable to create two processes with distinct address spaces and explicitly share the segment, marking it as read only?

Different threads still run in the context of one process so they should run on the same CPU the process is run on (usually one process is run on one CPU but that is not guaranteed).
When you run two threads instead of processes you have an overhead of switching threads, the more calculations you do the more this switching needs to be done so it will be slower than the same calculations done in one thread.
Furthermore if you run the same calculations in different processes then there is an even bigger overhead of switching between processes but there is more chance you will run on different CPUs so in the long run this will probably be faster, not so much for short calculations.
Even if you don't think you have other processes running the OS has a lot to do all the time and switches to it's own processes that you aren't always aware of.
All of this emanates from the randomness of switching. Hope I helped a bit.

Performance of System()

For the function in c, system(), would it affect the hardware counters if you are trying to see how that command you ran performed
For example lets say im using the Performance API(PAPI) and the program is a precompiled matrix multiplication application
PAPI_start_counters();
system("./matmul");
PAPI_read_counters();
//Print out values
PAPI_stop_counters();
I am obviously missing a bit but what I am trying to find out is it is possible, through the use of said counters to get the performance of a program im running.
from my tests I would get wild numbers like the ones below. they are obviously wrong, just want to find out why
Total Cycles =========== 140733358872510
Instructions Completed =========== 4203968
Floating Point Instructions =========== 0
Floating Point Operations =========== 4196867
Loads =========== 140733358872804
Stores =========== 4204037
Branches Taken =========== 15774436

system() is a very slow function in general. On Linux, it spawns /bin/sh (forking and executing a full shell process), which parses your command, and spawns the second program. Loading these two programs requires loading the code to memory, initializing all their libraries, executing startup code, etc. Only then will the program code actually start executing.
Because of the unpredictability of disk access and Linux process scheduling, timing system() calls has a very high inherent variability. Therefore, you won't get accurate results even if you use a high-performance counter.
The better solution would be to compile the target program as a library instead. Load it before initializing your counters, then just execute the main function from the library. That way, all the code executes in your process, and you have negligible startup time. Your performance numbers will be much more precise this way.

Do you have access to the code of matmul? If so, it's much more precise to instrument and measure only the code you're interested in. That means you wrap only those instructions (or C statements) in counters that you want to measure.
For more information see:
Related discussion here
Intel® Performance Counter Monitor here
Performance measurements with x86 RDTSC instruction here
As stated above, measuring using PAPI to wrap system() invocations carries way too much process overhead to give you any idea of how fast your math code is actually running.

The numbers you are getting are odd, but not necessarily wrong. The huge disparity between the instructions completed and the cycles probably indicate that the executable "matmul" is doing a lot of waiting for external processes (e.g. disk I/O) to complete. I do not know the specifics of the msg FP Instructions and FP ops, but if they are displaying those values differently PAPI has a reason.
What is interesting is that the loads and cycles are obviously connected as well as instructions/fp ops and stores.
I would have to know about the internals of "matmul" in order to give you a better description.

Measuring CPU clocks consumed by a process

I have written a program in C. Its a program created as result of a research. I want to compute exact CPU cycles which program consumes. Exact number of cycles.
Any idea how can I find that?

The valgrind tool cachegrind (valgrind --tool=cachegrind) will give you a detailed output including the number of instructions executed, cache misses and branch prediction misses. These can be accounted down to individual lines of assembler, so in principle (with knowledge of your exact architecture) you could derive precise cycle counts from this output.
Know that it will change from execution to execution, due to cache effects.
The documentation for the cachegrind tool is here.

No you can't. The concept of a 'CPU cycle' is not well defined. Modern chips can run at multiple clock rates, and different parts of them can be doing different things at different times.
The question of 'how many total pipeline steps' might in some cases be meaningful, but there is not likely to be a way to get it.

Try OProfile. It use various hardware counters on the CPU to measure the number of instructions executed and how many cycles have passed. You can see an example of it's use in the article, Memory part 7: Memory performance tools.

I am not entirely sure that I know exactly what you're trying to do, but what can be done on modern x86 processors is to read the time stamp counter (TSC) before and after the block of code you're interested in. On the assembly level, this is done using the RDTSC instruction, which gives you the value of the TSC in the edx:eax register pair.
Note however that there are certain caveats to this approach, e.g. if your process starts out on CPU0 and ends up on CPU1, the result you get from RDTSC will refer to the specific processor core that executed the instruction and hence may not be comparable. (There's also the lack of instruction serialisation with RDTSC, but in this context here, I don't think that's so much of an issue.)

Sorry, but no, at least not for most practical purposes -- it's simply not possible with most normal OSes. Just for example, quite a few OSes don't do a full context switch to handle an interrupt, so the time spent servicing a interrupt can and often will appear to be time spent in whatever process was executing when the interrupt occurred.
The "not for practical purposes" would indicate the possibility of running your program under a cycle accurate simulator. These are available, but mostly for CPUs used primarily in real-time embedded systems, NOT for anything like a full-blown PC. Worse, they (generally) aren't for running anything like a full-blown OS, but for code that runs on the "bare metal."
In theory, you might be able to do something with a virtual machine running something like Windows or Linux -- but I don't know of any existing virtual machine that attempts to, and it would be decidedly non-trivial and probably have pretty serious consequences in performance as well (to put it mildly).

How does sched_setaffinity() work?

I am trying to understand how the linux syscall sched_setaffinity() works. This is a follow-on from my question here.
I have this guide, which explains how to use the syscall and has a pretty neat (working!) example.
So I downloaded the Linux 2.6.27.19 kernel sources.
I did a 'grep' for lines containing that syscall, and I got 91 results. Not promising.
Ultimately, I'm trying to understand how the kernel is able to set the instruction pointer for a specific core (or processor.)
I am familiar with how single-core-single-thread programs work. One might issue a 'jmp foo' instruction, and this basically sets the IP to the memory address of the 'foo' label. But when one has multiple cores, one has to say "fetch the next instruction at memory address foo, and set the instruction pointer for core number 2 to begin execution there."
Where, in the assembly code, are we specifying which core performs that operation?
Back to the kernel code: what is important here? The file 'kernel/sched.c' has a function called sched_setaffinity(), but returns type "long" - which is inconsistent with its manual page. So what is important here? Which of these modules shows the assembly instructions issued? What module is reading the 'task_struct', looking at the 'cpus_allowed' member, and then translating that into an instruction? (I've also thumbed through the glibc source - but I think it just makes a call to the kernel code to accomplish this task.)

sched_setaffinity() simply tells the scheduler which CPUs is that process/thread allowed to run on, then calls for a re-schedule.
The scheduler actually runs on each one of the CPUs, so it gets a chance to decide what task to execute next on that particular CPU.
If you're interested in how you can actually call some code on other CPUs, I suggest you take a look at smp_call_function_single(). In case we want to call something on another CPU, this calls generic_exec_single(). The latter simply adds the function to the target CPU's call queue and forces a reschedule through some IPI stuff (if the queue was empty).
Bottom line is: there no actual SMP variant of the _jmp_ instruction. Instead, code running on other CPUs cooperates in order to accomplish the task.

I think the thing you are not understanding is that the kernel is running on all the CPU cores. At every timer interrupt (~1000 per second), the scheduler runs on each CPU and chooses a process to run. There is no one CPU that somehow tells the others to start running a process. sched_setaffinity() works by just setting flags on the process. The scheduler reads these flags and will not run that process on its CPU if it is set not to.

Where, in the assembly code, are we specifying which core performs that operation?
There is no assembly involved here. Every task (thread) is assigned to a single CPU (or core in your terms) at a time. To stop running on a given CPU and resume on another, the task has to "migrate" (also this). When a task migrates from one CPU to another, the scheduler picks the CPU which is more idle among the CPUs allowed by sched_setaffinity().
There is no magic assembly instructions issued. The kernel has a more low-level view of the hardware, each CPU is a separate object, very different than how it looks like for user-space processes (in user-space, CPUs are almost invisible).

Check this out: B Operating System Programming Guidelines

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight