I am looking to do some data processing of some 6700 files and am using fork() to handle different sets of calculations on the data. Will getting a CPU with a higher core count allow me to run more forks()? Currently I am using a quad core with 8 threads, forking() 8 times, takes me about an hour per file. If I had a 64 core processor and forked() 64 times (splitting up calculations), would that decrease the time by about 8???
Theoretically no, according to Amdahl's law. Probably also practically, because many resources are shared (the caches, the operating system calls, the disk, etc.), but this really depends on your algorithm. For example, if your algorithm is embarrassingly parallel and is cpu-bound, than you may notice a great improvement increasing the cores to 64.
A note after reading the comments of the question: if you have a complexity of O(n!), it is possible that your algorithm is simply impossible to be executed in a realistic time. For example, if your input is n=42, and let's say that you machine is able to do 1 billion of operation per seconds, then the time required to execute your algorithm is greater than the age of the universe.
Related
I have a computer with 4 cores, and I have a program that creates an N x M grid, which could range from a 1 by 1 square up to a massive grid. The program then fills it with numbers and performs calculations on each number, averaging them all together until they reach roughly the same number. The purpose of this is to create a LOT of busy work, so that computing with parallel threads is a necessity.
If we have a optional parameter to select the number of threads used, what number of threads would best optimize the busy work to make it run as quickly as possible?
Would using 4 threads be 4 times as fast as using 1 thread? What about 15 threads? 50? At some point I feel like we will be limited by the hardware (number of cores) in our computer and adding more threads will stop helping (and might even hinder?)
I think the best way to answer is to give a first overview on how threads are managed by the system. Nowadays all processors are actually multi-core and multi-thread per core, but for sake of simplicity let's first imagine a single core processor with single thread. This is physically limited in performing only a single task at the time, but we are still capable of running multitask programs.
So how is this possible? Well it is simply illusion!
The CPU is still performing a single task at the time, but switches between one and the other giving the illusion of multitasking. This process of changing from one task to the other is named Context switching.
During a Context switch all the data related to the task that is running is saved and the data related to the next task is loaded. Depending on the architecture of the CPU data can be saved in registers, cache, RAM, etc. The more the technology advances, the more performing solutions have been discovered. When the task is resumed, the whole data is fetched and the task continues its operations.
This concept introduces many issues in managing tasks, like:
Race condition
Synchronization
Starvation
Deadlock
There are other points, but this is just a quick list since the question does not focus on this.
Getting back to your question:
If we have a optional parameter to select the number of threads used, what number of threads would best optimize the busy work to make it run as quickly as possible?
Would using 4 threads be 4 times as fast as using 1 thread? What about 15 threads? 50? At some point I feel like we will be limited by the hardware (number of cores) in our computer and adding more threads will stop helping (and might even hinder?)
Short answer: It depends!
As previously said, to switch between a task and another, a Context switch is required. To perform this some storing and fetching data operations are required, but these operations are just an overhead for you computation and don't give you directly any advantage. So having too many tasks requires a high amount of Context switching, thus meaning a lot of computational time wasted! So at the end your task might be running slower than with less tasks.
Also, since you tagged this question with pthreads, it is also necessary to check that the code is compiled to run on multiple HW cores. Having a multi core CPU does not guarantee that you multitask code will run on multiple HW cores!
In your particular case of application:
I have a computer with 4 cores, and I have a program that creates an N x M grid, which could range from a 1 by 1 square up to a massive grid. The program then fills it with numbers and performs calculations on each number, averaging them all together until they reach roughly the same number. The purpose of this is to create a LOT of busy work, so that computing with parallel threads is a necessity.
Is a good example of concurrent and data independent computing. This sort of tasks run great on GPU, since operations don't have data correlation and concurrent computing is performed in hardware (modern GPU have thousands of computing cores!)
I'm playing around with process creation/ scheduling in Linux. As part of that, I have a number of concurrent threads computing a basic hash function from a shared in memory buffer. Each thread is created using clone, I'm trying and I'm playing around with the various flags, stack size, to measure process creation time, etc. (hence the use of clone)
My experiments are run on a 2 core i7 with hyperthreading enabled.
In this context, I find that, with all flags enabled (CLONE_VM, CLONE_SIGHAND, CLONE_FILES, CLONE_FS), the time it takes to compute n hash functions doubles when I run 4 processes (ak one per logical cpu) over when I run 2 processes. My understanding is that hyperthreading helps when a process is waiting on IO, so for a CPU bound process, it has almost no effect. Is this correct?
The second observation is that I observe pretty high variance (up to 2 seconds) when computing these hash functions (I compute a hash 1 000 000 times). No other process is running on he system (though there are some background threads). I'm struggling to understand why so much variance? Is it strictly due to how the scheduler happens to schedule the processes? I understand that without using sched_affinity, there is no guarantee that they will be located on different cpus, so can that just be explained by them being placed on the same CPU?
Are there any other ways to guarantee improved reliability without relying on sched_affinity?
The third observation is that, even when I run with just 2 threads (so when each should be scheduled on a diff CPU), I find that the performance goes down (not by much, but a little bit). I'm struggling to understand why that is the case? It's the same read-only buffer, and fits in the cache. Is there some contention in accessing the page table? Would it then be preferable to create two processes with distinct address spaces and explicitly share the segment, marking it as read only?
Different threads still run in the context of one process so they should run on the same CPU the process is run on (usually one process is run on one CPU but that is not guaranteed).
When you run two threads instead of processes you have an overhead of switching threads, the more calculations you do the more this switching needs to be done so it will be slower than the same calculations done in one thread.
Furthermore if you run the same calculations in different processes then there is an even bigger overhead of switching between processes but there is more chance you will run on different CPUs so in the long run this will probably be faster, not so much for short calculations.
Even if you don't think you have other processes running the OS has a lot to do all the time and switches to it's own processes that you aren't always aware of.
All of this emanates from the randomness of switching. Hope I helped a bit.
I am playing around with pthreads for the first time and have noticed something strange when running on my machine.
I have an Intel i5 with 2 physical cores and 4 virtual cores.
When running my program with 2 threads, I get roughly double the performance, yet when running with 4 threads, I get the same performance as two threads. Why is this the case?
Results with 2 threads:
real 0m9.335s
user 0m18.233s
sys 0m0.132s
Results with 4 threads:
real 0m9.427s
user 0m34.130s
sys 0m0.180s
Edit: The code is fully parallelizable and the threads are running independently without any shared resources.
Because you only really have 2 cores. Hyper-threading will not magically create 2 more cores for you. Hyper-threading makes it possible to run 4 threads on the CPU but not simultaneously. It will still allocate the threads on the two physical cores and switch the threads back and forth in the execution pipeline.
The performance increase you may expect is at BEST 30%.
Keep in mind that hyperthreading is basically a way of reusing spare execution units on the CPU for a separate thread of execution. You're still working with the horsepower of two cores, it's just split four ways.
If your code is optimized such that it fully utilizes most of the available EUs, there's no spare resources left once it's running on both physical cores, so the hyperthreaded cores can't do any better.
This old article from when HyperThreading (HT) was first introduced provides a lot of details on how it works (though I'm sure many improvements have been made over the last 10 years). http://www.intel.com/technology/itj/2002/volume06issue01/vol6iss1_hyper_threading_technology.pdf:
Each logical processor maintains a complete set of the architecture state. The architecture state consists of registers including the general-purpose registers, the control registers, the advanced programmable interrupt controller (APIC) registers, and some machine state registers. From a software perspective, once the architecture state is duplicated, the processor appears to be two processors. The number of transistors to store the architecture state is an extremely small fraction of the total.
However, the following sentence shows where HT can bottleneck:
Logical processors share nearly all other resources on the physical processor, such as caches, execution units, branch predictors, control logic, and buses.
If the threads execution are each keeping one or more of those shared resources (such as the execution unit or buses) 100% busy, then the hyperthreading will not improve throughput. Since benchmarks often exercise one aspect of a system (intentionally or not), it's not surprising that one of these shared processor resources would end up being a bottleneck and prevent HT from showing a benefit.
The performance gain when using multiple threads is very difficult to determine. Hyperthreading is also "less than one extra core" in performance for sure.
Besides from that, you may run into memory throughput issues, or your code is contending over locks or some such now that you have more of them - even if your own code is lock-less doesn't mean that for example I/O or some functions you call are completely able to run in parallel - there are sometimes "hidden" shared resources.
But most likely, your processor just can't go any faster.
I am trying to calculate a process utilization on my machine with Intel hyper-threading.
I have one problem when trying to do the calculation:
I am counting the loops my process is doing when running alone on the physical core
and counting it when another process (identical one) is running on the other logical core (same physical core).
I see there is a diff in the amount of loops my process is doing. I guess it's related to the fact I am running in hyper-threading machine.
Is there a way to know what is the exact running time my process did so I will be able to add it to my calculation when I am trying to calculate the process utilization?
You can only tell how much of the logical CPU's time a process takes. You can't tell how much it uses the physical CPU, and it isn't really defined.
HyperThreading (or, at least, the more modern SMT) doesn't work by dividing the physical CPU time between two threads. It works by assigning work to the execution units within the CPU (and there are several such units).
So it's possible for both threads to run at once - there are several integer execution units, and some others (memory, floating point).
Bottom line - if a thread takes 100% of the logical CPU, then it takes 100% CPU. That's all you can tell.
I have written code which performs a specific task; now when I will run on different machine(having different frequency) will it take different time?
Ouestion
If my code has one printf function, then will its required number of machine cycles be fixed for all machines, or will it depend on the system?
My system frequency is 2.0GHz, what does it mean?
The performance time of the code will depend on the frequency of the CPU, amongst many other things. All other things being equal, a faster CPU will take less time to execute the same instructions. But the number of other things that can affect the timing is vast, including O/S, compiler, memory chips, disk and so on.
If the machines have the same basic architecture, then the number of machine cycles is fixed. However, modern CPU architectures are very complex, and there could easily be variations depending on what else is running on the machine at the same time. If the machines have different chip types (even within a family such as Intel Core 2 Duo), then the results could be different. If the machines are of different architectures (Intel vs SPARC or PowerPC, say), then all bets are off.
If the 'frequency is 2.0 GHz', then it means that the main CPU clock cycles at 2.0 GHz. How many instructions are executed in that time depends on the instructions, and the parallelism (how many cores), and the CPU type, etc. The CPU frequency is separate from the bus frequency which controls how fast memory can be read (so, I'm using a 2.0 GHz CPU but the memory bus runs at 1067 MHz).
Clock speed of a computer of course has its influence on the execution time of a program, but just stating that the processor runs at 2 GHz is absolutely not enough to determine how long exactly the program will run because there are huge differences in "efficiency" between the processor families - an Intel Core family processor will just do a lot more work per time unit than its predecessor, the Pentium 4, when both run at the same speed.
So yes, CPU speed has a serious influence on the execution time of a program but just the GHz value is absolutely not enough. That's why various benchmarks were set up, to be able to compare the work a processor can do in a time unit. These benchmarks will run a mix of instructions that can be considered a typical workload in a chosen scenario, and time how long their execution will take. Check out Whetstone and Dhrystone for some older but relatively easy to understand benchmarks.
The fact that there are tons of benchmarks only proves that it's not easy at all to obtain a comparable value on whose relevance everybody can agree, it remains a topic for debate...
The frequency of the CPU defines how much work it can do within a certain time. The code is the same on all machines (i.e. it's compiled code) so yes the frequency will affect the time it takes to run your program.