function to purposefully have a high cpu load? - c

I'm making a program which controls other processes (as far as stopped and started states).
One of the criteria is that the load on the computer is under a certain value.
So i need a function or small program which will cause the load to be very high for testing purposes. Can you think of anything?
Thanks.

I can think of this one :
for(;;);

If you want to actually generate peak load on a CPU, you typically want a modest-size (so that the working set fits entirely in cache) trivially parallelizable task, that someone has hand-optimized to use the vector unit on a processor. Common choices are things like FFTs, matrix multiplications, and basic operations on mathematical vectors.
These almost always generate much higher power and compute load than do more number-theoretic tasks like primality testing because they are essentially branch-free (other than loops), and are extremely homogeneous, so they can be designed to use the full compute bandwidth of a machine essentially all the time.
The exact function that should be used to generate a true maximum load varies quite a bit with the exact details of the processor micro-architecture (different machines have different load/store bandwidth in proportion to the number and width of multiply and add functional units), but numerical software libraries (and signal processing libraries) are great things to start with. See if any that have been hand tuned for your platform are available.

If you need to control how long will the CPU burst be, you can use something like the Sieve of Eratosthenes (algorithm to find primes until a certain number) and supply a smallish integer (10000) for short bursts, and a big integer (100000000) for long bursts.
If you will take I/O into account for the load, you can write to a file per each test in the sieve.

Related

how to know clock cycles, overall performance, etc... of program?

I have 3 different algorithms which all calculate the same stuff.
My goal is to compare all three algorithms, i.e. clock cycles, "how intensive it is for the processor", time needed to get the final result, the overall performance etc...
How can I see/get/analyze all of this information?
I am programming in Matlab and in C-language in code composer studio for an embedded system.
EDIT: memory usage/management would be usefull as well for the embedded system especially
First you can compare the size of your Output-files. Most times the bigger one is slower.
Get the exactly clock cycles is not easy. you must know how many clock cycles your Assembler command Needs and calculate it for your code.
If you are running it directly on your Hardware, you can toggle a port at the start and end Point and do a Timing measurement. (Regard there are may Interrupts, that can slow you down)
For the MATLAB part, you should use the timeit function to evaluate performance. You can also use profile to inspect which, if any, parts of the code are causing bottlenecks.

Avoiding CUDA thread divergence for MISD type operation

As part of a bigger code, I have a CUDA RK4 solver that integrates a large number of ODEs (Can be 1000+) in parallel. One step of this operation is calculating 'xdot', which is different for each equation (or data element). As of now, I have a switch-case branching setup to calculate the value for each data element in the kernel. All the different threads use the same 3-6 data elements to calculate their output, but in a different way. For example, for thread 1, it could be
xdot = data[0]*data[0] + data[1];
while for thread 2 it could be,
xdot = -2*data[0] + data[2];
and so on.
So if I have a hundred data elements, the execution path is different for each of them.
Is there any way to avoid/decrease the thread-divergence penalty in such a scenario?
Would running only one thread per block be of any help ?
Running one thread per block simply nulls 31/32 threads in the single warp you launch and wastes a lot of cycles and opportunities to hide latency. I would never recommend it, no matter how much branch divergence penalty your code incurred.
Your application sounds pretty orthognal to the basic CUDA programming paradigm and there really isn't going to be much you can do to avoid branch divergence penalties. One approach which could slightly improve things would be to perform some prior analysis of the expressions for each equation and group those with common arithmetic terms together. Recent hardware can run a number of kernels simultaneously, so it might be profitable to group calculations sharing like terms into different kernels and launch them simultaneously, rather than a single large kernel. CUDA supports C++ templating, and that can be a good way of generating a lot of kernel code from a relatively narrow base and make a lot of logic statically evaluable, which can help the compiler. But don't expect miracles - your problem is probay better suited to a different architecture than the GPU (Intel's Xeon Phi, for example).

should use GPU?

how can i know if my serial code will run faster if i used a GPU? i know it depends on a lot of things... ie if the code could be parallalized in an SMID fation and all this stuff... but what considerations should i take into account to be "sure" that i will gain speed? should the algorithm be embarrassingly parallel? therefore i wouldn't bother trying the GPU if parts of the algorithm cannot be parallelized? should i take into consideration how much memory is required for a sample input?
what are the "specs" of a serial code that would make it run faster on a GPU? can a complex algorithm gain speed on a GPU?
i don't want to waste time and try to code my algorithm on GPU and i am 100% sure that speed will be gained.... that is my problem....
i think that my algorithm could be parallelized on GPU... would it be worth trying it?
It depends upon two factors:
1) The speedup of having many cores performing the floating point operations
This is dependent upon the inherent parallelization of the operations you are performing, the number of cores on your GPU, and the differences in clock rates between your CPU and GPU.
2) The overhead of transferring the data back and forth between main memory and GPU memory.
This is mainly dependent upon the "memory bandwidth" of your particular GPU, and is greatly reduced by the Sandy Bridge architecture where the CPU and GPU are on the same die. With older architectures, some operations such as matrix multiplication where the inner dimensions are small get no improvement. This is because it takes longer to transfer the inner vectors back and forth across the system bus than it does to dot product the vectors on the CPU.
Unfortunately these two factors are tough to estimate and there is no way to "know" without trying it. If you currently use BLAS for your SIMD operations, it is fairly simple to substitute in CUBLAS which has the same API except it sends the operations over to the GPU to perform.
When looking for a parallel solution you should typically ask yourself the questions
The amount of data you have.
The amount of floating point computation you have.
How complicated is your algorithm i.e. conditions and branches in the algorithm. Is there any data localization?
what kind of speedup is required?
Is it Realtime computation or not?
Do alternate algorithms exist (but maybe they are not the most efficient serial algorithm)?
What kind of sw/hw you have access to.
Depending on the answers you are looking for you may want to use GPGPU, cluster computation or distributed computation or a combination of GPU and cluster/distributed machines.
If you could share the any information on your algorithm and size of data then it would be easier to comment.
Regular C code can be converted to CUDA remarkably easily. If the heavy hitters in your algorithm's profile can be parallelized, try it and see if it helps.

How to do good benchmarking of complex functions?

I am about to embark in very detailed benchmarking of a set of complex functions in C. This is "science level" detail. I'm wondering, what would be the best way to do serious benchmarking? I was thinking about running them, say, 10 times each, averaging the timing results and give the standard dev, for instance, just using <time.h>. What would you guys do to obtain good benchmarks?
Reporting an average and standard deviation gives a good description of a distribution when the distribution in question is approximately normal. However, this is rarely true of computational performance measurements. Instead, performance measurements tend to more closely resemble a poisson distribution. This makes sense, because not many random events on a computer will cause a program to go faster; essentially all of the measurement noise is in how many random events occur that cause it to slow down. (A normal distribution, by contrast, makes no intuitive sense at all; it would require the belief that a program has a non-zero probability of finishing in negative time).
In light of this, I find it most useful to report the minimum time over many runs of a program, rather than the average; the noise in the distribution is typically noise of the measuring system, rather than meaningful information about the algorithm. For complex algorithms that have early out conditions, and other shortcuts, you need to be a little more careful, but the minimum of many runs where each run handles a representative balance of inputs usually works well.
"10 times each" sounds like very few iterations to me. I generally do something on the order of thousands (or more, depending on the function/system) of runs unless that's completely infeasible. At a bare minimum, you need to make sure that you run the timing for sufficiently long as to shake out any dependence on system state, some of which may change at fairly large time granularity.
The other thing you should be aware of is that essentially every system has a platform-specific timer available that is much more accurate than what is available <time.h>. Find out what it is on your target platform[s] and use it instead.
I am assuming you are looking at benchmarking pure Algorithmic computation in your program and there is no user input or output which can take unpredictable time.
Now for purely number crunching programs, your results could vary based on the time your program actually runs which will be impacted by other ongoing activities in the system. There could be other factor which you may choose to ignore depending upon level of accuracy desired i.e. impact due to cache miss, different access time through the memory hierarchy"
One of the methods is as you suggested calculation average over a number of runs.
Or you could try to look at the assembly code and see the instructions generated. And then based on the processor get the cycle count for these instructions. This method may not be practical depending on the amount of code you are looking to benchmark. If you are particular about memory hierarchy impact then you may want to control execution environment very carefully i.e. where program is loaded, where its data is loaded etc. But as I mentioned depending on the accuracy desired, you may absorb the variation caused due to memory hierarchy in you statistical variation" .
You may need to carefully design the test input for you functions to ensure the path coverage and may choose to publish statistics of performance as a function of test input. This will show how function behaves across range of inputs

Finding prime factors to large numbers using specially-crafted CPUs

My understanding is that many public key cryptographic algorithms these days depend on large prime numbers to make up the keys, and it is the difficulty in factoring the product of two primes that makes the encryption hard to break. It is also my understanding that one of the reasons that factoring such large numbers is so difficult, is that the sheer size of the numbers used means that no CPU can efficiently operate on the numbers, since our minuscule 32 and 64 bit CPUs are no match for 1024, 2048 or even 4096 bit numbers. Specialized Big Integer math libraries must be used in order to process those numbers, and those libraries are inherently slow since a CPU can only hold (and process) small chunks (like 32 or 64 bits) at one time.
So...
Why can't you build a highly specialized custom chip with 2048 bit registers, and giant arithmetic circuits, much in the same way that we scaled from 8 to 16 to 32 to 64-bit CPUs, just build one a LOT larger? This chip wouldn't need most of the circuitry on conventional CPUs, after all it wouldn't need to handle things like virtual memory, multithreading or I/O. It wouldn't even need to be a general-purpose processor supporting stored instructions. Just the bare minimum to perform the necessary arithmetical calculations on ginormous numbers.
I don't know a whole lot about IC design, but I do remember learning about how logic gates work, how to build a half adder, full adder, then link together a bunch of adders to do multi-bit arithmetic. Just scale up. A lot.
Now, I'm fairly certain that there is a very good reason (or 17) that the above won't work (since otherwise one of the many people smarter than I am would have already done it) but I am interested in knowing why it won't work.
(Note: This question may need some re-working, as I'm not even sure yet if the question makes sense)
What #cube said, and the fact that a giant arithmetic logic unit would take more time for the logic signals to stabilize, and include other complications in digital design. Digital logic design includes something that you take for granted in software, namely that signals through combinational logic take a small but nonzero time to propagate and settle. A 32x32 multiplier needs to be designed carefully. A 1024x1024 multiplier would not only take a huge amount of physical resources in a chip, but it also would be slower than a 32x32 multiplier (though perhaps faster than a 32x32 multiplier computing all the partial products needed to perform a 1024x1024 multiply). Plus it's not only the multiplier that's the bottleneck: you've got memory pathways. You'd have to spend a bunch of time gathering the 1024 bits from a memory circuit that's only 32 bits wide, and storing the resulting 2048 bits back into the memory circuit.
Almost certainly it's better to get a bunch of "conventional" 32-bit or 64-bit systems working in parallel: you get the speedup w/o the hardware design complexity.
edit: if anyone has ACM access (I don't), perhaps take a look at this paper to see what it says.
Its because this speedup would be only in O(n), but the complexity of factoring the number is something like O(2^n) (with respect to the number of bits). So if you made this überprocessor and factorized the numbers 1000 times faster, I would only have to make the numbers 10 bits larger and we would be back on the start again.
As indicated above, the primary problem is simply how many possibilities you have to go through to factor a number. That being said, specialized computers do exist to do this sort of thing.
The real progress for this sort of cryptography is improvements in number factoring algorithms. Currently, the fastest known general algorithm is the general number field sieve.
Historically, we seem to be able to factor numbers twice as large each decade. Part of that is faster hardware, and part of it is simply a better understanding of mathematics and how to perform factoring.
I can't comment on the feasibility of an approach exactly like the one you described, but people do similar things very frequently using FPGAs:
Crack DES keys
Crack GSM conversations
Open source graphics card
Shamir & Tromer suggest a similar approach, using a kind of grid computing:
This article discusses a new design for a custom hardware
implementation of the sieving step, which
reduces [the cost of sieving, relative to TWINKLE,] to about $10M. The new device,
called TWIRL, can be seen as an extension of the
TWINKLE device. However, unlike TWINKLE it
does not have optoelectronic components, and can
thus be manufactured using standard VLSI technology
on silicon wafers. The underlying idea is to use
a single copy of the input to solve many subproblems
in parallel. Since input storage dominates cost, if the
parallelization overhead is kept low then the resulting
speedup is obtained essentially for free. Indeed, the
main challenge lies in achieving this parallelism efficiently while allowing compact storage of the input.
Addressing this involves myriad considerations, ranging
from number theory to VLSI technology.
Why don't you try building an uber-quantum computer and run Shor's algorithm on it?
"... If a quantum computer with a sufficient number of qubits were to be constructed, Shor's algorithm could be used to break public-key cryptography schemes such as the widely used RSA scheme. RSA is based on the assumption that factoring large numbers is computationally infeasible. So far as is known, this assumption is valid for classical (non-quantum) computers; no classical algorithm is known that can factor in polynomial time. However, Shor's algorithm shows that factoring is efficient on a quantum computer, so a sufficiently large quantum computer can break RSA. ..." -Wikipedia

Resources