How to calculate the time cost of a nested loop - loops

Is there a math formula that I can put in my Windows calculator to calculate the time cost of my nested loop?
From what I read, I saw things like O(n) but I don't know how to put that in a calculator.
I have a 300 levels nested loop, each loop level will go from 0 to 1000
So it is like:
for i as integer =0 to 1000
{
//300 more nested loops, each goes from 0 to 1000
}
I am assuming that 1 complete loop will take 1 second (i.e. the 300 levels from the top most loop to the inner most loop will take 1 second).
How can I find out how many seconds it will take to complete the whole loop operation?
Thank you

If you have M nested loops each with N steps, the total number of times the inner-most code will run is NM. Consequently, if this code takes t seconds to execute, the total running time will be no less than NMt.
In your case, N = 1001 and M = 301. Thus
NM = 1351006135283321356815039655729366684349803831988739053240801914465955329955010318785683390381893021480038259097571556568393033624125663039020474816807139123124687688667110651554536455554983313531622053666601142890485454586020409971220427023079449603217442510854172465669657551198374621035716253537483681789962979899381794117593366167602159102878741666110186140811771157661856975727227011774938173689257851684515033630838203428990519981303994460521486651131205651440789014004132287034501678194895276766533222238644463714676717122486815519675003623218747516074833762258750872602504763945523812443905022112877989765178585281722001229086777672022301235387486256207147702409003911671750616133700186863997717468027217550738860605995255072363983914048955786591110292838269781981812167453230775137941595147968127169264749956845265384891251656562329411331097495063637387550896404766837508698223985774995150301001.
If, in addition, we assume that the inner-most code is really fast, like t = 1 µs, the total running time will be no less than NMt = 1.35... × 10897 s.
When the universe comes to its end, your code will be at almost exactly 0.0000000000000000000000000000 % completion, assuming the hardware hasn't faulted by then and the human civilization hasn't come to an end (thus making it difficult for you to find the electrical power you need to run your hardware).

Related

Determining number of stalls added by BTB in a processor with 5-stage pipeline and 2-stage BTB for nested loops

[I know that chatgpt isn't allowed, but I've used chat gpt to make the question easier to understand, because it's a complex argument for me but I need to ask it to understand this topic. I double checked the questions multiple times, and I manually edited the question, and therefore I can say it's 100% coherent to what I was asking]
I am trying to solve an exercise that involves a processor with a pipeline of 5 stages and a BTB of 2 stages. The exercise asks me to determine the number of stalls that the BTB will add for control areas, as well as the percentage of stalls that I will need when executing a program with two nested loops. The loops have a certain number of instructions and are executed a certain number of times, and the BTB introduces a certain number of stalls in cases of hit and miss.
I am not sure how to approach this problem, and I would appreciate any help or advice on how to determine the number of stalls and the percentage of stalls needed in this scenario. Thank you in advance for your help. [important note: don't provide me the full solution, because I need to be able to grasp the concept on my own, because I'm still learning this stuff]
this is the full problem I need to solve: "a processor has a pipeline of 5 stages, and a BTB of 2 stages. I need to execute a program with 2 nested loops, loop1, and loop2. loop1 has a number of instructions, N1, equals to 20. loop2 is the external loop with N2=(40 + N1). loop1 is the internal loop, and it's executed L1 = 30 times, and loop2 is exectuted L2 = 100 times. in cases of hit, BTB introduces a number of stalls equal to S1 = 2 if predictions is wrong, and it doesn't introduce any stalls if predictions is right. BTB introduces a number of stalls SM = 3, if I have a compulsory miss, and then it doesn't substitute jumps that the program have to execute. how many stalls does BTB add for control areas? how many stalls (expressed in percentage) do I need?" what is your modus operandi when solving an exercise like this one? moreover, what do you do to visualize the situations described by the problem?

How to pick a Random Number using batch within a range

I was wondering how to use the %random% variable to pick a number within a range smaller then 0-30000 (I made a rough estimate). I read a couple of articles on this website and did not address my problem. In my program, I want to draw a random number from 0 to 5. Anyway one can do this?
Use the modulus function. It divides a number and returns the remainder. So divide by 6 and the range is 0 to 5 (6 units) if needing 1 to 6 add 1. See Set /?. The operators are C operators (https://learn.microsoft.com/en-us/cpp/c-language/c-operators).
This gives 1 to 6. Note the operator modulus % is escaped by another %.
Set /a num=%random% %% 6 + 1
echo %num%
The mod operator, %, is a relation on the set of integers that is injective (one-to-one) but not surjective (onto). It is therefore NOT a function proper because it is not bijective (both one-to-one AND onto (but we know what you mean)).
Care must be taken in the construction of the first half of your generator; the part that produces the integer to be modded. If you are modding at irregular clock intervals then the time of day down to the millisecond is just fine. But if you are modding from within a loop you must take care that you are not producing a subset of the full range that you wish to mod. That is: there are 1000 possible millisecond values in clock time. If your loop timing is regular to the extreme, you could be drawing small subset of integer values in millisecs on every call, and therefore producing the same modded values on every call, especially if you loop interval in msecs divides 1000 evenly.
You can use the rand() generator modulo 6 -- rand() % 6. This is what I do. You must however realize that rand() chooses without replacement integers in the range of 0 through 32767 using a recursive method (the next number produced depends entirely on the previous number drawn). Consider two numbers in the range, A and B. Initially, he probability that you draw A equals the probability that you will draw B equals 1/32768. Suppose on first draw you draw A, then the probability that you will draw A on the second draw is zero, and the probability that you will draw B is 1/32767.
One more thing: rand() is not a class and calls to it are GLOBALLY DEPENDENT within your program. So if you need to draw ranged random variables in different parts of your program the dependency described above with A and B still holds, even if you are calling from different classes.
Most languages provide a method of producing a REAL random number R, in the range 0.0 <= R < 1.0. These generators have no dependencies. In BASIC this method is rnd(), and you would code (rnd() * 1000) % 6, or some variation of that.
There are other homebrew methods of producing random variables. My fallback is the middle square method, which you can look-up anywhere.
Well, I have said a mouthfull and perhaps it seems like I am driving thumbtacks with a sledgehammer. But this information is always useful when using the built-in methods.

Calculating FLops

I am writing a program to calculate the duration that my CPU take to do one "FLops". For that I wrote the code below
before = clock();
y= 4.8;
x= 2.3;
z= 0;
for (i = 0; i < MAX; ++i){
z=x*y+z;
}
printf("%1.20f\n", ( (clock()-before )/CLOCKS_PER_SEC )/MAX);
The problem that I am repeating the same operation. Doesn't the compiler optimize this sort of "Thing"? If so what I have to do to get the correct results?
I am not using the "rand" function so it does not conflict my result.
This has a loop-carried dependency and not enough stuff to do in parallel, so if anything is even executed at all, it would not be FLOPs that you're measuring, with this you will probably measure the latency of floating point addition. The loop carried dependency chain serializes all those additions. That chain has some little side-chains with multiplications in them, but they don't depend on anything so only their throughput matters. But that throughput is going to be better than the latency of an addition on any reasonable processor.
To actually measure FLOPs, there is no single recipe. The optimal conditions depend strongly on the microarchitecture. The number of independent dependency chains you need, the optimal add/mul ratio, whether you should use FMA, it all depends. Typically you have to do something more complicated than what you wrote, and if you're set on using a high level language, you have to somehow trick it into actually doing anything at all.
For inspiration see how do I achieve the theoretical maximum of 4 FLOPs per cycle?
Even if you have no compiler optimization going on (possibilities have already been nicely listed), your variables and result will be in cache after the first loop iteration and from then on your on the track with way more speed and performance than you would be, if the program would have to fetch new values for each iteration.
So if you want to calculate the time for a single flop for a single iteration of this program you would actually have to give new input for every iteration. Really consider using rand() and just seed with a known value srand(1) or so.
Your calculations should also be different; flops are the number of computations your program does so in your case 2*n (where n = MAX). To calculate the amount of time per flop divide time used by the amount of flops.

count total number of occurrences in parallel

I have a parallelized algorithm that can output a random number from 1 to 1000.
My objective is to compute, for N executions of the algorithm, how many times each number is chosen.
So for instance, I am doing N/100 executions of the algorithm, on 100 threads, and the final result is an array of 1000 ints, which are the occurrences of each number.
Is there a way to parallelize this intelligently? For instance, if I only use one global array I will have to lock it every time I want to write in it, which will make my algorithm run almost as if there was no parallelization. On the other hand, I can't just make one array of 1000 numbers per threads, just to have them be 1% filled and merge them at the end.
This appears to be histogramming. If you want to do it quickly, use a library such as CUB or Thrust.
For cases where there are a small number of bins, one approach is to have each thread operate on its own set of bins, for a segment of the input. Then do a parallel reduction on each bin. If you are clever about the storage organization of your bins, the parallel reduction amounts to summation of matrix columns:
Bins:
1 2 3 4 ... 1000
T 1
h 2
r 3
e .
a .
d 100
In the above example, each thread takes a segment of the input, and operates on one row of the partial sums matrix.
When all threads are finished with their segments, then sum the columns of the matrix, which can be done very efficiently and quickly with a simple for-loop kernel.
There are a couple of things you can do.
If you want to be as portable as possible, you could have one lock for each index.
If this is being run on a Windows system, I would suggest InterlockedIncrement

Performance Optimization for Matrix Rotation

I'm now trapped by a performance optimization lab in the book "Computer System from a Programmer's Perspective" described as following:
In a N*N matrix M, where N is multiple of 32, the rotate operation can be represented as:
Transpose: interchange elements M(i,j) and M(j,i)
Exchange rows: Row i is exchanged with row N-1-i
A example for matrix rotation(N is 3 instead of 32 for simplicity):
------- -------
|1|2|3| |3|6|9|
------- -------
|4|5|6| after rotate is |2|5|8|
------- -------
|7|8|9| |1|4|7|
------- -------
A naive implementation is:
#define RIDX(i,j,n) ((i)*(n)+(j))
void naive_rotate(int dim, pixel *src, pixel *dst)
{
int i, j;
for (i = 0; i < dim; i++)
for (j = 0; j < dim; j++)
dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}
I come up with an idea by inner-loop-unroll. The result is:
Code Version Speed Up
original 1x
unrolled by 2 1.33x
unrolled by 4 1.33x
unrolled by 8 1.55x
unrolled by 16 1.67x
unrolled by 32 1.61x
I also get a code snippet from pastebin.com that seems can solve this problem:
void rotate(int dim, pixel *src, pixel *dst)
{
int stride = 32;
int count = dim >> 5;
src += dim - 1;
int a1 = count;
do {
int a2 = dim;
do {
int a3 = stride;
do {
*dst++ = *src;
src += dim;
} while(--a3);
src -= dim * stride + 1;
dst += dim - stride;
} while(--a2);
src += dim * (stride + 1);
dst -= dim * dim - stride;
} while(--a1);
}
After carefully read the code, I think main idea of this solution is treat 32 rows as a data zone, and perform the rotating operation respectively. Speed up of this version is 1.85x, overwhelming all the loop-unroll version.
Here are the questions:
In the inner-loop-unroll version, why does increment slow down if the unrolling factor increase, especially change the unrolling factor from 8 to 16, which does not effect the same when switch from 4 to 8? Does the result have some relationship with depth of the CPU pipeline? If the answer is yes, could the degrade of increment reflect pipeline length?
What is the probable reason for the optimization of data-zone version? It seems that there is no too much essential difference from the original naive version.
EDIT:
My test environment is Intel Centrino Duo architecture and the verion of gcc is 4.4
Any advice will be highly appreciated!
Kind regards!
What kind of processor are you testing this on? I dimly remember that unrolling loops helps when the processor can handle multiple operations at once, but only up to the maximum number of parallel executions. So if your processor can only handle 8 simultaneous instructions, then unrolling to 16 won't help. But someone with knowledge of more recent processor design will have to pipe up/correct me.
EDIT: According to this PDF, the centrino core2 duo has two processors, each of which is capable of 4 simultaneous instructions. It's generally not so simple, though. Unless your compiler is optimizing across both cores (ie, when you run the task manager (if you're on windows, top if you're on linux), you'll see that CPU usage is maxed out), your process will be running on one core at a time. The processor also features 14 stages of execution, so if you can keep the pipeline full, you'll get a faster execution.
Continuing along the theoretical route, then, you get a speed improvement of 33% with a single unroll because you're starting to take advantage of simultaneous instruction execution. Going to 4 unrolls doesn't really help, because you're now still within that 4-simultaneous-instruction limit. Going to 8 unrolls helps because the processor can now fill the pipeline more completely, so more instructions will get executed per clock cycle.
For this last, think about how a McDonald's drive through works (I think that that's relatively widespread?). A car enters the drivethrough, orders at one window, pays at a second window, and receives food at a third window. If a second drive enters when the first is still ordering, then by the time both finish (assuming each operation in the drive through takes one 'cycle' or time unit), then 2 full operations will be done by the time 4 cycles have elapsed. If each car did all of their operations at one window, then the first car would take 3 cycles for ordering, paying, and getting food, and then the second car would also take 3 cycles for ordering, paying and getting food, for a total of 6 cycles. So, operation time due to pipelining decreases.
Of course, you have to keep the pipeline full to get the largest speed improvement. 14 stages is a lot of stages, so going to 16 unrolls will give you some improvement still because more operations can be in the pipeline.
Going to 32 causing a decrease in performance may have to do with bandwidth to the processor from the cache (again a guess, can't know for sure without seeing your code exactly, as well as the machine code). If all the instructions can't fit into cache or into the registers, then there is some time necessary to prepare them all to run (ie, people have to get into their cars and get to the drive through in the first place). There will be some reduction in speed if they all get there all at once, and some shuffling of the line has to be done to make the operation proceed.
Note that each movement from src to dst is not free or a single operation. You have the lookups into the arrays, and that costs time.
As for why the second version works so quickly, I'm going to hazard a guess that it has to do with the [] operator. Every time that gets called, you're doing some lookups into both the src and dst arrays, resolving pointers to locations, and then retrieving the memory. The other code is going straight to the pointers of the arrays and accessing them directly; basically, for each of the movements from src to dst, there are less operations involved in the move, because the lookups have been handled explicitly through pointer placement. If you use [], these steps are followed:
do any math inside the []
take a pointer to that location (startOfArray + [] in memory)
return the result of that location in memory
If you walk along with a pointer, you just do the math to do the walk (typically just an addition, no multiplication) and then return the result, because you've already done the second step.
If I'm right, then you might get better results with the second code by unrolling its inner loop as well, so that multiple operations can be pipelined simultaneously.
The first part of the question I'm not sure about. My initial thought was some sort of cache problem, but you're only accessing each item once.
The other code could be faster for a coupe reasons.
1) The loops count down instead of up. Comparing a loop counter to zero costs nothing on most architectures (a flag is set by the decrement automatically) you have to explicitly compare to a max value with each iteration.
2) There is no math in the inner loop. You are doing a bunch of math in your inner loop. I see 2 subtractions in the main code and a multiply in the macro (which is used twice). There is also the implicit addition of the resulting indexes to the base address of the array which is avoided by the use of pointers (good addressing modes on x86 should eliminate this penalty too).
When writing optimized code, you always construct it bottom up from the inside. This means taking the inner-most loop and reducing its content to nearly zero. In this case, moving data is unavoidable. Incrementing a pointer is the bare minimum to get to the next item, the other pointer needs to add an offset to get to its next item. So at a minimum we have 4 operations: load, store, increment, add. If an architecture supported "move with post-increment" this would be 2 instructions total. On Intel I suspect it's 3 or 4 instructions. Anything more than this like subtractions and multiplication is going to add significant code.
Looking at the assembly code of each version should offer much insight.
If you run this repeatedly on a small matrix (32x32) that fits completely in cache you should should see even more dramatic differences in implementations. Running on a 1024x1024 matrix will be much slower than doing 1024 rotations of a single 32x32 even though the number of data copies is the same.
The main purpose of loop unrolling is to reduce the time spent on the loop control (test for completion, incrementing counters, etc...). This is a case of diminishing returns though, since as the loop is unrolled more and more, the time spent on loop control becomes less and less significant. Like mmr said, loop unrolling may also help the compiler to execute things in parallel, but only up to a point.
The "data-zone" algorithm appears to be a version of a cache efficient matrix transpose algorithm. The problem with computing a transpose the naive way is that it results in a lot of cache misses. For the source array, you are accessing the memory along each row, so it is accessed in a linear manner, element-by-element. However, this requires that you access the destination array along the columns, meaning you are jumping dim elements each time you access an element. Basically, for each row of the input, you are traversing the memory of the entire destination matrix. Since the whole matrix probably won't fit in the cache, memory has to be loaded and unloaded from the cache very often.
The "data-zone" algorithm takes the matrix that you are accessing by column and only performs the transpose for 32 rows at a time, so the amount of memory you are traversing is 32xstride, which should hopefully fit completely into the cache. Basically the aim is to work on sub-sections that fit in the cache and reduce the amount of jumping around in memory.

Resources