I know that this loop is O(n^2) but what is Big-Omega and Big-Theta? How do you go about calculating them in situations like these?
for(i = 0; i < array.length; i++)
for (j = 0; j < array.length; j++)
//bla bla
For starters, see larsmans' comment. The loop logic is not necessarily trivial enough to exclude. Let's say for argument's sake that you're confident that the loop logic will be not break out, that the logic is trivial (i.e. no conditional paths affecting the work performed), and that you are defining your unit of work to be the total logic performed in one pass through the loop.
In this case, your upper and lower bounds are the same. You are guaranteed to execute at least, and at most, on the order of N^2 units of work. You have a Ω(N^2), and a O(N^2). Your lower and upper bounds are identical; you can characterize Θ(N^2).
It bears mentioning again that this is pointless if the loop logic is non-trivial and is especially dependent on what you are actually defining as a unit of work. The point of these notations is to characterize an expected amount of work to be incurred by an algorithm. You can iterate through a loop millions of times, but that doesn't affect this notation if the work you really care about is how many times SomeExpensiveFunction() is called within that loop, and the logic dictates that it is only called once.
Related
Whenever I see algorithm optimization, I see lots of talk about reducing loop count. Often times, I see multiple operations being incorporated into one loop that were originally done separately.
Ultimately, the same number of O(1) processes are performed. It's just that one algorithm splits them into multiple iterations. Is there honestly a performance benefit to combining operations, from a scaling perspective?
Overly simplified example. I'm aware this is not a good example because the inner time complexity operations are low compared to the act of even incrementing i, but you get my point.
let tally1 = 0
let tally2 = 0
for (let i = 0; i < 10; i++) {
tally1 += 1
}
for (let i = 0; i < 10; i++) {
tally2 += 1
}
// vs
for (let i = 0; i < 10; i++) {
tally1 += 1
tally2 += 1
}
It is obvious that the second version will perform better because all the operations that make up the loop only have to be executed once.
So while the operations executed inside the loop will perform no better or worse, the overall execution time will be shorter.
Whether that is relevant or not largely depends on how expensive the operations inside the loop are. If they are cheap, the overhead of the loop will be noticeable, and it may be worth optimizing the code. If they are expensive, it might not be worth the effort.
Besides performance, clarity of the code is also a good thing. So if it doesn't matter form a performance point of view, you should choose the code that is better to read.
In very short loops, the overhead of the loop construction itself (increment and termination test) is "significant". To the point that compilers may perform "loop unrolling" optimizations, i.e. replicate the loop body to avoid performing the intermediate tests (with some extra care to handle termination).
Loop merging can bring similar speedups.
When the loop bodies are more complicated, the loop overhead becomes more negligible, and performance can even degrade when you merge the loops because you may saturate the number of required registers or degrade cache efficiency.
For ordinary programs, these kinds of micro-optimization are often not worth the effort. They are more relevant in the development of reusable code of general usefulness, such as the BLAS routines.
According to my knowledge loops are used in programming to do repetitive task..
There are certain types of loops like for, while, do while etc... and their syntax differs from each other like for example in while loops we intialize the counter outside and check the conditions in while() and ++ || -- inside the code block, whereas in for loop we do all certain things like initialization,condition checking and ++ || -- in for keyword.
So my question is which loops is efficient and occupies less memory
The loops you listed aren't really going to differ in memory usage or "efficiency". Rather, each one should be used in different situations. A for loop is often used when one needs to iterate through some object that contains multiple indexes or lines etc. For example (Java):
for(int i = 0; i<fooString.length(); i++){
fooCharArray[i] = fooString.charAt(i);
}
You could also achieve the same with a while loop:
int i = 0;
while(i<fooString.length()){
fooCharArray[i] = fooString.charAt(i);
i++;
}
Often, recursion can achieve the sams results as loops, too (though in my example it'd seem slightly wasteful, since a loop could do it so easily). So really, it's more about what you're doing, what's easiest for you, and what makes it the most readable/understandable for you and other programmers.
I have a code that needs to check if a given buffer of size 2048 is all zero. Right now I make a single traversal but wonder if there is a faster way to check if all contain 0. Is there a faster way? The cod e I have is as follows:
static int isSilent(Uint8* buf, int length){
int i;
for(i=0; i<length; i++)
if(buf[i] != 0) return 0;
return 1;
}
EDIT:
I'm doing real time audio processing thus the small buffer size and trying to reduce the delay. Just trying to explore faster ways. Thanks.
No, generally speaking if you want to check something about a whole array, you have to check the whole array. If you know something about the array in advance then of course you may be able to optimize this.
O(n) is the Big-O complexity for a linear scan.
Even if processing in a larger data-alignment (e.g. processing on word-or-larger boundaries), the constant time improvements are still factored out. Doing such optimizations/tricks1, when applicable, may make for a better wall-clock time and "run faster" but don't affect the O(n) bounds.
However, make sure this is a relevant issue as 2048 elements is .. small, and the loop terminates as soon as a failing case is detected.
1 There are various tricks that can be used in this case. One of the most notable examples is memcpy, shown in implementations such as this, which uses a "word aligned" loop.
Theoretically speaking there is no such way. For unsorted array you must examine every element of it, so such operation has complexity of Θ(n). In practice you might utilize parallel processing using SIMD concept, but this seems to be an overkill for such simple (small?) task. Essentially the more processors you have, the greater speedup you can archieve. But assuming some k ∈ ℕ, where k > 0 is the number of processors you still have time complexity of Θ(n) for arbitrary large n.
In a C book there was a true/false question in which the following two statements were described as true.
1) Compiler implements a jump table for cases used in a switch.
2) for loop can be used if we want statements in a loop to get executed at least once.
I have following questions regarding these two points:
What is the meaning of statement number 1?
According to me, the second statement should be false, because for this task we a use a do while loop. Am I right?
The first point is somewhat misleading, if it's worded just like that. That might just be the point, of course. :)
It refers to one common way of generating fast code for a switch statement, but there's absolutely no requirement that a compiler does that. Even those that do, probably don't do it always since there's bound to be trade-offs that perhaps only make it worthwhile for a switch with more than n cases. Also the cases themselvesl typically have to be "compact" in order to provide a good index to use in the table.
And yes, a do loop is what to use if you want at least one iteration, since it does the test at the end whereas both for and while do it at the start.
1) It means that a common optimization is for a compiler to build a "jump table" which is like an array where the values are addresses of instructions where the program will execute next. The array is built such that the indexes correspond to values being switched on. Using a jump table like this is O(1), whereas a cascade of "if/else" statements is O(n) in the number of cases.
2) Sure, you can use a "do-while" loop to execute something "at least once." But you'll find that do-while loops are fairly uncommon in most applications, and "for" loops are the most common--partly because if you omit the first and third parts between their parentheses, they are really just fancy "while" loops! For example:
for (; i < x; ) // same as while (i < x)
for (i = 0; i == 0 || i < x; ) // like i = 0; do ... while (i < x)
I have long wondered what is more efficient with regards to making better use of CPU caches (which are known to benefit from locality of reference) - two loops each iterating over the same mathematical set of numbers, each with a different body statement (e.g. a call to a function for each element of the set), or having one loop with a body that does the equivalent of two (or more) body statements. We assume identical application state after all the looping.
In my opinion, having two loops would introduce fewer cache misses and evictions because more instructions and data used by the loop fit in the cache. Am I right?
Assuming:
Cost of a f and g call is negligible compared to cost of the loop
f and g use most of the cache each by itself, and so the cache would be spilled when one is called after another (the case with a single-loop version)
Intel Core Duo CPU
C language source code
The GCC compiler, "no extra switches"
I want answers outside the "premature optimization is evil" character, if possible.
An example of the two-loops version that I am advocating for:
int j = 0, k = 0;
for(int i = 0; i < 1000000; i++)
{
j += f(i);
}
for(int i = 0; i < 1000000; i++)
{
k += g(i);
}
To measure is to know.
I can see three variables (even in a seemingly simple chunk of code):
What do f() and g() do? Can one of them invalidate all of the instruction cache lines (effectively pushing the other one out)? Can that happen in L2 instruction cache too (unlikely)? Then keeping only one of them in it might be beneficial. Note: The inverse does not imply "have a single loop", because:
Do f() and g() operate on large amounts of data, according to i? Then, it'd be nice to know if they operate on the same set of data - again you have to consider whether operating on two different sets screws you up via cache misses.
If f() and g() are indeed that primitive as you first state, and I'm assuming both in code size as well as running time and code complexity, cache locality issues won't arise in little chunks of code like this - your biggest concern would be if some other process were scheduled with actual work to do, and invalidated all the caches until it were your process' turn to run.
A final thought: given that such processes like above might be a rare occurrence in your system (and I'm using "rare" quite liberally), you could consider making both your functions inline, and let the compiler unroll the loop. That is because for the instruction cache, faulting back to L2 is no big deal, and the probability that the single cache line that'd contain i, j, k would be invalidated in that loop doesn't look so horrible. However, if that's not the case, some more details would be useful.
Intuitively one loop is better: you increment i a million fewer times and all the other operation counts remain the same.
On the other hand it completely depends on f and g. If both are sufficiently large that each of their code or cacheable data that they use nearly fills a critical cache then swapping between f and g may completely swamp any single loop benefit.
As you say: it depends.
Your question is not clear enough to give a remotely accurate answer, but I think I understand where you are headed. The data you are iterating over is large enough that before you reach the end you will start to evict data so that the second time (second loop) you iterate over it some if not all will have to be read again.
If the two loops were joined so that each element/block is fetched for the first operation and then is already in cache for the second operation, then no matter how large the data is relative to the cache most if not all of the second operations will take their data from the cache.
Various things like the nature of the cache, the loop getting evicted by data then being fetched evicting data may cause some misses on the second operation. On a pc with an operating system, lots of evictions will occur with other programs getting time slices. But assuming an ideal world the first operation on index i of the data will fetch it from memory, the second operation will grab it from cache.
Tuning for a cache is difficult at best. I regularly demonstrate that even with an embedded system, no interrupts, single task, same source code. Execution time/performance can vary dramatically by simply changing compiler optimization options, changing compilers, both brands of compilers or versions of compilers, gcc 2.x vs 3.x vs 4.x (gcc is not necessarily producing faster code with newer versions btw)(and a compiler that is pretty good at a lot of targets is not really good at any one particular target). Same code different compilers or options can change execution time by several times, 3 times faster, 10 times faster, etc. Once you get into testing with or without a cache, it gets even more interesting. Add a single nop in your startup code so that your whole program moves one instruction over in memory and your cache lines now hit in different places. Same compiler same code. Repeat this with two nops, three nops, etc. Same compiler, same code you can see tens of percent (for the tests I ran that day on that target with that compiler) differences better and worse. That doesnt mean you cant tune for a cache, it just means that trying to figure out if your tuning is helping or hurting can be difficult. The normal answer is just "time it and see", but that doesnt work anymore, and you might get great results on your computer that day with that program with that compiler. But tomorrow on your computer or any day on someone elses computer you may be making things slower not faster. You need to understand why this or that change made it faster, maybe it had nothing to do with your code, your email program may have been downloading a lot of mail in the background during one test and not during the other.
Assuming I understood your question correctly I think the single loop is probably faster in general.
Breaking the loops into smaller chunks is a good idea.. It could improves the cache-hit ratio quite a lot and can make a lot of difference to the performance...
From your example:
int j = 0, k = 0;
for(int i = 0; i < 1000000; i++)
{
j += f(i);
}
for(int i = 0; i < 1000000; i++)
{
k += g(i);
}
I would either fuse the two loops into one loop like this:
int j = 0, k = 0;
for(int i = 0; i < 1000000; i++)
{
j += f(i);
k += g(i);
}
Of if this is not possible do the optimization called Loop-Tiling:
#define TILE_SIZE 1000 /* or whatever you like - pick a number that keeps */
/* the working-set below your first level cache size */
int i=0;
int elements = 100000;
do {
int n = i+TILE_SIZE;
if (n > elements) n = elements;
// perform loop A
for (int a=i; a<n; a++)
{
j += f(i);
}
// perform loop B
for (int a=i; a<n; a++)
{
k += g(i);
}
i += n
} while (i != elements)
The trick with loop tiling is, that if the loops share an access pattern the second loop body has the chance to re-use the data that has already been read into the cache by the first loop body. This won't happen if you execute loop A a million times because the cache is not large enough to hold all this data.
Breaking the loop into smaller chunks and executing them one after another will help here a lot. The trick is to limit the working-set of memory below the size of your first level cache. I aim for half the size of the cache, so other threads that get executed in-between don't mess up my cache so much..
If I came across the two-loop version in code, with no explanatory comments, I would wonder why the programmer did it that way, and probably consider the technique to be of dubious quality, whereas a one-loop version would not be surprising, commented or not.
But if I came across the two-loop version along with a comment like "I'm using two loops because it runs X% faster in the cache on CPU Y", at least I'd no longer be puzzled by the code, although I'd still question if it was true and applicable to other machines.
This seems like something the compiler could optimize for you so instead of trying to figure it out yourself and making it fast, use whatever method makes your code more clear and readable. If you really must know, time both methods for input size and calculation type that your application uses (try the code you have now but repeat your calculations many many times and disable optimization).