Why is my loop slower when I remove code - c

When I remove the tests to compute minimum and maximum from the loop, the execution time is actually longer than with the test. How is that possible ?
Edit :
After running more test, it seems the runtime is not constant, ie the same code
can run in 9 sec or 13 sec.... So it was just a repetable coincidence. Repetable until you do enough tests that is...
Some details :
execution time with the min max test : 9 sec
execution time without the min max test : 13 sec
CFLAGS=-Wall -O2 -fPIC -g
gcc 4.4.3 32 bit
Section to remove is now indicated in code
Some guess :
bad cache interaction ?
void FillFullValues(void)
{
int i,j,k;
double X,Y,Z;
double p,q,r,p1,q1,r1;
double Ls,as,bs;
unsigned long t1, t2;
t1 = GET_TICK_COUNT();
MinLs = Minas = Minbs = 1000000.0;
MaxLs = Maxas = Maxbs = 0.0;
for (i=0;i<256;i++)
{
for (j=0;j<256;j++)
{
for (k=0;k<256;k++)
{
X = 0.4124*CielabValues[i] + 0.3576*CielabValues[j] + 0.1805*CielabValues[k];
Y = 0.2126*CielabValues[i] + 0.7152*CielabValues[j] + 0.0722*CielabValues[k];
Z = 0.0193*CielabValues[i] + 0.1192*CielabValues[j] + 0.9505*CielabValues[k];
p = X * InvXn;
q = Y;
r = Z * InvZn;
if (q>0.008856)
{
Ls = 116*pow(q,third)-16;
}
else
{
Ls = 903.3*q;
}
if (q<=0.008856)
{
q1 = 7.787*q+seiz;
}
else
{
q1 = pow(q,third);
}
if (p<=0.008856)
{
p1 = 7.787*p+seiz;
}
else
{
p1 = pow(p,third);
}
if (r<=0.008856)
{
r1 = 7.787*r+seiz;
}
else
{
r1 = pow(r,third);
}
as = 500*(p1-q1);
bs = 200*(q1-r1);
//
// cast on short int for reducing array size
//
FullValuesLs[i][j][k] = (char) (Ls);
FullValuesas[i][j][k] = (char) (as);
FullValuesbs[i][j][k] = (char) (bs);
//// Remove this and get slower code
if (MaxLs<Ls)
MaxLs = Ls;
if ((abs(Ls)<MinLs) && (abs(Ls)>0))
MinLs = Ls;
if (Maxas<as)
Maxas = as;
if ((abs(as)<Minas) && (abs(as)>0))
Minas = as;
if (Maxbs<bs)
Maxbs = bs;
if ((abs(bs)<Minbs) && (abs(bs)>0))
Minbs = bs;
//// End of Remove
}
}
}
TRACE(_T("LMax = %f LMin = %f\n"),(MaxLs),(MinLs));
TRACE(_T("aMax = %f aMin = %f\n"),(Maxas),(Minas));
TRACE(_T("bMax = %f bMin = %f\n"),(Maxbs),(Minbs));
t2 = GET_TICK_COUNT();
TRACE(_T("WhiteBalance init : %lu ms\n"), t2 - t1);
}

I think compiler is trying to unroll the inner loop because you are removing dependency between iterations. But somehow this doesn't help in your case. Maybe because the loop is too big and using too many registers to be unrolled.
Try to turn off unrolling and post results again.
If this is the case, I would suggest you to submit a performance issue to gcc.
PS. I think you can merge if (q>0.008856) and if (q<=0.008856).

Maybe its the cache, maybe unrolling problems, there is only one way to answer this: look at the generated code (e.g. by using the -S option). Maybe you can post it/or spot the difference when comparing them.
EDIT: As you now clarified that it was just the measurement I can only recommend (or better command ;-) you, that when you want to get runtime numbers: ALWAYS put it into some loop and average it. Best to do it outside your programm (in a shell script), so your cache is not already filled with the right data.

Related

Getting timeout error in New Year Chaos problem using scala

I have got a working scala code for New Year Chaos problem in Hackerrank, but Im recieving timeout errors in some test cases
https://www.hackerrank.com/challenges/new-year-chaos/problem?h_l=interview&playlist_slugs%5B%5D=interview-preparation-kit&playlist_slugs%5B%5D=arrays
Please help me with optimization of below code:
def minimumBribes(q: Array[Int]){
val c = q.sorted
var swap = 0
var count_swap=0
import scala.util.control._
val loop = new Breaks
var temp = 0
var flag = true
loop.breakable
{
for (i <- q.length-1 to 0 by -1)
{
swap = 0
if (q(i) != i+1)
{
swap=i-q.indexOf(i+1)
if (swap > 2) {println("Too Chaotic");flag=false;loop.break()}
else
{
temp= q(q.indexOf(i+1))
q(q.indexOf(i+1)) = q(i-1)
q(i-1) = q(i)
q(i) = temp
count_swap += swap
if(q.deep == c.deep){
loop.break()
}
}
}
}
}
if (flag)println(count_swap)
}
To be honest I don't understand your implementation but
1) q.sorted could possibly already run out of time given that n is ~10^5.
2) q.sorted call is actually redundant since it's just a 1..n sequence.
3) using q.indexOf makes your algorithm O(n^2) complex. It's possible to solve it in linear time.

Integer vs Boolean array Swift Performance

I tried executing Sieve Of Eratosthenes algorithm using a large Integer array and a large Bool array.
The integer version seems to execute MUCH faster than the boolean one. What is the possible reason for this?
import Foundation
var n : Int = 100000000;
var prime = [Bool](repeating: true, count: n+1)
var p = 2
let start = DispatchTime.now()
while((p*p)<=n)
{
if(prime[p] == true)
{
var i = p*2
while (i<=n)
{
prime[i] = false
i = i + p
}
}
p = p+1
}
let stop = DispatchTime.now()
let time = (Double)(stop.uptimeNanoseconds - start.uptimeNanoseconds) / 1000000.0
print("Time = \(time) ms")
Boolean array execution time : 78223.342295 ms
import Foundation
var n : Int = 100000000;
var prime = [Int](repeating: 1, count: n+1)
var p = 2
let start = DispatchTime.now()
while((p*p)<=n)
{
if(prime[p] == 1)
{
var i = p*2
while (i<=n)
{
prime[i] = 0
i = i + p
}
}
p = p+1
}
let stop = DispatchTime.now()
let time = (Double)(stop.uptimeNanoseconds - start.uptimeNanoseconds) / 1000000.0
print("Time = \(time) ms")
Integer array execution time : 8535.54546 ms
TL, DR:
Do not attempt to optimize your code in a Debug build. Always run it through the Profiler. Int was faster then Bool in Debug but the oposite was true when run through the Profiler.
Heap allocation is expensive. Use your memory judiciously. (This question discusses the complications in C, but also applicable to Swift)
Long answer
First, let's refactor your code for easier execution:
func useBoolArray(n: Int) {
var prime = [Bool](repeating: true, count: n+1)
var p = 2
while((p*p)<=n)
{
if(prime[p] == true)
{
var i = p*2
while (i<=n)
{
prime[i] = false
i = i + p
}
}
p = p+1
}
}
func useIntArray(n: Int) {
var prime = [Int](repeating: 1, count: n+1)
var p = 2
while((p*p)<=n)
{
if(prime[p] == 1)
{
var i = p*2
while (i<=n)
{
prime[i] = 0
i = i + p
}
}
p = p+1
}
}
Now, run it in the Debug build:
let count = 100_000_000
let start = DispatchTime.now()
useBoolArray(n: count)
let boolStop = DispatchTime.now()
useIntArray(n: count)
let intStop = DispatchTime.now()
print("Bool array:", Double(boolStop.uptimeNanoseconds - start.uptimeNanoseconds) / Double(NSEC_PER_SEC))
print("Int array:", Double(intStop.uptimeNanoseconds - boolStop.uptimeNanoseconds) / Double(NSEC_PER_SEC))
// Bool array: 70.097249517
// Int array: 8.439799614
So Bool is a lot slower than Int right? Let's run it through the Profiler by pressing Cmd + I and choose the Time Profile template. (Somehow the Profiler wasn't able to separate these functions, probably because they were inlined so I had to run only 1 function per attempt):
let count = 100_000_000
useBoolArray(n: count)
// useIntArray(n: count)
// Bool: 1.15ms
// Int: 2.36ms
Not only they are an order of magnitude faster than Debug but the results are reversed to: Bool is now faster than Int!!! The Profiler doesn't tell us why how so we must go on a witch hunt. Let's check the memory allocation by adding an Allocation instrument:
Ha! Now the differences are laid bare. The Bool array uses only one-eight as much memory as Int array. Swift array uses the same internals as NSArray so it's allocated on the heap and heap allocation is slow.
When you think even more about it: a Bool value only take up 1 bit, an Int takes 64 bits on a 64-bit machine. Swift may have chosen to represent a Bool with a single byte, while an Int takes 8 bytes, hence the memory ratio. In Debug, this difference may have caused all the difference as the runtime must do all kinds of checks to ensure that it's actually dealing with a Bool value so the Bool array method takes significantly longer.
Moral of the lesson: don't optimize your code in Debug mode. It can be misleading!
(A partial answer ...)
As #MartinR mentions in his comments to the question, there is no such major difference between the two cases if you build for release mode (with optimizations); the Bool case is slightly faster due its smaller memory footprint (but equally fast as e.g. UInt8 which has the same footprint).
Running instruments to profile the (non-optimized) debug build, we clearly see that the array element access & assignment is the culprit for the Bool case (an as far as my brief testing has seen; for all types except the integer ones, Int, UInt16, and so on).
We can further ascertain that its not the writing part in particular that yields the overhead, but rather the repeated accessing of the i:th element.
The same explicit read-access tests for an array of integer elements show no such large overhead.
It would almost seem as if the random element access is, for some reason, not working as it should (for non-integer types) when compiling with debug build config.

Ruby native C big integer segfault

I'm working on a Ruby native C method: power mod. Here's what I got:
#define TO_BIGNUM(x) (FIXNUM_P(x) ? rb_int2big(FIX2LONG(x)) : x)
#define CONST2BIGNUM(x) (TO_BIGNUM(INT2NUM(x)))
VALUE method_big_power_mod(VALUE self, VALUE base, VALUE exp, VALUE mod){
VALUE res = TO_BIGNUM(INT2NUM(1));
base = TO_BIGNUM(base);
exp = TO_BIGNUM(exp);
mod = TO_BIGNUM(mod);
while (rb_big_cmp(exp, CONST2BIGNUM(0))) {
if (rb_big_modulo(exp, CONST2BIGNUM(2))) {
VALUE mul = rb_big_mul(res, base);
res = rb_big_modulo(mul, mod);
}
base = rb_big_modulo(rb_big_pow(base, CONST2BIGNUM(2)), mod);
exp = rb_big_div(exp, CONST2BIGNUM(2));
}
return res;
}
It segfaults every time. I isolated the problem to rb_big_modulo calls. gdb stacktrace says that it crashes in the bigdivrem method after calling rb_big_modulo. I tried to look through the source of bignum.c, but I can't figure out what's causing the crash. Am I doing something wrong?
There are two problems that are causing the segfault:
1 - The functions rb_big_* sometimes doesn't return a Bignum object, but when you call then the first arg must be a Bignum object. For example:
if (rb_big_modulo(exp, CONST2BIGNUM(2))) {
VALUE mul = rb_big_mul(res, base); // This maybe return a Fixnum
res = rb_big_modulo(mul, mod); // This will cause a segfault :(
}
2 - The function rb_big_pow when you call it with both args Bignum, it will warn you and will return a Float object where you can't convert easily to a Bignum object. So, you should replace the line where you call it by:
VALUE x = TO_BIGNUM(rb_big_pow(base, INT2NUM(2))); // Power by a Fixnum instead a Bignum
base = TO_BIGNUM(rb_big_modulo(x , mod));
The final implementation will be:
#define TO_BIGNUM(x) (FIXNUM_P(x) ? rb_int2big(FIX2LONG(x)) : x)
#define CONST2BIGNUM(x) (TO_BIGNUM(INT2NUM(x)))
VALUE method_big_power_mod(VALUE self, VALUE base, VALUE exp, VALUE mod){
VALUE res = TO_BIGNUM(INT2NUM(1));
base = TO_BIGNUM(base);
exp = TO_BIGNUM(exp);
mod = TO_BIGNUM(mod);
while (rb_big_cmp(exp, CONST2BIGNUM(0))) {
if (rb_big_modulo(exp, CONST2BIGNUM(2))) {
VALUE mul = TO_BIGNUM(rb_big_mul(res, base));
res = TO_BIGNUM(rb_big_modulo(mul, mod));
}
VALUE x = TO_BIGNUM(rb_big_pow(base, INT2NUM(2)));
base = TO_BIGNUM(rb_big_modulo(x , mod));
exp = TO_BIGNUM(rb_big_div(exp, CONST2BIGNUM(2)));
}
return res;
}
I don't know the performance impact with all these conversions. Maybe, you should test when it is a Fixnum or a Bignumand calculate it using the proper function or benchmark both approaches.
When I ran it, I went thought an infinite loop, but I don't know if I call it with the correct values.

How can I apply a low-shelving filter using Visualdsp++?

I'm very new to DSP. And have to solve the following problem: applying the low shelving filter for an array of data. The original data is displayed in fract16 (VisualDSP++).
I'm writing something as below but not sure it's correct or not.
Does the following code have any problem with overflow?
If 1 is true, how should I do to prevent it?
Any advice on this problem?
fract16 org_data[256]; //original data
float16 ArrayA[],ArrayB[];
long tmp_A0, tmp_A1, tmp_A2, tmp_B1, tmp_B2;
float filter_paraA[3], filter_paraB[3]; // correctness: 0.xxxxx
// For equalizing
// Low-Shelving filter
for ( i=0; i<2; i++)
{
tmp_A1 = ArrayA[i*2];
tmp_A2 = ArrayA[i*2+1];
tmp_B1 = ArrayB[i*2];
tmp_B2 = ArrayB[i*2+1];
for(j=0;j<256;j++){
tmp_A0 = org_data[j];
org_data[j] = filter_paraA[0] * tmp_A0
+ filter_paraA[1] * tmp_A1
+ filter_paraA[2] * tmp_A2
- filter_paraB[1] * tmp_B1
- filter_paraB[2] * tmp_B2;
tmp_A2 = tmp_A1;
tmp_B2 = tmp_B1;
tmp_A1 = tmp_A0;
tmp_B1 = org_data[j];
}
ArrayA[i*2] = tmp_A1;
ArrayA[i*2+1] = tmp_A2;
ArrayB[i*2] = tmp_B1;
ArrayB[i*2+1] = tmp_B2;
}
I don't know what the range is for fract16, just -1 to +1 approx?
The section that stands out to me as possibly generating an overflow is assigning org_data[j] but will be dependent on what you know about your input signal and your filter coefficients. If you can ensure that multiplying filter_paraA[2:0] to signal with values tmp_A2..1 = [1,1,1] is < max(fract16) you should be fine regardless of the 'B' side.
I would recommend adding some checks for overflow in your code. It doesn't necessarily have to fix it, but you would be able to identify an otherwise very tricky bug. Unless you need absolute max performance I would even leave the check code in place but with less output or setting a flag that gets checked.
macA = filter_paraA[0] * tmp_A0 + filter_paraA[1] * tmp_A1 \
+ filter_paraA[2] * tmp_A2;
macB = filter_paraB[1] * tmp_B1 - filter_paraB[2] * tmp_B2;
if((macA-macB)>1){
printf("ERROR! Overflow detected!\n");
printf("tmp_A[] = [%f, %f, %f]\n",tmp_A2,tmp_A1,tmp_A0);
printf("tmp_B[] = [%f, %f]\n",tmp_B1,tmp_B0);
printf(" i = %i, j = %i\n",i,j);
}

Why is this code not deterministic?

The following snippet is code from water-nsq benchmark from SPLASH 2...
if (comp_last > NMOL1)
{
for (mol = StartMol[ProcID]; mol < NMOL; mol++)
{
pthread_mutex_lock(&gl->MolLock[mol % MAXLCKS]);
for ( dir = XDIR; dir <= ZDIR; dir++) {
temp_p = VAR[mol].F[DEST][dir];
temp_p[H1] += PFORCES[ProcID][mol][dir][H1];
temp_p[O] += PFORCES[ProcID][mol][dir][O];
temp_p[H2] += PFORCES[ProcID][mol][dir][H2];
}
pthread_mutex_unlock(&gl->MolLock[mol % MAXLCKS]);
}
comp = comp_last % NMOL;
for (mol = 0; ((mol <= comp) && (mol < StartMol[ProcID])); mol++)
{
pthread_mutex_lock(&gl->MolLock[mol % MAXLCKS]);
for ( dir = XDIR; dir <= ZDIR; dir++)
{
temp_p = VAR[mol].F[DEST][dir];
temp_p[H1] += PFORCES[ProcID][mol][dir][H1];
temp_p[O] += PFORCES[ProcID][mol][dir][O];
temp_p[H2] += PFORCES[ProcID][mol][dir][H2];
}
pthread_mutex_unlock(&gl->MolLock[mol % MAXLCKS]);
}
}
else
{
for (mol = StartMol[ProcID]; mol <= comp_last; mol++)
{
pthread_mutex_lock(&gl->MolLock[mol % MAXLCKS]);
for ( dir = XDIR; dir <= ZDIR; dir++)
{
temp_p = VAR[mol].F[DEST][dir];
temp_p[H1] += PFORCES[ProcID][mol][dir][H1];
temp_p[O] += PFORCES[ProcID][mol][dir][O];
temp_p[H2] += PFORCES[ProcID][mol][dir][H2];
}
pthread_mutex_unlock(&gl->MolLock[mol % MAXLCKS]);
}
}
pthread_barrier_wait(&(gl->start));
The problem is that it is not deterministic at the barrier in the end, that is, if you execute this code two times with same inputs, it gives different answers. In other words, if the lock order of mutexes is changed, the results are different.
And yes I have verified this by noting the memory pages. Also I can assure you that the change occurs in the VAR's (pointed by temp_p) memory.
I want to know why? Because apparently, all threads are putting their own values (PFORCES[ProcID]...) to the sum of temp_p and at the end, that is at the barrier, the results should be same, no matter the order in which threads acquired the locks.
[EDITED]
Also, please note that variables comp, dir and mol are all local variables of the thread and therefore not shared.
Second try.
I can't check it, but I assume that in temp_p[H1] += PFORCES[ProcID][mol][dir][H1]; you are adding doubles or floats.
For floating point types, the order of addition matters! Floating point addition is not associative!
A different thread order means a different addition order. So changes in the outcome are to be expected.
See http://en.wikipedia.org/wiki/Floating_point#Accuracy_problems for some explanation.
I notice that you do not show the declaration of the loop variables, like mol and dir.
Could it be that the are accidently shared between threads?
If so, all kind of race conditions between e.g. one thread's mol++ and other thread's [mol % MAXLCKS] will cause problems.
UPDATE: According to the comments below, this does not seem to be the case.

Resources