Related
I have a function like this in C (in pseudo-ish code, dropping the unimportant parts):
int func(int s, int x, int* a, int* r) {
int i;
// do some stuff
for (i=0;i<a_really_big_int;++i) {
if (s) r[i] = x ^ i;
else r[i] = x ^ a[i];
// and maybe a couple other ways of computing r
// that are equally fast individually
}
// do some other stuff
}
This code gets called so much that this loop is actually a speed bottleneck in the code. I am wondering a couple things:
Since the switch s is a constant in the function, will good compilers optimize the loop so that the branch isn't slowing things down all the time?
If not, what is a good way to optimize this code?
====
Here is an update with a fuller example:
int func(int s,
int start,int stop,int stride,
double *x,double *b,
int *a,int *flips,int *signs,int i_max,
double *c)
{
int i,k,st;
for (k=start; k<stop; k += stride) {
b[k] = 0;
for (i=0;i<i_max;++i) {
/* this is the code in question */
if (s) st = k^flips[i];
else st = a[k]^flips[i];
/* done with code in question */
b[k] += x[st] * (__builtin_popcount(st & signs[i])%2 ? -c[i] : c[i]);
}
}
}
EDIT 2:
In case anyone is curious, I ended up refactoring the code and hoisting the whole inner for loop (with i_max) outside, making the really_big_int loop be much simpler and hopefully easy to vectorize! (and also avoiding doing a bunch of extra logic a zillion times)
One obvious way to optimize the code is pull the conditional outside the loop:
if (s)
for (i=0;i<a_really_big_int;++i) {
r[i] = x ^ i;
}
else
for (i=0;i<a_really_big_int;++i) {
r[i] = x ^ a[i];
}
A shrewd compiler might be able to change that into r[] assignments of more than one element at a time.
Micro-optimizations
Usually they are not worth the time - reviewing larger issue is more effective.
Yet to micro-optimize, trying a variety of approaches and then profiling them to find the best can make for modest improvements.
In addition to #wallyk and #kabanus fine answers, some simplistic compilers benefit with a loop that ends in 0.
// for (i=0;i<a_really_big_int;++i) {
for (i=a_really_big_int; --i; ) {
[edit 2nd optimization]
OP added a more compete example. One of the issues is that the compiler can not make assumption that that the memory pointed to by b and others do not overlap. This prevents certain optimizations.
Assuming they in fact to do not overlap, use restrict on b to allow optimizations. const helps too for weaker compilers that do no deduce that. restrict on the others may also benefit, again, if the reference data does not overlap.
// int func(int s, int start, int stop, int stride, double *x,
// double *b, int *a, int *flips,
// int *signs, int i_max, double *c) {
int func(int s, int start, int stop, int stride, const double * restrict x,
double * restrict b, const int * restrict a, const int * restrict flips,
const int * restrict signs, int i_max, double *c) {
All your commands are quick O(1) command in the loop. The if is definitely optimized, so is your for+if if all your commands are of the form r[i]=somethingquick. The question may boil down for you on how small can big int be?
A quick int main that just goes from INT_MIN to INT_MAX summing into a long variable, takes ~10 seconds for me on the Ubuntu subsystem on Windows. Your commands may multiply this by a few, which quickly gets to a minute. Bottom line, this may be not be avoidable if you really are iterating a ton.
If r[i] are calculated independently, this would be a classic usage for threading/multi-processing.
EDIT:
I think % is optimized anyway by the compiler, but if not, take care that x & 1 is much faster for an odd/even check.
Assuming x86_64, you can ensure that the pointers are aligned to 16 bytes and use intrinsics. If it is only running on systems with AVX2, you could use the __mm256 variants (similar for avx512*)
int func(int s, int x, const __m128i* restrict a, __m128i* restrict r) {
size_t i = 0, max = a_really_big_int / 4;
__m128i xv = _mm_set1_epi32(x);
// do some stuff
if (s) {
__m128i iv = _mm_set_epi32(3,2,1,0); //or is it 0,1,2,3?
__m128i four = _mm_set1_epi32(4);
for ( ;i<max; ++i, iv=_mm_add_epi32(iv,four)) {
r[i] = _mm_xor_si128(xv,iv);
}
}else{ /*not (s)*/
for (;i<max;++i){
r[i] = _mm_xor_si128(xv,a[i]);
}
}
// do some other stuff
}
Although the if statement will be optimized away on any decent compiler (unless you asked the compiler not to optimize), I would consider writing the optimization in (just in case you compile without optimizations).
In addition, although the compiler might optimize the "absolute" if statement, I would consider optimizing it manually, either using any available builtin, or using bitwise operations.
i.e.
b[k] += x[st] *
( ((__builtin_popcount(st & signs[I]) & 1) *
((int)0xFFFFFFFFFFFFFFFF)) ^c[I] );
This will take the last bit of popcount (1 == odd, 0 == even), multiply it by the const (all bits 1 if odd, all bits 0 if true) and than XOR the c[I] value (which is the same as 0-c[I] or ~(c[I]).
This will avoid instruction jumps in cases where the second absolute if statement isn't optimized.
P.S.
I used an 8 byte long value and truncated it's length by casting it to an int. This is because I have no idea how long an int might be on your system (it's 4 bytes on mine, which is 0xFFFFFFFF).
This is a homework question that I am confused on how to approach. There are restrictions as well were I cannot use /, %, or any loops. Given a method, it accepts two pointers of type int. Taking these two pointer I need to find whether they are in the same block of memory or in different block of memory. If case one I return 1 for them being in the same block if and 0 otherwise. So my thinking is that if two pointers are in the same block of memory that must mean they point to the same integer? Im not sure if this is correct any hint in the right direction would be greatly appreciated.
Thank you
Floris basically gave you the idea; here's my actual implementation for POSIX:
uintptr_t pagesz = getpagesize();
uintptr_t addr_one = (uintptr_t)ptr1;
uintptr_t addr_two = (uintptr_t)ptr2;
bool in_same_page = (addr_one & ~(pagesz - 1)) == (addr_two & ~(pagesz - 1));
Assuming that you know how large the blocks of memory are (I assume 1k (2^10)) you can subtract the smaller address from the larger and see if the difference is less than the block size -1.
int same_block(int x, int y){
int difference;
if(x > y){
difference = x - y;
} else {
difference = y - x;
}
if(difference < 1024){
return 1;
}
return 0;
}
I made an object that actually represents an array of 8 booleans stored in a char. I made it to learn something more about bitwise operators and about creating your own objects in C. So I've got two questions:
Can I be certain if the below code
always works?
Is this a good implementation to
make an object that can't get lost
in C, unless you release it
yourself.
The Code:
/*
* IEFBooleanArray.h
* IEFBooleanArray
*
* Created by ief2 on 8/08/10.
* Copyright 2010 ief2. All rights reserved.
*
*/
#ifndef IEFBOOLEANARRAY_H
#define IEFBOOLEANARRAY_H
#include <stdlib.h>
#include <string.h>
#include <math.h>
typedef char * IEFBooleanArrayRef;
void IEFBooleanArrayCreate(IEFBooleanArrayRef *ref);
void IEFBooleanArrayRelease(IEFBooleanArrayRef ref);
int IEFBooleanArraySetBitAtIndex(IEFBooleanArrayRef ref,
unsigned index,
int flag);
int IEFBooleanArrayGetBitAtIndex(IEFBooleanArrayRef ref,
unsigned index);
#endif
/*
* IEFBooleanArray.c
* IEFBooleanArray
*
* Created by ief2 on 8/08/10.
* Copyright 2010 ief2. All rights reserved.
*
*/
#include "IEFBooleanArray.h"
void IEFBooleanArrayCreate(IEFBooleanArrayRef *ref) {
IEFBooleanArrayRef newReference;
newReference = malloc(sizeof(char));
memset(newReference, 0, sizeof(char));
*ref = newReference;
}
void IEFBooleanArrayRelease(IEFBooleanArrayRef ref) {
free(ref);
}
int IEFBooleanArraySetBitAtIndex(IEFBooleanArrayRef ref, unsigned index, int flag) {
int orignalStatus;
if(index < 0 || index > 7)
return -1;
if(flag == 0)
flag = 0;
else
flag = 1;
orignalStatus = IEFBooleanArrayGetBitAtIndex(ref, index);
if(orignalStatus == 0 && flag == 1)
*ref = *ref + (int)pow(2, index);
else if(orignalStatus == 1 && flag == 0)
*ref = *ref - (int)pow(2, index);
return 0;
}
int IEFBooleanArrayGetBitAtIndex(IEFBooleanArrayRef ref, unsigned index) {
int result;
int value;
value = (int)pow(2, index);
result = value & *ref;
if(result == 0)
return 0;
else
return 1;
}
I'm more of an Objective-C guy, but I really want to learn C more. Can anyone request some more "homework" which I can improve myself with?
Thank you,
ief2
Don't check unsigned types with < 0, it's meaningless and causes warnings on some compilers.
Don't use unsigned types without specifying their size (unsigned int, unsigned char, etc).
If flag == 0 why are you setting it to 0?
I don't like abstracting the * away in a typedef, but it's not wrong by any means.
You don't need to call memset() to set a single byte to 0.
Using pow to calculate a bit offset is crazy. Check out the << and >> operators and use those instead
Fully parenthesize your if statement conditions or be prepared for debugging pain in your future.
If you use the bitwise operators & and | instead of arithmetic + and - in your SetBitAtIndex function, you won't need all those complicated if statements anyway.
Your GetBitAtIndex routine doesn't bounds check index.
From that list, #9 is the only one that means your program won't work in all cases, I think. I didn't exhaustively test it - that's just a first glance check.
pow(2,index) is among the more inefficient ways to produce a bit mask. I can imagine that using the Ackermann function could be worse, but pow() is pretty much on the slow side. You should use (1<<index) instead. Also, the C'ish way to set/clear a bit in a value looks different. Here's a recent question about this:
Simple way to set/unset an individual bit
If you want to munge bits in C in an efficient and portable way, then you really should have a look at the bit twiddling page, that everyone here will suggest to you if you mention "bits" somehow:
http://graphics.stanford.edu/~seander/bithacks.html
The following code sequence:
if(result == 0)
return 0;
else
return 1;
can be written as return (result != 0);, return resultor return !!result (if result should be forced to 0 or 1) . Though it's always a good idea to make an intent clear, most C programmer will prefer 'result result;' because in C this the way to make your intent clear. The if looks iffy, like a warning sticker saying "Original developer is a Java guy and knows not much about bits" or something.
newReference = malloc(sizeof(char));
memset(newReference, 0, sizeof(char));
malloc + memset(x,0,z) == calloc();
You have a way to report an error (invalid index) for IEFBooleanArraySetBitAtIndex but not for IEFBooleanArrayGetBitAtIndex. This is inconsistent. Make error reporting uniform, or the users of your library will botch error checking.
As for accessing bit #n in your char object, instead of using pow() function, you can use shifting and masking:
Set bit #n:
a = a | (1 << n);
Clear bit #n:
a = a & (~(1 << n));
Get bit #n:
return ((a >> n) & 1);
Nobody seems to be mentioning this (I am surprised), but... You can't tell me you're seriously doing malloc(sizeof(char))? That is a very small allocation. It doesn't make sense to make this a heap allocated object. Just declare it as char.
If you want to have some degree of encapsulation, you can do: typedef char IEFBoolArray; and make accessor functions to manipulate an IEFBoolArray. Or even do typedef struct { char value; } IEFBoolArray; But given the size of the data it would be sheer madness to allocate these one at a time on the heap. Have consumers of the type just declare it inline and use the accessors.
Further... Are you sure you want it to be char? You might get slightly better code generated if you promote that to something larger, like int.
In addition to Carl Norum points:
Don't save space in char such way unless you have to (i.e. you store a lot of bit values). It is much slower as you have to perform bitwise operations etc.
On most architectures you waste memory by mallocing char. One pointer takes 4 to 8 times more then char on most modern architectures and additionally you have data about the malloced chunk as it.
Probably static size is not the best approach as it inflexible. I wouldn't see any benefit of using speciall functions for it.
As of 3rd point something like:
typedef struct {
uint64_t size;
uint64_t *array;
}bitarray;
bitarray bitarray_new(uint64_t size) {
bitarray arr;
arr.size = size;
arr.array = calloc(size/8);
return arr;
}
void bitarray_free(bitarray arr) {
free(arr.array);
}
void bitarray_set(bitarray arr, uint64_t index, int bit) {
assert (index <= arr.size)
if (bit)
array[index/8] |= 1 << (index % 8);
else
array[index/8] ^= ~(1 << (index % 8));
}
void bitarray_get(bitarray arr, uint64_t index, int bit) {
assert (index <= arr.size)
return array[index/8] & 1 << (index % 8);
}
Copyright 2010 ief2. All rights reserved.
Actually they are not. You volontarly published them under cc-by-sa licence and only some right are reserved. Additionally you want us to read and modify the code so reserving all right is pointless.
(PS. I would advice against publishing trivial work under restrictive licences anyway - it does not look professionaly - unless you have legal issues to do so)
Is this a good implementation to make an object that can't get lost in C, unless you release it yourself.
Sorry?
char byte_to_ascii(char value_to_convert, volatile char *converted_value) {
if (value_to_convert < 10) {
return (value_to_convert + 48);
} else {
char a = value_to_convert / 10;
double x = fmod((double)value_to_convert, 10.0);
char b = (char)x;
a = a + 48;
b = b + 48;
*converted_value = a;
*(converted_value+1) = b;
return 0;
}
}
The purpose of this function is to take an unsigned char value of 0 through 99 and return either it's ascii equivalent in the case it is 0-9 or manipulate a small global character array that can be referenced from the calling code following function completion.
I ask this question because two compilers from the same vendor interpret this code in different ways.
This code was written as a way to parse address bytes sent via RS485 into strings that can easily be passed to a send-lcd-string function.
This code is written for the PIC18 architecture (8 bit uC).
The problem is that the free/evaluation version of a particular compiler generates perfect assembly code that works while suffering a performance hit, but the paid and supposedly superior compiler generates code more efficiently at the expense of being able reference the addresses of all my byte arrays used to drive the graphics on my lcd display.
I know I'm putting lots of mud in the water by using a proprietary compiler for a less than typical architecture, but I hope someone out there has some suggestions.
Thanks.
I would definitely avoid using floating point anything on a PIC. And I would -try not to- use any divisions. How many times do you see sending a non-ascii char to the LCD? Can you save it to the LCD's memory and then call it by it's memory position?
Here's what a divide by 10 looks like in my code, note the 17 cycles it needs to complete. Think about how long that will take, and make sure there is nothing else waiting on this.
61: q = d2 / 10;
01520 90482E mov.b [0x001c+10],0x0000
01522 FB8000 ze 0x0000,0x0000
01524 2000A2 mov.w #0xa,0x0004
01526 090011 repeat #17
01528 D88002 div.uw 0x0000,0x0004
0152A 984F00 mov.b 0x0000,[0x001c+8]
If you do a floating point anything in your code, look in the program memory after you've compiled it, on the Symbolic tab (so you can actually read it) and look for the floating point code that will need to be included. You'll find it up near the top (depending on your code), soon(ish) after the _reset label.
Mine starts at line number 223 and memory address of 001BC with _ floatsisf, continues through several additional labels (_fpack, _divsf3, etc) and ends in _funpack, last line at 535 and memory address 0042C. If you can handle (42C-1BC = 0x270 =) 624 bytes of lost program space, great, but some chips have just 2k of space and that's not an option.
Instead of floating point, if it's possible, try to use fixed point arithmetic, in base 2.
As far as not being able to reference all the byte arrays in your LCD, have you checked to make sure that you're not trying to send a null (which is a fine address) but it get's stopped by code checking for the end of an ascii string? (it's happened to me before).
modulo and integer division can be very very expensive. I have do not know about your particular architecture, but my guess it is expensive there as well.
If you need both, division and modulo, do one of them and get the other one by multiplication/difference.
q =p/10;
r = p - q*10;
I'd probably write that as:
char byte_to_ascii(char value_to_convert, volatile char *converted_value)
{
if (value_to_convert < 10) {
return value_to_convert + '0';
} else {
converted_value[0] = (value_to_convert / 10) + '0';
converted_value[1] = (value_to_convert % 10) + '0';
return 0;
}
}
Is it poor form to convert to floating, call fmod, and convert to integer, instead of just using the % operator? I would say yes. There are more readable ways to slow down a program to meet some timing requirement, for example sleeping in a for loop. No matter what compiler or what tweaking of assembly code or whatever else, this is a highly obfuscated way to control the execution speed of your program, and I call it poor form.
If perfect assembly code means that it works right but it's even slower than the conversions to floating point and back, then use integers and sleep in a for loop.
As for the imperfect assembly code, what's the problem? "at the expense of being able reference the addresses of all my byte arrays"? It looks like type char* is working in your code, so it seems that you can address all your byte arrays the way the C standard says you can. What's the problem?
Frankly, I would say yes..
If you wanted b to be the remainder, either use MOD or roll-your-own:
char a = value_to_convert / 10;
char b = value_to_convert - (10 * a);
Conversion to/from floats is never the way to do things, unless your values really are floats.
Furthermore, I would strongly recommend to stick to the convention of explicitly referring to your datatypes as 'signed' or 'unsigned', and leave the bare 'char' for when it actually is a character (part of a string). You are passing in raw data, which I feel should be an unsigned char (assuming of course, that the source is unsigned!). It is easy to forget if something should be signed/unsigned, and with a bare char, you'll get all sorts of roll-over errors.
Most 8-bit micros take forever for a multiply (and more than forever for a divide), so try and minimise these.
Hope this helps..
The code seems to be doing two very different things, depending on whether it's given a number in the range 0-9 or 10-99. For that reason, I would say that this function is written in poor form: I would split your function into two functions.
Since we're discussing divisions by 10 here..
This is my take. It only simple operations and does not even need wide registers.
unsigned char divide_by_10 (unsigned char value)
{
unsigned char q;
q = (value>>1) + (value>>2);
q += (q>>4);
q >>= 3;
value -= (q<<3)+q+q;
return q+((value+6)>>4);
}
Cheers,
Nils
It is typical for optimizers to do unwanted thingies from time to time if you poke around in the internals.
Is your converted_value a global value or otherwise assigned in such a fashion that the compiler knows not to touch it?
PIC's don't like doing pointer arithmetic.
As Windows programmer points out, use the mod operator (see below.)
char byte_to_ascii(char value_to_convert, volatile char *converted_value) {
if (value_to_convert < 10) {
return (value_to_convert + 48);
} else {
char a = value_to_convert / 10;
char b = value_TO_convert%10;
a = a + 48;
b = b + 48;
*converted_value = a;
*(converted_value+1) = b;
return 0;
}
}
Yes, I believe that your function:
char byte_to_ascii(char value_to_convert, volatile char *converted_value) {
if (value_to_convert < 10) {
return (value_to_convert + 48);
} else {
char a = value_to_convert / 10;
double x = fmod((double)value_to_convert, 10.0);
char b = (char)x;
a = a + 48;
b = b + 48;
*converted_value = a;
*(converted_value+1) = b;
return 0;
}
}
is in poor form:
Don't use decimal numbers for ASCII chars, use the character, i.e. '#' instead of 0x40.
There is no need for using the fmode function.
Here is my example:
// Assuming 8-bit octet
char value;
char largeValue;
value = value_to_convert / 100;
value += '0';
converted_value[0] = value;
largeValue = value_to_convert - value * 100;
value = largeValue / 10;
value += '0';
converted_value[1] = value;
largeValue = largeValue - value * 10;
value += '0';
converted_value[2] = value;
converted_value[3] = '\0'; // Null terminator.
Since there are only 3 digits, I decided to unroll the loop. There are no branches to interrupt the prefetching of instructions. No floating point exceptions, just integer arithmetic.
If you leading spaces instead of zeros, you can try this:
value = (value == 0) ? ' ' : value + '0';
Just to be a nitwitt, but multiple return statements from the same function can be considered bad form (MISRA).
Also, some of the discussions above are on the limit of permature optimizations. Some tasks must be left to the compiler. However, in such a minimalistic embedded environment, these tricks may be valid still.
Here is my short implementation of Russian Peasant Multiplication. How can it be improved?
Restrictions : only works when a>0,b>0
for(p=0;p+=(a&1)*b,a!=1;a>>=1,b<<=1);
It can be improved by adding whitespace, proper indentation, and a proper function body:
int peasant_mult (int a, int b) {
for (p = 0;
p += (a & 1) * b, a != 1;
a /= 2, b *= 2);
return p;}
See? Now it's clear how the three parts of the for declaration are used. Remember, programs are written mainly for human eyes. Unreadable code is always bad code.
And now, for my personal amusement, a tail recursive version:
(defun peasant-mult (a b &optional (sum 0))
"returns the product of a and b,
achieved by peasant multiplication."
(if (= a 1)
(+ b sum)
(peasant-mult (floor (/ a 2))
(* b 2)
(+ sum (* b (logand a 1))))))
I think it's terrible
This is exactly the same code from the compiler's point of view, and (hopefully) a lot clearer
int sum = 0;
while(1)
{
sum += (a & 1) * b;
if(a == 1)
break;
a = a / 2;
b = b * 2;
}
And now I've written it out, I understand it.
There is an really easy way to improve this:
p = a * b;
It has even the advantage that a or b could be less than 0.
If you look how it really works, you will see, that it is just the normal manual multiplication performed binary. You computer does it internaly this way (1), so the easiest way to use the russian peasant method is to use the builtin multiplication.
(1) Maybe it has a more sophasticated algorithm, but in principle you can say, it works with this algorithm
There is still a multiplication in the loop. If you wanted to reduce the cost of the multiplications, you could use this instead:
for(p=0;p+=(-(a&1))&b,a!=1;a>>=1,b<<=1);
I don't find it particularly terrible, obfuscated or unreadable, as others put it, and I don't understand all those downvotes. This said, here is how I would "improve" it:
// Russian Peasant Multiplication ( p <- a*b, only works when a>0, b>0 )
// See http://en.wikipedia.org/wiki/Ancient_Egyptian_multiplication
for( p=0; p+=(a&1)*b, a!=1; a>>=1,b<<=1 );
This is for a code obfuscation contest? I think you can do better. Use misleading variable names instead of meaningless ones, for starters.
p is not initialised.
What happens if a is zero?
What happens if a is negative?
Update: I see that you have updated the question to address the above problems. While your code now appears to work as stated (except for the overflow problem), it's still less readable than it should be.
I think it's incomplete, and very hard to read. What specific sort of feedback were you looking for?
int RussianPeasant(int a, int b)
{
// sum = a * b
int sum = 0;
while (a != 0)
{
if ((a & 1) != 0)
sum += b;
b <<= 1;
a >>= 1;
}
return sum;
}
Answer with no multiplication or division:
function RPM(int a, int b){
int rtn;
for(rtn=0;rtn+=(a&1)*b,a!=1;a>>=1,b<<=1);
return rtn;
}