Clamping short to unsigned char - c

I have a simple C function as follows:
unsigned char clamp(short value){
if (value < 0) return 0;
if (value > 0xff) return 0xff;
return value;
}
Is it possible to rewrite it without using any if / else branching while being efficient?
EDIT:
I basically wish to see if some bitwise arithmetic based implementation of clamping is possible. Objective is to process images on GPU (Graphics Processing Unit). This type of code will run on each pixel. I guess that if branches can be avoided, then overall throughput over the GPU would be higher.
A solution like (value <0? 0 : ((value > 255) ? 255 : value) ) is simply a rehash of if/else branching with syntactic sugar. So I am not looking for it.
EDIT 2:
I can cut it down to a single if as follows but I am not able to think better:
unsigned char clamp(short value){
int more = value >> 8;
if(more){
int sign = !(more >> 7);
return sign * 0xff;
}
return value;
}
EDIT 3:
Just saw a very nice implementation of this in FFmpeg code:
/**
* Clip a signed integer value into the 0-255 range.
* #param a value to clip
* #return clipped value
*/
static av_always_inline av_const uint8_t av_clip_uint8_c(int a)
{
if (a&(~0xFF)) return (-a)>>31;
else return a;
}
This certainly works and reduces it to one if nicely.

You write that you want to avoid branching on the GPU. It is true, that branching can be very costly in a parallel environment because either both branches have to be evaluated or synchronization has to be applied. But if the branches are small enough the code will be faster than most arithmetic. The CUDA C best practices guide describes why:
Sometimes, the compiler may [..]
optimize out if or switch statements
by using branch predication instead.
In these cases, no warp can ever
diverge. [..]
When using branch predication none of
the instructions whose execution
depends on the controlling condition
gets skipped. Instead, each of them is
associated with a per-thread condition
code or predicate that is set to true
or false based on the controlling
condition and although each of these
instructions gets scheduled for
execution, only the instructions with
a true predicate are actually
executed. Instructions with a false
predicate do not write results, and
also do not evaluate addresses or read
operands.
Branch predication is fast. Bloody fast! If you look at the intermediate PTX code generated by the optimizing compiler you will see that it is superior to even modest arithmetic. So the code like in the answer of davmac is probably as fast as it can get.
I know you did not ask specifically about CUDA, but most of the best practices guide also applies to OpenCL and probably large parts of AMDs GPU programming.
BTW: in virtually every case of GPU code I have ever seen most of the time is spend on memory access, not on arithmetic. Make sure to profile! http://en.wikipedia.org/wiki/Program_optimization

If you just want to avoid the actual if/else, using the ? : operator:
return value < 0 ? 0 : (value > 0xff ? 0xff : value);
However, in terms of efficiency this shouldn't be any different.
In practice, you shouldn't worry about efficiency with something so trivial as this. Let the compiler do the optimization.

You could do a 2D lookup-table:
unsigned char clamp(short value)
{
static const unsigned char table[256][256] = { ... }
const unsigned char x = value & 0xff;
const unsigned char y = (value >> 8) & 0xff;
return table[y][x];
}
Sure this looks bizarre (a 64 KB table for this trivial computation). However, considering that you mentioned you wanted to do this on a GPU, I'm thinking the above could be a texture look-up, which I believe are pretty quick on GPUs.
Further, if your GPU uses OpenGL, you could of course just use the clamp builtin directly:
clamp(value, 0, 255);
This won't type-convert (there is no 8-bit integer type in GLSL, it seems), but still.

You can do it without explicit if by using ?: as shown by another poster or by using interesting properties of abs() which lets you compute the maximum or minimum of two values.
For example, the expression (a + abs(a))/2 returns a for positive numbers and 0 otherwise (maximum of a and 0).
This gives
unsigned char clip(short value)
{
short a = (value + abs(value)) / 2;
return (a + 255 - abs(a - 255)) / 2;
}
To convince yourself that this works, here is a test program:
#include <stdio.h>
unsigned char clip(short value)
{
short a = (value + abs(value)) / 2;
return (a + 255 - abs(a - 255)) / 2;
}
void test(short value)
{
printf("clip(%d) = %d\n", value, clip(value));
}
int main()
{
test(0);
test(10);
test(-10);
test(255);
test(265);
return 0;
}
When run, this prints
clip(0) = 0
clip(10) = 10
clip(-10) = 0
clip(255) = 255
clip(265) = 255
Of course, one may argue that there is probably a test in abs(), but gcc -O3 for example compiles it linearly:
clip:
movswl %di, %edi
movl %edi, %edx
sarl $31, %edx
movl %edx, %eax
xorl %edi, %eax
subl %edx, %eax
addl %edi, %eax
movl %eax, %edx
shrl $31, %edx
addl %eax, %edx
sarl %edx
movswl %dx, %edx
leal 255(%rdx), %eax
subl $255, %edx
movl %edx, %ecx
sarl $31, %ecx
xorl %ecx, %edx
subl %ecx, %edx
subl %edx, %eax
movl %eax, %edx
shrl $31, %edx
addl %edx, %eax
sarl %eax
ret
But note that this will be much more inefficient than your original function, which compiles as:
clip:
xorl %eax, %eax
testw %di, %di
js .L1
movl $-1, %eax
cmpw $255, %di
cmovle %edi, %eax
.L1:
rep
ret
But at least it answers your question :)

How about:
unsigned char clamp (short value) {
unsigned char r = (value >> 15); /* uses arithmetic right-shift */
unsigned char s = !!(value & 0x7f00) * 0xff;
unsigned char v = (value & 0xff);
return (v | s) & ~r;
}
But I seriously doubt that it executes any faster than your original version involving branches.

Assuming a two byte short, and at the cost of readability of the code:
clipped_x = (x & 0x8000) ? 0 : ((x >> 8) ? 0xFF : x);

You should time this ugly but arithmetic-only version.
unsigned char clamp(short value){
short pmask = ((value & 0x4000) >> 7) | ((value & 0x2000) >> 6) |
((value & 0x1000) >> 5) | ((value & 0x0800) >> 4) |
((value & 0x0400) >> 3) | ((value & 0x0200) >> 2) |
((value & 0x0100) >> 1);
pmask |= (pmask >> 1) | (pmask >> 2) | (pmask >> 3) | (pmask >> 4) |
(pmask >> 5) | (pmask >> 6) | (pmask >> 7);
value |= pmask;
short nmask = (value & 0x8000) >> 8;
nmask |= (nmask >> 1) | (nmask >> 2) | (nmask >> 3) | (nmask >> 4) |
(nmask >> 5) | (nmask >> 6) | (nmask >> 7);
value &= ~nmask;
return value;
}

One way to make it efficient is to declare this function as inline to avoid function calling expense. you could also turn it into macro using tertiary operator but that will remove the return type checking by compiler.

Related

Need help constructing a Long loop(long x, int n) function in C given 64 bit assembly instructions

I have the following assembly code from the C function long loop(long x, int n)
with x in %rdi, n in %esi on a 64 bit machine. I've written my comments on what I think the assembly instructions are doing.
loop:
movl %esi, %ecx // store the value of n in register ecx
movl $1, %edx // store the value of 1 in register edx (rdx).initial mask
movl $0, %eax //store the value of 0 in register eax (rax). this is initial return value
jmp .L2
.L3
movq %rdi, %r8 //store the value of x in register r8
andq %rdx, %r8 //store the value of (x & mask) in r8
orq %r8, %rax //update the return value rax by (x & mask | [rax] )
salq %cl, %rdx //update the mask rdx by ( [rdx] << n)
.L2
testq %rdx, %rdx //test mask&mask
jne .L3 // if (mask&mask) != 0, jump to L3
rep; ret
I have the following C function which needs to correspond to the assembly code:
long loop(long x, int n){
long result = _____ ;
long mask;
// for (mask = ______; mask ________; mask = ______){ // filled in as:
for (mask = 1; mask != 0; mask <<n) {
result |= ________;
}
return result;
}
I need some help filling in the blanks, I'm not a 100% sure what the assembly instructions but I've given it my best shot by commenting with each line.
You've pretty much got it in your comments.
long loop(long x, long n) {
long result = 0;
long mask;
for (mask = 1; mask != 0; mask <<= n) {
result |= (x & mask);
}
return result;
}
Because result is the return value, and the return value is stored in %rax, movl $0, %eax loads 0 into result initially.
Inside the for loop, %r8 holds the value that is or'd with result, which, like you mentioned in your comments, is just x & mask.
The function copies every nth bit to result.
For the record, the implementation is full of missed optimizations, especially if we're tuning for Sandybridge-family where bts reg,reg is only 1 uop with 1c latency, but shl %cl is 3 uops. (BTS is also 1 uop on Intel P6 / Atom / Silvermont CPUs)
bts is only 2 uops on AMD K8/Bulldozer/Zen. BTS reg,reg masks the shift count the same way x86 integer shifts do, so bts %rdx, %rax implements rax |= 1ULL << (rdx&0x3f). i.e. setting bit n in RAX.
(This code is clearly designed to be simple to understand, and doesn't even use the most well-known x86 peephole optimization, xor-zeroing, but it's fun to see how we can implement the same thing efficiently.)
More obviously, doing the and inside the loop is bad. Instead we can just build up a mask with every nth bit set, and return x & mask. This has the added advantage that with a non-ret instruction following the conditional branch, we don't need the rep prefix as padding for the ret even if we care about tuning for the branch predictors in AMD Phenom CPUs. (Because it isn't the first byte after a conditional branch.)
# x86-64 System V: x in RDI, n in ESI
mask_bitstride: # give the function a meaningful name
mov $1, %eax # mask = 1
mov %esi, %ecx # unsigned bitidx = n (new tmp var)
# the loop always runs at least 1 iteration, so just fall into it
.Lloop: # do {
bts %rcx, %rax # rax |= 1ULL << bitidx
add %esi, %ecx # bitidx += n
cmp $63, %ecx # sizeof(long)*CHAR_BIT - 1
jbe .Lloop # }while(bitidx <= maxbit); // unsigned condition
and %rdi, %rax # return x & mask
ret # not following a JCC so no need for a REP prefix even on K10
We assume n is in the 0..63 range because otherwise the C would have undefined behaviour. In that case, this implementation differs from the shift-based implementation in the question. The shl version would treat n==64 as an infinite loop, because shift count = 0x40 & 0x3f = 0, so mask would never change. This bitidx += n version would exit after the first iteration, because idx immediately becomes >=63, i.e. out of range.
A less extreme case is that n=65 would copy all the bits (shift count of 1); this would copy only the low bit.
Both versions create an infinite loop for n=0. I used an unsigned compare so negative n will exit the loop promptly.
On Intel Sandybridge-family, the inner loop in the original is 7 uops. (mov = 1 + and=1 + or=1 + variable-count-shl=3 + macro-fused test+jcc=1). This will bottleneck on the front-end, or on ALU throughput on SnB/IvB.
My version is only 3 uops, and can run about twice as fast. (1 iteration per clock.)

Changing endianess, is union more efficient than bitshifts?

I was asked for a challenge to change the endianess of an int. The idea I had was to use bitshifts
int swap_endianess(int color)
{
int a;
int r;
int g;
int b;
a = (color & (255 << 24)) >> 24;
r = (color & (255 << 16)) >> 16;
g = (color & (255 << 8)) >> 8;
b = (color & 255)
return (b << 24 | g << 16 | r << 8 | a);
}
But someone told me that it was more easy to use a union containing an int and an array of four chars (if an int is stored on 4 chars), fill the int and then reverse the array.
union u_color
{
int color;
char c[4];
};
int swap_endianess(int color)
{
union u_color ucol;
char tmp;
ucol.color = color;
tmp = ucol.c[0];
ucol.c[0] = ucol.c[3];
ucol.c[3] = tmp;
tmp = ucol.c[1];
ucol.c[1] = ucol.c[2];
ucol.c[2] = tmp;
return (ucol.color);
}
What is the more efficient way of swapping bytes between those two? Are there more efficient ways of doing this?
EDIT
After having tested on an I7, the union way takes about 24 seconds (measured with time command), while the bitshift way takes about 15 seconds on 2,000,000,000 iterations.
The is that if I compile with -O1, both of the methods will take only 1 second, and 0.001 second with -O2 or -O3.
The bitshift methods compile to bswap in ASM with -02 and -03, but not the union way, gcc seems to recognize the naive pattern but not the complicated union way to do it. To conclude, read the bottom line of #user3386109.
Here is the correct code for a byte swap function
uint32_t changeEndianess( uint32_t value )
{
uint32_t r, g, b, a;
r = (value >> 24) & 0xff;
g = (value >> 16) & 0xff;
b = (value >> 8) & 0xff;
a = value & 0xff;
return (a << 24) | (b << 16) | (g << 8) | r;
}
Here's a function that tests the byte swap function
void testEndianess( void )
{
uint32_t value = arc4random();
uint32_t result = changeEndianess( value );
printf( "%08x %08x\n", value, result );
}
Using the LLVM compiler with full optimization, the resulting assembly code for the testEndianess function is
0x93d0: calll 0xc82e ; call `arc4random`
0x93d5: movl %eax, %ecx ; copy `value` into register CX
0x93d7: bswapl %ecx ; <--- this is the `changeEndianess` function
0x93d9: movl %ecx, 0x8(%esp) ; put 'result' on the stack
0x93dd: movl %eax, 0x4(%esp) ; put 'value' on the stack
0x93e1: leal 0x6536(%esi), %eax ; compute address of the format string
0x93e7: movl %eax, (%esp) ; put the format string on the stack
0x93ea: calll 0xc864 ; call 'printf'
In other words, the LLVM compiler recognizes the entire changeEndianess function and implements it as a single bswapl instruction.
Side note for those wondering why the call to arc4random is necessary. Given this code
void testEndianess( void )
{
uint32_t value = 0x11223344;
uint32_t result = changeEndianess( value );
printf( "%08x %08x\n", value, result );
}
the compiler generates this assembly
0x93dc: leal 0x6524(%eax), %eax ; compute address of format string
0x93e2: movl %eax, (%esp) ; put the format string on the stack
0x93e5: movl $0x44332211, 0x8(%esp) ; put 'result' on the stack
0x93ed: movl $0x11223344, 0x4(%esp) ; put 'value' on the stack
0x93f5: calll 0xc868 ; call 'printf'
In other words, given a hardcoded value as input, the compiler precomputes the result of the changeEndianess function, and puts that directly into the assembly code, bypassing the function entirely.
The bottom line. Write your code the way it makes sense to write your code, and let the compiler do the optimizing. Compilers these days are amazing. Using tricky optimizations in source code (e.g. unions) may defeat the optimizations built into the compiler, actually resulting in slower code.
You can also use this code which might be slightly more efficient:
#include <stdint.h>
extern uint32_t
change_endianness(uint32_t x)
{
x = (x & 0x0000FFFFLU) << 16 | (x & 0xFFFF0000LU) >> 16;
x = (x & 0x00FF00FFLU) << 8 | (x & 0xFF00FF00LU) >> 8;
return (x);
}
This is compiled by gcc on amd64 to the following assembly:
change_endianness:
roll $16, %edi
movl %edi, %eax
andl $16711935, %edi
andl $-16711936, %eax
salq $8, %rdi
sarq $8, %rax
orl %edi, %eax
ret
To get an even better result, you might want to employ embedded assembly. The i386 and amd64 architectures provide a bswap instruction to do what you want. As user3386109 explained, compilers might recognize the “naïve” approach and emit bswap instructions, something that doesn't happen with the approach from above. It is however better in case the compiler is not smart enough to detect that it can use bswap.

Can a modern C compiler optimize a combination of bit accesses?

I would like var to be unequal FALSE in case one of the bits 1, 3, 5, 7, 9, 11, 13 or 15 of input is set.
One solution which seem to be fairly common is this:
int var = 1 & (input >> 1) ||
1 & (input >> 3) ||
1 & (input >> 5) ||
1 & (input >> 7) ||
1 & (input >> 9) ||
1 & (input >> 11) ||
1 & (input >> 13) ||
1 & (input >> 15);
However, I'm afraid that that would lead the compiler to generate unnecessarily long code.
Following code would also yield the desired result. Would it be more efficient?
int var = input & 0b1010101010101010;
Thanks!
Your second example is not equivalent.
What you wanted was (using non-standard binary literals):
int var = !!(input & 0b1010101010101010));
Or with hex-literals (those are standard):
int var = !!(input & 0xaaaa));
Changes: Use of hexadecimal constants and double-negation (equivalent to != 0).
This presupposes input is not volatile, nor an atomic type.
A good compiler should optimize both to the same instructions (and most modern compilers are good enough).
In the end though, test and measure, most compilers will output the produced assembler code, you don't even need a disassembler!
If input is volatile, the compiler would be required to read it once if bit 1 was set, twice of bit 1 was clear but 3 was set, three times if bits 1 and 3 were clear but 5 was set, etc. The compiler may have ways of optimizing the code for doing the individual bit tests, but would have to test the bits separately.
If input is not volatile, a compiler could optimize the code, but I would not particularly expect it to. I would expect any compiler, however, no matter how ancient, to optimize
int var = (input & (
(1 << 1) | (1 << 3) | (1 << 5) | (1 << 7) |
(1 << 9) | (1 << 11) | (1 << 13) | (1 << 15)
) != 0);
which would appear to be what you're after.
It's going to depend on the processor and what instructions is has available, as well as how good the optimizing compiler is. I'd suspect that in your case, either of those lines of code will compile to the same instructions.
But we can do better than suspect, you can check for yourself. With gcc, use the -S compiler flag to have it output the assembly it generates. Then you can compare them yourself.
The orthodoxical solution should be to use the forgotten bitfields to map your flags, like
struct
{
bool B0: 1;
bool B1: 1;
bool B2: 1;
bool B3: 1;
bool B4: 1;
bool B5: 1;
bool B6: 1;
bool B7: 1;
bool B8: 1;
bool B9: 1;
bool B10: 1;
bool B11: 1;
bool B12: 1;
bool B13: 1;
bool B14: 1;
bool B15: 1;
} input;
and use the expression
bool Var= input.B1 || input.B3 || input.B5 || input.B7 || input.B9 || input.B11 || input.B13 || input.B15;
I doubt that an optimizing compiler will use the single-go masking trick, but honestly I have not tried.
How well this is handled depends on the compiler.
I've tested a minor variation of this code:
int test(int input) {
int var = 1 & (input >> 1) ||
1 & (input >> 3) ||
1 & (input >> 5) ||
1 & (input >> 7) ||
1 & (input >> 9) ||
1 & (input >> 11) ||
1 & (input >> 13) ||
1 & (input >> 15);
return var != 0;
}
Results
For x64, all compiled with -O2
GCC:
xor eax, eax
and edi, 43690
setne al
ret
Very good. That's precisely the transformation you were hoping for.
Clang:
testw $10922, %di # imm = 0x2AAA
movb $1, %al
jne .LBB0_2
andl $32768, %edi # imm = 0x8000
shrl $15, %edi
movb %dil, %al
.LBB0_2:
movzbl %al, %eax
ret
Yea that's a bit odd. Most of the tests were rolled together .. except for one. I see no reason why it would do this, maybe someone else can shed some light on that.
And the real surprise, ICC:
movl %edi, %eax #7.32
movl %edi, %edx #8.26
movl %edi, %ecx #9.26
shrl $1, %eax #7.32
movl %edi, %esi #10.26
shrl $3, %edx #8.26
movl %edi, %r8d #11.26
shrl $5, %ecx #9.26
orl %edx, %eax #7.32
shrl $7, %esi #10.26
orl %ecx, %eax #7.32
shrl $9, %r8d #11.26
orl %esi, %eax #7.32
movl %edi, %r9d #12.25
orl %r8d, %eax #7.32
shrl $11, %r9d #12.25
movl %edi, %r10d #13.25
shrl $13, %r10d #13.25
orl %r9d, %eax #7.32
shrl $15, %edi #14.25
orl %r10d, %eax #7.32
orl %edi, %eax #7.32
andl $1, %eax #7.32
ret #15.21
Ok so it optimized it a bit - no branches, and the 1 &'s are rolled together. But this is disappointing.
Conclusion
Your mileage may vary. To be safe, you can of course use the simple version directly, instead of relying on the compiler.

C - Rotate a 64 bit unsigned integer

today, I have been trying to write a function, which should rotate a given 64 bit integer n bits to the right, but also to the left, if the n is negative. Of course, bits out of the integer shall be rotated in on the other side.
I have kept the function quite simple.
void rotate(uint64_t *i, int n)
uint64_t one = 1;
if(n > 0) {
do {
int storeBit = *i & one;
*i = *i >> 1;
if(storeBit == 1)
*i |= 0x80000000000000;
n--;
}while(n>0);
}
}
possible inputs are:
uint64_t num = 0x2;
rotate(&num, 1); // num should be 0x1
rotate(&num, -1); // num should be 0x2, again
rotate(&num, 62); // num should 0x8
Unfortunately, I could not figure it out. I was hoping someone could help me.
EDIT: Now, the code is online. Sry, it took a while. I had some difficulties with the editor. But I just did it for the rotation to the right. The rotation to the left is missing, because I did not do it.
uint64_t rotate(uint64_t v, int n) {
n = n & 63U;
if (n)
v = (v >> n) | (v << (64-n));
return v; }
gcc -O3 produces:
.cfi_startproc
andl $63, %esi
movq %rdi, %rdx
movq %rdi, %rax
movl %esi, %ecx
rorq %cl, %rdx
testl %esi, %esi
cmovne %rdx, %rax
ret
.cfi_endproc
not perfect, but reasonable.
int storeBit = *i & one;
#This line you are assigning an 64 bit unsigned integer to probably 4 byte integer. I think your problem is related to this. In little endian machines things will be complicated if you do, non-defined operations.
if(n > 0)
doesnt takes negative n

Left shift of only part of a number

I need to find the fastest equivalence of the following C code.
int d = 1 << x; /* d=pow(2,x) */
int j = 2*d*(i / d) + (i % d);
What I thought is to shift left upper 32 - x bits of i.
For example the following i with x=5:
1010 1010 1010 1010
will become:
0101 0101 0100 1010
Is there an assembly command for that? How can I perform this operation fast?
divisions are slow:
int m = (1 << x) - 1;
int j = (i << 1) - (i & m);
update:
or probably faster:
int j = i + (i & (~0 << x));
x86 32bit assembly (AT&T syntax):
/* int MaskedShiftByOne(int val, int lowest_bit_to_shift) */
mov 8(%esp), %ecx
mov $1, %eax
shl %ecx, %eax ; does 1 << lowest_bit_to_shift
mov 4(%esp), %ecx
dec %eax ; (1 << ...) - 1 == 0xf..f (lower bitmask)
mov %eax, %edx
not %edx ; complement - higher mask
and %ecx, %edx ; higher bits
and %ecx, %eax ; lower bits
lea (%eax, %edx, 2), %eax ; low + 2 * high
ret
This should work both on Linux and Windows.
Edit: the i + (i & (~0 << x)) is shorter:
mov 4(%esp), %ecx
mov $-1, %eax
mov 8(%esp), %edx
shl %edx, %eax
and %ecx, %eax
add %ecx, %eax
ret
Morale: Don't ever start with assembly. If you really need it, disassemble highly-optimized compiler output ...
Shift left by one upper x bits.
unsigned i = 0xAAAAu;
int x = 5;
i = (i & ((1 << x) - 1)) | ((i & ~((1 << x) - 1)) << 1); // 0x1554A;
Some explanations:
(1 << x) - 1 makes a mask to zero upper 32 - x bits.
~((1 << x) - 1) makes a mask to zero lower x bits.
After bits a zeroed we shift the upper part and or them together.
Try this on Codepad.
int m = (1 << x) - 1;
int j = ((i & ~m) << 1) | (i & m);
There is no assembly command to do what you want, but the solution I give is quicker since it avoids the division.
Intel syntax:
mov ecx,[esp+4] ;ecx = x
mov eax,[esp+8] ;eax = i
ror eax,cl
inc cl
clc
rcl eax,cl
ret
Moral: Highly-optimized compiler output... isn't.

Resources