SSE shifting integers

SSE shifting integers - c

I'm trying to understand how shifting with SSE works, but I don't understand the output gdb gives me. Using SSE4 I have a 128bit vector holding 8 16bit unsigned integers (using uint16_t). Then I use the intrinsic _mm_cmpgt_epi16 to compare them against some value, this function puts in all 0 or 1 bits into the bits used to store the ints. So far so good, using gdb I get:
(gdb) p/t sse_res[0]
$3 = {1111111111111111111111111111111111111111111111110000000000000000, 1111111111111111111111111111111111111111111111110000000000000000}
Then I would like to shift them to the right (is that correct?) so I just get a numerical value of 1 in case it's true. GDB then gives me an output which I don't understand:
(gdb) p/t shifted
$4 = {11101000000000010010000000000000110000000000000000011, 100111000000000001011000000000001001000000000000001111}
It's not even of the same length as the first, why is this? Just to try it out I used the following intrinsic to shift it one bit to the right:
shifted = _mm_srli_epi16(sse_array[i], 1);
I expected it to shift in just one zero at the right end of every 16bit block.
Update:
I wrote a small example to test the thing with the bitmask, it works fine, but I still don't understand gdbs behavior:
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <tmmintrin.h>
#include <smmintrin.h>
void print128_num(__m128i vector)
{
uint16_t *values = (uint16_t*) &vector;
printf("Numerical: %i %i %i %i %i %i %i %i \n",
values[0], values[1], values[2], values[3], values[4], values[5],
values[6], values[7]);
}
int main (int argc, char **argv)
{
uint16_t nums[] = {1, 57, 33, 22, 88, 99, 9, 73};
__m128i *nums_sse = (__m128i*)(&nums);
print128_num(*nums_sse);
// vector of 42
__m128i mm42 = _mm_set1_epi16(42);
__m128i sse_res = _mm_cmpgt_epi16(*nums_sse, mm42);
printf("Result of the comparison\n");
print128_num(sse_res);
// bitmask
__m128i mask = _mm_set1_epi16(1);
__m128i finally = _mm_and_si128(sse_res, mask);
printf("Result of the masking\n");
print128_num(finally);
uint16_t* sse_ptr = (uint16_t*)(&finally);
uint32_t result = sse_ptr[0] + sse_ptr[1] + sse_ptr[2] + sse_ptr[3]
+ sse_ptr[4] + sse_ptr[5] + sse_ptr[6] + sse_ptr[7];
printf("Result: %i numbers greater 42\n", result);
return 0;
}
Breakpoint 1, main (argc=1, argv=0x7fff5fbff3b0) at example_comp.c:44
44 printf("Result: %i numbers greater 42\n", result);
(gdb) p/t sse_res
$1 = {11111111111111110000000000000000, 1111111111111111000000000000000011111111111111111111111111111111}
(gdb) p/t mask
$2 = {1000000000000000100000000000000010000000000000001, 1000000000000000100000000000000010000000000000001}
(gdb) p/t finally
$3 = {10000000000000000, 1000000000000000000000000000000010000000000000001}
(gdb) p result
$4 = 4
(gdb)
My gdb version: GNU gdb 6.3.50-20050815 (Apple version gdb-1472) (Wed Jul 21 10:53:12 UTC 2010)
Compiler flags: -Wall -g -O0 -mssse3 -msse4 -std=c99

I don't understand exactly what you're trying to do here, but maybe you can clarify it for us.
So, you have 8 signed integers packed in each of two variables, which you test for greater than. The result looks like it shows that the first 3 are greater, the next is not, the next 3 are greater, the last is not. (_mm_cmpgt_epi16 assumes signed integers in the reference I found.)
Then you want to tell if "it" is true, but I'm not sure what you mean by that. Do you mean they are all greater? (If so, then you could just compare the result against MAX_VALUE or -1 or something like that.)
But the last step is to shift some data to the right piecewise. Notice that is not the same variable as sse_res[0]. Were you expecting to shift that one instead?
Without knowing what was in the data before shifting, we can't tell if it worked correctly, but I assume that gdb is omitting the leading zeroes in its output, which would explain the shorter result.
0000000000011101 29 was 58 or 59
0000000000100100 36 was 72 or 73
0000000000011000 24 was 48 or 49
0000000000000011 3 was 6 or 7
0000000000100111 39 was 78 or 79
0000000000010110 22 was 44 or 45
0000000000100100 36 was 72 or 73
0000000000001111 15 was 30 or 31
Do these numbers look familiar?
Update:
Thanks for the updated code. It looks the integers are packed in the reverse order, and the leading zeroes left off in the gdb output.

Related

lldb and C code give different results for pow()

I have one variable, Npart which is an int and initialized to 64. Below is my code (test.c):
#include <math.h>
#include <stdio.h>
int Npart, N;
int main(){
Npart = 64;
N = (int) (pow(Npart/1., (1.0/3.0)));
printf("%d %d\n",Npart, N);
return 0;
};
which prints out 64 3, probably due to numerical precision issues. I compile it as follows:
gcc -g3 test.c -o test.x
If I try to debug using lldb, I try to calculate the value and print it in the command prompt, the following happens:
$ lldb ./test.x
(lldb) target create "./test.x"
Current executable set to './test.x' (x86_64).
(lldb) breakpoint set --file test.c --line 1
Breakpoint 1: where = test.x`main + 44 at test.c:8, address = 0x0000000100000f0c
(lldb) r
Process 20532 launched: './test.x' (x86_64)
Process 20532 stopped
* thread #1: tid = 0x5279e0, 0x0000000100000f0c test.x`main + 44 at test.c:8, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
frame #0: 0x0000000100000f0c test.x`main + 44 at test.c:8
5
6 int main(){
7
-> 8 Npart = 64;
9
10 N = (int) (pow(Npart/1., (1.0/3.0)));
11 printf("%d %d\n",Npart, N);
(lldb) n
Process 20532 stopped
* thread #1: tid = 0x5279e0, 0x0000000100000f12 test.x`main + 50 at test.c:10, queue = 'com.apple.main-thread', stop reason = step over
frame #0: 0x0000000100000f12 test.x`main + 50 at test.c:10
7
8 Npart = 64;
9
-> 10 N = (int) (pow(Npart/1., (1.0/3.0)));
11 printf("%d %d\n",Npart, N);
12
13 return 0;
(lldb) n
Process 20532 stopped
* thread #1: tid = 0x5279e0, 0x0000000100000f4a test.x`main + 106 at test.c:11, queue = 'com.apple.main-thread', stop reason = step over
frame #0: 0x0000000100000f4a test.x`main + 106 at test.c:11
8 Npart = 64;
9
10 N = (int) (pow(Npart/1., (1.0/3.0)));
-> 11 printf("%d %d\n",Npart, N);
12
13 return 0;
14 };
(lldb) print Npart
(int) $0 = 64
(lldb) print (int)(pow(Npart/1.,(1.0/3.0)))
warning: could not load any Objective-C class information. This will significantly reduce the quality of type information available.
(int) $1 = 0
(lldb) print (int)(pow(64,1.0/3.0))
(int) $2 = 0
Why is lldb giving different results?
Edit: Clarified the question and provided a minimal verifiable example.

Your code calculates the cube root of 64, which should be 4.
The C code converts the return value to an integer by flooring it. The pow is usually implemented in some sort of Taylor polynomial or similar - this tends to be numerically inaccurate. The result on your computer seems to be a little less than 4.0, which when cast to int is truncated - the solution would be to use for example lround first instead:
N = lround(pow(Npart/1., (1.0/3.0)));
As for the lldb, the key is the text:
error: 'pow' has unknown return type; cast the call to its declared return type
i.e. it doesn't know the return type - thus the prototype - of the function. pow is declared as
double pow(double x, double y);
but since the only hint that lldb has about the return type is the cast you provided, lldb thinks the prototype is
int pow(int x, double y);
and that will lead into undefined behaviour - in practice, lldb thinks that the return value should be the int from the EAX register, hence 0 was printed, but the actual return value was in some floating point/SIMD register. Likewise, since the types of the arguments are not known either, you must not pass in an int.
Thus I guess you would get the proper value in the debugger with
print (double)(pow(64.0, 1.0/3.0))

uint8_t Array - Data inside memory

I have a question to a behavior I detect with the gdb.
First I compiled this small program with the gcc on a 64bit machine:
#include <stdio.h>
#include <inttypes.h>
void fun (uint8_t *ar)
{
uint8_t i;
for(i = 0; i<4; i++)
{
printf("%i\n",*(&ar[0]+i));
}
}
int main (void)
{
uint8_t ar[4];
ar[0] = 0b11001100;
ar[1] = 0b10101010;
ar[2] = 0b01010110;
ar[3] = 0b00110011;
fun(ar);
return 0;
}
Then I look with the gdb to the memory of ar:
(gdb) p/t ar
$7 = {11001100, 10101010, 1010110, 110011}
(gdb) x ar
0x7fffffffe360: 00110011010101101010101011001100
(gdb) x 0x7fffffffe360
0x7fffffffe360: 00110011010101101010101011001100
(gdb) x 0x7fffffffe361
0x7fffffffe361: 11111111001100110101011010101010
(gdb) x 0x7fffffffe362
0x7fffffffe362: 01111111111111110011001101010110
(gdb) x 0x7fffffffe363
0x7fffffffe363: 00000000011111111111111100110011
I saw that the array of uint8_t was collect together to an 32 bit field. For the next addresses this will only push to the right.
&ar[0] -> {ar[3],ar[2],ar[1],ar[0]}
&ar[1] -> {xxxx,ar[3],ar[2],ar[1]}
&ar[2] -> {xxxx,xxxx,ar[3],ar[2]}
&ar[3] -> {xxxx,xxxx,xxxx,ar[3]}
It's a bit strange and I want to know: Why this will happen and can I rely on this behavior? Is this only typically for gcc or is this a handling standard?

In gdb, x just prints out whatever is in the memory location, regardless of its type in the C code. You're just getting some defaults (or previously used formats) for the width(4 bytes in your case) and format.
Do e.g. x/b ar to print the location as bytes. and do help x for more info.
If you print it as a anything other than a byte, endianess of your processor will determine how the memory is interpreted though.
Use p to take the type into account, as in p ar

It has to do with endianness:
In a x64, and every other little-endian machine, the data of value 0x12345678 is put into memory in the form 78 56 34 12, i. e. with the lowest significant byte first.
The debugger knows that and shows it to you in this way.
Expressed in hex, making your data easier to read, it looks this way:
Your memory is filled with
CC AA 56 33 FF 7F 00
which makes
the value at offset 0 3356AACC
the value at offset 1 FF3356AA
the value at offset 2 7FFF3356
the value at offset 3 007FFF33

Loading ARM CPSR into C and formatting?

Whilst being given a document teaching ARM assembly the document now tells me to load the CPRS into C and format the data into a friendly format, such as -
Flags: N Z IRQ FIQ
State: ARM
Mode: Supervisor
Now I've loaded the CPRS into a variable within my program, but I'm struggling to understand what format the CPRS is in, I've seen things using hex to reset flags and etc along which bytes are control, field, status and extension masks.
I put my CPRS into an int just to see what the data shows and I'm given 1610612752, I'm assuming I shouldn't be loading it into an int and something else in order for it to be much more clear.
Any hints pushing me to the right direction would be most appreciated.

From This wiki page, (http://www.heyrick.co.uk/armwiki/The_Status_register) we get the bit layout of the CPSR (and SPSR):
31 30 29 28 27 - 24 - 19 … 16 - 9 8 7 6 5 4 … 0
N Z C V Q - J - GE[3:0] - E A I F T M[4:0]
Declare some flags (or just compute these):
int armflag_N = (Cpsr>>31)&1;
int armflag_Z = (Cpsr>>30)&1;
int armflag_C = (Cpsr>>29)&1;
int armflag_V = (Cpsr>>28)&1;
int armflag_Q = (Cpsr>>27)&1;
int armflag_J = (Cpsr>>24)&1;
int armflag_GE = (Cpsr>>16)&7;
int armflag_E = (Cpsr>>9)&1;
int armflag_A = (Cpsr>>8)&1;
int armflag_I = (Cpsr>>7)&1;
int armflag_F = (Cpsr>>6)&1;
int armflag_T = (Cpsr>>5)&1;
int armflag_M = (Cpsr>>0)&15;
(The ">>" means to rightshift specified number of bits, and "&" is the bitwise and operator, so "(val>>num)&mask" means rightshift val num bits, and then extract the bits under the mask).
Now you have variables with flags, Here is how you could conditionally print a flag,
printf("Flags: ");
printf("%s ", armflag_N ? "N" : "-" );
...

Why can't I insert a breakpoint here (I might add all my friends can, and I see no other way to solve this)

(gdb) list 1,20
1 int swap_n_add(int *xp, int *yp)
2 {
3 int x = *xp;
4 int y = *yp;
5
6 *xp = y;
7 *yp = x;
8 return x + y;
9 }
10
11 int main() {
12 int a1 = 534;
13 int a2 = 1057;
14 int sum = swap_n_add(&a1, &a2);
15 int diff = a1 - a2;
16
17 return sum * diff;
18 }
(gdb) b 18
No line 18 in file "swap_n_add.c".
I want to check the value main returns, so I put a breakpoint at 18 to inspect the register there (info register). But it says that line doesn't exist, despite it... saying it does exist. And my friends with identical code can put it there.

I would question the previous setup steps that you haven't shown. If your friend's are able to set a breakpoint there and you are not, there is probably something you did wrong. (assuming everyone is using the same versions of all the tools)
with gdb version 7.4-2012.04 for Ubuntu and gcc 4.6.3 I can see and set a break point at the line in question:
> gcc -Wall -g file.c <-- compile with -g for debug symbols
> gdb a.out <-- run against the executable
This GDB was configured as "x86_64-linux-gnu". <-- make sure it was configured for
For bug reporting instructions, please see: your architecture
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /home/mike/C/a.out...done. <-- and that your file loads symbols
(gdb) line 22, 39
22 int swap_n_add(int *xp, int *yp) <-- mine are different because I have a bunch
23 { of #include's in my test file
24 int x = *xp;
25 int y = *yp;
26
27 *xp = y;
28 *yp = x;
29 return x + y;
30 }
31
32 int main() {
33 int a1 = 534;
34 int a2 = 1057;
35 int sum = swap_n_add(&a1, &a2);
36 int diff = a1 - a2;
37
38 return sum * diff;
39 }
(gdb) b 39
Breakpoint 1 at 0x400530: file file.c, line 39.
Note that if I pick a line outside of the file, say.. 75, it gives you a message about "no line x in file":
(gdb) b 75
No line 75 in the current file.
Make breakpoint pending on future shared library load? (y or [n])
If you're seeing this it's worth double checking your line numbers, it's possible you mis-counted.
If you want to see the value of the return (sum * diff) you can always set that to a local variable before returning and break on the return.

This may be compiler/debugger-specific. There may be no debug information generated for the closing brace.
Also, if you have optimization enabled, it can make it hard or impossible to put breakpoints at some locations. Try removing the -O parameter to gcc, if you use it.
If it still doesn't help, assign the return value to a new variable and return that variable instead.
Alternatively, you could just switch to the disassembly, put a breakpoint on the ret instruction of main() (the instruction that does function return on x86) and examine the returned value in the CPU registers (should be in eax on x86).

C array permutations with macros

Is it possible to generate a specific permutation of an array with a macro in C?
i.e. If I have an array X with elements:
0 1 2 3 4 5
x = ["0","1","1","0","1","0"]
I was thinking there may be some macro foo for something like this:
#define S_2Permute(x) = [x[5], x[3], x[4], x[2], x[1]]
where I redefine the order of the array, so the element in the original position 5 is now in position 0.
Any ideas?
EXAMPLE USE
I am starting to create an implementation of the DES encryption algorithm. DES requires several permutation/expansions where I would have to re-order all of the elements in the array, sometimes shrinking the array and sometimes expanding it. I was hoping to just be able to define a macro to permute the arrays for me.
EDIT2
Well in DES the first step is something called the initial permutation. So initially I have some 64-bit key, which for this example can be 0-15 hex:
0123456789ABCDEF
which expands to:
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
The IP (initial permutation) would permute this string so that every element in the array would be in a new position:
IP =
58 50 42 34 26 18 10 2
60 52 44 36 28 20 12 4
62 54 46 38 30 22 14 6
64 56 48 40 32 24 16 8
57 49 41 33 25 17 9 1
59 51 43 35 27 19 11 3
61 53 45 37 29 21 13 5
63 55 47 39 31 23 15 7
So the new 1st element in the bitstring would be the 58th element(bit) from the original bitstring.
So I would have all of these bits stored in an array of characters:
x = [0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,1,0,1,0,0,0,1,0,1,0,1,1,0,0,
1,1,1,1,0,0,0,1,0,0,1,1,0,1,0,1,0,1,1,1,1,0,0,1,1,0,1,1,1,1,0,1,1,1,1]
and then just call
IP_PERMUTE(x);
And macro magic will have moved all of the bits into the new correct positions.

Absolutely - you're almost there with your example. Try this:
#define S_2Permute(x) {x[5], x[3], x[4], x[2], x[1]}
Then later:
int x[] = {1,2,3,4,5,6};
int y[] = S_2Permute(x); // y is now {6,4,5,3,2}
Two things to remember:
1) in C, arrays are numbered from 0, so it's possible you meant:
#define S_2Permute(x) {x[4], x[2], x[3], x[1], x[0]}
2) If you're using gcc, you can compile with -E to see the output from the preprocessor (very good for debugging macro expansions).
However, I don't think I'd actually do it this way - I'd say the code will be easier to read (and potentially less error prone) if you generate the permutations programmatically - and I doubt that it'll be a large performance hit.
Since you say you're having trouble compiling this, here's a test program that works for me in gcc 4.6.1:
#include <stdio.h>
#define S_2Permute(x) {x[5], x[3], x[4], x[2], x[1]}
int main(void) {
int x[] = {1,2,3,4,5,6};
int y[] = S_2Permute(x);
for(int i = 0; i < 5; i++) {
printf("%d,",y[i]);
}
printf("\n");
}
I compiled with gcc test.c -std=c99 -Wall

I'm new so apologies if it's not ok to offer a different means of solution but have you considered using an inline function instead of a macro?
I love single lines of code that do a lot as much as the next guy, but it makes more sense to me to do it this way:
//I would have an array that defined how I wanted to swap the positions, I'll assume 5 elements
short reordering[5] = {4,1,3,2,0};
inline void permuteArray(char array[]) {
char swap = array[reordering[0]];
array[reordering[0]] = array[reordinger[1]];
array[reordering[1]] = array[reordinger[2]];
array[reordering[2]] = array[reordinger[3]];
array[reordering[3]] = array[reordinger[4]];
array[reordering[4]] = swap;
}
This may not be as pretty or efficient as a macro, but it could save you some headaches managing and maintaining your code (and could always be swapped for the macro version Timothy suggest.

I am doing something vary similar. This is my code. The variable that comes in is a ulong, so then i convert it to a bit array and then rearrange all the bits and then turn it back into a ulong.
public override ulong Permutation(ulong input, int[] permuation)
{
byte[] test = BitConverter.GetBytes(input);
BitArray test2 = new BitArray(test);
BitArray final = new BitArray(test);
ulong x = 0;
ulong y = 1;
for (int i = 0; i < permuation.Length; i++)
{
final[i] = test2[(permuation[i]-1)];
}
for (int i = 0; i < final.Length; i++)
{
if (final[i] == true)
{
x += (1 * y);
}
else
{
x += (0 * y);
}
y = y * 2;
}
return x;
}