I had a short interview where a question is like this: set an integer value to be 0xaa55 at address 0x*****9.
The only thing I noticed is that the address given is not aligned on word boundary. So setting an int *p to the address should not work. Then is it just using a unsigned char *p to assign the value byte-wise? Is it the point of this interview question? There is no point of doing this in real life, is there?
You need to get back to the interviewer with a number of subsidiary questions:
What is the size in bytes of an int?
Is the machine little-endian or big-endian?
Does the machine handle non-aligned access automatically?
What is the performance penalty for handling non-aligned access automatically?
What is the point of this?
The chances are that someone is thinking of marshalling data the quick and dirty way.
You're right that one basic process is to write the bytes via a char * or unsigned char * that is initialized to the relevant address. The answers to my subsidiary questions 1 and 2 determine the exact mechanism to use, but for a 2-byte int in little-endian format, you might use:
unsigned char *p = 0x*****9; // Copied from question!
unsigned int v = 0xAA55;
*p++ = v & 0xFF;
v >>= 8;
*p = v & 0xFF;
You can generalize to 4-byte or 8-byte integers easily; handling big-endian integers is a bit more fiddly.
I assembled some timing code to see what the relative costs were. Tested on a MacBook Pro (2.3 GHz Intel Core i7, 16 GiB 1333 MHz DDR3 RAM, Mac OS X 10.7.5, home-built GCC 4.7.1), I got the following times for the non-optimized code:
Aligned: 0.238420
Marshalled: 0.931727
Unaligned: 0.243081
Memcopy: 1.047383
Aligned: 0.239070
Marshalled: 0.931718
Unaligned: 0.242505
Memcopy: 1.060336
Aligned: 0.239915
Marshalled: 0.934913
Unaligned: 0.242374
Memcopy: 1.049218
When compiled with optimization, I got segmentation faults, even without -DUSE_UNALIGNED — which puzzles me a bit. Debugging was not easy; there seemed to be a lot of aggressive inline optimization which meant that variables could not be printed by the debugger.
The code is below. The Clock type and the time.h header (and timer.c source) are not shown, but can be provided on request (see my profile). They provide high resolution timing across most platforms (Windows is shakiest).
#include <string.h>
#include <stdio.h>
#include "timer.h"
static int array[100000];
enum { ARRAY_SIZE = sizeof(array) / sizeof(array[0]) };
static int repcount = 1000;
static void uac_aligned(int value)
{
int *base = array;
for (int i = 0; i < repcount; i++)
{
for (int j = 0; j < ARRAY_SIZE - 2; j++)
base[j] = value;
}
}
static void uac_marshalled(int value)
{
for (int i = 0; i < repcount; i++)
{
char *base = (char *)array + 1;
for (int j = 0; j < ARRAY_SIZE - 2; j++)
{
*base++ = value & 0xFF;
value >>= 8;
*base++ = value & 0xFF;
value >>= 8;
*base++ = value & 0xFF;
value >>= 8;
*base = value & 0xFF;
value >>= 8;
}
}
}
#ifdef USE_UNALIGNED
static void uac_unaligned(int value)
{
int *base = (int *)((char *)array + 1);
for (int i = 0; i < repcount; i++)
{
for (int j = 0; j < ARRAY_SIZE - 2; j++)
base[j] = value;
}
}
#endif /* USE_UNALIGNED */
static void uac_memcpy(int value)
{
for (int i = 0; i < repcount; i++)
{
char *base = (char *)array + 1;
for (int j = 0; j < ARRAY_SIZE - 2; j++)
{
memcpy(base, &value, sizeof(int));
base += sizeof(int);
}
}
}
static void time_it(int value, const char *tag, void (*function)(int value))
{
Clock c;
char buffer[32];
clk_init(&c);
clk_start(&c);
(*function)(value);
clk_stop(&c);
printf("%-12s %12s\n", tag, clk_elapsed_us(&c, buffer, sizeof(buffer)));
}
int main(void)
{
int value = 0xAA55;
for (int i = 0; i < 3; i++)
{
time_it(value, "Aligned:", uac_aligned);
time_it(value, "Marshalled:", uac_marshalled);
#ifdef USE_UNALIGNED
time_it(value, "Unaligned:", uac_unaligned);
#endif /* USE_UNALIGNED */
time_it(value, "Memcopy:", uac_memcpy);
}
return(0);
}
memcpy((void *)0x23456789, &(int){0xaa55}, sizeof(int));
Yes, you may need to deal with unaligned multi-byte values in real life. Imagine your device exchanges data with another device. For example, this data may be a message structure sent over a network or a file structure saved to disk. The format of that data may be predefined and not under your control. And the definiton of the data structure may not account for alignement (or even endianness) restrictions of your device. In these situations you'll need to take care when accessing these unaligned multi-byte values.
Related
I've been reading up on the use of pointers, and allocating memory for embedded projects. I must admit, that i perhaps don't understand it fully, as i can't seem to figure where my problem lies.
My two functions are supposed to take 4 float values, and return 16 bytes, that represent these, in order to transfer them through SPI. It works great, but only for a minute, before the program crashes and my SPI and I2C dies, lol.
Here are the functions:
/*Function that wraps a float value, by allocating memory and casting pointers.
Returns 4 bytes that represents input float value f.*/
typedef char byte;
byte* floatToByteArray(float f)
{
byte* ret = malloc(4 * sizeof(byte));
unsigned int asInt = *((int*)&f);
int i;
for (i = 0; i < 4; i++) {
ret[i] = (asInt >> 8 * i) & 0xFF;
}
return ret;
memset(ret, 0, 4 * sizeof(byte)); //Clear allocated memory, to avoid taking all memory
free(ret);
}
/*Takes a list of 4 quaternions, and wraps every quaternion in 4 bytes.
Returns a 16 element byte list for SPI transfer, that effectively contains the 4 quaternions*/
void wrap_quaternions(float Quaternion[4], int8_t *buff)
{
uint8_t m;
uint8_t n;
uint8_t k = 0;
for (m = 0; m < 4; m++)
{
for (n = 0; n < 4; n++)
{
byte* asBytes = floatToByteArray(Quaternion[m]);
buff[n+4*k] = asBytes[n];
}
k++;
}
}
The error message i receive after is the following, in the disassembly window of Atmel Studio
Atmel studio screenshot
You might drop all the dynamic memory allocation completely.
void floatToByteArray(float f, byte buf[4])
{
memcpy(buf, &f, sizeof(f));
}
void wrap_quaternions(float Quaternion[4], int8_t *buff)
{
for (int i = 0; i < 4; i++)
{
floatToByteArray(Quaternion[i], &buf[4*i]);
}
}
With this approach you do not need to care about freeing allocated memory after use. It is also much more efficient because dynamic memory allocation is rather expensive.
Gerhardh is correct, return prevent the memory from being released.
If you need to return 4 bytes, you might check if your environment can return a uint32_t or something like that.
As already mentioned, the lines below return ret; are never executed. And anyway if you want to return allocated memory in a function (what is fine) you can't free it in the function itself but it has to be freed by the caller when it isn't needed anymore. So your calling function should look like
/*Takes a list of 4 quaternions, and wraps every quaternion in 4 bytes.
Returns a 16 element byte list for SPI transfer, that effectively contains the 4 quaternions*/
void wrap_quaternions(float Quaternion[4], int8_t *buff)
{
uint8_t m;
uint8_t n;
uint8_t k = 0;
for (m = 0; m < 4; m++)
{
byte* asBytes = floatToByteArray(Quaternion[m]); // no need it to call for every n
for (n = 0; n < 4; n++)
{
buff[n+4*k] = asBytes[n];
}
free(asBytes); // asBytes is no longer needed and can be free()d
k++;
}
}
regarding:
buff[n+4*k] = asBytes[n];
This results in:
buff[0] << asBytes[0] // from first call to `byte* floatToByteArray(float f)`
buff[4] << asBytes[1] // from second call to `byte* floatToByteArray(float f)`
buff[8] << asBytes[2] // from third call to `byte* floatToByteArray(float f)`
buff[12] << asBytes[3] // from forth call to `byte* floatToByteArray(float f)`
most of the above problem can be fixed by using memcpy() to copy the 4 bytes from asBytes[] to buff[] similar to:
memcpy( &buff[ n*4 ], asBytes, 4 );
Of course, there is also the consideration: Is the length of a float, on your hardware/compiler actually 4 bytes.
'magic' numbers are numbers with no basis. 'magic' numbers make the code much more difficult to understand, debug, etc. I.E. 4. Suggest using something like: length = sizeof( float ); then using length everywhere that 4 is currently being used, except for the number of entries in the Quaternion[] array. for that 'magic' number, strongly suggest the statement: #define arraySize 4 be early in your code. Then using arraySize each time the code references the number of elements in the array
For my cuda project I want to give my device function a single integer.
My function looks like
__device__ void PBKDF2_CUDA(const uint8_t password[], const int pass_len, const uint8_t Essid[], const int Essid_len, const int c, const int dkLen, uint32_t T_ptr[], int *PW_len_test)
{
uint32_t Hash_ptr[5] = {0};
uint32_t L[5]={0,0,0,0,0};
uint32_t T[8] = {0};
//Maybe working
/*uint8_t * password_shrinked = (uint8_t*)malloc(8 + 1);
for(int i = 0; i < 8; i++)
password_shrinked[i] = password[i];
password_shrinked[8 + 1] = 0;*/
int password_len = pass_len;
if (pass_len != 8)
{
*PW_len_test = pass_len;
password_len = 8;
}
uint8_t * password_shrinked = (uint8_t*)malloc(sizeof(uint8_t)*(password_len + 1));
for (int i = 0; i < password_len; i++)
password_shrinked[i] = password[i];
password_shrinked[password_len + 1] = 0;
//Some other stuff
free(password_shrinked);
};
and I'm calling it from a kernel like this:
__global__ void kernel(uint8_t Password_list[], const int *Password_len, uint8_t Essid[], int *Essid_len, int *rounds,int *dkLen, uint32_t T[], int pmk_size, int *PW_len_test)
{
int idx= threadIdx.x + blockDim.x*blockIdx.x;
printf("Password_len is: %d\n", Password_len);
PBKDF2_CUDA(Password_list+idx*(8), 8, Essid, *Essid_len, *rounds, *dkLen, T+idx*pmk_size, PW_len_test + idx*sizeof(int));
}
Calling kernel in main function:
kernel<<<BLOCKS, THREADS>>>(Pass_d, Pass_len_d, Essid_d, Essid_len_d, rounds_d, key_len_d, PMK_d, PMK_size, PW_len_test_d);
Now, regardless if I set Pass_len_d to 8, or if I'm calling the kernel with 8 instead of Pass_len_d, my device function creates garbage (returning wrong values, explanation below). It only works if I set the value manually in the kernel function (as seen above) or in the device function.
With garbage I mean that some returned values are not calculated correctly from the password list (uint8_t array), but others are correctly calculated. Which words are correctly calculated changes with every run, so I assume there is a race condition somewhere, but I can not find it.
There's at least one buffer overflow.
password_shrinked[password_len + 1] = 0; writes to a slot one byte above what was allocated.
Remember that if you allocate password_len + 1 bytes, the last location in the array is password_len.
According to specification, the function rand() in C uses mutexes to lock context (http://sourcecodebrowser.com/uclibc/0.9.27/rand_8c.html). So if I use multiple threads that call it, my program will be slow because all threads will try to access this lock region.
So, I have found drand48(), another random number generator function, which does not have locks (http://sourcecodebrowser.com/uclibc/0.9.27/drand48_8c.html#af9329f9acef07ca14ea2256191c3ce74). But, somehow, my parallel program is still slower than the serial one! The code is pasted bellow:
Serial version:
#include <cstdlib>
#define M 100000000
int main()
{
for (int i = 0; i < M; ++i)
drand48();
return 0;
}
Parallel version:
#include <pthread.h>
#include <cstdlib>
#define M 100000000
#define N 4
pthread_t threads[N];
void* f(void* p)
{
for (int i = 0; i < M/N; ++i)
drand48();
}
int main()
{
for (int i = 0; i < N; ++i)
pthread_create(&threads[i], NULL, f, NULL);
for (int i = 0; i < N; ++i)
pthread_join(threads[i], NULL);
return 0;
}
I executed both codes. The serial one runs in ~0.6 seconds and the parallel in ~2.1 seconds.
Could anyone explain me why this happens?
Some additional information: I have 4 cores on my PC. I compile the serial version using
g++ serial.cpp -o serial
and the parallel using
g++ parallel.cpp -lpthread -o parallel
Edit:
Apparently, this performance loss happens whenever I updates a global variable in my threads. In the exemple below, the x variable is the global (note that in the parallel example, the operation will be non thread-safe):
Serial:
#include <cstdlib>
#define M 1000000000
int x = 0;
int main()
{
for (int i = 0; i < M; ++i)
x = x + 10 - 10;
return 0;
}
Parallel:
#include <pthread.h>
#include <cstdlib>
#define M 1000000000
#define N 4
pthread_t threads[N];
int x;
void* f(void* p)
{
for (int i = 0; i < M/N; ++i)
x = x + 10 - 10;
}
int main()
{
for (int i = 0; i < N; ++i)
pthread_create(&threads[i], NULL, f, NULL);
for (int i = 0; i < N; ++i)
pthread_join(threads[i], NULL);
return 0;
}
Note that the drand48() uses the global struct variable _libc_drand48_data.
drand48() uses the global struct variable _libc_drand48_data, it keeps state there (writes to it), and is therefore the source of cache line contention, which is very likely the source of the performance degradation. It isn't false sharing as I initially suspected and wrote in the comments, it is bona fide sharing. The reason there is no locking in the implementation of drand48() is two fold:
drand48() is not required to be thread-safe "The drand48(), lrand48(), and mrand48() functions need not be thread-safe."
If two threads happen to access it at the same time, and their writes to memory are interleaved there is no harm done - the data structure is not corrupted, and it is, after all, supposed to return pseudo random data.
There are some subtle considerations (race conditions) in the use of drand48() when one thread is initializing state, but considered harmless
Notice below in __drand48_iterate how it stores to three 16-bit words in the global variable, this is where the random generator keeps its state, and this is the source of the cache-line contention between your threads
xsubi[0] = result & 0xffff;
xsubi[1] = (result >> 16) & 0xffff;
xsubi[2] = (result >> 32) & 0xffff;
Source code
You provided the link to drand48() source code which I've included below for reference. The problem is cache line contention when the state is updated
#include <stdlib.h>
/* Global state for non-reentrant functions. Defined in drand48-iter.c. */
extern struct drand48_data __libc_drand48_data;
double drand48(void)
{
double result;
erand48_r (__libc_drand48_data.__x, &__libc_drand48_data, &result);
return result;
}
And here is the source for erand48_r
extern int __drand48_iterate(unsigned short xsubi[3], struct drand48_data *buffer);
int erand48_r (xsubi, buffer, result)
unsigned short int xsubi[3];
struct drand48_data *buffer;
double *result;
{
union ieee754_double temp;
/* Compute next state. */
if (__drand48_iterate (xsubi, buffer) < 0)
return -1;
/* Construct a positive double with the 48 random bits distributed over
its fractional part so the resulting FP number is [0.0,1.0). */
temp.ieee.negative = 0;
temp.ieee.exponent = IEEE754_DOUBLE_BIAS;
temp.ieee.mantissa0 = (xsubi[2] << 4) | (xsubi[1] >> 12);
temp.ieee.mantissa1 = ((xsubi[1] & 0xfff) << 20) | (xsubi[0] << 4);
/* Please note the lower 4 bits of mantissa1 are always 0. */
*result = temp.d - 1.0;
return 0;
}
And the implementation of __drand48_iterate which is where it writes back to the global
int
__drand48_iterate (unsigned short int xsubi[3], struct drand48_data *buffer)
{
uint64_t X;
uint64_t result;
/* Initialize buffer, if not yet done. */
if (unlikely(!buffer->__init))
{
buffer->__a = 0x5deece66dull;
buffer->__c = 0xb;
buffer->__init = 1;
}
/* Do the real work. We choose a data type which contains at least
48 bits. Because we compute the modulus it does not care how
many bits really are computed. */
X = (uint64_t) xsubi[2] << 32 | (uint32_t) xsubi[1] << 16 | xsubi[0];
result = X * buffer->__a + buffer->__c;
xsubi[0] = result & 0xffff;
xsubi[1] = (result >> 16) & 0xffff;
xsubi[2] = (result >> 32) & 0xffff;
return 0;
}
I am doing a GHASH for the AES-GCM implementation.
and i need to implement this
where v is the bit length of the final block of A, u is the bit length of the final block of C, and || denotes concatenation of bit strings.
How can I do the concatenation of A block to fill in the zeros padding from v to 128 bit, as I do not know the length of the whole block of A.
So I just take the A block and XOR it with an array of 128 bits
void GHASH(uint8_t H[16], uint8_t len_A, uint8_t A_i[len_A], uint8_t len_C,
uint8_t C_i[len_C], uint8_t X_i[16]) {
uint8_t m;
uint8_t n;
uint8_t i;
uint8_t j;
uint8_t zeros[16] = {0};
if (i == m + n) {
for(j=16; j>=0; j--){
C_i[j] = C_i[j] ^ zeros[j]; //XOR with zero array to fill in 0 of length 128-u
tmp[j] = X_i[j] ^ C_i[j]; // X[m+n+1] XOR C[i] left shift by (128bit-u) and store into tmp
gmul(tmp, H, X_i); //Do Multiplication of tmp to H and store into X
}
}
I am pretty sure that I am not correct. But I have no idea how to do it.
It seems to me that you've got several issues here, and conflating them is a big part of the problem. It'll be much easier when you separate them.
First: passing in a parameter of the form uint8_t len_A, uint8_t A_i[len_A] is not proper syntax and won't give you what you want. You're actually getting uint8_t len_A, uint8_t * A_i, and the length of A_i is determined by how it was declared on the level above, not how you tried to pass it in. (Note that uint8_t * A and uint8_t A[] are functionally identical here; the difference is mostly syntactic sugar for the programmer.)
On the level above, since I don't know if it was declared by malloc() or on the stack, I'm not going to get fancy with memory management issues. I'm going to use local storage for my suggestion.
Unit clarity: You've got a bad case going on here: bit vs. byte vs. block length. Without knowing the core algorithm, it appears to me that the undeclared m & n are block lengths of A & C; i.e., A is m blocks long, and C is n blocks long, and in both cases the last block is not required to be full length. You're passing in len_A & len_C without telling us (or using them in code so we can see) whether they're the bit length u/v, the byte length of A_i/C_i, or the total length of A/C, in bits or bytes or blocks. Based on the (incorrect) declaration, I'm assuming they're the length of A_i/C_i in bytes, but it's not obvious... nor is it the obvious thing to pass. By the name, I would have guessed it to be the length of A/C in bits. Hint: if your units are in the names, it becomes obvious when you try to add bitLenA to byteLenB.
Iteration control: You appear to be passing in 16-byte blocks for the i'th iteration, but not passing in i. Either pass in i, or pass in the full A & C instead of A_i & C_i. You're also using m & n without setting them or passing them in; the same issue applied. I'll just pretend they're all correct at the moment of use and let you fix that.
Finally, I don't understand the summation notation for the i=m+n+1 case, in particular how len(A) & len(C) are treated, but you're not asking about that case so I'll ignore it.
Given all that, let's look at your function:
void GHASH(uint8_t H[], uint8_t len_A, uint8_t A_i[], uint8_t len_C, uint8_t C_i[], uint8_t X_i[]) {
uint8_t tmpAC[16] = {0};
uint8_t tmp[16];
uint8_t * pAC = tmpAC;
if (i == 0) { // Initialization case
for (j=0; j<len_A; ++j) {
X_i[j] = 0;
}
return;
} else if (i < m) { // Use the input memory for A
pAC = A_i;
} else if (i == m) { // Use temp memory init'ed to 0; copy in A as far as it goes
for (j=0; j<len_A; ++j) {
pAC[j] = A_i[j];
}
} else if (i < m+n) { // Use the input memory for C
pAC = C_i;
} else if (i == m+n) { // Use temp memory init'ed to 0; copy in C as far as it goes
for (j=0; j<len_A; ++j) {
pAC[j] = C_i[j];
}
} else if (i == m+n+1) { // Do something unclear to me. Maybe this?
// Use temp memory init'ed to 0; copy in len(A) & len(C)
pAC[0] = len_A; // in blocks? bits? bytes?
pAC[1] = len_C; // in blocks? bits? bytes?
}
for(j=16; j>=0; j--){
tmp[j] = X_i[j] ^ pAC[j]; // X[m+n+1] XOR A or C[i] and store into tmp
gmul(tmp, H, X_i); //Do Multiplication of tmp to H and store into X
}
}
We only copy memory in the last block of A or C, and use local memory for the copy. Most blocks are handled with a single pointer copy to point to the correct bit of input memory.
if you don't care about every little bit of efficiency (i assume this is to experiment, and not for real use?) just reallocate and pad (in practice, you could round up and calloc when you first declare these):
size_t round16(size_t n) {
// if n isn't a multiple of 16, round up to next multiple
if (n % 16) return 16 * (1 + n / 16);
return n;
}
size_t realloc16(uint8_t **data, size_t len) {
// if len isn't a multiple of 16, extend with 0s to next multiple
size_t n = round16(len);
*data = realloc(*data, n);
for (size_t i = len; i < n; ++i) (*data)[i] = 0;
return n;
}
void xor16(uint8_t *result, uint8_t *a, uint8_t *b) {
// 16 byte xor
for (size_t i = 0; i < 16; ++i) result[i] = a[i] ^ b[i];
}
void xorandmult(uint8_t *x, uint8_t *data, size_t n, unint8_t *h) {
// run along the length of the (extended) data, xoring and mutliplying
uint8_t tmp[16];
for (size_t i = 0; i < n / 16; ++i) {
xor16(tmp, x, data+i*16);
multgcm(x, h, tmp);
}
}
void ghash(uint8_t *x, uint8_t **a, size_t len_a, uint8_t **c, size_t len_c, uint8_t *h) {
size_t m = realloc16(a, len_a);
xorandmult(x, *a, m, h);
size_t n = realloc16(c, len_c);
xorandmult(x, *c, n, h);
// then handle lengths
}
uint8_t x[16] = {0};
ghash(x, &a, len_a, &c, len_c, h);
disclaimer - no expert, just skimmed the spec. code uncompiled, unchecked, and not intended for "real" use. also, the spec supports arbitrary (bit) lengths, but i assume you're working in bytes.
also, i am still not sure i am answering the right question.
Let me preface this with.. I have extremely limited experience with ASM, and even less with SIMD.
But it happens that I have the following MMX/SSE optimised code, that I would like to port across to AltiVec instructions for use on PPC/Cell processors.
This is probably a big ask.. Even though it's only a few lines of code, I've had no end of trouble trying to work out what's going on here.
The original function:
static inline int convolve(const short *a, const short *b, int n)
{
int out = 0;
union {
__m64 m64;
int i32[2];
} tmp;
tmp.i32[0] = 0;
tmp.i32[1] = 0;
while (n >= 4) {
tmp.m64 = _mm_add_pi32(tmp.m64,
_mm_madd_pi16(*((__m64 *)a),
*((__m64 *)b)));
a += 4;
b += 4;
n -= 4;
}
out = tmp.i32[0] + tmp.i32[1];
_mm_empty();
while (n --)
out += (*(a++)) * (*(b++));
return out;
}
Any tips on how I might rewrite this to use AltiVec instructions?
My first attempt (a very wrong attempt) looks something like this.. But it's not entirely (or even remotely) correct.
static inline int convolve_altivec(const short *a, const short *b, int n)
{
int out = 0;
union {
vector unsigned int m128;
int i64[2];
} tmp;
vector unsigned int zero = {0, 0, 0, 0};
tmp.i64[0] = 0;
tmp.i64[1] = 0;
while (n >= 8) {
tmp.m128 = vec_add(tmp.m128,
vec_msum(*((vector unsigned short *)a),
*((vector unsigned short *)b), zero));
a += 8;
b += 8;
n -= 8;
}
out = tmp.i64[0] + tmp.i64[1];
#endif
while (n --)
out += (*(a++)) * (*(b++));
return out;
}
You're not far off - I fixed a few minor problems, cleaned up the code a little, added a test harness, and it seems to work OK now:
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <altivec.h>
static int convolve_ref(const short *a, const short *b, int n)
{
int out = 0;
int i;
for (i = 0; i < n; ++i)
{
out += a[i] * b[i];
}
return out;
}
static inline int convolve_altivec(const short *a, const short *b, int n)
{
int out = 0;
union {
vector signed int m128;
int i32[4];
} tmp;
const vector signed int zero = {0, 0, 0, 0};
assert(((unsigned long)a & 15) == 0);
assert(((unsigned long)b & 15) == 0);
tmp.m128 = zero;
while (n >= 8)
{
tmp.m128 = vec_msum(*((vector signed short *)a),
*((vector signed short *)b), tmp.m128);
a += 8;
b += 8;
n -= 8;
}
out = tmp.i32[0] + tmp.i32[1] + tmp.i32[2] + tmp.i32[3];
while (n --)
out += (*(a++)) * (*(b++));
return out;
}
int main(void)
{
const int n = 100;
vector signed short _a[n / 8 + 1];
vector signed short _b[n / 8 + 1];
short *a = (short *)_a;
short *b = (short *)_b;
int sum_ref, sum_test;
int i;
for (i = 0; i < n; ++i)
{
a[i] = rand();
b[i] = rand();
}
sum_ref = convolve_ref(a, b, n);
sum_test = convolve_altivec(a, b, n);
printf("sum_ref = %d\n", sum_ref);
printf("sum_test = %d\n", sum_test);
printf("%s\n", sum_ref == sum_test ? "PASS" : "FAIL");
return 0;
}
(Warning: all of my Altivec experience comes from working on Xbox360/PS3 - I'm not sure how different they are from other Altivec platforms).
First off, you should check your pointer alignment. Most vector loads (and stores) operations are expected to be from 16-byte aligned addresses. If they aren't, things will usually carry on without warning, but you won't get the data you were expecting.
It's possible (but slower) to do unaligned loads, but you basically have to read a bit before and after your data and combine them. See Apple's Altivec page. I've also done it before using an lvlx and lvrx load instructions, and then ORing them together.
Next up, I'm not sure your multiplies and adds are the same. I've never used either _mm_madd_pi16 or vec_msum, so I'm not positive they're equivalent. You should step through in a debugger and make sure they give you the same output for the same input data. Another possible difference is that they may treat overflow differently (e.g. modular vs. saturate).
Last but not least, you're computing 4 ints at a time instead of 2. So your union should hold 4 ints, and you should sum all 4 of them at the end.