int vs short vectorization

int vs short vectorization - c

I have the following kernel vectorized for arrays with integers:
long valor = 0, i=0;
__m128i vsum, vecPi, vecCi, vecQCi;
vsum = _mm_set1_epi32(0);
int32_t * const pA = A->data;
int32_t * const pB = B->data;
int sumDot[1];
for( ; i<SIZE-3 ;i+=4){
vecPi = _mm_loadu_si128((__m128i *)&(pA)[i] );
vecCi = _mm_loadu_si128((__m128i *)&(pB)[i] );
vecQCi = _mm_mullo_epi32(vecPi,vecCi);
vsum = _mm_add_epi32(vsum,vecQCi);
}
vsum = _mm_hadd_epi32(vsum, vsum);
vsum = _mm_hadd_epi32(vsum, vsum);
_mm_storeu_si128((__m128i *)&(sumDot), vsum);
for( ; i<SIZE; i++)
valor += A->data[i] * B->data[i];
valor += sumDot[0];
and it works fine. However, if I change the datatype of A and B to short instead of int, shouldn't I use the following code:
long valor = 0, i=0;
__m128i vsum, vecPi, vecCi, vecQCi;
vsum = _mm_set1_epi16(0);
int16_t * const pA = A->data;
int16_t * const pB = B->data;
int sumDot[1];
for( ; i<SIZE-7 ;i+=8){
vecPi = _mm_loadu_si128((__m128i *)&(pA)[i] );
vecCi = _mm_loadu_si128((__m128i *)&(pB)[i] );
vecQCi = _mm_mullo_epi16(vecPi,vecCi);
vsum = _mm_add_epi16(vsum,vecQCi);
}
vsum = _mm_hadd_epi16(vsum, vsum);
vsum = _mm_hadd_epi16(vsum, vsum);
_mm_storeu_si128((__m128i *)&(sumDot), vsum);
for( ; i<SIZE; i++)
valor += A->data[i] * B->data[i];
valor += sumDot[0];
This second kernel doesn't work and I don't know why. I know that all the entries of the vectors in the first and second case are the same (no overflow as well). Can someone help me finding the mistake?
Thanks

Here's a few things I see.
In both the int and short case, when you're storing the __m128 to sumDot, you use _mm_storeu_si128 on targets that are much smaller than 128 bits. This means you've been corrupting memory, and were lucky you were not bitten.
Related to this, because sumDot is an int[1] even in the short case, you were storing two shorts in one int, and then reading it as an int.
In the short case you're missing one horizontal vector reduction step. Remember that now that you've got 8 shorts per vector, you must now have log_2(8) = 3 vector reduction steps.
vsum = _mm_hadd_epi16(vsum, vsum);
vsum = _mm_hadd_epi16(vsum, vsum);
vsum = _mm_hadd_epi16(vsum, vsum);
(Optional) Since you're onto SSE4.1 already, might as well use one of the goodies it has: The PEXTR* instructions. They take the index of the lane from which to extract. You're interested in the bottom lane (lane 0) because that's where the sum ends up after your vector reduction.
/* 32-bit */
sumDot[0] = _mm_extract_epi32(vsum, 0);
/* 16-bit */
sumDot[0] = _mm_extract_epi16(vsum, 0);
EDIT: Apparently the compiler doesn't sign-extend the 16-bit word extracted with _mm_extract_epi16. You must convince it to do so yourself.
/* 32-bit */
sumDot[0] = (int32_t)_mm_extract_epi32(vsum, 0);
/* 16-bit */
sumDot[0] = (int16_t)_mm_extract_epi16(vsum, 0);
EDIT2: I found an even BETTER solution! It uses exactly the instruction we need (PMADDWD), and is identical to the 32-bit code except that the iteration bounds are different, and instead of _mm_mullo_epi16 you use _mm_madd_epi16 in the loop. This only needs two 32-bit vector reduction stages. http://pastebin.com/A9ibkMwP
(Optional) It is good style but will make no difference to use the _mm_setzero_*() functions instead of _mm_set1_*(0).

Related

How to count character occurrences using SIMD

I am given a array of lowercase characters (up to 1.5Gb) and a character c. And I want to find how many occurrences are of the character c using AVX instructions.
unsigned long long char_count_AVX2(char * vector, int size, char c){
unsigned long long sum =0;
int i, j;
const int con=3;
__m256i ans[con];
for(i=0; i<con; i++)
ans[i]=_mm256_setzero_si256();
__m256i Zer=_mm256_setzero_si256();
__m256i C=_mm256_set1_epi8(c);
__m256i Assos=_mm256_set1_epi8(0x01);
__m256i FF=_mm256_set1_epi8(0xFF);
__m256i shield=_mm256_set1_epi8(0xFF);
__m256i temp;
int couter=0;
for(i=0; i<size; i+=32){
couter++;
shield=_mm256_xor_si256(_mm256_cmpeq_epi8(ans[0], Zer), FF);
temp=_mm256_cmpeq_epi8(C, *((__m256i*)(vector+i)));
temp=_mm256_xor_si256(temp, FF);
temp=_mm256_add_epi8(temp, Assos);
ans[0]=_mm256_add_epi8(temp, ans[0]);
for(j=1; j<con; j++){
temp=_mm256_cmpeq_epi8(ans[j-1], Zer);
shield=_mm256_and_si256(shield, temp);
temp=_mm256_xor_si256(shield, FF);
temp=_mm256_add_epi8(temp, Assos);
ans[j]=_mm256_add_epi8(temp, ans[j]);
}
}
for(j=con-1; j>=0; j--){
sum<<=8;
unsigned char *ptr = (unsigned char*)&(ans[j]);
for(i=0; i<32; i++){
sum+=*(ptr+i);
}
}
return sum;
}

I'm intentionally leaving out some parts, which you need to figure out yourself (e.g. handling lengths that aren't a multiple of 4*255*32 bytes), but your most inner loop should look something like the one starting with for(int i...):
_mm256_cmpeq_epi8 will get you a -1 in each byte, which you can use as an integer. If you subtract that from a counter (using _mm256_sub_epi8) you can directly count up to 255 or 128. The inner loop contains just these two intrinsics. You have to stop and
#include <immintrin.h>
#include <stdint.h>
static inline
__m256i hsum_epu8_epu64(__m256i v) {
return _mm256_sad_epu8(v, _mm256_setzero_si256()); // SAD against zero is a handy trick
}
static inline
uint64_t hsum_epu64_scalar(__m256i v) {
__m128i lo = _mm256_castsi256_si128(v);
__m128i hi = _mm256_extracti128_si256(v, 1);
__m128i sum2x64 = _mm_add_epi64(lo, hi); // narrow to 128
hi = _mm_unpackhi_epi64(sum2x64, sum2x64);
__m128i sum = _mm_add_epi64(hi, sum2x64); // narrow to 64
return _mm_cvtsi128_si64(sum);
}
unsigned long long char_count_AVX2(char const* vector, size_t size, char c)
{
__m256i C=_mm256_set1_epi8(c);
// todo: count elements and increment `vector` until it is aligned to 256bits (=32 bytes)
__m256i const * simd_vector = (__m256i const *) vector;
// *simd_vector is an alignment-required load, unlike _mm256_loadu_si256()
__m256i sum64 = _mm256_setzero_si256();
size_t unrolled_size_limit = size - 4*255*32 + 1;
for(size_t k=0; k<unrolled_size_limit ; k+=4*255*32) // outer loop: TODO
{
__m256i counter[4]; // multiple counter registers to hide latencies
for(int j=0; j<4; j++)
counter[j]=_mm256_setzero_si256();
// inner loop: make sure that you don't go beyond the data you can read
for(int i=0; i<255; ++i)
{ // or limit this inner loop to ~22 to avoid branch mispredicts
for(int j=0; j<4; ++j)
{
counter[j]=_mm256_sub_epi8(counter[j], // count -= 0 or -1
_mm256_cmpeq_epi8(*simd_vector, C));
++simd_vector;
}
}
// only need one outer accumulator: OoO exec hides the latency of adding into it
sum64 = _mm256_add_epi64(sum64, hsum_epu8_epu64(counter[0]));
sum64 = _mm256_add_epi64(sum64, hsum_epu8_epu64(counter[1]));
sum64 = _mm256_add_epi64(sum64, hsum_epu8_epu64(counter[2]));
sum64 = _mm256_add_epi64(sum64, hsum_epu8_epu64(counter[3]));
}
uint64_t sum = hsum_epu64_scalar(sum64);
// TODO add up remaining bytes with sum.
// Including a rolled-up vector loop before going scalar
// because we're potentially a *long* way from the end
// Maybe put some logic into the main loop to shorten the 255 inner iterations
// if we're close to the end. A little bit of scalar work there shouldn't hurt every 255 iters.
return sum;
}
Godbolt link: https://godbolt.org/z/do5e3- (clang is slightly better than gcc at unrolling the most inner loop: gcc includes some useless vmovdqa instructions that will bottleneck the front-end if the data is hot in L1d cache, preventing us from running close to 2x 32-byte loads per clock)

If you don't insist on using only SIMD instructions, you can make use
of the VPMOVMSKB instruction in combination with the POPCNT instruction. The former combines the highest bits of each byte into a 32-bit integer mask and the latter counts the 1 bits in this integer (=the count of char matches).
int couter=0;
for(i=0; i<size; i+=32) {
...
couter +=
_mm_popcnt_u32(
(unsigned int)_mm256_movemask_epi8(
_mm256_cmpeq_epi8( C, *((__m256i*)(vector+i) ))
)
);
...
}
I haven't tested this solution, but you should get the gist.

Probably the fastest: memcount_avx2 and memcount_sse2
size_t memcount_avx2(const void *s, int c, size_t n)
{
__m256i cv = _mm256_set1_epi8(c),
zv = _mm256_setzero_si256(),
sum = zv, acr0,acr1,acr2,acr3;
const char *p,*pe;
for(p = s; p != (char *)s+(n- (n % (252*32)));)
{
for(acr0 = acr1 = acr2 = acr3 = zv, pe = p+252*32; p != pe; p += 128)
{
acr0 = _mm256_sub_epi8(acr0, _mm256_cmpeq_epi8(cv, _mm256_lddqu_si256((const __m256i *)p)));
acr1 = _mm256_sub_epi8(acr1, _mm256_cmpeq_epi8(cv, _mm256_lddqu_si256((const __m256i *)(p+32))));
acr2 = _mm256_sub_epi8(acr2, _mm256_cmpeq_epi8(cv, _mm256_lddqu_si256((const __m256i *)(p+64))));
acr3 = _mm256_sub_epi8(acr3, _mm256_cmpeq_epi8(cv, _mm256_lddqu_si256((const __m256i *)(p+96))));
__builtin_prefetch(p+1024);
}
sum = _mm256_add_epi64(sum, _mm256_sad_epu8(acr0, zv));
sum = _mm256_add_epi64(sum, _mm256_sad_epu8(acr1, zv));
sum = _mm256_add_epi64(sum, _mm256_sad_epu8(acr2, zv));
sum = _mm256_add_epi64(sum, _mm256_sad_epu8(acr3, zv));
}
for(acr0 = zv; p+32 < (char *)s + n; p += 32)
acr0 = _mm256_sub_epi8(acr0, _mm256_cmpeq_epi8(cv, _mm256_lddqu_si256((const __m256i *)p)));
sum = _mm256_add_epi64(sum, _mm256_sad_epu8(acr0, zv));
size_t count = _mm256_extract_epi64(sum, 0)
+ _mm256_extract_epi64(sum, 1)
+ _mm256_extract_epi64(sum, 2)
+ _mm256_extract_epi64(sum, 3);
while(p != (char *)s + n)
count += *p++ == c;
return count;
}
Benchmark skylake i7-6700 - 3.4GHz - gcc 8.3:
memcount_avx2 : 28 GB/s
memcount_sse: 23 GB/s
char_count_AVX2 : 23 GB/s (from post)

image proccessing further optimization

I'm new to optimization and was given a task to optimize a function that processes an image as much as possible. it takes an image, blurs it and then saves the blurred image, and then continues and sharpens the image, and saves also the sharpened image.
Here is my code:
typedef struct {
unsigned char red;
unsigned char green;
unsigned char blue;
} pixel;
// I delete the other struct because we can do the same operations with use of only addresses
//use macro instead of function is more efficient
#define calculateIndex(i, j, n) ((i)*(n)+(j))
// I combine all the functions in one because it is time consuming
void myfunction(Image *image, char* srcImgpName, char* blurRsltImgName, char* sharpRsltImgName) {
// use variable from type 'register int' is much more efficient from 'int'
register int i,j, ii, jj, sum_red, sum_green, sum_blue;
//using local variable is much more efficient than using pointer to pixels from the original image,and updat its value in each iteration
pixel current_pixel , p;
//dst will point on the first pixel in the image
pixel* dst = (pixel*)image->data;
int squareN = n*n;
//instead of multiply by 3 - I used shift
register int sizeToAllocate = ((squareN)<<1)+(squareN); // use variable from type 'register int' is much more efficient from 'int'
pixel* src = malloc(sizeToAllocate);
register int index;
//memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
memcpy(src, dst, sizeToAllocate);
///////////////////////////////////////// first step : smooth //////////////////////////////////////////////////////////////////////
/**the smooth blur is step that apply the blur-kernel (matrix of ints) over each pixel in the bouns - and make the image more smooth.
*this function was originally used this matrix :
* [1, 1, 1]
* [1, 1, 1]
* [1, 1, 1]
*because the matrix is full of 1 , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used
*primitive variable.
*/
//the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
index = calculateIndex(1, 1, n);
for (i = 1 ; i < n - 1; ++i) {
for (j = 1 ; j < n - 1 ; ++j) {
// I used this variables as counters to the colors' values around a specific pixel
sum_red = 0;
sum_green = 0;
sum_blue = 0;
for(ii = i-1; ii <= i+1; ++ii) {
for(jj =j-1; jj <= j+1; ++jj) {
//take care of the [ii,jj] pixel in the matrix
//calculate the adrees of the current pixel
pixel p = src[calculateIndex(ii, jj, n)];
//sum the colors' values of the neighbors of the current pixel
sum_red += p.red;
sum_green += p.green;
sum_blue += p.blue;
}
}
//calculate the avarage of the colors' values around the current pixel - as written in the instructions
sum_red = (((sum_red) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
sum_green = (((sum_green) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
sum_blue = (((sum_blue) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
current_pixel.red = (unsigned char)sum_red;
current_pixel.green = (unsigned char)sum_green;
current_pixel.blue = (unsigned char)sum_blue;
dst[index++] = current_pixel;
}
}
// write result image to file
writeBMP(image, srcImgpName, blurRsltImgName);
//memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
memcpy(src, dst, sizeToAllocate);
///////////////////////////////////////// second step : sharp //////////////////////////////////////////////////////////////////////
/** I want to sharp the smooth image . In this step I apply the sharpen kernel (matrix of ints) over each pixel in the bouns - and make the image more sharp.
*this function was originally used this matrix :
* [-1, -1, -1]
* [-1, 9, -1]
* [-1, -1, -1]
*because the matrix is full of (-1) , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used
*primitive variable. I operato like that : insted of multiply in (-1) in the end of the step , I define counter initializes with zero , and
*substruct all te colors' values from it. the result is actually the same as multiply by (-1), in more efficient way.
*/
//the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
for (i = 1 ; i < n-1; ++i) {
for (j = 1 ; j < n-1 ; ++j) {
// I used this variables as counters to the colors' values around a specific pixel
sum_red = 0;
sum_green = 0;
sum_blue = 0;
// Do central pixel first
p=src[calculateIndex(i,j,n)];
sum_red = 10*p.red;
sum_green = 10*p.green;
sum_blue = 10*p.blue;
for(ii =i-1; ii <= i + 1; ++ii) {
for(jj = j-1; jj <= j + 1; ++jj) {
p = src[calculateIndex(ii, jj, n)];
//operate according to the instructions
sum_red -= p.red;
sum_green -= p.green;
sum_blue -= p.blue;
}
}
//each pixel's colors' values must match the range [0,255] - I used the idea from the original code
//the red value must be in the range [0,255]
if (sum_red < 0) {
sum_red = 0;
} else if (sum_red > 255 ) {
sum_red = 255;
}
current_pixel.red = (unsigned char)sum_red;
//the green value must be in the range [0,255]
if (sum_green < 0) {
sum_green = 0;
} else if (sum_green > 255 ) {
sum_green = 255;
}
current_pixel.green = (unsigned char)sum_green;
//the blue value must be in the range [0,255]
if (sum_blue < 0) {
sum_blue = 0;
} else if (sum_blue > 255 ) {
sum_blue = 255;
}
current_pixel.blue = (unsigned char)sum_blue;
// put the updated pixel in [i,j] in the image
dst[calculateIndex(i, j, n)] = current_pixel;
}
}
//free the allocated space to prevent memory leaks
free(src);
// write result image to file
writeBMP(image, srcImgpName, sharpRsltImgName);
}
I wanted to ask about the if statements, is there anything better that can replace those? And also more generally speaking can anyone spot an optimization mistakes here, or can offer his inputs?
Thanks a lot!
updated code:
typedef struct {
unsigned char red;
unsigned char green;
unsigned char blue;
} pixel;
// I delete the other struct because we can do the same operations with use of only addresses
//use macro instead of function is more efficient
#define calculateIndex(i, j, n) ((i)*(n)+(j))
// I combine all the functions in one because it is time consuming
void myfunction(Image *image, char* srcImgpName, char* blurRsltImgName, char* sharpRsltImgName) {
// use variable from type 'register int' is much more efficient from 'int'
register int i,j, ii, jj, sum_red, sum_green, sum_blue;
//using local variable is much more efficient than using pointer to pixels from the original image,and updat its value in each iteration
pixel current_pixel , p;
//dst will point on the first pixel in the image
pixel* dst = (pixel*)image->data;
int squareN = n*n;
//instead of multiply by 3 - I used shift
register int sizeToAllocate = ((squareN)<<1)+(squareN); // use variable from type 'register int' is much more efficient from 'int'
pixel* src = malloc(sizeToAllocate);
register int index;
//memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
memcpy(src, dst, sizeToAllocate);
///////////////////////////////////////// first step : smooth //////////////////////////////////////////////////////////////////////
/**the smooth blur is step that apply the blur-kernel (matrix of ints) over each pixel in the bouns - and make the image more smooth.
*this function was originally used this matrix :
* [1, 1, 1]
* [1, 1, 1]
* [1, 1, 1]
*because the matrix is full of 1 , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used
*primitive variable.
*/
//the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
index = calculateIndex(1, 1, n);
for (i = 1 ; i < n - 1; ++i) {
for (j = 1 ; j < n - 1 ; ++j) {
// I used this variables as counters to the colors' values around a specific pixel
sum_red = 0;
sum_green = 0;
sum_blue = 0;
for(ii = i-1; ii <= i+1; ++ii) {
for(jj =j-1; jj <= j+1; ++jj) {
//take care of the [ii,jj] pixel in the matrix
//calculate the adrees of the current pixel
pixel p = src[calculateIndex(ii, jj, n)];
//sum the colors' values of the neighbors of the current pixel
sum_red += p.red;
sum_green += p.green;
sum_blue += p.blue;
}
}
//calculate the avarage of the colors' values around the current pixel - as written in the instructions
sum_red = (((sum_red) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
sum_green = (((sum_green) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
sum_blue = (((sum_blue) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
current_pixel.red = (unsigned char)sum_red;
current_pixel.green = (unsigned char)sum_green;
current_pixel.blue = (unsigned char)sum_blue;
dst[index++] = current_pixel;
}
index += 2;
}
// write result image to file
writeBMP(image, srcImgpName, blurRsltImgName);
//memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
memcpy(src, dst, sizeToAllocate);
///////////////////////////////////////// second step : sharp //////////////////////////////////////////////////////////////////////
/** I want to sharp the smooth image . In this step I apply the sharpen kernel (matrix of ints) over each pixel in the bouns - and make the image more sharp.
*this function was originally used this matrix :
* [-1, -1, -1]
* [-1, 9, -1]
* [-1, -1, -1]
*because the matrix is full of (-1) , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used
*primitive variable. I operato like that : insted of multiply in (-1) in the end of the step , I define counter initializes with zero , and
*substruct all te colors' values from it. the result is actually the same as multiply by (-1), in more efficient way.
*/
index = calculateIndex(1,1,n);
//the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
for (i = 1 ; i < n-1; ++i) {
for (j = 1 ; j < n-1 ; ++j) {
// I used this variables as counters to the colors' values around a specific pixel
sum_red = 0;
sum_green = 0;
sum_blue = 0;
// Do central pixel first
p=src[index];
sum_red = 10*p.red;
sum_green = 10*p.green;
sum_blue = 10*p.blue;
for(ii =i-1; ii <= i + 1; ++ii) {
for(jj = j-1; jj <= j + 1; ++jj) {
p = src[calculateIndex(ii, jj, n)];
//operate according to the instructions
sum_red -= p.red;
sum_green -= p.green;
sum_blue -= p.blue;
}
index += 2;
}
//each pixel's colors' values must match the range [0,255] - I used the idea from the original code
//the red value must be in the range [0,255]
if (sum_red < 0) {
sum_red = 0;
} else if (sum_red > 255 ) {
sum_red = 255;
}
current_pixel.red = (unsigned char)sum_red;
//the green value must be in the range [0,255]
if (sum_green < 0) {
sum_green = 0;
} else if (sum_green > 255 ) {
sum_green = 255;
}
current_pixel.green = (unsigned char)sum_green;
//the blue value must be in the range [0,255]
if (sum_blue < 0) {
sum_blue = 0;
} else if (sum_blue > 255 ) {
sum_blue = 255;
}
current_pixel.blue = (unsigned char)sum_blue;
// put the updated pixel in [i,j] in the image
dst[calculateIndex(i, j, n)] = current_pixel;
}
}
//free the allocated space to prevent memory leaks
free(src);
// write result image to file
writeBMP(image, srcImgpName, sharpRsltImgName);
}
------------------------------------------------------------------------------updated code:
typedef struct {
unsigned char red;
unsigned char green;
unsigned char blue;
} pixel;
// I delete the other struct because we can do the same operations with use of only addresses
//use macro instead of function is more efficient
#define calculateIndex(i, j, n) ((i)*(n)+(j))
// I combine all the functions in one because it is time consuming
void myfunction(Image *image, char* srcImgpName, char* blurRsltImgName, char* sharpRsltImgName) {
// use variable from type 'register int' is much more efficient from 'int'
register int i,j, ii, jj, sum_red, sum_green, sum_blue;
//using local variable is much more efficient than using pointer to pixels from the original image,and updat its value in each iteration
pixel current_pixel , p;
//dst will point on the first pixel in the image
pixel* dst = (pixel*)image->data;
int squareN = n*n;
//instead of multiply by 3 - I used shift
register int sizeToAllocate = ((squareN)<<1)+(squareN); // use variable from type 'register int' is much more efficient from 'int'
pixel* src = malloc(sizeToAllocate);
register int index;
//memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
memcpy(src, dst, sizeToAllocate);
///////////////////////////////////////// first step : smooth //////////////////////////////////////////////////////////////////////
/**the smooth blur is step that apply the blur-kernel (matrix of ints) over each pixel in the bouns - and make the image more smooth.
*this function was originally used this matrix :
* [1, 1, 1]
* [1, 1, 1]
* [1, 1, 1]
*because the matrix is full of 1 , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used
*primitive variable.
*/
//the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
index = n + 1;
for (i = 1 ; i < n - 1; ++i) {
for (j = 1 ; j < n - 1 ; ++j) {
// I used this variables as counters to the colors' values around a specific pixel
sum_red = 0;
sum_green = 0;
sum_blue = 0;
for(ii = i-1; ii <= i+1; ++ii) {
for(jj =j-1; jj <= j+1; ++jj) {
//take care of the [ii,jj] pixel in the matrix
//calculate the adrees of the current pixel
pixel p = src[calculateIndex(ii, jj, n)];
//sum the colors' values of the neighbors of the current pixel
sum_red += p.red;
sum_green += p.green;
sum_blue += p.blue;
}
}
//calculate the avarage of the colors' values around the current pixel - as written in the instructions
sum_red = (((sum_red) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
sum_green = (((sum_green) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
sum_blue = (((sum_blue) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
current_pixel.red = (unsigned char)sum_red;
current_pixel.green = (unsigned char)sum_green;
current_pixel.blue = (unsigned char)sum_blue;
dst[index++] = current_pixel;
}
index += 2;
}
// write result image to file
writeBMP(image, srcImgpName, blurRsltImgName);
//memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
memcpy(src, dst, sizeToAllocate);
///////////////////////////////////////// second step : sharp //////////////////////////////////////////////////////////////////////
/** I want to sharp the smooth image . In this step I apply the sharpen kernel (matrix of ints) over each pixel in the bouns - and make the image more sharp.
*this function was originally used this matrix :
* [-1, -1, -1]
* [-1, 9, -1]
* [-1, -1, -1]
*because the matrix is full of (-1) , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used
*primitive variable. I operate like that : instead of multiply in (-1) in the end of the step , I define counter initializes with zero , and
*substruct all te colors' values from it. the result is actually the same as multiply by (-1), in more efficient way.
*/
index = calculateIndex(1,1,n);
//the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
for (i = 1 ; i < n-1; ++i) {
for (j = 1 ; j < n-1 ; ++j) {
// I used this variables as counters to the colors' values around a specific pixel
sum_red = 0;
sum_green = 0;
sum_blue = 0;
// Do central pixel first
p=src[index];
sum_red = 10*p.red;
sum_green = 10*p.green;
sum_blue = 10*p.blue;
for(ii =i-1; ii <= i + 1; ++ii) {
for(jj = j-1; jj <= j + 1; ++jj) {
p = src[calculateIndex(ii, jj, n)];
//operate according to the instructions
sum_red -= p.red;
sum_green -= p.green;
sum_blue -= p.blue;
}
}
//each pixel's colors' values must match the range [0,255] - I used the idea from the original code
//the red value must be in the range [0,255]
if (sum_red < 0) {
sum_red = 0;
} else if (sum_red > 255 ) {
sum_red = 255;
}
current_pixel.red = (unsigned char)sum_red;
//the green value must be in the range [0,255]
if (sum_green < 0) {
sum_green = 0;
} else if (sum_green > 255 ) {
sum_green = 255;
}
current_pixel.green = (unsigned char)sum_green;
//the blue value must be in the range [0,255]
if (sum_blue < 0) {
sum_blue = 0;
} else if (sum_blue > 255 ) {
sum_blue = 255;
}
current_pixel.blue = (unsigned char)sum_blue;
// put the updated pixel in [i,j] in the image
dst[calculateIndex(i, j, n)] = current_pixel;
}
index += 2;
}
//free the allocated space to prevent memory leaks
free(src);
// write result image to file
writeBMP(image, srcImgpName, sharpRsltImgName);
}

Some general optimization guidelines:
If you're running on x86, compile as a 64-bit binary. x86 is really a register-starved CPU. In 32-bit mode you pretty much have only 5 or 6 32-bit general-purpose registers available, and you only get "all" 6 if you compile with optimizations like -fomit-frame-pointer on GCC. In 64-bit mode you'll have 13 or 14 64-bit general-purpose registers.
Get a good compiler and use the highest possible general optimization level.
Profile! Profile! Profile! Actually profile your code so actually know where the performance bottlenecks are. Any guesses about the location of any performance bottlenecks are likely wrong.
Once you find your bottlenecks, examine the actual instructions the compiler produces and look at the bottleneck areas, just to see what's happening. Perhaps the bottleneck is where the compiler had to do a lot of register spilling and filling because of register pressure. This can be really helpful if you can profile down to the instruction level.
Use the insights from the profiling and examination of the generated instructions to improve your code and compile arguments. For example, if you're seeing a lot of register spilling and filling, you need to reduce register pressure, perhaps by manually coalescing loops or disabling prefetching with a compiler option.
Experiment with different page size options. If a single row of pixels is a significant fraction of a page size, reaching into other rows is more likely to reach into another page and result in a TLB miss. Using larger memory pages may significantly reduce this.
Some specific ideas for your code:
Use only one outer loop. You'll have to experiment to find the fastest way to handle your "extra" edge pixels. The fastest way might be to not do anything special, roll right over them like "normal" pixels, and just ignore the values in them later.
Manually unroll the two inner loops - you're only doing 9 pixels.
Don't use calculateIndex() - use the address of the current pixel and find the other pixels simply by subtracting or adding the proper value from the current pixel address. For example, the address of the upper-left pixel in your inner loops would be something like currentPixelAddress - n - 1.
Those would convert your four-deep nested loops into a single loop with very little index calculations needed.

A few ideas - untested.
You have if(ii==i && jj=j) to test for the central pixel in your sharpening loop which you do 9x for every pixel. I think it would be faster to remove that if and do exactly the same for every pixel but then make a correction, outside the loop by adding 10x the central pixel.
// Do central pixel first
p=src[calculateIndex(i,j,n)];
sum_red = 10*p.red;
sum_green = 10*p.green;
sum_blue = 10*p.blue;
for(ii =i-1; ii <= i + 1; ++ii) {
for(jj = j-1; jj <= j + 1; ++jj) {
p = src[calculateIndex(ii, jj, n)];
//operate according to the instructions
sum_red -= p.red;
sum_green -= p.green;
sum_blue -= p.blue;
}
}
Where you do dst[calculateIndex(i, j, n)] = current_pixel;, you can probably calculate the index once before the loop at the start and then just increment the pointer with each write inside the loop - assuming your arrays are contiguous and unpadded.
index=calculateIndex(1,1,n)
for (i = 1 ; i < n - 1; ++i) {
for (j = 1 ; j < n - 1 ; ++j) {
...
dst[index++] = current_pixel;
}
index+=2; // skip over last pixel of this line and first pixel of next line
}
As you move your 3x3 window of 9 pixels across the image, you could "remember" the left-most column of 3 pixels from the previous position, then instead of 9 additions for each pixel, you would do a single subtraction for the left-most column leaving the window and 3 additions for the new column entering the window on the right side, i.e. 4 calculations instead of 9.

Floating average with reading of ADC values

I want to do moving average or something similar to that, because I am getting noisy values from ADC, this is my first try, just to compute moving average, but values goes to 0 everytime, can you help me?
This is part of code, which makes this magic:
unsigned char buffer[5];
int samples = 0;
USART_Init0(MYUBRR);
uint16_t adc_result0, adc_result1;
float ADCaverage = 0;
while(1)
{
adc_result0 = adc_read(0); // read adc value at PA0
samples++;
//adc_result1 = adc_read(1); // read adc value at PA1
ADCaverage = (ADCaverage + adc_result0)/samples;
sprintf(buffer, "%d\n", (int)ADCaverage);
char * p = buffer;
while (*p) { USART_Transmit0(*p++); }
_delay_ms(1000);
}
return(0);
}
This result I am sending via usart to display value.

Your equation is not correct.
Let s_n = (sum_{i=0}^{n} x[i])/n then:
s_(n-1) = sum_{i=0}^{n-1} x[i])/(n-1)
sum_{i=0}^{n-1} x[i] = (n-1)*s_(n-1)
sum_{i=0}^{n} x[i] = n*s_n
sum_{i=0}^{n} x[i] = sum_{i=0}^{n-1} x[i] + x[n]
n*s_n = (n-1)*s_(n-1) + x[n] = n*s_(n-1) + (x[n]-s_(n-1))
s_n = s_(n-1) + (x[n]-s_(n-1))/n
You must use
ADCaverage += (adc_result0-ADCaverage)/samples;

You can use an exponential moving average which only needs 1 memory unit.
y[0] = (x[0] + y[-1] * (a-1) )/a
Where a is the filter factor.
If a is multiples of 2 you can use shifts and optimize for speed significantly:
y[0] = ( x[0] + ( ( y[-1] << a ) - y[-1] ) ) >> a
This works especially well with left aligned ADC's. Just keep an eye on the word size of the shift result.

Comparison operation in SSE

I am new to the SSE coding. I want to do write a SSE code for my algorithm. I want to convert below C code into SSE code.
for(int i=1;i<height;i++)
{
for(int j=1;j<width;j++)
{
int index = 0;
if(input[width*i + j]<=input[width*(i-1)+(j-1)])) index += 0x80;
if(input[width*i + j]<=input[width*(i-1)+(j )])) index += 0x40;
if(input[width*i + j]<=input[width*(i-1)+(j+1)])) index += 0x20;
if(input[width*i + j]<=input[width*(i )+(j-1)])) index += 0x10;
if(input[width*i + j]<=input[width*(i )+(j+1)])) index += 0x08;
if(input[width*i + j]<=input[width*(i+1)+(j-1)])) index += 0x04;
if(input[width*i + j]<=input[width*(i+1)+(j )])) index += 0x02;
if(input[width*i + j]<=input[width*(i+1)+(j+1)])) index ++;
output[width*(i-1)+(j-1)] = index;
}
}
Here is my SSE code:
unsigned char *dst_d = outputbuffer
float *CT_image_0 = inputbuffer;
float *CT_image_1 = CT_image_0 + width;
float *CT_image_2 = CT_image_1 + width;
for(int i=1;i<height;i++)
{
for(int j=1;j<width;j+=4)
{
__m128 CT_current_00 = _mm_loadu_ps((CT_image_0+j-1));
__m128 CT_current_10 = _mm_loadu_ps((CT_image_1+j-1));
__m128 CT_current_20 = _mm_loadu_ps((CT_image_2+j-1));
__m128 CT_current_01 = _mm_loadu_ps(((CT_image_0+1)+j-1));
__m128 CT_current_11 = _mm_loadu_ps(((CT_image_1+1)+j-1));
__m128 CT_current_21 = _mm_loadu_ps(((CT_image_2+1)+j-1));
__m128 CT_current_02 = _mm_loadu_ps(((CT_image_0+2)+j-1));
__m128 CT_current_12 = _mm_loadu_ps(((CT_image_1+2)+j-1));
__m128 CT_current_22 = _mm_loadu_ps(((CT_image_2+2)+j-1));
__m128 val = CT_current_11;
//Below I tried to write the SSE instruction but that was wrong :(
//--How I can do index + ...operation with this _mm_cmple_ss return value ????
__m128 sample6= _mm_cmple_ss(val,CT_current_00);
sample6 += _mm_cmple_ss(val,CT_current_01);
sample6 += _mm_cmple_ss(val,CT_current_02);
sample6 += _mm_cmple_ss(val,CT_current_10);
sample6 +=_mm_cmple_ss(val,CT_current_12);
sample6 +=_mm_cmple_ss(val,CT_current_20);
sample6 +=_mm_cmple_ss(val,CT_current_21);
sample6 +=_mm_cmple_ss(val,CT_current_22);
}
CT_image_0 +=width;
CT_image_1 +=width;
CT_image_2 +=width;
dst_d += (width-2);
}
I broke my head and trying (as a lay man)to use if condition ...Please give me some idea on this ???

The part that needs work is evidently this:
__m128 sample6= _mm_cmple_ss(val,CT_current_00);
sample6 += _mm_cmple_ss(val,CT_current_01);
sample6 += _mm_cmple_ss(val,CT_current_02);
sample6 += _mm_cmple_ss(val,CT_current_10);
sample6 +=_mm_cmple_ss(val,CT_current_12);
sample6 +=_mm_cmple_ss(val,CT_current_20);
sample6 +=_mm_cmple_ss(val,CT_current_21);
sample6 +=_mm_cmple_ss(val,CT_current_22);
You need to combine all the comparison results into a set of flags, e.g. like this:
__m128i out = _mm_setzero_si128(); // init output flags to all zeroes
__m128i test;
test = _mm_cmple_ss(val, CT_current_00); // compare
test = _mm_and_si128(test, _mm_set1_epi32(0x80)); // mask all but required flag
out = _mm_or_si128(out, test); // merge flags to output mask
test = _mm_cmple_ss(val, CT_current_01);
test = _mm_and_si128(test, _mm_set1_epi32(0x40));
out = _mm_or_si128(out, test);
// ... repeat for each offset and flag value
// ... then finally extract 4 bytes from `out`
// ... and store at output[width*(i-1)+(j-1)]

I do not know what SSE is code but more than likely you would want to run a/or some combination of combining CT_current variables into a string array and then concatenating them into a List with the before before mention (by your code), CT=** specification (where CT** is everything you put afterwards); in order to iterate back to the _m128 you print to, then as you know you can then double iterate as you've done.
Good Luck.

How to make the following code faster

int u1, u2;
unsigned long elm1[20], _mulpre[16][20], res1[40], res2[40]; 64 bits long
res1, res2 initialized to zero.
l = 60;
while (l)
{
for (i = 0; i < 20; i += 2)
{
u1 = (elm1[i] >> l) & 15;
u2 = (elm1[i + 1] >> l) & 15;
for (k = 0; k < 20; k += 2)
{
simda = _mm_load_si128 ((__m128i *) &_mulpre[u1][k]);
simdb = _mm_load_si128 ((__m128i *) &res1[i + k]);
simdb = _mm_xor_si128 (simda, simdb);
_mm_store_si128 ((__m128i *)&res1[i + k], simdb);
simda = _mm_load_si128 ((__m128i *)&_mulpre[u2][k]);
simdb = _mm_load_si128 ((__m128i *)&res2[i + k]);
simdb = _mm_xor_si128 (simda, simdb);
_mm_store_si128 ((__m128i *)&res2[i + k], simdb);
}
}
l -= 4;
All res1, res2 values are left shifted by 4 bits.
}
The above mentioned code is called many times in my program (profiler shows 98%).
EDIT: In the inner loop, res1[i + k] values are loaded many times for same (i + k) values. I tried with this inside the while loop, I loaded all the res1 values into simd registers (array) and use array elements inside the innermost for loop to update array elements . Once both for loops are done, I stored the array values back to the res1, re2. But computation time increases with this. Any idea where I got wrong? The idea seemed to be correct
Any suggestion to make it faster is welcome.

Unfortunately the most obvious optimisations are probably already being done by the compiler:
You can pull &_mulpre[u1] and &mulpre[u2] our of the inner loop.
You can pull &res1[i] our of the inner loop.
Using different variables for the two inner operations, and reordering them, might allow for better pipelining.
Possibly swapping the outer loops would improve cache locality on elm1.

Well, you could always call it fewer times :-)
The total input & output data looks relatively small, depending on you design and expected input it might be feasible to just cache computations or do lazy evaluation instead of up-front.

There is very little you can do with a routine such as this, since loads and stores will be the dominant factor (you're doing 2 loads + 1 store = 4 bus cycles for a single computational instruction).

l = 60;
while (l)
{
for (i = 0; i < 20; i += 2)
{
u1 = (elm1[i] >> l) & 15;
u2 = (elm1[i + 1] >> l) & 15;
for (k = 0; k < 20; k += 2)
{
_mm_stream_si128 ((__m128i *)&res1[i + k],
_mm_xor_si128 (
_mm_load_si128 ((__m128i *) &_mulpre[u1][k]),
_mm_load_si128 ((__m128i *) &res1[i + k]
));
mm_stream_si128 ((__m128i *)&res2[i + k],
_mm_xor_si128 (
_mm_load_si128 ((__m128i *)&_mulpre[u2][k]),
_mm_load_si128 ((__m128i *)&res2[i + k])
));
}
}
l -= 4;
All res1, res2 values are left shifted by 4 bits.
}
Do remember your are using intrinsic, using less _128mi/_mm128 value will speed up your program.
try _mm_stream_si128(), it might speed up the storing process.
try prefetch

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight