I am new to the SSE coding. I want to do write a SSE code for my algorithm. I want to convert below C code into SSE code.
for(int i=1;i<height;i++)
{
for(int j=1;j<width;j++)
{
int index = 0;
if(input[width*i + j]<=input[width*(i-1)+(j-1)])) index += 0x80;
if(input[width*i + j]<=input[width*(i-1)+(j )])) index += 0x40;
if(input[width*i + j]<=input[width*(i-1)+(j+1)])) index += 0x20;
if(input[width*i + j]<=input[width*(i )+(j-1)])) index += 0x10;
if(input[width*i + j]<=input[width*(i )+(j+1)])) index += 0x08;
if(input[width*i + j]<=input[width*(i+1)+(j-1)])) index += 0x04;
if(input[width*i + j]<=input[width*(i+1)+(j )])) index += 0x02;
if(input[width*i + j]<=input[width*(i+1)+(j+1)])) index ++;
output[width*(i-1)+(j-1)] = index;
}
}
Here is my SSE code:
unsigned char *dst_d = outputbuffer
float *CT_image_0 = inputbuffer;
float *CT_image_1 = CT_image_0 + width;
float *CT_image_2 = CT_image_1 + width;
for(int i=1;i<height;i++)
{
for(int j=1;j<width;j+=4)
{
__m128 CT_current_00 = _mm_loadu_ps((CT_image_0+j-1));
__m128 CT_current_10 = _mm_loadu_ps((CT_image_1+j-1));
__m128 CT_current_20 = _mm_loadu_ps((CT_image_2+j-1));
__m128 CT_current_01 = _mm_loadu_ps(((CT_image_0+1)+j-1));
__m128 CT_current_11 = _mm_loadu_ps(((CT_image_1+1)+j-1));
__m128 CT_current_21 = _mm_loadu_ps(((CT_image_2+1)+j-1));
__m128 CT_current_02 = _mm_loadu_ps(((CT_image_0+2)+j-1));
__m128 CT_current_12 = _mm_loadu_ps(((CT_image_1+2)+j-1));
__m128 CT_current_22 = _mm_loadu_ps(((CT_image_2+2)+j-1));
__m128 val = CT_current_11;
//Below I tried to write the SSE instruction but that was wrong :(
//--How I can do index + ...operation with this _mm_cmple_ss return value ????
__m128 sample6= _mm_cmple_ss(val,CT_current_00);
sample6 += _mm_cmple_ss(val,CT_current_01);
sample6 += _mm_cmple_ss(val,CT_current_02);
sample6 += _mm_cmple_ss(val,CT_current_10);
sample6 +=_mm_cmple_ss(val,CT_current_12);
sample6 +=_mm_cmple_ss(val,CT_current_20);
sample6 +=_mm_cmple_ss(val,CT_current_21);
sample6 +=_mm_cmple_ss(val,CT_current_22);
}
CT_image_0 +=width;
CT_image_1 +=width;
CT_image_2 +=width;
dst_d += (width-2);
}
I broke my head and trying (as a lay man)to use if condition ...Please give me some idea on this ???
The part that needs work is evidently this:
__m128 sample6= _mm_cmple_ss(val,CT_current_00);
sample6 += _mm_cmple_ss(val,CT_current_01);
sample6 += _mm_cmple_ss(val,CT_current_02);
sample6 += _mm_cmple_ss(val,CT_current_10);
sample6 +=_mm_cmple_ss(val,CT_current_12);
sample6 +=_mm_cmple_ss(val,CT_current_20);
sample6 +=_mm_cmple_ss(val,CT_current_21);
sample6 +=_mm_cmple_ss(val,CT_current_22);
You need to combine all the comparison results into a set of flags, e.g. like this:
__m128i out = _mm_setzero_si128(); // init output flags to all zeroes
__m128i test;
test = _mm_cmple_ss(val, CT_current_00); // compare
test = _mm_and_si128(test, _mm_set1_epi32(0x80)); // mask all but required flag
out = _mm_or_si128(out, test); // merge flags to output mask
test = _mm_cmple_ss(val, CT_current_01);
test = _mm_and_si128(test, _mm_set1_epi32(0x40));
out = _mm_or_si128(out, test);
// ... repeat for each offset and flag value
// ... then finally extract 4 bytes from `out`
// ... and store at output[width*(i-1)+(j-1)]
I do not know what SSE is code but more than likely you would want to run a/or some combination of combining CT_current variables into a string array and then concatenating them into a List with the before before mention (by your code), CT=** specification (where CT** is everything you put afterwards); in order to iterate back to the _m128 you print to, then as you know you can then double iterate as you've done.
Good Luck.
Related
i have an array of n length fullfilled by 16 bit (int16) pcm raw data,the data is in 44100 sample_rate
and stereo,so i have in my array first 2 bytes left channel then right channel etc...i tried to implement a simple low pass converting my array into floating points -1 1,the low pass works but there are round errors that cause little pops in the sound
now i do simply this :
INT32 left_id = 0;
INT32 right_id = 1;
DOUBLE filtered_l_db = 0.0;
DOUBLE filtered_r_db = 0.0;
DOUBLE last_filtered_left = 0;
DOUBLE last_filtered_right = 0;
DOUBLE l_db = 0.0;
DOUBLE r_db = 0.0;
DOUBLE low_filter = filter_freq(core->audio->low_pass_cut);
for(UINT32 a = 0; a < (buffer_size/2);++a)
{
l_db = ((DOUBLE)input_buffer[left_id]) / (DOUBLE)32768;
r_db = ((DOUBLE)input_buffer[right_id]) / (DOUBLE)32768;
///////////////LOW PASS
filtered_l_db = last_filtered_left +
(low_filter * (l_db -last_filtered_left ));
filtered_r_db = last_filtered_right +
(low_filter * (r_db - last_filtered_right));
last_filtered_left = filtered_l_db;
last_filtered_right = filtered_r_db;
INT16 l = (INT16)(filtered_l_db * (DOUBLE)32768);
INT16 r = (INT16)(filtered_r_db * (DOUBLE)32768);
output_buffer[left_id] = (output_buffer[left_id] + l);
output_buffer[right_id] = (output_buffer[right_id] + r);
left_id +=2;
right_id +=2;
}
PS: the input buffer is an int16 array with the pcm data from -32767 to 32767;
i found this function here
Low Pass filter in C
and was the only one that i could understand xd
DOUBLE filter_freq(DOUBLE cut_freq)
{
DOUBLE a = 1.0/(cut_freq * 2 * PI);
DOUBLE b = 1.0/SAMPLE_RATE;
return b/(a+b);
}
my aim is instead to have absolute precision on the wave,and to directly low pass using only integers
with the cost to lose resolution on the filter(and i'm ok with it)..i saw a lot of examples but i really didnt understand anything...someone of you would be so gentle to explain how this is done like you would explain to a little baby?(in code or pseudo code rapresentation) thank you
Assuming the result of function filter_freq can be written as a fraction m/n your filter calculation basically is
y_new = y_old + (m/n) * (x - y_old);
which can be transformed to
y_new = ((n * y_old) + m * (x - y_old)) / n;
The integer division / n truncates the result towards 0. If you want rounding instead of truncation you can implement it as
y_tmp = ((n * y_old) + m * (x - y_old));
if(y_tmp < 0) y_tmp -= (n / 2);
else y_tmp += (n / 2);
y_new = y_tmp / n
In order to avoid losing precision from dividing the result by n in one step and multiplying it by n in the next step you can save the value y_tmp before the division and use it in the next cycle.
y_tmp = (y_tmp + m * (x - y_old));
if(y_tmp < 0) y_new = y_tmp - (n / 2);
else y_new = y_tmp + (n / 2);
y_new /= n;
If your input data is int16_t I suggest to implement the calculation using int32_t to avoid overflows.
I tried to convert the filter in your code without checking other parts for possible problems.
INT32 left_id = 0;
INT32 right_id = 1;
int32_t filtered_l_out = 0; // output value after division
int32_t filtered_r_out = 0;
int32_t filtered_l_tmp = 0; // used to keep the output value before division
int32_t filtered_r_tmp = 0;
int32_t l_in = 0; // input value
int32_t r_in = 0;
DOUBLE low_filter = filter_freq(core->audio->low_pass_cut);
// define denominator and calculate numerator
// use power of 2 to allow bit-shift instead of division
const uint32_t filter_shift = 16U;
const int32_t filter_n = 1U << filter_shift;
int32_t filter_m = (int32_t)(low_filter * filter_n)
for(UINT32 a = 0; a < (buffer_size/2);++a)
{
l_in = input_buffer[left_id]);
r_in = input_buffer[right_id];
///////////////LOW PASS
filtered_l_tmp = filtered_l_tmp + filter_m * (l_in - filtered_l_out);
if(last_filtered_left < 0) {
filtered_l_out = last_filtered_left - filter_n/2;
} else {
filtered_l_out = last_filtered_left + filter_n/2;
}
//filtered_l_out /= filter_n;
filtered_l_out >>= filter_shift;
/* same calculation for right */
INT16 l = (INT16)(filtered_l_out);
INT16 r = (INT16)(filtered_r_out);
output_buffer[left_id] = (output_buffer[left_id] + l);
output_buffer[right_id] = (output_buffer[right_id] + r);
left_id +=2;
right_id +=2;
}
As your filter is initialized with 0 it may need several samples to follow a possible step to the first input value. Depending on your data it might be better to initialize the filter based on the first input value.
I am given a array of lowercase characters (up to 1.5Gb) and a character c. And I want to find how many occurrences are of the character c using AVX instructions.
unsigned long long char_count_AVX2(char * vector, int size, char c){
unsigned long long sum =0;
int i, j;
const int con=3;
__m256i ans[con];
for(i=0; i<con; i++)
ans[i]=_mm256_setzero_si256();
__m256i Zer=_mm256_setzero_si256();
__m256i C=_mm256_set1_epi8(c);
__m256i Assos=_mm256_set1_epi8(0x01);
__m256i FF=_mm256_set1_epi8(0xFF);
__m256i shield=_mm256_set1_epi8(0xFF);
__m256i temp;
int couter=0;
for(i=0; i<size; i+=32){
couter++;
shield=_mm256_xor_si256(_mm256_cmpeq_epi8(ans[0], Zer), FF);
temp=_mm256_cmpeq_epi8(C, *((__m256i*)(vector+i)));
temp=_mm256_xor_si256(temp, FF);
temp=_mm256_add_epi8(temp, Assos);
ans[0]=_mm256_add_epi8(temp, ans[0]);
for(j=1; j<con; j++){
temp=_mm256_cmpeq_epi8(ans[j-1], Zer);
shield=_mm256_and_si256(shield, temp);
temp=_mm256_xor_si256(shield, FF);
temp=_mm256_add_epi8(temp, Assos);
ans[j]=_mm256_add_epi8(temp, ans[j]);
}
}
for(j=con-1; j>=0; j--){
sum<<=8;
unsigned char *ptr = (unsigned char*)&(ans[j]);
for(i=0; i<32; i++){
sum+=*(ptr+i);
}
}
return sum;
}
I'm intentionally leaving out some parts, which you need to figure out yourself (e.g. handling lengths that aren't a multiple of 4*255*32 bytes), but your most inner loop should look something like the one starting with for(int i...):
_mm256_cmpeq_epi8 will get you a -1 in each byte, which you can use as an integer. If you subtract that from a counter (using _mm256_sub_epi8) you can directly count up to 255 or 128. The inner loop contains just these two intrinsics. You have to stop and
#include <immintrin.h>
#include <stdint.h>
static inline
__m256i hsum_epu8_epu64(__m256i v) {
return _mm256_sad_epu8(v, _mm256_setzero_si256()); // SAD against zero is a handy trick
}
static inline
uint64_t hsum_epu64_scalar(__m256i v) {
__m128i lo = _mm256_castsi256_si128(v);
__m128i hi = _mm256_extracti128_si256(v, 1);
__m128i sum2x64 = _mm_add_epi64(lo, hi); // narrow to 128
hi = _mm_unpackhi_epi64(sum2x64, sum2x64);
__m128i sum = _mm_add_epi64(hi, sum2x64); // narrow to 64
return _mm_cvtsi128_si64(sum);
}
unsigned long long char_count_AVX2(char const* vector, size_t size, char c)
{
__m256i C=_mm256_set1_epi8(c);
// todo: count elements and increment `vector` until it is aligned to 256bits (=32 bytes)
__m256i const * simd_vector = (__m256i const *) vector;
// *simd_vector is an alignment-required load, unlike _mm256_loadu_si256()
__m256i sum64 = _mm256_setzero_si256();
size_t unrolled_size_limit = size - 4*255*32 + 1;
for(size_t k=0; k<unrolled_size_limit ; k+=4*255*32) // outer loop: TODO
{
__m256i counter[4]; // multiple counter registers to hide latencies
for(int j=0; j<4; j++)
counter[j]=_mm256_setzero_si256();
// inner loop: make sure that you don't go beyond the data you can read
for(int i=0; i<255; ++i)
{ // or limit this inner loop to ~22 to avoid branch mispredicts
for(int j=0; j<4; ++j)
{
counter[j]=_mm256_sub_epi8(counter[j], // count -= 0 or -1
_mm256_cmpeq_epi8(*simd_vector, C));
++simd_vector;
}
}
// only need one outer accumulator: OoO exec hides the latency of adding into it
sum64 = _mm256_add_epi64(sum64, hsum_epu8_epu64(counter[0]));
sum64 = _mm256_add_epi64(sum64, hsum_epu8_epu64(counter[1]));
sum64 = _mm256_add_epi64(sum64, hsum_epu8_epu64(counter[2]));
sum64 = _mm256_add_epi64(sum64, hsum_epu8_epu64(counter[3]));
}
uint64_t sum = hsum_epu64_scalar(sum64);
// TODO add up remaining bytes with sum.
// Including a rolled-up vector loop before going scalar
// because we're potentially a *long* way from the end
// Maybe put some logic into the main loop to shorten the 255 inner iterations
// if we're close to the end. A little bit of scalar work there shouldn't hurt every 255 iters.
return sum;
}
Godbolt link: https://godbolt.org/z/do5e3- (clang is slightly better than gcc at unrolling the most inner loop: gcc includes some useless vmovdqa instructions that will bottleneck the front-end if the data is hot in L1d cache, preventing us from running close to 2x 32-byte loads per clock)
If you don't insist on using only SIMD instructions, you can make use
of the VPMOVMSKB instruction in combination with the POPCNT instruction. The former combines the highest bits of each byte into a 32-bit integer mask and the latter counts the 1 bits in this integer (=the count of char matches).
int couter=0;
for(i=0; i<size; i+=32) {
...
couter +=
_mm_popcnt_u32(
(unsigned int)_mm256_movemask_epi8(
_mm256_cmpeq_epi8( C, *((__m256i*)(vector+i) ))
)
);
...
}
I haven't tested this solution, but you should get the gist.
Probably the fastest: memcount_avx2 and memcount_sse2
size_t memcount_avx2(const void *s, int c, size_t n)
{
__m256i cv = _mm256_set1_epi8(c),
zv = _mm256_setzero_si256(),
sum = zv, acr0,acr1,acr2,acr3;
const char *p,*pe;
for(p = s; p != (char *)s+(n- (n % (252*32)));)
{
for(acr0 = acr1 = acr2 = acr3 = zv, pe = p+252*32; p != pe; p += 128)
{
acr0 = _mm256_sub_epi8(acr0, _mm256_cmpeq_epi8(cv, _mm256_lddqu_si256((const __m256i *)p)));
acr1 = _mm256_sub_epi8(acr1, _mm256_cmpeq_epi8(cv, _mm256_lddqu_si256((const __m256i *)(p+32))));
acr2 = _mm256_sub_epi8(acr2, _mm256_cmpeq_epi8(cv, _mm256_lddqu_si256((const __m256i *)(p+64))));
acr3 = _mm256_sub_epi8(acr3, _mm256_cmpeq_epi8(cv, _mm256_lddqu_si256((const __m256i *)(p+96))));
__builtin_prefetch(p+1024);
}
sum = _mm256_add_epi64(sum, _mm256_sad_epu8(acr0, zv));
sum = _mm256_add_epi64(sum, _mm256_sad_epu8(acr1, zv));
sum = _mm256_add_epi64(sum, _mm256_sad_epu8(acr2, zv));
sum = _mm256_add_epi64(sum, _mm256_sad_epu8(acr3, zv));
}
for(acr0 = zv; p+32 < (char *)s + n; p += 32)
acr0 = _mm256_sub_epi8(acr0, _mm256_cmpeq_epi8(cv, _mm256_lddqu_si256((const __m256i *)p)));
sum = _mm256_add_epi64(sum, _mm256_sad_epu8(acr0, zv));
size_t count = _mm256_extract_epi64(sum, 0)
+ _mm256_extract_epi64(sum, 1)
+ _mm256_extract_epi64(sum, 2)
+ _mm256_extract_epi64(sum, 3);
while(p != (char *)s + n)
count += *p++ == c;
return count;
}
Benchmark skylake i7-6700 - 3.4GHz - gcc 8.3:
memcount_avx2 : 28 GB/s
memcount_sse: 23 GB/s
char_count_AVX2 : 23 GB/s (from post)
I want to do moving average or something similar to that, because I am getting noisy values from ADC, this is my first try, just to compute moving average, but values goes to 0 everytime, can you help me?
This is part of code, which makes this magic:
unsigned char buffer[5];
int samples = 0;
USART_Init0(MYUBRR);
uint16_t adc_result0, adc_result1;
float ADCaverage = 0;
while(1)
{
adc_result0 = adc_read(0); // read adc value at PA0
samples++;
//adc_result1 = adc_read(1); // read adc value at PA1
ADCaverage = (ADCaverage + adc_result0)/samples;
sprintf(buffer, "%d\n", (int)ADCaverage);
char * p = buffer;
while (*p) { USART_Transmit0(*p++); }
_delay_ms(1000);
}
return(0);
}
This result I am sending via usart to display value.
Your equation is not correct.
Let s_n = (sum_{i=0}^{n} x[i])/n then:
s_(n-1) = sum_{i=0}^{n-1} x[i])/(n-1)
sum_{i=0}^{n-1} x[i] = (n-1)*s_(n-1)
sum_{i=0}^{n} x[i] = n*s_n
sum_{i=0}^{n} x[i] = sum_{i=0}^{n-1} x[i] + x[n]
n*s_n = (n-1)*s_(n-1) + x[n] = n*s_(n-1) + (x[n]-s_(n-1))
s_n = s_(n-1) + (x[n]-s_(n-1))/n
You must use
ADCaverage += (adc_result0-ADCaverage)/samples;
You can use an exponential moving average which only needs 1 memory unit.
y[0] = (x[0] + y[-1] * (a-1) )/a
Where a is the filter factor.
If a is multiples of 2 you can use shifts and optimize for speed significantly:
y[0] = ( x[0] + ( ( y[-1] << a ) - y[-1] ) ) >> a
This works especially well with left aligned ADC's. Just keep an eye on the word size of the shift result.
I am finding a way to compare the upper part between two __m128d variable.
So I look up https://software.intel.com/sites/landingpage/IntrinsicsGuide/ for relative intrinsics.
But I only can find some intrinsics comparing the lower part between two variable, for example, _mm_comieq_sd.
I am wonder why there is not intrinsics about comparing the upper part, and more importantly, how to compare the upper part between two __m128d variable?
Update:
The code is like
j0 = jprev0;
j1 = jprev1;
t_0 = p_i_x - pj_x_0;
t_1 = p_i_x - pj_x_1;
r2_0 = t_0 * t_0;
r2_1 = t_1 * t_1;
t_0 = p_i_y - pj_y_0;
t_1 = p_i_y - pj_y_1;
r2_0 += t_0 * t_0;
r2_1 += t_1 * t_1;
t_0 = p_i_z - pj_z_0;
t_1 = p_i_z - pj_z_1;
r2_0 += t_0 * t_0;
r2_1 += t_1 * t_1;
#if NAMD_ComputeNonbonded_SortAtoms != 0 && ( 0 PAIR ( + 1 ) )
sortEntry0 = sortValues + g;
sortEntry1 = sortValues + g + 1;
jprev0 = sortEntry0->index;
jprev1 = sortEntry1->index;
#else
jprev0 = glist[g ];
jprev1 = glist[g+1];
#endif
pj_x_0 = p_1[jprev0].position.x;
pj_x_1 = p_1[jprev1].position.x;
pj_y_0 = p_1[jprev0].position.y;
pj_y_1 = p_1[jprev1].position.y;
pj_z_0 = p_1[jprev0].position.z;
pj_z_1 = p_1[jprev1].position.z;
// want to use sse to compare those
bool test0 = ( r2_0 < groupplcutoff2 );
bool test1 = ( r2_1 < groupplcutoff2 );
//removing ifs benefits on many architectures
//as the extra stores will only warm the cache up
goodglist [ hu ] = j0;
goodglist [ hu + test0 ] = j1;
hu += test0 + test1;
And I am trying to rewrite it with SSE.
You're asking how to compare upper halves after already having compared the lower halves.
The SIMD way to do compares is with a packed compare instruction, like __m128d _mm_cmplt_pd (__m128d a, __m128d b), which produces a mask as an output instead of setting flags. AVX has an improved vcmppd / vcmpps which has a wider choice of compare operators, which you pass as a 3rd arg. _mm_cmp_pd (__m128d a, __m128d b, const int imm8).
const __m128d groupplcutoff2_vec = _mm_broadcastsd_pd(groupplcutoff2);
// should emit SSE3 movddup like _mm_movedup_pd() would.
__m128d r2 = ...;
// bool test0 = ( r2_0 < groupplcutoff2 );
// bool test1 = ( r2_1 < groupplcutoff2 );
__m128d ltvec = _mm_cmplt_pd(r2, groupplcutoff2_vec);
int ltmask = _mm_movemask_pd(ltvec);
bool test0 = ltmask & 1;
// bool test1 = ltmask & 2;
// assuming j is double. I'm not sure from your code, it might be int.
// and you're right, doing both stores unconditionally is prob. fastest, if your code isn't heavy on stores.
// goodglist [ hu ] = j0;
_mm_store_sd (goodglist [ hu ], j);
// goodglist [ hu + test0 ] = j1;
_mm_storeh_pd(goodglist [ hu + test0 ], j);
// don't try to use non-AVX _mm_maskmoveu_si128, it's like movnt. And doesn't do exactly what this needs, anyway, without shuffling j and ltvec.
// hu += test0 + test1;
hu += _popcnt32(ltmask); // Nehalem or later. Check the popcnt CPUID flag
The popcnt trick will work just as efficiently with AVX (4 doubles packed in a ymm register). Packed-compare -> movemask and using bit manipulation is a useful trick to keep in mind.
I have the following kernel vectorized for arrays with integers:
long valor = 0, i=0;
__m128i vsum, vecPi, vecCi, vecQCi;
vsum = _mm_set1_epi32(0);
int32_t * const pA = A->data;
int32_t * const pB = B->data;
int sumDot[1];
for( ; i<SIZE-3 ;i+=4){
vecPi = _mm_loadu_si128((__m128i *)&(pA)[i] );
vecCi = _mm_loadu_si128((__m128i *)&(pB)[i] );
vecQCi = _mm_mullo_epi32(vecPi,vecCi);
vsum = _mm_add_epi32(vsum,vecQCi);
}
vsum = _mm_hadd_epi32(vsum, vsum);
vsum = _mm_hadd_epi32(vsum, vsum);
_mm_storeu_si128((__m128i *)&(sumDot), vsum);
for( ; i<SIZE; i++)
valor += A->data[i] * B->data[i];
valor += sumDot[0];
and it works fine. However, if I change the datatype of A and B to short instead of int, shouldn't I use the following code:
long valor = 0, i=0;
__m128i vsum, vecPi, vecCi, vecQCi;
vsum = _mm_set1_epi16(0);
int16_t * const pA = A->data;
int16_t * const pB = B->data;
int sumDot[1];
for( ; i<SIZE-7 ;i+=8){
vecPi = _mm_loadu_si128((__m128i *)&(pA)[i] );
vecCi = _mm_loadu_si128((__m128i *)&(pB)[i] );
vecQCi = _mm_mullo_epi16(vecPi,vecCi);
vsum = _mm_add_epi16(vsum,vecQCi);
}
vsum = _mm_hadd_epi16(vsum, vsum);
vsum = _mm_hadd_epi16(vsum, vsum);
_mm_storeu_si128((__m128i *)&(sumDot), vsum);
for( ; i<SIZE; i++)
valor += A->data[i] * B->data[i];
valor += sumDot[0];
This second kernel doesn't work and I don't know why. I know that all the entries of the vectors in the first and second case are the same (no overflow as well). Can someone help me finding the mistake?
Thanks
Here's a few things I see.
In both the int and short case, when you're storing the __m128 to sumDot, you use _mm_storeu_si128 on targets that are much smaller than 128 bits. This means you've been corrupting memory, and were lucky you were not bitten.
Related to this, because sumDot is an int[1] even in the short case, you were storing two shorts in one int, and then reading it as an int.
In the short case you're missing one horizontal vector reduction step. Remember that now that you've got 8 shorts per vector, you must now have log_2(8) = 3 vector reduction steps.
vsum = _mm_hadd_epi16(vsum, vsum);
vsum = _mm_hadd_epi16(vsum, vsum);
vsum = _mm_hadd_epi16(vsum, vsum);
(Optional) Since you're onto SSE4.1 already, might as well use one of the goodies it has: The PEXTR* instructions. They take the index of the lane from which to extract. You're interested in the bottom lane (lane 0) because that's where the sum ends up after your vector reduction.
/* 32-bit */
sumDot[0] = _mm_extract_epi32(vsum, 0);
/* 16-bit */
sumDot[0] = _mm_extract_epi16(vsum, 0);
EDIT: Apparently the compiler doesn't sign-extend the 16-bit word extracted with _mm_extract_epi16. You must convince it to do so yourself.
/* 32-bit */
sumDot[0] = (int32_t)_mm_extract_epi32(vsum, 0);
/* 16-bit */
sumDot[0] = (int16_t)_mm_extract_epi16(vsum, 0);
EDIT2: I found an even BETTER solution! It uses exactly the instruction we need (PMADDWD), and is identical to the 32-bit code except that the iteration bounds are different, and instead of _mm_mullo_epi16 you use _mm_madd_epi16 in the loop. This only needs two 32-bit vector reduction stages. http://pastebin.com/A9ibkMwP
(Optional) It is good style but will make no difference to use the _mm_setzero_*() functions instead of _mm_set1_*(0).