I have a fitness function that is scoring the values on an int array based on data that lies on a 4D array. The profiler says this function is using 80% of CPU time (it needs to be called several million times). I can't seem to optimize it further (if it's even possible). Here is the function:
unsigned int lookup_array[26][26][26][26]; /* lookup_array is a global variable */
unsigned int get_i_score(unsigned int *input) {
register unsigned int i, score = 0;
for(i = len - 3; i--; )
score += lookup_array[input[i]][input[i + 1]][input[i + 2]][input[i + 3]];
return(score)
}
I've tried to flatten the array to a single dimension but there was no improvement in performance. This is running on an IA32 CPU. Any CPU specific optimizations are also helpful.
Thanks
What is the range of the array items? If you can change the array base type to unsigned short or unsigned char, you might get fewer cache misses because a larger portion of the array fits into the cache.
Most of your time probably goes into cache misses. If you can optimize those away, you can get a big performance boost.
Remember that C/C++ arrays are stored in row-major order. Remember to store your data so that addresses referenced closely in time reside closely in memory. For example, it may make sense to store sub-results in a temporary array. Then you could process exactly one row of elements located sequentially. That way the processor cache will always contain the row during iterations and less memory operations will be required. However, you might need to modularize your lookup_array function. Maybe even split it into four (by the number of dimensions in your array).
The problem is definitely related to the size of the matrix. You cannot optimize it by declaring as a single array just because it's what the compiler does automatically.
Everything depends on which order do you use for accessing the data, namely on the content of the input array.
The only think you can do is work on locality: read this one, it should give you some inspiration.
By the way, I suggest you to replace the input array with four parameters: it will be more intuitive and it will be less error prone.
Good luck
A few suggestions to improve performance:
Parallelise. This is a very easy reduction to be programmed in OpenMP or MPI.
Reorder data to improve locality. Try sorting input first, for example.
Use streaming processing instructions if the compiler is not already doing it.
About reordering, it would be possible if you flatten the array and use linear coordinates instead.
Another point, compare the theoretical peak performance of your processor (integer operations) with the performance you're getting (do a quick count of the assembly generated instructions, multiply by the length of the input, etc.) and see if there's room for a significant improvement there.
I have a couple of suggestions:
unsigned int lookup_array[26][26][26][26]; /* lookup_array is a global variable */
unsigned int get_i_score(unsigned int *input, len) {
register unsigned int i, score = 0;
unsigned int *a=input;
unsigned int *b=input+1;
unsigned int *c=input+2;
unsigned int *d=input+3;
for(i = 0; i < (len - 3); i++, a++, b++, c++, d++)
score += lookup_array[*a][*b][*c][*d];
return(score)
}
Or try
for(i = 0; i < (len - 3); i++, a=b, b=c, c=d, d++)
score += lookup_array[*a][*b][*c][*d];
Also, given that there are only 26 values, why are you putting the input array in terms of unsigned ints? If it were char *input, you'd be using 1/4 as much memory and therefore using 1/4 of the memory bandwidth. Obviously the types of a through d have to match. Similarly, if the score values don't need to be unsigned ints, make the array smaller by using chars or uint16_t.
You might be able to squeeze a bit out, by unrolling the loop in some variation of Duffs device.
Multidimesional arrays often constrain the compiler to one or more multiply operations. It may be slow on some CPUs. A common workaround is to transform the N-dimensional array into an array of pointers to elements of (N-1) dimensions. With a 4-dim. array is quite annoying (26 pointers to 26*26 pointers to 26*26*26 rows...) I suggest to try it and compare the result. It is not guaranteed that it's faster: compilers are quite smart in optimizing array accesses, while a chain of indirect accesses has higher probability to invalidate the cache.
Bye
if lookup_array is mostly zeroes, could def be replaced with a hash table lookup on a smaller array. The inline lookup function could calculate the offset of the 4-dimensions ([5,6,7,8] = (4*26*26*26)+(5*26*26)+(6*26)+7 = 73847). the hash key could just be the lower few bits of the offset (depending on how sparse the array is expected to be). if the offset exists in the hash table, use the value, if it doesn't exist it's 0...
the loop could also be unrolled using something like this if the input has arbitrary length. there are only len accesses of input needed (instead of around len * 4 in the original loop).
register int j, x1, x2, x3, x4;
register unsigned int *p;
p = input;
x1 = *p++;
x2 = *p++;
x3 = *p++;
for (j = (len - 3) / 20; j--; ) {
x4 = *p++;
score += lookup_array[x1][x2][x3][x4];
x1 = *p++;
score += lookup_array[x2][x3][x4][x1];
x2 = *p++;
score += lookup_array[x3][x4][x1][x2];
x3 = *p++;
score += lookup_array[x4][x1][x2][x3];
x4 = *p++;
score += lookup_array[x1][x2][x3][x4];
x1 = *p++;
score += lookup_array[x2][x3][x4][x1];
x2 = *p++;
score += lookup_array[x3][x4][x1][x2];
x3 = *p++;
score += lookup_array[x4][x1][x2][x3];
x4 = *p++;
score += lookup_array[x1][x2][x3][x4];
x1 = *p++;
score += lookup_array[x2][x3][x4][x1];
x2 = *p++;
score += lookup_array[x3][x4][x1][x2];
x3 = *p++;
score += lookup_array[x4][x1][x2][x3];
x4 = *p++;
score += lookup_array[x1][x2][x3][x4];
x1 = *p++;
score += lookup_array[x2][x3][x4][x1];
x2 = *p++;
score += lookup_array[x3][x4][x1][x2];
x3 = *p++;
score += lookup_array[x4][x1][x2][x3];
x4 = *p++;
score += lookup_array[x1][x2][x3][x4];
x1 = *p++;
score += lookup_array[x2][x3][x4][x1];
x2 = *p++;
score += lookup_array[x3][x4][x1][x2];
x3 = *p++;
score += lookup_array[x4][x1][x2][x3];
/* that's 20 iterations, add more if you like */
}
for (j = (len - 3) % 20; j--; ) {
x4 = *p++;
score += lookup_array[x1][x2][x3][x4];
x1 = x2;
x2 = x3;
x3 = x4;
}
If you convert it to a flat array of size 26*26*26*26, you only need to lookup the input array once per loop:
unsigned int get_i_score(unsigned int *input)
{
unsigned int i = len - 3, score = 0, index;
index = input[i] * 26 * 26 +
input[i + 1] * 26 +
input[i + 2];
while (--i)
{
index += input[i] * 26 * 26 * 26;
score += lookup_array[index];
index /= 26 ;
}
return score;
}
The additional cost is a multiplication and a division. Whether it ends up being faster in practice - you'll have to test.
(By the way, the register keyword is often ignored by modern compilers - it's usually better to leave register allocation up to the optimiser).
Does the content of the array change much? Perhaps it would be faster to pre-calculate the score, and then modify that pre-calculated score everytime the array changes? Similar to how you can materialize a view in SQL using triggers.
Maybe you can eliminate some accesses to the input array by using local variables.
unsigned int lookup_array[26][26][26][26]; /* lookup_array is a global variable */
unsigned int get_i_score(unsigned int *input, unsigned int len) {
unsigned int i, score, a, b, c, d;
score = 0;
a = input[i + 0];
b = input[i + 1];
c = input[i + 2];
d = input[i + 3];
for (i = len - 3; i-- > 0; ) {
d = c, c = b, b = a, a = input[i];
score += lookup_array[a][b][c][d];
}
return score;
}
Moving around registers may be faster than accessing memory, although this kind of memory should remain in the innermost cache anyway.
Related
Hi I am trying to improve the performance of this code, suposing that I have a machine capable of handling 4 threads. I first thought about making omp parallel but then I saw that this function was inside a for loop so creating threads so many times was not very efficient. So i would like to know how to implement it with SSE that would be more efficient:
unsigned char cubicInterpolate_paralelo(unsigned char p[4], unsigned char x) {
unsigned char resultado;
unsigned char intermedio;
intermedio = + x*(3.0*(p[1] - p[2]) + p[3] - p[0]);
resultado = p[1] + 0.5 * x *(p[2] - p[0] + x*(2.0*p[0] - 5.0*p[1] + 4.0*p[2] - p[3] + x*(3.0*(p[1] - p[2]) + p[3] - p[0])));
return resultado;
}
unsigned char bicubicInterpolate_paralelo (unsigned char p[4][4], unsigned char x, unsigned char y) {
unsigned char arr[4],valorPixelCanal;
arr[0] = cubicInterpolate_paralelo(p[0], y);
arr[1] = cubicInterpolate_paralelo(p[1], y);
arr[2] = cubicInterpolate_paralelo(p[2], y);
arr[3] = cubicInterpolate_paralelo(p[3], y);
valorPixelCanal = cubicInterpolate_paralelo(arr, x);
return valorPixelCanal;
}
this is used inside some nested for:
for(i=0; i<z_img.width(); i++) {
for(j=0; j<z_img.height(); j++) {
//For R,G,B
for(c=0; c<3; c++) {
for(l=0; l<4; l++){
for(k=0; k<4; k++){
arr[l][k] = img(i/zFactor +l, j/zFactor +k, 0, c);
}
}
color[c] = bicubicInterpolate_paralelo(arr, (unsigned char)(i%zFactor)/zFactor, (unsigned char)(j%zFactor)/zFactor);
}
z_img.draw_point(i,j,color);
}
}
I've taken some liberties with the code, so you may have to change it significantly, but here's an (untested) transliteration to SSE:
__m128i x = _mm_unpacklo_epi8(_mm_loadl_epi64(x_array), _mm_setzero_si128());
__m128i p0 = _mm_unpacklo_epi8(_mm_loadl_epi64(p0_array), _mm_setzero_si128());
__m128i p1 = _mm_unpacklo_epi8(_mm_loadl_epi64(p1_array), _mm_setzero_si128());
__m128i p2 = _mm_unpacklo_epi8(_mm_loadl_epi64(p2_array), _mm_setzero_si128());
__m128i p3 = _mm_unpacklo_epi8(_mm_loadl_epi64(p3_array), _mm_setzero_si128());
__m128i t = _mm_sub_epi16(p1, p2);
t = _mm_add_epi16(_mm_add_epi16(t, t), t); // 3 * (p[1] - p[2])
__m128i intermedio = _mm_mullo_epi16(x, _mm_sub_epi16(_mm_add_epi16(t, p3), p0));
t = _mm_add_epi16(p1, _mm_slli_epi16(p1, 2)); // 5 * p[1]
// t2 = 2 * p[0] + 4 * p[2]
__m128i t2 = _mm_add_epi16(_mm_add_epi16(p0, p0), _mm_slli_epi16(p2, 2));
t = _mm_mullo_epi16(x, _mm_sub_epi16(_mm_add_epi16(t2, intermedio), _mm_add_epi16(t, p3)));
t = _mm_mullo_epi16(x, _mm_add_epi16(_mm_sub_epi16(p2, p0), t));
__m128i resultado = _mm_add_epi16(p1, _mm_srli_epi16(t, 1));
return resultado;
The 16 bit intermediates that I use should be wide enough, the only way for information from the high bits to affect low bits in this code is the right shift by 1 (0.5 * in your code), so really we only need 9 bits, the rest cannot affect the result. Bytes wouldn't be wide enough (unless you have some extra guarantees that I don't know about), but they would be annoying anyway because there is no nice way to multiply them.
I pretended for simplicity that the input takes the form of contiguous arrays of x's, p[0]'s etc, that's not what you need here but I didn't have time to work out all the loading and shuffling.
SSE is quite unrelated to threads. A single thread executes a single instruction at a time; with SSE that single instruction may apply to 4 or 8 sets of arguments at a time. So with multiple threads you can also run multiple SSE instructions to process even more data.
You can use threads with for-loops. Just don't use them inside. Instead, take the for(i=0; i<z_img.width(); i++) { outer loop and split it in 4 bands of width/4. Thread 0 gets 0..width/4, thread 1 gets width/4..width/2 etc.
On an unrelated note your code also suffers from mixing floating-point and integer math. 0.5 * x is not nearly as efficient as x/2.
Using OpenMP, you could try adding the #pragma to the outer-most for loop. This should solve your problem.
Going the SSE route is trickier because of the extra alignment restrictions on data, but the easiest transform would be to extend cubicInterpolate_paralelo to handle multiple calculations at once. With enough luck, telling the compiler to use SSE will do the trick for you, but to make sure, you could use intrinsic functions and types.
I'm trying to use Intel Intrinsics to perform an operation quickly on a float array. The operations themselves seem to work fine; however, when I try to get the result of the operation into a standard C variable I get a SEGFAULT. If I comment the indicated line below out, the program runs. If I save the result of the indicated line, but do not manipulate it in any way, the program runs fine. It is only when I try to (in any way) interact with the result of _mm_cvtss_f32(C) that my program crashes. Any ideas?
float proc(float *a, float *b, int n, int c, int width) {
// Operation: SUM: (A - B) ^ 2
__m128 A, B, C;
float total = 0;
for (int d = 0, k = 0; k < c; d += width, k++) {
for (int i = 0; i < n / 4 * 4; i += 4) {
A = _mm_load_ps(&a[i + d]);
B = _mm_load_ps(&b[i + d]);
C = _mm_sub_ps(A, B);
C = _mm_mul_ps(C, C);
C = _mm_hadd_ps(C, C);
C = _mm_hadd_ps(C, C);
total += _mm_cvtss_f32(C); // SEGFAULT HERE
}
for (int i = n / 4 * 4; i < n; i++) {
int diff = a[i + d] - b[i + d];
total += diff * diff;
}
}
return total;
}
Are you sure your program actually crashes at the instruction you cited, or is the compiler just optimizing the rest of the loop away if you remove the _mm_cvtss_f32() line (it doesn't have any other visible side effects)? Potential failure causes would be improper alignment of the a and b arrays since you are using aligned load instructions. Are you sure they are 16-byte aligned? On contemporary Intel hardware, there is very little performance difference between 16-byte aligned and unaligned loads (see the comments on the question above for a discussion of the issue).
I mentioned in my original comment that movaps has a shorter encoding than movups. This is not correct. I was thinking instead of movaps versus movapd, which do the same memory transfer, only they're labeled as being for single-precision and double-precision data, respectively. In practice, they do the same thing, but movaps has a shorter encoding.
I've trying to mix together 2 16bit linear PCM audio streams and I can't seem to overcome the noise issues. I think they are coming from overflow when mixing samples together.
I have following function ...
short int mix_sample(short int sample1, short int sample2)
{
return #mixing_algorithm#;
}
... and here's what I have tried as #mixing_algorithm#
sample1/2 + sample2/2
2*(sample1 + sample2) - 2*(sample1*sample2) - 65535
(sample1 + sample2) - sample1*sample2
(sample1 + sample2) - sample1*sample2 - 65535
(sample1 + sample2) - ((sample1*sample2) >> 0x10) // same as divide by 65535
Some of them have produced better results than others but even the best result contained quite a lot of noise.
Any ideas how to solve it?
The best solution I have found is given by Viktor Toth. He provides a solution for 8-bit unsigned PCM, and changing that for 16-bit signed PCM, produces this:
int a = 111; // first sample (-32768..32767)
int b = 222; // second sample
int m; // mixed result will go here
// Make both samples unsigned (0..65535)
a += 32768;
b += 32768;
// Pick the equation
if ((a < 32768) || (b < 32768)) {
// Viktor's first equation when both sources are "quiet"
// (i.e. less than middle of the dynamic range)
m = a * b / 32768;
} else {
// Viktor's second equation when one or both sources are loud
m = 2 * (a + b) - (a * b) / 32768 - 65536;
}
// Output is unsigned (0..65536) so convert back to signed (-32768..32767)
if (m == 65536) m = 65535;
m -= 32768;
Using this algorithm means there is almost no need to clip the output as it is only one value short of being within range. Unlike straight averaging, the volume of one source is not reduced even when the other source is silent.
here's a descriptive implementation:
short int mix_sample(short int sample1, short int sample2) {
const int32_t result(static_cast<int32_t>(sample1) + static_cast<int32_t>(sample2));
typedef std::numeric_limits<short int> Range;
if (Range::max() < result)
return Range::max();
else if (Range::min() > result)
return Range::min();
else
return result;
}
to mix, it's just add and clip!
to avoid clipping artifacts, you will want to use saturation or a limiter. ideally, you will have a small int32_t buffer with a small amount of lookahead. this will introduce latency.
more common than limiting everywhere, is to leave a few bits' worth of 'headroom' in your signal.
Here is what I did on my recent synthesizer project.
int* unfiltered = (int *)malloc(lengthOfLongPcmInShorts*4);
int i;
for(i = 0; i < lengthOfShortPcmInShorts; i++){
unfiltered[i] = shortPcm[i] + longPcm[i];
}
for(; i < lengthOfLongPcmInShorts; i++){
unfiltered[i] = longPcm[i];
}
int max = 0;
for(int i = 0; i < lengthOfLongPcmInShorts; i++){
int val = unfiltered[i];
if(abs(val) > max)
max = val;
}
short int *newPcm = (short int *)malloc(lengthOfLongPcmInShorts*2);
for(int i = 0; i < lengthOfLongPcmInShorts; i++){
newPcm[i] = (unfilted[i]/max) * MAX_SHRT;
}
I added all the PCM data into an integer array, so that I get all the data unfiltered.
After doing that I looked for the absolute max value in the integer array.
Finally, I took the integer array and put it into a short int array by taking each element dividing by that max value and then multiplying by the max short int value.
This way you get the minimum amount of 'headroom' needed to fit the data.
You might be able to do some statistics on the integer array and integrate some clipping, but for what I needed the minimum amount of headroom was good enough for me.
There's a discussion here: https://dsp.stackexchange.com/questions/3581/algorithms-to-mix-audio-signals-without-clipping about why the A+B - A*B solution is not ideal. Hidden down in one of the comments on this discussion is the suggestion to sum the values and divide by the square root of the number of signals. And an additional check for clipping couldn't hurt. This seems like a reasonable (simple and fast) middle ground.
I think they should be functions mapping [MIN_SHORT, MAX_SHORT] -> [MIN_SHORT, MAX_SHORT] and they are clearly not (besides first one), so overflows occurs.
If unwind's proposition won't work you can also try:
((long int)(sample1) + sample2) / 2
Since you are in time domain the frequency info is in the difference between successive samples, when you divide by two you damage that information. That's why adding and clipping works better. Clipping will of course add very high frequency noise which is probably filtered out.
The difference operator, (similar to the derivative operator), and the sum operator, (similar to the integration operator), can be used to change an algorithm because they are inverses.
Sum of (difference of y) = y
Difference of (sum of y) = y
An example of using them that way in a c program is below.
This c program demonstrates three approaches to making an array of squares.
The first approach is the simple obvious approach, y = x*x .
The second approach uses the equation (difference in y) = (x0 + x1)*(difference in x) .
The third approach is the reverse and uses the equation (sum of y) = x(x+1)(2x+1)/6 .
The second approach is consistently slightly faster then the first one, even though I haven't bothered optimizing it. I imagine that if I tried harder I could make it even better.
The third approach is consistently twice as slow, but this doesn't mean the basic idea is dumb. I could imagine that for some function other than y = x*x this approach might be faster. Also there is an integer overflow issue.
Trying out all these transformations was very interesting, so now I want to know what are some other pairs of mathematical operators I could use to transform the algorithm?
Here is the code:
#include <stdio.h>
#include <time.h>
#define tries 201
#define loops 100000
void printAllIn(unsigned int array[tries]){
unsigned int index;
for (index = 0; index < tries; ++index)
printf("%u\n", array[index]);
}
int main (int argc, const char * argv[]) {
/*
Goal, Calculate an array of squares from 0 20 as fast as possible
*/
long unsigned int obvious[tries];
long unsigned int sum_of_differences[tries];
long unsigned int difference_of_sums[tries];
clock_t time_of_obvious1;
clock_t time_of_obvious0;
clock_t time_of_sum_of_differences1;
clock_t time_of_sum_of_differences0;
clock_t time_of_difference_of_sums1;
clock_t time_of_difference_of_sums0;
long unsigned int j;
long unsigned int index;
long unsigned int sum1;
long unsigned int sum0;
long signed int signed_index;
time_of_obvious0 = clock();
for (j = 0; j < loops; ++j)
for (index = 0; index < tries; ++index)
obvious[index] = index*index;
time_of_obvious1 = clock();
time_of_sum_of_differences0 = clock();
for (j = 0; j < loops; ++j)
for (index = 1, sum_of_differences[0] = 0; index < tries; ++index)
sum_of_differences[index] = sum_of_differences[index-1] + 2 * index - 1;
time_of_sum_of_differences1 = clock();
time_of_difference_of_sums0 = clock();
for (j = 0; j < loops; ++j)
for (signed_index = 0, sum0 = 0; signed_index < tries; ++signed_index) {
sum1 = signed_index*(signed_index+1)*(2*signed_index+1);
difference_of_sums[signed_index] = (sum1 - sum0)/6;
sum0 = sum1;
}
time_of_difference_of_sums1 = clock();
// printAllIn(obvious);
printf(
"The obvious approach y = x*x took, %f seconds\n",
((double)(time_of_obvious1 - time_of_obvious0))/CLOCKS_PER_SEC
);
// printAllIn(sum_of_differences);
printf(
"The sum of differences approach y1 = y0 + 2x - 1 took, %f seconds\n",
((double)(time_of_sum_of_differences1 - time_of_sum_of_differences0))/CLOCKS_PER_SEC
);
// printAllIn(difference_of_sums);
printf(
"The difference of sums approach y = sum1 - sum0, sum = (x - 1)x(2(x - 1) + 1)/6 took, %f seconds\n",
(double)(time_of_difference_of_sums1 - time_of_difference_of_sums0)/CLOCKS_PER_SEC
);
return 0;
}
There are two classes of optimizations here: strength reduction and peephole optimizations.
Strength reduction is the usual term for replacing "expensive" mathematical functions with cheaper functions -- say, replacing a multiplication with two logarithm table lookups, an addition, and then an inverse logarithm lookup to find the final result.
Peephole optimizations is the usual term for replacing something like multiplication by a power of two with left shifts. Some CPUs have simple instructions for these operations that run faster than generic integer multiplication for the specific case of multiplying by powers of two.
You can also perform optimizations of individual algorithms. You might write a * b, but there are many different ways to perform multiplication, and different algorithms perform better or worse under different conditions. Many of these decisions are made by the chip designers, but arbitrary-precision integer libraries make their own choices based on the merits of the primitives available to them.
When I tried to compile your code on Ubuntu 10.04, I got a segmentation fault right when main() started because you are declaring many megabytes worth of variables on the stack. I was able to compile it after I moved most of your variables outside of main to make them be global variables.
Then I got these results:
The obvious approach y = x*x took, 0.000000 seconds
The sum of differences approach y1 = y0 + 2x - 1 took, 0.020000 seconds
The difference of sums approach y = sum1 - sum0, sum = (x - 1)x(2(x - 1) + 1)/6 took, 0.000000 seconds
The program runs so fast it's hard to believe it really did anything. I put the "-O0" option in to disable optimizations but it's possible GCC still might have optimized out all of the computations. So I tried adding the "volatile" qualifier to your arrays but still got similar results.
That's where I stopped working on it. In conclusion, I don't really know what's going on with your code but it's quite possible that something is wrong.
In the below code, is there a way to avoid the if statement?
s = 13; /*Total size*/
b = 5; /*Block size*/
x = 0;
b1 = b;
while(x < s)
{
if(x + b > s)
b1 = s-x;
SendData(x, b1); /*SendData(offset,length);*/
x += b1;
}
Thanks much!
I don't know maybe you'll think:
s = 13; /*Total size*/
b = 5; /*Block size*/
x = 0;
while(x + b < s)
{
SendData(x, b); /*SendData(offset,length);*/
x += b;
}
SendData(x, s%b);
is better?
Don't waste your time on pointless micro-optimizations your compiler probably does for you anyway.
Program for the programmer; not the computer. Compilers get better and better, but programmers don't.
If it makes your program more readable (#PaulPRO's answer), then do it. Otherwise, don't.
You can use a conditional move or branchless integer select to assign b1 without an if-statement:
// if a >= 0, return x, else y
// assumes 32-bit processors
inline int isel( int a, int x, int y ) // inlining is important here
{
int mask = a >> 31; // arithmetic shift right, splat out the sign bit
// mask is 0xFFFFFFFF if (a < 0) and 0x00 otherwise.
return x + ((y - x) & mask);
};
// ...
while(x < s)
{
b1 = isel( x + b - s, s-x, b1 );
SendData(x, b1); /*SendData(offset,length);*/
x += b1;
}
This is only a useful optimization on in-order processors, though. It won't make any difference on a modern PC's x86, which has a fast branch and a reorder unit. It might be useful on some embedded systems (like a Playstation), where pipeline latency matters more for performance than instruction count. I've used it to shave a few microseconds in tight loops.
In theory a compiler "should" be able to turn a ternary expression (b = (a > 0 ? x : y)) into a conditional move, but I've never met one that did.
Of course, in a larger sense everyone who says that this is a pointless optimization compared to the cost of SendData() is correct. The difference between a cmov and a branch is about 4 nanoseconds, which is negligible compared to the cost of a network call. Spending your time fixing this branch which happens once per network call is like driving across town to save 1ยข on gasoline.
If you try to remove if(), it might change your logic and you have to spend lot of time for testing. I see only one potential change:
s = 13;
b = 5;
x = 0;
b1 = b;
while(x < s)
{
const unsigned int total = x + b; // <--- introduce 'total'
if(total > s)
b1 = s-x;
SendData(x, b1);
x = total; // <--- reusing it
}