divide and store quotient and reminder in different arrays - c

The standard div() function returns a div_t struct as parameter, for example:
/* div example */
#include <stdio.h> /* printf */
#include <stdlib.h> /* div, div_t */
int main ()
{
div_t divresult;
divresult = div (38,5);
printf ("38 div 5 => %d, remainder %d.\n", divresult.quot, divresult.rem);
return 0;
}
My case is a bit different; I have this
#define NUM_ELTS 21433
int main ()
{
unsigned int quotients[NUM_ELTS];
unsigned int remainders[NUM_ELTS];
int i;
for(i=0;i<NUM_ELTS;i++) {
divide_single_instruction(&quotient[i],&reminder[i]);
}
}
I know that the assembly language for division does everything in single instruction, so I need to do the same here to save on cpu cycles, which is bassicaly move the quotient from EAX and reminder from EDX into a memory locations where my arrays are stored. How can this be done without including the asm {} or SSE intrinsics in my C code ? It has to be portable.

Since you're writing to the arrays in-place (replacing numerator and denominator with quotient and remainder) you should store the results to temporary variables before writing to the arrays.
void foo (unsigned *num, unsigned *den, int n) {
int i;
for(i=0;i<n;i++) {
unsigned q = num[i]/den[i], r = num[i]%den[i];
num[i] = q, den[i] = r;
}
}
produces this main loop assembly
.L5:
movl (%rdi,%rcx,4), %eax
xorl %edx, %edx
divl (%rsi,%rcx,4)
movl %eax, (%rdi,%rcx,4)
movl %edx, (%rsi,%rcx,4)
addq $1, %rcx
cmpl %ecx, %r8d
jg .L5
There are some more complicated cases where it helps to save the quotient and remainder when they are first used. For example in testing for primes by trial division you often see a loop like this
for (p = 3; p <= n/p; p += 2)
if (!(n % p)) return 0;
It turns out that GCC does not use the remainder from the first division and therefore it does the division instruction twice which is unnecessary. To fix this you can save the remainder when the first division is done like this:
for (p = 3, q=n/p, r=n%p; p <= q; p += 2, q = n/p, r=n%p)
if (!r) return 0;
This speeds up the result by a factor of two.
So in general GCC does a good job particularly if you save the quotient and remainder when they are first calculated.

The general rule here is to trust your compiler to do something fast. You can always disassemble the code and check that the compiler is doing something sane. It's important to realise that a good compiler knows a lot about the machine, often more than you or me.
Also let's assume you have a good reason for needing to "count cycles".
For your example code I agree that the x86 "idiv" instruction is the obvious choice. Let's see what my compiler (MS visual C 2013) will do if I just write out the most naive code I can
struct divresult {
int quot;
int rem;
};
struct divresult divrem(int num, int den)
{
return (struct divresult) { num / den, num % den };
}
int main()
{
struct divresult res = divrem(5, 2);
printf("%d, %d", res.quot, res.rem);
}
And the compiler gives us:
struct divresult res = divrem(5, 2);
printf("%d, %d", res.quot, res.rem);
01121000 push 1
01121002 push 2
01121004 push 1123018h
01121009 call dword ptr ds:[1122090h] ;;; this is printf()
Wow, I was outsmarted by the compiler. Visual C knows how division works so it just precalculated the result and inserted constants. It didn't even bother to include my function in the final code. We have to read in the integers from console to force it to actually do the calculation:
int main()
{
int num, den;
scanf("%d, %d", &num, &den);
struct divresult res = divrem(num, den);
printf("%d, %d", res.quot, res.rem);
}
Now we get:
struct divresult res = divrem(num, den);
01071023 mov eax,dword ptr [num]
01071026 cdq
01071027 idiv eax,dword ptr [den]
printf("%d, %d", res.quot, res.rem);
0107102A push edx
0107102B push eax
0107102C push 1073020h
01071031 call dword ptr ds:[1072090h] ;;; printf()
So you see, the compiler (or this compiler at least) already does what you want, or something even more clever.
From this we learn to trust the compiler and only second-guess it when we know it isn't doing a good enough job already.

Related

C Float with a basic integer value giving different results

Okay i have a simple question . In my adventure i seek the largest numbers can hold in data types and i was trying things like long int , doubles and floats etc.
But in the simplest assigns such as Float x = 12345789 , it gives me 123456792 as a output .
Here's the code
#include <stdio.h>
int main()
{
int x = 1234567891 ;
long int y = 9034567891234567899;
long long int z = 9034567891234567891;
float t = 123456789 ;
printf("%i \n%li \n%lli \n%f \n ",x,y,z,t);
}
and the output im getting is
1234567891
9034567891234567899
9034567891234567891
123456792.000000
im coding on a linux and using gcc. What could be the problem ?
For clearity , if you give a higher number like
float t = 123456789123456789
it will get the first 9 right but somekind of rounding in last numbers where it should not .
1234567890519087104.000000
İ could have understand it if i was working beyond 0 like 0.00123 but its just straight on integers just to find out limits of float.
As a visual and experiential learner, I would recommend you to take a good look at how floating point number is represented in the world of bits with a little help of some online converter such as https://www.h-schmidt.net/FloatConverter/IEEE754.html
Value: 123456789
Hexadecimal representation: 0x4ceb79a3
Binary representation: 01001100111010110111100110100011
sign (0) : +1
exponent(10011001) : 2^26
mantissa(11010110111100110100011): 1.8396495580673218
Value actually stored in float: 1.8396495580673218 * 2^26 = 123456792
Error due to conversion: 3
float_converter_image
Here is a closer look on how the compiler actually does its job: https://gcc.godbolt.org/z/C4YyKe
int main()
{
float t = 123456789;
}
main:
push rbp
mov rbp, rsp
movss xmm0, DWORD PTR .LC0[rip]
movss DWORD PTR [rbp-4], xmm0
mov eax, 0
pop rbp
ret
.LC0:
.long 1290500515 //(0x4CEB79A3)
compiler_explorer_image
For your adventure seeking the largest numbers of each data types, I guess your can explore standard header files such as float.h and limits.h.
To find the largest contiguous integer value that can be round-tripped from integer to float to integer, the following experiment could be used:
#include <stdio.h>
int main()
{
long i = 0 ;
float fint = 0 ;
while( i == (long)fint )
{
i++ ;
fint = (float)i ;
}
printf( "Largest integer representable exactly by float = %ld\n", i - 1 ) ;
return 0;
}
However the experiment is largely unnecessary, since the value is predictably 224 since 23 is the number of bits in the float mantissa.

Why is using a third variable faster than an addition trick?

When computing fibonacci numbers, a common method is mapping the pair of numbers (a, b) to (b, a + b) multiple times. This can usually be done by defining a third variable c and doing a swap. However, I realised you could do the following, avoiding the use of a third integer variable:
b = a + b; // b2 = a1 + b1
a = b - a; // a2 = b2 - a1 = b1, Ta-da!
I expected this to be faster than using a third variable, since in my mind this new method should only have to consider two memory locations.
So I wrote the following C programs comparing the processes. These mimic the calculation of fibonacci numbers, but rest assured I am aware that they will not calculate the correct values due to size limitations.
(Note: I realise now that it was unnecessary to make n a long int, but I will keep it as it is because that is how I first compiled it)
File: PlusMinus.c
// Using the 'b=a+b;a=b-a;' method.
#include <stdio.h>
int main() {
long int n = 1000000; // Number of iterations.
long int a,b;
a = 0; b = 1;
while (n--) {
b = a + b;
a = b - a;
}
printf("%lu\n", a);
}
File: ThirdVar.c
// Using the third-variable method.
#include <stdio.h>
int main() {
long int n = 1000000; // Number of iterations.
long int a,b,c;
a = 0; b = 1;
while (n--) {
c = a;
a = b;
b = b + c;
}
printf("%lu\n", a);
}
When I run the two with GCC (no optimisations enabled) I notice a consistent difference in speed:
$ time ./PlusMinus
14197223477820724411
real 0m0.014s
user 0m0.009s
sys 0m0.002s
$ time ./ThirdVar
14197223477820724411
real 0m0.012s
user 0m0.008s
sys 0m0.002s
When I run the two with GCC with -O3, the assembly outputs are equal. (I suspect I had confirmation bias when stating that one just outperformed the other in previous edits.)
Inspecting the assembly for each, I see that PlusMinus.s actually has one less instruction than ThirdVar.s, but runs consistently slower.
Question
Why does this time difference occur? Not only at all, but also why is my addition/subtraction method slower contrary to my expectations?
Why does this time difference occur?
There is no time difference when compiled with optimizations (under recent versions of gcc and clang). For instance, gcc 8.1 for x86_64 compiles both to:
Live at Godbolt
.LC0:
.string "%lu\n"
main:
sub rsp, 8
mov eax, 1000000
mov esi, 1
mov edx, 0
jmp .L2
.L3:
mov rsi, rcx
.L2:
lea rcx, [rdx+rsi]
mov rdx, rsi
sub rax, 1
jne .L3
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
mov eax, 0
add rsp, 8
ret
Not only at all, but also why is my addition/subtraction method slower contrary to my expectations?
Adding and subtracting could be slower than just moving. However, in most architectures (e.g. a x86 CPU), it is basically the same (1 cycle plus the memory latency); so this does not explain it.
The real problem is, most likely, the dependencies between the data. See:
b = a + b;
a = b - a;
To compute the second line, you have to have finished computing the value of the first. If the compiler uses the expressions as they are (which is the case under -O0), that is what the CPU will see.
In your second example, however:
c = a;
a = b;
b = b + c;
You can compute both the new a and b at the same time, since they do not depend on each other. And, in a modern processor, those operations can actually be computed in parallel. Or, putting it another way, you are not "stopping" the processor by making it wait on a previous result. This is called Instruction-level parallelism.

Toggle a given range of bits of an unsigned int in C

I am trying to replace the following piece of code
// code version 1
unsigned int time_stx = 11; // given range start
unsigned int time_enx = 19; // given range end
unsigned int time = 0; // desired output
while(time_stx < time_enx) time |= (1 << time_stx++);
with the following one without a loop
// code version 2
unsigned int time_stx = 11;
unsigned int time_enx = 19;
unsigned int time = (1 << time_enx) - (1 << time_stx);
It turns out that in code version 1, time = 522240; in code version 2, time = 0; when I use
printf("%u\n", time);
to compare the result. I would like to know why is this happening and if there is any faster way to toggle bits in a given range. My compiler is gcc (Debian 4.9.2-10) 4.9.2.
Edit:
Thank you for your replies. I have made a silly mistake and I feel embarrassing posting my question without further inspecting my codes. I did
unsigned int time_stx = 11;
unsigned int time_enx = 19;
unsigned int time1 = 0;
while(time_stx < time_enx) time1 |= (1 << time_stx++); // version 1
//// what I should, but forgotten to do
// time_stx = 11;
// time_enx = 19;
// where time_stx = time_enx now...
unsigned int time2 = (1 << time_enx) - (1 << time_stx); // version 2
// then obviously
printf("time1 = %u\n", time1); // time1 = 522240
printf("time2 = %u\n", time2); // time2 = 0
I am so sorry for any inconvenience incurred.
Remark: both time_stx and time_enx are generated in the run-time and are not fixed.
As suggested that I made a mistake and the problem is solved now. Thank you!!
Read Bit twiddling hacks. Even if the answer isn't in there, you'll be better educated on bit twiddling. Also, the original code is simply setting the bits in the range; toggling means turning 1 bits into 0 bits and vice versa (normally achieved using ^ or xor).
As to the code, I converted three variants of the expression into the following C code:
#include <stdio.h>
static void print(unsigned int v)
{
printf("0x%.8X = %u\n", v, v);
}
static void bit_setter1(void)
{
unsigned int time_stx = 11; // given range start
unsigned int time_enx = 19; // given range end
unsigned int time = 0; // desired output
while (time_stx < time_enx)
time |= (1 << time_stx++);
print(time);
}
static void bit_setter2(void)
{
unsigned int time_stx = 11;
unsigned int time_enx = 19;
unsigned int time = (1 << time_enx) - (1 << time_stx);
print(time);
}
static void bit_setter3(void)
{
unsigned int time = 0xFF << 11;
print(time);
}
int main(void)
{
bit_setter1();
bit_setter2();
bit_setter3();
return 0;
}
When I look at the assembler for it (GCC 5.1.0 on Mac OS X 10.10.3), I get:
.globl _main
_main:
LFB5:
LM1:
LVL0:
subq $8, %rsp
LCFI0:
LBB28:
LBB29:
LBB30:
LBB31:
LM2:
movl $522240, %edx
movl $522240, %esi
leaq LC0(%rip), %rdi
xorl %eax, %eax
call _printf
LVL1:
LBE31:
LBE30:
LBE29:
LBE28:
LBB32:
LBB33:
LBB34:
LBB35:
movl $522240, %edx
movl $522240, %esi
xorl %eax, %eax
leaq LC0(%rip), %rdi
call _printf
LVL2:
LBE35:
LBE34:
LBE33:
LBE32:
LBB36:
LBB37:
LBB38:
LBB39:
movl $522240, %edx
movl $522240, %esi
xorl %eax, %eax
leaq LC0(%rip), %rdi
call _printf
LVL3:
LBE39:
LBE38:
LBE37:
LBE36:
LM3:
xorl %eax, %eax
addq $8, %rsp
LCFI1:
ret
That's an amazingly large collection of labels!
The compiler has fully evaluated all three minimal bit_setterN() functions and inlined them, along with the call to print, into the body of main(). That includes evaluating the expressions to 522240 each time.
Compilers are good at optimization. Write clear code and let them at it, and they will optimize better than you can. Clearly, if the 11 and 19 are not fixed in your code (they're some sort of computed variables which can vary at runtime), then the precomputation isn't as easy (and bit_setter3() is a non-starter). Then the non-loop code will work OK, as will the loop code.
For the record, the output is:
0x0007F800 = 522240
0x0007F800 = 522240
0x0007F800 = 522240
If your Debian compiler is giving you a zero from one of the code fragments, then there's either a difference between what you compiled and what you posted, or there's a bug in the compiler. On the whole, and no disrespect intended, it is more likely that you've made a mistake than that the compiler has a bug in it that shows up in code as simple as this.

Is there a more efficient way of splitting a number into its digits?

I have to split a number into its digits in order to display it on an LCD. Right now I use the following method:
pos = 7;
do
{
LCD_Display(pos, val % 10);
val /= 10;
pos--;
} while (pos >= 0 && val);
The problem with this method is that division and modulo operations are extremely slow on an MSP430 microcontroller. Is there any alternative to this method, something that either does not involve division or that reduces the number of operations?
A note: I can't use any library functions, such as itoa. The libraries are big and the functions themselves are rather resource hungry (both in terms of number of cycles, and RAM usage).
You could do subtractions in a loop with predefined base 10 values.
My C is a bit rusty, but something like this:
int num[] = { 10000000,1000000,100000,10000,1000,100,10,1 };
for (pos = 0; pos < 8; pos++) {
int cnt = 0;
while (val >= num[pos]) {
cnt++;
val -= num[pos];
}
LCD_Display(pos, cnt);
}
Yes, there's another way, originally invented (at least AFAIK) by Terje Mathiesen. Instead of dividing by 10, you (sort of) multiply by the reciprocal. The trick, of course, is that in integers you can't represent the reciprocal directly. To make up for that, you work with scaled integers. If we had floating point, we could extract digits with something like:
input = 123
first digit = integer(10 * (fraction(input * .1))
second digit = integer(100 * (fraction(input * .01))
...and so on for as many digits as needed. To do this with integers, we basically just scale those by 232 (and round each up, since we'll use truncating math). In C, the algorithm looks like this:
#include <stdio.h>
// here are our scaled factors
static const unsigned long long factors[] = {
3435973837, // ceil((0.1 * 2**32)<<3)
2748779070, // ceil((0.01 * 2**32)<<6)
2199023256, // etc.
3518437209,
2814749768,
2251799814,
3602879702,
2882303762,
2305843010
};
static const char shifts[] = {
3, // the shift value used for each factor above
6,
9,
13,
16,
19,
23,
26,
29
};
int main() {
unsigned input = 13754;
for (int i=8; i!=-1; i--) {
unsigned long long inter = input * factors[i];
inter >>= shifts[i];
inter &= (unsigned)-1;
inter *= 10;
inter >>= 32;
printf("%u", inter);
}
return 0;
}
The operations in the loop will map directly to instructions on most 32-bit processors. Your typical multiply instruction will take 2 32-bit inputs, and produce a 64-bit result, which is exactly what we need here. It'll typically be quite a bit faster than a division instruction as well. In a typical case, some of the operations will (or at least with some care, can) disappear in assembly language. For example, where I've done the inter &= (unsigned)-1;, in assembly language you'll normally be able to just use the lower 32-bit register where the result was stored, and just ignore whatever holds the upper 32 bits. Likewise, the inter >>= 32; just means we use the value in the upper 32-bit register, and ignore the lower 32-bit register.
For example, in x86 assembly language, this comes out something like:
mov ebx, 9 ; maximum digits we can deal with.
mov esi, offset output_buffer
next_digit:
mov eax, input
mul factors[ebx*4]
mov cl, shifts[ebx]
shrd eax, edx, cl
mov edx, 10 ; overwrite edx => inter &= (unsigned)-1
mul edx
add dl, '0'
mov [esi], dl ; effectively shift right 32 bits by ignoring 32 LSBs in eax
inc esi
dec ebx
jnz next_digit
mov [esi], bl ; zero terminate the string
For the moment, I've cheated a tiny bit, and written the code assuming an extra item at the beginning of each table (factors and shifts). This isn't strictly necessary, but simplifies the code at the cost of wasting 8 bytes of data. It's pretty easy to do away with that too, but I haven't bothered for the moment.
In any case, doing away with the division makes this a fair amount faster on quite a few low- to mid-range processors that lack dedicated division hardware.
Another way is using double dabble. This is a way to convert binary to BCD with only additions and bit shifts so it's very appropriate for microcontrollers. After splitting to BCDs you can easily print out each number
I would use a temporary string, like:
char buffer[8];
itoa(yourValue, buffer, 10);
int pos;
for(pos=0; pos<8; ++pos)
LCD_Display(pos, buffer[pos]); /* maybe you'll need a cast here */
edit: since you can't use library's itoa, then I think your solution is already the best, providing you compile with max optimization turned on.
You may take a look at this: Most optimized way to calculate modulus in C
This is my attempt at a complete solution. Credit should go to Guffa for providing the general idea. This should work for 32bit integers, signed or otherwise and 0.
#include <stdlib.h>
#include <stdio.h>
#define MAX_WIDTH (10)
static unsigned int uiPosition[] = {
1u,
10u,
100u,
1000u,
10000u,
100000u,
1000000u,
10000000u,
100000000u,
1000000000u,
};
void uitostr(unsigned int uiSource, char* cTarget)
{
int i, c=0;
for( i=0; i!=MAX_WIDTH; ++i )
{
cTarget[i] = 0;
}
if( uiSource == 0 )
{
cTarget[0] = '0';
cTarget[1] = '\0';
return;
}
for( i=MAX_WIDTH -1; i>=0; --i )
{
while( uiSource >= uiPosition[i] )
{
cTarget[c] += 1;
uiSource -= uiPosition[i];
}
if( c != 0 || cTarget[c] != 0 )
{
cTarget[c] += 0x30;
c++;
}
}
cTarget[c] = '\0';
}
void itostr(int iSource, char* cTarget)
{
if( iSource < 0 )
{
cTarget[0] = '-';
uitostr((unsigned int)(iSource * -1), cTarget + 1);
}
else
{
uitostr((unsigned int)iSource, cTarget);
}
}
int main()
{
char szStr[MAX_WIDTH +1] = { 0 };
// signed integer
printf("Signed integer\n");
printf("int: %d\n", 100);
itostr(100, szStr);
printf("str: %s\n", szStr);
printf("int: %d\n", -1);
itostr(-1, szStr);
printf("str: %s\n", szStr);
printf("int: %d\n", 1000000000);
itostr(1000000000, szStr);
printf("str: %s\n", szStr);
printf("int: %d\n", 0);
itostr(0, szStr);
printf("str: %s\n", szStr);
return 0;
}

Divide without losing remainder

In C, is it possible to divide a dividend by a constant and get the result and the remainder at the same time?
I want to avoid execution of 2 division instructions, as in this example:
val=num / 10;
mod=num % 10;
I wouldn't worry about the instruction count because the x86 instruction set will provide a idivl instruction that computes the dividend and remainder in one instruction. Any decent compiler will make use of this instruction. The documenation here http://programminggroundup.blogspot.com/2007/01/appendix-b-common-x86-instructions.html describes the instruction as follows:
Performs unsigned division. Divides the contents of the double-word
contained in the combined %edx:%eax registers by the value in the
register or memory location specified. The %eax register contains the
resulting quotient, and the %edx register contains the resulting
remainder. If the quotient is too large to fit in %eax, it triggers a
type 0 interrupt.
For example, compiling this sample program:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
int x = 39;
int divisor = 1;
int div = 0;
int rem = 0;
printf("Enter the divisor: ");
scanf("%d", &divisor);
div = x/divisor;
rem = x%divisor;
printf("div = %d, rem = %d\n", div, rem);
}
With gcc -S -O2 (-S saves the tempory file created that shows the asm listing), shows that the division and mod in the following lines
div = x/divisor;
rem = x%divisor;
is effectively reduced to the following instruction:
idivl 28(%esp)
As you can see theres one instruction to perform the division and mod calculation. The idivl instruction remains even if the mod calculation in the C program is removed. After the idivl there are calls to mov:
movl $.LC2, (%esp)
movl %edx, 8(%esp)
movl %eax, 4(%esp)
call printf
These calls copy the quotient and the remainder onto the stack for the call to printf.
Update
Interestingly the function div doesn't do anything special other than wrap the / and % operators in a function call. Therefore, from a performance perspective, it will not improve the performance by replacing the lines
val=num / 10;
mod=num % 10;
with a single call to div.
There's div():
div_t result = div(num, 10);
// quotient is result.quot
// remainder is result.rem
Don't waste your time with div() Like Nemo said, the compiler will easily optimize the use of a division followed by the use of a modulus operation into one. Write code that makes optimal sense, and let the computer remove the cruft.
You could always use the div function.

Resources