How can I do 64 bit division on linux kernel? [duplicate]

How can I do 64 bit division on linux kernel? [duplicate] - c

This question already has answers here:
How to divide two 64-bit numbers in Linux Kernel?
(4 answers)
Closed 4 years ago.
I want the do the following code on linux kernel (32 bit processor):
#define UQ64 long long int
#define UI32 long int
UQ64 qTimeStamp;
UQ64 qSeconds;
UI32 uTimeStampRes;
qTimeStamp = num1;
uTimeStampRes = num2;
// 64 division !
qSeconds = qTimeStamp / uTimeStampRes;
There is an algorithm to calculate the 64 division ?
Thanks.

The GCC C compiler generates code that calls functions in the libgcc library to implement the / and % operations with 64-bit operands on 32-bit CPUs. However, the Linux kernel is not linked against the libgcc library, so such code will fail to link when building code for a 32-bit Linux kernel. (When building an external kernel module, the problem may not be apparent until you try and dynamically load the module into the running kernel.)
Originally, the Linux kernel only had the do_div(n,base) macro defined by #include <asm/div64.h>. The usage of this macro is unusual because it modifies its first argument in place to become the quotient resulting from the division, and yields (returns) the remainder from the division as its result. This was done for code efficiency reasons but is a bit of a pain to use. Also, it only supports division of a 64-bit unsigned dividend by a 32-bit divisor.
Linux kernel version 2.6.22 introduced the #include <linux/math64.h> header, which defines a set of functions which is more comprehensive than the old do_div(n,base) macro and is easier to use because they behave like normal C functions.
The functions declared by #include <linux/math64.h> for 64-bit division are listed below. Except where indicated, all of these have been available since kernel version 2.6.26.
One of the functions listed below in italics does not exist yet as of kernel version 4.18-rc8. Who knows if it will ever be implemented? (Some other functions declared by the header file related to multiply and shift operations in later kernel versions have been omitted below.)
u64 div_u64(u64 dividend, u32 divisor) — unsigned division of 64-bit dividend by 32-bit divisor.
s64 div_s64(s64 dividend, s32 divisor) — signed division of 64-bit dividend by 32-bit divisor.
u64 div_u64_rem(u64 dividend, u32 divisor, u32 *remainder) — unsigned division of 64-bit dividend by 32-bit divisor with remainder.
s64 div_s64_rem(s64 dividend, s32 divisor, s32 *remainder) — signed division of 64-bit dividend by 32-bit divisor with remainder.
u64 div64_u64(u64 dividend, u64 divisor) — unsigned division of 64-bit dividend by 64-bit divisor.
s64 div64_s64(s64 dividend, s64 divisor) — (since 2.6.37) signed division of 64-bit dividend by 64-bit divisor.
u64 div64_u64_rem(u64 dividend, u64 divisor, u64 *remainder) — (since 3.12.0) unsigned division of 64-bit dividend by 64-bit divisor with remainder.
s64 div64_s64_rem(s64 dividend, s64 divisor, s64 *remainder) — (does not exist yet as of 4.18-rc8) signed division of 64-bit dividend by 64-bit divisor with remainder.
div64_long(x,y) — (since 3.4.0) macro to do signed division of a 64-bit dividend by a long int divisor (which is 32-bit or 64 bit, depending on the architecture).
div64_ul(x,y) — (since 3.10.0) macro to do unsigned division of a 64-bit dividend by an unsigned long int divisor (which is 32-bit or 64-bit, depending on the architecture).
u32 iter_div_u64_rem(u64 dividend, u32 divisor, u64 *remainder) — unsigned division of 64-bit division by 32-bit divisor by repeated subtraction of divisor from dividend, with remainder (may be faster than regular division if the dividend is not expected to be much bigger than the divisor).

You can divide any size numbers on any bits computer. The only difference is the way the division is done. On a processor which handles 64 bit integers natively, it will be one machine code instruction (I do not know any 64bit processor without hardware division). On processors with narrower registers it will be translated to a series of machine code instructions or a call to a library function dividing those 64-bit numbers:
uint64_t foo(uint64_t x, uint64_t y)
{
return x/y;
}
On amd64 instruction set:
mov rax, rdi
xor edx, edx
div rsi
ret
On ia32 instrtuction set:
sub esp, 12
push DWORD PTR [esp+28]
push DWORD PTR [esp+28]
push DWORD PTR [esp+28]
push DWORD PTR [esp+28]
call __udivdi3
add esp, 28
ret

Related

Store signed 32-bit in unsigned 64-bit int

Basically, what I want is to "store" a signed 32-bit int inside (in the 32 rightmost bits) an unsigned 64-bit int - since I want to use the leftmost 32 bits for other purposes.
What I'm doing right now is a simple cast and mask:
#define packInt32(X) ((uint64_t)X | INT_MASK)
But this approach has an obvious issue: If X is a positive int (the first bit is not set), everything goes fine. If it's negative, it becomes messy.
The question is:
How to achieve the above, also supporting negative numbers, in the fastest and most-efficient way?

The "mess" you mention happens because you cast a small signed type to a large unsigned type.
During this conversion the size is adjusted first with applying sign extension. This is what causes your trouble.
You can simply cast the (signed) integer to an unsigned type of same size first. Then casting to 64 bit will not trigger sign extension:
#define packInt32(X) ((uint64_t)(uint32_t)(X) | INT_MASK)

You need to mask out any bits besides the low order 32 bits. You can do that with a bitwise AND:
#define packInt32(X) (((uint64_t)(X) & 0xFFFFFFFF) | INT_MASK)

A negative 32-bit integer will get sign-extended into 64-bits.
#include <stdint.h>
uint64_t movsx(int32_t X) { return X; }
movsx on x86-64:
movsx:
movsx rax, edi
ret
Masking out the higher 32-bits will remove cause it to be just zero-extended:
#include <stdint.h>
uint64_t mov(int32_t X) { return (uint64_t)X & 0xFFFFFFFF; }
//or uint64_t mov(int32_t X) { return (uint64_t)(uint32_t)X; }
mov on x86-64:
mov:
mov eax, edi
ret
https://gcc.godbolt.org/z/fihCmt
Neither method loses any info from the lower 32-bits, so either method is a valid way of storing a 32-bit integer into a 64-bit one.
The x86-64 code for a plain mov is one byte shorter (3 bytes vs 4). I don't think there should be much of a speed difference, but if there is one, I'd expect the plain mov to win by a tiny bit.

One option is to untangle the sign-extension and the upper value when it is read back, but that can be messy.
Another option is to construct a union with a bit-packed word. This then defers the problem to the compiler to optimise:
union {
int64_t merged;
struct {
int64_t field1:32,
field2:32;
};
};
A third option is to deal with the sign bit yourself. Store a 15-bit absolute value and a 1-bit sign. Not super-efficient, but more likely to be legal if you should ever encounter a non-2's-complement processor where negative signed values can't be safely cast to unsigned. They are rare as hens teeth, so I wouldn't worry about this myself.

Assuming that the only operation on the 64 bit value will be to convert it back to 32 (and potentially, storing/displaying it), there is no need to apply a mask. The compiler will sign extend the 32 bit attributes when casting it to 64 bit, and will pick the lowest 32 bit when casting the 64 bit value back to 32 bit.
#define packInt32(X) ((uint64_t)(X))
#define unpackInt32(X) ((int)(X))
Or better, using (inline) functions:
inline uint64_t packInt32(int x) { return ((uint64_t) x) ; }
inline int unpackInt32(uint64_t x) { return ((int) x) ; }

How to get the mantissa of an 80-bit long double as an int on x86-64

frexpl won't work because it keeps the mantissa as part of a long double. Can I use type punning, or would that be dangerous? Is there another way?

x86's float and integer endianness is little-endian, so the significand (aka mantissa) is the low 64 bits of an 80-bit x87 long double.
In assembly, you just load the normal way, like mov rax, [rdi].
Unlike IEEE binary32 (float) or binary64 (double), 80-bit long double stores the leading 1 in the significand explicitly. (Or 0 for subnormal). https://en.wikipedia.org/wiki/Extended_precision#x86_extended_precision_format
So the unsigned integer value (magnitude) of the true significand is the same as what's actually stored in the object-representation.
If you want signed int, too bad; including the sign bit it would be 65 bits but int is only 32-bit on any x86 C implementation.
If you want int64_t, you could maybe right shift by 1 to discard the low bit, making room for a sign bit. Then do 2's complement negation if the sign bit was set, leaving you with a signed 2's complement representation of the significand value divided by 2. (IEEE FP uses sign/magnitude with a sign bit at the top of the bit-pattern)
In C/C++, yes you need to type-pun, e.g. with a union or memcpy. All C implementations on x86 / x86-64 that expose 80-bit floating point at all use a 12 or 16-byte type with the 10-byte value at the bottom.
Beware that MSVC uses long double = double, a 64-bit float, so check LDBL_MANT_DIG from float.h, or sizeof(long double). All 3 static_assert() statements trigger on MSVC, so they all did their job and saved us from copying a whole binary64 double (sign/exp/mantissa) into our uint64_t.
// valid C11 and C++11
#include <float.h> // float numeric-limit macros
#include <stdint.h>
#include <assert.h> // C11 static assert
#include <string.h> // memcpy
// inline
uint64_t ldbl_mant(long double x)
{
// we can assume CHAR_BIT = 8 when targeting x86, unless you care about DeathStation 9000 implementations.
static_assert( sizeof(long double) >= 10, "x87 long double must be >= 10 bytes" );
static_assert( LDBL_MANT_DIG == 64, "x87 long double significand must be 64 bits" );
uint64_t retval;
memcpy(&retval, &x, sizeof(retval));
static_assert( sizeof(retval) < sizeof(x), "uint64_t should be strictly smaller than long double" ); // sanity check for wrong types
return retval;
}
This compiles efficiently on gcc/clang/ICC (on Godbolt) to just one instruction as a stand-alone function (because the calling convention passes long double in memory). After inlining into code with a long double in an x87 register, it will presumably compile to a TBYTE x87 store and an integer reload.
## gcc/clang/ICC -O3 for x86-64
ldbl_mant:
mov rax, QWORD PTR [rsp+8]
ret
For 32-bit, gcc has a weird redundant-copy missed-optimization bug which ICC and clang don't have; they just do the 2 loads from the function arg without copying first.
# GCC -m32 -O3 copies for no reason
ldbl_mant:
sub esp, 28
fld TBYTE PTR [esp+32] # load the stack arg
fstp TBYTE PTR [esp] # store a local
mov eax, DWORD PTR [esp]
mov edx, DWORD PTR [esp+4] # return uint64_t in edx:eax
add esp, 28
ret
C99 makes union type-punning well-defined behaviour, and so does GNU C++. I think MSVC defines it too.
But memcpy is always portable so that might be an even better choice, and it's easier to read in this case where we just want one element.
If you also want the exponent and sign bit, a union between a struct and long double might be good, except that padding for alignment at the end of the struct will make it bigger. It's unlikely that there'd be padding after a uint64_t member before a uint16_t member, though. But I'd worry about :1 and :15 bitfields, because IIRC it's implementation-defined which order the members of a bitfield are stored in.

difference between mult, multu simulating in c

I'm writing a c program and decodes mips 32 bit instructions and simulates their functionality minus the bitwise part. I don't know how I should be differentiating between signed and unsigned operations here.
For instance, given registers rd and rs, i need to multiply and put the result in in rd.
for the mult instructions, it's as simple as this:
reg[rd] = reg[rs] * reg[rt];
What should the multu instruction be? Do I need to be doing bitwise operations on the contents of the registers first?
I also need to do:
-add, addu,
-div, divu
-sub, subu
Is it the same distinction in functionality for all of them?

MIPS multiply can't overflow. Its a 32x32-bit operation with a full 64-bit result.
There is a significant difference between signed and unsigned results, as well.
To simulate these easily in C, you'll need C99 integer types from <stdint.h>:
uint32_t reg1, reg2; /* Use this type for the registers, normally */
uint32_t hi, lo; /* special MIPS registers for 64-bit products and dividends */
/* Signed mult instruction: */
int64_t temp = (int64_t)(int32_t)reg1 * (int_64_t)(int32_t)reg2;
hi = (uint32_t)((temp>>32) & 0xFFFFFFFF);
lo = (uint32_t)(temp & 0xFFFFFFFF);
The intermediate casts to signed 32-bit types is done so that a cast to signed 64-bit type will sign-extend before multiplication. The unsigned multiply is similar, except no intermediate cast is needed:
/* Unsigned multu instruction: */
uint64_t tempu = (uint64_t)reg1 * (uint64_t)reg2;
hi = (uint32_t)((temp>>32) & 0xFFFFFFFF);
lo = (uint32_t)(temp & 0xFFFFFFFF);

IIRC in mips assembly the only difference between the signed and unsigned variants is whether or not they set the overflow flag. The multu then, is easier to implement. For the regular, signed, mult you will need to determine if it will overflow the target register and set the flag.

why X86 provides pair of division and multiply instructions?

I noticed that, unsigned int and int shared the same instruction for addition and subtract. But provides idivl / imull for integer division and mutiply, divl / mull for unsigned int . May I know the underlying reason for this ?

The results are different when you multiply or divide, depending on whether your arguments are signed or unsigned.
It's really the magic of two's complement that allows us to use the same operation for signed and unsigned addition and subtraction. This is not true in other representations -- ones' complement and sign-magnitude both use a different addition and subtraction algorithm than unsigned arithmetic does.
For example, with 32-bit words, -1 is represented by 0xffffffff. Squaring this, you get different results for signed and unsigned versions:
Signed: -1 * -1 = 1 = 0x00000000 00000001
Unsigned: 0xffffffff * 0xffffffff = 0xfffffffe 00000001
Note that the low word of the result is the same. On processors that don't give you the high bits, there is only one multiplication instruction necessary. On PPC, there are three multiplication instructions — one for the low bits, and two for the high bits depending on whether the operands are signed or unsigned.

Most microprocessors implement multiplication and division with shift-and-add algorithm (or a similar algorithm. This of course requires that the sign of the operands be handled separately.
While implementing multiplication and divisions with add-an-substract would have allowed to not worry about sign and hence allowed to handle signed vs. unsigned integer values interchangeably, it is much less efficient algorithm and that's likely why it wasn't used.
I just read that some modern CPUs use alternatively the Booth encoding method, but that algorithm also implies asserting the sign of the values.

In x86 sign store in high bit of word (if will talk about integer and unsigned integer)
ADD and SUB command use one algorithm for signed and unsigned in - it get correct result in both.
For MULL and DIV this is not worked. And you should "tell" to CPU what int you want "use" signed or unsigned.
For unsigned use MULL and DIV. It just operate words - it is fast.
For signed use MULL and IDIV. It get word to absolute (positive) value, store sign for result and then make operation. This is slower than MULL and DIV.

How can a 16bit Processor have 4 byte sized long int?

I've problem with the size of long int on a 16-bit CPU. Looking at its architecture:
No register is more than 16-bit long. So, how come long int can have more than 16bits. In fact, according to me for any Processor, the maximum size of the data type must be the size of the general purpose register. Am I right?

Yes. In fact the C and C++ standards require that sizeof(long int) >= 4.*
(I'm assuming CHAR_BIT == 8 in this case.)
This is the same deal with 64-bit integers on 32-bit machines. The way it is implemented is to use two registers to represent the lower and upper halves.
Addition and subtraction are done as two instructions:
On x86:
Addition: add and adc where adc is "add with carry"
Subtraction: sub and sbb where sbb is "subtract with borrow"
For example:
long long a = ...;
long long b = ...;
a += b;
will compile to something like:
add eax,ebx
adc edx,ecx
Where eax and edx are the lower and upper parts of a. And ebx and ecx are the lower and upper parts of b.
Multiplication and division for double-word integers is more complicated, but it follows the same sort of grade-school math - but where each "digit" is a processor word.

No. If the machine doesn't have registers that can handle 32-bit values, it has to simulate them in software. It could do this using the same techniques that are used in any library for arbitrary precision arithmetic.