How can a 16bit Processor have 4 byte sized long int? - c

I've problem with the size of long int on a 16-bit CPU. Looking at its architecture:
No register is more than 16-bit long. So, how come long int can have more than 16bits. In fact, according to me for any Processor, the maximum size of the data type must be the size of the general purpose register. Am I right?

Yes. In fact the C and C++ standards require that sizeof(long int) >= 4.*
(I'm assuming CHAR_BIT == 8 in this case.)
This is the same deal with 64-bit integers on 32-bit machines. The way it is implemented is to use two registers to represent the lower and upper halves.
Addition and subtraction are done as two instructions:
On x86:
Addition: add and adc where adc is "add with carry"
Subtraction: sub and sbb where sbb is "subtract with borrow"
For example:
long long a = ...;
long long b = ...;
a += b;
will compile to something like:
add eax,ebx
adc edx,ecx
Where eax and edx are the lower and upper parts of a. And ebx and ecx are the lower and upper parts of b.
Multiplication and division for double-word integers is more complicated, but it follows the same sort of grade-school math - but where each "digit" is a processor word.

No. If the machine doesn't have registers that can handle 32-bit values, it has to simulate them in software. It could do this using the same techniques that are used in any library for arbitrary precision arithmetic.

Related

Does C uses 2's complement internally to evaluate unsigned numbers arithmetic like 5-4?

I have C code as
#include<stdio.h>
int main()
{
unsigned int a = 5;
unsigned int b = 4;
printf("%u",a-b);
}
Output of above code is 1, I am thinking that C has calculated internally the result as taking 2's compliment of -4 and then using compliment arithmetic to evaluate the result. Please correct me if anything I am interpreting wrong. (Here, I am talking about how C actually calculates result using binary)
Does C uses 2's complement internally to evaluate unsigned numbers arithmetic like 5-4?
No, for two reasons.
unsigned int a = 5, b = 4;
printf("%u",a-b);
C guarantees that arithmetic on unsigned integer types is performed modulo the size of the type. So if you computed b-a, you'd get -1 which would wrap around to UINT_MAX, which is probably either 65535 or 4294967295 on your machine. But if you compute a-b, that's just an ordinary subtraction that doesn't overflow or underflow in any way, so the result is an uncomplicated 1 without worrying about 2's complement or modulo arithmetic or anything.
If your compiler, or your CPU architecture, chooses to implement a - b as a + -b, that's their choice, but it's an implementation detail way beneath the visibility of the C Standard, or ordinary programmers like you and me, and it won't affect the observable results of a C program at all.
Where things get interesting, of course, is with addition and subtraction of signed quantities. Up above I said that under unsigned arithmetic, 4 - 5 is -1 which wraps around to UINT_MAX. Using signed arithmetic, of course, 4 - 5 is -1 which is -1. Under 2's complement arithmetic, it Just So Happens that the bit patterns for -1 and UINT_MAX are identical (typically 0xffff or 0xffffffff), and this is why 2's complement arithmetic is popular, because your processor gets to define, and your C compiler gets to use, just one set of add and sub instructions, that work equally well for doing signed and unsigned arithmetic. But (today, at least), C does not mandate 2's complement arithmetic, and that's the other reason why the answer to your original question is "no".
But to be clear (and to go back to your question): Just about any C compiler, for just about any architecture, is going to implement a - b by emitting some kind of a sub instruction. Whether the processor then chooses to implement sub as a two's complement negate-and-add, or some other kind of negate-and-add, or via dedicated bitwise subtraction-with-borrow logic, is entirely up to that processor, and it doesn't matter (and is probably invisible) as long as it always returns a mathematically appropriate result.
The arithmetic method generally whatever is most natural on the target hardware. It is not defined by the C language.
When a processor's ALU has at least int sized registers, a-b will not doubt be implemented as a single SUB instruction (or whatever the target's op-code mnemonic for subtraction might be). In the hardware logic it may well be that the logic is equivalent to a + (~b + 1) (i.e. 2's complement the RHS and add) - but that is a hardware logic/micro-code implementation issue, not a language or compiler behaviour.
At Godbolt for GCC x86 64-bit, the statement:
unsigned int c = a - b ;
generates the following assembly code (my comments):
mov eax, DWORD PTR [rbp-4] ; Load a to EAX
sub eax, DWORD PTR [rbp-8] ; Subtract b from EAX
mov DWORD PTR [rbp-12], eax ; Write result to c
So in that sense your question is not valid - C does not do anything in particular, the processor performs the subtraction intrinsically.
The C standard allows 2's, 1's and sign+magnitude arithmetic, but in practice the world has settled on 2's complement and machines that use other representations are arcane antiques that probably never had C compilers targeted for them in any case.
There are in any case moves to remove the option for anything other than 2's complement in the language: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2218.htm

Signed Multiplication of 1024-bit 2's complement numbers using 32-bit chunks

So I have the following struct definition for my 1024-bit number (I want to use 2's complement representation here and I am on a 32-bit system):
typedef struct int1024
{
int32_t num[32]; //should I use uint32_t?
} int1024;
Basically an array that holds the segments of the number.
For add since, signed and unsigned addition are the same. I can simply use the instructions add and adc to carry out the extended operation for my bigger array.
Now for multiplication. I was able to get an unsigned multiplication to work using a combination of imul (using the upper and lower results) and adc using the classic O(n^2) multiplication algorithm. However I need signed multiplication for my results. I know people say just to use the signed magnitude version. i.e. take the 2's complement when negative and at the end apply 2's complement if needed. But all the extra noting and adding 1 will be really expensive, since I need to do a lot of multiplications. Is there a technique for doing large signed multiplication for number representations like this. (I don't really want any branching or recursive calls here). I guess one of my main problems is how to deal with carry and the high part of the multiplication of negative numbers. (just an idea, maybe I could use a combination of signed and unsigned multiplication, but I don't know where to begin). If you don't have time for 1024 bits, a 128 bit answer will suffice (I will just generalize it).
Note that in a 1024-bit integer only the very top bit, bit 1023, is the sign bit and has a place-value of -(2**1023). All the other bits have their normal +(2**pos) place value. i.e. the lower "limbs" all need to be treated as unsigned when you do widening multiplies on them.
You have one sign bit and 1023 lower bits. Not one sign bit per limb.
Also, the difference between signed and unsigned multiply is only in the high half of a widening multiply (N x N => 2N bits). That's why x86 has separate instructions for imul r/m64 and mul r/m64 to do a full-multiply into RDX:RAX (https://www.felixcloutier.com/x86/imul vs. https://www.felixcloutier.com/x86/mul). But for non-widening there's only imul r, r/m and imul r, r/m, imm which compilers use for both unsigned and signed (and so should humans).
Since you want a 1024x1024 => 1024-bit product which discards the upper 1024 bits, you can and should just make your whole thing unsigned.
When you're writing in asm, you can use whatever chunk size you can, e.g. 64-bit in 64-bit mode, not limiting yourself to the C chunk size.
See https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf for how to use BMI2 mulx (leaves FLAGS untouched so doesn't disturb adc chains, and only has RDX as an implicit input, with other source and both outputs being explicit, saving on MOV instructions). And optionally also Broadwell ADOX / ADCX to run two dep chains in parallel to get more ILP for medium sized BigInteger stuff like 512x512-bit or 512x64-bit (which they use as an example).

Size of int on 8-bit machines

The ISO C standard states that A "plain" int object has the natural size suggested by the architecture of the execution environment
However, it is also guaranteed that int is at least as large as short, which is at least 16 bits in size.
The natural size suggested by an 8-bit processor, such as a 6502 or 8080, would seem to be an 8-bit int, however that would make int shorter than 16 bits.
So, how large would int be on one of these 8 bit processors?
The 6502 had only the instruction pointer as 16 bit register, the 16 bit integers were handled with 8 bits with multiple statements, e.g. if you do in 16 bits c = a + b
clc ; clear carry bit
lda A_lo ; lower byte of A into accumulator
adc B_lo ; add lower byte of B to accumulator, put carry to carry bit
sta C_lo ; store the result to lower byte of C
lda A_hi ; higher byte of A into accumulator
adc B_hi ; add higher byte of B using carry bit
sta C_hi ; store the result to higher byte of C
8080 and Z80 CPUs at that time had 16 bit registers as well.
The Z80 CPU was still 8 bit architecture. It's 16 bit registers were eventually pairing two 8 bit registers, like BC, DE. The operations with them were much slower then with 8 bit registers because the CPU architecture was 8 bit, but this way 16 bit registers and 16 operations were provided.
8088 architecture was mixed, because it also had 8 bit data bus, but it had 16 bit registers, AX, BX, etc., lower and higher bytes also separately usably as 8 bit registers, AL, AH, etc.
So there were different solutions to use 16 bit integers but 8 bit is simply not a useful integer. That's why C and C++ used also 16 bit for int.
From Section 6.2.5 Types, p5
5 An object declared as type signed char occupies the same amount of storage as a ''plain'' char object. A ''plain'' int object has the natural size suggested by the architecture of the execution environment (large enough to contain any value in the range INT_MIN to INT_MAX as defined in the header <limits.h>).
And 5.2.4.2.1 Sizes of integer types <limits.h> p1
Their implementation-defined values shall be equal or greater in magnitude (absolute value) to those shown, with the same sign.
...
minimum value for an object of type int
INT_MIN -32767 // -(215 - 1)
maximum value for an object of type int
INT_MAX +32767 // 215 - 1
Then in those platforms, int must be at least 16 bits

why X86 provides pair of division and multiply instructions?

I noticed that, unsigned int and int shared the same instruction for addition and subtract. But provides idivl / imull for integer division and mutiply, divl / mull for unsigned int . May I know the underlying reason for this ?
The results are different when you multiply or divide, depending on whether your arguments are signed or unsigned.
It's really the magic of two's complement that allows us to use the same operation for signed and unsigned addition and subtraction. This is not true in other representations -- ones' complement and sign-magnitude both use a different addition and subtraction algorithm than unsigned arithmetic does.
For example, with 32-bit words, -1 is represented by 0xffffffff. Squaring this, you get different results for signed and unsigned versions:
Signed: -1 * -1 = 1 = 0x00000000 00000001
Unsigned: 0xffffffff * 0xffffffff = 0xfffffffe 00000001
Note that the low word of the result is the same. On processors that don't give you the high bits, there is only one multiplication instruction necessary. On PPC, there are three multiplication instructions — one for the low bits, and two for the high bits depending on whether the operands are signed or unsigned.
Most microprocessors implement multiplication and division with shift-and-add algorithm (or a similar algorithm. This of course requires that the sign of the operands be handled separately.
While implementing multiplication and divisions with add-an-substract would have allowed to not worry about sign and hence allowed to handle signed vs. unsigned integer values interchangeably, it is much less efficient algorithm and that's likely why it wasn't used.
I just read that some modern CPUs use alternatively the Booth encoding method, but that algorithm also implies asserting the sign of the values.
In x86 sign store in high bit of word (if will talk about integer and unsigned integer)
ADD and SUB command use one algorithm for signed and unsigned in - it get correct result in both.
For MULL and DIV this is not worked. And you should "tell" to CPU what int you want "use" signed or unsigned.
For unsigned use MULL and DIV. It just operate words - it is fast.
For signed use MULL and IDIV. It get word to absolute (positive) value, store sign for result and then make operation. This is slower than MULL and DIV.

How to load a pixel struct into an SSE register?

I have a struct of 8-bit pixel data:
struct __attribute__((aligned(4))) pixels {
char r;
char g;
char b;
char a;
}
I want to use SSE instructions to calculate certain things on these pixels (namely, a Paeth transformation). How can I load these pixels into an SSE register as 32-bits unsigned integers?
Unpacking unsigned pixels with SSE2
Ok, using SSE2 integer intrinsics from <emmintrin.h> first load the thing into the lower 32 bits of the register:
__m128i xmm0 = _mm_cvtsi32_si128(*(const int*)&pixel);
Then first unpack those 8-bit values into 16-bit values in the lower 64 bits of the register, interleaving them with 0s:
xmm0 = _mm_unpacklo_epi8(xmm0, _mm_setzero_si128());
And again unpack those 16-bit values into 32-bit values:
xmm0 = _mm_unpacklo_epi16(xmm0, _mm_setzero_si128());
You should now have each pixel as 32-bit integer in the respective 4 components of the SSE register.
Unpacking signed pixels with SSE2
I just read, that you want to get those values as 32-bit signed integers, though I wonder what sense a signed pixel in [-127,127] makes. But if your pixel values can indeed be negative, the interleaving with zeros won't work, since it makes a negative 8-bit number into a positive 16-bit number (thus interprets your numbers as unsigned pixel values). A negative number has to be extended with 1s instead of 0s, but unfortunately that would have to be decided dynamically on a component by component basis, at which SSE is not that good.
What you could do is compare the values for negativity and use the resulting mask (which fortunately uses 1...1 for true and 0...0 for false) as interleavand, instead of the zero register:
xmm0 = _mm_unpacklo_epi8(xmm0, _mm_cmplt_epi8(xmm0, _mm_setzero_si128()));
xmm0 = _mm_unpacklo_epi16(xmm0, _mm_cmplt_epi16(xmm0, _mm_setzero_si128()));
This will properly extend negative numbers with 1s and positives with 0s. But of course this additional overhead (in the form of probably 2-4 additional SSE instructions) is only neccessary if your initial 8-bit pixel values can ever be negative, which I still doubt. But if this is really the case, you should rather consider signed char over char, as the latter has implementation-defined signedness (in the same way you should use unsigned char if those are the common unsigned [0,255] pixel values).
Alternative SSE2 unpacking using shifts
Although, as clarified, you don't need signed-8-bit to 32-bit conversion, but for the sake of completeness harold had another very good idea for the SSE2-based sign-extension, instead of using the above mentioned comparison based version. We first unpack the 8-bit values into the upper byte of the 32-bit values instead of the lower byte. Since we don't care for the lower parts, we just use the 8-bit values again, which frees us from the need for an extra zero-register and an additional move:
xmm0 = _mm_unpacklo_epi8(xmm0, xmm0);
xmm0 = _mm_unpacklo_epi16(xmm0, xmm0);
Now we just need to perform and arithmetic right-shift of the upper byte into the lower byte, which does the proper sign-extension for negative values:
xmm0 = _mm_srai_epi32(xmm0, 24);
This should be more instruction count and register efficient than my above SSE2-version.
And as it should even be equal in instruction count for a single pixel (though 1 more instruction when amortized over many pixels) and more register efficient (due to no extra zero-register) compared to the above zero-extension, it might even be used for the unsigned-to-signed conversion if registers are rare, but then with a logical shift (_mm_srli_epi32) instead of an arithmetic shift.
Improved unpacking with SSE4
Thanks to harold's comment, there is even a better option for the first 8-to-32 transformation. If you have SSE4 support (SSE4.1 to be precise), which has instructions for doing the complete conversion from 4 packed 8-bit values in the lower 32 bits of the register into 4 32-bit values in the whole register, both for signed and unsigned 8-bit values:
xmm0 = _mm_cvtepu8_epi32(xmm0); //or _mm_cvtepi8_epi32 for signed 8-bit values
Packing pixels with SSE2
As for the follow-up of reversing this transformation, first we pack the signed 32-bit integers into signed 16-bit integers and saturating:
xmm0 = _mm_packs_epi32(xmm0, xmm0);
Then we pack those 16-bit values into unsigned 8-bit values using saturation:
xmm0 = _mm_packus_epi16(xmm0, xmm0);
We can then finally take our pixel from the lower 32-bits of the register:
*(int*)&pixel = _mm_cvtsi128_si32(xmm0);
Due to the saturation, this whole process will autmatically map any negative values to 0 and any values greater than 255 to 255, which is usually intended when working with color pixels.
If you actually need truncation instead of saturation when packing the 32-bit values back into unsigned chars, then you will need to do this yourself, since SSE only provides saturating packing instructions. But this can be achieved by doing a simple:
xmm0 = _mm_and_si128(xmm0, _mm_set1_epi32(0xFF));
right before the above packing procedure. This should amount to just 2 additional SSE instructions, or only 1 additional instruction when amortized over many pixels.

Resources