Can anyone please explain this in a easier way? - c

I had take up a computer organization course a year ago and now I have a follow up to it as 'Computer architecture' , I am using the 3rd edition of John Hennessy's book 'Quantitative approach to computer architecture', I went through the MIPS ISA but still need some help , can you explain this line of code in greater detail
Source code:
for(i=1000; i>0; i--)
x[i] = x[i] + s;
Assembly code:
Loop: L.D F0, 0(R1) ; F0 = array element
ADD.D F4, F0, F2 ; add scalar
S.D F4, 0(R1) ; store result
DADDUI R1, R1,# -8 ; decrement address pointer
BNE R1, R2, Loop ; branch if R1 != R2
This is given as an example for loop unrolling to exploit ILP , I have a few doubts . I do get it that the array starts at Mem[0+R1] and goes backwards till Mem[R+8](as given in the text) , any reason for this or they just randomly took this location?
Also why use a DADDUI (unsigned ) when we are adding a signed number (-8) ?
Please give a detailed overview of this so that i can follow along the rest of the topics.
Thanks

The memory accesses are performed to the addesses and in the order as specified by the loop in the source code.
The daddiu instruction is sufficient to perform such address arithmetic. The "negative" value accomplishes subtraction in two's-complement. Addresses are neither negative nor positive; they are just bit-patterns. Refer to an ISA reference to learn more about MIPS and instructions.
The 16-bit signed immediate is added to the 64-bit value in GPR rs and
the 64-bit arithmetic result is placed into GPR rt . No Integer
Overflow exception occurs under any circumstances.
…
The term “unsigned” in the instruction name is a misnomer; this operation is 64-bit modulo arithmetic that does not
trap on overflow. It is appropriate for unsigned arithmetic such as address arithmetic, or integer arithmetic environments that ignore overflow, such as C language arithmetic.
The example is not optimized or unrolled. It's just a literal translation of the source.

Related

How memory store signed and unsigned char

Just started to learn C, and i feel little bit confused.
I have some questions:
If i have the following code:
signed char x = 56;
// ‭In the RAM, I will see 00111000‬ yes/no?
signed char z = -56;
// In the RAM, I will see 11001000 yes/no?
unsigned char y = 200;
// ‭In the RAM, I will see 11001000‬ yes/no?
I have the following code:
if (z<0){
printf("0 is bigger then z ");
}
After compiling, how the assembly instructions know if z is -56 and not 200?(there is a special ASM instructions for signed and unsigned?).
As i mentioned in question number 1, the value of z and y is 11001000, and there is not any indicate to know if its signed or unsigned.
Apologize if i didn't find the right way to ask my question, hope you understand me
Thanks
Numbers are stored in binary. Negative numbers are usually stored as two's complement form, but C language allows different representations. So this one:
signed char z = -56;
// In the RAM, I will see 11001000 yes/no?
usually yes, but may be not on some exotic platforms.
Second question is too implementation specific. For example comparison against zero on x86 may be performed as self-comparison, and flags register would be affected, for unsigned comparison sign flag (SF) is ignored.
The compiler will generate the appropriate instructions for the signed and unsigned cases. I think it might better to see an example. The following code
void foobar();
void foo(unsigned char a)
{
if (a < 10)
foobar();
}
void bar(char a)
{
if (a < 10)
foobar();
}
Will translate to this MIPS code with GCC 5.4 using -O3 flag.
foo:
andi $4,$4,0x00ff
sltu $4,$4,10
bne $4,$0,$L4
nop
j $31
nop
$L4:
j foobar
nop
bar:
sll $4,$4,24
sra $4,$4,24
slt $4,$4,10
bne $4,$0,$L7
nop
j $31
nop
$L7:
j foobar
nop
This is the interesting part of the foo function (which use unsigned char type)
foo:
andi $4,$4,0x00ff
sltu $4,$4,10
As you can see sltu command used which is the unsinged version of slt. (You don't really have to know what it does)
While if we looking at the function bar relevants part
bar:
sll $4,$4,24
sra $4,$4,24
slt $4,$4,10
You can see that slt used which will treat its register operand as signed. The sll and sra pair doing sign extension since here the operands a was signed so its needed, while in unsigned case its not.
So you could see that different instructions generated with respect to the signdess of the operands.
The compiler will generate different instructions depending on whether it is an unsigned or signed type. And that is what tells the processor which way to treat it. So yes there are seperate instructions for signed and unsigned. With Intel processors, there are also seperate instructions depending on the width (char, short, int)
there is a special ASM instructions for signed and unsigned?
Yes, hardware generally has machine code instructions (or instruction sequences) that can
sign extend a byte to word size
zero extend a byte to word size
compare signed quantities for the various relations <, <=, >, >=
compare unsigned quantities for the various relations <, <=, >, >=
how the assembly instructions know if z is -56 and not 200?
In high level languages we associate a type with a variable. From then on the compiler knows the default way to interpret code that uses the variable. (We can override or change that default interpretation using a cast at usages of the variable.)
In machine code, there are only bytes, either in memory or in CPU registers. So, it is not how it is stored that matter (for signed vs. unsigned), but what instructions are used to access the storage. The compiler will use the right set of machine code instructions every time the variable is accessed.
While we store lots of things in memory, the processor has no concept of variable declarations. The processor only sees machine code instructions, and interprets all data types through the eyes of the instruction it is being told to execute.
As an assembly programmer, it is your job to apply the proper instructions (here signed vs. unsigned) to the same variable each time it is used. Using a byte as a signed variable and later as an unsigned variable, is a logic bug that is easy to do in assembly language.
Some assemblers will help if you use the wrong size to access a variable, but none that I know help if you use the proper size but incorrect signed-ness.
Computers do not know nor care about such things. Unsigned and signed is only relevant to the programmer. The value 0xFF can at the same time be -1, 255 or an address or a portion of an address. Part of a floating point number and so on. The computer does not care. HOW the programmer conveys their interpretation of the bits is through the program. Understanding that addition and subtraction also do not care about signed vs unsigned because it is the same logic, but other instructions like multiplies where the result is larger than the inputs or divide where the result is smaller than the inputs then there are unsigned and signed versions of the instructions or your processor may only have one and you have to synthesize the other (or none and you have to synthesize both).
int fun0 ( void )
{
return(5);
}
unsigned int fun1 ( void )
{
return(5);
}
00000000 <fun0>:
0: e3a00005 mov r0, #5
4: e12fff1e bx lr
00000008 <fun1>:
8: e3a00005 mov r0, #5
c: e12fff1e bx lr
no special bits nor nomenclature...bits is bits.
The compiler driven by the users indication of signed vs unsigned in the high level language drive instructions and data values that cause the alu to output flags that indicate greater than, less than, and equal through single flags or combinations, then a conditional branch can be taken based on the flag. Often but not always the compiler will generate the opposite if z < 0 then do something the compiler will say if z >= 0 then jump over the something.

Imagining two types of boolean pairs within the same language without further ado

I have just a little theoretical curiosity. The == operator in C returns 1 in case of positive equality, 0 otherwise. My knowledge of assembly is very limited. However I was wondering if it could be possible, theoretically, to implement a new operator that returns ~0 in case of positive equality, 0 otherwise – but at one condition: it must produce the same number of assembly instructions as the == operator. It's really just a theoretical curiosity, I have no practical uses in mind.
EDIT
My question targets x86 CPUs, however I am very curious to know if there are architectures that natively do that.
SECOND EDIT
As Sneftel has pointed out, nothing similar to the SETcc instructions [1] – but able to convert flag register bits into 0/~0 values (instead of the classical 0/1) – exists. So the answer to my question seems to be no.
THIRD EDIT
A little note. I am not trying to represent a logical true as ~0, I am trying to understand if a logical true can also be optionally represented as ~0 when needed, whithout further effort, within a language that already normally represents true as 1. And for this I had hypothized a new operator that “returns” numbers, not booleans (the natural logical true “returned” by == remains represented as 1) – otherwise I would have asked whether == could be re-designed to “return” ~0 instead of 1. You can think of this new operator as half-belonging to the family of bitwise operators, which “return” numbers, not booleans (and by booleans I don't mean boolean data types, I mean anything outside of the number pair 0/1, which is what a boolean is intended in C as a result of a logical operation).
I know that all of this might sound futile, but I had warned: it is a theoretical question.
However here my question seems to be addressed explicitly:
Some languages represent a logical one as an integer with all bits set. This representation can be obtained by choosing the logically opposite condition for the SETcc instruction, then decrementing the result. For example, to test for overflow, use the SETNO instruction, then decrement the result.
So it seems there is no direct instruction, since using SETNE and then decrementing means adding one more instruction.
EDIT: as other people are pointing out, there are some flavors of "conditionally assign 0/1" out there. Kind of undermines my point :) Apparently, the 0/1 boolean type admits a slightly deeper optimization than a 0/~0 boolean.
The "operator returns a value" notion is a high level one, it's not preserved down to the assembly level. That 1/0 may only exist as a bit in the flags register, or not even that.
In other words, assigning the C-defined value of the equality operator to an int sized variable is not a primitive on the assembly level. If you write x = (a == b), the compiler might implement it as
cmp a, b ; set the Z flag
cmovz x, 1 ; if equals, assign 1
cmovnz x, 0 ; if not equals, assign 0
Or it can be done with conditional jumps. As you can see, assigning a ~0 as the value for TRUE will take the same commands, just with a different operand.
None of the architectures that I'm familiar with implement equality comparison as "assign a 1 or 0 to a general purpose register".
There is no assembly implementation of a C operator. For instance, there is no x86 instruction which compares two arguments and results in a 0 or 1, only one which compares two arguments and puts the result in a bit in the flag register. And that's not usually what happens when you use ==.
Example:
void foo(int a, int b) {
if(a == b) { blah(); }
}
produces the following assembly, more or less:
foo(int, int):
cmp %edi, %esi
je .L12
rep ret
.L12:
jmp blah()
Note that nothing in there involves a 0/1 value. If you want that, you have to really ask for it:
int bar(int a, int b) {
return a == b;
}
which becomes:
bar(int, int):
xor %eax, %eax
cmp %edi, %esi
sete %al
ret
I suspect the existence of the SETcc instructions is what prompted your question, since they convert flag register bits into 0/1 values. There is no corresponding instruction which converts them into 0/~0: GCC instead does a clever little DEC to map them. But in general, the result of == exists only as an abstract and optimizer-determined difference in machine state between the two.
Incidentally, I would not be surprised at all if some x86 implementations chose to fuse SETcc and a following DEC into a single micro-op; I know this is done with other common instruction pairs. There is no simple relationship between a stream of instructions and a number of cycles.
For just 1 extra cycle you can just negate the /output/.
Internally in 8086, the comparison operations only exist in the flags. Getting the value of the flags into a variable takes extra code. It is pretty much the same code whether you want true as 1 or -1. Generally a compiler doesn't actually generate the value 0 or 1 when evaluating an if statement, but uses the Jcc instructions directly on the flags generated by comparison operations. https://pdos.csail.mit.edu/6.828/2006/readings/i386/Jcc.htm
With 80386, SETcc was added, which only ever sets 0 or 1 as the answer, so that is the preferred arrangement if the code insists on storing the answer. https://pdos.csail.mit.edu/6.828/2006/readings/i386/SETcc.htm
And there are lots of new compare instructions that save results to registers going forward. The flags have been seen as a bottleneck for instruction pipeline stalls in modern processors, and very much are disfavoured by code optimisation.
Of course there are all sorts of tricks you can do to get 0, 1, or -1 given a particular set of values to compare. Needless to say the compiler has been optimised to generate 1 for true when applying these tricks, and wherever possible, it doesn't actually store the value at all, but just reorganises your code to avoid it.
SIMD vector comparisons do produce vectors of 0 / -1 results. This is the case on x86 MMX/SSE/AVX, ARM NEON, PowerPC Altivec, etc. (They're 2's complement machines, so I like to write -1 instead of ~0 to represent the elements of all-zero / all-one bits).
e.g. pcmpeqd xmm0, xmm1 replaces each element of xmm0 with xmm0[i] == xmm1[i] ? -1 : 0;
This lets you use them as AND masks, because SIMD code can't branch separately on each vector element without unpacking to scalar and back. It has to be branchless. How to use if condition in intrinsics
e.g. to blend 2 vectors based on a condition, without SSE4.1 pblendvb / blendvps, you'd compare and then AND / ANDNOT / OR. e.g. from Substitute a byte with another one
__m128i mask = _mm_cmpeq_epi8(inp, val); // movdqa xmm1, xmm0 / PCMPEQB xmm1, xmm2
// zero elements in the original where there was a match (that we want to replace)
inp = _mm_andnot_si128(mask, inp); // inp &= ~mask; // PANDN xmm0, xmm1
// zero elements where we keep the original
__m128i tmp = _mm_and_si128(newvals, mask); // newvals & mask; // PAND xmm3, xmm1
inp = _mm_or_si128(inp, tmp); // POR xmm0, xmm1
But if you want to count matches, you can subtract the compare result. total -= -1 avoids having to negate the vector elements. How to count character occurrences using SIMD
Or to conditionally add something, instead of actually blending, just do total += (x & mask), because 0 is the identity element for operations like ADD (and some others like XOR and OR).
See How to access a char array and change lower case letters to upper case, and vice versa and Convert a String In C++ To Upper Case for examples in C with intrinsics and x86 asm.
All of this has nothing to do with C operators and implicit conversion from boolean to integer.
In C and C++, operators return a boolean true/false condition, which in asm for most machines for scalar code (not auto-vectorized) maps to a bit in a flag register.
Converting that to an integer in a register is a totally separate thing.
But fun fact: MIPS doesn't have a flags register: it has some compare-and-branch instructions for simple conditions like reg == reg or reg != reg (beq and bne). And branch on less-than-zero (branch on the sign bit of one register): bltz $reg, target.
(And an architectural $zero register that always reads as zero, so you can use that implement branch if reg !=0 or reg == 0).
For more complex conditions, you use slt (set on less-than) or sltu (set on less-than-unsigned) to compare into an integer register. Like slt $t4, $t1, $t0 implements t4 = t1 < t0, producing a 0 or 1. Then you can branch on that being 0 or not, or combine multiple conditions with boolean AND / OR before branching on that. If one of your inputs is an actual bool that's already 0 or 1, it can be optimized into this without an slt.
Incomplete instruction listing of classic MIPS instructions (not including pseudo-instructions like blt that assemble to slt into $at + bne: http://www.mrc.uidaho.edu/mrc/people/jff/digital/MIPSir.html
But MIPS32r6 / MIPS64r6 changed this: instructions generating truth values now generate all zeroes or all ones instead of just clearing/setting the 0-bit, according to https://en.wikipedia.org/wiki/MIPS_architecture#MIPS32/MIPS64_Release_6. MIPS32/64 r6 is not binary compatible with previous MIPS ISAs, it also rearranged some opcodes. And because of this change, not even asm source compatible! But it's a definite change for the better.
Fun fact, there is an undocumented 8086 SALC instruction (set AL from carry) that's still supported in 16/32-bit mode by modern Intel (and AMD?) CPUs.
It's basically like sbb al,al without setting flags: AL = CF ? -1 : 0. http://os2museum.com/wp/undocumented-8086-opcodes.
Subtract-with-borrow with the same input twice does x-x - CF on x86, where CF is a borrow for subtraction. And x-x is of course always zero. (On some other ISAs, like ARM, the carry flag meaning is opposite for subtraction, C set means "no borrow".)
In general, you can do sbb edx,edx (or any register you want) to convert CF into a 0 / -1 integer. But this only works for CF; the carry flag is special and there's nothing equivalent for other flags.
Some AMD CPUs even recognize sbb same,same as independent of the old value of the register, only dependent on CF, like xor-zeroing. On other CPUs it still has the same architectural effect, but with a microarchitectural false dependency on the old value of EDX.

Union in C changing machine behavior of Float Addition

New to C programming, and I've been told to avoid unions which in general makes perfect sense and I agree with. However, as part of an academic exercise I'm writing an emulator for hardware single-precision floating point addition by doing bit manipulation operations on unsigned 32-bit integers. I only mention that to explain why I want to use unions; I'm having no trouble with the emulation.
In order to test this emulator, I wrote a test program. But of course I'm trying to find the bit representation of floats on my hardware, so I thought this could be the perfect use for a union. I wrote this union:
typedef union {
float floatRep;
uint32_t unsignedIntRep;
} FloatExaminer;
This way, I can initialize a float with the floatRep member and then examine the bits with the unsignedIntRep member.
This worked most of the time, but when I got to NaN addition, I started running into trouble. The exact situation was that I wrote a function to automate these tests. The gist of it was this:
void addTest(float op1, float op2){
FloatExaminer result;
result.floatRep = op1 + op2;
printf("%f + %f = %f\n", op1, op2, result.floatRep);
//print bit pattern as well
printf("Bit pattern of result: %08x", result.unsignedIntRep);
}
OK, now for the confusing part:
I added a NAN and a NAN with different mantissa bit patterns to differentiate between the two. On my particular hardware, it's supposed to return the second NAN operand (making it quiet if it was signalling). (I'll explain how I know this below.) However, passing the bit patterns op1=0x7fc00001, op2=0x7fc00002 would return op1, 0x7fc00001, every time!
I know it's supposed to return the second operand because I tried--outside the function--initializing as an integer and casting to a float as below:
uint32_t intRep1 = 0x7fc00001;
uint32_t intRep2 = 0x7fc00002;
float *op1 = (float *) &intRep1;
float *op2 = (float *) &intRep2;
float result = *op1 + *op2;
uint32_t *intResult = (uint32_t *)&result;
printf("%08x", *intResult); //bit pattern 0x7fc00002
In the end, I've concluded that unions are evil and I should never use them. However, does anyone know why I'm getting the result I am? Did I make stupid mistake or assumption? (I understand that hardware architecture varies, but this just seems bizarre.)
I'm assuming that when you say "my particular hardware", you are referring to an Intel processor using SSE floating point. But in fact, that architecture has a different rule, according to the Intel® 64 and IA-32 Architectures
Software Developer's Manual. Here's a summary of Table 4.7 ("Rules for handling NaNs") from Volume 1 of that documentation, which describes the handling of NaNs in arithmetic operations: (QNaN is a quiet NaN; SNaN is a signalling NaN; I've only included information about two-operand instructions)
SNaN and QNaN
x87 FPU — QNaN source operand.
SSE — First source operand, converted to a QNaN.
Two SNaNs
x87 FPU — SNaN source operand with the larger significand, converted to a QNaN
SSE — First source operand, converted to a QNaN.
Two QNaNs
x87 FPU — QNaN source operand with the larger significand
SSE — First source operand
NaN and a floating-point value
x87/SSE — NaN source operand, converted to a QNaN.
SSE floating point machine instructions generally have the form op xmm1, xmm2/m32, where the first operand is the destination register and the second operand is either a register or a memory location. The instruction will then do, in effect, xmm1 <- xmm1 (op) xmm2/m32, so the first operand is both the left-hand operand of the operation and the destination. That's the meaningof "first operand" in the above chart. AVX adds three-operand instructions, where the destination might be a different register; it is then the third operand and does not figure in the above chart. The x87 FPU uses a stack-based architecture, where the top of the stack is always one of the operands and the result replaces either the top of the stack or the other operand; in the above chart, it will be noted that the rules do not attempt to decide which operand is "first", relying instead on a simple comparison.
Now, suppose we're generating code for an SSE machine, and we have to handle the C statement:
a = b + c;
where none of those variables are in a register. That means we might emit code something like this: (I'm not using real instructions here, but the principle is the same)
LOAD r1, b (r1 <- b)
ADD r1, c (r1 <- r1 + c)
STORE r1, a (a <- r1)
But we could also do this, with (almost) the same result:
LOAD r1, c (r1 <- c)
ADD r1, b (r1 <- r1 + b)
STORE r1, a (a <- r1)
That will have precisely the same effect, except for additions involving NaNs (and only when using SSE). Since arithmetic involving NaNs is unspecified by the C standard, there is no reason why the compiler should care which of these two options it chooses. In particular, if r1 happened to already have the value c in it, the compiler would probably choose the second option, since it saves a load instruction. (And who is going to complain? We all want the compiler to generate code which runs as quickly as possible, no?)
So, in short, the order of the operands of the ADD instruction will vary with the intricate details of how the compiler chooses to optimize the code, and the particular state of the registers at the moment in which the addition operator is being emitted. It is possible that this will be effected by the use of a union, but it is equally or more likely that it has to do with the fact that in your code using the union, the values being added are arguments to the function and therefore are already placed in registers.
Indeed, different versions of gcc, and different optimization settings, produce different results for your code. And forcing the compiler to emit x87 FPU instructions produces yet different results, because the hardware operates according to a different logic.
Note:
If you want some bedtime reading, you can download the entire Intel SDM (currently 4,684 pages / 23.3MB, but it keeps on getting bigger) from their site.

What does cmp %eax,0x80498d4(,%ebx,4) mean?

I know there are some other questions similar to this, but I'm still having trouble understanding the () part of it. Could someone spell this syntax out for me? Thanks.
cmp %eax,0x80498d4(,%ebx,4)
cmp is the comparison assembly instruction. It performs a comparison between two arguments by signed subtracting the right argument from the left and sets a CPU EFLAGS register. This EFLAGS register can then be used to do conditional branching / moving, etc.
First argument: `%eax (the value in the %eax register)
Second argument: 0x80498d4(,%ebx,4). This is read as offset ( base, index, scale ) In your example, the value of the second argument is the memory location offset 0x80498d4 + base (which I believe defaults to zero if not included) + value in %ebx register * 4 (scaling factor).
Note: I believe base here is empty and defaults to the value 0.
You can take a look at http://docs.oracle.com/cd/E19120-01/open.solaris/817-5477/ennby/index.html for more information on the syntax for Intel x86 assembly instructions.

MUL(Assembler) in C

In Assembler i can use the MUL command and get a 64 bit Result EAX:EDX,
how can i do the same in C ? http://siyobik.info/index.php?module=x86&id=210
My approach to use a uint64_t and shift the Result don't work^^
Thank you for your help (=
Me
Any decent compiler will just do it when asked.
For example using VC++ 2010, the following code:
unsigned long long result ;
unsigned long a = 0x12345678 ;
unsigned long b = 0x87654321 ;
result = (unsigned long long)a * b ;
generates the following assembler:
mov eax,dword ptr [b]
mov ecx,dword ptr [a]
mul eax,ecx
mov dword ptr [result],eax
mov dword ptr [a],edx
Post some code. This works for me:
#include <inttypes.h>
#include <stdio.h>
int main(void) {
uint32_t x, y;
uint64_t z;
x = 0x10203040;
y = 0x3000;
z = (uint64_t)x * y;
printf("%016" PRIX64 "\n", z);
return 0;
}
See if you can get the equivalent of __emul or __emulu for your compiler(or just use this if you've got an MS compiler). though 64bit multiply should automatically work unless your sitting behind some restriction or other funny problem(like _aulmul)
You mean to multiply two 32 bit quantities to obtain a 64 bit result?
This is not foreseen in C by itself, either you have tow 32 bit in such as uint32_t and then the result is of the same width. Or you cast before to uint64_t but then you loose the advantage of that special (and fast) multiply.
The only way I see is to use inline assembler extensions. gcc is quite good in this, you may produce quite optimal code. But this isn't portable between different versions of compilers. (Many public domain compilers adopt the gcc, though, I think)
#include
/* The name says it all. Multiply two 32 bit unsigned ints and get
* one 64 bit unsigned int.
*/
uint64_t mul_U32xU32_u64(uint32_t a, uint32_t x) {
return a * (uint64_t)b; /* Note about the cast below. */
}
This produces:
mul_U32xU32_u64:
movl 8(%esp), %eax
mull 4(%esp)
popl %ebp
ret
When compiled with:
gcc -m32 -O3 -fomit-frame-pointer -S mul.c
Which uses the mul instruction (called mull here for multiply long, which is how the gnu assembler for x86 likes it) in the way that you want.
In this case one of the parameters was pulled directly from the stack rather than placed in a register (the 4(%esp) thing means 4 bytes above the stack pointer, and the 4 bytes being skipped over are the return address) because the numbers were passed into the function and would have been pushed onto the stack (as per the x86 ABI (application binary interface) ).
If you inlined the function or just did the math in it in your code it would most likely result in using the mul instruction in many cases, though optimizing compilers may also replace some multiplications with simpler code if they can tell that it would work (for instance it could turn this into a shift or even a constant if the one or more of the arguments were known).
In the C code at least one of the arguments had to be cast to a 64 bit value so that the compiler would produce a 64 bit result. Even if the compiler had to use code that produced a 64 bit result when multiplying 32 bit values, it may have not considered the top half of it to be important because according to the rules of C operations usually result in a value with the same type as the value with the largest range out of its components (except you can sometimes argue that is not really exactly what it does).
You cannot do exactly that in C, i.e. you cannot multiply two N-bit values and obtain a 2N-bit value as the result. Semantics of C multiplication is different from that of your machine multiplication. In C the multiplication operator is always applied to values of the same type T (so called usual arithmetic conversions take care of that) and produces the result of the same type T.
If you run into overflow on multiplication, you have to use a bigger type for the operands. If there's no bigger type, you are out of luck (i.e. you have no other choice but to use library-level implementation of large multiplication).
For example, if the largest integer type of your platform is a 64-bit type, then at assembly level on your machine you have access to mul operation producing the correct 128-bit result. At the language level you have no access to such multiplication.

Resources