Buffer overflow with integers - c

I have a very basic question. Lets say I have two variables(uint16_t a, uint16_t b) and in memory they are aligned next to each other like a=> 0x0 => 0x15 and b=> 0x16 to 0x31
Lets assume a = 0, b = 65535,
(1) if i increment b(b++), b will become 0 but will it affect 'a' 0th bit?
(2) if i right shift b( b = b << 1), will it affect 'a' ?
Thank you

No, unless you are doing odd things with pointers or casts.

The answer is no.
a and b is a uint16_t, so it is of an unsigned type. And unsigned overflow (or wrap-around) is well-defined in C. It won't change the memory besides it.

No, a correctly designed system will not have that happen. Also, I will point out that your numeral notation is incorrect by common convention. 0x is generally used to notate hexadecimal numbers, including in the C language, but from the context of your question, you are prefixing decimal base numbers with it for no apparent reason. For example, 0x31 is equal to 49 in decimal. And 16+16 is not equal to 49.

Related

Explain how specific C #define works

I have been looking at some of the codes at http://www.netlib.org/fdlibm/ to see how some functions work and I was looking at the code for e_log.c and in some parts of the code it says:
hx = __HI(x); /* high word of x */
lx = __LO(x); /* low word of x */
The code for __HI(x) and __LO(x) is:
#define __HI(x) *(1+(int*)&x)
#define __LO(x) *(int*)&x
which I really don't understand because I am not familiar with this type of C. Can someone please explain to me what __HI(x) and __LO(x) are doing?
Also later in the code for the function there is a statement:
__HI(x) = hx|(i^0x3ff00000);
Can someone please explain to me:
how is it possible to make a function equal to something (I generally work with python so I don't really know what is going on)?
what are __HI(x) and __LO(x) doing?
what does the program mean by "high word" and "low word" of x?
The final purpose of my analysis is understanding this code in order to port it into a Python implementation
These macros use compiler-dependent properties to access the representations of double types.
In C, all objects other than bit-fields are represented as sequences of bytes. The fdlibm code you are looking at is designed for implementations where int is four bytes and the double type is represented using eight bytes in a format defined by the IEEE-754 floating-point specification. That format is called binary64 or IEEE-754 basic 64-bit binary floating-point. It is also designed for an implementation where the C compiler guarantees that aliasing via pointer conversions is supported. (This is not guaranteed by the C standard, but C implementations may support it.)
Consider a double object named x. Given these macros:
#define __HI(x) *(1+(int*)&x)
#define __LO(x) *(int*)&x
When __LO(x) is used in source code, it is replaced by *(int*)&x. The &x takes the address of x. The address of x has type double *. The cast (int *) converts this to int *, a pointer to an int. Then * dereferences this pointer, resulting in a reference to the int that is at the low-address part of x.
When __HI(x) is used in the source code, (int*)&x again points to the low-address part of x. Adding 1 changes it to point to the high-address part. Then * dereferences this, resulting in a reference to the int that is at the high-address part.
The routines in fdlibm are special mathematical routines. To operate, they need to examine and modify the bytes that represent double values. The __LO and __HI macros give them this access.
These definitions of __HI and __LO work for implementations that store the double values in little-endian order (with the “least significant” part of the double in the lower-addressed memory location). The fdlibm code may contain alternate definitions for big-endian systems, likely selected by some #if statement.
In the code __HI(x) = hx|(i^0x3ff00000);, the value 0x3ff00000 is a bit mask for the bits that encode the exponent (and part of the significand) of a double value. Without context, we cannot say precisely what is happening here, but the code appears to be merging hx with some value from i. It is likely completing some computation of the bytes representing a new double value it is creating and storing those bytes in the “high” part of a double object.
I add a reply to integrate the one already present (not substitute).
hx = __HI(x); /* high word of x */
lx = __LO(x); /* low word of x */
Comments are useful... even if in this case the macro name could be clear enough. "high" and "low" refer to the two halves of an integer representation, typically a 16 or 32 bit because for an 8-bit int the used term is "nibble".
If we take a 16-bit unsigned integer which can range from 0 to 65535, or in hex 0x0000 to 0xFFFF, for example 0x1234, the two halves are:
0x1234
^^-------------------- lower half, or "low"
^^---------------------- upper half, or "high"
Note that "lower" means the less significant part. The correct way to get the two halves, assuming 16 bits, is to make a logical (bitwise) AND with 0xFF to get lo(), and to shift 8 bit right (divide by 256) to get high.
Now, inside a CPU the number 0x1234 is written in two consecutive locations, either as 0x12 then 0x34 if big-endian, or 0x34 then 0x12 if little-endian. Given this, other ways are possible to read single halves, reading the correct one directly from memory without calculation. To get the lo() of 0x1234 in a little endian machine, it is possible to read the single byte in the first location.
From the question:
#define __HI(x) *(1+(int*)&x)
#define __LO(x) *(int*)&x
__LO is defined to make a bitwise AND (sure way), while __HI peeks directly in the memory (less sure). It is strange because it seems that the integer to be splitted in two has double dimension of the size of the word of the machine. If the machine is 32 bit, the integer to be split is 64 bits long. And there is another caveat: those macro can read the halves, but can also be used to write separately the two halves. In fact, from the question:
__HI(x) = hx|(i^0x3ff00000);
the result is to set only the HI part (upper, most significant) of x. Note also the value used, 0x3FFF0000, which seems to indicate that x is 128 bits because the mask used to generate a half of it is 64 bits long.
Hope this is clear enough to translate C to python. You should use integers 128 bit long. When in need to get the LO() part, use a bitwise AND with 0xFFFFFFFF; to get HI(), shift right 64 times or do the equivalent division.
When HI and LO are to the left of an assignment, only that half of the value is written, and you should construct separately the two halves and sum them up (or bitwise or them together).
Hope it helps...
#define A B
is a preprocessor directive that substitutes literal A with literal B all over the source code before the compilation.
#define A(x) B
is a function-like preprocessor macro which uses a parameter x in order to do a parameterized preprocessor substitution. In this case, B can be a function of x as well.
Your macros
#define __HI(x) *(1+(int*)&x)
#define __LO(x) *(int*)&x
// called as
__HI(x) = hx|(i^0x3ff00000);
Since it is just a matter of code substitution, the assignment is perfectly legit. Why? Because in this case the macro is substituted by an R-value in both cases.
That rvalue is in both cases a variable of type int:
take x's address
cast it to a pointer to int
deference it (in case of __LO())
Add 1 and then deference it in case of __HI ().
What it will actually point depends on architecture because pointer arithmetics are architecture dependant. Also endianness has to be taken into account.
What we can say is that they are designed in order to access the lower and the higher halves of a data type whose size is 2*sizeof (int) big (so, if for example integer data is 32-bit wide, they will allow the access to lower 32 bytes and to upper 32 bytes). Furthermore, from the macro names we understand that it is a little-endian architecture (LSB comes first).
In order to port to Python code containing this macros you will need to do it at higher level, since Python does not support pointers.
These tips don't solve your specific task, but provide to you a working method for this task and similar:
A way to understand what a macro does is checking how it is actually translated by the preprocessor. This can be done on most compilers through the -E compiler option.
Use a debugger to understand the functionality: set a breakpoint just before the call to the macro, and analyze its effects on addresses and variables.

Printing actual bit representation of integers in C

I wanted to print the actual bit representation of integers in C. These are the two approaches that I found.
First:
union int_char {
int val;
unsigned char c[sizeof(int)];
} data;
data.val = n1;
// printf("Integer: %p\nFirst char: %p\nLast char: %p\n", &data.f, &data.c[0], &data.c[sizeof(int)-1]);
for(int i = 0; i < sizeof(int); i++)
printf("%.2x", data.c[i]);
printf("\n");
Second:
for(int i = 0; i < 8*sizeof(int); i++) {
int j = 8 * sizeof(int) - 1 - i;
printf("%d", (val >> j) & 1);
}
printf("\n");
For the second approach, the outputs are 00000002 and 02000000. I also tried the other numbers and it seems that the bytes are swapped in the two. Which one is correct?
Welcome to the exotic world of endian-ness.
Because we write numbers most significant digit first, you might imagine the most significant byte is stored at the lower address.
The electrical engineers who build computers are more imaginative.
Someimes they store the most significant byte first but on your platform it's the least significant.
There are even platforms where it's all a bit mixed up - but you'll rarely encounter those in practice.
So we talk about big-endian and little-endian for the most part. It's a joke about Gulliver's Travels where there's a pointless war about which end of a boiled egg to start at. Which is itself a satire of some disputes in the Christian Church. But I digress.
Because your first snippet looks at the value as a series of bytes it encounters then in endian order.
But because the >> is defined as operating on bits it is implemented to work 'logically' without regard to implementation.
It's right of C to not define the byte order because hardware not supporting the model C chose would be burdened with an overhead of shuffling bytes around endlessly and pointlessly.
There sadly isn't a built-in identifier telling you what the model is - though code that does can be found.
It will become relevant to you if (a) as above you want to breakdown integer types into bytes and manipulate them or (b) you receive files for other platforms containing multi-byte structures.
Unicode offers something called a BOM (Byte Order Marker) in UTF-16 and UTF-32.
In fact a good reason (among many) for using UTF-8 is the problem goes away. Because each component is a single byte.
Footnote:
It's been pointed out quite fairly in the comments that I haven't told the whole story.
The C language specification admits more than one representation of integers and particularly signed integers. Specifically signed-magnitude, twos-complement and ones-complement.
It also permits 'padding bits' that don't represent part of the value.
So in principle along with tackling endian-ness we need to consider representation.
In principle. All modern computers use twos complement and extant machines that use anything else are very rare and unless you have a genuine requirement to support such platforms, I recommend assuming you're on a twos-complement system.
The correct Hex representation as string is 00000002 as if you declare the integer with hex represetation.
int n = 0x00000002; //n=2
or as you where get when printing integer as hex like in:
printf("%08x", n);
But when printing integer bytes 1 byte after the other, you also must consider the endianess, which is the byte order in multi-byte integers:
In big endian system (some UNIX system use it) the 4 bytes will be ordered in memory as:
00 00 00 02
While in little endian system (most of OS) the bytes will be ordered in memory as:
02 00 00 00
The first prints the bytes that represent the integer in the order they appear in memory. Platforms with different endian will print different results as they store integers in different ways.
The second prints the bits that make up the integer value most significant bit first. This result is independent of endian. The result is also independent of how the >> operator is implemented for signed ints as it does not look at the bits that may be influenced by the implementation.
The second is a better match to the question "Printing actual bit representation of integers in C". Although there is a lot of ambiguity.
It depends on your definition of "correct".
The first one will print the data exactly like it's laid out in memory, so I bet that's the one you're getting the maybe unexpected 02000000 for. *) IMHO, that's the correct one. It could be done simpler by just aliasing with unsigned char * directly (char pointers are always allowed to alias any other pointers, in fact, accessing representations is a usecase for char pointers mentioned in the standard):
int x = 2;
unsigned char *rep = (unsigned char *)&x;
for (int i = 0; i < sizeof x; ++i) printf("0x%hhx ", rep[i]);
The second one will print only the value bits **) and take them in the order from the most significant byte to the least significant one. I wouldn't call it correct because it also assumes that bytes have 8 bits, and because the shifting used is implementation-defined for negative numbers. ***) Furthermore, just ignoring padding bits doesn't seem correct either if you really want to see the representation.
edit: As commented by Gerhardh meanwhile, this second code doesn't print byte by byte but bit by bit. So, the output you claim to see isn't possible. Still, it's the same principle, it only prints value bits and starts at the most significant one.
*) You're on a "little endian" machine. On these machines, the least significant byte is stored first in memory. Read more about Endianness on wikipedia.
**) Representations of types in C may also have padding bits. Some types aren't allowed to include padding (like char), but int is allowed to have them. This second option doesn't alias to char, so the padding bits remain invisible.
***) A correct version of this code (for printing all the value bits) must a) correctly determine the number of value bits (8 * sizeof int is wrong because bytes (char) can have more then 8 bits, even CHAR_BIT * sizeof int is wrong, because this would also count padding bits if present) and b) avoid the implementation-defined shifting behavior by first converting to unsigned. It could look for example like this:
#define IMAX_BITS(m) ((m) /((m)%0x3fffffffL+1) /0x3fffffffL %0x3fffffffL *30 \
+ (m)%0x3fffffffL /((m)%31+1)/31%31*5 + 4-12/((m)%31+3))
int main(void)
{
int x = 2;
for (unsigned mask = 1U << (IMAX_BITS((unsigned)-1) - 1); mask; mask >>= 1)
{
putchar((unsigned) x & mask ? '1' : '0');
}
puts("");
}
See this answer for an explanation of this strange macro.

How was an array of char stored?

Here is something weird I found:
When I have a char* s of three elements, and assigned it to be "21",
The printed short int value of s appears to be 12594, which is same to 0010001 0010010 in binary, and 49 50 for separate char. But according to the ASCII chart, the value of '2' is 50 and '1' is 49.
when I shift the char to right, *(short*)s >>= 8, the result is agreed with (1.), which is '1' or 49. But after I assigned the char *s = '1', the printed string of s also appears to be "1", which I earlier thought it will become "11".
I am kind of confused about how bits stored in a char now, hope someone can explain this.
Following is the code I use:
#include <stdio.h>
#include <stdlib.h>
int main(void) {
printf("%lu,%lu\n",sizeof(char), sizeof(short));
char* s = malloc(sizeof(char)*3);
*s = '2', *(s+1) = '1', *(s+2) = '\0';
printf("%s\n",s);
printf("%d\n",*(short int*)s);
*(short*)s >>= 8;
printf("%s\n",s);
printf("%d\n",*(short int*)s);
*s = '1';
printf("%s\n",s);
return 0;
}
And the output is:
1,2
21
12594
1
49
1
This program is compiled on macOS with gcc.
You need some understanding of the concept of "endianess" here, that values can be represented as "little endian" and "big endian".
I am going to skip the discussion of how legal it is, about involved undefined bahaviour.
(Here is however a relevant link, provided by Lundin, credits:
What is the strict aliasing rule?)
But lets look at a pair of byte in memory, of which the lower-addressed contains a 50 and the higher addressed contains a 49:
50 49
You introduce them exactly this way, by explicitly setting lower byte and higher byte (via char type).
Then you read them, forcing the compiler to consider it a short, which is a two byte sized type on your system.
Compilers and hardware can be created with different "opinions" on what is a good representation of two byte values in two cosecutive bytes. It is called "endianess".
Two compilers, both of which are perfectly standard-conforming can act like this:
The short to be returned is
take the value from lower address, multiply it by 256, add the value from higher address
take the value from the higher address, multiply it by 256, add the value from the lower address
They do not actually do so, it is a much more efficient mechanism implemented in hardware, but the point is that even the implementation in hardware implicity does this or that.
You are re-interpreting representations by aliasing types in a way that is not allowed by the standard: you can process a short value as if it were a char array, but not the opposite. Doing that can cause weird errors with optimizing compilers that could assume that the value has never been initialized, or could optimize out a full branch of code that contains Undefined Behaviour.
Then the answer to your question is called endianess. In a big endian representation, the most significant byte has the lowest address (258 or 0x102 will be represented as the 2 byte 0x01, 0x02 in that order) while in little endian representation the least significant byte has the lowest address (0x102 is represented as 0x02, 0x01 in that order).
Your system happens to be a little endian one.

Get byte - how is this wrong?

I want to get the designated byte from a 32 bit integer. I am getting wrong values but I don't know why.
The restrictions to this problem are:
Must use signed bits, and I can't use multiplication.
I specifically need to know what is wrong with the function as it's below.
Here is the function:
int retrieveByteFromWord(int word, int byte)
{
return (word >> (byte << 3)) & 0xFF;
}
ex:
(3) (2) (1) (0) ------ byte number
In word: 10010011 11001100 00110011 10101000
I want to return byte 2 (1100 1100).
retrieveByteFromWord(word, 2) ---- gives: 1100 1100
But for some cases it's wrong and it won't tell me what case.
Any ideas?
Here is the problem:
You just started working for a company that is implementing a set of procedures to operate on a data structure where 4 signed bytes are packed into a 32 bit unsigned. Bytes within the word are numbered from 0(LSB) to 3(MSB). You have been assigned the task of implementing a function for a machine using 2's complement arithmetic and arithmetic right shifts with the following prototype:
typedef unsigned packed_t
int xbyte(packed_t word, int bytenum);
This is the previous employees attempt which got him fired for being wrong:
int xbyte(packed_t word, int bytenum)
{
return (word >> (bytenum << 3)) & 0xFF;
}
A) What is wrong with the code?
B) Write a correct implementation using only left and right shifts and one subtraction.
I have done B but still don't know why A is wrong. Is it because the decimal numbers going in like 12, 15, 19, 55 and then getting packed into a word and then when I extract them they aren't the same number anymore??? It might be so I am going to run some tests real fast...
As this is homework I won't give you a full answer, but I'll point you in the right direction. Your problem statement says that:
4 signed bytes are packed into a 32 bit unsigned.
When you bitwise & a 32 bit signed integer with 0xFF the most significant bit - i.e. the sign bit - of the result is always 0, so the original function never returns a negative value regardless of the input.
By way of example...
When you say "retrieveByteFromWord(word, 2) ---- gives: 11001100" you're wrong.
Your return type is a 32 bit integer - not an 8 bit integer. You're not returning 11001100 you're returning 00000000 00000000 00000000 11001100.
To work with numbers, use signed integer types such as int.
To work with bits, use unsigned integer types such as unsigned. I.e. let the word argument be of type unsigned. That is what the unsigned types are for.
To multiply by 8, write just *8 (this does not mean that that part of the code is technically wrong, just that it is artificially contrived and needlessly unreadable).
Even better, create a self-describing name for that magic number 8, e.g. *bitsPerByte (the standard library calls it CHAR_BIT, which is not particularly self-describing nor readable).
Finally, at the design level, think about designing your functions so that the code that uses a function of yours – each call – becomes clear and readable. E.g. like int const b = byteAt( 2, x );. That can prevent bugs by e.g. preventing wrong actual argument order, and since designing for readability makes the code easier to read, it reduces time spent on that. :-)
Cheers & hth.,
Works fine for positive numbers. You may want to cast word to unsigned to make it work for integers with the MSB set.
int retrieveByteFromWord(int word, int byte)
{
return ((unsigned)word >> (byte << 3)) & 0xFF;
}

C convert from int to char

I have a simple code
char t = (char)(3000);
Then value of t is -72. The hex value of 3000 is 0xBB8. I couldn't understand why the value of t is -72.
Thanks for your answers.
I don't know about Mac. So my result is -72. As I know, MAC is using Big Endian, so does it affect the result? I dont have any MAC computer to test so I want to know from MAC people.
The hex value of 3000 is 0xBB8.
And so the hex value of the char (which, by the way, appears to be signed on your compiler) is 0xB8.
If it were unsigned, 0xB8 would be 184. But since it's signed, its actual value is 256 less, i.e. -72.
If you want to know why this is, read about two's complement notation.
A char is 8 bits (which can only represent a 0-255 range). Trying to cast 3000 to a char is... impossible impossible, at least for what you are intending.
This is happening because 3000 is too big a value and causes an overflow. Char is generally from -128 to 127 signed, or 0 to 255 unsigned, but it can change depending upon the implementation.
char is an integral type with certain range of representable values. int is also an integral type with certain range of representable values. Normally, range of int is [much] wider than that of char. When you try to squeeze into a char an int value that doesn't fit into the range of char, the value will not "fit", of course. The actual result is implementation-defined.
In your case 3000 is an int value that doesn't' fit into the range of char on your implementation. So, you won't get 3000 as the result. If you really want to know why it specifically came out as -72 - consult the documentation that came with your implementation.
As specified, the 16-bit hex value of 3000 is 0x0BB8. Although implementation specific, from your posted results this is likely stored in memory in 8-bit pairs as B8 0B (some architectures would store it as 0B B8. This is known as endianness.)
char, on the other hand, is probably not a 16-bit type. Again, this is implementation specific, but from your posted results it appears to be 8-bits, which is not uncommon.
So while your program has allocated 8-bits of memory for your value, you're storing twice as much information in that memory. When your program retrieves this value later, it will only be pulling the first stored octet, in this case B8. The 0B will be ignored, and may cause problems later down the line if it ended up overwriting something important. This is known as a buffer overflow, which is very bad.
Assuming two's complement (technically implementation specific, but a reasonable assumption), the hex value of B8 translates to either -72 or 184 in decimal, depending on whether your dealing with a signed or unsigned type. Since you didn't specify either, your compiler will go with it's default. Yet again, this is implementation specific, and it appears your compiler goes with signed char.
Therefore, you get -72. But don't expect the same results on any other system.
A char is (typically) just 8 bits, so you cant store values as large as 3000 (which would require at least 11 12 bits). So if you trie to store 3000 in a byte, it will just wrap.
Since 3000 is 0xBBA, it requires two bytes, one 0x0B and one which is 0xBA. If you try to store it in a single byte, you will just get one of them (0xBA). And since a byte is (typically) signed, that is -72.
char is used to hold a single character, and you're trying to store a 4-digit int in one. Perhaps you meant to use an array of chars, or string (char t[4] in this case).
To convert an int to a string (untested):
#include <stdlib.h>
int main() {
int num = 3000;
char numString[4];
itoa(num, buf, 10);
}
oh, i get it, it's overflow, it's like char is only from -256 to 256 or something like that i'm not sure, like if you have a var which type's max limit is 256 and you add 1 to it, than it becomes -256 and so on

Resources