Why this redefinition of sizeof works - c

I'm redefining sizeof as:
#undef sizeof
#define sizeof(type) ((char*)((type*)(0) + 1) - (char*)((type*)(0)))
For this to work, the 2 '0' in the definition need to be the same entity in memory, or in other words, need to have the same address. Is this always guaranteed, or is it compiler/architecture/run-time dependent?

The 0 here is not an object – it is an address. So the question you ask is something of a non-sequitur.

You are thinking that the zero's are discreet pieces of data that need to be stored somewhere. They aren't.. they are being cast as pointers to memory location zero.
When you increment a pointer to a type, it is actually incremented by the size of the type it points to. This is how C array arithmetic works.

In practice, a null pointer of a certain type always refers to the same location in memory (especially when constructed the same way, as you do above), simply because any other implementation would be senseless.
However, The standard actually does not guarantee a lot about this:
"[...] is guaranteed to compare unequal to a pointer to any object or function." 6.3.2.3§3
"[...] Any two null pointers shall compare equal." 6.3.2.3§4
This leaves a lot of lee-way. Assume a memory model with two distinctive regions. Each region could have a region of null pointers (say the first 128 bytes). It is easy to see, that even in that weird case, the basic assumptions about null pointers can indeed hold! Well, given a proper compiler that makes weird null tests...
So, what else do we know about pointers in general...
What you are trying to do is first, increment a pointer
"one operand shall be a pointer to a complete object type and the other shall have integer type. (Incrementing is equivalent to adding 1.)" [6.5.6§2]
and then a pointer difference
"both operands are pointers to qualified or unqualified versions of compatible complete object types" [6.5.6§3]
OK, they are (well, assuming type is a complete object type). But what about semantics?
"For the purposes of these operators, a pointer to an object that is not an element of an array behaves the same as a pointer to the first element of an array of length one with the type of the object as its element type." [6.5.6§7]
This is actually a bit of a problem: The null pointer need not point to an actual object! (Otherwise you could dereference it safely...) Therefore, incrementing it or subtracting it from another pointer is UB!
To conclude: 0 does not point to an object, and therefore the answer to your question is No.

A strictly standards-conforming compiler could reject this, or return some nonsense. On "typical" machines and pointers have the same size, and casting an integer to a pointer just takes that bit pattern and looks at it as a pointer. There are machines where words contain extra data (type perhaps, permission bits). Some addresses might be forbidden for certain objects (i.e., nothing can have address 0), and so on. While it is guaranteed that sizeof(char) == 1, on e.g. Crays a character is actually 32 bits.
Besides, the C standard guarantees that the expresison in sizeof(expression) is not evaluated at all, just its type is taken. I.e., ^sizeof(x++)doesn't incrementx`.

Related

Is it defined behavior to add 0 to a null pointer? [duplicate]

I noticed this warning from Clang:
warning: performing pointer arithmetic on a null pointer
has undefined behavior [-Wnull-pointer-arithmetic]
In details, it is this code which triggers this warning:
int *start = ((int*)0);
int *end = ((int*)0) + count;
The constant literal zero converted to any pointer type decays into the null pointer constant, which does not point to any contiguous area of memory but still has the type pointer to type needed to do pointer arithmetic.
Why would arithmetic on a null pointer be forbidden when doing the same on a non-null pointer obtained from an integer different than zero does not trigger any warning?
And more importantly, does the C standard explicitly forbid null pointer arithmetic?
Also, this code will not trigger the warning, but this is because the pointer is not evaluated at compile time:
int *start = ((int*)0);
int *end = start + count;
But a good way of avoiding the undefined behavior is to explicitly cast an integer value to the pointer:
int *end = (int *)(sizeof(int) * count);
The C standard does not allow it.
6.5.6 Additive operators (emphasis mine)
8 When an expression that has integer type is added to or
subtracted from a pointer, the result has the type of the pointer
operand. If the pointer operand points to an element of an array
object, and the array is large enough, the result points to an element
offset from the original element such that the difference of the
subscripts of the resulting and original array elements equals the
integer expression. In other words, if the expression P points to the
i-th element of an array object, the expressions (P)+N (equivalently,
N+(P)) and (P)-N (where N has the value n) point to, respectively, the
i+n-th and i-n-th elements of the array object, provided they exist.
Moreover, if the expression P points to the last element of an array
object, the expression (P)+1 points one past the last element of the
array object, and if the expression Q points one past the last element
of an array object, the expression (Q)-1 points to the last element of
the array object. If both the pointer operand and the result point to
elements of the same array object, or one past the last element of the
array object, the evaluation shall not produce an overflow; otherwise,
the behavior is undefined. If the result points one past the last
element of the array object, it shall not be used as the operand of a
unary * operator that is evaluated.
For the purposes of the above, a pointer to a single object is considered as pointing into an array of 1 element.
Now, ((uint8_t*)0) does not point at an element of an array object. Simply because a pointer holding a null pointer value does not point at any object. Which is said at:
6.3.2.3 Pointers
3 If a null pointer constant is converted to a pointer type, the
resulting pointer, called a null pointer, is guaranteed to compare
unequal to a pointer to any object or function.
So you can't do arithmetic on it. The warning is justified, because as the second highlighted sentence mentions, we are in the case of undefined behavior.
Don't be fooled by the fact the offsetof macro is possibly implemented like that. The standard library is not bound by the constraints placed on user programs. It can employ deeper knowledge. But doing this in our code is not well defined.
When the C Standard was written, the vast majority of C implementations would, for any non-void* pointer value p, uphold the invariants that p+0 and p-0 both yield p, and p-p will yield zero. More generally, operations like a size-zero memcpy or fwrite that operate on a buffer of size N would ignore the buffer address when N was zero. Such behavior would allow programmers to avoid having to write code to handle corner cases. For example, code to output a packet with an optional payload passed via address and length arguments would naturally process (NULL,0) as an empty payload.
Nothing in the published Rationale for the C Standard suggests that implementations whose target platforms would naturally behave in such fashion shouldn't continue to work as they always had. There were, however, a few platforms where it may have been expensive to uphold such behavioral guarantees in cases where p is null.
As with most situations where the vast majority of C implementations would process a construct identically, but implementations might exist where such treatment would be impractical, the Standard characterizes the addition of zero to a null pointer as Undefined Behavior. The Standard allows implementations to, as a form of "conforming language extension", define the behavior of constructs in cases where it imposes no requirements, and it allow conforming (but not strictly conforming) programs to make use of them. According to the published Rationale, the stated intention was that support for such "popular extensions" be regarded as a "quality of implementation" issue to be decided by the marketplace. Implementations that could support them at essentially zero cost would do so, but implementations where such support would be expensive would be free to support such constructs or not based upon their customers' needs.
If one is using a compiler that targets commonplace platforms, and is designed to process the widest range of useful programs reasonably efficiently, then the extended semantics surrounding pointer arithmetic may allow one to write code more efficiently than would otherwise be possible. If one is targeting a compiler that does not value compatibility with quality compilers, however, one should recognize that it may treat the Standard's allowance for quirky hardware as an invitation to behave nonsensically even on commonplace hardware. Of course, one should also be aware that such compilers may behave nonsensically in corner cases where adherence with the Standard would require them to forego optimizations that are unsound but would "usually" be safe.

Accessing bytes of an object in C

Unfortunately, I haven't found anything like std-discussion for the ISO C standard, so I'll ask here.
Before answering the question, make sure you are familiar with the idea of pointer provenance (see DR260 and/or http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2263.htm).
6.3.2.3(Pointers) paragraph 7 says:
When a pointer to an object is converted to a pointer to a character type,
the result points to the lowest addressed byte of the object. Successive increments of the
result, up to the size of the object, yield pointers to the remaining bytes of the object.
The questions are:
What does "a pointer to an object" mean? Does it mean that if we want to get the pointer to the lowest addressed byte of an object of type T, we shall convert a pointer of type cvT* to a pointer to a character type, and converting a pointer to void, obtained from the pointer of type T*, won't give us desired result? Or "a pointer to an object" is the value of a pointer which follows from the pointer provenance and cast to void* does not change the value (analogous to how it was recently formalized in C++17)?
Why the paragraph explicitly mentions increments? Does it mean that adding value greater than 1 is undefined? Does it mean that decrementing (after we have incremented the result several times so that we won't go beyond the lower object boundary) is undefined? In short: is the sequences of bytes composing an object an array?
The general description of pointer addition suggests that for any values/types of pointer p and signed integers x and y where ((ptr+x)+y) and (x+y) are both defined by the Standard, (ptr+(x+y)) would behave equivalent to ((ptr+x)+y). While it might be possible to argue that the Standard doesn't explicitly say that incrementing a pointer five times would be equivalent to adding 5, there is nothing in the Standard that would suggest that quality implementations should not be expected to behave in that fashion.
Note that the authors of the Standard didn't try to make it "language-lawyer-proof". Further, they didn't think anyone would care about whether or not an obviously-inferior implementation was "conforming". An implementation that only works reliably if bytes of an object are accessed sequentially would be less versatile than one which supported reliable indexing, while offering no plausible advantage. Consequently, there should be no need for the Standard to mandate support for indexing, because anyone trying to produce a quality implementation would support it whether the Standard mandated it or not.
Of course, there are some constructs which programmers in the 1990s--or even the authors of the Standard themselves--expected quality compilers to handle reliably, but which some of today's "clever" compilers don't. Whether that means such expectations were unreasonable, or whether they remain accurate when applied to quality compilers, is a matter of opinion. In this particular case, I think the implication that positive indexing should behave like repeated incrementing is strong enough that I wouldn't expect compiler writers to argue otherwise, but I'm not 100% certain that no compiler would ever be "clever"/obtuse enough to look at something like:
int test(unsigned char foo[5][5], int x)
{
foo[1][0] = 1;
// Following should yield a pointer that can be used to access the entire
// array 'foo', but an obtuse compiler writer could perhaps argue that the
// code is converting the address of foo[0] into a pointer to the first
// element of that sub-array, and that the resulting pointer is thus only
// usable to access items within that sub-array.
unsigned char *p = (unsigned char*)foo;
// Following should be able to access any element of the object [i.e. foo]
// whose address was taken
p[x] = 2;
return foo[1][0];
}
and decide that it could skip the second read of foo[1][0] since p[x] wouldn't possibly access any element of foo beyond the first row. I would, however, say that programmers should not try to code around the possibility of vandals writing a compiler that would behave that way. No program can be made bullet-proof against vandals writing obtuse-but-"conforming" compilers, and the fact that a program can be undermined by such vandals should hardly be viewed as a defect.
Take a non-char c object and create a pointer to it, i.e.
int obj;
int *objPtr = &obj;
convert the pointer to object to pointer to char:
char *charPtr = (char *)objPtr;
now, charPtr points to the lowest byte or the int obj.
increment it:
charPtr++;
now it points to the next byte of the object. and so on till you reach the size of the object:
int i;
for (i = 0; i < sizeof(obj); i++)
printf("%d", *charPtr++);

Does C have more macros like NULL?

Background:
When deleting a cell in a hash table that uses linear probing you have to indicate that a value once existed at that cell but can be skipped during a search. The easiest way to solve this is to add another variable to store this information, but this extra variable can be avoided if an guaranteed invalid memory address is known and is used to represent this state.
Question:
I assume that since 0 is a guaranteed invalid memory address (more often than not), there must be more than just NULL. So my question is, does C provide a standard macro for any other guaranteed invalid memory addresses?
Technically, NULL is not guaranteed to be invalid. It is only guaranteed not to be the address of any object (C11 6.3.2.3:3):
An integer constant expression with the value 0, or such an expression
cast to type void *, is called a null pointer constant(66). If a null
pointer constant is converted to a pointer type, the resulting
pointer, called a null pointer, is guaranteed to compare unequal to a
pointer to any object or function.
(66) The macro NULL is defined in (and other headers) as a null pointer constant
Your usage does not require the special address value to be invalid either: obviously, you are not accessing it, unless segfaulting is part of the normal behavior of your program.
So you could use the addresses of as many objects as you like, as long as the addresses of these objects are not intended to be part of the normal contents of a cell.
For instance, for an architecture where converting between pointers to objects preserve the representation, you could use:
char a, b, …;
#define NULL1 (&a)
#define NULL2 (&b)
…
Strictly speaking, NULL is not required to be numerically zero at runtime. C translates 0 and NULL, in a pointer context, into an implementation-defined invalid address. That address is often numerically zero, but that is not guaranteed by the C standard. To the best of my knowledge, C itself does not provide any invalid addresses guaranteed to be distinct from NULL.
You can also create your own 'invalid' address pointer:
const void* const SOME_MARKER = (void*) &x;
If you make sure that x (or its address) can never be actually used where you want to use SOME_MARKER you should be safe and 100% portable.

If pointers are just addresses, why can't I assign a specific int address to a pointer?

As I've been explained, a pointer is a variable that holds an address. Why am I unable to do int *p = 5; then, wouldn't that be correctly pointing to the memory address 5?
Pointers are data types, not just addresses, and int is a different, not compatible, data type, so you cannot assign an int to a pointer.
If you really need that, and you know what you're doing, you can cast an int (i.e. char * p = (char*)5; ) to the specific data type, but you need such things only in rare cases.
a pointer is a variable that holds an address
No no no no no.
So many questions in the C tag on StackOverflow could be avoided if people were actually taught the basics. The rules are:
A variable is a storage location.
A storage location holds a value.
A pointer is a particular kind of value.
A pointer may be dereferenced. Dereferencing a pointer produces a storage location.
A storage location may be addressed. Addressing a storage location produces a pointer.
So let's review your sentence:
a pointer is a variable ...
No, a pointer is a value. There might be a variable that holds a pointer, just like there might be a variable that holds an int. But an int isn't a variable, and neither is a pointer. Similarly, a pointer might be dereferenced to produce the storage location associated with a variable, but that is not saying that the pointer is the variable.
... that holds an address
A pointer by definition is an address, but what an address actually consists of is an implementation detail. A pointer holds something that when dereferenced produces a storage location. A common choice for implementers of C compilers is to make the value of a pointer be a numeric offset into a large block of virtual memory, but there is no requirement that an author of a C compiler use integer offsets into virtual memory as pointers.
There is a requirement that addresses be convertible to integers and that integers be convertible back to addresses, and that these mappings obey certain restrictions on their behaviour. But for all you know the address could be the string "5" and the conversion to integer could be to call atoi on that thing. That would be silly, but it would be legal.
Why am I unable to do int *p = 5;
Who says that you are unable?
The C11 specification says:
An integer may be converted to any pointer type. Except as previously specified, the
result is implementation-defined, might not be correctly aligned, might not point to an
entity of the referenced type, and might be a trap representation. The mapping functions for converting a pointer to an integer or an integer to a pointer are intended to
be consistent with the addressing structure of the execution environment. Any pointer type may be converted to an integer type. Except as previously specified, the
result is implementation-defined. If the result cannot be represented in the integer type,
the behavior is undefined. The result need not be in the range of values of any integer
type.
So there you go. If the compiler you are using gives meaning when mapping 5 to a particular address, you go be awesome. If it does not, then you're in for the worst sort of undefined behaviour. Basically there is no requirement on the developer of a C compiler to make any conversions between pointers and integers except that zero is always a null pointer, and no valid object has the address zero.
Yes, theoretically that would be correct
(except address 5 is probably nothing you should use).
Maybe you´ll get a warning/error from the compiler
because assigning a plain number to a pointer is not usual,
but with a cast it will accept it:
int *p = (int*)5;
You can't assign an address to a pointer like that, because you haven't the right. Maybe the address is already used by some other program or the OS. You have to allocate an address:
int *n;
n = malloc(sizeof(int));

why is array name a pointer to the first element of the array?

Is this always the case , i mean , that array name is always a pointer to the first element of the array.why is it so?is it something implementation kinda thing or a language feature?
An array name is not itself a pointer, but decays into a pointer to the first element of the array in most contexts. It's that way because the language defines it that way.
From C11 6.3.2.1 Lvalues, arrays, and function designators, paragraph 3:
Except when it is the operand of the sizeof operator, the _Alignof operator, or the unary & operator, or is a string literal used to initialize an array, an expression that has type "array of type" is converted to an expression with type "pointer to type" that points to the initial element of the array object and is not an lvalue.
You can learn more about this topic (and lots about the subtle behaviour involved) from the Arrays and Pointers section of the comp.lang.c FAQ.
Editorial aside: The same kind of behaviour takes place in C++, though the language specifies it a bit differently. For reference, from a C++11 draft I have here, 4.2 Array-to-pointer conversion, paragraph 1:
An lvalue or rvalue of type "array of N T" or "array of unknown bound of T" can be converted to an rvalue of type "pointer to T". The result is a pointer to the first element of the array.
The historical reason for this behavior can be found here.
C was derived from an earlier language named B (go figure). B was a typeless language, and memory was treated as a linear array of "cells", basically unsigned integers.
In B, when you declared an N-element array, as in
auto a[10];
N cells were allocated for the array, and another cell was set aside to store the address of the first element, which was bound to the variable a. As in C, array indexing was done through pointer arithmetic:
a[j] == *(a+j)
This worked pretty well until Ritchie started adding struct types to C. The example he gives in the paper is a hypothetical file system entry, which is a node id followed by a name:
struct {
int inumber;
char name[14];
};
He wanted the contents of the struct type to match the data on the disk; 2 bytes for an integer immediately followed by 14 bytes for the name. There was no good place to stash the pointer to the first element of the array.
So he got rid of it. Instead of setting aside storage for the pointer, he designed the language so that the pointer value would be computed from the array expression itself.
This, incidentally, is why an array expression cannot be the target of an assignment; it's effectively the same thing as writing 3 = 4; - you'd be trying to assign a value to another value.
Carl Norum has given the language-lawyer answer on the question (and got my upvote on it), here comes the implementation detail answer:
To the computer, any object in memory is just a range of bytes, and, as far as memory handling is concerned, uniquely identified by an address to the first byte and a size in bytes. Even when you have an int in memory, its address is nothing more or less than the address of its first byte. The size is almost always implicit: If you pass a pointer to an int, the compiler know its size because it knows that the bytes at that address are to be interpreted as an int. The same goes for structures: their address is the address of their first byte, their size is implicit.
Now, the language designers could have implemented a similar semantic with arrays as they did with structures, but they didn't for a good reason: Copying was then even more inefficient than now compared to just passing a pointer, structures were already passed around using pointers most of the time, and arrays are usually meant to be large. Prohibitively large to force value semantics on them by language.
Thus, arrays were just forced to be memory objects at all times by specifying that the name of an array would be virtually equivalent to a pointer. In order, to not break the similarity of arrays to other memory objects, the size was again said to be implicit (to the language implementation, not the programmer!): The compiler could just forget about the size of an array when it was passed somewhere else and rely on the programmer to know, how many objects were inside the array.
This had the benefit that array accesses are excrutiatingly simple; they decay to a matter of pointer arithmetic, of multiplying the index with the size of the object in the array and adding that offset to the pointer. It's the reason why a[5] is exactly the same as 5[a], it's a shorthand for *(a + 5).
Another performance related aspect is that it is excrutiatingly simple to make a subarray from an array: only the start address needs to be calculated. There is nothing that would force us to copy the data into a new array, we just have to remember to use the correct size...
So, yes, it has profound reasons in terms of implementation simplicity and performance that array names decay to pointers the way they do, and we should be glad for it.

Resources