Why (1)["abcd"]+"efg"-'b'+1 becomes "fg"? - c

#include <stdio.h>
int main()
{
printf("%s", (1)["abcd"]+"efg"-'b'+1);
}
Can someone please explain why the output of this code is:
fg
I know (1)["abcd"] points to "bcd" but why +"efg"-'b'+1 is even a valid syntax ?

I know (1)["abcd"] points to "bcd"
No. (1)["abcd"] is a single char (b).
So (1)["abcd"]+"efg"-'b'+1 is: 'b' + "egf" - 'b' + 1 and if you simplify it, it becomes "efg" + 1. Hence it prints fg.
Note: The above answer explains only the observed behaviour which is not strictly legal as per the C language specification. Here's why.
case 1: 'b' < 0 or 'b' > 4
In this case, the expression (1)["abcd"] + "efg" - 'b' + 1 will lead to undefined behaviour, due to the sub-expression (1)["abcd"] + "efg", which is 'b' + "efg" producing an invalid pointer expression (C11, 6.5.5 Multiplicative operators -- quote below).
On the widely used ASCII character set, 'b' is 98 in decimal; on the not-so-widely used EBCDIC character set, 'b' is 130 in decimal. So the sub-expression (1)["abcd"] + "efg" would cause undefined behaviour on a system using either of these two.
So barring a weird architecture, where 'b' <= 4 and 'b' >= 0, this program would cause undefined behaviour due to how the
C language is defined:
C11, 5.1.2.3 Program execution
The semantic descriptions in this International Standard describe the
behavior of an abstract machine in which issues of optimization are
irrelevant. [...] In the abstract machine, all expressions are
evaluated as specified by the semantics. An actual implementation need
not evaluate part of an expression if it can deduce that its value is
not used and that no needed side effects are produced.
which categorically states that whole standard has been defined based on the abstract machine's behaviour.
So in this case, it does cause undefined behaviour.
case 2: 'b' >= 0 or 'b' <= 4 (This is quite imaginary, but in theory, it's possible).
In this case, the subexpression (1)["abcd"] + "efg" can be valid (and in turn, the whole expression (1)["abcd"] + "efg" - 'b' + 1).
The string literal "efg" consists of 4 chars, which is an array type (of type char[N] in C) and and the C standard guarantees (as quoted above) that the pointer expression evaluating to one-past the end of an array doesn't overflow or cause undefined behaviour.
The following are the possible sub-expressions and they are valid:
(1) "efg"+0 (2) "efg"+1 (3) "efg"+2 (4) "efg"+3 and (5) "efg"+4 because C standard states that:
C11, 6.5.5 Multiplicative operators
When an expression that has integer type is added to or subtracted
from a pointer, the result has the type of the pointer operand. If the
pointer operand points to an element of an array object, and the array
is large enough, the result points to an element offset from the
original element such that the difference of the subscripts of the
resulting and original array elements equals the integer expression.
In other words, if the expression P points to the i-th element of an
array object, the expressions (P)+N (equivalently, N+(P)) and (P)-N
(where N has the value n) point to, respectively, the i+n-th and
i−n-th elements of the array object, provided they exist. Moreover, if
the expression P points to the last element of an array object, the
expression (P)+1 points one past the last element of the array object,
and if the expression Q points one past the last element of an array
object, the expression (Q)-1 points to the last element of the array
object. If both the pointer operand and the result point to elements
of the same array object, or one past the last element of the array
object, the evaluation shall not produce an overflow; otherwise, the
behavior is undefined. If the result points one past the last element
of the array object, it shall not be used as the operand of a unary *
operator that is evaluated.
So it's not causing undefined behaviour in this case.
Thanks #zch & #Keith Thompson for digging out the relevant parts of C standard :)

There seems to be some confusion about the difference between the other two answers. Here's what happens, step by step:
(1)["abcd"]+"efg"-'b'+1
The first part, (1)["abcd"] takes advantage of the way arrays are processed in C. Let's look at the following:
int a[5] = { 0, 10, 20, 30, 40 };
printf("%d %d\n", a[2], 2[a]);
The output will be 20 20. Why? because the name of an array of int evaluates to its address, and its data type is pointer to int. Referring to an element of the integer array tells C add an offset to the address of the array and evaluate the result as type int. But this means C doesn't care about the order: a[2] is exactly the same as 2[a].
Similarly, since a is the address of the array, a + 1 is the address of the element at the first offset into the array. Of course, that's equivalent to 1 + a.
A string in C is just another, human-friendly, way of representing an array of type char. So (1)["abcd"] is the same as returning the element at the first offset into an array of the characters a, b, c, d, \0 ... which is the character b.
In C, every character has an integral value (generally its ASCII code). The value of b happens to be 98. The remainder of the evaluation, therefore, involves calculations with integers and an array: the character string "efg".
We have the address of the string. We add and subtract 98 (the ASCII value of the character b), and we add 1. The b's cancel each other, so the net result is one more than the address of the first character in the string, which is the address of the character f.
The %s conversion in the printf() tells C to treat the address as the first character in a string, and to print the entire string until it encounters the null character at the end.
So it prints fg, which is the part of the string "efg" that starts at the f.

Related

Why is the printf statement in the code below printing a value rather than a garbage value?

int main(){
int array[] = [10,20,30,40,50] ;
printf("%d\n",-2[array -2]);
return 0 ;
}
Can anyone explain how -2[array-2] is working and Why are [ ] used here?
This was a question in my assignment it gives the output " -10 " but I don't understand why?
Technically speaking, this invokes undefined behaviour. Quoting C11, chapter §6.5.6
If both the pointer
operand and the result point to elements of the same array object, or one past the last
element of the array object, the evaluation shall not produce an overflow; otherwise, the
behavior is undefined. [....]
So, (array-2) is undefined behavior.
However, most compilers will read the indexing, and it will likely be able to nullify the +2 and -2 indexing, [2[a] is same as a[2] which is same as *(a+2), thus, 2[a-2] is *((2)+(a-2))], and only consider the remaining expression to be evaluated, which is *(a) or, a[0].
Then, check the operator precedence
-2[array -2] is effectively the same as -(array[0]). So, the result is the value array[0], and -ved.
This is an unfortunate example for instruction, because it implies it's okay to do some incorrect things that often work in practice.
The technically correct answer is that the program has Undefined Behavior, so any result is possible, including printing -10, printing a different number, printing something different or nothing at all, failing to run, crashing, and/or doing something entirely unrelated.
The undefined behavior comes up from evaluating the subexpression array -2. array decays from its array type to a pointer to the first element. array -2 would point at the element which comes two positions before that, but there is no such element (and it's not the "one-past-the-end" special rule), so evaluating that is a problem no matter what context it appears in.
(C11 6.5.6/8 says)
When an expression that has integer type is added to or subtracted from a pointer, .... If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.
Now the technically incorrect answer the instructor is probably looking for is what actually happens on most implementations:
Even though array -2 is outside the actual array, it evaluates to some address which is 2*sizeof(int) bytes before the address where the array's data starts. It's invalid to dereference that address since we don't know that there actually is any int there, but we're not going to.
Looking at the larger expression -2[array -2], the [] operator has higher precedence than the unary - operator, so it means -(2[array -2]) and not (-2)[array -2]. A[B] is defined to mean the same as *((A)+(B)). It's customary to have A be a pointer value and B be an integer value, but it's also legal to use them reversed like we're doing here. So these are equivalent:
-2[array -2]
-(2[array -2])
-(*(2 + (array - 2)))
-(*(array))
The last step acts like we would expect: Adding two to the address value of array - 2 is 2*sizeof(int) bytes after that value, which gets us back to the address of the first array element. So *(array) dereferences that address, giving 10, and -(*(array)) negates that value, giving -10. The program prints -10.
You should never count on things like this, even if you observe it "works" on your system and compiler. Since the language guarantees nothing about what will happen, the code might not work if you make slight changes which seem they shouldn't be related, or on a different system, a different compiler, a different version of the same compiler, or using the same system and compiler on a different day.
Here is how -2[array-2] is evaluated:
First, note that -2[array-2] is parsed as - (2[array-2]). The subscript operator, [...] has higher precedence than the unary - operator. We often think of constants like -2 as single numbers, but it is in fact a - operator applied to a 2.
In array-2, array is automatically converted to a pointer to its first element, so it points to array[0].
Then array-2 attempts to calculate a pointer to two elements before the first element of the array. The resulting behavior is not defined by the C standard because C 2018 6.5.6 8 says that only arithmetic that points to array members and the end of the array is defined.
For illustration only, suppose we are using a C implementation that extends the C standard by defining pointers to use a flat address space and permit arbitrary pointer arithmetic. Then array-2 points two elements before the array.
Then 2[array-2] uses the fact that the C standard defines E1[E2] to be *((E1)+(E2)). That is, the subscript operator is implemented by adding the two things and applying *. Thus, it does not matter which expression is E1 and which is E2. E1+E2 is the same as E2+E1. So 2[array-2] is *(2 + (array-2)). Adding 2 moves the pointer from two elements before the array back to the start of the array. Then applying * produces the element at that location, which is 10.
Finally, applying - gives −10. (Recall that this conclusion is only achieved using our supposition that the C implementation supports a flat address space. You cannot use this in general C code.)
This code invokes undefined behavior and can print anything, including -10.
C17 6.5.2.1 Array subscripting states:
The definition of the subscript operator [] is that E1[E2] is identical to (*((E1)+(E2)))
Meaning array[n] is equivalent to *((array) + (n)) and that's how the compiler evaluates subscripting. This allows us to write silly obfuscation like n[array] as 100% equivalent to array[n]. Because *((n) + (array)) is equivalent to *((array) + (n)). As explained here:
With arrays, why is it the case that a[5] == 5[a]?
Looking at the expression -2[array -2] specifically:
[array -2] and [array - 2] are naturally equivalent. In this case the former is just sloppy style purposely used for the sake of obfuscating the code.
Operator precedence tells us to first consider [].
Thus the expression is equivalent to -*( (2) + (array - 2) )
Note that the first - is not part of the integer constant 2. C does not support negative integer constants1), the - is actually the unary minus operator.
Unary minus has lower presedence than [], so the 2 in -2[ "binds" to the [.
The sub-expression (array - 2) is evaluated individually and invokes undefined behavior, as per C17 6.5.6/8:
When an expression that has integer type is added to or subtracted from a pointer, the
result has the type of the pointer operand. /--/ If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.
Speculatively, one potential form of undefined behavior could be that a compiler decides to replace the whole expression (2) + (array - 2) with array, in which case the whole expression would end up as -*array and prints -10.
There's no guarantees of this and therefore the code is bad. If you were given the assignment to explain why the code prints -10, your teacher is incompetent. Not only is it meaningless/harmful to study obfuscation as part of C studies, it is harmful to rely on undefined behavior or expect it to give a certain result.
1) C rather supports negative integer constant expressions. -2 is an integer constant expression, where 2 is an integer constant of type int.

How does sizeof operator behaves in below code snippet?

Please explain the OP for below code snippet :
int *a="";
char *b=NULL;
float *c='\0' ;
printf(" %d",sizeof(a[1])); // prints 4
printf(" %d",sizeof(b[1])); // prints 1
printf(" %d",sizeof(c[1])); // prints 4
Compiler interprets a[1] as *(a+1) , so a has some address , now it steps 4 bytes ahead , then it will have some garbage value there so how is the OP 4 bytes , even if I do a[0] , still it prints 4 , although it is an empty string , so how come its size is 4 bytes ?
Here we are finding out the size of the variable the pointer is pointing to , so if I say size of a[1] , it means size of *(a+1), Now a has the address of a string constant which is an empty string , after I do +1 to that address it moves 4 bytes ahead , now its at some new address , now how do we know the size of this value , it can be an integer , a character or a float , anything , so how to reach to a conclusion for this ?
The sizeof operator does not evaluate its operand except one case.
From the C Standard (6.5.3.4 The sizeof and alignof operators)
2 The sizeof operator yields the size (in bytes) of its operand, which
may be an expression or the parenthesized name of a type. The size is
determined from the type of the operand. The result is an integer.
If the type of the operand is a variable length array type, the operand is evaluated; otherwise, the operand is not evaluated and the
result is an integer constant.
In this code snippet
int *a="";
char *b=NULL;
float *c='\0' ;
printf(" %d",sizeof(a[1])); // prints 4
printf(" %d",sizeof(b[1])); // prints 1
printf(" %d",sizeof(c[1])); // prints 4
the type of the expression a[1] is int, the type of the expression b[1] is char and the type of the expression c[1] is float.
So the printf calls output correspondingly 4, 1, 4.
However the format specifiers in the calls are specified incorrectly. Instead of "%d" there must be "%zu" because the type of the value returned by the sizeof operator is size_t.
From the same section of the C Standard
5 The value of the result of both operators is implementation-defined,
and its type (an unsigned integer type) is size_t, defined in
<stddef.h> (and other headers).
This is all done statically, i.e. no dereferencing is happening at runtime. This is how the sizeof operator works, unless you use variable-length arrays (VLAs), then it must do work at runtime.
Which is why you can get away with sizeof:ing through a NULL pointer, and other things.
You should still be getting trouble for
int *a = "";
which makes no sense. I really dislike the c initializer too, but at least that makes sense.
sizeof operator happens at compilation (except for VLA's). It is looking at the type of an expression, not the actual data so even something like this will work:
sizeof(((float *)NULL)[1])
and give you the size of a float. Which on your system is 4 bytes.
Live example
Even though this looks super bad, it is all well defined, since no dereference ever actually occurs. This is all operations on type information at compile time.
sizeof() is based on the data type, so whilst it's getting the sizes outside the bounds of memory allocated to your variables, it doesn't matter as it's worked out at compile time rather than run time.

What is the rationale for one past the last element of an array object?

According to N1570 (C11 draft) 6.5.6/8 Additive operators:
Moreover, if the expression P points to the last element of an array
object, the expression (P)+1 points one past the last element of the
array object, and if the expression Q points one past the last element
of an array object, the expression (Q)-1 points to the last element of
the array object
Subclause 6.5.6/9 also contains:
Moreover, if the expression P points either to an element of an array
object or one past the last element of an array object, and the
expression Q points to the last element of the same array object, the
expression ((Q)+1)-(P) has the same value as ((Q)-(P))+1 and as
-((P)-((Q)+1)), and has the value zero if the expression P points one past the last element of the array object, even though the expression
(Q)+1 does not point to an element of the array object.106)
This justifies pointer's arithmetic like this to be valid:
#include <stdio.h>
int main(void)
{
int a[3] = {0, 1, 2};
int *P, *Q;
P = a + 3; // one past the last element
Q = a + 2; // last element
printf("%td\n", ((Q)+1)-(P));
printf("%td\n", ((Q)-(P))+1);
printf("%td\n", -((P)-((Q)+1)));
return 0;
}
I would expect to disallow pointing to element of array out-of-bounds, for which dereference acts as undefined behaviour (array overrun), thus it makes it potentially dangerous. Is there any rationale for this?
Specifying the range to loop over as the half-closed interval [start, end), especially for array indices, has certain pleasing properties as Dijkstra observed in one of his notes.
1) You can compute the size of the range as a simple function of end - start. In particular, if the range is specified in terms of array indices, the number of iterations performed by the loop would be given by end - start. If the range was [start, end], then the number of iterations would have been end - start + 1 - very annoying, isn't it? :)
2) Dijsktra's second observation applies only to the case of (non-negative) integral indices - specifying a range as [start, end) and (start, end] both have the property mentioned in 1). However, specifying it as (start, end] requires you to allow an index of -1 to represent a loop range including the index 0 - you are allowing an "unnatural" value of -1 just for the sake of representing the range. The [start, end) convention does not have this issue, because end is a non-negative integer, and hence a natural choice when dealing with array indices.
Dijsktra's objection to allowing -1 does have similarities to allowing one past the last valid address of the container. However, since the above convention has been in use for so long, it likely persuaded the standards committee to make this exception.
The rationale is quite simple. The compiler is not allowed to place an array at the end of memory. To illustrate, assume that we have a 16-bit machine with 16-bit pointers. The low address is 0x0000. The high address is 0xffff. If you declare char array[256] and the compiler locates array at address 0xff00, then technically the array would fit into the memory, using addresses 0xff00 thru 0xffff inclusive. However, the expression
char *endptr = &array[256]; // endptr points one past the end of the array
would be equivalent to
char *endptr = NULL; // &array[256] = 0xff00 + 0x0100 = 0x0000
Which means that the following loop would not work, since ptr will never be less than 0
for ( char *ptr = array; ptr < endptr; ptr++ )
So the sections you cited are simply lawyer-speak for, "Don't put arrays at the end of a memory region".
Historical note: the earliest x86 processors used a segmented memory scheme wherein memory addresses where specified by a 16-bit pointer register and a 16-bit segment register. The final address was computed by shifting the segment register left by 4 bits and adding to the pointer, e.g.
pointer register 1234
segment register AB00
-----
address in memory AC234
The resulting address space was 1MByte, but there were end-of-memory boundaries every 64Kbytes. That's one reason for using lawyer-speak instead of stating, "Don't put arrays at the end of memory" in plain english.

Negative index in array [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Negative array indexes in C?
Can I use negative indices in arrays?
#include <stdio.h>
int main(void)
{
char a[] = "pascual";
char *p = a;
p += 3;
printf("%c\n", p[-1]); /* -1 is valid here? */
return 0;
}
Yes, -1 is valid in this context, because it points to a valid location in memory allocated to your char a[] array. p[-1] is equivalent to *(p-1). Following the chain of assignments in your example, it is the same as a+3-1, or a+2, which is valid.
EDIT : The general rule is that an addition / subtraction of an integer and a pointer (and by extension, the equivalent indexing operations on pointers) need to produce a result that points to the same array or one element beyond the end of the array in order to be valid. Thanks, Eric Postpischil for a great note.
C 2011 online draft
6.5.6 Additive operators
8 ...if the expression P points to
the i-th element of an array object, the expressions (P)+N (equivalently, N+(P)) and
(P)-N (where N has the value n) point to, respectively, the i+n-th and i−n-th elements of
the array object, provided they exist....
Emphasis mine. So, in your specific example, p[-1] is valid, since it points to an existing element of a; however, a[-1] would not be valid, since a[-1] points to a non-existent element of a. Similarly, p[-4] would not be valid, a[10] would not be valid, etc.
Of course it is valid.
(C99, 6.5.2p1) "One of the expressions shall have type ‘‘pointer to object type’’, the other expression shall have integer type, and the result has type ‘‘type’’.
Generally using such negative indexes is a Bad Idea(TM). However, I found one place where this can be useful: trig lookup tables. For such a look up table, we need to use some angle measure as the index. For example, I can index sin values for angles between -180 degrees and +180 using degrees as an index. Or if I want to use radions instead, I can use a multiple of some fraction of PI, say PI/3, for the index. Then I can get cos values between -PI and PI by multiples of PI/3.
Yes, this is legal, as C lets you do unsafe pointer arithmetic all day. However, this is confusing, so don't do it. See also this answer to the same question.

Question about C programming

int a, b;
a = 1;
a = a + a++;
a = 1;
b = a + a++;
printf("%d %d, a, b);
output : 3,2
What's the difference between line 3 and 5?
What you are doing is undefined.
You can't change the value of a variable you are about to assign to.
You also can't change the value of a variable with a side effect and also try to use that same variable elsewhere in the same expression (unless there is a sequence point, but in this case there isn't). The order of evaluation for the two arguments for + is undefined.
So if there is a difference between the two lines, it is that the first is undefined for two reasons, and line 5 is only undefined for one reason. But the point is both line 3 and line 5 are undefined and doing either is wrong.
What you're doing on line 3 is undefined. C++ has the concept of "sequence points" (usually delimited by semicolons). If you modify an object more than once per sequence point, it's illegal, as you've done in line 3. As section 6.5 of C99 says:
(2) Between the previous and next sequence point an object shall have its stored value
modified at most once by the evaluation of an expression. Furthermore, the prior value
shall be read only to determine the value to be stored.
Line 5 is also undefined because of the second sentence. You read a to get its value, which you then use in another assignment in a++.
a++ is a post-fix operator, it gets the value of a then increments it.
So, for lines 2,3:
a = 1
a = 1 + 1, a is incremented.
a becomes 3 (Note, the order these operations are performed may vary between compilers, and a can easily also become 2)
for lines 4,5:
a = 1
b = 1 + 1, a is incremented.
b becomes 2, a becomes 2. (Due to undefined behaviour, b could also become 3 of a++ is processed before a)
Note that, other than for understanding how postfix operators work, I really wouldn't recommend using this trick. It's undefined behavior and will get different results when compiled using different compilers
As such, it is not only a needlessly confusing way to do things, but an unreliable, and worst-practice way of doing it.
EDIT: And has others have pointed out, this is actually undefined behavior.
Line 3 is undefined, line 5 is not.
EDIT:
As Prasoon correctly points out, both are UB.
The simple expression a + a++ is undefined because of the following:
The operator + is not a sequence point, so the side effects of each operands may happen in either order.
a is initially 1.
One of two possible [sensible] scenarios may occur:
The first operand, a is evaluated first,
a) Its value, 1 will be stored in a register, R. No side effects occur.
b) The second operand a++ is evaluated. It evaluates to 1 also, and is added to the same register R. As a side effect, the stored value of a is set to 2.
c) The result of the addition, currently in R is written back to a. The final value of a is 2.
The second operand a++ is evaluated first.
a) It is evaluated to 1 and stored in register R. The stored value of a is incremented to 2.
b) The first operand a is read. It now contains the value 2, not 1! It is added to R.
c) R contains 3, and this result is written back to a. The result of the addition is now 3, not 2, like in our first case!
In short, you mustn't rely on such code to work at all.

Resources