Index operator bound to a string literal - c

So I decided to experiment, completely out of randomness.
And I found this:
"Hello World"[1]
Actually working on a first view, resulting in 'e'
even though:
I haven't encounter this anywhere until happened to be in my code
Seems semantically controversial (or at least quite suspicious)
Could not find any information on the internet, regarding this positive string literal array indexing (probably searching in the wrong context?)
Is this actually allowed, confronting the "standards" with guaranteed well-defined behavior ?

This is semantically correct. "Hello World"[1] is equivalent to *("Hello World" + 1). In this expression the string "Hello World" will be converted to pointer to its first element. Therefore, ("Hello World" + 1) is the address of second element of string "Hello World".

"Hello World"[1]
is perfectly valid. A string literal is of type array of N characters. The type of "Hello World" is an array of 12 char (i.e., char[12]).

Related

Why are C-strings (char arrays) getting merged together or printing erroneously sometimes in C?

I have been working with strings in C. While working with ways to declare them and initialize them, I found some weird behavior I don't understand.
#include<stdio.h>
#include<string.h>
int main()
{
char str[5] = "World";
char str1[] = "hello";
char str2[] = {'N','a','m','a','s','t','e'};
char* str3 = "Hi";
printf("%s %zu\n"
"%s %zu\n"
"%s %zu\n"
"%s %zu\n",
str, strlen(str),
str1, strlen(str1),
str2, strlen(str2),
str3, strlen(str3));
return 0;
}
Sample output:
Worldhello 10
hello 5
Namaste 7
Hi 2
In some cases, the above code makes str contain Worldhello, and the rest are as they were intialized. In some other cases, the above code makes str2 contain Namastehello. It happens with different variables I never concatenated. So, how are they are getting combined?
To work with strings, you must allow space for a null character at the end of each string. Where you have char str[5]="World";, you allow only five characters, and the compiler fills them with “World”, but there is no space for a null character after them. Although the string literal "World" includes an automatic null character at its end, you did not provide space for it in the array, so it is not copied.
Where you have char str1[]="hello";, the compiler determines the array size by counting the characters, including the null character at the end of the string literal.
Where you have char str2[]={'N','a','m','a','s','t','e'};, there is no string literal, just a list of individual characters. The compiler determines the array size by counting those. Since there is no null character, it does not provide space for it.
One potential consequence of failing to terminate a string with a null character is that printf will continue reading memory beyond the string and printing characters from the values it finds. When the compiler has placed other character arrays after such an array you are printing, characters from those arrays may appear in the output.
If you allow space for a null character in str and provide a zero value in str2, your program will print strings in an orderly way:
#include <stdio.h>
#include <string.h>
int main(void)
{
char str[6] = "World"; // 5 letters plus a null character.
char str1[] = "hello";
char str2[] = {'N', 'a', 'm', 'a', 's', 't', 'e', 0}; // Include a null.
char *str3 = "Hi";
printf("%s %zu\n%s %zu\n%s %zu\n%s %zu\n",
str, strlen(str),
str1, strlen(str1),
str2, strlen(str2),
str3, strlen(str3));
return 0;
}
Undefined behavior in non-null-terminated, adjacently-stored C-strings
Why do you get this part:
Worldhello 10
hello 5
...instead of this?
World 5
hello 5
The answer is that printf() prints chars until it hits a null character, which is a binary zero, frequently written as the '\0' char. And, the compiler happens to have placed the character array containing hello right after the character array containing World. Since you explicitly forced the size of str to be 5 via str[5], the compiler was unable to fit the automatic null character at the end of the string. So, with hello happening to be (not guaranteed to be) right after World, and printf() printing until it sees a binary zero, it printed World, saw no terminating null char, and continued right on into the hello string right after it. This resulted in it printing Worldhello, and then stopping only when it saw the terminating character after hello, which string is properly terminated.
This code relies on undefined behavior, which is a bug. It cannot be relied upon. But, that is the explanation for this case.
Run it with gcc on a 64-bit Linux machine online here: Online GDB: undefined behavior in NON null-terminated C strings
#Eric Postpischil has a great answer and provides more insight here.
From the C tag wiki:
This tag should be used with general questions concerning the C language, as defined in the ISO 9899 standard (the latest version, 9899:2018, unless otherwise specified — also tag version-specific requests with c89, c99, c11, etc).
You've asked a "how?" question about something that none of those documents defines, and so the answer is undefined in the context of C. You can only experience this phenomenon through undefined behaviour.
how are they are getting combined?
There is no such requirement that any of these variables are "combined" or are immediately located after each other; trying to observe that is undefined behaviour. It may appear to coincidentally work (whatever that means) for you at times on your machine, while failing at other times or using some other machine or compiler, etc. That's purely coincidental and not to be relied upon.
In some cases, the above code assigns str with Worldhello and the rest as they were intitated.
In the context of undefined behaviour, it makes no sense to make claims about how your code functions, as you've already noticed, the functionality is erratic.
I found some weird Behaviour with them.
If you want to prevent erratic behaviour, stop invoking undefined behaviour by accessing arrays out of bounds (i.e. causing strlen to run off the end of an array).
Only one of those variables is safe to pass to strlen; you need to ensure the array contains a null terminator.

how can a read-only string literal be used as a pointer?

In C one can do this
printf("%c", *("hello there"+7));
which prints h
How can a read-only string literal like "hello there" be used almost like a pointer? How does this work exactly?
Using 'anonymous' string literals can be fun.
It's common to express dates with the appropriate ordinal suffix. (Eg "1st of May" or "25th of December".)
The following 'collapses' the 'Day of Month' value (1-31) down to values 0-3, then uses that value to index into a "segmented" string literal. This works!
// Folding DoM 'down' to use a compact array of suffixes.
i = DoM;
if( i > 20 ) i %= 10; // Change 21-99 to 0-9.
if( i > 3 ) i = 0; // Every other one ends with "th"
// 0 1 2 3
suffix = &"th\0st\0nd\0rd"[ i * 3 ]; // Acknowledge 3byte regions.
A string literal is a character array (char[]) and is thus implicitly cast to a char pointer (char *) to the first element of the array.
Thus, in the example in the question ("hello there"+7), 7 is added to a pointer to the first character (h) giving a pointer to the 7th character (counting zero based) which also happens to be a h (the "h" in "there").
Notice that the pointer is to char, not const char. However, it is important to know that writing at the location pointed at by a string literal is undefined behavior which means that each compiler implementation is free to define its own behavior in that case. Depending on the compiler implementation, it may be impossible (the string literal may be stored in read-only memory), it may have unforeseen side-effects, it may change the string string literal without any side-effects or ... basically anything.
It is allowed for two identical or overlapping string literals such as "hello there" and "there" to share the same memory location. Hence, the following expressions may be either true or false depending on the compiler implementation:
"hello" == "hello"
"hello there" + 6 == "there"
While you know how it is stored you will understand.
String constants are stored in .rodata section, seperate from code which stored in .text section. So when the program is running, it need to know the address of those string constants when using them, and, length of strings and arrays are not all the same, there is no simple way to get them (integer and float can be stored in and passed by register), thus "strings are visited thought pointer same as arrays".
Actually values not able to be hard encodered in instructions are all stored in sections such as .data and .rodata.

What happens to strings under the hood in C? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
Transitioning myself from Python to C for an algorithms course, it has been really difficult for me to understand how common strings work in this new hell.
From what I've understood:
In C, there are no strings per se, but rather an array of characters.
Variable names of arrays point to the address of the first element in an array (which in memory is lined up), thus lacking the need to point out to every single character.
What confuses me is the following:
char greeting[] = "Hello world";
printf("%s", greeting);
1) How come there is no need to pass an array to greeting[ ] like {"H", "e", "l", "l", "o"} etc but a single string is enough?
2) Why does printf print out the whole message, when it's in actuality a simple array? Does using the string format in prinf go through a for loop, printing out each element without a new-line?
char *greeting = "Hello world";
printf("%s", greeting);
3) What? Let me get guess this... C takes the inserted string, gets its length, creates an array of characters and then does the point (2) magic? What kind of shenigans does the pointer variable do? Something something a[ ] == &a AND a[0] == *a???
char *moreGreetings[] = {"Hello", "Greetings", "Good morning"};
printf("%s", moreGreetings[0]); // Returns "Hello"
4) I just can't anymore... why does calling moreGreetings[0] call out the whole array of characters "Hello"???
Unless there is a bunch of shenigans going on under the hood, I have no idea how any of this makes sense. Could someone PLEASE explain what is going on?
Computers are aliens. They think nothing like we do. Computers don't know what strings are.
Programming languages are human-to-alien translators. Python is like reading an idiomatically translated book. C is like reading a literal translation, and even then it does a lot of work.
1) How come there is no need to pass an array to greeting[ ] like {"H", "e", "l", "l", "o"} etc but a single string is enough?
The compiler takes care of it for you. Also you're missing the null byte at the end. And those aren't characters.
C is the ultimate DIY language. Coming from Python it can be very disorienting. C gives you the bare minimum (yes, I see you Assembly programmers waving your arms in the back, don't complicate things). It does this A) to be very fast and B) to let you build anything. Unfortunately it doesn't always do this in the most obvious way. If you don't understand what's going on under the hood in C, the details of how computer memory works, you're in trouble.
For example, be careful of " vs '. 'H' is the single character H, really the short (ie. 1 byte) integer 72 (the exact number depends on your locale). "H" is a two character array, {'H', '\0'} which is really {72, 0}.
The key thing to understand about strings in C, and all arrays, is they're just a hunk of memory split into 1 byte chunks. That's it. They don't even store their own length, you have to either store that somewhere else (like in a struct) or terminate the list with something.
C strings are a hunk of memory split into 1 byte chunks terminated by a null byte (ie. 0). That's it. These are conceptually equivalent.
const char *string = "Hello";
char string[] = {'H', 'e', 'l', 'l', 'o', '\0'};
Both will contain the same bytes, they differ in how they're stored.
2) Why does printf print out the whole message, when it's in actuality a simple array? Does using the string format in prinf go through a for loop, printing out each element without a new-line?
printf is kinda like Python's str. You tell it how to convert the thing into characters, and it'll convert the thing. %s says it's a character array terminated by a null byte. %d says it's an integer. %f says a floating point number. All of these things are represented differently in memory and need different conversions to characters.
How printf actually works is an implementation detail, but it's a good exercise to implement it yourself. And you can do it with a for loop writing out one byte at a time and stopping at the null byte.
for( const char *pos = string; pos[0] != '\0'; pos++ ) {
putchar(pos[0]);
}
Note that rather than indexing through the array, I'm moving forward where the start of the array is. string is nothing more but a pointer to the start of the array. By copying it to pos I can change that pointer without affecting string. This avoids having to allocate an extra integer for the index, and it avoids having to do the extra math of an array lookup. pos[0] just reads 1 byte after pos.
And yes, if you forget that null byte it'll just keep on going reading the memory past the end of the string until it happens to see a 0 or the operating system smacks it for going out of the bounds of the process.
3) What? Let me get guess this... C takes the inserted string, gets its length, creates an array of characters and then does the point (2) magic? What kind of shenigans does the pointer variable do? Something something a[ ] == &a AND a[0] == *a???
No, C strings don't store the length. To get the length they'd have to iterate through the whole string, and then iterate through the whole string again to print it. Instead they print to the null byte.
4) I just can't anymore... why does calling moreGreetings[0] call out the whole array of characters "Hello"???
Because moreGreetings is an array of pointers to more character arrays. char *moreGreetings[] is roughly equivalent to char **moreGreetings. It's a pointer to a pointer to characters.
It's an array of strings and you asked for the first one, so you get a string out.
Keep in mind, Python is written in C (yeah, there's other implementations now). C is the bottom of the stack (almost). Python, and every other program, eventually has to deal with these same "shenanigans" C does, but really its dealing with the reality of how computers work.
Often they don't use C strings because they're so ungainly and error prone, they make up their own, but they're still filling fixed sized hunks of memory with numbers and calling them "strings".
The best advice I can give you is to turn on compiler warnings. All of them! C compiler warnings can shine a light on many simple mistakes, but they're off by default. The typical way you turn them on is with -Wall, but that's not all warnings. There's lots and lots of extras. This is the formula I use in my Makefile (have a Makefile).
CFLAGS += -Wall -Wshadow -Wwrite-strings -Wextra -Wconversion -std=c99 -pedantic $(OPTIMIZE)
That turns on "all" warnings, and "extra" warnings, and some additional specific warnings I've found useful. It says I'm using the ISO C standard from 1999 (more on that in a moment) and I want the compiler to be pedantic about following the standard so my code is portable between compilers and environments. I do a lot of Open Source work, but it's good when you're started so you don't get addicted to non-standard compiler extensions.
About the standard. C is quite old and was only standardized in 1990. Many, many people learned to code with non-standard C, and you see that in a lot of C teaching material. Even though there's a 2011 standard, many C programmers write and teach to C90 or even earlier. Even C99 is considered "new" by many. Visual Studio is particularly bad at standards compliance, but they're finally catching up in the latest versions.
1) How come there is no need to pass an array to greeting[ ] like {"H", "e", "l", "l", "o"} etc but a single string is enough?
Because the C syntax allows for "string" literals, which are a shorthand way of representing a C-style string.
Incidentally, {"H", "e", "l", "l", "o"} is an array of strings, not an array of chars. An array of characters would look like this: {'H', 'e', 'l', 'l', 'o'}, but "Hello" actually represents the array { 'H', 'e', 'l', 'l', 'o', '\0' } (strings work by having a string termination character \0 at the end).
2) Why does printf print out the whole message, when it's in actuality a simple array? Does using the string format in prinf go through a for loop, printing out each element without a new-line?
The %s token tells printf that you want it to treat the value as a "string", so it handles it as one, printing characters one by one until it encounters the string termination character \0, which is automatically at the end of any "string" you create using the string literal syntax.
3) What? Let me get guess this... C takes the inserted string, gets its length, creates an array of characters and then does the point (2) magic? What kind of shenigans does the pointer variable do? Something something a[ ] == &a AND a[0] == *a???
I have no idea what this question means.
4) I just can't anymore... why does calling moreGreetings[0] call out the whole array of characters "Hello"???
moreGreetings is an array of strings (or an array of pointers to arrays of chars, if you like). So moreGreetings[0] is the first element in that array, which is the "string" "Hello". If you pass that into printf and use %s to tell it to treat the value as a string, then it will.
How come there is no need to pass an array to greeting[ ] like {"H", "e", "l", "l", "o"} etc but a single string is enough?
It is indeed possible to assign "Hello" as an array.
char greetings[] = {'H', 'e', 'l', 'l', 'o', '\0'};
But this assignment is very hard to write so char greetings[] = "Hello" will be a shortcut. But the two assignments are the same.
Why does printf print out the whole message?
printf has different behaviors depending on the format argument it receives. When you ask printf to print a value in string format %s, it takes a pointer to a character and prints its value as well as its subsequent characters one by one until it reaches the null terminator \0.
Why does calling moreGreetings[0] call out the whole array of characters "Hello"?
A pointer to the array is a pointer to the first element of that array. So in both printf("%s", greetings[0]); and printf("%s", greetings); you are passing a pointer to the same memory location, which produces the same output.
It's a language feature - you can initialize an array of chars using a string literal, and it will do what you meant, i.e. char greeting[] = "foo" will be interpreted as char greeting[] = {'f', 'o', 'o', '\0'}. This comes without a price because otherwise char greeting[] = "foo" would be a compile-time error.
Google C array decay. In short, passing an array where a pointer is expected will behave as if the pointer to the first element of the array was passed. This is useful in many contexts, especially with strings.
See #2.
Because you declared an array of pointers to char (strings), and are passing the first of those pointers to printf. It is equivalent to writing printf("%s", "Hello").
1) How come there is no need to pass an array to greeting[ ] like {"H", "e", "l", "l", "o"} etc but a single string is enough?
When you pass an array or a string (it turns out that they are both the same thing), you are giving the memory address of the first element in the array. Because array elements are stored in memory, one after another, all that is needed to access the next element in the array (or character in string) is to increment the memory address that was passed.
2) Why does printf print out the whole message, when it's in actuality a simple array? Does using the string format in prinf go through a for loop, printing out each element without a new-line?
Usually, all the system supports is a simple putchar() function call. In order to use more convenient IO function, libraries were created. The printf function probably uses a for loop to print each element in the string.
3) What? Let me get guess this... C takes the inserted string, gets its length, creates an array of characters and then does the point (2) magic? What kind of shenigans does the pointer variable do? Something something a[ ] == &a AND a[0] == *a???
The C compiler does count the length of the string. I just want to clarify that this does not happen at runtime, it happens at compiletime. During runtime, the string is referred to by its pointer.
The pointer variable is just an ordinary variable. It just contains some memory address, somewhere. In order for the compiler to know how to treat the pointer, the pointer is given a type, i.e. int*, char*.
Note: There is such a thing as a void* without a referencing type.
When the program wants to access a memory location directly next to that pointed by some pointer, let's call it int*p, it just increments the value of p by p++ or p + 1.

In C, what exactly does something like "a string"[4] mean and signify?

The following is taken from the C book by Mike Banahan (Link: Section 2.8.1.5)
I understand that "a string" reduces to a pointer to the first character of that string which is stored somewhere in memory. But I am clueless about "a string"[4] and what's given in the book is a bit unclear to me.
How can the size be 4 when the string has 9 characters? Beyond that, would "a string"[0] refer to the first character, "a string"[2] to third character, and so on? If not, can you please explain in simple term what that syntax of the book means?
The line that's killing me is "The first results in an
expression whose type is char and whose value is the internal
representation of the letter ‘r’ ". Where does 'r' come in?
Here's the text taken from that book:
Strings are implemented as arrays and although it might look odd, it
is entirely permissible to use array indexing on them:
- "a string"[4]
- L"a string"[4]
are both valid expressions. The first results in an
expression whose type is char and whose value is the internal
representation of the letter ‘r’ (remember arrays index from zero, not
one). The second has the type wchar_t and also has the value of the
internal representation of the letter ‘r’.
NB: Please ignore the stuff about the wide character part as I feel that's not relevant. Thank you.
You can spell out "a string"[4] as follows:
char *s = "a string";
char ch = s[4];
Does it make things clearer?
s[0]: a
s[1]:
s[2]: s
s[3]: t
s[4]: r
s[5]: i
s[6]: n
s[7]: g
s[8]: \0
The string literal "a string" signifies that it is a pointer to the first character of string "a string". C allows pointers to be subscripted, so a string literal can be subscripted.
Therefore "a string"[4] will give the 5th character r.

What does the line starting with double quote mean in C?

I was asked in one of the interviews, what does the following line print in C? In my opinion following line has no meaning:
"a"[3<<1];
Does anyone know the answer?
Surprisingly, it does have a meaning: it's an indexing into an array of characters that represent a string literal. Incidentally, this particular one indexes at 6, which is outside the limits of the literal, and is therefore undefined behavior.
You can construct an expression that works following the same basic pattern:
char c = "quick brown fox"[3 << 1];
will have the same effect as
char c = 'b';
Think of this:
"Hello world"[0]
is 'H'
"Hello world" is a string literal. A string literal is an array of char and is converted to a pointer to the first element of the array in an expression. "Hello world"[0] means the first element of the array.
It does have meaning. Hint: a[b] means exactly the same as *(a+b). (I don't think this is a great interview question, though.)
"a" is an array of 2 characters, 'a', and 0. 3 << 1 is 3*2 = 6, so it's trying to access the 7th element of a 2-element array. That is undefined behavior.
(Also, the code doesn't print anything, even if the undefined behavior is removed, since no printing functions are called.)
"some_string"[i] returns the ith character of the given string. 3<<1 is 6. So "a"[3<<1] tries to return the 6th character of the string "a".
In other words the code invokes undefined behavior (and thus, in a sense, really does have no meaning) because it's accessing a char array out of bounds.

Resources