Null character and strings in C

Null character and strings in C - c

I have the following C code:
#include <stdio.h>
#include <strings.h>
int main(void){
char * str = "\012\0345";
char testArr[8] = {'\0','1','2','\0','3','4','5','\0'};
printf("%s\n",str);
printf("**%s**",testArr);
return 0;
}
See live code here
I'm having trouble understanding the results and I have googled but am unsure that I understand why a null character at the start of a string and why one in the middle would cause only the string "5" to display. Also, when I assign each string character to array testArr and then attempt to display that array of characters the result is different despite the string and the array having the same characters. So, I'm struck by the confounding results, especially their disparity. With the string str, does the code display "5" because the null characters overwrite what is in memory?
Also, with the array I created using the same characters, nothing displays of the data contained in array testArr. Is it that once the first null is encountered for some reason everything else is ignored? If so, why doesn't the same behavior occur with string str which contains the same characters?

An octal escape sequence is \ followed by one to three octal digits, per C 2018 6.4.4.4 1. Per 6.4.4.4 7: “Each octal or hexadecimal escape sequence is the longest sequence of characters that can constitute the escape sequence.” So, when the compiler sees "\012\0345", it interprets it as the sequence \012 (which is ten), the sequence \034 (which is twenty-eight), and the character 5.
To represent the string you intended, you could use "\00012\000345". Since an octal escape sequence stops at three digits, this is interpreted as the sequence \000, the characters 1 and 2, the sequence \000, and the characters 3, 4, and 5. (A null terminating character will also be appended automatically.)
When you printed "\012\0345", the characters with codes ten and twenty-eight were printed but had no visible effect. (Your C implementation likely uses ASCII, in which case they are control characters. \012 is new-line, so it should have caused a line advance, but you probably did not notice that. \034 is a file-separator control character, which likely has no effect when printed to a regular terminal display.)
When you printed testArr, the null character in the first position ended the string.

Related

Why does a C null terminator `\0` show up as `\000` during GDB debugging?

During my GDB debugging sessions, I've noticed that null terminator characters, denoting the end of a string, and shown as \0 in C files, show up as \000 in GDB when displaying the value of a variable storing such a character.
(gdb) print buffer[10]
$2 = 0 '\000'
Can anyone tell me why that is?

GDB seems to always use 3 octal digits to display character escapes - and for a good reason_ Consider the following string
const char *str = "\1\2\3\4\5";
then
(gdb) p str
$1 = 0x555555556004 "\001\002\003\004\005"
This is because C standard says that an escape sequence consists of maximum of 3 octal digits. Thus if you write:
"\0a"
it means string literal of two characters - null followed by a. But if you write
"\01"
it means a string literal of one character: ASCII code 1 - Start-of-Header control character. In fact the shortest way to write ASCII null followed by the digit 1 (i.e. ASCII code 49) in a string literal is "\0001" The other possibilities are "\0" "1" using string concatenation; separate escapes "\0\61"; or using hex escapes \x..., all of which will be even longer....
So by always using 3 octal digits, GDB can produce consistent output for strings - such that when copied to a C program will result in the same string during runtime. Furthermore the output routine is simpler because it does not need to consider the following character.

This record '\0' is an octal escape sequence of a character constant (literal).
An octal escape sequence may contain at most three octal digits.

printf variable placeholder? Is it even a thing?

#include <stdio.h>
char pos[] = {34,92,48,51,51,91,57,59,57,72,37,115,34}; // "\033[9;9H%s"
main() {
printf(pos,"Aaaaaaa"); // (1) This doesnt work as intended
printf("\033[9;9H%s","Aaaaaaa"); // (2) Works as intended
}
So why (2) works and (1) doesn't?

There are two problems with pos.
First what you have is a character array, not a null terminated string. You need to add a 0 to the end of that array.
Second, you don't have the same characters. In the string literal "\033[9;9H%s" there are a total of 8 characters while pos has 13.
The sequence \033 represent a single character whose value is 33 octal or 27 decimal. You instead have the literal characters '\', '0', '3', and '3'. So replace 92,48,51,51 with 27. Also, you have 34 for the first and last characters in pos, which is the double quote character ". These characters are not part of the string literal but are used to denote it in code. So get rid of those.
pos should now look like this:
char pos[] = {27,91,57,59,57,72,37,115,0};

You've got three differences:
You don't need the initial and final 34 character ("), since string (2) doesn't print them out.
You need a null terminator to ensure that you print only your string and nothing more.
If \033 is meant to be an escape character, then its value is just 27, not 92,48,51,51.
After addressing those differences, your pos array:
{34,92,48,51,51,91,57,59,57,72,37,115,34}
Should instead look like this (aligned to match the original array):
{27,91,57,59,57,72,37,115,0}

sizeof(string) not including a "\" sign

I've been going through strlen and sizeof for strings (char arrays) and I don't quite get one thing.
I have the following code:
int main() {
char str[]="gdb\0eahr";
printf("sizeof=%u\n",sizeof(str));
printf("strlen=%u\n",strlen(str));
return 0;
}
the output of the code is:
sizeof=9
strlen=3
At first I was pretty sure that 2 separate characters \ followed by 0 wouldn't actually act as a NUL (\0) but I managed to figure that it does.
The thing is that I have no idea why sizeof shows 9 and not 10.
Since sizeof counts the amount of used bytes by the data type why doesn't it count the byte for the \?
In a following example:
char str[]="abc";
printf("sizeof=%u\n",sizeof(str));
that would print out "4" because of the NUL value terminating the array so why is \ being not counted?

In a character or string constant, the \ character marks the beginning of an escape sequence, used to represent character values for which there isn't a symbol in the source character set. For example, the escape sequence \n represents the newline character, \b represents the backspace character, \0 represents the zero-valued character (which is also the string terminator), etc.
In the string literal "gdb\0eahr", the escape sequence \0 maps to a single 0-valued character; the actual contents of str are {'g', 'd', 'b', 0, 'e', 'a', 'h', 'r', 0}.

It seems you already have the answer:
At first I was pretty sure that 2 separate characters "\" followed by "0" wouldn't actually act as a NULL "\0" but I managed to figure that it does.
The sequence \0 is an octal escape sequence for the byte 0. So while there are two characters in the code to denote this, it translated to a single byte in the string.
So you have 7 alphabetic characters, a null byte in the middle, and a null byte at the end. That's 9 bytes.

Why should char str[]="gdb\0eahr"; be 10 bytes with sizeof operator? It is 9 bytes because there are 8 string elements + trailing zero.
\0 is only 1 character, not 2. \'s purpose is to escape characters, therefore you might see some of these: \t, \n, \\ and others.
Strlen returns 3 because you have string termination at position str[3].
Single sequence of \ acts as escape character and is not part of string size. If you want to literally use \ in your string, you have to write it twice in sequence like \\, then this is single char of \ printable char.

The C compiler scans text strings as a part of compiling the source code and during the scan any special, escape sequences of characters are turned into a single character. The symbol backslash (\) is used to indicate the start of an escape sequence.
There are several formats for escape sequences. The most basic is a backslash followed by one of several special letters. These two characters are then translated into a single character. Some of these are:
'\n' is turned into a line feed character (0x0A or decimal 10)
'\t' is turned into a tab character (0x09 or decimal 9)
'\r' is turned into a carriage return character (0x0D or decimal 13)
'\\' is turned into a backslash character (0x5C)
This escape sequence idea was used back in the old days so that when a line of text was printed to a teletype machine or printer or a CRT terminal, the programmer could use these and other special command code characters to set where the next character would be printed or to cause the device to do some physical action like ring a bell or feed the paper to the next line.
The escape character also allowed you to embed a double quote (") or a single quote (') into a text string so that you could print text that contain quote marks.
In addition to the above special sequences of backslash followed by a letter there was also a way to specify any character by using a backslash followed by one up to three octal digits (0 through 7). So you could specify a line feed character by either using '\n' or you could use '\12' where 12 is the octal representation of the hexadecimal value A or the decimal value 10.
Then the ability to use a hexadecimal escape sequence was introduced with the backslash followed by the letter x followed by one or more hexadecimal digits. So you can write a line feed character with '\n' or '\12' or '\xa'.
See also Escape sequences in C in Wikipedia.

C string and hex characters

Can anyone explain what is happening in this code?
#include <stdio.h>
void f(const char * str) {
printf("%d\n", str[4]);
}
int main() {
f("\x03""www""\x01""a""\x02""pl");
f("\x03www\x01a\x02pl");
return 0;
}
why output is?
1
26

The issue is with "\x01""a" versus "\x01a", and the fact that the hex->char conversion and the string concatenation occur during different phases of lexical processing.
In the first case, the hexadecimal character is scanned and converted prior to concatenating
the strings, so the first character is seen as \x01. Then the "a" is
concatenated, but the hex->char conversion has already been performed,
and it's not re-scanned after the concatenation, so you get two letters
\x01 and a.
In the second case, the scanner sees \x01a as a single character,
with ASCII code 26.

In C, characters specified in hex (like "\x01") can have more than two digits. In the first case, "\x01""a" is character 1, followed by 'a'. In the second case, "\x01a", that's character 0x1a, which is 26.

Confused about C string constants

When I came across this C language implementation of Porters Stemming algorithm I found a C-ism I was confused about.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
void test( char *s )
{
int len = s[0];
printf("len= %i\n", len );
printf("s[len] = %c\n", s[len] );
}
int main()
{
test("\07" "abcdefg");
return 0;
}
and output:
len = 7
s[len] = g
However, when I input
test("\08" "abcdefgh");
or any string constant that is longer than 7 with the corresponding length in the first pair of parenthesis ( i.e. test("\09" "abcdefghi"); the output is
len = 0
s[len] =
But any input like test("\01" "abcdefgh"); prints out the character in that position ( if we call the first character position 1 and not 0 for the moment )
It appears if test( char *s ) reads the number in the first pair of parenthesis ( how it does this I am not sure since I thought s[0] would be able to only read a single char, i.e. the '\' ) and prints the last character at that index + 1 of the string constant in the second pair of parenthesis.
My question is this: It seems as if we are passing two string constants into test( char *s ). What exactly is happening here, meaning, how does the compiler seem to "split" up the string over two pairs of parenthesis? Another question one might have is, is a string of the form "blah" "abcdefg" one consecutive block of memory? It may be the case that I have overlooked something elementary, but even so I would like to know what I overlooked. I know this is a basic concept but I could not find a clear example or situation on the web that explains this and in all honesty I don't follow the output. Any helpful comments are welcomed.

There are at least three things going on here:
Literal strings juxtaposed against one another are concatenated by the compiler. "a" "b" is exactly the same as "ab".
The backslash is an escape character, which means it is not copied literally into the resulting string. The notation \01 means "the character with ASCII value 1".
The notation \0... means an octal character constant. Octal numbers are base 8, made up from digits that range from 0 through 7 inclusive. 8 is not a valid octal constant, so "\08" does not follow "\07".

The problem is not in the length of the string, but in the \o syntax for specifying non-printable values in string literals. \o, \oo, and \ooo denote octal constants, i.e. a single character whose value is written in base 8. Since 08 in \08 doesn't represent a valid base 8 number, it is interpreted as \0 followed by the ASCII character 8.
To fix the problem, represent 8 as \10 or \010:
test("\007" "abcdefg");
test("\010" "abcdefgh");
...or switch to hexadecimal, where the \x prefix makes the base more explicit to the casual reader:
test("\x07" "abcdefg");
test("\x08" "abcdefgh");
test("\x09" "abcdefghi");
test("\x0a" "abcdefghij");
...

\number in a character or string literal is means the character whose code is the value number. number is interpreted in octal, so the first non-octal digit terminates the number. So "\07" is a one-character string containing the character with code 7, but \08 is a two-character string containing the character with code 0 followed by the digit 8.
Additionally, code 0 the null terminator that's used in C to indicate the end of the string. So that second string ends at the beginning, because its first byte is the terminator. This why the length of the string in your second example is 0.

When two or more string literals are adjacent (separated only by white-space), the compiler will join them into a single string. Therefore "\07" "abcdefg" is equivalent to "\07abcdefg".
"\07" is an octal escape. An octal escape ends after three digits or with first non-octal character. So, when you enter "\08", 8 is a non octal character therefore escape ends and 0 is stored at s[0].
Now, len is 0 and printing s[len] will try to print the character at s[0] which has a non printable ASCII code (Only character above ASCII value above 32 are printable).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight