Is spacebar a character? - c

Consider this line of text:
First line of text.
If a character array string is used to load the first TEN characters in the array it will output as:
First lin'\0'
First contains 5 letters, lin contains 3 letters. Where are the other two characters being used?
Is \0 considered two characters?
Or is the space between the words considered a character, thus '\0` is one character?

Yes, space is a character. In ASCII encoding it has code number 32.

The space between the two words has ASCII code 0x20 (0408, or 3210); it occupies one byte.
The null at the end of the string, ASCII code 0x00 (0 in both octal and decimal) occupies the other byte.
Note that the space bar is simply the key on the keyboard that generates a space character when typed.

'\0' is the null-terminator, it is literally the value zero in all implementations.
'\0' is considered a single character because the backslash \ means to escape a character. '\0' and '0' thus are both single characters, but mean very different things.
Note that space is represented by a different ascii value.

space is represented in String as "\s", probably space is represented as '\s' as a character

Related

Why does a C null terminator `\0` show up as `\000` during GDB debugging?

During my GDB debugging sessions, I've noticed that null terminator characters, denoting the end of a string, and shown as \0 in C files, show up as \000 in GDB when displaying the value of a variable storing such a character.
(gdb) print buffer[10]
$2 = 0 '\000'
Can anyone tell me why that is?
GDB seems to always use 3 octal digits to display character escapes - and for a good reason_ Consider the following string
const char *str = "\1\2\3\4\5";
then
(gdb) p str
$1 = 0x555555556004 "\001\002\003\004\005"
This is because C standard says that an escape sequence consists of maximum of 3 octal digits. Thus if you write:
"\0a"
it means string literal of two characters - null followed by a. But if you write
"\01"
it means a string literal of one character: ASCII code 1 - Start-of-Header control character. In fact the shortest way to write ASCII null followed by the digit 1 (i.e. ASCII code 49) in a string literal is "\0001" The other possibilities are "\0" "1" using string concatenation; separate escapes "\0\61"; or using hex escapes \x..., all of which will be even longer....
So by always using 3 octal digits, GDB can produce consistent output for strings - such that when copied to a C program will result in the same string during runtime. Furthermore the output routine is simpler because it does not need to consider the following character.
This record '\0' is an octal escape sequence of a character constant (literal).
An octal escape sequence may contain at most three octal digits.

Null character and strings in C

I have the following C code:
#include <stdio.h>
#include <strings.h>
int main(void){
char * str = "\012\0345";
char testArr[8] = {'\0','1','2','\0','3','4','5','\0'};
printf("%s\n",str);
printf("**%s**",testArr);
return 0;
}
See live code here
I'm having trouble understanding the results and I have googled but am unsure that I understand why a null character at the start of a string and why one in the middle would cause only the string "5" to display. Also, when I assign each string character to array testArr and then attempt to display that array of characters the result is different despite the string and the array having the same characters. So, I'm struck by the confounding results, especially their disparity. With the string str, does the code display "5" because the null characters overwrite what is in memory?
Also, with the array I created using the same characters, nothing displays of the data contained in array testArr. Is it that once the first null is encountered for some reason everything else is ignored? If so, why doesn't the same behavior occur with string str which contains the same characters?
An octal escape sequence is \ followed by one to three octal digits, per C 2018 6.4.4.4 1. Per 6.4.4.4 7: “Each octal or hexadecimal escape sequence is the longest sequence of characters that can constitute the escape sequence.” So, when the compiler sees "\012\0345", it interprets it as the sequence \012 (which is ten), the sequence \034 (which is twenty-eight), and the character 5.
To represent the string you intended, you could use "\00012\000345". Since an octal escape sequence stops at three digits, this is interpreted as the sequence \000, the characters 1 and 2, the sequence \000, and the characters 3, 4, and 5. (A null terminating character will also be appended automatically.)
When you printed "\012\0345", the characters with codes ten and twenty-eight were printed but had no visible effect. (Your C implementation likely uses ASCII, in which case they are control characters. \012 is new-line, so it should have caused a line advance, but you probably did not notice that. \034 is a file-separator control character, which likely has no effect when printed to a regular terminal display.)
When you printed testArr, the null character in the first position ended the string.

sizeof(string) not including a "\" sign

I've been going through strlen and sizeof for strings (char arrays) and I don't quite get one thing.
I have the following code:
int main() {
char str[]="gdb\0eahr";
printf("sizeof=%u\n",sizeof(str));
printf("strlen=%u\n",strlen(str));
return 0;
}
the output of the code is:
sizeof=9
strlen=3
At first I was pretty sure that 2 separate characters \ followed by 0 wouldn't actually act as a NUL (\0) but I managed to figure that it does.
The thing is that I have no idea why sizeof shows 9 and not 10.
Since sizeof counts the amount of used bytes by the data type why doesn't it count the byte for the \?
In a following example:
char str[]="abc";
printf("sizeof=%u\n",sizeof(str));
that would print out "4" because of the NUL value terminating the array so why is \ being not counted?
In a character or string constant, the \ character marks the beginning of an escape sequence, used to represent character values for which there isn't a symbol in the source character set. For example, the escape sequence \n represents the newline character, \b represents the backspace character, \0 represents the zero-valued character (which is also the string terminator), etc.
In the string literal "gdb\0eahr", the escape sequence \0 maps to a single 0-valued character; the actual contents of str are {'g', 'd', 'b', 0, 'e', 'a', 'h', 'r', 0}.
It seems you already have the answer:
At first I was pretty sure that 2 separate characters "\" followed by "0" wouldn't actually act as a NULL "\0" but I managed to figure that it does.
The sequence \0 is an octal escape sequence for the byte 0. So while there are two characters in the code to denote this, it translated to a single byte in the string.
So you have 7 alphabetic characters, a null byte in the middle, and a null byte at the end. That's 9 bytes.
Why should char str[]="gdb\0eahr"; be 10 bytes with sizeof operator? It is 9 bytes because there are 8 string elements + trailing zero.
\0 is only 1 character, not 2. \'s purpose is to escape characters, therefore you might see some of these: \t, \n, \\ and others.
Strlen returns 3 because you have string termination at position str[3].
Single sequence of \ acts as escape character and is not part of string size. If you want to literally use \ in your string, you have to write it twice in sequence like \\, then this is single char of \ printable char.
The C compiler scans text strings as a part of compiling the source code and during the scan any special, escape sequences of characters are turned into a single character. The symbol backslash (\) is used to indicate the start of an escape sequence.
There are several formats for escape sequences. The most basic is a backslash followed by one of several special letters. These two characters are then translated into a single character. Some of these are:
'\n' is turned into a line feed character (0x0A or decimal 10)
'\t' is turned into a tab character (0x09 or decimal 9)
'\r' is turned into a carriage return character (0x0D or decimal 13)
'\\' is turned into a backslash character (0x5C)
This escape sequence idea was used back in the old days so that when a line of text was printed to a teletype machine or printer or a CRT terminal, the programmer could use these and other special command code characters to set where the next character would be printed or to cause the device to do some physical action like ring a bell or feed the paper to the next line.
The escape character also allowed you to embed a double quote (") or a single quote (') into a text string so that you could print text that contain quote marks.
In addition to the above special sequences of backslash followed by a letter there was also a way to specify any character by using a backslash followed by one up to three octal digits (0 through 7). So you could specify a line feed character by either using '\n' or you could use '\12' where 12 is the octal representation of the hexadecimal value A or the decimal value 10.
Then the ability to use a hexadecimal escape sequence was introduced with the backslash followed by the letter x followed by one or more hexadecimal digits. So you can write a line feed character with '\n' or '\12' or '\xa'.
See also Escape sequences in C in Wikipedia.

What is the difference between '\0' and '\n' in the C programming language?

What is the difference between the '\0' character and the '\n' character in the C programming language?
'\0' is a NULL character (ASCII 0), which is actually a string terminator too... (C strings are NULL-terminated): if you have a string like "this is\0a string", the part after the '\0' will be ignored (even if it will actually be inside the generated code).
'\n' is a newline (ASCII 10). It is noteworthy that in some circumstances, this newline character can actually be transformed. For example, on Windows, where the newline in files is indicated by the "\r\n" sequence (two bytes: ASCII 13, carriage return, followed by ASCII 10, line feed), if you write to a file (e.g. using fprintf()) a string containing a '\n' character, it will be automatically converted to a "\r\n" sequence if the file is open in ASCII mode (which is generally the default).
'\0' is a null: this terminates a string. '\n' is a newline
'\0' is a NULL character, which indicates the end of a string in C. (printf("%s") will stop printing at the first occurence of \0 in the string.
'\n' is a newline, which will simply make the text continue on the next line when printing.
\0 is the null byte, used to terminate strings.
\n is the newline character, 10 in ASCII, used (on Unix) to separate lines.
'\0' is a character constant that is written as octal-escape-sequence. Its value is 0. It is not the same as '0'. The last has value 48 in ASCII or 240 in EBCDIC
'\n' is a character constant that is written as simple-escape-sequence and denotes the new line character. Its value is equal to 10.

How to navigate a UTF-8 text file

I have a text file in UTF-8 that I need to navigate in C. I need to split this file into seperate smaller files (i.e. cut it in half). When this happens, it sometimes splits the multi-byte characters into two different files. When a dumb text editor goes to read the file containing the second half of text, it reads the second half of the cut character and becomes confused, thus not displaying the rest of the text correctly. If I read byte-by-byte, how can I tell if I am at the beginning of a character or in the middle? Non-ascii compatible UTF-8 characters all start with the leading bit set to 1 but some are two bytes and some are three bytes.
Edit: Nevermind, I just found out that the first byte contains the number of leading 1s that the character is long. IE a three byte character is 1110xxxx xxxxxxxx xxxxxxxx.
if ((*s & 0xc0) == 0x80) /* You are in the middle of */;
UTF-8 characters are represented use 1 to 4 bytes.
Check a byte, if you have this binary pattern:
10xxxxxx
you are in the middle of a multi-byte. And you should continue to the next leading character.
If you have this:
0xxxxxxx
you have a 1-byte character.
110xxxxx
is the leading byte of a 2-byte character
1110xxxx
is the leading byte of a 3-byte character
and
11110xxx
is the leading byte of a 4-byte character
All UTF-8 characters are made of a leading byte and zero or more continuation bytes. All continuation bytes are of the form "10xxxxxx" in binary. So all leading bytes are of one of the two forms: "0xxxxxxx" or "11xxxxxx".

Resources