character constant:\000 \xhh - c

Can anyone please explain the usage of the character constant \000 and \xhh ie octal numbers and hexadecimal numbers in a character constant?

In C, strings are terminated by a character with the value zero (0). This could be written like this:
char zero = 0;
but this doesn't work inside strings. There is a special syntax used in string literals, where the backslash works as an escape sequence introduction, and is followed by various things.
One such sequence is "backslash zero", that simply means a character with the value zero. Thus, you can write things like this:
char hard[] = "this\0has embedded\0zero\0characters";
Another sequence uses a backslash followed by the letter 'x' and one or two hexadecimal digits, to represent the character with the indicated code. Using this syntax, you could write the zero byte as '\x0' for instance.
EDIT: Re-reading the question, there's also support for such constants in base eight, i.e. octal. They use a backslash followed by the digit zero, just as octal literal integer constants. '\00' is thus a synonym for '\0'.
This is sometimes useful when you need to construct a string containing non-printing characters, or special control characters.
There's also a set of one-character "named" special characters, such as '\n' for newline, '\t' for TAB, and so on.

Those would be used to write otherwise nonprintable characters in the editor. For standard chars, that would be the various control characters, for wchar it could be characters not represented in the editor font.
For instance, this compiles in Visual Studio 2005:
const wchar_t bom = L'\xfffe'; /* Unicode byte-order marker */
const wchar_t hamza = L'\x0621'; /* Arabic Letter Hamza */
const char start_of_text = '\002'; /* Start-of-text */
const char end_of_text = '\003'; /* End-of-text */
Edit: Using octal character literals has an interesting caveat. Octal numbers can apparantly not be more than three digits long, which artificially restricts the characters we can enter.
For instance:
/* Letter schwa; capital unicode code point 0x018f (octal 0617)
* small unicode code point 0x0259 (octal 1131)
*/
const wchar_t Schwa2 = L'\x18f'; /* capital letter Schwa, correct */
const wchar_t Schwa1 = L'\617'; /* capital letter Schwa, correct */
const wchar_t schwa1 = L'\x259'; /* small letter schwa, correct */
const wchar_t schwa2 = L'\1131'; /* letter K (octal 113), incorrect */

Octal is base 8 (using digits 0-7) so each digit is 3 bits:
\0354 = 11 101 100
Hexadecimal is base 16 (using digits 0-9,A-F) and each digit is 4 bits:
\x23 = 0010 0011
Inside C strings (char arrays/pointers), they are generally used to encode bytes that can't be easily represented.
So, if you want a string which uses ASCII codes like STX and ETX, you can do:
char *msg = "\x02Here's my message\x03";

Related

Why does a C null terminator `\0` show up as `\000` during GDB debugging?

During my GDB debugging sessions, I've noticed that null terminator characters, denoting the end of a string, and shown as \0 in C files, show up as \000 in GDB when displaying the value of a variable storing such a character.
(gdb) print buffer[10]
$2 = 0 '\000'
Can anyone tell me why that is?
GDB seems to always use 3 octal digits to display character escapes - and for a good reason_ Consider the following string
const char *str = "\1\2\3\4\5";
then
(gdb) p str
$1 = 0x555555556004 "\001\002\003\004\005"
This is because C standard says that an escape sequence consists of maximum of 3 octal digits. Thus if you write:
"\0a"
it means string literal of two characters - null followed by a. But if you write
"\01"
it means a string literal of one character: ASCII code 1 - Start-of-Header control character. In fact the shortest way to write ASCII null followed by the digit 1 (i.e. ASCII code 49) in a string literal is "\0001" The other possibilities are "\0" "1" using string concatenation; separate escapes "\0\61"; or using hex escapes \x..., all of which will be even longer....
So by always using 3 octal digits, GDB can produce consistent output for strings - such that when copied to a C program will result in the same string during runtime. Furthermore the output routine is simpler because it does not need to consider the following character.
This record '\0' is an octal escape sequence of a character constant (literal).
An octal escape sequence may contain at most three octal digits.

Logical XOR in character arrays

I've been trying to make a program on Vernam Cipher which requires me to XOR two strings. I tried to do this program in C and have been getting an error.The length of the two strings are the same.
#include<stdio.h>
#include<string.h>
int main()
{
printf("Enter your string to be encrypted ");
char a[50];
char b[50];
scanf("%s",a);
printf("Enter the key ");
scanf("%s",b);
char c[50];
int q=strlen(a);
int i=0;
for(i=0;i<q;i++)
{
c[i]=(char)(a[i]^b[i]);
}
printf("%s",c);
}
Whenever I run the code, I get output as ????? in boxes. What is the method to XOR these two strings ?
I've been trying to make a program on Vernam Cipher which requires me to XOR two strings
Yes, it does, but that's not the only thing it requires. The Vernam cipher involves first representing the message and key in the ITA2 encoding (also known as Baudot-Murray code), and then computing the XOR of each pair of corresponding character codes from the message and key streams.
Moreover, to display the result in the manner you indicate wanting to do, you must first convert it from ITA2 to the appropriate character encoding for your locale, which is probably a superset of ASCII.
The transcoding to and from ITA2 is relatively straightforward, but not so trivial that I'm inclined to write them for you. There is a code chart at the ITA2 link above.
Note also that ITA2 is a stateful encoding that includes shift codes and a null character. This implies that the enciphered message may contain non-printing characters, which could cause some confusion, including a null character, which will be misinterpreted as a string terminator if you are not careful. More importantly, encoding in ITA2 may increase the length of the message as a result of a need to insert shift codes.
Additionally, as a technical matter, if you want to treat the enciphered bytes as a C string, then you need to ensure that it is terminated with a null character. On a related note, scanf() will do that for the strings it reads, which uses one character, leaving you only 49 each for the actual message and key characters.
What is the method to XOR these two strings ?
The XOR itself is not your problem. Your code for that is fine. The problem is that you are XORing the wrong values, and (once the preceding is corrected) outputting the result in a manner that does not serve your purpose.
Whenever I run the code, I get output as ????? in boxes...
XORing two printable characters does not always result in a printable value.
Consider the following:
the ^ operator operates at the bit level.
there is a limited range of values that are printable. (from here):
Control Characters (0–31 & 127): Control characters are not printable characters. They are used to send commands to the PC or the
printer and are based on telex technology. With these characters, you
can set line breaks or tabs. Today, they are mostly out of use.
Special Characters (32–47 / 58–64 / 91–96 / 123–126): Special characters include all printable characters that are neither letters
nor numbers. These include punctuation or technical, mathematical
characters. ASCII also includes the space (a non-visible but printable
character), and, therefore, does not belong to the control characters
category, as one might suspect.
Numbers (30–39): These numbers include the ten Arabic numerals from 0-9.
Letters (65–90 / 97–122): Letters are divided into two blocks, with the first group containing the uppercase letters and the second
group containing the lowercase.
Using the following two strings and the following code:
char str1 = {"asdf"};
char str1 = {"jkl;"};
Following demonstrates XORing the elements of the strings:
int main(void)
{
char str1[] = {"asdf"};
char str2[] = {"jkl;"};
for(int i=0;i<sizeof(str1)/sizeof(str1[i]);i++)
{
printf("%d ^ %d: %d\n", str1[i],str2[i], str1[i]^str2[i]);
}
getchar();
return 0;
}
While all of the input characters are printable (except the NULL character), not all of the XOR results of corresponding characters are:
97 ^ 106: 11 //not printable
115 ^ 107: 24 //not printable
100 ^ 108: 8 //not printable
102 ^ 59: 93
0 ^ 0: 0
This is why you are seeing the odd output. While all of the values may be completely valid for your purposes, they are not all printable.

Octal-based number format in C

I'm having problem using int values in C that starts with zero's (like 00111001).
I know that C compiler understand zero's in the beginning of a number as an octal number.
My question is how to disable it? I want to turn an 8 digit int into a char array[8].
e.g. 01010001={'0','1','0','1','0','0','0','1'}
If you're using GCC, you can use binary literals. Set the variable to 0b00111001.
https://gcc.gnu.org/onlinedocs/gcc/Binary-constants.html
Surround your number with quotes:
char array[] = "01010001";
The above will make each digit a character in array that you can read as well as placing a '\0' character at the end so that it can be used as a C string - as per your last sentence. Beware, the length of this string (in memory) will be 9 characters though because of this added NUL character.

Confused about C string constants

When I came across this C language implementation of Porters Stemming algorithm I found a C-ism I was confused about.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
void test( char *s )
{
int len = s[0];
printf("len= %i\n", len );
printf("s[len] = %c\n", s[len] );
}
int main()
{
test("\07" "abcdefg");
return 0;
}
and output:
len = 7
s[len] = g
However, when I input
test("\08" "abcdefgh");
or any string constant that is longer than 7 with the corresponding length in the first pair of parenthesis ( i.e. test("\09" "abcdefghi"); the output is
len = 0
s[len] =
But any input like test("\01" "abcdefgh"); prints out the character in that position ( if we call the first character position 1 and not 0 for the moment )
It appears if test( char *s ) reads the number in the first pair of parenthesis ( how it does this I am not sure since I thought s[0] would be able to only read a single char, i.e. the '\' ) and prints the last character at that index + 1 of the string constant in the second pair of parenthesis.
My question is this: It seems as if we are passing two string constants into test( char *s ). What exactly is happening here, meaning, how does the compiler seem to "split" up the string over two pairs of parenthesis? Another question one might have is, is a string of the form "blah" "abcdefg" one consecutive block of memory? It may be the case that I have overlooked something elementary, but even so I would like to know what I overlooked. I know this is a basic concept but I could not find a clear example or situation on the web that explains this and in all honesty I don't follow the output. Any helpful comments are welcomed.
There are at least three things going on here:
Literal strings juxtaposed against one another are concatenated by the compiler. "a" "b" is exactly the same as "ab".
The backslash is an escape character, which means it is not copied literally into the resulting string. The notation \01 means "the character with ASCII value 1".
The notation \0... means an octal character constant. Octal numbers are base 8, made up from digits that range from 0 through 7 inclusive. 8 is not a valid octal constant, so "\08" does not follow "\07".
The problem is not in the length of the string, but in the \o syntax for specifying non-printable values in string literals. \o, \oo, and \ooo denote octal constants, i.e. a single character whose value is written in base 8. Since 08 in \08 doesn't represent a valid base 8 number, it is interpreted as \0 followed by the ASCII character 8.
To fix the problem, represent 8 as \10 or \010:
test("\007" "abcdefg");
test("\010" "abcdefgh");
...or switch to hexadecimal, where the \x prefix makes the base more explicit to the casual reader:
test("\x07" "abcdefg");
test("\x08" "abcdefgh");
test("\x09" "abcdefghi");
test("\x0a" "abcdefghij");
...
\number in a character or string literal is means the character whose code is the value number. number is interpreted in octal, so the first non-octal digit terminates the number. So "\07" is a one-character string containing the character with code 7, but \08 is a two-character string containing the character with code 0 followed by the digit 8.
Additionally, code 0 the null terminator that's used in C to indicate the end of the string. So that second string ends at the beginning, because its first byte is the terminator. This why the length of the string in your second example is 0.
When two or more string literals are adjacent (separated only by white-space), the compiler will join them into a single string. Therefore "\07" "abcdefg" is equivalent to "\07abcdefg".
"\07" is an octal escape. An octal escape ends after three digits or with first non-octal character. So, when you enter "\08", 8 is a non octal character therefore escape ends and 0 is stored at s[0].
Now, len is 0 and printing s[len] will try to print the character at s[0] which has a non printable ASCII code (Only character above ASCII value above 32 are printable).

Find non-ascii characters from a UTF-8 string

I need to find the non-ASCII characters from a UTF-8 string.
my understanding:
UTF-8 is a superset of character encoding in which 0-127 are ascii characters.
So if in a UTF-8 string , a characters value is Not between 0-127, then it is not a ascii character , right? Please correct me if i'm wrong here.
On the above understanding i have written following code in C :
Note:
I'm using the Ubuntu gcc compiler to run C code
utf-string is x√ab c
long i;
char arr[] = "x√ab c";
printf("length : %lu \n", sizeof(arr));
for(i=0; i<sizeof(arr); i++){
char ch = arr[i];
if (isascii(ch))
printf("Ascii character %c\n", ch);
else
printf("Not ascii character %c\n", ch);
}
Which prints the output like:
length : 9
Ascii character x
Not ascii character
Not ascii character �
Not ascii character �
Ascii character a
Ascii character b
Ascii character
Ascii character c
Ascii character
To naked eye length of x√ab c seems to be 6, but in code it is coming as 9 ?
Correct answer for the x√ab c is 1 ...i.e it has only 1 non-ascii character , but in above output it is coming as 3 (times Not ascii character).
How can i find the non-ascii character from UTF-8 string, correctly.
Please guide on the subject.
What C calls a char is actually a byte. A UTF-8 character can be made up of several bytes.
In fact only the ASCII characters are represented by a single byte in UTF-8 (which is why all valid ASCII-encoded text is also effectively UTF-8 encoded).
So to count the number of UTF-8 characters you have to do a partial decoding: count the number of UTF-8 start codepoints.
See the Wikipedia article on UTF-8 to find out how they are encoded.
Basically there are 3 categories:
single-byte codes 0b0xxxxxxx
start bytes: 0b110xxxxx, 0b1110xxxx, 0b11110xxx
continuation bytes: 0b10xxxxxx
To count the number of unicode codepoint simply count all characters that are not continuation bytes.
However unicode codepoints don't always have a 1-to-1 correspondence to "characters" (depending on your exact definition of character).
The UTF-8 characters when taken in a character array occupies it in such a way that the first byte occupied by each UTF-8 character would contain the information regarding the number of bytes taken to represent the character. The number of consecutive 1's from the MSB of the first byte would represent the total bytes taken by the non-ascii character. In case of '√' the binary form would be: 11100010,10001000,10011010. Counting the number of 1's the in the first byte gives the number of bytes occupied as 3. Something like the code below would work for this:
int get_count(char non_ascii_char){
/*
The function returns the number of bytes occupied by the UTF-8 character
It takes the non ASCII character as the input and returns the length
to the calling function.
*/
int bit_counter=7,count=0;
/*
bit_counter - is the counter initialized to traverse through each bit of the
non ascii character
count - stores the number of bytes occupied by the character
*/
for(;bit_counter>=0;bit_counter--){
if((non_ascii_char>>bit_counter)&1){
count++;// increments on the number of consecutive 1s in the byte
}
else{
break;// breaks on encountering the first 0
}
}
return count;// returns the count to the calling function
}

Resources