How to escape from hex to decimal - c

I apologise if this is an obvious question. I've been searching online for an answer to this and cannot find one. This isn't relevant to my code per se, it's a curiosity on my part.
I am looking at testing my function to read start and end bytes of a buffer.
If I declare a char array as:
char *buffer;
buffer = "\x0212\x03";
meaning STX12ETX - switching between hex and decimal.
I get the expected error:
warning: hex escape sequence out of range [enabled by default]
I can test the code using all hex values:
"\x02\x31\x32\x03"
I am wanting to know, is there a way to escape the hex value to indicate that the following is a decimal value?

will something like this work for you ?
char *buffer;
buffer = "\x02" "12" "\x03";
according to standard:
§ 5.1.1.2 6. Adjacent string literal tokens are concatenated.
§ 6.4.4.4 3. and 7. Each octal or hexadecimal escape sequence is the longest sequence of characters that can constitute the escape sequence.
the escape characters:
\' - single quote '
\" - double quote "
\? - question mark ?
\ - backslash \
\octal digits
\xhexadecimal digits
So the only way to do it is concatenation of strings with the precompiler concatenation ( listing them one after another).
if you want to know more how the literals are constructed by compiler look at §6.4.4.4 and §6.4.5 they describe how to construct the character literals and string literals respectively.

You can write
"\b12"
to represent a decimal value. Altough you need to use space after hex values for it to work.
buffer = "\x02 \b12\x03";
Or just 12
buffer = "\x02 12\x03";
Basically you need to add a blank character after your hex values to indicate that it's a new value and not the same one

No, there's no way to end a hexadecimal escape except by having an invalid (for the hex value) character, but then that character is of course interpreted in its own right.
The C11 draft says (in 6.4.4.4 14):
[...] a hexadecimal escape sequence is terminated only by a non-hexadecimal character.
Octal escapes don't have this problem, they are limited to three octal digits.

You can always use the octal format. Octal code is always 3 digits.
So to get the character '<-' you simple type \215

Related

char array initialization with octal constant

I saw a comment that said initialization of a char array with "\001" would put a nul as the first character. I have seen where \0 does set a nul.
The unedited comment:
char input[SIZE] = ""; is sufficient initialization. while ( '\001' == input[0]) doesn't do what you think it is doing if you have initialized input[SIZE] = "\001"; (which creates an empty-string with the nul-character as the 1st character.)
This program
#include <stdio.h>
#define SIZE 8
int main ( void) {
char input[SIZE] = "\001";
if ( '\001' == input[0]) {//also tried 1 == input[0]
printf ( "octal 1\n\n");
}
else {
printf ( "empty string\n");
}
return 0;
}
running on Linux, compiled with gcc, outputs:
octal 1
so the first character is 1 rather than '\0'.
Is this the standard behavior or just something with Linux and gcc? Why does it not set a nul?
Is this the standard behavior or just something with Linux and gcc? Why does it not set a nul?
The behavior of the code you present is as required by the standard. In both string literals and integer character constants, octal escapes may contain one, two, or three digits, and the C standard specifies that
Each octal [...] escape sequence is the longest sequence of
characters that can constitute the escape sequence.
(C2011, 6.4.4.4/7)
In this context it is additionally relevant that \0 is an octal escape sequence, not a special, independent code for the null character. The wider context of the above quotation will make that clear.
In the string literal "\001", the backslash is followed by three octal digits, and an octal escape can have three digits, therefore the escape sequence consists of the backslash and all three digits. The first character of the resulting string is the one with integer value 1.
If for some reason you wanted a string literal consisting of a null character followed by the decimal digits 0 and 1, then you could either express the null with a full three-digit escape,
"\00001"
or split it up like so:
"\0" "01"
C will join adjacent string literals to produce the wanted result.
I saw a comment that said initialization of a char array with "\001" would put a nul as the first character.
That comment was in error.
From 6.4.4.1 Integer constants, paragraph 3, emphasis mine:
An octal constant consists of the prefix 0 optionally followed by a sequence of the digits 0 through 7 only.
But what we are looking at here is not an integer constant at all. What we have here is, actually, an octal escape sequence. And that is defined as follows (in 6.4.4.4 Character constants):
octal-escape-sequence:
\ octal-digit
\ octal-digit octal-digit
\ octal-digit octal-digit octal-digit
The definition -- both for integer constants as well as character constants -- is "greedy", as elaborated by paragraph 7:
Each octal or hexadecimal escape sequence is the longest sequence of characters that can constitute the escape sequence.
That means, if the first octal digit is followed by something that could be an octal digit, that next character is considered an octal digit belonging to that constant (to a maximum of three in the case of character constants -- not so for integer constants!).
Hence, your "\001" is, indeed, a character with the value 1.
Note that, while octal character constants run up to three characters maximum (making such a constant quite safe to use if padded with leading zeroes as necessary to get a length of three digits), hexadecimal character constants run as long as there are hexadecimal digits (potentially overflowing the char type they are meant to initialize).
See http://c0x.coding-guidelines.com/6.4.4.4.html
Octal sequence is defined as:
octal-escape-sequence:
\ octal-digit
\ octal-digit octal-digit
\ octal-digit octal-digit octal-digit
and item 873:
The octal digits that follow the backslash in an octal escape sequence
are taken to be part of the construction of a single character for an
integer character constant or of a single wide character for a wide
character constant.
also item 877:
Each octal or hexadecimal escape sequence is the longest sequence of
characters that can constitute the escape sequence.
Therefore the behaviour is correct. "\001" should not have null byte at position 0.

What do the characters starting with '\' and followed by a number e.g '\234' mean?

I have been looking at the source of an app when I came across these characters e.g '\233', '\234', '\235' and when I print them, I get garbage characters.
\233 is the character with the octal code 233.
In decimal this is 2×82 + 3×8 + 3 = 155
The meaning depends on the characterset being used. Codes beyond 127 are not defined in 7-bit ASCII.
As advertised by DevSolar:
http://rootdirectory.de/chrome/site/encoding.html might be helpful
They are octal-escape-sequences, which are used to represent specific byte values in a character constant or string literal.
C11, 6.4.4.4 Character constants:
character-constant:
' c-char-sequence '
L' c-char-sequence '
u' c-char-sequence '
U' c-char-sequence '
c-char-sequence:
c-char
c-char-sequence c-char
c-char:
any member of the source character set except the single-quote ', backslash \, or new-line character
escape-sequence
escape-sequence:
simple-escape-sequence
octal-escape-sequence
hexadecimal-escape-sequence
universal-character-name
octal-escape-sequence:
\ octal-digit
\ octal-digit octal-digit
\ octal-digit octal-digit octal-digit
An octal escape sequence is defined as a backslash followed by one to three octal digits (0-7).
To avoid getting a following decimal digit interpreted as part of the octal sequence, it is common practice to pad an octal escape sequence with leading zeroes. As opposed to octal integer constants, though, a leading zero is not required.
Note that the semantic meaning of such an escape sequence depends on the context. I could write "Fu\303\237", and it could mean "Fuß" (in UTF-8) or "Fuß" (in CP-1252), depending of what encoding I am assuming the string to be in. What I can not do, portably, is writing either of those strings in the source directly, because the interpretation of any character not in the source character set (i.e., ASCII-7 without dollar, at-sign, and backtick) is implementation-defined. While most compilers today can be made to interpret string literals as UTF-8, octal escape sequences are the portable way.
FWIW, there are also hexadecimal escape sequences; however they are not as well-defined: They greedily gobble as many "hex digits" as they can get, even beyond what a char can hold; so if the next character in the string literal is one of [0-9a-fA-F], you have no way of "terminating" the hex escape before that (1); this is why octal sequences are preferred by some.
(1): As M.M pointed out, you could split your string literal in two ("\xAB" "CD").
As for what the various character values could stand for, in which encoding, I recommend a good code table. This one I whipped up myself, as I could not find any existing one listing all the information I needed in one page.
It's an escape sequence, for octal values. The syntax is \nnn.
You can read more about escape sequences in c here.
Garbage is printed, because 233 in octal is 155 in decimal, 234 is 156 and 235 is 157. They do not represent any ascii character.
That notation is octal-escape-sequence which represents octal number representation for a char literal (char constant).
Quoting C11, chapter §6.4.4.4, Character constants
The single-quote ', the double-quote ", the question-mark ?, the backslash \, and
arbitrary integer values are representable according to the following table of escape
sequences:
...
octal character \octal digits
and, regarding the values,
The octal digits that follow the backslash in an octal escape sequence are taken to be part
of the construction of a single character for an integer character constant or of a single
wide character for a wide character constant. The numerical value of the octal integer so
formed specifies the value of the desired character or wide character.

Understanding output of printf containing backslash (\012)

Can you please help me to understand the output of this simple code:
const char str[10] = "55\01234";
printf("%s", str);
The output is:
55
34
The character sequence \012 inside the string is interpreted as an octal escape sequence. The value 012 interpreted as octal is 10 in decimal, which is the line feed (\n) character on most terminals.
From the Wikipedia page:
An octal escape sequence consists of \ followed by one, two, or three octal digits. The octal escape sequence ends when it either contains three octal digits already, or the next character is not an octal digit.
Since your sequence contains three valid octal digits, that's how it's going to be parsed. It doesn't continue with the 3 from 34, since that would be a fourth digit and only three digits are supported.
So you could write your string as "55\n34", which is more clearly what you're seeing and which would be more portable since it's no longer hard-coding the newline but instead letting the compiler generate something suitable.
\012 is an escape sequence which represents octal code of symbol:
012 = 10 = 0xa = LINE FEED (in ASCII)
So your string looks like 55[LINE FEED]34.
LINE FEED character is interpreted as newline sequence on many platforms. That is why you see two strings on a terminal.
\012 is a new line escape sequence as others stated already.
(What might be, as chux absolute correct commented, different if ASCII isn't the used charset. But anyway it is in this notation an octal digit.)
this is meant by standard as it says for c99 in ISO/IEC 9899
for:
6.4.4.4 Character constants
[...]
3 The single-quote ', the double-quote ", the question-mark ?, the backslash \, and
arbitrary integer values are representable according to the following table of escape
sequences:
single quote' \'
double quote" \"
question mark? \?
backslash\ \
octal character \octal digits
hexadecimal character \x hexadecimal digits
And the range it gets bound to:
Constraints
9 The value of an octal or hexadecimal escape sequence shall be in the range of
representable values for the type unsigned char for an integer character constant, or
the unsigned type corresponding to wchar_t for a wide character constant.

writing escape sequence in C using hex, dec, and oct values?

Can someone explain this question to me? I don't understand how the book arrived at its values or how one would arrive at the answer.
Here is the question:
Suppose that ch is a type char variable. Show how to assign the carriage-return character to ch by using an escape sequence, a decimal value, an octal character constant, and a hex character constant. (Assume ASCII code values.)
Here is the answer:
Assigning the carriage-return character to ch by using:
a) escape sequence: ch='\r';
b) decimal value: ch=13;
c) an octal character constant: ch='\015';
d) a hex character constant: ch='\xd';
I understand the answer to part a, but am completely lost for parts b, c, and d. Can you explain?
Computers represent characters using character encondings, such as ascii, utf-8, utf-16, iso-8859 (http://en.wikipedia.org/wiki/ISO/IEC_8859-1), as well as others. The carriage return character was used by early computers as a printer instruction to return the printhead to the leftmost position. And the linefeed character was used to index the paper to a new line (thus why DOS uses CRLF for lines, it worked better with dot matrix printers). Anyway the CR character is stored internally as a numeric value in either a single 8-bit byte/octet or a 16-bit pair of two bytes/octets, depending upon your language.
The common ascii characterset is found here: http://www.asciitable.com/ and you can find that CR, '\r', 13, 0xD, et al are different representations for the same value.
Strings are just sequences of characters stored either as an array of characters with a marker at the end (terminator), or stored with a count of the current string length.
From wiki:
Computers and communication equipment represent characters using a
character encoding that assigns each character to something — an
integer quantity represented by a sequence of bits, typically — that
can be stored or transmitted through a network. Two examples of usual
encodings are ASCII and the UTF-8 encoding for Unicode.
For your question b,c,d - all values are 13 (in decimal). Run this code to understand what's happening:
char ch1='\r';
printf("Ascii value of carriage return is %d", ch1);
There are two parts to explaining answers b-d.
You need to know that the ASCII code point for 'carriage return' or CR (also known as Control-M) is 13. You can find that out from various sources. It might not be obvious that the Unicode standard is one of those places (but it is) and U+000D is CARRIAGE RETURN (CR). Unicode code points U+0000..U+007F are identical to ASCII; Unicode code points U+0000..U+00FF are identical to ISO 8859-1 (Latin 1).
You need to know that C can use decimal numbers, or octal or hexadecimal escapes when assigning to characters. Notations such as '\15' or '\015' are octal character constants, and octal 15 is decimal 13. Notations such as '\xD' or '\x0D' (or, indeed, '\x0000000000000D' and all stops en route) are hexedecimal constants and hex D is also decimal 13. (Note that octal escapes are limited to 1-3 digits, but hex escapes are not so limited, but values larger than '\xFF' typically have implementation defined representations.)

What does \x mean in C/C++?

Example:
char arr[] = "\xeb\x2a";
BTW, are the following the same?
"\xeb\x2a" vs. '\xeb\x2a'
\x indicates a hexadecimal character escape. It's used to specify characters that aren't typeable (like a null '\x00').
And "\xeb\x2a" is a literal string (type is char *, 3 bytes, null-terminated), and '\xeb\x2a' is a character constant (type is int, 2 bytes, not null-terminated, and is just another way to write 0xEB2A or 60202 or 0165452). Not the same :)
As other have said, the \x is an escape sequence that starts a "hexadecimal-escape-sequence".
Some further details from the C99 standard:
When used inside a set of single-quotes (') the characters are part of an "integer character constant" which is (6.4.4.4/2 "Character constants"):
a sequence of one or more multibyte characters enclosed in single-quotes, as in 'x'.
and
An integer character constant has type int. The value of an integer character constant containing a single character that maps to a single-byte execution character is the numerical value of the representation of the mapped character interpreted as an integer. The value of an integer character constant containing more than one character (e.g., 'ab'), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-defined.
So the sequence in your example of '\xeb\x2a' is an implementation defined value. It's likely to be the int value 0xeb2a or 0x2aeb depending on whether the target platform is big-endian or little-endian, but you'd have to look at your compiler's documentation to know for certain.
When used inside a set of double-quotes (") the characters specified by the hex-escape-sequence are part of a null-terminated string literal.
From the C99 standard 6.4.5/3 "String literals":
The same considerations apply to each element of the sequence in a character string literal or a wide string literal as if it were in an integer character constant or a wide character constant, except that the single-quote ' is representable either by itself or by the escape sequence \', but the double-quote " shall be represented by the escape sequence \".
Additional info:
In my opinion, you should avoid avoid using 'multi-character' constants. There are only a few situations where they provide any value over using an regular, old int constant. For example, '\xeb\x2a' could be more portably be specified as 0xeb2a or 0x2aeb depending on what value you really wanted.
One area that I've found multi-character constants to be of some use is to come up with clever enum values that can be recognized in a debugger or memory dump:
enum CommandId {
CMD_ID_READ = 'read',
CMD_ID_WRITE = 'writ',
CMD_ID_DEL = 'del ',
CMD_ID_FOO = 'foo '
};
There are few portability problems with the above (other than platforms that have small ints or warnings that might be spewed). Whether the characters end up in the enum values in little- or big-endian form, the code will still work (unless you're doing some else unholy with the enum values). If the characters end up in the value using an endianness that wasn't what you expected, it might make the values less easy to read in a debugger, but the 'correctness' isn't affected.
When you say:
BTW,are these the same:
"\xeb\x2a" vs '\xeb\x2a'
They are in fact not. The first creates a character string literal, terminated with a zero byte, containing the two characters who's hex representation you provide. The second creates an integer constant.
It's a special character that indicates the string is actually a hexadecimal number.
http://www.austincc.edu/rickster/COSC1320/handouts/escchar.htm
The \x means it's a hex character escape. So \xeb would mean character eb in hex, or 235 in decimal. See http://msdn.microsoft.com/en-us/library/6aw8xdf2.aspx for ore information.
As for the second, no, they are not the same. The double-quotes, ", means it's a string of characters, a null-terminated character array, whereas a single quote, ', means it's a single character, the byte that character represents.
\x allows you to specify the character by its hexadecimal code.
This allows you to specify characters that are normally not printable (some of which have special escape sequences predefined such as '\n'=newline and '\t'=tab '\b'=bell)
A useful website is here.
And I quote:
x Unsigned hexadecimal integer
That way, your \xeb is like 235 in decimal.

Resources