String to ASCII code conversion and tie them - c

The code is c and compiling on gcc compiler.
How to append string and char like as following example
unsigned char MyString [] = {"LOREM IPSUM" + 0x28 + "DOLOR"};

unsigned char MyString [] = {"LOREM IPSUM\050DOLOR"};
The \050 is an octal escape sequence, with 050 == 0x28. The language standard also provides hex escape sequences, but "LOREM IPSUM\x28DOLOR" would be interpreted as a three-digit hex (\x28D), the meaning of which (since it would be overflowing the usual 8-bit char) would be implementation-defined. Octal escapes always end after three digits, which makes them safer to use.
And while we're at it, there is no guarantee whatsoever that your escapes would be considered ASCII. There are machines using EBCDIC natively, you know, and compilers defaulting to UTF-8 -- which would get you into trouble as soon as you go beyond 0x7f. ;-)

Related

unsigned char in C not working as expected

Since unsigned char represents 0 - 255 and the extended ascii code for 'à' is 133, I expected the following C code to print 133
unsigned char uc;
uc='à';
printf("%hhu \n",uc);
Instead, both clang and gcc produce the following error
error: character too large for enclosing character literal type
uc='à';
^
What went wrong?
By the way I copied à from a French language website and pasted the result into the assignment statement. What I suspect is the way I created à may not be valid.
Since unsigned char represents 0 - 255
This is true in most implementations, but the C standard does not require that a char is limited to 8 bit, it can be larger and support a larger range.
and the extended ascii code for 'à' is 133,
There can be a C implementation where 'à' has the value 133 (0x85) but since most implementations use Unicode, 'à' probably uses the code point 224 (0xE0) which is most likely stored as UTF-8. Your Editor is also set to UTF-8 and therefore needs more than a single byte to represent characters outside of ASCII. In UTF-8, all ASCII characters are stored like they are in ASCII and need 1 byte, all other characters are a combination of 2-4 byte and bit 7 is set in every one of them. I suggest you learn how UTF-8 works, UTF-8 is the best way to store text most of the time, so you should only use something else when you have a good reason to do so.
I expected the following C code to print 133
In UTF-8 the code point for à is stored as 0xC3 0xA0 which is combined to the value 0xE0. You can't store 0xC3 0xA0 in a 8 bit char. So clang reports an error.
You could try to store it in a int, unsigned, wchar_t or some other integer type that is large enough. GCC would store the value 0xC3A0 and not 0xE0, because that is the value inside the ''. However, C supports wide characters. The type wchar_t which may support more characters is most likely wchar_t is 32 or 16 on your system. To write a wide character literal, you can use the prefix L. With a wide character literal, the compiler would store the correct value of 0xE0.
Change the code to:
#include <wchar.h>
....
wchar_t wc;
wc=L'à';
printf("%u \n",(unsigned)wc);

Special char Literals

I want to assign a char with a char literal, but it's a special character say 255 or 13.I know that I can assign my char with a literal int that will be cast to a char: char a = 13;I also know that Microsoft will let me use the hex code as a char literal: char a = '\xd'
I want to know if there's a way to do this that gcc supports also.
Writing something like
char ch = 13;
is mostly portable, to platforms on which the value 13 is the same thing as on your platform (which is all systems which uses the ASCII character set, which indeed is most systems today).
There may be platforms on which 13 can mean something else. However, using '\r' instead should always be portable, no matter the character encoding system.
Using other values, which does not have character literal equivalents, are not portable. And using values above 127 is even less portable, since then you're outside the ASCII table, and into the extended ASCII table, in which the letters can depend on the locale settings of the system. For example, western European and eastern European language settings will most likely have different characters in the 128 to 255 range.
If you want to use a byte which can contain just some binary data and not letters, instead of using char you might be wanting to use e.g. uint8_t, to tell other readers of your code that you're not using the variable for letters but for binary data.
The hexidecimal escape sequence is not specific to Microsoft. It's part of C/C++: http://en.cppreference.com/w/cpp/language/escape
Meaning that to assign a hexidecimal number to a char, this is cross platform code:
char a = '\xD';
The question already demonstrates assigning a decimal number to a char:
char a = 13;
And octal numbers can also be assigned as well, with only the escape switch:
char a = '\023';
Incidentally, '\0' is common in C/C++ to represent the null-character (independent of platform). '\0' is not a special character that can be escaped. That's actually invoking the octal escape sequence.

writing escape sequence in C using hex, dec, and oct values?

Can someone explain this question to me? I don't understand how the book arrived at its values or how one would arrive at the answer.
Here is the question:
Suppose that ch is a type char variable. Show how to assign the carriage-return character to ch by using an escape sequence, a decimal value, an octal character constant, and a hex character constant. (Assume ASCII code values.)
Here is the answer:
Assigning the carriage-return character to ch by using:
a) escape sequence: ch='\r';
b) decimal value: ch=13;
c) an octal character constant: ch='\015';
d) a hex character constant: ch='\xd';
I understand the answer to part a, but am completely lost for parts b, c, and d. Can you explain?
Computers represent characters using character encondings, such as ascii, utf-8, utf-16, iso-8859 (http://en.wikipedia.org/wiki/ISO/IEC_8859-1), as well as others. The carriage return character was used by early computers as a printer instruction to return the printhead to the leftmost position. And the linefeed character was used to index the paper to a new line (thus why DOS uses CRLF for lines, it worked better with dot matrix printers). Anyway the CR character is stored internally as a numeric value in either a single 8-bit byte/octet or a 16-bit pair of two bytes/octets, depending upon your language.
The common ascii characterset is found here: http://www.asciitable.com/ and you can find that CR, '\r', 13, 0xD, et al are different representations for the same value.
Strings are just sequences of characters stored either as an array of characters with a marker at the end (terminator), or stored with a count of the current string length.
From wiki:
Computers and communication equipment represent characters using a
character encoding that assigns each character to something — an
integer quantity represented by a sequence of bits, typically — that
can be stored or transmitted through a network. Two examples of usual
encodings are ASCII and the UTF-8 encoding for Unicode.
For your question b,c,d - all values are 13 (in decimal). Run this code to understand what's happening:
char ch1='\r';
printf("Ascii value of carriage return is %d", ch1);
There are two parts to explaining answers b-d.
You need to know that the ASCII code point for 'carriage return' or CR (also known as Control-M) is 13. You can find that out from various sources. It might not be obvious that the Unicode standard is one of those places (but it is) and U+000D is CARRIAGE RETURN (CR). Unicode code points U+0000..U+007F are identical to ASCII; Unicode code points U+0000..U+00FF are identical to ISO 8859-1 (Latin 1).
You need to know that C can use decimal numbers, or octal or hexadecimal escapes when assigning to characters. Notations such as '\15' or '\015' are octal character constants, and octal 15 is decimal 13. Notations such as '\xD' or '\x0D' (or, indeed, '\x0000000000000D' and all stops en route) are hexedecimal constants and hex D is also decimal 13. (Note that octal escapes are limited to 1-3 digits, but hex escapes are not so limited, but values larger than '\xFF' typically have implementation defined representations.)

C standard: L prefix and octal/hexadecimal escape sequences

I didn't find an explanation in the C standard how do aforementioned escape sequences in wide strings are processed.
For example:
wchar_t *txt1 = L"\x03A9";
wchar_t *txt2 = L"\xA9\x03";
Are these somehow processed (like prefixing each byte with \x00 byte) or stored in memory exactly the same way as they are declared here?
Also, how does L prefix operate according to the standard?
EDIT:
Let's consider txt2. How it would be stored in memory? \xA9\x00\x03\x00 or \xA9\x03 as it was written? Same goes to \x03A9. Would this be considered as a wide character or as 2 separate bytes which would be made into two wide characters?
EDIT2:
Standard says:
The hexadecimal digits that follow the backslash and the letter x in a hexadecimal escape
sequence are taken to be part of the construction of a single character for an integer
character constant or of a single wide character for a wide character constant. The
numerical value of the hexadecimal integer so formed specifies the value of the desired
character or wide character.
Now, we have a char literal:
wchar_t txt = L'\xFE\xFF';
It consists of 2 hex escape sequences, therefore it should be treated as two wide characters. If these are two wide characters they can't fit into one wchar_t space (yet it compiles in MSVC) and in my case this sequence is treated as the following:
wchar_t foo = L'\xFFFE';
which is the only hex escape sequence and therefore the only wide char.
EDIT3:
Conclusions: each oct/hex sequence is treated as a separate value ( wchar_t *txt2 = L"\xA9\x03"; consists of 3 elements). wchar_t txt = L'\xFE\xFF'; is not portable - implementation defined feature, one should use wchar_t txt = L'\xFFFE';
There's no processing. L"\x03A9" is simply an array wchar_t const[2] consisting of the two elements 0x3A9 and 0, and similarly L"\xA9\x03" is an array wchar_t const[3].
Note in particular C11 6.4.4.4/7:
Each octal or hexadecimal escape sequence is the longest sequence of characters that can
constitute the escape sequence.
And also C++11 2.14.3/4:
There is no limit to the number of digits in a hexadecimal sequence.
Note also that when you are using a hexadecimal sequence, it is your responsibility to ensure that your data type can hold the value. C11-6.4.4.4/9 actually spells this out as a requirement, whereas in C++ exceeding the type's range is merely "implementation-defined". (And a good compiler should warn you if you exceed the type's range.)
Your code doesn't make sense, though, because the left-hand sides are neither arrays nor pointers. It should be like this:
wchar_t const * p = L"\x03A9"; // pointer to the first element of a string
wchar_t arr1[] = L"\x03A9"; // an actual array
wchar_t arr2[2] = L"\x03A9"; // ditto, but explicitly typed
std::wstring s = L"\x03A9"; // C++ only
On a tangent: This question of mine elaborates a bit on string literals and escape sequences.

What does \x mean in C/C++?

Example:
char arr[] = "\xeb\x2a";
BTW, are the following the same?
"\xeb\x2a" vs. '\xeb\x2a'
\x indicates a hexadecimal character escape. It's used to specify characters that aren't typeable (like a null '\x00').
And "\xeb\x2a" is a literal string (type is char *, 3 bytes, null-terminated), and '\xeb\x2a' is a character constant (type is int, 2 bytes, not null-terminated, and is just another way to write 0xEB2A or 60202 or 0165452). Not the same :)
As other have said, the \x is an escape sequence that starts a "hexadecimal-escape-sequence".
Some further details from the C99 standard:
When used inside a set of single-quotes (') the characters are part of an "integer character constant" which is (6.4.4.4/2 "Character constants"):
a sequence of one or more multibyte characters enclosed in single-quotes, as in 'x'.
and
An integer character constant has type int. The value of an integer character constant containing a single character that maps to a single-byte execution character is the numerical value of the representation of the mapped character interpreted as an integer. The value of an integer character constant containing more than one character (e.g., 'ab'), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-defined.
So the sequence in your example of '\xeb\x2a' is an implementation defined value. It's likely to be the int value 0xeb2a or 0x2aeb depending on whether the target platform is big-endian or little-endian, but you'd have to look at your compiler's documentation to know for certain.
When used inside a set of double-quotes (") the characters specified by the hex-escape-sequence are part of a null-terminated string literal.
From the C99 standard 6.4.5/3 "String literals":
The same considerations apply to each element of the sequence in a character string literal or a wide string literal as if it were in an integer character constant or a wide character constant, except that the single-quote ' is representable either by itself or by the escape sequence \', but the double-quote " shall be represented by the escape sequence \".
Additional info:
In my opinion, you should avoid avoid using 'multi-character' constants. There are only a few situations where they provide any value over using an regular, old int constant. For example, '\xeb\x2a' could be more portably be specified as 0xeb2a or 0x2aeb depending on what value you really wanted.
One area that I've found multi-character constants to be of some use is to come up with clever enum values that can be recognized in a debugger or memory dump:
enum CommandId {
CMD_ID_READ = 'read',
CMD_ID_WRITE = 'writ',
CMD_ID_DEL = 'del ',
CMD_ID_FOO = 'foo '
};
There are few portability problems with the above (other than platforms that have small ints or warnings that might be spewed). Whether the characters end up in the enum values in little- or big-endian form, the code will still work (unless you're doing some else unholy with the enum values). If the characters end up in the value using an endianness that wasn't what you expected, it might make the values less easy to read in a debugger, but the 'correctness' isn't affected.
When you say:
BTW,are these the same:
"\xeb\x2a" vs '\xeb\x2a'
They are in fact not. The first creates a character string literal, terminated with a zero byte, containing the two characters who's hex representation you provide. The second creates an integer constant.
It's a special character that indicates the string is actually a hexadecimal number.
http://www.austincc.edu/rickster/COSC1320/handouts/escchar.htm
The \x means it's a hex character escape. So \xeb would mean character eb in hex, or 235 in decimal. See http://msdn.microsoft.com/en-us/library/6aw8xdf2.aspx for ore information.
As for the second, no, they are not the same. The double-quotes, ", means it's a string of characters, a null-terminated character array, whereas a single quote, ', means it's a single character, the byte that character represents.
\x allows you to specify the character by its hexadecimal code.
This allows you to specify characters that are normally not printable (some of which have special escape sequences predefined such as '\n'=newline and '\t'=tab '\b'=bell)
A useful website is here.
And I quote:
x Unsigned hexadecimal integer
That way, your \xeb is like 235 in decimal.

Resources