We read in the C18 standard:
5.1.1.2 Translation phases
The precedence among the syntax rules of translation is specified by the following phases.
Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary.
Meaning that the source file character set is decoded and mapped to the source character set.
But then you can read:
5.2.1 Character sets
Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set).
Meaning that the source file character set is the source character set.
So the question is: which one did I understand wrong, or which one is actually wrong?
EDIT: Actually I was wrong. See my answer below.
Meaning that the source file character set is decoded and mapped to the source character set.
No, it does not mean that. My take is that the source is already assumed to be written in the source character set - how exactly would it make sense to "map the source character set to the source character set"? Either they are part of the set or they aren't. If you pick the wrong encoding for your source code, it will simply be rejected before the preprocessing even starts.
Translation phase 1 does two things not quite related to this at all:
Resolves trigraphs, which are standardized multibyte sequences.
Map multibyte characters into the source character set (defined in 5.2.1).
The source character set consists of the basic character set which is essentially the Latin alphabet plus various common symbols (5.2.1/3), and an extended character set, which is locale- and implemention-specific.
The definition of multibyte characters is found at 5.2.1.2:
The source character set may contain multibyte characters, used to represent members of
the extended character set. The execution character set may also contain multibyte
characters, which need not have the same encoding as for the source character set.
Meaning various locale-specific oddball special cases, such as locale-specific trigraphs.
All of this multibyte madness goes back to the first standardization in 1990 - according to anecdotes from those who were part of that committee, this was because members from various European countries weren't able to use various symbols on their national keyboards.
(I'm not sure how widespread the AltGr key on such keyboards was at the time. It remains a key subject to some serious button mashing when writing C on non-English keyboards anyway, to get access to {}[] symbols etc.)
Well, after all it seems I was wrong. After contacting David Keaton, from the WG14 group (they are in charge of the C standard), I got this clarifying reply:
There is a subtle distinction. The source character set is the
character set in which source files are written. However, the source
character set is just the list of characters available, which does not
say anything about the encoding.
Phase 1 maps the multibyte encoding of the source character set onto
the abstract source characters themselves.
In other words, a character that looks like this:
<byte 1><byte 2>
is mapped to this:
<character 1>
The first is an encoding that represents a character in the source
character set in which the program was written. The second is the
abstract character in the source character set.
You have encountered cross compiling, where a program is compiled on one architecture and executed on another architecture and these architectures have different character sets.
5.1.1.2 is active early in read, where the input file is converted into the compiler's single character set, which clearly must contain all of the characters required by a C program.
However when cross compiling, the execution character set may be different. 5.2.1 is allowing for this possibility. When the compiler emits code, it must translate all character and string constants to the target platform's character set. On modern platforms, this is a no-op, but on some ancient platforms it wasn't.
Related
#include <stdio.h>
int main(void) {
printf("A");
}
Does reserved keyword of the language int also follows the ASCII or Unicode character set. Like splitting it into individual character to convert it into binary?
The character set of the source files is implementation-defined. Many common compiler systems use ASCII, some even Unicode, but the C standard requests no specific character set.
This is the first translation phase specified by the C standard:
Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary.
The chapter 5.2.1 of the C standard differentiates between the source character set and the execution character set. None of them is defined as a specific character set.
Chapter 5.1.1.2 defines several translation phases, of which the fourth phase executes all preprocessor directives, and as such finishes the preprocessing.
This is the fifth phase:
Each source character set member and escape sequence in character constants and string literals is converted to the corresponding member of the execution character set; if there is no corresponding member, it is converted to an implementation-defined member other than the null (wide) character.
So your title question can be answered as "The compiler converts characters from the source character set to the execution character set."
There are actually compilers with different character sets. For example, z88dk uses a target specific execution character set that is not necessarily ASCII, but accepts ASCII source files.
However, this conversion takes place only for character constants and strings literals.
Keywords are not affected. They are processed by the preprocessor in their encoding in the source character set and are never converted to any other character set. This source character set can be ASCII or UTF-8, but the specific character set is implementation-defined.
The source file is decomposed into preprocessing tokens and sequences of white-space characters (including comments). [...]
Such a token could be implemented by some integer constant or enumeration member, or any other means seen fit by the developers.
Concerning your mentioning of "binary": Anything in a computer as we know them commonly is binary. It is always the interpretation of a binary value that leads to a numeric value, a character, a machine instruction, or any other meaning. Therefore, there are less conversions to binary than you might think.
Reserved words of programming languages are almost ever composed from plain ASCII characters.
int is stored in the source file as 0x69, 0x6E, 0x74 and compiler has to parse and identify it as the one of reserved words. This can be done by comparing those three bytes with strings in the table of reserved words, or using some more advanced technique, such as hash lookup. Many languages have case-insensitive reserved words, in this case the parsed word int would have to be converted to uniform character case INT first.
Argument of function printf() is a literal string "A". Compiler will store its value into object file, typically to .rodata section, as the sequence of bytes 0x41, 0x00 (zero-terminated string in ASCII or UTF-8 encoding). There is no sense to speak of converting the ASCII value of letter A to binary - it is the question of interpreting the byte contents as letter A or its binary value 0x41=65.
Why is there a restriction for Unicode escape sequences (\unnnn and \Unnnnnnnn) in C11 such that only those characters outside of the basic character set may be represented? For example, the following code results in the compiler error: \u000A is not a valid universal character. (Some Unicode "dictionary" sites even give this invalid format as canon for the C/C++ languages, though admittedly these are likely auto-generated):
static inline int test_unicode_single() {
return strlen(u8"\u000A") > 1;
}
While I understand that it's not exactly necessary for these basic characters to supported, is there a technical reason why they're not? Something like not being able to represent the same character in more than one way?
It's precisely to avoid alternative spellings.
The primary motivations for adding Universal Character Names (UCNs) to C and C++ were to:
allow identifiers to include letters outside of the basic source character set (like ñ, for example).
allow portable mechanisms for writing string and character literals which include characters outside of the basic source character set.
Furthermore, there was a desire that the changes to existing compilers be as limited as possible, and in particular that compilers (and other tools) could continue to use their established (and often highly optimised) lexical analysis functions.
That was a challenge, because there are huge differences in the lexical analysis architectures of different compilers. Without going into all the details, it appeared that two broad implementation strategies were possible:
The compiler could internally use some single universal encoding, such as UTF-8. All input files in other encodings would be transcribed into this internal encoding very early in the input pipeline. Also, UCNs (wherever they appeared) would be converted to the corresponding internal encoding. This latter transformation could be conducted in parallel with continuation line processing, which also requires detecting backslashes, thus avoiding an extra test on every input character for a condition which very rarely turns out to be true.
The compiler could internally use strict (7-bit) ASCII. Input files in encodings allowing other characters would be transcribed into ASCII with non-ASCII characters converted to UCNs prior to any other lexical analysis.
In effect, both of these strategies would be implemented in Phase 1 (or equivalent), which is long before lexical analysis has taken place. But note the difference: strategy 1 converts UCNs to an internal character coding, while strategy 2 converts non-representable characters to UCNs.
What these two strategies have in common is that once the transcription is finished, there is no longer any difference between a character entered directly into the source stream (in whatever encoding the source file uses) and a character described with a UCN. So if the compiler allows UTF-8 source files, you could enter an ñ as either the two bytes 0xc3, 0xb1 or as the six-character sequence \u00D1, and they would both end up as the same byte sequence. That, in turn, means that every identifier has only one spelling, so no change is necessary (for example) to symbol table lookup.
Typically, compilers just pass variable names through the compilation pipeline, leaving them to be eventually handled by assemblers or linkers. If these downstream tools do not accept extended character encodings or UCNs (depending on implementation strategy) then names containing such characters need to be "mangled" (transcribed) in order to make them acceptable. But even if that's necessary, it's a minor change and can be done at a well-defined interface.
Rather than resolve arguments between compiler vendors whose products (or development teams) had clear preferences between the two strategies, the C and C++ standards committees chose mechanisms and restrictions which make both strategies compatible. In particular, both committees forbid the use of UCNs which represent characters which already have an encoding in the basic source character set. That avoids questions like:
What happens if I put \u0022 inside a string literal:
const char* quote = "\u0022";
If the compiler translates UCNs to the characters they represent, then by the time the lexical analyser sees that line, "\u0022" will have been converted to """, which is a lexical error. On the other hand, a compiler which retains UCNs until the end would happily accept that as a string literal. Banning the use of a UCN which represents a quotation mark avoids this possible non-portability.
Similarly, would '\u005cn' be a newline character? Again, if the UCN is converted to a backslash in Phase 1, then in Phase 3 the string literal would definitely be treated as a newline. But if the UCN is converted to a character value only after the character literal token has been identified as such, then the resulting character literal would contain two characters (an implementation-defined value).
And what about 2 \u002B 2? Is that going to look like an addition, even though UCNs aren't supposed to be used for punctuation characters? Or will it look like an identifier starting with a non-letter code?
And so on, for a large number of similar issues.
All of these details are avoided by the simple expedient of requiring that UCNs cannot be used to spell characters in the basic source character set. And that's what was embodied in the standards.
Note that the "basic source character set" does not contain every ASCII character. It does not contain the majority of the control characters, and nor does it contain the ASCII characters $, # and `. These characters (which have no meaning in a C or C++ program outside of string and character literals) can be written as the UCNs \u0024, \u0040 and \u0060 respectively.
Finally, in order to see what sort of knots you need to untie in order to correctly lexically analyse C (or C++), consider the following snippet:
const char* s = "\\
n";
Because continuation lines are dealt with in Phase 1, prior to lexical analysis, and Phase 1 only looks for the two-character sequence consisting of a backslash followed by a newline, that line is the same as
const char* s = "\n";
But that might not have been obvious looking at the original code.
I know there are a few similar questions around relating to this, but it's still not completely clear.
For example: If in my C source file, I have lots of defined string literals, as the compiler is translating this source file, does it go through each character of strings and use a look-up table to get the ascii number for each character?
I'd guess that if entering characters dynamically into a a running C program from standard input, it is the terminal that is translating actual characters to numbers, but then if we have in the code for example :
if (ch == 'c'){//.. do something}
the compiler must have its own way of understanding and mapping the characters to numbers?
Thanks in advance for some help with my confusion.
The C standard talks about the source character set, which the set of characters it expects to find in the source files, and the execution character set, which is set of characters used natively by the target platform.
For most modern computers that you're likely to encounter, the source and execution character sets will be the same.
A line like if (ch == 'c') will be stored in the source file as a sequence of values from the source character set. For the 'c' part, the representation is likely 0x27 0x63 0x27, where the 0x27s represent the single quote marks and the 0x63 represents the letter c.
If the execution character set of the platform is the same as the source character set, then there's no need to translate the 0x63 to some other value. It can just use it directly.
If, however, the execution character set of the target is different (e.g., maybe you're cross-compiling for an IBM mainframe that still uses EBCDIC), then, yes, it will need a way to look up the 0x63 it finds in the source file to map it to the actual value for a c used in the target character set.
Outside the scope of what's defined by the standard, there's the distinction between character set and encoding. While a character set tells you what characters can be represented (and what their values are), the encoding tells you how those values are stored in a file.
For "plain ASCII" text, the encoding is typically the identity function: A c has the value 0x63, and it's encoded in the file simply as a byte with the value of 0x63.
Once you get beyond ASCII, though, there can be more complex encodings. For example, if your character set is Unicode, the encoding might be UTF-8, UTF-16, or UTF-32, which represent different ways to store a sequence of Unicode values (code points) in a file.
So if your source file uses a non-trivial encoding, the compiler will have to have an algorithm and/or a lookup table to convert the values it reads from the source file into the source character set before it actually does any parsing.
On most modern systems, the source character set is typically Unicode (or a subset of Unicode). On Unix-derived systems, the source file encoding is typically UTF-8. On Windows, the source encoding might be based on a code page, UTF-8, or UTF-16, depending on the code editor used to create the source file.
On many modern systems, the execution character set is also Unicode, but, on an older or less powerful computer (e.g., an embedded system), it might be restricted to ASCII or the characters within a particular code page.
Edited to address follow-on question in the comments
Any tool that reads text files (e.g., an editor or a compiler) has three options: (1) assume the encoding, (2) take an educated guess, or (3) require the user to specify it.
Most unix utilities assume UTF-8 because UTF-8 is ubiquitous in that world.
Windows tools usually check for a Unicode byte-order mark (BOM), which can indicate UTF-16 or UTF-8. If there's no BOM, it might apply some heuristics (IsTextUnicode) to guess the encoding, or it might just assume the file is in the user's current code page.
For files that have only characters from ASCII, guessing wrong usually isn't fatal. UTF-8 was designed to be compatible with plain ASCII files. (In fact, every ASCII file is a valid UTF-8 file.) Also many common code pages are supersets of ASCII, so a plain ASCII file will be interpreted correctly. It would be bad to guess UTF-16 or UTF-32 for plain ASCII, but that's unlikely given how the heuristics work.
Regular compilers don't expend much code dealing with all of this. The host environment can handle many of the details. A cross-compiler (one that runs on one platform to make a binary that runs on a different platform) might have to deal with mapping between character sets and encodings.
Sort of. Except you can drop the ASCII bit, in full generality at least.
The mapping used between int literals like 'c' and the numeric equivalent is a function of the encoding used by the architecture that the compiler is targeting. ASCII is one such encoding, but there are others, and the C standard places only minimal requirements on the encoding, an important one being that '0' through to '9' must be consecutive, in one block, positive and able to fit into a char. Another requirement is that 'A' to 'Z' and 'a' to 'z' must be positive values that can fit into a char.
No, the compiler is not required to have such a thing. Think a minute about a pre-C11 compiler, reading EBCDIC source and translating for an EBCDIC machine. What use would have an ASCII look-up table in such a compiler?
Also think another minute about how such ASCII look-up table(s) would look like in such a compiler!
According to the C11 standard (5.1.1.2 Translation phases) there are 8 translation phases.
Can anyone give a concrete example for each of the phases.
For example at phase 1 there is:
Physical source file multibyte characters are mapped, in an
implementation- defined manner, to the source character set...
so can I have an example of what happens when that mapping is executed and so on
for other phases?
Well, one example of phase one would be storing your source code into a record-oriented format, such as in z/OS on the mainframe.
These data sets have fixed record sizes so, if your data set specification was FB80 (fixed, blocked, record length of 80), the "line":
int main (void)
would be stored as those fifteen characters followed by sixty-five spaces, and no newline.
Phase one translation would read in the record, possibly strip off the trailing spaces, and add a newline character, before passing the line on to the next phase.
As per the standard, this is also the phase that handles trigraphs, such as converting ??( into [ on a 3270 terminal that has no support for the [ character.
An example of phase five is if you're writing your code on z/OS (using EBCDIC) but cross-compiling it for Linux/x86 (using ASCII/Unicode).
In that case the source characters within string literals and character constants must have the ASCII representation rather than the EBCDIC one. Otherwise, you're likely to get some truly bizarre output on your Linux box.
I'm looking for a table (or a way to generate one) for every character in each of the following C Character Sets:
Basic Character Set
Basic Execution Character Set
Basic Source Character Set
Execution Character Set
Extended Character Set
Source Character Set
C99 mentions all six of these under section 5.2.1. However, I've found it extremely cryptic to read and lacking in detail.
The only character sets that it clearly defines is the Basic Execution Character Set and the Basic Source Character Set:
52 upper- and lower-case letters in
the Latin alphabet:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
Ten decimal digits:
0 1 2 3 4 5 6 7 8 9
29 graphic characters:
! " # % & ' ( ) * + , – . / : ; < = > ? [ \ ] ^ _ { | } ~
4 whitespace characters:
space, horizontal tab, vertical tab, form feed
I believe these are the same as the Basic Character Set, though I'm guessing as C99 does not explicitly state this. The remaining Character Sets are a bit of a mystery to me.
Thanks for any help you can offer! :)
Except for the Basic Character Set as you mentioned, all of the rest of the character sets are implementation-defined. That means that they could be anything, but the implementation (that is, the C compiler/libraries/toolchain implementation) must document those decisions. The key paragraphs here are:
§3.4.1 implementation-defined behavior
unspecified behavior where each implementation documents how the choice is made
§3.4.2 locale-specific behavior
behavior that depends on local conventions of nationality, culture, and language that each implementation documents
§5.2.1.1 Character sets
Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of zero or more locale-specific members (which are not members of the basic character set) called extended characters. The combined set is also called the extended character set. The values of the members of the execution character set are implementation-defined.
So, look at your C compiler's documentation to find out what the other character sets are. For example, in my man page for gcc, some of the command line options state:
-fexec-charset=charset
Set the execution character set, used for string and character
constants. The default is UTF-8. charset can be any encoding
supported by the system's "iconv" library routine.
-fwide-exec-charset=charset
Set the wide execution character set, used for wide string and
character constants. The default is UTF-32 or UTF-16, whichever
corresponds to the width of "wchar_t". As with -fexec-charset,
charset can be any encoding supported by the system's "iconv"
library routine; however, you will have problems with encodings
that do not fit exactly in "wchar_t".
-finput-charset=charset
Set the input character set, used for translation from the
character set of the input file to the source character set used by
GCC. If the locale does not specify, or GCC cannot get this
information from the locale, the default is UTF-8. This can be
overridden by either the locale or this command line option.
Currently the command line option takes precedence if there's a
conflict. charset can be any encoding supported by the system's
"iconv" library routine.
To get a list of the encodings supported by iconv, run iconv -l. My system has 143 different encodings to choose from.
As far as I see, the standard doesn't talk about a basic character set as something distinct form the source character set and execution character set. The standard lays out that there are 2 character sets it's concerned with - the source character set and execution character set. each of these has a 'basic' and 'extended' component (and the extended component of either can be the empty set).
You have a "source character set" that is comprised of a "basic source character set" and zero or more "extended characters". The combination of the basic source character set and those extended characters is called the extended source character set.
Similarly for the execution character set (there's a basic execution character set that combined with zero or more extended characters make up the extended execution characters set).
The standard (and your question) enumerate characters that must be in the basic characters set - there can be other characters in the basic set.
As far as the difference between the basic 'range' and the extended 'range' of each character set, the values of the members of the basic character set must fit within a byte - that restriction doesn't hold for the extended characters. Also note, that this doesn't necessarily mean that the source file encoding must a single-byte encoding.
The values of characters in the source character sets do not need to agree with the values in the execution character sets (for example, the source character set might be comprised of ASCII, while the execution character set might be EBCDIC).
You might have a look a GNU iconv. Among many others, it will print or convert both Java and C99 strings. iconv is a command line interface to libiconv which, very likely, is what your C99 compiler is using internally for these character conversions.
Type iconv -l to see what strings are available on your system. You will need to recompile from source to change that set.
On OS X, I have 141 character sets. On Ubuntu, I have 1,168 character sets (with most of those being aliases).